Active Timepoint Selection for Learning Measure-Valued Trajectories

arXiv cs.LG Papers

Summary

This paper introduces a framework for active timepoint selection to infer probability paths from sparse snapshots, using linearized optimal transport to map distributions into a tangent space for Gaussian Process modeling, thereby enabling uncertainty-aware acquisition policies.

arXiv:2605.30625v1 Announce Type: new Abstract: Inferring continuous probability paths from sparse snapshots is a fundamental challenge in domains like single-cell biology, where high-fidelity data acquisition is often destructive and constrained by prohibitive sequencing costs. This motivates the need for active learning strategies to strategically select optimal measurement times. However, designing active learning policies for this setting remains an open problem: the target objects reside on the infinite dimensional Wasserstein space where standard Euclidean metrics are ill-defined, and current interpolation methods lack epistemic uncertainty quantification. We introduce a framework which extends active experimentation to the space of measures. By leveraging Linearized Optimal Transport (LOT), we map distributional snapshots into a tangent space amenable to Gaussian Process modeling, allowing us to construct a tractable probabilistic surrogate for the underlying probability path. This yields an acquisition policy that iteratively selects measurement times to minimize uncertainty. Empirical results demonstrate that our strategy outperforms uncertainty-agnostic baselines on both synthetic and real-world datasets.
Original Article
View Cached Full Text

Cached at: 06/01/26, 09:29 AM

# Active Timepoint Selection for Learning Measure-Valued Trajectories
Source: [https://arxiv.org/html/2605.30625](https://arxiv.org/html/2605.30625)
###### Abstract

Inferring continuous probability paths from sparse snapshots is a fundamental challenge in domains like single\-cell biology, where high\-fidelity data acquisition is often destructive and constrained by prohibitive sequencing costs\. This motivates the need for active learning strategies to strategically select optimal measurement times\. However, designing active learning policies for this setting remains an open problem: the target objects reside on the infinite dimensional Wasserstein space where standard Euclidean metrics are ill\-defined, and current interpolation methods lack epistemic uncertainty quantification\. We introduce a framework which extends active experimentation to the space of measures\. By leveraging Linearized Optimal Transport \(LOT\), we map distributional snapshots into a tangent space amenable to Gaussian Process modeling, allowing us to construct a tractable probabilistic surrogate for the underlying probability path\. This yields an acquisition policy that iteratively selects measurement times to minimize uncertainty\. Empirical results demonstrate that our strategy outperforms uncertainty\-agnostic baselines on both synthetic and real\-world datasets\.

Machine Learning, ICML

## 1Introduction

Measure\-valued trajectories\.Inferring the temporal evolution of a probability distribution \(i\.e\., a probability path\{μt\}t∈\[0,1\]\\\{\\mu\_\{t\}\\\}\_\{t\\in\[0,1\]\}\) is a fundamental challenge across scientific domains, ranging from fluid dynamics\(Benamou and Brenier,[2000](https://arxiv.org/html/2605.30625#bib.bib7)\)to macroeconomics\(Achdouet al\.,[2022](https://arxiv.org/html/2605.30625#bib.bib27)\)\. This problem is particularly acute in single\-cell biology\(Wagneret al\.,[2016](https://arxiv.org/html/2605.30625#bib.bib28)\), where cellular differentiation is modeled as a dynamic process in the high\-dimensional space of gene expressions\(Trapnellet al\.,[2014](https://arxiv.org/html/2605.30625#bib.bib26)\)\. In such settings, full continuous trajectories are often not observable\. Instead, we only have limited access to destructive snapshots, which are empirical measures at discrete time points\. The central task is therefore one ofdistributional interpolation: recovering the underlying continuous trajectoryt↦μtt\\mapsto\\mu\_\{t\}given a finite set of observed marginals at different timepoints\.

Active timepoint selection\.In practice, however, data acquisition is severely constrained by cost\. For instance, in single\-cell transcriptomics, generating high\-fidelity snapshots entails*destructive*sampling and incurs significant expenses, i\.e\. oftenthousands of dollars per time point\(Ziegenhainet al\.,[2017](https://arxiv.org/html/2605.30625#bib.bib25)\), which precludes dense temporal sampling\. Under such budgetary limits, the timing of observations becomes critical\. This motivates an*active learning*framework for measure\-valued processes, designed to iteratively select the next timet∗∈\[0,1\]t^\{\*\}\\in\[0,1\]to take a measurement that best contributes to estimating the underlying probability path given the past observations111The intended regime of our work is active acquisition withexpensivesnapshots, not real\-time selection\.\.

Challenges\.In this setting, active learning presents distinct challenges\. First, the geometry of the output space is inherently non\-Euclidean\. Standard active learning methods, such as those based on Gaussian Processes \(GPs\)\(Schulzet al\.,[2018](https://arxiv.org/html/2605.30625#bib.bib24); Williams and Rasmussen,[1995](https://arxiv.org/html/2605.30625#bib.bib22)\), assume vector\-valued outputs equipped with Euclidean metrics\. In contrast, probability measures live in a nonlinear space that is more naturally described by Wasserstein geometry\(Ambrosioet al\.,[2005](https://arxiv.org/html/2605.30625#bib.bib20)\)\. Second, this makes uncertainty quantification over measures particularly challenging\. Active learning for regression and classification typically relies on acquisition functions that require a notion of epistemic uncertainty, but such uncertainty is not readily available in current distribution interpolation methods\([Lipmanet al\.,](https://arxiv.org/html/2605.30625#bib.bib21); Rohbecket al\.,[2025](https://arxiv.org/html/2605.30625#bib.bib17)\)\. Finally, measure\-valued dynamics are often strongly non\-stationary\. For example, cellular development can vary dramatically in speed: extended periods of homeostasis may be punctuated by rapid, transient branching events\(Haghverdiet al\.,[2016](https://arxiv.org/html/2605.30625#bib.bib23)\)\. As a result, uniformly spaced acquisition times can be highly suboptimal\.

Method\.We propose an active timepoint selection strategy for measure\-valued trajectories that addresses the challenges above\. Our key idea is to*linearize*the Wasserstein space by lifting each observed snapshotμt\\mu\_\{t\}to a tangent space\. Concretely, we use Linearized Optimal Transport \(LOT\)\(Wanget al\.,[2013](https://arxiv.org/html/2605.30625#bib.bib19),[2025](https://arxiv.org/html/2605.30625#bib.bib18)\)to mapμt\\mu\_\{t\}to a tangent vector at a fixed reference measure\. We then compress these tangent vectors into a low\-dimensional representation and place a*warped*Gaussian Process \(GP\) prior over the resulting temporal coefficients\. The GP posterior induces a practical surrogate for epistemic uncertainty, while the warping accounts for non\-stationary dynamics by allowing time to be reparameterized\. Finally, we leverage the GP’s quantified epistemic uncertainty to determine the next time point,t∗t^\{\*\}, at which to take a measurement\.

Contributions\.*Conceptually*, we formulate the problem of active learning for measure\-valued trajectories, extending active experimentation to the space of measures\.*Technically*, we construct a tractable probabilistic surrogate in the Wasserstein space by combining Linearized Optimal Transport with multi\-output Gaussian Processes\.*Empirically*, we demonstrate that our acquisition strategy outperforms uniform and random baselines on both synthetic and real\-world datasets\.

## 2Related Work

Active learning\.Active learning is well\-established for scalar or categorical targets in Euclidean spaces\. Early and widely used heuristics include*uncertainty sampling*, which queries points with maximal predictive ambiguity\(Lewis,[1995](https://arxiv.org/html/2605.30625#bib.bib42)\), and*query\-by\-committee*\(Seunget al\.,[1992](https://arxiv.org/html/2605.30625#bib.bib41)\)\. A complementary Bayesian perspective casts acquisition as optimal experimental design, choosing inputs that maximize expected information gain\(Houlsbyet al\.,[2011](https://arxiv.org/html/2605.30625#bib.bib40)\)\. Other works extend uncertainty and information\-based acquisition to deep models using approximate Bayesian inference \(e\.g\., MC dropout\(Galet al\.,[2017](https://arxiv.org/html/2605.30625#bib.bib39)\)\)\. More closely related is\(Singhet al\.,[2005](https://arxiv.org/html/2605.30625#bib.bib35)\), which actively select timepoints to better fit Euclidean\-valued gene\-expression curves\. However, none of these works tackles active learning in the space of distributions, which is our focus\.

Distribution regression\.A related \(but directionally reversed\) line of work studies*distribution regression*, where the input is probability measures and the output lies inℝd\\mathbb\{R\}^\{d\}\(or a Hilbert space\)\. Classical approaches embed input distributions into an RKHS via kernel mean embeddings and then perform \(kernel\) regression\(Póczoset al\.,[2013](https://arxiv.org/html/2605.30625#bib.bib12); Szabóet al\.,[2016](https://arxiv.org/html/2605.30625#bib.bib13); Muandetet al\.,[2017](https://arxiv.org/html/2605.30625#bib.bib8); Lawet al\.,[2018](https://arxiv.org/html/2605.30625#bib.bib11)\)\. In contrast, our setting treats*time*as the covariate and the*output*as a distribution\. This connects more directly to*distributional interpolation*and*trajectory inference*methods that learn probability flows or stochastic processes consistent with observed marginals, including flow\-matching and score\-matching formulations, multi\-marginal extensions, and Schrödinger\-bridge–based approaches\([Lipmanet al\.,](https://arxiv.org/html/2605.30625#bib.bib21); Tonget al\.,[2024](https://arxiv.org/html/2605.30625#bib.bib10); Leeet al\.,[2025](https://arxiv.org/html/2605.30625#bib.bib9)\)\. However, these methods are primarilyreconstructionmethods: given fixed snapshots, they typically return a single learned probability flow or fitted stochastic dynamics, and hence a single induced marginal path\. By contrast, active acquisition requiresepistemic uncertaintyover plausible probability paths in order to decide which time point to measure next\. Our framework models this uncertainty, making it suitable for active learning\.

Linearized Optimal Transport \(LOT\)\.LOT embeds measures into the tangent space of a reference distribution\(Wanget al\.,[2013](https://arxiv.org/html/2605.30625#bib.bib19); Kolouriet al\.,[2016](https://arxiv.org/html/2605.30625#bib.bib45)\)\. This technique has proven powerful for pattern recognition tasks on measures, such as classification and barycenter estimation, by enabling the use of linear classifiers and regression in the tangent plane\(Moosmüller and Cloninger,[2023](https://arxiv.org/html/2605.30625#bib.bib34)\)\. However, none of these works focus on the active learning problem with measure\-valued trajectories, which requires providing a notion of uncertainty that we introduce in this work\.

## 3Problem Setup

### 3\.1Regression in the space of distributions

Notations\.Let𝒳⊆ℝd\\mathcal\{X\}\\subseteq\\mathbb\{R\}^\{d\}be a feature space\. We consider a probability path, i\.e\., a time\-varying function in the space of probability measuresμ:\[0,1\]→𝒫2​\(𝒳\)\\mu:\[0,1\]\\to\\mathcal\{P\}\_\{2\}\(\\mathcal\{X\}\)\. For notational convenience, in what follows we denote the measure at timettbyμt\\mu\_\{t\}instead ofμ​\(t\)\\mu\(t\), and note that the maximum time can be set to an arbitrarytmaxt\_\{\\text\{max\}\}after scaling\.𝒫2​\(𝒳\)\\mathcal\{P\}\_\{2\}\(\\mathcal\{X\}\)denotes the space of probability measures on𝒳\\mathcal\{X\}with finite second moments, defined as:

𝒫2​\(𝒳\):=\{ρ∈𝒫​\(𝒳\):∫𝒳‖x‖2​dρ​\(x\)<∞\}\\mathcal\{P\}\_\{2\}\(\\mathcal\{X\}\):=\\left\\\{\\rho\\in\\mathcal\{P\}\(\\mathcal\{X\}\):\\int\_\{\\mathcal\{X\}\}\\\|x\\\|^\{2\}\\,\\mathrm\{d\}\\rho\(x\)<\\infty\\right\\\}\(1\)A natural metric on this space is the 2\-Wasserstein metric\(Villani,[2021](https://arxiv.org/html/2605.30625#bib.bib43)\)\. For any two measuresμ,ν∈𝒫2​\(𝒳\)\\mu,\\nu\\in\\mathcal\{P\}\_\{2\}\(\\mathcal\{X\}\), the metric is defined as:

W2​\(μ,ν\):=\(infπ∈Π​\(μ,ν\)∫𝒳×𝒳‖x−y‖2​dπ​\(x,y\)\)1/2,W\_\{2\}\(\\mu,\\nu\):=\\left\(\\inf\_\{\\pi\\in\\Pi\(\\mu,\\nu\)\}\\int\_\{\\mathcal\{X\}\\times\\mathcal\{X\}\}\\\|x\-y\\\|^\{2\}\\,\\mathrm\{d\}\\pi\(x,y\)\\right\)^\{1/2\},\(2\)where∥⋅∥\\\|\\cdot\\\|denotes the Euclidean norm andΠ​\(μ,ν\)\\Pi\(\\mu,\\nu\)represents the set of all joint probability measures on𝒳×𝒳\\mathcal\{X\}\\times\\mathcal\{X\}with marginalsμ\\muandν\\nu\.

Goal\.We assume access to a dataset of temporal snapshots𝒟=\{\(ti,μ^ti\)\}i=1N\\mathcal\{D\}=\\\{\(t\_\{i\},\\hat\{\\mu\}\_\{t\_\{i\}\}\)\\\}\_\{i=1\}^\{N\}, where eachti∈\[0,1\]t\_\{i\}\\in\[0,1\]is a measurement time andμ^ti\\hat\{\\mu\}\_\{t\_\{i\}\}is an empirical measure observed at that time \(from samples drawn from the marginalμti\\mu\_\{t\_\{i\}\}\)\. Our objective is to estimate the underlying probability path\{μt\}t∈\[0,1\]\\\{\\mu\_\{t\}\\\}\_\{t\\in\[0,1\]\}using the snapshots in𝒟\\mathcal\{D\}\.

This objective is applicable to various domains, most notably in computational biology\. In this context, the feature space𝒳\\mathcal\{X\}represents the space of gene expressions \(or a latent space derived from it\)\. Each empirical measureμ^ti\\hat\{\\mu\}\_\{t\_\{i\}\}corresponds to the expression profiles ofnin\_\{i\}distinct cells observed at timetit\_\{i\}, and the goal is to estimate the dynamics of the cells in expression space\.

### 3\.2The active learning problem

In this work, we focus on the following question: given a fixed budget of measurementsBB, how do we select the measurement times\{ti\}i=1B\\\{t\_\{i\}\\\}\_\{i=1\}^\{B\}in order to best estimate\{μt\}t∈\[0,1\]\\\{\\mu\_\{t\}\\\}\_\{t\\in\[0,1\]\}?

This means that we seek an acquisition policyπ\\pithat, given the current history𝒟\\mathcal\{D\}, selects the next measurement timet∗∈\[0,1\]t^\{\*\}\\in\[0,1\]\. We assume the measurement times are not constrained to any specific order and need not be monotonic\.

Single\-cell active sequencing\.For example, biological samples can be collected and cryopreserved at dense time intervals, forming a ”bank” of potential data\. Since sequencing these samples is the primary cost bottleneck, processing the entire bank is often infeasible, as the cost is typically in the order of thousands of dollars per timepoint\. Instead, the active learning policy must sequentially query this bank, selecting optimal time\-points to thaw and sequence to minimize trajectory uncertainty under a fixed budget\.

### 3\.3Key challenges

To understand the difficulty of this active learning problem, it is useful to first recall how active learning operates in standard regression\. A conventional approach is to rely on two elements: aregressorthat predicts the output given the inputs, and a quantification ofepistemic uncertainty\. In Euclidean spaces, uncertainty can be derived from the posterior variance of a probabilistic model \(e\.g\., Gaussian Processes\) or from the empirical variance of an ensemble of predictors\. The active learning policy then utilizes this uncertainty to select the next queryt∗t^\{\*\}\. However, this approach falls short in our setup, due to two fundamental challenges\.

➀Non\-Euclidean geometry\.Standard regression models \(e\.g\., Gaussian Processes\) rely on Euclidean operations to interpolate between observations\. However,𝒫2​\(𝒳\)\\mathcal\{P\}\_\{2\}\(\\mathcal\{X\}\)is not a vector space, i\.e\. linear combinations of measures do not lead generally to valid measures\. We illustrate the limitations of naive Euclidean interpolation in[Figure1](https://arxiv.org/html/2605.30625#S3.F1)\. Specifically, we consider a one\-dimensional Gaussian trajectory with a time\-varying mean\. Each distribution is represented by its density evaluated on a fixed grid, and a standard Gaussian process is fit directly to these density vectors for interpolation\. As shown in the figure, this Euclidean interpolation splits mass rather than transporting it coherently\.

![Refer to caption](https://arxiv.org/html/2605.30625v1/x1.png)Figure 1:Gaussian Process regression naively applied to densities leads to poor interpolation\.While interpolation schemes compatible with the Wasserstein geometry exist, they typically operate between only*two*reference measures\. An example is the displacement interpolation\(McCann,[1997](https://arxiv.org/html/2605.30625#bib.bib37)\)\. Givenμ0,μ1\\mu\_\{0\},\\mu\_\{1\}in𝒫2​\(𝒳\)\\mathcal\{P\}\_\{2\}\(\\mathcal\{X\}\), it is defined by:

μt=\(\(1−t\)​Id\+t​T\)\#​μ0,\\mu\_\{t\}=\(\(1\-t\)\\mathrm\{Id\}\+tT\)\_\{\\\#\}\\mu\_\{0\},\(3\)whereTTis the optimal transport map fromμ0\\mu\_\{0\}toμ1\\mu\_\{1\}\(see[Equation4](https://arxiv.org/html/2605.30625#S4.E4)\), and\#\\\#denotes the pushfoward operation\. Extending this idea to settings withN\>2N\>2snapshots is non\-trivial\. While recent works have proposed interpolation schemes for this setting\(Rohbecket al\.,[2025](https://arxiv.org/html/2605.30625#bib.bib17); Leeet al\.,[2025](https://arxiv.org/html/2605.30625#bib.bib9)\), they remain fundamentally deterministic and do not provide the uncertainty estimates required for active learning, which we discuss next\.

➁Absence of canonical epistemic uncertainty\.Quantifying and leveraging*epistemic*uncertainty is a central idea in active learning\. However, in our setting, constructing a probabilistic model over the infinite\-dimensional space𝒫2​\(𝒳\)\\mathcal\{P\}\_\{2\}\(\\mathcal\{X\}\)is non trivial: there is no canonical prior analogous to Gaussian Processes that respects the Wasserstein geometry while allowing for tractable posterior computation\. Importantly, we note that stochastic transport formalisms do not resolve this gap\. For example, multi\-marginal Schrödinger Bridges define a stochastic process connecting prescribed marginals, but once the marginals are fixed and under regularity conditions, the solution is asingleprobability path\. Consequently, we lack a mechanism to quantify which regions of the temporal domain show high uncertainty at the*distribution level*\.

Desiderata\.To overcome these limitations, we seek a surrogate modeling framework that allows for probabilistic regression in𝒫2​\(𝒳\)\\mathcal\{P\}\_\{2\}\(\\mathcal\{X\}\)\. Specifically, the model must satisfy two key criteria: \(i\) it must be capable of producing interpolations in𝒫2​\(𝒳\)\\mathcal\{P\}\_\{2\}\(\\mathcal\{X\}\)given an arbitrary number of snapshotsN≥2N\\geq 2and \(ii\) it must beprobabilistic, providing a tractable measure of epistemic uncertainty over the trajectory to guide the acquisition policy\.

## 4Method

Tσ​𝒫2​\(𝒳\)T\_\{\\sigma\}\\mathcal\{P\}\_\{2\}\(\\mathcal\{X\}\)μ\\muσ\\sigmaν\\nuvμv\_\{\\mu\}vνv\_\{\\nu\}𝒫2​\(𝒳\)\\mathcal\{P\}\_\{2\}\(\\mathcal\{X\}\)

![Refer to caption](https://arxiv.org/html/2605.30625v1/x2.png)

Figure 2:Overview of the methodology\.\(Left\)Probability measuresμ,ν\\mu,\\nuin Wasserstein space𝒫2​\(𝒳\)\\mathcal\{P\}\_\{2\}\(\\mathcal\{X\}\)are projected onto the tangent planeTσ​𝒫2​\(𝒳\)T\_\{\\sigma\}\\mathcal\{P\}\_\{2\}\(\\mathcal\{X\}\)via Linearized Optimal Transport \(LOT\)\.\(Right\)The active learning loop maps snapshots to latent statescic\_\{i\}modeled by Gaussian Processes\. This surrogate quantifies epistemic uncertainty to select the optimal next measurement timet∗t^\{\*\}\.Overview\.In this section, we address the challenges outlined in[Section3](https://arxiv.org/html/2605.30625#S3)and propose a framework to enable active learning on probability paths\. First, to resolve the non\-Euclidean and infinite\-dimensional nature of𝒫2​\(𝒳\)\\mathcal\{P\}\_\{2\}\(\\mathcal\{X\}\), we leverageLinearized Optimal Transport \(LOT\)\(Wanget al\.,[2013](https://arxiv.org/html/2605.30625#bib.bib19)\)in order to map the observed empirical measures into a common tangent space\. This effectively linearizes the Wasserstein geometry and allows us to represent distributions as vectors in this tangent space\. Second, we compute a low\-dimensional representation of the tangent vectors and placeGaussian Process \(GP\)priors\(Rasmussen,[2003](https://arxiv.org/html/2605.30625#bib.bib16)\)on them\. This provides a canonical notion of epistemic uncertainty, which we can use within an acquisition function to guide the sequential selection of measurement times\.

### 4\.1Linearizing the Wasserstein space with LOT

The 2\-Wasserstein space\(𝒫2​\(𝒳\),W2\)\(\\mathcal\{P\}\_\{2\}\(\\mathcal\{X\}\),W\_\{2\}\)has a non\-Euclidean geometry: measures do not interpolate naturally by linear averaging\. To obtain representations of these measures, we therefore*linearize*the geometry around a fixed reference measureσ∈𝒫2​\(𝒳\)\\sigma\\in\\mathcal\{P\}\_\{2\}\(\\mathcal\{X\}\), following the LOT framework\.

##### Tangent\-space intuition\.

Informally, the*tangent space*atσ\\sigma, denotedTσ​𝒫2​\(𝒳\)T\_\{\\sigma\}\\mathcal\{P\}\_\{2\}\(\\mathcal\{X\}\), contains the instantaneous ”velocity fields” that moveσ\\sigmaalong Wasserstein paths\. One can view it as a Hilbert space of square\-integrable vector fields underσ\\sigma, i\.e\.,Tσ​𝒫2​\(𝒳\)⊆L2​\(σ;ℝd\)T\_\{\\sigma\}\\mathcal\{P\}\_\{2\}\(\\mathcal\{X\}\)\\subseteq L^\{2\}\(\\sigma;\\mathbb\{R\}^\{d\}\)\(up to standard technicalities, cf\.\(Ambrosioet al\.,[2005](https://arxiv.org/html/2605.30625#bib.bib20)\)\)\. Thus, our objective is to map each measureμ\\muontoTσ​𝒫2​\(𝒳\)T\_\{\\sigma\}\\mathcal\{P\}\_\{2\}\(\\mathcal\{X\}\)\.

##### Optimal transport map\.

To obtain this representation, we need to be able to compute the ”velocity vector” fromσ\\sigmatoμ\\mu\. The geometry of𝒫2​\(𝒳\)\\mathcal\{P\}\_\{2\}\(\\mathcal\{X\}\)naturally defines it via optimal transport maps\. For any target measureμ∈𝒫2​\(𝒳\)\\mu\\in\\mathcal\{P\}\_\{2\}\(\\mathcal\{X\}\), we define the optimal transport \(Monge\) map fromσ\\sigmatoμ\\muas

Tσ→μ∈arg⁡minT:T\#​σ=μ​∫𝒳‖x−T​\(x\)‖2​dσ​\(x\)\.T\_\{\\sigma\\to\\mu\}\\in\\arg\\min\_\{T:~T\_\{\\\#\}\\sigma=\\mu\}\\ \\int\_\{\\mathcal\{X\}\}\\\|x\-T\(x\)\\\|^\{2\}\\,\\mathrm\{d\}\\sigma\(x\)\.\(4\)A famous result byBrenier \([1991](https://arxiv.org/html/2605.30625#bib.bib38)\)guarantees that ifσ\\sigmais absolutely continuous with respect to the Lebesgue measure, the optimal transport mapTσ→μT\_\{\\sigma\\to\\mu\}exists and is unique\. The displacement fieldTσ→μ−IdT\_\{\\sigma\\to\\mu\}\-\\mathrm\{Id\}provides the LOT representation in the tangent space\.

##### LOT logarithmic map\.

LOT represents a measureμ\\muby its*displacement field*relative to a reference measureσ\\sigmathrough a logarithmic map\. Specifically, it embeds eachμ\\muas a vector fieldvμ∈L2​\(σ;ℝd\)v\_\{\\mu\}\\in L^\{2\}\(\\sigma;\\mathbb\{R\}^\{d\}\)defined pointwise by

logσ⁡\(μ\)​\(x\)=vμ​\(x\):=Tσ→μ​\(x\)−x,x∈𝒳\.\\log\_\{\\sigma\}\(\\mu\)\(x\)\\;=\\;v\_\{\\mu\}\(x\)\\;:=\\;T\_\{\\sigma\\to\\mu\}\(x\)\-x,\\qquad x\\in\\mathcal\{X\}\.\(5\)Under this embedding, all snapshots in𝒟\\mathcal\{D\}become elements of the same Hilbert space\. Moreover, whenμ\\muandν\\nulie sufficiently close toσ\\sigmain Wasserstein space, squared Wasserstein distances are well\-approximated by squaredL2​\(σ\)L^\{2\}\(\\sigma\)distances between their displacement fields, i\.e\.W2​\(μ,ν\)2≈‖vμ−vν‖L2​\(σ\)2W\_\{2\}\(\\mu,\\nu\)^\{2\}\\approx\\\|v\_\{\\mu\}\-v\_\{\\nu\}\\\|\_\{L^\{2\}\(\\sigma\)\}^\{2\}\.

This motivates choosingσ\\sigmaas the Wasserstein barycenter of the observed snapshots\{μ^ti\}i=1N\\\{\\hat\{\\mu\}\_\{t\_\{i\}\}\\\}\_\{i=1\}^\{N\}, see[SectionA\.1](https://arxiv.org/html/2605.30625#A1.SS1)for the definition\.

Computing the logarithmic map\.To represent each snapshot via the LOT/log maplogσ⁡\(μ^ti\)\\log\_\{\\sigma\}\(\\hat\{\\mu\}\_\{t\_\{i\}\}\), we discretize the reference measureσ\\sigmaas a weighted point cloud withMMlandmarks, approximating it via the empirical measure∑j=1Mwj​δzj\\sum\_\{j=1\}^\{M\}w\_\{j\}\\,\\delta\_\{z\_\{j\}\}\. Equivalently, it can be represented via a matrix𝐙σ∈ℝM×d\\mathbf\{Z\}\_\{\\sigma\}\\in\\mathbb\{R\}^\{M\\times d\}and a probability weight vector𝐰=\(w1,…,wM\)∈ΔM,\\mathbf\{w\}=\(w\_\{1\},\\dots,w\_\{M\}\)\\in\\Delta\_\{M\},whereΔM:=\{𝐰∈ℝ\+M:∑j=1Mwj=1\}\\Delta\_\{M\}:=\\\{\\mathbf\{w\}\\in\\mathbb\{R\}\_\{\+\}^\{M\}:\\sum\_\{j=1\}^\{M\}w\_\{j\}=1\\\}\. Similarly, each empirical distributionμti\\mu\_\{t\_\{i\}\}at timetit\_\{i\}is represented as a matrix𝐙i∈ℝni×d\\mathbf\{Z\}\_\{i\}\\in\\mathbb\{R\}^\{n\_\{i\}\\times d\}, with corresponding weight vector𝐚\(i\)=\(a1\(i\),…,ani\(i\)\)∈Δni\\mathbf\{a\}^\{\(i\)\}=\(a^\{\(i\)\}\_\{1\},\\dots,a^\{\(i\)\}\_\{n\_\{i\}\}\)\\in\\Delta\_\{n\_\{i\}\}\.

For eachi∈\{1,…,N\}i\\in\\\{1,\\dots,N\\\}, we solve a discrete optimal transport problem between\(𝐙σ,𝐰\)\(\\mathbf\{Z\}\_\{\\sigma\},\\mathbf\{w\}\)and\(𝐙i,𝐚\(i\)\)\(\\mathbf\{Z\}\_\{i\},\\mathbf\{a\}^\{\(i\)\}\), yielding a coupling𝜸\(i\)∈ℝ\+M×ni\\boldsymbol\{\\gamma\}^\{\(i\)\}\\in\\mathbb\{R\}\_\{\+\}^\{M\\times n\_\{i\}\}satisfying the marginal constraints

𝜸\(i\)​𝟏ni=𝐰,\(𝜸\(i\)\)⊤​𝟏M=𝐚\(i\)\.\\boldsymbol\{\\gamma\}^\{\(i\)\}\\mathbf\{1\}\_\{n\_\{i\}\}=\\mathbf\{w\},\\qquad\(\\boldsymbol\{\\gamma\}^\{\(i\)\}\)^\{\\top\}\\mathbf\{1\}\_\{M\}=\\mathbf\{a\}^\{\(i\)\}\.The entryγj​k\(i\)\\gamma^\{\(i\)\}\_\{jk\}represents the amount of mass transported from thejj\-th reference landmarkzjz\_\{j\}to thekk\-th target pointxk\(i\)x^\{\(i\)\}\_\{k\}\. Since discrete couplings generally allow mass splitting, a deterministic Monge map \(as in[Equation4](https://arxiv.org/html/2605.30625#S4.E4)\) is not strictly defined\. We therefore approximateTσ→μ^tiT\_\{\\sigma\\to\\hat\{\\mu\}\_\{t\_\{i\}\}\}via the*barycentric projection*, which maps each reference landmarkzjz\_\{j\}to the weighted barycenter of its assigned target mass:

T^\(i\)​\(zj\)=1∑k=1niγj​k\(i\)​∑k=1niγj​k\(i\)​\(𝐙i\)k=1wj​∑k=1niγj​k\(i\)​\(𝐙i\)k,\\hat\{T\}^\{\(i\)\}\(z\_\{j\}\)=\\frac\{1\}\{\\sum\_\{k=1\}^\{n\_\{i\}\}\\gamma^\{\(i\)\}\_\{jk\}\}\\sum\_\{k=1\}^\{n\_\{i\}\}\\gamma^\{\(i\)\}\_\{jk\}\\,\(\\mathbf\{Z\}\_\{i\}\)\_\{k\}=\\frac\{1\}\{w\_\{j\}\}\\sum\_\{k=1\}^\{n\_\{i\}\}\\gamma^\{\(i\)\}\_\{jk\}\\,\(\\mathbf\{Z\}\_\{i\}\)\_\{k\},\(6\)where\(𝐙i\)k\(\\mathbf\{Z\}\_\{i\}\)\_\{k\}denotes thekk\-th row of𝐙i\\mathbf\{Z\}\_\{i\}and the last equality uses∑kγj​k\(i\)=wj\\sum\_\{k\}\\gamma^\{\(i\)\}\_\{jk\}=w\_\{j\}\. In matrix form, the mapped landmarks are𝐙^i=diag\(𝜸\(i\)𝟏ni\)−1𝜸\(i\)𝐙i=diag\(𝐰\)−1𝜸\(i\)𝐙i\\hat\{\\mathbf\{Z\}\}\_\{i\}=\\operatorname\{diag\}\(\\boldsymbol\{\\gamma\}^\{\(i\)\}\\mathbf\{1\}\_\{n\_\{i\}\}\)^\{\-1\}\\boldsymbol\{\\gamma\}^\{\(i\)\}\\mathbf\{Z\}\_\{i\}=\\operatorname\{diag\}\(\\mathbf\{w\}\)^\{\-1\}\\boldsymbol\{\\gamma\}^\{\(i\)\}\\mathbf\{Z\}\_\{i\}

Finally, the linearized representation of theii\-th snapshot is given by the displacement field on the reference landmarks,

𝐕i=𝐙^i−𝐙σ∈ℝM×d\.\\mathbf\{V\}\_\{i\}=\\hat\{\\mathbf\{Z\}\}\_\{i\}\-\\mathbf\{Z\}\_\{\\sigma\}\\in\\mathbb\{R\}^\{M\\times d\}\.

### 4\.2Low\-dimensional representation

Following the LOT projection, each snapshotμ^ti\\hat\{\\mu\}\_\{t\_\{i\}\}is represented by a displacement matrix𝐕i∈ℝM×d\\mathbf\{V\}\_\{i\}\\in\\mathbb\{R\}^\{M\\times d\}\. Our objective is now to obtain a low\-dimensional representation from these matrices\. Since LOT linearizes𝒫2​\(𝒳\)\\mathcal\{P\}\_\{2\}\(\\mathcal\{X\}\)into the tangent space equipped with theL2​\(σ\)L^\{2\}\(\\sigma\)geometry, we perform dimensionality reduction using the corresponding discrete inner product induced by the reference weights𝐰\\mathbf\{w\}\. Let𝐒:=diag⁡\(w1,…,wM\)∈ℝM×M\\mathbf\{S\}:=\\operatorname\{diag\}\(\\sqrt\{w\_\{1\}\},\\dots,\\sqrt\{w\_\{M\}\}\)\\in\\mathbb\{R\}^\{M\\times M\}and denote the reweighted displacement field𝐕~i:=𝐒​𝐕i\\tilde\{\\mathbf\{V\}\}\_\{i\}\\;:=\\;\\mathbf\{S\}\\,\\mathbf\{V\}\_\{i\}\. We flatten the𝐕~i\\tilde\{\\mathbf\{V\}\}\_\{i\}row\-wise into𝐯~i:=vec⁡\(𝐕~i\)∈ℝD\\tilde\{\\mathbf\{v\}\}\_\{i\}:=\\operatorname\{vec\}\(\\tilde\{\\mathbf\{V\}\}\_\{i\}\)\\in\\mathbb\{R\}^\{D\}, whereD=M​dD=Md, and center𝐯¯=1N​∑i=1N𝐯~i\{\\bar\{\\mathbf\{v\}\}\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\tilde\{\\mathbf\{v\}\}\_\{i\}\.

Assuming that we want to obtain aK−K\-dimension representation, let𝐔~K∈ℝD×K\\tilde\{\\mathbf\{U\}\}\_\{K\}\\in\\mathbb\{R\}^\{D\\times K\}be the topKKPCA directions of\{𝐯~i−𝐯¯\}i=1N\\\{\\tilde\{\\mathbf\{v\}\}\_\{i\}\-\\bar\{\\mathbf\{v\}\}\\\}\_\{i=1\}^\{N\}\. The low\-dimensional representations are then obtained as follows:

𝐜i:=𝐔~K⊤​\(𝐯~i−𝐯¯\)∈ℝK\.\\mathbf\{c\}\_\{i\}\\;:=\\;\\tilde\{\\mathbf\{U\}\}\_\{K\}^\{\\top\}\(\\tilde\{\\mathbf\{v\}\}\_\{i\}\-\\bar\{\{\\mathbf\{v\}\}\}\)\\in\\mathbb\{R\}^\{K\}\.\(7\)which yields a dataset𝒟~=\{\(ti,𝐜i\)\}i=1N\\tilde\{\\mathcal\{D\}\}=\\\{\(t\_\{i\},\\mathbf\{c\}\_\{i\}\)\\\}\_\{i=1\}^\{N\}\.

### 4\.3Modeling uncertainty with Gaussian Processes

With the low\-dimensional temporal snapshots in𝒟~\\tilde\{\\mathcal\{D\}\}obtained via LOT, our objective is to construct a continuous probabilistic mapping that allows for interpolation via the observed data while quantifying epistemic uncertainty\. To satisfy these two desiderata, we consider a surrogate model based on Gaussian Processes \(GP\)\.

##### Priors and observation model\.

We model the temporal evolution of the latent state𝐜∈ℝK\\mathbf\{c\}\\in\\mathbb\{R\}^\{K\}as a vector\-valued function𝐟:\[0,1\]→ℝK\\mathbf\{f\}:\[0,1\]\\to\\mathbb\{R\}^\{K\}governed by a Multi\-Output Gaussian Process \(MOGP\)\. We assume a zero\-mean prior with a matrix\-valued kernel𝐊​\(t,t′\)\\mathbf\{K\}\(t,t^\{\\prime\}\):

𝐟∼𝒢​𝒫​\(𝟎,𝐊​\(⋅,⋅\)\),\\mathbf\{f\}\\sim\\mathcal\{GP\}\\left\(\\mathbf\{0\},\\mathbf\{K\}\(\\cdot,\\cdot\)\\right\),\(8\)where𝐊​\(t,t′\)∈ℝK×K\\mathbf\{K\}\(t,t^\{\\prime\}\)\\in\\mathbb\{R\}^\{K\\times K\}is a positive semi\-definite kernel matrix\. The entry\[𝐊​\(t,t′\)\]j​j′\[\\mathbf\{K\}\(t,t^\{\\prime\}\)\]\_\{jj^\{\\prime\}\}encodes the covariance between thejj\-th andj′j^\{\\prime\}\-th latent dimensions at timesttandt′t^\{\\prime\}\. This general formalism allows us to capture correlations between principal components if desired, though one may also proceed with the simplifying assumption of independence, in which case𝐊​\(t,t′\)\\mathbf\{K\}\(t,t^\{\\prime\}\)is diagonal \(as we do in[Section5](https://arxiv.org/html/2605.30625#S5)\)\.

We assume the observed coefficients𝐜i\\mathbf\{c\}\_\{i\}are noisy realizations of the latent trajectory:

𝐜i=𝐟​\(ti\)\+ϵi,ϵi∼𝒩​\(𝟎,𝚺obs\)\.\\mathbf\{c\}\_\{i\}=\\mathbf\{f\}\(t\_\{i\}\)\+\\boldsymbol\{\\epsilon\}\_\{i\},\\quad\\boldsymbol\{\\epsilon\}\_\{i\}\\sim\\mathcal\{N\}\(\\mathbf\{0\},\\boldsymbol\{\\Sigma\}\_\{\\text\{obs\}\}\)\.\(9\)Here,𝚺obs\\boldsymbol\{\\Sigma\}\_\{\\text\{obs\}\}captures aleatoric uncertainty arising from finite\-sample approximation error in the empirical snapshotsμ^ti\\hat\{\\mu\}\_\{t\_\{i\}\}and potential measurement noise which propagate to the LOT embeddings\.

##### Posterior inference\.

Conditioned on the dataset𝒟~\\tilde\{\\mathcal\{D\}\}, the predictive posterior distribution for the latent trajectory𝐟\\mathbf\{f\}at a query timettis a multivariate Gaussian:

p​\(𝐟​\(t\)∣𝒟~\)=𝒩​\(𝐦​\(t\),𝐒​\(t\)\)\.p\(\\mathbf\{f\}\(t\)\\mid\\tilde\{\\mathcal\{D\}\}\)=\\mathcal\{N\}\\left\(\\mathbf\{m\}\(t\),\\mathbf\{S\}\(t\)\\right\)\.\(10\)Here,𝐦​\(t\)∈ℝK\\mathbf\{m\}\(t\)\\in\\mathbb\{R\}^\{K\}denotes the predicted mean vector, while the predictive covariance matrix𝐒​\(t\)∈ℝK×K\\mathbf\{S\}\(t\)\\in\\mathbb\{R\}^\{K\\times K\}explicitly quantifies the joint epistemic uncertainty of the latent dimensions at timett\. The diagonal elements of𝐒​\(t\)\\mathbf\{S\}\(t\)correspond to the variances of individual components, while off\-diagonal elements capture their posterior correlations\.

##### Reconstructing distributions from the surrogate\.

To obtain the predicted measureμt\\mu\_\{t\}at a query timett, we invert the linearization pipeline\. Given a latent state𝐜^t∈ℝK\\hat\{\\mathbf\{c\}\}\_\{t\}\\in\\mathbb\{R\}^\{K\}\(e\.g\. the posterior mean𝐦​\(t\)\\mathbf\{m\}\(t\)or a sample drawn from the GP posterior\), we first map it back via the inverse PCA projection𝐲t=𝐔~K​𝐜^t\+𝐯¯\\mathbf\{y\}\_\{t\}=\\tilde\{\\mathbf\{U\}\}\_\{K\}\\hat\{\\mathbf\{c\}\}\_\{t\}\+\\bar\{\\mathbf\{v\}\}\. This vector is reshaped into𝐘t∈ℝM×d\\mathbf\{Y\}\_\{t\}\\in\\mathbb\{R\}^\{M\\times d\}, and the displacement field is recovered as𝐕^t=𝐒−1​𝐘t\\hat\{\\mathbf\{V\}\}\_\{t\}=\\mathbf\{S\}^\{\-1\}\\mathbf\{Y\}\_\{t\}\. In our discrete setting, the reconstructed measure is a point cloud supported on locations𝐙^t=𝐙σ\+𝐕^t\\hat\{\\mathbf\{Z\}\}\_\{t\}=\\mathbf\{Z\}\_\{\\sigma\}\+\\hat\{\\mathbf\{V\}\}\_\{t\}, carrying the same weights asσ\\sigma\.

### 4\.4Handling non\-stationarity via time warping

Standard covariance kernels \(e\.g\., RBF or Matérn\) used with GPs are typicallystationary, implying that the correlation structure depends only on the time difference, i\.e\.,𝐊​\(t,t′\)=𝐊base​\(\|t−t′\|\)\\mathbf\{K\}\(t,t^\{\\prime\}\)=\\mathbf\{K\}\_\{\\text\{base\}\}\(\|t\-t^\{\\prime\}\|\)\. This encodes the assumption that the rate of change of the modeled process is constant over time\. However, in domains such as biology, processes can be inherently non\-stationary\. For example, during cellular differentiation, cells often undergo rapid transcriptional bursts followed by prolonged periods of homeostasis\(Kumaret al\.,[2015](https://arxiv.org/html/2605.30625#bib.bib36)\)\.

Intrinsic time parametrization\.To address this, we model the dynamics in anintrinsic temporal domainwhere the rate of distributional change is constant\. We define a warping functionΦ:\[0,1\]→ℝ\+\\Phi:\[0,1\]\\to\\mathbb\{R\}^\{\+\}that maps physical timettto an intrinsic timeτ\\tau, representing the cumulative arc\-length of the trajectory on𝒫2​\(𝒳\)\\mathcal\{P\}\_\{2\}\(\\mathcal\{X\}\)\. This is given by the integral of the metric speedτ=Φ​\(t\):=∫0t‖μ˙u‖W2​𝑑u\\tau=\\Phi\(t\):=\\int\_\{0\}^\{t\}\\\|\\dot\{\\mu\}\_\{u\}\\\|\_\{W\_\{2\}\}\\,duwhere‖μ˙u‖W2:=limh→0W2​\(μu\+h,μu\)\|h\|\\\|\\dot\{\\mu\}\_\{u\}\\\|\_\{W\_\{2\}\}:=\\lim\_\{h\\to 0\}\\frac\{W\_\{2\}\(\\mu\_\{u\+h\},\\mu\_\{u\}\)\}\{\|h\|\}\. In this warped domain, the distribution evolves at unit speed with respect to the Wasserstein metric \(see[SectionA\.2](https://arxiv.org/html/2605.30625#A1.SS2)\)\. Consequently, a stationary kernel operating on warped inputs effectively induces a non\-stationary kernel on physical time\.

Approximation and inference\.Since the continuous curve is unknown, we approximateΦ​\(t\)\\Phi\(t\)using the discrete empirical snapshots\. Without loss of generality, we assume the timepoints are sorted before computing the warping\. We compute cumulative distances onτ^i=∑j=2iW2​\(μ^tj−1,μ^tj\)\\hat\{\\tau\}\_\{i\}=\\sum\_\{j=2\}^\{i\}W\_\{2\}\(\\hat\{\\mu\}\_\{t\_\{j\-1\}\},\\hat\{\\mu\}\_\{t\_\{j\}\}\)fori=2,…,Ni=2,\\dots,Nwithτ^1=0\\hat\{\\tau\}\_\{1\}=0\. To obtain a continuous mapping for any candidate timett, we fit a monotonic cubic spline to the pairs\{\(ti,τ^i\)\}i=1N\\\{\(t\_\{i\},\\hat\{\\tau\}\_\{i\}\)\\\}\_\{i=1\}^\{N\}\.

We incorporate this warping into the GP framework \([Section4\.3](https://arxiv.org/html/2605.30625#S4.SS3)\) by defining the kernel𝐊\\mathbf\{K\}as the composition of a base stationary kernel𝐊base\\mathbf\{K\}\_\{\\text\{base\}\}and the mappingΦ\\Phi\. Specifically:

𝐊​\(t,t′\)=𝐊base​\(Φ​\(t\),Φ​\(t′\)\)\.\\mathbf\{K\}\(t,t^\{\\prime\}\)=\\mathbf\{K\}\_\{\\text\{base\}\}\\big\(\\Phi\(t\),\\Phi\(t^\{\\prime\}\)\\big\)\.\(11\)This induces a non\-stationary kernel on physical timettthat naturally adapts its effective lengthscale to the local speed of the distributional evolution\.

### 4\.5Acquisition functions

At any iteration of the active learning loop, we select the next measurement time by maximizing an acquisition functionα​\(t;𝒟\)\\alpha\(t;\\mathcal\{D\}\)over a candidate pool𝒯pool\\mathcal\{T\}\_\{\\mathrm\{pool\}\}:

t∗=arg⁡maxt∈𝒯pool⁡α​\(t;𝒟\)\.t^\{\*\}\\;=\\;\\arg\\max\_\{t\\in\\mathcal\{T\}\_\{\\mathrm\{pool\}\}\}\\alpha\(t;\\mathcal\{D\}\)\.\(12\)The acquisition function encodes the criterion used to decide where the next observation is expected to be most informative\. Since our non\-stationarity handling is expressed in intrinsic time, all acquisition scores are evaluated through the warped locationτ=Φ​\(t\)\\tau=\\Phi\(t\)\. Letτ^max:=Φ​\(1\)\\hat\{\\tau\}\_\{\\max\}:=\\Phi\(1\)and let𝐒​\(τ\)∈ℝK×K\\mathbf\{S\}\(\\tau\)\\in\\mathbb\{R\}^\{K\\times K\}denote the GP predictive covariance of the latent coefficients at intrinsic timeτ\\taugiven the current dataset𝒟\\mathcal\{D\}\.

##### Example 1: point\-wise uncertainty\.

A simple acquisition strategy is to query where the model is locally most uncertain\. This yields the score

αunc​\(t;𝒟\)=Tr​\(𝐒​\(Φ​\(t\)\)\)\.\\alpha\_\{\\mathrm\{unc\}\}\(t;\\mathcal\{D\}\)\\;=\\;\\mathrm\{Tr\}\\\!\\left\(\\mathbf\{S\}\(\\Phi\(t\)\)\\right\)\.\(13\)This criterion favors measurement times whose corresponding intrinsic locations have high posterior epistemic uncertainty\.

##### Example 2: expected integrated risk reduction\.

Another possibility is to query the point expected to maximally reduce uncertainty along the entire intrinsic\-time domain\. We quantify the current global epistemic uncertainty by

𝒰​\(𝒟\)=∫0τ^maxTr​\(𝐒​\(τ′\)\)​𝑑τ′\.\\mathcal\{U\}\(\\mathcal\{D\}\)=\\int\_\{0\}^\{\\hat\{\\tau\}\_\{\\max\}\}\\mathrm\{Tr\}\\\!\\left\(\\mathbf\{S\}\(\\tau^\{\\prime\}\)\\right\)\\,d\\tau^\{\\prime\}\.\(14\)For a candidate query at physical timett, letτ=Φ​\(t\)\\tau=\\Phi\(t\)and let𝐂​\(τ′,τ\)∈ℝK×K\\mathbf\{C\}\(\\tau^\{\\prime\},\\tau\)\\in\\mathbb\{R\}^\{K\\times K\}denote the posterior cross\-covariance between the latent states atτ′\\tau^\{\\prime\}andτ\\tau\. Under the GP posterior update, the covariance atτ′\\tau^\{\\prime\}after observing atτ\\tauis

𝐒~​\(τ′\)=𝐒​\(τ′\)−𝐂​\(τ′,τ\)​\(𝐒​\(τ\)\+𝚺obs\)−1​𝐂​\(τ,τ′\)\.\\widetilde\{\\mathbf\{S\}\}\(\\tau^\{\\prime\}\)=\\mathbf\{S\}\(\\tau^\{\\prime\}\)\-\\mathbf\{C\}\(\\tau^\{\\prime\},\\tau\)\\big\(\\mathbf\{S\}\(\\tau\)\+\\boldsymbol\{\\Sigma\}\_\{\\mathrm\{obs\}\}\\big\)^\{\-1\}\\mathbf\{C\}\(\\tau,\\tau^\{\\prime\}\)\.\(15\)The expected integrated risk reduction acquisition is then

αEIRR​\(t;𝒟\)\\displaystyle\\alpha\_\{\\mathrm\{EIRR\}\}\(t;\\mathcal\{D\}\):=𝒰\(𝒟\)−𝒰\(𝒟∪\{\(t,\.\)\}\)\\displaystyle:=\\;\\mathcal\{U\}\(\\mathcal\{D\}\)\-\\mathcal\{U\}\(\\mathcal\{D\}\\cup\\\{\(t,\.\)\\\}\)=∫0τ^maxTr\(𝐂\(τ′,Φ\(t\)\)\(𝐒\(Φ\(t\)\)\+𝚺obs\)−1\\displaystyle=\\;\\int\_\{0\}^\{\\hat\{\\tau\}\_\{\\max\}\}\\mathrm\{Tr\}\\bigg\(\\mathbf\{C\}\(\\tau^\{\\prime\},\\Phi\(t\)\)\\,\\big\(\\mathbf\{S\}\(\\Phi\(t\)\)\+\\boldsymbol\{\\Sigma\}\_\{\\text\{obs\}\}\\big\)^\{\-1\}×𝐂\(τ′,Φ\(t\)\)⊤\)dτ′\.\\displaystyle\\qquad\\qquad\\qquad\\times\\mathbf\{C\}\(\\tau^\{\\prime\},\\Phi\(t\)\)^\{\\top\}\\bigg\)\\,d\\tau^\{\\prime\}\.\(16\)

## 5Experiments

![Refer to caption](https://arxiv.org/html/2605.30625v1/x3.png)
![Refer to caption](https://arxiv.org/html/2605.30625v1/x4.png)

Figure 3:\(Left\)Visualization of the synthetic data projected in 2D\. The trajectory is non\-stationary with two distinct branching events \(marked\)\.\(Right\)Reconstruction performance as a function of the acquisition budget\. We report the mean Wasserstein error and its velocity\-weighted variant \(w\-W2W\_\{2\}\)\. The results are averaged over 5 seeds, and the vertical axes are presented on a logarithmic scale\.In this section, we empirically evaluate the effectiveness of our active learning framework\. We first assess our method on a synthetic dataset designed to mimic branching events\. We then demonstrate its real\-world utility using a large\-scale single\-cell transcriptomics dataset\.

### 5\.1Experimental setup

Baselines\.To quantify the benefits of uncertainty\-aware sampling, we compare our strategy against two standard uncertainty\-agnostic baselines:Random\.Measurement times are sampled uniformly at random from the candidate pool without replacement\.Uniform\.A deterministic baseline where time points are selected on a fixed equidistant grid across the temporal domain\[0,1\]\[0,1\]\. Results with additional baselines can be found in[SectionC\.1](https://arxiv.org/html/2605.30625#A3.SS1)\.

Surrogate model\.To ensure a fair comparison, all acquisition strategies employ the identical probabilistic surrogate configurations to generate predictions and reconstruct the probability path\. We use a Multi\-Output Gaussian Process with independent GPs for each latent dimension, equipped with a Matérn5/25/2kernel\. Gaussian Process hyperparameters are optimized by maximizing the marginal log\-likelihood \(see[SectionB\.4](https://arxiv.org/html/2605.30625#A2.SS4)\)\. Our active method uses the uncertainty\-based acquisition functionαunc\\alpha\_\{\\mathrm\{unc\}\}\. Unless otherwise specified, all methods incorporate the intrinsic time\-warping strategy described in[Section4\.4](https://arxiv.org/html/2605.30625#S4.SS4)to account for non\-stationary dynamics\. All the methods also set the reference to be the Wasserstein barycenter of the observed distributions\.

### 5\.2Synthetic experiment

##### Data\.

We compare the methods on a synthetic dataset designed to mimic non\-stationary developmental trajectories with transient branching events\. Each sample is a time series generated by simulating a low\-dimensional latent process𝐳​\(t\)∈ℝdz\\mathbf\{z\}\(t\)\\in\\mathbb\{R\}^\{d\_\{z\}\}following an Ornstein–Uhlenbeck–type SDE with a time\-dependent target mean that induces two sequential bifurcations over time windows\(a1,b1\)\(a\_\{1\},b\_\{1\}\)and\(a2,b2\)\(a\_\{2\},b\_\{2\}\)\. We also add shared oscillatory motion localized to the branching windows\. Observations are obtained as𝐱​\(t\)=𝐐𝐳​\(t\)\\mathbf\{x\}\(t\)=\\mathbf\{Q\}\\mathbf\{z\}\(t\)with𝐱​\(t\)∈ℝ10\\mathbf\{x\}\(t\)\\in\\mathbb\{R\}^\{10\}and𝐐∈ℝ10×2\\mathbf\{Q\}\\in\\mathbb\{R\}^\{10\\times 2\}having orthonormal columns\. The dataset can be visualized in[Figure3](https://arxiv.org/html/2605.30625#S5.F3), with details in[SectionB\.3](https://arxiv.org/html/2605.30625#A2.SS3)\.

##### Methodology\.

We define an initial candidate pool of times𝒯pool\\mathcal\{T\}\_\{\\text\{pool\}\}consisting of5050time points regularly spaced in the interval\[0,1\]\[0,1\]\. We consider varying acquisition budgets up to2121\. For each method, once the budget is exhausted, we fit the surrogate model\. This surrogate model is then used to obtain predicted distributions\{μ^t\}t∈\[0,1\]\\\{\\hat\{\\mu\}\_\{t\}\\\}\_\{t\\in\[0,1\]\}\(cf\.[Section4\.3](https://arxiv.org/html/2605.30625#S4.SS3)\)\. We assess the quality of the inferred trajectory against the ground truth\{μt\}t∈\[0,1\]\\\{\\mu\_\{t\}\\\}\_\{t\\in\[0,1\]\}on a test set𝒯test\\mathcal\{T\}\_\{\\text\{test\}\}of times\. We report the Mean Wasserstein Error, defined as1\|𝒯test\|​∑t∈𝒯testW22​\(μt,μ^t\)\\frac\{1\}\{\|\\mathcal\{T\}\_\{\\text\{test\}\}\|\}\\sum\_\{t\\in\\mathcal\{T\}\_\{\\text\{test\}\}\}W\_\{2\}^\{2\}\(\\mu\_\{t\},\\hat\{\\mu\}\_\{t\}\)\. Additionally, to provide more granularity on the errors, we report a velocity\-weighted mean Wasserstein error \(w\-W2W\_\{2\}\)\. This metric weighs the error at each test time point by the instantaneous speed of the ground truth process, i\.e\. we define w\-W2=1S​∑t∈𝒯test‖μ˙t‖⋅W22​\(μt,μ^t\)W\_\{2\}=\\frac\{1\}\{S\}\\sum\_\{t\\in\\mathcal\{T\}\_\{\\text\{test\}\}\}\\\|\\dot\{\\mu\}\_\{t\}\\\|\\cdot W\_\{2\}^\{2\}\(\\mu\_\{t\},\\hat\{\\mu\}\_\{t\}\)where‖μ˙t‖\\\|\\dot\{\\mu\}\_\{t\}\\\|represents a metric speed of the true probability path at timettandS=∑t∈𝒯test‖μ˙t‖S=\\sum\_\{t\\in\\mathcal\{T\}\_\{\\text\{test\}\}\}\\\|\\dot\{\\mu\}\_\{t\}\\\|is the normalization constant\. This metric therefore puts more weight on test times where the dynamics are high\.

##### Results\.

We report these metrics in[Figure3](https://arxiv.org/html/2605.30625#S5.F3), as a function of the budget, for55seeds\. OurActivestrategy consistently outperforms both theUniformandRandombaselines across the acquisition budgets\. Crucially, the performance gap widens after a budget of55, suggesting that our active learning method starts exploiting by querying difficult regions\. This is also supported by the velocity\-weighted metrics, confirming that the acquisition function successfully identifies and targets the non\-stationary regions of the trajectory, specifically the rapid branching events, where the interpolation error is naturally highest\. While theUniformbaseline improves stepwise as the grid density increases, the active approach provides a more sample\-efficient strategy for low to moderate budgets\.

##### Qualitative analysis\.

To investigate the mechanism driving this efficiency, we visualize the temporal allocation of measurements in[Figure4](https://arxiv.org/html/2605.30625#S5.F4)\. The top panel displays the instantaneous metric speed‖μ˙t‖\\\|\\dot\{\\mu\}\_\{t\}\\\|of the ground truth trajectory, revealing a highly non\-stationary process with sharp peaks corresponding to the sequential branching events described in[SectionB\.3](https://arxiv.org/html/2605.30625#A2.SS3)\. The bottom panel illustrates the specific time points selected by each strategy\. We see that theActivestrategy exhibits strong adaptivity\. It concentrates its later acquisitions in the high\-velocity windows \(approximatelyt∈\[0\.3,0\.4\]t\\in\[0\.3,0\.4\]andt∈\[0\.7,0\.8\]t\\in\[0\.7,0\.8\]\)\.

##### Sensitivity analysis\.

Building on this observation, we hypothesize that our active learning method is most effective in scenarios characterized by localized heterogeneity in metric speed\.

To further validate our findings, we perform a sensitivity analysis by varying the duration of the branching events\. While the baseline configuration utilized a duration of0\.050\.05\(corresponding to the interval\[0\.3,0\.35\]\[0\.3,0\.35\]\), we also evaluate wider intervals with durations of0\.100\.10and0\.200\.20\. For each configuration, we compute the average relative improvement of the active method over the uniform and random baselines, aggregated across all budgets and55seeds\. Results in[Table1](https://arxiv.org/html/2605.30625#S5.T1)show that the Active strategy yields the most significant gains when branching events are highly localized \(length0\.050\.05\)\. This advantage persists at0\.100\.10but diminishes as windows widen to0\.200\.20\. At this width, improvement over the Uniform baseline becomes negative for the mean Wasserstein error; however, the velocity\-weighted metric gap remains positive, indicating the method still effectively targets high\-velocity regions\. These results confirm that the strength of our active approach lies in its ability to precisely target sharp, transient dynamics that uniform sampling intervals are likely to miss\.

![Refer to caption](https://arxiv.org/html/2605.30625v1/x5.png)Figure 4:Visualization of adaptive timepoint selection\.\(Top\)The instantaneous metric speed‖μ˙t‖\\\|\\dot\{\\mu\}\_\{t\}\\\|of the ground truth trajectory\.\(Bottom\)Comparison of selected measurement times for a budget of 16\. Darker points indicate later acquisitions\.Table 1:Average relative improvement of the active method compared to the baselines across55seeds\.llcontrols the duration of the branching events\. Subscripts denote 90% CI\.

### 5\.3Application to real\-world datasets

##### Data\.

We evaluate our method on the large\-scale single\-cell RNA sequencing dataset fromSchiebingeret al\.\([2019](https://arxiv.org/html/2605.30625#bib.bib31)\), which tracks the reprogramming of mouse fibroblasts into induced pluripotent stem cells \(iPSCs\)\. We specifically focus on the serum culture subset, which exhibits non\-stationary developmental dynamics over an 18\-day period\. The dataset contains 39 distinct measurement times within the intervalt∈\[0,18\]t\\in\[0,18\]\. We project the gene expressions into a 20\-dimensional latent space with PCA, and we whiten the components\. We partition the snapshot timepoints into two interleaved disjoint sets, one serving as the candidate pool for timepoint selection and the other as a test set\.

Results\.We report the results in[Figure5](https://arxiv.org/html/2605.30625#S5.F5)\. The active strategy achieves the lowest reconstruction error in the low\-to\-moderate budget regime \(B≤12B\\leq 12\) where uniform grids are most likely to miss brief, rapidly changing phases of the underlying reprogramming dynamics\. As expected, as we increase the acquisition budget past this point, the performance gap between methods shrinks: once the schedule becomes dense, even uniform or random sampling is likely to cover most transient phases, reducing the advantage of active selection\. To summarize, the clearest advantage of our method is in thehigh\-precision, moderate\-budget regime,while the differences are smaller in the low\-precision regime and at very low or very high budgets\. Finally, we also report the computational cost of our method in[SectionC\.3](https://arxiv.org/html/2605.30625#A3.SS3)\.

![Refer to caption](https://arxiv.org/html/2605.30625v1/x6.png)Figure 5:Evaluation on the single\-cell reprogramming dataset from\(Schiebingeret al\.,[2019](https://arxiv.org/html/2605.30625#bib.bib31)\)
##### Additional dataset\.

In[SectionC\.4](https://arxiv.org/html/2605.30625#A3.SS4), we perform the same experiment on a labor market dataset from\(Floodet al\.,[2024](https://arxiv.org/html/2605.30625#bib.bib3)\), where we show that our method preferentially identifies and explores periods where the distribution changes most rapidly \(around the onset of the COVID\-19 pandemic\)\.

### 5\.4Ablations

![Refer to caption](https://arxiv.org/html/2605.30625v1/x7.png)
![Refer to caption](https://arxiv.org/html/2605.30625v1/x8.png)

Figure 6:Ablation study\.We evaluate the impact of: \(i\) replacing the Matérn\-5/2 kernel with anRBFkernel; \(ii\) fixing the LOT referenceσ\\sigma; \(iii\) reducing the PCA basis rank toK=2K=2; and \(iv\) disabling theintrinsic time warpingstrategy \(No warp\)\.We ablate four components of our surrogate/acquisition pipeline: \(i\) replace the default Matérn\-5/25/2GP kernel with an RBF kernel; \(ii\) fix the LOT referenceσ\\sigmato the initially observed snapshot \(no Wasserstein\-barycenter refitting at each iteration\); \(iii\) reduce the PCA basis rank toK=2K=2; and \(iv\) disable intrinsic\-time warping\.

Results\.We report the results for both the synthetic and the fibroblast reprogramming dataset in[Figure6](https://arxiv.org/html/2605.30625#S5.F6)\. We observe that reducing the latent dimension toK=2K=2results in the most significant performance drop, confirming that a sufficiently high\-dimensional latent representation is critical for capturing complex transcriptomic variations\. Fixing the reference distributionσ\\sigmato the initial snapshot leads to considerably higher error compared to the full method on the synthetic dataset, demonstrating the importance of updating reference to minimize the linearization error\. We provide a detailed analysis on the role of the reference in[SectionC\.2](https://arxiv.org/html/2605.30625#A3.SS2)\. Disabling the intrinsic time warping degrades performance, particularly in the low\-budget regime, which validates the utility of our non\-stationary modeling approach\. Finally, the results with the RBF kernel show that different priors \(injected via the kernel\) can be used with our method\.

## 6Discussion

We presented a framework for active learning on measure\-valued trajectories that leverages Linearized Optimal Transport and Gaussian Processes to quantify uncertainty in Wasserstein space\. Our experiments demonstrate that this strategy outperforms uncertainty\-agnostic baselines by targeting high\-velocity regions, such as branching events in cellular differentiation\. A primary limitation of our approach is the reliance on the tangent space approximation\. While iteratively updating the reference measure mitigates this, large distributional shifts may still induce linearization errors\. Using multiple charts and atlas\-like construction with tools from Riemannian geometry is an interesting direction for future work\. Additionally, the main computational cost of our framework comes from the OT subroutines\. In the intended application regime of this paper, i\.e\. active acquisition settings with expensive snapshots and up to roughly10510^\{5\}samples per snapshot, this overhead is practical\. We do not claim that million\-point snapshots are fully solved: in that regime, both OT computation and memory become major constraints, and additional scalable OT approximations would be required\. Finally, future work could focus on incorporating multidimensional inputs in addition to time variables\.

## Acknowledgments

We thank the four anonymous ICML reviewers for their comments and suggestions\. NH thanks Illumina for their funding and support\.

## Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning\. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here\. Code to reproduce the experiments can be found at[https://github\.com/nicolashuynh/active\_wass](https://github.com/nicolashuynh/active_wass), and at the wider lab repository[https://github\.com/vanderschaarlab/active\_wass](https://github.com/vanderschaarlab/active_wass)\.

## References

- Y\. Achdou, J\. Han, J\. Lasry, P\. Lions, and B\. Moll \(2022\)Income and wealth distribution in macroeconomics: a continuous\-time approach\.The review of economic studies89\(1\),pp\. 45–86\.Cited by:[§1](https://arxiv.org/html/2605.30625#S1.p1.2)\.
- L\. Ambrosio, N\. Gigli, and G\. Savaré \(2005\)Gradient flows: in metric spaces and in the space of probability measures\.Springer\.Cited by:[§A\.2](https://arxiv.org/html/2605.30625#A1.SS2.p2.5),[§1](https://arxiv.org/html/2605.30625#S1.p3.1),[§4\.1](https://arxiv.org/html/2605.30625#S4.SS1.SSS0.Px1.p1.7)\.
- J\. Benamou and Y\. Brenier \(2000\)A computational fluid mechanics solution to the monge\-kantorovich mass transfer problem\.Numerische Mathematik84\(3\),pp\. 375–393\.Cited by:[§1](https://arxiv.org/html/2605.30625#S1.p1.2)\.
- Y\. Brenier \(1991\)Polar factorization and monotone rearrangement of vector\-valued functions\.Communications on Pure and Applied Mathematics44,pp\. 375–417\.External Links:[Link](https://api.semanticscholar.org/CorpusID:123428953)Cited by:[§4\.1](https://arxiv.org/html/2605.30625#S4.SS1.SSS0.Px2.p1.9)\.
- S\. Flood, M\. King, R\. Rodgers, S\. Ruggles, J\. R\. Warren, D\. Backman, A\. Chen, G\. Cooper, S\. Richards, M\. Schouweiler,et al\.\(2024\)Ipums cps: version 12\.0 \[dataset\]\.Minneapolis, MN: IPUMS10,pp\. D030\.Cited by:[§C\.4](https://arxiv.org/html/2605.30625#A3.SS4.SSS0.Px1.p1.5),[§5\.3](https://arxiv.org/html/2605.30625#S5.SS3.SSS0.Px2.p1.1)\.
- Y\. Gal, R\. Islam, and Z\. Ghahramani \(2017\)Deep bayesian active learning with image data\.InInternational conference on machine learning,pp\. 1183–1192\.Cited by:[§2](https://arxiv.org/html/2605.30625#S2.p1.1)\.
- L\. Haghverdi, M\. Büttner, F\. A\. Wolf, F\. Buettner, and F\. J\. Theis \(2016\)Diffusion pseudotime robustly reconstructs lineage branching\.Nature methods13\(10\),pp\. 845–848\.Cited by:[§1](https://arxiv.org/html/2605.30625#S1.p3.1)\.
- N\. Houlsby, F\. Huszár, Z\. Ghahramani, and M\. Lengyel \(2011\)Bayesian active learning for classification and preference learning\.arXiv preprint arXiv:1112\.5745\.Cited by:[§2](https://arxiv.org/html/2605.30625#S2.p1.1)\.
- S\. Kolouri, A\. B\. Tosun, J\. A\. Ozolek, and G\. K\. Rohde \(2016\)A continuous linear optimal transport approach for pattern analysis in image datasets\.Pattern recognition51,pp\. 453–462\.Cited by:[§2](https://arxiv.org/html/2605.30625#S2.p3.1)\.
- N\. Kumar, A\. Singh, and R\. V\. Kulkarni \(2015\)Transcriptional bursting in gene expression: analytical results for general stochastic models\.PLOS Computational Biology11\(10\),pp\. 1–22\.External Links:[Document](https://dx.doi.org/10.1371/journal.pcbi.1004292),[Link](https://doi.org/10.1371/journal.pcbi.1004292)Cited by:[§4\.4](https://arxiv.org/html/2605.30625#S4.SS4.p1.1)\.
- H\. C\. L\. Law, D\. J\. Sutherland, D\. Sejdinovic, and S\. Flaxman \(2018\)Bayesian approaches to distribution regression\.InInternational Conference on Artificial Intelligence and Statistics,pp\. 1167–1176\.Cited by:[§2](https://arxiv.org/html/2605.30625#S2.p2.1)\.
- J\. Lee, B\. Moradijamei, and H\. Shakeri \(2025\)Multi\-marginal stochastic flow matching for high\-dimensional snapshot data at irregular time points\.InInternational Conference on Machine Learning,pp\. 33476–33498\.Cited by:[§2](https://arxiv.org/html/2605.30625#S2.p2.1),[§3\.3](https://arxiv.org/html/2605.30625#S3.SS3.p3.7)\.
- D\. D\. Lewis \(1995\)A sequential algorithm for training text classifiers: corrigendum and additional data\.InAcm Sigir Forum,Vol\.29,pp\. 13–19\.Cited by:[§2](https://arxiv.org/html/2605.30625#S2.p1.1)\.
- \[14\]Y\. Lipman, R\. T\. Chen, H\. Ben\-Hamu, M\. Nickel, and M\. LeFlow matching for generative modeling\.InThe Eleventh International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.30625#S1.p3.1),[§2](https://arxiv.org/html/2605.30625#S2.p2.1)\.
- R\. J\. McCann \(1997\)A convexity principle for interacting gases\.Advances in mathematics128\(1\),pp\. 153–179\.Cited by:[§3\.3](https://arxiv.org/html/2605.30625#S3.SS3.p3.2)\.
- C\. Moosmüller and A\. Cloninger \(2023\)Linear optimal transport embedding: provable wasserstein classification for certain rigid transformations and perturbations\.Information and Inference: A Journal of the IMA12\(1\),pp\. 363–389\.Cited by:[§2](https://arxiv.org/html/2605.30625#S2.p3.1)\.
- K\. Muandet, K\. Fukumizu, B\. Sriperumbudur, and B\. Scholkopf \(2017\)Kernel mean embedding of distributions: a review and beyond\.Foundations and Trends in Machine Learning10\(1\-2\),pp\. 1–141\.Cited by:[§2](https://arxiv.org/html/2605.30625#S2.p2.1)\.
- B\. Póczos, A\. Singh, A\. Rinaldo, and L\. Wasserman \(2013\)Distribution\-free distribution regression\.Inartificial intelligence and statistics,pp\. 507–515\.Cited by:[§2](https://arxiv.org/html/2605.30625#S2.p2.1)\.
- C\. E\. Rasmussen \(2003\)Gaussian processes in machine learning\.InSummer school on machine learning,pp\. 63–71\.Cited by:[§4](https://arxiv.org/html/2605.30625#S4.p1.1)\.
- M\. Rohbeck, E\. De Brouwer, C\. Bunne, J\. Huetter, A\. Biton, K\. Y\. Chen, A\. Regev, and R\. Lopez \(2025\)Modeling complex system dynamics with flow matching across time and conditions\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§C\.5](https://arxiv.org/html/2605.30625#A3.SS5.p1.1),[§1](https://arxiv.org/html/2605.30625#S1.p3.1),[§3\.3](https://arxiv.org/html/2605.30625#S3.SS3.p3.7)\.
- G\. Schiebinger, J\. Shu, M\. Tabaka, B\. Cleary, V\. Subramanian, A\. Solomon, J\. Gould, S\. Liu, S\. Lin, P\. Berube,et al\.\(2019\)Optimal\-transport analysis of single\-cell gene expression identifies developmental trajectories in reprogramming\.Cell176\(4\),pp\. 928–943\.Cited by:[§B\.3\.2](https://arxiv.org/html/2605.30625#A2.SS3.SSS2.Px1.p1.1),[Figure 5](https://arxiv.org/html/2605.30625#S5.F5),[Figure 5](https://arxiv.org/html/2605.30625#S5.F5.3.2),[§5\.3](https://arxiv.org/html/2605.30625#S5.SS3.SSS0.Px1.p1.1)\.
- E\. Schulz, M\. Speekenbrink, and A\. Krause \(2018\)A tutorial on gaussian process regression: modelling, exploring, and exploiting functions\.Journal of mathematical psychology85,pp\. 1–16\.Cited by:[§1](https://arxiv.org/html/2605.30625#S1.p3.1)\.
- H\. S\. Seung, M\. Opper, and H\. Sompolinsky \(1992\)Query by committee\.InProceedings of the fifth annual workshop on Computational learning theory,pp\. 287–294\.Cited by:[§2](https://arxiv.org/html/2605.30625#S2.p1.1)\.
- R\. Singh, N\. Palmer, D\. Gifford, B\. Berger, and Z\. Bar\-Joseph \(2005\)Active learning for sampling in time\-series experiments with application to gene expression analysis\.InProceedings of the 22nd international conference on Machine learning,pp\. 832–839\.Cited by:[4th item](https://arxiv.org/html/2605.30625#A3.I1.i4.p1.1),[§2](https://arxiv.org/html/2605.30625#S2.p1.1)\.
- Z\. Szabó, B\. K\. Sriperumbudur, B\. Póczos, and A\. Gretton \(2016\)Learning theory for distribution regression\.Journal of Machine Learning Research17\(152\),pp\. 1–40\.Cited by:[§2](https://arxiv.org/html/2605.30625#S2.p2.1)\.
- A\. Y\. Tong, N\. Malkin, K\. Fatras, L\. Atanackovic, Y\. Zhang, G\. Huguet, G\. Wolf, and Y\. Bengio \(2024\)Simulation\-free schrödinger bridges via score and flow matching\.InInternational Conference on Artificial Intelligence and Statistics,pp\. 1279–1287\.Cited by:[§2](https://arxiv.org/html/2605.30625#S2.p2.1)\.
- C\. Trapnell, D\. Cacchiarelli, J\. Grimsby, P\. Pokharel, S\. Li, M\. Morse, N\. J\. Lennon, K\. J\. Livak, T\. S\. Mikkelsen, and J\. L\. Rinn \(2014\)The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells\.Nature biotechnology32\(4\),pp\. 381–386\.Cited by:[§1](https://arxiv.org/html/2605.30625#S1.p1.2)\.
- C\. Villani \(2021\)Topics in optimal transportation\.Vol\.58,American Mathematical Soc\.\.Cited by:[§3\.1](https://arxiv.org/html/2605.30625#S3.SS1.p1.9)\.
- A\. Wagner, A\. Regev, and N\. Yosef \(2016\)Revealing the vectors of cellular identity with single\-cell genomics\.Nature biotechnology34\(11\),pp\. 1145–1160\.Cited by:[§1](https://arxiv.org/html/2605.30625#S1.p1.2)\.
- T\. Wang, Y\. Ke, D\. Bhaskar, S\. Krishnaswamy, and A\. Cloninger \(2025\)Linearized optimal transport for analysis of high\-dimensional point\-cloud and single\-cell data\.arXiv preprint arXiv:2510\.22033\.Cited by:[§1](https://arxiv.org/html/2605.30625#S1.p4.3)\.
- W\. Wang, D\. Slepčev, S\. Basu, J\. A\. Ozolek, and G\. K\. Rohde \(2013\)A linear optimal transportation framework for quantifying and visualizing variations in sets of images\.International journal of computer vision101\(2\),pp\. 254–269\.Cited by:[§1](https://arxiv.org/html/2605.30625#S1.p4.3),[§2](https://arxiv.org/html/2605.30625#S2.p3.1),[§4](https://arxiv.org/html/2605.30625#S4.p1.1)\.
- C\. Williams and C\. Rasmussen \(1995\)Gaussian processes for regression\.Advances in neural information processing systems8\.Cited by:[§1](https://arxiv.org/html/2605.30625#S1.p3.1)\.
- C\. Ziegenhain, B\. Vieth, S\. Parekh, B\. Reinius, A\. Guillaumet\-Adkins, M\. Smets, H\. Leonhardt, H\. Heyn, I\. Hellmann, and W\. Enard \(2017\)Comparative analysis of single\-cell rna sequencing methods\.Molecular cell65\(4\),pp\. 631–643\.Cited by:[§1](https://arxiv.org/html/2605.30625#S1.p2.1)\.

## Appendix ATheory

### A\.1Wasserstein barycenter

In our framework, the reference measureσ\\sigmadefines the tangent space onto which all snapshots are projected\. To minimize the global linearization error, we defineσ\\sigmaas theWasserstein barycenterof the observed snapshots\{μ^ti\}i=1N\\\{\\hat\{\\mu\}\_\{t\_\{i\}\}\\\}\_\{i=1\}^\{N\}\.

Given the snapshots and a set of weights\{λi\}i=1N\\\{\\lambda\_\{i\}\\\}\_\{i=1\}^\{N\}such that∑λi=1\\sum\\lambda\_\{i\}=1\(typicallyλi=1N\\lambda\_\{i\}=\\frac\{1\}\{N\}\), the Wasserstein barycenter is the solution to the following optimization problem:

σ∈arg⁡minν∈𝒫2​\(𝒳\)​∑i=1Nλi​W22​\(ν,μ^ti\)\.\\sigma\\in\\arg\\min\_\{\\nu\\in\\mathcal\{P\}\_\{2\}\(\\mathcal\{X\}\)\}\\sum\_\{i=1\}^\{N\}\\lambda\_\{i\}W\_\{2\}^\{2\}\(\\nu,\\hat\{\\mu\}\_\{t\_\{i\}\}\)\.\(17\)
In the discrete setting where snapshots are empirical point clouds, the barycenterσ\\sigmais also a discrete measure\.

### A\.2Intrinsic time parametrization

In this section, we provide the formal derivation showing that the intrinsic time parametrization defined in[Section4\.4](https://arxiv.org/html/2605.30625#S4.SS4)induces a unit\-speed trajectory in the 2\-Wasserstein space𝒫2​\(𝒳\)\\mathcal\{P\}\_\{2\}\(\\mathcal\{X\}\)\.

Setup\.Letμ:\[0,1\]→𝒫2​\(𝒳\)\\mu:\[0,1\]\\to\\mathcal\{P\}\_\{2\}\(\\mathcal\{X\}\)be an absolutely continuous curve\. The metric derivative ofμ\\muat timettis defined as:

v​\(t\):=‖μ˙t‖W2=limh→0W2​\(μt\+h,μt\)\|h\|\.v\(t\):=\\\|\\dot\{\\mu\}\_\{t\}\\\|\_\{W\_\{2\}\}=\\lim\_\{h\\to 0\}\\frac\{W\_\{2\}\(\\mu\_\{t\+h\},\\mu\_\{t\}\)\}\{\|h\|\}\.\(18\)This limit exists for almost everyt∈\[0,1\]t\\in\[0,1\]\(Ambrosioet al\.,[2005](https://arxiv.org/html/2605.30625#bib.bib20)\)\. We assume the curve is regular, such thatv​\(t\)\>0v\(t\)\>0almost everywhere\.

The intrinsic time \(arc\-length\) warping functionΦ:\[0,1\]→\[0,L\]\\Phi:\[0,1\]\\to\[0,L\]is defined as:

τ=Φ​\(t\):=∫0tv​\(u\)​𝑑u,\\tau=\\Phi\(t\):=\\int\_\{0\}^\{t\}v\(u\)\\,du,\(19\)whereL=∫01v​\(u\)​𝑑uL=\\int\_\{0\}^\{1\}v\(u\)duis the total length of the curve\. Sincev​\(t\)\>0v\(t\)\>0,Φ\\Phiis strictly increasing and absolutely continuous, admitting a well\-defined inverseΨ:=Φ−1:\[0,L\]→\[0,1\]\\Psi:=\\Phi^\{\-1\}:\[0,L\]\\to\[0,1\]\.

Reparametrization\.We define the reparametrized curveν:\[0,L\]→𝒫2​\(𝒳\)\\nu:\[0,L\]\\to\\mathcal\{P\}\_\{2\}\(\\mathcal\{X\}\)in the warped domain asντ:=μΨ​\(τ\)\\nu\_\{\\tau\}:=\\mu\_\{\\Psi\(\\tau\)\}\. We aim to show that‖ν˙τ‖W2=1\\\|\\dot\{\\nu\}\_\{\\tau\}\\\|\_\{W\_\{2\}\}=1for almost everyτ\\tau\.

###### Proof\.

Letτ∈\[0,L\]\\tau\\in\[0,L\]be a point whereΨ\\Psiis differentiable and the metric derivative ofμ\\muexists att=Ψ​\(τ\)t=\\Psi\(\\tau\)\. Consider the metric derivative definition forν\\nuatτ\\tau:

‖ν˙τ‖W2=limk→0W2​\(ντ\+k,ντ\)\|k\|=limk→0W2​\(μΨ​\(τ\+k\),μΨ​\(τ\)\)\|k\|\.\\\|\\dot\{\\nu\}\_\{\\tau\}\\\|\_\{W\_\{2\}\}=\\lim\_\{k\\to 0\}\\frac\{W\_\{2\}\(\\nu\_\{\\tau\+k\},\\nu\_\{\\tau\}\)\}\{\|k\|\}=\\lim\_\{k\\to 0\}\\frac\{W\_\{2\}\(\\mu\_\{\\Psi\(\\tau\+k\)\},\\mu\_\{\\Psi\(\\tau\)\}\)\}\{\|k\|\}\.\(20\)SinceΨ\\Psiis strictly increasing,Ψ​\(τ\+k\)≠Ψ​\(τ\)\\Psi\(\\tau\+k\)\\neq\\Psi\(\\tau\)fork≠0k\\neq 0\. We have:

‖ν˙τ‖W2=limk→0\(W2​\(μΨ​\(τ\+k\),μΨ​\(τ\)\)\|Ψ​\(τ\+k\)−Ψ​\(τ\)\|⋅\|Ψ​\(τ\+k\)−Ψ​\(τ\)\|\|k\|\)\.\\\|\\dot\{\\nu\}\_\{\\tau\}\\\|\_\{W\_\{2\}\}=\\lim\_\{k\\to 0\}\\left\(\\frac\{W\_\{2\}\(\\mu\_\{\\Psi\(\\tau\+k\)\},\\mu\_\{\\Psi\(\\tau\)\}\)\}\{\|\\Psi\(\\tau\+k\)\-\\Psi\(\\tau\)\|\}\\cdot\\frac\{\|\\Psi\(\\tau\+k\)\-\\Psi\(\\tau\)\|\}\{\|k\|\}\\right\)\.\(21\)LetΔ​tk=Ψ​\(τ\+k\)−Ψ​\(τ\)\\Delta t\_\{k\}=\\Psi\(\\tau\+k\)\-\\Psi\(\\tau\)\. Ask→0k\\to 0, it follows thatΔ​tk→0\\Delta t\_\{k\}\\to 0\. We have:

limk→0W2​\(μt\+Δ​tk,μt\)\|Δ​tk\|=‖μ˙t‖W2=v​\(t\)\.\\lim\_\{k\\to 0\}\\frac\{W\_\{2\}\(\\mu\_\{t\+\\Delta t\_\{k\}\},\\mu\_\{t\}\)\}\{\|\\Delta t\_\{k\}\|\}=\\\|\\dot\{\\mu\}\_\{t\}\\\|\_\{W\_\{2\}\}=v\(t\)\.\(22\)Using the derivative of the inverse functionΨ′​\(τ\)=1/Φ′​\(Ψ​\(τ\)\)\\Psi^\{\\prime\}\(\\tau\)=1/\\Phi^\{\\prime\}\(\\Psi\(\\tau\)\):

limk→0\|Ψ​\(τ\+k\)−Ψ​\(τ\)\|\|k\|=\|Ψ′​\(τ\)\|=1\|v​\(t\)\|\.\\lim\_\{k\\to 0\}\\frac\{\|\\Psi\(\\tau\+k\)\-\\Psi\(\\tau\)\|\}\{\|k\|\}=\|\\Psi^\{\\prime\}\(\\tau\)\|=\\frac\{1\}\{\|v\(t\)\|\}\.\(23\)Sincev​\(t\)\>0v\(t\)\>0, combining these results yields:

‖ν˙τ‖W2=v​\(t\)⋅1v​\(t\)=1\.\\\|\\dot\{\\nu\}\_\{\\tau\}\\\|\_\{W\_\{2\}\}=v\(t\)\\cdot\\frac\{1\}\{v\(t\)\}=1\.\(24\)Thus, the distribution evolves at unit speed with respect to the Wasserstein metric in the warped domain\. ∎

## Appendix BExperimental Details

### B\.1Algorithm

We provide an algorithm section detailing our method in[Algorithm1](https://arxiv.org/html/2605.30625#alg1)\.

Algorithm 1Active Wasserstein Trajectory Learning1:Input:Initial snapshots

𝒟=\{\(ti,μ^ti\)\}i=1N\\mathcal\{D\}=\\\{\(t\_\{i\},\\hat\{\\mu\}\_\{t\_\{i\}\}\)\\\}\_\{i=1\}^\{N\}, Candidate pool

𝒯pool\\mathcal\{T\}\_\{\\text\{pool\}\}, Acquisition budget

BB, Latent dimension

KK\.

2:Hyperparameters:GP kernel

KbaseK\_\{\\text\{base\}\}, Reference landmark count

MM\.

3:whilebudget

B\>0B\>0do

4:// 1\. Geometry linearization \(Sec\.[4\.1](https://arxiv.org/html/2605.30625#S4.SS1)\)

5:Update reference

σ\\sigmaas the Wasserstein barycenter of

𝒟\\mathcal\{D\}\.

6:Represent

σ\\sigmavia landmarks

𝐙σ∈ℝM×d\\mathbf\{Z\}\_\{\\sigma\}\\in\\mathbb\{R\}^\{M\\times d\}and weights

𝐰\\mathbf\{w\}\.

7:foreach

\(ti,μ^ti\)∈𝒟\(t\_\{i\},\\hat\{\\mu\}\_\{t\_\{i\}\}\)\\in\\mathcal\{D\}do

8:Compute OT coupling

𝜸\(i\)\\boldsymbol\{\\gamma\}^\{\(i\)\}between

\(𝐙σ,𝐰\)\(\\mathbf\{Z\}\_\{\\sigma\},\\mathbf\{w\}\)and

μ^ti\\hat\{\\mu\}\_\{t\_\{i\}\}\.

9:Compute barycentric projection

𝐙^i=diag​\(𝐰\)−1​𝜸\(i\)​𝐙i\\hat\{\\mathbf\{Z\}\}\_\{i\}=\\text\{diag\}\(\\mathbf\{w\}\)^\{\-1\}\\boldsymbol\{\\gamma\}^\{\(i\)\}\\mathbf\{Z\}\_\{i\}\(Eq\.[6](https://arxiv.org/html/2605.30625#S4.E6)\)\.

10:Compute displacement field

𝐕i←𝐙^i−𝐙σ\\mathbf\{V\}\_\{i\}\\leftarrow\\hat\{\\mathbf\{Z\}\}\_\{i\}\-\\mathbf\{Z\}\_\{\\sigma\}\.

11:endfor

12:// 2\. Low\-dimensional representation \(Sec\.[4\.2](https://arxiv.org/html/2605.30625#S4.SS2)\)

13:Reweight and flatten:

𝐯~i=vec​\(𝐒𝐕i\)\\tilde\{\\mathbf\{v\}\}\_\{i\}=\\text\{vec\}\(\\mathbf\{S\}\\mathbf\{V\}\_\{i\}\)where

𝐒=diag​\(𝐰\)\\mathbf\{S\}=\\text\{diag\}\(\\sqrt\{\\mathbf\{w\}\}\)\.

14:Compute PCA basis

𝐔~K\\tilde\{\\mathbf\{U\}\}\_\{K\}on centered vectors

\{𝐯~i−𝐯¯\}i=1N\\\{\\tilde\{\\mathbf\{v\}\}\_\{i\}\-\\bar\{\\mathbf\{v\}\}\\\}\_\{i=1\}^\{N\}\.

15:Project snapshots to latent space:

𝐜i←𝐔~K⊤​\(𝐯~i−𝐯¯\)\\mathbf\{c\}\_\{i\}\\leftarrow\\tilde\{\\mathbf\{U\}\}\_\{K\}^\{\\top\}\(\\tilde\{\\mathbf\{v\}\}\_\{i\}\-\\bar\{\\mathbf\{v\}\}\)\(Eq\.[7](https://arxiv.org/html/2605.30625#S4.E7)\)\.

16:// 3\. Intrinsic time warping \(Sec\.[4\.4](https://arxiv.org/html/2605.30625#S4.SS4)\)

17:Compute cumulative transport distances

τ^i=∑j=1iW2​\(μ^tj−1,μ^tj\)\\hat\{\\tau\}\_\{i\}=\\sum\_\{j=1\}^\{i\}W\_\{2\}\(\\hat\{\\mu\}\_\{t\_\{j\-1\}\},\\hat\{\\mu\}\_\{t\_\{j\}\}\)\.

18:Fit monotone cubic spline

Φ​\(t\)\\Phi\(t\)to pairs

\{\(ti,τ^i\)\}i=1N\\\{\(t\_\{i\},\\hat\{\\tau\}\_\{i\}\)\\\}\_\{i=1\}^\{N\}\.

19:// 4\. Probabilistic Modeling \(Sec\.[4\.3](https://arxiv.org/html/2605.30625#S4.SS3)\)

20:Define warped kernel

𝐊​\(t,t′\)=𝐊base​\(Φ​\(t\),Φ​\(t′\)\)\\mathbf\{K\}\(t,t^\{\\prime\}\)=\\mathbf\{K\}\_\{\\text\{base\}\}\(\\Phi\(t\),\\Phi\(t^\{\\prime\}\)\)\.

21:Fit Multi\-Output GP on

𝒟~=\{\(ti,𝐜i\)\}\\tilde\{\\mathcal\{D\}\}=\\\{\(t\_\{i\},\\mathbf\{c\}\_\{i\}\)\\\}to obtain posterior

p​\(𝐟\|𝒟~\)p\(\\mathbf\{f\}\|\\tilde\{\\mathcal\{D\}\}\)\.

22:// 5\. Acquisition \(Sec\.[4\.5](https://arxiv.org/html/2605.30625#S4.SS5)\)

23:Select next timepoint by maximizing acquisition function:

24:

t∗←arg​maxt∈𝒯pool⁡α​\(t;𝒟\)t^\{\*\}\\leftarrow\\operatorname\*\{arg\\,max\}\_\{t\\in\\mathcal\{T\}\_\{\\text\{pool\}\}\}\\alpha\(t;\\mathcal\{D\}\)\(Eq\.[12](https://arxiv.org/html/2605.30625#S4.E12)\)\.

25:// 6\. Update

26:Acquire measurement

μ^t∗\\hat\{\\mu\}\_\{t^\{\*\}\}\(perform experiment/sequence\)\.

27:

𝒟←𝒟∪\{\(t∗,μ^t∗\)\}\\mathcal\{D\}\\leftarrow\\mathcal\{D\}\\cup\\\{\(t^\{\*\},\\hat\{\\mu\}\_\{t^\{\*\}\}\)\\\}\.

28:

N←N\+1,B←B−1N\\leftarrow N\+1,\\quad B\\leftarrow B\-1\.

29:endwhile

30:Output:Fitted surrogate model

𝐟\\mathbf\{f\}and reference

σ\\sigma\.

### B\.2Computational complexity

The computational cost of our framework is primarily driven by the optimal transport computations and the Gaussian Process inference\. We analyze the complexity with respect to the number of observed snapshotsNN, the average number of points per snapshotnn, the number of reference landmarksMM, and the ambient dimensiondd\.

LOT Embedding and Warping\.Mapping theNNsnapshots to the tangent space requires solvingNNdiscrete optimal transport problems between the referenceσ\\sigmaand eachμ^ti\\hat\{\\mu\}\_\{t\_\{i\}\}\. Additionally, the time warping approximation requires solvingNNpairwise OT problems between consecutive snapshots\. Let𝒞OT​\(M,n\)\\mathcal\{C\}\_\{\\text\{OT\}\}\(M,n\)denote the cost of a single transport solve \(e\.g\.,𝒪​\(M​n\)\\mathcal\{O\}\(Mn\)for Sinkhorn approximations or𝒪~​\(\(M\+n\)3\)\\tilde\{\\mathcal\{O\}\}\(\(M\+n\)^\{3\}\)for exact solvers\)\. The total cost for embedding and warping is𝒪​\(N⋅𝒞OT​\(M,n\)\)\\mathcal\{O\}\(N\\cdot\\mathcal\{C\}\_\{\\text\{OT\}\}\(M,n\)\)\. The subsequent barycentric projection involves matrix multiplications with complexity𝒪​\(N​M​n​d\)\\mathcal\{O\}\(NMnd\)\.

Surrogate Construction\.Dimensionality reduction via PCA on the displacement vectors \(matrix sizeN×M​dN\\times Md\) typically incurs a cost of𝒪​\(N2​M​d\)\\mathcal\{O\}\(N^\{2\}Md\), assumingN≪M​dN\\ll Md\. For the probabilistic model, trainingKKindependent GPs onNNtime points is dominated by the Cholesky decomposition of the kernel matrices, scaling as𝒪​\(K​N3\)\\mathcal\{O\}\(KN^\{3\}\)\.

Active Loop\.In the active learning setting, adding a new measurement requires solving one additional OT problem and updating the GPs\. Evaluating the acquisition functionα​\(t\)\\alpha\(t\)across a dense grid ofTgridT\_\{\\text\{grid\}\}candidate times requires computing predictive variances, scaling as𝒪​\(Tgrid​K​N2\)\\mathcal\{O\}\(T\_\{\\text\{grid\}\}KN^\{2\}\)\. Since our framework is designed for settings where data is sparse \(smallNN\) but high\-dimensional, the cubic scaling of the GP is negligible compared to the cost of optimal transport, which constitutes the main computational bottleneck, and which can be reduced by using entropic formulations\.

### B\.3Datasets\.

#### B\.3\.1Oscillatory sequential branching dataset

##### Latent dynamics\.

We generate a branching trajectory in a low\-dimensional latent space𝐳​\(t\)∈ℝdz\\mathbf\{z\}\(t\)\\in\\mathbb\{R\}^\{d\_\{z\}\}overt∈\[tmin,tmax\]t\\in\[t\_\{\\min\},t\_\{\\max\}\]using an Ornstein–Uhlenbeck–type SDE with a time\-dependent target mean:

d​𝐳​\(t\)=−κ​\(𝐳​\(t\)−𝝁​\(t;s\)\)​d​t\+σdiff​d​𝐖t,d\\mathbf\{z\}\(t\)=\-\\kappa\\big\(\\mathbf\{z\}\(t\)\-\\boldsymbol\{\\mu\}\(t;s\)\\big\)\\,dt\+\\sigma\_\{\\text\{diff\}\}\\,d\\mathbf\{W\}\_\{t\},\(25\)whereκ\>0\\kappa\>0is the drift strength,σdiff\\sigma\_\{\\text\{diff\}\}is the diffusion scale, and𝐖t\\mathbf\{W\}\_\{t\}is standard Brownian motion\. Each trajectory is assigned binary fate labelss=\(σ1,σ2\)∈\{±1\}2s=\(\\sigma\_\{1\},\\sigma\_\{2\}\)\\in\\\{\\pm 1\\\}^\{2\}\.

##### Branching schedule and target mean\.

Two sequential branching events occur within time windows\(a1,b1\)\(a\_\{1\},b\_\{1\}\)and\(a2,b2\)\(a\_\{2\},b\_\{2\}\), with smooth gates

αj​\(t\)=clip​\(t−ajbj−aj,0,1\),j∈\{1,2\}\.\\alpha\_\{j\}\(t\)=\\mathrm\{clip\}\\\!\\left\(\\frac\{t\-a\_\{j\}\}\{b\_\{j\}\-a\_\{j\}\},\\,0,\\,1\\right\),\\quad j\\in\\\{1,2\\\}\.\(26\)Defineg=𝕀​\{σ1=split\_branch\_sign\}g=\\mathbb\{I\}\\\{\\sigma\_\{1\}=\\texttt\{split\\\_branch\\\_sign\}\\\}to enforce that the second split only affects one side of the first branch\. The base mean indz=2d\_\{z\}=2is

𝝁0​\(t;s\)=\[g​β​σ2​α2​\(t\)β​σ1​α1​\(t\)\]\.\\boldsymbol\{\\mu\}\_\{0\}\(t;s\)=\\begin\{bmatrix\}g\\,\\beta\\,\\sigma\_\{2\}\\,\\alpha\_\{2\}\(t\)\\\\ \\beta\\,\\sigma\_\{1\}\\,\\alpha\_\{1\}\(t\)\\end\{bmatrix\}\.\(27\)To introduce oscillatory nuisance motion localized near the branching events, we add a time\-varying perturbation:

ej​\(t\)=20​αj​\(t\)​\(1−αj​\(t\)\),w​\(t\)=sin⁡\(ω​t\),e\_\{j\}\(t\)=20\\,\\alpha\_\{j\}\(t\)\\big\(1\-\\alpha\_\{j\}\(t\)\\big\),\\qquad w\(t\)=\\sin\(\\omega t\),\(28\)𝝁​\(t;s\)=\[μ0,x​\(t;s\)\+A​e1​\(t\)​w​\(t\)μ0,y​\(t;s\)\+A​e2​\(t\)​w​\(t\)\],\\boldsymbol\{\\mu\}\(t;s\)=\\begin\{bmatrix\}\\mu\_\{0,x\}\(t;s\)\+A\\,e\_\{1\}\(t\)\\,w\(t\)\\\\ \\mu\_\{0,y\}\(t;s\)\+A\\,e\_\{2\}\(t\)\\,w\(t\)\\end\{bmatrix\},\(29\)whereAAis the nuisance amplitude andω\\omegathe nuisance frequency\. This adds shared oscillatory motion during the transition windows without changing the branching topology\.

##### Initialization\.

Att=tmint=t\_\{\\min\}we initialize

𝐳​\(tmin\)∼𝒩​\(𝟎,σinit2​𝐈\),\\mathbf\{z\}\(t\_\{\\min\}\)\\sim\\mathcal\{N\}\(\\mathbf\{0\},\\,\\sigma\_\{\\text\{init\}\}^\{2\}\\mathbf\{I\}\),\(30\)with independent samples across trajectories\.

##### Observation model\.

Latent states are mapped into𝐱​\(t\)∈ℝdx\\mathbf\{x\}\(t\)\\in\\mathbb\{R\}^\{d\_\{x\}\}via a fixed orthonormal embedding:

𝐱​\(t\)=𝐐𝐳​\(t\)\+ϵ,\\mathbf\{x\}\(t\)=\\mathbf\{Q\}\\mathbf\{z\}\(t\)\+\\boldsymbol\{\\epsilon\},\(31\)where𝐐∈ℝdx×dz\\mathbf\{Q\}\\in\\mathbb\{R\}^\{d\_\{x\}\\times d\_\{z\}\}has orthonormal columns\.

##### Discretization\.

Trajectories are simulated using Euler–Maruyama with stepΔ​t\\Delta t; we record observations on the same grid\.

##### Data hyperparameters

We provide the hyperparameters used to generate the data in[Table2](https://arxiv.org/html/2605.30625#A2.T2)\.

Table 2:Data hyperparameters for the oscillatory sequential branching dataset

#### B\.3\.2Schiebinger reprogramming dataset \(Waddington\-OT\)

##### Source and biological context\.

We use the time\-resolved single\-cell RNA\-seq dataset introduced in\(Schiebingeret al\.,[2019](https://arxiv.org/html/2605.30625#bib.bib31)\)\(mouse fibroblast reprogramming under OSKM induction\)\. The full study contains multiple culture conditions\. In our experiments we restrict to the*serum*subset\. Cells are annotated with a numeric day post\-induction, and two experimental batches are provided\.

##### Representation and preprocessing\.

We treat each time point as an empirical measure over cell embeddings\. Let𝐱i​\(t\)∈ℝdx\\mathbf\{x\}\_\{i\}\(t\)\\in\\mathbb\{R\}^\{d\_\{x\}\}denote the embedding of celliiat daytt\. We perform PCA to obtaindx=20d\_\{x\}=20components and we whiten these components in our experiments\.

##### Train/eval split and candidate times\.

We use one batch, and we split the available time points into training and evaluation sets by taking a fixed fraction of times for evaluation \(0\.50\.5\)\. For sampling at a requested timett, we select the nearest available time point within a small tolerance\.

##### Time points\.

Time is measured in days post\-induction\. The serum subset includes discrete time points spanningt∈\[0,18\]t\\in\[0,18\]with half\-day spacing and a few quarter\-day measurements \(e\.g\.,8\.258\.25,8\.58\.5,8\.758\.75,9\.59\.5,10\.510\.5,11\.511\.5,12\.512\.5,13\.513\.5,14\.514\.5,15\.515\.5,16\.516\.5,17\.517\.5\)\.

Table 3:Data hyperparameters for the Schiebinger \(Waddington\-OT\) dataset

### B\.4Hyperparameters\.

In[Section5](https://arxiv.org/html/2605.30625#S5), all active, uniform, and random comparisons use the same linearized Wasserstein Gaussian\-process surrogate\. We use independent scalar GPs for the tangent\-space coefficients, with a Matérn\-5/2 covariance kernel and a Wasserstein arc\-length time warp\. The warped time coordinate is then linearly rescaled to\[0,1\]\[0,1\], and output coefficients are scaled by their empirical standard deviation before fitting\. The tangent PCA rank is set to44for the synthetic branching experiment and2020for the single\-cell experiment\. The nominal Matérn lengthscale is0\.40\.4in the synthetic experiment and1\.51\.5in the single\-cell experiment\. When at least two observations are available, this value is reinitialized to the median pairwise distance between the observed, warped, rescaled time points\. Kernel output scales are initialized from the empirical coefficient variance\. GP hyperparameters are optimized by maximizing the exact marginal likelihood with an L\-BFGS\-style SciPy optimizer, using Cholesky jitter10−510^\{\-5\}\. We place a data\-scaled Gamma prior on the noise variance, with prior mean equal to0\.10\.1times the empirical residual variance\. The likelihood noise is initialized from the residual variance with scale10−310^\{\-3\}for the synthetic experiment and10−210^\{\-2\}for the single\-cell experiment\. For the synthetic experiment, the referenceσ\\sigmahas512512support points, and for the single\-cell experiment, it has10241024support points\.

## Appendix CAdditional Results

### C\.1Additional baselines

We consider the following additional baselines:

- •Moments embeddings: we represent any distributionμ\\muonℝd\\mathbb\{R\}^\{d\}by its meanmμm\_\{\\mu\}and covarianceΣμ\\Sigma\_\{\\mu\}, and encode it asψ​\(μ\)=\(mμ,uptri⁡\(Σμ\)\)\\psi\(\\mu\)=\(m\_\{\\mu\},\\operatorname\{uptri\}\(\\Sigma\_\{\\mu\}\)\)before PCA, whereuptri\\operatorname\{uptri\}returns the upper triangular coefficients including the diagonal\. At reconstruction time, we decodeψ^\\hat\{\\psi\}into\(m^,Σ^\)\(\\hat\{m\},\\hat\{\\Sigma\}\), projectΣ^\\hat\{\\Sigma\}to be PSD, and map this to the Gaussian𝒩​\(m^,Σ^\)\\mathcal\{N\}\(\\hat\{m\},\\hat\{\\Sigma\}\)\.
- •Kernel mean embeddings: given a distributionμ\\mu, we fix anchorsZ=\{zℓ\}ℓ=1LZ=\\\{z\_\{\\ell\}\\\}\_\{\\ell=1\}^\{L\}, and, withkη​\(x,z\)=exp⁡\(−‖x−z‖2/\(2​η2\)\)k\_\{\\eta\}\(x,z\)=\\exp\(\-\\\|x\-z\\\|^\{2\}/\(2\\eta^\{2\}\)\), we defineψμ​\(ℓ\)=𝔼x∼μ​\[kη​\(x,zℓ\)\]\\psi\_\{\\mu\}\(\\ell\)=\\mathbb\{E\}\_\{x\\sim\\mu\}\[k\_\{\\eta\}\(x,z\_\{\\ell\}\)\]\. At reconstruction time, fromψ^\\hat\{\\psi\}we recover anchor weightsppby optimizingminp∈ΔL⁡‖KZ​Z​p−ψ^‖22\+λ​‖p‖22\\min\_\{p\\in\\Delta^\{L\}\}\\;\\\|K\_\{ZZ\}p\-\\hat\{\\psi\}\\\|\_\{2\}^\{2\}\+\\lambda\\\|p\\\|\_\{2\}^\{2\}where\(KZ​Z\)ℓ​m=kη​\(zℓ,zm\)\(K\_\{ZZ\}\)\_\{\\ell m\}=k\_\{\\eta\}\(z\_\{\\ell\},z\_\{m\}\), which yieldsμ^=∑ℓ=1Lpℓ​δzℓ\\hat\{\\mu\}=\\sum\_\{\\ell=1\}^\{L\}p\_\{\\ell\}\\,\\delta\_\{z\_\{\\ell\}\}\.
- •Interval midpoints: at each acquisition step, we select the candidate time closest to the midpoint of the largest interval between adjacent observed times \(both with and without time warping\)\.
- •Spline uncertainty: we adapt the acquisition function in\(Singhet al\.,[2005](https://arxiv.org/html/2605.30625#bib.bib35)\)\. Note that\(Singhet al\.,[2005](https://arxiv.org/html/2605.30625#bib.bib35)\)tackles a different problem: selecting time points given measurements fromkkfunctions, while we focus on the active learning problem in the space ofdistributions\. We adapt it to our problem by using their spline uncertainty acquisition function to our LOT\+PCA coefficients \(both with and without time warping\)\.

##### Results\.

We report the results in[Figure7](https://arxiv.org/html/2605.30625#A3.F7), where 1\) the alternative embeddings yield drastically worse results than with the LOT embeddings, showing the importance of capturing the Wasserstein geometry with OT and 2\) our acquisition function, which captures epistemic uncertainty, outperforms the other baselines\.

![Refer to caption](https://arxiv.org/html/2605.30625v1/x9.png)\(a\)Synthetic dataset\.
![Refer to caption](https://arxiv.org/html/2605.30625v1/x10.png)\(b\)Single\-cell reprogramming dataset\.

Figure 7:Reconstruction performance as a function of the acquisition budget\. We report the mean Wasserstein error and its velocity\-weighted variant \(w\-W2W\_\{2\}\)\.

### C\.2Distortion analysis

##### Role of the reference\.

The quality of the surrogate depends on how well LOT preserves Wasserstein geometry\. To reduce the distortion, we choose the reference measureσ\\sigmaas the Wasserstein barycenter of the observed snapshots, rather than fixing an arbitrary reference\. To provide more intuition on the importance of this choice, we perform an analysis quantifying the distortion induced byσ\\sigma: for pairsμi,μj\\mu\_\{i\},\\mu\_\{j\}, we compare the trueW2​\(μi,μj\)2W\_\{2\}\(\\mu\_\{i\},\\mu\_\{j\}\)^\{2\}with its LOT approximation‖zi,σ−zj,σ‖22\\\|z\_\{i,\\sigma\}\-z\_\{j,\\sigma\}\\\|\_\{2\}^\{2\}\. We report mean absolute and relative errors, as well as correlations in[Figure8](https://arxiv.org/html/2605.30625#A3.F8),[Table4](https://arxiv.org/html/2605.30625#A3.T4),[Table5](https://arxiv.org/html/2605.30625#A3.T5), and show that using the Wasserstein barycenter as reference yields substantially lower distortion than using the first snapshot as reference\.

![Refer to caption](https://arxiv.org/html/2605.30625v1/x11.png)\(a\)Synthetic dataset\.
![Refer to caption](https://arxiv.org/html/2605.30625v1/x12.png)\(b\)Single\-cell dataset\.

Figure 8:We compare the pairwise squared Wasserstein 2 distances \(di​j2d\_\{ij\}^\{2\}\) with the pairwise squared distances based on LOT embeddings \(d^i​j,σ2\\hat\{d\}\_\{ij,\\sigma\}^\{2\}\) for two choices of the referenceσ\\sigma: using the Wasserstein barycenter of the observed snapshots vs\. using the first snapshot\.Table 4:Linearization errors for the synthetic dataset\.Table 5:Linearization errors for the single\-cell dataset\.

### C\.3Runtime summary

We report the runtime summary for the single\-cell experiment in[Table6](https://arxiv.org/html/2605.30625#A3.T6)\.

Table 6:Runtime summary for the single\-cell experiment\.
### C\.4Labor market microdata

##### Data\.

We conduct an experiment using real\-world IPUMS\-CPS monthly microdata\(Floodet al\.,[2024](https://arxiv.org/html/2605.30625#bib.bib3)\)\. Specifically, we use monthly non\-ASEC U\.S\. CPS samples from January 2015 to December 2021 and construct a time\-indexed sequence of distributions over weekly earnings\. For each individual record, we retain observations satisfyingAGE≥16\\texttt\{AGE\}\\geq 16,EARNWT\>0\\texttt\{EARNWT\}\>0, andEARNWEEK\>0\\texttt\{EARNWEEK\}\>0\. We represent each month as a weighted empirical measure overlog⁡\(EARNWEEK\)\\log\(\\texttt\{EARNWEEK\}\), with weights normalized within each month usingEARNWT\.

##### Results\.

As shown in[Figure9\(a\)](https://arxiv.org/html/2605.30625#A3.F9.sf1), the active strategy performs similarly to the uniform baseline when error is averaged uniformly over calendar time \(left panel\), and yields a clear improvement under the intrinsic\-time \(velocity\-weighted metric on the right panel\)\. This difference is clear in[Figure9\(b\)](https://arxiv.org/html/2605.30625#A3.F9.sf2): our method preferentially identifies and explores periods where the distribution changes most rapidly\. That occurs around the onset of the COVID\-19 pandemic, especially in March\-April 2020\. During this period, the distribution of weekly earnings shifted quickly, because there were many fewer low\-earning workers in the data\. This shows that our method is a general tool for adaptive experimentation on measure\-valued dynamical systems, and is not restricted to biology\.

![Refer to caption](https://arxiv.org/html/2605.30625v1/x13.png)\(a\)Results for the IPUMS\-CPS dataset\.
![Refer to caption](https://arxiv.org/html/2605.30625v1/x14.png)\(b\)Visualization of adaptive timepoint selection for the IPUMS\-CPS dataset\.\(Top\)The instantaneous metric speed‖μ˙t‖\\\|\\dot\{\\mu\}\_\{t\}\\\|of the ground truth trajectory\.\(Bottom\)Comparison of selected measurement times\. Darker points indicate later acquisitions\.

Figure 9:Additional results for the IPUMS\-CPS dataset\.

### C\.5Alternative reconstruction schemes

Multi\-marginal Schrodinger\-type methods could in principle be used within our framework for the reconstruction step once snapshots have been acquired with our method\. To verify this, we conduct an experiment using MMFM\(Rohbecket al\.,[2025](https://arxiv.org/html/2605.30625#bib.bib17)\)to reconstruct a probability path given our acquired snapshots\. We report the results in[Table7](https://arxiv.org/html/2605.30625#A3.T7)and[Table8](https://arxiv.org/html/2605.30625#A3.T8), where MMFM underperforms our reconstruction framework\. We believe this gap is largely due to the strong non\-stationarity of the underlying dynamics\. Furthermore, unlike in MMSB, our approach can be used to sample multiple plausible probability paths compatible with the observed snapshots\. This allows to represent trajectory\-level ambiguity explicitly, which is what is leveraged for active learning\.

Table 7:Comparison with MMFM on the synthetic dataset for thereconstruction step\. Snapshot timepoints are selected with our method\.Table 8:Comparison with MMFM on the single\-cell dataset for thereconstruction step\. Snapshot timepoints are selected with our method\.

Similar Articles

AvAtar: Learning to Align via Active Optimal Transport

arXiv cs.LG

Presents AvAtar, a principled active alignment framework using optimal transport to actively acquire high-quality supervision for improved alignment, leveraging adjoint-state methods for efficient gradient computation.

Prediction and control with temporal segment models

OpenAI Blog

OpenAI introduces a method for learning complex nonlinear system dynamics using deep generative models over temporal segments, enabling stable long-horizon predictions and differentiable trajectory optimization for model-based control.