Towards Continuous-time Causal Foundation Models

arXiv cs.LG 05/29/26, 04:00 AM Papers
Summary
Proposes a continuity criterion for extending discrete-time causal prior-data fitted networks to continuous time using stochastic differential equations, introducing a taxonomy and fine-grid integration method that outperforms naive integration on irregular observation schedules.
arXiv:2605.28880v1 Announce Type: new Abstract: Extending discrete-time causal Prior-data Fitted Networks for time series to continuous time invites writing the mechanism as a stochastic differential equation (SDE) -- but if the SDE is integrated \emph{once per observation gap}, the trajectory law depends on when it is observed, and the prior remains a discrete-time Markov model in SDE clothing. We propose a precise continuity criterion -- trajectory-law invariance to the observation schedule -- together with a three-tier taxonomy (discrete; naive observation-grid integration; fine-grid integration with decoupled observation) and a construction realising the top tier on a random DAG with OU or small-MLP nonlinear drifts, irregular observation schedules, and hard / soft / time-varying interventions. A $2 \times 2$ encoder $\times$ integrator ablation, run independently on a linear and a nonlinear prior, finds fine-grid integration beats naive on 8/8 cells (sign-consistency $p < 1/256$) with the gap growing as the eval grid refines; the encoder axis is null with fine integration but time-aware-leading with naive. We release the prior and a preliminary zero-shot protocol on pharmacokinetic and physical-system data.
Original Article
View Cached Full Text
Cached at: 05/29/26, 09:12 AM
# Towards Continuous-time Causal Foundation Models
Source: [https://arxiv.org/html/2605.28880](https://arxiv.org/html/2605.28880)
###### Abstract

Extending discrete\-time causal Prior\-data Fitted Networks for time series to continuous time invites writing the mechanism as a stochastic differential equation \(SDE\)—but if the SDE is integrated*once per observation gap*, the trajectory law depends on when it is observed, and the prior remains a discrete\-time Markov model in SDE clothing\. We propose a precise continuity criterion—trajectory\-law invariance to the observation schedule—together with a three\-tier taxonomy \(discrete; naive observation\-grid integration; fine\-grid integration with decoupled observation\) and a construction realising the top tier on a random DAG with OU or small\-MLP nonlinear drifts, irregular observation schedules, and hard / soft / time\-varying interventions\. A2×22\\times 2encoder×\\timesintegrator ablation, run independently on a linear and a nonlinear prior, finds fine\-grid integration beats naive on 8/8 cells \(sign\-consistencyp<1/256p<1/256\) with the gap growing as the eval grid refines; the encoder axis is null with fine integration but time\-aware\-leading with naive\. We release111[https://github\.com/thummd/continuous\-time\-causal\-pfn](https://github.com/thummd/continuous-time-causal-pfn)the prior and a preliminary zero\-shot protocol on pharmacokinetic and physical\-system data\.

Causal inference, Prior\-Data Fitted networks, Time series, Stochastic differential equations, Foundation models

## 1Introduction

Prior\-Data Fitted networks \(PFNs\)\(Mülleret al\.,[2022](https://arxiv.org/html/2605.28880#bib.bib1); Hollmannet al\.,[2023](https://arxiv.org/html/2605.28880#bib.bib2); Nagler,[2023](https://arxiv.org/html/2605.28880#bib.bib22)\)pre\-train a transformer on datasets sampled from an analytic data\-generating prior and then perform in\-context inference at test time\. In causal settings, Do\-PFN\(Robertsonet al\.,[2025](https://arxiv.org/html/2605.28880#bib.bib3)\)and CausalFM\(Maet al\.,[2026](https://arxiv.org/html/2605.28880#bib.bib4)\)have pushed this recipe to*interventional*tabular prediction by training on synthetic structural causal models \(SCMs\)\(Pearl,[2009](https://arxiv.org/html/2605.28880#bib.bib19)\)\. Recent work extends causal PFNs to multivariate time series by sampling temporal SCMs \(TSCMs\) with lagged directed acyclic graphs \(DAGs\), nonlinear autoregressive mechanisms, and multiple intervention types\(Thumm and Chen,[2026](https://arxiv.org/html/2605.28880#bib.bib38)\)\.

Existing temporal causal priors\(Thumm and Chen,[2026](https://arxiv.org/html/2605.28880#bib.bib38)\)are*discrete\-time*: the generating process steps on a regular integer grid and the lag structure is a stack of adjacency matrices indexed by integer offsets\. A natural response is to rewrite the mechanism as a stochastic differential equation \(SDE\) and let it run between observations\. But the devil is in the integration: if the SDE is stepped*once per observation gap*\(Euler–Maruyama \(EM\) on the observation grid\), the joint law of a trajectory depends on when it is observed, and the prior is still effectively a discrete\-time Markov model in SDE clothing\. The target domains that motivate continuous time—pharmacokinetic concentrations sampled at clinically chosen times\(Boeckmannet al\.,[1994](https://arxiv.org/html/2605.28880#bib.bib36)\), physical systems like the Causal Chamber\(Gamellaet al\.,[2024](https://arxiv.org/html/2605.28880#bib.bib35)\)with variable\-delay events, and electronic health records with missing\-at\-random and missing\-not\-at\-random gaps\(Cheet al\.,[2018](https://arxiv.org/html/2605.28880#bib.bib39); Rubanovaet al\.,[2019](https://arxiv.org/html/2605.28880#bib.bib32)\)—are*schedule\-heterogeneous*and require more\.

This paper takes a step back and asks what exactly a causal PFN prior must satisfy to be called continuous\-time\. Our contributions are:

1. 1\.Aprecise criterion for continuous\-time causal priors\([Section3\.1](https://arxiv.org/html/2605.28880#S3.SS1)\): the joint law of a sampled trajectory must be invariant to the observation schedule\. We give a three\-tier taxonomy—discrete \(Δt≡1\\Delta t\\equiv 1\), naive observation\-grid integration, and fine\-grid integration with decoupled observation—that operationalises the criterion\.
2. 2\.Aconstruction that realises the top tier\([Section3\.2](https://arxiv.org/html/2605.28880#S3.SS2)\): Ornstein–Uhlenbeck \(OU\) or small\-Multilayer Perceptron \(MLP\) nonlinear drifts on a random DAG with optional hidden confounders and Markov regime switches, irregular observation schedules, and hard / soft / time\-varying interventions, all integrated on a fine grid and subsampled to the observation schedule\.
3. 3\.Anempirical2×22\\times 2encoder×\\timesintegrator ablation\([Section4](https://arxiv.org/html/2605.28880#S4)\), run independently on a linear\-OU prior and a nonlinear neural\-drift prior \(4 cells×\\times2 priors×\\timessingle seed×\\times10 k steps\)\. The \(B\)\-vs\-\(C\) gap is positive on every encoder cell across three eval discretisations on each of the two priors \(Tables[1](https://arxiv.org/html/2605.28880#S4.T1)–[2](https://arxiv.org/html/2605.28880#S4.T2), 8/8; sign\-consistencyp<1/256p<1/256under no\-effect null\)\. The lead is smallest on the eval that matches the naive variant’s training schedule and substep tier and grows when the eval moves to finer substeps\. The encoder axis is null with fine integration; with naive integration the time\-aware encoder leads on both priors—consistent with fine integration making the data\-generating process approximately schedule\-invariant and removing the need for explicit time\-gap features\.

Real\-data transfer \(Theophylline, Warfarin, Causal Chamber\) is preliminary and deferred to Appendix[C](https://arxiv.org/html/2605.28880#A3); the main body argues the continuity case on synthetic data where it can be measured cleanly\.

## 2Background and Related Work

#### Causal PFNs\.

Do\-PFN\(Robertsonet al\.,[2025](https://arxiv.org/html/2605.28880#bib.bib3)\)and CausalFM\(Maet al\.,[2026](https://arxiv.org/html/2605.28880#bib.bib4)\)pre\-train transformers on SCMs and estimate conditional interventional distributions in context on independent and identically distributed \(i\.i\.d\.\) tabular data\. They do not address temporal dependencies\.

#### Temporal interventional priors\.

Only a handful of generators produce paired \(observational, interventional\) time\-series data: CAnDOIT\(Castriet al\.,[2024](https://arxiv.org/html/2605.28880#bib.bib9)\)restricts to hard interventions at known targets; TECDI/RealTCD\(Liet al\.,[2023](https://arxiv.org/html/2605.28880#bib.bib10),[2024](https://arxiv.org/html/2605.28880#bib.bib11)\)handle soft or hard interventions in linear structural vector auto\-regressive \(SVAR\) models; CaTSG\(Xia and others,[2025](https://arxiv.org/html/2605.28880#bib.bib12)\)approximatesdo\\mathrm\{do\}\-calculus with a learned diffusion model\. The most recent CausalTimePrior framework\(Thumm and Chen,[2026](https://arxiv.org/html/2605.28880#bib.bib38)\)samples nonlinear autoregressive TSCMs with hard, soft, and time\-varying interventions—but, like all of the above, on a discrete\-time grid\. We build directly on its lagged\-DAG formulation\(Boeken and Mooij,[2024](https://arxiv.org/html/2605.28880#bib.bib20)\)and replace the mechanism and schedule with continuous\-time analogues\.

#### Continuous\-time dynamical ML and SDE causality\.

Neural ODEs\(Chenet al\.,[2018](https://arxiv.org/html/2605.28880#bib.bib29)\), Neural SDEs\(Kidgeret al\.,[2021](https://arxiv.org/html/2605.28880#bib.bib30)\), and latent\-ODE models for irregular series\(Rubanovaet al\.,[2019](https://arxiv.org/html/2605.28880#bib.bib32)\)demonstrate that continuous\-time parameterisations can match or beat discrete ones on irregular data\. Irregular\-time attention\(Shukla and Marlin,[2021](https://arxiv.org/html/2605.28880#bib.bib33); Tashiroet al\.,[2021](https://arxiv.org/html/2605.28880#bib.bib34)\)and time\-series foundation models\(Dooleyet al\.,[2023](https://arxiv.org/html/2605.28880#bib.bib23); Tagaet al\.,[2025](https://arxiv.org/html/2605.28880#bib.bib24); Moroshanet al\.,[2025](https://arxiv.org/html/2605.28880#bib.bib26); Xieet al\.,[2025](https://arxiv.org/html/2605.28880#bib.bib31)\)ingest continuous timestamps but, to our knowledge, none target*interventional*in\-context prediction\. Closest in spirit to our SDE\-based prior,Lorchet al\.\([2024](https://arxiv.org/html/2605.28880#bib.bib40)\)*learn*a single SDE whose stationary distribution captures interventional behaviour, dropping acyclicity\. Our goal is instead to*sample*an analytically specified prior over SDE\-driven TSCMs so a transformer can amortise causal inference across the family; the two approaches are complementary\.

## 3Method

### 3\.1What makes a causal prior continuous\-time?

Let𝒫\\mathcal\{P\}be a prior over \(TSCM, trajectory\) pairs, and let𝒫τ\\mathcal\{P\}\_\{\\tau\}denote the distribution of observations at scheduleτ=\(t1<…<tT\)\\tau=\(t\_\{1\}<\\ldots<t\_\{T\}\)\.

###### Definition 3\.1\(Continuous\-time causal prior\)\.

𝒫\\mathcal\{P\}is*continuous\-time*if there exists a continuous\-path stochastic processXXwhose law is independent ofτ\\tauand𝒫τ\\mathcal\{P\}\_\{\\tau\}is the law ofX\|τX\|\_\{\\tau\}\. I\.e\. the observation schedule is pure measurement, not part of the TSCM\.

The definition partitions priors into three tiers: \(A\)*discrete*\(Δt≡1\\Delta t\\equiv 1\), a VAR\-style SCM\(Thumm and Chen,[2026](https://arxiv.org/html/2605.28880#bib.bib38)\)defined only on the integer grid and failing Definition[3\.1](https://arxiv.org/html/2605.28880#S3.Thmtheorem1)by construction; \(B\)*naive continuous*\(observation\-grid integration\), an SDE stepped once per observation gapΔi\\Delta\_\{i\}—the joint kernel parameterises to a different Markov model asΔi\\Delta\_\{i\}varies, so the law depends onτ\\tau; \(C\)*continuous*\(fine\-grid integration\), the SDE integrated onΔfine≪mini⁡Δiobs\\Delta\_\{\\mathrm\{fine\}\}\\ll\\min\_\{i\}\\Delta\_\{i\}^\{\\mathrm\{obs\}\}and subsampled toτ\\tau, converging to the true SDE law asΔfine→0\\Delta\_\{\\mathrm\{fine\}\}\\to 0\(Kloeden and Platen,[1992](https://arxiv.org/html/2605.28880#bib.bib28)\)\. At any finiteΔfine\\Delta\_\{\\mathrm\{fine\}\}tier \(C\) realises Definition[3\.1](https://arxiv.org/html/2605.28880#S3.Thmtheorem1)only approximately, with‖𝒫τ\(C\)−𝒫τ\(SDE\)‖→0\\\|\\mathcal\{P\}\_\{\\tau\}^\{\(C\)\}\-\\mathcal\{P\}\_\{\\tau\}^\{\(\\mathrm\{SDE\}\)\}\\\|\\to 0asΔfine→0\\Delta\_\{\\mathrm\{fine\}\}\\to 0; we treat tier \(C\) as the practical realisation of the criterion\.

Whether tiers \(B\) and \(C\) differ in practice depends on a stability condition\. The standard Euler–Maruyama update on a 1\-D OU processdX=−θXdt\+σdWdX=\-\\theta X\\,dt\+\\sigma\\,dWhas amplification factor\|1−θΔ\|\|1\-\\theta\\Delta\|per step and is mean\-square stable only whenθΔ<2\\theta\\Delta<2; on a prior that crosses this boundary, naive EM produces exploding trajectories that pin training\-target distributions at their normalisation ceiling—a numerical\-stability artefact rather than a discretisation\-bias signature\. Stability is necessary but not sufficient for naive≈\\approxfine: the leading per\-step Euler–Maruyama bias against the exact Gaussian OU transition kernel isO\(θΔ\)O\(\\theta\\Delta\)on the variance, accumulating over the trajectory at the prior’s typicalθΔ≈0\.3\\theta\\Delta\\approx 0\.3\. Eval\-loss is partially robust to this transition\-kernel bias—it scores predictive likelihood, not path\-measure distance—so we expect a smaller but still detectable empirical gap, which[Section4](https://arxiv.org/html/2605.28880#S4)confirms on both OU and neural priors\. The construction \([Section3\.2](https://arxiv.org/html/2605.28880#S3.SS2)\) therefore pairs tier\-\(C\) integration with a stability\-respecting prior, and the ablation tests both axes\.

### 3\.2Construction of the continuous\-time prior

#### Graph sampling\.

A sample from the prior drawsN∼Uniform\(3,Nmax\)N\\sim\\mathrm\{Uniform\}\(3,N\_\{\\max\}\)variables and a DAG𝒢\\mathcal\{G\}over them\(Thumm and Chen,[2026](https://arxiv.org/html/2605.28880#bib.bib38)\)\. We provide two graph samplers: \(i\) a named\-structure sampler that cycles through eight canonical causal structures \(back\-door, front\-door, instrumental variable, observed confounder, mediator, confounder \+ mediator, bi\-variate, unobserved confounder\) and \(ii\) a*random\-DAG*sampler that draws an edge between each pair with probabilityp∼Beta\(α,β\)p\\sim\\mathrm\{Beta\}\(\\alpha,\\beta\)under a topological ordering, with configurable probability that each non\-\(A,Y\)\(A,Y\)variable is marked*hidden*\(removed from the encoder’s input, but active in the dynamics\)\. For each DAG we designate a treatment variableAAand an outcome variableYYsuch thatAAprecedesYYin topological order \(see Appendix[B](https://arxiv.org/html/2605.28880#A2)\)\.

#### Mechanism family\.

Unlike the per\-lag adjacency stack used in discrete\-time priors, the continuous\-time prior reduces temporal dependence to a single parent set per variable\. We support two drift families on that parent set\. The*linear*drift is a Ornstein–Uhlenbeck mechanism\(Øksendal,[2003](https://arxiv.org/html/2605.28880#bib.bib41)\)

dXv=\(−θvXv\+∑u∈Pa\(v\)wvuXu\)dt\+σvdWv,dX\_\{v\}=\\Bigl\(\\\!\-\\theta\_\{v\}X\_\{v\}\+\\\!\\\!\\sum\_\{u\\in\\mathrm\{Pa\}\(v\)\}\\\!\\\!w\_\{vu\}X\_\{u\}\\Bigr\)dt\+\\sigma\_\{v\}\\,dW\_\{v\},\(1\)withθv\>0\\theta\_\{v\}\>0,σv\>0\\sigma\_\{v\}\>0, andwvu∼𝒩\(0,σw2\)w\_\{vu\}\\sim\\mathcal\{N\}\(0,\\sigma\_\{w\}^\{2\}\)sampled per TSCM\. AtΔt≡1\\Delta t\\equiv 1this reduces to the AR\(1\) mechanism used by discrete\-time causal priors\(Thumm and Chen,[2026](https://arxiv.org/html/2605.28880#bib.bib38)\)\. OU admits an exact Gaussian transition kernel between any two times, so the linear\-prior naive\-vs\-fine comparison should be read as EM\-vs\-EM rather than EM\-vs\-exact; we use EM uniformly across drift families because no closed form exists for the neural drift\.

The*neural*drift replaces the linear parental sum with a small randomly\-initialised two\-layertanh\\tanh\-MLPgvg\_\{v\}on𝐳v=\[Xv,Xu1,…,Xuk\]\\mathbf\{z\}\_\{v\}=\[X\_\{v\},X\_\{u\_\{1\}\},\\ldots,X\_\{u\_\{k\}\}\]:

dXv=\(−θvXv\+svgv\(𝐳v\)\)dt\+σvdWv,dX\_\{v\}=\\bigl\(\-\\theta\_\{v\}X\_\{v\}\+s\_\{v\}\\,g\_\{v\}\(\\mathbf\{z\}\_\{v\}\)\\bigr\)\\,dt\+\\sigma\_\{v\}\\,dW\_\{v\},\(2\)withgv\(𝐳\)=tanh⁡\(W2tanh⁡\(W1𝐳\+b1\)\+b2\)g\_\{v\}\(\\mathbf\{z\}\)=\\tanh\\\!\\bigl\(W\_\{2\}\\tanh\(W\_\{1\}\\mathbf\{z\}\+b\_\{1\}\)\+b\_\{2\}\\bigr\)andsv\>0s\_\{v\}\>0\. We retain−θvXv\-\\theta\_\{v\}X\_\{v\}outside the MLP so trajectories stay bounded for any weight draw; the outertanh\\tanhbounds the nonlinear contribution to\[−sv,sv\]\[\-s\_\{v\},s\_\{v\}\]\. Each trajectory draws the drift family per variable with a Bernoulli\(pneural\)\(p\_\{\\mathrm\{neural\}\}\)coin, so a single training run exposes the PFN to a mixture of linear and nonlinear dynamics\.

#### Regime switching\.

Optionally, a fraction of training trajectories is drawn from a*continuous\-time regime\-switching*TSCM:RRindependent OU systems that share variables and observation schedule, arbitrated by a sticky row\-stochasticR×RR\\times RMarkov transition matrix \(Prr≈0\.9P\_\{rr\}\\approx 0\.9, expected regime duration∼10\\sim 10observations\) with rows sampled from a Dirichlet distribution\. This lets the prior express structural breaks of the kind observed in pharmacology \(e\.g\. absorption vs\. elimination phase\) and physical systems\.

#### Observation schedule\.

Given a horizonHHand an expected inter\-observation gapΔ¯\\bar\{\\Delta\}, we sample one of three schedules:*regular*\(ti=iΔ¯t\_\{i\}=i\\bar\{\\Delta\}\),*jittered*\(ti\+1−ti=Δ¯\(1\+ξi\)t\_\{i\+1\}\-t\_\{i\}=\\bar\{\\Delta\}\(1\+\\xi\_\{i\}\)withξi∼Uniform\[−ρ,ρ\]\\xi\_\{i\}\\sim\\mathrm\{Uniform\}\[\-\\rho,\\rho\]\), or*Poisson*\(ti\+1−ti∼Exp\(1/Δ¯\)t\_\{i\+1\}\-t\_\{i\}\\sim\\mathrm\{Exp\}\(1/\\bar\{\\Delta\}\)\)\. The model never sees the schedule as input; it only sees the actual timestamps\.

#### Simulation \(fine\-grid integration\)\.

Given a target observation scheduleτ=\(t1,…,tT\)\\tau=\(t\_\{1\},\\ldots,t\_\{T\}\)we do*not*integrate once per observation gap\. Instead we pick a fine stepΔfine≪mini⁡Δiobs\\Delta\_\{\\mathrm\{fine\}\}\\ll\\min\_\{i\}\\Delta\_\{i\}^\{\\mathrm\{obs\}\}, integrate the SDE on the union grid\[t1,tT\]∩\{t1\+kΔfine\}k≥0\[t\_\{1\},t\_\{T\}\]\\cap\\\{t\_\{1\}\+k\\Delta\_\{\\mathrm\{fine\}\}\\\}\_\{k\\geq 0\}via Euler–Maruyama\(Kloeden and Platen,[1992](https://arxiv.org/html/2605.28880#bib.bib28)\)with Brownian increments re\-sampled per fine step, and subsample the resulting trajectory atτ\\tau:

Xv\(t\+Δfine\)=Xv\(t\)\+μv\(X\(t\)\)Δfine\+σvΔfineZ,X\_\{v\}\(t\+\\Delta\_\{\\mathrm\{fine\}\}\)=X\_\{v\}\(t\)\+\\mu\_\{v\}\(X\(t\)\)\\,\\Delta\_\{\\mathrm\{fine\}\}\+\\sigma\_\{v\}\\sqrt\{\\Delta\_\{\\mathrm\{fine\}\}\}\\,Z,withZ∼𝒩\(0,1\)Z\\sim\\mathcal\{N\}\(0,1\)andμv\\mu\_\{v\}given by \([1](https://arxiv.org/html/2605.28880#S3.E1)\) or \([2](https://arxiv.org/html/2605.28880#S3.E2)\)\. SettingΔfine=Δiobs\\Delta\_\{\\mathrm\{fine\}\}=\\Delta\_\{i\}^\{\\mathrm\{obs\}\}recovers naive tier\-\(B\) integration;Δfine=1\\Delta\_\{\\mathrm\{fine\}\}=1with a regular unit\-gap schedule recovers tier\-\(A\)\. The continuity ablation in[Section4](https://arxiv.org/html/2605.28880#S4)varies this single knob\.

#### Interventions\.

For each sample we draw a targeti⋆i^\{\\star\}, a window\[tintstart,tintend\)\[t\_\{\\mathrm\{int\}\}^\{\\mathrm\{start\}\},t\_\{\\mathrm\{int\}\}^\{\\mathrm\{end\}\}\)of duration between10%10\\%and30%30\\%of the horizon, and an intervention kind∈\{hard,soft,time\-varying\}\\in\\\{\\text\{hard\},\\text\{soft\},\\text\{time\-varying\}\\\}:

\(hard\)Xi⋆\(t\):=c,\\displaystyle X\_\{i^\{\\star\}\}\(t\):=c,\(soft\)μi⋆\(X\)↦μi⋆\(X\)\+δ,\\displaystyle\\mu\_\{i^\{\\star\}\}\(X\)\\mapsto\\mu\_\{i^\{\\star\}\}\(X\)\+\\delta,\(time\-varying\)Xi⋆\(t\):=c\(t\),\\displaystyle X\_\{i^\{\\star\}\}\(t\):=c\(t\),active on the window\. Hard\-intervention values are optionally clipped to\[μi⋆−3σi⋆,μi⋆\+3σi⋆\]\[\\mu\_\{i^\{\\star\}\}\-3\\sigma\_\{i^\{\\star\}\},\\mu\_\{i^\{\\star\}\}\+3\\sigma\_\{i^\{\\star\}\}\]to keep the intervention inside the observed operating range of the target variable—analogous to the causal*positivity*\(overlap\) assumption\(Hernán and Robins,[2020](https://arxiv.org/html/2605.28880#bib.bib15)\)\. The prior returns paired counterfactual and interventional trajectories by re\-using the same Wiener noise across runs \(cf\.Pearl[2009](https://arxiv.org/html/2605.28880#bib.bib19), rung 3\)\.

### 3\.3Δt\\Delta t\-aware PFN encoder

We base upon a causal transformer encoder operating on a pre\-intervention window\(Thumm and Chen,[2026](https://arxiv.org/html/2605.28880#bib.bib38)\)\. Instead of a learned integer positional embedding we replace it with a Fourier embedding of continuous time:

ϕ\(t\)=Wϕ\[sin⁡\(2πfkt\),cos⁡\(2πfkt\)\]k=1K,\\phi\(t\)=W\_\{\\phi\}\\bigl\[\\sin\(2\\pi f\_\{k\}\\,t\),\\,\\cos\(2\\pi f\_\{k\}\\,t\)\\bigr\]\_\{k=1\}^\{K\},\(3\)with a geometric frequency bankfk∈\[fmin,fmax\]f\_\{k\}\\in\[f\_\{\\mathrm\{min\}\},f\_\{\\mathrm\{max\}\}\]\(defaults0\.01,100\.01,10\) and a learnable projectionWϕW\_\{\\phi\}\. Times are referenced to intervention onset,t←t−tintstartt\\leftarrow t\-t\_\{\\mathrm\{int\}\}^\{\\mathrm\{start\}\}, and inter\-observation gapsΔti\\Delta t\_\{i\}are embedded with the same family after alog⁡\(1\+Δti\)\\log\(1\{\+\}\\Delta t\_\{i\}\)transform to concentrate resolution at small gaps\. The encoder is otherwise identical to the discrete baseline, enabling a controlled ablation\.

At inference time we feed\(Xobs,tobs,interventionspec,tquery\)\(X\_\{\\mathrm\{obs\}\},t\_\{\\mathrm\{obs\}\},\\mathrm\{intervention\\ spec\},t\_\{\\mathrm\{query\}\}\)and the model predicts the Gaussian \(or quantile\) distribution ofYYattqueryt\_\{\\mathrm\{query\}\}under the intervention\.

### 3\.4Training

The prior runs on\-the\-fly during training; each batch draws a fresh TSCM, schedule, and intervention\. We use either a quantile \(pinball loss\) or bar\-distribution\(Thumm and Chen,[2026](https://arxiv.org/html/2605.28880#bib.bib38)\)output head; full hyperparameters and architecture sizes in Appendix[A](https://arxiv.org/html/2605.28880#A1)\.

## 4Experiments

A2×22\\times 2encoder×\\timesintegratorablation, run independently on a linear\-OU and a nonlinear neural\-drift prior, separates the two axes \(Tables[1](https://arxiv.org/html/2605.28880#S4.T1),[2](https://arxiv.org/html/2605.28880#S4.T2)\)\. The encoder axis is positional\-only \(learned absolute embedding, ablating the Fourier\-time path\) vs\. time\-aware \([Section3\.3](https://arxiv.org/html/2605.28880#S3.SS3)\); the integrator axis is tier\-\(B\)*naive*EM \(strain=1s\_\{\\rm train\}\{=\}1substep per observation gap\) vs\. tier\-\(C\)*fine*EM \(strain=8s\_\{\\rm train\}\{=\}8\)\. Each prior trains four PFNs \(10 k steps, single seed\) scored on held\-out evals drawn from the same prior; multi\-seed replication is future work\.

#### Eval distributions\.

Two schedules crossed with eval\-grid refinementsseval∈\{1,8\}s\_\{\\rm eval\}\\in\\\{1,8\\\}\(best held\-out eval\-loss over 50 batches\):regular\(uniformΔ=1\\Delta=1\) andmixed\(random per\-trajectory regular / jittered / Poisson, the pre\-training schedule\)\. Onregularthe positional encoder’sarange\(T\)positions equal the actual timestamps, so the two encoders see identical inputs at eval time—this isolates pos\-vs\-time gaps as training\-side residue\. The eval substep tiersevals\_\{\\rm eval\}probes the SDE limit independently of the model\.

Table 1:Encoder×\\timesintegrator onregulareval \(seval=1s\_\{\\rm eval\}\{=\}1\), both priors\. Rows: trained integrator \(*naive*:strain=1s\_\{\\rm train\}\{=\}1, tier B;*fine*:strain=8s\_\{\\rm train\}\{=\}8, tier C\)\. Cols: encoder\. Fine integration beats naive on every cell \(4/4\)\. Encoder gap \(pos−\-time, last column\) is positive in every naive row and≤0\.0003\\leq 0\.0003in every fine row: with fine integration the encoder choice is empirically inert\.Table 2:Integrator onmixedschedule \(time\-aware encoder\) at two eval\-grid refinements\. Cols: trained integrator \(*naive*:strain=1s\_\{\\rm train\}\{=\}1;*fine*:strain=8s\_\{\\rm train\}\{=\}8\)\. Rows: eval\-time substep tiersevals\_\{\\rm eval\}of the held\-out test trajectories \(independent of the model\)\. Fine beats naive on every cell \(4/4\)\.*Fine’s lead grows when the eval is more refined*:\+0\.0018→\+0\.0057\+0\.0018\\to\+0\.0057on OU,\+0\.0048→\+0\.0088\+0\.0048\\to\+0\.0088on Neural—an integrator\-specific signature\.
#### Findings\.

Fine\-grid integration wins on 8/8 fine\-vs\-naive comparisons across both tables; under a no\-effect null the probability of this sign pattern is1/2561/256, providing evidence stronger than the per\-cell magnitudes alone\. Crucially, fine’s lead grows monotonically as the eval grid refines \(Table[2](https://arxiv.org/html/2605.28880#S4.T2),\+0\.0018→\+0\.0057\+0\.0018\\to\+0\.0057on OU and\+0\.0048→\+0\.0088\+0\.0048\\to\+0\.0088on Neural\), which is the discretisation\-bias signature predicted by[Section3\.1](https://arxiv.org/html/2605.28880#S3.SS1): as the eval distribution approaches the SDE limit, the model trained at the SDE limit pulls ahead\. The encoder axis is conditional on the integrator: null with fine, positive \(time\-aware leads\) with naive\. We read the interaction as follows: with fine integration the data\-generating process is approximately schedule\-invariant \(Definition[3\.1](https://arxiv.org/html/2605.28880#S3.Thmtheorem1)\), so the model has little to gain from explicit time\-gap features; with naive integration the conditional dynamics genuinely depend onΔi\\Delta\_\{i\}, and the time\-aware encoder’s Fourier embedding of inter\-observation gaps gives the model a route to compensate\. The positional\-only encoder is structurally OOD onmixed\(positions≠\\neqtimes\) and omitted from Table[2](https://arxiv.org/html/2605.28880#S4.T2); its naive\-vs\-fine pattern mirrors time\-aware\. An instability check onθrange=\[0\.5,2\.0\]\\theta\_\{\\rm range\}=\[0\.5,2\.0\]\(where naive cells saturated∼30\\sim\\\!30–50%50\\,\\%of batches at the clip\) and PK / chamber zero\-shot transfer are in Appendices[C](https://arxiv.org/html/2605.28880#A3),[D](https://arxiv.org/html/2605.28880#A4)\.

## 5Discussion and Limitations

#### What the prior buys today\.

A precise continuity criterion realised by tier\-\(C\) integration\. Across two independent priors, fine\-grid integration produces models that transfer better, including on the eval that matches the naive variant’s training tier\. The encoder axis is conditional on the integrator: with fine, encoder choice is empirically inert; with naive it is not\.

#### Limitations and future work\.

Per\-cellΔ\\Deltas are small; multi\-seed replication is needed to harden the cross\-prior agreement\. Within\-regime noise is Markov; neural drifts capture nonlinear dependence but not time\-correlated noise\. Model capacity is small, real\-data transfer \([AppendixC](https://arxiv.org/html/2605.28880#A3)\) preliminary\. Jump\-diffusion SDEs and Neural\-SDE drifts\(Tzen and Raginsky,[2019](https://arxiv.org/html/2605.28880#bib.bib44)\)extend the construction; latent\-ODE–style hidden states address non\-Markov confounding\.

## Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning\. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here\.

## References

- A\. J\. Boeckmann, L\. B\. Sheiner, and S\. L\. Beal \(1994\)NONMEM users guide: part V\.NONMEM Project Group, University of California, San Francisco\.Note:Reference dataset: theophylline pharmacokinetics, 12 subjects\.Cited by:[Appendix C](https://arxiv.org/html/2605.28880#A3.SS0.SSS0.Px1.p1.4),[§1](https://arxiv.org/html/2605.28880#S1.p2.1)\.
- P\. Boeken and J\. M\. Mooij \(2024\)Dynamic structural causal models\.InAAAI 2024 Workshop on Causal Inference for Time Series \(CI4TS\),Cited by:[§2](https://arxiv.org/html/2605.28880#S2.SS0.SSS0.Px2.p1.1)\.
- L\. Castri, S\. Mghames, M\. Hanheide, and N\. Bellotto \(2024\)CAnDOIT: causal discovery with observational and interventional data from time series\.Advanced Intelligent Systems\.Cited by:[§2](https://arxiv.org/html/2605.28880#S2.SS0.SSS0.Px2.p1.1)\.
- Z\. Che, S\. Purushotham, K\. Cho, D\. Sontag, and Y\. Liu \(2018\)Recurrent neural networks for multivariate time series with missing values\.Scientific Reports8\(1\),pp\. 6085\.Cited by:[§1](https://arxiv.org/html/2605.28880#S1.p2.1)\.
- R\. T\. Q\. Chen, Y\. Rubanova, J\. Bettencourt, and D\. Duvenaud \(2018\)Neural ordinary differential equations\.InAdvances in Neural Information Processing Systems,Vol\.31\.Cited by:[§2](https://arxiv.org/html/2605.28880#S2.SS0.SSS0.Px3.p1.1)\.
- S\. Dooley, G\. S\. Khurana, C\. Mohapatra, S\. V\. Naidu, and C\. White \(2023\)Forecastpfn: synthetically\-trained zero\-shot forecasting\.Advances in Neural Information Processing Systems36,pp\. 2403–2426\.Cited by:[§2](https://arxiv.org/html/2605.28880#S2.SS0.SSS0.Px3.p1.1)\.
- J\. L\. Gamella, P\. Bühlmann, and J\. Peters \(2024\)The causal chambers: real physical systems as a testbed for AI methodology\.Nature Machine Intelligence\.Note:Data release:[https://causalchamber\.org](https://causalchamber.org/)Cited by:[Appendix C](https://arxiv.org/html/2605.28880#A3.SS0.SSS0.Px3.p1.5),[§1](https://arxiv.org/html/2605.28880#S1.p2.1)\.
- A\. Hamberg, M\. Dahl, M\. Barban, M\. G\. Scordo, M\. Wadelius, V\. Pengo, R\. Padrini, and E\. N\. Jonsson \(2007\)A PK–PD model for predicting the impact of age, CYP2C9, and VKORC1 genotype on individualization of warfarin therapy\.Clinical Pharmacology & Therapeutics81\(4\),pp\. 529–538\.Cited by:[Appendix C](https://arxiv.org/html/2605.28880#A3.SS0.SSS0.Px2.p1.5)\.
- M\. A\. Hernán and J\. M\. Robins \(2020\)Causal inference: what if\.Chapman & Hall/CRC\.Cited by:[§3\.2](https://arxiv.org/html/2605.28880#S3.SS2.SSS0.Px6.p1.6)\.
- N\. Hollmann, S\. Müller, K\. Eggensperger, and F\. Hutter \(2023\)TabPFN: a transformer that solves small tabular classification problems in a second\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.28880#S1.p1.1)\.
- P\. Kidger, J\. Foster, X\. Li, and T\. Lyons \(2021\)Neural SDEs as infinite\-dimensional GANs\.InInternational Conference on Machine Learning,pp\. 5453–5463\.Cited by:[§2](https://arxiv.org/html/2605.28880#S2.SS0.SSS0.Px3.p1.1)\.
- P\. E\. Kloeden and E\. Platen \(1992\)Numerical solution of stochastic differential equations\.Springer\-Verlag,Berlin\.Cited by:[§3\.1](https://arxiv.org/html/2605.28880#S3.SS1.p2.10),[§3\.2](https://arxiv.org/html/2605.28880#S3.SS2.SSS0.Px5.p1.4)\.
- P\. Li, Y\. Meng, X\. Wang, F\. Shen, Y\. Li, J\. Wang, and W\. Zhu \(2023\)Causal discovery in temporal domain from interventional data\.InProceedings of the 32nd ACM International Conference on Information and Knowledge Management,pp\. 1306–1315\.Cited by:[§2](https://arxiv.org/html/2605.28880#S2.SS0.SSS0.Px2.p1.1)\.
- P\. Li, X\. Wang, Z\. Zhang, Y\. Meng, F\. Shen, Y\. Li, J\. Wang, Y\. Li, and W\. Zhu \(2024\)RealTCD: temporal causal discovery from interventional data with large language model\.InProceedings of the 33rd ACM International Conference on Information and Knowledge Management,pp\. 4669–4677\.Cited by:[§2](https://arxiv.org/html/2605.28880#S2.SS0.SSS0.Px2.p1.1)\.
- L\. Lorch, A\. Krause, and B\. Schölkopf \(2024\)Causal modeling with stationary diffusions\.InInternational Conference on Artificial Intelligence and Statistics,Note:arXiv:2310\.17405Cited by:[§2](https://arxiv.org/html/2605.28880#S2.SS0.SSS0.Px3.p1.1)\.
- Y\. Ma, D\. Frauen, E\. Javurek, and S\. Feuerriegel \(2026\)Foundation models for causal inference via prior\-data fitted networks\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=d2L1ndOKjq)Cited by:[§1](https://arxiv.org/html/2605.28880#S1.p1.1),[§2](https://arxiv.org/html/2605.28880#S2.SS0.SSS0.Px1.p1.1)\.
- V\. Moroshan, J\. Siems, A\. Zela, T\. Carstensen, and F\. Hutter \(2025\)TempoPFN: synthetic pre\-training of linear RNNs for zero\-shot time series forecasting\.InNeurIPS 2025 Workshop on AI for Tabular Data,Cited by:[§2](https://arxiv.org/html/2605.28880#S2.SS0.SSS0.Px3.p1.1)\.
- S\. Müller, N\. Hollmann, S\. P\. Arango, J\. Grabocka, and F\. Hutter \(2022\)Transformers can do Bayesian inference\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.28880#S1.p1.1)\.
- T\. Nagler \(2023\)Statistical foundations of prior\-data fitted networks\.InInternational Conference on Machine Learning,pp\. 25660–25676\.Cited by:[§1](https://arxiv.org/html/2605.28880#S1.p1.1)\.
- B\. Øksendal \(2003\)Stochastic differential equations: an introduction with applications\.6th edition,Springer, Berlin\.Cited by:[§3\.2](https://arxiv.org/html/2605.28880#S3.SS2.SSS0.Px2.p1.5)\.
- J\. Pearl \(2009\)Causality: models, reasoning, and inference\.Cambridge University Press\.Cited by:[§1](https://arxiv.org/html/2605.28880#S1.p1.1),[§3\.2](https://arxiv.org/html/2605.28880#S3.SS2.SSS0.Px6.p1.6)\.
- J\. Robertson, A\. Reuter, S\. Guo, N\. Hollmann, F\. Hutter, and B\. Schölkopf \(2025\)Do\-PFN: in\-context learning for causal effect estimation\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=OaNbl9b56B)Cited by:[§1](https://arxiv.org/html/2605.28880#S1.p1.1),[§2](https://arxiv.org/html/2605.28880#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Rubanova, R\. T\. Q\. Chen, and D\. Duvenaud \(2019\)Latent ordinary differential equations for irregularly\-sampled time series\.InAdvances in Neural Information Processing Systems,Vol\.32\.Cited by:[§1](https://arxiv.org/html/2605.28880#S1.p2.1),[§2](https://arxiv.org/html/2605.28880#S2.SS0.SSS0.Px3.p1.1)\.
- S\. N\. Shukla and B\. M\. Marlin \(2021\)Multi\-time attention networks for irregularly sampled time series\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.28880#S2.SS0.SSS0.Px3.p1.1)\.
- E\. O\. Taga, M\. E\. Ildiz, and S\. Oymak \(2025\)TimePFN: effective multivariate time series forecasting with synthetic data\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.39,pp\. 20761–20769\.Cited by:[§2](https://arxiv.org/html/2605.28880#S2.SS0.SSS0.Px3.p1.1)\.
- Y\. Tashiro, J\. Song, Y\. Song, and S\. Ermon \(2021\)CSDI: conditional score\-based diffusion models for probabilistic time series imputation\.Advances in Neural Information Processing Systems34,pp\. 24804–24816\.Cited by:[§2](https://arxiv.org/html/2605.28880#S2.SS0.SSS0.Px3.p1.1)\.
- D\. Thumm and Y\. Chen \(2026\)Interventional time series priors for causal foundation models\.In1st ICLR Workshop on Time Series in the Age of Large Models,External Links:[Link](https://openreview.net/forum?id=JbTgx2L9Z2)Cited by:[§1](https://arxiv.org/html/2605.28880#S1.p1.1),[§1](https://arxiv.org/html/2605.28880#S1.p2.1),[§2](https://arxiv.org/html/2605.28880#S2.SS0.SSS0.Px2.p1.1),[§3\.1](https://arxiv.org/html/2605.28880#S3.SS1.p2.10),[§3\.2](https://arxiv.org/html/2605.28880#S3.SS2.SSS0.Px1.p1.8),[§3\.2](https://arxiv.org/html/2605.28880#S3.SS2.SSS0.Px2.p1.4),[§3\.3](https://arxiv.org/html/2605.28880#S3.SS3.p1.7),[§3\.4](https://arxiv.org/html/2605.28880#S3.SS4.p1.1)\.
- B\. Tzen and M\. Raginsky \(2019\)Neural stochastic differential equations: deep latent gaussian models in the diffusion limit\.External Links:1905\.09883,[Link](https://arxiv.org/abs/1905.09883)Cited by:[§5](https://arxiv.org/html/2605.28880#S5.SS0.SSS0.Px2.p1.1)\.
- S\. Xiaet al\.\(2025\)Causal time series generation via diffusion models\.arXiv preprint arXiv:2509\.20846\.Cited by:[§2](https://arxiv.org/html/2605.28880#S2.SS0.SSS0.Px2.p1.1)\.
- S\. Xie, V\. Feofanov, M\. Alonso, A\. Odonnat, J\. Zhang, T\. Palpanas, and I\. Redko \(2025\)CauKer: classification time series foundation models can be pretrained on synthetic data only\.CoRRabs/2508\.02879\.Cited by:[§2](https://arxiv.org/html/2605.28880#S2.SS0.SSS0.Px3.p1.1)\.

## Appendix AGenerator defaults and additional details

Table 3:Default prior hyperparameters used in the experiments of[Section4](https://arxiv.org/html/2605.28880#S4)\.In[Section4](https://arxiv.org/html/2605.28880#S4)all other knobs are held fixed: tightenedθrange=\[0\.1,0\.5\]\\theta\_\{\\mathrm\{range\}\}=\[0\.1,0\.5\]so worst\-caseθΔ<1\\theta\\Delta<1across the schedule distribution \(a prior that respects the EM stability condition of[Section3\.1](https://arxiv.org/html/2605.28880#S3.SS1)\); back\-door TSCM topology;*mixed*observation schedule \(random per\-trajectory choice between regular, jittered, and Poisson\); 10 k training steps; identical model size \(1\.1 M parameters\)\.

## Appendix BCanonical TSCM structures

The named\-structure sampler exposes the eight structures \(back\-door, front\-door, instrumental variable, randomised controlled trial, mediator, confounder\-plus\-mediator, observed confounder, unobserved confounder\) in Figure[1](https://arxiv.org/html/2605.28880#A2.F1)\. We reuse them as canonical sanity checks; the random\-DAG sampler of[Section3\.2](https://arxiv.org/html/2605.28880#S3.SS2)generalises this to anyNNup toNmaxN\_\{\\max\}\.

![Refer to caption](https://arxiv.org/html/2605.28880v1/x1.png)Figure 1:Canonical SCM structures used by the named\-structure sampler\. Each panel shows a back\-door / front\-door / IV\-style template with the treatmentAA\(left\), outcomeYY\(right\), and any mediators or confounders\. The random\-DAG sampler in[Section3\.2](https://arxiv.org/html/2605.28880#S3.SS2)subsumes these as special cases\.
## Appendix CPreliminary zero\-shot transfer to real irregular data

This appendix documents an early transfer study of the tier\-\(C\) prior to three irregularly\-sampled real\-world datasets\. We flag the numbers as corroborative and*not yet competitive*with dataset\-specific baselines; treating them as a full zero\-shot claim would require \(i\) broader mechanism families in the prior, \(ii\) calibration against domain baselines \(NONMEM fits for PK, system\-identification baselines for Causal Chamber\), and \(iii\) sensitivity analysis to prior misspecification\. We flag this as our primary line of future work\.

#### Theophylline pharmacokinetics\.

The 12\-subject NONMEM\-distributed Theophylline dataset\(Boeckmannet al\.,[1994](https://arxiv.org/html/2605.28880#bib.bib36)\): oral doses and 11 plasma\-concentration measurements per subject over∼24\\sim 24hours\. We treat dose as a time\-varying intervention and plasma concentration as the outcomeYY\. Times are converted to seconds and rescaled so thatΔ¯\\bar\{\\Delta\}matches the training distribution; values are per\-subjectzz\-scored and rescaled back for reporting\.

#### Warfarin PK/PD\.

32 subjects with irregular oral\-dose, plasma\-concentration, and PD \(prothrombin complex activity\) observations\(Hamberget al\.,[2007](https://arxiv.org/html/2605.28880#bib.bib37)\)\. Variables are aligned to the canonical\(A,M,Y\)\(A,M,Y\)front\-door layout with dose asAA, concentration asMM, and PD asYY—the cleanest match to a named\-structure TSCM sample from the prior\. Per\-variablezz\-scoring parallels Theophylline\.

#### Causal Chamber \(wind\-tunnel\)\.

The light\-tunnellt\_walks\_v1/actuators\_whitebenchmark used by earlier drafts produces uninformative causal\-effect estimates \(white\-noise actuators, Pearsonr≈0r\\approx 0\); see[AppendixD](https://arxiv.org/html/2605.28880#A4)for the failure analysis\. Phase 14b switches towt\_intake\_impulse\_v1\(Gamellaet al\.,[2024](https://arxiv.org/html/2605.28880#bib.bib35)\), the wind\-tunnel impulse rig, which \(i\) carries an explicit binaryinterventioncolumn—each0→\\to1pulse marks a known toggle of the intake\-fan setpointload\_in∈\{0\.01,1\.0\}\\texttt\{load\\\_in\}\\in\\\{0\.01,1\.0\\\}—and \(ii\) has real downstream dynamics on a 5\-variable subgraphload\_in→\{current\_in,rpm\_in,pressure\_intake,pressure\_downwind\}\\texttt\{load\\\_in\}\\to\\\{\\texttt\{current\\\_in\},\\texttt\{rpm\\\_in\},\\texttt\{pressure\\\_intake\},\\texttt\{pressure\\\_downwind\}\\\}\. We extract 200 episodes \(50 pre / 20 post samples around each toggle, real per\-row timestamps with median 0\.15 s and max 2\.4 s\) and query each of three downstream variables; Pearsonrrnow varies meaningfully \([Table4](https://arxiv.org/html/2605.28880#A3.T4)\)\.

#### Eval protocol\.

Each pretrained checkpoint is evaluated zero\-shot\. The PK adapter has the option to prependNNsynthetic pre\-baseline observations \(zero values,zz\-scored as−μ/σ\-\\mu/\\sigma\) so the encoder sees a non\-empty pre\-intervention window\. We sweptN∈\{0,2,4,8,16\}N\\in\\\{0,2,4,8,16\\\};N=0N\{=\}0is best across both datasets and both mechanism families, so the numbers reported in[Table4](https://arxiv.org/html/2605.28880#A3.T4)use no padding\. The full sweep \(mean Pearsonrron Warfarin cp stays in\[0\.79,0\.89\]\[0\.79,0\.89\]acrossNNbut drops elsewhere; on Theophylline mixed it flips sign from\+0\.16\+0\.16atN=0N\{=\}0to−0\.54\-0\.54atN=16N\{=\}16\) confirms that the cross\-variable mixer’s empty\-context fallback is acceptable as\-is and that the augmentation*hurts*more often than it helps\. We treat synthetic pre\-baseline padding as a*negative result*: useful to know it doesn’t earn its keep on these benchmarks, not as a method we recommend\.

Table 4:Zero\-shot transfer of the two Phase\-13bpnc000checkpoints \(linear / mixed mechanism family, single seed, no eval\-time paddingN=0N\{=\}0\)\. Lift over the naive \(constant\-mean\) baseline is small on PK because both targets cluster narrowly around their per\-subject means; on the wind\-tunnel chamber the lift is∼50%\\sim 50\\,\\%because the regime\-mean shift betweenload\_in=0\.01\\texttt\{load\\\_in\}=0\.01and1\.01\.0is large\.Pearsonrris the load\-bearing dynamics\-tracking metric\. Causal Chamber numbers from 200 episodes ofwt\_intake\_impulse\_v1/load\_out\_0\.5\_osr\_downwind\_4\(Phase 14b\)\.Dataset \(variable\)Mech\.RMSE↓\\downarrownaiveliftPearsonrr↑\\uparrowTheophylline \(concentration\)linear2\.412\.37−1\.8%\-1\.8\\%\+0\.53\+0\.53Theophylline \(concentration\)mixed2\.442\.37−3\.2%\-3\.2\\%\+0\.16\+0\.16Warfarin \(concentration\)linear3\.453\.51\+1\.8%\+1\.8\\%\+0\.88Warfarin \(concentration\)mixed3\.483\.51\+0\.8%\+0\.8\\%\+0\.89Warfarin \(PD response\)linear24\.9925\.25\+1\.0%\+1\.0\\%\+0\.36\+0\.36Warfarin \(PD response\)mixed25\.1725\.25\+0\.3%\+0\.3\\%\+0\.31\+0\.31Chamber\-WT \(rpm\_in\)linear660\.31264\.8\+47\.8%\+47\.8\\%\+0\.39\+0\.39Chamber\-WT \(rpm\_in\)mixed660\.31264\.8\+47\.8%\+47\.8\\%\+0\.95Chamber\-WT \(current\_in\)linear77\.1159\.8\+51\.8%\+51\.8\\%−0\.16\-0\.16Chamber\-WT \(current\_in\)mixed77\.1159\.8\+51\.8%\+51\.8\\%−0\.16\-0\.16Chamber\-WT \(pressure\_downwind\)linear3\.877\.78\+50\.2%\+50\.2\\%\+0\.03\+0\.03Chamber\-WT \(pressure\_downwind\)mixed3\.877\.78\+50\.2%\+50\.2\\%\+0\.01\+0\.01
#### Findings\.

Two headline numbers, one per domain\.

*\(i\) Warfarin plasma concentration with Pearsonr≈0\.88r\\approx 0\.88across both mechanism families*—a strong dynamics\-tracking signal on real PK data, obtained by a model that was never fine\-tuned on Warfarin\. Lift over naive is small \(1–2 %\) because the per\-subject concentration time\-series cluster narrowly around their means \(the naive baseline is hard to beat on RMSE\), but Pearsonrrunambiguously says the predictions co\-vary with the dose\-driven trajectory\. The PD outcome is harder \(r∈\[0\.31,0\.36\]r\\in\[0\.31,0\.36\]\): expected, since PD responds to concentration with a slow non\-stationary delay that our front\-door TSCM template only crudely approximates\. Theophylline is intermediate \(r≈0\.53r\\approx 0\.53for the linear PFN; mixed drops tor≈0\.16r\\approx 0\.16\)\. The pattern is consistent:the linear\-mechanism PFN is more robust than the mixed\-mechanism PFN under PK distribution shift\.

*\(ii\) Wind\-tunnelrpm\_inwith Pearsonr=\+0\.95r=\+0\.95for the mixed\-mechanism PFN*—an unambiguous within\-episode dynamics\-tracking signal on a real physical system\. The naive baseline already attains∼50%\\sim 50\\,\\%RMSE lift on every queried sensor because the regime\-mean shift betweenload\_in=0\.01\\texttt\{load\\\_in\}=0\.01and1\.01\.0is large; Pearsonrris the metric that distinguishes regime\-mean recovery from causal\-effect tracking\.rpm\_inramps slowly toward aload\_in\-dependent setpoint \(visible exponential rise over∼20\\sim 20samples\), and the mixed\-mechanism PFN tracks that ramp closely\. Faster sensors \(current\_in,r≈−0\.16r\\approx\-0\.16;pressure\_downwind,r≈0r\\approx 0\) carry mostly high\-frequency noise on top of the regime\-mean shift, so the PFN’s bet on slow dynamics anti\-correlates or zero\-correlates with their within\-episode noise\. The patternflipsrelative to PK: on the chamber the*mixed*\-mechanism PFN dominates the linear one \(r=0\.95r=0\.95vs\.0\.390\.39onrpm\_in\), consistent withrpm\_in’s dynamics being noticeably nonlinear \(saturating exponential\) and the mixed prior having seen nonlinear drifts during pre\-training\. We caveat that this flip rests on one seed per checkpoint; replicating across seeds before reading it as a domain\-dependence finding—rather than a single\-seed observation—is on our priority list\. We discuss this domain dependence in[Section5](https://arxiv.org/html/2605.28880#S5)\.

## Appendix DAdditional failure modes and caveats

Three concrete failure modes surfaced during the development of this prior; all three reshaped the experimental design in ways worth documenting\.

#### Clip\-saturation pathology under unstable priors\.

An early grid trained onθrange=\[0\.5,2\.0\]\\theta\_\{\\mathrm\{range\}\}=\[0\.5,2\.0\]\(worst\-caseθΔ≈3\.6\\theta\\Delta\\approx 3\.6, above the EM stability boundary\) had every naive\-substeps batch saturate the±10σ\\pm 10\\,\\sigmatarget normalisation clip on at least one sample, and roughly half the fine\-substeps batches did the same\. The resulting ”naive\-vs\-fine” gap there was a numerical\-stability artefact rather than a discretisation\-bias signature\. Tightening the prior toθrange=\[0\.1,0\.5\]\\theta\_\{\\mathrm\{range\}\}=\[0\.1,0\.5\]and raising the clip ceiling to±50\\pm 50pushed empiricalymaxy\_\{\\max\}below 5 across the entire grid; with the artefact removed, the residual \(B\)\-vs\-\(C\) integrator gap reported in[Section4](https://arxiv.org/html/2605.28880#S4)is what the discretisation\-bias accounting of[Section3\.1](https://arxiv.org/html/2605.28880#S3.SS1)predicts\. This is the empirical motivation for the stability condition\.

#### Zero\-context\-augmentation broke per\-variable normalisation\.

A separate training\-time fix for the PK regime \(where the encoder sees an empty pre\-intervention window\) used to fire a Bernoulli\(pno\_context\)\(p\_\{\\mathrm\{no\\\_context\}\}\)coin per sample to forceint\_onset\_idx=0=0\. The downstream per\-variablezz\-scoring then computed mean and std over an empty pre\-window—the masked statistics fell back to\(μ,σ\)=\(0,ϵ\)\(\\mu,\\sigma\)=\(0,\\epsilon\)withϵ=10−2\\epsilon=10^\{\-2\}, blowingYtrue,normY\_\{\\rm true,norm\}up by a factor of∼100\\sim 100and pinning the targets at the new clip\. Eval loss climbed monotonically withpno\_contextp\_\{\\mathrm\{no\\\_context\}\}\(0\.34→1\.1→2\.30\.34\\to 1\.1\\to 2\.3\), the opposite of what the augmentation was meant to achieve\. The eval\-side counterpart \(synthetic pre\-baseline padding in the PK adapter, Appendix[C](https://arxiv.org/html/2605.28880#A3)\) avoids the issue because the prepended zero rows make the pre\-window non\-empty\.

#### Theactuators\_whitechamber benchmark motivated a benchmark switch\.

Earlier drafts evaluated chamber transfer on the light\-tunnellt\_walks\_v1actuators\_whiteexperiment, with episodes defined by a change\-point detector on the eight polarizer / lamp actuator columns\. That benchmark turned out to be*structurally*unsuited to a causal\-effect claim\. Every actuator \(pol\_1,pol\_2,l\_11,…\\ldots\) is independently white\-noise\-driven:\>99%\>99\\,\\%of consecutive samples have step changes\>0\.5\>0\.5in every actuator simultaneously\. The ”intervention episodes” the detector finds are not interventions in the SCM sense; they are cross\-sections of a continuously\-randomised process\. The post\-intervention variance of the queried sensor \(red\) is95%95\\,\\%within\-episode dynamics and only5%5\\,\\%between\-episode regime mean, and Pearsonrrbetween any model’s predictions and ground truth is statistically zero\. Apparent ”lift over naive” on this dataset is regime\-mean recovery, not causal\-effect tracking\. The other twolt\_walks\_v1experiments \(smooth\_polarizers,color\_mix\) have continuous actuator sweeps and produce zero episodes under any reasonable change\-point heuristic\. The wind\-tunnelwt\_intake\_impulse\_v1dataset, used in[Table4](https://arxiv.org/html/2605.28880#A3.T4), fixes all three issues at once: explicit binaryinterventioncolumn \(no change\-point heuristic\), real physical\-system dynamics \(rpm\_inramps over∼20\\sim 20samples\), and real per\-row timestamps with non\-trivial jitter\. The Pearsonr=\+0\.95r=\+0\.95headline on the wt rig exists only because the lt benchmark was diagnosed and replaced\.
Towards Continuous-time Causal Foundation Models

Similar Articles

Assessing the Operational Viability of Foundation Models for Time Series Forecasting

Function-Valued Causal Influence in Nonlinear Time Series

A Finite-Iteration Theory for Asynchronous Categorical Distributional Temporal-Difference Learning

Formalizing and falsifying causal pathways of rare events

TTCD:Transformer Integrated Temporal Causal Discovery from Non-Stationary Time Series Data

Submit Feedback

Similar Articles

Assessing the Operational Viability of Foundation Models for Time Series Forecasting
Function-Valued Causal Influence in Nonlinear Time Series
A Finite-Iteration Theory for Asynchronous Categorical Distributional Temporal-Difference Learning
Formalizing and falsifying causal pathways of rare events
TTCD:Transformer Integrated Temporal Causal Discovery from Non-Stationary Time Series Data