Tadpole: Autoencoders as Foundation Models for 3D PDEs with Online Learning
Summary
Tadpole introduces a foundation model for 3D PDEs, pre-trained as an autoencoder via efficient online data generation, enabling large-scale diverse training without storage overhead. It demonstrates strong fine-tuning performance for dynamics learning and generative modeling across heterogeneous physical systems.
View Cached Full Text
Cached at: 05/18/26, 06:39 AM
# Tadpole: Autoencoders as Foundation Models for 3D PDEs with Online Learning
Source: [https://arxiv.org/html/2605.15284](https://arxiv.org/html/2605.15284)
###### Abstract
We introduce Tadpole, a novel foundation model for three\-dimensional partial differential equations \(PDEs\) that addresses key challenges in transferability, scalability to high dimensionality, and multi\-functionality\. Tadpole is pre\-trained as an autoencoder on synthetic 3D PDE data generated by an efficient online data\-generation framework\. This enables large\-scale, diverse training without storage or I/O overhead, demonstrated by scaling to an equivalent of hundreds of terabytes of training data\. By autoencoding single\-channel spatial crops, Tadpole learns rich and transferable representations across heterogeneous physical systems with varying numbers of state variables and spatial resolutions\. Although pre\-trained solely as an autoencoder, Tadpole can be efficiently applied for multiple downstream tasks beyond reconstruction, including dynamics learning and generative modeling\. For dynamics learning, we propose a novel parameter\-efficient fine\-tuning strategy that integrates low\-rank adaptation, latent\-space transformations, and reintroduced skip connections, achieving accurate temporal modeling with a minimal number of trainable parameters\. Tadpole demonstrates strong fine\-tuning performance across various downstream tasks, highlighting its versatility and effectiveness as a foundation model for 3D PDE learning\. Source code and pre\-trained weights of Tadpole are available at[https://github\.com/tum\-pbs/tadpole](https://github.com/tum-pbs/tadpole)
Machine Learning, ICML
## 1Introduction
The foundation model paradigm has achieved transformative success in Natural Language Processing \(NLP\) and Computer Vision \(CV\)\(Myerset al\.,[2024](https://arxiv.org/html/2605.15284#bib.bib22); Awaiset al\.,[2025](https://arxiv.org/html/2605.15284#bib.bib21)\)\. Recently, it has been adapted to scientific machine learning to solve Partial Differential Equations \(PDEs\)\(Subramanianet al\.,[2023](https://arxiv.org/html/2605.15284#bib.bib44); Ashtonet al\.,[2025](https://arxiv.org/html/2605.15284#bib.bib23)\)\. Unlike specialized solvers, these foundation models aim to learn transferable representations across diverse physical systems to allow efficient fine\-tuning for new dynamics\.
The prevailing strategy for building PDE foundation models is to learn the PDE dynamics by pre\-training on large\-scale trajectory datasets that capture a rich diversity of physical phenomena\. These datasets are composed of numerous simulations, each representing a unique system defined by its governing equations, boundary conditions, and parameters \(for instance, fluid viscosities or material stiffnesses\)\. The model’s task is to approximate the mapping from a system’s past states to its future states\. Formally, for a given phenomenonℙi\\mathbb\{P\}^\{i\}, the model learns to predict the state𝐮t\+Δti\\mathbf\{u\}\_\{t\+\\Delta t\}^\{i\}from prior states𝐮≤ti\\mathbf\{u\}\_\{\\leq t\}^\{i\}\. The ambition is that by learning from many such examples, the model will distill universal physical principles, allowing it tozero\-shotgeneralize to new phenomena and adapt to specific tasks with minimal additionalfine\-tuning\.
Despite the attractiveness and potential of PDE foundation models, three fundamental challenges remain\. There is a notable shortage of PDE foundation models forthree\-dimensionaldata\. Most existing PDE foundation models focus on 1D or 2D problems, and a few notable exceptions that support 3D often rely on datasets that combine 3D with 2D/1D data\(Rautelaet al\.,[2025](https://arxiv.org/html/2605.15284#bib.bib57); McCabeet al\.,[2025](https://arxiv.org/html/2605.15284#bib.bib69)\), or purely rely on 2D data\(Haoet al\.,[2024](https://arxiv.org/html/2605.15284#bib.bib68)\)\. Besides the heavily increased computational cost, a key reason for this lack of 3D models is the difficulty of collecting diverse and large\-scale 3D PDE datasets for pre\-training\. Generating, storing, reading, and processing 3D data is significantly more expensive than 2D data, which fundamentally limits the diversity and scale of precomputed 3D PDE datasets\. Since many real\-world applications \(e\.g\., weather forecasting, fluid dynamics, and material science\) inherently involve 3D spatial domains, developing effective 3D PDE foundation models is crucial for advancing scientific machine learning\.
Figure 1:Overview of Tadpole: a\) Tadpole is pre\-trained as an autoencoder on single\-channel crops of 3D PDE data generated on\-the\-fly by a GPU\-based solver with an efficient buffer strategy to eliminate I/O and storage bottlenecks\. b\) The pre\-trained Tadpole can be used for various downstream tasks, including autoencoding, dynamics learning with the novel Tadpole\-DFT method, and generative modeling via latent flow matching\.In addition, transferability and generalization remain inconsistent\. Ideally, most parameters of a foundation model can be reused without retraining, since the network should have learned general, transferable representations\. For example, zero\-shot evaluation and Parameter\-Efficient Fine\-Tuning \(PEFT\) have become standard benchmarks for the quality of NLP and CV foundation models\(Dinget al\.,[2023](https://arxiv.org/html/2605.15284#bib.bib26); Hanet al\.,[2024](https://arxiv.org/html/2605.15284#bib.bib27); Xinet al\.,[2025](https://arxiv.org/html/2605.15284#bib.bib28); Zhanget al\.,[2025a](https://arxiv.org/html/2605.15284#bib.bib29); Menget al\.,[2022](https://arxiv.org/html/2605.15284#bib.bib30)\)\. However, most PDE foundation models still rely on full\-parameter fine\-tuning \(FPFT\), and preliminary zero\-shot/PEFT experiments have shown limited success\(McCabeet al\.,[2024](https://arxiv.org/html/2605.15284#bib.bib47); Holzschuhet al\.,[2025](https://arxiv.org/html/2605.15284#bib.bib51); Rautelaet al\.,[2025](https://arxiv.org/html/2605.15284#bib.bib57)\)\. The reliance on FPFT casts doubt on the paradigm of training PDE foundation models: whether a model can really learn a generalizable representation through pre\-training on PDE dynamics with extreme variability\.
Finally, current PDE foundation models focus solely on dynamics learning, neglecting the ability to extend to other functionality\. For example, generative modeling has emerged as a powerful paradigm in scientific machine learning\(Liu and Thuerey,[2024](https://arxiv.org/html/2605.15284#bib.bib31); Rühling Cachayet al\.,[2024](https://arxiv.org/html/2605.15284#bib.bib78); Jacobsenet al\.,[2025](https://arxiv.org/html/2605.15284#bib.bib77)\)\. Achievingmulti\-functionalityacross diverse downstream tasks, such as generative modeling, is a new challenge for PDE foundation models\.
Therefore, developing a foundation model for 3D PDEs that efficiently and reliably generalizes across different tasks remains an open problem\. Our work makes important steps to address these challenges withTadpole,three\-dimensionalautoencoders forPDEs withonlinelearning\. It challenges the widespread notion that PDE foundation models require pretraining on the PDE dynamics of massive precomputed local data\. We instead establish that foundation models can be trained for representation learning using simple, synthetic data generated on\-the\-fly during training\. In contrast to foundation models in NLP, where representations emerge implicitly from next\-token prediction, we achieve representation learning via autoencoding, explicitly optimizing a continuous latent space to capture the underlying data manifold\.Our key innovations are:
- •A synthetic online learning framework:We propose an efficient online learning framework with highly accurate, yet efficient GPU\-based pseudo\-spectral solvers and a novel buffer strategy, effectively bypassing I/O bottlenecks and storage limits at training time\.
- •Transferable representations:By pre\-training Tadpole as an autoencoder on cropped individual fields, our models learn rich, transferable representations, enabling them to process varying PDE systems across different resolutions\.
- •Efficient dynamics fine\-tuning:We propose a novel PEFT method for dynamics learning that integrates latent transformations, re\-introduced skip connections, and LoRA\(Huet al\.,[2022](https://arxiv.org/html/2605.15284#bib.bib34)\)finetuning, which better utilizes the pre\-trained representations and achieves high accuracy\.
- •Multi\-task versatility:We demonstrate that Tadpole excels across different downstream tasks, including autoencoding, dynamics learning, and generative modeling, and at resolutions up to102431024^\{3\}\(i\.e\., on more than one billion degrees of freedom\)\.
## 2Related Work
The potential of pre\-trained neural networks to generalize across diverse physical systems was first characterized by Subramanian et al\.\([2023](https://arxiv.org/html/2605.15284#bib.bib44)\)\. Subsequent research has prioritized architectural scalability, from traditional U\-Net structures\(Thuereyet al\.,[2020](https://arxiv.org/html/2605.15284#bib.bib80); Siddiket al\.,[2025](https://arxiv.org/html/2605.15284#bib.bib52)\)to modern vision\-transformer \(ViT\) designs\(Herdeet al\.,[2024](https://arxiv.org/html/2605.15284#bib.bib46); Haoet al\.,[2024](https://arxiv.org/html/2605.15284#bib.bib68); Holzschuhet al\.,[2025](https://arxiv.org/html/2605.15284#bib.bib51)\)\. Poseidon\(Herdeet al\.,[2024](https://arxiv.org/html/2605.15284#bib.bib46)\)utilizes a multiscale transformer with time\-conditioned layer norms to achieve continuous\-in\-time evaluations, while DPOT\(Haoet al\.,[2024](https://arxiv.org/html/2605.15284#bib.bib68)\)scales to 1 billion parameters using a Fourier\-attention\-based architecture\. Other similar works include MPP\(McCabeet al\.,[2024](https://arxiv.org/html/2605.15284#bib.bib47)\)and Walrus\(McCabeet al\.,[2025](https://arxiv.org/html/2605.15284#bib.bib69)\), the latter introducing compute\-adaptive tokenization to maintain stability\.
A central line of research focuses on the representation and embedding of heterogeneous PDE systems\. Researchers have explored encoding PDEs as computational graphs to capture symbolic and numerical information simultaneously\(Yeet al\.,[2024](https://arxiv.org/html/2605.15284#bib.bib53),[2025](https://arxiv.org/html/2605.15284#bib.bib54)\), introducing point\-wise deep conditions to guide the global attention of transformers\(Zhouet al\.,[2025](https://arxiv.org/html/2605.15284#bib.bib64)\), and utilizing SymPy\-based libraries for automated symbolic tokenization\(Jollieet al\.,[2024](https://arxiv.org/html/2605.15284#bib.bib60)\)\. To overcome the limitations of single\-modality inputs, multimodal frameworks such as PROSE\-PDE\(Sunet al\.,[2025](https://arxiv.org/html/2605.15284#bib.bib48)\)and UPS\(Shenet al\.,[2024](https://arxiv.org/html/2605.15284#bib.bib50)\)integrate numerical states with symbolic or textual descriptions\(Wiesneret al\.,[2025](https://arxiv.org/html/2605.15284#bib.bib63); Negriniet al\.,[2025](https://arxiv.org/html/2605.15284#bib.bib56)\)\. In addition, UPS\(Shenet al\.,[2024](https://arxiv.org/html/2605.15284#bib.bib50)\)warm\-starts from pre\-trained Large Language Models \(LLMs\) to explicitly align data and improve computational efficiency\.
Drawing inspiration from LLMs, recent studies investigated In\-Context Learning \(ICL\) for PDE foundation models\(Yanget al\.,[2023](https://arxiv.org/html/2605.15284#bib.bib45); Caoet al\.,[2025](https://arxiv.org/html/2605.15284#bib.bib72); Songet al\.,[2024](https://arxiv.org/html/2605.15284#bib.bib62)\)\. Zebra\(Serranoet al\.,[2025](https://arxiv.org/html/2605.15284#bib.bib71)\)and VICON\(Caoet al\.,[2025](https://arxiv.org/html/2605.15284#bib.bib72)\)leverage prompt\-based trajectories to solve parametric PDEs, while Liu et al\.\([2025b](https://arxiv.org/html/2605.15284#bib.bib61)\)utilizes a block causal transformer to treat historical frames as contextual priors for next\-frame prediction\. Parallel to these ICL methods, PhysiX\(Nguyenet al\.,[2025](https://arxiv.org/html/2605.15284#bib.bib66)\)utilizes discrete tokenization and autoregressive next\-token prediction to model physical processes\.
Beyond these themes, the field is advancing on several adjacent topics\. PreLowD\(Hemmasian and Farimani,[2024](https://arxiv.org/html/2605.15284#bib.bib74)\), MORPH\(Rautelaet al\.,[2025](https://arxiv.org/html/2605.15284#bib.bib57)\)and OmniArch\(Chenet al\.,[2025](https://arxiv.org/html/2605.15284#bib.bib67)\)have proposed lower dimensional pre\-training\. Frequency\-adaptive fine tuning was proposed\(Zhanget al\.,[2025b](https://arxiv.org/html/2605.15284#bib.bib75)\), while constraint\-aware pre\-training\(Totounferoushet al\.,[2025](https://arxiv.org/html/2605.15284#bib.bib59)\)and physics\-informed temporal alignment\(Zhuet al\.,[2025](https://arxiv.org/html/2605.15284#bib.bib65)\)incorporate PDE residuals to ensure physical consistency\. A recent work\(Zhou and Farimani,[2024](https://arxiv.org/html/2605.15284#bib.bib4)\)also pre\-trains autoencoders for 2D PDEs, where the decoder is removed for dynamics finetuning, similar to previous latent\-space learners\(Wiewelet al\.,[2019](https://arxiv.org/html/2605.15284#bib.bib37); Regazzoniet al\.,[2024](https://arxiv.org/html/2605.15284#bib.bib38)\)\. Finally, frontiers such as operator discovery\(Rahmanet al\.,[2024](https://arxiv.org/html/2605.15284#bib.bib73); Morelet al\.,[2025](https://arxiv.org/html/2605.15284#bib.bib70)\)and reward\-model\-driven reasoning\(Mansinghet al\.,[2025](https://arxiv.org/html/2605.15284#bib.bib58)\)represent the latest efforts for scientific foundation models\.
## 3Self\-Supervised Pre\-training
### 3\.1Training Objective
In traditional pre\-training for PDEs foundation models, the models learn the dynamics mapping a previous state𝐮t\\mathbf\{u\}\_\{t\}to a future state𝐮t\+Δt\\mathbf\{u\}\_\{t\+\\Delta t\}\. In contrast, we pre\-train Tadpole as an autoencoder that reconstructs𝐮t\\mathbf\{u\}\_\{t\}to learn rich, transferable spatial features of𝐮t\\mathbf\{u\}\_\{t\}itself\. Specifically, Tadpole is pre\-trained as a Variational Autoencoder \(VAE\) with an adversarial loss to encourage sharper reconstructions, following the success of representation learning paradigms in CV\(Esseret al\.,[2021](https://arxiv.org/html/2605.15284#bib.bib39); Rombachet al\.,[2022](https://arxiv.org/html/2605.15284#bib.bib40)\)\. Tadpole consists of an encoderℰ\\mathcal\{E\}and a decoder𝒟\\mathcal\{D\}\. The encoder transforms the input𝐮t\\mathbf\{u\}\_\{t\}into a latent distributionpℰ\(𝐳t\|𝐮t\)p\_\{\\mathcal\{E\}\}\(\\mathbf\{z\}\_\{t\}\|\\mathbf\{u\}\_\{t\}\), while the decoder reconstructs the input from a sampled latent representation𝐳t\\mathbf\{z\}\_\{t\}\. A discriminator network𝒜\\mathcal\{A\}is optimized simultaneously to distinguish between real and reconstructed inputs and send feedback to the backbone training\. Details of the pre\-training target are provided in[SectionC\.2](https://arxiv.org/html/2605.15284#A3.SS2)\. We choose reconstruction as the pre\-training target over the dynamics target for the following reasons:
- •In dynamics pre\-training, a single𝐮t\\mathbf\{u\}\_\{t\}may evolve into significantly different future states depending on PDE type, boundary conditions, and physical parameters\. This necessitates high architectural complexity, as the network must distinguish between different physical systems by embedding a diverse set of parameters\.
- •The dynamics pre\-training target can usually only be applied to dynamics downstream tasks\. Instead, reconstruction pre\-training will provide a meaningful latent space of the solution domain, enabling more diverse applications in different types of downstream tasks\.
- •Reconstruction only requires learning the low\-dimensional manifold of admissible PDE solutions, which is often smooth and highly structured due to spatial correlations induced by differential operators\. In contrast, predicting the mapping from𝐮t\\mathbf\{u\}\_\{t\}to𝐮t\+Δt\\mathbf\{u\}\_\{t\+\\Delta t\}entails learning the nonlinear flow on this manifold, which is inherently more difficult to learn than reconstruction, as it must accurately capture both the geometry of the solution space and the vector field governing its evolution\. A more detailed discussion can be found at[SectionA\.1](https://arxiv.org/html/2605.15284#A1.SS1)\.
We show below that Tadpole, with its reconstruction\-based pretraining, can be effectively fine\-tuned for various downstream tasks \(including the dynamics prediction\) thanks to the generalization learned during pre\-training\. Details of fine\-tuning methods will be discussed in[Section4](https://arxiv.org/html/2605.15284#S4)\.
### 3\.2Online Learning Framework
Another significant difference between the training of Tadpole and traditional PDE foundation models is that we use an efficient online training\(Hoiet al\.,[2021](https://arxiv.org/html/2605.15284#bib.bib2); Meyeret al\.,[2023](https://arxiv.org/html/2605.15284#bib.bib3); Terrazet al\.,[2017](https://arxiv.org/html/2605.15284#bib.bib81)\)pipeline\. This pipeline is guided by two desiderata: \(i\) expose the model to a diverse training distribution; \(ii\) and eliminate I/O overheads and storage challenges associated with large\-scale 3D PDE datasets while sustaining high\-throughput training without stalling\.
Data Generation:All pre\-training data are generated on\-the\-fly using a PyTorch\-based GPU solver\. Spatial derivatives are computed via pseudo\-spectral methods based on Fast Fourier Transforms \(FFTs\), and time integration is performed using Exponential Time Differencing Runge–Kutta \(ETDRK\) schemes\(Cox and Matthews,[2002](https://arxiv.org/html/2605.15284#bib.bib35); Koehleret al\.,[2024](https://arxiv.org/html/2605.15284#bib.bib76)\), providing a highly efficient simulation backbone\. Details of the solver can be found in[SectionB\.1\.2](https://arxiv.org/html/2605.15284#A2.SS1.SSS2)\. Although Fourier\-spectral solvers impose periodic boundary conditions at the global domain level, Tadpole is trained exclusively on randomly sampled crops as will be discussed in[Section3\.3](https://arxiv.org/html/2605.15284#S3.SS3)\. These crops correspond to local regions with non\-periodic boundaries, thereby preventing the model from learning spurious periodic structures\. Meanwhile, we also impose various initial conditions for different PDEs, further increasing data diversity \(cf\.[SectionB\.1\.3](https://arxiv.org/html/2605.15284#A2.SS1.SSS3)\)\.
Buffering:To avoid the simulation speed affecting the training speed, we employ a three\-stage buffering strategy\. Simulation outputs are first written to a small First\-In\-First\-Out \(FIFO\) buffer and asynchronously forwarded to the training processes\. Each training process maintains a second FIFO buffer and a larger cache governed by a Most\-Frequently\-Used \(MFU\) replacement policy, from which batches are drawn directly during training\. Background threads continuously replenish the MFU cache from newly arriving samples, effectively hiding simulation and communication latency behind training computation\. The designed communication and buffer strategy can be effectively extended to multi\-node HPC setups, and more details are provided in[SectionB\.2](https://arxiv.org/html/2605.15284#A2.SS2)\.
### 3\.3Dataset Structure
Another significant challenge in pre\-training PDE foundation models is the variability in spatial domain sizes and state variable counts across different systems, which complicates batching for diverse datasets\. We address this in Tadpole by pre\-training on single\-channel crops of 3D PDE data\. Specifically, given training data of shape\[B,C,X,Y,Z\]\[B,C,X,Y,Z\]whereBBis the batch size,CCis the channel dimension representing the number of state variables, andX,Y,ZX,Y,Zare spatial dimensions, we collapse the channel dimension into the batch dimension and randomly sample contiguous crops of shape\[B×C,1,HX,HY,HZ\]\[B\\times C,1,H\_\{X\},H\_\{Y\},H\_\{Z\}\]for training\. During inference, large domains are processed by encoding crops for each state variable in mini\-batches, thereby avoiding the memory overhead of jointly processing all variables and spatial locations\. In this work, we set the crop sizes toHX,Y,Z=64H\_\{X,Y,Z\}=64for pre\-training\. Meanwhile, to reduce data transfer volume and increase sample diversity, we apply an intermediate pre\-cropping step before sending simulation data to the training process\. Each simulation output is firstly cropped to an intermediate spatial sizeHX,Y,Z′=96H\_\{X,Y,Z\}^\{\\prime\}=96, smaller than the full simulation resolution but larger than the final training crop sizeHX,Y,ZH\_\{X,Y,Z\}, enabling multiple distinct random crops to be drawn from the same transmitted sample\. Notably, in conventional dynamics pre\-training, learning from single\-channel crops of arbitrary size is challenging due to dynamics dependencies across variables \(e\.g\., velocity and pressure\) and error accumulation at crop boundaries during long rollouts\. In contrast, Tadpole pre\-training does not suffer from these issues due to its reconstruction learning target\.
### 3\.4Network Architecture
We employ P3D\(Holzschuhet al\.,[2026](https://arxiv.org/html/2605.15284#bib.bib17)\), a state\-of\-the\-art transformer for 3D PDEs, as the backbone for Tadpole\. We adapt it to the reconstruction objective while preserving its core design principles\. Specifically, we remove all embeddings related to PDE parameters, as the autoencoder focuses on input reconstruction rather than modeling complex dynamics\. Furthermore, we eliminate the skip connections between the encoder and decoder to ensure that all information necessary for reconstruction is contained in the bottleneck latent space\. An additional projection layer is appended to the encoder to map its output to the latent distribution parameters \(mean and log\-variance\)\. Detailed architectural specifications are provided in[SectionC\.1](https://arxiv.org/html/2605.15284#A3.SS1)\.
A primary motivation for choosing P3D, beyond its efficiency and scalability for 3D data, is its hybrid architecture, which combines convolutional layers with a transformer\-based bottleneck\. This configuration leverages the translation equivariance of convolutions\. Consequently, during inference, crops of different sizes than those used in training can be processed by directly applying the convolutional layers to the new inputs\. This design not only provides greater flexibility for processing data with different spatial resolutions but also lays a firm foundation for downstream fine\-tuning, as discussed in the following sections\. It is worth noting that the proposed training and fine\-tuning strategy is not limited to the P3D architecture but also applies to other architectures with translation equivariance\.
Together, the above components enable continuous, storage\- and bottleneck\-free online pre\-training on effectively unlimited amounts of 3D PDE data\. An overview of the resulting pipeline is shown in[Figure1](https://arxiv.org/html/2605.15284#S1.F1)a\)\.
## 4Flexible Fine\-Tuning on Downstream Tasks
Below, we outline our fine\-tuning methodology for the core competencies of scientific foundation models: autoencoding, dynamics, and generative modeling\. Importantly, we will introduce Tadpole dynamics fine\-tuning \(Tadpole\-DFT\) methods for the dynamics mission\. An overview of the fine\-tuning pipeline is shown in[Figure1](https://arxiv.org/html/2605.15284#S1.F1)b\)\.
Dynamics Learning:With a pre\-trained autoencoder like Tadpole, a natural approach for downstream dynamics tasks is to learn the PDE dynamics in the latent space rather than the high\-dimensional physical space\(Wiewelet al\.,[2019](https://arxiv.org/html/2605.15284#bib.bib37); Regazzoniet al\.,[2024](https://arxiv.org/html/2605.15284#bib.bib38)\)\. However, since the latent representation𝐳t\\mathbf\{z\}\_\{t\}is significantly more compact than the original input𝐮t\\mathbf\{u\}\_\{t\}, capturing precise dynamics purely in the latent space is often challenging\. To address this, we propose the novel Tadpole\-DFT approach, which encompasses:
- •Latent transformation:Tadpole\-DFT introduces a lightweight sub\-network𝒮\\mathcal\{S\}between the pre\-trained Tadpole encoder and decoder with a residual connection\. As discussed in[Section3\.3](https://arxiv.org/html/2605.15284#S3.SS3), capturing cross\-variable interactions is crucial for dynamics learning\. To solve this issue, we aggregate the latent spaces of all state variables after the encoder, enabling𝒮\\mathcal\{S\}to learn correlations between state variables\.
- •Re\-introduced skip connections:During fine\-tuning, the skip connections between the encoder and decoder are re\-established, each governed by a zero\-initialized, trainable scale factorγ\\gamma\. This allows the model to leverage both the latent dynamics from the sub\-network and high\-resolution spatial information from the skip connections to predict the future state𝐮t\+Δt\\mathbf\{u\}\_\{t\+\\Delta t\}\.
- •LoRA fine\-tuning:The pre\-trained encoderℰ\\mathcal\{E\}and decoder𝒟\\mathcal\{D\}are fine\-tuned using LoRA\(Huet al\.,[2022](https://arxiv.org/html/2605.15284#bib.bib34)\)while their core weights remain frozen\. For a pre\-trained weight matrixW0∈ℝd×kW\_\{0\}\\in\\mathbb\{R\}^\{d\\times k\}, LoRA approximates the updateΔW\\Delta Wvia a low\-rank decomposition:ΔW=AB\\Delta W=AB, whereA∈ℝd×rA\\in\\mathbb\{R\}^\{d\\times r\},B∈ℝr×kB\\in\\mathbb\{R\}^\{r\\times k\}, and the rankr≪min\(d,k\)r\\ll\\min\(d,k\)\. During fine\-tuning, the effective weight matrix is computed asW=W0\+ABW=W\_\{0\}\+AB, where only the low\-rank matricesAAandBBare trainable whileW0W\_\{0\}remains frozen\. The updated weights will be merged back into the original weights during inference to avoid additional computational overhead\. The introduced LoRA fine\-tuning enables the backbone to adapt to the new information flow from the skip connections and the sub\-network𝒮\\mathcal\{S\}while preserving the robust representations acquired during pre\-training\.
In addition, to prevent error accumulation at crop boundaries during rollouts, we exploit the translation equivariance of the Tadpole encoder to construct latent representations from uncropped data\. Despite removing cropping, Tadpole\-DFT remains easier to scale to larger datasets than conventional setups, as inputs are still encoded in mini\-batches along the channel axis\. Meanwhile, we zero\-initialize the weights of the subnetwork and backbone LoRA layers, as well as the scale factors for skip connections, which ensures that the model starts as a standard pre\-trained autoencoder \(outputting the current state𝐮t\\mathbf\{u\}\_\{t\}\) and gradually learns to transform𝐮t\\mathbf\{u\}\_\{t\}into the future state𝐮t\+Δt\\mathbf\{u\}\_\{t\+\\Delta t\}during training, further maintaining the prior pre\-trained knowledge\.
Autoencoding:The pre\-trained Tadpole model can be applied directly to unseen PDE systems for zero\-shot autoencoding tasks\. Alternatively, it can be fine\-tuned on specific downstream systems to further enhance reconstruction quality, either by updating all model parameters or using LoRA to reduce the number of trainable parameters\.
Generative Modeling:With a fine\-tuned Tadpole, we can efficiently build a latent generative model\(Rombachet al\.,[2022](https://arxiv.org/html/2605.15284#bib.bib40)\)for 3D PDEs\. For this, a new generative component is trained in the𝐳t\\mathbf\{z\}\_\{t\}space using a standard flow matching\(Lipmanet al\.,[2023](https://arxiv.org/html/2605.15284#bib.bib12)\)objective\. At inference time, new samples are drawn from the latent flow matching model and transformed into high\-fidelity 3D PDE data by the Tadpole decoder𝒟\\mathcal\{D\}\. Operating in the compact latent space rather than the high\-dimensional pixel space enables high\-quality generative modeling of complex 3D physical systems while minimizing memory and processing requirements\.
## 5Experiments
In this section, we first summarize the pre\-training statistics and then evaluate Tadpole across various downstream tasks that involve challenging 3D phenomena\. We assess its zero\-shot capabilities and fine\-tuning performance, comparing it against state\-of\-the\-art PDE foundation models and network architectures for direct, spectral, and distributional metrics\.
### 5\.1Pre\-training Statistics
With the proposed online\-learning framework, we pre\-train Tadpole on 7 distinct PDE systems at four spatial resolutions \(64364^\{3\},1283128^\{3\},2563256^\{3\}, and3843384^\{3\}\)\. Details of the PDEs and corresponding configurations can be found in[SectionB\.1\.1](https://arxiv.org/html/2605.15284#A2.SS1.SSS1)\. We perform a spectral distribution analysis on the pre\-training dataset, illustrating its large spectral diversity, cf\.[SectionB\.1\.4](https://arxiv.org/html/2605.15284#A2.SS1.SSS4)\. Meanwhile, we also scale Tadpole models of varying sizes\. Specifically, we investigate S, B, and L sizes of 8\.8, 38\.1, and 152\.1 million parameters, with corresponding compression ratios of 16, 8, and 4\. The pre\-trained Tadpole achieves average reconstruction RMSEs of5\.48×10−35\.48\\times 10^\{\-3\},2\.83×10−32\.83\\times 10^\{\-3\}, and2\.06×10−32\.06\\times 10^\{\-3\}, for the S\-, B\- and L\-size models across all pre\-training PDEs\. The online data generation pipeline yields approximately 202 TB of training data, with no local storage required\. While naturally dependent on hardware specifics, the online training pipeline achieves an overall1\.8×1\.8\\timesspeedup over an offline setup with pre\-generated data in our high\-speed SSD environment\. In HPC environments, which are typically used for FM training and have lower I/O bandwidths for data loading, we have measured differences that are an order of magnitude larger\. Thus, the online training is especially effective in such cases\.



Figure 2:Performance of Tadpole on the downstream autoencoding task \(exact NRMSE values in[Table2](https://arxiv.org/html/2605.15284#A1.T2)\)\. a\) Zero\-shot reconstruction NRMSE of Tadpole with different model sizes\. Tadpole shows consistent scaling with respect to model size\. b\) Zero\-shot relative NRMSE of Tadpole\-B models pre\-trained on datasets with varying diversity compared to the full\-size online setup \(at 1\.0\)\. All variants perform worse, with rel\. NRMSE values larger than 1\.0; thus, the wide range of PDEs improves c\) Relative NRMSE of Tadpole\-B models with different fine\-tuning methods compared to models trained from scratch\. Pre\-training consistently improves performance\. E\.g\., LoRA\-32 fine\-tuning reduces errors by more than 60% for MHD compared to training from scratch\.Figure 3:Visualizations of Tadpole\-B zero\-shot reconstruction on different datasets\. Only velocity channels are shown here; additional ones are provided in[SectionA\.7](https://arxiv.org/html/2605.15284#A1.SS7)\. The datasets feature high resolutions, ranging from962×19296^\{2\}\\times 192forTCFto102431024^\{3\}forIso\.
### 5\.2Autoencoding
LetD=C×X×Y×ZD=C\\times X\\times Y\\times Zdenote the dimension of a 3D PDE state\. To evaluate Tadpole’s autoencoding performance for unseen data, we consider four representative 3D PDE systems: isotropic turbulence \(Iso,D=4×10243D=4\\times 1024^\{3\}\), turbulent channel flow \(TCF,D=3×962×192D=3\\times 96^\{2\}\\times 192\), magnetohydrodynamics \(MHD,D=10×5123D=10\\times 512^\{3\}\), and transitional boundary layer flows \(TBL,D=4×2243D=4\\times 224^\{3\}\)\. These systems exhibit diverse physical characteristics and significantly higher resolutions that Tadpole has not encountered during pre\-training\. Details of these datasets are provided in[SectionB\.3](https://arxiv.org/html/2605.15284#A2.SS3)\.
Zero\-Shot Performance w\.r\.t Model and Dataset Sizes\.We first evaluate Tadpole in zero\-shot settings\. The results indicate a clear scaling trend with respect to model size: larger models consistently outperform smaller ones in the zero\-shot setting, as shown in[Figure2](https://arxiv.org/html/2605.15284#S5.F2)a\)\. Besides, a comparison of models pre\-trained on datasets with varying diversity is presented in[Figure2](https://arxiv.org/html/2605.15284#S5.F2)b\)\. Three successively simpler online training setups are introduced: pre\-training with only three PDEs \(KS, Burgers, and KPP\-Fisher\), with one PDE \(KS\), or only with initial conditions for the PDEs\. The findings show that incorporating additional PDEs during pre\-training continuously improves zero\-shot performance\. In contrast, the model shows reduced performance when pre\-trained solely on synthetic initial conditions\. This suggests that the PDE dynamics generate novel features that enhance reconstruction generalization\. Additionally, we also introduce a model pre\-trained on a 500GB local dataset generated with the same PDEs and parameter distributions \(cf\.[Table9](https://arxiv.org/html/2605.15284#A2.T9)\)\. The model pre\-trained with the online learning framework outperforms this local variant, highlighting that the online learning strategy increases data diversity while eliminating the training I/O and data storage bottleneck\. Visualizations of zero\-shot reconstructions are shown in[Figure3](https://arxiv.org/html/2605.15284#S5.F3)\.
Figure 4:Reconstruction NRMSE of Tadpole\-B fine\-tuned with different LoRA ranks on theIsodataset\. Increasing the rank approaches full\-parameter fine\-tuning\.

Figure 5:Performance of Tadpole on two distinct downstream dynamics tasksTBLandIso\. a\) Prediction NRMSEESof different foundation models\. Tadpole performs best onTBL, and second\-best onIso\. b\) The trainable parameters of different foundation models\. Thanks to the LoRA introduced in the Tadpole\-DFT method, only a very few parameters are fine\-tuned in Tadpole, which makes it significantly smaller than the best\-performing competitor, Walrus\.Fine\-Tuning with Pre\-trained Model:To verify the effect of pre\-training, we fine\-tune Tadpole under different settings and compare the performance with models trained from scratch\. The results are summarized in[Figure2](https://arxiv.org/html/2605.15284#S5.F2)c\)\. Pre\-trained Tadpole demonstrates substantial advantages over the from\-scratch variant\. Models initialized from pre\-trained weights, including both FPFT and LoRA\-based PEFT, consistently outperform their from\-scratch counterparts\. Notably, the zero\-shot Tadpole B clearly surpasses from\-scratch models on theIso,MHD, andTBLtasks, which further highlight the effectiveness of pre\-training\.
Effect of LoRA Rank:[Figure4](https://arxiv.org/html/2605.15284#S5.F4)further illustrates the impact of the LoRA rank on fine\-tuning performance\. Even with a small LoRA rank, Tadpole achieves lower reconstruction error than models trained from scratch \(NRMSE=7\.17×10−2=7\.17\\times 10^\{\-2\}\), highlighting the efficacy of pre\-trained representations\. As the LoRA rank increases, performance continues to improve and approaches that of FPFT, while maintaining substantially fewer trainable parameters\.
Flexibility in Domain Size and State Variable Count:It is worth noting that the proposed crop\-based strategy enables Tadpole to seamlessly handle varying domain sizes and state variable counts\. In particular, forIso, which features extremely high spatial resolutions, andMHD, which has a high state variable count, Tadpole effectively accommodates these variations without any architectural modifications or retraining\. Meanwhile, although Tadpole is trained on crops withHX,Y,Z=64H\_\{X,Y,Z\}=64, we adopt different inference crop sizes forTCF\(HX,Y,Z=48H\_\{X,Y,Z\}=48\) andTBL\(HX,Y,Z=32H\_\{X,Y,Z\}=32\) to fully cover the spatial domains in these two cases, and Tadpole can still obtain impressive zero\-shot performance by leveraging the translation equivariance of convolutional layers\. Furthermore, for spatial resolutions such asTCFwith962×19296^\{2\}\\times 192, the entire spatial domain can be processed in a single forward pass with similar performance \(cf\.[SectionA\.3](https://arxiv.org/html/2605.15284#A1.SS3)\)\. This flexibility is essential for practical scenarios in which PDE systems may differ substantially in configuration\.
### 5\.3Dynamics Learning
In this section, we evaluate Tadpole on a challenging dynamics learning task involving 3D cropped turbulent flows, following the setup of prior work\(Holzschuhet al\.,[2026](https://arxiv.org/html/2605.15284#bib.bib17)\)\. The test contains an isotropic turbulence \(Iso\) and a turbulence boundary layer \(TBL\) simulation, both cropped from a larger domain, with1283128^\{3\}points\. The cropping removes periodicity and introduces complex boundaries, thereby substantially increasing the difficulty of the dynamics learning task\. We compare Tadpole against state\-of\-the\-art 3D\-PDE foundation models MORPH\(Rautelaet al\.,[2025](https://arxiv.org/html/2605.15284#bib.bib57)\)and DPOT\(Haoet al\.,[2024](https://arxiv.org/html/2605.15284#bib.bib68)\)\. We select variants with total parameter counts comparable to those of the corresponding Tadpole models to ensure fair comparisons\. Meanwhile, we also include a concurrent foundation model, Walrus\(McCabeet al\.,[2025](https://arxiv.org/html/2605.15284#bib.bib69)\), with a significantly larger parameter count than all the other models\. All PDE foundation model baselines are performed via FPFT from their released pre\-trained weights\. For Tadpole variants, we employ Tadpole\-DFT with a sub\-network𝒮\\mathcal\{S\}using a standard encoder\-only transformer architecture where the spatial dimension in latent space is flattened into the token dimension\. Details of𝒮\\mathcal\{S\}can be found in[SectionC\.1](https://arxiv.org/html/2605.15284#A3.SS1)\.The default LoRA rank for Tadpole\-DFT is 32 unless specifically mentioned\. We also additionally compare Tadpole against several state\-of\-the\-art network architectures trained from scratch, whose results can be found in[SectionA\.6](https://arxiv.org/html/2605.15284#A1.SS6)\. We utilize an enstrophy\-based spectrum metric NRMSEESto accurately evaluate the rollout performance \(cf\.\(Chenet al\.,[2024](https://arxiv.org/html/2605.15284#bib.bib6); Holzschuhet al\.,[2026](https://arxiv.org/html/2605.15284#bib.bib17)\), and[SectionD\.1](https://arxiv.org/html/2605.15284#A4.SS1)\)\. Corresponding error values in pixel space are also provided in[SectionA\.6](https://arxiv.org/html/2605.15284#A1.SS6)\.


Figure 6:Performance improvements on the dynamics test from pre\-training of Tadpole\. a\) Relative NRMSEESof Tadpole\-B fine\-tuned using various methods compared to the from\-scratch variant\. Increasing the LoRA rank in Tadpole\-DFT consistently improves performance\. b\) Trainable parameters for different fine\-tuning methods\. The largest Tadpole\-DFT variant utilizes only 22\.3% of the trainable parameters required by the FPFT/from\-scratch variant\.[Figure5](https://arxiv.org/html/2605.15284#S5.F5)summarizes model performance for two dynamics tasks\. Tadpole models exhibit meaningful scaling: larger models outperform smaller ones\. The Tadpole models always perform better than the DPOT and MORPH models, even the smallest S model with 10x fewer trainable parameters\. Importantly, in theTBLtest, both the L and B size models outperform the Walrus model, which has over two orders of magnitude more trainable parameters: Tadpole\-B features a 10\-step enstrophy error of 3\.37, while the 200x larger Walrus model yields 4\.97\. It is worth noting that Walrus performs better on theIsotest case, which, however, is included in its training data \(marked with subscript \* in[Figure5](https://arxiv.org/html/2605.15284#S5.F5)a\) \)\. This provides an advantage that the other models in this comparison do not have\.
Figure 7:One\-step NRMSEESof the Tadpole\-B model with different sub\-network sizes and LoRA ranks\. Especially the latter positively affects performance\.Fine\-Tuning Methodologies:[Figure6](https://arxiv.org/html/2605.15284#S5.F6)presents a comparison between the Tadpole\-DFT fine\-tuning strategy, FPFT, and training from scratch\. As the LoRA rank increases, Tadpole\-DFT consistently demonstrates improved performance\. Notably, the LoRA 32 configuration outperforms the from\-scratch model across all evaluated metrics\. Furthermore, the LoRA 64 variant achieves a 49% reduction in the 10\-step average prediction error on theTBLdataset compared to the from\-scratch approach, while utilizing 77\.7% less trainable parameters\. This result highlights the effectiveness of Tadpole\-DFT in adapting the pre\-trained Tadpole model to dynamics\-learning tasks\. In contrast, the FPFT variant of Tadpole\-B does not consistently surpass the Tadpole\-DFT variants, despite requiring substantially more trainable parameters\. This behavior can be attributed to the fact that Tadpole is pre\-trained as an autoencoder without explicit exposure to temporal dynamics; consequently, directly fine\-tuning all parameters may lead to suboptimal local minima and unstable training\. In contrast, Tadpole\-DFT preserves the pre\-trained representation by leveraging frozen weights in LoRA and incrementally incorporates dynamics learning through latent transformations and skip connections, thereby facilitating more effective learning of new dynamics features\.[Figure12](https://arxiv.org/html/2605.15284#A1.F12)in the appendix presents the validation loss curves for Tadpole\-B fine\-tuned with Tadpole\-DFT LoRA\-32 and with FPFT\. Tadpole\-DFT not only achieves a lower final error but also converges faster and exhibits more stable training behavior than FPFT\.
Figure 8:One\-step NRMSEESof Tadpole\-B fine\-tuned with different Tadpole\-DFT components\.LoRA Rank vs Capacity of𝒮\\mathcal\{S\}:[Figure7](https://arxiv.org/html/2605.15284#S5.F7)presents an ablation study on fine\-tuning capacity for the dynamics learning task\. We evaluate the Tadpole\-DFT strategy under different configurations by varying the LoRA rank \(LoRA 16 and LoRA 64\) while keeping the sub\-network size fixed, and by varying the sub\-network size while fixing the LoRA rank\. While increasing LoRA rank consistently improves performance, the size of the subnetwork has little effect on performance\. In[SectionA\.6](https://arxiv.org/html/2605.15284#A1.SS6), we provide NRMSE values in physics space, where increasing the sub\-network size slightly improves the prediction accuracy, still showing less effect than the LoRA rank\. Thus, model performance benefits more strongly from LoRA rank than from the sub\-network’s capacity\.
Table 1:Statistical evaluation of the generated samples\.Boldedandunderlinedtext shows the best and second\-best values, respectively\. Tadpole with FPFT fine\-tuning performed best across all metrics\.DFT Components:Our proposed Tadpole\-DFT strategy consists of three major ingredients\. To evaluate their individual efficacy, we conducted an ablation study with results presented in[Figure8](https://arxiv.org/html/2605.15284#S5.F8)\. We evaluate the performance of Tadpole\-B fine\-tuned with different Tadpole\-DFT variants, including removing the latent transformation sub\-network, removing the reintroduced skip connections, and freezing the backbone without LoRA fine\-tuning\. The results indicate that each component contributes meaningfully to overall performance, and removing any single component results in a noticeable increase in NRMSEES\. Although previous analysis in[Figure7](https://arxiv.org/html/2605.15284#S5.F7)suggests that varying the sub\-network size has a relatively limited impact on final performance, completely removing the sub\-network results in a 4x increase in error\. In addition, reintroducing skip connections incurs almost no increase in trainable parameters while improving performance by 68%, highlighting their effectiveness in enhancing information flow across scales during dynamics learning\. Meanwhile, we also evaluated a pure latent dynamics variant, in which a network with the same architecture as𝒮\\mathcal\{S\}was trained to predict directly in the latent space encoded by the best\-performing FPFT Tadpole encoder\. The results are summarized in[Tables3](https://arxiv.org/html/2605.15284#A1.T3),[4](https://arxiv.org/html/2605.15284#A1.T4),[5](https://arxiv.org/html/2605.15284#A1.T5)and[6](https://arxiv.org/html/2605.15284#A1.T6)in the Appendix\. This latent dynamics variant performed significantly worse than all Tadpole\-DFT variants, despite having more trainable parameters\. These results highlight the limitations of relying exclusively on the latent space for dynamic prediction and emphasize the necessity of utilizing all components of Tadpole\-DFT\.
### 5\.4Generative Modeling
In this section, we evaluate Tadpole as a backbone for generative modeling of 3D turbulent flows\. We focus on theTCFdataset, which exhibits complex, anisotropic flow structures\. We implement a latent generative model based on flow matching, in which a 12\.3M network with the same architecture as𝒮\\mathcal\{S\}is trained in the latent space defined by the Tadpole encoder to generate realistic 3DTCFfields\. During inference, the latent flow matching model generates latent samples, which are subsequently decoded into high\-fidelity 3DTCFfields using the Tadpole decoder\. We consider three Tadpole\-B variants: trained from scratch, fine\-tuned with FPFT, and fine\-tuned with LoRA\-32 on theTCFautoencoding task, as described in[Section5\.2](https://arxiv.org/html/2605.15284#S5.SS2)\. These Tadpole\-based latent generative models are compared with baselines trained to generate samples directly in physical space, without leveraging pre\-trained backbones\. We introduce several metrics to evaluate the model performance, including theχPQM2\\chi^\{2\}\_\{\\mathrm\{PQM\}\}\(Lemoset al\.,[2025](https://arxiv.org/html/2605.15284#bib.bib9)\), Wasserstein\-1 distance𝒲1\\mathcal\{W\}\_\{1\}, Maximum Mean Discrepancy with a Radial Basis Function kernelMMDRBF\\text\{MMD\}\_\{\\text\{RBF\}\}\(Grettonet al\.,[2012](https://arxiv.org/html/2605.15284#bib.bib8)\), and the NRMSE of the mean and standard deviation of the distribution\. Details of these metrics can be found in[SectionD\.2](https://arxiv.org/html/2605.15284#A4.SS2)\. Meanwhile, we also evaluate the relative sample generation time \(Rel\. Time\) normalized wr\.r\.t\. the best\-performing model\.
Fine\-tuned Tadpole substantially improves generative modeling performance, as shown in[Table1](https://arxiv.org/html/2605.15284#S5.T1)\. The latent generative model built upon the FPFT Tadpole achieves the best performance across all metrics\. The LoRA 32 variant is the second\-best model except for the𝒲1\\mathcal\{W\}\_\{1\}and the Std\. metrics, where UNetGenCFDis better but 183 times slower\. These results highlight the advantages and effectiveness of the latent representation learned with Tadpole’s pre\-training\.
## 6Conclusions, Limitations and Outlook
We have introduced Tadpole, a foundation model for 3D PDEs that leverages an efficient crop\-based training strategy and a novel online pre\-training framework using synthetic data generators\. Tadpole is pre\-trained as a variational autoencoder on a diverse set of 3D PDEs and can be effectively fine\-tuned for various downstream tasks, including autoencoding, dynamics learning, and generative modeling\. Extensive experiments demonstrate Tadpole’s strong zero\-shot reconstruction capabilities and its ability to achieve impressive performance across multiple tasks\.
At the same time, several limitations exist: as the approach focuses on regular grids, unstructured grids are a natural extension\. While our work, like other approaches, focuses on short\-term rollouts, long\-term predictions represent an important challenge for all scientific foundation models\. Likewise, despite the high accuracy, even larger Tadpole models should be pre\-trained and evaluated\. In the future, it will be highly interesting to evaluate Tadpole’s capacity to predict a broader range of physical systems\(Ohanaet al\.,[2024](https://arxiv.org/html/2605.15284#bib.bib83)\), couple with differentiable solvers\(Listet al\.,[2025](https://arxiv.org/html/2605.15284#bib.bib79)\), and to combine the framework with active learning techniques\(Musekampet al\.,[2025](https://arxiv.org/html/2605.15284#bib.bib5); Pestourieet al\.,[2020](https://arxiv.org/html/2605.15284#bib.bib82)\)\.
## Acknowledgements
Qiang Liu acknowledges the support from the China Scholarship Council \(No\.202206120036\) for his Ph\.D research at the Technical University of Munich\. The authors gratefully acknowledge the scientific support and HPC resources provided by the Erlangen National High Performance Computing Center \(NHR@FAU\) of the Friedrich\-Alexander\-Universität Erlangen\-Nürnberg \(FAU\) under the NHR project b278bb\. NHR funding is provided by federal and Bavarian state authorities\. The authors also gratefully acknowledge the computational and data resources as well as the support provided by the Leibniz Supercomputing Centre\. The authors also acknowledge the EuroHPC Joint Undertaking for providing access to the EuroHPC supercomputer LEONARDO, hosted by CINECA \(Italy\) and the LEONARDO consortium\.
## References
- N\. Ashton, J\. Brandstetter, and S\. Mishra \(2025\)Fluid Intelligence: A Forward Look on AI Foundation Models in Computational Fluid Dynamics\.External Links:2511\.20455,[Link](https://arxiv.org/abs/2511.20455)Cited by:[§1](https://arxiv.org/html/2605.15284#S1.p1.1)\.
- M\. Awais, M\. Naseer, S\. Khan, R\. M\. Anwer, H\. Cholakkal, M\. Shah, M\. Yang, and F\. S\. Khan \(2025\)Foundation Models Defining a New Era in Vision: A Survey and Outlook\.IEEE Transactions on Pattern Analysis and Machine Intelligence47\(4\),pp\. 2245–2264\.External Links:[Document](https://dx.doi.org/10.1109/TPAMI.2024.3506283),[Link](https://ieeexplore.ieee.org/document/10834497/)Cited by:[§1](https://arxiv.org/html/2605.15284#S1.p1.1)\.
- Y\. Cao, Y\. Liu, L\. Yang, R\. Yu, H\. Schaeffer, and S\. Osher \(2025\)VICON: Vision In\-Context Operator Networks for Multi\-Physics Fluid Dynamics Prediction\.External Links:2411\.16063,[Link](https://arxiv.org/abs/2411.16063)Cited by:[§2](https://arxiv.org/html/2605.15284#S2.p3.1)\.
- T\. Chen, H\. Zhou, Y\. Li, H\. Wang, C\. Gao, R\. Shi, S\. Zhang, and J\. Li \(2025\)OmniArch: Building Foundation Model for Scientific Computing\.InInternational Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=UlprLwWYKP)Cited by:[§2](https://arxiv.org/html/2605.15284#S2.p4.1)\.
- Y\. Chen, M\. Goldstein, M\. Hua, M\. S\. Albergo, N\. M\. Boffi, and E\. Vanden\-Eijnden \(2024\)Probabilistic Forecasting with Stochastic Interpolants and Föllmer Processes\.InProceedings of the 41st International Conference on Machine Learning,R\. Salakhutdinov, Z\. Kolter, K\. Heller, A\. Weller, N\. Oliver, J\. Scarlett, and F\. Berkenkamp \(Eds\.\),Proceedings of Machine Learning Research, Vol\.235,pp\. 6728–6756\.External Links:[Link](https://proceedings.mlr.press/v235/chen24n.html)Cited by:[§5\.3](https://arxiv.org/html/2605.15284#S5.SS3.p1.4)\.
- S\.M\. Cox and P\.C\. Matthews \(2002\)Exponential Time Differencing for Stiff Systems\.Journal of Computational Physics176\(2\),pp\. 430–455\.External Links:ISSN 0021\-9991,[Document](https://dx.doi.org/https%3A//doi.org/10.1006/jcph.2002.6995),[Link](https://www.sciencedirect.com/science/article/pii/S0021999102969950)Cited by:[§B\.1\.2](https://arxiv.org/html/2605.15284#A2.SS1.SSS2.p4.6),[§B\.1\.2](https://arxiv.org/html/2605.15284#A2.SS1.SSS2.p5.1),[§3\.2](https://arxiv.org/html/2605.15284#S3.SS2.p2.1)\.
- N\. Ding, Y\. Qin, G\. Yang, F\. Wei, Z\. Yang, Y\. Su, S\. Hu, Y\. Chen, C\. Chan, W\. Chen, J\. Yi, W\. Zhao, X\. Wang, Z\. Liu, H\. Zheng, J\. Chen, Y\. Liu, J\. Tang, J\. Li, and M\. Sun \(2023\)Parameter\-efficient fine\-tuning of large\-scale pre\-trained language models\.Nature Machine Intelligence5\(3\),pp\. 220–235\.External Links:ISSN 2522\-5839,[Document](https://dx.doi.org/10.1038/s42256-023-00626-4),[Link](https://doi.org/10.1038/s42256-023-00626-4)Cited by:[§1](https://arxiv.org/html/2605.15284#S1.p4.1)\.
- P\. Esser, R\. Rombach, and B\. Ommer \(2021\)Taming Transformers for High\-Resolution Image Synthesis\.In2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),Vol\.,pp\. 12868–12878\.External Links:[Document](https://dx.doi.org/10.1109/CVPR46437.2021.01268),[Link](https://openaccess.thecvf.com/content/CVPR2021/html/Esser_Taming_Transformers_for_High-Resolution_Image_Synthesis_CVPR_2021_paper.html)Cited by:[§C\.2](https://arxiv.org/html/2605.15284#A3.SS2.p3.4),[§3\.1](https://arxiv.org/html/2605.15284#S3.SS1.p1.10)\.
- A\. Franz, H\. Wei, L\. Guastoni, and N\. Thuerey \(2026\)PICT–A differentiable, GPU\-accelerated multi\-block PISO solver for simulation\-coupled learning tasks in fluid dynamics\.Journal of Computational Physics544,pp\. 114433\.External Links:ISSN 0021\-9991,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.jcp.2025.114433),[Link](https://www.sciencedirect.com/science/article/pii/S0021999125007156)Cited by:[§B\.3](https://arxiv.org/html/2605.15284#A2.SS3.p4.4)\.
- A\. Gretton, K\. M\. Borgwardt, M\. J\. Rasch, B\. Schölkopf, and A\. J\. Smola \(2012\)A Kernel Two\-Sample Test\.J\. Mach\. Learn\. Res\.13,pp\. 723–773\.External Links:[Link](https://dl.acm.org/doi/10.5555/2503308.2188410),[Document](https://dx.doi.org/10.5555/2503308.2188410)Cited by:[§5\.4](https://arxiv.org/html/2605.15284#S5.SS4.p1.4)\.
- J\. Guibas, M\. Mardani, Z\. Li, A\. Tao, A\. Anandkumar, and B\. Catanzaro \(2021\)Efficient Token Mixing for Transformers via Adaptive Fourier Neural Operators\.InInternational Conference on Learning Representations,External Links:[Link](https://arxiv.org/abs/2111.13587)Cited by:[§A\.6](https://arxiv.org/html/2605.15284#A1.SS6.p1.2)\.
- Z\. Han, C\. Gao, J\. Liu, J\. Zhang, and S\. Q\. Zhang \(2024\)Parameter\-Efficient Fine\-Tuning for Large Models: A Comprehensive Survey\.Transactions on Machine Learning Research\.Note:External Links:ISSN 2835\-8856,[Link](https://openreview.net/forum?id=lIsCS8b6zj)Cited by:[§1](https://arxiv.org/html/2605.15284#S1.p4.1)\.
- Z\. Hao, C\. Su, S\. Liu, J\. Berner, C\. Ying, H\. Su, A\. Anandkumar, J\. Song, and J\. Zhu \(2024\)DPOT: auto\-regressive denoising operator transformer for large\-scale PDE pre\-training\.InInternational Conference on Machine Learning,ICML’24\.External Links:[Link](https://dl.acm.org/doi/10.5555/3692070.3692773)Cited by:[§1](https://arxiv.org/html/2605.15284#S1.p3.1),[§2](https://arxiv.org/html/2605.15284#S2.p1.1),[§5\.3](https://arxiv.org/html/2605.15284#S5.SS3.p1.4)\.
- A\. Hemmasian and A\. B\. Farimani \(2024\)Pretraining a neural operator in lower dimensions\.Transactions on Machine Learning Research\.External Links:[Link](https://openreview.net/forum?id=ZewaRoZehI)Cited by:[§2](https://arxiv.org/html/2605.15284#S2.p4.1)\.
- M\. Herde, B\. Raonić, T\. Rohner, R\. Käppeli, R\. Molinaro, E\. de Bézenac, and S\. Mishra \(2024\)POSEIDON: efficient foundation models for PDEs\.InAdvances in Neural Information Processing Systems,NeurIPS,Red Hook, NY, USA\.External Links:ISBN 9798331314385,[Link](https://dl.acm.org/doi/10.5555/3737916.3740227)Cited by:[§2](https://arxiv.org/html/2605.15284#S2.p1.1)\.
- S\. C\.H\. Hoi, D\. Sahoo, J\. Lu, and P\. Zhao \(2021\)Online learning: A comprehensive survey\.Neurocomputing459,pp\. 249–289\.External Links:ISSN 0925\-2312,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.neucom.2021.04.112),[Link](https://www.sciencedirect.com/science/article/pii/S0925231221006706)Cited by:[§3\.2](https://arxiv.org/html/2605.15284#S3.SS2.p1.1)\.
- B\. Holzschuh, G\. Kohl, F\. Redinger, and N\. Thuerey \(2026\)P3D: Scalable Neural Surrogates for High\-Resolution 3D Physics Simulations with Global Context\.InInternational Conference on Learning Representations,External Links:2509\.10186,[Link](https://arxiv.org/abs/2509.10186)Cited by:[Figure 47](https://arxiv.org/html/2605.15284#A3.F47),[Figure 47](https://arxiv.org/html/2605.15284#A3.F47.3.2),[§C\.1](https://arxiv.org/html/2605.15284#A3.SS1.p1.2),[§3\.4](https://arxiv.org/html/2605.15284#S3.SS4.p1.1),[§5\.3](https://arxiv.org/html/2605.15284#S5.SS3.p1.4)\.
- B\. Holzschuh, Q\. Liu, G\. Kohl, and N\. Thuerey \(2025\)PDE\-transformer: efficient and versatile transformers for physics simulations\.InInternational Conference on Machine Learning,A\. Singh, M\. Fazel, D\. Hsu, S\. Lacoste\-Julien, F\. Berkenkamp, T\. Maharaj, K\. Wagstaff, and J\. Zhu \(Eds\.\),Proceedings of Machine Learning Research, Vol\.267,pp\. 23562–23602\.External Links:[Link](https://proceedings.mlr.press/v267/holzschuh25a.html)Cited by:[§1](https://arxiv.org/html/2605.15284#S1.p4.1),[§2](https://arxiv.org/html/2605.15284#S2.p1.1)\.
- E\. J\. Hu, yelong shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen \(2022\)LoRA: low\-rank adaptation of large language models\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=nZeVKeeFYf9)Cited by:[3rd item](https://arxiv.org/html/2605.15284#S1.I1.i3.p1.1),[3rd item](https://arxiv.org/html/2605.15284#S4.I1.i3.p1.13)\.
- C\. Jacobsen, Y\. Zhuang, and K\. Duraisamy \(2025\)CoCoGen: Physically Consistent and Conditioned Score\-Based Generative Models for Forward and Inverse Problems\.SIAM Journal on Scientific Computing47\(2\),pp\. C399–C425\.External Links:[Document](https://dx.doi.org/10.1137/24M1636071),[Link](https://epubs.siam.org/doi/10.1137/24M1636071)Cited by:[§1](https://arxiv.org/html/2605.15284#S1.p5.1)\.
- D\. Jollie, J\. Sun, Z\. Zhang, and H\. Schaeffer \(2024\)Time\-Series Forecasting, Knowledge Distillation, and Refinement within a Multimodal PDE Foundation Model\.External Links:2409\.11609,[Link](https://arxiv.org/abs/2409.11609)Cited by:[§2](https://arxiv.org/html/2605.15284#S2.p2.1)\.
- A\. Kassam and L\. N\. Trefethen \(2005\)Fourth\-Order Time\-Stepping for Stiff PDEs\.SIAM Journal on Scientific Computing26\(4\),pp\. 1214–1233\.External Links:[Document](https://dx.doi.org/10.1137/S1064827502410633),[Link](https://doi.org/10.1137/S1064827502410633),https://doi\.org/10\.1137/S1064827502410633Cited by:[§B\.1\.2](https://arxiv.org/html/2605.15284#A2.SS1.SSS2.p5.1)\.
- F\. Koehler, S\. Niedermayr, R\. Westermann, and N\. Thuerey \(2024\)APEBench: A Benchmark for Autoregressive Neural Emulators of PDEs\.InAdvances in Neural Information Processing Systems,A\. Globerson, L\. Mackey, D\. Belgrave, A\. Fan, U\. Paquet, J\. Tomczak, and C\. Zhang \(Eds\.\),Vol\.37,pp\. 120252–120310\.External Links:[Document](https://dx.doi.org/10.52202/079017-3822),[Link](https://tum-pbs.github.io/apebench-paper/)Cited by:[§B\.1\.2](https://arxiv.org/html/2605.15284#A2.SS1.SSS2.p5.1),[§3\.2](https://arxiv.org/html/2605.15284#S3.SS2.p2.1)\.
- P\. Lemos, S\. N\. Sharief, N\. Malkin, S\. Salhi, C\. Stone, L\. P\. Levasseur, and Y\. Hezaveh \(2025\)PQMass: Probabilistic Assessment of the Quality of Generative Models using Probability Mass Estimation\.InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24\-28, 2025,External Links:[Link](https://openreview.net/forum?id=n7qGCmluZr)Cited by:[§5\.4](https://arxiv.org/html/2605.15284#S5.SS4.p1.4)\.
- Y\. Li, E\. Perlman, M\. Wan, Y\. Yang, C\. Meneveau, R\. Burns, S\. Chen, A\. Szalay, and G\. Eyink \(2008\)A public turbulence database cluster and applications to study Lagrangian evolution of velocity increments in turbulence\.Journal of Turbulence9\(\),pp\. N31\.External Links:[Document](https://dx.doi.org/10.1080/14685240802376389),[Link](https://doi.org/10.1080/14685240802376389),https://doi\.org/10\.1080/14685240802376389Cited by:[§B\.3](https://arxiv.org/html/2605.15284#A2.SS3.p2.3),[§B\.3](https://arxiv.org/html/2605.15284#A2.SS3.p6.4),[§B\.3](https://arxiv.org/html/2605.15284#A2.SS3.p7.4)\.
- J\. H\. Lim and J\. C\. Ye \(2017\)Geometric GAN\.External Links:1705\.02894,[Link](https://arxiv.org/abs/1705.02894)Cited by:[§C\.2](https://arxiv.org/html/2605.15284#A3.SS2.p1.4)\.
- Y\. Lipman, R\. T\. Q\. Chen, H\. Ben\-Hamu, M\. Nickel, and M\. Le \(2023\)Flow Matching for Generative Modeling\.InThe Eleventh International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=PqvMRDCJT9t)Cited by:[§4](https://arxiv.org/html/2605.15284#S4.p5.2)\.
- B\. List, L\. Chen, K\. Bali, and N\. Thuerey \(2025\)Differentiability in unrolled training of neural physics simulators on transient dynamics\.Computer Methods in Applied Mechanics and Engineering433,pp\. 117441\.Cited by:[§6](https://arxiv.org/html/2605.15284#S6.p2.1)\.
- Q\. Liu, F\. Koehler, and N\. Thuerey \(2025a\)TorchFSM: fourier spectral method with pytorchExternal Links:[Document](https://dx.doi.org/10.5281/zenodo.15350210),[Link](https://qiauil.github.io/torchfsm/)Cited by:[§B\.1](https://arxiv.org/html/2605.15284#A2.SS1.p3.1)\.
- Q\. Liu and N\. Thuerey \(2024\)Uncertainty\-Aware Surrogate Models for Airfoil Flow Simulations with Denoising Diffusion Probabilistic Models\.AIAA Journal62\(8\),pp\. 2912–2933\.External Links:[Document](https://dx.doi.org/10.2514/1.J063440),[Link](https://doi.org/10.2514/1.J063440),https://doi\.org/10\.2514/1\.J063440Cited by:[§1](https://arxiv.org/html/2605.15284#S1.p5.1)\.
- Y\. Liu, J\. Sun, and H\. Schaeffer \(2025b\)BCAT: A Block Causal Transformer for PDE Foundation Models for Fluid Dynamics\.External Links:2501\.18972,[Link](https://arxiv.org/abs/2501.18972)Cited by:[§2](https://arxiv.org/html/2605.15284#S2.p3.1)\.
- Z\. Liu, H\. Hu, Y\. Lin, Z\. Yao, Z\. Xie, Y\. Wei, J\. Ning, Y\. Cao, Z\. Zhang, L\. Dong, F\. Wei, and B\. Guo \(2022\)Swin Transformer V2: Scaling Up Capacity and Resolution\.In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),Vol\.,pp\. 11999–12009\.External Links:[Document](https://dx.doi.org/10.1109/CVPR52688.2022.01170),[Link](https://ieeexplore.ieee.org/document/9879380)Cited by:[§A\.6](https://arxiv.org/html/2605.15284#A1.SS6.p1.2)\.
- I\. Loshchilov and F\. Hutter \(2019\)Decoupled Weight Decay Regularization\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=Bkg6RiCqY7)Cited by:[§C\.3](https://arxiv.org/html/2605.15284#A3.SS3.SSS0.Px1.p1.9)\.
- S\. Mansingh, J\. Amarel, R\. Arnab, A\. Mohan, K\. Singh, G\. J\. Kunde, N\. Hengartner, B\. Migliori, E\. Casleton, N\. A\. Debardeleben, A\. Biswas, D\. Oyen, and E\. Lawrence \(2025\)Towards Reasoning for PDE Foundation Models: A Reward\-Model\-Driven Inference\-Time\-Scaling Algorithm\.External Links:2509\.02846,[Link](https://arxiv.org/abs/2509.02846)Cited by:[§2](https://arxiv.org/html/2605.15284#S2.p4.1)\.
- M\. McCabe, B\. R\. Blancard, L\. Parker, R\. Ohana, M\. Cranmer, A\. Bietti, M\. Eickenberg, S\. Golkar, G\. Krawezik, F\. Lanusse, M\. Pettee, T\. Tesileanu, K\. Cho, and S\. Ho \(2024\)Multiple physics pretraining for spatiotemporal surrogate models\.InAdvances in Neural Information Processing Systems,NeurIPS,Red Hook, NY, USA\.External Links:ISBN 9798331314385,[Link](https://neurips.cc/virtual/2024/poster/96095)Cited by:[§A\.6](https://arxiv.org/html/2605.15284#A1.SS6.p1.2),[§1](https://arxiv.org/html/2605.15284#S1.p4.1),[§2](https://arxiv.org/html/2605.15284#S2.p1.1)\.
- M\. McCabe, P\. Mukhopadhyay, T\. Marwah, B\. R\. Blancard, F\. Rozet, C\. Diaconu, L\. Meyer, K\. W\. K\. Wong, H\. Sotoudeh, A\. Bietti, I\. Espejo, R\. Fear, S\. Golkar, T\. Hehir, K\. Hirashima, G\. Krawezik, F\. Lanusse, R\. Morel, R\. Ohana, L\. Parker, M\. Pettee, J\. Shen, K\. Cho, M\. Cranmer, and S\. Ho \(2025\)Walrus: A Cross\-Domain Foundation Model for Continuum Dynamics\.External Links:2511\.15684,[Link](https://arxiv.org/abs/2511.15684)Cited by:[§1](https://arxiv.org/html/2605.15284#S1.p3.1),[§2](https://arxiv.org/html/2605.15284#S2.p1.1),[§5\.3](https://arxiv.org/html/2605.15284#S5.SS3.p1.4)\.
- Y\. Meng, J\. Huang, Y\. Zhang, and J\. Han \(2022\)Generating Training Data with Language Models: Towards Zero\-Shot Language Understanding\.InAdvances in Neural Information Processing Systems,S\. Koyejo, S\. Mohamed, A\. Agarwal, D\. Belgrave, K\. Cho, and A\. Oh \(Eds\.\),Vol\.35,pp\. 462–477\.External Links:[Link](https://dl.acm.org/doi/10.5555/3600270.3600304)Cited by:[§1](https://arxiv.org/html/2605.15284#S1.p4.1)\.
- L\. Meyer, M\. Schouler, R\. A\. Caulk, A\. Ribes, and B\. Raffin \(2023\)Training deep surrogate models with large scale online learning\.InProceedings of the 40th International Conference on Machine Learning,ICML’23\.External Links:[Link](https://dl.acm.org/doi/10.5555/3618408.3619432)Cited by:[§3\.2](https://arxiv.org/html/2605.15284#S3.SS2.p1.1)\.
- T\. Miyato and M\. Koyama \(2018\)cGANs with projection discriminator\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=ByS1VpgRZ)Cited by:[§C\.2](https://arxiv.org/html/2605.15284#A3.SS2.p1.4)\.
- R\. Morel, J\. Han, and E\. Oyallon \(2025\)DISCO: learning to DISCover an evolution operator for multi\-physics\-agnostic prediction\.InInternational Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=6EZ3MDDf6p)Cited by:[§2](https://arxiv.org/html/2605.15284#S2.p4.1)\.
- D\. Musekamp, M\. Kalimuthu, D\. Holzmüller, M\. Takamoto, and M\. Niepert \(2025\)Active Learning for Neural PDE solvers\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=x4ZmQaumRg)Cited by:[§6](https://arxiv.org/html/2605.15284#S6.p2.1)\.
- D\. Myers, R\. Mohawesh, V\. I\. Chellaboina, A\. L\. Sathvik, P\. Venkatesh, Y\. Ho, H\. Henshaw, M\. Alhawawreh, D\. Berdik, and Y\. Jararweh \(2024\)Foundation and large language models: fundamentals, challenges, opportunities, and social impacts\.Cluster Computing27\(1\),pp\. 1–26\.External Links:[Link](https://link.springer.com/article/10.1007/s10586-023-04203-7)Cited by:[§1](https://arxiv.org/html/2605.15284#S1.p1.1)\.
- E\. Negrini, Y\. Liu, L\. Yang, S\. J\. Osher, and H\. Schaeffer \(2025\)A Multimodal PDE Foundation Model for Prediction and Scientific Text Descriptions\.External Links:2502\.06026,[Link](https://arxiv.org/abs/2502.06026)Cited by:[§2](https://arxiv.org/html/2605.15284#S2.p2.1)\.
- T\. Nguyen, A\. Koneru, S\. Li, and A\. Grover \(2025\)PhysiX: A Foundation Model for Physics Simulations\.External Links:2506\.17774,[Link](https://arxiv.org/abs/2506.17774)Cited by:[§2](https://arxiv.org/html/2605.15284#S2.p3.1)\.
- R\. Ohana, M\. McCabe, L\. Meyer, R\. Morel, F\. Agocs, M\. Beneitez, M\. Berger, B\. Burkhart, S\. Dalziel, D\. Fielding,et al\.\(2024\)The well: a large\-scale collection of diverse physics simulations for machine learning\.Advances in Neural Information Processing Systems37,pp\. 44989–45037\.External Links:[Link](https://neurips.cc/virtual/2024/poster/97882)Cited by:[§6](https://arxiv.org/html/2605.15284#S6.p2.1)\.
- R\. Pestourie, Y\. Mroueh, T\. V\. Nguyen, P\. Das, and S\. G\. Johnson \(2020\)Active learning of deep surrogates for PDEs: application to metasurface design\.npj Computational Materials6\(1\),pp\. 164\.External Links:[Link](https://www.nature.com/articles/s41524-020-00431-2)Cited by:[§6](https://arxiv.org/html/2605.15284#S6.p2.1)\.
- M\. A\. Rahman, R\. J\. George, M\. Elleithy, D\. Leibovici, Z\. Li, B\. Bonev, C\. White, J\. Berner, R\. A\. Yeh, J\. Kossaifi,et al\.\(2024\)Pretraining codomain attention neural operators for solving multiphysics pdes\.Advances in Neural Information Processing Systems37,pp\. 104035–104064\.External Links:[Link](https://neurips.cc/virtual/2024/poster/93155)Cited by:[§2](https://arxiv.org/html/2605.15284#S2.p4.1)\.
- M\. S\. Rautela, A\. Most, S\. Mansingh, B\. C\. Love, A\. Biswas, D\. Oyen, and E\. Lawrence \(2025\)MORPH: PDE Foundation Models with Arbitrary Data Modality\.External Links:2509\.21670,[Link](https://arxiv.org/abs/2509.21670)Cited by:[§1](https://arxiv.org/html/2605.15284#S1.p3.1),[§1](https://arxiv.org/html/2605.15284#S1.p4.1),[§2](https://arxiv.org/html/2605.15284#S2.p4.1),[§5\.3](https://arxiv.org/html/2605.15284#S5.SS3.p1.4)\.
- F\. Regazzoni, S\. Pagani, M\. Salvador, L\. Dede’, and A\. Quarteroni \(2024\)Learning the intrinsic dynamics of spatio\-temporal processes through Latent Dynamics Networks\.Nature Communications15\(1\),pp\. 1834\.External Links:ISSN 2041\-1723,[Document](https://dx.doi.org/10.1038/s41467-024-45323-x),[Link](https://doi.org/10.1038/s41467-024-45323-x)Cited by:[§2](https://arxiv.org/html/2605.15284#S2.p4.1),[§4](https://arxiv.org/html/2605.15284#S4.p2.2)\.
- R\. Rombach, A\. Blattmann, D\. Lorenz, P\. Esser, and B\. Ommer \(2022\)High\-resolution image synthesis with latent diffusion models\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 10684–10695\.External Links:[Link](https://www.computer.org/csdl/proceedings-article/cvpr/2022/694600k0674/1H1iFsO7Zuw)Cited by:[§3\.1](https://arxiv.org/html/2605.15284#S3.SS1.p1.10),[§4](https://arxiv.org/html/2605.15284#S4.p5.2)\.
- S\. Rühling Cachay, B\. Zhao, H\. Joren, and R\. Yu \(2024\)Dyffusion: A dynamics\-informed diffusion model for spatiotemporal forecasting\.Advances in Neural Information Processing Systems36\.External Links:[Link](https://arxiv.org/abs/2306.01984)Cited by:[§1](https://arxiv.org/html/2605.15284#S1.p5.1)\.
- L\. Serrano, A\. K\. Koupaï, T\. X\. Wang, P\. Erbacher, and P\. Gallinari \(2025\)Zebra: In\-Context Generative Pretraining for Solving Parametric PDEs\.InInternational Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=22kNOkkokU)Cited by:[§2](https://arxiv.org/html/2605.15284#S2.p3.1)\.
- J\. Shen, T\. Marwah, and A\. Talwalkar \(2024\)UPS: efficiently building foundation models for PDE solving via cross\-modal adaptation\.Trans\. Mach\. Learn\. Res\.2024\.External Links:[Link](https://openreview.net/forum?id=0r9mhjRv1E)Cited by:[§2](https://arxiv.org/html/2605.15284#S2.p2.1)\.
- A\. B\. Siddik, D\. Oyen, A\. Most, M\. Kucer, and A\. Biswas \(2025\)SPUS: A Lightweight and Parameter\-Efficient Foundation Model for PDEs\.External Links:2510\.01370,[Link](https://arxiv.org/abs/2510.01370)Cited by:[§2](https://arxiv.org/html/2605.15284#S2.p1.1)\.
- Z\. Song, J\. Yuan, and H\. Yang \(2024\)FMint: Bridging Human Designed and Data Pretrained Models for Differential Equation Foundation Model\.External Links:2404\.14688,[Link](https://arxiv.org/abs/2404.14688)Cited by:[§2](https://arxiv.org/html/2605.15284#S2.p3.1)\.
- S\. Subramanian, P\. Harrington, K\. Keutzer, W\. Bhimji, D\. Morozov, M\. W\. Mahoney, and A\. Gholami \(2023\)Towards foundation models for scientific machine learning: characterizing scaling and transfer behavior\.InAdvances in Neural Information Processing Systems,NeurIPS,Red Hook, NY, USA\.External Links:[Link](https://arxiv.org/abs/2306.00258)Cited by:[§1](https://arxiv.org/html/2605.15284#S1.p1.1),[§2](https://arxiv.org/html/2605.15284#S2.p1.1)\.
- J\. Sun, Y\. Liu, Z\. Zhang, and H\. Schaeffer \(2025\)Towards a foundation model for partial differential equations: Multioperator learning and extrapolation\.Phys\. Rev\. E111,pp\. 035304\.External Links:[Document](https://dx.doi.org/10.1103/PhysRevE.111.035304),[Link](https://link.aps.org/doi/10.1103/PhysRevE.111.035304)Cited by:[§2](https://arxiv.org/html/2605.15284#S2.p2.1)\.
- T\. Terraz, A\. Ribes, Y\. Fournier, B\. Iooss, and B\. Raffin \(2017\)Melissa: large scale in transit sensitivity analysis avoiding intermediate files\.InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis,SC ’17,New York, NY, USA\.External Links:ISBN 9781450351140,[Link](https://doi.org/10.1145/3126908.3126922),[Document](https://dx.doi.org/10.1145/3126908.3126922)Cited by:[§3\.2](https://arxiv.org/html/2605.15284#S3.SS2.p1.1)\.
- N\. Thuerey, K\. Weissenow, L\. Prantl, and X\. Hu \(2020\)Deep learning methods for reynolds\-averaged navier–stokes simulations of airfoil flows\.AIAA Journal58\(1\),pp\. 25–36\.External Links:[Link](https://ge.in.tum.de/publications/2018-deep-flow-pred/)Cited by:[§2](https://arxiv.org/html/2605.15284#S2.p1.1)\.
- A\. Totounferoush, S\. Kotchourko, M\. W\. Mahoney, and S\. Staab \(2025\)Paving the way for scientific foundation models: enhancing generalization and robustness in PDEs with constraint\-aware pre\-training\.External Links:2503\.19081,[Link](https://arxiv.org/abs/2503.19081)Cited by:[§2](https://arxiv.org/html/2605.15284#S2.p4.1)\.
- D\. Tran, R\. Ranganath, and D\. M\. Blei \(2017\)Hierarchical implicit models and likelihood\-free variational inference\.InAdvances in Neural Information Processing Systems,NeurIPS,Red Hook, NY, USA,pp\. 5529–5539\.External Links:ISBN 9781510860964Cited by:[§C\.2](https://arxiv.org/html/2605.15284#A3.SS2.p1.4)\.
- F\. Wiesner, M\. Wessling, and S\. Baek \(2025\)Towards a Physics Foundation Model\.External Links:2509\.13805,[Link](https://arxiv.org/abs/2509.13805)Cited by:[§2](https://arxiv.org/html/2605.15284#S2.p2.1)\.
- S\. Wiewel, M\. Becher, and N\. Thuerey \(2019\)Latent Space Physics: Towards Learning the Temporal Evolution of Fluid Flow\.Computer Graphics Forum38\(2\),pp\. 71–82\.External Links:[Document](https://dx.doi.org/https%3A//doi.org/10.1111/cgf.13620),[Link](https://onlinelibrary.wiley.com/doi/abs/10.1111/cgf.13620),https://onlinelibrary\.wiley\.com/doi/pdf/10\.1111/cgf\.13620Cited by:[§2](https://arxiv.org/html/2605.15284#S2.p4.1),[§4](https://arxiv.org/html/2605.15284#S4.p2.2)\.
- Y\. Xin, J\. Yang, S\. Luo, Y\. Du, Q\. Qin, K\. Cen, Y\. He, Z\. Zhang, B\. Fu, X\. Yang, G\. Zhai, M\. Yang, and X\. Liu \(2025\)Parameter\-Efficient Fine\-Tuning for Pre\-Trained Vision Models: A Survey and Benchmark\.External Links:2402\.02242,[Link](https://arxiv.org/abs/2402.02242)Cited by:[§1](https://arxiv.org/html/2605.15284#S1.p4.1)\.
- L\. Yang, S\. Liu, T\. Meng, and S\. J\. Osher \(2023\)In\-context operator learning with data prompts for differential equation problems\.Proceedings of the National Academy of Sciences120\(39\),pp\. e2310142120\.External Links:[Document](https://dx.doi.org/10.1073/pnas.2310142120),[Link](https://www.pnas.org/doi/abs/10.1073/pnas.2310142120),https://www\.pnas\.org/doi/pdf/10\.1073/pnas\.2310142120Cited by:[§2](https://arxiv.org/html/2605.15284#S2.p3.1)\.
- Z\. Ye, X\. Huang, L\. Chen, H\. Liu, Z\. Wang, and B\. Dong \(2024\)PDEformer: towards a foundation model for one\-dimensional partial differential equations\.InICLR 2024 Workshop on AI4DifferentialEquations In Science,External Links:[Link](https://openreview.net/forum?id=GLDMCwdhTK)Cited by:[§2](https://arxiv.org/html/2605.15284#S2.p2.1)\.
- Z\. Ye, Z\. Liu, B\. Wu, H\. Jiang, L\. Chen, M\. Zhang, X\. Huang, Q\. Meng, J\. Zou, H\. Liu, and B\. Dong \(2025\)PDEformer\-2: A Versatile Foundation Model for Two\-Dimensional Partial Differential Equations\.External Links:2507\.15409,[Link](https://arxiv.org/abs/2507.15409)Cited by:[§2](https://arxiv.org/html/2605.15284#S2.p2.1)\.
- D\. Zhang, T\. Feng, L\. Xue, Y\. Wang, Y\. Dong, and J\. Tang \(2025a\)Parameter\-Efficient Fine\-Tuning for Foundation Models\.External Links:2501\.13787,[Link](https://arxiv.org/abs/2501.13787)Cited by:[§1](https://arxiv.org/html/2605.15284#S1.p4.1)\.
- H\. Zhang, C\. Kang, Y\. Wang, and D\. Zou \(2025b\)F\-Adapter: Frequency\-Adaptive Parameter\-Efficient Fine\-Tuning in Scientific Machine Learning\.Advances in Neural Information Processing Systems\.External Links:[Link](https://neurips.cc/virtual/2025/poster/119790)Cited by:[§2](https://arxiv.org/html/2605.15284#S2.p4.1)\.
- A\. Zhou and A\. B\. Farimani \(2024\)Masked Autoencoders are PDE learners\.Transactions on Machine Learning Research\.Note:External Links:ISSN 2835\-8856,[Link](https://openreview.net/forum?id=rZNuiFwXVs)Cited by:[§2](https://arxiv.org/html/2605.15284#S2.p4.1)\.
- H\. Zhou, Y\. Ma, H\. Wu, H\. Wang, and M\. Long \(2025\)Unisolver: PDE\-conditional transformers are universal neural PDE solvers\.InICLR 2025 Workshop on Foundation Models in the Wild,External Links:[Link](https://openreview.net/forum?id=6HlgUqkjmW)Cited by:[§2](https://arxiv.org/html/2605.15284#S2.p2.1)\.
- C\. Zhu, X\. Xu, J\. Han, and J\. Chen \(2025\)Physics\-informed Temporal Alignment for Auto\-regressive PDE foundation models\.InInternational Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=OKDN1Hg3im)Cited by:[§2](https://arxiv.org/html/2605.15284#S2.p4.1)\.
APPENDIX
This appendix provides details and background information for Tadpole on the following topics:
- •[AppendixA](https://arxiv.org/html/2605.15284#A1): Additional Analysis, Results, and Visualizations\.
- •[AppendixB](https://arxiv.org/html/2605.15284#A2): Dataset and online training setups
- •[AppendixC](https://arxiv.org/html/2605.15284#A3): Training details and network architectures
- •[AppendixD](https://arxiv.org/html/2605.15284#A4): Evaluation metrics
- •[AppendixE](https://arxiv.org/html/2605.15284#A5): Nomenclature and abbreviations
## Appendix AAdditional Analysis, Results, and Visualizations
### A\.1On the Geometric Complexity of Reconstruction and Dynamics Learning
Below, we explain why the reconstruction object is easier to learn compared to the dynamics object from a manifold point of view\.
Letℋ\\mathcal\{H\}andℳ⊂ℋ\\mathcal\{M\}\\subset\\mathcal\{H\}denote a high\-dimensional function space and a set of admissible PDE solution statesuu, respectively\.ℳ\\mathcal\{M\}is typically concentrated near a low\-dimensional, smooth manifold due to spatial coupling induced by differential operators\. Training with the reconstruction target is equivalent to learning a coordinate chart and its inverse onℳ\\mathcal\{M\}:
ℰ:ℳ→ℝk,𝒟:ℝk→ℳ,𝒟∘ℰ≈Idℳ,\\mathcal\{E\}:\\mathcal\{M\}\\rightarrow\\mathbb\{R\}^\{k\},\\;\\mathcal\{D\}:\\mathbb\{R\}^\{k\}\\rightarrow\\mathcal\{M\},\\;\\mathcal\{D\}\\circ\\mathcal\{E\}\\approx\\text\{Id\}\_\{\\mathcal\{M\}\},\(1\)whereℝk\\mathbb\{R\}^\{k\}denotes a k\-dimensional Euclidean latent space\. The above learning target is primarily governed by the geometry feature, e\.g\., dimensionality and regularity, of the solution manifoldℳ\\mathcal\{M\}rather than the high\-dimensional function spaceℋ\\mathcal\{H\}\. In particular, whenℳ\\mathcal\{M\}is a smooth manifold, the associated coordinate maps can also be chosen to be smooth, and their neural approximations, i\.e\.,ℰ\\mathcal\{E\}and𝒟\\mathcal\{D\}, can be small and simple\.
In contrast, the PDE evolution can be treated as a dynamical system onℳ\\mathcal\{M\}:
dudt=F\(u\),\\frac\{du\}\{dt\}=F\(u\),\(2\)whereF\(u\)∈TuℳF\(u\)\\in T\_\{u\}\\mathcal\{M\}is a tangent vector assigned to each stateu∈ℳu\\in\\mathcal\{M\}\. Learning a one\-step prediction \(i\.e\.,𝐮t\\mathbf\{u\}\_\{t\}→\\rightarrow𝐮t\+Δt\\mathbf\{u\}\_\{t\+\\Delta t\}\) corresponds to learning a flow map generated through integratingFF, which is more challenging for many reasons:
Figure 9:Illustration of the distinction between learning the solution manifold and learning the induced dynamics\. The curve represents a low\-dimensional solution manifoldℳ\\mathcal\{M\}embedded in a higher\-dimensional space\. Tangent vectors along the manifold correspond to the intrinsic dynamicsF\(u\)∈TuℳF\(u\)\\in T\_\{u\}\\mathcal\{M\}, while the surrounding vector field illustrates the additional requirement of maintaining consistency with the manifold by attracting perturbed states back toward M\. Reconstruction\-based learning only needs to learn the geometry ofℳ\\mathcal\{M\}, whereas dynamics learning also requires learning both tangent and surrounding vectors\.- •While reconstruction requires approximating the geometry ofℳ\\mathcal\{M\}, dynamical prediction also requires learning additional vector fields defined onℳ\\mathcal\{M\}besides the manifold’s geometry\.
- •In addition to approximating the tangent vectorFFonℳ\\mathcal\{M\}, a learned dynamics model must remain stable under perturbations, e\.g\., it should also learn the surrounding vectors, as shown in[Figure9](https://arxiv.org/html/2605.15284#A1.F9), to map states in a neighborhood ofMMback towardMM, thereby preserving the manifold’s invariance\.
- •Althoughℳ\\mathcal\{M\}can be smooth and low\-dimensional, the vector fieldFFand corresponding surrounding vector fields could still exhibit strong variability, particularly when PDEs have strong nonlinear interactions or multiscale effects\.
Consequently, reconstruction\-based objectives primarily learn the geometric structure of the solution manifold, whereas dynamical models must additionally resolve the induced flow on this manifold\. This distinction helps explain why reconstruction\-based training is typically easier and exhibits better generalization properties than directly learning PDEs time\-stepping\.
### A\.2Methodological Summary of the Tadpole approach
The discussion of the advantages of reconstruction\-based objectives for generalrepresentation learninghighlights key advantages of the proposed approach\.
To summarize, Tadpoles distinguishes itself from existing approaches for scientific foundation models in the following ways:•It focuses onautoencodingas generalizable, central objective for represetnation learning\.•Tadpole employsonline data\-generationwith a fast, semi\-spectral, GPU\-based solver, circumventing storage and I/O bottlenecks\.•It comes with a highlyflexible architecture, e\.g\., supporting arbitrary numbers of channels, and temporal dynamics \(via learned skip\-connections\) for downstream tasks\.•Tadpole’s capabilities are demonstrated for a wide range of downstream applications, from reconstruction, over generative modeling to temporal dynamics\.
These aspects also provide important distinctions from existing approaches for foundation models, in particular from large language models: the representations of LLMs typically emerge implicitly from the next\-token prediction task using a pre\-determined tokenization codebook\. Tadpole instead explicitly optimizes for representation learning via autoencoding\. Specifically, we induce meaningful representations by combining the reconstruction task with large streams of synthetic PDE data, such that the resulting latent space captures the low\-dimensional solution manifold and allows for generalizing transfers to new downstream tasks\. This is more closely aligned with the learned latent spaces of imaging and video FMs, which, in contrast, typically focus on perceptually\-driven latent spaces\.
### A\.3Comparison Between Crop\-based Inference and Whole\-domain Inference
We compare the zero\-shot reconstruction RMSE of the Tadpole B\-size model acrossTCFdatasets for crop\-based and whole\-domain inference\. For the S\-size model, the reconstruction error slightly improved by 2\.8% with whole\-domain inference\. For the B\-size model, the performance degrades slightly by 0\.8%\.[Figure10](https://arxiv.org/html/2605.15284#A1.F10)shows the corresponding visualizations of the reconstructions with whole\-domain inference and crop\-based inference, and[Figure11](https://arxiv.org/html/2605.15284#A1.F11)shows the corresponding absolute error\. We can see that the whole\-domain approach helps to remove the inconsistency near crop boundaries in the S\-model zero\-shot\. But in general, the zero\-shot reconstruction RMSE on theTCFdataset is not significantly affected by the inference strategy, highlighting Tadpole’s strong generalization across different resolutions\.


Figure 10:Visualization of the reconstruction ofTCFwith crop\-based and whole\-domain inference for slicesx=X/2x=X/2\(left\), andy=Y/2y=Y/2\(right\)\.

Figure 11:Visualization of the absolute error for the reconstruction ofTCFwith crop\-based and whole\-domain inference for slicesx=X/2x=X/2\(left\), andy=Y/2y=Y/2\(right\)\.
### A\.4Convergence Curve of FPFT and LoRA Fine\-tuning on Dynamics Learning
[Figure12](https://arxiv.org/html/2605.15284#A1.F12)shows the validation RMSE during FPFT and LoRA fine\-tuning\. FPFT fine\-tuning exhibits many oscillations during training, especially at the start, whereas LoRA fine\-tuning remains stable\. This highlights LoRA fine\-tuning’s ability to preserve pretrained knowledge, thereby avoiding training instabilities\.
Figure 12:Validation RMSE of Tadpole fine\-tuned with LoRA\-32 and FPFT\. LoRA yields stable training with reduced errors\.
### A\.5Ablation Study of Dynamics Learning in Pixel Space
In[Figure7](https://arxiv.org/html/2605.15284#S5.F7)and[Figure8](https://arxiv.org/html/2605.15284#S5.F8), we perform ablation studies in spectrum space, showing that the sub\-network has little effect on model performance\. Here, we provide additional evaluations in the pixel space\.[Figure13](https://arxiv.org/html/2605.15284#A1.F13)and[Figure14](https://arxiv.org/html/2605.15284#A1.F14)compare the model performance with different sub\-network sizes, LoRA ranks, and DFT components\. Compared to the results in spectrum space, the effect of the sub\-network is more evident in physics space, where improving the sub\-network size clearly decreases the NRMSE of the prediction, although increasing the LoRA rank is more efficient\. Dropping the sub\-network results in the largest increase in pixel\-space error compared to removing other components\.
The above results suggest that LoRA has a greater impact on performance in the spectrum space, while the sub\-network has a greater impact in the pixel space\. This is because LoRA works with the model backbone, where convolutional and vision transformer layers are applied to learn spatial correlations\. Improving backbone performance yields a better\-learned spatial pattern of PDE solutions\. While the sub\-network learns in the latent space, spatial correlations are more difficult to learn because of the backbone’s downsampling layers\. It focuses more on improving model performance by combining information from state variables\. Thus, although it can improve performance in pixel space, the spatial patterns of PDE solutions are only slightly affected\.
Figure 13:One\-step NRMSE of the Tadpole\-B model with different sub\-network sizes and LoRA ranks\. Especially the latter positively affects performance\.Figure 14:One\-step NRMSE of Tadpole\-B fine\-tuned with different Tadpole\-DFT components\.
### A\.6Detailed Metric Values of the Main Experiments
In this section, we summarize the detailed metric values from the previous experiments\.[Table2](https://arxiv.org/html/2605.15284#A1.T2)shows the reconstruction NRMSE of Tadpole models on the autoencoding task\. Corresponding results are illustrated in[Figure2](https://arxiv.org/html/2605.15284#S5.F2)\.[Table3](https://arxiv.org/html/2605.15284#A1.T3)\-[Table6](https://arxiv.org/html/2605.15284#A1.T6)summarize the NRMSEESand NRMSE of different methods in dynamics tasks\. Corresponding results are illustrated in[Figure5](https://arxiv.org/html/2605.15284#S5.F5)\. Here, we also compare the model’s performance with Swin3D\(Liuet al\.,[2022](https://arxiv.org/html/2605.15284#bib.bib13)\), AViT\(McCabeet al\.,[2024](https://arxiv.org/html/2605.15284#bib.bib47)\), and AFNO\(Guibaset al\.,[2021](https://arxiv.org/html/2605.15284#bib.bib14)\), all trained from scratch, where Tadpole and other foundation models outperform them inNRMSEESNRMSE^\{ES\}\.
Table 2:Tadpole’s reconstruction NRMSE \(×10−2\\times 10^\{\-2\}\) on downstream autoencoding tasks\. The zero\-shot model surpasses a from\-scratch model across three datasets, highlighting the efficacy of pre\-training\. Fine\-tuning further enhances the performance\.IsoTCFTBLMHDZero ShotS6\.119\.825\.937\.22\\B3\.237\.871\.273\.42L2\.527\.271\.172\.93B1 PDE112\.3614\.5718\.886\.483 PDEs3\.358\.501\.863\.57Initial3\.758\.201\.634\.04Local4\.147\.961\.794\.21Scratch7\.174\.401\.678\.2238\.1MFPFTB2\.732\.940\.502\.5038\.1MLoRA\-323\.013\.820\.783\.182\.8MTable 3:Rollout NRMSEES\(×10−2\\times 10^\{\-2\}\) of different models on the dynamics downstream task forTBLdataset\. Tadpole with DFT fine\-tuning strategy outperforms all other specialized and foundational models\.Table 4:Rollout NRMSE \(×10−2\\times 10^\{\-2\}\) of different models on the dynamics downstream task forTBLdataset\. Tadpole with DFT fine\-tuning strategy outperforms Walrus in one\-step prediction with much fewer trainable parameters\.Table 5:Rollout NRMSEES\(×10−2\\times 10^\{\-2\}\) of different models on the dynamics downstream task forIsodataset\. Besides the Walrus with two orders of magnitude more parameters, Tadpole with DFT fine\-tuning strategy outperforms all other specialized and foundational models\.Table 6:Rollout NRMSE \(×10−2\\times 10^\{\-2\}\) of different models on the dynamics downstream task forIsodataset\. Tadpole outperforms the Walrus in one\-step prediction
### A\.7Visualizations of Autoencoding Reconstructions
Figures[15](https://arxiv.org/html/2605.15284#A1.F15)–[37](https://arxiv.org/html/2605.15284#A1.F37)present qualitative reconstruction results for different datasets under various training strategies\. Due to the high spatial resolution ofIsoandMHD, differences in the reconstructions are difficult to discern at lower visualization resolutions\. We therefore additionally provide visualizations of1283128^\{3\}crops for clearer comparison\. Overall, models trained from scratch exhibit higher reconstruction errors than the other variants\. The zero\-shot model occasionally shows inconsistencies at crop boundaries; however, these artifacts disappear after fine\-tuning on the corresponding dataset\.
Figure 15:Volume rendering of the reconstructed102431024^\{3\}Isofields generated by different Tadpole training methods\.Figure 16:Visualization of the reconstruction ofIsoat the slice wherex=X/2x=X/2\.Figure 17:Visualization of the reconstruction ofIsoat the slice wherey=Y/2y=Y/2\.Figure 18:Visualization of the reconstruction ofIso1283128^\{3\}crops at the slice wherex=X/2x=X/2\.Figure 19:Visualization of the reconstruction ofIso1283128^\{3\}crops at the slice wherey=Y/2y=Y/2\.Figure 20:Visualization of the absolute error for the reconstruction ofIso1283128^\{3\}crops at the slice wherex=X/2x=X/2\.Figure 21:Visualization of the absolute error for the reconstruction ofIso1283128^\{3\}crops at the slice wherey=Y/2y=Y/2\.Figure 22:Volume rendering of the reconstructed962×19296^\{2\}\\times 192TCFfields generated by different Tadpole training methods\. This visualization confirms that all methods have successfully learned the large\-scale structures of the data\. Differences become apparent in the following visualizations of individual slices through the volume\.

Figure 23:Visualization of the reconstruction ofTCFat the slice wherex=X/2x=X/2andy=Y/2y=Y/2\.

Figure 24:Visualization of the absolute error for the reconstruction ofTCFat the slice wherex=X/2x=X/2andy=Y/2y=Y/2\.Figure 25:Volume rendering of the reconstructed5123512^\{3\}MHDfields generated by different Tadpole training methods\.Figure 26:Visualization of the reconstruction ofMHDat the slice wherex=X/2x=X/2\.Figure 27:Visualization of the reconstruction ofMHDat the slice wherey=Y/2y=Y/2\.Figure 28:Visualization of the reconstruction ofMHDat the slice wherez=Z/2z=Z/2\.Figure 29:Visualization of the reconstruction ofMHD1283128^\{3\}crops at the slice wherey=Y/2y=Y/2\.Figure 30:Visualization of the reconstruction ofMHD1283128^\{3\}crops at the slice wherez=Z/2z=Z/2\.Figure 31:Visualization of the absolute error for the reconstruction ofMHD1283128^\{3\}crops at the slice wherey=Y/2y=Y/2\.Figure 32:Visualization of the absolute error for the reconstruction ofMHD1283128^\{3\}crops at the slice wherez=Z/2z=Z/2\.Figure 33:Volume rendering of the reconstructed2243224^\{3\}TBLfields generated by different Tadpole training methods\.Figure 34:Visualization of the reconstruction ofTBLat the slice wherex=X/2x=X/2\.Figure 35:Visualization of the reconstruction ofTBLat the slice wherey=Y/2y=Y/2\.Figure 36:Visualization of absolute error for the reconstruction ofTBLat the slice wherex=X/2x=X/2\.Figure 37:Visualization of absolute error for the reconstruction ofTBLat the slice wherey=Y/2y=Y/2\.
### A\.8Visualizations on Dynamics Learning
The following images provide qualitative samples for the dynamics rollout task over the course of three time steps\. Each image compares the ground truth result \(at the top\) with the three baseline models \(next three rows, top to bottom: Walrus, DPOT, and MORPH\), with the Tadpole\-DFT result shown in the bottom row\.
As indicated by the MSE measurements in the main text, both DPOT and MORPH exhibit a worse performance\. The outputs are relatively smooth, and both methods show clear patch boundaries induced by the underlying architectures\. The MORPH model induces a substantial amount of smoothing in the first step, while DPOT loses detail more gradually\. In contrast, both Walrus and Tadpole produce outputs that are hard to distinguish from the ground truth\. Importantly, Tadpole achieves this with a more than100×100\\timessmaller number of trainable weights\.
In addition, the Walrus results feature noticeable noise in the pressure channel \(e\.g\., fourth column in[Figure40](https://arxiv.org/html/2605.15284#A1.F40)\)\. This is most likely caused by the Walrus model having difficulties in adjusting to the different scales of velocity and pressure channels\. The outputs of the Tadpole model are significantly smoother without losing detail\.


Figure 38:Visualization of the prediction \(left\) ofIsoand the corresponding absolute error \(right\) at thefirst rollout stepand the slice wherez=Z/2z=Z/2\.

Figure 39:Visualization of the prediction \(left\) ofIsoand the corresponding absolute error \(right\) at thesecond rollout stepand the slice wherez=Z/2z=Z/2\.

Figure 40:Visualization of the prediction \(left\) ofIsoand the corresponding absolute error \(right\) at thethird rollout stepand the slice wherez=Z/2z=Z/2\.

Figure 41:Visualization of the prediction \(left\) ofTBLand the corresponding absolute error \(right\) at thefirst rollout stepand the slice wherez=Z/2z=Z/2\.

Figure 42:Visualization of the prediction \(left\) ofTBLand the corresponding absolute error \(right\) at thesecond rollout stepand the slice wherez=Z/2z=Z/2\.

Figure 43:Visualization of the prediction \(left\) ofTBLand the corresponding absolute error \(right\) at thethird rollout stepand the slice wherez=Z/2z=Z/2\.
## Appendix BDataset and Online Learning Setups
### B\.1Pre\-training Dataset
A significant challenge in training 3D physics foundation models is the production, storage, and efficient loading of large\-scale spatiotemporal datasets\. To illustrate the magnitude of this bottleneck, consider a single trajectory of100100frames on a3843384^\{3\}grid for a vector field with three components\. In half\-precision floating\-point format, a single trajectory requires approximately30GB30\\,\\text\{GB\}of storage\. Scaling this to a dataset of just100100trajectories per phenomenon across1010different physical phenomena results in a storage requirement between10TB10\\,\\text\{TB\}and100TB100\\,\\text\{TB\}\.
Beyond storage, the I/O overhead required to shuffle and stream this data to GPUs often exceeds the computational cost of the training step itself\. While many 3D phenomena are computationally expensive to simulate \(requiring at least minutes per frame\), there exists a class of semi\-linear Partial Differential Equations \(PDEs\) that admit fast, stable solutions via spectral methods\. By leveraging efficient spectral solvers on modern GPUs, we decouple the training process from disk I/O\. This procedural, online data generation strategy enables access to a theoretically infinite dataset with great spectral diversity and significantly lower engineering overhead than traditional offline storage\.
This appendix details the mathematical formulation of the governing equations, the spectral solver implementation, and the communication strategy used to bridge the simulation and training environments\. Corresponding numerical solvers have been independently released as TorchFSM\(Liuet al\.,[2025a](https://arxiv.org/html/2605.15284#bib.bib1)\)and can be used in other physics\-based deep learning research\.
#### B\.1\.1Overview of Equations and their Configurations
Consider a scaled three\-dimensional unit cubeΩ=\(0,L\)3⊂ℝ3\\Omega=\(0,L\)^\{3\}\\subset\\mathbb\{R\}^\{3\}in whichLLdescribes the extent along each dimension\. We assume periodic boundary conditions in each direction\. On this domain, we want to solve equations of the form
∂tu=ℒu\+𝒩\(u\),\\partial\_\{t\}u=\\mathcal\{L\}u\+\\mathcal\{N\}\(u\),\(3\)whereℒ\\mathcal\{L\}describes a linear differential operator and𝒩\(⋅\)\\mathcal\{N\}\(\\cdot\)a nonlinear differential operator\. Specifically, we are interested in equations in which the order of derivatives inℒ\\mathcal\{L\}is higher than in𝒩\(⋅\)\\mathcal\{N\}\(\\cdot\)\. PDEs of this form are also called*semi\-linear*\. We select a diverse set of equations to cover a wide range of spectral characteristics, including diffusive, dispersive, shock\-forming, chaotic, and pattern\-forming dynamics\.[Table7](https://arxiv.org/html/2605.15284#A2.T7)summarizes the mathematical formulation of the selected equations\. The specific sampling parameters \(time\-stepsΔt\\Delta t, grid resolutionsNN, and warmup periods\) are detailed in the subsequent implementation section \(see[Table8](https://arxiv.org/html/2605.15284#A2.T8)and[Table9](https://arxiv.org/html/2605.15284#A2.T9)\)\.
Table 7:Mathematical formulation of the considered PDEs\. We denote scalar fields byuuand vector fields by𝐮\\mathbf\{u\}\. Parameters drawn from distributions are sampled per simulator instance, see[SectionB\.2\.1](https://arxiv.org/html/2605.15284#A2.SS2.SSS1)\.Diffusion and Hyper\-Diffusion Equation
The diffusion equation describes an isotropic smoothing process where the initial energy dissipates over time\. The spectrum decays radially as\|u^\(𝐤\)\|∝exp\(−ν‖𝐤‖22t\)\|\\hat\{u\}\(\\mathbf\{k\}\)\|\\propto\\exp\(\-\\nu\\\|\\mathbf\{k\}\\\|\_\{2\}^\{2\}t\)\. To introduce higher\-order damping effects often seen in turbulence modeling, we also include the hyper\-diffusion equation, which applies a fourth\-order Laplacian, resulting in a steeper quartic spectral decay\|u^\(𝐤\)\|∝exp\(−ζ‖𝐤‖24t\)\|\\hat\{u\}\(\\mathbf\{k\}\)\|\\propto\\exp\(\-\\zeta\\\|\\mathbf\{k\}\\\|\_\{2\}^\{4\}t\)\.
Burgers Equation
We utilize the viscous 3D Burgers’ equation to model shock formation and the transfer of energy from large to small scales\. The solution is a three\-dimensional vector field\. This nonlinear PDE generates sharp gradients, which are essential for training the model to resolve high\-frequency features\.
Korteweg\-de\-Vries \(KdV\) Equation
While classically a 1D equation describing soliton waves, we extend the KdV dynamics to 3D by employing a similar nonlinearity as the Burgers equation and by using an extension of the dispersion effect to higher dimensions\. This system balances nonlinearity with dispersion rather than diffusion\.
Kuramoto\-Sivashinsky \(KS\) Equation
The KS equation is characterized by a negative diffusion term \(instability at low wavenumbers\) stabilized by hyper\-diffusion at high wavenumbers\. This balance between energy production, energy movement based on the nonlinearity and high\-frequency dissipation is known for exhibiting spatiotemporal chaos, yielding rich spectra\.
Fisher\-KPP Equation
The Fisher\-Kolmogorov\-Petrovsky\-Piscounov \(Fisher\-KPP\) equation combines diffusion with a logistic reaction term\. The polynomial nonlinearity contributes interesting spectral components\.
Swift\-Hohenberg \(SH\) Equation
This equation is a canonical model for pattern formation leading to the emergence of complex spatial structures\. We select parameters to ensure the system remains in the pattern\-forming regime\. These patterns are characterized by spectral richness\.
Table 8:Discretization configurations for each considered PDE system\. Note that values differ depending on the resolutionNNon which the system is simulated\. The effectiveΔt\\Delta trealized when recording the trajectory is the one given here multiplied by theSave Frequency, allowing substepping on stiffer configurations\. After data recording, each frame is voxel\-wise normalized to be withinValue Range, which was precomputed based on reasonable limits seen in the data distribution\.Table 9:Configurations for recording trajectories\.Warmupis the number of frames starting from the initial condition that are being discarded*before*recording starts\. This ensures we are within the physics state space we deem most interesting\. The number of frames recorded per trajectory \(after warmup\) is given byTrajectory Length\. Before moving on to the next simulator configuration, we repeat the recording processNum Runstimes\. As such, each setup always contributes3030frames\. We also conducted an ablation with an offline local training dataset of approximately 500GB\. TheOffline Train Num Trajectorysettings denote how many trajectories are produced for each split of the data in the offline setup\.
#### B\.1\.2Fast Semi\-Linear PDE Solvers in PyTorch
We solve the semi\-linear PDEs∂tu=ℒu\+𝒩\(u\)\\partial\_\{t\}u=\\mathcal\{L\}u\+\\mathcal\{N\}\(u\)using Exponential Time Differencing Runge\-Kutta \(ETDRK\) methods\. These schemes are particularly well\-suited for stiff PDEs where the linear termℒ\\mathcal\{L\}contains high\-order derivatives \(e\.g\.,Δ2\\Delta^\{2\}\)\. By treating the linear part exactly via an integrating factor, we avoid the severe time\-step restrictions typical of explicit schemes\.
We discretize the scaled unit cubeΩ=\(0,L\)3\\Omega=\(0,L\)^\{3\}intoNNintervals of sizeΔx\\Delta xper dimension \(yieldingN3N^\{3\}total number of voxels\)\. Then we consider the left end of each interval a nodal degree of freedom, i\.e\., in one dimension the grid points are located at positions\[0,Δx,2Δx,…,\(N−1\)Δx\]T∈ℝN\\left\[0,\\Delta x,2\\Delta x,\\dots,\(N\-1\)\\Delta x\\right\]^\{T\}\\in\\mathbb\{R\}^\{N\}\. This also means that the left end of the domain is considered a degree of freedom, while the right end is not, which naturally encodes periodic boundary conditions and is a prerequisite for most Fast Fourier Transform implementations\. The three\-dimensional grid is given by the tensor product of the one\-dimensional coordinates\.
We denote by𝐮t∈ℝC×N×N×N\\mathbf\{u\}\_\{t\}\\in\\mathbb\{R\}^\{C\\times N\\times N\\times N\}a state array at time pointttwithCCchannels and a value for each valid degree of freedom\. Applying a three\-dimensional discrete Fourierℱ\\mathcal\{F\}transformation yields𝐮^t∈ℂC×N×N×N\\hat\{\\mathbf\{u\}\}\_\{t\}\\in\\mathbb\{C\}^\{C\\times N\\times N\\times N\}which is an array of the same shape but with complex\-valued entries111Note that given all our PDEs are real\-valued, one could have used a real\-valued FFT that typically halves the number of entries in the last array axis and halves the compute cost\. However, for simplicity, we opted for the general FFT since we did not observe significant performance degradations in our online learning setup\.\.
In state space, time integration with a fixed time step sizeΔt\\Delta tcan be represented by the time\-stepping operator𝒫\\mathcal\{P\}which yields
𝐮t\+Δt=𝒫\(𝐮t\)\.\\mathbf\{u\}\_\{t\+\\Delta t\}=\\mathcal\{P\}\(\\mathbf\{u\}\_\{t\}\)\.\(4\)Due to the diagonalization of derivative operators in Fourier space, we choose to perform time integration in the spectral domain via
𝐮t\+Δt=ℱ−1\(𝒫^\(ℱ\(𝐮t\)\)\),\\mathbf\{u\}\_\{t\+\\Delta t\}=\\mathcal\{F\}^\{\-1\}\\left\(\\hat\{\\mathcal\{P\}\}\\left\(\\mathcal\{F\}\\left\(\\mathbf\{u\}\_\{t\}\\right\)\\right\)\\right\),\(5\)in which the spectral time\-stepping operator𝒫^\(⋅\)\\hat\{\\mathcal\{P\}\}\(\\cdot\)is implemented via a two\-stage process
𝐮^∗\\displaystyle\\hat\{\\mathbf\{u\}\}\_\{\*\}=exp\(ℒ^Δt\)⊙𝐮^t\+exp\(ℒ^Δt\)−1ℒ^⊙𝒩^\(𝐮^t\),\\displaystyle=\\exp\(\\hat\{\\mathcal\{L\}\}\\Delta t\)\\odot\\hat\{\\mathbf\{u\}\}\_\{t\}\+\\frac\{\\exp\(\\hat\{\\mathcal\{L\}\}\\Delta t\)\-1\}\{\\hat\{\\mathcal\{L\}\}\}\\odot\\hat\{\\mathcal\{N\}\}\(\\hat\{\\mathbf\{u\}\}\_\{t\}\),\(6\)𝐮^t\+Δt\\displaystyle\\hat\{\\mathbf\{u\}\}\_\{t\+\\Delta t\}=𝐮^∗\+exp\(ℒ^Δt\)−1−ℒ^Δtℒ^2Δt\(𝒩^\(𝐮^∗\)−𝒩^\(𝐮^t\)\)\.\\displaystyle=\\hat\{\\mathbf\{u\}\}\_\{\*\}\+\\frac\{\\exp\(\\hat\{\\mathcal\{L\}\}\\Delta t\)\-1\-\\hat\{\\mathcal\{L\}\}\\Delta t\}\{\\hat\{\\mathcal\{L\}\}^\{2\}\\Delta t\}\\left\(\\hat\{\\mathcal\{N\}\}\(\\hat\{\\mathbf\{u\}\}\_\{\*\}\)\-\\hat\{\\mathcal\{N\}\}\(\\hat\{\\mathbf\{u\}\}\_\{t\}\)\\right\)\.\(7\)Here,ℒ^\\hat\{\\mathcal\{L\}\}and𝒩^\(⋅\)\\hat\{\\mathcal\{N\}\}\(\\cdot\)are the linear and nonlinear differential operators in Fourier space, respectively\. Both can be built using spectral derivative operators\. As a property of spectral differentiation, these operators diagonalize, which allows all the operations in[Equations6](https://arxiv.org/html/2605.15284#A2.E6)and[7](https://arxiv.org/html/2605.15284#A2.E7)to be evaluated pointwise\. This includes an elementwise exponentiation that is at the core of the ETDRK methods\. For the nonlinear operator in Fourier space𝒩^\(⋅\)\\hat\{\\mathcal\{N\}\}\(\\cdot\), a pseudo\-spectral evaluation strategy using transformation to the state space, evaluating the nonlinearity, and transforming back into state space is used\. We supplement this with appropriate anti\-aliasing\. The time integration procedure of[Equations6](https://arxiv.org/html/2605.15284#A2.E6)and[7](https://arxiv.org/html/2605.15284#A2.E7)is of second order consistency\. We use this ETDRK2 method for most equations because it offers the best compromise between speed, accuracy, and stability in single precision across our tested systems\. Only for the KdV equation, we use the fourth\-order ETDRK version\(Cox and Matthews,[2002](https://arxiv.org/html/2605.15284#bib.bib35)\)\.
Since these methods are based on array computations, they operate efficiently within modern acceleration frameworks like PyTorch and can therefore be easily ported to GPUs\. For more details on Fourier pseudo\-spectral ETDRK methods, their implementation, and a discussion of their limitations, we refer to\(Cox and Matthews,[2002](https://arxiv.org/html/2605.15284#bib.bib35); Kassam and Trefethen,[2005](https://arxiv.org/html/2605.15284#bib.bib36); Koehleret al\.,[2024](https://arxiv.org/html/2605.15284#bib.bib76)\)\.
#### B\.1\.3Overview of Initializers
For many equations, their long\-term behavior throughout the trajectory is influenced by the distribution of the initial state\. For Tadpole, we used five initialization routines chosen to ensure a wide range of spectral representations\. The primary difference between them is how the spectrum behaves as a function of wavenumber𝐤\\mathbf\{k\}\. Some of the initialization routines additionally sample hyperparameters according to[Table10](https://arxiv.org/html/2605.15284#A2.T10)\.
For most initializers, we apply a post\-processing step that clamps the state to the range\(cmin,cmax\)\(c\_\{\\text\{min\}\},c\_\{\\text\{max\}\}\)\. These limitscminc\_\{\\text\{min\}\}andcmaxc\_\{\\text\{max\}\}can either be fixed or randomly drawn\. We normalize the initial conditions to ensure that their order of magnitude is around11\. Note that this normalization is different from the normalization of each recorded frame according to[Table8](https://arxiv.org/html/2605.15284#A2.T8)\.
Each PDE system uses a distinct set of initializers and their respective hyperparameters\. This is to enable sufficiently wide spectral representation in the created frames while ensuring stable time integration\. We note down the match in[Table11](https://arxiv.org/html/2605.15284#A2.T11)\.
In[Figure44](https://arxiv.org/html/2605.15284#A2.F44), we display 100 samples for each initial condition distribution\.
Figure 44:Radially*shell\-aggregated*magnitude spectra \(100 samples per distribution\)\.Row 1 \(GN & TFS\):Gaussian \(white\) noise \(GN\) exhibits quadratic growth due to the volume of spherical shells in 3D Fourier space\. The Truncated Fourier Series \(TFS\) initializers show distinct spectral cutoffs determined byklimitk\_\{\\text\{limit\}\}\. Note that TFS\-D exhibits vertical variation due to its randomized normalization bounds\.Row 2 \(DN & DE\):Diffused Noise \(DN\) follows an exponential quadratic decay \(∼exp\(−ν‖𝐤‖22\)\\sim\\exp\(\-\\nu\\\|\\mathbf\{k\}\\\|\_\{2\}^\{2\}\)\), with DN\-A appearing to have more spectrally compact samples due to higher diffusivity sampling\. Decayed Energy \(DE\) follows a rough power\-law distribution with significantly high\-frequency contributions\.Row 3 \(P\-TFS\):The Poisson \(P\-TFS\) initializers retain the cutoff frequency of the source TFS but exhibit smoother maxima and a steeper decay within the active modes, consistent with the smoothing properties of the inverse Laplacian\.Pixel\-Wise Gaussian Noise \(GN\)
The state array for a single channel𝐮0∈ℝ1×N×N×N\\mathbf\{u\}\_\{0\}\\in\\mathbb\{R\}^\{1\\times N\\times N\\times N\}is built by drawing each degree of freedom value identically and independently from a standard normal distribution
uj∼𝒩\(0,1\)\.u\_\{j\}\\sim\\mathcal\{N\}\(0,1\)\.\(8\)This corresponds to white noise\. We use this initializer for the KS equation whose long\-term behavior is independent of the initial state\. For the other PDE systems, this initializer contains too much high\-frequency content\. We only use it as a starting point to apply spectral modifications\.
Truncated Fourier Series \(TFS\)
We generate a spectrally compact IC by filtering white noise in the Fourier domain\. Let𝐮^=ℱ\(𝐮noise\)\\hat\{\\mathbf\{u\}\}=\\mathcal\{F\}\(\\mathbf\{u\}\_\{\\text\{noise\}\}\)\. We apply a three\-dimensional binary maskM\(𝐤\)M\(\\mathbf\{k\}\)which is11if𝐤∈\[0,klimit\]3\\mathbf\{k\}\\in\[0,k\_\{\\text\{limit\}\}\]^\{3\}and0otherwise:
𝐮TFS=ℱ−1\(M⊙𝐮^\)\.\\mathbf\{u\}\_\{\\text\{TFS\}\}=\\mathcal\{F\}^\{\-1\}\(M\\odot\\hat\{\\mathbf\{u\}\}\)\.\(9\)This ensures energy is distributed only among specific low\-to\-mid frequency modes\. Since the mask does not alter the spectrum within its active region, energy is equally distributed among all active modes\. We draw the limitklimitk\_\{\\text\{limit\}\}from a discrete uniform distributions with extentskmink\_\{\\text\{min\}\}andkmaxk\_\{\\text\{max\}\}\.
Diffused Noise \(DN\)
To generate smooth fields with physically natural decay, we initialize white noise and integrate the linear diffusion operator∂tu=νΔu\\partial\_\{t\}u=\\nu\\Delta ufor a single time step of sizeΔt=1\\Delta t=1\. The resulting magnitude spectrum follows an exponential quadratic decay:
\|u^\(𝐤\)\|∝exp\(−ν‖𝐤‖22\)\.\|\\hat\{u\}\(\\mathbf\{k\}\)\|\\propto\\exp\(\-\\nu\\\|\\mathbf\{k\}\\\|\_\{2\}^\{2\}\)\.\(10\)We vary the smoothness of the IC by sampling the diffusivity parameterν\\nu\.
Decayed Energy \(DE\)
This initializer creates Gaussian Random Fields \(GRF\) with a specific power\-law spectral density\. We explicitly enforce the amplitude of the Fourier modes to follow\|u^\(𝐤\)\|∝‖𝐤‖2−α\|\\hat\{u\}\(\\mathbf\{k\}\)\|\\propto\\\|\\mathbf\{k\}\\\|\_\{2\}^\{\-\\alpha\}, where the exponentα\\alphais drawn uniformly from the range\(−5,−2\)\(\-5,\-2\)\. This directly controls the field’s roughness\.
Poisson \(P\)
We solve a Poisson equationΔu=−f\\Delta u=\-f, where the source termffis generated via the Truncated Fourier Series \(TFS\) method described above\. In the spectral domain, inversion of the Laplacian acts as a low\-pass filter, scaling the spectrum by‖𝐤‖2−2\\\|\\mathbf\{k\}\\\|\_\{2\}^\{\-2\}\. This results in smoother initial conditions than the raw TFS source\. Since Poisson inversion on periodic boundaries does not move energy between the modes, we retain the characteristics of a compact spectrum \(i\.e\., non\-zero energy only in the modes leftover by the mask\)\. However, within this patch, the magnitude follows an isotropic polynomial decay\. We denote these configurations as P\-TFS\.
Table 10:These hyperparameter configurations for the initialization schemes are used to instantiate the initial condition distribution\. Normalization bounds are used to clamp the initial condition and to keep their order of magnitude consistent\.Table 11:To achieve maximal spectral diversity, we pair specific initializers and certain PDE systems\. See[Table10](https://arxiv.org/html/2605.15284#A2.T10)for the specifics of each initialization configuration\.
#### B\.1\.4Spectral Distributions and the Beneficial Effect of Cropping
The combinations of partial differential equations, initialization distributions, and integration horizons were specifically chosen to expose the foundation model to a wide range of plausible physics states\. To confirm this, we scraped 100,000 samples from a simulation server and analyzed their spectra\. This equals≈3\.5TB\\approx 3\.5\\,\\text\{TB\}of full\-resolution simulation data, and this is approximately a tenth of what the Tadpole B\-size model was exposed to during its pre\-training\.
The analysis is based on the random64364^\{3\}crops of data used to train the model\. Since those crops are no longer ensured to be periodic, we first apply a Hann window per dimension
w\(n\)=0\.5−0\.5cos\(2πnN−1\)0≤n≤N−1w\(n\)=0\.5\-0\.5\\cos\\left\(\\frac\{2\\pi\{n\}\}\{N\-1\}\\right\)\\qquad 0\\leq n\\leq N\-1\(11\)before applying the Fourier transform\. While this smoothes the cropped field, it ensures alias\-free spectral analysis\. We radially aggregate the magnitude of the Fourier coefficients\. For example, bin33contains the sum of all magnitudes of Fourier coefficients such that‖𝐤‖2∈\[2\.5,3\.5\)\\\|\\mathbf\{k\}\\\|\_\{2\}\\in\[2\.5,3\.5\)\. The results of this, separated by PDE, resolution, and initializer, are presented in[Figure46](https://arxiv.org/html/2605.15284#A2.F46)\. We also show an aggregated version across all PDEs and simulation resolutions in[Figure45](https://arxiv.org/html/2605.15284#A2.F45)\.
We see that the linear equations \(Diffusion & Hyper\-Diffusion\) retain the characteristic spectra of their respective initializers, but also further diffuse them\. The Burgers equation develops noticeable shocks, as evidenced by a richer spectral content than in the initialization\. Also, the pattern\-forming KPP\-Fisher and the Swift\-Hohenberg equation display rich spectral content characteristic of their polynomial nonlinearities\. As expected, the Kuramoto\-Sivashinsky in its chaotic state attains a characteristic spectrum that depends on the domain extentLL\.
Interestingly, we see that using different simulation resolutions with consistently sized crops exposes the model to the same phenomena at different scales\. This is most noticeable for the KS equation, which develops the same spectrum, independent of the resolution\. However, because the crops in the high\-resolution simulation occupy a small physical space, the model sees a smoother version of that same pattern\. We hypothesize that this relationship of similar patterns across different resolutions enables the model to fundamentally understand and be applied at varying resolutions\.
Figure 45:This collection of all*shell\-aggregated*spectra across 100,000 samples \(about one tenth of the pre\-training amount\) from the simulation server highlights the diversity of states exposed to the foundational pre\-training of Tadpole\.Figure 46:Distribution of Fourier coefficient magnitude across radially aggregated bins for a spectral analysis of the64364^\{3\}training crops based on 100,000 samples from the simulation server \(about 10% of the amount of data used for Tadpole B\-size pre\-taining\)\. Each row represents a different PDE \(according to[Table7](https://arxiv.org/html/2605.15284#A2.T7)\) and each column a different simulation resolution\{64,128,256,384\}\\\{64,128,256,384\\\}\. Colors indicate different initial condition distributions, see[Table10](https://arxiv.org/html/2605.15284#A2.T10)and[Table11](https://arxiv.org/html/2605.15284#A2.T11)\. The states produced cover a large range of plausible physics spectra\.
### B\.2Online Learning Framework
#### B\.2\.1Sampling Strategies and Data Pipelines
Since the pre\-training objective relies on reconstructing physical fields across a vast manifold of plausible physics states, we devised a procedural sampling strategy that maximizes the diversity of states𝐮t\\mathbf\{u\}\_\{t\}while maintaining a balanced distribution of physical phenomena\. We employ a decoupled client\-server architecture: a dedicated simulation server continuously synthesizes physical trajectories and pushes individual frames to an asynchronous First\-In\-First\-Out \(FIFO\) queue\. The training clients consume data from this queue, ensuring that the computationally expensive simulation steps do not bottleneck the GPU training throughput\. For a detailed breakdown of the communication protocol, see[SectionB\.2\.2](https://arxiv.org/html/2605.15284#A2.SS2.SSS2)\. The procedural generation logic is formalized in[Algorithm1](https://arxiv.org/html/2605.15284#alg1)and detailed below\.
Continuous Generation CycleThe simulation server operates in an infinite loop, cycling through the set of available PDE systems \(e\.g\., Burgers, Swift\-Hohenberg\) and the set of resolutions𝒩=\{64,128,256,384\}\\mathcal\{N\}=\\\{64,128,256,384\\\}in random order\. By systematically varying the simulation resolutionNN, we ensure that the training crops expose the model to spectral features at different scales, mimicking the multi\-scale nature of downstream tasks\.
Throughput Standardization and Channel BatchingTo maintain consistent tensor shapes and maximize computational efficiency, we standardize the generation process around three\-channel inputs\.
- •Vector Fields:Systems such as the 3D Burgers equation naturally possess three channels \(C=3C=3\)\. These are integrated as a single dependent system\.
- •Scalar Fields:For single\-component systems \(C=1C=1, e\.g\., Diffusion, Fisher\-KPP\), we instantiate three independent initial conditions in parallel\. These are batched along the channel dimension to form a pseudo\-3\-channel tensor,𝐮t∈ℝ3×N×N×N\\mathbf\{u\}\_\{t\}\\in\\mathbb\{R\}^\{3\\times N\\times N\\times N\}\.
Upon completion of a time\-step, the channels are decoupled and pushed to the queue individually\.*Exception:*The Kuramoto\-Sivashinsky \(KS\) equation is integrated as a single channel without parallel batching\. Consequently, the KS equation contributes proportionally less data volume \(approx\. 1/3\) compared to other phenomena\.
Transient Dynamics \(Physics Warmup\)For some of the PDE systems, we only produce frames within the physically most meaningful regime\. In the case of the Burgers equation, this would be in the shock formation and propagation phase\. For the KS equation, this is within the chaotic attractor\. To ensure we produce samples in these parts of the physics state space, we implement a*physics warmup*phase\. For every trajectory, the simulator integrates forWWsteps \(see[Table9](https://arxiv.org/html/2605.15284#A2.T9)\), which are strictly discarded\. Data recording commences only after this period\.
Queue Pre\-filling \(Server Warmup\)Distinct from the physics warmup, we employ a*server warmup*strategy to ensure the initial data distribution is sufficiently diverse\. We define a server round counterrrand a target thresholdRR\. During the startup phase \(r<Rr<R\), we artificially truncate the recorded trajectory length by an early\-stop ratioξ=min\(r/R,1\)\\xi=\\min\(r/R,1\)\. This increases the turnover rate of simulators, rapidly filling the buffer with diverse physics states\.
Trajectory Balancing and Re\-initializationRegardless of the underlying physics, each simulator runs for a specific number of repetitions \(Num Runs, see[Table9](https://arxiv.org/html/2605.15284#A2.T9)\) such that it contributes exactly3030distinct time steps to the dataset before being discarded\. Once this quota is met, the simulator is torn down and a new system is instantiated with fresh constitutive parameters and initial conditions\.
Spatial Subsampling and NormalizationTo optimize bandwidth, we extract random spatial crops𝐂\\mathbf\{C\}of sizeH′×H′×H′H^\{\\prime\}\\times H^\{\\prime\}\\times H^\{\\prime\}\(whereH′=96H^\{\\prime\}=96\)\. If the native simulation resolution isN=64N=64, the full frame is transmitted\. Finally, to stabilize the input distribution, each crop is clamped to the precomputed value ranges defined in[Table8](https://arxiv.org/html/2605.15284#A2.T8)\.
Fault Tolerance and CheckpointingTo support long\-running training jobs on preemptible clusters, the simulation server is designed to be fully checkpointable\. We serialize the active simulator configurations\. This ensures that training can be paused and resumed deterministically without altering the data distribution or repeating sequences\.
Algorithm 1Procedural Data Generation Loop0:Set of Equations
ℰ\\mathcal\{E\}, Set of Resolutions
𝒩=\{64,128,256,384\}\\mathcal\{N\}=\\\{64,128,256,384\\\}
0:Server Warmup Rounds
R=10R=10, Crop size
H′=96H^\{\\prime\}=96, Server Queue
𝒬\\mathcal\{Q\}
1:Set simulation round counter
r=0r=0
2:whileServer Runningdo
3:
r←r\+1r\\leftarrow r\+1
4:for
\{E,N\}\\\{E,N\\\}in all combinations of
ℰ×𝒩\\mathcal\{E\}\\times\\mathcal\{N\}do
5:1\. Simulator Configuration
6:Sample constitutive parameters \(e\.g\.,
ν,ζ\\nu,\\zeta\) per[Table7](https://arxiv.org/html/2605.15284#A2.T7)
7:Retrieve discretization settings \(
Δt\\Delta t, save\-freq, value\-range\) per[Table8](https://arxiv.org/html/2605.15284#A2.T8)
8:Retrieve trajectory settings \(physics\-warmup
WW, length
TT, num\-runs\) per[Table9](https://arxiv.org/html/2605.15284#A2.T9)
9:Instantiate ETDRK time\-stepper
𝒫\\mathcal\{P\}
10:
11:for
1:num\-runs1:\\text\{num\-runs\}do
12:// Initialize States
13:for
b=1:3b=1:3do
14:\{For KS equation, only
b=1b=1is executed\}
15:Sample IC type
ℐ\\mathcal\{I\}and hyper\-parameters per[Table11](https://arxiv.org/html/2605.15284#A2.T11)
16:Generate
𝐮0\(b\)∼ℐ\\mathbf\{u\}\_\{0\}^\{\(b\)\}\\sim\\mathcal\{I\}and apply IC normalization
17:endfor
18:Stack initial states:
𝐮0←Concat\(𝐮0\(1\),…,𝐮0\(3\)\)\\mathbf\{u\}\_\{0\}\\leftarrow\\text\{Concat\}\(\\mathbf\{u\}\_\{0\}^\{\(1\)\},\\dots,\\mathbf\{u\}\_\{0\}^\{\(3\)\}\)
19:
20:// Time Integration
21:Reset Simulator with
𝐮0\\mathbf\{u\}\_\{0\}
22:Compute server early\-stop ratio
ξ=min\(r/R,1\)\\xi=\\min\(r/R,1\)
23:
Ttotal←\(T⋅SaveFreq⋅ξ\)\+WT\_\{total\}\\leftarrow\(T\\cdot\\text\{SaveFreq\}\\cdot\\xi\)\+W
24:for
1:Ttotal1:T\_\{total\}do
25:
𝐮t\+Δt←𝒫\(𝐮t\)\\mathbf\{u\}\_\{t\+\\Delta t\}\\leftarrow\\mathcal\{P\}\(\\mathbf\{u\}\_\{t\}\)\{Step via ETDRK \([Equations6](https://arxiv.org/html/2605.15284#A2.E6)and[7](https://arxiv.org/html/2605.15284#A2.E7)\)\}
26:if
t\>Wt\>Wand
\(t−W\)\(modSaveFreq\)==0\(t\-W\)\\pmod\{\\text\{SaveFreq\}\}==0then
27:for
c=1:3c=1:3do
28:// Post\-processing & Transport
29:
𝐂←\(N\>H′\)?RandCrop\(𝐮t\+Δt\[c\],H′\):𝐮t\+Δt\[c\]\\mathbf\{C\}\\leftarrow\(N\>H^\{\\prime\}\)\\;?\\;\\text\{RandCrop\}\(\\mathbf\{u\}\_\{t\+\\Delta t\}\[c\],H^\{\\prime\}\)\\;:\\;\\mathbf\{u\}\_\{t\+\\Delta t\}\[c\]
30:Clamp
𝐂\\mathbf\{C\}to limits per[Table8](https://arxiv.org/html/2605.15284#A2.T8)
31:Push
𝐂\\mathbf\{C\}to
𝒬\\mathcal\{Q\}
32:endfor
33:endif
34:endfor
35:endfor
36:endfor
37:endwhile
#### B\.2\.2Buffer and Communication Strategy
To bridge the gap between high\-speed numerical solvers and deep learning frameworks, we implemented a custom asynchronous data loading pipeline\. This system abstracts the continuous simulation stream into a PyTorch\-compatible dataset, allowing the training loop \(managed via PyTorch Lightning\) to interface with the procedural generators as if they were a standard static dataset\.
Architecture and CommunicationThe pipeline operates on a Producer\-Consumer model implemented viatorch\.multiprocessing\. A dedicated subset of GPU resources is assigned to the*Producer*role \(simulation\), while the remaining GPUs function as*Consumers*\(training\)\. Communication between these processes is handled via asynchronous thread\-safe queues\. Our approach bypasses the file system entirely\. Data is passed directly through shared memory or TCP sockets \(depending on the node topology\), eliminating disk I/O latency\. To further reduce the communication overhead, we employ a “transport crop” strategy: simulation frames are cropped to an intermediate size ofHX,Y,Z′=963H\_\{X,Y,Z\}^\{\\prime\}=96^\{3\}before transmission, as outlined in[SectionB\.2\.1](https://arxiv.org/html/2605.15284#A2.SS2.SSS1)\. This significantly reduces payload size compared to full\-resolution grids while remaining larger than the final training crop, enabling further data augmentation/cropping on the consumer side\.
Multi\-Stage Buffering and Latency HidingAlthough spectral solvers are computationally efficient, network latency and synchronization overheads can cause pipeline stalls\. To mitigate this, we employ a hierarchical buffering strategy:
1. 1\.Transmission Queue \(FIFO\):The simulation server pushes completed transport samples into a finite\-sized First\-In\-First\-Out buffer\. If this buffer fills up, the simulator pauses, preventing memory overflows\. From this queue, data is sent to all participating training GPUs in a round\-robin fashion\.
2. 2\.Local Staging Buffer \(FIFO\):Each training GPU maintains an incoming “mailbox” queue\. New frames are received here before being processed for the training cache\.
3. 3\.Consumer Cache \(MFU\):On the training side, frames are moved from the staging buffer into a larger local cache governed by a Most\-Frequently\-Used \(MFU\) replacement policy\. Background threads continuously replenish this cache\.
The training loop samples batches from the Consumer Cache rather than the stream directly\. This decouples the training step time from the simulation step time\. Consequently, even if the simulator throughput fluctuates \(e\.g\., due to varying solver times based on resolution or other overhead\), the trainer always has immediate access to data\.
Epoch Definition in Infinite StreamsIn this procedural paradigm, the concept of an “epoch”, traditionally seen as one full pass over a static dataset, becomes ill\-defined\. We redefine an epoch as a fixed number of samples seen during training, which we set to13′20013^\{\\prime\}200\.
Numerical Stability GuardrailsGiven the stochastic initialization of parameters, numerical instabilities \(divergences\) are rare but possible\. To prevent invalid gradients from propagating into the model weights, we implement a strict fail\-safe mechanism\. Before any frame enters the transmission queue, it is scanned forNaNorInfvalues\. If a numerical anomaly is detected:
1. 1\.The corrupted trajectory is immediately discarded\.
2. 2\.The specific simulator instance responsible is reset with a new random seed and parameters\.
3. 3\.A global error counter is incremented\.
If the error counter exceeds a tolerance threshold \(set to1010events per training run\), the entire training process is halted to allow for debugging\. This ensures that the model is never exposed to corrupted gradients\.
Multi\-Node and Distributed TrainingOur default configuration utilizes a single node with four GPUs \(1 Producer, 3 Consumers\)\. Under PyTorch Lightning’s data\-parallel strategy, microbatches are distributed among the consumers, and gradients are synchronized via AllReduce\. We also experimented with multi\-node configurations by replicating the topology: each node instantiates its own local Producer GPU alongside its local Consumers\. This*Local\-Producer Strategy*offers a significant bandwidth advantage\. By confining the transmission of high\-dimensional simulation tensors to the intra\-node PCIe bus, we ensure that inter\-node interconnects \(e\.g\., InfiniBand\) are reserved exclusively for gradient synchronization\. This demonstrates the practical scalability of procedural online training, effectively bypassing both disk I/O and inter\-node bandwidth bottlenecks\.
### B\.3Downstream Datasets
For the downstream tasks, we include 4 challenging datasets\. Details of these datasets are summarized as follows:
Isocontains a direct numerical simulation of homogeneous isotropic turbulence from Johns Hopkins Turbulence Database \(JHTDB\)\(Liet al\.,[2008](https://arxiv.org/html/2605.15284#bib.bib7)\), in which the statistical properties are invariant under translations and rotations of the coordinate system\. We sample 500 frames from the original dataset\. In the autoencoding task, random64364^\{3\}crops are generated from the first 420 frames for training\. And we select 3 complete102431024^\{3\}\-resolution frames from the remaining 80 for testing\. In the dynamics learning task, random1283128^\{3\}crops are still generated from the first 420 frames for training, and the testing crops are generated from the remaining 80 frames\. Below is a brief summary of the key characteristics of the dataset:
- •Spatial resolution:X=1024,Y=1024,Z=1024X=1024,Y=1024,Z=1024
- •Spatial size:\[0,2π\]×\[0,2π\]×\[0,2π\]\[0,2\\pi\]\\times\[0,2\\pi\]\\times\[0,2\\pi\]
- •Reynolds number:Re=433Re=433
- •State variables: x/y/z components of velocity and pressure\.
- •Time step between stored data: 0\.002
- •Boundary conditions: periodic
TCFcontains 21 simulations with Reynolds numbers ranging fromRe=400Re=400toRe=800Re=800simulated with PICT\(Franzet al\.,[2026](https://arxiv.org/html/2605.15284#bib.bib10)\)\. Each simulation contains 200 snapshots\. In the autoencoding task, random48348^\{3\}crops are generated from the first 20 simulations for training\. And we select 200 complete2563256^\{3\}\-resolution frames from the remaining 1 simulation for testing\. In the dynamics learning task, the latent flow matching models are trained in the latent space of the first 20 simulations\. Below is a brief summary of the key characteristics of the dataset:
- •Spatial resolution:X=96,Y=96,Z=192X=96,Y=96,Z=192
- •Spatial size:\[−1,1\]×\[−1,1\]×\[−π,π\]\[\-1,1\]\\times\[\-1,1\]\\times\[\-\\pi,\\pi\]
- •Reynolds number:Re∈\[400,800\]Re\\in\[400,800\]
- •State variables: x/y/z components of velocity\.
- •Time step between stored data: 0\.1
- •Boundary conditions: periodic \(x\), wall \(y,z\)
MHDcontains 100 frames sampled from the magnetohydrodynamics turbulence simulation of the JHTDB\(Liet al\.,[2008](https://arxiv.org/html/2605.15284#bib.bib7)\)\. We generate crops of size5123512^\{3\}from the original102431024^\{3\}simulations\. In the autoencoding task, random64364^\{3\}crops are generated from the first 80 frames for training\. And we select complete5123512^\{3\}\-resolution frames from the remaining 20 for testing\. Below is a brief summary of the key characteristics of the dataset:
- •Spatial resolution:X=512,Y=512,Z=512X=512,Y=512,Z=512\(cropped from1024×1024×10241024\\times 1024\\times 1024\)
- •Spatial size:\[0,π\]×\[0,π\]×\[0,π\]\[0,\\pi\]\\times\[0,\\pi\]\\times\[0,\\pi\]\(cropped from\[0,2π\]×\[0,2π\]×\[0,2π\]\[0,2\\pi\]\\times\[0,2\\pi\]\\times\[0,2\\pi\]\)
- •Reynolds number:Re=186Re=186
- •State variables: x/y/z components of velocity, pressure, x/y/z components of magnetic field, x/y/z components of vector potential
- •Time step between stored data: 0\.025
- •Boundary conditions: crop
TBLcontains 940 frames sampled from the transitional boundary layer simulations of the JHTDB\(Liet al\.,[2008](https://arxiv.org/html/2605.15284#bib.bib7)\)\. We generate crops of size2243224^\{3\}from the original10240×1536×204810240\\times 1536\\times 2048simulations\. In the autoencoding task, random32332^\{3\}crops are generated from the first 840 frames for training\. And we select complete2243224^\{3\}\-resolution frames from the remaining 100 for testing\. Below is a brief summary of the key characteristics of the dataset:
- •Spatial resolution:X=224,Y=224,Z=224X=224,Y=224,Z=224\(cropped from10240×1536×204810240\\times 1536\\times 2048\)
- •Spatial size:\[293\.2,314\.3\]×\[0,3\.9\]×\[0,26\.25\]\[293\.2,314\.3\]\\times\[0,3\.9\]\\times\[0,26\.25\]\(cropped from\[0,969\.8\]×\[0,26\.5\]×\[0,240\]\[0,969\.8\]\\times\[0,26\.5\]\\times\[0,240\]\)
- •Reynolds number:Re=800Re=800
- •State variables: x/y/z components of velocity, pressure
- •Time step between stored data: 1\.25
- •Boundary conditions: crop
## Appendix CTraining Details and Network Architectures
### C\.1Network Architectures
We build the backbone of Tadple based on P3D\(Holzschuhet al\.,[2026](https://arxiv.org/html/2605.15284#bib.bib17)\)\.[Figure47](https://arxiv.org/html/2605.15284#A3.F47)illustrates the architecture of the network, and[Table12](https://arxiv.org/html/2605.15284#A3.T12)summarizes the hyperparameters used for Tadpole with different sizes\. Compared to the original P3D architecture, the embedding layers for PDE parameters and skip connections are removed, and an additional convolutional layer is appended to the encoder to project its output to the mean and log\-variance of the latent distribution\. The discriminator network𝒜\\mathcal\{A\}adopts the same architecture as the encoder, but removes the final convolutional projection layer\. This makes𝒜\\mathcal\{A\}a patch\-based discriminator, and we utilize the mean of its output as the final belief\.
Table 12:Architecture hyperparameters of Tadpole with different sizes\. The definition of each hyperparameter can be found in[Figure47](https://arxiv.org/html/2605.15284#A3.F47)Figure 47:Network architecture of Tadpole based on P3D\(Holzschuhet al\.,[2026](https://arxiv.org/html/2605.15284#bib.bib17)\)\.For the sub\-network𝒮\\mathcal\{S\}used in current experiments, we use a standard encoder\-only transformer architecture\.[Figure48](https://arxiv.org/html/2605.15284#A3.F48)illustrates the architecture of the network, and[Table13](https://arxiv.org/html/2605.15284#A3.T13)summarizes the hyperparameters with different sizes\.
Figure 48:Network architecture of𝒮\\mathcal\{S\}based\.Table 13:Architecture hyperparameters of𝒮\\mathcal\{S\}with different sizes\. The definition of each hyperparameter can be found in[Figure48](https://arxiv.org/html/2605.15284#A3.F48)
### C\.2Training Objective
The loss function for the VAE consists of three terms: a reconstruction loss, a KL\-divergence regularization term, and an adversarial loss term weighted byλ𝒜\\lambda\_\{\\mathcal\{A\}\}\. The discriminator loss function encourages correct classification of real and reconstructed samples\. For the adversarial loss, the discriminator𝒜\\mathcal\{A\}outputs a scalar score𝒜\(𝐮t\)\\mathcal\{A\}\(\\mathbf\{u\}\_\{t\}\)indicating the authenticity of the current state𝐮t\\mathbf\{u\}\_\{t\}\. The discriminator is trained using a hinge loss\(Lim and Ye,[2017](https://arxiv.org/html/2605.15284#bib.bib41); Tranet al\.,[2017](https://arxiv.org/html/2605.15284#bib.bib42); Miyato and Koyama,[2018](https://arxiv.org/html/2605.15284#bib.bib43)\), and the overall objectives for the VAE and the discriminator are defined as follows:
ℒVAE=𝔼pℰ\(𝐳t\|𝐮t\)\[−logp𝒟\(𝐮t\|𝐳t\)\]\+λKLKL\(pℰ\(𝐳t\|𝐮t\)\|\|q\(𝐳t\)\)−λ𝒜𝔼pℰ\(𝐳t\|𝐮t\)\[𝒜\(𝒟\(𝐳t\)\)\]\\displaystyle\\begin\{split\}\\mathcal\{L\}\_\{VAE\}=&\\mathbb\{E\}\_\{p\_\{\\mathcal\{E\}\}\(\\mathbf\{z\}\_\{t\}\|\\mathbf\{u\}\_\{t\}\)\}\\left\[\-\\log p\_\{\\mathcal\{D\}\}\(\\mathbf\{u\}\_\{t\}\|\\mathbf\{z\}\_\{t\}\)\\right\]\\\\ &\+\\lambda\_\{\\text\{KL\}\}\\text\{KL\}\(p\_\{\\mathcal\{E\}\}\(\\mathbf\{z\}\_\{t\}\|\\mathbf\{u\}\_\{t\}\)\|\|q\(\\mathbf\{z\}\_\{t\}\)\)\\\\ &\-\\lambda\_\{\\mathcal\{A\}\}\\mathbb\{E\}\_\{p\_\{\\mathcal\{E\}\}\(\\mathbf\{z\}\_\{t\}\|\\mathbf\{u\}\_\{t\}\)\}\[\\mathcal\{A\}\(\\mathcal\{D\}\(\\mathbf\{z\}\_\{t\}\)\)\]\\end\{split\}\(12\)ℒDis=𝔼p\(𝐮t\)\[max\(0,1−𝒜\(𝐮t\)\)\]\+𝔼pℰ\(𝐳t\|𝐮t\)\[max\(0,1\+𝒜\(𝒟\(𝐳t\)\)\)\]\\begin\{matrix\}\\mathcal\{L\}\_\{Dis\}=\\mathbb\{E\}\_\{p\(\\mathbf\{u\}\_\{t\}\)\}\[\\max\(0,1\-\\mathcal\{A\}\(\\mathbf\{u\}\_\{t\}\)\)\]\+\\mathbb\{E\}\_\{p\_\{\\mathcal\{E\}\}\(\\mathbf\{z\}\_\{t\}\|\\mathbf\{u\}\_\{t\}\)\}\[\\max\(0,1\+\\mathcal\{A\}\(\\mathcal\{D\}\(\\mathbf\{z\}\_\{t\}\)\)\)\]\\end\{matrix\}\(13\)
For the KL\-divergence term in[Equation12](https://arxiv.org/html/2605.15284#A3.E12), we setλKL=10−6\\lambda\_\{\\text\{KL\}\}=10^\{\-6\}\. For the adversarial loss term, we use a gradient\-based scale strategy\(Esseret al\.,[2021](https://arxiv.org/html/2605.15284#bib.bib39)\)forλadv\\lambda\_\{\\text\{adv\}\}with a maximum scale value of10−410^\{\-4\}\. The discriminator will only be trained when the L2 reconstruction loss is below a threshold of 0\.001 to stabilize training\. After the start of training, the feedback from the discriminator will not be added to Tadpole’s training directly until a 1000\-iteration learning rate warm\-up stage, followed by another 1000 iterations of warm\-up forλadv\\lambda\_\{\\text\{adv\}\}\.
### C\.3Training Hyperparameters
Table 14:Training hyperparameters of TadpoleThis section summarizes the training hyperparameters for Tadpole\. The primary values are presented in[Table14](https://arxiv.org/html/2605.15284#A3.T14)\.
##### Pre\-training:
Pre\-training is conducted in bf16\-mixed precision using the AdamW optimizer\(Loshchilov and Hutter,[2019](https://arxiv.org/html/2605.15284#bib.bib16)\)withβ1=0\.9\\beta\_\{1\}=0\.9,β2=0\.999\\beta\_\{2\}=0\.999, and a weight decay of10−1510^\{\-15\}\. A loss\-adaptive learning rate scheduler reduces the learning rate by a factor of 0\.5 when the training loss decreases by an order of magnitude below the previous threshold\. The initial and minimal learning rates are set to5×10−55\\times 10^\{\-5\}and5×10−65\\times 10^\{\-6\}, respectively\. A linear learning rate warm\-up is applied during the first 1000 iterations for both Tadpole and the discriminator\. The KL\-divergence term in[Equation12](https://arxiv.org/html/2605.15284#A3.E12)is optimized solely by the encoder and becomes increasingly unstable as network size increases\. Therefore, the initial learning rate of the Tadpole\-L encoder is reduced to5×10−65\\times 10^\{\-6\}\. Different\-sized Tadpoles are pre\-trained with varying numbers of training iterations, as larger models require more iterations to converge\. The training iterations for S, B, and L\-size models are5\.5×1055\.5\\times 10^\{5\},8\.25×1058\.25\\times 10^\{5\}, and1\.7×1061\.7\\times 10^\{6\}, respectively\. The batch size for pre\-training is 48, with gradient accumulation employed to reduce VRAM consumption\.
##### Downstream autoencoding:
The downstream autoencoding uses the same hyperparameters as pre\-training, except the batch size is reduced to 32 and the number of training iterations is set to1\.4×1041\.4\\times 10^\{4\}\.
##### Downstream dynamics:
The same hyperparameter configuration is applied to bothIsoandTCFdatasets\. Training is performed in bf16\-mixed precision using the AdamW optimizer withβ1=0\.9\\beta\_\{1\}=0\.9,β2=0\.999\\beta\_\{2\}=0\.999, and a weight decay of10−1510^\{\-15\}\. The learning rate is fixed at2×10−42\\times 10^\{\-4\}\. The number of training iterations is5\.6×1035\.6\\times 10^\{3\}\. The batch size is 32, with gradient accumulation used to reduce VRAM consumption\.
##### Downstream generative modeling:
Generative modeling training is conducted in bf16\-mixed precision using the AdamW optimizer withβ1=0\.9\\beta\_\{1\}=0\.9,β2=0\.999\\beta\_\{2\}=0\.999, and a weight decay of10−1510^\{\-15\}\. The learning rate is fixed at1×10−41\\times 10^\{\-4\}\. The number of training iterations is2\.64×1032\.64\\times 10^\{3\}\. The batch size is set to 256, as the latent generative model requires less training memory\. For models trained in pixel space, gradient accumulation is used to reduce memory consumption\.
### C\.4Training Cost
Training foundation models incurs a substantial computational cost\. Current Tadpole training involves multiple systems with different hardware configurations\. Below, we summarize the training costs for each model in terms of GPU hours\. Note that if the model is trained with multiple GPUs in parallel, the total GPU hours are estimated by multiplying the actual training hours by the number of GPUs\. Meanwhile, the training costs across different downstream datasets are similar as we typically use a fixed number of training iterations\. Below, we show the average training cost estimates across datasets and runs\. Central cost factor for pre\-training is the model size:
- •Tadpole S Pre\-training: 372 GPU hours with L40S GPUs\.
- •Tadpole B Pre\-training: 620 GPU hours with A100 GPUs\.
- •Tadpole L Pre\-training: 2300 GPU hours with A100 GPUs\.
Training via fine\-tuning is substantially faster, but likewise directly scales with model size:
- •Tadpole B fine\-tuning for autoencoding task \(FPFT/Scr\.\): 40 GPU hours with L40S GPUs
- •Tadpole B fine\-tuning for autoencoding task \(LoRA 32\): 36 GPU hours with L40S GPUs
- •Tadpole S for dynamics task \(Tadpole\-DFT\): 64 GPU hours with L40S GPUs
- •Tadpole B for dynamics task \(Tadpole\-DFT\): 120 GPU hours with L40S GPUs
- •Tadpole B for dynamics task \(FPFT/Scratch\): 110 GPU hours with L40S GPUs
- •Tadpole L for dynamics task \(Tadpole\-DFT\): 280 GPU hours with A100 GPUs
- •Tadpole latent generative models: 8 GPU hours with L40S GPUs
Nonetheless, even dynamic tasks using the L\-size model converged in8×8\\timesfewer GPU hours than pre\-training\. Above, the NVIDIA L40S GPUs were equipped with 48GB RAM, while the A100 GPUs had 40GB RAM\.
## Appendix DEvaluation Metrics
### D\.1Enstrophy\-based Evaluation for Dynamics Learning
Neural networks are known to exhibit spectral bias, favoring the learning of low\-frequency components while under\-resolving high\-frequency structures\. This limitation becomes particularly pronounced in autoregressive rollout settings, where prediction errors accumulate over time and manifest as progressive attenuation of high\-frequency modes, resulting in overly smooth or blurred solutions\. Conventional pixel\-space metrics, such as mean squared error, are dominated by large\-scale features and may therefore underestimate the degradation of fine\-scale structures\. To more faithfully assess model performance, especially in long\-horizon predictions, we incorporate spectrum\-based evaluation metrics that quantify errors across frequency bands\. The spectrum\-based metrics provide a scale\-resolved characterization of model accuracy and explicitly capture the loss of high\-frequency content, which is critical in many PDE systems\. That’s why we emphasize the spectral metrics in the current manuscript\.
The spectrum\-based evaluation for the dynamic rollout test case of the main text is performed as follows: The enstrophy spectrum at wavenumberk∈ℝ\+k\\in\\mathbb\{R\}\_\{\+\}is given by
S\(k\)=∑k<\|m\|≤k\+112∑\(\|ωx^\(m\)\|2\+\|ωy^\(m\)\|2\+\|ωz^\(m\)\|2\),S\(k\)=\\sum\_\{k<\|m\|\\leq k\+1\}\\frac\{1\}\{2\}\\sum\(\|\\widehat\{\\omega\_\{x\}\}\(m\)\|^\{2\}\+\|\\widehat\{\\omega\_\{y\}\}\(m\)\|^\{2\}\+\|\\widehat\{\\omega\_\{z\}\}\(m\)\|^\{2\}\),\(14\)where\|𝝎x,y,z^\(m\)\|\\widehat\{\|\\boldsymbol\{\\omega\}\_\{x,y,z\}\}\(m\)\|, withm∈ℤ3m\\in\\mathbb\{Z\}^\{3\}, denotes the Fourier coefficients of the vorticity component\. To quantify discrepancies between spectra, we compute the NRMSE between the averaged reference spectrum and the averaged spectrum of generated vorticity fields,
NRMSEES=meank\(\(Spred\(k\)−Sref\(k\)\)2\)meank\(Sref\(k\)2\)\\text\{NRMSE\}^\{ES\}=\\sqrt\{\\frac\{\\operatorname\{mean\}\_\{k\}\\bigl\(\(S\_\{\\text\{pred\}\}\(k\)\-S\_\{\\text\{ref\}\}\(k\)\)^\{2\}\\bigr\)\}\{\\operatorname\{mean\}\_\{k\}\\bigl\(S\_\{\\text\{ref\}\}\(k\)^\{2\}\\bigr\)\}\}\(15\)Since the cropped regions are not periodic, discontinuities at the domain boundaries introduce artifacts in the Fourier transform\. To mitigate these effects, we apply a Hann window to smoothly attenuate𝝎\\boldsymbol\{\\omega\}toward the boundaries prior to computing the Fourier coefficients\.
### D\.2Statistical Evaluation for Generative Modeling
Properly assessing the quality of generative models for scientific data is an open problem\. Two central difficulties are that \(1\) the number of samples in the reference dataset is often small and \(2\) the dimensionality of the data is very high\. TheTCFdataset comprises three velocity channels at a spatial discretization of96×96×19296\\times 96\\times 192in 3D\. When flattened, this corresponds to a ca\.5M5M\-dimensional vector\. Taking samples from the reference simulation is futher complicated, since snapshots that are close in time are highly correlated, which can have implications for the statistical evaluation, which often assumes that samples are independent\. To avoid a high auto\-correlation of samples from the reference simulations, we take every 10th sample from theTCFdataset for Reynolds numbers in\[400,500,600,700,800\]\[400,500,600,700,800\], which corresponds to a step sizeΔt=1\\Delta t=1\. This yields100100reference samples in total\. We generate the same number of samples for each generative model\.
To simplify the evaluation process, we split a single high\-resolution sample into multiple low\-dimensional samples\. While this means that information on long\-range correlation and structure is lost, we consider the distributional metrics on the set of low\-dimensional derived samples as a lower bound on the distributional metrics for the high\-resolution data\.
There are many strategies to reduce the high\-resolution samples\. We choose a crop\-based strategy, which partitions the full3×96×96×1923\\times 96\\times 96\\times 192\-sized data into chunks of size3×16×16×163\\times 16\\times 16\\times 16\. This transforms a single high\-resolution sample into432432smaller samples, which have dimensionality12 28812\\,288\. The smaller samples are no longer independent, however, we believe that this has a negligible effect on the evaluation\. In total, there are43 20043\\,200samples with reduced dimensionality\.
Besides the NRMSE of the mean and std\., we were not able to run the computation on the full set of samples or the full dimensionality due to computation and stability constraints\. We denote the maximum number of samples used for computation withnpointsn\_\{\\mathrm\{points\}\}and the maximum dimensionality withdmaxd\_\{\\mathrm\{max\}\}, and select the maximum values acceptable for the computation budget\. Ifnpointsn\_\{\\mathrm\{points\}\}is smaller than the dataset size, we randomly sample a subset whose size matchesnpointsn\_\{\\mathrm\{points\}\}without replacement\. Ifdmaxd\_\{\\mathrm\{max\}\}is smaller than the dimensionality of the data, we only use the firstdmaxd\_\{\\mathrm\{max\}\}dimensions\.[Table15](https://arxiv.org/html/2605.15284#A4.T15)shows the exact values ofnpointsn\_\{\\mathrm\{points\}\}anddmaxd\_\{\\mathrm\{max\}\}used for different distributional metrics\.
Table 15:npointsn\_\{\\mathrm\{points\}\}anddmaxd\_\{\\mathrm\{max\}\}for different distributional metrics\.
## Appendix ENomenclature and Abbreviations
### E\.1Nomenclature
- •ℙi\\mathbb\{P\}^\{i\}: Theii\-th PDE in the PDE family\.
- •𝐮t\\mathbf\{u\}\_\{t\}: The PDE solution at time steptt\.
- •𝐳t\\mathbf\{z\}\_\{t\}: The latent representation of𝐮t\\mathbf\{u\}\_\{t\}\.
- •ℰ\\mathcal\{E\}: The encoder network that maps the current state𝐮t\\mathbf\{u\}\_\{t\}to a latent representation𝐳t\\mathbf\{z\}\_\{t\}\.
- •𝒟\\mathcal\{D\}: The decoder network that reconstructs the state from the latent representation𝐳t\\mathbf\{z\}\_\{t\}\.
- •𝒜\\mathcal\{A\}: The adversarial network \(discriminator\) that distinguishes between real and reconstructed states\.
- •𝒮\\mathcal\{S\}: The sub\-network used for downstream tasks or specific applications\.
- •λKL\\lambda\_\{\\text\{KL\}\}: The weight for the KL\-divergence term in the loss function\.
- •BB: The batch dimension\.
- •CC: The number of channels in the input data\.
- •XX,YY,ZZ: The spatial dimensions \(height, width, depth\) of the 3D input data\.
- •HiH\_\{i\}: The final crop size along spatial dimensionii\.
- •Hi′H\_\{i\}^\{\\prime\}: The pre\-crop size along spatial dimensionii\.
- •W0W\_\{0\}: The pre\-trained weights of the Tadpole model\.
- •rr: The LoRA rank used in fine\-tuning\.
- •A,BA,B: The LoRA adaptation matrices\.
- •DD: The dimension of the 3D PDE data, denoted asD=C×X×Y×ZD=C\\times X\\times Y\\times Z\.
- •γ\\gamma: The scale factor for the skip connections\.
### E\.2Abbreviations
- •PDE: Partial Differential Equation\.
- •PEFT: Parameter\-efficient Fine\-tuning\.
- •NLP: Natural Language Processing\.
- •CV: Computer Vision\.
- •FM: Foundation Model\.
- •FPFT: Full\-parameter fine\-tuning\.
- •Iso: Isotropic turbulence\.
- •TCF: Turbulent Channel Flow\.
- •MHD: Magnetohydrodynamics\.
- •TBL: Transitional Boundary Layer\.
- •LLM: Large Language Models\.
- •ICL: In Context Learning\.
- •FIFO: First In First Out\.
- •MFU: Most Frequently UsedSimilar Articles
3D Masked Autoencoders are Robust Learners of Volumetric and Multimodal Cellular Representations for Microscopy
This paper presents 3D masked autoencoders for volumetric microscopy data, demonstrating that 3D modeling outperforms 2D max-projection and slice-based variants on downstream single-cell tasks, with cross-modal alignment to a protein language model further improving performance.
AeroJEPA: Learning Semantic Latent Representations for Scalable 3D Aerodynamic Field Modeling
This paper introduces AeroJEPA, a Joint-Embedding Predictive Architecture for scalable 3D aerodynamic field modeling. It addresses limitations in current surrogate models by predicting semantic latent representations of flow fields, enabling efficient high-fidelity analysis and design optimization.
EgoPhys: Learning Generalizable Physics Models of Deformable Objects from Egocentric Video
EgoPhys introduces a framework to construct deformable physical digital twins from egocentric RGB video using generalizable priors and a compact codebook, enabling zero-shot generalization to unseen objects without per-spring optimization. The system is demonstrated on a real robot, showing that egocentric human play video can serve as internal world representation for deformable-object planning.
Parallel-in-Time Training of Recurrent Neural Networks for Dynamical Systems Reconstruction
This paper investigates parallel-in-time algorithms for training recurrent neural networks in dynamical systems reconstruction, proposing GTF-DEER that enables stable learning over long sequences and improves reconstruction accuracy.
A Robust Foundation Model for Conservation Laws: Injecting Context into Flux Neural Operators via Recurrent Vision Transformers
This paper proposes a new architecture that augments Flux Neural Operators with recurrent Vision Transformers to solve conservation laws as a foundation model. It demonstrates robust generalization and long-time prediction capabilities across diverse conservative systems without explicit access to governing equations.