Breakeven complexity: A new perspective on neural partial differential equation solvers

arXiv cs.LG 05/18/26, 04:00 AM Papers
Summary
This paper introduces breakeven complexity, a metric to determine when neural PDE solvers become cost-effective compared to traditional numerical solvers. The framework uses scaling laws to allocate training budgets and evaluates multiple neural solvers on diverse PDE benchmarks.
arXiv:2605.15399v1 Announce Type: new Abstract: Neural surrogate solvers of partial differential equations (PDEs) promise dramatic speedups over numerical methods, especially in scenarios requiring many solves. However, current accuracy-based evaluations do not fully consider two central issues: (1) neural solvers incur substantial up-front costs for data generation, training, and tuning; and (2) classical solvers can also generate low-fidelity solutions at a sufficiently low simulation cost. To explicitly account for these realities and fully incorporate end-to-end costs, we propose an evaluation framework centered on breakeven complexity, a metric that counts the forward solves before a learned solver is cost-effective relative to an error-equivalent traditional solver. To evaluate this measure, we apply scaling laws to determine how much training budget to allocate to data generation and discuss how to achieve smooth error-matching in diverse settings. We evaluate the breakeven complexity of multiple neural PDE solvers on three PDEs on 2D periodic domains from APEBench and a novel benchmark of flows past multiple obstacles generated by the GPU-native PyFR code. Among other findings, our results suggest that neural PDE solvers become more effective as problems get harder in terms of cost, dimension, rollout, physics regime (e.g. higher Reynolds number), etc.
Original Article
View Cached Full Text
Cached at: 05/18/26, 06:40 AM
# A new perspective on neural partial differential equation solvers
Source: [https://arxiv.org/html/2605.15399](https://arxiv.org/html/2605.15399)
Yijing ZhangUniversity of Wisconsin–Madisonyzhang2637@wisc\.eduNicholas RobertsUniversity of Wisconsin–Madisonnick11roberts@cs\.wisc\.edu Tanya MarwahGoogle DeepMindtmarwah@google\.comMikhail KhodakUniversity of Wisconsin–Madisonkhodak@wisc\.edu

###### Abstract

Neural surrogate solvers of partial differential equations \(PDEs\) promise dramatic speedups over numerical methods, especially in scenarios requiring many solves\. However, current accuracy\-based evaluations do not fully consider two central issues: \(1\) neural solvers incur substantial up\-front costs for data generation, training, and tuning; and \(2\) classical solvers can also generate low\-fidelity solutions at a sufficiently low simulation cost\. To explicitly account for these realities and fully incorporate end\-to\-end costs, we propose an evaluation framework centered onbreakeven complexity,\*\*\*Dataset: https://huggingface\.co/datasets/yijingz/breakeven\_complexity\.a metric that counts the forward solves before a learned solver is cost\-effective relative to an error\-equivalent traditional solver\. To evaluate this measure, we apply scaling laws to determine how much training budget to allocate to data generation and discuss how to achieve smooth error\-matching in diverse settings\. We evaluate the breakeven complexity of multiple neural PDE solvers on three PDEs on 2D periodic domains from APEBench and a novel benchmark of flows past multiple obstacles generated by the GPU\-native PyFR code\. Among other findings, our results suggest that neural PDE solvers becomemore effectiveas problems get harder in terms of cost, dimension, rollout, physics regime \(e\.g\. higher Reynolds number\), etc\.

††footnotetext:Code: https://github\.com/yijingz02/breakeven\_complexity## 1Introduction

PDE simulation is a crucial engineering tool that enabling prediction and analysis of physics systems in settings where experiments are expensive or impractical\[[8](https://arxiv.org/html/2605.15399#bib.bib14),[18](https://arxiv.org/html/2605.15399#bib.bib15)\]\. Recently, advances in machine learning \(ML\) have driven the proliferation of data\-driven surrogate models that can quickly solve transient PDEs using only neural network forward passes\[[19](https://arxiv.org/html/2605.15399#bib.bib8),[21](https://arxiv.org/html/2605.15399#bib.bib22)\]\. This type of acceleration has significant potential impact across numerous large\-scale scientific computing tasks\.

However, while neural PDE surrogates often have fast inference when trained, they require substantial up\-front cost, including data generation, model training and tuning\. Furthermore, neural PDE surrogates are typically low\-fidelity approximations, while classical simulators can trade\-off accuracy and cost via tunable parameters \(grid resolution, number of timesteps, etc\.\), thus also allowing for low\-fidelity simulation at a reduced cost\[[3](https://arxiv.org/html/2605.15399#bib.bib12)\]\. This creates a decision problem that accuracy alone does not address: when is it worthwhile to invest the up\-front training cost of a surrogate, rather than simply running a numerical solver at an appropriately chosen lower fidelity?

In some settings, such real\-time inference, reducing latency is often the main concern, motivating surrogate deployment even with nontrivial training costs so long as the surrogate is faster than a classical solver of similar error\[[30](https://arxiv.org/html/2605.15399#bib.bib9),[3](https://arxiv.org/html/2605.15399#bib.bib12)\]\. We instead are motivated by cases where repeated forward PDE solves are required to fit latent quantities or optimize objectives, as in PDE\-constrained inverse, control, and design problems\[[4](https://arxiv.org/html/2605.15399#bib.bib10)\]\. For such non\-real\-time applications, the decision becomes fundamentally an amortization trade\-off: will the surrogate solver be used enough times such that the compute spent on generating data and training the model is worth it, i\.e\. not better\-spent on querying a traditional solver of similarly low fidelity? Existing evaluation practice for neural PDE surrogates, which mostly focuses on measuring the predictive error against a fixed high\-fidelity ground\-truth simulator, typically under\-represents end\-to\-end costs, making it difficult to answer this question\[[28](https://arxiv.org/html/2605.15399#bib.bib11),[17](https://arxiv.org/html/2605.15399#bib.bib31),[24](https://arxiv.org/html/2605.15399#bib.bib13)\]\.

To address this gap, we propose a compute\-aware evaluation framework that makes the decision trade\-off explicit\. Our approach is centered aroundbreakeven complexity, which measures the number of forward solves required for a neural PDE surrogate to be cost\-effective relative to an error\-matched low\-fidelity classical simulator\. Unlike accuracy\-only evaluations, breakeven complexity accounts for data generation costs, training costs, inference costs, and low fidelity in a single number\. This enables a more realistic comparison between learned models that better\-informs decision\-making and accurately conveys their potential relative to traditional solvers\.

Our specific contributions are the following:

1. 1\.We build an evaluation framework aroundbreakeven complexity, a cost\-aware metric for learned PDE solvers building on the amortization quantity suggested byMcGreivy and Hakim \[[22](https://arxiv.org/html/2605.15399#bib.bib25), Eq\. 1\]\. We formulate its worst\-case and average\-case variants and make their computation practical by \(a\) using scaling laws to allocate compute between optimization and data generation at a fixed budget level and \(b\) showing how to smoothly vary the error while reducing resolution when finding an error\-matched solver\.
2. 2\.We deploy our methodology on three 2D periodic PDE settings from APEBench, solved via the pseudo\-spectral Exponax code\[[17](https://arxiv.org/html/2605.15399#bib.bib31)\], and on a novel dataset calledBreakFlowconsisting of multi\-obstacle flows simulated with the scale\-resolving, GPU\-native PyFR code\[[33](https://arxiv.org/html/2605.15399#bib.bib32)\]\. We will release multi\-fidelity variants of all four datasets along with protocols for enabling the evaluation of breakeven complexity on these and on future benchmarks\.
3. 3\.We evaluate the breakeven complexity of eight leading models \(five supervised solvers and four foundation models\) on the above benchmarks\. Our investigation shows that models that do well according to reported error can sometimes underperform under our cost\-aware metric, that it can be used to study the robustness of learned solvers, and that on the APEBench \(periodic\) tasks learned solvers can require hundreds of thousands of inference call in order to be cost effective\.
4. 4\.On the other hand, by studying breakeven complexity across multiple spatial dimensions, rollout lengths, and Reynolds numbers, we find that it consistentlydecreasesas the task gets harder\. This suggests that the cost\-effectiveness of neural PDE solvers mayimprovewith simulation difficulty\.

## 2Related Work

There are a wide variety of neural network\-based approaches to solving PDEs, including physics\-informed neural networks \(PINNs\)\[[25](https://arxiv.org/html/2605.15399#bib.bib29)\]and neural operators\[[19](https://arxiv.org/html/2605.15399#bib.bib8),[31](https://arxiv.org/html/2605.15399#bib.bib24)\]\. Our work is centered around evaluating methods, such as the latter, which train networks on data generates by classical simulators; this field has witnessed a steady improvement in \(error\-based\) performance\[[31](https://arxiv.org/html/2605.15399#bib.bib24),[7](https://arxiv.org/html/2605.15399#bib.bib23)\]\. Most recently, foundation models have begun to be used in this field as well, promising to alleviate the upfront cost of data generation via transfer from massive pretraining data\[[14](https://arxiv.org/html/2605.15399#bib.bib16),[21](https://arxiv.org/html/2605.15399#bib.bib22)\]\.

These neural surrogates have been evaluated on a variety of benchmarks, largely via their predictive error relative to some high\-fidelity numerical simulator\. Benchmarks such as PDEBench\[[28](https://arxiv.org/html/2605.15399#bib.bib11)\], PDEAreana\[[12](https://arxiv.org/html/2605.15399#bib.bib27)\], APEBench\[[17](https://arxiv.org/html/2605.15399#bib.bib31)\]and the Well\[[24](https://arxiv.org/html/2605.15399#bib.bib13)\]all provides large, ready\-to\-use datasets across different baseline PDEs with standardized metrics and baseline models\. However, focusing on accuracy makes it hard to understand tradeoffs with cost, despite this being a crucial question in a field motivated largely by solver speed\. Our work addresses this issue by comparing explicitly to an error\-matched traditional solver to derive a single interpretable value\.

We are not the first to consider varying\-fidelity classical solvers in evaluating learned approaches, which has been done for PINNs\[[27](https://arxiv.org/html/2605.15399#bib.bib26),[11](https://arxiv.org/html/2605.15399#bib.bib28)\]and \(more relevantly\) for neural operators byMcGreivy and Hakim \[[22](https://arxiv.org/html/2605.15399#bib.bib25)\]\. The latter showed the importance of simultaneously accounting for cost and accuracy and the need to compare to effective, well\-suited, and problem\-specific solvers; these critiques directly motivate our work to develop a cost\-aware basis for evaluating neural PDE solvers\.McGreivy and Hakim \[[22](https://arxiv.org/html/2605.15399#bib.bib25)\]also recommend the metric that in this work we make practical and call “breakeven complexity”; to our knowledge, the measure was not used by them nor in the subsequent literature\.

In addition to the framework and empirical evaluations, we also develop a new flow\-past\-multiple\-obstacles benchmark calledBreakFlow\. While several flow\-past\-object tasks exist\[[29](https://arxiv.org/html/2605.15399#bib.bib18),[5](https://arxiv.org/html/2605.15399#bib.bib17)\], ours crucially is generated via the GPU\-native code PyFR code\[[33](https://arxiv.org/html/2605.15399#bib.bib32)\], enabling a direct and fair comparison to neural network training and inference on the same device type\.

## 3Breakeven Complexity

We are interested in solvingθ\\theta\-parameterized PDEs of form∂tu\(t,x\)=ℒθ\[u\]\(t,x\)\\partial\_\{t\}u\(t,x\)=\\mathcal\{L\}\_\{\\theta\}\[u\]\(t,x\)with initial conditionu\(0,x\)=uθ\(x\)u\(0,x\)=u\_\{\\theta\}\(x\)and boundary termsℬθ\[u\|∂𝒳\]\(t,x\)=0\\mathcal\{B\}\_\{\\theta\}\[u\|\\partial\\mathcal\{X\}\]\(t,x\)=0over a domain𝒳∋x\\mathcal\{X\}\\ni x, with design variableθ\\thetaparameterizing the differential operatorℒθ\\mathcal\{L\}\_\{\\theta\}, the initial conditionsuθu\_\{\\theta\}, and the boundary conditionsℬθ\\mathcal\{B\}\_\{\\theta\}\. The goal of a PDE solver is to output an approximate solutionu^θ\\hat\{u\}\_\{\\theta\}over𝒳\\mathcal\{X\}, with the approximation quality assessed either by comparing to a high\-fidelity reference or via a PDE residual\. Traditionally, transient PDEs have been solved by what we will refer to as “classical” solvers, which typically discretize in space and time and solve a sequence of numerical problems on the resulting grid\. We will use such solvers to generate our data and high\-fidelity reference solutions for evaluations\.

### 3\.1Neural PDE Solvers

A neural PDE solver is one where the values of the functionu^θ\(t,x\)\\hat\{u\}\_\{\\theta\}\(t,x\)at specific space–time points are computed using a neural networkfwf\_\{w\}with weightsw∈ℝnw\\in\\mathbb\{R\}^\{n\}, i\.e\.u^θ\(t,x\)=fw\(θ,t,x\)\\hat\{u\}\_\{\\theta\}\(t,x\)=f\_\{w\}\(\\theta,t,x\)\. They are often trained by sampling a set ofNdataN\_\{\\mathrm\{data\}\}pointsθi,i=1,…,Ndata\\theta\_\{i\},i=1,\\dots,N\_\{\\mathrm\{data\}\}, from some distribution𝒟\\mathcal\{D\}overΘ\\Theta, classically computing corresponding “ground truth” solutionsuiu\_\{i\}for each of them, and trainingfwf\_\{w\}to approximate these solutions\. We will evaluate the performance of learned \(and other non\-reference\) solvers via thenormalized RMSE \(nRMSE\)‖u^θ−uθ‖2/‖uθ‖2\\\|\\hat\{u\}\_\{\\theta\}\-u\_\{\\theta\}\\\|\_\{2\}/\\\|u\_\{\\theta\}\\\|\_\{2\}relative to a ground truth reference solution, averaged over a held\-out test set of pointsθ∼𝒟\\theta\\sim\\mathcal\{D\}\. This serves as an imperfect indicator for the performance of the method on solving downstream tasks, e\.g\. inverse problems over the domainΘ\\Theta\.

### 3\.2Motivation: Where are neural solvers used?

The scientific workloads where neural PDE solvers have the most promise involve*repeated*forward solves of related PDEs, often to optimize some task metric\. For example, solving a design optimization or inverse problem task with an objective likeθ⋆∈arg⁡maxθ∈Θ⁡J\(θ;uθ\)\.\\theta^\{\\star\}\\in\\arg\\max\_\{\\theta\\in\\Theta\}\\;J\(\\theta;u\_\{\\theta\}\)\.typically requires many forward solves at different values ofθ\\theta, which is where acceleration via learned solvers may be useful\. To quantify this, supposeNNforward PDE solves with accuracyε\\varepsilonsuffices to optimize this objective to an acceptable value\. Then the question we seek to answer next is whether it is better to use a classical or neural solver for this purpose\.

### 3\.3Breakeven Complexity

A key feature of this setting is the up\-front cost of a neural PDE solverfwf\_\{w\}, which is the sum of two quantities: \(i\) the cost of generating the data and \(ii\) the cost of training the model on that data\. Since data generation cost can be reasonably assumed to scale linearly with the numberNdataN\_\{\\mathrm\{data\}\}of training points, we write the data generation cost asCgenNdataC\_\{\\mathrm\{gen\}\}N\_\{\\mathrm\{data\}\}, whereCgenC\_\{\\mathrm\{gen\}\}is the average cost of generating one training trajectory via a classical solver\. We then define the total up\-front training cost asB:=CgenNdata\+Ctrain,B:=C\_\{\\mathrm\{gen\}\}N\_\{\\mathrm\{data\}\}\+C\_\{\\mathrm\{train\}\},whereCtrainC\_\{\\mathrm\{train\}\}is the total cost of all training steps\.BBcan be interpreted as the budget we spend on producing a neural solver before it can be deployed\. LetεB\\varepsilon\_\{B\}denote the \(worst\-case or average\) error achieved by the resulting model\.

Now, if our downstream task needsNNtrajectories with errorεB\\varepsilon\_\{B\}to solve, then the total cost spent on the task will beBtotal:=B\+CinfN,B\_\{\\mathrm\{total\}\}:=B\+C\_\{\\mathrm\{inf\}\}N,i\.e\. the sum of the up\-front training costBBand total inference costCinfNC\_\{\\mathrm\{inf\}\}N, whereCinfC\_\{\\mathrm\{inf\}\}is cost per inference usingfwf\_\{w\}\.

![Refer to caption](https://arxiv.org/html/2605.15399v1/x1.png)Figure 1:Illustration of the breakeven complexity definition, using FFNO and Poseidon\-T on Navier\-Stokes as examples\. For a fixed training budgetB=1000B=1000, a learned solver’s total costs for simulatingNNtrajectories isBtotal=B\+CinfNB\_\{\\textrm\{total\}\}=B\+C\_\{\\textrm\{inf\}\}N, while the cheapest classical configuration whose error matches the learned accuracyεB\\varepsilon\_\{B\}has costCδBNC\_\{\\delta\_\{B\}\}N\. The intersection of these two lines, if ever, will occur atN⋆\(B\)=BCδB−CinfN^\{\\star\}\(B\)=\\frac\{B\}\{C\_\{\\delta\_\{B\}\}\-C\_\{\\mathrm\{inf\}\}\}\(Eq\.[1](https://arxiv.org/html/2605.15399#S3.E1)\), which we define as the breakeven complexity\. For any problem that requires fewer inference calls than this, we prefer to use the classical solver\.
![Refer to caption](https://arxiv.org/html/2605.15399v1/x2.png)Figure 2:Example of a scaling landscape for the budget\-optimal error, using Poseidon\-B trained on Gray\-Scott and varying the allocation between training trajectories \(data\-generation\) and optimization steps\. We fit a model to the small\-budget optima and extrapolate to larger budgets to estimate the best\-achievable error at eachBtrainB\_\{\\mathrm\{train\}\}\. A detailed fitting procedure is explained in Appendix[C](https://arxiv.org/html/2605.15399#A3)\.

To fairly compare a neural PDE solver against a classical solver\[[22](https://arxiv.org/html/2605.15399#bib.bib25)\], we compareBtotalB\_\{\\mathrm\{total\}\}to the cost that a similarly accurate classical solver would incur in order to computeNNtrajectories\. Specifically, letδB\\delta\_\{B\}denote the fidelity—e\.g\. the resolution—of anerror\-matchedclassical solver, i\.e\. one that achieves the same \(worst\-case or average\) errorεB\\varepsilon\_\{B\}as the model, and letCδBC\_\{\\delta\_\{B\}\}denote its per\-trajectory cost\. Then, the neural PDE solver is useful if and only if, givenNNtrajectories required by the task,B\+CinfN<CδBN\.B\+C\_\{\\mathrm\{inf\}\}N<C\_\{\\delta\_\{B\}\}N\.Rearranging the equation yields an amortization threshold, which we define as thebreakeven complexity:

N⋆:=Bmax⁡\{CδB−Cinf,0\}\.N^\{\\star\}:=\\frac\{B\}\{\\max\\\{C\_\{\\delta\_\{B\}\}\-C\_\{\\mathrm\{inf\}\},0\\\}\}\.\(1\)
As illustrated in Fig\.[2](https://arxiv.org/html/2605.15399#S3.F2), breakeven complexity is the point when a neural PDE solver and its error\-matched classical solver require the same cost of solving the downstream task\. It thus isolates the amortization regime, being small when \(i\) the up\-front costBBis modest, \(ii\) inference is substantially cheaper than that of the error\-matched classical simulation \(CδB≫CinfC\_\{\\delta\_\{B\}\}\\gg C\_\{\\mathrm\{inf\}\}\), or both\. Conversely, breakeven complexity is infinite when classical low\-fidelity simulation is already cheaper than inference at matched accuracy\. One limitation here is the assumption that simulation costs are roughly the same for all parametersθ\\theta, which could fail in certain settings whereθ\\thetainfluences the difficulty of the problem; however, taking the average cost over a distribution is a reasonable and meaningful approximation\.

## 4Evaluation Methodology

In this section, we describe the experimental evaluation protocol we use to compute breakeven complexity\. The pipeline involves \(i\) determining the cost measure, \(ii\) estimating the best achievable neural solver error for a fixed training budget, \(iii\) matching the latter’s accuracy to a classical baseline’s, and \(iv\) the computing the breakeven complexity via the quotient of the training budget and the difference between neural and traditional generation cost\.

### 4\.1Measuring compute cost

We use the wallclock time \(seconds\) to measure cost\. Although FLOPs are commonly used for scaling laws, they are less useful for comparing neural and classical PDE solvers because \(i\) throughput \(FLOPs/s\) depends strongly on kernel structure/hardware utilization and \(ii\) classical solver costs are dominated by stencil/flux evaluations, memory effects, and communications that are not well\-represented via FLOPs\. Wallclock therefore provides a direct end\-to\-end cost currency shared by both learned and classical methods, one which is also used in related work\[[3](https://arxiv.org/html/2605.15399#bib.bib12)\]\. Note that we exclude training initialization overheads \(e\.g\. compilation and warmup\) as they are not fundamental to the underlying modeling or physics problems\. Lastly, in this paper we run both neural and traditional solvers on the same hardware and ensure high GPU\-utilization in both cases; in future settings where classical simulation must be done on CPU, a monetary or power usage comparison could be used\.

### 4\.2Error\-optimal data generation

A learned solver incurs up\-front costB=CgenNdata\+CtrainB=C\_\{\\textrm\{gen\}\}N\_\{\\textrm\{data\}\}\+C\_\{\\textrm\{train\}\}from data generation and training, whereCgenC\_\{\\textrm\{gen\}\}is per\-trajectory generation time,NdataN\_\{\\textrm\{data\}\}is the number of generated trajectories, andCtrainC\_\{\\textrm\{train\}\}is total training step time\. For a given budgetBB, neural solvers admit many feasible allocations \(more data vs\. more optimization, different model sizes/hyperparameters\) that can yield substantially different errors and inference costs\. Since breakeven complexity compares methods at a*fixed up\-front budget*, using a non\-optimal configuration would conflate the metric with arbitrary training choices\. We therefore estimate the*best achievable*error under each budget and use this budget\-optimal model as the representative learned solver at budgetBB\.

Empirically, this can be done by sweeping feasible model configurations \(i\.e\. data generation and optimization allocation\) for some fixed budget\. However, since running sweeping for large budgets can be costly, we instead take the following scaling laws\-inspired approach\. First, we run hyperparameter sweeps on a set of small budgetsBBwhile also trying different generation vs\. optimization allocations \(varyingNdataN\_\{\\textrm\{data\}\}vs\.CtrainC\_\{\\textrm\{train\}\}while keepingCgenNdata\+Ctrain≤BC\_\{\\textrm\{gen\}\}N\_\{\\textrm\{data\}\}\+C\_\{\\textrm\{train\}\}\\leq B\)\. We setεB\\varepsilon\_\{B\}to be the best \(worst\-case or average\) error achieved at small budgetBBand defineNdata\(B\)N\_\{\\textrm\{data\}\}\(B\)to be the associated optimal number of training trajectories\. Then we fit a scaling model to the labeled pairs\(B,Ndata\(B\)\)\(B,N\_\{\\textrm\{data\}\}\(B\)\)in order to predict the error\-optimal amount of data to generate at larger budgetsBB\[[15](https://arxiv.org/html/2605.15399#bib.bib30)\]\. An example of the fitted scaling landscape graph is show in Fig\.[2](https://arxiv.org/html/2605.15399#S3.F2)\.

### 4\.3Error\-matched classical solvers

While high\-fidelity classical solvers are typically much more expensive than neural PDE solvers, the latter often have much higher error; thus a fair comparison is to compare to low\-fidelity classical solver\. Thus, for each training budgetBB, we find the cheapest configuration of the classical solver whose error matches that of the learned solver’s accuracyεB\\varepsilon\_\{B\}and use that as our baseline\. This error\-matching is essential: classical solvers can often be run at reduced fidelity \(and thus much lower cost\), so comparing surrogate inference cost to an unnecessarily high\-fidelity baseline can overstate the usefulness of neural PDE solvers\[[22](https://arxiv.org/html/2605.15399#bib.bib25)\]\.

More formally, let\{𝖲δ\}\\\{\\mathsf\{S\}\_\{\\delta\}\\\}be a family of classical solver configurations indexed by fidelityδ\\delta, with per\-trajectory runtimeCδC\_\{\\delta\}\. Letε\(δ\)\\varepsilon\(\\delta\)denote its error, computed in the same way as the neural PDE solver error by comparing to a high\-fidelity reference on a set of test trajectories\. Then we define theerror\-matched solverto be the cheapest solver that is at leastεB\\varepsilon\_\{B\}\-accurate:

δB\\displaystyle\\delta\_\{B\}∈arg⁡minδ⁡\{Cδ:ε\(δ\)≤εB\}\\displaystyle\\in\\arg\\min\_\{\\delta\}\\Big\\\{C\_\{\\delta\}:\\ \\varepsilon\(\\delta\)\\leq\\varepsilon\_\{B\}\\Big\\\}\(2\)In practice, we can construct the family\{𝖲δ\}\\\{\\mathsf\{S\}\_\{\\delta\}\\\}of low\-fidelity solvers by coarsening the resolution and timesteps\. Starting from a high\-fidelity solver, we decrease the resolution by e\.g\. setting grid spacingΔx→rΔx\\Delta x\\to r\\Delta xforr∈\{2,4,…\}r\\in\\\{2,4,\\dots\\\}in the case of regular grades or analagously reducing the number nodes in the case of unstructured meshes\. For each spatial coarsening level, we increase the time stepΔt\\Delta tto reduce runtime, while maintaining stability by enforcing a CFL \(Courant–Friedrichs–Lewy\) condition constraint\. Concretely, we chooseΔt\\Delta tso that the nondimensional Courant number remains approximately constant across fidelities, i\.e\.CFL∝umaxΔtΔx≈const,\\mathrm\{CFL\}\\;\\propto\\;\\frac\{u\_\{\\max\}\\,\\Delta t\}\{\\Delta x\}\\approx\\mathrm\{const\},whereumaxu\_\{\\max\}is a characteristic velocity scale for the benchmark equation\. When the solver has additional stability constraints \(e\.g\., diffusion\-like terms\), we additionally capΔt\\Delta tby the corresponding bound\. Note that, for some numerical methods \(e\.g\. ETDRK time\-steppers\[[9](https://arxiv.org/html/2605.15399#bib.bib38),[34](https://arxiv.org/html/2605.15399#bib.bib37)\]\), the stability and CFL condition does not depend significantly onΔt\\Delta t\. In this case, we would fixΔt\\Delta tand only coarsen the grid resolution\. This procedure yields a principled grid\-timestep schedule that produces a monotone cost\-accuracy tradeoff and avoids unstable configurations\.

### 4\.4Computing breakeven complexity

After determining the error\-matched solver by defining a family of low\-fidelity solvers and using Eq\.[2](https://arxiv.org/html/2605.15399#S4.E2), we are finally able to compute the breakeven complexity \([1](https://arxiv.org/html/2605.15399#S3.E1)\)N⋆\(B\)=B/max⁡\{CδB−Cinf,0\}N^\{\\star\}\(B\)=B/\\max\\\{C\_\{\\delta\_\{B\}\}\-C\_\{\\textrm\{inf\}\},0\\\}, withN⋆\(B\)=∞N^\{\\star\}\(B\)=\\inftywhen the denominator is 0, i\.e\. when inference costs more than classical simulation\. Note that, given a static set of test trajectories, we can define the errorεB\\varepsilon\_\{B\}of the neural PDE\-solver and the errorε\(δB\)\\varepsilon\(\\delta\_\{B\}\)of its matched classical solver in \(at least\) two ways: via theaverage nRMSEover the test set or theworst nRMSEof any trajectory in the test set\. We use these to define theaverage\-case breakeven complexityand theworst\-case breakeven complexity, respectively\. Average\-case breakeven complexity indicates the workload size required for a learned solver to amortize training cost to be more useful than classical simulators when targeting typical accuracy, while worst\-case breakeven complexity indicates the workload size required to amortize training cost while meeting robust worst\-trajectory accuracy guarantees\. Note that since we consider the worst\-case for the both the neural solver and the error\-matched classical solver, the worst\-case breakeven complexity is not necessarily worse than the average\-case\.

## 5Benchmarks and datasets

We evaluate breakeven complexity on six PDE families: five extant settings from APEBench\[[17](https://arxiv.org/html/2605.15399#bib.bib31)\]and a new benchmark, BreakFlow, which uses PyFR\[[33](https://arxiv.org/html/2605.15399#bib.bib32)\]to simulate flows in complex geometries\. Simulations are run until the dynamics converge to a steady state or enter limited cycles\. Details of the benchmarks and datasets are presented in Appendix[A](https://arxiv.org/html/2605.15399#A1)\.

Theperiodic benchmarkswe consider involve three canonical equations: Navier\-Stokes \(N\-S\), Kuramoto\-Shivashinsky \(K\-S\), and Gray\-Scott \(G\-S\)\. They are chosen due to their nonlinearity and distinct physical characteristics \(advective turbulence, chaotic instability\-driven dynamics, and reaction–diffusion pattern formation, respectively\)\. While our primary evaluations focus on 2D domains, we also evaluate the K\-S equation in 1D and 3D to analyze difficulty scaling across spatial dimensions\. The data trajectories are simulated via the Exponax solver from APEBench\[[17](https://arxiv.org/html/2605.15399#bib.bib31)\], a GPU\-based pseudo\-spectral solver with ETDRK time\-stepping\. Its Fourier spectral discretization is well\-suited to smooth periodic dynamics and provides an efficient, well\-controlled classical baseline for breakeven analysis\.

TheBreakFlow benchmarkis a new setting we introduce consisting of flows past multiple objects simulated in PyFR\[[33](https://arxiv.org/html/2605.15399#bib.bib32)\], a flux\-reconstruction code that uses artificial viscosity to enable highly parallel GPU operations for of incompressible flows in complex geometries; this makes it particularly well\-suited for breakeven analysis\. The specific PDE instances in BreakFlow involve rightward flows past rectangles, with sizes, positions, and rotations sampled from a distribution ensuring sufficient separation between obstacles and symmetry about the horizontal axis, as detailed in Appendix[A](https://arxiv.org/html/2605.15399#A1)\. We generate simulations over three different range of Reynolds numbers,\(10,40\]\(10,40\],\(40,90\]\(40,90\]and\(90,160\]\(90,160\]\.

## 6Experimental Results\.

![Refer to caption](https://arxiv.org/html/2605.15399v1/x3.png)Figure 3:Plots of the test nRMSE \(top\) and breakeven complexity \(bottom\), both as functions of the training budgetBB\. “×\\times” and “▲\\blacktriangle” represent worst\-case \(error or complexity\) and infinite \(complexity\), respectively\. While error decreases monotonically almost everywhere, breakeven complexity is nonlinear because both its numerator and denominator depend on the budgetBB\.We use the evaluation pipeline in Sec\.[4](https://arxiv.org/html/2605.15399#S4)to compute breakeven complexity across the four PDE setups in Sec\.[5](https://arxiv.org/html/2605.15399#S5)for multiple model classes\. Specifically, we evaluate two categories of neural PDE solvers:*supervised*models trained from scratch on each task and*foundation\-model*\(FM\) approaches adapted to each task via fine\-tuning\. For the former we consider four representative architectures: \(i\)*FFNO*\[[31](https://arxiv.org/html/2605.15399#bib.bib24)\], an efficient and tuned variant of the Fourier Neural Operator\[[19](https://arxiv.org/html/2605.15399#bib.bib8)\], \(ii\)*EddyFormer*, a Transformer\-based model\[[7](https://arxiv.org/html/2605.15399#bib.bib23)\], \(iii\)*DISCO*, which uses hypernetwork to dynamically generate a an autoregressive operator\[[23](https://arxiv.org/html/2605.15399#bib.bib3)\], and \(iv\)*DPOT*, which uses a Fourier\-based attention mechanisms\[[13](https://arxiv.org/html/2605.15399#bib.bib2)\]\. For the latter, we consider two representative FM architectures: \(i\) the*Poseidon*family \(Tiny, Base, and Large\) pre\-trained on diverse fluid data\[[14](https://arxiv.org/html/2605.15399#bib.bib16)\], and \(ii\)*HalfWalrus*, a Transformer\-based cross\-domain FM pre\-trained on a broad set of continuum dynamics from many physical scenarios\[[21](https://arxiv.org/html/2605.15399#bib.bib22)\]\. When computing breakeven complexity, we exclude pretraining cost, treating it as a shared, reusable expense amortized across many downstream tasks and thus not attributable to any single deployment\.

A complete report of our breakeven complexities at different budgets can be found in Appendix[F](https://arxiv.org/html/2605.15399#A6), with model training details provided in Appendix[B](https://arxiv.org/html/2605.15399#A2)\. Note that we compute both*average\-case*and*worst\-case*variants \(Sec\.[4\.4](https://arxiv.org/html/2605.15399#S4.SS4)\), where worst\-case is taken over test trajectories\. In this section, we organize our experimental analysis around the following three claims:

1. 1\.By forcing the consideration of compute cost, breakeven complexity reveals relative model strengths and weaknesses overlooked in previous accuracy\-driven evaluations\.
2. 2\.We can compare worst\-case and average\-case variants of breakeven complexity to gain an understanding of the robustness of a learned model relative to a low fidelity classial solver\.
3. 3\.Breakeven complexity evaluations reveal that neural solvers need to be called hundreds of thousands of times to be effective in the basic toy PDEs that are typically studied in scientific ML, but on the other hand neural solvers becomemorecost\-effective onharderPDE problems\.

![Refer to caption](https://arxiv.org/html/2605.15399v1/x4.png)Figure 4:nRMSE \(top\) and breakeven complexity \(bottom\) for different models across the four tasks at the largest budget\. While nRMSE suggests neural surrogates will struggle on \(classically\) demanding tasks given the same \(training\) budget, our compute\-aware metric suggests the opposite, being lower for PDEs with longer simulation time \(CgenC\_\{\\textrm\{gen\}\}\)\.
![Refer to caption](https://arxiv.org/html/2605.15399v1/x5.png)Figure 5:Comparison of how decreasing resolution increases simulation speed across our four PDE settings; the time saved is relative to the high\-fidelity solver, with the sequence of low\-fidelity solvers determined as in Section[4\.3](https://arxiv.org/html/2605.15399#S4.SS3)\. Unlike the APEBench simulations, BreakFlow is extremely sensitive to changes in resolution, with error increasing rapidly after a slight decrease in solve cost\.

### 6\.1Claim 1: The value of cost\-awareness

In Fig\.[3](https://arxiv.org/html/2605.15399#S6.F3)we plot the budget\-optimal test errorεB\\varepsilon\_\{B\}and the corresponding breakeven complexityN⋆\(B\)N^\{\\star\}\(B\)as functions of the training budgetBB, doing so across all models and settings under consideration\. While the error curves are almost uniformly decreasing across the models and settings, the breakeven complexity can be non\-monotone; notably, EddyFormer struggles on low budgets for both fluid settings \(Navier\-Stokes and BreakFlow\) but becomes much stronger at higher budgets, suggesting that to be effective it must be trained sufficiently long and on sufficiently many samples\. On the other hand, it is undefined on Kuramoto\-Sivashinsky because its inference cost is greater than the simulation cost of an error\-matched classical solver; the same holds for two of the APEBench settings for HalfWalrus, suggesting that a meaningful evaluation of it must be conducted on harder problems\. A surprising finding is that while the Poseidon models and DPOT dominate BreakFlow according to test error, their breakeven complexity is not much better than that of HalfWalrus and FFNO, with the latter being the most cost\-effective at the highest budget we evaluate despite being a supervised model\.

Beyond this, breakeven complexity’s forced consideration of cost tradeoffs reveals performance shifts that standard metrics might overlook\. Most notably, when training cost is equalized we find that Poseidon\-L is the worst FM in its family on all tasks except BreakFlow, according to both breakeven complexity \(albeit barely\) and test nRMSE \(more significantly\)\. In contrast, it was reported to be far better than Poseidon\-T and Poseidon\-B in aggregate on similar tasks\[[14](https://arxiv.org/html/2605.15399#bib.bib16), Table 9\]\. This demonstrates the usefulness of our new measure in forcing an accounting of compute costs when doing evaluation\.

![Refer to caption](https://arxiv.org/html/2605.15399v1/x6.png)Figure 6:Comparison of the error distributions of FFNO and Poseidon\-T \(training budget 8K\) along with the distributions of the classical solvers that match their worst\-case and average\-case performance\. On the right we compare the ratio between worst\-case and average\-case forN⋆N^\{\\star\}and nRMSE\. For both models, their error distribution on Gray\-Scott is much more robust, i\.e\. more concentrated, than on the other PDEs, making the distribution of the worst\-case matching solver fairly close to that of the average\-case; this in turn means the worse solver is not much faster and so the worst\-case breakeven complexity does not increase so much over its average\-case\. This is evidenced in the plots on the right, where the ratio of worst\-to\-average\-case complexity is smaller for Gray\-Scott\. Notably, the ratio of the worst\-case to the average\-case nRMSE does not correlate to robustness in this way\.
### 6\.2Claim 2: Measuring robustness

As discussed in Section[3](https://arxiv.org/html/2605.15399#S3), we can measure breakeven complexity by matching learned and traditional solvers via theiraveragetest error or theirworst\-casetest error\. As shown in Figure[6](https://arxiv.org/html/2605.15399#S6.F6), it is instructive to look at their ratio: both FFNO and Poseidon\-T has relatively closer average and worst\-case complexities on Gray\-Scott, but on the other PDEs the worst\-case tends to be much higher\. Looking at the distribution of errors across the test trajectories of each task in Fig\.[6](https://arxiv.org/html/2605.15399#S6.F6), we see that Gray\-Scott is also the setting where the surrogates’ errors are concentrated as or more tightly than their matched solvers\. This suggests a smaller gap between worst\-case and average\-case breakeven complexity reflects a more robust learned solver, i\.e\. one that is less likely to have instances where it performs much worse than the average instance\. Mathematically, this will be caused by the worst\-case error\-matched solver having an error distribution not too far from the average\-case, thus also not being much cheaper and therefore not significantly increasing the complexity\. As robustness is naturally important for optimization problems that require exploring unseen parts of the optimization, the fact that breakeven complexity—in a way that nRMSE is not—can be useful indicator of robustness is practically meaningful\.

### 6\.3Claim 3: Optimism in the face of harder PDE problems

At first glance, our evaluations in toy settings suggest that hundreds of thousands of calls are necessary to reach cost\-effectiveness against error\-matched traditional solvers \(Fig\.[5](https://arxiv.org/html/2605.15399#S6.F5)\)\. However, the same plot demonstrates that breakeven complexity scales inversely with the computational cost of classically solving the PDEs, a basic notion of problem difficulty\. In particular, neural surrogates typically require only a few thousands inference calls to break even on BreakFlow, an order of magnitude fewer than on Gray\-Scott \(for most models\) and two orders of magnitude fewer than on Navier\-Stokes and Kuramoto\-Sivashinsky\. On the other hand, simulating BreakFlow is two orders more expensive than G\-S and three orders more than N\-S and K\-S\. This trend suggests that the true utility of neural PDE solvers lies in high\-fidelity regimes\.

These results scaling metrics with the cost do not on their own imply such as a conclusion, as there could be other reasons why learned solvers may have such a dramatically smaller breakeven complexity on BreakFlow\. For example, Figure[5](https://arxiv.org/html/2605.15399#S6.F5)shows that on the APEBench settings we can easily decrease the solver cost without significantly increasing error, while on BreakFlow the solver is more sensitive, with error rising dramatically even when the low\-fidelity solver is only10%10\\%faster than the high\-fidelity one\. In such cases, the inference cost of the error\-matched solver will remain large and thus will dominate the denominator of Eq\.[1](https://arxiv.org/html/2605.15399#S3.E1), making breakeven complexity small\.

We thus conduct further analysis along three other, better\-defined difficulty axes: spatial dimensionality \(on K\-S\), temporal rollout length \(on G\-S\), and flow complexity in terms Reynolds numberReRe\(on BreakFlow\)\. As shown in Figure[7](https://arxiv.org/html/2605.15399#S6.F7), in all three settings we observe the same pattern from before: as the problem difficulty increases while the budget is kept fixed, the breakeven complexitydecreaseswhile the test errorincreases\. In more detail, moving from 1D to 3D on Kuramoto\-Sivashinsky substantially increases classical solver cost, lowering the breakeven threshold even when surrogate error increases slightly\. For Gray\-Scott, longer rollout horizons make coarse classical solvers less viable because temporal errors accumulate, requiring finer resolutions and again reducing breakeven complexity\. Lastly, BreakFlow simulations with well\-formed vortex streets \(Re∈\(90,160\]Re\\in\(90,160\]\) require finer grids and more restrictive time steps than steady flows \(Re∈\(10,40\]Re\\in\(10,40\]\), yielding the same trend along the flow complexity axis\[[32](https://arxiv.org/html/2605.15399#bib.bib6),[26](https://arxiv.org/html/2605.15399#bib.bib7)\]\. These results suggests that neural surrogates will be most computationally advantageous precisely in regimes where classical solvers face rapidly growing costs, i\.e\. on more challenging, large\-scale problems\. As these are precisely the problems of interest for practical scientific computing, breakeven complexity thus provides a more optimistic view of the potential of ML for scientific computing than found inMcGreivy and Hakim \[[22](https://arxiv.org/html/2605.15399#bib.bib25)\], despite their first suggestion of the metric\.

![Refer to caption](https://arxiv.org/html/2605.15399v1/x7.png)Figure 7:Plot showing the inverse relationship between test error and breakeven complexity of FFNO as problem difficulty increases on Kuramoto\-Sivashinsky \(left, varying spatial dimension\), Gray\-Scott \(middle, varying rollout length\), and BreakFlow \(right, varying Reynolds number\)\. Across all three axes, increasing problem difficulty sharply reduces breakeven complexity at the same budget, indicating a growing computational advantage for neural surrogates over classical solvers\.

## 7Conclusion

We formalize*breakeven complexity*, a framework for evaluating neural PDE solvers that complements standard error\-based evaluation by quantifying the threshold at which a learned PDE solver becomes cost\-effective relative to an error\-matched classical baseline; it thus integrates end\-to\-end costs \(data generation, training, inference\) while accounting for low\-fidelity solvers\. Our empirical evaluations of breakeven complexity across three 2D periodic PDE setups and on a new benchmark of complex flows demonstrate the importance of accounting for cost, reveal hidden weaknesses in past evaluations, naturally quantify robustness, and suggest that the greatest potential of neural PDE solvers will be realized on more challenging benchmarks\. As a result, we believe this metric should be considered by any rigorous evaluation of neural PDE methods and should also inform future benchmark construction\.

Limitationsof breakeven complexity include its dependence on the fidelity and implementation of a classical baseline, hardware conditions, and the modeling of inference and training costs; these can be more challenging to work out than simple error statistics\. However, doing this assessment is precisely what enables a practitioner to understand the paradigm’s value on their system and use\-case\.

Future workmay thus extend the breakeven complexity framework to incorporate heterogeneous hardware, adaptive classical solvers, and uncertainty in workload forecasts\. Other directions include combining the metric with concrete inverse problems and defining it for the case of hybrid solvers that use learning to correct low\-fidelity traditional solvers\[[16](https://arxiv.org/html/2605.15399#bib.bib19),[7](https://arxiv.org/html/2605.15399#bib.bib23)\]\.

## References

- \[1\]S\. Abnar, H\. Shah, D\. Busbridge, A\. El\-Nouby, J\. M\. Susskind, and V\. Thilak\(2025\)Parameters vs FLOPs: scaling laws for optimal sparsity for mixture\-of\-experts language models\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=l9FVZ7NXmm)Cited by:[Appendix D](https://arxiv.org/html/2605.15399#A4.p1.1)\.
- \[2\]J\. P\. Ahrens, B\. Geveci, and C\. C\. Law\(2005\)ParaView: an end\-user tool for large\-data visualization\.InThe Visualization Handbook,Cited by:[item 4](https://arxiv.org/html/2605.15399#A1.I1.i4.p1.1)\.
- \[3\]N\. Ashton, J\. Brandstetter, and S\. Mishra\(2025\)Fluid intelligence: a forward look on ai foundation models in computational fluid dynamics\.arXiv preprint arXiv:2511\.20455\.Cited by:[Appendix C](https://arxiv.org/html/2605.15399#A3.p1.1),[Appendix D](https://arxiv.org/html/2605.15399#A4.p4.1),[§1](https://arxiv.org/html/2605.15399#S1.p2.1),[§1](https://arxiv.org/html/2605.15399#S1.p3.1),[§4\.1](https://arxiv.org/html/2605.15399#S4.SS1.p1.1)\.
- \[4\]P\. B\. Bochev and D\. Ridzal\(2009\)An optimization\-based approach for the design of pde solution algorithms\.SIAM journal on numerical analysis47\(5\),pp\. 3938–3955\.Cited by:[§1](https://arxiv.org/html/2605.15399#S1.p3.1)\.
- \[5\]N\. Choudhary, V\. Singh, A\. Talwalkar, N\. M\. Boffi, M\. Khodak, and T\. Marwah\(2025\)Pre\-generating multi\-difficulty pde data for few\-shot neural pde solvers\.arXiv preprint arXiv:2512\.00564\.Cited by:[§2](https://arxiv.org/html/2605.15399#S2.p4.1)\.
- \[6\]T\. De Ryck and S\. Mishra\(2022\)Generic bounds on the approximation error for physics\-informed \(and\) operator learning\.Advances in Neural Information Processing Systems35,pp\. 10945–10958\.Cited by:[Appendix C](https://arxiv.org/html/2605.15399#A3.p1.1)\.
- \[7\]Y\. Du and A\. S\. Krishnapriyan\(2025\)EddyFormer: accelerated neural simulations of three\-dimensional turbulence at scale\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=bla5qx2sYe)Cited by:[2nd item](https://arxiv.org/html/2605.15399#A2.I1.i2.p1.3),[§2](https://arxiv.org/html/2605.15399#S2.p1.1),[§6](https://arxiv.org/html/2605.15399#S6.p1.1),[§7](https://arxiv.org/html/2605.15399#S7.p3.1)\.
- \[8\]L\. C\. Evans\(2022\)Partial differential equations\.Vol\.19,American mathematical society\.Cited by:[§1](https://arxiv.org/html/2605.15399#S1.p1.1)\.
- \[9\]Z\. Fu, J\. Shen, and J\. Yang\(2025\)Higher\-order energy\-decreasing exponential time differencing runge\-kutta methods for gradient flows\.Science China Mathematics68\(7\),pp\. 1727–1746\.Cited by:[§4\.3](https://arxiv.org/html/2605.15399#S4.SS3.p2.15)\.
- \[10\]C\. Geuzaine and J\. Remacle\(2009\)Gmsh: a three\-dimensional finite element mesh generator with built\-in pre\- and post\-processing facilities\.International Journal for Numerical Methods in Engineering79,pp\. 1309–1331\.Cited by:[item 2](https://arxiv.org/html/2605.15399#A1.I1.i2.p1.3)\.
- \[11\]T\. G\. Grossmann, U\. J\. Komorowska, J\. Latz, and C\. Schönlieb\(2024\-01\)Can physics\-informed neural networks beat the finite element method?\.IMA Journal of Applied Mathematics89\(1\),pp\. 143–174\.External Links:ISSN 0272\-4960,[Document](https://dx.doi.org/10.1093/imamat/hxae011),[Link](https://doi.org/10.1093/imamat/hxae011),https://academic\.oup\.com/imamat/article\-pdf/89/1/143/58325885/hxae011\.pdfCited by:[§2](https://arxiv.org/html/2605.15399#S2.p3.1)\.
- \[12\]J\. K\. Gupta and J\. Brandstetter\(2023\)Towards multi\-spatiotemporal\-scale generalized PDE modeling\.Transactions on Machine Learning Research\.External Links:ISSN 2835\-8856,[Link](https://openreview.net/forum?id=dPSTDbGtBY)Cited by:[§2](https://arxiv.org/html/2605.15399#S2.p2.1)\.
- \[13\]Z\. Hao, C\. Su, S\. Liu, J\. Berner, C\. Ying, H\. Su, A\. Anandkumar, J\. Song, and J\. Zhu\(2024\)DPOT: auto\-regressive denoising operator transformer for large\-scale pde pre\-training\.InProceedings of the 41st International Conference on Machine Learning,ICML’24\.Cited by:[4th item](https://arxiv.org/html/2605.15399#A2.I1.i4.p1.3),[§6](https://arxiv.org/html/2605.15399#S6.p1.1)\.
- \[14\]M\. Herde, B\. Raonic, T\. Rohner, R\. Käppeli, R\. Molinaro, E\. de Bezenac, and S\. Mishra\(2024\)Poseidon: efficient foundation models for PDEs\.InThe Thirty\-eighth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=JC1VKK3UXk)Cited by:[5th item](https://arxiv.org/html/2605.15399#A2.I1.i5.p1.2),[§2](https://arxiv.org/html/2605.15399#S2.p1.1),[§6\.1](https://arxiv.org/html/2605.15399#S6.SS1.p2.1),[§6](https://arxiv.org/html/2605.15399#S6.p1.1)\.
- \[15\]J\. Hoffmann, S\. Borgeaud, A\. Mensch, E\. Buchatskaya, T\. Cai, E\. Rutherford, D\. de Las Casas, L\. A\. Hendricks, J\. Welbl, A\. Clark, T\. Hennigan, E\. Noland, K\. Millican, G\. van den Driessche, B\. Damoc, A\. Guy, S\. Osindero, K\. Simonyan, E\. Elsen, O\. Vinyals, J\. W\. Rae, and L\. Sifre\(2022\)Training compute\-optimal large language models\.InProceedings of the 36th International Conference on Neural Information Processing Systems,NIPS ’22,Red Hook, NY, USA\.External Links:ISBN 9781713871088Cited by:[Appendix C](https://arxiv.org/html/2605.15399#A3.p2.5),[Appendix D](https://arxiv.org/html/2605.15399#A4.p1.1),[§4\.2](https://arxiv.org/html/2605.15399#S4.SS2.p2.9)\.
- \[16\]D\. Kochkov, J\. A\. Smith, A\. Alieva, Q\. Wang, M\. P\. Brenner, and S\. Hoyer\(2021\)Machine learning–accelerated computational fluid dynamics\.Proceedings of the National Academy of Sciences118\(21\)\.Cited by:[§7](https://arxiv.org/html/2605.15399#S7.p3.1)\.
- \[17\]F\. Koehler, S\. Niedermayr, R\. Westermann, and N\. Thuerey\(2024\)APEBench: a benchmark for autoregressive neural emulators of PDEs\.Advances in Neural Information Processing Systems \(NeurIPS\)38\.Cited by:[Appendix A](https://arxiv.org/html/2605.15399#A1.SS0.SSS0.Px1.p2.5),[Appendix A](https://arxiv.org/html/2605.15399#A1.SS0.SSS0.Px2.p2.4),[Appendix A](https://arxiv.org/html/2605.15399#A1.SS0.SSS0.Px3.p2.9),[Appendix D](https://arxiv.org/html/2605.15399#A4.p2.2),[item 2](https://arxiv.org/html/2605.15399#S1.I1.i2.p1.1),[§1](https://arxiv.org/html/2605.15399#S1.p3.1),[§2](https://arxiv.org/html/2605.15399#S2.p2.1),[§5](https://arxiv.org/html/2605.15399#S5.p1.1),[§5](https://arxiv.org/html/2605.15399#S5.p2.1)\.
- \[18\]R\. J\. LeVeque\(2007\)Finite difference methods for ordinary and partial differential equations: steady\-state and time\-dependent problems\.SIAM\.Cited by:[§1](https://arxiv.org/html/2605.15399#S1.p1.1)\.
- \[19\]Z\. Li, N\. B\. Kovachki, K\. Azizzadenesheli, B\. Liu, K\. Bhattacharya, A\. Stuart, and A\. Anandkumar\(2021\)Fourier neural operator for parametric partial differential equations\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=c8P9NQVtmnO)Cited by:[§1](https://arxiv.org/html/2605.15399#S1.p1.1),[§2](https://arxiv.org/html/2605.15399#S2.p1.1),[§6](https://arxiv.org/html/2605.15399#S6.p1.1)\.
- \[20\]Y\. Lu, H\. Chen, J\. Lu, L\. Ying, and J\. Blanchet\(2022\)Machine learning for elliptic PDEs: fast rate generalization bound, neural scaling law and minimax optimality\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=mhYUBYNoGz)Cited by:[Appendix C](https://arxiv.org/html/2605.15399#A3.p1.1)\.
- \[21\]M\. McCabe, P\. Mukhopadhyay, T\. Marwah, B\. R\. Blancard, F\. Rozet, C\. Diaconu, L\. Meyer, K\. W\. Wong, H\. Sotoudeh, A\. Bietti,et al\.\(2025\)Walrus: a cross\-domain foundation model for continuum dynamics\.arXiv preprint arXiv:2511\.15684\.Cited by:[6th item](https://arxiv.org/html/2605.15399#A2.I1.i6.p1.2),[§1](https://arxiv.org/html/2605.15399#S1.p1.1),[§2](https://arxiv.org/html/2605.15399#S2.p1.1),[§6](https://arxiv.org/html/2605.15399#S6.p1.1)\.
- \[22\]N\. McGreivy and A\. Hakim\(2024\-09\)Weak baselines and reporting biases lead to overoptimism in machine learning for fluid\-related partial differential equations\.Nature Machine Intelligence6\(10\),pp\. 1256–1269\.External Links:ISSN 2522\-5839,[Link](http://dx.doi.org/10.1038/s42256-024-00897-5),[Document](https://dx.doi.org/10.1038/s42256-024-00897-5)Cited by:[Appendix D](https://arxiv.org/html/2605.15399#A4.p4.1),[item 1](https://arxiv.org/html/2605.15399#S1.I1.i1.p1.1),[§2](https://arxiv.org/html/2605.15399#S2.p3.1),[§3\.3](https://arxiv.org/html/2605.15399#S3.SS3.p3.7),[§4\.3](https://arxiv.org/html/2605.15399#S4.SS3.p1.2),[§6\.3](https://arxiv.org/html/2605.15399#S6.SS3.p3.3)\.
- \[23\]R\. Morel, J\. Han, and E\. Oyallon\(2025\)DISCO: learning to DISCover an evolution operator for multi\-physics\-agnostic prediction\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=6EZ3MDDf6p)Cited by:[3rd item](https://arxiv.org/html/2605.15399#A2.I1.i3.p1.3),[§6](https://arxiv.org/html/2605.15399#S6.p1.1)\.
- \[24\]R\. Ohana, M\. McCabe, L\. Meyer, R\. Morel, F\. Agocs, M\. Beneitez, M\. Berger, B\. Burkhart, S\. Dalziel, D\. Fielding,et al\.\(2024\)The well: a large\-scale collection of diverse physics simulations for machine learning\.Advances in Neural Information Processing Systems37,pp\. 44989–45037\.Cited by:[§1](https://arxiv.org/html/2605.15399#S1.p3.1),[§2](https://arxiv.org/html/2605.15399#S2.p2.1)\.
- \[25\]M\. Raissi, P\. Perdikaris, and G\. E\. Karniadakis\(2017\)Physics informed deep learning \(part i\): data\-driven solutions of nonlinear partial differential equations\.arXiv preprint arXiv:1711\.10561\.Cited by:[§2](https://arxiv.org/html/2605.15399#S2.p1.1)\.
- \[26\]A\. Roshko\(1954\)On the development of turbulent wakes from vortex streets\.Technical reportCited by:[§6\.3](https://arxiv.org/html/2605.15399#S6.SS3.p3.3)\.
- \[27\]A\. Sacchetti, B\. Bachmann, K\. Löffel, U\. Künzi, and B\. Paoli\(2022\)Neural networks to solve partial differential equations: a comparison with finite elements\.IEEE Access10,pp\. 32271–32279\.Cited by:[§2](https://arxiv.org/html/2605.15399#S2.p3.1)\.
- \[28\]M\. Takamoto, T\. Praditia, R\. Leiteritz, D\. MacKinlay, F\. Alesiani, D\. Pflüger, and M\. Niepert\(2022\)PDEBench: an extensive benchmark for scientific machine learning\.Advances in Neural Information Processing Systems35,pp\. 1596–1611\.Cited by:[§1](https://arxiv.org/html/2605.15399#S1.p3.1),[§2](https://arxiv.org/html/2605.15399#S2.p2.1)\.
- \[29\]R\. Tali, A\. Rabeh, C\. Yang, M\. Shadkhah, S\. Karki, A\. Upadhyaya, S\. Dhakshinamoorthy, M\. Saadati, S\. Sarkar, A\. Krishnamurthy, C\. Hegde, A\. Balu, and B\. Ganapathysubramanian\(2025\)FlowBench: a large scale benchmark for flow simulation over complex geometries\.Journal of Data\-centric Machine Learning Research\.Note:External Links:[Link](https://openreview.net/forum?id=0xrOmmB2KQ)Cited by:[§2](https://arxiv.org/html/2605.15399#S2.p4.1)\.
- \[30\]K\. Tangsali, R\. Ranade, M\. A\. Nabian, A\. Kamenev, P\. Sharpe, N\. Ashton, R\. Cherukuri, and S\. Choudhry\(2025\)A benchmarking framework for ai models in automotive aerodynamics\.arXiv preprint arXiv:2507\.10747\.Cited by:[§1](https://arxiv.org/html/2605.15399#S1.p3.1)\.
- \[31\]A\. Tran, A\. Mathews, L\. Xie, and C\. S\. Ong\(2023\)Factorized fourier neural operators\.InThe Eleventh International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=tmIiMPl4IPa)Cited by:[1st item](https://arxiv.org/html/2605.15399#A2.I1.i1.p1.3),[§2](https://arxiv.org/html/2605.15399#S2.p1.1),[§6](https://arxiv.org/html/2605.15399#S6.p1.1)\.
- \[32\]C\. H\. Williamson\(1989\)Oblique and parallel modes of vortex shedding in the wake of a circular cylinder at low reynolds numbers\.Journal of Fluid Mechanics206,pp\. 579–627\.Cited by:[§6\.3](https://arxiv.org/html/2605.15399#S6.SS3.p3.3)\.
- \[33\]F\.D\. Witherden, A\.M\. Farrington, and P\.E\. Vincent\(2014\-11\)PyFR: an open source framework for solving advection–diffusion type problems on streaming architectures using the flux reconstruction approach\.Computer Physics Communications185\(11\),pp\. 3028–3040\.External Links:ISSN 0010\-4655,[Link](http://dx.doi.org/10.1016/j.cpc.2014.07.011),[Document](https://dx.doi.org/10.1016/j.cpc.2014.07.011)Cited by:[item 2](https://arxiv.org/html/2605.15399#S1.I1.i2.p1.1),[§2](https://arxiv.org/html/2605.15399#S2.p4.1),[§5](https://arxiv.org/html/2605.15399#S5.p1.1),[§5](https://arxiv.org/html/2605.15399#S5.p3.3)\.
- \[34\]Z\. Xu, Z\. Sun, and Y\. Zhang\(2025\)Stability and time\-step constraints of exponential time differencing runge–kutta discontinuous galerkin methods for advection\-diffusion equations\.Journal of Scientific Computing105\(2\),pp\. 1–31\.Cited by:[§4\.3](https://arxiv.org/html/2605.15399#S4.SS3.p2.15)\.

## Appendix ADataset specifications

All data generation is conducted on NVIDIA L40S GPUs with no jobs running in parallel\.

#### 2D incompressible Navier\-Stokes

We consider this equation in vorticity form\. Letting the velocity field be𝐮\(t,x\)=\(u1,u2\)\\mathbf\{u\}\(t,x\)=\(u\_\{1\},u\_\{2\}\)with∇⋅𝐮=0\\nabla\\cdot\\mathbf\{u\}=0, define the scalar vorticityω\(t,x\):=∂x1u2−∂x2u1\\omega\(t,x\):=\\partial\_\{x\_\{1\}\}u\_\{2\}\-\\partial\_\{x\_\{2\}\}u\_\{1\}\. The vorticity equation is

∂tω\+𝐮⋅∇ω=νΔω\+f,\\displaystyle\\partial\_\{t\}\\omega\+\\mathbf\{u\}\\cdot\\nabla\\omega=\\nu\\Delta\\omega\+f,\(3\)whereν\>0\\nu\>0is the kinematic viscosity andffis forcing\.

We generate a total of 200K training trajectories plus 1K testing trajectories using Exponax\[[17](https://arxiv.org/html/2605.15399#bib.bib31)\]\. We generate at a resolution of256×256256\\times 256with time stepΔt=1×10−3\\Delta t=1\\times 10^\{\-3\}and simulate overT=1T=1seconds on a 2D square periodic domain of edge\-lengthL=2L=2\. We store 10 frames in total and spectrally downgraded the saved frames into resolution64×6464\\times 64\. The average per\-trajectory simulation time on NVIDIA L40S GPUs is 0\.0435 seconds\.

#### 1D, 2D and 3D Kuramoto\-Sivashinsky

We consider this equation on a periodic domain𝒳=\[0,L\]d\\mathcal\{X\}=\[0,L\]^\{d\}, where the spatial dimension isd∈\{1,2,3\}d\\in\\\{1,2,3\\\}\. While the 2D formulation serves as our standard benchmark, the 1D and 3D variants are included specifically for the difficulty scaling analysis\. The governing equation for a scalar fieldu\(t,x\)u\(t,x\)is:

∂tu\+Δu\+Δ2u\+12‖∇u‖22=0,\\displaystyle\\partial\_\{t\}u\\;\+\\;\\Delta u\\;\+\\;\\Delta^\{2\}u\\;\+\\;\\frac\{1\}\{2\}\\\|\\nabla u\\\|\_\{2\}^\{2\}\\;=\\;0,\(4\)where∇\\nablaandΔ\\Deltadenote the spatial gradient and Laplacian, respectively, andΔ2\\Delta^\{2\}is the biharmonic operator\.

We generate in total of 200K training trajectories plus 1K testing trajectories using Exponax\[[17](https://arxiv.org/html/2605.15399#bib.bib31)\]\. We generate the data at resolution of 256 with time stepΔt=0\.1\\Delta t=0\.1and simulate overT=10T=10seconds on a 2D square periodic domain of edge\-lengthL=50L=50\. We store 50 frames in total and spectrally downgraded the saved frames into resolution64×6464\\times 64\. All simulations uses NVIDIA L40S GPUs\. The average per\-trajectory simulation time for 1D is 0\.0004 seconds\. The average per\-trajectory simulation time for 2D is 0\.0667 seconds\. The average per\-trajectory simulation time for 3D is 2\.7110 seconds\.

#### 2D Gray\-Scott

Letu\(t,x\)u\(t,x\)andv\(t,x\)v\(t,x\)denote chemical concentrations on a 2D domain𝒳\\mathcal\{X\}\. The Gray–Scott system is formulated by

∂tu\\displaystyle\\partial\_\{t\}u=DuΔu−uv2\+F\(1−u\)\\displaystyle=D\_\{u\}\\Delta u\-uv^\{2\}\+F\(1\-u\)\(5\)∂tv\\displaystyle\\partial\_\{t\}v=DvΔv\+uv2−\(F\+k\)v,\\displaystyle=D\_\{v\}\\Delta v\+uv^\{2\}\-\(F\+k\)\\,v,\(6\)for diffusivitiesDu,Dv\>0D\_\{u\},D\_\{v\}\>0and reaction parametersF,kF,k\.

We generate in total of 200K training trajectories plus 1K testing trajectories using Exponax\[[17](https://arxiv.org/html/2605.15399#bib.bib31)\]\. The feed rates used aref=0\.029f=0\.029andk=0\.057k=0\.057and the diffusivities are set to2\.1×10−52\.1\\times 10^\{\-5\}and1\.1×10−51\.1\\times 10^\{\-5\}\. We generate at a resolution of256×256256\\times 256with time stepΔt=0\.5\\Delta t=0\.5and simulate overT=2000T=2000seconds on a 2D square periodic domain of edge\-lengthL=2L=2\. The end\-time is specifically chosen because most Gray\-Scott simulatons stabilize well before then\. We stored 200 frames in total and spectrally downgraded the saved frames into resolution of64×6464\\times 64\. The average per\-trajectory simulation time on NVIDIA L40S GPUs is 0\.4751 seconds\.

#### BreakFlow

The BreakFlow dataset is constructed to systematically evaluate the performance, scalability, and physical consistency of neural partial differential equation \(PDE\) solvers\. An example of the resulting vorticity field is given in Fig\.[8](https://arxiv.org/html/2605.15399#A1.F8)\.

Each simulation consists of four steps:

1. 1\.Domain generation: we simulate over the square domain\[0,50\]×\[−25,25\]\[0,50\]\\times\[\-25,25\]with an outlet boundary on the right and inlet boundaries on left, top, and bottom\. We generate rectangular obstacles by looping in a nested fashion over the integersx=\{11,…,25\}x=\\\{11,\\dots,25\\\}andy∈\{0,…,15\}y\\in\\\{0,\\dots,15\\\}; at each\(x,y\)\(x,y\)coordinate we place a rectangle with probability 0\.5\. The rectangle’s dimensions are determined by randomly sampling an areaaabetween 4 and 30 \(or 4 and 60 ify=0y=0\), then randomly sampling a heighth∈ℤh\\in\\mathbb\{Z\}and widthw∈ℤw\\in\\mathbb\{Z\}such thatwh=awh=a,w≥2w\\geq 2, andh≥2h\\geq 2\. The rectangle’s rotation is randomly sampled from\{0,15,…,165\}\\\{0,15,\\dots,165\\\}degrees, except ify=0y=0in which case it is either not rotated or sampled from\{0,45\}\\\{0,45\\\}if it is a square \(h=wh=w\)\. If the resulting rectangle overlaps or is closer than distance two to a previously placed rectangle, we do not place it\. Lastly, to make the obstacles symmetric about they=0y=0axis we take all rectangles with centers\(x,y\)\(x,y\)and place a mirrored rectangle at\(x,−y\)\(x,\-y\)\.
2. 2\.Mesh generation: we generate a first\-order triangular mesh using the Gmsh utility\[[10](https://arxiv.org/html/2605.15399#bib.bib20)\]\. The mesh is graded to have edge\-length13\\tfrac\{1\}\{3\}near the obstacles,23\\tfrac\{2\}\{3\}at the edge of a refinement zone\[5,45\]×\[−20,20\]\[5,45\]\\times\[\-20,20\], and 1 at the edge of the domain, with edge\-sizes varying smoothly between them\. To decrease the simulation resolution, we simultaneously increase these mesh sizes\.
3. 3\.Simulation: we run PyFR using the settings of the 2D incompressible cylinder flow example,††[https://pyfr\.readthedocs\.io/en/latest/examples\.html\#d\-incompressible\-cylinder\-flow](https://pyfr.readthedocs.io/en/latest/examples.html#d-incompressible-cylinder-flow)but extending the simulation time to 400\. To decrease the simulation resolution, we increase the timestepping parametersdtandpseudo\-dt\. The average per\-trajectory simulation time on NVIDIA L40S GPUs is 136\.9045 seconds\.
4. 4\.Lastly, we convert the data from an unstructured mesh to a regular64×6464\\times 64grid using the ParaView utility\[[2](https://arxiv.org/html/2605.15399#bib.bib21)\]\. Both these and the original mesh files will be released\.

The dataset spans a Reynolds number range ofRe∈\(10,160\]Re\\in\(10,160\]\. Data generation is organized into three discrete bins, withReResampled uniformly at random within each respective range: \(1\) Bin 1:Re∈\(10,40\];\(2\)Re\\in\(10,40\];\(2\)~Bin 2:Re∈\(40,90\]Re\\in\(40,90\]and \(3\) Bin 3:Re∈\(90,160\]Re\\in\(90,160\]\. The dataset simulated for 400 seconds, yielding 40 frames per trajectory, ensuring most simulation dynamics to enter limited cycles\. For each of the Reynolds bins, we generate 2,000 training trajectory and 50 testing trajectory\. We preserve per\-simulation wallclock time, velocityxx, velocityyy, pressure, and vorticity field in the benchmark dataset\. All experiments are trained onxx\-velocity,yy\-velocity, and pressure fields, and reported all\-field nRMSE error\. The main experiment results are trained and tested on the shuffled combinations for all three bins\.

![Refer to caption](https://arxiv.org/html/2605.15399v1/x8.png)Figure 8:Vorticity snapshots att=0,200,400t=0,200,400of BreakFlow simulations of differentReRebins \(Re∈\(10,40\],\(40,90\]Re\\in\(10,40\],\(40,90\]and\(90,160\]\(90,160\]\) generated via PyFR\. There are inlets boundaries at all sides apart from an outlet on the right\. Black boxes represent obstacles\.

## Appendix BTraining Details

We present training details for training our models to compute breakeven complexity:

- •FFNO\[[31](https://arxiv.org/html/2605.15399#bib.bib24)\]: FFNO is trained with model width of 64, 16 modes and 24 number of layers, learning rate of1×10−31\\times 10^\{\-3\}on periodic equations and1×10−41\\times 10^\{\-4\}on Breakflow, and weight decay of1×10−41\\times 10^\{\-4\}across all equations\. The batch size for training on periodic equations is 100, and is 16 for training on Breakflow\. The FFNO model has a total of 6,698,113 parameters\.
- •EddyFormer\[[7](https://arxiv.org/html/2605.15399#bib.bib23)\]: EddyFormer is trained with model width of 32, FFN width of 128, 10 attention blocks, 8 attention heads, learning rate of learning rate of1×10−31\\times 10^\{\-3\}on periodic equations and1×10−41\\times 10^\{\-4\}on Breakflow, weight decay of1×10−41\\times 10^\{\-4\}across all equations\. The batch size for training on periodic equations is 64, and is 16 for training on Breakflow\. The EddyFormer model has a total of 3,499,171 parameters\.
- •DISCO\[[23](https://arxiv.org/html/2605.15399#bib.bib3)\]: DISCO is trained with patch size 16, start channels 8, 4 down/4 up blocks, 4\-group GroupNorm, 6 heads, 12 processor blocks, learning rate of1×10−31\\times 10^\{\-3\}on periodic equations and1×10−41\\times 10^\{\-4\}on Breakflow, and weight decay of1×10−41\\times 10^\{\-4\}across all equations\. The batch size for training on periodic equations is 64, and is 16 for training on Breakflow\. The DISCO model has a total of 100,765,696 parameters\.
- •DPOT\[[13](https://arxiv.org/html/2605.15399#bib.bib2)\]: DPOT is trained with 8×8 patches, embedding width 512, 4 layers, 4 blocks, MLP ratio 1, output head dimension 32, Adam optimizer, learning rate of1×10−31\\times 10^\{\-3\}on periodic equations and1×10−41\\times 10^\{\-4\}on Breakflow, and weight decay of1×10−41\\times 10^\{\-4\}across all equations\. The batch size for training on periodic equations is 64, and is 16 for training on Breakflow\. The model has a total of 7,010,355 parameters\.
- •Poseidon\[[14](https://arxiv.org/html/2605.15399#bib.bib16)\]: Poseidon is tuned on pretrained Poseidon weights††https://huggingface\.co/collections/camlab\-ethz/poseidon, with learning rate of1×10−31\\times 10^\{\-3\}on periodic equations and1×10−41\\times 10^\{\-4\}on Breakflow, and weight decay of 0\.01 across all equations\. The batch size for Navier\-Stokes\. is 100, for Kuramoto\-Sivashinsky is 300, for Gray\-Scott is 200, and for BreakFlow is 16\. All Poseidon T, B and L model share the same set of training configs\. Poseidon T has 21M parameters\. Poseidon B has 158M parameters\. Poseidon L has 629M parameters\.
- •HalfWalrus\[[21](https://arxiv.org/html/2605.15399#bib.bib22)\]: HalfWalrus is tuned on pretrained HalfWalrus weights with parameter count of6\.4×1086\.4\\times 10^\{8\}and all model details kept the same as in the Walrus paperMcCabeet al\.\[[21](https://arxiv.org/html/2605.15399#bib.bib22), Table\. 2\]\. We tuned with learning rate of1×10−31\\times 10^\{\-3\}, weight decay of 0\.01 across all equations\. The batch size for Naiver Stokes is 100, for Kuramoto Sivashinsky is 300, for Gray\-Scott is 200, and for BreakFlow is 16\.

All training parameters are chosen to be the optimal from parameter sweeping\. Batch sizes and learning rate are chosen in tend to be large to enforce faster convergence and lower up\-front training cost\.

For all training, we warm\-up for 5 training steps \(including forward, backward path, optimizer step, etc\.\)\. We then run another 5 warm\-up training steps and take the average of these 5 warm\-up step time as the estimation of training step time and use this estimation to reset the scheduler\. The training budget are discounted with real Wall\-Clock time counted during training\.

All training was done on NVIDIA L40S GPUs with no jobs running in parallel\.

## Appendix CData scaling fitting procedure

To efficiently estimate breakeven complexity, we utilize neural scaling laws to predict optimal data and compute allocations\. The predictable scaling behavior of neural PDE solvers is well\-documented empirically\[[3](https://arxiv.org/html/2605.15399#bib.bib12),[20](https://arxiv.org/html/2605.15399#bib.bib5)\]\. This is further supported by the Universal Approximation Theorem for operator learning\[[6](https://arxiv.org/html/2605.15399#bib.bib4)\], which posits that approximation error can always be systematically reduced given sufficient model capacity and data\. Because this implies the absence of a fundamental, irreducible error limit in the continuous regime, scaling laws serve as a highly effective and theoretically grounded tool for performance prediction\.

To visualize the trade\-off between training compute and data generation under a fixed wall\-clock budget, we draw a scaling landscape over training budgetCC\(seconds\) and number of training trajectoriesNdataN\_\{data\}\. For a given PDE/model pair, we first fit a parametric scaling lawℒ^\(Ndata,C\)\\widehat\{\\mathcal\{L\}\}\(N\_\{data\},C\)to our measured training runs\. In this work we use a Chinchilla\-style functional form\[[15](https://arxiv.org/html/2605.15399#bib.bib30)\], parameterized by\(L∞,a,α,d,β\)\(L\_\{\\infty\},a,\\alpha,d,\\beta\), whereL∞L\_\{\\infty\}is the irreducible loss floor and the remaining parameters control the data\- and compute\-scaling exponents\. After fitting these parameters on empirical sweep points, we evaluate the fitted loss surface on a dense log\-spaced grid:

C∈\[9×102,7×104\],Ndata∈\[4×102,3×104\]\.C\\in\[9\\\!\\times\\\!10^\{2\},7\\\!\\times\\\!10^\{4\}\],\\qquad N\_\{data\}\\in\[4\\\!\\times\\\!10^\{2\},3\\\!\\times\\\!10^\{4\}\]\.We then plot contour lines ofℒ^\(Ndata,C\)\\widehat\{\\mathcal\{L\}\}\(N\_\{data\},C\)in log–log coordinates \(Fig\. 2\)\. To keep the plot readable, we choose contour levels between the 3rd and 97th percentiles of the grid values and label only a small subset of levels\.

#### Budget\-optimal frontier\.

Given the fitted surfaceℒ^\(N,C\)\\widehat\{\\mathcal\{L\}\}\(N,C\), we compute the*budget\-optimal*number of trajectories

N^data\(C\)∈arg⁡minNdata⁡ℒ^\(Ndata,C\),\\hat\{N\}\_\{data\}\(C\)\\in\\arg\\min\_\{N\_\{data\}\}\\widehat\{\\mathcal\{L\}\}\(N\_\{data\},C\),which traces an “optimal frontier” on the landscape\. In practice,N^data\(C\)\\hat\{N\}\_\{data\}\(C\)has a closed form for the chosen scaling law; we plot it as the thick black curve\. Finally, we overlay the empirical budgets used in experiments \(e\.g\.,C∈\{1000,2000,4000,C\\in\\\{1000,2000,4000,and8000\}8000\\\}seconds\) as markers on the frontier\. This plot provides an interpretable summary of \(i\) the predicted best\-achievable loss at each budget and \(ii\) the implied optimal allocation between more data \(largerNdataN\_\{data\}\) and more optimization compute, which we then use to estimate the budget\-optimal errorεB\\varepsilon\_\{B\}for breakeven computation\.

## Appendix DTime vs\. FLOP choice

Table 1:Effective throughput \(FLOP/sec\) measured on warmed\-up FP32 runs with batch size 100\. The rightmost column reports throughput relative to the standard spectral solver\. Both implemented using Pytorch with same torch profiler FLOP profiling\.Although FLOPs are a common measure of computational cost in ML frameworks\[[15](https://arxiv.org/html/2605.15399#bib.bib30),[1](https://arxiv.org/html/2605.15399#bib.bib1)\], when comparing neural and numerical PDE solvers FLOPs alone are not an adequate fair comparison standard, as the realized throughput \(FLOP/sec\) can differ substantially between workloads\.

We estimate FLOP counts and realized throughput for FFNO training, FFNO inference, and a standard RK4 pseudo\-spectral solver, all implemented in PyTorch and measured using the PyTorch profiler on the same NVIDIA L40S GPU hardware\. We exclude Exponax\[[17](https://arxiv.org/html/2605.15399#bib.bib31)\]from this FLOP\-throughput comparison because it is implemented in JAX, whose profiling and FLOP accounting are not directly comparable to the PyTorch\-based measurements used for model training and inference\. As shown in Table[1](https://arxiv.org/html/2605.15399#A4.T1), FFNO training and inference achieve throughput above3\.5×10113\.5\\times 10^\{11\}FLOP/sec on our hardware, whereas the standard spectral solver achieves3\.92×1093\.92\\times 10^\{9\}FLOP/sec\.

The differences span three orders of magnitude\. We suspect this is because neural PDE solver training and inference are dominated by dense tensor operations that are highly optimized for modern accelerators, whereas classical PDE solvers have different computational patterns and may achieve much lower hardware utilization\. As a result, comparing methods purely through FLOPs would obscure the actual computational burden incurred in practice\. Moreover, many numerical methods depend strongly on CPU performance, memory access patterns, communication, and implementation details, making FLOP\-based comparisons less fair across heterogeneous solver families\.

For the above reasons, and to be consistent with previous cost\-aware researches in the neural PDE solver community\[[3](https://arxiv.org/html/2605.15399#bib.bib12),[22](https://arxiv.org/html/2605.15399#bib.bib25)\], we use wall\-clock as the primary cost measure\. Wall\-clock can better reflect the actual tradeoff when choosing between a neural surrogate and an error\-matched numerical solver\.

## Appendix EInterpreting of Breakeven complexity

Breakeven complexity \(N⋆N^\{\\star\}\) is not an intrinsic property of a neural architecture or PDE\. Rather, it is a deployment\-dependent metric dictated by the chosen classical solver, error metric, hardware, and implementation details\. Its value serves as a calibrated amortization threshold for a specific setup, not a universal ranking of solvers\.

As expected from Eq\. \(1\),N⋆N^\{\\star\}depends on the per\-trajectory cost gap,CδB−CinfC\_\{\\delta\_\{B\}\}\-C\_\{\\mathrm\{inf\}\}\. When this gap is small,N⋆N^\{\\star\}becomes highly sensitive to minor hardware or implementation changes, which can dictate whether a neural solver breaks even at all\. This sensitivity reflects the practical reality of deployment: near the boundary where learned and classical per\-trajectory costs are similar, environmental factors determine the optimal choice\.

Therefore, we treat breakeven complexity primarily as a cost\-accounting tool\. It highlights how training, inference, and baseline costs interact, exposing cases where accuracy\-only evaluations are misleading\. In practice, users can recalibrate Eq\. \(1\) for their specific target workloads and runtime environments\.

## Appendix FFull Experimental Results Tables

We present the full experimental data in Sec\.[4](https://arxiv.org/html/2605.15399#S4)here\. Table[2](https://arxiv.org/html/2605.15399#A6.T2)–[5](https://arxiv.org/html/2605.15399#A6.T5)shows detailed results for Fig\.[3](https://arxiv.org/html/2605.15399#S6.F3)\. The 1k–8k budget training reports the optimal from sweeping, while larger budgets report the results trained using the scaling\-predicted number of training trajectories\. Scaling fitting procedure are demonstrated in Appendix[C](https://arxiv.org/html/2605.15399#A3)\. Data%\\%denotes the percentage of budget allocated to data generation\. Worst\-case results are reported inblue cells, while average\-case results are reported ingreen cells\. Table[6](https://arxiv.org/html/2605.15399#A6.T6)–[8](https://arxiv.org/html/2605.15399#A6.T8)shows detailed results for Fig\.[7](https://arxiv.org/html/2605.15399#S6.F7)\.

Table 2:Full results on Navier\-Stokes for Fig\.[3](https://arxiv.org/html/2605.15399#S6.F3)\. Worst\-case results are shown inblue cells, and average\-case results are shown ingreen cells\.Table 3:Full results on Kuramoto\-Sivashinsky for Fig\.[3](https://arxiv.org/html/2605.15399#S6.F3)\. Worst\-case results are shown inblue cells, and average\-case results are shown ingreen cells\.Table 4:Full results on Gray\-Scott for Fig\.[3](https://arxiv.org/html/2605.15399#S6.F3)\. Worst\-case results are shown inblue cells, and average\-case results are shown ingreen cells\.Table 5:Full results on BreakFlow for Fig\.[3](https://arxiv.org/html/2605.15399#S6.F3)\. Worst\-case results are shown inblue cells, and average\-case results are shown ingreen cells\.Table 6:FFNO results across PDE dimensionality for Kuramoto–Sivashinsky equation for Fig\.[7](https://arxiv.org/html/2605.15399#S6.F7)\.Table 7:FFNO results across rollout lengths for Gray\-Scott equation for Fig\.[7](https://arxiv.org/html/2605.15399#S6.F7)\.Table 8:FFNO results across Reynolds bins on BreakFlow for Fig\.[7](https://arxiv.org/html/2605.15399#S6.F7)\.
Breakeven complexity: A new perspective on neural partial differential equation solvers

Similar Articles

Quantitative Sobolev Approximation Bounds for Neural Operators with Empirical Validation on Burgers Equation

Harnessing AI for Inverse Partial Differential Equation Problems: Past, Present, and Prospects

State-Space NTK Collapse Near Bifurcations

Finite Volume-Informed Neural Network Framework for 2D Shallow Water Equations: Rugged Loss Landscapes and the Importance of Data Guidance

@MaximeRivest: At first glance: > Structural Equation Modeling (SEM/Path Analysis) > Neural Ordinary Differential Equations (Neural OD…

Submit Feedback

Similar Articles

Quantitative Sobolev Approximation Bounds for Neural Operators with Empirical Validation on Burgers Equation
Harnessing AI for Inverse Partial Differential Equation Problems: Past, Present, and Prospects
State-Space NTK Collapse Near Bifurcations
Finite Volume-Informed Neural Network Framework for 2D Shallow Water Equations: Rugged Loss Landscapes and the Importance of Data Guidance
@MaximeRivest: At first glance: > Structural Equation Modeling (SEM/Path Analysis) > Neural Ordinary Differential Equations (Neural OD…