Operator Boosting Produces Pareto-Efficient PDE Surrogates

arXiv cs.LG 06/17/26, 04:00 AM Papers
Summary
Operator Boosting is a stagewise residual-learning framework that constructs compact neural operator surrogates for PDEs by training tiny models on residual fields. It achieves accuracy comparable to or better than full-size models while reducing parameters by up to 95%, demonstrating Pareto improvements on several benchmarks.
arXiv:2606.17460v1 Announce Type: new Abstract: Neural operators are widely used as surrogate solution maps for partial differential equations (PDEs), but full-size models can be costly to store, deploy, and evaluate in many-query scientific workflows. This work introduces Operator Boosting, a stagewise residual-learning framework for constructing compact neural-operator surrogates directly, rather than training a large model and compressing it afterward. Starting from the empirical mean predictor in normalized output coordinates, the method trains a sequence of tiny same-family neural operators on residual fields and incorporates each correction through validation-selected shrinkage. We instantiate the framework with Fourier neural operators (FNOs), DeepONets, and convolutional neural operators (CNOs), and compare boosted tiny stacks against full-size monolithic baselines across one-, two-, and three-dimensional PDE benchmarks from PDEBench, APEBench, and The Well. Across 30 dataset-architecture pairs, 21 show positive mean accuracy gains and 17 have positive confidence intervals, while all boosted stacks reduce trainable parameter count by approximately 72-95%. Best-model comparisons show empirical Pareto improvements on 7 of 10 completed PDE benchmarks, including two-dimensional Navier-Stokes, shallow-water dynamics, Darcy flow, one-dimensional transport and reaction systems, and three-dimensional compressible Navier-Stokes. These results show that Operator Boosting often improves the empirical accuracy-parameter Pareto frontier of neural PDE surrogates, while also exposing PDE- and architecture-dependent regimes where residual boosting fails to offset compression.
Original Article
View Cached Full Text
Cached at: 06/17/26, 05:38 AM
# Operator Boosting Produces Pareto-Efficient PDE Surrogates
Source: [https://arxiv.org/html/2606.17460](https://arxiv.org/html/2606.17460)
\[orcid=0009\-0008\-6030\-7641\]\\cormark\[1\]

\\cortext

\[cor1\]Corresponding author\.

1\]organization=College of Computing, Georgia Institute of Technology, city=Atlanta, state=GA, country=United States

2\]organization=Department of Mathematics and Systems Engineering, Florida Institute of Technology, city=Melbourne, state=FL, country=United States

###### Abstract

Neural operators are widely used as surrogate solution maps for partial differential equations \(PDEs\), but full\-size models can be costly to store, deploy, and evaluate in many\-query scientific workflows\. This work introduces*Operator Boosting*, a stagewise residual\-learning framework for constructing compact neural\-operator surrogates directly, rather than training a large model and compressing it afterward\. Starting from the empirical mean predictor in normalized output coordinates, the method trains a sequence of tiny same\-family neural operators on residual fields and incorporates each correction through validation\-selected shrinkage\. We instantiate the framework with Fourier neural operators \(FNOs\), DeepONets, and convolutional neural operators \(CNOs\), and compare boosted tiny stacks against full\-size monolithic baselines across one\-, two\-, and three\-dimensional PDE benchmarks from PDEBench, APEBench, and The Well\. Across 30 dataset–architecture pairs, 21 show positive mean accuracy gains and 17 have positive confidence intervals, while all boosted stacks reduce trainable parameter count by approximately 72–95%\. Best\-model comparisons show empirical Pareto improvements on 7 of 10 completed PDE benchmarks, including two\-dimensional Navier–Stokes, shallow\-water dynamics, Darcy flow, one\-dimensional transport and reaction systems, and three\-dimensional compressible Navier–Stokes\. These results show that Operator Boosting often improves the empirical accuracy–parameter Pareto frontier of neural PDE surrogates, while also exposing PDE\- and architecture\-dependent regimes where residual boosting fails to offset compression\.

###### keywords:

Neural operators\\sepPDE surrogates\\sepOperator Boosting\\sepFunctional gradient boosting\\sepParameter efficiency\\sepScientific machine learning

## 1Introduction

Partial differential equations \(PDEs\) are central to scientific computing, but repeated high\-fidelity simulation can be prohibitively expensive in many\-query settings such as design optimization, uncertainty quantification, inverse problems, control, digital twins, and real\-time monitoring\. Neural operators address this bottleneck by learning maps between function spaces, allowing a trained surrogate to approximate an entire solution operator rather than a single finite\-dimensional regression map\[kovachki2023neuraloperator\]\. Architectures such as Fourier neural operators \(FNOs\), DeepONets, and convolutional neural operators \(CNOs\) have therefore become standard tools for PDE surrogate modeling\[li2021fourier,lu2021deeponet,raonic2023convolutional\]\.

As neural operators move toward deployable scientific workflows, the relevant bottleneck is not only predictive fidelity but also model size, memory footprint, throughput, and hardware availability\. Full\-size monolithic neural operators can be costly to store and evaluate, especially for high\-resolution or three\-dimensional fields\. This has motivated parameter\-efficient operator\-learning methods based on factorization, tensorization, multigrid structure, mixed precision, pruning, quantization, and related compression mechanisms\[tran2023factorized,kossaifi2024multigrid,tu2024guaranteed,han2016deepcompressioncompressingdeep,novikov2015tensorizingNeuralNetworks,lu2025tensorcompressedfullyquantizedtrainingneural\]\. These constraints are especially important for edge and low\-SWaP scientific systems, where surrogates may need to run near sensors, actuators, or embedded controllers rather than on centralized GPU servers\[howes2026realtimesensinginaccessiblephysical\]\.

This work studies a complementary route to parameter efficiency\. Instead of training one large operator and then compressing it, we ask whether compact surrogates can be built directly as stagewise additive stacks of small neural operators\. The idea is motivated by classical boosting, where predictors are constructed sequentially by fitting each new learner to the residual error of the current ensemble\. Functional\-gradient interpretations of boosting formalize this as greedy optimization in function space, and squared\-loss boosting reduces naturally to residual fitting\[mason1999boostingGradientDescent,friedman2001gradientBoostingMachine,buhlmann2003boostingL2Loss\]\. For PDE surrogates, this viewpoint suggests that early small operators may capture dominant coarse structure in the solution map, while later residual operators focus capacity on systematic errors left by earlier stages, such as fine\-scale features, spectral errors, boundary effects\[shikhman2026one\], or architecture\-specific deficiencies\. Recent spectral analyses of FNOs similarly indicate that residual correction can recover information missed by an initial operator\[qin2024betterunderstandingfourierneural\]\.

We introduce*Operator Boosting*, a model\-family\-agnostic framework for constructing parameter\-efficient PDE surrogates from existing neural\-operator architectures\. In the experiments, the framework is applied separately within each architecture family: a full FNO is compared against a boosted stack of tiny FNOs, a full DeepONet against a boosted stack of tiny DeepONets, and a full CNO against a boosted stack of tiny CNOs\. This within\-family protocol isolates the effect of stagewise residual learning from the effect of changing the operator architecture\. The central question is whether homogeneous boosted stacks can improve the deployment\-relevant accuracy–parameter Pareto frontier relative to full\-size monolithic baselines\. The results should not be read as showing that residual boosting universally improves neural operators\. Rather, they show that boosting is a simple wrapper that often improves the empirical accuracy–parameter tradeoff, while its failures reveal when the residual structure is poorly matched to the available tiny operator class\.

The contributions of this work are:

- •We formulate*Operator Boosting*as a stagewise functional\-gradient framework for PDE surrogate modeling, where tiny neural operators are trained sequentially on residual fields and added with validation\-selected shrinkage\.
- •We instantiate the framework with FNO, DeepONet, and CNO base learners, using within\-family comparisons to separate the effect of the boosting procedure from the effect of changing architecture class\.
- •We evaluate boosted tiny stacks against full\-size monolithic baselines across one\-, two\-, and three\-dimensional PDE benchmarks from PDEBench, APEBench, and The Well\[takamoto2022pdebench,koehler2024apebench,ohana2024thewell\]\.
- •We report both predictive error and trainable parameter count, emphasizing empirical accuracy–parameter Pareto tradeoffs rather than relativeL2L^\{2\}error alone\.
- •We identify Pareto\-improving regimes, where boosted stacks achieve comparable or lower error with substantially fewer parameters, and failure regimes, where residual boosting exposes PDE\- or architecture\-dependent compression limits\.

Figure[1](https://arxiv.org/html/2606.17460#S1.F1)demonstrates the difference between a full\-size monolithic neural operator and the boosted residual stack used in Operator Boosting\.

input fieldaafull\-sizeoperatorGfullfG\_\{\\mathrm\{full\}\}^\{f\}full predictionu^full\\widehat\{u\}\_\{\\mathrm\{full\}\}monolithic baselineinput fieldaastage 2H2fH\_\{2\}^\{f\}stage 1H1fH\_\{1\}^\{f\}stageMMHMfH\_\{M\}^\{f\}∑\\sumboosted predictionu^boost\\widehat\{u\}\_\{\\mathrm\{boost\}\}η1\\eta\_\{1\}η2\\eta\_\{2\}ηM\\eta\_\{M\}boosted stackarchitecture familyf∈\{FNO,DeepONet,CNO\}f\\in\\\{\\mathrm\{FNO\},\\mathrm\{DeepONet\},\\mathrm\{CNO\}\\\}Figure 1:Operator Boosting replaces a full\-size monolithic neural operator with an additive stack of residual operators\. Hereaadenotes the input field,G0\(a\)=0G\_\{0\}\(a\)=0is the normalized\-output mean initializer, and each stage contributes a validation\-shrunk correctionηmHmf\(a\)\\eta\_\{m\}H\_\{m\}^\{f\}\(a\)to form the boosted prediction\.
## 2Related Work

### 2\.1Neural operators for PDE surrogate modeling

Neural operators learn maps between function spaces and have become a standard approach for PDE surrogate modeling, where the goal is to approximate solution operators mapping coefficients, initial conditions, forcing terms, or previous states to solution fields\[kovachki2023neuraloperator\]\. Early graph\-based formulations include graph kernel and multipole graph neural operators\[anandkumar2019neural,li2020multipole\]\. Among widely used architectures, DeepONet represents operators through branch and trunk networks\[lu2021deeponet\], while Fourier neural operators learn global spectral updates on regular grids\[li2021fourier\]and have been extended to more general geometries through learned deformations\[li2023geofno\]\. Convolutional neural operators provide a multiscale convolutional alternative for robust PDE learning\[raonic2023convolutional\]\. Benchmark suites such as PDEBench have further standardized empirical evaluation across PDE families\[takamoto2022pdebench\]\. In this work, FNO, DeepONet, and CNO serve as representative operator families for studying whether homogeneous stagewise residual boosting can improve the accuracy–parameter Pareto frontier of PDE surrogates under severe parameter constraints\.

### 2\.2Ensembles, boosting, and residual learning

Ensemble methods have long been used to improve predictive accuracy and stability by combining multiple learned predictors\. Early neural\-network ensembles showed that aggregating independently trained models can reduce generalization error\[hansen1990neural\], while bagging introduced bootstrap aggregation as a general\-purpose variance\-reduction strategy\[breiman1996bagging\]\. Boosting provides a more structured form of ensembling: rather than training predictors independently, it builds an additive model stage by stage, with later learners correcting the errors of earlier ones\. This idea originates in the AdaBoost literature\[freund1997decisionTheoreticBoosting\]and was later interpreted as gradient descent in function space\[mason1999boostingGradientDescent\]\. Friedman’s gradient boosting framework formalized boosting as greedy functional approximation by fitting weak learners to pseudo\-residuals\[friedman2001gradientBoostingMachine\], with stochastic variants improving robustness and scalability\[friedman2002stochasticGradientBoosting\]\. For squared\-error regression, boosting with theL2L\_\{2\}loss is especially close to residual fitting, since each stage is trained to approximate the remaining prediction error\[buhlmann2003boostingL2Loss\]\. Modern deep learning has also made extensive use of ensemble and residual principles: deep ensembles remain a simple and effective approach for improving predictive reliability and uncertainty estimation\[lakshminarayanan2017deepEnsembles\], while residual networks showed that learning additive corrections can make very deep models easier to optimize\[he2016deepResidualLearning\]\. Operator Boosting transfers this residual\-fitting principle to neural\-operator surrogate modeling, where each weak learner is itself an operator mapping input fields to output fields\.

### 2\.3Parameter\-efficient scientific machine learning

Parameter efficiency has long been central to scientific computing, where high\-fidelity PDE and dynamical\-system solvers are often too expensive for repeated evaluation, uncertainty quantification, control, or design optimization\. Classical reduced\-order modeling addresses this issue by projecting high\-dimensional dynamics onto compact approximation spaces\[benner2015projectionModelReduction\], while reduced\-basis methods provide certified low\-dimensional surrogates for parametrized PDEs and related field problems\[chen2010certifiedReducedBasisMaxwell\]\. In deep learning, parameter\-efficient modeling has been studied through several complementary mechanisms, including knowledge distillation from large models or ensembles into smaller students\[hinton2015distillingknowledgeneuralnetwork\], pruning and quantization pipelines for compressed neural networks\[han2016deepcompressioncompressingdeep\], tensor\-train representations of neural\-network layers\[novikov2015tensorizingNeuralNetworks\], and sparse trainable subnetworks\[frankle2018the\]\. More recent scientific machine\-learning work has adapted these ideas to neural PDE solvers and neural operators\. Factorized Fourier neural operators reduce the cost of spectral operator learning by factorizing the Fourier representation\[tran2023factorized\], while multi\-grid tensorized Fourier neural operators combine domain decomposition and tensorized parameter representations for high\-resolution PDE surrogates\[kossaifi2024multigrid\]\. Mixed\-precision neural operators target memory and throughput bottlenecks in FNO\-style models while preserving approximation guarantees\[tu2024guaranteed\], and tensor\-compressed, fully quantized neural PDE solvers further illustrate the growing interest in deployment\-oriented scientific machine learning\[lu2025tensorcompressedfullyquantizedtrainingneural\]\. The present work is complementary to these compression and efficiency strategies\. Rather than compressing a trained full\-size operator, changing numerical precision, or imposing a low\-rank parameterization on a single monolithic architecture, we study whether a stagewise additive stack of tiny neural operators can improve the empirical accuracy–parameter Pareto frontier directly\.

## 3Problem Setting

### 3\.1PDE surrogate modeling

LetΩ⊂ℝd\\Omega\\subset\\mathbb\{R\}^\{d\}denote the spatial domain and letaadenote the input field specifying the instance of a PDE surrogate problem\. Depending on the benchmark,aamay represent an initial condition, a previous solution state, a coefficient field, a forcing field, or a collection of such quantities\. The target field is denoted byuu\. The objective is to learn an operator

𝒢:𝒜→𝒰,a↦u,\\mathcal\{G\}:\\mathcal\{A\}\\to\\mathcal\{U\},\\qquad a\\mapsto u,\(1\)from paired observations

𝒟=\{\(ai,ui\)\}i=1N\.\\mathcal\{D\}=\\\{\(a\_\{i\},u\_\{i\}\)\\\}\_\{i=1\}^\{N\}\.\(2\)For static problems such as Darcy flow,aais a coefficient or permeability field anduuis the corresponding solution field\. For time\-dependent problems, we construct one\-step or finite\-time transition pairs of the form

ai=\(u\(ti\),Δti\),ui=u\(ti\+Δti\),a\_\{i\}=\\bigl\(u\(t\_\{i\}\),\\Delta t\_\{i\}\\bigr\),\\qquad u\_\{i\}=u\(t\_\{i\}\+\\Delta t\_\{i\}\),\(3\)so that the learned operator approximates a time\-advancement map\. All models are trained and evaluated on discretized fields, but the surrogate\-learning objective is operator\-valued: the learned map should approximate the underlying input–output relation between fields rather than a scalar response\.

### 3\.2Full\-size baseline and boosted\-stack comparison

For each dataset and architecture familyff, we compare two predictors\. The first is a full\-size monolithic neural operator,

Gfullf:𝒜→𝒰,G\_\{\\mathrm\{full\}\}^\{f\}:\\mathcal\{A\}\\to\\mathcal\{U\},\(4\)trained directly to predict the normalized output field\. The second is a boosted stack of tiny same\-family neural operators,

Gboostf\(a\)=G0\(a\)\+∑m=1MηmHmf\(a\),G\_\{\\mathrm\{boost\}\}^\{f\}\(a\)=G\_\{0\}\(a\)\+\\sum\_\{m=1\}^\{M\}\\eta\_\{m\}H\_\{m\}^\{f\}\(a\),\(5\)where eachHmfH\_\{m\}^\{f\}is a small neural operator from the same architecture family as the full baseline andηm\\eta\_\{m\}is a validation\-selected shrinkage coefficient\. In our experiments,G0\(a\)=0G\_\{0\}\(a\)=0in normalized output coordinates, corresponding after denormalization to the empirical mean output field\. Thus the boosted predictor contains no learned initialization model; its trainable parameters are entirely contained in the stagewise correction operatorsHmfH\_\{m\}^\{f\}\.

This comparison is performed within each architecture family\. A full FNO is compared against boosted tiny FNOs, a full DeepONet against boosted tiny DeepONets, and a full CNO against boosted tiny CNOs\. This design isolates the effect of stagewise residual learning from the effect of switching model classes\. The total boosted parameter count is

Pboostf=∑m=1MP\(Hmf\),P\_\{\\mathrm\{boost\}\}^\{f\}=\\sum\_\{m=1\}^\{M\}P\(H\_\{m\}^\{f\}\),\(6\)and, when all stages use the same tiny architecture,

Pboostf=MPtinyf\.P\_\{\\mathrm\{boost\}\}^\{f\}=MP\_\{\\mathrm\{tiny\}\}^\{f\}\.\(7\)The full\-size baseline parameter count is denoted byPfullfP\_\{\\mathrm\{full\}\}^\{f\}\.

### 3\.3Evaluation metrics and empirical Pareto dominance

The primary accuracy metric is test relativeL2L^\{2\}error\. For a predicted fieldu^\\widehat\{u\}and target fielduu, we compute

RelL2\(u^,u\)=‖u^−u‖2‖u‖2\+ε,\\mathrm\{RelL2\}\(\\widehat\{u\},u\)=\\frac\{\\\|\\widehat\{u\}\-u\\\|\_\{2\}\}\{\\\|u\\\|\_\{2\}\+\\varepsilon\},\(8\)whereε\>0\\varepsilon\>0is a small numerical stabilizer\. Relative errors are computed after denormalizing predictions back to physical coordinates and are averaged over the test set\.

For each dataset, architecture family, and random seed, we define the performance delta

Δperf=100RelL2full−RelL2boostRelL2full\.\\Delta\_\{\\mathrm\{perf\}\}=100\\frac\{\\mathrm\{RelL2\}\_\{\\mathrm\{full\}\}\-\\mathrm\{RelL2\}\_\{\\mathrm\{boost\}\}\}\{\\mathrm\{RelL2\}\_\{\\mathrm\{full\}\}\}\.\(9\)ThusΔperf\>0\\Delta\_\{\\mathrm\{perf\}\}\>0indicates that the boosted tiny stack outperforms the full\-size monolithic baseline, whileΔperf<0\\Delta\_\{\\mathrm\{perf\}\}<0indicates degradation\. We also report the size delta

Δsize=100Pboost−PfullPfull,\\Delta\_\{\\mathrm\{size\}\}=100\\frac\{P\_\{\\mathrm\{boost\}\}\-P\_\{\\mathrm\{full\}\}\}\{P\_\{\\mathrm\{full\}\}\},\(10\)so that negative values indicate parameter reduction\. Equivalently, the size reduction is−Δsize\-\\Delta\_\{\\mathrm\{size\}\}\.

We say that a boosted stack empirically Pareto\-dominates a full\-size baseline when it achieves no larger mean test relativeL2L^\{2\}error while using fewer trainable parameters\. That is, modelAAdominates modelBBif

RelL2¯\(A\)≤RelL2¯\(B\),PA<PB,\\overline\{\\mathrm\{RelL2\}\}\(A\)\\leq\\overline\{\\mathrm\{RelL2\}\}\(B\),\\qquad P\_\{A\}<P\_\{B\},\(11\)with at least one strict improvement\. In finite\-seed experiments, we report accuracy and parameter\-count changes separately rather than collapsing them into a scalar score\. This distinguishes strong Pareto improvements, where the boosted stack is both smaller and more accurate, from compression tradeoffs, where parameter count is reduced at the cost of some predictive degradation\. Figure[2](https://arxiv.org/html/2606.17460#S3.F2)illustrates this normalized accuracy–parameter Pareto plane and shows representative dominating and tradeoff cases\.

00\.20\.20\.40\.40\.60\.60\.80\.81100\.50\.511best fullNS2D/FNOshallow/CNOCNS3D/FNOactive/FNOMHD/CNOsmaller andno less accurateparameter ratioPmodel/PbestfullP\_\{\\mathrm\{model\}\}/P\_\{\\mathrm\{best\\ full\}\}error ratioRelL2model/RelL2bestfull\\mathrm\{RelL2\}\_\{\\mathrm\{model\}\}/\\mathrm\{RelL2\}\_\{\\mathrm\{best\\ full\}\}Figure 2:Accuracy–parameter Pareto plane normalized by the best full\-size baseline for each PDE\. Points below and to the left of\(1,1\)\(1,1\)empirically Pareto\-dominate the best full\-size baseline\. Representative boosted stacks show both strong Pareto improvements and compression tradeoffs\.

## 4Operator Boosting

### 4\.1Stagewise additive operator model

Let𝒜\\mathcal\{A\}denote the input function space and𝒰\\mathcal\{U\}the output function space\. Given training data

𝒟tr=\{\(ai,ui\)\}i=1Ntr,ai∈𝒜,ui∈𝒰,\\mathcal\{D\}\_\{\\mathrm\{tr\}\}=\\\{\(a\_\{i\},u\_\{i\}\)\\\}\_\{i=1\}^\{N\_\{\\mathrm\{tr\}\}\},\\qquad a\_\{i\}\\in\\mathcal\{A\},\\quad u\_\{i\}\\in\\mathcal\{U\},\(12\)the goal is to approximate an operator𝒢:𝒜→𝒰\\mathcal\{G\}:\\mathcal\{A\}\\to\\mathcal\{U\}\. We work in normalized output coordinates\. Letμy\\mu\_\{y\}andσy\\sigma\_\{y\}denote the empirical mean and standard deviation of the training outputs, computed over the training split, and define

u~i=ui−μyσy\.\\widetilde\{u\}\_\{i\}=\\frac\{u\_\{i\}\-\\mu\_\{y\}\}\{\\sigma\_\{y\}\}\.\(13\)Operator Boosting represents the surrogate as an additive operator\-valued model

GM\(a\)=G0\(a\)\+∑m=1MηmHm\(a\),G\_\{M\}\(a\)=G\_\{0\}\(a\)\+\\sum\_\{m=1\}^\{M\}\\eta\_\{m\}H\_\{m\}\(a\),\(14\)where eachHm:𝒜→𝒰H\_\{m\}:\\mathcal\{A\}\\to\\mathcal\{U\}is a small neural operator andηm≥0\\eta\_\{m\}\\geq 0is a scalar shrinkage parameter\. In this work,G0\(a\)=0G\_\{0\}\(a\)=0in normalized coordinates, corresponding to the empirical mean predictorμy\\mu\_\{y\}after denormalization\. Thus the physical prediction is

u^M\(a\)=μy\+σyGM\(a\)\.\\widehat\{u\}\_\{M\}\(a\)=\\mu\_\{y\}\+\\sigma\_\{y\}G\_\{M\}\(a\)\.\(15\)This construction is a functional\-gradient analogue of classical boosting, where an additive predictor is built by sequentially fitting weak learners to residual or pseudo\-residual targets\[mason1999boostingGradientDescent,friedman2001gradientBoostingMachine,buhlmann2003boostingL2Loss\]\. Here the weak learners are neural operators mapping input fields to output fields\.

### 4\.2Residual fitting and shrinkage selection

For squared\-error loss in normalized coordinates, the negative functional gradient at stagemmis the residual field

ri\(m\)=u~i−Gm−1\(ai\)\.r\_\{i\}^\{\(m\)\}=\\widetilde\{u\}\_\{i\}\-G\_\{m\-1\}\(a\_\{i\}\)\.\(16\)For a fixed operator familyff, the next correction is trained in the tiny model classℋf,tiny\\mathcal\{H\}\_\{f,\\mathrm\{tiny\}\}:

Hmf∈arg⁡minH∈ℋf,tiny⁡1Ntr∑i=1Ntr‖H\(ai\)−ri\(m\)‖22\+βℛspec\(H\(ai\)−ri\(m\)\),H\_\{m\}^\{f\}\\in\\arg\\min\_\{H\\in\\mathcal\{H\}\_\{f,\\mathrm\{tiny\}\}\}\\frac\{1\}\{N\_\{\\mathrm\{tr\}\}\}\\sum\_\{i=1\}^\{N\_\{\\mathrm\{tr\}\}\}\\left\\\|H\(a\_\{i\}\)\-r\_\{i\}^\{\(m\)\}\\right\\\|\_\{2\}^\{2\}\+\\beta\\,\\mathcal\{R\}\_\{\\mathrm\{spec\}\}\\\!\\left\(H\(a\_\{i\}\)\-r\_\{i\}^\{\(m\)\}\\right\),\(17\)whereℛspec\\mathcal\{R\}\_\{\\mathrm\{spec\}\}is an optional Fourier\-weighted residual penalty\. After trainingHmfH\_\{m\}^\{f\}, the shrinkage coefficient is selected on the validation set:

ηm∈arg⁡minη∈Λ⁡1Nval∑j=1Nval‖Gm−1\(ajval\)\+ηHmf\(ajval\)−u~jval‖22,\\eta\_\{m\}\\in\\arg\\min\_\{\\eta\\in\\Lambda\}\\frac\{1\}\{N\_\{\\mathrm\{val\}\}\}\\sum\_\{j=1\}^\{N\_\{\\mathrm\{val\}\}\}\\left\\\|G\_\{m\-1\}\(a\_\{j\}^\{\\mathrm\{val\}\}\)\+\\eta H\_\{m\}^\{f\}\(a\_\{j\}^\{\\mathrm\{val\}\}\)\-\\widetilde\{u\}\_\{j\}^\{\\mathrm\{val\}\}\\right\\\|\_\{2\}^\{2\},\(18\)whereΛ\\Lambdais a finite candidate grid containingη=0\\eta=0\. Includingη=0\\eta=0makes each stage validation\-safe: if a trained correction does not reduce validation error, it can be rejected\. This validation\-selected shrinkage plays a regularizing role analogous to shrinkage and early stopping in classical boosting\[zhang2005boostingEarlyStopping\]\. The boosted surrogate is then updated by

Gm\(a\)=Gm−1\(a\)\+ηmHmf\(a\)\.G\_\{m\}\(a\)=G\_\{m\-1\}\(a\)\+\\eta\_\{m\}H\_\{m\}^\{f\}\(a\)\.\(19\)All primary reported results use the finalMM\-stage predictor rather than selecting the best test\-stage post hoc\. Figure[3](https://arxiv.org/html/2606.17460#S4.F3)summarizes one residual\-fitting stage of the procedure\.

current stackGm−1G\_\{m\-1\}compute residualri\(m\)=u~i−Gm−1\(ai\)r\_\{i\}^\{\(m\)\}=\\widetilde\{u\}\_\{i\}\-G\_\{m\-1\}\(a\_\{i\}\)fit tiny operatorHmf\(ai\)≈ri\(m\)H\_\{m\}^\{f\}\(a\_\{i\}\)\\approx r\_\{i\}^\{\(m\)\}validation shrinkageηm=arg⁡minη∈Λ⁡ℒval\\eta\_\{m\}=\\arg\\min\_\{\\eta\\in\\Lambda\}\\mathcal\{L\}\_\{\\mathrm\{val\}\}update stackGm=Gm−1\+ηmHmfG\_\{m\}=G\_\{m\-1\}\+\\eta\_\{m\}H\_\{m\}^\{f\}next stageor final predictortraining split0∈Λ0\\in\\Lambdato rejectbad correctionsFigure 3:One stage of same\-family Operator Boosting\. Each stage fits the current residual field using a tiny operator from the fixed familyff, selects a scalar shrinkage coefficient on validation data, and adds the shrunken correction to the current stack\.
### 4\.3Algorithm

Algorithm[1](https://arxiv.org/html/2606.17460#alg1)summarizes the training procedure for a fixed operator familyff\.

Algorithm 1Operator Boosting1:Training set

𝒟tr\\mathcal\{D\}\_\{\\mathrm\{tr\}\}, validation set

𝒟val\\mathcal\{D\}\_\{\\mathrm\{val\}\}, operator family

ff, number of stages

MM, shrinkage grid

Λ\\Lambda
2:Compute output normalization statistics

μy\\mu\_\{y\}and

σy\\sigma\_\{y\}on

𝒟tr\\mathcal\{D\}\_\{\\mathrm\{tr\}\}
3:Normalize outputs by

u~=\(u−μy\)/σy\\widetilde\{u\}=\(u\-\\mu\_\{y\}\)/\\sigma\_\{y\}
4:Initialize

G0\(a\)=0G\_\{0\}\(a\)=0
5:for

m=1,…,Mm=1,\\ldots,Mdo

6:Compute residual targets

ri\(m\)=u~i−Gm−1\(ai\)r\_\{i\}^\{\(m\)\}=\\widetilde\{u\}\_\{i\}\-G\_\{m\-1\}\(a\_\{i\}\)for all training samples

7:Train a tiny neural operator

Hmf∈ℋf,tinyH\_\{m\}^\{f\}\\in\\mathcal\{H\}\_\{f,\\mathrm\{tiny\}\}on

\{\(ai,ri\(m\)\)\}i=1Ntr\\\{\(a\_\{i\},r\_\{i\}^\{\(m\)\}\)\\\}\_\{i=1\}^\{N\_\{\\mathrm\{tr\}\}\}
8:Select

ηm∈arg⁡minη∈Λ∑\(aj,u~j\)∈𝒟val‖Gm−1\(aj\)\+ηHmf\(aj\)−u~j‖22\\eta\_\{m\}\\in\\arg\\min\_\{\\eta\\in\\Lambda\}\\sum\_\{\(a\_\{j\},\\widetilde\{u\}\_\{j\}\)\\in\\mathcal\{D\}\_\{\\mathrm\{val\}\}\}\\left\\\|G\_\{m\-1\}\(a\_\{j\}\)\+\\eta H\_\{m\}^\{f\}\(a\_\{j\}\)\-\\widetilde\{u\}\_\{j\}\\right\\\|\_\{2\}^\{2\}
9:Update

Gm\(a\)=Gm−1\(a\)\+ηmHmf\(a\)G\_\{m\}\(a\)=G\_\{m\-1\}\(a\)\+\\eta\_\{m\}H\_\{m\}^\{f\}\(a\)
10:endfor

11:returnDenormalized predictor

u^M\(a\)=μy\+σyGM\(a\)\\widehat\{u\}\_\{M\}\(a\)=\\mu\_\{y\}\+\\sigma\_\{y\}G\_\{M\}\(a\)

### 4\.4Instantiation across model classes

The framework is model\-family agnostic\. We instantiate it with FNO, DeepONet, and CNO\. Fourier neural operators use spectral convolution layers to learn global updates in Fourier space\[li2021fourier\]; DeepONets use a branch–trunk decomposition motivated by universal approximation of nonlinear operators\[lu2021deeponet\]; and convolutional neural operators use multiscale convolutional structure for operator learning on spatial fields\[raonic2023convolutional\]\. In experiments, allHmfH\_\{m\}^\{f\}within a stack use the same tiny architecture, so

Pboostf=MPtinyf\.P\_\{\\mathrm\{boost\}\}^\{f\}=MP\_\{\\mathrm\{tiny\}\}^\{f\}\.\(20\)SinceG0G\_\{0\}is the zero predictor in normalized coordinates, it contributes no trainable parameters\.

## 5Experimental Design

### 5\.1Datasets

We evaluate Operator Boosting on a collection of one\-, two\-, and three\-dimensional PDE surrogate\-modeling tasks\. The one\-dimensional benchmarks are standardized advection, Burgers, and reaction–diffusion datasets\. The two\-dimensional benchmarks include Darcy flow and reaction–diffusion from PDEBench\[takamoto2022pdebench\], incompressible Navier–Stokes from APEBench\[koehler2024apebench\], shallow\-water dynamics from PDEBench, and active matter from The Well\[ohana2024thewell\]\. The three\-dimensional benchmarks are compressible Navier–Stokes from PDEBench and magnetohydrodynamics from The Well\. For standardized datasets, each example is represented as a direct input–output pair\(a,u\)\(a,u\)\. For trajectory datasets, we construct supervised operator\-learning pairs by sampling states separated by a time offsetΔt\\Delta t, using the earlier state and normalized time offset as input and the later state as target\. All datasets are normalized using statistics computed only from the training split\.

The benchmark collection covers several mechanics\-relevant surrogate regimes: transport\-dominated dynamics in advection and Burgers problems, reaction–diffusion dynamics, elliptic porous\-media\-type prediction through Darcy flow, incompressible and compressible fluid dynamics through Navier–Stokes benchmarks, free\-surface geophysical flow through shallow\-water dynamics, and magnetohydrodynamic evolution\. This range is intended to test whether the boosting mechanism is specific to simple low\-dimensional PDEs or persists in higher\-dimensional flow and field\-evolution settings\.

### 5\.2Baselines

For each PDE and architecture family, we compare a full\-size monolithic neural operator against a boosted stack of tiny same\-family operators\. The comparisons are full FNO versus boosted tiny FNOs, full DeepONet versus boosted tiny DeepONets, and full CNO versus boosted tiny CNOs\. This protocol separates the effect of the boosting procedure from the effect of changing architecture class\. We also report best\-model Pareto comparisons in which, for each PDE, the best full\-size architecture is compared with the best boosted tiny stack among the completed runs\. Parameter counts are computed directly from trainable model parameters\. The present study focuses on full\-size monolithic baselines because the primary question is whether stagewise stacks of tiny operators can improve the deployment\-relevant accuracy–parameter frontier relative to standard full\-size neural operators\. Parameter\-matched tiny monolithic models and independently trained tiny ensembles are natural additional baselines for future ablation\.

### 5\.3Training protocol

All models are trained under a fixed\-budget terminal\-iterate protocol\. Full\-size baselines are trained for a prescribed number of epochs and evaluated at the final epoch\. Each boosted correction stage is also trained for the same prescribed stage budget and evaluated at its final epoch\. At stagemm, the current residualri\(m\)=u~i−Gm−1\(ai\)r\_\{i\}^\{\(m\)\}=\\widetilde\{u\}\_\{i\}\-G\_\{m\-1\}\(a\_\{i\}\)is used as the target for the next tiny operator\. After training the correction operator, the shrinkage coefficientηm\\eta\_\{m\}is selected on the validation set from a finite gridΛ\\Lambdacontaining0\. Unless otherwise stated, each boosted stack uses three stages\. Full baselines are trained with mean\-squared error, while boosted residual stages may include a small Fourier\-weighted residual penalty to emphasize high\-frequency residual structure\. All reported relative errors are computed after denormalizing the predicted fields back to physical coordinates\.

Operator Boosting reduces deployed parameter count, but it does not necessarily reduce training cost\. Each boosted stack requires trainingMMresidual operators, whereas each full\-size baseline requires training a single monolithic model\. The reported parameter counts therefore characterize deployed surrogate size rather than total training compute\. This distinction is important in offline surrogate construction, where additional training cost may be acceptable if the resulting model is repeatedly deployed in many\-query workflows\.

### 5\.4Statistical aggregation

Results are aggregated over random seeds using seed\-level means and 95% confidence intervals\. For each PDE and model family, we report the mean performance delta, confidence interval, number of wins, mean full\-baseline error, mean boosted\-stack error, and parameter reduction\. In addition to within\-family comparisons, we report best\-model Pareto tables comparing the best full\-size surrogate and the best boosted tiny surrogate for each PDE\.

## 6Results

### 6\.1Within\-family accuracy–parameter tradeoffs

Across all completed experiments, Operator Boosting improves the mean relative error in 21 of 30 dataset–architecture pairs, with 17 pairs having confidence intervals whose lower endpoint is positive\. All boosted stacks reduce trainable parameter count, with mean reductions ranging from approximately 72% to 95%\. The strongest gains occur for FNO on two\-dimensional Navier–Stokes, where the boosted stack reduces mean relative error by 74\.8% while using 94\.5% fewer parameters, and for CNO and DeepONet on shallow\-water dynamics, which obtain 51\.9% and 45\.2% reductions in relative error with 74\.3% and 88\.1% fewer parameters, respectively\. Three\-dimensional results further support the Pareto\-efficiency interpretation: DeepONet improves on compressible Navier–Stokes by 27\.6% with 72\.2% fewer parameters, while FNO improves on MHD by 4\.4% with 95\.2% fewer parameters\. At the same time, the failures are not isolated numerical noise: several PDE–architecture pairs exhibit negative mean gains, including CNO on Navier–Stokes, FNO on two\-dimensional reaction–diffusion, and CNO on MHD\. These failures motivate reporting the full accuracy–parameter tradeoff rather than only the successful Pareto\-dominating cases\. Figure[4](https://arxiv.org/html/2606.17460#S6.F4)summarizes the aggregate win counts across the completed experiments\.

05510101515202025253030Best\-model Pareto winsCI\-positive pairsMean\-positive pairsDataset–family pairs77171721213030countFigure 4:Aggregate summary of the completed homogeneous Operator Boosting experiments\. Across 30 dataset–family pairs, 21 have positive mean accuracy gains and 17 have completely positive confidence intervals\. In best\-model comparisons, boosted stacks empirically Pareto\-dominate the best full\-size baseline on 7 of 10 PDE benchmarks\.Table[1](https://arxiv.org/html/2606.17460#S6.T1)reports the primary within\-family comparison\. PositiveΔperf\\Delta\_\{\\mathrm\{perf\}\}means that the boosted tiny stack improves over the full\-size baseline\.

Table 1:Within\-family accuracy–parameter tradeoffs grouped by architecture family\. PositiveΔperf\\Delta\_\{\\mathrm\{perf\}\}indicates lower test relativeL2L^\{2\}error for the boosted tiny stack than for the full\-size baseline\.DatasetnnWinsRelL2 full→\\toboostedΔperf\\Delta\_\{\\mathrm\{perf\}\}\(%\)95% CISize red\. \(%\)FNO1D advection1010/100\.0592→0\.03050\.0592\\to 0\.030544\.2\[31\.2, 57\.2\]86\.21D Burgers104/100\.1095→0\.09670\.1095\\to 0\.0967\-2\.0\[\-21\.0, 17\.0\]86\.21D reaction–diffusion1010/107\.580×10−3→4\.074×10−37\.580\{\\times\}10^\{\-3\}\\to 4\.074\{\\times\}10^\{\-3\}43\.3\[33\.1, 53\.5\]86\.22D Navier–Stokes1010/100\.0805→0\.01880\.0805\\to 0\.018874\.8\[69\.2, 80\.4\]94\.52D Darcy106/100\.1383→0\.13340\.1383\\to 0\.13343\.2\[\-2\.3, 8\.7\]94\.52D reaction–diffusion103/100\.0132→0\.01470\.0132\\to 0\.0147\-20\.6\[\-43\.4, 2\.3\]94\.52D shallow water109/101\.921×10−3→1\.475×10−31\.921\{\\times\}10^\{\-3\}\\to 1\.475\{\\times\}10^\{\-3\}19\.9\[6\.6, 33\.2\]94\.52D active matter101/100\.6575→0\.72060\.6575\\to 0\.7206\-11\.1\[\-24\.0, 1\.9\]94\.53D compressible Navier–Stokes107/100\.1790→0\.13680\.1790\\to 0\.136817\.3\[\-1\.8, 36\.5\]95\.23D MHD1010/100\.5104→0\.48790\.5104\\to 0\.48794\.4\[4\.1, 4\.7\]95\.2DeepONet1D advection1010/100\.5009→0\.41270\.5009\\to 0\.412717\.3\[11\.9, 22\.8\]87\.61D Burgers1010/100\.3620→0\.28060\.3620\\to 0\.280622\.1\[18\.3, 25\.9\]87\.61D reaction–diffusion1010/100\.0942→0\.07350\.0942\\to 0\.073522\.0\[17\.9, 26\.1\]87\.62D Navier–Stokes107/101\.0366→0\.98531\.0366\\to 0\.98534\.6\[0\.1, 9\.2\]84\.52D Darcy1010/100\.9133→0\.76830\.9133\\to 0\.768315\.6\[11\.0, 20\.2\]84\.52D reaction–diffusion103/100\.3087→0\.30950\.3087\\to 0\.3095\-0\.3\[\-0\.8, 0\.3\]84\.52D shallow water1010/100\.0107→5\.787×10−30\.0107\\to 5\.787\{\\times\}10^\{\-3\}45\.2\[39\.9, 50\.5\]88\.12D active matter107/101\.1564→1\.13961\.1564\\to 1\.13961\.3\[\-7\.2, 9\.8\]87\.93D compressible Navier–Stokes107/100\.3110→0\.21020\.3110\\to 0\.210227\.6\[2\.3, 52\.8\]72\.23D MHD107/100\.9452→0\.90180\.9452\\to 0\.90184\.3\[0\.4, 8\.3\]72\.3CNO1D advection109/100\.1657→0\.13800\.1657\\to 0\.138016\.0\[6\.9, 25\.1\]74\.51D Burgers107/100\.1510→0\.12300\.1510\\to 0\.123011\.8\[\-5\.0, 28\.5\]74\.51D reaction–diffusion1010/100\.0133→0\.01050\.0133\\to 0\.010521\.0\[13\.1, 28\.9\]74\.52D Navier–Stokes100/100\.0822→0\.21090\.0822\\to 0\.2109\-202\.0\[\-427\.1, 23\.1\]74\.32D Darcy102/100\.2079→0\.22600\.2079\\to 0\.2260\-9\.1\[\-16\.0, \-2\.1\]74\.32D reaction–diffusion108/100\.0258→0\.02500\.0258\\to 0\.02502\.9\[0\.1, 5\.7\]74\.32D shallow water1010/102\.215×10−3→9\.572×10−42\.215\{\\times\}10^\{\-3\}\\to 9\.572\{\\times\}10^\{\-4\}51\.9\[41\.0, 62\.9\]74\.32D active matter103/100\.7705→0\.79100\.7705\\to 0\.7910\-6\.3\[\-19\.3, 6\.6\]74\.33D compressible Navier–Stokes104/100\.2002→0\.20500\.2002\\to 0\.2050\-18\.8\[\-52\.4, 14\.9\]81\.73D MHD100/100\.4034→0\.42250\.4034\\to 0\.4225\-4\.8\[\-6\.8, \-2\.8\]81\.7

### 6\.2Best full baseline versus boosted stack

Table[2](https://arxiv.org/html/2606.17460#S6.T2)selects, for each dataset, the full\-size architecture with the lowest mean test relativeL2L^\{2\}error and compares it with the boosted stack from the same architecture family\. This isolates whether boosting improves the strongest monolithic family for each PDE\.

Table 2:Best full\-size baseline versus its boosted tiny stack\. For each dataset, the best full\-size architecture is selected by mean test relativeL2L^\{2\}error and compared with the boosted tiny stack from the same family\.DatasetFull familyFull RelL2Boosted RelL2Δperf\\Delta\_\{\\mathrm\{perf\}\}\(%\)Size red\. \(%\)Wins1D advectionFNO0\.05920\.030544\.286\.210/101D BurgersFNO0\.10950\.0967\-2\.086\.24/101D reaction–diffusionFNO7\.580×10−37\.580\{\\times\}10^\{\-3\}4\.074×10−34\.074\{\\times\}10^\{\-3\}43\.386\.210/102D Navier–StokesFNO0\.08050\.018874\.894\.510/102D DarcyFNO0\.13830\.13343\.294\.56/102D reaction–diffusionFNO0\.01320\.0147\-20\.694\.53/102D shallow waterFNO1\.921×10−31\.921\{\\times\}10^\{\-3\}1\.475×10−31\.475\{\\times\}10^\{\-3\}19\.994\.59/102D active matterFNO0\.65750\.7206\-11\.194\.51/103D compressible Navier–StokesFNO0\.17900\.136817\.395\.27/103D MHDCNO0\.40340\.4225\-4\.881\.70/10

The best full\-size baseline is FNO for most completed datasets under the current experimental configuration, with CNO selected for three\-dimensional MHD\. Boosted FNOs substantially improve advection, reaction–diffusion, Navier–Stokes, shallow water, and three\-dimensional compressible Navier–Stokes, give a small positive tradeoff on Darcy flow, and underperform on Burgers, two\-dimensional reaction–diffusion, active matter, and MHD when compared against the best full\-size family for those cases\.

### 6\.3Best full baseline versus best boosted stack

Table[3](https://arxiv.org/html/2606.17460#S6.T3)allows the boosted stack family to differ from the best full\-size family\. This comparison measures the deployment\-relevant Pareto frontier: whether the best compact boosted surrogate can outperform the best full\-size monolithic surrogate for the same PDE\. Operator Boosting produces empirical Pareto improvements on 7 of 10 datasets\. The three failures are active matter, two\-dimensional reaction–diffusion, and MHD, where compression is obtained but the best boosted stack does not match the best full\-size baseline error\.

Table 3:Best full\-size baseline versus best boosted tiny stack\. For each dataset, the best full\-size baseline and best boosted tiny surrogate are selected by mean test relativeL2L^\{2\}error\. This comparison permits architecture selection among completed boosted stacks\.DatasetBest full familyFull RelL2Best boosted familyBoosted RelL2RelL2 red\. \(%\)Size red\. \(%\)Pareto status1D advectionFNO0\.0592FNO0\.030548\.686\.2Dominates1D BurgersFNO0\.1095FNO0\.096711\.686\.2Dominates1D reaction–diffusionFNO7\.580×10−37\.580\{\\times\}10^\{\-3\}FNO4\.074×10−34\.074\{\\times\}10^\{\-3\}46\.386\.2Dominates2D Navier–StokesFNO0\.0805FNO0\.018876\.694\.5Dominates2D DarcyFNO0\.1383FNO0\.13343\.594\.5Dominates2D reaction–diffusionFNO0\.0132FNO0\.0147\-11\.194\.5Tradeoff2D shallow waterFNO1\.921×10−31\.921\{\\times\}10^\{\-3\}CNO9\.572×10−49\.572\{\\times\}10^\{\-4\}50\.250\.4Dominates2D active matterFNO0\.6575FNO0\.7206\-9\.694\.5Tradeoff3D compressible Navier–StokesFNO0\.1790FNO0\.136823\.695\.2Dominates3D MHDCNO0\.4034CNO0\.4225\-4\.781\.7Tradeoff

### 6\.4Failure modes

The failures in Table[1](https://arxiv.org/html/2606.17460#S6.T1)are consistent with prior diagnostic evidence that learned PDE surrogates exhibit strongly PDE\- and architecture\-dependent error modes\[shikhman2026diagnosing\]\. Operator Boosting is least effective when the tiny correction operators cannot represent the relevant residual structure, when the full\-size baseline already captures the dominant dynamics, or when residual fitting amplifies architecture\-specific weaknesses\. The clearest degradations occur for CNO on two\-dimensional Navier–Stokes, FNO on two\-dimensional reaction–diffusion, CNO on Darcy flow, CNO on MHD, and active matter for FNO and CNO\. These cases indicate that Operator Boosting should be viewed as an accuracy–parameter Pareto\-improvement strategy rather than a universal replacement for full\-size neural operators\. Its effectiveness depends jointly on the PDE family, the architecture class, and the residual structure left by earlier stages\.

## 7Discussion

### 7\.1When does boosting help?

The results suggest that Operator Boosting helps most when the full\-size monolithic baseline is not simply parameter\-limited, but when small residual operators can correct systematic errors left by earlier low\-capacity stages\. The strongest gains occur when the full model’s error appears to contain a systematic component that remains representable by the tiny operator class\. This is most evident for FNO on two\-dimensional Navier–Stokes, where the boosted stack improves error by 74\.8% while using 94\.5% fewer parameters, and for shallow\-water dynamics, where both CNO and DeepONet boosted stacks substantially improve over their full\-size counterparts\. These cases suggest that the stagewise residual objective is not merely acting as compression; it changes the optimization problem by allowing successive small operators to focus on the remaining error of the current surrogate\. In contrast, failures such as CNO on Navier–Stokes and FNO on two\-dimensional reaction–diffusion suggest that residual fitting is not sufficient when the tiny correction class poorly matches the residual geometry or when the residual is dominated by errors that require capacity absent from the tiny architecture\.

### 7\.2Implications for resource\-constrained surrogate simulation

The main deployment implication is that trainable parameter count can often be reduced by an order of magnitude without sacrificing accuracy, and in many cases while improving it\. Parameter count is not identical to inference latency or peak memory, but it is a relevant proxy for storage, model transfer, and deployability\. Because all correction operators receive the same input field at inference, their forward passes are independent after training and can in principle be evaluated in parallel\. However, actual latency depends on hardware utilization, batching strategy, memory bandwidth, and the cost structure of the operator family\. Thus the present results should be interpreted primarily as parameter\-count and storage improvements rather than measured latency improvements\.

### 7\.3Limitations and Future Work

This study focuses on trainable parameter count and relativeL2L^\{2\}error\. These metrics capture the deployed accuracy–parameter tradeoff, but they do not fully measure deployment cost\. Future evaluations should also report model size, peak activation memory, wall\-clock latency, throughput, and energy use, especially because boosted stacks contain multiple forward modules even when their total parameter count is small\.

The experiments use fixed tiny architectures and three boosting stages\. We do not optimize the number of stages, per\-stage capacity, shrinkage grid, or residual loss weighting for each PDE\. The results should therefore be interpreted as evidence that a simple homogeneous residual\-stacking procedure can improve the empirical Pareto frontier, not as an optimized lower bound on achievable error or model size\.

The method also remains architecture\- and PDE\-dependent\. Some PDE–family pairs degrade when the tiny correction class cannot represent the residual structure left by earlier stages\. Future work should study residual diagnostics that identify whether the remaining error is spectral, localized, low\-rank, multiscale, boundary\-dominated, or otherwise matched to a particular operator family\.

A natural extension is heterogeneous Operator Boosting, where residual stages may come from different model families:

GM\(a\)=G0\(a\)\+∑m=1MηmHmfm\(a\),fm∈\{FNO,DeepONet,CNO\}\.G\_\{M\}\(a\)=G\_\{0\}\(a\)\+\\sum\_\{m=1\}^\{M\}\\eta\_\{m\}H\_\{m\}^\{f\_\{m\}\}\(a\),\\qquad f\_\{m\}\\in\\\{\\mathrm\{FNO\},\\mathrm\{DeepONet\},\\mathrm\{CNO\}\\\}\.This could combine spectral, branch–trunk, and convolutional biases within one boosted surrogate, but it introduces additional stage\-selection and training\-cost questions\. The present work isolates the homogeneous case first\.

Finally, future work should compare Operator Boosting against additional efficiency baselines, including pruning, quantization, distillation, tensorization, parameter\-matched tiny monolithic models, and independently trained tiny ensembles\.

## 8Conclusion

This work introduced Operator Boosting, a homogeneous stagewise residual\-learning framework for constructing compact neural\-operator PDE surrogates\. Rather than training a large monolithic operator and compressing it afterward, Operator Boosting builds the surrogate directly as an additive stack of tiny same\-family operators trained on successive residual fields, with each correction added through validation\-selected shrinkage\.

Across FNO, DeepONet, and CNO backbones, the results show that this residual construction can substantially improve the empirical accuracy–parameter tradeoff\. Across 30 dataset–architecture pairs, 21 show positive mean accuracy gains, 17 have positive confidence intervals, and all boosted stacks reduce trainable parameter count by approximately 72–95%\. In best\-model comparisons, boosted stacks empirically Pareto\-dominate the best full\-size baseline on 7 of 10 completed PDE benchmarks, including transport, reaction–diffusion, Darcy flow, shallow\-water dynamics, and incompressible and compressible Navier–Stokes\.

The experiments also show that Operator Boosting is not a universal replacement for full\-size neural operators: several PDE–architecture pairs degrade despite large parameter reductions, indicating that success depends on whether the remaining residual is representable by the tiny correction class\. Overall, the study demonstrates that homogeneous residual stacks can improve the deployed size–accuracy frontier of neural PDE surrogates, while motivating future work on latency, memory footprint, energy use, stronger compression baselines, and adaptive choices of residual\-stage size and structure\.

## Declaration of Competing Interest

The author declares no competing interests\.

## Data and Code Availability

The code used to generate the experiments, aggregate the results, and produce the tables will be made publicly available upon publication\. The benchmark datasets used in this study are publicly available through their respective sources, including PDEBench, APEBench, and The Well, together with the standardized synthetic PDE datasets used for the one\-dimensional experiments\.

## Acknowledgments

#### Reproducibility

#### Computational Resources

The author gratefully acknowledges Dell Technologies, and in particular the Dell Pro Precision division, for providing computational resources that supported the experiments in this work\. All experiments were conducted on a Dell Pro Max T2 workstation equipped with an Intel Core Ultra 9 285K processor, 128 GB of DDR5 ECC memory, and an NVIDIA RTX PRO 6000 Blackwell GPU\.

#### Declaration of Generative AI

During the preparation of this work, ChatGPT 5\.5 was used to assist with creating software documentation and preparation of the GitHub repository\. The author reviewed and edited the output as needed and takes full responsibility for the content of the published article\.

## References
Operator Boosting Produces Pareto-Efficient PDE Surrogates

Similar Articles

@AnimaAnandkumar: This is something I have been emphasizing since we started our work on Neural Operators. We very quickly went from simp…

Reducing Learner Redundancy in Boosting via Residual Orthogonalization

Perron--Frobenius Operator Matching for Generative Modeling

Generalization Guarantees for Multi-Input Neural Operator Learning in Sobolev Spaces

Breaking the Solver Bottleneck: Training Task Generators at the Learnable Frontier

Submit Feedback

Similar Articles

@AnimaAnandkumar: This is something I have been emphasizing since we started our work on Neural Operators. We very quickly went from simp…
Reducing Learner Redundancy in Boosting via Residual Orthogonalization
Perron--Frobenius Operator Matching for Generative Modeling
Generalization Guarantees for Multi-Input Neural Operator Learning in Sobolev Spaces
Breaking the Solver Bottleneck: Training Task Generators at the Learnable Frontier