Generalization Guarantees for Multi-Input Neural Operator Learning in Sobolev Spaces

arXiv cs.LG 06/17/26, 04:00 AM Papers
Summary
This paper provides approximation and generalization error estimates for multi-input neural operators measured in Sobolev norms, analyzing how multiple input functions with different domains and regularities affect error bounds, applicable to PDE and scientific computing problems.
arXiv:2606.17419v1 Announce Type: new Abstract: We develop approximation and generalization error estimates for multi-input neural operators, with the output error measured in Sobolev norms. In contrast to standard operator-learning settings with a single input function, our framework allows multiple input functions defined on possibly different domains, with different dimensions and Sobolev regularities. The derived rates explicitly quantify the contribution of each input space to the final error bound. In particular, in the balanced regime, the approximation and generalization rates are governed by the interaction between the input dimensions, regularities, and Sobolev orders, while the dependence on the model complexity retains a \(\log\log/\log\)-type structure. Our analysis provides a general theoretical framework for multi-input operator learning, including Sobolev training, and is applicable to operator learning problems arising from partial differential equations and scientific computing.
Original Article
View Cached Full Text
Cached at: 06/17/26, 05:38 AM
# Generalization Guarantees for Multi-Input Neural Operator Learning in Sobolev Spaces
Source: [https://arxiv.org/html/2606.17419](https://arxiv.org/html/2606.17419)
Yahong YangSchool of Mathematics, Georgia Institute of Technology, 686 Cherry Street, Atlanta, GA 30332\-0160, USA\. Emails:yyang3194@gatech\.edu,weizhu@gatech\.edu,wliao60@gatech\.eduZecheng ZhangDepartment of Applied and Computational Mathematics and Statistics, University of Notre Dame, Notre Dame, IN 46556, USA\. Email:zzhang48@nd\.eduWenjing Liao11footnotemark:1Hao LiuDepartment of Mathematics, Hong Kong Baptist University, FSC1202, Fong Shu Chuen Building, Hong Kong Baptist University, Kowloon Tong, Hong Kong\. Email:haoliu@hkbu\.edu\.hk\. Corresponding author\.

###### Abstract

We develop approximation and generalization error estimates for multi\-input neural operators, with the output error measured in Sobolev norms\. In contrast to standard operator\-learning settings with a single input function, our framework allows multiple input functions defined on possibly different domains, with different dimensions and Sobolev regularities\. The derived rates explicitly quantify the contribution of each input space to the final error bound\. In particular, in the balanced regime, the approximation and generalization rates are governed by the interaction between the input dimensions, regularities, and Sobolev orders, while the dependence on the model complexity retains alog⁡log/log\\log\\log/\\log\-type structure\. Our analysis provides a general theoretical framework for multi\-input operator learning, including Sobolev training, and is applicable to operator learning problems arising from partial differential equations and scientific computing\.

## 1Introduction

Operator learning aims to approximate mappings between infinite\-dimensional function spaces and has become an important tool for solving parametric families of partial differential equations \(PDEs\)\. Prominent examples include DeepONet\[[32](https://arxiv.org/html/2606.17419#bib.bib32),[45](https://arxiv.org/html/2606.17419#bib.bib45),[30](https://arxiv.org/html/2606.17419#bib.bib30),[31](https://arxiv.org/html/2606.17419#bib.bib31),[25](https://arxiv.org/html/2606.17419#bib.bib25)\], Fourier neural operators \(FNOs\)\[[28](https://arxiv.org/html/2606.17419#bib.bib28)\], and other structure\-preserving operator\-learning architectures\[[28](https://arxiv.org/html/2606.17419#bib.bib28),[27](https://arxiv.org/html/2606.17419#bib.bib27),[23](https://arxiv.org/html/2606.17419#bib.bib23),[19](https://arxiv.org/html/2606.17419#bib.bib19)\]\. Compared with classical solvers that must be repeatedly applied for each new instance, a trained neural operator can approximate the solution map for new input functions, source terms, boundary conditions, or physical parameters with a single forward evaluation\. This feature makes neural operators especially useful in scientific computing tasks involving many\-query or real\-time simulation of PDE models\. A simple example is the elliptic boundary value problem

−Δu\+a\(x\)u=f,u\|∂Ω=g\.\-\\Delta u\+a\(x\)u=f,\\qquad u\|\_\{\\partial\\Omega\}=g\.\(1\)In this case, the solution operator depends on three inputs, namely the coefficientaa, the source termff, and the boundary conditiongg, and can therefore be written as𝒢\(a,f,g\)=u\\mathcal\{G\}\(a,f,g\)=u\. This illustrates a basic feature of many PDE problems in scientific computing: the solution often depends simultaneously on several varying quantities\. For instance, the coefficientaamay encode material properties or the underlying medium, the source termffmay represent external forcing, and the boundary conditionggmay also vary from one instance to another\. Thus, it is natural to study operator learning in a multi\-input setting, where the learned operator takes all of these quantities as inputs rather than treating only one or two of them as variable\.

Motivated by the need to apply operator learning to more complex scientific computing problems, multi\-input operator learning has recently attracted increasing attention\[[3](https://arxiv.org/html/2606.17419#bib.bib3),[21](https://arxiv.org/html/2606.17419#bib.bib21),[46](https://arxiv.org/html/2606.17419#bib.bib46),[48](https://arxiv.org/html/2606.17419#bib.bib48),[22](https://arxiv.org/html/2606.17419#bib.bib22),[47](https://arxiv.org/html/2606.17419#bib.bib47)\]\. For example in\[[3](https://arxiv.org/html/2606.17419#bib.bib3)\], the authors firstly proposed the universal approximation to an operator indexed by a function which is a two\-input operator setting\. More general, the target operator depends on multiple input functions\. Precisely, one considers an operator

𝒢:∏i=1λ𝒳i→𝒱,λ≥2,\\mathcal\{G\}:\\prod\_\{i=1\}^\{\\lambda\}\\mathcal\{X\}\_\{i\}\\to\\mathcal\{V\},\\qquad\\lambda\\geq 2,\(2\)and aims to construct a neural operator𝒢𝜽\\mathcal\{G\}\_\{\\bm\{\\theta\}\}that approximates𝒢\\mathcal\{G\}uniformly or in an appropriate statistical sense\. Universal approximation results for multi\-input operator\-learning architectures have been established in\[[21](https://arxiv.org/html/2606.17419#bib.bib21),[3](https://arxiv.org/html/2606.17419#bib.bib3),[22](https://arxiv.org/html/2606.17419#bib.bib22)\], and both theoretical and numerical studies have demonstrated the effectiveness of such frameworks\. However, quantitative scaling laws and error rates for multi\-input neural operators remain much less understood\.

Recent works\[[46](https://arxiv.org/html/2606.17419#bib.bib46),[48](https://arxiv.org/html/2606.17419#bib.bib48),[47](https://arxiv.org/html/2606.17419#bib.bib47)\]established scaling laws for operator learning indexed by functions\[[3](https://arxiv.org/html/2606.17419#bib.bib3)\]\. Their framework mainly focuses on the two\-input setting and assumes Lipschitz continuity in theL∞L^\{\\infty\}sense\. In this work, we analyze the general multiple\-input operator learning, where the number of inputs can be any integer greater than or equal to two\. In addition, what is important is we allow the inputs to lie in Sobolev spaces, so that the regularity of each input is explicitly reflected in the rate, making it possible to quantify how the smoothness of different inputs influences the final generalization rate and the corresponding neural network design\. Finally, we measure the output error in Sobolev norms rather than only in theL∞L^\{\\infty\}norm, which is more natural and informative in scientific computing applications\. This paper aims to quantify precisely how input\-wise dimensions and regularities influence the final learning rate, and to design a neural operator architecture whose complexity reflects this structure\.

These extensions are particularly important for scientific computing applications\. First, many partial differential equations depend on more than two inputs\. For example, a typical boundary value problem \([1](https://arxiv.org/html/2606.17419#S1.E1)\) may involve a coefficientaa, a source termff, and a boundary conditiongg\. Therefore, the two\-input setting is not sufficient for capturing many practically relevant PDE models\. Second, different inputs often play fundamentally different roles in the underlying equation, and may possess very different regularities and intrinsic dimensions\. As a result, the network structures associated with different inputs should not be treated uniformly\. Understanding how to design neural network architectures that reflect the dimension and regularity of each input is thus an important and challenging problem\. In practice, such design choices are often made only through extensive trial and error, whereas our framework provides a mathematically grounded perspective on this issue\. Finally, if operator learning is to be applied to PDE problems in scientific computing, it is essential to measure the error in Sobolev norms rather than only inL∞L^\{\\infty\}\. Even for weak solutions, one often needs derivative information, and hence the extension to Sobolev norms is both natural and necessary\. This is also closely related to a very active direction in scientific machine learning, usually referred to as Sobolev training\[[11](https://arxiv.org/html/2606.17419#bib.bib11),[43](https://arxiv.org/html/2606.17419#bib.bib43),[44](https://arxiv.org/html/2606.17419#bib.bib44),[42](https://arxiv.org/html/2606.17419#bib.bib42),[20](https://arxiv.org/html/2606.17419#bib.bib20),[15](https://arxiv.org/html/2606.17419#bib.bib15)\]\.

We first state an informal version of our main approximation result; the full statement is given in Corollary[2](https://arxiv.org/html/2606.17419#Thmcorollary2)\.

###### Theorem 1\(Informal Sobolev approximation result\)\.

Letℓ∈\{0,1\}\\ell\\in\\\{0,1\\\}\. Let𝒳i⊂Wni,∞\(\[−1,1\]di\)\\mathcal\{X\}\_\{i\}\\subset W^\{n\_\{i\},\\infty\}\(\[\-1,1\]^\{d\_\{i\}\}\),i=1,…,λi=1,\\ldots,\\lambda, be uniformly bounded input classes, and let𝒢:∏i=1λ𝒳i→Wℓ,∞\(Ωλ\+1\)\\mathcal\{G\}:\\prod\_\{i=1\}^\{\\lambda\}\\mathcal\{X\}\_\{i\}\\to W^\{\\ell,\\infty\}\(\\Omega\_\{\\lambda\+1\}\)be aλ\\lambda\-input operator\. Assume that𝒢\\mathcal\{G\}is separately Lipschitz continuous with respect to the inputL∞L^\{\\infty\}\-norms, and that its outputWnλ\+1,∞\(Ωλ\+1\)W^\{n\_\{\\lambda\+1\},\\infty\}\(\\Omega\_\{\\lambda\+1\}\)\-norm is uniformly controlled, withnλ\+1\>ℓn\_\{\\lambda\+1\}\>\\ell\. Then there exists a ReLU neural operator𝒢𝛉\\mathcal\{G\}\_\{\\bm\{\\theta\}\}withNtotN\_\{\\mathrm\{tot\}\}trainable parameters such that, up to lower\-order logarithmic factors,

sup∏i=1λ𝒳i‖𝒢\(f1,…,fλ\)−𝒢𝜽\(f1,…,fλ\)‖Wℓ,∞\(Ωλ\+1\)≲\(log⁡Ntotlog⁡log⁡Ntot\)−1Qmax\(log⁡log⁡Ntot\)∑i=1λdi,\\sup\_\{\\prod\_\{i=1\}^\{\\lambda\}\\mathcal\{X\}\_\{i\}\}\\left\\\|\\mathcal\{G\}\(f\_\{1\},\\ldots,f\_\{\\lambda\}\)\-\\mathcal\{G\}\_\{\\bm\{\\theta\}\}\(f\_\{1\},\\ldots,f\_\{\\lambda\}\)\\right\\\|\_\{W^\{\\ell,\\infty\}\(\\Omega\_\{\\lambda\+1\}\)\}\\lesssim\\left\(\\frac\{\\log N\_\{\\mathrm\{tot\}\}\}\{\\log\\log N\_\{\\mathrm\{tot\}\}\}\\right\)^\{\-\\frac\{1\}\{Q\_\{\\max\}\}\}\(\\log\\log N\_\{\\mathrm\{tot\}\}\)^\{\\sum\_\{i=1\}^\{\\lambda\}d\_\{i\}\},whereQmax:=max1≤i≤λ⁡di/niQ\_\{\\max\}:=\\max\_\{1\\leq i\\leq\\lambda\}d\_\{i\}/n\_\{i\}\.

The rate above shows that the final approximation complexity is controlled by the most difficult input space\. More precisely, an input space𝒳i\\mathcal\{X\}\_\{i\}contributes through the effective complexitydi/nid\_\{i\}/n\_\{i\}\. If𝒳i\\mathcal\{X\}\_\{i\}has a low\-dimensional domain or high regularity, thendi/nid\_\{i\}/n\_\{i\}is small\. Such an input does not change the leading exponent unless it becomes the largest among all input complexities\. Conversely, an input with large dimension or low regularity may dominate the maximumQmaxQ\_\{\\max\}, thereby determining the final approximation rate\.

The proof also suggests how the neural operator should be constructed\. The architecture has a shared branch–trunk form

𝒢𝜽\(f1,…,fλ\)\(𝒙\):=∑s1=1J1⋯∑sλ=1Jλ∑p=1Pes1,…,sλ,p∏i=1λℬi,si\(𝒟mifi\)𝒯p\(𝒙\)\.\\displaystyle\\mathcal\{G\}\_\{\\bm\{\\theta\}\}\(f\_\{1\},\\ldots,f\_\{\\lambda\}\)\(\\bm\{x\}\)=\\sum\_\{s\_\{1\}=1\}^\{J\_\{1\}\}\\cdots\\sum\_\{s\_\{\\lambda\}=1\}^\{J\_\{\\lambda\}\}\\sum\_\{p=1\}^\{P\}e\_\{s\_\{1\},\\ldots,s\_\{\\lambda\},p\}\\prod\_\{i=1\}^\{\\lambda\}\\mathcal\{B\}\_\{i,s\_\{i\}\}\(\\mathcal\{D\}\_\{m\_\{i\}\}f\_\{i\}\)\\,\\mathcal\{T\}\_\{p\}\(\\bm\{x\}\)\.\(3\)Here𝒟mifi\\mathcal\{D\}\_\{m\_\{i\}\}f\_\{i\}denotes the finite\-dimensional discretization of theii\-th input function,ℬi,si\\mathcal\{B\}\_\{i,s\_\{i\}\}are input\-wise branch networks, and𝒯p\\mathcal\{T\}\_\{p\}are shared trunk networks\. A schematic diagram of this neural\-operator architecture is shown in Fig\.[1](https://arxiv.org/html/2606.17419#S1.F1)\.

![Refer to caption](https://arxiv.org/html/2606.17419v1/x1.png)Figure 1:Schematic illustration of the shared branch–trunk architecture for theλ\\lambda\-input neural operator\. Each input functionfif\_\{i\}is first discretized by𝒟mi\\mathcal\{D\}\_\{m\_\{i\}\}, then passed through an input\-wise branch networkℬi,si\\mathcal\{B\}\_\{i,s\_\{i\}\}\. The branch outputs are multiplied, combined with the shared trunk network𝒯p\\mathcal\{T\}\_\{p\}, and summed over the indicess1,…,sλ,ps\_\{1\},\\ldots,s\_\{\\lambda\},p\.If the input space𝒳i\\mathcal\{X\}\_\{i\}has high dimension or low regularity, then a larger discretization levelmim\_\{i\}is needed to control the discretization error\. This, in turn, requires a larger branch rankJiJ\_\{i\}\. In the balanced construction, up to logarithmic factors,

\(mi\+1\)di≍ε−di/ni,log⁡Ji≍\(mi\+1\)dilog⁡\(mi\+1\)\.\(m\_\{i\}\+1\)^\{d\_\{i\}\}\\asymp\\varepsilon^\{\-d\_\{i\}/n\_\{i\}\},\\qquad\\log J\_\{i\}\\asymp\(m\_\{i\}\+1\)^\{d\_\{i\}\}\\log\(m\_\{i\}\+1\)\.Thus, the derived rate not only quantifies the approximation complexity, but also provides guidance on how to allocate network capacity across different input functions\. A more detailed discussion is given in Remark[2](https://arxiv.org/html/2606.17419#Thmremark2)\.

Using the approximation error estimates established in Sec\.[3](https://arxiv.org/html/2606.17419#S3), together with explicit bounds on the network size and parameter magnitudes, we derive generalization error estimates for the proposed multi\-input neural operator class in Sec\.[4](https://arxiv.org/html/2606.17419#S4)\. The analysis is carried out under a hierarchical sampling setting: for each outer input, one observes multiple inner input samples and spatial evaluation points\. Since samples sharing the same outer input are not fully independent at the outer level, the dominant statistical error is controlled by the number of outer samples\. The proof follows the empirical\-process strategy of\[[31](https://arxiv.org/html/2606.17419#bib.bib31)\], but with an additional difficulty caused by Sobolev training\. In our setting, the loss involves derivative observations, and the derivative of the ReLU activation is not Lipschitz continuous\. Therefore, the standard parameter\-perturbation covering argument is not directly applicable to the derivative classes\. To overcome this issue, we use uniform empirical covering numbers and bound them through pseudo\-dimension estimates, based on\[[2](https://arxiv.org/html/2606.17419#bib.bib2),[50](https://arxiv.org/html/2606.17419#bib.bib50)\]\. This yields quantitative generalization bounds depending explicitly on the network complexity and the training sample size\. The following informal theorem summarizes the resulting generalization rate\.

###### Theorem 2\(Informal Sobolev generalization result\)\.

Letℓ∈\{0,1\}\\ell\\in\\\{0,1\\\}\. Let𝒳i⊂Wni,∞\(\[−1,1\]di\)\\mathcal\{X\}\_\{i\}\\subset W^\{n\_\{i\},\\infty\}\(\[\-1,1\]^\{d\_\{i\}\}\),i=1,…,λi=1,\\ldots,\\lambda, be uniformly bounded input classes, and let𝒢:∏i=1λ𝒳i→Wℓ,∞\(Ωλ\+1\)\\mathcal\{G\}:\\prod\_\{i=1\}^\{\\lambda\}\\mathcal\{X\}\_\{i\}\\to W^\{\\ell,\\infty\}\(\\Omega\_\{\\lambda\+1\}\)be aλ\\lambda\-input operator\. Assume that𝒢\\mathcal\{G\}is separately Lipschitz continuous with respect to the inputL∞L^\{\\infty\}\-norms, and that the output has sufficient Sobolev regularitynλ\+1\>ℓn\_\{\\lambda\+1\}\>\\ell\. Letn1sampn\_\{1\}^\{\\rm samp\}denote the number of outer training samples in the hierarchical sampling setting, and defineQmax:=max1≤i≤λ⁡dini\.Q\_\{\\max\}:=\\max\_\{1\\leq i\\leq\\lambda\}\\frac\{d\_\{i\}\}\{n\_\{i\}\}\.Then there exists a ReLU neural operator class such that the empirical risk minimizer𝒢^𝒮\\widehat\{\\mathcal\{G\}\}\_\{\\mathcal\{S\}\}, trained with the Sobolev empirical loss, satisfies

𝔼𝒮𝔼fi∼μi,i=1,…,λ𝒛∼μz\[∑\|γ\|≤ℓ\|∂γ𝒢\(f1,…,fλ\)\(𝒛\)−∂γ𝒢^𝒮\(𝒟m1f1,…,𝒟mλfλ\)\(𝒛\)\|2\]\\displaystyle\\mathbb\{E\}\_\{\\mathcal\{S\}\}\\mathbb\{E\}\_\{\\begin\{subarray\}\{c\}f\_\{i\}\\sim\\mu\_\{i\},\\ i=1,\\ldots,\\lambda\\\\ \\bm\{z\}\\sim\\mu\_\{z\}\\end\{subarray\}\}\\left\[\\sum\_\{\|\\gamma\|\\leq\\ell\}\\left\|\\partial^\{\\gamma\}\\mathcal\{G\}\(f\_\{1\},\\ldots,f\_\{\\lambda\}\)\(\\bm\{z\}\)\-\\partial^\{\\gamma\}\\widehat\{\\mathcal\{G\}\}\_\{\\mathcal\{S\}\}\(\\mathcal\{D\}\_\{m\_\{1\}\}f\_\{1\},\\ldots,\\mathcal\{D\}\_\{m\_\{\\lambda\}\}f\_\{\\lambda\}\)\(\\bm\{z\}\)\\right\|^\{2\}\\right\]≲\(log⁡n1samplog⁡log⁡n1samp\)−2Qmax\(log⁡log⁡n1samp\)2∑i=1λdi\.\\displaystyle\\qquad\\lesssim\\left\(\\frac\{\\log n\_\{1\}^\{\\rm samp\}\}\{\\log\\log n\_\{1\}^\{\\rm samp\}\}\\right\)^\{\-\\frac\{2\}\{Q\_\{\\max\}\}\}\(\\log\\log n\_\{1\}^\{\\rm samp\}\)^\{2\\sum\_\{i=1\}^\{\\lambda\}d\_\{i\}\}\.\(4\)Here𝔼𝒮\\mathbb\{E\}\_\{\\mathcal\{S\}\}is taken over the training set, while𝔼fi∼μi,𝐳∼μz\\mathbb\{E\}\_\{f\_\{i\}\\sim\\mu\_\{i\},\\,\\bm\{z\}\\sim\\mu\_\{z\}\}is taken over independent test inputs and test output locations\.

Theorems[1](https://arxiv.org/html/2606.17419#Thmtheorem1)and[2](https://arxiv.org/html/2606.17419#Thmtheorem2)not only capture the regularity of the input functions, but also measure the output error in the Sobolev normWℓ,∞W^\{\\ell,\\infty\}\. To the best of our knowledge, such a Sobolev\-output error estimate has not appeared in previous works on multi\-input neural operator learning\. In this paper, we focus on the casesℓ=0\\ell=0andℓ=1\\ell=1, since our network architecture is based on ReLU activation functions\. Extending the results toℓ≥2\\ell\\geq 2would require smoother activation functions, together with corresponding Sobolev approximation results based on partition\-of\-unity constructions; see, for example,\[[51](https://arxiv.org/html/2606.17419#bib.bib51)\]\. We expect that the overall approximation framework and proof strategy can be extended to this higher\-order Sobolev setting, although we do not pursue this direction here\.

### Related Work

There have been many works on the error analysis of functional learning and operator learning with inputs taken from infinite\-dimensional spaces\. For functional learning, see\[[9](https://arxiv.org/html/2606.17419#bib.bib9),[26](https://arxiv.org/html/2606.17419#bib.bib26),[55](https://arxiv.org/html/2606.17419#bib.bib55),[39](https://arxiv.org/html/2606.17419#bib.bib39),[53](https://arxiv.org/html/2606.17419#bib.bib53),[34](https://arxiv.org/html/2606.17419#bib.bib34)\]; for operator learning, see\[[10](https://arxiv.org/html/2606.17419#bib.bib10),[3](https://arxiv.org/html/2606.17419#bib.bib3),[25](https://arxiv.org/html/2606.17419#bib.bib25),[24](https://arxiv.org/html/2606.17419#bib.bib24),[23](https://arxiv.org/html/2606.17419#bib.bib23),[30](https://arxiv.org/html/2606.17419#bib.bib30),[22](https://arxiv.org/html/2606.17419#bib.bib22),[21](https://arxiv.org/html/2606.17419#bib.bib21),[31](https://arxiv.org/html/2606.17419#bib.bib31),[18](https://arxiv.org/html/2606.17419#bib.bib18),[50](https://arxiv.org/html/2606.17419#bib.bib50),[46](https://arxiv.org/html/2606.17419#bib.bib46),[48](https://arxiv.org/html/2606.17419#bib.bib48),[47](https://arxiv.org/html/2606.17419#bib.bib47),[29](https://arxiv.org/html/2606.17419#bib.bib29),[38](https://arxiv.org/html/2606.17419#bib.bib38),[33](https://arxiv.org/html/2606.17419#bib.bib33),[49](https://arxiv.org/html/2606.17419#bib.bib49)\]\. Among these works, except for\[[3](https://arxiv.org/html/2606.17419#bib.bib3),[22](https://arxiv.org/html/2606.17419#bib.bib22),[21](https://arxiv.org/html/2606.17419#bib.bib21),[46](https://arxiv.org/html/2606.17419#bib.bib46),[48](https://arxiv.org/html/2606.17419#bib.bib48),[47](https://arxiv.org/html/2606.17419#bib.bib47)\], most focus on the single\-input setting\. If we specialize our results \(Theorems[1](https://arxiv.org/html/2606.17419#Thmtheorem1)and[2](https://arxiv.org/html/2606.17419#Thmtheorem2)\) to the caseλ=1\\lambda=1, then our framework reduces to the single\-input setting, and the resulting rates are consistent with those in the previous literature\.

In some works, the rates are better than those obtained here because they consider smaller or more structured input spaces, such as mixed\-Sobolev spaces\[[26](https://arxiv.org/html/2606.17419#bib.bib26)\], finite basis expansion spaces\[[31](https://arxiv.org/html/2606.17419#bib.bib31)\], or infinitely smooth spaces\[[25](https://arxiv.org/html/2606.17419#bib.bib25)\]\. Improved rates are also possible when the target functional or operator admits additional low\-complexity structure, such as Barron\-type structure\[[53](https://arxiv.org/html/2606.17419#bib.bib53)\]or Green’s function structure\[[18](https://arxiv.org/html/2606.17419#bib.bib18)\]\. These settings go beyond the scope of the present paper and will be considered in future work\.

Among the existing multi\-input works\[[3](https://arxiv.org/html/2606.17419#bib.bib3),[22](https://arxiv.org/html/2606.17419#bib.bib22),[21](https://arxiv.org/html/2606.17419#bib.bib21),[46](https://arxiv.org/html/2606.17419#bib.bib46),[48](https://arxiv.org/html/2606.17419#bib.bib48),[47](https://arxiv.org/html/2606.17419#bib.bib47)\], the work\[[3](https://arxiv.org/html/2606.17419#bib.bib3)\]first established universal approximation of two\-input operator; the papers\[[22](https://arxiv.org/html/2606.17419#bib.bib22),[21](https://arxiv.org/html/2606.17419#bib.bib21)\]establish general multi\-input universal approximation results, but do not provide quantitative scaling laws\. The papers\[[46](https://arxiv.org/html/2606.17419#bib.bib46),[48](https://arxiv.org/html/2606.17419#bib.bib48),[47](https://arxiv.org/html/2606.17419#bib.bib47)\]derive such scaling laws, with\[[47](https://arxiv.org/html/2606.17419#bib.bib47)\]giving the strongest result among them\. That framework only considers the two\-input setting and assumes that the inputs belong merely to Lipschitz classes, while the operator is Lipschitz continuous with respect to both inputs and outputs in theL∞L^\{\\infty\}sense\. These assumptions and the proof strategy generalize those in\[[31](https://arxiv.org/html/2606.17419#bib.bib31)\]for the single\-input setting\.

In contrast, our work is motivated by scientific computing applications, where both the input and output spaces are more naturally measured in Sobolev norms, and where Lipschitz continuity is more appropriately formulated in Sobolev spaces\. Moreover, since each input is allowed to belong to a Sobolev space with its own regularity, our rate explicitly reflects the smoothness of each input\. In this sense, our result is sharper than those in\[[47](https://arxiv.org/html/2606.17419#bib.bib47),[31](https://arxiv.org/html/2606.17419#bib.bib31)\]\. The rate in\[[31](https://arxiv.org/html/2606.17419#bib.bib31),[47](https://arxiv.org/html/2606.17419#bib.bib47)\]takes the form

\(log⁡Ntotlog⁡log⁡Ntot\)−1max1≤i≤λ⁡di,\\left\(\\frac\{\\log N\_\{\\mathrm\{tot\}\}\}\{\\log\\log N\_\{\\mathrm\{tot\}\}\}\\right\)^\{\-\\frac\{1\}\{\\max\_\{1\\leq i\\leq\\lambda\}d\_\{i\}\}\},whereλ=1\\lambda=1corresponds to the setting in\[[31](https://arxiv.org/html/2606.17419#bib.bib31)\], whileλ=2\\lambda=2corresponds to the setting in\[[47](https://arxiv.org/html/2606.17419#bib.bib47)\]\. By contrast, our rate becomes

\(log⁡Ntotlog⁡log⁡Ntot\)−1max1≤i≤λ⁡di/ni\\left\(\\frac\{\\log N\_\{\\mathrm\{tot\}\}\}\{\\log\\log N\_\{\\mathrm\{tot\}\}\}\\right\)^\{\-\\frac\{1\}\{\\max\_\{1\\leq i\\leq\\lambda\}d\_\{i\}/n\_\{i\}\}\}for anyλ\>0\\lambda\>0, thereby incorporating the input\-wise regularitynin\_\{i\}\. For this reason, the framework of\[[47](https://arxiv.org/html/2606.17419#bib.bib47),[31](https://arxiv.org/html/2606.17419#bib.bib31)\], which relies on hat\-function encoding, cannot be applied\. Instead, we use a pseudo\-spectral projection, which is better suited to Sobolev spaces and allows us to define and control Sobolev norms in a natural way\.

Furthermore, measuring the output error in Sobolev normsWℓ,∞W^\{\\ell,\\infty\}instead ofL∞L^\{\\infty\}leads to two essential differences in the analysis compared with\[[31](https://arxiv.org/html/2606.17419#bib.bib31)\]\. First, in the approximation step, we need approximation results for the trunk network in Sobolev norms, which can be obtained by following\[[16](https://arxiv.org/html/2606.17419#bib.bib16),[54](https://arxiv.org/html/2606.17419#bib.bib54),[52](https://arxiv.org/html/2606.17419#bib.bib52)\]\. Second, in the generalization error analysis, our empirical loss is based on derivatives, namely,

ℒ𝒮\(𝒢NN\):=1Ntrain∑data=1Ntrain∑\|γ\|≤ℓ\|∂γ𝒢NN∣data−data of derivative\|2,\\mathcal\{L\}\_\{\\mathcal\{S\}\}\(\\mathcal\{G\}\_\{\\rm NN\}\):=\\frac\{1\}\{N\_\{\\rm train\}\}\\sum\_\{\\operatorname\{data\}=1\}^\{N\_\{\\rm train\}\}\\sum\_\{\|\\gamma\|\\leq\\ell\}\\left\|\{\\partial^\{\\gamma\}\}\\mathcal\{G\}\_\{\\rm NN\}\\mid\_\{\\operatorname\{data\}\}\-\\text\{data of derivative\}\\right\|^\{2\},rather than the standardL2L^\{2\}\-type loss\. This is closely related to the highly active topic of Sobolev training\[[11](https://arxiv.org/html/2606.17419#bib.bib11),[43](https://arxiv.org/html/2606.17419#bib.bib43),[44](https://arxiv.org/html/2606.17419#bib.bib44),[42](https://arxiv.org/html/2606.17419#bib.bib42),[20](https://arxiv.org/html/2606.17419#bib.bib20),[15](https://arxiv.org/html/2606.17419#bib.bib15)\]\. Sobolev training requires matching derivatives of the target operator by derivatives of the neural network, and thus we need to control the complexity of derivatives of neural networks\.

We emphasize that the extensions developed in this paper require several technical ingredients beyond existing neural\-operator approximation and generalization results\. First, in order to obtain approximation rates measured in Sobolev norms and to explicitly incorporate the smoothness of the input and output functions into the final rates, we need a different encoding strategy from those used in\[[31](https://arxiv.org/html/2606.17419#bib.bib31),[47](https://arxiv.org/html/2606.17419#bib.bib47)\]\. At the same time, we would like to keep the encoding as pointwise sampling of the input data, rather than using basis coefficients that must be computed in advance, as in\[[41](https://arxiv.org/html/2606.17419#bib.bib41)\]\. To achieve this, we use a pseudo\-spectral projection based on pointwise samples, which preserves the sampling\-based nature of the neural\-operator architecture while allowing us to control Sobolev discretization errors\. Second, measuring the approximation error in Sobolev norms requires neural network approximations not only of functions but also of their derivatives\. This leads to additional technical difficulties compared withL∞L^\{\\infty\}orL2L^\{2\}approximation\. We handle this by using local polynomial approximation and partition\-of\-unity constructions, inspired by classical finite element arguments\[[6](https://arxiv.org/html/2606.17419#bib.bib6),[17](https://arxiv.org/html/2606.17419#bib.bib17)\], together with ReLU network realizations that preserve the required Sobolev approximation rates\. Third, the generalization analysis also requires new arguments\. Since the loss involves derivatives of the neural operator, one needs to estimate the complexity of derivative classes of neural networks\. This cannot be obtained by directly using covering numbers over the whole function space, as in\[[31](https://arxiv.org/html/2606.17419#bib.bib31)\], because derivatives of ReLU networks are not Lipschitz continuous and the corresponding global covering numbers may be infinite\. Instead, we use covering numbers defined on finite data points and follow the strategy of\[[2](https://arxiv.org/html/2606.17419#bib.bib2),[50](https://arxiv.org/html/2606.17419#bib.bib50)\]to control the empirical Sobolev loss\. These ingredients together show that extending existing approximation and generalization theory to the present multi\-input Sobolev setting is highly nontrivial\.

## 2Preliminaries

### 2\.1Notations

Let us summarize all basic notations used in the deep neural networks as follows:

1\. Defineσ\(x\)=ReLU⁡\(x\)=max⁡\{0,x\}\\sigma\(x\)=\\operatorname\{ReLU\}\(x\)=\\max\\\{0,x\\\}\. With the abuse of notations, we defineσ:ℝd→ℝd\\sigma:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\}asσ\(𝒙\)=\[σ\(x1\)⋮σ\(xd\)\]\\sigma\(\\bm\{x\}\)=\\left\[\\begin\{array\}\[\]\{c\}\\sigma\(x\_\{1\}\)\\\\ \\vdots\\\\ \\sigma\(x\_\{d\}\)\\end\{array\}\\right\]for any𝒙=\[x1,⋯,xd\]T∈ℝd\\bm\{x\}=\\left\[x\_\{1\},\\cdots,x\_\{d\}\\right\]^\{T\}\\in\\mathbb\{R\}^\{d\}\.

2\. Letd,L∈ℕ\+d,L\\in\\mathbb\{N\}\_\{\+\}, and let

N0=d,NL\+1=1,Nℓ∈ℕ\+,ℓ=1,…,L\.N\_\{0\}=d,\\qquad N\_\{L\+1\}=1,\\qquad N\_\{\\ell\}\\in\\mathbb\{N\}\_\{\+\},\\quad\\ell=1,\\dots,L\.A feedforward neural network with activation functionσ\\sigma, depthLL, and layer widths\{Nℓ\}ℓ=0L\+1\\\{N\_\{\\ell\}\\\}\_\{\\ell=0\}^\{L\+1\}is defined by

𝒉~0=𝒙,𝒉ℓ=𝑾ℓ𝒉~ℓ−1\+𝒃ℓ,𝒉~ℓ=σ\(𝒉ℓ\),ℓ=1,…,L,\\displaystyle\\tilde\{\\bm\{h\}\}\_\{0\}=\\bm\{x\},\\qquad\\bm\{h\}\_\{\\ell\}=\\bm\{W\}\_\{\\ell\}\\tilde\{\\bm\{h\}\}\_\{\\ell\-1\}\+\\bm\{b\}\_\{\\ell\},\\quad\\tilde\{\\bm\{h\}\}\_\{\\ell\}=\\sigma\(\\bm\{h\}\_\{\\ell\}\),\\quad\\ell=1,\\dots,L,\(5\)and

ϕ\(𝒙\)=𝒉L\+1=𝑾L\+1𝒉~L\+𝒃L\+1,\\displaystyle\\phi\(\\bm\{x\}\)=\\bm\{h\}\_\{L\+1\}=\\bm\{W\}\_\{L\+1\}\\tilde\{\\bm\{h\}\}\_\{L\}\+\\bm\{b\}\_\{L\+1\},\(6\)where𝑾ℓ∈ℝNℓ×Nℓ−1\\bm\{W\}\_\{\\ell\}\\in\\mathbb\{R\}^\{N\_\{\\ell\}\\times N\_\{\\ell\-1\}\}and𝒃ℓ∈ℝNℓ\\bm\{b\}\_\{\\ell\}\\in\\mathbb\{R\}^\{N\_\{\\ell\}\}are the weight matrix and bias vector at theℓ\\ell\-th layer, respectively\. The width of the network is defined asmax1≤ℓ≤L⁡Nℓ\.\\max\_\{1\\leq\\ell\\leq L\}N\_\{\\ell\}\.Givendin,dout,L,p,K,κ,M\>0d\_\{\\mathrm\{in\}\},d\_\{\\mathrm\{out\}\},L,p,K,\\kappa,M\>0, we denote by

ℱNN\(din,dout,L,p,K,κ,M\)\\mathcal\{F\}\_\{\\mathrm\{NN\}\}\(d\_\{\\mathrm\{in\}\},d\_\{\\mathrm\{out\}\},L,p,K,\\kappa,M\)the class of vector\-valued neural networksΓ:ℝdin→ℝdout\\Gamma:\\mathbb\{R\}^\{d\_\{\\mathrm\{in\}\}\}\\to\\mathbb\{R\}^\{d\_\{\\mathrm\{out\}\}\}whose components are of the form \([6](https://arxiv.org/html/2606.17419#S2.E6)\)\. Each network in this class has depthLL, width at mostpp, at mostKKnonzero parameters, and satisfies

‖Γ‖L∞≤M,‖𝑾ℓ‖∞,∞≤κ,‖𝒃ℓ‖∞≤κ,ℓ=1,…,L\+1\.\\\|\\Gamma\\\|\_\{L^\{\\infty\}\}\\leq M,\\qquad\\\|\\bm\{W\}\_\{\\ell\}\\\|\_\{\\infty,\\infty\}\\leq\\kappa,\\qquad\\\|\\bm\{b\}\_\{\\ell\}\\\|\_\{\\infty\}\\leq\\kappa,\\quad\\ell=1,\\ldots,L\+1\.Here‖𝑾ℓ‖∞,∞\\\|\\bm\{W\}\_\{\\ell\}\\\|\_\{\\infty,\\infty\}denotes the largest magnitude of all elements of𝑾ℓ\\bm\{W\}\_\{\\ell\},‖𝑾ℓ‖0\\\|\\bm\{W\}\_\{\\ell\}\\\|\_\{0\}and‖𝒃ℓ‖0\\\|\\bm\{b\}\_\{\\ell\}\\\|\_\{0\}denote the numbers of nonzero entries of𝑾ℓ\\bm\{W\}\_\{\\ell\}and𝒃ℓ\\bm\{b\}\_\{\\ell\}, respectively\.

3\. DenoteΩ\\Omegaas\[−1,1\]d\[\-1,1\]^\{d\},DDas the weak derivative of a single variable function andD𝜶=D1α1D2α2…DdαdD^\{\\bm\{\\alpha\}\}=D^\{\\alpha\_\{1\}\}\_\{1\}D^\{\\alpha\_\{2\}\}\_\{2\}\\ldots D^\{\\alpha\_\{d\}\}\_\{d\}as the partial derivative where𝜶=\[α1,α2,…,αd\]⊤\\bm\{\\alpha\}=\[\\alpha\_\{1\},\\alpha\_\{2\},\\ldots,\\alpha\_\{d\}\]^\{\\top\}andDiD\_\{i\}is the derivative in theii\-th variable\. Letn∈ℕn\\in\\mathbb\{N\}and1≤p≤∞1\\leq p\\leq\\infty\. We define Sobolev spaces

Wn,p\(Ω\):=\{f∈Lp\(Ω\):D𝜶f∈Lp\(Ω\)for all𝜶∈ℕdwith\|𝜶\|≤n\}W^\{n,p\}\(\\Omega\):=\\left\\\{f\\in L^\{p\}\(\\Omega\):D^\{\\bm\{\\alpha\}\}f\\in L^\{p\}\(\\Omega\)\\text\{ for all \}\\bm\{\\alpha\}\\in\\mathbb\{N\}^\{d\}\\text\{ with \}\|\\bm\{\\alpha\}\|\\leq n\\right\\\}with a norm

‖f‖Wn,p\(Ω\):=\(∑0≤\|α\|≤n‖D𝜶f‖Lp\(Ω\)p\)1/p,\\\|f\\\|\_\{W^\{n,p\}\(\\Omega\)\}:=\\left\(\\sum\_\{0\\leq\|\\alpha\|\\leq n\}\\left\\\|D^\{\\bm\{\\alpha\}\}f\\right\\\|\_\{L^\{p\}\(\\Omega\)\}^\{p\}\\right\)^\{1/p\},ifp<∞p<\\infty, and

‖f‖Wn,∞\(Ω\):=max0≤\|α\|≤n⁡‖D𝜶f‖L∞\(Ω\)\.\\\|f\\\|\_\{W^\{n,\\infty\}\(\\Omega\)\}:=\\max\_\{0\\leq\|\\alpha\|\\leq n\}\\left\\\|D^\{\\bm\{\\alpha\}\}f\\right\\\|\_\{L^\{\\infty\}\(\\Omega\)\}\.

### 2\.2Problem Setting and Assumptions

In this section, we introduce the operator\-learning problem studied in this paper, together with the assumptions on the function classes and the neural network architecture used for the approximation\.

Letλ,ni∈ℕ\+\\lambda,n\_\{i\}\\in\\mathbb\{N\}\_\{\+\}, and set

Ωi=\[−1,1\]di,i=1,2,…,λ\+1\.\\Omega\_\{i\}=\[\-1,1\]^\{d\_\{i\}\},\\qquad i=1,2,\\ldots,\\lambda\+1\.We consider aλ\\lambda\-input operator defined on the product space

𝒢:∏i=1λWni,∞\(Ωi\)⟶Wnλ\+1,∞\(Ωλ\+1\)\.\\mathcal\{G\}:\\prod\_\{i=1\}^\{\\lambda\}W^\{n\_\{i\},\\infty\}\(\\Omega\_\{i\}\)\\longrightarrow W^\{n\_\{\\lambda\+1\},\\infty\}\(\\Omega\_\{\\lambda\+1\}\)\.Although the operator is defined on the larger space∏i=1λWni,∞\(Ωi\)\\prod\_\{i=1\}^\{\\lambda\}W^\{n\_\{i\},\\infty\}\(\\Omega\_\{i\}\), our error analysis will be carried out on more regular input classes𝒳i⊂Wni,∞\(Ωi\)\.\\mathcal\{X\}\_\{i\}\\subset W^\{n\_\{i\},\\infty\}\(\\Omega\_\{i\}\)\.This restriction allows us to incorporate the regularity of the input and output functions into the error rates and to work with compact subsets of the ambient function spaces\. At the same time, it is important to keep𝒢\\mathcal\{G\}defined on the larger product space∏i=1λL∞\(Ωi\)\\prod\_\{i=1\}^\{\\lambda\}L^\{\\infty\}\(\\Omega\_\{i\}\), since in the proof we will reconstruct certain auxiliary functions that may not belong to the compact regularity classes𝒳i\\mathcal\{X\}\_\{i\}\. Equivalently, the approximation problem studied in this paper concerns the restriction

𝒢:∏i=1λ𝒳i⟶Wnλ\+1,∞\(Ωλ\+1\)\.\\mathcal\{G\}:\\prod\_\{i=1\}^\{\\lambda\}\\mathcal\{X\}\_\{i\}\\longrightarrow W^\{n\_\{\\lambda\+1\},\\infty\}\(\\Omega\_\{\\lambda\+1\}\)\.
We next impose regularity and Lipschitz assumptions on the operator𝒢\\mathcal\{G\}\.

###### Assumption 1\.

Letki,αi∈ℕk\_\{i\},\\alpha\_\{i\}\\in\\mathbb\{N\}fori=1,2,…,λi=1,2,\\ldots,\\lambdaandℓ=\{0,1\}\\ell=\\\{0,1\\\}\. Assume that𝒢\\mathcal\{G\}is separately Lipschitz continuous with respect to the Sobolev normsWki,∞\(Ωi\)W^\{k\_\{i\},\\infty\}\(\\Omega\_\{i\}\)in the following sense\. There exist constantsLi\>0L\_\{i\}\>0,i=1,…,λi=1,\\ldots,\\lambda, such that for any1≤i∗≤λ1\\leq i\_\{\*\}\\leq\\lambda, anygi∗,hi∗∈Wki∗,∞\(Ωi∗\)g\_\{i\_\{\*\}\},h\_\{i\_\{\*\}\}\\in W^\{k\_\{i\_\{\*\}\},\\infty\}\(\\Omega\_\{i\_\{\*\}\}\), and anyfi∈Wki,∞\(Ωi\)f\_\{i\}\\in W^\{k\_\{i\},\\infty\}\(\\Omega\_\{i\}\),i≠i∗i\\neq i\_\{\*\}, we have

‖𝒢\(f1,…,fi∗−1,gi∗,fi∗\+1,…,fλ\)−𝒢\(f1,…,fi∗−1,hi∗,fi∗\+1,…,fλ\)‖Wℓ,∞\(Ωλ\+1\)\\displaystyle\\Bigl\\\|\\mathcal\{G\}\(f\_\{1\},\\ldots,f\_\{i\_\{\*\}\-1\},g\_\{i\_\{\*\}\},f\_\{i\_\{\*\}\+1\},\\ldots,f\_\{\\lambda\}\)\-\\mathcal\{G\}\(f\_\{1\},\\ldots,f\_\{i\_\{\*\}\-1\},h\_\{i\_\{\*\}\},f\_\{i\_\{\*\}\+1\},\\ldots,f\_\{\\lambda\}\)\\Bigr\\\|\_\{W^\{\\ell,\\infty\}\(\\Omega\_\{\\lambda\+1\}\)\}≤\\displaystyle\\leqLi∗‖gi∗−hi∗‖Wki∗,∞\(Ωi∗\)\(∏i=1i≠i∗λ\(1\+‖fi‖Wki,∞\(Ωi\)\)\)αi∗\.\\displaystyle L\_\{i\_\{\*\}\}\\\|g\_\{i\_\{\*\}\}\-h\_\{i\_\{\*\}\}\\\|\_\{W^\{k\_\{i\_\{\*\}\},\\infty\}\(\\Omega\_\{i\_\{\*\}\}\)\}\\left\(\\prod\_\{\\begin\{subarray\}\{c\}i=1\\\\ i\\neq i\_\{\*\}\\end\{subarray\}\}^\{\\lambda\}\\left\(1\+\\\|f\_\{i\}\\\|\_\{W^\{k\_\{i\},\\infty\}\(\\Omega\_\{i\}\)\}\\right\)\\right\)^\{\\alpha\_\{i\_\{\*\}\}\}\.

###### Assumption 2\.

There exists a constantS,U\>0S,U\>0such that, for everyi=1,…,λi=1,\\ldots,\\lambda,

‖fi‖L∞\(Ωi\)≤S,‖fi‖Wni,∞\(Ωi\)≤U∀fi∈𝒳i\.\\\|f\_\{i\}\\\|\_\{L^\{\\infty\}\(\\Omega\_\{i\}\)\}\\leq S,~\\\|f\_\{i\}\\\|\_\{W^\{n\_\{i\},\\infty\}\(\\Omega\_\{i\}\)\}\\leq U\\qquad\\forall\\,f\_\{i\}\\in\\mathcal\{X\}\_\{i\}\.

###### Assumption 3\.

Letβi∈ℕ\\beta\_\{i\}\\in\\mathbb\{N\},i=1,…,λi=1,\\ldots,\\lambda\. Assume that there exists a constantC\>0C\>0such that, for anyfi∈Wβi,∞\(Ωi\)f\_\{i\}\\in W^\{\\beta\_\{i\},\\infty\}\(\\Omega\_\{i\}\),i=1,…,λi=1,\\ldots,\\lambda, we have

‖𝒢\(f1,…,fλ\)‖Wnλ\+1,∞\(Ωλ\+1\)≤C∏i=1λ\(1\+‖fi‖Wβi,∞\(Ωi\)\)\.\\left\\\|\\mathcal\{G\}\(f\_\{1\},\\ldots,f\_\{\\lambda\}\)\\right\\\|\_\{W^\{n\_\{\\lambda\+1\},\\infty\}\(\\Omega\_\{\\lambda\+1\}\)\}\\leq C\\prod\_\{i=1\}^\{\\lambda\}\\left\(1\+\\\|f\_\{i\}\\\|\_\{W^\{\\beta\_\{i\},\\infty\}\(\\Omega\_\{i\}\)\}\\right\)\.

The above assumptions are natural in the PDE setting\. For example, consider the linear elliptic Dirichlet problem

\{ℒu=finΩ,u=gon∂Ω\.\\begin\{cases\}\\mathcal\{L\}u=f&\\text\{in \}\\Omega,\\\\ u=g&\\text\{on \}\\partial\\Omega\.\\end\{cases\}We view the solution operator as the map

𝒢:\(f,g\)↦u\.\\mathcal\{G\}:\(f,g\)\\mapsto u\.Under the standard Calderón–Zygmund assumptions onℒ\\mathcal\{L\}andΩ\\Omegaas shown in\[[13](https://arxiv.org/html/2606.17419#bib.bib13),[14](https://arxiv.org/html/2606.17419#bib.bib14)\], the globalW2,pW^\{2,p\}\-estimate gives, for anyp∈\(1,∞\)p\\in\(1,\\infty\),

‖u‖W2,p\(Ω\)=‖𝒢\(f,g\)‖W2,p\(Ω\)≤C\(‖f‖Lp\(Ω\)\+‖g‖W2−1/p,p\(∂Ω\)\)\.\\\|u\\\|\_\{W^\{2,p\}\(\\Omega\)\}=\\\|\\mathcal\{G\}\(f,g\)\\\|\_\{W^\{2,p\}\(\\Omega\)\}\\leq C\\left\(\\\|f\\\|\_\{L^\{p\}\(\\Omega\)\}\+\\\|g\\\|\_\{W^\{2\-1/p,p\}\(\\partial\\Omega\)\}\\right\)\.Takingp\>dp\>dand using the Sobolev embeddingW2,p\(Ω\)↪W1,∞\(Ω\)W^\{2,p\}\(\\Omega\)\\hookrightarrow W^\{1,\\infty\}\(\\Omega\), we obtain

‖𝒢\(f,g\)‖W1,∞\(Ω\)≤C\(‖f‖L∞\(Ω\)\+‖g‖W2−1/p,∞\(∂Ω\)\)\.\\\|\\mathcal\{G\}\(f,g\)\\\|\_\{W^\{1,\\infty\}\(\\Omega\)\}\\leq C\\left\(\\\|f\\\|\_\{L^\{\\infty\}\(\\Omega\)\}\+\\\|g\\\|\_\{W^\{2\-1/p,\\infty\}\(\\partial\\Omega\)\}\\right\)\.Thus, in Assumption[3](https://arxiv.org/html/2606.17419#Thmassumption3), one may takenλ\+1=1n\_\{\\lambda\+1\}=1,β1=0\\beta\_\{1\}=0, andβ2=2−1/p\\beta\_\{2\}=2\-1/p\.

Moreover, by linearity of the PDE, the solution operator is separately Lipschitz\. Indeed,

‖𝒢\(f1,g\)−𝒢\(f2,g\)‖W1,∞\(Ω\)=‖𝒢\(f1−f2,0\)‖W1,∞\(Ω\)≤C‖f1−f2‖L∞\(Ω\),\\\|\\mathcal\{G\}\(f\_\{1\},g\)\-\\mathcal\{G\}\(f\_\{2\},g\)\\\|\_\{W^\{1,\\infty\}\(\\Omega\)\}=\\\|\\mathcal\{G\}\(f\_\{1\}\-f\_\{2\},0\)\\\|\_\{W^\{1,\\infty\}\(\\Omega\)\}\\leq C\\\|f\_\{1\}\-f\_\{2\}\\\|\_\{L^\{\\infty\}\(\\Omega\)\},and

‖𝒢\(f,g1\)−𝒢\(f,g2\)‖W1,∞\(Ω\)=‖𝒢\(0,g1−g2\)‖W1,∞\(Ω\)≤C‖g1−g2‖W2−1/p,∞\(∂Ω\)\.\\\|\\mathcal\{G\}\(f,g\_\{1\}\)\-\\mathcal\{G\}\(f,g\_\{2\}\)\\\|\_\{W^\{1,\\infty\}\(\\Omega\)\}=\\\|\\mathcal\{G\}\(0,g\_\{1\}\-g\_\{2\}\)\\\|\_\{W^\{1,\\infty\}\(\\Omega\)\}\\leq C\\\|g\_\{1\}\-g\_\{2\}\\\|\_\{W^\{2\-1/p,\\infty\}\(\\partial\\Omega\)\}\.Therefore, in this linear case, the exponentsαi\\alpha\_\{i\}in Assumption[1](https://arxiv.org/html/2606.17419#Thmassumption1)can be chosen to be zero\.

In contrast, if the input includes coefficients of the PDE, then the corresponding Lipschitz bound generally depends on the size of the other inputs\. For instance, consider

−Δu\+a\(x\)u=f,u\|∂Ω=0,\-\\Delta u\+a\(x\)u=f,\\qquad u\|\_\{\\partial\\Omega\}=0,and regard the solution operator as𝒢\(a,f\)=u\\mathcal\{G\}\(a,f\)=u\. Ifa≥0a\\geq 0, then for fixedaaand two right\-hand sidesf1,f2f\_\{1\},f\_\{2\}, the differencew=𝒢\(a,f1\)−𝒢\(a,f2\)w=\\mathcal\{G\}\(a,f\_\{1\}\)\-\\mathcal\{G\}\(a,f\_\{2\}\)satisfies

−Δw\+a\(x\)w=f1−f2,w\|∂Ω=0\.\-\\Delta w\+a\(x\)w=f\_\{1\}\-f\_\{2\},\\qquad w\|\_\{\\partial\\Omega\}=0\.For anyp\>dp\>d, the ellipticW2,pW^\{2,p\}\-estimate and the Sobolev embedding imply

‖𝒢\(a,f1\)−𝒢\(a,f2\)‖W1,∞\(Ω\)≤CΩ,p\(1\+‖a‖L∞\(Ω\)\)‖f1−f2‖L∞\(Ω\)\.\\\|\\mathcal\{G\}\(a,f\_\{1\}\)\-\\mathcal\{G\}\(a,f\_\{2\}\)\\\|\_\{W^\{1,\\infty\}\(\\Omega\)\}\\leq C\_\{\\Omega,p\}\\left\(1\+\\\|a\\\|\_\{L^\{\\infty\}\(\\Omega\)\}\\right\)\\\|f\_\{1\}\-f\_\{2\}\\\|\_\{L^\{\\infty\}\(\\Omega\)\}\.Hence, when the other input is the coefficientaa, the Lipschitz constant grows at most linearly in‖a‖L∞\(Ω\)\\\|a\\\|\_\{L^\{\\infty\}\(\\Omega\)\}\. This corresponds to takingαi=1\\alpha\_\{i\}=1in Assumption[1](https://arxiv.org/html/2606.17419#Thmassumption1)\.

### 2\.3Neural network architecture

Our goal is to construct a parametric neural operator

𝒢𝜽:∏i=1λ𝒳i⟶Wnλ\+1,∞\(Ωλ\+1\)\.\\mathcal\{G\}\_\{\\bm\{\\theta\}\}:\\prod\_\{i=1\}^\{\\lambda\}\\mathcal\{X\}\_\{i\}\\longrightarrow W^\{n\_\{\\lambda\+1\},\\infty\}\(\\Omega\_\{\\lambda\+1\}\)\.We first describe the two\-input caseλ=2\\lambda=2, which is the main case used in the detailed construction below\. The architecture is similar in spirit to the multiple\-input operator architecture considered in\[[47](https://arxiv.org/html/2606.17419#bib.bib47)\]\.

In view of the discretization procedure introduced later, for two inputs we consider neural operators of the form

𝒢𝜽\(f,g\)\(𝒙\)=∑s=1J∑h=1H∑p=1Peh,p,sℰ\(𝒟m¯f;𝜽0,h\)ℬ\(𝒟mg;𝜽1,s\)𝒯\(𝒙;𝜽2,p\)\.\\displaystyle\\mathcal\{G\}\_\{\\bm\{\\theta\}\}\(f,g\)\(\\bm\{x\}\)=\\sum\_\{s=1\}^\{J\}\\sum\_\{h=1\}^\{H\}\\sum\_\{p=1\}^\{P\}e\_\{h,p,s\}\\,\\mathcal\{E\}\(\\mathcal\{D\}\_\{\\bar\{m\}\}f;\\bm\{\\theta\}\_\{0,h\}\)\\,\\mathcal\{B\}\(\\mathcal\{D\}\_\{m\}g;\\bm\{\\theta\}\_\{1,s\}\)\\,\\mathcal\{T\}\(\\bm\{x\};\\bm\{\\theta\}\_\{2,p\}\)\.\(7\)Hereℰ\\mathcal\{E\}andℬ\\mathcal\{B\}are branch subnetworks corresponding to the discretized inputsffandgg, respectively, while𝒯\\mathcal\{T\}is the trunk subnetwork associated with the spatial variable𝒙\\bm\{x\}\. The coefficientseh,p,s∈ℝe\_\{h,p,s\}\\in\\mathbb\{R\}are trainable scalar coefficients and are independent of the particular input pair\(f,g\)\(f,g\)\.

Let

ℱE=ℱNN\(dE,in,dE,out,LE,pE,KE,κE,ME\),\\mathcal\{F\}\_\{E\}=\\mathcal\{F\}\_\{\\rm NN\}\(d\_\{E,\{\\rm in\}\},d\_\{E,\{\\rm out\}\},L\_\{E\},p\_\{E\},K\_\{E\},\\kappa\_\{E\},M\_\{E\}\),ℱB=ℱNN\(dB,in,dB,out,LB,pB,KB,κB,MB\),\\mathcal\{F\}\_\{B\}=\\mathcal\{F\}\_\{\\rm NN\}\(d\_\{B,\{\\rm in\}\},d\_\{B,\{\\rm out\}\},L\_\{B\},p\_\{B\},K\_\{B\},\\kappa\_\{B\},M\_\{B\}\),and

ℱT=ℱNN\(dT,in,dT,out,LT,pT,KT,κT,MT\)\\mathcal\{F\}\_\{T\}=\\mathcal\{F\}\_\{\\rm NN\}\(d\_\{T,\{\\rm in\}\},d\_\{T,\{\\rm out\}\},L\_\{T\},p\_\{T\},K\_\{T\},\\kappa\_\{T\},M\_\{T\}\)be the prescribed architecture classes for the two branch subnetworks and the trunk subnetwork\.

We define the corresponding neural operator class by

ℱG=\{𝒢𝜽:\\displaystyle\\mathcal\{F\}\_\{G\}=\\Bigl\\\{\\mathcal\{G\}\_\{\\bm\{\\theta\}\}\\;:\\;𝒢𝜽is of the form \([7](https://arxiv.org/html/2606.17419#S2.E7)\),ℰ\(⋅;𝜽0,h\)∈ℱE,ℬ\(⋅;𝜽1,s\)∈ℱB,𝒯\(⋅;𝜽2,p\)∈ℱT,maxh,p,s\|eh,p,s\|≤κe\}\.\\displaystyle\\ \\mathcal\{G\}\_\{\\bm\{\\theta\}\}\\text\{ is of the form \\eqref\{eq:G\}\},\\ \\mathcal\{E\}\(\\cdot;\\bm\{\\theta\}\_\{0,h\}\)\\in\\mathcal\{F\}\_\{E\},\\ \\mathcal\{B\}\(\\cdot;\\bm\{\\theta\}\_\{1,s\}\)\\in\\mathcal\{F\}\_\{B\},\\ \\mathcal\{T\}\(\\cdot;\\bm\{\\theta\}\_\{2,p\}\)\\in\\mathcal\{F\}\_\{T\},\\max\_\{h,p,s\}\|e\_\{h,p,s\}\|\\leq\\kappa\_\{e\}\\Bigr\\\}\.\(8\)The precise choices of the architecture parameters

LE,pE,KE,κE,ME,LB,pB,KB,κB,MB,LT,pT,KT,κT,MT,L\_\{E\},p\_\{E\},K\_\{E\},\\kappa\_\{E\},M\_\{E\},\\qquad L\_\{B\},p\_\{B\},K\_\{B\},\\kappa\_\{B\},M\_\{B\},\\qquad L\_\{T\},p\_\{T\},K\_\{T\},\\kappa\_\{T\},M\_\{T\},as well as the coefficient boundκe\\kappa\_\{e\}, will be specified in the main theorems\.

For a general number of inputsλ≥2\\lambda\\geq 2, the same construction extends naturally to the shared branch–trunk form

𝒢𝜽\(f1,…,fλ\)\(𝒙\)=∑s1=1J1⋯∑sλ=1Jλ∑p=1Pes1,…,sλ,p∏i=1λℬi,si\(𝒟mifi\)𝒯p\(𝒙\)\.\\mathcal\{G\}\_\{\\bm\{\\theta\}\}\(f\_\{1\},\\ldots,f\_\{\\lambda\}\)\(\\bm\{x\}\)=\\sum\_\{s\_\{1\}=1\}^\{J\_\{1\}\}\\cdots\\sum\_\{s\_\{\\lambda\}=1\}^\{J\_\{\\lambda\}\}\\sum\_\{p=1\}^\{P\}e\_\{s\_\{1\},\\ldots,s\_\{\\lambda\},p\}\\prod\_\{i=1\}^\{\\lambda\}\\mathcal\{B\}\_\{i,s\_\{i\}\}\(\\mathcal\{D\}\_\{m\_\{i\}\}f\_\{i\}\)\\,\\mathcal\{T\}\_\{p\}\(\\bm\{x\}\)\.Hereℬi,si\\mathcal\{B\}\_\{i,s\_\{i\}\}is the branch subnetwork associated with the discretizedii\-th input𝒟mifi\\mathcal\{D\}\_\{m\_\{i\}\}f\_\{i\}, and𝒯p\\mathcal\{T\}\_\{p\}is the shared trunk subnetwork associated with the output variable𝒙\\bm\{x\}\.

## 3Approximation Error

In this section, we establish the approximation error for multi\-input neural operators\. To make the presentation clear, we first treat the two\-input caseλ=2\\lambda=2withd1=d2=d3=dd\_\{1\}=d\_\{2\}=d\_\{3\}=d, that is,Ω1=Ω2=Ω3=Ω\\Omega\_\{1\}=\\Omega\_\{2\}=\\Omega\_\{3\}=\\Omega\. The extension to the generalλ\\lambda\-input case is given afterwards\.

### 3\.1Approximation error forλ=2\\lambda=2

Whenλ=2\\lambda=2, we use the neural operator architecture introduced in \([8](https://arxiv.org/html/2606.17419#S2.E8)\)\. The approximation error is measured by

inf𝒢𝜽∈ℱGsupf∈𝒳1,g∈𝒳2‖𝒢\(f,g\)−𝒢𝜽\(f,g\)‖Wℓ,∞\(Ω\)\\inf\_\{\\mathcal\{G\}\_\{\\bm\{\\theta\}\}\\in\\mathcal\{F\}\_\{G\}\}\\sup\_\{f\\in\\mathcal\{X\}\_\{1\},\\;g\\in\\mathcal\{X\}\_\{2\}\}\\left\\\|\\mathcal\{G\}\(f,g\)\-\\mathcal\{G\}\_\{\\bm\{\\theta\}\}\(f,g\)\\right\\\|\_\{W^\{\\ell,\\infty\}\(\\Omega\)\}forℓ∈\{0,1\}\\ell\\in\\\{0,1\\\}\. Since the architecture depends on discretized input functions, we first specify the discretization operator\. In this paper, we use tensorized Chebyshev grids\. ForN∈ℕN\\in\\mathbb\{N\}, define

𝒙k1,…,kd=\(cos⁡\(\(k1\+12\)πN\+1\),…,cos⁡\(\(kd\+12\)πN\+1\)\),0≤k1,…,kd≤N\.\\bm\{x\}\_\{k\_\{1\},\\dots,k\_\{d\}\}=\\left\(\\cos\\left\(\\frac\{\(k\_\{1\}\+\\frac\{1\}\{2\}\)\\pi\}\{N\+1\}\\right\),\\dots,\\cos\\left\(\\frac\{\(k\_\{d\}\+\\frac\{1\}\{2\}\)\\pi\}\{N\+1\}\\right\)\\right\),\\qquad 0\\leq k\_\{1\},\\dots,k\_\{d\}\\leq N\.The total number of grid points is\(N\+1\)d\(N\+1\)^\{d\}\. We define the sampling operator by

𝒟N\(g\)=\(g\(𝒙k1,…,kd\)\)0≤k1,…,kd≤N∈ℝ\(N\+1\)d\.\\mathcal\{D\}\_\{N\}\(g\)=\\bigl\(g\(\\bm\{x\}\_\{k\_\{1\},\\dots,k\_\{d\}\}\)\\bigr\)\_\{0\\leq k\_\{1\},\\dots,k\_\{d\}\\leq N\}\\in\\mathbb\{R\}^\{\(N\+1\)^\{d\}\}\.
We now state the main approximation theorem for the two\-input case withd1=d2=d3=dd\_\{1\}=d\_\{2\}=d\_\{3\}=d, see a proof in Section[5\.1](https://arxiv.org/html/2606.17419#S5.SS1)\.

###### Theorem 3\.

LetS,U,Li\>0S,U,L\_\{i\}\>0,d,m¯,m,ki,αi,βi∈ℕd,\\bar\{m\},m,k\_\{i\},\\alpha\_\{i\},\\beta\_\{i\}\\in\\mathbb\{N\}fori=1,2i=1,2, andℓ∈\{0,1\}\\ell\\in\\\{0,1\\\}\. Set

m¯∗=\(m¯\+1\)d,m∗=\(m\+1\)d\.\\bar\{m\}\_\{\*\}=\(\\bar\{m\}\+1\)^\{d\},\\qquad m\_\{\*\}=\(m\+1\)^\{d\}\.Define

Am,m¯\(g\):=L1m2k2\(log⁡m\)d\(1\+Sm¯2k1\(log⁡m¯\)d\)α2,A\_\{m,\\bar\{m\}\}^\{\(g\)\}:=L\_\{1\}m^\{2k\_\{2\}\}\(\\log m\)^\{d\}\\left\(1\+S\\bar\{m\}^\{2k\_\{1\}\}\(\\log\\bar\{m\}\)^\{d\}\\right\)^\{\\alpha\_\{2\}\},Am¯,m\(f\):=L2m¯2k1\(log⁡m¯\)d\(1\+Sm2k2\(log⁡m\)d\)α1,A\_\{\\bar\{m\},m\}^\{\(f\)\}:=L\_\{2\}\\bar\{m\}^\{2k\_\{1\}\}\(\\log\\bar\{m\}\)^\{d\}\\left\(1\+Sm^\{2k\_\{2\}\}\(\\log m\)^\{d\}\\right\)^\{\\alpha\_\{1\}\},and

Bm,m¯:=\[1\\displaystyle B\_\{m,\\bar\{m\}\}:=\\Bigl\[1\+L1Sm2k2\(logm\)d\+L2Sm¯2k1\(logm¯\)d\(1\+Sm2k2\(logm\)d\)α1\]\.\\displaystyle\+L\_\{1\}Sm^\{2k\_\{2\}\}\(\\log m\)^\{d\}\+L\_\{2\}S\\bar\{m\}^\{2k\_\{1\}\}\(\\log\\bar\{m\}\)^\{d\}\\left\(1\+Sm^\{2k\_\{2\}\}\(\\log m\)^\{d\}\\right\)^\{\\alpha\_\{1\}\}\\Bigr\]\.\(9\)Furthermore, set

Rm¯,m\(n3\):=\(1\+Sm¯2β1\(log⁡m¯\)d\)\(1\+Sm2β2\(log⁡m\)d\)\.R\_\{\\bar\{m\},m\}^\{\(n\_\{3\}\)\}:=\\left\(1\+S\\bar\{m\}^\{2\\beta\_\{1\}\}\(\\log\\bar\{m\}\)^\{d\}\\right\)\\left\(1\+Sm^\{2\\beta\_\{2\}\}\(\\log m\)^\{d\}\\right\)\.
For anyJ,H,P∈ℕJ,H,P\\in\\mathbb\{N\}, choose branch networks

ℰh∈ℱNN\(m¯∗,1,LE,pE,KE,κE,1\),h=1,…,H,\\mathcal\{E\}\_\{h\}\\in\\mathcal\{F\}\_\{\\mathrm\{NN\}\}\(\\bar\{m\}\_\{\*\},1,L\_\{E\},p\_\{E\},K\_\{E\},\\kappa\_\{E\},1\),\\qquad h=1,\\ldots,H,ℬs∈ℱNN\(m∗,1,LB,pB,KB,κB,1\),s=1,…,J,\\mathcal\{B\}\_\{s\}\\in\\mathcal\{F\}\_\{\\mathrm\{NN\}\}\(m\_\{\*\},1,L\_\{B\},p\_\{B\},K\_\{B\},\\kappa\_\{B\},1\),\\qquad s=1,\\ldots,J,and trunk networks

𝒯p∈ℱNN\(d,1,LT,pT,KT,κT,MT\),p=1,…,P,\\mathcal\{T\}\_\{p\}\\in\\mathcal\{F\}\_\{\\mathrm\{NN\}\}\(d,1,L\_\{T\},p\_\{T\},K\_\{T\},\\kappa\_\{T\},M\_\{T\}\),\\qquad p=1,\\ldots,P,and coefficientseh,p,s∈ℝe\_\{h,p,s\}\\in\\mathbb\{R\}\. Set the branch network architecture forℬs\\mathcal\{B\}\_\{s\}as

LB=𝒪\(m∗2log⁡m∗\+m∗log⁡J\),pB=𝒪\(1\),\\displaystyle L\_\{B\}=\\mathcal\{O\}\\\!\\left\(m\_\{\*\}^\{2\}\\log m\_\{\*\}\+m\_\{\*\}\\log J\\right\),\\qquad p\_\{B\}=\\mathcal\{O\}\(1\),KB=𝒪\(m∗2log⁡m∗\+m∗log⁡J\),κB=𝒪\(m∗Bm,m¯J1\+1m∗\)\.\\displaystyle K\_\{B\}=\\mathcal\{O\}\\\!\\left\(m\_\{\*\}^\{2\}\\log m\_\{\*\}\+m\_\{\*\}\\log J\\right\),\\qquad\\kappa\_\{B\}=\\mathcal\{O\}\\\!\\left\(\\sqrt\{m\_\{\*\}\}\\,B\_\{m,\\bar\{m\}\}J^\{1\+\\frac\{1\}\{m\_\{\*\}\}\}\\right\)\.Set the branch network architecture forℰh\\mathcal\{E\}\_\{h\}as,

LE=𝒪\(m¯∗2log⁡m¯∗\+m¯∗log⁡H\),pE=𝒪\(1\)\\displaystyle L\_\{E\}=\\mathcal\{O\}\\\!\\left\(\\bar\{m\}\_\{\*\}^\{2\}\\log\\bar\{m\}\_\{\*\}\+\\bar\{m\}\_\{\*\}\\log H\\right\),\\qquad p\_\{E\}=\\mathcal\{O\}\(1\)KE=𝒪\(m¯∗2log⁡m¯∗\+m¯∗log⁡H\),κE=𝒪\(m¯∗Bm,m¯H1\+1m¯∗\)\.\\displaystyle K\_\{E\}=\\mathcal\{O\}\\\!\\left\(\\bar\{m\}\_\{\*\}^\{2\}\\log\\bar\{m\}\_\{\*\}\+\\bar\{m\}\_\{\*\}\\log H\\right\),\\qquad\\kappa\_\{E\}=\\mathcal\{O\}\\\!\\left\(\\sqrt\{\\bar\{m\}\_\{\*\}\}\\,B\_\{m,\\bar\{m\}\}H^\{1\+\\frac\{1\}\{\\bar\{m\}\_\{\*\}\}\}\\right\)\.Set the trunk network architecture for𝒯p\\mathcal\{T\}\_\{p\}as

LT,KT=𝒪\(log⁡P\),pT,MT=𝒪\(1\),κT=𝒪\(P\(n3\+d−ℓ\)/d\)\.L\_\{T\},K\_\{T\}=\\mathcal\{O\}\(\\log P\),\\qquad p\_\{T\},M\_\{T\}=\\mathcal\{O\}\(1\),\\qquad\\kappa\_\{T\}=\\mathcal\{O\}\\\!\\left\(P^\{\(n\_\{3\}\+d\-\\ell\)/d\}\\right\)\.Moreover, set the coefficients bound as

\|eh,p,s\|≤𝒪\(Rm¯,m\(n3\)\)\.\|e\_\{h,p,s\}\|\\leq\\mathcal\{O\}\(R\_\{\\bar\{m\},m\}^\{\(n\_\{3\}\)\}\)\.For any two\-input operator𝒢\\mathcal\{G\}satisfying Assumptions[1](https://arxiv.org/html/2606.17419#Thmassumption1),[2](https://arxiv.org/html/2606.17419#Thmassumption2)and[3](https://arxiv.org/html/2606.17419#Thmassumption3), there exist branch and trunk networks with architectures specified above such that for any\(f,g\)∈𝒳1×𝒳2\(f,g\)\\in\\mathcal\{X\}\_\{1\}\\times\\mathcal\{X\}\_\{2\}, we have

‖𝒢\(f,g\)−∑s=1J∑h=1H∑p=1Peh,p,sℰh\(𝒟m¯f\)ℬs\(𝒟mg\)𝒯p‖Wℓ,∞\(Ω\)\\displaystyle\\left\\\|\\mathcal\{G\}\(f,g\)\-\\sum\_\{s=1\}^\{J\}\\sum\_\{h=1\}^\{H\}\\sum\_\{p=1\}^\{P\}e\_\{h,p,s\}\\,\\mathcal\{E\}\_\{h\}\(\\mathcal\{D\}\_\{\\bar\{m\}\}f\)\\,\\mathcal\{B\}\_\{s\}\(\\mathcal\{D\}\_\{m\}g\)\\,\\mathcal\{T\}\_\{p\}\\right\\\|\_\{W^\{\\ell,\\infty\}\(\\Omega\)\}≤C\[L2m¯−n1\(logm¯\)d\(1\+Sm2k2\(logm\)d\)α1\+L1m−n2\(logm\)d\\displaystyle\\qquad\\leq C\\Bigl\[L\_\{2\}\\bar\{m\}^\{\-n\_\{1\}\}\(\\log\\bar\{m\}\)^\{d\}\\left\(1\+Sm^\{2k\_\{2\}\}\(\\log m\)^\{d\}\\right\)^\{\\alpha\_\{1\}\}\+L\_\{1\}m^\{\-n\_\{2\}\}\(\\log m\)^\{d\}\+SAm,m¯\(g\)m∗J−1/m∗\+SAm¯,m\(f\)m¯∗H−1/m¯∗\+Rm¯,m\(n3\)P−n3−ℓd\]\.\\displaystyle\\qquad\\qquad\+SA\_\{m,\\bar\{m\}\}^\{\(g\)\}\\sqrt\{m\_\{\*\}\}\\,J^\{\-1/m\_\{\*\}\}\+SA\_\{\\bar\{m\},m\}^\{\(f\)\}\\sqrt\{\\bar\{m\}\_\{\*\}\}\\,H^\{\-1/\\bar\{m\}\_\{\*\}\}\+R\_\{\\bar\{m\},m\}^\{\(n\_\{3\}\)\}P^\{\-\\frac\{n\_\{3\}\-\\ell\}\{d\}\}\\Bigr\]\.\(10\)
The resulting neural operator satisfies the output bound

sup\(f,g\)∈𝒳1×𝒳2sup𝒙∈Ω∑\|γ\|≤ℓ\|∂γ\[∑s=1J∑h=1H∑p=1Peh,p,sℰh\(𝒟m¯f\)ℬs\(𝒟mg\)𝒯p\(𝒙\)\]\|≤CRm¯,m\(n3\)Pℓ/d\.\\displaystyle\\sup\_\{\(f,g\)\\in\\mathcal\{X\}\_\{1\}\\times\\mathcal\{X\}\_\{2\}\}\\sup\_\{\\bm\{x\}\\in\\Omega\}\\sum\_\{\|\\gamma\|\\leq\\ell\}\\left\|\\partial^\{\\gamma\}\\left\[\\sum\_\{s=1\}^\{J\}\\sum\_\{h=1\}^\{H\}\\sum\_\{p=1\}^\{P\}e\_\{h,p,s\}\\,\\mathcal\{E\}\_\{h\}\(\\mathcal\{D\}\_\{\\bar\{m\}\}f\)\\,\\mathcal\{B\}\_\{s\}\(\\mathcal\{D\}\_\{m\}g\)\\,\\mathcal\{T\}\_\{p\}\(\\bm\{x\}\)\\right\]\\right\|\\leq CR\_\{\\bar\{m\},m\}^\{\(n\_\{3\}\)\}P^\{\\ell/d\}\.\(11\)HereC\>0C\>0is independent ofm¯,m,J,H,P,f\\bar\{m\},m,J,H,P,f, andgg, but may depend ond,ni,ki,αi,βi,ℓd,n\_\{i\},k\_\{i\},\\alpha\_\{i\},\\beta\_\{i\},\\ell, the constants in the assumptions, and the uniform Sobolev bounds of the input classes\.

Based on Theorem[3](https://arxiv.org/html/2606.17419#Thmtheorem3), we now state the final balanced approximation result\. It is obtained by choosing the discretization levels and the numbers of branch and trunk basis functions so that the four error contributions are of the same order\.

###### Corollary 1\(Balanced rate in terms of the total number of parameters\)\.

Suppose that the two\-input operator𝒢\\mathcal\{G\}satisfies Assumptions[1](https://arxiv.org/html/2606.17419#Thmassumption1),[2](https://arxiv.org/html/2606.17419#Thmassumption2), and[3](https://arxiv.org/html/2606.17419#Thmassumption3)withn3\>ℓn\_\{3\}\>\\ell,di=dd\_\{i\}=d,ki=0k\_\{i\}=0,αi=0\\alpha\_\{i\}=0, andβi=0\\beta\_\{i\}=0\. Then there exists a neural operator𝒢𝛉\\mathcal\{G\}\_\{\\bm\{\\theta\}\}of the form \([7](https://arxiv.org/html/2606.17419#S2.E7)\) such that, for every\(f,g\)∈𝒳1×𝒳2\(f,g\)\\in\\mathcal\{X\}\_\{1\}\\times\\mathcal\{X\}\_\{2\},

‖𝒢\(f,g\)−𝒢𝜽\(f,g\)‖Wℓ,∞\(Ω\)≤C\(log⁡Ntotlog⁡log⁡Ntot\)−min⁡\{n1,n2\}d\(log⁡log⁡Ntot\)2d\.\\displaystyle\\left\\\|\\mathcal\{G\}\(f,g\)\-\\mathcal\{G\}\_\{\\bm\{\\theta\}\}\(f,g\)\\right\\\|\_\{W^\{\\ell,\\infty\}\(\\Omega\)\}\\leq C\\left\(\\frac\{\\log N\_\{\\mathrm\{tot\}\}\}\{\\log\\log N\_\{\\mathrm\{tot\}\}\}\\right\)^\{\-\\frac\{\\min\\\{n\_\{1\},n\_\{2\}\\\}\}\{d\}\}\(\\log\\log N\_\{\\mathrm\{tot\}\}\)^\{2d\}\.\(12\)HereNtotN\_\{\\mathrm\{tot\}\}denotes the total number of parameters of𝒢𝛉\\mathcal\{G\}\_\{\\bm\{\\theta\}\}, andC\>0C\>0is independent ofNtotN\_\{\\mathrm\{tot\}\}\.

Corollary[1](https://arxiv.org/html/2606.17419#Thmcorollary1)is proved in Appendix[A\.6](https://arxiv.org/html/2606.17419#A1.SS6)\.

### 3\.2Approximation error forλ≥2\\lambda\\geq 2

We now extend the two\-input approximation result to the generalλ\\lambda\-input case\. The construction follows the same strategy as in the caseλ=2\\lambda=2: we first discretize each input function, then approximate the dependence on the resulting finite\-dimensional variables by branch networks, and finally approximate the remaining spatial coefficient functions by trunk networks\. Since the argument is essentially iterative, we only state the result here and defer the proof to the appendix\.

###### Corollary 2\(Balanced rate for theλ\\lambda\-input case\)\.

Suppose that theλ\\lambda\-input operator𝒢\\mathcal\{G\}satisfies Assumptions[1](https://arxiv.org/html/2606.17419#Thmassumption1),[2](https://arxiv.org/html/2606.17419#Thmassumption2), and[3](https://arxiv.org/html/2606.17419#Thmassumption3)\. Assume also thatni\>2kin\_\{i\}\>2k\_\{i\}fori=1,…,λi=1,\\ldots,\\lambda, andnλ\+1\>ℓn\_\{\\lambda\+1\}\>\\ell\. For each input, let

Mi:=\(mi\+1\)diM\_\{i\}:=\(m\_\{i\}\+1\)^\{d\_\{i\}\}denote the number of sampling points used in the discretization𝒟mifi\\mathcal\{D\}\_\{m\_\{i\}\}f\_\{i\}\. Then there exists a neural operator𝒢𝛉\\mathcal\{G\}\_\{\\bm\{\\theta\}\}of the shared branch–trunk form

𝒢𝜽\(f1,…,fλ\)\(𝒙\):=∑s1=1J1⋯∑sλ=1Jλ∑p=1Pes1,…,sλ,p∏i=1λℬi,si\(𝒟mifi\)𝒯p\(𝒙\),\\displaystyle\\mathcal\{G\}\_\{\\bm\{\\theta\}\}\(f\_\{1\},\\ldots,f\_\{\\lambda\}\)\(\\bm\{x\}\)=\\sum\_\{s\_\{1\}=1\}^\{J\_\{1\}\}\\cdots\\sum\_\{s\_\{\\lambda\}=1\}^\{J\_\{\\lambda\}\}\\sum\_\{p=1\}^\{P\}e\_\{s\_\{1\},\\ldots,s\_\{\\lambda\},p\}\\prod\_\{i=1\}^\{\\lambda\}\\mathcal\{B\}\_\{i,s\_\{i\}\}\(\\mathcal\{D\}\_\{m\_\{i\}\}f\_\{i\}\)\\,\\mathcal\{T\}\_\{p\}\(\\bm\{x\}\),\(13\)such that, for every\(f1,…,fλ\)∈∏i=1λ𝒳i\(f\_\{1\},\\ldots,f\_\{\\lambda\}\)\\in\\prod\_\{i=1\}^\{\\lambda\}\\mathcal\{X\}\_\{i\},

‖𝒢\(f1,…,fλ\)−𝒢𝜽\(f1,…,fλ\)‖Wℓ,∞\(Ωλ\+1\)\\displaystyle\\left\\\|\\mathcal\{G\}\(f\_\{1\},\\ldots,f\_\{\\lambda\}\)\-\\mathcal\{G\}\_\{\\bm\{\\theta\}\}\(f\_\{1\},\\ldots,f\_\{\\lambda\}\)\\right\\\|\_\{W^\{\\ell,\\infty\}\(\\Omega\_\{\\lambda\+1\}\)\}≲∑i=1λMi−ni−2kidi\(log⁡Mi\)di\[∏j=1j≠iλ\(1\+SMj2kjdj\(log⁡Mj\)dj\)\]αi\\displaystyle\\quad\\lesssim\\sum\_\{i=1\}^\{\\lambda\}M\_\{i\}^\{\-\\frac\{n\_\{i\}\-2k\_\{i\}\}\{d\_\{i\}\}\}\(\\log M\_\{i\}\)^\{d\_\{i\}\}\\left\[\\prod\_\{\\begin\{subarray\}\{c\}j=1\\\\ j\\neq i\\end\{subarray\}\}^\{\\lambda\}\\left\(1\+SM\_\{j\}^\{\\frac\{2k\_\{j\}\}\{d\_\{j\}\}\}\(\\log M\_\{j\}\)^\{d\_\{j\}\}\\right\)\\right\]^\{\\alpha\_\{i\}\}\+∑i=1λSLiMiMi2kidi\(log⁡Mi\)diJi−1/Mi\[∏j=1j≠iλ\(1\+SMj2kjdj\(log⁡Mj\)dj\)\]αi\\displaystyle\\qquad\+\\sum\_\{i=1\}^\{\\lambda\}SL\_\{i\}\\sqrt\{M\_\{i\}\}\\,M\_\{i\}^\{\\frac\{2k\_\{i\}\}\{d\_\{i\}\}\}\(\\log M\_\{i\}\)^\{d\_\{i\}\}J\_\{i\}^\{\-1/M\_\{i\}\}\\left\[\\prod\_\{\\begin\{subarray\}\{c\}j=1\\\\ j\\neq i\\end\{subarray\}\}^\{\\lambda\}\\left\(1\+SM\_\{j\}^\{\\frac\{2k\_\{j\}\}\{d\_\{j\}\}\}\(\\log M\_\{j\}\)^\{d\_\{j\}\}\\right\)\\right\]^\{\\alpha\_\{i\}\}\+\[∏i=1λ\(1\+SMi2βidi\(log⁡Mi\)di\)\]P−nλ\+1−ℓdλ\+1\.\\displaystyle\\qquad\+\\left\[\\prod\_\{i=1\}^\{\\lambda\}\\left\(1\+SM\_\{i\}^\{\\frac\{2\\beta\_\{i\}\}\{d\_\{i\}\}\}\(\\log M\_\{i\}\)^\{d\_\{i\}\}\\right\)\\right\]P^\{\-\\frac\{n\_\{\\lambda\+1\}\-\\ell\}\{d\_\{\\lambda\+1\}\}\}\.\(14\)Here the implicit constant is independent ofMi,Ji,PM\_\{i\},J\_\{i\},P\.

In general, the balancing of \([14](https://arxiv.org/html/2606.17419#S3.E14)\) depends on the interaction among the parametersαi,βi,ki,ni,di\\alpha\_\{i\},\\beta\_\{i\},k\_\{i\},n\_\{i\},d\_\{i\}\. In the special caseαi=βi=0,i=1,…,λ,\\alpha\_\{i\}=\\beta\_\{i\}=0,~i=1,\\ldots,\\lambda,the dominant approximation rate can be balanced explicitly\. LetNtotN\_\{\\mathrm\{tot\}\}denote the total number of trainable parameters of𝒢𝛉\\mathcal\{G\}\_\{\\bm\{\\theta\}\}\. Define

Qmax:=max1≤i≤λ⁡dini−2ki\.Q\_\{\\max\}:=\\max\_\{1\\leq i\\leq\\lambda\}\\frac\{d\_\{i\}\}\{n\_\{i\}\-2k\_\{i\}\}\.Then, up to higher logarithmic factors,

sup\(f1,…,fλ\)∈∏i=1λ𝒳i‖𝒢\(f1,…,fλ\)−𝒢𝜽\(f1,…,fλ\)‖Wℓ,∞\(Ωλ\+1\)≲\(log⁡Ntotlog⁡log⁡Ntot\)−1/Qmax\(log⁡log⁡Ntot\)∑i=1λdi\.\\sup\_\{\(f\_\{1\},\\ldots,f\_\{\\lambda\}\)\\in\\prod\_\{i=1\}^\{\\lambda\}\\mathcal\{X\}\_\{i\}\}\\left\\\|\\mathcal\{G\}\(f\_\{1\},\\ldots,f\_\{\\lambda\}\)\-\\mathcal\{G\}\_\{\\bm\{\\theta\}\}\(f\_\{1\},\\ldots,f\_\{\\lambda\}\)\\right\\\|\_\{W^\{\\ell,\\infty\}\(\\Omega\_\{\\lambda\+1\}\)\}\\lesssim\\left\(\\frac\{\\log N\_\{\\mathrm\{tot\}\}\}\{\\log\\log N\_\{\\mathrm\{tot\}\}\}\\right\)^\{\-1/Q\_\{\\max\}\}\(\\log\\log N\_\{\\mathrm\{tot\}\}\)^\{\\sum\_\{i=1\}^\{\\lambda\}d\_\{i\}\}\.

Corollay[2](https://arxiv.org/html/2606.17419#Thmcorollary2)is proved in Appendix[A\.6](https://arxiv.org/html/2606.17419#A1.SS6)\.

## 4Generalization Error

Letμi\\mu\_\{i\}be a probability measure on𝒳i\\mathcal\{X\}\_\{i\},i=1,…,λi=1,\\ldots,\\lambda, and letμz\\mu\_\{z\}be a probability measure onΩλ\+1\\Omega\_\{\\lambda\+1\}\. We consider the following hierarchical Sobolev training setting\.

###### Setting 1\(λ\\lambda\-input hierarchical Sobolev training setting\)\.

Let

𝒢:∏r=1λ𝒳r→Wℓ,∞\(Ωλ\+1\)\\mathcal\{G\}:\\prod\_\{r=1\}^\{\\lambda\}\\mathcal\{X\}\_\{r\}\\to W^\{\\ell,\\infty\}\(\\Omega\_\{\\lambda\+1\}\)be aλ\\lambda\-input operator, whereℓ∈\{0,1\}\\ell\\in\\\{0,1\\\}\. Suppose that the training data are generated hierarchically as follows\. First, sample

\{fi1\(1\)\}i1=1n1samp⊂𝒳1\.\\\{f^\{\(1\)\}\_\{i\_\{1\}\}\\\}\_\{i\_\{1\}=1\}^\{n\_\{1\}^\{\\rm samp\}\}\\subset\\mathcal\{X\}\_\{1\}\.For eachfi1\(1\)f^\{\(1\)\}\_\{i\_\{1\}\}, sample

\{fi1,i2\(2\)\}i2=1n2samp⊂𝒳2\.\\\{f^\{\(2\)\}\_\{i\_\{1\},i\_\{2\}\}\\\}\_\{i\_\{2\}=1\}^\{n\_\{2\}^\{\\rm samp\}\}\\subset\\mathcal\{X\}\_\{2\}\.Proceeding recursively, for each\(fi1\(1\),…,fi1,…,ir−1\(r−1\)\)\(f^\{\(1\)\}\_\{i\_\{1\}\},\\ldots,f^\{\(r\-1\)\}\_\{i\_\{1\},\\ldots,i\_\{r\-1\}\}\), sample

\{fi1,…,ir\(r\)\}ir=1nrsamp⊂𝒳r,r=2,…,λ\.\\\{f^\{\(r\)\}\_\{i\_\{1\},\\ldots,i\_\{r\}\}\\\}\_\{i\_\{r\}=1\}^\{n\_\{r\}^\{\\rm samp\}\}\\subset\\mathcal\{X\}\_\{r\},\\qquad r=2,\\ldots,\\lambda\.Finally, for each sampledλ\\lambda\-tuple, sample output locations

\{𝒛i1,…,iλ,k\}k=1nz⊂Ωλ\+1\.\\\{\\bm\{z\}\_\{i\_\{1\},\\ldots,i\_\{\\lambda\},k\}\\\}\_\{k=1\}^\{n\_\{z\}\}\\subset\\Omega\_\{\\lambda\+1\}\.
For every multi\-indexγ\\gammawith\|γ\|≤ℓ\|\\gamma\|\\leq\\ell, the observed Sobolev training labels are

yi1,…,iλ,k\(γ\)=∂γ𝒢\(fi1\(1\),fi1,i2\(2\),…,fi1,…,iλ\(λ\)\)\(𝒛i1,…,iλ,k\)\+εi1,…,iλ,k\(γ\)\.y\_\{i\_\{1\},\\ldots,i\_\{\\lambda\},k\}^\{\(\\gamma\)\}=\\partial^\{\\gamma\}\\mathcal\{G\}\\left\(f^\{\(1\)\}\_\{i\_\{1\}\},f^\{\(2\)\}\_\{i\_\{1\},i\_\{2\}\},\\ldots,f^\{\(\\lambda\)\}\_\{i\_\{1\},\\ldots,i\_\{\\lambda\}\}\\right\)\(\\bm\{z\}\_\{i\_\{1\},\\ldots,i\_\{\\lambda\},k\}\)\+\\varepsilon\_\{i\_\{1\},\\ldots,i\_\{\\lambda\},k\}^\{\(\\gamma\)\}\.We assume that, for each\|γ\|≤ℓ\|\\gamma\|\\leq\\ell, the noise variablesεi1,…,iλ,k\(γ\)\\varepsilon\_\{i\_\{1\},\\ldots,i\_\{\\lambda\},k\}^\{\(\\gamma\)\}are independent, mean\-zero, and sub\-Gaussian with variance proxy bounded byσ2\\sigma^\{2\}\.

For a neural operator classℱG\\mathcal\{F\}\_\{G\}, define the Sobolev empirical risk by

ℒ𝒮\(𝒢NN\):=1Ntrain∑i1=1n1samp⋯∑iλ=1nλsamp∑k=1nz∑\|γ\|≤ℓ\|∂γ𝒢NN\(𝒟m1fi1\(1\),…,𝒟mλfi1,…,iλ\(λ\)\)\(𝒛i1,…,iλ,k\)−yi1,…,iλ,k\(γ\)\|2\.\\mathcal\{L\}\_\{\\mathcal\{S\}\}\(\\mathcal\{G\}\_\{\\rm NN\}\):=\\frac\{1\}\{N\_\{\\rm train\}\}\\sum\_\{i\_\{1\}=1\}^\{n\_\{1\}^\{\\rm samp\}\}\\cdots\\sum\_\{i\_\{\\lambda\}=1\}^\{n\_\{\\lambda\}^\{\\rm samp\}\}\\sum\_\{k=1\}^\{n\_\{z\}\}\\sum\_\{\|\\gamma\|\\leq\\ell\}\\left\|\\partial^\{\\gamma\}\\mathcal\{G\}\_\{\\rm NN\}\\left\(\\mathcal\{D\}\_\{m\_\{1\}\}f^\{\(1\)\}\_\{i\_\{1\}\},\\ldots,\\mathcal\{D\}\_\{m\_\{\\lambda\}\}f^\{\(\\lambda\)\}\_\{i\_\{1\},\\ldots,i\_\{\\lambda\}\}\\right\)\(\\bm\{z\}\_\{i\_\{1\},\\ldots,i\_\{\\lambda\},k\}\)\-y\_\{i\_\{1\},\\ldots,i\_\{\\lambda\},k\}^\{\(\\gamma\)\}\\right\|^\{2\}\.The empirical risk minimizer is defined by

𝒢^𝒮∈argmin𝒢NN∈ℱG⁡ℒ𝒮\(𝒢NN\)\.\\widehat\{\\mathcal\{G\}\}\_\{\\mathcal\{S\}\}\\in\\operatorname\*\{arg\\,min\}\_\{\\mathcal\{G\}\_\{\\rm NN\}\\in\\mathcal\{F\}\_\{G\}\}\\mathcal\{L\}\_\{\\mathcal\{S\}\}\(\\mathcal\{G\}\_\{\\rm NN\}\)\.
We assume that the total number of inner samples and output locations does not grow too fast compared with the number of outer samples\. More precisely, for some sufficiently smallδ\>0\\delta\>0,

log⁡\(nz∏r=2λnrsamp\)≤\(n1samp\)δ\.\\log\\left\(n\_\{z\}\\prod\_\{r=2\}^\{\\lambda\}n\_\{r\}^\{\\rm samp\}\\right\)\\leq\\left\(n\_\{1\}^\{\\rm samp\}\\right\)^\{\\delta\}\.

Setting[1](https://arxiv.org/html/2606.17419#Thmsetting1)is motivated by common sampling structures in scientific computing\. For example, consider

−Δu\+a\(x\)u=f,u\|∂Ω=0,\-\\Delta u\+a\(x\)u=f,\\qquad u\|\_\{\\partial\\Omega\}=0,with solution operator𝒢\(a,f\)=u\\mathcal\{G\}\(a,f\)=u\. The coefficienta\(𝒙\)a\(\\bm\{x\}\)may represent a fixed physical system, material property, or medium parameter\. For each fixedaa, one may observe several solutions corresponding to different source termsff\. Thus the data are naturally hierarchical: one first samples coefficientsaa, and then, for each coefficient, samples several source termsffand their corresponding solution observations\. In this case,aais the outer input with sample sizen1sampn\_\{1\}^\{\\rm samp\}, whileffis the inner input with sample sizen2sampn\_\{2\}^\{\\rm samp\}\.

###### Theorem 4\(Sobolev generalization error for the two\-input case\)\.

Consider Setting[1](https://arxiv.org/html/2606.17419#Thmsetting1)withλ=2\\lambda=2andℓ∈\{0,1\}\\ell\\in\\\{0,1\\\}\. For simplicity, write

nf:=n1samp,ng:=n2samp\.n\_\{f\}:=n\_\{1\}^\{\\rm samp\},\\qquad n\_\{g\}:=n\_\{2\}^\{\\rm samp\}\.Suppose that the two\-input operator𝒢:𝒳×𝒴→Wℓ,∞\(Ω\)\\mathcal\{G\}:\\mathcal\{X\}\\times\\mathcal\{Y\}\\to W^\{\\ell,\\infty\}\(\\Omega\)satisfies the assumptions of Corollary[1](https://arxiv.org/html/2606.17419#Thmcorollary1)in theWℓ,∞W^\{\\ell,\\infty\}\-output setting\. LetℱG\\mathcal\{F\}\_\{G\}be the neural operator class consisting of functions of the form

𝒢NN\(f,g\)\(𝒛\)=∑s=1J∑h=1H∑p=1Peh,p,sℰh\(𝒟m¯f\)ℬs\(𝒟mg\)𝒯p\(𝒛\)\.\\mathcal\{G\}\_\{\\rm NN\}\(f,g\)\(\\bm\{z\}\)=\\sum\_\{s=1\}^\{J\}\\sum\_\{h=1\}^\{H\}\\sum\_\{p=1\}^\{P\}e\_\{h,p,s\}\\,\\mathcal\{E\}\_\{h\}\(\\mathcal\{D\}\_\{\\bar\{m\}\}f\)\\,\\mathcal\{B\}\_\{s\}\(\\mathcal\{D\}\_\{m\}g\)\\,\\mathcal\{T\}\_\{p\}\(\\bm\{z\}\)\.Let𝒢^𝒮∈ℱG\\widehat\{\\mathcal\{G\}\}\_\{\\mathcal\{S\}\}\\in\\mathcal\{F\}\_\{G\}be the Sobolev empirical risk minimizer defined in Setting[1](https://arxiv.org/html/2606.17419#Thmsetting1)\.

Chooseng,nzn\_\{g\},n\_\{z\}so thatlog⁡\(ngnz\)≤nfδ\\log\(n\_\{g\}n\_\{z\}\)\\leq n\_\{f\}^\{\\delta\}, andH=nfcH=n\_\{f\}^\{c\}withc<1−δc<1\-\\delta\. SetΛf:=log⁡nf/log⁡log⁡nf\\Lambda\_\{f\}:=\\log n\_\{f\}/\\log\\log n\_\{f\}\. The discretization levels and branch ranks can be chosen such that

m∗≍Λfd/n2,m¯∗≍Λfd/n1,log⁡J=𝒪\(Λfd/n2log⁡log⁡nf\)\.m\_\{\*\}\\asymp\\Lambda\_\{f\}^\{d/n\_\{2\}\},\\qquad\\bar\{m\}\_\{\*\}\\asymp\\Lambda\_\{f\}^\{d/n\_\{1\}\},\\qquad\\log J=\\mathcal\{O\}\\\!\\left\(\\Lambda\_\{f\}^\{d/n\_\{2\}\}\\log\\log n\_\{f\}\\right\)\.The trunk rank satisfiesP=\(log⁡nf\)CP\+o\(1\)P=\(\\log n\_\{f\}\)^\{C\_\{P\}\+o\(1\)\}for some constantCP\>0C\_\{P\}\>0\. In particular,P=nfo\(1\)P=n\_\{f\}^\{o\(1\)\}andlog⁡P=𝒪\(log⁡log⁡nf\)\\log P=\\mathcal\{O\}\(\\log\\log n\_\{f\}\)\. Thegg\-branch networksℬs\\mathcal\{B\}\_\{s\}may be chosen withpB=𝒪\(1\)p\_\{B\}=\\mathcal\{O\}\(1\),

LB,KB=𝒪\(Λf2d/n2log⁡log⁡nf\),log⁡κB=𝒪\(Λfd/n2log⁡log⁡nf\)\.L\_\{B\},K\_\{B\}=\\mathcal\{O\}\\\!\\left\(\\Lambda\_\{f\}^\{2d/n\_\{2\}\}\\log\\log n\_\{f\}\\right\),\\qquad\\log\\kappa\_\{B\}=\\mathcal\{O\}\\\!\\left\(\\Lambda\_\{f\}^\{d/n\_\{2\}\}\\log\\log n\_\{f\}\\right\)\.Theff\-branch networksℰh\\mathcal\{E\}\_\{h\}may be chosen withpE=𝒪\(1\)p\_\{E\}=\\mathcal\{O\}\(1\),

LE,KE=𝒪\(Λf2d/n1log⁡log⁡nf\+Λfd/n1log⁡nf\),log⁡κE=𝒪\(log⁡nf\)\.L\_\{E\},K\_\{E\}=\\mathcal\{O\}\\\!\\left\(\\Lambda\_\{f\}^\{2d/n\_\{1\}\}\\log\\log n\_\{f\}\+\\Lambda\_\{f\}^\{d/n\_\{1\}\}\\log n\_\{f\}\\right\),\\qquad\\log\\kappa\_\{E\}=\\mathcal\{O\}\(\\log n\_\{f\}\)\.The trunk networks𝒯p\\mathcal\{T\}\_\{p\}may be chosen with

LT,KT=𝒪\(log⁡log⁡nf\),pT=𝒪\(1\),log⁡κT=𝒪\(log⁡log⁡nf\),MT≤C\.L\_\{T\},K\_\{T\}=\\mathcal\{O\}\(\\log\\log n\_\{f\}\),\\qquad p\_\{T\}=\\mathcal\{O\}\(1\),\\qquad\\log\\kappa\_\{T\}=\\mathcal\{O\}\(\\log\\log n\_\{f\}\),\\qquad M\_\{T\}\\leq C\.
Then the Sobolev empirical risk minimizer satisfies

𝔼𝒮𝔼f∼μf𝔼g∼μg𝔼𝒛∼μz∑\|γ\|≤ℓ\|∂𝒛γ𝒢\(f,g\)\(𝒛\)−∂𝒛γ𝒢^𝒮\(𝒟m¯f,𝒟mg\)\(𝒛\)\|2≤C\(log⁡nflog⁡log⁡nf\)−2max⁡\{dn1,dn2\}\(log⁡log⁡nf\)4d\.\\displaystyle\\mathbb\{E\}\_\{\\mathcal\{S\}\}\\mathbb\{E\}\_\{f\\sim\\mu\_\{f\}\}\\mathbb\{E\}\_\{g\\sim\\mu\_\{g\}\}\\mathbb\{E\}\_\{\\bm\{z\}\\sim\\mu\_\{z\}\}\\sum\_\{\|\\gamma\|\\leq\\ell\}\\left\|\\partial\_\{\\bm\{z\}\}^\{\\gamma\}\\mathcal\{G\}\(f,g\)\(\\bm\{z\}\)\-\\partial\_\{\\bm\{z\}\}^\{\\gamma\}\\widehat\{\\mathcal\{G\}\}\_\{\\mathcal\{S\}\}\(\\mathcal\{D\}\_\{\\bar\{m\}\}f,\\mathcal\{D\}\_\{m\}g\)\(\\bm\{z\}\)\\right\|^\{2\}\\leq C\\left\(\\frac\{\\log n\_\{f\}\}\{\\log\\log n\_\{f\}\}\\right\)^\{\-\\frac\{2\}\{\\max\\left\\\{\\frac\{d\}\{n\_\{1\}\},\\frac\{d\}\{n\_\{2\}\}\\right\\\}\}\}\(\\log\\log n\_\{f\}\)^\{4d\}\.HereC\>0C\>0is independent ofnf,ng,nzn\_\{f\},n\_\{g\},n\_\{z\}\.

Theorem[4](https://arxiv.org/html/2606.17419#Thmtheorem4)is proved in Section[5\.3](https://arxiv.org/html/2606.17419#S5.SS3)\. In Setting[1](https://arxiv.org/html/2606.17419#Thmsetting1), the generalization error achieves the same convergence rate as the approximation error\. Combining this observation with Corollary[2](https://arxiv.org/html/2606.17419#Thmcorollary2), we obtain the following generalization error estimate\.

###### Corollary 3\(Sobolev generalization error for theλ\\lambda\-input case\)\.

Consider Setting[1](https://arxiv.org/html/2606.17419#Thmsetting1)for a generalλ≥2\\lambda\\geq 2andℓ∈\{0,1\}\\ell\\in\\\{0,1\\\}\. Suppose that theλ\\lambda\-input operator𝒢:∏i=1λ𝒳i→Wℓ,∞\(Ωλ\+1\)\\mathcal\{G\}:\\prod\_\{i=1\}^\{\\lambda\}\\mathcal\{X\}\_\{i\}\\to W^\{\\ell,\\infty\}\(\\Omega\_\{\\lambda\+1\}\)satisfies the same assumptions as in Corollary[2](https://arxiv.org/html/2606.17419#Thmcorollary2)\. In particular, assume thatαi=βi=0,i=1,…,λ\.\\alpha\_\{i\}=\\beta\_\{i\}=0,~i=1,\\ldots,\\lambda\.Letn1sampn\_\{1\}^\{\\rm samp\}denote the number of outer training samples in Setting[1](https://arxiv.org/html/2606.17419#Thmsetting1), and define

Qmax:=max1≤i≤λ⁡dini−2ki\.Q\_\{\\max\}:=\\max\_\{1\\leq i\\leq\\lambda\}\\frac\{d\_\{i\}\}\{n\_\{i\}\-2k\_\{i\}\}\.Then one can choose a shared branch–trunk ReLU neural operator classℱG\(λ\)\\mathcal\{F\}\_\{G\}^\{\(\\lambda\)\}of the form

𝒢𝜽\(f1,…,fλ\)\(𝒙\)=∑s1=1J1⋯∑sλ=1Jλ∑p=1Pes1,…,sλ,p∏i=1λℬi,si\(𝒟mifi\)𝒯p\(𝒙\),\\mathcal\{G\}\_\{\\bm\{\\theta\}\}\(f\_\{1\},\\ldots,f\_\{\\lambda\}\)\(\\bm\{x\}\)=\\sum\_\{s\_\{1\}=1\}^\{J\_\{1\}\}\\cdots\\sum\_\{s\_\{\\lambda\}=1\}^\{J\_\{\\lambda\}\}\\sum\_\{p=1\}^\{P\}e\_\{s\_\{1\},\\ldots,s\_\{\\lambda\},p\}\\prod\_\{i=1\}^\{\\lambda\}\\mathcal\{B\}\_\{i,s\_\{i\}\}\(\\mathcal\{D\}\_\{m\_\{i\}\}f\_\{i\}\)\\,\\mathcal\{T\}\_\{p\}\(\\bm\{x\}\),where the input\-wise discretization levels, branch ranks, and trunk rank are balanced as in Corollary[2](https://arxiv.org/html/2606.17419#Thmcorollary2)\. Let𝒢^𝒮∈ℱG\(λ\)\\widehat\{\\mathcal\{G\}\}\_\{\\mathcal\{S\}\}\\in\\mathcal\{F\}\_\{G\}^\{\(\\lambda\)\}be the Sobolev empirical risk minimizer defined in Setting[1](https://arxiv.org/html/2606.17419#Thmsetting1)\. Then

𝔼𝒮𝔼fi∼μi,i=1,…,λ𝒛∼μz\[∑\|γ\|≤ℓ\|∂γ𝒢\(f1,…,fλ\)\(𝒛\)−∂γ𝒢^𝒮\(𝒟m1f1,…,𝒟mλfλ\)\(𝒛\)\|2\]\\displaystyle\\mathbb\{E\}\_\{\\mathcal\{S\}\}\\mathbb\{E\}\_\{\\begin\{subarray\}\{c\}f\_\{i\}\\sim\\mu\_\{i\},\\ i=1,\\ldots,\\lambda\\\\ \\bm\{z\}\\sim\\mu\_\{z\}\\end\{subarray\}\}\\left\[\\sum\_\{\|\\gamma\|\\leq\\ell\}\\left\|\\partial^\{\\gamma\}\\mathcal\{G\}\(f\_\{1\},\\ldots,f\_\{\\lambda\}\)\(\\bm\{z\}\)\-\\partial^\{\\gamma\}\\widehat\{\\mathcal\{G\}\}\_\{\\mathcal\{S\}\}\(\\mathcal\{D\}\_\{m\_\{1\}\}f\_\{1\},\\ldots,\\mathcal\{D\}\_\{m\_\{\\lambda\}\}f\_\{\\lambda\}\)\(\\bm\{z\}\)\\right\|^\{2\}\\right\]≤C\(log⁡n1samplog⁡log⁡n1samp\)−2Qmax\(log⁡log⁡n1samp\)2∑i=1λdi\.\\displaystyle\\qquad\\leq C\\left\(\\frac\{\\log n\_\{1\}^\{\\rm samp\}\}\{\\log\\log n\_\{1\}^\{\\rm samp\}\}\\right\)^\{\-\\frac\{2\}\{Q\_\{\\max\}\}\}\(\\log\\log n\_\{1\}^\{\\rm samp\}\)^\{2\\sum\_\{i=1\}^\{\\lambda\}d\_\{i\}\}\.\(15\)HereC\>0C\>0is independent of the sample sizes\.

## 5Proof of Main Results

### 5\.1Proof of Theorem[3](https://arxiv.org/html/2606.17419#Thmtheorem3)

The proof of Theorem[3](https://arxiv.org/html/2606.17419#Thmtheorem3)is based on four approximation steps\. Let

𝒫N:ℝ\(N\+1\)d→Q¯N\\mathcal\{P\}\_\{N\}:\\mathbb\{R\}^\{\(N\+1\)^\{d\}\}\\to\\overline\{Q\}\_\{N\}be the multi\-dimensional Chebyshev interpolation operator associated with the tensorized Chebyshev grids, and define

fm¯:=𝒫m¯𝒟m¯f,gm:=𝒫m𝒟mg\.f\_\{\\bar\{m\}\}:=\\mathcal\{P\}\_\{\\bar\{m\}\}\\mathcal\{D\}\_\{\\bar\{m\}\}f,\\qquad g\_\{m\}:=\\mathcal\{P\}\_\{m\}\\mathcal\{D\}\_\{m\}g\.We decompose the error as

‖𝒢\(f,g\)−∑s=1J∑h=1H∑p=1Peh,p,sℰh\(𝒟m¯f\)ℬs\(𝒟mg\)𝒯p‖Wℓ,∞\(Ω\)\\displaystyle\\left\\\|\\mathcal\{G\}\(f,g\)\-\\sum\_\{s=1\}^\{J\}\\sum\_\{h=1\}^\{H\}\\sum\_\{p=1\}^\{P\}e\_\{h,p,s\}\\,\\mathcal\{E\}\_\{h\}\(\\mathcal\{D\}\_\{\\bar\{m\}\}f\)\\,\\mathcal\{B\}\_\{s\}\(\\mathcal\{D\}\_\{m\}g\)\\,\\mathcal\{T\}\_\{p\}\\right\\\|\_\{W^\{\\ell,\\infty\}\(\\Omega\)\}≤\\displaystyle\\leq\\;‖𝒢\(f,g\)−𝒢\(fm¯,gm\)‖Wℓ,∞\(Ω\)⏟Step 1\+‖𝒢\(fm¯,gm\)−∑s=1J𝔡s\(Fmfm¯,𝒙\)ℬs\(𝒟mg\)‖Wℓ,∞\(Ω\)⏟Step 2\\displaystyle\\underbrace\{\\\|\\mathcal\{G\}\(f,g\)\-\\mathcal\{G\}\(f\_\{\\bar\{m\}\},g\_\{m\}\)\\\|\_\{W^\{\\ell,\\infty\}\(\\Omega\)\}\}\_\{\\text\{Step 1\}\}\+\\underbrace\{\\left\\\|\\mathcal\{G\}\(f\_\{\\bar\{m\}\},g\_\{m\}\)\-\\sum\_\{s=1\}^\{J\}\\mathfrak\{d\}\_\{s\}\(F\_\{m\}^\{f\_\{\\bar\{m\}\},\\bm\{x\}\}\)\\,\\mathcal\{B\}\_\{s\}\(\\mathcal\{D\}\_\{m\}g\)\\right\\\|\_\{W^\{\\ell,\\infty\}\(\\Omega\)\}\}\_\{\\text\{Step 2\}\}\+‖∑s=1J\[𝔡s\(Fmfm¯,𝒙\)−∑h=1Heh,s\(𝒙\)ℰh\(𝒟m¯f\)\]ℬs\(𝒟mg\)‖Wℓ,∞\(Ω\)⏟Step 3\\displaystyle\+\\underbrace\{\\left\\\|\\sum\_\{s=1\}^\{J\}\\left\[\\mathfrak\{d\}\_\{s\}\(F\_\{m\}^\{f\_\{\\bar\{m\}\},\\bm\{x\}\}\)\-\\sum\_\{h=1\}^\{H\}e\_\{h,s\}\(\\bm\{x\}\)\\,\\mathcal\{E\}\_\{h\}\(\\mathcal\{D\}\_\{\\bar\{m\}\}f\)\\right\]\\mathcal\{B\}\_\{s\}\(\\mathcal\{D\}\_\{m\}g\)\\right\\\|\_\{W^\{\\ell,\\infty\}\(\\Omega\)\}\}\_\{\\text\{Step 3\}\}\+‖∑s=1J∑h=1H\[eh,s\(𝒙\)−∑p=1Peh,p,s𝒯p\(𝒙\)\]ℰh\(𝒟m¯f\)ℬs\(𝒟mg\)‖Wℓ,∞\(Ω\)⏟Step 4\.\\displaystyle\+\\underbrace\{\\left\\\|\\sum\_\{s=1\}^\{J\}\\sum\_\{h=1\}^\{H\}\\left\[e\_\{h,s\}\(\\bm\{x\}\)\-\\sum\_\{p=1\}^\{P\}e\_\{h,p,s\}\\,\\mathcal\{T\}\_\{p\}\(\\bm\{x\}\)\\right\]\\mathcal\{E\}\_\{h\}\(\\mathcal\{D\}\_\{\\bar\{m\}\}f\)\\mathcal\{B\}\_\{s\}\(\\mathcal\{D\}\_\{m\}g\)\\right\\\|\_\{W^\{\\ell,\\infty\}\(\\Omega\)\}\}\_\{\\text\{Step 4\}\}\.\(16\)where𝔡s\(Fmfm¯,𝒙\)\\mathfrak\{d\}\_\{s\}\(F\_\{m\}^\{f\_\{\\bar\{m\}\},\\bm\{x\}\}\)depending onfm¯,𝒙f\_\{\\bar\{m\}\},\\bm\{x\}is some coefficient used to approximate𝒢\(fm¯,gm\)\\mathcal\{G\}\(f\_\{\\bar\{m\}\},g\_\{m\}\)along the second argument\. Its expression is presented in Section[5\.1\.2](https://arxiv.org/html/2606.17419#S5.SS1.SSS2)\. The four terms correspond to the four main steps of the construction\. In Step 1, we discretize the input functionsffandgg\. The approximantsfm¯f\_\{\\bar\{m\}\}andgmg\_\{m\}are determined solely by the finite pointwise information𝒟m¯f\\mathcal\{D\}\_\{\\bar\{m\}\}fand𝒟mg\\mathcal\{D\}\_\{m\}g, respectively\. In Step 2, for each fixedfm¯f\_\{\\bar\{m\}\}and𝒙\\bm\{x\}, we regard𝒢\(fm¯,gm\)\(𝒙\)\\mathcal\{G\}\(f\_\{\\bar\{m\}\},g\_\{m\}\)\(\\bm\{x\}\)as a function of the finite\-dimensional variable𝒟mg\\mathcal\{D\}\_\{m\}g, and approximate it by a basis expansion whose coefficients depend only onfm¯f\_\{\\bar\{m\}\}and𝒙\\bm\{x\}\. In Step 3, we view these coefficients as functions of the finite\-dimensional variable𝒟m¯f\\mathcal\{D\}\_\{\\bar\{m\}\}f, and apply the same approximation procedure again\. In Step 4, the remaining coefficient functions depend only on the spatial variable𝒙\\bm\{x\}, and are approximated by trunk networks\.

We now state the proposition corresponding to each step\. The detailed proofs of these four steps are deferred to the appendix\.

#### 5\.1\.1Step 1

In this step, we estimate the discretization error\. We have the following proposition for Step 1 \(see a proof in Appendix[A\.2](https://arxiv.org/html/2606.17419#A1.SS2)\):

###### Proposition 1\(Approximation error by Chebyshev interpolation\)\.

Suppose Assumption[1](https://arxiv.org/html/2606.17419#Thmassumption1)and[2](https://arxiv.org/html/2606.17419#Thmassumption2)hold\. We have

‖𝒢\(f,g\)−𝒢\(fm¯,gm\)‖Wℓ,∞\(Ω\)≤CU\[L1m¯−n1\(log⁡m¯\)d\(1\+Sm2k2\(log⁡m\)d\)α1\+L2m−n2\(log⁡m\)d\]\.\\displaystyle\\\|\\mathcal\{G\}\(f,g\)\-\\mathcal\{G\}\(f\_\{\\bar\{m\}\},g\_\{m\}\)\\\|\_\{W^\{\\ell,\\infty\}\(\\Omega\)\}\\leq CU\\left\[L\_\{1\}\\,\\bar\{m\}^\{\-n\_\{1\}\}\(\\log\\bar\{m\}\)^\{d\}\\left\(1\+Sm^\{2k\_\{2\}\}\(\\log m\)^\{d\}\\right\)^\{\\alpha\_\{1\}\}\+L\_\{2\}\\,m^\{\-n\_\{2\}\}\(\\log m\)^\{d\}\\right\]\.\(17\)

#### 5\.1\.2Step 2

Next, we estimate the error arising from the approximation of thegg\-dependent coefficient\.

###### Proposition 2\(Approximation of thegg\-dependent coefficient\)\.

Assume that Assumption[1](https://arxiv.org/html/2606.17419#Thmassumption1)holds with theWℓ,∞\(Ω\)W^\{\\ell,\\infty\}\(\\Omega\)\-norm on the left\-hand side, whereℓ∈ℕ\\ell\\in\\mathbb\{N\}\. Assume also thatJJis large enough\. Fixfm¯f\_\{\\bar\{m\}\}, and defineFmfm¯,𝐱\(𝐳\):=𝒢\(fm¯,𝒫m𝐳\)\(𝐱\)F\_\{m\}^\{f\_\{\\bar\{m\}\},\\bm\{x\}\}\(\\bm\{z\}\):=\\mathcal\{G\}\(f\_\{\\bar\{m\}\},\\mathcal\{P\}\_\{m\}\\bm\{z\}\)\(\\bm\{x\}\)for𝐳∈\[−S,S\]m∗\\bm\{z\}\\in\[\-S,S\]^\{m\_\{\*\}\}\. Set

Am,m¯\(g\):=L1m2k2\(log⁡m\)d\(1\+Sm¯2k1\(log⁡m¯\)d\)α2,A\_\{m,\\bar\{m\}\}^\{\(g\)\}:=L\_\{1\}m^\{2k\_\{2\}\}\(\\log m\)^\{d\}\\left\(1\+S\\bar\{m\}^\{2k\_\{1\}\}\(\\log\\bar\{m\}\)^\{d\}\\right\)^\{\\alpha\_\{2\}\},and

Bm,m¯\(g\):=1\\displaystyle B\_\{m,\\bar\{m\}\}^\{\(g\)\}:=1\+L1Sm2k2\(log⁡m\)d\+L2Sm¯2k1\(log⁡m¯\)d\(1\+Sm2k2\(log⁡m\)d\)α1\.\\displaystyle\+L\_\{1\}Sm^\{2k\_\{2\}\}\(\\log m\)^\{d\}\+L\_\{2\}S\\bar\{m\}^\{2k\_\{1\}\}\(\\log\\bar\{m\}\)^\{d\}\\left\(1\+Sm^\{2k\_\{2\}\}\(\\log m\)^\{d\}\\right\)^\{\\alpha\_\{1\}\}\.\(18\)Then, for anyJ∈ℕJ\\in\\mathbb\{N\}, there exist points𝐳1,…,𝐳J∈\[−S,S\]m∗\\bm\{z\}\_\{1\},\\ldots,\\bm\{z\}\_\{J\}\\in\[\-S,S\]^\{m\_\{\*\}\}and branch networksℬs∈ℱNN\(m∗,1,LB,pB,KB,κB,1\)\\mathcal\{B\}\_\{s\}\\in\\mathcal\{F\}\_\{\\mathrm\{NN\}\}\(m\_\{\*\},1,L\_\{B\},p\_\{B\},K\_\{B\},\\kappa\_\{B\},1\),s=1,…,Js=1,\\ldots,J, such that, with

𝔡s\(Fmfm¯,𝒙\):=Fmfm¯,𝒙\(𝒛s\)=𝒢\(fm¯,𝒫m𝒛s\)\(𝒙\),\\mathfrak\{d\}\_\{s\}\(F\_\{m\}^\{f\_\{\\bar\{m\}\}\},\\bm\{x\}\):=F\_\{m\}^\{f\_\{\\bar\{m\}\},\\bm\{x\}\}\(\\bm\{z\}\_\{s\}\)=\\mathcal\{G\}\(f\_\{\\bar\{m\}\},\\mathcal\{P\}\_\{m\}\\bm\{z\}\_\{s\}\)\(\\bm\{x\}\),we have

‖𝒢\(fm¯,gm\)−∑s=1J𝔡s\(Fmfm¯,⋅\)ℬs\(𝒟mg\)‖Wℓ,∞\(Ω\)≤CSAm,m¯\(g\)m∗J−1/m∗\.\\displaystyle\\left\\\|\\mathcal\{G\}\(f\_\{\\bar\{m\}\},g\_\{m\}\)\-\\sum\_\{s=1\}^\{J\}\\mathfrak\{d\}\_\{s\}\(F\_\{m\}^\{f\_\{\\bar\{m\}\}\},\\cdot\)\\mathcal\{B\}\_\{s\}\(\\mathcal\{D\}\_\{m\}g\)\\right\\\|\_\{W^\{\\ell,\\infty\}\(\\Omega\)\}\\leq CSA\_\{m,\\bar\{m\}\}^\{\(g\)\}\\sqrt\{m\_\{\*\}\}\\,J^\{\-1/m\_\{\*\}\}\.\(19\)Moreover, the branch networks satisfy‖∑s=1Jℬs‖L∞\(\[−S,S\]m∗\)≤2\\\|\\sum\_\{s=1\}^\{J\}\\mathcal\{B\}\_\{s\}\\\|\_\{L^\{\\infty\}\(\[\-S,S\]^\{m\_\{\*\}\}\)\}\\leq 2andℬs≥0\\mathcal\{B\}\_\{s\}\\geq 0\. Furthermore, they can be chosen withpB=𝒪\(1\)p\_\{B\}=\\mathcal\{O\}\(1\),

LB=𝒪\(m∗2log⁡m∗\+m∗log⁡J\),KB=𝒪\(m∗2log⁡m∗\+m∗log⁡J\),L\_\{B\}=\\mathcal\{O\}\\\!\\left\(m\_\{\*\}^\{2\}\\log m\_\{\*\}\+m\_\{\*\}\\log J\\right\),\\qquad K\_\{B\}=\\mathcal\{O\}\\\!\\left\(m\_\{\*\}^\{2\}\\log m\_\{\*\}\+m\_\{\*\}\\log J\\right\),and

κB=𝒪\(m∗Bm,m¯\(g\)J1\+1m∗\)\.\\kappa\_\{B\}=\\mathcal\{O\}\\\!\\left\(\\sqrt\{m\_\{\*\}\}B\_\{m,\\bar\{m\}\}^\{\(g\)\}J^\{1\+\\frac\{1\}\{m\_\{\*\}\}\}\\right\)\.The implicit constants are independent ofmm,m∗m\_\{\*\},JJ, andfm¯f\_\{\\bar\{m\}\}\.

Proposition[2](https://arxiv.org/html/2606.17419#Thmproposition2)is proved in Appendix[A\.3](https://arxiv.org/html/2606.17419#A1.SS3)\.

#### 5\.1\.3Step 3

For the third term, we further approximate the dependence ofcsfm¯\(𝒙\):=𝔡s\(Fmfm¯,𝒙\)c\_\{s\}^\{f\_\{\\bar\{m\}\}\}\(\\bm\{x\}\):=\\mathfrak\{d\}\_\{s\}\(F\_\{m\}^\{f\_\{\\bar\{m\}\},\\bm\{x\}\}\)onfm¯f\_\{\\bar\{m\}\}\.

###### Proposition 3\(Approximation of theff\-dependent coefficient\)\.

Assume that Assumption[1](https://arxiv.org/html/2606.17419#Thmassumption1)holds with theWℓ,∞\(Ω\)W^\{\\ell,\\infty\}\(\\Omega\)\-norm on the left\-hand side, whereℓ∈ℕ\\ell\\in\\mathbb\{N\}\. Assume thatHHis large enough\. Fixs=1,…,Js=1,\\ldots,J, and defineF¯m¯,s𝐱\(𝐰\):=𝔡s\(Fm𝒫m¯𝐰,𝐱\)\\bar\{F\}\_\{\\bar\{m\},s\}^\{\\bm\{x\}\}\(\\bm\{w\}\):=\\mathfrak\{d\}\_\{s\}\(F\_\{m\}^\{\\mathcal\{P\}\_\{\\bar\{m\}\}\\bm\{w\},\\bm\{x\}\}\)for𝐰∈\[−S,S\]m¯∗\\bm\{w\}\\in\[\-S,S\]^\{\\bar\{m\}\_\{\*\}\}\. Set

Am¯,m\(f\):=L2m¯2k1\(log⁡m¯\)d\(1\+Sm2k2\(log⁡m\)d\)α1,A\_\{\\bar\{m\},m\}^\{\(f\)\}:=L\_\{2\}\\bar\{m\}^\{2k\_\{1\}\}\(\\log\\bar\{m\}\)^\{d\}\\left\(1\+Sm^\{2k\_\{2\}\}\(\\log m\)^\{d\}\\right\)^\{\\alpha\_\{1\}\},and

B¯m¯,m\(f\):=1\\displaystyle\\bar\{B\}\_\{\\bar\{m\},m\}^\{\(f\)\}:=1\+L2Sm¯2k1\(log⁡m¯\)d\(1\+Sm2k2\(log⁡m\)d\)α1\+L1Sm2k2\(log⁡m\)d\.\\displaystyle\+L\_\{2\}S\\bar\{m\}^\{2k\_\{1\}\}\(\\log\\bar\{m\}\)^\{d\}\\left\(1\+Sm^\{2k\_\{2\}\}\(\\log m\)^\{d\}\\right\)^\{\\alpha\_\{1\}\}\+L\_\{1\}Sm^\{2k\_\{2\}\}\(\\log m\)^\{d\}\.\(20\)Then, for anyH∈ℕH\\in\\mathbb\{N\}, there exist points𝐰1,…,𝐰H∈\[−S,S\]m¯∗\\bm\{w\}\_\{1\},\\ldots,\\bm\{w\}\_\{H\}\\in\[\-S,S\]^\{\\bar\{m\}\_\{\*\}\}and branch networksℰh∈ℱNN\(m¯∗,1,LE,pE,KE,κE,1\)\\mathcal\{E\}\_\{h\}\\in\\mathcal\{F\}\_\{\\mathrm\{NN\}\}\(\\bar\{m\}\_\{\*\},1,L\_\{E\},p\_\{E\},K\_\{E\},\\kappa\_\{E\},1\),h=1,…,Hh=1,\\ldots,H, such that, with

eh,s\(𝒙\):=F¯m¯,s𝒙\(𝒘h\)=𝒢\(𝒫m¯𝒘h,𝒫m𝒛s\)\(𝒙\),e\_\{h,s\}\(\\bm\{x\}\):=\\bar\{F\}\_\{\\bar\{m\},s\}^\{\\bm\{x\}\}\(\\bm\{w\}\_\{h\}\)=\\mathcal\{G\}\(\\mathcal\{P\}\_\{\\bar\{m\}\}\\bm\{w\}\_\{h\},\\mathcal\{P\}\_\{m\}\\bm\{z\}\_\{s\}\)\(\\bm\{x\}\),we have

‖csfm¯−∑h=1Heh,sℰh\(𝒟m¯f\)‖Wℓ,∞\(Ω\)≤CSAm¯,m\(f\)m¯∗H−1/m¯∗\.\\displaystyle\\left\\\|c\_\{s\}^\{f\_\{\\bar\{m\}\}\}\-\\sum\_\{h=1\}^\{H\}e\_\{h,s\}\\mathcal\{E\}\_\{h\}\(\\mathcal\{D\}\_\{\\bar\{m\}\}f\)\\right\\\|\_\{W^\{\\ell,\\infty\}\(\\Omega\)\}\\leq CSA\_\{\\bar\{m\},m\}^\{\(f\)\}\\sqrt\{\\bar\{m\}\_\{\*\}\}\\,H^\{\-1/\\bar\{m\}\_\{\*\}\}\.\(21\)Moreover, the branch networks satisfy‖∑h=1Hℰh‖L∞\(\[−S,S\]m¯∗\)≤2\\\|\\sum\_\{h=1\}^\{H\}\\mathcal\{E\}\_\{h\}\\\|\_\{L^\{\\infty\}\(\[\-S,S\]^\{\\bar\{m\}\_\{\*\}\}\)\}\\leq 2andℰh≥0\\mathcal\{E\}\_\{h\}\\geq 0\. The networksℰh\\mathcal\{E\}\_\{h\}can be chosen withpE=𝒪\(1\)p\_\{E\}=\\mathcal\{O\}\(1\),

LE=𝒪\(m¯∗2log⁡m¯∗\+m¯∗log⁡H\),KE=𝒪\(m¯∗2log⁡m¯∗\+m¯∗log⁡H\),L\_\{E\}=\\mathcal\{O\}\\\!\\left\(\\bar\{m\}\_\{\*\}^\{2\}\\log\\bar\{m\}\_\{\*\}\+\\bar\{m\}\_\{\*\}\\log H\\right\),\\qquad K\_\{E\}=\\mathcal\{O\}\\\!\\left\(\\bar\{m\}\_\{\*\}^\{2\}\\log\\bar\{m\}\_\{\*\}\+\\bar\{m\}\_\{\*\}\\log H\\right\),and

κE=𝒪\(m¯∗B¯m¯,m\(f\)H1\+1m¯∗\)\.\\kappa\_\{E\}=\\mathcal\{O\}\\\!\\left\(\\sqrt\{\\bar\{m\}\_\{\*\}\}\\,\\bar\{B\}\_\{\\bar\{m\},m\}^\{\(f\)\}H^\{1\+\\frac\{1\}\{\\bar\{m\}\_\{\*\}\}\}\\right\)\.The implicit constants are independent ofm,m¯,H,sm,\\bar\{m\},H,s, andff\.

Proposition[3](https://arxiv.org/html/2606.17419#Thmproposition3)is proved in Appendix[A\.4](https://arxiv.org/html/2606.17419#A1.SS4)\.

#### 5\.1\.4Step 4

For the last term, we approximate the remaining spatial coefficient functionseh,s\(𝒙\)e\_\{h,s\}\(\\bm\{x\}\)by trunk networks\. Moreover, the trunk networks satisfy the local\-overlap bounds

sup𝒙∈Ω∑p=1P\|𝒯p\(𝒙\)\|≤C,\\sup\_\{\\bm\{x\}\\in\\Omega\}\\sum\_\{p=1\}^\{P\}\|\\mathcal\{T\}\_\{p\}\(\\bm\{x\}\)\|\\leq C,and, whenℓ=1\\ell=1,

sup𝒙∈Ω∑p=1P∑r=1d\|∂xr𝒯p\(𝒙\)\|≤CP1/d\.\\sup\_\{\\bm\{x\}\\in\\Omega\}\\sum\_\{p=1\}^\{P\}\\sum\_\{r=1\}^\{d\}\|\\partial\_\{x\_\{r\}\}\\mathcal\{T\}\_\{p\}\(\\bm\{x\}\)\|\\leq CP^\{1/d\}\.
###### Proposition 4\(Trunk approximation of the spatial coefficients\)\.

Assume that Assumption[3](https://arxiv.org/html/2606.17419#Thmassumption3)holds\. Letℓ∈\{0,1\}\\ell\\in\\\{0,1\\\}and assume thatn3\>ℓn\_\{3\}\>\\ell\. For everyh=1,…,Hh=1,\\ldots,H,s=1,…,Js=1,\\ldots,J, andP≥2P\\geq 2, there exist coefficientseh,p,s∈ℝe\_\{h,p,s\}\\in\\mathbb\{R\}and trunk networks𝒯p∈ℱNN\(d,1,LT,pT,KT,κT,MT\)\\mathcal\{T\}\_\{p\}\\in\\mathcal\{F\}\_\{\\mathrm\{NN\}\}\(d,1,L\_\{T\},p\_\{T\},K\_\{T\},\\kappa\_\{T\},M\_\{T\}\),p=1,…,Pp=1,\\ldots,P, such that

‖eh,s−∑p=1Peh,p,s𝒯p‖Wℓ,∞\(Ω\)≤CRm¯,m\(n3\)P−n3−ℓd,\\displaystyle\\left\\\|e\_\{h,s\}\-\\sum\_\{p=1\}^\{P\}e\_\{h,p,s\}\\mathcal\{T\}\_\{p\}\\right\\\|\_\{W^\{\\ell,\\infty\}\(\\Omega\)\}\\leq CR\_\{\\bar\{m\},m\}^\{\(n\_\{3\}\)\}P^\{\-\\frac\{n\_\{3\}\-\\ell\}\{d\}\},\(22\)and\|eh,p,s\|≤CRm¯,m\(n3\)\|e\_\{h,p,s\}\|\\leq CR\_\{\\bar\{m\},m\}^\{\(n\_\{3\}\)\}forp=1,…,Pp=1,\\ldots,P, where

Rm¯,m\(n3\):=\(1\+Sm¯2β1\(log⁡m¯\)d\)\(1\+Sm2β2\(log⁡m\)d\)\.R\_\{\\bar\{m\},m\}^\{\(n\_\{3\}\)\}:=\\left\(1\+S\\bar\{m\}^\{2\\beta\_\{1\}\}\(\\log\\bar\{m\}\)^\{d\}\\right\)\\left\(1\+Sm^\{2\\beta\_\{2\}\}\(\\log m\)^\{d\}\\right\)\.
Moreover, the trunk networks can be chosen to satisfy the local\-overlap bounds

sup𝒙∈Ω∑p=1P\|𝒯p\(𝒙\)\|≤C,sup𝒙∈Ω∑p=1P∑r=1d\|∂xr𝒯p\(𝒙\)\|≤CP1/d\.\\sup\_\{\\bm\{x\}\\in\\Omega\}\\sum\_\{p=1\}^\{P\}\|\\mathcal\{T\}\_\{p\}\(\\bm\{x\}\)\|\\leq C,\\quad\\sup\_\{\\bm\{x\}\\in\\Omega\}\\sum\_\{p=1\}^\{P\}\\sum\_\{r=1\}^\{d\}\|\\partial\_\{x\_\{r\}\}\\mathcal\{T\}\_\{p\}\(\\bm\{x\}\)\|\\leq CP^\{1/d\}\.The trunk networks can be chosen withLT≤Clog⁡PL\_\{T\}\\leq C\\log P,pT≤Cp\_\{T\}\\leq C,KT≤Clog⁡PK\_\{T\}\\leq C\\log P,κT≤CP\(n3\+d−ℓ\)/d\\kappa\_\{T\}\\leq CP^\{\(n\_\{3\}\+d\-\\ell\)/d\}, andMT≤CM\_\{T\}\\leq C\. HereC\>0C\>0is independent ofh,s,P,m,m¯h,s,P,m,\\bar\{m\}\.

Proposition[4](https://arxiv.org/html/2606.17419#Thmproposition4)is proved in Appendix[A\.5](https://arxiv.org/html/2606.17419#A1.SS5)\.

#### 5\.1\.5Proof of Theorem[3](https://arxiv.org/html/2606.17419#Thmtheorem3)

###### Proof of Theorem[3](https://arxiv.org/html/2606.17419#Thmtheorem3)\.

We prove the result by combining the four approximation steps\.

First, definefm¯:=𝒫m¯𝒟m¯ff\_\{\\bar\{m\}\}:=\\mathcal\{P\}\_\{\\bar\{m\}\}\\mathcal\{D\}\_\{\\bar\{m\}\}fandgm:=𝒫m𝒟mgg\_\{m\}:=\\mathcal\{P\}\_\{m\}\\mathcal\{D\}\_\{m\}g\. By Proposition[1](https://arxiv.org/html/2606.17419#Thmproposition1), we have

‖𝒢\(f,g\)−𝒢\(fm¯,gm\)‖Wℓ,∞\(Ω\)≤C\[L2m¯−n1\(log⁡m¯\)d\(1\+Sm2k2\(log⁡m\)d\)α1\+L1m−n2\(log⁡m\)d\]\.\\displaystyle\\\|\\mathcal\{G\}\(f,g\)\-\\mathcal\{G\}\(f\_\{\\bar\{m\}\},g\_\{m\}\)\\\|\_\{W^\{\\ell,\\infty\}\(\\Omega\)\}\\leq C\\Bigl\[L\_\{2\}\\bar\{m\}^\{\-n\_\{1\}\}\(\\log\\bar\{m\}\)^\{d\}\\left\(1\+Sm^\{2k\_\{2\}\}\(\\log m\)^\{d\}\\right\)^\{\\alpha\_\{1\}\}\+L\_\{1\}m^\{\-n\_\{2\}\}\(\\log m\)^\{d\}\\Bigr\]\.\(23\)Indeed, the first term corresponds to replacingffbyfm¯f\_\{\\bar\{m\}\}, while the background second input isgmg\_\{m\}\. The factor‖gm‖Wk2,∞\\\|g\_\{m\}\\\|\_\{W^\{k\_\{2\},\\infty\}\}is controlled by the Sobolev stability estimate‖gm‖Wk2,∞\(Ω\)≤CSm2k2\(log⁡m\)d\\\|g\_\{m\}\\\|\_\{W^\{k\_\{2\},\\infty\}\(\\Omega\)\}\\leq CSm^\{2k\_\{2\}\}\(\\log m\)^\{d\}\.

Next, for fixedfm¯f\_\{\\bar\{m\}\}and𝒙∈Ω\\bm\{x\}\\in\\Omega, defineFmfm¯,𝒙\(𝒛\):=𝒢\(fm¯,𝒫m𝒛\)\(𝒙\)F\_\{m\}^\{f\_\{\\bar\{m\}\},\\bm\{x\}\}\(\\bm\{z\}\):=\\mathcal\{G\}\(f\_\{\\bar\{m\}\},\\mathcal\{P\}\_\{m\}\\bm\{z\}\)\(\\bm\{x\}\)for𝒛∈\[−S,S\]m∗\\bm\{z\}\\in\[\-S,S\]^\{m\_\{\*\}\}\. By Proposition[2](https://arxiv.org/html/2606.17419#Thmproposition2), there exist points𝒛s∈\[−S,S\]m∗\\bm\{z\}\_\{s\}\\in\[\-S,S\]^\{m\_\{\*\}\},s=1,…,Js=1,\\ldots,J, and branch networks

ℬs∈ℱNN\(m∗,1,LB,pB,KB,κB,1\),s=1,…,J,\\mathcal\{B\}\_\{s\}\\in\\mathcal\{F\}\_\{\\mathrm\{NN\}\}\(m\_\{\*\},1,L\_\{B\},p\_\{B\},K\_\{B\},\\kappa\_\{B\},1\),\\qquad s=1,\\ldots,J,such that, withcsfm¯\(𝒙\):=𝔡s\(Fmfm¯,𝒙\)=Fmfm¯,𝒙\(𝒛s\)c\_\{s\}^\{f\_\{\\bar\{m\}\}\}\(\\bm\{x\}\):=\\mathfrak\{d\}\_\{s\}\(F\_\{m\}^\{f\_\{\\bar\{m\}\},\\bm\{x\}\}\)=F\_\{m\}^\{f\_\{\\bar\{m\}\},\\bm\{x\}\}\(\\bm\{z\}\_\{s\}\), we have

‖𝒢\(fm¯,gm\)−∑s=1Jcsfm¯ℬs\(𝒟mg\)‖Wℓ,∞\(Ω\)≤CSAm,m¯\(g\)m∗J−1/m∗\.\\displaystyle\\left\\\|\\mathcal\{G\}\(f\_\{\\bar\{m\}\},g\_\{m\}\)\-\\sum\_\{s=1\}^\{J\}c\_\{s\}^\{f\_\{\\bar\{m\}\}\}\\mathcal\{B\}\_\{s\}\(\\mathcal\{D\}\_\{m\}g\)\\right\\\|\_\{W^\{\\ell,\\infty\}\(\\Omega\)\}\\leq CSA\_\{m,\\bar\{m\}\}^\{\(g\)\}\\sqrt\{m\_\{\*\}\}\\,J^\{\-1/m\_\{\*\}\}\.\(24\)The network\-size bounds forℬs\\mathcal\{B\}\_\{s\}are precisely those stated in the theorem\. Moreover,

‖∑s=1Jℬs‖L∞\(\[−S,S\]m∗\)≤2,ℬs≥0\.\\left\\\|\\sum\_\{s=1\}^\{J\}\\mathcal\{B\}\_\{s\}\\right\\\|\_\{L^\{\\infty\}\(\[\-S,S\]^\{m\_\{\*\}\}\)\}\\leq 2,\\qquad\\mathcal\{B\}\_\{s\}\\geq 0\.
We then approximate the dependence ofcsfm¯c\_\{s\}^\{f\_\{\\bar\{m\}\}\}onfm¯f\_\{\\bar\{m\}\}\. For eachs=1,…,Js=1,\\ldots,J, Proposition[3](https://arxiv.org/html/2606.17419#Thmproposition3)gives points𝒘h∈\[−S,S\]m¯∗\\bm\{w\}\_\{h\}\\in\[\-S,S\]^\{\\bar\{m\}\_\{\*\}\},h=1,…,Hh=1,\\ldots,H, and branch networks

ℰh∈ℱNN\(m¯∗,1,LE,pE,KE,κE,1\),h=1,…,H,\\mathcal\{E\}\_\{h\}\\in\\mathcal\{F\}\_\{\\mathrm\{NN\}\}\(\\bar\{m\}\_\{\*\},1,L\_\{E\},p\_\{E\},K\_\{E\},\\kappa\_\{E\},1\),\\qquad h=1,\\ldots,H,such that

‖csfm¯−∑h=1Heh,sℰh\(𝒟m¯f\)‖Wℓ,∞\(Ω\)≤CSAm¯,m\(f\)m¯∗H−1/m¯∗\.\\displaystyle\\left\\\|c\_\{s\}^\{f\_\{\\bar\{m\}\}\}\-\\sum\_\{h=1\}^\{H\}e\_\{h,s\}\\mathcal\{E\}\_\{h\}\(\\mathcal\{D\}\_\{\\bar\{m\}\}f\)\\right\\\|\_\{W^\{\\ell,\\infty\}\(\\Omega\)\}\\leq CSA\_\{\\bar\{m\},m\}^\{\(f\)\}\\sqrt\{\\bar\{m\}\_\{\*\}\}\\,H^\{\-1/\\bar\{m\}\_\{\*\}\}\.\(25\)Using the nonnegativity ofℬs\\mathcal\{B\}\_\{s\}and the bound‖∑s=1Jℬs‖L∞\(\[−S,S\]m∗\)≤2\\\|\\sum\_\{s=1\}^\{J\}\\mathcal\{B\}\_\{s\}\\\|\_\{L^\{\\infty\}\(\[\-S,S\]^\{m\_\{\*\}\}\)\}\\leq 2, we obtain

‖∑s=1J\[csfm¯−∑h=1Heh,sℰh\(𝒟m¯f\)\]ℬs\(𝒟mg\)‖Wℓ,∞\(Ω\)≤CSAm¯,m\(f\)m¯∗H−1/m¯∗\.\\displaystyle\\left\\\|\\sum\_\{s=1\}^\{J\}\\left\[c\_\{s\}^\{f\_\{\\bar\{m\}\}\}\-\\sum\_\{h=1\}^\{H\}e\_\{h,s\}\\mathcal\{E\}\_\{h\}\(\\mathcal\{D\}\_\{\\bar\{m\}\}f\)\\right\]\\mathcal\{B\}\_\{s\}\(\\mathcal\{D\}\_\{m\}g\)\\right\\\|\_\{W^\{\\ell,\\infty\}\(\\Omega\)\}\\leq CSA\_\{\\bar\{m\},m\}^\{\(f\)\}\\sqrt\{\\bar\{m\}\_\{\*\}\}\\,H^\{\-1/\\bar\{m\}\_\{\*\}\}\.\(26\)Here we used thatℬs\(𝒟mg\)\\mathcal\{B\}\_\{s\}\(\\mathcal\{D\}\_\{m\}g\)is independent of𝒙\\bm\{x\}, so theWℓ,∞W^\{\\ell,\\infty\}\-derivatives act only on the coefficient functions\. The corresponding bounds forLE,pE,KE,κEL\_\{E\},p\_\{E\},K\_\{E\},\\kappa\_\{E\}are those in Proposition[3](https://arxiv.org/html/2606.17419#Thmproposition3)\. Moreover,

‖∑h=1Hℰh‖L∞\(\[−S,S\]m¯∗\)≤2,ℰh≥0\.\\left\\\|\\sum\_\{h=1\}^\{H\}\\mathcal\{E\}\_\{h\}\\right\\\|\_\{L^\{\\infty\}\(\[\-S,S\]^\{\\bar\{m\}\_\{\*\}\}\)\}\\leq 2,\\qquad\\mathcal\{E\}\_\{h\}\\geq 0\.
It remains to approximate the spatial coefficient functionseh,s\(𝒙\)e\_\{h,s\}\(\\bm\{x\}\)\. By Proposition[4](https://arxiv.org/html/2606.17419#Thmproposition4), for eachh=1,…,Hh=1,\\ldots,Hands=1,…,Js=1,\\ldots,J, there exist coefficientseh,p,s∈ℝe\_\{h,p,s\}\\in\\mathbb\{R\}and trunk networks

𝒯p∈ℱNN\(d,1,LT,pT,KT,κT,MT\),p=1,…,P,\\mathcal\{T\}\_\{p\}\\in\\mathcal\{F\}\_\{\\mathrm\{NN\}\}\(d,1,L\_\{T\},p\_\{T\},K\_\{T\},\\kappa\_\{T\},M\_\{T\}\),\\qquad p=1,\\ldots,P,such that

‖eh,s−∑p=1Peh,p,s𝒯p‖Wℓ,∞\(Ω\)≤CRm¯,m\(n3\)P−n3−ℓd,\\displaystyle\\left\\\|e\_\{h,s\}\-\\sum\_\{p=1\}^\{P\}e\_\{h,p,s\}\\mathcal\{T\}\_\{p\}\\right\\\|\_\{W^\{\\ell,\\infty\}\(\\Omega\)\}\\leq CR\_\{\\bar\{m\},m\}^\{\(n\_\{3\}\)\}P^\{\-\\frac\{n\_\{3\}\-\\ell\}\{d\}\},\(27\)and\|eh,p,s\|≤CRm¯,m\(n3\)\|e\_\{h,p,s\}\|\\leq CR\_\{\\bar\{m\},m\}^\{\(n\_\{3\}\)\}\. Therefore, using‖∑s=1Jℬs‖L∞\(\[−S,S\]m∗\)≤2\\\|\\sum\_\{s=1\}^\{J\}\\mathcal\{B\}\_\{s\}\\\|\_\{L^\{\\infty\}\(\[\-S,S\]^\{m\_\{\*\}\}\)\}\\leq 2,‖∑h=1Hℰh‖L∞\(\[−S,S\]m¯∗\)≤2\\\|\\sum\_\{h=1\}^\{H\}\\mathcal\{E\}\_\{h\}\\\|\_\{L^\{\\infty\}\(\[\-S,S\]^\{\\bar\{m\}\_\{\*\}\}\)\}\\leq 2, andℬs,ℰh≥0\\mathcal\{B\}\_\{s\},\\mathcal\{E\}\_\{h\}\\geq 0, we obtain

‖∑s=1J∑h=1H\[eh,s−∑p=1Peh,p,s𝒯p\]ℰh\(𝒟m¯f\)ℬs\(𝒟mg\)‖Wℓ,∞\(Ω\)≤CRm¯,m\(n3\)P−n3−ℓd‖∑s=1J∑h=1Hℰh\(𝒟m¯f\)ℬs\(𝒟mg\)‖L∞\(Ω\)\\displaystyle\\left\\\|\\sum\_\{s=1\}^\{J\}\\sum\_\{h=1\}^\{H\}\\left\[e\_\{h,s\}\-\\sum\_\{p=1\}^\{P\}e\_\{h,p,s\}\\mathcal\{T\}\_\{p\}\\right\]\\mathcal\{E\}\_\{h\}\(\\mathcal\{D\}\_\{\\bar\{m\}\}f\)\\mathcal\{B\}\_\{s\}\(\\mathcal\{D\}\_\{m\}g\)\\right\\\|\_\{W^\{\\ell,\\infty\}\(\\Omega\)\}\\leq CR\_\{\\bar\{m\},m\}^\{\(n\_\{3\}\)\}P^\{\-\\frac\{n\_\{3\}\-\\ell\}\{d\}\}\\left\\\|\\sum\_\{s=1\}^\{J\}\\sum\_\{h=1\}^\{H\}\\mathcal\{E\}\_\{h\}\(\\mathcal\{D\}\_\{\\bar\{m\}\}f\)\\mathcal\{B\}\_\{s\}\(\\mathcal\{D\}\_\{m\}g\)\\right\\\|\_\{L^\{\\infty\}\(\\Omega\)\}≤CRm¯,m\(n3\)P−n3−ℓd‖∑s=1Jℬs\(𝒟mg\)∑h=1Hℰh\(𝒟m¯f\)‖L∞\(Ω\)≤CRm¯,m\(n3\)P−n3−ℓd\.\\displaystyle\\qquad\\leq CR\_\{\\bar\{m\},m\}^\{\(n\_\{3\}\)\}P^\{\-\\frac\{n\_\{3\}\-\\ell\}\{d\}\}\\left\\\|\\sum\_\{s=1\}^\{J\}\\mathcal\{B\}\_\{s\}\(\\mathcal\{D\}\_\{m\}g\)\\sum\_\{h=1\}^\{H\}\\mathcal\{E\}\_\{h\}\(\\mathcal\{D\}\_\{\\bar\{m\}\}f\)\\right\\\|\_\{L^\{\\infty\}\(\\Omega\)\}\\leq CR\_\{\\bar\{m\},m\}^\{\(n\_\{3\}\)\}P^\{\-\\frac\{n\_\{3\}\-\\ell\}\{d\}\}\.\(28\)Here again the branch factors are independent of𝒙\\bm\{x\}, and therefore theWℓ,∞W^\{\\ell,\\infty\}\-derivatives act only on the trunk approximation error\. The stated bounds forLT,pT,KT,κT,MTL\_\{T\},p\_\{T\},K\_\{T\},\\kappa\_\{T\},M\_\{T\}also follow from Proposition[4](https://arxiv.org/html/2606.17419#Thmproposition4)\.

Finally, by the triangle inequality, the total error is bounded by the sum of \([23](https://arxiv.org/html/2606.17419#S5.E23)\), \([24](https://arxiv.org/html/2606.17419#S5.E24)\), \([26](https://arxiv.org/html/2606.17419#S5.E26)\), and \([28](https://arxiv.org/html/2606.17419#S5.E28)\)\. This yields \([10](https://arxiv.org/html/2606.17419#S3.E10)\)\.

We also record the output bound of the resulting neural operator\. By Proposition[4](https://arxiv.org/html/2606.17419#Thmproposition4), for eachh,sh,s,

sup𝒙∈Ω∑\|γ\|≤ℓ\|∂γ\(∑p=1Peh,p,s𝒯p\(𝒙\)\)\|≤CRm¯,m\(n3\)Pℓ/d\.\\sup\_\{\\bm\{x\}\\in\\Omega\}\\sum\_\{\|\\gamma\|\\leq\\ell\}\\left\|\\partial^\{\\gamma\}\\left\(\\sum\_\{p=1\}^\{P\}e\_\{h,p,s\}\\mathcal\{T\}\_\{p\}\(\\bm\{x\}\)\\right\)\\right\|\\leq CR\_\{\\bar\{m\},m\}^\{\(n\_\{3\}\)\}P^\{\\ell/d\}\.On the other hand, the partition\-of\-unity\-type bounds for the branch networks give

sup𝒖∈\[−S,S\]m∗∑s=1J\|ℬs\(𝒖\)\|≤C,sup𝒘∈\[−S,S\]m¯∗∑h=1H\|ℰh\(𝒘\)\|≤C\.\\sup\_\{\\bm\{u\}\\in\[\-S,S\]^\{m\_\{\*\}\}\}\\sum\_\{s=1\}^\{J\}\|\\mathcal\{B\}\_\{s\}\(\\bm\{u\}\)\|\\leq C,\\qquad\\sup\_\{\\bm\{w\}\\in\[\-S,S\]^\{\\bar\{m\}\_\{\*\}\}\}\\sum\_\{h=1\}^\{H\}\|\\mathcal\{E\}\_\{h\}\(\\bm\{w\}\)\|\\leq C\.Sinceℰh\(𝒟m¯f\)\\mathcal\{E\}\_\{h\}\(\\mathcal\{D\}\_\{\\bar\{m\}\}f\)andℬs\(𝒟mg\)\\mathcal\{B\}\_\{s\}\(\\mathcal\{D\}\_\{m\}g\)are independent of the output variable𝒙\\bm\{x\}, the derivative∂γ\\partial^\{\\gamma\}only acts on the trunk expansion\. Therefore,

sup𝒙∈Ω∑\|γ\|≤ℓ\|∂γ\[∑s=1J∑h=1H∑p=1Peh,p,sℰh\(𝒟m¯f\)ℬs\(𝒟mg\)𝒯p\(𝒙\)\]\|\\displaystyle\\sup\_\{\\bm\{x\}\\in\\Omega\}\\sum\_\{\|\\gamma\|\\leq\\ell\}\\left\|\\partial^\{\\gamma\}\\left\[\\sum\_\{s=1\}^\{J\}\\sum\_\{h=1\}^\{H\}\\sum\_\{p=1\}^\{P\}e\_\{h,p,s\}\\mathcal\{E\}\_\{h\}\(\\mathcal\{D\}\_\{\\bar\{m\}\}f\)\\mathcal\{B\}\_\{s\}\(\\mathcal\{D\}\_\{m\}g\)\\mathcal\{T\}\_\{p\}\(\\bm\{x\}\)\\right\]\\right\|≤∑s=1J\|ℬs\(𝒟mg\)\|∑h=1H\|ℰh\(𝒟m¯f\)\|sup𝒙∈Ω∑\|γ\|≤ℓ\|∂γ\(∑p=1Peh,p,s𝒯p\(𝒙\)\)\|\\displaystyle\\qquad\\leq\\sum\_\{s=1\}^\{J\}\|\\mathcal\{B\}\_\{s\}\(\\mathcal\{D\}\_\{m\}g\)\|\\sum\_\{h=1\}^\{H\}\|\\mathcal\{E\}\_\{h\}\(\\mathcal\{D\}\_\{\\bar\{m\}\}f\)\|\\sup\_\{\\bm\{x\}\\in\\Omega\}\\sum\_\{\|\\gamma\|\\leq\\ell\}\\left\|\\partial^\{\\gamma\}\\left\(\\sum\_\{p=1\}^\{P\}e\_\{h,p,s\}\\mathcal\{T\}\_\{p\}\(\\bm\{x\}\)\\right\)\\right\|≤CRm¯,m\(n3\)Pℓ/d\(∑s=1J\|ℬs\(𝒟mg\)\|\)\(∑h=1H\|ℰh\(𝒟m¯f\)\|\)\\displaystyle\\qquad\\leq CR\_\{\\bar\{m\},m\}^\{\(n\_\{3\}\)\}P^\{\\ell/d\}\\left\(\\sum\_\{s=1\}^\{J\}\|\\mathcal\{B\}\_\{s\}\(\\mathcal\{D\}\_\{m\}g\)\|\\right\)\\left\(\\sum\_\{h=1\}^\{H\}\|\\mathcal\{E\}\_\{h\}\(\\mathcal\{D\}\_\{\\bar\{m\}\}f\)\|\\right\)≤CRm¯,m\(n3\)Pℓ/d\.\\displaystyle\\qquad\\leq CR\_\{\\bar\{m\},m\}^\{\(n\_\{3\}\)\}P^\{\\ell/d\}\.Taking the supremum over\(f,g\)∈𝒳1×𝒳2\(f,g\)\\in\\mathcal\{X\}\_\{1\}\\times\\mathcal\{X\}\_\{2\}gives \([11](https://arxiv.org/html/2606.17419#S3.E11)\)\. The proof is complete\. ∎

### 5\.2Proof of Corollary[1](https://arxiv.org/html/2606.17419#Thmcorollary1)

###### Proof of Corollary[1](https://arxiv.org/html/2606.17419#Thmcorollary1)\.

By Theorem[3](https://arxiv.org/html/2606.17419#Thmtheorem3)withdi=dd\_\{i\}=d,ki=0k\_\{i\}=0,αi=0\\alpha\_\{i\}=0, andβi=0\\beta\_\{i\}=0, together with the network\-size bounds in the construction, the approximation error satisfies

Etotal≲\\displaystyle E\_\{\\mathrm\{total\}\}\\lesssim\\;m¯∗−n1/d\(log⁡m¯∗\)d\+m∗−n2/d\(log⁡m∗\)d\+SL1m∗\(log⁡m∗\)dJ−1/m∗\\displaystyle\\bar\{m\}\_\{\*\}^\{\-n\_\{1\}/d\}\(\\log\\bar\{m\}\_\{\*\}\)^\{d\}\+m\_\{\*\}^\{\-n\_\{2\}/d\}\(\\log m\_\{\*\}\)^\{d\}\+SL\_\{1\}\\sqrt\{m\_\{\*\}\}\(\\log m\_\{\*\}\)^\{d\}J^\{\-1/m\_\{\*\}\}\+SL2m¯∗\(log⁡m¯∗\)dH−1/m¯∗\+\(1\+S\(log⁡m¯∗\)d\)\(1\+S\(log⁡m∗\)d\)P−n3−ℓd\.\\displaystyle\+SL\_\{2\}\\sqrt\{\\bar\{m\}\_\{\*\}\}\(\\log\\bar\{m\}\_\{\*\}\)^\{d\}H^\{\-1/\\bar\{m\}\_\{\*\}\}\+\\left\(1\+S\(\\log\\bar\{m\}\_\{\*\}\)^\{d\}\\right\)\\left\(1\+S\(\\log m\_\{\*\}\)^\{d\}\\right\)P^\{\-\\frac\{n\_\{3\}\-\\ell\}\{d\}\}\.\(29\)
We first balance the discretization and branch approximation errors in each input variable\. Balancingm∗−n2/d\(log⁡m∗\)dm\_\{\*\}^\{\-n\_\{2\}/d\}\(\\log m\_\{\*\}\)^\{d\}withm∗\(log⁡m∗\)dJ−1/m∗\\sqrt\{m\_\{\*\}\}\(\\log m\_\{\*\}\)^\{d\}J^\{\-1/m\_\{\*\}\}givesJ1/m∗≍m∗n2/d\+1/2J^\{1/m\_\{\*\}\}\\asymp m\_\{\*\}^\{n\_\{2\}/d\+1/2\}, or equivalently

log⁡J≍\(n2d\+12\)m∗log⁡m∗\.\\log J\\asymp\\left\(\\frac\{n\_\{2\}\}\{d\}\+\\frac\{1\}\{2\}\\right\)m\_\{\*\}\\log m\_\{\*\}\.Similarly, balancingm¯∗−n1/d\(log⁡m¯∗\)d\\bar\{m\}\_\{\*\}^\{\-n\_\{1\}/d\}\(\\log\\bar\{m\}\_\{\*\}\)^\{d\}withm¯∗\(log⁡m¯∗\)dH−1/m¯∗\\sqrt\{\\bar\{m\}\_\{\*\}\}\(\\log\\bar\{m\}\_\{\*\}\)^\{d\}H^\{\-1/\\bar\{m\}\_\{\*\}\}givesH1/m¯∗≍m¯∗n1/d\+1/2H^\{1/\\bar\{m\}\_\{\*\}\}\\asymp\\bar\{m\}\_\{\*\}^\{n\_\{1\}/d\+1/2\}, or equivalently

log⁡H≍\(n1d\+12\)m¯∗log⁡m¯∗\.\\log H\\asymp\\left\(\\frac\{n\_\{1\}\}\{d\}\+\\frac\{1\}\{2\}\\right\)\\bar\{m\}\_\{\*\}\\log\\bar\{m\}\_\{\*\}\.
Let0<ε<10<\\varepsilon<1be the target approximation accuracy\. We choosem∗m\_\{\*\}andm¯∗\\bar\{m\}\_\{\*\}such thatm∗−n2/d\(log⁡m∗\)d≍εm\_\{\*\}^\{\-n\_\{2\}/d\}\(\\log m\_\{\*\}\)^\{d\}\\asymp\\varepsilonandm¯∗−n1/d\(log⁡m¯∗\)d≍ε\\bar\{m\}\_\{\*\}^\{\-n\_\{1\}/d\}\(\\log\\bar\{m\}\_\{\*\}\)^\{d\}\\asymp\\varepsilon\. Thus, up to logarithmic factors,m∗≍ε−d/n2m\_\{\*\}\\asymp\\varepsilon^\{\-d/n\_\{2\}\}andm¯∗≍ε−d/n1\\bar\{m\}\_\{\*\}\\asymp\\varepsilon^\{\-d/n\_\{1\}\}\. With the balanced choices ofJJandHHabove, the first four terms in \([29](https://arxiv.org/html/2606.17419#S5.E29)\) are bounded byCεC\\varepsilon\.

It remains to choosePPso that the trunk approximation term is also of orderε\\varepsilon\. Since\(log⁡m∗\)d≍\(log⁡m¯∗\)d≍\(log⁡\(1/ε\)\)d\(\\log m\_\{\*\}\)^\{d\}\\asymp\(\\log\\bar\{m\}\_\{\*\}\)^\{d\}\\asymp\(\\log\(1/\\varepsilon\)\)^\{d\}, up to constants, it suffices to impose

\(1\+S\(log⁡1ε\)d\)2P−n3−ℓd≲ε\.\\left\(1\+S\\left\(\\log\\frac\{1\}\{\\varepsilon\}\\right\)^\{d\}\\right\)^\{2\}P^\{\-\\frac\{n\_\{3\}\-\\ell\}\{d\}\}\\lesssim\\varepsilon\.For instance, we may take

P≍ε−dn3−ℓ\(1\+S\(log⁡1ε\)d\)2dn3−ℓ\.P\\asymp\\varepsilon^\{\-\\frac\{d\}\{n\_\{3\}\-\\ell\}\}\\left\(1\+S\\left\(\\log\\frac\{1\}\{\\varepsilon\}\\right\)^\{d\}\\right\)^\{\\frac\{2d\}\{n\_\{3\}\-\\ell\}\}\.Then all five terms in \([29](https://arxiv.org/html/2606.17419#S5.E29)\) are bounded byCεC\\varepsilon, and henceEtotal≲εE\_\{\\mathrm\{total\}\}\\lesssim\\varepsilon\.

We now expressε\\varepsilonin terms of the total number of parameters\. For the shared architecture

𝒢𝜽\(f,g\)\(𝒙\)=∑s=1J∑h=1H∑p=1Peh,p,sℰh\(𝒟m¯f\)ℬs\(𝒟mg\)𝒯p\(𝒙\),\\mathcal\{G\}\_\{\\bm\{\\theta\}\}\(f,g\)\(\\bm\{x\}\)=\\sum\_\{s=1\}^\{J\}\\sum\_\{h=1\}^\{H\}\\sum\_\{p=1\}^\{P\}e\_\{h,p,s\}\\,\\mathcal\{E\}\_\{h\}\(\\mathcal\{D\}\_\{\\bar\{m\}\}f\)\\,\\mathcal\{B\}\_\{s\}\(\\mathcal\{D\}\_\{m\}g\)\\,\\mathcal\{T\}\_\{p\}\(\\bm\{x\}\),there areHHnetworks of typeℰ\\mathcal\{E\},JJnetworks of typeℬ\\mathcal\{B\}, andPPnetworks of type𝒯\\mathcal\{T\}, together withJHPJHPscalar coefficients\. Therefore,

Ntot≲HKE\+JKB\+PKT\+JHP\.N\_\{\\mathrm\{tot\}\}\\lesssim HK\_\{E\}\+JK\_\{B\}\+PK\_\{T\}\+JHP\.The last term counts the coefficientseh,p,se\_\{h,p,s\}\.

From the choices above,

log⁡J≍m∗log⁡m∗≲ε−d/n2\(log⁡1ε\)C,log⁡H≍m¯∗log⁡m¯∗≲ε−d/n1\(log⁡1ε\)C\.\\log J\\asymp m\_\{\*\}\\log m\_\{\*\}\\lesssim\\varepsilon^\{\-d/n\_\{2\}\}\\left\(\\log\\frac\{1\}\{\\varepsilon\}\\right\)^\{C\},\\qquad\\log H\\asymp\\bar\{m\}\_\{\*\}\\log\\bar\{m\}\_\{\*\}\\lesssim\\varepsilon^\{\-d/n\_\{1\}\}\\left\(\\log\\frac\{1\}\{\\varepsilon\}\\right\)^\{C\}\.Moreover,log⁡P≲log⁡\(1/ε\)\+log⁡\(1\+S\(log⁡\(1/ε\)\)d\)\\log P\\lesssim\\log\(1/\\varepsilon\)\+\\log\(1\+S\(\\log\(1/\\varepsilon\)\)^\{d\}\), which is of lower order compared withlog⁡J\+log⁡H\\log J\+\\log H\. Hence

log⁡Ntot≲\(ε−d/n1\+ε−d/n2\)\(log⁡1ε\)C≲ε−a\(log⁡1ε\)C,\\log N\_\{\\mathrm\{tot\}\}\\lesssim\\left\(\\varepsilon^\{\-d/n\_\{1\}\}\+\\varepsilon^\{\-d/n\_\{2\}\}\\right\)\\left\(\\log\\frac\{1\}\{\\varepsilon\}\\right\)^\{C\}\\lesssim\\varepsilon^\{\-a\}\\left\(\\log\\frac\{1\}\{\\varepsilon\}\\right\)^\{C\},wherea:=max⁡\{d/n1,d/n2\}a:=\\max\\\{d/n\_\{1\},d/n\_\{2\}\\\}\.

Inverting the above relation gives

ε≲\(log⁡Ntotlog⁡log⁡Ntot\)−1/a\(log⁡log⁡Ntot\)C\.\\varepsilon\\lesssim\\left\(\\frac\{\\log N\_\{\\mathrm\{tot\}\}\}\{\\log\\log N\_\{\\mathrm\{tot\}\}\}\\right\)^\{\-1/a\}\(\\log\\log N\_\{\\mathrm\{tot\}\}\)^\{C\}\.Moreover,

\(1\+S\(log⁡m¯∗\)d\)\(1\+S\(log⁡m∗\)d\)≲C\(log⁡log⁡Ntot\)2d\.\\left\(1\+S\(\\log\\bar\{m\}\_\{\*\}\)^\{d\}\\right\)\\left\(1\+S\(\\log m\_\{\*\}\)^\{d\}\\right\)\\lesssim C\(\\log\\log N\_\{\\mathrm\{tot\}\}\)^\{2d\}\.Using this non\-sharp but uniform logarithmic upper bound, we obtain

Etotal≲\(log⁡Ntotlog⁡log⁡Ntot\)−1/a\(log⁡log⁡Ntot\)2d\.E\_\{\\mathrm\{total\}\}\\lesssim\\left\(\\frac\{\\log N\_\{\\mathrm\{tot\}\}\}\{\\log\\log N\_\{\\mathrm\{tot\}\}\}\\right\)^\{\-1/a\}\(\\log\\log N\_\{\\mathrm\{tot\}\}\)^\{2d\}\.Since1/a=\(max⁡\{d/n1,d/n2\}\)−1=min⁡\{n1,n2\}/d1/a=\(\\max\\\{d/n\_\{1\},d/n\_\{2\}\\\}\)^\{\-1\}=\\min\\\{n\_\{1\},n\_\{2\}\\\}/d, we conclude that

‖𝒢\(f,g\)−𝒢𝜽\(f,g\)‖Wℓ,∞\(Ω\)≤C\(log⁡Ntotlog⁡log⁡Ntot\)−1max⁡\{dn1,dn2\}\(log⁡log⁡Ntot\)2d\.\\left\\\|\\mathcal\{G\}\(f,g\)\-\\mathcal\{G\}\_\{\\bm\{\\theta\}\}\(f,g\)\\right\\\|\_\{W^\{\\ell,\\infty\}\(\\Omega\)\}\\leq C\\left\(\\frac\{\\log N\_\{\\mathrm\{tot\}\}\}\{\\log\\log N\_\{\\mathrm\{tot\}\}\}\\right\)^\{\-\\frac\{1\}\{\\max\\left\\\{\\frac\{d\}\{n\_\{1\}\},\\frac\{d\}\{n\_\{2\}\}\\right\\\}\}\}\(\\log\\log N\_\{\\mathrm\{tot\}\}\)^\{2d\}\.This completes the proof\. ∎

### 5\.3Proof of Theorem[4](https://arxiv.org/html/2606.17419#Thmtheorem4)

The overall strategy of proofs follows the empirical\-process argument in\[[31](https://arxiv.org/html/2606.17419#bib.bib31)\]\. The main difference is that we study Sobolev training, where the loss involves both the operator output and its derivatives with respect to the spatial variable\. In addition, our networks use the ReLU activation\. Although ReLU is Lipschitz continuous, its derivative is the Heaviside step function, which is not Lipschitz continuous\. Therefore, the covering\-number estimate for the derivative classes cannot be obtained directly from the perturbation argument used in\[[8](https://arxiv.org/html/2606.17419#bib.bib8)\]and\[[31](https://arxiv.org/html/2606.17419#bib.bib31)\], which relies on Lipschitz continuity of the activation function and its derivatives\.

To overcome this difficulty, we use a uniform empirical covering argument\. More precisely, we regard the derivative\-augmented neural operator class as a real\-valued function class on finite samples and bound its uniform empirical covering number through pseudo\-dimension estimates\. The required empirical covering estimate follows from\[[2](https://arxiv.org/html/2606.17419#bib.bib2), Theorem 12\.2\]\. This allows us to control the stochastic and covering\-number terms even in the presence of derivative observations\.

After choosing the network size so that the stochastic and covering\-number terms are dominated by the approximation term, the resulting generalization error becomes approximation\-limited\.

###### Proof of Theorem[4](https://arxiv.org/html/2606.17419#Thmtheorem4)\.

We first introduce the notation∂γ𝒢\(f,g\)\(𝒛\):=∂𝒛γ𝒢\(f,g\)\(𝒛\)\\partial^\{\\gamma\}\\mathcal\{G\}\(f,g\)\(\\bm\{z\}\):=\\partial\_\{\\bm\{z\}\}^\{\\gamma\}\\mathcal\{G\}\(f,g\)\(\\bm\{z\}\)for\|γ\|≤ℓ\|\\gamma\|\\leq\\ell\. For any neural operatorℋ\\mathcal\{H\}, define the empirical Sobolev norm by

‖ℋ‖n,ℓ2:=1nfngnz∑i=1nf∑j=1ng∑k=1nz∑\|γ\|≤ℓ\|∂γℋ\(𝒟m¯fi,𝒟mgi,j\)\(𝒛i,j,k\)\|2\.\\\|\\mathcal\{H\}\\\|\_\{n,\\ell\}^\{2\}:=\\frac\{1\}\{n\_\{f\}n\_\{g\}n\_\{z\}\}\\sum\_\{i=1\}^\{n\_\{f\}\}\\sum\_\{j=1\}^\{n\_\{g\}\}\\sum\_\{k=1\}^\{n\_\{z\}\}\\sum\_\{\|\\gamma\|\\leq\\ell\}\\left\|\\partial^\{\\gamma\}\\mathcal\{H\}\(\\mathcal\{D\}\_\{\\bar\{m\}\}f\_\{i\},\\mathcal\{D\}\_\{m\}g\_\{i,j\}\)\(\\bm\{z\}\_\{i,j,k\}\)\\right\|^\{2\}\.
Since the test samplesgj∼μgg\_\{j\}\\sim\\mu\_\{g\}and𝒛j,k∼μz\\bm\{z\}\_\{j,k\}\\sim\\mu\_\{z\}are i\.i\.d\. and independent of the training set𝒮\\mathcal\{S\}, the expected batch test error is equal to the population Sobolev prediction error:

𝔼𝒮𝔼f∼μf𝔼\{gj\}j=1ng𝔼\{𝒛j,k\}j,k\[1ngnz∑j=1ng∑k=1nz∑\|γ\|≤ℓ\|∂γ𝒢\(f,gj\)\(𝒛j,k\)−∂γ𝒢^𝒮\(𝒟m¯f,𝒟mgj\)\(𝒛j,k\)\|2\]\\displaystyle\\mathbb\{E\}\_\{\\mathcal\{S\}\}\\mathbb\{E\}\_\{f\\sim\\mu\_\{f\}\}\\mathbb\{E\}\_\{\\\{g\_\{j\}\\\}\_\{j=1\}^\{n\_\{g\}\}\}\\mathbb\{E\}\_\{\\\{\\bm\{z\}\_\{j,k\}\\\}\_\{j,k\}\}\\Biggl\[\\frac\{1\}\{n\_\{g\}n\_\{z\}\}\\sum\_\{j=1\}^\{n\_\{g\}\}\\sum\_\{k=1\}^\{n\_\{z\}\}\\sum\_\{\|\\gamma\|\\leq\\ell\}\\left\|\\partial^\{\\gamma\}\\mathcal\{G\}\(f,g\_\{j\}\)\(\\bm\{z\}\_\{j,k\}\)\-\\partial^\{\\gamma\}\\widehat\{\\mathcal\{G\}\}\_\{\\mathcal\{S\}\}\(\\mathcal\{D\}\_\{\\bar\{m\}\}f,\\mathcal\{D\}\_\{m\}g\_\{j\}\)\(\\bm\{z\}\_\{j,k\}\)\\right\|^\{2\}\\Biggr\]=𝔼𝒮𝔼f∼μf𝔼g∼μg𝔼𝒛∼μz∑\|γ\|≤ℓ\|∂γ𝒢\(f,g\)\(𝒛\)−∂γ𝒢^𝒮\(𝒟m¯f,𝒟mg\)\(𝒛\)\|2\.\\displaystyle=\\mathbb\{E\}\_\{\\mathcal\{S\}\}\\mathbb\{E\}\_\{f\\sim\\mu\_\{f\}\}\\mathbb\{E\}\_\{g\\sim\\mu\_\{g\}\}\\mathbb\{E\}\_\{\\bm\{z\}\\sim\\mu\_\{z\}\}\\sum\_\{\|\\gamma\|\\leq\\ell\}\\left\|\\partial^\{\\gamma\}\\mathcal\{G\}\(f,g\)\(\\bm\{z\}\)\-\\partial^\{\\gamma\}\\widehat\{\\mathcal\{G\}\}\_\{\\mathcal\{S\}\}\(\\mathcal\{D\}\_\{\\bar\{m\}\}f,\\mathcal\{D\}\_\{m\}g\)\(\\bm\{z\}\)\\right\|^\{2\}\.\(30\)Denote the left\-hand side byℰgen\\mathcal\{E\}\_\{\\rm gen\}\. We decompose it asℰgen=T1\+T2\\mathcal\{E\}\_\{\\rm gen\}=\{\\rm T\}\_\{1\}\+\{\\rm T\}\_\{2\}, where

T1:=2𝔼𝒮\[‖𝒢^𝒮−𝒢‖n,ℓ2\],T2:=ℰgen−2𝔼𝒮\[‖𝒢^𝒮−𝒢‖n,ℓ2\]\.\{\\rm T\}\_\{1\}:=2\\,\\mathbb\{E\}\_\{\\mathcal\{S\}\}\\left\[\\\|\\widehat\{\\mathcal\{G\}\}\_\{\\mathcal\{S\}\}\-\\mathcal\{G\}\\\|\_\{n,\\ell\}^\{2\}\\right\],\\qquad\{\\rm T\}\_\{2\}:=\\mathcal\{E\}\_\{\\rm gen\}\-2\\,\\mathbb\{E\}\_\{\\mathcal\{S\}\}\\left\[\\\|\\widehat\{\\mathcal\{G\}\}\_\{\\mathcal\{S\}\}\-\\mathcal\{G\}\\\|\_\{n,\\ell\}^\{2\}\\right\]\.We estimate these two terms separately\.

BoundingT1\{\\rm T\}\_\{1\}\.For the Sobolev training data, we haveyi,j,k\(γ\)=∂γ𝒢\(fi,gi,j\)\(𝒛i,j,k\)\+εi,j,k\(γ\)y\_\{i,j,k\}^\{\(\\gamma\)\}=\\partial^\{\\gamma\}\\mathcal\{G\}\(f\_\{i\},g\_\{i,j\}\)\(\\bm\{z\}\_\{i,j,k\}\)\+\\varepsilon\_\{i,j,k\}^\{\(\\gamma\)\}for\|γ\|≤ℓ\|\\gamma\|\\leq\\ell\. Recall thatT1=2𝔼𝒮\[‖𝒢^𝒮−𝒢‖n,ℓ2\]\{\\rm T\}\_\{1\}=2\\mathbb\{E\}\_\{\\mathcal\{S\}\}\[\\\|\\widehat\{\\mathcal\{G\}\}\_\{\\mathcal\{S\}\}\-\\mathcal\{G\}\\\|\_\{n,\\ell\}^\{2\}\], that is,

T1=2𝔼𝒮\[1nfngnz∑i=1nf∑j=1ng∑k=1nz∑\|γ\|≤ℓ\|\\displaystyle\{\\rm T\}\_\{1\}=2\\mathbb\{E\}\_\{\\mathcal\{S\}\}\\Biggl\[\\frac\{1\}\{n\_\{f\}n\_\{g\}n\_\{z\}\}\\sum\_\{i=1\}^\{n\_\{f\}\}\\sum\_\{j=1\}^\{n\_\{g\}\}\\sum\_\{k=1\}^\{n\_\{z\}\}\\sum\_\{\|\\gamma\|\\leq\\ell\}\\Bigl\|∂γ𝒢^𝒮\(𝒟m¯fi,𝒟mgi,j\)\(𝒛i,j,k\)−∂γ𝒢\(fi,gi,j\)\(𝒛i,j,k\)\|2\]\.\\displaystyle\\partial^\{\\gamma\}\\widehat\{\\mathcal\{G\}\}\_\{\\mathcal\{S\}\}\(\\mathcal\{D\}\_\{\\bar\{m\}\}f\_\{i\},\\mathcal\{D\}\_\{m\}g\_\{i,j\}\)\(\\bm\{z\}\_\{i,j,k\}\)\-\\partial^\{\\gamma\}\\mathcal\{G\}\(f\_\{i\},g\_\{i,j\}\)\(\\bm\{z\}\_\{i,j,k\}\)\\Bigr\|^\{2\}\\Biggr\]\.\(31\)Using∂γ𝒢\(fi,gi,j\)\(𝒛i,j,k\)=yi,j,k\(γ\)−εi,j,k\(γ\)\\partial^\{\\gamma\}\\mathcal\{G\}\(f\_\{i\},g\_\{i,j\}\)\(\\bm\{z\}\_\{i,j,k\}\)=y\_\{i,j,k\}^\{\(\\gamma\)\}\-\\varepsilon\_\{i,j,k\}^\{\(\\gamma\)\}, we obtain

T1=\\displaystyle\{\\rm T\}\_\{1\}=2𝔼𝒮\[1nfngnz∑i,j,k∑\|γ\|≤ℓ\|∂γ𝒢^𝒮\(𝒟m¯fi,𝒟mgi,j\)\(𝒛i,j,k\)−yi,j,k\(γ\)\+εi,j,k\(γ\)\|2\]\\displaystyle 2\\mathbb\{E\}\_\{\\mathcal\{S\}\}\\Biggl\[\\frac\{1\}\{n\_\{f\}n\_\{g\}n\_\{z\}\}\\sum\_\{i,j,k\}\\sum\_\{\|\\gamma\|\\leq\\ell\}\\Bigl\|\\partial^\{\\gamma\}\\widehat\{\\mathcal\{G\}\}\_\{\\mathcal\{S\}\}\(\\mathcal\{D\}\_\{\\bar\{m\}\}f\_\{i\},\\mathcal\{D\}\_\{m\}g\_\{i,j\}\)\(\\bm\{z\}\_\{i,j,k\}\)\-y\_\{i,j,k\}^\{\(\\gamma\)\}\+\\varepsilon\_\{i,j,k\}^\{\(\\gamma\)\}\\Bigr\|^\{2\}\\Biggr\]=\\displaystyle=2𝔼𝒮\[1nfngnz∑i,j,k∑\|γ\|≤ℓ\(∂γ𝒢^𝒮\(𝒟m¯fi,𝒟mgi,j\)\(𝒛i,j,k\)−yi,j,k\(γ\)\)2\]\\displaystyle 2\\mathbb\{E\}\_\{\\mathcal\{S\}\}\\Biggl\[\\frac\{1\}\{n\_\{f\}n\_\{g\}n\_\{z\}\}\\sum\_\{i,j,k\}\\sum\_\{\|\\gamma\|\\leq\\ell\}\\Bigl\(\\partial^\{\\gamma\}\\widehat\{\\mathcal\{G\}\}\_\{\\mathcal\{S\}\}\(\\mathcal\{D\}\_\{\\bar\{m\}\}f\_\{i\},\\mathcal\{D\}\_\{m\}g\_\{i,j\}\)\(\\bm\{z\}\_\{i,j,k\}\)\-y\_\{i,j,k\}^\{\(\\gamma\)\}\\Bigr\)^\{2\}\\Biggr\]\+4𝔼𝒮\[1nfngnz∑i,j,k∑\|γ\|≤ℓ\(∂γ𝒢^𝒮\(𝒟m¯fi,𝒟mgi,j\)\(𝒛i,j,k\)−yi,j,k\(γ\)\)εi,j,k\(γ\)\]\\displaystyle\+4\\mathbb\{E\}\_\{\\mathcal\{S\}\}\\Biggl\[\\frac\{1\}\{n\_\{f\}n\_\{g\}n\_\{z\}\}\\sum\_\{i,j,k\}\\sum\_\{\|\\gamma\|\\leq\\ell\}\\Bigl\(\\partial^\{\\gamma\}\\widehat\{\\mathcal\{G\}\}\_\{\\mathcal\{S\}\}\(\\mathcal\{D\}\_\{\\bar\{m\}\}f\_\{i\},\\mathcal\{D\}\_\{m\}g\_\{i,j\}\)\(\\bm\{z\}\_\{i,j,k\}\)\-y\_\{i,j,k\}^\{\(\\gamma\)\}\\Bigr\)\\varepsilon\_\{i,j,k\}^\{\(\\gamma\)\}\\Biggr\]\+2𝔼𝒮\[1nfngnz∑i,j,k∑\|γ\|≤ℓ\(εi,j,k\(γ\)\)2\]\.\\displaystyle\+2\\mathbb\{E\}\_\{\\mathcal\{S\}\}\\Biggl\[\\frac\{1\}\{n\_\{f\}n\_\{g\}n\_\{z\}\}\\sum\_\{i,j,k\}\\sum\_\{\|\\gamma\|\\leq\\ell\}\\bigl\(\\varepsilon\_\{i,j,k\}^\{\(\\gamma\)\}\\bigr\)^\{2\}\\Biggr\]\.\(32\)By the definition of the empirical risk minimizer𝒢^𝒮\\widehat\{\\mathcal\{G\}\}\_\{\\mathcal\{S\}\}, the first term on the right\-hand side is bounded by

2𝔼𝒮inf𝒢NN∈ℱG\[1nfngnz∑i,j,k∑\|γ\|≤ℓ\(∂γ𝒢NN\(𝒟m¯fi,𝒟mgi,j\)\(𝒛i,j,k\)−yi,j,k\(γ\)\)2\]\.\\displaystyle 2\\mathbb\{E\}\_\{\\mathcal\{S\}\}\\inf\_\{\\mathcal\{G\}\_\{\\rm NN\}\\in\\mathcal\{F\}\_\{G\}\}\\Biggl\[\\frac\{1\}\{n\_\{f\}n\_\{g\}n\_\{z\}\}\\sum\_\{i,j,k\}\\sum\_\{\|\\gamma\|\\leq\\ell\}\\Bigl\(\\partial^\{\\gamma\}\\mathcal\{G\}\_\{\\rm NN\}\(\\mathcal\{D\}\_\{\\bar\{m\}\}f\_\{i\},\\mathcal\{D\}\_\{m\}g\_\{i,j\}\)\(\\bm\{z\}\_\{i,j,k\}\)\-y\_\{i,j,k\}^\{\(\\gamma\)\}\\Bigr\)^\{2\}\\Biggr\]\.\(33\)Therefore,

T1≤\\displaystyle\{\\rm T\}\_\{1\}\\leq2inf𝒢NN∈ℱG𝔼𝒮\[1nfngnz∑i,j,k∑\|γ\|≤ℓ\(∂γ𝒢NN\(𝒟m¯fi,𝒟mgi,j\)\(𝒛i,j,k\)−yi,j,k\(γ\)\)2\]\\displaystyle 2\\inf\_\{\\mathcal\{G\}\_\{\\rm NN\}\\in\\mathcal\{F\}\_\{G\}\}\\mathbb\{E\}\_\{\\mathcal\{S\}\}\\Biggl\[\\frac\{1\}\{n\_\{f\}n\_\{g\}n\_\{z\}\}\\sum\_\{i,j,k\}\\sum\_\{\|\\gamma\|\\leq\\ell\}\\Bigl\(\\partial^\{\\gamma\}\\mathcal\{G\}\_\{\\rm NN\}\(\\mathcal\{D\}\_\{\\bar\{m\}\}f\_\{i\},\\mathcal\{D\}\_\{m\}g\_\{i,j\}\)\(\\bm\{z\}\_\{i,j,k\}\)\-y\_\{i,j,k\}^\{\(\\gamma\)\}\\Bigr\)^\{2\}\\Biggr\]\+4𝔼𝒮\[1nfngnz∑i,j,k∑\|γ\|≤ℓ\(∂γ𝒢^𝒮\(𝒟m¯fi,𝒟mgi,j\)\(𝒛i,j,k\)−yi,j,k\(γ\)\)εi,j,k\(γ\)\]\\displaystyle\+4\\mathbb\{E\}\_\{\\mathcal\{S\}\}\\Biggl\[\\frac\{1\}\{n\_\{f\}n\_\{g\}n\_\{z\}\}\\sum\_\{i,j,k\}\\sum\_\{\|\\gamma\|\\leq\\ell\}\\Bigl\(\\partial^\{\\gamma\}\\widehat\{\\mathcal\{G\}\}\_\{\\mathcal\{S\}\}\(\\mathcal\{D\}\_\{\\bar\{m\}\}f\_\{i\},\\mathcal\{D\}\_\{m\}g\_\{i,j\}\)\(\\bm\{z\}\_\{i,j,k\}\)\-y\_\{i,j,k\}^\{\(\\gamma\)\}\\Bigr\)\\varepsilon\_\{i,j,k\}^\{\(\\gamma\)\}\\Biggr\]\+2𝔼𝒮\[1nfngnz∑i,j,k∑\|γ\|≤ℓ\(εi,j,k\(γ\)\)2\]\.\\displaystyle\+2\\mathbb\{E\}\_\{\\mathcal\{S\}\}\\Biggl\[\\frac\{1\}\{n\_\{f\}n\_\{g\}n\_\{z\}\}\\sum\_\{i,j,k\}\\sum\_\{\|\\gamma\|\\leq\\ell\}\\bigl\(\\varepsilon\_\{i,j,k\}^\{\(\\gamma\)\}\\bigr\)^\{2\}\\Biggr\]\.\(34\)Usingyi,j,k\(γ\)=∂γ𝒢\(fi,gi,j\)\(𝒛i,j,k\)\+εi,j,k\(γ\)y\_\{i,j,k\}^\{\(\\gamma\)\}=\\partial^\{\\gamma\}\\mathcal\{G\}\(f\_\{i\},g\_\{i,j\}\)\(\\bm\{z\}\_\{i,j,k\}\)\+\\varepsilon\_\{i,j,k\}^\{\(\\gamma\)\}, we rewrite the first term in \([34](https://arxiv.org/html/2606.17419#S5.E34)\)\. Subtracting and adding the noise square term gives

T1≤\\displaystyle\{\\rm T\}\_\{1\}\\leq2inf𝒢NN∈ℱG𝔼𝒮\[1nfngnz∑i,j,k∑\|γ\|≤ℓ\{\(∂γ𝒢NN−∂γ𝒢−ε\(γ\)\)2−\(ε\(γ\)\)2\}\]\\displaystyle 2\\inf\_\{\\mathcal\{G\}\_\{\\rm NN\}\\in\\mathcal\{F\}\_\{G\}\}\\mathbb\{E\}\_\{\\mathcal\{S\}\}\\Biggl\[\\frac\{1\}\{n\_\{f\}n\_\{g\}n\_\{z\}\}\\sum\_\{i,j,k\}\\sum\_\{\|\\gamma\|\\leq\\ell\}\\Bigl\\\{\\Bigl\(\\partial^\{\\gamma\}\\mathcal\{G\}\_\{\\rm NN\}\-\\partial^\{\\gamma\}\\mathcal\{G\}\-\\varepsilon^\{\(\\gamma\)\}\\Bigr\)^\{2\}\-\\bigl\(\\varepsilon^\{\(\\gamma\)\}\\bigr\)^\{2\}\\Bigr\\\}\\Biggr\]\+4𝔼𝒮\[1nfngnz∑i,j,k∑\|γ\|≤ℓ\(∂γ𝒢^𝒮−∂γ𝒢\)\(𝒟m¯fi,𝒟mgi,j\)\(𝒛i,j,k\)εi,j,k\(γ\)\]\.\\displaystyle\+4\\mathbb\{E\}\_\{\\mathcal\{S\}\}\\Biggl\[\\frac\{1\}\{n\_\{f\}n\_\{g\}n\_\{z\}\}\\sum\_\{i,j,k\}\\sum\_\{\|\\gamma\|\\leq\\ell\}\\Bigl\(\\partial^\{\\gamma\}\\widehat\{\\mathcal\{G\}\}\_\{\\mathcal\{S\}\}\-\\partial^\{\\gamma\}\\mathcal\{G\}\\Bigr\)\(\\mathcal\{D\}\_\{\\bar\{m\}\}f\_\{i\},\\mathcal\{D\}\_\{m\}g\_\{i,j\}\)\(\\bm\{z\}\_\{i,j,k\}\)\\varepsilon\_\{i,j,k\}^\{\(\\gamma\)\}\\Biggr\]\.\(35\)Here, in the first line,∂γ𝒢NN\\partial^\{\\gamma\}\\mathcal\{G\}\_\{\\rm NN\},∂γ𝒢\\partial^\{\\gamma\}\\mathcal\{G\}, andε\(γ\)\\varepsilon^\{\(\\gamma\)\}are evaluated at\(𝒟m¯fi,𝒟mgi,j,𝒛i,j,k\)\(\\mathcal\{D\}\_\{\\bar\{m\}\}f\_\{i\},\\mathcal\{D\}\_\{m\}g\_\{i,j\},\\bm\{z\}\_\{i,j,k\}\)\. Expanding the square in the first term of \([35](https://arxiv.org/html/2606.17419#S5.E35)\), and using that the noises are mean\-zero and independent of the samples, the cross term vanishes after taking expectation\. Hence

T1≤\\displaystyle\{\\rm T\}\_\{1\}\\leq2inf𝒢NN∈ℱG𝔼𝒮\[1nfngnz∑i,j,k∑\|γ\|≤ℓ\|∂γ𝒢NN\(𝒟m¯fi,𝒟mgi,j\)\(𝒛i,j,k\)−∂γ𝒢\(fi,gi,j\)\(𝒛i,j,k\)\|2\]\\displaystyle 2\\inf\_\{\\mathcal\{G\}\_\{\\rm NN\}\\in\\mathcal\{F\}\_\{G\}\}\\mathbb\{E\}\_\{\\mathcal\{S\}\}\\Biggl\[\\frac\{1\}\{n\_\{f\}n\_\{g\}n\_\{z\}\}\\sum\_\{i,j,k\}\\sum\_\{\|\\gamma\|\\leq\\ell\}\\Bigl\|\\partial^\{\\gamma\}\\mathcal\{G\}\_\{\\rm NN\}\(\\mathcal\{D\}\_\{\\bar\{m\}\}f\_\{i\},\\mathcal\{D\}\_\{m\}g\_\{i,j\}\)\(\\bm\{z\}\_\{i,j,k\}\)\-\\partial^\{\\gamma\}\\mathcal\{G\}\(f\_\{i\},g\_\{i,j\}\)\(\\bm\{z\}\_\{i,j,k\}\)\\Bigr\|^\{2\}\\Biggr\]\+4𝔼𝒮\[1nfngnz∑i,j,k∑\|γ\|≤ℓ\(∂γ𝒢^𝒮−∂γ𝒢\)\(𝒟m¯fi,𝒟mgi,j\)\(𝒛i,j,k\)εi,j,k\(γ\)\]\.\\displaystyle\+4\\mathbb\{E\}\_\{\\mathcal\{S\}\}\\Biggl\[\\frac\{1\}\{n\_\{f\}n\_\{g\}n\_\{z\}\}\\sum\_\{i,j,k\}\\sum\_\{\|\\gamma\|\\leq\\ell\}\\Bigl\(\\partial^\{\\gamma\}\\widehat\{\\mathcal\{G\}\}\_\{\\mathcal\{S\}\}\-\\partial^\{\\gamma\}\\mathcal\{G\}\\Bigr\)\(\\mathcal\{D\}\_\{\\bar\{m\}\}f\_\{i\},\\mathcal\{D\}\_\{m\}g\_\{i,j\}\)\(\\bm\{z\}\_\{i,j,k\}\)\\varepsilon\_\{i,j,k\}^\{\(\\gamma\)\}\\Biggr\]\.\(36\)This gives the desired decomposition ofT1\{\\rm T\}\_\{1\}\.

The first term in \([36](https://arxiv.org/html/2606.17419#S5.E36)\) is controlled by the Sobolev approximation result in Corollary[1](https://arxiv.org/html/2606.17419#Thmcorollary1)\. Under the balanced choice ofm¯,m,J,H,P\\bar\{m\},m,J,H,P, we have

2inf𝒢NN∈ℱG𝔼𝒮\[‖𝒢NN−𝒢‖n,ℓ2\]≤2C1\(log⁡Hlog⁡log⁡H\)−2max⁡\{dn1,dn2\}\(log⁡log⁡H\)4d\.\\displaystyle 2\\inf\_\{\\mathcal\{G\}\_\{\\rm NN\}\\in\\mathcal\{F\}\_\{G\}\}\\mathbb\{E\}\_\{\\mathcal\{S\}\}\\left\[\\\|\\mathcal\{G\}\_\{\\rm NN\}\-\\mathcal\{G\}\\\|\_\{n,\\ell\}^\{2\}\\right\]\\leq 2C\_\{1\}\\left\(\\frac\{\\log H\}\{\\log\\log H\}\\right\)^\{\-\\frac\{2\}\{\\max\\\{\\frac\{d\}\{n\_\{1\}\},\\frac\{d\}\{n\_\{2\}\}\\\}\}\}\(\\log\\log H\)^\{4d\}\.\(37\)
We next bound the noise\-correlation term in \([36](https://arxiv.org/html/2606.17419#S5.E36)\)\. Since standard ReLU networks do not admit a uniformW1,∞W^\{1,\\infty\}\-covering number in general, we use a finite\-sample empirical Sobolev covering instead\. LetN:=nfngnzN:=n\_\{f\}n\_\{g\}n\_\{z\}\. For any finite sample set𝒜=\{\(fa,ga,𝒛a\)\}a=1N𝒜\\mathcal\{A\}=\\\{\(f\_\{a\},g\_\{a\},\\bm\{z\}\_\{a\}\)\\\}\_\{a=1\}^\{N\_\{\\mathcal\{A\}\}\}withN𝒜≤2NN\_\{\\mathcal\{A\}\}\\leq 2N, define the empirical Sobolev sup\-metric by

d𝒜,ℓ\(ℋ1,ℋ2\):=max1≤a≤N𝒜∑\|γ\|≤ℓ\|∂γ\(ℋ1−ℋ2\)\(𝒟m¯fa,𝒟mga\)\(𝒛a\)\|\.d\_\{\\mathcal\{A\},\\ell\}\(\\mathcal\{H\}\_\{1\},\\mathcal\{H\}\_\{2\}\):=\\max\_\{1\\leq a\\leq N\_\{\\mathcal\{A\}\}\}\\sum\_\{\|\\gamma\|\\leq\\ell\}\\left\|\\partial^\{\\gamma\}\(\\mathcal\{H\}\_\{1\}\-\\mathcal\{H\}\_\{2\}\)\(\\mathcal\{D\}\_\{\\bar\{m\}\}f\_\{a\},\\mathcal\{D\}\_\{m\}g\_\{a\}\)\(\\bm\{z\}\_\{a\}\)\\right\|\.Let𝒩\(η,ℱG,d𝒜,ℓ\)\\mathcal\{N\}\(\\eta,\\mathcal\{F\}\_\{G\},d\_\{\\mathcal\{A\},\\ell\}\)denote the corresponding covering number\. Define the uniform empirical covering number over at most2N2Nsample points by

𝒩emp\(2N\)\(η,ℱG\):=sup𝒜:N𝒜≤2N𝒩\(η,ℱG,d𝒜,ℓ\)\.\\mathcal\{N\}\_\{\\rm emp\}^\{\(2N\)\}\(\\eta,\\mathcal\{F\}\_\{G\}\):=\\sup\_\{\\mathcal\{A\}:\\,N\_\{\\mathcal\{A\}\}\\leq 2N\}\\mathcal\{N\}\\left\(\\eta,\\mathcal\{F\}\_\{G\},d\_\{\\mathcal\{A\},\\ell\}\\right\)\.For simplicity, we write𝒩emp\(η,ℱG\):=𝒩emp\(2N\)\(η,ℱG\)\\mathcal\{N\}\_\{\\rm emp\}\(\\eta,\\mathcal\{F\}\_\{G\}\):=\\mathcal\{N\}\_\{\\rm emp\}^\{\(2N\)\}\(\\eta,\\mathcal\{F\}\_\{G\}\)throughout the proof\.

This metric controls the empirical Sobolev norm used in the loss\. Indeed, ifd𝒮,ℓ\(ℋ1,ℋ2\)≤ηd\_\{\\mathcal\{S\},\\ell\}\(\\mathcal\{H\}\_\{1\},\\mathcal\{H\}\_\{2\}\)\\leq\\eta, then‖ℋ1−ℋ2‖n,ℓ≤η\\\|\\mathcal\{H\}\_\{1\}\-\\mathcal\{H\}\_\{2\}\\\|\_\{n,\\ell\}\\leq\\eta\. This follows because, at each sample point,

∑\|γ\|≤ℓ\|∂γ\(ℋ1−ℋ2\)\(𝒟m¯fi,𝒟mgi,j\)\(𝒛i,j,k\)\|≤η,\\sum\_\{\|\\gamma\|\\leq\\ell\}\\left\|\\partial^\{\\gamma\}\(\\mathcal\{H\}\_\{1\}\-\\mathcal\{H\}\_\{2\}\)\(\\mathcal\{D\}\_\{\\bar\{m\}\}f\_\{i\},\\mathcal\{D\}\_\{m\}g\_\{i,j\}\)\(\\bm\{z\}\_\{i,j,k\}\)\\right\|\\leq\\eta,and hence the corresponding sum of squares is bounded byη2\\eta^\{2\}\. Moreover, if𝒮′\\mathcal\{S\}^\{\\prime\}is an independent ghost sample, then𝒮∪𝒮′\\mathcal\{S\}\\cup\\mathcal\{S\}^\{\\prime\}contains at most2N2Nsample points\. Hence

𝒩\(η,ℱG,d𝒮∪𝒮′,ℓ\)≤𝒩emp\(η,ℱG\)\.\\mathcal\{N\}\\left\(\\eta,\\mathcal\{F\}\_\{G\},d\_\{\\mathcal\{S\}\\cup\\mathcal\{S\}^\{\\prime\},\\ell\}\\right\)\\leq\\mathcal\{N\}\_\{\\rm emp\}\(\\eta,\\mathcal\{F\}\_\{G\}\)\.Thus the same deterministic covering number𝒩emp\(η,ℱG\)\\mathcal\{N\}\_\{\\rm emp\}\(\\eta,\\mathcal\{F\}\_\{G\}\)can be used in both the one\-sample and two\-sample empirical\-process estimates\.

The following lemma gives an upper bound of the second term in \([36](https://arxiv.org/html/2606.17419#S5.E36)\) \(see a proof in Appendix[B\.1](https://arxiv.org/html/2606.17419#A2.SS1)\)\.

###### Lemma 1\(Noise correlation bound for Sobolev training\)\.

Assume that, for each\|γ\|≤ℓ\|\\gamma\|\\leq\\ell, the noise variablesεi,j,k\(γ\)\\varepsilon\_\{i,j,k\}^\{\(\\gamma\)\}are independent, mean\-zero, and sub\-Gaussian with variance proxy bounded byσ2\\sigma^\{2\}\. Then, for anyη\>0\\eta\>0,

𝔼𝒮\[1nfngnz∑i=1nf∑j=1ng∑k=1nz∑\|γ\|≤ℓ\(∂γ𝒢^𝒮−∂γ𝒢\)\(𝒟m¯fi,𝒟mgi,j\)\(𝒛i,j,k\)εi,j,k\(γ\)\]\\displaystyle\\mathbb\{E\}\_\{\\mathcal\{S\}\}\\Biggl\[\\frac\{1\}\{n\_\{f\}n\_\{g\}n\_\{z\}\}\\sum\_\{i=1\}^\{n\_\{f\}\}\\sum\_\{j=1\}^\{n\_\{g\}\}\\sum\_\{k=1\}^\{n\_\{z\}\}\\sum\_\{\|\\gamma\|\\leq\\ell\}\\left\(\\partial^\{\\gamma\}\\widehat\{\\mathcal\{G\}\}\_\{\\mathcal\{S\}\}\-\\partial^\{\\gamma\}\\mathcal\{G\}\\right\)\(\\mathcal\{D\}\_\{\\bar\{m\}\}f\_\{i\},\\mathcal\{D\}\_\{m\}g\_\{i,j\}\)\(\\bm\{z\}\_\{i,j,k\}\)\\varepsilon\_\{i,j,k\}^\{\(\\gamma\)\}\\Biggr\]≤2σ\(𝔼𝒮\[‖𝒢^𝒮−𝒢‖n,ℓ2\]\+η\)4log⁡𝒩emp\(η,ℱG\)\+6nfngnz\+ση\.\\displaystyle\\qquad\\leq 2\\sigma\\left\(\\sqrt\{\\mathbb\{E\}\_\{\\mathcal\{S\}\}\\left\[\\\|\\widehat\{\\mathcal\{G\}\}\_\{\\mathcal\{S\}\}\-\\mathcal\{G\}\\\|\_\{n,\\ell\}^\{2\}\\right\]\}\+\\eta\\right\)\\sqrt\{\\frac\{4\\log\\mathcal\{N\}\_\{\\rm emp\}\(\\eta,\\mathcal\{F\}\_\{G\}\)\+6\}\{n\_\{f\}n\_\{g\}n\_\{z\}\}\}\+\\sigma\\eta\.\(38\)

Applying Lemma[1](https://arxiv.org/html/2606.17419#Thmlemma1)to the second term in \([36](https://arxiv.org/html/2606.17419#S5.E36)\), and combining it with \([37](https://arxiv.org/html/2606.17419#S5.E37)\), we obtain

T1≤\\displaystyle\{\\rm T\}\_\{1\}\\leq\\;2C1\(log⁡Hlog⁡log⁡H\)−2max⁡\{dn1,dn2\}\(log⁡log⁡H\)4d\\displaystyle 2C\_\{1\}\\left\(\\frac\{\\log H\}\{\\log\\log H\}\\right\)^\{\-\\frac\{2\}\{\\max\\\{\\frac\{d\}\{n\_\{1\}\},\\frac\{d\}\{n\_\{2\}\}\\\}\}\}\(\\log\\log H\)^\{4d\}\+8σ\(𝔼𝒮\[‖𝒢^𝒮−𝒢‖n,ℓ2\]\+η\)4log⁡𝒩emp\(η,ℱG\)\+6nfngnz\+4ση\.\\displaystyle\+8\\sigma\\left\(\\sqrt\{\\mathbb\{E\}\_\{\\mathcal\{S\}\}\\left\[\\\|\\widehat\{\\mathcal\{G\}\}\_\{\\mathcal\{S\}\}\-\\mathcal\{G\}\\\|\_\{n,\\ell\}^\{2\}\\right\]\}\+\\eta\\right\)\\sqrt\{\\frac\{4\\log\\mathcal\{N\}\_\{\\rm emp\}\(\\eta,\\mathcal\{F\}\_\{G\}\)\+6\}\{n\_\{f\}n\_\{g\}n\_\{z\}\}\}\+4\\sigma\\eta\.\(39\)
Let

Γ:=𝔼𝒮\[‖𝒢^𝒮−𝒢‖n,ℓ2\],Δη:=4log⁡𝒩emp\(η,ℱG\)\+6nfngnz\.\\Gamma:=\\sqrt\{\\mathbb\{E\}\_\{\\mathcal\{S\}\}\\left\[\\\|\\widehat\{\\mathcal\{G\}\}\_\{\\mathcal\{S\}\}\-\\mathcal\{G\}\\\|\_\{n,\\ell\}^\{2\}\\right\]\},\\qquad\\Delta\_\{\\eta\}:=\\sqrt\{\\frac\{4\\log\\mathcal\{N\}\_\{\\rm emp\}\(\\eta,\\mathcal\{F\}\_\{G\}\)\+6\}\{n\_\{f\}n\_\{g\}n\_\{z\}\}\}\.SinceT1=2Γ2\{\\rm T\}\_\{1\}=2\\Gamma^\{2\}, inequality \([39](https://arxiv.org/html/2606.17419#S5.E39)\) implies

Γ2≤C1\(log⁡Hlog⁡log⁡H\)−2max⁡\{dn1,dn2\}\(log⁡log⁡H\)4d\+4σ\(Γ\+η\)Δη\+2ση\.\\Gamma^\{2\}\\leq C\_\{1\}\\left\(\\frac\{\\log H\}\{\\log\\log H\}\\right\)^\{\-\\frac\{2\}\{\\max\\\{\\frac\{d\}\{n\_\{1\}\},\\frac\{d\}\{n\_\{2\}\}\\\}\}\}\(\\log\\log H\)^\{4d\}\+4\\sigma\(\\Gamma\+\\eta\)\\Delta\_\{\\eta\}\+2\\sigma\\eta\.Equivalently,Γ2≤a\+2bΓ\\Gamma^\{2\}\\leq a\+2b\\Gamma, where

a:=C1\(log⁡Hlog⁡log⁡H\)−2max⁡\{dn1,dn2\}\(log⁡log⁡H\)4d\+4σηΔη\+2ση,a:=C\_\{1\}\\left\(\\frac\{\\log H\}\{\\log\\log H\}\\right\)^\{\-\\frac\{2\}\{\\max\\\{\\frac\{d\}\{n\_\{1\}\},\\frac\{d\}\{n\_\{2\}\}\\\}\}\}\(\\log\\log H\)^\{4d\}\+4\\sigma\\eta\\Delta\_\{\\eta\}\+2\\sigma\\eta,andb:=2σΔηb:=2\\sigma\\Delta\_\{\\eta\}\. Using the elementary implicationΓ2≤a\+2bΓ⇒Γ2≤2a\+4b2\\Gamma^\{2\}\\leq a\+2b\\Gamma\\Rightarrow\\Gamma^\{2\}\\leq 2a\+4b^\{2\}, we obtain

Γ2≤C\(log⁡Hlog⁡log⁡H\)−2max⁡\{dn1,dn2\}\(log⁡log⁡H\)4d\+Cση\+CσηΔη\+Cσ2Δη2\.\\Gamma^\{2\}\\leq C\\left\(\\frac\{\\log H\}\{\\log\\log H\}\\right\)^\{\-\\frac\{2\}\{\\max\\\{\\frac\{d\}\{n\_\{1\}\},\\frac\{d\}\{n\_\{2\}\}\\\}\}\}\(\\log\\log H\)^\{4d\}\+C\\sigma\\eta\+C\\sigma\\eta\\Delta\_\{\\eta\}\+C\\sigma^\{2\}\\Delta\_\{\\eta\}^\{2\}\.SinceT1=2Γ2\{\\rm T\}\_\{1\}=2\\Gamma^\{2\}, we conclude that

T1≤\\displaystyle\{\\rm T\}\_\{1\}\\leq\\;C\(log⁡Hlog⁡log⁡H\)−2max⁡\{dn1,dn2\}\(log⁡log⁡H\)4d\+Cση\\displaystyle C\\left\(\\frac\{\\log H\}\{\\log\\log H\}\\right\)^\{\-\\frac\{2\}\{\\max\\\{\\frac\{d\}\{n\_\{1\}\},\\frac\{d\}\{n\_\{2\}\}\\\}\}\}\(\\log\\log H\)^\{4d\}\+C\\sigma\\eta\+Cσηlog⁡𝒩emp\(η,ℱG\)\+1nfngnz\+Cσ2log⁡𝒩emp\(η,ℱG\)\+1nfngnz\.\\displaystyle\+C\\sigma\\eta\\sqrt\{\\frac\{\\log\\mathcal\{N\}\_\{\\rm emp\}\(\\eta,\\mathcal\{F\}\_\{G\}\)\+1\}\{n\_\{f\}n\_\{g\}n\_\{z\}\}\}\+C\\sigma^\{2\}\\frac\{\\log\\mathcal\{N\}\_\{\\rm emp\}\(\\eta,\\mathcal\{F\}\_\{G\}\)\+1\}\{n\_\{f\}n\_\{g\}n\_\{z\}\}\.\(40\)
BoundingT2\{\\rm T\}\_\{2\}\.We next bound the deviation termT2\{\\rm T\}\_\{2\}\. Recall thatT2=ℰgen−2𝔼𝒮\[‖𝒢^𝒮−𝒢‖n,ℓ2\]\{\\rm T\}\_\{2\}=\\mathcal\{E\}\_\{\\rm gen\}\-2\\mathbb\{E\}\_\{\\mathcal\{S\}\}\[\\\|\\widehat\{\\mathcal\{G\}\}\_\{\\mathcal\{S\}\}\-\\mathcal\{G\}\\\|\_\{n,\\ell\}^\{2\}\]\. ThusT2\{\\rm T\}\_\{2\}measures the discrepancy between the population Sobolev prediction error and the empirical Sobolev prediction error\. The following lemma is the Sobolev analogue of\[[31](https://arxiv.org/html/2606.17419#bib.bib31), Lemma 4\]\(see a proof in Appendix[B\.2](https://arxiv.org/html/2606.17419#A2.SS2)\)\.

###### Lemma 2\(Uniform deviation bound for the Sobolev error\)\.

For anyη\>0\\eta\>0,

T2≤C\(Rm¯,m\(n3\)\)2P2ℓ/dnflog⁡𝒩emp\(ηCRm¯,m\(n3\)Pℓ/d,ℱG\)\+Cη\.\\displaystyle\{\\rm T\}\_\{2\}\\leq\\frac\{C\\bigl\(R\_\{\\bar\{m\},m\}^\{\(n\_\{3\}\)\}\\bigr\)^\{2\}P^\{2\\ell/d\}\}\{n\_\{f\}\}\\log\\mathcal\{N\}\_\{\\rm emp\}\\left\(\\frac\{\\eta\}\{CR\_\{\\bar\{m\},m\}^\{\(n\_\{3\}\)\}P^\{\\ell/d\}\},\\mathcal\{F\}\_\{G\}\\right\)\+C\\eta\.\(41\)HereC\>0C\>0is independent ofnf,m¯,mn\_\{f\},\\bar\{m\},mandη\\eta\.

Usingngnz≥1n\_\{g\}n\_\{z\}\\geq 1, we obtain

ℰgen≤\\displaystyle\\mathcal\{E\}\_\{\\rm gen\}\\leq\\;C\(log⁡Hlog⁡log⁡H\)−2max⁡\{dn1,dn2\}\(log⁡log⁡H\)4d\+Cη\\displaystyle C\\left\(\\frac\{\\log H\}\{\\log\\log H\}\\right\)^\{\-\\frac\{2\}\{\\max\\\{\\frac\{d\}\{n\_\{1\}\},\\frac\{d\}\{n\_\{2\}\}\\\}\}\}\(\\log\\log H\)^\{4d\}\+C\\eta\+Cσηlog⁡𝒩emp\(η,ℱG\)\+1nfngnz\+C\(1\+\(Rm¯,m\(n3\)\)2P2ℓ/d\)nflog⁡𝒩emp\(ηCRm¯,m\(n3\)Pℓ/d,ℱG\)\.\\displaystyle\+C\\sigma\\eta\\sqrt\{\\frac\{\\log\\mathcal\{N\}\_\{\\rm emp\}\(\\eta,\\mathcal\{F\}\_\{G\}\)\+1\}\{n\_\{f\}n\_\{g\}n\_\{z\}\}\}\+\\frac\{C\\left\(1\+\\bigl\(R\_\{\\bar\{m\},m\}^\{\(n\_\{3\}\)\}\\bigr\)^\{2\}P^\{2\\ell/d\}\\right\)\}\{n\_\{f\}\}\\log\\mathcal\{N\}\_\{\\rm emp\}\\left\(\\frac\{\\eta\}\{CR\_\{\\bar\{m\},m\}^\{\(n\_\{3\}\)\}P^\{\\ell/d\}\},\\mathcal\{F\}\_\{G\}\\right\)\.\(42\)Indeed, sincengnz≥1n\_\{g\}n\_\{z\}\\geq 1, the term of order\(nfngnz\)−1log⁡𝒩emp\(η,ℱG\)\(n\_\{f\}n\_\{g\}n\_\{z\}\)^\{\-1\}\\log\\mathcal\{N\}\_\{\\rm emp\}\(\\eta,\\mathcal\{F\}\_\{G\}\)is dominated by the outer\-sampling term of ordernf−1log⁡𝒩emp\(η/\(CRm¯,m\(n3\)Pℓ/d\),ℱG\)n\_\{f\}^\{\-1\}\\log\\mathcal\{N\}\_\{\\rm emp\}\(\\eta/\(CR\_\{\\bar\{m\},m\}^\{\(n\_\{3\}\)\}P^\{\\ell/d\}\),\\mathcal\{F\}\_\{G\}\), after increasing the constant\.

Covering number estimate\.We next estimate the uniform covering number ofℱG\\mathcal\{F\}\_\{G\}\.

###### Definition 1\(Uniform empirical covering number\)\.

Letℱ\\mathcal\{F\}be a class of real\-valued functions on a set𝒜\\mathcal\{A\}\. For anyn∈ℕn\\in\\mathbb\{N\}and any sample𝒜n=\{a1,…,an\}⊂𝒜\\mathcal\{A\}\_\{n\}=\\\{a\_\{1\},\\ldots,a\_\{n\}\\\}\\subset\\mathcal\{A\}, define the empiricalL∞L^\{\\infty\}\-metric by

d𝒜n\(f,g\):=max1≤i≤n⁡\|f\(ai\)−g\(ai\)\|,f,g∈ℱ\.d\_\{\\mathcal\{A\}\_\{n\}\}\(f,g\):=\\max\_\{1\\leq i\\leq n\}\|f\(a\_\{i\}\)\-g\(a\_\{i\}\)\|,\\qquad f,g\\in\\mathcal\{F\}\.Let𝒩\(ε,ℱ,d𝒜n\)\\mathcal\{N\}\(\\varepsilon,\\mathcal\{F\},d\_\{\\mathcal\{A\}\_\{n\}\}\)denote the covering number ofℱ\\mathcal\{F\}with respect tod𝒜nd\_\{\\mathcal\{A\}\_\{n\}\}\. The uniform empirical covering number ofℱ\\mathcal\{F\}over allnn\-point samples is defined by

𝒩\(ε,ℱ,n\):=sup𝒜n⊂𝒜:\|𝒜n\|=n𝒩\(ε,ℱ,d𝒜n\)\.\\mathcal\{N\}\(\\varepsilon,\\mathcal\{F\},n\):=\\sup\_\{\\mathcal\{A\}\_\{n\}\\subset\\mathcal\{A\}:\\,\|\\mathcal\{A\}\_\{n\}\|=n\}\\mathcal\{N\}\\left\(\\varepsilon,\\mathcal\{F\},d\_\{\\mathcal\{A\}\_\{n\}\}\\right\)\.

###### Lemma 3\(\[[2](https://arxiv.org/html/2606.17419#bib.bib2), Theorem 12\.2\]\)\.

Letℱ\\mathcal\{F\}be a class of real\-valued functions on a set𝒜\\mathcal\{A\}\. Assume that\|f\(a\)\|≤B\|f\(a\)\|\\leq Bfor allf∈ℱf\\in\\mathcal\{F\}anda∈𝒜a\\in\\mathcal\{A\}\. LetV:=Pdim⁡\(ℱ\)V:=\\operatorname\{Pdim\}\(\\mathcal\{F\}\)be the pseudo\-dimension ofℱ\\mathcal\{F\}\. Then, for anyε\>0\\varepsilon\>0and anyn≥Vn\\geq V, the uniform empirical covering number in Definition[1](https://arxiv.org/html/2606.17419#Thmdefinition1)satisfies

𝒩\(ε,ℱ,n\)≤\(2enBεV\)V\.\\mathcal\{N\}\(\\varepsilon,\\mathcal\{F\},n\)\\leq\\left\(\\frac\{2enB\}\{\\varepsilon V\}\\right\)^\{V\}\.

For each multi\-indexγ\\gammawith\|γ\|≤ℓ\|\\gamma\|\\leq\\ell, define the scalar derivative class

ℱGγ:=\{\(f,g,𝒙\)↦∂γℋ\(𝒟m¯f,𝒟mg\)\(𝒙\):ℋ∈ℱG\}\.\\mathcal\{F\}\_\{G\}^\{\\gamma\}:=\\left\\\{\(f,g,\\bm\{x\}\)\\mapsto\\partial^\{\\gamma\}\\mathcal\{H\}\(\\mathcal\{D\}\_\{\\bar\{m\}\}f,\\mathcal\{D\}\_\{m\}g\)\(\\bm\{x\}\):\\mathcal\{H\}\\in\\mathcal\{F\}\_\{G\}\\right\\\}\.Letqℓ:=\#\{γ:\|γ\|≤ℓ\}q\_\{\\ell\}:=\\\#\\\{\\gamma:\|\\gamma\|\\leq\\ell\\\}\. Sinceℓ∈\{0,1\}\\ell\\in\\\{0,1\\\}, we haveq0=1q\_\{0\}=1andq1=d\+1q\_\{1\}=d\+1\.

Fix a finite sample set𝒜=\{\(fa,ga,𝒙a\)\}a=1N𝒜\\mathcal\{A\}=\\\{\(f\_\{a\},g\_\{a\},\\bm\{x\}\_\{a\}\)\\\}\_\{a=1\}^\{N\_\{\\mathcal\{A\}\}\}\. For each\|γ\|≤ℓ\|\\gamma\|\\leq\\ell, let𝒞γ\\mathcal\{C\}\_\{\\gamma\}be anη/\(2qℓ\)\\eta/\(2q\_\{\\ell\}\)\-cover ofℱGγ\\mathcal\{F\}\_\{G\}^\{\\gamma\}with respect to the empiricalL∞L^\{\\infty\}\-metric

d𝒜,∞\(F1,F2\):=max1≤a≤N𝒜⁡\|F1\(fa,ga,𝒙a\)−F2\(fa,ga,𝒙a\)\|\.d\_\{\\mathcal\{A\},\\infty\}\(F\_\{1\},F\_\{2\}\):=\\max\_\{1\\leq a\\leq N\_\{\\mathcal\{A\}\}\}\|F\_\{1\}\(f\_\{a\},g\_\{a\},\\bm\{x\}\_\{a\}\)\-F\_\{2\}\(f\_\{a\},g\_\{a\},\\bm\{x\}\_\{a\}\)\|\.
We now construct anη\\eta\-cover ofℱG\\mathcal\{F\}\_\{G\}with respect tod𝒜,ℓd\_\{\\mathcal\{A\},\\ell\}\. For eachℋ∈ℱG\\mathcal\{H\}\\in\\mathcal\{F\}\_\{G\}and each\|γ\|≤ℓ\|\\gamma\|\\leq\\ell, chooseFγ,ℋ∈𝒞γF\_\{\\gamma,\\mathcal\{H\}\}\\in\\mathcal\{C\}\_\{\\gamma\}such thatd𝒜,∞\(∂γℋ,Fγ,ℋ\)≤η/\(2qℓ\)d\_\{\\mathcal\{A\},\\infty\}\(\\partial^\{\\gamma\}\\mathcal\{H\},F\_\{\\gamma,\\mathcal\{H\}\}\)\\leq\\eta/\(2q\_\{\\ell\}\)\. This assigns toℋ\\mathcal\{H\}a tuple\(Fγ,ℋ\)\|γ\|≤ℓ∈∏\|γ\|≤ℓ𝒞γ\(F\_\{\\gamma,\\mathcal\{H\}\}\)\_\{\|\\gamma\|\\leq\\ell\}\\in\\prod\_\{\|\\gamma\|\\leq\\ell\}\\mathcal\{C\}\_\{\\gamma\}\. The tuples induce a partition ofℱG\\mathcal\{F\}\_\{G\}\. For each nonempty cell of this partition, choose one representativeℋη∈ℱG\\mathcal\{H\}\_\{\\eta\}\\in\\mathcal\{F\}\_\{G\}\. The collection of all such representatives is denoted by𝒞\\mathcal\{C\}\. Clearly,

\|𝒞\|≤∏\|γ\|≤ℓ\|𝒞γ\|\.\|\\mathcal\{C\}\|\\leq\\prod\_\{\|\\gamma\|\\leq\\ell\}\|\\mathcal\{C\}\_\{\\gamma\}\|\.We claim that𝒞\\mathcal\{C\}is anη\\eta\-cover ofℱG\\mathcal\{F\}\_\{G\}with respect tod𝒜,ℓd\_\{\\mathcal\{A\},\\ell\}\. Indeed, letℋ∈ℱG\\mathcal\{H\}\\in\\mathcal\{F\}\_\{G\}, and letℋη∈𝒞\\mathcal\{H\}\_\{\\eta\}\\in\\mathcal\{C\}be the representative chosen from the same cell asℋ\\mathcal\{H\}\. Then, for each\|γ\|≤ℓ\|\\gamma\|\\leq\\ell,

d𝒜,∞\(∂γℋ,∂γℋη\)≤d𝒜,∞\(∂γℋ,Fγ,ℋ\)\+d𝒜,∞\(Fγ,ℋ,∂γℋη\)≤ηqℓ\.d\_\{\\mathcal\{A\},\\infty\}\\left\(\\partial^\{\\gamma\}\\mathcal\{H\},\\partial^\{\\gamma\}\\mathcal\{H\}\_\{\\eta\}\\right\)\\leq d\_\{\\mathcal\{A\},\\infty\}\\left\(\\partial^\{\\gamma\}\\mathcal\{H\},F\_\{\\gamma,\\mathcal\{H\}\}\\right\)\+d\_\{\\mathcal\{A\},\\infty\}\\left\(F\_\{\\gamma,\\mathcal\{H\}\},\\partial^\{\\gamma\}\\mathcal\{H\}\_\{\\eta\}\\right\)\\leq\\frac\{\\eta\}\{q\_\{\\ell\}\}\.Therefore, for everya=1,…,N𝒜a=1,\\ldots,N\_\{\\mathcal\{A\}\},

∑\|γ\|≤ℓ\|∂γ\(ℋ−ℋη\)\(𝒟m¯fa,𝒟mga\)\(𝒙a\)\|≤η\.\\sum\_\{\|\\gamma\|\\leq\\ell\}\\left\|\\partial^\{\\gamma\}\(\\mathcal\{H\}\-\\mathcal\{H\}\_\{\\eta\}\)\(\\mathcal\{D\}\_\{\\bar\{m\}\}f\_\{a\},\\mathcal\{D\}\_\{m\}g\_\{a\}\)\(\\bm\{x\}\_\{a\}\)\\right\|\\leq\\eta\.Henced𝒜,ℓ\(ℋ,ℋη\)≤ηd\_\{\\mathcal\{A\},\\ell\}\(\\mathcal\{H\},\\mathcal\{H\}\_\{\\eta\}\)\\leq\\eta\. Consequently,

𝒩\(η,ℱG,d𝒜,ℓ\)≤∏\|γ\|≤ℓ𝒩\(η2qℓ,ℱGγ,N𝒜\)\.\\displaystyle\\mathcal\{N\}\\left\(\\eta,\\mathcal\{F\}\_\{G\},d\_\{\\mathcal\{A\},\\ell\}\\right\)\\leq\\prod\_\{\|\\gamma\|\\leq\\ell\}\\mathcal\{N\}\\left\(\\frac\{\\eta\}\{2q\_\{\\ell\}\},\\mathcal\{F\}\_\{G\}^\{\\gamma\},N\_\{\\mathcal\{A\}\}\\right\)\.\(43\)Taking the supremum over all sample sets𝒜\\mathcal\{A\}withN𝒜≤2nfngnzN\_\{\\mathcal\{A\}\}\\leq 2n\_\{f\}n\_\{g\}n\_\{z\}, we obtain

𝒩emp\(η,ℱG\)≤∏\|γ\|≤ℓ𝒩\(η2qℓ,ℱGγ,2nfngnz\)\.\\displaystyle\\mathcal\{N\}\_\{\\rm emp\}\(\\eta,\\mathcal\{F\}\_\{G\}\)\\leq\\prod\_\{\|\\gamma\|\\leq\\ell\}\\mathcal\{N\}\\left\(\\frac\{\\eta\}\{2q\_\{\\ell\}\},\\mathcal\{F\}\_\{G\}^\{\\gamma\},2n\_\{f\}n\_\{g\}n\_\{z\}\\right\)\.\(44\)Sinceqℓq\_\{\\ell\}depends only onddandℓ\\ell, the factor2qℓ2q\_\{\\ell\}can be absorbed into the constants in the logarithmic covering estimates\.

###### Lemma 4\(Pseudo\-dimension bound for the derivative classes\)\.

LetℱGγ\\mathcal\{F\}\_\{G\}^\{\\gamma\}be defined as above\. For any multi\-indexγ\\gammawith\|γ\|≤1\|\\gamma\|\\leq 1, we have

Pdim⁡\(ℱGγ\)≤CJHP\(LB2pB2\+LE2pE2\+LT2pT2\),\\operatorname\{Pdim\}\(\\mathcal\{F\}\_\{G\}^\{\\gamma\}\)\\leq CJHP\\left\(L\_\{B\}^\{2\}p\_\{B\}^\{2\}\+L\_\{E\}^\{2\}p\_\{E\}^\{2\}\+L\_\{T\}^\{2\}p\_\{T\}^\{2\}\\right\),whereC\>0C\>0is independent of the weight magnitudes of the neural networks\.

Lemma[4](https://arxiv.org/html/2606.17419#Thmlemma4)is proved in Appendix[B\.3](https://arxiv.org/html/2606.17419#A2.SS3)\.

Covering number estimate and choice of parameters\.We first estimate the uniform empirical covering number under the balanced choice of the architecture parameters\. By \([44](https://arxiv.org/html/2606.17419#S5.E44)\) and Lemma[3](https://arxiv.org/html/2606.17419#Thmlemma3), for anyη\>0\\eta\>0,

log⁡𝒩emp\(η,ℱG\)\\displaystyle\\log\\mathcal\{N\}\_\{\\rm emp\}\(\\eta,\\mathcal\{F\}\_\{G\}\)≤C∑\|γ\|≤ℓPdim⁡\(ℱGγ\)log⁡\(CnfngnzBℓη\),\\displaystyle\\leq C\\sum\_\{\|\\gamma\|\\leq\\ell\}\\operatorname\{Pdim\}\(\\mathcal\{F\}\_\{G\}^\{\\gamma\}\)\\log\\left\(\\frac\{Cn\_\{f\}n\_\{g\}n\_\{z\}B\_\{\\ell\}\}\{\\eta\}\\right\),\(45\)whereBℓ:=CRm¯,m\(n3\)Pℓ/dB\_\{\\ell\}:=CR\_\{\\bar\{m\},m\}^\{\(n\_\{3\}\)\}P^\{\\ell/d\}\. Similarly,

log⁡𝒩emp\(ηCRm¯,m\(n3\)Pℓ/d,ℱG\)≤CVℓlog⁡\(Cnfngnz\(Rm¯,m\(n3\)Pℓ/d\)2η\),\\displaystyle\\log\\mathcal\{N\}\_\{\\rm emp\}\\left\(\\frac\{\\eta\}\{CR\_\{\\bar\{m\},m\}^\{\(n\_\{3\}\)\}P^\{\\ell/d\}\},\\mathcal\{F\}\_\{G\}\\right\)\\leq CV\_\{\\ell\}\\log\\left\(\\frac\{Cn\_\{f\}n\_\{g\}n\_\{z\}\\bigl\(R\_\{\\bar\{m\},m\}^\{\(n\_\{3\}\)\}P^\{\\ell/d\}\\bigr\)^\{2\}\}\{\\eta\}\\right\),\(46\)whereVℓ:=∑\|γ\|≤ℓPdim⁡\(ℱGγ\)V\_\{\\ell\}:=\\sum\_\{\|\\gamma\|\\leq\\ell\}\\operatorname\{Pdim\}\(\\mathcal\{F\}\_\{G\}^\{\\gamma\}\)\. By Lemma[4](https://arxiv.org/html/2606.17419#Thmlemma4),

Vℓ≤CJHP\(LB2pB2\+LE2pE2\+LT2pT2\)\.V\_\{\\ell\}\\leq CJHP\\left\(L\_\{B\}^\{2\}p\_\{B\}^\{2\}\+L\_\{E\}^\{2\}p\_\{E\}^\{2\}\+L\_\{T\}^\{2\}p\_\{T\}^\{2\}\\right\)\.
We now choose the architecture parameters\. The discretization levelsm∗,m¯∗m\_\{\*\},\\bar\{m\}\_\{\*\}and the branch ranksJ,HJ,Hare chosen according to the balanced construction in Corollary[1](https://arxiv.org/html/2606.17419#Thmcorollary1)\. The trunk rankPPis chosen so that the trunk approximation error is of the same order as the branch approximation error\. By the balancing argument in Corollary[1](https://arxiv.org/html/2606.17419#Thmcorollary1), this givesP=\(log⁡H\)C\+o\(1\)=Ho\(1\)P=\(\\log H\)^\{C\+o\(1\)\}=H^\{o\(1\)\}for some constantC\>0C\>0\. Hence every polynomial factor involvingPPis of orderHo\(1\)H^\{o\(1\)\}\.

With these choices, the squared Sobolev approximation error is bounded by

C\(log⁡Hlog⁡log⁡H\)−2max⁡\{dn1,dn2\}\(log⁡log⁡H\)4d\.C\\left\(\\frac\{\\log H\}\{\\log\\log H\}\\right\)^\{\-\\frac\{2\}\{\\max\\\{\\frac\{d\}\{n\_\{1\}\},\\frac\{d\}\{n\_\{2\}\}\\\}\}\}\(\\log\\log H\)^\{4d\}\.Moreover, up to logarithmic factors,m∗≍\(log⁡H/log⁡log⁡H\)d/n2m\_\{\*\}\\asymp\(\\log H/\\log\\log H\)^\{d/n\_\{2\}\}andm¯∗≍\(log⁡H/log⁡log⁡H\)d/n1\\bar\{m\}\_\{\*\}\\asymp\(\\log H/\\log\\log H\)^\{d/n\_\{1\}\}, whilelog⁡J≍m∗log⁡m∗\\log J\\asymp m\_\{\*\}\\log m\_\{\*\}andlog⁡H≍m¯∗log⁡m¯∗\\log H\\asymp\\bar\{m\}\_\{\*\}\\log\\bar\{m\}\_\{\*\}\. Therefore,J=Ho\(1\)J=H^\{o\(1\)\}andP=Ho\(1\)P=H^\{o\(1\)\}\. SincepB,pE,pT=𝒪\(1\)p\_\{B\},p\_\{E\},p\_\{T\}=\\mathcal\{O\}\(1\), andLB,LE,LTL\_\{B\},L\_\{E\},L\_\{T\}grow at most polynomially inlog⁡H\\log H, we obtain

Vℓ≤CJHP\(LB2pB2\+LE2pE2\+LT2pT2\)≤H1\+o\(1\)\.V\_\{\\ell\}\\leq CJHP\\left\(L\_\{B\}^\{2\}p\_\{B\}^\{2\}\+L\_\{E\}^\{2\}p\_\{E\}^\{2\}\+L\_\{T\}^\{2\}p\_\{T\}^\{2\}\\right\)\\leq H^\{1\+o\(1\)\}\.Furthermore,Rm¯,m\(n3\)=𝒪\(\(log⁡log⁡H\)2d\)R\_\{\\bar\{m\},m\}^\{\(n\_\{3\}\)\}=\\mathcal\{O\}\(\(\\log\\log H\)^\{2d\}\), and henceRm¯,m\(n3\)Pℓ/d=Ho\(1\)R\_\{\\bar\{m\},m\}^\{\(n\_\{3\}\)\}P^\{\\ell/d\}=H^\{o\(1\)\}\.

Takingη=H−1\\eta=H^\{\-1\}in \([46](https://arxiv.org/html/2606.17419#S5.E46)\), we obtain

log⁡𝒩emp\(ηCRm¯,m\(n3\)Pℓ/d,ℱG\)≤CH1\+o\(1\)log⁡\(nfngnzH\)\.\\displaystyle\\log\\mathcal\{N\}\_\{\\rm emp\}\\left\(\\frac\{\\eta\}\{CR\_\{\\bar\{m\},m\}^\{\(n\_\{3\}\)\}P^\{\\ell/d\}\},\\mathcal\{F\}\_\{G\}\\right\)\\leq CH^\{1\+o\(1\)\}\\log\(n\_\{f\}n\_\{g\}n\_\{z\}H\)\.\(47\)Similarly,

log⁡𝒩emp\(η,ℱG\)≤CH1\+o\(1\)log⁡\(nfngnzH\)\.\\displaystyle\\log\\mathcal\{N\}\_\{\\rm emp\}\(\\eta,\\mathcal\{F\}\_\{G\}\)\\leq CH^\{1\+o\(1\)\}\\log\(n\_\{f\}n\_\{g\}n\_\{z\}H\)\.\(48\)
Substituting \([47](https://arxiv.org/html/2606.17419#S5.E47)\) and \([48](https://arxiv.org/html/2606.17419#S5.E48)\) into \([42](https://arxiv.org/html/2606.17419#S5.E42)\), and usingη=H−1\\eta=H^\{\-1\}, gives

ℰgen≤\\displaystyle\\mathcal\{E\}\_\{\\rm gen\}\\leq\\;C\(log⁡Hlog⁡log⁡H\)−2max⁡\{dn1,dn2\}\(log⁡log⁡H\)4d\+CH−1\\displaystyle C\\left\(\\frac\{\\log H\}\{\\log\\log H\}\\right\)^\{\-\\frac\{2\}\{\\max\\\{\\frac\{d\}\{n\_\{1\}\},\\frac\{d\}\{n\_\{2\}\}\\\}\}\}\(\\log\\log H\)^\{4d\}\+CH^\{\-1\}\+CσH−1H1\+o\(1\)log⁡\(nfngnzH\)nfngnz\+CH1\+o\(1\)log⁡\(nfngnzH\)nf\.\\displaystyle\+C\\sigma H^\{\-1\}\\sqrt\{\\frac\{H^\{1\+o\(1\)\}\\log\(n\_\{f\}n\_\{g\}n\_\{z\}H\)\}\{n\_\{f\}n\_\{g\}n\_\{z\}\}\}\+C\\frac\{H^\{1\+o\(1\)\}\\log\(n\_\{f\}n\_\{g\}n\_\{z\}H\)\}\{n\_\{f\}\}\.\(49\)Here we used1\+\(Rm¯,m\(n3\)\)2P2ℓ/d=Ho\(1\)1\+\\bigl\(R\_\{\\bar\{m\},m\}^\{\(n\_\{3\}\)\}\\bigr\)^\{2\}P^\{2\\ell/d\}=H^\{o\(1\)\}\.

We now chooseH=nfcH=n\_\{f\}^\{c\}with a sufficiently small constantc\>0c\>0\. More precisely, assumelog⁡\(ngnz\)≤nfδ\\log\(n\_\{g\}n\_\{z\}\)\\leq n\_\{f\}^\{\\delta\}for some sufficiently smallδ\>0\\delta\>0, and choosec\>0c\>0such thatc\+δ<1c\+\\delta<1\. Then the stochastic terms in \([49](https://arxiv.org/html/2606.17419#S5.E49)\) are dominated by the approximation term for sufficiently largenfn\_\{f\}\. Hence

ℰgen≤C\(log⁡Hlog⁡log⁡H\)−2max⁡\{dn1,dn2\}\(log⁡log⁡H\)4d\.\\mathcal\{E\}\_\{\\rm gen\}\\leq C\\left\(\\frac\{\\log H\}\{\\log\\log H\}\\right\)^\{\-\\frac\{2\}\{\\max\\\{\\frac\{d\}\{n\_\{1\}\},\\frac\{d\}\{n\_\{2\}\}\\\}\}\}\(\\log\\log H\)^\{4d\}\.Finally, sinceH=nfcH=n\_\{f\}^\{c\}, we havelog⁡H≍log⁡nf\\log H\\asymp\\log n\_\{f\}andlog⁡log⁡H≍log⁡log⁡nf\\log\\log H\\asymp\\log\\log n\_\{f\}\. Therefore,

ℰgen≤C\(log⁡nflog⁡log⁡nf\)−2max⁡\{dn1,dn2\}\(log⁡log⁡nf\)4d\.\\mathcal\{E\}\_\{\\rm gen\}\\leq C\\left\(\\frac\{\\log n\_\{f\}\}\{\\log\\log n\_\{f\}\}\\right\)^\{\-\\frac\{2\}\{\\max\\\{\\frac\{d\}\{n\_\{1\}\},\\frac\{d\}\{n\_\{2\}\}\\\}\}\}\(\\log\\log n\_\{f\}\)^\{4d\}\.Equivalently,

ℰgen≤C\(log⁡nflog⁡log⁡nf\)−2min⁡\{n1,n2\}d\(log⁡log⁡nf\)4d\.\\mathcal\{E\}\_\{\\rm gen\}\\leq C\\left\(\\frac\{\\log n\_\{f\}\}\{\\log\\log n\_\{f\}\}\\right\)^\{\-\\frac\{2\\min\\\{n\_\{1\},n\_\{2\}\\\}\}\{d\}\}\(\\log\\log n\_\{f\}\)^\{4d\}\.This completes the proof\. ∎

## 6Conclusion

In this paper, we studied scaling laws for multi\-input neural operator learning and quantified how the complexity of each input space affects the final approximation and generalization rates\. Our results show that the dimension, regularity, and Lipschitz sensitivity associated with each input function enter explicitly into the final error bound\. In particular, in the balanced regime, the rate is governed by a harmonic\-type interaction among the input dimensions and regularities\. This provides a theoretical explanation of how different inputs contribute to the overall learning complexity and also gives guidance for allocating the network capacity across different branch subnetworks\.

There are several directions for future work\. First, the architecture studied in this paper uses a construction in which the trunk network is introduced at the final stage\. This is different from the standard DeepONet architecture\[[32](https://arxiv.org/html/2606.17419#bib.bib32)\], where the trunk network is constructed at the beginning and then combined with the branch networks\. If one instead constructs the trunk network at an earlier stage, the approximation procedure and the corresponding error analysis would be substantially different\. Comparing these two types of constructions may help identify which architecture is more suitable for different classes of multi\-input operator\-learning problems\.

Second, the discretization procedure in this paper is based on global Chebyshev interpolation\. Due to the Markov–Bernstein inequality, the Sobolev stability of this reconstruction introduces a loss of order2ki2k\_\{i\}, leading to the effective smoothnessni−2kin\_\{i\}\-2k\_\{i\}in the final rate\. Ideally, this loss could be reduced from2ki2k\_\{i\}tokik\_\{i\}by using more stable local sampling and reconstruction procedures, such as B\-spline quasi\-interpolation\. For instance, one\-dimensional spline approximation estimates of this type can be found in\[[37](https://arxiv.org/html/2606.17419#bib.bib37), Theorem 6\.20\]\. Developing a multi\-dimensional sample\-based reconstruction scheme with improved Sobolev stability would lead to sharper rates for multi\-input operator learning\.

Third, the generalization analysis in this work is carried out under a paired hierarchical sampling setting\. In this setting, for each outer samplefif\_\{i\}, one observes a collection of inner samplesgi,jg\_\{i,j\}and spatial points𝒛i,j,k\\bm\{z\}\_\{i,j,k\}\. As a consequence, the dominant statistical error is mainly controlled by the number of outer samplesnfn\_\{f\}, rather than byngn\_\{g\}alone\. This reflects the fact that the samples sharing the samefif\_\{i\}are not fully independent at the outer level\. In many practical scientific computing tasks, however, one may sample many input functionsffandggsimultaneously and form different pairings between them\. In such settings, the effective sample size and the resulting generalization error may depend on the joint sampling and pairing structure of the inputs\. Understanding generalization error for multi\-input neural operators under such non\-nested or partially paired sampling schemes is an important direction for future work\. Understanding generalization error for multi\-input neural operators under such non\-nested or partially paired sampling schemes is an important direction for future work\[[36](https://arxiv.org/html/2606.17419#bib.bib36),[7](https://arxiv.org/html/2606.17419#bib.bib7)\]\.

## Acknowledgement

W\.Z\. acknowledges support from the National Science Foundation under awards DMS\-2502900 and DMS\-2540370, and from the Air Force Office of Scientific Research under Grant No\. FA9550\-25\-1\-0079\. W\.L\. acknowledges support from the National Science Foundation under the NSF DMS 2145167 and the U\.S\. Department of Energy under the DOE SC0024348\. H\.L\. acknowledges support from HKRGC ECS 22302123, HKRGC GRF 12301925 and Guangdong and Hong Kong Universities “1\+1\+1” Joint Research Collaboration Scheme 2025A0505000007\. Z\.Z\. acknowledges support from the U\.S\. Department of Energy under the DE\-SC0025440\.

## References

- \[1\]Y\. Abu\-Mostafa\.The Vapnik\-Chervonenkis dimension: Information versus complexity in learning\.Neural Computation, 1\(3\):312–317, 1989\.
- \[2\]M\. Anthony, P\. Bartlett, et al\.Neural network learning: Theoretical foundations, volume 9\.cambridge university press Cambridge, 1999\.
- \[3\]A\. D\. Back and T\. Chen\.Universal approximation of multiple nonlinear operators by neural networks\.Neural Computation, 14\(11\):2561–2566, 2002\.
- \[4\]T\. Bagby, L\. Bos, and N\. Levenberg\.Multivariate simultaneous approximation\.Constructive approximation, 18\(4\):569–577, 2002\.
- \[5\]P\. Bartlett, V\. Maiorov, and R\. Meir\.Almost linear VC dimension bounds for piecewise polynomial networks\.Advances in neural information processing systems, 11, 1998\.
- \[6\]S\. Brenner, L\. Scott, and L\. Scott\.The mathematical theory of finite element methods, volume 3\.Springer, 2008\.
- \[7\]A\. Caponnetto and E\. De Vito\.Optimal rates for the regularized least\-squares algorithm\.Foundations of Computational mathematics, 7\(3\):331–368, 2007\.
- \[8\]M\. Chen, H\. Jiang, W\. Liao, and T\. Zhao\.Nonparametric regression on low\-dimensional manifolds using deep ReLU networks: Function approximation and statistical recovery\.Information and Inference: A Journal of the IMA, 11\(4\):1203–1253, 2022\.
- \[9\]T\. Chen and H\. Chen\.Approximations of continuous functionals by neural networks with application to dynamic systems\.IEEE Transactions on Neural networks, 4\(6\):910–918, 1993\.
- \[10\]T\. Chen and H\. Chen\.Universal approximation to nonlinear operators by neural networks with arbitrary activation functions and its application to dynamical systems\.IEEE transactions on neural networks, 6\(4\):911–917, 1995\.
- \[11\]W\. Czarnecki, S\. Osindero, M\. Jaderberg, G\. Swirszcz, and R\. Pascanu\.Sobolev training for neural networks\.Advances in neural information processing systems, 30, 2017\.
- \[12\]R\. A\. DeVore and G\. G\. Lorentz\.Constructive approximation, volume 303\.Springer Science & Business Media, 1993\.
- \[13\]H\. Dong and Z\. Li\.On theW2,pW^\{2,p\}estimate for oblique derivative problem in lipschitz domains\.International Mathematics Research Notices, 2022\(5\):3602–3635, 2022\.
- \[14\]D\. Gilbarg, N\. S\. Trudinger, D\. Gilbarg, and N\. Trudinger\.Elliptic partial differential equations of second order, volume 2\.Springer, 1998\.
- \[15\]S\. Goswami, M\. Yin, Y\. Yu, and G\. Karniadakis\.A physics\-informed variational DeepONet for predicting crack path in quasi\-brittle materials\.Computer Methods in Applied Mechanics and Engineering, 391:114587, 2022\.
- \[16\]I\. Gühring, G\. Kutyniok, and P\. Petersen\.Error bounds for approximations with deep ReLU neural networks inWs,p\{W\}^\{s,p\}norms\.Analysis and Applications, 18\(05\):803–859, 2020\.
- \[17\]I\. Gühring and M\. Raslan\.Approximation rates for neural networks with encodable weights in smoothness spaces\.Neural Networks, 134:107–130, 2021\.
- \[18\]W\. Hao, R\. P\. Li, Y\. Xi, T\. Xu, and Y\. Yang\.Multiscale neural networks for approximating green’s functions\.SIAM Journal on Scientific Computing, 48\(2\):C240–C270, 2026\.
- \[19\]J\. He, X\. Liu, and J\. Xu\.Mgno: Efficient parameterization of linear operators via multigrid\.InInternational Conference on Learning Representations, volume 2024, pages 53409–53428, 2024\.
- \[20\]S\. Hill and F\. X\.\-F\. Ye\.Geometric regularization of autoencoders via observed stochastic dynamics\.arXiv preprint arXiv:2604\.16282, 2026\.
- \[21\]J\. Hu and P\. Jin\.A hybrid iterative method based on mionet for pdes: Theory and numerical examples\.Mathematics of Computation, 2025\.
- \[22\]P\. Jin, S\. Meng, and L\. Lu\.Mionet: Learning multiple\-input operators via tensor product\.SIAM Journal on Scientific Computing, 44\(6\):A3490–A3514, 2022\.
- \[23\]N\. Kovachki, S\. Lanthaler, and S\. Mishra\.On universal approximation and error bounds for fourier neural operators\.Journal of Machine Learning Research, 22\(290\):1–76, 2021\.
- \[24\]S\. Lanthaler\.Operator learning with pca\-net: upper and lower complexity bounds\.Journal of Machine Learning Research, 24\(318\):1–67, 2023\.
- \[25\]S\. Lanthaler, S\. Mishra, and G\. Karniadakis\.Error estimates for DeepONets: A deep learning framework in infinite dimensions\.Transactions of Mathematics and Its Applications, 6\(1\):tnac001, 2022\.
- \[26\]J\. Li, S\. Huang, H\. Feng, D\.\-X\. Zhou, and G\. Kutyniok\.Sparse\-aware neural networks for nonlinear functionals: Mitigating the exponential dependence on dimension\.arXiv preprint arXiv:2604\.06774, 2026\.
- \[27\]Z\. Li, D\. Z\. Huang, B\. Liu, and A\. Anandkumar\.Fourier neural operator with learned deformations for pdes on general geometries\.Journal of Machine Learning Research, 24\(388\):1–26, 2023\.
- \[28\]Z\. Li, N\. Kovachki, K\. Azizzadenesheli, B\. Liu, K\. Bhattacharya, A\. Stuart, and A\. Anandkumar\.Fourier neural operator for parametric partial differential equations\.arXiv preprint arXiv:2010\.08895, 2020\.
- \[29\]H\. Liu, B\. Dahal, R\. Lai, and W\. Liao\.Generalization error guaranteed auto\-encoder\-based nonlinear model reduction for operator learning\.Applied and Computational Harmonic Analysis, 74:101717, 2025\.
- \[30\]H\. Liu, H\. Yang, M\. Chen, T\. Zhao, and W\. Liao\.Deep nonparametric estimation of operators between infinite dimensional spaces\.Journal of Machine Learning Research, 25\(24\):1–67, 2024\.
- \[31\]H\. Liu, Z\. Zhang, W\. Liao, and H\. Schaeffer\.Neural scaling laws of deep ReLU and deep operator network: A theoretical study\.arXiv preprint arXiv:2410\.00357, 2024\.
- \[32\]L\. Lu, P\. Jin, and G\. E\. Karniadakis\.Deeponet: Learning nonlinear operators for identifying differential equations based on the universal approximation theorem of operators\.arXiv preprint arXiv:1910\.03193, 2019\.
- \[33\]C\. Marcati and C\. Schwab\.Exponential convergence of deep operator networks for elliptic partial differential equations\.SIAM Journal on Numerical Analysis, 61\(3\):1513–1545, 2023\.
- \[34\]H\. N\. Mhaskar and N\. Hahm\.Neural networks for functional approximation and system identification\.Neural Computation, 9\(1\):143–159, 1997\.
- \[35\]J\. A\. A\. Opschoor, P\. C\. Petersen, and C\. Schwab\.Deep ReLU networks and high\-order finite element methods\.Analysis and Applications, 18\(05\):715–770, 2020\.
- \[36\]I\. Pinelis\.Optimum bounds for the distributions of martingales in banach spaces\.The Annals of Probability, pages 1679–1706, 1994\.
- \[37\]L\. Schumaker\.Spline functions: basic theory\.Cambridge university press, 2007\.
- \[38\]C\. Schwab, A\. Stein, and J\. Zech\.Deep operator network approximation rates for lipschitz operators\.Analysis and Applications, 24\(01\):199–239, 2026\.
- \[39\]Z\. Shi, J\. Fan, L\. Song, D\.\-X\. Zhou, and J\. A\. Suykens\.Nonlinear functional regression by functional deep neural network with kernel embedding\.Journal of Machine Learning Research, 26\(284\):1–49, 2025\.
- \[40\]J\. W\. Siegel\.Optimal approximation rates for deep ReLU neural networks on sobolev and besov spaces\.Journal of Machine Learning Research, 24\(357\):1–52, 2023\.
- \[41\]L\. Song, Y\. Liu, J\. Fan, and D\. Zhou\.Approximation of smooth functionals using deep ReLU networks\.Neural Networks, 166:424–436, 2023\.
- \[42\]S\. Srinivas and F\. Fleuret\.Knowledge transfer with jacobian matching\.InInternational conference on machine learning, pages 4723–4731\. PMLR, 2018\.
- \[43\]N\. Vlassis and W\. Sun\.Sobolev training of thermodynamic\-informed neural networks for interpretable elasto\-plasticity models with level set hardening\.Computer Methods in Applied Mechanics and Engineering, 377:113695, 2021\.
- \[44\]N\. N\. Vlassis, R\. Ma, and W\. Sun\.Geometric deep learning for computational mechanics part i: Anisotropic hyperelasticity\.Computer Methods in Applied Mechanics and Engineering, 371:113299, 2020\.
- \[45\]S\. Wang, H\. Wang, and P\. Perdikaris\.Learning the solution operator of parametric partial differential equations with physics\-informed DeepONets\.Science advances, 7\(40\):eabi8605, 2021\.
- \[46\]A\. Weihs and H\. Schaeffer\.Generalization bounds and statistical guarantees for multi\-task and multiple operator learning with mno networks\.arXiv preprint arXiv:2604\.01961, 2026\.
- \[47\]A\. Weihs and H\. Schaeffer\.Multiple neural operators achieve near\-optimal rates for multi\-task learning\.arXiv preprint arXiv:2605\.22724, 2026\.
- \[48\]A\. Weihs, J\. Sun, Z\. Zhang, and H\. Schaeffer\.A deep learning framework for multi\-operator learning: Architectures and approximation theory\.arXiv preprint arXiv:2510\.25379, 2025\.
- \[49\]J\.\-Q\. Yang and L\. Shi\.Efficient approximation for encoder–decoder neural operators via variation spaces\.arXiv preprint arXiv:2606\.01244, 2026\.
- \[50\]Y\. Yang\.DeepONet for solving nonlinear partial differential equations with physics\-informed training\.Neural Networks, page 108490, 2025\.
- \[51\]Y\. Yang and J\. He\.Deep neural networks with general activations: Super\-convergence in sobolev norms\.arXiv preprint arXiv:2508\.05141, 2025\.
- \[52\]Y\. Yang, Y\. Wu, H\. Yang, and Y\. Xiang\.Nearly optimal approximation rates for deep super ReLU networks on Sobolev spaces\.arXiv preprint arXiv:2310\.10766, 2023\.
- \[53\]Y\. Yang and Y\. Xiang\.Approximation of functionals by neural network without curse of dimensionality\.Journal of Machine Learning, 1\(4\):342–372, 2022\.
- \[54\]Y\. Yang, H\. Yang, and Y\. Xiang\.Nearly optimal VC\-dimension and pseudo\-dimension bounds for deep neural network derivatives\.InThirty\-seventh Conference on Neural Information Processing Systems, 2023\.
- \[55\]Z\. Yang, S\. Huang, H\. Feng, and D\.\-X\. Zhou\.Spherical analysis of learning nonlinear functionals\.Constructive Approximation, pages 1–29, 2026\.
- \[56\]D\. Yarotsky\.Error bounds for approximations with deep ReLU networks\.Neural Networks, 94:103–114, 2017\.
- \[57\]D\. Yarotsky and A\. Zhevnerchuk\.The phase diagram of approximation rates for deep neural networks\.Advances in neural information processing systems, 33:13005–13015, 2020\.

## Appendix AProofs for Approximation Error

This section contains the detailed proof of the approximation error estimate\. The construction consists of four steps\. Step 1, which concerns the discretization of the input functions, has already been discussed in the main text\. We therefore focus here on the remaining steps, especially the coefficient approximations in Steps 2 and 3 and the trunk\-network approximation in Step 4\. We begin by recalling two preliminary results: the pseudo\-spectral projection estimate associated with the tensorized Chebyshev discretization, and a ReLU approximation theorem for Sobolev functions in Sobolev norms\.

### A\.1Step 0

#### A\.1\.1Pseudo\-spectral projection

As discussed above, a key step in the construction is to discretize the input functionsffandgg\. In practical operator learning, the input functions are usually accessed only through finitely many pointwise samples\. Therefore, although basis expansions provide a clean mathematical formulation, it is important to develop a discretization procedure based directly on sampling, in the same spirit as DeepONet\[[9](https://arxiv.org/html/2606.17419#bib.bib9),[32](https://arxiv.org/html/2606.17419#bib.bib32)\]\. Pseudo\-spectral projection based on spectral bases has been used in\[[23](https://arxiv.org/html/2606.17419#bib.bib23),[50](https://arxiv.org/html/2606.17419#bib.bib50)\]\. However, Fourier\-type spectral projections are naturally suited to periodic domains\. For non\-periodic problems, one typically needs to introduce extensions outside the physical domain, which is not always natural\. On the other hand, the construction in\[[31](https://arxiv.org/html/2606.17419#bib.bib31)\]is designed for Lipschitz\-continuous input spaces, while in this paper we work with Sobolev input classes\. Therefore, we introduce a sampling\-based discretization–reconstruction pair using Chebyshev interpolation\.

ForN∈ℕN\\in\\mathbb\{N\}, define the tensorized Chebyshev grid by

𝒙k1,…,kd=\(cos⁡\(\(k1\+12\)πN\+1\),…,cos⁡\(\(kd\+12\)πN\+1\)\),0≤k1,…,kd≤N\.\\displaystyle\\bm\{x\}\_\{k\_\{1\},\\dots,k\_\{d\}\}=\\left\(\\cos\\left\(\\frac\{\(k\_\{1\}\+\\frac\{1\}\{2\}\)\\pi\}\{N\+1\}\\right\),\\dots,\\cos\\left\(\\frac\{\(k\_\{d\}\+\\frac\{1\}\{2\}\)\\pi\}\{N\+1\}\\right\)\\right\),\\qquad 0\\leq k\_\{1\},\\dots,k\_\{d\}\\leq N\.\(50\)The total number of grid points is\(N\+1\)d\(N\+1\)^\{d\}\.

LetQNQ\_\{N\}denote the space of polynomials onℝd\\mathbb\{R\}^\{d\}of total degree at mostNN, and letQ¯N\\overline\{Q\}\_\{N\}denote the space of polynomials onℝd\\mathbb\{R\}^\{d\}such that each variable has degree at mostNN\. Clearly,

QN⊂Q¯N\.Q\_\{N\}\\subset\\overline\{Q\}\_\{N\}\.
We define the sampling operator

𝒟N\(g\)=\(g\(𝒙k1,…,kd\)\)0≤k1,…,kd≤N∈ℝ\(N\+1\)d,\\mathcal\{D\}\_\{N\}\(g\)=\\bigl\(g\(\\bm\{x\}\_\{k\_\{1\},\\dots,k\_\{d\}\}\)\\bigr\)\_\{0\\leq k\_\{1\},\\dots,k\_\{d\}\\leq N\}\\in\\mathbb\{R\}^\{\(N\+1\)^\{d\}\},and let

𝒫N:ℝ\(N\+1\)d→Q¯N\\mathcal\{P\}\_\{N\}:\\mathbb\{R\}^\{\(N\+1\)^\{d\}\}\\to\\overline\{Q\}\_\{N\}be the multi\-dimensional Chebyshev interpolation operator associated with the grid \([50](https://arxiv.org/html/2606.17419#A1.E50)\)\. Then𝒫N∘𝒟Ng\\mathcal\{P\}\_\{N\}\\circ\\mathcal\{D\}\_\{N\}gis the tensor\-product Chebyshev interpolant ofgg\.

To derive the approximation estimate, we first recall several standard lemmas\.

###### Lemma 5\(Jackson’s inequality\[[4](https://arxiv.org/html/2606.17419#bib.bib4)\]\)\.

LetΩ=\(−1,1\)d⊂ℝd\\Omega=\(\-1,1\)^\{d\}\\subset\\mathbb\{R\}^\{d\}, and letg∈Wm,∞\(Ω\)g\\in W^\{m,\\infty\}\(\\Omega\)\. Then for each integerNN, there exists a polynomialpN∗∈QN⊂Q¯Np\_\{N\}^\{\*\}\\in Q\_\{N\}\\subset\\overline\{Q\}\_\{N\}such that for every integer0≤k≤min⁡\{m,N\}0\\leq k\\leq\\min\\\{m,N\\\},

‖g−pN∗‖Wk,∞\(Ω\)≤C\(d,m\)Nm−k‖g‖Wm,∞\(Ω\),\\\|g\-p\_\{N\}^\{\*\}\\\|\_\{W^\{k,\\infty\}\(\\Omega\)\}\\leq\\frac\{C\(d,m\)\}\{N^\{m\-k\}\}\\\|g\\\|\_\{W^\{m,\\infty\}\(\\Omega\)\},whereC\(d,m\)C\(d,m\)depends only onddandmm\.

###### Lemma 6\(Markov–Bernstein inequality\[[12](https://arxiv.org/html/2606.17419#bib.bib12)\]\)\.

For any polynomialpN∈Q¯Np\_\{N\}\\in\\overline\{Q\}\_\{N\}and any multi\-index𝛂\\bm\{\\alpha\}, one has

‖∂𝜶pN‖L∞\(Ω\)≤N2\|𝜶\|‖pN‖L∞\(Ω\)\.\\\|\\partial^\{\\bm\{\\alpha\}\}p\_\{N\}\\\|\_\{L^\{\\infty\}\(\\Omega\)\}\\leq N^\{2\|\\bm\{\\alpha\}\|\}\\\|p\_\{N\}\\\|\_\{L^\{\\infty\}\(\\Omega\)\}\.

###### Lemma 7\(Sobolev stability of the Chebyshev interpolation operator\)\.

Letk∈ℕ0k\\in\\mathbb\{N\}\_\{0\}\. The interpolation operator𝒫N∘𝒟N:C\(Ω\)→Q¯N\\mathcal\{P\}\_\{N\}\\circ\\mathcal\{D\}\_\{N\}:C\(\\Omega\)\\to\\overline\{Q\}\_\{N\}is linear and satisfies

‖𝒫N∘𝒟Nu‖Wk,∞\(Ω\)≤CN2k\(log⁡N\)d‖u‖L∞\(Ω\)\\\|\\mathcal\{P\}\_\{N\}\\circ\\mathcal\{D\}\_\{N\}u\\\|\_\{W^\{k,\\infty\}\(\\Omega\)\}\\leq CN^\{2k\}\(\\log N\)^\{d\}\\\|u\\\|\_\{L^\{\\infty\}\(\\Omega\)\}for allu∈C\(Ω\)u\\in C\(\\Omega\), whereC\>0C\>0depends only onddandkk\.

###### Proposition 5\.

Letg∈Wm,∞\(Ω\)g\\in W^\{m,\\infty\}\(\\Omega\)\. Then for every integer0≤k≤min⁡\{m,N\}0\\leq k\\leq\\min\\\{m,N\\\},

‖g−𝒫N∘𝒟Ng‖Wk,∞\(Ω\)≤C\(d,m\)N2k−m\(log⁡N\)d‖g‖Wm,∞\(Ω\)\.\\displaystyle\\\|g\-\\mathcal\{P\}\_\{N\}\\circ\\mathcal\{D\}\_\{N\}g\\\|\_\{W^\{k,\\infty\}\(\\Omega\)\}\\leq C\(d,m\)\\,N^\{2k\-m\}\(\\log N\)^\{d\}\\\|g\\\|\_\{W^\{m,\\infty\}\(\\Omega\)\}\.\(51\)Moreover, the reconstruction operator𝒫N:\(ℝ\(N\+1\)d,∥⋅∥∞\)→\(Wk,∞\(Ω\),∥⋅∥Wk,∞\(Ω\)\)\\mathcal\{P\}\_\{N\}:\(\\mathbb\{R\}^\{\(N\+1\)^\{d\}\},\\\|\\cdot\\\|\_\{\\infty\}\)\\to\(W^\{k,\\infty\}\(\\Omega\),\\\|\\cdot\\\|\_\{W^\{k,\\infty\}\(\\Omega\)\}\)is Lipschitz continuous with

‖𝒫N‖ℓ∞→Wk,∞\(Ω\)≲N2k\(log⁡N\)d\.\\displaystyle\\\|\\mathcal\{P\}\_\{N\}\\\|\_\{\\ell^\{\\infty\}\\to W^\{k,\\infty\}\(\\Omega\)\}\\lesssim N^\{2k\}\(\\log N\)^\{d\}\.\(52\)

###### Proof\.

FixN∈ℕN\\in\\mathbb\{N\}, and letpN∗∈QNp\_\{N\}^\{\*\}\\in Q\_\{N\}be given by Lemma[5](https://arxiv.org/html/2606.17419#Thmlemma5)\. For any multi\-index𝜶\\bm\{\\alpha\}with\|𝜶\|=s≤k\|\\bm\{\\alpha\}\|=s\\leq k, we write

∂𝜶\(g−𝒫N∘𝒟Ng\)=∂𝜶\(g−pN∗\)\+∂𝜶\(pN∗−𝒫N∘𝒟Ng\)\.\\displaystyle\\partial^\{\\bm\{\\alpha\}\}\\bigl\(g\-\\mathcal\{P\}\_\{N\}\\circ\\mathcal\{D\}\_\{N\}g\\bigr\)=\\partial^\{\\bm\{\\alpha\}\}\(g\-p\_\{N\}^\{\*\}\)\+\\partial^\{\\bm\{\\alpha\}\}\\bigl\(p\_\{N\}^\{\*\}\-\\mathcal\{P\}\_\{N\}\\circ\\mathcal\{D\}\_\{N\}g\\bigr\)\.SincepN∗∈Q¯Np\_\{N\}^\{\*\}\\in\\overline\{Q\}\_\{N\}and𝒫N∘𝒟NpN∗=pN∗\\mathcal\{P\}\_\{N\}\\circ\\mathcal\{D\}\_\{N\}p\_\{N\}^\{\*\}=p\_\{N\}^\{\*\}, we have

pN∗−𝒫N∘𝒟Ng=𝒫N∘𝒟N\(pN∗−g\)\.p\_\{N\}^\{\*\}\-\\mathcal\{P\}\_\{N\}\\circ\\mathcal\{D\}\_\{N\}g=\\mathcal\{P\}\_\{N\}\\circ\\mathcal\{D\}\_\{N\}\(p\_\{N\}^\{\*\}\-g\)\.Hence,

‖∂𝜶\(g−𝒫N∘𝒟Ng\)‖L∞≤‖∂𝜶\(g−pN∗\)‖L∞⏟=⁣:I1\+‖∂𝜶\(𝒫N∘𝒟N\(pN∗−g\)\)‖L∞⏟=⁣:I2\.\\displaystyle\\\|\\partial^\{\\bm\{\\alpha\}\}\(g\-\\mathcal\{P\}\_\{N\}\\circ\\mathcal\{D\}\_\{N\}g\)\\\|\_\{L^\{\\infty\}\}\\leq\\underbrace\{\\\|\\partial^\{\\bm\{\\alpha\}\}\(g\-p\_\{N\}^\{\*\}\)\\\|\_\{L^\{\\infty\}\}\}\_\{=:I\_\{1\}\}\+\\underbrace\{\\\|\\partial^\{\\bm\{\\alpha\}\}\(\\mathcal\{P\}\_\{N\}\\circ\\mathcal\{D\}\_\{N\}\(p\_\{N\}^\{\*\}\-g\)\)\\\|\_\{L^\{\\infty\}\}\}\_\{=:I\_\{2\}\}\.
By Lemma[5](https://arxiv.org/html/2606.17419#Thmlemma5),

I1≤C\(d,m\)Nm−s‖g‖Wm,∞\(Ω\)\.I\_\{1\}\\leq\\frac\{C\(d,m\)\}\{N^\{m\-s\}\}\\\|g\\\|\_\{W^\{m,\\infty\}\(\\Omega\)\}\.
ForI2I\_\{2\}, Lemma[6](https://arxiv.org/html/2606.17419#Thmlemma6)and Lemma[7](https://arxiv.org/html/2606.17419#Thmlemma7)imply

I2\\displaystyle I\_\{2\}≤N2s‖𝒫N∘𝒟N\(pN∗−g\)‖L∞≤CN2s\(log⁡N\)d‖pN∗−g‖L∞\.\\displaystyle\\leq N^\{2s\}\\\|\\mathcal\{P\}\_\{N\}\\circ\\mathcal\{D\}\_\{N\}\(p\_\{N\}^\{\*\}\-g\)\\\|\_\{L^\{\\infty\}\}\\leq C\\,N^\{2s\}\(\\log N\)^\{d\}\\\|p\_\{N\}^\{\*\}\-g\\\|\_\{L^\{\\infty\}\}\.Applying Lemma[5](https://arxiv.org/html/2606.17419#Thmlemma5)again withk=0k=0, we get

‖pN∗−g‖L∞≤C\(d,m\)N−m‖g‖Wm,∞\(Ω\)\.\\\|p\_\{N\}^\{\*\}\-g\\\|\_\{L^\{\\infty\}\}\\leq C\(d,m\)N^\{\-m\}\\\|g\\\|\_\{W^\{m,\\infty\}\(\\Omega\)\}\.Therefore,

I2≤C\(d,m\)N2s−m\(log⁡N\)d‖g‖Wm,∞\(Ω\)\.I\_\{2\}\\leq C\(d,m\)\\,N^\{2s\-m\}\(\\log N\)^\{d\}\\\|g\\\|\_\{W^\{m,\\infty\}\(\\Omega\)\}\.
Combining the above estimates and taking the maximum over all\|𝜶\|≤k\|\\bm\{\\alpha\}\|\\leq k, we obtain

‖g−𝒫N∘𝒟Ng‖Wk,∞\(Ω\)≤C\(d,m\)N2k−m\(log⁡N\)d‖g‖Wm,∞\(Ω\),\\\|g\-\\mathcal\{P\}\_\{N\}\\circ\\mathcal\{D\}\_\{N\}g\\\|\_\{W^\{k,\\infty\}\(\\Omega\)\}\\leq C\(d,m\)\\,N^\{2k\-m\}\(\\log N\)^\{d\}\\\|g\\\|\_\{W^\{m,\\infty\}\(\\Omega\)\},which proves \([51](https://arxiv.org/html/2606.17419#A1.E51)\)\.

To prove \([52](https://arxiv.org/html/2606.17419#A1.E52)\), let𝒇∈ℝ\(N\+1\)d\\bm\{f\}\\in\\mathbb\{R\}^\{\(N\+1\)^\{d\}\}\. For any\|𝜶\|≤k\|\\bm\{\\alpha\}\|\\leq k, Lemma[6](https://arxiv.org/html/2606.17419#Thmlemma6)gives

‖∂𝜶\(𝒫N𝒇\)‖L∞≤N2\|𝜶\|‖𝒫N𝒇‖L∞≤N2k‖𝒫N𝒇‖L∞\.\\\|\\partial^\{\\bm\{\\alpha\}\}\(\\mathcal\{P\}\_\{N\}\\bm\{f\}\)\\\|\_\{L^\{\\infty\}\}\\leq N^\{2\|\\bm\{\\alpha\}\|\}\\\|\\mathcal\{P\}\_\{N\}\\bm\{f\}\\\|\_\{L^\{\\infty\}\}\\leq N^\{2k\}\\\|\\mathcal\{P\}\_\{N\}\\bm\{f\}\\\|\_\{L^\{\\infty\}\}\.For any given𝒇∈ℝ\(N\+1\)d\\bm\{f\}\\in\\mathbb\{R\}^\{\(N\+1\)^\{d\}\}, we can construct a piecewise linear function that interpolates𝒇\\bm\{f\}while satisfying‖f‖L∞=‖𝒇‖∞\\\|f\\\|\_\{L^\{\\infty\}\}=\\\|\\bm\{f\}\\\|\_\{\\infty\}\. Using Lemma[7](https://arxiv.org/html/2606.17419#Thmlemma7), we further obtain

‖𝒫N𝒇‖L∞≤C\(log⁡N\)d‖f‖L∞=C\(log⁡N\)d‖𝒇‖∞\.\\\|\\mathcal\{P\}\_\{N\}\\bm\{f\}\\\|\_\{L^\{\\infty\}\}\\leq C\(\\log N\)^\{d\}\\\|f\\\|\_\{L^\{\\infty\}\}=C\(\\log N\)^\{d\}\\\|\\bm\{f\}\\\|\_\{\\infty\}\.Hence,

‖∂𝜶\(𝒫N𝒇\)‖L∞≤CN2k\(log⁡N\)d‖𝒇‖∞\.\\\|\\partial^\{\\bm\{\\alpha\}\}\(\\mathcal\{P\}\_\{N\}\\bm\{f\}\)\\\|\_\{L^\{\\infty\}\}\\leq C\\,N^\{2k\}\(\\log N\)^\{d\}\\\|\\bm\{f\}\\\|\_\{\\infty\}\.Taking the maximum over all\|𝜶\|≤k\|\\bm\{\\alpha\}\|\\leq kyields \([52](https://arxiv.org/html/2606.17419#A1.E52)\)\. ∎

#### A\.1\.2Neural network approximation in Sobolev spaces

As discussed above, the final step of our construction requires approximating Sobolev\-regular coefficient functions by neural networks\. This problem has been extensively studied in approximation theory and learning theory; see, for example,\[[56](https://arxiv.org/html/2606.17419#bib.bib56),[57](https://arxiv.org/html/2606.17419#bib.bib57),[16](https://arxiv.org/html/2606.17419#bib.bib16),[40](https://arxiv.org/html/2606.17419#bib.bib40),[35](https://arxiv.org/html/2606.17419#bib.bib35),[54](https://arxiv.org/html/2606.17419#bib.bib54),[51](https://arxiv.org/html/2606.17419#bib.bib51)\]\. Our argument follows the standard ReLU approximation framework of\[[56](https://arxiv.org/html/2606.17419#bib.bib56)\]\. The only additional point needed for our later generalization analysis is an explicit bookkeeping of the network complexity\. In particular, we need bounds on the depth, width, sparsity, and the magnitude of the weights and biases\. For this reason, we recall and slightly rederive the construction with all relevant network\-size parameters made explicit\.

###### Lemma 8\(Approximate multiplication inWℓ,∞W^\{\\ell,\\infty\}\)\.

LetM\>0M\>0,0<ε<10<\\varepsilon<1, andℓ∈\{0,1\}\\ell\\in\\\{0,1\\\}\. There exists a ReLU network×~M,ε:ℝ2→ℝ\\widetilde\{\\times\}\_\{M,\\varepsilon\}:\\mathbb\{R\}^\{2\}\\to\\mathbb\{R\}such that

‖×~M,ε\(x,y\)−xy‖Wℓ,∞\(\(−M,M\)2\)≤ε\.\\left\\\|\\widetilde\{\\times\}\_\{M,\\varepsilon\}\(x,y\)\-xy\\right\\\|\_\{W^\{\\ell,\\infty\}\(\(\-M,M\)^\{2\}\)\}\\leq\\varepsilon\.Moreover,

×~M,ε\(0,y\)=0for ally∈\(−M,M\),\\widetilde\{\\times\}\_\{M,\\varepsilon\}\(0,y\)=0\\quad\\text\{for all \}y\\in\(\-M,M\),and similarly×~M,ε\(x,0\)=0\\widetilde\{\\times\}\_\{M,\\varepsilon\}\(x,0\)=0for allx∈\(−M,M\)x\\in\(\-M,M\)\. The network has width bounded by a universal constant, depth at mostC\(M\)log⁡\(ε−1\)C\(M\)\\log\(\\varepsilon^\{\-1\}\), and at mostC\(M\)log⁡\(ε−1\)C\(M\)\\log\(\\varepsilon^\{\-1\}\)nonzero parameters\. In addition, all weights and biases are bounded byC\(M\)ε−1C\(M\)\\varepsilon^\{\-1\}\.

###### Proof\.

Forℓ=0\\ell=0, this is the standard ReLU multiplication construction; see, for example, Proposition 3 of\[[56](https://arxiv.org/html/2606.17419#bib.bib56)\]\. The proof forℓ=1\\ell=1is similar to the construction in\[[54](https://arxiv.org/html/2606.17419#bib.bib54), Proposition 4\]\. Since the present argument requires explicit bounds on the network size and parameter magnitudes, we include the details\.

We first construct a ReLU approximation to the square function\. Forx~∈\[−1,1\]\\widetilde\{x\}\\in\[\-1,1\], define the tent map

T1\(x~\)=\{2\|x~\|,\|x~\|≤12,2\(1−\|x~\|\),12<\|x~\|≤1,T\_\{1\}\(\\widetilde\{x\}\)=\\begin\{cases\}2\|\\widetilde\{x\}\|,&\|\\widetilde\{x\}\|\\leq\\frac\{1\}\{2\},\\\\ 2\(1\-\|\\widetilde\{x\}\|\),&\\frac\{1\}\{2\}<\|\\widetilde\{x\}\|\\leq 1,\\end\{cases\}and setTj=T1∘Tj−1T\_\{j\}=T\_\{1\}\\circ T\_\{j\-1\}forj≥2j\\geq 2\. ForL∈ℕ\+L\\in\\mathbb\{N\}\_\{\+\}, define

ψ~L\(x~\)=x~−∑j=1LTj\(x~\)22j\.\\widetilde\{\\psi\}\_\{L\}\(\\widetilde\{x\}\)=\\widetilde\{x\}\-\\sum\_\{j=1\}^\{L\}\\frac\{T\_\{j\}\(\\widetilde\{x\}\)\}\{2^\{2j\}\}\.The functionψ~L\\widetilde\{\\psi\}\_\{L\}can be represented by a ReLU network with width bounded by a universal constant and depthCLCL\. Moreover, the standard sawtooth approximation estimate gives

‖ψ~L−x~2‖W1,∞\(\(−1,1\)\)≤C2−L,‖ψ~L‖W1,∞\(\(−1,1\)\)≤C,\\left\\\|\\widetilde\{\\psi\}\_\{L\}\-\\widetilde\{x\}^\{2\}\\right\\\|\_\{W^\{1,\\infty\}\(\(\-1,1\)\)\}\\leq C2^\{\-L\},\\qquad\\\|\\widetilde\{\\psi\}\_\{L\}\\\|\_\{W^\{1,\\infty\}\(\(\-1,1\)\)\}\\leq C,andψ~L\(0\)=0\\widetilde\{\\psi\}\_\{L\}\(0\)=0\.

We rescale this approximation to\(−2M,2M\)\(\-2M,2M\)by settingψL\(t\):=\(2M\)2ψ~L\(t/\(2M\)\)\\psi\_\{L\}\(t\):=\(2M\)^\{2\}\\widetilde\{\\psi\}\_\{L\}\(t/\(2M\)\)\. Then

‖ψL−t2‖W1,∞\(\(−2M,2M\)\)≤C\(M\)2−L,ψL\(0\)=0\.\\left\\\|\\psi\_\{L\}\-t^\{2\}\\right\\\|\_\{W^\{1,\\infty\}\(\(\-2M,2M\)\)\}\\leq C\(M\)2^\{\-L\},\\qquad\\psi\_\{L\}\(0\)=0\.Now define

×~M,ε\(x,y\)=12\[ψL\(x\+y\)−ψL\(x\)−ψL\(y\)\]\.\\widetilde\{\\times\}\_\{M,\\varepsilon\}\(x,y\)=\\frac\{1\}\{2\}\\left\[\\psi\_\{L\}\(x\+y\)\-\\psi\_\{L\}\(x\)\-\\psi\_\{L\}\(y\)\\right\]\.\(53\)Sincexy=12\(\(x\+y\)2−x2−y2\)xy=\\frac\{1\}\{2\}\(\(x\+y\)^\{2\}\-x^\{2\}\-y^\{2\}\), we have

‖×~M,ε\(x,y\)−xy‖W1,∞\(\(−M,M\)2\)≤C\(M\)2−L\.\\left\\\|\\widetilde\{\\times\}\_\{M,\\varepsilon\}\(x,y\)\-xy\\right\\\|\_\{W^\{1,\\infty\}\(\(\-M,M\)^\{2\}\)\}\\leq C\(M\)2^\{\-L\}\.ChoosingL≍log⁡\(ε−1\)L\\asymp\\log\(\\varepsilon^\{\-1\}\)gives the desiredW1,∞W^\{1,\\infty\}\-estimate, and theL∞L^\{\\infty\}\-estimate follows immediately\.

The zero\-preserving identities follow directly from the definition \([53](https://arxiv.org/html/2606.17419#A1.E53)\) andψL\(0\)=0\\psi\_\{L\}\(0\)=0:

×~M,ε\(0,y\)=12\[ψL\(y\)−ψL\(0\)−ψL\(y\)\]=0,\\widetilde\{\\times\}\_\{M,\\varepsilon\}\(0,y\)=\\frac\{1\}\{2\}\[\\psi\_\{L\}\(y\)\-\\psi\_\{L\}\(0\)\-\\psi\_\{L\}\(y\)\]=0,and the identity atx=0x=0is proved in the same way\. The stated bounds on width, depth, number of parameters, and weights follow from the construction ofψL\\psi\_\{L\}and the rescaling\. This completes the proof\. ∎

###### Proposition 6\(Local\-basis ReLU approximation of Sobolev functions inWℓ,∞W^\{\\ell,\\infty\}\)\.

LetΩ=\[−1,1\]d\\Omega=\[\-1,1\]^\{d\},n∈ℕ\+n\\in\\mathbb\{N\}\_\{\+\},R\>0R\>0, andℓ∈\{0,1\}\\ell\\in\\\{0,1\\\}withn\>ℓn\>\\ell\. Then there exists a constantC=C\(d,n,ℓ\)\>0C=C\(d,n,\\ell\)\>0such that, for everyP≥2P\\geq 2, there exist ReLU networksfj:ℝd→ℝf\_\{j\}:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\},j=1,…,Pj=1,\\ldots,P, independent offf, such that for everyf∈Wn,∞\(Ω\)f\\in W^\{n,\\infty\}\(\\Omega\)with‖f‖Wn,∞\(Ω\)≤R\\\|f\\\|\_\{W^\{n,\\infty\}\(\\Omega\)\}\\leq R, there exist coefficientsaj∈ℝa\_\{j\}\\in\\mathbb\{R\},j=1,…,Pj=1,\\ldots,P, satisfying

‖f−∑j=1Pajfj‖Wℓ,∞\(Ω\)≤CRP−\(n−ℓ\)/d,\|aj\|≤CR\.\\left\\\|f\-\\sum\_\{j=1\}^\{P\}a\_\{j\}f\_\{j\}\\right\\\|\_\{W^\{\\ell,\\infty\}\(\\Omega\)\}\\leq CRP^\{\-\(n\-\\ell\)/d\},\\qquad\|a\_\{j\}\|\\leq CR\.Moreover, the basis networks can be chosen so that

sup𝒙∈Ω∑j=1P\|fj\(𝒙\)\|≤C,sup𝒙∈Ω∑j=1P∑r=1d\|∂xrfj\(𝒙\)\|≤CP1/d\.\\sup\_\{\\bm\{x\}\\in\\Omega\}\\sum\_\{j=1\}^\{P\}\|f\_\{j\}\(\\bm\{x\}\)\|\\leq C,\\qquad\\sup\_\{\\bm\{x\}\\in\\Omega\}\\sum\_\{j=1\}^\{P\}\\sum\_\{r=1\}^\{d\}\|\\partial\_\{x\_\{r\}\}f\_\{j\}\(\\bm\{x\}\)\|\\leq CP^\{1/d\}\.Eachfjf\_\{j\}belongs toℱNN\(d,1,Lloc,ploc,Kloc,κloc,Mloc\)\\mathcal\{F\}\_\{\\mathrm\{NN\}\}\(d,1,L\_\{\\rm loc\},p\_\{\\rm loc\},K\_\{\\rm loc\},\\kappa\_\{\\rm loc\},M\_\{\\rm loc\}\), where

Lloc≤Clog⁡P,ploc≤C,Kloc≤Clog⁡P,L\_\{\\rm loc\}\\leq C\\log P,\\qquad p\_\{\\rm loc\}\\leq C,\\qquad K\_\{\\rm loc\}\\leq C\\log P,andκloc≤CP\(n\+d−ℓ\)/d\\kappa\_\{\\rm loc\}\\leq CP^\{\(n\+d\-\\ell\)/d\},Mloc≤CM\_\{\\rm loc\}\\leq C\.

###### Proof\.

We prove the result forℓ∈\{0,1\}\\ell\\in\\\{0,1\\\}\. The construction follows the local average polynomial approximation argument in\[[16](https://arxiv.org/html/2606.17419#bib.bib16)\]\.

LetN∈ℕN\\in\\mathbb\{N\}be chosen later and setℐN=\{0,1,…,N\}d\\mathcal\{I\}\_\{N\}=\\\{0,1,\\ldots,N\\\}^\{d\}\. For𝒎=\(m1,…,md\)∈ℐN\\bm\{m\}=\(m\_\{1\},\\ldots,m\_\{d\}\)\\in\\mathcal\{I\}\_\{N\}, define𝒙𝒎:=−𝟏\+2𝒎/N∈\[−1,1\]d\\bm\{x\}\_\{\\bm\{m\}\}:=\-\{\\bf 1\}\+2\\bm\{m\}/N\\in\[\-1,1\]^\{d\}\. Let

ψ\(t\)=\{1,\|t\|<1,2−\|t\|,1≤\|t\|≤2,0,\|t\|\>2,\\psi\(t\)=\\begin\{cases\}1,&\|t\|<1,\\\\ 2\-\|t\|,&1\\leq\|t\|\\leq 2,\\\\ 0,&\|t\|\>2,\\end\{cases\}and define the local cutoff function

ϕ𝒎\(𝒙\)=∏r=1dψ\(3N2\(xr−x𝒎,r\)\)\.\\phi\_\{\\bm\{m\}\}\(\\bm\{x\}\)=\\prod\_\{r=1\}^\{d\}\\psi\\left\(\\frac\{3N\}\{2\}\(x\_\{r\}\-x\_\{\\bm\{m\},r\}\)\\right\)\.Thenϕ𝒎\\phi\_\{\\bm\{m\}\}is supported in a cube of side lengthO\(N−1\)O\(N^\{\-1\}\), and the family\{ϕ𝒎\}𝒎∈ℐN\\\{\\phi\_\{\\bm\{m\}\}\\\}\_\{\\bm\{m\}\\in\\mathcal\{I\}\_\{N\}\}has uniformly bounded overlap\. Moreover,

‖ϕ𝒎‖L∞\(Ω\)≤C,‖∇ϕ𝒎‖L∞\(Ω\)≤CN\.\\\|\\phi\_\{\\bm\{m\}\}\\\|\_\{L^\{\\infty\}\(\\Omega\)\}\\leq C,\\qquad\\\|\\nabla\\phi\_\{\\bm\{m\}\}\\\|\_\{L^\{\\infty\}\(\\Omega\)\}\\leq CN\.
Using local average polynomials\[[6](https://arxiv.org/html/2606.17419#bib.bib6)\]of ordern−1n\-1, define

fN\(𝒙\)=∑𝒎∈ℐN∑\|𝜶\|<na𝒎,𝜶ϕ𝒎\(𝒙\)\(𝒙−𝒙𝒎\)𝜶,f\_\{N\}\(\\bm\{x\}\)=\\sum\_\{\\bm\{m\}\\in\\mathcal\{I\}\_\{N\}\}\\sum\_\{\|\\bm\{\\alpha\}\|<n\}a\_\{\\bm\{m\},\\bm\{\\alpha\}\}\\phi\_\{\\bm\{m\}\}\(\\bm\{x\}\)\(\\bm\{x\}\-\\bm\{x\}\_\{\\bm\{m\}\}\)^\{\\bm\{\\alpha\}\},wherea𝒎,𝜶a\_\{\\bm\{m\},\\bm\{\\alpha\}\}are the corresponding average polynomial coefficients offfat𝒙𝒎\\bm\{x\}\_\{\\bm\{m\}\}\. Since‖f‖Wn,∞\(Ω\)≤R\\\|f\\\|\_\{W^\{n,\\infty\}\(\\Omega\)\}\\leq R, we have\|a𝒎,𝜶\|≤C\(d,n\)R\|a\_\{\\bm\{m\},\\bm\{\\alpha\}\}\|\\leq C\(d,n\)R\. The standard local average polynomial estimate, namely the Bramble–Hilbert lemma\[[6](https://arxiv.org/html/2606.17419#bib.bib6), Lemma 4\.3\.8\], gives

‖f−fN‖Wℓ,∞\(Ω\)≤CRN−\(n−ℓ\),ℓ=0,1\.\\\|f\-f\_\{N\}\\\|\_\{W^\{\\ell,\\infty\}\(\\Omega\)\}\\leq CRN^\{\-\(n\-\\ell\)\},\\qquad\\ell=0,1\.
For each pair\(𝒎,𝜶\)\(\\bm\{m\},\\bm\{\\alpha\}\), setq𝒎,𝜶\(𝒙\):=ϕ𝒎\(𝒙\)\(𝒙−𝒙𝒎\)𝜶q\_\{\\bm\{m\},\\bm\{\\alpha\}\}\(\\bm\{x\}\):=\\phi\_\{\\bm\{m\}\}\(\\bm\{x\}\)\(\\bm\{x\}\-\\bm\{x\}\_\{\\bm\{m\}\}\)^\{\\bm\{\\alpha\}\}\. The functionq𝒎,𝜶q\_\{\\bm\{m\},\\bm\{\\alpha\}\}is a product of at mostd\+nd\+nbounded piecewise\-linear or affine factors\. The cutoffψ\\psiis exactly representable by a fixed\-size ReLU network, and the affine factors are exactly representable by affine layers\. Applying Lemma[8](https://arxiv.org/html/2606.17419#Thmlemma8)repeatedly, for any0<δ<10<\\delta<1, there exists a ReLU networkq~𝒎,𝜶\\widetilde\{q\}\_\{\\bm\{m\},\\bm\{\\alpha\}\}, independent offf, such that

‖q𝒎,𝜶−q~𝒎,𝜶‖Wℓ,∞\(Ω\)≤Cδ\.\\\|q\_\{\\bm\{m\},\\bm\{\\alpha\}\}\-\\widetilde\{q\}\_\{\\bm\{m\},\\bm\{\\alpha\}\}\\\|\_\{W^\{\\ell,\\infty\}\(\\Omega\)\}\\leq C\\delta\.Moreover,q~𝒎,𝜶∈ℱNN\(d,1,Lloc,ploc,Kloc,κloc,Mloc\)\\widetilde\{q\}\_\{\\bm\{m\},\\bm\{\\alpha\}\}\\in\\mathcal\{F\}\_\{\\mathrm\{NN\}\}\(d,1,L\_\{\\rm loc\},p\_\{\\rm loc\},K\_\{\\rm loc\},\\kappa\_\{\\rm loc\},M\_\{\\rm loc\}\), where

Lloc≤Clog⁡\(δ−1\),ploc≤C,Kloc≤Clog⁡\(δ−1\),L\_\{\\rm loc\}\\leq C\\log\(\\delta^\{\-1\}\),\\qquad p\_\{\\rm loc\}\\leq C,\\qquad K\_\{\\rm loc\}\\leq C\\log\(\\delta^\{\-1\}\),andκloc≤Cmax⁡\{N,δ−1\}\\kappa\_\{\\rm loc\}\\leq C\\max\\\{N,\\delta^\{\-1\}\\\},Mloc≤CM\_\{\\rm loc\}\\leq C\. Here the factorNNcomes from the affine rescaling in the local cutoffs\.

We now enumerate all pairs\(𝒎,𝜶\)∈ℐN×\{𝜶:\|𝜶\|<n\}\(\\bm\{m\},\\bm\{\\alpha\}\)\\in\\mathcal\{I\}\_\{N\}\\times\\\{\\bm\{\\alpha\}:\|\\bm\{\\alpha\}\|<n\\\}byj=1,…,PNj=1,\\ldots,P\_\{N\}\. Since the number of multi\-indices𝜶\\bm\{\\alpha\}with\|𝜶\|<n\|\\bm\{\\alpha\}\|<ndepends only onddandnn, we havePN≤CNdP\_\{N\}\\leq CN^\{d\}\. Writeaj:=a𝒎,𝜶a\_\{j\}:=a\_\{\\bm\{m\},\\bm\{\\alpha\}\}andfj:=q~𝒎,𝜶f\_\{j\}:=\\widetilde\{q\}\_\{\\bm\{m\},\\bm\{\\alpha\}\}\. The networksfjf\_\{j\}are constructed only from the grid points, the cutoff function, and the monomial basis\. Hence they are independent of the target functionff, and the dependence onffenters only through the coefficientsaja\_\{j\}\.

Using\|aj\|≤CR\|a\_\{j\}\|\\leq CRandPN≤CNdP\_\{N\}\\leq CN^\{d\}, we obtain

‖fN−∑j=1PNajfj‖Wℓ,∞\(Ω\)≤CRNdδ\.\\left\\\|f\_\{N\}\-\\sum\_\{j=1\}^\{P\_\{N\}\}a\_\{j\}f\_\{j\}\\right\\\|\_\{W^\{\\ell,\\infty\}\(\\Omega\)\}\\leq CRN^\{d\}\\delta\.Chooseδ=N−\(n−ℓ\)−d\\delta=N^\{\-\(n\-\\ell\)\-d\}\. Then

‖f−∑j=1PNajfj‖Wℓ,∞\(Ω\)≤CRN−\(n−ℓ\)\.\\left\\\|f\-\\sum\_\{j=1\}^\{P\_\{N\}\}a\_\{j\}f\_\{j\}\\right\\\|\_\{W^\{\\ell,\\infty\}\(\\Omega\)\}\\leq CRN^\{\-\(n\-\\ell\)\}\.With this choice ofδ\\delta, eachfjf\_\{j\}satisfiesLloc≤Clog⁡NL\_\{\\rm loc\}\\leq C\\log N,ploc≤Cp\_\{\\rm loc\}\\leq C,Kloc≤Clog⁡NK\_\{\\rm loc\}\\leq C\\log N,κloc≤CNn\+d−ℓ\\kappa\_\{\\rm loc\}\\leq CN^\{n\+d\-\\ell\}, andMloc≤CM\_\{\\rm loc\}\\leq C\.

It remains to record the local\-overlap bounds for the constructed basis\. Sinceϕ𝒎\\phi\_\{\\bm\{m\}\}is supported in a cube of side lengthO\(N−1\)O\(N^\{\-1\}\), the support ofq𝒎,𝜶q\_\{\\bm\{m\},\\bm\{\\alpha\}\}is contained in the same local cube\. The multiplication networks in Lemma[8](https://arxiv.org/html/2606.17419#Thmlemma8)may be chosen to preserve the zero property when one factor is zero\. Therefore,q~𝒎,𝜶\\widetilde\{q\}\_\{\\bm\{m\},\\bm\{\\alpha\}\}, and hencefjf\_\{j\}, has the same local support property, up to a fixed enlargement independent ofNN\.

For every𝒙∈Ω\\bm\{x\}\\in\\Omega, onlyC\(d\)C\(d\)cutoffsϕ𝒎\\phi\_\{\\bm\{m\}\}are nonzero\. Since the number of multi\-indices𝜶\\bm\{\\alpha\}with\|𝜶\|<n\|\\bm\{\\alpha\}\|<ndepends only onddandnn, at each𝒙\\bm\{x\}onlyC\(d,n\)C\(d,n\)basis functionsfjf\_\{j\}can be nonzero\. Furthermore,‖fj‖L∞\(Ω\)≤C\\\|f\_\{j\}\\\|\_\{L^\{\\infty\}\(\\Omega\)\}\\leq C\. Hence

sup𝒙∈Ω∑j=1PN\|fj\(𝒙\)\|≤C\.\\sup\_\{\\bm\{x\}\\in\\Omega\}\\sum\_\{j=1\}^\{P\_\{N\}\}\|f\_\{j\}\(\\bm\{x\}\)\|\\leq C\.
Whenℓ=1\\ell=1, we also estimate the derivatives\. Since‖∇ϕ𝒎‖L∞\(Ω\)≤CN\\\|\\nabla\\phi\_\{\\bm\{m\}\}\\\|\_\{L^\{\\infty\}\(\\Omega\)\}\\leq CN, and the monomial factors\(𝒙−𝒙𝒎\)𝜶\(\\bm\{x\}\-\\bm\{x\}\_\{\\bm\{m\}\}\)^\{\\bm\{\\alpha\}\}and their first derivatives are uniformly bounded onΩ\\Omega, with constants depending only onddandnn, we have

‖∇q𝒎,𝜶‖L∞\(Ω\)≤CN\.\\\|\\nabla q\_\{\\bm\{m\},\\bm\{\\alpha\}\}\\\|\_\{L^\{\\infty\}\(\\Omega\)\}\\leq CN\.Since‖q𝒎,𝜶−q~𝒎,𝜶‖W1,∞\(Ω\)≤Cδ\\\|q\_\{\\bm\{m\},\\bm\{\\alpha\}\}\-\\widetilde\{q\}\_\{\\bm\{m\},\\bm\{\\alpha\}\}\\\|\_\{W^\{1,\\infty\}\(\\Omega\)\}\\leq C\\deltaandδ<1\\delta<1, after increasing the constant we also have‖∇q~𝒎,𝜶‖L∞\(Ω\)≤CN\\\|\\nabla\\widetilde\{q\}\_\{\\bm\{m\},\\bm\{\\alpha\}\}\\\|\_\{L^\{\\infty\}\(\\Omega\)\}\\leq CN\. Using again the uniformly bounded overlap of the local supports, we obtain

sup𝒙∈Ω∑j=1PN∑r=1d\|∂xrfj\(𝒙\)\|≤CN\.\\sup\_\{\\bm\{x\}\\in\\Omega\}\\sum\_\{j=1\}^\{P\_\{N\}\}\\sum\_\{r=1\}^\{d\}\|\\partial\_\{x\_\{r\}\}f\_\{j\}\(\\bm\{x\}\)\|\\leq CN\.
Finally, chooseN=⌊cP1/d⌋N=\\lfloor cP^\{1/d\}\\rfloor, wherec=c\(d,n\)\>0c=c\(d,n\)\>0is sufficiently small so thatPN≤PP\_\{N\}\\leq P\. IfPN<PP\_\{N\}<P, addP−PNP\-P\_\{N\}zero networks and zero coefficients\. This does not affect the approximation error or the local\-overlap bounds\. SinceN−\(n−ℓ\)≤CP−\(n−ℓ\)/dN^\{\-\(n\-\\ell\)\}\\leq CP^\{\-\(n\-\\ell\)/d\},log⁡N≤Clog⁡P\\log N\\leq C\\log P, andNn\+d−ℓ≤CP\(n\+d−ℓ\)/dN^\{n\+d\-\\ell\}\\leq CP^\{\(n\+d\-\\ell\)/d\}, the claimed approximation and network\-parameter bounds follow\. Moreover,

sup𝒙∈Ω∑j=1P\|fj\(𝒙\)\|≤C,\\sup\_\{\\bm\{x\}\\in\\Omega\}\\sum\_\{j=1\}^\{P\}\|f\_\{j\}\(\\bm\{x\}\)\|\\leq C,and, whenℓ=1\\ell=1,

sup𝒙∈Ω∑j=1P∑r=1d\|∂xrfj\(𝒙\)\|≤CN≤CP1/d\.\\sup\_\{\\bm\{x\}\\in\\Omega\}\\sum\_\{j=1\}^\{P\}\\sum\_\{r=1\}^\{d\}\|\\partial\_\{x\_\{r\}\}f\_\{j\}\(\\bm\{x\}\)\|\\leq CN\\leq CP^\{1/d\}\.This completes the proof\. ∎

### A\.2Proofs in Step 1

###### Proof of Proposition[1](https://arxiv.org/html/2606.17419#Thmproposition1)\.

We use Assumption[1](https://arxiv.org/html/2606.17419#Thmassumption1)withki=0k\_\{i\}=0andαi=0\\alpha\_\{i\}=0together with the pseudo\-spectral projection estimate in Proposition[5](https://arxiv.org/html/2606.17419#Thmproposition5)\. Let

fm¯:=𝒫m¯𝒟m¯f,gm:=𝒫m𝒟mg,f\_\{\\bar\{m\}\}:=\\mathcal\{P\}\_\{\\bar\{m\}\}\\mathcal\{D\}\_\{\\bar\{m\}\}f,\\qquad g\_\{m\}:=\\mathcal\{P\}\_\{m\}\\mathcal\{D\}\_\{m\}g,wherem,m¯∈ℕ\+m,\\bar\{m\}\\in\\mathbb\{N\}\_\{\+\}\. Here,fm¯f\_\{\\bar\{m\}\}depends only on them¯∗:=\(1\+m¯\)d\\bar\{m\}\_\{\*\}:=\(1\+\\bar\{m\}\)^\{d\}sampled values offf, whilegmg\_\{m\}depends only on them∗:=\(1\+m\)dm\_\{\*\}:=\(1\+m\)^\{d\}sampled values ofgg\. By the separate Lipschitz continuity of𝒢\\mathcal\{G\}with respect to each input, we have

‖𝒢\(f,g\)−𝒢\(fm¯,gm\)‖Wℓ,∞\(Ω\)≤‖𝒢\(f,g\)−𝒢\(f,gm\)‖Wℓ,∞\(Ω\)\+‖𝒢\(f,gm\)−𝒢\(fm¯,gm\)‖Wℓ,∞\(Ω\)\\displaystyle\\\|\\mathcal\{G\}\(f,g\)\-\\mathcal\{G\}\(f\_\{\\bar\{m\}\},g\_\{m\}\)\\\|\_\{W^\{\\ell,\\infty\}\(\\Omega\)\}\\leq\\\|\\mathcal\{G\}\(f,g\)\-\\mathcal\{G\}\(f,g\_\{m\}\)\\\|\_\{W^\{\\ell,\\infty\}\(\\Omega\)\}\+\\\|\\mathcal\{G\}\(f,g\_\{m\}\)\-\\mathcal\{G\}\(f\_\{\\bar\{m\}\},g\_\{m\}\)\\\|\_\{W^\{\\ell,\\infty\}\(\\Omega\)\}≤\\displaystyle\\leqL2‖g−gm‖L∞\(Ω\)\+L1‖f−fm¯‖L∞\(Ω\)\.\\displaystyle L\_\{2\}\\\|g\-g\_\{m\}\\\|\_\{L^\{\\infty\}\(\\Omega\)\}\+L\_\{1\}\\\|f\-f\_\{\\bar\{m\}\}\\\|\_\{L^\{\\infty\}\(\\Omega\)\}\.Applying Proposition[5](https://arxiv.org/html/2606.17419#Thmproposition5)withk=0k=0, together with Assumptions[1](https://arxiv.org/html/2606.17419#Thmassumption1)and[2](https://arxiv.org/html/2606.17419#Thmassumption2), we obtain111For simplicity of notation, the constantCCmay vary from line to line\.

‖𝒢\(f,g\)−𝒢\(fm¯,gm\)‖Wℓ,∞\(Ω\)\\displaystyle\\\|\\mathcal\{G\}\(f,g\)\-\\mathcal\{G\}\(f\_\{\\bar\{m\}\},g\_\{m\}\)\\\|\_\{W^\{\\ell,\\infty\}\(\\Omega\)\}≤\\displaystyle\\leqCU\[L1‖f−fm¯‖L∞\(Ω\)\(1\+‖gm‖Wk2,∞\(Ω\)\)α1\+L2‖g−gm‖L∞\(Ω\)\(1\+‖f‖Wk1,∞\(Ω\)\)α2\]\\displaystyle CU\\left\[L\_\{1\}\\,\\\|f\-f\_\{\\bar\{m\}\}\\\|\_\{L^\{\\infty\}\(\\Omega\)\}\\left\(1\+\\\|g\_\{m\}\\\|\_\{W^\{k\_\{2\},\\infty\}\(\\Omega\)\}\\right\)^\{\\alpha\_\{1\}\}\+L\_\{2\}\\,\\\|g\-g\_\{m\}\\\|\_\{L^\{\\infty\}\(\\Omega\)\}\\left\(1\+\\\|f\\\|\_\{W^\{k\_\{1\},\\infty\}\(\\Omega\)\}\\right\)^\{\\alpha\_\{2\}\}\\right\]≤\\displaystyle\\leqCU\[L1m¯−n1\(log⁡m¯\)d\(1\+‖gm‖Wk2,∞\(Ω\)\)α1\+L2m−n2\(log⁡m\)d\]\.\\displaystyle CU\\left\[L\_\{1\}\\,\\bar\{m\}^\{\-n\_\{1\}\}\(\\log\\bar\{m\}\)^\{d\}\\left\(1\+\\\|g\_\{m\}\\\|\_\{W^\{k\_\{2\},\\infty\}\(\\Omega\)\}\\right\)^\{\\alpha\_\{1\}\}\+L\_\{2\}\\,m^\{\-n\_\{2\}\}\(\\log m\)^\{d\}\\right\]\.HereU\>0U\>0denotes a uniform bound for the Sobolev norms of the input classes, andC\>0C\>0depends only ond,n1,n2,k1,k2d,n\_\{1\},n\_\{2\},k\_\{1\},k\_\{2\}\.

It remains to control the Sobolev norm ofgmg\_\{m\}\. By Lemma[7](https://arxiv.org/html/2606.17419#Thmlemma7),

‖gm‖Wk2,∞\(Ω\)=‖𝒫m𝒟mg‖Wk2,∞\(Ω\)≤CSm2k2\(log⁡m\)d\.\\displaystyle\\\|g\_\{m\}\\\|\_\{W^\{k\_\{2\},\\infty\}\(\\Omega\)\}=\\\|\\mathcal\{P\}\_\{m\}\\mathcal\{D\}\_\{m\}g\\\|\_\{W^\{k\_\{2\},\\infty\}\(\\Omega\)\}\\leq CSm^\{2k\_\{2\}\}\(\\log m\)^\{d\}\.\(54\)Therefore,

‖𝒢\(f,g\)−𝒢\(fm¯,gm\)‖Wℓ,∞\(Ω\)≤CU\[L1m¯−n1\(log⁡m¯\)d\(1\+Sm2k2\(log⁡m\)d\)α1\+L2m−n2\(log⁡m\)d\]\.\\displaystyle\\\|\\mathcal\{G\}\(f,g\)\-\\mathcal\{G\}\(f\_\{\\bar\{m\}\},g\_\{m\}\)\\\|\_\{W^\{\\ell,\\infty\}\(\\Omega\)\}\\leq CU\\left\[L\_\{1\}\\,\\bar\{m\}^\{\-n\_\{1\}\}\(\\log\\bar\{m\}\)^\{d\}\\left\(1\+Sm^\{2k\_\{2\}\}\(\\log m\)^\{d\}\\right\)^\{\\alpha\_\{1\}\}\+L\_\{2\}\\,m^\{\-n\_\{2\}\}\(\\log m\)^\{d\}\\right\]\.\(55\)∎

### A\.3Proofs in Step 2

Next, we estimate the error arising from the approximation of thegg\-dependent coefficient\. Fixfm¯f\_\{\\bar\{m\}\}and𝒙∈Ω\\bm\{x\}\\in\\Omega\. Define

Fmfm¯,𝒙\(𝒛\):=𝒢\(fm¯,𝒫m𝒛\)\(𝒙\),𝒛∈\[−S,S\]m∗\.F\_\{m\}^\{f\_\{\\bar\{m\}\},\\bm\{x\}\}\(\\bm\{z\}\):=\\mathcal\{G\}\(f\_\{\\bar\{m\}\},\\mathcal\{P\}\_\{m\}\\bm\{z\}\)\(\\bm\{x\}\),\\qquad\\bm\{z\}\\in\[\-S,S\]^\{m\_\{\*\}\}\.ThenFmfm¯,𝒙F\_\{m\}^\{f\_\{\\bar\{m\}\},\\bm\{x\}\}is a function of the finite\-dimensional variable𝒛=𝒟mg∈ℝm∗\\bm\{z\}=\\mathcal\{D\}\_\{m\}g\\in\\mathbb\{R\}^\{m\_\{\*\}\}\. The following lemma shows that this function is Lipschitz uniformly infm¯f\_\{\\bar\{m\}\}and𝒙\\bm\{x\}\.

###### Lemma 9\(Sobolev regularity of thegg\-dependent coefficient map\)\.

Assume that Assumption[1](https://arxiv.org/html/2606.17419#Thmassumption1)holds with theWℓ,∞\(Ω\)W^\{\\ell,\\infty\}\(\\Omega\)\-norm on the left\-hand side, whereℓ∈ℕ\\ell\\in\\mathbb\{N\}\. For fixedfm¯f\_\{\\bar\{m\}\}and𝐱∈Ω\\bm\{x\}\\in\\Omega, define

Fmfm¯,𝒙\(𝒛\):=𝒢\(fm¯,𝒫m𝒛\)\(𝒙\),𝒛∈\[−S,S\]m∗\.F\_\{m\}^\{f\_\{\\bar\{m\}\},\\bm\{x\}\}\(\\bm\{z\}\):=\\mathcal\{G\}\(f\_\{\\bar\{m\}\},\\mathcal\{P\}\_\{m\}\\bm\{z\}\)\(\\bm\{x\}\),\\qquad\\bm\{z\}\\in\[\-S,S\]^\{m\_\{\*\}\}\.Then, for every multi\-indexγ\\gammawith\|γ\|≤ℓ\|\\gamma\|\\leq\\ell, the function

Fm,γfm¯,𝒙\(𝒛\):=∂𝒙γ𝒢\(fm¯,𝒫m𝒛\)\(𝒙\)F\_\{m,\\gamma\}^\{f\_\{\\bar\{m\}\},\\bm\{x\}\}\(\\bm\{z\}\):=\\partial\_\{\\bm\{x\}\}^\{\\gamma\}\\mathcal\{G\}\(f\_\{\\bar\{m\}\},\\mathcal\{P\}\_\{m\}\\bm\{z\}\)\(\\bm\{x\}\)belongs toW1,∞\(\[−S,S\]m∗\)W^\{1,\\infty\}\(\[\-S,S\]^\{m\_\{\*\}\}\)\. More precisely,

Lip⁡\(Fm,γfm¯,𝒙\)≤CL1m2k2\(log⁡m\)d\(1\+Sm¯2k1\(log⁡m¯\)d\)α2,\\operatorname\{Lip\}\\\!\\left\(F\_\{m,\\gamma\}^\{f\_\{\\bar\{m\}\},\\bm\{x\}\}\\right\)\\leq CL\_\{1\}m^\{2k\_\{2\}\}\(\\log m\)^\{d\}\\left\(1\+S\\bar\{m\}^\{2k\_\{1\}\}\(\\log\\bar\{m\}\)^\{d\}\\right\)^\{\\alpha\_\{2\}\},uniformly for all\|γ\|≤ℓ\|\\gamma\|\\leq\\ell, and

∥Fm,γfm¯,𝒙∥L∞\(\[−S,S\]m∗\)≤C\[1\\displaystyle\\\|F\_\{m,\\gamma\}^\{f\_\{\\bar\{m\}\},\\bm\{x\}\}\\\|\_\{L^\{\\infty\}\(\[\-S,S\]^\{m\_\{\*\}\}\)\}\\leq C\\Bigl\[1\+L2Sm¯2k1\(logm¯\)d\(1\+Sm2k2\(logm\)d\)α1\+L1Sm2k2\(logm\)d\],\\displaystyle\+L\_\{2\}S\\bar\{m\}^\{2k\_\{1\}\}\(\\log\\bar\{m\}\)^\{d\}\\left\(1\+Sm^\{2k\_\{2\}\}\(\\log m\)^\{d\}\\right\)^\{\\alpha\_\{1\}\}\+L\_\{1\}Sm^\{2k\_\{2\}\}\(\\log m\)^\{d\}\\Bigr\],\(56\)whereC\>0C\>0is independent ofm,m¯,fm¯,𝐱m,\\bar\{m\},f\_\{\\bar\{m\}\},\\bm\{x\}, andγ\\gamma\.

###### Proof\.

Fix a multi\-indexγ\\gammawith\|γ\|≤ℓ\|\\gamma\|\\leq\\ell\. For any𝒛1,𝒛2∈\[−S,S\]m∗\\bm\{z\}\_\{1\},\\bm\{z\}\_\{2\}\\in\[\-S,S\]^\{m\_\{\*\}\}, by the strengthened version of Assumption[1](https://arxiv.org/html/2606.17419#Thmassumption1)applied to the second input, we have

\|Fm,γfm¯,𝒙\(𝒛1\)−Fm,γfm¯,𝒙\(𝒛2\)\|≤‖𝒢\(fm¯,𝒫m𝒛1\)−𝒢\(fm¯,𝒫m𝒛2\)‖Wℓ,∞\(Ω\)\\displaystyle\\left\|F\_\{m,\\gamma\}^\{f\_\{\\bar\{m\}\},\\bm\{x\}\}\(\\bm\{z\}\_\{1\}\)\-F\_\{m,\\gamma\}^\{f\_\{\\bar\{m\}\},\\bm\{x\}\}\(\\bm\{z\}\_\{2\}\)\\right\|\\leq\\left\\\|\\mathcal\{G\}\(f\_\{\\bar\{m\}\},\\mathcal\{P\}\_\{m\}\\bm\{z\}\_\{1\}\)\-\\mathcal\{G\}\(f\_\{\\bar\{m\}\},\\mathcal\{P\}\_\{m\}\\bm\{z\}\_\{2\}\)\\right\\\|\_\{W^\{\\ell,\\infty\}\(\\Omega\)\}≤\\displaystyle\\leqL1‖𝒫m\(𝒛1−𝒛2\)‖Wk2,∞\(Ω\)\(1\+‖fm¯‖Wk1,∞\(Ω\)\)α2\.\\displaystyle L\_\{1\}\\\|\\mathcal\{P\}\_\{m\}\(\\bm\{z\}\_\{1\}\-\\bm\{z\}\_\{2\}\)\\\|\_\{W^\{k\_\{2\},\\infty\}\(\\Omega\)\}\\left\(1\+\\\|f\_\{\\bar\{m\}\}\\\|\_\{W^\{k\_\{1\},\\infty\}\(\\Omega\)\}\\right\)^\{\\alpha\_\{2\}\}\.By the Sobolev stability of the reconstruction operator,

‖𝒫m\(𝒛1−𝒛2\)‖Wk2,∞\(Ω\)≤Cm2k2\(log⁡m\)d‖𝒛1−𝒛2‖ℓ∞,\\\|\\mathcal\{P\}\_\{m\}\(\\bm\{z\}\_\{1\}\-\\bm\{z\}\_\{2\}\)\\\|\_\{W^\{k\_\{2\},\\infty\}\(\\Omega\)\}\\leq Cm^\{2k\_\{2\}\}\(\\log m\)^\{d\}\\\|\\bm\{z\}\_\{1\}\-\\bm\{z\}\_\{2\}\\\|\_\{\\ell^\{\\infty\}\},and

‖fm¯‖Wk1,∞\(Ω\)≤Cm¯2k1\(log⁡m¯\)d‖𝒟m¯f‖ℓ∞≤CSm¯2k1\(log⁡m¯\)d\.\\\|f\_\{\\bar\{m\}\}\\\|\_\{W^\{k\_\{1\},\\infty\}\(\\Omega\)\}\\leq C\\bar\{m\}^\{2k\_\{1\}\}\(\\log\\bar\{m\}\)^\{d\}\\\|\\mathcal\{D\}\_\{\\bar\{m\}\}f\\\|\_\{\\ell^\{\\infty\}\}\\leq CS\\bar\{m\}^\{2k\_\{1\}\}\(\\log\\bar\{m\}\)^\{d\}\.Therefore,

Lip⁡\(Fm,γfm¯,𝒙\)≤CL1m2k2\(log⁡m\)d\(1\+Sm¯2k1\(log⁡m¯\)d\)α2\.\\operatorname\{Lip\}\\\!\\left\(F\_\{m,\\gamma\}^\{f\_\{\\bar\{m\}\},\\bm\{x\}\}\\right\)\\leq CL\_\{1\}m^\{2k\_\{2\}\}\(\\log m\)^\{d\}\\left\(1\+S\\bar\{m\}^\{2k\_\{1\}\}\(\\log\\bar\{m\}\)^\{d\}\\right\)^\{\\alpha\_\{2\}\}\.
It remains to prove theL∞L^\{\\infty\}\-bound\. For any𝒛∈\[−S,S\]m∗\\bm\{z\}\\in\[\-S,S\]^\{m\_\{\*\}\}, we write

\|Fm,γfm¯,𝒙\(𝒛\)\|\\displaystyle\\left\|F\_\{m,\\gamma\}^\{f\_\{\\bar\{m\}\},\\bm\{x\}\}\(\\bm\{z\}\)\\right\|≤‖𝒢\(fm¯,𝒫m𝒛\)−𝒢\(0,𝒫m𝒛\)‖Wℓ,∞\(Ω\)\\displaystyle\\leq\\left\\\|\\mathcal\{G\}\(f\_\{\\bar\{m\}\},\\mathcal\{P\}\_\{m\}\\bm\{z\}\)\-\\mathcal\{G\}\(0,\\mathcal\{P\}\_\{m\}\\bm\{z\}\)\\right\\\|\_\{W^\{\\ell,\\infty\}\(\\Omega\)\}\+‖𝒢\(0,𝒫m𝒛\)−𝒢\(0,0\)‖Wℓ,∞\(Ω\)\+‖𝒢\(0,0\)‖Wℓ,∞\(Ω\)\.\\displaystyle\\quad\+\\left\\\|\\mathcal\{G\}\(0,\\mathcal\{P\}\_\{m\}\\bm\{z\}\)\-\\mathcal\{G\}\(0,0\)\\right\\\|\_\{W^\{\\ell,\\infty\}\(\\Omega\)\}\+\\left\\\|\\mathcal\{G\}\(0,0\)\\right\\\|\_\{W^\{\\ell,\\infty\}\(\\Omega\)\}\.Applying the strengthened Assumption[1](https://arxiv.org/html/2606.17419#Thmassumption1)to the first input gives

‖𝒢\(fm¯,𝒫m𝒛\)−𝒢\(0,𝒫m𝒛\)‖Wℓ,∞\(Ω\)≤L2‖fm¯‖Wk1,∞\(Ω\)\(1\+‖𝒫m𝒛‖Wk2,∞\(Ω\)\)α1\.\\left\\\|\\mathcal\{G\}\(f\_\{\\bar\{m\}\},\\mathcal\{P\}\_\{m\}\\bm\{z\}\)\-\\mathcal\{G\}\(0,\\mathcal\{P\}\_\{m\}\\bm\{z\}\)\\right\\\|\_\{W^\{\\ell,\\infty\}\(\\Omega\)\}\\leq L\_\{2\}\\\|f\_\{\\bar\{m\}\}\\\|\_\{W^\{k\_\{1\},\\infty\}\(\\Omega\)\}\\left\(1\+\\\|\\mathcal\{P\}\_\{m\}\\bm\{z\}\\\|\_\{W^\{k\_\{2\},\\infty\}\(\\Omega\)\}\\right\)^\{\\alpha\_\{1\}\}\.Similarly, applying it to the second input gives

‖𝒢\(0,𝒫m𝒛\)−𝒢\(0,0\)‖Wℓ,∞\(Ω\)≤L1‖𝒫m𝒛‖Wk2,∞\(Ω\)\.\\left\\\|\\mathcal\{G\}\(0,\\mathcal\{P\}\_\{m\}\\bm\{z\}\)\-\\mathcal\{G\}\(0,0\)\\right\\\|\_\{W^\{\\ell,\\infty\}\(\\Omega\)\}\\leq L\_\{1\}\\\|\\mathcal\{P\}\_\{m\}\\bm\{z\}\\\|\_\{W^\{k\_\{2\},\\infty\}\(\\Omega\)\}\.Using

‖fm¯‖Wk1,∞\(Ω\)≤CSm¯2k1\(log⁡m¯\)d\\\|f\_\{\\bar\{m\}\}\\\|\_\{W^\{k\_\{1\},\\infty\}\(\\Omega\)\}\\leq CS\\bar\{m\}^\{2k\_\{1\}\}\(\\log\\bar\{m\}\)^\{d\}and

‖𝒫m𝒛‖Wk2,∞\(Ω\)≤CSm2k2\(log⁡m\)d,\\\|\\mathcal\{P\}\_\{m\}\\bm\{z\}\\\|\_\{W^\{k\_\{2\},\\infty\}\(\\Omega\)\}\\leq CSm^\{2k\_\{2\}\}\(\\log m\)^\{d\},we obtain

∥Fm,γfm¯,𝒙∥L∞\(\[−S,S\]m∗\)≤C\[1\\displaystyle\\\|F\_\{m,\\gamma\}^\{f\_\{\\bar\{m\}\},\\bm\{x\}\}\\\|\_\{L^\{\\infty\}\(\[\-S,S\]^\{m\_\{\*\}\}\)\}\\leq C\\Bigl\[1\+L2Sm¯2k1\(log⁡m¯\)d\(1\+Sm2k2\(log⁡m\)d\)α1\\displaystyle\+L\_\{2\}S\\bar\{m\}^\{2k\_\{1\}\}\(\\log\\bar\{m\}\)^\{d\}\\left\(1\+Sm^\{2k\_\{2\}\}\(\\log m\)^\{d\}\\right\)^\{\\alpha\_\{1\}\}\+L1Sm2k2\(logm\)d\],\\displaystyle\+L\_\{1\}Sm^\{2k\_\{2\}\}\(\\log m\)^\{d\}\\Bigr\],where the term‖𝒢\(0,0\)‖Wℓ,∞\(Ω\)\\\|\\mathcal\{G\}\(0,0\)\\\|\_\{W^\{\\ell,\\infty\}\(\\Omega\)\}is absorbed into the constant\. This completes the proof\. ∎

We next approximate this Lipschitz coefficient map by a finite sum of neural network basis functions\.

###### Lemma 10\(Neural network approximation of Lipschitz coefficient maps\)\.

LetF:\[−S,S\]m∗→ℝF:\[\-S,S\]^\{m\_\{\*\}\}\\to\\mathbb\{R\}satisfy

‖F‖L∞\(\[−S,S\]m∗\)≤B,Lip⁡\(F\)≤LF\.\\\|F\\\|\_\{L^\{\\infty\}\(\[\-S,S\]^\{m\_\{\*\}\}\)\}\\leq B,\\qquad\\operatorname\{Lip\}\(F\)\\leq L\_\{F\}\.Then, for any0≤ε≤10\\leq\\varepsilon\\leq 1, if

N=⌈Cm∗LFSε−1⌉\+1,N=\\left\\lceil C\\sqrt\{m\_\{\*\}\}\\,L\_\{F\}S\\,\\varepsilon^\{\-1\}\\right\\rceil\+1,there exist points𝐳1,…,𝐳Nm∗∈\[−S,S\]m∗\\bm\{z\}\_\{1\},\\ldots,\\bm\{z\}\_\{N^\{m\_\{\*\}\}\}\\in\[\-S,S\]^\{m\_\{\*\}\}and ReLU networks

qs∈ℱNN\(m∗,1,Lq,pq,Kq,κq,1\),s=1,…,Nm∗,q\_\{s\}\\in\\mathcal\{F\}\_\{\\mathrm\{NN\}\}\(m\_\{\*\},1,L\_\{q\},p\_\{q\},K\_\{q\},\\kappa\_\{q\},1\),\\qquad s=1,\\ldots,N^\{m\_\{\*\}\},such that

‖F−∑s=1Nm∗F\(𝒛s\)qs‖L∞\(\[−S,S\]m∗\)≤ε,‖∑s=1Nm∗qs‖L∞\(\[−S,S\]m∗\)≤2,qs≥0\.\\left\\\|F\-\\sum\_\{s=1\}^\{N^\{m\_\{\*\}\}\}F\(\\bm\{z\}\_\{s\}\)q\_\{s\}\\right\\\|\_\{L^\{\\infty\}\(\[\-S,S\]^\{m\_\{\*\}\}\)\}\\leq\\varepsilon,\\qquad\\left\\\|\\sum\_\{s=1\}^\{N^\{m\_\{\*\}\}\}q\_\{s\}\\right\\\|\_\{L^\{\\infty\}\(\[\-S,S\]^\{m\_\{\*\}\}\)\}\\leq 2,\\qquad q\_\{s\}\\geq 0\.Moreover, one may choosepq=𝒪\(1\)p\_\{q\}=\\mathcal\{O\}\(1\),

Lq=𝒪\(m∗2log⁡m∗\+m∗2log⁡\(ε−1\)\),Kq=𝒪\(m∗2log⁡m∗\+m∗2log⁡\(ε−1\)\),L\_\{q\}=\\mathcal\{O\}\\\!\\left\(m\_\{\*\}^\{2\}\\log m\_\{\*\}\+m\_\{\*\}^\{2\}\\log\(\\varepsilon^\{\-1\}\)\\right\),\\qquad K\_\{q\}=\\mathcal\{O\}\\\!\\left\(m\_\{\*\}^\{2\}\\log m\_\{\*\}\+m\_\{\*\}^\{2\}\\log\(\\varepsilon^\{\-1\}\)\\right\),and

κq=𝒪\(m∗Bε−1\(m∗LFSε−1\)m∗\)\.\\kappa\_\{q\}=\\mathcal\{O\}\\\!\\left\(m\_\{\*\}B\\,\\varepsilon^\{\-1\}\\left\(\\sqrt\{m\_\{\*\}\}L\_\{F\}S\\varepsilon^\{\-1\}\\right\)^\{m\_\{\*\}\}\\right\)\.The implicit constants are independent ofm∗m\_\{\*\},FF, andε\\varepsilon\.

###### Proof\.

The result follows from the Lipschitz\-function approximation construction in\[[31](https://arxiv.org/html/2606.17419#bib.bib31), Theorem 5\]\. We recall the parameter dependence for completeness\. Partition\[−S,S\]m∗\[\-S,S\]^\{m\_\{\*\}\}intoNm∗N^\{m\_\{\*\}\}cubes with side length of orderS/NS/N, and choose one point𝒛s\\bm\{z\}\_\{s\}in each cube\. Letqsq\_\{s\}be a ReLU realization of the corresponding localized hat basis function\. SinceLip⁡\(F\)≤LF\\operatorname\{Lip\}\(F\)\\leq L\_\{F\}, choosingN≥Cm∗LFSε−1N\\geq C\\sqrt\{m\_\{\*\}\}L\_\{F\}S\\varepsilon^\{\-1\}ensures that the oscillation ofFFon each cube is at mostε\\varepsilon\. Therefore,

‖F−∑s=1Nm∗F\(𝒛s\)qs‖L∞\(\[−S,S\]m∗\)≤ε\.\\left\\\|F\-\\sum\_\{s=1\}^\{N^\{m\_\{\*\}\}\}F\(\\bm\{z\}\_\{s\}\)q\_\{s\}\\right\\\|\_\{L^\{\\infty\}\(\[\-S,S\]^\{m\_\{\*\}\}\)\}\\leq\\varepsilon\.TakingF≡1F\\equiv 1, we haveF\(𝒛s\)=1F\(\\bm\{z\}\_\{s\}\)=1for allss\. Hence, sinceε≤1\\varepsilon\\leq 1,

‖∑s=1Nm∗qs‖L∞\(\[−S,S\]m∗\)=‖∑s=1Nm∗F\(𝒛s\)qs‖L∞\(\[−S,S\]m∗\)≤1\+ε≤2\.\\left\\\|\\sum\_\{s=1\}^\{N^\{m\_\{\*\}\}\}q\_\{s\}\\right\\\|\_\{L^\{\\infty\}\(\[\-S,S\]^\{m\_\{\*\}\}\)\}=\\left\\\|\\sum\_\{s=1\}^\{N^\{m\_\{\*\}\}\}F\(\\bm\{z\}\_\{s\}\)q\_\{s\}\\right\\\|\_\{L^\{\\infty\}\(\[\-S,S\]^\{m\_\{\*\}\}\)\}\\leq 1\+\\varepsilon\\leq 2\.
It remains to record the network parameters\. By the construction in\[[31](https://arxiv.org/html/2606.17419#bib.bib31), Theorem 5\], eachqsq\_\{s\}can be represented by a ReLU network withM=1M=1,pq=𝒪\(1\)p\_\{q\}=\\mathcal\{O\}\(1\), and

Lq=𝒪\(m∗log⁡\(δ−1\)\),Kq=𝒪\(m∗log⁡\(δ−1\)\),L\_\{q\}=\\mathcal\{O\}\(m\_\{\*\}\\log\(\\delta^\{\-1\}\)\),\\qquad K\_\{q\}=\\mathcal\{O\}\(m\_\{\*\}\\log\(\\delta^\{\-1\}\)\),whereδ=ε/\(2m∗Nm∗B\)\\delta=\\varepsilon/\(2m\_\{\*\}N^\{m\_\{\*\}\}B\)\. Hence

δ−1=𝒪\(m∗Nm∗Bε−1\)=𝒪\(m∗B\(m∗LFSε−1\)m∗ε−1\)\.\\delta^\{\-1\}=\\mathcal\{O\}\(m\_\{\*\}N^\{m\_\{\*\}\}B\\varepsilon^\{\-1\}\)=\\mathcal\{O\}\\\!\\left\(m\_\{\*\}B\\left\(\\sqrt\{m\_\{\*\}\}L\_\{F\}S\\varepsilon^\{\-1\}\\right\)^\{m\_\{\*\}\}\\varepsilon^\{\-1\}\\right\)\.The weights and biases are bounded by𝒪\(δ−1\+N\)\\mathcal\{O\}\(\\delta^\{\-1\}\+N\), which gives the stated bound forκq\\kappa\_\{q\}\. Substituting this estimate intoLqL\_\{q\}andKqK\_\{q\}gives the displayed bounds\. ∎

###### Proof of Proposition[2](https://arxiv.org/html/2606.17419#Thmproposition2)\.

By Lemma[9](https://arxiv.org/html/2606.17419#Thmlemma9), for every multi\-indexγ\\gammawith\|γ\|≤ℓ\|\\gamma\|\\leq\\ell, the functionFm,γfm¯,𝒙\(𝒛\):=∂𝒙γ𝒢\(fm¯,𝒫m𝒛\)\(𝒙\)F\_\{m,\\gamma\}^\{f\_\{\\bar\{m\}\},\\bm\{x\}\}\(\\bm\{z\}\):=\\partial\_\{\\bm\{x\}\}^\{\\gamma\}\\mathcal\{G\}\(f\_\{\\bar\{m\}\},\\mathcal\{P\}\_\{m\}\\bm\{z\}\)\(\\bm\{x\}\)satisfiesLip⁡\(Fm,γfm¯,𝒙\)≤CAm,m¯\(g\)\\operatorname\{Lip\}\(F\_\{m,\\gamma\}^\{f\_\{\\bar\{m\}\},\\bm\{x\}\}\)\\leq CA\_\{m,\\bar\{m\}\}^\{\(g\)\}and‖Fm,γfm¯,𝒙‖L∞\(\[−S,S\]m∗\)≤CBm,m¯\(g\)\\\|F\_\{m,\\gamma\}^\{f\_\{\\bar\{m\}\},\\bm\{x\}\}\\\|\_\{L^\{\\infty\}\(\[\-S,S\]^\{m\_\{\*\}\}\)\}\\leq CB\_\{m,\\bar\{m\}\}^\{\(g\)\}, uniformly in𝒙∈Ω\\bm\{x\}\\in\\Omegaand\|γ\|≤ℓ\|\\gamma\|\\leq\\ell\. HereCCis independent ofm,m¯m,\\bar\{m\}\.

Apply Lemma[10](https://arxiv.org/html/2606.17419#Thmlemma10)withLF=CAm,m¯\(g\)L\_\{F\}=CA\_\{m,\\bar\{m\}\}^\{\(g\)\}andB=CBm,m¯\(g\)B=CB\_\{m,\\bar\{m\}\}^\{\(g\)\}\. LetεB=CSAm,m¯\(g\)m∗J−1/m∗\\varepsilon\_\{B\}=CSA\_\{m,\\bar\{m\}\}^\{\(g\)\}\\sqrt\{m\_\{\*\}\}\\,J^\{\-1/m\_\{\*\}\}, and chooseJJlarge enough so thatεB≤1\\varepsilon\_\{B\}\\leq 1\. Equivalently, we chooseN≍J1/m∗N\\asymp J^\{1/m\_\{\*\}\}, so thatNm∗≍JN^\{m\_\{\*\}\}\\asymp J\. Since the points𝒛s\\bm\{z\}\_\{s\}and the networksqsq\_\{s\}in Lemma[10](https://arxiv.org/html/2606.17419#Thmlemma10)depend only on the grid and the parametersLF,B,εBL\_\{F\},B,\\varepsilon\_\{B\}, but not on the particular functionFF, the same points and the same networks can be used simultaneously for all derivative coefficient mapsFm,γfm¯,𝒙F\_\{m,\\gamma\}^\{f\_\{\\bar\{m\}\},\\bm\{x\}\},\|γ\|≤ℓ\|\\gamma\|\\leq\\ell\.

Thus, for every\|γ\|≤ℓ\|\\gamma\|\\leq\\ell,

‖Fm,γfm¯,𝒙−∑s=1JFm,γfm¯,𝒙\(𝒛s\)ℬs‖L∞\(\[−S,S\]m∗\)≤εB,\\left\\\|F\_\{m,\\gamma\}^\{f\_\{\\bar\{m\}\},\\bm\{x\}\}\-\\sum\_\{s=1\}^\{J\}F\_\{m,\\gamma\}^\{f\_\{\\bar\{m\}\},\\bm\{x\}\}\(\\bm\{z\}\_\{s\}\)\\mathcal\{B\}\_\{s\}\\right\\\|\_\{L^\{\\infty\}\(\[\-S,S\]^\{m\_\{\*\}\}\)\}\\leq\\varepsilon\_\{B\},and the branch networks satisfy‖∑s=1Jℬs‖L∞\(\[−S,S\]m∗\)≤2\\\|\\sum\_\{s=1\}^\{J\}\\mathcal\{B\}\_\{s\}\\\|\_\{L^\{\\infty\}\(\[\-S,S\]^\{m\_\{\*\}\}\)\}\\leq 2andℬs≥0\\mathcal\{B\}\_\{s\}\\geq 0\.

Define

𝔡s\(Fmfm¯,𝒙\):=Fmfm¯,𝒙\(𝒛s\)=𝒢\(fm¯,𝒫m𝒛s\)\(𝒙\)\.\\mathfrak\{d\}\_\{s\}\(F\_\{m\}^\{f\_\{\\bar\{m\}\}\},\\bm\{x\}\):=F\_\{m\}^\{f\_\{\\bar\{m\}\},\\bm\{x\}\}\(\\bm\{z\}\_\{s\}\)=\\mathcal\{G\}\(f\_\{\\bar\{m\}\},\\mathcal\{P\}\_\{m\}\\bm\{z\}\_\{s\}\)\(\\bm\{x\}\)\.Then, for every\|γ\|≤ℓ\|\\gamma\|\\leq\\ell,∂𝒙γ𝔡s\(Fmfm¯,𝒙\)=Fm,γfm¯,𝒙\(𝒛s\)\\partial\_\{\\bm\{x\}\}^\{\\gamma\}\\mathfrak\{d\}\_\{s\}\(F\_\{m\}^\{f\_\{\\bar\{m\}\}\},\\bm\{x\}\)=F\_\{m,\\gamma\}^\{f\_\{\\bar\{m\}\},\\bm\{x\}\}\(\\bm\{z\}\_\{s\}\)\. Taking𝒛=𝒟mg\\bm\{z\}=\\mathcal\{D\}\_\{m\}gand usinggm=𝒫m𝒟mgg\_\{m\}=\\mathcal\{P\}\_\{m\}\\mathcal\{D\}\_\{m\}g, we obtain

\|∂𝒙γ\[𝒢\(fm¯,gm\)\(𝒙\)−∑s=1J𝔡s\(Fmfm¯,𝒙\)ℬs\(𝒟mg\)\]\|\\displaystyle\\left\|\\partial\_\{\\bm\{x\}\}^\{\\gamma\}\\left\[\\mathcal\{G\}\(f\_\{\\bar\{m\}\},g\_\{m\}\)\(\\bm\{x\}\)\-\\sum\_\{s=1\}^\{J\}\\mathfrak\{d\}\_\{s\}\(F\_\{m\}^\{f\_\{\\bar\{m\}\}\},\\bm\{x\}\)\\mathcal\{B\}\_\{s\}\(\\mathcal\{D\}\_\{m\}g\)\\right\]\\right\|=\|Fm,γfm¯,𝒙\(𝒟mg\)−∑s=1JFm,γfm¯,𝒙\(𝒛s\)ℬs\(𝒟mg\)\|≤εB\.\\displaystyle\\qquad=\\left\|F\_\{m,\\gamma\}^\{f\_\{\\bar\{m\}\},\\bm\{x\}\}\(\\mathcal\{D\}\_\{m\}g\)\-\\sum\_\{s=1\}^\{J\}F\_\{m,\\gamma\}^\{f\_\{\\bar\{m\}\},\\bm\{x\}\}\(\\bm\{z\}\_\{s\}\)\\mathcal\{B\}\_\{s\}\(\\mathcal\{D\}\_\{m\}g\)\\right\|\\leq\\varepsilon\_\{B\}\.Taking the supremum over𝒙∈Ω\\bm\{x\}\\in\\Omegaand all\|γ\|≤ℓ\|\\gamma\|\\leq\\ellgives

‖𝒢\(fm¯,gm\)−∑s=1J𝔡s\(Fmfm¯,⋅\)ℬs\(𝒟mg\)‖Wℓ,∞\(Ω\)≤CεB\.\\left\\\|\\mathcal\{G\}\(f\_\{\\bar\{m\}\},g\_\{m\}\)\-\\sum\_\{s=1\}^\{J\}\\mathfrak\{d\}\_\{s\}\(F\_\{m\}^\{f\_\{\\bar\{m\}\}\},\\cdot\)\\mathcal\{B\}\_\{s\}\(\\mathcal\{D\}\_\{m\}g\)\\right\\\|\_\{W^\{\\ell,\\infty\}\(\\Omega\)\}\\leq C\\varepsilon\_\{B\}\.This proves \([19](https://arxiv.org/html/2606.17419#S5.E19)\)\.

It remains to record the network parameters\. By Lemma[10](https://arxiv.org/html/2606.17419#Thmlemma10), we may choosepB=𝒪\(1\)p\_\{B\}=\\mathcal\{O\}\(1\)and

LB=𝒪\(m∗log⁡\(δB−1\)\),KB=𝒪\(m∗log⁡\(δB−1\)\),L\_\{B\}=\\mathcal\{O\}\(m\_\{\*\}\\log\(\\delta\_\{B\}^\{\-1\}\)\),\\qquad K\_\{B\}=\\mathcal\{O\}\(m\_\{\*\}\\log\(\\delta\_\{B\}^\{\-1\}\)\),where

δB=εB2Cm∗JBm,m¯\(g\)\.\\delta\_\{B\}=\\frac\{\\varepsilon\_\{B\}\}\{2Cm\_\{\*\}JB\_\{m,\\bar\{m\}\}^\{\(g\)\}\}\.SinceεB=CSAm,m¯\(g\)m∗J−1/m∗\\varepsilon\_\{B\}=CSA\_\{m,\\bar\{m\}\}^\{\(g\)\}\\sqrt\{m\_\{\*\}\}J^\{\-1/m\_\{\*\}\}, we have

δB−1=𝒪\(m∗Bm,m¯\(g\)\(SAm,m¯\(g\)m∗\)−1J1\+1m∗\)\.\\delta\_\{B\}^\{\-1\}=\\mathcal\{O\}\\\!\\left\(m\_\{\*\}B\_\{m,\\bar\{m\}\}^\{\(g\)\}\\left\(SA\_\{m,\\bar\{m\}\}^\{\(g\)\}\\sqrt\{m\_\{\*\}\}\\right\)^\{\-1\}J^\{1\+\\frac\{1\}\{m\_\{\*\}\}\}\\right\)\.Thus

LB=𝒪\(m∗2log⁡m∗\+m∗log⁡J\),KB=𝒪\(m∗2log⁡m∗\+m∗log⁡J\),L\_\{B\}=\\mathcal\{O\}\\\!\\left\(m\_\{\*\}^\{2\}\\log m\_\{\*\}\+m\_\{\*\}\\log J\\right\),\\qquad K\_\{B\}=\\mathcal\{O\}\\\!\\left\(m\_\{\*\}^\{2\}\\log m\_\{\*\}\+m\_\{\*\}\\log J\\right\),where logarithmic factors fromAm,m¯\(g\)A\_\{m,\\bar\{m\}\}^\{\(g\)\}andBm,m¯\(g\)B\_\{m,\\bar\{m\}\}^\{\(g\)\}are absorbed into them∗2log⁡m∗m\_\{\*\}^\{2\}\\log m\_\{\*\}term\.

Finally, by Lemma[10](https://arxiv.org/html/2606.17419#Thmlemma10), the weights and biases satisfy

κB=𝒪\(m∗Bm,m¯\(g\)J1\+1m∗\)\.\\kappa\_\{B\}=\\mathcal\{O\}\\\!\\left\(\\sqrt\{m\_\{\*\}\}B\_\{m,\\bar\{m\}\}^\{\(g\)\}J^\{1\+\\frac\{1\}\{m\_\{\*\}\}\}\\right\)\.This completes the proof\. ∎

Here, we need the following lemma, which will be used in the final trunk\-approximation step\.

###### Lemma 11\(Spatial regularity of the coefficient functions\)\.

Fixfm¯f\_\{\\bar\{m\}\}\. For𝐱∈Ω\\bm\{x\}\\in\\Omega, define

Fmfm¯,𝒙\(𝒛\):=𝒢\(fm¯,𝒫m𝒛\)\(𝒙\),𝒛∈\[−S,S\]m∗\.F\_\{m\}^\{f\_\{\\bar\{m\}\},\\bm\{x\}\}\(\\bm\{z\}\):=\\mathcal\{G\}\(f\_\{\\bar\{m\}\},\\mathcal\{P\}\_\{m\}\\bm\{z\}\)\(\\bm\{x\}\),\\qquad\\bm\{z\}\\in\[\-S,S\]^\{m\_\{\*\}\}\.Letcsfm¯\(𝐱\):=𝔡s\(Fmfm¯,𝐱\)c\_\{s\}^\{f\_\{\\bar\{m\}\}\}\(\\bm\{x\}\):=\\mathfrak\{d\}\_\{s\}\(F\_\{m\}^\{f\_\{\\bar\{m\}\},\\bm\{x\}\}\), where𝔡s\(F\):=F\(𝐳s\)\\mathfrak\{d\}\_\{s\}\(F\):=F\(\\bm\{z\}\_\{s\}\)for some fixed𝐳s∈\[−S,S\]m∗\\bm\{z\}\_\{s\}\\in\[\-S,S\]^\{m\_\{\*\}\}\. Assume that

Km\(n3\)\(fm¯\):=sup𝒛∈\[−S,S\]m∗‖𝒢\(fm¯,𝒫m𝒛\)‖Wn3,∞\(Ω\)<∞\.K\_\{m\}^\{\(n\_\{3\}\)\}\(f\_\{\\bar\{m\}\}\):=\\sup\_\{\\bm\{z\}\\in\[\-S,S\]^\{m\_\{\*\}\}\}\\\|\\mathcal\{G\}\(f\_\{\\bar\{m\}\},\\mathcal\{P\}\_\{m\}\\bm\{z\}\)\\\|\_\{W^\{n\_\{3\},\\infty\}\(\\Omega\)\}<\\infty\.Thencsfm¯∈Wn3,∞\(Ω\)c\_\{s\}^\{f\_\{\\bar\{m\}\}\}\\in W^\{n\_\{3\},\\infty\}\(\\Omega\), and

‖csfm¯‖Wn3,∞\(Ω\)≤Km\(n3\)\(fm¯\)\.\\\|c\_\{s\}^\{f\_\{\\bar\{m\}\}\}\\\|\_\{W^\{n\_\{3\},\\infty\}\(\\Omega\)\}\\leq K\_\{m\}^\{\(n\_\{3\}\)\}\(f\_\{\\bar\{m\}\}\)\.

###### Proof\.

Since𝔡s\\mathfrak\{d\}\_\{s\}is point evaluation at𝒛s\\bm\{z\}\_\{s\}, we have

csfm¯\(𝒙\)=Fmfm¯,𝒙\(𝒛s\)=𝒢\(fm¯,𝒫m𝒛s\)\(𝒙\)\.c\_\{s\}^\{f\_\{\\bar\{m\}\}\}\(\\bm\{x\}\)=F\_\{m\}^\{f\_\{\\bar\{m\}\},\\bm\{x\}\}\(\\bm\{z\}\_\{s\}\)=\\mathcal\{G\}\(f\_\{\\bar\{m\}\},\\mathcal\{P\}\_\{m\}\\bm\{z\}\_\{s\}\)\(\\bm\{x\}\)\.Thus, for every multi\-indexβ\\betawith\|β\|≤n3\|\\beta\|\\leq n\_\{3\},

D𝒙βcsfm¯\(𝒙\)=D𝒙β𝒢\(fm¯,𝒫m𝒛s\)\(𝒙\)\.D\_\{\\bm\{x\}\}^\{\\beta\}c\_\{s\}^\{f\_\{\\bar\{m\}\}\}\(\\bm\{x\}\)=D\_\{\\bm\{x\}\}^\{\\beta\}\\mathcal\{G\}\(f\_\{\\bar\{m\}\},\\mathcal\{P\}\_\{m\}\\bm\{z\}\_\{s\}\)\(\\bm\{x\}\)\.Taking theL∞\(Ω\)L^\{\\infty\}\(\\Omega\)\-norm and then the maximum over\|β\|≤n3\|\\beta\|\\leq n\_\{3\}, we obtain

‖csfm¯‖Wn3,∞\(Ω\)≤sup𝒛∈\[−S,S\]m∗‖𝒢\(fm¯,𝒫m𝒛\)‖Wn3,∞\(Ω\)=Km\(n3\)\(fm¯\)\.\\\|c\_\{s\}^\{f\_\{\\bar\{m\}\}\}\\\|\_\{W^\{n\_\{3\},\\infty\}\(\\Omega\)\}\\leq\\sup\_\{\\bm\{z\}\\in\[\-S,S\]^\{m\_\{\*\}\}\}\\\|\\mathcal\{G\}\(f\_\{\\bar\{m\}\},\\mathcal\{P\}\_\{m\}\\bm\{z\}\)\\\|\_\{W^\{n\_\{3\},\\infty\}\(\\Omega\)\}=K\_\{m\}^\{\(n\_\{3\}\)\}\(f\_\{\\bar\{m\}\}\)\.∎

### A\.4Proofs in Step 3

The proof of Proposition[3](https://arxiv.org/html/2606.17419#Thmproposition3)is similar to that of Proposition[2](https://arxiv.org/html/2606.17419#Thmproposition2), and is therefore omitted\. It remains to record the spatial regularity of the coefficient functionseh,se\_\{h,s\}, which will be used in the trunk approximation step\.

###### Lemma 12\(Spatial regularity ofeh,se\_\{h,s\}\)\.

Assume that

Km¯,m\(n3\):=sup𝒘∈\[−S,S\]m¯∗𝒛∈\[−S,S\]m∗‖𝒢\(𝒫m¯𝒘,𝒫m𝒛\)‖Wn3,∞\(Ω\)<∞\.K\_\{\\bar\{m\},m\}^\{\(n\_\{3\}\)\}:=\\sup\_\{\\begin\{subarray\}\{c\}\\bm\{w\}\\in\[\-S,S\]^\{\\bar\{m\}\_\{\*\}\}\\\\ \\bm\{z\}\\in\[\-S,S\]^\{m\_\{\*\}\}\\end\{subarray\}\}\\left\\\|\\mathcal\{G\}\(\\mathcal\{P\}\_\{\\bar\{m\}\}\\bm\{w\},\\mathcal\{P\}\_\{m\}\\bm\{z\}\)\\right\\\|\_\{W^\{n\_\{3\},\\infty\}\(\\Omega\)\}<\\infty\.Then, for everyh=1,…,Hh=1,\\ldots,Hands=1,…,Js=1,\\ldots,J, we haveeh,s∈Wn3,∞\(Ω\)e\_\{h,s\}\\in W^\{n\_\{3\},\\infty\}\(\\Omega\), and

‖eh,s‖Wn3,∞\(Ω\)≤Km¯,m\(n3\)\.\\\|e\_\{h,s\}\\\|\_\{W^\{n\_\{3\},\\infty\}\(\\Omega\)\}\\leq K\_\{\\bar\{m\},m\}^\{\(n\_\{3\}\)\}\.

###### Proof\.

By construction, both𝔡s\\mathfrak\{d\}\_\{s\}and𝔢h,s\\mathfrak\{e\}\_\{h,s\}are point\-evaluation functionals\. Hence there exist points𝒛s∈\[−S,S\]m∗\\bm\{z\}\_\{s\}\\in\[\-S,S\]^\{m\_\{\*\}\}and𝒘h,s∈\[−S,S\]m¯∗\\bm\{w\}\_\{h,s\}\\in\[\-S,S\]^\{\\bar\{m\}\_\{\*\}\}such that𝔡s\(F\)=F\(𝒛s\)\\mathfrak\{d\}\_\{s\}\(F\)=F\(\\bm\{z\}\_\{s\}\)and𝔢h,s\(F¯\)=F¯\(𝒘h,s\)\\mathfrak\{e\}\_\{h,s\}\(\\bar\{F\}\)=\\bar\{F\}\(\\bm\{w\}\_\{h,s\}\)\. Therefore,

eh,s\(𝒙\)=F¯m¯,s𝒙\(𝒘h,s\)=Fm𝒫m¯𝒘h,s,𝒙\(𝒛s\)=𝒢\(𝒫m¯𝒘h,s,𝒫m𝒛s\)\(𝒙\)\.e\_\{h,s\}\(\\bm\{x\}\)=\\bar\{F\}\_\{\\bar\{m\},s\}^\{\\bm\{x\}\}\(\\bm\{w\}\_\{h,s\}\)=F\_\{m\}^\{\\mathcal\{P\}\_\{\\bar\{m\}\}\\bm\{w\}\_\{h,s\},\\bm\{x\}\}\(\\bm\{z\}\_\{s\}\)=\\mathcal\{G\}\(\\mathcal\{P\}\_\{\\bar\{m\}\}\\bm\{w\}\_\{h,s\},\\mathcal\{P\}\_\{m\}\\bm\{z\}\_\{s\}\)\(\\bm\{x\}\)\.Thus, for every multi\-indexβ\\betawith\|β\|≤n3\|\\beta\|\\leq n\_\{3\},

D𝒙βeh,s\(𝒙\)=D𝒙β𝒢\(𝒫m¯𝒘h,s,𝒫m𝒛s\)\(𝒙\)\.D\_\{\\bm\{x\}\}^\{\\beta\}e\_\{h,s\}\(\\bm\{x\}\)=D\_\{\\bm\{x\}\}^\{\\beta\}\\mathcal\{G\}\(\\mathcal\{P\}\_\{\\bar\{m\}\}\\bm\{w\}\_\{h,s\},\\mathcal\{P\}\_\{m\}\\bm\{z\}\_\{s\}\)\(\\bm\{x\}\)\.Taking theL∞\(Ω\)L^\{\\infty\}\(\\Omega\)\-norm and then the maximum over\|β\|≤n3\|\\beta\|\\leq n\_\{3\}, we obtain

‖eh,s‖Wn3,∞\(Ω\)≤Km¯,m\(n3\)\.\\\|e\_\{h,s\}\\\|\_\{W^\{n\_\{3\},\\infty\}\(\\Omega\)\}\\leq K\_\{\\bar\{m\},m\}^\{\(n\_\{3\}\)\}\.This completes the proof\. ∎

### A\.5Proofs in Step 4

For the last term, we approximate the remaining spatial coefficient functionseh,s\(𝒙\)e\_\{h,s\}\(\\bm\{x\}\)by trunk networks\. The trunk basis is chosen with uniformly bounded local overlap, which will be used to control the sum over the branch indices\.

###### Proof\.

For the reconstructed inputs, define

Km¯,m\(n3\):=sup𝒘∈\[−S,S\]m¯∗𝒛∈\[−S,S\]m∗‖𝒢\(𝒫m¯𝒘,𝒫m𝒛\)‖Wn3,∞\(Ω\)\.K\_\{\\bar\{m\},m\}^\{\(n\_\{3\}\)\}:=\\sup\_\{\\begin\{subarray\}\{c\}\\bm\{w\}\\in\[\-S,S\]^\{\\bar\{m\}\_\{\*\}\}\\\\ \\bm\{z\}\\in\[\-S,S\]^\{m\_\{\*\}\}\\end\{subarray\}\}\\left\\\|\\mathcal\{G\}\(\\mathcal\{P\}\_\{\\bar\{m\}\}\\bm\{w\},\\mathcal\{P\}\_\{m\}\\bm\{z\}\)\\right\\\|\_\{W^\{n\_\{3\},\\infty\}\(\\Omega\)\}\.By the Sobolev stability of the reconstruction operators, for𝒘∈\[−S,S\]m¯∗\\bm\{w\}\\in\[\-S,S\]^\{\\bar\{m\}\_\{\*\}\}and𝒛∈\[−S,S\]m∗\\bm\{z\}\\in\[\-S,S\]^\{m\_\{\*\}\}, we have

‖𝒫m¯𝒘‖Wβ1,∞\(Ω\)≤CSm¯2β1\(log⁡m¯\)d,‖𝒫m𝒛‖Wβ2,∞\(Ω\)≤CSm2β2\(log⁡m\)d\.\\\|\\mathcal\{P\}\_\{\\bar\{m\}\}\\bm\{w\}\\\|\_\{W^\{\\beta\_\{1\},\\infty\}\(\\Omega\)\}\\leq CS\\bar\{m\}^\{2\\beta\_\{1\}\}\(\\log\\bar\{m\}\)^\{d\},\\qquad\\\|\\mathcal\{P\}\_\{m\}\\bm\{z\}\\\|\_\{W^\{\\beta\_\{2\},\\infty\}\(\\Omega\)\}\\leq CSm^\{2\\beta\_\{2\}\}\(\\log m\)^\{d\}\.Therefore, Assumption[3](https://arxiv.org/html/2606.17419#Thmassumption3)gives

Km¯,m\(n3\)≤C\(1\+Sm¯2β1\(log⁡m¯\)d\)\(1\+Sm2β2\(log⁡m\)d\)=CRm¯,m\(n3\)\.K\_\{\\bar\{m\},m\}^\{\(n\_\{3\}\)\}\\leq C\\left\(1\+S\\bar\{m\}^\{2\\beta\_\{1\}\}\(\\log\\bar\{m\}\)^\{d\}\\right\)\\left\(1\+Sm^\{2\\beta\_\{2\}\}\(\\log m\)^\{d\}\\right\)=CR\_\{\\bar\{m\},m\}^\{\(n\_\{3\}\)\}\.
By Lemma[12](https://arxiv.org/html/2606.17419#Thmlemma12), for everyh=1,…,Hh=1,\\ldots,Hands=1,…,Js=1,\\ldots,J, we haveeh,s∈Wn3,∞\(Ω\)e\_\{h,s\}\\in W^\{n\_\{3\},\\infty\}\(\\Omega\)and‖eh,s‖Wn3,∞\(Ω\)≤Km¯,m\(n3\)\\\|e\_\{h,s\}\\\|\_\{W^\{n\_\{3\},\\infty\}\(\\Omega\)\}\\leq K\_\{\\bar\{m\},m\}^\{\(n\_\{3\}\)\}\. Applying Proposition[6](https://arxiv.org/html/2606.17419#Thmproposition6)toeh,se\_\{h,s\}, withn=n3n=n\_\{3\},R=Km¯,m\(n3\)R=K\_\{\\bar\{m\},m\}^\{\(n\_\{3\}\)\}, and the chosenℓ∈\{0,1\}\\ell\\in\\\{0,1\\\}, gives coefficientseh,p,s∈ℝe\_\{h,p,s\}\\in\\mathbb\{R\}and trunk networks𝒯p∈ℱNN\(d,1,LT,pT,KT,κT,MT\)\\mathcal\{T\}\_\{p\}\\in\\mathcal\{F\}\_\{\\mathrm\{NN\}\}\(d,1,L\_\{T\},p\_\{T\},K\_\{T\},\\kappa\_\{T\},M\_\{T\}\),p=1,…,Pp=1,\\ldots,P, such that

‖eh,s−∑p=1Peh,p,s𝒯p‖Wℓ,∞\(Ω\)≤CKm¯,m\(n3\)P−n3−ℓd,\\left\\\|e\_\{h,s\}\-\\sum\_\{p=1\}^\{P\}e\_\{h,p,s\}\\mathcal\{T\}\_\{p\}\\right\\\|\_\{W^\{\\ell,\\infty\}\(\\Omega\)\}\\leq CK\_\{\\bar\{m\},m\}^\{\(n\_\{3\}\)\}P^\{\-\\frac\{n\_\{3\}\-\\ell\}\{d\}\},and\|eh,p,s\|≤CKm¯,m\(n3\)\|e\_\{h,p,s\}\|\\leq CK\_\{\\bar\{m\},m\}^\{\(n\_\{3\}\)\}\. SinceKm¯,m\(n3\)≤CRm¯,m\(n3\)K\_\{\\bar\{m\},m\}^\{\(n\_\{3\}\)\}\\leq CR\_\{\\bar\{m\},m\}^\{\(n\_\{3\}\)\}, this proves \([22](https://arxiv.org/html/2606.17419#S5.E22)\) and the stated coefficient bound\.

The same application of Proposition[6](https://arxiv.org/html/2606.17419#Thmproposition6)also gives the local\-overlap bounds

sup𝒙∈Ω∑p=1P\|𝒯p\(𝒙\)\|≤C,\\sup\_\{\\bm\{x\}\\in\\Omega\}\\sum\_\{p=1\}^\{P\}\|\\mathcal\{T\}\_\{p\}\(\\bm\{x\}\)\|\\leq C,and, whenℓ=1\\ell=1,

sup𝒙∈Ω∑p=1P∑r=1d\|∂xr𝒯p\(𝒙\)\|≤CP1/d\.\\sup\_\{\\bm\{x\}\\in\\Omega\}\\sum\_\{p=1\}^\{P\}\\sum\_\{r=1\}^\{d\}\|\\partial\_\{x\_\{r\}\}\\mathcal\{T\}\_\{p\}\(\\bm\{x\}\)\|\\leq CP^\{1/d\}\.The boundsLT≤Clog⁡PL\_\{T\}\\leq C\\log P,pT≤Cp\_\{T\}\\leq C,KT≤Clog⁡PK\_\{T\}\\leq C\\log P,κT≤CP\(n3\+d−ℓ\)/d\\kappa\_\{T\}\\leq CP^\{\(n\_\{3\}\+d\-\\ell\)/d\}, andMT≤CM\_\{T\}\\leq Calso follow directly from Proposition[6](https://arxiv.org/html/2606.17419#Thmproposition6)\. This completes the proof\. ∎

### A\.6Proof of Corollaries[1](https://arxiv.org/html/2606.17419#Thmcorollary1)and[2](https://arxiv.org/html/2606.17419#Thmcorollary2)

The proof of Corollary[1](https://arxiv.org/html/2606.17419#Thmcorollary1)is a special case of Corollary[2](https://arxiv.org/html/2606.17419#Thmcorollary2)\. Therefore, it suffices to prove the latter\.

###### Proof of Corollary[2](https://arxiv.org/html/2606.17419#Thmcorollary2)\.

Applying the general pseudo\-spectral projection estimate in Sobolev norms and following the same shared branch–trunk approximation argument as in the two\-input case gives \([14](https://arxiv.org/html/2606.17419#S3.E14)\)\. We now consider the special caseαi=βi=0\\alpha\_\{i\}=\\beta\_\{i\}=0,i=1,…,λi=1,\\ldots,\\lambda\. Then \([14](https://arxiv.org/html/2606.17419#S3.E14)\) reduces to

Etotal≲\\displaystyle E\_\{\\mathrm\{total\}\}\\lesssim\\;∑i=1λMi−ni−2kidi\(log⁡Mi\)di\+∑i=1λSLiMiMi2kidi\(log⁡Mi\)diJi−1/Mi\\displaystyle\\sum\_\{i=1\}^\{\\lambda\}M\_\{i\}^\{\-\\frac\{n\_\{i\}\-2k\_\{i\}\}\{d\_\{i\}\}\}\(\\log M\_\{i\}\)^\{d\_\{i\}\}\+\\sum\_\{i=1\}^\{\\lambda\}SL\_\{i\}\\sqrt\{M\_\{i\}\}\\,M\_\{i\}^\{\\frac\{2k\_\{i\}\}\{d\_\{i\}\}\}\(\\log M\_\{i\}\)^\{d\_\{i\}\}J\_\{i\}^\{\-1/M\_\{i\}\}\+P−nλ\+1−ℓdλ\+1∏i=1λ\(1\+S\(log⁡Mi\)di\)\.\\displaystyle\+P^\{\-\\frac\{n\_\{\\lambda\+1\}\-\\ell\}\{d\_\{\\lambda\+1\}\}\}\\prod\_\{i=1\}^\{\\lambda\}\(1\+S\(\\log M\_\{i\}\)^\{d\_\{i\}\}\)\.\(57\)SetAi:=\(ni−2ki\)/diA\_\{i\}:=\(n\_\{i\}\-2k\_\{i\}\)/d\_\{i\},Bi:=2ki/diB\_\{i\}:=2k\_\{i\}/d\_\{i\}, andνℓ:=\(nλ\+1−ℓ\)/dλ\+1\\nu\_\{\\ell\}:=\(n\_\{\\lambda\+1\}\-\\ell\)/d\_\{\\lambda\+1\}\. Sinceni\>2kin\_\{i\}\>2k\_\{i\}, we haveAi\>0A\_\{i\}\>0\.

We first balance theii\-th discretization term with the corresponding branch approximation term:

Mi−Ai\(log⁡Mi\)di≍SLiMiMiBi\(log⁡Mi\)diJi−1/Mi\.M\_\{i\}^\{\-A\_\{i\}\}\(\\log M\_\{i\}\)^\{d\_\{i\}\}\\asymp SL\_\{i\}\\sqrt\{M\_\{i\}\}\\,M\_\{i\}^\{B\_\{i\}\}\(\\log M\_\{i\}\)^\{d\_\{i\}\}J\_\{i\}^\{\-1/M\_\{i\}\}\.Canceling the common logarithmic factor givesJi1/Mi≍SLiMiAi\+Bi\+1/2=SLiMini/di\+1/2J\_\{i\}^\{1/M\_\{i\}\}\\asymp SL\_\{i\}M\_\{i\}^\{A\_\{i\}\+B\_\{i\}\+1/2\}=SL\_\{i\}M\_\{i\}^\{n\_\{i\}/d\_\{i\}\+1/2\}\. Hence

log⁡Ji≍Milog⁡Mi,i=1,…,λ,\\log J\_\{i\}\\asymp M\_\{i\}\\log M\_\{i\},\\qquad i=1,\\ldots,\\lambda,up to constants depending only ondi,ni,ki,S,Lid\_\{i\},n\_\{i\},k\_\{i\},S,L\_\{i\}\.

Let0<ε<10<\\varepsilon<1be the target accuracy\. ChooseMiM\_\{i\}so thatMi−Ai\(log⁡Mi\)di≍εM\_\{i\}^\{\-A\_\{i\}\}\(\\log M\_\{i\}\)^\{d\_\{i\}\}\\asymp\\varepsilon\. Equivalently, up to logarithmic factors,

Mi≍ε−qi\(log⁡1ε\)diqi,qi:=1Ai=dini−2ki\.M\_\{i\}\\asymp\\varepsilon^\{\-q\_\{i\}\}\\left\(\\log\\frac\{1\}\{\\varepsilon\}\\right\)^\{d\_\{i\}q\_\{i\}\},\\qquad q\_\{i\}:=\\frac\{1\}\{A\_\{i\}\}=\\frac\{d\_\{i\}\}\{n\_\{i\}\-2k\_\{i\}\}\.Consequently,

log⁡Ji≲ε−qi\(log⁡1ε\)diqi\+1\.\\log J\_\{i\}\\lesssim\\varepsilon^\{\-q\_\{i\}\}\\left\(\\log\\frac\{1\}\{\\varepsilon\}\\right\)^\{d\_\{i\}q\_\{i\}\+1\}\.DefineQmax:=max1≤i≤λ⁡qi=max1≤i≤λ⁡di/\(ni−2ki\)Q\_\{\\max\}:=\\max\_\{1\\leq i\\leq\\lambda\}q\_\{i\}=\\max\_\{1\\leq i\\leq\\lambda\}d\_\{i\}/\(n\_\{i\}\-2k\_\{i\}\)\. Sinceλ\\lambdais fixed,

∑i=1λlog⁡Ji≲ε−Qmax\(log⁡1ε\)C\.\\sum\_\{i=1\}^\{\\lambda\}\\log J\_\{i\}\\lesssim\\varepsilon^\{\-Q\_\{\\max\}\}\\left\(\\log\\frac\{1\}\{\\varepsilon\}\\right\)^\{C\}\.
It remains to choosePP\. Sincelog⁡Mi≍log⁡\(1/ε\)\\log M\_\{i\}\\asymp\\log\(1/\\varepsilon\), we have

∏i=1λ\(1\+S\(log⁡Mi\)di\)≲C\(log⁡1ε\)∑i=1λdi\.\\prod\_\{i=1\}^\{\\lambda\}\(1\+S\(\\log M\_\{i\}\)^\{d\_\{i\}\}\)\\lesssim C\\left\(\\log\\frac\{1\}\{\\varepsilon\}\\right\)^\{\\sum\_\{i=1\}^\{\\lambda\}d\_\{i\}\}\.Thus it is enough to impose

P−νℓ\(log⁡1ε\)∑i=1λdi≲ε\.P^\{\-\\nu\_\{\\ell\}\}\\left\(\\log\\frac\{1\}\{\\varepsilon\}\\right\)^\{\\sum\_\{i=1\}^\{\\lambda\}d\_\{i\}\}\\lesssim\\varepsilon\.For instance, we may take

P≍ε−1/νℓ\(log⁡1ε\)1νℓ∑i=1λdi\.P\\asymp\\varepsilon^\{\-1/\\nu\_\{\\ell\}\}\\left\(\\log\\frac\{1\}\{\\varepsilon\}\\right\)^\{\\frac\{1\}\{\\nu\_\{\\ell\}\}\\sum\_\{i=1\}^\{\\lambda\}d\_\{i\}\}\.With these choices ofMi,JiM\_\{i\},J\_\{i\}, andPP, all terms in \([57](https://arxiv.org/html/2606.17419#A1.E57)\) are bounded byCεC\\varepsilon\. HenceEtotal≲εE\_\{\\mathrm\{total\}\}\\lesssim\\varepsilon\.

We now expressε\\varepsilonin terms ofNtotN\_\{\\mathrm\{tot\}\}\. For the shared architecture \([13](https://arxiv.org/html/2606.17419#S3.E13)\), there areJiJ\_\{i\}branch networks for theii\-th input,PPtrunk networks, andP∏i=1λJiP\\prod\_\{i=1\}^\{\\lambda\}J\_\{i\}scalar coefficients\. Hence

Ntot≲∑i=1λJiKi\+PKT\+P∏i=1λJi,N\_\{\\mathrm\{tot\}\}\\lesssim\\sum\_\{i=1\}^\{\\lambda\}J\_\{i\}K\_\{i\}\+PK\_\{T\}\+P\\prod\_\{i=1\}^\{\\lambda\}J\_\{i\},whereKiK\_\{i\}andKTK\_\{T\}denote the sizes of the corresponding branch and trunk networks\. These factors grow at most polynomially inMi,log⁡JiM\_\{i\},\\log J\_\{i\}, andlog⁡P\\log P, and therefore only contribute lower\-order logarithmic factors\. Thus

log⁡Ntot≲∑i=1λlog⁡Ji\+log⁡P\+lower\-order logarithmic factors≲ε−Qmax\(log⁡1ε\)C\.\\log N\_\{\\mathrm\{tot\}\}\\lesssim\\sum\_\{i=1\}^\{\\lambda\}\\log J\_\{i\}\+\\log P\+\\text\{lower\-order logarithmic factors\}\\lesssim\\varepsilon^\{\-Q\_\{\\max\}\}\\left\(\\log\\frac\{1\}\{\\varepsilon\}\\right\)^\{C\}\.Inverting this relation gives, up to logarithmic factors,

ε≲\(log⁡Ntotlog⁡log⁡Ntot\)−1/Qmax\.\\varepsilon\\lesssim\\left\(\\frac\{\\log N\_\{\\mathrm\{tot\}\}\}\{\\log\\log N\_\{\\mathrm\{tot\}\}\}\\right\)^\{\-1/Q\_\{\\max\}\}\.Moreover, sincelog⁡\(1/ε\)≍log⁡log⁡Ntot\\log\(1/\\varepsilon\)\\asymp\\log\\log N\_\{\\mathrm\{tot\}\}, the explicit logarithmic factor from the trunk term satisfies

∏i=1λ\(1\+S\(log⁡Mi\)di\)≲C\(log⁡log⁡Ntot\)∑i=1λdi\.\\prod\_\{i=1\}^\{\\lambda\}\(1\+S\(\\log M\_\{i\}\)^\{d\_\{i\}\}\)\\lesssim C\(\\log\\log N\_\{\\mathrm\{tot\}\}\)^\{\\sum\_\{i=1\}^\{\\lambda\}d\_\{i\}\}\.Therefore,

Etotal≲\(log⁡Ntotlog⁡log⁡Ntot\)−1/Qmax\(log⁡log⁡Ntot\)∑i=1λdi\.E\_\{\\mathrm\{total\}\}\\lesssim\\left\(\\frac\{\\log N\_\{\\mathrm\{tot\}\}\}\{\\log\\log N\_\{\\mathrm\{tot\}\}\}\\right\)^\{\-1/Q\_\{\\max\}\}\(\\log\\log N\_\{\\mathrm\{tot\}\}\)^\{\\sum\_\{i=1\}^\{\\lambda\}d\_\{i\}\}\.Since1/Qmax=min1≤i≤λ⁡\(ni−2ki\)/di1/Q\_\{\\max\}=\\min\_\{1\\leq i\\leq\\lambda\}\(n\_\{i\}\-2k\_\{i\}\)/d\_\{i\}, this can be written as

Etotal≲\(log⁡Ntotlog⁡log⁡Ntot\)−min1≤i≤λ⁡ni−2kidi\(log⁡log⁡Ntot\)∑i=1λdi\.E\_\{\\mathrm\{total\}\}\\lesssim\\left\(\\frac\{\\log N\_\{\\mathrm\{tot\}\}\}\{\\log\\log N\_\{\\mathrm\{tot\}\}\}\\right\)^\{\-\\min\_\{1\\leq i\\leq\\lambda\}\\frac\{n\_\{i\}\-2k\_\{i\}\}\{d\_\{i\}\}\}\(\\log\\log N\_\{\\mathrm\{tot\}\}\)^\{\\sum\_\{i=1\}^\{\\lambda\}d\_\{i\}\}\.This completes the proof\. ∎

## Appendix BProof for Generalization error

### B\.1Proof of Lemma[1](https://arxiv.org/html/2606.17419#Thmlemma1)

###### Proof of Lemma[1](https://arxiv.org/html/2606.17419#Thmlemma1)\.

Condition on the training inputs and locations\. Then the empirical sample set is fixed\. Applying the same finite\-net and sub\-Gaussian maximal inequality argument as in\[[31](https://arxiv.org/html/2606.17419#bib.bib31), Lemma 3\]gives the desired estimate with the covering number taken with respect to the empirical metric on this fixed sample\. Since ad𝒮,ℓd\_\{\\mathcal\{S\},\\ell\}\-cover also controls the empirical Sobolev norm∥⋅∥n,ℓ\\\|\\cdot\\\|\_\{n,\\ell\}, and since𝒩\(η,ℱG,d𝒮,ℓ\)≤𝒩emp\(η,ℱG\)\\mathcal\{N\}\(\\eta,\\mathcal\{F\}\_\{G\},d\_\{\\mathcal\{S\},\\ell\}\)\\leq\\mathcal\{N\}\_\{\\rm emp\}\(\\eta,\\mathcal\{F\}\_\{G\}\), the same estimate holds with𝒩emp\(η,ℱG\)\\mathcal\{N\}\_\{\\rm emp\}\(\\eta,\\mathcal\{F\}\_\{G\}\)\. Taking expectation over the training samples completes the proof\. ∎

### B\.2Proof of Lemma[2](https://arxiv.org/html/2606.17419#Thmlemma2)

###### Proof of Lemma[2](https://arxiv.org/html/2606.17419#Thmlemma2)\.

The proof follows the ghost\-sample and finite\-covering argument in\[[31](https://arxiv.org/html/2606.17419#bib.bib31), Lemma 4\]\. For a more detailed presentation of the same empirical\-process argument, see also\[[30](https://arxiv.org/html/2606.17419#bib.bib30), Proof of Lemma 29\]\. ∎

### B\.3Proof of Lemma[4](https://arxiv.org/html/2606.17419#Thmlemma4)

###### Proof of Lemma[4](https://arxiv.org/html/2606.17419#Thmlemma4)\.

The proof follows the same argument as\[[50](https://arxiv.org/html/2606.17419#bib.bib50), Proposition 5\]\. We only indicate the minor difference\. In\[[50](https://arxiv.org/html/2606.17419#bib.bib50), Proposition 5\], the pseudo\-dimension estimate for the trunk network contains a cubic dependence on the depth because the trunk network uses the squared ReLU activation\. In the present setting, the trunk network uses the standard ReLU activation, and therefore the corresponding depth dependence is quadratic\.

Applying the same counting argument to thegg\-branch networks, theff\-branch networks, the trunk networks, and the scalar coefficientseh,p,se\_\{h,p,s\}, we obtain

Pdim⁡\(ℱGγ\)≤CJHP\(LB2pB2\+LE2pE2\+LT2pT2\)\.\\operatorname\{Pdim\}\(\\mathcal\{F\}\_\{G\}^\{\\gamma\}\)\\leq CJHP\\left\(L\_\{B\}^\{2\}p\_\{B\}^\{2\}\+L\_\{E\}^\{2\}p\_\{E\}^\{2\}\+L\_\{T\}^\{2\}p\_\{T\}^\{2\}\\right\)\.A more careful inspection of the proof of\[[50](https://arxiv.org/html/2606.17419#bib.bib50), Proposition 5\]yields the sharper estimate

Pdim⁡\(ℱGγ\)≤C\(JLB2pB2\+HLE2pE2\+PLT2pT2\+JHP\)\.\\operatorname\{Pdim\}\(\\mathcal\{F\}\_\{G\}^\{\\gamma\}\)\\leq C\\left\(JL\_\{B\}^\{2\}p\_\{B\}^\{2\}\+HL\_\{E\}^\{2\}p\_\{E\}^\{2\}\+PL\_\{T\}^\{2\}p\_\{T\}^\{2\}\+JHP\\right\)\.However, this sharper estimate is not needed for the final balanced rate\. The coarser bound above already gives stochastic terms that can be dominated by the approximation error after choosingHHas a sufficiently small power of the outer sample size\. Therefore, for readability, we use the simpler bound\. ∎
Generalization Guarantees for Multi-Input Neural Operator Learning in Sobolev Spaces

Similar Articles

Quantitative Sobolev Approximation Bounds for Neural Operators with Empirical Validation on Burgers Equation

Universal Approximation of Nonlinear Operators and Their Derivatives

Frequency Bias and OOD Generalization in Neural Operators under a Variable-Coefficient Wave Equation

Operator Boosting Produces Pareto-Efficient PDE Surrogates

A PAC-Bayesian View of Generalisation for Physics-Informed Machine Learning

Submit Feedback

Similar Articles

Quantitative Sobolev Approximation Bounds for Neural Operators with Empirical Validation on Burgers Equation
Universal Approximation of Nonlinear Operators and Their Derivatives
Frequency Bias and OOD Generalization in Neural Operators under a Variable-Coefficient Wave Equation
Operator Boosting Produces Pareto-Efficient PDE Surrogates
A PAC-Bayesian View of Generalisation for Physics-Informed Machine Learning