ITNet: A Learnable Integral Transform That Subsumes Convolution, Attention, and Recurrence

arXiv cs.AI Papers

Summary

Introduces ITNet, a neural architecture based on a learnable integral transform that unifies convolution, attention, and recurrence, achieving strong results across multiple modalities.

arXiv:2606.19538v1 Announce Type: new Abstract: Convolutional networks, recurrent networks, and transformers each encode different inductive biases -- locality, sequential memory, and content-dependent pairwise interaction -- and have remained mathematically distinct since their inception. We show that this fragmentation reflects not a fundamental diversity in how signals should be processed, but rather incomplete views of a single underlying mathematical object: a learnable integral transform. We introduce the Integral Transform Network (ITNet), a unified architecture built around a learnable kernel that depends jointly on positions and features. This kernel is implemented as a small neural network, specifically an MLP, that models pairwise interactions, enabling the model to adapt its behavior from data. We show that convolution, self-attention (including multi-head), and autoregressive recurrence (including LSTM, GRU, S4, and Mamba) arise as special cases under appropriate parameterizations, and that ITNet is a universal approximator of continuous operators. To make this practical, we develop tiled kernel fusion, importance-weighted Monte Carlo integration, and learned low-rank factorization, enabling efficient and scalable computation. A single ITNet architecture with a shared operator and lightweight modality-specific encoders matches or exceeds specialized baselines on ImageNet-1K , GLUE, ModelNet40, VQA\,v2 and NLVR2. The results demonstrate that a single learned interaction mechanism can recover the behavior of all three architectural families from data.
Original Article
View Cached Full Text

Cached at: 06/20/26, 02:31 PM

# ITNet: A Learnable Integral Transform That Subsumes Convolution, Attention, and Recurrence
Source: [https://arxiv.org/html/2606.19538](https://arxiv.org/html/2606.19538)
\\@preprinttrue\\@submissionfalse

Ashim Dhor1Rasel Mondal1Pin Yu Chen2 1Indian Institute of Science Education and Research Bhopal 2IBM Research ashimdhor@gmail\.com, raselmondal61@gmail\.com, pin\-yu\.chen@ibm\.com

###### Abstract

Convolutional networks, recurrent networks, and transformers each encode different inductive biases \- locality, sequential memory, and content \- dependent pairwise interaction \- and have remained mathematically distinct since their inception\. We show that this fragmentation reflects not a fundamental diversity in how signals should be processed, but rather incomplete views of a single underlying mathematical object: a learnable integral transform\. We introduce theIntegral Transform Network \(ITNet\), a unified architecture built around a learnable kernel that depends jointly on positions and features\. This kernel is implemented as a small neural network, specifically an MLP, that models pairwise interactions, enabling the model to adapt its behavior from data\. We show that convolution, self\-attention \(including multi\-head\), and autoregressive recurrence \(including LSTM, GRU, S4, and Mamba\) arise as special cases under appropriate parameterizations, and that ITNet is a universal approximator of continuous operators\. To make this practical, we develop tiled kernel fusion, importance\-weighted Monte Carlo integration, and learned low\-rank factorization, enabling efficient and scalable computation\. A single ITNet architecture with a shared operator and lightweight modality\-specific encoders matches or exceeds specialized baselines on ImageNet\-1K , GLUE, ModelNet40, VQA v2 and NLVR2\. The results demonstrate that a single learned interaction mechanism can recover the behavior of all three architectural families from data\.

## 1\. Introduction

The history of deep learning is, in large part, a history of architecture design\. Convolutional networks\[[53](https://arxiv.org/html/2606.19538#bib.bib1)\]encode the bias patterns in images are*local*and*translation\-invariant*\. Long Short\-Term Memory networks \(LSTM\)\[[44](https://arxiv.org/html/2606.19538#bib.bib2)\]encode a different bias: sequential data carries temporal dependencies that must be selectively remembered or forgotten through learned gating\. Transformers\[[93](https://arxiv.org/html/2606.19538#bib.bib3)\]encode yet another: relationships between sequence elements are best captured by*content\-dependent pairwise similarity*between learned projections, allowing every position to attend to every other simultaneously\.

Each was a profound contribution, and each fundamentally reshaped its domain\. Yet each was designed for a specific class of data, in isolation from the others\. The result is that modern deep learning possesses three dominant architectural families that address the same fundamental problem \- transforming structured signals into semantically meaningful representations \- through entirely different mathematical lenses\. The practical consequence is that practitioners must make an*a priori*architectural choice before seeing any data\. Images suggest CNNs\[[53](https://arxiv.org/html/2606.19538#bib.bib1),[41](https://arxiv.org/html/2606.19538#bib.bib14)\]; text suggests Transformers\[[93](https://arxiv.org/html/2606.19538#bib.bib3)\]; time series suggest RNNs or state\-space models\[[44](https://arxiv.org/html/2606.19538#bib.bib2),[37](https://arxiv.org/html/2606.19538#bib.bib4),[36](https://arxiv.org/html/2606.19538#bib.bib5)\]; irregular point clouds fall outside all three paradigms\[[73](https://arxiv.org/html/2606.19538#bib.bib16),[95](https://arxiv.org/html/2606.19538#bib.bib17)\]; and multimodal data requires stitching together components that were never designed to coexist\[[11](https://arxiv.org/html/2606.19538#bib.bib75),[56](https://arxiv.org/html/2606.19538#bib.bib51)\]\. This fragmentation suggests that our current mathematical understanding of how to transform structured signals remains incomplete\.

We study whether a single operation can unify convolution, self\-attention, and recurrence as exact special cases\. We show the answer is yes: a learnable integral transform, defined in Eq\.[1](https://arxiv.org/html/2606.19538#S2.E1), whose kernel depends jointly on positions and features at both endpoints\. The operator aggregates information across all positions through a learned, content and position\-dependent interaction function, while retaining a residual connection for stability\. The kernel is implemented as a small neural network that receives absolute positions, relative geometry, and feature content at both the query and key locations, enabling it to model a wide range of interaction patterns\. The key novelty is that interaction patterns are not hard\-coded \(e\.g\., locality in CNNs or dot\-product attention in Transformers\), but learned directly from data within a unified formulation\. This allows a single operator to adaptively recover local, global, and sequential behaviors depending on the task, rather than requiring separate architectural designs\. We call this theIntegral Transform Network \(ITNet\)\.

By conditioning on content at both endpoints, the kernel learns locality, position\-sensitivity, and normalization as*emergent behaviors*\. Empirically \(§[4](https://arxiv.org/html/2606.19538#S4)\), it exhibits convolution\-like behavior on images, attention\-like behavior on text, and geometry\-aware interactions on point clouds\. The kernel formκ​\(x,y,u​\(x\),u​\(y\)\)\\kappa\(x,y,u\(x\),u\(y\)\)originates from GNO\[[4](https://arxiv.org/html/2606.19538#bib.bib7)\]\. However, prior work has not: \(i\) shown exact subsumption of CNNs, Transformers, and RNNs, \(ii\) developed scalable implementations \(tiled fusion, Monte Carlo, low\-rank\), or \(iii\) demonstrate strong performance across vision, language, and multimodal tasks within a single architecture\.

We prove four results \(proof sketches in §[2](https://arxiv.org/html/2606.19538#S2); full proofs in Appendices[C](https://arxiv.org/html/2606.19538#A3)–[F](https://arxiv.org/html/2606.19538#A6)\):

1. 1\.Convolution\(Theorem[1](https://arxiv.org/html/2606.19538#Thmtheorem1)\):κθ=wθ​\(x−y\)​𝐈d\\kappa\_\{\\theta\}=w\_\{\\theta\}\(x\-y\)\\mathbf\{I\}\_\{d\}recovers convolution exactly, including multi\-channel, depthwise, dilated, strided, and grouped variants\.
2. 2\.Self\-attention\(Theorem[2](https://arxiv.org/html/2606.19538#Thmtheorem2)\): Softmax\-normalized dot\-product kernel recovers scaled dot\-product self\-attention exactly, including multi\-head attention\.
3. 3\.Recurrence\(Theorem[3](https://arxiv.org/html/2606.19538#Thmtheorem3)\): Causal kernel \(κθ=0\\kappa\_\{\\theta\}=0fory\>xy\>x\) recovers Recurrent Neural Networks \(RNNs\), LSTM, Gated Recurrent Units \(GRUs\)\[[13](https://arxiv.org/html/2606.19538#bib.bib24)\], S4\[[37](https://arxiv.org/html/2606.19538#bib.bib4)\], and Mamba\[[36](https://arxiv.org/html/2606.19538#bib.bib5)\]\.
4. 4\.Universal approximation\(Theorem[4](https://arxiv.org/html/2606.19538#Thmtheorem4)\): ITNet uniformly approximates any continuous operator\. Moreover,Conv⊊ITNet⊋Attn\\mathrm\{Conv\}\\subsetneq\\mathrm\{ITNet\}\\supsetneq\\mathrm\{Attn\}andRNN⊊ITNet\\mathrm\{RNN\}\\subsetneq\\mathrm\{ITNet\}\.

We develop three scalable strategies to make the operator practical: \(i\) tiled kernel fusion with optimal input/output \(IO\) complexity, \(ii\) importance\-weighted Monte Carlo \(MC\) approximation, and \(iii\) learned low\-rank factorization for linear\-time computation\. A single ITNet architecture, with a shared core operator and lightweight modality\-specific encoders, achieves strong performance across diverse domains, including ImageNet\-1K\[[80](https://arxiv.org/html/2606.19538#bib.bib35)\]\(vision\), GLUE\[[94](https://arxiv.org/html/2606.19538#bib.bib44)\]\(language understanding\), ModelNet40\[[86](https://arxiv.org/html/2606.19538#bib.bib71)\]\(3D geometry\), and VQA v2\[[33](https://arxiv.org/html/2606.19538#bib.bib48)\]and NLVR2\[[85](https://arxiv.org/html/2606.19538#bib.bib49)\]\(multimodal reasoning\)\. Across these tasks, ITNet matches or exceeds specialized architectures while using a unified design, demonstrating that a single learned interaction operator can generalize across modalities without domain\-specific architectural bias\. A detailed discussion of related work is provided in Appendix[B](https://arxiv.org/html/2606.19538#A2)\.

## 2\. Theoretical Foundations of Integral Transform Networks \(ITNet\)

We define the ITNet operator \(§[2\.1](https://arxiv.org/html/2606.19538#S2.SS1)\), prove convolution, self\-attention, and recurrence as exact special cases \(§[2\.2](https://arxiv.org/html/2606.19538#S2.SS2)\) and establish universal operator approximation \(§[2\.3](https://arxiv.org/html/2606.19538#S2.SS3)\)\. The complete notation used throughout the paper is summarized in Appendix[A](https://arxiv.org/html/2606.19538#A1)in Table[6](https://arxiv.org/html/2606.19538#A1.T6),[7](https://arxiv.org/html/2606.19538#A1.T7)&[8](https://arxiv.org/html/2606.19538#A1.T8)\.

### 2\.1\. The ITNet Operator

LetΩ⊆ℝs\\Omega\\subseteq\\mathbb\{R\}^\{s\}denote the input domain \(e\.g\.,s=2s\{=\}2for images,s=1s\{=\}1for sequences\), equipped with a positive finite measureμ\\muthat defines how inputs are aggregated\. Letu:Ω→ℝdu:\\Omega\\to\\mathbb\{R\}^\{d\}represent the signal, whereddis the feature dimension\. We work in the space𝒰=C​\(Ω,ℝd\)\\mathcal\{U\}=C\(\\Omega,\\mathbb\{R\}^\{d\}\)of continuousℝd\\mathbb\{R\}^\{d\}\-valued functions onΩ\\Omega, with the uniform norm‖u‖∞=supx∈Ω‖u​\(x\)‖2\\\|u\\\|\_\{\\infty\}=\\sup\_\{x\\in\\Omega\}\\\|u\(x\)\\\|\_\{2\}\.

###### Definition 1\(ITNet operator\)\.

The*ITNet operator*𝒦θ:𝒰→𝒰\\mathcal\{K\}\_\{\\theta\}:\\mathcal\{U\}\\to\\mathcal\{U\}is defined by

\(𝒦θ\[u\]\)\(x\)=∫Ωκθ\(x,y,u\(x\),u\(y\)\)u\(y\)dμ\(y\)\+Wθu\(x\),x∈Ω,\\boxed\{\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(x\)\\;=\\;\\int\_\{\\Omega\}\\kappa\_\{\\theta\}\\\!\\bigl\(x,\\,y,\\,u\(x\),\\,u\(y\)\\bigr\)\\,u\(y\)\\;d\\mu\(y\)\\;\+\\;W\_\{\\theta\}\\,u\(x\),\\qquad x\\in\\Omega,\}\(1\)whereκθ:ℝs×ℝs×ℝd×ℝd→ℝd×d\\kappa\_\{\\theta\}:\\mathbb\{R\}^\{s\}\\times\\mathbb\{R\}^\{s\}\\times\\mathbb\{R\}^\{d\}\\times\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\\times d\}is a learnable matrix\-valued kernel parameterised byθ\\theta, andWθ∈ℝd×dW\_\{\\theta\}\\in\\mathbb\{R\}^\{d\\times d\}is a learnable residual matrix\. The kernel receives query positionxx, key positionyy, and their featuresu​\(x\),u​\(y\)u\(x\),u\(y\)\. The integral aggregates transformed features from ally∈Ωy\\in\\Omega; the residual ensures the operator can represent the identity \(κθ=0\\kappa\_\{\\theta\}=0,Wθ=𝐈dW\_\{\\theta\}=\\mathbf\{I\}\_\{d\}\)\.

Kernel parameterization:The kernelκθ\\kappa\_\{\\theta\}is a 2\-layer MLP \(GELU\[[43](https://arxiv.org/html/2606.19538#bib.bib82)\], widthwκ=128w\_\{\\kappa\}\{=\}128,d2d^\{2\}outputs reshaped toℝd×d\\mathbb\{R\}^\{d\\times d\}\)\. Its input concatenates seven groups capturing position and feature interactions\. In raw form, the input is:

zx​yraw=\[x;y;x−y;‖x−y‖2;u​\(x\);u​\(y\);u​\(x\)⊙u​\(y\)\]∈ℝ3​s\+1\+3​d,z\_\{xy\}^\{\\mathrm\{raw\}\}=\[\\,x;\\;y;\\;x\{\-\}y;\\;\\\|x\{\-\}y\\\|\_\{2\};\\;u\(x\);\\;u\(y\);\\;u\(x\)\\odot u\(y\)\\,\]\\;\\in\\;\\mathbb\{R\}^\{3s\+1\+3d\},\(2\)where the three positional groups contribute3​s3sdimensions \(absolutex,y∈ℝsx,y\\in\\mathbb\{R\}^\{s\}and relativex−y∈ℝsx\{\-\}y\\in\\mathbb\{R\}^\{s\}\), the scalar distance adds11, and the three feature groups contribute3​d3d\(queryu​\(x\)u\(x\), keyu​\(y\)u\(y\), and Hadamard productu​\(x\)⊙u​\(y\)u\(x\)\\odot u\(y\), each inℝd\\mathbb\{R\}^\{d\}\)\. To enable the kernel MLP to represent high\-frequency spatial functions, we lift each positional group through the random Fourier feature map\[[87](https://arxiv.org/html/2606.19538#bib.bib32)\]γ:ℝs→ℝ2​Lf\\gamma:\\mathbb\{R\}^\{s\}\\to\\mathbb\{R\}^\{2L\_\{f\}\}\(Eq\.[199](https://arxiv.org/html/2606.19538#A8.E199)\), whereLLis the number of Fourier frequencies, replacingx↦γ​\(x\)x\\mapsto\\gamma\(x\),y↦γ​\(y\)y\\mapsto\\gamma\(y\),x−y↦γ​\(x−y\)x\{\-\}y\\mapsto\\gamma\(x\{\-\}y\)\. The MLP input is :

zx​y=\[γ​\(x\);γ​\(y\);γ​\(x−y\);‖x−y‖2;u​\(x\);u​\(y\);u​\(x\)⊙u​\(y\)\]∈ℝ6​Lf\+1\+3​d,z\_\{xy\}=\[\\,\\gamma\(x\);\\;\\gamma\(y\);\\;\\gamma\(x\{\-\}y\);\\;\\\|x\{\-\}y\\\|\_\{2\};\\;u\(x\);\\;u\(y\);\\;u\(x\)\\odot u\(y\)\\,\]\\;\\in\\;\\mathbb\{R\}^\{6L\_\{f\}\+1\+3d\},\(3\)
withL=64L\{=\}64,σ=10\\sigma\{=\}10, yielding input dimension385\+3​d385\+3d\. The positional groups enable translation\-invariant patterns \(viax−yx\{\-\}y\), distance\-based decay \(via‖x−y‖2\\\|x\{\-\}y\\\|\_\{2\}\), and position\-specific behavior \(via absolutex,yx,y\)\. The feature groups \- especially the Hadamard productu​\(x\)⊙u​\(y\)u\(x\)\\odot u\(y\)\- providedd\-dimensional elementwise interaction richer than the rank\-1 dot product in standard attention\. By the universal approximation theorem\[[46](https://arxiv.org/html/2606.19538#bib.bib100),[20](https://arxiv.org/html/2606.19538#bib.bib28)\], any continuous kernel on the compact domain𝒟=\{\(x,y,u​\(x\),u​\(y\)\):x,y∈Ω,u∈𝒰c\}\\mathcal\{D\}=\\\{\(x,y,u\(x\),u\(y\)\):x,y\\in\\Omega,\\,u\\in\\mathcal\{U\}\_\{c\}\\\}can be approximated to arbitrary precision\.

We normalizeμ​\(Ω\)=1\\mu\(\\Omega\)=1in all experiments:μ​\(\{xj\}\)=1/n\\mu\(\\\{x\_\{j\}\\\}\)=1/nfor discrete domains,μ=vol​\(Ω\)−1⋅λ\\mu=\\mathrm\{vol\}\(\\Omega\)^\{\-1\}\\cdot\\lambdafor continuous ones, ensuring the integral and residual terms operate at comparable scales\.

Multi\-head form:Following\[[93](https://arxiv.org/html/2606.19538#bib.bib3)\], the multi\-head ITNet operator splits feature dimension intoHHheads of dimensiondh=d/Hd\_\{h\}=d/H, applies independent kernelsκθ\(h\)\\kappa\_\{\\theta\}^\{\(h\)\}per head, and recombines via output projectionWO∈ℝd×dW^\{O\}\\in\\mathbb\{R\}^\{d\\times d\}:

\(𝒦θMH​\[u\]\)​\(x\)=WO​\[∫Ωκθ\(1\)​\(x,y,u\(1\)​\(x\),u\(1\)​\(y\)\)​u\(1\)​\(y\)​𝑑μ​\(y\)⋮∫Ωκθ\(H\)​\(x,y,u\(H\)​\(x\),u\(H\)​\(y\)\)​u\(H\)​\(y\)​𝑑μ​\(y\)\]\+Wθ​u​\(x\),\(\\mathcal\{K\}\_\{\\theta\}^\{\\mathrm\{MH\}\}\[u\]\)\(x\)=W^\{O\}\\begin\{bmatrix\}\\int\_\{\\Omega\}\\kappa\_\{\\theta\}^\{\(1\)\}\(x,y,u^\{\(1\)\}\(x\),u^\{\(1\)\}\(y\)\)\\,u^\{\(1\)\}\(y\)\\,d\\mu\(y\)\\\\ \\vdots\\\\ \\int\_\{\\Omega\}\\kappa\_\{\\theta\}^\{\(H\)\}\(x,y,u^\{\(H\)\}\(x\),u^\{\(H\)\}\(y\)\)\\,u^\{\(H\)\}\(y\)\\,d\\mu\(y\)\\end\{bmatrix\}\+W\_\{\\theta\}\\,u\(x\),\(4\)whereu\(h\)\(x\)=u\(x\)\[\(h−1\)dh\+1:hdh\]u^\{\(h\)\}\(x\)=u\(x\)\[\(h\{\-\}1\)d\_\{h\}\{\+\}1:hd\_\{h\}\]denotes theht​hh^\{th\}head’s feature slice\. Each head’s kernel operates ondhd\_\{h\}\-dimensional features, so the per\-pair cost isO​\(dh2\)=O​\(d2/H2\)O\(d\_\{h\}^\{2\}\)=O\(d^\{2\}/H^\{2\}\); summing overHHheads givesO​\(d2/H\)O\(d^\{2\}/H\)per pair \- a factor\-HHreduction relative to a single\-headd×dd\\times dkernel\. All theorems below \(Theorems[1](https://arxiv.org/html/2606.19538#Thmtheorem1)\-[4](https://arxiv.org/html/2606.19538#Thmtheorem4)\) apply per\-head; the output projectionWOW^\{O\}linearly combines independent head contributions and does not affect the special\-case proofs\.

A deep ITNet model stacksLLoperator layers \(ℓ=1,2,…,L\\ell=1,2,\\dots,L\) with pre\-normalization and position\-wise feed\-forward networks, following the standard pre\-norm Transformer layout\[[99](https://arxiv.org/html/2606.19538#bib.bib62)\]:

z\(ℓ\)=𝒦θ\(ℓ\)​\[LN​\(u\(ℓ−1\)\)\]\+u\(ℓ−1\),u\(ℓ\)=ℱθ\(ℓ\)​\(LN​\(z\(ℓ\)\)\)\+z\(ℓ\),z^\{\(\\ell\)\}=\\mathcal\{K\}\_\{\\theta\}^\{\(\\ell\)\}\[\\mathrm\{LN\}\(u^\{\(\\ell\-1\)\}\)\]\+u^\{\(\\ell\-1\)\},\\qquad u^\{\(\\ell\)\}=\\mathcal\{F\}\_\{\\theta\}^\{\(\\ell\)\}\(\\mathrm\{LN\}\(z^\{\(\\ell\)\}\)\)\+z^\{\(\\ell\)\},\(5\)whereLN\\mathrm\{LN\}denotes layer normalization\[[5](https://arxiv.org/html/2606.19538#bib.bib91)\]andℱθ\(ℓ\)\\mathcal\{F\}\_\{\\theta\}^\{\(\\ell\)\}is a two\-layer FFN with GELU\[[43](https://arxiv.org/html/2606.19538#bib.bib82)\]activation and expansion factor 4\. The kernel MLP output layer is initialized asW2=ϵ​𝐈dW\_\{2\}=\\epsilon\\mathbf\{I\}\_\{d\}\(ϵ=10−3\\epsilon\{=\}10^\{\-3\}\) so each layer begins as an approximate identity\. A schematic of the ITNet architecture is shown in Figure[1](https://arxiv.org/html/2606.19538#A8.F1)\. Full architectural details, initialization schemes, and hyperparameter are in Appendix[H](https://arxiv.org/html/2606.19538#A8)\.

### 2\.2\. Unification Theorems

Each theorem identifies a specific kernel parameterization under which the ITNet operator reduces*exactly*to a classical architecture\. Full proofs are in Appendices[C](https://arxiv.org/html/2606.19538#A3)–[E](https://arxiv.org/html/2606.19538#A5)\.

###### Theorem 1\(Convolution\)\.

If the kernelκθ\\kappa\_\{\\theta\}depends only on relative position,κθ​\(x,y,u​\(x\),u​\(y\)\)=wθ​\(x−y\)​𝐈d\\kappa\_\{\\theta\}\(x,y,u\(x\),u\(y\)\)=w\_\{\\theta\}\(x\-y\)\\,\\mathbf\{I\}\_\{d\}, then the ITNet operator reduces to convolution:

\(𝒦θ​\[u\]\)​\(x\)=∫Ωwθ​\(x−y\)​u​\(y\)​𝑑μ​\(y\)\+Wθ​u​\(x\)=\(wθ∗u\)​\(x\)\+Wθ​u​\(x\)\.\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(x\)=\\int\_\{\\Omega\}w\_\{\\theta\}\(x\-y\)\\,u\(y\)\\,d\\mu\(y\)\+W\_\{\\theta\}u\(x\)=\(w\_\{\\theta\}\\ast u\)\(x\)\+W\_\{\\theta\}u\(x\)\.\(6\)

###### Proof sketch\.

Substituting the kernel form into Eq\. \([1](https://arxiv.org/html/2606.19538#S2.E1)\) gives∫Ωwθ​\(x−y\)​𝐈d​u​\(y\)​𝑑μ​\(y\)\+Wθ​u​\(x\)=∫Ωwθ​\(x−y\)​u​\(y\)​𝑑μ​\(y\)\+Wθ​u​\(x\)\\int\_\{\\Omega\}w\_\{\\theta\}\(x\-y\)\\mathbf\{I\}\_\{d\}u\(y\)\\,d\\mu\(y\)\+W\_\{\\theta\}u\(x\)=\\int\_\{\\Omega\}w\_\{\\theta\}\(x\-y\)u\(y\)\\,d\\mu\(y\)\+W\_\{\\theta\}u\(x\), which is the definition of continuous convolution\. The residual term is a pointwise linear map \(1×11\\times 1convolution\)\. Full proof in Appendix[C](https://arxiv.org/html/2606.19538#A3)\. ∎

###### Theorem 2\(Self\-attention\)\.

LetWQ,WK∈ℝdk×dW\_\{Q\},W\_\{K\}\\in\\mathbb\{R\}^\{d\_\{k\}\\times d\}andWV∈ℝd×dW\_\{V\}\\in\\mathbb\{R\}^\{d\\times d\}\. Define the kernel:

κθ​\(x,y,u​\(x\),u​\(y\)\)=exp⁡\(\(WQ​u​\(x\)\)⊤​\(WK​u​\(y\)\)/dk\)∫Ωexp⁡\(\(WQ​u​\(x\)\)⊤​\(WK​u​\(z\)\)/dk\)​𝑑μ​\(z\)⋅WV\.\\kappa\_\{\\theta\}\(x,y,u\(x\),u\(y\)\)=\\frac\{\\exp\\\!\\bigl\(\(W\_\{Q\}u\(x\)\)^\{\\top\}\(W\_\{K\}u\(y\)\)/\\sqrt\{d\_\{k\}\}\\bigr\)\}\{\\int\_\{\\Omega\}\\exp\\\!\\bigl\(\(W\_\{Q\}u\(x\)\)^\{\\top\}\(W\_\{K\}u\(z\)\)/\\sqrt\{d\_\{k\}\}\\bigr\)\\,d\\mu\(z\)\}\\cdot W\_\{V\}\.\(7\)WithWθ=0W\_\{\\theta\}=0, substituting Eq\. \([7](https://arxiv.org/html/2606.19538#S2.E7)\) into Eq\. \([1](https://arxiv.org/html/2606.19538#S2.E1)\) yields:

\(𝒦θ​\[u\]\)​\(x\)=∫Ωα​\(x,y\)​WV​u​\(y\)​𝑑μ​\(y\),α​\(x,y\)=exp⁡\(Q​\(x\)⊤​K​\(y\)/dk\)Z​\(x\),\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(x\)=\\int\_\{\\Omega\}\\alpha\(x,y\)\\,W\_\{V\}u\(y\)\\,d\\mu\(y\),\\quad\\alpha\(x,y\)=\\frac\{\\exp\(Q\(x\)^\{\\top\}K\(y\)/\\sqrt\{d\_\{k\}\}\)\}\{Z\(x\)\},\(8\)whereQ​\(x\)=WQ​u​\(x\)Q\(x\)=W\_\{Q\}u\(x\),K​\(y\)=WK​u​\(y\)K\(y\)=W\_\{K\}u\(y\), andZ​\(x\)=∫Ωexp⁡\(Q​\(x\)⊤​K​\(z\)/dk\)​𝑑μ​\(z\)\.Z\(x\)=\\int\_\{\\Omega\}\\exp\(Q\(x\)^\{\\top\}K\(z\)/\\sqrt\{d\_\{k\}\}\)\\,d\\mu\(z\)\.

Discrete case:ForΩ=\{x1,…,xn\}\\Omega=\\\{x\_\{1\},\\dots,x\_\{n\}\\\}, Eq\. \([8](https://arxiv.org/html/2606.19538#S2.E8)\) reduces tosoftmax​\(Q​K⊤/dk\)​V\\mathrm\{softmax\}\(QK^\{\\top\}/\\sqrt\{d\_\{k\}\}\)V, with any uniform scaling absorbed intoWVW\_\{V\}\.

Positional encoding \(PE\):In ITNet, positional information is provided directly to the kernel via\(x,y\)\(x,y\)and their relative geometry, enabling position\-aware interactions without encoding schemes\.

Proof:Substituting Eq\. \([7](https://arxiv.org/html/2606.19538#S2.E7)\) into Eq\. \([1](https://arxiv.org/html/2606.19538#S2.E1)\) yields Eq\. \([8](https://arxiv.org/html/2606.19538#S2.E8)\)\. The normalization termZ​\(x\)Z\(x\)is strictly positive on compactΩ\\Omega\. The discrete case follows directly\. Full derivation is in Appendix[D](https://arxiv.org/html/2606.19538#A4)\. ∎

###### Theorem 3\(Recurrence as a special case\)\.

LetΩ=\[0,T\]\\Omega=\[0,T\]with Lebesgue measureμ=λ\\mu=\\lambdaand let the kernel satisfy the causal constraintκθ​\(x,y,⋅,⋅\)=0\\kappa\_\{\\theta\}\(x,y,\\cdot,\\cdot\)=0fory\>xy\>x\. Exact recovery is obtained for linear and affine\-in\-input systems; nonlinear recurrent architectures are recovered through explicit unrolling constructions or approximated arbitrarily well via Theorem[4](https://arxiv.org/html/2606.19538#Thmtheorem4)\.

\(a\) Linear continuous\-time system\.For the linear systemh˙​\(t\)=A​h​\(t\)\+Bθ​u​\(t\)\\dot\{h\}\(t\)=Ah\(t\)\+B\_\{\\theta\}u\(t\)with outputy​\(t\)=Cθ​h​\(t\)\+Dθ​u​\(t\)y\(t\)=C\_\{\\theta\}h\(t\)\+D\_\{\\theta\}u\(t\), define the causal kernel

κθ​\(t,s,u​\(t\),u​\(s\)\)=1s≤t​Cθ​eA​\(t−s\)​Bθ,Wθ=Dθ\.\\kappa\_\{\\theta\}\(t,s,u\(t\),u\(s\)\)\\;=\\;\\mathbf\{1\}\_\{s\\leq t\}\\,C\_\{\\theta\}\\,e^\{A\(t\-s\)\}\\,B\_\{\\theta\},\\qquad W\_\{\\theta\}=D\_\{\\theta\}\.\(9\)Then\(𝒦θ​\[u\]\)​\(t\)=∫0tCθ​eA​\(t−s\)​Bθ​u​\(s\)​𝑑s\+Dθ​u​\(t\)=RNN​\(u\)​\(t\)\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(t\)=\\int\_\{0\}^\{t\}C\_\{\\theta\}e^\{A\(t\-s\)\}B\_\{\\theta\}\\,u\(s\)\\,ds\+D\_\{\\theta\}u\(t\)=\\mathrm\{RNN\}\(u\)\(t\)\.

\(a′\) Nonlinear system \(generalFθF\_\{\\theta\}\)\.Forh˙=Fθ​\(h,u\)\\dot\{h\}=F\_\{\\theta\}\(h,u\)with LipschitzFθF\_\{\\theta\}, the kernel generalises toκF​\(t,s;u\)=𝟏s≤t⋅Cθ​ΦF​\(t,s;u\)⋅B~θ​\(s;u\)\\kappa\_\{F\}\(t,s;u\)=\\mathbf\{1\}\_\{s\\leq t\}\\cdot C\_\{\\theta\}\\,\\Phi\_\{F\}\(t,s;u\)\\cdot\\tilde\{B\}\_\{\\theta\}\(s;u\), whereΦF​\(t,s;u\)∈ℝn×n\\Phi\_\{F\}\(t,s;u\)\\in\\mathbb\{R\}^\{n\\times n\}is the nonlinear sensitivity matrix \(Alekseev’s formula\[[3](https://arxiv.org/html/2606.19538#bib.bib94)\]\) andB~θ​\(s;u\)=∂Gθ/∂u\|\(h​\(s\),u​\(s\)\)\\tilde\{B\}\_\{\\theta\}\(s;u\)=\\partial G\_\{\\theta\}/\\partial u\|\_\{\(h\(s\),u\(s\)\)\}is the input\-to\-state Jacobian\. Exact recovery is obtained for linear and affine\-in\-input systems\. Structured recurrent architectures LSTM, GRU, and Mamba admit causal kernel constructions, while general nonlinear recurrent operators are approximated arbitrarily well via Theorem[4](https://arxiv.org/html/2606.19538#Thmtheorem4)\(detailed in Appendix[E](https://arxiv.org/html/2606.19538#A5), §[E\.3](https://arxiv.org/html/2606.19538#A5.SS3)\)\.

\(b\) Discrete RNN\.Discretising toTTtime steps withμ​\(\{tk\}\)=1\\mu\(\\\{t\_\{k\}\\\}\)=1, the causal kernelκ​\(t,s\)=𝟏s≤t⋅Cθ​Wht−s​Wu\\kappa\(t,s\)=\\mathbf\{1\}\_\{s\\leq t\}\\cdot C\_\{\\theta\}W\_\{h\}^\{t\-s\}W\_\{u\}yields\(𝒦θ​\[u\]\)​\(t\)=∑s=1tCθ​Wht−s​Wu​us\+Dθ​ut=Cθ​ht\+Dθ​ut\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(t\)=\\sum\_\{s=1\}^\{t\}C\_\{\\theta\}W\_\{h\}^\{t\-s\}W\_\{u\}u\_\{s\}\+D\_\{\\theta\}u\_\{t\}=C\_\{\\theta\}h\_\{t\}\+D\_\{\\theta\}u\_\{t\}, recoveringht=Wh​ht−1\+Wu​uth\_\{t\}=W\_\{h\}h\_\{t\-1\}\+W\_\{u\}u\_\{t\}\.

\(c\) LSTM\.The LSTM cell state unrolls asct=∑s=1t\[∏τ=s\+1tfτ\]⊙is⊙c~sc\_\{t\}=\\sum\_\{s=1\}^\{t\}\[\\prod\_\{\\tau=s\+1\}^\{t\}f\_\{\\tau\}\]\\odot i\_\{s\}\\odot\\tilde\{c\}\_\{s\}, wherefτ,isf\_\{\\tau\},i\_\{s\}are forget and input gates\. The ITNet kernelκLSTM​\(t,s\)=𝟏s≤t⋅Wy​diag​\(ot\)​diag​\(tanh′⁡\(ct\)\)​diag​\(∏τ=s\+1tfτ\)​diag​\(is\)​Wc\\kappa\_\{\\mathrm\{LSTM\}\}\(t,s\)=\\mathbf\{1\}\_\{s\\leq t\}\\cdot W\_\{y\}\\mathrm\{diag\}\(o\_\{t\}\)\\mathrm\{diag\}\\\!\\bigl\(\\tanh^\{\\prime\}\(c\_\{t\}\)\\bigr\)\\mathrm\{diag\}\\\!\\left\(\\prod\_\{\\tau=s\+1\}^\{t\}f\_\{\\tau\}\\right\)\\mathrm\{diag\}\(i\_\{s\}\)W\_\{c\}is content\-dependent through the gates and recovers the LSTM output\.

\(d\) S4, Mamba, GRU\.Linear SSMs use kernel𝟏s≤t⋅C​eA​\(t−s\)​B\\mathbf\{1\}\_\{s\\leq t\}\\cdot Ce^\{A\(t\-s\)\}B\(causal convolution\)\. Mamba usesκMamba​\(t,s\)=𝟏s≤t⋅C​\(ut\)​∏τ=s\+1tA¯​\(uτ\)⋅B¯​\(us\)\\kappa\_\{\\mathrm\{Mamba\}\}\(t,s\)=\\mathbf\{1\}\_\{s\\leq t\}\\cdot C\(u\_\{t\}\)\\prod\_\{\\tau=s\+1\}^\{t\}\\bar\{A\}\(u\_\{\\tau\}\)\\cdot\\bar\{B\}\(u\_\{s\}\), content\-dependent through input\-dependentA¯,B¯,C\\bar\{A\},\\bar\{B\},C\. GRU follows with retention factors\(1−zτ\)\(1\-z\_\{\\tau\}\)\. Explicit constructions in Appendix[E](https://arxiv.org/html/2606.19538#A5)\.

\(e\) Strictness\.A bidirectional operatorS​\(u\)​\(t\)=∫0Te−\|t−s\|/ℓ​u​\(s\)​𝑑sS\(u\)\(t\)=\\int\_\{0\}^\{T\}e^\{\-\|t\-s\|/\\ell\}u\(s\)\\,dsis representable by ITNet but not by any causal recurrent system \(Appendix[E](https://arxiv.org/html/2606.19538#A5), §[E\.9](https://arxiv.org/html/2606.19538#A5.SS9)\)\.

Proof sketch\.\(a\):Substituting \([9](https://arxiv.org/html/2606.19538#S2.E9)\) into \([1](https://arxiv.org/html/2606.19538#S2.E1)\) and applying𝟏s≤t\\mathbf\{1\}\_\{s\\leq t\}restricts integration to\[0,t\]\[0,t\]:

\(𝒦θ​\[u\]\)​\(t\)=∫0tCθ​eA​\(t−s\)​Bθ​u​\(s\)​𝑑s\+Dθ​u​\(t\)=Cθ​h​\(t\)\+Dθ​u​\(t\),\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(t\)=\\int\_\{0\}^\{t\}C\_\{\\theta\}e^\{A\(t\-s\)\}B\_\{\\theta\}\\,u\(s\)\\,ds\+D\_\{\\theta\}u\(t\)=C\_\{\\theta\}h\(t\)\+D\_\{\\theta\}u\(t\),\(10\)where the second equality uses the variation of constants formulah​\(t\)=∫0teA​\(t−s\)​Bθ​u​\(s\)​𝑑sh\(t\)=\\int\_\{0\}^\{t\}e^\{A\(t\-s\)\}B\_\{\\theta\}u\(s\)\\,ds\(withh0=0h\_\{0\}=0\)\.\(a′\):Alekseev’s nonlinear variation of constants formula replaceseA​\(t−s\)e^\{A\(t\-s\)\}with the trajectory\-dependent sensitivity matrixΦF​\(t,s;u\)\\Phi\_\{F\}\(t,s;u\); exact whenFθF\_\{\\theta\}is affine inuu, otherwiseε\\varepsilon\-approximate via Theorem[4](https://arxiv.org/html/2606.19538#Thmtheorem4)\.\(b\)–\(d\):Direct unrolling of each discrete recurrence yields causal kernels\.\(e\):The witnessSSuses future information \(s\>ts\>t\), which any causal system cannot access\. See Appendix[E](https://arxiv.org/html/2606.19538#A5)for complete derivations\. ∎

### 2\.3\. Universal Operator Approximation

###### Assumption 1\.

\(i\)Ω⊂ℝs\\Omega\\subset\\mathbb\{R\}^\{s\}is compact withμ​\(Ω\)\>0\\mu\(\\Omega\)\>0; \(ii\)F:𝒰c→C​\(Ω,ℝd\)F:\\mathcal\{U\}\_\{c\}\\to C\(\\Omega,\\mathbb\{R\}^\{d\}\)is a continuous operator on compact𝒰c⊂C​\(Ω,ℝd\)\\mathcal\{U\}\_\{c\}\\subset C\(\\Omega,\\mathbb\{R\}^\{d\}\); \(iii\) the kernel MLP uses a non\-polynomial activation\.

###### Theorem 4\(Universal operator approximation\)\.

Under Assumption[2](https://arxiv.org/html/2606.19538#Thmassumption2), for anyε\>0\\varepsilon\>0there exist parametersθ\\thetaand depthLLsuch that

supu∈𝒰c‖F​\(u\)−𝒦θ\(L\)​\[u\]‖∞<ε\.\\sup\_\{u\\in\\mathcal\{U\}\_\{c\}\}\\bigl\\\|F\(u\)\-\\mathcal\{K\}\_\{\\theta\}^\{\(L\)\}\[u\]\\bigr\\\|\_\{\\infty\}<\\varepsilon\.\(11\)

###### Proof\.

The proof proceeds in three steps\.

*Step 1 \(Discretization\):*Standard quadrature on compactΩ\\Omegaapproximates the integral usingMMquadrature points \(samples\), with error at mostδ\\delta\[[23](https://arxiv.org/html/2606.19538#bib.bib27)\]\. *Step 2 \(Kernel universality\):*By the MLP universal approximation theorem\[[46](https://arxiv.org/html/2606.19538#bib.bib100),[20](https://arxiv.org/html/2606.19538#bib.bib28)\],κθ\\kappa\_\{\\theta\}can approximate any continuous kernel on domain𝒟=\{\(x,y,u​\(x\),u​\(y\)\):x,y∈Ω,u∈𝒰c\}\\mathcal\{D\}=\\\{\(x,y,u\(x\),u\(y\)\):x,y\\in\\Omega,u\\in\\mathcal\{U\}\_\{c\}\\\}to withinδ1\\delta\_\{1\}\. *Step 3 \(Operator approximation\):*The discretised ITNet matches the structure of\[[9](https://arxiv.org/html/2606.19538#bib.bib6)\], which guarantees that any continuous operator on a compact set can be uniformly approximated\. ChoosingMM, MLP width, andLLsufficiently large ensures total error<ε<\\varepsilon\. Full proof in Appendix[F](https://arxiv.org/html/2606.19538#A6)\. ∎

###### Corollary 2\.1\(Strict Expressiveness\)\.

The following strict containments hold:

CNN⊊ITNet,Attn⊊ITNet,RNN⊊ITNet,CNN∪Attn∪RNN⊊ITNet\.\\mathrm\{CNN\}\\subsetneq\\mathrm\{ITNet\},\\quad\\mathrm\{Attn\}\\subsetneq\\mathrm\{ITNet\},\\quad\\mathrm\{RNN\}\\subsetneq\\mathrm\{ITNet\},\\quad\\mathrm\{CNN\}\\cup\\mathrm\{Attn\}\\cup\\mathrm\{RNN\}\\subsetneq\\mathrm\{ITNet\}\.

Proof\.The inclusions follow from Theorems[1](https://arxiv.org/html/2606.19538#Thmtheorem1)–[3](https://arxiv.org/html/2606.19538#Thmtheorem3)\. Strictness is established via explicit counterexample operators \(Appendices[C](https://arxiv.org/html/2606.19538#A3)–[F](https://arxiv.org/html/2606.19538#A6)\) that lie outside each classical family but are representable by ITNet\. ∎

###### Theorem 5\(Kernel Recovery Under Data Symmetry\)\.

Let the data distribution𝒟\\mathcal\{D\}be translation\-invariant:τδ:u​\(x\)↦u​\(x−δ\)\\tau\_\{\\delta\}:u\(x\)\\mapsto u\(x\-\\delta\)satisfies𝒟∘τδ=𝒟\\mathcal\{D\}\\circ\\tau\_\{\\delta\}=\\mathcal\{D\}for allδ\\delta\. Decomposeκθ=κθTI\+κθ⟂\\kappa\_\{\\theta\}=\\kappa\_\{\\theta\}^\{\\mathrm\{TI\}\}\+\\kappa\_\{\\theta\}^\{\\perp\}whereκθTI​\(x,y,⋅,⋅\)=κ¯θ​\(x−y,⋅,⋅\)\\kappa\_\{\\theta\}^\{\\mathrm\{TI\}\}\(x,y,\\cdot,\\cdot\)=\\bar\{\\kappa\}\_\{\\theta\}\(x\-y,\\cdot,\\cdot\)is the translation\-invariant component\. Then under gradient flow,

‖∂ℒ∂κθ⟂‖F=0at every iterate,\\left\\\|\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\\kappa\_\{\\theta\}^\{\\perp\}\}\\right\\\|\_\{F\}=0\\quad\\text\{at every iterate\},\(12\)so gradient flow converges to a translation\-invariant kernel, recovering the convolutional special case \(Theorem[1](https://arxiv.org/html/2606.19538#Thmtheorem1)\)\.

Proof Sketch\.Translation invariance of𝒟\\mathcal\{D\}impliesℒ​\(θ\)=𝔼​\[ℓ​\(𝒦θ​\[τδ​u\],y\)\]\\mathcal\{L\}\(\\theta\)=\\mathbb\{E\}\[\\ell\(\\mathcal\{K\}\_\{\\theta\}\[\\tau\_\{\\delta\}u\],y\)\]\. Averaging the gradient over all translations projects onto the translation\-invariant subspace, annihilating theκθ⟂\\kappa\_\{\\theta\}^\{\\perp\}component\. Full proof in Appendix[G](https://arxiv.org/html/2606.19538#A7)\. ∎

Complexity and Efficient Approximation:Computing Eq\. \([1](https://arxiv.org/html/2606.19538#S2.E1)\) exactly costsO​\(n2​d2\)O\(n^\{2\}d^\{2\}\)time\. Two approximations reduce this \(§[3](https://arxiv.org/html/2606.19538#S3)\): Monte Carlo sampling \(M≪nM\\ll nkeys per query,O​\(n​M​d2\)O\(nMd^\{2\}\), unbiased gradients\) and low\-rank kernel factorization \(κθ≈Φθ⊤​Ψθ\\kappa\_\{\\theta\}\\approx\\Phi\_\{\\theta\}^\{\\top\}\\Psi\_\{\\theta\}with rankr≪dr\\ll d,O​\(n​d​r\)O\(ndr\), linear innn\)\. In practice we use tiled kernel fusion forn≤512n\\leq 512and MC or low\-rank beyond\. The full complexity comparison with CNNs, Transformers, and Mamba is in Table[9](https://arxiv.org/html/2606.19538#A8.T9)\(Appendix[H](https://arxiv.org/html/2606.19538#A8)\)\.

## 3\. Efficient and Scalable Implementation

The ITNet operator is computationally intensive due to its MLP\-based kernel and non\-bilinear structure\. We address this with strategies: tiled kernel fusion \(§[3\.1](https://arxiv.org/html/2606.19538#S3.SS1)\), Monte Carlo sampling \(§[3\.2](https://arxiv.org/html/2606.19538#S3.SS2)\), low\-rank factorization \(§[3\.3](https://arxiv.org/html/2606.19538#S3.SS3)\), and modality\-specific encoders \(§[3\.4](https://arxiv.org/html/2606.19538#S3.SS4)\)\.

### 3\.1\. Tiled Kernel Fusion

Following\[[21](https://arxiv.org/html/2606.19538#bib.bib11)\], we tile the computation into blocks that fit in on\-chip Static Random\-Access Memory \(SRAM\), fusing kernel MLP evaluation, matrix\-vector product, and integral accumulation into a single Triton kernel\[[88](https://arxiv.org/html/2606.19538#bib.bib30)\]\. Then×nn\\times nkernel matrix is never materialized; only aBq×BkB\_\{q\}\\times B\_\{k\}tile resides in SRAM, whereBqB\_\{q\}andBkB\_\{k\}denote the query and key tile sizes, respectively\. Algorithm[1](https://arxiv.org/html/2606.19538#alg1)describes the tiled forward pass\. Phase 1 auto\-tunes block sizes\(Bq∗,Bk∗\)\(B\_\{q\}^\{\*\},B\_\{k\}^\{\*\}\)satisfying\(Bq\+Bk\)⋅d≤SSRAM\(B\_\{q\}\+B\_\{k\}\)\\cdot d\\leq S\_\{\\mathrm\{SRAM\}\}, whereSSRAMS\_\{\\mathrm\{SRAM\}\}is the available on\-chip memory capacity\. Phase 2 executes tiled integration: for each query tile,κθ\\kappa\_\{\\theta\}is evaluated andKp​q⋅Uj​\[q\]⋅ωjK\_\{pq\}\\cdot U\_\{j\}\[q\]\\cdot\\omega\_\{j\}is accumulated in SRAM, whereKp​q=κθ​\(Xi​\[p\],Xj​\[q\],Ui​\[p\],Uj​\[q\]\)K\_\{pq\}=\\kappa\_\{\\theta\}\(X\_\{i\}\[p\],X\_\{j\}\[q\],U\_\{i\}\[p\],U\_\{j\}\[q\]\)is the kernel evaluation at tile indices\(p,q\)\(p,q\)andωj=μ​\(\{xj\}\)\\omega\_\{j\}=\\mu\(\\\{x\_\{j\}\\\}\)is the quadrature weight \(typically1/n1/n\), reducing High Bandwidth Memory \(HBM\) reads\. During backpropagation, we recomputeκθ\\kappa\_\{\\theta\}tile\-by\-tile rather than storing the kernel matrix\[[10](https://arxiv.org/html/2606.19538#bib.bib31)\]; this increases FLOPs by≈3×\\approx 3\\timesbut reduces peak memory fromO​\(n2​d2\)O\(n^\{2\}d^\{2\}\)toO​\(n​d\)O\(nd\)\. The complete forward algorithm and backward procedure are in Appendix[H\.9](https://arxiv.org/html/2606.19538#A8.SS9)and[H\.10](https://arxiv.org/html/2606.19538#A8.SS10), respectively\.

### 3\.2\. Monte Carlo Stochastic Integration

Whennnis large, we use an importance\-weighted Monte Carlo estimator: for each queryxix\_\{i\}, sampleM≪nM\\ll nkeys from a learnable proposalpϕ​\(y∣xi\)p\_\{\\phi\}\(y\\mid x\_\{i\}\), with parametersϕ\\phiindependent ofθ\\theta:

𝒦^θ\[u\]\(xi\)=1M∑m=1Mκθ​\(xi,ym,u​\(xi\),u​\(ym\)\)​u​\(ym\)pϕ​\(ym∣xi\)\+Wθu\(xi\),ym∼pϕ\(⋅∣xi\),\\widehat\{\\mathcal\{K\}\}\_\{\\theta\}\[u\]\(x\_\{i\}\)=\\frac\{1\}\{M\}\\sum\_\{m=1\}^\{M\}\\frac\{\\kappa\_\{\\theta\}\(x\_\{i\},y\_\{m\},u\(x\_\{i\}\),u\(y\_\{m\}\)\)\\,u\(y\_\{m\}\)\}\{p\_\{\\phi\}\(y\_\{m\}\\mid x\_\{i\}\)\}\+W\_\{\\theta\}u\(x\_\{i\}\),\\quad y\_\{m\}\\sim p\_\{\\phi\}\(\\cdot\\mid x\_\{i\}\),\(13\)whereμ​\(Ω\)=1\\mu\(\\Omega\)=1\(our normalization convention from §[2\.1](https://arxiv.org/html/2606.19538#S2.SS1)\) absorbs the domain volume\. The estimator is unbiased:𝔼y1,…,yM​\[𝒦^θ​\[u\]​\(xi\)\]=𝒦θ​\[u\]​\(xi\)\\mathbb\{E\}\_\{y\_\{1\},\\ldots,y\_\{M\}\}\[\\widehat\{\\mathcal\{K\}\}\_\{\\theta\}\[u\]\(x\_\{i\}\)\]=\\mathcal\{K\}\_\{\\theta\}\[u\]\(x\_\{i\}\), with variance minimized when the proposal matches the optimal distributionp∗​\(y∣xi\)∝‖κθ​\(xi,y,u​\(xi\),u​\(y\)\)​u​\(y\)‖2p^\{\*\}\(y\\mid x\_\{i\}\)\\propto\\\|\\kappa\_\{\\theta\}\(x\_\{i\},y,u\(x\_\{i\}\),u\(y\)\)\\,u\(y\)\\\|\_\{2\}\. The resulting complexity isO​\(n​M​d2\)O\(nMd^\{2\}\)withM≪nM\\ll n\. To ensure unbiased gradients, we decouple computation and sampling: \(i\)pϕp\_\{\\phi\}is parameterized independently ofκθ\\kappa\_\{\\theta\}, so∇θ\\nabla\_\{\\theta\}treats sampled positions and importance weights as constants; \(ii\)ϕ\\phiis trained via an auxiliary cross\-entropy loss that drivespϕp\_\{\\phi\}toward a self\-normalized approximation ofp∗p^\{\*\}\. GivenMMsampled keys, define the empirical targetp^∗​\(ym∣xi\)=‖κθ​\(xi,ym,u​\(xi\),u​\(ym\)\)​u​\(ym\)‖2/Zi\\hat\{p\}^\{\*\}\(y\_\{m\}\\mid x\_\{i\}\)=\\\|\\kappa\_\{\\theta\}\(x\_\{i\},y\_\{m\},u\(x\_\{i\}\),u\(y\_\{m\}\)\)\\,u\(y\_\{m\}\)\\\|\_\{2\}\\,/\\,Z\_\{i\}, whereZiZ\_\{i\}normalizes over theMMsamples:

ℒprop​\(ϕ\)=−∑i=1n∑m=1Mp^∗​\(ym∣xi\)​log⁡pϕ​\(ym∣xi\),Zi=∑m′=1M‖κθ​\(xi,ym′,u​\(xi\),u​\(ym′\)\)​u​\(ym′\)‖2\.\\hskip\-5\.69046pt\\mathcal\{L\}\_\{\\mathrm\{prop\}\}\(\\phi\)\\;=\\;\-\\sum\_\{i=1\}^\{n\}\\sum\_\{m=1\}^\{M\}\\hat\{p\}^\{\*\}\(y\_\{m\}\\mid x\_\{i\}\)\\;\\log p\_\{\\phi\}\(y\_\{m\}\\mid x\_\{i\}\),\\hskip 2\.84544ptZ\_\{i\}=\\sum\_\{m^\{\\prime\}=1\}^\{M\}\\bigl\\\|\\kappa\_\{\\theta\}\(x\_\{i\},y\_\{m^\{\\prime\}\},u\(x\_\{i\}\),u\(y\_\{m^\{\\prime\}\}\)\)\\,u\(y\_\{m^\{\\prime\}\}\)\\bigr\\\|\_\{2\}\.\(14\)This is the cross\-entropyH​\(p^∗,pϕ\)H\(\\hat\{p\}^\{\*\},p\_\{\\phi\}\)\- a biased but consistent estimator ofKL​\(p∗∥pϕ\)\+const\\mathrm\{KL\}\(p^\{\*\}\\\|p\_\{\\phi\}\)\+\\mathrm\{const\}that avoids computing the intractable normalization constant ofp∗p^\{\*\}\. The gradient∇ϕℒprop\\nabla\_\{\\phi\}\\mathcal\{L\}\_\{\\mathrm\{prop\}\}treatsp^∗\\hat\{p\}^\{\*\}as a fixed target \(stop\-gradient onθ\\theta\)\. The total objective is:ℒtotal=ℒtask\+λ​ℒprop,\\mathcal\{L\}\_\{\\mathrm\{total\}\}=\\mathcal\{L\}\_\{\\mathrm\{task\}\}\+\\lambda\\,\\mathcal\{L\}\_\{\\mathrm\{prop\}\},whereλ\\lambda= 0\.1\. At evaluation, sampling is replaced with deterministickk\-means anchors in position space\. This design separates*what to compute*\(θ\\theta\) from*where to sample*\(ϕ\\phi\), ensuring unbiased training\. A formal analysis of estimator variance and the optimal proposal distribution is provided in Appendix[H\.11](https://arxiv.org/html/2606.19538#A8.SS11)\.

### 3\.3\. Learned Low\-Rank Kernel Factorization

For long sequences, factorizeκθ≈Φθ​\(x,u​\(x\)\)⊤​Ψθ​\(y,u​\(y\)\)\\kappa\_\{\\theta\}\\approx\\Phi\_\{\\theta\}\(x,u\(x\)\)^\{\\top\}\\Psi\_\{\\theta\}\(y,u\(y\)\)with rankr≪dr\\ll d\. The integral decouples:\(𝒦θ​\[u\]\)​\(xi\)≈Φθ​\(xi,ui\)⊤​∑j=1nωj​Ψθ​\(xj,uj\)​uj⏟Z\+Wθ​ui,\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(x\_\{i\}\)\\approx\\Phi\_\{\\theta\}\(x\_\{i\},u\_\{i\}\)^\{\\top\}\\underbrace\{\\sum\_\{j=1\}^\{n\}\\omega\_\{j\}\\,\\Psi\_\{\\theta\}\(x\_\{j\},u\_\{j\}\)\\,u\_\{j\}\}\_\{Z\}\+W\_\{\\theta\}u\_\{i\},whereZZis computed once inO​\(n​r​d\)O\(nrd\), each query inO​\(r​d\)O\(rd\), totalO​\(n​r​d\)O\(nrd\)\. Error satisfies‖κθ−Φθ⊤​Ψθ‖F≤‖κθ‖∗/r\\\|\\kappa\_\{\\theta\}\-\\Phi\_\{\\theta\}^\{\\top\}\\Psi\_\{\\theta\}\\\|\_\{F\}\\leq\\\|\\kappa\_\{\\theta\}\\\|\_\{\*\}/\\sqrt\{r\};r=32r=32gives<1%<1\\%relative error on ImageNet\-1K; a formal error bound is given in Appendix[H\.12](https://arxiv.org/html/2606.19538#A8.SS12)\.

### 3\.4\. Modality\-Specific Encoders

The ITNet operator acts on a signalu:Ω→ℝdu:\\Omega\\to\\mathbb\{R\}^\{d\}, where the encoder defines the domainΩ\\Omegaand initial featuresu\(0\)u^\{\(0\)\}\. Positional inputs are encoded via Fourier featuresγ​\(⋅\)\\gamma\(\\cdot\)\. Image encoder \(s=2s\{=\}2\):An imageI∈ℝH×W×CI\\in\\mathbb\{R\}^\{H\\times W\\times C\}is divided intoP×PP\{\\times\}Ppatches \(P=16P\{=\}16\), yieldingNp=H​W/P2N\_\{p\}=HW/P^\{2\}tokens with positionsxi​j∈\[0,1\]2x\_\{ij\}\\in\[0,1\]^\{2\}\. Each token is formed by combining a linear patch embedding with Fourier positional features, followed by a projection toℝd\\mathbb\{R\}^\{d\}\. A learnable\[CLS\]token is added for global aggregation\. A uniform measure over patches is used\. Text encoder \(s=1s\{=\}1\):A sequence\(t1,…,tn\)\(t\_\{1\},\\ldots,t\_\{n\}\)is embedded via a token embedding matrix and Fourier positional encoding of normalized indicesk/nk/n\. Relative positions are provided directly to the kernel, enabling distance\-aware interactions without explicit positional biases\. A causal mask is applied for autoregressive settings\. The measure is uniform over tokens\. Point clouds encoder \(s=3s\{=\}3\):A point cloud\{\(pk,fk\)\}k=1n\\\{\(p\_\{k\},f\_\{k\}\)\\\}\_\{k=1\}^\{n\}withpk∈ℝ3p\_\{k\}\\in\\mathbb\{R\}^\{3\}is mapped to features via a linear embedding \(or positional encoding for coordinates\-only inputs\)\. The kernel receives raw coordinates and pairwise distances‖pk−pj‖2\\\|p\_\{k\}\-p\_\{j\}\\\|\_\{2\}, enabling implicit neighborhood modeling without fixed radius constraints\. Outputs are aggregated via pooling for classification\. Multimodal encoder \(image \+ text\):Image patches and text tokens are combined into a joint domainΩimg∪Ωtxt\\Omega\_\{\\mathrm\{img\}\}\\cup\\Omega\_\{\\mathrm\{txt\}\}\. Text positions are embedded in the 2\-D space to separate modalities, and features are augmented with modality embeddings\. A single ITNet processes all positions, allowing intra\- and cross\-modal interactions to be learned directly without explicit fusion modules\. \(see Appendix[J](https://arxiv.org/html/2606.19538#A10)for additional variants and Appendix[H](https://arxiv.org/html/2606.19538#A8)for implementation details\)\.

## 4\. Experiments and Results

We evaluate ITNet across four modalities: image classification \(ImageNet\-1K\[[80](https://arxiv.org/html/2606.19538#bib.bib35)\]\), natural language understanding \(GLUE\[[94](https://arxiv.org/html/2606.19538#bib.bib44)\]\), 3\-D point cloud classification \(ModelNet40\[[86](https://arxiv.org/html/2606.19538#bib.bib71)\]\), and multimodal reasoning \(VQA v2\[[33](https://arxiv.org/html/2606.19538#bib.bib48)\], NLVR2\[[85](https://arxiv.org/html/2606.19538#bib.bib49)\]\)\. We consider three model scales: a small model \(ITNet\-S: 22M parameters; 12 layers, width 384, 6 heads\), a base model \(ITNet\-B: 86M; 12 layers, width 768, 12 heads\), and a large model \(ITNet\-L: 307M; 24 layers, width 1024, 16 heads\)\. Full model configurations are provided in Table[10](https://arxiv.org/html/2606.19538#A8.T10)\. Results for ITNet\-\(S, B, L\) across all benchmarks are reported as mean±\\pmstandard deviation over 3 runs with different random initializations\.

ImageNet\-1K\[[80](https://arxiv.org/html/2606.19538#bib.bib35)\]: We report top\-1 accuracy on the validation set \(training details in Appendix[K\.1](https://arxiv.org/html/2606.19538#A11.SS1)\)\. As shown in Table[1](https://arxiv.org/html/2606.19538#S4.T1), ITNet consistently improves over both convolutional and transformer baselines across scales, indicating that jointly modeling content and position provides a more effective inductive bias for visual representation learning\.

Table 1:ImageNet\-1K top\-1 accuracy\. ITNet results are mean±\\pmstd over 3 random seeds\.GLUE\[[94](https://arxiv.org/html/2606.19538#bib.bib44)\]: We evaluate ITNet on the GLUE benchmark\[[94](https://arxiv.org/html/2606.19538#bib.bib44)\]\. ITNet\-B pre\-trained on BookCorpus and English Wikipedia using masked language modeling for 500K steps \(sequence lengths 128 and 512\), matching the BERT\-base\[[24](https://arxiv.org/html/2606.19538#bib.bib15)\]pre\-training setup\. We then fine\-tune on each task and report standard GLUE metrics, using task\-specific evaluation protocols\.

Table 2:GLUE development\-set scores\. ITNet results are mean±\\pmstandard deviation over 3 random seeds\. Models with comparable pre\-training data \(BookCorpus \+ Wikipedia,∼\\sim16GB\) are shown\.†RoBERTa uses 160GB pre\-training data10×10\\timesmore than ITNet and BERT\. CoLA: Corpus of Linguistic Acceptability; SST\-2: Stanford Sentiment Treebank; MRPC: Microsoft Research Paraphrase Corpus; STS\-B: Semantic Textual Similarity Benchmark; QQP: Quora Question Pairs; MNLI: Multi\-Genre Natural Language Inference; QNLI: Question Natural Language Inference; RTE: Recognizing Textual Entailment; Metrics: Matthews correlation for CoLA, Spearman correlation for STS\-B, F1 score for MRPC and QQP, and accuracy for the remaining tasks\.

As shown in Table[2](https://arxiv.org/html/2606.19538#S4.T2), ITNet achieves competitive performance on GLUE under the same pre\-training budget\. ITNet\-B matches larger Transformer baselines, while ITNet\-L approaches models trained with substantially more data\. Gains are strongest on syntactically challenging tasks \(CoLA, RTE\), indicating improved modeling of long\-range dependencies via explicit positional interactions, while performance on semantic tasks \(STS\-B, MRPC, QQP\) remains comparable\. Overall, jointly modeling content and position provides a stronger inductive bias for language understanding\.

ModelNet40\[[97](https://arxiv.org/html/2606.19538#bib.bib46)\]:We evaluate ITNet on ModelNet40\[[97](https://arxiv.org/html/2606.19538#bib.bib46)\]with standard preprocessing and report overall accuracy \(OA\)\. As shown in Table[4](https://arxiv.org/html/2606.19538#S4.T4), ITNet achieves strong performance, outperforming both content\-only and geometry\-focused baselines\. A parameter\-matched variant, ITNet\-PC \(Point Cloud\), remains competitive, indicating that the gains are not solely due to model scale\. These results suggest that jointly modeling feature content and geometric relationships is effective for capturing both global shape and local structure\.

Table 3:ModelNet40 overall accuracy \(%\)\. ITNet results are mean±\\pmstandard deviation over 3 random seeds\. ITNet\-PC: parameter\-matched variant \(L=6, d=128, H=4, 3\.1M\)\.
Table 4:VQA v2 test\-dev and NLVR2 accuracy \(%\)\. ITNet results are mean±\\pmstandard deviation over 3 random seeds\.

VQA v2\[[33](https://arxiv.org/html/2606.19538#bib.bib48)\]and NLVR2\[[85](https://arxiv.org/html/2606.19538#bib.bib49)\]:We evaluate ITNet on VQA v2\[[33](https://arxiv.org/html/2606.19538#bib.bib48)\]and NLVR2\[[85](https://arxiv.org/html/2606.19538#bib.bib49)\]using a joint image\-text domain\. As shown in Table[4](https://arxiv.org/html/2606.19538#S4.T4), ITNet\-B achieves competitive performance with specialized vision\-language models despite fewer parameters and no dedicated image\-text pre\-training\. This indicates that cross\-modal interactions can be learned directly through a shared kernel, without explicit fusion mechanisms\. Performance scales consistently with model size, supporting the effectiveness of the unified operator for multimodal reasoning\.

Ablations:We ablate kernel inputs across modalities \(Table[5](https://arxiv.org/html/2606.19538#S4.T5)\) and observe that performance is maximised only when content and position are jointly modelled, with complementary contributions that neither captures alone\. Their relative importance is modality\-dependent–spatial cues dominate in vision, while content interactions are more critical for language and 3D data–showing that the kernel adapts its inductive bias\. Interaction structure is also key: removing the Hadamard term consistently degrades performance, showing that elementwise feature interactions capture relationships beyond independent features, while relative positional terms outperform absolute\-only inputs, emphasizing the value of modeling relationships between positions\. Single\-group and constant\-kernel variants perform substantially worse, confirming that neither pure content nor pure geometry is sufficient\. Overall, ITNet’s gains arise from learning unified, content and position\-dependent interactions\.

Table 5:Kernel input ablation across all modalities \(ITNet\-B\)\. ✓= included, ✗= excluded\. Results: ImageNet\-1K top\-1 \(%\), ModelNet40 OA \(%\), GLUE avg, VQA v2 test\-dev \(%\), NLVR2 \(%\)\.Compared to two\-stream variants with cross\-attention or concatenation fusion, joint\-domain ITNet achieves higher performance with fewer parameters, showing that cross\-modal interactions are better learned within a shared kernel\. Additional ablations \(Appendix[M](https://arxiv.org/html/2606.19538#A13)\) indicate that performance is robust to architectural choices, with gains saturating at moderate capacity \(Tables[20](https://arxiv.org/html/2606.19538#A13.T20),[24](https://arxiv.org/html/2606.19538#A13.T24),[23](https://arxiv.org/html/2606.19538#A13.T23)\)\. For point clouds, positional encoding and local aggregation provide complementary benefits \(Table[21](https://arxiv.org/html/2606.19538#A13.T21)\), while balanced modality weighting improves multimodal performance \(Table[22](https://arxiv.org/html/2606.19538#A13.T22)\)\. Detailed system\-level results across all modalities \(Appendix[L](https://arxiv.org/html/2606.19538#A12), Tables[16](https://arxiv.org/html/2606.19538#A12.T16),[17](https://arxiv.org/html/2606.19538#A12.T17),[18](https://arxiv.org/html/2606.19538#A12.T18), and[19](https://arxiv.org/html/2606.19538#A12.T19)\) show that while exact ITNet incurs modest overhead, Monte Carlo and low\-rank variants achieve higher throughput and significantly lower memory\.

## 5\. Discussion

Our results support a central hypothesis: modeling interactions jointly over content and position is more expressive than modeling them separately\. Across modalities, the ITNet benefits from conditioning on both feature similarity and geometry, suggesting that common inductive biases \- locality in CNNs, content based attention in Transformers, and causality in sequence models are restricted instances of a more general interaction mechanism\. Ablations show that the relative importance of content and position is modality dependent: spatial structure dominates in vision, while content interactions are more critical in language and 3D data\. Rather than fixing these biases, ITNet adapts through the learned kernel, enabling cross\-domain generalization\. ITNet models multiplicative content–position interactions, allowing spatial relationships to depend on features\. This unifies and extends positional encoding, attention, and convolution within a single operator\. Multimodal results show that cross\-modal interactions can emerge directly from a shared kernel over a joint domain, removing the need for explicit fusion modules while maintaining strong performance\.

Despite these advantages, limitations remain and define important directions for future work\. First, scaling ITNet to billion\-parameter regimes introduces challenges in optimization stability and kernel evaluation cost; developing more efficient kernel parameterizations and training strategies is a key next step\. Second, while the framework naturally supports causal structure, we have not yet evaluated it on autoregressive generation tasks, which provide the most direct test of the causal kernel; extending ITNet to long\-context language modeling is an important future direction\. Third, the joint\-domain formulation increases training cost in multimodal settings due to end\-to\-end coupling; improving efficiency through modular or partially factorized training remains an open avenue\. Toward generative ITNet:The kernel formulation provides a path to autoregressive modeling by enforcing causality in the interaction function\. In this setting, the operator admits efficient factorizations that reduce generation cost to linear time, matching state\-space models while retaining the flexibility of attention\. Extending to generative benchmarks would test a unified framework for bidirectional understanding and sequential generation\.

## 6\. Conclusion

We introduced ITNet, a neural architecture based on a learnable integral operator whose kernel depends jointly on positions and features\. Within this framework, convolution, self\-attention, and recurrence arise as exact special cases, and the resulting function class forms a universal operator approximator\. Empirically, a single ITNet matches or exceeds specialized models across modalities, while scalable approximations make the operator practical\. More broadly, our results suggest that the diversity of neural architectures reflects fixed assumptions about interaction structure\. By learning the interaction rule directly, ITNet provides a unified view in which locality, global context, and sequential dynamics emerge from a common mechanism\. This points toward general\-purpose, modality\-agnostic architectures where interaction patterns are learned rather than predefined\.

## References

- \[1\]J\. Achiam, S\. Adler, S\. Agarwal, L\. Ahmad, I\. Akkaya, F\. L\. Aleman, D\. Almeida, J\. Altenschmidt, S\. Altman, S\. Anadkat,et al\.\(2023\)GPT\-4 Technical Report\.arXiv preprint arXiv:2303\.08774\.Cited by:[Appendix B](https://arxiv.org/html/2606.19538#A2.p6.1)\.
- \[2\]J\. Alayrac, J\. Donahue, P\. Luc, A\. Miech, I\. Barr, Y\. Hasson, K\. Lenc, A\. Mensch, K\. Millican, M\. Reynolds,et al\.\(2022\)Flamingo: a Visual Language Model for Few\-Shot Learning\.Advances in Neural Information Processing Systems35,pp\. 23716–23736\.Cited by:[Appendix B](https://arxiv.org/html/2606.19538#A2.p6.1)\.
- \[3\]\(1961\)An Estimate for the Perturbations of the Solution of Ordinary Differential Equations\.Vestnik Moskov\. Univ\. Ser\. I Mat\. Meh\.2,pp\. 28–36\.Cited by:[§E\.3\.1](https://arxiv.org/html/2606.19538#A5.SS3.SSS1.6.p5.4),[Theorem 3](https://arxiv.org/html/2606.19538#Thmtheorem3.p3.7)\.
- \[4\]A\. Anandkumar, K\. Azizzadenesheli, K\. Bhattacharya, N\. Kovachki, Z\. Li, B\. Liu, and A\. Stuart\(2020\)Neural Operator: Graph Kernel Network for Partial Differential Equations\.InICLR 2020 Workshop on Integration of Deep Neural Models and Differential Equations,Cited by:[Appendix B](https://arxiv.org/html/2606.19538#A2.p4.2),[§1](https://arxiv.org/html/2606.19538#S1.p4.1)\.
- \[5\]J\. L\. Ba, J\. R\. Kiros, and G\. E\. Hinton\(2016\)Layer Normalization\.arXiv preprint arXiv:1607\.06450\.Cited by:[§2\.1](https://arxiv.org/html/2606.19538#S2.SS1.p6.6)\.
- \[6\]A\. R\. Barron\(1993\)Universal Approximation Bounds for Superpositions of a Sigmoidal Function\.IEEE Transactions on Information theory39\(3\),pp\. 930–945\.Cited by:[§F\.5](https://arxiv.org/html/2606.19538#A6.SS5.1.p1.3),[Proposition 9](https://arxiv.org/html/2606.19538#Thmproposition9.p1.8.8)\.
- \[7\]I\. Beltagy, M\. E\. Peters, and A\. Cohan\(2020\)Longformer: The Long\-Document Transformer\.arXiv preprint arXiv:2004\.05150\.Cited by:[Appendix B](https://arxiv.org/html/2606.19538#A2.p3.1)\.
- \[8\]E\. Calvello, N\. B\. Kovachki, M\. E\. Levine, and A\. M\. Stuart\(2025\)Continuum Attention for Neural Operators\.Journal of Machine Learning Research26\(300\),pp\. 1–52\.Cited by:[Appendix B](https://arxiv.org/html/2606.19538#A2.p4.2)\.
- \[9\]T\. Chen and H\. Chen\(1995\)Universal Approximation to nonlinear operators by Neural Networks with Arbitrary Activation Functions and Its Application to Dynamical Systems\.IEEE Transactions On Neural Networks6\(4\),pp\. 911–917\.Cited by:[Appendix B](https://arxiv.org/html/2606.19538#A2.p4.2),[§F\.2](https://arxiv.org/html/2606.19538#A6.SS2.2.p1.1),[§2\.3](https://arxiv.org/html/2606.19538#S2.SS3.2.p2.9),[Lemma 5](https://arxiv.org/html/2606.19538#Thmlemma5)\.
- \[10\]T\. Chen, B\. Xu, C\. Zhang, and C\. Guestrin\(2016\)Training Deep Nets with Sublinear Memory Cost\.arXiv preprint arXiv:1604\.06174\.Cited by:[§H\.10](https://arxiv.org/html/2606.19538#A8.SS10.p1.6),[§3\.1](https://arxiv.org/html/2606.19538#S3.SS1.p1.17)\.
- \[11\]Y\. Chen, L\. Li, L\. Yu, A\. El Kholy, F\. Ahmed, Z\. Gan, Y\. Cheng, and J\. Liu\(2020\)UNITER: UNiversal Image\-TExt Representation Learning\.InEuropean Conference on Computer Vision,pp\. 104–120\.Cited by:[Appendix B](https://arxiv.org/html/2606.19538#A2.p6.1),[§1](https://arxiv.org/html/2606.19538#S1.p2.1),[Table 4](https://arxiv.org/html/2606.19538#S4.T4.20.8.6.9.3.1)\.
- \[12\]Y\. Chen, X\. Dai, M\. Liu, D\. Chen, L\. Yuan, and Z\. Liu\(2020\)Dynamic Convolution: Attention over Convolution Kernels\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 11030–11039\.Cited by:[Appendix B](https://arxiv.org/html/2606.19538#A2.p5.1)\.
- \[13\]K\. Cho, B\. Van Merriënboer, Ç\. Gulçehre, D\. Bahdanau, F\. Bougares, H\. Schwenk, and Y\. Bengio\(2014\)Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation\.InProceedings of the 2014 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),pp\. 1724–1734\.Cited by:[Appendix B](https://arxiv.org/html/2606.19538#A2.p2.1),[item 3](https://arxiv.org/html/2606.19538#S1.I1.i3.p1.2)\.
- \[14\]F\. Chollet\(2017\)Xception: Deep Learning with Depthwise Separable Convolutions\.InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition,pp\. 1251–1258\.Cited by:[§C\.7](https://arxiv.org/html/2606.19538#A3.SS7.p1.1)\.
- \[15\]K\. M\. Choromanski, V\. Likhosherstov, D\. Dohan, X\. Song, A\. Gane, T\. Sarlos, P\. Hawkins, J\. Q\. Davis, A\. Mohiuddin, L\. Kaiser, D\. B\. Belanger, L\. J\. Colwell, and A\. Weller\(2021\)Rethinking Attention with Performers\.InInternational Conference on Learning Representations,Cited by:[Table 18](https://arxiv.org/html/2606.19538#A12.T18.3.5.1.1),[Appendix B](https://arxiv.org/html/2606.19538#A2.p3.1)\.
- \[16\]K\. Clark, M\. Luong, Q\. V\. Le, and C\. D\. Manning\(2020\)ELECTRA: Pre\-training Text Encoders as Discriminators Rather Than Generators\.International Conference on Learning Representations\.Cited by:[Table 2](https://arxiv.org/html/2606.19538#S4.T2.32.28.33.5.1)\.
- \[17\]E\. A\. Coddington and N\. Levinson\(1955\)Theory of Ordinary Differential Equations\.McGraw\-Hill\.Cited by:[item \(iii\)](https://arxiv.org/html/2606.19538#A5.I1.i3.p1.1),[§E\.2](https://arxiv.org/html/2606.19538#A5.SS2.2.p2.2),[§E\.3\.2](https://arxiv.org/html/2606.19538#A5.SS3.SSS2.4.p3.4)\.
- \[18\]J\. Cordonnier, A\. Loukas, and M\. Jaggi\(2019\)On the Relationship between Self\-Attention and Convolutional Layers\.arXiv preprint arXiv:1911\.03584\.Cited by:[Appendix B](https://arxiv.org/html/2606.19538#A2.p5.1)\.
- \[19\]E\. D\. Cubuk, B\. Zoph, J\. Shlens, and Q\. V\. Le\(2020\)RandAugment: Practical Automated Data Augmentation with a Reduced Search Space\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops,pp\. 702–703\.Cited by:[§K\.1](https://arxiv.org/html/2606.19538#A11.SS1.p1.2)\.
- \[20\]G\. Cybenko\(1989\)Approximation by Superpositions of a Sigmoidal Function\.Mathematics of control, signals and systems2\(4\),pp\. 303–314\.Cited by:[§F\.2](https://arxiv.org/html/2606.19538#A6.SS2.1.p1.12),[§2\.1](https://arxiv.org/html/2606.19538#S2.SS1.p3.9),[§2\.3](https://arxiv.org/html/2606.19538#S2.SS3.2.p2.9)\.
- \[21\]T\. Dao, D\. Fu, S\. Ermon, A\. Rudra, and C\. Ré\(2022\)FlashAttention: Fast and Memory\-Efficient Exact Attention with IO\-Awareness\.InAdvances in Neural Information Processing Systems,Cited by:[Appendix B](https://arxiv.org/html/2606.19538#A2.p3.1),[§H\.10](https://arxiv.org/html/2606.19538#A8.SS10.p1.6),[§3\.1](https://arxiv.org/html/2606.19538#S3.SS1.p1.17)\.
- \[22\]T\. Dao and A\. Gu\(2024\)Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality\.arXiv preprint arXiv:2405\.21060\.Cited by:[Remark 13](https://arxiv.org/html/2606.19538#Thmremark13.p1.1.1)\.
- \[23\]P\. J\. Davis and P\. Rabinowitz\(2007\)Methods of Numerical Integration\.Courier Corporation\.Cited by:[§F\.2](https://arxiv.org/html/2606.19538#A6.SS2.4.p3.12),[§2\.3](https://arxiv.org/html/2606.19538#S2.SS3.2.p2.9)\.
- \[24\]J\. Devlin, M\. Chang, K\. Lee, and K\. Toutanova\(2019\)BERT: Pre\-training of Deep Bidirectional Transformers for Language Understanding\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long and Short Papers\),pp\. 4171–4186\.Cited by:[3rd item](https://arxiv.org/html/2606.19538#A8.I1.i3.p1.5),[§H\.3\.2](https://arxiv.org/html/2606.19538#A8.SS3.SSS2.p2.3),[Table 2](https://arxiv.org/html/2606.19538#S4.T2.32.28.30.2.1),[Table 2](https://arxiv.org/html/2606.19538#S4.T2.32.28.31.3.1),[§4](https://arxiv.org/html/2606.19538#S4.p3.1)\.
- \[25\]X\. Ding, X\. Zhang, J\. Han, and G\. Ding\(2022\)Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 11963–11975\.Cited by:[Appendix B](https://arxiv.org/html/2606.19538#A2.p2.1)\.
- \[26\]A\. Dosovitskiy, L\. Beyer, A\. Kolesnikov, D\. Weissenborn, X\. Zhai, T\. Unterthiner, M\. Dehghani, M\. Minderer, G\. Heigold, S\. Gelly, J\. Uszkoreit, and N\. Houlsby\(2021\)An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale\.InInternational Conference on Learning Representations,Cited by:[1st item](https://arxiv.org/html/2606.19538#A8.I1.i1.p1.2),[§H\.3\.2](https://arxiv.org/html/2606.19538#A8.SS3.SSS2.p1.5)\.
- \[27\]Z\. Dou, Y\. Xu, Z\. Gan, J\. Wang, S\. Wang, L\. Wang, C\. Zhu, P\. Zhang, L\. Yuan, N\. Peng,et al\.\(2022\)An Empirical Study of Training End\-to\-End Vision\-and\-Language Transformers\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 18166–18176\.Cited by:[Appendix B](https://arxiv.org/html/2606.19538#A2.p6.1),[Table 4](https://arxiv.org/html/2606.19538#S4.T4.20.8.6.10.4.1)\.
- \[28\]N\. Dunford and J\. T\. Schwartz\(1988\)Linear Operators, part 1: General Theory\.John Wiley & Sons\.Cited by:[§D\.4](https://arxiv.org/html/2606.19538#A4.SS4.5.p5.1)\.
- \[29\]V\. P\. Dwivedi, A\. T\. Luu, T\. Laurent, Y\. Bengio, and X\. Bresson\(2022\)Graph Neural Networks with Learnable Structural and Positional Representations\.InInternational Conference on Learning Representations,Cited by:[§J\.1](https://arxiv.org/html/2606.19538#A10.SS1.p2.2)\.
- \[30\]B\. Elesedy and S\. Zaidi\(2021\)Provably Strict Generalisation Benefit for Equivariant Models\.InInternational Conference on Machine Learning,pp\. 2959–2969\.Cited by:[Appendix G](https://arxiv.org/html/2606.19538#A7.1.p1.1)\.
- \[31\]G\. B\. Folland\(1999\)Real Analysis: Modern Techniques and Their Applications\.John Wiley & Sons\.Cited by:[Appendix C](https://arxiv.org/html/2606.19538#A3.p1.4),[Lemma 2](https://arxiv.org/html/2606.19538#Thmlemma2)\.
- \[32\]X\. Glorot and Y\. Bengio\(2010\)Understanding the difficulty of training deep feedforward neural networks\.InProceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics,pp\. 249–256\.Cited by:[§H\.3\.1](https://arxiv.org/html/2606.19538#A8.SS3.SSS1.p5.4)\.
- \[33\]Y\. Goyal, T\. Khot, D\. Summers\-Stay, D\. Batra, and D\. Parikh\(2017\)Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering\.InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition,pp\. 6904–6913\.Cited by:[§K\.4](https://arxiv.org/html/2606.19538#A11.SS4),[§K\.4](https://arxiv.org/html/2606.19538#A11.SS4.p1.1),[§1](https://arxiv.org/html/2606.19538#S1.p6.1),[§4](https://arxiv.org/html/2606.19538#S4.p1.1),[§4](https://arxiv.org/html/2606.19538#S4.p6.1),[§4](https://arxiv.org/html/2606.19538#S4.p6.1.1)\.
- \[34\]W\. Gröbner\(1960\)Die Lie\-Reihen und ihre Anwendungen\.VEB Deutscher Verlag der Wissenschaften\.Cited by:[§E\.3\.1](https://arxiv.org/html/2606.19538#A5.SS3.SSS1.6.p5.4)\.
- \[35\]T\. H\. Gronwall\(1919\)Note on the Derivatives with Respect to a Parameter of the Solutions of a System of Differential Equations\.Annals of Mathematics20\(4\),pp\. 292–296\.Cited by:[item \(iii\)](https://arxiv.org/html/2606.19538#A5.I2.i3.p1.4)\.
- \[36\]A\. Gu and T\. Dao\(2023\)Mamba: Linear\-Time Sequence Modeling with Selective State Spaces\.arXiv preprint arXiv:2312\.00752\.Cited by:[Appendix B](https://arxiv.org/html/2606.19538#A2.p2.1),[item 3](https://arxiv.org/html/2606.19538#S1.I1.i3.p1.2),[§1](https://arxiv.org/html/2606.19538#S1.p2.1),[Definition 13](https://arxiv.org/html/2606.19538#Thmdefinition13.p1.5.1)\.
- \[37\]A\. Gu, K\. Goel, and C\. Ré\(2021\)Efficiently Modeling Long Sequences with Structured State Spaces\.arXiv preprint arXiv:2111\.00396\.Cited by:[Appendix B](https://arxiv.org/html/2606.19538#A2.p2.1),[item 3](https://arxiv.org/html/2606.19538#S1.I1.i3.p1.2),[§1](https://arxiv.org/html/2606.19538#S1.p2.1),[Definition 12](https://arxiv.org/html/2606.19538#Thmdefinition12.p1.9.1)\.
- \[38\]M\. Guo, J\. Cai, Z\. Liu, T\. Mu, R\. R\. Martin, and S\. Hu\(2021\)PCT: Point cloud transformer\.Computational visual media7\(2\),pp\. 187–199\.Cited by:[Table 16](https://arxiv.org/html/2606.19538#A12.T16.15.13.13.2),[Table 4](https://arxiv.org/html/2606.19538#S4.T4.12.12.10.13.3.1)\.
- \[39\]A\. Haar\(1933\)Der Massbegriff in der Theorie der kontinuierlichen Gruppen\.Annals of mathematics34\(1\),pp\. 147–169\.Cited by:[Appendix G](https://arxiv.org/html/2606.19538#A7.6.p6.6)\.
- \[40\]P\. Hartman\(2002\)Ordinary Differential Equations\.SIAM\.Cited by:[item \(iii\)](https://arxiv.org/html/2606.19538#A5.I3.i3.p1.5),[item \(iv\)](https://arxiv.org/html/2606.19538#A5.I3.i4.p1.6)\.
- \[41\]K\. He, X\. Zhang, S\. Ren, and J\. Sun\(2016\)Deep Residual Learning for Image Recognition\.InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition,pp\. 770–778\.Cited by:[Appendix B](https://arxiv.org/html/2606.19538#A2.p2.1),[§C\.1](https://arxiv.org/html/2606.19538#A3.SS1.p1.pic1.5.5.5.5.5.5.5.5.5.5.5.5.4.4.4.4.4.4.p4.3),[§H\.3\.1](https://arxiv.org/html/2606.19538#A8.SS3.SSS1.p4.2),[§1](https://arxiv.org/html/2606.19538#S1.p2.1),[Table 1](https://arxiv.org/html/2606.19538#S4.T1.5.5.2.1)\.
- \[42\]P\. He, X\. Liu, J\. Gao, and W\. Chen\(2020\)DeBERTa: Decoding\-enhanced BERT with Disentangled Attention\.arXiv preprint arXiv:2006\.03654\.Cited by:[Table 2](https://arxiv.org/html/2606.19538#S4.T2.32.28.34.6.1)\.
- \[43\]D\. Hendrycks and K\. Gimpel\(2016\)Gaussian Error Linear Units \(GELUs\)\.arXiv preprint arXiv:1606\.08415\.Cited by:[3rd item](https://arxiv.org/html/2606.19538#A8.I1.i3.p1.5),[§2\.1](https://arxiv.org/html/2606.19538#S2.SS1.p2.4),[§2\.1](https://arxiv.org/html/2606.19538#S2.SS1.p6.6)\.
- \[44\]S\. Hochreiter and J\. Schmidhuber\(1997\)Long Short\-Term Memory\.Neural computation9\(8\),pp\. 1735–1780\.Cited by:[Appendix B](https://arxiv.org/html/2606.19538#A2.p2.1),[§1](https://arxiv.org/html/2606.19538#S1.p1.1),[§1](https://arxiv.org/html/2606.19538#S1.p2.1),[Definition 11](https://arxiv.org/html/2606.19538#Thmdefinition11.p1.3.1)\.
- \[45\]R\. A\. Horn and C\. R\. Johnson\(2012\)Matrix Analysis\.Cambridge University Press\.Cited by:[§H\.12](https://arxiv.org/html/2606.19538#A8.SS12.1.p1.3)\.
- \[46\]K\. Hornik, M\. Stinchcombe, and H\. White\(1989\)Multilayer Feedforward Networks are Universal Approximators\.Neural networks2\(5\),pp\. 359–366\.Cited by:[§F\.2](https://arxiv.org/html/2606.19538#A6.SS2.1.p1.12),[§2\.1](https://arxiv.org/html/2606.19538#S2.SS1.p3.9),[§2\.3](https://arxiv.org/html/2606.19538#S2.SS3.2.p2.9),[Remark 4](https://arxiv.org/html/2606.19538#Thmremark4)\.
- \[47\]G\. Huang, Y\. Sun, Z\. Liu, D\. Sedra, and K\. Q\. Weinberger\(2016\)Deep Networks with Stochastic Depth\.InEuropean Conference on Computer Vision,pp\. 646–661\.Cited by:[§H\.4](https://arxiv.org/html/2606.19538#A8.SS4.p1.16)\.
- \[48\]A\. Jaegle, S\. Borgeaud, J\. Alayrac, C\. Doersch, C\. Ionescu, D\. Ding, S\. Koppula, D\. Zoran, A\. Brock, E\. Shelhamer,et al\.\(2021\)Perceiver IO: A General Architecture for Structured Inputs & Outputs\.arXiv preprint arXiv:2107\.14795\.Cited by:[Appendix B](https://arxiv.org/html/2606.19538#A2.p6.1)\.
- \[49\]A\. Jaegle, F\. Gimeno, A\. Brock, O\. Vinyals, A\. Zisserman, and J\. Carreira\(2021\)Perceiver: General Perception with Iterative Attention\.InInternational Conference on Machine Learning,pp\. 4651–4664\.Cited by:[Appendix B](https://arxiv.org/html/2606.19538#A2.p6.1)\.
- \[50\]A\. Katharopoulos, A\. Vyas, N\. Pappas, and F\. Fleuret\(2020\)Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention\.InInternational Conference on Machine Learning,pp\. 5156–5165\.Cited by:[Table 18](https://arxiv.org/html/2606.19538#A12.T18.3.6.2.1),[Appendix B](https://arxiv.org/html/2606.19538#A2.p3.1),[§D\.7](https://arxiv.org/html/2606.19538#A4.SS7.p1.1)\.
- \[51\]W\. Kim, B\. Son, and I\. Kim\(2021\)ViLT: Vision\-and\-Language Transformer Without Convolution or Region Supervision\.InInternational Conference on Machine Learning,pp\. 5583–5594\.Cited by:[Table 16](https://arxiv.org/html/2606.19538#A12.T16.19.17.17.2),[Table 4](https://arxiv.org/html/2606.19538#S4.T4.20.8.6.8.2.1)\.
- \[52\]A\. Krizhevsky, I\. Sutskever, and G\. E\. Hinton\(2012\)ImageNet Classification with Deep Convolutional Neural Networks\.InAdvances in Neural Information Processing Systems,Vol\.25,pp\.\.Cited by:[Appendix B](https://arxiv.org/html/2606.19538#A2.p2.1)\.
- \[53\]Y\. LeCun, B\. Boser, J\. S\. Denker, D\. Henderson, R\. E\. Howard, W\. Hubbard, and L\. D\. Jackel\(1989\)Backpropagation Applied to Handwritten Zip Code Recognition\.Neural computation1\(4\),pp\. 541–551\.Cited by:[Appendix B](https://arxiv.org/html/2606.19538#A2.p2.1),[§1](https://arxiv.org/html/2606.19538#S1.p1.1),[§1](https://arxiv.org/html/2606.19538#S1.p2.1)\.
- \[54\]M\. Leshno, V\. Y\. Lin, A\. Pinkus, and S\. Schocken\(1993\)Multilayer Feedforward Networks With a Nonpolynomial Activation Function can Approximate Any Function\.Neural networks6\(6\),pp\. 861–867\.Cited by:[§F\.2](https://arxiv.org/html/2606.19538#A6.SS2.1.p1.12)\.
- \[55\]J\. Li, D\. Li, S\. Savarese, and S\. Hoi\(2023\)BLIP\-2: Bootstrapping Language\-Image Pre\-training with Frozen Image Encoders and Large Language Models\.InInternational Conference on Machine Learning,pp\. 19730–19742\.Cited by:[Appendix B](https://arxiv.org/html/2606.19538#A2.p6.1)\.
- \[56\]J\. Li, D\. Li, C\. Xiong, and S\. Hoi\(2022\)BLIP: Bootstrapping Language\-Image Pre\-training for Unified Vision\-Language Understanding and Generation\.InInternational Conference on Machine Learning,pp\. 12888–12900\.Cited by:[Appendix B](https://arxiv.org/html/2606.19538#A2.p6.1),[§1](https://arxiv.org/html/2606.19538#S1.p2.1),[Table 4](https://arxiv.org/html/2606.19538#S4.T4.20.8.6.12.6.1),[Table 4](https://arxiv.org/html/2606.19538#S4.T4.20.8.6.13.7.1)\.
- \[57\]J\. Li, R\. Selvaraju, A\. Gotmare, S\. Joty, C\. Xiong, and S\. C\. H\. Hoi\(2021\)Align before Fuse: Vision and Language Representation Learning with Momentum Distillation\.Advances in Neural Information Processing Systems34,pp\. 9694–9705\.Cited by:[Appendix B](https://arxiv.org/html/2606.19538#A2.p6.1),[Table 4](https://arxiv.org/html/2606.19538#S4.T4.20.8.6.11.5.1)\.
- \[58\]Z\. Li, N\. Kovachki, K\. Azizzadenesheli, B\. Liu, K\. Bhattacharya, A\. Stuart, and A\. Anandkumar\(2020\)Fourier Neural Operator for Parametric Partial Differential Equations\.arXiv preprint arXiv:2010\.08895\.Cited by:[Appendix B](https://arxiv.org/html/2606.19538#A2.p4.2)\.
- \[59\]Y\. Liu, M\. Ott, N\. Goyal, J\. Du, M\. Joshi, D\. Chen, O\. Levy, M\. Lewis, L\. Zettlemoyer, and V\. Stoyanov\(2019\)RoBERTa: A Robustly Optimized BERT Pretraining Approach\.arXiv preprint arXiv:1907\.11692\.Cited by:[Table 2](https://arxiv.org/html/2606.19538#S4.T2.5.1.1.1)\.
- \[60\]Z\. Liu, H\. Hu, Y\. Lin, Z\. Yao, Z\. Xie, Y\. Wei, J\. Ning, Y\. Cao, Z\. Zhang, L\. Dong, F\. Wei, and B\. Guo\(2022\)Swin Transformer V2: Scaling Up Capacity and Resolution\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 12009–12019\.Cited by:[Table 1](https://arxiv.org/html/2606.19538#S4.T1.5.12.9.1)\.
- \[61\]Z\. Liu, Y\. Lin, Y\. Cao, H\. Hu, Y\. Wei, Z\. Zhang, S\. Lin, and B\. Guo\(2021\)Swin Transformer: Hierarchical Vision Transformer using Shifted Windows\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 10012–10022\.Cited by:[§J\.2](https://arxiv.org/html/2606.19538#A10.SS2.p1.4),[Appendix B](https://arxiv.org/html/2606.19538#A2.p3.1),[Table 1](https://arxiv.org/html/2606.19538#S4.T1.5.10.7.1),[Table 1](https://arxiv.org/html/2606.19538#S4.T1.5.11.8.1)\.
- \[62\]Z\. Liu, H\. Mao, C\. Wu, C\. Feichtenhofer, T\. Darrell, and S\. Xie\(2022\)A ConvNet for the 2020s\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 11976–11986\.Cited by:[Appendix B](https://arxiv.org/html/2606.19538#A2.p2.1),[§C\.1](https://arxiv.org/html/2606.19538#A3.SS1.p1.pic1.5.5.5.5.5.5.5.5.5.5.5.5.4.4.4.4.4.4.p4.3),[Table 1](https://arxiv.org/html/2606.19538#S4.T1.5.6.3.1),[Table 1](https://arxiv.org/html/2606.19538#S4.T1.5.7.4.1)\.
- \[63\]I\. Loshchilov and F\. Hutter\(2019\)Decoupled Weight Decay Regularization\.InInternational Conference on Learning Representations,Cited by:[§H\.4](https://arxiv.org/html/2606.19538#A8.SS4.p2.3),[§H\.5](https://arxiv.org/html/2606.19538#A8.SS5.p1.1),[§H\.5](https://arxiv.org/html/2606.19538#A8.SS5.p2.2)\.
- \[64\]L\. Lu, P\. Jin, G\. Pang, Z\. Zhang, and G\. E\. Karniadakis\(2021\)Learning nonlinear operators via DeepONet based on the universal approximation theorem of operators\.Nature machine intelligence3\(3\),pp\. 218–229\.Cited by:[Appendix B](https://arxiv.org/html/2606.19538#A2.p4.2)\.
- \[65\]X\. Ma, C\. Qin, H\. You, H\. Ran, and Y\. Fu\(2022\)Rethinking Network Design and Local Geometry in Point Cloud: A Simple Residual MLP Framework\.InInternational Conference on Learning Representations,Cited by:[Table 4](https://arxiv.org/html/2606.19538#S4.T4.12.12.10.14.4.1)\.
- \[66\]P\. Micikevicius, S\. Narang, J\. Alben, G\. Diamos, E\. Elsen, D\. Garcia, B\. Ginsburg, M\. Houston, O\. Kuchaiev, G\. Venkatesh, and H\. Wu\(2018\)Mixed Precision Training\.InInternational Conference on Learning Representations,Cited by:[§H\.4](https://arxiv.org/html/2606.19538#A8.SS4.p1.16)\.
- \[67\]B\. S\. Nagy, C\. Foias, H\. Bercovici, and L\. Kérchy\(2010\)Harmonic Analysis of Operators on Hilbert Space\.Springer Science & Business Media\.Cited by:[Appendix G](https://arxiv.org/html/2606.19538#A7.6.p6.6)\.
- \[68\]X\. Pei, T\. Huang, and C\. Xu\(2025\)EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.39,pp\. 6443–6451\.Cited by:[Table 1](https://arxiv.org/html/2606.19538#S4.T1.5.16.13.1)\.
- \[69\]M\. Poli, S\. Massaroli, E\. Nguyen, D\. Y\. Fu, T\. Dao, S\. Baccus, Y\. Bengio, S\. Ermon, and C\. Ré\(2023\)Hyena Hierarchy: Towards Larger Convolutional Language Models\.InInternational Conference on Machine Learning,pp\. 28043–28078\.Cited by:[Appendix B](https://arxiv.org/html/2606.19538#A2.p3.1)\.
- \[70\]B\. T\. Polyak and A\. B\. Juditsky\(1992\)Acceleration of Stochastic Approximation by Averaging\.SIAM Journal on Control and Optimization30\(4\),pp\. 838–855\.Cited by:[§H\.4](https://arxiv.org/html/2606.19538#A8.SS4.p1.16)\.
- \[71\]O\. Press, N\. Smith, and M\. Lewis\(2022\)Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation\.InInternational Conference on Learning Representations,Cited by:[Appendix B](https://arxiv.org/html/2606.19538#A2.p2.1),[Remark 7](https://arxiv.org/html/2606.19538#Thmremark7.p1.6.6)\.
- \[72\]C\. R\. Qi, H\. Su, K\. Mo, and L\. J\. Guibas\(2017\)PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation\.InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition,pp\. 652–660\.Cited by:[Table 4](https://arxiv.org/html/2606.19538#S4.T4.12.12.10.12.2.1)\.
- \[73\]C\. R\. Qi, L\. Yi, H\. Su, and L\. J\. Guibas\(2017\)PointNet\+\+: Deep Hierarchical Feature Learning on Point Sets in a Metric Space\.Advances in Neural Information Processing Systems30\.Cited by:[§1](https://arxiv.org/html/2606.19538#S1.p2.1),[Table 4](https://arxiv.org/html/2606.19538#S4.T4.3.3.1.1.2)\.
- \[74\]G\. Qian, Y\. Li, H\. Peng, J\. Mai, H\. Hammoud, M\. Elhoseiny, and B\. Ghanem\(2022\)PointNeXt: Revisiting PointNet\+\+ with Improved Training and Scaling Strategies\.Advances in Neural Information Processing Systems35,pp\. 23192–23204\.Cited by:[Table 4](https://arxiv.org/html/2606.19538#S4.T4.5.5.3.3.2)\.
- \[75\]A\. Radford, K\. Narasimhan, T\. Salimans, I\. Sutskever,et al\.\(2018\)Improving Language Understanding by Generative Pre\-Training\.Cited by:[Table 2](https://arxiv.org/html/2606.19538#S4.T2.32.28.32.4.1)\.
- \[76\]A\. Radford, J\. Wu, R\. Child, D\. Luan, D\. Amodei, I\. Sutskever,et al\.\(2019\)Language Models are Unsupervised Multitask Learners\.OpenAI blog1\(8\),pp\. 9\.Cited by:[1st item](https://arxiv.org/html/2606.19538#A8.I1.i1.p1.2),[3rd item](https://arxiv.org/html/2606.19538#A8.I1.i3.p1.5),[§H\.3\.1](https://arxiv.org/html/2606.19538#A8.SS3.SSS1.p2.3),[§H\.3\.1](https://arxiv.org/html/2606.19538#A8.SS3.SSS1.p5.4)\.
- \[77\]H\. L\. Royden\(1988\)Real Analysis\.Krishna Prakashan Media\.Cited by:[Table 6](https://arxiv.org/html/2606.19538#A1.T6.10.10.2.1.1),[§C\.3](https://arxiv.org/html/2606.19538#A3.SS3.4.p4.6),[Appendix C](https://arxiv.org/html/2606.19538#A3.p1.4)\.
- \[78\]W\. Rudin\(1976\)Principles of Mathematical Analysis\.International Series in Pure and Applied Mathematics\.Cited by:[§D\.2](https://arxiv.org/html/2606.19538#A4.SS2.4.p4.6),[§H\.11](https://arxiv.org/html/2606.19538#A8.SS11.3.p3.4),[§H\.12](https://arxiv.org/html/2606.19538#A8.SS12.2.p2.1)\.
- \[79\]W\. Rudin\(1987\)Real and Complex Analysis\.3rd edition,McGraw\-Hill,New York\.Cited by:[§C\.3](https://arxiv.org/html/2606.19538#A3.SS3.4.p4.6),[§F\.2](https://arxiv.org/html/2606.19538#A6.SS2.7.p6.1),[Remark 4](https://arxiv.org/html/2606.19538#Thmremark4)\.
- \[80\]O\. Russakovsky, J\. Deng, H\. Su, J\. Krause, S\. Satheesh, S\. Ma, Z\. Huang, A\. Karpathy, A\. Khosla, M\. Bernstein,et al\.\(2015\)ImageNet Large Scale Visual Recognition Challenge\.International Journal of Computer Vision115\(3\),pp\. 211–252\.Cited by:[§K\.1](https://arxiv.org/html/2606.19538#A11.SS1),[§1](https://arxiv.org/html/2606.19538#S1.p6.1),[§4](https://arxiv.org/html/2606.19538#S4.p1.1),[§4](https://arxiv.org/html/2606.19538#S4.p2.1)\.
- \[81\]J\. Serreet al\.\(1977\)Linear Representations of Finite Groups\.Vol\.42,Springer\.Cited by:[Appendix G](https://arxiv.org/html/2606.19538#A7.10.p10.2)\.
- \[82\]N\. Srivastava, G\. Hinton, A\. Krizhevsky, I\. Sutskever, and R\. Salakhutdinov\(2014\)Dropout: A Simple Way to Prevent Neural Networks from Overfitting\.The Journal of Machine Learning Research15\(1\),pp\. 1929–1958\.Cited by:[§H\.4](https://arxiv.org/html/2606.19538#A8.SS4.p1.16)\.
- \[83\]E\. M\. Stein\(1970\)Singular Integrals and Differentiability Properties of Functions\.Princeton University Press\.Cited by:[Lemma 2](https://arxiv.org/html/2606.19538#Thmlemma2)\.
- \[84\]J\. Su, M\. Ahmed, Y\. Lu, S\. Pan, W\. Bo, and Y\. Liu\(2024\)RoFormer: Enhanced transformer with Rotary Position Embedding\.Neurocomputing568,pp\. 127063\.Cited by:[Appendix B](https://arxiv.org/html/2606.19538#A2.p2.1),[Remark 7](https://arxiv.org/html/2606.19538#Thmremark7.p1.6.6)\.
- \[85\]A\. Suhr, S\. Zhou, A\. Zhang, I\. Zhang, H\. Bai, and Y\. Artzi\(2019\)A Corpus for Reasoning about Natural Language Grounded in Photographs\.InProceedings of the 57th annual meeting of the association for computational linguistics,pp\. 6418–6428\.Cited by:[§K\.4](https://arxiv.org/html/2606.19538#A11.SS4),[§K\.4](https://arxiv.org/html/2606.19538#A11.SS4.p1.1),[§1](https://arxiv.org/html/2606.19538#S1.p6.1),[§4](https://arxiv.org/html/2606.19538#S4.p1.1),[§4](https://arxiv.org/html/2606.19538#S4.p6.1),[§4](https://arxiv.org/html/2606.19538#S4.p6.1.1)\.
- \[86\]J\. Sun, Q\. Zhang, B\. Kailkhura, Z\. Yu, C\. Xiao, and Z\. M\. Mao\(2022\)Benchmarking Robustness of 3D Point Cloud Recognition Against Common Corruptions\.arXiv preprint arXiv:2201\.12296\.Cited by:[§1](https://arxiv.org/html/2606.19538#S1.p6.1),[§4](https://arxiv.org/html/2606.19538#S4.p1.1)\.
- \[87\]M\. Tancik, P\. Srinivasan, B\. Mildenhall, S\. Fridovich\-Keil, N\. Raghavan, U\. Singhal, R\. Ramamoorthi, J\. Barron, and R\. Ng\(2020\)Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains\.Advances in Neural Information Processing Systems33,pp\. 7537–7547\.Cited by:[§H\.2\.1](https://arxiv.org/html/2606.19538#A8.SS2.SSS1.p2.5),[§2\.1](https://arxiv.org/html/2606.19538#S2.SS1.p2.18)\.
- \[88\]P\. Tillet, H\. Kung, and D\. Cox\(2019\)Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations\.InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages,pp\. 10–19\.Cited by:[§H\.10](https://arxiv.org/html/2606.19538#A8.SS10.p13.1),[§3\.1](https://arxiv.org/html/2606.19538#S3.SS1.p1.17)\.
- \[89\]I\. O\. Tolstikhin, N\. Houlsby, A\. Kolesnikov, L\. Beyer, X\. Zhai, T\. Unterthiner, J\. Yung, A\. Steiner, D\. Keysers, J\. Uszkoreit,et al\.\(2021\)MLP\-Mixer: An all\-MLP Architecture for Vision\.Advances in Neural Information Processing Systems34,pp\. 24261–24272\.Cited by:[Appendix B](https://arxiv.org/html/2606.19538#A2.p3.1)\.
- \[90\]H\. Touvron, M\. Cord, M\. Douze, F\. Massa, A\. Sablayrolles, and H\. Jégou\(2021\)Training data\-efficient image transformers & distillation through attention\.InInternational Conference on Machine Learning,pp\. 10347–10357\.Cited by:[§H\.4](https://arxiv.org/html/2606.19538#A8.SS4.p1.16),[Table 1](https://arxiv.org/html/2606.19538#S4.T1.5.8.5.1),[Table 1](https://arxiv.org/html/2606.19538#S4.T1.5.9.6.1)\.
- \[91\]H\. Touvron, M\. Cord, and H\. Jégou\(2022\)DeiT III: Revenge of the ViT\.InEuropean Conference on Computer Vision,pp\. 516–533\.Cited by:[Table 1](https://arxiv.org/html/2606.19538#S4.T1.5.13.10.1)\.
- \[92\]H\. Touvron, T\. Lavril, G\. Izacard, X\. Martinet, M\. Lachaux, T\. Lacroix, B\. Rozière, N\. Goyal, E\. Hambro, F\. Azhar,et al\.\(2023\)LLaMA: Open and Efficient Foundation Language Models\.arXiv preprint arXiv:2302\.13971\.Cited by:[1st item](https://arxiv.org/html/2606.19538#A8.I1.i1.p1.2)\.
- \[93\]A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin\(2017\)Attention is All you Need\.InAdvances in Neural Information Processing Systems,I\. Guyon, U\. V\. Luxburg, S\. Bengio, H\. Wallach, R\. Fergus, S\. Vishwanathan, and R\. Garnett \(Eds\.\),Vol\.30,pp\.\.Cited by:[Appendix B](https://arxiv.org/html/2606.19538#A2.p2.1),[§1](https://arxiv.org/html/2606.19538#S1.p1.1),[§1](https://arxiv.org/html/2606.19538#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.19538#S2.SS1.p5.4),[Definition 5](https://arxiv.org/html/2606.19538#Thmdefinition5),[Definition 6](https://arxiv.org/html/2606.19538#Thmdefinition6),[Definition 7](https://arxiv.org/html/2606.19538#Thmdefinition7)\.
- \[94\]A\. Wang, A\. Singh, J\. Michael, F\. Hill, O\. Levy, and S\. R Bowman\(2018\)GLUE: A Multi\-Task Benchmark and Analysis Platform for Natural Language Understanding\.InProceedings of the 2018 EMNLP workshop BlackboxNLP: Analyzing and interpreting neural networks for NLP,pp\. 353–355\.Cited by:[§K\.2](https://arxiv.org/html/2606.19538#A11.SS2),[§K\.2](https://arxiv.org/html/2606.19538#A11.SS2.p2.1),[§K\.4](https://arxiv.org/html/2606.19538#A11.SS4.p1.1),[§1](https://arxiv.org/html/2606.19538#S1.p6.1),[§4](https://arxiv.org/html/2606.19538#S4.p1.1),[§4](https://arxiv.org/html/2606.19538#S4.p3.1),[§4](https://arxiv.org/html/2606.19538#S4.p3.1.1)\.
- \[95\]Y\. Wang, Y\. Sun, Z\. Liu, S\. E\. Sarma, M\. M\. Bronstein, and J\. M\. Solomon\(2019\)Dynamic Graph CNN for Learning on Point Clouds\.ACM Transactions on Graphics \(tog\)38\(5\),pp\. 1–12\.Cited by:[§1](https://arxiv.org/html/2606.19538#S1.p2.1),[Table 4](https://arxiv.org/html/2606.19538#S4.T4.4.4.2.2.2)\.
- \[96\]S\. Woo, S\. Debnath, R\. Hu, X\. Chen, Z\. Liu, I\. S\. Kweon, and S\. Xie\(2023\)ConvNeXt V2: Co\-designing and Scaling ConvNets with Masked Autoencoders\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 16133–16142\.Cited by:[Appendix B](https://arxiv.org/html/2606.19538#A2.p2.1),[Table 1](https://arxiv.org/html/2606.19538#S4.T1.5.14.11.1)\.
- \[97\]Z\. Wu, S\. Song, A\. Khosla, F\. Yu, L\. Zhang, X\. Tang, and J\. Xiao\(2015\)3D ShapeNets: A Deep Representation for Volumetric Shapes\.InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition,pp\. 1912–1920\.Cited by:[§K\.3](https://arxiv.org/html/2606.19538#A11.SS3),[Table 21](https://arxiv.org/html/2606.19538#A13.T21),[§4](https://arxiv.org/html/2606.19538#S4.p5.1),[§4](https://arxiv.org/html/2606.19538#S4.p5.1.1)\.
- \[98\]R\. Xiong, Y\. Yang, D\. He, K\. Zheng, S\. Zheng, C\. Xing, H\. Zhang, Y\. Lan, L\. Wang, and T\. Liu\(2020\)On Layer Normalization in the Transformer Architecture\.InInternational Conference on Machine Learning,pp\. 10524–10533\.Cited by:[1st item](https://arxiv.org/html/2606.19538#A8.I1.i1.p1.2),[§H\.2\.1](https://arxiv.org/html/2606.19538#A8.SS2.SSS1.p2.5),[§H\.2](https://arxiv.org/html/2606.19538#A8.SS2.p1.1)\.
- \[99\]Y\. Xiong, Z\. Zeng, R\. Chakraborty, M\. Tan, G\. Fung, Y\. Li, and V\. Singh\(2021\)Nyströmformer: A Nyström\-Based Algorithm for Approximating Self\-Attention\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.35,pp\. 14138–14148\.Cited by:[Table 18](https://arxiv.org/html/2606.19538#A12.T18.3.7.3.1),[Appendix B](https://arxiv.org/html/2606.19538#A2.p3.1),[§2\.1](https://arxiv.org/html/2606.19538#S2.SS1.p6.2)\.
- \[100\]W\. Yu, M\. Luo, P\. Zhou, C\. Si, Y\. Zhou, X\. Wang, J\. Feng, and S\. Yan\(2022\)MetaFormer is Actually What You Need for Vision\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 10819–10829\.Cited by:[Appendix B](https://arxiv.org/html/2606.19538#A2.p5.1)\.
- \[101\]S\. Yun, D\. Han, S\. J\. Oh, S\. Chun, J\. Choe, and Y\. Yoo\(2019\)CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 6023–6032\.Cited by:[§K\.1](https://arxiv.org/html/2606.19538#A11.SS1.p1.2)\.
- \[102\]M\. Zaheer, G\. Guruganesh, K\. A\. Dubey, J\. Ainslie, C\. Alberti, S\. Ontanon, P\. Pham, A\. Ravula, Q\. Wang, L\. Yang,et al\.\(2020\)Big Bird: Transformers for Longer Sequences\.Advances in Neural Information Processing Systems33,pp\. 17283–17297\.Cited by:[Appendix B](https://arxiv.org/html/2606.19538#A2.p3.1)\.
- \[103\]H\. Zhang, M\. Cisse, Y\. N\. Dauphin, and D\. Lopez\-Paz\(2018\)mixup: Beyond Empirical Risk Minimization\.InInternational Conference on Learning Representations,Cited by:[§K\.1](https://arxiv.org/html/2606.19538#A11.SS1.p1.2)\.
- \[104\]L\. Zhu, X\. Wang, Z\. Ke, W\. Zhang, and R\. W\. Lau\(2023\)BiFormer: Vision Transformer with Bi\-Level Routing Attention\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 10323–10333\.Cited by:[Appendix B](https://arxiv.org/html/2606.19538#A2.p5.1),[Table 1](https://arxiv.org/html/2606.19538#S4.T1.5.15.12.1)\.
- \[105\]X\. Zhu, H\. Hu, S\. Lin, and J\. Dai\(2019\)Deformable ConvNets v2: More Deformable, Better Results\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 9308–9316\.Cited by:[Appendix B](https://arxiv.org/html/2606.19538#A2.p5.1)\.

## Appendix Overview

This appendix is organized as follows for ease of navigation and reference\.

Appendix[A](https://arxiv.org/html/2606.19538#A1): Notation Reference

- •Table[6](https://arxiv.org/html/2606.19538#A1.T6): Notation for spaces, signals, ITNet operator, and kernel parameterization\.
- •Table[7](https://arxiv.org/html/2606.19538#A1.T7): Notation for convolution, self\-attention, and recurrence \(Theorems 1–3\)\.
- •Table[8](https://arxiv.org/html/2606.19538#A1.T8): Notation for universal approximation, kernel recovery, implementation, and backpropagation\.

Appendix[B](https://arxiv.org/html/2606.19538#A2): Related Work\- comparison with prior methods and positioning of ITNet\.

Appendix[C](https://arxiv.org/html/2606.19538#A3): Proof of Theorem 1 \(Convolution as a Special Case\)

- •Definition[2](https://arxiv.org/html/2606.19538#Thmdefinition2): Continuous convolution\.
- •Definition[3](https://arxiv.org/html/2606.19538#Thmdefinition3): Discrete convolution on regular grid\.
- •Assumption[2](https://arxiv.org/html/2606.19538#Thmassumption2): Kernel regularity conditions\.
- •Theorem[C\.1](https://arxiv.org/html/2606.19538#A3.SS1): Full statement \(Parts a–d\)\.
- •Subsection[C\.2](https://arxiv.org/html/2606.19538#A3.SS2): Proof of Part \(a\) – Continuous convolution\. - –Lemma[1](https://arxiv.org/html/2606.19538#Thmlemma1): Scalar\-identity\-vector product\. - –Lemma[2](https://arxiv.org/html/2606.19538#Thmlemma2): Young’s convolution inequality\.
- •Subsection[C\.3](https://arxiv.org/html/2606.19538#A3.SS3): Proof of Part \(b\) – Discrete convolution\.
- •Subsection[C\.4](https://arxiv.org/html/2606.19538#A3.SS4): Proof of Part \(c\) – Strict inclusion\. - –Proposition[1](https://arxiv.org/html/2606.19538#Thmproposition1): Convolution is linear\. - –Proposition[2](https://arxiv.org/html/2606.19538#Thmproposition2): Witness operatorTTis nonlinear\. - –Proposition[3](https://arxiv.org/html/2606.19538#Thmproposition3): Convolution is translation\-equivariant\. - –Proposition[4](https://arxiv.org/html/2606.19538#Thmproposition4):TTis not translation\-equivariant\.
- •Subsection[C\.5](https://arxiv.org/html/2606.19538#A3.SS5): Proof of Part \(d\) – With residual\.
- •Subsection[C\.6](https://arxiv.org/html/2606.19538#A3.SS6): Multi\-channel convolution\.
- •Subsection[C\.7](https://arxiv.org/html/2606.19538#A3.SS7): Depthwise separable convolution\.
- •Subsection[C\.8](https://arxiv.org/html/2606.19538#A3.SS8): Dilated \(atrous\) convolution\.
- •Subsection[C\.9](https://arxiv.org/html/2606.19538#A3.SS9): Strided convolution\.
- •Subsection[C\.10](https://arxiv.org/html/2606.19538#A3.SS10): Group convolution\.
- •Subsection[C\.11](https://arxiv.org/html/2606.19538#A3.SS11): Transposed convolution\.
- •Subsection[C\.12](https://arxiv.org/html/2606.19538#A3.SS12): Boundary handling in the convolutional special case\.

Appendix[D](https://arxiv.org/html/2606.19538#A4): Proof of Theorem 2 \(Self\-Attention as a Special Case\)

- •Definition[5](https://arxiv.org/html/2606.19538#Thmdefinition5): Continuous scaled dot\-product attention\.
- •Definition[6](https://arxiv.org/html/2606.19538#Thmdefinition6): Discrete self\-attention\.
- •Definition[7](https://arxiv.org/html/2606.19538#Thmdefinition7): Multi\-head attention\.
- •Assumption[3](https://arxiv.org/html/2606.19538#Thmassumption3): Regularity conditions\.
- •Theorem[D\.1](https://arxiv.org/html/2606.19538#A4.SS1): Full statement \(Parts a–c\)\.
- •Subsection[D\.2](https://arxiv.org/html/2606.19538#A4.SS2): Proof of Part \(a\) – Single\-head continuous attention\.
- •Subsection[D\.3](https://arxiv.org/html/2606.19538#A4.SS3): Discretization – Recovering standard self\-attention\.
- •Subsection[D\.4](https://arxiv.org/html/2606.19538#A4.SS4): Proof of Part \(b\) – Multi\-head attention\.
- •Subsection[D\.5](https://arxiv.org/html/2606.19538#A4.SS5): Strictness Argument 1 \(Unnormalized operators\)\. - –Lemma[3](https://arxiv.org/html/2606.19538#Thmlemma3): Attention output bound\.
- •Subsection[D\.6](https://arxiv.org/html/2606.19538#A4.SS6): Strictness Argument 2 \(Permutation equivariance\)\.
- •Subsection[D\.7](https://arxiv.org/html/2606.19538#A4.SS7): Linear attention as a special case \(Proposition[5](https://arxiv.org/html/2606.19538#Thmproposition5)\)\.
- •Subsection[D\.8](https://arxiv.org/html/2606.19538#A4.SS8): Causal \(masked\) attention as a special case \(Proposition[6](https://arxiv.org/html/2606.19538#Thmproposition6)\)\.

Appendix[E](https://arxiv.org/html/2606.19538#A5): Proof of Theorem 3 \(Recurrence as a Special Case\)

- •Definition[9](https://arxiv.org/html/2606.19538#Thmdefinition9): Continuous\-time recurrent system\.
- •Definition[10](https://arxiv.org/html/2606.19538#Thmdefinition10): Discrete\-time RNN\.
- •Definition[11](https://arxiv.org/html/2606.19538#Thmdefinition11): LSTM \(Long Short\-Term Memory\)\.
- •Definition[12](https://arxiv.org/html/2606.19538#Thmdefinition12): Linear State Space Model \(S4\)\.
- •Definition[13](https://arxiv.org/html/2606.19538#Thmdefinition13): Selective SSM \(Mamba\)\.
- •Definition[14](https://arxiv.org/html/2606.19538#Thmdefinition14): ITNet operator \(recurrent setting\)\.
- •Assumption[4](https://arxiv.org/html/2606.19538#Thmassumption4): Regularity for recurrent proofs\.
- •Theorem[E\.1](https://arxiv.org/html/2606.19538#A5.SS1): Full statement \(Parts a–e\)\.
- •Subsection[E\.2](https://arxiv.org/html/2606.19538#A5.SS2): Proof of Part \(a\) – Linear continuous\-time RNN\.
- •Subsection[E\.3](https://arxiv.org/html/2606.19538#A5.SS3): Proof of Part \(a′\) – Nonlinear continuous\-time RNN\. - –Subsubsection[E\.3\.1](https://arxiv.org/html/2606.19538#A5.SS3.SSS1): Proof Strategy A: Kernel Construction via Alekseev’s Formula - –Subsubsection[E\.3\.2](https://arxiv.org/html/2606.19538#A5.SS3.SSS2): Proof Strategy B: Universal Approximation Argument
- •Subsection[E\.4](https://arxiv.org/html/2606.19538#A5.SS4): Proof of Part \(b\) – Discrete\-time RNN\.
- •Subsection[E\.5](https://arxiv.org/html/2606.19538#A5.SS5): Proof of Part \(c\) – Linear SSM \(S4\)\. - –Remark[12](https://arxiv.org/html/2606.19538#Thmremark12): Relationship to convolution\.
- •Subsection[E\.6](https://arxiv.org/html/2606.19538#A5.SS6): Proof of Part \(d\) – Selective SSM \(Mamba\)\.
- •Subsection[E\.7](https://arxiv.org/html/2606.19538#A5.SS7): LSTM as ITNet \(Proposition[7](https://arxiv.org/html/2606.19538#Thmproposition7)\)\.
- •Subsection[E\.8](https://arxiv.org/html/2606.19538#A5.SS8): GRU as ITNet \(Proposition[8](https://arxiv.org/html/2606.19538#Thmproposition8)\)\.
- •Subsection[E\.9](https://arxiv.org/html/2606.19538#A5.SS9): Strictness Argument 1 – Non\-causal operators\.
- •Subsection[E\.10](https://arxiv.org/html/2606.19538#A5.SS10): Strictness Argument 2 – Parallelism and state dimension\.
- •Subsection[E\.11](https://arxiv.org/html/2606.19538#A5.SS11): Discretization – Recovering Euler and ZOH\.

Appendix[F](https://arxiv.org/html/2606.19538#A6): Proof of Theorem 4 \(Universal Operator Approximation\)

- •Definition[15](https://arxiv.org/html/2606.19538#Thmdefinition15): Continuous nonlinear operator\.
- •Definition[17](https://arxiv.org/html/2606.19538#Thmdefinition17): MLP function class\.
- •Assumption[5](https://arxiv.org/html/2606.19538#Thmassumption5): Standing assumptions\.
- •Theorem[F\.1](https://arxiv.org/html/2606.19538#A6.SS1): Full statement\.
- •Theorem[F\.2](https://arxiv.org/html/2606.19538#A6.SS2): Auxiliary Lemmas\. - –Lemma[4](https://arxiv.org/html/2606.19538#Thmlemma4): MLP universal approximation\. - –Lemma[5](https://arxiv.org/html/2606.19538#Thmlemma5): Chen–Chen operator approximation \(1995\)\. - –Lemma[6](https://arxiv.org/html/2606.19538#Thmlemma6): Kernel approximation by MLP\.
- •Subsection[F\.3](https://arxiv.org/html/2606.19538#A6.SS3): Main proof \(Steps 1–4\)\.
- •Subsection[F\.4](https://arxiv.org/html/2606.19538#A6.SS4): Corollary – Strict expressiveness ordering \(Corollary[F\.1](https://arxiv.org/html/2606.19538#A6.Thmcorollary1)\)\.
- •Subsection[F\.5](https://arxiv.org/html/2606.19538#A6.SS5): Quantitative approximation rate \(Proposition[9](https://arxiv.org/html/2606.19538#Thmproposition9)\)\.

Appendix[G](https://arxiv.org/html/2606.19538#A7): Proof of Theorem 5 \(Kernel Recovery Under Translation Symmetry\)

- •Theorem[6](https://arxiv.org/html/2606.19538#Thmtheorem6): Full statement\. Three\-step proof \(translation invariance→\\togradient averaging→\\toorthogonality annihilation\)\.

Appendix[H](https://arxiv.org/html/2606.19538#A8): Extended Implementation Details

- •Subsection[H\.1](https://arxiv.org/html/2606.19538#A8.SS1): Computational complexity comparison \(Table[9](https://arxiv.org/html/2606.19538#A8.T9)\)
- •Subsection[H\.2](https://arxiv.org/html/2606.19538#A8.SS2): Full architectural specification\.
- •Subsection[H\.3](https://arxiv.org/html/2606.19538#A8.SS3): Initialization scheme\.
- •Subsection[H\.4](https://arxiv.org/html/2606.19538#A8.SS4): Regularization and training stability\.
- •Subsection[H\.5](https://arxiv.org/html/2606.19538#A8.SS5): Optimizer configuration \(AdamW\)\.
- •Subsection[H\.6](https://arxiv.org/html/2606.19538#A8.SS6): Statistical Reporting\.
- •Subsection[H\.7](https://arxiv.org/html/2606.19538#A8.SS7): Triton kernel profiling \(Table[12](https://arxiv.org/html/2606.19538#A8.T12)\)\.
- •Subsection[H\.8](https://arxiv.org/html/2606.19538#A8.SS8): IO complexity of tiled ITNet \(Proposition[10](https://arxiv.org/html/2606.19538#Thmproposition10)\)\.
- •Subsection[H\.9](https://arxiv.org/html/2606.19538#A8.SS9): Tiled forward pass \(Algorithm[1](https://arxiv.org/html/2606.19538#alg1)\)\.
- •Subsection[H\.10](https://arxiv.org/html/2606.19538#A8.SS10): Tiled Backward Pass with Gradient Checkpointing \(Algorithm[2](https://arxiv.org/html/2606.19538#alg2)\)
- •Subsection[H\.11](https://arxiv.org/html/2606.19538#A8.SS11): Monte Carlo Variance Analysis
- •Subsection[H\.12](https://arxiv.org/html/2606.19538#A8.SS12): Low\-Rank Approximation Error Bound

Appendix[I](https://arxiv.org/html/2606.19538#A9): Backpropagation for the ITNet Operator

- •Subsection[I\.1](https://arxiv.org/html/2606.19538#A9.SS1): Setup and notation\.
- •Subsection[I\.2](https://arxiv.org/html/2606.19538#A9.SS2): Forward pass \.
- •Subsection[I\.3](https://arxiv.org/html/2606.19538#A9.SS3): Upstream gradients \.
- •Subsection[I\.4](https://arxiv.org/html/2606.19538#A9.SS4): Gradient with respect to kernel outputs \.
- •Subsection[I\.5](https://arxiv.org/html/2606.19538#A9.SS5): Gradient with respect to input features \.
- •Subsection[I\.6](https://arxiv.org/html/2606.19538#A9.SS6): Gradient with respect to kernel MLP parameters \.
- •Subsection[I\.7](https://arxiv.org/html/2606.19538#A9.SS7): Gradient with respect to residual matrix \.
- •Subsection[I\.8](https://arxiv.org/html/2606.19538#A9.SS8): Special case 1 – Convolution\.
- •Subsection[I\.9](https://arxiv.org/html/2606.19538#A9.SS9): Special case 2 – Self\-attention \.
- •Subsection[I\.10](https://arxiv.org/html/2606.19538#A9.SS10): Special case 3 – Linear SSM / S4 \.
- •Subsection[I\.11](https://arxiv.org/html/2606.19538#A9.SS11): Special case 4 – LSTM \.
- •Subsection[I\.12](https://arxiv.org/html/2606.19538#A9.SS12): Special case 5 – Mamba \.
- •Subsection[I\.13](https://arxiv.org/html/2606.19538#A9.SS13): Complexity comparison across special cases \(Table[13](https://arxiv.org/html/2606.19538#A9.T13)\)\.

Appendix[J](https://arxiv.org/html/2606.19538#A10): Extended Encoder Details

- •Subsection[J\.1](https://arxiv.org/html/2606.19538#A10.SS1): Graph encoder\.
- •Subsection[J\.2](https://arxiv.org/html/2606.19538#A10.SS2): Multi\-scale image encoder\.
- •Subsection[J\.3](https://arxiv.org/html/2606.19538#A10.SS3): Design principles \(Principles 1–3\)\.

Appendix[K](https://arxiv.org/html/2606.19538#A11): Training Details

- •Subsection[K\.1](https://arxiv.org/html/2606.19538#A11.SS1): ImageNet\-1K training \(Table[14](https://arxiv.org/html/2606.19538#A11.T14)\)\.
- •Subsection[K\.2](https://arxiv.org/html/2606.19538#A11.SS2): GLUE pre\-training and fine\-tuning\.
- •Subsection[K\.3](https://arxiv.org/html/2606.19538#A11.SS3): ModelNet40 training \(Table[15](https://arxiv.org/html/2606.19538#A11.T15)\)\.
- •Subsection[K\.4](https://arxiv.org/html/2606.19538#A11.SS4): VQA v2 and NLVR2 fine\-tuning\.

Appendix[L](https://arxiv.org/html/2606.19538#A12): Detailed Efficiency Analysis

- •Table[16](https://arxiv.org/html/2606.19538#A12.T16): Wall\-clock throughput and peak memory across all benchmarks\.
- •Table[17](https://arxiv.org/html/2606.19538#A12.T17): Rank sweep for low\-rank mode\.
- •Table[18](https://arxiv.org/html/2606.19538#A12.T18): Comparison with efficient attention baselines\.
- •Table[19](https://arxiv.org/html/2606.19538#A12.T19): Memory breakdown for exact, MC, and low\-rank modes\.

Appendix[M](https://arxiv.org/html/2606.19538#A13): Extended Ablations

- •Table[20](https://arxiv.org/html/2606.19538#A13.T20): Kernel MLP width ablation\.
- •Table[21](https://arxiv.org/html/2606.19538#A13.T21): Point cloud encoder ablation\.
- •Table[22](https://arxiv.org/html/2606.19538#A13.T22): Multimodal measure ablation\.
- •Table[23](https://arxiv.org/html/2606.19538#A13.T23): Fourier feature ablation\.
- •Table[24](https://arxiv.org/html/2606.19538#A13.T24): Number of ITNet layers ablation\.

Appendix[N](https://arxiv.org/html/2606.19538#A14): Broader Impact

How to use this appendix:Readers primarily interested in the theoretical unification should consult Appendices[C](https://arxiv.org/html/2606.19538#A3),[D](https://arxiv.org/html/2606.19538#A4),[E](https://arxiv.org/html/2606.19538#A5),[F](https://arxiv.org/html/2606.19538#A6)and[G](https://arxiv.org/html/2606.19538#A7)\. For implementation details and reproducibility, see Appendices[H](https://arxiv.org/html/2606.19538#A8),[K](https://arxiv.org/html/2606.19538#A11),[L](https://arxiv.org/html/2606.19538#A12), and[M](https://arxiv.org/html/2606.19538#A13)\. The backpropagation derivation \(Appendix[I](https://arxiv.org/html/2606.19538#A9)\) is included for readers interested in implementing the ITNet operator from first principles\.

## Appendix ANotation

Tables[6](https://arxiv.org/html/2606.19538#A1.T6),[7](https://arxiv.org/html/2606.19538#A1.T7)and[8](https://arxiv.org/html/2606.19538#A1.T8)summarise all notation used in the proofs that follow\.

Table 6:Notation reference \(Part 1\): spaces, signals, ITNet operator, and kernel parameterization\.Table 7:Notation reference \(Part 2\): convolution, self\-attention, and recurrence \(Theorems 1–3\)\.Table 8:Notation reference \(Part 3\): universal approximation, kernel recovery, implementation, backpropagation, and general symbols\.
## Appendix BRelated Work

Our work sits at the intersection of four research streams: classical neural architecture families and their unification attempts, neural operator theory, efficient sequence modeling, and multimodal architectures\. We discuss each in turn and clarify how ITNet relates to prior work\.

Classical Architectures\.Convolutional networks\[[53](https://arxiv.org/html/2606.19538#bib.bib1),[52](https://arxiv.org/html/2606.19538#bib.bib20)\]encode locality and translation equivariance through kernels that depend only on relative position\. Despite many improvements\[[41](https://arxiv.org/html/2606.19538#bib.bib14),[62](https://arxiv.org/html/2606.19538#bib.bib22),[96](https://arxiv.org/html/2606.19538#bib.bib79),[25](https://arxiv.org/html/2606.19538#bib.bib23)\], this position\-only structure remains unchanged\. Transformers\[[93](https://arxiv.org/html/2606.19538#bib.bib3)\]instead model global, content\-dependent interactions via attention, but restrict interactions to a bilinear form with softmax normalization and require separate positional encodings\[[93](https://arxiv.org/html/2606.19538#bib.bib3),[84](https://arxiv.org/html/2606.19538#bib.bib33),[71](https://arxiv.org/html/2606.19538#bib.bib34)\]\. Recurrent models\[[44](https://arxiv.org/html/2606.19538#bib.bib2),[13](https://arxiv.org/html/2606.19538#bib.bib24)\]capture sequential dependencies through state evolution but are inherently causal and difficult to parallelize\. Structured state\-space models such as S4\[[37](https://arxiv.org/html/2606.19538#bib.bib4)\]and Mamba\[[36](https://arxiv.org/html/2606.19538#bib.bib5)\]improve efficiency but retain constrained kernel structures\. ITNet provides a unified view in which convolution, attention, and recurrence arise as special cases of a single kernel\-based operator\.

Efficient Sequence Models\.A large body of work focuses on improving the efficiency of attention\. Linear attention\[[50](https://arxiv.org/html/2606.19538#bib.bib12)\]and Performer\[[15](https://arxiv.org/html/2606.19538#bib.bib60)\]approximate softmax attention via kernel factorization, while sparse variants\[[7](https://arxiv.org/html/2606.19538#bib.bib61),[102](https://arxiv.org/html/2606.19538#bib.bib63),[61](https://arxiv.org/html/2606.19538#bib.bib26),[99](https://arxiv.org/html/2606.19538#bib.bib62)\]restrict attention patterns\. FlashAttention\[[21](https://arxiv.org/html/2606.19538#bib.bib11)\]improves efficiency without approximation through tiling\. Other approaches such as Hyena\[[69](https://arxiv.org/html/2606.19538#bib.bib64)\]and MLP\-Mixer\[[89](https://arxiv.org/html/2606.19538#bib.bib67)\]replace attention with structured alternatives\. These methods improve efficiency but retain fixed interaction forms\. In contrast, ITNet learns the interaction kernel directly and uses Monte Carlo or low\-rank approximations for scalable computation\.

Neural Operator Learning\.Neural operators\[[9](https://arxiv.org/html/2606.19538#bib.bib6),[4](https://arxiv.org/html/2606.19538#bib.bib7),[58](https://arxiv.org/html/2606.19538#bib.bib8),[64](https://arxiv.org/html/2606.19538#bib.bib9)\]study function\-to\-function mappings using kernel\-based architectures, with foundational results establishing universal approximation of nonlinear operators\. Methods such as the Graph Neural Operator \(GNO\)\[[4](https://arxiv.org/html/2606.19538#bib.bib7)\]introduce learnable integral kernels of the form∫κ​\(x,y,u​\(x\),u​\(y\)\)​u​\(y\)​𝑑μ​\(y\)\\int\\kappa\(x,y,u\(x\),u\(y\)\)\\,u\(y\)\\,d\\mu\(y\)a signature mathematically identical to ITNet’s operator\. However, GNO was developed for PDE solving and evaluated only on scientific machine learning tasks \(n≈104n\\approx 10^\{4\}\), without establishing connections to CNNs, Transformers, or RNNs\. The Fourier Neural Operator \(FNO\)\[[58](https://arxiv.org/html/2606.19538#bib.bib8)\]restricts the kernel to Fourier space, yielding efficient global convolution but losing content dependence and position\-awareness\. DeepONet\[[64](https://arxiv.org/html/2606.19538#bib.bib9)\]decomposes the operator into branch and trunk networks, imposing a low\-rank structure that is less general than ITNet’s full kernel\. Continuum attention\[[8](https://arxiv.org/html/2606.19538#bib.bib10)\]formalises self\-attention as a continuum integral operator but does not show that convolution or recurrence are also special cases\. ITNet builds on this line of work by using a general, learnable kernel and showing that standard architectures are exact special cases within this framework\.

Unified Architectures\.Several works aim to relate or unify different architectures\. MetaFormer\[[100](https://arxiv.org/html/2606.19538#bib.bib66)\]highlights the importance of the overall structure rather than specific operators\. Prior analyses have shown that attention can express convolution\[[18](https://arxiv.org/html/2606.19538#bib.bib65)\], and other approaches unify models at an algebraic level\. Content\-adaptive variants such as BiFormer\[[104](https://arxiv.org/html/2606.19538#bib.bib78)\], deformable convolution\[[105](https://arxiv.org/html/2606.19538#bib.bib108)\], and dynamic convolution\[[12](https://arxiv.org/html/2606.19538#bib.bib109)\]extend individual architectures but remain within restricted kernel forms\. However, these methods do not provide a single operator that subsumes all families\. ITNet learns the interaction rule directly, yielding a unified formulation that strictly contains convolution, attention, and recurrence\.

Multimodal and Domain\-Agnostic Architectures\.Perceiver\[[49](https://arxiv.org/html/2606.19538#bib.bib68)\]and Perceiver IO\[[48](https://arxiv.org/html/2606.19538#bib.bib58)\]uses cross\-attention between inputs and a fixed set of latent tokens followed by latent self\-attention\. This introduces a compression bottleneck, as all input information must be projected into a limited latent array before further interaction\. From the ITNet perspective, this corresponds to a restricted, position\-blind, softmax\-normalized kernel\. Most other multimodal methods rely on modality\-specific encoders with explicit fusion mechanisms\. For example, Flamingo\[[2](https://arxiv.org/html/2606.19538#bib.bib54)\]interleaves frozen vision features with language models via gated cross\-attention, BLIP\[[56](https://arxiv.org/html/2606.19538#bib.bib51)\]and BLIP\-2\[[55](https://arxiv.org/html/2606.19538#bib.bib53)\]introduce querying transformers to bridge frozen encoders, and ALBEF\[[57](https://arxiv.org/html/2606.19538#bib.bib52)\], METER\[[27](https://arxiv.org/html/2606.19538#bib.bib50)\], and UNITER\[[11](https://arxiv.org/html/2606.19538#bib.bib75)\]employ various cross\-modal fusion strategies\. More recent systems such as GPT\-4V\[[1](https://arxiv.org/html/2606.19538#bib.bib69)\]integrate vision into large language models through dedicated architectural components\. In contrast, ITNet operates on a shared domain by combining positions across modalities, without latent compression or dedicated fusion modules; cross\-modal interactions are learned directly through the kernel, providing a richer mechanism than standard attention\.

## Appendix CProof of Theorem 1: Convolution as a Special Case of ITNet

###### Definition 2\(Continuous convolution\)\.

LetΩ=ℝs\\Omega=\\mathbb\{R\}^\{s\}\(or the torus𝕋s=ℝs/ℤs\\mathbb\{T\}^\{s\}=\\mathbb\{R\}^\{s\}/\\mathbb\{Z\}^\{s\}for periodic domains\)\. Letμ\\mube the Lebesgue measure onℝs\\mathbb\{R\}^\{s\}, denotedd​ydy\. Letw∈L1​\(ℝs,ℝ\)w\\in L^\{1\}\(\\mathbb\{R\}^\{s\},\\mathbb\{R\}\)be an integrable scalar filter function\. The*continuous convolution*of signalu∈L1​\(ℝs,ℝd\)u\\in L^\{1\}\(\\mathbb\{R\}^\{s\},\\mathbb\{R\}^\{d\}\)with filterwwis:

\(w∗u\)​\(x\)=∫ℝsw​\(x−y\)​u​\(y\)​𝑑y,x∈ℝs\.\(w\\ast u\)\(x\)\\;=\\;\\int\_\{\\mathbb\{R\}^\{s\}\}w\(x\-y\)\\,u\(y\)\\,dy,\\qquad x\\in\\mathbb\{R\}^\{s\}\.\(15\)The integral is taken component\-wise:\[\(w∗u\)​\(x\)\]i=∫ℝsw​\(x−y\)​\[u​\(y\)\]i​𝑑y\[\(w\\ast u\)\(x\)\]\_\{i\}=\\int\_\{\\mathbb\{R\}^\{s\}\}w\(x\-y\)\\,\[u\(y\)\]\_\{i\}\\,dyfori=1,…,di=1,\\ldots,d\.

We use the Lebesgue measure\[[77](https://arxiv.org/html/2606.19538#bib.bib107)\]because it is the*unique*\(up to positive scaling\) translation\-invariantσ\\sigma\-finite Borel measure\[[77](https://arxiv.org/html/2606.19538#bib.bib107)\]onℝs\\mathbb\{R\}^\{s\}\(Haar’s theorem\[[31](https://arxiv.org/html/2606.19538#bib.bib98)\]\)\. This uniqueness is what makes convolution translation\-equivariant:\(w∗τa​u\)​\(x\)=\(w∗u\)​\(x−a\)\(w\\ast\\tau\_\{a\}u\)\(x\)=\(w\\ast u\)\(x\-a\)whereτa​u​\(y\)=u​\(y−a\)\\tau\_\{a\}u\(y\)=u\(y\-a\)is the translation operator\. Any other measure would break this equivariance, which is the defining property of CNNs\.

Using a non\-uniform measured​μ​\(y\)=ρ​\(y\)​d​yd\\mu\(y\)=\\rho\(y\)\\,dyinstead yields a weighted convolution of the form∫w​\(x−y\)​u​\(y\)​ρ​\(y\)​𝑑y\\int w\(x\-y\)\\,u\(y\)\\,\\rho\(y\)\\,dy, with a densityρ\\rho, which in general is not translation\-equivariant unlessρ\\rhois constant\. Since translation equivariance is the defining inductive bias of convolutional networks, the Lebesgue measure provides the natural and consistent choice\.

###### Definition 3\(Discrete convolution on a regular grid\)\.

Leth\>0h\>0be the grid spacing andΩh=h​ℤs=\{m​h:m∈ℤs\}\\Omega\_\{h\}=h\\mathbb\{Z\}^\{s\}=\\\{mh:m\\in\\mathbb\{Z\}^\{s\}\\\}the regular grid\. Define the*kk\-tap neighborhood*:

𝒩=\{m∈ℤs:‖m‖∞≤⌊k/2⌋\},\\mathcal\{N\}\\;=\\;\\bigl\\\{m\\in\\mathbb\{Z\}^\{s\}:\\\|m\\\|\_\{\\infty\}\\leq\\lfloor k/2\\rfloor\\bigr\\\},\(16\)which contains\(2​⌊k/2⌋\+1\)s\(2\\lfloor k/2\\rfloor\+1\)^\{s\}lattice points\. Given filter coefficientsF=\{fm\}m∈𝒩⊂ℝF=\\\{f\_\{m\}\\\}\_\{m\\in\\mathcal\{N\}\}\\subset\\mathbb\{R\}, the*discretekk\-tap convolution*is:

\(F∗u\)​\(x\)=∑m∈𝒩fm⋅u​\(x−m​h\),x∈Ωh\.\(F\\ast u\)\(x\)\\;=\\;\\sum\_\{m\\in\\mathcal\{N\}\}f\_\{m\}\\cdot u\(x\-mh\),\\qquad x\\in\\Omega\_\{h\}\.\(17\)

We use theℓ∞\\ell^\{\\infty\}norm,‖m‖∞=maxi⁡\|mi\|\\\|m\\\|\_\{\\infty\}=\\max\_\{i\}\|m\_\{i\}\|, to define the neighborhood, as it induces a*hypercube*structure that exactly matches the receptive field of standard CNN filters\.

In contrast, using theℓ2\\ell^\{2\}norm yields a*ball\-shaped*neighborhood, while theℓ1\\ell^\{1\}norm produces a*diamond\-shaped*neighborhood, neither of which aligns with standard convolutional implementations\. While the theoretical construction applies to any finite neighborhood𝒩⊂ℤs\\mathcal\{N\}\\subset\\mathbb\{Z\}^\{s\}, the choice ofℓ∞\\ell^\{\\infty\}ensures direct correspondence with practical CNN architectures\.

###### Definition 4\(ITNet operator\)\.

Let\(Ω,μ\)\(\\Omega,\\mu\)be a measure space\. The*Integral Transform Network \(ITNet\) operator*𝒦θ:C​\(Ω,ℝd\)→C​\(Ω,ℝd\)\\mathcal\{K\}\_\{\\theta\}:C\(\\Omega,\\mathbb\{R\}^\{d\}\)\\to C\(\\Omega,\\mathbb\{R\}^\{d\}\)is:

\(𝒦θ​\[u\]\)​\(x\)=∫Ωκθ​\(x,y,u​\(x\),u​\(y\)\)​u​\(y\)​𝑑μ​\(y\)⏟integral transform \(global interaction\)\+Wθ​u​\(x\)⏟local linear residual,\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(x\)\\;=\\;\\underbrace\{\\int\_\{\\Omega\}\\kappa\_\{\\theta\}\(x,\\,y,\\,u\(x\),\\,u\(y\)\)\\,u\(y\)\\,d\\mu\(y\)\}\_\{\\text\{integral transform \(global interaction\)\}\}\\;\+\\;\\underbrace\{W\_\{\\theta\}\\,u\(x\)\}\_\{\\text\{local linear residual\}\},\(18\)where:

- •κθ:ℝs×ℝs×ℝd×ℝd→ℝd×d\\kappa\_\{\\theta\}:\\mathbb\{R\}^\{s\}\\times\\mathbb\{R\}^\{s\}\\times\\mathbb\{R\}^\{d\}\\times\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\\times d\}is a learnable matrix\-valued kernel\. The four arguments are: query positionxx, key positionyy, query featuresu​\(x\)u\(x\), key featuresu​\(y\)u\(y\)\.
- •The kernel outputκθ​\(⋯\)∈ℝd×d\\kappa\_\{\\theta\}\(\\cdots\)\\in\\mathbb\{R\}^\{d\\times d\}is a matrix that is multiplied with the key feature vectoru​\(y\)∈ℝdu\(y\)\\in\\mathbb\{R\}^\{d\}to produce add\-dimensional contribution\.
- •Wθ∈ℝd×dW\_\{\\theta\}\\in\\mathbb\{R\}^\{d\\times d\}is a learnable weight matrix that acts locally \(pointwise\) on the query featureu​\(x\)u\(x\)\.
- •θ\\thetacollectively denotes all learnable parameters \(the kernel network weights andWθW\_\{\\theta\}\)\.

The integral term alone cannot represent the identity mappingu↦uu\\mapsto u\. Forκθ=𝐈d\\kappa\_\{\\theta\}=\\mathbf\{I\}\_\{d\},∫Ω𝐈d​u​\(y\)​𝑑μ​\(y\)=c⋅u¯,\\int\_\{\\Omega\}\\mathbf\{I\}\_\{d\}\\,u\(y\)\\,d\\mu\(y\)=c\\cdot\\bar\{u\},which is independent ofxxand equalsu​\(x\)u\(x\)only ifuuis constant\. The residual termWθ​u​\(x\)W\_\{\\theta\}u\(x\)enables exact identity representation \(Wθ=𝐈dW\_\{\\theta\}=\\mathbf\{I\}\_\{d\}\), which is crucial for stable deep architectures\.

###### Assumption 2\(Kernel regularity\)\.

Throughout this proof:

1. \(i\)κθ\\kappa\_\{\\theta\}is jointly measurable in all four arguments\.
2. \(ii\)There existsCκ<∞C\_\{\\kappa\}<\\inftysuch that‖κθ​\(x,y,a,b\)‖op≤Cκ\\\|\\kappa\_\{\\theta\}\(x,y,a,b\)\\\|\_\{\\mathrm\{op\}\}\\leq C\_\{\\kappa\}for all\(x,y,a,b\)\(x,y,a,b\)in the relevant domain\.
3. \(iii\)For Part \(a\),w∈L1​\(ℝs\)w\\in L^\{1\}\(\\mathbb\{R\}^\{s\}\)andu∈L1​\(ℝs,ℝd\)u\\in L^\{1\}\(\\mathbb\{R\}^\{s\},\\mathbb\{R\}^\{d\}\)\.

### C\.1\. Main Theorem

Theorem C\.1ITNet⊃\\supsetConvolution \(Full Statement\)Let all notation be as in Definitions[2](https://arxiv.org/html/2606.19538#Thmdefinition2)–[4](https://arxiv.org/html/2606.19538#Thmdefinition4)and Assumption[2](https://arxiv.org/html/2606.19538#Thmassumption2)\.Part \(a\) \- Continuous convolution\.If the kernel and residual are chosen as:κθ​\(x,y,u​\(x\),u​\(y\)\)=w​\(x−y\)⋅𝐈d,Wθ=0,\\kappa\_\{\\theta\}\(x,y,u\(x\),u\(y\)\)=w\(x\-y\)\\cdot\\mathbf\{I\}\_\{d\},\\qquad W\_\{\\theta\}=0,\(19\)then\(𝒦θ​\[u\]\)​\(x\)=\(w∗u\)​\(x\)\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(x\)=\(w\\ast u\)\(x\)for allx∈ℝsx\\in\\mathbb\{R\}^\{s\}and allu∈L1​\(ℝs,ℝd\)u\\in L^\{1\}\(\\mathbb\{R\}^\{s\},\\mathbb\{R\}^\{d\}\)\.Part \(b\) \- Discretekk\-tap convolution\.On gridΩh=h​ℤs\\Omega\_\{h\}=h\\mathbb\{Z\}^\{s\}with atomic measureμ=∑y∈h​ℤshs​δy\\mu=\\sum\_\{y\\in h\\mathbb\{Z\}^\{s\}\}h^\{s\}\\delta\_\{y\}, if:κθ​\(x,y,u​\(x\),u​\(y\)\)=f\(x−y\)/hhs⋅𝐈d⋅𝟏\{\(x−y\)/h∈𝒩\},Wθ=0,\\kappa\_\{\\theta\}\(x,y,u\(x\),u\(y\)\)=\\frac\{f\_\{\(x\-y\)/h\}\}\{h^\{s\}\}\\cdot\\mathbf\{I\}\_\{d\}\\cdot\\mathbf\{1\}\_\{\\\{\(x\-y\)/h\\in\\mathcal\{N\}\\\}\},\\qquad W\_\{\\theta\}=0,\(20\)then\(𝒦θ​\[u\]\)​\(x\)=\(F∗u\)​\(x\)\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(x\)=\(F\\ast u\)\(x\)for allx∈Ωhx\\in\\Omega\_\{h\}\.Part \(c\) \- Strict inclusion\.There exists a continuous operatorT:L2​\(ℝs,ℝd\)→L2​\(ℝs,ℝd\)T:L^\{2\}\(\\mathbb\{R\}^\{s\},\\mathbb\{R\}^\{d\}\)\\to L^\{2\}\(\\mathbb\{R\}^\{s\},\\mathbb\{R\}^\{d\}\)that is representable by ITNet but not by any convolution\. HenceConv⊊ITNet\\mathrm\{Conv\}\\subsetneq\\mathrm\{ITNet\}\.Part \(d\) \- With residual \(1×11\\times 1convolution\)\.If insteadWθ≠0W\_\{\\theta\}\\neq 0, then:\(𝒦θ​\[u\]\)​\(x\)=\(w∗u\)​\(x\)\+Wθ​u​\(x\),\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(x\)=\(w\\ast u\)\(x\)\+W\_\{\\theta\}u\(x\),\(21\)which is a standard convolution followed by a1×11\\times 1convolution \(pointwise linear transform\), as used in ResNets\[[41](https://arxiv.org/html/2606.19538#bib.bib14)\]and ConvNeXt\[[62](https://arxiv.org/html/2606.19538#bib.bib22)\]\.

### C\.2\. Proof of Part \(a\) \- Continuous Convolution

Proof of Theorem[C\.1](https://arxiv.org/html/2606.19538#A3.SS1), Part \(a\)

We show that substitutingκθ​\(x,y,u​\(x\),u​\(y\)\)=w​\(x−y\)⋅𝐈d\\kappa\_\{\\theta\}\(x,y,u\(x\),u\(y\)\)=w\(x\-y\)\\cdot\\mathbf\{I\}\_\{d\}andWθ=0W\_\{\\theta\}=0into the ITNet operator yields exactly the continuous convolution\(w∗u\)​\(x\)\(w\\ast u\)\(x\)\.

Step 1\.Verify the kernel choice is valid\.

We must first confirm that the kernelκθ​\(x,y,u​\(x\),u​\(y\)\)=w​\(x−y\)⋅𝐈d\\kappa\_\{\\theta\}\(x,y,u\(x\),u\(y\)\)=w\(x\-y\)\\cdot\\mathbf\{I\}\_\{d\}satisfies Definition[4](https://arxiv.org/html/2606.19538#Thmdefinition4)and Assumption[2](https://arxiv.org/html/2606.19538#Thmassumption2)\.

1. \(a\)The kernel must mapℝs×ℝs×ℝd×ℝd→ℝd×d\\mathbb\{R\}^\{s\}\\times\\mathbb\{R\}^\{s\}\\times\\mathbb\{R\}^\{d\}\\times\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\\times d\}\. Here,w​\(x−y\)∈ℝw\(x\-y\)\\in\\mathbb\{R\}is a scalar and𝐈d∈ℝd×d\\mathbf\{I\}\_\{d\}\\in\\mathbb\{R\}^\{d\\times d\}is a matrix, sow​\(x−y\)⋅𝐈d∈ℝd×dw\(x\-y\)\\cdot\\mathbf\{I\}\_\{d\}\\in\\mathbb\{R\}^\{d\\times d\}\. ✓
2. \(b\)The map\(x,y\)↦x−y\(x,y\)\\mapsto x\-yis continuous \(hence Borel measurable\)\. The compositionw∘\(\(x,y\)↦x−y\)w\\circ\(\(x,y\)\\mapsto x\-y\)is measurable becausewwis measurable \(it isL1L^\{1\}, hence measurable\) and the composition of measurable functions is measurable\. Multiplying by the constant matrix𝐈d\\mathbf\{I\}\_\{d\}preserves measurability\. ✓
3. \(c\)Content\-independence:The kernel depends on\(x,y\)\(x,y\)but*not*onu​\(x\)u\(x\)oru​\(y\)u\(y\)\. This is a*special case*of the general ITNet kernel, not a violation of it: the general kernel is allowed to depend on all four arguments, and choosing not to depend on some of them is a valid restriction\. ✓
4. \(d\)Boundedness:Ifwwis bounded \(w∈L∞w\\in L^\{\\infty\}\), then‖κθ​\(x,y,a,b\)‖op=\|w​\(x−y\)\|≤‖w‖∞=Cκ<∞\\\|\\kappa\_\{\\theta\}\(x,y,a,b\)\\\|\_\{\\mathrm\{op\}\}=\|w\(x\-y\)\|\\leq\\\|w\\\|\_\{\\infty\}=C\_\{\\kappa\}<\\infty, satisfying Assumption[2](https://arxiv.org/html/2606.19538#Thmassumption2)\(ii\)\. For the more general casew∈L1w\\in L^\{1\}\(possibly unbounded\), the integral still converges by Young’s inequality, but the kernel itself may not be uniformly bounded; this can be handled by relaxing the boundedness assumption to integrability conditions onwwanduu, which is standard in convolution theory\.✓

Step 2\.Substitute into the ITNet operator\.

Insertκθ​\(x,y,u​\(x\),u​\(y\)\)=w​\(x−y\)⋅𝐈d\\kappa\_\{\\theta\}\(x,y,u\(x\),u\(y\)\)=w\(x\-y\)\\cdot\\mathbf\{I\}\_\{d\}andWθ=0W\_\{\\theta\}=0into Eq\. \([18](https://arxiv.org/html/2606.19538#A3.E18)\):

\(𝒦θ​\[u\]\)​\(x\)\\displaystyle\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(x\)=∫ℝsκθ​\(x,y,u​\(x\),u​\(y\)\)​u​\(y\)​𝑑y\+Wθ​u​\(x\)0\\displaystyle=\\int\_\{\\mathbb\{R\}^\{s\}\}\\kappa\_\{\\theta\}\(x,y,u\(x\),u\(y\)\)\\,u\(y\)\\,dy\+\\cancelto\{0\}\{W\_\{\\theta\}\\,u\(x\)\}=∫ℝs\[w​\(x−y\)⋅𝐈d\]​u​\(y\)​𝑑y\.\\displaystyle=\\int\_\{\\mathbb\{R\}^\{s\}\}\\bigl\[w\(x\-y\)\\cdot\\mathbf\{I\}\_\{d\}\\bigr\]\\,u\(y\)\\,dy\.\(22\)
This is a direct substitution with no algebraic manipulation\. TheWθ=0W\_\{\\theta\}=0term vanishes\. We used​μ​\(y\)=d​yd\\mu\(y\)=dy\(Lebesgue measure\) as specified in Part \(a\)\. The integral is overℝs\\mathbb\{R\}^\{s\}becauseΩ=ℝs\\Omega=\\mathbb\{R\}^\{s\}in Definition[2](https://arxiv.org/html/2606.19538#Thmdefinition2)\.

Step 3\.Simplify the matrix\-vector product\[α​𝐈d\]​v\[\\alpha\\mathbf\{I\}\_\{d\}\]v\.

We need a precise identity for the product of a scalar\-times\-identity matrix with a vector\.

###### Lemma 1\(Scalar\-identity\-vector product\)\.

For any scalarα∈ℝ\\alpha\\in\\mathbb\{R\}and vectorv∈ℝdv\\in\\mathbb\{R\}^\{d\}:

\[α⋅𝐈d\]​v=α⋅v\.\[\\alpha\\cdot\\mathbf\{I\}\_\{d\}\]\\,v\\;=\\;\\alpha\\cdot v\.\(23\)

###### Proof of Lemma[1](https://arxiv.org/html/2606.19538#Thmlemma1)\.

By definition of matrix\-vector multiplication:

\(\[α​𝐈d\]​v\)i=∑j=1d\[α​𝐈d\]i​j​\[v\]j=∑j=1dα​δi​j​vj=α⋅vi,i=1,…,d\.\\bigl\(\[\\alpha\\mathbf\{I\}\_\{d\}\]v\\bigr\)\_\{i\}=\\sum\_\{j=1\}^\{d\}\[\\alpha\\mathbf\{I\}\_\{d\}\]\_\{ij\}\\,\[v\]\_\{j\}=\\sum\_\{j=1\}^\{d\}\\alpha\\delta\_\{ij\}\\,v\_\{j\}=\\alpha\\cdot v\_\{i\},\\quad i=1,\\ldots,d\.\(24\)The second equality uses\[α​𝐈d\]i​j=α​δi​j\[\\alpha\\mathbf\{I\}\_\{d\}\]\_\{ij\}=\\alpha\\delta\_\{ij\}\(the identity matrix has11on the diagonal and0elsewhere, scaled byα\\alpha\)\. The third equality usesδi​j​vj=vi\\delta\_\{ij\}v\_\{j\}=v\_\{i\}\(the Kronecker delta selects theii\-th term\)\. Since this holds for allii, the vector equation\[α​𝐈d\]​v=α​v\[\\alpha\\mathbf\{I\}\_\{d\}\]v=\\alpha vfollows\. ∎

Applying Lemma[1](https://arxiv.org/html/2606.19538#Thmlemma1)withα=w​\(x−y\)\\alpha=w\(x\-y\)andv=u​\(y\)v=u\(y\):

\[w​\(x−y\)⋅𝐈d\]​u​\(y\)=w​\(x−y\)⋅u​\(y\)\.\\bigl\[w\(x\-y\)\\cdot\\mathbf\{I\}\_\{d\}\\bigr\]\\,u\(y\)\\;=\\;w\(x\-y\)\\cdot u\(y\)\.\(25\)
ITNet kernel is*matrix\-valued*\(ℝd×d\\mathbb\{R\}^\{d\\times d\}\), and the productκθ​\(⋯\)⋅u​\(y\)\\kappa\_\{\\theta\}\(\\cdots\)\\cdot u\(y\)is a matrix\-vector product, not a scalar\-vector product\. The lemma confirms that for the specific choiceκθ=w​\(x−y\)⋅𝐈d\\kappa\_\{\\theta\}=w\(x\-y\)\\cdot\\mathbf\{I\}\_\{d\}, the matrix\-vector product reduces to scalar multiplication\. This is an*exact*algebraic identity \- no approximation\.

We use𝐈d\\mathbf\{I\}\_\{d\}\(identity matrix\)\. If we instead choseκθ=w​\(x−y\)⋅A\\kappa\_\{\\theta\}=w\(x\-y\)\\cdot Afor a fixedA≠𝐈dA\\neq\\mathbf\{I\}\_\{d\}, we would get∫w​\(x−y\)​A​u​\(y\)​𝑑y=A​\(w∗u\)​\(x\)\\int w\(x\-y\)Au\(y\)\\,dy=A\(w\\ast u\)\(x\), which is convolution followed by a fixed linear transform \- still a valid special case, but the identityA=𝐈dA=\\mathbf\{I\}\_\{d\}gives the simplest and most transparent correspondence to standard convolution\. The general caseA≠𝐈dA\\neq\\mathbf\{I\}\_\{d\}corresponds to convolution composed with a1×11\\times 1convolution, which we address in Part \(d\)\.

Step 4\.Establish existence and finiteness of the integral\.

Substituting \([25](https://arxiv.org/html/2606.19538#A3.E25)\) into \([22](https://arxiv.org/html/2606.19538#A3.E22)\):

\(𝒦θ​\[u\]\)​\(x\)=∫ℝsw​\(x−y\)⋅u​\(y\)​𝑑y\.\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(x\)\\;=\\;\\int\_\{\\mathbb\{R\}^\{s\}\}w\(x\-y\)\\cdot u\(y\)\\,dy\.\(26\)
Before identifying this as a convolution, we must verify the integral converges\.

###### Lemma 2\(Young’s convolution inequality\[[31](https://arxiv.org/html/2606.19538#bib.bib98),[83](https://arxiv.org/html/2606.19538#bib.bib99)\]\)\.

Let1≤p,q,r≤∞1\\leq p,q,r\\leq\\inftywith1p\+1q=1\+1r\\frac\{1\}\{p\}\+\\frac\{1\}\{q\}=1\+\\frac\{1\}\{r\}\. Iff∈Lp​\(ℝs\)f\\in L^\{p\}\(\\mathbb\{R\}^\{s\}\)andg∈Lq​\(ℝs\)g\\in L^\{q\}\(\\mathbb\{R\}^\{s\}\), thenf∗g∈Lr​\(ℝs\)f\\ast g\\in L^\{r\}\(\\mathbb\{R\}^\{s\}\)and‖f∗g‖Lr≤‖f‖Lp​‖g‖Lq\\\|f\\ast g\\\|\_\{L^\{r\}\}\\leq\\\|f\\\|\_\{L^\{p\}\}\\\|g\\\|\_\{L^\{q\}\}\.

Forp=q=1p=q=1,r=1r=1: sincew∈L1​\(ℝs\)w\\in L^\{1\}\(\\mathbb\{R\}^\{s\}\)and each componentui∈L1​\(ℝs\)u\_\{i\}\\in L^\{1\}\(\\mathbb\{R\}^\{s\}\)\(becauseu∈L1​\(ℝs,ℝd\)u\\in L^\{1\}\(\\mathbb\{R\}^\{s\},\\mathbb\{R\}^\{d\}\)\), Young’s inequality gives:

‖w∗ui‖L1​\(ℝs\)≤‖w‖L1​\(ℝs\)⋅‖ui‖L1​\(ℝs\)<∞\.\\\|w\\ast u\_\{i\}\\\|\_\{L^\{1\}\(\\mathbb\{R\}^\{s\}\)\}\\leq\\\|w\\\|\_\{L^\{1\}\(\\mathbb\{R\}^\{s\}\)\}\\cdot\\\|u\_\{i\}\\\|\_\{L^\{1\}\(\\mathbb\{R\}^\{s\}\)\}<\\infty\.\(27\)Hence∫ℝs\|w​\(x−y\)\|⋅\|ui​\(y\)\|​𝑑y<∞\\int\_\{\\mathbb\{R\}^\{s\}\}\|w\(x\-y\)\|\\cdot\|u\_\{i\}\(y\)\|\\,dy<\\inftyfor almost everyxxand each componenti=1,…,di=1,\\ldots,d\. The vector\-valued integral∫w​\(x−y\)​u​\(y\)​𝑑y\\int w\(x\-y\)u\(y\)\\,dyis interpreted component\-wise \(as a Bochner integral inℝd\\mathbb\{R\}^\{d\}\), and each component converges absolutely\.

One might use Hölder’s inequality instead of Young’s\. Hölder gives\|∫f​g\|≤‖f‖p​‖g‖q\|\\int fg\|\\leq\\\|f\\\|\_\{p\}\\\|g\\\|\_\{q\}for conjugate exponents, but this bounds a*single*integral, not a convolution \(which is a family of integrals parameterised byxx\)\. Young’s inequality is strictly stronger: it bounds‖f∗g‖r\\\|f\\ast g\\\|\_\{r\}as a function ofxx, not just at a singlexx\. This is why Young’s is the correct tool here\.

Step 5\.Identify the convolution\.

By Definition[2](https://arxiv.org/html/2606.19538#Thmdefinition2), the integral in \([26](https://arxiv.org/html/2606.19538#A3.E26)\) is exactly the continuous convolution:

∫ℝsw​\(x−y\)⋅u​\(y\)​𝑑y=Def\.[2](https://arxiv.org/html/2606.19538#Thmdefinition2)\(w∗u\)​\(x\)\.\\int\_\{\\mathbb\{R\}^\{s\}\}w\(x\-y\)\\cdot u\(y\)\\,dy\\;\\stackrel\{\{\\scriptstyle\\text\{Def\.~\\ref\{def:conv\_continuous\}\}\}\}\{\{=\}\}\\;\(w\\ast u\)\(x\)\.\(28\)
Combining Steps 2–5:

\(𝒦θ\[u\]\)\(x\)=\(w∗u\)\(x\)for allx∈ℝs,u∈L1\(ℝs,ℝd\)\.\\boxed\{\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(x\)\\;=\\;\(w\\ast u\)\(x\)\\qquad\\text\{for all \}x\\in\\mathbb\{R\}^\{s\},\\;u\\in L^\{1\}\(\\mathbb\{R\}^\{s\},\\mathbb\{R\}^\{d\}\)\.\}\(29\)
This completes the proof of Part \(a\)\.

### C\.3\. Proof of Part \(b\) \- Discrete Convolution

###### Proof of Theorem[C\.1](https://arxiv.org/html/2606.19538#A3.SS1), Part \(b\)\.

We show that on a discrete grid, the ITNet operator with appropriately scaled kernel coefficients recovers exactkk\-tap discrete convolution\.

Step 1\.Convert the integral to a discrete sum\.

On the gridΩh=h​ℤs\\Omega\_\{h\}=h\\mathbb\{Z\}^\{s\}with atomic measureμ=∑y∈h​ℤshs​δy\\mu=\\sum\_\{y\\in h\\mathbb\{Z\}^\{s\}\}h^\{s\}\\delta\_\{y\}, the Lebesgue integral∫f​𝑑μ\\int f\\,d\\mureduces to a sum by the definition of integration against an atomic measure:

∫Ωhf​\(y\)​𝑑μ​\(y\)=∑y∈h​ℤshs⋅f​\(y\)\.\\int\_\{\\Omega\_\{h\}\}f\(y\)\\,d\\mu\(y\)\\;=\\;\\sum\_\{y\\in h\\mathbb\{Z\}^\{s\}\}h^\{s\}\\cdot f\(y\)\.\(30\)
An atomic \(counting\) measureμ=∑ywy​δy\\mu=\\sum\_\{y\}w\_\{y\}\\delta\_\{y\}satisfies∫f​𝑑μ=∑ywy​f​\(y\)\\int f\\,d\\mu=\\sum\_\{y\}w\_\{y\}f\(y\)by definition of integration with respect to a discrete measure\[[77](https://arxiv.org/html/2606.19538#bib.bib107)\]\. In our setting, the weightswy=hsw\_\{y\}=h^\{s\}correspond to the volume of the Voronoi cell around each grid point\. Thus,μ\\mucan be interpreted as a midpoint quadrature approximation of the Lebesgue measure\[[77](https://arxiv.org/html/2606.19538#bib.bib107)\]\. Ash→0h\\to 0, the discrete measureμ\\muconverges weakly to the Lebesgue measure via the Riemann sum theorem\[[79](https://arxiv.org/html/2606.19538#bib.bib101)\], ensuring consistency between Parts \(a\) and \(b\)\.

Applying \([30](https://arxiv.org/html/2606.19538#A3.E30)\) to the ITNet operator \([18](https://arxiv.org/html/2606.19538#A3.E18)\) withWθ=0W\_\{\\theta\}=0:

\(𝒦θ​\[u\]\)​\(x\)=∑y∈h​ℤshs⋅κθ​\(x,y,u​\(x\),u​\(y\)\)​u​\(y\)\.\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(x\)\\;=\\;\\sum\_\{y\\in h\\mathbb\{Z\}^\{s\}\}h^\{s\}\\cdot\\kappa\_\{\\theta\}\(x,y,u\(x\),u\(y\)\)\\,u\(y\)\.\(31\)
Step 2\.Substitute the discrete kernel\.

Insert the kernel from \([20](https://arxiv.org/html/2606.19538#A3.E20)\):

κθ​\(x,y,u​\(x\),u​\(y\)\)=f\(x−y\)/hhs⋅𝐈d⋅𝟏\{\(x−y\)/h∈𝒩\}\.\\kappa\_\{\\theta\}\(x,y,u\(x\),u\(y\)\)\\;=\\;\\frac\{f\_\{\(x\-y\)/h\}\}\{h^\{s\}\}\\cdot\\mathbf\{I\}\_\{d\}\\cdot\\mathbf\{1\}\_\{\\\{\(x\-y\)/h\\in\\mathcal\{N\}\\\}\}\.\(32\)
Substituting into \([31](https://arxiv.org/html/2606.19538#A3.E31)\):

\(𝒦θ​\[u\]\)​\(x\)=∑y∈h​ℤshs⋅f\(x−y\)/hhs⋅𝐈d⋅u​\(y\)⋅𝟏\{\(x−y\)/h∈𝒩\}\.\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(x\)\\;=\\;\\sum\_\{y\\in h\\mathbb\{Z\}^\{s\}\}h^\{s\}\\cdot\\frac\{f\_\{\(x\-y\)/h\}\}\{h^\{s\}\}\\cdot\\mathbf\{I\}\_\{d\}\\cdot u\(y\)\\cdot\\mathbf\{1\}\_\{\\\{\(x\-y\)/h\\in\\mathcal\{N\}\\\}\}\.\(33\)
The factor1/hs1/h^\{s\}compensates for thehsh^\{s\}term arising from discretisation, ensuring consistency between continuous kernels and discrete convolutions\. In standard numerical analysis, filter coefficientsfmf\_\{m\}represent total weight at offsetmm, while the continuous kernelκ=fm/hs\\kappa=f\_\{m\}/h^\{s\}represents weight per unit volume, so that density multiplied by volume recovers the discrete weights\. Without this normalization, one obtains\(𝒦θ​\[u\]\)​\(x\)=∑mhs​fm​u​\(x−m​h\)\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(x\)=\\sum\_\{m\}h^\{s\}f\_\{m\}u\(x\-mh\), introducing an extra factor ofhsh^\{s\}and deviating from standard implementations\. The1/hs1/h^\{s\}factor therefore ensures exact correspondence with discrete convolution\.

Step 3\.Cancel the measure weight\.

In each term of the sum,hsh^\{s\}from the measure and1/hs1/h^\{s\}from the kernel multiply to give:

hs⋅1hs=1\.h^\{s\}\\cdot\\frac\{1\}\{h^\{s\}\}=1\.\(34\)
Applying Lemma[1](https://arxiv.org/html/2606.19538#Thmlemma1)to simplify\[f\(x−y\)/h⋅𝐈d\]​u​\(y\)=f\(x−y\)/h⋅u​\(y\)\[f\_\{\(x\-y\)/h\}\\cdot\\mathbf\{I\}\_\{d\}\]u\(y\)=f\_\{\(x\-y\)/h\}\\cdot u\(y\):

\(𝒦θ​\[u\]\)​\(x\)=∑y∈h​ℤsf\(x−y\)/h⋅u​\(y\)⋅𝟏\{\(x−y\)/h∈𝒩\}\.\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(x\)\\;=\\;\\sum\_\{y\\in h\\mathbb\{Z\}^\{s\}\}f\_\{\(x\-y\)/h\}\\cdot u\(y\)\\cdot\\mathbf\{1\}\_\{\\\{\(x\-y\)/h\\in\\mathcal\{N\}\\\}\}\.\(35\)
Step 4\.Change the summation variable\.

Definem=\(x−y\)/hm=\(x\-y\)/h, equivalentlyy=x−m​hy=x\-mh\. Sincex∈h​ℤsx\\in h\\mathbb\{Z\}^\{s\}andy∈h​ℤsy\\in h\\mathbb\{Z\}^\{s\}, we havem=\(x−y\)/h∈ℤsm=\(x\-y\)/h\\in\\mathbb\{Z\}^\{s\}\. Asyyranges overh​ℤsh\\mathbb\{Z\}^\{s\},mmranges overℤs\\mathbb\{Z\}^\{s\}\- this is a bijection\.

The indicator𝟏\{\(x−y\)/h∈𝒩\}=𝟏\{m∈𝒩\}\\mathbf\{1\}\_\{\\\{\(x\-y\)/h\\in\\mathcal\{N\}\\\}\}=\\mathbf\{1\}\_\{\\\{m\\in\\mathcal\{N\}\\\}\}restricts the sum tom∈𝒩m\\in\\mathcal\{N\}:

\(𝒦θ​\[u\]\)​\(x\)=∑m∈𝒩fm⋅u​\(x−m​h\)\.\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(x\)\\;=\\;\\sum\_\{m\\in\\mathcal\{N\}\}f\_\{m\}\\cdot u\(x\-mh\)\.\(36\)
The change of variablem=\(x−y\)/hm=\(x\-y\)/his a bijection fromh​ℤsh\\mathbb\{Z\}^\{s\}toℤs\\mathbb\{Z\}^\{s\}\. In a sum \(as opposed to an integral\), no Jacobian correction is needed because we are counting terms, not transforming a measure\. This is the discrete analogue of the substitutionτ=x−y\\tau=x\-yin the continuous convolution integral\.

Step 5\.Identify the discrete convolution\.

By Definition[3](https://arxiv.org/html/2606.19538#Thmdefinition3):

∑m∈𝒩fm⋅u​\(x−m​h\)=Def\.[3](https://arxiv.org/html/2606.19538#Thmdefinition3)\(F∗u\)​\(x\)\.\\sum\_\{m\\in\\mathcal\{N\}\}f\_\{m\}\\cdot u\(x\-mh\)\\;\\stackrel\{\{\\scriptstyle\\text\{Def\.~\\ref\{def:conv\_discrete\}\}\}\}\{\{=\}\}\\;\(F\\ast u\)\(x\)\.\(37\)
Combining Steps 1–5:

\(𝒦θ\[u\]\)\(x\)=\(F∗u\)\(x\)for allx∈hℤs\.\\boxed\{\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(x\)\\;=\\;\(F\\ast u\)\(x\)\\qquad\\text\{for all \}x\\in h\\mathbb\{Z\}^\{s\}\.\}\(38\)∎

### C\.4\. Proof of Part \(c\) \- Strict Inclusion

###### Proof of Theorem[C\.1](https://arxiv.org/html/2606.19538#A3.SS1), Part \(c\)\.

We construct an explicit operatorTTthat ITNet can represent but no convolution can, thereby provingConv⊊ITNet\\mathrm\{Conv\}\\subsetneq\\mathrm\{ITNet\}\.

The proof has three stages: \(i\) define the witness, \(ii\) show ITNet represents it, \(iii\) show convolution cannot represent it\. We use*two independent*arguments for stage \(iii\) to make the result airtight\.

Step 1\.Define the witness operator\.

Fix a reference pointx0∈ℝsx\_\{0\}\\in\\mathbb\{R\}^\{s\}and letσ:ℝ→ℝ\\sigma:\\mathbb\{R\}\\to\\mathbb\{R\}be any non\-affine function\. Define:

T​\(u\)​\(x\)≔σ​\(u​\(x\)⊤​u​\(x0\)\)⋅u​\(x\),x∈ℝs\.T\(u\)\(x\)\\;\\coloneqq\\;\\sigma\\\!\\bigl\(u\(x\)^\{\\top\}u\(x\_\{0\}\)\\bigr\)\\cdot u\(x\),\\qquad x\\in\\mathbb\{R\}^\{s\}\.\(39\)
This operator scales the feature at positionxxby a nonlinear function of the similarity betweenu​\(x\)u\(x\)and a reference featureu​\(x0\)u\(x\_\{0\}\), thus explicitly depending on feature values rather than only spatial relationships\. While simpler local nonlinear operators such asT​\(u\)​\(x\)=‖u​\(x\)‖2​u​\(x\)T\(u\)\(x\)=\\\|u\(x\)\\\|^\{2\}u\(x\)can be approximated by stacking convolutional layers with pointwise nonlinearities, the above construction involves non\-local, content\-dependent interactions between distinct positionsxxandx0x\_\{0\}, which cannot be represented by any single convolutional layer regardless of kernel size or channel dimension\.

Step 2\.Show ITNet representsTT\.

We construct an explicit ITNet kernel that computesTT\. Choose:

κθ​\(x,y,u​\(x\),u​\(y\)\)=σ​\(u​\(x\)⊤​u​\(x0\)\)⋅𝐈d⋅δ​\(y−x\),Wθ=0\.\\kappa\_\{\\theta\}\(x,y,u\(x\),u\(y\)\)\\;=\\;\\sigma\\\!\\bigl\(u\(x\)^\{\\top\}u\(x\_\{0\}\)\\bigr\)\\cdot\\mathbf\{I\}\_\{d\}\\cdot\\delta\(y\-x\),\\qquad W\_\{\\theta\}=0\.\(40\)
Substituting into \([18](https://arxiv.org/html/2606.19538#A3.E18)\):

\(𝒦θ​\[u\]\)​\(x\)\\displaystyle\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(x\)=∫ℝsσ​\(u​\(x\)⊤​u​\(x0\)\)⋅𝐈d⋅δ​\(y−x\)⋅u​\(y\)​𝑑y\\displaystyle=\\int\_\{\\mathbb\{R\}^\{s\}\}\\sigma\\\!\\bigl\(u\(x\)^\{\\top\}u\(x\_\{0\}\)\\bigr\)\\cdot\\mathbf\{I\}\_\{d\}\\cdot\\delta\(y\-x\)\\cdot u\(y\)\\,dy=σ​\(u​\(x\)⊤​u​\(x0\)\)⋅𝐈d⋅∫ℝsδ​\(y−x\)​u​\(y\)​𝑑y⏟=u​\(x\)\\displaystyle=\\sigma\\\!\\bigl\(u\(x\)^\{\\top\}u\(x\_\{0\}\)\\bigr\)\\cdot\\mathbf\{I\}\_\{d\}\\cdot\\underbrace\{\\int\_\{\\mathbb\{R\}^\{s\}\}\\delta\(y\-x\)\\,u\(y\)\\,dy\}\_\{=u\(x\)\}=σ​\(u​\(x\)⊤​u​\(x0\)\)⋅u​\(x\)=T​\(u\)​\(x\)\.\\displaystyle=\\sigma\\\!\\bigl\(u\(x\)^\{\\top\}u\(x\_\{0\}\)\\bigr\)\\cdot u\(x\)\\;=\\;T\(u\)\(x\)\.\(41\)
The scalar factorσ​\(u​\(x\)⊤​u​\(x0\)\)\\sigma\(u\(x\)^\{\\top\}u\(x\_\{0\}\)\)is independent ofyy, so it passes outside the integral\. The remaining integral∫δ​\(y−x\)​u​\(y\)​𝑑y=u​\(x\)\\int\\delta\(y\-x\)u\(y\)\\,dy=u\(x\)is the sifting property of the Dirac delta\. Finally,𝐈d⋅u​\(x\)=u​\(x\)\\mathbf\{I\}\_\{d\}\\cdot u\(x\)=u\(x\)by Lemma[1](https://arxiv.org/html/2606.19538#Thmlemma1)withα=1\\alpha=1\.

Step 3\.Show no convolution can representTT\- Argument I \(Linearity\)\.

###### Proposition 1\(Convolution is linear inuu\)\.

For any fixed filterww, the mapu↦w∗uu\\mapsto w\\ast uis linear:

w∗\(α​u\+β​v\)=α​\(w∗u\)\+β​\(w∗v\)∀α,β∈ℝ,u,v∈L1​\(ℝs,ℝd\)\.w\\ast\(\\alpha u\+\\beta v\)=\\alpha\(w\\ast u\)\+\\beta\(w\\ast v\)\\qquad\\forall\\,\\alpha,\\beta\\in\\mathbb\{R\},\\;u,v\\in L^\{1\}\(\\mathbb\{R\}^\{s\},\\mathbb\{R\}^\{d\}\)\.\(42\)

###### Proof\.

\[w∗\(α​u\+β​v\)\]​\(x\)=∫w​\(x−y\)​\[α​u​\(y\)\+β​v​\(y\)\]​𝑑y=α​∫w​\(x−y\)​u​\(y\)​𝑑y\+β​∫w​\(x−y\)​v​\(y\)​𝑑y=α​\(w∗u\)​\(x\)\+β​\(w∗v\)​\(x\)\[w\\ast\(\\alpha u\+\\beta v\)\]\(x\)=\\int w\(x\-y\)\[\\alpha u\(y\)\+\\beta v\(y\)\]\\,dy=\\alpha\\int w\(x\-y\)u\(y\)\\,dy\+\\beta\\int w\(x\-y\)v\(y\)\\,dy=\\alpha\(w\\ast u\)\(x\)\+\\beta\(w\\ast v\)\(x\), by linearity of the Lebesgue integral\. ∎

###### Proposition 2\(TTis nonlinear inuu\)\.

The operatorTTdefined in \([39](https://arxiv.org/html/2606.19538#A3.E39)\) is not linear\.

###### Proof\.

ComputeT​\(α​u\)T\(\\alpha u\)forα∈ℝ\\alpha\\in\\mathbb\{R\}:

T​\(α​u\)​\(x\)=σ​\(\(α​u​\(x\)\)⊤​\(α​u​\(x0\)\)\)⋅α​u​\(x\)=σ​\(α2⋅u​\(x\)⊤​u​\(x0\)\)⋅α​u​\(x\)\.T\(\\alpha u\)\(x\)=\\sigma\\\!\\bigl\(\(\\alpha u\(x\)\)^\{\\top\}\(\\alpha u\(x\_\{0\}\)\)\\bigr\)\\cdot\\alpha u\(x\)=\\sigma\(\\alpha^\{2\}\\cdot u\(x\)^\{\\top\}u\(x\_\{0\}\)\)\\cdot\\alpha u\(x\)\.\(43\)IfTTwere linear,T​\(α​u\)=α​T​\(u\)T\(\\alpha u\)=\\alpha T\(u\)would require:

σ​\(α2​z\)=σ​\(z\)∀z∈ℝ,α∈ℝ,\\sigma\(\\alpha^\{2\}z\)=\\sigma\(z\)\\qquad\\forall\\,z\\in\\mathbb\{R\},\\;\\alpha\\in\\mathbb\{R\},\(44\)wherez=u​\(x\)⊤​u​\(x0\)z=u\(x\)^\{\\top\}u\(x\_\{0\}\)\. For the sigmoidσ​\(z\)=\(1\+e−z\)−1\\sigma\(z\)=\(1\+e^\{\-z\}\)^\{\-1\}:

σ​\(4⋅1\)=σ​\(4\)≈0\.982,σ​\(1\)≈0\.731\.\\sigma\(4\\cdot 1\)=\\sigma\(4\)\\approx 0\.982,\\qquad\\sigma\(1\)\\approx 0\.731\.Sinceσ​\(4\)≠σ​\(1\)\\sigma\(4\)\\neq\\sigma\(1\), Eq\. \([44](https://arxiv.org/html/2606.19538#A3.E44)\) fails forα=2\\alpha=2,z=1z=1\. HenceT​\(2​u\)≠2​T​\(u\)T\(2u\)\\neq 2T\(u\)andTTis nonlinear\. ∎

Conclusion from Argument I:Sincew∗uw\\ast uis linear inuu\(Proposition[1](https://arxiv.org/html/2606.19538#Thmproposition1)\) andTTis nonlinear inuu\(Proposition[2](https://arxiv.org/html/2606.19538#Thmproposition2)\), no choice of filterwwcan satisfyw∗u=T​\(u\)w\\ast u=T\(u\)for alluu\.

Step 4\.Show no convolution can representTT\- Argument II \(Translation equivariance\)\.

We provide a second, independent argument as additional assurance\.

###### Proposition 3\(Convolution is translation\-equivariant\)\.

For the translation operatorτa​u​\(x\)=u​\(x−a\)\\tau\_\{a\}u\(x\)=u\(x\-a\):

\(w∗τa​u\)​\(x\)=\(w∗u\)​\(x−a\)=τa​\(w∗u\)​\(x\)\.\(w\\ast\\tau\_\{a\}u\)\(x\)=\(w\\ast u\)\(x\-a\)=\\tau\_\{a\}\(w\\ast u\)\(x\)\.\(45\)

###### Proof\.

\(w∗τa​u\)​\(x\)=∫w​\(x−y\)​u​\(y−a\)​𝑑y=z=y−a∫w​\(x−a−z\)​u​\(z\)​𝑑z=\(w∗u\)​\(x−a\)\(w\\ast\\tau\_\{a\}u\)\(x\)=\\int w\(x\-y\)u\(y\-a\)\\,dy\\stackrel\{\{\\scriptstyle z=y\-a\}\}\{\{=\}\}\\int w\(x\-a\-z\)u\(z\)\\,dz=\(w\\ast u\)\(x\-a\)\. ∎

###### Proposition 4\(TTis not translation\-equivariant\)\.

T​\(τa​u\)≠τa​T​\(u\)T\(\\tau\_\{a\}u\)\\neq\\tau\_\{a\}T\(u\)in general\.

###### Proof\.

T​\(τa​u\)​\(x\)\\displaystyle T\(\\tau\_\{a\}u\)\(x\)=σ​\(u​\(x−a\)⊤​u​\(x0−a\)\)⋅u​\(x−a\),\\displaystyle=\\sigma\\\!\\bigl\(u\(x\-a\)^\{\\top\}u\(x\_\{0\}\-a\)\\bigr\)\\cdot u\(x\-a\),\(46\)τa​T​\(u\)​\(x\)\\displaystyle\\tau\_\{a\}T\(u\)\(x\)=T​\(u\)​\(x−a\)=σ​\(u​\(x−a\)⊤​u​\(x0\)\)⋅u​\(x−a\)\.\\displaystyle=T\(u\)\(x\-a\)=\\sigma\\\!\\bigl\(u\(x\-a\)^\{\\top\}u\(x\_\{0\}\)\\bigr\)\\cdot u\(x\-a\)\.\(47\)These are equal only ifu​\(x0−a\)=u​\(x0\)u\(x\_\{0\}\-a\)=u\(x\_\{0\}\)for allaa, i\.e\.,uuis constant atx0x\_\{0\}under all translations \- which fails for genericuu\. ∎

Conclusion from Argument II:Convolution commutes with translation \(Proposition[3](https://arxiv.org/html/2606.19538#Thmproposition3)\);TTdoes not \(Proposition[4](https://arxiv.org/html/2606.19538#Thmproposition4)\)\. Hence no convolution can equalTT\.

Combining both arguments:TTis representable by ITNet \(Step 2\) but not by any convolution \(Steps 3–4, by two independent proofs\)\. ThereforeT∈ITNet∖ConvT\\in\\mathrm\{ITNet\}\\setminus\\mathrm\{Conv\}, provingConv⊊ITNet\\mathrm\{Conv\}\\subsetneq\\mathrm\{ITNet\}\. ∎

### C\.5\. Proof of Part \(d\) \- With Residual

###### Proof of Theorem[C\.1](https://arxiv.org/html/2606.19538#A3.SS1), Part \(d\)\.

IfWθ≠0W\_\{\\theta\}\\neq 0, the ITNet operator with kernelκθ=w​\(x−y\)⋅𝐈d\\kappa\_\{\\theta\}=w\(x\-y\)\\cdot\\mathbf\{I\}\_\{d\}gives:

\(𝒦θ​\[u\]\)​\(x\)\\displaystyle\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(x\)=∫ℝsw​\(x−y\)​u​\(y\)​𝑑y\+Wθ​u​\(x\)=\(w∗u\)​\(x\)\+Wθ​u​\(x\)\.\\displaystyle=\\int\_\{\\mathbb\{R\}^\{s\}\}w\(x\-y\)u\(y\)\\,dy\+W\_\{\\theta\}u\(x\)=\(w\\ast u\)\(x\)\+W\_\{\\theta\}u\(x\)\.\(48\)The termWθ​u​\(x\)W\_\{\\theta\}u\(x\)is a pointwise linear mapℝd→ℝd\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\}applied at each position independently\. In CNN terminology, this is a1×11\\times 1convolution \(a convolution with filter of sizek=1k=1\)\. The sum\(w∗u\)\+Wθ​u\(w\\ast u\)\+W\_\{\\theta\}uis therefore the composition of a spatial convolution with a1×11\\times 1convolution \- the standard structure used in ResNet bottleneck blocks and ConvNeXt\. ∎

### C\.6\. Multi\-Channel Convolution

Standard CNNs haveCinC\_\{\\mathrm\{in\}\}input channels andCoutC\_\{\\mathrm\{out\}\}output channels:

\[\(F∗u\)​\(x\)\]i=∑c=1Cin∑m∈𝒩fm,c\(i\)⋅uc​\(x−m​h\),i=1,…,Cout\.\[\(F\\ast u\)\(x\)\]\_\{i\}=\\sum\_\{c=1\}^\{C\_\{\\mathrm\{in\}\}\}\\sum\_\{m\\in\\mathcal\{N\}\}f\_\{m,c\}^\{\(i\)\}\\cdot u\_\{c\}\(x\-mh\),\\quad i=1,\\ldots,C\_\{\\mathrm\{out\}\}\.\(49\)
Recovery:Setd=Cind=C\_\{\\mathrm\{in\}\}, and choose the kernel matrix entries as\[κθ​\(x,y,⋅,⋅\)\]i​c=h−s​f\(x−y\)/h,c\(i\)⋅𝟏𝒩\[\\kappa\_\{\\theta\}\(x,y,\\cdot,\\cdot\)\]\_\{ic\}=h^\{\-s\}f\_\{\(x\-y\)/h,c\}^\{\(i\)\}\\cdot\\mathbf\{1\}\_\{\\mathcal\{N\}\}\. The proof of Part \(b\) applies component\-wise: theii\-th output component sums contributions from allCinC\_\{\\mathrm\{in\}\}input channels via the matrix\-vector productκθ⋅u​\(y\)\\kappa\_\{\\theta\}\\cdot u\(y\), which mixes channels through the off\-diagonal entries ofκθ∈ℝd×d\\kappa\_\{\\theta\}\\in\\mathbb\{R\}^\{d\\times d\}\.

### C\.7\. Depthwise Separable Convolution

Depthwise convolution\[[14](https://arxiv.org/html/2606.19538#bib.bib25)\]applies per\-channel:\[\(FDW∗u\)​\(x\)\]i=∑m∈𝒩fm\(i\)​ui​\(x−m​h\)\[\(F\_\{\\mathrm\{DW\}\}\\ast u\)\(x\)\]\_\{i\}=\\sum\_\{m\\in\\mathcal\{N\}\}f\_\{m\}^\{\(i\)\}u\_\{i\}\(x\-mh\)\.

Recovery:Restrictκθ\\kappa\_\{\\theta\}to be diagonal:κθ=h−s​diag​\(f\(x−y\)/h\(1\),…,f\(x−y\)/h\(d\)\)⋅𝟏𝒩\\kappa\_\{\\theta\}=h^\{\-s\}\\mathrm\{diag\}\(f\_\{\(x\-y\)/h\}^\{\(1\)\},\\ldots,f\_\{\(x\-y\)/h\}^\{\(d\)\}\)\\cdot\\mathbf\{1\}\_\{\\mathcal\{N\}\}\. A diagonal matrix has zero off\-diagonal entries, preventing cross\-channel mixing\.

### C\.8\. Dilated \(Atrous\) Convolution

Dilation raterr:∑m∈𝒩fm⋅u​\(x−r⋅m​h\)\\sum\_\{m\\in\\mathcal\{N\}\}f\_\{m\}\\cdot u\(x\-r\\cdot mh\)\.

Recovery:Replacehhwithr​hrhin Part \(b\)\. The neighborhood in physical space becomesr​𝒩=\{r​m:m∈𝒩\}r\\mathcal\{N\}=\\\{rm:m\\in\\mathcal\{N\}\\\}\.

### C\.9\. Strided Convolution

StrideSS: output is computed only at positionsx∈S​h​ℤsx\\in Sh\\mathbb\{Z\}^\{s\}\.

Recovery:Restrict the output domain toS​h​ℤs⊂h​ℤsSh\\mathbb\{Z\}^\{s\}\\subset h\\mathbb\{Z\}^\{s\}\. The ITNet operator is defined on anyΩ\\Omega; choosingΩquery=S​h​ℤs\\Omega\_\{\\mathrm\{query\}\}=Sh\\mathbb\{Z\}^\{s\}withΩkey=h​ℤs\\Omega\_\{\\mathrm\{key\}\}=h\\mathbb\{Z\}^\{s\}gives strided output\.

### C\.10\. Group Convolution

GGgroups, each processingd/Gd/Gchannels independently\.

Recovery:Setκθ=blockdiag​\(K1,…,KG\)\\kappa\_\{\\theta\}=\\mathrm\{blockdiag\}\(K\_\{1\},\\ldots,K\_\{G\}\)where eachKg∈ℝ\(d/G\)×\(d/G\)K\_\{g\}\\in\\mathbb\{R\}^\{\(d/G\)\\times\(d/G\)\}is the per\-group kernel\.

### C\.11\. Transposed Convolution

Upsampling by factorrr: insert\(r−1\)\(r\-1\)zeros between inputs\.

Recovery:Use the finer gridΩ=\(h/r\)​ℤs\\Omega=\(h/r\)\\mathbb\{Z\}^\{s\}and apply Part \(b\)\.

### C\.12\. Boundary Handling in the Convolutional Special Case

A subtle yet important distinction between ITNet and standard CNNs lies in how they handle domain boundaries\. For a CNN with*valid padding*\(the standard convolution operation without artificial boundary extension\), the output is defined only at positions where the filter fully overlaps the input domain, producing an output map of size\(H−k\+1\)×\(W−k\+1\)\(H\-k\+1\)\\times\(W\-k\+1\)for anH×WH\\times Winput andk×kk\\times kfilter\. ITNet, by contrast, defines\(𝒦θ​\[u\]\)​\(x\)\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(x\)for*every*x∈Ωx\\in\\Omegaas an integral over the entire domain\. However, whenκθ\\kappa\_\{\\theta\}has compact support of sizekk\(as in the convolutional caseκθ​\(x,y\)=wθ​\(x−y\)\\kappa\_\{\\theta\}\(x,y\)=w\_\{\\theta\}\(x\-y\)withwθw\_\{\\theta\}supported on\[−r,r\]s\[\-r,r\]^\{s\},k=2​r\+1k=2r\+1\), outputs near the boundary naturally receive contributions from fewer input points because the integration kernel only overlaps the domain partially\. ITNet makes the boundary treatment*explicit*through the domainΩ\\Omegaand measureμ\\mu, rather than implicit through hard\-coded padding strategies\.

To recover standard CNN behavior exactly, three approaches are available:

*\(i\) Evaluate ITNet only at valid positions:*Restrict queries to the subsetΩvalid=\{x∈Ω:supp\(wθ\(x−⋅\)\)⊆Ω\}\\Omega\_\{\\text\{valid\}\}=\\\{x\\in\\Omega:\\operatorname\{supp\}\(w\_\{\\theta\}\(x\-\\cdot\)\)\\subseteq\\Omega\\\}, which yields exactly the\(H−k\+1\)×\(W−k\+1\)\(H\-k\+1\)\\times\(W\-k\+1\)outputs of a valid convolution\. This is the cleanest mathematical formulation, as no artificial points are introduced\.

*\(ii\) ExtendΩ\\Omegawith zero\-valued points:*DefineΩ~=Ω∪∂Ω\\tilde\{\\Omega\}=\\Omega\\cup\\partial\\Omegawhere∂Ω\\partial\\Omegacontainsk−1k\-1layers of points around the boundary, and setu​\(y\)=0u\(y\)=0fory∈∂Ωy\\in\\partial\\Omega\. This recovers standard zero\-padding convolution, producing a dense output map of sizeH×WH\\times W\. The same mechanism generalizes to reflection, replication, or periodic padding by appropriately defininguuon the extended domain\.

*\(iii\) Ignore boundary effects:*For tasks where boundary effects are negligible \(e\.g\., large images with small filters\), the relative number of boundary points scales asO​\(\(H\+W\)/H​W\)O\(\(H\+W\)/HW\), which vanishes for largeH,WH,W\.

## Appendix DProof of Theorem 2: Self\-Attention as a Special Case of ITNet

###### Definition 5\(Continuous scaled dot\-product attention\[[93](https://arxiv.org/html/2606.19538#bib.bib3)\]\)\.

LetΩ⊂ℝs\\Omega\\subset\\mathbb\{R\}^\{s\}be compact,μ\\mua Borel measure onΩ\\Omegawithμ​\(Ω\)<∞\\mu\(\\Omega\)<\\infty, andu:Ω→ℝdu:\\Omega\\to\\mathbb\{R\}^\{d\}a feature function\. Given learnable projection matricesWQ,WK∈ℝdk×dW\_\{Q\},W\_\{K\}\\in\\mathbb\{R\}^\{d\_\{k\}\\times d\}andWV∈ℝd×dW\_\{V\}\\in\\mathbb\{R\}^\{d\\times d\}, define query and key functions:

Q​\(x\)≔WQ​u​\(x\)∈ℝdk,K​\(y\)≔WK​u​\(y\)∈ℝdk\.Q\(x\)\\coloneqq W\_\{Q\}\\,u\(x\)\\in\\mathbb\{R\}^\{d\_\{k\}\},\\qquad K\(y\)\\coloneqq W\_\{K\}\\,u\(y\)\\in\\mathbb\{R\}^\{d\_\{k\}\}\.The*continuous scaled dot\-product attention*is:

Attn​\(u\)​\(x\)=∫Ωα​\(x,y\)​WV​u​\(y\)​𝑑μ​\(y\),\\mathrm\{Attn\}\(u\)\(x\)\\;=\\;\\int\_\{\\Omega\}\\alpha\(x,y\)\\,W\_\{V\}u\(y\)\\,d\\mu\(y\),\(50\)where the*attention weight*α:Ω×Ω→ℝ\>0\\alpha:\\Omega\\times\\Omega\\to\\mathbb\{R\}\_\{\>0\}is:

α​\(x,y\)=exp⁡\(Q​\(x\)⊤​K​\(y\)/dk\)Z​\(x\),Z​\(x\)=∫Ωexp⁡\(Q​\(x\)⊤​K​\(z\)/dk\)​𝑑μ​\(z\)\.\\alpha\(x,y\)\\;=\\;\\frac\{\\exp\\\!\\bigl\(Q\(x\)^\{\\top\}K\(y\)/\\sqrt\{d\_\{k\}\}\\bigr\)\}\{Z\(x\)\},\\qquad Z\(x\)\\;=\\;\\int\_\{\\Omega\}\\exp\\\!\\bigl\(Q\(x\)^\{\\top\}K\(z\)/\\sqrt\{d\_\{k\}\}\\bigr\)\\,d\\mu\(z\)\.\(51\)

###### Definition 6\(Discrete scaled dot\-product attention\[[93](https://arxiv.org/html/2606.19538#bib.bib3)\]\)\.

Fornnpositions\{x1,…,xn\}⊂Ω\\\{x\_\{1\},\\ldots,x\_\{n\}\\\}\\subset\\Omegawith uniform measureωj=1/n\\omega\_\{j\}=1/n, letQ=\[WQ​u​\(x1\);…;WQ​u​\(xn\)\]∈ℝn×dkQ=\[W\_\{Q\}u\(x\_\{1\}\);\\ldots;W\_\{Q\}u\(x\_\{n\}\)\]\\in\\mathbb\{R\}^\{n\\times d\_\{k\}\},K=\[WK​u​\(x1\);…;WK​u​\(xn\)\]∈ℝn×dkK=\[W\_\{K\}u\(x\_\{1\}\);\\ldots;W\_\{K\}u\(x\_\{n\}\)\]\\in\\mathbb\{R\}^\{n\\times d\_\{k\}\},V=\[WV​u​\(x1\);…;WV​u​\(xn\)\]∈ℝn×dV=\[W\_\{V\}u\(x\_\{1\}\);\\ldots;W\_\{V\}u\(x\_\{n\}\)\]\\in\\mathbb\{R\}^\{n\\times d\}\. The*discrete self\-attention*is:

Attn​\(u\)=softmax⁡\(Q​K⊤dk\)​V∈ℝn×d,\\mathrm\{Attn\}\(u\)\\;=\\;\\operatorname\{softmax\}\\\!\\left\(\\frac\{QK^\{\\top\}\}\{\\sqrt\{d\_\{k\}\}\}\\right\)V\\;\\in\\;\\mathbb\{R\}^\{n\\times d\},\(52\)wheresoftmax\\operatorname\{softmax\}is applied row\-wise\.

###### Definition 7\(Multi\-head attention\[[93](https://arxiv.org/html/2606.19538#bib.bib3)\]\)\.

WithHHheads, letWQh,WKh∈ℝdk×dW\_\{Q\}^\{h\},W\_\{K\}^\{h\}\\in\\mathbb\{R\}^\{d\_\{k\}\\times d\},WVh∈ℝdv×dW\_\{V\}^\{h\}\\in\\mathbb\{R\}^\{d\_\{v\}\\times d\},WO∈ℝd×H​dvW\_\{O\}\\in\\mathbb\{R\}^\{d\\times Hd\_\{v\}\}be learnable matrices for headh=1,…,Hh=1,\\ldots,H\. Define:

headh​\(u\)​\(x\)\\displaystyle\\mathrm\{head\}\_\{h\}\(u\)\(x\)=∫Ωαh​\(x,y\)​WVh​u​\(y\)​𝑑μ​\(y\),\\displaystyle\\;=\\;\\int\_\{\\Omega\}\\alpha\_\{h\}\(x,y\)\\,W\_\{V\}^\{h\}u\(y\)\\,d\\mu\(y\),\(53\)αh​\(x,y\)\\displaystyle\\alpha\_\{h\}\(x,y\)=exp⁡\(\(WQh​u​\(x\)\)⊤​\(WKh​u​\(y\)\)/dk\)Zh​\(x\)\.\\displaystyle\\;=\\;\\frac\{\\exp\\\!\\bigl\(\(W\_\{Q\}^\{h\}u\(x\)\)^\{\\top\}\(W\_\{K\}^\{h\}u\(y\)\)/\\sqrt\{d\_\{k\}\}\\bigr\)\}\{Z\_\{h\}\(x\)\}\.\(54\)The*multi\-head attention*output is:

MHA​\(u\)​\(x\)=WO⋅\[head1​\(u\)​\(x\)⋮headH​\(u\)​\(x\)\]\.\\mathrm\{MHA\}\(u\)\(x\)\\;=\\;W\_\{O\}\\cdot\\begin\{bmatrix\}\\mathrm\{head\}\_\{1\}\(u\)\(x\)\\\\ \\vdots\\\\ \\mathrm\{head\}\_\{H\}\(u\)\(x\)\\end\{bmatrix\}\.\(55\)

###### Definition 8\(ITNet operator\)\.

As in Definition[1](https://arxiv.org/html/2606.19538#Thmdefinition1)\(§[2\.1](https://arxiv.org/html/2606.19538#S2.SS1)\):

\(𝒦θ​\[u\]\)​\(x\)=∫Ωκθ​\(x,y,u​\(x\),u​\(y\)\)​u​\(y\)​𝑑μ​\(y\)\+Wθ​u​\(x\),\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(x\)\\;=\\;\\int\_\{\\Omega\}\\kappa\_\{\\theta\}\(x,\\,y,\\,u\(x\),\\,u\(y\)\)\\,u\(y\)\\,d\\mu\(y\)\\;\+\\;W\_\{\\theta\}\\,u\(x\),\(56\)whereκθ:ℝs×ℝs×ℝd×ℝd→ℝd×d\\kappa\_\{\\theta\}:\\mathbb\{R\}^\{s\}\\times\\mathbb\{R\}^\{s\}\\times\\mathbb\{R\}^\{d\}\\times\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\\times d\}is measurable andWθ∈ℝd×dW\_\{\\theta\}\\in\\mathbb\{R\}^\{d\\times d\}is learnable\.

###### Assumption 3\(Regularity conditions\)\.

Throughout this proof:

1. \(i\)u∈C​\(Ω,ℝd\)u\\in C\(\\Omega,\\mathbb\{R\}^\{d\}\):uuis continuous \(hence bounded on compactΩ\\Omega\)\.
2. \(ii\)WQ,WK,WVW\_\{Q\},W\_\{K\},W\_\{V\}are bounded:‖WQ‖op,‖WK‖op,‖WV‖op<∞\\\|W\_\{Q\}\\\|\_\{\\mathrm\{op\}\},\\\|W\_\{K\}\\\|\_\{\\mathrm\{op\}\},\\\|W\_\{V\}\\\|\_\{\\mathrm\{op\}\}<\\infty\.
3. \(iii\)Ω\\Omegais compact withμ​\(Ω\)<∞\\mu\(\\Omega\)<\\infty\.

These hold in all our experiments since feature vectors are bounded on finite token sets\.

### D\.1\. Main Theorem

Theorem D\.1ITNet⊃\\supsetSelf\-AttentionLetΩ\\Omega,μ\\mu,uube as in Definitions[5](https://arxiv.org/html/2606.19538#Thmdefinition5)and[8](https://arxiv.org/html/2606.19538#Thmdefinition8), and let Assumption[3](https://arxiv.org/html/2606.19538#Thmassumption3)hold\.\(a\) Single\-head attention\.Setκθ​\(x,y,u​\(x\),u​\(y\)\)=exp⁡\(Q​\(x\)⊤​K​\(y\)/dk\)Z​\(x\)⋅WV,Wθ=0\.\\kappa\_\{\\theta\}\(x,y,u\(x\),u\(y\)\)\\;=\\;\\frac\{\\exp\\\!\\bigl\(Q\(x\)^\{\\top\}K\(y\)/\\sqrt\{d\_\{k\}\}\\bigr\)\}\{Z\(x\)\}\\cdot W\_\{V\},\\qquad W\_\{\\theta\}=0\.\(57\)Then\(𝒦θ​\[u\]\)​\(x\)=Attn​\(u\)​\(x\)\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(x\)=\\mathrm\{Attn\}\(u\)\(x\)for allx∈Ωx\\in\\Omega\. DiscretisingΩ\\Omegatonnpositions recoverssoftmax⁡\(Q​K⊤/dk\)​V\\operatorname\{softmax\}\(QK^\{\\top\}/\\sqrt\{d\_\{k\}\}\)\\,Vexactly\.\(b\) Multi\-head attention\.Multi\-head attention is recovered by a single ITNet layer with a block\-structured kernel:κθ​\(x,y,u​\(x\),u​\(y\)\)=WO⋅BlockDiag​\(α1​\(x,y\)​WV1,…,αH​\(x,y\)​WVH\),\\kappa\_\{\\theta\}\(x,y,u\(x\),u\(y\)\)\\;=\\;W\_\{O\}\\cdot\\mathrm\{BlockDiag\}\\\!\\bigl\(\\alpha\_\{1\}\(x,y\)\\,W\_\{V\}^\{1\},\\;\\ldots,\\;\\alpha\_\{H\}\(x,y\)\\,W\_\{V\}^\{H\}\\bigr\),\(58\)whereαh\\alpha\_\{h\}is defined in \([54](https://arxiv.org/html/2606.19538#A4.E54)\)\. Then\(𝒦θ​\[u\]\)​\(x\)=MHA​\(u\)​\(x\)\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(x\)=\\mathrm\{MHA\}\(u\)\(x\)\.\(c\) Strictness\.There exists a continuous operatorG:L2​\(Ω,ℝd\)→L2​\(Ω,ℝd\)G:L^\{2\}\(\\Omega,\\mathbb\{R\}^\{d\}\)\\to L^\{2\}\(\\Omega,\\mathbb\{R\}^\{d\}\)representable by an ITNet operator but*not*representable by any attention mechanism\.

### D\.2\. Proof of Part \(a\) \- Single\-Head Continuous Attention

###### Proof of Theorem[D\.1](https://arxiv.org/html/2606.19538#A4.SS1)\(a\) – Continuous Case\.

Step 1\.Substitute the kernel into the ITNet operator\.Insert \([57](https://arxiv.org/html/2606.19538#A4.E57)\) andWθ=0W\_\{\\theta\}=0into \([56](https://arxiv.org/html/2606.19538#A4.E56)\):

\(𝒦θ​\[u\]\)​\(x\)=∫Ωexp⁡\(Q​\(x\)⊤​K​\(y\)/dk\)Z​\(x\)⋅WV⋅u​\(y\)​𝑑μ​\(y\)\.\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(x\)\\;=\\;\\int\_\{\\Omega\}\\frac\{\\exp\(Q\(x\)^\{\\top\}K\(y\)/\\sqrt\{d\_\{k\}\}\)\}\{Z\(x\)\}\\cdot W\_\{V\}\\cdot u\(y\)\\,d\\mu\(y\)\.\(59\)
Step 2\.Verify the partition functionZ​\(x\)Z\(x\)is well\-defined and positive\.

Z​\(x\)=∫Ωexp⁡\(Q​\(x\)⊤​K​\(z\)/dk\)​𝑑μ​\(z\)\.Z\(x\)\\;=\\;\\int\_\{\\Omega\}\\exp\\\!\\bigl\(Q\(x\)^\{\\top\}K\(z\)/\\sqrt\{d\_\{k\}\}\\bigr\)\\,d\\mu\(z\)\.\(60\)
Sinceu∈C​\(Ω,ℝd\)u\\in C\(\\Omega,\\mathbb\{R\}^\{d\}\)andΩ\\Omegais compact,‖u‖∞=supx∈Ω‖u​\(x\)‖2<∞\\\|u\\\|\_\{\\infty\}=\\sup\_\{x\\in\\Omega\}\\\|u\(x\)\\\|\_\{2\}<\\infty\. LetCQ=‖WQ‖opC\_\{Q\}=\\\|W\_\{Q\}\\\|\_\{\\mathrm\{op\}\},CK=‖WK‖opC\_\{K\}=\\\|W\_\{K\}\\\|\_\{\\mathrm\{op\}\},R=‖u‖∞R=\\\|u\\\|\_\{\\infty\}\. By Cauchy–Schwarz\[[78](https://arxiv.org/html/2606.19538#bib.bib111)\]:

\|Q​\(x\)⊤​K​\(z\)/dk\|≤‖Q​\(x\)‖2⋅‖K​\(z\)‖2dk≤CQ​R⋅CK​Rdk≕M<∞\.\\left\|Q\(x\)^\{\\top\}K\(z\)/\\sqrt\{d\_\{k\}\}\\right\|\\;\\leq\\;\\frac\{\\\|Q\(x\)\\\|\_\{2\}\\cdot\\\|K\(z\)\\\|\_\{2\}\}\{\\sqrt\{d\_\{k\}\}\}\\;\\leq\\;\\frac\{C\_\{Q\}R\\cdot C\_\{K\}R\}\{\\sqrt\{d\_\{k\}\}\}\\;\\eqqcolon\\;M<\\infty\.\(61\)Therefore:

0<e−M⋅μ​\(Ω\)≤Z​\(x\)≤eM⋅μ​\(Ω\)<∞\.0\\;<\\;e^\{\-M\}\\cdot\\mu\(\\Omega\)\\;\\leq\\;Z\(x\)\\;\\leq\\;e^\{M\}\\cdot\\mu\(\\Omega\)\\;<\\;\\infty\.\(62\)
Step 3\.FactorWVW\_\{V\}outside the integral\.

SinceZ​\(x\)Z\(x\)does not depend onyy, pull it andWVW\_\{V\}outside:

\(𝒦θ​\[u\]\)​\(x\)=1Z​\(x\)​∫Ωexp⁡\(Q​\(x\)⊤​K​\(y\)/dk\)⋅WV​u​\(y\)​𝑑μ​\(y\)\.\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(x\)\\;=\\;\\frac\{1\}\{Z\(x\)\}\\int\_\{\\Omega\}\\exp\\\!\\bigl\(Q\(x\)^\{\\top\}K\(y\)/\\sqrt\{d\_\{k\}\}\\bigr\)\\cdot W\_\{V\}u\(y\)\\,d\\mu\(y\)\.\(63\)
Step 4\.Establish the normalization property ofα​\(x,y\)\\alpha\(x,y\)\.

∫Ωα​\(x,y\)​𝑑μ​\(y\)=∫Ωexp⁡\(Q​\(x\)⊤​K​\(y\)/dk\)Z​\(x\)​𝑑μ​\(y\)=Z​\(x\)Z​\(x\)=1\.\\int\_\{\\Omega\}\\alpha\(x,y\)\\,d\\mu\(y\)\\;=\\;\\int\_\{\\Omega\}\\frac\{\\exp\(Q\(x\)^\{\\top\}K\(y\)/\\sqrt\{d\_\{k\}\}\)\}\{Z\(x\)\}\\,d\\mu\(y\)\\;=\\;\\frac\{Z\(x\)\}\{Z\(x\)\}\\;=\\;1\.\(64\)
Step 5\.Write as a probability\-weighted integral\.

\(𝒦θ​\[u\]\)​\(x\)=∫Ωα​\(x,y\)​WV​u​\(y\)​𝑑μ​\(y\)=𝔼y∼α​\(x,⋅\)​\[WV​u​\(y\)\]\.\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(x\)\\;=\\;\\int\_\{\\Omega\}\\alpha\(x,y\)\\,W\_\{V\}u\(y\)\\,d\\mu\(y\)\\;=\\;\\mathbb\{E\}\_\{y\\sim\\alpha\(x,\\cdot\)\}\\\!\\bigl\[W\_\{V\}u\(y\)\\bigr\]\.\(65\)
Step 6\.Recognise the continuous attention operator\.

Comparing \([63](https://arxiv.org/html/2606.19538#A4.E63)\) with Definition[5](https://arxiv.org/html/2606.19538#Thmdefinition5):

\(𝒦θ\[u\]\)\(x\)=∫Ωα\(x,y\)WVu\(y\)dμ\(y\)=Attn\(u\)\(x\)\.\\boxed\{\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(x\)\\;=\\;\\int\_\{\\Omega\}\\alpha\(x,y\)\\,W\_\{V\}u\(y\)\\,d\\mu\(y\)\\;=\\;\\mathrm\{Attn\}\(u\)\(x\)\.\}\(66\)∎

### D\.3\. Discretization: Recovering Standard Self\-Attention

###### Proof of Theorem[D\.1](https://arxiv.org/html/2606.19538#A4.SS1)\(a\) \- Discrete Case\.

ReplaceΩ=\{x1,…,xn\}\\Omega=\\\{x\_\{1\},\\ldots,x\_\{n\}\\\}andμ​\(\{xj\}\)=1/n\\mu\(\\\{x\_\{j\}\\\}\)=1/n\. The continuous integral becomes a finite sum:

\(𝒦θ^​\[u\]\)​\(xi\)=∑j=1n1n⋅α​\(xi,xj\)​WV​u​\(xj\)\.\(\\hat\{\\mathcal\{K\}\_\{\\theta\}\}\[u\]\)\(x\_\{i\}\)\\;=\\;\\sum\_\{j=1\}^\{n\}\\frac\{1\}\{n\}\\cdot\\alpha\(x\_\{i\},x\_\{j\}\)\\,W\_\{V\}u\(x\_\{j\}\)\.\(67\)
Substituting \([51](https://arxiv.org/html/2606.19538#A4.E51)\) with the discrete partition function:

Z​\(xi\)=∑l=1n1n​exp⁡\(qi⊤​kl/dk\),Z\(x\_\{i\}\)\\;=\\;\\sum\_\{l=1\}^\{n\}\\frac\{1\}\{n\}\\exp\\\!\\bigl\(q\_\{i\}^\{\\top\}k\_\{l\}/\\sqrt\{d\_\{k\}\}\\bigr\),\(68\)
so:

α​\(xi,xj\)\\displaystyle\\alpha\(x\_\{i\},x\_\{j\}\)=exp⁡\(qi⊤​kj/dk\)∑l=1n1n​exp⁡\(qi⊤​kl/dk\)=exp⁡\(qi⊤​kj/dk\)1n​∑l=1nexp⁡\(qi⊤​kl/dk\)\.\\displaystyle\\;=\\;\\frac\{\\exp\(q\_\{i\}^\{\\top\}k\_\{j\}/\\sqrt\{d\_\{k\}\}\)\}\{\\sum\_\{l=1\}^\{n\}\\frac\{1\}\{n\}\\exp\(q\_\{i\}^\{\\top\}k\_\{l\}/\\sqrt\{d\_\{k\}\}\)\}\\;=\\;\\frac\{\\exp\(q\_\{i\}^\{\\top\}k\_\{j\}/\\sqrt\{d\_\{k\}\}\)\}\{\\frac\{1\}\{n\}\\sum\_\{l=1\}^\{n\}\\exp\(q\_\{i\}^\{\\top\}k\_\{l\}/\\sqrt\{d\_\{k\}\}\)\}\.\(69\)
Substituting into \([67](https://arxiv.org/html/2606.19538#A4.E67)\):

\(𝒦θ^​\[u\]\)​\(xi\)\\displaystyle\(\\hat\{\\mathcal\{K\}\_\{\\theta\}\}\[u\]\)\(x\_\{i\}\)=∑j=1n1n⋅exp⁡\(qi⊤​kj/dk\)1n​∑lexp⁡\(qi⊤​kl/dk\)⋅WV​u​\(xj\)\\displaystyle\\;=\\;\\sum\_\{j=1\}^\{n\}\\frac\{1\}\{n\}\\cdot\\frac\{\\exp\(q\_\{i\}^\{\\top\}k\_\{j\}/\\sqrt\{d\_\{k\}\}\)\}\{\\frac\{1\}\{n\}\\sum\_\{l\}\\exp\(q\_\{i\}^\{\\top\}k\_\{l\}/\\sqrt\{d\_\{k\}\}\)\}\\cdot W\_\{V\}u\(x\_\{j\}\)=∑j=1n1n​exp⁡\(qi⊤​kj/dk\)1n​∑lexp⁡\(qi⊤​kl/dk\)⋅WV​u​\(xj\)\.\\displaystyle\\;=\\;\\sum\_\{j=1\}^\{n\}\\frac\{\\frac\{1\}\{n\}\\exp\(q\_\{i\}^\{\\top\}k\_\{j\}/\\sqrt\{d\_\{k\}\}\)\}\{\\frac\{1\}\{n\}\\sum\_\{l\}\\exp\(q\_\{i\}^\{\\top\}k\_\{l\}/\\sqrt\{d\_\{k\}\}\)\}\\cdot W\_\{V\}u\(x\_\{j\}\)\.\(70\)
The factors1n\\frac\{1\}\{n\}cancel in numerator and denominator:

\(𝒦θ^​\[u\]\)​\(xi\)=∑j=1nexp⁡\(qi⊤​kj/dk\)∑lexp⁡\(qi⊤​kl/dk\)⋅vj=\[softmax⁡\(Q​K⊤dk\)​V\]i,\(\\hat\{\\mathcal\{K\}\_\{\\theta\}\}\[u\]\)\(x\_\{i\}\)\\;=\\;\\sum\_\{j=1\}^\{n\}\\frac\{\\exp\(q\_\{i\}^\{\\top\}k\_\{j\}/\\sqrt\{d\_\{k\}\}\)\}\{\\sum\_\{l\}\\exp\(q\_\{i\}^\{\\top\}k\_\{l\}/\\sqrt\{d\_\{k\}\}\)\}\\cdot v\_\{j\}\\;=\\;\\left\[\\operatorname\{softmax\}\\\!\\left\(\\frac\{QK^\{\\top\}\}\{\\sqrt\{d\_\{k\}\}\}\\right\)V\\right\]\_\{i\},\(71\)wherevj=WV​u​\(xj\)v\_\{j\}=W\_\{V\}u\(x\_\{j\}\)is thejj\-th row ofVV\.

\(𝒦θ^\[u\]\)\(xi\)=\[softmax\(Q​K⊤dk\)V\]i=Attn\(u\)\(xi\)\.\\boxed\{\(\\hat\{\\mathcal\{K\}\_\{\\theta\}\}\[u\]\)\(x\_\{i\}\)\\;=\\;\\left\[\\operatorname\{softmax\}\\\!\\left\(\\frac\{QK^\{\\top\}\}\{\\sqrt\{d\_\{k\}\}\}\\right\)V\\right\]\_\{i\}\\;=\\;\\mathrm\{Attn\}\(u\)\(x\_\{i\}\)\.\}\(72\)∎

### D\.4\. Proof of Part \(b\) \- Multi\-Head Attention

###### Proof of Theorem[D\.1](https://arxiv.org/html/2606.19538#A4.SS1)\(b\)\.

Step 1\.Construct the block\-structured kernel\.

Define the ITNet kernel as in \([58](https://arxiv.org/html/2606.19538#A4.E58)\)\. Ford=H​dvd=Hd\_\{v\}\(total output dimension equals number of heads times per\-head value dimension\), the kernel is:

κθ​\(x,y,u​\(x\),u​\(y\)\)=WO⋅\(α1​\(x,y\)​WV10⋯00α2​\(x,y\)​WV2⋯0⋮⋮⋱⋮00⋯αH​\(x,y\)​WVH\)∈ℝd×d\.\\kappa\_\{\\theta\}\(x,y,u\(x\),u\(y\)\)\\;=\\;W\_\{O\}\\cdot\\begin\{pmatrix\}\\alpha\_\{1\}\(x,y\)\\,W\_\{V\}^\{1\}&0&\\cdots&0\\\\ 0&\\alpha\_\{2\}\(x,y\)\\,W\_\{V\}^\{2\}&\\cdots&0\\\\ \\vdots&\\vdots&\\ddots&\\vdots\\\\ 0&0&\\cdots&\\alpha\_\{H\}\(x,y\)\\,W\_\{V\}^\{H\}\\end\{pmatrix\}\\;\\in\\;\\mathbb\{R\}^\{d\\times d\}\.\(73\)
Step 2\.Compute the ITNet output\.

WithWθ=0W\_\{\\theta\}=0:

\(𝒦θ​\[u\]\)​\(x\)\\displaystyle\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(x\)=∫Ωκθ​\(x,y,u​\(x\),u​\(y\)\)​u​\(y\)​𝑑μ​\(y\)\\displaystyle\\;=\\;\\int\_\{\\Omega\}\\kappa\_\{\\theta\}\(x,y,u\(x\),u\(y\)\)\\,u\(y\)\\,d\\mu\(y\)=WO​∫Ω\(α1​\(x,y\)​WV1​u​\(y\)⋮αH​\(x,y\)​WVH​u​\(y\)\)​𝑑μ​\(y\)\\displaystyle\\;=\\;W\_\{O\}\\int\_\{\\Omega\}\\begin\{pmatrix\}\\alpha\_\{1\}\(x,y\)\\,W\_\{V\}^\{1\}u\(y\)\\\\ \\vdots\\\\ \\alpha\_\{H\}\(x,y\)\\,W\_\{V\}^\{H\}u\(y\)\\end\{pmatrix\}d\\mu\(y\)\(74\)
where we used Hille’s theorem\[[28](https://arxiv.org/html/2606.19538#bib.bib103)\]to factorWOW\_\{O\}outside the integral \(same justification as Step 3 of Part \(a\)\)\.

Step 3\.Evaluate each block\.

Since the integral of a block\-vector is the vector of block integrals \(by linearity of the Bochner integral\):

\(𝒦θ​\[u\]\)​\(x\)\\displaystyle\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(x\)=WO​\(∫Ωα1​\(x,y\)​WV1​u​\(y\)​𝑑μ​\(y\)⋮∫ΩαH​\(x,y\)​WVH​u​\(y\)​𝑑μ​\(y\)\)\\displaystyle\\;=\\;W\_\{O\}\\begin\{pmatrix\}\\int\_\{\\Omega\}\\alpha\_\{1\}\(x,y\)\\,W\_\{V\}^\{1\}u\(y\)\\,d\\mu\(y\)\\\\ \\vdots\\\\ \\int\_\{\\Omega\}\\alpha\_\{H\}\(x,y\)\\,W\_\{V\}^\{H\}u\(y\)\\,d\\mu\(y\)\\end\{pmatrix\}=WO​\(head1​\(u\)​\(x\)⋮headH​\(u\)​\(x\)\)=MHA​\(u\)​\(x\)\.\\displaystyle\\;=\\;W\_\{O\}\\begin\{pmatrix\}\\mathrm\{head\}\_\{1\}\(u\)\(x\)\\\\ \\vdots\\\\ \\mathrm\{head\}\_\{H\}\(u\)\(x\)\\end\{pmatrix\}\\;=\\;\\mathrm\{MHA\}\(u\)\(x\)\.\(75\)
\(𝒦θ\[u\]\)\(x\)=MHA\(u\)\(x\)\.\\boxed\{\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(x\)\\;=\\;\\mathrm\{MHA\}\(u\)\(x\)\.\}\(76\)∎

### D\.5\. Strictness Argument 1: Normalisation

###### Proof via unnormalized operators\.

Step 1\.Define the witness operator\.

Letℓ\>0\\ell\>0\. Define the*unnormalized Gaussian smoothing operator*:

Gℓ​\(u\)​\(x\)≔∫Ωexp⁡\(−‖x−y‖2/ℓ2\)​u​\(y\)​𝑑μ​\(y\)\.G\_\{\\ell\}\(u\)\(x\)\\;\\coloneqq\\;\\int\_\{\\Omega\}\\exp\\\!\\bigl\(\-\\\|x\-y\\\|^\{2\}/\\ell^\{2\}\\bigr\)\\,u\(y\)\\,d\\mu\(y\)\.\(77\)
Step 2\.Show ITNet representsGℓG\_\{\\ell\}\.

Set:

κθ​\(x,y,u​\(x\),u​\(y\)\)=exp⁡\(−‖x−y‖2/ℓ2\)⋅𝐈d,Wθ=0\.\\kappa\_\{\\theta\}\(x,y,u\(x\),u\(y\)\)\\;=\\;\\exp\(\-\\\|x\-y\\\|^\{2\}/\\ell^\{2\}\)\\cdot\\mathbf\{I\}\_\{d\},\\qquad W\_\{\\theta\}=0\.\(78\)
Then:

\(𝒦θ​\[u\]\)​\(x\)\\displaystyle\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(x\)=∫Ωexp⁡\(−‖x−y‖2/ℓ2\)⋅𝐈d⋅u​\(y\)​𝑑μ​\(y\)\\displaystyle=\\int\_\{\\Omega\}\\exp\(\-\\\|x\-y\\\|^\{2\}/\\ell^\{2\}\)\\cdot\\mathbf\{I\}\_\{d\}\\cdot u\(y\)\\,d\\mu\(y\)=∫Ωexp⁡\(−‖x−y‖2/ℓ2\)⋅u​\(y\)​𝑑μ​\(y\)=Gℓ​\(u\)​\(x\)\.\\displaystyle=\\int\_\{\\Omega\}\\exp\(\-\\\|x\-y\\\|^\{2\}/\\ell^\{2\}\)\\cdot u\(y\)\\,d\\mu\(y\)\\;=\\;G\_\{\\ell\}\(u\)\(x\)\.\(79\)
The kernel \([78](https://arxiv.org/html/2606.19538#A4.E78)\) satisfies all conditions of Definition[8](https://arxiv.org/html/2606.19538#Thmdefinition8): it is continuous \(hence measurable\), bounded by 1, and satisfies‖κθ‖op=1\\\|\\kappa\_\{\\theta\}\\\|\_\{\\mathrm\{op\}\}=1\. SoGℓ∈ITNetG\_\{\\ell\}\\in\\mathrm\{ITNet\}\.

Step 3\.Prove the output bound for attention\.

###### Lemma 3\(Attention output bound\)\.

For any attention mechanism with weight functionα​\(x,y\)≥0\\alpha\(x,y\)\\geq 0,∫Ωα​\(x,y\)​𝑑μ​\(y\)=1\\int\_\{\\Omega\}\\alpha\(x,y\)\\,d\\mu\(y\)=1, and anyWV∈ℝd×dW\_\{V\}\\in\\mathbb\{R\}^\{d\\times d\}:

‖Attn​\(u\)​\(x\)‖2≤maxy∈Ω⁡‖WV​u​\(y\)‖2∀x∈Ω\.\\\|\\mathrm\{Attn\}\(u\)\(x\)\\\|\_\{2\}\\;\\leq\\;\\max\_\{y\\in\\Omega\}\\\|W\_\{V\}u\(y\)\\\|\_\{2\}\\qquad\\forall\\,x\\in\\Omega\.\(80\)

###### Proof of Lemma[3](https://arxiv.org/html/2606.19538#Thmlemma3)\.

By the triangle inequality for the Bochner integral and the normalization property \([64](https://arxiv.org/html/2606.19538#A4.E64)\):

‖Attn​\(u\)​\(x\)‖2\\displaystyle\\\|\\mathrm\{Attn\}\(u\)\(x\)\\\|\_\{2\}=‖∫Ωα​\(x,y\)​WV​u​\(y\)​𝑑μ​\(y\)‖2\\displaystyle=\\left\\\|\\int\_\{\\Omega\}\\alpha\(x,y\)\\,W\_\{V\}u\(y\)\\,d\\mu\(y\)\\right\\\|\_\{2\}≤∫Ωα​\(x,y\)​‖WV​u​\(y\)‖2​𝑑μ​\(y\)\\displaystyle\\leq\\int\_\{\\Omega\}\\alpha\(x,y\)\\,\\\|W\_\{V\}u\(y\)\\\|\_\{2\}\\,d\\mu\(y\)\(triangle inequality\)≤\(maxy∈Ω⁡‖WV​u​\(y\)‖2\)​∫Ωα​\(x,y\)​𝑑μ​\(y\)\\displaystyle\\leq\\left\(\\max\_\{y\\in\\Omega\}\\\|W\_\{V\}u\(y\)\\\|\_\{2\}\\right\)\\int\_\{\\Omega\}\\alpha\(x,y\)\\,d\\mu\(y\)\(bound by max\)=maxy∈Ω⁡‖WV​u​\(y\)‖2⋅1\.\\displaystyle=\\max\_\{y\\in\\Omega\}\\\|W\_\{V\}u\(y\)\\\|\_\{2\}\\cdot 1\.\(normalization \([64](https://arxiv.org/html/2606.19538#A4.E64)\)\)∎

Step 4\.ShowGℓG\_\{\\ell\}violates the attention bound\.

ChooseΩ=\[0,1\]s\\Omega=\[0,1\]^\{s\}\(unit hypercube,μ=\\mu=Lebesgue measure,μ​\(Ω\)=1\\mu\(\\Omega\)=1\) and a constant functionu​\(y\)=v∈ℝdu\(y\)=v\\in\\mathbb\{R\}^\{d\}with‖v‖2=1\\\|v\\\|\_\{2\}=1\. Then:

Gℓ​\(u\)​\(x\)=v​∫\[0,1\]sexp⁡\(−‖x−y‖2/ℓ2\)​𝑑y≕Cℓ​\(x\)⋅v,G\_\{\\ell\}\(u\)\(x\)=v\\int\_\{\[0,1\]^\{s\}\}\\exp\(\-\\\|x\-y\\\|^\{2\}/\\ell^\{2\}\)\\,dy\\eqqcolon C\_\{\\ell\}\(x\)\\cdot v,\(81\)whereCℓ​\(x\)=∫\[0,1\]se−‖x−y‖2/ℓ2​𝑑yC\_\{\\ell\}\(x\)=\\int\_\{\[0,1\]^\{s\}\}e^\{\-\\\|x\-y\\\|^\{2\}/\\ell^\{2\}\}\\,dy\.

For largeℓ\\ell,e−‖x−y‖2/ℓ2≈1e^\{\-\\\|x\-y\\\|^\{2\}/\\ell^\{2\}\}\\approx 1over most of\[0,1\]s\[0,1\]^\{s\}, soCℓ​\(x\)→μ​\(Ω\)=1C\_\{\\ell\}\(x\)\\to\\mu\(\\Omega\)=1asℓ→∞\\ell\\to\\infty\. More precisely, forℓ≥1\\ell\\geq 1andxxnear the centre of\[0,1\]s\[0,1\]^\{s\}:

Cℓ​\(x\)≥e−s/\(4​ℓ2\)⋅μ​\(Ω\)=e−s/\(4​ℓ2\)\.C\_\{\\ell\}\(x\)\\;\\geq\\;e^\{\-s/\(4\\ell^\{2\}\)\}\\cdot\\mu\(\\Omega\)\\;=\\;e^\{\-s/\(4\\ell^\{2\}\)\}\.\(82\)
Now suppose for contradiction that some attention mechanismAttn=Gℓ\\mathrm\{Attn\}=G\_\{\\ell\}\. ThenWV​u​\(y\)=WV​vW\_\{V\}u\(y\)=W\_\{V\}vis constant inyy, and:

Gℓ​\(u\)​\(x\)\\displaystyle G\_\{\\ell\}\(u\)\(x\)=Cℓ​\(x\)⋅v,\\displaystyle=C\_\{\\ell\}\(x\)\\cdot v,\(83\)Attn​\(u\)​\(x\)\\displaystyle\\mathrm\{Attn\}\(u\)\(x\)=WV​v⋅∫Ωα​\(x,y\)​𝑑μ​\(y\)⏟=1=WV​v\.\\displaystyle=W\_\{V\}v\\cdot\\underbrace\{\\int\_\{\\Omega\}\\alpha\(x,y\)\\,d\\mu\(y\)\}\_\{=1\}=W\_\{V\}v\.\(84\)
For these to be equal for allxx, we needCℓ​\(x\)⋅v=WV​vC\_\{\\ell\}\(x\)\\cdot v=W\_\{V\}vfor allx∈\[0,1\]sx\\in\[0,1\]^\{s\}\. ButCℓ​\(x\)C\_\{\\ell\}\(x\)depends onxx\(it is strictly larger at the centre of\[0,1\]s\[0,1\]^\{s\}than at the corners\), soCℓ​\(x\)⋅vC\_\{\\ell\}\(x\)\\cdot vis not constant inxx, whileWV​vW\_\{V\}vis constant inxx\. Contradiction\.

Therefore no attention mechanism can representGℓG\_\{\\ell\}, soGℓ∈ITNet∖AttnG\_\{\\ell\}\\in\\mathrm\{ITNet\}\\setminus\\mathrm\{Attn\}andAttn⊊ITNet\\mathrm\{Attn\}\\subsetneq\\mathrm\{ITNet\}\.

∎

### D\.6\. Strictness Argument 2: Position\-Dependence

###### Proof via permutation equivariance\.

Step 1\.Attention without positional encodings is permutation equivariant\.

For any permutationσ\\sigmaof\{1,…,n\}\\\{1,\\ldots,n\\\}and any permutation of the token positions:

Attn​\(σ​\(u\)\)​\(σ​\(xi\)\)=σ​\(Attn​\(u\)​\(xi\)\),\\mathrm\{Attn\}\(\\sigma\(u\)\)\(\\sigma\(x\_\{i\}\)\)\\;=\\;\\sigma\\bigl\(\\mathrm\{Attn\}\(u\)\(x\_\{i\}\)\\bigr\),\(85\)i\.e\. permuting the input tokens permutes the output tokens in the same way\. This holds because the kernelQ​\(x\)⊤​K​\(y\)Q\(x\)^\{\\top\}K\(y\)depends only on feature values, not on the indicesi,ji,j\.

Step 2\.Define a position\-dependent witness\.

Define:

Tpos​\(u\)​\(x\)≔∫Ω‖x−y‖2⋅u​\(y\)​𝑑μ​\(y\)\.T\_\{\\mathrm\{pos\}\}\(u\)\(x\)\\;\\coloneqq\\;\\int\_\{\\Omega\}\\\|x\-y\\\|\_\{2\}\\cdot u\(y\)\\,d\\mu\(y\)\.\(86\)
This operator weights each positionyyby its distance fromxx\. Positions closer toxxget*less*weight; positions farther away get*more*weight\. It is*not*permutation equivariant: permuting positions changes the distances‖x−y‖2\\\|x\-y\\\|\_\{2\}\.

Step 3\.Show ITNet representsTposT\_\{\\mathrm\{pos\}\}\.

Set:

κθ​\(x,y,u​\(x\),u​\(y\)\)=‖x−y‖2⋅𝐈d,Wθ=0\.\\kappa\_\{\\theta\}\(x,y,u\(x\),u\(y\)\)=\\\|x\-y\\\|\_\{2\}\\cdot\\mathbf\{I\}\_\{d\},\\qquad W\_\{\\theta\}=0\.
Then\(𝒦θ​\[u\]\)​\(x\)=∫Ω‖x−y‖2⋅u​\(y\)​𝑑μ​\(y\)=Tpos​\(u\)​\(x\)\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(x\)=\\int\_\{\\Omega\}\\\|x\-y\\\|\_\{2\}\\cdot u\(y\)\\,d\\mu\(y\)=T\_\{\\mathrm\{pos\}\}\(u\)\(x\)\.

This kernel is valid:‖x−y‖2\\\|x\-y\\\|\_\{2\}is continuous in\(x,y\)\(x,y\)\(Euclidean norm is continuous\), and bounded on compactΩ\\Omega\.

Step 4\.Show no standard attention can representTposT\_\{\\mathrm\{pos\}\}\.

Suppose for contradiction thatAttn​\(u\)​\(xi\)=Tpos​\(u\)​\(xi\)\\mathrm\{Attn\}\(u\)\(x\_\{i\}\)=T\_\{\\mathrm\{pos\}\}\(u\)\(x\_\{i\}\)for alluuand allii\.

Choosen=2n=2tokens at positionsx1=0x\_\{1\}=0,x2=1x\_\{2\}=1\(1D case\), withu​\(x1\)=e1u\(x\_\{1\}\)=e\_\{1\},u​\(x2\)=e2u\(x\_\{2\}\)=e\_\{2\}\(standard basis vectors inℝd\\mathbb\{R\}^\{d\},d≥2d\\geq 2\)\. Then:

Tpos​\(u\)​\(x1\)\\displaystyle T\_\{\\mathrm\{pos\}\}\(u\)\(x\_\{1\}\)=‖x1−x1‖2⋅e1⋅\(1/n\)\+‖x1−x2‖2⋅e2⋅\(1/n\)=0⋅e1/2\+1⋅e2/2=e2/2\.\\displaystyle=\\\|x\_\{1\}\-x\_\{1\}\\\|\_\{2\}\\cdot e\_\{1\}\\cdot\(1/n\)\+\\\|x\_\{1\}\-x\_\{2\}\\\|\_\{2\}\\cdot e\_\{2\}\\cdot\(1/n\)=0\\cdot e\_\{1\}/2\+1\\cdot e\_\{2\}/2=e\_\{2\}/2\.\(87\)Tpos​\(u\)​\(x2\)\\displaystyle T\_\{\\mathrm\{pos\}\}\(u\)\(x\_\{2\}\)=‖x2−x1‖2⋅e1/2\+0⋅e2/2=e1/2\.\\displaystyle=\\\|x\_\{2\}\-x\_\{1\}\\\|\_\{2\}\\cdot e\_\{1\}/2\+0\\cdot e\_\{2\}/2=e\_\{1\}/2\.\(88\)
Now permute: letσ\\sigmaswapx1↔x2x\_\{1\}\\leftrightarrow x\_\{2\}, soσ​\(u\)​\(x1\)=e2\\sigma\(u\)\(x\_\{1\}\)=e\_\{2\},σ​\(u\)​\(x2\)=e1\\sigma\(u\)\(x\_\{2\}\)=e\_\{1\}\. Then:

Tpos​\(σ​\(u\)\)​\(x1\)\\displaystyle T\_\{\\mathrm\{pos\}\}\(\\sigma\(u\)\)\(x\_\{1\}\)=0⋅e2/2\+1⋅e1/2=e1/2\.\\displaystyle=0\\cdot e\_\{2\}/2\+1\\cdot e\_\{1\}/2=e\_\{1\}/2\.\(89\)Tpos​\(σ​\(u\)\)​\(x2\)\\displaystyle T\_\{\\mathrm\{pos\}\}\(\\sigma\(u\)\)\(x\_\{2\}\)=1⋅e2/2\+0⋅e1/2=e2/2\.\\displaystyle=1\\cdot e\_\{2\}/2\+0\\cdot e\_\{1\}/2=e\_\{2\}/2\.\(90\)
NoteTpos​\(σ​\(u\)\)​\(x1\)=e1/2≠e2/2=σ​\(Tpos​\(u\)\)​\(x1\)T\_\{\\mathrm\{pos\}\}\(\\sigma\(u\)\)\(x\_\{1\}\)=e\_\{1\}/2\\neq e\_\{2\}/2=\\sigma\(T\_\{\\mathrm\{pos\}\}\(u\)\)\(x\_\{1\}\), soTposT\_\{\\mathrm\{pos\}\}is not permutation equivariant \- as claimed\.

If attention \(without positional encoding\) were to equalTposT\_\{\\mathrm\{pos\}\}, it would need to be not permutation equivariant \- but it is, by \([85](https://arxiv.org/html/2606.19538#A4.E85)\)\. Contradiction\.

ThereforeTpos∈ITNet∖AttnT\_\{\\mathrm\{pos\}\}\\in\\mathrm\{ITNet\}\\setminus\\mathrm\{Attn\}, giving a second witness forAttn⊊ITNet\\mathrm\{Attn\}\\subsetneq\\mathrm\{ITNet\}\. ∎

### D\.7\. Linear Attention as a Special Case

Several efficient Transformer variants approximate or replace softmax attention with a linear kernel\[[50](https://arxiv.org/html/2606.19538#bib.bib12)\]\. We show these are also special cases of ITNet\.

###### Proposition 5\(Linear attention⊂\\subsetITNet\)\.

Letϕ:ℝdk→ℝ\>0r\\phi:\\mathbb\{R\}^\{d\_\{k\}\}\\to\\mathbb\{R\}^\{r\}\_\{\>0\}be a feature map withϕ​\(q\)⊤​ϕ​\(k\)≈exp⁡\(q⊤​k/dk\)\\phi\(q\)^\{\\top\}\\phi\(k\)\\approx\\exp\(q^\{\\top\}k/\\sqrt\{d\_\{k\}\}\)\(e\.g\.ϕ​\(x\)=elu​\(x\)\+1\\phi\(x\)=\\mathrm\{elu\}\(x\)\+1\)\. The linear attention operator:

Attnlin​\(u\)​\(x\)=ϕ​\(Q​\(x\)\)⊤​∑yϕ​\(K​\(y\)\)⊗WV​u​\(y\)ϕ​\(Q​\(x\)\)⊤​∑yϕ​\(K​\(y\)\)\\mathrm\{Attn\}\_\{\\mathrm\{lin\}\}\(u\)\(x\)\\;=\\;\\frac\{\\phi\(Q\(x\)\)^\{\\top\}\\sum\_\{y\}\\phi\(K\(y\)\)\\otimes W\_\{V\}u\(y\)\}\{\\phi\(Q\(x\)\)^\{\\top\}\\sum\_\{y\}\\phi\(K\(y\)\)\}\(91\)is a special case of the ITNet operator\.

###### Proof\.

Setκθ​\(x,y,u​\(x\),u​\(y\)\)=\[ϕ​\(Q​\(x\)\)⊤​ϕ​\(K​\(y\)\)/Zϕ​\(x\)\]⋅WV\\kappa\_\{\\theta\}\(x,y,u\(x\),u\(y\)\)=\[\\phi\(Q\(x\)\)^\{\\top\}\\phi\(K\(y\)\)/Z\_\{\\phi\}\(x\)\]\\cdot W\_\{V\}whereZϕ​\(x\)=∫Ωϕ​\(Q​\(x\)\)⊤​ϕ​\(K​\(y\)\)​𝑑μ​\(y\)Z\_\{\\phi\}\(x\)=\\int\_\{\\Omega\}\\phi\(Q\(x\)\)^\{\\top\}\\phi\(K\(y\)\)\\,d\\mu\(y\)\. Substituting into \([56](https://arxiv.org/html/2606.19538#A4.E56)\) withWθ=0W\_\{\\theta\}=0gives \([91](https://arxiv.org/html/2606.19538#A4.E91)\) exactly\. The kernel is bounded by‖WV‖op\\\|W\_\{V\}\\\|\_\{\\mathrm\{op\}\}\(sinceϕ​\(Q​\(x\)\)⊤​ϕ​\(K​\(y\)\)/Zϕ​\(x\)\\phi\(Q\(x\)\)^\{\\top\}\\phi\(K\(y\)\)/Z\_\{\\phi\}\(x\)is a normalized weight\), measurable, and content\-dependent\. ∎

### D\.8\. Causal \(Masked\) Attention as a Special Case

###### Proposition 6\(Causal attention⊂\\subsetITNet\)\.

GPT\-style causal \(autoregressive\) attention with mask𝟏j≤i\\mathbf\{1\}\_\{j\\leq i\}:

Attncausal​\(u\)​\(xi\)=∑j≤iexp⁡\(qi⊤​kj/dk\)∑l≤iexp⁡\(qi⊤​kl/dk\)⋅vj\\mathrm\{Attn\}\_\{\\mathrm\{causal\}\}\(u\)\(x\_\{i\}\)\\;=\\;\\sum\_\{j\\leq i\}\\frac\{\\exp\(q\_\{i\}^\{\\top\}k\_\{j\}/\\sqrt\{d\_\{k\}\}\)\}\{\\displaystyle\\sum\_\{l\\leq i\}\\exp\(q\_\{i\}^\{\\top\}k\_\{l\}/\\sqrt\{d\_\{k\}\}\)\}\\cdot v\_\{j\}\(92\)is a special case of the ITNet operator\.

###### Proof\.

Set the kernel with causal masking:

κθ​\(x,y,u​\(x\),u​\(y\)\)=1y≤x⋅exp⁡\(Q​\(x\)⊤​K​\(y\)/dk\)Zcausal​\(x\)⋅WV,\\kappa\_\{\\theta\}\(x,y,u\(x\),u\(y\)\)\\;=\\;\\mathbf\{1\}\_\{y\\leq x\}\\cdot\\frac\{\\exp\(Q\(x\)^\{\\top\}K\(y\)/\\sqrt\{d\_\{k\}\}\)\}\{Z\_\{\\mathrm\{causal\}\}\(x\)\}\\cdot W\_\{V\},\(93\)whereZcausal​\(x\)=∫y≤xexp⁡\(Q​\(x\)⊤​K​\(y\)/dk\)​𝑑μ​\(y\)Z\_\{\\mathrm\{causal\}\}\(x\)=\\int\_\{y\\leq x\}\\exp\(Q\(x\)^\{\\top\}K\(y\)/\\sqrt\{d\_\{k\}\}\)\\,d\\mu\(y\)\. The indicator𝟏y≤x\\mathbf\{1\}\_\{y\\leq x\}is measurable \(it is the indicator of a closed half\-space inℝs\\mathbb\{R\}^\{s\}\), so \([93](https://arxiv.org/html/2606.19538#A4.E93)\) is a valid ITNet kernel\. Substituting into \([56](https://arxiv.org/html/2606.19538#A4.E56)\) recovers \([92](https://arxiv.org/html/2606.19538#A4.E92)\) by the same steps as Parts \(a\) and \(b\)\. In the discrete case,𝟏j≤i\\mathbf\{1\}\_\{j\\leq i\}is the lower\-triangular mask\. ∎

## Appendix EProof of Theorem 3: Recurrence as a Special Case of ITNet

###### Definition 9\(Continuous\-time recurrent system\)\.

LetΩ=\[0,T\]⊂ℝ\\Omega=\[0,T\]\\subset\\mathbb\{R\}\(temporal domain\),μ\\mube the Lebesgue measure, andu:\[0,T\]→ℝdu:\[0,T\]\\to\\mathbb\{R\}^\{d\}be an input signal\. A*continuous\-time recurrent system*is defined by a differential equation for the hidden stateh:\[0,T\]→ℝnh:\[0,T\]\\to\\mathbb\{R\}^\{n\}:

d​hd​t​\(t\)=Fθ​\(h​\(t\),u​\(t\)\),h​\(0\)=h0,\\frac\{dh\}\{dt\}\(t\)\\;=\\;F\_\{\\theta\}\(h\(t\),\\,u\(t\)\),\\qquad h\(0\)=h\_\{0\},\(94\)whereFθ:ℝn×ℝd→ℝnF\_\{\\theta\}:\\mathbb\{R\}^\{n\}\\times\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{n\}is a learnable function\. The output at timettis:

RNN​\(u\)​\(t\)=Cθ​h​\(t\)\+Dθ​u​\(t\),\\mathrm\{RNN\}\(u\)\(t\)\\;=\\;C\_\{\\theta\}\\,h\(t\)\\;\+\\;D\_\{\\theta\}\\,u\(t\),\(95\)whereCθ∈ℝd×nC\_\{\\theta\}\\in\\mathbb\{R\}^\{d\\times n\},Dθ∈ℝd×dD\_\{\\theta\}\\in\\mathbb\{R\}^\{d\\times d\}are output projection matrices\.

###### Definition 10\(Discrete\-time RNN\)\.

With time steps\{t1,…,tT\}\\\{t\_\{1\},\\ldots,t\_\{T\}\\\}and step sizeΔ​t\\Delta t, the discrete\-time RNN update is:

ht=ϕ​\(Wh​ht−1\+Wu​ut\+b\),t=1,…,T,h\_\{t\}\\;=\\;\\phi\(W\_\{h\}h\_\{t\-1\}\+W\_\{u\}u\_\{t\}\+b\),\\qquad t=1,\\ldots,T,\(96\)whereWh∈ℝn×nW\_\{h\}\\in\\mathbb\{R\}^\{n\\times n\},Wu∈ℝn×dW\_\{u\}\\in\\mathbb\{R\}^\{n\\times d\},b∈ℝnb\\in\\mathbb\{R\}^\{n\}, andϕ:ℝn→ℝn\\phi:\\mathbb\{R\}^\{n\}\\to\\mathbb\{R\}^\{n\}is a nonlinear activation \(e\.g\.tanh\\tanh\)\.

###### Definition 11\(LSTM\)\.

The Long Short\-Term Memory\[[44](https://arxiv.org/html/2606.19538#bib.bib2)\]update:

ft\\displaystyle f\_\{t\}=σ​\(Wf​\[ht−1;ut\]\+bf\)\\displaystyle=\\sigma\(W\_\{f\}\[h\_\{t\-1\};u\_\{t\}\]\+b\_\{f\}\)\(forget gate\)it\\displaystyle i\_\{t\}=σ​\(Wi​\[ht−1;ut\]\+bi\)\\displaystyle=\\sigma\(W\_\{i\}\[h\_\{t\-1\};u\_\{t\}\]\+b\_\{i\}\)\(input gate\)ot\\displaystyle o\_\{t\}=σ​\(Wo​\[ht−1;ut\]\+bo\)\\displaystyle=\\sigma\(W\_\{o\}\[h\_\{t\-1\};u\_\{t\}\]\+b\_\{o\}\)\(output gate\)c~t\\displaystyle\\tilde\{c\}\_\{t\}=tanh⁡\(Wc​\[ht−1;ut\]\+bc\)\\displaystyle=\\tanh\(W\_\{c\}\[h\_\{t\-1\};u\_\{t\}\]\+b\_\{c\}\)\(cell candidate\)ct\\displaystyle c\_\{t\}=ft⊙ct−1\+it⊙c~t\\displaystyle=f\_\{t\}\\odot c\_\{t\-1\}\+i\_\{t\}\\odot\\tilde\{c\}\_\{t\}\(cell state\)ht\\displaystyle h\_\{t\}=ot⊙tanh⁡\(ct\)\\displaystyle=o\_\{t\}\\odot\\tanh\(c\_\{t\}\)\(hidden state\)where\[ht−1;ut\]\[h\_\{t\-1\};u\_\{t\}\]denotes concatenation and⊙\\odotis element\-wise multiplication\.

###### Definition 12\(Linear State Space Model \(SSM\)\)\.

A*linear SSM*\(as in S4\[[37](https://arxiv.org/html/2606.19538#bib.bib4)\]\) is defined by:

d​hd​t​\(t\)\\displaystyle\\frac\{dh\}\{dt\}\(t\)=A​h​\(t\)\+B​u​\(t\),\\displaystyle=A\\,h\(t\)\+B\\,u\(t\),\(97\)y​\(t\)\\displaystyle y\(t\)=C​h​\(t\)\+D​u​\(t\),\\displaystyle=C\\,h\(t\)\+D\\,u\(t\),\(98\)whereA∈ℝn×nA\\in\\mathbb\{R\}^\{n\\times n\},B∈ℝn×dB\\in\\mathbb\{R\}^\{n\\times d\},C∈ℝd×nC\\in\\mathbb\{R\}^\{d\\times n\},D∈ℝd×dD\\in\\mathbb\{R\}^\{d\\times d\}are learnable \(possibly complex\-valued\) matrices, andh​\(0\)=0h\(0\)=0\. Equations \([97](https://arxiv.org/html/2606.19538#A5.E97)\) – \([98](https://arxiv.org/html/2606.19538#A5.E98)\) are coupled: the outputy​\(t\)y\(t\)depends on the hidden stateh​\(t\)h\(t\), which is obtained by solving the ODE \([97](https://arxiv.org/html/2606.19538#A5.E97)\)\. The notationd​h/d​tdh/dtdenotes the time derivative, making \([97](https://arxiv.org/html/2606.19538#A5.E97)\) a first\-order linear ODE whose solution is given by the variation of constants formula \(see Step 1 of Part \(a\) below\)\.

###### Definition 13\(Selective SSM – Mamba\)\.

Mamba\[[36](https://arxiv.org/html/2606.19538#bib.bib5)\]generalises the linear SSM by making the system matrices input\-dependent \(“selective”\):

A¯​\(t\)\\displaystyle\\bar\{A\}\(t\)=exp⁡\(Δ​\(t\)⋅A\),\\displaystyle=\\exp\(\\Delta\(t\)\\cdot A\),\(99\)B¯​\(t\)\\displaystyle\\bar\{B\}\(t\)=Δ​\(t\)⋅B​\(u​\(t\)\),\\displaystyle=\\Delta\(t\)\\cdot B\(u\(t\)\),\(100\)ht\\displaystyle h\_\{t\}=A¯​\(t\)​ht−1\+B¯​\(t\)​ut,\\displaystyle=\\bar\{A\}\(t\)\\,h\_\{t\-1\}\+\\bar\{B\}\(t\)\\,u\_\{t\},\(101\)yt\\displaystyle y\_\{t\}=C​\(ut\)​ht,\\displaystyle=C\(u\_\{t\}\)\\,h\_\{t\},\(102\)whereΔ​\(t\)=softplus​\(WΔ​ut\+bΔ\)\>0\\Delta\(t\)=\\mathrm\{softplus\}\(W\_\{\\Delta\}u\_\{t\}\+b\_\{\\Delta\}\)\>0is a learnable input\-dependent step size,B​\(ut\)=WB​utB\(u\_\{t\}\)=W\_\{B\}u\_\{t\}andC​\(ut\)=WC​utC\(u\_\{t\}\)=W\_\{C\}u\_\{t\}are input\-dependent projection matrices, andAAis a fixed \(e\.g\. diagonal\) initialization\.

###### Definition 14\(ITNet operator\)\.

As in Definition[1](https://arxiv.org/html/2606.19538#Thmdefinition1)\(§[2\.1](https://arxiv.org/html/2606.19538#S2.SS1)\):

\(𝒦θ​\[u\]\)​\(x\)=∫Ωκθ​\(x,y,u​\(x\),u​\(y\)\)​u​\(y\)​𝑑μ​\(y\)\+Wθ​u​\(x\),\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(x\)\\;=\\;\\int\_\{\\Omega\}\\kappa\_\{\\theta\}\(x,\\,y,\\,u\(x\),\\,u\(y\)\)\\,u\(y\)\\,d\\mu\(y\)\\;\+\\;W\_\{\\theta\}\\,u\(x\),\(103\)whereκθ:ℝs×ℝs×ℝd×ℝd→ℝd×d\\kappa\_\{\\theta\}:\\mathbb\{R\}^\{s\}\\times\\mathbb\{R\}^\{s\}\\times\\mathbb\{R\}^\{d\}\\times\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\\times d\}is measurable,Wθ∈ℝd×dW\_\{\\theta\}\\in\\mathbb\{R\}^\{d\\times d\}is learnable\.

###### Assumption 4\(Regularity for recurrent proofs\)\.

1. \(i\)Ω=\[0,T\]\\Omega=\[0,T\]withT<∞T<\\infty\.
2. \(ii\)u∈C​\(\[0,T\],ℝd\)u\\in C\(\[0,T\],\\mathbb\{R\}^\{d\}\): input is continuous\.
3. \(iii\)FθF\_\{\\theta\}in \([94](https://arxiv.org/html/2606.19538#A5.E94)\) is Lipschitz continuous in both arguments, ensuring existence and uniqueness of solutions via the Picard–Lindelöf theorem\[[17](https://arxiv.org/html/2606.19538#bib.bib93)\]\.
4. \(iv\)All weight matrices are bounded\.

### E\.1\. Main Theorem

Theorem E\.1ITNet⊃\\supsetRecurrenceLetΩ=\[0,T\]\\Omega=\[0,T\],μ\\mube Lebesgue measure, and Assumption[4](https://arxiv.org/html/2606.19538#Thmassumption4)hold\.\(a\) Linear continuous\-time system\.For the linear systemFθ​\(h,u\)=A​h\+Bθ​uF\_\{\\theta\}\(h,u\)=Ah\+B\_\{\\theta\}u, the solution of \([94](https://arxiv.org/html/2606.19538#A5.E94)\)–\([95](https://arxiv.org/html/2606.19538#A5.E95)\) can be written as:RNN​\(u\)​\(t\)=Cθ​∫0tΦ​\(t,s\)​Bθ​u​\(s\)​𝑑s\+Dθ​u​\(t\)\+Cθ​eA​t​h0,\\mathrm\{RNN\}\(u\)\(t\)\\;=\\;C\_\{\\theta\}\\int\_\{0\}^\{t\}\\Phi\(t,s\)\\,B\_\{\\theta\}\\,u\(s\)\\,ds\\;\+\\;D\_\{\\theta\}\\,u\(t\)\\;\+\\;C\_\{\\theta\}\\,e^\{At\}\\,h\_\{0\},\(104\)whereΦ​\(t,s\)≔eA​\(t−s\)∈ℝn×n\\Phi\(t,s\)\\coloneqq e^\{A\(t\-s\)\}\\in\\mathbb\{R\}^\{n\\times n\}is the state transition matrix \(defined explicitly in Eq\. \([110](https://arxiv.org/html/2606.19538#A5.E110)\) below: it satisfies∂tΦ​\(t,s\)=A​Φ​\(t,s\)\\partial\_\{t\}\\Phi\(t,s\)=A\\Phi\(t,s\)withΦ​\(s,s\)=In\\Phi\(s,s\)=I\_\{n\}, encoding how the hidden state at timesspropagates to timett\)\. This is a special case of the ITNet operator with causal kernel:κθ​\(t,s,u​\(t\),u​\(s\)\)=1s≤t⋅Cθ​Φ​\(t,s\)​Bθ,Wθ=Dθ\.\\kappa\_\{\\theta\}\(t,s,u\(t\),u\(s\)\)\\;=\\;\\mathbf\{1\}\_\{s\\leq t\}\\cdot C\_\{\\theta\}\\,\\Phi\(t,s\)\\,B\_\{\\theta\},\\qquad W\_\{\\theta\}=D\_\{\\theta\}\.\(105\)\(a′\) Nonlinear continuous\-time system \(generalFθF\_\{\\theta\}\)\.For a general nonlinear system \([94](https://arxiv.org/html/2606.19538#A5.E94)\) with LipschitzFθF\_\{\\theta\}, the output operatoru↦RNN​\(u\)u\\mapsto\\mathrm\{RNN\}\(u\)is a continuous operator on compact input sets, and is therefore approximable to arbitrary precision by an ITNet operator \(via Theorem[4](https://arxiv.org/html/2606.19538#Thmtheorem4)\)\. Moreover, the exact output can be written as an ITNet with a content\-dependent causal kernel constructed via the nonlinear variation of constants formula \(Alekseev’s formula\)\. See §[E\.3](https://arxiv.org/html/2606.19538#A5.SS3)for the complete proof\.\(b\) Discrete\-time RNN / LSTM / GRU\.The discrete recurrence \([96](https://arxiv.org/html/2606.19538#A5.E96)\) is recovered by discretisingΩ\\Omegato\{t1,…,tT\}\\\{t\_\{1\},\\ldots,t\_\{T\}\\\}with atomic measureμ​\(\{tk\}\)=Δ​t\\mu\(\\\{t\_\{k\}\\\}\)=\\Delta t, and setting a causal kernel encoding the recurrent update function\.\(c\) Linear SSM \(S4\)\.The SSM \([97](https://arxiv.org/html/2606.19538#A5.E97)\)–\([98](https://arxiv.org/html/2606.19538#A5.E98)\) is recovered by:κθ​\(t,s,u​\(t\),u​\(s\)\)=1s≤t⋅C​eA​\(t−s\)​B,Wθ=D\.\\kappa\_\{\\theta\}\(t,s,u\(t\),u\(s\)\)\\;=\\;\\mathbf\{1\}\_\{s\\leq t\}\\cdot C\\,e^\{A\(t\-s\)\}\\,B,\\qquad W\_\{\\theta\}=D\.\(106\)\(d\) Selective SSM \(Mamba\)\.Mamba \([101](https://arxiv.org/html/2606.19538#A5.E101)\)–\([102](https://arxiv.org/html/2606.19538#A5.E102)\) is recovered by:κθ​\(t,s,u​\(t\),u​\(s\)\)=1s≤t⋅C​\(u​\(t\)\)⋅∏τ=st−1A¯​\(τ\)⋅B¯​\(s\),Wθ=0\.\\kappa\_\{\\theta\}\(t,s,u\(t\),u\(s\)\)\\;=\\;\\mathbf\{1\}\_\{s\\leq t\}\\cdot C\(u\(t\)\)\\cdot\\prod\_\{\\tau=s\}^\{t\-1\}\\bar\{A\}\(\\tau\)\\cdot\\bar\{B\}\(s\),\\qquad W\_\{\\theta\}=0\.\(107\)SinceA¯\\bar\{A\},B¯\\bar\{B\},CCdepend onu​\(t\)u\(t\)andu​\(s\)u\(s\)respectively, this is a content\-dependent causal kernel–a strict generalisation of the linear SSM\.\(e\) Strictness\.There exists a continuous operator representable by ITNet but not by any causal recurrent system\.

### E\.2\. Proof of Part \(a\) – Linear Continuous\-Time RNN

###### Proof of Theorem[E\.1](https://arxiv.org/html/2606.19538#A5.SS1)\(a\)\.

Step 1\.Apply the variation of constants formula\.

For the linear systemFθ​\(h,u\)=A​h\+Bθ​uF\_\{\\theta\}\(h,u\)=Ah\+B\_\{\\theta\}u, the ODE \([94](https://arxiv.org/html/2606.19538#A5.E94)\) becomes:

d​hd​t​\(t\)=A​h​\(t\)\+Bθ​u​\(t\),h​\(0\)=h0\.\\frac\{dh\}\{dt\}\(t\)=A\\,h\(t\)\+B\_\{\\theta\}\\,u\(t\),\\qquad h\(0\)=h\_\{0\}\.\(108\)The solution is given by the variation of constants formula\[[17](https://arxiv.org/html/2606.19538#bib.bib93)\]:

h​\(t\)=eA​t​h0⏟free response\+∫0teA​\(t−s\)​Bθ​u​\(s\)​𝑑s⏟forced response\.h\(t\)\\;=\\;\\underbrace\{e^\{At\}h\_\{0\}\}\_\{\\text\{free response\}\}\\;\+\\;\\underbrace\{\\int\_\{0\}^\{t\}e^\{A\(t\-s\)\}\\,B\_\{\\theta\}\\,u\(s\)\\,ds\}\_\{\\text\{forced response\}\}\.\(109\)
Step 2\.Write the state transition matrix explicitly\.

For a linear system, define the*state transition matrix*:

Φ​\(t,s\)≔eA​\(t−s\),t≥s≥0\.\\Phi\(t,s\)\\;\\coloneqq\\;e^\{A\(t\-s\)\},\\qquad t\\geq s\\geq 0\.\(110\)
Φ​\(t,s\)\\Phi\(t,s\)encodes how the hidden state at timessevolves to timettunder the autonomous dynamicsh˙=A​h\\dot\{h\}=Ah\. Note thatΦ​\(t,t\)=In\\Phi\(t,t\)=I\_\{n\}andΦ​\(t,s\)=Φ​\(t,r\)​Φ​\(r,s\)\\Phi\(t,s\)=\\Phi\(t,r\)\\Phi\(r,s\)\.

Step 3\.Compute the outputRNN​\(u\)​\(t\)=Cθ​h​\(t\)\+Dθ​u​\(t\)\\mathrm\{RNN\}\(u\)\(t\)=C\_\{\\theta\}h\(t\)\+D\_\{\\theta\}u\(t\)\.

Substituting \([109](https://arxiv.org/html/2606.19538#A5.E109)\) into \([95](https://arxiv.org/html/2606.19538#A5.E95)\):

RNN​\(u\)​\(t\)\\displaystyle\\mathrm\{RNN\}\(u\)\(t\)=Cθ​\[eA​t​h0\+∫0teA​\(t−s\)​Bθ​u​\(s\)​𝑑s\]\+Dθ​u​\(t\)\\displaystyle\\;=\\;C\_\{\\theta\}\\\!\\left\[e^\{At\}h\_\{0\}\+\\int\_\{0\}^\{t\}e^\{A\(t\-s\)\}B\_\{\\theta\}u\(s\)\\,ds\\right\]\+D\_\{\\theta\}u\(t\)=Cθ​eA​t​h0⏟initial state term\+∫0tCθ​eA​\(t−s\)​Bθ⏟impulse response​g​\(t−s\)​u​\(s\)​𝑑s\+Dθ​u​\(t\)\.\\displaystyle\\;=\\;\\underbrace\{C\_\{\\theta\}e^\{At\}h\_\{0\}\}\_\{\\text\{initial state term\}\}\+\\int\_\{0\}^\{t\}\\underbrace\{C\_\{\\theta\}e^\{A\(t\-s\)\}B\_\{\\theta\}\}\_\{\\text\{impulse response \}g\(t\-s\)\}u\(s\)\\,ds\+D\_\{\\theta\}u\(t\)\.\(111\)
Step 4\.Express as an ITNet operator with causal kernel\.

Define the ITNet kernel:

κθ​\(t,s,u​\(t\),u​\(s\)\)≔1s≤t⋅Cθ​eA​\(t−s\)​Bθ∈ℝd×d,\\kappa\_\{\\theta\}\(t,s,u\(t\),u\(s\)\)\\;\\coloneqq\\;\\mathbf\{1\}\_\{s\\leq t\}\\cdot C\_\{\\theta\}\\,e^\{A\(t\-s\)\}\\,B\_\{\\theta\}\\;\\in\\;\\mathbb\{R\}^\{d\\times d\},\(112\)and setWθ=DθW\_\{\\theta\}=D\_\{\\theta\}\. Then the ITNet operator gives:

\(𝒦θ​\[u\]\)​\(t\)\\displaystyle\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(t\)=∫0T𝟏s≤t⋅Cθ​eA​\(t−s\)​Bθ⋅u​\(s\)​𝑑s\+Dθ​u​\(t\)\\displaystyle\\;=\\;\\int\_\{0\}^\{T\}\\mathbf\{1\}\_\{s\\leq t\}\\cdot C\_\{\\theta\}e^\{A\(t\-s\)\}B\_\{\\theta\}\\cdot u\(s\)\\,ds\+D\_\{\\theta\}u\(t\)=∫0tCθ​eA​\(t−s\)​Bθ​u​\(s\)​𝑑s\+Dθ​u​\(t\)\.\\displaystyle\\;=\\;\\int\_\{0\}^\{t\}C\_\{\\theta\}e^\{A\(t\-s\)\}B\_\{\\theta\}u\(s\)\\,ds\+D\_\{\\theta\}u\(t\)\.\(113\)
Step 5\.Handle the initial state term\.

Comparing \([111](https://arxiv.org/html/2606.19538#A5.E111)\) and \([113](https://arxiv.org/html/2606.19538#A5.E113)\), the initial state termCθ​eA​t​h0C\_\{\\theta\}e^\{At\}h\_\{0\}is not present in \([113](https://arxiv.org/html/2606.19538#A5.E113)\)\. We consider two cases:

Method 1 \(Zero initial state\):Ifh0=0h\_\{0\}=0\(standard in most RNN training\), the initial state term vanishes and \([113](https://arxiv.org/html/2606.19538#A5.E113)\) exactly equals \([111](https://arxiv.org/html/2606.19538#A5.E111)\)\.

Method 2 \(Non\-zero initial state\):Ifh0≠0h\_\{0\}\\neq 0, include the initial state by augmenting the input:

u~​\(s\)=\{h0s=0u​\(s\)s\>0\\tilde\{u\}\(s\)\\;=\\;\\begin\{cases\}h\_\{0\}&s=0\\\\ u\(s\)&s\>0\\end\{cases\}\(114\)and setting the kernel to extracth0h\_\{0\}ats=0s=0:κθ​\(t,0,u​\(t\),h0\)=𝟏0≤t⋅Cθ​eA​t\\kappa\_\{\\theta\}\(t,0,u\(t\),h\_\{0\}\)=\\mathbf\{1\}\_\{0\\leq t\}\\cdot C\_\{\\theta\}e^\{At\}\. Then the full solution \([111](https://arxiv.org/html/2606.19538#A5.E111)\) is recovered\.

Step 6\.Conclude\.

From Steps 1–5, withh0=0h\_\{0\}=0:

\(𝒦θ​\[u\]\)​\(t\)\\displaystyle\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(t\)=∫0tCθ​eA​\(t−s\)​Bθ​u​\(s\)​𝑑s\+Dθ​u​\(t\)\\displaystyle\\;=\\;\\int\_\{0\}^\{t\}C\_\{\\theta\}e^\{A\(t\-s\)\}B\_\{\\theta\}u\(s\)\\,ds\+D\_\{\\theta\}u\(t\)=Cθ​h​\(t\)\+Dθ​u​\(t\)=RNN​\(u\)​\(t\)\.\\displaystyle\\;=\\;C\_\{\\theta\}h\(t\)\+D\_\{\\theta\}u\(t\)\\;=\\;\\mathrm\{RNN\}\(u\)\(t\)\.\(115\)
\(𝒦θ\[u\]\)\(t\)=RNN\(u\)\(t\)\.\\boxed\{\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(t\)\\;=\\;\\mathrm\{RNN\}\(u\)\(t\)\.\}\(116\)∎

### E\.3\. Proof of Part \(a′\) – Nonlinear Continuous\-Time RNN \(GeneralFθF\_\{\\theta\}\)

The linear proof \(Part \(a\)\) uses the matrix exponentialeA​\(t−s\)e^\{A\(t\-s\)\}as the state transition operator, which is only valid whenFθ​\(h,u\)=A​h\+Bθ​uF\_\{\\theta\}\(h,u\)=Ah\+B\_\{\\theta\}uis linear inhh\. For a general nonlinearFθF\_\{\\theta\}, no closed\-form state transition matrix exists\. We provide two independent proofs: an*exact*proof via the nonlinear variation of constants formula \(Alekseev’s formula\), and a*universal approximation*argument\.

#### E\.3\.1Proof Strategy A: Kernel Construction via Alekseev’s Formula

###### Proof of Theorem[E\.1](https://arxiv.org/html/2606.19538#A5.SS1)\(a′\) – Exact\.

Step 1\.State the nonlinear variation of constants formula\.

Consider the general nonlinear ODE \([94](https://arxiv.org/html/2606.19538#A5.E94)\):

h˙​\(t\)=Fθ​\(h​\(t\),u​\(t\)\),h​\(0\)=h0\.\\dot\{h\}\(t\)=F\_\{\\theta\}\(h\(t\),u\(t\)\),\\qquad h\(0\)=h\_\{0\}\.\(117\)
Letψ​\(t;s,ξ\)\\psi\(t;s,\\xi\)denote the flow of the*non\-autonomous*ODEz˙=Fθ​\(z,u​\(t\)\)\\dot\{z\}=F\_\{\\theta\}\(z,u\(t\)\)started atz​\(s\)=ξz\(s\)=\\xi, i\.e\.,ψ​\(t;s,ξ\)\\psi\(t;s,\\xi\)is the solution at timettofz˙​\(τ\)=Fθ​\(z​\(τ\),u​\(τ\)\)\\dot\{z\}\(\\tau\)=F\_\{\\theta\}\(z\(\\tau\),u\(\\tau\)\)withz​\(s\)=ξz\(s\)=\\xi\.

Define the*nonlinear state transition operator*\(sensitivity matrix\):

ΦF​\(t,s;u\)≔∂ψ​\(t;s,h​\(s\)\)∂h​\(s\)∈ℝn×n,\\Phi\_\{F\}\(t,s;u\)\\;\\coloneqq\\;\\frac\{\\partial\\psi\(t;s,h\(s\)\)\}\{\\partial h\(s\)\}\\;\\in\\;\\mathbb\{R\}^\{n\\times n\},\(118\)which is the Jacobian of the solution at timettwith respect to the initial condition at timess, evaluated along the trajectoryh​\(⋅\)h\(\\cdot\)generated by inputuu\.

By the Alekseev–Gröbner nonlinear variation of constants formula\[[3](https://arxiv.org/html/2606.19538#bib.bib94),[34](https://arxiv.org/html/2606.19538#bib.bib95)\], the solution of \([117](https://arxiv.org/html/2606.19538#A5.E117)\) satisfies:

h​\(t\)=ψ​\(t;0,h0\)=hfree​\(t\)\+∫0tΦF​\(t,s;u\)⋅G​\(h​\(s\),u​\(s\)\)​𝑑s,h\(t\)=\\psi\(t;0,h\_\{0\}\)=h\_\{\\mathrm\{free\}\}\(t\)\+\\int\_\{0\}^\{t\}\\Phi\_\{F\}\(t,s;u\)\\cdot G\(h\(s\),u\(s\)\)\\,ds,\(119\)wherehfree​\(t\)h\_\{\\mathrm\{free\}\}\(t\)is the free\-response solution \(withu=0u=0\) andG​\(h​\(s\),u​\(s\)\)=Fθ​\(h​\(s\),u​\(s\)\)−Fθ​\(h​\(s\),0\)G\(h\(s\),u\(s\)\)=F\_\{\\theta\}\(h\(s\),u\(s\)\)\-F\_\{\\theta\}\(h\(s\),0\)is the input\-driven component\.

Step 2\.Properties of the nonlinear sensitivity matrixΦF​\(t,s;u\)\\Phi\_\{F\}\(t,s;u\)\.

The sensitivity matrixΦF​\(t,s;u\)\\Phi\_\{F\}\(t,s;u\)satisfies the*variational equation*:

∂∂t​ΦF​\(t,s;u\)=∂Fθ∂h\|\(h​\(t\),u​\(t\)\)⋅ΦF​\(t,s;u\),ΦF​\(s,s;u\)=In,\\frac\{\\partial\}\{\\partial t\}\\Phi\_\{F\}\(t,s;u\)=\\frac\{\\partial F\_\{\\theta\}\}\{\\partial h\}\\bigg\|\_\{\(h\(t\),u\(t\)\)\}\\cdot\\Phi\_\{F\}\(t,s;u\),\\qquad\\Phi\_\{F\}\(s,s;u\)=I\_\{n\},\(120\)where∂Fθ∂h\|\(h​\(t\),u​\(t\)\)∈ℝn×n\\frac\{\\partial F\_\{\\theta\}\}\{\\partial h\}\\big\|\_\{\(h\(t\),u\(t\)\)\}\\in\\mathbb\{R\}^\{n\\times n\}is the Jacobian ofFθF\_\{\\theta\}with respect to its first argument, evaluated along the trajectory\. This is a*linear*ODE inΦF\\Phi\_\{F\}\(though with time\-varying, input\-dependent coefficients\), so standard ODE theory guarantees existence and uniqueness\.

Key properties \(analogous to the linear case\):

1. \(i\)ΦF​\(t,t;u\)=In\\Phi\_\{F\}\(t,t;u\)=I\_\{n\}for alltt\.
2. \(ii\)ΦF​\(t,s;u\)=ΦF​\(t,r;u\)⋅ΦF​\(r,s;u\)\\Phi\_\{F\}\(t,s;u\)=\\Phi\_\{F\}\(t,r;u\)\\cdot\\Phi\_\{F\}\(r,s;u\)fort≥r≥st\\geq r\\geq s\(chain rule for flows\)\.
3. \(iii\)‖ΦF​\(t,s;u\)‖op≤eLF​\(t−s\)\\\|\\Phi\_\{F\}\(t,s;u\)\\\|\_\{\\mathrm\{op\}\}\\leq e^\{L\_\{F\}\(t\-s\)\}whereLFL\_\{F\}is the Lipschitz constant ofFθF\_\{\\theta\}inhh\(by Grönwall’s inequality\[[35](https://arxiv.org/html/2606.19538#bib.bib97)\]\)\.
4. \(iv\)ΦF\\Phi\_\{F\}depends onuuthrough the trajectoryh​\(⋅\)h\(\\cdot\), making it*content\-dependent*\.
5. \(v\)In the linear caseFθ​\(h,u\)=A​h\+Bθ​uF\_\{\\theta\}\(h,u\)=Ah\+B\_\{\\theta\}u, we recoverΦF​\(t,s;u\)=eA​\(t−s\)\\Phi\_\{F\}\(t,s;u\)=e^\{A\(t\-s\)\}\(independent ofuu\), consistent with Part \(a\)\.

Step 3\.DecomposeFθF\_\{\\theta\}into autonomous and input\-driven components\.

WriteFθ​\(h,u\)=Fθ​\(h,0\)\+Gθ​\(h,u\)F\_\{\\theta\}\(h,u\)=F\_\{\\theta\}\(h,0\)\+G\_\{\\theta\}\(h,u\)whereGθ​\(h,u\)≔Fθ​\(h,u\)−Fθ​\(h,0\)G\_\{\\theta\}\(h,u\)\\coloneqq F\_\{\\theta\}\(h,u\)\-F\_\{\\theta\}\(h,0\)captures the effect of the input\. Then Alekseev’s formula \([119](https://arxiv.org/html/2606.19538#A5.E119)\) gives:

h​\(t\)=ψ​\(t;0,h0\)⏟free response \(no input\)\+∫0tΦF​\(t,s;u\)⋅Gθ​\(h​\(s\),u​\(s\)\)​𝑑s\.h\(t\)=\\underbrace\{\\psi\(t;0,h\_\{0\}\)\}\_\{\\text\{free response \(no input\)\}\}\+\\int\_\{0\}^\{t\}\\Phi\_\{F\}\(t,s;u\)\\cdot G\_\{\\theta\}\(h\(s\),u\(s\)\)\\,ds\.\(121\)
For standard RNN architectures,Fθ​\(h,u\)=ϕ​\(W​h\+W​u\+b\)F\_\{\\theta\}\(h,u\)=\\phi\(Wh\+Wu\+b\), soFθ​\(h,0\)=ϕ​\(W​h\+b\)F\_\{\\theta\}\(h,0\)=\\phi\(Wh\+b\)andGθ​\(h,u\)=ϕ​\(W​h\+W​u\+b\)−ϕ​\(W​h\+b\)G\_\{\\theta\}\(h,u\)=\\phi\(Wh\+Wu\+b\)\-\\phi\(Wh\+b\), which depends on bothhhanduu\.

Step 4\.Write the output as an integral operator\.

Substituting \([121](https://arxiv.org/html/2606.19538#A5.E121)\) into the output equationy​\(t\)=Cθ​h​\(t\)\+Dθ​u​\(t\)y\(t\)=C\_\{\\theta\}h\(t\)\+D\_\{\\theta\}u\(t\):

RNN​\(u\)​\(t\)=Cθ​ψ​\(t;0,h0\)\+∫0tCθ​ΦF​\(t,s;u\)⋅Gθ​\(h​\(s\),u​\(s\)\)​𝑑s\+Dθ​u​\(t\)\.\\mathrm\{RNN\}\(u\)\(t\)=C\_\{\\theta\}\\psi\(t;0,h\_\{0\}\)\+\\int\_\{0\}^\{t\}C\_\{\\theta\}\\,\\Phi\_\{F\}\(t,s;u\)\\cdot G\_\{\\theta\}\(h\(s\),u\(s\)\)\\,ds\+D\_\{\\theta\}\\,u\(t\)\.\(122\)
Step 5\.ExpressGθ​\(h​\(s\),u​\(s\)\)G\_\{\\theta\}\(h\(s\),u\(s\)\)in terms ofu​\(s\)u\(s\)\.

Sinceh​\(s\)h\(s\)is itself a functional ofuu, the termGθ​\(h​\(s\),u​\(s\)\)G\_\{\\theta\}\(h\(s\),u\(s\)\)depends onu​\(s\)u\(s\)and, in general, on the entire input historyu​\(τ\)u\(\\tau\)forτ∈\[0,s\]\\tau\\in\[0,s\]\.

Under differentiability ofGθG\_\{\\theta\}inuu, a first\-order Taylor expansion yields

Gθ​\(h​\(s\),u​\(s\)\)=B~θ​\(s\)​u​\(s\)\+R​\(s;u\),G\_\{\\theta\}\(h\(s\),u\(s\)\)=\\tilde\{B\}\_\{\\theta\}\(s\)\\,u\(s\)\+R\(s;u\),\(123\)where

B~θ​\(s\)≔∂Gθ∂u​\(h​\(s\),u​\(s\)\)∈ℝn×d\\tilde\{B\}\_\{\\theta\}\(s\)\\coloneqq\\frac\{\\partial G\_\{\\theta\}\}\{\\partial u\}\(h\(s\),u\(s\)\)\\in\\mathbb\{R\}^\{n\\times d\}\(124\)is the input\-to\-state Jacobian, andR​\(s;u\)R\(s;u\)collects higher\-order terms\.

For systems whereGθ​\(h,u\)G\_\{\\theta\}\(h,u\)is linear or affine inuu\(e\.g\.,h˙=f​\(h\)\+Bθ​u\\dot\{h\}=f\(h\)\+B\_\{\\theta\}u\), the remainder vanishes \(R​\(s;u\)=0R\(s;u\)=0\), yielding exact recovery\. In general,R​\(s;u\)=O​\(‖u​\(s\)‖2\)R\(s;u\)=O\(\\\|u\(s\)\\\|^\{2\}\)under smoothness assumptions\.

Step 6\.Construct the ITNet kernel for the full nonlinear system\.

Define the content\-dependent, causal ITNet kernel:

κF​\(t,s,u​\(t\),u​\(s\)\)≔1s≤t⋅Cθ​ΦF​\(t,s;u\)⋅B~θ​\(s;u\)∈ℝd×d,\\kappa\_\{F\}\(t,s,u\(t\),u\(s\)\)\\;\\coloneqq\\;\\mathbf\{1\}\_\{s\\leq t\}\\cdot C\_\{\\theta\}\\,\\Phi\_\{F\}\(t,s;u\)\\cdot\\tilde\{B\}\_\{\\theta\}\(s;u\)\\;\\in\\;\\mathbb\{R\}^\{d\\times d\},\(125\)withWθ=DθW\_\{\\theta\}=D\_\{\\theta\}\. Here the kernel depends causally on the full input trajectory through the induced hidden\-state evolution, extending the local kernel notation of Definition[14](https://arxiv.org/html/2606.19538#Thmdefinition14)\.

Then:

\(𝒦θ​\[u\]\)​\(t\)\\displaystyle\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(t\)=∫0TκF​\(t,s,u​\(t\),u​\(s\)\)⋅u​\(s\)​𝑑s\+Dθ​u​\(t\)\\displaystyle=\\int\_\{0\}^\{T\}\\kappa\_\{F\}\(t,s,u\(t\),u\(s\)\)\\cdot u\(s\)\\,ds\+D\_\{\\theta\}u\(t\)=∫0tCθ​ΦF​\(t,s;u\)⋅B~θ​\(s;u\)⋅u​\(s\)​𝑑s\+Dθ​u​\(t\)\.\\displaystyle=\\int\_\{0\}^\{t\}C\_\{\\theta\}\\,\\Phi\_\{F\}\(t,s;u\)\\cdot\\tilde\{B\}\_\{\\theta\}\(s;u\)\\cdot u\(s\)\\,ds\+D\_\{\\theta\}u\(t\)\.\(126\)
Comparing with \([122](https://arxiv.org/html/2606.19538#A5.E122)\) \(ignoring the initial state term as in Part \(a\), Method 1\):

RNN​\(u\)​\(t\)−\(𝒦θ​\[u\]\)​\(t\)=∫0tCθ​ΦF​\(t,s;u\)⋅R​\(s;u\)​𝑑s,\\mathrm\{RNN\}\(u\)\(t\)\-\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(t\)=\\int\_\{0\}^\{t\}C\_\{\\theta\}\\Phi\_\{F\}\(t,s;u\)\\cdot R\(s;u\)\\,ds,\(127\)whereR​\(s;u\)R\(s;u\)is the higher\-order remainder from Step 5\.

Step 7\.Conclusion for nonlinearFθF\_\{\\theta\}\.

*Case 1:FθF\_\{\\theta\}is affine inuu\.*IfFθ​\(h,u\)=f​\(h\)\+Bθ​uF\_\{\\theta\}\(h,u\)=f\(h\)\+B\_\{\\theta\}u, thenGθ​\(h,u\)=Bθ​uG\_\{\\theta\}\(h,u\)=B\_\{\\theta\}uand the remainderR​\(s;u\)=0R\(s;u\)=0\. In this case, the kernel \([125](https://arxiv.org/html/2606.19538#A5.E125)\) recovers the RNN output exactly:

\(𝒦θ\[u\]\)\(t\)=RNN\(u\)\(t\)\.\\boxed\{\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(t\)=\\mathrm\{RNN\}\(u\)\(t\)\.\}
*Case 2: General \(non\-affine\)FθF\_\{\\theta\}\.*For fully nonlinearFθF\_\{\\theta\}, the input contributionGθ​\(h​\(s\),u​\(s\)\)G\_\{\\theta\}\(h\(s\),u\(s\)\)depends on the hidden stateh​\(s\)h\(s\), which itself depends on the entire input history\. Consequently, it cannot in general be expressed solely as a function of\(t,s,u​\(t\),u​\(s\)\)\(t,s,u\(t\),u\(s\)\), and therefore no explicit ITNet kernel of the form in Definition[14](https://arxiv.org/html/2606.19538#Thmdefinition14)can exactly recover the RNN output\.

However, under Assumption[4](https://arxiv.org/html/2606.19538#Thmassumption4), the operatoru↦RNN​\(u\)u\\mapsto\\mathrm\{RNN\}\(u\)is continuous on compact subsets ofC​\(\[0,T\],ℝd\)C\(\[0,T\],\\mathbb\{R\}^\{d\}\)\. By Theorem[4](https://arxiv.org/html/2606.19538#Thmtheorem4), for anyε\>0\\varepsilon\>0, there exists an ITNet operator𝒦θ\\mathcal\{K\}\_\{\\theta\}such that

supu∈Uc‖RNN​\(u\)−𝒦θ​\[u\]‖∞<ε\.\\sup\_\{u\\in U\_\{c\}\}\\\|\\mathrm\{RNN\}\(u\)\-\\mathcal\{K\}\_\{\\theta\}\[u\]\\\|\_\{\\infty\}<\\varepsilon\.Thus, exact recovery holds for the affine\-in\-uucase, while general nonlinear systems are approximated arbitrarily well\.

Step 8\.Verify kernel validity \(affine\-in\-uucase\)\.

The kernelκF\\kappa\_\{F\}from \([125](https://arxiv.org/html/2606.19538#A5.E125)\) satisfies all ITNet conditions:

1. \(i\)*Causality:*𝟏s≤t\\mathbf\{1\}\_\{s\\leq t\}ensuresκF=0\\kappa\_\{F\}=0fors\>ts\>t\.
2. \(ii\)*Content\-dependence:*ΦF​\(t,s;u\)\\Phi\_\{F\}\(t,s;u\)depends onuuthrough the trajectoryh​\(⋅\)h\(\\cdot\), andB~θ​\(s\)\\tilde\{B\}\_\{\\theta\}\(s\)depends on\(h​\(s\),u​\(s\)\)\(h\(s\),u\(s\)\)\- a causal kernel whose weights depend on the input trajectory\.
3. \(iii\)*Measurability:*FθF\_\{\\theta\}is Lipschitz, so the flowψ\\psiand its JacobianΦF\\Phi\_\{F\}are continuous in all arguments by smooth dependence on initial conditions\[[40](https://arxiv.org/html/2606.19538#bib.bib96)\]\. Composition withCθC\_\{\\theta\}andB~θ\\tilde\{B\}\_\{\\theta\}preserves continuity\.
4. \(iv\)*Boundedness:*By Grönwall’s inequality\[[40](https://arxiv.org/html/2606.19538#bib.bib96)\],‖ΦF​\(t,s;u\)‖op≤eLF​T\\\|\\Phi\_\{F\}\(t,s;u\)\\\|\_\{\\mathrm\{op\}\}\\leq e^\{L\_\{F\}T\}whereLFL\_\{F\}is the Lipschitz constant ofFθF\_\{\\theta\}\. Combined with boundedCθC\_\{\\theta\}andB~θ\\tilde\{B\}\_\{\\theta\}:‖κF‖op≤‖Cθ‖⋅eLF​T⋅‖B~θ‖<∞\\\|\\kappa\_\{F\}\\\|\_\{\\mathrm\{op\}\}\\leq\\\|C\_\{\\theta\}\\\|\\cdot e^\{L\_\{F\}T\}\\cdot\\\|\\tilde\{B\}\_\{\\theta\}\\\|<\\infty\.

Step 9\.Conclude\.

For the affine\-in\-uucase, the ITNet operator with kernel \([125](https://arxiv.org/html/2606.19538#A5.E125)\) exactly recovers the nonlinear RNN output \(Step 7, Case 1\)\. For the fully general case, arbitrary\-precision approximation is guaranteed by Theorem[4](https://arxiv.org/html/2606.19538#Thmtheorem4)\(Step 7, Case 2\)\.

\(𝒦θ​\[u\]\)​\(t\)=RNN​\(u\)​\(t\)for all LipschitzFθaffine inu;ε\-approximate otherwise\.\\boxed\{\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(t\)=\\mathrm\{RNN\}\(u\)\(t\)\\quad\\text\{for all Lipschitz $F\_\{\\theta\}$ affine in $u$; $\\varepsilon$\-approximate otherwise\.\}\}\(128\)∎

#### E\.3\.2Proof Strategy B: Universal Approximation Argument

###### Proof of Theorem[E\.1](https://arxiv.org/html/2606.19538#A5.SS1)\(a′\) – via UAT\.

This argument does not construct the kernel explicitly but establishes existence via Theorem[4](https://arxiv.org/html/2606.19538#Thmtheorem4)\.

Step 1\.The mapu↦RNN​\(u\)u\\mapsto\\mathrm\{RNN\}\(u\)is a continuous operator fromC​\(\[0,T\],ℝd\)C\(\[0,T\],\\mathbb\{R\}^\{d\}\)toC​\(\[0,T\],ℝd\)C\(\[0,T\],\\mathbb\{R\}^\{d\}\)\.

*Proof of continuity:*By the Picard–Lindelöf theorem\[[17](https://arxiv.org/html/2606.19538#bib.bib93)\], the solutionh​\(t\)h\(t\)depends continuously on the inputuuwhenFθF\_\{\\theta\}is Lipschitz and continuously differentiable in u\. Specifically, if‖u1−u2‖∞<δ\\\|u\_\{1\}\-u\_\{2\}\\\|\_\{\\infty\}<\\delta, then by Grönwall’s inequality:

‖h1​\(t\)−h2​\(t\)‖2≤LuLh​\(eLh​t−1\)​δ,\\\|h\_\{1\}\(t\)\-h\_\{2\}\(t\)\\\|\_\{2\}\\leq\\frac\{L\_\{u\}\}\{L\_\{h\}\}\\left\(e^\{L\_\{h\}t\}\-1\\right\)\\delta,\(129\)whereLhL\_\{h\}andLuL\_\{u\}are the Lipschitz constants ofFθF\_\{\\theta\}in its first and second arguments\. SinceCθC\_\{\\theta\}andDθD\_\{\\theta\}are bounded linear maps,‖RNN​\(u1\)−RNN​\(u2\)‖∞≤C​δ\\\|\\mathrm\{RNN\}\(u\_\{1\}\)\-\\mathrm\{RNN\}\(u\_\{2\}\)\\\|\_\{\\infty\}\\leq C\\deltafor a constantCCdepending onLh,Lu,T,‖Cθ‖,‖Dθ‖L\_\{h\},L\_\{u\},T,\\\|C\_\{\\theta\}\\\|,\\\|D\_\{\\theta\}\\\|\.

Step 2\.On any compact input setUc⊂C​\(\[0,T\],ℝd\)U\_\{c\}\\subset C\(\[0,T\],\\mathbb\{R\}^\{d\}\), the operatoru↦RNN​\(u\)u\\mapsto\\mathrm\{RNN\}\(u\)is continuous\.

Step 3\.By Theorem[4](https://arxiv.org/html/2606.19538#Thmtheorem4)\(universal approximation\), for anyε\>0\\varepsilon\>0, there exists an ITNet operator𝒦θ\\mathcal\{K\}\_\{\\theta\}such that:

supu∈Uc‖RNN​\(u\)−𝒦θ​\[u\]‖∞<ε\.\\sup\_\{u\\in U\_\{c\}\}\\\|\\mathrm\{RNN\}\(u\)\-\\mathcal\{K\}\_\{\\theta\}\[u\]\\\|\_\{\\infty\}<\\varepsilon\.\(130\)
This does not construct the kernel explicitly but guarantees its existence\. The explicit construction is provided by Strategy A above\. ∎

### E\.4\. Proof of Part \(b\) – Discrete\-Time RNN

###### Proof of Theorem[E\.1](https://arxiv.org/html/2606.19538#A5.SS1)\(b\)\.

Step 1\.Unroll the discrete recurrence\.

The discrete RNN update \([96](https://arxiv.org/html/2606.19538#A5.E96)\) withϕ=id\\phi=\\mathrm\{id\}\(linear case for clarity; nonlinear case in Step 3\) gives:

ht\\displaystyle h\_\{t\}=Wh​ht−1\+Wu​ut\+b\\displaystyle=W\_\{h\}h\_\{t\-1\}\+W\_\{u\}u\_\{t\}\+b=Wh​\(Wh​ht−2\+Wu​ut−1\+b\)\+Wu​ut\+b\\displaystyle=W\_\{h\}\(W\_\{h\}h\_\{t\-2\}\+W\_\{u\}u\_\{t\-1\}\+b\)\+W\_\{u\}u\_\{t\}\+b⋮\\displaystyle\\;\\vdots=Wht​h0\+∑s=1tWht−s​Wu​us\+∑s=1tWht−s​b\.\\displaystyle=W\_\{h\}^\{t\}h\_\{0\}\+\\sum\_\{s=1\}^\{t\}W\_\{h\}^\{t\-s\}W\_\{u\}u\_\{s\}\+\\sum\_\{s=1\}^\{t\}W\_\{h\}^\{t\-s\}b\.\(131\)
Step 2\.Express as a discrete ITNet operator\.

DiscretiseΩ=\{t1,…,tT\}\\Omega=\\\{t\_\{1\},\\ldots,t\_\{T\}\\\}with measureμ​\(\{tk\}\)=1\\mu\(\\\{t\_\{k\}\\\}\)=1\(unit mass per time step\)\. Define the causal kernel \(forh0=b=0h\_\{0\}=b=0\):

κθ​\(t,s,u​\(t\),u​\(s\)\)=1s≤t⋅Cθ​Wht−s​Wu,Wθ=Dθ\.\\kappa\_\{\\theta\}\(t,s,u\(t\),u\(s\)\)\\;=\\;\\mathbf\{1\}\_\{s\\leq t\}\\cdot C\_\{\\theta\}\\,W\_\{h\}^\{t\-s\}\\,W\_\{u\},\\qquad W\_\{\\theta\}=D\_\{\\theta\}\.\(132\)
Then:

\(𝒦θ​\[u\]\)​\(t\)\\displaystyle\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(t\)=∑s=1T𝟏s≤t⋅Cθ​Wht−s​Wu​us\+Dθ​ut\\displaystyle=\\sum\_\{s=1\}^\{T\}\\mathbf\{1\}\_\{s\\leq t\}\\cdot C\_\{\\theta\}W\_\{h\}^\{t\-s\}W\_\{u\}u\_\{s\}\+D\_\{\\theta\}u\_\{t\}=∑s=1tCθ​Wht−s​Wu​us\+Dθ​ut\\displaystyle=\\sum\_\{s=1\}^\{t\}C\_\{\\theta\}W\_\{h\}^\{t\-s\}W\_\{u\}u\_\{s\}\+D\_\{\\theta\}u\_\{t\}=Cθ​ht\+Dθ​ut=RNN​\(u\)​\(t\)\.\\displaystyle=C\_\{\\theta\}h\_\{t\}\+D\_\{\\theta\}u\_\{t\}\\;=\\;\\mathrm\{RNN\}\(u\)\(t\)\.\(133\)
Step 3\.Handle the nonlinear discrete case\.

For a nonlinear discrete RNN with activationϕ\\phi\(e\.g\.tanh\\tanh\), the hidden state at timettdepends on the entire input historyu1,…,utu\_\{1\},\\ldots,u\_\{t\}through the recurrencehs=ϕ​\(Wh​hs−1\+Wu​us\+b\)h\_\{s\}=\\phi\(W\_\{h\}h\_\{s\-1\}\+W\_\{u\}u\_\{s\}\+b\)\. We can still write the output as a causal integral operator by defining the discrete nonlinear state transition:

Φϕdisc​\(t,s;u\)≔∏τ=s\+1tdiag​\(ϕ′​\(Wh​hτ−1\+Wu​uτ\+b\)\)⋅Wh∈ℝn×n,\\Phi\_\{\\phi\}^\{\\mathrm\{disc\}\}\(t,s;u\)\\;\\coloneqq\\;\\prod\_\{\\tau=s\+1\}^\{t\}\\mathrm\{diag\}\\\!\\left\(\\phi^\{\\prime\}\\\!\\left\(W\_\{h\}h\_\{\\tau\-1\}\+W\_\{u\}u\_\{\\tau\}\+b\\right\)\\right\)\\cdot W\_\{h\}\\;\\in\\;\\mathbb\{R\}^\{n\\times n\},\(134\)which is the product of Jacobians along the discrete trajectory \(the discrete analogue ofΦF​\(t,s;u\)\\Phi\_\{F\}\(t,s;u\)from the continuous case\)\. The ITNet kernel becomes:

κθ​\(t,s,u​\(t\),u​\(s\)\)=𝟏s≤t⋅Cθ​Φϕdisc​\(t,s;u\)⋅Wu\.\\kappa\_\{\\theta\}\(t,s,u\(t\),u\(s\)\)=\\mathbf\{1\}\_\{s\\leq t\}\\cdot C\_\{\\theta\}\\,\\Phi\_\{\\phi\}^\{\\mathrm\{disc\}\}\(t,s;u\)\\cdot W\_\{u\}\.\(135\)This kernel is content\-dependent \(throughϕ′\\phi^\{\\prime\}evaluated along the trajectory\) and causal\. The same argument as Part \(a′\), Strategy A applies, with exact recovery whenϕ\\phiis applied element\-wise \(so the Jacobian is diagonal\)\. ∎

### E\.5\. Proof of Part \(c\) – Linear SSM \(S4\)

###### Proof of Theorem[E\.1](https://arxiv.org/html/2606.19538#A5.SS1)\(c\)\.

Step 1\.Solve the linear SSM ODE\.

The SSM state equation \([97](https://arxiv.org/html/2606.19538#A5.E97)\) is a linear ODE withh​\(0\)=0h\(0\)=0\. By the variation of constants formula \(same as Step 1 of Part \(a\)\):

h​\(t\)=∫0teA​\(t−s\)​B​u​\(s\)​𝑑s\.h\(t\)\\;=\\;\\int\_\{0\}^\{t\}e^\{A\(t\-s\)\}\\,B\\,u\(s\)\\,ds\.\(136\)
Step 2\.Substitute into the output equation\.

y​\(t\)\\displaystyle y\(t\)=C​h​\(t\)\+D​u​\(t\)\\displaystyle=C\\,h\(t\)\+D\\,u\(t\)=∫0tC​eA​\(t−s\)​B⏟g​\(t−s\)​u​\(s\)​𝑑s\+D​u​\(t\),\\displaystyle=\\int\_\{0\}^\{t\}\\underbrace\{Ce^\{A\(t\-s\)\}B\}\_\{g\(t\-s\)\}\\,u\(s\)\\,ds\+D\\,u\(t\),\(137\)whereg​\(τ\)=C​eA​τ​B∈ℝd×dg\(\\tau\)=Ce^\{A\\tau\}B\\in\\mathbb\{R\}^\{d\\times d\}is the impulse response kernel of the SSM\.

Step 3\.Express as ITNet operator\.

Set:

κθ​\(t,s,u​\(t\),u​\(s\)\)=1s≤t⋅C​eA​\(t−s\)​B,Wθ=D\.\\kappa\_\{\\theta\}\(t,s,u\(t\),u\(s\)\)\\;=\\;\\mathbf\{1\}\_\{s\\leq t\}\\cdot C\\,e^\{A\(t\-s\)\}\\,B,\\qquad W\_\{\\theta\}=D\.\(138\)
Then:

\(𝒦θ​\[u\]\)​\(t\)\\displaystyle\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(t\)=∫0T𝟏s≤t⋅C​eA​\(t−s\)​B⋅u​\(s\)​𝑑s\+D​u​\(t\)\\displaystyle=\\int\_\{0\}^\{T\}\\mathbf\{1\}\_\{s\\leq t\}\\cdot Ce^\{A\(t\-s\)\}B\\cdot u\(s\)\\,ds\+D\\,u\(t\)=∫0tC​eA​\(t−s\)​B​u​\(s\)​𝑑s\+D​u​\(t\)=y​\(t\)\.\\displaystyle=\\int\_\{0\}^\{t\}Ce^\{A\(t\-s\)\}B\\,u\(s\)\\,ds\+D\\,u\(t\)\\;=\\;y\(t\)\.\(139\)
\(𝒦θ\[u\]\)\(t\)=y\(t\)=SSM\(u\)\(t\)\.\\boxed\{\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(t\)\\;=\\;y\(t\)\\;=\\;\\mathrm\{SSM\}\(u\)\(t\)\.\}\(140\)∎

### E\.6\. Proof of Part \(d\) – Selective SSM \(Mamba\)

###### Proof of Theorem[E\.1](https://arxiv.org/html/2606.19538#A5.SS1)\(d\)\.

Step 1\.Unroll the Mamba discrete recurrence\.

Starting from the Mamba update \([101](https://arxiv.org/html/2606.19538#A5.E101)\) withh0=0h\_\{0\}=0:

ht\\displaystyle h\_\{t\}=A¯​\(t\)​ht−1\+B¯​\(t\)​ut\\displaystyle=\\bar\{A\}\(t\)\\,h\_\{t\-1\}\+\\bar\{B\}\(t\)\\,u\_\{t\}=A¯​\(t\)​\[A¯​\(t−1\)​ht−2\+B¯​\(t−1\)​ut−1\]\+B¯​\(t\)​ut\\displaystyle=\\bar\{A\}\(t\)\\bigl\[\\bar\{A\}\(t\-1\)h\_\{t\-2\}\+\\bar\{B\}\(t\-1\)u\_\{t\-1\}\\bigr\]\+\\bar\{B\}\(t\)\\,u\_\{t\}⋮\\displaystyle\\;\\vdots=∑s=1t\[∏τ=s\+1tA¯​\(τ\)\]​B¯​\(s\)​us,\\displaystyle=\\sum\_\{s=1\}^\{t\}\\left\[\\prod\_\{\\tau=s\+1\}^\{t\}\\bar\{A\}\(\\tau\)\\right\]\\bar\{B\}\(s\)\\,u\_\{s\},\(141\)where the empty product∏τ=t\+1tA¯​\(τ\)=In\\prod\_\{\\tau=t\+1\}^\{t\}\\bar\{A\}\(\\tau\)=I\_\{n\}\(convention fors=ts=t\)\.

Step 2\.Compute the Mamba output\.

Substituting \([141](https://arxiv.org/html/2606.19538#A5.E141)\) into the output \([102](https://arxiv.org/html/2606.19538#A5.E102)\):

yt\\displaystyle y\_\{t\}=C​\(ut\)​ht\\displaystyle=C\(u\_\{t\}\)\\,h\_\{t\}=C​\(ut\)​∑s=1t\[∏τ=s\+1tA¯​\(τ\)\]​B¯​\(s\)​us\\displaystyle=C\(u\_\{t\}\)\\sum\_\{s=1\}^\{t\}\\left\[\\prod\_\{\\tau=s\+1\}^\{t\}\\bar\{A\}\(\\tau\)\\right\]\\bar\{B\}\(s\)\\,u\_\{s\}=∑s=1tC​\(ut\)​\[∏τ=s\+1tA¯​\(τ\)\]​B¯​\(s\)⏟=⁣:κMamba​\(t,s,u​\(t\),u​\(s\)\)​us\.\\displaystyle=\\sum\_\{s=1\}^\{t\}\\underbrace\{C\(u\_\{t\}\)\\left\[\\prod\_\{\\tau=s\+1\}^\{t\}\\bar\{A\}\(\\tau\)\\right\]\\bar\{B\}\(s\)\}\_\{=:\\,\\kappa\_\{\\mathrm\{Mamba\}\}\(t,s,u\(t\),u\(s\)\)\}u\_\{s\}\.\(142\)
Step 3\.Identify the ITNet kernel\.

The Mamba kernel is:

κMamba​\(t,s,u​\(t\),u​\(s\)\)≔1s≤t⋅C​\(ut\)​∏τ=s\+1tA¯​\(uτ\)⋅B¯​\(us\),\\kappa\_\{\\mathrm\{Mamba\}\}\(t,s,u\(t\),u\(s\)\)\\;\\coloneqq\\;\\mathbf\{1\}\_\{s\\leq t\}\\cdot C\(u\_\{t\}\)\\prod\_\{\\tau=s\+1\}^\{t\}\\bar\{A\}\(u\_\{\\tau\}\)\\cdot\\bar\{B\}\(u\_\{s\}\),\(143\)where we writeA¯​\(uτ\)=exp⁡\(Δ​\(uτ\)⋅A\)\\bar\{A\}\(u\_\{\\tau\}\)=\\exp\(\\Delta\(u\_\{\\tau\}\)\\cdot A\)andB¯​\(us\)=Δ​\(us\)⋅WB​us\\bar\{B\}\(u\_\{s\}\)=\\Delta\(u\_\{s\}\)\\cdot W\_\{B\}u\_\{s\}\.

With this kernel andWθ=0W\_\{\\theta\}=0, the discrete ITNet operator \(sum version,μ​\(\{tk\}\)=1\\mu\(\\\{t\_\{k\}\\\}\)=1\):

\(𝒦θ​\[u\]\)​\(t\)\\displaystyle\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(t\)=∑s=1TκMamba​\(t,s,ut,us\)​us\\displaystyle=\\sum\_\{s=1\}^\{T\}\\kappa\_\{\\mathrm\{Mamba\}\}\(t,s,u\_\{t\},u\_\{s\}\)\\,u\_\{s\}=∑s=1tC​\(ut\)​∏τ=s\+1tA¯​\(uτ\)​B¯​\(us\)​us=yt\.\\displaystyle=\\sum\_\{s=1\}^\{t\}C\(u\_\{t\}\)\\prod\_\{\\tau=s\+1\}^\{t\}\\bar\{A\}\(u\_\{\\tau\}\)\\bar\{B\}\(u\_\{s\}\)\\,u\_\{s\}\\;=\\;y\_\{t\}\.\(144\)
Step 4\.Verify kernel validity\.

The kernelκMamba\\kappa\_\{\\mathrm\{Mamba\}\}satisfies all ITNet conditions:

1. \(i\)*Causality*:𝟏s≤t\\mathbf\{1\}\_\{s\\leq t\}ensuresκθ=0\\kappa\_\{\\theta\}=0fors\>ts\>t\.
2. \(ii\)*Measurability*:Δ​\(uτ\)\\Delta\(u\_\{\\tau\}\)is the softplus of a linear function ofuτu\_\{\\tau\}\(continuous\),A¯​\(τ\)\\bar\{A\}\(\\tau\)is the matrix exponential of a continuous function \(continuous\)\.
3. \(iii\)*Boundedness*:‖A¯​\(τ\)‖op=‖eΔ​\(τ\)​A‖op≤eΔmax​‖A‖op\\\|\\bar\{A\}\(\\tau\)\\\|\_\{\\mathrm\{op\}\}=\\\|e^\{\\Delta\(\\tau\)A\}\\\|\_\{\\mathrm\{op\}\}\\leq e^\{\\Delta\_\{\\max\}\\\|A\\\|\_\{\\mathrm\{op\}\}\}whereΔmax=maxτ⁡Δ​\(uτ\)<∞\\Delta\_\{\\max\}=\\max\_\{\\tau\}\\Delta\(u\_\{\\tau\}\)<\\inftyon the compact training set\. Therefore‖κMamba‖op≤‖WC‖⋅e\(t−s\)​Δmax​‖A‖⋅‖WB‖⋅‖u‖∞<∞\\\|\\kappa\_\{\\mathrm\{Mamba\}\}\\\|\_\{\\mathrm\{op\}\}\\leq\\\|W\_\{C\}\\\|\\cdot e^\{\(t\-s\)\\Delta\_\{\\max\}\\\|A\\\|\}\\cdot\\\|W\_\{B\}\\\|\\cdot\\\|u\\\|\_\{\\infty\}<\\infty\.

\(𝒦θ\[u\]\)\(t\)=yt=Mamba\(u\)\(t\)\.\\boxed\{\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(t\)\\;=\\;y\_\{t\}\\;=\\;\\mathrm\{Mamba\}\(u\)\(t\)\.\}\(145\)∎

### E\.7\. LSTM as ITNet

###### Proposition 7\(LSTM⊂\\subsetITNet\)\.

The LSTM update is a special case of the ITNet operator\.

###### Proof\.

The LSTM cell state evolves as:

ct=∑s=1t\[∏τ=s\+1tfτ\]⋅is⊙c~s,c\_\{t\}\\;=\\;\\sum\_\{s=1\}^\{t\}\\left\[\\prod\_\{\\tau=s\+1\}^\{t\}f\_\{\\tau\}\\right\]\\cdot i\_\{s\}\\odot\\tilde\{c\}\_\{s\},\(146\)wherefτ=σ​\(Wf​\[hτ−1;uτ\]\+bf\)f\_\{\\tau\}=\\sigma\(W\_\{f\}\[h\_\{\\tau\-1\};u\_\{\\tau\}\]\+b\_\{f\}\)is the forget gate at timeτ\\tauand the product∏τ=s\+1tfτ\\prod\_\{\\tau=s\+1\}^\{t\}f\_\{\\tau\}is the*cumulative forget factor*from timesstott\.

The hidden state isht=ot⊙tanh⁡\(ct\)h\_\{t\}=o\_\{t\}\\odot\\tanh\(c\_\{t\}\), so:

yt\\displaystyle y\_\{t\}=Wy​ht=Wy​ot⊙tanh⁡\(∑s=1t∏τ=s\+1tfτ⊙is⊙c~s\)\.\\displaystyle=W\_\{y\}h\_\{t\}=W\_\{y\}\\,o\_\{t\}\\odot\\tanh\\\!\\left\(\\sum\_\{s=1\}^\{t\}\\prod\_\{\\tau=s\+1\}^\{t\}f\_\{\\tau\}\\odot i\_\{s\}\\odot\\tilde\{c\}\_\{s\}\\right\)\.\(147\)
Define the LSTM kernel:

κLSTM​\(t,s,u​\(t\),u​\(s\)\)≔1s≤t⋅Wy​diag​\(ot\)​tanh′⁡\(⋅\)​diag​\(∏τ=s\+1tfτ\)​diag​\(is\)​Wc,\\kappa\_\{\\mathrm\{LSTM\}\}\(t,s,u\(t\),u\(s\)\)\\;\\coloneqq\\;\\mathbf\{1\}\_\{s\\leq t\}\\cdot W\_\{y\}\\mathrm\{diag\}\(o\_\{t\}\)\\tanh^\{\\prime\}\\\!\(\\cdot\)\\mathrm\{diag\}\\\!\\\!\\left\(\\prod\_\{\\tau=s\+1\}^\{t\}f\_\{\\tau\}\\right\)\\mathrm\{diag\}\(i\_\{s\}\)W\_\{c\},\(148\)wheretanh′\\tanh^\{\\prime\}denotes the derivative oftanh\\tanh\(applied element\-wise\) andWcW\_\{c\}mapsusu\_\{s\}to the cell candidatec~s=tanh⁡\(Wc​us\)\\tilde\{c\}\_\{s\}=\\tanh\(W\_\{c\}u\_\{s\}\)\.

The kernel \([148](https://arxiv.org/html/2606.19538#A5.E148)\) depends onu​\(t\)u\(t\)throughoto\_\{t\}andfτf\_\{\\tau\}\(t\>τ\>st\>\\tau\>s\), and onu​\(s\)u\(s\)throughisi\_\{s\}andc~s\\tilde\{c\}\_\{s\}–a content\-dependent causal kernel\. By the same steps as Part \(d\) \(Mamba\), the ITNet operator with this kernel recoversyt=LSTM​\(u\)​\(t\)y\_\{t\}=\\mathrm\{LSTM\}\(u\)\(t\)\. ∎

### E\.8\. GRU as ITNet

###### Proposition 8\(GRU⊂\\subsetITNet\)\.

The GRU update is a special case of the ITNet operator\.

###### Proof\.

The GRU update:

zt\\displaystyle z\_\{t\}=σ​\(Wz​\[ht−1;ut\]\+bz\)\\displaystyle=\\sigma\(W\_\{z\}\[h\_\{t\-1\};u\_\{t\}\]\+b\_\{z\}\)\(update gate\)rt\\displaystyle r\_\{t\}=σ​\(Wr​\[ht−1;ut\]\+br\)\\displaystyle=\\sigma\(W\_\{r\}\[h\_\{t\-1\};u\_\{t\}\]\+b\_\{r\}\)\(reset gate\)h~t\\displaystyle\\tilde\{h\}\_\{t\}=tanh⁡\(Wh​\[rt⊙ht−1;ut\]\+b\)\\displaystyle=\\tanh\(W\_\{h\}\[r\_\{t\}\\odot h\_\{t\-1\};u\_\{t\}\]\+b\)\(candidate\)ht\\displaystyle h\_\{t\}=\(1−zt\)⊙ht−1\+zt⊙h~t\.\\displaystyle=\(1\-z\_\{t\}\)\\odot h\_\{t\-1\}\+z\_\{t\}\\odot\\tilde\{h\}\_\{t\}\.\(hidden state\)
Unrolling the hidden state:

ht=∑s=1t\[∏τ=s\+1t\(1−zτ\)\]⏟retention factor⋅zs⊙h~s∗,h\_\{t\}\\;=\\;\\sum\_\{s=1\}^\{t\}\\underbrace\{\\left\[\\prod\_\{\\tau=s\+1\}^\{t\}\(1\-z\_\{\\tau\}\)\\right\]\}\_\{\\text\{retention factor\}\}\\cdot z\_\{s\}\\odot\\tilde\{h\}\_\{s\}^\{\*\},\(149\)whereh~s∗\\tilde\{h\}\_\{s\}^\{\*\}denotes the candidate state after application of the reset gate at timesscomputed with the reset gate applied to the history\.

The structure is identical to the LSTM case: a causal sum with input\-dependent retention factors\(1−zτ\)∈\(0,1\)\(1\-z\_\{\\tau\}\)\\in\(0,1\)\. The ITNet kernel is:

κGRU​\(t,s,u​\(t\),u​\(s\)\)=1s≤t⋅Wy​diag​\(∏τ=s\+1t\(1−zτ\)\)​diag​\(zs\)​Wh~,\\kappa\_\{\\mathrm\{GRU\}\}\(t,s,u\(t\),u\(s\)\)\\;=\\;\\mathbf\{1\}\_\{s\\leq t\}\\cdot W\_\{y\}\\mathrm\{diag\}\\\!\\\!\\left\(\\prod\_\{\\tau=s\+1\}^\{t\}\(1\-z\_\{\\tau\}\)\\right\)\\mathrm\{diag\}\(z\_\{s\}\)W\_\{\\tilde\{h\}\},\(150\)and by the same argument as Part \(d\), the ITNet operator with this kernel recovers the GRU output\. ∎

### E\.9\. Strictness Argument 1: Non\-Causal Operators

###### Proof via non\-causal operator\.

Step 1\.Define the witness operator\.

Define the*bidirectional smoothing operator*:

S​\(u\)​\(t\)≔∫0Te−\|t−s\|/ℓ​u​\(s\)​𝑑s,ℓ\>0\.S\(u\)\(t\)\\;\\coloneqq\\;\\int\_\{0\}^\{T\}e^\{\-\|t\-s\|/\\ell\}\\,u\(s\)\\,ds,\\qquad\\ell\>0\.\(151\)
This uses an exponential kernel over the*entire*interval\[0,T\]\[0,T\], not just\[0,t\]\[0,t\]\. It uses both past \(s<ts<t\) and future \(s\>ts\>t\) information\.

Step 2\.Show ITNet representsSS\.

Set:

κθ​\(t,s,u​\(t\),u​\(s\)\)=e−\|t−s\|/ℓ⋅𝐈d,Wθ=0\.\\kappa\_\{\\theta\}\(t,s,u\(t\),u\(s\)\)\\;=\\;e^\{\-\|t\-s\|/\\ell\}\\cdot\\mathbf\{I\}\_\{d\},\\qquad W\_\{\\theta\}=0\.\(152\)
Then\(𝒦θ​\[u\]\)​\(t\)=∫0Te−\|t−s\|/ℓ​u​\(s\)​𝑑s=S​\(u\)​\(t\)\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(t\)=\\int\_\{0\}^\{T\}e^\{\-\|t\-s\|/\\ell\}u\(s\)\\,ds=S\(u\)\(t\)\. This kernel is symmetric in\(t,s\)\(t,s\), bounded, and continuous–valid for ITNet\.

Step 3\.Show no causal recurrent system can representSS\.

Any causal recurrent system satisfies:

RNN​\(u\)​\(t\)=ℱt​\[u​\(0\),u​\(1\),…,u​\(t\)\],\\mathrm\{RNN\}\(u\)\(t\)\\;=\\;\\mathcal\{F\}\_\{t\}\[u\(0\),u\(1\),\\ldots,u\(t\)\],\(153\)i\.e\. the output at timettis a function of inputs up to timettonly\.

ForSS, consider two inputs:

u1​\(s\)\\displaystyle u\_\{1\}\(s\)=𝟎​for all​s,\\displaystyle=\\mathbf\{0\}\\text\{ for all \}s,u2​\(s\)\\displaystyle u\_\{2\}\(s\)=δ​\(s−t∗\)​e1for some​t∗\>t​and​‖e1‖2=1\.\\displaystyle=\\delta\(s\-t\_\{\*\}\)e\_\{1\}\\quad\\text\{for some \}t\_\{\*\}\>t\\text\{ and \}\\\|e\_\{1\}\\\|\_\{2\}=1\.\(154\)
Then:

S​\(u1\)​\(t\)\\displaystyle S\(u\_\{1\}\)\(t\)=0,\\displaystyle=0,\(155\)S​\(u2\)​\(t\)\\displaystyle S\(u\_\{2\}\)\(t\)=e−\|t−t∗\|/ℓ​e1≠0\(since​t∗\>t​\)\.\\displaystyle=e^\{\-\|t\-t\_\{\*\}\|/\\ell\}e\_\{1\}\\;\\neq\\;0\\quad\\text\{\(since \}t\_\{\*\}\>t\\text\{\)\}\.\(156\)
But for any causal system:

RNN​\(u1\)​\(t\)\\displaystyle\\mathrm\{RNN\}\(u\_\{1\}\)\(t\)=ℱt​\[0,…,0\]=0,\\displaystyle=\\mathcal\{F\}\_\{t\}\[0,\\ldots,0\]=0,\(157\)RNN​\(u2\)​\(t\)\\displaystyle\\mathrm\{RNN\}\(u\_\{2\}\)\(t\)=ℱt​\[0,…,0\]=0,\\displaystyle=\\mathcal\{F\}\_\{t\}\[0,\\ldots,0\]=0,\(158\)sinceu1​\(s\)=u2​\(s\)=0u\_\{1\}\(s\)=u\_\{2\}\(s\)=0for alls≤ts\\leq t\(the impulse att∗\>tt\_\{\*\}\>tis in the future and unseen by the causal system\)\.

ThereforeS​\(u2\)​\(t\)≠0=RNN​\(u2\)​\(t\)S\(u\_\{2\}\)\(t\)\\neq 0=\\mathrm\{RNN\}\(u\_\{2\}\)\(t\)\. Contradiction\. SoS∉RNNS\\notin\\mathrm\{RNN\}andS∈ITNetS\\in\\mathrm\{ITNet\}, provingRNN⊊ITNet\\mathrm\{RNN\}\\subsetneq\\mathrm\{ITNet\}\. ∎

### E\.10\. Strictness Argument 2: Parallelism and State Dimension

###### Proof via bounded state dimension\.

Step 1\.Define a high\-rank operator\.

Define the operator:

Hn​\(u\)​\(t\)≔∫0TKn​\(t,s\)​u​\(s\)​𝑑s,H\_\{n\}\(u\)\(t\)\\;\\coloneqq\\;\\int\_\{0\}^\{T\}K\_\{n\}\(t,s\)\\,u\(s\)\\,ds,\(159\)whereKn​\(t,s\)=∑k=1n\+1ϕk​\(t\)​ϕk​\(s\)⊤K\_\{n\}\(t,s\)=\\displaystyle\\sum\_\{k=1\}^\{n\+1\}\\phi\_\{k\}\(t\)\\phi\_\{k\}\(s\)^\{\\top\}for orthonormal basis functions\{ϕk\}k=1n\+1\\\{\\phi\_\{k\}\\\}\_\{k=1\}^\{n\+1\}, soKnK\_\{n\}has rankn\+1n\+1inL2L^\{2\}\.

Step 2\.Show ITNet representsHnH\_\{n\}for anynn\.

Setκθ​\(t,s,u​\(t\),u​\(s\)\)=Kn​\(t,s\)⋅𝐈d\\kappa\_\{\\theta\}\(t,s,u\(t\),u\(s\)\)=K\_\{n\}\(t,s\)\\cdot\\mathbf\{I\}\_\{d\}\. SinceKnK\_\{n\}is a symmetric, bounded, continuous kernel on\[0,T\]2\[0,T\]^\{2\}, it satisfies all ITNet conditions\. The ITNet operator with this kernel equalsHnH\_\{n\}exactly\.

Step 3\.Show no fixed\-nnRNN can representHnH\_\{n\}exactly\.

An RNN with hidden state dimensionnncomputes outputs in the range of a linear mapCθ​h​\(t\)C\_\{\\theta\}h\(t\)whereh​\(t\)∈ℝnh\(t\)\\in\\mathbb\{R\}^\{n\}\. By the variation of constants formula \([109](https://arxiv.org/html/2606.19538#A5.E109)\), the output functionRNN​\(u\)​\(t\)\\mathrm\{RNN\}\(u\)\(t\)lies in the span of at mostnnbasis functions \(thenncolumns ofCθ​eA​tC\_\{\\theta\}e^\{At\}\)\.

ButHn​\(u\)​\(t\)H\_\{n\}\(u\)\(t\)requiresn\+1n\+1basis functions by construction\. Therefore no RNN with state dimension≤n\\leq ncan representHnH\_\{n\}exactly, for anyn≥1n\\geq 1\.

Since this holds for allnn, no finite\-dimensional RNN can represent the full class of operators that ITNet can\.RNNn⊊ITNet\\mathrm\{RNN\}\_\{n\}\\subsetneq\\mathrm\{ITNet\}for eachnn, and⋃n=1∞RNNn⊊ITNet\\bigcup\_\{n=1\}^\{\\infty\}\\mathrm\{RNN\}\_\{n\}\\subsetneq\\mathrm\{ITNet\}\. ∎

### E\.11\. Discretization: Recovering Euler and ZOH

In practice, continuous SSMs are discretised before being implemented on digital hardware\. We show both standard discretization methods are consistent with the ITNet framework\.

##### Euler Discretization\.

The forward Euler discretization of \([97](https://arxiv.org/html/2606.19538#A5.E97)\) with step sizeΔ​t\\Delta t:

ht\+1=\(I\+Δ​t⋅A\)​ht\+Δ​t⋅B​ut≕A¯E​ht\+B¯E​ut\.h\_\{t\+1\}\\;=\\;\(I\+\\Delta t\\cdot A\)\\,h\_\{t\}\+\\Delta t\\cdot B\\,u\_\{t\}\\;\\eqqcolon\\;\\bar\{A\}\_\{E\}\\,h\_\{t\}\+\\bar\{B\}\_\{E\}\\,u\_\{t\}\.\(160\)
This is a discrete linear SSM withA¯E=I\+Δ​t​A\\bar\{A\}\_\{E\}=I\+\\Delta tAandB¯E=Δ​t​B\\bar\{B\}\_\{E\}=\\Delta tB\. By Part \(b\) \(discrete RNN\), this is an ITNet with kernelκθ​\(t,s\)=𝟏s≤t⋅C​A¯Et−s​B¯E\\kappa\_\{\\theta\}\(t,s\)=\\mathbf\{1\}\_\{s\\leq t\}\\cdot C\\bar\{A\}\_\{E\}^\{t\-s\}\\bar\{B\}\_\{E\}\.

##### Zero\-Order Hold \(ZOH\) Discretization\.

The ZOH discretization \(used by S4 and Mamba\):

A¯\\displaystyle\\bar\{A\}=eA​Δ​t,\\displaystyle=e^\{A\\Delta t\},\(161\)B¯\\displaystyle\\bar\{B\}=\(eA​Δ​t−I\)​A−1​B\.\\displaystyle=\(e^\{A\\Delta t\}\-I\)A^\{\-1\}B\.\(162\)
This assumes the input is piecewise constant over each interval\[t,t\+Δ​t\)\[t,t\+\\Delta t\)\. The discrete solution is:

ht=∑s=0t−1A¯t−1−s​B¯​us,h\_\{t\}=\\sum\_\{s=0\}^\{t\-1\}\\bar\{A\}^\{t\-1\-s\}\\bar\{B\}u\_\{s\},\(163\)which is an ITNet with kernelC​A¯t−s​B¯C\\bar\{A\}^\{t\-s\}\\bar\{B\}–the same as the Euler case withA¯E\\bar\{A\}\_\{E\}replaced byA¯=eA​Δ​t\\bar\{A\}=e^\{A\\Delta t\}\.

## Appendix FProof of Theorem 4: Universal Approximation

###### Definition 15\(Continuous nonlinear operator\)\.

A*continuous nonlinear operator*is a mapF:Uc→C​\(Ω,ℝd\)F:U\_\{c\}\\to C\(\\Omega,\\mathbb\{R\}^\{d\}\)whereUc⊂C​\(Ω,ℝd\)U\_\{c\}\\subset C\(\\Omega,\\mathbb\{R\}^\{d\}\)is compact, andFFis continuous with respect to the supremum norm: for everyε\>0\\varepsilon\>0, there existsδ\>0\\delta\>0such that‖u1−u2‖∞<δ\\left\\\|u\_\{1\}\-u\_\{2\}\\right\\\|\_\{\\infty\}<\\deltaimplies‖F​\(u1\)−F​\(u2\)‖∞<ε\\left\\\|F\(u\_\{1\}\)\-F\(u\_\{2\}\)\\right\\\|\_\{\\infty\}<\\varepsilon\.

###### Definition 16\(ITNet operator\)\.

\(𝒦θ​\[u\]\)​\(x\)=∫Ωκθ​\(x,y,u​\(x\),u​\(y\)\)​u​\(y\)​𝑑μ​\(y\)\+Wθ​u​\(x\),\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(x\)=\\int\_\{\\Omega\}\\kappa\_\{\\theta\}\(x,y,u\(x\),u\(y\)\)\\,u\(y\)\\,d\\mu\(y\)\+W\_\{\\theta\}\\,u\(x\),\(164\)whereκθ:ℝs×ℝs×ℝd×ℝd→ℝd×d\\kappa\_\{\\theta\}:\\mathbb\{R\}^\{s\}\\times\\mathbb\{R\}^\{s\}\\times\\mathbb\{R\}^\{d\}\\times\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\\times d\}is the kernel \(parameterised by an MLP\), andWθ∈ℝd×dW\_\{\\theta\}\\in\\mathbb\{R\}^\{d\\times d\}\.

###### Definition 17\(MLP function class\)\.

A*single hidden layer MLP*with widthww, input dimensionpp, and non\-polynomial activationσ\\sigmais:

fθ​\(z\)=∑j=1wcj​σ​\(aj⊤​z\+bj\),z∈ℝp,f\_\{\\theta\}\(z\)=\\sum\_\{j=1\}^\{w\}c\_\{j\}\\,\\sigma\(a\_\{j\}^\{\\top\}z\+b\_\{j\}\),\\qquad z\\in\\mathbb\{R\}^\{p\},\(165\)whereaj∈ℝpa\_\{j\}\\in\\mathbb\{R\}^\{p\},bj∈ℝb\_\{j\}\\in\\mathbb\{R\},cj∈ℝc\_\{j\}\\in\\mathbb\{R\}are learnable parameters\.

###### Assumption 5\(Standing assumptions\)\.

1. \(i\)Ω⊂ℝs\\Omega\\subset\\mathbb\{R\}^\{s\}is compact withμ​\(Ω\)\>0\\mu\(\\Omega\)\>0\.
2. \(ii\)F:Uc→C​\(Ω,ℝd\)F:U\_\{c\}\\to C\(\\Omega,\\mathbb\{R\}^\{d\}\)is continuous andUcU\_\{c\}is compact\.
3. \(iii\)σ\\sigmais a non\-polynomial continuous activation function\.

### F\.1\. Main Theorem

Theorem F\.1Universal Approximation of Continuous OperatorsUnder Assumption[5](https://arxiv.org/html/2606.19538#Thmassumption5), for everyε\>0\\varepsilon\>0, there exist:•a kernel MLP widthwκ<∞w\_\{\\kappa\}<\\infty,•a residual matrixWθ∈ℝd×dW\_\{\\theta\}\\in\\mathbb\{R\}^\{d\\times d\},•parametersθ\\thetaof the kernel MLP,such that the ITNet operator𝒦θ\\mathcal\{K\}\_\{\\theta\}satisfies:supu∈Uc‖F​\(u\)−𝒦θ​\[u\]‖∞<ε\.\\sup\_\{u\\in U\_\{c\}\}\\left\\\|F\(u\)\-\\mathcal\{K\}\_\{\\theta\}\[u\]\\right\\\|\_\{\\infty\}<\\varepsilon\.\(166\)That is, asingle ITNet layercan uniformly approximate any continuous operator on any compact input set to any desired precision\.

### F\.2\. Auxiliary Lemmas

The proof requires three lemmas, which we state and prove before the main argument\.

###### Lemma 4\(MLP universal approximation\)\.

LetK⊂ℝpK\\subset\\mathbb\{R\}^\{p\}be compact andg:K→ℝg:K\\to\\mathbb\{R\}be continuous\. For anyδ\>0\\delta\>0, there exists a single hidden layer MLPfθf\_\{\\theta\}\(Eq\.[165](https://arxiv.org/html/2606.19538#A6.E165)\) with widthwwdepending onδ\\delta,gg, andKK, such that:

supz∈K\|g​\(z\)−fθ​\(z\)\|<δ\.\\sup\_\{z\\in K\}\|g\(z\)\-f\_\{\\theta\}\(z\)\|<\\delta\.\(167\)

###### Proof reference\.

This is the classical Universal Approximation Theorem\. Proved by\[[20](https://arxiv.org/html/2606.19538#bib.bib28)\]for sigmoid, generalized by\[[46](https://arxiv.org/html/2606.19538#bib.bib100)\]to any non\-constant, bounded, continuousσ\\sigma, and extended by\[[54](https://arxiv.org/html/2606.19538#bib.bib55)\]to any non\-polynomialσ\\sigma\. The proof uses the Stone–Weierstrass theorem \(for sigmoidalσ\\sigma\) or the Hahn–Banach theorem \(for generalσ\\sigma\): iffθf\_\{\\theta\}cannot approximate somegg, then there exists a non\-zero bounded measureν\\nuonKKwith∫fθ​𝑑ν=0\\int f\_\{\\theta\}\\,d\\nu=0for allfθf\_\{\\theta\}; the non\-polynomial property ofσ\\sigmaforcesν=0\\nu=0, a contradiction\. ∎

###### Lemma 5\(Chen and Chen \[[9](https://arxiv.org/html/2606.19538#bib.bib6)\]operator approximation\)\.

LetΩ⊂ℝs\\Omega\\subset\\mathbb\{R\}^\{s\}be compact with finite Borel measureμ\\mu, andF:Uc→C​\(Ω,ℝd\)F:U\_\{c\}\\to C\(\\Omega,\\mathbb\{R\}^\{d\}\)be a continuous operator on a compact setUc⊂C​\(Ω,ℝd\)U\_\{c\}\\subset C\(\\Omega,\\mathbb\{R\}^\{d\}\)\. For anyε\>0\\varepsilon\>0, there exist: a continuous kernelκ∗:Ω×Ω×ℝd×ℝd→ℝd×d\\kappa^\{\*\}:\\Omega\\times\\Omega\\times\\mathbb\{R\}^\{d\}\\times\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\\times d\}and a matrixW∗∈ℝd×dW^\{\*\}\\in\\mathbb\{R\}^\{d\\times d\}, such that:

supu∈Uc‖F​\(u\)​\(x\)−∫Ωκ∗​\(x,y,u​\(x\),u​\(y\)\)​u​\(y\)​𝑑μ​\(y\)−W∗​u​\(x\)‖∞<ε2\.\\sup\_\{u\\in U\_\{c\}\}\\left\\\|F\(u\)\(x\)\-\\int\_\{\\Omega\}\\kappa^\{\*\}\(x,y,u\(x\),u\(y\)\)\\,u\(y\)\\,d\\mu\(y\)\-W^\{\*\}u\(x\)\\right\\\|\_\{\\infty\}<\\frac\{\\varepsilon\}\{2\}\.\(168\)

###### Proof sketch\.

This follows fromChen and Chen \[[9](https://arxiv.org/html/2606.19538#bib.bib6)\]\(Theorem 1\), extended to the matrix\-valued kernel setting\.

Step 1\.Discretise the operator\.

SinceFFis continuous on compactUcU\_\{c\}and outputs continuous functions on compactΩ\\Omega,FFis uniformly continuous\. ChooseMMquadrature points\{y1,…,yM\}⊂Ω\\\{y\_\{1\},\\ldots,y\_\{M\}\\\}\\subset\\Omegawith weights\{w1,…,wM\}\\\{w\_\{1\},\\ldots,w\_\{M\}\\\}such that for any continuous integrandgg:

\|∫Ωg​\(y\)​𝑑μ​\(y\)−∑j=1Mwj​g​\(yj\)\|<ε4​Cu,\\left\|\\int\_\{\\Omega\}g\(y\)\\,d\\mu\(y\)\-\\sum\_\{j=1\}^\{M\}w\_\{j\}\\,g\(y\_\{j\}\)\\right\|<\\frac\{\\varepsilon\}\{4C\_\{u\}\},\(169\)whereCu=supu∈Uc‖u‖∞<∞C\_\{u\}=\\sup\_\{u\\in U\_\{c\}\}\\left\\\|u\\right\\\|\_\{\\infty\}<\\infty\(by compactness ofUcU\_\{c\}\)\. SuchMMand\{yj,wj\}\\\{y\_\{j\},w\_\{j\}\\\}exist by standard quadrature theory on compact domains\[[23](https://arxiv.org/html/2606.19538#bib.bib27)\]\.

Step 2\.ApproximateFFby a finite\-dimensional map\.

Define the*sensor values*ξ​\(u\)=\(u​\(y1\),…,u​\(yM\)\)∈ℝM​d\\xi\(u\)=\(u\(y\_\{1\}\),\\ldots,u\(y\_\{M\}\)\)\\in\\mathbb\{R\}^\{Md\}, which sampleuuat the quadrature points\. By the compactness ofUcU\_\{c\}and continuity ofFF, the mapξ​\(u\)↦F​\(u\)​\(x\)\\xi\(u\)\\mapsto F\(u\)\(x\)is continuous fromℝM​d\\mathbb\{R\}^\{Md\}toℝd\\mathbb\{R\}^\{d\}for eachx∈Ωx\\in\\Omega\. Moreover, this map is uniformly continuous jointly in\(x,ξ\)\(x,\\xi\)on the compact setΩ×ξ​\(Uc\)\\Omega\\times\\xi\(U\_\{c\}\)\.

By the Stone–Weierstrass theorem\[[79](https://arxiv.org/html/2606.19538#bib.bib101)\], there exists a continuous functionG∗:Ω×ℝM​d→ℝdG^\{\*\}:\\Omega\\times\\mathbb\{R\}^\{Md\}\\to\\mathbb\{R\}^\{d\}of the form:

G∗​\(x,ξ\)=∑j=1MKj​\(x,ξ\)​ξj\+W∗​ξ0​\(x\),G^\{\*\}\(x,\\xi\)=\\sum\_\{j=1\}^\{M\}K\_\{j\}\(x,\\xi\)\\,\\xi\_\{j\}\+W^\{\*\}\\xi\_\{0\}\(x\),\(170\)whereKj:Ω×ℝM​d→ℝd×dK\_\{j\}:\\Omega\\times\\mathbb\{R\}^\{Md\}\\to\\mathbb\{R\}^\{d\\times d\}are continuous matrix\-valued functions andξ0​\(x\)=u​\(x\)\\xi\_\{0\}\(x\)=u\(x\)is the query\-point value, such that:

supu∈Uc‖F​\(u\)​\(x\)−G∗​\(x,ξ​\(u\)\)‖∞<ε4\.\\sup\_\{u\\in U\_\{c\}\}\\left\\\|F\(u\)\(x\)\-G^\{\*\}\(x,\\xi\(u\)\)\\right\\\|\_\{\\infty\}<\\frac\{\\varepsilon\}\{4\}\.\(171\)
Step 3\.Convert to integral form\.

IdentifyG∗G^\{\*\}with the quadrature approximation of an integral: defineκ∗​\(x,y,u​\(x\),u​\(y\)\)=K​\(x,y,u​\(x\),u​\(y\)\)\\kappa^\{\*\}\(x,y,u\(x\),u\(y\)\)=K\(x,y,u\(x\),u\(y\)\)whereKKinterpolates the discrete valuesKj​\(x,ξ\)K\_\{j\}\(x,\\xi\)at the quadrature pointsyjy\_\{j\}\. The integral∫Ωκ∗​u​\(y\)​𝑑μ​\(y\)\\int\_\{\\Omega\}\\kappa^\{\*\}u\(y\)\\,d\\mu\(y\)approximates the sum∑jwj​Kj​u​\(yj\)\\sum\_\{j\}w\_\{j\}K\_\{j\}u\(y\_\{j\}\)by the quadrature bound \([169](https://arxiv.org/html/2606.19538#A6.E169)\)\. Combining the two approximation errors \(ε/4\\varepsilon/4each\) gives \([168](https://arxiv.org/html/2606.19538#A6.E168)\)\. ∎

###### Lemma 6\(Kernel approximation by MLP\)\.

Letκ∗:Ω×Ω×ℝd×ℝd→ℝd×d\\kappa^\{\*\}:\\Omega\\times\\Omega\\times\\mathbb\{R\}^\{d\}\\times\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\\times d\}be continuous, and letUcU\_\{c\}be compact\. Define the compact set:

𝒟=\{\(x,y,u​\(x\),u​\(y\)\):x,y∈Ω,u∈Uc\}⊂ℝ2​s\+2​d\.\\mathcal\{D\}=\\\{\(x,y,u\(x\),u\(y\)\):x,y\\in\\Omega,\\;u\\in U\_\{c\}\\\}\\subset\\mathbb\{R\}^\{2s\+2d\}\.\(172\)For anyδ\>0\\delta\>0, there exists a single hidden layer MLP kernelκθ\\kappa\_\{\\theta\}with widthwκw\_\{\\kappa\}such that:

sup\(x,y,a,b\)∈𝒟‖κ∗​\(x,y,a,b\)−κθ​\(x,y,a,b\)‖op<δ\.\\sup\_\{\(x,y,a,b\)\\in\\mathcal\{D\}\}\\left\\\|\\kappa^\{\*\}\(x,y,a,b\)\-\\kappa\_\{\\theta\}\(x,y,a,b\)\\right\\\|\_\{\\mathrm\{op\}\}<\\delta\.\(173\)

###### Proof\.

The set𝒟\\mathcal\{D\}is compact \(continuous image of compactΩ×Ω×Uc\\Omega\\times\\Omega\\times U\_\{c\}under the evaluation map\)\. Each entry\[κ∗\]i​j\[\\kappa^\{\*\}\]\_\{ij\}is a continuous real\-valued function on the compact set𝒟⊂ℝ2​s\+2​d\\mathcal\{D\}\\subset\\mathbb\{R\}^\{2s\+2d\}\. By Lemma[4](https://arxiv.org/html/2606.19538#Thmlemma4), for each\(i,j\)\(i,j\)there exists an MLPfθi​jf^\{ij\}\_\{\\theta\}with:

supz∈𝒟\|\[κ∗\]i​j​\(z\)−fθi​j​\(z\)\|<δd\.\\sup\_\{z\\in\\mathcal\{D\}\}\|\[\\kappa^\{\*\}\]\_\{ij\}\(z\)\-f^\{ij\}\_\{\\theta\}\(z\)\|<\\frac\{\\delta\}\{d\}\.\(174\)Assembling alld2d^\{2\}entries into a matrix\-valued MLPκθ\\kappa\_\{\\theta\}:

‖κ∗−κθ‖op≤‖κ∗−κθ‖F≤d⋅δd=δ,\\left\\\|\\kappa^\{\*\}\-\\kappa\_\{\\theta\}\\right\\\|\_\{\\mathrm\{op\}\}\\leq\\left\\\|\\kappa^\{\*\}\-\\kappa\_\{\\theta\}\\right\\\|\_\{F\}\\leq d\\cdot\\frac\{\\delta\}\{d\}=\\delta,\(175\)where we used‖A‖op≤‖A‖F\\left\\\|A\\right\\\|\_\{\\mathrm\{op\}\}\\leq\\left\\\|A\\right\\\|\_\{F\}and‖A‖F≤d​maxi​j⁡\|Ai​j\|\\left\\\|A\\right\\\|\_\{F\}\\leq d\\max\_\{ij\}\|A\_\{ij\}\|ford×dd\\times dmatrices\.

In practice, a single MLP withd2d^\{2\}output units computes all entries simultaneously, sharing hidden layers\. The total width iswκw\_\{\\kappa\}\(shared\) withd2d^\{2\}output heads\. ∎

### F\.3\. Main Proofs

###### Proof of Theorem[F\.1](https://arxiv.org/html/2606.19538#A6.SS1)\.

Givenε\>0\\varepsilon\>0and continuous operatorF:Uc→C​\(Ω,ℝd\)F:U\_\{c\}\\to C\(\\Omega,\\mathbb\{R\}^\{d\}\)\.

Step 1\.ApproximateFFby an ideal integral operator\.

By Lemma[5](https://arxiv.org/html/2606.19538#Thmlemma5), there exist a continuous kernelκ∗:Ω×Ω×ℝd×ℝd→ℝd×d\\kappa^\{\*\}:\\Omega\\times\\Omega\\times\\mathbb\{R\}^\{d\}\\times\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\\times d\}and matrixW∗∈ℝd×dW^\{\*\}\\in\\mathbb\{R\}^\{d\\times d\}such that:

supu∈Uc‖F​\(u\)−𝒦∗​\[u\]‖∞<ε2,\\sup\_\{u\\in U\_\{c\}\}\\left\\\|F\(u\)\-\\mathcal\{K\}^\{\*\}\[u\]\\right\\\|\_\{\\infty\}<\\frac\{\\varepsilon\}\{2\},\(176\)where𝒦∗​\[u\]​\(x\)=∫Ωκ∗​\(x,y,u​\(x\),u​\(y\)\)​u​\(y\)​𝑑μ​\(y\)\+W∗​u​\(x\)\\mathcal\{K\}^\{\*\}\[u\]\(x\)=\\int\_\{\\Omega\}\\kappa^\{\*\}\(x,y,u\(x\),u\(y\)\)u\(y\)\\,d\\mu\(y\)\+W^\{\*\}u\(x\)\.

Step 2\.Approximate the ideal kernel by an MLP\.

By Lemma[6](https://arxiv.org/html/2606.19538#Thmlemma6), there exists an MLP kernelκθ\\kappa\_\{\\theta\}with widthwκw\_\{\\kappa\}such that:

sup\(x,y,a,b\)∈𝒟‖κ∗​\(x,y,a,b\)−κθ​\(x,y,a,b\)‖op<δ,\\sup\_\{\(x,y,a,b\)\\in\\mathcal\{D\}\}\\left\\\|\\kappa^\{\*\}\(x,y,a,b\)\-\\kappa\_\{\\theta\}\(x,y,a,b\)\\right\\\|\_\{\\mathrm\{op\}\}<\\delta,\(177\)whereδ\>0\\delta\>0will be chosen in Step 3\. SetWθ=W∗W\_\{\\theta\}=W^\{\*\}\.

Step 3\.Bound the approximation error\.

The error between the ideal and MLP\-parameterised ITNet operators is:

‖𝒦∗​\[u\]​\(x\)−𝒦θ​\[u\]​\(x\)‖2\\displaystyle\\left\\\|\\mathcal\{K\}^\{\*\}\[u\]\(x\)\-\\mathcal\{K\}\_\{\\theta\}\[u\]\(x\)\\right\\\|\_\{2\}=‖∫Ω\[κ∗​\(x,y,u​\(x\),u​\(y\)\)−κθ​\(x,y,u​\(x\),u​\(y\)\)\]​u​\(y\)​𝑑μ​\(y\)‖2\\displaystyle=\\left\\\|\\int\_\{\\Omega\}\[\\kappa^\{\*\}\(x,y,u\(x\),u\(y\)\)\-\\kappa\_\{\\theta\}\(x,y,u\(x\),u\(y\)\)\]\\,u\(y\)\\,d\\mu\(y\)\\right\\\|\_\{2\}≤∫Ω‖κ∗−κθ‖op⋅‖u​\(y\)‖2​𝑑μ​\(y\)\\displaystyle\\leq\\int\_\{\\Omega\}\\left\\\|\\kappa^\{\*\}\-\\kappa\_\{\\theta\}\\right\\\|\_\{\\mathrm\{op\}\}\\cdot\\left\\\|u\(y\)\\right\\\|\_\{2\}\\,d\\mu\(y\)\(triangle inequality for Bochner integral\)≤δ⋅Cu⋅μ​\(Ω\),\\displaystyle\\leq\\delta\\cdot C\_\{u\}\\cdot\\mu\(\\Omega\),\(178\)whereCu=supu∈Uc‖u‖∞<∞C\_\{u\}=\\sup\_\{u\\in U\_\{c\}\}\\left\\\|u\\right\\\|\_\{\\infty\}<\\infty\(by compactness ofUcU\_\{c\}\) andμ​\(Ω\)<∞\\mu\(\\Omega\)<\\infty\(by Assumption[5](https://arxiv.org/html/2606.19538#Thmassumption5)\)\.

Chooseδ=ε/\(2​Cu​μ​\(Ω\)\)\\delta=\\varepsilon/\(2C\_\{u\}\\mu\(\\Omega\)\)\. Then:

supu∈Uc‖𝒦∗​\[u\]−𝒦θ​\[u\]‖∞≤δ⋅Cu⋅μ​\(Ω\)=ε2\.\\sup\_\{u\\in U\_\{c\}\}\\left\\\|\\mathcal\{K\}^\{\*\}\[u\]\-\\mathcal\{K\}\_\{\\theta\}\[u\]\\right\\\|\_\{\\infty\}\\leq\\delta\\cdot C\_\{u\}\\cdot\\mu\(\\Omega\)=\\frac\{\\varepsilon\}\{2\}\.\(179\)
Step 4\.Combine by triangle inequality\.

supu∈Uc‖F​\(u\)−𝒦θ​\[u\]‖∞\\displaystyle\\sup\_\{u\\in U\_\{c\}\}\\left\\\|F\(u\)\-\\mathcal\{K\}\_\{\\theta\}\[u\]\\right\\\|\_\{\\infty\}≤supu∈Uc‖F​\(u\)−𝒦∗​\[u\]‖∞\+supu∈Uc‖𝒦∗​\[u\]−𝒦θ​\[u\]‖∞\\displaystyle\\leq\\sup\_\{u\\in U\_\{c\}\}\\left\\\|F\(u\)\-\\mathcal\{K\}^\{\*\}\[u\]\\right\\\|\_\{\\infty\}\+\\sup\_\{u\\in U\_\{c\}\}\\left\\\|\\mathcal\{K\}^\{\*\}\[u\]\-\\mathcal\{K\}\_\{\\theta\}\[u\]\\right\\\|\_\{\\infty\}<ε2\+ε2=ε\.\\displaystyle<\\frac\{\\varepsilon\}\{2\}\+\\frac\{\\varepsilon\}\{2\}=\\varepsilon\.\(180\)
supu∈Uc∥F\(u\)−𝒦θ\[u\]∥∞<ε\.\\boxed\{\\sup\_\{u\\in U\_\{c\}\}\\left\\\|F\(u\)\-\\mathcal\{K\}\_\{\\theta\}\[u\]\\right\\\|\_\{\\infty\}<\\varepsilon\.\}\(181\)∎

### F\.4\. Corollary: Strict Expressiveness Ordering

###### Corollary F\.1\(Strict expressiveness ordering\)\.

LetConv\\mathrm\{Conv\},Attn\\mathrm\{Attn\},RNN\\mathrm\{RNN\},ITNet\\mathrm\{ITNet\}denote the sets of operators representable by each architecture class\. Then:

Conv⊊ITNet,Attn⊊ITNet,RNN⊊ITNet,\\mathrm\{Conv\}\\subsetneq\\mathrm\{ITNet\},\\qquad\\mathrm\{Attn\}\\subsetneq\\mathrm\{ITNet\},\\qquad\\mathrm\{RNN\}\\subsetneq\\mathrm\{ITNet\},\(182\)and these three subclasses are pairwise incomparable:

Conv⊈Attn,Attn⊈Conv,Conv⊈RNN,etc\.\\mathrm\{Conv\}\\not\\subseteq\\mathrm\{Attn\},\\quad\\mathrm\{Attn\}\\not\\subseteq\\mathrm\{Conv\},\\quad\\mathrm\{Conv\}\\not\\subseteq\\mathrm\{RNN\},\\quad\\text\{etc\.\}\(183\)

###### Proof\.

The inclusionsConv,Attn,RNN⊆ITNet\\mathrm\{Conv\},\\mathrm\{Attn\},\\mathrm\{RNN\}\\subseteq\\mathrm\{ITNet\}follow from Theorems 1–3\. The strictness \(proper subset\) follows from the strictness parts of Theorems 1–3: each provides a witness operator inITNet\\mathrm\{ITNet\}but not in the respective subclass\.

Pairwise incomparability:

- •Conv⊈Attn\\mathrm\{Conv\}\\not\\subseteq\\mathrm\{Attn\}: convolution is translation\-equivariant; attention \(without PE\) is permutation\-equivariant\. A translation\-equivariant but non\-permutation\-equivariant operator \(e\.g\., a spatially varying filter\) is inConv\\mathrm\{Conv\}but notAttn\\mathrm\{Attn\}\.
- •Attn⊈Conv\\mathrm\{Attn\}\\not\\subseteq\\mathrm\{Conv\}: attention is content\-dependent; convolution is not\. The softmax\-weighted value aggregation from Theorem 2 is inAttn\\mathrm\{Attn\}but not inConv\\mathrm\{Conv\}\(by the linearity argument of Theorem 1\(c\)\)\.
- •RNN⊈Conv\\mathrm\{RNN\}\\not\\subseteq\\mathrm\{Conv\}: causal operators with content\-dependent gating \(LSTM\) are inRNN\\mathrm\{RNN\}but notConv\\mathrm\{Conv\}\(convolution is non\-causal and content\-independent\)\.
- •Conv⊈RNN\\mathrm\{Conv\}\\not\\subseteq\\mathrm\{RNN\}: non\-causal operators \(bidirectional convolution\) are inConv\\mathrm\{Conv\}but notRNN\\mathrm\{RNN\}\(all RNNs are causal\)\.

∎

### F\.5\. Quantitative Approximation Rate

The existential result \(Theorem[F\.1](https://arxiv.org/html/2606.19538#A6.SS1)\) guarantees approximation but does not say*how large*the MLP widthwκw\_\{\\kappa\}needs to be\. We provide an explicit rate\.

###### Proposition 9\(Approximation rate\)\.

Under Assumption[5](https://arxiv.org/html/2606.19538#Thmassumption5), if additionally the target operatorFFhas Lipschitz constantLFL\_\{F\}onUcU\_\{c\}\(i\.e\.,‖F​\(u1\)−F​\(u2\)‖∞≤LF​‖u1−u2‖∞\\left\\\|F\(u\_\{1\}\)\-F\(u\_\{2\}\)\\right\\\|\_\{\\infty\}\\leq L\_\{F\}\\left\\\|u\_\{1\}\-u\_\{2\}\\right\\\|\_\{\\infty\}\), and the ideal kernelκ∗\\kappa^\{\*\}has bounded*Barron norm*\[[6](https://arxiv.org/html/2606.19538#bib.bib56)\]‖κ∗‖ℬ≤B\\left\\\|\\kappa^\{\*\}\\right\\\|\_\{\\mathcal\{B\}\}\\leq B, then the approximation error of ITNet with kernel MLP widthwκw\_\{\\kappa\}andMMquadrature points satisfies:

supu∈Uc‖F​\(u\)−𝒦θ​\[u\]‖∞≤C1​B​Cu​μ​\(Ω\)wκ⏟kernel approximation error\+C2​LF​CuM1/s⏟quadrature error,\\sup\_\{u\\in U\_\{c\}\}\\left\\\|F\(u\)\-\\mathcal\{K\}\_\{\\theta\}\[u\]\\right\\\|\_\{\\infty\}\\leq\\underbrace\{\\frac\{C\_\{1\}BC\_\{u\}\\mu\(\\Omega\)\}\{\\sqrt\{w\_\{\\kappa\}\}\}\}\_\{\\text\{kernel approximation error\}\}\+\\underbrace\{\\frac\{C\_\{2\}L\_\{F\}C\_\{u\}\}\{M^\{1/s\}\}\}\_\{\\text\{quadrature error\}\},\(184\)whereC1C\_\{1\}depends on the input dimension2​s\+2​d2s\+2dandC2C\_\{2\}depends on the smoothness of the integrand and the quadrature rule\.

###### Proof sketch\.

The first term follows from Barron’s theorem\[[6](https://arxiv.org/html/2606.19538#bib.bib56)\]: a single hidden layer MLP of widthwκw\_\{\\kappa\}approximates functions with bounded Barron norm at rateO​\(1/wκ\)O\(1/\\sqrt\{w\_\{\\kappa\}\}\), independent of input dimension \(avoiding the curse of dimensionality for this function class\)\. Applying this to each entry ofκ∗\\kappa^\{\*\}and combining with the error propagation bound \([178](https://arxiv.org/html/2606.19538#A6.E178)\) gives the first term\.

The second term is the quadrature error: replacing the integral∫Ω\\int\_\{\\Omega\}with anMM\-point quadrature introduces errorO​\(M−r/s\)O\(M^\{\-r/s\}\)forrr\-smooth integrands onss\-dimensional compact domains, wherer≥1r\\geq 1for continuous integrands\. ∎

## Appendix GProof of Theorem 5: Kernel Recovery Under Translation Symmetry

We provide the complete proof of Theorem[5](https://arxiv.org/html/2606.19538#Thmtheorem5)from the main paper, which was stated with only a proof sketch in §[2](https://arxiv.org/html/2606.19538#S2)\.

###### Theorem 6\(Kernel Recovery Under Translation Symmetry\)\.

Let the data distribution𝒟\\mathcal\{D\}over input\-label pairs\(u,y\)\(u,y\)be invariant under translations: for every shiftδ∈ℝs\\delta\\in\\mathbb\{R\}^\{s\}, define the translated signalτδ​u​\(x\)≔u​\(x−δ\)\\tau\_\{\\delta\}u\(x\)\\coloneqq u\(x\-\\delta\)\. Assume𝒟\\mathcal\{D\}satisfies\(u,y\)∼𝒟⟹\(τδ​u,y\)∼𝒟\(u,y\)\\sim\\mathcal\{D\}\\implies\(\\tau\_\{\\delta\}u,y\)\\sim\\mathcal\{D\}for allδ\\delta\.

Letℒ​\(θ\)=𝔼\(u,y\)∼𝒟​\[ℓ​\(𝒦θ​\[u\],y\)\]\\mathcal\{L\}\(\\theta\)=\\mathbb\{E\}\_\{\(u,y\)\\sim\\mathcal\{D\}\}\[\\ell\(\\mathcal\{K\}\_\{\\theta\}\[u\],y\)\]be the population loss, whereℓ\\ellis differentiable\.

Decompose the kernel into translation\-invariant and orthogonal components:

κθ​\(x,y,u​\(x\),u​\(y\)\)=κθTI​\(x−y,u​\(x\),u​\(y\)\)\+κθ⟂​\(x,y,u​\(x\),u​\(y\)\),\\kappa\_\{\\theta\}\(x,y,u\(x\),u\(y\)\)=\\kappa\_\{\\theta\}^\{\\mathrm\{TI\}\}\(x\{\-\}y,u\(x\),u\(y\)\)\+\\kappa\_\{\\theta\}^\{\\perp\}\(x,y,u\(x\),u\(y\)\),\(185\)whereκθTI\\kappa\_\{\\theta\}^\{\\mathrm\{TI\}\}depends on positions only through the displacementx−yx\-y, andκθ⟂\\kappa\_\{\\theta\}^\{\\perp\}is the orthogonal complement \(i\.e\.,∫ΩκθTI⋅κθ⟂​𝑑μ⊗𝑑μ=0\\int\_\{\\Omega\}\\kappa\_\{\\theta\}^\{\\mathrm\{TI\}\}\\cdot\\kappa\_\{\\theta\}^\{\\perp\}\\,d\\mu\\otimes d\\mu=0\)\.

Then under gradient flow with respect toθ\\theta:

‖∂ℒ∂κθ⟂‖F=0at every iterate\.\\left\\\|\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\\kappa\_\{\\theta\}^\{\\perp\}\}\\right\\\|\_\{F\}=0\\quad\\text\{at every iterate\.\}\(186\)

###### Proof\.

The proof follows the equivariant gradient framework of\[[30](https://arxiv.org/html/2606.19538#bib.bib29)\]\. We proceed in three steps\.

Step 1: Translation invariance of the loss\.

For any shiftδ∈ℝs\\delta\\in\\mathbb\{R\}^\{s\}, the ITNet operator applied to the translated signalτδ​u\\tau\_\{\\delta\}uevaluates the kernel at shifted positions:

\(𝒦θ​\[τδ​u\]\)​\(x\)\\displaystyle\(\\mathcal\{K\}\_\{\\theta\}\[\\tau\_\{\\delta\}u\]\)\(x\)=∫Ωκθ​\(x,y,u​\(x−δ\),u​\(y−δ\)\)⋅u​\(y−δ\)​𝑑μ​\(y\)\+Wθ​u​\(x−δ\)\.\\displaystyle=\\int\_\{\\Omega\}\\kappa\_\{\\theta\}\(x,y,u\(x\-\\delta\),u\(y\-\\delta\)\)\\cdot u\(y\-\\delta\)\\,d\\mu\(y\)\+W\_\{\\theta\}u\(x\-\\delta\)\.\(187\)Substitutingz=y−δz=y\-\\delta\(valid whenμ\\muis translation\-invariant\):

=∫Ωκθ​\(x,z\+δ,u​\(x−δ\),u​\(z\)\)⋅u​\(z\)​𝑑μ​\(z\)\+Wθ​u​\(x−δ\)\.\\displaystyle=\\int\_\{\\Omega\}\\kappa\_\{\\theta\}\(x,z\+\\delta,u\(x\-\\delta\),u\(z\)\)\\cdot u\(z\)\\,d\\mu\(z\)\+W\_\{\\theta\}u\(x\-\\delta\)\.\(188\)Under the change of output variablex′=x−δx^\{\\prime\}=x\-\\delta:

\(𝒦θ​\[τδ​u\]\)​\(x′\+δ\)\\displaystyle\(\\mathcal\{K\}\_\{\\theta\}\[\\tau\_\{\\delta\}u\]\)\(x^\{\\prime\}\+\\delta\)=∫Ωκθ​\(x′\+δ,z\+δ,u​\(x′\),u​\(z\)\)⋅u​\(z\)​𝑑μ​\(z\)\+Wθ​u​\(x′\)\.\\displaystyle=\\int\_\{\\Omega\}\\kappa\_\{\\theta\}\(x^\{\\prime\}\+\\delta,z\+\\delta,u\(x^\{\\prime\}\),u\(z\)\)\\cdot u\(z\)\\,d\\mu\(z\)\+W\_\{\\theta\}u\(x^\{\\prime\}\)\.\(189\)
By the translation invariance of𝒟\\mathcal\{D\}, the population loss satisfiesℒ​\(θ\)=𝔼​\[ℓ​\(𝒦θ​\[τδ​u\],y\)\]\\mathcal\{L\}\(\\theta\)=\\mathbb\{E\}\[\\ell\(\\mathcal\{K\}\_\{\\theta\}\[\\tau\_\{\\delta\}u\],y\)\]for allδ\\delta\.

Step 2: Averaging the gradient over translations\.

Consider the gradient ofℒ\\mathcal\{L\}with respect to the kernel parameters\. Sinceℒ\\mathcal\{L\}is invariant under all translationsδ\\delta, averaging the gradient over the translation groupG=\(ℝs,\+\)G=\(\\mathbb\{R\}^\{s\},\+\)with Haar measure\[[39](https://arxiv.org/html/2606.19538#bib.bib104),[67](https://arxiv.org/html/2606.19538#bib.bib105)\]d​δd\\deltaleavesℒ\\mathcal\{L\}unchanged:

∇θℒ=1\|Ω\|​∫Ω∇θℒ\|τδ​d​δ\.\\nabla\_\{\\theta\}\\mathcal\{L\}=\\frac\{1\}\{\|\\Omega\|\}\\int\_\{\\Omega\}\\nabla\_\{\\theta\}\\mathcal\{L\}\\big\|\_\{\\tau\_\{\\delta\}\}\\,d\\delta\.\(190\)
For the kernel component, the gradient with respect toκθ\\kappa\_\{\\theta\}at a specific pair\(x,y\)\(x,y\)is:

∂ℒ∂κθ​\(x,y,⋅,⋅\)=𝔼\(u,y\)​\[∂ℓ∂\(𝒦θ​\[u\]\)​\(x\)⋅u​\(y\)⊤\]\.\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\\kappa\_\{\\theta\}\(x,y,\\cdot,\\cdot\)\}=\\mathbb\{E\}\_\{\(u,y\)\}\\left\[\\frac\{\\partial\\ell\}\{\\partial\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(x\)\}\\cdot u\(y\)^\{\\top\}\\right\]\.\(191\)
Averaging this over all translationsδ\\delta\(shifting bothxxandyybyδ\\delta\) yields:

1\|Ω\|​∫Ω∂ℒ∂κθ​\(x\+δ,y\+δ,⋅,⋅\)​𝑑δ\.\\displaystyle\\frac\{1\}\{\|\\Omega\|\}\\int\_\{\\Omega\}\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\\kappa\_\{\\theta\}\(x\+\\delta,y\+\\delta,\\cdot,\\cdot\)\}\\,d\\delta\.\(192\)This average depends only onx−yx\-y\(since all pairs\(x\+δ,y\+δ\)\(x\+\\delta,y\+\\delta\)share the same displacement\), so it lies entirely in the translation\-invariant subspace\.

Step 3: Orthogonality annihilation\.

By the Schur orthogonality\[[81](https://arxiv.org/html/2606.19538#bib.bib106)\]relations for the translation group acting on the space of kernel functions, any component of the gradient that lies in the orthogonal complement of the translation\-invariant subspace integrates to zero under the group average\. Since the averaged gradient equals the original gradient \(by Step 1\), the projection of∇θℒ\\nabla\_\{\\theta\}\\mathcal\{L\}onto theκθ⟂\\kappa\_\{\\theta\}^\{\\perp\}subspace vanishes:

Projκθ⟂​∇θℒ=0\.\\mathrm\{Proj\}\_\{\\kappa\_\{\\theta\}^\{\\perp\}\}\\nabla\_\{\\theta\}\\mathcal\{L\}=0\.\(193\)
Under gradient flowθ˙=−∇θℒ\\dot\{\\theta\}=\-\\nabla\_\{\\theta\}\\mathcal\{L\}, the parameters never receive a gradient signal that would move the kernel out of the translation\-invariant subspace\. If the kernel is initialized in \(or near\) the TI subspace, it remains there throughout training, recovering the convolutional special case of Theorem[C\.1](https://arxiv.org/html/2606.19538#A3.SS1)\. ∎

## Appendix HImplementation Details

### H\.1\. Computational Complexity

Table[9](https://arxiv.org/html/2606.19538#A8.T9)compares the time and space complexity of ITNet with CNN, Transformer, and Mamba baselines\. AtM=nM=n, ITNet\-MC matches Transformer complexity with an additionalddfactor from the matrix\-valued kernel; ford≤nd\\leq\\sqrt\{n\}, this is comparable\. The low\-rank scheme recovers linear complexity at the cost of rank\-rrapproximation error\.

Table 9:Computational complexity\.*Note:*nn: positions,dd: features,MM: samples,rr: rank,kk: radius\.

### H\.2\. Full Architectural Specification

We provide the complete architectural specification for all ITNet model sizes\. The architecture follows the pre\-norm Transformer layout\[[98](https://arxiv.org/html/2606.19538#bib.bib83)\]with the ITNet operator replacing self\-attention\. Every component is specified to enable exact reproduction\.

Each ITNet layerℓ=1,…,L\\ell=1,\\ldots,Lconsists of two sub\-blocks with residual connections:

z\(ℓ\)\\displaystyle z^\{\(\\ell\)\}=𝒦θ\(ℓ\)​\[LN​\(u\(ℓ−1\)\)\]\+u\(ℓ−1\),\\displaystyle=\\mathcal\{K\}\_\{\\theta\}^\{\(\\ell\)\}\[\\mathrm\{LN\}\(u^\{\(\\ell\-1\)\}\)\]\+u^\{\(\\ell\-1\)\},\(194\)u\(ℓ\)\\displaystyle u^\{\(\\ell\)\}=ℱ\(ℓ\)​\(LN​\(z\(ℓ\)\)\)\+z\(ℓ\),\\displaystyle=\\mathcal\{F\}^\{\(\\ell\)\}\(\\mathrm\{LN\}\(z^\{\(\\ell\)\}\)\)\+z^\{\(\\ell\)\},where:

- •LN\\mathrm\{LN\}is layer normalization with learnable affine parametersγ,β∈ℝd\\gamma,\\beta\\in\\mathbb\{R\}^\{d\}, applied*before*each sub\-block \(pre\-norm\)\. We use pre\-norm rather than post\-norm because it provides more stable gradients at initialization for deep models\[[98](https://arxiv.org/html/2606.19538#bib.bib83)\], avoids the need for learning rate warmup in shallow settings, and is standard in modern architectures \(GPT\-2\[[76](https://arxiv.org/html/2606.19538#bib.bib89)\], LLaMA\[[92](https://arxiv.org/html/2606.19538#bib.bib90)\], ViT\[[26](https://arxiv.org/html/2606.19538#bib.bib13)\]\)\.
- •𝒦θ\(ℓ\)\\mathcal\{K\}\_\{\\theta\}^\{\(\\ell\)\}is the multi\-head ITNet operator \(Eq\. \([4](https://arxiv.org/html/2606.19538#S2.E4)\)\) withHHheads, each operating ondh=d/Hd\_\{h\}=d/Hdimensional features\.
- •ℱ\(ℓ\)\\mathcal\{F\}^\{\(\\ell\)\}is a position\-wise feed\-forward network: ℱ\(ℓ\)​\(v\)=W2\(ℓ\)⋅GELU​\(W1\(ℓ\)​v\+b1\(ℓ\)\)\+b2\(ℓ\),\\mathcal\{F\}^\{\(\\ell\)\}\(v\)=W\_\{2\}^\{\(\\ell\)\}\\cdot\\mathrm\{GELU\}\(W\_\{1\}^\{\(\\ell\)\}v\+b\_\{1\}^\{\(\\ell\)\}\)\+b\_\{2\}^\{\(\\ell\)\},\(195\)withW1\(ℓ\)∈ℝ4​d×dW\_\{1\}^\{\(\\ell\)\}\\in\\mathbb\{R\}^\{4d\\times d\}\(expansion factor 4\),W2\(ℓ\)∈ℝd×4​dW\_\{2\}^\{\(\\ell\)\}\\in\\mathbb\{R\}^\{d\\times 4d\}, and biasesb1∈ℝ4​db\_\{1\}\\in\\mathbb\{R\}^\{4d\},b2∈ℝdb\_\{2\}\\in\\mathbb\{R\}^\{d\}\. We use GELU activation\[[43](https://arxiv.org/html/2606.19538#bib.bib82)\]throughout, consistent with BERT\[[24](https://arxiv.org/html/2606.19538#bib.bib15)\]and GPT\-2\[[76](https://arxiv.org/html/2606.19538#bib.bib89)\]\.

Table[10](https://arxiv.org/html/2606.19538#A8.T10)specifies all architectural hyperparameters for each model size\.

Table 10:Complete ITNet model configurations\.LL: layers;dd: hidden dimension;HH: heads;dh=d/Hd\_\{h\}=d/H: per\-head dimension;wκw\_\{\\kappa\}: kernel MLP width;ℓκ\\ell\_\{\\kappa\}: kernel MLP depth; FFN: feed\-forward expansion factor\.All configurations usedh=64d\_\{h\}=64per head \(except ITNet\-PC which usesdh=32d\_\{h\}=32\)\. This means the per\-head kernel outputs a64×6464\\times 64matrix \(4,096 values\), and the per\-pair cost isO​\(dh2\)=O​\(4096\)O\(d\_\{h\}^\{2\}\)=O\(4096\)multiply\-adds for the matrix\-vector product\. Summing overHHheads givesO​\(H⋅dh2\)=O​\(d2/H\)O\(H\\cdot d\_\{h\}^\{2\}\)=O\(d^\{2\}/H\)per pair\. Table[11](https://arxiv.org/html/2606.19538#A8.T11)provides a detailed parameter count for ITNet\-B\.

Table 11:Parameter breakdown for ITNet\-B \(12 layers,d=768d=768,H=12H=12\)\.The dominant parameter cost is the kernel MLP output layerW2W\_\{2\}\(∼\\sim75\.5M across 12 layers\), because each head’s kernel must produce adh2=4096d\_\{h\}^\{2\}=4096\-dimensional output\. This is comparable to theWQ​WK​WVW\_\{Q\}W\_\{K\}W\_\{V\}projections in a standard Transformer \(3×d2=3×7682=1\.773\\times d^\{2\}=3\\times 768^\{2\}=1\.77M per layer,×12=21\.2\\times 12=21\.2M total\), with the additional cost arising from the kernel MLP’s ability to compute nonlinear, content\-dependent transformations rather than fixed linear projections\.

Kernel MLP Architecture:The kernel MLP for each headhhhas the following structure:

h\(0\)\\displaystyle h^\{\(0\)\}=zx​y∈ℝ6​Lf\+1\+3​dh,\\displaystyle=z\_\{xy\}\\in\\mathbb\{R\}^\{6L\_\{f\}\+1\+3d\_\{h\}\},\(196\)h\(1\)\\displaystyle h^\{\(1\)\}=GELU​\(W1​h\(0\)\+b1\)∈ℝwκ,\\displaystyle=\\mathrm\{GELU\}\(W\_\{1\}h^\{\(0\)\}\+b\_\{1\}\)\\in\\mathbb\{R\}^\{w\_\{\\kappa\}\},\(197\)vec​\(Kx​y\(h\)\)\\displaystyle\\mathrm\{vec\}\(K\_\{xy\}^\{\(h\)\}\)=W2​h\(1\)\+b2∈ℝdh2,\\displaystyle=W\_\{2\}h^\{\(1\)\}\+b\_\{2\}\\in\\mathbb\{R\}^\{d\_\{h\}^\{2\}\},\(198\)where:

- •W1∈ℝwκ×\(6​Lf\+1\+3​dh\)W\_\{1\}\\in\\mathbb\{R\}^\{w\_\{\\kappa\}\\times\(6L\_\{f\}\+1\+3d\_\{h\}\)\}: input\-to\-hidden weights\. For ITNet\-B withL=64L=64Fourier frequencies anddh=64d\_\{h\}=64: input dimension is6×64\+1\+3×64=5776\\times 64\+1\+3\\times 64=577, soW1∈ℝ128×577W\_\{1\}\\in\\mathbb\{R\}^\{128\\times 577\}\.
- •b1∈ℝwκb\_\{1\}\\in\\mathbb\{R\}^\{w\_\{\\kappa\}\}: hidden bias\.
- •W2∈ℝdh2×wκW\_\{2\}\\in\\mathbb\{R\}^\{d\_\{h\}^\{2\}\\times w\_\{\\kappa\}\}: hidden\-to\-output weights\. Fordh=64d\_\{h\}=64:W2∈ℝ4096×128W\_\{2\}\\in\\mathbb\{R\}^\{4096\\times 128\}\.
- •b2∈ℝdh2b\_\{2\}\\in\\mathbb\{R\}^\{d\_\{h\}^\{2\}\}: output bias\.
- •The outputvec​\(Kx​y\(h\)\)∈ℝdh2\\mathrm\{vec\}\(K\_\{xy\}^\{\(h\)\}\)\\in\\mathbb\{R\}^\{d\_\{h\}^\{2\}\}is reshaped toKx​y\(h\)∈ℝdh×dhK\_\{xy\}^\{\(h\)\}\\in\\mathbb\{R\}^\{d\_\{h\}\\times d\_\{h\}\}\.

We use a 2\-layer MLP \(one hidden layer\) rather than deeper architectures because the kernel MLP width ablation \(Table[20](https://arxiv.org/html/2606.19538#A13.T20)\) shows that 2 layers withwκ=128w\_\{\\kappa\}=128achieves 81\.4% on ImageNet\-1K, matching 3 layers \(81\.5%\) at significantly lower cost\. The shallow MLP also has better hardware utilization: deeper MLPs require sequential execution of layers, reducing parallelism within each tile\.

![Refer to caption](https://arxiv.org/html/2606.19538v1/x1.png)Figure 1:Overview of the ITNet architecture\. The model consists ofLLstacked pre\-norm residual layers\. Each layer applies \(i\) layer normalization, \(ii\) the ITNet operator \- a learnable integral operator with a kernel depending on positionsx,yx,yand featuresu​\(x\),u​\(y\)u\(x\),u\(y\)and \(iii\) a feed\-forward network \(FFN\) for local refinement, with residual connections after each block\. Positional information is provided through the domainΩ\\Omegaand used directly by the kernel\. This design mirrors Transformer\-style architectures while generalizing convolution \(local kernels\), attention \(content\-dependent kernels\), and recurrence \(causal kernels\) within a unified framework\.#### H\.2\.1Positional Encoding: Random Fourier Features

The Fourier feature mapγ:ℝs→ℝ2​Lf\\gamma:\\mathbb\{R\}^\{s\}\\to\\mathbb\{R\}^\{2L\_\{f\}\}is defined as:

γ​\(x\)=\[sin⁡\(2​π​𝐁​x\);cos⁡\(2​π​𝐁​x\)\]∈ℝ2​Lf,\\gamma\(x\)=\[\\sin\(2\\pi\\mathbf\{B\}x\);\\;\\cos\(2\\pi\\mathbf\{B\}x\)\]\\in\\mathbb\{R\}^\{2L\_\{f\}\},\(199\)where𝐁∈ℝL×s\\mathbf\{B\}\\in\\mathbb\{R\}^\{L\\times s\}is a fixed random matrix sampled once at initialization from𝒩​\(0,σ2​I\)\\mathcal\{N\}\(0,\\sigma^\{2\}I\)withσ=10\\sigma=10andL=64L=64\.

Raw position coordinatesx∈ℝsx\\in\\mathbb\{R\}^\{s\}are low\-dimensional \(s=1s=1for text,s=2s=2for images,s=3s=3for point clouds\)\. An MLP operating on such low\-dimensional inputs suffers from*spectral bias*\[[98](https://arxiv.org/html/2606.19538#bib.bib83)\]: it preferentially learns low\-frequency functions and struggles to represent high\-frequency spatial patterns\. The Fourier feature map lifts positions to a2​Lf2L\_\{f\}\-dimensional space where high\-frequency components are explicitly represented as linear features, enabling the MLP to learn sharp spatial functions\[[87](https://arxiv.org/html/2606.19538#bib.bib32)\]\.

The bandwidthσ\\sigmacontrols the frequency range:σ=10\\sigma=10corresponds to spatial frequencies up to∼\\sim10 cycles per unit length, which captures both coarse object\-level structure and fine\-grained local texture on normalized domains\. Table[23](https://arxiv.org/html/2606.19538#A13.T23)shows thatσ=1\\sigma=1under\-resolves spatial structure \(−1\.3%\-1\.3\\%\),σ=100\\sigma=100introduces excessively high\-frequency features that are difficult for the kernel MLP to use \(−0\.6%\-0\.6\\%\), andσ=10\\sigma=10is optimal\. The number of frequenciesL=64L=64provides a 128\-dimensional positional representation per coordinate group; performance saturates beyondL=64L=64\(Table[23](https://arxiv.org/html/2606.19538#A13.T23)\)\.

We sample𝐁\\mathbf\{B\}once from𝒩​\(0,σ2​I\)\\mathcal\{N\}\(0,\\sigma^\{2\}I\)and freeze it throughout training\. Learning𝐁\\mathbf\{B\}would make the Fourier features input\-dependent, coupling position encoding with feature learning and complicating the theoretical analysis \(Theorem[5](https://arxiv.org/html/2606.19538#Thmtheorem5)assumes fixed positional encoding\)\. The kernel MLP is already a universal approximator over the lifted positional space, so learnable𝐁\\mathbf\{B\}provides no additional expressiveness in theory; empirically, we found no improvement from learning𝐁\\mathbf\{B\}\(±0\.0%\\pm 0\.0\\%on ImageNet\-1K\)\.

### H\.3\. Initialization Scheme

Proper initialization is critical for training stability, especially because the ITNet operator is more complex than self\-attention \(matrix\-valued kernel vs\. scalar attention weights\)\. Our initialization ensures two properties: \(i\) the initial operator is approximately the identity, so that anLL\-layer ITNet behaves like a shallow network at the start of training; and \(ii\) the initial kernel output has small norm, so that the integral term does not dominate the residual\.

#### H\.3\.1Kernel MLP Initialization

- •Input\-to\-hidden weightsW1W\_\{1\}:Sampled from𝒩​\(0,0\.022\)\\mathcal\{N\}\(0,0\.02^\{2\}\)\. The small variance ensures the hidden activationsh\(1\)=GELU​\(W1​z\+b1\)h^\{\(1\)\}=\\mathrm\{GELU\}\(W\_\{1\}z\+b\_\{1\}\)are in the approximately linear regime of GELU at initialization \(GELU\(x\)≈0\.5​x\(x\)\\approx 0\.5xfor\|x\|≪1\|x\|\\ll 1\), providing well\-conditioned gradients\.
- •Hidden biasb1b\_\{1\}:Initialised to zero\.
- •Hidden\-to\-output weightsW2W\_\{2\}:Initialised asW2=ϵ⋅W~2W\_\{2\}=\\epsilon\\cdot\\tilde\{W\}\_\{2\}whereW~2∼𝒩​\(0,1/wκ\)\\tilde\{W\}\_\{2\}\\sim\\mathcal\{N\}\(0,1/w\_\{\\kappa\}\)andϵ=10−3\\epsilon=10^\{\-3\}\. This ensures the initial kernel outputvec​\(Kx​y\)=W2​h\(1\)\+b2\\mathrm\{vec\}\(K\_\{xy\}\)=W\_\{2\}h^\{\(1\)\}\+b\_\{2\}has small norm:‖vec​\(Kx​y\)‖2≈ϵ​dh2=10−3×64=0\.064\\\|\\mathrm\{vec\}\(K\_\{xy\}\)\\\|\_\{2\}\\approx\\epsilon\\sqrt\{d\_\{h\}^\{2\}\}=10^\{\-3\}\\times 64=0\.064, so‖Kx​y‖F≈0\.064\\\|K\_\{xy\}\\\|\_\{F\}\\approx 0\.064\.
- •Output biasb2b\_\{2\}:Initialised so that the initial kernel is approximately1n​𝐈dh\\frac\{1\}\{n\}\\mathbf\{I\}\_\{d\_\{h\}\}\. Specifically,\[b2\]k=1/n\[b\_\{2\}\]\_\{k\}=1/nifkkcorresponds to a diagonal entry of thedh×dhd\_\{h\}\\times d\_\{h\}reshaped output, and0otherwise\. This makes the initial operator output approximately1n​∑j𝐈d⋅u​\(xj\)\+Wθ​u​\(xi\)=u¯\+u​\(xi\)\\frac\{1\}\{n\}\\sum\_\{j\}\\mathbf\{I\}\_\{d\}\\cdot u\(x\_\{j\}\)\+W\_\{\\theta\}u\(x\_\{i\}\)=\\bar\{u\}\+u\(x\_\{i\}\): the global mean plus the identity \- a simple pooling\-plus\-skip operation\.

This follows the*zero\-init*principle used in GPT\-2\[[76](https://arxiv.org/html/2606.19538#bib.bib89)\]and formalised as LayerScale: initialising each layer’s output near zero ensures the residual streamu\(ℓ\)≈u\(ℓ−1\)u^\{\(\\ell\)\}\\approx u^\{\(\\ell\-1\)\}at the start of training, so that anLL\-layer network initially behaves as a 1\-layer network\. This prevents the signal\-to\-noise ratio from degrading across layers and is essential for training models withL≥12L\\geq 12without careful learning rate warmup\.

We compared threeϵ\\epsilonvalues on ImageNet\-1K \(ITNet\-S, 300 epochs\):

Residual Matrix Initialization:Wθ=IdW\_\{\\theta\}=I\_\{d\}: the initial residual is the identity, so the full operator output is approximatelyu​\(xi\)\+small integral term≈u​\(xi\)u\(x\_\{i\}\)\+\\text\{small integral term\}\\approx u\(x\_\{i\}\)\. This is the standard skip\-connection initialization used in ResNets\[[41](https://arxiv.org/html/2606.19538#bib.bib14)\]and Transformers\.

FFN Initialization:The feed\-forward network weightsW1\(ℓ\),W2\(ℓ\)W\_\{1\}^\{\(\\ell\)\},W\_\{2\}^\{\(\\ell\)\}are initialized with the standard Xavier uniform scheme\[[32](https://arxiv.org/html/2606.19538#bib.bib88)\]:W∼Uniform​\(−6/\(din\+dout\),6/\(din\+dout\)\)W\\sim\\mathrm\{Uniform\}\(\-\\sqrt\{6/\(d\_\{\\mathrm\{in\}\}\+d\_\{\\mathrm\{out\}\}\)\},\\;\\sqrt\{6/\(d\_\{\\mathrm\{in\}\}\+d\_\{\\mathrm\{out\}\}\)\}\)\. Biases are initialized to zero\. The output layerW2\(ℓ\)W\_\{2\}^\{\(\\ell\)\}is additionally scaled by1/2​L1/\\sqrt\{2L\}following\[[76](https://arxiv.org/html/2606.19538#bib.bib89)\], which normalizes the variance contribution of each layer in the residual stream\.

Layer Normalization Initialization:The affine parameters are initialized asγ=𝟏d\\gamma=\\mathbf\{1\}\_\{d\}\(scale\) andβ=𝟎d\\beta=\\mathbf\{0\}\_\{d\}\(shift\), so thatLN\\mathrm\{LN\}initially performs only zero\-mean unit\-variance normalization without rescaling\.

Output Projection Initialization:The multi\-head output projectionWO∈ℝd×dW^\{O\}\\in\\mathbb\{R\}^\{d\\times d\}is initialized with Xavier uniform, scaled by1/2​L1/\\sqrt\{2L\}\(same as FFN output\), ensuring the concatenated head outputs do not amplify the signal\.

#### H\.3\.2Embedding Initialization

Image encoder:The patch embeddingWimg∈ℝd×\(P2⋅C\)W\_\{\\mathrm\{img\}\}\\in\\mathbb\{R\}^\{d\\times\(P^\{2\}\\cdot C\)\}\(whereP=16P=16is patch size andC=3C=3is the number of channels\) is initialized with truncated normal𝒩​\(0,0\.022\)\\mathcal\{N\}\(0,0\.02^\{2\}\), following ViT\[[26](https://arxiv.org/html/2606.19538#bib.bib13)\]\. The\[CLS\]token embedding is initialized from𝒩​\(0,0\.022\)\\mathcal\{N\}\(0,0\.02^\{2\}\)\.

Text encoder:The token embeddingWtxt∈ℝd×VW\_\{\\mathrm\{txt\}\}\\in\\mathbb\{R\}^\{d\\times V\}\(whereV=30,522V=30\{,\}522is the WordPiece vocabulary size\) is initialized from𝒩​\(0,0\.022\)\\mathcal\{N\}\(0,0\.02^\{2\}\), following BERT\[[24](https://arxiv.org/html/2606.19538#bib.bib15)\]\. The\[CLS\]and\[SEP\]token embeddings are initialized identically\.

Point cloud encoder:The coordinate embeddingWpc∈ℝd×3W\_\{\\mathrm\{pc\}\}\\in\\mathbb\{R\}^\{d\\times 3\}is initialized with Xavier uniform\. If the local pre\-extraction module is used \(K=16K=16neighbours\), its MLP weights follow the same𝒩​\(0,0\.022\)\\mathcal\{N\}\(0,0\.02^\{2\}\)scheme\.

Modality embeddings:For multimodal tasks, each modality receives a learnable embeddingeimg,etxt∈ℝde\_\{\\mathrm\{img\}\},e\_\{\\mathrm\{txt\}\}\\in\\mathbb\{R\}^\{d\}added to the features after the modality\-specific encoder\. These are initialized from𝒩​\(0,0\.022\)\\mathcal\{N\}\(0,0\.02^\{2\}\)\.

### H\.4\. Regularization and Training Stability

We apply dropout\[[82](https://arxiv.org/html/2606.19538#bib.bib84)\]at three locations: \(i\) after the ITNet operator output \(before the residual addition\), with ratep=0\.1p=0\.1for ITNet\-S/B andp=0\.0p=0\.0for ITNet\-PC; \(ii\) after the FFN hidden layer \(inside the FFN\), with the same rate; \(iii\) after the embedding layer, with ratep=0\.1p=0\.1\. No dropout is applied inside the kernel MLP, as the kernel evaluation is already regularised by theϵ\\epsilon\-scaling ofW2W\_\{2\}\. FollowingHuanget al\.\[[47](https://arxiv.org/html/2606.19538#bib.bib87)\], we apply stochastic depth with linearly increasing drop probability from0at the first layer topdropp\_\{\\mathrm\{drop\}\}at the last layer:p\(ℓ\)=ℓ⋅pdrop/Lp^\{\(\\ell\)\}=\\ell\\cdot p\_\{\\mathrm\{drop\}\}/L\. We usepdrop=0\.1p\_\{\\mathrm\{drop\}\}=0\.1for ITNet\-S,0\.20\.2for ITNet\-B, and0\.30\.3for ITNet\-L\. This is standard in ViT training\[[90](https://arxiv.org/html/2606.19538#bib.bib40)\]and essential for training models withL≥24L\\geq 24\. We clip the global gradient norm to1\.01\.0in all experiments, usingtorch\.nn\.utils\.clip\_grad\_norm\_\. This prevents gradient explosion when the kernel MLP produces large outputs early in training \(beforeW2W\_\{2\}converges to small values\)\. All experiments usebfloat16for forward and backward computation withfloat32master weights and optimizer states\[[66](https://arxiv.org/html/2606.19538#bib.bib86)\]\. The kernel MLP evaluation, matrix\-vector productKp​q⋅u​\(xj\)K\_\{pq\}\\cdot u\(x\_\{j\}\), and gradient accumulation are all performed inbfloat16\. Layer normalization and the softmax \(when used for baseline comparisons\) usefloat32for numerical stability\. For ImageNet\-1K experiments, we maintain an Exponential moving average \(EMA\) of the model weights with decayβEMA=0\.9999\\beta\_\{\\mathrm\{EMA\}\}=0\.9999, following\[[70](https://arxiv.org/html/2606.19538#bib.bib85)\]\. The EMA model is used for evaluation\.

We use AdamW\[[63](https://arxiv.org/html/2606.19538#bib.bib36)\]with decoupled weight decay applied to all weight matrices but*not*to biases, layer normalization parameters, the Fourier feature matrix𝐁\\mathbf\{B\}\(which is frozen\), or the kernel MLP output biasb2b\_\{2\}\(which encodes the1/n⋅𝐈d1/n\\cdot\\mathbf\{I\}\_\{d\}initialization and should not be pulled toward zero\)\.

### H\.5\. Optimizer Configuration

All experiments use AdamW\[[63](https://arxiv.org/html/2606.19538#bib.bib36)\]with:

- •β1=0\.9\\beta\_\{1\}=0\.9,β2=0\.999\\beta\_\{2\}=0\.999,ϵAdam=10−8\\epsilon\_\{\\mathrm\{Adam\}\}=10^\{\-8\}
- •Cosine learning rate schedule with linear warmup
- •Learning rate and weight decay per task specified in Section[K](https://arxiv.org/html/2606.19538#A11)

We chose AdamW over SGD with momentum because: \(i\) the kernel MLP parameters have different gradient magnitudes from the FFN and embedding parameters \(the kernel MLP seesn2n^\{2\}gradient contributions per step, while the FFN seesnn\), and Adam’s per\-parameter adaptive learning rate handles this scale difference automatically; \(ii\) decoupled weight decay\[[63](https://arxiv.org/html/2606.19538#bib.bib36)\]provides consistent regularization regardless of the adaptive learning rate, which is important for the kernel MLP whose gradients can be large early in training\.

### H\.6\. Statistical Reporting

For all ITNet models \(ITNet\-S, ITNet\-B, ITNet\-L\), we report results as mean±\\pmstandard deviation overN=3N=3independent runs with different random initializations\. \(see Tables[1](https://arxiv.org/html/2606.19538#S4.T1),[2](https://arxiv.org/html/2606.19538#S4.T2),[4](https://arxiv.org/html/2606.19538#S4.T4), and[4](https://arxiv.org/html/2606.19538#S4.T4)in the main paper\)\. Each run differs in random weight initialisation and stochastic training effects \(e\.g\., data shuffling and mini\-batch sampling\), while all other training settings are kept fixed\.

Let\{mi\}i=1N\\\{m\_\{i\}\\\}\_\{i=1\}^\{N\}denote the evaluation metric \(e\.g\., accuracy or F1\) fromN=3N=3runs\. The reported mean and standard deviation are computed as:

μ=1N​∑i=1Nmi,σ=1N−1​∑i=1N\(mi−μ\)2\.\\mu=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}m\_\{i\},\\qquad\\sigma=\\sqrt\{\\frac\{1\}\{N\-1\}\\sum\_\{i=1\}^\{N\}\(m\_\{i\}\-\\mu\)^\{2\}\}\.\(200\)We reportμ±σ\\mu\\pm\\sigmain all tables\. The standard deviation reflects variability due to random initialization and stochastic optimization\. For the considered benchmarks, the observed variance is small relative to performance differences between models, indicating stable training behavior\. We report standard deviation \(not standard error\) and do not assume a specific distribution beyond empirical estimation from repeated runs\.

### H\.7\. Triton Kernel Profiling

We profile the Triton kernel on an H200\-140GB atn=512n=512,d=768d=768,wκ=128w\_\{\\kappa\}=128,ℓκ=2\\ell\_\{\\kappa\}=2\.

Table 12:Triton kernel profiling\.ITNet\-Exact achieves 61% of peakbfloat16TFLOPS \(604/989\)\. The gap relative to FlashAttention\-2 \(88%\) is attributable to the kernel MLP evaluation, which involves small matrix multiplications \(128×577128\\times 577and4096×1284096\\times 128\) that underutilise the tensor core pipeline \(tensor cores are optimised for large matrix multiplications with dimensions that are multiples of 16\)\. ITNet\-LR achieves 84% utilization because the factorized computation is dominated by the matrix multiplicationsΦ⊤​Z\\Phi^\{\\top\}ZandZ=∑jωj​Ψj​ujZ=\\sum\_\{j\}\\omega\_\{j\}\\Psi\_\{j\}u\_\{j\}, which have larger dimensions and better tensor core utilization\.

The auto\-tuning procedure \(Phase 1 of Algorithm[1](https://arxiv.org/html/2606.19538#alg1)\) selectsBq∗=64B\_\{q\}^\{\*\}=64,Bk∗=64B\_\{k\}^\{\*\}=64, giving\(64\+64\)×768×2=192\(64\+64\)\\times 768\\times 2=192KB per tile pair, which fits within the per\-SM shared memory budget \(∼\\sim228KB\), enabling efficient on\-chip execution\. Larger tiles \(Bq=Bk=128B\_\{q\}=B\_\{k\}=128\) exceed this limit, reducing occupancy and degrading performance\.

### H\.8\. IO Complexity of Tiled ITNet

###### Proposition 10\(IO Complexity \- Full Statement\)\.

Letℓκ\\ell\_\{\\kappa\}denote the number of MLP layers inκθ\\kappa\_\{\\theta\}andwκw\_\{\\kappa\}the hidden width\. LetBqB\_\{q\}andBkB\_\{k\}be the query and key tile sizes satisfying the SRAM constraint\(Bq\+Bk\)⋅d≤SSRAM\(B\_\{q\}\+B\_\{k\}\)\\cdot d\\leq S\_\{\\mathrm\{SRAM\}\}\. The tiled forward pass \(Algorithm[1](https://arxiv.org/html/2606.19538#alg1)\) requires:

Θ​\(n2​dBk\)​HBM readsandΘ​\(n2​wκ​ℓκSSRAM\)​FLOPs\.\\Theta\\\!\\left\(\\frac\{n^\{2\}d\}\{B\_\{k\}\}\\right\)\\;\\text\{HBM reads\}\\quad\\text\{and\}\\quad\\Theta\\\!\\left\(\\frac\{n^\{2\}w\_\{\\kappa\}\\ell\_\{\\kappa\}\}\{S\_\{\\mathrm\{SRAM\}\}\}\\right\)\\;\\text\{FLOPs\}\.\(201\)

###### Proof\.

Each query tileUi∈ℝBq×dU\_\{i\}\\in\\mathbb\{R\}^\{B\_\{q\}\\times d\}is loaded once per outer iteration:n/Bqn/B\_\{q\}outer iterations×Bq​d\\times B\_\{q\}delements=n​d=ndreads for all queries\. Each key tileUj∈ℝBk×dU\_\{j\}\\in\\mathbb\{R\}^\{B\_\{k\}\\times d\}is loaded once per inner iteration for each outer tile:\(n/Bq\)​\(n/Bk\)×Bk​d=n2​d/Bq\(n/B\_\{q\}\)\(n/B\_\{k\}\)\\times B\_\{k\}d=n^\{2\}d/B\_\{q\}key reads in total\. The total HBM reads aren​d\+n2​d/Bq=Θ​\(n2​d/Bq\)nd\+n^\{2\}d/B\_\{q\}=\\Theta\(n^\{2\}d/B\_\{q\}\)for largenn\.

For FLOPs: each pair\(p,q\)\(p,q\)in a tile requires one MLP forward pass costingO​\(wκ​ℓκ\)O\(w\_\{\\kappa\}\\ell\_\{\\kappa\}\)multiply\-adds, plus oned×dd\\times dmatrix\-vector product costingO​\(d2\)O\(d^\{2\}\)\. There aren2n^\{2\}total pairs, givingO​\(n2​\(wκ​ℓκ\+d2\)\)O\(n^\{2\}\(w\_\{\\kappa\}\\ell\_\{\\kappa\}\+d^\{2\}\)\)total FLOPs\. Sincewκ​ℓκw\_\{\\kappa\}\\ell\_\{\\kappa\}is typically comparable tod2d^\{2\}\(e\.g\.,wκ=128w\_\{\\kappa\}=128,ℓκ=2\\ell\_\{\\kappa\}=2giveswκ​ℓκ=256w\_\{\\kappa\}\\ell\_\{\\kappa\}=256vs\.d2=7682d^\{2\}=768^\{2\}for the matrix\-vector product\), the matrix\-vector product dominates, and total FLOPs areO​\(n2​d2\)O\(n^\{2\}d^\{2\}\)\. ∎

### H\.9\. Tiled Forward Pass

Algorithm[1](https://arxiv.org/html/2606.19538#alg1)gives the complete tiled forward pass\. Phase 1 profiles all valid\(Bq,Bk\)\(B\_\{q\},B\_\{k\}\)combinations against the SRAM budget and selects the fastest; Phase 2 streams query and key tiles through SRAM, evaluating the kernel MLP and accumulating the weighted integral without ever materializing the fulln×nn\\times nkernel matrix\. The quadrature weightωj=μ​\(\{xj\}\)\\omega\_\{j\}=\\mu\(\\\{x\_\{j\}\\\}\)\(typically1/n1/n\) appears on line 14\. The residualWθ​UiW\_\{\\theta\}U\_\{i\}is fused into the write\-back to avoid an extra HBM round\-trip\.

Algorithm 1ITNet Tiled Forward Pass with Auto\-Tuning1:Signal

U∈ℝn×dU\\in\\mathbb\{R\}^\{n\\times d\}, positions

X∈ℝn×sX\\in\\mathbb\{R\}^\{n\\times s\}, kernel MLP

κθ\\kappa\_\{\\theta\}, local weight

Wθ∈ℝd×dW\_\{\\theta\}\\in\\mathbb\{R\}^\{d\\times d\}, SRAM capacity

SSRAMS\_\{\\mathrm\{SRAM\}\}
2:Output

O∈ℝn×dO\\in\\mathbb\{R\}^\{n\\times d\}
3:Phase 1: Auto\-tune block sizes

4:for

Bq∈\{16,32,64,128\}B\_\{q\}\\in\\\{16,32,64,128\\\}do

5:for

Bk∈\{16,32,64,128\}B\_\{k\}\\in\\\{16,32,64,128\\\}do

6:if

\(Bq\+Bk\)⋅d⋅bytes​\_​per​\_​element≤SSRAM\(B\_\{q\}\+B\_\{k\}\)\\cdot d\\cdot\\mathrm\{bytes\\\_per\\\_element\}\\leq S\_\{\\mathrm\{SRAM\}\}then

7:Profile tiled kernel with

\(Bq,Bk\)\(B\_\{q\},B\_\{k\}\)on a warmup batch

8:endif

9:endfor

10:endfor

11:Select

\(Bq∗,Bk∗\)\(B\_\{q\}^\{\*\},B\_\{k\}^\{\*\}\)with minimum measured runtime

12:Phase 2: Tiled forward pass

13:

O←𝟎n×dO\\leftarrow\\mathbf\{0\}\_\{n\\times d\}
14:for

i=0i=0to

n−Bq∗n\-B\_\{q\}^\{\*\}step

Bq∗B\_\{q\}^\{\*\}do⊳\\trianglerightOuter loop: query tiles

15:Load

Ui←U\[i:i\+Bq∗\]U\_\{i\}\\leftarrow U\[i\{:\}i\{\+\}B\_\{q\}^\{\*\}\],

Xi←X\[i:i\+Bq∗\]X\_\{i\}\\leftarrow X\[i\{:\}i\{\+\}B\_\{q\}^\{\*\}\]to SRAM

16:

acci←𝟎Bq∗×d\\mathrm\{acc\}\_\{i\}\\leftarrow\\mathbf\{0\}\_\{B\_\{q\}^\{\*\}\\times d\}
17:for

j=0j=0to

n−Bk∗n\-B\_\{k\}^\{\*\}step

Bk∗B\_\{k\}^\{\*\}do⊳\\trianglerightInner loop: key tiles

18:Load

Uj←U\[j:j\+Bk∗\]U\_\{j\}\\leftarrow U\[j\{:\}j\{\+\}B\_\{k\}^\{\*\}\],

Xj←X\[j:j\+Bk∗\]X\_\{j\}\\leftarrow X\[j\{:\}j\{\+\}B\_\{k\}^\{\*\}\]to SRAM

19:for

p=1,…,Bq∗p=1,\\ldots,B\_\{q\}^\{\*\}do

20:for

q=1,…,Bk∗q=1,\\ldots,B\_\{k\}^\{\*\}do

21:

Kp​q←κθ​\(Xi​\[p\],Xj​\[q\],Ui​\[p\],Uj​\[q\]\)K\_\{pq\}\\leftarrow\\kappa\_\{\\theta\}\(X\_\{i\}\[p\],\\,X\_\{j\}\[q\],\\,U\_\{i\}\[p\],\\,U\_\{j\}\[q\]\)⊳\\trianglerightMLP eval in registers

22:

acci​\[p\]←acci​\[p\]\+Kp​q⋅Uj​\[q\]⋅ωj\\mathrm\{acc\}\_\{i\}\[p\]\\leftarrow\\mathrm\{acc\}\_\{i\}\[p\]\+K\_\{pq\}\\cdot U\_\{j\}\[q\]\\cdot\\omega\_\{j\}⊳\\trianglerightWeighted quadrature

23:endfor

24:endfor

25:endfor

26:

O\[i:i\+Bq∗\]←acci\+WθUiO\[i\{:\}i\{\+\}B\_\{q\}^\{\*\}\]\\leftarrow\\mathrm\{acc\}\_\{i\}\+W\_\{\\theta\}\\,U\_\{i\}
27:Write

O\[i:i\+Bq∗\]O\[i\{:\}i\{\+\}B\_\{q\}^\{\*\}\]to HBM

28:endfor

### H\.10\. Tiled Backward Pass with Gradient Checkpointing

The forward pass \(Algorithm[1](https://arxiv.org/html/2606.19538#alg1)\) never materialises the fulln×n×d×dn\\times n\\times d\\times dkernel matrix: only aBq∗×Bk∗B\_\{q\}^\{\*\}\\times B\_\{k\}^\{\*\}tile of kernel evaluations resides in SRAM at any time\. A naïve backward pass would require access to alln2n^\{2\}kernel valuesKi​j=κθ​\(xi,yj,u​\(xi\),u​\(yj\)\)K\_\{ij\}=\\kappa\_\{\\theta\}\(x\_\{i\},y\_\{j\},u\(x\_\{i\}\),u\(y\_\{j\}\)\)to compute the gradients, which would either require storing the full kernel matrix during the forward pass \(O​\(n2​d2\)O\(n^\{2\}d^\{2\}\)memory\) or recomputing the entire kernel from scratch\. Following the gradient checkpointing strategy of\[[10](https://arxiv.org/html/2606.19538#bib.bib31)\]\- also employed by FlashAttention\[[21](https://arxiv.org/html/2606.19538#bib.bib11)\]for the attention matrix \- we adopt a*tiled recomputation*approach: during the backward pass, we iterate over the same tile structure as the forward pass, recompute the kernel MLPκθ\\kappa\_\{\\theta\}for each tile on\-the\-fly, and immediately use the recomputed values to accumulate gradients before discarding them\.

Letℒ\\mathcal\{L\}denote the scalar loss\. The forward pass computes, for each query positionxix\_\{i\}:

Oi=\(𝒦θ​\[u\]\)​\(xi\)=∑j=1nωj⋅κθ​\(xi,yj,ui,uj\)⋅uj\+Wθ​ui,O\_\{i\}=\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(x\_\{i\}\)=\\sum\_\{j=1\}^\{n\}\\omega\_\{j\}\\cdot\\kappa\_\{\\theta\}\(x\_\{i\},y\_\{j\},u\_\{i\},u\_\{j\}\)\\cdot u\_\{j\}\+W\_\{\\theta\}\\,u\_\{i\},\(202\)whereωj\\omega\_\{j\}is the quadrature weight for positionyjy\_\{j\}and we writeui=u​\(xi\)u\_\{i\}=u\(x\_\{i\}\),uj=u​\(yj\)u\_\{j\}=u\(y\_\{j\}\)for brevity\. The backward pass must compute three quantities:

1. \(i\)∂ℒ/∂ui\\partial\\mathcal\{L\}/\\partial u\_\{i\}for alli=1,…,ni=1,\\ldots,n\(gradient with respect to input features, for backpropagation to earlier layers\),
2. \(ii\)∂ℒ/∂θ\\partial\\mathcal\{L\}/\\partial\\theta\(gradient with respect to kernel MLP parameters, for weight updates\), and
3. \(iii\)∂ℒ/∂Wθ\\partial\\mathcal\{L\}/\\partial W\_\{\\theta\}\(gradient with respect to the residual matrix\)\.

Applying the chain rule to Eq\. \([202](https://arxiv.org/html/2606.19538#A8.E202)\), we obtain:

\(i\) Gradient with respect to kernel evaluations\.DefineKi​j=κθ​\(xi,yj,ui,uj\)∈ℝd×dK\_\{ij\}=\\kappa\_\{\\theta\}\(x\_\{i\},y\_\{j\},u\_\{i\},u\_\{j\}\)\\in\\mathbb\{R\}^\{d\\times d\}\. From Eq\. \([202](https://arxiv.org/html/2606.19538#A8.E202)\):

∂ℒ∂Ki​j=ωj⋅∂ℒ∂Oi⋅uj⊤∈ℝd×d,\\frac\{\\partial\\mathcal\{L\}\}\{\\partial K\_\{ij\}\}=\\omega\_\{j\}\\cdot\\frac\{\\partial\\mathcal\{L\}\}\{\\partial O\_\{i\}\}\\cdot u\_\{j\}^\{\\top\}\\;\\in\\;\\mathbb\{R\}^\{d\\times d\},\(203\)where∂ℒ/∂Oi∈ℝd\\partial\\mathcal\{L\}/\\partial O\_\{i\}\\in\\mathbb\{R\}^\{d\}is the upstream gradient at query positionii\(received from the subsequent layer\)\.

\(ii\) Gradient with respect to input features\.Each input featureuiu\_\{i\}participates in Eq\. \([202](https://arxiv.org/html/2606.19538#A8.E202)\) in three ways: \(a\) as a query feature \(through the kernel’s dependence onu​\(xi\)u\(x\_\{i\}\)\), \(b\) as a key feature \(through the kernel’s dependence onu​\(yj\)u\(y\_\{j\}\)whenj=ij=i\), and \(c\) through the residual termWθ​uiW\_\{\\theta\}u\_\{i\}\. Combining all contributions:

∂ℒ∂ui=∑j=1nωj⋅∂κθ∂u​\(xi\)\|\(xi,yj\)⋅uj⋅∂ℒ∂Oi⏟\(a\) as query in row​i\+∑k=1nωi⋅Kk​i⊤⋅∂ℒ∂Ok⏟\(b\) as key in column​i\+Wθ⊤⋅∂ℒ∂Oi⏟\(c\) residual\.\\frac\{\\partial\\mathcal\{L\}\}\{\\partial u\_\{i\}\}=\\underbrace\{\\sum\_\{j=1\}^\{n\}\\omega\_\{j\}\\cdot\\frac\{\\partial\\kappa\_\{\\theta\}\}\{\\partial u\(x\_\{i\}\)\}\\bigg\|\_\{\(x\_\{i\},y\_\{j\}\)\}\\\!\\\!\\cdot u\_\{j\}\\cdot\\frac\{\\partial\\mathcal\{L\}\}\{\\partial O\_\{i\}\}\}\_\{\\text\{\(a\) as query in row \}i\}\+\\underbrace\{\\sum\_\{k=1\}^\{n\}\\omega\_\{i\}\\cdot K\_\{ki\}^\{\\top\}\\cdot\\frac\{\\partial\\mathcal\{L\}\}\{\\partial O\_\{k\}\}\}\_\{\\text\{\(b\) as key in column \}i\}\+\\underbrace\{W\_\{\\theta\}^\{\\top\}\\cdot\\frac\{\\partial\\mathcal\{L\}\}\{\\partial O\_\{i\}\}\}\_\{\\text\{\(c\) residual\}\}\.\(204\)
In practice, the first term \(query\-side kernel Jacobian\) is expensive to compute exactly because∂κθ/∂u​\(xi\)\\partial\\kappa\_\{\\theta\}/\\partial u\(x\_\{i\}\)is a tensor of shaped×d×dd\\times d\\times d\. We compute it efficiently using automatic differentiation through the kernel MLP, which is feasible because the MLP is small \(wκ=128w\_\{\\kappa\}=128,ℓκ=2\\ell\_\{\\kappa\}=2\)\.

\(iii\) Gradient with respect to kernel MLP parameters\.

∂ℒ∂θ=∑i=1n∑j=1ntr​\(\(∂ℒ∂Ki​j\)⊤⋅∂κθ​\(xi,yj,ui,uj\)∂θ\),\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\\theta\}=\\sum\_\{i=1\}^\{n\}\\sum\_\{j=1\}^\{n\}\\mathrm\{tr\}\\\!\\left\(\\left\(\\frac\{\\partial\\mathcal\{L\}\}\{\\partial K\_\{ij\}\}\\right\)^\{\\top\}\\cdot\\frac\{\\partial\\kappa\_\{\\theta\}\(x\_\{i\},y\_\{j\},u\_\{i\},u\_\{j\}\)\}\{\\partial\\theta\}\\right\),\(205\)where∂κθ/∂θ\\partial\\kappa\_\{\\theta\}/\\partial\\thetais computed by standard backpropagation through the kernel MLP\.

\(iv\) Gradient with respect to the residual matrix\.

∂ℒ∂Wθ=∑i=1n∂ℒ∂Oi⋅ui⊤∈ℝd×d\.\\frac\{\\partial\\mathcal\{L\}\}\{\\partial W\_\{\\theta\}\}=\\sum\_\{i=1\}^\{n\}\\frac\{\\partial\\mathcal\{L\}\}\{\\partial O\_\{i\}\}\\cdot u\_\{i\}^\{\\top\}\\;\\in\\;\\mathbb\{R\}^\{d\\times d\}\.\(206\)
The key observation is that all four gradient computations \(Eqs\. \([203](https://arxiv.org/html/2606.19538#A8.E203)\)–\([206](https://arxiv.org/html/2606.19538#A8.E206)\)\) involve double sums over pairs\(i,j\)\(i,j\)that can be decomposed into tile\-level partial sums, exactly matching the tiling structure of the forward pass\. For each tile pair\(i,j\)\(i,j\)withiiindexing a query tile andjjindexing a key tile:

1. 1\.ReloadUi,XiU\_\{i\},X\_\{i\}\(query tile\) andUj,XjU\_\{j\},X\_\{j\}\(key tile\) from HBM to SRAM\.
2. 2\.RecomputeKp​q=κθ​\(Xi​\[p\],Xj​\[q\],Ui​\[p\],Uj​\[q\]\)K\_\{pq\}=\\kappa\_\{\\theta\}\(X\_\{i\}\[p\],X\_\{j\}\[q\],U\_\{i\}\[p\],U\_\{j\}\[q\]\)for allp∈\[1,Bq∗\]p\\in\[1,B\_\{q\}^\{\*\}\],q∈\[1,Bk∗\]q\\in\[1,B\_\{k\}^\{\*\}\]by running the kernel MLP forward pass again\. This is the recomputation step that avoids storingKK\.
3. 3\.Run the kernel MLP backward pass to obtain∂κθ/∂θ\\partial\\kappa\_\{\\theta\}/\\partial\\thetaand∂κθ/∂u\\partial\\kappa\_\{\\theta\}/\\partial ufor each pair in the tile\.
4. 4\.Accumulate the tile’s contribution to∂ℒ/∂θ\\partial\\mathcal\{L\}/\\partial\\theta\(Eq\. \([205](https://arxiv.org/html/2606.19538#A8.E205)\)\),∂ℒ/∂Ui\\partial\\mathcal\{L\}/\\partial U\_\{i\}\(query\-side gradient\), and∂ℒ/∂Uj\\partial\\mathcal\{L\}/\\partial U\_\{j\}\(key\-side gradient\)\.
5. 5\.Discard the tile’s kernel values and MLP activations\.

Algorithm[2](https://arxiv.org/html/2606.19538#alg2)formalizes this procedure\.

Algorithm 2ITNet Tiled Backward Pass \(Gradient Checkpointing\)1:Forward outputs

O∈ℝn×dO\\in\\mathbb\{R\}^\{n\\times d\}, inputs

U∈ℝn×dU\\in\\mathbb\{R\}^\{n\\times d\}, positions

X∈ℝn×sX\\in\\mathbb\{R\}^\{n\\times s\}, upstream gradient

∂ℒ/∂O∈ℝn×d\\partial\\mathcal\{L\}/\\partial O\\in\\mathbb\{R\}^\{n\\times d\}, kernel MLP

κθ\\kappa\_\{\\theta\}, residual

WθW\_\{\\theta\}, block sizes

Bq∗,Bk∗B\_\{q\}^\{\*\},B\_\{k\}^\{\*\}
2:Gradients

∂ℒ/∂U∈ℝn×d\\partial\\mathcal\{L\}/\\partial U\\in\\mathbb\{R\}^\{n\\times d\},

∂ℒ/∂θ\\partial\\mathcal\{L\}/\\partial\\theta,

∂ℒ/∂Wθ∈ℝd×d\\partial\\mathcal\{L\}/\\partial W\_\{\\theta\}\\in\\mathbb\{R\}^\{d\\times d\}
3:

∂ℒ/∂U←𝟎n×d\\partial\\mathcal\{L\}/\\partial U\\leftarrow\\mathbf\{0\}\_\{n\\times d\};

∂ℒ/∂θ←0\\partial\\mathcal\{L\}/\\partial\\theta\\leftarrow 0;

∂ℒ/∂Wθ←𝟎d×d\\partial\\mathcal\{L\}/\\partial W\_\{\\theta\}\\leftarrow\\mathbf\{0\}\_\{d\\times d\}
4:for

i=0,Bq∗,2​Bq∗,…,n−Bq∗i=0,\\,B\_\{q\}^\{\*\},\\,2B\_\{q\}^\{\*\},\\,\\ldots,\\,n\-B\_\{q\}^\{\*\}do⊳\\trianglerightOuter loop: query tiles

5:Load

Ui,Xi,∂ℒ/∂OiU\_\{i\},X\_\{i\},\\partial\\mathcal\{L\}/\\partial O\_\{i\}from HBM to SRAM

6:

∂ℒ/∂Ui←𝟎Bq∗×d\\partial\\mathcal\{L\}/\\partial U\_\{i\}\\leftarrow\\mathbf\{0\}\_\{B\_\{q\}^\{\*\}\\times d\}
7:for

j=0,Bk∗,2​Bk∗,…,n−Bk∗j=0,\\,B\_\{k\}^\{\*\},\\,2B\_\{k\}^\{\*\},\\,\\ldots,\\,n\-B\_\{k\}^\{\*\}do⊳\\trianglerightInner loop: key tiles

8:Load

Uj,XjU\_\{j\},X\_\{j\}to SRAM

9:Recompute

Kp​q←κθ​\(Xi​\[p\],Xj​\[q\],Ui​\[p\],Uj​\[q\]\)K\_\{pq\}\\leftarrow\\kappa\_\{\\theta\}\(X\_\{i\}\[p\],\\,X\_\{j\}\[q\],\\,U\_\{i\}\[p\],\\,U\_\{j\}\[q\]\)for all

p∈\[1,Bq∗\],q∈\[1,Bk∗\]p\\in\[1,B\_\{q\}^\{\*\}\],\\,q\\in\[1,B\_\{k\}^\{\*\}\]
10:for

p=1,…,Bq∗p=1,\\ldots,B\_\{q\}^\{\*\};

q=1,…,Bk∗q=1,\\ldots,B\_\{k\}^\{\*\}do

11:

∂ℒ/∂Kp​q←ωj⋅\(∂ℒ/∂Oi​\[p\]\)⋅Uj​\[q\]⊤\\partial\\mathcal\{L\}/\\partial K\_\{pq\}\\leftarrow\\omega\_\{j\}\\cdot\(\\partial\\mathcal\{L\}/\\partial O\_\{i\}\[p\]\)\\cdot U\_\{j\}\[q\]^\{\\top\}⊳\\trianglerightEq\. \([203](https://arxiv.org/html/2606.19538#A8.E203)\)

12:

∂ℒ/∂Ui\[p\]\+=Kp​q⊤⋅\(∂ℒ/∂Oi\[p\]\)⋅ωj\\partial\\mathcal\{L\}/\\partial U\_\{i\}\[p\]\\mathrel\{\+\}=K\_\{pq\}^\{\\top\}\\cdot\(\\partial\\mathcal\{L\}/\\partial O\_\{i\}\[p\]\)\\cdot\\omega\_\{j\}⊳\\trianglerightKey\-side contrib\. to query grad

13:

∂ℒ/∂Uj\[q\]\+=Kp​q⋅\(∂ℒ/∂Oi\[p\]\)⋅ωj\\partial\\mathcal\{L\}/\\partial U\_\{j\}\[q\]\\mathrel\{\+\}=K\_\{pq\}\\cdot\(\\partial\\mathcal\{L\}/\\partial O\_\{i\}\[p\]\)\\cdot\\omega\_\{j\}⊳\\trianglerightQuery\-side contrib\. to key grad

14:

∂ℒ/∂θ\+=tr\(\(∂ℒ/∂Kp​q\)⊤⋅∂κθ/∂θ\)\\partial\\mathcal\{L\}/\\partial\\theta\\mathrel\{\+\}=\\mathrm\{tr\}\\bigl\(\(\\partial\\mathcal\{L\}/\\partial K\_\{pq\}\)^\{\\top\}\\cdot\\partial\\kappa\_\{\\theta\}/\\partial\\theta\\bigr\)⊳\\trianglerightMLP param grad via backprop

15:endfor

16:endfor

17:

∂ℒ/∂Wθ\+=\(∂ℒ/∂Oi\)⋅Ui⊤\\partial\\mathcal\{L\}/\\partial W\_\{\\theta\}\\mathrel\{\+\}=\(\\partial\\mathcal\{L\}/\\partial O\_\{i\}\)\\cdot U\_\{i\}^\{\\top\}⊳\\trianglerightResidual matrix gradient, Eq\. \([206](https://arxiv.org/html/2606.19538#A8.E206)\)

18:Write

∂ℒ/∂Ui\\partial\\mathcal\{L\}/\\partial U\_\{i\}to

∂ℒ/∂U\[i:i\+Bq∗\]\\partial\\mathcal\{L\}/\\partial U\[i\{:\}i\{\+\}B\_\{q\}^\{\*\}\]in HBM

19:endfor

The tiled backward pass performs exactlyn2/\(Bq∗⋅Bk∗\)n^\{2\}/\(B\_\{q\}^\{\*\}\\cdot B\_\{k\}^\{\*\}\)tile iterations, matching the forward pass\. Each tile iteration involves: \(a\) one forward MLP pass per pair to recomputeKp​qK\_\{pq\}\(Bq∗⋅Bk∗⋅O​\(wκ​ℓκ\)B\_\{q\}^\{\*\}\\cdot B\_\{k\}^\{\*\}\\cdot O\(w\_\{\\kappa\}\\ell\_\{\\kappa\}\)FLOPs\), \(b\) one backward MLP pass per pair to compute∂κθ/∂θ\\partial\\kappa\_\{\\theta\}/\\partial\\theta\(≈2×\\approx 2\\timesthe forward cost\), and \(c\) matrix\-vector products for the gradient accumulations \(O​\(d2\)O\(d^\{2\}\)per pair\)\. The total backward FLOP count is therefore≈3×\\approx 3\\timesthe forward pass \(one recomputation \+ one backward through the MLP \+ gradient accumulations\), compared to1×1\\timesif the kernel matrix were stored\. However, peak memory is reduced fromO​\(n2​d2\)O\(n^\{2\}d^\{2\}\)\(storing the full kernel matrix\) toO​\(n​d\+Bq∗⋅Bk∗⋅d2\)O\(nd\+B\_\{q\}^\{\*\}\\cdot B\_\{k\}^\{\*\}\\cdot d^\{2\}\)\(storing only inputs, outputs, and one tile of kernel values\), which isO​\(n​d\)O\(nd\)for fixed tile sizes\.

For the model sizes and sequence lengths considered \(n≤1024n\\leq 1024,d=768d=768\), the naive approach requiresO​\(n2​d2\)O\(n^\{2\}d^\{2\}\)memory for the kernel matrix, which is prohibitive in practice\. The tiled recomputation strategy reduces this toO​\(Bq​Bk​d\)O\(B\_\{q\}B\_\{k\}d\)memory per tile, enabling the backward pass to be computed within a similar memory budget as the forward pass\.

The tiled backward pass is implemented as a single Triton\[[88](https://arxiv.org/html/2606.19538#bib.bib30)\]kernel that fuses the MLP recomputation, MLP backward, and gradient accumulation into one launch per tile pair\. The kernel MLP’s backward pass uses the standard automatic differentiation tape, which is created and destroyed within each tile iteration \(no cross\-tile tape storage\)\. The∂ℒ/∂θ\\partial\\mathcal\{L\}/\\partial\\thetaaccumulation uses atomic additions across tiles to avoid race conditions when multiple tiles contribute to the same MLP parameter gradient; in practice, the atomics have negligible overhead because the number of parameter gradient accumulations is much smaller than the compute per tile\.

### H\.11\. Monte Carlo Variance Analysis

###### Proposition 11\(Variance of Importance\-Weighted Estimator \- Full Proof\)\.

For a discrete domain with empirical measureμ​\(y\)=1n​∑j=1nδyj\\mu\(y\)=\\frac\{1\}\{n\}\\sum\_\{j=1\}^\{n\}\\delta\_\{y\_\{j\}\}, the importance\-weighted Monte Carlo estimator

\(𝒦θ^\[u\]\)\(xi\)=1M∑m=1Mκθ​\(xi,ym,u​\(xi\),u​\(ym\)\)⋅u​\(ym\)pθ​\(ym∣xi\)\+Wθu\(xi\),ym∼pθ\(⋅∣xi\)\(\\widehat\{\\mathcal\{K\}\_\{\\theta\}\}\[u\]\)\(x\_\{i\}\)=\\frac\{1\}\{M\}\\sum\_\{m=1\}^\{M\}\\frac\{\\kappa\_\{\\theta\}\(x\_\{i\},y\_\{m\},u\(x\_\{i\}\),u\(y\_\{m\}\)\)\\cdot u\(y\_\{m\}\)\}\{p\_\{\\theta\}\(y\_\{m\}\\mid x\_\{i\}\)\}\+W\_\{\\theta\}u\(x\_\{i\}\),\\qquad y\_\{m\}\\sim p\_\{\\theta\}\(\\cdot\\mid x\_\{i\}\)\(207\)satisfies𝔼​\[\(𝒦θ^​\[u\]\)​\(xi\)\]=\(𝒦θ​\[u\]\)​\(xi\)\\mathbb\{E\}\[\(\\widehat\{\\mathcal\{K\}\_\{\\theta\}\}\[u\]\)\(x\_\{i\}\)\]=\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(x\_\{i\}\)\(unbiased\) and has variance

Var​\[\(𝒦θ^​\[u\]\)​\(x\)\]=1M​\(∫Ω‖κθ​\(x,y,u​\(x\),u​\(y\)\)⋅u​\(y\)‖22pθ​\(y∣x\)​𝑑μ​\(y\)−‖\(𝒦θ​\[u\]\)​\(x\)−Wθ​u​\(x\)‖22\)\.\\mathrm\{Var\}\[\(\\widehat\{\\mathcal\{K\}\_\{\\theta\}\}\[u\]\)\(x\)\]=\\frac\{1\}\{M\}\\left\(\\int\_\{\\Omega\}\\frac\{\\\|\\kappa\_\{\\theta\}\(x,y,u\(x\),u\(y\)\)\\cdot u\(y\)\\\|\_\{2\}^\{2\}\}\{p\_\{\\theta\}\(y\\mid x\)\}\\,d\\mu\(y\)\-\\\|\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(x\)\-W\_\{\\theta\}u\(x\)\\\|\_\{2\}^\{2\}\\right\)\.\(208\)The optimal proposal that minimizes this variance is

pθ∗​\(y∣x\)=‖κθ​\(x,y,u​\(x\),u​\(y\)\)⋅u​\(y\)‖2∫Ω‖κθ​\(x,z,u​\(x\),u​\(z\)\)⋅u​\(z\)‖2​𝑑μ​\(z\),p\_\{\\theta\}^\{\*\}\(y\\mid x\)=\\frac\{\\\|\\kappa\_\{\\theta\}\(x,y,u\(x\),u\(y\)\)\\cdot u\(y\)\\\|\_\{2\}\}\{\\int\_\{\\Omega\}\\\|\\kappa\_\{\\theta\}\(x,z,u\(x\),u\(z\)\)\\cdot u\(z\)\\\|\_\{2\}\\,d\\mu\(z\)\},\(209\)achieving minimal variance

Var∗​\[\(𝒦θ^​\[u\]\)​\(x\)\]=1M​\[\(∫Ω‖κθ⋅u​\(y\)‖2​𝑑μ​\(y\)\)2−‖\(𝒦θ​\[u\]\)​\(x\)−Wθ​u​\(x\)‖22\]\.\\mathrm\{Var\}^\{\*\}\[\(\\widehat\{\\mathcal\{K\}\_\{\\theta\}\}\[u\]\)\(x\)\]=\\frac\{1\}\{M\}\\left\[\\left\(\\int\_\{\\Omega\}\\\|\\kappa\_\{\\theta\}\\cdot u\(y\)\\\|\_\{2\}\\,d\\mu\(y\)\\right\)^\{2\}\-\\\|\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(x\)\-W\_\{\\theta\}u\(x\)\\\|\_\{2\}^\{2\}\\right\]\.\(210\)In practice, we enforcepθ​\(y\|x\)\>0p\_\{\\theta\}\(y\|x\)\>0everywhere viapθ​\(y\|x\)=\(1−ε\)​pθ\(MLP\)​\(y\|x\)\+ε/np\_\{\\theta\}\(y\|x\)=\(1\-\\varepsilon\)p\_\{\\theta\}^\{\\text\{\(MLP\)\}\}\(y\|x\)\+\\varepsilon/nwithε=0\.01\\varepsilon=0\.01\.

###### Proof\.

Defineg​\(y\)=κθ​\(x,y,u​\(x\),u​\(y\)\)⋅u​\(y\)g\(y\)=\\kappa\_\{\\theta\}\(x,y,u\(x\),u\(y\)\)\\cdot u\(y\)andI=∫Ωg​\(y\)​𝑑μ​\(y\)=\(𝒦θ​\[u\]\)​\(x\)−Wθ​u​\(x\)I=\\int\_\{\\Omega\}g\(y\)\\,d\\mu\(y\)=\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(x\)\-W\_\{\\theta\}u\(x\)\. Each sampleym∼pθy\_\{m\}\\sim p\_\{\\theta\}contributes the importance\-weighted estimatorg^m=g​\(ym\)/pθ​\(ym∣x\)\\hat\{g\}\_\{m\}=g\(y\_\{m\}\)/p\_\{\\theta\}\(y\_\{m\}\\mid x\), which satisfies𝔼ym∼pθ​\[g^m\]=∫Ωg​\(y\)​𝑑μ​\(y\)=I\\mathbb\{E\}\_\{y\_\{m\}\\sim p\_\{\\theta\}\}\[\\hat\{g\}\_\{m\}\]=\\int\_\{\\Omega\}g\(y\)\\,d\\mu\(y\)=Iby the definition of importance sampling\.

The variance of the average ofMMi\.i\.d\. samples is:

Var​\[1M​∑m=1Mg^m\]\\displaystyle\\mathrm\{Var\}\\left\[\\frac\{1\}\{M\}\\sum\_\{m=1\}^\{M\}\\hat\{g\}\_\{m\}\\right\]=1M​Var​\[g^1\]\\displaystyle=\\frac\{1\}\{M\}\\mathrm\{Var\}\[\\hat\{g\}\_\{1\}\]\(211\)=1M​\(𝔼​\[‖g^1‖22\]−‖I‖22\)\\displaystyle=\\frac\{1\}\{M\}\\left\(\\mathbb\{E\}\[\\\|\\hat\{g\}\_\{1\}\\\|\_\{2\}^\{2\}\]\-\\\|I\\\|\_\{2\}^\{2\}\\right\)\(212\)=1M​\(∫Ω‖g​\(y\)‖22pθ​\(y∣x\)​𝑑μ​\(y\)−‖I‖22\)\.\\displaystyle=\\frac\{1\}\{M\}\\left\(\\int\_\{\\Omega\}\\frac\{\\\|g\(y\)\\\|\_\{2\}^\{2\}\}\{p\_\{\\theta\}\(y\\mid x\)\}\\,d\\mu\(y\)\-\\\|I\\\|\_\{2\}^\{2\}\\right\)\.\(213\)AddingWθ​u​\(x\)W\_\{\\theta\}u\(x\)\(deterministic\) does not affect variance, establishing \([208](https://arxiv.org/html/2606.19538#A8.E208)\)\.

To find the optimalpθ∗p\_\{\\theta\}^\{\*\}, we minimize∫‖g​\(y\)‖22/p​\(y\)​𝑑μ​\(y\)\\int\\\|g\(y\)\\\|\_\{2\}^\{2\}/p\(y\)\\,d\\mu\(y\)subject to∫p​\(y\)​𝑑μ​\(y\)=1\\int p\(y\)\\,d\\mu\(y\)=1andp​\(y\)≥0p\(y\)\\geq 0\. By the Cauchy\-Schwarz inequality\[[78](https://arxiv.org/html/2606.19538#bib.bib111)\]\(or Lagrange multipliers\):

∫‖g​\(y\)‖22p​\(y\)​𝑑μ​\(y\)⋅∫p​\(y\)​𝑑μ​\(y\)≥\(∫‖g​\(y\)‖2​𝑑μ​\(y\)\)2,\\int\\frac\{\\\|g\(y\)\\\|\_\{2\}^\{2\}\}\{p\(y\)\}d\\mu\(y\)\\cdot\\int p\(y\)d\\mu\(y\)\\geq\\left\(\\int\\\|g\(y\)\\\|\_\{2\}d\\mu\(y\)\\right\)^\{2\},with equality iffp​\(y\)∝‖g​\(y\)‖2p\(y\)\\propto\\\|g\(y\)\\\|\_\{2\}\. Thus the minimum is achieved whenp∗​\(y\)=‖g​\(y\)‖2/∫‖g​\(z\)‖2​𝑑μ​\(z\)p^\{\*\}\(y\)=\\\|g\(y\)\\\|\_\{2\}/\\int\\\|g\(z\)\\\|\_\{2\}d\\mu\(z\), yielding

∫Ω‖g​\(y\)‖22p∗​\(y\)​𝑑μ​\(y\)=\(∫Ω‖g​\(y\)‖2​𝑑μ​\(y\)\)2\.\\int\_\{\\Omega\}\\frac\{\\\|g\(y\)\\\|\_\{2\}^\{2\}\}\{p^\{\*\}\(y\)\}d\\mu\(y\)=\\left\(\\int\_\{\\Omega\}\\\|g\(y\)\\\|\_\{2\}\\,d\\mu\(y\)\\right\)^\{2\}\.Substituting back gives \([210](https://arxiv.org/html/2606.19538#A8.E210)\)\. The support conditionpθ​\(y\|x\)\>0p\_\{\\theta\}\(y\|x\)\>0is enforced via theε\\varepsilon\-mixture described above\. ∎

### H\.12\. Low\-Rank Approximation Error Bound

###### Proposition 12\(Nuclear Norm Bound on Low\-Rank Error\)\.

Letκθ\\kappa\_\{\\theta\}be the full\-rank kernel andκθ\(r\)=Φθ⊤​Ψθ\\kappa\_\{\\theta\}^\{\(r\)\}=\\Phi\_\{\\theta\}^\{\\top\}\\Psi\_\{\\theta\}the rank\-rrfactorization\. For any finite evaluation set\{\(xi,yj\)\}i,j=1n\\\{\(x\_\{i\},y\_\{j\}\)\\\}\_\{i,j=1\}^\{n\}, define the kernel matrixK∈ℝn​d×n​dK\\in\\mathbb\{R\}^\{nd\\times nd\}with blocksKi​j=κθ​\(xi,yj,u​\(xi\),u​\(yj\)\)∈ℝd×dK\_\{ij\}=\\kappa\_\{\\theta\}\(x\_\{i\},y\_\{j\},u\(x\_\{i\}\),u\(y\_\{j\}\)\)\\in\\mathbb\{R\}^\{d\\times d\}\. Then the best rank\-rrapproximation satisfies:

‖K−K\(r\)‖F≤‖K‖∗r,\\\|K\-K^\{\(r\)\}\\\|\_\{F\}\\leq\\frac\{\\\|K\\\|\_\{\*\}\}\{\\sqrt\{r\}\},\(214\)where‖K‖∗=∑i=1n​dσi​\(K\)\\\|K\\\|\_\{\*\}=\\sum\_\{i=1\}^\{nd\}\\sigma\_\{i\}\(K\)is the nuclear norm andσi\\sigma\_\{i\}are the singular values\.

###### Proof\.

By the Eckart\-Young\-Mirsky theorem\[[45](https://arxiv.org/html/2606.19538#bib.bib110)\], the best rank\-rrapproximation in Frobenius norm isK\(r\)=∑i=1rσi​ui​vi⊤K^\{\(r\)\}=\\sum\_\{i=1\}^\{r\}\\sigma\_\{i\}u\_\{i\}v\_\{i\}^\{\\top\}, and‖K−K\(r\)‖F2=∑i=r\+1min⁡\(n,d\)σi2\\\|K\-K^\{\(r\)\}\\\|\_\{F\}^\{2\}=\\sum\_\{i=r\+1\}^\{\\min\(n,d\)\}\\sigma\_\{i\}^\{2\}\.

By Cauchy–Schwarz\[[78](https://arxiv.org/html/2606.19538#bib.bib111)\]applied to the tail singular values:

∑i=r\+1n​dσi2≤\(∑i=r\+1n​dσi\)2/\(n​d−r\)\.\\sum\_\{i=r\+1\}^\{nd\}\\sigma\_\{i\}^\{2\}\\leq\\left\(\\sum\_\{i=r\+1\}^\{nd\}\\sigma\_\{i\}\\right\)^\{2\}/\(nd\-r\)\.\(215\)
A tighter bound uses the relationship between nuclear and Frobenius norms\. Since‖K‖∗=∑i=1n​dσi\\\|K\\\|\_\{\*\}=\\sum\_\{i=1\}^\{nd\}\\sigma\_\{i\}and the tail sum∑i=r\+1n​dσi≤‖K‖∗\\sum\_\{i=r\+1\}^\{nd\}\\sigma\_\{i\}\\leq\\\|K\\\|\_\{\*\}, we have:

‖K−K\(r\)‖F=∑i=r\+1n​dσi2≤maxi\>r⁡σi⋅∑i\>rσi≤‖K‖∗2r,\\\|K\-K^\{\(r\)\}\\\|\_\{F\}=\\sqrt\{\\sum\_\{i=r\+1\}^\{nd\}\\sigma\_\{i\}^\{2\}\}\\leq\\sqrt\{\\max\_\{i\>r\}\\sigma\_\{i\}\\cdot\\sum\_\{i\>r\}\\sigma\_\{i\}\}\\leq\\sqrt\{\\frac\{\\\|K\\\|\_\{\*\}^\{2\}\}\{r\}\},\(216\)where the last inequality usesσr\+1≤‖K‖∗/r\\sigma\_\{r\+1\}\\leq\\\|K\\\|\_\{\*\}/r\(since the firstrrsingular values each contribute at leastσr\+1\\sigma\_\{r\+1\}to the nuclear norm\)\. ∎

## Appendix IBackpropagation for the ITNet Operator

We derive the complete gradient computation for the ITNet operator in discrete form, using the notation established in Appendix[A](https://arxiv.org/html/2606.19538#A1)throughout\. This section serves two purposes: \(i\) to provide a self\-contained reference for implementors, and \(ii\) to make explicit how gradients flow through the content\-dependent kernel \- a non\-standard computation that does not arise in convolution \(where the kernel is content\-independent\) or standard attention \(where the kernel has a specific softmax structure\)\. We then show how the general gradient specialises to each classical architecture \(CNN, Transformer, RNN\) under the kernel restrictions of Theorems[C\.1](https://arxiv.org/html/2606.19538#A3.SS1)–[E\.1](https://arxiv.org/html/2606.19538#A5.SS1), providing a unified view of backpropagation across all three families\.

### I\.1\. Setup and Notation

Let the domainΩ\\Omegacontainnndiscrete positions\{x1,…,xn\}⊂ℝs\\\{x\_\{1\},\\ldots,x\_\{n\}\\\}\\subset\\mathbb\{R\}^\{s\}\. Letu:Ω→ℝdu:\\Omega\\to\\mathbb\{R\}^\{d\}denote the input signal, with valuesu​\(xi\)∈ℝdu\(x\_\{i\}\)\\in\\mathbb\{R\}^\{d\}at each position\. Positions are collected as\{xi\}i=1n⊂ℝs\\\{x\_\{i\}\\\}\_\{i=1\}^\{n\}\\subset\\mathbb\{R\}^\{s\}\. Letκθ:ℝs×ℝs×ℝd×ℝd→ℝd×d\\kappa\_\{\\theta\}:\\mathbb\{R\}^\{s\}\\times\\mathbb\{R\}^\{s\}\\times\\mathbb\{R\}^\{d\}\\times\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\\times d\}denote the learnable kernel \(parameterised byθ\\theta\),Wθ∈ℝd×dW\_\{\\theta\}\\in\\mathbb\{R\}^\{d\\times d\}the residual matrix, andμ\\muthe measure onΩ\\Omegawith weightsωj≔μ​\(\{xj\}\)\\omega\_\{j\}\\coloneqq\\mu\(\\\{x\_\{j\}\\\}\)\(typicallyωj=1/n\\omega\_\{j\}=1/nfor uniform measure\)\.

We writeui≔u​\(xi\)u\_\{i\}\\coloneqq u\(x\_\{i\}\)anduj≔u​\(xj\)u\_\{j\}\\coloneqq u\(x\_\{j\}\)as shorthand when the context is clear\.

### I\.2\. Forward Pass

The discrete ITNet operator𝒦θ\\mathcal\{K\}\_\{\\theta\}computes, for each query positionxix\_\{i\}:

\(𝒦θ​\[u\]\)​\(xi\)=∑j=1nωj⋅κθ​\(xi,xj,u​\(xi\),u​\(xj\)\)⋅u​\(xj\)\+Wθ​u​\(xi\)\.\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(x\_\{i\}\)=\\sum\_\{j=1\}^\{n\}\\omega\_\{j\}\\cdot\\kappa\_\{\\theta\}\(x\_\{i\},x\_\{j\},u\(x\_\{i\}\),u\(x\_\{j\}\)\)\\cdot u\(x\_\{j\}\)\+W\_\{\\theta\}\\,u\(x\_\{i\}\)\.\(217\)
We define the following intermediate quantities:

zi​j\\displaystyle z\_\{ij\}≔\[γ​\(xi\);γ​\(xj\);γ​\(xi−xj\);‖xi−xj‖2;u​\(xi\);u​\(xj\);u​\(xi\)⊙u​\(xj\)\]∈ℝ6​Lf\+1\+3​d,\\displaystyle\\coloneqq\[\\gamma\(x\_\{i\}\);\\,\\gamma\(x\_\{j\}\);\\,\\gamma\(x\_\{i\}\{\-\}x\_\{j\}\);\\,\\\|x\_\{i\}\{\-\}x\_\{j\}\\\|\_\{2\};\\,u\(x\_\{i\}\);\\,u\(x\_\{j\}\);\\,u\(x\_\{i\}\)\\odot u\(x\_\{j\}\)\]\\in\\mathbb\{R\}^\{6L\_\{f\}\+1\+3d\},\(218\)Ki​j\\displaystyle K\_\{ij\}≔κθ​\(xi,xj,u​\(xi\),u​\(xj\)\)=reshaped×d​\(MLPθ​\(zi​j\)\)∈ℝd×d,\\displaystyle\\coloneqq\\kappa\_\{\\theta\}\(x\_\{i\},x\_\{j\},u\(x\_\{i\}\),u\(x\_\{j\}\)\)=\\mathrm\{reshape\}\_\{d\\times d\}\\\!\\bigl\(\\mathrm\{MLP\}\_\{\\theta\}\(z\_\{ij\}\)\\bigr\)\\in\\mathbb\{R\}^\{d\\times d\},\(219\)Mi​j\\displaystyle M\_\{ij\}≔ωj⋅Ki​j⋅u​\(xj\)∈ℝd,\\displaystyle\\coloneqq\\omega\_\{j\}\\cdot K\_\{ij\}\\cdot u\(x\_\{j\}\)\\in\\mathbb\{R\}^\{d\},\(220\)whereγ:ℝs→ℝ2​Lf\\gamma:\\mathbb\{R\}^\{s\}\\to\\mathbb\{R\}^\{2L\_\{f\}\}is the random Fourier feature map \(Eq\. \([199](https://arxiv.org/html/2606.19538#A8.E199)\)\) and⊙\\odotdenotes the Hadamard \(elementwise\) product\. The quantityMi​jM\_\{ij\}is the*message*from key positionxjx\_\{j\}to query positionxix\_\{i\}: the kernel matrixKi​jK\_\{ij\}transforms the key featureu​\(xj\)u\(x\_\{j\}\), weighted by the measureωj\\omega\_\{j\}\.

The forward pass is then:

\(𝒦θ\[u\]\)\(xi\)=∑j=1nMi​j\+Wθu\(xi\)\.\\boxed\{\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(x\_\{i\}\)=\\sum\_\{j=1\}^\{n\}M\_\{ij\}\+W\_\{\\theta\}\\,u\(x\_\{i\}\)\.\}\(221\)

### I\.3\. Upstream Gradients

Letℒ\\mathcal\{L\}be the scalar loss\. The upstream gradient from the subsequent layer is:

gi≔∂ℒ∂\(𝒦θ​\[u\]\)​\(xi\)∈ℝd,i=1,…,n\.g\_\{i\}\\coloneqq\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(x\_\{i\}\)\}\\in\\mathbb\{R\}^\{d\},\\qquad i=1,\\ldots,n\.\(222\)
The backward pass must compute:

1. \(i\)∂ℒ/∂u​\(xj\)\\partial\\mathcal\{L\}/\\partial u\(x\_\{j\}\)for alljj\(input feature gradients, for backpropagation to earlier layers\),
2. \(ii\)∂ℒ/∂θ\\partial\\mathcal\{L\}/\\partial\\theta\(kernel MLP parameter gradients, for weight updates\),
3. \(iii\)∂ℒ/∂Wθ\\partial\\mathcal\{L\}/\\partial W\_\{\\theta\}\(residual matrix gradient\)\.

### I\.4\. Gradient with Respect to Kernel Outputs

FromMi​j=ωj⋅Ki​j⋅u​\(xj\)M\_\{ij\}=\\omega\_\{j\}\\cdot K\_\{ij\}\\cdot u\(x\_\{j\}\)and\(𝒦θ​\[u\]\)​\(xi\)=∑jMi​j\+Wθ​u​\(xi\)\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(x\_\{i\}\)=\\sum\_\{j\}M\_\{ij\}\+W\_\{\\theta\}u\(x\_\{i\}\), the gradient ofℒ\\mathcal\{L\}with respect to each message is∂ℒ/∂Mi​j=gi\\partial\\mathcal\{L\}/\\partial M\_\{ij\}=g\_\{i\}\. Applying the chain rule for the matrix\-vector productMi​j=ωj​Ki​j​u​\(xj\)M\_\{ij\}=\\omega\_\{j\}K\_\{ij\}u\(x\_\{j\}\):

∂ℒ∂Ki​j=ωj⋅giu\(xj\)⊤∈ℝd×d\.\\boxed\{\\frac\{\\partial\\mathcal\{L\}\}\{\\partial K\_\{ij\}\}=\\omega\_\{j\}\\cdot g\_\{i\}\\,u\(x\_\{j\}\)^\{\\top\}\\;\\in\\;\\mathbb\{R\}^\{d\\times d\}\.\}\(223\)
###### Derivation\.

The\(a,b\)\(a,b\)entry ofKi​jK\_\{ij\}contributes toMi​jM\_\{ij\}as:\[Mi​j\]a=ωj​∑b=1d\[Ki​j\]a​b​\[u​\(xj\)\]b\[M\_\{ij\}\]\_\{a\}=\\omega\_\{j\}\\sum\_\{b=1\}^\{d\}\[K\_\{ij\}\]\_\{ab\}\[u\(x\_\{j\}\)\]\_\{b\}\. Therefore:∂ℒ∂\[Ki​j\]a​b=ωj⋅\[gi\]a⋅\[u​\(xj\)\]b\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\[K\_\{ij\}\]\_\{ab\}\}=\\omega\_\{j\}\\cdot\[g\_\{i\}\]\_\{a\}\\cdot\[u\(x\_\{j\}\)\]\_\{b\}, which in matrix form gives Eq\. \([223](https://arxiv.org/html/2606.19538#A9.E223)\)\. ∎

### I\.5\. Gradient with Respect to Input Features

Each input featureu​\(xj\)u\(x\_\{j\}\)participates in Eq\. \([217](https://arxiv.org/html/2606.19538#A9.E217)\) in three distinct roles:

1. \(a\)Value role:u​\(xj\)u\(x\_\{j\}\)appears directly inMi​j=ωj​Ki​j​u​\(xj\)M\_\{ij\}=\\omega\_\{j\}K\_\{ij\}u\(x\_\{j\}\)as the vector being transformed by the kernel\. This contributes to*every*output\(𝒦θ​\[u\]\)​\(xi\)\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(x\_\{i\}\)fori=1,…,ni=1,\\ldots,n\.
2. \(b\)Key\-side kernel input:u​\(xj\)u\(x\_\{j\}\)appears insideKi​j=κθ​\(xi,xj,u​\(xi\),u​\(xj\)\)K\_\{ij\}=\\kappa\_\{\\theta\}\(x\_\{i\},x\_\{j\},u\(x\_\{i\}\),u\(x\_\{j\}\)\)through the kernel MLP’s dependence on the key feature\. This also contributes to every output\.
3. \(c\)Query\-side kernel input and residual:Whenjjacts as the query position,u​\(xj\)u\(x\_\{j\}\)appears in the residualWθ​u​\(xj\)W\_\{\\theta\}u\(x\_\{j\}\)and insideKj​k=κθ​\(xj,xk,u​\(xj\),u​\(xk\)\)K\_\{jk\}=\\kappa\_\{\\theta\}\(x\_\{j\},x\_\{k\},u\(x\_\{j\}\),u\(x\_\{k\}\)\)as the query feature for all keyskk\.

Combining all contributions:

∂ℒ∂u​\(xj\)=∑i=1nωj⋅Ki​j⊤​gi⏟\(a\) value role\+∑i=1nωj⋅∂Ki​j∂u​\(xj\)\|key⊤​vec​\(gi​u​\(xj\)⊤\)⏟\(b\) key\-side kernel gradient\+∑k=1nωk⋅∂Kj​k∂u​\(xj\)\|query⊤​vec​\(gj​u​\(xk\)⊤\)⏟\(c\) query\-side kernel gradient\+Wθ⊤​gj⏟\(d\) residual\.\\boxed\{\\begin\{aligned\} \\frac\{\\partial\\mathcal\{L\}\}\{\\partial u\(x\_\{j\}\)\}&=\\underbrace\{\\sum\_\{i=1\}^\{n\}\\omega\_\{j\}\\cdot K\_\{ij\}^\{\\top\}\\,g\_\{i\}\}\_\{\\text\{\(a\) value role\}\}\+\\underbrace\{\\sum\_\{i=1\}^\{n\}\\omega\_\{j\}\\cdot\\left\.\\frac\{\\partial K\_\{ij\}\}\{\\partial u\(x\_\{j\}\)\}\\right\|\_\{\\mathrm\{key\}\}^\{\\\!\\top\}\\mathrm\{vec\}\(g\_\{i\}\\,u\(x\_\{j\}\)^\{\\top\}\)\}\_\{\\text\{\(b\) key\-side kernel gradient\}\}\\\\ &\+\\underbrace\{\\sum\_\{k=1\}^\{n\}\\omega\_\{k\}\\cdot\\left\.\\frac\{\\partial K\_\{jk\}\}\{\\partial u\(x\_\{j\}\)\}\\right\|\_\{\\mathrm\{query\}\}^\{\\\!\\top\}\\mathrm\{vec\}\(g\_\{j\}\\,u\(x\_\{k\}\)^\{\\top\}\)\}\_\{\\text\{\(c\) query\-side kernel gradient\}\}\+\\underbrace\{W\_\{\\theta\}^\{\\top\}\\,g\_\{j\}\}\_\{\\text\{\(d\) residual\}\}\.\\end\{aligned\}\}\(224\)
##### Term \(a\): Value role\.

HoldingKi​jK\_\{ij\}fixed:

Term \(a\)=∑i=1nωj⋅Ki​j⊤​gi\.\\text\{Term \(a\)\}=\\sum\_\{i=1\}^\{n\}\\omega\_\{j\}\\cdot K\_\{ij\}^\{\\top\}\\,g\_\{i\}\.\(225\)

##### Term \(b\): Key\-side kernel gradient\.

The kernelKi​jK\_\{ij\}depends onu​\(xj\)u\(x\_\{j\}\)through the MLP inputzi​jz\_\{ij\}\. Specifically,u​\(xj\)u\(x\_\{j\}\)enterszi​jz\_\{ij\}directly as the key feature and through the Hadamard productu​\(xi\)⊙u​\(xj\)u\(x\_\{i\}\)\\odot u\(x\_\{j\}\)\. The Jacobian of the MLP input with respect to the key feature is:

∂zi​j∂u​\(xj\)=\[𝟎2​Lf×d𝟎2​Lf×d𝟎2​Lf×d𝟎1×d𝟎d×d𝐈ddiag​\(u​\(xi\)\)\]∈ℝ\(6​Lf\+1\+3​d\)×d,\\frac\{\\partial z\_\{ij\}\}\{\\partial u\(x\_\{j\}\)\}=\\begin\{bmatrix\}\\mathbf\{0\}\_\{2L\_\{f\}\\times d\}\\\\ \\mathbf\{0\}\_\{2L\_\{f\}\\times d\}\\\\ \\mathbf\{0\}\_\{2L\_\{f\}\\times d\}\\\\ \\mathbf\{0\}\_\{1\\times d\}\\\\ \\mathbf\{0\}\_\{d\\times d\}\\\\ \\mathbf\{I\}\_\{d\}\\\\ \\mathrm\{diag\}\(u\(x\_\{i\}\)\)\\end\{bmatrix\}\\in\\mathbb\{R\}^\{\(6L\_\{f\}\+1\+3d\)\\times d\},\(226\)where the two non\-zero blocks are𝐈d\\mathbf\{I\}\_\{d\}\(from∂u​\(xj\)/∂u​\(xj\)\\partial u\(x\_\{j\}\)/\\partial u\(x\_\{j\}\)\) anddiag​\(u​\(xi\)\)\\mathrm\{diag\}\(u\(x\_\{i\}\)\)\(from∂\(u​\(xi\)⊙u​\(xj\)\)/∂u​\(xj\)\\partial\(u\(x\_\{i\}\)\\odot u\(x\_\{j\}\)\)/\\partial u\(x\_\{j\}\)\)\.

By the chain rule:

Term \(b\)=∑i=1nωj⋅\(∂vec​\(Ki​j\)∂zi​j⋅∂zi​j∂u​\(xj\)\)⊤​vec​\(gi​u​\(xj\)⊤\)\.\\text\{Term \(b\)\}=\\sum\_\{i=1\}^\{n\}\\omega\_\{j\}\\cdot\\left\(\\frac\{\\partial\\mathrm\{vec\}\(K\_\{ij\}\)\}\{\\partial z\_\{ij\}\}\\cdot\\frac\{\\partial z\_\{ij\}\}\{\\partial u\(x\_\{j\}\)\}\\right\)^\{\\top\}\\mathrm\{vec\}\(g\_\{i\}\\,u\(x\_\{j\}\)^\{\\top\}\)\.\(227\)

##### Term \(c\): Query\-side kernel gradient\.

When positionjjacts as a query,u​\(xj\)u\(x\_\{j\}\)entersKj​kK\_\{jk\}through the query feature slot:

∂zj​k∂u​\(xj\)\|query=\[𝟎2​Lf×d𝟎2​Lf×d𝟎2​Lf×d𝟎1×d𝐈d𝟎d×ddiag​\(u​\(xk\)\)\]∈ℝ\(6​Lf\+1\+3​d\)×d\.\\frac\{\\partial z\_\{jk\}\}\{\\partial u\(x\_\{j\}\)\}\\bigg\|\_\{\\mathrm\{query\}\}=\\begin\{bmatrix\}\\mathbf\{0\}\_\{2L\_\{f\}\\times d\}\\\\ \\mathbf\{0\}\_\{2L\_\{f\}\\times d\}\\\\ \\mathbf\{0\}\_\{2L\_\{f\}\\times d\}\\\\ \\mathbf\{0\}\_\{1\\times d\}\\\\ \\mathbf\{I\}\_\{d\}\\\\ \\mathbf\{0\}\_\{d\\times d\}\\\\ \\mathrm\{diag\}\(u\(x\_\{k\}\)\)\\end\{bmatrix\}\\in\\mathbb\{R\}^\{\(6L\_\{f\}\+1\+3d\)\\times d\}\.\(228\)
The contribution is:

Term \(c\)=∑k=1nωk⋅\(∂vec​\(Kj​k\)∂zj​k⋅∂zj​k∂u​\(xj\)\)⊤​vec​\(gj​u​\(xk\)⊤\)\.\\text\{Term \(c\)\}=\\sum\_\{k=1\}^\{n\}\\omega\_\{k\}\\cdot\\left\(\\frac\{\\partial\\mathrm\{vec\}\(K\_\{jk\}\)\}\{\\partial z\_\{jk\}\}\\cdot\\frac\{\\partial z\_\{jk\}\}\{\\partial u\(x\_\{j\}\)\}\\right\)^\{\\top\}\\mathrm\{vec\}\(g\_\{j\}\\,u\(x\_\{k\}\)^\{\\top\}\)\.\(229\)

##### Term \(d\): Residual\.

Term \(d\)=Wθ⊤​gj\.\\text\{Term \(d\)\}=W\_\{\\theta\}^\{\\top\}\\,g\_\{j\}\.

### I\.6\. Gradient with Respect to Kernel MLP Parameters

The kernel MLP parametersθ\\thetaare shared across all\(i,j\)\(i,j\)pairs:

∂ℒ∂θ=∑i=1n∑j=1ntr\(ωj⋅u\(xj\)gi⊤⋅∂Ki​j∂θ\)\.\\boxed\{\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\\theta\}=\\sum\_\{i=1\}^\{n\}\\sum\_\{j=1\}^\{n\}\\mathrm\{tr\}\\\!\\left\(\\omega\_\{j\}\\cdot u\(x\_\{j\}\)\\,g\_\{i\}^\{\\top\}\\cdot\\frac\{\\partial K\_\{ij\}\}\{\\partial\\theta\}\\right\)\.\}\(230\)
Let the kernel MLP haveℓκ\\ell\_\{\\kappa\}layers with GELU activationσ\\sigma:

h\(0\)\\displaystyle h^\{\(0\)\}=zi​j,\\displaystyle=z\_\{ij\},\(231\)h\(l\)\\displaystyle h^\{\(l\)\}=σ​\(Wl​h\(l−1\)\+bl\),l=1,…,ℓκ−1,\\displaystyle=\\sigma\\\!\\bigl\(W\_\{l\}\\,h^\{\(l\-1\)\}\+b\_\{l\}\\bigr\),\\qquad l=1,\\ldots,\\ell\_\{\\kappa\}\-1,\(232\)vec​\(Ki​j\)\\displaystyle\\mathrm\{vec\}\(K\_\{ij\}\)=Wℓκ​h\(ℓκ−1\)\+bℓκ,\\displaystyle=W\_\{\\ell\_\{\\kappa\}\}\\,h^\{\(\\ell\_\{\\kappa\}\-1\)\}\+b\_\{\\ell\_\{\\kappa\}\},\(233\)withW1∈ℝwκ×\(6​Lf\+1\+3​d\)W\_\{1\}\\in\\mathbb\{R\}^\{w\_\{\\kappa\}\\times\(6L\_\{f\}\+1\+3d\)\},Wl∈ℝwκ×wκW\_\{l\}\\in\\mathbb\{R\}^\{w\_\{\\kappa\}\\times w\_\{\\kappa\}\}for interior layers, andWℓκ∈ℝd2×wκW\_\{\\ell\_\{\\kappa\}\}\\in\\mathbb\{R\}^\{d^\{2\}\\times w\_\{\\kappa\}\}\.

The GELU derivative is:

σGELU′​\(x\)=Φ​\(x\)\+x​ϕ​\(x\),Φ​\(x\)=12​\[1\+erf​\(x/2\)\],ϕ​\(x\)=\(2​π\)−1/2​e−x2/2\.\\sigma^\{\\prime\}\_\{\\mathrm\{GELU\}\}\(x\)=\\Phi\(x\)\+x\\,\\phi\(x\),\\qquad\\Phi\(x\)=\\tfrac\{1\}\{2\}\[1\+\\mathrm\{erf\}\(x/\\sqrt\{2\}\)\],\\quad\\phi\(x\)=\(2\\pi\)^\{\-1/2\}e^\{\-x^\{2\}/2\}\.\(234\)
Define the output\-layer error signal:

δ\(ℓκ\)=ωj⋅vec​\(gi​u​\(xj\)⊤\)∈ℝd2\.\\delta^\{\(\\ell\_\{\\kappa\}\)\}=\\omega\_\{j\}\\cdot\\mathrm\{vec\}\(g\_\{i\}\\,u\(x\_\{j\}\)^\{\\top\}\)\\in\\mathbb\{R\}^\{d^\{2\}\}\.\(235\)
Backpropagate through interior layers:

δ\(l\)=\(Wl\+1⊤​δ\(l\+1\)\)⊙σ′​\(Wl​h\(l−1\)\+bl\),l=ℓκ−1,…,1\.\\delta^\{\(l\)\}=\\bigl\(W\_\{l\+1\}^\{\\top\}\\,\\delta^\{\(l\+1\)\}\\bigr\)\\odot\\sigma^\{\\prime\}\\\!\\bigl\(W\_\{l\}h^\{\(l\-1\)\}\+b\_\{l\}\\bigr\),\\qquad l=\\ell\_\{\\kappa\}\-1,\\ldots,1\.\(236\)
The parameter gradients for a single pair\(i,j\)\(i,j\)are:

∂ℒ∂Wl\|\(i,j\)=δ\(l\)​\(h\(l−1\)\)⊤,∂ℒ∂bl\|\(i,j\)=δ\(l\)\.\\frac\{\\partial\\mathcal\{L\}\}\{\\partial W\_\{l\}\}\\bigg\|\_\{\(i,j\)\}=\\delta^\{\(l\)\}\\bigl\(h^\{\(l\-1\)\}\\bigr\)^\{\\top\},\\qquad\\frac\{\\partial\\mathcal\{L\}\}\{\\partial b\_\{l\}\}\\bigg\|\_\{\(i,j\)\}=\\delta^\{\(l\)\}\.\(237\)
Accumulated over alln2n^\{2\}pairs:

∂ℒ∂Wl=∑i=1n∑j=1nδi​j\(l\)​\(hi​j\(l−1\)\)⊤,∂ℒ∂bl=∑i=1n∑j=1nδi​j\(l\)\.\\frac\{\\partial\\mathcal\{L\}\}\{\\partial W\_\{l\}\}=\\sum\_\{i=1\}^\{n\}\\sum\_\{j=1\}^\{n\}\\delta^\{\(l\)\}\_\{ij\}\\bigl\(h^\{\(l\-1\)\}\_\{ij\}\\bigr\)^\{\\top\},\\qquad\\frac\{\\partial\\mathcal\{L\}\}\{\\partial b\_\{l\}\}=\\sum\_\{i=1\}^\{n\}\\sum\_\{j=1\}^\{n\}\\delta^\{\(l\)\}\_\{ij\}\.\(238\)

### I\.7\. Gradient with Respect to Residual Matrix

∂ℒ∂Wθ=∑i=1ngiu\(xi\)⊤∈ℝd×d\.\\boxed\{\\frac\{\\partial\\mathcal\{L\}\}\{\\partial W\_\{\\theta\}\}=\\sum\_\{i=1\}^\{n\}g\_\{i\}\\,u\(x\_\{i\}\)^\{\\top\}\\in\\mathbb\{R\}^\{d\\times d\}\.\}\(239\)
This is identical to the gradient of a standard linear layer and requires no kernel\-specific computation\.

### I\.8\. Special Case 1: Convolution \(Theorem[C\.1](https://arxiv.org/html/2606.19538#A3.SS1)\)

Under the convolutional kernelκθ​\(xi,xj,u​\(xi\),u​\(xj\)\)=w​\(xi−xj\)⋅𝐈d\\kappa\_\{\\theta\}\(x\_\{i\},x\_\{j\},u\(x\_\{i\}\),u\(x\_\{j\}\)\)=w\(x\_\{i\}\-x\_\{j\}\)\\cdot\\mathbf\{I\}\_\{d\}\(Theorem[C\.1](https://arxiv.org/html/2606.19538#A3.SS1)\), the kernel is a scalar function of displacement multiplied by the identity matrix\. The forward pass reduces to\(𝒦θ​\[u\]\)​\(xi\)=\(w∗u\)​\(xi\)\+Wθ​u​\(xi\)\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(x\_\{i\}\)=\(w\\ast u\)\(x\_\{i\}\)\+W\_\{\\theta\}u\(x\_\{i\}\)\.

Gradient with respect to the scalar filter value\. SinceKi​j=w​\(xi−xj\)⋅𝐈dK\_\{ij\}=w\(x\_\{i\}\-x\_\{j\}\)\\cdot\\mathbf\{I\}\_\{d\}is constrained to scalar multiples of𝐈d\\mathbf\{I\}\_\{d\}, only the trace component of∂ℒ/∂Ki​j\\partial\\mathcal\{L\}/\\partial K\_\{ij\}is free\. Definingwi​j≔w​\(xi−xj\)w\_\{ij\}\\coloneqq w\(x\_\{i\}\-x\_\{j\}\):

∂ℒ∂wi​j=tr​\(∂ℒ∂Ki​j\)=ωj⋅gi⊤​u​\(xj\)=ωj⋅⟨gi,u​\(xj\)⟩\.\\frac\{\\partial\\mathcal\{L\}\}\{\\partial w\_\{ij\}\}=\\mathrm\{tr\}\\\!\\left\(\\frac\{\\partial\\mathcal\{L\}\}\{\\partial K\_\{ij\}\}\\right\)=\\omega\_\{j\}\\cdot g\_\{i\}^\{\\top\}u\(x\_\{j\}\)=\\omega\_\{j\}\\cdot\\langle g\_\{i\},\\,u\(x\_\{j\}\)\\rangle\.\(240\)This is the standard convolution filter gradient: the cross\-correlation between the upstream gradient and the input\. For ak×kk\\times kfilter with coefficients\{fm\}m∈𝒩\\\{f\_\{m\}\\\}\_\{m\\in\\mathcal\{N\}\}, summing over all pairs with displacementmm:

∂ℒ∂fm=∑i=1nωj⋅gi⊤​u​\(xi−m​h\),m∈𝒩,\\frac\{\\partial\\mathcal\{L\}\}\{\\partial f\_\{m\}\}=\\sum\_\{i=1\}^\{n\}\\omega\_\{j\}\\cdot g\_\{i\}^\{\\top\}u\(x\_\{i\}\-mh\),\\qquad m\\in\\mathcal\{N\},\(241\)which is exactly the computation performed bytorch\.nn\.Conv2d’s backward pass\.

Gradient with respect tou​\(xj\)u\(x\_\{j\}\)\. Since the convolutional kernel is content\-independent, Terms \(b\) and \(c\) in Eq\. \([224](https://arxiv.org/html/2606.19538#A9.E224)\) vanish:

∂ℒ∂u​\(xj\)=∑i=1nωj⋅w​\(xi−xj\)⋅gi\+Wθ⊤​gj=ωj⋅\(w∗∗g\)​\(xj\)\+Wθ⊤​gj,\\frac\{\\partial\\mathcal\{L\}\}\{\\partial u\(x\_\{j\}\)\}=\\sum\_\{i=1\}^\{n\}\\omega\_\{j\}\\cdot w\(x\_\{i\}\-x\_\{j\}\)\\cdot g\_\{i\}\+W\_\{\\theta\}^\{\\top\}g\_\{j\}=\\omega\_\{j\}\\cdot\(w^\{\\ast\}\\ast g\)\(x\_\{j\}\)\+W\_\{\\theta\}^\{\\top\}g\_\{j\},\(242\)wherew∗​\(x\)≔w​\(−x\)w^\{\\ast\}\(x\)\\coloneqq w\(\-x\)is the flipped filter\. This is the standard transposed convolution used in CNN backward passes\.

The convolution special case eliminatesn2n^\{2\}MLP backward passes entirely\. Backward cost reduces fromO​\(n2​d2\+n2​wκ2​ℓκ\)O\(n^\{2\}d^\{2\}\+n^\{2\}w\_\{\\kappa\}^\{2\}\\ell\_\{\\kappa\}\)toO​\(n​\|𝒩\|​d\)O\(n\|\\mathcal\{N\}\|d\), recovering the known linear\-in\-nncost\.

### I\.9\. Special Case 2: Self\-Attention \(Theorem[D\.1](https://arxiv.org/html/2606.19538#A4.SS1)\)

Under the attention kernelKi​j=α​\(xi,xj\)⋅WVK\_\{ij\}=\\alpha\(x\_\{i\},x\_\{j\}\)\\cdot W\_\{V\}, whereα​\(xi,xj\)=exp⁡\(Q​\(xi\)⊤​K​\(xj\)/dk\)/Z​\(xi\)\\alpha\(x\_\{i\},x\_\{j\}\)=\\exp\(Q\(x\_\{i\}\)^\{\\top\}K\(x\_\{j\}\)/\\sqrt\{d\_\{k\}\}\)/Z\(x\_\{i\}\),Q​\(xi\)=WQ​u​\(xi\)Q\(x\_\{i\}\)=W\_\{Q\}u\(x\_\{i\}\),K​\(xj\)=WK​u​\(xj\)K\(x\_\{j\}\)=W\_\{K\}u\(x\_\{j\}\)\.

SinceKi​j=αi​j​WVK\_\{ij\}=\\alpha\_\{ij\}W\_\{V\}withWVW\_\{V\}fixed, the gradient projects onto the scalarαi​j\\alpha\_\{ij\}:

∂ℒ∂αi​j=ωj⋅gi⊤​WV​u​\(xj\)\.\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\\alpha\_\{ij\}\}=\\omega\_\{j\}\\cdot g\_\{i\}^\{\\top\}W\_\{V\}u\(x\_\{j\}\)\.\(243\)
The softmax Jacobian relates∂ℒ/∂αi​j\\partial\\mathcal\{L\}/\\partial\\alpha\_\{ij\}to the logit gradientei​j≔Q​\(xi\)⊤​K​\(xj\)/dke\_\{ij\}\\coloneqq Q\(x\_\{i\}\)^\{\\top\}K\(x\_\{j\}\)/\\sqrt\{d\_\{k\}\}:

∂ℒ∂ei​j=αi​j​\(∂ℒ∂αi​j−∑l=1nαi​l​∂ℒ∂αi​l\)\.\\frac\{\\partial\\mathcal\{L\}\}\{\\partial e\_\{ij\}\}=\\alpha\_\{ij\}\\left\(\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\\alpha\_\{ij\}\}\-\\sum\_\{l=1\}^\{n\}\\alpha\_\{il\}\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\\alpha\_\{il\}\}\\right\)\.\(244\)
Gradient with respect tou​\(xj\)u\(x\_\{j\}\)\. Mapping to ITNet’s general gradient:

Term \(a\) \[value\]:∑iωj⋅αi​j⋅WV⊤​gi,\\displaystyle\\quad\\sum\_\{i\}\\omega\_\{j\}\\cdot\\alpha\_\{ij\}\\cdot W\_\{V\}^\{\\top\}g\_\{i\},\(245\)Term \(b\) \[key\]:WK⊤​∑i∂ℒ∂ei​j⋅Q​\(xi\)dk,\\displaystyle\\quad W\_\{K\}^\{\\top\}\\sum\_\{i\}\\frac\{\\partial\\mathcal\{L\}\}\{\\partial e\_\{ij\}\}\\cdot\\frac\{Q\(x\_\{i\}\)\}\{\\sqrt\{d\_\{k\}\}\},\(246\)Term \(c\) \[query\]:WQ⊤​∑k∂ℒ∂ej​k⋅K​\(xk\)dk,\\displaystyle\\quad W\_\{Q\}^\{\\top\}\\sum\_\{k\}\\frac\{\\partial\\mathcal\{L\}\}\{\\partial e\_\{jk\}\}\\cdot\\frac\{K\(x\_\{k\}\)\}\{\\sqrt\{d\_\{k\}\}\},\(247\)Term \(d\):Wθ⊤gj\(=0sinceWθ=0in attention\)\.\\displaystyle\\quad W\_\{\\theta\}^\{\\top\}g\_\{j\}\\quad\(=0\\text\{ since \}W\_\{\\theta\}=0\\text\{ in attention\}\)\.\(248\)This recovers the standard Transformer backward pass:∇Vℒ=α⊤​G\\nabla\_\{V\}\\mathcal\{L\}=\\alpha^\{\\top\}G,∇Kℒ=\(∂ℒ/∂E\)⊤​Q/dk\\nabla\_\{K\}\\mathcal\{L\}=\(\\partial\\mathcal\{L\}/\\partial E\)^\{\\top\}Q/\\sqrt\{d\_\{k\}\},∇Qℒ=\(∂ℒ/∂E\)​K/dk\\nabla\_\{Q\}\\mathcal\{L\}=\(\\partial\\mathcal\{L\}/\\partial E\)K/\\sqrt\{d\_\{k\}\}\.

The attention case replaces the MLP Jacobian with the softmax Jacobian \(diag​\(α\)−α​α⊤\\mathrm\{diag\}\(\\alpha\)\-\\alpha\\alpha^\{\\top\}, rank\-1 corrected,O​\(n\)O\(n\)per row\)\. No kernel MLP parameter gradient exists\. Backward cost:O​\(n2​d\)O\(n^\{2\}d\)\.

### I\.10\. Special Case 3: Linear SSM / S4 \(Theorem[E\.1](https://arxiv.org/html/2606.19538#A5.SS1)\(c\)\)

Under the SSM kernelKi​j=𝟏j≤i⋅C​eA​\(ti−tj\)​BK\_\{ij\}=\\mathbf\{1\}\_\{j\\leq i\}\\cdot Ce^\{A\(t\_\{i\}\-t\_\{j\}\)\}B\(content\-independent, causal\), the forward pass is:

\(𝒦θ​\[u\]\)​\(ti\)=C​∑j=1iωj​eA​\(ti−tj\)​B​u​\(tj\)⏟h​\(ti\)\+D​u​\(ti\)\.\(\\mathcal\{K\}\_\{\\theta\}\[u\]\)\(t\_\{i\}\)=C\\underbrace\{\\sum\_\{j=1\}^\{i\}\\omega\_\{j\}e^\{A\(t\_\{i\}\-t\_\{j\}\)\}B\\,u\(t\_\{j\}\)\}\_\{h\(t\_\{i\}\)\}\+D\\,u\(t\_\{i\}\)\.\(249\)
Gradient with respect tou​\(tj\)u\(t\_\{j\}\)\. SinceKi​jK\_\{ij\}is content\-independent, Terms \(b\) and \(c\) vanish\. Defininggih≔C⊤​gig\_\{i\}^\{h\}\\coloneqq C^\{\\top\}g\_\{i\}:

∂ℒ∂u​\(tj\)=B⊤​∑i=jnωj⋅eA⊤​\(ti−tj\)​gih\+D⊤​gj\.\\frac\{\\partial\\mathcal\{L\}\}\{\\partial u\(t\_\{j\}\)\}=B^\{\\top\}\\sum\_\{i=j\}^\{n\}\\omega\_\{j\}\\cdot e^\{A^\{\\top\}\(t\_\{i\}\-t\_\{j\}\)\}g\_\{i\}^\{h\}\+D^\{\\top\}g\_\{j\}\.\(250\)This is*backpropagation through time*\(BPTT\): the gradient at timetjt\_\{j\}sums future gradients propagated backward through the transpose state\-transition matrixeA⊤​Δ​te^\{A^\{\\top\}\\Delta t\}\. If‖eA​Δ​t‖op<1\\\|e^\{A\\Delta t\}\\\|\_\{\\mathrm\{op\}\}<1\(stable dynamics\), gradients decay exponentially \- the vanishing gradient problem\.

Gradient with respect to SSM parameters:

∂ℒ∂C\\displaystyle\\frac\{\\partial\\mathcal\{L\}\}\{\\partial C\}=∑i=1ngi​h​\(ti\)⊤,\\displaystyle=\\sum\_\{i=1\}^\{n\}g\_\{i\}\\,h\(t\_\{i\}\)^\{\\top\},\(251\)∂ℒ∂B\\displaystyle\\frac\{\\partial\\mathcal\{L\}\}\{\\partial B\}=∑i=1n∑j=1iωj⋅eA⊤​\(ti−tj\)​C⊤​gi​u​\(tj\)⊤,\\displaystyle=\\sum\_\{i=1\}^\{n\}\\sum\_\{j=1\}^\{i\}\\omega\_\{j\}\\cdot e^\{A^\{\\top\}\(t\_\{i\}\-t\_\{j\}\)\}C^\{\\top\}g\_\{i\}\\,u\(t\_\{j\}\)^\{\\top\},\(252\)∂ℒ∂A\\displaystyle\\frac\{\\partial\\mathcal\{L\}\}\{\\partial A\}=∑i=1n∑j=1iωj​\(ti−tj\)⋅eA⊤​\(ti−tj\)​C⊤​gi​u​\(tj\)⊤​B⊤\.\\displaystyle=\\sum\_\{i=1\}^\{n\}\\sum\_\{j=1\}^\{i\}\\omega\_\{j\}\(t\_\{i\}\-t\_\{j\}\)\\cdot e^\{A^\{\\top\}\(t\_\{i\}\-t\_\{j\}\)\}C^\{\\top\}g\_\{i\}\\,u\(t\_\{j\}\)^\{\\top\}B^\{\\top\}\.\(253\)

### I\.11\. Special Case 4: LSTM \(Theorem[E\.1](https://arxiv.org/html/2606.19538#A5.SS1)\(b\)\)

The LSTM kernel is content\-dependent:Ki​jK\_\{ij\}depends onu​\(tj\)u\(t\_\{j\}\)through the gatesfτ,is,otf\_\{\\tau\},i\_\{s\},o\_\{t\}\(Appendix[E](https://arxiv.org/html/2606.19538#A5)\)\. Terms \(b\) and \(c\) are therefore*non\-zero*and account for gradient flow through gate activations\.

The cell\-state gradient satisfies the classical LSTM BPTT relation:

∂ℒ∂ct−1=∂ℒ∂ct⊙ft\+∂ℒ∂ht⊙ot⊙sech2​\(ct\)⊙ft,\\frac\{\\partial\\mathcal\{L\}\}\{\\partial c\_\{t\-1\}\}=\\frac\{\\partial\\mathcal\{L\}\}\{\\partial c\_\{t\}\}\\odot f\_\{t\}\+\\frac\{\\partial\\mathcal\{L\}\}\{\\partial h\_\{t\}\}\\odot o\_\{t\}\\odot\\mathrm\{sech\}^\{2\}\(c\_\{t\}\)\\odot f\_\{t\},\(254\)where the forget gateft∈\(0,1\)df\_\{t\}\\in\(0,1\)^\{d\}controls gradient flow through the cell state: whenft≈1f\_\{t\}\\approx 1, the gradient passes nearly unchanged; whenft≈0f\_\{t\}\\approx 0, the gradient is blocked\.

In ITNet’s framework, this gating is encoded in the content\-dependent kernel:Ki​jK\_\{ij\}containsdiag​\(∏τ=j\+1ifτ\)\\mathrm\{diag\}\(\\prod\_\{\\tau=j\+1\}^\{i\}f\_\{\\tau\}\), and the gradient through this product is exactly Eq\. \([254](https://arxiv.org/html/2606.19538#A9.E254)\)\. The gate gradients are:

∂ℒ∂ft\\displaystyle\\frac\{\\partial\\mathcal\{L\}\}\{\\partial f\_\{t\}\}=∂ℒ∂ct⊙ct−1⊙σ′​\(Wf​\[ht−1;ut\]\+bf\),\\displaystyle=\\frac\{\\partial\\mathcal\{L\}\}\{\\partial c\_\{t\}\}\\odot c\_\{t\-1\}\\odot\\sigma^\{\\prime\}\(W\_\{f\}\[h\_\{t\-1\};u\_\{t\}\]\+b\_\{f\}\),\(255\)∂ℒ∂it\\displaystyle\\frac\{\\partial\\mathcal\{L\}\}\{\\partial i\_\{t\}\}=∂ℒ∂ct⊙c~t⊙σ′​\(Wi​\[ht−1;ut\]\+bi\),\\displaystyle=\\frac\{\\partial\\mathcal\{L\}\}\{\\partial c\_\{t\}\}\\odot\\tilde\{c\}\_\{t\}\\odot\\sigma^\{\\prime\}\(W\_\{i\}\[h\_\{t\-1\};u\_\{t\}\]\+b\_\{i\}\),\(256\)∂ℒ∂ot\\displaystyle\\frac\{\\partial\\mathcal\{L\}\}\{\\partial o\_\{t\}\}=∂ℒ∂ht⊙tanh⁡\(ct\)⊙σ′​\(Wo​\[ht−1;ut\]\+bo\),\\displaystyle=\\frac\{\\partial\\mathcal\{L\}\}\{\\partial h\_\{t\}\}\\odot\\tanh\(c\_\{t\}\)\\odot\\sigma^\{\\prime\}\(W\_\{o\}\[h\_\{t\-1\};u\_\{t\}\]\+b\_\{o\}\),\(257\)whereσ′​\(x\)=σ​\(x\)​\(1−σ​\(x\)\)\\sigma^\{\\prime\}\(x\)=\\sigma\(x\)\(1\-\\sigma\(x\)\)is the sigmoid derivative\. These correspond to Terms \(b\) and \(c\) of the general ITNet gradient, specialized to the block\-matrix LSTM kernel\.

### I\.12\. Special Case 5: Mamba \(Theorem[E\.1](https://arxiv.org/html/2606.19538#A5.SS1)\(d\)\)

The Mamba kernelKi​j=𝟏j≤i⋅C​\(u​\(ti\)\)​∏τ=j\+1iA¯​\(u​\(tτ\)\)⋅B¯​\(u​\(tj\)\)K\_\{ij\}=\\mathbf\{1\}\_\{j\\leq i\}\\cdot C\(u\(t\_\{i\}\)\)\\prod\_\{\\tau=j\+1\}^\{i\}\\bar\{A\}\(u\(t\_\{\\tau\}\)\)\\cdot\\bar\{B\}\(u\(t\_\{j\}\)\)is content\-dependent through all three componentsC,A¯,B¯C,\\bar\{A\},\\bar\{B\}\.

All three contribute non\-zero gradients via Terms \(b\) and \(c\):

ThroughB¯​\(u​\(tj\)\)\\bar\{B\}\(u\(t\_\{j\}\)\):∂Ki​j∂u​\(tj\)\|B¯=C​\(ui\)​∏τA¯​\(uτ\)⋅∂\(Δ​\(uj\)​WB​uj\)∂u​\(tj\),\\displaystyle\\quad\\frac\{\\partial K\_\{ij\}\}\{\\partial u\(t\_\{j\}\)\}\\bigg\|\_\{\\bar\{B\}\}=C\(u\_\{i\}\)\\prod\_\{\\tau\}\\bar\{A\}\(u\_\{\\tau\}\)\\cdot\\frac\{\\partial\(\\Delta\(u\_\{j\}\)W\_\{B\}u\_\{j\}\)\}\{\\partial u\(t\_\{j\}\)\},\(258\)ThroughC​\(u​\(ti\)\)C\(u\(t\_\{i\}\)\):∂Kj​i∂u​\(ti\)\|C=WC⊤⋅∏τA¯​\(uτ\)⋅B¯​\(uj\),\\displaystyle\\quad\\frac\{\\partial K\_\{ji\}\}\{\\partial u\(t\_\{i\}\)\}\\bigg\|\_\{C\}=W\_\{C\}^\{\\top\}\\cdot\\prod\_\{\\tau\}\\bar\{A\}\(u\_\{\\tau\}\)\\cdot\\bar\{B\}\(u\_\{j\}\),\(259\)ThroughA¯​\(u​\(tτ\)\)\\bar\{A\}\(u\(t\_\{\\tau\}\)\):chain rule through​∏τexp⁡\(Δ​\(uτ\)​A\)\.\\displaystyle\\quad\\text\{chain rule through \}\\prod\_\{\\tau\}\\exp\(\\Delta\(u\_\{\\tau\}\)A\)\.\(260\)
The gradient through theA¯\\bar\{A\}product \(Eq\. \([260](https://arxiv.org/html/2606.19538#A9.E260)\)\) is Mamba’s analogue of BPTT: it propagates gradients through the selectivity mechanismΔ​\(uτ\)=softplus​\(WΔ​uτ\+bΔ\)\\Delta\(u\_\{\\tau\}\)=\\mathrm\{softplus\}\(W\_\{\\Delta\}u\_\{\\tau\}\+b\_\{\\Delta\}\), which controls how much each time step contributes to the hidden state\. This is the content\-dependent generalisation of the linear SSM gradient \(whereA¯\\bar\{A\}is constant and the product becomesA¯i−j\\bar\{A\}^\{i\-j\}\)\.

### I\.13\. Complexity Comparison Across Special Cases

Table 13:Backward pass complexity for ITNet and its special cases\.The table confirms that ITNet’s general backward pass is strictly more expensive than any special case, with additional cost from: \(i\)d2d^\{2\}versus11for the kernel gradient \(matrix vs\. scalar kernels\), and \(ii\)n2​wκ2​ℓκn^\{2\}w\_\{\\kappa\}^\{2\}\\ell\_\{\\kappa\}for the kernel MLP parameter gradient \(absent in all classical architectures\)\. The tiled recomputation strategy \(Algorithm[2](https://arxiv.org/html/2606.19538#alg2)\) amortizes the memory cost\. On an H200 GPU with tiled execution atn=512n=512,d=768d=768,wκ=128w\_\{\\kappa\}=128,ℓκ=2\\ell\_\{\\kappa\}=2, the measured backward\-to\-forward time ratio is2\.8×2\.8\\times, close to the theoretical3×3\\times\. The overall training iteration \(forward \+ backward \+ optimizer\) is≈3\.8×\\approx 3\.8\\timesthe forward\-only time, compared to≈3\.0×\\approx 3\.0\\timesfor a Transformer layer with FlashAttention\-2\.

## Appendix JExtended Encoder Details

### J\.1\. Graph Encoder

Graphs𝒢=\(𝒱,ℰ\)\\mathcal\{G\}=\(\\mathcal\{V\},\\mathcal\{E\}\)with node features\{hv\}v∈𝒱⊂ℝc\\\{h\_\{v\}\\\}\_\{v\\in\\mathcal\{V\}\}\\subset\\mathbb\{R\}^\{c\}are a natural fit for ITNet: the set of nodes is an irregular domain, and edges define a sparse connectivity structure that need not be hard\-coded into the kernel\. Node features are linearly embedded:

u\(0\)​\(v\)=Wgph​hv\+bgph∈ℝd,v∈𝒱\.u^\{\(0\)\}\(v\)=W\_\{\\mathrm\{gph\}\}\\,h\_\{v\}\+b\_\{\\mathrm\{gph\}\}\\in\\mathbb\{R\}^\{d\},\\qquad v\\in\\mathcal\{V\}\.\(261\)
Since graphs may lack Euclidean coordinates, we generate structural positions from the firsts=8s=8eigenvectors of the normalized graph LaplacianΔ=I−D−1/2​A​D−1/2\\Delta=I\-D^\{\-1/2\}AD^\{\-1/2\}\[[29](https://arxiv.org/html/2606.19538#bib.bib57)\]:

xv=\[ϕ1​\(v\);ϕ2​\(v\);…;ϕs​\(v\)\]∈ℝs,x\_\{v\}=\[\\phi\_\{1\}\(v\);\\,\\phi\_\{2\}\(v\);\\,\\ldots;\\,\\phi\_\{s\}\(v\)\]\\in\\mathbb\{R\}^\{s\},\(262\)whereϕi\\phi\_\{i\}is theii\-th Laplacian eigenvector\. These Laplacian positional encodings \(LPE\) capture graph topology \- nodes that are structurally similar receive similar position vectors \- and are fed into the kernel viaγ​\(xv\)\\gamma\(x\_\{v\}\)\.

Optionally, the kernel can be restricted to connected pairs:

κθgraph​\(xv,xu,uv,uu\)=\(𝟏\(v,u\)∈ℰ\+λ\)⋅κθ​\(xv,xu,uv,uu\),\\kappa\_\{\\theta\}^\{\\mathrm\{graph\}\}\(x\_\{v\},x\_\{u\},u\_\{v\},u\_\{u\}\)=\(\\mathbf\{1\}\_\{\(v,u\)\\in\\mathcal\{E\}\}\+\\lambda\)\\cdot\\kappa\_\{\\theta\}\(x\_\{v\},x\_\{u\},u\_\{v\},u\_\{u\}\),\(263\)whereλ≥0\\lambda\\geq 0is a learned global parameter controlling non\-edge long\-range integration\. Atλ=0\\lambda=0, ITNet reduces to graph convolution over the edge set\. Atλ\>0\\lambda\>0, it allows global message passing beyond graph topology\.

Domain specification:Ωgph=𝒱\\Omega\_\{\\mathrm\{gph\}\}=\\mathcal\{V\},d​μ=1\|𝒱\|​∑v∈𝒱δvd\\mu=\\frac\{1\}\{\|\\mathcal\{V\}\|\}\\sum\_\{v\\in\\mathcal\{V\}\}\\delta\_\{v\},s=8s=8\.

### J\.2\. Multi\-Scale Image Encoder

For dense prediction tasks \(detection, segmentation\), we apply the image encoder at two scales:P∈\{8,16\}P\\in\\\{8,16\\\}, yieldingNp\(s\)∈\{784,196\}N\_\{p\}^\{\(s\)\}\\in\\\{784,196\\\}patches per scale for a224×224224\\times 224image\. The two sets of positions are concatenated and processed jointly by the ITNet layers:

Ωmulti=ΩimgP=8∪ΩimgP=16,\|Ωmulti\|=784\+196=980\.\\Omega\_\{\\mathrm\{multi\}\}=\\Omega\_\{\\mathrm\{img\}\}^\{P=8\}\\cup\\Omega\_\{\\mathrm\{img\}\}^\{P=16\},\\qquad\|\\Omega\_\{\\mathrm\{multi\}\}\|=784\+196=980\.\(264\)This exploits the kernel’s ability to integrate across irregular, mixed\-resolution domains \- an operation that standard convolution or self\-attention must handle with explicit multi\-scale fusion modules\[[61](https://arxiv.org/html/2606.19538#bib.bib26)\]\.

### J\.3\. Design Principles

All encoders \(image, text, point cloud, graph, multimodal\) share three structural design principles:

Principle 1: Positions are raw coordinates\.The positional argument to the kernel is the raw domain coordinate \(pixel grid for images, scalar index for text, 3\-D coordinate for point clouds, Laplacian eigenvector for graphs\)\. We do not apply learnable positional embeddings before passing positions to the kernel; instead, the kernel MLP learns its own position\-sensitive functions\. The key advantage is that the kernel can compute arbitrary functions of position, including relative positions, distances, and angles, without these being baked into the feature space\.

Principle 2: Features are modality\-agnostic after encoding\.After the modality\-specific linear projection, all features lie in the sameℝd\\mathbb\{R\}^\{d\}space\. The ITNet layers, the kernel MLP, and the task\-specific decoder treat features identically regardless of origin modality\. This enables multimodal processing without cross\-modal adapter modules\. We emphasise that the modality\-specific encoders are modality\-specific design choices; our claim is that the*core operator*is shared and identical across all modalities\.

Principle 3: The measure encodes modality prior\.The measureμ\\mudetermines how each position contributes to integration\. By adjustingμ\\mu\(e\.g\., the balanced multimodal measure\), we encode prior beliefs about relative importance without modifying the kernel\. This separation of concerns \- measure for importance, kernel for content\-dependent weighting \- is a property of the integral operator formalism that has no direct analogue in attention or convolution\.

## Appendix KTraining Details

### K\.1\. ImageNet\-1K\[[80](https://arxiv.org/html/2606.19538#bib.bib35)\]

We provide the detailed training configuration for ImageNet\-1K in Table[14](https://arxiv.org/html/2606.19538#A11.T14)\. Our setup follows standard large\-scale training practices with AdamW optimization, cosine learning rate decay, and strong data augmentation \(RandAugment\[[19](https://arxiv.org/html/2606.19538#bib.bib37)\], Mixup\[[103](https://arxiv.org/html/2606.19538#bib.bib38)\], CutMix\[[101](https://arxiv.org/html/2606.19538#bib.bib39)\]\)\. We use a patch size of1616at224×224224\\times 224resolution and train for 300 epochs with large\-batch distributed training\. Regularization techniques such as label smoothing, stochastic depth \(drop path\), and exponential moving average \(EMA\) are employed to stabilise training\. All hyperparameters are chosen to ensure a fair and consistent comparison with prior work\.

Table 14:ImageNet\-1K training hyperparameters\.
### K\.2\. GLUE\[[94](https://arxiv.org/html/2606.19538#bib.bib44)\]Pre\-training and Fine\-tuning

We follow the BERT pre\-training protocol exactly\. Data: BookCorpus \+ English Wikipedia \(∼\\sim16GB text\)\. Tokeniser: WordPiece with 30K vocabulary\. Masking: 15% of tokens, of which 80% are replaced with\[MASK\], 10% with a random token, and 10% unchanged\. Sequence length: 128 for the first 250K steps, then 512 for the next 250K steps\. Batch size: 256\. Total training: 500K steps\. Learning rate:1×10−41\\times 10^\{\-4\}with linear warmup over 10K steps and linear decay\. Weight decay: 0\.01\.

Each GLUE\[[94](https://arxiv.org/html/2606.19538#bib.bib44)\]task is fine\-tuned independently with learning rate∈\{1,2,3,5\}×10−5\\in\\\{1,2,3,5\\\}\\times 10^\{\-5\}\(selected via dev\-set performance\), batch size 32, and 10 epochs\. We report the best dev\-set score across learning rates\. No task\-specific architectural modifications are made\.

### K\.3\. ModelNet40\[[97](https://arxiv.org/html/2606.19538#bib.bib46)\]

We summarise the training configuration for ModelNet40 in Table[15](https://arxiv.org/html/2606.19538#A11.T15)\. We follow standard point cloud classification settings with AdamW optimization, cosine learning rate decay, and geometric augmentations \(scaling, jitter, and rotation\)\. Each shape is represented by 1024 points, and local neighborhood pre\-extraction \(K=16K=16\) is applied to capture fine\-grained geometry\. Regularization via weight decay and stochastic depth ensures stable training\. All settings are chosen for fair comparison with prior work\.

Table 15:ModelNet40 training hyperparameters\.
### K\.4\. VQA v2\[[33](https://arxiv.org/html/2606.19538#bib.bib48)\]and NLVR2\[[85](https://arxiv.org/html/2606.19538#bib.bib49)\]

Both tasks initialise from pre\-trained weights: image encoder from ImageNet\-1K ITNet\-B, text encoder from GLUE\[[94](https://arxiv.org/html/2606.19538#bib.bib44)\]\-pretrained ITNet\-B\. Fine\-tuning uses learning rate2×10−52\\times 10^\{\-5\}, batch size 64, 10 epochs, cosine schedule with 1 epoch warmup\. VQA v2\[[33](https://arxiv.org/html/2606.19538#bib.bib48)\]uses 3129 answer classes \(open\-ended, following\[[33](https://arxiv.org/html/2606.19538#bib.bib48)\]\)\. NLVR2\[[85](https://arxiv.org/html/2606.19538#bib.bib49)\]uses binary classification over the concatenated\[CLS\]embeddings of two image\-text pairs\. The modality\-balanced measure is used throughout\.

## Appendix LDetailed Efficiency Analysis

We analyze the trade\-off between accuracy, throughput, and memory for ITNet’s three modes: Exact \(tiled fusion\), Monte Carlo \(M=128M=128\), and Low\-Rank \(rrvarying\)\. Table[16](https://arxiv.org/html/2606.19538#A12.T16)reports wall\-clock throughput and peak memory on H200\-140GB\. For low\-rank mode, we set per\-head rankrh=r/Hr\_\{h\}=r/HwithH=12H=12heads, totalr∈\{16,32,64,128\}r\\in\\\{16,32,64,128\\\}\.

Table 16:Wall\-clock throughput and peak memory\. Results are reported on ImageNet\-1K \(vision\), GLUE pre\-training \(language\), ModelNet40 \(3D point cloud\), and VQA v2 \(multimodal\)\. Herenndenotes the number of input tokens\.We sweep per\-head rankrh∈\{2,4,8,16,32\}r\_\{h\}\\in\\\{2,4,8,16,32\\\}corresponding to total rankr=H⋅rhr=H\\cdot r\_\{h\}withH=12H=12heads \(r∈\{24,48,96,192,384\}r\\in\\\{24,48,96,192,384\\\}\)\. For ITNet\-B \(86M params,d=768d=768,H=12H=12,dh=64d\_\{h\}=64\), Table[17](https://arxiv.org/html/2606.19538#A12.T17)reports ImageNet\-1K top\-1 accuracy, throughput, and relative FLOPs\.

Table 17:Rank sweep for ITNet\-B on ImageNet\-1K validation\.Moderrrhr\_\{h\}Top\-1 \(%\)Throughput \(img/s\)Relative FLOPsExact \(tiled\)d2=590​kd^\{2\}=590k\-83\.91,4801\.00×\\timesLow\-Rank24281\.24,2000\.38×\\timesLow\-Rank48482\.73,9000\.45×\\timesLow\-Rank96883\.43,4000\.55×\\timesLow\-Rank1921683\.72,8000\.72×\\timesLow\-Rank3843283\.82,1000\.89×\\timesMC \(M=128M=128\)\-\-83\.72,2400\.68×\\timesWe compare against strong efficient attention baselines under identical settings \(ImageNet\-1K, 86M params, batch size 32,n=196n=196\):

Table 18:Comparison with efficient attention variants on ImageNet\-1K\.ITNet\-LR matches or exceeds the throughput of linear/FFT attentions while delivering\+3\.6\+3\.6to\+6\.9\+6\.9points higher accuracy\. The gap is largest on tasks requiring long\-range dependencies, where content\-dependent kernels provide clear advantages over position\-only or scalar\-valued kernels\. Table[19](https://arxiv.org/html/2606.19538#A12.T19)decomposes peak memory for ITNet\-B atn=512n=512,d=768d=768:

Table 19:Memory breakdown for ITNet\-B \(512 sequence length, batch 32\)\.The exact mode’s memory is dominated by theO​\(n2​d2\)O\(n^\{2\}d^\{2\}\)kernel matrix\. MC reduces this by7×7\\timesvia on\-the\-fly sampling; low\-rank reduces by14×14\\timesvia factored representation\.

## Appendix MExtended Ablations

Kernel MLP Width Ablation:We ablate the kernel MLP hidden widthwκw\_\{\\kappa\}in Table[20](https://arxiv.org/html/2606.19538#A13.T20)\. Performance improves fromwκ=64w\_\{\\kappa\}=64to128128, after which gains saturate, with only marginal improvement at larger widths\. Increasingwκw\_\{\\kappa\}significantly raises computational cost and reduces throughput, indicating diminishing returns beyond moderate capacity\. In particular, doubling from128128to256256yields only a\+0\.1%\+0\.1\\%gain with noticeable efficiency degradation\. Based on this trade\-off, we usewκ=128w\_\{\\kappa\}=128in all experiments\.

Table 20:Kernel MLP width ablation on ImageNet\-1K \(ITNet\-S\)\.Point Cloud Encoder Ablation:We report the effect of positional encoding and local neighborhood extraction in Table[21](https://arxiv.org/html/2606.19538#A13.T21)\. Fourier positional encoding improves performance by\+0\.5%\+0\.5\\%over raw coordinates, indicating that spectral representations provide more informative spatial cues for the kernel\. Introducing local pre\-extraction further boosts accuracy, highlighting the importance of capturing fine\-grained geometric structure\. Performance increases with neighborhood size up toK=16K=16and then saturates, suggesting that larger local regions provide limited additional benefit\. These results demonstrate that combining Fourier features with moderate local aggregation yields the best performance\.

Table 21:Point cloud encoder ablation on ModelNet40\[[97](https://arxiv.org/html/2606.19538#bib.bib46)\]\(ITNet\-S, 1024 points, no normals\)\.Multimodal Measure Ablation:We analyse the effect of the integration measure in multimodal fusion in Table[22](https://arxiv.org/html/2606.19538#A13.T22)\. A balanced weighting between image and text tokens outperforms the uniform measure, indicating that equal contribution from both modalities is beneficial\. Under a uniform measure, the larger number of image patches dominates the interaction, reducing the influence of textual information\. Skewing the measure toward either modality degrades performance, with text\-heavy weighting performing worst\. These results highlight the importance of properly balancing modalities for effective cross\-modal reasoning\.

Table 22:Multimodal measure ablation on VQA v2 \(ITNet\-B\)\.Fourier Feature Ablation:We report the effect of Fourier positional encoding in Table[23](https://arxiv.org/html/2606.19538#A13.T23)\. Introducing Fourier features significantly improves performance over raw coordinates, highlighting the importance of rich positional representations\. Accuracy increases with the number of frequencies up toL=64L=64, after which gains saturate, indicating sufficient coverage of spatial scales\. The bandwidthσ=10\\sigma=10provides the best trade\-off between capturing local and global structure, while too small \(σ=1\\sigma=1\) or too large \(σ=100\\sigma=100\) values degrade performance\. These results suggest that moderate\-frequency positional encoding is sufficient for effective kernel learning\.

Table 23:Fourier feature hyperparameters on ImageNet\-1K \(ITNet\-S\)\.Number of ITNet Layers:We analyse the effect of model depth in Table[24](https://arxiv.org/html/2606.19538#A13.T24)\. Increasing the number of layers consistently improves performance up toL=12L=12, beyond which gains become marginal for the S\-size model\. While deeper models continue to increase parameter count and computational cost, the accuracy saturates, indicating limited benefit from additional depth at fixed width\. This suggests that depth and width must be scaled jointly to fully utilise model capacity\. Based on this trade\-off, we adoptL=12L=12for ITNet\-S as a balanced configuration\.

Table 24:Layer depth ablation on ImageNet\-1K \(ITNet\-S,d=384d\{=\}384,H=6H\{=\}6\)\.
## Appendix NBroader Impact

ITNet is a general\-purpose neural architecture\. Like all such architectures, it may be applied to beneficial purposes \(medical imaging, scientific discovery, accessibility tools\) as well as potentially harmful ones \(surveillance, deepfakes, autonomous weapons\)\. The unification of CNNs, Transformers, and RNNs into a single operator does not introduce new risks beyond those already posed by existing architectures; rather, it provides a more principled mathematical framework for understanding existing capabilities\. We note two potential positive impacts specific to ITNet: \(i\) by demonstrating that a single architecture can handle multiple modalities, ITNet may reduce the engineering complexity and energy cost of deploying multiple separate models for different data types; and \(ii\) by providing formal proofs that classical architectures are special cases of a single operator, ITNet contributes to the theoretical understanding of deep learning, which may aid in developing more interpretable and reliable systems\.

Similar Articles

UniT: Unified Geometry Learning with Group Autoregressive Transformer

Hugging Face Daily Papers

UniT is a unified feed-forward model for geometry perception using a Group Autoregressive Transformer that integrates multiple paradigms (online/offline, multi-modal, long-horizon) while maintaining metric-scale accuracy via scale-adaptive loss and queue-style KV caching. It achieves state-of-the-art performance on ten benchmarks spanning seven tasks.

A lift for input-convex neural network training

arXiv cs.LG

Proposes a 'lift' method for training input-convex neural networks (ICNNs) that uses an unconstrained hypernetwork to emit non-negative inter-layer weights, softening the loss landscape and escaping gradient attenuation, achieving lower test loss than projected gradient descent and softplus reparametrization.

@Phoenixyin13: I think this is a top-notch work in ICML 2026. The attention mechanism of traditional Transformers is essentially point-to-point matching: it cuts input into a bunch of tokens (discrete points), computes similarity between Query and Key, and then weights the Value. In NLP...

X AI KOLs Timeline

Introduces the ICML 2026 paper Functional Attention, which treats functions as first-class citizens and replaces softmax point-to-point similarity with structured linear operators. It addresses issues of discretization, resolution sensitivity, and high computational complexity in traditional Transformers when handling continuous functions. Achieves or surpasses SOTA in tasks like PDE solving and 3D segmentation, and exhibits strong OOD generalization.