Characterizing the Representational Capacity of Neural Processes

arXiv cs.LG Papers

Summary

This paper theoretically characterizes the representational capacity of Neural Process (NP) architectures, proving a strict hierarchy among Conditional, Attentive, Convolutional, and Transformer NPs, and showing that finite-dimensional latent variables do not expand representational capacity beyond the encoder.

arXiv:2605.24210v1 Announce Type: new Abstract: What functions can Neural Processes represent? We analyze the representational capacity of popular NP architectures: Conditional Neural Processes (CNPs), Attentive Neural Processes (ANPs), Transformer Neural Processes (TNPs), and their latent variants. We prove these architectures form a strict hierarchy. CNP-representable functions are exactly those depending on finitely many expected features of the context distribution. ANPs strictly generalize CNPs via query-dependent reweighting, enabling kernel smoothers. ConvCNPs and ANPs are incomparable; each contains functions outside the other, separated by stationarity versus translation equivariance. TNPs with $L$ self-attention layers capture $L$-hop context interactions. For latent NPs, we show finite-dimensional latents provide coherent sampling but do not circumvent encoder limitations; matching GP posterior distributions requires latent dimension scaling with context size. These results provide a theoretical foundation for architecture selection based on task structure.
Original Article
View Cached Full Text

Cached at: 05/26/26, 09:01 AM

# Characterizing the Representational Capacity of Neural Processes
Source: [https://arxiv.org/html/2605.24210](https://arxiv.org/html/2605.24210)
###### Abstract

What functions can Neural Processes represent? We analyze the representational capacity of popular NP architectures: Conditional Neural Processes \(CNPs\), Attentive Neural Processes \(ANPs\), Transformer Neural Processes \(TNPs\), and their latent variants\. We prove these architectures form a strict hierarchy\. CNP\-representable functions are exactly those depending on finitely many expected features of the context distribution\. ANPs strictly generalize CNPs via query\-dependent reweighting, enabling kernel smoothers\. ConvCNPs and ANPs are incomparable; each contains functions outside the other, separated by stationarity versus translation equivariance\. TNPs withLLself\-attention layers captureLL\-hop context interactions\. For latent NPs, we show finite\-dimensional latents provide coherent sampling but do not circumvent encoder limitations; matching GP posterior distributions requires latent dimension scaling with context size\. These results provide a theoretical foundation for architecture selection based on task structure\.

## 1Introduction

Neural Processes\(Garneloet al\.,[2018a](https://arxiv.org/html/2605.24210#bib.bib14),[b](https://arxiv.org/html/2605.24210#bib.bib15)\)have emerged as a family of models for meta\-learning and few\-shot prediction\. By learning to map context sets to predictive distributions, NPs combine the computational efficiency of neural networks with the uncertainty quantification traditionally associated with Gaussian Processes \(GPs\)\.

Since their introduction, numerous architectural variants have been proposed including Conditional Neural Processes \(CNPs\), Attentive Neural Processes \(ANPs\)\(Kimet al\.,[2019](https://arxiv.org/html/2605.24210#bib.bib18)\), Convolutional CNPs\(Gordonet al\.,[2020](https://arxiv.org/html/2605.24210#bib.bib16)\), and Transformer Neural Processes \(TNPs\)\(Nguyen and Grover,[2022](https://arxiv.org/html/2605.24210#bib.bib21)\)\. Many have observed that these architectures exhibit different capabilities\. ANPs outperform CNPs on tasks requiring local adaptation, ConvCNPs excel on spatially structured stationary tasks, while TNPs excel at tasks requiring global coherence, but the theoretical foundations explaining these differences have been lacking\.

In this paper, we answer a natural question:*What functions can neural processes represent?*

We provide a characterization of CNP\-representable functions as those depending on finitely many expected features of the empirical context distribution\. We prove that ANPs strictly generalize CNPs by enabling query\-dependent reweighting, with an explicit construction showing that kernel smoothers are ANP\-representable but lie outside the CNP function class\. We show that ConvCNPs and ANPs are incomparable in the hierarchy with the separation governed by stationarity versus translation equivariance\. For TNPs, we establish thatLLlayers of self\-attention captureLL\-hop context interactions, and prove matching upper and lower bounds:Θ​\(κ​log⁡\(1/ε\)\)\\Theta\(\\sqrt\{\\kappa\}\\log\(1/\\varepsilon\)\)layers are necessary and sufficient toε\\varepsilon\-approximate GP posteriors, whereκ\\kappais the kernel matrix condition number\. We extend our analysis to latent NPs, showing that finite\-dimensional latent variables provide coherent function sampling but do not expand representational capacity beyond what the encoder permits\.

## 2Related Work

Neural Processes were introduced byGarneloet al\.\([2018a](https://arxiv.org/html/2605.24210#bib.bib14)\)as a computationally efficient alternative to Gaussian Processes for meta\-learning, with the latent variable extension following inGarneloet al\.\([2018b](https://arxiv.org/html/2605.24210#bib.bib15)\)\. Subsequent work has developed numerous architectural variants\. Attentive Neural Processes\(Kimet al\.,[2019](https://arxiv.org/html/2605.24210#bib.bib18)\)replaced mean aggregation with cross\-attention, allowing the context representation to depend on the target location\. Convolutional Conditional Neural Processes\(Gordonet al\.,[2020](https://arxiv.org/html/2605.24210#bib.bib16)\)introduced translation equivariance through convolutional aggregation, achieving strong performance on spatially structured tasks\. Transformer Neural Processes\(Nguyen and Grover,[2022](https://arxiv.org/html/2605.24210#bib.bib21)\)applied self\-attention to the context set before cross\-attention, enabling context points to exchange information\. Practitioners have observed that these architectures exhibit qualitatively different capabilities, but the theoretical foundations explaining these differences have been absent\. We provide the first rigorous characterization of the representational capacity of each architecture class, proving they form a strict hierarchy\.

The representation of functions over sets has been studied extensively in the deep learning literature\.Zaheeret al\.\([2017](https://arxiv.org/html/2605.24210#bib.bib4)\)established that permutation invariant functions can be represented by DeepSets architectures with sum or mean aggregation, while follow up work byWagstaffet al\.\([2019](https://arxiv.org/html/2605.24210#bib.bib8)\)proved limitations that no continuous function with fixed\-dimension aggregation can be injective on sets of unbounded size\. Our work extends these results to the predictive setting where outputs depend on both the context set and a target location\. The distinction is that Neural Processes must produce query\-dependent predictions, not just set\-level summaries\. We show that query\-dependent reweighting strictly expands representational capacity even when the aggregated dimension is held fixed\.

Gaussian Processes are standard for uncertainty quantification in regression, but exact inference scales cubically in the number of observations\. This has motivated extensive work on sparse approximations, including inducing point methods\(Snelson and Ghahramani,[2005](https://arxiv.org/html/2605.24210#bib.bib26); Titsias,[2009](https://arxiv.org/html/2605.24210#bib.bib23)\), random Fourier features\(Rahimi and Recht,[2007](https://arxiv.org/html/2605.24210#bib.bib25)\), and structured kernel interpolation\(Wilson and Nickisch,[2015](https://arxiv.org/html/2605.24210#bib.bib24)\)\. Neural Processes take a different approach as they learn to amortize GP\-like inference through an encoder\-decoder architecture trained across tasks\. Our results characterize when this amortization is possible\. We prove that CNPs and ANPs cannot represent GP posteriors regardless of encoder capacity \(Theorems[6](https://arxiv.org/html/2605.24210#Thmtheorem6)and[12](https://arxiv.org/html/2605.24210#Thmtheorem12)\), while TNPs can approximate GP posteriors with depth scaling asΘ​\(κ​log⁡\(1/ε\)\)\\Theta\(\\sqrt\{\\kappa\}\\log\(1/\\varepsilon\)\)whereκ\\kappais the kernel matrix condition number \(Theorem[15](https://arxiv.org/html/2605.24210#Thmtheorem15), Proposition[14](https://arxiv.org/html/2605.24210#Thmtheorem14)\)\. This provides a precise sense in which TNPs succeed where simpler architectures fail\.

The representational capacity of transformer architectures has received significant recent attention\.Yunet al\.\([2020](https://arxiv.org/html/2605.24210#bib.bib27)\)proved that transformers are universal approximators of sequence\-to\-sequence functions, whilePérezet al\.\([2021](https://arxiv.org/html/2605.24210#bib.bib28)\)established Turing completeness under suitable assumptions\. Our depth lower bound \(Theorem[15](https://arxiv.org/html/2605.24210#Thmtheorem15)\) contributes to this literature by proving task\-specific depth requirements\. Approximating GP posteriors requiresΩ​\(κ​log⁡\(1/ε\)\)\\Omega\(\\sqrt\{\\kappa\}\\log\(1/\\varepsilon\)\)layers regardless of width\. The key insight is that linearization atyC=0y\_\{C\}=0reduces the Jacobian to a polynomial in the attention matrix, enabling the application of classical approximation barriers\. This technique may be applicable to other regression tasks requiring matrix inversion\.

Our depth lower bound relies on classical results from approximation theory, particularly polynomial approximation of rational functions\. The Chebyshev barrier for approximating1/μ1/\\muon an interval dates to the foundational work of Chebyshev and Markov; seeTrefethen \([2019](https://arxiv.org/html/2605.24210#bib.bib6)\)for a modern treatment\. The connection between iterative methods and polynomial approximation is well\-established in numerical linear algebra, where Chebyshev iteration achieves optimal convergence rates for linear systems\(Golub and Van Loan,[2013](https://arxiv.org/html/2605.24210#bib.bib30)\)\. Our contribution is recognizing that transformer self\-attention layers implement polynomial iterations, and that the same barriers governing classical iterative methods govern neural network depth requirements\. The matching upper bound \(Proposition[14](https://arxiv.org/html/2605.24210#Thmtheorem14)\) shows that the Chebyshev polynomial construction can be realized by appropriate choice of attention weights\.

Theoretical analysis of meta\-learning has developed along several axes\. Generalization bounds for learning\-to\-learn were established byBaxter \([2000](https://arxiv.org/html/2605.24210#bib.bib9)\), with subsequent refinements using PAC\-Bayes\(Pentina and Lampert,[2014](https://arxiv.org/html/2605.24210#bib.bib20); Rothfusset al\.,[2021](https://arxiv.org/html/2605.24210#bib.bib22)\)and information\-theoretic\(Jose and Simeone,[2021](https://arxiv.org/html/2605.24210#bib.bib19)\)techniques\. These results bound how many tasks are needed for a meta\-learner to generalize to new tasks\. Our work is complementary\. We characterize what can be represented, not how fast it can be learned\. The expressiveness hierarchy we establish has implications for meta\-learning theory\. A CNP cannot benefit from additional tasks if the target predictor lies outside its function class, but a full sample complexity analysis for Neural Processes remains open\.

## 3Preliminaries

Let𝒳⊆ℝdx\\mathcal\{X\}\\subseteq\\mathbb\{R\}^\{d\_\{x\}\}be the input space and𝒴⊆ℝdy\\mathcal\{Y\}\\subseteq\\mathbb\{R\}^\{d\_\{y\}\}the output space\. A context set is a finite collectionC=\{\(xi,yi\)\}i=1nC=\\\{\(x\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{n\}withxi∈𝒳x\_\{i\}\\in\\mathcal\{X\},yi∈𝒴y\_\{i\}\\in\\mathcal\{Y\}\. The context space𝒞=⋃n=1∞\(𝒳×𝒴\)n/Sn\\mathcal\{C\}=\\bigcup\_\{n=1\}^\{\\infty\}\(\\mathcal\{X\}\\times\\mathcal\{Y\}\)^\{n\}/S\_\{n\}is the set of all finite multisets, with𝒞≤N\\mathcal\{C\}\_\{\\leq N\}the restriction to size at mostNN\. A*predictive map*F:𝒞×𝒳→ΘF:\\mathcal\{C\}\\times\\mathcal\{X\}\\to\\Thetareturns distribution parametersθ∈Θ\\theta\\in\\Thetagiven a context setCCand target locationxtx\_\{t\}\.

#### Conditional Neural Process \(CNP\)\.

A CNP computes:

hi\\displaystyle h\_\{i\}=hϕ​\(xi,yi\)∈ℝd\\displaystyle=h\_\{\\phi\}\(x\_\{i\},y\_\{i\}\)\\in\\mathbb\{R\}^\{d\}\(encode each context point\)\(1\)rC\\displaystyle r\_\{C\}=1n​∑i=1nhi∈ℝd\\displaystyle=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}h\_\{i\}\\in\\mathbb\{R\}^\{d\}\(aggregate via mean\)\(2\)θ\\displaystyle\\theta=gψ​\(xt,rC\)∈Θ\\displaystyle=g\_\{\\psi\}\(x\_\{t\},r\_\{C\}\)\\in\\Theta\(decode at target\)\(3\)wherehϕ:𝒳×𝒴→ℝdh\_\{\\phi\}:\\mathcal\{X\}\\times\\mathcal\{Y\}\\to\\mathbb\{R\}^\{d\}is the encoder andgψ:𝒳×ℝd→Θg\_\{\\psi\}:\\mathcal\{X\}\\times\\mathbb\{R\}^\{d\}\\to\\Thetais the decoder\.

#### Attentive Neural Process \(ANP\)\.

An ANP uses cross\-attention from target to context:

αi​\(xt;C\)\\displaystyle\\alpha\_\{i\}\(x\_\{t\};C\)=exp⁡\(q​\(xt\)⊤​k​\(xi,yi\)/τ\)∑j=1nexp⁡\(q​\(xt\)⊤​k​\(xj,yj\)/τ\)\\displaystyle=\\frac\{\\exp\(q\(x\_\{t\}\)^\{\\top\}k\(x\_\{i\},y\_\{i\}\)/\\tau\)\}\{\\sum\_\{j=1\}^\{n\}\\exp\(q\(x\_\{t\}\)^\{\\top\}k\(x\_\{j\},y\_\{j\}\)/\\tau\)\}\(4\)rC​\(xt\)\\displaystyle r\_\{C\}\(x\_\{t\}\)=∑i=1nαi​\(xt;C\)​v​\(xi,yi\)\\displaystyle=\\sum\_\{i=1\}^\{n\}\\alpha\_\{i\}\(x\_\{t\};C\)\\,v\(x\_\{i\},y\_\{i\}\)\(5\)θ\\displaystyle\\theta=gψ​\(xt,rC​\(xt\)\)\\displaystyle=g\_\{\\psi\}\(x\_\{t\},r\_\{C\}\(x\_\{t\}\)\)\(6\)whereq:𝒳→ℝdkq:\\mathcal\{X\}\\to\\mathbb\{R\}^\{d\_\{k\}\}is the query network,k:𝒳×𝒴→ℝdkk:\\mathcal\{X\}\\times\\mathcal\{Y\}\\to\\mathbb\{R\}^\{d\_\{k\}\}is the key network, andv:𝒳×𝒴→ℝdv:\\mathcal\{X\}\\times\\mathcal\{Y\}\\to\\mathbb\{R\}^\{d\}is the value network\.

#### Transformer Neural Process \(TNP\)\.

A TNP applies self\-attention to the context before cross\-attention:

βi​j\(ℓ\)\\displaystyle\\beta\_\{ij\}^\{\(\\ell\)\}=exp⁡\(qs​\(h~i\(ℓ−1\)\)⊤​ks​\(h~j\(ℓ−1\)\)/τ\)∑m=1nexp⁡\(qs​\(h~i\(ℓ−1\)\)⊤​ks​\(h~m\(ℓ−1\)\)/τ\)\\displaystyle=\\frac\{\\exp\(q\_\{s\}\(\\tilde\{h\}\_\{i\}^\{\(\\ell\-1\)\}\)^\{\\top\}k\_\{s\}\(\\tilde\{h\}\_\{j\}^\{\(\\ell\-1\)\}\)/\\tau\)\}\{\\sum\_\{m=1\}^\{n\}\\exp\(q\_\{s\}\(\\tilde\{h\}\_\{i\}^\{\(\\ell\-1\)\}\)^\{\\top\}k\_\{s\}\(\\tilde\{h\}\_\{m\}^\{\(\\ell\-1\)\}\)/\\tau\)\}\(7\)h~i\(ℓ\)\\displaystyle\\tilde\{h\}\_\{i\}^\{\(\\ell\)\}=∑j=1nβi​j\(ℓ\)​Wv​h~j\(ℓ−1\)\\displaystyle=\\sum\_\{j=1\}^\{n\}\\beta\_\{ij\}^\{\(\\ell\)\}\\,W\_\{v\}\\tilde\{h\}\_\{j\}^\{\(\\ell\-1\)\}\(8\)for layersℓ=1,…,L\\ell=1,\\ldots,L, withh~i\(0\)=h​\(xi,yi\)\\tilde\{h\}\_\{i\}^\{\(0\)\}=h\(x\_\{i\},y\_\{i\}\)\. The final representation is then used in cross\-attention as in ANP\.

#### Convolutional Neural Process \(ConvCNP\)\.

A ConvCNP replaces finite\-dimensional aggregation with functional convolutional channels\. A non\-negative filterw:ℝdx→ℝ\+w:\\mathbb\{R\}^\{d\_\{x\}\}\\to\\mathbb\{R\}\_\{\+\}and pointwise encoderh:𝒴→ℝdh:\\mathcal\{Y\}\\to\\mathbb\{R\}^\{d\}produce:

ρC​\(x\)\\displaystyle\\rho\_\{C\}\(x\)=∑i=1nw​\(x−xi\)\\displaystyle=\\sum\_\{i=1\}^\{n\}w\(x\-x\_\{i\}\)\(density channel\)\(9\)sC​\(x\)\\displaystyle s\_\{C\}\(x\)=∑i=1nw​\(x−xi\)​h​\(yi\)\\displaystyle=\\sum\_\{i=1\}^\{n\}w\(x\-x\_\{i\}\)\\,h\(y\_\{i\}\)\(signal channel\)\(10\)r~C\\displaystyle\\tilde\{r\}\_\{C\}=Φ​\(sC,ρC\)\\displaystyle=\\Phi\(s\_\{C\},\\rho\_\{C\}\)\(CNN processing\)\(11\)θ\\displaystyle\\theta=g​\(r~C​\(xt\),xt\)\\displaystyle=g\(\\tilde\{r\}\_\{C\}\(x\_\{t\}\),x\_\{t\}\)\(readout at target\)\(12\)whereΦ\\Phiis a multi\-layer CNN\. A pure ConvCNP omits the CNN:θ=g​\(sC​\(xt\),ρC​\(xt\),xt\)\\theta=g\(s\_\{C\}\(x\_\{t\}\),\\rho\_\{C\}\(x\_\{t\}\),x\_\{t\}\)\. We distinguish between the pure convolutional aggregation and the subsequent CNN processing, as these contribute qualitatively different representational capabilities\.

The core question of representational capacity is basically: what information about the context can reach the prediction? For CNPs, only the mean\-aggregated encoding and context points are processed independently\. For ANPs, the query can reweight context points, but the weights factorize but context points still don’t interact\. ConvCNPs lift the finite\-dimensional bottleneck via functional representations, and the CNN enables context–context coupling, but only through translation\-equivariant operations\. For TNPs, self\-attention lets context points exchange information before the prediction is made\. This progression from no interaction, to query\-mediated reweighting, to full context\-context coupling is what drives the hierarchy we establish\.

## 4Conditional Neural Processes

###### Definition 1\(hh\-Equivalence\)\.

For a fixed encoderh:𝒳×𝒴→ℝdh:\\mathcal\{X\}\\times\\mathcal\{Y\}\\to\\mathbb\{R\}^\{d\}, two context setsC,C′C,C^\{\\prime\}arehh\-equivalent, writtenC∼hC′C\\sim\_\{h\}C^\{\\prime\}, if1\|C\|​∑\(x,y\)∈Ch​\(x,y\)=1\|C′\|​∑\(x,y\)∈C′h​\(x,y\)\\frac\{1\}\{\|C\|\}\\sum\_\{\(x,y\)\\in C\}h\(x,y\)=\\frac\{1\}\{\|C^\{\\prime\}\|\}\\sum\_\{\(x,y\)\\in C^\{\\prime\}\}h\(x,y\)\.

###### Proposition 2\(Indistinguishability\)\.

A CNP with encoderhhcannot distinguishhh\-equivalent context sets: ifC∼hC′C\\sim\_\{h\}C^\{\\prime\}, thenCNP​\(C,xt\)=CNP​\(C′,xt\)\\mathrm\{CNP\}\(C,x\_\{t\}\)=\\mathrm\{CNP\}\(C^\{\\prime\},x\_\{t\}\)for allxt∈𝒳x\_\{t\}\\in\\mathcal\{X\}\.

###### Theorem 3\(CNP Characterization\)\.

The class ofdd\-representable predictive maps is exactly the moment statistics classℱmoment\(d\)=\{F:F​\(C,xt\)=ϕ​\(1n​∑iψ​\(xi,yi\),xt\)\}\\mathcal\{F\}\_\{\\mathrm\{moment\}\}^\{\(d\)\}=\\\{F:F\(C,x\_\{t\}\)=\\phi\(\\frac\{1\}\{n\}\\sum\_\{i\}\\psi\(x\_\{i\},y\_\{i\}\),x\_\{t\}\)\\\}for continuousψ:𝒳×𝒴→ℝd\\psi:\\mathcal\{X\}\\times\\mathcal\{Y\}\\to\\mathbb\{R\}^\{d\},ϕ:ℝd×𝒳→Θ\\phi:\\mathbb\{R\}^\{d\}\\times\\mathcal\{X\}\\to\\Theta\.

###### Proposition 4\(Existence of Collisions\)\.

For any encoderh:𝒳×𝒴→ℝdh:\\mathcal\{X\}\\times\\mathcal\{Y\}\\to\\mathbb\{R\}^\{d\}and anyn\>d/\(dx\+dy\)n\>d/\(d\_\{x\}\+d\_\{y\}\), there exist distinct context setsC≠C′C\\neq C^\{\\prime\}with\|C\|=\|C′\|=n\|C\|=\|C^\{\\prime\}\|=nsuch thatC∼hC′C\\sim\_\{h\}C^\{\\prime\}\.

###### Example 5\(n=2n=2Collision\)\.

Considerdx=dy=1d\_\{x\}=d\_\{y\}=1and encoderh​\(x,y\)=\(x,y\)∈ℝ2h\(x,y\)=\(x,y\)\\in\\mathbb\{R\}^\{2\}\. The contextsC=\{\(0,1\),\(2,1\)\}C=\\\{\(0,1\),\(2,1\)\\\}andC′=\{\(0\.5,0\.5\),\(1\.5,1\.5\)\}C^\{\\prime\}=\\\{\(0\.5,0\.5\),\(1\.5,1\.5\)\\\}satisfyh¯C=h¯C′=\(1,1\)\\bar\{h\}\_\{C\}=\\bar\{h\}\_\{C^\{\\prime\}\}=\(1,1\), yet for an RBF kernel,k​\(0,2\)≠k​\(0\.5,1\.5\)k\(0,2\)\\neq k\(0\.5,1\.5\), so the GP posterior means differ\. This two\-point collision reappears in the ANP analysis \(Theorem[12](https://arxiv.org/html/2605.24210#Thmtheorem12)\), where we show that even query\-dependent reweighting cannot resolve it\.

###### Theorem 6\(GP Posterior is Not Finitely Representable\)\.

A predictive mapFFis\(d,ε\)\(d,\\varepsilon\)\-representable on𝒞≤n\\mathcal\{C\}\_\{\\leq n\}if there exist continuoush:𝒳×𝒴→ℝdh:\\mathcal\{X\}\\times\\mathcal\{Y\}\\to\\mathbb\{R\}^\{d\}andg:ℝd×𝒳→Θg:\\mathbb\{R\}^\{d\}\\times\\mathcal\{X\}\\to\\Thetasuch thatsupC∈𝒞≤n,xt∈𝒳‖F​\(C,xt\)−g​\(1\|C\|​∑\(x,y\)∈Ch​\(x,y\),xt\)‖≤ε\\sup\_\{C\\in\\mathcal\{C\}\_\{\\leq n\},x\_\{t\}\\in\\mathcal\{X\}\}\\\|F\(C,x\_\{t\}\)\-g\(\\frac\{1\}\{\|C\|\}\\sum\_\{\(x,y\)\\in C\}h\(x,y\),x\_\{t\}\)\\\|\\leq\\varepsilon\. Exact representability corresponds toε=0\\varepsilon=0\.

For anyddand anyn\>dn\>d, the GP posterior meanμ​\(xt\|C\)=k​\(xt,XC\)​𝐊​\(XC,XC\)−1​yC\\mu\(x\_\{t\}\|C\)=k\(x\_\{t\},X\_\{C\}\)\\mathbf\{K\}\(X\_\{C\},X\_\{C\}\)^\{\-1\}y\_\{C\}is not\(d,0\)\(d,0\)\-representable on𝒞≤n\\mathcal\{C\}\_\{\\leq n\}for generic positive definite kernelskk\.

###### Theorem 7\(CNP Approximation Lower Bound\)\.

Letν\\nudenote a distribution over target locationsxt∈𝒳x\_\{t\}\\in\\mathcal\{X\}, context locationsx1,…,xn∈𝒳x\_\{1\},\\ldots,x\_\{n\}\\in\\mathcal\{X\}be fixed and letyC∼𝒩​\(0,𝐊\)y\_\{C\}\\sim\\mathcal\{N\}\(0,\\mathbf\{K\}\)\. For the GP posterior mean, any CNP with representation dimensiond<nd<nsatisfies:

infCNP𝔼yC∫\|μ\(xt\|C\)−μ^\(xt\|C\)\|2dν\(xt\)𝔼yC∫\|μ\(xt\|C\)\|2dν\(xt\)≥1−dn\\inf\_\{\\mathrm\{CNP\}\}\\frac\{\\mathbb\{E\}\_\{y\_\{C\}\}\\int\|\\mu\(x\_\{t\}\|C\)\-\\widehat\{\\mu\}\(x\_\{t\}\|C\)\|^\{2\}\\,d\\nu\(x\_\{t\}\)\}\{\\mathbb\{E\}\_\{y\_\{C\}\}\\int\|\\mu\(x\_\{t\}\|C\)\|^\{2\}\\,d\\nu\(x\_\{t\}\)\}\\geq 1\-\\frac\{d\}\{n\}for any target distributionν\\nusuch that the whitened kernel vectors\{𝐊−1/2​k​\(xt,XC\)\}xt∼ν\\\{\\mathbf\{K\}^\{\-1/2\}k\(x\_\{t\},X\_\{C\}\)\\\}\_\{x\_\{t\}\\sim\\nu\}have isotropic second moment\.

###### Example 8\(Linear Regression\)\.

Even ordinary least squares illustrates the bottleneck\. Withkk\-dimensional featuresψ:𝒳→ℝk\\psi:\\mathcal\{X\}\\to\\mathbb\{R\}^\{k\}, the OLS predictorF​\(C,xt\)=⟨\(∑iψ​\(xi\)​ψ​\(xi\)⊤\)−1​∑iyi​ψ​\(xi\),ψ​\(xt\)⟩F\(C,x\_\{t\}\)=\\langle\(\\sum\_\{i\}\\psi\(x\_\{i\}\)\\psi\(x\_\{i\}\)^\{\\top\}\)^\{\-1\}\\sum\_\{i\}y\_\{i\}\\psi\(x\_\{i\}\),\\,\\psi\(x\_\{t\}\)\\ranglerequires encoding both the Gram matrix∑iψ​\(xi\)​ψ​\(xi\)⊤\\sum\_\{i\}\\psi\(x\_\{i\}\)\\psi\(x\_\{i\}\)^\{\\top\}\(k​\(k\+1\)/2k\(k\+1\)/2parameters\) and the moment vector∑iyi​ψ​\(xi\)\\sum\_\{i\}y\_\{i\}\\psi\(x\_\{i\}\)\(kkparameters\), givingd=k​\(k\+3\)/2d=k\(k\+3\)/2\. Since distinct Gram matrices generically yield distinct predictors,d=Ω​\(k2\)d=\\Omega\(k^\{2\}\)is also necessary\.

The bound in Theorem[7](https://arxiv.org/html/2605.24210#Thmtheorem7)is sharp and achieved by the PCA encoderh¯C=A∗​yC\\bar\{h\}\_\{C\}=A^\{\*\}y\_\{C\}that projects onto the topddprincipal components of the whitened weight covariance\. Ford≪nd\\ll n, most GP posterior structure is lost regardless of encoder sophistication\. The isotropic second moment condition holds whenever context and target locations are drawn i\.i\.d\. from the same distribution and the kernel is stationary \(e\.g\. RBF\); for non\-isotropic targets the bound generalizes to∑i=d\+1nλi/∑i=1nλi\\sum\_\{i=d\+1\}^\{n\}\\lambda\_\{i\}/\\sum\_\{i=1\}^\{n\}\\lambda\_\{i\}whereλi\\lambda\_\{i\}are the eigenvalues of𝔼xt∼ν​\[w~xt​w~xt⊤\]\\mathbb\{E\}\_\{x\_\{t\}\\sim\\nu\}\[\\tilde\{w\}\_\{x\_\{t\}\}\\tilde\{w\}\_\{x\_\{t\}\}^\{\\top\}\]\. Full proofs appear in Appendix[A](https://arxiv.org/html/2605.24210#A1)\.

## 5Attentive Neural Processes

The key difference in ANPs is that the representationrC​\(xt\)r\_\{C\}\(x\_\{t\}\)depends on the query locationxtx\_\{t\}via attention weightsαi​\(xt;C\)∝exp⁡\(q​\(xt\)⊤​k​\(xi,yi\)/τ\)\\alpha\_\{i\}\(x\_\{t\};C\)\\propto\\exp\(q\(x\_\{t\}\)^\{\\top\}k\(x\_\{i\},y\_\{i\}\)/\\tau\)\. Unlike CNP\-equivalence, ANP\-equivalence is query\-dependent: context sets equivalent at one query may be distinguishable at another\.

###### Theorem 9\(ANPs Represent Kernel Smoothers\)\.

For any continuous kernelK:𝒳×𝒳→ℝ\+K:\\mathcal\{X\}\\times\\mathcal\{X\}\\to\\mathbb\{R\}\_\{\+\}, the kernel smootherF​\(C,xt\)=∑iK​\(xt,xi\)​yi/∑iK​\(xt,xi\)F\(C,x\_\{t\}\)=\\sum\_\{i\}K\(x\_\{t\},x\_\{i\}\)y\_\{i\}/\\sum\_\{i\}K\(x\_\{t\},x\_\{i\}\)is ANP\-representable to arbitrary precision with value dimensiond=dy\+1d=d\_\{y\}\+1\.

###### Proof sketch\.

Setv​\(x,y\)=\(y,1\)v\(x,y\)=\(y,1\),k​\(x,y\)=ψ​\(x\)k\(x,y\)=\\psi\(x\),q​\(xt\)=ϕ​\(xt\)q\(x\_\{t\}\)=\\phi\(x\_\{t\}\), and chooseϕ,ψ\\phi,\\psiso thatq​\(xt\)⊤​k​\(xi,yi\)≈log⁡K​\(xt,xi\)q\(x\_\{t\}\)^\{\\top\}k\(x\_\{i\},y\_\{i\}\)\\approx\\log K\(x\_\{t\},x\_\{i\}\)\. Then the attention weights recover the normalized kernel weights, and a linear decoder extracts the kernel smoother\. The full construction and universal approximation argument appear in Appendix[B](https://arxiv.org/html/2605.24210#A2)\. ∎

###### Theorem 10\(ANP Characterization\)\.

A predictive mapFFis ANP\-representable if and only ifF​\(C,xt\)=G​\(𝔼PCxt​\[v​\(x,y\)\],xt\)F\(C,x\_\{t\}\)=G\(\\mathbb\{E\}\_\{P\_\{C\}^\{x\_\{t\}\}\}\[v\(x,y\)\],x\_\{t\}\)wherePCxt​\(xi,yi\)∝exp⁡\(s​\(xt,xi,yi\)\)P\_\{C\}^\{x\_\{t\}\}\(x\_\{i\},y\_\{i\}\)\\propto\\exp\(s\(x\_\{t\},x\_\{i\},y\_\{i\}\)\)for some continuous score functionss\.

###### Corollary 11\(Strict Separation\)\.

ℱCNP\(d\)⊊ℱANP\(d\)\\mathcal\{F\}\_\{\\mathrm\{CNP\}\}^\{\(d\)\}\\subsetneq\\mathcal\{F\}\_\{\\mathrm\{ANP\}\}^\{\(d\)\}for alldd\.

###### Theorem 12\(GP Posterior Requires Context\-Context Coupling\)\.

For generic positive definite kernels, the GP posterior mean is not ANP\-representable\.

###### Proof sketch\.

The GP posterior weightwi=\[k​\(xt,XC\)​𝐊−1\]iw\_\{i\}=\[k\(x\_\{t\},X\_\{C\}\)\\mathbf\{K\}^\{\-1\}\]\_\{i\}depends on the full Gram matrix, coupling all context points\. ANP attention weights factor asαi∝f​\(xt,xi,yi\)\\alpha\_\{i\}\\propto f\(x\_\{t\},x\_\{i\},y\_\{i\}\), depending only on query\-point pairs independently\. The simplest counterexample hasn=2n=2: the GP weight ony1y\_\{1\}is

w1=k​\(xt,x1\)​k​\(x2,x2\)−k​\(xt,x2\)​k​\(x1,x2\)k​\(x1,x1\)​k​\(x2,x2\)−k​\(x1,x2\)2,w\_\{1\}=\\frac\{k\(x\_\{t\},x\_\{1\}\)k\(x\_\{2\},x\_\{2\}\)\-k\(x\_\{t\},x\_\{2\}\)k\(x\_\{1\},x\_\{2\}\)\}\{k\(x\_\{1\},x\_\{1\}\)k\(x\_\{2\},x\_\{2\}\)\-k\(x\_\{1\},x\_\{2\}\)^\{2\}\},which depends on the inter\-context kernel valuek​\(x1,x2\)k\(x\_\{1\},x\_\{2\}\)\. No ANP attention weightα1∝exp⁡\(q​\(xt\)⊤​k​\(x1,y1\)\)\\alpha\_\{1\}\\propto\\exp\(q\(x\_\{t\}\)^\{\\top\}k\(x\_\{1\},y\_\{1\}\)\)can capture this coupling\. See Appendix[B](https://arxiv.org/html/2605.24210#A2)for the general argument\. ∎

## 6Transformer Neural Processes

### 6\.1Polynomial Computation via Self\-Attention

We analyze TNPs under two structural assumptions that enable clean characterization: position\-based self\-attention \(attention weights depend only on input positions, not representations\) and residual connections\. These are relaxed in Section[6\.3](https://arxiv.org/html/2605.24210#S6.SS3)\. Full definitions appear in Appendix[C](https://arxiv.org/html/2605.24210#A3)\.

###### Theorem 13\(Polynomial Representation\)\.

Under position\-based attention with residual connections, afterLLself\-attention layers with scalar value weightsWv\(ℓ\)=αℓ​𝐈W\_\{v\}^\{\(\\ell\)\}=\\alpha\_\{\\ell\}\\mathbf\{I\}, the representation matrix satisfiesH\(L\)=pL​\(𝐊~\)​H\(0\)H^\{\(L\)\}=p\_\{L\}\(\\tilde\{\\mathbf\{K\}\}\)H^\{\(0\)\}wherepL​\(𝐊~\)=∏ℓ=1L\(𝐈\+αℓ​𝐊~\)p\_\{L\}\(\\tilde\{\\mathbf\{K\}\}\)=\\prod\_\{\\ell=1\}^\{L\}\(\\mathbf\{I\}\+\\alpha\_\{\\ell\}\\tilde\{\\mathbf\{K\}\}\)is a degree\-LLpolynomial in the attention matrix𝐊~\\tilde\{\\mathbf\{K\}\}\.

###### Proof\.

By induction:H\(L\)=\(𝐈\+αL​𝐊~\)​H\(L−1\)=∏ℓ=1L\(𝐈\+αℓ​𝐊~\)​H\(0\)H^\{\(L\)\}=\(\\mathbf\{I\}\+\\alpha\_\{L\}\\tilde\{\\mathbf\{K\}\}\)H^\{\(L\-1\)\}=\\prod\_\{\\ell=1\}^\{L\}\(\\mathbf\{I\}\+\\alpha\_\{\\ell\}\\tilde\{\\mathbf\{K\}\}\)H^\{\(0\)\}\. ∎

### 6\.2Approximating GP Posteriors

The GP posterior involves𝐊−1\\mathbf\{K\}^\{\-1\}\. Via the Neumann series𝐊−1=1λmax​∑m=0∞\(𝐈−𝐊/λmax\)m\\mathbf\{K\}^\{\-1\}=\\frac\{1\}\{\\lambda\_\{\\max\}\}\\sum\_\{m=0\}^\{\\infty\}\(\\mathbf\{I\}\-\\mathbf\{K\}/\\lambda\_\{\\max\}\)^\{m\}, this can be approximated by matrix polynomials\. The truncation error isO​\(ρL/λmin\)O\(\\rho^\{L\}/\\lambda\_\{\\min\}\)whereρ=1−1/κ\\rho=1\-1/\\kappa, requiringL=O​\(κ​log⁡\(1/ε\)\)L=O\(\\kappa\\log\(1/\\varepsilon\)\)layers\. The Chebyshev construction improves this toO​\(κ​log⁡\(1/ε\)\)O\(\\sqrt\{\\kappa\}\\log\(1/\\varepsilon\)\):

###### Proposition 14\(Chebyshev Upper Bound\)\.

There exist scalar value weightsα1,…,αL\\alpha\_\{1\},\\ldots,\\alpha\_\{L\}such that‖∏ℓ=1L\(𝐈\+αℓ​𝐊\)−𝐊−1‖≤2λmin​\(κ−1κ\+1\)L\\\|\\prod\_\{\\ell=1\}^\{L\}\(\\mathbf\{I\}\+\\alpha\_\{\\ell\}\\mathbf\{K\}\)\-\\mathbf\{K\}^\{\-1\}\\\|\\leq\\frac\{2\}\{\\lambda\_\{\\min\}\}\\left\(\\frac\{\\sqrt\{\\kappa\}\-1\}\{\\sqrt\{\\kappa\}\+1\}\\right\)^\{L\}\.

The weights are the optimal Chebyshev iteration parametersαℓ=−2/\(λmax\+λmin\+\(λmax−λmin\)​cos⁡θℓ\)\\alpha\_\{\\ell\}=\-2/\(\\lambda\_\{\\max\}\+\\lambda\_\{\\min\}\+\(\\lambda\_\{\\max\}\-\\lambda\_\{\\min\}\)\\cos\\theta\_\{\\ell\}\)withθℓ=\(2​ℓ−1\)​π/\(2​L\)\\theta\_\{\\ell\}=\(2\\ell\-1\)\\pi/\(2L\)\. See Appendix[32](https://arxiv.org/html/2605.24210#Thmtheorem32)for the full proof and idealized TNP approximation construction\.

### 6\.3Lower Bounds for Representation\-Based Attention

The upper bound assumed position\-based attention\. We now show theΩ​\(κ​log⁡\(1/ε\)\)\\Omega\(\\sqrt\{\\kappa\}\\log\(1/\\varepsilon\)\)depth requirement holds even for fully adaptive representation\-based attention\.

The key insight is that the GP posterior meanμ​\(xt\|C\)=k​\(xt,XC\)​𝐊−1​yC\\mu\(x\_\{t\}\|C\)=k\(x\_\{t\},X\_\{C\}\)\\mathbf\{K\}^\{\-1\}y\_\{C\}is linear inyCy\_\{C\}\. Linearizing the TNP atyC=0y\_\{C\}=0neutralizes adaptive routing: the Jacobian∂F/∂yC\|yC=0\\partial F/\\partial y\_\{C\}\|\_\{y\_\{C\}=0\}becomes a polynomial in the base attention matrix \(evaluated atyC=0y\_\{C\}=0\), regardless of how sophisticated the attention mechanism\.

###### Theorem 15\(TNP Depth Lower Bound\)\.

For anyκ\>1\\kappa\>1, there exists a family of context configurations with kernel matrices satisfyingκ​\(𝐊\)=κ\\kappa\(\\mathbf\{K\}\)=\\kappasuch that anyLL\-layer TNP with representation\-based self\-attention achievingε\\varepsilon\-approximation of𝐊−1​yC\\mathbf\{K\}^\{\-1\}y\_\{C\}must satisfy:

L≥κ4⋅log⁡\(cε\)L\\geq\\frac\{\\sqrt\{\\kappa\}\}\{4\}\\cdot\\log\\left\(\\frac\{c\}\{\\varepsilon\}\\right\)for a universal constantc\>0c\>0\.

###### Proof sketch\.

The argument proceeds in four stages:

*\(1\) Linearization\.*By Taylor expansion \(Lemma[37](https://arxiv.org/html/2605.24210#Thmtheorem37)in Appendix[D](https://arxiv.org/html/2605.24210#A4)\),ε\\varepsilon\-approximation of the linear target implies the JacobianM​\(X\)=∂F/∂yC\|yC=0M\(X\)=\\partial F/\\partial y\_\{C\}\|\_\{y\_\{C\}=0\}satisfies‖M​\(X\)−𝐊−1‖≤O​\(ε\)\\\|M\(X\)\-\\mathbf\{K\}^\{\-1\}\\\|\\leq O\(\\varepsilon\)\.

*\(2\) Polynomial structure\.*Differentiating throughLLself\-attention layers showsM​\(X\)M\(X\)is a matrix polynomial of degree at most2​L2Lin the attention matrix𝐊~\|yC=0\\tilde\{\\mathbf\{K\}\}\|\_\{y\_\{C\}=0\}\(Appendix[D](https://arxiv.org/html/2605.24210#A4), Corollary[42](https://arxiv.org/html/2605.24210#Thmtheorem42)\)\.

*\(3\) Univariate reduction\.*We construct a family of kernel matrices\{𝐊t\}\\\{\\mathbf\{K\}\_\{t\}\\\}\(Lemma[45](https://arxiv.org/html/2605.24210#Thmtheorem45)\) where all eigenvalues except the minimum are equal to 1, with condition numberκ\\kappa\. On this family, the quadratic formv1⊤​M​\(𝐊t\)​v1v\_\{1\}^\{\\top\}M\(\\mathbf\{K\}\_\{t\}\)v\_\{1\}reduces to a univariate polynomial in the minimum eigenvalueμ1​\(t\)\\mu\_\{1\}\(t\)\.

*\(4\) Chebyshev barrier\.*The targetv1⊤​𝐊t−1​v1=1/μ1​\(t\)v\_\{1\}^\{\\top\}\\mathbf\{K\}\_\{t\}^\{\-1\}v\_\{1\}=1/\\mu\_\{1\}\(t\)must be approximated by a degree\-2​L2Lpolynomial on\[1/κ,1\]\[1/\\kappa,1\]\. The classical Chebyshev lower bound gives error at leastΩ​\(ρ2​L\)\\Omega\(\\rho^\{2L\}\)whereρ=\(κ−1\)/\(κ\+1\)\\rho=\(\\sqrt\{\\kappa\}\-1\)/\(\\sqrt\{\\kappa\}\+1\)\. Solving forLLyields the stated bound\. ∎

The full proof, including the Jacobian evolution through representation\-based attention layers, the eigenvalue\-controlled kernel family construction, and spectral analysis, appears in Appendix[D](https://arxiv.org/html/2605.24210#A4)\.

###### Corollary 16\(Tight Characterization\)\.

For well\-conditioned context geometries,L=Θ​\(κ​log⁡\(1/ε\)\)L=\\Theta\(\\sqrt\{\\kappa\}\\log\(1/\\varepsilon\)\)self\-attention layers are both necessary and sufficient forε\\varepsilon\-accurate GP posterior approximation\.

###### Theorem 18\(Dimension Scaling\)\.

To represent GP posterior predictions on contexts of sizenn: \(a\) CNP and ANP require infinitedd; \(b\) TNP requiresd=O​\(n\)d=O\(n\)\.

## 7Convolutional Neural Processes

ConvCNPs replace finite\-dimensional aggregation with functional convolutional representations\. Two structural properties distinguish them\.

###### Proposition 19\(Translation Equivariance\)\.

Every ConvCNP satisfiesF​\(\{\(xi\+τ,yi\)\},xt\+τ\)=F​\(\{\(xi,yi\)\},xt\)F\(\\\{\(x\_\{i\}\+\\tau,y\_\{i\}\)\\\},x\_\{t\}\+\\tau\)=F\(\\\{\(x\_\{i\},y\_\{i\}\)\\\},x\_\{t\}\)for allτ∈ℝdx\\tau\\in\\mathbb\{R\}^\{d\_\{x\}\}\.

###### Proposition 20\(Injectivity\)\.

Ifwwhas non\-vanishing Fourier transform andhhis injective, then the convolutional aggregationC↦\(sC​\(⋅\),ρC​\(⋅\)\)C\\mapsto\(s\_\{C\}\(\\cdot\),\\rho\_\{C\}\(\\cdot\)\)is injective up to permutation on contexts with distinct locations\.

Thus ConvCNPs avoid the collision problem of[Proposition4](https://arxiv.org/html/2605.24210#Thmtheorem4), but at the cost that any non\-translation\-equivariant predictive map lies outside the ConvCNP function class\. Proofs appear in Appendix[E](https://arxiv.org/html/2605.24210#A5)\.

###### Proposition 21\(Pure ConvCNP Represents Stationary Kernel Smoothers\)\.

For any continuous stationary kernelKK, the Nadaraya–Watson estimatorF​\(C,xt\)=∑iK​\(xt−xi\)​yi/∑iK​\(xt−xi\)F\(C,x\_\{t\}\)=\\sum\_\{i\}K\(x\_\{t\}\-x\_\{i\}\)y\_\{i\}/\\sum\_\{i\}K\(x\_\{t\}\-x\_\{i\}\)is exactly representable by a pure ConvCNP\. However, the GP posterior mean is not representable by any pure ConvCNP \(the same factorization barrier as ANPs applies\)\.

#### CNN layers enable context–context coupling\.

On a regular grid, each CNN layer with residual connection implements a Toeplitz matrix iteration𝐫\(ℓ\)=\(𝐈\+𝐀ℓ\)​𝐫\(ℓ−1\)\\mathbf\{r\}^\{\(\\ell\)\}=\(\\mathbf\{I\}\+\\mathbf\{A\}\_\{\\ell\}\)\\mathbf\{r\}^\{\(\\ell\-1\)\}, analogous to the TNP update but with Toeplitz structure\.

###### Theorem 22\(ConvCNP Depth for GP Posteriors\)\.

On a regular grid with spacingδ\\delta, a ConvCNP withLLCNN layers achieves GP posterior approximation errorO​\(Bk​By​λmin−1​\(\(κ−1\)/\(κ\+1\)\)L\)\+O​\(δ\)O\(B\_\{k\}B\_\{y\}\\lambda\_\{\\min\}^\{\-1\}\(\(\\sqrt\{\\kappa\}\-1\)/\(\\sqrt\{\\kappa\}\+1\)\)^\{L\}\)\+O\(\\delta\), matching the Chebyshev convergence rate of TNPs\.

A distinctive feature of ConvCNPs is the depth–support tradeoff: on periodic grids, all matrices are circulant and simultaneously diagonalized by the DFT\. With unrestricted filter supportp=np=n, a single CNN layer suffices \(Proposition[55](https://arxiv.org/html/2605.24210#Thmtheorem55)in Appendix[E](https://arxiv.org/html/2605.24210#A5)\)\. With restricted support, the required productL⋅⌊p/2⌋L\\cdot\\lfloor p/2\\rflooris governed by the trigonometric approximation number of1/K^​\(ω\)1/\\hat\{K\}\(\\omega\)\(Theorem[56](https://arxiv.org/html/2605.24210#Thmtheorem56)in Appendix[E](https://arxiv.org/html/2605.24210#A5)\), recovering theκ\\sqrt\{\\kappa\}rate\.

###### Theorem 23\(ConvCNPs and ANPs Are Incomparable\)\.

ℱANP⊈ℱConvCNP\\mathcal\{F\}\_\{\\mathrm\{ANP\}\}\\not\\subseteq\\mathcal\{F\}\_\{\\mathrm\{ConvCNP\}\}\(non\-stationary kernel smoothers violate translation equivariance\) andℱConvCNP⊈ℱANP\\mathcal\{F\}\_\{\\mathrm\{ConvCNP\}\}\\not\\subseteq\\mathcal\{F\}\_\{\\mathrm\{ANP\}\}\(ConvCNPs with CNN depth approximate GP posteriors, which are not ANP\-representable\)\.

## 8The Expressiveness Hierarchy

###### Theorem 24\(Expressiveness Hierarchy\)\.

For alld≥1d\\geq 1andL≥0L\\geq 0:

1. \(a\)Attention branch:ℱCNP\(d\)⊊ℱANP\(d\)⊊ℱTNP\(1,d\)⊊ℱTNP\(2,d\)⊊⋯\\mathcal\{F\}\_\{\\mathrm\{CNP\}\}^\{\(d\)\}\\subsetneq\\mathcal\{F\}\_\{\\mathrm\{ANP\}\}^\{\(d\)\}\\subsetneq\\mathcal\{F\}\_\{\\mathrm\{TNP\}\}^\{\(1,d\)\}\\subsetneq\\mathcal\{F\}\_\{\\mathrm\{TNP\}\}^\{\(2,d\)\}\\subsetneq\\cdots
2. \(b\)Convolutional branch:ℱCNP\(d\)⊊ℱConvCNP\(0\)⊊ℱConvCNP\(1\)⊊ℱConvCNP\(2\)⊊⋯\\mathcal\{F\}\_\{\\mathrm\{CNP\}\}^\{\(d\)\}\\subsetneq\\mathcal\{F\}\_\{\\mathrm\{ConvCNP\}\}^\{\(0\)\}\\subsetneq\\mathcal\{F\}\_\{\\mathrm\{ConvCNP\}\}^\{\(1\)\}\\subsetneq\\mathcal\{F\}\_\{\\mathrm\{ConvCNP\}\}^\{\(2\)\}\\subsetneq\\cdots
3. \(c\)Incomparability:ℱANP\(d\)⊈ℱConvCNP\(L\)\\mathcal\{F\}\_\{\\mathrm\{ANP\}\}^\{\(d\)\}\\not\\subseteq\\mathcal\{F\}\_\{\\mathrm\{ConvCNP\}\}^\{\(L\)\}andℱConvCNP\(L\)⊈ℱANP\(d\)\\mathcal\{F\}\_\{\\mathrm\{ConvCNP\}\}^\{\(L\)\}\\not\\subseteq\\mathcal\{F\}\_\{\\mathrm\{ANP\}\}^\{\(d\)\}for allL,dL,d\.

CNPANPPure ConvCNPTNP \(L=1L\{=\}1\)ConvCNP \(L=1L\{=\}1\)TNP \(LLlayers\)ConvCNP \(LLCNN\)⋮\\vdots⋮\\vdotsincomparableFigure 1:The expressiveness hierarchy of Neural Process architectures\. Solid arrows denote strict inclusion \(⊊\\subsetneq\)\. The dashed line marks incomparability\. For stationary kernels on regular grids, the ConvCNP branch achieves the same Chebyshev convergence rate as the TNP branch\.The proof assembles the results from previous sections; the full argument appears in Appendix[F](https://arxiv.org/html/2605.24210#A6)\.

## 9Latent Neural Processes

Latent NPs augment the deterministic pathway with a global latentz∼q​\(z\|C\)z\\sim q\(z\|C\),z∈ℝkz\\in\\mathbb\{R\}^\{k\}, yieldingp​\(yT\|XT,C\)=∫p​\(yT\|XT,z\)​q​\(z\|C\)​𝑑zp\(y\_\{T\}\|X\_\{T\},C\)=\\int p\(y\_\{T\}\|X\_\{T\},z\)q\(z\|C\)dz\. The encoderq​\(z\|C\)q\(z\|C\)inherits the limitations of the underlying architecture: ifC∼hC′C\\sim\_\{h\}C^\{\\prime\}for a latent CNP, thenq​\(z\|C\)=q​\(z\|C′\)q\(z\|C\)=q\(z\|C^\{\\prime\}\)regardless of decoder expressiveness\. Thus the impossibility results for CNPs and ANPs lift directly to their latent variants\.

Assuming an arbitrarily powerful encoder, the constraints of finite latent dimension remain:

###### Theorem 25\(Latent NP Cannot Represent GP Posterior\)\.

For a Gaussian latent NP with latent dimensionkkand linear decoder: \(a\) Mean matching for allyC∈ℝny\_\{C\}\\in\\mathbb\{R\}^\{n\}requiresk≥nk\\geq n\. \(b\) Covariance matching atmmtarget points requiresk≥mk\\geq m\. \(c\) Matching for arbitrary target configurations requiresk=∞k=\\infty\.

Finite\-dimensional latent NPs exactly represent the class of rank\-kkGPs: processes of the formf​\(x\)=a​\(x\)⊤​z\+b​\(x\)f\(x\)=a\(x\)^\{\\top\}z\+b\(x\)withz∼𝒩​\(m,S\)z\\sim\\mathcal\{N\}\(m,S\)\. For full\-rank kernels, truncating the Mercer expansion gives approximation error governed by the spectral tail∑j\>kλj\\sum\_\{j\>k\}\\lambda\_\{j\}, with rates depending on kernel smoothness \(exponential for RBF, polynomial for Matérn\)\. Full proofs appear in Appendix[G](https://arxiv.org/html/2605.24210#A7)\.

## 10Discussion

Table[1](https://arxiv.org/html/2605.24210#S10.T1)summarizes our results\. The analysis reveals that CNPs are limited to functions of finitely many expected features; ANPs extend this to query\-dependent reweightings but still factor across context points; only TNPs can capture global structure via self\-attention, with tight depth bounds ofΘ​\(κ​log⁡\(1/ε\)\)\\Theta\(\\sqrt\{\\kappa\}\\log\(1/\\varepsilon\)\)\. ConvCNPs and ANPs are incomparable, separated by stationarity versus translation equivariance\.

Table 1:Architecture capabilities\. Latent variants add coherent sampling but do not circumvent encoder limitations\.Our results provide an answer to a theoretical question open sinceGarneloet al\.\([2018a](https://arxiv.org/html/2605.24210#bib.bib14)\)introduced NPs regarding what functions these architectures can represent\. The analysis reveals a hierarchy\. CNPs with mean aggregation are limited to functions of finitely many expected features of the context distribution which is sufficient for low\-dimensional parametric families but unable to capture context\-context interactions\. ANPs extend this to query\-dependent reweightings, enabling kernel smoothers and local adaptation but still factoring across context points\. Only TNPs with self\-attention can capture the global structure needed for GP posteriors, with depth requirements scaling logarithmically in the kernel matrix condition number\.

The CNP impossibility follows from a dimension\-counting argument\. the set ofhh\-equivalent contexts forms an\(n−1\)​d\(n\-1\)d\-dimensional subspace, so collisions are generic rather than exceptional\. The ANP impossibility is more subtle and stems from a factorization constraint because attention weightsαi∝f​\(xt,xi,yi\)\\alpha\_\{i\}\\propto f\(x\_\{t\},x\_\{i\},y\_\{i\}\)decompose across context points, while GP posterior weights couple all points through𝐊−1\\mathbf\{K\}^\{\-1\}\. This coupling\-versus\-factorization distinction, visible even with just two context points, is the fundamental reason self\-attention is necessary\. The TNP construction exploits the Neumann series𝐊−1=∑m=0∞\(𝐈−𝐊\)m\\mathbf\{K\}^\{\-1\}=\\sum\_\{m=0\}^\{\\infty\}\(\\mathbf\{I\}\-\\mathbf\{K\}\)^\{m\}, which requires learning an appropriate normalization of the kernel matrix, which is a hidden capacity requirement that may explain why TNPs benefit from careful initialization in practice\. Orthogonally to representational capacity, the validity of NP predictions as stochastic processes is governed by conditioning consistency, where the gap for CNPs vanishes asO​\(1/n2\)O\(1/n^\{2\}\)in context size\(Young,[2026a](https://arxiv.org/html/2605.24210#bib.bib13)\), while the quantitative cost of amortization decomposes into label contamination, information bottleneck, and encoder sharing terms\(Young,[2026b](https://arxiv.org/html/2605.24210#bib.bib32)\), with the bottleneck decay rates matching the spectral tail bounds of[Corollary66](https://arxiv.org/html/2605.24210#Thmtheorem66)\.

Several limitations warrant discussion\. Our analysis characterizes representational capacity, not learnability\. A function being TNP\-representable does not guarantee that gradient descent will find the right parameters, and the sample complexity of learning within each function class remains open\. We also assume idealized attention with exact softmax and infinite precision; practical implementations with learned temperatures, finite precision, and approximate attention kernels may exhibit different effective capacity\. For standard kernels \(e\.g\. RBF, Matérn\), the condition numberκ​\(𝐊\)\\kappa\(\\mathbf\{K\}\)typically grows with context sizenn, so the depth boundΘ​\(κ​log⁡\(1/ε\)\)\\Theta\(\\sqrt\{\\kappa\}\\log\(1/\\varepsilon\)\)implies that fixed\-depth TNPs face representational limitations on large contexts\. This is an architectural insight, not a limitation of the analysis as it quantifies the cost of approximating GP posteriors as context sets grow\.

Convolutional CNPs\(Gordonet al\.,[2020](https://arxiv.org/html/2605.24210#bib.bib16)\)replace finite\-dimensional aggregation with functional representations, sidestepping the dimension\-counting arguments of Section[4](https://arxiv.org/html/2605.24210#S4)\. The picture is more nuanced than a simple placement in the hierarchy\. The pure ConvCNP \(without CNN\) sits at the ANP level as its weights on context points factorize, enabling stationary kernel smoothers but not GP posteriors\. However, the full ConvCNP with CNN layers accesses TNP\-level expressiveness for stationary kernels on regular grids, achieving the same Chebyshev convergence rate\(κ−1\)/\(κ\+1\)\(\\sqrt\{\\kappa\}\-1\)/\(\\sqrt\{\\kappa\}\+1\)per layer \(Theorem[22](https://arxiv.org/html/2605.24210#Thmtheorem22)\)\. The structural difference is that CNN iterations are Toeplitz \(constant along diagonals\), while TNP attention matrices are unconstrained\. This makes ConvCNPs and ANPs incomparable rather than nested as each contains functions outside the other \(Theorem[23](https://arxiv.org/html/2605.24210#Thmtheorem23)\)\. The cost of irregularity remains an open question\.

We also offer some thoughts on practicalities\. For few\-shot regression on parametric families, CNPs should suffice and attention adds complexity without benefit\. For image completion and spatial prediction requiring local adaptation, ANPs provide the right inductive bias\. For tasks demanding global coherence such as calibrated uncertainty in Bayesian optimization, consistent identity in face completion, exact GP emulation, TNPs are necessary\. For spatially structured tasks with stationary kernels and approximately regular observation grids, which is common in environmental monitoring and climate modeling, ConvCNPs offer an attractive middle ground with TNP\-level posterior approximation with the parameter efficiency and equivariance guarantees of convolutional architectures\.

## References

- J\. Baxter \(2000\)A model of inductive bias learning\.J\. Artif\. Int\. Res\.12\(1\),pp\. 149–198\.External Links:ISSN 1076\-9757Cited by:[§2](https://arxiv.org/html/2605.24210#S2.p6.1)\.
- T\. Berger \(1971\)Rate distortion theory: a mathematical basis for data compression\.Prentice\-Hall Series in Information and System Sciences,Prentice Hall\.External Links:ISBN 9780137531035Cited by:[Appendix A](https://arxiv.org/html/2605.24210#A1.10.p2.9)\.
- M\. Garnelo, D\. Rosenbaum, C\. Maddison, T\. Ramalho, D\. Saxton, M\. Shanahan, Y\. W\. Teh, D\. Rezende, and S\. M\. A\. Eslami \(2018a\)Conditional neural processes\.InProceedings of the 35th International Conference on Machine Learning,J\. Dy and A\. Krause \(Eds\.\),Proceedings of Machine Learning Research, Vol\.80,pp\. 1704–1713\.External Links:[Link](https://proceedings.mlr.press/v80/garnelo18a.html)Cited by:[§1](https://arxiv.org/html/2605.24210#S1.p1.1),[§10](https://arxiv.org/html/2605.24210#S10.p2.1),[§2](https://arxiv.org/html/2605.24210#S2.p1.1)\.
- M\. Garnelo, J\. Schwarz, D\. Rosenbaum, F\. Viola, D\. J\. Rezende, S\. M\. A\. Eslami, and Y\. W\. Teh \(2018b\)Neural processes\.External Links:1807\.01622,[Link](https://arxiv.org/abs/1807.01622)Cited by:[§1](https://arxiv.org/html/2605.24210#S1.p1.1),[§2](https://arxiv.org/html/2605.24210#S2.p1.1)\.
- G\. H\. Golub and C\. F\. Van Loan \(2013\)Matrix computations \- 4th edition\.edition,Johns Hopkins University Press,Philadelphia, PA\.External Links:[Document](https://dx.doi.org/10.1137/1.9781421407944),[Link](https://epubs.siam.org/doi/abs/10.1137/1.9781421407944),https://epubs\.siam\.org/doi/pdf/10\.1137/1\.9781421407944Cited by:[§C\.4](https://arxiv.org/html/2605.24210#A3.SS4.1.p1.1),[§2](https://arxiv.org/html/2605.24210#S2.p5.1)\.
- J\. Gordon, W\. P\. Bruinsma, A\. Y\. K\. Foong, J\. Requeima, Y\. Dubois, and R\. E\. Turner \(2020\)Convolutional conditional neural processes\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=Skey4eBYPS)Cited by:[§1](https://arxiv.org/html/2605.24210#S1.p2.1),[§10](https://arxiv.org/html/2605.24210#S10.p5.1),[§2](https://arxiv.org/html/2605.24210#S2.p1.1)\.
- K\. Hornik, M\. Stinchcombe, and H\. White \(1989\)Multilayer feedforward networks are universal approximators\.Neural Networks2\(5\),pp\. 359–366\.External Links:[Document](https://dx.doi.org/10.1016/0893-6080%2889%2990020-8),ISSN 0893\-6080Cited by:[Appendix B](https://arxiv.org/html/2605.24210#A2.1.p1.10)\.
- S\. T\. Jose and O\. Simeone \(2021\)Information\-theoretic generalization bounds for meta\-learning and applications\.Entropy23\(1\)\.External Links:[Link](https://www.mdpi.com/1099-4300/23/1/126),ISSN 1099\-4300,[Document](https://dx.doi.org/10.3390/e23010126)Cited by:[§2](https://arxiv.org/html/2605.24210#S2.p6.1)\.
- H\. Kim, A\. Mnih, J\. Schwarz, M\. Garnelo, A\. Eslami, D\. Rosenbaum, O\. Vinyals, and Y\. W\. Teh \(2019\)Attentive neural processes\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=SkE6PjC9KX)Cited by:[§1](https://arxiv.org/html/2605.24210#S1.p2.1),[§2](https://arxiv.org/html/2605.24210#S2.p1.1)\.
- T\. Nguyen and A\. Grover \(2022\)Transformer neural processes: uncertainty\-aware meta\-learning via sequence modeling\.InProceedings of the 39th International Conference on Machine Learning,K\. Chaudhuri, S\. Jegelka, L\. Song, C\. Szepesvári, G\. Niu, and S\. Sabato \(Eds\.\),Proceedings of Machine Learning Research, Vol\.162,pp\. 16569–16594\.External Links:[Link](https://proceedings.mlr.press/v162/nguyen22b.html)Cited by:[§1](https://arxiv.org/html/2605.24210#S1.p2.1),[§2](https://arxiv.org/html/2605.24210#S2.p1.1)\.
- A\. Pentina and C\. Lampert \(2014\)A pac\-bayesian bound for lifelong learning\.InProceedings of the 31st International Conference on Machine Learning,E\. P\. Xing and T\. Jebara \(Eds\.\),Proceedings of Machine Learning Research, Vol\.32,Bejing, China,pp\. 991–999\.External Links:[Link](https://proceedings.mlr.press/v32/pentina14.html)Cited by:[§2](https://arxiv.org/html/2605.24210#S2.p6.1)\.
- J\. Pérez, P\. Barceló, and J\. Marinkovic \(2021\)Attention is turing\-complete\.Journal of Machine Learning Research22\(75\),pp\. 1–35\.External Links:[Link](http://jmlr.org/papers/v22/20-302.html)Cited by:[§2](https://arxiv.org/html/2605.24210#S2.p4.2)\.
- A\. Rahimi and B\. Recht \(2007\)Random features for large\-scale kernel machines\.InAdvances in Neural Information Processing Systems,J\. Platt, D\. Koller, Y\. Singer, and S\. Roweis \(Eds\.\),Vol\.20,pp\.\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2007/file/013a006f03dbc5392effeb8f18fda755-Paper.pdf)Cited by:[§2](https://arxiv.org/html/2605.24210#S2.p3.2)\.
- J\. Rothfuss, V\. Fortuin, M\. Josifoski, and A\. Krause \(2021\)PACOH: bayes\-optimal meta\-learning with pac\-guarantees\.InProceedings of the 38th International Conference on Machine Learning,M\. Meila and T\. Zhang \(Eds\.\),Proceedings of Machine Learning Research, Vol\.139,pp\. 9116–9126\.External Links:[Link](https://proceedings.mlr.press/v139/rothfuss21a.html)Cited by:[§2](https://arxiv.org/html/2605.24210#S2.p6.1)\.
- E\. Snelson and Z\. Ghahramani \(2005\)Sparse gaussian processes using pseudo\-inputs\.InAdvances in Neural Information Processing Systems,Y\. Weiss, B\. Schölkopf, and J\. Platt \(Eds\.\),Vol\.18,pp\. 1257–1264\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2005/file/4491777b1aa8b5b32c2e8666dbe1a495-Paper.pdf)Cited by:[§2](https://arxiv.org/html/2605.24210#S2.p3.2)\.
- M\. Titsias \(2009\)Variational learning of inducing variables in sparse gaussian processes\.InProceedings of the Twelfth International Conference on Artificial Intelligence and Statistics,D\. van Dyk and M\. Welling \(Eds\.\),Proceedings of Machine Learning Research, Vol\.5,Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA,pp\. 567–574\.External Links:[Link](https://proceedings.mlr.press/v5/titsias09a.html)Cited by:[§2](https://arxiv.org/html/2605.24210#S2.p3.2)\.
- L\. N\. Trefethen \(2019\)Approximation theory and approximation practice, extended edition\.Society for Industrial & Applied Mathematics,Philadelphia, PA, USA\.External Links:ISBN 9781611975932Cited by:[§C\.4](https://arxiv.org/html/2605.24210#A3.SS4.8.p2.8),[§D\.6](https://arxiv.org/html/2605.24210#A4.SS6.1.p1.3),[Appendix F](https://arxiv.org/html/2605.24210#A6.8.p8.10),[§2](https://arxiv.org/html/2605.24210#S2.p5.1)\.
- E\. Wagstaff, F\. Fuchs, M\. Engelcke, I\. Posner, and M\. A\. Osborne \(2019\)On the limitations of representing functions on sets\.InProceedings of the 36th International Conference on Machine Learning,K\. Chaudhuri and R\. Salakhutdinov \(Eds\.\),Proceedings of Machine Learning Research, Vol\.97,pp\. 6487–6494\.External Links:[Link](https://proceedings.mlr.press/v97/wagstaff19a.html)Cited by:[§2](https://arxiv.org/html/2605.24210#S2.p2.1)\.
- A\. Wilson and H\. Nickisch \(2015\)Kernel interpolation for scalable structured gaussian processes \(kiss\-gp\)\.InProceedings of the 32nd International Conference on Machine Learning,F\. Bach and D\. Blei \(Eds\.\),Proceedings of Machine Learning Research, Vol\.37,Lille, France,pp\. 1775–1784\.External Links:[Link](https://proceedings.mlr.press/v37/wilson15.html)Cited by:[§2](https://arxiv.org/html/2605.24210#S2.p3.2)\.
- R\. Young \(2026a\)On the conditioning consistency gap in conditional neural processes\.Transactions on Machine Learning Research\.Note:External Links:ISSN 2835\-8856,[Link](https://openreview.net/forum?id=rLJ5Hm5vbG)Cited by:[§10](https://arxiv.org/html/2605.24210#S10.p3.6)\.
- R\. Young \(2026b\)Three costs of amortizing gaussian process inference with neural processes\.External Links:2605\.21798,[Link](https://arxiv.org/abs/2605.21798)Cited by:[§10](https://arxiv.org/html/2605.24210#S10.p3.6)\.
- C\. Yun, S\. Bhojanapalli, A\. S\. Rawat, S\. Reddi, and S\. Kumar \(2020\)Are transformers universal approximators of sequence\-to\-sequence functions?\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=ByxRM0Ntvr)Cited by:[§2](https://arxiv.org/html/2605.24210#S2.p4.2)\.
- M\. Zaheer, S\. Kottur, S\. Ravanbakhsh, B\. Poczos, R\. R\. Salakhutdinov, and A\. J\. Smola \(2017\)Deep sets\.InAdvances in Neural Information Processing Systems,I\. Guyon, U\. V\. Luxburg, S\. Bengio, H\. Wallach, R\. Fergus, S\. Vishwanathan, and R\. Garnett \(Eds\.\),Vol\.30,pp\.\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2017/file/f22e4747da1aa27e363d86d40ff442fe-Paper.pdf)Cited by:[§2](https://arxiv.org/html/2605.24210#S2.p2.1)\.

## Appendix ACNP Proofs

###### Proof of Theorem[3](https://arxiv.org/html/2605.24210#Thmtheorem3)\(CNP Characterization\)\.

\(⇒\)\(\\Rightarrow\)IfF∈ℱmoment\(d\)F\\in\\mathcal\{F\}\_\{\\mathrm\{moment\}\}^\{\(d\)\}, thenF​\(C,xt\)=ϕ​\(1n​∑iψ​\(xi,yi\),xt\)F\(C,x\_\{t\}\)=\\phi\(\\frac\{1\}\{n\}\\sum\_\{i\}\\psi\(x\_\{i\},y\_\{i\}\),x\_\{t\}\)\. Settingh=ψh=\\psiandg=ϕg=\\phigives exact representation\.

\(⇐\)\(\\Leftarrow\)IfFFisdd\-representable, thenF​\(C,xt\)=g​\(h¯C,xt\)F\(C,x\_\{t\}\)=g\(\\bar\{h\}\_\{C\},x\_\{t\}\)for someh,gh,g\. This is exactly the form ofℱmoment\(d\)\\mathcal\{F\}\_\{\\mathrm\{moment\}\}^\{\(d\)\}withψ=h\\psi=handϕ=g\\phi=g\. ∎

###### Proof of Proposition[4](https://arxiv.org/html/2605.24210#Thmtheorem4)\(Existence of Collisions\)\.

Consider the mapΦ:\(𝒳×𝒴\)n→ℝd\\Phi:\(\\mathcal\{X\}\\times\\mathcal\{Y\}\)^\{n\}\\to\\mathbb\{R\}^\{d\}defined byΦ​\(C\)=1n​∑i=1nh​\(xi,yi\)\\Phi\(C\)=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}h\(x\_\{i\},y\_\{i\}\)\. The domain has dimensionn​\(dx\+dy\)n\(d\_\{x\}\+d\_\{y\}\); the codomain has dimensiondd\. Whenn​\(dx\+dy\)\>dn\(d\_\{x\}\+d\_\{y\}\)\>d, the map cannot be injective, so distinct context sets can have identical representations\.

More explicitly, for fixedhh\-images\(h1,…,hn\)∈\(ℝd\)n\(h\_\{1\},\\ldots,h\_\{n\}\)\\in\(\\mathbb\{R\}^\{d\}\)^\{n\}, perturbations\(δ1,…,δn\)\(\\delta\_\{1\},\\ldots,\\delta\_\{n\}\)satisfying∑iδi=0\\sum\_\{i\}\\delta\_\{i\}=0form a subspace of dimension\(n−1\)​d\(n\-1\)d\. Sincehhis continuous and maps from a higher\-dimensional space whenn​\(dx\+dy\)\>dn\(d\_\{x\}\+d\_\{y\}\)\>d, generic perturbations in\(𝒳×𝒴\)n\(\\mathcal\{X\}\\times\\mathcal\{Y\}\)^\{n\}induce such mean\-preserving perturbations in the representation space\. ∎

###### Proof of Theorem[6](https://arxiv.org/html/2605.24210#Thmtheorem6)\(GP Posterior is Not Finitely Representable\)\.

The GP posterior mean depends on the inverse Gram matrix𝐊−1\\mathbf\{K\}^\{\-1\}, which couples all context points\. The weight onyiy\_\{i\}in the posterior mean depends on all pairwise kernel valuesk​\(xi,xj\)k\(x\_\{i\},x\_\{j\}\)forj≠ij\\neq i\.

By[Proposition4](https://arxiv.org/html/2605.24210#Thmtheorem4), for any encoderhh, there exist distinct context setsC,C′C,C^\{\\prime\}withh¯C=h¯C′\\bar\{h\}\_\{C\}=\\bar\{h\}\_\{C^\{\\prime\}\}\. We constructC,C′C,C^\{\\prime\}with identical mean encodings but different x\-configurations, hence different Gram matrices𝐊​\(XC,XC\)≠𝐊​\(XC′,XC′\)\\mathbf\{K\}\(X\_\{C\},X\_\{C\}\)\\neq\\mathbf\{K\}\(X\_\{C^\{\\prime\}\},X\_\{C^\{\\prime\}\}\)\.

Specifically, withn=d\+1n=d\+1context points, letCChave points\{x1,…,xd\+1\}\\\{x\_\{1\},\\ldots,x\_\{d\+1\}\\\}in general position andC′C^\{\\prime\}have points\{x1′,…,xd\+1′\}\\\{x\_\{1\}^\{\\prime\},\\ldots,x\_\{d\+1\}^\{\\prime\}\\\}arranged so thath¯C=h¯C′\\bar\{h\}\_\{C\}=\\bar\{h\}\_\{C^\{\\prime\}\}\(possible by the null space argument\) but the pairwise distances differ\.

For the same targetxtx\_\{t\}and y\-values, the GP posterior means differ:

μ​\(xt\|C\)=k​\(xt,XC\)​𝐊​\(XC,XC\)−1​yC≠k​\(xt,XC′\)​𝐊​\(XC′,XC′\)−1​yC=μ​\(xt\|C′\)\.\\mu\(x\_\{t\}\|C\)=k\(x\_\{t\},X\_\{C\}\)\\mathbf\{K\}\(X\_\{C\},X\_\{C\}\)^\{\-1\}y\_\{C\}\\neq k\(x\_\{t\},X\_\{C^\{\\prime\}\}\)\\mathbf\{K\}\(X\_\{C^\{\\prime\}\},X\_\{C^\{\\prime\}\}\)^\{\-1\}y\_\{C\}=\\mu\(x\_\{t\}\|C^\{\\prime\}\)\.Thus no CNP with representation dimensionddcan exactly represent the GP posterior on contexts of sizen\>dn\>d\. ∎

###### Proof of Theorem[7](https://arxiv.org/html/2605.24210#Thmtheorem7)\(CNP Approximation Lower Bound\)\.

We proceed in four steps\.

Step 1: Reduction to linear encoders\.Since the targetμ​\(xt\|C\)=k​\(xt,XC\)​𝐊−1​yC\\mu\(x\_\{t\}\|C\)=k\(x\_\{t\},X\_\{C\}\)\\mathbf\{K\}^\{\-1\}y\_\{C\}is linear inyCy\_\{C\}andyCy\_\{C\}is Gaussian, the optimal encoder is linear\. This follows from a standard result in Gaussian rate\-distortion theory: for jointly Gaussian\(yC,T​\(yC\)\)\(y\_\{C\},T\(y\_\{C\}\)\)withTTlinear, the minimum MSE estimator ofT​\(yC\)T\(y\_\{C\}\)given any functionϕ​\(yC\)\\phi\(y\_\{C\}\)compressed todddimensions is achieved by a linearϕ\\phi\(seeBerger \([1971](https://arxiv.org/html/2605.24210#bib.bib3)\)for example\)\.

For fixed context locations, a CNP encoderh:𝒳×𝒴→ℝdh:\\mathcal\{X\}\\times\\mathcal\{Y\}\\to\\mathbb\{R\}^\{d\}reduces toh¯C=1n​∑i=1nϕi​\(yi\)\\bar\{h\}\_\{C\}=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\phi\_\{i\}\(y\_\{i\}\)for functionsϕi:ℝ→ℝd\\phi\_\{i\}:\\mathbb\{R\}\\to\\mathbb\{R\}^\{d\}\. Linearity givesϕi​\(yi\)=ai​yi\\phi\_\{i\}\(y\_\{i\}\)=a\_\{i\}y\_\{i\}for someai∈ℝda\_\{i\}\\in\\mathbb\{R\}^\{d\}, soh¯C=A​yC\\bar\{h\}\_\{C\}=Ay\_\{C\}for a matrixA∈ℝd×nA\\in\\mathbb\{R\}^\{d\\times n\}\.

Step 2: Whitening\.Letz=𝐊−1/2​yC∼𝒩​\(0,𝐈n\)z=\\mathbf\{K\}^\{\-1/2\}y\_\{C\}\\sim\\mathcal\{N\}\(0,\\mathbf\{I\}\_\{n\}\)\. Define:

w~xt\\displaystyle\\tilde\{w\}\_\{x\_\{t\}\}=𝐊−1/2​k​\(xt,XC\)∈ℝn\\displaystyle=\\mathbf\{K\}^\{\-1/2\}k\(x\_\{t\},X\_\{C\}\)\\in\\mathbb\{R\}^\{n\}A~\\displaystyle\\tilde\{A\}=A​𝐊1/2∈ℝd×n\\displaystyle=A\\mathbf\{K\}^\{1/2\}\\in\\mathbb\{R\}^\{d\\times n\}Thenμ​\(xt\|C\)=w~xt⊤​z\\mu\(x\_\{t\}\|C\)=\\tilde\{w\}\_\{x\_\{t\}\}^\{\\top\}zandh¯C=A~​z\\bar\{h\}\_\{C\}=\\tilde\{A\}z\.

Step 3: Optimal reconstruction\.For Gaussianzz, the optimal predictor ofw~xt⊤​z\\tilde\{w\}\_\{x\_\{t\}\}^\{\\top\}zgivenA~​z\\tilde\{A\}zis:

𝔼​\[w~xt⊤​z\|A~​z\]=w~xt⊤​PA~​z\\mathbb\{E\}\[\\tilde\{w\}\_\{x\_\{t\}\}^\{\\top\}z\|\\tilde\{A\}z\]=\\tilde\{w\}\_\{x\_\{t\}\}^\{\\top\}P\_\{\\tilde\{A\}\}zwherePA~=A~⊤​\(A~​A~⊤\)−1​A~P\_\{\\tilde\{A\}\}=\\tilde\{A\}^\{\\top\}\(\\tilde\{A\}\\tilde\{A\}^\{\\top\}\)^\{\-1\}\\tilde\{A\}is the orthogonal projection onto thedd\-dimensional row space ofA~\\tilde\{A\}\. The MSE at targetxtx\_\{t\}is:

𝔼​\[\|w~xt⊤​z−w~xt⊤​PA~​z\|2\]=‖\(𝐈−PA~\)​w~xt‖2\.\\mathbb\{E\}\[\|\\tilde\{w\}\_\{x\_\{t\}\}^\{\\top\}z\-\\tilde\{w\}\_\{x\_\{t\}\}^\{\\top\}P\_\{\\tilde\{A\}\}z\|^\{2\}\]=\\\|\(\\mathbf\{I\}\-P\_\{\\tilde\{A\}\}\)\\tilde\{w\}\_\{x\_\{t\}\}\\\|^\{2\}\.
Step 4: Integrated error\.Define the whitened weight matrixW~∈ℝn×m\\tilde\{W\}\\in\\mathbb\{R\}^\{n\\times m\}with columnsw~x1∗,…,w~xm∗\\tilde\{w\}\_\{x\_\{1\}^\{\*\}\},\\ldots,\\tilde\{w\}\_\{x\_\{m\}^\{\*\}\}for target pointsx1∗,…,xm∗∼νx\_\{1\}^\{\*\},\\ldots,x\_\{m\}^\{\*\}\\sim\\nu\. The total MSE is:

∑j=1m‖\(𝐈−PA~\)​w~xj∗‖2=‖\(𝐈−PA~\)​W~‖F2\.\\sum\_\{j=1\}^\{m\}\\\|\(\\mathbf\{I\}\-P\_\{\\tilde\{A\}\}\)\\tilde\{w\}\_\{x\_\{j\}^\{\*\}\}\\\|^\{2\}=\\\|\(\\mathbf\{I\}\-P\_\{\\tilde\{A\}\}\)\\tilde\{W\}\\\|\_\{F\}^\{2\}\.This is minimized whenPA~P\_\{\\tilde\{A\}\}projects onto the topddleft singular vectors ofW~\\tilde\{W\}\. IfW~\\tilde\{W\}has singular valuesσ1≥⋯≥σn\\sigma\_\{1\}\\geq\\cdots\\geq\\sigma\_\{n\}\(withσi=0\\sigma\_\{i\}=0fori\>rank⁡\(W~\)i\>\\operatorname\{rank\}\(\\tilde\{W\}\)\), the minimum error is∑i=d\+1nσi2\\sum\_\{i=d\+1\}^\{n\}\\sigma\_\{i\}^\{2\}and the total variance is∑i=1nσi2\\sum\_\{i=1\}^\{n\}\\sigma\_\{i\}^\{2\}\.

Under the isotropic second moment assumption:

1m​W~​W~⊤=1m​∑j=1mw~xj∗​w~xj∗⊤→𝔼xt∼ν​\[w~xt​w~xt⊤\]=α​𝐈n\\frac\{1\}\{m\}\\tilde\{W\}\\tilde\{W\}^\{\\top\}=\\frac\{1\}\{m\}\\sum\_\{j=1\}^\{m\}\\tilde\{w\}\_\{x\_\{j\}^\{\*\}\}\\tilde\{w\}\_\{x\_\{j\}^\{\*\}\}^\{\\top\}\\to\\mathbb\{E\}\_\{x\_\{t\}\\sim\\nu\}\[\\tilde\{w\}\_\{x\_\{t\}\}\\tilde\{w\}\_\{x\_\{t\}\}^\{\\top\}\]=\\alpha\\mathbf\{I\}\_\{n\}for someα\>0\\alpha\>0\. Thus the singular values ofW~\\tilde\{W\}satisfyσi2≈α​m/n\\sigma\_\{i\}^\{2\}\\approx\\alpha m/nfor alli≤ni\\leq n, giving:

∑i=d\+1nσi2∑i=1nσi2=n−dn=1−dn\.∎\\frac\{\\sum\_\{i=d\+1\}^\{n\}\\sigma\_\{i\}^\{2\}\}\{\\sum\_\{i=1\}^\{n\}\\sigma\_\{i\}^\{2\}\}=\\frac\{n\-d\}\{n\}=1\-\\frac\{d\}\{n\}\.\\qed

We also record the linear regression example, which illustrates the representation requirements concretely\.

###### Proposition 26\(Linear Regression Requiresd=O​\(k2\)d=O\(k^\{2\}\)\)\.

Letℱlinear\(k\)\\mathcal\{F\}\_\{\\mathrm\{linear\}\}^\{\(k\)\}be the class of linear predictors withkk\-dimensional featuresψ:𝒳→ℝk\\psi:\\mathcal\{X\}\\to\\mathbb\{R\}^\{k\}:

F​\(C,xt\)=⟨β​\(C\),ψ​\(xt\)⟩,β​\(C\)=arg⁡minβ​∑i=1n\(yi−⟨β,ψ​\(xi\)⟩\)2\.F\(C,x\_\{t\}\)=\\langle\\beta\(C\),\\psi\(x\_\{t\}\)\\rangle,\\quad\\beta\(C\)=\\arg\\min\_\{\\beta\}\\sum\_\{i=1\}^\{n\}\(y\_\{i\}\-\\langle\\beta,\\psi\(x\_\{i\}\)\\rangle\)^\{2\}\.Then:

1. \(a\)d=k​\(k\+3\)2d=\\frac\{k\(k\+3\)\}\{2\}suffices:ℱlinear\(k\)⊆ℱmoment\(k​\(k\+3\)/2\)\\mathcal\{F\}\_\{\\mathrm\{linear\}\}^\{\(k\)\}\\subseteq\\mathcal\{F\}\_\{\\mathrm\{moment\}\}^\{\(k\(k\+3\)/2\)\}\.
2. \(b\)d=Ω​\(k2\)d=\\Omega\(k^\{2\}\)is necessary for exact representation\.

###### Proof\.

\(a\) The OLS solution isβ​\(C\)=\(∑iψ​\(xi\)​ψ​\(xi\)⊤\)−1​\(∑iyi​ψ​\(xi\)\)\\beta\(C\)=\(\\sum\_\{i\}\\psi\(x\_\{i\}\)\\psi\(x\_\{i\}\)^\{\\top\}\)^\{\-1\}\(\\sum\_\{i\}y\_\{i\}\\psi\(x\_\{i\}\)\)\. Define the encoder:

h​\(x,y\)=\(vech⁡\(ψ​\(x\)​ψ​\(x\)⊤\),y⋅ψ​\(x\)\)∈ℝk​\(k\+1\)/2\+k,h\(x,y\)=\(\\operatorname\{vech\}\(\\psi\(x\)\\psi\(x\)^\{\\top\}\),y\\cdot\\psi\(x\)\)\\in\\mathbb\{R\}^\{k\(k\+1\)/2\+k\},Thenh¯C=\(1n​vec​\(∑iψ​\(xi\)​ψ​\(xi\)⊤\),1n​∑iyi​ψ​\(xi\)\)\\bar\{h\}\_\{C\}=\(\\frac\{1\}\{n\}\\mathrm\{vec\}\(\\sum\_\{i\}\\psi\(x\_\{i\}\)\\psi\(x\_\{i\}\)^\{\\top\}\),\\frac\{1\}\{n\}\\sum\_\{i\}y\_\{i\}\\psi\(x\_\{i\}\)\)\. The decoder can recoverβ​\(C\)\\beta\(C\)by inverting the \(rescaled\) Gram matrix and multiplying\.

\(b\) The Gram matrix∑iψ​\(xi\)​ψ​\(xi\)⊤\\sum\_\{i\}\\psi\(x\_\{i\}\)\\psi\(x\_\{i\}\)^\{\\top\}hask​\(k\+1\)/2k\(k\+1\)/2degrees of freedom \(symmetric\)\. Different Gram matrices generically yield different predictors\. Thusd≥k​\(k\+1\)/2=Ω​\(k2\)d\\geq k\(k\+1\)/2=\\Omega\(k^\{2\}\)\. ∎

## Appendix BANP Proofs

###### Proof of Theorem[9](https://arxiv.org/html/2605.24210#Thmtheorem9)\(ANPs Represent Kernel Smoothers\)\.

Set:

- •Value:v​\(x,y\)=\(y,1\)∈ℝdy\+1v\(x,y\)=\(y,1\)\\in\\mathbb\{R\}^\{d\_\{y\}\+1\}
- •Key:k​\(x,y\)=ψ​\(x\)k\(x,y\)=\\psi\(x\)\(independent ofyy\)
- •Query:q​\(xt\)=ϕ​\(xt\)q\(x\_\{t\}\)=\\phi\(x\_\{t\}\)

Chooseϕ,ψ\\phi,\\psisuch thatq​\(xt\)⊤​k​\(xi,yi\)=log⁡K​\(xt,xi\)q\(x\_\{t\}\)^\{\\top\}k\(x\_\{i\},y\_\{i\}\)=\\log K\(x\_\{t\},x\_\{i\}\)\. Iflog⁡K\\log Kadmits a finite inner product decomposition, this is achieved exactly\. For general continuousKKon a compact domain, universal approximation\(Horniket al\.,[1989](https://arxiv.org/html/2605.24210#bib.bib5)\)gives networksϕ,ψ\\phi,\\psiwith\|q​\(xt\)⊤​k​\(xi,yi\)−log⁡K​\(xt,xi\)\|≤δ\|q\(x\_\{t\}\)^\{\\top\}k\(x\_\{i\},y\_\{i\}\)\-\\log K\(x\_\{t\},x\_\{i\}\)\|\\leq\\deltafor anyδ\>0\\delta\>0, yieldingε​\(δ\)\\varepsilon\(\\delta\)\-approximation of the kernel smoother withε​\(δ\)→0\\varepsilon\(\\delta\)\\to 0asδ→0\\delta\\to 0\.

In the exact case, the attention weights are:

αi​\(xt;C\)=exp⁡\(log⁡K​\(xt,xi\)\)∑jexp⁡\(log⁡K​\(xt,xj\)\)=K​\(xt,xi\)∑jK​\(xt,xj\)\.\\alpha\_\{i\}\(x\_\{t\};C\)=\\frac\{\\exp\(\\log K\(x\_\{t\},x\_\{i\}\)\)\}\{\\sum\_\{j\}\\exp\(\\log K\(x\_\{t\},x\_\{j\}\)\)\}=\\frac\{K\(x\_\{t\},x\_\{i\}\)\}\{\\sum\_\{j\}K\(x\_\{t\},x\_\{j\}\)\}\.
The representation is:

rC​\(xt\)=∑iαi​\(xt;C\)​\(yi,1\)=\(∑iK​\(xt,xi\)​yi∑iK​\(xt,xi\),1\)\.r\_\{C\}\(x\_\{t\}\)=\\sum\_\{i\}\\alpha\_\{i\}\(x\_\{t\};C\)\\,\(y\_\{i\},1\)=\\left\(\\frac\{\\sum\_\{i\}K\(x\_\{t\},x\_\{i\}\)y\_\{i\}\}\{\\sum\_\{i\}K\(x\_\{t\},x\_\{i\}\)\},1\\right\)\.
A linear decoder extracts the first component, which is exactly the kernel smoother\. ∎

###### Proof of Theorem[10](https://arxiv.org/html/2605.24210#Thmtheorem10)\(ANP Characterization\)\.

\(⇒\)\(\\Rightarrow\)IfFFis ANP\-representable, the attention mechanism gives exactly this form withs​\(xt,x,y\)=q​\(xt\)⊤​k​\(x,y\)/τs\(x\_\{t\},x,y\)=q\(x\_\{t\}\)^\{\\top\}k\(x,y\)/\\tau\.

\(⇐\)\(\\Leftarrow\)Any suchFFcan be implemented by settingq​\(xt\)⊤​k​\(x,y\)=s​\(xt,x,y\)⋅τq\(x\_\{t\}\)^\{\\top\}k\(x,y\)=s\(x\_\{t\},x,y\)\\cdot\\tauand using an appropriate decoderGG\. ∎

###### Proof of Theorem[12](https://arxiv.org/html/2605.24210#Thmtheorem12)\(GP Posterior Requires Context\-Context Coupling\)\.

The GP posterior mean is:

μ​\(xt\|C\)=k​\(xt,XC\)​𝐊​\(XC,XC\)−1​yC=∑i=1nwi​\(xt;C\)​yi\\mu\(x\_\{t\}\|C\)=k\(x\_\{t\},X\_\{C\}\)\\,\\mathbf\{K\}\(X\_\{C\},X\_\{C\}\)^\{\-1\}\\,y\_\{C\}=\\sum\_\{i=1\}^\{n\}w\_\{i\}\(x\_\{t\};C\)\\,y\_\{i\}wherewi​\(xt;C\)=\[k​\(xt,XC\)​𝐊−1\]iw\_\{i\}\(x\_\{t\};C\)=\[k\(x\_\{t\},X\_\{C\}\)\\mathbf\{K\}^\{\-1\}\]\_\{i\}\.

The weightwiw\_\{i\}depends on𝐊−1\\mathbf\{K\}^\{\-1\}, which couples all context points\. Specifically,wiw\_\{i\}depends onk​\(xj,xm\)k\(x\_\{j\},x\_\{m\}\)forj,m≠ij,m\\neq i\.

ANP attention weights factor as:

αi​\(xt;C\)=f​\(xt,xi,yi\)∑jf​\(xt,xj,yj\)\\alpha\_\{i\}\(x\_\{t\};C\)=\\frac\{f\(x\_\{t\},x\_\{i\},y\_\{i\}\)\}\{\\sum\_\{j\}f\(x\_\{t\},x\_\{j\},y\_\{j\}\)\}for some functionffdepending on query\-key pairs independently\. The weight on pointiidepends only on\(xt,xi,yi\)\(x\_\{t\},x\_\{i\},y\_\{i\}\)and the normalizing constant, not on the relationships between other context points\.

Explicit counterexample:Considern=2n=2context points\. The GP weight on point 1 is:

w1=k​\(xt,x1\)​k​\(x2,x2\)−k​\(xt,x2\)​k​\(x1,x2\)k​\(x1,x1\)​k​\(x2,x2\)−k​\(x1,x2\)2\.w\_\{1\}=\\frac\{k\(x\_\{t\},x\_\{1\}\)k\(x\_\{2\},x\_\{2\}\)\-k\(x\_\{t\},x\_\{2\}\)k\(x\_\{1\},x\_\{2\}\)\}\{k\(x\_\{1\},x\_\{1\}\)k\(x\_\{2\},x\_\{2\}\)\-k\(x\_\{1\},x\_\{2\}\)^\{2\}\}\.This depends onk​\(x1,x2\)k\(x\_\{1\},x\_\{2\}\), the kernel value between the two context points\. No ANP attention weight can capture this, asα1∝exp⁡\(q​\(xt\)⊤​k​\(x1,y1\)\)\\alpha\_\{1\}\\propto\\exp\(q\(x\_\{t\}\)^\{\\top\}k\(x\_\{1\},y\_\{1\}\)\)involves only the query\-point\-1 relationship\. ∎

## Appendix CTNP Assumptions and Upper Bound Proofs

### C\.1Structural Assumptions

###### Assumption 1\(Position\-Based Self\-Attention\)\.

Each self\-attention layer uses attention weights that depend only on input positions:

βi​j\(ℓ\)=exp⁡\(qs​\(xi\)⊤​ks​\(xj\)/τ\)∑m=1nexp⁡\(qs​\(xi\)⊤​ks​\(xm\)/τ\)\\beta\_\{ij\}^\{\(\\ell\)\}=\\frac\{\\exp\(q\_\{s\}\(x\_\{i\}\)^\{\\top\}k\_\{s\}\(x\_\{j\}\)/\\tau\)\}\{\\sum\_\{m=1\}^\{n\}\\exp\(q\_\{s\}\(x\_\{i\}\)^\{\\top\}k\_\{s\}\(x\_\{m\}\)/\\tau\)\}whereqs,ks:𝒳→ℝdkq\_\{s\},k\_\{s\}:\\mathcal\{X\}\\to\\mathbb\{R\}^\{d\_\{k\}\}are position encoders\. We denote the resulting attention matrix by𝐊~∈ℝn×n\\tilde\{\\mathbf\{K\}\}\\in\\mathbb\{R\}^\{n\\times n\}, with\[𝐊~\]i​j=βi​j\[\\tilde\{\\mathbf\{K\}\}\]\_\{ij\}=\\beta\_\{ij\}\.

###### Assumption 2\(Residual Connections\)\.

Self\-attention layers use residual connections with learnable value projections:

hi\(ℓ\)=hi\(ℓ−1\)\+∑j=1nβi​j\(ℓ\)​Wv\(ℓ\)​hj\(ℓ−1\)h\_\{i\}^\{\(\\ell\)\}=h\_\{i\}^\{\(\\ell\-1\)\}\+\\sum\_\{j=1\}^\{n\}\\beta\_\{ij\}^\{\(\\ell\)\}W\_\{v\}^\{\(\\ell\)\}h\_\{j\}^\{\(\\ell\-1\)\}whereWv\(ℓ\)∈ℝd×dW\_\{v\}^\{\(\\ell\)\}\\in\\mathbb\{R\}^\{d\\times d\}is the value projection matrix at layerℓ\\ell\.

###### Assumption 3\(Kernel Matrix Conditioning\)\.

The kernelkkis positive definite, and for context sets of sizenn, the Gram matrix𝐊\\mathbf\{K\}has eigenvalues0<λmin≤λ1≤⋯≤λn≤λmax0<\\lambda\_\{\\min\}\\leq\\lambda\_\{1\}\\leq\\cdots\\leq\\lambda\_\{n\}\\leq\\lambda\_\{\\max\}\. The condition number isκ=λmax/λmin\\kappa=\\lambda\_\{\\max\}/\\lambda\_\{\\min\}\.

###### Assumption 4\(Attention Approximates Normalized Kernel\)\.

The attention matrix𝐊~\\tilde\{\\mathbf\{K\}\}satisfies𝐊~=D−1​𝐊\\tilde\{\\mathbf\{K\}\}=D^\{\-1\}\\mathbf\{K\}whereD=diag⁡\(𝐊𝟏\)D=\\operatorname\{diag\}\(\\mathbf\{K\}\\mathbf\{1\}\)is the row\-sum normalization\. Realistically, the TNP must learn position encodersqs,ksq\_\{s\},k\_\{s\}achieving this approximation, which is a nontrivial learning problem\.

### C\.2Matrix Form and Basic Lemmas

###### Lemma 28\(Matrix Form of Updates\)\.

Under Assumptions[1](https://arxiv.org/html/2605.24210#Thmassumption1)and[2](https://arxiv.org/html/2605.24210#Thmassumption2), the layer\-ℓ\\ellupdate is:

H\(ℓ\)=H\(ℓ−1\)\+𝐊~​H\(ℓ−1\)​Wv\(ℓ\)⊤H^\{\(\\ell\)\}=H^\{\(\\ell\-1\)\}\+\\tilde\{\\mathbf\{K\}\}H^\{\(\\ell\-1\)\}W\_\{v\}^\{\(\\ell\)\\top\}or in vectorized form:

vec​\(H\(ℓ\)\)=\(𝐈n​d\+Wv\(ℓ\)⊗𝐊~\)​vec​\(H\(ℓ−1\)\)\\mathrm\{vec\}\(H^\{\(\\ell\)\}\)=\\left\(\\mathbf\{I\}\_\{nd\}\+W\_\{v\}^\{\(\\ell\)\}\\otimes\\tilde\{\\mathbf\{K\}\}\\right\)\\mathrm\{vec\}\(H^\{\(\\ell\-1\)\}\)where⊗\\otimesdenotes the Kronecker product\. For scalar value weightsWv\(ℓ\)=αℓ​𝐈dW\_\{v\}^\{\(\\ell\)\}=\\alpha\_\{\\ell\}\\mathbf\{I\}\_\{d\}:

H\(ℓ\)=\(𝐈n\+αℓ​𝐊~\)​H\(ℓ−1\)\.H^\{\(\\ell\)\}=\(\\mathbf\{I\}\_\{n\}\+\\alpha\_\{\\ell\}\\tilde\{\\mathbf\{K\}\}\)H^\{\(\\ell\-1\)\}\.

###### Lemma 29\(One Layer Captures Kernel\-Weighted Sums\)\.

Under Assumption[1](https://arxiv.org/html/2605.24210#Thmassumption1), if the position encodersqs,ksq\_\{s\},k\_\{s\}satisfyqs​\(x\)⊤​ks​\(x′\)=τ​log⁡K​\(x,x′\)\+cq\_\{s\}\(x\)^\{\\top\}k\_\{s\}\(x^\{\\prime\}\)=\\tau\\log K\(x,x^\{\\prime\}\)\+cfor some kernelKKand constantcc, then:

βi​j=K​\(xi,xj\)∑m=1nK​\(xi,xm\)\.\\beta\_\{ij\}=\\frac\{K\(x\_\{i\},x\_\{j\}\)\}\{\\sum\_\{m=1\}^\{n\}K\(x\_\{i\},x\_\{m\}\)\}\.After one self\-attention layer with value functionv:𝒳×𝒴→ℝdv:\\mathcal\{X\}\\times\\mathcal\{Y\}\\to\\mathbb\{R\}^\{d\}:

hi\(1\)=hi\(0\)\+∑j=1nK​\(xi,xj\)∑mK​\(xi,xm\)​v​\(xj,yj\)\.h\_\{i\}^\{\(1\)\}=h\_\{i\}^\{\(0\)\}\+\\sum\_\{j=1\}^\{n\}\\frac\{K\(x\_\{i\},x\_\{j\}\)\}\{\\sum\_\{m\}K\(x\_\{i\},x\_\{m\}\)\}v\(x\_\{j\},y\_\{j\}\)\.

###### Proof\.

Direct computation:

βi​j=exp⁡\(qs​\(xi\)⊤​ks​\(xj\)/τ\)∑mexp⁡\(qs​\(xi\)⊤​ks​\(xm\)/τ\)=exp⁡\(log⁡K​\(xi,xj\)\+c\)∑mexp⁡\(log⁡K​\(xi,xm\)\+c\)=K​\(xi,xj\)∑mK​\(xi,xm\)\.\\beta\_\{ij\}=\\frac\{\\exp\(q\_\{s\}\(x\_\{i\}\)^\{\\top\}k\_\{s\}\(x\_\{j\}\)/\\tau\)\}\{\\sum\_\{m\}\\exp\(q\_\{s\}\(x\_\{i\}\)^\{\\top\}k\_\{s\}\(x\_\{m\}\)/\\tau\)\}=\\frac\{\\exp\(\\log K\(x\_\{i\},x\_\{j\}\)\+c\)\}\{\\sum\_\{m\}\\exp\(\\log K\(x\_\{i\},x\_\{m\}\)\+c\)\}=\\frac\{K\(x\_\{i\},x\_\{j\}\)\}\{\\sum\_\{m\}K\(x\_\{i\},x\_\{m\}\)\}\.The representation update follows from the definition\. ∎

###### Proposition 30\(Gram Matrix Encoding\)\.

With representation dimensiond≥nd\\geq nand initial encodingh\(0\)​\(xj,yj\)=\(ej,yj\)∈ℝn\+dyh^\{\(0\)\}\(x\_\{j\},y\_\{j\}\)=\(e\_\{j\},y\_\{j\}\)\\in\\mathbb\{R\}^\{n\+d\_\{y\}\}whereeje\_\{j\}is thejj\-th standard basis vector, one self\-attention layer can produce representations containing theii\-th row of𝐊~\\tilde\{\\mathbf\{K\}\}:

hi\(1\)=\(ei\+𝐊~i⁣⋅,yi\+\(weighted sum of​yj​\)\)h\_\{i\}^\{\(1\)\}=\(e\_\{i\}\+\\tilde\{\\mathbf\{K\}\}\_\{i\\cdot\},y\_\{i\}\+\\text\{\(weighted sum of \}y\_\{j\}\\text\{\)\}\)where𝐊~i⁣⋅=\(βi​1,…,βi​n\)\\tilde\{\\mathbf\{K\}\}\_\{i\\cdot\}=\(\\beta\_\{i1\},\\ldots,\\beta\_\{in\}\)is theii\-th row of the attention matrix\.

###### Proof\.

SetWv\(1\)=𝐈W\_\{v\}^\{\(1\)\}=\\mathbf\{I\}\. Thenhi\(1\)=hi\(0\)\+∑jβi​j​hj\(0\)=\(ei,yi\)\+∑jβi​j​\(ej,yj\)=\(ei\+𝐊~i⁣⋅,yi\+∑jβi​j​yj\)h\_\{i\}^\{\(1\)\}=h\_\{i\}^\{\(0\)\}\+\\sum\_\{j\}\\beta\_\{ij\}h\_\{j\}^\{\(0\)\}=\(e\_\{i\},y\_\{i\}\)\+\\sum\_\{j\}\\beta\_\{ij\}\(e\_\{j\},y\_\{j\}\)=\(e\_\{i\}\+\\tilde\{\\mathbf\{K\}\}\_\{i\\cdot\},y\_\{i\}\+\\sum\_\{j\}\\beta\_\{ij\}y\_\{j\}\)\. ∎

### C\.3Polynomial Computation

###### Proof of Theorem[13](https://arxiv.org/html/2605.24210#Thmtheorem13)\(Polynomial Representation\)\.

By induction\. ForL=1L=1:H\(1\)=\(𝐈\+α1​𝐊~\)​H\(0\)H^\{\(1\)\}=\(\\mathbf\{I\}\+\\alpha\_\{1\}\\tilde\{\\mathbf\{K\}\}\)H^\{\(0\)\}\. Assuming the result forL−1L\-1:

H\(L\)=\(𝐈\+αL​𝐊~\)​H\(L−1\)=\(𝐈\+αL​𝐊~\)​∏ℓ=1L−1\(𝐈\+αℓ​𝐊~\)​H\(0\)=∏ℓ=1L\(𝐈\+αℓ​𝐊~\)​H\(0\)\.H^\{\(L\)\}=\(\\mathbf\{I\}\+\\alpha\_\{L\}\\tilde\{\\mathbf\{K\}\}\)H^\{\(L\-1\)\}=\(\\mathbf\{I\}\+\\alpha\_\{L\}\\tilde\{\\mathbf\{K\}\}\)\\prod\_\{\\ell=1\}^\{L\-1\}\(\\mathbf\{I\}\+\\alpha\_\{\\ell\}\\tilde\{\\mathbf\{K\}\}\)H^\{\(0\)\}=\\prod\_\{\\ell=1\}^\{L\}\(\\mathbf\{I\}\+\\alpha\_\{\\ell\}\\tilde\{\\mathbf\{K\}\}\)H^\{\(0\)\}\.Expanding the product, the coefficient of𝐊~m\\tilde\{\\mathbf\{K\}\}^\{m\}is the sum over all ways to choosemmfactors to contributeαℓ​𝐊~\\alpha\_\{\\ell\}\\tilde\{\\mathbf\{K\}\}\(and the remainingL−mL\-mfactors to contribute𝐈\\mathbf\{I\}\), which is exactlyem​\(α1,…,αL\)e\_\{m\}\(\\alpha\_\{1\},\\ldots,\\alpha\_\{L\}\)\. ∎

###### Corollary 31\(Achievable Coefficient Space\)\.

The set of achievable coefficient vectors\(c0,c1,…,cL\)∈ℝL\+1\(c\_\{0\},c\_\{1\},\\ldots,c\_\{L\}\)\\in\\mathbb\{R\}^\{L\+1\}for polynomials∑m=0Lcm​𝐊~m\\sum\_\{m=0\}^\{L\}c\_\{m\}\\tilde\{\\mathbf\{K\}\}^\{m\}representable by anLL\-layer TNP with scalar value weights is:

𝒜L=\{\(e0​\(α\),e1​\(α\),…,eL​\(α\)\):α∈ℝL\}\\mathcal\{A\}\_\{L\}=\\\{\(e\_\{0\}\(\\alpha\),e\_\{1\}\(\\alpha\),\\ldots,e\_\{L\}\(\\alpha\)\):\\alpha\\in\\mathbb\{R\}^\{L\}\\\}wheree0≡1e\_\{0\}\\equiv 1\. This is anLL\-dimensional algebraic variety inℝL\+1\\mathbb\{R\}^\{L\+1\}, not all ofℝL\+1\\mathbb\{R\}^\{L\+1\}\.

###### Proof\.

The mapα↦\(e0​\(α\),…,eL​\(α\)\)\\alpha\\mapsto\(e\_\{0\}\(\\alpha\),\\ldots,e\_\{L\}\(\\alpha\)\)has image determined by Newton’s identities relating elementary symmetric polynomials to power sums\. Sincee0=1e\_\{0\}=1always, the image lies in the hyperplane\{c0=1\}\\\{c\_\{0\}=1\\\}\. Within this hyperplane, the image is the set of coefficient vectors of constant term 1 polynomials with real roots \(since∏ℓ\(1\+αℓ​z\)=∑mem​zm\\prod\_\{\\ell\}\(1\+\\alpha\_\{\\ell\}z\)=\\sum\_\{m\}e\_\{m\}z^\{m\}has roots−1/αℓ\-1/\\alpha\_\{\\ell\}\)\. This is a proper subset ofℝL\\mathbb\{R\}^\{L\}\. ∎

### C\.4Approximating the Kernel Matrix Inverse

###### Lemma 33\(Neumann Series\)\.

LetA∈ℝn×nA\\in\\mathbb\{R\}^\{n\\times n\}with spectral radiusρ​\(A\)<1\\rho\(A\)<1\. Then\(𝐈−A\)−1=∑m=0∞Am\(\\mathbf\{I\}\-A\)^\{\-1\}=\\sum\_\{m=0\}^\{\\infty\}A^\{m\}, and the truncated series satisfies:

‖\(𝐈−A\)−1−∑m=0L−1Am‖≤ρ​\(A\)L1−ρ​\(A\)\.\\left\\\|\(\\mathbf\{I\}\-A\)^\{\-1\}\-\\sum\_\{m=0\}^\{L\-1\}A^\{m\}\\right\\\|\\leq\\frac\{\\rho\(A\)^\{L\}\}\{1\-\\rho\(A\)\}\.

###### Proof\.

Standard result from matrix analysis\(Golub and Van Loan,[2013](https://arxiv.org/html/2605.24210#bib.bib30)\)\. The truncation error is‖\(𝐈−A\)−1​AL‖=‖\(𝐈−A\)−1‖​‖AL‖≤11−ρ​\(A\)​ρ​\(A\)L\\\|\(\\mathbf\{I\}\-A\)^\{\-1\}A^\{L\}\\\|=\\\|\(\\mathbf\{I\}\-A\)^\{\-1\}\\\|\\\|A^\{L\}\\\|\\leq\\frac\{1\}\{1\-\\rho\(A\)\}\\rho\(A\)^\{L\}\. ∎

###### Proposition 34\(Inverse via Neumann Series\)\.

Under Assumption[3](https://arxiv.org/html/2605.24210#Thmassumption3), settingA=𝐈−𝐊/λmaxA=\\mathbf\{I\}\-\\mathbf\{K\}/\\lambda\_\{\\max\}:

1. \(a\)The eigenvalues ofAAlie in\[0,1−1/κ\]\[0,1\-1/\\kappa\], soρ​\(A\)=1−1/κ<1\\rho\(A\)=1\-1/\\kappa<1\.
2. \(b\)𝐊−1=1λmax​\(𝐈−A\)−1=1λmax​∑m=0∞Am\\mathbf\{K\}^\{\-1\}=\\frac\{1\}\{\\lambda\_\{\\max\}\}\(\\mathbf\{I\}\-A\)^\{\-1\}=\\frac\{1\}\{\\lambda\_\{\\max\}\}\\sum\_\{m=0\}^\{\\infty\}A^\{m\}\.
3. \(c\)The truncated series gives: ‖𝐊−1−1λmax​∑m=0L−1Am‖≤ρLλmin,ρ=1−1κ\.\\left\\\|\\mathbf\{K\}^\{\-1\}\-\\frac\{1\}\{\\lambda\_\{\\max\}\}\\sum\_\{m=0\}^\{L\-1\}A^\{m\}\\right\\\|\\leq\\frac\{\\rho^\{L\}\}\{\\lambda\_\{\\min\}\},\\quad\\rho=1\-\\frac\{1\}\{\\kappa\}\.

###### Proof\.

\(a\) If𝐊​v=λ​v\\mathbf\{K\}v=\\lambda v, thenA​v=\(1−λ/λmax\)​vAv=\(1\-\\lambda/\\lambda\_\{\\max\}\)v\. Sinceλ∈\[λmin,λmax\]\\lambda\\in\[\\lambda\_\{\\min\},\\lambda\_\{\\max\}\], the eigenvalues ofAAlie in\[0,1−λmin/λmax\]=\[0,1−1/κ\]\[0,1\-\\lambda\_\{\\min\}/\\lambda\_\{\\max\}\]=\[0,1\-1/\\kappa\]\.

\(b\)𝐊=λmax​\(𝐈−A\)\\mathbf\{K\}=\\lambda\_\{\\max\}\(\\mathbf\{I\}\-A\), so𝐊−1=1λmax​\(𝐈−A\)−1\\mathbf\{K\}^\{\-1\}=\\frac\{1\}\{\\lambda\_\{\\max\}\}\(\\mathbf\{I\}\-A\)^\{\-1\}\.

\(c\) By Lemma[33](https://arxiv.org/html/2605.24210#Thmtheorem33):‖\(𝐈−A\)−1−∑m=0L−1Am‖≤ρL1−ρ=ρL1/κ=κ​ρL\\\|\(\\mathbf\{I\}\-A\)^\{\-1\}\-\\sum\_\{m=0\}^\{L\-1\}A^\{m\}\\\|\\leq\\frac\{\\rho^\{L\}\}\{1\-\\rho\}=\\frac\{\\rho^\{L\}\}\{1/\\kappa\}=\\kappa\\rho^\{L\}\. Dividing byλmax\\lambda\_\{\\max\}:‖𝐊−1−1λmax​∑m=0L−1Am‖≤κ​ρLλmax=ρLλmin\\\|\\mathbf\{K\}^\{\-1\}\-\\frac\{1\}\{\\lambda\_\{\\max\}\}\\sum\_\{m=0\}^\{L\-1\}A^\{m\}\\\|\\leq\\frac\{\\kappa\\rho^\{L\}\}\{\\lambda\_\{\\max\}\}=\\frac\{\\rho^\{L\}\}\{\\lambda\_\{\\min\}\}\. ∎

###### Proposition 35\(Idealized TNP Approximation of GP Posterior\)\.

Suppose a TNP architecture has access to:

1. \(i\)Position\-based self\-attention with attention matrix equal to the row\-normalized kernel𝐊~=D−1​𝐊\\tilde\{\\mathbf\{K\}\}=D^\{\-1\}\\mathbf\{K\},
2. \(ii\)The normalization constantsDi​i=∑jK​\(xi,xj\)D\_\{ii\}=\\sum\_\{j\}K\(x\_\{i\},x\_\{j\}\)via the encoding,
3. \(iii\)Scalar value weightsα1,…,αL\\alpha\_\{1\},\\ldots,\\alpha\_\{L\}implementing the truncated Neumann series\.

Then the GP posterior mean can be approximated with errorO​\(ρL/λmin\)O\(\\rho^\{L\}/\\lambda\_\{\\min\}\)whereρ=1−1/κ\\rho=1\-1/\\kappa\.

This is an existence result\. It establishes that the approximation lies within the representational capacity of TNPs, but does not address whether gradient\-based learning can find the required parameters\.

###### Proof\.

By Proposition[34](https://arxiv.org/html/2605.24210#Thmtheorem34), the truncated Neumann series𝐊^−1=1λmax​∑m=0L−1\(𝐈−𝐊/λmax\)m\\hat\{\\mathbf\{K\}\}^\{\-1\}=\\frac\{1\}\{\\lambda\_\{\\max\}\}\\sum\_\{m=0\}^\{L\-1\}\(\\mathbf\{I\}\-\\mathbf\{K\}/\\lambda\_\{\\max\}\)^\{m\}satisfies‖𝐊−1−𝐊^−1‖≤ρL/λmin\\\|\\mathbf\{K\}^\{\-1\}\-\\hat\{\\mathbf\{K\}\}^\{\-1\}\\\|\\leq\\rho^\{L\}/\\lambda\_\{\\min\}\.

The approximation error in the posterior mean is:

\|μTNP−μ\|=\|k​\(xt,XC\)​\(𝐊^−1−𝐊−1\)​yC\|≤‖k​\(xt,XC\)‖⋅‖𝐊^−1−𝐊−1‖⋅‖yC‖\.∎\|\\mu\_\{\\mathrm\{TNP\}\}\-\\mu\|=\|k\(x\_\{t\},X\_\{C\}\)\(\\hat\{\\mathbf\{K\}\}^\{\-1\}\-\\mathbf\{K\}^\{\-1\}\)y\_\{C\}\|\\leq\\\|k\(x\_\{t\},X\_\{C\}\)\\\|\\cdot\\\|\\hat\{\\mathbf\{K\}\}^\{\-1\}\-\\mathbf\{K\}^\{\-1\}\\\|\\cdot\\\|y\_\{C\}\\\|\.\\qed

###### Corollary 36\(Depth Requirement\)\.

To achieveε\\varepsilon\-approximation of the GP posterior mean uniformly over targetsxtx\_\{t\}with‖k​\(xt,XC\)‖≤Bk\\\|k\(x\_\{t\},X\_\{C\}\)\\\|\\leq B\_\{k\}and observations‖yC‖≤By\\\|y\_\{C\}\\\|\\leq B\_\{y\}, a TNP requires:

L≥log⁡\(Bk​By/\(ε​λmin\)\)log⁡\(1/ρ\)=O​\(κ​log⁡Bk​Byε​λmin\)L\\geq\\frac\{\\log\(B\_\{k\}B\_\{y\}/\(\\varepsilon\\lambda\_\{\\min\}\)\)\}\{\\log\(1/\\rho\)\}=O\\left\(\\kappa\\log\\frac\{B\_\{k\}B\_\{y\}\}\{\\varepsilon\\lambda\_\{\\min\}\}\\right\)self\-attention layers, usinglog⁡\(1/ρ\)≈1/κ\\log\(1/\\rho\)\\approx 1/\\kappafor largeκ\\kappa\.

###### Proof of Proposition[14](https://arxiv.org/html/2605.24210#Thmtheorem14)\(Chebyshev Upper Bound\)\.

Set

αℓ=−2λmax\+λmin\+\(λmax−λmin\)​cos⁡θℓ,θℓ=\(2​ℓ−1\)​π2​L\.\\alpha\_\{\\ell\}=\-\\frac\{2\}\{\\lambda\_\{\\max\}\+\\lambda\_\{\\min\}\+\(\\lambda\_\{\\max\}\-\\lambda\_\{\\min\}\)\\cos\\theta\_\{\\ell\}\},\\quad\\theta\_\{\\ell\}=\\frac\{\(2\\ell\-1\)\\pi\}\{2L\}\.These are the optimal Chebyshev iteration parameters for inverting a matrix with spectrum in\[λmin,λmax\]\[\\lambda\_\{\\min\},\\lambda\_\{\\max\}\]\. Eachαℓ\\alpha\_\{\\ell\}is real and negative withαℓ∈\(−2/λmin,−2/λmax\)\\alpha\_\{\\ell\}\\in\(\-2/\\lambda\_\{\\min\},\-2/\\lambda\_\{\\max\}\), so the polynomialpL​\(λ\)=∏ℓ\(1\+αℓ​λ\)p\_\{L\}\(\\lambda\)=\\prod\_\{\\ell\}\(1\+\\alpha\_\{\\ell\}\\lambda\)satisfiespL​\(0\)=1p\_\{L\}\(0\)=1and has all roots−1/αℓ\-1/\\alpha\_\{\\ell\}in\[λmin,λmax\]\[\\lambda\_\{\\min\},\\lambda\_\{\\max\}\]\.

The residual polynomialrL​\(λ\)=1−λ⋅pL​\(λ\)r\_\{L\}\(\\lambda\)=1\-\\lambda\\cdot p\_\{L\}\(\\lambda\)is the degree\-LLpolynomial vanishing at0that deviates least from11on\[λmin,λmax\]\[\\lambda\_\{\\min\},\\lambda\_\{\\max\}\]in the sup\-norm\. Standard Chebyshev theory\(Trefethen,[2019](https://arxiv.org/html/2605.24210#bib.bib6)\)gives‖rL‖∞≤2​ρL\\\|r\_\{L\}\\\|\_\{\\infty\}\\leq 2\\rho^\{L\}whereρ=\(κ−1\)/\(κ\+1\)\\rho=\(\\sqrt\{\\kappa\}\-1\)/\(\\sqrt\{\\kappa\}\+1\)\. The stated bound follows from‖pL​\(𝐊\)−𝐊−1‖=‖𝐊−1‖⋅‖rL​\(𝐊\)‖≤λmin−1⋅2​ρL\\\|p\_\{L\}\(\\mathbf\{K\}\)\-\\mathbf\{K\}^\{\-1\}\\\|=\\\|\\mathbf\{K\}^\{\-1\}\\\|\\cdot\\\|r\_\{L\}\(\\mathbf\{K\}\)\\\|\\leq\\lambda\_\{\\min\}^\{\-1\}\\cdot 2\\rho^\{L\}\. ∎

## Appendix DTNP Lower Bound Proofs

### D\.1Linearization

The GP posterior meanμ​\(xt\|C\)=k​\(xt,XC\)​𝐊−1​yC\\mu\(x\_\{t\}\|C\)=k\(x\_\{t\},X\_\{C\}\)\\mathbf\{K\}^\{\-1\}y\_\{C\}is linear inyCy\_\{C\}\. Any TNP approximating this target must itself be approximately linear\.

###### Lemma 37\(Linearization\)\.

LetF:𝒞×𝒳→ℝF:\\mathcal\{C\}\\times\\mathcal\{X\}\\to\\mathbb\{R\}be a TNP achieving

sup‖yC‖≤1\|F​\(C,xt\)−k​\(xt,XC\)​𝐊−1​yC\|≤ε\.\\sup\_\{\\\|y\_\{C\}\\\|\\leq 1\}\|F\(C,x\_\{t\}\)\-k\(x\_\{t\},X\_\{C\}\)\\mathbf\{K\}^\{\-1\}y\_\{C\}\|\\leq\\varepsilon\.AssumeFFis twice continuously differentiable inyCy\_\{C\}with‖∇yC2F‖≤L2\\\|\\nabla^\{2\}\_\{y\_\{C\}\}F\\\|\\leq L\_\{2\}uniformly on‖yC‖≤1\\\|y\_\{C\}\\\|\\leq 1\. Then:

∥∂F∂yC\|yC=0−k\(xt,XC\)𝐊−1∥≤2ε\+L22\.\\left\\\|\\frac\{\\partial F\}\{\\partial y\_\{C\}\}\\bigg\|\_\{y\_\{C\}=0\}\-k\(x\_\{t\},X\_\{C\}\)\\mathbf\{K\}^\{\-1\}\\right\\\|\\leq 2\\varepsilon\+\\frac\{L\_\{2\}\}\{2\}\.

###### Proof\.

LetM=∂F∂yC\|yC=0M=\\frac\{\\partial F\}\{\\partial y\_\{C\}\}\\big\|\_\{y\_\{C\}=0\}andT=k​\(xt,XC\)​𝐊−1T=k\(x\_\{t\},X\_\{C\}\)\\mathbf\{K\}^\{\-1\}\. For any unit vectoruu, Taylor expansion gives:

F​\(δ​u\)=F​\(0\)\+δ⋅M​u\+R​\(δ\)F\(\\delta u\)=F\(0\)\+\\delta\\cdot Mu\+R\(\\delta\)where\|R​\(δ\)\|≤L22​δ2\|R\(\\delta\)\|\\leq\\frac\{L\_\{2\}\}\{2\}\\delta^\{2\}by the smoothness assumption\.

The target satisfiesT​\(δ​u\):=k​\(xt,XC\)​𝐊−1​\(δ​u\)=δ⋅T​uT\(\\delta u\):=k\(x\_\{t\},X\_\{C\}\)\\mathbf\{K\}^\{\-1\}\(\\delta u\)=\\delta\\cdot Tu\(exactly linear\)\.

By the approximation hypothesis:

\|F​\(δ​u\)−δ⋅T​u\|≤εfor all​\|δ\|≤1\.\|F\(\\delta u\)\-\\delta\\cdot Tu\|\\leq\\varepsilon\\quad\\text\{for all \}\|\\delta\|\\leq 1\.
Evaluating atδ=0\\delta=0:\|F​\(0\)\|≤ε\|F\(0\)\|\\leq\\varepsilon\.

Forδ≠0\\delta\\neq 0:

\|F​\(0\)\+δ⋅M​u\+R​\(δ\)−δ⋅T​u\|≤ε\|F\(0\)\+\\delta\\cdot Mu\+R\(\\delta\)\-\\delta\\cdot Tu\|\\leq\\varepsilon\|δ\|⋅\|M​u−T​u\|≤ε\+\|F​\(0\)\|\+\|R​\(δ\)\|≤2​ε\+L22​δ2\.\|\\delta\|\\cdot\|Mu\-Tu\|\\leq\\varepsilon\+\|F\(0\)\|\+\|R\(\\delta\)\|\\leq 2\\varepsilon\+\\frac\{L\_\{2\}\}\{2\}\\delta^\{2\}\.
Settingδ=1\\delta=1:

\|M​u−T​u\|≤2​ε\+L22\.\|Mu\-Tu\|\\leq 2\\varepsilon\+\\frac\{L\_\{2\}\}\{2\}\.
Sinceuuwas an arbitrary unit vector,‖M−T‖≤2​ε\+L22\\\|M\-T\\\|\\leq 2\\varepsilon\+\\frac\{L\_\{2\}\}\{2\}\. ∎

### D\.2Jacobian Evolution through Self\-Attention

###### Definition 39\(Jacobian Matrix\)\.

At layerℓ\\ell, define the Jacobian matrixB\(ℓ\)∈ℝn×nB^\{\(\\ell\)\}\\in\\mathbb\{R\}^\{n\\times n\}\(suppressing the representation dimensiondd\) by

Bi​m\(ℓ\)=∂hi\(ℓ\)∂ym\|yC=0\.B^\{\(\\ell\)\}\_\{im\}=\\frac\{\\partial h\_\{i\}^\{\(\\ell\)\}\}\{\\partial y\_\{m\}\}\\bigg\|\_\{y\_\{C\}=0\}\.

###### Lemma 40\(Initial Jacobian\)\.

For an encoderh\(0\)​\(xi,yi\)=a​\(xi\)\+b​\(xi\)​yi\+O​\(yi2\)h^\{\(0\)\}\(x\_\{i\},y\_\{i\}\)=a\(x\_\{i\}\)\+b\(x\_\{i\}\)y\_\{i\}\+O\(y\_\{i\}^\{2\}\), the initial Jacobian is diagonal:

Bi​m\(0\)=δi​m​b​\(xi\)\.B^\{\(0\)\}\_\{im\}=\\delta\_\{im\}b\(x\_\{i\}\)\.

###### Theorem 41\(Jacobian Evolution\)\.

Consider a self\-attention layer with representation\-based attention:

hi\(ℓ\)=hi\(ℓ−1\)\+∑j=1nβi​j\(ℓ\)​Wv​hj\(ℓ−1\)h\_\{i\}^\{\(\\ell\)\}=h\_\{i\}^\{\(\\ell\-1\)\}\+\\sum\_\{j=1\}^\{n\}\\beta\_\{ij\}^\{\(\\ell\)\}W\_\{v\}h\_\{j\}^\{\(\\ell\-1\)\}whereβi​j\(ℓ\)=softmaxj​\(q​\(hi\(ℓ−1\)\)⊤​k​\(hj\(ℓ−1\)\)/τ\)\\beta\_\{ij\}^\{\(\\ell\)\}=\\mathrm\{softmax\}\_\{j\}\(q\(h\_\{i\}^\{\(\\ell\-1\)\}\)^\{\\top\}k\(h\_\{j\}^\{\(\\ell\-1\)\}\)/\\tau\)\.

Let𝐊~=β\(ℓ\)\|yC=0\\tilde\{\\mathbf\{K\}\}=\\beta^\{\(\\ell\)\}\|\_\{y\_\{C\}=0\}be the attention matrix evaluated atyC=0y\_\{C\}=0\. Then:

B\(ℓ\)=B\(ℓ−1\)\+𝐊~​Wv​B\(ℓ−1\)\+Γ\(ℓ\)​B\(ℓ−1\)B^\{\(\\ell\)\}=B^\{\(\\ell\-1\)\}\+\\tilde\{\\mathbf\{K\}\}W\_\{v\}B^\{\(\\ell\-1\)\}\+\\Gamma^\{\(\\ell\)\}B^\{\(\\ell\-1\)\}whereΓ\(ℓ\)\\Gamma^\{\(\\ell\)\}is a matrix depending onXXand network parameters, arising from attention gradients\.

###### Proof\.

Differentiating the layer update with respect toymy\_\{m\}and evaluating atyC=0y\_\{C\}=0:

∂hi\(ℓ\)∂ym=∂hi\(ℓ−1\)∂ym\+∑j∂βi​j\(ℓ\)∂ym​Wv​hj\(ℓ−1\)\+∑jβi​j\(ℓ\)​Wv​∂hj\(ℓ−1\)∂ym\.\\frac\{\\partial h\_\{i\}^\{\(\\ell\)\}\}\{\\partial y\_\{m\}\}=\\frac\{\\partial h\_\{i\}^\{\(\\ell\-1\)\}\}\{\\partial y\_\{m\}\}\+\\sum\_\{j\}\\frac\{\\partial\\beta\_\{ij\}^\{\(\\ell\)\}\}\{\\partial y\_\{m\}\}W\_\{v\}h\_\{j\}^\{\(\\ell\-1\)\}\+\\sum\_\{j\}\\beta\_\{ij\}^\{\(\\ell\)\}W\_\{v\}\\frac\{\\partial h\_\{j\}^\{\(\\ell\-1\)\}\}\{\\partial y\_\{m\}\}\.
AtyC=0y\_\{C\}=0, the third term becomes∑j𝐊~i​j​Wv​Bj​m\(ℓ−1\)\\sum\_\{j\}\\tilde\{\\mathbf\{K\}\}\_\{ij\}W\_\{v\}B^\{\(\\ell\-1\)\}\_\{jm\}\.

For the attention gradient term, we use the softmax derivative:

∂βi​j∂ym=βi​j​\(∂si​j∂ym−∑kβi​k​∂si​k∂ym\)\\frac\{\\partial\\beta\_\{ij\}\}\{\\partial y\_\{m\}\}=\\beta\_\{ij\}\\left\(\\frac\{\\partial s\_\{ij\}\}\{\\partial y\_\{m\}\}\-\\sum\_\{k\}\\beta\_\{ik\}\\frac\{\\partial s\_\{ik\}\}\{\\partial y\_\{m\}\}\\right\)wheresi​j=q​\(hi\)⊤​k​\(hj\)/τs\_\{ij\}=q\(h\_\{i\}\)^\{\\top\}k\(h\_\{j\}\)/\\tau\.

The score gradient is:

∂si​j∂ym=1τ​\[\(∇hq\|hi\)⊤​∂hi∂ym⋅k​\(hj\)\+q​\(hi\)⋅\(∇hk\|hj\)⊤​∂hj∂ym\]\.\\frac\{\\partial s\_\{ij\}\}\{\\partial y\_\{m\}\}=\\frac\{1\}\{\\tau\}\\left\[\(\\nabla\_\{h\}q\|\_\{h\_\{i\}\}\)^\{\\top\}\\frac\{\\partial h\_\{i\}\}\{\\partial y\_\{m\}\}\\cdot k\(h\_\{j\}\)\+q\(h\_\{i\}\)\\cdot\(\\nabla\_\{h\}k\|\_\{h\_\{j\}\}\)^\{\\top\}\\frac\{\\partial h\_\{j\}\}\{\\partial y\_\{m\}\}\\right\]\.
AtyC=0y\_\{C\}=0,∂hi\(0\)∂ym=δi​m​b​\(xi\)\\frac\{\\partial h\_\{i\}^\{\(0\)\}\}\{\\partial y\_\{m\}\}=\\delta\_\{im\}b\(x\_\{i\}\), so∂si​j∂ym\|yC=0\\frac\{\\partial s\_\{ij\}\}\{\\partial y\_\{m\}\}\\big\|\_\{y\_\{C\}=0\}is nonzero only whenm∈\{i,j\}m\\in\\\{i,j\\\}\.

Collecting terms, the attention gradient contribution has the formΓ\(ℓ\)​B\(ℓ−1\)\\Gamma^\{\(\\ell\)\}B^\{\(\\ell\-1\)\}whereΓ\(ℓ\)\\Gamma^\{\(\\ell\)\}depends on𝐊~\\tilde\{\\mathbf\{K\}\}, query/key gradients, and initial representations—all functions ofXXalone\. ∎

###### Corollary 42\(Polynomial Structure\)\.

AfterLLself\-attention layers, the JacobianB\(L\)=∂H\(L\)∂yC\|yC=0B^\{\(L\)\}=\\frac\{\\partial H^\{\(L\)\}\}\{\\partial y\_\{C\}\}\\big\|\_\{y\_\{C\}=0\}has the form:

B\(L\)=∑m1,…,mk≥0m1\+⋯\+mk≤LC0​𝐊~m1​C1​𝐊~m2​⋯​Ck−1​𝐊~mk​CkB^\{\(L\)\}=\\sum\_\{\\begin\{subarray\}\{c\}m\_\{1\},\\ldots,m\_\{k\}\\geq 0\\\\ m\_\{1\}\+\\cdots\+m\_\{k\}\\leq L\\end\{subarray\}\}C\_\{0\}\\tilde\{\\mathbf\{K\}\}^\{m\_\{1\}\}C\_\{1\}\\tilde\{\\mathbf\{K\}\}^\{m\_\{2\}\}\\cdots C\_\{k\-1\}\\tilde\{\\mathbf\{K\}\}^\{m\_\{k\}\}C\_\{k\}where eachCi∈ℝn×nC\_\{i\}\\in\\mathbb\{R\}^\{n\\times n\}depends on context locationsXXand network parameters, but not on observationsyCy\_\{C\}\.

In particular,B\(L\)B^\{\(L\)\}is a matrix polynomial in𝐊~\\tilde\{\\mathbf\{K\}\}of degree at most2​L2L\. The factor of22arises because the attention gradient termsΓ\(ℓ\)\\Gamma^\{\(\\ell\)\}involve quadratic products𝐊~i​j​𝐊~i​k\\tilde\{\\mathbf\{K\}\}\_\{ij\}\\tilde\{\\mathbf\{K\}\}\_\{ik\}, contributing up to degree22per layer when projected onto the eigenvalue\-controlled family\.

###### Proof\.

By induction onLL\. The base caseL=0L=0givesB\(0\)=C0B^\{\(0\)\}=C\_\{0\}where\[C0\]i​m=δi​m​b​\(xi\)\[C\_\{0\}\]\_\{im\}=\\delta\_\{im\}b\(x\_\{i\}\)is diagonal\.

For the inductive step, Theorem[41](https://arxiv.org/html/2605.24210#Thmtheorem41)gives:

B\(ℓ\)=\(𝐈\+Γ\(ℓ\)\)​B\(ℓ−1\)\+𝐊~​Wv​B\(ℓ−1\)B^\{\(\\ell\)\}=\(\\mathbf\{I\}\+\\Gamma^\{\(\\ell\)\}\)B^\{\(\\ell\-1\)\}\+\\tilde\{\\mathbf\{K\}\}W\_\{v\}B^\{\(\\ell\-1\)\}whereΓ\(ℓ\)\\Gamma^\{\(\\ell\)\}depends only onXX\. IfB\(ℓ−1\)B^\{\(\\ell\-1\)\}is a degree\-\(ℓ−1\)\(\\ell\-1\)polynomial in𝐊~\\tilde\{\\mathbf\{K\}\}, thenB\(ℓ\)B^\{\(\\ell\)\}is degree at mostℓ\\ell\. ∎

### D\.3Spectral Analysis

###### Definition 43\(Spectral Content\)\.

Let\{v1,…,vn\}\\\{v\_\{1\},\\ldots,v\_\{n\}\\\}be the eigenvectors of𝐊~\\tilde\{\\mathbf\{K\}\}with eigenvaluesμ1≤⋯≤μn\\mu\_\{1\}\\leq\\cdots\\leq\\mu\_\{n\}\. For any vectoru∈ℝnu\\in\\mathbb\{R\}^\{n\}, its spectral content isc=\(c1,…,cn\)c=\(c\_\{1\},\\ldots,c\_\{n\}\)whereu=∑j=1ncj​vju=\\sum\_\{j=1\}^\{n\}c\_\{j\}v\_\{j\}\.

###### Lemma 44\(Quadratic Form Structure\)\.

LetM=∑m=0LCm​𝐊~mM=\\sum\_\{m=0\}^\{L\}C\_\{m\}\\tilde\{\\mathbf\{K\}\}^\{m\}be a commutative degree\-LLmatrix polynomial \(i\.e\.,CmC\_\{m\}commutes with𝐊~\\tilde\{\\mathbf\{K\}\}for allmm\)\. Let\{v1,…,vn\}\\\{v\_\{1\},\\ldots,v\_\{n\}\\\}be the orthonormal eigenvectors of𝐊~\\tilde\{\\mathbf\{K\}\}with eigenvaluesμ1≤⋯≤μn\\mu\_\{1\}\\leq\\cdots\\leq\\mu\_\{n\}\. Then for any eigenvectorvjv\_\{j\}:

vj⊤​M​vj=∑m=0L\(vj⊤​Cm​vj\)​μjm,v\_\{j\}^\{\\top\}Mv\_\{j\}=\\sum\_\{m=0\}^\{L\}\(v\_\{j\}^\{\\top\}C\_\{m\}v\_\{j\}\)\\mu\_\{j\}^\{m\},a univariate polynomial inμj\\mu\_\{j\}of degree at mostLL\.

For the general \(non\-commutative\) interleaved form of Corollary[42](https://arxiv.org/html/2605.24210#Thmtheorem42), the analogous reduction to a univariate polynomial in a single eigenvalue requires restricting to kernel families where all but one eigenvalue is constant, as in Lemma[45](https://arxiv.org/html/2605.24210#Thmtheorem45)\.

###### Proof\.

Write𝐊~=∑kμk​vk​vk⊤\\tilde\{\\mathbf\{K\}\}=\\sum\_\{k\}\\mu\_\{k\}v\_\{k\}v\_\{k\}^\{\\top\}\. Then𝐊~m=∑kμkm​vk​vk⊤\\tilde\{\\mathbf\{K\}\}^\{m\}=\\sum\_\{k\}\\mu\_\{k\}^\{m\}v\_\{k\}v\_\{k\}^\{\\top\}, so:

u⊤​𝐊~m​u=∑kμkm​\(u⊤​vk\)2\.u^\{\\top\}\\tilde\{\\mathbf\{K\}\}^\{m\}u=\\sum\_\{k\}\\mu\_\{k\}^\{m\}\(u^\{\\top\}v\_\{k\}\)^\{2\}\.For the full polynomial:

u⊤​M​u=∑m=0Lu⊤​Cm​𝐊~m​u=∑m=0L∑kμkm⋅u⊤​Cm​vk⋅vk⊤​u\.u^\{\\top\}Mu=\\sum\_\{m=0\}^\{L\}u^\{\\top\}C\_\{m\}\\tilde\{\\mathbf\{K\}\}^\{m\}u=\\sum\_\{m=0\}^\{L\}\\sum\_\{k\}\\mu\_\{k\}^\{m\}\\cdot u^\{\\top\}C\_\{m\}v\_\{k\}\\cdot v\_\{k\}^\{\\top\}u\.ExpandingCm​vk=∑j\(vj⊤​Cm​vk\)​vjC\_\{m\}v\_\{k\}=\\sum\_\{j\}\(v\_\{j\}^\{\\top\}C\_\{m\}v\_\{k\}\)v\_\{j\}:

u⊤​M​u=∑m=0L∑j,kμkm​\(vj⊤​Cm​vk\)​\(u⊤​vj\)​\(u⊤​vk\)\.u^\{\\top\}Mu=\\sum\_\{m=0\}^\{L\}\\sum\_\{j,k\}\\mu\_\{k\}^\{m\}\(v\_\{j\}^\{\\top\}C\_\{m\}v\_\{k\}\)\(u^\{\\top\}v\_\{j\}\)\(u^\{\\top\}v\_\{k\}\)\.SettingQj​k​\(μ\)=∑m=0L\(vj⊤​Cm​vk\)​μkmQ\_\{jk\}\(\\mu\)=\\sum\_\{m=0\}^\{L\}\(v\_\{j\}^\{\\top\}C\_\{m\}v\_\{k\}\)\\mu\_\{k\}^\{m\}gives the result\. EachQj​kQ\_\{jk\}has degree at mostLLinμk\\mu\_\{k\}, hence total degree at mostLL\. ∎

### D\.4Eigenvalue\-Controlled Kernel Family

###### Lemma 45\(Eigenvalue\-Controlled Kernel Family\)\.

For anyκ\>1\\kappa\>1andn≥2n\\geq 2, there exists a family of positive definite matrices\{Kt\}t∈\[0,1−1/κ\]\\\{K\_\{t\}\\\}\_\{t\\in\[0,1\-1/\\kappa\]\}such that:

1. \(i\)All row sums equal 1:Kt​𝟏=𝟏K\_\{t\}\\mathbf\{1\}=\\mathbf\{1\}, henceDt=ID\_\{t\}=IandK~t=Kt\\tilde\{K\}\_\{t\}=K\_\{t\}
2. \(ii\)The minimum eigenvalue isμ1​\(t\)=1/κ\+t∈\[1/κ,1\]\\mu\_\{1\}\(t\)=1/\\kappa\+t\\in\[1/\\kappa,1\]
3. \(iii\)The minimum eigenvectorv1v\_\{1\}is constant across the family, withv1⟂𝟏v\_\{1\}\\perp\\mathbf\{1\}
4. \(iv\)All other eigenvalues equal 1
5. \(v\)α=v1⊤​Dt−1​v1=1\\alpha=v\_\{1\}^\{\\top\}D\_\{t\}^\{\-1\}v\_\{1\}=1for alltt

###### Proof\.

Let\{v1,…,vn−1\}\\\{v\_\{1\},\\ldots,v\_\{n\-1\}\\\}be any orthonormal set in𝟏⟂\\mathbf\{1\}^\{\\perp\}, and setvn=𝟏/nv\_\{n\}=\\mathbf\{1\}/\\sqrt\{n\}\. Define:

K0=1κ​v1​v1⊤\+∑j=2nvj​vj⊤=I\+\(1κ−1\)​v1​v1⊤\.K\_\{0\}=\\frac\{1\}\{\\kappa\}v\_\{1\}v\_\{1\}^\{\\top\}\+\\sum\_\{j=2\}^\{n\}v\_\{j\}v\_\{j\}^\{\\top\}=I\+\\left\(\\frac\{1\}\{\\kappa\}\-1\\right\)v\_\{1\}v\_\{1\}^\{\\top\}\.ThenK0K\_\{0\}is positive definite with eigenvalue1/κ1/\\kappaforv1v\_\{1\}and eigenvalue11forv2,…,vnv\_\{2\},\\ldots,v\_\{n\}\.

Row sums:K0​𝟏=𝟏\+\(1/κ−1\)​v1​\(v1⊤​𝟏\)=𝟏K\_\{0\}\\mathbf\{1\}=\\mathbf\{1\}\+\(1/\\kappa\-1\)v\_\{1\}\(v\_\{1\}^\{\\top\}\\mathbf\{1\}\)=\\mathbf\{1\}sincev1⟂𝟏v\_\{1\}\\perp\\mathbf\{1\}\.

Fort∈\[0,1−1/κ\]t\\in\[0,1\-1/\\kappa\], setKt=K0\+t⋅v1​v1⊤K\_\{t\}=K\_\{0\}\+t\\cdot v\_\{1\}v\_\{1\}^\{\\top\}\. Then:

- •Kt​𝟏=K0​𝟏\+t⋅v1​\(v1⊤​𝟏\)=𝟏K\_\{t\}\\mathbf\{1\}=K\_\{0\}\\mathbf\{1\}\+t\\cdot v\_\{1\}\(v\_\{1\}^\{\\top\}\\mathbf\{1\}\)=\\mathbf\{1\}
- •Kt​v1=\(1/κ\+t\)​v1K\_\{t\}v\_\{1\}=\(1/\\kappa\+t\)v\_\{1\}
- •Kt​vj=vjK\_\{t\}v\_\{j\}=v\_\{j\}forj≥2j\\geq 2

The minimum eigenvalue1/κ\+t1/\\kappa\+tremains below11fort≤1−1/κt\\leq 1\-1/\\kappa, sov1v\_\{1\}remains the minimum eigenvector throughout\. ∎

### D\.5Reduction to Univariate Approximation

###### Lemma 46\(Reduction to Univariate Approximation\)\.

Let𝐊t\\mathbf\{K\}\_\{t\}be the family from Lemma[45](https://arxiv.org/html/2605.24210#Thmtheorem45)witht∈\[μmin,μmax\]t\\in\[\\mu\_\{\\min\},\\mu\_\{\\max\}\], and letM​\(X\)M\(X\)be the Jacobian of anLL\-layer TNP evaluated atyC=0y\_\{C\}=0\. Then there exists a univariate polynomialqqof degree at most2​L2Lsuch that:

v1⊤​M​\(𝐊t\)​v1=q​\(t\)for all​t∈\[μmin,μmax\]\.v\_\{1\}^\{\\top\}M\(\\mathbf\{K\}\_\{t\}\)v\_\{1\}=q\(t\)\\quad\\text\{for all \}t\\in\[\\mu\_\{\\min\},\\mu\_\{\\max\}\]\.

###### Proof\.

By Corollary[42](https://arxiv.org/html/2605.24210#Thmtheorem42),MMis a sum of terms of the formC0​𝐊~m1​C1​⋯​CkC\_\{0\}\\tilde\{\\mathbf\{K\}\}^\{m\_\{1\}\}C\_\{1\}\\cdots C\_\{k\}with∑mi≤L\\sum m\_\{i\}\\leq L, where eachCiC\_\{i\}depends onXXbut not onyCy\_\{C\}\.

For the family𝐊t\\mathbf\{K\}\_\{t\}from Lemma[45](https://arxiv.org/html/2605.24210#Thmtheorem45), we have𝐊~t=𝐊t\\tilde\{\\mathbf\{K\}\}\_\{t\}=\\mathbf\{K\}\_\{t\}\(sinceDt=ID\_\{t\}=I\) with eigenvaluesμ1​\(t\)=1/κ\+t\\mu\_\{1\}\(t\)=1/\\kappa\+tandμj=1\\mu\_\{j\}=1forj≥2j\\geq 2\. The eigenvectorv1v\_\{1\}is fixed across the family\.

Consider any termC0​𝐊~tm1​C1​𝐊~tm2​⋯​CkC\_\{0\}\\tilde\{\\mathbf\{K\}\}\_\{t\}^\{m\_\{1\}\}C\_\{1\}\\tilde\{\\mathbf\{K\}\}\_\{t\}^\{m\_\{2\}\}\\cdots C\_\{k\}\. Expanding in the eigenbasis\{v1,…,vn\}\\\{v\_\{1\},\\ldots,v\_\{n\}\\\}:

v1⊤​C0​𝐊~tm1​C1​⋯​Ck​v1=∑j0,…,jk\(v1⊤​C0​vj0\)​\(vj0⊤​𝐊~tm1​vj1\)​⋯​\(vjk−1⊤​Ck​v1\)\.v\_\{1\}^\{\\top\}C\_\{0\}\\tilde\{\\mathbf\{K\}\}\_\{t\}^\{m\_\{1\}\}C\_\{1\}\\cdots C\_\{k\}v\_\{1\}=\\sum\_\{j\_\{0\},\\ldots,j\_\{k\}\}\(v\_\{1\}^\{\\top\}C\_\{0\}v\_\{j\_\{0\}\}\)\(v\_\{j\_\{0\}\}^\{\\top\}\\tilde\{\\mathbf\{K\}\}\_\{t\}^\{m\_\{1\}\}v\_\{j\_\{1\}\}\)\\cdots\(v\_\{j\_\{k\-1\}\}^\{\\top\}C\_\{k\}v\_\{1\}\)\.Since𝐊~tm​vj=μjm​vj\\tilde\{\\mathbf\{K\}\}\_\{t\}^\{m\}v\_\{j\}=\\mu\_\{j\}^\{m\}v\_\{j\}, each factorvji−1⊤​𝐊~tmi​vji=μjimi​δji−1,jiv\_\{j\_\{i\-1\}\}^\{\\top\}\\tilde\{\\mathbf\{K\}\}\_\{t\}^\{m\_\{i\}\}v\_\{j\_\{i\}\}=\\mu\_\{j\_\{i\}\}^\{m\_\{i\}\}\\delta\_\{j\_\{i\-1\},j\_\{i\}\}\. The sum collapses to paths through eigenvector indices\.

For indicesj≥2j\\geq 2, we haveμj=1\\mu\_\{j\}=1, contributing constant factors\. Only paths passing throughj=1j=1contribute powers ofμ1​\(t\)\\mu\_\{1\}\(t\)\. Therefore:

v1⊤​M​\(𝐊t\)​v1=∑m=0Lcm​μ1​\(t\)m=q​\(μ1​\(t\)\)v\_\{1\}^\{\\top\}M\(\\mathbf\{K\}\_\{t\}\)v\_\{1\}=\\sum\_\{m=0\}^\{L\}c\_\{m\}\\mu\_\{1\}\(t\)^\{m\}=q\(\\mu\_\{1\}\(t\)\)where the coefficientscmc\_\{m\}depend on the matricesCiC\_\{i\}\(hence on network parameters\) but not ontt\. The degree is at most2​L2Lsince each of theLLlayers contributes at most degree22due to the quadratic attention gradient terms\. ∎

### D\.6Chebyshev Barrier

###### Theorem 47\(Chebyshev Lower Bound\)\.

For any degree\-LLpolynomialppand any interval\[a,b\]\[a,b\]with0<a<b0<a<b:

maxμ∈\[a,b\]⁡\|p​\(μ\)−1μ\|≥2a\+b​ρL\\max\_\{\\mu\\in\[a,b\]\}\\left\|p\(\\mu\)\-\\frac\{1\}\{\\mu\}\\right\|\\geq\\frac\{2\}\{a\+b\}\\rho^\{L\}whereρ=b/a−1b/a\+1=κ−1κ\+1\\rho=\\frac\{\\sqrt\{b/a\}\-1\}\{\\sqrt\{b/a\}\+1\}=\\frac\{\\sqrt\{\\kappa\}\-1\}\{\\sqrt\{\\kappa\}\+1\}andκ=b/a\\kappa=b/ais the condition number\.

###### Proof\.

This is a classical result from approximation theory\. The optimal degree\-LLpolynomial approximation to1/μ1/\\muon\[a,b\]\[a,b\]is achieved by appropriately shifted and scaled Chebyshev polynomials, with the stated error bound\. SeeTrefethen \([2019](https://arxiv.org/html/2605.24210#bib.bib6)\)\. ∎

### D\.7Spectral Relationship

###### Lemma 48\(Spectral Relationship\)\.

Letγ=dmax/dmin\\gamma=d\_\{\\max\}/d\_\{\\min\}wheredmax=maxi⁡Di​id\_\{\\max\}=\\max\_\{i\}D\_\{ii\}anddmin=mini⁡Di​id\_\{\\min\}=\\min\_\{i\}D\_\{ii\}\. Then:

1γ​κ​\(𝐊\)≤κ​\(𝐊~\)≤γ⋅κ​\(𝐊\)\.\\frac\{1\}\{\\gamma\}\\kappa\(\\mathbf\{K\}\)\\leq\\kappa\(\\tilde\{\\mathbf\{K\}\}\)\\leq\\gamma\\cdot\\kappa\(\\mathbf\{K\}\)\.

###### Proof\.

The matrix𝐊~=D−1​𝐊\\tilde\{\\mathbf\{K\}\}=D^\{\-1\}\\mathbf\{K\}is similar to𝐊^=D−1/2​𝐊​D−1/2\\hat\{\\mathbf\{K\}\}=D^\{\-1/2\}\\mathbf\{K\}D^\{\-1/2\}:

D1/2​𝐊~​D−1/2=D−1/2​𝐊​D−1/2=𝐊^\.D^\{1/2\}\\tilde\{\\mathbf\{K\}\}D^\{\-1/2\}=D^\{\-1/2\}\\mathbf\{K\}D^\{\-1/2\}=\\hat\{\\mathbf\{K\}\}\.Thus𝐊~\\tilde\{\\mathbf\{K\}\}and𝐊^\\hat\{\\mathbf\{K\}\}share eigenvalues, and we analyze the latter via Rayleigh quotients\.

Upper bound onκ​\(𝐊~\)\\kappa\(\\tilde\{\\mathbf\{K\}\}\):For any unit vectorww, settingu=D−1/2​wu=D^\{\-1/2\}wgives:

w⊤​𝐊^​ww⊤​w=u⊤​𝐊​uu⊤​D​u∈\[λmin​\(𝐊\)dmax,λmax​\(𝐊\)dmin\]\\frac\{w^\{\\top\}\\hat\{\\mathbf\{K\}\}w\}\{w^\{\\top\}w\}=\\frac\{u^\{\\top\}\\mathbf\{K\}u\}\{u^\{\\top\}Du\}\\in\\left\[\\frac\{\\lambda\_\{\\min\}\(\\mathbf\{K\}\)\}\{d\_\{\\max\}\},\\frac\{\\lambda\_\{\\max\}\(\\mathbf\{K\}\)\}\{d\_\{\\min\}\}\\right\]where the bounds follow fromdmin​‖u‖2≤u⊤​D​u≤dmax​‖u‖2d\_\{\\min\}\\\|u\\\|^\{2\}\\leq u^\{\\top\}Du\\leq d\_\{\\max\}\\\|u\\\|^\{2\}\. Therefore:

κ​\(𝐊~\)≤λmax​\(𝐊\)/dminλmin​\(𝐊\)/dmax=γ⋅κ​\(𝐊\)\.\\kappa\(\\tilde\{\\mathbf\{K\}\}\)\\leq\\frac\{\\lambda\_\{\\max\}\(\\mathbf\{K\}\)/d\_\{\\min\}\}\{\\lambda\_\{\\min\}\(\\mathbf\{K\}\)/d\_\{\\max\}\}=\\gamma\\cdot\\kappa\(\\mathbf\{K\}\)\.
Lower bound onκ​\(𝐊~\)\\kappa\(\\tilde\{\\mathbf\{K\}\}\):Letvmaxv\_\{\\max\}be the unit eigenvector of𝐊\\mathbf\{K\}with eigenvalueλmax​\(𝐊\)\\lambda\_\{\\max\}\(\\mathbf\{K\}\)\. Evaluating atw=D1/2​vmaxw=D^\{1/2\}v\_\{\\max\}:

λmax​\(𝐊~\)≥vmax⊤​𝐊​vmaxvmax⊤​D​vmax=λmax​\(𝐊\)vmax⊤​D​vmax≥λmax​\(𝐊\)dmax\.\\lambda\_\{\\max\}\(\\tilde\{\\mathbf\{K\}\}\)\\geq\\frac\{v\_\{\\max\}^\{\\top\}\\mathbf\{K\}v\_\{\\max\}\}\{v\_\{\\max\}^\{\\top\}Dv\_\{\\max\}\}=\\frac\{\\lambda\_\{\\max\}\(\\mathbf\{K\}\)\}\{v\_\{\\max\}^\{\\top\}Dv\_\{\\max\}\}\\geq\\frac\{\\lambda\_\{\\max\}\(\\mathbf\{K\}\)\}\{d\_\{\\max\}\}\.
Letvminv\_\{\\min\}be the unit eigenvector of𝐊\\mathbf\{K\}with eigenvalueλmin​\(𝐊\)\\lambda\_\{\\min\}\(\\mathbf\{K\}\)\. Evaluating atw=D1/2​vminw=D^\{1/2\}v\_\{\\min\}:

λmin​\(𝐊~\)≤vmin⊤​𝐊​vminvmin⊤​D​vmin=λmin​\(𝐊\)vmin⊤​D​vmin≤λmin​\(𝐊\)dmin\.\\lambda\_\{\\min\}\(\\tilde\{\\mathbf\{K\}\}\)\\leq\\frac\{v\_\{\\min\}^\{\\top\}\\mathbf\{K\}v\_\{\\min\}\}\{v\_\{\\min\}^\{\\top\}Dv\_\{\\min\}\}=\\frac\{\\lambda\_\{\\min\}\(\\mathbf\{K\}\)\}\{v\_\{\\min\}^\{\\top\}Dv\_\{\\min\}\}\\leq\\frac\{\\lambda\_\{\\min\}\(\\mathbf\{K\}\)\}\{d\_\{\\min\}\}\.
Combining:

κ​\(𝐊~\)=λmax​\(𝐊~\)λmin​\(𝐊~\)≥λmax​\(𝐊\)/dmaxλmin​\(𝐊\)/dmin=1γ​κ​\(𝐊\)\.∎\\kappa\(\\tilde\{\\mathbf\{K\}\}\)=\\frac\{\\lambda\_\{\\max\}\(\\tilde\{\\mathbf\{K\}\}\)\}\{\\lambda\_\{\\min\}\(\\tilde\{\\mathbf\{K\}\}\)\}\\geq\\frac\{\\lambda\_\{\\max\}\(\\mathbf\{K\}\)/d\_\{\\max\}\}\{\\lambda\_\{\\min\}\(\\mathbf\{K\}\)/d\_\{\\min\}\}=\\frac\{1\}\{\\gamma\}\\kappa\(\\mathbf\{K\}\)\.\\qed

###### Corollary 49\.

When the row\-sum ratioγ=O​\(1\)\\gamma=O\(1\), we haveκ​\(𝐊~\)=Θ​\(κ​\(𝐊\)\)\\kappa\(\\tilde\{\\mathbf\{K\}\}\)=\\Theta\(\\kappa\(\\mathbf\{K\}\)\)\. Consequently, the depth requirement ofΘ​\(κ​\(𝐊~\)​log⁡\(1/ε\)\)\\Theta\(\\sqrt\{\\kappa\(\\tilde\{\\mathbf\{K\}\}\)\}\\log\(1/\\varepsilon\)\)layers for TNP approximation of GP posteriors is equivalent toΘ​\(κ​\(𝐊\)​log⁡\(1/ε\)\)\\Theta\(\\sqrt\{\\kappa\(\\mathbf\{K\}\)\}\\log\(1/\\varepsilon\)\)in terms of the original kernel matrix condition number\.

### D\.8Target Quadratic Form

###### Lemma 50\(Target Quadratic Form\)\.

Letv1v\_\{1\}be the minimum eigenvector of𝐊~\\tilde\{\\mathbf\{K\}\}\. The target value satisfies:

v1⊤​𝐊−1​v1=αμ1v\_\{1\}^\{\\top\}\\mathbf\{K\}^\{\-1\}v\_\{1\}=\\frac\{\\alpha\}\{\\mu\_\{1\}\}whereα=v1⊤​D−1​v1\>0\\alpha=v\_\{1\}^\{\\top\}D^\{\-1\}v\_\{1\}\>0andμ1=λmin​\(𝐊~\)\\mu\_\{1\}=\\lambda\_\{\\min\}\(\\tilde\{\\mathbf\{K\}\}\)\.

###### Proof\.

Since𝐊=D​𝐊~\\mathbf\{K\}=D\\tilde\{\\mathbf\{K\}\}, we have𝐊−1=𝐊~−1​D−1\\mathbf\{K\}^\{\-1\}=\\tilde\{\\mathbf\{K\}\}^\{\-1\}D^\{\-1\}\. Thus:

v1⊤​𝐊−1​v1=v1⊤​𝐊~−1​D−1​v1=1μ1​v1⊤​D−1​v1=αμ1v\_\{1\}^\{\\top\}\\mathbf\{K\}^\{\-1\}v\_\{1\}=v\_\{1\}^\{\\top\}\\tilde\{\\mathbf\{K\}\}^\{\-1\}D^\{\-1\}v\_\{1\}=\\frac\{1\}\{\\mu\_\{1\}\}v\_\{1\}^\{\\top\}D^\{\-1\}v\_\{1\}=\\frac\{\\alpha\}\{\\mu\_\{1\}\}where the second equality uses𝐊~−1​v1=1μ1​v1\\tilde\{\\mathbf\{K\}\}^\{\-1\}v\_\{1\}=\\frac\{1\}\{\\mu\_\{1\}\}v\_\{1\}\. SinceD−1D^\{\-1\}is positive diagonal andv1≠0v\_\{1\}\\neq 0, we haveα\>0\\alpha\>0\. ∎

### D\.9Main Lower Bound

###### Proof of Theorem[15](https://arxiv.org/html/2605.24210#Thmtheorem15)\(TNP Depth Lower Bound\)\.

Step 1: Linearization\.By Lemma[37](https://arxiv.org/html/2605.24210#Thmtheorem37), there existsM​\(X\)M\(X\)with‖M​\(X\)−𝐊−1‖≤2​ε\+L2/2\\\|M\(X\)\-\\mathbf\{K\}^\{\-1\}\\\|\\leq 2\\varepsilon\+L\_\{2\}/2\. Forε\\varepsilonsufficiently small relative to the fixed architecture constantL2L\_\{2\}, this isO​\(ε\)O\(\\varepsilon\)\.

Step 2: Polynomial structure\.By Corollary[42](https://arxiv.org/html/2605.24210#Thmtheorem42),M​\(X\)M\(X\)has the form:

M​\(X\)=∑m1,…,mk≥0m1\+⋯\+mk≤LC0​\(X\)​𝐊~m1​C1​\(X\)​⋯​Ck​\(X\)M\(X\)=\\sum\_\{\\begin\{subarray\}\{c\}m\_\{1\},\\ldots,m\_\{k\}\\geq 0\\\\ m\_\{1\}\+\\cdots\+m\_\{k\}\\leq L\\end\{subarray\}\}C\_\{0\}\(X\)\\tilde\{\\mathbf\{K\}\}^\{m\_\{1\}\}C\_\{1\}\(X\)\\cdots C\_\{k\}\(X\)where eachCi​\(X\)C\_\{i\}\(X\)depends on context locations and network parameters, but not onyCy\_\{C\}\.

Step 3: Restriction to controlled family\.We use the family\{𝐊t\}\\\{\\mathbf\{K\}\_\{t\}\\\}from Lemma[45](https://arxiv.org/html/2605.24210#Thmtheorem45)withμmin=1/κ\\mu\_\{\\min\}=1/\\kappaandμmax=1\\mu\_\{\\max\}=1\. This family satisfies:

- •Condition numberκ​\(𝐊t\)=κ\\kappa\(\\mathbf\{K\}\_\{t\}\)=\\kappa\(ratio of largest to smallest eigenvalue is1/\(1/κ\)=κ1/\(1/\\kappa\)=\\kappa\)
- •Row\-sum ratioγ=1\\gamma=1\(sinceDt=ID\_\{t\}=I\)
- •Eigenvector coefficientα=v1⊤​Dt−1​v1=‖v1‖2=1\\alpha=v\_\{1\}^\{\\top\}D\_\{t\}^\{\-1\}v\_\{1\}=\\\|v\_\{1\}\\\|^\{2\}=1

Step 4: Univariate reduction\.By Lemma[46](https://arxiv.org/html/2605.24210#Thmtheorem46), for this family:

v1⊤​M​\(𝐊t\)​v1=q​\(μ1​\(t\)\)v\_\{1\}^\{\\top\}M\(\\mathbf\{K\}\_\{t\}\)v\_\{1\}=q\(\\mu\_\{1\}\(t\)\)whereμ1​\(t\)=λmin​\(𝐊~t\)\\mu\_\{1\}\(t\)=\\lambda\_\{\\min\}\(\\tilde\{\\mathbf\{K\}\}\_\{t\}\)andqqis a polynomial of degree at most2​L2L\.

Step 5: Target value\.By Lemma[50](https://arxiv.org/html/2605.24210#Thmtheorem50):

v1⊤​𝐊t−1​v1=α​\(t\)μ1​\(t\)≥α0μ1​\(t\)\.v\_\{1\}^\{\\top\}\\mathbf\{K\}\_\{t\}^\{\-1\}v\_\{1\}=\\frac\{\\alpha\(t\)\}\{\\mu\_\{1\}\(t\)\}\\geq\\frac\{\\alpha\_\{0\}\}\{\\mu\_\{1\}\(t\)\}\.
Step 6: Chebyshev barrier\.The approximation requirement‖M−𝐊t−1‖≤O​\(ε\)\\\|M\-\\mathbf\{K\}\_\{t\}^\{\-1\}\\\|\\leq O\(\\varepsilon\)implies:

\|q​\(μ1\)−α​\(t\)μ1\|≤O​\(ε\)for all​μ1∈\[μmin,μmax\]\.\\left\|q\(\\mu\_\{1\}\)\-\\frac\{\\alpha\(t\)\}\{\\mu\_\{1\}\}\\right\|\\leq O\(\\varepsilon\)\\quad\\text\{for all \}\\mu\_\{1\}\\in\[\\mu\_\{\\min\},\\mu\_\{\\max\}\]\.
Sinceα​\(t\)≥α0\\alpha\(t\)\\geq\\alpha\_\{0\}andqqmust approximateα​\(t\)/μ1≥α0/μ1\\alpha\(t\)/\\mu\_\{1\}\\geq\\alpha\_\{0\}/\\mu\_\{1\}, Theorem[47](https://arxiv.org/html/2605.24210#Thmtheorem47)gives:

supμ1∈\[μmin,μmax\]\|q​\(μ1\)−α0μ1\|≥2​α0μmin​\(1\+κ\)⋅ρL\\sup\_\{\\mu\_\{1\}\\in\[\\mu\_\{\\min\},\\mu\_\{\\max\}\]\}\\left\|q\(\\mu\_\{1\}\)\-\\frac\{\\alpha\_\{0\}\}\{\\mu\_\{1\}\}\\right\|\\geq\\frac\{2\\alpha\_\{0\}\}\{\\mu\_\{\\min\}\(1\+\\kappa\)\}\\cdot\\rho^\{L\}whereρ=\(κ−1\)/\(κ\+1\)\\rho=\(\\sqrt\{\\kappa\}\-1\)/\(\\sqrt\{\\kappa\}\+1\)\.

Step 7: Solve forLL\.For the boundO​\(ε\)O\(\\varepsilon\)to hold:

2​α0μmin​\(1\+κ\)⋅ρL≤O​\(ε\)\.\\frac\{2\\alpha\_\{0\}\}\{\\mu\_\{\\min\}\(1\+\\kappa\)\}\\cdot\\rho^\{L\}\\leq O\(\\varepsilon\)\.
Taking logarithms:

L≥1log⁡\(1/ρ\)​log⁡\(c​α0ε​μmin​\(1\+κ\)\)\.L\\geq\\frac\{1\}\{\\log\(1/\\rho\)\}\\log\\left\(\\frac\{c\\alpha\_\{0\}\}\{\\varepsilon\\mu\_\{\\min\}\(1\+\\kappa\)\}\\right\)\.
Usinglog⁡\(1/ρ\)=log⁡κ\+1κ−1=2​arctanh​\(1/κ\)≈2κ\\log\(1/\\rho\)=\\log\\frac\{\\sqrt\{\\kappa\}\+1\}\{\\sqrt\{\\kappa\}\-1\}=2\\,\\mathrm\{arctanh\}\(1/\\sqrt\{\\kappa\}\)\\approx\\frac\{2\}\{\\sqrt\{\\kappa\}\}for largeκ\\kappa, and noting thatqqhas degree at most2​L2L:

2​L≥κ2​log⁡\(c′​α0ε\),henceL≥κ4​log⁡\(c′​α0ε\)2L\\geq\\frac\{\\sqrt\{\\kappa\}\}\{2\}\\log\\left\(\\frac\{c^\{\\prime\}\\alpha\_\{0\}\}\{\\varepsilon\}\\right\),\\quad\\text\{hence\}\\quad L\\geq\\frac\{\\sqrt\{\\kappa\}\}\{4\}\\log\\left\(\\frac\{c^\{\\prime\}\\alpha\_\{0\}\}\{\\varepsilon\}\\right\)for a constantc′c^\{\\prime\}absorbing theμmin\\mu\_\{\\min\}and\(1\+κ\)\(1\+\\kappa\)factors into the logarithm\. ∎

### D\.10Dimension Scaling

###### Proof of Theorem[18](https://arxiv.org/html/2605.24210#Thmtheorem18)\(Dimension Scaling\), part \(c\)\.

The GP posterior mean at targetxtx\_\{t\}is:

μ​\(xt\|C\)=∑i=1nwi​\(xt;C\)​yi,wi​\(xt;C\)=\[k​\(xt,XC\)​𝐊−1\]i\.\\mu\(x\_\{t\}\|C\)=\\sum\_\{i=1\}^\{n\}w\_\{i\}\(x\_\{t\};C\)y\_\{i\},\\quad w\_\{i\}\(x\_\{t\};C\)=\[k\(x\_\{t\},X\_\{C\}\)\\mathbf\{K\}^\{\-1\}\]\_\{i\}\.
Encoding strategy:Use initial encodinghi\(0\)=\(ei,yi,xi\)∈ℝn\+dy\+dxh\_\{i\}^\{\(0\)\}=\(e\_\{i\},y\_\{i\},x\_\{i\}\)\\in\\mathbb\{R\}^\{n\+d\_\{y\}\+d\_\{x\}\}whereeie\_\{i\}is a one\-hot index vector\.

AfterLLself\-attention layers:By Proposition[30](https://arxiv.org/html/2605.24210#Thmtheorem30)and the polynomial computation theorem, representations can encode sufficient information about𝐊\\mathbf\{K\}and its powers to approximate𝐊−1​yC\\mathbf\{K\}^\{\-1\}y\_\{C\}via the Neumann series\.

Cross\-attention:With query\-dependent attention weightsαi​\(xt\)∝k​\(xt,xi\)\\alpha\_\{i\}\(x\_\{t\}\)\\propto k\(x\_\{t\},x\_\{i\}\), the cross\-attention output is:

r​\(xt\)=∑iαi​\(xt\)​hi\(L\)\.r\(x\_\{t\}\)=\\sum\_\{i\}\\alpha\_\{i\}\(x\_\{t\}\)h\_\{i\}^\{\(L\)\}\.Ifhi\(L\)h\_\{i\}^\{\(L\)\}encodes theii\-th component of𝐊−1​yC\\mathbf\{K\}^\{\-1\}y\_\{C\}\(call itziz\_\{i\}\), then withαi​\(xt\)=k​\(xt,xi\)/∑jk​\(xt,xj\)\\alpha\_\{i\}\(x\_\{t\}\)=k\(x\_\{t\},x\_\{i\}\)/\\sum\_\{j\}k\(x\_\{t\},x\_\{j\}\):

r​\(xt\)≈∑ik​\(xt,xi\)​zi∑jk​\(xt,xj\)=k​\(xt,XC\)​𝐊−1​yC∑jk​\(xt,xj\)\.r\(x\_\{t\}\)\\approx\\frac\{\\sum\_\{i\}k\(x\_\{t\},x\_\{i\}\)z\_\{i\}\}\{\\sum\_\{j\}k\(x\_\{t\},x\_\{j\}\)\}=\\frac\{k\(x\_\{t\},X\_\{C\}\)\\mathbf\{K\}^\{\-1\}y\_\{C\}\}\{\\sum\_\{j\}k\(x\_\{t\},x\_\{j\}\)\}\.A decoder with access to∑jk​\(xt,xj\)\\sum\_\{j\}k\(x\_\{t\},x\_\{j\}\)\(computable via an auxiliary attention head\) recovers the unnormalized posterior mean\. ∎

## Appendix EConvCNP Proofs

###### Proof of Proposition[19](https://arxiv.org/html/2605.24210#Thmtheorem19)\(Translation Equivariance\)\.

The functional channels satisfyρC\+τ​\(x\+τ\)=∑iw​\(x\+τ−xi−τ\)=ρC​\(x\)\\rho\_\{C\+\\tau\}\(x\+\\tau\)=\\sum\_\{i\}w\(x\+\\tau\-x\_\{i\}\-\\tau\)=\\rho\_\{C\}\(x\), and identically forsC\+τs\_\{C\+\\tau\}\. Convolution layers commute with translation: for any filterϕ\\phiand translated functionfτ​\(x\)=f​\(x−τ\)f\_\{\\tau\}\(x\)=f\(x\-\\tau\),\(ϕ∗fτ\)​\(x\)=\(ϕ∗f\)​\(x−τ\)\(\\phi\*f\_\{\\tau\}\)\(x\)=\(\\phi\*f\)\(x\-\\tau\)\. Pointwise nonlinearities preserve this\. By induction over CNN layers,r~C\+τ​\(x\+τ\)=r~C​\(x\)\\tilde\{r\}\_\{C\+\\tau\}\(x\+\\tau\)=\\tilde\{r\}\_\{C\}\(x\)\. ∎

###### Proof of Proposition[20](https://arxiv.org/html/2605.24210#Thmtheorem20)\(Injectivity of Convolutional Aggregation\)\.

Step 1: Recover locations\.Taking Fourier transforms,ρ^C​\(ξ\)=w^​\(ξ\)​∑i=1ne−2​π​i​ξ⋅xi\\hat\{\\rho\}\_\{C\}\(\\xi\)=\\hat\{w\}\(\\xi\)\\sum\_\{i=1\}^\{n\}e^\{\-2\\pi i\\xi\\cdot x\_\{i\}\}\. Sincew^​\(ξ\)≠0\\hat\{w\}\(\\xi\)\\neq 0, we recover the characteristic function of the discrete measureμ=∑iδxi\\mu=\\sum\_\{i\}\\delta\_\{x\_\{i\}\}, which determines\{x1,…,xn\}\\\{x\_\{1\},\\ldots,x\_\{n\}\\\}as a multiset\.

Step 2: Recover values\.Given the locations,sC​\(xj\)=∑i=1nw​\(xj−xi\)​h​\(yi\)s\_\{C\}\(x\_\{j\}\)=\\sum\_\{i=1\}^\{n\}w\(x\_\{j\}\-x\_\{i\}\)\\,h\(y\_\{i\}\)forj=1,…,nj=1,\\ldots,nis the linear system𝐬=W​𝐡\\mathbf\{s\}=W\\mathbf\{h\}withWj​i=w​\(xj−xi\)W\_\{ji\}=w\(x\_\{j\}\-x\_\{i\}\)\. For positive definiteww\(e\.g\. Gaussian\) and distinctxix\_\{i\},WWis a positive definite Gram matrix, hence invertible\. Injectivity ofhhthen recoversyiy\_\{i\}\. ∎

###### Proof of Proposition[21](https://arxiv.org/html/2605.24210#Thmtheorem21)\(Pure ConvCNP Represents Stationary Kernel Smoothers\)\.

Setw=Kw=Kandh​\(y\)=\(y,1\)∈ℝdy\+1h\(y\)=\(y,1\)\\in\\mathbb\{R\}^\{d\_\{y\}\+1\}\. ThensC​\(xt\)=\(∑iK​\(xt−xi\)​yi,∑iK​\(xt−xi\)\)s\_\{C\}\(x\_\{t\}\)=\(\\sum\_\{i\}K\(x\_\{t\}\-x\_\{i\}\)y\_\{i\},\\;\\sum\_\{i\}K\(x\_\{t\}\-x\_\{i\}\)\)andρC​\(xt\)=∑iK​\(xt−xi\)\\rho\_\{C\}\(x\_\{t\}\)=\\sum\_\{i\}K\(x\_\{t\}\-x\_\{i\}\)\. The decoderg​\(s,ρ,xt\)=s1:dy/sdy\+1g\(s,\\rho,x\_\{t\}\)=s\_\{1:d\_\{y\}\}/s\_\{d\_\{y\}\+1\}extracts the kernel smoother exactly\. ∎

###### Proposition 51\(Pure ConvCNP Cannot Represent GP Posteriors\)\.

For generic positive definite stationary kernels, the GP posterior mean is not representable by any pure ConvCNP\.

###### Proof\.

A pure ConvCNP computes:

F​\(C,xt\)=g​\(∑i=1nw​\(xt−xi\)∑jw​\(xt−xj\)​h​\(yi\),ρC​\(xt\),xt\)\.F\(C,x\_\{t\}\)=g\\\!\\left\(\\sum\_\{i=1\}^\{n\}\\frac\{w\(x\_\{t\}\-x\_\{i\}\)\}\{\\sum\_\{j\}w\(x\_\{t\}\-x\_\{j\}\)\}\\,h\(y\_\{i\}\),\\;\\rho\_\{C\}\(x\_\{t\}\),\\;x\_\{t\}\\right\)\.The weightαi​\(xt\)=w​\(xt−xi\)/∑jw​\(xt−xj\)\\alpha\_\{i\}\(x\_\{t\}\)=w\(x\_\{t\}\-x\_\{i\}\)/\\sum\_\{j\}w\(x\_\{t\}\-x\_\{j\}\)on context pointiidepends only on the query–point distancext−xix\_\{t\}\-x\_\{i\}and the set of all such distances\{xt−xj\}j=1n\\\{x\_\{t\}\-x\_\{j\}\\\}\_\{j=1\}^\{n\}\. It does not depend on the inter\-context distances\{xi−xj\}j≠i\\\{x\_\{i\}\-x\_\{j\}\\\}\_\{j\\neq i\}\.

The GP posterior weightwi​\(xt;C\)=\[k​\(xt,XC\)​𝐊−1\]iw\_\{i\}\(x\_\{t\};C\)=\[k\(x\_\{t\},X\_\{C\}\)\\mathbf\{K\}^\{\-1\}\]\_\{i\}depends on the full Gram matrix𝐊\\mathbf\{K\}, which couples all context points\. The same two\-point counterexample from Theorem[12](https://arxiv.org/html/2605.24210#Thmtheorem12)applies: the GP weight ony1y\_\{1\}depends onk​\(x1,x2\)k\(x\_\{1\},x\_\{2\}\), which no factorized weight can capture\. ∎

### E\.1CNN Layers and Toeplitz Iteration

###### Proposition 52\(CNN Implements Toeplitz Iteration\)\.

Let context locations\{x1,…,xn\}\\\{x\_\{1\},\\ldots,x\_\{n\}\\\}lie on a regular grid with spacingδ\\delta, and letw=Kw=Kfor a stationary kernelKK\. A CNN layer with residual connection and filterϕℓ\\phi\_\{\\ell\}:

r~\(ℓ\)​\(x\)=r~\(ℓ−1\)​\(x\)\+\(ϕℓ∗r~\(ℓ−1\)\)​\(x\)\\tilde\{r\}^\{\(\\ell\)\}\(x\)=\\tilde\{r\}^\{\(\\ell\-1\)\}\(x\)\+\(\\phi\_\{\\ell\}\*\\tilde\{r\}^\{\(\\ell\-1\)\}\)\(x\)implements, at the context locations, the matrix update:

𝐫\(ℓ\)=\(𝐈\+𝐀ℓ\)​𝐫\(ℓ−1\)\\mathbf\{r\}^\{\(\\ell\)\}=\(\\mathbf\{I\}\+\\mathbf\{A\}\_\{\\ell\}\)\\,\\mathbf\{r\}^\{\(\\ell\-1\)\}where\[𝐀ℓ\]i​j=δdx​ϕℓ​\(xi−xj\)\[\\mathbf\{A\}\_\{\\ell\}\]\_\{ij\}=\\delta^\{d\_\{x\}\}\\phi\_\{\\ell\}\(x\_\{i\}\-x\_\{j\}\)is a Toeplitz matrix determined by the filter, up to boundary effects of orderO​\(δ\)O\(\\delta\)\.

###### Proof\.

Evaluating the convolution at a context locationxix\_\{i\}:

\(ϕℓ∗r~\(ℓ−1\)\)​\(xi\)=∫ϕℓ​\(xi−x′\)​r~\(ℓ−1\)​\(x′\)​𝑑x′\.\(\\phi\_\{\\ell\}\*\\tilde\{r\}^\{\(\\ell\-1\)\}\)\(x\_\{i\}\)=\\int\\phi\_\{\\ell\}\(x\_\{i\}\-x^\{\\prime\}\)\\,\\tilde\{r\}^\{\(\\ell\-1\)\}\(x^\{\\prime\}\)\\,dx^\{\\prime\}\.Whenr~\(ℓ−1\)\\tilde\{r\}^\{\(\\ell\-1\)\}is concentrated near the context locations \(which holds for localizedww\), the integral is dominated by contributions from neighborhoods of eachxjx\_\{j\}:

\(ϕℓ∗r~\(ℓ−1\)\)​\(xi\)≈∑j=1nϕℓ​\(xi−xj\)​∫Bδ​\(xj\)r~\(ℓ−1\)​\(x′\)​𝑑x′≈∑j=1nδdx​ϕℓ​\(xi−xj\)​r~\(ℓ−1\)​\(xj\)\.\(\\phi\_\{\\ell\}\*\\tilde\{r\}^\{\(\\ell\-1\)\}\)\(x\_\{i\}\)\\approx\\sum\_\{j=1\}^\{n\}\\phi\_\{\\ell\}\(x\_\{i\}\-x\_\{j\}\)\\int\_\{B\_\{\\delta\}\(x\_\{j\}\)\}\\tilde\{r\}^\{\(\\ell\-1\)\}\(x^\{\\prime\}\)\\,dx^\{\\prime\}\\approx\\sum\_\{j=1\}^\{n\}\\delta^\{d\_\{x\}\}\\,\\phi\_\{\\ell\}\(x\_\{i\}\-x\_\{j\}\)\\,\\tilde\{r\}^\{\(\\ell\-1\)\}\(x\_\{j\}\)\.Sincexi−xjx\_\{i\}\-x\_\{j\}depends only on the grid displacementi−ji\-j, the matrix\[𝐀ℓ\]i​j=δdx​ϕℓ​\(xi−xj\)\[\\mathbf\{A\}\_\{\\ell\}\]\_\{ij\}=\\delta^\{d\_\{x\}\}\\phi\_\{\\ell\}\(x\_\{i\}\-x\_\{j\}\)is Toeplitz\. ∎

###### Proof of Theorem[22](https://arxiv.org/html/2605.24210#Thmtheorem22)\(ConvCNP Depth for GP Posteriors\)\.

On a regular grid,𝐊\\mathbf\{K\}is Toeplitz\. The Chebyshev iterationpL​\(𝐊\)=∏ℓ=1L\(𝐈\+αℓ​𝐊\)p\_\{L\}\(\\mathbf\{K\}\)=\\prod\_\{\\ell=1\}^\{L\}\(\\mathbf\{I\}\+\\alpha\_\{\\ell\}\\mathbf\{K\}\)uses parametersαℓ\\alpha\_\{\\ell\}as in Proposition[14](https://arxiv.org/html/2605.24210#Thmtheorem14)\. Each factor𝐈\+αℓ​𝐊\\mathbf\{I\}\+\\alpha\_\{\\ell\}\\mathbf\{K\}is Toeplitz \(the sum of identity and a scaled Toeplitz matrix\), so each factor is implementable by a single CNN layer with filterϕℓ​\(u\)=αℓ​K​\(u\)/δdx\\phi\_\{\\ell\}\(u\)=\\alpha\_\{\\ell\}K\(u\)/\\delta^\{d\_\{x\}\}via Proposition[52](https://arxiv.org/html/2605.24210#Thmtheorem52)\.

The first error term is the Chebyshev approximation rate from Proposition[14](https://arxiv.org/html/2605.24210#Thmtheorem14)\. TheO​\(δ\)O\(\\delta\)term accounts for discretization of the convolution integral and boundary effects; both vanish asδ→0\\delta\\to 0for compactly supported contexts\.

AfterLLCNN layers, the representation at each context locationxix\_\{i\}encodes theii\-th component ofpL​\(𝐊\)​yC≈𝐊−1​yCp\_\{L\}\(\\mathbf\{K\}\)y\_\{C\}\\approx\\mathbf\{K\}^\{\-1\}y\_\{C\}\. The readout step recovers the posterior meanμ​\(xt\|C\)=k​\(xt,XC\)​𝐊−1​yC\\mu\(x\_\{t\}\|C\)=k\(x\_\{t\},X\_\{C\}\)\\mathbf\{K\}^\{\-1\}y\_\{C\}via convolutional cross\-attention with filterw=Kw=K\. ∎

### E\.2Depth–Support Tradeoff on Regular Grids

###### Definition 53\(Trigonometric Approximation Number\)\.

For a continuous functionf:\[0,2​π\]→ℝf:\[0,2\\pi\]\\to\\mathbb\{R\}, define𝒩ε​\(f\)\\mathcal\{N\}\_\{\\varepsilon\}\(f\)as the minimum degreeDDsuch that there exists a trigonometric polynomialt​\(ω\)=∑\|j\|≤Dcj​ei​j​ωt\(\\omega\)=\\sum\_\{\|j\|\\leq D\}c\_\{j\}e^\{ij\\omega\}satisfyingsupω∈\[0,2​π\]\|t​\(ω\)−f​\(ω\)\|≤ε\\sup\_\{\\omega\\in\[0,2\\pi\]\}\|t\(\\omega\)\-f\(\\omega\)\|\\leq\\varepsilon\.

###### Lemma 54\(CNN Jacobian on Periodic Grids\)\.

Consider a ConvCNP on a periodic grid of sizennwithLLCNN layers, each with filter support at mostppand residual connections:

r~\(ℓ\)​\(x\)=r~\(ℓ−1\)​\(x\)\+σ​\(ϕℓ∗r~\(ℓ−1\)​\(x\)\+bℓ\),ℓ=1,…,L\.\\tilde\{r\}^\{\(\\ell\)\}\(x\)=\\tilde\{r\}^\{\(\\ell\-1\)\}\(x\)\+\\sigma\\\!\\bigl\(\\phi\_\{\\ell\}\*\\tilde\{r\}^\{\(\\ell\-1\)\}\(x\)\+b\_\{\\ell\}\\bigr\),\\quad\\ell=1,\\ldots,L\.\(13\)Write the encoder ash​\(y\)=h​\(0\)\+h′​\(0\)​y\+O​\(y2\)h\(y\)=h\(0\)\+h^\{\\prime\}\(0\)\\,y\+O\(y^\{2\}\)\. The Jacobian of the full ConvCNP output with respect toyCy\_\{C\}, evaluated atyC=0y\_\{C\}=0, is a circulant matrix that factors in the Fourier domain as:

J^​\(k\)=g^​\(k\)⋅∏ℓ=1L\(1\+dℓ​τ^ℓ​\(k\)\)⋅h′​\(0\)⋅w^​\(k\),k=0,…,n−1,\\hat\{J\}\(k\)=\\hat\{g\}\(k\)\\cdot\\prod\_\{\\ell=1\}^\{L\}\\bigl\(1\+d\_\{\\ell\}\\,\\hat\{\\tau\}\_\{\\ell\}\(k\)\\bigr\)\\cdot h^\{\\prime\}\(0\)\\cdot\\hat\{w\}\(k\),\\quad k=0,\\ldots,n\-1,where:

- •w^​\(k\)\\hat\{w\}\(k\)is the DFT of the aggregation filterww,
- •τ^ℓ​\(k\)\\hat\{\\tau\}\_\{\\ell\}\(k\)is the DFT ofϕℓ\\phi\_\{\\ell\}, a trigonometric polynomial of degree at most⌊p/2⌋\\lfloor p/2\\rfloorinωk=2​π​k/n\\omega\_\{k\}=2\\pi k/n,
- •dℓ∈ℝd\_\{\\ell\}\\in\\mathbb\{R\}is the \(spatially constant\) activation derivative at layerℓ\\ell,
- •g^​\(k\)\\hat\{g\}\(k\)is the Fourier\-domain readout transfer function\.

###### Proof\.

We trace the Jacobian through the three stages of the ConvCNP: encoding/aggregation, CNN processing, and readout\.

Stage 1: Encoding and aggregation\.The signal channel issC​\(xi\)=∑jw​\(xi−xj\)​h​\(yj\)s\_\{C\}\(x\_\{i\}\)=\\sum\_\{j\}w\(x\_\{i\}\-x\_\{j\}\)\\,h\(y\_\{j\}\)\. Its Jacobian with respect toymy\_\{m\}is:

∂sC​\(xi\)∂ym=w​\(xi−xm\)​h′​\(ym\)\.\\frac\{\\partial s\_\{C\}\(x\_\{i\}\)\}\{\\partial y\_\{m\}\}=w\(x\_\{i\}\-x\_\{m\}\)\\,h^\{\\prime\}\(y\_\{m\}\)\.AtyC=0y\_\{C\}=0, this becomesw​\(xi−xm\)⋅h′​\(0\)w\(x\_\{i\}\-x\_\{m\}\)\\cdot h^\{\\prime\}\(0\), the Toeplitz matrixWWwith entriesWi​m=w​\(xi−xm\)W\_\{im\}=w\(x\_\{i\}\-x\_\{m\}\), scaled byh′​\(0\)h^\{\\prime\}\(0\)\. On the periodic grid,WWis circulant with DFTw^​\(k\)\\hat\{w\}\(k\)\.

The density channelρC​\(xi\)=∑jw​\(xi−xj\)\\rho\_\{C\}\(x\_\{i\}\)=\\sum\_\{j\}w\(x\_\{i\}\-x\_\{j\}\)is independent ofyCy\_\{C\}and hence contributes zero to the Jacobian\. Note thatρC\\rho\_\{C\}is constant across grid locations by translation invariance of the periodic grid:ρC​\(xi\)=ρ0\\rho\_\{C\}\(x\_\{i\}\)=\\rho\_\{0\}for allii\.

The constant componenth​\(0\)h\(0\)of the encoder contributes∑jw​\(xi−xj\)​h​\(0\)\\sum\_\{j\}w\(x\_\{i\}\-x\_\{j\}\)\\,h\(0\), which is also independent ofyCy\_\{C\}and does not affect the Jacobian\. Thus only the linear termh′​\(0\)​yh^\{\\prime\}\(0\)\\,yin the encoder expansion contributes\.

Stage 2: CNN processing\.AtyC=0y\_\{C\}=0, the signal channel vanishes \(itsyCy\_\{C\}\-dependent part is zero\) and the density channel is the spatial constantρ0\\rho\_\{0\}\. The CNN input is therefore spatially uniform atyC=0y\_\{C\}=0\.

Consider theℓ\\ell\-th layer \([13](https://arxiv.org/html/2605.24210#A5.E13)\)\. Its Jacobian with respect to its input, evaluated atyC=0y\_\{C\}=0, is:

J\(ℓ\)=I\+Dℓ​Tℓ,J^\{\(\\ell\)\}=I\+D\_\{\\ell\}\\,T\_\{\\ell\},whereTℓT\_\{\\ell\}is the circulant matrix of filterϕℓ\\phi\_\{\\ell\}andDℓ=diag⁡\(σ′​\(ϕℓ∗r~\(ℓ−1\)​\(xi\)\+bℓ\)\|yC=0\)D\_\{\\ell\}=\\operatorname\{diag\}\(\\sigma^\{\\prime\}\(\\phi\_\{\\ell\}\*\\tilde\{r\}^\{\(\\ell\-1\)\}\(x\_\{i\}\)\+b\_\{\\ell\}\)\|\_\{y\_\{C\}=0\}\)\. Since the inputr~\(ℓ−1\)\|yC=0\\tilde\{r\}^\{\(\\ell\-1\)\}\|\_\{y\_\{C\}=0\}is spatially uniform \(by induction the base case holds as shown above, and each layer preserves spatial uniformity when the input is spatially uniform\), the argument ofσ′\\sigma^\{\\prime\}is the same at every location\. ThusDℓ=dℓ​ID\_\{\\ell\}=d\_\{\\ell\}Ifor a scalardℓ∈ℝd\_\{\\ell\}\\in\\mathbb\{R\}\.

The full CNN Jacobian isJΦ=∏ℓ=1L\(I\+dℓ​Tℓ\)J\_\{\\Phi\}=\\prod\_\{\\ell=1\}^\{L\}\(I\+d\_\{\\ell\}T\_\{\\ell\}\)\. Since each factor is circulant and products of circulant matrices are circulant,JΦJ\_\{\\Phi\}is circulant with DFT:

J^Φ​\(k\)=∏ℓ=1L\(1\+dℓ​τ^ℓ​\(k\)\)\.\\hat\{J\}\_\{\\Phi\}\(k\)=\\prod\_\{\\ell=1\}^\{L\}\(1\+d\_\{\\ell\}\\,\\hat\{\\tau\}\_\{\\ell\}\(k\)\)\.Eachτ^ℓ​\(k\)=∑\|m\|≤⌊p/2⌋ϕℓ​\(m​δ\)​e−2​π​i​k​m/n\\hat\{\\tau\}\_\{\\ell\}\(k\)=\\sum\_\{\|m\|\\leq\\lfloor p/2\\rfloor\}\\phi\_\{\\ell\}\(m\\delta\)\\,e^\{\-2\\pi ikm/n\}is a trigonometric polynomial of degree at most⌊p/2⌋\\lfloor p/2\\rfloor\.

Stage 3: Readout\.The readoutg​\(r~C​\(xt\),ρC​\(xt\),xt\)g\(\\tilde\{r\}\_\{C\}\(x\_\{t\}\),\\rho\_\{C\}\(x\_\{t\}\),x\_\{t\}\)is evaluated at a single location\. On the periodic grid with stationary readout, the Jacobian of the readout with respect to the CNN output is a circulant matrix with DFTg^​\(k\)\\hat\{g\}\(k\)\. \(For the readout structure in Proposition[21](https://arxiv.org/html/2605.24210#Thmtheorem21), whereggdivides byρC\\rho\_\{C\},g^​\(k\)\\hat\{g\}\(k\)is the constant1/ρ01/\\rho\_\{0\}\.\)

Composing the three stages by the chain rule gives the stated factorization\. ∎

###### Proposition 55\(Full\-Support Filters Trivialize the Problem\)\.

On a periodic grid of sizenn, if each CNN filter has supportp=np=n, then a single CNN layer \(L=1L=1\) suffices: there exist filter coefficientsϕ1\\phi\_\{1\}such that the ConvCNP Jacobian satisfiesJ^​\(k\)=1/λk\\hat\{J\}\(k\)=1/\\lambda\_\{k\}for allkk, providedd1≠0d\_\{1\}\\neq 0,h′​\(0\)≠0h^\{\\prime\}\(0\)\\neq 0, andw^​\(k\)≠0\\hat\{w\}\(k\)\\neq 0for allkk\.

###### Proof\.

By Lemma[54](https://arxiv.org/html/2605.24210#Thmtheorem54), the Jacobian at frequencykkisJ^​\(k\)=g^​\(k\)⋅\(1\+d1​τ^1​\(k\)\)⋅h′​\(0\)⋅w^​\(k\)\\hat\{J\}\(k\)=\\hat\{g\}\(k\)\\cdot\(1\+d\_\{1\}\\hat\{\\tau\}\_\{1\}\(k\)\)\\cdot h^\{\\prime\}\(0\)\\cdot\\hat\{w\}\(k\)\. SettingJ^​\(k\)=1/λk\\hat\{J\}\(k\)=1/\\lambda\_\{k\}and solving:

τ^1​\(k\)=1d1​\(1λk⋅g^​\(k\)⋅h′​\(0\)⋅w^​\(k\)−1\)\.\\hat\{\\tau\}\_\{1\}\(k\)=\\frac\{1\}\{d\_\{1\}\}\\\!\\left\(\\frac\{1\}\{\\lambda\_\{k\}\\cdot\\hat\{g\}\(k\)\\cdot h^\{\\prime\}\(0\)\\cdot\\hat\{w\}\(k\)\}\-1\\right\)\.This determinesτ^1​\(k\)\\hat\{\\tau\}\_\{1\}\(k\)independently at each of thennfrequencies\. Withp=np=n, the DFTϕ1↦\(τ^1​\(0\),…,τ^1​\(n−1\)\)\\phi\_\{1\}\\mapsto\(\\hat\{\\tau\}\_\{1\}\(0\),\\ldots,\\hat\{\\tau\}\_\{1\}\(n\-1\)\)is a bijection onℂn\\mathbb\{C\}^\{n\}, soϕ1\\phi\_\{1\}exists and is unique\. The solution is well\-defined providedd1≠0d\_\{1\}\\neq 0\(nonzero activation derivative atyC=0y\_\{C\}=0, which holds for standard activations like GELU or softplus evaluated at the uniform input\),h′​\(0\)≠0h^\{\\prime\}\(0\)\\neq 0\(nontrivial encoder, which holds forh​\(y\)=\(y,1\)h\(y\)=\(y,1\)sinceh′​\(0\)=\(1,0\)h^\{\\prime\}\(0\)=\(1,0\)\), andw^​\(k\)≠0\\hat\{w\}\(k\)\\neq 0for allkk\(which holds for positive definite filters such as Gaussians, cf\. Proposition[20](https://arxiv.org/html/2605.24210#Thmtheorem20)\)\. ∎

###### Theorem 56\(Depth–Support Tradeoff\)\.

Let𝐊\\mathbf\{K\}be a circulant positive definite matrix on a periodic grid of sizennwith eigenvaluesλk=K^​\(ωk\)\\lambda\_\{k\}=\\hat\{K\}\(\\omega\_\{k\}\), whereωk=2​π​k/n\\omega\_\{k\}=2\\pi k/n\. Letn\>2​L​⌊p/2⌋n\>2L\\lfloor p/2\\rfloor\. Any ConvCNP withLLCNN layers, each with filter support at mostpp, achievingε\\varepsilon\-approximation of𝐊−1​yC\\mathbf\{K\}^\{\-1\}y\_\{C\}requires:

L⋅⌊p/2⌋≥𝒩O​\(ε\)​\(c/K^\),L\\cdot\\lfloor p/2\\rfloor\\;\\geq\\;\\mathcal\{N\}\_\{O\(\\varepsilon\)\}\\\!\\bigl\(c/\\hat\{K\}\\bigr\),wherec=\(g^⋅h′​\(0\)⋅w^\)−1c=\(\\hat\{g\}\\cdot h^\{\\prime\}\(0\)\\cdot\\hat\{w\}\)^\{\-1\}absorbs the encoding and readout, and𝒩ε\\mathcal\{N\}\_\{\\varepsilon\}is as in Definition[53](https://arxiv.org/html/2605.24210#Thmtheorem53)\.

###### Proof\.

Step 1: Linearization\.The targetT​\(yC\)=𝐊−1​yCT\(y\_\{C\}\)=\\mathbf\{K\}^\{\-1\}y\_\{C\}is linear\. By the same argument as Lemma[37](https://arxiv.org/html/2605.24210#Thmtheorem37)\(applied to the ConvCNP as a mapℝn→ℝn\\mathbb\{R\}^\{n\}\\to\\mathbb\{R\}^\{n\}\),ε\\varepsilon\-approximation ofTTimplies that the JacobianM=∂F/∂yC\|yC=0M=\\partial F/\\partial y\_\{C\}\|\_\{y\_\{C\}=0\}satisfies‖M−𝐊−1‖≤O​\(ε\)\\\|M\-\\mathbf\{K\}^\{\-1\}\\\|\\leq O\(\\varepsilon\), where the constant absorbs the smoothness boundL2/2L\_\{2\}/2which depends on the architecture but not onε\\varepsilonorκ\\kappa\.

Step 2: Fourier structure\.By Lemma[54](https://arxiv.org/html/2605.24210#Thmtheorem54),MMis circulant with DFT:

M^​\(k\)=g^​\(k\)⋅∏ℓ=1L\(1\+dℓ​τ^ℓ​\(k\)\)⋅h′​\(0\)⋅w^​\(k\)\.\\hat\{M\}\(k\)=\\hat\{g\}\(k\)\\cdot\\prod\_\{\\ell=1\}^\{L\}\(1\+d\_\{\\ell\}\\,\\hat\{\\tau\}\_\{\\ell\}\(k\)\)\\cdot h^\{\\prime\}\(0\)\\cdot\\hat\{w\}\(k\)\.Defineq​\(ω\)=∏ℓ=1L\(1\+dℓ​τ^ℓ​\(ω\)\)q\(\\omega\)=\\prod\_\{\\ell=1\}^\{L\}\(1\+d\_\{\\ell\}\\,\\hat\{\\tau\}\_\{\\ell\}\(\\omega\)\), which is a trigonometric polynomial of degree at mostD=L⋅⌊p/2⌋D=L\\cdot\\lfloor p/2\\rfloor\. ThenM^​\(k\)=g^​\(ωk\)⋅q​\(ωk\)⋅h′​\(0\)⋅w^​\(ωk\)\\hat\{M\}\(k\)=\\hat\{g\}\(\\omega\_\{k\}\)\\cdot q\(\\omega\_\{k\}\)\\cdot h^\{\\prime\}\(0\)\\cdot\\hat\{w\}\(\\omega\_\{k\}\)\.

Step 3: Approximation requirement\.SinceMMand𝐊−1\\mathbf\{K\}^\{\-1\}are both circulant,‖M−𝐊−1‖=maxk⁡\|M^​\(k\)−1/λk\|\\\|M\-\\mathbf\{K\}^\{\-1\}\\\|=\\max\_\{k\}\|\\hat\{M\}\(k\)\-1/\\lambda\_\{k\}\|\. The approximation bound from Step 1 gives:

\|g^​\(ωk\)⋅q​\(ωk\)⋅h′​\(0\)⋅w^​\(ωk\)−1λk\|≤O​\(ε\)for all​k=0,…,n−1\.\\left\|\\hat\{g\}\(\\omega\_\{k\}\)\\cdot q\(\\omega\_\{k\}\)\\cdot h^\{\\prime\}\(0\)\\cdot\\hat\{w\}\(\\omega\_\{k\}\)\-\\frac\{1\}\{\\lambda\_\{k\}\}\\right\|\\leq O\(\\varepsilon\)\\quad\\text\{for all \}k=0,\\ldots,n\-1\.Rearranging, the trigonometric polynomialr​\(ω\)=g^​\(ω\)⋅h′​\(0\)⋅w^​\(ω\)⋅q​\(ω\)r\(\\omega\)=\\hat\{g\}\(\\omega\)\\cdot h^\{\\prime\}\(0\)\\cdot\\hat\{w\}\(\\omega\)\\cdot q\(\\omega\)of degree at mostD\+D0D\+D\_\{0\}\(whereD0D\_\{0\}is the degree contribution fromg^\\hat\{g\}andw^\\hat\{w\}, a fixed constant depending on the kernel and readout\) satisfies:

\|r​\(ωk\)−1K^​\(ωk\)\|≤O​\(ε\)for all​k\.\\left\|r\(\\omega\_\{k\}\)\-\\frac\{1\}\{\\hat\{K\}\(\\omega\_\{k\}\)\}\\right\|\\leq O\(\\varepsilon\)\\quad\\text\{for all \}k\.
Step 4: From grid to uniform\.The trigonometric polynomialr​\(ω\)−1/K^​\(ω\)r\(\\omega\)\-1/\\hat\{K\}\(\\omega\)has degree at mostD\+D0\+DKD\+D\_\{0\}\+D\_\{K\}, whereDKD\_\{K\}is the degree of the numerator ofr​\(ω\)​K^​\(ω\)−1r\(\\omega\)\\hat\{K\}\(\\omega\)\-1expressed as a ratio of trigonometric polynomials\. A trigonometric polynomial of degreeNNwith\|t​\(ωk\)\|≤O​\(ε\)\|t\(\\omega\_\{k\}\)\|\\leq O\(\\varepsilon\)atnnequispaced points satisfies‖t‖∞≤O​\(ε\)\\\|t\\\|\_\{\\infty\}\\leq O\(\\varepsilon\)providedn\>2​Nn\>2N\(since a trigonometric polynomial of degreeNNis determined by2​N\+12N\+1equispaced samples\)\. The conditionn\>2​L​⌊p/2⌋n\>2L\\lfloor p/2\\rfloorensures this \(absorbing the fixed contributionD0D\_\{0\}fornnsufficiently large relative toD0D\_\{0\}\)\.

Thusrris a trigonometric polynomial of degree at mostD\+D0D\+D\_\{0\}that uniformlyO​\(ε\)O\(\\varepsilon\)\-approximates1/K^​\(ω\)1/\\hat\{K\}\(\\omega\)on\[0,2​π\]\[0,2\\pi\]\. By Definition[53](https://arxiv.org/html/2605.24210#Thmtheorem53),D\+D0≥𝒩O​\(ε\)​\(1/K^\)D\+D\_\{0\}\\geq\\mathcal\{N\}\_\{O\(\\varepsilon\)\}\(1/\\hat\{K\}\)\. SinceD0D\_\{0\}is a fixed constant,D=L⋅⌊p/2⌋≥𝒩O​\(ε\)​\(1/K^\)−D0D=L\\cdot\\lfloor p/2\\rfloor\\geq\\mathcal\{N\}\_\{O\(\\varepsilon\)\}\(1/\\hat\{K\}\)\-D\_\{0\}, givingL⋅⌊p/2⌋≥𝒩O​\(ε\)​\(c/K^\)L\\cdot\\lfloor p/2\\rfloor\\geq\\mathcal\{N\}\_\{O\(\\varepsilon\)\}\(c/\\hat\{K\}\)as stated\. ∎

###### Corollary 57\(Recovering theκ\\sqrt\{\\kappa\}Rate\)\.

Fix filter supportp=O​\(ℓK/δ\)p=O\(\\ell\_\{K\}/\\delta\), whereℓK\\ell\_\{K\}is the kernel lengthscale\. If the spectral densityK^​\(ω\)\\hat\{K\}\(\\omega\)satisfiesK^min≤K^​\(ω\)≤K^max\\hat\{K\}\_\{\\min\}\\leq\\hat\{K\}\(\\omega\)\\leq\\hat\{K\}\_\{\\max\}withκ=K^max/K^min\\kappa=\\hat\{K\}\_\{\\max\}/\\hat\{K\}\_\{\\min\}, then:

L≥Ω​\(κ​log⁡\(1/ε\)p\)\.L\\geq\\Omega\\\!\\left\(\\frac\{\\sqrt\{\\kappa\}\\,\\log\(1/\\varepsilon\)\}\{p\}\\right\)\.In particular, for the filter support used in Theorem[22](https://arxiv.org/html/2605.24210#Thmtheorem22), the depth requirement matches the Chebyshev upper bound up to the factorpp\.

###### Proof\.

The function1/K^​\(ω\)1/\\hat\{K\}\(\\omega\)takes values in\[1/K^max,1/K^min\]\[1/\\hat\{K\}\_\{\\max\},\\,1/\\hat\{K\}\_\{\\min\}\]\. By the substitutionμ=K^​\(ω\)\\mu=\\hat\{K\}\(\\omega\), approximating1/K^​\(ω\)1/\\hat\{K\}\(\\omega\)in the trigonometric sense reduces to approximating1/μ1/\\muon the interval\[K^min,K^max\]\[\\hat\{K\}\_\{\\min\},\\hat\{K\}\_\{\\max\}\]\.

More precisely, lett​\(ω\)t\(\\omega\)be any trigonometric polynomial satisfying\|t​\(ω\)−1/K^​\(ω\)\|≤ε\|t\(\\omega\)\-1/\\hat\{K\}\(\\omega\)\|\\leq\\varepsilonuniformly\. Definep​\(μ\)=t​\(K^−1​\(μ\)\)p\(\\mu\)=t\(\\hat\{K\}^\{\-1\}\(\\mu\)\)on any monotone branch ofK^\\hat\{K\}\. Then\|p​\(μ\)−1/μ\|≤ε\|p\(\\mu\)\-1/\\mu\|\\leq\\varepsilonon\[K^min,K^max\]\[\\hat\{K\}\_\{\\min\},\\hat\{K\}\_\{\\max\}\]\. The degree ofttis at least the degree needed forpp, which by the classical Chebyshev barrier \(Theorem[47](https://arxiv.org/html/2605.24210#Thmtheorem47)\) isΩ​\(κ​log⁡\(1/ε\)\)\\Omega\(\\sqrt\{\\kappa\}\\,\\log\(1/\\varepsilon\)\)\. \(The substitution can increase degree by at most the factor⌊q/2⌋\\lfloor q/2\\rfloor, the trigonometric degree ofK^\\hat\{K\}itself, which is a fixed constant depending on the kernel support\.\)

Theorem[56](https://arxiv.org/html/2605.24210#Thmtheorem56)then givesL⋅⌊p/2⌋≥Ω​\(κ​log⁡\(1/ε\)\)L\\cdot\\lfloor p/2\\rfloor\\geq\\Omega\(\\sqrt\{\\kappa\}\\,\\log\(1/\\varepsilon\)\), yielding the stated bound\. ∎

### E\.3Incomparability with ANPs

###### Proof of Theorem[23](https://arxiv.org/html/2605.24210#Thmtheorem23)\(ConvCNPs and ANPs Are Incomparable\)\.

ℱANP⊈ℱConvCNP\\mathcal\{F\}\_\{\\mathrm\{ANP\}\}\\not\\subseteq\\mathcal\{F\}\_\{\\mathrm\{ConvCNP\}\}:Letσ:𝒳→ℝ\+\\sigma:\\mathcal\{X\}\\to\\mathbb\{R\}\_\{\+\}be non\-constant and define the non\-stationary kernel smoother:

F​\(C,xt\)=∑iσ​\(xt\)​σ​\(xi\)​K​\(xt−xi\)​yi∑iσ​\(xt\)​σ​\(xi\)​K​\(xt−xi\)\.F\(C,x\_\{t\}\)=\\frac\{\\sum\_\{i\}\\sigma\(x\_\{t\}\)\\sigma\(x\_\{i\}\)K\(x\_\{t\}\-x\_\{i\}\)\\,y\_\{i\}\}\{\\sum\_\{i\}\\sigma\(x\_\{t\}\)\\sigma\(x\_\{i\}\)K\(x\_\{t\}\-x\_\{i\}\)\}\.This is ANP\-representable by Theorem[9](https://arxiv.org/html/2605.24210#Thmtheorem9)with non\-stationary kernelK~​\(x,x′\)=σ​\(x\)​σ​\(x′\)​K​\(x−x′\)\\tilde\{K\}\(x,x^\{\\prime\}\)=\\sigma\(x\)\\sigma\(x^\{\\prime\}\)K\(x\-x^\{\\prime\}\)\. However,FFis not translation equivariant: shifting all inputs byτ\\taureplacesσ​\(xi\)\\sigma\(x\_\{i\}\)withσ​\(xi\+τ\)≠σ​\(xi\)\\sigma\(x\_\{i\}\+\\tau\)\\neq\\sigma\(x\_\{i\}\)generically\. Since every ConvCNP is translation equivariant \(Proposition[19](https://arxiv.org/html/2605.24210#Thmtheorem19)\),F∉ℱConvCNPF\\notin\\mathcal\{F\}\_\{\\mathrm\{ConvCNP\}\}\.

ℱConvCNP⊈ℱANP\\mathcal\{F\}\_\{\\mathrm\{ConvCNP\}\}\\not\\subseteq\\mathcal\{F\}\_\{\\mathrm\{ANP\}\}:For a stationary kernel, the GP posterior mean is translation equivariant \(K​\(x\+τ,x′\+τ\)=K​\(x−x′\)K\(x\+\\tau,x^\{\\prime\}\+\\tau\)=K\(x\-x^\{\\prime\}\)implies invariance of𝐊\\mathbf\{K\}under joint translation\)\. By Theorem[22](https://arxiv.org/html/2605.24210#Thmtheorem22), a ConvCNP with sufficient CNN depth approximates it to arbitrary precision on grid contexts\. By Theorem[12](https://arxiv.org/html/2605.24210#Thmtheorem12), it is not ANP\-representable\. ∎

## Appendix FHierarchy Proof

###### Proof of Theorem[24](https://arxiv.org/html/2605.24210#Thmtheorem24)\(Expressiveness Hierarchy\)\.

Part \(a\): Attention branch\.

ℱCNP\(d\)⊆ℱANP\(d\)\\mathcal\{F\}\_\{\\mathrm\{CNP\}\}^\{\(d\)\}\\subseteq\\mathcal\{F\}\_\{\\mathrm\{ANP\}\}^\{\(d\)\}:Set uniform attention weightsαi=1/n\\alpha\_\{i\}=1/n\(achieved by constant query and key functions\)\.

ℱCNP\(d\)≠ℱANP\(d\)\\mathcal\{F\}\_\{\\mathrm\{CNP\}\}^\{\(d\)\}\\neq\\mathcal\{F\}\_\{\\mathrm\{ANP\}\}^\{\(d\)\}:Kernel smoothers are inℱANP\(d\)\\mathcal\{F\}\_\{\\mathrm\{ANP\}\}^\{\(d\)\}\([Theorem9](https://arxiv.org/html/2605.24210#Thmtheorem9)\) but not inℱCNP\(d\)\\mathcal\{F\}\_\{\\mathrm\{CNP\}\}^\{\(d\)\}\. To see the latter, note that a CNP computesF​\(C,xt\)=g​\(h¯C,xt\)F\(C,x\_\{t\}\)=g\(\\bar\{h\}\_\{C\},x\_\{t\}\)whereh¯C\\bar\{h\}\_\{C\}is independent ofxtx\_\{t\}\. For the Gaussian kernel smoother withn=2n=2points, configurationsC=\{\(0,0\),\(1,1\)\}C=\\\{\(0,0\),\(1,1\)\\\}andC′=\{\(0\.25,0\),\(0\.75,1\)\}C^\{\\prime\}=\\\{\(0\.25,0\),\(0\.75,1\)\\\}can satisfyh¯C=h¯C′\\bar\{h\}\_\{C\}=\\bar\{h\}\_\{C^\{\\prime\}\}for appropriatehh, yet the kernel smoother outputs differ: atxt=0x\_\{t\}=0, configurationCCweights the points roughly equally whileC′C^\{\\prime\}strongly favors the nearby point\(0\.25,0\)\(0\.25,0\)\.

ℱANP\(d\)⊆ℱTNP\(1,d\)\\mathcal\{F\}\_\{\\mathrm\{ANP\}\}^\{\(d\)\}\\subseteq\\mathcal\{F\}\_\{\\mathrm\{TNP\}\}^\{\(1,d\)\}:An ANP is a TNP withL=0L=0self\-attention layers\. WithL=1L=1andWv\(1\)=0W\_\{v\}^\{\(1\)\}=0, we recover ANP\.

ℱANP\(d\)≠ℱTNP\(1,d\)\\mathcal\{F\}\_\{\\mathrm\{ANP\}\}^\{\(d\)\}\\neq\\mathcal\{F\}\_\{\\mathrm\{TNP\}\}^\{\(1,d\)\}:GP posteriors are inℱTNP\(L,d\)\\mathcal\{F\}\_\{\\mathrm\{TNP\}\}^\{\(L,d\)\}for sufficientLL\(Proposition[35](https://arxiv.org/html/2605.24210#Thmtheorem35)\) but not inℱANP\(d\)\\mathcal\{F\}\_\{\\mathrm\{ANP\}\}^\{\(d\)\}\([Theorem12](https://arxiv.org/html/2605.24210#Thmtheorem12)\)\.

ℱTNP\(L,d\)⊊ℱTNP\(L\+1,d\)\\mathcal\{F\}\_\{\\mathrm\{TNP\}\}^\{\(L,d\)\}\\subsetneq\\mathcal\{F\}\_\{\\mathrm\{TNP\}\}^\{\(L\+1,d\)\}:The inclusion follows by adding a layer withWv\(L\+1\)=0W\_\{v\}^\{\(L\+1\)\}=0\. We prove strictness under Assumptions[1](https://arxiv.org/html/2605.24210#Thmassumption1)–[2](https://arxiv.org/html/2605.24210#Thmassumption2)with scalar value weights; the global lower bound of Theorem[15](https://arxiv.org/html/2605.24210#Thmtheorem15)establishes the corresponding result for representation\-based attention at the coarser granularity ofΘ​\(κ​log⁡\(1/ε\)\)\\Theta\(\\sqrt\{\\kappa\}\\log\(1/\\varepsilon\)\)total layers\.

AnLL\-layer TNP with scalar value weights computespL​\(𝐊~\)​H\(0\)p\_\{L\}\(\\tilde\{\\mathbf\{K\}\}\)H^\{\(0\)\}wherepL​\(x\)=∏ℓ=1L\(1\+αℓ​x\)p\_\{L\}\(x\)=\\prod\_\{\\ell=1\}^\{L\}\(1\+\\alpha\_\{\\ell\}x\)\. The achievable set𝒫L=\{pL:α1,…,αL∈ℝ\}\\mathcal\{P\}\_\{L\}=\\\{p\_\{L\}:\\alpha\_\{1\},\\ldots,\\alpha\_\{L\}\\in\\mathbb\{R\}\\\}satisfies𝒫L⊂PolyL\\mathcal\{P\}\_\{L\}\\subset\\mathrm\{Poly\}\_\{L\}, the space of polynomials of degree at mostLL\.

Let𝐊\\mathbf\{K\}have condition numberκ\>1\\kappa\>1with eigenvalues in\[λmin,λmax\]\[\\lambda\_\{\\min\},\\lambda\_\{\\max\}\]\. The GP posterior requires𝐊−1\\mathbf\{K\}^\{\-1\}, which acts as1/λ1/\\lambdaon each eigenspace\. The minimax error for degree\-LLpolynomial approximation to1/x1/xon\[λmin,λmax\]\[\\lambda\_\{\\min\},\\lambda\_\{\\max\}\]is

EL∗=infp∈PolyLsupx∈\[λmin,λmax\]\|p​\(x\)−1/x\|=Θ​\(ρLλmin\)E\_\{L\}^\{\*\}=\\inf\_\{p\\in\\mathrm\{Poly\}\_\{L\}\}\\sup\_\{x\\in\[\\lambda\_\{\\min\},\\lambda\_\{\\max\}\]\}\|p\(x\)\-1/x\|=\\Theta\\left\(\\frac\{\\rho^\{L\}\}\{\\lambda\_\{\\min\}\}\\right\)whereρ=\(κ−1\)/\(κ\+1\)\\rho=\(\\sqrt\{\\kappa\}\-1\)/\(\\sqrt\{\\kappa\}\+1\); seeTrefethen \([2019](https://arxiv.org/html/2605.24210#bib.bib6)\)\. Since𝒫L⊂PolyL\\mathcal\{P\}\_\{L\}\\subset\\mathrm\{Poly\}\_\{L\}:

infp∈𝒫L‖p​\(𝐊\)−𝐊−1‖≥EL∗\.\\inf\_\{p\\in\\mathcal\{P\}\_\{L\}\}\\\|p\(\\mathbf\{K\}\)\-\\mathbf\{K\}^\{\-1\}\\\|\\geq E\_\{L\}^\{\*\}\.
For the upper bound at depthL\+1L\+1, Proposition[14](https://arxiv.org/html/2605.24210#Thmtheorem14)provides explicit real parametersα1,…,αL\+1\\alpha\_\{1\},\\ldots,\\alpha\_\{L\+1\}such thatpL\+1​\(𝐊\)=∏ℓ=1L\+1\(𝐈\+αℓ​𝐊\)p\_\{L\+1\}\(\\mathbf\{K\}\)=\\prod\_\{\\ell=1\}^\{L\+1\}\(\\mathbf\{I\}\+\\alpha\_\{\\ell\}\\mathbf\{K\}\)satisfies

‖pL\+1​\(𝐊\)−𝐊−1‖≤2λmin​ρL\+1\.\\\|p\_\{L\+1\}\(\\mathbf\{K\}\)\-\\mathbf\{K\}^\{\-1\}\\\|\\leq\\frac\{2\}\{\\lambda\_\{\\min\}\}\\rho^\{L\+1\}\.Combining bounds:

infL​\-layer‖p​\(𝐊\)−𝐊−1‖inf\(L\+1\)​\-layer‖p​\(𝐊\)−𝐊−1‖≥EL∗C​ρL\+1/λmin=Θ​\(ρ−1\)\>1\.\\frac\{\\inf\_\{L\\text\{\-layer\}\}\\\|p\(\\mathbf\{K\}\)\-\\mathbf\{K\}^\{\-1\}\\\|\}\{\\inf\_\{\(L\+1\)\\text\{\-layer\}\}\\\|p\(\\mathbf\{K\}\)\-\\mathbf\{K\}^\{\-1\}\\\|\}\\geq\\frac\{E\_\{L\}^\{\*\}\}\{C\\rho^\{L\+1\}/\\lambda\_\{\\min\}\}=\\Theta\(\\rho^\{\-1\}\)\>1\.Thus the GP posterior \(restricted to accuracyε\\varepsilonwithEL\+1∗<ε<EL∗E\_\{L\+1\}^\{\*\}<\\varepsilon<E\_\{L\}^\{\*\}\) lies inℱTNP\(L\+1,d\)∖ℱTNP\(L,d\)\\mathcal\{F\}\_\{\\mathrm\{TNP\}\}^\{\(L\+1,d\)\}\\setminus\\mathcal\{F\}\_\{\\mathrm\{TNP\}\}^\{\(L,d\)\}\.

Part \(b\): Convolutional branch\.

ℱCNP\(d\)⊊ℱConvCNP\(0\)\\mathcal\{F\}\_\{\\mathrm\{CNP\}\}^\{\(d\)\}\\subsetneq\\mathcal\{F\}\_\{\\mathrm\{ConvCNP\}\}^\{\(0\)\}:Every CNP function is a pure ConvCNP function: setw=𝟏w=\\mathbf\{1\}\(constant filter\) to recover mean aggregationsC​\(xt\)=∑ih​\(yi\)s\_\{C\}\(x\_\{t\}\)=\\sum\_\{i\}h\(y\_\{i\}\)withρC​\(xt\)=n\\rho\_\{C\}\(x\_\{t\}\)=n, givingF​\(C,xt\)=g​\(1n​∑ih​\(yi\),xt\)F\(C,x\_\{t\}\)=g\(\\frac\{1\}\{n\}\\sum\_\{i\}h\(y\_\{i\}\),x\_\{t\}\)\. Strictness: stationary kernel smoothers are inℱConvCNP\(0\)\\mathcal\{F\}\_\{\\mathrm\{ConvCNP\}\}^\{\(0\)\}\(Proposition[21](https://arxiv.org/html/2605.24210#Thmtheorem21)\) but not inℱCNP\(d\)\\mathcal\{F\}\_\{\\mathrm\{CNP\}\}^\{\(d\)\}for any finitedd, since the kernel smoother weight onyiy\_\{i\}depends onxtx\_\{t\}viaK​\(xt−xi\)K\(x\_\{t\}\-x\_\{i\}\), which the query\-independent CNP representation cannot capture\.

ℱConvCNP\(0\)⊊ℱConvCNP\(1\)\\mathcal\{F\}\_\{\\mathrm\{ConvCNP\}\}^\{\(0\)\}\\subsetneq\\mathcal\{F\}\_\{\\mathrm\{ConvCNP\}\}^\{\(1\)\}:The inclusion follows by setting the CNN filter to zero\. Strictness: consider the one\-hop kernel\-weighted sum

F​\(C,xt\)=∑iK​\(xt−xi\)​\[∑jK​\(xi−xj\)​yj\]∑iK​\(xt−xi\)F\(C,x\_\{t\}\)=\\frac\{\\sum\_\{i\}K\(x\_\{t\}\-x\_\{i\}\)\\bigl\[\\sum\_\{j\}K\(x\_\{i\}\-x\_\{j\}\)\\,y\_\{j\}\\bigr\]\}\{\\sum\_\{i\}K\(x\_\{t\}\-x\_\{i\}\)\}which depends on inter\-context distancesK​\(xi−xj\)K\(x\_\{i\}\-x\_\{j\}\)\. A single CNN layer computes this \(Proposition[52](https://arxiv.org/html/2605.24210#Thmtheorem52)\), but a pure ConvCNP cannot represent it: the weight onyjy\_\{j\}in the pure ConvCNP factors asαj​\(xt\)=w​\(xt−xj\)/∑mw​\(xt−xm\)\\alpha\_\{j\}\(x\_\{t\}\)=w\(x\_\{t\}\-x\_\{j\}\)/\\sum\_\{m\}w\(x\_\{t\}\-x\_\{m\}\), which is independent of all other context locations\{xi\}i≠j\\\{x\_\{i\}\\\}\_\{i\\neq j\}\. The one\-hop sum assignsyjy\_\{j\}a weight proportional to∑iK​\(xt−xi\)​K​\(xi−xj\)\\sum\_\{i\}K\(x\_\{t\}\-x\_\{i\}\)K\(x\_\{i\}\-x\_\{j\}\), which couplesxjx\_\{j\}to every other context point through the intermediate summation\.

ℱConvCNP\(L\)⊊ℱConvCNP\(L\+1\)\\mathcal\{F\}\_\{\\mathrm\{ConvCNP\}\}^\{\(L\)\}\\subsetneq\\mathcal\{F\}\_\{\\mathrm\{ConvCNP\}\}^\{\(L\+1\)\}:The same Chebyshev argument as in Part \(a\) applies\. On regular grids,LLCNN layers compute Toeplitz polynomials of degreeLLin𝐊\\mathbf\{K\}\(Proposition[52](https://arxiv.org/html/2605.24210#Thmtheorem52)\)\. The set of achievable Toeplitz polynomials at degreeLLis contained inPolyL\\mathrm\{Poly\}\_\{L\}, so the minimax barrierEL∗E\_\{L\}^\{\*\}applies\. At depthL\+1L\+1, the Chebyshev construction of Theorem[22](https://arxiv.org/html/2605.24210#Thmtheorem22)achieves errorC​ρL\+1/λminC\\rho^\{L\+1\}/\\lambda\_\{\\min\}, giving the same separation ratioΘ​\(ρ−1\)\>1\\Theta\(\\rho^\{\-1\}\)\>1\.

Part \(c\): Incomparability\.

ℱANP\(d\)⊈ℱConvCNP\(L\)\\mathcal\{F\}\_\{\\mathrm\{ANP\}\}^\{\(d\)\}\\not\\subseteq\\mathcal\{F\}\_\{\\mathrm\{ConvCNP\}\}^\{\(L\)\}:Non\-stationary kernel smoothers are ANP\-representable \(Theorem[9](https://arxiv.org/html/2605.24210#Thmtheorem9)\) but not ConvCNP\-representable at any depth, since every ConvCNP is translation equivariant and non\-stationary kernel smoothers generically violate translation equivariance\.

ℱConvCNP\(L\)⊈ℱANP\(d\)\\mathcal\{F\}\_\{\\mathrm\{ConvCNP\}\}^\{\(L\)\}\\not\\subseteq\\mathcal\{F\}\_\{\\mathrm\{ANP\}\}^\{\(d\)\}:ForL≥1L\\geq 1, ConvCNPs with CNN layers can approximate GP posteriors for stationary kernels on regular grids \(Theorem[22](https://arxiv.org/html/2605.24210#Thmtheorem22)\)\. GP posteriors are not ANP\-representable \(Theorem[12](https://arxiv.org/html/2605.24210#Thmtheorem12)\)\. ForL=0L=0, stationary kernel smoothers are exactly representable by pure ConvCNPs \(Proposition[21](https://arxiv.org/html/2605.24210#Thmtheorem21)\) but only approximately representable by ANPs \(Theorem[9](https://arxiv.org/html/2605.24210#Thmtheorem9)\), giving a strict separation in the exact representation sense\. ∎

## Appendix GLatent NP Proofs

###### Proposition 59\(Encoder Bottleneck\)\.

For a latent CNP with encoderh:𝒳×𝒴→ℝdh:\\mathcal\{X\}\\times\\mathcal\{Y\}\\to\\mathbb\{R\}^\{d\}, ifC∼hC′C\\sim\_\{h\}C^\{\\prime\}\(i\.e\.,h¯C=h¯C′\\bar\{h\}\_\{C\}=\\bar\{h\}\_\{C^\{\\prime\}\}\), then:

q​\(z\|C\)=q​\(z\|C′\)and hencep​\(yT\|XT,C\)=p​\(yT\|XT,C′\)q\(z\|C\)=q\(z\|C^\{\\prime\}\)\\quad\\text\{and hence\}\\quad p\(y\_\{T\}\|X\_\{T\},C\)=p\(y\_\{T\}\|X\_\{T\},C^\{\\prime\}\)for all target configurationsXTX\_\{T\}, regardless of decoder expressiveness\.

###### Proof\.

The latent distribution depends onCConly throughrCr\_\{C\}\. IfrC=rC′r\_\{C\}=r\_\{C^\{\\prime\}\}, thenq​\(z\|C\)=q​\(z\|C′\)q\(z\|C\)=q\(z\|C^\{\\prime\}\)\. The predictive distributionp​\(yT\|XT,C\)=∫p​\(yT\|XT,z\)​q​\(z\|C\)​𝑑zp\(y\_\{T\}\|X\_\{T\},C\)=\\int p\(y\_\{T\}\|X\_\{T\},z\)q\(z\|C\)dzthen coincides forCCandC′C^\{\\prime\}\. ∎

### G\.1Mean Bottleneck

###### Assumption 5\(Gaussian Latent NP\)\.

The latent distribution is Gaussian:q​\(z\|C\)=𝒩​\(m​\(C\),S​\(C\)\)q\(z\|C\)=\\mathcal\{N\}\(m\(C\),S\(C\)\)withm:𝒞→ℝkm:\\mathcal\{C\}\\to\\mathbb\{R\}^\{k\}andS:𝒞→ℝ\+\+k×kS:\\mathcal\{C\}\\to\\mathbb\{R\}^\{k\\times k\}\_\{\+\+\}\. The decoder is linear:f​\(x,z\)=a​\(x\)⊤​z\+b​\(x\)f\(x,z\)=a\(x\)^\{\\top\}z\+b\(x\)witha:𝒳→ℝka:\\mathcal\{X\}\\to\\mathbb\{R\}^\{k\}andb:𝒳→ℝb:\\mathcal\{X\}\\to\\mathbb\{R\}\.

Under this assumption, the predictive distribution at targetsXT=\(xt1,…,xtm\)X\_\{T\}=\(x\_\{t\_\{1\}\},\\ldots,x\_\{t\_\{m\}\}\)is Gaussian with:

𝔼​\[yT\|XT,C\]\\displaystyle\\mathbb\{E\}\[y\_\{T\}\|X\_\{T\},C\]=A​\(XT\)​m​\(C\)\+b​\(XT\)\\displaystyle=A\(X\_\{T\}\)m\(C\)\+b\(X\_\{T\}\)\(14\)Cov​\(yT\|XT,C\)\\displaystyle\\mathrm\{Cov\}\(y\_\{T\}\|X\_\{T\},C\)=A​\(XT\)​S​\(C\)​A​\(XT\)⊤\+σ2​I\\displaystyle=A\(X\_\{T\}\)S\(C\)A\(X\_\{T\}\)^\{\\top\}\+\\sigma^\{2\}I\(15\)whereA​\(XT\)∈ℝm×kA\(X\_\{T\}\)\\in\\mathbb\{R\}^\{m\\times k\}has rowsa​\(xti\)⊤a\(x\_\{t\_\{i\}\}\)^\{\\top\}andb​\(XT\)=\(b​\(xt1\),…,b​\(xtm\)\)⊤b\(X\_\{T\}\)=\(b\(x\_\{t\_\{1\}\}\),\\ldots,b\(x\_\{t\_\{m\}\}\)\)^\{\\top\}\.

###### Proof of Theorem[25](https://arxiv.org/html/2605.24210#Thmtheorem25), part \(a\) \(Mean Bottleneck\)\.

Fix context locationsXCX\_\{C\}withnnpoints in general position for a universal kernelkk\. Defineϕ​\(xt\)=𝐊−1​k​\(XC,xt\)∈ℝn\\phi\(x\_\{t\}\)=\\mathbf\{K\}^\{\-1\}k\(X\_\{C\},x\_\{t\}\)\\in\\mathbb\{R\}^\{n\}, so the GP posterior mean isϕ​\(xt\)⊤​yC\\phi\(x\_\{t\}\)^\{\\top\}y\_\{C\}\.

Step 1: Construct target points with independent weight vectors\.For a universal kernel, the vectors\{ϕ​\(xt\):xt∈𝒳\}\\\{\\phi\(x\_\{t\}\):x\_\{t\}\\in\\mathcal\{X\}\\\}spanℝn\\mathbb\{R\}^\{n\}\. Choosenntarget pointsxt1,…,xtnx\_\{t\_\{1\}\},\\ldots,x\_\{t\_\{n\}\}such thatΦ=\[ϕ​\(xt1\),…,ϕ​\(xtn\)\]⊤∈ℝn×n\\Phi=\[\\phi\(x\_\{t\_\{1\}\}\),\\ldots,\\phi\(x\_\{t\_\{n\}\}\)\]^\{\\top\}\\in\\mathbb\{R\}^\{n\\times n\}is invertible\.

Step 2: Analyze the composite map\.DefineF:ℝk→ℝnF:\\mathbb\{R\}^\{k\}\\to\\mathbb\{R\}^\{n\}byF​\(z\)=\(f​\(xt1,z\),…,f​\(xtn,z\)\)F\(z\)=\(f\(x\_\{t\_\{1\}\},z\),\\ldots,f\(x\_\{t\_\{n\}\},z\)\)\. The matching condition requires:

F​\(g​\(XC,yC\)\)=Φ​yC∀yC∈ℝn\.F\(g\(X\_\{C\},y\_\{C\}\)\)=\\Phi y\_\{C\}\\quad\\forall y\_\{C\}\\in\\mathbb\{R\}^\{n\}\.
Step 3: Dimension argument\.SinceΦ\\Phiis invertible, the mapyC↦Φ​yCy\_\{C\}\\mapsto\\Phi y\_\{C\}is a bijection onℝn\\mathbb\{R\}^\{n\}\. ThusF∘g:ℝn→ℝnF\\circ g:\\mathbb\{R\}^\{n\}\\to\\mathbb\{R\}^\{n\}is surjective\.

The image ofg:ℝn→ℝkg:\\mathbb\{R\}^\{n\}\\to\\mathbb\{R\}^\{k\}has dimension at mostmin⁡\(n,k\)\\min\(n,k\)\. For the compositeF∘gF\\circ gto be surjective ontoℝn\\mathbb\{R\}^\{n\}, we needdim\(im​\(g\)\)≥n\\dim\(\\mathrm\{im\}\(g\)\)\\geq n, hencek≥nk\\geq n\.

The stochastic case reduces to the deterministic case by considering the low\-variance limitq​\(z\|C\)=𝒩​\(m​\(C\),σ2​I\)q\(z\|C\)=\\mathcal\{N\}\(m\(C\),\\sigma^\{2\}I\)asσ→0\\sigma\\to 0\. ∎

### G\.2Covariance Bottleneck

###### Theorem 60\(Covariance Rank Bound\)\.

Under Assumption[5](https://arxiv.org/html/2605.24210#Thmassumption5), the predictive covariance satisfies:

rank⁡\(Cov​\(yT\|XT,C\)−σ2​I\)≤k\\operatorname\{rank\}\\left\(\\mathrm\{Cov\}\(y\_\{T\}\|X\_\{T\},C\)\-\\sigma^\{2\}I\\right\)\\leq kfor any target configurationXTX\_\{T\}and any contextCC\.

###### Proof\.

From \([15](https://arxiv.org/html/2605.24210#A7.E15)\),Cov​\(yT\|XT,C\)−σ2​I=A​\(XT\)​S​\(C\)​A​\(XT\)⊤\\mathrm\{Cov\}\(y\_\{T\}\|X\_\{T\},C\)\-\\sigma^\{2\}I=A\(X\_\{T\}\)S\(C\)A\(X\_\{T\}\)^\{\\top\}\. This matrix has rank at mostmin⁡\(m,k,rank⁡\(S​\(C\)\)\)≤k\\min\(m,k,\\operatorname\{rank\}\(S\(C\)\)\)\\leq k\. ∎

###### Theorem 61\(GP Posterior Covariance Rank\)\.

For a universal kernelkk, the GP posterior covariance matrix atmmtarget pointsxt1,…,xtmx\_\{t\_\{1\}\},\\ldots,x\_\{t\_\{m\}\}distinct from the context locationsXCX\_\{C\}has rankmmgenerically\.

###### Proof\.

The GP posterior covariance is:

Σ~T=KT​T−KT​C​𝐊−1​KC​T\\tilde\{\\Sigma\}\_\{T\}=K\_\{TT\}\-K\_\{TC\}\\mathbf\{K\}^\{\-1\}K\_\{CT\}whereKT​T=\[k​\(xti,xtj\)\]i​j∈ℝm×mK\_\{TT\}=\[k\(x\_\{t\_\{i\}\},x\_\{t\_\{j\}\}\)\]\_\{ij\}\\in\\mathbb\{R\}^\{m\\times m\},KT​C=\[k​\(xti,xj\)\]i​j∈ℝm×nK\_\{TC\}=\[k\(x\_\{t\_\{i\}\},x\_\{j\}\)\]\_\{ij\}\\in\\mathbb\{R\}^\{m\\times n\}, andKC​T=KT​C⊤K\_\{CT\}=K\_\{TC\}^\{\\top\}\.

This is the Schur complement of𝐊\\mathbf\{K\}in the joint covariance matrix:

\(𝐊KC​TKT​CKT​T\)\.\\begin\{pmatrix\}\\mathbf\{K\}&K\_\{CT\}\\\\ K\_\{TC\}&K\_\{TT\}\\end\{pmatrix\}\.
For a universal kernel, this joint matrix is positive definite when alln\+mn\+mpoints are distinct\. The Schur complement of a positive definite block in a positive definite matrix is positive definite\. ThusΣ~T≻0\\tilde\{\\Sigma\}\_\{T\}\\succ 0, which impliesrank⁡\(Σ~T\)=m\\operatorname\{rank\}\(\\tilde\{\\Sigma\}\_\{T\}\)=m\. ∎

###### Corollary 62\(Covariance Mismatch\)\.

A Gaussian latent NP with latent dimensionkkand linear decoder cannot match the GP posterior covariance atm\>km\>ktarget points\.

###### Proof\.

The GP posterior covariance \(minus observation noise\) has rankmmby Theorem[61](https://arxiv.org/html/2605.24210#Thmtheorem61)\. The latent NP predictive covariance \(minus observation noise\) has rank at mostkkby Theorem[60](https://arxiv.org/html/2605.24210#Thmtheorem60)\. Form\>km\>k, exact matching is impossible\. ∎

###### Proof of Theorem[25](https://arxiv.org/html/2605.24210#Thmtheorem25)\(Latent NP Cannot Represent GP Posterior\)\.

Part \(a\) is proven above \(Mean Bottleneck\)\. Part \(b\) is Corollary[62](https://arxiv.org/html/2605.24210#Thmtheorem62)\. Part \(c\) follows from \(b\) since we can choose arbitrarily many target points\. ∎

### G\.3What Latent NPs Can Represent

###### Definition 63\(Finite\-Rank GP\)\.

A GP has rankkkif its covariance kernel admits the representation:

k~​\(x,x′\)=∑i,j=1kSi​j​ai​\(x\)​aj​\(x′\)=a​\(x\)⊤​S​a​\(x′\)\\tilde\{k\}\(x,x^\{\\prime\}\)=\\sum\_\{i,j=1\}^\{k\}S\_\{ij\}a\_\{i\}\(x\)a\_\{j\}\(x^\{\\prime\}\)=a\(x\)^\{\\top\}S\\,a\(x^\{\\prime\}\)for somea:𝒳→ℝka:\\mathcal\{X\}\\to\\mathbb\{R\}^\{k\}and positive semidefiniteS∈ℝk×kS\\in\\mathbb\{R\}^\{k\\times k\}\.

###### Theorem 64\(Latent NP Function Class\)\.

A Gaussian latent NP with latent dimensionkkand linear decoder exactly represents the class of stochastic processes:

ℱlatent\(k\)=\{f​\(x\)=a​\(x\)⊤​z\+b​\(x\):z∼𝒩​\(m,S\),a:𝒳→ℝk,b:𝒳→ℝ,m∈ℝk,S∈ℝ\+\+k×k\}\\mathcal\{F\}\_\{\\mathrm\{latent\}\}^\{\(k\)\}=\\left\\\{f\(x\)=a\(x\)^\{\\top\}z\+b\(x\):z\\sim\\mathcal\{N\}\(m,S\),\\;a:\\mathcal\{X\}\\to\\mathbb\{R\}^\{k\},\\;b:\\mathcal\{X\}\\to\\mathbb\{R\},\\;m\\in\\mathbb\{R\}^\{k\},\\;S\\in\\mathbb\{R\}^\{k\\times k\}\_\{\+\+\}\\right\\\}wheremmandSSmay depend on the contextCC\.

This is the class of GPs with rank at mostkk\.

###### Proof\.

\(⇒\)\(\\Rightarrow\)A Gaussian latent NP with linear decoderf​\(x,z\)=a​\(x\)⊤​z\+b​\(x\)f\(x,z\)=a\(x\)^\{\\top\}z\+b\(x\)and latentz\|C∼𝒩​\(m​\(C\),S​\(C\)\)z\|C\\sim\\mathcal\{N\}\(m\(C\),S\(C\)\)induces:

𝔼​\[f​\(x\)\|C\]\\displaystyle\\mathbb\{E\}\[f\(x\)\|C\]=a​\(x\)⊤​m​\(C\)\+b​\(x\)\\displaystyle=a\(x\)^\{\\top\}m\(C\)\+b\(x\)Cov​\(f​\(x\),f​\(x′\)\|C\)\\displaystyle\\mathrm\{Cov\}\(f\(x\),f\(x^\{\\prime\}\)\|C\)=a​\(x\)⊤​S​\(C\)​a​\(x′\)\\displaystyle=a\(x\)^\{\\top\}S\(C\)a\(x^\{\\prime\}\)which is a rank\-kkGP\.

\(⇐\)\(\\Leftarrow\)Any rank\-kkGP with meanμ​\(x\)=a​\(x\)⊤​m\+b​\(x\)\\mu\(x\)=a\(x\)^\{\\top\}m\+b\(x\)and covariancek~​\(x,x′\)=a​\(x\)⊤​S​a​\(x′\)\\tilde\{k\}\(x,x^\{\\prime\}\)=a\(x\)^\{\\top\}Sa\(x^\{\\prime\}\)is realized by setting the latent distribution to𝒩​\(m,S\)\\mathcal\{N\}\(m,S\)and decoder tof​\(x,z\)=a​\(x\)⊤​z\+b​\(x\)f\(x,z\)=a\(x\)^\{\\top\}z\+b\(x\)\. ∎

###### Corollary 65\(Representable Function Families\)\.

The classℱlatent\(k\)\\mathcal\{F\}\_\{\\mathrm\{latent\}\}^\{\(k\)\}includes:

1. \(a\)Bayesian linear regression withkkbasis functions:f​\(x\)=∑j=1kzj​ϕj​\(x\)f\(x\)=\\sum\_\{j=1\}^\{k\}z\_\{j\}\\phi\_\{j\}\(x\)\.
2. \(b\)GPs with degenerate \(finite\-rank\) kernels\.
3. \(c\)Anykk\-parameter function family with linear parameter dependence and Gaussian parameter posterior\.

###### Corollary 66\(Spectral Decay Rates\)\.

For common kernels on\[0,1\]d\[0,1\]^\{d\}:

1. \(a\)RBF kernel:λj∼e−c​j2/d\\lambda\_\{j\}\\sim e^\{\-cj^\{2/d\}\}, giving errorO​\(e−c′​k2/d\)O\(e^\{\-c^\{\\prime\}k^\{2/d\}\}\)\.
2. \(b\)Matérn\-ν\\nukernel:λj∼j−\(2​ν\+d\)/d\\lambda\_\{j\}\\sim j^\{\-\(2\\nu\+d\)/d\}, giving errorO​\(k−\(2​ν\)/d\)O\(k^\{\-\(2\\nu\)/d\}\)\.
3. \(c\)Polynomial kernel of degreepp:λj=0\\lambda\_\{j\}=0forj\>\(p\+dd\)j\>\\binom\{p\+d\}\{d\}, giving zero error fork≥\(p\+dd\)k\\geq\\binom\{p\+d\}\{d\}\.

Similar Articles

Representational Capacity: Geometric Limits on Feature Representation in Transformer Language Models

arXiv cs.LG

This paper introduces a quantitative framework for estimating how many near-orthogonal directions a transformer language model's latent space can support, based on the linear representation and superposition hypotheses. The authors define representational capacity as an upper bound on distinguishable features and show it is exponentially sensitive to the allowed deviation from orthogonality, with larger models favoring tighter constraints.

Generalized Neurons

ML at Berkeley

The article explores the Universal Approximation Theorem in deep learning, analyzing the representation capacity of individual neurons and neural network layers using ReLU activation functions.