Learning from almost nothing: How neural networks survive heavy input corruption

arXiv cs.LG Papers

Summary

This paper investigates how neural networks maintain high accuracy even when over 90% of input features are corrupted, deriving a centroid-based decision rule in the high-noise limit using a mean-field approach.

arXiv:2606.11319v1 Announce Type: new Abstract: Learning from imperfect data is a central theme in machine learning, connecting practical questions of robustness to fundamental questions of learnability. Here we examine attribute noise: learning from corrupted inputs while keeping the labels intact, a setting that has received considerably less analytical attention than its label-noise counterpart. We consider two types of corruption models: additive noise and replacement noise. Through experiments with multi-layer perceptrons (MLPs) on corrupted classification datasets, we find that neural networks remain robust, maintaining well-above-chance accuracy even when inputs are >90% corrupted -- far beyond human recognition. To understand this robustness, we analyze infinite-width networks in the heavy-corruption regime using a mean-field-inspired approach and derive a leading-order decision rule for the classification outcome: the network implements a prototype rule, the nearest-class-mean, assigning each test point to the class whose training-set average it most closely resembles. This leading-order decision rule is universal across a broad range of MLP architectures, holding for any depth, as well as a wide class of activation functions and noise distributions. The same centroid mechanism closely matches finite-width network behavior in our experiments and provides an interpretable and analytically tractable account of why learning can succeed even when individual training examples carry almost no signal.
Original Article
View Cached Full Text

Cached at: 06/11/26, 01:46 PM

# Learning from almost nothing: How neural networks survive heavy input corruption
Source: [https://arxiv.org/html/2606.11319](https://arxiv.org/html/2606.11319)
Justin TahmassebpurAsadullah BhuiyanDepartment of Physics, Cornell University, Ithaca, NY 14853, USAHyejin KimDepartment of Physics, Cornell University, Ithaca, NY 14853, USAOmri LesserDepartment of Physics, Cornell University, Ithaca, NY 14853, USA

###### Abstract

Learning from imperfect data is a central theme in machine learning, connecting practical questions of robustness to fundamental questions of learnability\. Here we examine attribute noise: learning from corrupted inputs while keeping the labels intact, a setting that has received considerably less analytical attention than its label\-noise counterpart\. We consider two types of corruption models: additive noise and replacement noise\. Through experiments with multi\-layer perceptrons \(MLPs\) on corrupted classification datasets, we find that neural networks remain robust, maintaining well\-above\-chance accuracy even when inputs are\>90%\>90\\%corrupted—far beyond human recognition\. To understand this robustness, we analyze infinite\-width networks in the heavy\-corruption regime using a mean\-field\-inspired approach and derive a leading\-order decision rule for the classification outcome: the network implements a prototype rule, the nearest\-class\-mean, assigning each test point to the class whose training\-set average it most closely resembles\. This leading\-order decision rule is universal across a broad range of MLP architectures, holding for any depth, as well as a wide class of activation functions and noise distributions\. The same centroid mechanism closely matches finite\-width network behavior in our experiments and provides an interpretable and analytically tractable account of why learning can succeed even when individual training examples carry almost no signal\.

###### Contents

1. [IIntroduction](https://arxiv.org/html/2606.11319#S1)
2. [IIFeature corruption models and empirical results](https://arxiv.org/html/2606.11319#S2)1. [II\.1Corruption models and setup](https://arxiv.org/html/2606.11319#S2.SS1) 2. [II\.2Robustness to training\-set corruption](https://arxiv.org/html/2606.11319#S2.SS2) 3. [II\.3Robustness to simultaneous training\-set and testing\-set corruption](https://arxiv.org/html/2606.11319#S2.SS3)
3. [IIIAnalytical treatment](https://arxiv.org/html/2606.11319#S3)1. [III\.1Derivation of the high\-noise centroid rule](https://arxiv.org/html/2606.11319#S3.SS1) 2. [III\.2Comparison with finite\-width networks](https://arxiv.org/html/2606.11319#S3.SS2) 3. [III\.3Signal\-to\-noise ratio](https://arxiv.org/html/2606.11319#S3.SS3) 4. [III\.4Permutation symmetry and the centroid response](https://arxiv.org/html/2606.11319#S3.SS4)
4. [IVDiscussion and outlook](https://arxiv.org/html/2606.11319#S4)
5. [References](https://arxiv.org/html/2606.11319#bib)
6. [AExperimental details](https://arxiv.org/html/2606.11319#A1)1. [A\.1Experimental setup](https://arxiv.org/html/2606.11319#A1.SS1) 2. [A\.2Squared\-error loss for classification](https://arxiv.org/html/2606.11319#A1.SS2)
7. [BFurther analysis of simultaneous train\-time and test\-time corruption](https://arxiv.org/html/2606.11319#A2)
8. [CHigh\-noise NTK expansion](https://arxiv.org/html/2606.11319#A3)1. [C\.1Setup](https://arxiv.org/html/2606.11319#A3.SS1) 2. [C\.2Useful identities](https://arxiv.org/html/2606.11319#A3.SS2) 3. [C\.3Influence of input perturbations on the output](https://arxiv.org/html/2606.11319#A3.SS3) 4. [C\.4Output fluctuations under Gaussian kernel perturbations](https://arxiv.org/html/2606.11319#A3.SS4) 5. [C\.5The output for an arbitrary i\.i\.d\. noise model](https://arxiv.org/html/2606.11319#A3.SS5)
9. [DNonnegativity of the coefficient multiplying the centroid](https://arxiv.org/html/2606.11319#A4)
10. [EEmpirical validation of the centroid model for additive\-Gaussian noise](https://arxiv.org/html/2606.11319#A5)
11. [FDependence of test accuracy on activation](https://arxiv.org/html/2606.11319#A6)
12. [GExplanation of non\-monotonic test accuracy variance](https://arxiv.org/html/2606.11319#A7)

## IIntroduction

Machine learning is fundamentally reshaping science and technology, raising profound questions about both its capabilities and its limitations\. A central challenge in this context is learning from imperfect data: how can a model generalize when the examples available during training are degraded? The canonical setting is that of*label noise*, where the target labels in a classification or regression task are corrupted\[[3](https://arxiv.org/html/2606.11319#bib.bib20),[19](https://arxiv.org/html/2606.11319#bib.bib21),[6](https://arxiv.org/html/2606.11319#bib.bib23),[10](https://arxiv.org/html/2606.11319#bib.bib81),[26](https://arxiv.org/html/2606.11319#bib.bib25),[34](https://arxiv.org/html/2606.11319#bib.bib31),[28](https://arxiv.org/html/2606.11319#bib.bib82)\]\. The complementary problem of*attribute noise*—where the labels remain correct but the input features themselves are corrupted—has received considerably less analytical attention\. Furthermore, attribute noise is common in practice: measured attributes are often degraded by sensor noise, acquisition errors, missing features, or other imperfections, while the associated labels may remain reliable\[[9](https://arxiv.org/html/2606.11319#bib.bib96)\]\.

![Refer to caption](https://arxiv.org/html/2606.11319v1/x1.png)Figure 1:Test accuracy vs\. corruption probabilityppfor width\-128, 3\-layer ReLU MLPs trained with cross\-entropy loss on images with i\.i\.d\. pixel replacement noise\. Three class\-balanced datasets are shown: \(a\) MNIST, \(b\) Fashion\-MNIST, and \(c\) KMNIST\. Below each panel we demonstrate the noise model by showing representative examples of images subject to replacement noise at increasingpp\. For eachpp, an MLP is trained on a dataset where each pixel is independently replaced with uniform noise with probabilitypp; the test set is left clean\. Accuracy metrics are averaged over 100 independent noisy training realizations perppvalue, and shaded bands surrounding each curve represent one standard deviation across realizations \(see Appendix[G](https://arxiv.org/html/2606.11319#A7)for a discussion of the non\-monotonic behavior of the standard deviation\)\. Even when the replacement noise has corrupted up to 80% \(p=0\.8p=0\.8\) of each image in the training set, test accuracy remains at or above∼70%\{\\sim\}70\\%across all three datasets, with accuracy collapsing toward uniform random guessing \(100/C=10%100/C=10\\%forC=10C=10classes, dashed line\) only nearp→1p\\to 1\.Attribute noise has been examined from several perspectives\. Early work showed that injecting noise into inputs during training can act as a regularizer and improve generalization\[[22](https://arxiv.org/html/2606.11319#bib.bib33),[16](https://arxiv.org/html/2606.11319#bib.bib34),[5](https://arxiv.org/html/2606.11319#bib.bib32),[2](https://arxiv.org/html/2606.11319#bib.bib35),[36](https://arxiv.org/html/2606.11319#bib.bib74),[32](https://arxiv.org/html/2606.11319#bib.bib75),[41](https://arxiv.org/html/2606.11319#bib.bib76),[31](https://arxiv.org/html/2606.11319#bib.bib103)\]\. Later work addressed feature corruption and deletion more directly, developing classifiers that are robust to missing inputs and training procedures that explicitly marginalize over corrupted features\[[11](https://arxiv.org/html/2606.11319#bib.bib78),[8](https://arxiv.org/html/2606.11319#bib.bib79),[39](https://arxiv.org/html/2606.11319#bib.bib77)\]\. A separate line of research considers*test\-time*corruption, asking how models trained on clean data degrade when evaluated on corrupted inputs\[[15](https://arxiv.org/html/2606.11319#bib.bib85)\]\. This form of robustness is quantified by benchmarks such as ImageNet\-C\[[14](https://arxiv.org/html/2606.11319#bib.bib83)\]and MNIST\-C\[[25](https://arxiv.org/html/2606.11319#bib.bib84)\]\. Together, these bodies of work address several important aspects of the robustness of corruption\. However, they leave open a basic question: can neural networks trained on*severely*corrupted attributes, but clean labels, learn a coherent classification rule and generalize well to clean test data?

In this paper, we show that the answer is surprisingly “yes”\. Empirically, we train multi\-layer perceptrons \(MLPs\) on corrupted versions of the standard image datasets MNIST\[[21](https://arxiv.org/html/2606.11319#bib.bib93)\], KMNIST\[[7](https://arxiv.org/html/2606.11319#bib.bib94)\], and Fashion\-MNIST\[[40](https://arxiv.org/html/2606.11319#bib.bib95)\], using two corruption models appropriate for image data: pixel replacement noise and additive noise\. Across all three datasets, we find that the networks remain strikingly robust even when the training inputs are severely corrupted; see Fig\.[1](https://arxiv.org/html/2606.11319#S1.F1)\. At such high\-noise levels, the images are nearly indistinguishable to the human eye, yet the network can still learn to classify clean test images far better than chance\.

We then explain this robustness analytically using the neural tangent kernel \(NTK\) framework\[[18](https://arxiv.org/html/2606.11319#bib.bib59)\]\. For any wide MLP trained on arbitrary datasets with clean labels and generic independent and identically distributed \(i\.i\.d\.\) input corruption, we derive an explicit analytical form for the learned predictor in the high\-noise limit\. The resulting high\-noise decision rule is simple: the analytical form of the predictor is a linear nearest\-class\-mean—orcentroid\[[38](https://arxiv.org/html/2606.11319#bib.bib67)\]—classifier on top of a competing stochastic background\.

This theoretical result is then compared to numerical experiments on finite\-width MLPs, and we find excellent agreement for both raw logit outputs and theargmax\\mathrm\{argmax\}\-filtered test accuracy\. This empirical agreement with finite\-width networks suggests that their surprising robustness on the image datasets considered here comes not from a complicated denoising or feature\-recovery procedure, but from aggregating the weak class information that survives the corruption into a prototype\-based decision rule—the centroid rule\.

Furthermore, the leading form of the high\-noise decision rule is independent of depth, activation, and the details of the chosen noise distribution\. The origin of this universality can be understood through the lens of symmetry and linear response: at pure noise, training examples become exchangeable, and the leading class\-dependent response to a small surviving non\-exchangeable signal is forced to depend on centered class averages\.

These results connect our work to prototype\-based views of classification, which have appeared in both classical and modern settings, including nearest\-class\-mean methods, prototypical networks, and recent analyses of class\-mean structure in deep representations\[[38](https://arxiv.org/html/2606.11319#bib.bib67),[37](https://arxiv.org/html/2606.11319#bib.bib89),[13](https://arxiv.org/html/2606.11319#bib.bib88),[29](https://arxiv.org/html/2606.11319#bib.bib90),[35](https://arxiv.org/html/2606.11319#bib.bib91),[20](https://arxiv.org/html/2606.11319#bib.bib92)\]\. Our setting differs in emphasis: we do not assume a prototype\-based model from the outset, nor do we study prototype geometry as an empirical tendency of learned representations\. Instead, we*derive*the prototype rule analytically as the effective predictor of a wide network trained in the high\-attribute\-noise regime\. More broadly, our results identify a rare regime in which neural networks can be understood in unusually explicit terms: despite the complexity of the architecture, the learned rule itself becomes simple, universal, and interpretable\.

The remainder of this paper is organized as follows\. In Sec\.[II](https://arxiv.org/html/2606.11319#S2), we introduce the two corruption models and present our motivating empirical finding: MLPs classify clean test data well above chance even with heavy train\-time corruption\. In addition, we present results on MLP classification with independent test\-timeandtrain\-time corruption, finding that classification power is optimized when the training set corruption strength matches that of the test set\. Following this, we derive a theoretical prediction for the trained network output in the high\-noise regime using the NTK framework and compare with numerical experiments on finite\-width MLPs in Sec\.[III](https://arxiv.org/html/2606.11319#S3)\. Furthermore, we present a symmetry argument for the origin of the decision rule and its universality across noisy MLP architectures\. Finally, Sec\.[IV](https://arxiv.org/html/2606.11319#S4)discusses the implications of our results and outlook\. Details on numerical experiments, technical derivations, and supporting details are collected in the Appendices\.

## IIFeature corruption models and empirical results

In this section, we present an empirical study of multiclass classification with MLPs trained on feature\-corrupted inputs\. We will first present the corruption channels of interest, followed by a presentation of empirical results for MLP performance when trained on correctly\-labeled but feature\-corrupted inputs, and tested on clean data; this setup will be treated analytically in Sec\.[III](https://arxiv.org/html/2606.11319#S3)\. In addition, we briefly present empirical results for MLP performance when both the trainingandtesting examples are independently corrupted, with further details provided in Appendix[B](https://arxiv.org/html/2606.11319#A2)\.

Empirical findings presented in this section were obtained from fully\-connected neural networks with 3 hidden layers of width 128, rectified linear unit \(ReLU\) activation, trained for 20 epochs using the Adam optimizer with batch size 128 and cross\-entropy loss, and averaged over 100 random seeds\. Further details on the experiments can be found in Appendix[A](https://arxiv.org/html/2606.11319#A1)\.

### II\.1Corruption models and setup

We begin by introducing the notation and noise models used throughout this work\. Letxk​χx\_\{k\\chi\}denote thekk\-th feature of*clean*training exampleχ\\chi, wherek=1,…,dk=1,\\ldots,dindexes the feature dimension andχ=1,…,N\\chi=1,\\ldots,Nindexes the training set\. The corresponding corrupted data point is denoted byx~k​χ\\tilde\{x\}\_\{k\\chi\}\. Our main theoretical analysis focuses on the large\-feature\-dimension regimed≫1d\\gg 1, which is natural for image data and many modern high\-dimensional learning settings\.

We consider two corruption models, as shown in Fig\.[2](https://arxiv.org/html/2606.11319#S2.F2)\. The first is*replacement noise*,

x~k​χ=\(1−bk​χ\)​xk​χ\+bk​χ​uk​χ,\\tilde\{x\}\_\{k\\chi\}=\(1\-b\_\{k\\chi\}\)x\_\{k\\chi\}\+b\_\{k\\chi\}u\_\{k\\chi\},\(1\)wherebk​χb\_\{k\\chi\}are i\.i\.d\. Bernoulli random variables withPr⁡\(bk​χ=1\)=p\\Pr\(b\_\{k\\chi\}=1\)=p, so that each feature \(each pixel in image data\) is independently replaced with probabilitypp, anduk​χu\_\{k\\chi\}is an i\.i\.d\. noise variable\. In our numerical experiments we takeuk​χ∼Unif​\[0,1\]u\_\{k\\chi\}\\sim\\mathrm\{Unif\}\[0,1\], which is a natural choice for grayscale image data such as MNIST, where each of the28×2828\\times 28pixels takes on a value in\[0,1\]\[0,1\]\. The second corruption model is*additive noise*,

x~k​χ=\(1−p\)​xk​χ\+p​ξk​χ,\\tilde\{x\}\_\{k\\chi\}=\(1\-p\)x\_\{k\\chi\}\+p\\,\\xi\_\{k\\chi\},\(2\)whereξk​χ\\xi\_\{k\\chi\}is an i\.i\.d\. noise variable\. Notice that in both models,p=0p=0corresponds to the clean case andp=1p=1is the maximally corrupted case, where the training data becomes completely random\. The empirical results in the main text focus on replacement noise, while additive Gaussian noise is treated in Appendix[E](https://arxiv.org/html/2606.11319#A5)\.

Although our empirical results use specific distributions \(uniform and Gaussian\), we show in Appendix[C](https://arxiv.org/html/2606.11319#A3)that the analytical framework applies more broadly to genericuk​χ,ξk​χu\_\{k\\chi\},\\xi\_\{k\\chi\}drawn i\.i\.d\. fromanydistribution, since only the first two moments of the distribution enter the calculation of the signal\. Equivalently, a specific choice of noise distribution can be replaced by a Gaussian with matched mean and variance without altering the result\. This is similar in spirit to the Gaussian equivalence principle from the high\-dimensional learning literature\[[23](https://arxiv.org/html/2606.11319#bib.bib100),[12](https://arxiv.org/html/2606.11319#bib.bib60),[17](https://arxiv.org/html/2606.11319#bib.bib101),[24](https://arxiv.org/html/2606.11319#bib.bib102)\], under which the training and generalization behavior of a high\-dimensional model depends on the data distribution only through its first two moments\.

![Refer to caption](https://arxiv.org/html/2606.11319v1/x2.png)Figure 2:Visualization of the two corruption models used in this work applied to an MNIST image\. The top panel shows the replacement noise model, where each pixel is independently replaced with uniform noise with probabilitypp; see Eq\. \([1](https://arxiv.org/html/2606.11319#S2.E1)\)\. The bottom panel shows the additive noise model, where each pixel is replaced by a convex combination of the clean pixel and an independent noise variable, with mixing parameterpp; see Eq\. \([2](https://arxiv.org/html/2606.11319#S2.E2)\)\. The panels show increasing corruption levelspp\.Importantly, the labels are assumed to remain clean throughout training\. We will denote the label vector associated with classi=1,…,Ci=1,\\ldots,Casyi​χy\_\{i\\chi\}, withyi​χy\_\{i\\chi\}one\-hot in the class index\. We assume that the dataset is class balanced, so that each of theCCclasses contains the same number of training examples\. This assumption is not essential; it is adopted here only to streamline the discussion, and is relaxed in Appendix[C](https://arxiv.org/html/2606.11319#A3)\.

### II\.2Robustness to training\-set corruption

Figure[1](https://arxiv.org/html/2606.11319#S1.F1)summarizes the basic empirical phenomenon\. Using the established image datasets MNIST, Fashion\-MNIST, and KMNIST\[[21](https://arxiv.org/html/2606.11319#bib.bib93),[7](https://arxiv.org/html/2606.11319#bib.bib94),[40](https://arxiv.org/html/2606.11319#bib.bib95)\], we show the test accuracy of trained MLPs as a function of the corruption probabilityppfor replacement\-noise corrupted training data, tested on*clean*data\. Representative corrupted images are displayed below each panel\. Byp=0\.75p=0\.75, the training images are already extremely difficult for a human to recognize, yet the networks still achieve strikingly high clean\-test accuracy: around80%80\\%–90%90\\%on MNIST and70%70\\%–80%80\\%on Fashion\-MNIST and KMNIST in this regime\. In other words, even when trained on inputs that appear almost devoid of recognizable structure, the networks retain predictive power substantially higher than random chance\.

We also note an interesting feature of the standard deviation of the test accuracy as a function ofppin Fig\.[1](https://arxiv.org/html/2606.11319#S1.F1): it is non\-monotonic\. Specifically, the standard deviation is relatively small nearp=1p=1, increases asppdecreases, reaches a maximum at an intermediate corruption level, and then decreases again for smallerpp\. At first glance this is surprising: a reasonable expectation is for the network to be maximally uncertain atp=1p=1, where it is trained on pure noise\. However, in Appendix[G](https://arxiv.org/html/2606.11319#A7)we show that for multi\-class problems \(C\>2C\>2\), this is not the case: the standard deviation generically peaks atp∗<1p^\{\*\}<1\. This effect arises from the indicator form of the test accuracy\.

### II\.3Robustness to simultaneous training\-set and testing\-set corruption

![Refer to caption](https://arxiv.org/html/2606.11319v1/x3.png)Figure 3:Train\-time and test\-time corruption in MNIST data: Accuracy vs\. train corruption levelptrainp\_\{\\text\{train\}\}, for several values of test corruptionptestp\_\{\\text\{test\}\}\. We observe a non\-monotonic behavior, where the optimalptrainp\_\{\\text\{train\}\}\(dots in the figure\) is nearptestp\_\{\\text\{test\}\}, rather than 0\. All networks and hyperparameters are the same as in Fig\.[1](https://arxiv.org/html/2606.11319#S1.F1)\.So far we have trained models with some corruption levelppand tested them on clean data\. We now consider a generalized scenario, where*both*the training set and the test set are corrupted, with corruption levelsptrainp\_\{\\text\{train\}\}andptestp\_\{\\text\{test\}\}, respectively\. Figure[3](https://arxiv.org/html/2606.11319#S2.F3)shows the resulting test accuracy as a function ofptrainp\_\{\\text\{train\}\}, for several values ofptestp\_\{\\text\{test\}\}\(we emphasize that theptest=0p\_\{\\text\{test\}\}=0curve is the same as in Fig\.[1](https://arxiv.org/html/2606.11319#S1.F1)\)\.

The naive expectation might be that the optimal value ofptrainp\_\{\\text\{train\}\}would be zero: one would expect that training with clean data will always be preferred\. However, we find that for a given value ofptestp\_\{\\text\{test\}\}, the behavior is non\-monotonic as a function ofptrainp\_\{\\text\{train\}\}, and in fact the optimal value isptrain≃ptestp\_\{\\text\{train\}\}\\simeq p\_\{\\text\{test\}\}\. The interpretation is that, when training and testing use the same level of corruption, the model is trying to generalize within the same data distribution, rather than out\-of\-distribution, which is more likely to succeed\. In contrast, for a fixed value ofptrainp\_\{\\text\{train\}\}, it is always optimal to useptest=0p\_\{\\text\{test\}\}=0, as is evident from Fig\.[3](https://arxiv.org/html/2606.11319#S2.F3)\. In Appendix[B](https://arxiv.org/html/2606.11319#A2)we demonstrate this and also discuss a strategy to handle situations whenptestp\_\{\\text\{test\}\}is not known exactly\.

## IIIAnalytical treatment

We now turn to an analytical explanation of the empirical robustness observed above\. We show that, in the severe\-corruption regime, the network output simplifies dramatically: rather than learning a complicated denoising rule, a wide MLP trained on corrupted inputs reduces at leading order to a centroid classifier on the clean data\. We derive this result using the neural tangent kernel \(NTK\) description of infinite\-width networks, in which training with mean\-squared error is equivalent to kernel regression with the NTK\.

Throughout this section, we use the standard NTK initialization, with weights and biases drawn independently from mean\-zero Gaussian distributions\[[4](https://arxiv.org/html/2606.11319#bib.bib98)\]\. We also normalize both the clean training and test inputs before applying corruption\. Specifically, each input is normalized and centered so that

1d​∑k=1dxk​α=0,1d​∑k=1dxk​α2=1,\\frac\{1\}\{d\}\\sum\_\{k=1\}^\{d\}x\_\{k\\alpha\}=0,\\qquad\\frac\{1\}\{d\}\\sum\_\{k=1\}^\{d\}x\_\{k\\alpha\}^\{2\}=1,\(3\)whereα=χ\\alpha=\\chidenotes a training example andα=∗\\alpha=\*denotes a test example\. This normalization is applied to both the training and test sets for notational and conceptual simplicity; the more general case is discussed in Appendix[C](https://arxiv.org/html/2606.11319#A3)\. Since the clean data are centered and normalized in this section, we correspondingly take the replacement noise to be

uk​χ∼Unif​\[−3,3\],u\_\{k\\chi\}\\sim\\mathrm\{Unif\}\[\-\\sqrt\{3\},\\sqrt\{3\}\],so that the pure\-noise inputs also have mean zero and variance one\. This choice aligns the first two moments of the noise with the normalized data\.

### III\.1Derivation of the high\-noise centroid rule

We now present the effective decision rule learned in the high\-noise regime\. In the infinite\-width limit, training anLL\-layer MLP111we use the convention that a depthLLnetwork hasL−1L\-1hidden layers\.with MSE loss is equivalent to kernel regression with the NTK\[[18](https://arxiv.org/html/2606.11319#bib.bib59)\]\. We will go through a sketch of the derivation in this section; a detailed derivation for arbitrary depth and activation is provided in Appendix[C](https://arxiv.org/html/2606.11319#A3)\.

The class\-iioutput logit of the MLP on a clean test point𝒙∗\\boldsymbol\{x\}\_\{\*\}is given by\[[33](https://arxiv.org/html/2606.11319#bib.bib97)\]

f~i​\(𝒙∗;ϵ\)=𝚯~∗χ\(L\)​\[𝚯~χ​χ\(L\)\]−1​𝒚i,\\tilde\{f\}\_\{i\}\(\\boldsymbol\{x\}\_\{\*\};\\epsilon\)=\\tilde\{\\boldsymbol\{\\Theta\}\}\_\{\*\\chi\}^\{\(L\)\}\\left\[\\tilde\{\\boldsymbol\{\\Theta\}\}\_\{\\chi\\chi\}^\{\(L\)\}\\right\]^\{\-1\}\\boldsymbol\{y\}\_\{i\},\(4\)where𝚯~χ​χ\(L\)\\tilde\{\\boldsymbol\{\\Theta\}\}\_\{\\chi\\chi\}^\{\(L\)\}is the depth\-LLNTK on the noisy training set,𝚯~∗χ\(L\)\\tilde\{\\boldsymbol\{\\Theta\}\}\_\{\*\\chi\}^\{\(L\)\}is the test\-train NTK,𝒚i∈\{0,1\}N\\boldsymbol\{y\}\_\{i\}\\in\\\{0,1\\\}^\{N\}is the target one\-hot vector for classii, and we have defined the signal strength

ϵ≡1−p\\epsilon\\equiv 1\-p\(5\)for convenience\. The pure\-noise limit corresponds toϵ=0\\epsilon=0, whileϵ=1\\epsilon=1corresponds to clean training inputs\. Notice that the test\-train NTK is a1×N1\\times Nvector with respect to the training data, whereas the train\-train NTK is anN×NN\\times Nmatrix in the same space\.

We now sketch the derivation of the logit output in the high\-noise limit\. The key simplification occurs in the first layer, where the NTK has the simple form

Θ~α​β\(1\)​\(ϵ\)=Cb\+Cw​X~α​β​\(ϵ\),\\tilde\{\\Theta\}^\{\(1\)\}\_\{\\alpha\\beta\}\(\\epsilon\)=C\_\{b\}\+C\_\{w\}\\tilde\{X\}\_\{\\alpha\\beta\}\(\\epsilon\),\(6\)whereCw\>0C\_\{w\}\>0andCb\>0C\_\{b\}\>0are the weight and bias variances at initialization,

X~α​β​\(ϵ\)≡1d​∑k=1dx~k​α​x~k​β\\tilde\{X\}\_\{\\alpha\\beta\}\(\\epsilon\)\\equiv\\frac\{1\}\{d\}\\sum\_\{k=1\}^\{d\}\\tilde\{x\}\_\{k\\alpha\}\\tilde\{x\}\_\{k\\beta\}\(7\)denotes the normalized noisy feature overlap matrix, andα,β\\alpha,\\betadenote either train or test indices\. In a mean\-field\-inspired decomposition, we separate the overlap matrix into a deterministic term obtained from averaging over the corruption noise, plus a subleading fluctuation

X~α​β​\(ϵ\)=Xα​β​\(ϵ\)\+δ​X~α​β​\(ϵ\),\\tilde\{X\}\_\{\\alpha\\beta\}\(\\epsilon\)=X\_\{\\alpha\\beta\}\(\\epsilon\)\+\\delta\\tilde\{X\}\_\{\\alpha\\beta\}\(\\epsilon\),\(8\)where

Xα​β​\(ϵ\)≡𝔼​\[X~α​β​\(ϵ\)\]X\_\{\\alpha\\beta\}\(\\epsilon\)\\equiv\\mathbb\{E\}\[\\tilde\{X\}\_\{\\alpha\\beta\}\(\\epsilon\)\]\(9\)is the noise\-averaged overlap and

δ​X~α​β​\(ϵ\)=1d​ηα​β​\(ϵ\)\+𝒪​\(d−1\)\\delta\\tilde\{X\}\_\{\\alpha\\beta\}\(\\epsilon\)=\\frac\{1\}\{\\sqrt\{d\}\}\\,\\eta\_\{\\alpha\\beta\}\(\\epsilon\)\+\\mathcal\{O\}\(d^\{\-1\}\)\(10\)is a non\-deterministic fluctuation of𝒪​\(d−1/2\)\\mathcal\{O\}\(d^\{\-1/2\}\)\. In the large\-ddlimit,ηα​β​\(ϵ\)\\eta\_\{\\alpha\\beta\}\(\\epsilon\)is asymptotically Gaussian with zero mean via the central limit theorem, while the𝒪​\(d−1\)\\mathcal\{O\}\(d^\{\-1\}\)term captures the subleading finite\-ddcorrections to this Gaussian limit\.

Through the layer\-to\-layer recursion formulas\[[33](https://arxiv.org/html/2606.11319#bib.bib97),[4](https://arxiv.org/html/2606.11319#bib.bib98)\], the depth\-LLNTK inherits its entire data dependence, and in particular the mean\-plus\-fluctuation split of Eq\. \([8](https://arxiv.org/html/2606.11319#S3.E8)\), fromX~α​β\\tilde\{X\}\_\{\\alpha\\beta\}\. The full layer\-wise propagation is carried out in Appendix[C](https://arxiv.org/html/2606.11319#A3); here we simply track the two pieces of Eq\. \([8](https://arxiv.org/html/2606.11319#S3.E8)\) through to the output\.

![Refer to caption](https://arxiv.org/html/2606.11319v1/x4.png)Figure 4:Numerical verification of the high\-noise theory\. All networks used to create the data in this figure are width\-2048, 3\-layererf\\mathrm\{erf\}MLPs, trained with mean squared error \(MSE\) loss on noisy \(a\) MNIST, \(b\) Fashion\-MNIST, and \(c\) KMNIST withN=4000N=4000\. Main panels: mean clean\-test accuracy versus replacement\-noise strengthppfor a nested ensemble of 20 noisy training sets and 10 MLPs per noisy set \(red curve\)\. Error bars show one standard deviation\. The grey\-dashed curve is the mean clean\-test accuracy of the fitted effective model in Eq\. \([16](https://arxiv.org/html/2606.11319#S3.E16)\)\. Insets: output\-level comparison atp=0\.97p=0\.97, using a nested ensemble of 100 noisy training sets and 50 MLPs per noisy set\. Each point corresponds to one clean test image and one candidate class, comparing the ensemble\-mean network output with the fitted prediction from Eq\. \([16](https://arxiv.org/html/2606.11319#S3.E16)\)\. The black dashed line in the inset indicates perfect agreement\.We work near the pure\-noise point and in the large\-feature\-dimension limit:ϵ≪1\\epsilon\\ll 1andd≫1\\sqrt\{d\}\\gg 1\. Throughout this expansion, the argument\(0\)\(0\)denotes evaluation at pure noise,ϵ=0\\epsilon=0\(equivalentlyp=1p=1\)\. With this in mind, we perform a joint first\-order expansion of the noisy overlap matrix Eq\. \([8](https://arxiv.org/html/2606.11319#S3.E8)\) inϵ\\epsilonand1/d1/\\sqrt\{d\}, leading to

X~α​β​\(ϵ\)=Xα​β​\(0\)\+ϵ​∂ϵXα​β​\(0\)\+1d​ηα​β​\(0\)\+𝒪​\(ϵ2\)\+𝒪​\(ϵ​d−1/2\)\+𝒪​\(d−1\),\\begin\{split\}\\tilde\{X\}\_\{\\alpha\\beta\}\(\\epsilon\)&=X\_\{\\alpha\\beta\}\(0\)\+\\epsilon\\,\\partial\_\{\\epsilon\}X\_\{\\alpha\\beta\}\(0\)\+\\frac\{1\}\{\\sqrt\{d\}\}\\eta\_\{\\alpha\\beta\}\(0\)\\\\ &\\quad\+\\mathcal\{O\}\(\\epsilon^\{2\}\)\+\\mathcal\{O\}\(\\epsilon d^\{\-1/2\}\)\+\\mathcal\{O\}\(d^\{\-1\}\),\\end\{split\}\(11\)where we have also expanded the non\-deterministic fluctuationηα​β\\eta\_\{\\alpha\\beta\}aboutϵ=0\\epsilon=0\.

Feeding this structure through the recursion, the depth\-LLNTK takes on a similar form,

Θ~α​β\(L\)​\(ϵ\)\\displaystyle\\tilde\{\\Theta\}\_\{\\alpha\\beta\}^\{\(L\)\}\(\\epsilon\)=Θα​β\(L\)​\(0\)\+ϵ​∂ϵΘα​β\(L\)​\(0\)\+1d​ηα​β\(L\)​\(0\)\\displaystyle=\\Theta\_\{\\alpha\\beta\}^\{\(L\)\}\(0\)\+\\epsilon\\partial\_\{\\epsilon\}\\Theta\_\{\\alpha\\beta\}^\{\(L\)\}\(0\)\+\\frac\{1\}\{\\sqrt\{d\}\}\\eta^\{\(L\)\}\_\{\\alpha\\beta\}\(0\)\+𝒪​\(ϵ2\)\+𝒪​\(ϵ​d−1/2\)\+𝒪​\(d−1\),\\displaystyle\+\\mathcal\{O\}\(\\epsilon^\{2\}\)\+\\mathcal\{O\}\(\\epsilon d^\{\-1/2\}\)\+\\mathcal\{O\}\(d^\{\-1\}\),\(12\)whereΘα​β\(L\)​\(0\)\\Theta\_\{\\alpha\\beta\}^\{\(L\)\}\(0\)is the depth\-LLNTK built from the pure\-noise averaged overlapXα​β​\(0\)X\_\{\\alpha\\beta\}\(0\), andηα​β\(L\)​\(0\)\\eta^\{\(L\)\}\_\{\\alpha\\beta\}\(0\)is zero\-mean Gaussian fluctuation inherited fromδ​X~\\delta\\tilde\{X\}\.

Substituting Eq\. \([12](https://arxiv.org/html/2606.11319#S3.E12)\) into Eq\. \([4](https://arxiv.org/html/2606.11319#S3.E4)\) and expanding to leading order inϵ\\epsilonandd−1/2d^\{\-1/2\}, we obtain

fi​\(𝒙∗;ϵ\)\\displaystyle f\_\{i\}\(\\boldsymbol\{x\}\_\{\*\};\\epsilon\)≈C\(L\)\+A\(L\)​N/Cd​ξ^i\\displaystyle\\approx C^\{\(L\)\}\+A^\{\(L\)\}\\sqrt\{\\frac\{N/C\}\{d\}\}\\,\\hat\{\\xi\}\_\{i\}\+B\(L\)​ϵ​NC​\[1d​∑k=1dxk⁣∗​\(x¯ki−x¯k\)\],\\displaystyle\+B^\{\(L\)\}\\frac\{\\epsilon N\}\{C\}\\left\[\\frac\{1\}\{d\}\\sum\_\{k=1\}^\{d\}x\_\{k\*\}\\bigl\(\\bar\{x\}\_\{k\}^\{i\}\-\\bar\{x\}\_\{k\}\\bigr\)\\right\],\(13\)whereA\(L\)A^\{\(L\)\},B\(L\)B^\{\(L\)\}, andC\(L\)C^\{\(L\)\}are class\-independent constants which encode depth, activation, initialization, and details of the noise distribution;ξ^i\\hat\{\\xi\}\_\{i\}is a class\-dependent Gaussian fluctuation with zero mean and unit variance, and

x¯ki=CN​∑χ∈ixk​χ,x¯k=1N​∑χxk​χ\\bar\{x\}\_\{k\}^\{i\}=\\frac\{C\}\{N\}\\sum\_\{\\chi\\in i\}x\_\{k\\chi\},\\qquad\\bar\{x\}\_\{k\}=\\frac\{1\}\{N\}\\sum\_\{\\chi\}x\_\{k\\chi\}\(14\)are the centered empirical mean, or centroid, of classiiand the global empirical mean, respectively, in featurekkfor the*clean*training data\. The crucial point is that the leading class\-dependent signal depends on the training data only through these*clean*centered\-class means\. Despite being trained on corrupted inputs, the network extracts from the noisy data a class\-dependent statistic depending only on the clean dataset\. At this order, the noisy part of the inputs contributes only the fluctuation termξ^i\\hat\{\\xi\}\_\{i\}, while the discriminating signal is the overlap between the test point and the clean class centroid\.

Therefore, the leading\-order decision rule learned by the network in the joint limitϵ≪1\\epsilon\\ll 1andd≫1\\sqrt\{d\}\\gg 1is

i⋆=arg⁡maxi=1,…,C⁡1d​∑k=1dxk⁣∗​\(x¯ki−x¯k\)\.i^\{\\star\}=\\arg\\max\_\{i=1,\\ldots,C\}\\frac\{1\}\{d\}\\sum\_\{k=1\}^\{d\}x\_\{k\*\}\\bigl\(\\bar\{x\}\_\{k\}^\{i\}\-\\bar\{x\}\_\{k\}\\bigr\)\.\(15\)In other words, the NTK calculation predicts that, in the high\-noise regime, the network reduces to a nearest\-class\-mean, or centroid, classifier\[[38](https://arxiv.org/html/2606.11319#bib.bib67)\]with an inner\-product score in feature space\. Remarkably, the leading\-order decision rule is universal across depth, activation, and noise distribution\. Despite the depth and nonlinearity of the original model, classification collapses to simple linear template matching against one prototype per class\.

We note that the coefficientB\(L\)B^\{\(L\)\}multiplying the centroid term in Eq\. \([13](https://arxiv.org/html/2606.11319#S3.E13)\) is guaranteed to be nonnegative under our normalization convention Eq\. \([3](https://arxiv.org/html/2606.11319#S3.E3)\) and unit\-variance noise models; see Appendix[D](https://arxiv.org/html/2606.11319#A4)for further details\. However, nonnegativity ofB\(L\)B^\{\(L\)\}is not guaranteed for arbitrary non\-normalized data or completely general noise models\. For settings whereB\(L\)<0B^\{\(L\)\}<0, the leading decision rule is simply obtained by reversing the orientation of the centroid score, equivalently by replacingarg⁡max\\arg\\maxwitharg⁡min\\arg\\min\. Thus, the leading class\-dependent response remains a centroid\-overlap rule, with the sign ofB\(L\)B^\{\(L\)\}determining its orientation\.

Finally, we comment on the controllability of the approximation at large training dataset sizeNN\. The prefactor of the centroid score in Eq\. \([13](https://arxiv.org/html/2606.11319#S3.E13)\) comes from expanding the inverse NTK around the pure\-noise kernel\. For moderateNN, this expansion is controlled by the small signal strengthϵ\\epsilon\. At largerNN, higher\-order terms inϵ\\epsiloncan become important; however, these terms primarily renormalize the overall coefficient of the centroid statistic, leaving the leading decision rule unchanged \(further details are given in Appendix[C](https://arxiv.org/html/2606.11319#A3)\)\. The full expansion obtained by substituting higher\-order NTK corrections into Eq\. \([4](https://arxiv.org/html/2606.11319#S3.E4)\) does not necessarily converge uniformly inNN\. Nonetheless, as shown in the next subsection, the leading correction already agrees well with the empirical network outputs and test accuracy in the high\-noise regime\.

### III\.2Comparison with finite\-width networks

We now compare the theoretical prediction for the logit output, Eq\. \([13](https://arxiv.org/html/2606.11319#S3.E13)\), to finite\-width networks in a controlled empirical setting matched to the assumptions of the high\-noise NTK calculation\. This setting differs from the motivating experiments in Fig\.[1](https://arxiv.org/html/2606.11319#S1.F1), which use finite\-width ReLU MLPs, cross\-entropy loss, and raw image inputs in\[0,1\]\[0,1\]to establish the robustness phenomenon in a standard classification setup\. Here, by contrast, we use the normalization convention of Eq\. \([3](https://arxiv.org/html/2606.11319#S3.E3)\) for the inputs, MSE loss, wider networks with erf activation, and fewer training data points so that the numerical experiment isolates the leading\-order mechanism predicted by the theory\. The theoretical prediction is not intended to reproduce every implementation detail of Fig\.[1](https://arxiv.org/html/2606.11319#S1.F1); rather, it identifies a tractable high\-noise mechanism, which we validate directly in the finite\-width experiments of Fig\.[4](https://arxiv.org/html/2606.11319#S3.F4)\. Further experimental details on the empirical results presented here can be found in Appendix[A](https://arxiv.org/html/2606.11319#A1)\.

Restoring the corruption probabilityp=1−ϵp=1\-\\epsilon, the form of Eq\. \([13](https://arxiv.org/html/2606.11319#S3.E13)\) motivates the effective model

fi​\(𝒙∗\)=a​\(1−p\)​\[1d​∑k=1dxk⁣∗​\(x¯ki−x¯k\)\]\+b\+c​ξ^i,f\_\{i\}\(\\boldsymbol\{x\}\_\{\*\}\)=a\(1\-p\)\\left\[\\frac\{1\}\{d\}\\sum\_\{k=1\}^\{d\}x\_\{k\*\}\\bigl\(\\bar\{x\}\_\{k\}^\{i\}\-\\bar\{x\}\_\{k\}\\bigr\)\\right\]\+b\+c\\hat\{\\xi\}\_\{i\},\(16\)whereaa,bb, andccabsorb the renormalized architecture\-dependent prefactors and fluctuation scale\. For simplicity, we model the fluctuationsξ^i\\hat\{\\xi\}\_\{i\}as i\.i\.d\. standard normal random variables\. To estimate these coefficients, we trainmmMLPs for each ofMMindependent noisy\-dataset realizations, for a total ofm​MmMtrained networks\. For each test point and class, we first average the output logit over themmnetworks corresponding to a fixed noisy realization\. This producesMMrealization\-averaged logits per class, which are the empirical objects to be compared with Eq\. \([16](https://arxiv.org/html/2606.11319#S3.E16)\)\. We then average these logits over theMMnoisy realizations and determineaaandbbby a least\-squares fit over all test points and classes to the mean of Eq\. \([16](https://arxiv.org/html/2606.11319#S3.E16)\); the coefficientccdoes not enter this step because it multiplies a zero\-mean fluctuation\. Finally, we setccequal to the standard deviation of theMMrealization\-averaged logits, pooled over all test points and classes\. Importantly, this is a global three\-parameter fit: we do*not*fit separate coefficients for different test examples\.

Figure[4](https://arxiv.org/html/2606.11319#S3.F4)tests this picture in two complementary ways\. The insets compare the empirical mean output logits of the MLPs on the full MNIST, Fashion\-MNIST, and KMNIST test sets with the prediction obtained from the mean of Eq\. \([16](https://arxiv.org/html/2606.11319#S3.E16)\)\. After fitting the two global coefficientsaaandbb, we find nearly exact agreement for each dataset\. This provides strong evidence that, after averaging over noise realizations, the outputs of the MLP collapse onto the mean of the centroid model, consistent with centroid\-classifier behavior\. The main panels compare the clean\-test accuracy of the empirical ensemble with that of the full centroid model in Eq\. \([16](https://arxiv.org/html/2606.11319#S3.E16)\), now including the fluctuation term controlled by the fitted coefficientcc\. The agreement for each dataset shows that the centroid model captures not only the mean output structure, but also the fluctuations that govern classification accuracy\. In particular, it accurately reproduces the decay of the test accuracy as the noise probabilityppincreases\. The main panels also indicate where the empirical networks begin to depart from the leading\-order centroid description, signaling the onset of higher\-order corrections in\(1−p\)\(1\-p\)\. In our data, this deviation becomes noticeable forp≲0\.9p\\lesssim 0\.9\.

The agreement in Fig\.[4](https://arxiv.org/html/2606.11319#S3.F4)should therefore be interpreted as a controlled validation of the high\-noise centroid mechanism, rather than as a claim that the theory reproduces every implementation detail of the raw\-input cross\-entropy experiments in Fig\.[1](https://arxiv.org/html/2606.11319#S1.F1)\. Nevertheless, it indicates that the same centroid mechanism remains visible in finite\-width networks once the experimental setting is chosen to match the assumptions of the calculation\.

### III\.3Signal\-to\-noise ratio

Equation \([13](https://arxiv.org/html/2606.11319#S3.E13)\) for the output logit also implies a scaling law, since it predicts that, at least for intermediateNN, the class\-dependent mean grows as\(1−p\)​\(N/C\)\(1\-p\)\(N/C\)while the fluctuation scales as\(N/C\)/d\\sqrt\{\(N/C\)/d\}\. Therefore, we expect the signal\-to\-noise ratio \(SNR\) to scale as

SNR∼\(1−p\)2​d​NC\\mathrm\{SNR\}\\sim\(1\-p\)^\{2\}d\\frac\{N\}\{C\}\(17\)The theory therefore predicts improved classification accuracy as the surviving signal fraction1−p1\-p, the feature dimensiondd, or the number of training examples per classN/CN/Care increased\. Figure[5](https://arxiv.org/html/2606.11319#S3.F5)confirms these trends for MNIST in the high\-noise regimep=0\.97p=0\.97\. At fixed noise strength, test accuracy improves systematically with bothNNanddd, in agreement with the suggested scaling in Eq\. \([17](https://arxiv.org/html/2606.11319#S3.E17)\)\.

![Refer to caption](https://arxiv.org/html/2606.11319v1/x5.png)Figure 5:Empirical test of the scaling suggested by Eq\. \([17](https://arxiv.org/html/2606.11319#S3.E17)\)\. Each curve shows the mean clean\-test accuracy of an ensemble of width\-2048, 3\-layererf\\mathrm\{erf\}MLPs trained with MSE loss on MNIST with replacement noise atp=0\.97p=0\.97, as a function of training\-set sizeNNfor fixed feature dimensiondd\. Error bars show one standard deviation\. Larger values ofddare obtained by resizing the original MNIST images using bilinear interpolation\.The analysis in this section has focused on uniform replacement noise for concreteness\. However, the centroid\-classification result is not tied to that specific choice of corruption model\. Rather, for the two broad classes of input noise most relevant in practice—general replacement noise and general additive noise—the same leading centroid structure persists\. In Appendix[E](https://arxiv.org/html/2606.11319#A5)we present analogous results for the additive noise model \[Eq\. \([2](https://arxiv.org/html/2606.11319#S2.E2)\)\], demonstrating the generality of our findings\.

### III\.4Permutation symmetry and the centroid response

The centroid rule Eq\. \([15](https://arxiv.org/html/2606.11319#S3.E15)\) is independent of noise distribution, depth, and activation\. Here we give intuition for this universality within the NTK framework using a linear\-response argument: we treat the trained network as a system sitting at a permutation\-symmetric pure\-noise reference state and view the surviving signal as a small perturbation away from it\. This provides a symmetry interpretation of the high\-noise expansion derived in Appendix[C](https://arxiv.org/html/2606.11319#A3)\. It should be read as a minimal linear\-response model rather than a second microscopic derivation of the corruption channels of Sec\.[II](https://arxiv.org/html/2606.11319#S2), but it is sufficient to extract the leading\-order class\-dependent decision rule Eq\. \([15](https://arxiv.org/html/2606.11319#S3.E15)\)\.

After separating the𝒪​\(d−1/2\)\\mathcal\{O\}\(d^\{\-1/2\}\)stochastic fluctuations, the deterministic part of the noisy NTK depends on the data only through the noise\-averaged overlap Eq\. \([9](https://arxiv.org/html/2606.11319#S3.E9)\)\. At pure noise,ϵ=1−p=0\\epsilon=1\-p=0, the training examples become exchangeable: the noise\-averaged train\-train overlap is invariant under any relabelingπ∈SN\\pi\\in S\_\{N\}of the training examples,

Xπ​\(α\)​π​\(β\)​\(0\)=Xα​β​\(0\)\.X\_\{\\pi\(\\alpha\)\\pi\(\\beta\)\}\(0\)=X\_\{\\alpha\\beta\}\(0\)\.\(18\)That is,Xα​β​\(0\)X\_\{\\alpha\\beta\}\(0\)depends onα,β\\alpha,\\betaonly through whetherα=β\\alpha=\\beta\. This symmetry structure is guaranteed at pure noise for any distribution that is i\.i\.d\. with respect to data and feature space and has bounded mean and variance\.

Motivated by this structure, we introduce a minimal pure\-noise reference ensemble for the training set\. This reference ensemble consists of featureless, i\.i\.d\. random variables\{ξk​α\}\\\{\\xi\_\{k\\alpha\}\\\}, whose purpose is to capture the exchangeable structure of the deterministic part of the noisy NTK at the pure\-noise fixed point\. The corresponding pure\-noise\-averaged test–train and train–train overlap matrices are

X0,∗α\\displaystyle X\_\{0,\*\\alpha\}=1d​∑k=1dxk⁣∗​𝔼​\[ξk​α\],\\displaystyle=\\frac\{1\}\{d\}\\sum\_\{k=1\}^\{d\}x\_\{k\*\}\\mathbb\{E\}\\\!\\left\[\\xi\_\{k\\alpha\}\\right\],\(19\)X0,α​β\\displaystyle X\_\{0,\\alpha\\beta\}=1d​∑k=1d𝔼​\[ξk​α​ξk​β\],\\displaystyle=\\frac\{1\}\{d\}\\sum\_\{k=1\}^\{d\}\\mathbb\{E\}\\\!\\left\[\\xi\_\{k\\alpha\}\\xi\_\{k\\beta\}\\right\],\(20\)where the expectation𝔼​\[⋅\]\\mathbb\{E\}\[\\cdot\]is taken with respect to the noise distribution\. We leave the first two moments of the reference noise arbitrary, assuming only that they are bounded\. For later convenience, we assume the first moment vanishes and denote the second moment byvv222The value of the first moment is inconsequential for the rest of the calculation; a nonzero first moment simply renormalizes the constantsr\(ℓ\)r^\{\(\\ell\)\},s\(ℓ\)s^\{\(\\ell\)\}, andt\(ℓ\)t^\{\(\\ell\)\}introduced later in the subsection\. In contrast, the value of the second moment can affect the orientation of the centroid rule, as discussed below\.\. Adopting the normalization convention of Eq\. \([3](https://arxiv.org/html/2606.11319#S3.E3)\) for the clean training inputs\{xk​α\}\\\{x\_\{k\\alpha\}\\\}and test inputxk⁣∗x\_\{k\*\}, the imposed mean\-centering makes the pure\-noise test–train overlap vanish,X0,∗α=0X\_\{0,\*\\alpha\}=0, while the pure\-noise train–train overlap remains diagonalX0,α​β=δα​β​vX\_\{0,\\alpha\\beta\}=\\delta\_\{\\alpha\\beta\}v\.

We treat the network trained on the pure\-noise ensemble\{ξk​α\}\\\{\\xi\_\{k\\alpha\}\\\}as a pure\-noise fixed point\. Now suppose we perturb the pure\-noise inputs by a small signal proportional to the clean features,

ξk​α→ξk​α\+ϵ​xk​α\.\\xi\_\{k\\alpha\}\\to\\xi\_\{k\\alpha\}\+\\epsilon\\,x\_\{k\\alpha\}\.\(21\)To study the response, letδ\\deltadenote the leading\-order variation along the perturbation pathzk​α​\(ϵ\)=ξk​α\+ϵ​xk​αz\_\{k\\alpha\}\(\\epsilon\)=\\xi\_\{k\\alpha\}\+\\epsilon x\_\{k\\alpha\}\. That is, for any functionFFbuilt from the training inputs, define

δ​F≡ϵ​dd​ϵ\|ϵ=0​F​\[𝒛​\(ϵ\)\],\\delta F\\equiv\\left\.\\epsilon\\frac\{d\}\{d\\epsilon\}\\right\|\_\{\\epsilon=0\}F\\bigl\[\\boldsymbol\{z\}\(\\epsilon\)\\bigr\],\(22\)so thatδ​F=𝒪​\(ϵ\)\\delta F=\\mathcal\{O\}\(\\epsilon\)\. The leading\-order response of the test–train overlap is

δ​X0,∗α=ϵd​∑k=1dxk⁣∗​xk​α,\\delta X\_\{0,\*\\alpha\}=\\frac\{\\epsilon\}\{d\}\\sum\_\{k=1\}^\{d\}x\_\{k\*\}x\_\{k\\alpha\},\(23\)while the response of the train–train overlap vanishes, since its leading nonzero term is𝒪​\(ϵ2\)\\mathcal\{O\}\(\\epsilon^\{2\}\)\. Thus onlyδ​X0,∗α\\delta X\_\{0,\*\\alpha\}carries the signal at leading order\.

We now compute the response at arbitrary depthLL\. LetΘ0,α​β\(ℓ\)\\Theta^\{\(\\ell\)\}\_\{0,\\alpha\\beta\}andΘ0,∗α\(ℓ\)\\Theta^\{\(\\ell\)\}\_\{0,\*\\alpha\}denote the depth\-ℓ\\elltrain–train and test–train NTKs at pure noise, respectively\. At leading order inϵ\\epsilon, the class\-iilogit responds as

δ​fi​\(𝒙∗\)=∑α,β=1Nδ​Θ0,∗α\(L\)​\[𝚯0\(L\)\]α​β−1​yi​β,\\delta f\_\{i\}\(\\boldsymbol\{x\}\_\{\*\}\)=\\sum\_\{\\alpha,\\beta=1\}^\{N\}\\delta\\Theta^\{\(L\)\}\_\{0,\*\\alpha\}\\left\[\\boldsymbol\{\\Theta\}^\{\(L\)\}\_\{0\}\\right\]\_\{\\alpha\\beta\}^\{\-1\}y\_\{i\\beta\},\(24\)where the variation of the inverse train–train NTK vanishes at leading order in this minimal treatment\. Two objects therefore determine the response: the inverse pure\-noise train–train NTK\[𝚯0\(L\)\]−1\[\\boldsymbol\{\\Theta\}^\{\(L\)\}\_\{0\}\]^\{\-1\}and the pure\-noise test–train NTK variationδ​Θ0,∗α\(L\)\\delta\\Theta^\{\(L\)\}\_\{0,\*\\alpha\}\. Both follow from applying the layer\-to\-layer recursion relations\[[33](https://arxiv.org/html/2606.11319#bib.bib97)\]to the pure\-noise fixed point:

K0,α​β\(ℓ\+1\)\\displaystyle K^\{\(\\ell\+1\)\}\_\{0,\\alpha\\beta\}=Cb\+Cw​⟨σα​σβ⟩𝑲0\(ℓ\),\\displaystyle=C\_\{b\}\+C\_\{w\}\\langle\\sigma\_\{\\alpha\}\\sigma\_\{\\beta\}\\rangle\_\{\\boldsymbol\{K\}\_\{0\}^\{\(\\ell\)\}\},\(25\)Θ0,α​β\(ℓ\+1\)\\displaystyle\\Theta^\{\(\\ell\+1\)\}\_\{0,\\alpha\\beta\}=K0,α​β\(ℓ\+1\)\+Cw​⟨σα′​σβ′⟩𝑲0\(ℓ\)​Θ0,α​β\(ℓ\),\\displaystyle=K^\{\(\\ell\+1\)\}\_\{0,\\alpha\\beta\}\+C\_\{w\}\\langle\\sigma^\{\\prime\}\_\{\\alpha\}\\sigma^\{\\prime\}\_\{\\beta\}\\rangle\_\{\\boldsymbol\{K\}\_\{0\}^\{\(\\ell\)\}\}\\Theta^\{\(\\ell\)\}\_\{0,\\alpha\\beta\},\(26\)where𝑲0\(ℓ\)\\boldsymbol\{K\}^\{\(\\ell\)\}\_\{0\}denotes the depth\-ℓ\\ellpure\-noise pre\-activation kernel, and the layer\-11initial condition is fixed directly by the pure\-noise overlap,

K0,α​β\(1\)=Θ0,α​β\(1\)=Cb\+Cw​X0,α​β\.K^\{\(1\)\}\_\{0,\\alpha\\beta\}=\\Theta^\{\(1\)\}\_\{0,\\alpha\\beta\}=C\_\{b\}\+C\_\{w\}X\_\{0,\\alpha\\beta\}\.\(27\)In the recursion relations,σα≡σ​\(uα\)\\sigma\_\{\\alpha\}\\equiv\\sigma\(u\_\{\\alpha\}\),σα′\\sigma^\{\\prime\}\_\{\\alpha\}is its derivative, and⟨⋯⟩𝑲\\langle\\cdots\\rangle\_\{\\boldsymbol\{K\}\}is an expectation over the zero\-mean bivariate Gaussian pre\-activations\(uα,uβ\)\(u\_\{\\alpha\},u\_\{\\beta\}\)with2×22\\times 2covariance𝑲\\boldsymbol\{K\}\[Eq\. \([41](https://arxiv.org/html/2606.11319#A3.E41)\)\]\.

Consider first the inverse train–train NTK\. At pure noise the train–train overlap is permutation\-invariant, and hence so is the layer\-11NTK,Θ0,α​β\(1\)=r\(1\)​δα​β\+s\(1\)\\Theta^\{\(1\)\}\_\{0,\\alpha\\beta\}=r^\{\(1\)\}\\delta\_\{\\alpha\\beta\}\+s^\{\(1\)\}, wherer\(1\)=Cw​vr^\{\(1\)\}=C\_\{w\}vands\(1\)=Cbs^\{\(1\)\}=C\_\{b\}\. The recursions at pure\-noise are permutation\-equivariant; if the depth\-11kernel is permutation invariant, then this invariance is preserved at arbitrary depthℓ\\ell\. Therefore, the depth\-LLtrain–train NTK therefore retains the two\-parameter formΘ0,α​β\(L\)=r\(L\)​δα​β\+s\(L\)\\Theta^\{\(L\)\}\_\{0,\\alpha\\beta\}=r^\{\(L\)\}\\delta\_\{\\alpha\\beta\}\+s^\{\(L\)\}, for constantsr\(L\)r^\{\(L\)\}ands\(L\)s^\{\(L\)\}that depend on depth, activation, initialization, and the moments of the noise distribution\. Equivalently, the depth\-LLNTK at pure noise lives in the projector algebraspan​\(𝑷∥,𝑷⟂\)\\mathrm\{span\}\(\\boldsymbol\{P\}\_\{\\parallel\},\\boldsymbol\{P\}\_\{\\perp\}\), with𝑷∥=1N​𝟏𝟏T\\boldsymbol\{P\}\_\{\\parallel\}=\\frac\{1\}\{N\}\\boldsymbol\{1\}\\boldsymbol\{1\}^\{\\mathrm\{T\}\}the projector onto the permutation\-invariant sector and𝑷⟂=𝑰−𝑷∥\\boldsymbol\{P\}\_\{\\perp\}=\\boldsymbol\{I\}\-\\boldsymbol\{P\}\_\{\\parallel\}its complement:

𝚯0\(L\)=r\(L\)​𝑷⟂\+\(r\(L\)\+s\(L\)​N\)​𝑷∥\.\\boldsymbol\{\\Theta\}^\{\(L\)\}\_\{0\}=r^\{\(L\)\}\\boldsymbol\{P\}\_\{\\perp\}\+\\left\(r^\{\(L\)\}\+s^\{\(L\)\}N\\right\)\\boldsymbol\{P\}\_\{\\parallel\}\.\(28\)The NTK is an example of a Gram matrix, which is strictly positive\-definite for distinct inputs\[[18](https://arxiv.org/html/2606.11319#bib.bib59)\]\. Therefore, its eigenvaluesr\(L\)\>0r^\{\(L\)\}\>0andr\(L\)\+s\(L\)​N\>0r^\{\(L\)\}\+s^\{\(L\)\}N\>0are strictly positive, and so it has a well\-defined inverse

\[𝚯0\(L\)\]−1=𝑷⟂r\(L\)\+𝑷∥r\(L\)\+s\(L\)​N\.\\left\[\\boldsymbol\{\\Theta\}^\{\(L\)\}\_\{0\}\\right\]^\{\-1\}=\\frac\{\\boldsymbol\{P\}\_\{\\perp\}\}\{r^\{\(L\)\}\}\+\\frac\{\\boldsymbol\{P\}\_\{\\parallel\}\}\{r^\{\(L\)\}\+s^\{\(L\)\}N\}\.\(29\)
Consider next the depth\-LLtest–train NTK variationδ​Θ0,∗α\(L\)\\delta\\Theta\_\{0,\*\\alpha\}^\{\(L\)\}\. The first\-layer variation about pure noise is

δ​Θ0,∗α\(1\)=δ​K0,∗α\(1\)=Cw​δ​X0,∗α\.\\delta\\Theta^\{\(1\)\}\_\{0,\*\\alpha\}=\\delta K^\{\(1\)\}\_\{0,\*\\alpha\}=C\_\{w\}\\,\\delta X\_\{0,\*\\alpha\}\.\(30\)Layer\-to\-layer propagation of the variation can be handled by differentiating the recursion relations using Price’s theorem\[[30](https://arxiv.org/html/2606.11319#bib.bib99)\]\. The resulting recursion relations for the test–train kernel and NTK variation are

δ​K0,∗α\(ℓ\+1\)\\displaystyle\\delta K\_\{0,\*\\alpha\}^\{\(\\ell\+1\)\}=Cw​⟨σ∗′​σα′⟩𝑲0\(ℓ\)​δ​K0,∗α\(ℓ\),\\displaystyle=C\_\{w\}\\langle\\sigma^\{\\prime\}\_\{\*\}\\sigma^\{\\prime\}\_\{\\alpha\}\\rangle\_\{\\boldsymbol\{K\}\_\{0\}^\{\(\\ell\)\}\}\\,\\delta K\_\{0,\*\\alpha\}^\{\(\\ell\)\},\(31\)δ​Θ0,∗α\(ℓ\+1\)\\displaystyle\\delta\\Theta\_\{0,\*\\alpha\}^\{\(\\ell\+1\)\}=δ​K0,∗α\(ℓ\+1\)\+Cw​⟨σ∗′′​σα′′⟩𝑲0\(ℓ\)​Θ0,∗α\(ℓ\)​δ​K0,∗α\(ℓ\)\\displaystyle=\\delta K\_\{0,\*\\alpha\}^\{\(\\ell\+1\)\}\+C\_\{w\}\\langle\\sigma^\{\\prime\\prime\}\_\{\*\}\\sigma^\{\\prime\\prime\}\_\{\\alpha\}\\rangle\_\{\\boldsymbol\{K\}\_\{0\}^\{\(\\ell\)\}\}\\Theta^\{\(\\ell\)\}\_\{0,\*\\alpha\}\\delta K\_\{0,\*\\alpha\}^\{\(\\ell\)\}\+Cw​⟨σ∗′​σα′⟩𝑲0\(ℓ\)​δ​Θ0,∗α\(ℓ\),\\displaystyle\\quad\+C\_\{w\}\\langle\\sigma^\{\\prime\}\_\{\*\}\\sigma^\{\\prime\}\_\{\\alpha\}\\rangle\_\{\\boldsymbol\{K\}\_\{0\}^\{\(\\ell\)\}\}\\,\\delta\\Theta^\{\(\\ell\)\}\_\{0,\*\\alpha\},\(32\)where we have used the fact that diagonal variations of the test–train kernel and NTK vanish at leading order\. A key observation is that these recursion relations are local in the training index: a perturbation ofX0,∗αX\_\{0,\*\\alpha\}propagates only through the pure\-noise test–train NTKΘ0,∗α\(ℓ\)\\Theta^\{\(\\ell\)\}\_\{0,\*\\alpha\}at the sameα\\alpha, never coupling distinct training points\. The activation\-dependent transfer factors depend onα\\alphaonly through the layer\-ℓ\\ellkernel entries of the pair\(∗,α\)\(\*,\\alpha\), and these entries are independent ofα\\alpha\. At the first layer they are

K0,∗∗\(1\)=Cb\+Cw,K0,α​α\(1\)=Cb\+Cw​v,K0,∗α\(1\)=Cb,K^\{\(1\)\}\_\{0,\*\*\}=C\_\{b\}\+C\_\{w\},\\quad K^\{\(1\)\}\_\{0,\\alpha\\alpha\}=C\_\{b\}\+C\_\{w\}v,\\quad K^\{\(1\)\}\_\{0,\*\\alpha\}=C\_\{b\},\(33\)wherevvis the second moment of the noise distribution\. Because these kernel entries are the same for every training indexα\\alpha, the recursion preserves thisα\\alpha\-independence\. Every transfer factor is therefore identical for allα\\alpha, and sinceδ​Θ0,∗α\(1\)∝δ​X0,∗α\\delta\\Theta^\{\(1\)\}\_\{0,\*\\alpha\}\\propto\\delta X\_\{0,\*\\alpha\}, the recursion merely rescales the pure\-noise overlap by a constant:

δ​Θ0,∗α\(L\)=t\(L\)​δ​X0,∗α,\\delta\\Theta^\{\(L\)\}\_\{0,\*\\alpha\}=t^\{\(L\)\}\\,\\delta X\_\{0,\*\\alpha\},\(34\)wheret\(L\)t^\{\(L\)\}depends on depth, activation, initialization, and the moments of the noise distribution\.

Substituting Eqs\. \([29](https://arxiv.org/html/2606.11319#S3.E29)\) and \([34](https://arxiv.org/html/2606.11319#S3.E34)\) into Eq\. \([24](https://arxiv.org/html/2606.11319#S3.E24)\), and using the class\-balanced label contractions

𝑷⟂​𝒚i=𝒚i−1C​𝟏,𝑷∥​𝒚i=1C​𝟏,\\boldsymbol\{P\}\_\{\\perp\}\\boldsymbol\{y\}\_\{i\}=\\boldsymbol\{y\}\_\{i\}\-\\frac\{1\}\{C\}\\boldsymbol\{1\},\\qquad\\boldsymbol\{P\}\_\{\\parallel\}\\boldsymbol\{y\}\_\{i\}=\\frac\{1\}\{C\}\\boldsymbol\{1\},\(35\)the permutation\-invariant sector𝑷∥\\boldsymbol\{P\}\_\{\\parallel\}contributes only a class\-independent constant, while the𝑷⟂\\boldsymbol\{P\}\_\{\\perp\}sector carries the class signal\. Withδ​X0,∗α\\delta X\_\{0,\*\\alpha\}from Eq\. \([23](https://arxiv.org/html/2606.11319#S3.E23)\), this gives

δ​fi​\(𝒙∗\)=t\(L\)r\(L\)​ϵ​NC​\[1d​∑k=1dxk⁣∗​\(x¯ki−x¯k\)\]\+class\-independent terms,\\begin\{split\}\\delta f\_\{i\}\(\\boldsymbol\{x\}\_\{\*\}\)&=\\frac\{t^\{\(L\)\}\}\{r^\{\(L\)\}\}\\frac\{\\epsilon N\}\{C\}\\left\[\\frac\{1\}\{d\}\\sum\_\{k=1\}^\{d\}x\_\{k\*\}\\left\(\\bar\{x\}\_\{k\}^\{i\}\-\\bar\{x\}\_\{k\}\\right\)\\right\]\\\\ &\\quad\+\\text\{class\-independent terms\},\\end\{split\}\(36\)which is exactly the centroid rule\. The constantst\(L\)t^\{\(L\)\}andr\(L\)r^\{\(L\)\}encode depth, activation, initialization, and the first two moments of the noise distribution, while the class\-dependent structure is fixed by permutation symmetry at the pure\-noise fixed point\.

A remark on orientation is useful\. The prefactort\(L\)/r\(L\)t^\{\(L\)\}/r^\{\(L\)\}is common to all classes, so only its sign matters: a positive sign gives the nearest\-class\-mean rule, while a negative sign reverses the orientation of the centroid score\. Sincer\(L\)\>0r^\{\(L\)\}\>0, the orientation is determined by the sign oft\(L\)t^\{\(L\)\}\. Under the unit\-variance conventionv=1v=1, the layer\-11test–train kernel entries become

K0,∗∗\(1\)=K0,α​α\(1\)\|v=1=Cb\+Cw,K0,∗α\(1\)=Cb\.K^\{\(1\)\}\_\{0,\*\*\}=\\left\.K^\{\(1\)\}\_\{0,\\alpha\\alpha\}\\right\|\_\{v=1\}=C\_\{b\}\+C\_\{w\},\\qquad K^\{\(1\)\}\_\{0,\*\\alpha\}=C\_\{b\}\.\(37\)which can be determined using Eq\. \([33](https://arxiv.org/html/2606.11319#S3.E33)\)\. Thus the test and training diagonals coincide, and the off\-diagonal entry is nonnegative\. These properties imply that the activation\-dependent transfer factors⟨h∗​hα⟩𝑲0\(ℓ\)\\langle h\_\{\*\}h\_\{\\alpha\}\\rangle\_\{\\boldsymbol\{K\}\_\{0\}^\{\(\\ell\)\}\}, withhhbeing any ofσ,σ′,σ′′\\sigma,\\sigma^\{\\prime\},\\sigma^\{\\prime\\prime\}, appearing in the kernel and NTK variation recursions are nonnegative at every depth for any activation function—we show this in detail in Appendix[D](https://arxiv.org/html/2606.11319#A4)\. Consequently, for unit\-variance noise under the normalization convention Eq\. \([3](https://arxiv.org/html/2606.11319#S3.E3)\), we havet\(L\)≥0t^\{\(L\)\}\\geq 0\.

For generalv≠1v\\neq 1or non\-normalized data, the test and training diagonals of the pure\-noise test–train kernel need not coincide, and the sign is not guaranteed for every activation\. If the prefactor is negative, however, the leading decision rule is obtained by reversing the orientation of the centroid score, equivalently by replacingarg⁡max\\arg\\maxwitharg⁡min\\arg\\min\. Thus the leading class\-dependent response remains a centroid\-overlap rule for any depth, activation, and i\.i\.d\. noise distribution, with the sign oft\(L\)t^\{\(L\)\}fixing the orientation\.

## IVDiscussion and outlook

The problem of training neural networks with heavily corrupted inputs has revealed numerous surprises\. First, we found that MLPs are remarkably robust to attribute noise, capable of classifying heavily corrupted images with accuracy much higher than chance\. This phenomenon, observed across several datasets, prompted us to tackle the problem analytically, with the naive expectation of revealing the complicated “denoising” process employed by the networks, enabling them to perform so well\. Using a controlled expansion of the NTK in the strong corruption regime, we found that in fact the networks learn a very simple rule: they compare each test image to the average of all training images of a certain class, and check which one is the most similar\. This nearest\-class\-mean \(centroid\) rule, unique to the strong corruption limit, emerges despite the complexity and nonlinearity enabled by the network\. The analytical result also leads to an effective signal\-to\-noise ratio \[Eq\. \([17](https://arxiv.org/html/2606.11319#S3.E17)\)\], which reveals that performance improves with feature dimension and dataset size; we have confirmed that empirically as well\. Taken together, these results show that, in the high\-attribute\-noise regime, wide MLPs have a universal leading class\-dependent response: a centered centroid classifier\. The finite\-width networks studied here visibly follow this mechanism, explaining their surprising robustness on the MNIST\-family datasets\.

The symmetry interpretation in Sec\.[III\.4](https://arxiv.org/html/2606.11319#S3.SS4)explains why this simple centroid rule is not a coincidence of the first\-layer calculation: it is fixed by the exchangeability of the pure\-noise kernel and by the permutation equivariance of the NTK recursion\. This observation also suggests a broader interpretation of the formalism\. The essential ingredients are not the particular MLP recursion alone; the same reasoning should apply beyond fully connected MLPs to any architecture with a well\-defined kernel limit whose pure\-noise kernel has the same exchangeability structure\. When those ingredients are present, linear response around the high\-noise fixed point identifies the first class\-dependent correction to the predictor; in the present setting, that correction is precisely the centroid rule\. More broadly, this points to a way of understanding learning phenomena by locating tractable fixed points of a network’s effective description and deriving the leading decision\-rule corrections that appear when a small signal breaks the fixed\-point symmetry\.

The success of the NTK prediction for finite\-width networks suggests that the high\-noise regime has an effective universality beyond the exactly solvable infinite\-width limit\. Although finite width, nonlinear feature learning, and the choice of loss can change architecture\-dependent prefactors and fluctuation scales, the dominant class\-dependent direction appears to remain the centered centroid overlap\. This should not be read as a claim that MLPs are robust to arbitrary training corruption on every classification dataset\. Rather, the theory predicts a generic leading\-order decision rule for MLPs in the high\-noise regime, while the resulting accuracy depends on the geometry of the dataset itself\. In the datasets studied here, MNIST, Fashion\-MNIST, and KMNIST, the clean class centroids contain enough discriminative information to yield well\-above\-chance classification, which explains why the networks remain robust at high corruption in Fig\.[1](https://arxiv.org/html/2606.11319#S1.F1)\.

Finally, the analytical control enabled in the strong\-corruption regime may also have broader implications\. Our perturbative result in\(1−p\)\(1\-p\)could, in principle, be extended to higher orders, thereby revealing progressively finer structure in the rules learned by the network\. Systematically carrying out this expansion may provide access to the intermediate\-corruption regime, or even suggest a controlled route toward the clean\-data limit, using the corruption strength as an explicit expansion parameter\. A related possible direction is the analysis of denoising models, including diffusion\-type models, in the high\-noise regime\. In that setting, one would replace the classification target considered here by a regression target, such as the clean input\. An analogous expansion around the pure\-noise limit could then reveal what statistics of the clean data are recovered at leading order, and how progressively finer denoising information emerges at higher orders in the signal strength\.

## Acknowledgments

We thank P\. Frazier, T\. Arias, C\.\-M\. Jian, H\. Pan, T\. Nebabu, and J\. Kim for useful discussions\. J\.T\. acknowledges funding from the Center for Alkaline\-Based Energy Solutions \(CABES\), part of the Energy Frontier Research Center \(EFRC\) program supported by the U\.S\. Department of Energy, under grant DE\-SC\-0019445\. A\.B\. was supported by a Cornell fellowship from the Cornell University Graduate School\. H\.K\. acknowledges support by the NSF through the grant OAC\-2118310\. O\.L\. acknowledges support from the Bethe\-KIC postdoctoral fellowship at Cornell University\.

## References

- \[1\]M\. Abramowitz and I\. A\. Stegun\(1965\)Handbook of mathematical functions\.Dover,New York\.Cited by:[Appendix D](https://arxiv.org/html/2606.11319#A4.p3.14)\.
- \[2\]G\. An\(1996\)The effects of adding noise during backpropagation training on a generalization performance\.Neural Computation8\(3\),pp\. 643–674\.External Links:[Document](https://dx.doi.org/10.1162/neco.1996.8.3.643)Cited by:[§I](https://arxiv.org/html/2606.11319#S1.p2.1)\.
- \[3\]D\. Angluin and P\. Laird\(1988\)Learning from noisy examples\.Machine Learning2\(4\),pp\. 343–370\.External Links:[Document](https://dx.doi.org/10.1007/BF00116829)Cited by:[§I](https://arxiv.org/html/2606.11319#S1.p1.1)\.
- \[4\]Y\. Bahri, B\. Hanin, A\. Brossollet, V\. Erba, C\. Keup, R\. Pacelli, and J\. B\. Simon\(2024\)Les houches lectures on deep learning at large and infinite width\.Journal of Statistical Mechanics: Theory and Experiment2024\(10\),pp\. 104012\.Cited by:[§III\.1](https://arxiv.org/html/2606.11319#S3.SS1.p4.2),[§III](https://arxiv.org/html/2606.11319#S3.p2.3)\.
- \[5\]C\. M\. Bishop\(1995\)Training with noise is equivalent to Tikhonov regularization\.Neural Computation7\(1\),pp\. 108–116\.External Links:[Document](https://dx.doi.org/10.1162/neco.1995.7.1.108)Cited by:[§I](https://arxiv.org/html/2606.11319#S1.p2.1)\.
- \[6\]A\. Blum, A\. Kalai, and H\. Wasserman\(2003\)Noise\-tolerant learning, the parity problem, and the statistical query model\.Journal of the ACM50\(4\),pp\. 506–519\.External Links:[Document](https://dx.doi.org/10.1145/792538.792543)Cited by:[§I](https://arxiv.org/html/2606.11319#S1.p1.1)\.
- \[7\]T\. Clanuwat, M\. Bober\-Irizar, A\. Kitamoto, A\. Lamb, K\. Yamamoto, and D\. Ha\(2018\)Deep learning for classical japanese literature\.arXiv preprint arXiv:1812\.01718\.External Links:1812\.01718Cited by:[§I](https://arxiv.org/html/2606.11319#S1.p3.1),[§II\.2](https://arxiv.org/html/2606.11319#S2.SS2.p1.6)\.
- \[8\]O\. Dekel and O\. Shamir\(2008\)Learning to classify with missing and corrupted features\.InProceedings of the 25th International Conference on Machine Learning,ICML ’08,New York, NY, USA,pp\. 216–223\.External Links:ISBN 9781605582054,[Link](https://doi.org/10.1145/1390156.1390184),[Document](https://dx.doi.org/10.1145/1390156.1390184)Cited by:[§I](https://arxiv.org/html/2606.11319#S1.p2.1)\.
- \[9\]T\. Emmanuel, T\. Maupong, D\. Mpoeleng, T\. Semong, B\. Mphago, and O\. Tabona\(2021\)A survey on missing data in machine learning\.Journal of Big Data8\(1\),pp\. 140\.External Links:[Document](https://dx.doi.org/10.1186/s40537-021-00516-9)Cited by:[§I](https://arxiv.org/html/2606.11319#S1.p1.1)\.
- \[10\]B\. Frenay and M\. Verleysen\(2014\)Classification in the presence of label noise: a survey\.IEEE Transactions on Neural Networks and Learning Systems25\(5\),pp\. 845–869\.External Links:[Document](https://dx.doi.org/10.1109/TNNLS.2013.2292894)Cited by:[§I](https://arxiv.org/html/2606.11319#S1.p1.1)\.
- \[11\]A\. Globerson and S\. Roweis\(2006\)Nightmare at test time: robust learning by feature deletion\.InProceedings of the 23rd International Conference on Machine Learning,ICML ’06,New York, NY, USA,pp\. 353–360\.External Links:ISBN 1595933832,[Link](https://doi.org/10.1145/1143844.1143889),[Document](https://dx.doi.org/10.1145/1143844.1143889)Cited by:[§I](https://arxiv.org/html/2606.11319#S1.p2.1)\.
- \[12\]S\. Goldt, B\. Loureiro, G\. Reeves, F\. Krzakala, M\. Mézard, and L\. Zdeborová\(2022\)The Gaussian equivalence of generative models for learning with shallow neural networks\.InProceedings of the 2nd Mathematical and Scientific Machine Learning Conference,PMLR, Vol\.145,pp\. 426–471\.Note:arXiv:2006\.14709Cited by:[§II\.1](https://arxiv.org/html/2606.11319#S2.SS1.p3.1)\.
- \[13\]S\. Guerriero, B\. Caputo, and T\. Mensink\(2018\)DEEP nearest class mean classifiers\.InICLR Workshop,External Links:[Link](https://api.semanticscholar.org/CorpusID:3283439)Cited by:[§I](https://arxiv.org/html/2606.11319#S1.p7.1)\.
- \[14\]D\. Hendrycks and T\. G\. Dietterich\(2019\)Benchmarking neural network robustness to common corruptions and perturbations\.CoRRabs/1903\.12261\.External Links:[Link](http://arxiv.org/abs/1903.12261),1903\.12261Cited by:[§I](https://arxiv.org/html/2606.11319#S1.p2.1)\.
- \[15\]D\. Hendrycks, N\. Mu, E\. D\. Cubuk, B\. Zoph, J\. Gilmer, and B\. Lakshminarayanan\(2020\)AugMix: a simple data processing method to improve robustness and uncertainty\.External Links:1912\.02781,[Link](https://arxiv.org/abs/1912.02781)Cited by:[§I](https://arxiv.org/html/2606.11319#S1.p2.1)\.
- \[16\]L\. Holmström and P\. Koistinen\(1992\)Using additive noise in back\-propagation training\.IEEE Transactions on Neural Networks3\(1\),pp\. 24–38\.External Links:[Document](https://dx.doi.org/10.1109/72.105415)Cited by:[§I](https://arxiv.org/html/2606.11319#S1.p2.1)\.
- \[17\]H\. Hu and Y\. M\. Lu\(2023\)Universality laws for high\-dimensional learning with random features\.IEEE Transactions on Information Theory69\(3\),pp\. 1932–1964\.External Links:[Document](https://dx.doi.org/10.1109/TIT.2022.3217698)Cited by:[§II\.1](https://arxiv.org/html/2606.11319#S2.SS1.p3.1)\.
- \[18\]A\. Jacot, F\. Gabriel, and C\. Hongler\(2018\)Neural tangent kernel: convergence and generalization in neural networks\.InAdvances in Neural Information Processing Systems 31 \(NeurIPS 2018\),pp\. 8571–8580\.Note:arXiv:1806\.07572Cited by:[§I](https://arxiv.org/html/2606.11319#S1.p4.1),[§III\.1](https://arxiv.org/html/2606.11319#S3.SS1.p1.1),[§III\.4](https://arxiv.org/html/2606.11319#S3.SS4.p6.16)\.
- \[19\]M\. Kearns\(1998\)Efficient noise\-tolerant learning from statistical queries\.Journal of the ACM45\(6\),pp\. 983–1006\.External Links:[Document](https://dx.doi.org/10.1145/293347.293351)Cited by:[§I](https://arxiv.org/html/2606.11319#S1.p1.1)\.
- \[20\]V\. Kothapalli\(2023\)Neural collapse: a review on modelling principles and generalization\.External Links:2206\.04041,[Link](https://arxiv.org/abs/2206.04041)Cited by:[§I](https://arxiv.org/html/2606.11319#S1.p7.1)\.
- \[21\]Y\. LeCun, C\. Cortes, and C\. Burges\(2010\)MNIST handwritten digit database\.ATT Labs \[Online\]\. Available: http://yann\.lecun\.com/exdb/mnist2\.Cited by:[§I](https://arxiv.org/html/2606.11319#S1.p3.1),[§II\.2](https://arxiv.org/html/2606.11319#S2.SS2.p1.6)\.
- \[22\]K\. Matsuoka\(1992\)Noise injection into inputs in back\-propagation learning\.IEEE Transactions on Systems, Man, and Cybernetics22\(3\),pp\. 436–440\.External Links:[Document](https://dx.doi.org/10.1109/21.155944)Cited by:[§I](https://arxiv.org/html/2606.11319#S1.p2.1)\.
- \[23\]S\. Mei and A\. Montanari\(2022\)The generalization error of random features regression: precise asymptotics and the double descent curve\.Communications on Pure and Applied Mathematics75\(4\),pp\. 667–766\.External Links:[Document](https://dx.doi.org/10.1002/cpa.22008)Cited by:[§II\.1](https://arxiv.org/html/2606.11319#S2.SS1.p3.1)\.
- \[24\]A\. Montanari and B\. N\. Saeed\(2022\)Universality of empirical risk minimization\.InProceedings of the 35th Conference on Learning Theory \(COLT\),Proceedings of Machine Learning Research, Vol\.178,pp\. 4310–4312\.Cited by:[§II\.1](https://arxiv.org/html/2606.11319#S2.SS1.p3.1)\.
- \[25\]N\. Mu and J\. Gilmer\(2019\)MNIST\-C: A robustness benchmark for computer vision\.CoRRabs/1906\.02337\.External Links:[Link](http://arxiv.org/abs/1906.02337),1906\.02337Cited by:[§I](https://arxiv.org/html/2606.11319#S1.p2.1)\.
- \[26\]N\. Natarajan, I\. S\. Dhillon, P\. Ravikumar, and A\. Tewari\(2013\)Learning with noisy labels\.InAdvances in Neural Information Processing Systems 26 \(NeurIPS 2013\),Cited by:[§I](https://arxiv.org/html/2606.11319#S1.p1.1)\.
- \[27\]NIST digital library of mathematical functions\.Note:Release 1\.2\.x, F\. W\. J\. Olveret al\., eds\.[https://dlmf\.nist\.gov/](https://dlmf.nist.gov/), Chapter 18Cited by:[Appendix D](https://arxiv.org/html/2606.11319#A4.p3.14)\.
- \[28\]C\. Northcutt, L\. Jiang, and I\. Chuang\(2021\-05\)Confident learning: estimating uncertainty in dataset labels\.J\. Artif\. Int\. Res\.70,pp\. 1373–1411\.External Links:ISSN 1076\-9757,[Link](https://doi.org/10.1613/jair.1.12125),[Document](https://dx.doi.org/10.1613/jair.1.12125)Cited by:[§I](https://arxiv.org/html/2606.11319#S1.p1.1)\.
- \[29\]V\. Papyan, X\. Y\. Han, and D\. L\. Donoho\(2020\)Prevalence of neural collapse during the terminal phase of deep learning training\.CoRRabs/2008\.08186\.External Links:[Link](https://arxiv.org/abs/2008.08186),2008\.08186Cited by:[§I](https://arxiv.org/html/2606.11319#S1.p7.1)\.
- \[30\]R\. Price\(2003\)A useful theorem for nonlinear devices having gaussian inputs\.IRE Transactions on Information Theory4\(2\),pp\. 69–72\.Cited by:[§III\.4](https://arxiv.org/html/2606.11319#S3.SS4.p7.16)\.
- \[31\]O\. H\. Ramírez\-Agudelo Sr, N\. Gorea, A\. Reif, L\. Bonasera, and M\. Karl Sr\(2025\)The role of noisy data in improving cnn robustness for image classification\.InApplications of Machine Learning 2025,Vol\.13606,pp\. 197–212\.Cited by:[§I](https://arxiv.org/html/2606.11319#S1.p2.1)\.
- \[32\]R\. Reed, S\. Oh, and R\.J\. Marks\(1992\)Regularization using jittered training data\.In\[Proceedings 1992\] IJCNN International Joint Conference on Neural Networks,Vol\.3,pp\. 147–152 vol\.3\.External Links:[Document](https://dx.doi.org/10.1109/IJCNN.1992.227178)Cited by:[§I](https://arxiv.org/html/2606.11319#S1.p2.1)\.
- \[33\]D\. A\. Roberts, S\. Yaida, and B\. Hanin\(2022\)The principles of deep learning theory\.Vol\.46,Cambridge University Press Cambridge, MA, USA\.Cited by:[§III\.1](https://arxiv.org/html/2606.11319#S3.SS1.p2.2),[§III\.1](https://arxiv.org/html/2606.11319#S3.SS1.p4.2),[§III\.4](https://arxiv.org/html/2606.11319#S3.SS4.p5.8)\.
- \[34\]D\. Rolnick, A\. Veit, S\. Belongie, and N\. Shavit\(2017\)Deep learning is robust to massive label noise\.External Links:1705\.10694Cited by:[§I](https://arxiv.org/html/2606.11319#S1.p1.1)\.
- \[35\]M\. E\. A\. Seddik and M\. Tamaazousti\(2022\-Jun\.\)Neural networks classify through the class\-wise means of their representations\.Proceedings of the AAAI Conference on Artificial Intelligence36\(8\),pp\. 8204–8211\.External Links:[Link](https://ojs.aaai.org/index.php/AAAI/article/view/20794),[Document](https://dx.doi.org/10.1609/aaai.v36i8.20794)Cited by:[§I](https://arxiv.org/html/2606.11319#S1.p7.1)\.
- \[36\]J\. Sietsma and R\. J\.F\. Dow\(1991\)Creating artificial neural networks that generalize\.Neural Networks4\(1\),pp\. 67–79\.External Links:ISSN 0893\-6080,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/0893-6080%2891%2990033-2),[Link](https://www.sciencedirect.com/science/article/pii/0893608091900332)Cited by:[§I](https://arxiv.org/html/2606.11319#S1.p2.1)\.
- \[37\]J\. Snell, K\. Swersky, and R\. S\. Zemel\(2017\)Prototypical networks for few\-shot learning\.CoRRabs/1703\.05175\.External Links:[Link](http://arxiv.org/abs/1703.05175),1703\.05175Cited by:[§I](https://arxiv.org/html/2606.11319#S1.p7.1)\.
- \[38\]R\. Tibshirani, T\. Hastie, B\. Narasimhan, and G\. Chu\(2002\)Diagnosis of multiple cancer types by shrunken centroids of gene expression\.Proceedings of the National Academy of Sciences99\(10\),pp\. 6567–6572\.External Links:[Document](https://dx.doi.org/10.1073/pnas.082099299)Cited by:[§I](https://arxiv.org/html/2606.11319#S1.p4.1),[§I](https://arxiv.org/html/2606.11319#S1.p7.1),[§III\.1](https://arxiv.org/html/2606.11319#S3.SS1.p8.3)\.
- \[39\]L\. Van Der Maaten, M\. Chen, S\. Tyree, and K\. Q\. Weinberger\(2013\)Learning with marginalized corrupted features\.InProceedings of the 30th International Conference on International Conference on Machine Learning \- Volume 28,ICML’13,pp\. I–410–I–418\.Cited by:[§I](https://arxiv.org/html/2606.11319#S1.p2.1)\.
- \[40\]H\. Xiao, K\. Rasul, and R\. Vollgraf\(2017\)Fashion\-mnist: a novel image dataset for benchmarking machine learning algorithms\.arXiv preprint arXiv:1708\.07747\.External Links:1708\.07747Cited by:[§I](https://arxiv.org/html/2606.11319#S1.p3.1),[§II\.2](https://arxiv.org/html/2606.11319#S2.SS2.p1.6)\.
- \[41\]R\. M\. Zur, Y\. Jiang, L\. L\. Pesce, and K\. Drukker\(2009\-10\)Noise injection for training artificial neural networks: a comparison with weight decay and early stopping\.Medical Physics36\(10\),pp\. 4810–4818\.External Links:[Document](https://dx.doi.org/10.1118/1.3213517)Cited by:[§I](https://arxiv.org/html/2606.11319#S1.p2.1)\.

## Appendix AExperimental details

### A\.1Experimental setup

For reference, Table[1](https://arxiv.org/html/2606.11319#A1.T1)gathers the datasets, corruption schemes, model architectures, and training protocols used to produce each figure in the main text\.

Table 1:Compact summary of the experimental settings used in the main text and appendix\.
### A\.2Squared\-error loss for classification

While cross\-entropy is the standard loss function for classification tasks, mean squared error \(MSE\) is frequently employed in analytical studies of neural networks, particularly within the NTK framework\. To adapt MSE for aCC\-class classification problem, the categorical labels are encoded as one\-hot vectors\. Letyi​χ∈\{0,1\}y\_\{i\\chi\}\\in\\\{0,1\\\}denote the target for classi=1,…,Ci=1,\\dots,Cassociated with training exampleχ=1,…,N\\chi=1,\\dots,N, whereyi​χ=1y\_\{i\\chi\}=1if the example belongs to classii, andyi​χ=0y\_\{i\\chi\}=0otherwise\. The neural network outputs a continuous scalarfi​\(𝒙χ\)f\_\{i\}\(\\boldsymbol\{x\}\_\{\\chi\}\)for each class\. The empirical MSE loss over the training set is defined as

ℒMSE=12​N​∑χ=1N∑i=1C\(fi​\(𝒙χ\)−yi​χ\)2\.\\mathcal\{L\}\_\{\\mathrm\{MSE\}\}=\\frac\{1\}\{2N\}\\sum\_\{\\chi=1\}^\{N\}\\sum\_\{i=1\}^\{C\}\\left\(f\_\{i\}\(\\boldsymbol\{x\}\_\{\\chi\}\)\-y\_\{i\\chi\}\\right\)^\{2\}\.\(38\)The predicted classi⋆i^\{\\star\}is assigned to the index of the largest output logit:

i⋆=arg⁡maxi=1,…,C⁡fi​\(𝒙∗\)\.i^\{\\star\}=\\arg\\max\_\{i=1,\\ldots,C\}f\_\{i\}\(\\boldsymbol\{x\}\_\{\*\}\)\.\(39\)The primary theoretical advantage of using MSE over cross\-entropy in this context is that the loss gradient becomes strictly linear with respect to the network outputs\.

## Appendix BFurther analysis of simultaneous train\-time and test\-time corruption

In Sec\.[II](https://arxiv.org/html/2606.11319#S2)of the main text, we showed results for train\-time corruption as well as test\-time corruption, characterized by corruption levelsptrainp\_\{\\text\{train\}\}andptestp\_\{\\text\{test\}\}, which are in general different\. We found that for a given value ofptestp\_\{\\text\{test\}\}, the optimal training corruption level isptrain≃ptestp\_\{\\text\{train\}\}\\simeq p\_\{\\text\{test\}\}\. Figure[6\(a\)](https://arxiv.org/html/2606.11319#A2.F6)shows the complementary view: accuracy as a function ofptestp\_\{\\text\{test\}\}for different values ofptrainp\_\{\\text\{train\}\}\.

![Refer to caption](https://arxiv.org/html/2606.11319v1/x6.png)Figure 6:\(a\) Test accuracy vs\. test corruptionptestp\_\{\\text\{test\}\}for several values of the train corruptionptrainp\_\{\\text\{train\}\}\. We find monotonically decreasing accuracy as a function ofptestp\_\{\\text\{test\}\}, but for givenptestp\_\{\\text\{test\}\}the behavior with respect toptrainp\_\{\\text\{train\}\}is non\-monotonic \(see also Fig\.[3](https://arxiv.org/html/2606.11319#S2.F3)of the main text\)\. \(b\) Area under the curve \(AUC\) of the test accuracy vs\.ptestp\_\{\\text\{test\}\}, as a function ofptrainp\_\{\\text\{train\}\}\. This is an overall measure of performance for a givenptrainp\_\{\\text\{train\}\}\. We find that the AUC peaks aroundptrainopt≈0\.65p\_\{\\text\{train\}\}^\{\\text\{opt \}\}\\approx 0\.65\. All network parameters are the same as in Fig\.[3](https://arxiv.org/html/2606.11319#S2.F3)of the main text\.There, we find monotonic behavior: it is always best to test on clean data \(ptest=0p\_\{\\text\{test\}\}=0\), indicating an asymmetry in the effect of train\-time and test\-time corruption\. Networks trained on corrupted data doing better when tested on clean data makes sense from the perspective of centroid classification: testing on actual examples from the clean distribution is optimal for the nearest\-mean\-class classifier\.

As an overall metric of performance of a network with givenptrainp\_\{\\text\{train\}\}, we calculate the area under the curve \(AUC\) of the test accuracy vs\.ptestp\_\{\\text\{test\}\}\. Figure[6\(b\)](https://arxiv.org/html/2606.11319#A2.F6)shows the AUC as a function ofptrainp\_\{\\text\{train\}\}, exhibiting non\-monotonic behavior with a peak aroundptrainopt≈0\.65p\_\{\\text\{train\}\}^\{\\text\{opt \}\}\\approx 0\.65\. This implies that, if the test\-time corruption is not known ahead of time, training with corruption level≈0\.65\\approx 0\.65will likely lead to optimal performance\.

## Appendix CHigh\-noise NTK expansion

### C\.1Setup

We consider a training set withNNexamples,CCclasses, and feature dimensiondd\. The clean training features are denoted by𝐱∈ℝd×N\\mathbf\{x\}\\in\\mathbb\{R\}^\{d\\times N\}, while the corrupted training features are denoted by𝐱~∈ℝd×N\\tilde\{\\mathbf\{x\}\}\\in\\mathbb\{R\}^\{d\\times N\}\. The clean labels are denoted by𝐲∈ℝN×C\\mathbf\{y\}\\in\\mathbb\{R\}^\{N\\times C\}, with𝒚i∈ℝN\\boldsymbol\{y\}\_\{i\}\\in\\mathbb\{R\}^\{N\}the one\-hot target vector for classi=1,…,Ci=1,\\ldots,C\. A clean test feature is denoted byx∗∈ℝdx\_\{\*\}\\in\\mathbb\{R\}^\{d\}\. The notation in this section is self\-contained\.

The indexχ\\chiis reserved for training examples\. Other Greek indices, such asα,β,γ,δ\\alpha,\\beta,\\gamma,\\delta, may refer either to training examples or to the test point∗\*\. We writezαz\_\{\\alpha\}for the input associated with indexα\\alpha, so thatzχ=x~χz\_\{\\chi\}=\\tilde\{x\}\_\{\\chi\}for a corrupted training input andz∗=x∗z\_\{\*\}=x\_\{\*\}for the clean test input\.

We work in the infinite\-width NTK limit for an arbitraryLL\-layer MLP with activation functionσ​\(⋅\)\\sigma\(\\cdot\)\. The weights and biases at initialization are drawn i\.i\.d\. from mean\-zero Gaussian distributions with variancesCwC\_\{w\}andCbC\_\{b\}, respectively\. In this limit, training with squared loss gives the class\-iioutput

fi​\(x∗\)=𝚯~∗χ\(L\)​\[𝚯~χ​χ\(L\)\]−1​𝒚i,f\_\{i\}\(x\_\{\*\}\)=\\tilde\{\\boldsymbol\{\\Theta\}\}\_\{\*\\chi\}^\{\(L\)\}\\left\[\\tilde\{\\boldsymbol\{\\Theta\}\}\_\{\\chi\\chi\}^\{\(L\)\}\\right\]^\{\-1\}\\boldsymbol\{y\}\_\{i\},\(40\)where𝚯~χ​χ\(L\)∈ℝN×N\\tilde\{\\boldsymbol\{\\Theta\}\}\_\{\\chi\\chi\}^\{\(L\)\}\\in\\mathbb\{R\}^\{N\\times N\}is the training–training NTK and𝚯~∗χ\(L\)∈ℝ1×N\\tilde\{\\boldsymbol\{\\Theta\}\}\_\{\*\\chi\}^\{\(L\)\}\\in\\mathbb\{R\}^\{1\\times N\}is the test\-training NTK\. The tilde indicates that the NTK is evaluated on the corrupted training features\.

For arbitrary functionsggandhh, we define the two\-dimensional Gaussian expectation by

⟨gα​hβ⟩K=12​π​detK​∫𝑑a​𝑑b​g​\(a\)​h​\(b\)​exp⁡\[−12​\(ab\)​\(Kα​αKα​βKβ​αKβ​β\)−1​\(ab\)\]\.\\langle g\_\{\\alpha\}h\_\{\\beta\}\\rangle\_\{K\}=\\frac\{1\}\{2\\pi\\sqrt\{\\det K\}\}\\int da\\,db\\,g\(a\)h\(b\)\\exp\\left\[\-\\frac\{1\}\{2\}\\begin\{pmatrix\}a&b\\end\{pmatrix\}\\begin\{pmatrix\}K\_\{\\alpha\\alpha\}&K\_\{\\alpha\\beta\}\\\\ K\_\{\\beta\\alpha\}&K\_\{\\beta\\beta\}\\end\{pmatrix\}^\{\-1\}\\begin\{pmatrix\}a\\\\ b\\end\{pmatrix\}\\right\]\.\(41\)
The corresponding one\-dimensional Gaussian expectation is denoted by

⟨g​h⟩Kα​α=12​π​Kα​α​∫𝑑a​g​\(a\)​h​\(a\)​exp⁡\[−a22​Kα​α\]\.\\langle gh\\rangle\_\{K\_\{\\alpha\\alpha\}\}=\\frac\{1\}\{\\sqrt\{2\\pi K\_\{\\alpha\\alpha\}\}\}\\int da\\,g\(a\)h\(a\)\\exp\\left\[\-\\frac\{a^\{2\}\}\{2K\_\{\\alpha\\alpha\}\}\\right\]\.\(42\)
The noisy Bayesian kernel is defined recursively as

K~α​β\(l\+1\)=Cb\+Cw​⟨σα​σβ⟩K~\(l\),\\tilde\{K\}\_\{\\alpha\\beta\}^\{\(l\+1\)\}=C\_\{b\}\+C\_\{w\}\\langle\\sigma\_\{\\alpha\}\\sigma\_\{\\beta\}\\rangle\_\{\\tilde\{K\}^\{\(l\)\}\},\(43\)with initial condition

K~α​β\(1\)=Cb\+Cwd​∑j=1dzj​α​zj​β\.\\tilde\{K\}\_\{\\alpha\\beta\}^\{\(1\)\}=C\_\{b\}\+\\frac\{C\_\{w\}\}\{d\}\\sum\_\{j=1\}^\{d\}z\_\{j\\alpha\}z\_\{j\\beta\}\.\(44\)The noisy NTK is defined recursively by

Θ~α​β\(l\+1\)=Cb\+Cw​\(⟨σα​σβ⟩K~\(l\)\+⟨σα′​σβ′⟩K~\(l\)​Θ~α​β\(l\)\),\\tilde\{\\Theta\}\_\{\\alpha\\beta\}^\{\(l\+1\)\}=C\_\{b\}\+C\_\{w\}\\left\(\\langle\\sigma\_\{\\alpha\}\\sigma\_\{\\beta\}\\rangle\_\{\\tilde\{K\}^\{\(l\)\}\}\+\\langle\\sigma^\{\\prime\}\_\{\\alpha\}\\sigma^\{\\prime\}\_\{\\beta\}\\rangle\_\{\\tilde\{K\}^\{\(l\)\}\}\\tilde\{\\Theta\}\_\{\\alpha\\beta\}^\{\(l\)\}\\right\),\(45\)with initial condition

Θ~α​β\(1\)=K~α​β\(1\)\.\\tilde\{\\Theta\}\_\{\\alpha\\beta\}^\{\(1\)\}=\\tilde\{K\}\_\{\\alpha\\beta\}^\{\(1\)\}\.\(46\)
Since we work in the large\-ddlimit, the initial feature average obeys a central\-limit expansion,

Cwd​∑j=1dzj​α​zj​β=Cwd​∑j=1d𝔼​\[zj​α​zj​β\]\+1d​ηα​β\+𝒪​\(d−1\),\\frac\{C\_\{w\}\}\{d\}\\sum\_\{j=1\}^\{d\}z\_\{j\\alpha\}z\_\{j\\beta\}=\\frac\{C\_\{w\}\}\{d\}\\sum\_\{j=1\}^\{d\}\\mathbb\{E\}\[z\_\{j\\alpha\}z\_\{j\\beta\}\]\+\\frac\{1\}\{\\sqrt\{d\}\}\\eta\_\{\\alpha\\beta\}\+\\mathcal\{O\}\\left\(d^\{\-1\}\\right\),\(47\)whereηα​β\\eta\_\{\\alpha\\beta\}is a mean\-zero Gaussian random tensor with covariance determined by the noise model\. Here and throughout this appendix,𝔼​\[⋅\]\\mathbb\{E\}\[\\cdot\]denotes expectation over the input\-corruption noise\. Thus the first\-layer Bayesian kernel may be decomposed as

K~α​β\(1\)=Kα​β\(1\)\+1d​ηα​β\+𝒪​\(d−1\),\\tilde\{K\}\_\{\\alpha\\beta\}^\{\(1\)\}=K\_\{\\alpha\\beta\}^\{\(1\)\}\+\\frac\{1\}\{\\sqrt\{d\}\}\\eta\_\{\\alpha\\beta\}\+\\mathcal\{O\}\\left\(d^\{\-1\}\\right\),\(48\)where

Kα​β\(1\)=Cb\+Cwd​∑j=1d𝔼​\[zj​α​zj​β\]\.K\_\{\\alpha\\beta\}^\{\(1\)\}=C\_\{b\}\+\\frac\{C\_\{w\}\}\{d\}\\sum\_\{j=1\}^\{d\}\\mathbb\{E\}\[z\_\{j\\alpha\}z\_\{j\\beta\}\]\.\(49\)
The corruption model is assumed to have a noise strengthp∈\[0,1\]p\\in\[0,1\]\. We define the signal strength

so thatϵ=0\\epsilon=0is the pure\-noise limit\. BothKα​β\(1\)K\_\{\\alpha\\beta\}^\{\(1\)\}andηα​β\\eta\_\{\\alpha\\beta\}may depend onϵ\\epsilon\. Since the expansion is around the pure\-noise point, all quantities below are evaluated atϵ=0\\epsilon=0unless otherwise specified\. For example,

Kα​β\(1\)=Kα​β\(1\)\|ϵ=0,∂Kα​β\(1\)∂ϵ=∂Kα​β\(1\)​\(ϵ\)∂ϵ\|ϵ=0\.K\_\{\\alpha\\beta\}^\{\(1\)\}=K\_\{\\alpha\\beta\}^\{\(1\)\}\\big\|\_\{\\epsilon=0\},\\qquad\\frac\{\\partial K\_\{\\alpha\\beta\}^\{\(1\)\}\}\{\\partial\\epsilon\}=\\left\.\\frac\{\\partial K\_\{\\alpha\\beta\}^\{\(1\)\}\(\\epsilon\)\}\{\\partial\\epsilon\}\\right\|\_\{\\epsilon=0\}\.\(51\)With this convention, the first\-layer perturbation has the form

δ​Kα​β\(1\)=ϵ​∂Kα​β\(1\)∂ϵ\+1d​ηα​β\.\\delta K\_\{\\alpha\\beta\}^\{\(1\)\}=\\epsilon\\frac\{\\partial K\_\{\\alpha\\beta\}^\{\(1\)\}\}\{\\partial\\epsilon\}\+\\frac\{1\}\{\\sqrt\{d\}\}\\eta\_\{\\alpha\\beta\}\.\(52\)

### C\.2Useful identities

We collect several identities used repeatedly below\. First, Price’s theorem gives the following derivatives of Gaussian expectations with respect to the covariance\. For one\-dimensional Gaussian expectations,

∂∂Kα​α​⟨g​h⟩Kα​α\\displaystyle\\frac\{\\partial\}\{\\partial K\_\{\\alpha\\alpha\}\}\\langle gh\\rangle\_\{K\_\{\\alpha\\alpha\}\}=⟨g′​h′⟩Kα​α\+12​⟨g′′​h⟩Kα​α\+12​⟨g​h′′⟩Kα​α\.\\displaystyle=\\langle g^\{\\prime\}h^\{\\prime\}\\rangle\_\{K\_\{\\alpha\\alpha\}\}\+\\frac\{1\}\{2\}\\langle g^\{\\prime\\prime\}h\\rangle\_\{K\_\{\\alpha\\alpha\}\}\+\\frac\{1\}\{2\}\\langle gh^\{\\prime\\prime\}\\rangle\_\{K\_\{\\alpha\\alpha\}\}\.\(53\)For two\-dimensional Gaussian expectations withα≠β\\alpha\\neq\\beta,

∂∂Kα​α​⟨gα​hβ⟩K\\displaystyle\\frac\{\\partial\}\{\\partial K\_\{\\alpha\\alpha\}\}\\langle g\_\{\\alpha\}h\_\{\\beta\}\\rangle\_\{K\}=12​⟨gα′′​hβ⟩K,\\displaystyle=\\frac\{1\}\{2\}\\langle g^\{\\prime\\prime\}\_\{\\alpha\}h\_\{\\beta\}\\rangle\_\{K\},\(54\)∂∂Kα​β​⟨gα​hβ⟩K\\displaystyle\\frac\{\\partial\}\{\\partial K\_\{\\alpha\\beta\}\}\\langle g\_\{\\alpha\}h\_\{\\beta\}\\rangle\_\{K\}=⟨gα′​hβ′⟩K,\\displaystyle=\\langle g^\{\\prime\}\_\{\\alpha\}h^\{\\prime\}\_\{\\beta\}\\rangle\_\{K\},\(55\)∂∂Kβ​β​⟨gα​hβ⟩K\\displaystyle\\frac\{\\partial\}\{\\partial K\_\{\\beta\\beta\}\}\\langle g\_\{\\alpha\}h\_\{\\beta\}\\rangle\_\{K\}=12​⟨gα​hβ′′⟩K\.\\displaystyle=\\frac\{1\}\{2\}\\langle g\_\{\\alpha\}h^\{\\prime\\prime\}\_\{\\beta\}\\rangle\_\{K\}\.\(56\)Although these expressions are written for smooth activations, non\-smooth activations such as ReLU may be recovered by a limiting procedure\.

We also use the elementary solution of the scalar recursion

B\(l\+1\)=al\+bl​B\(l\)\.B^\{\(l\+1\)\}=a\_\{l\}\+b\_\{l\}B^\{\(l\)\}\.\(57\)It is

B\(l\)=∑i=1l−1ai​∏j=i\+1l−1bj\+B\(1\)​∏i=1l−1bi\.B^\{\(l\)\}=\\sum\_\{i=1\}^\{l\-1\}a\_\{i\}\\prod\_\{j=i\+1\}^\{l\-1\}b\_\{j\}\+B^\{\(1\)\}\\prod\_\{i=1\}^\{l\-1\}b\_\{i\}\.\(58\)
Next, for anN×NN\\times Nmatrix of the form

H=hd​𝐈\+ho​\(𝟏𝟏T−𝐈\),H=h\_\{d\}\\mathbf\{I\}\+h\_\{o\}\(\\mathbf\{1\}\\mathbf\{1\}^\{\\mathrm\{T\}\}\-\\mathbf\{I\}\),\(59\)where𝐈\\mathbf\{I\}is the identity and𝟏\\mathbf\{1\}is the vector of all ones, the inverse is

H−1=1hd−ho​\(𝐈−hohd−ho\+ho​N​𝟏𝟏T\)\.H^\{\-1\}=\\frac\{1\}\{h\_\{d\}\-h\_\{o\}\}\\left\(\\mathbf\{I\}\-\\frac\{h\_\{o\}\}\{h\_\{d\}\-h\_\{o\}\+h\_\{o\}N\}\\mathbf\{1\}\\mathbf\{1\}^\{\\mathrm\{T\}\}\\right\)\.\(60\)For vectors𝐚,𝐛∈ℝN\\mathbf\{a\},\\mathbf\{b\}\\in\\mathbb\{R\}^\{N\}, defining

Eα​\[aα\]=1N​∑α=1Naα,Eα​\[aα​bα\]=1N​∑α=1Naα​bα,\\mathrm\{E\}\_\{\\alpha\}\[a\_\{\\alpha\}\]=\\frac\{1\}\{N\}\\sum\_\{\\alpha=1\}^\{N\}a\_\{\\alpha\},\\qquad\\mathrm\{E\}\_\{\\alpha\}\[a\_\{\\alpha\}b\_\{\\alpha\}\]=\\frac\{1\}\{N\}\\sum\_\{\\alpha=1\}^\{N\}a\_\{\\alpha\}b\_\{\\alpha\},\(61\)we therefore have

𝐚T​H−1​𝐛=Nhd−ho​\(Eα​\[aα​bα\]−ho​Nhd−ho\+ho​N​Eα​\[aα\]​Eβ​\[bβ\]\)\.\\mathbf\{a\}^\{\\mathrm\{T\}\}H^\{\-1\}\\mathbf\{b\}=\\frac\{N\}\{h\_\{d\}\-h\_\{o\}\}\\left\(\\mathrm\{E\}\_\{\\alpha\}\[a\_\{\\alpha\}b\_\{\\alpha\}\]\-\\frac\{h\_\{o\}N\}\{h\_\{d\}\-h\_\{o\}\+h\_\{o\}N\}\\mathrm\{E\}\_\{\\alpha\}\[a\_\{\\alpha\}\]\\mathrm\{E\}\_\{\\beta\}\[b\_\{\\beta\}\]\\right\)\.\(62\)
The following layer\-to\-layer quantities will also be used:

Pα​βa→b=∏i=ab−1⟨σα′​σβ′⟩K\(i\),P\_\{\\alpha\\beta\}^\{a\\rightarrow b\}=\\prod\_\{i=a\}^\{b\-1\}\\langle\\sigma^\{\\prime\}\_\{\\alpha\}\\sigma^\{\\prime\}\_\{\\beta\}\\rangle\_\{K^\{\(i\)\}\},\(63\)Uαa→b=∏i=ab−1\(⟨σ′​σ′⟩Kα​α\(i\)\+⟨σ​σ′′⟩Kα​α\(i\)\),U\_\{\\alpha\}^\{a\\rightarrow b\}=\\prod\_\{i=a\}^\{b\-1\}\\left\(\\langle\\sigma^\{\\prime\}\\sigma^\{\\prime\}\\rangle\_\{K\_\{\\alpha\\alpha\}^\{\(i\)\}\}\+\\langle\\sigma\\sigma^\{\\prime\\prime\}\\rangle\_\{K\_\{\\alpha\\alpha\}^\{\(i\)\}\}\\right\),\(64\)and

Mα​βa→b=12​∑i=ab−1⟨σα′′​σβ⟩K\(i\)​Uαa→i​Pα​βi\+1→b\.M\_\{\\alpha\\beta\}^\{a\\rightarrow b\}=\\frac\{1\}\{2\}\\sum\_\{i=a\}^\{b\-1\}\\langle\\sigma^\{\\prime\\prime\}\_\{\\alpha\}\\sigma\_\{\\beta\}\\rangle\_\{K^\{\(i\)\}\}U\_\{\\alpha\}^\{a\\rightarrow i\}P\_\{\\alpha\\beta\}^\{i\+1\\rightarrow b\}\.\(65\)The corresponding NTK\-response coefficients are

Rαa→b\\displaystyle R\_\{\\alpha\}^\{a\\rightarrow b\}=∑i=ab−1\(⟨σ′​σ′⟩Kα​α\(i\)\+⟨σ​σ′′⟩Kα​α\(i\)\+\[⟨σ′′σ′′⟩Kα​α\(i\)\+⟨σ′σ′′′⟩Kα​α\(i\)\]Θα​α\(i\)\)Uαa→iPα​αi\+1→b\+Pα​αa→b,\\displaystyle=\\sum\_\{i=a\}^\{b\-1\}\\begin\{aligned\} \\Big\(&\\langle\\sigma^\{\\prime\}\\sigma^\{\\prime\}\\rangle\_\{K\_\{\\alpha\\alpha\}^\{\(i\)\}\}\+\\langle\\sigma\\sigma^\{\\prime\\prime\}\\rangle\_\{K\_\{\\alpha\\alpha\}^\{\(i\)\}\}\\\\ &\+\\left\[\\langle\\sigma^\{\\prime\\prime\}\\sigma^\{\\prime\\prime\}\\rangle\_\{K\_\{\\alpha\\alpha\}^\{\(i\)\}\}\+\\langle\\sigma^\{\\prime\}\\sigma^\{\\prime\\prime\\prime\}\\rangle\_\{K\_\{\\alpha\\alpha\}^\{\(i\)\}\}\\right\]\\Theta\_\{\\alpha\\alpha\}^\{\(i\)\}\\Big\)U\_\{\\alpha\}^\{a\\rightarrow i\}P\_\{\\alpha\\alpha\}^\{i\+1\\rightarrow b\}\+P\_\{\\alpha\\alpha\}^\{a\\rightarrow b\},\\end\{aligned\}\(66\)Sα​βa→b\\displaystyle S\_\{\\alpha\\beta\}^\{a\\rightarrow b\}=∑i=ab−1\[12​\(⟨σα′′​σβ⟩K\(i\)\+Θα​β\(i\)​⟨σα′′′​σβ′⟩K\(i\)\)​Uαa→i\+\(⟨σα′σβ′⟩K\(i\)\+Θα​β\(i\)⟨σα′′σβ′′⟩K\(i\)\)Mα​βa→i\]Pα​βi\+1→b,\\displaystyle=\\sum\_\{i=a\}^\{b\-1\}\\begin\{aligned\} \\Bigg\[&\\frac\{1\}\{2\}\\left\(\\langle\\sigma^\{\\prime\\prime\}\_\{\\alpha\}\\sigma\_\{\\beta\}\\rangle\_\{K^\{\(i\)\}\}\+\\Theta\_\{\\alpha\\beta\}^\{\(i\)\}\\langle\\sigma^\{\\prime\\prime\\prime\}\_\{\\alpha\}\\sigma^\{\\prime\}\_\{\\beta\}\\rangle\_\{K^\{\(i\)\}\}\\right\)U\_\{\\alpha\}^\{a\\rightarrow i\}\\\\ &\+\\left\(\\langle\\sigma^\{\\prime\}\_\{\\alpha\}\\sigma^\{\\prime\}\_\{\\beta\}\\rangle\_\{K^\{\(i\)\}\}\+\\Theta\_\{\\alpha\\beta\}^\{\(i\)\}\\langle\\sigma^\{\\prime\\prime\}\_\{\\alpha\}\\sigma^\{\\prime\\prime\}\_\{\\beta\}\\rangle\_\{K^\{\(i\)\}\}\\right\)M\_\{\\alpha\\beta\}^\{a\\rightarrow i\}\\Bigg\]P\_\{\\alpha\\beta\}^\{i\+1\\rightarrow b\},\\end\{aligned\}\(67\)Tα​βa→b\\displaystyle T\_\{\\alpha\\beta\}^\{a\\rightarrow b\}=∑i=ab−1\(⟨σα′​σβ′⟩K\(i\)\+Θα​β\(i\)​⟨σα′′​σβ′′⟩K\(i\)\)​Pα​βa→i​Pα​βi\+1→b\+Pα​βa→b\.\\displaystyle=\\sum\_\{i=a\}^\{b\-1\}\\left\(\\langle\\sigma^\{\\prime\}\_\{\\alpha\}\\sigma^\{\\prime\}\_\{\\beta\}\\rangle\_\{K^\{\(i\)\}\}\+\\Theta\_\{\\alpha\\beta\}^\{\(i\)\}\\langle\\sigma^\{\\prime\\prime\}\_\{\\alpha\}\\sigma^\{\\prime\\prime\}\_\{\\beta\}\\rangle\_\{K^\{\(i\)\}\}\\right\)P\_\{\\alpha\\beta\}^\{a\\rightarrow i\}P\_\{\\alpha\\beta\}^\{i\+1\\rightarrow b\}\+P\_\{\\alpha\\beta\}^\{a\\rightarrow b\}\.\(68\)All quantities in Eqs\. \([63](https://arxiv.org/html/2606.11319#A3.E63)\)–\([68](https://arxiv.org/html/2606.11319#A3.E68)\) are evaluated in the pure\-noise limit\. Therefore their values depend only on whether the indices are training–training, test\-training, or test\-test indices; they do not otherwise depend on the individual data points\.

### C\.3Influence of input perturbations on the output

We now compute how a perturbation in the first\-layer Bayesian kernel propagates to the output\. Let

K~α​β\(l\)=Kα​β\(l\)\+δ​Kα​β\(l\)\+𝒪​\(δ​K2\),Θ~α​β\(l\)=Θα​β\(l\)\+δ​Θα​β\(l\)\+𝒪​\(δ​K2\)\.\\tilde\{K\}\_\{\\alpha\\beta\}^\{\(l\)\}=K\_\{\\alpha\\beta\}^\{\(l\)\}\+\\delta K\_\{\\alpha\\beta\}^\{\(l\)\}\+\\mathcal\{O\}\(\\delta K^\{2\}\),\\qquad\\tilde\{\\Theta\}\_\{\\alpha\\beta\}^\{\(l\)\}=\\Theta\_\{\\alpha\\beta\}^\{\(l\)\}\+\\delta\\Theta\_\{\\alpha\\beta\}^\{\(l\)\}\+\\mathcal\{O\}\(\\delta K^\{2\}\)\.\(69\)The unperturbed kernels satisfy the pure\-noise recursions

Θα​β\(l\+1\)\\displaystyle\\Theta\_\{\\alpha\\beta\}^\{\(l\+1\)\}=Cb\+Cw​\(⟨σα​σβ⟩K\(l\)\+⟨σα′​σβ′⟩K\(l\)​Θα​β\(l\)\),\\displaystyle=C\_\{b\}\+C\_\{w\}\\left\(\\langle\\sigma\_\{\\alpha\}\\sigma\_\{\\beta\}\\rangle\_\{K^\{\(l\)\}\}\+\\langle\\sigma^\{\\prime\}\_\{\\alpha\}\\sigma^\{\\prime\}\_\{\\beta\}\\rangle\_\{K^\{\(l\)\}\}\\Theta\_\{\\alpha\\beta\}^\{\(l\)\}\\right\),\(70\)Kα​β\(l\+1\)\\displaystyle K\_\{\\alpha\\beta\}^\{\(l\+1\)\}=Cb\+Cw​⟨σα​σβ⟩K\(l\)\.\\displaystyle=C\_\{b\}\+C\_\{w\}\\langle\\sigma\_\{\\alpha\}\\sigma\_\{\\beta\}\\rangle\_\{K^\{\(l\)\}\}\.\(71\)The linear perturbations obey

1Cw​δ​Kα​β\(l\+1\)\\displaystyle\\frac\{1\}\{C\_\{w\}\}\\delta K^\{\(l\+1\)\}\_\{\\alpha\\beta\}=δα​β​\(⟨σ′​σ′⟩Kα​α\(l\)\+⟨σ​σ′′⟩Kα​α\(l\)\)​δ​Kα​α\(l\)\\displaystyle=\\delta\_\{\\alpha\\beta\}\\left\(\\langle\\sigma^\{\\prime\}\\sigma^\{\\prime\}\\rangle\_\{K\_\{\\alpha\\alpha\}^\{\(l\)\}\}\+\\langle\\sigma\\sigma^\{\\prime\\prime\}\\rangle\_\{K\_\{\\alpha\\alpha\}^\{\(l\)\}\}\\right\)\\delta K\_\{\\alpha\\alpha\}^\{\(l\)\}\+\(1−δα​β\)​\[12​⟨σα′′​σβ⟩K\(l\)​δ​Kα​α\(l\)\+⟨σα′​σβ′⟩K\(l\)​δ​Kα​β\(l\)\+12​⟨σα​σβ′′⟩K\(l\)​δ​Kβ​β\(l\)\],\\displaystyle\\quad\+\(1\-\\delta\_\{\\alpha\\beta\}\)\\Bigg\[\\frac\{1\}\{2\}\\langle\\sigma^\{\\prime\\prime\}\_\{\\alpha\}\\sigma\_\{\\beta\}\\rangle\_\{K^\{\(l\)\}\}\\delta K\_\{\\alpha\\alpha\}^\{\(l\)\}\+\\langle\\sigma^\{\\prime\}\_\{\\alpha\}\\sigma^\{\\prime\}\_\{\\beta\}\\rangle\_\{K^\{\(l\)\}\}\\delta K\_\{\\alpha\\beta\}^\{\(l\)\}\+\\frac\{1\}\{2\}\\langle\\sigma\_\{\\alpha\}\\sigma^\{\\prime\\prime\}\_\{\\beta\}\\rangle\_\{K^\{\(l\)\}\}\\delta K\_\{\\beta\\beta\}^\{\(l\)\}\\Bigg\],\(72\)1Cw​δ​Θα​β\(l\+1\)\\displaystyle\\frac\{1\}\{C\_\{w\}\}\\delta\\Theta^\{\(l\+1\)\}\_\{\\alpha\\beta\}=δα​β\[\(⟨σ′σ′⟩Kα​α\(l\)\+⟨σσ′′⟩Kα​α\(l\)\\displaystyle=\\delta\_\{\\alpha\\beta\}\\Bigg\[\\Big\(\\langle\\sigma^\{\\prime\}\\sigma^\{\\prime\}\\rangle\_\{K\_\{\\alpha\\alpha\}^\{\(l\)\}\}\+\\langle\\sigma\\sigma^\{\\prime\\prime\}\\rangle\_\{K\_\{\\alpha\\alpha\}^\{\(l\)\}\}\+\[⟨σ′′σ′′⟩Kα​α\(l\)\+⟨σ′σ′′′⟩Kα​α\(l\)\]Θα​α\(l\)\)δKα​α\(l\)\+⟨σ′σ′⟩Kα​α\(l\)δΘα​α\(l\)\]\\displaystyle\\qquad\\qquad\\qquad\+\\left\[\\langle\\sigma^\{\\prime\\prime\}\\sigma^\{\\prime\\prime\}\\rangle\_\{K\_\{\\alpha\\alpha\}^\{\(l\)\}\}\+\\langle\\sigma^\{\\prime\}\\sigma^\{\\prime\\prime\\prime\}\\rangle\_\{K\_\{\\alpha\\alpha\}^\{\(l\)\}\}\\right\]\\Theta\_\{\\alpha\\alpha\}^\{\(l\)\}\\Big\)\\delta K\_\{\\alpha\\alpha\}^\{\(l\)\}\+\\langle\\sigma^\{\\prime\}\\sigma^\{\\prime\}\\rangle\_\{K\_\{\\alpha\\alpha\}^\{\(l\)\}\}\\delta\\Theta\_\{\\alpha\\alpha\}^\{\(l\)\}\\Bigg\]\+\(1−δα​β\)\[12\(⟨σα′′σβ⟩K\(l\)\+⟨σα′′′σβ′⟩K\(l\)Θα​β\(l\)\)δKα​α\(l\)\\displaystyle\\quad\+\(1\-\\delta\_\{\\alpha\\beta\}\)\\Bigg\[\\frac\{1\}\{2\}\\left\(\\langle\\sigma^\{\\prime\\prime\}\_\{\\alpha\}\\sigma\_\{\\beta\}\\rangle\_\{K^\{\(l\)\}\}\+\\langle\\sigma^\{\\prime\\prime\\prime\}\_\{\\alpha\}\\sigma^\{\\prime\}\_\{\\beta\}\\rangle\_\{K^\{\(l\)\}\}\\Theta\_\{\\alpha\\beta\}^\{\(l\)\}\\right\)\\delta K\_\{\\alpha\\alpha\}^\{\(l\)\}\+\(⟨σα′​σβ′⟩K\(l\)\+⟨σα′′​σβ′′⟩K\(l\)​Θα​β\(l\)\)​δ​Kα​β\(l\)\\displaystyle\\qquad\\qquad\+\\left\(\\langle\\sigma^\{\\prime\}\_\{\\alpha\}\\sigma^\{\\prime\}\_\{\\beta\}\\rangle\_\{K^\{\(l\)\}\}\+\\langle\\sigma^\{\\prime\\prime\}\_\{\\alpha\}\\sigma^\{\\prime\\prime\}\_\{\\beta\}\\rangle\_\{K^\{\(l\)\}\}\\Theta\_\{\\alpha\\beta\}^\{\(l\)\}\\right\)\\delta K\_\{\\alpha\\beta\}^\{\(l\)\}\+12​\(⟨σα​σβ′′⟩K\(l\)\+⟨σα′​σβ′′′⟩K\(l\)​Θα​β\(l\)\)​δ​Kβ​β\(l\)\\displaystyle\\qquad\\qquad\+\\frac\{1\}\{2\}\\left\(\\langle\\sigma\_\{\\alpha\}\\sigma^\{\\prime\\prime\}\_\{\\beta\}\\rangle\_\{K^\{\(l\)\}\}\+\\langle\\sigma^\{\\prime\}\_\{\\alpha\}\\sigma^\{\\prime\\prime\\prime\}\_\{\\beta\}\\rangle\_\{K^\{\(l\)\}\}\\Theta\_\{\\alpha\\beta\}^\{\(l\)\}\\right\)\\delta K\_\{\\beta\\beta\}^\{\(l\)\}\+⟨σα′σβ′⟩K\(l\)δΘα​β\(l\)\]\.\\displaystyle\\qquad\\qquad\+\\langle\\sigma^\{\\prime\}\_\{\\alpha\}\\sigma^\{\\prime\}\_\{\\beta\}\\rangle\_\{K^\{\(l\)\}\}\\delta\\Theta\_\{\\alpha\\beta\}^\{\(l\)\}\\Bigg\]\.\(73\)
Given the solution of the pure\-noise Bayesian\-kernel recursion, the pure\-noise NTK has the closed\-form solution

1Cwl−1​Θα​β\(l\)=∑i=1l−1\(Cb\+Cw​⟨σα​σβ⟩K\(i\)\)​Cw−i​Pα​βi\+1→l\+Pα​β1→l​Θα​β\(1\),\\frac\{1\}\{C\_\{w\}^\{l\-1\}\}\\Theta\_\{\\alpha\\beta\}^\{\(l\)\}=\\sum\_\{i=1\}^\{l\-1\}\\left\(C\_\{b\}\+C\_\{w\}\\langle\\sigma\_\{\\alpha\}\\sigma\_\{\\beta\}\\rangle\_\{K^\{\(i\)\}\}\\right\)C\_\{w\}^\{\-i\}P\_\{\\alpha\\beta\}^\{i\+1\\rightarrow l\}\+P\_\{\\alpha\\beta\}^\{1\\rightarrow l\}\\Theta\_\{\\alpha\\beta\}^\{\(1\)\},\(74\)and the perturbations propagate as

1Cwl−1​δ​Kα​β\(l\)\\displaystyle\\frac\{1\}\{C\_\{w\}^\{l\-1\}\}\\delta K\_\{\\alpha\\beta\}^\{\(l\)\}=δα​β​Uα1→l​δ​Kα​α\(1\)\\displaystyle=\\delta\_\{\\alpha\\beta\}U\_\{\\alpha\}^\{1\\rightarrow l\}\\delta K\_\{\\alpha\\alpha\}^\{\(1\)\}\+\(1−δα​β\)​\(Mα​β1→l​δ​Kα​α\(1\)\+Mβ​α1→l​δ​Kβ​β\(1\)\+Pα​β1→l​δ​Kα​β\(1\)\),\\displaystyle\\quad\+\(1\-\\delta\_\{\\alpha\\beta\}\)\\left\(M\_\{\\alpha\\beta\}^\{1\\rightarrow l\}\\delta K\_\{\\alpha\\alpha\}^\{\(1\)\}\+M\_\{\\beta\\alpha\}^\{1\\rightarrow l\}\\delta K\_\{\\beta\\beta\}^\{\(1\)\}\+P\_\{\\alpha\\beta\}^\{1\\rightarrow l\}\\delta K\_\{\\alpha\\beta\}^\{\(1\)\}\\right\),\(75\)1Cwl−1​δ​Θα​β\(l\)\\displaystyle\\frac\{1\}\{C\_\{w\}^\{l\-1\}\}\\delta\\Theta\_\{\\alpha\\beta\}^\{\(l\)\}=δα​β​Rα1→l​δ​Kα​α\(1\)\\displaystyle=\\delta\_\{\\alpha\\beta\}R\_\{\\alpha\}^\{1\\rightarrow l\}\\delta K\_\{\\alpha\\alpha\}^\{\(1\)\}\+\(1−δα​β\)​\(Sα​β1→l​δ​Kα​α\(1\)\+Sβ​α1→l​δ​Kβ​β\(1\)\+Tα​β1→l​δ​Kα​β\(1\)\)\.\\displaystyle\\quad\+\(1\-\\delta\_\{\\alpha\\beta\}\)\\left\(S\_\{\\alpha\\beta\}^\{1\\rightarrow l\}\\delta K\_\{\\alpha\\alpha\}^\{\(1\)\}\+S\_\{\\beta\\alpha\}^\{1\\rightarrow l\}\\delta K\_\{\\beta\\beta\}^\{\(1\)\}\+T\_\{\\alpha\\beta\}^\{1\\rightarrow l\}\\delta K\_\{\\alpha\\beta\}^\{\(1\)\}\\right\)\.\(76\)
At pure noise, the training–training NTK at layerLLhas the permutation\-symmetric form

1CwL−1​𝚯χ​χ\(L\)=θd​𝐈\+θo​\(𝟏𝟏T−𝐈\),\\frac\{1\}\{C\_\{w\}^\{L\-1\}\}\\boldsymbol\{\\Theta\}\_\{\\chi\\chi\}^\{\(L\)\}=\\theta\_\{d\}\\mathbf\{I\}\+\\theta\_\{o\}\(\\mathbf\{1\}\\mathbf\{1\}^\{\\mathrm\{T\}\}\-\\mathbf\{I\}\),\(77\)while the test\-training NTK has the form

1CwL−1​𝚯∗χ\(L\)=θ∗​𝟏T\.\\frac\{1\}\{C\_\{w\}^\{L\-1\}\}\\boldsymbol\{\\Theta\}\_\{\*\\chi\}^\{\(L\)\}=\\theta\_\{\*\}\\mathbf\{1\}^\{\\mathrm\{T\}\}\.\(78\)Hereθd\\theta\_\{d\},θo\\theta\_\{o\}, andθ∗\\theta\_\{\*\}are constants determined by the architecture and the pure\-noise kernel\. We define

ψ=Nθd−θo\+θo​N,Covα1​\(aα,bα\)=Eα​\[aα​bα\]−θo​ψ​Eα​\[aα\]​Eα​\[bα\]\.\\psi=\\frac\{N\}\{\\theta\_\{d\}\-\\theta\_\{o\}\+\\theta\_\{o\}N\},\\qquad\\mathrm\{Cov\}\_\{\\alpha\}^\{1\}\(a\_\{\\alpha\},b\_\{\\alpha\}\)=\\mathrm\{E\}\_\{\\alpha\}\[a\_\{\\alpha\}b\_\{\\alpha\}\]\-\\theta\_\{o\}\\psi\\,\\mathrm\{E\}\_\{\\alpha\}\[a\_\{\\alpha\}\]\\mathrm\{E\}\_\{\\alpha\}\[b\_\{\\alpha\}\]\.\(79\)
Expanding the kernel\-regression output to first order inδ​Θ\\delta\\Thetagives

fi​\(x∗\)=\[𝚯∗χ\(L\)​\[𝚯χ​χ\(L\)\]−1\+δ​𝚯∗χ\(L\)​\[𝚯χ​χ\(L\)\]−1−𝚯∗χ\(L\)​\[𝚯χ​χ\(L\)\]−1​δ​𝚯χ​χ\(L\)​\[𝚯χ​χ\(L\)\]−1\]​𝒚i\.f\_\{i\}\(x\_\{\*\}\)=\\left\[\\boldsymbol\{\\Theta\}\_\{\*\\chi\}^\{\(L\)\}\\left\[\\boldsymbol\{\\Theta\}\_\{\\chi\\chi\}^\{\(L\)\}\\right\]^\{\-1\}\+\\delta\\boldsymbol\{\\Theta\}\_\{\*\\chi\}^\{\(L\)\}\\left\[\\boldsymbol\{\\Theta\}\_\{\\chi\\chi\}^\{\(L\)\}\\right\]^\{\-1\}\-\\boldsymbol\{\\Theta\}\_\{\*\\chi\}^\{\(L\)\}\\left\[\\boldsymbol\{\\Theta\}\_\{\\chi\\chi\}^\{\(L\)\}\\right\]^\{\-1\}\\delta\\boldsymbol\{\\Theta\}\_\{\\chi\\chi\}^\{\(L\)\}\\left\[\\boldsymbol\{\\Theta\}\_\{\\chi\\chi\}^\{\(L\)\}\\right\]^\{\-1\}\\right\]\\boldsymbol\{y\}\_\{i\}\.\(80\)Substituting the pure\-noise inverse and the perturbative solution gives

fi​\(x∗\)=\\displaystyle f\_\{i\}\(x\_\{\*\}\)=θ∗​ψ​Eα​\[yα​i\]\\displaystyle\\ \\theta\_\{\*\}\\psi\\,\\mathrm\{E\}\_\{\\alpha\}\[y\_\{\\alpha i\}\]\+Nθd−θo\(\(Sχ⁣∗1→L−θ∗ψSχ​χ1→L\)Covα1\(δKα​α\(1\),yα​i\)\\displaystyle\+\\frac\{N\}\{\\theta\_\{d\}\-\\theta\_\{o\}\}\\Bigg\(\\left\(S\_\{\\chi\*\}^\{1\\rightarrow L\}\-\\theta\_\{\*\}\\psi S\_\{\\chi\\chi\}^\{1\\rightarrow L\}\\right\)\\mathrm\{Cov\}\_\{\\alpha\}^\{1\}\\left\(\\delta K\_\{\\alpha\\alpha\}^\{\(1\)\},y\_\{\\alpha i\}\\right\)\+T∗χ1→L​Covα1​\(δ​K∗α\(1\),yα​i\)\\displaystyle\\qquad\\qquad\\qquad\+T\_\{\*\\chi\}^\{1\\rightarrow L\}\\mathrm\{Cov\}\_\{\\alpha\}^\{1\}\\left\(\\delta K\_\{\*\\alpha\}^\{\(1\)\},y\_\{\\alpha i\}\\right\)−θ∗ψ\[Tχ​χ1→LEα\[Covβ1\(δKα​β\(1\),yβ​i\)\]\\displaystyle\\qquad\\qquad\\qquad\-\\theta\_\{\*\}\\psi\\Bigg\[T\_\{\\chi\\chi\}^\{1\\rightarrow L\}\\mathrm\{E\}\_\{\\alpha\}\\left\[\\mathrm\{Cov\}\_\{\\beta\}^\{1\}\\left\(\\delta K\_\{\\alpha\\beta\}^\{\(1\)\},y\_\{\\beta i\}\\right\)\\right\]\+Sχ​χ1→L\(1−θoψ\)Eα\[δKα​α\(1\)\]Eα\[yα​i\]\]\)\+𝒪\(δKN0\)\+𝒪\(δK2\)\.\\displaystyle\\qquad\\qquad\\qquad\\qquad\\quad\+S\_\{\\chi\\chi\}^\{1\\rightarrow L\}\\left\(1\-\\theta\_\{o\}\\psi\\right\)\\mathrm\{E\}\_\{\\alpha\}\\left\[\\delta K\_\{\\alpha\\alpha\}^\{\(1\)\}\\right\]\\mathrm\{E\}\_\{\\alpha\}\[y\_\{\\alpha i\}\]\\Bigg\]\\Bigg\)\+\\mathcal\{O\}\(\\delta K\\,N^\{0\}\)\+\\mathcal\{O\}\(\\delta K^\{2\}\)\.\(81\)

### C\.4Output fluctuations under Gaussian kernel perturbations

We now isolate the fluctuation term obtained by taking

δ​Kα​β\(1\)=1d​ηα​β,\\delta K\_\{\\alpha\\beta\}^\{\(1\)\}=\\frac\{1\}\{\\sqrt\{d\}\}\\eta\_\{\\alpha\\beta\},\(82\)whereηα​β\\eta\_\{\\alpha\\beta\}is a mean\-zero Gaussian random tensor\. Since Eq\. \([81](https://arxiv.org/html/2606.11319#A3.E81)\) is linear inηα​β\\eta\_\{\\alpha\\beta\}, the induced output fluctuation is also Gaussian\. It is therefore enough to compute its covariance\.

The basic identity is

𝔼​\[Covβ1​\(ηα​β,yβ​i\)​Covδ1​\(ηγ​δ,yδ​j\)\]=1N​ℂ​ov​\(ηα​β,ηβ​γ\)​Covα2​\(yα​i,yα​j\),\\mathbb\{E\}\\left\[\\mathrm\{Cov\}\_\{\\beta\}^\{1\}\(\\eta\_\{\\alpha\\beta\},y\_\{\\beta i\}\)\\mathrm\{Cov\}\_\{\\delta\}^\{1\}\(\\eta\_\{\\gamma\\delta\},y\_\{\\delta j\}\)\\right\]=\\frac\{1\}\{N\}\\mathbb\{C\}\\mathrm\{ov\}\(\\eta\_\{\\alpha\\beta\},\\eta\_\{\\beta\\gamma\}\)\\mathrm\{Cov\}\_\{\\alpha\}^\{2\}\(y\_\{\\alpha i\},y\_\{\\alpha j\}\),\(83\)where

Covα2​\(aα,bα\)=Eα​\[aα​bα\]−\(2​θo​ψ−\(θo​ψ\)2\)​Eα​\[aα\]​Eα​\[bα\]\.\\mathrm\{Cov\}\_\{\\alpha\}^\{2\}\(a\_\{\\alpha\},b\_\{\\alpha\}\)=\\mathrm\{E\}\_\{\\alpha\}\[a\_\{\\alpha\}b\_\{\\alpha\}\]\-\\left\(2\\theta\_\{o\}\\psi\-\(\\theta\_\{o\}\\psi\)^\{2\}\\right\)\\mathrm\{E\}\_\{\\alpha\}\[a\_\{\\alpha\}\]\\mathrm\{E\}\_\{\\alpha\}\[b\_\{\\alpha\}\]\.\(84\)This identity follows from the i\.i\.d\. structure of the input noise: a covariance between twoη\\etavariables vanishes unless the two index pairs share at least one data index\.

For largeNN, the required contractions are

𝔼​\[Covα1​\(ηα​α,yα​i\)​Covβ1​\(ηβ​β,yβ​j\)\]\\displaystyle\\mathbb\{E\}\\left\[\\mathrm\{Cov\}\_\{\\alpha\}^\{1\}\(\\eta\_\{\\alpha\\alpha\},y\_\{\\alpha i\}\)\\mathrm\{Cov\}\_\{\\beta\}^\{1\}\(\\eta\_\{\\beta\\beta\},y\_\{\\beta j\}\)\\right\]=1N​𝕍​ar​\(ηα​α\)​Covβ2​\(yβ​i,yβ​j\),\\displaystyle=\\frac\{1\}\{N\}\\mathbb\{V\}\\mathrm\{ar\}\(\\eta\_\{\\alpha\\alpha\}\)\\mathrm\{Cov\}\_\{\\beta\}^\{2\}\(y\_\{\\beta i\},y\_\{\\beta j\}\),\(85\)𝔼​\[Covα1​\(ηα​α,yα​i\)​Covβ1​\(η∗β,yβ​j\)\]\\displaystyle\\mathbb\{E\}\\left\[\\mathrm\{Cov\}\_\{\\alpha\}^\{1\}\(\\eta\_\{\\alpha\\alpha\},y\_\{\\alpha i\}\)\\mathrm\{Cov\}\_\{\\beta\}^\{1\}\(\\eta\_\{\*\\beta\},y\_\{\\beta j\}\)\\right\]=1N​ℂ​ov​\(ηα​α,η∗α\)​Covβ2​\(yβ​i,yβ​j\),\\displaystyle=\\frac\{1\}\{N\}\\mathbb\{C\}\\mathrm\{ov\}\(\\eta\_\{\\alpha\\alpha\},\\eta\_\{\*\\alpha\}\)\\mathrm\{Cov\}\_\{\\beta\}^\{2\}\(y\_\{\\beta i\},y\_\{\\beta j\}\),\(86\)𝔼​\[Covα1​\(ηα​α,yα​i\)​Eβ​\[Covγ1​\(ηβ​γ,yγ​j\)\]\]\\displaystyle\\mathbb\{E\}\\left\[\\mathrm\{Cov\}\_\{\\alpha\}^\{1\}\(\\eta\_\{\\alpha\\alpha\},y\_\{\\alpha i\}\)\\mathrm\{E\}\_\{\\beta\}\\left\[\\mathrm\{Cov\}\_\{\\gamma\}^\{1\}\(\\eta\_\{\\beta\\gamma\},y\_\{\\gamma j\}\)\\right\]\\right\]=1N​ℂ​ov​\(ηα​α,ηα​β\)\\displaystyle=\\frac\{1\}\{N\}\\mathbb\{C\}\\mathrm\{ov\}\(\\eta\_\{\\alpha\\alpha\},\\eta\_\{\\alpha\\beta\}\)×\(Covγ2​\(yγ​i,yγ​j\)\+\(1−θo​ψ\)2​Eγ​\[yγ​i\]​Eδ​\[yδ​j\]\)\\displaystyle\\quad\\times\\Bigg\(\\mathrm\{Cov\}\_\{\\gamma\}^\{2\}\(y\_\{\\gamma i\},y\_\{\\gamma j\}\)\+\(1\-\\theta\_\{o\}\\psi\)^\{2\}\\mathrm\{E\}\_\{\\gamma\}\[y\_\{\\gamma i\}\]\\mathrm\{E\}\_\{\\delta\}\[y\_\{\\delta j\}\]\\Bigg\)\+𝒪​\(1N2\),\\displaystyle\\quad\+\\mathcal\{O\}\\left\(\\frac\{1\}\{N^\{2\}\}\\right\),\(87\)𝔼​\[Covα1​\(ηα​α,yα​i\)​Eβ​\[ηβ​β\]​Eγ​\[yγ​j\]\]\\displaystyle\\mathbb\{E\}\\left\[\\mathrm\{Cov\}\_\{\\alpha\}^\{1\}\(\\eta\_\{\\alpha\\alpha\},y\_\{\\alpha i\}\)\\mathrm\{E\}\_\{\\beta\}\[\\eta\_\{\\beta\\beta\}\]\\mathrm\{E\}\_\{\\gamma\}\[y\_\{\\gamma j\}\]\\right\]=1N​𝕍​ar​\(ηα​α\)​\(1−θo​ψ\)​Eβ​\[yβ​i\]​Eγ​\[yγ​j\],\\displaystyle=\\frac\{1\}\{N\}\\mathbb\{V\}\\mathrm\{ar\}\(\\eta\_\{\\alpha\\alpha\}\)\(1\-\\theta\_\{o\}\\psi\)\\mathrm\{E\}\_\{\\beta\}\[y\_\{\\beta i\}\]\\mathrm\{E\}\_\{\\gamma\}\[y\_\{\\gamma j\}\],\(88\)𝔼​\[Covα1​\(η∗α,yα​i\)​Covβ1​\(η∗β,yβ​j\)\]\\displaystyle\\mathbb\{E\}\\left\[\\mathrm\{Cov\}\_\{\\alpha\}^\{1\}\(\\eta\_\{\*\\alpha\},y\_\{\\alpha i\}\)\\mathrm\{Cov\}\_\{\\beta\}^\{1\}\(\\eta\_\{\*\\beta\},y\_\{\\beta j\}\)\\right\]=1N​𝕍​ar​\(η∗α\)​Covβ2​\(yβ​i,yβ​j\),\\displaystyle=\\frac\{1\}\{N\}\\mathbb\{V\}\\mathrm\{ar\}\(\\eta\_\{\*\\alpha\}\)\\mathrm\{Cov\}\_\{\\beta\}^\{2\}\(y\_\{\\beta i\},y\_\{\\beta j\}\),\(89\)𝔼​\[Covα1​\(η∗α,yα​i\)​Eβ​\[Covγ1​\(ηβ​γ,yγ​j\)\]\]\\displaystyle\\mathbb\{E\}\\left\[\\mathrm\{Cov\}\_\{\\alpha\}^\{1\}\(\\eta\_\{\*\\alpha\},y\_\{\\alpha i\}\)\\mathrm\{E\}\_\{\\beta\}\\left\[\\mathrm\{Cov\}\_\{\\gamma\}^\{1\}\(\\eta\_\{\\beta\\gamma\},y\_\{\\gamma j\}\)\\right\]\\right\]=1N​ℂ​ov​\(η∗α,ηα​β\)\\displaystyle=\\frac\{1\}\{N\}\\mathbb\{C\}\\mathrm\{ov\}\(\\eta\_\{\*\\alpha\},\\eta\_\{\\alpha\\beta\}\)×\(Covγ2​\(yγ​i,yγ​j\)\+\(1−θo​ψ\)2​Eγ​\[yγ​i\]​Eδ​\[yδ​j\]\)\\displaystyle\\quad\\times\\Bigg\(\\mathrm\{Cov\}\_\{\\gamma\}^\{2\}\(y\_\{\\gamma i\},y\_\{\\gamma j\}\)\+\(1\-\\theta\_\{o\}\\psi\)^\{2\}\\mathrm\{E\}\_\{\\gamma\}\[y\_\{\\gamma i\}\]\\mathrm\{E\}\_\{\\delta\}\[y\_\{\\delta j\}\]\\Bigg\)\+𝒪​\(1N2\),\\displaystyle\\quad\+\\mathcal\{O\}\\left\(\\frac\{1\}\{N^\{2\}\}\\right\),\(90\)𝔼​\[Covα1​\(η∗α,yα​i\)​Eβ​\[ηβ​β\]​Eγ​\[yγ​j\]\]\\displaystyle\\mathbb\{E\}\\left\[\\mathrm\{Cov\}\_\{\\alpha\}^\{1\}\(\\eta\_\{\*\\alpha\},y\_\{\\alpha i\}\)\\mathrm\{E\}\_\{\\beta\}\[\\eta\_\{\\beta\\beta\}\]\\mathrm\{E\}\_\{\\gamma\}\[y\_\{\\gamma j\}\]\\right\]=1N​ℂ​ov​\(η∗α,ηα​α\)​\(1−θo​ψ\)​Eβ​\[yβ​i\]​Eγ​\[yγ​j\],\\displaystyle=\\frac\{1\}\{N\}\\mathbb\{C\}\\mathrm\{ov\}\(\\eta\_\{\*\\alpha\},\\eta\_\{\\alpha\\alpha\}\)\(1\-\\theta\_\{o\}\\psi\)\\mathrm\{E\}\_\{\\beta\}\[y\_\{\\beta i\}\]\\mathrm\{E\}\_\{\\gamma\}\[y\_\{\\gamma j\}\],\(91\)𝔼​\[Eα​\[Covβ1​\(ηα​β,yβ​i\)\]​Eγ​\[Covδ1​\(ηγ​δ,yδ​j\)\]\]\\displaystyle\\mathbb\{E\}\\left\[\\mathrm\{E\}\_\{\\alpha\}\\left\[\\mathrm\{Cov\}\_\{\\beta\}^\{1\}\(\\eta\_\{\\alpha\\beta\},y\_\{\\beta i\}\)\\right\]\\mathrm\{E\}\_\{\\gamma\}\\left\[\\mathrm\{Cov\}\_\{\\delta\}^\{1\}\(\\eta\_\{\\gamma\\delta\},y\_\{\\delta j\}\)\\right\]\\right\]=1N​ℂ​ov​\(ηα​β,ηβ​γ\)\\displaystyle=\\frac\{1\}\{N\}\\mathbb\{C\}\\mathrm\{ov\}\(\\eta\_\{\\alpha\\beta\},\\eta\_\{\\beta\\gamma\}\)×\(Covδ2​\(yδ​i,yδ​j\)\+3​\(1−θo​ψ\)2​Eδ​\[yδ​i\]​Eλ​\[yλ​j\]\)\\displaystyle\\quad\\times\\Bigg\(\\mathrm\{Cov\}\_\{\\delta\}^\{2\}\(y\_\{\\delta i\},y\_\{\\delta j\}\)\+3\(1\-\\theta\_\{o\}\\psi\)^\{2\}\\mathrm\{E\}\_\{\\delta\}\[y\_\{\\delta i\}\]\\mathrm\{E\}\_\{\\lambda\}\[y\_\{\\lambda j\}\]\\Bigg\)\+𝒪​\(1N2\),\\displaystyle\\quad\+\\mathcal\{O\}\\left\(\\frac\{1\}\{N^\{2\}\}\\right\),\(92\)𝔼​\[Eα​\[Covβ1​\(ηα​β,yβ​i\)\]​Eγ​\[ηγ​γ\]​Eδ​\[yδ​j\]\]\\displaystyle\\mathbb\{E\}\\left\[\\mathrm\{E\}\_\{\\alpha\}\\left\[\\mathrm\{Cov\}\_\{\\beta\}^\{1\}\(\\eta\_\{\\alpha\\beta\},y\_\{\\beta i\}\)\\right\]\\mathrm\{E\}\_\{\\gamma\}\[\\eta\_\{\\gamma\\gamma\}\]\\mathrm\{E\}\_\{\\delta\}\[y\_\{\\delta j\}\]\\right\]=2N​ℂ​ov​\(ηα​β,ηβ​β\)​\(1−θo​ψ\)​Eγ​\[yγ​i\]​Eδ​\[yδ​j\]\\displaystyle=\\frac\{2\}\{N\}\\mathbb\{C\}\\mathrm\{ov\}\(\\eta\_\{\\alpha\\beta\},\\eta\_\{\\beta\\beta\}\)\(1\-\\theta\_\{o\}\\psi\)\\mathrm\{E\}\_\{\\gamma\}\[y\_\{\\gamma i\}\]\\mathrm\{E\}\_\{\\delta\}\[y\_\{\\delta j\}\]\+𝒪​\(1N2\),\\displaystyle\\quad\+\\mathcal\{O\}\\left\(\\frac\{1\}\{N^\{2\}\}\\right\),\(93\)𝔼​\[Eα​\[ηα​α\]​Eβ​\[yβ​i\]​Eγ​\[ηγ​γ\]​Eδ​\[yδ​j\]\]\\displaystyle\\mathbb\{E\}\\left\[\\mathrm\{E\}\_\{\\alpha\}\[\\eta\_\{\\alpha\\alpha\}\]\\mathrm\{E\}\_\{\\beta\}\[y\_\{\\beta i\}\]\\mathrm\{E\}\_\{\\gamma\}\[\\eta\_\{\\gamma\\gamma\}\]\\mathrm\{E\}\_\{\\delta\}\[y\_\{\\delta j\}\]\\right\]=1N​𝕍​ar​\(ηα​α\)​Eβ​\[yβ​i\]​Eγ​\[yγ​j\]\.\\displaystyle=\\frac\{1\}\{N\}\\mathbb\{V\}\\mathrm\{ar\}\(\\eta\_\{\\alpha\\alpha\}\)\\mathrm\{E\}\_\{\\beta\}\[y\_\{\\beta i\}\]\\mathrm\{E\}\_\{\\gamma\}\[y\_\{\\gamma j\}\]\.\(94\)The variances and covariances appearing here are scalars evaluated at pure noise\. In particular, they have no explicit dependence on the clean training data, because the clean signal has been set to zero before evaluating the fluctuation covariance\.

Combining these contractions gives

Cov​\(fi​\(x∗\),fj​\(x∗\)\)=N/d\(θd−θo\)2​V~i​j\+𝒪​\(1d2\)\+𝒪​\(N0d\)\+𝒪​\(Nd2\),\\mathrm\{Cov\}\\left\(f\_\{i\}\(x\_\{\*\}\),f\_\{j\}\(x\_\{\*\}\)\\right\)=\\frac\{N/d\}\{\(\\theta\_\{d\}\-\\theta\_\{o\}\)^\{2\}\}\\tilde\{V\}\_\{ij\}\+\\mathcal\{O\}\\left\(\\frac\{1\}\{d^\{2\}\}\\right\)\+\\mathcal\{O\}\\left\(\\frac\{N^\{0\}\}\{d\}\\right\)\+\\mathcal\{O\}\\left\(\\frac\{N\}\{d^\{2\}\}\\right\),\(95\)whereV~i​j\\tilde\{V\}\_\{ij\}is an architecture\- and noise\-dependent coefficient matrix\. For C equally partitioned classes,

Covα2​\(yα​i,yα​j\)=1C​\(δi​j−1C​\[2​θo​ψ−\(θo​ψ\)2\]\)\.\\mathrm\{Cov\}\_\{\\alpha\}^\{2\}\(y\_\{\\alpha i\},y\_\{\\alpha j\}\)=\\frac\{1\}\{C\}\\left\(\\delta\_\{ij\}\-\\frac\{1\}\{C\}\\left\[2\\theta\_\{o\}\\psi\-\(\\theta\_\{o\}\\psi\)^\{2\}\\right\]\\right\)\.\(96\)Thus the coefficient matrix can be written as

V~i​j=δi​j​V~d\+\(1−δi​j\)​V~o,\\tilde\{V\}\_\{ij\}=\\delta\_\{ij\}\\tilde\{V\}\_\{d\}\+\(1\-\\delta\_\{ij\}\)\\tilde\{V\}\_\{o\},\(97\)for scalarsV~d\\tilde\{V\}\_\{d\}andV~o\\tilde\{V\}\_\{o\}\. Since theCov2\\mathrm\{Cov\}^\{2\}has an overall factor of1/C1/C, we can pull this out by definingVi​j=C​V~i​jV\_\{ij\}=C\\tilde\{V\}\_\{ij\}\. Thus, the output fluctuations may be written as

fi​\(x∗\)−𝔼​\[fi​\(x∗\)\]=N/Cd​Vdθd−θo​ξ^i,𝝃^∼𝒩​\(0,Σ\),f\_\{i\}\(x\_\{\*\}\)\-\\mathbb\{E\}\[f\_\{i\}\(x\_\{\*\}\)\]=\\sqrt\{\\frac\{N/C\}\{d\}\}\\frac\{\\sqrt\{V\_\{d\}\}\}\{\\theta\_\{d\}\-\\theta\_\{o\}\}\\hat\{\\xi\}\_\{i\},\\qquad\\hat\{\\boldsymbol\{\\xi\}\}\\sim\\mathcal\{N\}\(0,\\Sigma\),\(98\)where

Σi​j=δi​j\+\(1−δi​j\)​VoVd\.\\Sigma\_\{ij\}=\\delta\_\{ij\}\+\(1\-\\delta\_\{ij\}\)\\frac\{V\_\{o\}\}\{V\_\{d\}\}\.\(99\)

### C\.5The output for an arbitrary i\.i\.d\. noise model

For an arbitrary i\.i\.d\. noise model, the first\-layer perturbation is

δ​Kα​β\(1\)=ϵ​∂Kα​β\(1\)∂ϵ\+1d​ηα​β\.\\delta K\_\{\\alpha\\beta\}^\{\(1\)\}=\\epsilon\\frac\{\\partial K\_\{\\alpha\\beta\}^\{\(1\)\}\}\{\\partial\\epsilon\}\+\\frac\{1\}\{\\sqrt\{d\}\}\\eta\_\{\\alpha\\beta\}\.\(100\)Substituting this into Eq\. \([81](https://arxiv.org/html/2606.11319#A3.E81)\), and using the fluctuation result Eq\. \([98](https://arxiv.org/html/2606.11319#A3.E98)\), gives the general high\-noise output\. Here, we assume the dataset has been partitioned intoCCequally represented classes\. We define

Δ​Eαi​\[aα\]=CN​∑α∈iaα−θo​ψ​1N​∑α=1Naα,\\Delta\\mathrm\{E\}\_\{\\alpha\}^\{i\}\[a\_\{\\alpha\}\]=\\frac\{C\}\{N\}\\sum\_\{\\alpha\\in i\}a\_\{\\alpha\}\-\\theta\_\{o\}\\psi\\,\\frac\{1\}\{N\}\\sum\_\{\\alpha=1\}^\{N\}a\_\{\\alpha\},\(101\)whereα∈i\\alpha\\in imeans that the sum is restricted to training examples in classii\. SinceEα​\[yα​i\]=1/C\\mathrm\{E\}\_\{\\alpha\}\[y\_\{\\alpha i\}\]=1/C, we have

Covα1​\(aα,yα​i\)=1C​Δ​Eαi​\[aα\]\.\\mathrm\{Cov\}\_\{\\alpha\}^\{1\}\(a\_\{\\alpha\},y\_\{\\alpha i\}\)=\\frac\{1\}\{C\}\\Delta\\mathrm\{E\}\_\{\\alpha\}^\{i\}\[a\_\{\\alpha\}\]\.\(102\)The output therefore becomes

fi​\(x∗;ϵ\)=\\displaystyle f\_\{i\}\(x\_\{\*\};\\epsilon\)=θ∗​ψC\+N/Cd​Vdθd−θo​ξ^i\\displaystyle\\ \\frac\{\\theta\_\{\*\}\\psi\}\{C\}\+\\sqrt\{\\frac\{N/C\}\{d\}\}\\frac\{\\sqrt\{V\_\{d\}\}\}\{\\theta\_\{d\}\-\\theta\_\{o\}\}\\hat\{\\xi\}\_\{i\}\+ϵ​N/Cθd−θo\(\(Sχ⁣∗1→L−θ∗ψSχ​χ1→L\)ΔEαi\[∂Kα​α\(1\)∂ϵ\]\\displaystyle\+\\frac\{\\epsilon N/C\}\{\\theta\_\{d\}\-\\theta\_\{o\}\}\\Bigg\(\\left\(S\_\{\\chi\*\}^\{1\\rightarrow L\}\-\\theta\_\{\*\}\\psi S\_\{\\chi\\chi\}^\{1\\rightarrow L\}\\right\)\\Delta\\mathrm\{E\}\_\{\\alpha\}^\{i\}\\left\[\\frac\{\\partial K\_\{\\alpha\\alpha\}^\{\(1\)\}\}\{\\partial\\epsilon\}\\right\]\+T∗χ1→L​Δ​Eαi​\[∂K∗α\(1\)∂ϵ\]\\displaystyle\\qquad\\qquad\\qquad\+T\_\{\*\\chi\}^\{1\\rightarrow L\}\\Delta\\mathrm\{E\}\_\{\\alpha\}^\{i\}\\left\[\\frac\{\\partial K\_\{\*\\alpha\}^\{\(1\)\}\}\{\\partial\\epsilon\}\\right\]−θ∗ψ\(Tχ​χ1→LEα\[ΔEβi\[∂Kα​β\(1\)∂ϵ\]\]\\displaystyle\\qquad\\qquad\\qquad\-\\theta\_\{\*\}\\psi\\Bigg\(T\_\{\\chi\\chi\}^\{1\\rightarrow L\}\\mathrm\{E\}\_\{\\alpha\}\\left\[\\Delta\\mathrm\{E\}\_\{\\beta\}^\{i\}\\left\[\\frac\{\\partial K\_\{\\alpha\\beta\}^\{\(1\)\}\}\{\\partial\\epsilon\}\\right\]\\right\]\+Sχ​χ1→L\(1−θoψ\)Eα\[∂Kα​α\(1\)∂ϵ\]\)\)\\displaystyle\\qquad\\qquad\\qquad\\qquad\\quad\+S\_\{\\chi\\chi\}^\{1\\rightarrow L\}\(1\-\\theta\_\{o\}\\psi\)\\mathrm\{E\}\_\{\\alpha\}\\left\[\\frac\{\\partial K\_\{\\alpha\\alpha\}^\{\(1\)\}\}\{\\partial\\epsilon\}\\right\]\\Bigg\)\\Bigg\)\+𝒪​\(ϵ​N0\)\+𝒪​\(ϵ2\)\+𝒪​\(N0/d\)\+𝒪​\(ϵ​N/d\)\+𝒪​\(N/d\)\.\\displaystyle\+\\mathcal\{O\}\(\\epsilon N^\{0\}\)\+\\mathcal\{O\}\(\\epsilon^\{2\}\)\+\\mathcal\{O\}\(N^\{0\}/\\sqrt\{d\}\)\+\\mathcal\{O\}\(\\epsilon N/\\sqrt\{d\}\)\+\\mathcal\{O\}\(N/d\)\.\(103\)
We now impose the normalization used in the main text,

1d​∑j=1dxj​α=0,1d​∑j=1dxj​α2=1,\\frac\{1\}\{d\}\\sum\_\{j=1\}^\{d\}x\_\{j\\alpha\}=0,\\qquad\\frac\{1\}\{d\}\\sum\_\{j=1\}^\{d\}x\_\{j\\alpha\}^\{2\}=1,\(104\)for both training and test data\.

Here we consider two general noise models\. Additive noise with any mean\-zero, unit\-variance noise variable,

x~j​χ=ϵ​xj​χ\+\(1−ϵ\)​ξj​χ,𝔼​\[ξj​χ\]=0,𝔼​\[ξj​χ2\]=1,\\tilde\{x\}\_\{j\\chi\}=\\epsilon x\_\{j\\chi\}\+\(1\-\\epsilon\)\\xi\_\{j\\chi\},\\qquad\\mathbb\{E\}\[\\xi\_\{j\\chi\}\]=0,\\qquad\\mathbb\{E\}\[\\xi\_\{j\\chi\}^\{2\}\]=1,\(105\)for which the first\-layer derivatives are

∂K∗χ\(1\)∂ϵ=Cwd​∑j=1dxj⁣∗​xj​χ,∂Kχ​χ′\(1\)∂ϵ=−2​Cw​δχ​χ′,\\frac\{\\partial K\_\{\*\\chi\}^\{\(1\)\}\}\{\\partial\\epsilon\}=\\frac\{C\_\{w\}\}\{d\}\\sum\_\{j=1\}^\{d\}x\_\{j\*\}x\_\{j\\chi\},\\quad\\frac\{\\partial K\_\{\\chi\\chi^\{\\prime\}\}^\{\(1\)\}\}\{\\partial\\epsilon\}=\-2C\_\{w\}\\delta\_\{\\chi\\chi^\{\\prime\}\},\(106\)and replacement noise with an arbitrary replacement distributionuu,

x~j​χ=\(1−bj​χ\)​xj​χ\+bj​χ​uj​χ,Pr⁡\(bj​χ=0\)=ϵ,\\tilde\{x\}\_\{j\\chi\}=\(1\-b\_\{j\\chi\}\)x\_\{j\\chi\}\+b\_\{j\\chi\}u\_\{j\\chi\},\\qquad\\Pr\(b\_\{j\\chi\}=0\)=\\epsilon,\(107\)for which the normalized data imply

∂K∗χ\(1\)∂ϵ\\displaystyle\\frac\{\\partial K\_\{\*\\chi\}^\{\(1\)\}\}\{\\partial\\epsilon\}=Cwd​∑j=1dxj⁣∗​xj​χ,\\displaystyle=\\frac\{C\_\{w\}\}\{d\}\\sum\_\{j=1\}^\{d\}x\_\{j\*\}x\_\{j\\chi\},\(108\)∂Kχ​χ′\(1\)∂ϵ\\displaystyle\\frac\{\\partial K\_\{\\chi\\chi^\{\\prime\}\}^\{\(1\)\}\}\{\\partial\\epsilon\}=Cw​δχ​χ′​\(1−𝔼​\[u2\]\)−2​Cw​\(1−δχ​χ′\)​𝔼​\[u\]2\.\\displaystyle=C\_\{w\}\\delta\_\{\\chi\\chi^\{\\prime\}\}\\left\(1\-\\mathbb\{E\}\[u^\{2\}\]\\right\)\-2C\_\{w\}\(1\-\\delta\_\{\\chi\\chi^\{\\prime\}\}\)\\mathbb\{E\}\[u\]^\{2\}\.\(109\)Thus, for both additive noise and replacement noise, the only class\-dependent data dependence in Eq\. \([103](https://arxiv.org/html/2606.11319#A3.E103)\) comes from

Δ​Eαi​\[∂K∗α\(1\)∂ϵ\]=Cw​Δ​Eαi​\[1d​∑j=1dxj⁣∗​xj​α\]\.\\Delta\\mathrm\{E\}\_\{\\alpha\}^\{i\}\\left\[\\frac\{\\partial K\_\{\*\\alpha\}^\{\(1\)\}\}\{\\partial\\epsilon\}\\right\]=C\_\{w\}\\Delta\\mathrm\{E\}\_\{\\alpha\}^\{i\}\\left\[\\frac\{1\}\{d\}\\sum\_\{j=1\}^\{d\}x\_\{j\*\}x\_\{j\\alpha\}\\right\]\.\(110\)All other quantities in Eq\. \([103](https://arxiv.org/html/2606.11319#A3.E103)\), includingθ∗\\theta\_\{\*\},θd\\theta\_\{d\},θo\\theta\_\{o\},SS,TT, andVdV\_\{d\}, are constants determined only by the architecture and the noise distribution\. This is because, under the normalization Eq\. \([104](https://arxiv.org/html/2606.11319#A3.E104)\), the pure\-noise kernels depend on the data only through fixed quantities such as1d​∑jxj​α\\frac\{1\}\{d\}\\sum\_\{j\}x\_\{j\\alpha\}and1d​∑jxj​α2\\frac\{1\}\{d\}\\sum\_\{j\}x\_\{j\\alpha\}^\{2\}\.

Therefore, for either noise model,

fi​\(x∗;ϵ\)=\\displaystyle f\_\{i\}\(x\_\{\*\};\\epsilon\)=const\+A\(L\)​N/Cd​ξ^i\+B\(L\)​ϵ​NC​Δ​Eαi​\[1d​∑j=1dxj⁣∗​xj​α\]\\displaystyle\\ \\mathrm\{const\}\+A^\{\(L\)\}\\sqrt\{\\frac\{N/C\}\{d\}\}\\,\\hat\{\\xi\}\_\{i\}\+B^\{\(L\)\}\\epsilon\\frac\{N\}\{C\}\\Delta\\mathrm\{E\}\_\{\\alpha\}^\{i\}\\left\[\\frac\{1\}\{d\}\\sum\_\{j=1\}^\{d\}x\_\{j\*\}x\_\{j\\alpha\}\\right\]\+𝒪​\(ϵ​N0\)\+𝒪​\(ϵ2\)\+𝒪​\(N0/d\)\+𝒪​\(ϵ​N/d\)\+𝒪​\(N/d\),\\displaystyle\+\\mathcal\{O\}\(\\epsilon N^\{0\}\)\+\\mathcal\{O\}\(\\epsilon^\{2\}\)\+\\mathcal\{O\}\(N^\{0\}/\\sqrt\{d\}\)\+\\mathcal\{O\}\(\\epsilon N/\\sqrt\{d\}\)\+\\mathcal\{O\}\(N/d\),\(111\)whereA\(L\)A^\{\(L\)\}andB\(L\)B^\{\(L\)\}are architecture\- and noise distribution\-dependent constants\.

Finally, to obtain the expression quoted in the main text, note that whenθo≠0\\theta\_\{o\}\\neq 0,

θo​ψ=θo​Nθd−θo\+θo​N=1\+𝒪​\(N−1\)\.\\theta\_\{o\}\\psi=\\frac\{\\theta\_\{o\}N\}\{\\theta\_\{d\}\-\\theta\_\{o\}\+\\theta\_\{o\}N\}=1\+\\mathcal\{O\}\(N^\{\-1\}\)\.\(112\)The exceptional caseθo=0\\theta\_\{o\}=0can occur in special symmetry\-preserving settings, such as when the activation is chosen to be odd andCb=0C\_\{b\}=0\. This is a non\-generic degeneracy rather than a fundamental obstruction\. Indeed, adding any positive bias varianceCb\>0C\_\{b\}\>0, however small, breaks this exact odd symmetry at the level of the kernel recursion and generically produces a nonzero off\-diagonal pure\-noise kernelθo\\theta\_\{o\}\. Thus the caseθo=0\\theta\_\{o\}=0can be avoided by an arbitrarily small and standard modification of the initialization\. In the generic caseθo≠0\\theta\_\{o\}\\neq 0, we therefore have

Δ​Eαi​\[aα\]=CN​∑α∈iaα−1N​∑α=1Naα\+𝒪​\(N−1\)\.\\Delta\\mathrm\{E\}\_\{\\alpha\}^\{i\}\[a\_\{\\alpha\}\]=\\frac\{C\}\{N\}\\sum\_\{\\alpha\\in i\}a\_\{\\alpha\}\-\\frac\{1\}\{N\}\\sum\_\{\\alpha=1\}^\{N\}a\_\{\\alpha\}\+\\mathcal\{O\}\(N^\{\-1\}\)\.\(113\)Defining the empirical class mean \(centroid\) and global mean by

x¯ji=CN​∑α∈ixj​α,x¯j=1N​∑α=1Nxj​α,\\bar\{x\}\_\{j\}^\{i\}=\\frac\{C\}\{N\}\\sum\_\{\\alpha\\in i\}x\_\{j\\alpha\},\\qquad\\bar\{x\}\_\{j\}=\\frac\{1\}\{N\}\\sum\_\{\\alpha=1\}^\{N\}x\_\{j\\alpha\},\(114\)we obtain Eq\. \([13](https://arxiv.org/html/2606.11319#S3.E13)\) of the main text,

fi​\(x∗;ϵ\)=\\displaystyle f\_\{i\}\(x\_\{\*\};\\epsilon\)=const\+A\(L\)​N/Cd​ξ^i\+B\(L\)​ϵ​NC​1d​∑j=1dxj⁣∗​\(x¯ji−x¯j\)\\displaystyle\\ \\mathrm\{const\}\+A^\{\(L\)\}\\sqrt\{\\frac\{N/C\}\{d\}\}\\,\\hat\{\\xi\}\_\{i\}\+B^\{\(L\)\}\\epsilon\\frac\{N\}\{C\}\\frac\{1\}\{d\}\\sum\_\{j=1\}^\{d\}x\_\{j\*\}\\left\(\\bar\{x\}\_\{j\}^\{i\}\-\\bar\{x\}\_\{j\}\\right\)\+𝒪​\(ϵ​N0\)\+𝒪​\(ϵ2\)\+𝒪​\(N0/d\)\+𝒪​\(ϵ​N/d\)\+𝒪​\(N/d\)\.\\displaystyle\+\\mathcal\{O\}\(\\epsilon N^\{0\}\)\+\\mathcal\{O\}\(\\epsilon^\{2\}\)\+\\mathcal\{O\}\(N^\{0\}/\\sqrt\{d\}\)\+\\mathcal\{O\}\(\\epsilon N/\\sqrt\{d\}\)\+\\mathcal\{O\}\(N/d\)\.\(115\)Usingϵ=1−p\\epsilon=1\-pgives the high\-noise expression stated in the main text\. The leading class\-dependent term is therefore the overlap between the clean test point and the centered empirical class centroid\. Consequently, after averaging over the finite\-ddfluctuations, the network implements a nearest\-class\-mean rule in the high\-noise limit\.

## Appendix DNonnegativity of the coefficient multiplying the centroid

For the centroid term in Eq\. \([13](https://arxiv.org/html/2606.11319#S3.E13)\) to have the correct orientation, its coefficientB\(L\)B^\{\(L\)\}must be nonnegative\. We show here that this is guaranteed by the normalization convention and the unit\-variance noise models adopted in the main text\.

According to Eq\. \([103](https://arxiv.org/html/2606.11319#A3.E103)\), the sign ofB\(L\)B^\{\(L\)\}is that ofT∗χ\(L\)T^\{\(L\)\}\_\{\*\\chi\}, so it is enough to establishT∗χ\(L\)≥0T^\{\(L\)\}\_\{\*\\chi\}\\geq 0\. The recursion that generatesT∗χ\(L\)T^\{\(L\)\}\_\{\*\\chi\}\(Appendix[C](https://arxiv.org/html/2606.11319#A3)\) involves only the initialization constantsCb≥0C\_\{b\}\\geq 0andCw\>0C\_\{w\}\>0, the test–train kernelsK∗χ\(ℓ\)K^\{\(\\ell\)\}\_\{\*\\chi\}and NTKsΘ∗χ\(ℓ\)\\Theta^\{\(\\ell\)\}\_\{\*\\chi\}, and the two\-dimensional Gaussian expectations

⟨h∗​hχ⟩K\(ℓ\),\\langle h\_\{\*\}h\_\{\\chi\}\\rangle\_\{K^\{\(\\ell\)\}\},\(116\)in whichhhstands for the activation or one of its derivatives, evaluated on a pair of jointly Gaussian pre\-activations\. The nonnegativity ofT∗χ\(L\)T^\{\(L\)\}\_\{\*\\chi\}follows once each of these ingredients is shown to be nonnegative at every layer\.

The essential observation concerns the Gaussian expectation itself\. Let\(z∗,zχ\)\(z\_\{\*\},z\_\{\\chi\}\)be jointly Gaussian with zero mean and covariance

K=\(K∗∗K∗χK∗χKχ​χ\),K=\\begin\{pmatrix\}K\_\{\*\*\}&K\_\{\*\\chi\}\\\\ K\_\{\*\\chi\}&K\_\{\\chi\\chi\}\\end\{pmatrix\},\(117\)and suppose the two pre\-activations share a common variance and a nonnegative correlation,

K∗∗=Kχ​χ\>0,K∗χ≥0\.K\_\{\*\*\}=K\_\{\\chi\\chi\}\>0,\\qquad K\_\{\*\\chi\}\\geq 0\.\(118\)Because the two marginals have the common varianceK∗∗K\_\{\*\*\}, we may rescale to standard normal variables, writingz∗=K∗∗​Xz\_\{\*\}=\\sqrt\{K\_\{\*\*\}\}\\,Xandzχ=K∗∗​Yz\_\{\\chi\}=\\sqrt\{K\_\{\*\*\}\}\\,Ywith\(X,Y\)\(X,Y\)jointly standard normal\. Their correlation is inherited from the off\-diagonal entry ofKK,

ρ=K∗χK∗∗∈\[0,1\],\\rho=\\frac\{K\_\{\*\\chi\}\}\{K\_\{\*\*\}\}\\in\[0,1\],\(119\)whereρ≥0\\rho\\geq 0follows fromK∗χ≥0K\_\{\*\\chi\}\\geq 0, andρ≤1\\rho\\leq 1becauseKKis a valid covariance matrix\. We now expand the activation in the Hermite polynomialsHnH\_\{n\}, orthogonal with respect to the standard Gaussian weight \(H0=1H\_\{0\}=1,H1=xH\_\{1\}=x,H2=x2−1H\_\{2\}=x^\{2\}\-1, and so on\)\[[1](https://arxiv.org/html/2606.11319#bib.bib104),[27](https://arxiv.org/html/2606.11319#bib.bib105)\],

h​\(K∗∗​x\)=∑n=0∞cn​Hn​\(x\)\.h\\\!\\left\(\\sqrt\{K\_\{\*\*\}\}\\,x\\right\)=\\sum\_\{n=0\}^\{\\infty\}c\_\{n\}\\,H\_\{n\}\(x\)\.\(120\)Using the bivariate identity𝔼​\[Hm​\(X\)​Hn​\(Y\)\]=δm​n​n\!​ρn\\mathbb\{E\}\[H\_\{m\}\(X\)H\_\{n\}\(Y\)\]=\\delta\_\{mn\}\\,n\!\\,\\rho^\{\\,n\}, the expectation collapses to a single sum over the diagonal,

⟨h∗​hχ⟩K=𝔼​\[h​\(z∗\)​h​\(zχ\)\]=∑n=0∞cn2​n\!​ρn≥0,\\langle h\_\{\*\}h\_\{\\chi\}\\rangle\_\{K\}=\\mathbb\{E\}\\\!\\left\[h\(z\_\{\*\}\)h\(z\_\{\\chi\}\)\\right\]=\\sum\_\{n=0\}^\{\\infty\}c\_\{n\}^\{2\}\\,n\!\\,\\rho^\{\\,n\}\\geq 0,\(121\)every term of which is nonnegative becauseρ≥0\\rho\\geq 0\. A common variance together with a nonnegative correlation is therefore sufficient to render⟨h∗​hχ⟩K\\langle h\_\{\*\}h\_\{\\chi\}\\rangle\_\{K\}nonnegative for any activationhhsquare\-integrable with respect to a Gaussian measure; in particular this holds forh=σ,σ′,σ′′h=\\sigma,\\sigma^\{\\prime\},\\sigma^\{\\prime\\prime\}\.

It remains to verify that the conditions Eq\. \([118](https://arxiv.org/html/2606.11319#A4.E118)\) are met at every layer\. At the first layer the kernel is fixed directly by the input overlaps,

K\(1\)=Cb\+Cw​\(1d​∑jxj⁣∗2𝔼​\[u\]​1d​∑jxj⁣∗𝔼​\[u\]​1d​∑jxj⁣∗𝔼​\[u2\]\),K^\{\(1\)\}=C\_\{b\}\+C\_\{w\}\\begin\{pmatrix\}\\frac\{1\}\{d\}\\sum\_\{j\}x\_\{j\*\}^\{2\}&\\mathbb\{E\}\[u\]\\,\\frac\{1\}\{d\}\\sum\_\{j\}x\_\{j\*\}\\\\\[4\.0pt\] \\mathbb\{E\}\[u\]\\,\\frac\{1\}\{d\}\\sum\_\{j\}x\_\{j\*\}&\\mathbb\{E\}\[u^\{2\}\]\\end\{pmatrix\},\(122\)the constantCbC\_\{b\}being added to every entry\. The normalization

1d​∑jxj⁣∗=0,1d​∑jxj⁣∗2=1\\frac\{1\}\{d\}\\sum\_\{j\}x\_\{j\*\}=0,\\qquad\\frac\{1\}\{d\}\\sum\_\{j\}x\_\{j\*\}^\{2\}=1\(123\)annihilates the off\-diagonal signal and fixes the test\-point variance to unity, leavingK∗χ\(1\)=Cb≥0K^\{\(1\)\}\_\{\*\\chi\}=C\_\{b\}\\geq 0\. The diagonal entries,Cb\+CwC\_\{b\}\+C\_\{w\}andCb\+Cw​𝔼​\[u2\]C\_\{b\}\+C\_\{w\}\\,\\mathbb\{E\}\[u^\{2\}\], coincide precisely when𝔼​\[u2\]=1\\mathbb\{E\}\[u^\{2\}\]=1— which is the case for both noise models employed here, since replacement noise withu∼Unif​\[−3,3\]u\\sim\\mathrm\{Unif\}\[\-\\sqrt\{3\},\\sqrt\{3\}\]has𝔼​\[u\]=0\\mathbb\{E\}\[u\]=0and𝔼​\[u2\]=1\\mathbb\{E\}\[u^\{2\}\]=1, and the additive Gaussian noise is taken with zero mean and unit variance\. The layer\-one kernel thus satisfies Eq\. \([118](https://arxiv.org/html/2606.11319#A4.E118)\), withK∗∗\(1\)\>0K^\{\(1\)\}\_\{\*\*\}\>0assured byCw\>0C\_\{w\}\>0\.

This structure is inherited under the layer\-to\-layer recursion\. The diagonal entries evolve according toKα​α\(ℓ\+1\)=Cb\+Cw​⟨σ2⟩Kα​α\(ℓ\)K^\{\(\\ell\+1\)\}\_\{\\alpha\\alpha\}=C\_\{b\}\+C\_\{w\}\\langle\\sigma^\{2\}\\rangle\_\{K^\{\(\\ell\)\}\_\{\\alpha\\alpha\}\}, which depends on the layer\-ℓ\\ellkernel only through the single diagonal entryKα​α\(ℓ\)K^\{\(\\ell\)\}\_\{\\alpha\\alpha\}; the two diagonal entries are therefore governed by the same map, so equal diagonals at layerℓ\\ellyield equal diagonals at layerℓ\+1\\ell\+1\. The off\-diagonal entry evolves according to

K∗χ\(ℓ\+1\)=Cb\+Cw​⟨σ∗​σχ⟩K\(ℓ\)≥Cb≥0,K^\{\(\\ell\+1\)\}\_\{\*\\chi\}=C\_\{b\}\+C\_\{w\}\\langle\\sigma\_\{\*\}\\sigma\_\{\\chi\}\\rangle\_\{K^\{\(\\ell\)\}\}\\geq C\_\{b\}\\geq 0,\(124\)where the inequality holds because the quantity⟨σ∗​σχ⟩K\(ℓ\)\\langle\\sigma\_\{\*\}\\sigma\_\{\\chi\}\\rangle\_\{K^\{\(\\ell\)\}\}is itself nonnegative: this is precisely Eq\. \([121](https://arxiv.org/html/2606.11319#A4.E121)\) withh=σh=\\sigma, applied to the layer\-ℓ\\ellkernel, which by hypothesis has equal diagonals and a nonnegative off\-diagonal entry\. Both conditions Eq\. \([118](https://arxiv.org/html/2606.11319#A4.E118)\) are thus passed from layerℓ\\ellto layerℓ\+1\\ell\+1, and since they hold at the first layer they hold at every layer up toLL\. The same reasoning applied to the NTK recursionΘ∗χ\(ℓ\+1\)=Cb\+K∗χ\(ℓ\)\+Cw​⟨σ∗′​σχ′⟩K\(ℓ\)​Θ∗χ\(ℓ\)\\Theta^\{\(\\ell\+1\)\}\_\{\*\\chi\}=C\_\{b\}\+K^\{\(\\ell\)\}\_\{\*\\chi\}\+C\_\{w\}\\langle\\sigma^\{\\prime\}\_\{\*\}\\sigma^\{\\prime\}\_\{\\chi\}\\rangle\_\{K^\{\(\\ell\)\}\}\\Theta^\{\(\\ell\)\}\_\{\*\\chi\}, beginning fromΘ∗χ\(1\)=K∗χ\(1\)=Cb≥0\\Theta^\{\(1\)\}\_\{\*\\chi\}=K^\{\(1\)\}\_\{\*\\chi\}=C\_\{b\}\\geq 0, shows thatΘ∗χ\(ℓ\)≥0\\Theta^\{\(\\ell\)\}\_\{\*\\chi\}\\geq 0throughout\.

With the common\-variance and nonnegative\-correlation properties secured at every layer, the Gaussian expectations⟨h∗​hχ⟩K\(ℓ\)\\langle h\_\{\*\}h\_\{\\chi\}\\rangle\_\{K^\{\(\\ell\)\}\}— forh=σ,σ′,σ′′h=\\sigma,\\sigma^\{\\prime\},\\sigma^\{\\prime\\prime\}alike — are all nonnegative, as areK∗χ\(ℓ\)K^\{\(\\ell\)\}\_\{\*\\chi\}andΘ∗χ\(ℓ\)\\Theta^\{\(\\ell\)\}\_\{\*\\chi\}\. Every term of the recursion forT∗χ\(L\)T^\{\(L\)\}\_\{\*\\chi\}is then a sum of products of nonnegative quantities, so that

T∗χ\(L\)≥0,and henceB\(L\)≥0\.T^\{\(L\)\}\_\{\*\\chi\}\\geq 0,\\qquad\\text\{and hence\}\\qquad B^\{\(L\)\}\\geq 0\.\(125\)The centroid term in Eq\. \([13](https://arxiv.org/html/2606.11319#S3.E13)\) therefore carries a nonnegative coefficient under our normalization convention and unit\-variance noise models\.

## Appendix EEmpirical validation of the centroid model for additive\-Gaussian noise

We now show that the same high\-noise picture extends to additive\-Gaussian corruption\. In the insets of Fig\.[7](https://arxiv.org/html/2606.11319#A5.F7), we compare the ensemble\-averaged MLP logits on the full MNIST, FashionMNIST, and KMNIST test sets with the prediction obtained from the mean of Eq\. \([16](https://arxiv.org/html/2606.11319#S3.E16)\) for noise strengthp=0\.97p=0\.97\. Using a global fit for the coefficientsaaandbb, we observe excellent agreement\. This indicates that, once averaged over noise realizations, the network outputs are well described by the mean centroid\-based predictor in the additive\-Gaussian setting as well\. The main panels of Fig\.[7](https://arxiv.org/html/2606.11319#A5.F7)compare the clean\-test accuracy of the empirical ensemble against that of the full effective model in Eq\. \([16](https://arxiv.org/html/2606.11319#S3.E16)\), including the fluctuation term set by the fitted coefficientcc\. The close match shows that the centroid description accounts not only for the average logits, but also for the noise\-induced variations that determine classification performance\. In particular, it reproduces the decline in clean\-test accuracy asppincreases\. The figure also shows where the empirical networks begin to deviate from the leading\-order centroid prediction, indicating the growing importance of higher\-order terms in\(1−p\)\(1\-p\)\. In our experiments, these deviations become visible forp≲0\.9p\\lesssim 0\.9\.

![Refer to caption](https://arxiv.org/html/2606.11319v1/x7.png)Figure 7:Numerical verification of the high\-noise theory\. All networks used to create the data in this figure are width\-2048, 3\-layererf\\mathrm\{erf\}MLPs, trained with MSE loss on noisy MNIST, FashionMNIST, and KMNIST withN=4000N=4000\. Main panel: mean clean\-test accuracy versus additive\-Gaussian\-noise strengthppfor a nested ensemble of 20 noisy training sets and 10 MLPs per noisy set\. Error bars show one standard deviation\. The blue curve is the mean clean\-test accuracy of the fitted effective model in Eq\. \([16](https://arxiv.org/html/2606.11319#S3.E16)\)\. Inset: output\-level comparison atp=0\.97p=0\.97, using a nested ensemble of 100 noisy training sets and 50 MLPs per noisy set\. Each point corresponds to one clean test image and one candidate class, comparing the ensemble\-mean network output with the fitted prediction from Eq\. \([16](https://arxiv.org/html/2606.11319#S3.E16)\)\. The black dashed line in the inset indicates perfect agreement\.We next examine whether the scaling relation predicted in Eq\. \([17](https://arxiv.org/html/2606.11319#S3.E17)\) is reflected in the test accuracy\. As shown in Fig\.[8](https://arxiv.org/html/2606.11319#A5.F8), for MNIST in the high\-noise regimep=0\.97p=0\.97, the test accuracy increases systematically with bothNNanddd, consistent with the proposed scaling\.

![Refer to caption](https://arxiv.org/html/2606.11319v1/x8.png)Figure 8:Empirical test of the scaling suggested by Eq\. \([17](https://arxiv.org/html/2606.11319#S3.E17)\)\. Each curve shows the mean clean\-test accuracy of an ensemble of width\-2048, 3\-layererf\\mathrm\{erf\}MLPs trained with MSE loss on MNIST with additive\-Gaussian noise atp=0\.97p=0\.97, as a function of training\-set sizeNNfor fixed feature dimensiondd\. Error bars show one standard deviation\. Larger values ofddare obtained by resizing the original MNIST images using bilinear interpolation\.
## Appendix FDependence of test accuracy on activation

In this appendix we discuss how the high\-noise prediction accuracy depends on the activation function\. For analytical simplicity, we specialize the general result of Appendix[C](https://arxiv.org/html/2606.11319#A3)to additive\-Gaussian noise\. This choice makes the fluctuation covariance particularly transparent, while still capturing the qualitative mechanism expected for general i\.i\.d\. corruption models: the class\-dependent signal is controlled by the centroid overlap derived in Eq\. \([115](https://arxiv.org/html/2606.11319#A3.E115)\), whereas the activation function enters only through architecture\-dependent constants\.

We use the same data normalization convention as in Appendix[C](https://arxiv.org/html/2606.11319#A3),

1d​∑j=1dxj​α=0,1d​∑j=1dxj​α2=1\.\\frac\{1\}\{d\}\\sum\_\{j=1\}^\{d\}x\_\{j\\alpha\}=0,\\qquad\\frac\{1\}\{d\}\\sum\_\{j=1\}^\{d\}x\_\{j\\alpha\}^\{2\}=1\.\(126\)Under this normalization, the pure\-noise training–training and test\-training kernels coincide at the level relevant for this calculation\. Thus

Sχ⁣∗1→L=Sχ​χ1→L=S,T∗χ1→L=Tχ​χ1→L=T,θ∗=θo\.S\_\{\\chi\*\}^\{1\\rightarrow L\}=S\_\{\\chi\\chi\}^\{1\\rightarrow L\}=S,\\qquad T\_\{\*\\chi\}^\{1\\rightarrow L\}=T\_\{\\chi\\chi\}^\{1\\rightarrow L\}=T,\\qquad\\theta\_\{\*\}=\\theta\_\{o\}\.\(127\)
For additive Gaussian noise, substituting the corresponding first\-layer derivatives into the general high\-noise output formula, Eq\. \([103](https://arxiv.org/html/2606.11319#A3.E103)\), gives

fi​\(x∗;ϵ\)=\\displaystyle f\_\{i\}\(x\_\{\*\};\\epsilon\)=θo​ψC\+N/Cd​Vdθd−θo​ξ^i\\displaystyle\\ \\frac\{\\theta\_\{o\}\\psi\}\{C\}\+\\sqrt\{\\frac\{N/C\}\{d\}\}\\frac\{\\sqrt\{V\_\{d\}\}\}\{\\theta\_\{d\}\-\\theta\_\{o\}\}\\hat\{\\xi\}\_\{i\}\+Cw​ϵ​N/Cθd−θo​\(−2​S​\(1−θo​ψ\)​\(1−2​θo​ψ\)\+T​Δ​Eαi​\[1d​∑j=1dxj⁣∗​xj​α\]\)\\displaystyle\+C\_\{w\}\\frac\{\\epsilon N/C\}\{\\theta\_\{d\}\-\\theta\_\{o\}\}\\Bigg\(\-2S\(1\-\\theta\_\{o\}\\psi\)\(1\-2\\theta\_\{o\}\\psi\)\+T\\Delta\\mathrm\{E\}^\{i\}\_\{\\alpha\}\\left\[\\frac\{1\}\{d\}\\sum\_\{j=1\}^\{d\}x\_\{j\*\}x\_\{j\\alpha\}\\right\]\\Bigg\)\+𝒪​\(ϵ​N0\)\+𝒪​\(ϵ2\)\+𝒪​\(N0/d\)\+𝒪​\(ϵ​N/d\)\+𝒪​\(N/d\),\\displaystyle\+\\mathcal\{O\}\(\\epsilon N^\{0\}\)\+\\mathcal\{O\}\(\\epsilon^\{2\}\)\+\\mathcal\{O\}\(N^\{0\}/\\sqrt\{d\}\)\+\\mathcal\{O\}\(\\epsilon N/\\sqrt\{d\}\)\+\\mathcal\{O\}\(N/d\),\(128\)where

Vd=Cw2\[\\displaystyle V\_\{d\}=C\_\{w\}^\{2\}\\Bigg\[2​S2​\(1−θo​ψ\)2​\(1−4C​θo​ψ​\(1−θo​ψ\)\)\\displaystyle 2S^\{2\}\(1\-\\theta\_\{o\}\\psi\)^\{2\}\\left\(1\-\\frac\{4\}\{C\}\\theta\_\{o\}\\psi\(1\-\\theta\_\{o\}\\psi\)\\right\)\+T2\(1−1Cθoψ\(2−θoψ\)\)\]\.\\displaystyle\+T^\{2\}\\left\(1\-\\frac\{1\}\{C\}\\theta\_\{o\}\\psi\(2\-\\theta\_\{o\}\\psi\)\\right\)\\Bigg\]\.\(129\)Hereξ^i\\hat\{\\xi\}\_\{i\}denotes the Gaussian fluctuation described in Appendix[C](https://arxiv.org/html/2606.11319#A3), andΔ​Eαi​\[⋅\]\\Delta\\mathrm\{E\}^\{i\}\_\{\\alpha\}\[\\cdot\]is the class\-centered average defined in Eq\. \([101](https://arxiv.org/html/2606.11319#A3.E101)\)\. The only term in Eq\. \([128](https://arxiv.org/html/2606.11319#A6.E128)\) with classifying power is the centroid\-overlap term

Δ​Eαi​\[1d​∑j=1dxj⁣∗​xj​α\]\.\\Delta\\mathrm\{E\}^\{i\}\_\{\\alpha\}\\left\[\\frac\{1\}\{d\}\\sum\_\{j=1\}^\{d\}x\_\{j\*\}x\_\{j\\alpha\}\\right\]\.\(130\)All remaining terms are either class\-independent shifts or random fluctuations\. Under the normalization and unit\-variance noise conventions used here, the coefficient multiplying this centroid term has the correct nonnegative orientation; equivalently,T≥0T\\geq 0, as shown in Appendix[D](https://arxiv.org/html/2606.11319#A4)\. Thus different activations change the magnitude of the signal and the fluctuation scale, but they do not reverse the orientation of the leading centroid decision rule in the setting considered here\.

For fixed data and noise level, activation\-dependent robustness is therefore governed by the noise\-to\-signal ratio between the fluctuation scale and the coefficient of this centroid term\. Equivalently, maximizing prediction accuracy corresponds to minimizing

VdT2\.\\frac\{V\_\{d\}\}\{T^\{2\}\}\.\(131\)Since the optimal activation should not depend sensitively on the number of classes, it is useful to examine the large\-CClimit\. In this limit,

Vd=Cw2​\[2​S2​\(1−θo​ψ\)2\+T2\]\.\\displaystyle V\_\{d\}=C\_\{w\}^\{2\}\\left\[2S^\{2\}\(1\-\\theta\_\{o\}\\psi\)^\{2\}\+T^\{2\}\\right\]\.\(132\)Thus, at largeCC, the excess fluctuation beyond the unavoidableTTcontribution is controlled by

S2​\(1−θo​ψ\)2\.S^\{2\}\(1\-\\theta\_\{o\}\\psi\)^\{2\}\.\(133\)Consequently, for the generic case in which the pure\-noise off\-diagonal NTK is nonzero, the large\-NNrelationθo​ψ=1\+𝒪​\(N−1\)\\theta\_\{o\}\\psi=1\+\\mathcal\{O\}\(N^\{\-1\}\)implies that this term is suppressed by𝒪​\(N−2\)\\mathcal\{O\}\(N^\{\-2\}\)\. Thus, a broad class of activations should have nearly identical signal\-to\-noise ratios, and therefore similar test accuracy\. In other words, the leading high\-noise classifier is not strongly controlled by the detailed choice of activation; the activation mainly changes subleading architecture\-dependent prefactors multiplying the same centroid rule\.

![Refer to caption](https://arxiv.org/html/2606.11319v1/x9.png)Figure 9:Mean test accuracy, with standard deviations shown as error bars, for an ensemble of 50 independent 3\-layer, width 2048 neural networks with four different activations: erf, GELU, Swish, and Tanh\. The networks are trained with MSE loss on a noisy subset of 4,000 MNIST data points corrupted with additive\-Gaussian noise\.Figure[9](https://arxiv.org/html/2606.11319#A6.F9)tests this prediction numerically\. We compare four activations: erf, GELU, Swish, and Tanh\. The resulting test\-accuracy curves are very similar across the full range of noise probabilitiespp\. Aside from a modest advantage for Tanh and Swish at the highest noise levels, all four activations attain comparable performance and display nearly the same dependence onpp\.

## Appendix GExplanation of non\-monotonic test accuracy variance

Here we show that the non\-monotonic dependence of the test\-accuracy variance on the noise probabilityppfollows directly from the general form of the output logits in Eq\. \([13](https://arxiv.org/html/2606.11319#S3.E13)\)\. For clarity, we consider the following toy model\. The logit of the correct class is

l∗=ξ^∗\+a​\(1−p\),l\_\{\*\}=\\hat\{\\xi\}\_\{\*\}\+a\(1\-p\),\(134\)whereξ^∗∼𝒩​\(0,1\)\\hat\{\\xi\}\_\{\*\}\\sim\\mathcal\{N\}\(0,1\)anda\>0a\>0is a constant\. The logits of theC−1C\-1incorrect classes are

li=ξ^i,i=1,…,C−1,l\_\{i\}=\\hat\{\\xi\}\_\{i\},\\qquad i=1,\\dots,C\-1,\(135\)where eachξ^i∼𝒩​\(0,1\)\\hat\{\\xi\}\_\{i\}\\sim\\mathcal\{N\}\(0,1\)\. We assume all Gaussian variables are independent\. This captures the essential structure of Eq\. \([13](https://arxiv.org/html/2606.11319#S3.E13)\): the correct class receives a positive deterministic offset, reflecting its larger centroid overlap, while the incorrect classes do not\. For simplicity, we take this offset to be the constanta​\(1−p\)a\(1\-p\), and we normalize the Gaussian fluctuations to zero mean and unit variance\.

For a test inputx∗x\_\{\*\}, define the correctness indicator

Acc​\(x∗\)=1​\{l∗​\(x∗\)\>max1≤i≤C−1⁡li​\(x∗\)\}\.\\mathrm\{Acc\}\(x\_\{\*\}\)=1\\\!\\left\\\{l\_\{\*\}\(x\_\{\*\}\)\>\\max\_\{1\\leq i\\leq C\-1\}l\_\{i\}\(x\_\{\*\}\)\\right\\\}\.\(136\)The population mean test accuracy is therefore

𝔼​\[Acc\]=𝔼x∗​\[1​\{l∗​\(x∗\)\>max1≤i≤C−1⁡li​\(x∗\)\}\]\.\\mathbb\{E\}\[\\mathrm\{Acc\}\]=\\mathbb\{E\}\_\{x\_\{\*\}\}\\\!\\left\[1\\\!\\left\\\{l\_\{\*\}\(x\_\{\*\}\)\>\\max\_\{1\\leq i\\leq C\-1\}l\_\{i\}\(x\_\{\*\}\)\\right\\\}\\right\]\.\(137\)SinceAcc​\(x∗\)\\mathrm\{Acc\}\(x\_\{\*\}\)is an indicator variable,Acc​\(x∗\)2=Acc​\(x∗\)\\mathrm\{Acc\}\(x\_\{\*\}\)^\{2\}=\\mathrm\{Acc\}\(x\_\{\*\}\), and thus its population variance is

𝕍​ar​\[Acc\]=𝔼​\[Acc\]−𝔼​\[Acc\]2=𝔼​\[Acc\]​\(1−𝔼​\[Acc\]\)\.\\mathbb\{V\}\\mathrm\{ar\}\[\\mathrm\{Acc\}\]=\\mathbb\{E\}\[\\mathrm\{Acc\}\]\-\\mathbb\{E\}\[\\mathrm\{Acc\}\]^\{2\}=\\mathbb\{E\}\[\\mathrm\{Acc\}\]\\,\\bigl\(1\-\\mathbb\{E\}\[\\mathrm\{Acc\}\]\\bigr\)\.\(138\)Hence the variance is maximized when𝔼​\[Acc\]=1/2\\mathbb\{E\}\[\\mathrm\{Acc\}\]=1/2, so it is necessarily a non\-monotonic function of the mean accuracy\.

It therefore remains only to show that the mean accuracy is monotonic inpp\. Conditioning on the correct\-class noiseξ^∗=u\\hat\{\\xi\}\_\{\*\}=u, the probability that the correct logit exceeds allC−1C\-1incorrect logits is

Φ​\(u\+a​\(1−p\)\)C−1\.\\Phi\\bigl\(u\+a\(1\-p\)\\bigr\)^\{C\-1\}\.\(139\)Averaging overu∼𝒩​\(0,1\)u\\sim\\mathcal\{N\}\(0,1\)gives

𝔼​\[Acc\]=∫−∞∞𝑑u​ϕ​\(u\)​Φ​\(u\+a​\(1−p\)\)C−1,\\mathbb\{E\}\[\\mathrm\{Acc\}\]=\\int\_\{\-\\infty\}^\{\\infty\}du\\,\\phi\(u\)\\,\\Phi\\bigl\(u\+a\(1\-p\)\\bigr\)^\{C\-1\},\(140\)whereϕ\\phiandΦ\\Phiare the standard normal probability density function \(PDF\) and cumulative distribution function \(CDF\), respectively\. BecauseΦ\\Phiis strictly increasing, this expression is monotonically decreasing inpp\. It follows that𝕍​ar​\[Acc\]\\mathbb\{V\}\\mathrm\{ar\}\[\\mathrm\{Acc\}\], being of the formm​\(1−m\)m\(1\-m\)withm=𝔼​\[Acc\]m=\\mathbb\{E\}\[\\mathrm\{Acc\}\], is generically non\-monotonic inpp\. This proves the claim\.

![Refer to caption](https://arxiv.org/html/2606.11319v1/x10.png)Figure 10:Standard deviation of the test accuracy of the toy model versus the noise probabilitypp\. We set the constantsa=10a=10andC=10C=10\.Figure[10](https://arxiv.org/html/2606.11319#A7.F10)shows the predicted standard deviation of the test accuracy in this toy model as a function ofppfora=10a=10andC=10C=10\. The non\-monotonic behavior is clearly reproduced, showing that it arises directly from the competition between a decreasing correct\-class offset and Gaussian logit fluctuations\.

Similar Articles

Maximal Brain Damage Without Data or Optimization: Disrupting Neural Networks via Sign-Bit Flips

Hugging Face Daily Papers

This paper demonstrates that deep neural networks are catastrophically vulnerable to minimal sign-bit flips in parameters, introducing DNL and 1P-DNL methods to identify critical vulnerable parameters without data or optimization. The vulnerability spans multiple domains including image classification, object detection, instance segmentation, and language models, with practical implications for model security.

The role of class encoding in neural collapse

arXiv cs.LG

This paper investigates how class label encoding influences neural collapse in neural network classifiers, showing that with one-hot encoding and balanced data, uncentered mean features transition from a simplex equiangular tight frame to an orthogonal frame as bias regularization increases.

Forgetting is Not Erasure: Recovering Latent Knowledge via Transport Keys

arXiv cs.LG

This paper argues that catastrophic forgetting in neural networks is not erasure but an interface alignment problem. It introduces 'transport keys' to recover latent task-specific features from sequentially trained models, demonstrating significant performance recovery on split CIFAR-100.