Do Deep Networks Forget Initialization? A Forgetting-Time View of Practical Inductive Bias
Summary
This paper introduces the concept of 'initialization memory' to study how much of the random initialization bias survives training in deep networks, showing that low-learning-rate SGD preserves initialization while Adam-family optimizers erase it, and linking this to forgetting dynamics.
View Cached Full Text
Cached at: 05/29/26, 09:17 AM
# Do Deep Networks Forget Initialization? A Forgetting-Time View of Practical Inductive Bias
Source: [https://arxiv.org/html/2605.29152](https://arxiv.org/html/2605.29152)
Mohua Das MIT &Pierfrancesco Beneventano∗ MIT &Shibshankar Dey Northwestern University &Gareth H\. McKinley MIT &Tomaso Poggio MIT
###### Abstract
Randomly initialized neural networks induce a prior over functions, but the predictor used in practice is produced only after training\. We ask how much of this initial bias survives the training pipeline\. To make the question measurable, we introduce*initialization memory*: the dependence of the validation\-selected predictor on the scale of the random initialization\. We perform controlled CIFAR\-10 experiments on ResNets where initialization memory already sharply separates training regimes\. Low\-learning\-rate SGD can interpolate while still remembering its initialization: on ResNet\-9 with batch sizeb=128b=128, test accuracy varies by26\.526\.5percentage points across initialization scales despite≥99\.5%\\geq 99\.5\\%training accuracy\. This is not undertraining: extending the same low\-learning\-rate regime to5,0005\{,\}000epochs leaves the spread essentially unchanged\. In contrast, Adam\-family methods largely erase the dependence\. SGD can also be made to forget when larger learning rates are paired with explicitL2L\_\{2\}norm control\. We interpret these findings in terms of the time scale of forgetting: gradient\-flow\-like dynamics can preserve initialization memory, whereas stochastic finite\-step effects, explicit norm decay, and adaptive preconditioning erase it on scales governed by the size of explicit or implicit regularization\. The practical inductive bias of a trained network is therefore not the architectural prior alone, but the architectural prior after being filtered by the forgetting dynamics of the training pipeline; and the same regularizers that improve generalization are precisely those that erase memory of initialization\.
## 1Introduction
#### The literature on initialization\.
Modern neural networks are too expressive a priori for performance to be explained by capacity alone\. Thus, the relevant object is not merely the hypothesis class, but the training pipeline that selects one function from it\[[1](https://arxiv.org/html/2605.29152#bib.bib1)\]\. In practice, this pipeline includes data preprocessing, architecture, initialization, optimizer, batching, explicit regularization, and training time\. Understanding performance, therefore, requires understanding not only the prior induced by the architecture, but also how training transforms that prior\. This article investigates a precise subquestion:
What is the role of initialization in explaining performance?
Initialization is a natural place to look for such bias\. From a dynamical systems perspective, the initial condition, in the absence of regularization, determines which region of parameter space the trajectory explores \(basin of attraction\) and which solution is ultimately selected\[[2](https://arxiv.org/html/2605.29152#bib.bib2),[3](https://arxiv.org/html/2605.29152#bib.bib3)\]\. In simplified linear and homogeneous networks, initialization is also known to control the implicit bias of gradient\-based training\[[4](https://arxiv.org/html/2605.29152#bib.bib4),[5](https://arxiv.org/html/2605.29152#bib.bib5)\]\. A line of work studies the function prior of random networks: before seeing labels, architectures, and initialization schemes assign much higher probability to some functions than to others, often favoring simple ones\[[6](https://arxiv.org/html/2605.29152#bib.bib6),[7](https://arxiv.org/html/2605.29152#bib.bib7),[8](https://arxiv.org/html/2605.29152#bib.bib8)\]\. Mingard et al\.\[[7](https://arxiv.org/html/2605.29152#bib.bib7)\]distinguish the first\-order question of why overparameterized DNNs generalize at all from the second\-order question of how to further improve the performance of already\-generalizing models\. Our focus is on one concrete part of that bridge: whether the simplicity bias present at initialization survives training strongly enough to remain visible in the final predictor and in practical performance\. Answering this requires studying \(i\) what happens after optimization begins and \(ii\) whether practical performance is affected by these geometric biases\. The simplicity bias present at initialization may be preserved, distorted, or erased by the subsequent training dynamics\. Thus, the practical question is not only whether random networks have a simplicity bias, but whether that bias remains visible in the predictor selected by a modern training pipeline\. The question, given these works, becomes:
When does training remember initialization’s bias, when does it forget it, and on what time scale? How does this geometric simplicity bias resolve in practical performance?
In apparent contrast with the literature above, a different intuition is common in large\-scale model training\. There, initialization is often treated less as a source of final inductive bias and more as a mechanism for stabilizing optimization: preventing exploding or vanishing signals, enabling depth, and making training predictable at scale\. This perspective underlies variance\-preserving initialization schemes, random\-matrix and dynamical\-mean\-field analyses of signal propagation\[[9](https://arxiv.org/html/2605.29152#bib.bib9),[10](https://arxiv.org/html/2605.29152#bib.bib10),[11](https://arxiv.org/html/2605.29152#bib.bib11),[12](https://arxiv.org/html/2605.29152#bib.bib12),[13](https://arxiv.org/html/2605.29152#bib.bib13),[14](https://arxiv.org/html/2605.29152#bib.bib14),[15](https://arxiv.org/html/2605.29152#bib.bib15),[16](https://arxiv.org/html/2605.29152#bib.bib16),[17](https://arxiv.org/html/2605.29152#bib.bib17)\], and modern parameterization theory such asμ\\muP\[[18](https://arxiv.org/html/2605.29152#bib.bib18),[19](https://arxiv.org/html/2605.29152#bib.bib19),[20](https://arxiv.org/html/2605.29152#bib.bib20),[21](https://arxiv.org/html/2605.29152#bib.bib21),[22](https://arxiv.org/html/2605.29152#bib.bib22),[23](https://arxiv.org/html/2605.29152#bib.bib23)\]\. From this viewpoint, improvements attributed to initialization may come primarily from making training possible or stable, rather than from a persistent preference among functions\.
Recent evidence that random seeds and initialization can affect language\-model training—both during fine\-tuning\[[24](https://arxiv.org/html/2605.29152#bib.bib24)\]and during pretraining\[[25](https://arxiv.org/html/2605.29152#bib.bib25),[26](https://arxiv.org/html/2605.29152#bib.bib26),[27](https://arxiv.org/html/2605.29152#bib.bib27),[28](https://arxiv.org/html/2605.29152#bib.bib28)\]—further sharpens the timeline of the issue\. Related controlled studies of language\-model training pipelines and architecture choices reinforce that large\-scale behavior is shaped by more than the architecture alone\[[29](https://arxiv.org/html/2605.29152#bib.bib29),[30](https://arxiv.org/html/2605.29152#bib.bib30)\]\.
Figure 1:SGD remembers initialization; Adam\-family methods forget\.ResNet\-9 under a shared low\-learning\-rate training procedure\. Each curve shows the mean overn=10n=10seeds; shaded bands indicate the10th−90th10^\{\\mathrm\{th\}\}\-90^\{\\mathrm\{th\}\}percentile range\. SGD interpolates, but its generalization gap grows withσw\\sigma\_\{w\}\. Adam, AdamW, and Muon show substantially weaker dependence onσw\\sigma\_\{w\}\. The norm panels show radial memory: SGD retains sensitivity to the initial norm \(dashed line\), whereas adaptive methods converge toward a common final norm scale\. Top row:b=16b=16; bottom row:b=128\.b=128\.
#### Our contribution\.
As we argue above, there is substantial work showing that random networks and SGD\-trained networks are biased toward simple functions, and there is separate work showing that normalization, SGD noise, regularization, and finite\-step discretization strongly alter optimization trajectories\[[31](https://arxiv.org/html/2605.29152#bib.bib31),[32](https://arxiv.org/html/2605.29152#bib.bib32),[33](https://arxiv.org/html/2605.29152#bib.bib33),[34](https://arxiv.org/html/2605.29152#bib.bib34),[35](https://arxiv.org/html/2605.29152#bib.bib35)\]\(further related work in Appendix[H](https://arxiv.org/html/2605.29152#A8)\)\. What is still comparatively under\-articulated is when initialization\-induced simplicity survives training and when it is forgotten\. That is the precise gap our paper occupies\.
This article is an empirical study in a deliberately controlled setting: ResNets on CIFAR\-10\. We perform extensive ablations across initialization scales, optimizers, batch sizes, depths, training horizons, and explicit regularization\. We introduce initialization memory as our diagnostic: the extent to which the predictor returned by a training pipeline still depends on the initialization scaleσw\\sigma\_\{w\}\.
1. 1\.A controlled phase diagram of initialization\-scale memory\.We show that training procedures differ sharply in whether the returned predictor still depends on initialization scale\. - •Low\-learning\-rate SGD can interpolate while retaining large initialization\-scale memory: for ResNet\-9 atb=128b=128, test accuracy varies by26\.526\.5percentage points acrossσw\\sigma\_\{w\}despite≥99\.5%\\geq 99\.5\\%training accuracy\. - •In the same diagnostic grid, Adam, AdamW, and Muon instead largely erase this dependence\. Hyperparameter ablations and depth stress tests show that the failure mode changes across regimes: poor forgetting can appear both as interpolation with poor generalization in shallower networks, or as degraded trainability in deeper ones\.
2. 2\.A separation between interpolation, training horizon, and initialization\-scale forgetting\.We show that erasing initialization\-scale dependence is not implied by fitting the labels or by extending the same low\-learning\-rate dynamics\. A5,0005\{,\}000\-epoch low\-LR SGD control leaves the initialization\-scale spread essentially unchanged\. SGD can nevertheless be made to forget when the recipe supplies larger effective movement \(larger implicit regularization\) or explicit regularization as weight decay, showing that forgetting is a property of the full training recipe rather than the optimizer name\. In particular, we show that the training procedure forgets more with a larger learning rate and regularization, or a smaller batch size\.
3. 3\.A forgetting\-timescale mechanism\.We organize the results by cumulative optimizer clocks: 𝒯SGD=1b∑k<Kηk2,𝒯L2=λ∑k<Kηk,𝒯adapt=∑k<Kηk\.\\mathcal\{T\}\_\{\\mathrm\{SGD\}\}=\\frac\{1\}\{b\}\\sum\_\{k<K\}\\eta\_\{k\}^\{2\},\\qquad\\mathcal\{T\}\_\{L\_\{2\}\}=\\lambda\\sum\_\{k<K\}\\eta\_\{k\},\\qquad\\mathcal\{T\}\_\{\\mathrm\{adapt\}\}=\\sum\_\{k<K\}\\eta\_\{k\}\.These time scales track stochastic finite\-step effects, explicit norm decay, and adaptive preconditioning\. They explain why epoch count and interpolation time are not reliable proxies for erasing initialization\-scale dependence, and are supported by the repair map, sensitivity\-collapse plots, and minimal conservation\-law model\.
Importantly, we see that the main takeaways align with those of the \(implicit\) regularization community:
Gradient\-flow\-like training can remember initialization; stochasticity, adaptivity, and norm control erase it on optimizer\-dependent time scales\.
Figure 2:Large\-batch fixed\-epoch regimes forget initialization more slowly\.ResNet\-9 test accuracy atτbest\\tau\_\{\\mathrm\{best\}\}, averaged overn=10n=10seeds\. Each panel corresponds to one batch sizeb∈\{16,32,64,128,256\}b\\in\\\{16,32,64,128,256\\\}; within each panel, rows are optimizers and columns are initialization scales\. Cells marked×\\timesdid not reach99\.5%99\.5\\%mean training accuracy at the best\-validation\-loss checkpointτbest\\tau\_\{\\mathrm\{best\}\}\. SGD shows a strong left\-to\-right degradation that becomes more pronounced at larger batch sizes; adaptive methods remain much flatter\.
## 2Experimental Design
We use the initialization scale as a controlled perturbation of the training pipeline\. For each optimizer–batch\-size pair, we sweep the global scaleσw\\sigma\_\{w\}of the random kernel initialization while holding all other choices fixed\. If the predictor returned by the same training and checkpoint\-selection rule varies acrossσw\\sigma\_\{w\}, the procedure has retained initialization\-scale memory; if the dependence is small, this component of the initial condition has been erased\.
Formally, letΣ\\Sigmabe the finite set of scales in the sweep and letℛ\\mathcal\{R\}denotes the full training procedure: architecture, data split, optimizer, batch size, schedule, regularization, horizon, and checkpoint rule\. LetAℛ,K\(σ,s\)A\_\{\\mathcal\{R\},K\}\(\\sigma,s\)be the predictor returned afterKKupdates from scaleσ\\sigmaand seedss\. For metricmm, define
𝖬𝖾𝗆mret\(ℛ,K;Σ\)=maxσ∈Σ𝔼s\[m\(Aℛ,K\(σ,s\)\)\]−minσ∈Σ𝔼s\[m\(Aℛ,K\(σ,s\)\)\]\.\\mathsf\{Mem\}\_\{m\}^\{\\mathrm\{ret\}\}\(\\mathcal\{R\},K;\\Sigma\)=\\max\_\{\\sigma\\in\\Sigma\}\\mathbb\{E\}\_\{s\}\[m\(A\_\{\\mathcal\{R\},K\}\(\\sigma,s\)\)\]\-\\min\_\{\\sigma\\in\\Sigma\}\\mathbb\{E\}\_\{s\}\[m\(A\_\{\\mathcal\{R\},K\}\(\\sigma,s\)\)\]\.In the main experiments,mmis the test accuracy, and the returned predictor is the best\-validation\-loss checkpoint\.
Diagnostic\.Initialization\-scale memory is the finite\-horizon dependence of the returned predictor on one controlled initialization perturbation,σw\\sigma\_\{w\}\. It is a property of the full training\-and\-selection procedure\.
#### Controlled setting\.
We use CIFAR\-10 BatchNorm ResNets\. The main ResNet\-9 grid fixes the data split, architecture, learning\-rate schedule, and training horizon, and varies only initialization scale, optimizer, and batch size\. We sweepσw∈\{0\.10,0\.20,…,2\.50\}\\sigma\_\{w\}\\in\\\{0\.10,0\.20,\\ldots,2\.50\\\}, optimizers SGD, SGD with momentum, Adam, AdamW, and Muon, andb∈\{16,32,64,128,256\}b\\in\\\{16,32,64,128,256\\\}, with1010seeds per configuration\. All main\-grid optimizers useη0=10−3\\eta\_\{0\}=10^\{\-3\}with cosine decay for300300epochs\. Ablations over learning rate,L2L\_\{2\}regularization, training length, normalization, augmentation, and depth \(ResNet\-56, ResNet\-110, R9\-AvgPool\) are reported in Appendices[4](https://arxiv.org/html/2605.29152#A3.T4)–[E](https://arxiv.org/html/2605.29152#A5); full hyperparameters are in Appendix[A](https://arxiv.org/html/2605.29152#A1)\.
Figure 3:Interpolation is not forgetting\.\(a,b\) Interpolation epochτinterp\\tau\_\{\\mathrm\{interp\}\}versusσw\\sigma\_\{w\}\. SGD’s interpolation time grows sharply withσw\\sigma\_\{w\}, especially at large batch sizes; Adam’s is nearly flat\. \(c,d\) Repair gapΔrepair=ValAccτbest−ValAccτinterp\\Delta\_\{\\mathrm\{repair\}\}=\\mathrm\{ValAcc\}\_\{\\tau\_\{\\mathrm\{best\}\}\}\-\\mathrm\{ValAcc\}\_\{\\tau\_\{\\mathrm\{interp\}\}\}\. At large batch \(b≥128b\\\!\\geq\\\!128\) Adam\-family methods continue to gain validation accuracy after interpolation \(τbest≫τinterp\\tau\_\{\\mathrm\{best\}\}\\gg\\\!\\tau\_\{\\mathrm\{interp\}\},Δrepair\>0\\Delta\_\{\\mathrm\{repair\}\}\\\!\>\\\!0\)\. At small batch \(b≤64b\\\!\\leq\\\!64\) they reachτbest\\tau\_\{\\mathrm\{best\}\}*before*interpolating, so the formal repair gap can be negative even though validation accuracy keeps improving \(see Appendix\.[B](https://arxiv.org/html/2605.29152#A2)\)\. Vanilla SGD’s repair gap stays near zero: validation accuracy is flat betweenτbest\\tau\_\{\\mathrm\{best\}\}andτinterp\\tau\_\{\\mathrm\{interp\}\}, even as both grow withσw\\sigma\_\{w\}\.
#### Metrics\.
We report test accuracy at the best\-validation\-loss checkpointτbest\\tau\_\{\\mathrm\{best\}\}\. We also track the interpolation epoch,τinterp=min\{t:TrainAcct≥99\.5%\},\\tau\_\{\\mathrm\{interp\}\}=\\min\\\{t:\\mathrm\{TrainAcc\}\_\{t\}\\geq 99\.5\\%\\\},the trainable\-kernel Frobenius norm
‖W‖F=\(∑ℓ‖W\(ℓ\)‖F2\)1/2,\\\|W\\\|\_\{F\}=\\left\(\\sum\_\{\\ell\}\\\|W^\{\(\\ell\)\}\\\|\_\{F\}^\{2\}\\right\)^\{1/2\},and the checkpoint repair gap,Δrepair=ValAccτbest−ValAccτinterp\.\\Delta\_\{\\mathrm\{repair\}\}=\\mathrm\{ValAcc\}\_\{\\tau\_\{\\mathrm\{best\}\}\}\-\\mathrm\{ValAcc\}\_\{\\tau\_\{\\mathrm\{interp\}\}\}\.
The interpolation epoch measures when the labels are fit\. The norm measures radial memory of the initial scale\. The repair gap measures how validation accuracy changes between interpolation and the selected checkpoint\.
## 3Memory of Initialization: Optimizers and Hyperparameters
We now compare how much initialization\-scale memory different procedures retain\. The central contrast is that, under the shared low\-learning\-rate diagnostic procedure, vanilla SGD leavesσw\\sigma\_\{w\}visible at the selected checkpoint, whereas Adam, AdamW, and Muon largely erase it\. Two diagnostics refine this comparison: the kernel norm tests whether the radial scale has been overwritten, and the checkpoint analysis separates fitting the labels from forgetting the initial scale\.
### 3\.1SGDvs\.Adam and Muon
Figure[1](https://arxiv.org/html/2605.29152#S1.F1)exemplifies the basic optimizer contrast; Figure[2](https://arxiv.org/html/2605.29152#S1.F2)extends it to the fullσw×b\\sigma\_\{w\}\\\!\\times\\\!bgrid\.
#### Generalization\.
Vanilla SGD remembers where it started: atb=128b=128it exceeds99\.5%99\.5\\%training accuracy across the entireσw\\sigma\_\{w\}sweep, yet its test accuracy ranges from85\.0%85\.0\\%atσw=0\.1\\sigma\_\{w\}=0\.1down to58\.6%58\.6\\%atσw=2\.5\\sigma\_\{w\}=2\.5, a26\.526\.5pp spread between two runs that, by training accuracy alone, appear to have learned the same data\.Momentum is only a partial cure: at small batch it reduces the spread substantially \(from22\.822\.8to10\.910\.9pp atb=16b=16\), but atb=128b=128under the same low\-LR procedure it barely helps\.
Adam\-family methods largely wash out initialization scale: with the same data, learning rate, and epoch budget, Adam, AdamW, and Muon haveb=128b\{=\}128spreads of only4\.34\.3,4\.74\.7, and4\.04\.0pp, respectively, and Adam’sb=16b\{=\}16spread is1\.61\.6pp against SGD’s22\.822\.8pp\. The fact that*Adam itself*\(no weight decay\) matches AdamW and Muon \(decoupledλ=10−4\\lambda\{=\}10^\{\-4\}\) shows the effect is not driven by the optimizer\-default weight decay \(Appendix[A](https://arxiv.org/html/2605.29152#A1)\)\.
#### Weight Norm\.
The norm panels in Figure[1](https://arxiv.org/html/2605.29152#S1.F1)\(c,f\) show the same distinction in parameter space\. Under low\-LR SGD,‖Wτbest‖F\\\|W\_\{\\tau\_\{\\mathrm\{best\}\}\}\\\|\_\{F\}remains coupled to‖Winit‖F\\\|W\_\{\\mathrm\{init\}\}\\\|\_\{F\}: larger initial kernels lead to larger selected kernel norms\. Adam, AdamW, and Muon instead move different initial scales toward a common norm range\. We therefore read‖Wτbest‖F\\\|W\_\{\\tau\_\{\\mathrm\{best\}\}\}\\\|\_\{F\}as a radial memory diagnostic, not as a measure of total distance traveled\.
#### The message\.
Forgetting initialization is therefore not the event of fitting the labels\. Under low\-LR SGD, interpolation is reached before radial memory has been erased, and little subsequent repair occurs\. The behavior of the Adam\-family is batch\-dependent: at large batch \(b≥128b\\geq 128\) the best\-validation\-loss checkpoint is reached*after*interpolation, and the continued post\-interpolation movement closes theσw\\sigma\_\{w\}gap; at small batches \(b≤64b\\leq 64\) the best\-validation\-loss checkpoint is reached*before*interpolation, i\.e\., gap is already closed in the pre\-interpolation phase \(Figure[3](https://arxiv.org/html/2605.29152#S2.F3), Appendix\.[B](https://arxiv.org/html/2605.29152#A2)\)\.
Message 1\.Adam\(W\) and Muon largely reduce initialization memory\. SGD generally retains initialization memory\.
### 3\.2Dependence on Hyperparameters
In Figure[1](https://arxiv.org/html/2605.29152#S1.F1), we use a deliberately diagnostic low\-learning\-rate SGD baseline, not a tuned SGD procedure\. The natural objection is therefore simple: perhaps SGD only needs more time or better hyperparameters\. In this section, we analyze how forgetting the initialization depends on the hyperparameters\.
Figure[4](https://arxiv.org/html/2605.29152#S3.F4)tests this directly with a targeted ResNet\-9 sweep over training length, learning rate, and explicitL2L\_\{2\}regularization\. See Appendix[4](https://arxiv.org/html/2605.29152#A3.T4)for more details\. Precisely, we see that:
Figure 4:What helps SGD forget initialization?Test accuracy vs\.σw\\sigma\_\{w\}for \(a\)b=16b\{=\}16, and \(b\)b=128b\{=\}128; inset shows the spread \(pp\) per configuration, sorted ascending\. Long training alone leaves the curves nearly unchanged; larger learning rates and especially moderate LR with explicitL2L\_\{2\}sharply reduce the spread\. Spreads use the55\-point gridσw∈\{0\.1,0\.5,1\.0,1\.8,2\.5\}\\sigma\_\{w\}\\\!\\in\\\!\\\{0\.1,0\.5,1\.0,1\.8,2\.5\\\}\[Table[4](https://arxiv.org/html/2605.29152#A3.T4)\]; these agree with the2525\-point spreads of §[3](https://arxiv.org/html/2605.29152#S3)to within∼1\\sim 1pp\.- •Weight decay and regularization\.Explicit regularization is most effective once the learning rate is large enough to move the iterate substantially\. Withη0=10−2\\eta\_\{0\}=10^\{\-2\}and cosine decay, addingλL2=10−2\\lambda\_\{L\_\{2\}\}=10^\{\-2\}reduces the spread to0\.30\.3pp at bothb=16b=16andb=128b=128\(Table[4](https://arxiv.org/html/2605.29152#A3.T4)\)\.
- •Learning rate\.Increasing the SGD learning rate helps, especially at a small batch\. Atb=16b=16, raisingη0\\eta\_\{0\}from10−310^\{\-3\}to10−210^\{\-2\}reduces the spread from22\.822\.8to9\.79\.7pp, andη0=10−1\\eta\_\{0\}=10^\{\-1\}reduces it to4\.54\.5pp\. Atb=128b=128, the same intervention is weaker:η0=10−2\\eta\_\{0\}=10^\{\-2\}still leaves23\.823\.8pp, andη0=10−1\\eta\_\{0\}=10^\{\-1\}leaves8\.18\.1pp\. Large batches need stronger forgetting mechanisms\.
- •Batch size\.Figure[2](https://arxiv.org/html/2605.29152#S1.F2)extends the comparison to the fullσw×b\\sigma\_\{w\}\\times bgrid\. We find thatlarge batch slows forgetting under a fixed\-epoch budget\.For SGD, initialization sensitivity remains large across the entire range of batch sizes, with spreads of22\.822\.8pp atb=16b=16,26\.526\.5pp atb=128b=128, and25\.725\.7pp atb=256b=256; asbbgrows, the failure pattern also becomes more orderly and the absolute test\-accuracy degradation more severe\. Adaptive methods with the same low\-LR training procedure still erase initialization memory more effectively than SGD, but their robustness weakens as the training regime becomes larger\-batch and deeper\. Adam, indeed, shows the same directional trend, but much more weakly: its spread increases from1\.61\.6pp atb=16b=16to5\.35\.3pp atb=256b=256\.
- •Normalization and augmentation\.Replacing BatchNorm with LayerNorm or adding standard data augmentation reduces the SGD spread but does not eliminate it; Adam remains robust under all these configurations \(Appendix[E](https://arxiv.org/html/2605.29152#A5)\)\.
#### The standard recipe fully erases memory\.
Under a well\-tuned SGD recipe \(lr=0\.1=0\.1, momentum=0\.9=0\.9, weight decay=5×10−4=5\\\!\\times\\\!10^\{\-4\}, augmentation, 200 epochs\), theσw\\sigma\_\{w\}spread collapses to zero atb=128b\{=\}128\(\|Δ\|=0\.0\|\\Delta\|=0\.0pp\) with94\.2%94\.2\\%test accuracy \(Table[7](https://arxiv.org/html/2605.29152#A6.T7), Appendix[F](https://arxiv.org/html/2605.29152#A6)\)\. In this setting, the regularizers that produce strong generalization also erase initialization memory\.
#### Depth changes the failure mode\.
Figure[5](https://arxiv.org/html/2605.29152#S3.F5)extends the initialization sweep to ResNet\-56, ResNet\-110, and the pooling\-matched R9\-AvgPool control\. The main effect is not a monotonic increase in spread with depth\. Atb=128b=128, the SGD spread is26\.526\.5pp for ResNet\-9,24\.224\.2pp for ResNet\-56, and20\.420\.4pp for ResNet\-110\. What changes more decisively is the absolute failure mode: ResNet\-9 typically still interpolates and then generalizes poorly, whereas deeper ResNets increasingly fail to train well at largeσw\\sigma\_\{w\}\. Full numerical results \(test and train accuracy, interpolation epochs\) and the corresponding train accuracy figure \[Figure[7](https://arxiv.org/html/2605.29152#A4.F7)\] are provided in Appendix[D](https://arxiv.org/html/2605.29152#A4)\. The regime, therefore, shifts from memorization without good generalization to degraded trainability itself\.
SGD can forget, but only when the training procedure makes it forget\.The repair map changes the interpretation of the optimizer comparison\. The message is not that Adam is intrinsically capable of forgetting while SGD is not\. Rather, Adam forgets automatically in this grid, while SGD forgets only when its training pipeline supplies both movement and radial control\. Initialization robustness is therefore a property of the full training pipeline, not of the optimizer name alone\.
Message 2\.Memory of initialization is, in general, a property of the whole training procedure, not a precise subset\. Larger depth and batch sizes, or smaller regularization or learning rate, increase the memory of initialization\.
Figure 5:Depth makes poor forgetting dynamics more damaging; pooling is a partial confound\.Test accuracy versusσw\\sigma\_\{w\}for ResNet\-9, R9\-AvgPool, ResNet\-56, and ResNet\-110\. Switching pooling strategy reduces ResNet\-9 performance, but the 9\-layer R9 remains more robust than ResNet\-56, suggesting that depth and optimization difficulty contribute beyond pooling alone\.
## 4The Time Scale of Forgetting
Sections[3](https://arxiv.org/html/2605.29152#S3)showed*what*is remembered and forgotten; this section asks*why*\. The key idea is that epoch count is not the natural unit of forgetting: the relevant clock is the accumulated strength of the regularization—implicit or explicit—that acts along the trajectory\.
### 4\.1The time scale is regularization
#### Backward error view\.
Backward error analysis interprets a discrete optimizer as follows, to leading order, a nearby continuous dynamics\[[36](https://arxiv.org/html/2605.29152#bib.bib36),[37](https://arxiv.org/html/2605.29152#bib.bib37)\]\. In ML, this viewpoint writes the discrete update as a gradient flow on a modified objective
ℒmod=ℒ\+ρ𝒜\(η,b,λ,…\)ℛ𝒜\+higher\-order terms,\\mathcal\{L\}\_\{\\mathrm\{mod\}\}\\;=\\;\\mathcal\{L\}\\;\+\\;\\rho\_\{\\mathcal\{A\}\}\(\\eta,b,\\lambda,\\ldots\)\\,\\mathcal\{R\}\_\{\\mathcal\{A\}\}\\;\+\\;\\text\{higher\-order terms\},whereℛ𝒜\\mathcal\{R\}\_\{\\mathcal\{A\}\}is the implicit or explicit regularizer associated with algorithm𝒜\\mathcal\{A\}: implicit gradient regularization for finite\-step GD\[[32](https://arxiv.org/html/2605.29152#bib.bib32)\], covariance/Fisher\-type regularization for minibatch SGD\[[33](https://arxiv.org/html/2605.29152#bib.bib33),[34](https://arxiv.org/html/2605.29152#bib.bib34)\], momentum\-amplified regularization\[[38](https://arxiv.org/html/2605.29152#bib.bib38)\], and geometry\-dependent implicit bias for adaptive methods\[[39](https://arxiv.org/html/2605.29152#bib.bib39)\]\. What matters is notρ𝒜\\rho\_\{\\mathcal\{A\}\}itself, but its accumulated strength along the trajectory: the events that erase initialization\-dependent quantities are exactly the perturbation terms ofℒmod\\mathcal\{L\}\_\{\\mathrm\{mod\}\}accumulating over many steps\.
#### Forgetting timescales\.
For the mechanisms our experiments isolate, the leading cumulative scales along a schedule\(ηk\)k<K\(\\eta\_\{k\}\)\_\{k<K\}are
𝒯SGD=1b∑k<Kηk2,𝒯L2=λ∑k<Kηk,𝒯adapt=∑k<Kηk\.\\mathcal\{T\}\_\{\\mathrm\{SGD\}\}=\\frac\{1\}\{b\}\\sum\_\{k<K\}\\eta\_\{k\}^\{2\},\\qquad\\mathcal\{T\}\_\{L\_\{2\}\}=\\lambda\\sum\_\{k<K\}\\eta\_\{k\},\\qquad\\mathcal\{T\}\_\{\\mathrm\{adapt\}\}=\\sum\_\{k<K\}\\eta\_\{k\}\.The first scale corresponds to the minibatch finite\-step/covariance effect: theη2/b\\eta^\{2\}/bscaling is the cumulative strength of the SGD modified\-loss correction\[[33](https://arxiv.org/html/2605.29152#bib.bib33),[34](https://arxiv.org/html/2605.29152#bib.bib34)\]\. The second is explicit radial norm decay, which contracts initialization\-dependent norm at rateλη\\lambda\\etaper step\. The third is the first\-order adaptive transport scale: adaptive preconditioners do not preserve the symmetries that make the gradient\-flow imbalance conserved, so they can move initialization\-dependent quantities at orderη\\etaper step\[[39](https://arxiv.org/html/2605.29152#bib.bib39)\]\. For constant learning rate these reduce toKη2/bK\\eta^\{2\}/b,KηλK\\eta\\lambda, andKηK\\eta\. We use these expressions as an organizing principle for the experiments, not as a complete theorem for nonlinear BatchNorm ResNets\.
Figure 6:Initialization sensitivity decays on regularization timescales, not epoch count\.\(a–d\)\|β\(t\)\|\|\\beta\(t\)\|versus epoch for the SGD family \(left\) and adaptive methods \(right\) atb=16b\{=\}16\(top\) andb=128b\{=\}128\(bottom\)\. Vanilla SGD \(red\) and SGD\+momentum \(orange\) stay flat; Adam/AdamW/Muon decay steadily; SGD withη=10−2\\eta\{=\}10^\{\-2\},L2=10−2L\_\{2\}\{=\}10^\{\-2\}\(green\) reaches the adaptive noise floor\. \(e\) Vanilla SGD\|β\|\|\\beta\|versus𝒯SGD=\(1/b\)∑kηk2\\mathcal\{T\}\_\{\\mathrm\{SGD\}\}\{=\}\(1/b\)\\sum\_\{k\}\\eta\_\{k\}^\{2\}across all five batch sizes \(colored lines\); SGD\+L2L\_\{2\}versus𝒯L2=λ∑kηk\\mathcal\{T\}\_\{L\_\{2\}\}\{=\}\\lambda\\sum\_\{k\}\\eta\_\{k\}atb=16,128b\{=\}16,128\(green\)\. \(f\) Adam\|β\|\|\\beta\|versus𝒯adapt=∑kηk\\mathcal\{T\}\_\{\\mathrm\{adapt\}\}\{=\}\\sum\_\{k\}\\eta\_\{k\}across all five batch sizes\. Curves approximately collapse under the proposed rescaling\.
### 4\.2From timescales to data
#### A linear\-network sanity check\.
These timescales are visible in the smallest nontrivial homogeneous network: the two\-parameter scalar problem\(abx−y\)2\(abx\-y\)^\{2\}witha,b∈ℝa,b\\in\\mathbb\{R\}\. \(Appendix[G](https://arxiv.org/html/2605.29152#A7)proves the conservation, finite\-step,L2L\_\{2\}, and adaptive\-preconditioning statements below for two\-factor and deep linear networks\)\. Gradient flow preserves the imbalanceD=a2−b2D=a^\{2\}\-b^\{2\}exactly, so the norm at convergence depends on initialization\. ExplicitL2L\_\{2\}decay killsDDat ratee−2λte^\{\-2\\lambda t\}, giving a timescaleO\(ληK\)O\(\\lambda\\eta K\)\. Discrete minibatch SGD breaks the conservation at second order inη\\eta; the batch\-dependent part of the leakage scales asO\(η2K/b\)O\(\\eta^\{2\}K/b\)\[[40](https://arxiv.org/html/2605.29152#bib.bib40),[41](https://arxiv.org/html/2605.29152#bib.bib41),[35](https://arxiv.org/html/2605.29152#bib.bib35)\]\. Adaptive preconditioning, by contrast, generically breaks this first\-order cancellation; the resultingO\(ηK\)O\(\\eta K\)clock for adaptive methods is proved in our minimal model \(Appendix[G](https://arxiv.org/html/2605.29152#A7)\) and is suggested by the implicit\-bias analyses of\[[39](https://arxiv.org/html/2605.29152#bib.bib39),[38](https://arxiv.org/html/2605.29152#bib.bib38)\]\. In the regimes studied here, the effective clocks are ordered roughly as
𝒯SGD≪𝒯L2≲𝒯adapt,\\mathcal\{T\}\_\{\\mathrm\{SGD\}\}\\ll\\mathcal\{T\}\_\{L\_\{2\}\}\\lesssim\\mathcal\{T\}\_\{\\mathrm\{adapt\}\},once explicitL2L\_\{2\}is large enough to matter\. This ordering matches the observed forgetting hierarchy: low\-LR SGD remembers, SGD with sufficientL2L\_\{2\}forgets, and adaptive methods forget fastest under the diagnostic grid\. In other words, the formulas explain the empirical hierarchy without requiring a universal ordering: low\-LR SGD has a tiny𝒯SGD\\mathcal\{T\}\_\{\\mathrm\{SGD\}\}, explicitL2L\_\{2\}becomes effective whenλ∑kηk\\lambda\\sum\_\{k\}\\eta\_\{k\}is order one or larger, and adaptive methods have a first\-order movement clock∑kηk\\sum\_\{k\}\\eta\_\{k\}\.
#### Numerical check\.
The timescale arithmetic explains the repair map of Section[3\.2](https://arxiv.org/html/2605.29152#S3.SS2)\. Ignoring cosine decay and usingK≈Entrain/bK\\\!\\approx\\\!E\\,n\_\{\\mathrm\{train\}\}/b, low\-LR SGD atb=128b\{=\}128has𝒯SGD≈7\.3⋅10−4\\mathcal\{T\}\_\{\\mathrm\{SGD\}\}\\\!\\approx\\\!7\.3\\\!\\cdot\\\!10^\{\-4\}after300300epochs and only≈1\.2⋅10−2\\approx 1\.2\\\!\\cdot\\\!10^\{\-2\}after5,0005\{,\}000epochs; both regimes retain∼26\{\\sim\}26pp of spread\. Atb=16b\{=\}16andη=10−2\\eta\{=\}10^\{\-2\}, the same timescale reaches≈4\.7\{\\approx\}\\,4\.7and the spread drops sharply\. ExplicitL2L\_\{2\}shows the same threshold: atb=128b\{=\}128,η=10−2\\eta\{=\}10^\{\-2\}withλ=10−3\\lambda\{=\}10^\{\-3\}gives𝒯L2≈0\.94\\mathcal\{T\}\_\{L\_\{2\}\}\\\!\\approx\\\!0\.94and still leaves17\.417\.4pp, whereasλ=5×10−3\\lambda\{=\}5\\\!\\times\\\!10^\{\-3\}gives𝒯L2≈4\.7\\mathcal\{T\}\_\{L\_\{2\}\}\\\!\\approx\\\!4\.7and collapses the spread below11pp\. Adaptive methods erase initialization\-scale dependence without the SGD\-style sweep, consistent with theKηK\\etascaling of𝒯adapt\\mathcal\{T\}\_\{\\mathrm\{adapt\}\}\.
#### Empirical\|β\(t\)\|\|\\beta\(t\)\|collapse\.
If these timescales are the right dynamical clocks, then initialization sensitivity at different batch sizes should collapse when plotted against𝒯\\mathcal\{T\}rather than epoch count\. We test this with\|β\(t\)\|=\|dValAcc/dσw\|\|\\beta\(t\)\|=\|d\\,\\mathrm\{ValAcc\}/d\\,\\sigma\_\{w\}\|at each epoch, estimated by ordinary least squares regression across the2525\-pointσw\\sigma\_\{w\}sweep \(1010seeds\)\.
Figure[6](https://arxiv.org/html/2605.29152#S4.F6)\(a–d\) shows\|β\|\|\\beta\|versus epoch: vanilla SGD stays flat near10−110^\{\-1\}; Adam, AdamW, and Muon decay by11–22orders of magnitude; the SGD ablation withη=10−2\\eta\{=\}10^\{\-2\}andL2=10−2L\_\{2\}\{=\}10^\{\-2\}reaches the adaptive noise floor\. Panels \(e–f\) replot against the cumulative timescale𝒯\\mathcal\{T\}for all five batch sizes \(b∈\{16,32,64,128,256\}b\\in\\\{16,32,64,128,256\\\}\): under𝒯SGD\\mathcal\{T\}\_\{\\mathrm\{SGD\}\}the vanilla\-SGD curves approximately collapse, and adding SGD withL2L\_\{2\}\(plotted against𝒯L2\\mathcal\{T\}\_\{L\_\{2\}\}\) shows the same decay trajectory as the adaptive methods\. Panel \(f\) confirms that Adam curves at all five batch sizes align under𝒯adapt\\mathcal\{T\}\_\{\\mathrm\{adapt\}\}\. The collapse is not exact—\|β\|\|\\beta\|is a noisy estimator and the timescale derivations assume linear, scale\-invariant dynamics—but the qualitative alignment across a16×16\{\\times\}range of batch sizes is clear\.
Message 3\.The right unit of forgetting is accumulated regularization, not epoch count\. Initialization memory gives an empirical readout of implicit and explicit regularization: the optimizer is not only fitting the data, but also deciding how quickly the initial condition stops mattering\.
## 5Conclusion
#### An apparent paradox, and its resolution\.
A line of work argues that the parameter–function map of deep networks is strongly biased toward simple functions, and that this architectural prior already accounts for generalization\[[6](https://arxiv.org/html/2605.29152#bib.bib6),[42](https://arxiv.org/html/2605.29152#bib.bib42),[7](https://arxiv.org/html/2605.29152#bib.bib7)\]\. Our results sharpen this picture in a way that initially looks paradoxical: the training regimes that preserve initialization bias \(low\-LR SGD\) leave generalization fragile to the initialization, while those that erase it \(Adam, well\-tuned SGD with weight decay\) generalize uniformly well across theσw\\sigma\_\{w\}sweep\. The standard ResNet recipe \(η=0\.1\\eta\{=\}0\.1, momentum,λ=5×10−4\\lambda\{=\}5\\\!\\times\\\!10^\{\-4\}, augmentation\) collapses initialization memory to\|Δ\|=0\.0\|\\Delta\|\{=\}0\.0pp atb=128b\{=\}128while achieving the highest test accuracy in our study,94\.2%94\.2\\%\(Appendix[F](https://arxiv.org/html/2605.29152#A6)\)\. The Occam’s razor that matters in practice is therefore not the one built into the parameter–function map at initialization; it is the one imposed by the regularizers accumulated along the trajectory—and the same mechanisms that close theσw\\sigma\_\{w\}gap are the ones that produce good generalization\.
#### Implications, limitations, and future work\.
Our controlled study \(6,250\+ runs across optimizers, batch sizes, depths, regularizers, normalizations, and training lengths\) establishes that interpolation does not imply forgetting, long, low\-LR training does not erase memory, and that forgetting requires accumulated regularization on optimizer\-dependent timescales \(Kη2/bK\\eta^\{2\}/b,KηλK\\eta\\lambda,KηK\\eta\)\. A practical, somewhat counterintuitive consequence is that designing initialization schemes purely “for predictor performance” bring little payoff once the pipeline includes sufficient regularization: any initialization\-dependent advantage is erased on the same timescale as the one that delivers generalization\. The initialization design remains valuable as a*trainability*device—stabilizing optimization, enabling depth, supporting shorter training horizons and controlling the signal propagation—but it should not be expected to leave a persistent fingerprint under modern, regularization\-rich procedures\. Conversely, in gradient\-flow\-like regimes \(smallη\\eta, largebb, no weight decay\) initialization\-induced bias does survive, and changing the prior att=0t\{=\}0becomes a real lever on the final predictor\.
Our experiments are restricted to BatchNorm ResNets on CIFAR\-10, and the timescale expressions are organizing principles derived from linear and scale\-invariant arguments rather than full nonlinear theorems—limitations shared with the implicit\-bias literature we build on\[[32](https://arxiv.org/html/2605.29152#bib.bib32),[33](https://arxiv.org/html/2605.29152#bib.bib33),[34](https://arxiv.org/html/2605.29152#bib.bib34),[39](https://arxiv.org/html/2605.29152#bib.bib39)\]\. Ourσw\\sigma\_\{w\}probe is also one\-dimensional, leaving bias distributions, layer\-wise scalings, andμ\\muP parameterizations unstudied\. The most pressing extensions are to larger\-scale settings \(ImageNet, ViTs, LLM pretraining and fine\-tuning\), where recent seed\-dependence results—across pretraining\[[25](https://arxiv.org/html/2605.29152#bib.bib25),[26](https://arxiv.org/html/2605.29152#bib.bib26),[27](https://arxiv.org/html/2605.29152#bib.bib27),[28](https://arxiv.org/html/2605.29152#bib.bib28)\]and fine\-tuning\[[24](https://arxiv.org/html/2605.29152#bib.bib24)\]—suggest initialization memory may be both measurable and consequential, and to tighten the backward\-error framework into a quantitative theory of forgetting time for homogeneous and approximately scale\-invariant networks\.
## References
- \[1\]Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals\.Understanding deep learning \(still\) requires rethinking generalization\.Communications of the ACM, 64\(3\):107–115, 2021\.
- \[2\]Steven H\. Strogatz\.Nonlinear Dynamics and Chaos: With Applications to Physics, Biology, Chemistry, and Engineering\.Westview Press, 2 edition, 2015\.
- \[3\]Morris W\. Hirsch, Stephen Smale, and Robert L\. Devaney\.Differential Equations, Dynamical Systems, and an Introduction to Chaos\.Academic Press, 3 edition, 2013\.
- \[4\]Hancheng Min, Salma Tarmoun, René Vidal, and Enrique Mallada\.On the explicit role of initialization on the convergence and implicit bias of overparametrized linear networks\.InProceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 7760–7768\. PMLR, 2021\.
- \[5\]Oria Gruber and Haim Avron\.On the role of initialization on the implicit bias in deep linear networks, 2024\.
- \[6\]Guillermo Valle\-Pérez, Chico Q\. Camargo, and Ard A\. Louis\.Deep learning generalizes because the parameter–function map is biased towards simple functions\.InInternational Conference on Learning Representations, 2019\.
- \[7\]Chris Mingard, Henry Rees, Guillermo Valle\-Pérez, and Ard A\. Louis\.Deep neural networks have an inbuilt occam’s razor\.Nature Communications, 16:220, 2025\.
- \[8\]Thomas Fink\.Deep\-layered machines have a built\-in occam’s razor\.arXiv preprint arXiv:2603\.01217, 2026\.
- \[9\]Xavier Glorot and Yoshua Bengio\.Understanding the difficulty of training deep feedforward neural networks\.InProceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 ofProceedings of Machine Learning Research, pages 249–256\. PMLR, 2010\.
- \[10\]Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun\.Delving deep into rectifiers: Surpassing human\-level performance on imagenet classification\.InProceedings of the IEEE International Conference on Computer Vision, pages 1026–1034, 2015\.
- \[11\]Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl\-Dickstein, and Surya Ganguli\.Exponential expressivity in deep neural networks through transient chaos\.InAdvances in Neural Information Processing Systems, volume 29, 2016\.
- \[12\]Samuel S\. Schoenholz, Justin Gilmer, Surya Ganguli, and Jascha Sohl\-Dickstein\.Deep information propagation\.InInternational Conference on Learning Representations, 2017\.
- \[13\]Jeffrey Pennington, Samuel S\. Schoenholz, and Surya Ganguli\.Resurrecting the sigmoid in deep learning through dynamical isometry: Theory and practice\.InAdvances in Neural Information Processing Systems, volume 30, 2017\.
- \[14\]Boris Hanin and David Rolnick\.How to start training: The effect of initialization and architecture\.InAdvances in Neural Information Processing Systems, volume 31, 2018\.
- \[15\]Boris Hanin\.Which neural net architectures give rise to exploding and vanishing gradients?InAdvances in Neural Information Processing Systems, volume 31, 2018\.
- \[16\]Lechao Xiao, Yasaman Bahri, Jascha Sohl\-Dickstein, Samuel S\. Schoenholz, and Jeffrey Pennington\.Dynamical isometry and a mean field theory of CNNs: How to train 10,000\-layer vanilla convolutional neural networks\.InProceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 5393–5402\. PMLR, 2018\.
- \[17\]Minmin Chen, Jeffrey Pennington, and Samuel S\. Schoenholz\.Dynamical isometry and a mean field theory of RNNs: Gating enables signal propagation in recurrent neural networks\.InProceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 873–882\. PMLR, 2018\.
- \[18\]Greg Yang and Edward J\. Hu\.Tensor programs IV: Feature learning in infinite\-width neural networks\.InProceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 11727–11737\. PMLR, 2021\.
- \[19\]Greg Yang, Edward J\. Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao\.Tensor programs V: Tuning large neural networks via zero\-shot hyperparameter transfer\.InAdvances in Neural Information Processing Systems, volume 34, 2021\.
- \[20\]Blake Bordelon and Cengiz Pehlevan\.Self\-consistent dynamical field theory of kernel evolution in wide neural networks\.Journal of Statistical Mechanics: Theory and Experiment, 2023\(11\):114009, 2023\.
- \[21\]Blake Bordelon and Cengiz Pehlevan\.Dynamics of finite width kernel and prediction fluctuations in mean field neural networks\.Journal of Statistical Mechanics: Theory and Experiment, 2024\(10\):104021, 2024\.
- \[22\]Blake Bordelon and Cengiz Pehlevan\.Deep linear network training dynamics from random initialization: Data, width, depth, and hyperparameter transfer\.InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 4968–4997\. PMLR, 2025\.
- \[23\]Clarissa Lauditi, Blake Bordelon, and Cengiz Pehlevan\.Adaptive kernel predictors from feature\-learning infinite limits of neural networks\.InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 32617–32648\. PMLR, 2025\.
- \[24\]Jesse Dodge, Gabriel Ilharco, Roy Schwartz, Ali Farhadi, Hannaneh Hajishirzi, and Noah A\. Smith\.Fine\-tuning pretrained language models: Weight initializations, data orders, and early stopping\.CoRR, abs/2002\.06305, 2020\.
- \[25\]Oskar van der Wal, Pietro Lesci, Max Müller\-Eberstein, Naomi Saphra, Hailey Schoelkopf, Willem H\. Zuidema, and Stella R\. Biderman\.PolyPythias: Stability and outliers across fifty language model pre\-training runs\.InInternational Conference on Learning Representations, 2025\.
- \[26\]Finlay Fehlauer, Kyle Mahowald, and Tiago Pimentel\.Convergence and divergence of language models under different random seeds\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 32982–32991, Suzhou, China, 2025\. Association for Computational Linguistics\.
- \[27\]Yao Tong, Haonan Wang, Siquan Li, Kenji Kawaguchi, and Tianyang Hu\.SeedPrints: Fingerprints can even tell which seed your large language model was trained from\.InInternational Conference on Learning Representations, 2026\.Poster\.
- \[28\]Siquan Li, Yao Tong, Haonan Wang, and Tianyang Hu\.Transformers are born biased: Structural inductive biases at random initialization and their practical consequences, 2026\.
- \[29\]Zeyuan Allen\-Zhu and Yuanzhi Li\.Physics of language models: Part 3\.1, knowledge storage and extraction\.InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 1067–1077\. PMLR, 2024\.
- \[30\]Zeyuan Allen\-Zhu\.Physics of language models: Part 4\.1, architecture design and the magic of canon layers\.InProceedings of the 39th Conference on Neural Information Processing Systems, NeurIPS ’25, 2025\.Full version available at[https://ssrn\.com/abstract=5240330](https://ssrn.com/abstract=5240330)\.
- \[31\]Zhiyuan Li, Kaifeng Lyu, and Sanjeev Arora\.Reconciling modern deep learning with traditional optimization analyses: The intrinsic learning rate\.InAdvances in Neural Information Processing Systems, volume 33, 2020\.
- \[32\]David G\. T\. Barrett and Benoit Dherin\.Implicit gradient regularization\.InInternational Conference on Learning Representations, 2021\.
- \[33\]Samuel L\. Smith, Benoit Dherin, David G\. T\. Barrett, and Soham De\.On the origin of implicit regularization in stochastic gradient descent\.InInternational Conference on Learning Representations, 2021\.
- \[34\]Pierfrancesco Beneventano\.On the trajectories of sgd without replacement\.arXiv preprint arXiv:2312\.16143, 2023\.
- \[35\]Pierfrancesco Beneventano, Andrea Pinto, and Tomaso Poggio\.How neural networks learn the support is an implicit regularization effect of SGD\.2024\.
- \[36\]David F\. Griffiths and J\. M\. Sanz\-Serna\.On the scope of the method of modified equations\.SIAM Journal on Scientific and Statistical Computing, 7\(3\):994–1008, 1986\.
- \[37\]Ernst Hairer, Christian Lubich, and Gerhard Wanner\.Geometric Numerical Integration: Structure\-Preserving Algorithms for Ordinary Differential Equations, volume 31 ofSpringer Series in Computational Mathematics\.Springer, Berlin, Heidelberg, 2 edition, 2006\.
- \[38\]Avrajit Ghosh, He Lyu, Xitong Zhang, and Rongrong Wang\.Implicit regularization in heavy\-ball momentum accelerated stochastic gradient descent\.arXiv preprint arXiv:2302\.00849, 2023\.
- \[39\]Matias D\. Cattaneo, Jason Matthew Klusowski, and Boris Shigida\.On the implicit bias of Adam\.InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 5862–5906\. PMLR, 2024\.
- \[40\]Sanjeev Arora, Nadav Cohen, Wei Hu, and Yuping Luo\.Implicit regularization in deep matrix factorization\.InAdvances in Neural Information Processing Systems, volume 32, 2019\.
- \[41\]Pierfrancesco Beneventano and Blake Woodworth\.Gradient descent converges linearly to flatter minima than gradient flow in shallow linear networks\.arXiv preprint arXiv:2501\.09137, 2025\.
- \[42\]Chris Mingard, Joar Skalse, Guillermo Valle\-Pérez, David Martínez\-Rubio, Vladimir Mikulik, and Ard A\. Louis\.Neural networks are a priori biased towards boolean functions with low entropy, 2020\.
- \[43\]Vladimir N\. Vapnik and Alexey Ya\. Chervonenkis\.On the uniform convergence of relative frequencies of events to their probabilities\.Theory of Probability and Its Applications, 16\(2\):264–280, 1971\.
- \[44\]Vladimir N\. Vapnik\.Statistical Learning Theory\.Wiley, 1998\.
- \[45\]Peter L\. Bartlett\.The sample complexity of pattern classification with neural networks: The size of the weights is more important than the size of the network\.IEEE Transactions on Information Theory, 44\(2\):525–536, 1998\.
- \[46\]Peter L\. Bartlett and Shahar Mendelson\.Rademacher and gaussian complexities: Risk bounds and structural results\.Journal of Machine Learning Research, 3:463–482, 2002\.
- \[47\]Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro\.Norm\-based capacity control in neural networks\.InProceedings of The 28th Conference on Learning Theory, volume 40 ofProceedings of Machine Learning Research, pages 1376–1401\. PMLR, 2015\.
- \[48\]Peter L\. Bartlett, Dylan J\. Foster, and Matus J\. Telgarsky\.Spectrally\-normalized margin bounds for neural networks\.InAdvances in Neural Information Processing Systems, volume 30, 2017\.
- \[49\]Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals\.Understanding deep learning requires rethinking generalization\.InInternational Conference on Learning Representations, 2017\.
- \[50\]Devansh Arpit, Stanislaw Jastrzebski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S\. Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, and Simon Lacoste\-Julien\.A closer look at memorization in deep networks\.InProceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 233–242\. PMLR, 2017\.
- \[51\]Preetum Nakkiran, Gal Kaplun, Dimitris Kalimeris, Tristan Yang, Benjamin L\. Edelman, Fred Zhang, and Boaz Barak\.SGD on neural networks learns functions of increasing complexity\.InAdvances in Neural Information Processing Systems, volume 32, 2019\.
- \[52\]Moritz Hardt, Benjamin Recht, and Yoram Singer\.Train faster, generalize better: Stability of stochastic gradient descent\.InProceedings of the 33rd International Conference on Machine Learning, volume 48 ofProceedings of Machine Learning Research, pages 1225–1234\. PMLR, 2016\.
- \[53\]Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan Srebro\.Exploring generalization in deep learning\.InAdvances in Neural Information Processing Systems, volume 30, 2017\.
- \[54\]Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan Srebro\.A PAC\-bayesian approach to spectrally\-normalized margin bounds for neural networks\.InInternational Conference on Learning Representations, 2018\.
- \[55\]Yiding Jiang, Dilip Krishnan, Hossein Mobahi, and Samy Bengio\.Predicting the generalization gap in deep networks with margin distributions\.InInternational Conference on Learning Representations, 2019\.
- \[56\]Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio\.Fantastic generalization measures and where to find them\.InInternational Conference on Learning Representations, 2020\.
- \[57\]Sepp Hochreiter and Jürgen Schmidhuber\.Flat minima\.Neural Computation, 9\(1\):1–42, 1997\.
- \[58\]Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang\.On large\-batch training for deep learning: Generalization gap and sharp minima\.InInternational Conference on Learning Representations, 2017\.
- \[59\]Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio\.Sharp minima can generalize for deep nets\.InProceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 1019–1028\. PMLR, 2017\.
- \[60\]Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur\.Sharpness\-aware minimization for efficiently improving generalization\.InInternational Conference on Learning Representations, 2021\.
- \[61\]David A\. McAllester\.PAC\-bayesian model averaging\.InProceedings of the Twelfth Annual Conference on Computational Learning Theory, pages 164–170, 1999\.
- \[62\]Gintare Karolina Dziugaite and Daniel M\. Roy\.Computing nonvacuous generalization bounds for deep \(stochastic\) neural networks with many more parameters than training data\.InProceedings of the Thirty\-Third Conference on Uncertainty in Artificial Intelligence, 2017\.
- \[63\]Wenda Zhou, Victor Veitch, Morgane Austern, Ryan P\. Adams, and Peter Orbanz\.Non\-vacuous generalization bounds at the ImageNet scale: A PAC\-bayesian compression approach\.InInternational Conference on Learning Representations, 2019\.
- \[64\]Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang\.Stronger generalization bounds for deep nets via a compression approach\.InInternational Conference on Learning Representations, 2018\.
- \[65\]Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal\.Reconciling modern machine\-learning practice and the classical bias–variance trade\-off\.Proceedings of the National Academy of Sciences, 116\(32\):15849–15854, 2019\.
- \[66\]Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever\.Deep double descent: Where bigger models and more data hurt\.InInternational Conference on Learning Representations, 2020\.
- \[67\]Peter L\. Bartlett, Philip M\. Long, Gábor Lugosi, and Alexander Tsigler\.Benign overfitting in linear regression\.Proceedings of the National Academy of Sciences, 117\(48\):30063–30070, 2020\.
- \[68\]Radford M\. Neal\.Bayesian Learning for Neural Networks, volume 118 ofLecture Notes in Statistics\.Springer, 1996\.
- \[69\]Christopher K\. I\. Williams\.Computing with infinite networks\.InAdvances in Neural Information Processing Systems, volume 9, 1996\.
- \[70\]Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S\. Schoenholz, Jeffrey Pennington, and Jascha Sohl\-Dickstein\.Deep neural networks as gaussian processes\.InInternational Conference on Learning Representations, 2018\.
- \[71\]Arthur Jacot, Franck Gabriel, and Clément Hongler\.Neural tangent kernel: Convergence and generalization in neural networks\.InAdvances in Neural Information Processing Systems, volume 31, 2018\.
- \[72\]Kamaludin Dingle, Chico Q\. Camargo, and Ard A\. Louis\.Input–output maps are strongly biased towards simple outputs\.Nature Communications, 9:761, 2018\.
- \[73\]Giacomo De Palma, Bobak Kiani, and Seth Lloyd\.Random deep neural networks are biased towards simple functions\.InAdvances in Neural Information Processing Systems, volume 32, 2019\.
- \[74\]Abraham Lempel and Jacob Ziv\.On the complexity of finite sequences\.IEEE Transactions on Information Theory, 22\(1\):75–81, 1976\.
- \[75\]Jacob Ziv and Abraham Lempel\.A universal algorithm for sequential data compression\.IEEE Transactions on Information Theory, 23\(3\):337–343, 1977\.
- \[76\]Ming Li and Paul M\. B\. Vitányi\.An Introduction to Kolmogorov Complexity and Its Applications\.Springer, 3 edition, 2008\.
- \[77\]Satwik Bhattamishra, Arkil Patel, Varun Kanade, and Phil Blunsom\.Simplicity bias in transformers and their ability to learn sparse Boolean functions\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), pages 5767–5791, Toronto, Canada, 2023\. Association for Computational Linguistics\.
- \[78\]Michael Hahn and Mark Rofin\.Why are sensitive functions hard for transformers?InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), pages 14973–15008, Bangkok, Thailand, 2024\. Association for Computational Linguistics\.
- \[79\]Bhavya Vasudeva, Deqing Fu, Tianyi Zhou, Elliott Kau, Youqi Huang, and Vatsal Sharan\.Transformers learn low sensitivity functions: Investigations and implications\.InInternational Conference on Learning Representations, 2025\.
- \[80\]Nasim Rahaman, Aristide Baratin, Devansh Arpit, Felix Draxler, Min Lin, Fred A\. Hamprecht, Yoshua Bengio, and Aaron Courville\.On the spectral bias of neural networks\.InProceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 5301–5310\. PMLR, 2019\.
- \[81\]Roman Novak, Yasaman Bahri, Daniel A\. Abolafia, Jeffrey Pennington, and Jascha Sohl\-Dickstein\.Sensitivity and generalization in neural networks: An empirical study\.InInternational Conference on Learning Representations, 2018\.
- \[82\]Boris Hanin and David Rolnick\.Complexity of linear regions in deep networks\.InProceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 2596–2604\. PMLR, 2019\.
- \[83\]Boris Hanin and David Rolnick\.Deep ReLU networks have surprisingly few activation patterns\.InAdvances in Neural Information Processing Systems, volume 32, pages 359–368, 2019\.
- \[84\]Benoit Dherin, Michael Munn, Mihaela Rosca, and David G\. T\. Barrett\.Why neural networks find simple solutions: The many regularizers of geometric complexity\.InAdvances in Neural Information Processing Systems, volume 35, 2022\.
- \[85\]Maria Refinetti, Alessandro Ingrosso, and Sebastian Goldt\.Neural networks trained with SGD learn distributions of increasing complexity\.InProceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 28843–28863\. PMLR, 2023\.
- \[86\]Etienne Boursier and Nicolas Flammarion\.Simplicity bias and optimization threshold in two\-layer ReLU networks, 2024\.
- \[87\]Yedi Zhang, Andrew M\. Saxe, and Peter E\. Latham\.Saddle\-to\-saddle dynamics explains a simplicity bias across neural network architectures\.InInternational Conference on Learning Representations, 2026\.Poster\.
- \[88\]Andrew M\. Saxe, James L\. McClelland, and Surya Ganguli\.Exact solutions to the nonlinear dynamics of learning in deep linear neural networks\.InInternational Conference on Learning Representations, 2014\.
- \[89\]Yaniv Blumenfeld, Dar Gilboa, and Daniel Soudry\.Beyond signal propagation: Is feature diversity necessary in deep neural network initialization?InProceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 960–969\. PMLR, 2020\.
- \[90\]Greg Yang\.Tensor programs I: Wide feedforward or recurrent neural networks of any architecture are gaussian processes, 2020\.
- \[91\]Sergey Ioffe and Christian Szegedy\.Batch normalization: Accelerating deep network training by reducing internal covariate shift\.InProceedings of the 32nd International Conference on Machine Learning, volume 37 ofProceedings of Machine Learning Research, pages 448–456\. PMLR, 2015\.
- \[92\]Sanjeev Arora, Kaifeng Lyu, and Zhiyuan Li\.Theoretical analysis of auto rate\-tuning by batch normalization\.InInternational Conference on Learning Representations, 2019\.
- \[93\]Zhiyuan Li, Kaifeng Lyu, and Sanjeev Arora\.Reconciling modern deep learning with traditional optimization analyses: The intrinsic learning rate\.InAdvances in Neural Information Processing Systems, volume 33, 2020\.
- \[94\]James Hardy Wilkinson\.Rounding Errors in Algebraic Processes\.Number 32 in Notes on Applied Science\. Her Majesty’s Stationery Office, London, 1963\.
- \[95\]James Hardy Wilkinson\.The Algebraic Eigenvalue Problem\.Monographs on Numerical Analysis\. Clarendon Press, Oxford, 1965\.
- \[96\]Nicholas J\. Higham\.Accuracy and Stability of Numerical Algorithms\.Society for Industrial and Applied Mathematics, Philadelphia, PA, 2 edition, 2002\.
- \[97\]M\. P\. Calvo, Ander Murua, and J\. M\. Sanz\-Serna\.Modified equations for ODEs\.In Peter E\. Kloeden and Kenneth J\. Palmer, editors,Chaotic Numerics, volume 172 ofContemporary Mathematics, pages 63–74\. American Mathematical Society, Providence, RI, 1994\.
- \[98\]Tony Shardlow\.Modified equations for stochastic differential equations\.BIT Numerical Mathematics, 46\(1\):111–125, 2006\.
- \[99\]Konstantinos C\. Zygalakis\.On the existence and the applications of modified equations for stochastic differential equations\.SIAM Journal on Scientific Computing, 33\(1\):102–130, 2011\.
- \[100\]Arnaud Debussche and Erwan Faou\.Weak backward error analysis for SDEs\.SIAM Journal on Numerical Analysis, 50\(3\):1735–1752, 2012\.
- \[101\]Yuanyuan Feng, Tingran Gao, Lei Li, Jian\-Guo Liu, and Yulong Lu\.Uniform\-in\-time weak error analysis for stochastic gradient descent algorithms via diffusion approximation\.Communications in Mathematical Sciences, 18\(1\):163–188, 2020\.
- \[102\]Qianxiao Li, Cheng Tai, and Weinan E\.Stochastic modified equations and adaptive stochastic gradient algorithms\.InProceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 2101–2110\. PMLR, 2017\.
- \[103\]Qianxiao Li, Cheng Tai, and Weinan E\.Stochastic modified equations and dynamics of stochastic gradient algorithms I: Mathematical foundations\.Journal of Machine Learning Research, 20\(40\):1–47, 2019\.
- \[104\]Taiki Miyagawa\.Toward equation of motion for deep neural networks: Continuous\-time gradient descent and discretization error analysis\.InAdvances in Neural Information Processing Systems, volume 35, pages 37778–37791, 2022\.
- \[105\]Mihaela Rosca, Yan Wu, Chongli Qin, and Benoit Dherin\.On a continuous time model of gradient descent dynamics and instability in deep learning\.arXiv preprint arXiv:2302\.01952, 2023\.
- \[106\]Matias D Cattaneo and Boris Shigida\.Modified loss of momentum gradient descent: Fine\-grained analysis\.arXiv preprint arXiv:2509\.08483, 2025\.
- \[107\]Stefano Di Giovacchino, Desmond J\. Higham, and Konstantinos C\. Zygalakis\.Backward error analysis and the qualitative behaviour of stochastic optimization algorithms: Application to stochastic coordinate descent\.Journal of Computational Dynamics, 11\(4\):453–467, 2024\.
- \[108\]Matias Cattaneo and Boris Shigida\.How memory in optimization algorithms implicitly modifies the loss\.Advances in Neural Information Processing Systems, 38:156059–156096, 2026\.
- \[109\]Matias D Cattaneo and Boris Shigida\.The effect of mini\-batch noise on the implicit bias of adam\.arXiv preprint arXiv:2602\.01642, 2026\.
- \[110\]Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro\.The implicit bias of gradient descent on separable data\.Journal of Machine Learning Research, 19\(70\):1–57, 2018\.
- \[111\]Suriya Gunasekar, Jason D\. Lee, Daniel Soudry, and Nathan Srebro\.Implicit bias of gradient descent on linear convolutional networks\.InAdvances in Neural Information Processing Systems, volume 31, 2018\.
- \[112\]Kaifeng Lyu and Jian Li\.Gradient descent maximizes the margin of homogeneous neural networks\.InInternational Conference on Learning Representations, 2020\.
- \[113\]Elad Hoffer, Itay Hubara, and Daniel Soudry\.Train longer, generalize better: Closing the generalization gap in large batch training of neural networks\.InAdvances in Neural Information Processing Systems, volume 30, 2017\.
- \[114\]Stephan Mandt, Matthew D\. Hoffman, and David M\. Blei\.Stochastic gradient descent as approximate bayesian inference\.Journal of Machine Learning Research, 18\(134\):1–35, 2017\.
- \[115\]Ashia C\. Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin Recht\.The marginal value of adaptive gradient methods in machine learning\.InAdvances in Neural Information Processing Systems, volume 30, 2017\.
## Appendix AExperimental Details
This appendix records the setups for all experiments\. All runs are implemented in Keras 3 / TensorFlow on a single NVIDIA GPU per run \(H100, H200, or L40S\)\. A summary is given in Table[1](https://arxiv.org/html/2605.29152#A1.T1); per\-experiment paragraphs follow\.
#### Common ingredients \(all experiments\)\.
- •*Data\.*CIFAR\-10 with the standard50,00050\{,\}000training and10,00010\{,\}000test images\. We hold out a fixed10,00010\{,\}000\-image validation set from the training partition using permutation seed4242, giving a40,00040\{,\}000/10,00010\{,\}000/10,00010\{,\}000train/validation/test split that is identical across every experiment\. Inputs are scaled to\[0,1\]\[0,1\]\.
- •*Initialization\.*Convolutional and dense kernels are drawn fromWij∼𝒩\(0,σw2/fanin\)W\_\{ij\}\\sim\\mathcal\{N\}\\\!\\big\(0,\\sigma\_\{w\}^\{2\}/\\mathrm\{fan\}\_\{\\mathrm\{in\}\}\\big\), soσw=1\\sigma\_\{w\}=1corresponds to a fan\-in normal baseline\. Biases are drawn from𝒩\(0,0\.22\)\\mathcal\{N\}\(0,0\.2^\{2\}\)and are not rescaled byσw\\sigma\_\{w\}\.
- •*Loss\.*Sparse categorical cross\-entropy from logits\.
- •*Checkpoint selection\.*For each run we save and report metrics at the validation\-loss\-minimizing epochτbest\\tau\_\{\\mathrm\{best\}\}, and recordτinterp=min\{t:TrainAcct≥99\.5%\}\\tau\_\{\\mathrm\{interp\}\}=\\min\\\{t:\\mathrm\{TrainAcc\}\_\{t\}\\geq 99\.5\\%\\\}and the kernel Frobenius norm‖W‖F\\\|W\\\|\_\{F\}\.
- •*Reproducibility\.*Each run sets Python, NumPy, and TensorFlow random seeds to the same integer\.
#### Main ResNet\-9 grid \(§[3](https://arxiv.org/html/2605.29152#S3), App\.[B](https://arxiv.org/html/2605.29152#A2)\)\.
ResNet\-9 with BatchNorm, no augmentation,300300epochs\. Learning rate follows the cosine scheduleηk=η0\[α\+\(1−α\)12\(1\+cos\(πk/K\)\)\]\\eta\_\{k\}=\\eta\_\{0\}\\big\[\\alpha\+\(1\-\\alpha\)\\tfrac\{1\}\{2\}\(1\+\\cos\(\\pi k/K\)\)\\big\]withη0=10−3\\eta\_\{0\}=10^\{\-3\},α=0\.01\\alpha=0\.01, and total updatesK=300⌈40,000/b⌉K=300\\,\\lceil 40\{,\}000/b\\rceil\. The grid varies three axes:σw∈\{0\.10,0\.20,…,2\.50\}\\sigma\_\{w\}\\in\\\{0\.10,0\.20,\\ldots,2\.50\\\}\(2525values\), optimizer in \{SGD, SGD\-momentum, Adam, AdamW, Muon\} \(Table[1](https://arxiv.org/html/2605.29152#A1.T1)\), and batch sizeb∈\{16,32,64,128,256\}b\\in\\\{16,32,64,128,256\\\}\. Each of the25×5×5=62525\\times 5\\times 5=625configurations is run for1010seeds\{0,1,42,123,456,789,999,2024,2025,2026\}\\\{0,1,42,123,456,789,999,2024,2025,2026\\\}, for a total of6,2506\{,\}250trained models\.
#### 5,0005\{,\}000\-epoch SGD \(§[3](https://arxiv.org/html/2605.29152#S3), §[4](https://arxiv.org/html/2605.29152#S4)\)\.
Vanilla SGD \(η=10−3\\eta=10^\{\-3\}, momentum0\) trained for5,0005\{,\}000epochs*with constant learning rate*\(no cosine decay\), no augmentation, no weight decay orL2L\_\{2\}\. Five scalesσw∈\{0\.1,0\.5,1\.0,1\.8,2\.5\}\\sigma\_\{w\}\\in\\\{0\.1,0\.5,1\.0,1\.8,2\.5\\\}, batch sizesb∈\{16,128\}b\\in\\\{16,128\\\}, single seed experiment\.
#### LR×L2\\times L\_\{2\}sweep for SGD \(§[3\.2](https://arxiv.org/html/2605.29152#S3.SS2), Table[4](https://arxiv.org/html/2605.29152#A3.T4)\)\.
Vanilla SGD \(no momentum\),300300epochs, cosine schedule\. Twelve configurationsη∈\{10−3,10−2,10−1\}\\eta\\in\\\{10^\{\-3\},10^\{\-2\},10^\{\-1\}\\\}paired with explicit kernelL2L\_\{2\}regularizationλ∈\{0,10−3,5×10−3,10−2\}\\lambda\\in\\\{0,10^\{\-3\},5\\\!\\times\\\!10^\{\-3\},10^\{\-2\}\\\}, at five scalesσw∈\{0\.1,0\.5,1\.0,1\.8,2\.5\}\\sigma\_\{w\}\\in\\\{0\.1,0\.5,1\.0,1\.8,2\.5\\\}and batch sizesb∈\{16,128\}b\\in\\\{16,128\\\}\.
#### Norm and augmentation \(App\.[E](https://arxiv.org/html/2605.29152#A5), Fig\.[8](https://arxiv.org/html/2605.29152#A5.F8)\)\.
Two normalization variants – BatchNorm, LayerNorm – under standard data augmentation \(random horizontal flip \+44\-pixel reflective padding \+ random32×3232\\\!\\times\\\!32crop\)\.300300epochs cosine,η=10−3\\eta=10^\{\-3\}, no weight decay\. Two optimizers \(SGD without momentum, Adam\),σw∈\{0\.1,0\.5,1\.0,1\.8,2\.5\}\\sigma\_\{w\}\\in\\\{0\.1,0\.5,1\.0,1\.8,2\.5\\\},b∈\{16,128\}b\\in\\\{16,128\\\},55seeds\.
#### Best\-procedure comparison \(App\.[F](https://arxiv.org/html/2605.29152#A6), Table[7](https://arxiv.org/html/2605.29152#A6.T7)\)\.
200200epochs, cosine schedule, with augmentation\. Two procedures: SGD \(η=0\.1\\eta=0\.1, momentum0\.90\.9, weight decay5×10−45\\\!\\times\\\!10^\{\-4\}\) and Adam \(η=10−3\\eta=10^\{\-3\}, no weight decay\), at the two extreme scalesσw∈\{0\.1,2\.5\}\\sigma\_\{w\}\\in\\\{0\.1,2\.5\\\}and batch sizesb∈\{16,128,256\}b\\in\\\{16,128,256\\\},1010seeds\.
#### Depth \(App\.[D](https://arxiv.org/html/2605.29152#A4), Fig\.[5](https://arxiv.org/html/2605.29152#S3.F5)\)\.
ResNet\-56, ResNet\-110 \(CIFAR\-style architectures,6n\+26n\+2layers withn=9n=9andn=18n=18respectively\), and an R9\-AvgPool control that replaces ResNet\-9’s global max\-pool with global average\-pool\. All use300300epochs cosine,η0=10−3\\eta\_\{0\}=10^\{\-3\}, the five main\-grid optimizers,σw∈\{0\.1,0\.5,1\.0,1\.8,2\.5\}\\sigma\_\{w\}\\in\\\{0\.1,0\.5,1\.0,1\.8,2\.5\\\}, andb∈\{16,128\}b\\in\\\{16,128\\\}, with1010seeds\. ResNet\-56 and ResNet\-110 use early stopping with patience3030\(without altering the cosine schedule\); R9\-AvgPool does not\.
#### Optimizer settings\.
Table[1](https://arxiv.org/html/2605.29152#A1.T1)lists the per\-optimizer settings used in the main grid\. All five optimizers shareη0=10−3\\eta\_\{0\}=10^\{\-3\}with cosine decay\. The SGD family is run with zero weight decay; AdamW and Muon retain their default decoupled weight decayλ=10−4\\lambda=10^\{\-4\}as part of the optimizer definition\. Adam, which has*no*weight decay, exhibits the same small spreads as AdamW and Muon, so the forgetting effect attributed to adaptive methods is not driven by AdamW’s or Muon’sλ=10−4\\lambda=10^\{\-4\}\.
Table 1:Main\-grid optimizer settings\.All optimizers useη0=10−3\\eta\_\{0\}=10^\{\-3\}with cosine decay \(α=0\.01\\alpha=0\.01\) over300300epochs\. Adam\-familyβ1=0\.9\\beta\_\{1\}=0\.9\(momentum\-like\) andβ2=0\.999\\beta\_\{2\}=0\.999\(variance\) are the Keras defaults\.
#### Reproducibility\.
For each run, Python, NumPy, and TensorFlow random seeds are set to a single integer, so initialization, minibatch order, and all stochastic operations are jointly determined by it\.
## Appendix BSupplementary Tables
Tables[2](https://arxiv.org/html/2605.29152#A2.T2)and[3](https://arxiv.org/html/2605.29152#A2.T3)report validation accuracy at three training checkpoints for the two extreme initialization scales,σw=0\.1\\sigma\_\{w\}=0\.1andσw=2\.5\\sigma\_\{w\}=2\.5\. The*repair gap*Δrepair≔ValAccτbest−ValAccτinterp\\Delta\_\{\\mathrm\{repair\}\}\\coloneqq\\mathrm\{ValAcc\}\_\{\\tau\_\{\\mathrm\{best\}\}\}\-\\mathrm\{ValAcc\}\_\{\\tau\_\{\\mathrm\{interp\}\}\}quantifies how much validation accuracy changes between the interpolation epoch and the best\-validation\-loss epoch\. Its sign, however, is largely determined by the*ordering*ofτbest\\tau\_\{\\mathrm\{best\}\}andτinterp\\tau\_\{\\mathrm\{interp\}\}: because validation accuracy generally increases during training, configurations whereτbest<τinterp\\tau\_\{\\mathrm\{best\}\}<\\tau\_\{\\mathrm\{interp\}\}\(i\.e\. the lowest validation loss occurs before the model memorizes the training set\) yield a mechanically negativeΔrepair\\Delta\_\{\\mathrm\{repair\}\}\. This is the dominant regime for adaptive optimizers at small batch sizes, where the model achieves its best\-calibrated predictions early \(τbest≈5\\tau\_\{\\mathrm\{best\}\}\\approx 5–77\) but does not interpolate until epoch 30–40; accuracy continues to improve even as cross\-entropy loss rises\. In contrast, at large batch sizes \(b≥128b\\geq 128\), adaptive optimizers haveτbest≫τinterp\\tau\_\{\\mathrm\{best\}\}\\gg\\tau\_\{\\mathrm\{interp\}\}, and the positiveΔrepair\\Delta\_\{\\mathrm\{repair\}\}reflects genuine post\-interpolation restructuring that benefits generalization\. SGD exhibits near\-zeroΔrepair\\Delta\_\{\\mathrm\{repair\}\}at both extremes: its validation accuracy is essentially flat betweenτbest\\tau\_\{\\mathrm\{best\}\}andτinterp\\tau\_\{\\mathrm\{interp\}\}, even when both grow withσw\\sigma\_\{w\}, consistent with the stopping\-time stagnation discussed in §[3](https://arxiv.org/html/2605.29152#S3)\.
Table 2:Validation accuracy at three checkpoints forσw=0\.1\\sigma\_\{w\}=0\.1\. Values are mean±\\pmstd overn=10n=10seeds\.Table 3:Validation accuracy at three checkpoints forσw=2\.5\\sigma\_\{w\}=2\.5\(large initialization\)\. At this extreme, SGD’s test accuracy collapses to∼58%\{\\sim\}58\\%atb=128b=128while adaptive optimizers retain∼85%\{\\sim\}85\\%; the contrast with Table[2](https://arxiv.org/html/2605.29152#A2.T2)directly quantifies initialization bias\. Values are mean±\\pmstd overn=10n=10seeds\.
## Appendix CLearning\-Rate Ablation
Table 4:Full LR×\\timesL2 sweep for SGD on ResNet\-9\(no momentum\)\. Test accuracy \(%\) at the best\-validation\-loss epoch for each\(η,λL2,σw\)\(\\eta,\\lambda\_\{\\mathrm\{L2\}\},\\sigma\_\{w\}\)configuration\. The “Spread” column reportsmaxσwacc−minσwacc\\max\_\{\\sigma\_\{w\}\}\\mathrm\{acc\}\-\\min\_\{\\sigma\_\{w\}\}\\mathrm\{acc\}\(in percentage points\); “Train” is the mean training accuracy across allσw\\sigma\_\{w\}values\. Entries with±\\pmshow mean±\\pmstd overn=10n=10seeds; entries marked†are from a single representative seed \(seed 42\)\.Bottom rows: Adam baseline and SGD long training \(5000 ep, constant LR\) for comparison\.
## Appendix DDepth Comparison: Full Results
Figures[5](https://arxiv.org/html/2605.29152#S3.F5)and[7](https://arxiv.org/html/2605.29152#A4.F7)display test and train accuracy as a function of initialization scaleσw\\sigma\_\{w\}for four architectures: ResNet\-9 \(6\.58M parameters\), R9\-AvgPool \(6\.58M\), ResNet\-56 \(0\.86M\), and ResNet\-110 \(1\.74M\)\. Each panel fixes one optimizer \(SGD or Adam\) and one batch size \(b=16b=16orb=128b=128\); lines show the mean over seeds and shaded regions indicate the 10th–90th percentile range\.
Table[5](https://arxiv.org/html/2605.29152#A4.T5)reports the numerical values plotted in Figure[5](https://arxiv.org/html/2605.29152#S3.F5), together with train accuracy at the final epoch and the spread \(difference between the best and worst mean test accuracy acrossσw\\sigma\_\{w\}values\)\. The lower part of the table records the interpolation epochτinterp\\tau\_\{\\mathrm\{interp\}\}\(first epoch where train accuracy≥99\.5%\\geq 99\.5\\%\) and the best\-validation\-loss epoch for each configuration\.
Figure 7:Train accuracy vs\. initialization scaleσw\\sigma\_\{w\}\(same layout as Figure[5](https://arxiv.org/html/2605.29152#S3.F5)\)\. Under SGD atb=128b=128, ResNet\-56 and ResNet\-110 fail to interpolate at largeσw\\sigma\_\{w\}, while ResNet\-9 maintains\>99%\>99\\%train accuracy throughout\.Table 5:Depth comparison: training and generalization metrics across architectures\. All runs useη=10−3\\eta\{=\}10^\{\-3\}with cosine decay, no weight decay, 300 epochs\. Values are mean±\\pmstd over 10 seeds\.τinterp\\tau\_\{\\mathrm\{interp\}\}: first epoch where train accuracy≥99\.5%\\geq 99\.5\\%; “—” indicates interpolation not reached\. Spread=maxσwtest\_acc¯−minσwtest\_acc¯=\\max\_\{\\sigma\_\{w\}\}\\overline\{\\mathrm\{test\\\_acc\}\}\-\\min\_\{\\sigma\_\{w\}\}\\overline\{\\mathrm\{test\\\_acc\}\}\.
## Appendix ENormalization and Data Augmentation
The main experiments use BatchNorm without data augmentation to isolate the effect of initialization scale from other regularization mechanisms\. A natural question is whether switching the normalization layer or adding standard data augmentation eliminates initialization memory\. To test this, we train ResNet\-9 on CIFAR\-10 under three configurations: \(i\) BatchNorm with augmentation \(random horizontal flip \+ random crop with 4\-pixel padding\), \(ii\) LayerNorm with the same augmentation, and \(iii\) the original BatchNorm baseline without augmentation\. All other hyperparameters are identical to the main grid \(η=10−3\\eta=10^\{\-3\}, cosine decay, no weight decay, 300 epochs,σw∈\{0\.1,0\.5,1\.0,1\.8,2\.5\}\\sigma\_\{w\}\\in\\\{0\.1,0\.5,1\.0,1\.8,2\.5\\\}\)\. Results are averaged over 5 seeds\.
Figure[8](https://arxiv.org/html/2605.29152#A5.F8)and Table[6](https://arxiv.org/html/2605.29152#A5.T6)show the results\. Data augmentation substantially reduces initialization memory under SGD \(spread drops from 22\.8 to 8\.3 pp atb=16b=16, and from 26\.5 to 18\.9 pp atb=128b=128for BatchNorm\), but does not eliminate it\. LayerNorm behaves similarly to BatchNorm: the SGD spread is 5\.1 pp atb=16b=16and 15\.2 pp atb=128b=128\. Under Adam, all three configurations show small spreads \(≤3\.5\\leq 3\.5pp\), confirming that the forgetting property of Adam is robust to the choice of normalization and augmentation\.
These results reinforce the main finding: neither the specific normalization scheme nor data augmentation is sufficient to erase the initialization memory under low\-learning\-rate SGD, though both reduce its magnitude\.
Figure 8:Test accuracy vs\. initialization scaleσw\\sigma\_\{w\}for ResNet\-9 under BatchNorm \+ augmentation \(blue\), LayerNorm \+ augmentation \(orange\), and the BatchNorm baseline without augmentation \(grey dashed\)\. Lines: mean over seeds; shaded bands: 10th–90th percentile\.Table 6:Normalization comparison on ResNet\-9 CIFAR\-10\. “BN \+ Aug” and “LN \+ Aug” use data augmentation \(random flip \+ crop\) with BatchNorm and LayerNorm respectively; “BN \(no aug\)” is the main\-grid baseline without augmentation\. All useη=10−3\\eta\{=\}10^\{\-3\}, cosine decay, no weight decay, 300 epochs\. Values: mean±\\pmstd over 5 seeds \(aug\) or 10 seeds \(baseline\)\. Spread=maxσwtest\_acc¯−minσwtest\_acc¯=\\max\_\{\\sigma\_\{w\}\}\\overline\{\\mathrm\{test\\\_acc\}\}\-\\min\_\{\\sigma\_\{w\}\}\\overline\{\\mathrm\{test\\\_acc\}\}\(pp\)\.
## Appendix FBest\-Training Pipeline Comparison
The preceding experiments deliberately use a minimal training configuration \(η=10−3\\eta\{=\}10^\{\-3\}, no momentum, no weight decay, no augmentation\) to isolate the role of initialization scale\. A natural concern is whether initialization memory persists under a standard, well\-tuned training procedure\. Table[7](https://arxiv.org/html/2605.29152#A6.T7)addresses this by comparing two configurations at the extreme initialization scalesσw∈\{0\.1,2\.5\}\\sigma\_\{w\}\\in\\\{0\.1,2\.5\\\}: \(i\) SGD with lr=0\.1\{=\}0\.1, momentum=0\.9\{=\}0\.9, weight decay=5×10−4=5\\times 10^\{\-4\}, and data augmentation—the standard ResNet procedure—and \(ii\) Adam with lr=10−3\{=\}10^\{\-3\}, no weight decay, and the same augmentation\.
Atb=128b\{=\}128, SGD \(best\) achieves 94\.2% test accuracy at bothσw=0\.1\\sigma\_\{w\}\{=\}0\.1and2\.52\.5respectively \(\|Δ\|=0\.0\|\\Delta\|\{=\}0\.0pp\), confirming that the standard training procedure erases initialization memory completely\. Adam also shows a small gap \(0\.6 pp\)\. Atb=16b\{=\}16, the SGD procedure exhibits high seed\-to\-seed variance: withη=0\.1\\eta\{=\}0\.1and only 16 samples per step, the effective per\-sample step size is large enough that some seeds experience a near\-divergence in the first epoch \(loss\>50\>50vs\.∼\\sim15 for other seeds\) and fail to recover fully within 200 epochs\. This is a training\-stability issue—addressable by learning\-rate warmup—not an initialization\-memory effect; the gap betweenσw\\sigma\_\{w\}values remains small even for the affected seeds\.
Table 7:Best\-recipe comparison\.SGD \(standard recipe:η=0\.1\\eta\{=\}0\.1, momentum=0\.9\{=\}0\.9,L2L\_\{2\}kernel reg\.λ=5×10−4\\lambda=5\\times 10^\{\-4\}\) vs\. Adam \(η=10−3\\eta\{=\}10^\{\-3\}, no WD\), both with data augmentation and cosine decay, 200 epochs on ResNet\-9 / CIFAR\-10\. Values: mean±\\pmstd overn=10n\{=\}10seeds\.τinterp\\tau\_\{\\mathrm\{interp\}\}: first epoch with train accuracy≥99\.5%\\geq 99\.5\\%\(“—” if not reached within 200 epochs\); format isσw=0\.1\\sigma\_\{w\}\{=\}0\.1/σw=2\.5\\sigma\_\{w\}\{=\}2\.5\.τbest\\tau\_\{\\mathrm\{best\}\}: best\-validation\-loss epoch\.\|Δ\|\|\\Delta\|: absolute difference betweenσw=0\.1\\sigma\_\{w\}\{=\}0\.1andσw=2\.5\\sigma\_\{w\}\{=\}2\.5test accuracy\.
## Appendix GMinimal Linear Model for Initialization Memory
This appendix gives the calculations behind the timescale discussion in Section[4](https://arxiv.org/html/2605.29152#S4)\. The goal is not a complete theory of BatchNorm ResNets, but a minimal homogeneous model showing why gradient\-flow\-like dynamics can preserve initialization\-dependent quantities and why explicit norm decay, finite steps, stochasticity, and adaptive preconditioning act on different clocks\.
Throughout this appendix, we useϕ\\phifor the loss as a function of the product represented by the linear network\. This avoids overloading the layer index\. All losses are assumed continuously differentiable, and all gradients are evaluated at the current product\. In the two\-factor case the product isW=UVW=UV\. In the deep\-linear case the product is
F\(W1,…,WL\)=WLWL−1⋯W1\.F\(W\_\{1\},\\ldots,W\_\{L\}\)=W\_\{L\}W\_\{L\-1\}\\cdots W\_\{1\}\.All discrete updates below are simultaneous updates of all factors\.
For stochastic statements,𝔼k\[⋅\]\\mathbb\{E\}\_\{k\}\[\\cdot\]denotes conditional expectation given the current iterate\. We write a product\-space minibatch gradient estimate as
Gk=G¯k\+ξk,𝔼k\[ξk\]=0,𝔼k‖ξk‖F2=O\(1/b\),G\_\{k\}=\\bar\{G\}\_\{k\}\+\\xi\_\{k\},\\qquad\\mathbb\{E\}\_\{k\}\[\\xi\_\{k\}\]=0,\\qquad\\mathbb\{E\}\_\{k\}\\\|\\xi\_\{k\}\\\|\_\{F\}^\{2\}=O\(1/b\),whereG¯k\\bar\{G\}\_\{k\}is the full product\-space gradient andbbis the minibatch size\. In the deep\-linear stochastic statements, the layerwise gradient estimates are induced from the same product\-space estimateGkG\_\{k\}by the chain rule\. This sharedGkG\_\{k\}assumption is essential: the first\-order cancellations below need not hold for arbitrary independent layerwise perturbations\.
#### Gradient flow preserves imbalance\.
Consider a two\-layer linear networkW=UVW=UV, with
U∈ℝdout×r,V∈ℝr×din,ℒ\(U,V\)=ϕ\(UV\)\.U\\in\\mathbb\{R\}^\{d\_\{\\mathrm\{out\}\}\\times r\},\\qquad V\\in\\mathbb\{R\}^\{r\\times d\_\{\\mathrm\{in\}\}\},\\qquad\\mathcal\{L\}\(U,V\)=\\phi\(UV\)\.Let
G=∇Wϕ\(W\)\.G=\\nabla\_\{W\}\\phi\(W\)\.Then
∇Uℒ=GV⊤,∇Vℒ=U⊤G,\\nabla\_\{U\}\\mathcal\{L\}=GV^\{\\top\},\\qquad\\nabla\_\{V\}\\mathcal\{L\}=U^\{\\top\}G,so Euclidean gradient flow is
U˙=−GV⊤,V˙=−U⊤G\.\\dot\{U\}=\-GV^\{\\top\},\\qquad\\dot\{V\}=\-U^\{\\top\}G\.Define the imbalance
D\(U,V\)=U⊤U−VV⊤∈ℝr×r\.D\(U,V\)=U^\{\\top\}U\-VV^\{\\top\}\\in\\mathbb\{R\}^\{r\\times r\}\.
###### Proposition 1\(Two\-factor gradient flow preserves imbalance\)\.
Along Euclidean gradient flow for the two\-factor linear network,
ddtD\(U\(t\),V\(t\)\)=0\.\\frac\{d\}\{dt\}D\(U\(t\),V\(t\)\)=0\.
###### Proof\.
We compute
ddtU⊤U\\displaystyle\\frac\{d\}\{dt\}U^\{\\top\}U=U˙⊤U\+U⊤U˙\\displaystyle=\\dot\{U\}^\{\\top\}U\+U^\{\\top\}\\dot\{U\}=\(−GV⊤\)⊤U\+U⊤\(−GV⊤\)\\displaystyle=\(\-GV^\{\\top\}\)^\{\\top\}U\+U^\{\\top\}\(\-GV^\{\\top\}\)=−VG⊤U−U⊤GV⊤\.\\displaystyle=\-VG^\{\\top\}U\-U^\{\\top\}GV^\{\\top\}\.Similarly,
ddtVV⊤\\displaystyle\\frac\{d\}\{dt\}VV^\{\\top\}=V˙V⊤\+VV˙⊤\\displaystyle=\\dot\{V\}V^\{\\top\}\+V\\dot\{V\}^\{\\top\}=\(−U⊤G\)V⊤\+V\(−U⊤G\)⊤\\displaystyle=\(\-U^\{\\top\}G\)V^\{\\top\}\+V\(\-U^\{\\top\}G\)^\{\\top\}=−U⊤GV⊤−VG⊤U\.\\displaystyle=\-U^\{\\top\}GV^\{\\top\}\-VG^\{\\top\}U\.The two derivatives are identical, hence their difference is zero:
ddt\(U⊤U−VV⊤\)=0\.\\frac\{d\}\{dt\}\\left\(U^\{\\top\}U\-VV^\{\\top\}\\right\)=0\.∎
Thus gradient flow exactly preserves an initialization\-dependent quantity:D\(U\(t\),V\(t\)\)=D\(U\(0\),V\(0\)\)D\(U\(t\),V\(t\)\)=D\(U\(0\),V\(0\)\)\. In this minimal homogeneous model, initialization memory is a conserved charge\.
#### The conserved imbalance fixes the final norm in the scalar model\.
The preceding invariant is not merely formal: in the scalar case it directly determines the factor norm of any converged solution\.
###### Proposition 2\(Scalar two\-factor norm is fixed by conserved imbalance\)\.
Consider scalar parametersa,b∈ℝa,b\\in\\mathbb\{R\}with predictorp=abp=ab\. Suppose gradient flow converges to a point with productp⋆p\_\{\\star\}\. Let
D0=a02−b02\.D\_\{0\}=a\_\{0\}^\{2\}\-b\_\{0\}^\{2\}\.SinceD=a2−b2D=a^\{2\}\-b^\{2\}is conserved, the final squared Euclidean norm satisfies
‖\(a∞,b∞\)‖22=a∞2\+b∞2=D02\+4p⋆2\.\\\|\(a\_\{\\infty\},b\_\{\\infty\}\)\\\|\_\{2\}^\{2\}=a\_\{\\infty\}^\{2\}\+b\_\{\\infty\}^\{2\}=\\sqrt\{D\_\{0\}^\{2\}\+4p\_\{\\star\}^\{2\}\}\.Therefore, among solutions with the same productp⋆p\_\{\\star\}, the final factor norm is a monotone function of\|D0\|\|D\_\{0\}\|\. The final norm thus retains initialization memory\.
###### Proof\.
Let
x=a∞2,y=b∞2\.x=a\_\{\\infty\}^\{2\},\\qquad y=b\_\{\\infty\}^\{2\}\.Conservation of imbalance gives
and convergence to productp⋆p\_\{\\star\}gives
xy=\(a∞b∞\)2=p⋆2\.xy=\(a\_\{\\infty\}b\_\{\\infty\}\)^\{2\}=p\_\{\\star\}^\{2\}\.Therefore
\(x\+y\)2=\(x−y\)2\+4xy=D02\+4p⋆2\.\(x\+y\)^\{2\}=\(x\-y\)^\{2\}\+4xy=D\_\{0\}^\{2\}\+4p\_\{\\star\}^\{2\}\.Sincex\+y≥0x\+y\\geq 0, we obtain
x\+y=D02\+4p⋆2\.x\+y=\\sqrt\{D\_\{0\}^\{2\}\+4p\_\{\\star\}^\{2\}\}\.Substituting backx=a∞2x=a\_\{\\infty\}^\{2\}andy=b∞2y=b\_\{\\infty\}^\{2\}gives the claim\. ∎
#### Continuous\-timeL2L\_\{2\}decay contracts imbalance\.
Now add coupled continuous\-time weight decay:
U˙=−GV⊤−λU,V˙=−U⊤G−λV,λ≥0\.\\dot\{U\}=\-GV^\{\\top\}\-\\lambda U,\\qquad\\dot\{V\}=\-U^\{\\top\}G\-\\lambda V,\\qquad\\lambda\\geq 0\.
###### Proposition 3\(Continuous\-timeL2L\_\{2\}decay erases imbalance\)\.
Under the dynamics above,
ddtD\(t\)=−2λD\(t\),D\(t\)=e−2λtD\(0\)\.\\frac\{d\}\{dt\}D\(t\)=\-2\\lambda D\(t\),\\qquad D\(t\)=e^\{\-2\\lambda t\}D\(0\)\.
###### Proof\.
The gradient\-dependent terms are the same as in the previous calculation and still cancel in the difference\. The decay terms give
ddtU⊤U\\displaystyle\\frac\{d\}\{dt\}U^\{\\top\}U=gradient terms−2λU⊤U,\\displaystyle=\\text\{gradient terms\}\-2\\lambda U^\{\\top\}U,ddtVV⊤\\displaystyle\\frac\{d\}\{dt\}VV^\{\\top\}=same gradient terms−2λVV⊤\.\\displaystyle=\\text\{same gradient terms\}\-2\\lambda VV^\{\\top\}\.Subtracting yields
D˙\(t\)=−2λ\(U⊤U−VV⊤\)=−2λD\(t\)\.\\dot\{D\}\(t\)=\-2\\lambda\\left\(U^\{\\top\}U\-VV^\{\\top\}\\right\)=\-2\\lambda D\(t\)\.Solving this linear differential equation gives
D\(t\)=e−2λtD\(0\)\.D\(t\)=e^\{\-2\\lambda t\}D\(0\)\.∎
Thus explicit norm control contracts the imbalance on the continuous\-time clockλt\\lambda t\. In discrete training, the gradient\-flow time corresponding to a learning\-rate schedule\(ηk\)\(\\eta\_\{k\}\)is approximately∑kηk\\sum\_\{k\}\\eta\_\{k\}, so the naturalL2L\_\{2\}clock is
𝒯L2=λ∑k<Kηk=Kηλfor constantη\.\\mathcal\{T\}\_\{L\_\{2\}\}=\\lambda\\sum\_\{k<K\}\\eta\_\{k\}=K\\eta\\lambda\\quad\\text\{for constant $\\eta$\}\.
#### Coupled discrete weight decay\.
The corresponding discrete calculation makes the same clock explicit\. Consider the two\-factor update
Uk\+1=\(1−ηkλ\)Uk−ηkGkVk⊤,Vk\+1=\(1−ηkλ\)Vk−ηkUk⊤Gk\.U\_\{k\+1\}=\(1\-\\eta\_\{k\}\\lambda\)U\_\{k\}\-\\eta\_\{k\}G\_\{k\}V\_\{k\}^\{\\top\},\\qquad V\_\{k\+1\}=\(1\-\\eta\_\{k\}\\lambda\)V\_\{k\}\-\\eta\_\{k\}U\_\{k\}^\{\\top\}G\_\{k\}\.Let
ck=1−ηkλ\.c\_\{k\}=1\-\\eta\_\{k\}\\lambda\.
###### Proposition 4\(Discrete weight decay contracts imbalance up to finite\-step leakage\)\.
For the coupled discrete update above,
Dk\+1=ck2Dk\+ηk2\(VkGk⊤GkVk⊤−Uk⊤GkGk⊤Uk\)\.D\_\{k\+1\}=c\_\{k\}^\{2\}D\_\{k\}\+\\eta\_\{k\}^\{2\}\\left\(V\_\{k\}G\_\{k\}^\{\\top\}G\_\{k\}V\_\{k\}^\{\\top\}\-U\_\{k\}^\{\\top\}G\_\{k\}G\_\{k\}^\{\\top\}U\_\{k\}\\right\)\.
###### Proof\.
Expand
Uk\+1⊤Uk\+1=ck2Uk⊤Uk−ckηk\(VkGk⊤Uk\+Uk⊤GkVk⊤\)\+ηk2VkGk⊤GkVk⊤\.U\_\{k\+1\}^\{\\top\}U\_\{k\+1\}=c\_\{k\}^\{2\}U\_\{k\}^\{\\top\}U\_\{k\}\-c\_\{k\}\\eta\_\{k\}\\left\(V\_\{k\}G\_\{k\}^\{\\top\}U\_\{k\}\+U\_\{k\}^\{\\top\}G\_\{k\}V\_\{k\}^\{\\top\}\\right\)\+\\eta\_\{k\}^\{2\}V\_\{k\}G\_\{k\}^\{\\top\}G\_\{k\}V\_\{k\}^\{\\top\}\.Similarly,
Vk\+1Vk\+1⊤=ck2VkVk⊤−ckηk\(Uk⊤GkVk⊤\+VkGk⊤Uk\)\+ηk2Uk⊤GkGk⊤Uk\.V\_\{k\+1\}V\_\{k\+1\}^\{\\top\}=c\_\{k\}^\{2\}V\_\{k\}V\_\{k\}^\{\\top\}\-c\_\{k\}\\eta\_\{k\}\\left\(U\_\{k\}^\{\\top\}G\_\{k\}V\_\{k\}^\{\\top\}\+V\_\{k\}G\_\{k\}^\{\\top\}U\_\{k\}\\right\)\+\\eta\_\{k\}^\{2\}U\_\{k\}^\{\\top\}G\_\{k\}G\_\{k\}^\{\\top\}U\_\{k\}\.The first\-order terms are identical and cancel after subtraction\. Hence
Dk\+1\\displaystyle D\_\{k\+1\}=Uk\+1⊤Uk\+1−Vk\+1Vk\+1⊤\\displaystyle=U\_\{k\+1\}^\{\\top\}U\_\{k\+1\}\-V\_\{k\+1\}V\_\{k\+1\}^\{\\top\}=ck2\(Uk⊤Uk−VkVk⊤\)\+ηk2\(VkGk⊤GkVk⊤−Uk⊤GkGk⊤Uk\),\\displaystyle=c\_\{k\}^\{2\}\\left\(U\_\{k\}^\{\\top\}U\_\{k\}\-V\_\{k\}V\_\{k\}^\{\\top\}\\right\)\+\\eta\_\{k\}^\{2\}\\left\(V\_\{k\}G\_\{k\}^\{\\top\}G\_\{k\}V\_\{k\}^\{\\top\}\-U\_\{k\}^\{\\top\}G\_\{k\}G\_\{k\}^\{\\top\}U\_\{k\}\\right\),which is the stated identity\. ∎
Ignoring the second\-order finite\-step leakage, the multiplicative decay overKKupdates is
∏k<K\(1−ηkλ\)2\.\\prod\_\{k<K\}\(1\-\\eta\_\{k\}\\lambda\)^\{2\}\.Whenηkλ≪1\\eta\_\{k\}\\lambda\\ll 1, this is approximated by
∏k<K\(1−ηkλ\)2≈exp\(−2λ∑k<Kηk\)\.\\prod\_\{k<K\}\(1\-\\eta\_\{k\}\\lambda\)^\{2\}\\approx\\exp\\\!\\left\(\-2\\lambda\\sum\_\{k<K\}\\eta\_\{k\}\\right\)\.This is the discrete counterpart of the continuous\-time clock𝒯L2=λ∑kηk\\mathcal\{T\}\_\{L\_\{2\}\}=\\lambda\\sum\_\{k\}\\eta\_\{k\}\.
#### Finite\-step Euclidean updates leak imbalance at second order\.
Without explicitL2L\_\{2\}decay, simultaneous Euclidean gradient updates do not preserve imbalance exactly\. However, the failure of conservation starts only at second order in the step size\.
###### Proposition 5\(Two\-factor finite\-step leakage is second order\)\.
For the simultaneous update
Uk\+1=Uk−ηkGkVk⊤,Vk\+1=Vk−ηkUk⊤Gk,U\_\{k\+1\}=U\_\{k\}\-\\eta\_\{k\}G\_\{k\}V\_\{k\}^\{\\top\},\\qquad V\_\{k\+1\}=V\_\{k\}\-\\eta\_\{k\}U\_\{k\}^\{\\top\}G\_\{k\},one has the exact identity
Dk\+1−Dk=ηk2\(VkGk⊤GkVk⊤−Uk⊤GkGk⊤Uk\)\.D\_\{k\+1\}\-D\_\{k\}=\\eta\_\{k\}^\{2\}\\left\(V\_\{k\}G\_\{k\}^\{\\top\}G\_\{k\}V\_\{k\}^\{\\top\}\-U\_\{k\}^\{\\top\}G\_\{k\}G\_\{k\}^\{\\top\}U\_\{k\}\\right\)\.In particular, all first\-order terms inηk\\eta\_\{k\}cancel exactly\.
###### Proof\.
This is the previous proposition withλ=0\\lambda=0, equivalentlyck=1c\_\{k\}=1\. Expanding directly,
Uk\+1⊤Uk\+1\\displaystyle U\_\{k\+1\}^\{\\top\}U\_\{k\+1\}=Uk⊤Uk−ηk\(VkGk⊤Uk\+Uk⊤GkVk⊤\)\+ηk2VkGk⊤GkVk⊤,\\displaystyle=U\_\{k\}^\{\\top\}U\_\{k\}\-\\eta\_\{k\}\\left\(V\_\{k\}G\_\{k\}^\{\\top\}U\_\{k\}\+U\_\{k\}^\{\\top\}G\_\{k\}V\_\{k\}^\{\\top\}\\right\)\+\\eta\_\{k\}^\{2\}V\_\{k\}G\_\{k\}^\{\\top\}G\_\{k\}V\_\{k\}^\{\\top\},and
Vk\+1Vk\+1⊤\\displaystyle V\_\{k\+1\}V\_\{k\+1\}^\{\\top\}=VkVk⊤−ηk\(Uk⊤GkVk⊤\+VkGk⊤Uk\)\+ηk2Uk⊤GkGk⊤Uk\.\\displaystyle=V\_\{k\}V\_\{k\}^\{\\top\}\-\\eta\_\{k\}\\left\(U\_\{k\}^\{\\top\}G\_\{k\}V\_\{k\}^\{\\top\}\+V\_\{k\}G\_\{k\}^\{\\top\}U\_\{k\}\\right\)\+\\eta\_\{k\}^\{2\}U\_\{k\}^\{\\top\}G\_\{k\}G\_\{k\}^\{\\top\}U\_\{k\}\.The first\-order terms are identical\. Subtracting leaves exactly the second\-order expression\. ∎
This calculation proves an order statement, not a monotonicity statement: finite steps can move the conserved quantity at orderηk2\\eta\_\{k\}^\{2\}, but the sign of the movement depends on the current factors and gradient\.
#### Deep linear gradient flow preserves all adjacent imbalances\.
The same conservation law holds at every hidden layer of a deep linear network\. Let
F\(W1,…,WL\)=WLWL−1⋯W1,ℒ\(W1,…,WL\)=ϕ\(F\),F\(W\_\{1\},\\ldots,W\_\{L\}\)=W\_\{L\}W\_\{L\-1\}\\cdots W\_\{1\},\\qquad\\mathcal\{L\}\(W\_\{1\},\\ldots,W\_\{L\}\)=\\phi\(F\),where
Wj∈ℝdj×dj−1,j=1,…,L\.W\_\{j\}\\in\\mathbb\{R\}^\{d\_\{j\}\\times d\_\{j\-1\}\},\\qquad j=1,\\ldots,L\.Forj=1,…,L−1j=1,\\ldots,L\-1, define the adjacent\-layer imbalance
Dj=WjWj⊤−Wj\+1⊤Wj\+1∈ℝdj×dj\.D\_\{j\}=W\_\{j\}W\_\{j\}^\{\\top\}\-W\_\{j\+1\}^\{\\top\}W\_\{j\+1\}\\in\\mathbb\{R\}^\{d\_\{j\}\\times d\_\{j\}\}\.
###### Proposition 6\(Deep linear gradient flow preserves adjacent imbalances\)\.
Assumeϕ\\phiis differentiable\. Along Euclidean gradient flow
W˙j=−∇Wjℒ,j=1,…,L,\\dot\{W\}\_\{j\}=\-\\nabla\_\{W\_\{j\}\}\\mathcal\{L\},\\qquad j=1,\\ldots,L,one has
ddtDj\(t\)=0,j=1,…,L−1\.\\frac\{d\}\{dt\}D\_\{j\}\(t\)=0,\\qquad j=1,\\ldots,L\-1\.With coupled continuous\-time weight decay,
W˙j=−∇Wjℒ−λWj,\\dot\{W\}\_\{j\}=\-\\nabla\_\{W\_\{j\}\}\\mathcal\{L\}\-\\lambda W\_\{j\},the imbalances satisfy
ddtDj\(t\)=−2λDj\(t\),Dj\(t\)=e−2λtDj\(0\)\.\\frac\{d\}\{dt\}D\_\{j\}\(t\)=\-2\\lambda D\_\{j\}\(t\),\\qquad D\_\{j\}\(t\)=e^\{\-2\\lambda t\}D\_\{j\}\(0\)\.
###### Proof\.
For each layerjj, define
Aj=WLWL−1⋯Wj\+1,Bj=Wj−1⋯W1,A\_\{j\}=W\_\{L\}W\_\{L\-1\}\\cdots W\_\{j\+1\},\\qquad B\_\{j\}=W\_\{j\-1\}\\cdots W\_\{1\},with the convention that empty products are identity matrices\. Then
F=AjWjBj\.F=A\_\{j\}W\_\{j\}B\_\{j\}\.Let
G=∇Fϕ\(F\)\.G=\\nabla\_\{F\}\\phi\(F\)\.By the chain rule,
∇Wjℒ=Aj⊤GBj⊤\.\\nabla\_\{W\_\{j\}\}\\mathcal\{L\}=A\_\{j\}^\{\\top\}GB\_\{j\}^\{\\top\}\.We use the identities
Aj=Aj\+1Wj\+1,Bj\+1=WjBj\.A\_\{j\}=A\_\{j\+1\}W\_\{j\+1\},\\qquad B\_\{j\+1\}=W\_\{j\}B\_\{j\}\.Let
Hj=∇Wjℒ=Aj⊤GBj⊤\.H\_\{j\}=\\nabla\_\{W\_\{j\}\}\\mathcal\{L\}=A\_\{j\}^\{\\top\}GB\_\{j\}^\{\\top\}\.Then
HjWj⊤\\displaystyle H\_\{j\}W\_\{j\}^\{\\top\}=Aj⊤GBj⊤Wj⊤\\displaystyle=A\_\{j\}^\{\\top\}GB\_\{j\}^\{\\top\}W\_\{j\}^\{\\top\}=Wj\+1⊤Aj\+1⊤GBj\+1⊤\\displaystyle=W\_\{j\+1\}^\{\\top\}A\_\{j\+1\}^\{\\top\}GB\_\{j\+1\}^\{\\top\}=Wj\+1⊤Hj\+1,\\displaystyle=W\_\{j\+1\}^\{\\top\}H\_\{j\+1\},and
WjHj⊤\\displaystyle W\_\{j\}H\_\{j\}^\{\\top\}=WjBjG⊤Aj\\displaystyle=W\_\{j\}B\_\{j\}G^\{\\top\}A\_\{j\}=Bj\+1G⊤Aj\+1Wj\+1\\displaystyle=B\_\{j\+1\}G^\{\\top\}A\_\{j\+1\}W\_\{j\+1\}=Hj\+1⊤Wj\+1\.\\displaystyle=H\_\{j\+1\}^\{\\top\}W\_\{j\+1\}\.Therefore, under gradient flow,
ddt\(WjWj⊤\)\\displaystyle\\frac\{d\}\{dt\}\(W\_\{j\}W\_\{j\}^\{\\top\}\)=−HjWj⊤−WjHj⊤\\displaystyle=\-H\_\{j\}W\_\{j\}^\{\\top\}\-W\_\{j\}H\_\{j\}^\{\\top\}=−Wj\+1⊤Hj\+1−Hj\+1⊤Wj\+1\.\\displaystyle=\-W\_\{j\+1\}^\{\\top\}H\_\{j\+1\}\-H\_\{j\+1\}^\{\\top\}W\_\{j\+1\}\.On the other hand,
ddt\(Wj\+1⊤Wj\+1\)\\displaystyle\\frac\{d\}\{dt\}\(W\_\{j\+1\}^\{\\top\}W\_\{j\+1\}\)=−Hj\+1⊤Wj\+1−Wj\+1⊤Hj\+1\.\\displaystyle=\-H\_\{j\+1\}^\{\\top\}W\_\{j\+1\}\-W\_\{j\+1\}^\{\\top\}H\_\{j\+1\}\.The two derivatives are identical, so
ddt\(WjWj⊤−Wj\+1⊤Wj\+1\)=0\.\\frac\{d\}\{dt\}\\left\(W\_\{j\}W\_\{j\}^\{\\top\}\-W\_\{j\+1\}^\{\\top\}W\_\{j\+1\}\\right\)=0\.This proves conservation\.
With weight decay, the gradient\-dependent terms are unchanged and still cancel\. The additional terms are
−2λWjWj⊤and−2λWj\+1⊤Wj\+1,\-2\\lambda W\_\{j\}W\_\{j\}^\{\\top\}\\qquad\\text\{and\}\\qquad\-2\\lambda W\_\{j\+1\}^\{\\top\}W\_\{j\+1\},so
D˙j=−2λ\(WjWj⊤−Wj\+1⊤Wj\+1\)=−2λDj\.\\dot\{D\}\_\{j\}=\-2\\lambda\\left\(W\_\{j\}W\_\{j\}^\{\\top\}\-W\_\{j\+1\}^\{\\top\}W\_\{j\+1\}\\right\)=\-2\\lambda D\_\{j\}\.Solving givesDj\(t\)=e−2λtDj\(0\)D\_\{j\}\(t\)=e^\{\-2\\lambda t\}D\_\{j\}\(0\)\. ∎
#### Finite\-step leakage in deep linear networks\.
The deep\-linear finite\-step calculation is the discrete analogue of the previous conservation law\. The key point is that the layerwise gradient estimates must be induced by the same product\-space gradient estimate\.
For a product\-space gradient estimateGkG\_\{k\}, define
Aj,k=WL,k⋯Wj\+1,k,Bj,k=Wj−1,k⋯W1,k,A\_\{j,k\}=W\_\{L,k\}\\cdots W\_\{j\+1,k\},\\qquad B\_\{j,k\}=W\_\{j\-1,k\}\\cdots W\_\{1,k\},and
Hj,k=Aj,k⊤GkBj,k⊤\.H\_\{j,k\}=A\_\{j,k\}^\{\\top\}G\_\{k\}B\_\{j,k\}^\{\\top\}\.ThusHj,kH\_\{j,k\}is the layer\-jjgradient estimate induced byGkG\_\{k\}\.
###### Proposition 7\(Deep finite\-step leakage is second order\)\.
Consider simultaneous Euclidean updates
Wj,k\+1=Wj,k−ηkHj,k,j=1,…,L\.W\_\{j,k\+1\}=W\_\{j,k\}\-\\eta\_\{k\}H\_\{j,k\},\\qquad j=1,\\ldots,L\.Then, forj=1,…,L−1j=1,\\ldots,L\-1,
Dj,k\+1−Dj,k=ηk2\(Hj,kHj,k⊤−Hj\+1,k⊤Hj\+1,k\)\.D\_\{j,k\+1\}\-D\_\{j,k\}=\\eta\_\{k\}^\{2\}\\left\(H\_\{j,k\}H\_\{j,k\}^\{\\top\}\-H\_\{j\+1,k\}^\{\\top\}H\_\{j\+1,k\}\\right\)\.In particular, all first\-order terms inηk\\eta\_\{k\}cancel exactly\.
###### Proof\.
Expanding the first term inDj,k\+1D\_\{j,k\+1\},
Wj,k\+1Wj,k\+1⊤\\displaystyle W\_\{j,k\+1\}W\_\{j,k\+1\}^\{\\top\}=Wj,kWj,k⊤−ηk\(Hj,kWj,k⊤\+Wj,kHj,k⊤\)\\displaystyle=W\_\{j,k\}W\_\{j,k\}^\{\\top\}\-\\eta\_\{k\}\\left\(H\_\{j,k\}W\_\{j,k\}^\{\\top\}\+W\_\{j,k\}H\_\{j,k\}^\{\\top\}\\right\)\+ηk2Hj,kHj,k⊤\.\\displaystyle\\qquad\+\\eta\_\{k\}^\{2\}H\_\{j,k\}H\_\{j,k\}^\{\\top\}\.Expanding the second term,
Wj\+1,k\+1⊤Wj\+1,k\+1\\displaystyle W\_\{j\+1,k\+1\}^\{\\top\}W\_\{j\+1,k\+1\}=Wj\+1,k⊤Wj\+1,k\\displaystyle=W\_\{j\+1,k\}^\{\\top\}W\_\{j\+1,k\}−ηk\(Hj\+1,k⊤Wj\+1,k\+Wj\+1,k⊤Hj\+1,k\)\\displaystyle\\quad\-\\eta\_\{k\}\\left\(H\_\{j\+1,k\}^\{\\top\}W\_\{j\+1,k\}\+W\_\{j\+1,k\}^\{\\top\}H\_\{j\+1,k\}\\right\)\+ηk2Hj\+1,k⊤Hj\+1,k\.\\displaystyle\\quad\+\\eta\_\{k\}^\{2\}H\_\{j\+1,k\}^\{\\top\}H\_\{j\+1,k\}\.It remains to verify that the first\-order terms are identical\. Since
Hj,k=Aj,k⊤GkBj,k⊤,H\_\{j,k\}=A\_\{j,k\}^\{\\top\}G\_\{k\}B\_\{j,k\}^\{\\top\},and
Aj,k=Aj\+1,kWj\+1,k,Bj\+1,k=Wj,kBj,k,A\_\{j,k\}=A\_\{j\+1,k\}W\_\{j\+1,k\},\\qquad B\_\{j\+1,k\}=W\_\{j,k\}B\_\{j,k\},we have
Hj,kWj,k⊤\\displaystyle H\_\{j,k\}W\_\{j,k\}^\{\\top\}=Aj,k⊤GkBj,k⊤Wj,k⊤\\displaystyle=A\_\{j,k\}^\{\\top\}G\_\{k\}B\_\{j,k\}^\{\\top\}W\_\{j,k\}^\{\\top\}=Wj\+1,k⊤Aj\+1,k⊤GkBj\+1,k⊤\\displaystyle=W\_\{j\+1,k\}^\{\\top\}A\_\{j\+1,k\}^\{\\top\}G\_\{k\}B\_\{j\+1,k\}^\{\\top\}=Wj\+1,k⊤Hj\+1,k,\\displaystyle=W\_\{j\+1,k\}^\{\\top\}H\_\{j\+1,k\},and
Wj,kHj,k⊤\\displaystyle W\_\{j,k\}H\_\{j,k\}^\{\\top\}=Wj,kBj,kGk⊤Aj,k\\displaystyle=W\_\{j,k\}B\_\{j,k\}G\_\{k\}^\{\\top\}A\_\{j,k\}=Bj\+1,kGk⊤Aj\+1,kWj\+1,k\\displaystyle=B\_\{j\+1,k\}G\_\{k\}^\{\\top\}A\_\{j\+1,k\}W\_\{j\+1,k\}=Hj\+1,k⊤Wj\+1,k\.\\displaystyle=H\_\{j\+1,k\}^\{\\top\}W\_\{j\+1,k\}\.Therefore the entire first\-order contribution in the expansion ofWj,k\+1Wj,k\+1⊤W\_\{j,k\+1\}W\_\{j,k\+1\}^\{\\top\}is identical to the first\-order contribution in the expansion ofWj\+1,k\+1⊤Wj\+1,k\+1W\_\{j\+1,k\+1\}^\{\\top\}W\_\{j\+1,k\+1\}\. Subtracting the two expansions cancels all first\-order terms and leaves exactly
Dj,k\+1−Dj,k=ηk2\(Hj,kHj,k⊤−Hj\+1,k⊤Hj\+1,k\)\.D\_\{j,k\+1\}\-D\_\{j,k\}=\\eta\_\{k\}^\{2\}\\left\(H\_\{j,k\}H\_\{j,k\}^\{\\top\}\-H\_\{j\+1,k\}^\{\\top\}H\_\{j\+1,k\}\\right\)\.∎
#### The minibatch\-dependent leakage clock\.
The previous proposition is deterministic: it holds for any product\-space gradient estimateGkG\_\{k\}\. To isolate the batch\-size\-dependent part, write
Gk=G¯k\+ξk,𝔼k\[ξk\]=0,𝔼k‖ξk‖F2=O\(1/b\)\.G\_\{k\}=\\bar\{G\}\_\{k\}\+\\xi\_\{k\},\\qquad\\mathbb\{E\}\_\{k\}\[\\xi\_\{k\}\]=0,\\qquad\\mathbb\{E\}\_\{k\}\\\|\\xi\_\{k\}\\\|\_\{F\}^\{2\}=O\(1/b\)\.Define
H¯j,k=Aj,k⊤G¯kBj,k⊤,Δj,k=Aj,k⊤ξkBj,k⊤\.\\bar\{H\}\_\{j,k\}=A\_\{j,k\}^\{\\top\}\\bar\{G\}\_\{k\}B\_\{j,k\}^\{\\top\},\\qquad\\Delta\_\{j,k\}=A\_\{j,k\}^\{\\top\}\\xi\_\{k\}B\_\{j,k\}^\{\\top\}\.Then
Hj,k=H¯j,k\+Δj,k,𝔼k\[Δj,k\]=0\.H\_\{j,k\}=\\bar\{H\}\_\{j,k\}\+\\Delta\_\{j,k\},\\qquad\\mathbb\{E\}\_\{k\}\[\\Delta\_\{j,k\}\]=0\.Taking conditional expectation in the finite\-step leakage identity gives
𝔼k\[Dj,k\+1−Dj,k\]\\displaystyle\\mathbb\{E\}\_\{k\}\[D\_\{j,k\+1\}\-D\_\{j,k\}\]=ηk2\(H¯j,kH¯j,k⊤−H¯j\+1,k⊤H¯j\+1,k\)\\displaystyle=\\eta\_\{k\}^\{2\}\\left\(\\bar\{H\}\_\{j,k\}\\bar\{H\}\_\{j,k\}^\{\\top\}\-\\bar\{H\}\_\{j\+1,k\}^\{\\top\}\\bar\{H\}\_\{j\+1,k\}\\right\)\+ηk2𝔼k\[Δj,kΔj,k⊤−Δj\+1,k⊤Δj\+1,k\]\.\\displaystyle\\quad\+\\eta\_\{k\}^\{2\}\\mathbb\{E\}\_\{k\}\\left\[\\Delta\_\{j,k\}\\Delta\_\{j,k\}^\{\\top\}\-\\Delta\_\{j\+1,k\}^\{\\top\}\\Delta\_\{j\+1,k\}\\right\]\.The first line is the deterministic full\-batch finite\-step leakage\. It is second order inηk\\eta\_\{k\}, but it is not batch\-size dependent\. The second line is the minibatch\-dependent stochastic correction\.
On any portion of the trajectory where the operator norms of the factors are bounded, the assumption𝔼k‖ξk‖F2=O\(1/b\)\\mathbb\{E\}\_\{k\}\\\|\\xi\_\{k\}\\\|\_\{F\}^\{2\}=O\(1/b\)implies
𝔼k‖Δj,k‖F2=O\(1/b\)\\mathbb\{E\}\_\{k\}\\\|\\Delta\_\{j,k\}\\\|\_\{F\}^\{2\}=O\(1/b\)for every fixed layerjj\. Therefore the batch\-dependent part of the conditional expected leakage isO\(ηk2/b\)O\(\\eta\_\{k\}^\{2\}/b\)per update\. Summing over updates gives the stochastic finite\-step clock
𝒯SGD=1b∑k<Kηk2=Kη2bfor constantη\.\\mathcal\{T\}\_\{\\mathrm\{SGD\}\}=\\frac\{1\}\{b\}\\sum\_\{k<K\}\\eta\_\{k\}^\{2\}=\\frac\{K\\eta^\{2\}\}\{b\}\\quad\\text\{for constant $\\eta$\}\.This clock measures the cumulative size of the minibatch\-dependent second\-order correction\. It is an order statement; it does not assert that the correction has a fixed sign\.
#### Generic preconditioning creates a first\-order imbalance term\.
Adaptive methods apply a preconditioned update rather than the Euclidean gradient update\. The following calculation shows exactly where the first\-order conservation cancellation can fail\.
Let
Qj,k=Pj,k\(Hj,k\),Q\_\{j,k\}=P\_\{j,k\}\(H\_\{j,k\}\),wherePj,kP\_\{j,k\}is a layerwise or coordinatewise preconditioning map, possibly depending on the current iterate and optimizer state\. Consider updates
Wj,k\+1=Wj,k−ηkQj,k\.W\_\{j,k\+1\}=W\_\{j,k\}\-\\eta\_\{k\}Q\_\{j,k\}\.
###### Proposition 8\(Preconditioning creates a first\-order imbalance term\)\.
For the preconditioned update above,
Dj,k\+1−Dj,k\\displaystyle D\_\{j,k\+1\}\-D\_\{j,k\}=−ηk\[Qj,kWj,k⊤\+Wj,kQj,k⊤−Qj\+1,k⊤Wj\+1,k−Wj\+1,k⊤Qj\+1,k\]\\displaystyle=\-\\eta\_\{k\}\\Big\[Q\_\{j,k\}W\_\{j,k\}^\{\\top\}\+W\_\{j,k\}Q\_\{j,k\}^\{\\top\}\-Q\_\{j\+1,k\}^\{\\top\}W\_\{j\+1,k\}\-W\_\{j\+1,k\}^\{\\top\}Q\_\{j\+1,k\}\\Big\]\+ηk2\[Qj,kQj,k⊤−Qj\+1,k⊤Qj\+1,k\]\.\\displaystyle\\quad\+\\eta\_\{k\}^\{2\}\\left\[Q\_\{j,k\}Q\_\{j,k\}^\{\\top\}\-Q\_\{j\+1,k\}^\{\\top\}Q\_\{j\+1,k\}\\right\]\.For Euclidean gradient descent,Qj,k=Hj,kQ\_\{j,k\}=H\_\{j,k\}, and the first\-order bracket vanishes by the chain\-rule identity proved above\. For a general adaptive preconditioner,Qj,kQ\_\{j,k\}need not satisfy that identity, so imbalance can change at orderηk\\eta\_\{k\}per update\.
###### Proof\.
Expanding the first adjacent Gram matrix,
Wj,k\+1Wj,k\+1⊤\\displaystyle W\_\{j,k\+1\}W\_\{j,k\+1\}^\{\\top\}=Wj,kWj,k⊤−ηk\(Qj,kWj,k⊤\+Wj,kQj,k⊤\)\\displaystyle=W\_\{j,k\}W\_\{j,k\}^\{\\top\}\-\\eta\_\{k\}\\left\(Q\_\{j,k\}W\_\{j,k\}^\{\\top\}\+W\_\{j,k\}Q\_\{j,k\}^\{\\top\}\\right\)\+ηk2Qj,kQj,k⊤\.\\displaystyle\\qquad\+\\eta\_\{k\}^\{2\}Q\_\{j,k\}Q\_\{j,k\}^\{\\top\}\.Expanding the second adjacent Gram matrix,
Wj\+1,k\+1⊤Wj\+1,k\+1\\displaystyle W\_\{j\+1,k\+1\}^\{\\top\}W\_\{j\+1,k\+1\}=Wj\+1,k⊤Wj\+1,k−ηk\(Qj\+1,k⊤Wj\+1,k\+Wj\+1,k⊤Qj\+1,k\)\\displaystyle=W\_\{j\+1,k\}^\{\\top\}W\_\{j\+1,k\}\-\\eta\_\{k\}\\left\(Q\_\{j\+1,k\}^\{\\top\}W\_\{j\+1,k\}\+W\_\{j\+1,k\}^\{\\top\}Q\_\{j\+1,k\}\\right\)\+ηk2Qj\+1,k⊤Qj\+1,k\.\\displaystyle\\qquad\+\\eta\_\{k\}^\{2\}Q\_\{j\+1,k\}^\{\\top\}Q\_\{j\+1,k\}\.Subtracting the second expansion from the first gives the stated identity\. IfQj,k=Hj,kQ\_\{j,k\}=H\_\{j,k\}, the first\-order bracket is zero by the identities
Hj,kWj,k⊤=Wj\+1,k⊤Hj\+1,k,Wj,kHj,k⊤=Hj\+1,k⊤Wj\+1,kH\_\{j,k\}W\_\{j,k\}^\{\\top\}=W\_\{j\+1,k\}^\{\\top\}H\_\{j\+1,k\},\\qquad W\_\{j,k\}H\_\{j,k\}^\{\\top\}=H\_\{j\+1,k\}^\{\\top\}W\_\{j\+1,k\}proved in the finite\-step Euclidean case\. ∎
This calculation proves that the order of imbalance movement can change from second order to first order under preconditioning\. It does not claim that every adaptive optimizer monotonically contracts imbalance; rather, it identifies the algebraic cancellation that Euclidean gradient flow and Euclidean finite\-step updates enjoy, and shows that generic preconditioning need not preserve it\.
Figure 9:Optimizer trajectories on the scalar loss\(ab−𝟏\)𝟐\(ab\-1\)^\{2\}, initialization\(a𝟎,b𝟎\)=\(𝟏,𝟔\)\(a\_\{0\},b\_\{0\}\)=\(1,6\)\.All four panels use the same two learning ratesη∈\{0\.01,0\.04\}\\eta\\in\\\{0\.01,0\.04\\\}\(red and orange, respectively\); only the optimizer changes\. The black curve is the manifold of global minimaab=1ab=1; magenta stars mark the minimum\-norm solutions\(±1,±1\)\(\\pm 1,\\pm 1\)\.\(a\)Gradient descent \(20 steps\): the small\-η\\etatrajectory stays near the initialization \(D≈−35D\\approx\-35, close to gradient\-flow conservation\), while the large\-η\\etatrajectory moves toward lower norm \(D≈−3D\\approx\-3\), illustrating theO\(η2K\)O\(\\eta^\{2\}K\)finite\-step leakage\.\(b\)SGD with cyclic minibatch targets\{0,1,2\}\\\{0,1,2\\\}\(up to50,00050\{,\}000steps\): both learning rates converge to the minimum\-norm solution \(D≈0D\\approx 0\), consistent with the𝒯SGD=Kη2/b\\mathcal\{T\}\_\{\\mathrm\{SGD\}\}=K\\eta^\{2\}/bclock\.\(c\)Full\-batch Adam \(5,0005\{,\}000steps\): the preconditioner movesDDfrom−35\-35toward≈−27\\approx\\\!\-27, confirming first\-orderO\(η\)O\(\\eta\)movement of the imbalance\. However, convergence stalls because the full\-batch gradientg=2\(ab−1\)→0g=2\(ab\-1\)\\to 0at the manifold\. This implies that the numerator goes to 0 but the denominator is lowerbounded byϵ\>0\\epsilon\>0\. Thus theO\(η\)O\(\\eta\)coefficient vanishes andDDfreezes at a nonzero value\.\(d\)Minibatch Adam \(up to200,000200\{,\}000steps\): adding stochasticity restores forgetting—both learning rates reachD≈0D\\approx 0\. Individual minibatch gradients remain nonzero atab=1ab=1, keeping the adaptive clock active\. In real BatchNorm networks, scale invariance prevents the vanishing\-gradient stalling seen in \(c\); see the experimental results in Section[4](https://arxiv.org/html/2605.29152#S4)\.
#### Scalar example of first\-order breaking\.
The first\-order effect is already visible in the scalar two\-factor model\. Letp=abp=abandg=ϕ′\(ab\)g=\\phi^\{\\prime\}\(ab\)\. The Euclidean gradients are
ha=gb,hb=ga\.h\_\{a\}=gb,\\qquad h\_\{b\}=ga\.Consider preconditioned updates
ak\+1=ak−ηαkgbk,bk\+1=bk−ηβkgak,a\_\{k\+1\}=a\_\{k\}\-\\eta\\alpha\_\{k\}gb\_\{k\},\\qquad b\_\{k\+1\}=b\_\{k\}\-\\eta\\beta\_\{k\}ga\_\{k\},whereαk,βk\>0\\alpha\_\{k\},\\beta\_\{k\}\>0\. Then
Dk\+1−Dk\\displaystyle D\_\{k\+1\}\-D\_\{k\}=ak\+12−bk\+12−\(ak2−bk2\)\\displaystyle=a\_\{k\+1\}^\{2\}\-b\_\{k\+1\}^\{2\}\-\(a\_\{k\}^\{2\}\-b\_\{k\}^\{2\}\)=−2ηgakbk\(αk−βk\)\+η2g2\(αk2bk2−βk2ak2\)\.\\displaystyle=\-2\\eta ga\_\{k\}b\_\{k\}\(\\alpha\_\{k\}\-\\beta\_\{k\}\)\+\\eta^\{2\}g^\{2\}\\left\(\\alpha\_\{k\}^\{2\}b\_\{k\}^\{2\}\-\\beta\_\{k\}^\{2\}a\_\{k\}^\{2\}\\right\)\.Thus, whenever
gakbk\(αk−βk\)≠0,ga\_\{k\}b\_\{k\}\(\\alpha\_\{k\}\-\\beta\_\{k\}\)\\neq 0,the imbalance changes at first order inη\\eta\. Euclidean gradient descent corresponds toαk=βk=1\\alpha\_\{k\}=\\beta\_\{k\}=1, in which case the first\-order term vanishes\.
This scalar example justifies the adaptive clock
𝒯adapt=∑k<Kηk=Kηfor constantη\\mathcal\{T\}\_\{\\mathrm\{adapt\}\}=\\sum\_\{k<K\}\\eta\_\{k\}=K\\eta\\quad\\text\{for constant $\\eta$\}as the natural scale on which preconditioned updates can move initialization\-dependent imbalance\.
#### Summary\.
In this minimal homogeneous model, Euclidean gradient flow exactly conserves initialization\-dependent imbalances\. In the scalar two\-factor case, the conserved imbalance directly fixes the final factor norm among solutions with the same product, giving a precise notion of radial initialization memory\. CoupledL2L\_\{2\}decay contracts the imbalance on the clock
𝒯L2=λ∑k<Kηk\.\\mathcal\{T\}\_\{L\_\{2\}\}=\\lambda\\sum\_\{k<K\}\\eta\_\{k\}\.Euclidean finite\-step updates break the conservation law only at second order; the minibatch\-dependent part of the conditional expected leakage accumulates on the clock
𝒯SGD=1b∑k<Kηk2\.\\mathcal\{T\}\_\{\\mathrm\{SGD\}\}=\\frac\{1\}\{b\}\\sum\_\{k<K\}\\eta\_\{k\}^\{2\}\.Generic adaptive preconditioning can break the conservation law at first order, giving the movement clock
𝒯adapt=∑k<Kηk\.\\mathcal\{T\}\_\{\\mathrm\{adapt\}\}=\\sum\_\{k<K\}\\eta\_\{k\}\.These calculations are not a theorem for nonlinear BatchNorm ResNets: they do not prove monotone forgetting, nor do they determine test accuracy\. Rather, they identify the conservation law and the optimizer\-dependent orders at which different training mechanisms can move or contract initialization\-dependent quantities\. This mechanism matches the empirical hierarchy observed in Section[4](https://arxiv.org/html/2605.29152#S4): gradient\-flow\-like low\-LR SGD remembers initialization, increasing the learning rate or adding explicitL2L\_\{2\}can reduce memory by enlarging the relevant clocks, and adaptive methods erase initialization\-scale dependence much faster in the diagnostic grid\.
## Appendix HFurther Related Work
#### Purpose of this appendix\.
The main text cites only the papers needed to motivate the central definitions and claims\. This appendix gives the broader historical map\. The question “why do neural networks not overfit?” has been approached through several partially overlapping mechanisms: classical capacity control, margins and norms, flatness, PAC\-Bayes and compression, interpolation and benign overfitting, function\-space priors, spectral and geometric simplicity, stable signal propagation at initialization, and optimizer\-dependent implicit bias\. Our paper sits at the intersection of two views\. On one hand, random architectures and initialization schemes define nontrivial priors over functions\. On the other hand, the training pipeline transforms, preserves, or erases those priors\. Initialization memory is our proposed diagnostic for measuring how much of the initial prior survives training\.
#### The common thread\.
The literature can be read as a progression from static explanations to dynamical explanations:
hypothesis class→trained predictor→training pipeline→memory of initialization\.\\text\{hypothesis class\}\\quad\\to\\quad\\text\{trained predictor\}\\quad\\to\\quad\\text\{training pipeline\}\\quad\\to\\quad\\text\{memory of initialization\}\.Classical theory controls the size of the class; post\-hoc complexity measures study the final predictor; implicit\-bias work studies the optimizer’s selection rule; simplicity\-bias work studies the prior at initialization\. Our contribution is to connect the last two: we ask whether the initialization\-induced prior remains visible after optimization\.
### H\.1From capacity to pipeline\-dependent generalization
#### Classical complexity control\.
Classical learning theory explains generalization by controlling effective capacity\. VC theory and statistical learning theory formalize uniform convergence in terms of hypothesis\-class complexity\[[43](https://arxiv.org/html/2605.29152#bib.bib43),[44](https://arxiv.org/html/2605.29152#bib.bib44)\]\. Later refinements replaced raw parameter counting by data\- and norm\-dependent quantities, including margins, Rademacher complexities, and weight norms\[[45](https://arxiv.org/html/2605.29152#bib.bib45),[46](https://arxiv.org/html/2605.29152#bib.bib46),[47](https://arxiv.org/html/2605.29152#bib.bib47),[48](https://arxiv.org/html/2605.29152#bib.bib48)\]\. This line is historically essential because it already separates the number of trainable parameters from the actual complexity of the learned predictor\. However, worst\-case capacity control alone does not explain why highly overparameterized networks can fit random labels yet generalize well on natural labels\.
#### The random\-label challenge\.
The modern crisis point was the observation that standard deep networks can interpolate both natural labels and random labels under essentially the same architecture and optimization machinery\[[49](https://arxiv.org/html/2605.29152#bib.bib49),[1](https://arxiv.org/html/2605.29152#bib.bib1)\]\. This shifted attention away from the hypothesis class alone and toward the complete pipeline that selects one interpolating solution among many\. Follow\-up work showed that deep networks tend to learn simple or structured patterns before memorizing idiosyncratic noise\[[50](https://arxiv.org/html/2605.29152#bib.bib50),[51](https://arxiv.org/html/2605.29152#bib.bib51)\]\. The relevant object is therefore not merely the set of functions representable by the architecture, but the algorithmic process by which training selects one of them\.
#### Stability and training time\.
Algorithmic stability gives one route from optimization to generalization\.\[[52](https://arxiv.org/html/2605.29152#bib.bib52)\]showed that stochastic gradient methods with controlled numbers of steps can be stable, giving a formal sense in which the trajectory itself participates in capacity control\. This perspective is especially relevant for our work because initialization memory is also trajectory\-dependent: a fixed architecture may either preserve or erase initialization scale depending on update count, batch size, learning rate, and regularization\.
### H\.2Post\-hoc explanations: norms, margins, flatness, PAC\-Bayes, and compression
#### Norm and margin explanations\.
A large body of work attempts to explain generalization by measuring properties of the trained predictor rather than the raw hypothesis class\. Norm\-based and margin\-based bounds for neural networks include path\-norm, spectral\-norm, and margin\-normalized quantities\[[47](https://arxiv.org/html/2605.29152#bib.bib47),[53](https://arxiv.org/html/2605.29152#bib.bib53),[48](https://arxiv.org/html/2605.29152#bib.bib48),[54](https://arxiv.org/html/2605.29152#bib.bib54),[55](https://arxiv.org/html/2605.29152#bib.bib55)\]\. Large\-scale empirical studies have compared many proposed complexity measures and found that some correlate with generalization better than others, while many fail outside narrow settings\[[56](https://arxiv.org/html/2605.29152#bib.bib56)\]\. These works are complementary to ours: they ask how to measure the complexity of the final predictor, whereas we ask how much the final predictor still remembers initialization\.
#### Flatness and sharpness\.
The flat\-minima view dates back to\[[57](https://arxiv.org/html/2605.29152#bib.bib57)\]and became central again in the large\-batch generalization debate\[[58](https://arxiv.org/html/2605.29152#bib.bib58)\]\. Because sharpness is not invariant to parameter rescaling, later work showed that naive sharpness measures can be misleading in deep networks\[[59](https://arxiv.org/html/2605.29152#bib.bib59)\]\. Sharpness\-aware minimization was subsequently proposed as an explicit algorithmic route to flatter solutions\[[60](https://arxiv.org/html/2605.29152#bib.bib60)\]\. In our setting, BatchNorm scale invariance makes this issue particularly relevant: rescaling weights can leave the represented function nearly unchanged while substantially changing the optimization geometry\.
#### PAC\-Bayes and compression\.
PAC\-Bayes gives another way to turn algorithmic or posterior concentration into generalization statements\[[61](https://arxiv.org/html/2605.29152#bib.bib61),[62](https://arxiv.org/html/2605.29152#bib.bib62),[54](https://arxiv.org/html/2605.29152#bib.bib54),[63](https://arxiv.org/html/2605.29152#bib.bib63)\]\. Compression\-based analyses similarly argue that trained networks generalize when they can be compressed without a large loss of performance\[[64](https://arxiv.org/html/2605.29152#bib.bib64)\]\. These approaches support a broad message: overparameterization is not itself fatal if the training procedure selects a small or compressible subset of effective functions\. The initialization\-memory viewpoint adds a dynamical question: whether the selected subset is still determined by the initial function prior, or whether training has overwritten it\.
### H\.3Interpolation, double descent, and benign overfitting
#### Interpolation is not necessarily overfitting\.
The double\-descent literature recast interpolation as a regime to be understood rather than automatically avoided\.\[[65](https://arxiv.org/html/2605.29152#bib.bib65)\]proposed that the classical bias–variance curve extends into a second descent beyond the interpolation threshold, and\[[66](https://arxiv.org/html/2605.29152#bib.bib66)\]demonstrated double descent as a function of model size, data size, and training time in modern deep learning systems\. Benign\-overfitting theory shows that, even in simple linear models, interpolating predictors can generalize under appropriate spectral conditions on the data distribution\[[67](https://arxiv.org/html/2605.29152#bib.bib67)\]\. Our experiments refine this picture for deep networks: interpolation of the training set is not the same as forgetting initialization\. Low\-learning\-rate SGD can interpolate while retaining large initialization memory\.
### H\.4Function\-space priors and simplicity bias at initialization
#### Bayesian and infinite\-width priors\.
The idea that random neural networks define a nontrivial function predates the current simplicity\-bias literature\.\[[68](https://arxiv.org/html/2605.29152#bib.bib68)\]studied Bayesian neural networks and the connection between random networks and Gaussian processes;\[[69](https://arxiv.org/html/2605.29152#bib.bib69)\]analyzed computation with infinite neural networks; later work made Gaussian\-process and neural\-tangent\-kernel limits central tools for analyzing wide networks\[[70](https://arxiv.org/html/2605.29152#bib.bib70),[71](https://arxiv.org/html/2605.29152#bib.bib71)\]\. These function\-space limits show that architecture and initialization define a prior over functions before training\.
#### Algorithmic simplicity bias of parameter–function maps\.
A more recent line argues that many parameter–function maps are intrinsically biased toward simple outputs\.\[[72](https://arxiv.org/html/2605.29152#bib.bib72)\]showed that broad classes of input–output maps can be strongly biased toward low\-complexity outputs\.\[[6](https://arxiv.org/html/2605.29152#bib.bib6)\]applied this idea to neural networks, arguing that the parameter–function map of deep networks is biased toward simple functions and that this can yield PAC\-Bayesian explanations of generalization\.\[[73](https://arxiv.org/html/2605.29152#bib.bib73)\]proved simplicity\-bias results for random wide ReLU networks on Boolean inputs\. Related work studies a priori biases toward low\-entropy Boolean functions\[[42](https://arxiv.org/html/2605.29152#bib.bib42)\], and\[[7](https://arxiv.org/html/2605.29152#bib.bib7)\]sharpened the claim that deep networks have an inbuilt Occam\-like bias\. These works establish initialization and architecture as genuine sources of inductive bias\. They do not, however, determine how much of that bias survives optimization\. That survival question is the focus of our paper\.
#### Algorithmic complexity and Lempel–Ziv measures\.
Several simplicity\-bias papers operationalize simplicity through computable proxies for Kolmogorov complexity\. The relevant background includes Lempel–Ziv complexity and universal compression\[[74](https://arxiv.org/html/2605.29152#bib.bib74),[75](https://arxiv.org/html/2605.29152#bib.bib75)\], as well as the broader theory of Kolmogorov complexity\[[76](https://arxiv.org/html/2605.29152#bib.bib76)\]\. In the neural\-network setting, these measures are often used not because they are the only possible definition of simplicity, but because they provide a concrete way to compare the output complexity of functions induced by random parameter draws\. Our work is agnostic about which simplicity metric is definitive: the question we isolate is whether any initialization\-induced simplicity bias remains visible after training\.
#### Transformer simplicity bias\.
Simplicity bias is not restricted to convolutional or fully connected networks\. Work on Transformers has studied biases toward sparse or low\-sensitivity Boolean functions\[[77](https://arxiv.org/html/2605.29152#bib.bib77),[78](https://arxiv.org/html/2605.29152#bib.bib78),[79](https://arxiv.org/html/2605.29152#bib.bib79)\]\. Recent work further argues that Transformers already contain structural inductive biases at random initialization\[[28](https://arxiv.org/html/2605.29152#bib.bib28)\]\. These results are relevant to the broader question of whether large modern architectures prefer simple rules, but they do not by themselves determine whether training preserves the initialization\-induced prior\. Initialization memory is intended to measure this survival question directly\.
### H\.5Technical notions of simplicity and their training dynamics
#### Frequency, sensitivity, regions, and geometry\.
The literature uses several inequivalent notions of simplicity\. Spectral\-bias or frequency\-principle results show that neural networks often learn low\-frequency components before high\-frequency components\[[80](https://arxiv.org/html/2605.29152#bib.bib80)\]\. Sensitivity\-based work measures input\-output Jacobians and local robustness around the data manifold\[[81](https://arxiv.org/html/2605.29152#bib.bib81)\]\. Region\-counting work studies the number of linear regions or activation patterns in ReLU networks, showing that typical trained or initialized networks use far fewer regions than worst\-case expressivity bounds allow\[[82](https://arxiv.org/html/2605.29152#bib.bib82),[83](https://arxiv.org/html/2605.29152#bib.bib83)\]\. Geometric\-complexity work measures variation of the learned function through a Dirichlet\-energy\-like quantity and shows that many regularizers control this geometry\[[84](https://arxiv.org/html/2605.29152#bib.bib84)\]\. These notions are not identical, but they share a common theme: practical deep networks tend to realize functions much simpler than the worst\-case architecture permits\.
#### Simplicity along training\.
Recent work has begun to study not only the simplicity of random networks or final predictors, but also the order in which different complexities are learned\. Neural networks trained with SGD can learn distributions of increasing complexity over time\[[85](https://arxiv.org/html/2605.29152#bib.bib85)\]\. Two\-layer ReLU models exhibit simplicity bias and optimization thresholds that can be analyzed in controlled settings\[[86](https://arxiv.org/html/2605.29152#bib.bib86)\]\. Saddle\-to\-saddle dynamics have also been proposed as a mechanism for simplicity bias across architectures\[[87](https://arxiv.org/html/2605.29152#bib.bib87)\]\. These works are close in spirit to our paper because they treat simplicity as a dynamical phenomenon rather than a static property of the architecture alone\.
#### Support learning as implicit regularization\.
A particularly relevant recent result is\[[35](https://arxiv.org/html/2605.29152#bib.bib35)\], who study how neural networks learn the support of the target function\. They show that mini\-batch SGD can shrink irrelevant input weights in the first layer, while full\-batch gradient descent requires explicit regularization to obtain the same effect\. Their mechanism is a second\-order implicit regularization effect of SGD that depends on step size and batch size\. This is directly aligned with our perspective: stochastic finite\-step effects can erase or restructure parts of the initialization\-dependent geometry, whereas gradient\-flow\-like dynamics may preserve them\.
### H\.6Initialization as a dynamical boundary condition
#### Basins and classical dynamical systems\.
The informal intuition that initialization chooses a basin of attraction comes from classical dynamical systems: the long\-time behavior of an autonomous flow can depend strongly on its initial condition and on the basins of attraction of stable invariant sets\[[2](https://arxiv.org/html/2605.29152#bib.bib2),[3](https://arxiv.org/html/2605.29152#bib.bib3)\]\. In neural\-network training, this language is only an analogy: high\-dimensional nonconvex losses, stochastic updates, normalization layers, and time\-varying learning rates make the dynamics much richer than a simple gradient flow\. Nevertheless, the basin viewpoint motivates the empirical question we study: whether the final trained predictor remains measurably dependent on its initial condition\.
#### Initialization in implicit\-bias theory\.
Several theoretical works show that initialization can affect convergence and implicit bias, even in simplified linear or homogeneous models\.\[[4](https://arxiv.org/html/2605.29152#bib.bib4)\]analyzes the explicit role of initialization in overparameterized linear networks\.\[[5](https://arxiv.org/html/2605.29152#bib.bib5)\]studies how initialization affects the implicit bias of deep linear networks\. These works complement our minimal linear\-network calculation: they show that initialization is not merely a nuisance parameter, but can control the geometry of the solution selected by gradient\-based training\.
### H\.7Initialization for trainability and signal propagation
#### Variance propagation and dynamical isometry\.
A historically distinct literature treats initialization primarily as a trainability device\. Xavier and Kaiming initializations were designed to stabilize forward and backward signal propagation through deep networks\[[9](https://arxiv.org/html/2605.29152#bib.bib9),[10](https://arxiv.org/html/2605.29152#bib.bib10)\]\. Deep\-linear\-network analyses showed how initialization affects learning dynamics and training plateaus\[[88](https://arxiv.org/html/2605.29152#bib.bib88)\]\. Mean\-field and random\-matrix analyses of signal propagation identified ordered and chaotic regimes in random deep networks\[[11](https://arxiv.org/html/2605.29152#bib.bib11),[12](https://arxiv.org/html/2605.29152#bib.bib12)\], while dynamical\-isometry results emphasized controlling the singular\-value distribution of the input\-output Jacobian at initialization\[[13](https://arxiv.org/html/2605.29152#bib.bib13),[17](https://arxiv.org/html/2605.29152#bib.bib17),[16](https://arxiv.org/html/2605.29152#bib.bib16)\]\. Hanin and collaborators studied how architecture and initialization determine trainability, including the onset of exploding and vanishing gradients\[[14](https://arxiv.org/html/2605.29152#bib.bib14),[15](https://arxiv.org/html/2605.29152#bib.bib15)\]\. This line explains why initialization can enable training without necessarily claiming that the final predictor should remain close to the initial prior\.
#### Beyond signal propagation\.
Signal propagation is not the only relevant initialization criterion\.\[[89](https://arxiv.org/html/2605.29152#bib.bib89)\]argue that feature diversity at initialization can matter beyond the stability of forward and backward signals\. This is conceptually close to our motivation: initialization can affect not only whether training is numerically stable, but also what information and geometry are available to the optimizer early in training\.
#### Dynamical mean\-field and kernel\-evolution perspectives\.
Mean\-field and dynamical field theories provide another route to analyzing training from random initialization\. Work on kernel evolution in wide neural network models, how predictors move away from their initial random\-kernel description during feature learning\[[20](https://arxiv.org/html/2605.29152#bib.bib20),[21](https://arxiv.org/html/2605.29152#bib.bib21)\]\. Recent analyses of deep linear networks from random initialization connect data, width, depth, and hyperparameter transfer\[[22](https://arxiv.org/html/2605.29152#bib.bib22)\], while feature\-learning infinite limits yield adaptive kernel predictors\[[23](https://arxiv.org/html/2605.29152#bib.bib23)\]\. These works are part of the same broader shift from static initialization priors to the dynamics by which training modifies those priors\.
#### Parameterization and scaling\.
Modern scaling theory further emphasizes parameterization as part of the training pipeline\. Tensor Programs and maximal\-update parameterization show that stable feature learning and hyperparameter transfer require carefully chosen width scalings\[[90](https://arxiv.org/html/2605.29152#bib.bib90),[18](https://arxiv.org/html/2605.29152#bib.bib18),[19](https://arxiv.org/html/2605.29152#bib.bib19)\]\. In this view, initialization is tightly coupled to learning\-rate scales and update magnitudes\. Our experiments echo this philosophy in a finite\-network setting: initialization scale, learning rate, batch size, and regularization jointly determine whether the initial condition is remembered\.
### H\.8Normalization, scale invariance, and effective learning rates
#### BatchNorm and scale\-invariant optimization\.
Batch normalization was introduced as a way to accelerate training and reduce sensitivity to initialization\[[91](https://arxiv.org/html/2605.29152#bib.bib91)\]\. Subsequent theory emphasized that normalized networks contain scale\-invariant parameter blocks, for which rescaling weights can leave the represented function nearly unchanged while altering the effective optimization dynamics\[[92](https://arxiv.org/html/2605.29152#bib.bib92),[93](https://arxiv.org/html/2605.29152#bib.bib93)\]\. The intrinsic\-learning\-rate view is particularly close to our mechanism: for scale\-invariant parameters, the effective angular step size scales inversely with the squared norm\. This makes initialization scale a useful tracer in BatchNorm networks, because it is partly hidden from the forward map while remaining visible to the optimizer\.
### H\.9Backward error analysis and modified equations
#### Historical context\.
Backward error analysis is a classical tool in numerical analysis for understanding what a numerical algorithm actually solves\. Its linear\-algebra form is closely associated with Wilkinson’s work on roundoff error and eigenvalue computations: rather than measuring only the forward error with respect to the original problem, one asks whether the computed output is the exact solution of a nearby perturbed problem\[[94](https://arxiv.org/html/2605.29152#bib.bib94),[95](https://arxiv.org/html/2605.29152#bib.bib95),[96](https://arxiv.org/html/2605.29152#bib.bib96)\]\. For time\-stepping methods, the analogous question is whether a discrete integrator is the exact time\-hhmap of a nearby differential equation\. This is the method of modified equations\. Foundational analyses by\[[36](https://arxiv.org/html/2605.29152#bib.bib36)\]and subsequent developments in geometric numerical integration made this viewpoint central for understanding the qualitative behavior of ODE solvers\[[97](https://arxiv.org/html/2605.29152#bib.bib97),[37](https://arxiv.org/html/2605.29152#bib.bib37)\]\. In Hamiltonian and structure\-preserving integration, backward error analysis explains why finite\-step methods may preserve modified invariants or modified Hamiltonians for long times, even when they do not exactly solve the original continuous system\.
#### Stochastic modified equations\.
A stochastic analogue was developed for numerical SDEs and weak approximation\. Modified\-equation methods for SDEs show that a discrete stochastic scheme can be interpreted, in a weak sense, as solving a nearby stochastic dynamics whose drift, diffusion, or invariant measure has been perturbed by the step size\[[98](https://arxiv.org/html/2605.29152#bib.bib98),[99](https://arxiv.org/html/2605.29152#bib.bib99),[100](https://arxiv.org/html/2605.29152#bib.bib100)\]\. This line is important for modern optimization because constant\-step stochastic algorithms are not merely noisy versions of gradient flow; their invariant behavior and local stability can depend on finite\-step corrections\. Uniform\-in\-time weak\-error analyses for SGD make this connection explicit by using tools motivated by backward error analysis for stochastic differential equations\[[101](https://arxiv.org/html/2605.29152#bib.bib101)\]\.
#### BEA in optimization and machine learning\.
The ML use of modified\-equation ideas began by treating stochastic gradient algorithms as discrete dynamical systems whose continuous\-time approximations should include step\-size\-dependent correction terms\.\[[102](https://arxiv.org/html/2605.29152#bib.bib102)\]introduced stochastic modified equations for adaptive stochastic gradient algorithms, and\[[103](https://arxiv.org/html/2605.29152#bib.bib103)\]developed the mathematical foundations for SGD, momentum SGD, and stochastic Nesterov methods\. In deep\-learning generalization,\[[32](https://arxiv.org/html/2605.29152#bib.bib32)\]used backward error analysis to show that finite learning rates in gradient descent induce an implicit gradient regularization term, while\[[33](https://arxiv.org/html/2605.29152#bib.bib33),[34](https://arxiv.org/html/2605.29152#bib.bib34)\]extended this perspective to SGD and derived minibatch\-dependent finite\-step regularization\. Subsequent work used related discretization\-error or modified\-dynamics viewpoints to study deep\-network gradient descent\[[104](https://arxiv.org/html/2605.29152#bib.bib104)\], momentum\[[105](https://arxiv.org/html/2605.29152#bib.bib105),[38](https://arxiv.org/html/2605.29152#bib.bib38),[106](https://arxiv.org/html/2605.29152#bib.bib106)\], stochastic coordinate descent\[[107](https://arxiv.org/html/2605.29152#bib.bib107)\], and the implicit bias of Adam and RMSProp\[[39](https://arxiv.org/html/2605.29152#bib.bib39),[108](https://arxiv.org/html/2605.29152#bib.bib108),[109](https://arxiv.org/html/2605.29152#bib.bib109)\]\. These works support the conceptual move made in our paper: raw epoch count is not the natural dynamical unit\. The effects that erase initialization memory are governed by accumulated step\-size, stochasticity, preconditioning, and regularization timescales, such as
1b∑kηk2,λ∑kηk,∑kηk\.\\frac\{1\}\{b\}\\sum\_\{k\}\\eta\_\{k\}^\{2\},\\qquad\\lambda\\sum\_\{k\}\\eta\_\{k\},\\qquad\\sum\_\{k\}\\eta\_\{k\}\.In this sense, backward error analysis provides the mathematical language for our “forgetting\-time” interpretation: finite\-step optimization follows a nearby dynamics, and the perturbation terms of that nearby dynamics can act as implicit regularizers that overwrite parts of the initialization\-induced geometry\.
### H\.10Optimization as solution selection and forgetting
#### Implicit bias of gradient descent\.
A large body of literature shows that optimization selects among interpolating solutions\. In separable linear classification, gradient descent on exponential\-type losses converges in direction to the max\-margin classifier\[[110](https://arxiv.org/html/2605.29152#bib.bib110)\]\. For linear convolutional networks, the implicit bias depends on architecture and depth\[[111](https://arxiv.org/html/2605.29152#bib.bib111)\]\. Deep matrix factorization exhibits an implicit tendency toward low\-rank solutions\[[40](https://arxiv.org/html/2605.29152#bib.bib40)\]\. For homogeneous neural networks, gradient descent can be related to margin maximization\[[112](https://arxiv.org/html/2605.29152#bib.bib112)\]\. These works support the view that optimization does not merely minimize loss; it chooses a particular geometry among many interpolating predictors\.
#### SGD noise, batch size, and finite\-step effects\.
The stochasticity and discretization of SGD also affect the selected solution\. Large\-batch training was linked to sharp minima and generalization gaps\[[58](https://arxiv.org/html/2605.29152#bib.bib58)\], while other work argued that the number of updates and high\-learning\-rate phase can be central to closing this gap\[[113](https://arxiv.org/html/2605.29152#bib.bib113)\]\. Constant\-step SGD can be viewed as a stochastic process with an approximate stationary distribution\[[114](https://arxiv.org/html/2605.29152#bib.bib114)\]\. As reviewed in Appendix[H\.9](https://arxiv.org/html/2605.29152#A8.SS9), backward error analysis gives a complementary view: finite\-step gradient methods approximately follow the gradient flow of a modified objective\[[32](https://arxiv.org/html/2605.29152#bib.bib32)\], and the SGD correction can be minibatch\-dependent\[[33](https://arxiv.org/html/2605.29152#bib.bib33)\]\. These works motivate our timescale language: raw epoch count is not the natural unit of forgetting; update count, learning rate, batch size, and explicit decay determine how quickly training leaves the initialization\-controlled regime\.
#### Adaptive methods\.
Adaptive optimizers use a geometry different from Euclidean gradient descent\.\[[115](https://arxiv.org/html/2605.29152#bib.bib115)\]showed that adaptive methods can select different solutions from SGD and can generalize differently even when they optimize training loss well\. Our results are consistent with this broader message: Adam\-family methods do not merely train faster in our grid; they erase initialization\-scale dependence more readily\. We interpret this as a forgetting property of the optimizer geometry, not as a universal claim that Adam is always preferable to SGD\.
### H\.11Random seeds and initialization effects in language models
#### Fine\-tuning and pretraining variability\.
Recent NLP work makes seed dependence concrete in large pretrained models\.\[[24](https://arxiv.org/html/2605.29152#bib.bib24)\]showed that fine\-tuning BERT can vary substantially across random seeds, with both weight initialization and data order contributing to downstream variance\. For decoder\-only pretraining,\[[25](https://arxiv.org/html/2605.29152#bib.bib25)\]introduced multiple Pythia\-style pretraining runs across seeds and model sizes, finding broadly stable dynamics but also identifiable outlier runs\.\[[26](https://arxiv.org/html/2605.29152#bib.bib26)\]studied convergence and divergence of language models across random seeds through token\-level distributional comparisons\.\[[27](https://arxiv.org/html/2605.29152#bib.bib27)\]go further, showing that models can retain fingerprints of their training seed\. These papers motivate the same broad question as ours in a different setting: which parts of final performance are due to the initial condition, and which are erased by training?
#### Controlled language\-model training and architecture design\.
The Physics of Language Models series studies controlled mechanisms in language\-model training, including knowledge storage and extraction\[[29](https://arxiv.org/html/2605.29152#bib.bib29)\]and architecture design choices in transformer training\[[30](https://arxiv.org/html/2605.29152#bib.bib30)\]\. These works are not direct substitutes for an initialization\-scale sweep, but they reinforce the broader point that large\-scale model behavior is shaped by a coupled training pipeline rather than by architecture alone\.
#### Why our setting is deliberately smaller\.
Our paper studies CIFAR\-10 BatchNorm ResNets rather than large language models because the goal is to sweep initialization scale, optimizer, batch size, learning rate, regularization, depth, and seeds extensively\. The resulting controlled study is not meant to replace LLM\-scale evidence\. Instead, it isolates the mechanism: initialization can be a function prior, a trainability device, and a boundary condition for optimization; the observed performance effect depends on whether the training pipeline remembers that boundary condition\.
### H\.12Position of this work
#### The missing dynamical link\.
Prior work has studied architecture\-induced priors, simplicity bias, trainability at initialization, optimizer implicit bias, scale\-invariant optimization, and large\-batch effects\. We connect these themes through one measurable quantity,
𝖬𝖾𝗆acc\(ℛ,K\),\\mathsf\{Mem\}\_\{\\mathrm\{acc\}\}\(\\mathcal\{R\},K\),The dependence of the returned predictor on the initialization scale under a fixed training procedureℛ\\mathcal\{R\}\. The resulting statement is neither that initialization always matters nor that it disappears universally\. Initialization matters when the training dynamics remember it\.
#### Summary\.
The literature above identifies many reasons why neural networks may avoid overfitting: classical capacity control, margins and norms, flat minima, compression, Bayesian priors, benign overfitting, function\-space simplicity, spectral/geometric bias, stable signal propagation, and optimizer\-induced implicit bias\. Our contribution is not to replace these explanations\. It is to add a dynamic measurement that connects them:
initialization prior→training pipelinetrained predictor\.\\text\{initialization prior\}\\quad\\xrightarrow\{\\text\{training pipeline\}\}\\quad\\text\{trained predictor\}\.Initialization memory measures how much of the left\-hand side survives the arrow\. In this sense, initialization is neither universally decisive nor universally irrelevant\. It matters exactly when the training dynamics remember it\.Similar Articles
Forgetting is Not Erasure: Recovering Latent Knowledge via Transport Keys
This paper argues that catastrophic forgetting in neural networks is not erasure but an interface alignment problem. It introduces 'transport keys' to recover latent task-specific features from sequentially trained models, demonstrating significant performance recovery on split CIFAR-100.
Balancing Stability and Plasticity in Sequentially Trained Early-Exiting Neural Networks
The paper addresses catastrophic forgetting in sequentially trained early-exiting neural networks and proposes two methods based on Elastic Weight Consolidation and Learning without Forgetting to preserve earlier exit performance while adding new ones.
Lost or Hidden? A Concept-Level Forgetting in Supervised Continual Learning
This paper introduces a diagnostic framework using Sparse Autoencoders to analyze concept-level forgetting in continual learning, finding that much forgetting is due to representational inaccessibility rather than erasure.
Mechanistic origins of catastrophic forgetting: why RL preserves circuits better than SFT?
This paper investigates the mechanistic origins of catastrophic forgetting in LLMs, finding that reinforcement learning preserves internal computational circuits better than supervised fine-tuning, resulting in less forgetting of prior capabilities.
Forgetting That Sticks: Quantization-Permanent Unlearning via Circuit Attribution
This paper identifies a fundamental sparsity-permanence tradeoff where quantization reverses machine unlearning, and proposes MANSU, a method combining causal circuit attribution and null-space projection to achieve quantization-permanent forgetting.