Edge of Stability Selectively Shapes Learning Across the Data Distribution

arXiv cs.LG Papers

Summary

MIT researchers show that the edge of stability (EoS) in neural network training is not merely a global optimization phenomenon but selectively redistributes learning across subsets of the training distribution, amplifying progress on some data groups while suppressing others. They identify two key conditions governing this allocation: gradient alignment with the top Hessian eigenvector and sustained non-vanishing gradient magnitude.

arXiv:2606.04212v1 Announce Type: new Abstract: Existing analyses of the edge of stability (EoS) treat it as a global property of optimization. We show that it is also selective: the stability constraint redistributes learning across subsets of the training distribution, amplifying progress on some groups while suppressing progress on others. Using a branching intervention that enters or exits the EoS regime from the same training state, we causally demonstrate this trade-off and identify two necessary conditions for a group to benefit. First, its aggregate gradient must align with the top Hessian eigenvector. We isolate this mechanism with a controlled perturbation that preserves distance but randomizes direction, destroying alignment and eliminating the advantage. Second, the group must sustain non-vanishing gradient magnitude over time. Under cross-entropy loss, gradient saturation decouples confidently classified groups, shifting the advantage to output-outliers, whose gradients persist. Together, these results show that EoS functions not only as a stability boundary, but as a mechanism governing the allocation of learning across the data distribution.
Original Article
View Cached Full Text

Cached at: 06/05/26, 02:23 AM

# Edge of Stability Selectively Shapes Learning Across the Data Distribution
Source: [https://arxiv.org/html/2606.04212](https://arxiv.org/html/2606.04212)
Shauna Kwag\* MIT kwags@mit\.edu Anakha Ganesh\* MIT anakhag@mit\.edu Tomaso Poggio MIT tp@ai\.mit\.edu Pierfrancesco Beneventano MIT pierb@mit\.edu

###### Abstract

Existing analyses of the edge of stability \(EoS\) treat it as a global property of optimization\. We show that it is also selective: the stability constraint redistributes learning across subsets of the training distribution, amplifying progress on some groups while suppressing progress on others\. Using a branching intervention that enters or exits the EoS regime from the same training state, we causally demonstrate this trade\-off and identify two necessary conditions for a group to benefit\. First, its aggregate gradient must align with the top Hessian eigenvector\. We isolate this mechanism with a controlled perturbation that preserves distance but randomizes direction, destroying alignment and eliminating the advantage\. Second, the group must sustain non\-vanishing gradient magnitude over time\. Under cross\-entropy loss, gradient saturation decouples confidently classified groups, shifting the advantage to output\-outliers, whose gradients persist\. Together, these results show that EoS functions not only as a stability boundary, but as a mechanism governing the allocation of learning across the data distribution\.

††footnotetext:\* Equal contribution\.## 1Introduction

Deep neural networks exhibit strong sensitivity to optimizer and hyperparameters\. Training choices such as learning rate, batch size, and optimizer affect which solution is found\[[41](https://arxiv.org/html/2606.04212#bib.bib10),[23](https://arxiv.org/html/2606.04212#bib.bib20),[18](https://arxiv.org/html/2606.04212#bib.bib8)\], unlike in the classical convex setting where these choices do not affect which minimum is reached\. Understanding the mechanisms underlying this implicit bias is a key objective in the theory of deep learning\.

One structural explanation comes from theedge of stability\(EoS\) literature: under full\-batch and large\-batch gradient descent, the top Hessian eigenvalue self\-stabilizes near the stability threshold that depends on the optimizer and hyperparameters\[[40](https://arxiv.org/html/2606.04212#bib.bib36),[19](https://arxiv.org/html/2606.04212#bib.bib37),[20](https://arxiv.org/html/2606.04212#bib.bib38),[9](https://arxiv.org/html/2606.04212#bib.bib12),[8](https://arxiv.org/html/2606.04212#bib.bib39),[4](https://arxiv.org/html/2606.04212#bib.bib29),[3](https://arxiv.org/html/2606.04212#bib.bib32),[16](https://arxiv.org/html/2606.04212#bib.bib40),[22](https://arxiv.org/html/2606.04212#bib.bib1),[35](https://arxiv.org/html/2606.04212#bib.bib2)\]\. At this threshold, the optimizer operates at the boundary of discrete\-time stability, constraining which regions of the loss landscape remain accessible during training\.

While the EoS phenomenon is well established, far less is understood about its consequences\. In particular, it remains unclear whether operating near the stability threshold provides any functional benefit, or how these stability constraints shape optimization across the data distribution\. Prior work has primarily characterized EoS through curvature and optimization trajectories in parameter space, leaving open how these dynamics influence which examples are learned during training\. We ask:

Are there even practical consequences to the fact that EoS happens? In case, which subsets of the training distribution benefit from EoS? Which do not? What governs this allocation?

We give a positive answer to the first question\. Then, to study this allocation, we define four prototype groups from the geometry ofPX,YP\_\{X,Y\}that vary in input typicality, label consistency, and boundary proximity, independent of the model or loss\.

Message:We find that EoS induces aselectivelearning regime: the stability constraint distributes optimization effort unevenly, amplifying subsets whose gradients persistently align with the top Hessian eigendirection while suppressing others\.

This counters classical intuition: although the Descent Lemma treats the sharpest direction as a stability boundary that constrains learning, alignment with it is precisely what determines how learning is allocated across the distribution\.

Our findings connect to a longstanding debate on whether low curvature improves generalization\[[15](https://arxiv.org/html/2606.04212#bib.bib6),[23](https://arxiv.org/html/2606.04212#bib.bib20),[14](https://arxiv.org/html/2606.04212#bib.bib22),[11](https://arxiv.org/html/2606.04212#bib.bib26)\]\. EoS is the regime in which sharpness is actively constrained near the stability threshold, making it a natural setting to test what low curvature actually confers\. Our results suggest the answer is not global: the functional benefit depends on which subset dominates the top Hessian eigendirection and shifts with the geometric composition of the training distribution\. Flatness, in this view, is a directional property determined by data geometry rather than a scalar property of the solution\. More broadly, our results take a step toward connecting two largely separate lines of work:implicit regularizationin parameter space andinductive biasover the data distribution\.

Our paper makes the following contributions:

- •EoS is selective, not global\(§[3](https://arxiv.org/html/2606.04212#S3)\)\. Using a branching intervention that enters or exits EoS from a shared training trajectory, we find that the stability constraint is not a uniform bottleneck: it selectively benefits some subsets of the training distribution while suppressing others\. The trade\-off qualitatively replicates across different architectures and optimizers \(Appendix[C](https://arxiv.org/html/2606.04212#A3)\)\.
- •Selectivity is governed by alignment×\\timespersistence\(§[4](https://arxiv.org/html/2606.04212#S4)\)\. The advantage at EoS is captured by subsets whose aggregate gradient both aligns with the top Hessian eigenvector and remains non\-vanishing throughout training\. Two controlled counterfactuals isolate each factor: random\-direction displacement removes alignment, and cross\-entropy saturation removes persistence\. In both cases the EoS advantage disappears or is transferred to the subsets that retain the missing factor\. Both factors fall out of extending the self\-stabilization framework\[[10](https://arxiv.org/html/2606.04212#bib.bib13)\]from the global loss to per\-subset losses \(Section[2\.2](https://arxiv.org/html/2606.04212#S2.SS2); full derivation in Appendix[F](https://arxiv.org/html/2606.04212#A6)\)\.
- •Geometry shifts the beneficiary\(§[5](https://arxiv.org/html/2606.04212#S5)\)\. In our MLP experiments on CIFAR\-10 \(the standard setting in prior EoS work\[[9](https://arxiv.org/html/2606.04212#bib.bib12)\]\), the subset that benefits from EoS is the one furthest from the class centroids in the input space\. Varying the geometric composition of the training distribution continuously shifts which subset this is\. Preliminary evidence suggests the resulting generalization behavior shifts accordingly, with improved adversarial robustness when geometrically distant examples lie near the decision boundary, and improved out\-of\-distribution generalization when they lie far from the training distribution\.

![Refer to caption](https://arxiv.org/html/2606.04212v1/x1.png)Figure 1:Conceptual taxonomy of prototypes\.Data samples are categorized based on geometric proximity in input space relative to class\-specific cluster centroids \(μ0\\mu\_\{0\},μ1\\mu\_\{1\}\)\.
![Refer to caption](https://arxiv.org/html/2606.04212v1/x2.png)Figure 2:Input\-space visualization of prototype groups for CIFAR\-10\.Three representative samples per class \(automobile vs\. truck\) are shown for each group\.

## 2Preliminaries

### 2\.1Edge of stability

Gradient descent in deep networks often operates at the*edge of stability*\(EoS\), a regime in which training remains near the boundary between stable and unstable updates without diverging\[[9](https://arxiv.org/html/2606.04212#bib.bib12)\]\. Consider full\-batch gradient descentθt\+1=θt−η​∇L​\(θt\)\\theta\_\{t\+1\}=\\theta\_\{t\}\-\\eta\\nabla L\(\\theta\_\{t\}\)with HessianHt=∇2L​\(θt\)H\_\{t\}=\\nabla^\{2\}L\(\\theta\_\{t\}\)and eigenpairs\(λi,vi\)\(\\lambda\_\{i\},v\_\{i\}\)\. A perturbation alongviv\_\{i\}is multiplied by1−η​λi1\-\\eta\\lambda\_\{i\}at each step, so discrete\-time stability requiresη​λi<2\\eta\\lambda\_\{i\}<2\. EoS is the regime in which the top eigenvalueλ1\\lambda\_\{1\}\(sharpness\) saturates the bound:

η​λ1≈2\.\\eta\\lambda\_\{1\}\\approx 2\.The multiplier along the corresponding eigendirection𝐯1\\mathbf\{v\}\_\{1\}is then near−1\-1, producing sign\-alternating oscillations that are linearly unstable yet remain bounded throughout training\[[9](https://arxiv.org/html/2606.04212#bib.bib12)\]\.

### 2\.2Self\-stabilization

The non\-monotonic descent at EoS is explained by the*self\-stabilization*mechanism of Damian et al\.\[[10](https://arxiv.org/html/2606.04212#bib.bib13)\]\. Onceλ1\\lambda\_\{1\}exceeds2/η2/\\eta, gradient descent develops a fast period\-2 oscillation along𝐯1\\mathbf\{v\}\_\{1\}\. Higher\-order terms convert this oscillation into a slow corrective drift toward lower sharpness, so that the cycle\-averaged dynamics remain near the active boundary\{λ1≈2/η\}\\\{\\lambda\_\{1\}\\approx 2/\\eta\\\}\.

Under this cycle\-averaged description, the EoS drift differs from ordinary gradient descent by an additional sharpness\-reducing component,

−\(α/β\)​∇λ1,α=−⟨∇L,∇λ1⟩,β=‖∇λ1‖2\.\-\(\\alpha/\\beta\)\\nabla\\lambda\_\{1\},\\qquad\\alpha=\-\\langle\\nabla L,\\nabla\\lambda\_\{1\}\\rangle,\\quad\\beta=\\\|\\nabla\\lambda\_\{1\}\\\|^\{2\}\.For a subset lossℓk\\ell\_\{k\}, this extra drift contributes

−\(α/β\)​⟨∇ℓk,∇λ1⟩\-\(\\alpha/\\beta\)\\langle\\nabla\\ell\_\{k\},\\nabla\\lambda\_\{1\}\\rangle
We use this established framework to motivate our experimental design\. The selector⟨∇ℓk,∇λ1⟩\\langle\\nabla\\ell\_\{k\},\\nabla\\lambda\_\{1\}\\rangledepends on both a directional factor \(alignment between∇ℓk\\nabla\\ell\_\{k\}and∇λ1\\nabla\\lambda\_\{1\}\) and a magnitude factor \(persistence of‖∇ℓk‖\\\|\\nabla\\ell\_\{k\}\\\|over training\)\. These two components motivate complementary controlled interventions in Section[4](https://arxiv.org/html/2606.04212#S4)\.

Empirically, we track the directional factor through alignment with the top Hessian eigendirection𝐯1\\mathbf\{v\}\_\{1\}, measuring how strongly each subset data subset couples to the unstable mode\. Appendix[F](https://arxiv.org/html/2606.04212#A6)derives the per\-subset loss decomposition and shows that the measuredcos2⁡θk\\cos^\{2\}\\theta\_\{k\}serves as an empirical proxy for the theoretical EoS selectorQk=⟨∇ℓk,∇λ1⟩Q\_\{k\}=\\langle\\nabla\\ell\_\{k\},\\nabla\\lambda\_\{1\}\\rangleunder the single\-mode approximation\. Together, this motivates our experiments: branching interventions test the effect of remaining at EoS, random\-direction displacement isolates alignment, and cross\-entropy saturation isolates gradient persistence\.

### 2\.3Prototype Taxonomy

#### Definition\.

We partition the training distribution into four groups defined by the joint geometry ofPX,YP\_\{X,Y\}, independent of any trained model \(Figure[2](https://arxiv.org/html/2606.04212#S1.F2)\)\.*Inliers*are high\-density points near the class centroidμc\\mu\_\{c\}with correct labels\.*Boundary points*are in\-distribution examples near the inter\-class boundary, identified by high label ambiguity in their local neighborhood\.*Input\-outliers*are geometrically atypical inputs, far fromμc\\mu\_\{c\}in input space, that retain correct labels\.*Output\-outliers*are high\-density inputs assigned an incorrect label\. Representative examples from each group are shown in Figure[2](https://arxiv.org/html/2606.04212#S1.F2)\. Inliers and boundary points are identified from the existing training data by ranking centroid distance andkk\-NN label ambiguity respectively, while input\-outliers and output\-outliers are synthetically constructed to isolate the effects of input\-space atypicality and label inconsistency\.

#### Construction\.

We instantiate the taxonomy on a binary CIFAR\-10 task \(automobile vs\. truck,n=10,000n=10\{,\}000\)\[[26](https://arxiv.org/html/2606.04212#bib.bib4)\]\. Inlier candidates are theM=3​mM=3mpoints per class with smallest centroid distance‖xi−μc‖\\\|x\_\{i\}\-\\mu\_\{c\}\\\|; boundary candidates are themmpoints per class whosekk\-NN label composition \(k=50k=50\) is closest to uniform\. From the inlier candidate pool we sample three disjoint subsets of sizem=25m=25per class: the first retains original inputs and labels \(inliers\); the second is assigned flipped labels1−c1\-c\(output\-outliers\); the third is extrapolated by pushing each example away from the opposite class’s centroid asxi±α​vdiffx\_\{i\}\\pm\\alpha\\,v\_\{\\mathrm\{diff\}\}\(sign\+\+ifyi=1y\_\{i\}=1,−\-ifyi=0y\_\{i\}=0\), wherevdiff=μ1−μ0v\_\{\\mathrm\{diff\}\}=\\mu\_\{1\}\-\\mu\_\{0\}is the unnormalized centroid difference \(input\-outliers\)\. We setα=3\\alpha=3, chosen so input\-outliers are by construction the most distant group from their class centroid; resulting pixel values may lie outside the valid input range, but we do not clip in order to preserve the displacement magnitude\. Boundary points are drawn directly from the ambiguity pool\. The final training set containsn=10,000n=10\{,\}000examples, including 200 prototype\-labeled points \(50 per group, 25 per class\) tracked throughout training\.

### 2\.4Metrics

#### Per group lossℓk\\ell\_\{k\}\.

For each prototype groupk∈\{inlier, boundary, input\-outlier, output\-outlier\}k\\in\\\{\\text\{inlier, boundary, input\-outlier, output\-outlier\}\\\}with index setPkP\_\{k\}, we defineℓk=1\|Pk\|​∑i∈Pkℓ​\(f​\(xi\),yi\)\\ell\_\{k\}=\\frac\{1\}\{\|P\_\{k\}\|\}\\sum\_\{i\\in P\_\{k\}\}\\ell\(f\(x\_\{i\}\),y\_\{i\}\)as the average loss over the examples in groupkk\. Trackingℓk\\ell\_\{k\}over training reveals the learning order across the groups and which groups are differentially affected by the stability constraint at EoS\.

#### Directional coupling\.

Let∇ℓk\\nabla\\ell\_\{k\}denote the gradient of the loss restricted to prototype groupkk\. We measure the directional coupling of groupkkto the EoS\-constrained mode by

cos2⁡θk=\(∇ℓk⋅v1\)2‖∇ℓk‖2∈\[0,1\],\\cos^\{2\}\\\!\\theta\_\{k\}\\;=\\;\\frac\{\(\\nabla\\ell\_\{k\}\\cdot\{v\}\_\{1\}\)^\{2\}\}\{\\\|\\nabla\\ell\_\{k\}\\\|^\{2\}\}\\;\\in\\;\[0,1\],\(1\)where𝐯1\\mathbf\{v\}\_\{1\}is the top Hessian eigenvector \(assumed unit norm\)\. Whencos2⁡θk≈1\\cos^\{2\}\\\!\\theta\_\{k\}\\approx 1, the group gradient is nearly aligned with𝐯1\\mathbf\{v\}\_\{1\}, and whencos2⁡θk≈0\\cos^\{2\}\\\!\\theta\_\{k\}\\approx 0, it is nearly orthogonal\. This quantity measures how strongly a group’s gradient aligns with the direction constrained by EoS dynamics\.

The link to learning comes from self\-stabilization\. At EoS, the oscillation–stabilization cycle produces net parameter movement primarily along𝐯1\\mathbf\{v\}\_\{1\}\[[10](https://arxiv.org/html/2606.04212#bib.bib13)\]\. Consequently, loss decreases predominantly for groups whose gradients are aligned with𝐯1\\mathbf\{v\}\_\{1\}, while groups with orthogonal gradients make limited progress \(Figure[3](https://arxiv.org/html/2606.04212#S2.F3)\)\.

![Refer to caption](https://arxiv.org/html/2606.04212v1/x3.png)Figure 3:Directional coupling at EoS\.The optimizer oscillates along𝐯1\\mathbf\{v\}\_\{1\}\(red zigzag\)\. When a group’s gradient∇ℓk\\nabla\\ell\_\{k\}aligns with𝐯1\\mathbf\{v\}\_\{1\}\(left\), self\-stabilization reduces loss for that group\. When∇ℓk\\nabla\\ell\_\{k\}is orthogonal to𝐯1\\mathbf\{v\}\_\{1\}\(right\), the group is decoupled from the oscillation and its loss does not benefit\.
#### Curvature influence\.

Whilecos2⁡θk\\cos^\{2\}\\\!\\theta\_\{k\}measures direction, it does not capture gradient magnitude\. We report the squared projection

\(∇ℓk⋅v1\)2=‖∇ℓk‖2⋅cos2⁡θk,\(\\nabla\\ell\_\{k\}\\cdot\{v\}\_\{1\}\)^\{2\}=\\\|\\nabla\\ell\_\{k\}\\\|^\{2\}\\cdot\\cos^\{2\}\\\!\\theta\_\{k\},\(2\)which quantifies a group’s effective curvature influence along𝐯1\\mathbf\{v\}\_\{1\}\. This follows from the quadratic form

∇ℓk⊤​H​∇ℓk=∑iλi​\(∇ℓk⋅vi\)2,\\nabla\\ell\_\{k\}^\{\\top\}H\\nabla\\ell\_\{k\}=\\sum\_\{i\}\\lambda\_\{i\}\(\\nabla\\ell\_\{k\}\\cdot v\_\{i\}\)^\{2\},where the contribution of the top eigendirection isλ1​\(∇ℓk⋅v1\)2\\lambda\_\{1\}\(\\nabla\\ell\_\{k\}\\cdot v\_\{1\}\)^\{2\}\. Sinceλ1\\lambda\_\{1\}is shared across groups at a given step,\(∇ℓk⋅v1\)2\(\\nabla\\ell\_\{k\}\\cdot v\_\{1\}\)^\{2\}ranks groups by their curvature influence in the dominant direction\. The metric can decrease either through rotation away from𝐯1\\mathbf\{v\}\_\{1\}or shrinking gradient magnitude\. We isolate each effect in Section[4](https://arxiv.org/html/2606.04212#S4)\.

## 3Selective Learning at the Edge of Stability

### 3\.1Setup and Intervention Design

We train a two\-hidden\-layer MLP \(width 512, ReLU\) with full\-batch gradient descent on a binary CIFAR\-10 task \(automobile vs\. truck,n=10,000n=10\{,\}000\) for 10,000 steps\. Prototype groups are constructed as described in Section[2\.3](https://arxiv.org/html/2606.04212#S2.SS3)\. The default learning rate isη=0\.01\\eta=0\.01under mean square error \(MSE\)\. All quantities are plotted against effective timeteff=η​tt\_\{\\mathrm\{eff\}\}=\\eta\\,t, which normalizes for step size\.

To test whether the stability constraint causally affects subset\-level learning, we branch each run from a shared trajectory at timet∗t^\{\*\}, defined as the onset of EoS \(detected as the first timeλ1\\lambda\_\{1\}reaches2/η2/\\eta\)\. The*baseline*branch continues at the original learning rate, remaining at EoS\. The*exit*branch reduces the learning rate by half \(η→η/2\\eta\\to\\eta/2\), increasing the stability threshold to4/η4/\\etaand allowing for training to promptly exit the EoS regime\. We denotet∗∗t^\{\*\*\}as the time at which the exit branch reaches its new stability threshold4/η4/\\eta\. The interval\[t∗,t∗∗\]\[t^\{\*\},t^\{\*\*\}\]is the window over which the two branches occupy different stability regimes\. Since architecture, data, and initialization are shared up tot∗t^\{\*\}, the branching intervention identifies the causal effect of continuing at the original learning rate versus dropping it at EoS onset\. The intervention separates the branches into EoS and non\-EoS regimes, so post\-branch divergence in prototype\-level loss provides evidence about the consequences of remaining at EoS\.

All figures show medians across 5 seeds\. Shading indicates interquartile range where shown\. Full experimental setup details can be found in Appendix[A](https://arxiv.org/html/2606.04212#A1)\.

### 3\.2The Selective Trade\-off

Figure[4](https://arxiv.org/html/2606.04212#S3.F4)shows the effect of the branching intervention under MSE training\. After the exit branch leaves the EoS regime att∗t^\{\*\}, prototype losses begin to diverge between the two runs\. The divergence is group\-specific: input\-outlier and output\-outlier loss decrease faster under the baseline \(Δ​ℓk\>0\\Delta\\ell\_\{k\}\>0\), while inlier and boundary loss decrease faster under the exit branch \(Δ​ℓk<0\\Delta\\ell\_\{k\}<0\)\. The stability constraint does not uniformly slow or accelerate learning across the data subsets; instead, it redistributes optimization, favoring some groups at the expense of others\.

The selective trade\-off replicates across alternative architectures \(CNN, ResNet\), optimizers, and class pair \(Appendix[C](https://arxiv.org/html/2606.04212#A3)\)\.

![Refer to caption](https://arxiv.org/html/2606.04212v1/x4.png)Figure 4:EoS creates selective trade\-offs across prototype groups\.Baseline run \(solid\) enters EoS att∗t^\{\*\}, while the exit branch \(dashed\) subsequently leaves\.Left:sharpness confirms the two branches occupy distinct stability regimes\.Middle:prototype losses diverge post\-branch, with input\-outliers and output\-outliers benefiting from EoS while boundary and inlier progress is suppressed\.Right:the intervention effectΔ​ℓk=ℓkexit−ℓkEoS\\Delta\\ell\_\{k\}=\\ell\_\{k\}^\{\\text\{exit\}\}\-\\ell\_\{k\}^\{\\text\{EoS\}\}is shown where positive values indicate EoS achieves lower loss for that group\.
### 3\.3Alignment Resolves Under EoS, Persists Without It

Figure[5](https://arxiv.org/html/2606.04212#S3.F5)shows directional alignmentcos2⁡θk\\cos^\{2\}\\\!\\theta\_\{k\}for input\-outliers and boundary points\. During progressive sharpening, alignment for input\-outliers increases rapidly, approaching 1 near the onset of EoS, while boundary alignment remains below0\.20\.2\. As a result, input\-outlier gradients dominate the top eigendirection of the Hessian, and𝐯1\\mathbf\{v\}\_\{1\}aligns with them\.

![Refer to caption](https://arxiv.org/html/2606.04212v1/x5.png)Figure 5:Alignment dynamics under EoS\.Input\-outliercos2⁡θk\\cos^\{2\}\\theta\_\{k\}rises during progressive sharpening and dominates𝐯1\\mathbf\{v\}\_\{1\}at EoS onset\. Under the baseline \(solid\), self\-stabilization resolves this alignment and𝐯1\\mathbf\{v\}\_\{1\}rotates toward boundary points\. Under the exit branch \(dashed\), alignment remains elevated until the new threshold is reached att∗∗t^\{\*\*\}\.Once EoS is reached att∗t^\{\*\}, input\-outliercos2⁡θk\\cos^\{2\}\\\!\\theta\_\{k\}declines under the baseline as self\-stabilization reduces their loss and gradient magnitude\. As their curvature contribution weakens, boundary alignment rises and𝐯1\\mathbf\{v\}\_\{1\}rotates toward it\. Under the exit branch, this transition does not occur: input\-outlier alignment remains elevated throught∗∗t^\{\*\*\}, indicating that the resolution is driven specifically by EoS dynamics\. This motivates identifying the properties that determine which group dominates𝐯1\\mathbf\{v\}\_\{1\}\.

## 4Mechanism: Alignment and Gradient Persistence

Two properties are jointly necessary to capture the EoS advantage: directional alignment and gradient persistence\. We isolate each experimentally\.

### 4\.1Directional alignment\.

In the baseline construction, input\-outliers are displaced along a shared directionvdiffv\_\{\\mathrm\{diff\}\}, yielding a directionally coherent group\. To test whether this coherence drives their curvature dominance, we construct a counterfactual in which each input\-outlier is displaced by the same distance but in a random direction orthogonal tovdiffv\_\{\\mathrm\{diff\}\}\. The two conditions match in centroid distance, group size, and labels, differing only in directional structure \(Appendix[D](https://arxiv.org/html/2606.04212#A4)\)\.

Under coherent displacement alongvdiffv\_\{\\mathrm\{diff\}\}, input\-outlier’scos2⁡θk\\cos^\{2\}\\\!\\theta\_\{k\}and curvature influence\(∇ℓk⋅v1\)2\(\\nabla\\ell\_\{k\}\\cdot\{v\}\_\{1\}\)^\{2\}dominate \(Figure[6](https://arxiv.org/html/2606.04212#S4.F6)\)\. The input\-outliers benefit from EoS while the progress of other groups is suppressed\. Under incoherent displacement, input\-outlier alignment collapses, curvature influence falls by an order of magnitude, and the selective intervention effect dissolves\. Geometric atypicality alone \(large distance from its own class centroid\) is not sufficient\. The group must push the Hessian coherently along a shared direction to dominate𝐯1\\mathbf\{v\}\_\{1\}and benefit at the edge of stability\.

![Refer to caption](https://arxiv.org/html/2606.04212v1/x6.png)Figure 6:Directional alignment is necessary for the selective EoS advantage\.Identical seeds and configurations; only the input\-outlier displacement direction differs\.Top:Coherent displacement alongvdiffv\_\{\\mathrm\{diff\}\}yields high alignment and curvature influence for input\-outliers, which capture the EoS advantage\.Bottom:Random orthogonal displacement at equal distance reduces alignment and curvature influence, largely eliminating the trade\-off\.
### 4\.2Gradient persistence\.

Directional coherence determines*which direction*a group pushes the Hessian, but the coupling also requires sustained gradient*magnitude*\. We test this by comparing MSE and CE training on identical data \(Figure[7](https://arxiv.org/html/2606.04212#S4.F7)\)\.

![Refer to caption](https://arxiv.org/html/2606.04212v1/x7.png)Figure 7:Gradient persistence determines which group retains curvature influence\.Identical seeds and configurations; only the loss differs\.Top:Input\-outlier have elevated alignment, strong curvature influence, and captures the EoS advantage\.Bottom:Alignment for input\-outlier is high, but gradient saturation weakens their curvature influence; the EoS advantage shifts to output\-outliers, whose gradients remain active\.Under MSE as the specified loss function, per\-example gradients scale with the residual and persist even for confidently classified points\. Conversely, under CE, gradients vanish as points are learned and confidence grows\. The experiment reveals that while the gradient for input\-outliers points towards𝐯1\\mathbf\{v\}\_\{1\}across both loss functions,\(∇ℓk⋅v1\)2\(\\nabla\\ell\_\{k\}\\cdot\{v\}\_\{1\}\)^\{2\}collapses by orders of magnitude in CE as gradient norms shrink\. The functional consequence is that output\-outliers, with the most elevated curvature influence, becomes the sole beneficiary at the edge of stability\.

### 4\.3Natural Outliers in the Data Distribution\.

To verify that the coupling between centroid distance and𝐯1\\mathbf\{v\}\_\{1\}alignment is not an artifact of synthetic prototype construction, we measure per\-example directional alignment directly on the natural CIFAR\-10 training distribution\. For each of the10,00010\{,\}000training examples, we compute centroid distance‖xi−μc‖\\\|x\_\{i\}\-\\mu\_\{c\}\\\|and per\-example alignmentcos2⁡\(∇ℓi,v1\)\\cos^\{2\}\(\\nabla\\ell\_\{i\},v\_\{1\}\)at checkpoints throughout training\.

Figure[8](https://arxiv.org/html/2606.04212#S4.F8)shows the relationship at two timepoints\. During progressive sharpening, centroid distance andcos2\\cos^\{2\}are uncorrelated \(Spearmanρ=−0\.11\\rho=\-0\.11\)\. At EoS, a positive correlation emerges \(ρ=0\.39\\rho=0\.39\), where examples that are farther from their class centroid align more strongly with𝐯1\\mathbf\{v\}\_\{1\}\. Centroid distance, a purely distributional quantity computed before training, correlates with which examples will dominate the unstable direction at EoS\.

![Refer to caption](https://arxiv.org/html/2606.04212v1/x8.png)Figure 8:Centroid distance predicts per\-example alignment with𝐯1\\mathbf\{v\}\_\{1\}at EoS\.Each dot is one training example; the line indicates the monotonic trend captured by Spearmanρ\\rho\.Left:during progressive sharpening, centroid distance andcos2⁡\(∇ℓi,v1\)\\cos^\{2\}\(\\nabla\\ell\_\{i\},v\_\{1\}\)are uncorrelated \(ρ=−0\.11\\rho=\-0\.11\)\.Right:at EoS, correlation emerges \(ρ=0\.39\\rho=0\.39\)\. Full trajectory of correlation is shown in Appendix[E](https://arxiv.org/html/2606.04212#A5)\.

## 5Generalizing the Alignment Principle

Sections[3](https://arxiv.org/html/2606.04212#S3)and[4](https://arxiv.org/html/2606.04212#S4)imply a clear prediction: if alignment×\\timespersistence drives the effect, then changing only the data geometry, while holding the model, optimizer, and loss fixed, should change the dominant group and, in turn, the functional benefit conferred by EoS\.

### 5\.1Performance on Harder Class Boundaries\.

We verify this prediction on a more challenging class pair \(cat vs\. dog,n=10,000n=10\{,\}000\), where the relative geometry of the prototype groups shifts\. Withα=3\\alpha=3, boundary points are the most distant group from the centroid, not the input\-outliers\. Correspondingly, boundary points dominate𝐯1\\mathbf\{v\}\_\{1\}and become the primary beneficiary of EoS\. Increasingα\\alphato 10 restores input\-outliers as the most distant group and transfers the advantage back to them \(Appendix[C\.3](https://arxiv.org/html/2606.04212#A3.SS3)\)\.

### 5\.2Does Edge\-of\-Stability Improve Generalization or Robustness?

Sections[3](https://arxiv.org/html/2606.04212#S3)–[4](https://arxiv.org/html/2606.04212#S4)established when a subset benefits from EoS\. We now ask whether this training\-time selectivity transfers to test\-time behavior and share preliminary results\.

#### Adversarial robustness\.

Figure[9](https://arxiv.org/html/2606.04212#S5.F9)reports PGD adversarial accuracy \(ε=0\.03\\varepsilon=0\.03, 10 steps, step sizeε/4\\varepsilon/4\) on 50 test\-set boundary points selected bykk\-NN ambiguity\[[29](https://arxiv.org/html/2606.04212#bib.bib30)\]\. Withα=3\\alpha=3, boundary points dominate𝐯1\\mathbf\{v\}\_\{1\}, and the EoS branch maintains higher adversarial accuracy than the exit branch aftert∗∗t^\{\*\*\}\. The stability constraint implicitly sharpens the decision boundary in the region most relevant to robust classification\. Withα=10\\alpha=10, where input\-outliers instead dominate, the pattern reverses: the exit branch now achieves higher adversarial accuracy on boundary points, because the EoS branch in this configuration has been spending its optimization budget elsewhere\.

![Refer to caption](https://arxiv.org/html/2606.04212v1/x9.png)Figure 9:EoS improves adversarial robustness only when boundary points dominate𝐯1\\mathbf\{v\}\_\{1\}\.Left\(α=3\\alpha=3, boundary dominates𝐯1\\mathbf\{v\}\_\{1\}\): the EoS branch \(solid\) outperforms the exit branch \(dashed\) aftert∗∗t^\{\*\*\}\.Right\(α=10\\alpha=10, input\-outlier dominates𝐯1\\mathbf\{v\}\_\{1\}\): the pattern reverses, and the exit branch performs better\. Robustness gains appear only when EoS prioritizes the evaluated subset\. Single seed\.
#### Out\-of\-distribution generalization\.

Figure[10](https://arxiv.org/html/2606.04212#S5.F10)reports MSE loss on input\-outliers constructed at varyingαtest\\alpha\_\{\\text\{test\}\}, evaluated at the checkpoint immediately aftert∗∗t^\{\*\*\}in training\. Withα=3\\alpha=3, the exit branch achieves lower loss at largeαtest\\alpha\_\{\\text\{test\}\}, indicating no OOD advantage for input\-outliers\. Withα=10\\alpha=10, where input\-outliers dominate𝐯1\\mathbf\{v\}\_\{1\}, the EoS branch achieves lower loss at largeαtest\\alpha\_\{\\text\{test\}\}: optimization on input\-outliers at EoS during training appears to transfer to OOD generalization alongvdiffv\_\{\\mathrm\{diff\}\}\.

![Refer to caption](https://arxiv.org/html/2606.04212v1/x10.png)Figure 10:EoS steers generalization toward the dominant group\.Test MSE on input\-outliers acrossαtest\\alpha\_\{\\text\{test\}\}\(aftert∗∗t^\{\*\*\}\)\.Left\(α=3\\alpha=3, boundary dominates𝐯1\\mathbf\{v\}\_\{1\}\): no OOD advantage\.Right\(α=10\\alpha=10, input\-outlier dominates𝐯1\\mathbf\{v\}\_\{1\}\): EoS improves OOD performance at largeαtest\\alpha\_\{\\text\{test\}\}\. Single seed\.

## 6Discussion

We find that the EoS stability constraint acts as an inductive bias, not merely an implicit regularizer\. Rather than selecting among equivalent minimizers, it determines which subsets of the data are optimized, as a function of data geometry and loss\. Concretely, EoS optimizes on the subset with the largest curvature influence, and the functional consequence depends on which subset dominates\.

Comparing theα=3\\alpha=3andα=10\\alpha=10models illustrates how data geometry changes the functional effect of EoS\. When boundary points dominate𝐯1\\mathbf\{v\}\_\{1\}, EoS focuses optimization near the decision boundary and can improve adversarial robustness; when input\-outliers dominate, it instead shifts optimization toward distributional tails and can improve extrapolation along the outlier direction \(Appendix[C\.3](https://arxiv.org/html/2606.04212#A3.SS3)\)\. Thus, hyperparameters typically viewed as controlling convergence—such as learning rate or loss—may also influence which subset of the data, and hence which functional property of the model, is emphasized\.

This perspective offers a possible explanation for divergent findings in the flatness literature\. The benefit of low or controlled sharpness is not determined by scalar flatness alone, but by which subset captures the top curvature direction during training\. Different empirical settings may therefore induce different functional outcomes, depending on whether boundary points, input\-outliers, or other subgroups dominate𝐯1\\mathbf\{v\}\_\{1\}\. This yields a testable prediction: changing the subgroup that dominates𝐯1\\mathbf\{v\}\_\{1\}should change the direction of the resulting robustness or generalization effect\.

#### Limitations and scope\.

Our experiments are restricted to relatively small models and datasets, primarily because tracking EoS dynamics requires repeated estimates of the top Hessian eigenvalue via Hessian–vector products, whose cost scales with both model and dataset size\. This constraint is common in the EoS literature, including the settings of\[[9](https://arxiv.org/html/2606.04212#bib.bib12),[4](https://arxiv.org/html/2606.04212#bib.bib29)\], from which we adapt our experimental setup\. Consequently, whether the observed selectivity persists for more classes, larger datasets, or higher\-resolution inputs remains open\. For the same reason, we focus on full\-batch training: a major open direction for the field is to understand in what way the self\-stabilization argument\[[10](https://arxiv.org/html/2606.04212#bib.bib13)\]could be generalized to mini\-batch optimizers\. Extending the corresponding subset\-level analysis to mini\-batch optimizers requires such theoretical foundations\. Our prototype taxonomy is also defined in pixel space, which maps most transparently to gradient geometry in MLPs\. In convolutional architectures, our experiments suggest that the same predictive quantity,\(∇ℓk⋅v1\)2\(\\nabla\\ell\_\{k\}\\cdot v\_\{1\}\)^\{2\}, remains informative, but the identity of the dominant group can shift as learned features reshape the input\-to\-gradient mapping\. Extending prototype construction to representation space is therefore a natural next step\. Finally, our robustness and generalization results are preliminary; systematic evaluation across architectures, seeds, and broader distribution shifts remains future work\.

## 7Conclusion

The edge of stability is not a uniform optimization phenomenon, but a selective one: the stability constraint redistributes learning unevenly across the training distribution, accelerating progress on some subsets while suppressing others\. We find that two properties jointly determine which subsets benefit: directional alignment with the top Hessian eigendirection and persistence of gradient signal under the loss function\. Removing either component—by disrupting directional coherence or by inducing gradient saturation—eliminates or shifts the advantage to different subsets\. Importantly, by characterizing how a curvature constraint differentially shapes learning across the data distribution, this work provides a step toward connecting implicit regularization in parameter space with inductive bias over examples\.

## References

- \[1\]\(2023\)Learning threshold neurons via edge of stability\.InAdvances in Neural Information Processing Systems,Vol\.36\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2023/hash/3e592c571de69a43d7a870ea89c7e33a-Abstract-Conference.html)Cited by:[§B\.3](https://arxiv.org/html/2606.04212#A2.SS3.p1.1)\.
- \[2\]K\. Ahn, J\. Zhang, and S\. Sra\(2022\)Understanding the unstable convergence of gradient descent\.InInternational Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.162,pp\. 247–257\.External Links:[Link](https://proceedings.mlr.press/v162/ahn22a.html)Cited by:[§B\.3](https://arxiv.org/html/2606.04212#A2.SS3.p1.1)\.
- \[3\]A\. Andreyev, A\. Ananthkumar, M\. Walden, T\. Poggio, and P\. Beneventano\(2026\)Momentum further constrains sharpness at the edge of stochastic stability\.arXiv preprint arXiv:2604\.14108\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2604.14108)Cited by:[§B\.1](https://arxiv.org/html/2606.04212#A2.SS1.SSS0.Px1.p1.3),[Figure 14](https://arxiv.org/html/2606.04212#A3.F14),[§1](https://arxiv.org/html/2606.04212#S1.p2.1)\.
- \[4\]A\. Andreyev and P\. Beneventano\(2024\)Edge of stochastic stability: revisiting the edge of stability for SGD\.arXiv preprint arXiv:2412\.20553\.External Links:2412\.20553,[Link](https://arxiv.org/abs/2412.20553)Cited by:[§A\.5](https://arxiv.org/html/2606.04212#A1.SS5.p2.1),[§B\.1](https://arxiv.org/html/2606.04212#A2.SS1.SSS0.Px1.p1.3),[§1](https://arxiv.org/html/2606.04212#S1.p2.1),[§6](https://arxiv.org/html/2606.04212#S6.SS0.SSS0.Px1.p1.1)\.
- \[5\]A\. Andreyev and P\. Beneventano\(2025\)Edge of stochastic stability\.Note:Software, Apache 2\.0 licenseExternal Links:[Link](https://github.com/arseniqum/edge-of-stochastic-stability)Cited by:[§A\.5](https://arxiv.org/html/2606.04212#A1.SS5.p2.1)\.
- \[6\]S\. Arora, Z\. Li, and A\. Panigrahi\(2022\)Understanding gradient descent on the edge of stability in deep learning\.InInternational Conference on Machine Learning,pp\. 948–1024\.Cited by:[§B\.1](https://arxiv.org/html/2606.04212#A2.SS1.SSS0.Px2.p1.2),[§B\.3](https://arxiv.org/html/2606.04212#A2.SS3.p1.1)\.
- \[7\]L\. Chen and J\. Bruna\(2023\)Beyond the edge of stability via two\-step gradient updates\.InInternational Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.202,pp\. 4330–4391\.External Links:[Link](https://proceedings.mlr.press/v202/chen23b.html)Cited by:[§A\.4](https://arxiv.org/html/2606.04212#A1.SS4.p3.7)\.
- \[8\]J\. Cohen, B\. Ghorbani, S\. Krishnan, N\. Agarwal, S\. Medapati, M\. Badura, D\. Suo, Z\. Nado, G\. E\. Dahl, and J\. Gilmer\(2023\)Adaptive gradient methods at the edge of stability\.InNeurIPS 2023 Workshop on Heavy Tails in Machine Learning: Structure, Stability, and Dynamics,External Links:[Link](https://openreview.net/forum?id=dHGNgkUcGd)Cited by:[§B\.1](https://arxiv.org/html/2606.04212#A2.SS1.SSS0.Px1.p1.3),[§1](https://arxiv.org/html/2606.04212#S1.p2.1)\.
- \[9\]J\. Cohen, S\. Kaur, Y\. Li, J\. Z\. Kolter, and A\. Talwalkar\(2021\)Gradient descent on neural networks typically occurs at the edge of stability\.InInternational Conference on Learning Representations,Cited by:[§B\.1](https://arxiv.org/html/2606.04212#A2.SS1.SSS0.Px1.p1.3),[3rd item](https://arxiv.org/html/2606.04212#S1.I1.i3.p1.1),[§1](https://arxiv.org/html/2606.04212#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.04212#S2.SS1.p1.7),[§2\.1](https://arxiv.org/html/2606.04212#S2.SS1.p1.9),[§6](https://arxiv.org/html/2606.04212#S6.SS0.SSS0.Px1.p1.1)\.
- \[10\]A\. Damian, E\. Nichani, and J\. D\. Lee\(2023\)Self\-stabilization: the implicit bias of gradient descent at the edge of stability\.InInternational Conference on Learning Representations,Cited by:[§B\.1](https://arxiv.org/html/2606.04212#A2.SS1.SSS0.Px2.p1.2),[§F\.1](https://arxiv.org/html/2606.04212#A6.SS1.SSS0.Px2.p1.1),[§F\.1](https://arxiv.org/html/2606.04212#A6.SS1.p1.17),[Appendix F](https://arxiv.org/html/2606.04212#A6.p1.1),[2nd item](https://arxiv.org/html/2606.04212#S1.I1.i2.p1.1),[§2\.2](https://arxiv.org/html/2606.04212#S2.SS2.p1.4),[§2\.4](https://arxiv.org/html/2606.04212#S2.SS4.SSS0.Px2.p2.2),[§6](https://arxiv.org/html/2606.04212#S6.SS0.SSS0.Px1.p1.1)\.
- \[11\]L\. Dinh, R\. Pascanu, S\. Bengio, and Y\. Bengio\(2017\)Sharp minima can generalize for deep nets\.InInternational Conference on Machine Learning,Vol\.70,pp\. 1019–1028\.Cited by:[§1](https://arxiv.org/html/2606.04212#S1.p7.1)\.
- \[12\]V\. Feldman and C\. Zhang\(2020\)What neural networks memorize and why: discovering the long tail via influence estimation\.InAdvances in Neural Information Processing Systems,Vol\.33\.External Links:[Link](https://proceedings.neurips.cc/paper/2020/hash/1e14bfe2714193e7af5abc64ecbd6b46-Abstract.html)Cited by:[§B\.2](https://arxiv.org/html/2606.04212#A2.SS2.p1.4)\.
- \[13\]V\. Feldman\(2020\)Does learning require memorization? A short tale about a long tail\.InProceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing,pp\. 954–959\.External Links:[Document](https://dx.doi.org/10.1145/3357713.3384290)Cited by:[§B\.2](https://arxiv.org/html/2606.04212#A2.SS2.p1.4)\.
- \[14\]P\. Foret, A\. Kleiner, H\. Mobahi, and B\. Neyshabur\(2021\)Sharpness\-aware minimization for efficiently improving generalization\.InInternational Conference on Learning Representations,Cited by:[§B\.3](https://arxiv.org/html/2606.04212#A2.SS3.p1.1),[§1](https://arxiv.org/html/2606.04212#S1.p7.1)\.
- \[15\]S\. Hochreiter and J\. Schmidhuber\(1997\)Flat minima\.Neural Computation9\(1\),pp\. 1–42\.Cited by:[§B\.3](https://arxiv.org/html/2606.04212#A2.SS3.p1.1),[§1](https://arxiv.org/html/2606.04212#S1.p7.1)\.
- \[16\]R\. Islamov, M\. Crawshaw, J\. Cohen, and R\. Gower\(2026\)Non\-euclidean gradient descent operates at the edge of stability\.arXiv preprint arXiv:2603\.05002\.External Links:2603\.05002,[Document](https://dx.doi.org/10.48550/arXiv.2603.05002),[Link](https://arxiv.org/abs/2603.05002)Cited by:[§1](https://arxiv.org/html/2606.04212#S1.p2.1)\.
- \[17\]P\. Izmailov, D\. Podoprikhin, T\. Garipov, D\. Vetrov, and A\. G\. Wilson\(2018\)Averaging weights leads to wider optima and better generalization\.InConference on Uncertainty in Artificial Intelligence,External Links:[Link](https://arxiv.org/abs/1803.05407)Cited by:[§B\.3](https://arxiv.org/html/2606.04212#A2.SS3.p1.1)\.
- \[18\]S\. Jastrzębski, Z\. Kenton, D\. Arpit, N\. Ballas, A\. Fischer, Y\. Bengio, and A\. Storkey\(2018\-09\)Three factors influencing minima in sgd\.arXiv preprint arXiv:1711\.04623\.Cited by:[§B\.3](https://arxiv.org/html/2606.04212#A2.SS3.p1.1),[§1](https://arxiv.org/html/2606.04212#S1.p1.1)\.
- \[19\]S\. Jastrzębski, Z\. Kenton, N\. Ballas, A\. Fischer, Y\. Bengio, and A\. Storkey\(2019\)On the relation between the sharpest directions of DNN loss and the SGD step length\.InInternational Conference on Learning Representations,Note:arXiv:1807\.05031External Links:[Link](https://openreview.net/forum?id=SkgEaj05t7)Cited by:[§B\.1](https://arxiv.org/html/2606.04212#A2.SS1.SSS0.Px1.p1.3),[§1](https://arxiv.org/html/2606.04212#S1.p2.1)\.
- \[20\]S\. Jastrzebski, M\. Szymczak, S\. Fort, D\. Arpit, J\. Tabor, K\. Cho, and K\. Geras\(2020\)The break\-even point on optimization trajectories of deep neural networks\.InInternational Conference on Learning Representations,Note:arXiv:2002\.09572External Links:[Link](https://openreview.net/forum?id=r1g87C4KwB)Cited by:[§B\.1](https://arxiv.org/html/2606.04212#A2.SS1.SSS0.Px1.p1.3),[§1](https://arxiv.org/html/2606.04212#S1.p2.1)\.
- \[21\]Y\. Jiang, B\. Neyshabur, H\. Mobahi, D\. Krishnan, and S\. Bengio\(2020\)Fantastic generalization measures and where to find them\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=SJgIPJBFvH)Cited by:[§B\.3](https://arxiv.org/html/2606.04212#A2.SS3.p1.1)\.
- \[22\]D\. S\. Kalra, J\. Gagnon\-Audet, A\. Gromov, I\. Mediratta, K\. Niu, A\. H\. Miller, and M\. Shvartsman\(2026\)A scalable measure of loss landscape curvature for analyzing the training dynamics of llms\.arXiv preprint arXiv:2601\.16979\.Cited by:[§1](https://arxiv.org/html/2606.04212#S1.p2.1)\.
- \[23\]N\. S\. Keskar, D\. Mudigere, J\. Nocedal, M\. Smelyanskiy, and P\. T\. P\. Tang\(2017\)On large\-batch training for deep learning: generalization gap and sharp minima\.InInternational Conference on Learning Representations,Cited by:[§B\.3](https://arxiv.org/html/2606.04212#A2.SS3.p1.1),[§1](https://arxiv.org/html/2606.04212#S1.p1.1),[§1](https://arxiv.org/html/2606.04212#S1.p7.1)\.
- \[24\]D\. P\. Kingma and J\. Ba\(2014\)Adam: a method for stochastic optimization\.arXiv preprint arXiv:1412\.6980\.Cited by:[§C\.2](https://arxiv.org/html/2606.04212#A3.SS2.p1.1)\.
- \[25\]P\. W\. Koh and P\. Liang\(2017\)Understanding black\-box predictions via influence functions\.InInternational Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.70,pp\. 1885–1894\.External Links:[Link](https://proceedings.mlr.press/v70/koh17a.html)Cited by:[§B\.2](https://arxiv.org/html/2606.04212#A2.SS2.p1.4)\.
- \[26\]A\. Krizhevsky\(2009\)Learning multiple layers of features from tiny images\.Technical ReportUniversity of Toronto\.External Links:[Link](http://www.cs.toronto.edu/%CB%9Ckriz/cifar.html)Cited by:[§2\.3](https://arxiv.org/html/2606.04212#S2.SS3.SSS0.Px2.p1.16)\.
- \[27\]A\. Lewkowycz, Y\. Bahri, E\. Dyer, J\. Sohl\-Dickstein, and G\. Gur\-Ari\(2020\)The large learning rate phase of deep learning: the catapult mechanism\.arXiv preprint arXiv:2003\.02218\.Cited by:[§B\.3](https://arxiv.org/html/2606.04212#A2.SS3.p1.1)\.
- \[28\]K\. Lyu, Z\. Li, and S\. Arora\(2022\)Understanding the generalization benefit of normalization layers: sharpness reduction\.InAdvances in Neural Information Processing Systems,Vol\.35,pp\. 34689–34708\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2022/hash/dffd1c523512e557f4e75e8309049213-Abstract-Conference.html)Cited by:[§B\.1](https://arxiv.org/html/2606.04212#A2.SS1.SSS0.Px2.p1.2),[§B\.3](https://arxiv.org/html/2606.04212#A2.SS3.p1.1)\.
- \[29\]A\. Madry, A\. Makelov, L\. Schmidt, D\. Tsipras, and A\. Vladu\(2018\)Towards deep learning models resistant to adversarial attacks\.InInternational Conference on Learning Representations,Cited by:[§5\.2](https://arxiv.org/html/2606.04212#S5.SS2.SSS0.Px1.p1.7)\.
- \[30\]B\. Neyshabur, S\. Bhojanapalli, D\. McAllester, and N\. Srebro\(2017\)Exploring generalization in deep learning\.InAdvances in Neural Information Processing Systems,Vol\.30\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2017/hash/10ce03a1ed01077e3e289f3e53c72813-Abstract.html)Cited by:[§B\.3](https://arxiv.org/html/2606.04212#A2.SS3.p1.1)\.
- \[31\]M\. Paul, S\. Ganguli, and G\. K\. Dziugaite\(2021\)Deep learning on a data diet: finding important examples early in training\.InAdvances in Neural Information Processing Systems,Vol\.34\.Note:arXiv:2107\.07075External Links:[Document](https://dx.doi.org/10.48550/arXiv.2107.07075)Cited by:[§B\.2](https://arxiv.org/html/2606.04212#A2.SS2.p1.4)\.
- \[32\]M\. Pezeshki, S\. Kaba, Y\. Bengio, A\. Courville, D\. Precup, and G\. Lajoie\(2021\)Gradient starvation: a learning proclivity in neural networks\.InAdvances in Neural Information Processing Systems,Vol\.34\.External Links:[Link](https://proceedings.neurips.cc/paper/2021/hash/0987b8b338d6c90bbedd8631bc499221-Abstract.html)Cited by:[§B\.2](https://arxiv.org/html/2606.04212#A2.SS2.p1.4)\.
- \[33\]B\. T\. Polyak\(1964\)Some methods of speeding up the convergence of iteration methods\.USSR Computational Mathematics and Mathematical Physics4\(5\),pp\. 1–17\.External Links:[Document](https://dx.doi.org/10.1016/0041-5553%2864%2990137-5)Cited by:[Figure 14](https://arxiv.org/html/2606.04212#A3.F14)\.
- \[34\]E\. Rosenfeld and A\. Risteski\(2024\)Outliers with opposing signals have an outsized effect on neural network optimization\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=kIZ3S3tel6)Cited by:[§B\.2](https://arxiv.org/html/2606.04212#A2.SS2.p1.4),[§B\.4](https://arxiv.org/html/2606.04212#A2.SS4.p1.1)\.
- \[35\]M\. Saether, A\. Kolic, T\. Poggio, and P\. Beneventano\(2026\)Does weight decay enhance training stability?\.arXiv preprint arXiv:2605\.16622\.Cited by:[§1](https://arxiv.org/html/2606.04212#S1.p2.1)\.
- \[36\]B\. Sorscher, R\. Geirhos, S\. Shekhar, S\. Ganguli, and A\. S\. Morcos\(2022\)Beyond neural scaling laws: beating power law scaling via data pruning\.InAdvances in Neural Information Processing Systems,Vol\.35\.External Links:[Link](https://openreview.net/forum?id=UmvSlP-PyV)Cited by:[§B\.2](https://arxiv.org/html/2606.04212#A2.SS2.p1.4)\.
- \[37\]S\. Swayamdipta, R\. Schwartz, N\. Lourie, Y\. Wang, H\. Hajishirzi, N\. A\. Smith, and Y\. Choi\(2020\)Dataset cartography: mapping and diagnosing datasets with training dynamics\.InEmpirical Methods in Natural Language Processing,pp\. 9275–9293\.External Links:[Link](https://aclanthology.org/2020.emnlp-main.746)Cited by:[§B\.2](https://arxiv.org/html/2606.04212#A2.SS2.p1.4)\.
- \[38\]M\. Toneva, A\. Sordoni, R\. Tachet des Combes, A\. Trischler, Y\. Bengio, and G\. J\. Gordon\(2019\)An empirical study of example forgetting during deep neural network learning\.InInternational Conference on Learning Representations,Cited by:[§B\.2](https://arxiv.org/html/2606.04212#A2.SS2.p1.4)\.
- \[39\]L\. Wu, Z\. Zhu, and W\. E\(2017\)Towards understanding generalization of deep learning: perspective of loss landscapes\.arXiv preprint arXiv:1706\.10239\.External Links:1706\.10239,[Link](https://arxiv.org/abs/1706.10239)Cited by:[§B\.3](https://arxiv.org/html/2606.04212#A2.SS3.p1.1)\.
- \[40\]C\. Xing, D\. Arpit, C\. Tsirigotis, and Y\. Bengio\(2018\)A walk with sgd\.arXiv preprint arXiv:1802\.08770\.External Links:1802\.08770,[Document](https://dx.doi.org/10.48550/arXiv.1802.08770),[Link](https://arxiv.org/abs/1802.08770)Cited by:[§1](https://arxiv.org/html/2606.04212#S1.p2.1)\.
- \[41\]C\. Zhang, S\. Bengio, M\. Hardt, B\. Recht, and O\. Vinyals\(2017\)Understanding deep learning requires rethinking generalization\.InInternational Conference on Learning Representations,Note:arXiv:1611\.03530External Links:[Document](https://dx.doi.org/10.48550/arXiv.1611.03530)Cited by:[§1](https://arxiv.org/html/2606.04212#S1.p1.1)\.
- \[42\]X\. Zhu, Z\. Wang, X\. Wang, M\. Zhou, and R\. Ge\(2023\)Understanding edge\-of\-stability training dynamics with a minimalist example\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=p7EagBsMAEO)Cited by:[§B\.1](https://arxiv.org/html/2606.04212#A2.SS1.SSS0.Px2.p1.2),[§B\.3](https://arxiv.org/html/2606.04212#A2.SS3.p1.1)\.

## Appendix AExperimental setup

### A\.1Architecture

We use a fully connected MLP that flattens each input image to a vector, then applies two hidden linear layers of width512512with ReLU activations, followed by a final linear classifier to 2 output classes\.

### A\.2Training procedure and hyperparameters

Our primary experiments use full\-batch GD, which is deterministic given a fixed initialization\. We varied the initialization seed to obtain different optimization trajectories\. Five pre\-determined identical seeds were used in all plots shown in main text\. An initial scaling of 0\.2 was applied to all runs\.

### A\.3Branching Intervention Implementation

We isolate the effect of a learning\-rate drop at the edge of stability by running each configuration as a matched pair:

- •a*baseline*trained atη\\etafor the full step budget;
- •a*fork*that follows the baseline trajectory up to a fork stept∗t^\{\*\}and then applies a single scheduled drop toη′=c​η\\eta^\{\\prime\}=c\\eta\(we usec=0\.5c=0\.5\)\.

The fork stept∗t^\{\*\}is defined as the first logged step at whichλmax​\(∇2ℒ\)\\lambda\_\{\\max\}\(\\nabla^\{2\}\\mathcal\{L\}\)crosses2/η2/\\eta, the EoS threshold for full\-batch GD\. We utilize a learning rate of0\.010\.01for all plots\.

Because GD is deterministic given a fixed initialization and has no optimizer state beyond the weights, the fork is implemented by resuming from a checkpoint of the baseline taken just beforet∗t^\{\*\}and continuing atη\\etaup tot∗t^\{\*\}, at which point the drop is applied\. All other settings are identical to the baseline—dataset and initialization seeds, architecture, loss, batch size, step budget, and the fixed prototype subset\.

Both branches log all diagnostics on a uniform every\-32\-steps grid spanning the full training window, ensuring the baseline and fork are sampled at identical step indices so their per\-prototype curves can be compared index\-for\-index without interpolation\. The only quantity that differs between a baseline and its fork is the scheduledη→c​η\\eta\\to c\\etaat stept∗t^\{\*\}\. This makes the fork a minimal counterfactual for the learning\-rate drop and supports the causal language in Section[3](https://arxiv.org/html/2606.04212#S3)\. Both the fork and the baseline are compared via distance traveledη∗step\\eta\*\\text\{step\}to provide a valid comparison of speed across different learning rates\.

### A\.4EoS detection and timing

The effect size depends on whent∗t^\{\*\}falls relative to the onset of the second EoS regime\. Whent∗t^\{\*\}lies well after the sharpness plateau is established, the post\-fork branches converge to similar outcomes and the trade\-off is attenuated\. This is consistent with the mechanism: the intervention acts by redirecting a constraint\-shaped trajectory, so once that shaping has occurred, it has less leverage\.

The loggedt∗t^\{\*\}has two sources of small offset relative to the true EoS onset\. First, the every\-32\-steps logging grid limits detection resolution:t∗t^\{\*\}can only be flagged at a logged step, so the actual crossing may have occurred up to 32 steps earlier\.

Second, the cubic term of the Taylor expansion around a point adds a jitter around2/η2/\\eta\[[7](https://arxiv.org/html/2606.04212#bib.bib27)\]\. Specifically, the period\-2 orbit of GD exists forη∈\(2/f′′​\(x¯\),2/\(f′′​\(x¯\)−ϵ⋅f\(3\)​\(x¯\)\)\)\\eta\\in\\left\(\{2\}/\{f^\{\\prime\\prime\}\(\\bar\{x\}\)\},\{2\}/\(\{f^\{\\prime\\prime\}\(\\bar\{x\}\)\-\\epsilon\\cdot f^\{\(3\)\}\(\\bar\{x\}\)\)\}\\right\), so the bifurcation is governed by the curvature at the orbit,f′′​\(x¯\)−ϵ⋅f\(3\)​\(x¯\)f^\{\\prime\\prime\}\(\\bar\{x\}\)\-\\epsilon\\cdot f^\{\(3\)\}\(\\bar\{x\}\), rather than the curvature at the minimum,f′′​\(x¯\)f^\{\\prime\\prime\}\(\\bar\{x\}\), which our2/η2/\\etacrossing criterion targets\. Together, these two effects can cause the loggedt∗t^\{\*\}to lead or lag the visible onset of oscillation by a small number of steps\. This is consistent with the small offsets visible in some figures and does not affect the branching design: both branches share the trajectory up tot∗t^\{\*\}, so the post\-fork comparison is unaffected by a few\-step mistiming in the detection of EoS onset\.

### A\.5Compute and Codebase

Compute\.All experiments ran on a single NVIDIA L40S GPU \(48 GB\) on an internal academic SLURM cluster, with 8 CPU cores and 64 GB of system RAM per job\. A 10,000\-step run takes approximately 10 minutes for the MLP and 30 minutes for CNN/ResNet models\. The number of GPU\-hours is on the order of 15 for all reported experiments\.

Codebase\.Our implementation builds on the open\-source codebase \([https://github\.com/arseniqum/edge\-of\-stochastic\-stability](https://github.com/arseniqum/edge-of-stochastic-stability)\)\[[4](https://arxiv.org/html/2606.04212#bib.bib29)\]\(Apache 2\.0;[5](https://arxiv.org/html/2606.04212#bib.bib28)\), which provides the CIFAR\-10 training pipeline and curvature/sharpness logging used in prior work on edge of stochastic stability\. We extend this framework with \(i\) a prototype taxonomy and synthetic outlier construction, \(ii\) an EoS branching intervention, \(iii\) per\-group alignment diagnostics \(gradient norms and cosine alignment\), and \(iv\) per\-subset loss and sharpness metrics\.

## Appendix BRelated Work

Our work connects three threads: training dynamics at EoS, sample\-centric analyses of which examples drive learning, and curvature\-based implicit regularization\. We argue these are linked: the directional structure at EoS governs how optimization effort is distributed across the data, connecting where the optimizer moves in parameter space to what it learns\.

### B\.1Training Dynamics at the Edge of Stability

#### Onset of EoS\.

EoS occurs when the top Hessian eigenvalue grows along the optimization trajectory until it reaches the stability threshold2/η2/\\eta\[[19](https://arxiv.org/html/2606.04212#bib.bib37),[20](https://arxiv.org/html/2606.04212#bib.bib38),[9](https://arxiv.org/html/2606.04212#bib.bib12)\]\. Cohen et al\.\[[9](https://arxiv.org/html/2606.04212#bib.bib12)\]termed this growth phase progressive sharpening\. In full\-batch gradient descent, this gives rise to the edge of stability, where sharpness saturates near2/η2/\\etawhile the loss continues to decrease\[[9](https://arxiv.org/html/2606.04212#bib.bib12)\]\. EoS describes a regime where sharpness saturates near2/η2/\\etawhile loss continues to decrease\[[9](https://arxiv.org/html/2606.04212#bib.bib12)\]\. Recent work has shown that EoS\-like phenomena extend beyond full\-batch GD to adaptive optimizers, stochastic training, and momentum dynamics\[[8](https://arxiv.org/html/2606.04212#bib.bib39),[4](https://arxiv.org/html/2606.04212#bib.bib29),[3](https://arxiv.org/html/2606.04212#bib.bib32)\]\. Our goal is complementary: we focus on the cleanest setting for EoS, full\-batch gradient descent, and use this controlled regime to show that the global stability constraint is also a selective mechanism over the data distribution\.

#### Self\-stabilization at EoS\.

A mechanistic account is provided by Damian et al\., where oscillations along the top Hessian eigenvector𝐯1\\mathbf\{v\}\_\{1\}bound curvature via self\-stabilization\[[10](https://arxiv.org/html/2606.04212#bib.bib13)\]\. Specifically, these oscillations can feed back through higher\-order derivatives to reduce sharpness, yielding an implicit projected dynamics near the stability boundary\[[10](https://arxiv.org/html/2606.04212#bib.bib13)\]\. Prior work characterizes EoS through global dynamics—fast oscillation along𝐯1\\mathbf\{v\}\_\{1\}and slow sharpness\-reducing drift along flat directions\[[42](https://arxiv.org/html/2606.04212#bib.bib41),[28](https://arxiv.org/html/2606.04212#bib.bib42),[6](https://arxiv.org/html/2606.04212#bib.bib11)\]\. We instead ask which examples contribute to these dynamics, revealing a selective effect over the data distribution that scalar sharpness obscures\.

### B\.2Sample\-Centric Learning

A parallel literature studies which individual examples drive learning\[[25](https://arxiv.org/html/2606.04212#bib.bib44),[13](https://arxiv.org/html/2606.04212#bib.bib17),[12](https://arxiv.org/html/2606.04212#bib.bib18),[38](https://arxiv.org/html/2606.04212#bib.bib19),[37](https://arxiv.org/html/2606.04212#bib.bib43),[31](https://arxiv.org/html/2606.04212#bib.bib9),[32](https://arxiv.org/html/2606.04212#bib.bib14),[34](https://arxiv.org/html/2606.04212#bib.bib16),[36](https://arxiv.org/html/2606.04212#bib.bib45)\]\. Rare or atypical examples can disproportionately influence generalization: memorizing them is necessary for near\-optimal generalization on long\-tailed distributions\[[13](https://arxiv.org/html/2606.04212#bib.bib17),[12](https://arxiv.org/html/2606.04212#bib.bib18)\]\. Toneva et al\.\[[38](https://arxiv.org/html/2606.04212#bib.bib19)\]also tracks forgetting events and finds rare examples are repeatedly forgotten and relearned\[[38](https://arxiv.org/html/2606.04212#bib.bib19)\]\. Other works score rare example on difficulty throughout training and reveal ambiguous examples \(those which are flip throughout training\) are key for out\-of\-distribution generalization\[[37](https://arxiv.org/html/2606.04212#bib.bib43)\]\. Rare examples can also dominate gradient signal and sharpening dynamics\[[31](https://arxiv.org/html/2606.04212#bib.bib9),[32](https://arxiv.org/html/2606.04212#bib.bib14),[34](https://arxiv.org/html/2606.04212#bib.bib16)\]\. In these approaches, example difficulty is defined relative to the model state\. We instead define example groups directly from the data distribution and ask how optimization treats them, shifting the question from which examples are hard to which are structurally favored\. Other studies\[[31](https://arxiv.org/html/2606.04212#bib.bib9)\]also introduce GraNd \(per\-example gradient norm\) and EL2N \(per\-example error\-vector norm\) as scores that rank training examples by influence on optimization\. Our curvature\-influence score\(∇ℓi⋅𝐯1\)2=‖∇ℓi‖2​cos2⁡θi\(\\nabla\\ell\_\{i\}\\\!\\cdot\\\!\\mathbf\{v\}\_\{1\}\)^\{2\}=\\\|\\nabla\\ell\_\{i\}\\\|^\{2\}\\cos^\{2\}\\theta\_\{i\}adds thecos2⁡θi\\cos^\{2\}\\theta\_\{i\}factor on top of GraNd’s‖∇ℓi‖2\\\|\\nabla\\ell\_\{i\}\\\|^\{2\}\. Thecos2\\cos^\{2\}factor is what makes the score predictive of the EoS beneficiary’s identity, and it is invisible to GraNd and EL2N\.

### B\.3Implicit regularization through curvature\.

Curvature has been widely linked to generalization: sharp minima correlate with worse performance\[[23](https://arxiv.org/html/2606.04212#bib.bib20)\], while optimization methods and hyperparameters implicitly bias toward flatter solutions\[[14](https://arxiv.org/html/2606.04212#bib.bib22),[27](https://arxiv.org/html/2606.04212#bib.bib23),[18](https://arxiv.org/html/2606.04212#bib.bib8),[15](https://arxiv.org/html/2606.04212#bib.bib6),[23](https://arxiv.org/html/2606.04212#bib.bib20),[21](https://arxiv.org/html/2606.04212#bib.bib46),[30](https://arxiv.org/html/2606.04212#bib.bib47),[39](https://arxiv.org/html/2606.04212#bib.bib48)\]\. Other methods make the flatness objective explicit or directly bias the final iterate toward flatter regions, including Sharpness\-Aware Minimization and stochastic weight averaging\[[14](https://arxiv.org/html/2606.04212#bib.bib22),[17](https://arxiv.org/html/2606.04212#bib.bib49)\]\. Much of the flatness literature treats curvature as a property of the final solution or of the algorithm’s implicit bias\. Work on EoS studies curvature dynamically along the training trajectory, including its convergence and implicit\-regularization effects\[[6](https://arxiv.org/html/2606.04212#bib.bib11),[2](https://arxiv.org/html/2606.04212#bib.bib50),[1](https://arxiv.org/html/2606.04212#bib.bib7),[42](https://arxiv.org/html/2606.04212#bib.bib41),[28](https://arxiv.org/html/2606.04212#bib.bib42)\]\. We add a data\-level perspective: at EoS, curvature acts selectively across examples, with functional consequences for which subsets generalize and which are robust\.

### B\.4Our positioning

This work draws on three lines of work: the dynamics of training at EoS, sample\-centric analyses of which examples drive learning, and implicit regularization through curvature\. The threads are usually treated separately, and this paper’s contribution sits where they meet: the directional structure of EoS governs how optimization effort is distributed across the data distribution, connecting where the optimizer goes in parameter space to what it learns from the data\. The closest prior work\[[34](https://arxiv.org/html/2606.04212#bib.bib16)\], observes that small groups of outliers with large\-magnitude features have an outsized effect on sharpening and EoS dynamics; we extend this with a model\-independent prototype taxonomy and analysis on initial data distribution\.

## Appendix CArchitecture, Optimizer, Class Pair Robustness

Our primary results focus on an MLP trained with full\-batch gradient descent under both mean\-squared error and cross\-entropy losses\. In this section, we assess robustness across alternative architectures, optimizers, and class pair, across 3 seeds and withη\\eta= 0\.01\. We examine \(i\) the divergence in sharpness between baseline and exit\-from\-EoS runs, \(ii\) the curvature profile of the baseline, and \(iii\) how this curvature predicts the intervention effect, defined asΔ​ℓk=ℓkexit−ℓkEoS\\Delta\\ell\_\{k\}=\\ell\_\{k\}^\{\\text\{exit\}\}\-\\ell\_\{k\}^\{\\text\{EoS\}\}\.

### C\.1Architecture robustness

The network’s architecture determines how input\-space geometry maps to gradient\-space geometry\. In an MLP, which processes raw pixel vectors, centroid distance translates directly into gradient atypicality: pixel\-space outliers produce distinctly oriented gradients\. Convolutional architectures transform this mapping through learned spatial features, pooling, and normalization, which can compress pixel\-space differences that the MLP preserves\. As a result, the group with the highest centroid distance need not be the group with the largest curvature influence\.

The predictive quantity\(∇ℓk⋅v1\)2\(\\nabla\\ell\_\{k\}\\cdot v\_\{1\}\)^\{2\}remains consistent across architectures—what changes is which group achieves the highest value\.

CNN\.

The CNN consists of three convolutional layers with channel widths64,64,12864,64,128, all using3×33\\times 3kernels, stride11, no padding, and ReLU activations, with2×22\\times 2max\-pooling after the second and third convolutional layers\. The resulting feature map is flattened and passed through a fully connected hidden layer of width512512with ReLU activation, followed by a linear classifier toCCoutput classes\.

![Refer to caption](https://arxiv.org/html/2606.04212v1/x11.png)Figure 11:GD CNN MSE\. Curvature influence is comparable for output\-outliers and input\-outliers, both benefit\.ResNet\.

We use a batch\-normalization\-free ResNet\-14 with an initial3×33\\times 3convolutional stem of width1616, followed by three residual stages with channel widths16,32,6416,32,64and block counts\[2,2,2\]\[2,2,2\]\. Each residual block contains two3×33\\times 3convolutions with ReLU activations and identity BatchNorm replacements; downsampling occurs in the first block of stages 2 and 3, followed by global average pooling and a final linear classifier toCCoutput classes\.

![Refer to caption](https://arxiv.org/html/2606.04212v1/x12.png)Figure 12:GD ResNet MSE\. Curvature influence is highest for output\-outliers, and it is the primary beneficiary of EoS\.
### C\.2Optimizer robustness

We primarily focused on optimizers which maintained a fixed curvature landscape\. We did not extend this analysis to adaptive optimizers such as Adam\[[24](https://arxiv.org/html/2606.04212#bib.bib31)\]as it reshapes the curvature landscape at each step\. Instead, stochastic gradient descent \(SGD\) and full\-batch gradient descent with momentum add randomness and acceleration, respectively, to each step the optimizer takes down the loss landscape\. For a large batch for Figure[13](https://arxiv.org/html/2606.04212#A3.F13), we see that the highest curvature influence group \(input\-outliers\) correspondingly is the primary beneficiary of EoS\. Similarly for Figure[14](https://arxiv.org/html/2606.04212#A3.F14), input\-outliers have the highest curvature influence, and the intervention benefits them\.

SGD\(batch size=128\)\.

![Refer to caption](https://arxiv.org/html/2606.04212v1/x13.png)Figure 13:SGD MLP MSE\. Curvature influence is highest for input\-outliers, and it is the primary beneficiary of EoS\.GD with Momentum\(β\\beta= 0\.9\)\.

![Refer to caption](https://arxiv.org/html/2606.04212v1/x14.png)Figure 14:GD Momentum MLP MSE\. Curvature influence is highest for input\-outliers, and it is the primary beneficiary of EoS\. Momentum implemented as in\[[33](https://arxiv.org/html/2606.04212#bib.bib33)\]and EoS threshold implemented described in large\-batch momentum in\[[3](https://arxiv.org/html/2606.04212#bib.bib32)\]
### C\.3Class pair robustness

A harder classification task\.On the closer pair \(3,5\),α\\alpha= 3 is no longer sufficient to make input\-outliers the most atypical subset; boundary points dominate atypicality instead \(Figure[15](https://arxiv.org/html/2606.04212#A3.F15)\)\. The alignment principle predicts that boundary points should therefore capture the EoS advantage on this pair, and they do \(Figure[16](https://arxiv.org/html/2606.04212#A3.F16)\)\.

![Refer to caption](https://arxiv.org/html/2606.04212v1/x15.png)Figure 15:Distribution of centroid distance by prototype subgroup on the \(3,5\) class pair\. Boundary points have the largest median distance underα=3\\alpha=3but input\-outliers have the largest median distance underα=10\\alpha=10\.Ablation onα\\alpha\.Atα=3\\alpha=3, boundary points are the primary beneficiaries at EoS\. Atα=10\\alpha=10, input\-outliers become the most distant from their class centroids and the EoS advantage transfers to them\.

![Refer to caption](https://arxiv.org/html/2606.04212v1/x16.png)Figure 16:Curvature influence under input\-outlier construction \(α=3\\alpha=3\)\. The boundary group has the highest curvature influence and is the primary beneficiary of EoS\.![Refer to caption](https://arxiv.org/html/2606.04212v1/x17.png)Figure 17:Curvature influence under input\-outlier construction \(α=10\\alpha=10\)\. The boundary group has the highest curvature influence and is the primary beneficiary of EoS\.

## Appendix DInput\-Outlier Construction Ablation

The conceptual schematic for the directional perturbation is shown in Figure[18](https://arxiv.org/html/2606.04212#A4.F18)\. Figure[19](https://arxiv.org/html/2606.04212#A4.F19)confirms that geometric atypicality of input\-outlier is preserved under the random\-direction control\.

![Refer to caption](https://arxiv.org/html/2606.04212v1/x18.png)Figure 18:Schematic for coherent vs\. incoherent input\-outlier construction\.Left:input\-outliers displaced along a shared directionvdiffv\_\{\\mathrm\{diff\}\}; per\-example gradients reinforce, producing a dominant curvature contribution along one direction\.Right:same displacement distance but in random orthogonal directions; per\-example gradients partially cancel, producing a diffuse contribution across many eigenvectors\. Seed points, labels, and centroid distances are identical across conditions\.![Refer to caption](https://arxiv.org/html/2606.04212v1/x19.png)Figure 19:Centroid distance is preserved under the random\-direction control \(median 69\.4 vs\. 70\.3 for along\-vdiffv\_\{\\text\{diff\}\}and random\-direction outliers, respectively\)\. Geometric atypicality is conserved\.
## Appendix ENatural points

We verify that the alignment principle holds on natural geometric outliers, with no synthetic displacement applied\. Figure[20](https://arxiv.org/html/2606.04212#A5.F20)plots the Spearman rank correlation coefficient between centroid distance and per\-example alignmentcos2⁡\(∇ℓi,v1\)\\cos^\{2\}\(\\nabla\\ell\_\{i\},v\_\{1\}\)across training: the correlation is near zero during progressive sharpening and rises at EoS onset\. The Spearman coefficient measures rank agreement; lines show the line of best fit for visual clarity\.

Figure[21](https://arxiv.org/html/2606.04212#A5.F21)shows the underlying per\-example trajectories\. Atypical points \(red\) exhibit high alignment that peaks just before EoS onset and then declines, while typical points \(blue\) remain weakly aligned over the same interval\. This relationship is monotonic with respect to distance from the centroid near EoS onset\. Later in training, the ordering reverses, with typical points eventually exceeding atypical ones in alignment\.

![Refer to caption](https://arxiv.org/html/2606.04212v1/x20.png)Figure 20:Top:beginning of training, initially negative correlation that increases;Middle:EoS onset, monotonically increasing correlation;Bottom:correlation peak when large\-amplitude oscillations along𝐯1\\mathbf\{v\}\_\{1\}develop, then monotonically decrease to negative after peaking![Refer to caption](https://arxiv.org/html/2606.04212v1/x21.png)Figure 21:Top:correlation rises at EoS onset and peaks with sharpness oscillations\.Bottom:alignment vs\. centroid\-distance percentile over training\. Atypical points peak near EoS \(step 660\), then decline, and dominance shifts to typical points later in training\.
## Appendix FTheory: EoS Self\-stabilization as a Subset Selector

This appendix gives the local mathematical mechanism behind the subset\-level effects measured in the main text\. The starting point is the cubic self\-stabilization model of\[[10](https://arxiv.org/html/2606.04212#bib.bib13)\]: at the edge of stability, gradient descent admits a slow, cycle\-averaged description as projected gradient descent on the active sharpness constraintS​\(θ\)=2/ηS\(\\theta\)=2/\\eta\. Under this assumption, we derive a first\-order decomposition of the branch differential into two distinct contributions and isolate the two mechanisms tested experimentally—directional alignment and gradient persistence\.

### F\.1Notation and assumptions

LetL:ℝd→ℝL:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}denote the full empirical training loss, and let

ℓk​\(θ\):=1\|Pk\|​∑i∈Pkℓ​\(fθ​\(xi\),yi\)\\ell\_\{k\}\(\\theta\):=\\frac\{1\}\{\|P\_\{k\}\|\}\\sum\_\{i\\in P\_\{k\}\}\\ell\(f\_\{\\theta\}\(x\_\{i\}\),y\_\{i\}\)be the loss restricted to prototype groupkk\. LetH​\(θ\):=∇2L​\(θ\)H\(\\theta\):=\\nabla^\{2\}L\(\\theta\),S​\(θ\):=λmax​\(H​\(θ\)\)S\(\\theta\):=\\lambda\_\{\\max\}\(H\(\\theta\)\), and𝐯1​\(θ\)\\mathbf\{v\}\_\{1\}\(\\theta\)denote the Hessian, sharpness, and unit top Hessian eigenvector respectively \(assumed simple throughout\)\. We compare two branches forked fromθ∗=θt∗\\theta^\{\*\}=\\theta\_\{t^\{\*\}\}at EoS onset, whereη​S​\(θ∗\)≈2\\eta S\(\\theta^\{\*\}\)\\approx 2\. The baseline branch continues at learning rateη\\eta\. The exit branch uses learning ratec​ηc\\etawithc∈\(0,1\)c\\in\(0,1\), locally stable at the fork\. We use the baseline learning\-rate\-scaled time

τ:=η​\(t−t∗\),\\tau:=\\eta\(t\-t^\{\*\}\),wherettdenotes the gradient descent iteration index andt∗t^\{\*\}denotes the EoS onset step at which the branches fork\. This scaling gives a continuous\-time comparison variable aligned to the baseline branch\. Define the cycle\-averaged \(two\-step average\) branch differential

Δ¯​ℓk​\(τ\):=ℓ¯k​\(θexit​\(τ\)\)−ℓ¯k​\(θEoS​\(τ\)\),\\bar\{\\Delta\}\\ell\_\{k\}\(\\tau\):=\\bar\{\\ell\}\_\{k\}\(\\theta\_\{\\text\{exit\}\}\(\\tau\)\)\-\\bar\{\\ell\}\_\{k\}\(\\theta\_\{\\text\{EoS\}\}\(\\tau\)\),where the bar denotes the two\-step average that suppresses theO​\(δ\)O\(\\delta\)phase term along𝐯1\\mathbf\{v\}\_\{1\}generated by the period\-two EoS oscillation\. Define the sharpness\-gradient quantities

α:=−⟨∇L,∇S⟩,β:=‖∇S‖2,δ:=2​α/β,\\alpha:=\-\\langle\\nabla L,\\nabla S\\rangle,\\qquad\\beta:=\\\|\\nabla S\\\|^\{2\},\\qquad\\delta:=\\sqrt\{2\\alpha/\\beta\},whereδ\\deltameasures the oscillation amplitude along𝐯1\\mathbf\{v\}\_\{1\}in the EoS regime\[[10](https://arxiv.org/html/2606.04212#bib.bib13)\], and progressive sharpening corresponds toα\>0\\alpha\>0\. We use the following standing assumptions\.

#### \(A1\) Smoothness and simple top eigenvalue\.

L∈C3L\\in C^\{3\}\(L is thrice\-differentiable\) and eachℓk∈C2\\ell\_\{k\}\\in C^\{2\}\(ℓk\\ell\_\{k\}is twice\-differentiable\) in a neighborhood ofθ∗\\theta^\{\*\}, with simple largest Hessian eigenvalue\.

#### \(A2\) Local EoS approximation\.

The baseline EoS trajectory admits a slow, cycle\-averaged description as projected gradient descent on the active sharpness constraintS​\(θ\)=2/ηS\(\\theta\)=2/\\eta, in the sense of\[[10](https://arxiv.org/html/2606.04212#bib.bib13)\]\.

#### \(A3\) Stable exit\.

The exit branch is locally stable in the top direction at the fork,c​η​S​\(θ∗\)<2c\\eta S\(\\theta^\{\*\}\)<2\.

#### \(A4\) Short\-window Taylor regime\.

Post\-fork comparisons are made in a window short enough that gradients and Hessians admit Taylor expansion aroundθ∗\\theta^\{\*\}withO​\(τ2\)O\(\\tau^\{2\}\)remainder\.

#### \(A5\) Cycle averaging\.

Subset losses are compared via two\-step averaging or short\-window smoothing, suppressing theO​\(δ\)O\(\\delta\)instantaneous phase fluctuation alonguuat EoS\.

### F\.2Projected drift at the sharpness boundary

###### Lemma 1\(EoS slow drift\)\.

Under \(A1\)–\(A2\), the cycle\-averaged EoS drift atθ∗\\theta^\{\*\}is

θ˙EoS=−∇L−αβ​∇S\.\\dot\{\\theta\}\_\{\\text\{EoS\}\}=\-\\nabla L\-\\frac\{\\alpha\}\{\\beta\}\\nabla S\.Under \(A3\), the exit branch in baseline timeτ\\tauhas leading driftθ˙exit=−c​∇L\\dot\{\\theta\}\_\{\\text\{exit\}\}=\-c\\nabla L\.

###### Proof\.

The tangent space of the active boundaryℳ:=\{θ:S​\(θ\)=2/η\}\\mathcal\{M\}:=\\\{\\theta:S\(\\theta\)=2/\\eta\\\}atθ∗\\theta^\{\*\}is\{z:⟨∇S,z⟩=0\}\\\{z:\\langle\\nabla S,z\\rangle=0\\\}\. Removing the normal component of∇L\\nabla Lalong∇S\\nabla Sgives

∇L\|ℳ=∇L−⟨∇L,∇S⟩‖∇S‖2​∇S=∇L\+αβ​∇S,\\nabla L\|\_\{\\mathcal\{M\}\}=\\nabla L\-\\frac\{\\langle\\nabla L,\\nabla S\\rangle\}\{\\\|\\nabla S\\\|^\{2\}\}\\nabla S=\\nabla L\+\\frac\{\\alpha\}\{\\beta\}\\nabla S,so projected\-gradient descent yields drift−∇L−\(α/β\)​∇S\-\\nabla L\-\(\\alpha/\\beta\)\\nabla S\. The exit\-branch drift in baseline time follows from \(A3\): a step of sizec​ηc\\etain baseline timeτ=η​\(n−n∗\)\\tau=\\eta\(n\-n^\{\*\}\)has slope−c​∇L\-c\\nabla Lto leading order\. ∎

The key observation is that the EoS branch is not ordinary gradient descent with oscillations: its slow drift carries an additional component along−∇S\-\\nabla S\. Each subset feels this additional drift through the inner product⟨∇ℓk,∇S⟩\\langle\\nabla\\ell\_\{k\},\\nabla S\\rangle\.

### F\.3Branch decomposition

Define the two local subset scores

Rk:=⟨∇ℓk,∇L⟩,Qk:=⟨∇ℓk,∇S⟩\.R\_\{k\}:=\\langle\\nabla\\ell\_\{k\},\\nabla L\\rangle,\\qquad Q\_\{k\}:=\\langle\\nabla\\ell\_\{k\},\\nabla S\\rangle\.RkR\_\{k\}is the ordinary loss\-gradient alignment of subsetkk\.QkQ\_\{k\}is the projection of the subset gradient onto the sharpness\-control direction\.

###### Proposition 2\(Per\-subset branch decomposition\)\.

Under \(A1\)–\(A5\), for short post\-branch times,

Δ¯​ℓk​\(τ\)=\(1−c\)​Rk​τ⏟learning\-rate confounder\+αβ​Qk​τ⏟EoS selector\+Ok​\(τ2\)\+Ok​\(δ2\)\.\\bar\{\\Delta\}\\ell\_\{k\}\(\\tau\)\\;=\\;\\underbrace\{\(1\-c\)\\,R\_\{k\}\\,\\tau\}\_\{\\text\{learning\-rate confounder\}\}\\;\+\\;\\underbrace\{\\frac\{\\alpha\}\{\\beta\}\\,Q\_\{k\}\\,\\tau\}\_\{\\text\{EoS selector\}\}\\;\+\\;O\_\{k\}\(\\tau^\{2\}\)\+O\_\{k\}\(\\delta^\{2\}\)\.

###### Proof\.

All quantities below are evaluated locally near the branch pointθ∗\\theta^\{\*\}unless otherwise stated\. Recall that

Rk:=⟨∇ℓk​\(θ∗\),∇L​\(θ∗\)⟩,Qk:=⟨∇ℓk​\(θ∗\),∇S​\(θ∗\)⟩\.R\_\{k\}:=\\langle\\nabla\\ell\_\{k\}\(\\theta^\{\*\}\),\\nabla L\(\\theta^\{\*\}\)\\rangle,\\qquad Q\_\{k\}:=\\langle\\nabla\\ell\_\{k\}\(\\theta^\{\*\}\),\\nabla S\(\\theta^\{\*\}\)\\rangle\.By Lemma[1](https://arxiv.org/html/2606.04212#Thmtheorem1), the cycle\-averaged EoS branch has slow drift

θ˙EoS=−∇L−αβ​∇S,\\dot\{\\theta\}\_\{\\mathrm\{EoS\}\}=\-\\nabla L\-\\frac\{\\alpha\}\{\\beta\}\\nabla S,while the exit branch, measured in baseline timeτ\\tau, has leading drift

θ˙exit=−c​∇L\.\\dot\{\\theta\}\_\{\\mathrm\{exit\}\}=\-c\\nabla L\.
We first compute the rate of change of the subset loss along the EoS branch\. By the chain rule,

dd​τ​ℓ¯k​\(θEoS​\(τ\)\)=⟨∇ℓk​\(θEoS​\(τ\)\),θ˙EoS​\(τ\)⟩\.\\frac\{d\}\{d\\tau\}\\bar\{\\ell\}\_\{k\}\(\\theta\_\{\\mathrm\{EoS\}\}\(\\tau\)\)=\\left\\langle\\nabla\\ell\_\{k\}\(\\theta\_\{\\mathrm\{EoS\}\}\(\\tau\)\),\\dot\{\\theta\}\_\{\\mathrm\{EoS\}\}\(\\tau\)\\right\\rangle\.Using the local EoS drift from Lemma[1](https://arxiv.org/html/2606.04212#Thmtheorem1), this becomes

dd​τ​ℓ¯k​\(θEoS​\(τ\)\)=⟨∇ℓk​\(θEoS​\(τ\)\),−∇L​\(θEoS​\(τ\)\)−α​\(τ\)β​\(τ\)​∇S​\(θEoS​\(τ\)\)⟩\.\\frac\{d\}\{d\\tau\}\\bar\{\\ell\}\_\{k\}\(\\theta\_\{\\mathrm\{EoS\}\}\(\\tau\)\)=\\left\\langle\\nabla\\ell\_\{k\}\(\\theta\_\{\\mathrm\{EoS\}\}\(\\tau\)\),\-\\nabla L\(\\theta\_\{\\mathrm\{EoS\}\}\(\\tau\)\)\-\\frac\{\\alpha\(\\tau\)\}\{\\beta\(\\tau\)\}\\nabla S\(\\theta\_\{\\mathrm\{EoS\}\}\(\\tau\)\)\\right\\rangle\.Expanding the inner product gives

dd​τ​ℓ¯k​\(θEoS​\(τ\)\)=−⟨∇ℓk​\(θEoS​\(τ\)\),∇L​\(θEoS​\(τ\)\)⟩−α​\(τ\)β​\(τ\)​⟨∇ℓk​\(θEoS​\(τ\)\),∇S​\(θEoS​\(τ\)\)⟩\.\\frac\{d\}\{d\\tau\}\\bar\{\\ell\}\_\{k\}\(\\theta\_\{\\mathrm\{EoS\}\}\(\\tau\)\)=\-\\left\\langle\\nabla\\ell\_\{k\}\(\\theta\_\{\\mathrm\{EoS\}\}\(\\tau\)\),\\nabla L\(\\theta\_\{\\mathrm\{EoS\}\}\(\\tau\)\)\\right\\rangle\-\\frac\{\\alpha\(\\tau\)\}\{\\beta\(\\tau\)\}\\left\\langle\\nabla\\ell\_\{k\}\(\\theta\_\{\\mathrm\{EoS\}\}\(\\tau\)\),\\nabla S\(\\theta\_\{\\mathrm\{EoS\}\}\(\\tau\)\)\\right\\rangle\.For short post\-branch times, Assumption \(A4\) allows us to replace the quantities along the branch by their values atθ∗\\theta^\{\*\}up to first\-order local errors:

⟨∇ℓk​\(θEoS​\(τ\)\),∇L​\(θEoS​\(τ\)\)⟩=Rk\+Ok​\(τ\),\\left\\langle\\nabla\\ell\_\{k\}\(\\theta\_\{\\mathrm\{EoS\}\}\(\\tau\)\),\\nabla L\(\\theta\_\{\\mathrm\{EoS\}\}\(\\tau\)\)\\right\\rangle=R\_\{k\}\+O\_\{k\}\(\\tau\),and

⟨∇ℓk​\(θEoS​\(τ\)\),∇S​\(θEoS​\(τ\)\)⟩=Qk\+Ok​\(τ\)\.\\left\\langle\\nabla\\ell\_\{k\}\(\\theta\_\{\\mathrm\{EoS\}\}\(\\tau\)\),\\nabla S\(\\theta\_\{\\mathrm\{EoS\}\}\(\\tau\)\)\\right\\rangle=Q\_\{k\}\+O\_\{k\}\(\\tau\)\.Likewise,α​\(τ\)/β​\(τ\)=α/β\+O​\(τ\)\\alpha\(\\tau\)/\\beta\(\\tau\)=\\alpha/\\beta\+O\(\\tau\)locally\. Therefore

dd​τ​ℓ¯k​\(θEoS​\(τ\)\)=−Rk−αβ​Qk\+Ok​\(τ\)\+Ok​\(δ2\)\.\\frac\{d\}\{d\\tau\}\\bar\{\\ell\}\_\{k\}\(\\theta\_\{\\mathrm\{EoS\}\}\(\\tau\)\)=\-R\_\{k\}\-\\frac\{\\alpha\}\{\\beta\}Q\_\{k\}\+O\_\{k\}\(\\tau\)\+O\_\{k\}\(\\delta^\{2\}\)\.The termOk​\(δ2\)O\_\{k\}\(\\delta^\{2\}\)comes from replacing the instantaneous oscillatory EoS trajectory by its two\-step cycle average\. The leadingO​\(δ\)O\(\\delta\)phase term cancels under the two\-step average, leaving only second\-order oscillation effects\.

Now consider the exit branch\. Again by the chain rule,

dd​τ​ℓ¯k​\(θexit​\(τ\)\)=⟨∇ℓk​\(θexit​\(τ\)\),θ˙exit​\(τ\)⟩\.\\frac\{d\}\{d\\tau\}\\bar\{\\ell\}\_\{k\}\(\\theta\_\{\\mathrm\{exit\}\}\(\\tau\)\)=\\left\\langle\\nabla\\ell\_\{k\}\(\\theta\_\{\\mathrm\{exit\}\}\(\\tau\)\),\\dot\{\\theta\}\_\{\\mathrm\{exit\}\}\(\\tau\)\\right\\rangle\.Usingθ˙exit=−c​∇L\\dot\{\\theta\}\_\{\\mathrm\{exit\}\}=\-c\\nabla Lgives

dd​τ​ℓ¯k​\(θexit​\(τ\)\)=−c​⟨∇ℓk​\(θexit​\(τ\)\),∇L​\(θexit​\(τ\)\)⟩\.\\frac\{d\}\{d\\tau\}\\bar\{\\ell\}\_\{k\}\(\\theta\_\{\\mathrm\{exit\}\}\(\\tau\)\)=\-c\\left\\langle\\nabla\\ell\_\{k\}\(\\theta\_\{\\mathrm\{exit\}\}\(\\tau\)\),\\nabla L\(\\theta\_\{\\mathrm\{exit\}\}\(\\tau\)\)\\right\\rangle\.Again Taylor\-expanding aroundθ∗\\theta^\{\*\}over the short window,

⟨∇ℓk​\(θexit​\(τ\)\),∇L​\(θexit​\(τ\)\)⟩=Rk\+Ok​\(τ\),\\left\\langle\\nabla\\ell\_\{k\}\(\\theta\_\{\\mathrm\{exit\}\}\(\\tau\)\),\\nabla L\(\\theta\_\{\\mathrm\{exit\}\}\(\\tau\)\)\\right\\rangle=R\_\{k\}\+O\_\{k\}\(\\tau\),so

dd​τ​ℓ¯k​\(θexit​\(τ\)\)=−c​Rk\+Ok​\(τ\)\.\\frac\{d\}\{d\\tau\}\\bar\{\\ell\}\_\{k\}\(\\theta\_\{\\mathrm\{exit\}\}\(\\tau\)\)=\-cR\_\{k\}\+O\_\{k\}\(\\tau\)\.
We now differentiate the branch differential

Δ¯​ℓk​\(τ\):=ℓ¯k​\(θexit​\(τ\)\)−ℓ¯k​\(θEoS​\(τ\)\)\.\\bar\{\\Delta\}\\ell\_\{k\}\(\\tau\):=\\bar\{\\ell\}\_\{k\}\(\\theta\_\{\\mathrm\{exit\}\}\(\\tau\)\)\-\\bar\{\\ell\}\_\{k\}\(\\theta\_\{\\mathrm\{EoS\}\}\(\\tau\)\)\.Taking a derivative with respect toτ\\tauyields

dd​τ​Δ¯​ℓk​\(τ\)=dd​τ​ℓ¯k​\(θexit​\(τ\)\)−dd​τ​ℓ¯k​\(θEoS​\(τ\)\)\.\\frac\{d\}\{d\\tau\}\\bar\{\\Delta\}\\ell\_\{k\}\(\\tau\)=\\frac\{d\}\{d\\tau\}\\bar\{\\ell\}\_\{k\}\(\\theta\_\{\\mathrm\{exit\}\}\(\\tau\)\)\-\\frac\{d\}\{d\\tau\}\\bar\{\\ell\}\_\{k\}\(\\theta\_\{\\mathrm\{EoS\}\}\(\\tau\)\)\.Substituting the two expressions above,

dd​τ​Δ¯​ℓk​\(τ\)=\[−c​Rk\+Ok​\(τ\)\]−\[−Rk−αβ​Qk\+Ok​\(τ\)\+Ok​\(δ2\)\]\.\\frac\{d\}\{d\\tau\}\\bar\{\\Delta\}\\ell\_\{k\}\(\\tau\)=\\left\[\-cR\_\{k\}\+O\_\{k\}\(\\tau\)\\right\]\-\\left\[\-R\_\{k\}\-\\frac\{\\alpha\}\{\\beta\}Q\_\{k\}\+O\_\{k\}\(\\tau\)\+O\_\{k\}\(\\delta^\{2\}\)\\right\]\.Simplifying,

dd​τ​Δ¯​ℓk​\(τ\)=\(1−c\)​Rk\+αβ​Qk\+Ok​\(τ\)\+Ok​\(δ2\)\.\\frac\{d\}\{d\\tau\}\\bar\{\\Delta\}\\ell\_\{k\}\(\\tau\)=\(1\-c\)R\_\{k\}\+\\frac\{\\alpha\}\{\\beta\}Q\_\{k\}\+O\_\{k\}\(\\tau\)\+O\_\{k\}\(\\delta^\{2\}\)\.
At the branching time, both branches start from the same parameter value, so

Δ¯​ℓk​\(0\)=0\.\\bar\{\\Delta\}\\ell\_\{k\}\(0\)=0\.Integrating from0toτ\\taugives

Δ¯​ℓk​\(τ\)=∫0τdd​s​Δ¯​ℓk​\(s\)​𝑑s\.\\bar\{\\Delta\}\\ell\_\{k\}\(\\tau\)=\\int\_\{0\}^\{\\tau\}\\frac\{d\}\{ds\}\\bar\{\\Delta\}\\ell\_\{k\}\(s\)\\,ds\.Using the expression for the derivative,

Δ¯​ℓk​\(τ\)=∫0τ\[\(1−c\)​Rk\+αβ​Qk\+Ok​\(s\)\+Ok​\(δ2\)\]​𝑑s\.\\bar\{\\Delta\}\\ell\_\{k\}\(\\tau\)=\\int\_\{0\}^\{\\tau\}\\left\[\(1\-c\)R\_\{k\}\+\\frac\{\\alpha\}\{\\beta\}Q\_\{k\}\+O\_\{k\}\(s\)\+O\_\{k\}\(\\delta^\{2\}\)\\right\]ds\.The leading terms are constant in this local expansion, so

∫0τ\[\(1−c\)​Rk\+αβ​Qk\]​𝑑s=\[\(1−c\)​Rk\+αβ​Qk\]​τ\.\\int\_\{0\}^\{\\tau\}\\left\[\(1\-c\)R\_\{k\}\+\\frac\{\\alpha\}\{\\beta\}Q\_\{k\}\\right\]ds=\\left\[\(1\-c\)R\_\{k\}\+\\frac\{\\alpha\}\{\\beta\}Q\_\{k\}\\right\]\\tau\.The local Taylor error integrates as

∫0τOk​\(s\)​𝑑s=Ok​\(τ2\),\\int\_\{0\}^\{\\tau\}O\_\{k\}\(s\)\\,ds=O\_\{k\}\(\\tau^\{2\}\),and the cycle\-averaging residual contributes

∫0τOk​\(δ2\)​𝑑s=Ok​\(δ2​τ\)\.\\int\_\{0\}^\{\\tau\}O\_\{k\}\(\\delta^\{2\}\)\\,ds=O\_\{k\}\(\\delta^\{2\}\\tau\)\.For short fixed post\-branch windows, we absorb this into the stated residual notation asOk​\(δ2\)O\_\{k\}\(\\delta^\{2\}\)\. Therefore

Δ¯​ℓk​\(τ\)=\(1−c\)​Rk​τ\+αβ​Qk​τ\+Ok​\(τ2\)\+Ok​\(δ2\)\.\\bar\{\\Delta\}\\ell\_\{k\}\(\\tau\)=\(1\-c\)R\_\{k\}\\tau\+\\frac\{\\alpha\}\{\\beta\}Q\_\{k\}\\tau\+O\_\{k\}\(\\tau^\{2\}\)\+O\_\{k\}\(\\delta^\{2\}\)\.This proves the decomposition\. ∎

Proposition[2](https://arxiv.org/html/2606.04212#Thmtheorem2)isolates two distinct mechanisms in the branching intervention\. The first is an ordinary learning\-rate confounder: loweringη\\etaby factorccslows progress on every subset by\(1−c\)​Rk​τ\(1\-c\)R\_\{k\}\\tau\. The second is the EoS\-specific selector: the baseline branch carries an additional−\(α/β\)​∇S\-\(\\alpha/\\beta\)\\nabla Sdrift, contributing\(α/β\)​Qk​τ\(\\alpha/\\beta\)Q\_\{k\}\\tauto the differential\. Whether subsetkkbenefits from EoS depends on the sign ofQkQ\_\{k\}relative to the rate confounder\.

###### Corollary 3\(Rate slowdown alone cannot produce mixed signs\)\.

IfRk≥0R\_\{k\}\\geq 0for all prototype groupskk, then the learning\-rate confounder\(1−c\)​Rk​τ\(1\-c\)R\_\{k\}\\tauis nonnegative for allkk\. A mixed\-sign pattern across groups inΔ¯​ℓk\\bar\{\\Delta\}\\ell\_\{k\}therefore cannot be explained by ordinary learning\-rate slowdown alone\.

###### Proof\.

0<c<10<c<1implies1−c\>01\-c\>0\. Withτ≥0\\tau\\geq 0, the rate term has the sign ofRkR\_\{k\}\. Nonnegativity of allRkR\_\{k\}then forces nonnegative rate contributions for all subsets\. ∎

This is the theoretical basis for interpreting the selective trade\-off in Figure 4\. Mixed\-signΔ¯​ℓk\\bar\{\\Delta\}\\ell\_\{k\}across groups requires either the EoS selectorQkQ\_\{k\}or another subset\-dependent mechanism, not rate slowdown alone\. In our empirical comparisons, we instead align branches by learning\-rate\-normalized time\. This removes the leading\-order speed difference along−∇L\-\\nabla L, so that the remaining first\-order branch differential is governed by the EoS\-specific selector\(α/β\)​Qk\(\\alpha/\\beta\)Q\_\{k\}\. Higher\-order differences due to the changed trajectory and sharpness threshold are absorbed into the residual terms\.

### F\.4Curvature influence as a single\-mode proxy

The main text measures

Ck:=\(∇ℓk⋅v1\)2\.C\_\{k\}:=\(\\nabla\\ell\_\{k\}\\cdot v\_\{1\}\)^\{2\}\.Proposition[2](https://arxiv.org/html/2606.04212#Thmtheorem2)states the exact local selector asQk=⟨∇ℓk,∇S⟩Q\_\{k\}=\\langle\\nabla\\ell\_\{k\},\\nabla S\\rangle\. Under standard eigenvalue perturbation,∇S=∇3L​\[v1,v1\]\\nabla S=\\nabla^\{3\}L\[v\_\{1\},v\_\{1\}\], so∇S\\nabla Shas a component along𝐯1\\mathbf\{v\}\_\{1\}of magnitudeγ:=∇3L​\[v1,v1,v1\]\\gamma:=\\nabla^\{3\}L\[v\_\{1\},v\_\{1\},v\_\{1\}\]plus an orthogonal residual\. When∇S\\nabla Sis dominated by its top\-mode component,\|Qk\|2∝Ck\|Q\_\{k\}\|^\{2\}\\propto C\_\{k\}, andCkC\_\{k\}functions as a single\-mode proxy for the squared selector magnitude\. In what follows,CkC\_\{k\}is the measured statistic\. The factorization

Ck=‖∇ℓk‖2​cos2⁡θk,cos2⁡θk:=\(∇ℓk⋅v1\)2‖∇ℓk‖2,C\_\{k\}=\\\|\\nabla\\ell\_\{k\}\\\|^\{2\}\\cos^\{2\}\\theta\_\{k\},\\qquad\\cos^\{2\}\\theta\_\{k\}:=\\frac\{\(\\nabla\\ell\_\{k\}\\cdot v\_\{1\}\)^\{2\}\}\{\\\|\\nabla\\ell\_\{k\}\\\|^\{2\}\},separates direction \(alignment\) from magnitude \(persistence\)\. The next two subsections establish that both factors are necessary\.

### F\.5Alignment: coherent vs\. random gradients

The random\-direction outlier ablation tests whether large geometric distance is sufficient for curvature dominance\. The theory predicts that it is not: what matters is whether per\-example gradients add coherently in a shared direction\.

Letgk:=∇ℓk=m−1​∑i=1mgig\_\{k\}:=\\nabla\\ell\_\{k\}=m^\{\-1\}\\sum\_\{i=1\}^\{m\}g\_\{i\}be the group gradient, and letuube a unit direction \(interpreted as the current top Hessian eigenvector\)\.

###### Lemma 4\(Coherence amplifies curvature influence\)\.

Suppose first that per\-example gradients share a coherent component:

gi=ai​q\+εi,m−1​∑i⟨εi,u⟩≈0\.g\_\{i\}=a\_\{i\}q\+\\varepsilon\_\{i\},\\qquad m^\{\-1\}\\sum\_\{i\}\\langle\\varepsilon\_\{i\},u\\rangle\\approx 0\.Then⟨gk,u⟩≈a¯​⟨q,u⟩\\langle g\_\{k\},u\\rangle\\approx\\bar\{a\}\\,\\langle q,u\\ranglewitha¯:=m−1​∑iai\\bar\{a\}:=m^\{\-1\}\\sum\_\{i\}a\_\{i\}\.

Suppose instead that per\-example directions are independent, mean\-zero, and isotropic in an effectivedeffd\_\{\\text\{eff\}\}\-dimensional subspace \(i\.e\. incoherent directions\):

gi=a​qi,𝔼​\[qi\]=0,𝔼​⟨qi,u⟩2=1/deff\.g\_\{i\}=aq\_\{i\},\\qquad\\mathbb\{E\}\[q\_\{i\}\]=0,\\qquad\\mathbb\{E\}\\langle q\_\{i\},u\\rangle^\{2\}=1/d\_\{\\text\{eff\}\}\.Then

𝔼​\[⟨gk,u⟩2\]=a2m​deff\.\\mathbb\{E\}\\\!\\left\[\\langle g\_\{k\},u\\rangle^\{2\}\\right\]=\\frac\{a^\{2\}\}\{m\\,d\_\{\\text\{eff\}\}\}\.

###### Proof\.

In the coherent case,

⟨gk,u⟩=a¯​⟨q,u⟩\+m−1​∑i⟨εi,u⟩,\\langle g\_\{k\},u\\rangle=\\bar\{a\}\\langle q,u\\rangle\+m^\{\-1\}\\sum\_\{i\}\\langle\\varepsilon\_\{i\},u\\rangle,and the residual average is negligible by assumption\. In the random case, the terms⟨qi,u⟩\\langle q\_\{i\},u\\rangleare independent with mean zero, so

𝔼​\[⟨gk,u⟩2\]=a2m2​∑i𝔼​⟨qi,u⟩2=a2m​deff\.\\mathbb\{E\}\\\!\\left\[\\langle g\_\{k\},u\\rangle^\{2\}\\right\]=\\frac\{a^\{2\}\}\{m^\{2\}\}\\sum\_\{i\}\\mathbb\{E\}\\langle q\_\{i\},u\\rangle^\{2\}=\\frac\{a^\{2\}\}\{m\\,d\_\{\\text\{eff\}\}\}\.∎

Lemma[4](https://arxiv.org/html/2606.04212#Thmtheorem4)justifies the coherent\-vs\-random ablation in Section[3\.3](https://arxiv.org/html/2606.04212#S3.SS3)as a test of the alignment mechanism\. If EoS selectivity is driven by directional alignment rather than distance or gradient magnitude alone, then preserving displacement size while randomizing directions should collapse the group\-level projection onto𝐯1\\mathbf\{v\}\_\{1\}and eliminate the EoS advantage\. In the coherent case, the projection remains ordera¯​⟨q,𝐯1⟩\\bar\{a\}\\langle q,\\mathbf\{v\}\_\{1\}\\rangle, and henceCkC\_\{k\}remains ordera¯2​⟨q,𝐯1⟩2\\bar\{a\}^\{2\}\\langle q,\\mathbf\{v\}\_\{1\}\\rangle^\{2\}\. In the random\-direction case, the expected squared projection is onlya2/\(m​deff\)a^\{2\}/\(md\_\{\\mathrm\{eff\}\}\), so curvature influence averages away with group size and effective dimension\. A norm\-only account predicts no difference between the two conditions; the observed collapse of the EoS advantage under random\-direction displacement is therefore evidence for alignment as the operative mechanism, not magnitude alone\.

### F\.6Persistence: gradient saturation removes curvature influence

The MSE\-vs\-CE comparison tests whether alignment is sufficient when gradient magnitude collapses\. Writeℓi​\(θ\)=ϕ​\(fθ​\(xi\),yi\)\\ell\_\{i\}\(\\theta\)=\\phi\(f\_\{\\theta\}\(x\_\{i\}\),y\_\{i\}\)and decompose

∇θℓi=Ji⊤​ri,Ji:=∇θfθ​\(xi\),ri:=∇fϕ​\(fθ​\(xi\),yi\)\.\\nabla\_\{\\theta\}\\ell\_\{i\}=J\_\{i\}^\{\\top\}r\_\{i\},\\qquad J\_\{i\}:=\\nabla\_\{\\theta\}f\_\{\\theta\}\(x\_\{i\}\),\\qquad r\_\{i\}:=\\nabla\_\{f\}\\phi\(f\_\{\\theta\}\(x\_\{i\}\),y\_\{i\}\)\.
###### Lemma 5\(Saturation removes curvature influence\)\.

Suppose‖ri‖→0\\\|r\_\{i\}\\\|\\to 0for alli∈Pki\\in P\_\{k\}and the Jacobians are uniformly bounded,‖Ji‖op≤B\\\|J\_\{i\}\\\|\_\{\\text\{op\}\}\\leq B\. Then‖∇ℓk‖→0\\\|\\nabla\\ell\_\{k\}\\\|\\to 0, and consequently

\(∇ℓk⋅u\)2→0\(\\nabla\\ell\_\{k\}\\cdot u\)^\{2\}\\to 0for every unit vectoruu, regardless of the alignmentcos2⁡θk\\cos^\{2\}\\theta\_\{k\}\.

###### Proof\.

Per example,‖∇θℓi‖≤‖Ji‖op​‖ri‖≤B​‖ri‖\\\|\\nabla\_\{\\theta\}\\ell\_\{i\}\\\|\\leq\\\|J\_\{i\}\\\|\_\{\\text\{op\}\}\\\|r\_\{i\}\\\|\\leq B\\\|r\_\{i\}\\\|\. Hence

‖∇ℓk‖≤\|Pk\|−1​∑i∈Pk‖∇θℓi‖→0,\\\|\\nabla\\ell\_\{k\}\\\|\\leq\|P\_\{k\}\|^\{\-1\}\\sum\_\{i\\in P\_\{k\}\}\\\|\\nabla\_\{\\theta\}\\ell\_\{i\}\\\|\\to 0,and\|⟨∇ℓk,u⟩\|≤‖∇ℓk‖\|\\langle\\nabla\\ell\_\{k\},u\\rangle\|\\leq\\\|\\nabla\\ell\_\{k\}\\\|then gives the second claim\. ∎

For softmax cross\-entropy,ri=pi−yir\_\{i\}=p\_\{i\}\-y\_\{i\}, so confidently correctly classified examples \(pi→yip\_\{i\}\\to y\_\{i\}\) driveri→0r\_\{i\}\\to 0and the corresponding subset gradient norm collapses\. Output\-outliers, whose assigned labels are inconsistent with their input, retain‖pi−yi‖\\\|p\_\{i\}\-y\_\{i\}\\\|bounded away from zero as long as the model continues to predict the input\-consistent class, so their gradients persist\. Lemma[5](https://arxiv.org/html/2606.04212#Thmtheorem5)thus predicts that under CE, a confidently classified subset can retain highcos2⁡θk\\cos^\{2\}\\theta\_\{k\}while losing all curvature influence, and that the EoS advantage transfers to whichever subset retains non\-vanishing gradients—empirically, output\-outliers\.

The MSE comparison sharpens the test\. MSE residuals also vanish on perfectly fit examples, but in the observed regime the coherent input\-outliers retain large residuals throughout training, preserving gradient magnitude\. The MSE\-vs\-CE contrast therefore isolates persistence: a subset that remains aligned with𝐯1\\mathbf\{v\}\_\{1\}but loses gradient norm loses EoS influence, while a subset that retains both keeps it\.

### F\.7Scope of the results

The results above are local and conditional\. Under the Damian–Nichani–Lee self\-stabilization approximation \(A2\), Proposition[2](https://arxiv.org/html/2606.04212#Thmtheorem2)establishes that the cycle\-averaged branch differential decomposes additively into a learning\-rate confounder controlled byRkR\_\{k\}and an EoS\-specific selector controlled byQkQ\_\{k\}\. Lemmas[4](https://arxiv.org/html/2606.04212#Thmtheorem4)and[5](https://arxiv.org/html/2606.04212#Thmtheorem5)explain why directional coherence and gradient persistence are each individually necessary for a subset to feel the selector\. The results do not claim that the decomposition holds globally, that the single\-mode proxyCkC\_\{k\}exactly equals\|Qk\|2\|Q\_\{k\}\|^\{2\}, or that flatness has a universal functional meaning\. Empirically,CkC\_\{k\}tracksQk2Q\_\{k\}^\{2\}across \(subgroup, checkpoint\) pairs in the EoS regime up to a roughly constant proportionality \(Figure[22](https://arxiv.org/html/2606.04212#A6.F22)\), validating its use as a single\-mode proxy\. The functional consequence of EoS depends on which subset dominates the selector at training time—a point developed empirically in Section[4](https://arxiv.org/html/2606.04212#S4)\.

![Refer to caption](https://arxiv.org/html/2606.04212v1/x22.png)Figure 22:Empirical validation of the single\-mode proxy\.Scatter ofQk2=⟨∇ℓk,∇S⟩2Q\_\{k\}^\{2\}=\\langle\\nabla\\ell\_\{k\},\\nabla S\\rangle^\{2\}versusCk=\(∇ℓk⋅v1\)2C\_\{k\}=\(\\nabla\\ell\_\{k\}\\cdot v\_\{1\}\)^\{2\}across \(subgroup, checkpoint\) pairs in training\. The dashed line shows the median proportionalityQk2=α​CkQ\_\{k\}^\{2\}=\\alpha\\,C\_\{k\},α=7\.7×105\\alpha=7\.7\\times 10^\{5\}\. The relationship holds across the trajectory; for inliers \(nearly orthogonal tov1v\_\{1\}\),CkC\_\{k\}is small and the proxy is loosest, in the regime where the selector predicts no EoS advantage\.

Similar Articles

Dropout Universality: Scaling Laws and Optimal Scheduling at the Edge-of-Chaos

arXiv cs.LG

This paper develops a mean-field theory of dropout as a perturbation at the edge of chaos in neural networks, deriving scaling laws for correlation decay and establishing distinct universality classes for smooth and ReLU-like activations. It also yields optimal dropout scheduling that reduces test loss with no extra computational cost.

On the Stability of Growth in Structural Plasticity

arXiv cs.LG

This academic paper investigates the asymmetry between pruning and growth in structural plasticity for neural networks, showing that newborn units suffer from weaker gradient signals than incumbent units, and proposes interventions to improve integration.

The Stability of Singular Distribution: A Spectral Perspective on the Two-Phase Dynamics of Language Model Pre-training

arXiv cs.LG

This paper identifies a spectral phenomenon called Stability of Singular Distribution (SoSD) in large language model pre-training, where the singular value spectrum stabilizes early while parameters continue to evolve. The authors prove that this stabilization marks the transition to the slow-descent phase of training, and they analyze how training strategies like WSD and Muon affect this behavior.