The Stability of Singular Distribution: A Spectral Perspective on the Two-Phase Dynamics of Language Model Pre-training

arXiv cs.LG Papers

Summary

This paper identifies a spectral phenomenon called Stability of Singular Distribution (SoSD) in large language model pre-training, where the singular value spectrum stabilizes early while parameters continue to evolve. The authors prove that this stabilization marks the transition to the slow-descent phase of training, and they analyze how training strategies like WSD and Muon affect this behavior.

arXiv:2605.26489v1 Announce Type: new Abstract: Large language model pre-training typically exhibits a two-phase trajectory: a fast initial loss drop followed by a prolonged slow improvement. We identify an underlying spectral phenomenon, Stability of Singular Distribution (SoSD), where the trace-normalized singular value spectrum stabilizes early, even as parameter matrices continue to evolve. We demonstrate that synchronization between SoSD and the slow-descent regime is widely observed across diverse architectures (GPT-2, LLaMA) and settings, including various schedules (Step-wise, WSD, Cosine Decay), weight decays, and optimizers (AdamW, Muon). By analyzing a simplified Transformer, we prove that growing weight norms inevitably precipitate an early SoSD threshold, after which the rate of loss decrease becomes theoretically bounded by the variation in the singular distribution. We further interpret strategies like WSD and Muon through their ability to modulate the SoSD scale, offering a spectral lens for understanding efficient pre-training dynamics.
Original Article
View Cached Full Text

Cached at: 05/27/26, 09:11 AM

# A Spectral Perspective on the Two-Phase Dynamics of Language Model Pre-training
Source: [https://arxiv.org/html/2605.26489](https://arxiv.org/html/2605.26489)
## The Stability of Singular Distribution: A Spectral Perspective on the Two\-Phase Dynamics of Language Model Pre\-training

###### Abstract

Large language model pre\-training typically exhibits a two\-phase trajectory: a fast initial loss drop followed by a prolonged slow improvement\. We identify an underlying spectral phenomenon, Stability of Singular Distribution \(SoSD\), where the trace\-normalized singular value spectrum stabilizes early, even as parameter matrices continue to evolve\. We demonstrate that synchronization between SoSD and the slow\-descent regime is widely observed across diverse architectures \(GPT\-2, LLaMA\) and settings, including various schedules \(Step\-wise, WSD, Cosine Decay\), weight decays, and optimizers \(AdamW, Muon\)\. By analyzing a simplified Transformer, we prove that growing weight norms inevitably precipitate an early SoSD threshold, after which the rate of loss decrease becomes theoretically bounded by the variation in the singular distribution\. We further interpret strategies like WSD and Muon through their ability to modulate the SoSD scale, offering a spectral lens for understanding efficient pre\-training dynamics\.

Machine Learning, ICML

## 1Introduction

Large Language Models \(LLMs\) have established themselves as the cornerstone of modern artificial intelligence\(Brownet al\.,[2020](https://arxiv.org/html/2605.26489#bib.bib2); Achiamet al\.,[2023](https://arxiv.org/html/2605.26489#bib.bib3)\), achieving unprecedented scalability and generalization by leveraging the Transformer architecture\(Vaswaniet al\.,[2017](https://arxiv.org/html/2605.26489#bib.bib1)\)as their foundational backbone\. However, the optimization dynamics governing their pre\-training remain enigmatic, particularly regarding the temporal evolution of the training process\.

A ubiquitous observation recorded across major technical reports\(Touvronet al\.,[2023](https://arxiv.org/html/2605.26489#bib.bib6); Chowdheryet al\.,[2023](https://arxiv.org/html/2605.26489#bib.bib7); Zhanget al\.,[2022](https://arxiv.org/html/2605.26489#bib.bib8)\)is the characteristic two\-phase trajectory of the training loss: an initial regime of precipitous decay followed by a prolonged period of asymptotic, heavy\-tailed improvement\. While this “fast\-then\-slow” two\-phase behavior is empirically taken for granted as the standard convergence pattern of Large Language Models \(LLMs\), the underlying theoretical mechanisms driving this transition and specifically, what mechanistic factors dictate the onset of the slow\-descent phase remain fundamentally under\-explored\.

![Refer to caption](https://arxiv.org/html/2605.26489v1/show_in_intro_4.png)

Figure 1:Identification of the Stability of Singular Distribution \(SoSD\) phenomenon on GPT\-2 Small\.\(a\)Evolution of cosine similarity between current states and final states\. The singular distributions \(dashed lines\) stabilize significantly earlier than the parameter matrices themselves \(solid lines\)\.\(b\)Synchronization between Validation Loss \(top\) and Singular Distribution Variation \(SD Variation, bottom\)\. Vertical red dashed lines mark the approximate steps where singular value matrices stabilize, coinciding with the transition into the slow descent regime\.Existing analyses often rely on restricted task formulations, such as in\-context learning\(Olssonet al\.,[2022](https://arxiv.org/html/2605.26489#bib.bib10); Biettiet al\.,[2023](https://arxiv.org/html/2605.26489#bib.bib12); Zhanget al\.,[2025a](https://arxiv.org/html/2605.26489#bib.bib47)\)or linear regression\(Zhanget al\.,[2024](https://arxiv.org/html/2605.26489#bib.bib11)\), to derive tractable convergence bounds\. Recently, the focus has shifted towards more granular, multi\-stage interpretations of Transformer optimization\(Zhouet al\.,[2022](https://arxiv.org/html/2605.26489#bib.bib46); Yaoet al\.,[2025](https://arxiv.org/html/2605.26489#bib.bib45); Zhanget al\.,[2025b](https://arxiv.org/html/2605.26489#bib.bib44)\)\. Notably, recent work on Condensation to Rank Collapse\(Chen and Luo,[2025](https://arxiv.org/html/2605.26489#bib.bib9)\)employs a gradient flow framework on linearized attention, identifying a two\-stage transition from parameter condensation to asymptotic rank collapse under small initialization\. Motivated by these insights, we propose to shift the lens towards thetemporal dynamicsof the spectral evolution\. Crucially, this perspective allows us to investigate the mechanistic synchronization between stability of singular distribution and the macroscopic saturation of the loss function, addressing a fundamental question:

What intrinsic mechanism governs the transition from fast learning to slow saturation, and how does the spectral evolution of the parameters dictate this regime shift?

To investigate the mechanism underlying this transition, we analyze the spectral dynamics of the parameter matrices throughout the pre\-training process\. We identify a phenomenon termed theStability of Singular Distribution \(SoSD\), where the normalized singular value spectrum stabilizes significantly earlier than the parameter matrices themselves \(Figure[1](https://arxiv.org/html/2605.26489#S1.F1)\(a\)\), and the onset of this stability aligns with the validation loss entering a plateau, exhibiting a tight synchronization with optimization saturation \(Figure[1](https://arxiv.org/html/2605.26489#S1.F1)\(b\)\)\. Our analysis reveals that this spectral stabilization closely characterizes the shift from the fast to the slow descent phase\. Our specific contributions are as follows:

1. 1\.Identification of SoSD: We identify the Stability of Singular Distribution \(SoSD\) phenomenon, observing that the singular value distribution enters a stable state significantly prior to the stability of parameter matrices\(Figure[1](https://arxiv.org/html/2605.26489#S1.F1)\(a\)\)\. Across GPT\-2 and LLaMA families, we report a synchronization that the onset of SoSD aligns with the loss function’s transition to the slow descent regime\(Figure[1](https://arxiv.org/html/2605.26489#S1.F1)\(b\)\)\.
2. 2\.Theoretical Analysis of SoSD: We establish a theoretical framework to clarify the mechanism of SoSD\. We first establish the emergence of SoSD \(Theorem[4\.5](https://arxiv.org/html/2605.26489#S4.Thmtheorem5)\) contingent upon the Non\-Degeneracy of parameters and Gradient Boundedness \(Assumption[4\.3](https://arxiv.org/html/2605.26489#S4.Thmtheorem3)\)\. Building on this, we incorporate Smoothness and Margin Conditions \(Assumption[4\.7](https://arxiv.org/html/2605.26489#S4.Thmtheorem7)\) to demonstrate that loss descent is dynamically coupled with singular distribution variation: while large variation is associated with rapid loss decay, the onset of SoSD strictly bounds the subsequent loss reduction \(Theorem[4\.9](https://arxiv.org/html/2605.26489#S4.Thmtheorem9)\)\.
3. 3\.Interpreting Pre\-training Strategies via SoSD: We interpret pre\-training strategies through the lens of SoSD dynamics and the derived stability boundε∝η/‖W‖\\varepsilon\\propto\\eta/\\\|W\\\|, whereη\\etaand‖W‖\\\|W\\\|denote the learning rate and the norm of parameters, respectively\. We demonstrate that learning rate schedules \(step\-wise decay and continuous annealing\) facilitate optimization by tightening this bound, thereby mitigating the SoSD\-related constraint to enable further loss minimization\. Conversely, we find that Weight Decay facilitates loss descent by suppressing the growth of the weight norm; this mechanism relaxes the stability constraint, thereby permitting larger updates to the singular distribution\. Finally, we validate SoSD with the Muon optimizer, observing that the SoSD phenomenon persists alongside its superior training efficiency\.

#### Conflict of Interest Disclosure\.

The authors declare no financial conflicts of interest related to this work\.

![Refer to caption](https://arxiv.org/html/2605.26489v1/cossim_3.png)

Figure 2:Evolution of cosine similarity between current parameters\(tt\) and their final states\(TT\) during pre\-training\.The plots analyze cosine similaritycos​⟨Wt,WT⟩\\mathrm\{cos\}\\langle W\_\{t\},W\_\{T\}\\rangleandcos​⟨Σt,ΣT⟩\\mathrm\{cos\}\\langle\\Sigma\_\{t\},\\Sigma\_\{T\}\\rangleacross four models: \(a\) GPT\-2 Small, \(b\) GPT\-2 Medium, \(c\) LLaMA 0\.5B, and \(d\) LLaMA 2B\. Solid lines represent the weight matrices \(WW\), while dashed lines represent the singular value matrices \(Σ\\Sigma\)\. Colors denote different projection layers\. The vertical red dashed lines mark the approximate steps where the singular value matrices stabilize\.

## 2Related Work

#### Training Dynamics of Transformers

Understanding the optimization trajectory of Transformers remains a formidable challenge given the interplay between non\-convex objectives and massive parameter scales\. Early theoretical efforts primarily dissected the dynamics of simplified, single\-layer attention mechanisms\. For instance,Tianet al\.\([2023](https://arxiv.org/html/2605.26489#bib.bib28)\)andSnellet al\.\([2021](https://arxiv.org/html/2605.26489#bib.bib16)\)characterized how gradient descent drives attention heads to capture co\-occurrence patterns or mimic Seq2Seq algorithms, whileLiet al\.\([2023](https://arxiv.org/html/2605.26489#bib.bib15)\)demonstrated the learnability of topic models within a BERT\-like framework\(Devlinet al\.,[2019](https://arxiv.org/html/2605.26489#bib.bib70)\)\. Building on these foundational settings, recent studies have extended dynamic analysis to more complex distributions\. For instance, the staged learning phases in two\-mixture linear classification have been detailed\(Yanget al\.,[2025](https://arxiv.org/html/2605.26489#bib.bib60)\)\. Similarly, in the realm of logical reasoning, theoretical proofs have established how attention and linear layers evolve to solve regular language tasks via Chain\-of\-Thought\(Huanget al\.,[2025](https://arxiv.org/html/2605.26489#bib.bib61)\)\. More recently, the focus has shifted towards In\-Context Learning \(ICL\) as a testbed for understanding dynamics\. A substantial body of work has adopted the linear regression setting to theoretically analyze how Transformers implement gradient descent\-like algorithms during inference\(Akyüreket al\.,[2022](https://arxiv.org/html/2605.26489#bib.bib20); Von Oswaldet al\.,[2023](https://arxiv.org/html/2605.26489#bib.bib21); Mahankaliet al\.,[2023](https://arxiv.org/html/2605.26489#bib.bib17); Zhanget al\.,[2024](https://arxiv.org/html/2605.26489#bib.bib11)\)\. Refining this perspective, recent research demonstrates that Transformers can transcend simple algorithms to learn latent representations, implicitly performing ridge regression to generalize to unseen tasks\(Yanget al\.,[2024](https://arxiv.org/html/2605.26489#bib.bib62)\)\. Parallel to this, mechanistic interpretability research has traced the emergence of specific structural components during training, such as induction heads\(Olssonet al\.,[2022](https://arxiv.org/html/2605.26489#bib.bib10); Reddy,[2023](https://arxiv.org/html/2605.26489#bib.bib24)\)and memory retrieval circuits\(Biettiet al\.,[2023](https://arxiv.org/html/2605.26489#bib.bib12); Cabanneset al\.,[2024](https://arxiv.org/html/2605.26489#bib.bib26)\)\. Complementing these granular analyses of specific capabilities and circuits, our work seeks to characterize themacroscopic temporal evolutionof the loss landscape in general pre\-training scenarios\. We aim to bridge the mechanistic link between the spectral evolution of parameters and the global phenomenon of loss saturation \(the two\-phase transition\), offering a unified spectral perspective that governs the training efficiency of standard language models\.

#### Implicit Regularization and Structural Dynamics

Theoretical investigations into Transformer optimization have extensively characterized how gradient\-based learning induces specific structural properties in weight matrices\. A central theme is the emergence of low\-complexity solutions, famously observed as ”Rank Collapse”\(Donget al\.,[2021](https://arxiv.org/html/2605.26489#bib.bib14)\), where the effective rank of parameters diminishes over time\. This phenomenon is widely interpreted as a form of implicit regularization, where the optimizer naturally favors low\-rank or max\-margin solutions even without explicit constraints\(Gunasekaret al\.,[2017](https://arxiv.org/html/2605.26489#bib.bib32); Soudryet al\.,[2018](https://arxiv.org/html/2605.26489#bib.bib48); Aroraet al\.,[2019](https://arxiv.org/html/2605.26489#bib.bib31); Neyshabur,[2017](https://arxiv.org/html/2605.26489#bib.bib51)\)\. Supporting this view, recent non\-asymptotic analysis for next\-token prediction establishes that both feed\-forward and attention layers converge to such max\-margin solutions with linear rates\(Huanget al\.,[2024](https://arxiv.org/html/2605.26489#bib.bib63)\)\. Complementary to this structural view, the dynamic interplay between optimizer sharpness and stability has been analyzed through the ”Edge of Stability” framework\(Cohenet al\.,[2021](https://arxiv.org/html/2605.26489#bib.bib33); Ahnet al\.,[2022](https://arxiv.org/html/2605.26489#bib.bib49); Damianet al\.,[2022](https://arxiv.org/html/2605.26489#bib.bib50)\)\. However, a tension exists between the tendency towards low\-rank structures and the empirical observation of monotonic norm growth during pre\-training\(Merrillet al\.,[2021](https://arxiv.org/html/2605.26489#bib.bib34)\)\. Gradient flow theories attempt to reconcile these aspects by modeling training as a process of incremental rank accumulation or balanced flow\(Saxeet al\.,[2019](https://arxiv.org/html/2605.26489#bib.bib35); Gidelet al\.,[2019](https://arxiv.org/html/2605.26489#bib.bib36)\)\. Most relevantly,Chen and Luo \([2025](https://arxiv.org/html/2605.26489#bib.bib9)\)recently employed a gradient flow framework onlinearizedTransformers to identify a transition from parameter condensation to asymptotic rank collapse\. Building on these foundational insights, we extend the analysis to the spectral dynamics of standard attention\. We identify the Stability of Singular Distribution \(SoSD\) not as a final low\-rank state, but as anearly\-onsetkinetic bottleneck\. This perspective provides a mechanistic ground for the ”fast\-then\-slow” two\-phase saturation observed in practical pre\-training\.

## 3The Stability Phenomenon in Singular Distribution

### 3\.1Experimental Setting

Models and Datasets\.We conduct pre\-training experiments on two widely adopted decoder\-only model families:

- •GPT\-2 on FineWeb: We train GPT\-2 Small \(124M\) and Medium \(355M\) models on dataset using the highly optimizednano\-gptbenchmark training recipe111[https://github\.com/KellerJordan/modded\-nanogpt](https://github.com/KellerJordan/modded-nanogpt), establishing a rigorous standard for small\-to\-medium scale language modeling\(Radfordet al\.,[2019](https://arxiv.org/html/2605.26489#bib.bib57)\)\.
- •LLaMA on C4: We train 0\.5B and 2B parameter LLaMA models on the Colossal Clean Crawled Corpus \(C4\) to evaluate architectural scalability\(Touvronet al\.,[2023](https://arxiv.org/html/2605.26489#bib.bib6)\)\.

Optimization Settings\.All models are trained using the AdamW optimizer with hyperparametersβ1=0\.9\\beta\_\{1\}=0\.9andβ2=0\.95\\beta\_\{2\}=0\.95, with additional comparative runs using the Muon optimizer specifically for GPT\-2 Small\. We employ diverse learning rate schedules \(Step Decay and Warmup\-Stable\-Decay for GPT\-2, Cosine decay for LLaMA\) and ablate weight decay on the LLaMA 0\.5B model\(Loshchilov and Hutter,[2017](https://arxiv.org/html/2605.26489#bib.bib65)\)\. Full details are provided in Appendix[D](https://arxiv.org/html/2605.26489#A4)\.

### 3\.2The Emergence of Stability of Singular Distribution

We investigate the convergence trajectories of the parameter matricesWWand their associated singular value spectraΣ\\Sigmaby monitoring their cosine similarity to the final trained states \(TT\) across GPT\-2 and LLaMA architectures \(Figure[2](https://arxiv.org/html/2605.26489#S1.F2)\)\.

Spectral Decoupling\.A distinct divergence in convergence rates is observed across all models\. While the parameter alignmentcos⁡⟨Wt,WT⟩\\cos\\langle W\_\{t\},W\_\{T\}\\rangleevolves gradually, indicating that weight entries remain distant from their final configuration, the spectral alignmentcos⁡⟨Σt,ΣT⟩\\cos\\langle\\Sigma\_\{t\},\\Sigma\_\{T\}\\ranglerapidly saturates near11\. This implies that therelative shapeof the singular value spectrum stabilizes significantly earlier than the parameters themselves\. To formalize this decoupling, we introduce the following definition:

###### Definition 3\.1\(Singular Distribution, SD\)\.

LetΣt\\Sigma\_\{t\}denote the diagonal matrix of singular values derived fromWtW\_\{t\}\. TheSingular Distributionis defined as the trace\-normalized spectrum:Σ^t≜Σt/tr⁡\(Σt\)\.\\hat\{\\Sigma\}\_\{t\}\\triangleq\{\\Sigma\_\{t\}\}/\{\\operatorname\{tr\}\(\\Sigma\_\{t\}\)\}\.

Since cosine similarity is scale\-invariant, the saturation ofcos⁡⟨Σt,ΣT⟩\\cos\\langle\\Sigma\_\{t\},\\Sigma\_\{T\}\\rangleis equivalent to the stabilization ofΣ^t\\hat\{\\Sigma\}\_\{t\}\. We term this phenomenon where the singular distribution achieves stationarity prior to the parameter matrices as theStability of Singular Distribution \(SoSD\)\.

### 3\.3Synchronized Dynamics of SoSD and Training Loss

![Refer to caption](https://arxiv.org/html/2605.26489v1/sync_loss_sosd_3.png)

Figure 3:Synchronization between Validation Loss and Singular Distribution Variation\.The figures illustrate the training dynamics across four models:\(a\)GPT\-2 Small,\(b\)GPT\-2 Medium,\(c\)LlaMA 0\.5B, and\(d\)LlaMA 2B\. Thetop rowdisplays the Validation Loss, while thebottom rowshows the Singular Distribution Variation for various projection layers\. Vertical red dashed lines mark the approximate steps where singular value matrices stabilize\.To quantify the temporal evolution of the singular spectrum, we define theSingular Distribution Variation \(SD Variation\)as the magnitude of the update to the normalized spectrum:Δ​Σt:=‖Σ^t\+1−Σ^t‖F\\Delta\\Sigma\_\{t\}:=\\left\\\|\\hat\{\\Sigma\}\_\{t\+1\}\-\\hat\{\\Sigma\}\_\{t\}\\right\\\|\_\{F\}, whereΣ^t\\hat\{\\Sigma\}\_\{t\}is the singular distribution defined in Definition[3\.1](https://arxiv.org/html/2605.26489#S3.Thmtheorem1)\. We investigate the correlation between this spectral variation and the validation loss trajectory in Figure[3](https://arxiv.org/html/2605.26489#S3.F3)\.

Our analysis reveals a tight synchronization between the stability of the singular distribution and the regime of loss descent\. Specifically, the training dynamics \(exemplified in Figure[3](https://arxiv.org/html/2605.26489#S3.F3)\(a\)\) decouple into two distinct phases governed by the magnitude ofΔ​Σt\\Delta\\Sigma\_\{t\}:

- •Phase I: Singular Distribution Restructuring \(Fast Descent\)\.In the initial stage \(t<1000t<1000\),Δ​Σt\\Delta\\Sigma\_\{t\}exhibits a transient impulse, peaking at magnitudes of approximately10−210^\{\-2\}\. This period of active spectral restructuring coincides with the rapid decay of the validation loss \(from∼10\\sim 10to∼4\.0\\sim 4\.0\)\. This suggests that the steepest descent in the loss landscape is mechanistically driven by significant shifts in the relative singular value distribution\.
- •Phase II: Singular Distribution Metastability \(Slow Descent\)\.Fort\>1000t\>1000,Δ​Σt\\Delta\\Sigma\_\{t\}decays rapidly and settles into a metastable floor \(≈10−4\\approx 10^\{\-4\}\), marking the onset of SoSD\. Crucially, this spectral stabilization precisely aligns with the transition of the loss function into the asymptotic, heavy\-tailed regime\. We note that MLP layers consistently exhibit lower Singular Distribution Variation than Attention layers, yet both components synchronize their entry into the SoSD regime with the saturation of the loss\.

As we illustrated in Figure[3](https://arxiv.org/html/2605.26489#S3.F3), this synchronization is universal across architectures \(GPT\-2, LLaMA\) and modules, indicating that the SoSD is not merely an artifact of specific layers but a global kinetic constraint characterizing the saturation of Transformer pre\-training\.

## 4Theoretical Analysis

In this section, we will theoretically analyze the inevitability of SoSD within a single\-layer, single\-head Transformer setting and investigate the rationale for the synchronized dynamics of SoSD and the Transformer training loss\.

### 4\.1Setup

#### Task and Loss Function

We consider a sequence classification setting\. LetX∈ℝn×dX\\in\\mathbb\{R\}^\{n\\times d\}denote the input sequence, wherennis the sequence length andddis the feature dimension\(Tarzanaghet al\.,[2023](https://arxiv.org/html/2605.26489#bib.bib64); Tianet al\.,[2023](https://arxiv.org/html/2605.26489#bib.bib28)\)\. The corresponding labels are represented by a matrixY∈ℝn×CY\\in\\mathbb\{R\}^\{n\\times C\}, where each rowYiY\_\{i\}is a one\-hot vector indicating the class label at positionii\. Formally, the elements ofYYare defined as:

Yi,c=𝕀​\[c=yi\]=\{1,if​c=yi,0,if​c≠yi,Y\_\{i,c\}=\\mathbb\{I\}\[c=y\_\{i\}\]=\\begin\{cases\}1,&\\text\{if \}c=y\_\{i\},\\\\ 0,&\\text\{if \}c\\neq y\_\{i\},\\end\{cases\}\(1\)whereyi∈\{1,…,C\}y\_\{i\}\\in\\\{1,\\ldots,C\\\}denotes the ground\-truth class index for theii\-th position\.

We adopt the cross\-entropy loss for training\. The empirical risk over the sequence is defined as:ℒ=1n​∑i=1nℓi=−1n​∑i=1n∑c=1CYi,c​log⁡Pi,c,\\mathcal\{L\}=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\ell\_\{i\}=\-\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\sum\_\{c=1\}^\{C\}Y\_\{i,c\}\\log P\_\{i,c\},whereYi,cY\_\{i,c\}denotes the one\-hot encoded ground\-truth label\. SinceYi,c=𝕀​\[c=yi\]Y\_\{i,c\}=\\mathbb\{I\}\[c=y\_\{i\}\]according to \([1](https://arxiv.org/html/2605.26489#S4.E1)\), the loss can be equivalently simplified to:

ℒ=−1n​∑i=1nlog⁡Pi,yi\.\\mathcal\{L\}=\-\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\log P\_\{i,y\_\{i\}\}\.\(2\)

#### Model Architecture

We employ a 1\-layer single\-head transformer architecture\(Chen and Luo,[2025](https://arxiv.org/html/2605.26489#bib.bib9); Yanget al\.,[2024](https://arxiv.org/html/2605.26489#bib.bib62); Jiaoet al\.,[2025](https://arxiv.org/html/2605.26489#bib.bib66)\)\. Given the input sequenceX∈ℝn×dX\\in\\mathbb\{R\}^\{n\\times d\}, we denote the query, key, and value matrices via linear transformations:Q=X​WQ,K=X​WK,V=X​WV,Q=XW\_\{Q\},\\quad K=XW\_\{K\},\\quad V=XW\_\{V\},whereWQ,WK,WV∈ℝd×dW\_\{Q\},W\_\{K\},W\_\{V\}\\in\\mathbb\{R\}^\{d\\times d\}are learnable weight matrices\.

The attention mechanism is formalized by the score matrixMMand the attention matrixAA:

M=Q​K⊤d,A=softmax​\(M\)\.M=\\frac\{QK^\{\\top\}\}\{\\sqrt\{d\}\},\\quad A=\\mathrm\{softmax\}\(M\)\.The resulting hidden representations are then computed as:H=A​V∈ℝn×d\.H=AV\\in\\mathbb\{R\}^\{n\\times d\}\.

To map these representations to the output space, we introduce a fixed \(non\-trainable\) projection matrixWC∈ℝd×CW\_\{C\}\\in\\mathbb\{R\}^\{d\\times C\}\. The logitsZZand output probabilitiesPPare given by:

Z=H​WC,P=softmax​\(Z\),Z=HW\_\{C\},\\quad P=\\mathrm\{softmax\}\(Z\),whereZ,P∈ℝn×CZ,P\\in\\mathbb\{R\}^\{n\\times C\}\.

#### Optimization

We optimize the model parameters using full\-batch Gradient Descent \(GD\) with constant learning rateη\>0\\eta\>0\. Denoteθt=\{WQ​\(t\),WK​\(t\),WV​\(t\)\}\.\\theta\_\{t\}=\\\{W\_\{Q\}\(t\),W\_\{K\}\(t\),W\_\{V\}\(t\)\\\}\.The parameters are updated according to the gradient descent rule:

θt\+1=θt−η​∇θℒ​\(θt\)\.\\theta\_\{t\+1\}=\\theta\_\{t\}\-\\eta\\nabla\_\{\\theta\}\\mathcal\{L\}\(\\theta\_\{t\}\)\.

#### Initialization

We adopt small initialization strategy\(Yaoet al\.,[2025](https://arxiv.org/html/2605.26489#bib.bib45); Chen and Luo,[2025](https://arxiv.org/html/2605.26489#bib.bib9); Giorlandino and Goldt,[2025](https://arxiv.org/html/2605.26489#bib.bib67)\)\. Specifically, the entries of the parameter matrices att=0t=0are sampled independently from a Gaussian distribution:\[W\]i,j∼𝒩​\(0,σ2\),W∈\{WQ,WK,WV\},\[W\]\_\{i,j\}\\sim\\mathcal\{N\}\(0,\\sigma^\{2\}\),\\quad W\\in\\\{W\_\{Q\},W\_\{K\},W\_\{V\}\\\},whereσ≪1\\sigma\\ll 1denotes the standard deviation\.

### 4\.2Stability of Singular Distribution

To theoretically characterize the Stability of Singular Distribution \(SoSD\), we first introduce the necessary technical assumptions regarding the evolution of weight norms and the regularity of the gradients\.

###### Assumption 4\.1\(Strictly Increasing Norms\)\.

The norm of the weight matrices strictly increases over time, i\.e\.∀W∈\{WQ,WK,WV\}\\forall W\\in\\\{W\_\{Q\},W\_\{K\},W\_\{V\}\\\}andtt,‖W​\(t\+1\)‖\>‖W​\(t\)‖\\\|W\(t\+1\)\\\|\>\\\|W\(t\)\\\|is strictly monotonically increasing\.

###### Assumption 4\.3\(Non\-Degeneracy and Gradient Boundedness\)\.

We impose the following regularity conditions on the model parameters and gradients:

1. 1\.Bounded Condition Number: For allW∈\{WQ,WK,WV\}W\\in\\\{W\_\{Q\},W\_\{K\},W\_\{V\}\\\}, the matrices are full\-rank and their condition numbers are bounded, i\.e\. there exists a constantκ∙\>0\\kappa\_\{\\bullet\}\>0such that: κ\(W∙\):=σmax​\(W∙\)σmin​\(W∙\)≤κ∙,∙∈\{Q,K,V\}\.\\kappa\(W\_\{\\bullet\}\):=\\frac\{\\sigma\_\{\\max\}\(W\_\{\\bullet\}\)\}\{\\sigma\_\{\\min\}\(W\_\{\\bullet\}\)\}\\leq\\kappa\_\{\\bullet\},\\bullet\\in\\\{Q,K,V\\\}\.
2. 2\.Non\-vanishing Gradients: The backpropagated gradient with respect to the hidden representationsHHis non\-vanishing, i\.e\.‖GH‖=‖∂ℒ∂H‖\>0\.\\\|G\_\{H\}\\\|=\\left\\\|\\frac\{\\partial\\mathcal\{L\}\}\{\\partial H\}\\right\\\|\>0\.
3. 3\.Bounded Gradient Norms: The gradients with respect to the weight matrices are bounded\. There exists a constantG\>0G\>0such that for allW∈\{WQ,WK,WV\}W\\in\\\{W\_\{Q\},W\_\{K\},W\_\{V\}\\\}:‖∂ℒ∂W‖≤G\.\\left\\\|\\frac\{\\partial\\mathcal\{L\}\}\{\\partial W\}\\right\\\|\\leq G\.

###### Theorem 4\.5\(Stability of Singular Distribution\)\.

Consider the model defined in Section[4\.1](https://arxiv.org/html/2605.26489#S4.SS1.SSS0.Px2)\. Under Assumption[4\.1](https://arxiv.org/html/2605.26489#S4.Thmtheorem1)and Assumption[4\.3](https://arxiv.org/html/2605.26489#S4.Thmtheorem3), for stability boundε​\(W\)=O​\(η‖W‖\),W∈\{WQ,WK,WV\}\\varepsilon\(W\)=O\(\\frac\{\\eta\}\{\\\|W\\\|\}\),W\\in\\\{W\_\{Q\},W\_\{K\},W\_\{V\}\\\}, the following hold:

1. 1\.Forthe Value Matrix \(WVW\_\{V\}\): There exists a threshold timeTVT\_\{V\}satisfying TV=O​\(\(1\+d\)​η​G−ε​v​\(0\)ε​CV​d\),T\_\{V\}=O\\left\(\\frac\{\(1\+\\sqrt\{d\}\)\\,\\eta G\-\\varepsilon v\(0\)\}\{\\varepsilon\\,C\_\{V\}\\sqrt\{d\}\}\\right\),such that for allt≥TVt\\geq T\_\{V\}, the singular distribution stabilizes: ‖Σ^t\+1​\(WV\)−Σ^t​\(WV\)‖F<ε​\(WV\)\.\\left\\\|\\hat\{\\Sigma\}\_\{t\+1\}\(W\_\{V\}\)\-\\hat\{\\Sigma\}\_\{t\}\(W\_\{V\}\)\\right\\\|\_\{F\}<\\varepsilon\(W\_\{V\}\)\.
2. 2\.ForQuery and Key Matrices \(W∈\{WQ,WK\}W\\in\\\{W\_\{Q\},W\_\{K\}\\\}\): There exists a threshold timeTQ​KT\_\{QK\}which scales as: TQ​K=O​\(\(C02\+2​C0​Λ​\(ε\)​η​G\)1/2\),\\begin\{split\}T\_\{QK\}=O\\Bigg\(\\bigg\(C\_\{0\}^\{2\}\+2C\_\{0\}\\Lambda\(\\varepsilon\)\\eta G\\bigg\)^\{1/2\}\\Bigg\),\\end\{split\}such that for allt≥TQ​Kt\\geq T\_\{QK\}, the singular distribution stabilizes: ‖Σ^t\+1​\(W\)−Σ^t​\(W\)‖F<ε​\(W\),\\left\\\|\\hat\{\\Sigma\}\_\{t\+1\}\(W\)\-\\hat\{\\Sigma\}\_\{t\}\(W\)\\right\\\|\_\{F\}<\\varepsilon\(W\),whereΛ​\(ε\)=ln⁡\(1\+d\)−ln⁡\(ε​q​\(0\)​d\)\\Lambda\(\\varepsilon\)=\\ln\(1\+\\sqrt\{d\}\)\-\\ln\(\\varepsilon\\,q\(0\)\\sqrt\{d\}\)\.

Consequently, there exists a sufficiently large constantC\>0C\>0such that the global threshold timeT∗T^\{\*\}is defined as:

T∗=C​max⁡\{\(1\+d\)​η​Gε​CV​d,C02\+2​C0​Λ​\(ε\)​η​G\}\.T^\{\*\}=C\\max\\left\\\{\\frac\{\(1\+\\sqrt\{d\}\)\\,\\eta G\}\{\\varepsilon\\,C\_\{V\}\\sqrt\{d\}\},\\sqrt\{C\_\{0\}^\{2\}\+2C\_\{0\}\\Lambda\(\\varepsilon\)\\eta G\}\\right\\\}\.For allt≥T∗t\\geq T^\{\*\}, the SoSD holds for all matricesW∈\{WQ,WK,WV\}W\\in\\\{W\_\{Q\},W\_\{K\},W\_\{V\}\\\}\.

The proof is provided in Appendix[B](https://arxiv.org/html/2605.26489#A2)\. Theorem[4\.5](https://arxiv.org/html/2605.26489#S4.Thmtheorem5)establishes the mechanistic origin of SoSD\. Under small initialization, the relatively small parameter norms allow for rapid SD Variation\. However, as training progresses, the monotonic growth of weight norms \(Assumption[4\.1](https://arxiv.org/html/2605.26489#S4.Thmtheorem1)\) progressively suppresses the relative magnitude of gradient updates\. This scaling effect naturally enforces a tighter bound on SD variation \(ε∝η/‖W‖\\varepsilon\\propto\\eta/\\\|W\\\|\), effectively compelling the system into the SoSD regime\.

### 4\.32\-Phase Analysis of Training Loss

###### Definition 4\.6\(Gap Function\)\.

Letu∈ℝmu\\in\\mathbb\{R\}^\{m\}\. Define the index of the maximum element asj∗:=arg⁡maxj⁡ujj^\{\*\}:=\\arg\\max\_\{j\}u\_\{j\}\. The gap functiongap​\(⋅\)\\mathrm\{gap\}\(\\cdot\)is defined as the difference between the largest and the second\-largest entries:gap​\(u\):=uj∗−maxj≠j∗⁡uj≥0\.\\mathrm\{gap\}\(u\):=u\_\{j^\{\*\}\}\-\\max\_\{j\\neq j^\{\*\}\}u\_\{j\}\\geq 0\.

###### Assumption 4\.7\(Smoothness and Margin Conditions\)\.

We posit the following regularity conditions regarding the loss landscape and asymptotic margins:

1. 1\.Smoothness: There existsTβT\_\{\\beta\}such that for allt\>Tβt\>T\_\{\\beta\}, the loss function is locallyβ\\beta\-smooth, i\.e\. for allt\>Tβt\>T\_\{\\beta\}, for allW∈\{WQ,WK,WV\}W\\in\\\{W\_\{Q\},W\_\{K\},W\_\{V\}\\\}:‖∇Wℒ​\(W​\(t\)\)−∇Wℒ​\(W​\(t\+1\)\)‖≤β​‖W​\(t\)−W​\(t\+1\)‖\.\\\|\\nabla\_\{W\}\\mathcal\{L\}\(W\(t\)\)\-\\nabla\_\{W\}\\mathcal\{L\}\(W\(t\+1\)\)\\\|\\leq\\beta\\\|W\(t\)\-W\(t\+1\)\\\|\.
2. 2\.Attention Margin\. For eacht\>max⁡\{T∗,Tβ\}t\>\\max\\\{T^\{\*\},T\_\{\\beta\}\\\}, denoteM¯:=M‖WQ‖∗​‖WK‖∗\.\\bar\{M\}:=\\frac\{M\}\{\\\|W\_\{Q\}\\\|\_\{\*\}\\\|W\_\{K\}\\\|\_\{\*\}\}\.We assume that the attention mechanism separates the tokens with a positive margin:γmin:=mini⁡gap⁡\(M¯i,:\)\>0\.\\gamma\_\{\\min\}:=\\min\_\{i\}\\operatorname\{gap\}\\big\(\\bar\{M\}\_\{i,:\}\\big\)\>0\.
3. 3\.Logit Margin\. For eacht\>max⁡\{T∗,Tβ\}t\>\\max\\\{T^\{\*\},T\_\{\\beta\}\\\}, denoteZ¯:=Z‖WV‖∗\\bar\{Z\}:=\\frac\{Z\}\{\\\|W\_\{V\}\\\|\_\{\*\}\}\. For theii\-th token, define its normalized logit margin asωi:=Z¯i,yi−maxc≠yi⁡Z¯i,c\\omega\_\{i\}:=\\bar\{Z\}\_\{i,y\_\{i\}\}\-\\max\_\{c\\neq y\_\{i\}\}\\bar\{Z\}\_\{i,c\}\. We assume that the minimum logit margin is strictly positive:ωmin:=mini⁡ωi\>0\\omega\_\{\\min\}:=\\min\_\{i\}\\omega\_\{i\}\>0\.

###### Theorem 4\.9\(Two\-Phase Dynamics\)\.

Denote the loss decrease asΔ​ℒ​\(t\):=ℒt−ℒt\+1\\Delta\\mathcal\{L\}\(t\):=\\mathcal\{L\}\_\{t\}\-\\mathcal\{L\}\_\{t\+1\}\. Consider the model architecture defined in Section[4\.1](https://arxiv.org/html/2605.26489#S4.SS1.SSS0.Px2)\. Under Assumption[4\.1](https://arxiv.org/html/2605.26489#S4.Thmtheorem1)and Assumption[4\.7](https://arxiv.org/html/2605.26489#S4.Thmtheorem7), the training dynamics exhibit two distinct phases characterizing the relationship between the loss decrease and the singular distribution:

1. 1\.Phase I \(Fast Descent\):There exists a timeTf≤T∗T\_\{f\}\\leq T^\{\*\}such that for allt≤Tft\\leq T\_\{f\}, the loss decrease satisfies: Δ​ℒ​\(t\)≥3​D2​η2​\(1\+d\)=O​\(1\)\.\\Delta\\mathcal\{L\}\(t\)\\geq\\frac\{3D^\{2\}\\eta\}\{2\(1\+\\sqrt\{d\}\)\}=O\(1\)\.
2. 2\.Phase II \(Slow Descent\):There existsTs:=max⁡\{T∗,Tβ\}T\_\{s\}:=\\max\\\{T^\{\*\},T\_\{\\beta\}\\\}andp\>2p\>2such that for allt\>Tst\>T\_\{s\}, the singular distribution stabilizes within anε\\varepsilon\-neighborhood:‖Σ^t\+1−Σ^t‖F<ε\.\\left\\\|\\hat\{\\Sigma\}\_\{t\+1\}\-\\hat\{\\Sigma\}\_\{t\}\\right\\\|\_\{F\}<\\varepsilon\.Then, the loss decrement is polynomially bounded by the magnitude of SoSD: Δ​ℒ​\(t\)≤O​\(‖Σ^t\+1−Σ^t‖Fp\)=O​\(εp\)\.\\Delta\\mathcal\{L\}\(t\)\\leq O\\\!\\left\(\\left\\\|\\hat\{\\Sigma\}\_\{t\+1\}\-\\hat\{\\Sigma\}\_\{t\}\\right\\\|\_\{F\}^\{\\,p\}\\right\)=O\(\\varepsilon^\{p\}\)\.

Here,T∗T^\{\*\}is the global threshold time given by Theorem[4\.5](https://arxiv.org/html/2605.26489#S4.Thmtheorem5)\.

The proof is provided in Appendix[C](https://arxiv.org/html/2605.26489#A3)\.

![Refer to caption](https://arxiv.org/html/2605.26489v1/lrdecay_wsd_cosine_1.png)

Figure 4:Singular Distribution Variation under different Learning Rate Schedules\.In all subplots, the left y\-axis represents the Validation Loss \(dark blue line\), and the right y\-axis denotes the Singular Distribution Variation for various projection layers \(colored lines\)\.\(a\) Learning Rate Decay on GPT\-2 Small\.The plot displays the training dynamics over 20k steps\. Vertical red dashed lines at steps 10\.2k and 15\.3k indicate the points where the learning rate decays by a factor of 10\. The inset provides a magnified view of the trajectories around the first decay point\.\(b\) Warmup Stable Decay on GPT\-2 Small\.A focused view of the interval between steps 3000 and 5100\. The vertical red dashed line at step 3650 marks the beginning of the WSD\.\(c\) Cosine Decay on LLaMA 0\.5B\.The plot spans 80k training steps, with a vertical red dashed line indicating the onset of the SoSD\.Theorem[4\.9](https://arxiv.org/html/2605.26489#S4.Thmtheorem9)formalizes theTwo\-Phase Dynamicsof pre\-training by linking loss reduction to SD Variation:

- •Fast Descent Phase:Initially, moderate parameter norms permit significant spectral shifts\. In this regime, the loss decreaseΔ​ℒ​\(t\)\\Delta\\mathcal\{L\}\(t\)is lower\-bounded by a constant order term \(O​\(1\)O\(1\)\), enabling rapid optimization and active feature learning\.
- •Slow Descent Phase:As the system transitions into SoSD \(triggered by norm growth per Theorem[4\.5](https://arxiv.org/html/2605.26489#S4.Thmtheorem5)\), the loss reduction becomes kinetically constrained\. The descent rate is no longer governed by gradient magnitude alone but is theoretically bounded by the variation of the singular distribution \(Δ​ℒ≤O​\(εp\)\\Delta\\mathcal\{L\}\\leq O\(\\varepsilon^\{p\}\)\)\. Consequently, the stabilization ofΣ^t\\hat\{\\Sigma\}\_\{t\}acts as a bottleneck, forcing the loss into an asymptotic plateau\.

## 5Interpreting Pre\-training Strategies via SoSD Dynamics

In this section, we interpret the effectiveness of established pre\-training strategies through the lens of SoSD\. We posit that their empirical success stems from their ability to modulate thestability boundderived in Theorem[4\.5](https://arxiv.org/html/2605.26489#S4.Thmtheorem5), thereby relaxing the kinetic constraints on loss reduction\.

![Refer to caption](https://arxiv.org/html/2605.26489v1/wd_and_muon_1.png)

Figure 5:\(a\) Impact of Weight Decay on LlaMA 0\.5B\.Comparison of training dynamics with \(solid lines\) and without \(dashed lines\) weight decay\. The model with weight decay achieves lower validation loss, yet exhibits consistently higher SoSD values compared to the baseline without weight decay\.\(b\) Muon vs\. Adam on GPT\-2 Small\.Comparison between Muon and Adam optimizers\. Muon demonstrates faster convergence and lower loss, while maintaining significantly lower SoSD values compared to Adam throughout the training process\. For visual clarity, only representative matrices are displayed; others exhibit consistent behaviors\.### 5\.1Learning Rate Schedule

We analyze three distinct learning rate \(LR\) schedules: Step\-wise Decay, Warmup\-Stable\-Decay \(WSD\), and Cosine Decay, to demonstrate how schedulingηt\\eta\_\{t\}directly controls the SoSD threshold\.

Step\-wise Learning Rate Decay\.Figure[4](https://arxiv.org/html/2605.26489#S4.F4)\(a\) illustrates the dynamics of GPT\-2 Small under a step\-wise decay strategy\. At steps 10\.2k and 15\.3k, the learning rate is reduced by a factor of 10\. Coinciding with these discrete reductions, the SoSD metric \(Δ​Σ\\Delta\\Sigma\) exhibits an immediate, sharp decline \(see inset\), synchronized with a renewed descent in validation loss\.

Theoretically, this aligns with the stability boundε∝η/‖W‖\\varepsilon\\propto\\eta/\\\|W\\\|\(Theorem[4\.5](https://arxiv.org/html/2605.26489#S4.Thmtheorem5)\)\. The abrupt reduction inη\\etatightens the permissible variation of the singular distribution\. This forces the system out of its previous quasi\-equilibrium state, allowing the model to resolve finer\-grained features and settle into a new SoSD regime with a lower spectral noise floor, thereby minimizing the loss further\.

Continuous Annealing \(WSD and Cosine Decay\)\.In contrast to the discrete shifts of step decay, WSD \(decay phase\) and Cosine Decay employ continuous annealing\.

- •Warmup\-Stable\-Decay \(WSD\):Figure[4](https://arxiv.org/html/2605.26489#S4.F4)\(b\) highlights the critical transition at step 3650\. During the constant LR phase, the singular distribution oscillates around a fixed magnitude, and the loss stagnates\. However, once the linear decay commences, the SoSD metric tracks the reduction ofη\\eta, decreasing monotonically\. This continuous relaxation of the stability bound prevents the model from locking into a static spectral configuration, enabling sustained loss improvement\.
- •Cosine Decay:Similarly, under Cosine Decay \(Figure[4](https://arxiv.org/html/2605.26489#S4.F4)\(c\)\), the smooth reduction ofηt\\eta\_\{t\}induces a gradual tightening of the stability boundεt\\varepsilon\_\{t\}\. Consequently, the system maintains adynamicstability rather than a static equilibrium, driving a steady and persistent reduction in loss without the abrupt phase transitions observed in step\-wise schedules\.

#### Takeaway\.

From a spectral perspective, learning rate scheduling is equivalent toscheduling the stability bound of SoSD\. By reducingη\\eta, these strategies progressively tighten the constraint on singular value variation, thereby mitigating the saturation effects imposed by SoSD and extending the effective optimization window\.

### 5\.2Weight Decay

As shown in Figure[5](https://arxiv.org/html/2605.26489#S5.F5)\(a\), the trajectory utilizing Weight Decay \(WD\) maintains consistently higher SoSD levels compared to the baseline\. In terms of validation loss, both models exhibit similar performance during the initial 60k steps; however, beyond this phase, the WD model achieves a lower final loss compared to the non\-WD baseline\. This behavior aligns with our derived stability boundε∝η/‖W‖\\varepsilon\\propto\\eta/\\\|W\\\|\. By suppressing the growth of the weight norm‖W‖\\\|W\\\|, WD relaxes the stability constraint \(increasingε\\varepsilon\)\(Touvronet al\.,[2023](https://arxiv.org/html/2605.26489#bib.bib6)\)\. According to Theorem[4\.9](https://arxiv.org/html/2605.26489#S4.Thmtheorem9), this relaxed bound permits larger updates to the singular value distribution, effectively raising the theoretical upper bound for loss reduction\. Consequently, by mitigating the growth rate of the weight norm, WD sustains the necessary optimization dynamics to achieve better convergence\.

#### Takeaway

Our findings indicate that a lower SoSD does not essentially imply better performance\. While the previous section showed that reducing the learning rate decreases SoSD alongside the loss, the Weight Decay experiment reveals the opposite: a higher SoSD coexists with a lower loss\. Therefore, SoSD acts as a characterization of singular value dynamics rather than a direct performance metric\. Its behavior serves as an indicator, signaling when the model has entered a regime of slow descent\.

### 5\.3Muon/Adam Optimizer

Figure[5](https://arxiv.org/html/2605.26489#S5.F5)\(b\) presents the comparative training dynamics of Muon \(solid lines\) and Adam \(dashed lines\) under the Warmup\-Stable\-Decay \(WSD\) schedule\(Jordanet al\.,[2024](https://arxiv.org/html/2605.26489#bib.bib69)\)\. Regarding the Validation Loss, the Muon curve remains consistently below that of Adam, indicating that Muon achieves lower loss values at equivalent training steps\. In terms of Singular Distribution Variation, both optimizers exhibit initial peaks; however, upon entering the SoSD phase, the variation produced by Muon is significantly lower than that of Adam, with a magnitude difference of approximately one order \(10−410^\{\-4\}vs\.10−310^\{\-3\}\)\.

#### Takeaway

The emergence of SoSD in Muon\-trained models confirms that the transition to Singular Distribution stability is a fundamental characteristic of Transformer pre\-training, invariant to the choice of optimizer\. However, the distinct scales of variation suggest that different optimizers navigate the spectral landscape with varying degrees of efficiency\.

## 6Conclusion and Discussion

In this work, we have investigated the intrinsic mechanisms governing the characteristic ”fast\-then\-slow” convergence pattern in Large Language Model pre\-training\. By shifting the analytical lens to the temporal dynamics of spectral evolution, we identified the Stability of Singular Distribution \(SoSD\) phenomenon\. Our empirical analysis across GPT\-2 and LLaMA families demonstrates that the singular distribution stabilizes significantly earlier than the parameter matrices, with this onset of stability tightly synchronizing with the transition of the validation loss into the saturation regime\.

Theoretically, we established a framework that proves the emergence of SoSD under conditions of non\-degeneracy and gradient boundedness\. We further revealed the dynamic coupling between loss descent and singular distribution variation, showing that the onset of SoSD theoretically bounds the subsequent loss reduction\. Finally, we applied this spectral perspective to interpret standard pre\-training strategies including learning rate schedule, weight decay and optimizer\. We showed that learning rate schedules and weight decay effectively facilitate optimization by manipulating the stability boundε∝η/‖W‖\\varepsilon\\propto\\eta/\\\|W\\\|, thereby allowing the model to overcome the spectral barrier to achieve lower training loss\. Our work offers a novel mechanistic understanding of Transformer optimization trajectory, linking microscopic spectral behaviors to macroscopic training dynamics\.

## Impact Statement

This study seeks to foster progress in deep learning research, with particular attention to deepening insights into and refining pre\-training methodologies for Large Language Models \(LLMs\)\. While the potential broader societal implications of this work are acknowledged, no specific societal impacts have been identified at present that warrant explicit emphasis\.

## Acknowledgements

This work was funded by the Strategic Priority Research Program of the Chinese Academy of Sciences \(Grant No\. XDB0680101\), CAS Project for Young Scientists in Basic Research under Grant No\. YSBR\-034, the National Key Research and Development Program of China under Grants No\. 2023YFA1011602, and Xiaomi Young Talents Program\.

## References

- J\. Achiam, S\. Adler, S\. Agarwal, L\. Ahmad, I\. Akkaya, F\. L\. Aleman, D\. Almeida, J\. Altenschmidt, S\. Altman, S\. Anadkat,et al\.\(2023\)Gpt\-4 technical report\.arXiv preprint arXiv:2303\.08774\.Cited by:[§1](https://arxiv.org/html/2605.26489#S1.p1.1)\.
- K\. Ahn, J\. Zhang, and S\. Sra \(2022\)Understanding the unstable convergence of gradient descent\.InInternational conference on machine learning,pp\. 247–257\.Cited by:[§2](https://arxiv.org/html/2605.26489#S2.SS0.SSS0.Px2.p1.1)\.
- E\. Akyürek, D\. Schuurmans, J\. Andreas, T\. Ma, and D\. Zhou \(2022\)What learning algorithm is in\-context learning? investigations with linear models\.arXiv preprint arXiv:2211\.15661\.Cited by:[§2](https://arxiv.org/html/2605.26489#S2.SS0.SSS0.Px1.p1.1)\.
- M\. Andriushchenko, F\. D’Angelo, A\. Varre, and N\. Flammarion \(2023\)Why do we need weight decay in modern deep learning?\.Cited by:[Remark 4\.2](https://arxiv.org/html/2605.26489#S4.Thmtheorem2.p1.1)\.
- S\. Arora, N\. Cohen, W\. Hu, and Y\. Luo \(2019\)Implicit regularization in deep matrix factorization\.Advances in neural information processing systems32\.Cited by:[§2](https://arxiv.org/html/2605.26489#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Bietti, V\. Cabannes, D\. Bouchacourt, H\. Jegou, and L\. Bottou \(2023\)Birth of a transformer: a memory viewpoint\.Advances in Neural Information Processing Systems36,pp\. 1560–1588\.Cited by:[§1](https://arxiv.org/html/2605.26489#S1.p3.1),[§2](https://arxiv.org/html/2605.26489#S2.SS0.SSS0.Px1.p1.1)\.
- T\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. D\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell,et al\.\(2020\)Language models are few\-shot learners\.Advances in neural information processing systems33,pp\. 1877–1901\.Cited by:[§1](https://arxiv.org/html/2605.26489#S1.p1.1)\.
- V\. Cabannes, B\. Simsek, and A\. Bietti \(2024\)Learning associative memories with gradient descent\.arXiv preprint arXiv:2402\.18724\.Cited by:[§2](https://arxiv.org/html/2605.26489#S2.SS0.SSS0.Px1.p1.1)\.
- Z\. Chen and T\. Luo \(2025\)From condensation to rank collapse: a two\-stage analysis of transformer training dynamics\.arXiv preprint arXiv:2510\.06954\.Cited by:[§1](https://arxiv.org/html/2605.26489#S1.p3.1),[§2](https://arxiv.org/html/2605.26489#S2.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2605.26489#S4.SS1.SSS0.Px2.p1.3),[§4\.1](https://arxiv.org/html/2605.26489#S4.SS1.SSS0.Px4.p1.3)\.
- A\. Chowdhery, S\. Narang, J\. Devlin, M\. Bosma, G\. Mishra, A\. Roberts, P\. Barham, H\. W\. Chung, C\. Sutton, S\. Gehrmann,et al\.\(2023\)Palm: scaling language modeling with pathways\.Journal of Machine Learning Research24\(240\),pp\. 1–113\.Cited by:[§1](https://arxiv.org/html/2605.26489#S1.p2.1)\.
- J\. M\. Cohen, S\. Kaur, Y\. Li, J\. Z\. Kolter, and A\. Talwalkar \(2021\)Gradient descent on neural networks typically occurs at the edge of stability\.arXiv preprint arXiv:2103\.00065\.Cited by:[§2](https://arxiv.org/html/2605.26489#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Damian, E\. Nichani, and J\. D\. Lee \(2022\)Self\-stabilization: the implicit bias of gradient descent at the edge of stability\.arXiv preprint arXiv:2209\.15594\.Cited by:[§2](https://arxiv.org/html/2605.26489#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Devlin, M\. Chang, K\. Lee, and K\. Toutanova \(2019\)Bert: pre\-training of deep bidirectional transformers for language understanding\.InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 \(long and short papers\),pp\. 4171–4186\.Cited by:[§2](https://arxiv.org/html/2605.26489#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Dong, J\. Cordonnier, and A\. Loukas \(2021\)Attention is not all you need: pure attention loses rank doubly exponentially with depth\.InInternational conference on machine learning,pp\. 2793–2803\.Cited by:[§2](https://arxiv.org/html/2605.26489#S2.SS0.SSS0.Px2.p1.1)\.
- G\. Gidel, F\. Bach, and S\. Lacoste\-Julien \(2019\)Implicit regularization of discrete gradient dynamics in linear neural networks\.Advances in Neural Information Processing Systems32\.Cited by:[§2](https://arxiv.org/html/2605.26489#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Giorlandino and S\. Goldt \(2025\)Two failure modes of deep transformers and how to avoid them: a unified theory of signal propagation at initialisation\.arXiv preprint arXiv:2505\.24333\.Cited by:[§4\.1](https://arxiv.org/html/2605.26489#S4.SS1.SSS0.Px4.p1.3)\.
- S\. Gunasekar, B\. E\. Woodworth, S\. Bhojanapalli, B\. Neyshabur, and N\. Srebro \(2017\)Implicit regularization in matrix factorization\.Advances in neural information processing systems30\.Cited by:[§2](https://arxiv.org/html/2605.26489#S2.SS0.SSS0.Px2.p1.1)\.
- R\. Huang, Y\. Liang, and J\. Yang \(2024\)Non\-asymptotic convergence of training transformers for next\-token prediction\.Advances in Neural Information Processing Systems37,pp\. 80634–80673\.Cited by:[§2](https://arxiv.org/html/2605.26489#S2.SS0.SSS0.Px2.p1.1)\.
- R\. Huang, Y\. Liang, and J\. Yang \(2025\)How transformers learn regular language recognition: a theoretical study on training dynamics and implicit bias\.arXiv preprint arXiv:2505\.00926\.Cited by:[§2](https://arxiv.org/html/2605.26489#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Jiao, Y\. Lai, Y\. Wang, and B\. Yan \(2025\)Transformers can overcome the curse of dimensionality: a theoretical study from an approximation perspective\.arXiv preprint arXiv:2504\.13558\.Cited by:[§4\.1](https://arxiv.org/html/2605.26489#S4.SS1.SSS0.Px2.p1.3)\.
- K\. Jordan, Y\. Jin, V\. Boza, J\. You, F\. Cesista, L\. Newhouse, and J\. Bernstein \(2024\)Muon: an optimizer for hidden layers in neural networks\.External Links:[Link](https://kellerjordan.github.io/posts/muon/)Cited by:[§5\.3](https://arxiv.org/html/2605.26489#S5.SS3.p1.2)\.
- Y\. Li, Y\. Li, and A\. Risteski \(2023\)How do transformers learn topic structure: towards a mechanistic understanding\.External Links:2303\.04245,[Link](https://arxiv.org/abs/2303.04245)Cited by:[§2](https://arxiv.org/html/2605.26489#S2.SS0.SSS0.Px1.p1.1)\.
- I\. Loshchilov and F\. Hutter \(2017\)Decoupled weight decay regularization\.arXiv preprint arXiv:1711\.05101\.Cited by:[§3\.1](https://arxiv.org/html/2605.26489#S3.SS1.p2.2)\.
- A\. Mahankali, T\. B\. Hashimoto, and T\. Ma \(2023\)One step of gradient descent is provably the optimal in\-context learner with one layer of linear self\-attention\.arXiv preprint arXiv:2307\.03576\.Cited by:[§2](https://arxiv.org/html/2605.26489#S2.SS0.SSS0.Px1.p1.1)\.
- W\. Merrill, V\. Ramanujan, Y\. Goldberg, R\. Schwartz, and N\. A\. Smith \(2021\)Effects of parameter norm growth during transformer training: inductive bias from gradient descent\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,pp\. 1766–1781\.Cited by:[§2](https://arxiv.org/html/2605.26489#S2.SS0.SSS0.Px2.p1.1)\.
- B\. Neyshabur \(2017\)Implicit regularization in deep learning\.arXiv preprint arXiv:1709\.01953\.Cited by:[§2](https://arxiv.org/html/2605.26489#S2.SS0.SSS0.Px2.p1.1)\.
- C\. Olsson, N\. Elhage, N\. Nanda, N\. Joseph, N\. DasSarma, T\. Henighan, B\. Mann, A\. Askell, Y\. Bai, A\. Chen,et al\.\(2022\)In\-context learning and induction heads\.arXiv preprint arXiv:2209\.11895\.Cited by:[§1](https://arxiv.org/html/2605.26489#S1.p3.1),[§2](https://arxiv.org/html/2605.26489#S2.SS0.SSS0.Px1.p1.1)\.
- G\. Penedo, H\. Kydlíček, A\. Lozhkov, M\. Mitchell, C\. A\. Raffel, L\. Von Werra, T\. Wolf,et al\.\(2024\)The fineweb datasets: decanting the web for the finest text data at scale\.Advances in Neural Information Processing Systems37,pp\. 30811–30849\.Cited by:[§D\.1](https://arxiv.org/html/2605.26489#A4.SS1.SSS0.Px2.p1.1)\.
- A\. Radford, J\. Wu, R\. Child, D\. Luan, D\. Amodei, and I\. Sutskever \(2019\)Language models are unsupervised multitask learners\.OpenAI\.Note:Accessed: 2024\-11\-15External Links:[Link](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)Cited by:[§D\.1](https://arxiv.org/html/2605.26489#A4.SS1.SSS0.Px1.p1.1),[1st item](https://arxiv.org/html/2605.26489#S3.I1.i1.p1.1)\.
- C\. Raffel, N\. Shazeer, A\. Roberts, K\. Lee, S\. Narang, M\. Matena, Y\. Zhou, W\. Li, and P\. J\. Liu \(2020\)Exploring the limits of transfer learning with a unified text\-to\-text transformer\.Journal of machine learning research21\(140\),pp\. 1–67\.Cited by:[§D\.2](https://arxiv.org/html/2605.26489#A4.SS2.SSS0.Px2.p1.2)\.
- G\. Reddy \(2023\)The mechanistic basis of data dependence and abrupt learning in an in\-context classification task\.arXiv preprint arXiv:2312\.03002\.Cited by:[§2](https://arxiv.org/html/2605.26489#S2.SS0.SSS0.Px1.p1.1)\.
- A\. M\. Saxe, J\. L\. McClelland, and S\. Ganguli \(2019\)A mathematical theory of semantic development in deep neural networks\.Proceedings of the National Academy of Sciences116\(23\),pp\. 11537–11546\.Cited by:[§2](https://arxiv.org/html/2605.26489#S2.SS0.SSS0.Px2.p1.1)\.
- C\. Snell, R\. Zhong, D\. Klein, and J\. Steinhardt \(2021\)Approximating how single head attention learns\.arXiv preprint arXiv:2103\.07601\.Cited by:[§2](https://arxiv.org/html/2605.26489#S2.SS0.SSS0.Px1.p1.1)\.
- D\. Soudry, E\. Hoffer, M\. S\. Nacson, S\. Gunasekar, and N\. Srebro \(2018\)The implicit bias of gradient descent on separable data\.Journal of Machine Learning Research19\(70\),pp\. 1–57\.Cited by:[§2](https://arxiv.org/html/2605.26489#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Su, Y\. Lu, S\. Pan, A\. Murtadha, B\. Wen, and Y\. Liu \(2023\)RoFormer: enhanced transformer with rotary position embedding\.External Links:2104\.09864,[Link](https://arxiv.org/abs/2104.09864)Cited by:[§D\.1](https://arxiv.org/html/2605.26489#A4.SS1.SSS0.Px1.p1.1),[§D\.2](https://arxiv.org/html/2605.26489#A4.SS2.SSS0.Px1.p1.1)\.
- D\. A\. Tarzanagh, Y\. Li, C\. Thrampoulidis, and S\. Oymak \(2023\)Transformers as support vector machines\.arXiv preprint arXiv:2308\.16898\.Cited by:[§4\.1](https://arxiv.org/html/2605.26489#S4.SS1.SSS0.Px1.p1.7)\.
- Y\. Tian, Y\. Wang, B\. Chen, and S\. S\. Du \(2023\)Scan and snap: understanding training dynamics and token composition in 1\-layer transformer\.Advances in neural information processing systems36,pp\. 71911–71947\.Cited by:[§2](https://arxiv.org/html/2605.26489#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.26489#S4.SS1.SSS0.Px1.p1.7)\.
- H\. Touvron, T\. Lavril, G\. Izacard, X\. Martinet, M\. Lachaux, T\. Lacroix, B\. Rozière, N\. Goyal, E\. Hambro, F\. Azhar,et al\.\(2023\)Llama: open and efficient foundation language models\.arXiv preprint arXiv:2302\.13971\.Cited by:[§D\.2](https://arxiv.org/html/2605.26489#A4.SS2.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.26489#S1.p2.1),[2nd item](https://arxiv.org/html/2605.26489#S3.I1.i2.p1.1),[§5\.2](https://arxiv.org/html/2605.26489#S5.SS2.p1.3)\.
- A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin \(2017\)Attention is all you need\.Advances in neural information processing systems30\.Cited by:[§D\.1](https://arxiv.org/html/2605.26489#A4.SS1.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.26489#S1.p1.1)\.
- J\. Von Oswald, E\. Niklasson, E\. Randazzo, J\. Sacramento, A\. Mordvintsev, A\. Zhmoginov, and M\. Vladymyrov \(2023\)Transformers learn in\-context by gradient descent\.InInternational Conference on Machine Learning,pp\. 35151–35174\.Cited by:[§2](https://arxiv.org/html/2605.26489#S2.SS0.SSS0.Px1.p1.1)\.
- T\. Wolf, L\. Debut, V\. Sanh, J\. Chaumond, C\. Delangue, A\. Moi, P\. Cistac, T\. Rault, R\. Louf, M\. Funtowicz,et al\.\(2020\)Transformers: state\-of\-the\-art natural language processing\.InProceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations,pp\. 38–45\.Cited by:[§D\.2](https://arxiv.org/html/2605.26489#A4.SS2.SSS0.Px1.p1.1)\.
- H\. Yang, Z\. Wang, J\. D\. Lee, and Y\. Liang \(2025\)Transformers provably learn two\-mixture of linear classification via gradient flow\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.26489#S2.SS0.SSS0.Px1.p1.1)\.
- T\. Yang, Y\. Huang, Y\. Liang, and Y\. Chi \(2024\)In\-context learning with representations: contextual generalization of trained transformers\.Advances in Neural Information Processing Systems37,pp\. 85867–85898\.Cited by:[§2](https://arxiv.org/html/2605.26489#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.26489#S4.SS1.SSS0.Px2.p1.3)\.
- J\. Yao, Z\. Zhang, and Z\. J\. Xu \(2025\)An analysis for reasoning bias of language models with small initialization\.arXiv preprint arXiv:2502\.04375\.Cited by:[§1](https://arxiv.org/html/2605.26489#S1.p3.1),[§4\.1](https://arxiv.org/html/2605.26489#S4.SS1.SSS0.Px4.p1.3)\.
- B\. Zhang and R\. Sennrich \(2019\)Root mean square layer normalization\.Advances in neural information processing systems32\.Cited by:[§D\.1](https://arxiv.org/html/2605.26489#A4.SS1.SSS0.Px1.p1.1)\.
- R\. Zhang, S\. Frei, and P\. L\. Bartlett \(2024\)Trained transformers learn linear models in\-context\.Journal of Machine Learning Research25\(49\),pp\. 1–55\.Cited by:[§1](https://arxiv.org/html/2605.26489#S1.p3.1),[§2](https://arxiv.org/html/2605.26489#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Zhang, S\. Roller, N\. Goyal, M\. Artetxe, M\. Chen, S\. Chen, C\. Dewan, M\. Diab, X\. Li, X\. V\. Lin,et al\.\(2022\)Opt: open pre\-trained transformer language models\.arXiv preprint arXiv:2205\.01068\.Cited by:[§1](https://arxiv.org/html/2605.26489#S1.p2.1)\.
- Y\. Zhang, A\. K\. Singh, P\. E\. Latham, and A\. Saxe \(2025a\)Training dynamics of in\-context learning in linear attention\.arXiv preprint arXiv:2501\.16265\.Cited by:[§1](https://arxiv.org/html/2605.26489#S1.p3.1)\.
- Z\. Zhang, P\. Lin, Z\. Wang, Y\. Zhang, and Z\. J\. Xu \(2025b\)Complexity control facilitates reasoning\-based compositional generalization in transformers\.arXiv preprint arXiv:2501\.08537\.Cited by:[§1](https://arxiv.org/html/2605.26489#S1.p3.1)\.
- J\. Zhao, Z\. Zhang, B\. Chen, Z\. Wang, A\. Anandkumar, and Y\. Tian \(2024\)Galore: memory\-efficient llm training by gradient low\-rank projection\.arXiv preprint arXiv:2403\.03507\.Cited by:[§D\.2](https://arxiv.org/html/2605.26489#A4.SS2.SSS0.Px3.p1.4)\.
- H\. Zhou, Z\. Qixuan, T\. Luo, Y\. Zhang, and Z\. Xu \(2022\)Towards understanding the condensation of neural networks at initial training\.Advances in Neural Information Processing Systems35,pp\. 2184–2196\.Cited by:[§1](https://arxiv.org/html/2605.26489#S1.p3.1)\.

## Appendix AProofs of Auxiliary Lemmas

###### Lemma A\.1\.

Leta,b∈ℝna,b\\in\\mathbb\{R\}^\{n\}, and defineα=‖a‖1,β=‖b‖1,\\alpha=\\\|a\\\|\_\{1\},\\quad\\beta=\\\|b\\\|\_\{1\},withα≠β\\alpha\\neq\\beta\. Define the mappingf:ℝn→ℝnf:\\mathbb\{R\}^\{n\}\\to\\mathbb\{R\}^\{n\}byf​\(x\)=x‖x‖1\.f\(x\)=\\frac\{x\}\{\\\|x\\\|\_\{1\}\}\.Then,

‖f​\(a\)−f​\(b\)‖2≤1\+nmin⁡\{α,β\}​‖a−b‖2\.\\\|f\(a\)\-f\(b\)\\\|\_\{2\}\\leq\\frac\{1\+\\sqrt\{n\}\}\{\\min\\\{\\alpha,\\beta\\\}\}\\,\\\|a\-b\\\|\_\{2\}\.\(3\)

###### Proof of lemma[A\.1](https://arxiv.org/html/2605.26489#A1.Thmtheorem1)\.

We have

‖f​\(a\)−f​\(b\)‖2\\displaystyle\\\|f\(a\)\-f\(b\)\\\|\_\{2\}=‖aα−bβ‖2\\displaystyle=\\left\\\|\\frac\{a\}\{\\alpha\}\-\\frac\{b\}\{\\beta\}\\right\\\|\_\{2\}\(4\)≤‖aα−bα‖2\+‖bα−bβ‖2,\\displaystyle\\leq\\left\\\|\\frac\{a\}\{\\alpha\}\-\\frac\{b\}\{\\alpha\}\\right\\\|\_\{2\}\+\\left\\\|\\frac\{b\}\{\\alpha\}\-\\frac\{b\}\{\\beta\}\\right\\\|\_\{2\},whereα=‖a‖1\\alpha=\\\|a\\\|\_\{1\}andβ=‖b‖1\\beta=\\\|b\\\|\_\{1\}\.

For the first term, we have

‖aα−bα‖2=1α​‖a−b‖2\.\\left\\\|\\frac\{a\}\{\\alpha\}\-\\frac\{b\}\{\\alpha\}\\right\\\|\_\{2\}=\\frac\{1\}\{\\alpha\}\\\|a\-b\\\|\_\{2\}\.\(5\)
For the second term, we obtain

‖bα−bβ‖2=\|β−α\|α​β​‖b‖2\.\\left\\\|\\frac\{b\}\{\\alpha\}\-\\frac\{b\}\{\\beta\}\\right\\\|\_\{2\}=\\frac\{\|\\beta\-\\alpha\|\}\{\\alpha\\beta\}\\\|b\\\|\_\{2\}\.\(6\)
Note that

\|β−α\|=\|‖b‖1−‖a‖1\|≤‖b−a‖1≤n​‖b−a‖2,\|\\beta\-\\alpha\|=\\big\|\\\|b\\\|\_\{1\}\-\\\|a\\\|\_\{1\}\\big\|\\leq\\\|b\-a\\\|\_\{1\}\\leq\\sqrt\{n\}\\\|b\-a\\\|\_\{2\},\(7\)and

‖b‖2≤‖b‖1=β\.\\\|b\\\|\_\{2\}\\leq\\\|b\\\|\_\{1\}=\\beta\.\(8\)
Combining the above inequalities, we have

‖f​\(a\)−f​\(b\)‖2\\displaystyle\\\|f\(a\)\-f\(b\)\\\|\_\{2\}≤1α​‖a−b‖2\+β−αα\\displaystyle\\leq\\frac\{1\}\{\\alpha\}\\\|a\-b\\\|\_\{2\}\+\\frac\{\\beta\-\\alpha\}\{\\alpha\}\(9\)≤1α​‖a−b‖2\+nα​‖b−a‖2\\displaystyle\\leq\\frac\{1\}\{\\alpha\}\\\|a\-b\\\|\_\{2\}\+\\frac\{\\sqrt\{n\}\}\{\\alpha\}\\\|b\-a\\\|\_\{2\}=1\+nα​‖b−a‖2\.\\displaystyle=\\frac\{1\+\\sqrt\{n\}\}\{\\alpha\}\\\|b\-a\\\|\_\{2\}\.
By symmetry \(interchangingaaandbb\), we conclude that

‖f​\(a\)−f​\(b\)‖2≤1\+nmin⁡\{α,β\}​‖b−a‖2\.\\\|f\(a\)\-f\(b\)\\\|\_\{2\}\\leq\\frac\{1\+\\sqrt\{n\}\}\{\\min\\\{\\alpha,\\beta\\\}\}\\\|b\-a\\\|\_\{2\}\.\(10\)∎

###### Lemma A\.2\(Mirsky’s inequality: stability of singular values\)\.

LetWt,Wt\+1∈ℝm×nW\_\{t\},W\_\{t\+1\}\\in\\mathbb\{R\}^\{m\\times n\}, and let their singular value decompositions beWt=Ut​Σt​Vt∗,Wt\+1=Ut\+1​Σt\+1​Vt\+1∗\.W\_\{t\}=U\_\{t\}\\Sigma\_\{t\}V\_\{t\}^\{\*\},\\quad W\_\{t\+1\}=U\_\{t\+1\}\\Sigma\_\{t\+1\}V\_\{t\+1\}^\{\*\}\.Then,

‖Σt\+1−Σt‖F≤‖Wt\+1−Wt‖F\.\\\|\\Sigma\_\{t\+1\}\-\\Sigma\_\{t\}\\\|\_\{F\}\\leq\\\|W\_\{t\+1\}\-W\_\{t\}\\\|\_\{F\}\.\(11\)

###### Proof of Lemma[A\.2](https://arxiv.org/html/2605.26489#A1.Thmtheorem2)\.

We have

‖Wt\+1−Wt‖F2\\displaystyle\\\|W\_\{t\+1\}\-W\_\{t\}\\\|\_\{F\}^\{2\}=tr⁡\(\(Wt\+1−Wt\)⊤​\(Wt\+1−Wt\)\)\\displaystyle=\\operatorname\{tr\}\\\!\\left\(\(W\_\{t\+1\}\-W\_\{t\}\)^\{\\top\}\(W\_\{t\+1\}\-W\_\{t\}\)\\right\)\(12\)=tr⁡\(Wt\+1⊤​Wt\+1\)−2​tr⁡\(Wt\+1⊤​Wt\)\+tr⁡\(Wt⊤​Wt\)\\displaystyle=\\operatorname\{tr\}\(W\_\{t\+1\}^\{\\top\}W\_\{t\+1\}\)\-2\\operatorname\{tr\}\(W\_\{t\+1\}^\{\\top\}W\_\{t\}\)\+\\operatorname\{tr\}\(W\_\{t\}^\{\\top\}W\_\{t\}\)=∑i=1nσi2​\(Wt\+1\)\+∑i=1nσi2​\(Wt\)−2​tr⁡\(Wt\+1⊤​Wt\)\.\\displaystyle=\\sum\_\{i=1\}^\{n\}\\sigma\_\{i\}^\{2\}\(W\_\{t\+1\}\)\+\\sum\_\{i=1\}^\{n\}\\sigma\_\{i\}^\{2\}\(W\_\{t\}\)\-2\\operatorname\{tr\}\(W\_\{t\+1\}^\{\\top\}W\_\{t\}\)\.
By the von Neumann trace inequality in Lemma[A\.3](https://arxiv.org/html/2605.26489#A1.Thmtheorem3),

tr⁡\(Wt\+1⊤​Wt\)≤∑i=1nσi​\(Wt\+1\)​σi​\(Wt\)\.\\operatorname\{tr\}\(W\_\{t\+1\}^\{\\top\}W\_\{t\}\)\\leq\\sum\_\{i=1\}^\{n\}\\sigma\_\{i\}\(W\_\{t\+1\}\)\\sigma\_\{i\}\(W\_\{t\}\)\.\(13\)
Therefore,

‖Wt\+1−Wt‖F2\\displaystyle\\\|W\_\{t\+1\}\-W\_\{t\}\\\|\_\{F\}^\{2\}≥‖Σt\+1‖F2\+‖Σt‖F2−2​∑i=1nσi​\(Wt\+1\)​σi​\(Wt\)\\displaystyle\\geq\\\|\\Sigma\_\{t\+1\}\\\|^\{2\}\_\{F\}\+\\\|\\Sigma\_\{t\}\\\|^\{2\}\_\{F\}\-2\\sum\_\{i=1\}^\{n\}\\sigma\_\{i\}\(W\_\{t\+1\}\)\\sigma\_\{i\}\(W\_\{t\}\)\(14\)=∑i=1n\(σi​\(Wt\+1\)−σi​\(Wt\)\)2\\displaystyle=\\sum\_\{i=1\}^\{n\}\\big\(\\sigma\_\{i\}\(W\_\{t\+1\}\)\-\\sigma\_\{i\}\(W\_\{t\}\)\\big\)^\{2\}=‖Σt\+1−Σt‖F2\.\\displaystyle=\\\|\\Sigma\_\{t\+1\}\-\\Sigma\_\{t\}\\\|\_\{F\}^\{2\}\.
Hence,

‖Σt\+1−Σt‖F≤‖Wt\+1−Wt‖F\.\\\|\\Sigma\_\{t\+1\}\-\\Sigma\_\{t\}\\\|\_\{F\}\\leq\\\|W\_\{t\+1\}\-W\_\{t\}\\\|\_\{F\}\.\(15\)∎

###### Lemma A\.3\(Von Neumann’s inequality\)\.

For matricesA,B∈ℝm×nA,B\\in\\mathbb\{R\}^\{m\\times n\}, we have

tr​\(A⊤​B\)≤∑i=1nσi​\(A\)​σi​\(B\),\\text\{tr\}\(A^\{\\top\}B\)\\leq\\sum\_\{i=1\}^\{n\}\\sigma\_\{i\}\(A\)\\sigma\_\{i\}\(B\),\(16\)whereσmin⁡\{m,n\}​\(A\)≥⋯≥σ2​\(A\)≥σ1​\(A\)≥0\\sigma\_\{\\min\\\{m,n\\\}\}\(A\)\\geq\\dots\\geq\\sigma\_\{2\}\(A\)\\geq\\sigma\_\{1\}\(A\)\\geq 0andσmin⁡\{m,n\}​\(B\)≥⋯≥σ2​\(B\)≥σ1​\(B\)≥0\\sigma\_\{\\min\\\{m,n\\\}\}\(B\)\\geq\\dots\\geq\\sigma\_\{2\}\(B\)\\geq\\sigma\_\{1\}\(B\)\\geq 0are the singular values ofAAandBB, respectively\.

###### Lemma A\.4\.

Let\{Wt\}t=0T\\\{W\_\{t\}\\\}\_\{t=0\}^\{T\}denote the trajectory generated by gradient descent, withWt∈ℝd×dW\_\{t\}\\in\\mathbb\{R\}^\{d\\times d\}\.

Wt\+1=Wt−η​∇ℒ​\(Wt\)=Wt−η​Gt\.W\_\{t\+1\}=W\_\{t\}\-\\eta\\nabla\\mathcal\{L\}\(W\_\{t\}\)=W\_\{t\}\-\\eta G\_\{t\}\.\(17\)
For eacht=0,1,…,Tt=0,1,\\dots,T, letWt=Ut​Σt​Vt⊤W\_\{t\}=U\_\{t\}\\Sigma\_\{t\}V\_\{t\}^\{\\top\}be the singular value decomposition ofWtW\_\{t\}, whereΣt\\Sigma\_\{t\}denotes the diagonal matrix of singular values\. Then the following inequality holds:

‖Σt\+1tr⁡\(Σt\+1\)−Σttr⁡\(Σt\)‖F≤1\+dmin⁡\{tr⁡\(Σt\),tr⁡\(Σt\+1\)\}​‖Wt\+1−Wt‖F≤1\+dmin⁡\{tr⁡\(Σt\),tr⁡\(Σt\+1\)\}​η​‖Gt‖F\.\\left\\\|\\frac\{\\Sigma\_\{t\+1\}\}\{\\operatorname\{tr\}\(\\Sigma\_\{t\+1\}\)\}\-\\frac\{\\Sigma\_\{t\}\}\{\\operatorname\{tr\}\(\\Sigma\_\{t\}\)\}\\right\\\|\_\{F\}\\leq\\frac\{1\+\\sqrt\{d\}\}\{\\min\\\!\\left\\\{\\operatorname\{tr\}\(\\Sigma\_\{t\}\),\\,\\operatorname\{tr\}\(\\Sigma\_\{t\+1\}\)\\right\\\}\}\\,\\\|W\_\{t\+1\}\-W\_\{t\}\\\|\_\{F\}\\leq\\frac\{1\+\\sqrt\{d\}\}\{\\min\\\!\\left\\\{\\operatorname\{tr\}\(\\Sigma\_\{t\}\),\\,\\operatorname\{tr\}\(\\Sigma\_\{t\+1\}\)\\right\\\}\}\\,\\eta\\\|G\_\{t\}\\\|\_\{F\}\.\(18\)

###### Proof of lemma[A\.4](https://arxiv.org/html/2605.26489#A1.Thmtheorem4)\.

For eachWtW\_\{t\}, letWt=Ut​Σt​Vt⊤,W\_\{t\}=U\_\{t\}\\Sigma\_\{t\}V\_\{t\}^\{\\top\},and definept=Σttr⁡\(Σt\)\.p\_\{t\}=\\frac\{\\Sigma\_\{t\}\}\{\\operatorname\{tr\}\(\\Sigma\_\{t\}\)\}\.

By Lemma[A\.1](https://arxiv.org/html/2605.26489#A1.Thmtheorem1), the normalization mappingf​\(X\)=Xtr⁡\(X\)f\(X\)=\\frac\{X\}\{\\operatorname\{tr\}\(X\)\}is Lipschitz continuous\. Therefore, we have

‖Σ^t\+1−Σ^t‖F≤1\+dmin⁡\{tr⁡\(Σt\),tr⁡\(Σt\+1\)\}​‖Σt\+1−Σt‖F\.\\left\\\|\\hat\{\\Sigma\}\_\{t\+1\}\-\\hat\{\\Sigma\}\_\{t\}\\right\\\|\_\{F\}\\leq\\frac\{1\+\\sqrt\{d\}\}\{\\min\\\!\\left\\\{\\operatorname\{tr\}\(\\Sigma\_\{t\}\),\\,\\operatorname\{tr\}\(\\Sigma\_\{t\+1\}\)\\right\\\}\}\\,\\\|\\Sigma\_\{t\+1\}\-\\Sigma\_\{t\}\\\|\_\{F\}\.\(19\)
Using lemma[A\.2](https://arxiv.org/html/2605.26489#A1.Thmtheorem2), we further obtain‖Σt\+1−Σt‖F≤‖Wt\+1−Wt‖F\.\\\|\\Sigma\_\{t\+1\}\-\\Sigma\_\{t\}\\\|\_\{F\}\\leq\\\|W\_\{t\+1\}\-W\_\{t\}\\\|\_\{F\}\.

Combining the above inequalities yields‖Σ^t\+1−Σ^t‖F≤1\+dmin⁡\{tr⁡\(Σt\),tr⁡\(Σt\+1\)\}​‖Wt\+1−Wt‖F\.\\left\\\|\\hat\{\\Sigma\}\_\{t\+1\}\-\\hat\{\\Sigma\}\_\{t\}\\right\\\|\_\{F\}\\leq\\frac\{1\+\\sqrt\{d\}\}\{\\min\\\!\\left\\\{\\operatorname\{tr\}\(\\Sigma\_\{t\}\),\\,\\operatorname\{tr\}\(\\Sigma\_\{t\+1\}\)\\right\\\}\}\\,\\\|W\_\{t\+1\}\-W\_\{t\}\\\|\_\{F\}\.

Finally, sinceWt\+1−Wt=−η​GtW\_\{t\+1\}\-W\_\{t\}=\-\\eta G\_\{t\}, we obtain‖Σtt​r​\(Σt\)−Σt\+1t​r​\(Σt\+1\)‖F≤1\+dmin⁡\{tr⁡\(Σt\),tr⁡\(Σt\+1\)\}​η​‖Gt‖F\.\\left\\\|\\frac\{\\Sigma\_\{t\}\}\{tr\(\\Sigma\_\{t\}\)\}\-\\frac\{\\Sigma\_\{t\+1\}\}\{tr\(\\Sigma\_\{t\+1\}\)\}\\right\\\|\_\{F\}\\leq\\frac\{1\+\\sqrt\{d\}\}\{\\min\\\!\\left\\\{\\operatorname\{tr\}\(\\Sigma\_\{t\}\),\\,\\operatorname\{tr\}\(\\Sigma\_\{t\+1\}\)\\right\\\}\}\\,\\eta\\\|G\_\{t\}\\\|\_\{F\}\.∎

###### Lemma A\.5\(Lipschitz Gradient and Gradient Descent\)\.

Assume the loss functionℒ\\mathcal\{L\}isβ\\beta\-smooth, i\.e\., there existsβ\>0\\beta\>0such that for allθ1,θ2∈ℝd×d\\theta\_\{1\},\\theta\_\{2\}\\in\\mathbb\{R\}^\{d\\times d\},‖∇ℒ​\(θ1\)−∇ℒ​\(θ2\)‖≤β​‖θ1−θ2‖\.\\\|\\nabla\\mathcal\{L\}\(\\theta\_\{1\}\)\-\\nabla\\mathcal\{L\}\(\\theta\_\{2\}\)\\\|\\leq\\beta\\\|\\theta\_\{1\}\-\\theta\_\{2\}\\\|\.Consider full\-batch gradient descent with update ruleθt\+1=θt−η​∇ℒ​\(θt\)\.\\theta\_\{t\+1\}=\\theta\_\{t\}\-\\eta\\nabla\\mathcal\{L\}\(\\theta\_\{t\}\)\.Then the following inequalities hold:

1. 1\.η​\(1−η​β2\)​‖∇ℒ​\(θt\)‖2≤ℒ​\(θt\)−ℒ​\(θt\+1\),\\eta\\\!\\left\(1\-\\frac\{\\eta\\beta\}\{2\}\\right\)\\\|\\nabla\\mathcal\{L\}\(\\theta\_\{t\}\)\\\|^\{2\}\\leq\\mathcal\{L\}\(\\theta\_\{t\}\)\-\\mathcal\{L\}\(\\theta\_\{t\+1\}\),\(20\)
2. 2\.η​\(1\+η​β2\)​‖∇ℒ​\(θt\)‖2≥ℒ​\(θt\)−ℒ​\(θt\+1\)\.\\eta\\\!\\left\(1\+\\frac\{\\eta\\beta\}\{2\}\\right\)\\\|\\nabla\\mathcal\{L\}\(\\theta\_\{t\}\)\\\|^\{2\}\\geq\\mathcal\{L\}\(\\theta\_\{t\}\)\-\\mathcal\{L\}\(\\theta\_\{t\+1\}\)\.\(21\)

###### Lemma A\.6\.

For any vectoru∈ℝmu\\in\\mathbb\{R\}^\{m\}, letj∗=arg⁡maxj⁡uj,j^\{\*\}=\\arg\\max\_\{j\}u\_\{j\},and define the gap asgap⁡\(u\)=uj∗−maxj≠j∗⁡uj≥0\.\\operatorname\{gap\}\(u\)=u\_\{j^\{\*\}\}\-\\max\_\{j\\neq j^\{\*\}\}u\_\{j\}\\geq 0\.Lets=softmax⁡\(u\)\.s=\\operatorname\{softmax\}\(u\)\.Then the following hold:

1. 1\.1−sj∗≤\(m−1\)​exp⁡\(−gap⁡\(u\)\)\.1\-s\_\{j^\{\*\}\}\\leq\(m\-1\)\\exp\\\!\\left\(\-\\operatorname\{gap\}\(u\)\\right\)\.\(22\)
2. 2\.For allj≠j∗j\\neq j^\{\*\}, sj≤exp⁡\(−gap⁡\(u\)\)\.s\_\{j\}\\leq\\exp\\\!\\left\(\-\\operatorname\{gap\}\(u\)\\right\)\.\(23\)

###### Proof of Lemma[A\.6](https://arxiv.org/html/2605.26489#A1.Thmtheorem6)\.

For anyj≠j∗j\\neq j^\{\*\}, by the definition of the gap we have

uj−uj∗≤−gap⁡\(u\)\.u\_\{j\}\-u\_\{j^\{\*\}\}\\leq\-\\operatorname\{gap\}\(u\)\.\(24\)Therefore,

sj=exp⁡\(uj\)∑i=1mexp⁡\(ui\)≤exp⁡\(uj\)exp⁡\(uj∗\)=exp⁡\(uj−uj∗\)≤exp⁡\(−gap⁡\(u\)\),s\_\{j\}=\\frac\{\\exp\(u\_\{j\}\)\}\{\\sum\_\{i=1\}^\{m\}\\exp\(u\_\{i\}\)\}\\leq\\frac\{\\exp\(u\_\{j\}\)\}\{\\exp\(u\_\{j^\{\*\}\}\)\}=\\exp\(u\_\{j\}\-u\_\{j^\{\*\}\}\)\\leq\\exp\\\!\\left\(\-\\operatorname\{gap\}\(u\)\\right\),\(25\)where the last inequality follows from \([24](https://arxiv.org/html/2605.26489#A1.E24)\)\.

As a consequence,

1−sj∗=∑j≠j∗sj≤\(m−1\)​exp⁡\(−gap⁡\(u\)\)\.1\-s\_\{j^\{\*\}\}=\\sum\_\{j\\neq j^\{\*\}\}s\_\{j\}\\leq\(m\-1\)\\exp\\\!\\left\(\-\\operatorname\{gap\}\(u\)\\right\)\.\(26\)∎

###### Lemma A\.7\.

Leta=softmax⁡\(u\)a=\\operatorname\{softmax\}\(u\)withu∈ℝmu\\in\\mathbb\{R\}^\{m\}\. The Jacobian matrix ofaais given by

J​\(a\)=diag⁡\(a\)−a​a⊤\.J\(a\)=\\operatorname\{diag\}\(a\)\-aa^\{\\top\}\.\(27\)ThenJ​\(a\)J\(a\)is positive semidefinite, and

‖J​\(a\)‖2≤tr⁡\(J​\(a\)\)=1−‖a‖22≤2​\(1−amax\),\\\|J\(a\)\\\|\_\{2\}\\leq\\operatorname\{tr\}\(J\(a\)\)=1\-\\\|a\\\|\_\{2\}^\{2\}\\leq 2\(1\-a\_\{\\max\}\),\(28\)where

amax=maxj⁡aj\.a\_\{\\max\}=\\max\_\{j\}a\_\{j\}\.\(29\)

###### proof of lemma[A\.7](https://arxiv.org/html/2605.26489#A1.Thmtheorem7)\.

\(1\)To show thatJ​\(a\)⪰0J\(a\)\\succeq 0\.

For any𝐱∈ℝm\\mathbf\{x\}\\in\\mathbb\{R\}^\{m\}, we have

𝐱T​J​\(a\)​𝐱\\displaystyle\\mathbf\{x\}^\{T\}J\(a\)\\mathbf\{x\}=𝐱T​diag⁡\(a\)​𝐱−𝐱T​\(a​aT\)​𝐱\\displaystyle=\\mathbf\{x\}^\{T\}\\operatorname\{diag\}\(a\)\\mathbf\{x\}\-\\mathbf\{x\}^\{T\}\(aa^\{T\}\)\\mathbf\{x\}\(30\)=∑iai​xi2−∑iai2​xi2\\displaystyle=\\sum\_\{i\}a\_\{i\}x\_\{i\}^\{2\}\-\\sum\_\{i\}a\_\{i\}^\{2\}x\_\{i\}^\{2\}=∑i\(ai−ai2\)​xi2\\displaystyle=\\sum\_\{i\}\(a\_\{i\}\-a\_\{i\}^\{2\}\)x\_\{i\}^\{2\}≥0,since​ai≤1\.\\displaystyle\\geq 0,\\qquad\\text\{since \}a\_\{i\}\\leq 1\.Thus,J​\(a\)⪰0J\(a\)\\succeq 0\.

\(2\) SinceJ​\(a\)⪰0J\(a\)\\succeq 0, we have‖J​\(a\)‖2=λmax​\(J​\(a\)\)≤∑iλi​\(J​\(a\)\)=tr⁡\(J​\(a\)\)\.\\\|J\(a\)\\\|\_\{2\}=\\lambda\_\{\\max\}\(J\(a\)\)\\leq\\sum\_\{i\}\\lambda\_\{i\}\(J\(a\)\)=\\operatorname\{tr\}\(J\(a\)\)\.

First, we calculate the trace ofJ​\(a\)J\(a\):

tr​\(J​\(a\)\)=∑iai−∑iai2=1−‖a‖22\\text\{tr\}\(J\(a\)\)=\\sum\_\{i\}a\_\{i\}\-\\sum\_\{i\}a\_\{i\}^\{2\}=1\-\\\|a\\\|\_\{2\}^\{2\}\(31\)
Next, we use the inequality‖a‖22≥amax2\\\|a\\\|\_\{2\}^\{2\}\\geq a\_\{\\max\}^\{2\}, which gives:

1−‖a‖22≤1−amax2=\(1\+amax\)​\(1−amax\)≤2​\(1−amax\)\.1\-\\\|a\\\|\_\{2\}^\{2\}\\leq 1\-a\_\{\\max\}^\{2\}=\(1\+a\_\{\\max\}\)\(1\-a\_\{\\max\}\)\\leq 2\(1\-a\_\{\\max\}\)\.\(32\)
Thus, we have:‖J​\(a\)‖2≤tr​\(J​\(a\)\)≤2​\(1−amax\)\.\\\|J\(a\)\\\|\_\{2\}\\leq\\text\{tr\}\(J\(a\)\)\\leq 2\(1\-a\_\{\\max\}\)\.∎

###### Lemma A\.8\.

For models of the form[4\.1](https://arxiv.org/html/2605.26489#S4.SS1.SSS0.Px2), under small perturbations, the following approximation holds:

H​\(x\)≐\(A0\+J1​M\)​W​WT,H\(x\)\\doteq\(A\_\{0\}\+J\_\{1\}M\)WW^\{T\},\(33\)where

A0=1n​𝟏𝟏T,J1=1n​\(I−1n​𝟏𝟏T\)\.A\_\{0\}=\\frac\{1\}\{n\}\\mathbf\{1\}\\mathbf\{1\}^\{T\},\\qquad J\_\{1\}=\\frac\{1\}\{n\}\\left\(I\-\\frac\{1\}\{n\}\\mathbf\{1\}\\mathbf\{1\}^\{T\}\\right\)\.\(34\)

###### proof of[A\.8](https://arxiv.org/html/2605.26489#A1.Thmtheorem8)\.

Consider the softmax function

A​\(x\)\\displaystyle A\(x\)=softmax⁡\(M\)=\[exp⁡\(Mi​j\)∑k=1nexp⁡\(Mi​k\)\]i​j\\displaystyle=\\operatorname\{softmax\}\(M\)=\\left\[\\frac\{\\exp\(M\_\{ij\}\)\}\{\\sum\_\{k=1\}^\{n\}\\exp\(M\_\{ik\}\)\}\\right\]\_\{ij\}\(35\)≈\[1\+Mi​jn\+∑k=1nMi​k\]i​j\\displaystyle\\approx\\left\[\\frac\{1\+M\_\{ij\}\}\{n\+\\sum\_\{k=1\}^\{n\}M\_\{ik\}\}\\right\]\_\{ij\}≈\[1\+Mi​jn​\(1\+1n​∑k=1nMi​k\)\]i​j\\displaystyle\\approx\\left\[\\frac\{1\+M\_\{ij\}\}\{n\\\!\\left\(1\+\\frac\{1\}\{n\}\\sum\_\{k=1\}^\{n\}M\_\{ik\}\\right\)\}\\right\]\_\{ij\}≈1n​\[\(1\+Mi​j\)​\(1−1n​∑k=1nMi​k\)\]i​j\\displaystyle\\approx\\frac\{1\}\{n\}\\left\[\(1\+M\_\{ij\}\)\\left\(1\-\\frac\{1\}\{n\}\\sum\_\{k=1\}^\{n\}M\_\{ik\}\\right\)\\right\]\_\{ij\}≈\[1n​\(1\+Mi​j−1n​∑k=1nMi​k\)\+O​\(σ4\)\]i​j\\displaystyle\\approx\\left\[\\frac\{1\}\{n\}\\left\(1\+M\_\{ij\}\-\\frac\{1\}\{n\}\\sum\_\{k=1\}^\{n\}M\_\{ik\}\\right\)\+O\(\\sigma^\{4\}\)\\right\]\_\{ij\}≈\[1n\+1n​Mi​j−1n2​∑k=1nMi​k\]i​j\\displaystyle\\approx\\left\[\\frac\{1\}\{n\}\+\\frac\{1\}\{n\}M\_\{ij\}\-\\frac\{1\}\{n^\{2\}\}\\sum\_\{k=1\}^\{n\}M\_\{ik\}\\right\]\_\{ij\}≈A0\+J1​M,\\displaystyle\\approx A\_\{0\}\+J\_\{1\}M,whereA0=1n​𝟏𝟏T,J1=1n​\(I−1n​𝟏𝟏T\)\.A\_\{0\}=\\frac\{1\}\{n\}\\mathbf\{1\}\\mathbf\{1\}^\{T\},\\quad J\_\{1\}=\\frac\{1\}\{n\}\\left\(I\-\\frac\{1\}\{n\}\\mathbf\{1\}\\mathbf\{1\}^\{T\}\\right\)\.∎

## Appendix BProof of Theorem[4\.5](https://arxiv.org/html/2605.26489#S4.Thmtheorem5)

###### Proof of Theorem[4\.5](https://arxiv.org/html/2605.26489#S4.Thmtheorem5)\.\.

Part 1\.First, we analyze‖GtV‖F2\\\|G\_\{t\}^\{V\}\\\|\_\{F\}^\{2\}\.

For each sampleii, we define the softmax probability assigned to the ground\-truth labelyiy\_\{i\}as

Pi,yi=exp⁡\(Zi,yi\)∑k=1Cexp⁡\(Zi,k\)\.P\_\{i,y\_\{i\}\}=\\frac\{\\exp\(Z\_\{i,y\_\{i\}\}\)\}\{\\sum\_\{k=1\}^\{C\}\\exp\(Z\_\{i,k\}\)\}\.\(36\)
Taking the logarithm of \([36](https://arxiv.org/html/2605.26489#A2.E36)\), we obtain

log⁡Pi,yi=Zi,yi−log​∑k=1Cexp⁡\(Zi,k\)\.\\log P\_\{i,y\_\{i\}\}=Z\_\{i,y\_\{i\}\}\-\\log\\sum\_\{k=1\}^\{C\}\\exp\(Z\_\{i,k\}\)\.\(37\)
Therefore, the empirical loss function can be written as

ℒ​\(θ\)=1n​∑i=1n\(log​∑k=1Cexp⁡\(Zi,k\)−Zi,yi\)\.\\mathcal\{L\}\(\\theta\)=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\left\(\\log\\sum\_\{k=1\}^\{C\}\\exp\(Z\_\{i,k\}\)\-Z\_\{i,y\_\{i\}\}\\right\)\.\(38\)
By direct computation, the gradient of the loss with respect toziz\_\{i\}is given by

∂ℒ∂zi=1n​\(pi−yi\)\.\\frac\{\\partial\\mathcal\{L\}\}\{\\partial z\_\{i\}\}=\\frac\{1\}\{n\}\\left\(p\_\{i\}\-y\_\{i\}\\right\)\.\(39\)
Stacking all samples together, the gradient of the loss with respect to the logit matrixZZis given by

∂ℒ∂Z=1n​\(P−Y\)\.\\frac\{\\partial\\mathcal\{L\}\}\{\\partial Z\}=\\frac\{1\}\{n\}\(P\-Y\)\.\(40\)
For convenience, we define

GZ:=∂ℒ∂Z=1n​\(P−Y\)\.G\_\{Z\}:=\\frac\{\\partial\\mathcal\{L\}\}\{\\partial Z\}=\\frac\{1\}\{n\}\(P\-Y\)\.\(41\)
The gradient with respect to the hidden representationHHthen satisfies

GH:=∂ℒ∂H=GZ​WC⊤\.G\_\{H\}:=\\frac\{\\partial\\mathcal\{L\}\}\{\\partial H\}=G\_\{Z\}W\_\{C\}^\{\\top\}\.\(42\)
Recall thatH=A​VH=AV\. Applying the chain rule, we obtain the gradients with respect toVVandAAas

GV:=∂ℒ∂V=A⊤​GH,GA:=∂ℒ∂A=GH​V⊤\.G\_\{V\}:=\\frac\{\\partial\\mathcal\{L\}\}\{\\partial V\}=A^\{\\top\}G\_\{H\},\\qquad G\_\{A\}:=\\frac\{\\partial\\mathcal\{L\}\}\{\\partial A\}=G\_\{H\}V^\{\\top\}\.\(43\)
Finally, usingV=X​WVV=XW\_\{V\}, the gradient with respect toWVW\_\{V\}is given by

GWV:=GV=∂ℒ∂WV=X⊤​GV=X⊤​A⊤​GH\.G\_\{W\_\{V\}\}:=G^\{V\}=\\frac\{\\partial\\mathcal\{L\}\}\{\\partial W\_\{V\}\}=X^\{\\top\}G\_\{V\}=X^\{\\top\}A^\{\\top\}G\_\{H\}\.\(44\)Under the small initialization regime, as stated in Lemma[A\.8](https://arxiv.org/html/2605.26489#A1.Thmtheorem8), we have

A\\displaystyle A≈1n​𝟏𝟏⊤\+1n​M−1n2​𝟏𝟏⊤​M\\displaystyle\\approx\\frac\{1\}\{n\}\\mathbf\{1\}\\mathbf\{1\}^\{\\top\}\+\\frac\{1\}\{n\}M\-\\frac\{1\}\{n^\{2\}\}\\mathbf\{1\}\\mathbf\{1\}^\{\\top\}M\(45\)≈1n​𝟏𝟏⊤\+1n​\(I−1n​𝟏𝟏⊤\)​M\\displaystyle\\approx\\frac\{1\}\{n\}\\mathbf\{1\}\\mathbf\{1\}^\{\\top\}\+\\frac\{1\}\{n\}\\left\(I\-\\frac\{1\}\{n\}\\mathbf\{1\}\\mathbf\{1\}^\{\\top\}\\right\)M≈A0\+J1​M\.\\displaystyle\\approx A\_\{0\}\+J\_\{1\}M\.Here,

A0:=1n​𝟏𝟏⊤,J1:=1n​\(I−1n​𝟏𝟏⊤\)\.A\_\{0\}:=\\frac\{1\}\{n\}\\mathbf\{1\}\\mathbf\{1\}^\{\\top\},\\qquad J\_\{1\}:=\\frac\{1\}\{n\}\\left\(I\-\\frac\{1\}\{n\}\\mathbf\{1\}\\mathbf\{1\}^\{\\top\}\\right\)\.\(46\)
Furthermore, forGVG^\{V\}we have

‖GV‖F=‖X⊤​A⊤​GH‖F=\[tr⁡\(GH⊤​A​X​X⊤​A⊤​GH\)\]1/2\.\\\|G^\{V\}\\\|\_\{F\}=\\\|X^\{\\top\}A^\{\\top\}G\_\{H\}\\\|\_\{F\}=\\left\[\\operatorname\{tr\}\\\!\\left\(G\_\{H\}^\{\\top\}AXX^\{\\top\}A^\{\\top\}G\_\{H\}\\right\)\\right\]^\{1/2\}\.\(47\)
AssumeX∈ℝn×dX\\in\\mathbb\{R\}^\{n\\times d\}and thatX​X⊤∈ℝn×nXX^\{\\top\}\\in\\mathbb\{R\}^\{n\\times n\}is full\-rank\. Then, it holds that

‖GV‖F≥λmin​\(X\)​\[tr⁡\(GH⊤​A​A⊤​GH\)\]1/2\.\\\|G^\{V\}\\\|\_\{F\}\\geq\\lambda\_\{\\min\}\(X\)\\left\[\\operatorname\{tr\}\\\!\\left\(G\_\{H\}^\{\\top\}AA^\{\\top\}G\_\{H\}\\right\)\\right\]^\{1/2\}\.\(48\)
Taking the approximationA≈A0A\\approx A\_\{0\}simplifies the analysis, where

A0:=1n​𝟏𝟏⊤\.A\_\{0\}:=\\frac\{1\}\{n\}\\mathbf\{1\}\\mathbf\{1\}^\{\\top\}\.\(49\)
Moreover, we observe that

A0​A0⊤=1n​𝟏𝟏⊤​1n​𝟏𝟏⊤=1n​𝟏𝟏⊤=A0\.A\_\{0\}A\_\{0\}^\{\\top\}=\\frac\{1\}\{n\}\\mathbf\{1\}\\mathbf\{1\}^\{\\top\}\\frac\{1\}\{n\}\\mathbf\{1\}\\mathbf\{1\}^\{\\top\}=\\frac\{1\}\{n\}\\mathbf\{1\}\\mathbf\{1\}^\{\\top\}=A\_\{0\}\.\(50\)
Thus, we obtain the following lower bound for the Frobenius norm ofGVG^\{V\}:

‖GV‖F≳λmin​\(X\)​\[tr⁡\(GH⊤​A0​GH\)\]1/2≳1n​λmin​\(X\)​‖GH‖F\.\\\|G^\{V\}\\\|\_\{F\}\\gtrsim\\lambda\_\{\\min\}\(X\)\\left\[\\operatorname\{tr\}\\\!\\left\(G\_\{H\}^\{\\top\}A\_\{0\}G\_\{H\}\\right\)\\right\]^\{1/2\}\\gtrsim\\frac\{1\}\{\\sqrt\{n\}\}\\,\\lambda\_\{\\min\}\(X\)\\,\\\|G\_\{H\}\\\|\_\{F\}\.\(51\)
Finally, we define the constant

CV:=1n​λmin​\(X\)​‖GH‖F\.C\_\{V\}:=\\frac\{1\}\{\\sqrt\{n\}\}\\,\\lambda\_\{\\min\}\(X\)\\,\\\|G\_\{H\}\\\|\_\{F\}\.\(52\)
For notational simplicity, we define

‖WV‖F=v\.\\\|W\_\{V\}\\\|\_\{F\}=v\.\(53\)
From the gradient relations derived above, we obtain the following differential inequality:

v˙≳CV,\\dot\{v\}\\gtrsim C\_\{V\},\(54\)
Solving the differential inequality in \([54](https://arxiv.org/html/2605.26489#A2.E54)\), we obtain

v​\(t\)≥CV​t\+v​\(0\)\.v\(t\)\\geq C\_\{V\}t\+v\(0\)\.\(55\)
By Lemma[A\.4](https://arxiv.org/html/2605.26489#A1.Thmtheorem4), The matricesWVW\_\{V\}can be decomposed via SVD\. With the help of Assumption[4\.1](https://arxiv.org/html/2605.26489#S4.Thmtheorem1), We have

‖Σt\+1tr⁡\(Σt\+1\)−Σttr⁡\(Σt\)‖F\\displaystyle\\left\\\|\\frac\{\\Sigma\_\{t\+1\}\}\{\\operatorname\{tr\}\(\\Sigma\_\{t\+1\}\)\}\-\\frac\{\\Sigma\_\{t\}\}\{\\operatorname\{tr\}\(\\Sigma\_\{t\}\)\}\\right\\\|\_\{F\}≤1\+dmin⁡\{tr⁡\(Σt\),tr⁡\(Σt\+1\)\}​‖Wt\+1−Wt‖F\\displaystyle\\leq\\frac\{1\+\\sqrt\{d\}\}\{\\min\\\{\\operatorname\{tr\}\(\\Sigma\_\{t\}\),\\,\\operatorname\{tr\}\(\\Sigma\_\{t\+1\}\)\\\}\}\\,\\\|W\_\{t\+1\}\-W\_\{t\}\\\|\_\{F\}≤1\+dmin⁡\{tr⁡\(Σt\),tr⁡\(Σt\+1\)\}​η​‖Gt‖F\\displaystyle\\leq\\frac\{1\+\\sqrt\{d\}\}\{\\min\\\{\\operatorname\{tr\}\(\\Sigma\_\{t\}\),\\,\\operatorname\{tr\}\(\\Sigma\_\{t\+1\}\)\\\}\}\\,\\eta\\\|G\_\{t\}\\\|\_\{F\}≤1\+dtr⁡\(Σt\)​η​G\.\\displaystyle\\leq\\frac\{1\+\\sqrt\{d\}\}\{\\operatorname\{tr\}\(\\Sigma\_\{t\}\)\}\\,\\eta G\.
For stability boundε​\(WV\)=O​\(η‖WV‖\)\\varepsilon\(W\_\{V\}\)=O\(\\frac\{\\eta\}\{\\\|W\_\{V\}\\\|\}\), we have

‖Σt\+1tr⁡\(Σt\+1\)−Σttr⁡\(Σt\)‖F≤1\+dtr⁡\(Σt\)​η​G<ε​\(W\)\.\\left\\\|\\frac\{\\Sigma\_\{t\+1\}\}\{\\operatorname\{tr\}\(\\Sigma\_\{t\+1\}\)\}\-\\frac\{\\Sigma\_\{t\}\}\{\\operatorname\{tr\}\(\\Sigma\_\{t\}\)\}\\right\\\|\_\{F\}\\leq\\frac\{1\+\\sqrt\{d\}\}\{\\operatorname\{tr\}\(\\Sigma\_\{t\}\)\}\\,\\eta G<\\varepsilon\(W\)\.\(56\)
Hence, using the relation between the Frobenius norm and the nuclear norm, it follows that

‖Wt‖F=‖Σt‖F≥‖Σt‖∗d=‖Wt‖∗d\>\(1\+d\)​η​Gd​ϵ\.\\\|W\_\{t\}\\\|\_\{F\}=\\\|\\Sigma\_\{t\}\\\|\_\{F\}\\geq\\frac\{\\\|\\Sigma\_\{t\}\\\|\_\{\*\}\}\{\\sqrt\{d\}\}=\\frac\{\\\|W\_\{t\}\\\|\_\{\*\}\}\{\\sqrt\{d\}\}\>\\frac\{\(1\+\\sqrt\{d\}\)\\,\\eta G\}\{\\sqrt\{d\}\\,\\epsilon\}\.\(57\)
As a result, for the query, key, and value projection matriceWVW\_\{V\}, a sufficient condition ensuring \([56](https://arxiv.org/html/2605.26489#A2.E56)\) is that their Frobenius norms satisfy

v​\(t\)\>\(1\+d\)​η​Gd​ϵ\.v\(t\)\>\\dfrac\{\(1\+\\sqrt\{d\}\)\\,\\eta G\}\{\\sqrt\{d\}\\,\\epsilon\}\.\(58\)
By solving the above inequality using the growth lower bounds ofv​\(t\)v\(t\), we obtain the corresponding hitting times

TV=O​\(\(1\+d\)​η​G−ε​v​\(0\)ε​CV​d\)\.T\_\{V\}=O\\left\(\\frac\{\(1\+\\sqrt\{d\}\)\\,\\eta G\-\\varepsilon v\(0\)\}\{\\varepsilon\\,C\_\{V\}\\sqrt\{d\}\}\\right\)\.\(59\)
Part 2\.Next, we separately analyze‖GtQ‖F2\\\|G\_\{t\}^\{Q\}\\\|\_\{F\}^\{2\}and‖GtK‖F2\\\|G\_\{t\}^\{K\}\\\|\_\{F\}^\{2\}\.

Substituting \([45](https://arxiv.org/html/2605.26489#A2.E45)\) into the gradients yields

GWQ:=GQ\\displaystyle G\_\{W\_\{Q\}\}=G^\{Q\}=1d​X⊤​J1⊤​GH​V⊤​K,\\displaystyle=\\frac\{1\}\{\\sqrt\{d\}\}\\,X^\{\\top\}J\_\{1\}^\{\\top\}G\_\{H\}V^\{\\top\}K,\(60\)GWK:=GK\\displaystyle G\_\{W\_\{K\}\}=G^\{K\}=1d​X⊤​V​GH⊤​J1​Q\.\\displaystyle=\\frac\{1\}\{\\sqrt\{d\}\}\\,X^\{\\top\}VG\_\{H\}^\{\\top\}J\_\{1\}Q\.
Then we consider the Frobenius norms ofGQG^\{Q\}andGKG^\{K\}\.

We first analyze‖GQ‖F\\\|G^\{Q\}\\\|\_\{F\}\. By definition, we have

‖GQ‖F\\displaystyle\\\|G^\{Q\}\\\|\_\{F\}=‖1d​X​J1⊤​GH​V⊤‖F\\displaystyle=\\left\\\|\\frac\{1\}\{\\sqrt\{d\}\}XJ\_\{1\}^\{\\top\}G\_\{H\}V^\{\\top\}\\right\\\|\_\{F\}\(61\)=1d​‖X​J1⊤​GH​V⊤‖F\\displaystyle=\\frac\{1\}\{\\sqrt\{d\}\}\\left\\\|XJ\_\{1\}^\{\\top\}G\_\{H\}V^\{\\top\}\\right\\\|\_\{F\}=1d​\[tr⁡\(K⊤​V​GH⊤​J1​X​X⊤​J1⊤​GH​V​K\)\]1/2\\displaystyle=\\frac\{1\}\{\\sqrt\{d\}\}\\left\[\\operatorname\{tr\}\\\!\\left\(K^\{\\top\}VG\_\{H\}^\{\\top\}J\_\{1\}XX^\{\\top\}J\_\{1\}^\{\\top\}G\_\{H\}VK\\right\)\\right\]^\{1/2\}≥1d​λmin​\(X\)​\[tr⁡\(K⊤​V​GH⊤​J1​J1⊤​GH​V​K\)\]1/2\.\\displaystyle\\geq\\frac\{1\}\{\\sqrt\{d\}\}\\,\\lambda\_\{\\min\}\(X\)\\,\\left\[\\operatorname\{tr\}\\\!\\left\(K^\{\\top\}VG\_\{H\}^\{\\top\}J\_\{1\}J\_\{1\}^\{\\top\}G\_\{H\}VK\\right\)\\right\]^\{1/2\}\.
Recall that

J1=1n​\(I−1n​𝟏𝟏⊤\)\.J\_\{1\}=\\frac\{1\}\{n\}\\left\(I\-\\frac\{1\}\{n\}\\mathbf\{1\}\\mathbf\{1\}^\{\\top\}\\right\)\.\(62\)
Moreover, we have

J1​J1⊤\\displaystyle J\_\{1\}J\_\{1\}^\{\\top\}=1n2​\(I−1n​𝟏𝟏⊤\)​\(I−1n​𝟏𝟏⊤\)\\displaystyle=\\frac\{1\}\{n^\{2\}\}\\left\(I\-\\frac\{1\}\{n\}\\mathbf\{1\}\\mathbf\{1\}^\{\\top\}\\right\)\\left\(I\-\\frac\{1\}\{n\}\\mathbf\{1\}\\mathbf\{1\}^\{\\top\}\\right\)\(63\)=1n2​\(I−1n​𝟏𝟏⊤\)\.\\displaystyle=\\frac\{1\}\{n^\{2\}\}\\left\(I\-\\frac\{1\}\{n\}\\mathbf\{1\}\\mathbf\{1\}^\{\\top\}\\right\)\.
Substituting this identity into the expression yields

‖GQ‖F\\displaystyle\\\|G^\{Q\}\\\|\_\{F\}≥1n​d​λmin​\(X\)​\[tr⁡\(K⊤​V​GH⊤​GH​V⊤​K\)\]1/2⏟\(I\)\\displaystyle\\geq\\frac\{1\}\{n\\sqrt\{d\}\}\\,\\lambda\_\{\\min\}\(X\)\\,\\underbrace\{\\left\[\\operatorname\{tr\}\\\!\\left\(K^\{\\top\}VG\_\{H\}^\{\\top\}G\_\{H\}V^\{\\top\}K\\right\)\\right\]^\{1/2\}\}\_\{\\text\{\(I\)\}\}\(64\)−1n3/2​d​λmin​\(X\)​\[tr⁡\(K⊤​V​GH⊤​𝟏𝟏⊤​GH​V⊤​K\)\]1/2⏟\(II\)\.\\displaystyle\\quad\-\\frac\{1\}\{n^\{3/2\}\\sqrt\{d\}\}\\,\\lambda\_\{\\min\}\(X\)\\,\\underbrace\{\\left\[\\operatorname\{tr\}\\\!\\left\(K^\{\\top\}VG\_\{H\}^\{\\top\}\\mathbf\{1\}\\mathbf\{1\}^\{\\top\}G\_\{H\}V^\{\\top\}K\\right\)\\right\]^\{1/2\}\}\_\{\\text\{\(II\)\}\}\.
We decompose the right\-hand side of \([64](https://arxiv.org/html/2605.26489#A2.E64)\) into two components, denoted by \(I\) and \(II\), respectively\.

We first bound term \(I\)\.

\(I\)=\[tr⁡\(GH​V⊤​K⊤​K​V​GH⊤\)\]1/2\\displaystyle=\\left\[\\operatorname\{tr\}\\\!\\left\(G\_\{H\}V^\{\\top\}K^\{\\top\}KVG\_\{H\}^\{\\top\}\\right\)\\right\]^\{1/2\}\(65\)≥λmin​\(X\)​σmin​\(WK\)​\[tr⁡\(GH​V⊤​V​GH⊤\)\]1/2\\displaystyle\\geq\\lambda\_\{\\min\}\(X\)\\,\\sigma\_\{\\min\}\(W\_\{K\}\)\\,\\left\[\\operatorname\{tr\}\\\!\\left\(G\_\{H\}V^\{\\top\}VG\_\{H\}^\{\\top\}\\right\)\\right\]^\{1/2\}≥λmin2​\(X\)​σmin​\(WK\)​σmin​\(WV\)​‖GH‖F\.\\displaystyle\\geq\\lambda\_\{\\min\}^\{2\}\(X\)\\,\\sigma\_\{\\min\}\(W\_\{K\}\)\\,\\sigma\_\{\\min\}\(W\_\{V\}\)\\,\\\|G\_\{H\}\\\|\_\{F\}\.
We next bound term \(II\)\.

\(II\)=\[tr⁡\(K​V​GH⊤​𝟏𝟏⊤​GH​V⊤​K\)\]1/2\\displaystyle=\\left\[\\operatorname\{tr\}\\\!\\left\(KVG\_\{H\}^\{\\top\}\\mathbf\{1\}\\mathbf\{1\}^\{\\top\}G\_\{H\}V^\{\\top\}K\\right\)\\right\]^\{1/2\}\(66\)=\[tr⁡\(𝟏⊤​GH​V⊤​K​K​V​GH⊤​𝟏\)\]1/2\\displaystyle=\\left\[\\operatorname\{tr\}\\\!\\left\(\\mathbf\{1\}^\{\\top\}G\_\{H\}V^\{\\top\}KKVG\_\{H\}^\{\\top\}\\mathbf\{1\}\\right\)\\right\]^\{1/2\}≥λmin2​\(X\)​σmin​\(WK\)​σmin​\(WV\)​\[tr⁡\(𝟏⊤​GH​GH⊤​𝟏\)\]1/2\\displaystyle\\geq\\lambda\_\{\\min\}^\{2\}\(X\)\\,\\sigma\_\{\\min\}\(W\_\{K\}\)\\,\\sigma\_\{\\min\}\(W\_\{V\}\)\\,\\left\[\\operatorname\{tr\}\\\!\\left\(\\mathbf\{1\}^\{\\top\}G\_\{H\}G\_\{H\}^\{\\top\}\\mathbf\{1\}\\right\)\\right\]^\{1/2\}≥λmin2​\(X\)​σmin​\(WK\)​σmin​\(WV\)​‖GH‖F\.\\displaystyle\\geq\\lambda\_\{\\min\}^\{2\}\(X\)\\,\\sigma\_\{\\min\}\(W\_\{K\}\)\\,\\sigma\_\{\\min\}\(W\_\{V\}\)\\,\\\|G\_\{H\}\\\|\_\{F\}\.
Combining the bounds on \(I\) and \(II\) and substituting them into \([64](https://arxiv.org/html/2605.26489#A2.E64)\), we obtain

‖GQ‖F≥\(1n−1n3/2\)​1d​λmin3​\(X\)​‖GH‖F​σmin​\(WK\)​σmin​\(WV\)\.\\\|G^\{Q\}\\\|\_\{F\}\\geq\\left\(\\frac\{1\}\{n\}\-\\frac\{1\}\{n^\{3/2\}\}\\right\)\\frac\{1\}\{\\sqrt\{d\}\}\\,\\lambda\_\{\\min\}^\{3\}\(X\)\\,\\\|G\_\{H\}\\\|\_\{F\}\\,\\sigma\_\{\\min\}\(W\_\{K\}\)\\,\\sigma\_\{\\min\}\(W\_\{V\}\)\.\(67\)
Under the numerical non\-degeneracy assumption in Assumption[4\.3](https://arxiv.org/html/2605.26489#S4.Thmtheorem3),

‖GQ‖F\\displaystyle\\\|G^\{Q\}\\\|\_\{F\}≥n−1n3/2​d​λmin3​‖GH‖F​σmax​\(WK\)​σmax​\(WV\)​1κK​κV\\displaystyle\\geq\\frac\{\\sqrt\{n\}\-1\}\{n^\{3/2\}\\sqrt\{d\}\}\\,\\lambda\_\{\\min\}^\{3\}\\,\\\|G\_\{H\}\\\|\_\{F\}\\,\\sigma\_\{\\max\}\(W\_\{K\}\)\\,\\sigma\_\{\\max\}\(W\_\{V\}\)\\,\\frac\{1\}\{\\kappa\_\{K\}\\kappa\_\{V\}\}\(68\)≥n−1n3/2​d5/2​λmin3​‖GH‖F​1κK​κV​‖WK‖F​‖WV‖F\.\\displaystyle\\geq\\frac\{\\sqrt\{n\}\-1\}\{n^\{3/2\}d^\{5/2\}\}\\,\\lambda\_\{\\min\}^\{3\}\\,\\\|G\_\{H\}\\\|\_\{F\}\\,\\frac\{1\}\{\\kappa\_\{K\}\\kappa\_\{V\}\}\\,\\\|W\_\{K\}\\\|\_\{F\}\\,\\\|W\_\{V\}\\\|\_\{F\}\.
We define the constant

CQ:=\(n−1\)​λmin3​‖GH‖Fn3/2​d5/2​κK​κV\.C\_\{Q\}:=\\frac\{\(\\sqrt\{n\}\-1\)\\,\\lambda\_\{\\min\}^\{3\}\\,\\\|G\_\{H\}\\\|\_\{F\}\}\{n^\{3/2\}d^\{5/2\}\\kappa\_\{K\}\\kappa\_\{V\}\}\.\(69\)
Therefore, we arrive at the compact form

‖GQ‖F≥CQ​‖WK‖F​‖WV‖F\.\\\|G^\{Q\}\\\|\_\{F\}\\geq C\_\{Q\}\\,\\\|W\_\{K\}\\\|\_\{F\}\\,\\\|W\_\{V\}\\\|\_\{F\}\.\(70\)
By symmetry, an analogous bound holds for the key projection, namely,

‖GK‖F≥CK​‖WQ‖F​‖WV‖F,\\\|G^\{K\}\\\|\_\{F\}\\geq C\_\{K\}\\,\\\|W\_\{Q\}\\\|\_\{F\}\\,\\\|W\_\{V\}\\\|\_\{F\},\(71\)where the constantCKC\_\{K\}is given by

CK=\(n−1\)​λmin3​‖GH‖Fn3/2​d5/2​κQ​κV\.C\_\{K\}=\\frac\{\(\\sqrt\{n\}\-1\)\\,\\lambda\_\{\\min\}^\{3\}\\,\\\|G\_\{H\}\\\|\_\{F\}\}\{n^\{3/2\}\\,d^\{5/2\}\\,\\kappa\_\{Q\}\\kappa\_\{V\}\}\.\(72\)
For notational simplicity, we define

‖WQ‖F=q,‖WK‖F=k\.\\\|W\_\{Q\}\\\|\_\{F\}=q,\\qquad\\\|W\_\{K\}\\\|\_\{F\}=k\.\(73\)
From the gradient relations derived above, we obtain the following system of differential inequalities:

\{q˙≥CQ​k​v,k˙≥CK​q​v\.\\begin\{cases\}\\dot\{q\}\\;\\geq\\;C\_\{Q\}\\,kv,\\\\ \\dot\{k\}\\;\\geq\\;C\_\{K\}\\,qv\.\\end\{cases\}\(74\)
Due to the symmetry betweenqqandkk, without loss of generality we assume

k​\(0\)=q​\(0\),CQ=CK=CM\.k\(0\)=q\(0\),\\qquad C\_\{Q\}=C\_\{K\}=C\_\{M\}\.\(75\)
Solving the system of differential inequalities in \([74](https://arxiv.org/html/2605.26489#A2.E74)\) under the assumption \([75](https://arxiv.org/html/2605.26489#A2.E75)\) and substituting \([55](https://arxiv.org/html/2605.26489#A2.E55)\) into the resulting expressions, we obtain

q​\(t\)\\displaystyle q\(t\)≥q​\(0\)​exp⁡\(12​CM​CV​t2\+CM​v​\(0\)​t\),\\displaystyle\\geq q\(0\)\\,\\exp\\\!\\left\(\\tfrac\{1\}\{2\}C\_\{M\}C\_\{V\}t^\{2\}\+C\_\{M\}v\(0\)\\,t\\right\),\(76\)k​\(t\)\\displaystyle k\(t\)≥k​\(0\)​exp⁡\(12​CM​CV​t2\+CM​v​\(0\)​t\)\.\\displaystyle\\geq k\(0\)\\,\\exp\\\!\\left\(\\tfrac\{1\}\{2\}C\_\{M\}C\_\{V\}t^\{2\}\+C\_\{M\}v\(0\)\\,t\\right\)\.
Hence, using the inequality \([56](https://arxiv.org/html/2605.26489#A2.E56)\) and the relation between the Frobenius norm and the nuclear norm, for stability boundε=ε​\(W\)=O​\(η‖W‖\),W∈\{WQ,WK\}\\varepsilon=\\varepsilon\(W\)=O\(\\frac\{\\eta\}\{\\\|W\\\|\}\),\\quad W\\in\\\{W\_\{Q\},W\_\{K\}\\\}, it follows that

‖Wt‖F=‖Σt‖F≥‖Σt‖∗d=‖Wt‖∗d\>\(1\+d\)​η​Gd​ε\.\\\|W\_\{t\}\\\|\_\{F\}=\\\|\\Sigma\_\{t\}\\\|\_\{F\}\\geq\\frac\{\\\|\\Sigma\_\{t\}\\\|\_\{\*\}\}\{\\sqrt\{d\}\}=\\frac\{\\\|W\_\{t\}\\\|\_\{\*\}\}\{\\sqrt\{d\}\}\>\\frac\{\(1\+\\sqrt\{d\}\)\\,\\eta G\}\{\\sqrt\{d\}\\,\\varepsilon\}\.\(77\)
As a result, for the query, key, and value projection matricesWQW\_\{Q\},WKW\_\{K\}, andWVW\_\{V\}, a sufficient condition ensuring \([56](https://arxiv.org/html/2605.26489#A2.E56)\) is that their Frobenius norms satisfy

\{q​\(t\)\>\(1\+d\)​η​Gd​ε,v​\(t\)\>\(1\+d\)​η​Gd​ε\.\\begin\{cases\}q\(t\)\>\\dfrac\{\(1\+\\sqrt\{d\}\)\\,\\eta G\}\{\\sqrt\{d\}\\,\\varepsilon\},\\\\ v\(t\)\>\\dfrac\{\(1\+\\sqrt\{d\}\)\\,\\eta G\}\{\\sqrt\{d\}\\,\\varepsilon\}\.\\end\{cases\}\(78\)
By solving the above inequalities using the growth lower bounds ofq​\(t\)q\(t\)andv​\(t\)v\(t\)and settingCM​v​\(0\)=C0C\_\{M\}v\(0\)=C\_\{0\}, we obtain the corresponding hitting times

TQ​K=1CM​CV​\(−CM​v​\(0\)\+\(CM​v​\(0\)\)2\+2​CM​v​\(0\)​Λ​\(ε\)\)=O​\(\(C02\+2​C0​Λ​\(ε\)​η​G\)1/2\),T\_\{QK\}=\\frac\{1\}\{C\_\{M\}C\_\{V\}\}\\left\(\-C\_\{M\}v\(0\)\+\\sqrt\{\\bigl\(C\_\{M\}v\(0\)\\bigr\)^\{2\}\+2C\_\{M\}v\(0\)\\Lambda\(\\varepsilon\)\}\\right\)=O\\Bigg\(\\bigg\(C\_\{0\}^\{2\}\+2C\_\{0\}\\Lambda\(\\varepsilon\)\\eta G\\bigg\)^\{1/2\}\\Bigg\),\(79\)whereΛ​\(ε\)=ln⁡\(1\+d\)−ln⁡\(ε​q​\(0\)​d\)\\Lambda\(\\varepsilon\)=\\ln\(1\+\\sqrt\{d\}\)\-\\ln\(\\varepsilon\\,q\(0\)\\sqrt\{d\}\)\.

Therefore, for stability boundε​\(W\)=O​\(1‖W‖\),W∈\{WQ,WK,WV\}\\varepsilon\(W\)=O\(\\frac\{1\}\{\\\|W\\\|\}\),W\\in\\\{W\_\{Q\},W\_\{K\},W\_\{V\}\\\}, there exists a constantC\>0C\>0such that fort\>T∗=C​max⁡\{\(1\+d\)​η​Gε​CV​d,C02\+2​C0​Λ​\(ε\)​η​G\}t\>T^\{\*\}=C\\max\\left\\\{\\frac\{\(1\+\\sqrt\{d\}\)\\,\\eta G\}\{\\varepsilon\\,C\_\{V\}\\sqrt\{d\}\},\\sqrt\{C\_\{0\}^\{2\}\+2C\_\{0\}\\Lambda\(\\varepsilon\)\\eta G\}\\right\\\}, the normalized singular value matrices satisfy

‖Σt\+1tr⁡\(Σt\+1\)−Σttr⁡\(Σt\)‖F<ε\.\\left\\\|\\frac\{\\Sigma\_\{t\+1\}\}\{\\operatorname\{tr\}\(\\Sigma\_\{t\+1\}\)\}\-\\frac\{\\Sigma\_\{t\}\}\{\\operatorname\{tr\}\(\\Sigma\_\{t\}\)\}\\right\\\|\_\{F\}<\\varepsilon\.\(80\)
∎

## Appendix CProof of Theorem[4\.9](https://arxiv.org/html/2605.26489#S4.Thmtheorem9)

###### Proof of Theorem[4\.9](https://arxiv.org/html/2605.26489#S4.Thmtheorem9)\.\.

By Lemma[A\.4](https://arxiv.org/html/2605.26489#A1.Thmtheorem4), for the parameter sequence\{W∙​\(t\)\}t=0T\\\{W\_\{\\bullet\}\(t\)\\\}\_\{t=0\}^\{T\}, where∙∈\{Q,K,V\}\\bullet\\in\\\{Q,K,V\\\}, and their corresponding singular value matricesΣt\\Sigma\_\{t\}, the following bound holds:

‖Σt\+1tr⁡\(Σt\+1\)−Σttr⁡\(Σt\)‖F≤1\+dmin⁡\{tr⁡\(Σt\),tr⁡\(Σt\+1\)\}​η​‖Gt‖F\.\\left\\\|\\frac\{\\Sigma\_\{t\+1\}\}\{\\operatorname\{tr\}\(\\Sigma\_\{t\+1\}\)\}\-\\frac\{\\Sigma\_\{t\}\}\{\\operatorname\{tr\}\(\\Sigma\_\{t\}\)\}\\right\\\|\_\{F\}\\leq\\frac\{1\+\\sqrt\{d\}\}\{\\min\\\{\\operatorname\{tr\}\(\\Sigma\_\{t\}\),\\,\\operatorname\{tr\}\(\\Sigma\_\{t\+1\}\)\\\}\}\\,\\eta\\,\\\|G\_\{t\}\\\|\_\{F\}\.\(81\)
For notational convenience, define

τt:=tr⁡\(Σt\),\\tau\_\{t\}:=\\operatorname\{tr\}\(\\Sigma\_\{t\}\),\(82\)and introduce the normalized singular distribution difference

δ​\(t\):=‖Σt\+1tr⁡\(Σt\+1\)−Σttr⁡\(Σt\)‖F\.\\delta\(t\):=\\left\\\|\\frac\{\\Sigma\_\{t\+1\}\}\{\\operatorname\{tr\}\(\\Sigma\_\{t\+1\}\)\}\-\\frac\{\\Sigma\_\{t\}\}\{\\operatorname\{tr\}\(\\Sigma\_\{t\}\)\}\\right\\\|\_\{F\}\.\(83\)
Applying \([81](https://arxiv.org/html/2605.26489#A3.E81)\) to the query, key, and value projection matricesWQW\_\{Q\},WKW\_\{K\}, andWVW\_\{V\}, respectively, we obtain

δQ​\(t\)≤\(1\+d\)​ητtQ​‖GtQ‖F,\\delta\_\{Q\}\(t\)\\leq\\frac\{\(1\+\\sqrt\{d\}\)\\,\\eta\}\{\\tau\_\{t\}^\{Q\}\}\\,\\\|G\_\{t\}^\{Q\}\\\|\_\{F\},\(84\)δK​\(t\)≤\(1\+d\)​ητtK​‖GtK‖F,\\delta\_\{K\}\(t\)\\leq\\frac\{\(1\+\\sqrt\{d\}\)\\,\\eta\}\{\\tau\_\{t\}^\{K\}\}\\,\\\|G\_\{t\}^\{K\}\\\|\_\{F\},\(85\)and

δV​\(t\)≤\(1\+d\)​ητtV​‖GtV‖F\.\\delta\_\{V\}\(t\)\\leq\\frac\{\(1\+\\sqrt\{d\}\)\\,\\eta\}\{\\tau\_\{t\}^\{V\}\}\\,\\\|G\_\{t\}^\{V\}\\\|\_\{F\}\.\(86\)
Let

C0:=\(1\+d\)​η\.C\_\{0\}:=\(1\+\\sqrt\{d\}\)\\,\\eta\.\(87\)Then, by rearranging the bounds in \([84](https://arxiv.org/html/2605.26489#A3.E84)\)–\([86](https://arxiv.org/html/2605.26489#A3.E86)\), we obtain

‖GtQ‖F≥1C0​τtQ​δQ​\(t\),‖GtK‖F≥1C0​τtK​δK​\(t\),\\\|G\_\{t\}^\{Q\}\\\|\_\{F\}\\geq\\frac\{1\}\{C\_\{0\}\}\\,\\tau\_\{t\}^\{Q\}\\,\\delta\_\{Q\}\(t\),\\qquad\\\|G\_\{t\}^\{K\}\\\|\_\{F\}\\geq\\frac\{1\}\{C\_\{0\}\}\\,\\tau\_\{t\}^\{K\}\\,\\delta\_\{K\}\(t\),\(88\)and

‖GtV‖F≥1C0​τtV​δV​\(t\)\.\\\|G\_\{t\}^\{V\}\\\|\_\{F\}\\geq\\frac\{1\}\{C\_\{0\}\}\\,\\tau\_\{t\}^\{V\}\\,\\delta\_\{V\}\(t\)\.\(89\)
Let

θ:=\{WQ,WK,WV\}\\theta:=\\\{W\_\{Q\},\\,W\_\{K\},\\,W\_\{V\}\\\}\(90\)denote the collection of model parameters\. Then the squared Frobenius norm of the full gradient satisfies

‖∇θℒ​\(θ\)‖F2=‖GtQ‖F2\+‖GtK‖F2\+‖GtV‖F2\.\\\|\\nabla\_\{\\theta\}\\mathcal\{L\}\(\\theta\)\\\|\_\{F\}^\{2\}=\\\|G\_\{t\}^\{Q\}\\\|\_\{F\}^\{2\}\+\\\|G\_\{t\}^\{K\}\\\|\_\{F\}^\{2\}\+\\\|G\_\{t\}^\{V\}\\\|\_\{F\}^\{2\}\.\(91\)
By Lemma[A\.5](https://arxiv.org/html/2605.26489#A1.Thmtheorem5), the one\-step decrease of the loss obeys

Δ​ℒ​\(t\)≜ℒ​\(t\)−ℒ​\(t\+1\)≥η​\(1−η​β2\)​‖∇θℒ​\(θ\)‖F2\.\\Delta\\mathcal\{L\}\(t\)\\triangleq\\mathcal\{L\}\(t\)\-\\mathcal\{L\}\(t\+1\)\\geq\\eta\\\!\\left\(1\-\\frac\{\\eta\\beta\}\{2\}\\right\)\\\|\\nabla\_\{\\theta\}\\mathcal\{L\}\(\\theta\)\\\|\_\{F\}^\{2\}\.\(92\)
Substituting \([91](https://arxiv.org/html/2605.26489#A3.E91)\) into \([92](https://arxiv.org/html/2605.26489#A3.E92)\) yields

Δ​ℒ​\(t\)≥η​\(1−η​β2\)​\(‖GtQ‖F2\+‖GtK‖F2\+‖GtV‖F2\)\.\\Delta\\mathcal\{L\}\(t\)\\geq\\eta\\\!\\left\(1\-\\frac\{\\eta\\beta\}\{2\}\\right\)\\Bigl\(\\\|G\_\{t\}^\{Q\}\\\|\_\{F\}^\{2\}\+\\\|G\_\{t\}^\{K\}\\\|\_\{F\}^\{2\}\+\\\|G\_\{t\}^\{V\}\\\|\_\{F\}^\{2\}\\Bigr\)\.\(93\)
Finally, applying the lower bounds \([88](https://arxiv.org/html/2605.26489#A3.E88)\)–\([89](https://arxiv.org/html/2605.26489#A3.E89)\), we obtain

Δ​ℒ​\(t\)≥ηC0​\(1−η​β2\)​\(\(τtQ​δQ​\(t\)\)2\+\(τtK​δK​\(t\)\)2\+\(τtV​δV​\(t\)\)2\)\.\\Delta\\mathcal\{L\}\(t\)\\geq\\frac\{\\eta\}\{C\_\{0\}\}\\left\(1\-\\frac\{\\eta\\beta\}\{2\}\\right\)\\Bigl\(\\bigl\(\\tau\_\{t\}^\{Q\}\\delta\_\{Q\}\(t\)\\bigr\)^\{2\}\+\\bigl\(\\tau\_\{t\}^\{K\}\\delta\_\{K\}\(t\)\\bigr\)^\{2\}\+\\bigl\(\\tau\_\{t\}^\{V\}\\delta\_\{V\}\(t\)\\bigr\)^\{2\}\\Bigr\)\.\(94\)
In the early stage of training, there exists a constantD\>0D\>0such that

max\{τt∙δ∙\(t\),∙∈\{Q,K,V\}\}≥D\.\\max\\\{\\tau^\{\\bullet\}\_\{t\}\\delta\_\{\\bullet\}\(t\),\\bullet\\in\\\{Q,K,V\\\}\\\}\\geq D\.\(95\)
Combining \([95](https://arxiv.org/html/2605.26489#A3.E95)\) and \([87](https://arxiv.org/html/2605.26489#A3.E87)\) with \([94](https://arxiv.org/html/2605.26489#A3.E94)\), we obtain

Δ​ℒ​\(t\)≥3​D21\+d​\(1−η​β2\)\.\\Delta\\mathcal\{L\}\(t\)\\geq\\frac\{3D^\{2\}\}\{1\+\\sqrt\{d\}\}\\left\(1\-\\frac\{\\eta\\beta\}\{2\}\\right\)\.\(96\)
Choosing the step sizeη=1β\\eta=\\frac\{1\}\{\\beta\}, it follows that

Δ​ℒ​\(t\)≥3​D22​\(1\+d\)\.\\Delta\\mathcal\{L\}\(t\)\\geq\\frac\{3D^\{2\}\}\{2\(1\+\\sqrt\{d\}\)\}\.\(97\)
Consequently,

Δ​ℒ​\(t\)=O​\(1\),\\Delta\\mathcal\{L\}\(t\)=O\(1\),\(98\)that is, the loss decreases by a constant amount in the early stage of training\.

By Lemma[A\.5](https://arxiv.org/html/2605.26489#A1.Thmtheorem5), the one\-step decrease of the loss satisfies

Δ​ℒ​\(t\)≜ℒ​\(t\)−ℒ​\(t\+1\)≤η​\(1\+η​β2\)​‖∇θℒ​\(θ\)‖F2\.\\Delta\\mathcal\{L\}\(t\)\\triangleq\\mathcal\{L\}\(t\)\-\\mathcal\{L\}\(t\+1\)\\leq\\eta\\\!\\left\(1\+\\frac\{\\eta\\beta\}\{2\}\\right\)\\\|\\nabla\_\{\\theta\}\\mathcal\{L\}\(\\theta\)\\\|\_\{F\}^\{2\}\.\(99\)
By Theorem[4\.5](https://arxiv.org/html/2605.26489#S4.Thmtheorem5), the gradient with respect to the value projection matrix admits the representation

GV=X⊤​A⊤​GH\.G^\{V\}=X^\{\\top\}A^\{\\top\}G\_\{H\}\.\(100\)
Recall thatA=softmax⁡\(M\)A=\\operatorname\{softmax\}\(M\)\. Letai∈ℝna\_\{i\}\\in\\mathbb\{R\}^\{n\}denote theii\-th row ofAA, i\.e\.,

ai=softmax⁡\(Mi\)\.a\_\{i\}=\\operatorname\{softmax\}\(M\_\{i\}\)\.\(101\)The Jacobian ofaia\_\{i\}with respect toMiM\_\{i\}is given by

J​\(ai\)=∂ai∂Mi=diag⁡\(ai\)−ai​ai⊤\.J\(a\_\{i\}\)=\\frac\{\\partial a\_\{i\}\}\{\\partial M\_\{i\}\}=\\operatorname\{diag\}\(a\_\{i\}\)\-a\_\{i\}a\_\{i\}^\{\\top\}\.\(102\)
Therefore, the gradient with respect to the attention logitsMiM\_\{i\}can be written as

GM,i=∂L∂Mi=J​\(ai\)⊤​GA,i\.G\_\{M,i\}=\\frac\{\\partial L\}\{\\partial M\_\{i\}\}=J\(a\_\{i\}\)^\{\\top\}G\_\{A,i\}\.\(103\)
Stacking all rows together, the gradient with respect to the attention logitsMMcan be written as

GM=∂ℒ∂M=\[J​\(ai\)⊤​GA,i\]i=1n\.G\_\{M\}=\\frac\{\\partial\\mathcal\{L\}\}\{\\partial M\}=\\bigl\[J\(a\_\{i\}\)^\{\\top\}G\_\{A,i\}\\bigr\]\_\{i=1\}^\{n\}\.\(104\)
Recall that

M=Q​K⊤d,Q=X​WQ,K=X​WK\.M=\\frac\{QK^\{\\top\}\}\{\\sqrt\{d\}\},\\qquad Q=XW\_\{Q\},\\qquad K=XW\_\{K\}\.\(105\)
Hence, the gradients with respect to the projection matricesWQW\_\{Q\}andWKW\_\{K\}are given by

GQ=∂ℒ∂WQ=1d​X⊤​GM​K,G^\{Q\}=\\frac\{\\partial\\mathcal\{L\}\}\{\\partial W\_\{Q\}\}=\\frac\{1\}\{\\sqrt\{d\}\}\\,X^\{\\top\}G\_\{M\}K,\(106\)and

GK=∂ℒ∂WK=1d​X⊤​GM⊤​Q\.G^\{K\}=\\frac\{\\partial\\mathcal\{L\}\}\{\\partial W\_\{K\}\}=\\frac\{1\}\{\\sqrt\{d\}\}\\,X^\{\\top\}G\_\{M\}^\{\\top\}Q\.\(107\)
Next, consider the gradient with respect to the logitsZZ\. We have

‖GZ‖F=∑i=1n1n2​‖pi−yi‖22≤∑i=1n1n2​2​\(C−1\)​exp⁡\(−τV​ωmin\)\.\\\|G\_\{Z\}\\\|\_\{F\}=\\sqrt\{\\sum\_\{i=1\}^\{n\}\\frac\{1\}\{n^\{2\}\}\\,\\\|p\_\{i\}\-y\_\{i\}\\\|\_\{2\}^\{2\}\}\\leq\\sqrt\{\\sum\_\{i=1\}^\{n\}\\frac\{1\}\{n^\{2\}\}\}\\,2\(C\-1\)\\exp\\\!\\bigl\(\-\\tau^\{V\}\\omega\_\{\\min\}\\bigr\)\.\(108\)Therefore,

‖GZ‖F≤2​\(C−1\)n​exp⁡\(−τV​ωmin\)\.\\\|G\_\{Z\}\\\|\_\{F\}\\leq\\frac\{\\sqrt\{2\(C\-1\)\}\}\{\\sqrt\{n\}\}\\exp\\\!\\bigl\(\-\\tau^\{V\}\\omega\_\{\\min\}\\bigr\)\.\(109\)
Let

CH=2​\(C−1\)n​‖WC‖F\.C\_\{H\}=\\frac\{2\(C\-1\)\}\{\\sqrt\{n\}\}\\,\\\|W\_\{C\}\\\|\_\{F\}\.\(110\)Then it follows that

‖GH‖F≤CH​exp⁡\(−τV​ωmin\)\.\\\|G\_\{H\}\\\|\_\{F\}\\leq C\_\{H\}\\exp\\\!\\bigl\(\-\\tau^\{V\}\\omega\_\{\\min\}\\bigr\)\.\(111\)
Moreover, we have

‖GV‖F=‖X⊤​A⊤​GH‖F≤‖X⊤‖2​‖A⊤‖2​‖GH‖F≤n​λmax​\(X\)​‖GH‖F\.\\\|G^\{V\}\\\|\_\{F\}=\\\|X^\{\\top\}A^\{\\top\}G\_\{H\}\\\|\_\{F\}\\leq\\\|X^\{\\top\}\\\|\_\{2\}\\,\\\|A^\{\\top\}\\\|\_\{2\}\\,\\\|G\_\{H\}\\\|\_\{F\}\\leq\\sqrt\{n\}\\,\\lambda\_\{\\max\}\(X\)\\,\\\|G\_\{H\}\\\|\_\{F\}\.\(112\)
Define

CV=2​\(C−1\)​λmax​\(X\)​‖WC‖F\.C\_\{V\}=2\(C\-1\)\\lambda\_\{\\max\}\(X\)\\,\\\|W\_\{C\}\\\|\_\{F\}\.\(113\)Then it follows that

‖GV‖F≤CV​exp⁡\(−τV​ωmin\)\.\\\|G^\{V\}\\\|\_\{F\}\\leq C\_\{V\}\\exp\\\!\\bigl\(\-\\tau^\{V\}\\omega\_\{\\min\}\\bigr\)\.\(114\)
Next, we compute

‖GQ‖F=1d​‖X⊤​GM​K‖F≤1d​λmax2​\(X\)​τK​‖GM‖F\.\\\|G^\{Q\}\\\|\_\{F\}=\\frac\{1\}\{\\sqrt\{d\}\}\\,\\\|X^\{\\top\}G\_\{M\}K\\\|\_\{F\}\\leq\\frac\{1\}\{\\sqrt\{d\}\}\\,\\lambda\_\{\\max\}^\{2\}\(X\)\\,\\tau^\{K\}\\\|G\_\{M\}\\\|\_\{F\}\.\(115\)
Moreover, the Frobenius norm ofGMG\_\{M\}satisfies

‖GM‖F\\displaystyle\\\|G\_\{M\}\\\|\_\{F\}=∑i=1n‖J​\(ai\)⊤​GA,i‖F2\\displaystyle=\\sqrt\{\\sum\_\{i=1\}^\{n\}\\bigl\\\|J\(a\_\{i\}\)^\{\\top\}G\_\{A,i\}\\bigr\\\|\_\{F\}^\{2\}\}\(116\)≤∑i=1n‖J​\(ai\)‖22​‖GA,i‖F2\\displaystyle\\leq\\sqrt\{\\sum\_\{i=1\}^\{n\}\\\|J\(a\_\{i\}\)\\\|\_\{2\}^\{2\}\\,\\\|G\_\{A,i\}\\\|\_\{F\}^\{2\}\}≤maxi⁡‖J​\(ai\)‖2​∑i=1n‖GA,i‖F2\\displaystyle\\leq\\max\_\{i\}\\\|J\(a\_\{i\}\)\\\|\_\{2\}\\,\\sqrt\{\\sum\_\{i=1\}^\{n\}\\\|G\_\{A,i\}\\\|\_\{F\}^\{2\}\}=maxi⁡‖J​\(ai\)‖2​‖GA‖F\.\\displaystyle=\\max\_\{i\}\\\|J\(a\_\{i\}\)\\\|\_\{2\}\\,\\\|G\_\{A\}\\\|\_\{F\}\.
Substituting the bound‖J​\(ai\)‖2≤2​\(n−1\)​exp⁡\(−γmind​τQ​τK\)\\\|J\(a\_\{i\}\)\\\|\_\{2\}\\leq 2\(n\-1\)\\exp\\\!\\left\(\-\\frac\{\\gamma\_\{\\min\}\}\{\\sqrt\{d\}\}\\,\\tau^\{Q\}\\tau^\{K\}\\right\)into \([116](https://arxiv.org/html/2605.26489#A3.E116)\), we obtain

‖GM‖F≤2​\(n−1\)​exp⁡\(−γmind​τQ​τK\)​‖GA‖F\.\\\|G\_\{M\}\\\|\_\{F\}\\leq 2\(n\-1\)\\exp\\\!\\left\(\-\\frac\{\\gamma\_\{\\min\}\}\{\\sqrt\{d\}\}\\,\\tau^\{Q\}\\tau^\{K\}\\right\)\\\|G\_\{A\}\\\|\_\{F\}\.\(117\)
Moreover, the gradient with respect to the attention matrix satisfies

‖GA‖F=‖GH​V‖F≤‖GH‖F​‖V‖F≤CH​λmax​\(X\)​τV​exp⁡\(−τV​ωmin\)\.\\\|G\_\{A\}\\\|\_\{F\}=\\\|G\_\{H\}V\\\|\_\{F\}\\leq\\\|G\_\{H\}\\\|\_\{F\}\\,\\\|V\\\|\_\{F\}\\leq C\_\{H\}\\lambda\_\{\\max\}\(X\)\\,\\tau^\{V\}\\exp\\\!\\bigl\(\-\\tau^\{V\}\\omega\_\{\\min\}\\bigr\)\.\(118\)
Next, recall that

M=Q​K⊤d=X​WQ​WK⊤​X⊤d=τQ​τKd​X​W¯Q​W¯K⊤​X⊤=τQ​τKd​M¯\.M=\\frac\{QK^\{\\top\}\}\{\\sqrt\{d\}\}=\\frac\{XW\_\{Q\}W\_\{K\}^\{\\top\}X^\{\\top\}\}\{\\sqrt\{d\}\}=\\frac\{\\tau^\{Q\}\\tau^\{K\}\}\{\\sqrt\{d\}\}\\,X\\bar\{W\}\_\{Q\}\\bar\{W\}\_\{K\}^\{\\top\}X^\{\\top\}=\\frac\{\\tau^\{Q\}\\tau^\{K\}\}\{\\sqrt\{d\}\}\\,\\bar\{M\}\.\(119\)
For theii\-th row ofM¯\\bar\{M\}, define the margin

γi:=gap⁡\(M¯i,:\),γmin:=mini⁡γi\.\\gamma\_\{i\}:=\\operatorname\{gap\}\\\!\\bigl\(\\bar\{M\}\_\{i,:\}\\bigr\),\\qquad\\gamma\_\{\\min\}:=\\min\_\{i\}\\gamma\_\{i\}\.\(120\)
Consequently, for the original logitsMM, we have

gap⁡\(Mi,:\)=τQ​τKd​γi≥τQ​τKd​γmin\.\\operatorname\{gap\}\\\!\\bigl\(M\_\{i,:\}\\bigr\)=\\frac\{\\tau^\{Q\}\\tau^\{K\}\}\{\\sqrt\{d\}\}\\,\\gamma\_\{i\}\\geq\\frac\{\\tau^\{Q\}\\tau^\{K\}\}\{\\sqrt\{d\}\}\\,\\gamma\_\{\\min\}\.\(121\)
From Lemma[A\.7](https://arxiv.org/html/2605.26489#A1.Thmtheorem7), we obtain the bound for the Jacobian matrix

‖J​\(ai\)‖2≤2​\(1−amax\)\.\\\|J\(a\_\{i\}\)\\\|\_\{2\}\\leq 2\(1\-a\_\{\\max\}\)\.\(122\)
From Lemma[A\.6](https://arxiv.org/html/2605.26489#A1.Thmtheorem6), we get the following bound for the Jacobian matrix:

‖J​\(ai\)‖2≤2​\(1−αmax\)≤2​\(n−1\)​exp⁡\(−gap⁡\(Mi,:\)\)\.\\\|J\(a\_\{i\}\)\\\|\_\{2\}\\leq 2\(1\-\\alpha\_\{\\max\}\)\\leq 2\(n\-1\)\\exp\\\!\\left\(\-\\operatorname\{gap\}\(M\_\{i,:\}\)\\right\)\.\(123\)
Thus, we can derive the following exponential decay bound:

‖J​\(ai\)‖2≤2​\(n−1\)​exp⁡\(−γmind​τQ​τK\)\.\\\|J\(a\_\{i\}\)\\\|\_\{2\}\\leq 2\(n\-1\)\\exp\\\!\\left\(\-\\frac\{\\gamma\_\{\\min\}\}\{\\sqrt\{d\}\}\\,\\tau^\{Q\}\\tau^\{K\}\\right\)\.\(124\)
Similarly, we can handleτV\\tau^\{V\}using the same approach\.

Let

WV=τV​W¯V\.W\_\{V\}=\\tau^\{V\}\\bar\{W\}\_\{V\}\.\(125\)Then, the matrixZZcan be expressed as

Z=A​X​WV​WC=τV​A​X​W¯V​WC,Z=AXW\_\{V\}W\_\{C\}=\\tau^\{V\}AX\\bar\{W\}\_\{V\}W\_\{C\},\(126\)and let

Z¯=A​X​W¯V​WC\.\\bar\{Z\}=AX\\bar\{W\}\_\{V\}W\_\{C\}\.\(127\)
Analogously, we define the*logit gap*onZ¯\\bar\{Z\}as

ωi:=Z¯i,yi−maxc≠yi⁡Z¯i,c\.\\omega\_\{i\}:=\\bar\{Z\}\_\{i,y\_\{i\}\}\-\\max\_\{c\\neq y\_\{i\}\}\\bar\{Z\}\_\{i,c\}\.\(128\)
We define the minimum logit gap as

ωmin:=mini⁡ωi\.\\omega\_\{\\min\}:=\\min\_\{i\}\\omega\_\{i\}\.\(129\)
From Lemma[A\.6](https://arxiv.org/html/2605.26489#A1.Thmtheorem6), we obtain the following bound on the probabilities:

1−pi,yi≤\(C−1\)​exp⁡\(−τV​ωi\)≤\(C−1\)​exp⁡\(−τV​ωmin\)\.1\-p\_\{i,y\_\{i\}\}\\leq\(C\-1\)\\exp\\left\(\-\\tau^\{V\}\\omega\_\{i\}\\right\)\\leq\(C\-1\)\\exp\\left\(\-\\tau^\{V\}\\omega\_\{\\min\}\\right\)\.\(130\)
Consider the Frobenius norm of the gradient with respect toZZ:

‖GZ‖F2=‖∂ℓ∂Z‖F2=∑i=1n‖∂ℓ∂Zi‖F2=∑i=1n1n2​‖pi−yi‖22\.\\\|G\_\{Z\}\\\|\_\{F\}^\{2\}=\\left\\\|\\frac\{\\partial\\ell\}\{\\partial Z\}\\right\\\|\_\{F\}^\{2\}=\\sum\_\{i=1\}^\{n\}\\left\\\|\\frac\{\\partial\\ell\}\{\\partial Z\_\{i\}\}\\right\\\|\_\{F\}^\{2\}=\\sum\_\{i=1\}^\{n\}\\frac\{1\}\{n^\{2\}\}\\,\\\|p\_\{i\}\-y\_\{i\}\\\|\_\{2\}^\{2\}\.\(131\)
For each sampleii, we have

‖pi−yi‖22\\displaystyle\\\|p\_\{i\}\-y\_\{i\}\\\|\_\{2\}^\{2\}=\(1−pi,yi\)2\+∑c≠yipi,c2\\displaystyle=\(1\-p\_\{i,y\_\{i\}\}\)^\{2\}\+\\sum\_\{c\\neq y\_\{i\}\}p\_\{i,c\}^\{2\}\(132\)≤\(1−pi,yi\)2\+\(∑c≠yipi,c\)2\\displaystyle\\leq\(1\-p\_\{i,y\_\{i\}\}\)^\{2\}\+\\left\(\\sum\_\{c\\neq y\_\{i\}\}p\_\{i,c\}\\right\)^\{2\}=\(1−pi,yi\)2\+\(1−pi,yi\)2\\displaystyle=\(1\-p\_\{i,y\_\{i\}\}\)^\{2\}\+\(1\-p\_\{i,y\_\{i\}\}\)^\{2\}=2​\(1−pi,yi\)2\.\\displaystyle=2\(1\-p\_\{i,y\_\{i\}\}\)^\{2\}\.Therefore,

‖pi−yi‖2≤2​\(1−pi,yi\)≤2​\(C−1\)​exp⁡\(−τV​ωmin\)\.\\\|p\_\{i\}\-y\_\{i\}\\\|\_\{2\}\\leq\\sqrt\{2\}\\,\(1\-p\_\{i,y\_\{i\}\}\)\\leq 2\(C\-1\)\\exp\\\!\\bigl\(\-\\tau^\{V\}\\omega\_\{\\min\}\\bigr\)\.\(133\)
Next, combining the previously established bounds, we obtain

‖GQ‖F\\displaystyle\\\|G^\{Q\}\\\|\_\{F\}≤1d​λmax2​\(X\)​τK⋅2​\(n−1\)​exp⁡\(−γmind​τQ​τK\)\\displaystyle\\leq\\frac\{1\}\{\\sqrt\{d\}\}\\,\\lambda\_\{\\max\}^\{2\}\(X\)\\,\\tau^\{K\}\\cdot 2\(n\-1\)\\exp\\\!\\left\(\-\\frac\{\\gamma\_\{\\min\}\}\{\\sqrt\{d\}\}\\,\\tau^\{Q\}\\tau^\{K\}\\right\)\(134\)⋅CH​λmax​\(X\)​τV​exp⁡\(−τV​ωmin\)\\displaystyle\\qquad\\qquad\\cdot C\_\{H\}\\lambda\_\{\\max\}\(X\)\\,\\tau^\{V\}\\exp\\\!\\bigl\(\-\\tau^\{V\}\\omega\_\{\\min\}\\bigr\)≤2​\(n−1\)d​CH​λmax3​\(X\)​exp⁡\(−ωmin​τV\)​exp⁡\(−γmind​τQ​τK\)⋅τK​τV\.\\displaystyle\\leq\\frac\{2\(n\-1\)\}\{\\sqrt\{d\}\}\\,C\_\{H\}\\lambda\_\{\\max\}^\{3\}\(X\)\\,\\exp\\\!\\bigl\(\-\\omega\_\{\\min\}\\tau^\{V\}\\bigr\)\\exp\\\!\\left\(\-\\frac\{\\gamma\_\{\\min\}\}\{\\sqrt\{d\}\}\\,\\tau^\{Q\}\\tau^\{K\}\\right\)\\cdot\\tau^\{K\}\\tau^\{V\}\.
Similarly, we have

‖GK‖F≤2​\(n−1\)d​CH​λmax3​\(X\)​exp⁡\(−ωmin​τV\)​exp⁡\(−γmind​τQ​τK\)⋅τQ​τV\.\\\|G^\{K\}\\\|\_\{F\}\\leq\\frac\{2\(n\-1\)\}\{\\sqrt\{d\}\}\\,C\_\{H\}\\lambda\_\{\\max\}^\{3\}\(X\)\\,\\exp\\\!\\bigl\(\-\\omega\_\{\\min\}\\tau^\{V\}\\bigr\)\\exp\\\!\\left\(\-\\frac\{\\gamma\_\{\\min\}\}\{\\sqrt\{d\}\}\\,\\tau^\{Q\}\\tau^\{K\}\\right\)\\cdot\\tau^\{Q\}\\tau^\{V\}\.\(135\)
Then the following upper bounds hold:

\{‖GV‖F≤CV​exp⁡\(−ωmin​τV\),‖GK‖F≤2​\(n−1\)d​CH​λmax3​\(X\)​τQ​τV​exp⁡\(−ωmin​τV\)​exp⁡\(−γmind​τQ​τK\),‖GQ‖F≤2​\(n−1\)d​CH​λmax3​\(X\)​τK​τV​exp⁡\(−ωmin​τV\)​exp⁡\(−γmind​τQ​τK\)\.\\begin\{cases\}\\\|G^\{V\}\\\|\_\{F\}\\leq C\_\{V\}\\exp\\\!\\bigl\(\-\\omega\_\{\\min\}\\tau^\{V\}\\bigr\),\\\\ \\\|G^\{K\}\\\|\_\{F\}\\leq\\dfrac\{2\(n\-1\)\}\{\\sqrt\{d\}\}\\,C\_\{H\}\\lambda\_\{\\max\}^\{3\}\(X\)\\,\\tau^\{Q\}\\tau^\{V\}\\exp\\\!\\bigl\(\-\\omega\_\{\\min\}\\tau^\{V\}\\bigr\)\\exp\\\!\\left\(\-\\dfrac\{\\gamma\_\{\\min\}\}\{\\sqrt\{d\}\}\\tau^\{Q\}\\tau^\{K\}\\right\),\\\\ \\\|G^\{Q\}\\\|\_\{F\}\\leq\\dfrac\{2\(n\-1\)\}\{\\sqrt\{d\}\}\\,C\_\{H\}\\lambda\_\{\\max\}^\{3\}\(X\)\\,\\tau^\{K\}\\tau^\{V\}\\exp\\\!\\bigl\(\-\\omega\_\{\\min\}\\tau^\{V\}\\bigr\)\\exp\\\!\\left\(\-\\dfrac\{\\gamma\_\{\\min\}\}\{\\sqrt\{d\}\}\\tau^\{Q\}\\tau^\{K\}\\right\)\.\\end\{cases\}\(136\)
The constants are defined as

CV=2​\(C−1\)​λmax​\(X\)​‖WC‖F,CH=2​\(C−1\)n​‖WC‖F\.C\_\{V\}=2\(C\-1\)\\lambda\_\{\\max\}\(X\)\\,\\\|W\_\{C\}\\\|\_\{F\},\\qquad C\_\{H\}=\\dfrac\{2\(C\-1\)\}\{\\sqrt\{n\}\}\\,\\\|W\_\{C\}\\\|\_\{F\}\.\(137\)
Let

CQ​K:=2​\(n−1\)d​CH​λmax3​\(X\),ω:=ωmin,γ:=γmind\.C\_\{QK\}:=\\dfrac\{2\(n\-1\)\}\{\\sqrt\{d\}\}\\,C\_\{H\}\\lambda\_\{\\max\}^\{3\}\(X\),\\qquad\\omega:=\\omega\_\{\\min\},\\qquad\\gamma:=\\dfrac\{\\gamma\_\{\\min\}\}\{\\sqrt\{d\}\}\.\(138\)
Then \([136](https://arxiv.org/html/2605.26489#A3.E136)\) can be equivalently written as

\{‖GV‖F≤CV​exp⁡\(−ω​τV\),‖GK‖F≤CQ​K​τQ​τV​exp⁡\(−ω​τV\)​exp⁡\(−γ​τQ​τK\),‖GQ‖F≤CQ​K​τK​τV​exp⁡\(−ω​τV\)​exp⁡\(−γ​τQ​τK\)\.\\begin\{cases\}\\\|G^\{V\}\\\|\_\{F\}\\leq C\_\{V\}\\exp\(\-\\omega\\tau^\{V\}\),\\\\ \\\|G^\{K\}\\\|\_\{F\}\\leq C\_\{QK\}\\,\\tau^\{Q\}\\tau^\{V\}\\exp\(\-\\omega\\tau^\{V\}\)\\exp\(\-\\gamma\\tau^\{Q\}\\tau^\{K\}\),\\\\ \\\|G^\{Q\}\\\|\_\{F\}\\leq C\_\{QK\}\\,\\tau^\{K\}\\tau^\{V\}\\exp\(\-\\omega\\tau^\{V\}\)\\exp\(\-\\gamma\\tau^\{Q\}\\tau^\{K\}\)\.\\end\{cases\}\(139\)
Therefore, the squared Frobenius norm of the full gradient satisfies

‖∇θℒ​\(θ\)‖F2\\displaystyle\\\|\\nabla\_\{\\theta\}\\mathcal\{L\}\(\\theta\)\\\|\_\{F\}^\{2\}≤‖GV‖F2\+‖GK‖F2\+‖GQ‖F2\\displaystyle\\leq\\\|G^\{V\}\\\|\_\{F\}^\{2\}\+\\\|G^\{K\}\\\|\_\{F\}^\{2\}\+\\\|G^\{Q\}\\\|\_\{F\}^\{2\}\(140\)≤CV2​exp⁡\(−2​ω​τV\)\+CQ​K2​\[\(τQ​τV\)2\+\(τK​τV\)2\]​exp⁡\(−2​ω​τV\)​exp⁡\(−2​γ​τQ​τK\)\.\\displaystyle\\leq C\_\{V\}^\{2\}\\exp\(\-2\\omega\\tau^\{V\}\)\+C\_\{QK\}^\{2\}\\Big\[\(\\tau^\{Q\}\\tau^\{V\}\)^\{2\}\+\(\\tau^\{K\}\\tau^\{V\}\)^\{2\}\\Big\]\\exp\(\-2\\omega\\tau^\{V\}\)\\exp\(\-2\\gamma\\tau^\{Q\}\\tau^\{K\}\)\.
Since exponential decay dominates polynomial decay, there existsp\>2p\>2such that

exp⁡\(−2​ω​τV\)≤\(2​ω​τV\)−p≤\(2​ω\)−p​\(τV\)−p,\\exp\(\-2\\omega\\tau^\{V\}\)\\leq\(2\\omega\\tau^\{V\}\)^\{\-p\}\\leq\(2\\omega\)^\{\-p\}\(\\tau^\{V\}\)^\{\-p\},\(141\)and

exp⁡\(−2​γ​τQ​τK\)≤\(2​γ​τQ​τK\)−p=\(2​γ\)−p​\(τQ​τK\)−p\.\\exp\(\-2\\gamma\\tau^\{Q\}\\tau^\{K\}\)\\leq\(2\\gamma\\tau^\{Q\}\\tau^\{K\}\)^\{\-p\}=\(2\\gamma\)^\{\-p\}\(\\tau^\{Q\}\\tau^\{K\}\)^\{\-p\}\.\(142\)
From Theorem[4\.5](https://arxiv.org/html/2605.26489#S4.Thmtheorem5), we haveε=O​\(η/τ\)\\varepsilon=O\(\\eta/\\tau\)\. Consequently, the one\-step decrease of the loss satisfies

Δ​ℒ​\(t\)\\displaystyle\\Delta\\mathcal\{L\}\(t\)=ℒ​\(t\)−ℒ​\(t\+1\)\\displaystyle=\\mathcal\{L\}\(t\)\-\\mathcal\{L\}\(t\+1\)\(143\)≤η​\(1\+2​β2\)​\[CV2​exp⁡\(−2​ω​τV\)\+CQ​K2​\(\(τQ\)2\+\(τK\)2\)​\(τV\)2​exp⁡\(−2​ω​τV\)​exp⁡\(−2​γ​τQ​τK\)\]\\displaystyle\\leq\\eta\\left\(1\+\\frac\{2\\beta\}\{2\}\\right\)\\Big\[C\_\{V\}^\{2\}\\exp\(\-2\\omega\\tau^\{V\}\)\+C\_\{QK\}^\{2\}\\big\(\(\\tau^\{Q\}\)^\{2\}\+\(\\tau^\{K\}\)^\{2\}\\big\)\(\\tau^\{V\}\)^\{2\}\\exp\(\-2\\omega\\tau^\{V\}\)\\exp\(\-2\\gamma\\tau^\{Q\}\\tau^\{K\}\)\\Big\]≤η​\(1\+2​β2\)​\[CV2\(2​ω\)p​\(1τV\)p\+2​CQ​K2\(4​ω​γ\)p​\(1τV\)p−2​\(1τK\)2​p−2\]\\displaystyle\\leq\\eta\\left\(1\+\\frac\{2\\beta\}\{2\}\\right\)\\Bigg\[\\dfrac\{C\_\{V\}^\{2\}\}\{\(2\\omega\)^\{p\}\}\\left\(\\dfrac\{1\}\{\\tau^\{V\}\}\\right\)^\{p\}\+\\dfrac\{2C\_\{QK\}^\{2\}\}\{\(4\\omega\\gamma\)^\{p\}\}\\left\(\\dfrac\{1\}\{\\tau^\{V\}\}\\right\)^\{p\-2\}\\left\(\\dfrac\{1\}\{\\tau^\{K\}\}\\right\)^\{2p\-2\}\\Bigg\]≤O​\(εα\),\\displaystyle\\leq O\(\\varepsilon^\{\\alpha\}\),where

α=min⁡\{p,3​p−4\}=p\.\\alpha=\\min\\\{\\,p,\\;3p\-4\\,\\\}=p\.\(144\)
∎

## Appendix DExperimental Setting

### D\.1GPT\-2 on FineWeb

#### Models and Architecture\.

We train GPT\-2 Small \(124M\) and Medium \(355M\) models\. While rooted in the standard decoder\-only Transformer architecture\(Radfordet al\.,[2019](https://arxiv.org/html/2605.26489#bib.bib57); Vaswaniet al\.,[2017](https://arxiv.org/html/2605.26489#bib.bib1)\), we incorporate modern architectural enhancements to improve training stability and performance\. Specifically, we follow themodded\-nanogptbenchmark222[https://github\.com/KellerJordan/modded\-nanogpt](https://github.com/KellerJordan/modded-nanogpt), which replaces LayerNorm withRMSNorm\(Zhang and Sennrich,[2019](https://arxiv.org/html/2605.26489#bib.bib59)\), adoptsRotary Positional Embeddings \(RoPE\)\(Suet al\.,[2023](https://arxiv.org/html/2605.26489#bib.bib52)\), and utilizesSquared ReLUactivations\.

The architectural configurations for the two scales are as follows:

- •Small \(124M\):A modified configuration withnlayer=12n\_\{\\text\{layer\}\}=12,dmodel=768d\_\{\\text\{model\}\}=768, andnhead=6n\_\{\\text\{head\}\}=6\. Notably, this results in a head dimension ofdhead=128d\_\{\\text\{head\}\}=128, larger than the standard 64\.
- •Medium \(355M\):A standard configuration withnlayer=24n\_\{\\text\{layer\}\}=24,dmodel=1024d\_\{\\text\{model\}\}=1024, andnhead=16n\_\{\\text\{head\}\}=16, keeping the standard head dimensiondhead=64d\_\{\\text\{head\}\}=64\.

#### Dataset and Tokenization\.

Training is performed on the FineWeb dataset\(Penedoet al\.,[2024](https://arxiv.org/html/2605.26489#bib.bib58)\), strictly adhering to the 10B token subset prescribed by the NanoGPT benchmark\. This subset provides a high\-quality, representative sample suitable for efficiency comparisons\. We utilize the canonical GPT\-2 Byte\-Pair Encoding \(BPE\) tokenizer, processing input sequences with a context window ofL=1024L=1024tokens\. To ensure consistent benchmarking, all evaluation metrics are reported on the standard 10M\-token FineWeb validation partition\.

#### Hyperparameters\.

To optimize computational efficiency, we follow the hyperparameter setting in the NanoGPT speedrun benchmark, which adopts specific settings for learning rates, weight decay, and scheduler phases for each model\. We employ a split optimization strategy, applying distinct learning rates to the embeddings/head \(L​RheadLR\_\{\\text\{head\}\}\) versus the transformer body \(L​RbodyLR\_\{\\text\{body\}\}\)\. The detailed configurations are summarized in Table[1](https://arxiv.org/html/2605.26489#A4.T1)\.

Table 1:Hyperparameters for Modded GPT\-2 experiments\.WSDdenotes the Warmup\-Stable\-Decay schedule \(linear warmup, constant hold, linear decay\)\.Step Decaydrops the learning rate by a factor of 10 at specified milestones\.
#### Optimization Details\.

- •Small Long\-Run:Trained for∼\\sim10\.7B tokens using aStep Decayschedule\. To maximize convergence on the larger data budget, the learning rate is dropped by a factor of 10 at 50% and 75% of the total training steps\.
- •Small Short\-Runs:We conduct a controlled comparison betweenAdamWand theMuonoptimizer over a∼\\sim2\.6B token budget\. Both runs utilize aWSD\(Warmup\-Stable\-Decay\) schedule with a linear warmdown\. Weight decay is explicitly disabled \(λ=0\\lambda=0\) for both optimizers to strictly isolate the algorithmic differences\.
- •Medium Experiment:Trained for∼\\sim2\.6B tokens using AdamW with aWSDschedule\. Unlike the Small experiments, we apply specific regularization to the larger model by introducing weight decay \(λ=0\.125\\lambda=0\.125\) to the Transformer Body parameters, while keepingλ=0\\lambda=0for the embeddings and head\.

### D\.2Pre\-training Setup: LLaMA on C4

#### Model Architecture\.

We implement the LLaMA architecture\(Touvronet al\.,[2023](https://arxiv.org/html/2605.26489#bib.bib6)\), which features RMSNorm for pre\-normalization, SwiGLU activations, and Rotary Positional Embeddings \(RoPE\)\(Suet al\.,[2023](https://arxiv.org/html/2605.26489#bib.bib52)\)\. The implementation is based on the HuggingFace Transformers library\(Wolfet al\.,[2020](https://arxiv.org/html/2605.26489#bib.bib53)\)\. In this study, we experiment with two distinct model scales: 0\.5B and 2B parameters\. The detailed hyperparameters and architectural specifications are summarized in Table[2](https://arxiv.org/html/2605.26489#A4.T2)\.

Table 2:Detailed architectural configurations and hyperparameters for the LLaMA models\.L​RmaxLR\_\{\\max\}indicates the maximum learning rate applied during the cosine schedule\.
#### Dataset and Tokenization\.

Our models are pre\-trained on the C4 \(Colossal Clean Crawled Corpus\) dataset\(Raffelet al\.,[2020](https://arxiv.org/html/2605.26489#bib.bib54)\)\. For text processing, we employ a SentencePiece tokenizer configured with a vocabulary size ofV=32,100V=32,100\. The input sequence length is fixed atL=2048L=2048tokens\. To assess model performance, we report both perplexity and loss metrics on the official C4 validation split\.

#### Optimization and Training\.

Our training configuration aligns withZhaoet al\.\([2024](https://arxiv.org/html/2605.26489#bib.bib56)\)\. We utilize the AdamW optimizer with hyperparameters set toβ1=0\.9\\beta\_\{1\}=0\.9,β2=0\.95\\beta\_\{2\}=0\.95, andϵ=10−8\\epsilon=10^\{\-8\}\. Gradient clipping is applied with a threshold of1\.01\.0\.

Regarding regularization, we employ two distinct settings:

- •Standard Setting:For the LLaMA\-2B and the primary LLaMA\-0\.5B baseline, we apply a standard weight decay ofλ=0\.1\\lambda=0\.1\.
- •No\-WD Variant:We train an additional LLaMA\-0\.5B variant with weight decay disabled \(λ=0\\lambda=0\) while keeping all other hyperparameters identical, serving as a non\-regularized baseline for comparison\.

The learning rate is managed via a cosine decay schedule, which consists of:

1. 1\.A linear warmup phase for the first 1,000 steps;
2. 2\.A peak learning rate \(L​RmaxLR\_\{\\max\}\) as specified in Table[2](https://arxiv.org/html/2605.26489#A4.T2);
3. 3\.A cosine decay phase reducing the rate to a minimum ofl​rmin=0\.1×L​Rmaxlr\_\{\\min\}=0\.1\\times LR\_\{\\max\}\.

Training is conducted with a global batch size of 512 sequences for a total duration of 100,000 steps, processing approximately 105 billion tokens\.

#### System Implementation\.

We leverage PyTorch Fully Sharded Data Parallel \(FSDP\) for distributed training\. Specifically, we adopt theHYBRID\_SHARDstrategy combined withbfloat16mixed precision\. In this setup, model parameters and buffers are stored in FP32, while gradients and communication operations are performed inbfloat16\. CPU offloading is disabled to maximize training efficiency\.

Similar Articles

UniSD: Towards a Unified Self-Distillation Framework for Large Language Models

Hugging Face Daily Papers

This paper introduces UniSD, a unified self-distillation framework for adapting large language models that integrates mechanisms for supervision reliability, representation alignment, and training stability. Experimental results show that UniSD improves performance over base models and existing baselines across multiple benchmarks.

Generalization Dynamics of LM Pre-training (17 minute read)

TLDR AI

This paper reveals that during pre-training, language models frequently and suddenly switch between pattern-matching and generalization behaviors, a phenomenon called mode-hopping, and presents a toy evaluation suite to study it.

Spectral Scaling Laws of Muon

arXiv cs.LG

This paper presents the first systematic study of singular value spectral behavior in Muon optimizer momentum matrices during LLM training, discovering clean power-law scaling relationships across model sizes (77M–2.8B parameters). The findings provide practitioners with principled, layer-aware guidelines for configuring Newton–Schulz iterations to maintain orthonormalization quality at frontier scale without unnecessary computation.

Representation Collapse in Sequential Post-Training of Large Language Models

arXiv cs.LG

This paper studies representation collapse in sequential post-training of large language models, showing that repeated adaptation stages compress internal representations, reducing plasticity and out-of-domain generalization. The authors propose lightweight interventions to preserve future learnability without sacrificing behavioral gains.