GQA-{\mu}P: The maximal parameterization update for grouped query attention
Summary
This paper extends the maximal update parameterization (μP) framework to grouped-query attention (GQA), deriving scaling laws for hyperparameter transfer across model architectures. It introduces spectral norm conditions for feature learning and addresses issues with low-rank weight matrices in GQA.
View Cached Full Text
Cached at: 05/18/26, 06:39 AM
# The Maximal Parameterization Update for Grouped Query Attention
Source: [https://arxiv.org/html/2605.15290](https://arxiv.org/html/2605.15290)
Kyle R\. Chickering UC Davis & MBZUAI IFM &Huijuan Wang11footnotemark:1 USC & MBZUAI IFM &Mengxi Wu11footnotemark:1 USC &Alexander Moreno MBZUAI IFM &Muhao Chen UC Davis &Xuezhe Ma USC & MBZUAI IFM &Daria Soboleva Cerebras &Joel Hestness Cerebras &Zhengzhong Liu MBZUAI IFM &Eric Xing Carnegie Mellon University & MBZUAI IFM
###### Abstract
Hyperparameter transfer across model architectures dramatically reduces the amount of compute necessary for tuning large language models \(LLMs\)\. The maximal update parameterization \(μ\\muP\) ensures transfer through principled mathematical analysis but can be challenging to derive for new model architectures\. Building on the spectral feature\-learning view ofYanget al\.\([2023a](https://arxiv.org/html/2605.15290#bib.bib47)\), we make two advances\. First, we promote spectral norm conditions on the weights from a heuristic to the definition of feature learning, and as a consequence arrive at the Complete\-P depth and weight\-decay scalings without recourse to lazy\-learning\. Second, we consider a modified spectral norm that preserves the valid scaling law of network weights when weight matrices are not full rank\. This enables \(to our knowledge, the first\) derivation ofμ\\muP scalings for grouped\-query attention \(GQA\)\. We demonstrate the efficacy of our theoretical derivations by showing learning rate transfer across the GQA repetition hyperparameter as well as experiments regarding transfer over weight decay\.
## 1Introduction
The maximal update parametrization \(μ\\muP\)\(Yang and Hu,[2021](https://arxiv.org/html/2605.15290#bib.bib44); Yanget al\.,[2022](https://arxiv.org/html/2605.15290#bib.bib45)\)provides principled rules for zero\-shot learning rate transfer across model widths\. Thus, large terminal model hyperparameters can be determined by sweeping a small proxy model\.μ\\muP has been used to train models up to at least 13B parameters with zero\-shot transfer\(Blakeet al\.,[2023](https://arxiv.org/html/2605.15290#bib.bib8); Deyet al\.,[2023](https://arxiv.org/html/2605.15290#bib.bib11); Narayanet al\.,[2025](https://arxiv.org/html/2605.15290#bib.bib28)\)\. Its applicability, however, has been largely limited to learning rate transfer across model widths\. To broaden this scope,Deyet al\.\([2025](https://arxiv.org/html/2605.15290#bib.bib12)\)introduced Complete\-P, extending the original prescriptions to weight decay and model depth\. However, many common architectures that are widely deployed in production still lack establishedμ\\muP scalings\.
This paper seeks to close this gap by extending the spectralμ\\muP framework ofYanget al\.\([2023a](https://arxiv.org/html/2605.15290#bib.bib47)\)to be more practically useful in derivingμ\\muP prescriptions for novel architectures\. As an example of the utility of our framework, we derive \(to our knowledge, the first\)μ\\muP scaling for grouped\-query attention \(GQA\)\(Ainslieet al\.,[2023](https://arxiv.org/html/2605.15290#bib.bib1)\)\. Our analysis reveals that GQA surfaces several difficulties that prior work has left unaddressed\. First, when using GQA the originalμ\\muP implementation passes coordinate checks, i\.e\., the customary correctness tests for the implementation\. However, empirical analysis shows that the originalμ\\muP implementation fails to transfer learning rates, seemingly contradicting established theory \(see Figures[1](https://arxiv.org/html/2605.15290#S1.F1)and[5](https://arxiv.org/html/2605.15290#S4.F5)\)\. We resolve this by extending the spectral\-norm version ofμ\\muP introduced inYanget al\.\([2023a](https://arxiv.org/html/2605.15290#bib.bib47)\), and showing that the originalμ\\muP implementation does not pass a more rigorous spectral\-norm coordinate check\. Second, the intrinsic low rank of GQA weight matrices skews the expected size of layer outputs\. To address this issue, we introduce a new norm, namely the expected operator norm, to replace the spectral norm in spectralμ\\muP theory and restore the desired scaling behavior\.
Our primary contributions are threefold:
1. 1\.We extend the spectralμ\\muP theory ofYanget al\.\([2023a](https://arxiv.org/html/2605.15290#bib.bib47)\), which allows derivations ofμ\\muP for more advanced architectures like weight decay, recursion blocks, and GQA\. Our work provides, to our knowledge, the first derivation ofμ\\muP scaling for GQA\.
2. 2\.We perform empirical analysis to validate the theory and offer practical guidance for learning rate transfer across GQA settings\. In particular, we suggest that transferring across different numbers of GQA repetitions leads to noisy transfer dynamics, suggesting caution when attempting to transfer learning rate\.
3. 3\.Our experiments show that, with the correct scalings, both weight decay and the training\-time constantτepoch\\tau\_\{\\text\{epoch\}\}introduced inWang and Aitchison \([2024](https://arxiv.org/html/2605.15290#bib.bib39)\)appear to be transferable\.
Figure 1:Comparison of the standard parameterization \(left\), the vanilla Adam\-μ\\muP parameterization \(middle\), and our GQA\-μ\\muP scaling \(right\)\. For a fixed model size, we vary the number of KV heads\. The dashed lines indicate the mean optimal learning rates for each parameterization, and the shaded grey region denotes the standard deviation of the optimal learning rates\. All models are trained to 10 tokens per parameter \(TPP\)\. Additional details can be found in Appendix[B\.1\.2](https://arxiv.org/html/2605.15290#A2.SS1.SSS2)\.
## 2Related Work
Foundations ofμ\\muP:μ\\muP builds on a series of works by Yang, developing the Tensor Programs framework\(Yang,[2019](https://arxiv.org/html/2605.15290#bib.bib41);[2020a](https://arxiv.org/html/2605.15290#bib.bib42);[2020b](https://arxiv.org/html/2605.15290#bib.bib43); Yang and Hu,[2021](https://arxiv.org/html/2605.15290#bib.bib44); Yanget al\.,[2022](https://arxiv.org/html/2605.15290#bib.bib45);[2023b](https://arxiv.org/html/2605.15290#bib.bib46)\)\. This line of work uses random matrix theory to carefully analyze the mathematical properties of neural networks during training, while also empirically demonstrating that these theoretical approaches remain valuable for real\-world deep learning\. Within the framework of Tensor Programs,Yanget al\.\([2022](https://arxiv.org/html/2605.15290#bib.bib45)\)derives the well\-knownμ\\muP scaling laws for width under SGD and Adam training\. The final paper in the seriesYanget al\.\([2023b](https://arxiv.org/html/2605.15290#bib.bib46)\)attempts to extendμ\\muP to depth scalings\. However, they were unable to extend to the case of residual blocks with standard configurations for the hidden layers\. Finally, the foundation of the mathematical framework presented in this work builds onYanget al\.\([2023a](https://arxiv.org/html/2605.15290#bib.bib47)\), who show an alternative derivation of the results inYanget al\.\([2022](https://arxiv.org/html/2605.15290#bib.bib45)\)based on spectral norms\.
Models using GQA:Group\-query attention \(GQA\)\(Ainslieet al\.,[2023](https://arxiv.org/html/2605.15290#bib.bib1)\)is an efficient attention mechanism that reduces memory usage by sharing key and value heads across groups of query heads\. Due to its favorable trade\-off between memory efficiency and model performance, GQA has been widely used in modern large language models, including Mistral 7B\(Jianget al\.,[2023](https://arxiv.org/html/2605.15290#bib.bib16)\), LLaMA 3\(Grattafioriet al\.,[2024](https://arxiv.org/html/2605.15290#bib.bib53)\), Qwen3\(Yanget al\.,[2025](https://arxiv.org/html/2605.15290#bib.bib51)\)and K2\-V2\(Liuet al\.,[2025](https://arxiv.org/html/2605.15290#bib.bib52)\)\.
Extensions ofμ\\muP:The originalμ\\muP formulation presented inYanget al\.\([2022](https://arxiv.org/html/2605.15290#bib.bib45)\)applies only to scaling the width of a fixed depth, fixed batch size neural network\. While already a powerful tool, later authors have sought to extend the principles ofμ\\muP to cover cases not covered by the original formulation\.Deyet al\.\([2023](https://arxiv.org/html/2605.15290#bib.bib11)\)do large\-scale validation experiments usingμ\\muP and find empirical evidence that learning rate can transfer across batch and dataset size\.Deyet al\.\([2023](https://arxiv.org/html/2605.15290#bib.bib11)\)suggestsμ\\muP\-type scalings for weight decay, the Adamε\\varepsilon, and depth\. Their contributions to depth scaling are most notable, as their empirical findings contradict the scaling presented inYanget al\.\([2023b](https://arxiv.org/html/2605.15290#bib.bib46)\)\. However, their extensive empirical analysis suggests that the scaling they derive is correct\. We arrive at the same scaling in Section[3\.3](https://arxiv.org/html/2605.15290#S3.SS3)using the framework we outline in this paper\. More recentlyMlodozeniecet al\.\([2025](https://arxiv.org/html/2605.15290#bib.bib56)\)extended the work of Dey et al\.Deyet al\.\([2025](https://arxiv.org/html/2605.15290#bib.bib12)\)by extending the SDE parameterization to cover hyperparameter transfer for batch size, as well as showing the value of per\-layer learning rate tuning\.
Blakeet al\.\([2023](https://arxiv.org/html/2605.15290#bib.bib8)\)applyμ\\muP in the context of large\-scale, low\-precision LLM training\. They use ABC parameterizations to apply theμ\\muP scaling rules while maintaining unit variance for all layers in the network, which they refer to as unit scaling\-μ\\muP\. Additionally, they empirically validate that learning rate transfer persists across datasets, batch sizes, depths, and training iterations under controlled conditions\.Narayanet al\.\([2025](https://arxiv.org/html/2605.15290#bib.bib28)\)suggest a different, more simplified version of the unit scaling\-μ\\muP which they also show works for training low\-precision networks withμ\\muP\.
Finally, a related subsequent workZhenget al\.\([2026](https://arxiv.org/html/2605.15290#bib.bib55)\)has proposed a similar theoretical framework to ours\. Both our work and theirs share the perspective that the spectral norm provides a principled alternative to Tensor ProgramsYang and Hu \([2021](https://arxiv.org/html/2605.15290#bib.bib44)\)for derivingμ\\muP\. However our works differ in scope and motivation\.Zhenget al\.\([2026](https://arxiv.org/html/2605.15290#bib.bib55)\)systematically apply their framework to a broad class of optimizers under width and depth scaling, while our work identifies the expected operator norm as necessary to correctly handle rank\-degenerate weights and using this norm to provide the first derivation ofμ\\muP for GQA\.
## 3Deriving Novel Maximal Update Parameterizations
Consider a collection of weight matrices𝑾ℓ∈ℝnℓ×mℓ\\bm\{W\}^\{\\ell\}\\in\\mathbb\{R\}^\{n\_\{\\ell\}\\times m\_\{\\ell\}\}in a neural network, indexed by layerℓ\\ell\.Yanget al\.\([2023a](https://arxiv.org/html/2605.15290#bib.bib47)\)proves that conditions imposed upon the weight matrices of a network imply feature learning \(and thus learning rate transfer\) as defined inYanget al\.\([2022](https://arxiv.org/html/2605.15290#bib.bib45)\)\(see Equation[3](https://arxiv.org/html/2605.15290#S3.E3)\)\. For initial weights𝑾0ℓ\\bm\{W\}\_\{0\}^\{\\ell\}and iterates𝑾tℓ=𝑾0ℓ\+∑k=1tΔ𝑾tℓ\\bm\{W\}\_\{t\}^\{\\ell\}=\\bm\{W\}\_\{0\}^\{\\ell\}\+\\sum\_\{k=1\}^\{t\}\\Delta\\bm\{W\}\_\{t\}^\{\\ell\}, whereΔ𝑾tℓ=𝑾tℓ−𝑾t−1ℓ\\Delta\\bm\{W\}\_\{t\}^\{\\ell\}=\\bm\{W\}\_\{t\}^\{\\ell\}\-\\bm\{W\}\_\{t\-1\}^\{\\ell\},Yanget al\.\([2023a](https://arxiv.org/html/2605.15290#bib.bib47)\)suggests that both the initialization and the updates must satisfy:
‖𝑾0ℓ‖=Θ\(nℓ/mℓ\),‖Δ𝑾tℓ‖=Θ\(nℓ/mℓ\),\\displaystyle\\left\|\\left\|\\,\\bm\{W\}^\{\\ell\}\_\{0\}\\,\\right\|\\right\|=\\Theta\(\\sqrt\{n\_\{\\ell\}\}/\\sqrt\{m\_\{\\ell\}\}\),\\quad\\left\|\\left\|\\,\\Delta\\bm\{W\}\_\{t\}^\{\\ell\}\\,\\right\|\\right\|=\\Theta\(\\sqrt\{n\_\{\\ell\}\}/\\sqrt\{m\_\{\\ell\}\}\),\(1\)where‖𝑾‖:=sup‖x‖2=1‖𝑾x‖2\\left\|\\left\|\\,\\bm\{W\}\\,\\right\|\\right\|:=\\sup\_\{\\left\|\\left\|\\,x\\,\\right\|\\right\|\_\{2\}=1\}\\left\|\\left\|\\,\\bm\{W\}x\\,\\right\|\\right\|\_\{2\}is the usual spectral \(or induced\) norm\. This spectral perspective on feature learning is powerful, and we introduce three minor but important modifications that enable us to extend the method ofYanget al\.\([2023a](https://arxiv.org/html/2605.15290#bib.bib47)\)to cover novel architectures like GQA\.
Figure 2:Demonstration of the failure of the spectral norm to accurately capture the behavior for low\-rank matrices when the inputs are randomly sampled i\.i\.d\. from𝒩\(0,1\)\\mathcal\{N\}\(0,1\)\.rris the number of key\-value head repetitions andr=1r=1corresponds to the setting without GQA\. Each point is averaged over10001000independent draws of𝑨\\bm\{A\}, with the shaded band showing±1\\pm 1standard deviation\.Analysis Under a New Norm:The spectral norm can be interpreted as the maximal deformation of an input vector induced by an operatorφ:ℝm→ℝn\\varphi:\\mathbb\{R\}^\{m\}\\rightarrow\\mathbb\{R\}^\{n\}\. For full\-rank operators, such as dense feed\-forward layers, random matrix theory shows that the quantitative value of the spectral norm is attained asymptotically\. In the classical case of ann×nn\\times nrandom matrixAA, we have the sharp asymptotic relation‖A‖=2n\\left\|\\left\|\\,A\\,\\right\|\\right\|=2\\sqrt\{n\}asn→∞n\\rightarrow\\infty\.
However, for rank\-degenerate matrices like those used in GQA, the spectral norm is not attained asymptotically in practice\. The reason is that, as shown by Tensor ProgramsYang and Hu \([2021](https://arxiv.org/html/2605.15290#bib.bib44)\), the inputs to a GQA layer during training are i\.i\.d\., and therefore, for rank\-degenerate matrices, the vectors that cause this “maximal deformation” occur with probability zero\! A visualization of this discrepancy can be seen in Figure[2](https://arxiv.org/html/2605.15290#S3.F2)\. Instead, we should use a notion of size that reflects the actual deformation encountered during training\.
To this end, letΩ\\Omegabe the probability distribution of the input vectors\. We define theexpectation operator normas111Technically, the object we define as‖A‖𝔼,Ω,p\\left\|\\left\|\\,A\\,\\right\|\\right\|\_\{\\mathbb\{E\},\\Omega,p\}is only a seminorm without further constraints onΩ\\Omega\. In particular, ifsuppΩ≠ℝn\\text\{supp\}\\,\\Omega\\neq\\mathbb\{R\}^\{n\}then it is possible for all random vectorsx∼Ωx\\sim\\Omegato lie in the nullspace ofAA\. This edge case does not occur in neural network training\.
‖𝑨‖𝔼,Ω,p:=𝔼x∼Ω\[‖𝑨x‖p‖x‖p\]\.\\displaystyle\\left\|\\left\|\\,\\bm\{A\}\\,\\right\|\\right\|\_\{\\mathbb\{E\},\\Omega,p\}:=\\mathbb\{E\}\_\{x\\sim\\Omega\}\\left\[\\,\\frac\{\\left\|\\left\|\\,\\bm\{A\}x\\,\\right\|\\right\|\_\{p\}\}\{\\left\|\\left\|\\,x\\,\\right\|\\right\|\_\{p\}\}\\,\\right\]\.\(2\)Throughout this paper, we adopt the convention‖𝑨‖E=‖𝑨‖𝔼,𝒩\(0,1\),2\\left\|\\left\|\\,\\bm\{A\}\\,\\right\|\\right\|\_\{E\}=\\left\|\\left\|\\,\\bm\{A\}\\,\\right\|\\right\|\_\{\\mathbb\{E\},\\mathcal\{N\}\(0,1\),2\}, wherex∼𝒩\(0,1\)x\\sim\\mathcal\{N\}\(0,1\)has i\.i\.d\. entries\. Crucially, whenAAis square with i\.i\.d\. entries, it has full rank with probability one, and we obtain the asymptotic relationship‖𝑨‖𝔼=Θ\(‖𝑨‖\)\\left\|\\left\|\\,\\bm\{A\}\\,\\right\|\\right\|\_\{\\mathbb\{E\}\}=\\Theta\(\\left\|\\left\|\\,\\bm\{A\}\\,\\right\|\\right\|\)\. A proof is provided in Lemma[2](https://arxiv.org/html/2605.15290#Thmlemma2)\.
Operator\-Norm Focused Feature Learning:Yanget al\.\([2023a](https://arxiv.org/html/2605.15290#bib.bib47)\)shows that constraining the spectral norm of the weight matrices implies feature learning in the sense ofYang and Hu \([2021](https://arxiv.org/html/2605.15290#bib.bib44)\), where feature learning is defined to occur when
‖h0ℓ‖2=Θ\(n\),‖Δhtℓ‖2=Θ\(n\),\\displaystyle\\left\|\\left\|\\,h^\{\\ell\}\_\{0\}\\,\\right\|\\right\|\_\{2\}=\\Theta\(\\sqrt\{n\}\),\\qquad\\left\|\\left\|\\,\\Delta h^\{\\ell\}\_\{t\}\\,\\right\|\\right\|\_\{2\}=\\Theta\(\\sqrt\{n\}\),\(3\)for all pre\-activationshℓh^\{\\ell\}\. In particular,Yanget al\.\([2023a](https://arxiv.org/html/2605.15290#bib.bib47)\)prove that enforcing condition equation[1](https://arxiv.org/html/2605.15290#S3.E1)on spectral norms implies equation[3](https://arxiv.org/html/2605.15290#S3.E3)\. However, the converse does not hold: feature learning in the sense of equation[3](https://arxiv.org/html/2605.15290#S3.E3)may still occur even if the weight matrices do not scale according to equation[1](https://arxiv.org/html/2605.15290#S3.E1)\.
Consider a hidden layerh\(x\)=𝑾xh\(x\)=\\bm\{W\}xwith trainable weights𝑾∈ℝn×n\\bm\{W\}\\in\\mathbb\{R\}^\{n\\times n\}and an additional scaling parameter, the number of layersL\>1L\>1independent ofnn, for which we want to ensure feature learning asL→∞L\\rightarrow\\infty\. Under proper initialization, i\.e\.‖𝑾0‖=Θ\(1\)\\left\|\\left\|\\,\\bm\{W\}\_\{0\}\\,\\right\|\\right\|=\\Theta\(1\), we have‖h\(x\)‖2=Θ\(n\)\\left\|\\left\|\\,h\(x\)\\,\\right\|\\right\|\_\{2\}=\\Theta\(\\sqrt\{n\}\)forx∈ℝnx\\in\\mathbb\{R\}^\{n\}with‖x‖2=Θ\(n\)\\left\|\\left\|\\,x\\,\\right\|\\right\|\_\{2\}=\\Theta\(\\sqrt\{n\}\), as required by feature learning\. Now suppose the learning rate is set incorrectly, and‖Δ𝑾t‖=Θ\(L−α\)\\left\|\\left\|\\,\\Delta\\bm\{W\}\_\{t\}\\,\\right\|\\right\|=\\Theta\(L^\{\-\\alpha\}\)for some0<α0<\\alpha\. Then the weight update takes the form
Δht=𝑾txt−𝑾t−1xt−1=Δ𝑾txt\+𝑾t−1Δxt\.\\displaystyle\\Delta h\_\{t\}=\\bm\{W\}\_\{t\}x\_\{t\}\-\\bm\{W\}\_\{t\-1\}x\_\{t\-1\}=\\Delta\\bm\{W\}\_\{t\}x\_\{t\}\+\\bm\{W\}\_\{t\-1\}\\Delta x\_\{t\}\.Assuming that these terms do not exactly cancel, and noting that‖Δxt‖2=Θ\(n\)\\left\|\\left\|\\,\\Delta x\_\{t\}\\,\\right\|\\right\|\_\{2\}=\\Theta\(\\sqrt\{n\}\), we have that
‖Δht‖=Θ\(n\(1\+L−α\)\)=Θ\(n\),\\displaystyle\\left\|\\left\|\\,\\Delta h\_\{t\}\\,\\right\|\\right\|=\\Theta\(\\sqrt\{n\}\(1\+L^\{\-\\alpha\}\)\)=\\Theta\(\\sqrt\{n\}\),and thus this layer satisfies feature learning in the sense of equation[3](https://arxiv.org/html/2605.15290#S3.E3)\. For GQA, this precise situation arises, and the subtle failure of the termsΔht\\Delta h\_\{t\}to properly scale leads to a failure of learning rate transfer \(see Figures[5](https://arxiv.org/html/2605.15290#S4.F5)and[1](https://arxiv.org/html/2605.15290#S1.F1)below\)\.
This analysis shows that the spectral condition of equation[1](https://arxiv.org/html/2605.15290#S3.E1)is a stronger notion of feature learning than equation[3](https://arxiv.org/html/2605.15290#S3.E3)and we propose using it as thedefinitionof feature learning\. This perspective has beneficial practical consequences\. When doing coordinate checking to validate aμ\\muP implementation as shown in\(Yanget al\.,[2022](https://arxiv.org/html/2605.15290#bib.bib45)\), we found that directly analyzing the weight matrices proves more effective than analyzing only the activations \(see Figure[4](https://arxiv.org/html/2605.15290#S4.F4)\)\. This point is discussed further below\.
A Functional Analytic View of Layer\-Wise Computation:Modern machine learning architectures consist of more than dense feed\-forward units\. Thus, we propose focusing on the computational units of the network rather than specifically focusing on matrices\. In the case of dense feed\-forward layers, these notions coincide\. But for residual layers our perspective offers a more unifying approach\. Concretely, we regard a neural network not only as a compositional sequence of matrix multiplications, but as a compositional sequence of abstract, generally non\-linear mappingsφℓ:ℝm→ℝn\\varphi^\{\\ell\}:\\mathbb\{R\}^\{m\}\\rightarrow\\mathbb\{R\}^\{n\}\.
We suggest that the first part of the spectral condition in equation[1](https://arxiv.org/html/2605.15290#S3.E1)should be applied to each compositional unit, rather than the matrices themselves\. Starting with the end\-to\-end computation of the network, we recursively apply this condition to all mappingsφℓ\\varphi^\{\\ell\}\. In conjunction with requiring that all trainable parameters satisfy both parts of equation[1](https://arxiv.org/html/2605.15290#S3.E1), this leads to a unified treatment of residual layers which we discuss below\.
Table 1:The table summarizes the parameterization of Transformers with Grouped\-Query Attention \(GQA\), wherenndenotes the input dimension andrris the number of key\-value head repetitions\. Modifications specific to GQA are highlighted in blue\. The derivations of learning rate and weight decay follow the AdamW implementation in PyTorch\.### 3\.1Weight Decay
Weight decay is commonly applied in deep learning to stabilize model training dynamics\(Loshchilov and Hutter,[2017](https://arxiv.org/html/2605.15290#bib.bib22); Andriushchenkoet al\.,[2023](https://arxiv.org/html/2605.15290#bib.bib2)\)\. For concreteness, we focus on AdamW\(Loshchilov and Hutter,[2017](https://arxiv.org/html/2605.15290#bib.bib22)\)in this section, although our framework extends well to other optimizers, such as MuON\(Jordanet al\.,[2024](https://arxiv.org/html/2605.15290#bib.bib23)\), with weight decay\. AdamW modifies the Adam weight update equation[12](https://arxiv.org/html/2605.15290#A1.E12)by including a weight decay term with the associated weight decay hyperparameter222We focus oncoupledweight decay, which is the type of weight decay included inPyTorch\(Schaipp,[2024](https://arxiv.org/html/2605.15290#bib.bib30)\)\. However, the weight decay introduced inLoshchilov and Hutter \([2017](https://arxiv.org/html/2605.15290#bib.bib22)\)isdecoupledand given byΔ𝑾t=−λ𝑾t−η𝒓^t\\Delta\\bm\{W\}\_\{t\}=\-\\lambda\\,\\bm\{W\}\_\{t\}\-\\eta\\hat\{\\bm\{r\}\}\_\{t\}\. Our results still apply in this case and prescribe the scalingλ=λ0\\lambda=\\lambda\_\{0\}, whereλ0\\lambda\_\{0\}is the base model weight decay, to ensure the terms all have the same size in norm\. In other words, when using decoupled weight decay, the base weight decay term should not scale with model size\. See alsoDeyet al\.\([2025](https://arxiv.org/html/2605.15290#bib.bib12)\)\.
We begin by defining the update rule for AdamW\. The total parameter updateΔ𝑾t\\Delta\\bm\{W\}\_\{t\}is composed of two distinct terms: a weight decay term and the standard Adam gradient\-based update𝒓^t\\hat\{\\bm\{r\}\}\_\{t\}:
Δ𝑾t=−λη𝑾t−η𝒓^t\.\\displaystyle\\Delta\\bm\{W\}\_\{t\}=\-\\lambda\\eta\\,\\bm\{W\}\_\{t\}\-\\eta\\hat\{\\bm\{r\}\}\_\{t\}\.\(4\)For the learning dynamics to remain consistent \(transferable\) as we scale the network widthnn, the spectral norms of these two contributions must scale identically\. Specifically, the weight decay magnitude must match the gradient update magnitude, and both must remain stable relative to the weights themselves\. Mathematically, this balance condition is expressed as:
‖Δ𝑾t‖=Θ\(λη‖𝑾t‖\)=Θ\(η‖𝒓t‖\)=Θ\(1\)\.\\displaystyle\\left\|\\left\|\\,\\Delta\\bm\{W\}\_\{t\}\\,\\right\|\\right\|=\\Theta\(\\lambda\\eta\\left\|\\left\|\\,\\bm\{W\}\_\{t\}\\,\\right\|\\right\|\)=\\Theta\(\\eta\\left\|\\left\|\\,\\bm\{r\}\_\{t\}\\,\\right\|\\right\|\)=\\Theta\(1\)\.Given that‖Δ𝑾t‖≈‖𝑾t‖\\left\|\\left\|\\,\\Delta\\bm\{W\}\_\{t\}\\,\\right\|\\right\|\\approx\\left\|\\left\|\\,\\bm\{W\}\_\{t\}\\,\\right\|\\right\|in the feature learning limit, it follows that the weight decay coefficient must satisfyλη=Θ\(1\)\\lambda\\eta=\\Theta\(1\)\. Recall thatμ\\muP prescribes specific learning ratesη\\etadepending on the layer type:η=Θ\(1\)\\eta=\\Theta\(1\)for input layers, andη=Θ\(1/n\)\\eta=\\Theta\(1/n\)for hidden and output layers\. To maintain the balance described above, the base weight decayλ0\\lambda\_\{0\}must be scaled as follows\. For input Layers, sinceη=Θ\(1\)\\eta=\\Theta\(1\), we requireλ0=Θ\(1\)\\lambda^\{0\}=\\Theta\(1\)\. For the hidden and output Layers, sinceη=Θ\(1/n\)\\eta=\\Theta\(1/n\), we requireλℓ=Θ\(n\)\\lambda^\{\\ell\}=\\Theta\(n\)\.
Let us analyze the dynamics of a hidden layer where we introduce a scaling errorδ\>0\\delta\>0\. Suppose we setλℓ=λ0n1\+δ\\lambda^\{\\ell\}=\\lambda\_\{0\}n^\{1\+\\delta\}\. To maintain a stable update size, the learning rate needs to compensate, scaling asη=Θ\(n−1−δ\)\\eta=\\Theta\(n^\{\-1\-\\delta\}\)\. Asn→∞n\\rightarrow\\infty,η𝒓^t\\eta\\hat\{\\bm\{r\}\}\_\{t\}becomes negligible\. The weight decay term dominatesΔ𝑾t≈−η0λ0𝑾t\.\\Delta\\bm\{W\}\_\{t\}\\approx\-\\eta\_\{0\}\\lambda\_\{0\}\\,\\bm\{W\}\_\{t\}\.Consequently, the model ignores the data entirely, and the weights simply decay toward𝟎\\bm\{0\}without learning features\. Conversely, suppose we setλ=λ0n1−δ\\lambda=\\lambda\_\{0\}n^\{1\-\\delta\}\. This implies standardμ\\muP learning rateη=Θ\(n−1\)\\eta=\\Theta\(n^\{\-1\}\)is relatively larger compared to the decay strength\. Asn→∞n\\rightarrow\\infty,Δ𝑾t≈−η𝒓^t\\Delta\\bm\{W\}\_\{t\}\\approx\-\\eta\\hat\{\\bm\{r\}\}\_\{t\}vanishes\. The weight decay is effectively ignored\. The algorithm collapses back to standard Adam, losing the regularization benefits of AdamW\. Thus, the scalingλ=Θ\(n\)\\lambda=\\Theta\(n\)maintains the necessary equilibrium between regularization and feature learning at large widths\.
Recent studies byWang and Aitchison \([2024](https://arxiv.org/html/2605.15290#bib.bib39)\)andDeyet al\.\([2025](https://arxiv.org/html/2605.15290#bib.bib12)\)have derived a similar relation between learning rate and weight decay through different methods\. We experimentally validate this relationship in Figure[7](https://arxiv.org/html/2605.15290#A2.F7)\.
### 3\.2Grouped Query Attention
Grouped query attention \(GQA\) reduces computational cost by repeating the key and value heads in the Transformer\(Ainslieet al\.,[2023](https://arxiv.org/html/2605.15290#bib.bib1)\)\. In a standard multi\-headed attention layer, the key and value projections are given by weights𝑾K∈ℝn×n\\bm\{W\}\_\{K\}\\in\\mathbb\{R\}^\{n\\times n\}and𝑾V∈𝑹n×n\\bm\{W\}\_\{V\}\\in\\bm\{R\}^\{n\\times n\}, wherennis the embedding dimension\. These matrices are partitioned intoHHheads of sizen/Hn/Heach and theii\-th head is computed aski=\(𝑾Kx\)i,vi=\(𝑾Vx\)ik\_\{i\}=\(\\bm\{W\}\_\{K\}x\)\_\{i\},\\,v\_\{i\}=\(\\bm\{W\}\_\{V\}x\)\_\{i\}\. In GQA, the number of parameters is reduced by using onlyppdistinct key/value heads, whereH/p=rH/p=rdenotes the number of repetitions of each key/value head group\. We then define matrices𝑾p,K,𝑾p,V∈ℝnr×n\\bm\{W\}\_\{p,K\},\\bm\{W\}\_\{p,V\}\\in\\mathbb\{R\}^\{\\frac\{n\}\{r\}\\times n\}, and construct the full key and value weights by concatenating along the output dimension
𝑾K⊕=⨁m=1r𝑾p,K,𝑾V⊕=⨁m=1r𝑾p,V,\\displaystyle\\bm\{W\}\_\{K\}^\{\\oplus\}=\\bigoplus\_\{m=1\}^\{r\}\\bm\{W\}\_\{p,K\},\\quad\\bm\{W\}\_\{V\}^\{\\oplus\}=\\bigoplus\_\{m=1\}^\{r\}\\bm\{W\}\_\{p,V\},\(5\)where⊕\\oplusdenotes concatenation along the first dimension333Note that concatenation and matrix multiplication commute: if𝑨∈ℝm×n\\bm\{A\}\\in\\mathbb\{R\}^\{m\\times n\}andx∈ℝnx\\in\\mathbb\{R\}^\{n\}, we have𝑨⊕x=\(𝑨x\)⊕\\bm\{A\}^\{\\oplus\}x=\(\\bm\{A\}x\)^\{\\oplus\}, which follows directly by writing the product in it’s index form\.\.
Consider the initial weight matrix𝑾0\\bm\{W\}\_\{0\}for either the key or value projections, and its concatenation version𝑾0⊕\\bm\{W\}^\{\\oplus\}\_\{0\}, and let𝑾t\\bm\{W\}\_\{t\}and𝑾t⊕\\bm\{W\}\_\{t\}^\{\\oplus\}denote their corresponding weight updates\. To begin, applying the law of large numbers and the central limit theorem to equation[2](https://arxiv.org/html/2605.15290#S3.E2), we obtain
‖𝑾0⊕‖𝔼=𝔼x\[Θ\(\(∑k=1n\(∑j=1n\(W0⊕\)kjxj\)2\)12\(∑k=1nxj2\)12\)\]=𝔼x\[Θ\(\(r∑k=1nr\(∑j=1n\(W0\)kjxj\)2\)12\(∑k=1nxj2\)12\)\]\.\\begin\{aligned\} \\left\|\\left\|\\,\\bm\{W\}^\{\\oplus\}\_\{0\}\\,\\right\|\\right\|\_\{\\mathbb\{E\}\}=\\mathbb\{E\}\_\{x\}\\left\[\\,\\Theta\\left\(\\frac\{\\left\(\\sum\\limits\_\{k=1\}^\{n\}\\left\(\\sum\\limits\_\{j=1\}^\{n\}\(W\_\{0\}^\{\\oplus\}\)\_\{kj\}x\_\{j\}\\right\)^\{2\}\\right\)^\{\\frac\{1\}\{2\}\}\}\{\(\\sum\\limits\_\{k=1\}^\{n\}x\_\{j\}^\{2\}\)^\{\\frac\{1\}\{2\}\}\}\\right\)\\,\\right\]=\\mathbb\{E\}\_\{x\}\\left\[\\,\\Theta\\left\(\\frac\{\\left\(r\\sum\\limits\_\{k=1\}^\{\\frac\{n\}\{r\}\}\\left\(\\sum\\limits\_\{j=1\}^\{n\}\(W\_\{0\}\)\_\{kj\}x\_\{j\}\\right\)^\{2\}\\right\)^\{\\frac\{1\}\{2\}\}\}\{\(\\sum\\limits\_\{k=1\}^\{n\}x\_\{j\}^\{2\}\)^\{\\frac\{1\}\{2\}\}\}\\right\)\\,\\right\]\.\\end\{aligned\}
Therefore,
‖𝑾0⊕‖𝔼=Θ\(\(r×nr×n×σ2\)12n12\)=Θ\(σn12\)\.\\left\|\\left\|\\,\\bm\{W\}^\{\\oplus\}\_\{0\}\\,\\right\|\\right\|\_\{\\mathbb\{E\}\}=\\Theta\\left\(\\frac\{\(r\\times\\frac\{n\}\{r\}\\times n\\times\\sigma^\{2\}\)^\{\\frac\{1\}\{2\}\}\}\{n^\{\\frac\{1\}\{2\}\}\}\\right\)=\\Theta\\left\(\\sigma n^\{\\frac\{1\}\{2\}\}\\right\)\.
To satisfy the spectral condition in equation[1](https://arxiv.org/html/2605.15290#S3.E1), we requireσ=Θ\(n−1/2\)\\sigma=\\Theta\(n^\{\-1/2\}\)\. Importantly, this corresponds to the expected operator norm for𝑾⊕\\bm\{W\}^\{\\oplus\}, not the spectral norm of the constituent matrix𝑾\\bm\{W\}\. Because𝑾0\\bm\{W\}\_\{0\}has full rank with probability 1, its spectral norm can be computed directly using Bai\-Yin\(Bai and Yin,[1993](https://arxiv.org/html/2605.15290#bib.bib4); Yinet al\.,[1988](https://arxiv.org/html/2605.15290#bib.bib48)\)
‖𝑾0‖=Θ\(σ\(n\+nr\)\)=Θ\(1\+rr\)\.\\displaystyle\\left\|\\left\|\\,\\bm\{W\}\_\{0\}\\,\\right\|\\right\|=\\Theta\\left\(\\sigma\\left\(\\sqrt\{n\}\+\\frac\{\\sqrt\{n\}\}\{\\sqrt\{r\}\}\\right\)\\right\)=\\Theta\\left\(\\frac\{1\+\\sqrt\{r\}\}\{\\sqrt\{r\}\}\\right\)\.\(6\)Moreover, in terms of spectral norms,‖𝑾0⊕‖=r‖𝑾0‖\\left\|\\left\|\\,\\bm\{W\}\_\{0\}^\{\\oplus\}\\,\\right\|\\right\|=\\sqrt\{r\}\\left\|\\left\|\\,\\bm\{W\}\_\{0\}\\,\\right\|\\right\|\(Lemma[1](https://arxiv.org/html/2605.15290#Thmlemma1)\), so that the spectral norm and the expected operator norm do not agree in this setting \(see Figure[2](https://arxiv.org/html/2605.15290#S3.F2)\)\.
The computation in equation[6](https://arxiv.org/html/2605.15290#S3.E6)is critical for determining the required learning rate, since we require‖Δ𝑾t‖=Θ\(‖𝑾0‖\)\\left\|\\left\|\\,\\Delta\\bm\{W\}\_\{t\}\\,\\right\|\\right\|=\\Theta\(\\left\|\\left\|\\,\\bm\{W\}\_\{0\}\\,\\right\|\\right\|\)\. To this end, we computeη\\etain the usual manner\. Assuming the use of the Adam optimizer with update step𝒓^t\\hat\{\\bm\{r\}\}\_\{t\}, we have
‖Δ𝑾t‖=η‖𝒓^t‖=Θ\(ηnr\)=Θ\(1\+rr\)\.\\displaystyle\\left\|\\left\|\\,\\Delta\\bm\{W\}\_\{t\}\\,\\right\|\\right\|=\\eta\\left\|\\left\|\\,\\hat\{\\bm\{r\}\}\_\{t\}\\,\\right\|\\right\|=\\Theta\\left\(\\frac\{\\eta n\}\{\\sqrt\{r\}\}\\right\)=\\Theta\\left\(\\frac\{1\+\\sqrt\{r\}\}\{\\sqrt\{r\}\}\\right\)\.From this we easily deduce thatη=Θ\(1\+rn\)\\eta=\\Theta\\left\(\\frac\{1\+\\sqrt\{r\}\}\{n\}\\right\)\. We normalize by a factor of two to ensure that whenr=1r=1our scalings agree with the usual full\-rank hidden layer scalings:
σ=1nσ0,η=1\+r2nη0\.\\displaystyle\\sigma=\\frac\{1\}\{\\sqrt\{n\}\}\\sigma\_\{0\},\\qquad\\eta=\\frac\{1\+\\sqrt\{r\}\}\{2n\}\\eta\_\{0\}\.\(7\)
Through the above derivations, we arrive at the parameterization of Transformers with GQA as summarized in Table[1](https://arxiv.org/html/2605.15290#S3.T1)\.
Our GQA\-μ\\muP can adapt to any group size by following equation[7](https://arxiv.org/html/2605.15290#S3.E7)\. Empirically, we evaluater∈\{1,2,3,4,5,6,12\}r\\in\\\{1,2,3,4,5,6,12\\\}\. These values cover commonly used settings, includingr=4r=4for Llama\-3 8B and Mistral 7B,r=8r=8for Llama\-2/3 70B and Qwen\-2\.5 72B, andr=12r=12for Cohere Command R\+ \(e\.g\., 96 query heads and 8 key/value heads\)\.
To clarify, our mathematical framework is asymptotically valid with respect torr\. However, in practice,rris not scaled to infinity in the same way as the network widthnnis\. Sincen≥rn\\geq r, the extreme limit wherer=nr=nwould force the model to have single\-parameter attention heads\. Thus, while our asymptotic analysis is sound, this particular infinite limit is not a realistic scenario that would actually be used\. The typical range ofrrin practice usually does not exceed 16, which may not be asymptotically large\. However, we would like to emphasize that our primary objective in scaling withrris empirical: to prevent learning\-rate drift when changing the number of repetitionsrr\. Our derivations remain valid in the asymptotic limit, which is merely a mathematical consequence of the theory rather than the practical setting we target\.
### 3\.3Complete\-P Depth Scaling
We now turn to the depth scaling of residual networks\. Complete\-P\(Deyet al\.,[2025](https://arxiv.org/html/2605.15290#bib.bib12)\), derives depth scalings by relying on an additional desideratum of “no lazy learning\.” We show that we do not need this additional assumption\. We demonstrate that applying the standardμ\\muP desideratum to the spectral norm automatically prevents the no lazy learning\.” assumption\. In this section, we show that applying our framework naturally recovers the exact same scaling as in Complete\-P\.
Consider the stacked hidden layers, where the output of theℓ\\ell\-th layer is given by the residual update:
Gℓ\(x\)=x\+βgℓ\(x\)for1≤ℓ≤L\.G^\{\\ell\}\(x\)=x\+\\beta g^\{\\ell\}\(x\)\\quad\\text\{for \}1\\leq\\ell\\leq L\.\(8\)Here,x∈ℝnx\\in\\mathbb\{R\}^\{n\}is the input,gℓ:ℝn→ℝng^\{\\ell\}:\\mathbb\{R\}^\{n\}\\rightarrow\\mathbb\{R\}^\{n\}represents the residual branch, andβ\\betais a scaling constant independent of the layer indexℓ\\ell\. To formalize our analysis and address the network dynamics rigorously in the large\-width \(n→∞n\\to\\infty\) and large\-depth \(L→∞L\\to\\infty\) limits, we state the following assumptions explicitly:
- •Assumption 1 \(Input Scaling\):The network inputs satisfy‖x‖2=Θ\(n\)\\left\|\\left\|\\,x\\,\\right\|\\right\|\_\{2\}=\\Theta\(\\sqrt\{n\}\)\. This ensures that the base compositional units satisfy the spectral condition in equation[1](https://arxiv.org/html/2605.15290#S3.E1)\.
- •Assumption 2 \(Stable Operator Norms\):‖gℓ‖=Θ\(1\)\\left\|\\left\|\\,g^\{\\ell\}\\,\\right\|\\right\|=\\Theta\(1\),‖Δgℓ‖=Θ\(1\)\\left\|\\left\|\\,\\Delta g^\{\\ell\}\\,\\right\|\\right\|=\\Theta\(1\),gℓg^\{\\ell\}is full\-rank, andβ‖gℓ‖<1\\beta\\left\|\\left\|\\,g^\{\\ell\}\\,\\right\|\\right\|<1\.
- •Assumption 3 \(No Exact Cancellation\):This is a standard assumption in theμ\\muP and Tensor Programs literature\(Yang and Hu,[2021](https://arxiv.org/html/2605.15290#bib.bib44)\)\. For more intuition about this, please refer to[A\.3](https://arxiv.org/html/2605.15290#A1.SS3)\.
LetG¯tℓ=○k=1ℓGtk\\overline\{G\}\_\{t\}^\{\\ell\}=\\operatorname\*\{\\bigcirc\}\_\{k=1\}^\{\\ell\}G\_\{t\}^\{k\}represent the composition of the firstℓ\\elllayers at training steptt\. We can express the network recursively:
G¯tℓ=G¯tℓ−1\+βgtℓ∘G¯tℓ−1\.\\overline\{G\}\_\{t\}^\{\\ell\}=\\overline\{G\}\_\{t\}^\{\\ell\-1\}\+\\beta g\_\{t\}^\{\\ell\}\\circ\\overline\{G\}\_\{t\}^\{\\ell\-1\}\.\(9\)
Under Assumption 3, taking the norm yields the following asymptotic relation:
‖G¯tℓ‖=Θ\(‖G¯tℓ−1‖\+β‖gtℓ‖‖G¯tℓ−1‖\)\.\\left\|\\left\|\\,\\overline\{G\}\_\{t\}^\{\\ell\}\\,\\right\|\\right\|=\\Theta\\left\(\\left\|\\left\|\\,\\overline\{G\}\_\{t\}^\{\\ell\-1\}\\,\\right\|\\right\|\+\\beta\\left\|\\left\|\\,g\_\{t\}^\{\\ell\}\\,\\right\|\\right\|\\left\|\\left\|\\,\\overline\{G\}\_\{t\}^\{\\ell\-1\}\\,\\right\|\\right\|\\right\)\.\(10\)
Because Assumption 2 dictates‖gtℓ‖=Θ\(1\)\\left\|\\left\|\\,g\_\{t\}^\{\\ell\}\\,\\right\|\\right\|=\\Theta\(1\), every layer effectively adds a proportional factor ofβ\\betato the norm estimate\. We can formalize the resulting depth\-dependent bound via a recursive induction argument\. For the base case, at the input,‖G¯t0‖=‖I‖=1≤Θ\(1\)\\left\|\\left\|\\,\\overline\{G\}\_\{t\}^\{0\}\\,\\right\|\\right\|=\\left\|\\left\|\\,I\\,\\right\|\\right\|=1\\leq\\Theta\(1\)\. Assume the bound holds for layerℓ\\ell, such that‖G¯tℓ‖≤Θ\(1\+ℓβ\)\\left\|\\left\|\\,\\overline\{G\}\_\{t\}^\{\\ell\}\\,\\right\|\\right\|\\leq\\Theta\(1\+\\ell\\beta\)\. For layerℓ\+1\\ell\+1, we substitute the hypothesis into our asymptotic relation:
‖G¯tℓ\+1‖≤Θ\(‖G¯tℓ‖\+β‖G¯tℓ‖\)≤Θ\(\(1\+ℓβ\)\+β\(1\+ℓβ\)\)=Θ\(1\+ℓβ\+β\+ℓβ2\)\.\\displaystyle\\left\|\\left\|\\,\\overline\{G\}\_\{t\}^\{\\ell\+1\}\\,\\right\|\\right\|\\leq\\Theta\\left\(\\left\|\\left\|\\,\\overline\{G\}\_\{t\}^\{\\ell\}\\,\\right\|\\right\|\+\\beta\\left\|\\left\|\\,\\overline\{G\}\_\{t\}^\{\\ell\}\\,\\right\|\\right\|\\right\)\\leq\\Theta\\left\(\(1\+\\ell\\beta\)\+\\beta\(1\+\\ell\\beta\)\\right\)=\\Theta\\left\(1\+\\ell\\beta\+\\beta\+\\ell\\beta^\{2\}\\right\)\.Because we are analyzing regimes whereβ<1\\beta<1, the higher\-order termℓβ2\\ell\\beta^\{2\}is dominated by the linear terms \(β2<β\\beta^\{2\}<\\beta\)\. Thus, the expression simplifies to:
‖G¯tℓ\+1‖≤Θ\(1\+\(ℓ\+1\)β\)\.\\left\|\\left\|\\,\\overline\{G\}\_\{t\}^\{\\ell\+1\}\\,\\right\|\\right\|\\leq\\Theta\(1\+\(\\ell\+1\)\\beta\)\.By induction, evaluating this at the final layer yields the total forward pass bound:
‖G¯tL‖=Θ\(1\+Lβ\)\.\\left\|\\left\|\\,\\overline\{G\}\_\{t\}^\{L\}\\,\\right\|\\right\|=\\Theta\(1\+L\\beta\)\.\(11\)
To maintain stable representations and avoid network explosion or vanishing signals, we require the final norm to be independent of depth, i\.e\.,Θ\(1\+\(ℓ\+1\)β\)=Θ\(1\)\\Theta\(1\+\(\\ell\+1\)\\beta\)=\\Theta\(1\)\. Setting this equality forcesLβ=Θ\(1\)L\\beta=\\Theta\(1\), which directly implies thatβ=Θ\(L−1\)\\beta=\\Theta\(L^\{\-1\}\)\.
We single out a minor point of confusion: for a fixed, finite depth network \(e\.g\.,L=2L=2\), a constant value such asβ=1/2\\beta=1/2naturally satisfies this requirement because1/2=O\(1/L\)1/2=O\(1/L\)for that specificLL\. However, when considering the asymptotic limit whereL→∞L\\to\\infty, the parameterβ\\betacannot be a static constant larger than1/L1/L; it must shrink in exact proportion to the total depth to prevent theLβL\\betaterm from diverging\. Conversely, choosing a faster shrinking exponent, such asβ=Θ\(L−α\)\\beta=\\Theta\(L^\{\-\\alpha\}\)forα\>1\\alpha\>1\(Yanget al\.,[2023b](https://arxiv.org/html/2605.15290#bib.bib46)\), causes the residual branches to vanish entirely, degenerating into trivial dynamics\. Therefore,β=Θ\(L−1\)\\beta=\\Theta\(L^\{\-1\}\)guarantees stability without sacrificing expressivity, arriving at the exact same bound asDeyet al\.\([2025](https://arxiv.org/html/2605.15290#bib.bib12)\)through an alternate derivation\.
## 4Empirical Results
In this section, we present our empirical results\. Details of model configurations and experimental setups are provided in Appendix[B\.1](https://arxiv.org/html/2605.15290#A2.SS1)\.
Figure 3:Voronoi interpolation for random sweeps over both learning rate and weight decay\. The top row is standard parameterization\. The middle row is the vanilla Adam\-μ\\muP implementation suggested inYanget al\.\([2022](https://arxiv.org/html/2605.15290#bib.bib45)\)\. The bottom row is our proposed implementation\. Each column corresponds to a different size model, with the number of parameters increasing from left to right\. For each model and implementation, we plot the best trial\. Hidden dimension, depth, batch size, and training iterations are all scaled\. Lighter colors indicate lower loss, darker colors indicate higher loss\. The red crosses mark the average\(learning rate, weight decay\)pair, where each coordinate is averaged over the model sizes, while the black crosses are the optimal pair for each experiment\.Figure 4:Coordinate checks in the style ofYanget al\.\([2022](https://arxiv.org/html/2605.15290#bib.bib45)\)for the activation update norms‖Δhtℓ‖2\\left\|\\left\|\\,\\Delta h^\{\\ell\}\_\{t\}\\,\\right\|\\right\|\_\{2\}under the vanilla Adam\-μ\\muP implementation\. Together, these coordinate checks indicate that the implementation is correct, and that feature learning and thus learning rate transfer should occur\. However, Figure[1](https://arxiv.org/html/2605.15290#S1.F1)shows that learning rate transfer for this implementation does not occur\.Figure 5:Coordinate checks for‖Δ𝑾‖\\left\|\\left\|\\,\\Delta\\bm\{W\}\\,\\right\|\\right\|under the vanilla Adam\-μ\\muP scalings\. The model fails the coordinate checks when evaluated using the spectral feature learning condition equation[1](https://arxiv.org/html/2605.15290#S3.E1)\. However, as shown in Figure[4](https://arxiv.org/html/2605.15290#S4.F4), it does pass when evaluated under Yang’s definition of feature learning[3](https://arxiv.org/html/2605.15290#S3.E3)\.Figure 6:Coordinate checks for‖Δ𝑾‖\\left\|\\left\|\\,\\Delta\\bm\{W\}\\,\\right\|\\right\|under our proposed GQA scalings\. The model has eight hidden layers\. Additional experimental details are provided in Appendix[B\.1\.1](https://arxiv.org/html/2605.15290#A2.SS1.SSS1)\.##### Coordinate\-Checks Demonstrate the Necessity for Spectral Feature Learning:
As discussed in Section[3](https://arxiv.org/html/2605.15290#S3), validating feature learning by measuring the norms ofhℓh^\{\\ell\}andΔhtℓ\\Delta h\_\{t\}^\{\\ell\}can be misleading\. Figures[4](https://arxiv.org/html/2605.15290#S4.F4)plot‖Δhtℓ‖\\left\|\\left\|\\,\\Delta h^\{\\ell\}\_\{t\}\\,\\right\|\\right\|for the vanilla Adam\-μ\\muP implementation\. The coordinate check would suggest transferable learning rates, yet empirical results show otherwise \(see Figure[1](https://arxiv.org/html/2605.15290#S1.F1), middle\)\. By contrast, when we instead examine the spectral norm conditions in equation[1](https://arxiv.org/html/2605.15290#S3.E1)\(Figure[5](https://arxiv.org/html/2605.15290#S4.F5)\), the model fails the coordinate check: a clear non\-linear dependence on the number of KV heads of the model, which explains the lack of transferable dynamics\.
##### Coordinate\-Checks Demonstrate a Qualitative Dependency onrr:
Because the vanilla Adam\-μ\\muP implementation and our implementation share the same initialization scaling, we do not compare‖𝑾‖\\left\|\\left\|\\,\\bm\{W\}\\,\\right\|\\right\|directly\. Instead, Figure[5](https://arxiv.org/html/2605.15290#S4.F5)presents the coordinate checks for the vanilla Adam\-μ\\muP implementation, while Figure[6](https://arxiv.org/html/2605.15290#S4.F6)shows the corresponding coordinate checks for our proposed GQA scaling\. Our method passes the coordinate check, thereby enablingμ\\mu\-transfer of learning rate\. By contrast, the vanilla Adam\-μ\\muP implementation shows a persistent dependency on the number of KV heads, explaining why the learning rate does not transfer in this case\.
##### Learning Rate Transfer for Grouped Query Attention:
We perform an ablation study comparing the standard parameterization, the vanilla Adam\-μ\\muP implementation \(where the KV heads are initialized as hidden layers\), and our proposed GQA\-μ\\muP\. The results of this ablation study are summarized in Figure[1](https://arxiv.org/html/2605.15290#S1.F1)\. We observe that the vanilla Adam\-μ\\muP scaling does not account for the shift induced by using GQA, whereas our proposed scaling brings the optimal learning rates into a much narrower region\. Noise inherent to GQA training is already evident in these plots and becomes more pronounced as the number of KV heads decreases\. This noise is apparent in both the coordinate checks from Figure[6](https://arxiv.org/html/2605.15290#S4.F6)as well as in Figure[2](https://arxiv.org/html/2605.15290#S3.F2)\. We provide an explanation for this phenomenon in the following paragraph\.
##### Expected Variance in GQA Transfer:
From the perspective ofμ\\muP, the nature of GQA introduces a dichotomy: one may achieve feature learning in the sense of equation[1](https://arxiv.org/html/2605.15290#S3.E1)and thereby obtain learning rate transfer, but at the cost of increasingly noisy dynamics as the number of KV heads decreases; alternatively, one may constrain the variance as we decrease the number of KV heads to stabilize the training, but this leads to a shift in optimal learning rate\. Consequently, we suggest that in scenarios where transferable dynamics are critical, it may be preferable to avoid using GQA altogether\.
##### μ\\muP \(Mostly\) Decouples Coupled Weight Decay
To examine the transferability of optimal learning rate and optimal weight decay across model scales, we do a random grid search over \(learning rate, weight decay\) pairs at constant initial standard deviation\. We plot our results in Figure[3](https://arxiv.org/html/2605.15290#S4.F3)\. We note that under the standard parameterization, neither the learning rate nor weight decay transfers, and that the qualitative properties of the Voronoi\-interpolated loss landscape change markedly as the model size increases from 26M to 177M non\-embedding parameters\. By contrast, both the vanilla Adam\-μ\\muP implementation and our proposed scaling preserve their qualitative properties across model sizes\.
For the experiment in Figure[3](https://arxiv.org/html/2605.15290#S4.F3), we quantify the degree of transfer in Table[6](https://arxiv.org/html/2605.15290#A2.T6)\. We find that the variance of both the optimal learning rate and the optimal weight decay across model sizes is lower for our implementation than for the vanilla Adam\-μ\\muP baseline\. Thus, it suggests that our proposed implementation enables the transfer of both learning rate and weight decay across model scales, both qualitatively and quantitatively\.
Table 2:Variance table comparing our implementations across model sizes for theτepoch\\tau\_\{\\text\{epoch\}\}experiment from Figure[7](https://arxiv.org/html/2605.15290#A2.F7)\.Previous works have argued that the quantityτepoch=\(λ0×η0×iters\)−1\\tau\_\{\\text\{epoch\}\}=\(\\lambda\_\{0\}\\times\\eta\_\{0\}\\times\\text\{iters\}\)^\{\-1\}should transfer instead of weight decay\(Wang and Aitchison,[2024](https://arxiv.org/html/2605.15290#bib.bib39); Bergsmaet al\.,[2025](https://arxiv.org/html/2605.15290#bib.bib5)\)\. We found that both weight decay andτepoch\\tau\_\{\\text\{epoch\}\}transfer in our experimental setting\. This is a non\-trivial observation since we vary the number of iterations based on the model size\. Figure[7](https://arxiv.org/html/2605.15290#A2.F7)presents the analog of our interpolation diagram\. Figure[3](https://arxiv.org/html/2605.15290#S4.F3)and Table[2](https://arxiv.org/html/2605.15290#S4.T2)reports the quantitative variance results forτepoch\\tau\_\{\\text\{epoch\}\}\. We find thatτepoch\\tau\_\{\\text\{epoch\}\}transfers slightly better than weight decay in our setting\.
## 5Conclusions
In this paper, we introduced a novel extension of the spectralμ\\muP framework originally developed byYanget al\.\([2023a](https://arxiv.org/html/2605.15290#bib.bib47)\)\. We can apply our framework to rederive the Complete\-P weight decay and depth scalings fromDeyet al\.\([2025](https://arxiv.org/html/2605.15290#bib.bib12)\)\. Additionally, we use our framework to derive, for the first time, theμ\\muP scalings for grouped query attention\(Ainslieet al\.,[2023](https://arxiv.org/html/2605.15290#bib.bib1)\)\. We perform empirical validation in two directions for our work\. First, we explore the empirical nature of learning rate transfer for GQA\. We find that we can either do noisy learning rate transfer or fail to transfer the learning rate\. This dichotomy is a consequence of the competing scalings between the spectral norm and the expected operator norm\. Compared to the standardμ\\muP implementation, our method reduces the variance in optimal learning rate during learning rate transfer\. Second, we explore the transferability of weight decay across model sizes\. We demonstrate that with the standardμ\\muP implementation, we can nearly achieve transfer of weight decay\. With our implementation, we are able to get much closer to true transfer across both learning rate and weight decay\. Future work could extend scaling laws for Mixture\-of\-Experts or other sparse models\.
## References
- J\. Ainslie, J\. Lee\-Thorp, M\. De Jong, Y\. Zemlyanskiy, F\. Lebrón, and S\. Sanghai \(2023\)Gqa: training generalized multi\-query transformer models from multi\-head checkpoints\.arXiv preprint arXiv:2305\.13245\.Cited by:[§1](https://arxiv.org/html/2605.15290#S1.p2.8),[§2](https://arxiv.org/html/2605.15290#S2.p2.1),[§3\.2](https://arxiv.org/html/2605.15290#S3.SS2.p1.10),[§5](https://arxiv.org/html/2605.15290#S5.p1.4)\.
- M\. Andriushchenko, F\. D’Angelo, A\. Varre, and N\. Flammarion \(2023\)Why do we need weight decay in modern deep learning?\.arXiv preprint arXiv:2310\.04415\.Cited by:[§3\.1](https://arxiv.org/html/2605.15290#S3.SS1.p1.1)\.
- Z\. Bai and Y\. Yin \(1993\)Limit of the smallest eigenvalue of a large dimensional sample covariance matrix\.Ann\. Probab21\(3\),pp\. 1275–1294\.Cited by:[§A\.1](https://arxiv.org/html/2605.15290#A1.SS1.p1.7),[§3\.2](https://arxiv.org/html/2605.15290#S3.SS2.p4.4)\.
- S\. Bergsma, N\. Dey, G\. Gosal, G\. Gray, D\. Soboleva, and J\. Hestness \(2025\)Power lines: scaling laws for weight decay and batch size in llm pre\-training\.arXiv preprint arXiv:2505\.13738\.Cited by:[§B\.4](https://arxiv.org/html/2605.15290#A2.SS4.p1.3),[§4](https://arxiv.org/html/2605.15290#S4.SS0.SSS0.Px5.p3.4)\.
- S\. Bergsma, N\. S\. Dey, G\. Gosal, G\. Gray, D\. Soboleva, and J\. Hestness \(2026\)Power lines: scaling laws for weight decay and batch size in LLM pre\-training\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=bFXbLQzRoZ)Cited by:[§B\.1](https://arxiv.org/html/2605.15290#A2.SS1.p3.1)\.
- C\. Blake, D\. Orr, and C\. Luschi \(2023\)Unit scaling: out\-of\-the\-box low\-precision training\.InInternational Conference on Machine Learning,pp\. 2548–2576\.Cited by:[§1](https://arxiv.org/html/2605.15290#S1.p1.3),[§2](https://arxiv.org/html/2605.15290#S2.p4.5)\.
- N\. Dey, G\. Gosal, H\. Khachane, W\. Marshall, R\. Pathria, M\. Tom, J\. Hestness,et al\.\(2023\)Cerebras\-gpt: open compute\-optimal language models trained on the cerebras wafer\-scale cluster\.arXiv preprint arXiv:2304\.03208\.Cited by:[§A\.1](https://arxiv.org/html/2605.15290#A1.SS1.p1.24),[§1](https://arxiv.org/html/2605.15290#S1.p1.3),[§2](https://arxiv.org/html/2605.15290#S2.p3.6)\.
- N\. Dey, B\. C\. Zhang, L\. Noci, M\. Li, B\. Bordelon, S\. Bergsma, C\. Pehlevan, B\. Hanin, and J\. Hestness \(2025\)Don’t be lazy: completep enables compute\-efficient deep transformers\.arXiv preprint arXiv:2505\.01618\.Cited by:[§B\.1\.2](https://arxiv.org/html/2605.15290#A2.SS1.SSS2.p1.6),[§B\.1](https://arxiv.org/html/2605.15290#A2.SS1.p4.1),[§B\.4](https://arxiv.org/html/2605.15290#A2.SS4.p1.3),[§1](https://arxiv.org/html/2605.15290#S1.p1.3),[§2](https://arxiv.org/html/2605.15290#S2.p3.6),[§3\.1](https://arxiv.org/html/2605.15290#S3.SS1.p4.1),[§3\.3](https://arxiv.org/html/2605.15290#S3.SS3.p1.1),[§3\.3](https://arxiv.org/html/2605.15290#S3.SS3.p7.11),[§5](https://arxiv.org/html/2605.15290#S5.p1.4),[footnote 2](https://arxiv.org/html/2605.15290#footnote2)\.
- K\. Everett, L\. Xiao, M\. Wortsman, A\. A\. Alemi, R\. Novak, P\. J\. Liu, I\. Gur, J\. Sohl\-Dickstein, L\. P\. Kaelbling, J\. Lee,et al\.\(2024\)Scaling exponents across parameterizations and optimizers\.arXiv preprint arXiv:2407\.05872\.Cited by:[§A\.1](https://arxiv.org/html/2605.15290#A1.SS1.p1.24)\.
- A\. Gokaslan and V\. Cohen \(2019\)OpenWebText corpus\.Note:[http://Skylion007\.github\.io/OpenWebTextCorpus](http://skylion007.github.io/OpenWebTextCorpus)Cited by:[§B\.1](https://arxiv.org/html/2605.15290#A2.SS1.p6.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§2](https://arxiv.org/html/2605.15290#S2.p2.1)\.
- D\. Hendrycks and K\. Gimpel \(2023\)Gaussian error linear units \(gelus\)\.External Links:1606\.08415,[Link](https://arxiv.org/abs/1606.08415)Cited by:[§B\.1](https://arxiv.org/html/2605.15290#A2.SS1.p1.1)\.
- J\. Hoffmann, S\. Borgeaud, A\. Mensch, E\. Buchatskaya, T\. Cai, E\. Rutherford, D\. de Las Casas, L\. A\. Hendricks, J\. Welbl, A\. Clark,et al\.\(2022\)An empirical analysis of compute\-optimal large language model training\.Advances in neural information processing systems35,pp\. 30016–30030\.Cited by:[§B\.1\.3](https://arxiv.org/html/2605.15290#A2.SS1.SSS3.p1.1)\.
- A\. Q\. Jiang, A\. Sablayrolles, A\. Mensch, C\. Bamford, D\. S\. Chaplot, D\. de Las Casas, F\. Bressand, G\. Lengyel, G\. Lample, L\. Saulnier, L\. R\. Lavaud, M\. Lachaux, P\. Stock, T\. L\. Scao, T\. Lavril, T\. Wang, T\. Lacroix, and W\. E\. Sayed \(2023\)Mistral 7b\.ArXivabs/2310\.06825\.External Links:[Link](https://api.semanticscholar.org/CorpusID:263830494)Cited by:[§2](https://arxiv.org/html/2605.15290#S2.p2.1)\.
- K\. Jordan, Y\. Jin, V\. Boza, J\. You, F\. Cesista, L\. Newhouse, and J\. Bernstein \(2024\)Muon: an optimizer for hidden layers in neural networks\.External Links:[Link](https://kellerjordan.github.io/posts/muon/)Cited by:[§3\.1](https://arxiv.org/html/2605.15290#S3.SS1.p1.1)\.
- D\. P\. Kingma and J\. Ba \(2014\)Adam: a method for stochastic optimization\.arXiv preprint arXiv:1412\.6980\.Cited by:[§A\.1](https://arxiv.org/html/2605.15290#A1.SS1.p1.5)\.
- Z\. Liu, L\. Tang, L\. Jin, H\. Li, N\. Ranjan, D\. Fan, S\. Rohatgi, R\. Fan, O\. Pangarkar, H\. Wang,et al\.\(2025\)K2\-v2: a 360\-open, reasoning\-enhanced llm\.arXiv preprint arXiv:2512\.06201\.Cited by:[§2](https://arxiv.org/html/2605.15290#S2.p2.1)\.
- I\. Loshchilov and F\. Hutter \(2017\)Decoupled weight decay regularization\.arXiv preprint arXiv:1711\.05101\.Cited by:[§3\.1](https://arxiv.org/html/2605.15290#S3.SS1.p1.1),[footnote 2](https://arxiv.org/html/2605.15290#footnote2)\.
- B\. Mlodozeniec, P\. Ablin, L\. Béthune, D\. Busbridge, M\. Klein, J\. Ramapuram, and M\. Cuturi \(2025\)Completed hyperparameter transfer across modules, width, depth, batch and duration\.arXiv preprint arXiv:2512\.22382\.Cited by:[§2](https://arxiv.org/html/2605.15290#S2.p3.6)\.
- S\. Narayan, A\. Gupta, M\. Paul, and D\. Blalock \(2025\)μ\\munit Scaling: simple and scalable fp8 llm training\.arXiv preprint arXiv:2502\.05967\.Cited by:[§1](https://arxiv.org/html/2605.15290#S1.p1.3),[§2](https://arxiv.org/html/2605.15290#S2.p4.5)\.
- F\. Schaipp \(2024\)How to jointly tune learning rate and weight decay for AdamW\.Note:[https://fabian\-sp\.github\.io/posts/2024/02/decoupling/](https://fabian-sp.github.io/posts/2024/02/decoupling/)Cited by:[footnote 2](https://arxiv.org/html/2605.15290#footnote2)\.
- X\. Wang and L\. Aitchison \(2024\)How to set adamw’s weight decay as you scale model and dataset size\.arXiv preprint arXiv:2405\.13698\.Cited by:[Figure 7](https://arxiv.org/html/2605.15290#A2.F7),[§B\.4](https://arxiv.org/html/2605.15290#A2.SS4.p1.3),[item 3](https://arxiv.org/html/2605.15290#S1.I1.i3.p1.1),[§3\.1](https://arxiv.org/html/2605.15290#S3.SS1.p4.1),[§4](https://arxiv.org/html/2605.15290#S4.SS0.SSS0.Px5.p3.4)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§2](https://arxiv.org/html/2605.15290#S2.p2.1)\.
- G\. Yang, E\. J\. Hu, I\. Babuschkin, S\. Sidor, X\. Liu, D\. Farhi, N\. Ryder, J\. Pachocki, W\. Chen, and J\. Gao \(2022\)Tensor programs v: tuning large neural networks via zero\-shot hyperparameter transfer\.arXiv preprint arXiv:2203\.03466\.Cited by:[§A\.1](https://arxiv.org/html/2605.15290#A1.SS1.p1.16),[Figure 7](https://arxiv.org/html/2605.15290#A2.F7),[§B\.2](https://arxiv.org/html/2605.15290#A2.SS2.p1.5),[§B\.4](https://arxiv.org/html/2605.15290#A2.SS4.p1.3),[§1](https://arxiv.org/html/2605.15290#S1.p1.3),[§2](https://arxiv.org/html/2605.15290#S2.p1.4),[§2](https://arxiv.org/html/2605.15290#S2.p3.6),[§3](https://arxiv.org/html/2605.15290#S3.p1.5),[§3](https://arxiv.org/html/2605.15290#S3.p7.1),[Figure 3](https://arxiv.org/html/2605.15290#S4.F3),[Figure 4](https://arxiv.org/html/2605.15290#S4.F4)\.
- G\. Yang and E\. J\. Hu \(2021\)Tensor programs iv: feature learning in infinite\-width neural networks\.InInternational Conference on Machine Learning,pp\. 11727–11737\.Cited by:[§1](https://arxiv.org/html/2605.15290#S1.p1.3),[§2](https://arxiv.org/html/2605.15290#S2.p1.4),[§2](https://arxiv.org/html/2605.15290#S2.p5.2),[3rd item](https://arxiv.org/html/2605.15290#S3.I1.i3.p1.1),[§3](https://arxiv.org/html/2605.15290#S3.p3.1),[§3](https://arxiv.org/html/2605.15290#S3.p5.2)\.
- G\. Yang, J\. B\. Simon, and J\. Bernstein \(2023a\)A spectral condition for feature learning\.arXiv preprint arXiv:2310\.17813\.Cited by:[item 1](https://arxiv.org/html/2605.15290#S1.I1.i1.p1.3),[§1](https://arxiv.org/html/2605.15290#S1.p2.8),[§2](https://arxiv.org/html/2605.15290#S2.p1.4),[§3](https://arxiv.org/html/2605.15290#S3.p1.5),[§3](https://arxiv.org/html/2605.15290#S3.p1.6),[§3](https://arxiv.org/html/2605.15290#S3.p5.1),[§3](https://arxiv.org/html/2605.15290#S3.p5.2),[§5](https://arxiv.org/html/2605.15290#S5.p1.4)\.
- G\. Yang, D\. Yu, C\. Zhu, and S\. Hayou \(2023b\)Tensor programs vi: feature learning in infinite\-depth neural networks\.arXiv preprint arXiv:2310\.02244\.Cited by:[§2](https://arxiv.org/html/2605.15290#S2.p1.4),[§2](https://arxiv.org/html/2605.15290#S2.p3.6),[§3\.3](https://arxiv.org/html/2605.15290#S3.SS3.p7.11)\.
- G\. Yang \(2019\)Wide feedforward or recurrent neural networks of any architecture are gaussian processes\.Advances in Neural Information Processing Systems32\.Cited by:[§2](https://arxiv.org/html/2605.15290#S2.p1.4)\.
- G\. Yang \(2020a\)Tensor programs ii: neural tangent kernel for any architecture\.arXiv preprint arXiv:2006\.14548\.Cited by:[§2](https://arxiv.org/html/2605.15290#S2.p1.4)\.
- G\. Yang \(2020b\)Tensor programs iii: neural matrix laws\.arXiv preprint arXiv:2009\.10685\.Cited by:[§2](https://arxiv.org/html/2605.15290#S2.p1.4)\.
- Y\. Yin, Z\. Bai, and P\. R\. Krishnaiah \(1988\)On the limit of the largest eigenvalue of the large dimensional sample covariance matrix\.Probability theory and related fields78,pp\. 509–521\.Cited by:[§A\.1](https://arxiv.org/html/2605.15290#A1.SS1.p1.7),[§3\.2](https://arxiv.org/html/2605.15290#S3.SS2.p4.4)\.
- C\. Zheng, R\. Wang, X\. Zhang, and L\. Chongxuan \(2026\)Spectral condition for μp under width\-depth scaling\.arXiv preprint arXiv:2603\.00541v2\.Cited by:[§2](https://arxiv.org/html/2605.15290#S2.p5.2)\.
## Appendix AAdditional Mathematical Details
### A\.1Derivation for Adam
We demonstrate the applicability of our framework by re\-deriving theμ\\muP scalings for Adam\. Recall that the Adam optimizerKingma and Ba \([2014](https://arxiv.org/html/2605.15290#bib.bib19)\)uses hyperparametersβ1\\beta\_\{1\},β2\\beta\_\{2\},ε\\varepsilon, andη\\etaand has its optimization steps given by the following components:
gt\\displaystyle g\_\{t\}=∇𝑾f\(𝑾t−1\),\\displaystyle=\\nabla\_\{\\bm\{W\}\}f\(\\bm\{W\}\_\{t\-1\}\),mt\\displaystyle m\_\{t\}=β1mt−1\+\(1−β1\)gt,vt=β2vt−1\+\(1−β2\)gt2,\\displaystyle=\\beta\_\{1\}m\_\{t\-1\}\+\(1\-\\beta\_\{1\}\)g\_\{t\},\\qquad v\_\{t\}=\\beta\_\{2\}v\_\{t\-1\}\+\(1\-\\beta\_\{2\}\)g\_\{t\}^\{2\},m^t\\displaystyle\\hat\{m\}\_\{t\}=mt1−β1t,v^t=vt1−β2t,\\displaystyle=\\frac\{m\_\{t\}\}\{1\-\\beta\_\{1\}^\{t\}\},\\qquad\\qquad\\qquad\\qquad\\hat\{v\}\_\{t\}=\\frac\{v\_\{t\}\}\{1\-\\beta\_\{2\}^\{t\}\},with the weight update
𝑾t\\displaystyle\\bm\{W\}\_\{t\}=𝑾t−1−ηm^tv^t\+ε\.\\displaystyle=\\bm\{W\}\_\{t\-1\}\-\\eta\\frac\{\\hat\{m\}\_\{t\}\}\{\\sqrt\{\\hat\{v\}\_\{t\}\}\+\\varepsilon\}\.\(12\)The key observation is that the term
𝒓^t:=m^tv^t\+ε\\displaystyle\\hat\{\\bm\{r\}\}\_\{t\}:=\\frac\{\\hat\{m\}\_\{t\}\}\{\\sqrt\{\\hat\{v\}\_\{t\}\}\+\\varepsilon\}\(13\)will always have typical size11\(forε\\varepsilonsufficiently small\), and as such the spectral norm can be estimated using the Bai\-Yin theorem\(Yinet al\.,[1988](https://arxiv.org/html/2605.15290#bib.bib48); Bai and Yin,[1993](https://arxiv.org/html/2605.15290#bib.bib4)\), depending on whether the layer is vector\-like or matrix\-like\. Thus, we have the following reasoning\. For an input layer, we have
‖Δ𝑾t0‖=Θ\(η0n\)⏟Bai\-Yin=Θ\(n\)⏟equation[1](https://arxiv.org/html/2605.15290#S3.E1),\\displaystyle\\left\|\\left\|\\,\\Delta\\bm\{W\}^\{0\}\_\{t\}\\,\\right\|\\right\|=\\underbrace\{\\Theta\(\\eta^\{0\}\\sqrt\{n\}\)\}\_\{\\text\{Bai\-Yin\}\}=\\underbrace\{\\Theta\(\\sqrt\{n\}\)\}\_\{\\text\{equation~\\ref\{eq:spectral\_condition\}\}\},which implies that we must chooseη0=Θ\(1\)\\eta^\{0\}=\\Theta\(1\)\. Next, for the hidden layers, we have
‖Δ𝑾tℓ‖=Θ\(ηℓn\)⏟Bai\-Yin=Θ\(1\)⏟equation[1](https://arxiv.org/html/2605.15290#S3.E1),\\displaystyle\\left\|\\left\|\\,\\Delta\\bm\{W\}^\{\\ell\}\_\{t\}\\,\\right\|\\right\|=\\underbrace\{\\Theta\(\\eta^\{\\ell\}n\)\}\_\{\\text\{Bai\-Yin\}\}=\\underbrace\{\\Theta\(1\)\}\_\{\\text\{equation~\\ref\{eq:spectral\_condition\}\}\},which leads us to chooseηℓ=Θ\(n−1\)\\eta^\{\\ell\}=\\Theta\(n^\{\-1\}\)\. Finally, for the output layer, we have
‖Δ𝑾tL\+1‖=Θ\(ηL\+1n\)⏟Bai\-Yin=Θ\(n−1/2\)⏟equation[1](https://arxiv.org/html/2605.15290#S3.E1),\\displaystyle\\left\|\\left\|\\,\\Delta\\bm\{W\}^\{L\+1\}\_\{t\}\\,\\right\|\\right\|=\\underbrace\{\\Theta\(\\eta^\{L\+1\}\\sqrt\{n\}\)\}\_\{\\text\{Bai\-Yin\}\}=\\underbrace\{\\Theta\(n^\{\-1/2\}\)\}\_\{\\text\{equation~\\ref\{eq:spectral\_condition\}\}\},which leads us to chooseηL\+1=Θ\(n−1\)\\eta^\{L\+1\}=\\Theta\(n^\{\-1\}\)\. There is a subtle nuance in our derivation that is also often overlooked in the literature\. We have assumed that the Adam optimizer step is independent of the network widthnn, but this is not quite true\. To see why, consider settingβ1=β2=1\\beta\_\{1\}=\\beta\_\{2\}=1, so that the Adam optimizer step is given simply by
𝒓^t=gt\|gt\|\+ε,\\displaystyle\\hat\{\\bm\{r\}\}\_\{t\}=\\frac\{g\_\{t\}\}\{\|g\_\{t\}\|\+\\varepsilon\},wheregtg\_\{t\}is the gradient\. For concreteness, consider a hidden layer\.Yanget al\.\([2022](https://arxiv.org/html/2605.15290#bib.bib45)\)show that the gradient will scale likeΘ\(1/n\)\\Theta\(1/n\)\. Thus lettingg¯t=gt/n\\overline\{g\}\_\{t\}=g\_\{t\}/nbe the size11normalized gradient updates, we have that the
𝒓^t=g¯t\|g¯t\|\+nε,\\displaystyle\\hat\{\\bm\{r\}\}\_\{t\}=\\frac\{\\overline\{g\}\_\{t\}\}\{\|\\overline\{g\}\_\{t\}\|\+n\\varepsilon\},which is not actuallyΘ\(1\)\\Theta\(1\)innn, since forn=Ω\(ε−1\)n=\\Omega\(\\varepsilon^\{\-1\}\), the Adam updates decay liken−1n^\{\-1\}\. Thus, to be pedantic and ensure actual feature learning, we must scaleε=ε0/n\\varepsilon=\\varepsilon\_\{0\}/nto continue to achieve feature learning\. In practice, we find that this subtlety can be avoided by settingε=10−12\\varepsilon=10^\{\-12\}instead of the usual default of10−810^\{\-8\}; however, for a complete treatment, this scaling must be included\. We note thatDeyet al\.\([2023](https://arxiv.org/html/2605.15290#bib.bib11)\),Everettet al\.\([2024](https://arxiv.org/html/2605.15290#bib.bib38)\)make the same conclusion about scaling the Adamε\\varepsilonparameter and perform an empirical study on its transferability\.
### A\.2Additional Details for Grouped Query Attention
###### Lemma 1\.
Scaling Concatenated Spectral Norms\. LetA∈𝐑n×nrA\\in\\bm\{R\}^\{n\\times\\frac\{n\}\{r\}\}for some integerr≥1r\\geq 1, whererris the number of repetitions of key and value heads, be a matrix with spectral norm‖A‖\>0\\left\|\\left\|\\,A\\,\\right\|\\right\|\>0\. Then lettingA⊕:=⨁j=1rAA^\{\\oplus\}:=\\bigoplus\_\{j=1\}^\{r\}Awe have
‖A⊕‖=r‖A‖\.\\displaystyle\\left\|\\left\|\\,A^\{\\oplus\}\\,\\right\|\\right\|=\\sqrt\{r\}\\left\|\\left\|\\,A\\,\\right\|\\right\|\.\(14\)
###### Proof of Lemma[1](https://arxiv.org/html/2605.15290#Thmlemma1)\.\.
We denote therr\-times concatenation by
A∗=\[AA⋯A⏟rtimes\]\.\\displaystyle A^\{\*\}=\[\\,\\underbrace\{A\\,A\\,\\cdots\\,A\}\_\{\\text\{$r$ times\}\}\\,\]\.\(15\)Each matrixAAhas a singular value decompositionUΣVTU\\Sigma V^\{T\}forU,VU,Vunitary withU∈ℝn×nU\\in\\mathbb\{R\}^\{n\\times n\},V∈ℝnr×nrV\\in\\mathbb\{R\}^\{\\frac\{n\}\{r\}\\times\\frac\{n\}\{r\}\}, andΣ=\[Λ0\]∈ℝn×nr\\Sigma=\\begin\{bmatrix\}\\Lambda\\\\ 0\\end\{bmatrix\}\\in\\mathbb\{R\}^\{n\\times\\frac\{n\}\{r\}\}withΛ\\Lambdaa diagonalℝnr×nr\\mathbb\{R\}^\{\\frac\{n\}\{r\}\\times\\frac\{n\}\{r\}\}matrix\. Substituting the SVD into equation[15](https://arxiv.org/html/2605.15290#A1.E15)we can factor out the unitary matrixUUand arrive at
A=U\[ΣVTΣVT⋯ΣVT\]=U\[ΛVTΛVT⋯ΛVT00⋯0⋮⋮⋱⋮00⋯0\]\\displaystyle A=U\\,\[\\,\\Sigma V^\{T\}\\,\\Sigma V^\{T\}\\,\\cdots\\,\\Sigma V^\{T\}\\,\]=U\\,\\begin\{bmatrix\}\\Lambda V^\{T\}&\\Lambda V^\{T\}&\\cdots&\\Lambda V^\{T\}\\\\ 0&0&\\cdots&0\\\\ \\vdots&\\vdots&\\ddots&\\vdots\\\\ 0&0&\\cdots&0\\end\{bmatrix\}It remains to find the singular values of this matrix, which give us the spectral norm scaling\. To this end, observe that by the unitary ofVVwe have
U\[ΛVTΛVT⋯ΛVT00⋯0⋮⋮⋱⋮00⋯0\]\[VΛ0⋯0VΛ0⋯0⋮⋮⋱⋮VΛ0⋯0\]UT=U\[rΛ20⋯000⋯0⋮⋮⋱⋮00⋯0\]UT\\displaystyle U\\begin\{bmatrix\}\\Lambda V^\{T\}&\\Lambda V^\{T\}&\\cdots&\\Lambda V^\{T\}\\\\ 0&0&\\cdots&0\\\\ \\vdots&\\vdots&\\ddots&\\vdots\\\\ 0&0&\\cdots&0\\end\{bmatrix\}\\begin\{bmatrix\}V\\Lambda&0&\\cdots&0\\\\ V\\Lambda&0&\\cdots&0\\\\ \\vdots&\\vdots&\\ddots&\\vdots\\\\ V\\Lambda&0&\\cdots&0\\end\{bmatrix\}U^\{T\}=U\\begin\{bmatrix\}r\\Lambda^\{2\}&0&\\cdots&0\\\\ 0&0&\\cdots&0\\\\ \\vdots&\\vdots&\\ddots&\\vdots\\\\ 0&0&\\cdots&0\\end\{bmatrix\}U^\{T\}
Thus, the largest eigenvalue ofAATAA^\{T\}is given byrλmax2r\\lambda\_\{\\text\{max\}\}^\{2\}, withλmax\\lambda\_\{\\text\{max\}\}being the largest eigenvalue ofAA, and the desired spectral norm scaling is immediate\. ∎
###### Lemma 2\.
LetA∈𝐑n×nA\\in\\bm\{R\}^\{n\\times n\}have i\.i\.d\. entries\. Then the forx∼𝒩\(0,1\)x\\sim\\mathcal\{N\}\(0,1\)with i\.i\.d\. entries, we have that
‖𝑨‖𝔼=Θ\(‖𝑨‖\)\.\\displaystyle\\left\|\\left\|\\,\\bm\{A\}\\,\\right\|\\right\|\_\{\\mathbb\{E\}\}=\\Theta\(\\left\|\\left\|\\,\\bm\{A\}\\,\\right\|\\right\|\)\.
###### Proof of Lemma[2](https://arxiv.org/html/2605.15290#Thmlemma2)\.
First, note that‖𝑨‖=Θ\(σn\)\\left\|\\left\|\\,\\bm\{A\}\\,\\right\|\\right\|=\\Theta\(\\sigma\\sqrt\{n\}\), whereσ\\sigmais the variance of the i\.i\.d\. entries of𝑨\\bm\{A\}\. Next, observe that we can use the law of large numbers and the central limit theorem to estimate
𝔼‖Ax‖22=𝔼∑i\(∑jAijxj\)2=Θ\(σ2n2\),\\displaystyle\\mathbb\{E\}\\left\|\\left\|\\,Ax\\,\\right\|\\right\|\_\{2\}^\{2\}=\\mathbb\{E\}\\sum\_\{i\}\\left\(\\sum\_\{j\}A\_\{ij\}x\_\{j\}\\right\)^\{2\}=\\Theta\(\\sigma^\{2\}n^\{2\}\),and the result follows since‖x‖2=Θ\(n\)\\left\|\\left\|\\,x\\,\\right\|\\right\|\_\{2\}=\\Theta\(\\sqrt\{n\}\)\. ∎
### A\.3Intuition behind no exact cancellation
The intuition is that, in high\-dimensional spaces, independently initialized weight matrices and their gradient updates typically act on inputs in weakly correlated, nearly orthogonal directions\. Therefore, when we add two such high\-dimensional operators, such asWt−1W\_\{t\-1\}andΔWt−1\\Delta W\_\{t\-1\}, their geometries are unlikely to align in a way that causes significant cancellation\. As a result, the norm of the sum remains on the same order as the combined contribution of the two terms, rather than becoming artificially small\.
This assumption captures a basic stability property of high\-dimensional neural networks: when an update is added to a weight matrix, the update and the existing weights should not systematically point in opposite directions\. If they did, the update could cancel the weights, causing the activations or backpropagated gradients to shrink across layers or training steps\. In the extreme case, repeated cancellation would drive internal signals toward zero, preventing the network from learning useful features from the data\.
## Appendix BExperimental Details
### B\.1Model Configurations
In our experiments, we train Transformer language models with untied embeddings and GELUHendrycks and Gimpel \([2023](https://arxiv.org/html/2605.15290#bib.bib50)\)nonlinearity\. The batch size is chosen using a data\-driven optimal batch size in equation[16](https://arxiv.org/html/2605.15290#A2.E16)based on the total number of training tokensntokensn\_\{tokens\}, where the corresponding sequence length is 8192\.
B=0\.000733×ntokens\.\\displaystyle B=0\.000733\\times\\sqrt\{n\_\{tokens\}\}\.\(16\)
Equation[16](https://arxiv.org/html/2605.15290#A2.E16)follows the isoloss sweep methodology ofBergsmaet al\.\([2026](https://arxiv.org/html/2605.15290#bib.bib54)\)but uses a rounded exponent for tractability\. Specifically,Bergsmaet al\.\([2026](https://arxiv.org/html/2605.15290#bib.bib54)\)estimates a scaling exponent of 0\.46 and recommends rounding to 0\.5\. Since we ran independent sweeps on our own data, equation[16](https://arxiv.org/html/2605.15290#A2.E16)is specific to our setup but aligns structurally withBergsmaet al\.\([2026](https://arxiv.org/html/2605.15290#bib.bib54)\)\.
We use a cosine learning rate schedule with warmup\. The number of warmup steps follows equation[17](https://arxiv.org/html/2605.15290#A2.E17),Deyet al\.\([2025](https://arxiv.org/html/2605.15290#bib.bib12)\):
nwarmup=min\(int\(0\.02∗ntraining\),int\(375e6/\(B×L\)\)\),\\displaystyle n\_\{\\text\{warmup\}\}=\\min\(\\text\{int\}\(0\.02\*n\_\{\\text\{training\}\}\),\\text\{int\}\(375e6/\(B\\times L\)\)\),\(17\)where B is batch size and L is sequence length\.
All of our experiments are conducted using the openwebtext dataset\(Gokaslan and Cohen,[2019](https://arxiv.org/html/2605.15290#bib.bib13)\)\.
Table 3:Model configurations for the coordinate check experiments from Figures[4](https://arxiv.org/html/2605.15290#S4.F4),[5](https://arxiv.org/html/2605.15290#S4.F5),[6](https://arxiv.org/html/2605.15290#S4.F6)\.#### B\.1\.1Coordinate Checks
For the coordinate checking we verify that our norms remain stable as we vary the nubmer of kv heads\. The specific configurations which we used during the coordinate checks are contained in Table[3](https://arxiv.org/html/2605.15290#A2.T3)\. We use weight decay0in our coordinate checking experiments, and do all of the computation infloat32\. We used a fixed Adamε\\varepsilonof10−1210^\{\-12\}and an initial standard deviation of0\.020\.02\. Other optimizer settings are set to the defaults ofPyTorch’s Adam implementation\. We perform our experiments on seeds11through1010and plot the average and confidence interval\. We use a batch size of11and a sequence length of10241024to ensure quick computation\.
#### B\.1\.2GQA Ablation Experiment
We train our GQA ablation models to 10 TPP\. The configurations used for this experiment can be found in Table[4](https://arxiv.org/html/2605.15290#A2.T4)\. We set the base weight decay to beλ0=0\.1\\lambda\_\{0\}=0\.1\. We use a base Adamε\\varepsilonof10−9/n10^\{\-9\}/n, wherennis the embedding dimension, to match the predicted Adamε\\varepsilonscaling ofDeyet al\.\([2025](https://arxiv.org/html/2605.15290#bib.bib12)\)\. We take three runs for each data point, using seeds42,43,4442,43,44for reproducibility\.
Table 4:Model configurations for the GQA transfer experiments from Figure[1](https://arxiv.org/html/2605.15290#S1.F1)\.
#### B\.1\.3Weight Decay Transfer Experiment
Due to the high number of sampling points we only trained our models in the weight decay experiments to 3 TPP, well below the compute optimal horizon\(Hoffmannet al\.,[2022](https://arxiv.org/html/2605.15290#bib.bib15)\)\. The configurations used for this experiment can be found in Table[5](https://arxiv.org/html/2605.15290#A2.T5)\. We uniformly sample the grid inlog−log\\log\-\\logspace\. We were only able to run one trial per data point, but the high number of trials increase confidence\. We sample 250 points on the grid for each implementation and model size\.
Table 5:Model configurations for the Weight decay experiments from Figures[3](https://arxiv.org/html/2605.15290#S4.F3)and[7](https://arxiv.org/html/2605.15290#A2.F7)\.Figure 7:Voronoi interpolation for random sweeps over both learning rate andτepoch\\tau\_\{\\text\{epoch\}\}Wang and Aitchison \([2024](https://arxiv.org/html/2605.15290#bib.bib39)\)\. The top row is standard parameterization\. The middle row is the vanilla Adam\-μ\\muP implementation suggested inYanget al\.\([2022](https://arxiv.org/html/2605.15290#bib.bib45)\)\. The bottom row is our proposed implementation\. Each column is a different size model, increasing in number of parameters from left to right\. For each model and implementation we plot the best trial\. We scale the hidden dimension, depth, batch size, and training iterations\. Lighter colors are lower loss, darker colors are higher loss\. The red x is the average\(learning rate, weight decay\)pair, where each coordinate is averaged over the model sizes, while the black x is the optimal pair for each experiment\.Figure 8:Learning\-rate transfer at 20 tokens\-per\-parameter \(TPP\) under vanilla Adam\-μ\\muP \(left\) and GQA\-μ\\muP \(right\)\. The 136M proxy is swept atr∈\{1,2,4,8\}r\\in\\\{1,2,4,8\\\}on a half\-power LR grid and the 1\.2B target is swept atr=4r\{=\}4on whole\-power steps\. Colour encodesrrand linestyle encodes model size; stars mark the optimal learning rate for each configuration\. The 1\.2B target optimum lands at2−92^\{\-9\}under both parameterizations\. Under vanillaμ\\muP the 136M proxy optima drift to2−8\.52^\{\-8\.5\}for three of the fourrrvalues, whereas under GQA\-μ\\muP two of the four proxy optima coincide with the target at2−92^\{\-9\}; the residualrr\-dependence is smaller but not fully eliminated at this training horizon\.
### B\.2Failure of Yang\-Type Coordinate Checking
Yanget al\.\([2022](https://arxiv.org/html/2605.15290#bib.bib45)\)suggest measuring‖ht‖2\\left\|\\left\|\\,h\_\{t\}\\,\\right\|\\right\|\_\{2\}and‖Δht‖2\\left\|\\left\|\\,\\Delta h\_\{t\}\\,\\right\|\\right\|\_\{2\}to verify that aμ\\muP implementation is correct by comparing these norms during training to feature learning conditions in equation[3](https://arxiv.org/html/2605.15290#S3.E3)\. In Figure[4](https://arxiv.org/html/2605.15290#S4.F4)we plot‖Δht‖2\\left\|\\left\|\\,\\Delta h\_\{t\}\\,\\right\|\\right\|\_\{2\}for the vanilla Adam\-μ\\muP implementation while varying only the number of kv heads\. Note that the the implementation passes a coordinate check, but as discussed theoretically in Section[3](https://arxiv.org/html/2605.15290#S3), and empirically in Section[4](https://arxiv.org/html/2605.15290#S4)the learning rate does not transfer for this implementation \(see Figure[1](https://arxiv.org/html/2605.15290#S1.F1)\)\.
Coordinate checks on our proposed spectral condition from equation[1](https://arxiv.org/html/2605.15290#S3.E1), however, capture the failure of feature learning \(see Figure[5](https://arxiv.org/html/2605.15290#S4.F5)\)\.
### B\.3Transfer at Compute\-Optimal Training Horizons
To check that the behaviour observed at 10 TPP persists at more realistic training lengths, we repeat the learning\-rate sweep at 20 TPP, scaling from a 136M proxy model to a 1\.2B target model acrossr∈\{1,2,4,8\}r\\in\\\{1,2,4,8\\\}\. Figure[8](https://arxiv.org/html/2605.15290#A2.F8)reports the results on a half\-power LR grid for the 136M proxy and whole\-power steps for the 1\.2B target\. The 1\.2B target optimum lands at2−92^\{\-9\}under both parameterizations, so the transfer question reduces to whether the 136M proxy identifies the same learning rate\. Under vanillaμ\\muP, three of the four 136M proxy optimar∈\{1,2,4\}r\\in\\\{1,2,4\\\}drift to2−8\.52^\{\-8\.5\}, half a power above the 1\.2B target, while onlyr=8r\{=\}8recovers2−92^\{\-9\}\. Under GQA\-μ\\muP, two of the four proxy optimar∈\{4,8\}r\\in\\\{4,8\\\}coincide with the target at2−92^\{\-9\}, and the remaining twor∈\{1,2\}r\\in\\\{1,2\\\}sit at2−8\.52^\{\-8\.5\}\. GQA\-μ\\muP therefore reduces but does not eliminate the residualrr\-dependence at this horizon\. Combined with the coordinate\-check result in Figure[6](https://arxiv.org/html/2605.15290#S4.F6), this still favours GQA\-μ\\muP as the theoretically correct choice, particularly at shorter training horizons where the bias under vanillaμ\\muP is larger as shown in Figure[1](https://arxiv.org/html/2605.15290#S1.F1)\.
### B\.4More Results about Weight Decay
Table 6:Variance table comparing our implementations across model sizes for the weight decay experiment from Figure[3](https://arxiv.org/html/2605.15290#S4.F3)\.We used the same data that was collected from Figure[3](https://arxiv.org/html/2605.15290#S4.F3)to analyze whether or not our experimental testbed demonstrates transfer overτepoch\\tau\_\{\\text\{epoch\}\}, as is suggested by\(Wang and Aitchison,[2024](https://arxiv.org/html/2605.15290#bib.bib39); Bergsmaet al\.,[2025](https://arxiv.org/html/2605.15290#bib.bib5); Deyet al\.,[2025](https://arxiv.org/html/2605.15290#bib.bib12)\)\. We find that we get slightly better transfer inτepoch\\tau\_\{\\text\{epoch\}\}than we do with weight decay, using the same data\. We plot the variance in our optimal configurations in Table[2](https://arxiv.org/html/2605.15290#S4.T2)\. Like for the case of weight decay transfer \(see Figure[3](https://arxiv.org/html/2605.15290#S4.F3)\), we find that our suggested implementation outperforms both the standard parameterization and the vanilla Adam\-μ\\muP implementation fromYanget al\.\([2022](https://arxiv.org/html/2605.15290#bib.bib45)\)\.
## Appendix CLLM Statement
We did not use LLMs in a significant way to aid our research during the completion of this work\. Our LLM usage did not extend beyond using code assistants like copilot and for polishing the writing in our manuscript\.Similar Articles
GQLA: Group-Query Latent Attention for Hardware-Adaptive Large Language Model Decoding
GQLA proposes a minimal modification to Multi-head Latent Attention (MLA) that exposes both an MQA-absorb path and a GQA path over the same trained weights, enabling hardware-adaptive decoding without retraining. The method compresses KV cache and supports tensor parallelism, demonstrated by converting LLaMA-3-8B from GQA to GQLA.
Unlocking Feature Learning in Gated Delta Networks at Scale
This paper derives scaling rules for Gated Delta Networks using Maximal Update Parametrization (μP), enabling zero-shot hyperparameter transfer across model widths for efficient sub-quadratic LLM architectures. Experiments confirm stable learning-rate transfer under both AdamW and SGD, whereas standard parametrization fails.
Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning
This paper proposes GASP, a framework that injects geometric priors into vision-language models via deep supervision with contrastive and depth consistency losses, achieving significant improvements on 3D spatial reasoning benchmarks without using 3D VQA data.
Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs
The paper introduces mmGRPO, a multi-module extension of Group Relative Policy Optimization (GRPO) that improves accuracy in modular AI systems by optimizing language model calls and prompts. It reports an average 11% accuracy improvement across various tasks and provides an open-source implementation in DSPy.
Taming Extreme Tokens: Covariance-Aware GRPO with Gaussian-Kernel Advantage Reweighting
This paper proposes a covariance-aware variant of Group Relative Policy Optimization (GRPO) that uses Gaussian-kernel advantage reweighting to stabilize training entropy and improve reasoning performance in large language models.