Unlocking Feature Learning in Gated Delta Networks at Scale

arXiv cs.LG Papers

Summary

This paper derives scaling rules for Gated Delta Networks using Maximal Update Parametrization (μP), enabling zero-shot hyperparameter transfer across model widths for efficient sub-quadratic LLM architectures. Experiments confirm stable learning-rate transfer under both AdamW and SGD, whereas standard parametrization fails.

arXiv:2606.04048v1 Announce Type: new Abstract: Training and scaling Large Language Models demand enormous computational resources, motivating both efficient sub-quadratic architectures and principled hyperparameter tuning methods. While the Maximal Update Parametrization ($\mu$P) has enabled zero-shot hyperparameter transfer for standard Transformers, its extension to linear models, particularly those with structured state transitions and complicated architectures, remains largely unexplored. By rigorously propagating coordinate-size estimates through the forward pass, gating mechanisms, and recurrent state dynamics, we derive the scaling rules for Gated Delta Network. Experiments on language-model pre-training confirm that our configurations enable stable learning-rate transfer across model widths under both AdamW and SGD, whereas standard parametrization fails to transfer, validating the correctness and practical utility of our analysis.
Original Article
View Cached Full Text

Cached at: 06/05/26, 02:18 AM

# Unlocking Feature Learning in Gated Delta Networks at Scale
Source: [https://arxiv.org/html/2606.04048](https://arxiv.org/html/2606.04048)
Los Angeles \{liuyifeng,qgu\}@cs\.ucla\.edu

###### Abstract

Training and scaling Large Language Models demand enormous computational resources, motivating both efficient sub\-quadratic architectures and principled hyperparameter tuning methods\. While the Maximal Update Parametrization \(μ\\muP\) has enabled zero\-shot hyperparameter transfer for standard Transformers, its extension to linear models, particularly those with structured state transitions and complicated architectures, remains largely unexplored\. By rigorously propagating coordinate\-size estimates through the forward pass, gating mechanisms, and recurrent state dynamics, we derive the scaling rules for Gated Delta Network\. Experiments on language\-model pre\-training confirm that our configurations enable stable learning\-rate transfer across model widths under both AdamW and SGD, whereas standard parametrization fails to transfer, validating the correctness and practical utility of our analysis\.

## 1Introduction

The rapid development of Large Language Models \(LLMs\) has demonstrated remarkable capabilities across a wide range of downstream tasks\(brown2020language;touvron2023llama;radford2019language;vaswani2017attention\)\. However, scaling these models to larger sizes introduces two challenges\. First, empirical scaling laws show that optimal performance requires increased model size, while the computational budget required for training grows steeply with model scale\(kaplan2020scaling;hoffmann2022training\)\. Second, the efficiency of standard Transformer architecture is limited by the quadratic complexity of softmax self\-attention with respect to sequence length, making it increasingly costly for long\-context inference and training\(katharopoulos2020transformers\)\.

Linear models have been proposed to address these issues\. The original linear attention\(katharopoulos2020transformers\)rewrites softmax attention as a linear kernel, enabling recurrent\-form inference at constant per\-step cost\. Structured state space models \(SSMs\) such as S4\(guefficiently\), Mamba\(gu2024mamba\)and Mamba\-2\(dao2024transformers\)utilize recurrent state spaces to represent long\-range dependencies within linear structures\. A particularly promising family of linear recurrent models is based on the delta rule\(widrow1960adaptive\), which updates a fast\-weight matrix by subtracting the prediction error of the current key\-value pair\. Furthermore, the DeltaNet\(yang2024deltanet\)introduced a hardware\-efficient parallel training algorithm for delta\-rule Transformers, enabling scaling to large language models\. Afterwards, Gated Delta Network\(yang2025gated\)augmented DeltaNet with the data\-dependent gating mechanism of Mamba\-2, which achieves strong language modeling performance while maintaining linear\-time inference\.

Simultaneously, training deep networks requires careful selection of hyperparameters such as learning rates, which are expensive to tune through grid search\(snoek2012practical;snoek2015scalable\), and whose optimal values often change greatly with model scale\. Meta\-learning approaches have been explored to transfer hyperparameters across tasks and datasets\(yogatama2014efficient;perrone2018scalable;horvath2021hyperparameter;akiba2019optuna\)\. A more principled solution is offered by the Maximal Update Parametrization \(μ\\muP\)\(yang2020feature\), which identifies the valid parametrization of a neural network that supports feature learning in the infinite\-width limit, as formalized through the Tensor Programs framework\(yang2022tensor;yang2023tensor;TensorProgramVI\)\.μ\\muP theories demonstrated that hyperparameters tuned on small proxy models transfer zero\-shot to large target models, with extensions to adaptive optimizers\(yang2023tensor;ishikawaparameterization;everett2024scaling\)and a spectral reformulation\(yang2023spectral\)\. Subsequent work has successfully appliedμ\\muP to other fields\(blakeu2025;dey2024sparse;hajjar2024training\), and even industrial models\(meta2024llama4;team2025longcat\)\.

Despite the development of efficient linear architectures, how to properly parametrize them for feature learning at scale has received very limited attention\. The core challenge is that their recurrent state is updated through the sequence dimension, which does not fit the standard feedforward or attention\-based derivations ofμ\\muP\. The only prior work on this isvankadara2024feature, which shows that vanillaμ\\muP and spectral scaling conditions both fail to support feature learning in diagonal SSMs like Mamba, and proposes a corrected scaling rule for them\. However, the Gated Delta Network differs from diagonal SSMs fundamentally, since its recurrent state is a full matrix updated with additional data\-dependent scalar gating through two separate weight matrices\. These differences make the SSM\-specific analysis ofvankadara2024featureinapplicable, leaving theμ\\muP parametrization of Gated Delta Networks an open problem\.

In this paper, we formally derive the completeμ\\muP formulation for Gated Delta Networks\. Our main contributions are:

- •We theoretically derive coordinate\-size estimates through the full forward pass\. We also derive principled initialization variances, forward multipliers, and learning\-rate scalings for all weight classes\. We discover that the gating weight matrices require a non\-standardΘ​\(1/d\)\\Theta\(1/\\sqrt\{d\}\)learning\-rate scaling, and the scalar gating parameters require aΘ​\(d\)\\Theta\(\\sqrt\{d\}\)scaling, both of which deviate from standardμ\\muP setting\.
- •We pretrain Gated Delta Network language models across multiple widths and show that ourμ\\muP formulation enables zero\-shot learning\-rate transfer under both AdamW and SGD optimizers, while standard parametrization fails to transfer, confirming both the theoretical derivation and its practical efficiency\.

## 2Related Work

Efficient Sequence ModelsThe standard Transformer\(vaswani2017attention\)and its variants\(radford2019language;brown2020language;touvron2023llama\)have become the dominant architecture for large\-scale language modeling, but their quadratic attention complexity limits its application in much larger language models\. Linear attention\(katharopoulos2020transformers\)replaces the softmax with a kernel function that permits rewriting attention as a linear RNN, enablingO​\(1\)O\(1\)per\-step inference\. Structured state space models \(SSMs\) improve by integrating recurrent state spaces\. S4\(guefficiently\)introduces HiPPO\-based\(gu2020hippo\)structured matrices for long\-range sequence modeling, Mamba\(gu2024mamba\)adds input\-selective state transitions for improved performance, and Mamba\-2\(dao2024transformers\)unifies SSMs with structured matrix attention via state\-space duality\. Other notable architectures include RetNet\(sun2023retentive\), RWKV\(peng2023rwkv\), Gated Linear Attention\(yang2024gla\), HGRN\(qin2023hgrn\)and its expansion HGRN2\(qin2024hgrn2\)\. Recently, numerous models consider applying delta rule\(widrow1960adaptive\), which subtracts the prediction error rather than accumulating outer products\. This was formalized in the fast\-weight programmer framework\(schlag2021linear\), with recurrent extensions inirie2021going\. DeltaNet\(yang2024deltanet\)further introduced a hardware\-efficient parallel training algorithm, and Gated Delta Networks\(yang2025gated\)combined the delta\-rule state update with the data\-dependent gating of Mamba\-2 with improved performances\. Our work focuses precisely on this architecture and derives itsμ\\muP formulation\.

Hyperparameter TransferA lot of researchers have explored approaches to accelerate the search progress of hyperparameters of training deep learning models\(snoek2012practical;snoek2015scalable;jamieson2016non;akiba2019optuna\)\. Some have also explored the methods to transfer learning between different tasks or datasets\(horvath2021hyperparameter;perrone2018scalable;yogatama2014efficient\)\. Moreover, based on standard parametrization \(SP, such as Xavier initialization\(glorot2010understanding\)and Kaiming initialization\(he2015delving\)\),yang2020featureproposed Maximal Update Parametrization \(μ\\muP\) based on abc\-parameterization framework, which unifies previous parametrization methods such as SP, Neural Tangent Kernel \(NTK\)\(jacot2018neural\)and Mean Field parametrization\(chizat2018global;mei2018mean;sirignano2020mean;rotskoff2022trainability\)\. It enables feature learning that can be generalized to infinite\-width conditions\. And based onμ\\muP,yang2022tensorproposedμ\\muTransferthat can implement zero\-shot transfer of hyperparameters to large models from a much smaller proxy model, and generalized it to different architectures and optimizers\(yang2023tensor\), such as SGD, Adagrad\(duchi2011adaptive\)and Adam\(adam\)\. Following that, they also reformulatedμ\\muP from the perspective of spectral norm\(yang2023spectral\)\. Recently, many researchers tried to further generalizeμ\\muP in other fields or successfully scale LLMs withμ\\muP\(blakeu2025;meta2024llama4;haas2024effective;dey2024sparse;hajjar2024training;team2025longcat\)\.

However, the question of how to properly parametrize these models for feature learning at scale has received very little attention\. The core challenge is that originalμ\\muP designs cannot be directly applied to these architectures with recurrent state transitions and structured matrix operations\. To our knowledge, the only work that formally addressesμ\\muP\-style parametrization for this class of models isvankadara2024feature, which studies the scaling behavior of structured SSMs like Mamba\. Their analysis reveals that both vanillaμ\\muP and spectral scaling conditions fail to support feature learning in SSMs, and they derive the scaling rule for SSMs that recovers feature learning\. The parametrization of Gated Delta Net, however, differs from the diagonal SSMs greatly, since its state is updated via outer\-product delta rules rather than scalar recurrences with matrix\-valued hidden states\. This work is, to our knowledge, the first to derive and validate aμ\\muP\-consistent parametrization for Gated Delta Network\.

## 3Preliminaries

### 3\.1Gated Delta Net

Proposed byyang2025gated, Gated Delta Net is a variant of linear transformer\(katharopoulos2020transformers\), based on the Mamba 2 architecture\(dao2024transformers\)\. For the query, key and value vectors\\qbt,\\kbt\\qb\_\{t\},\\kb\_\{t\}and\\vbt\\vb\_\{t\}similar to the original Transformer, the update rule of the latent state is shown as:

\\Sbbt=\\Sbbt−1​\(αt​\(\\Ib−βt​\\kbt​\\kbt⊤\)\)\+βt​\\vbt​\\kbt⊤,\\displaystyle\\Sbb\_\{t\}=\\Sbb\_\{t\-1\}\(\\alpha\_\{t\}\(\\Ib\-\\beta\_\{t\}\\kb\_\{t\}\\kb\_\{t\}^\{\\top\}\)\)\+\\beta\_\{t\}\\vb\_\{t\}\\kb\_\{t\}^\{\\top\},\(1\)whereαt∈\(0,1\)\\alpha\_\{t\}\\in\(0,1\)is the data\-dependent gating scale andβt∈\(0,1\)\\beta\_\{t\}\\in\(0,1\)is the “writing strength” of the current input at timett, as proposed inwidrow1960adaptive;schlag2021linear\. And the output is just direct readout of the latent state on the query:

\\obt=\\Sbbt​\\qbt\.\\displaystyle\\ob\_\{t\}=\\Sbb\_\{t\}\\qb\_\{t\}\.~\(2\)Different from the Transformer, Gated Delta Net added a short convolution after query, key and value projections, followed by a SiLU activation layer\. There are also L2 Normalization layer for queries and keys\. And there is also an RMSNorm layer before the output projection to stabilize the training\. As discussed in the original paper, these norm are crucial to the performance of Gated Delta Net\.

### 3\.2μ\\muP theory

In deep learning, models are frequently scaled by increasing their hidden dimension or widthdd\. Under the Standard Parameterization \(SP\), including He\(he2015delving\)or Xavier\(glorot2010understanding\)initialization, hidden weights are typically initialized with entries drawn from𝒩​\(0,σ2/d\)\\mathcal\{N\}\(0,\\sigma^\{2\}/d\)and optimized using a uniform learning rateη\\etaacross all layers\. However, asddgoes to infinity, SP encounters fundamental limitations\. If the learning rate remains constant, the network’s activations and gradients diverge\. To prevent this instability,η\\etamust be scaled down by𝒪​\(1/d\)\\mathcal\{O\}\(1/d\), which forces the network into the Neural Tangent Kernel \(NTK\) or “lazy training” regimeyang2020feature, where the intermediate representations \(features\) seldom evolve from their initialized state, meaning the network fails to perform real feature learning\.

To resolve the trade\-off between stability and feature learning in the infinite\-width limit,yang2020featureproposed theMaximal Update Parameterization\(μ\\muP\) using the Tensor Programs framework\.μ\\muP provides rigorous configurations for scaling weight initializations and learning rates as a function of the widthdd\(sometimes a width\-dependent multiplier on the weight is required; refer to Tables[2](https://arxiv.org/html/2606.04048#A2.T2)and[1](https://arxiv.org/html/2606.04048#S5.T1)for AdamW and SGD configurations\) to ensure feature learning\. In this setting, feature updates at every layer remain bounded and non\-vanishing \(i\.e\.,Δ​h=Θ​\(1\)\\Delta h=\\Theta\(1\)\) as the model expands to infinity width\. To further illustrate this, the definition of coordinate size should be first introduced:

###### Definition 3\.1\.

A vectorv∈\\RRdv\\in\\RR^\{d\}hasΘ​\(da\)\\Theta\(d^\{a\}\)\-sized coordinates if‖v‖2/d=Θ​\(d2​a\)\\\|v\\\|^\{2\}/d=\\Theta\(d^\{2a\}\), i\.e\., each entry ofvvhas varianceΘ​\(d2​a\)\\Theta\(d^\{2a\}\)asd→∞d\\to\\infty\. Whenddis large, the coordinates of the vectors being studied are regarded as roughly i\.i\.d\. Gaussian\.

Based on the definition above,μ\\muP theory proposes three desiderata\. Firstly, every \(pre\)activation vector should haveΘ​\(1\)\\Theta\(1\)\-sized coordinates; and the output of a network should beO​\(1\)O\(1\); moreover, all parameters should be updated as much as possible without leading to divergence\. And based on these desiderata and the assumption of feature learning, there are some derivatives\. For example, the gradient to a hidden state is withΘ​\(1/d\)\\Theta\(1/d\)coordinate size when optimized with SGD optimizer\.

## 4Theμ\\muP Forward Analysis of Gated Delta Net

In this section, we will review the architecture of Gated Delta Net, and then derive the scaling law of this architecture\.

We derive the Maximal Update Parametrization \(μ\\muP\) conditions for Gated Delta Net by propagating coordinate\-size estimates through the forward pass and the gating mechanisms, then conclude with the implications for the AdamW optimizer\.

#### Notation and standing assumptions\.

Followingyang2022tensor, we say a vector\\zb∈\\RRd\\zb\\in\\RR^\{d\}hasΘ​\(1\)\\Theta\(1\)coordinate size if‖\\zb‖2=Θ​\(d\)\\\|\\zb\\\|\_\{2\}=\\Theta\(\\sqrt\{d\}\), i\.e\. each coordinate is of orderΘ​\(1\)\\Theta\(1\)in magnitude\. Equivalently, the per\-coordinate variance of\\zb\\zbisΘ​\(1\)\\Theta\(1\)\. For a matrix\\Ab∈\\RRd×d\\Ab\\in\\RR^\{d\\times d\}, we say it hasΘ​\(c\)\\Theta\(c\)coordinate size if each entry is of orderΘ​\(c\)\\Theta\(c\)in magnitude\.

We assume throughout that the hidden state\\xbt∈\\RRd\\xb\_\{t\}\\in\\RR^\{d\}satisfies theμ\\muP feature\-learning condition, namely

‖\\xbt‖2=Θ​\(d\),‖Δ​\\xbt‖2=Θ​\(d\),\\displaystyle\\\|\\xb\_\{t\}\\\|\_\{2\}=\\Theta\(\\sqrt\{d\}\),\\qquad\\\|\\Delta\\xb\_\{t\}\\\|\_\{2\}=\\Theta\(\\sqrt\{d\}\),so that\\xbt\\xb\_\{t\}hasΘ​\(1\)\\Theta\(1\)coordinate size and its update is of the same order\. To isolate the effect of the parametrization we temporarily ignore the SiLU activations \(see Remark[4\.2](https://arxiv.org/html/2606.04048#S4.Thmtheorem2)below\)\.

### 4\.1Coordinate sizes of the projected features

Let\\qb~t=ShortConv​\(\\Wbq​\\xbt\)\\tilde\{\\qb\}\_\{t\}=\\text\{ShortConv\}\(\\Wb\_\{q\}\\,\\xb\_\{t\}\),\\kb~t=ShortConv​\(\\Wbk​\\xbt\)\\tilde\{\\kb\}\_\{t\}=\\text\{ShortConv\}\(\\Wb\_\{k\}\\,\\xb\_\{t\}\), and\\vbt=ShortConv​\(\\Wbv​\\xbt\)\\vb\_\{t\}=\\text\{ShortConv\}\(\\Wb\_\{v\}\\,\\xb\_\{t\}\), where\\Wbq,\\Wbk,\\Wbv∈\\RRd×d\\Wb\_\{q\},\\Wb\_\{k\},\\Wb\_\{v\}\\in\\RR^\{d\\times d\}are the query, key, and value projection matrices, respectively\. Under theμ\\muP initialization of hidden weights\(yang2022tensor\), the products\\Wbq​\\xbt\\Wb\_\{q\}\\,\\xb\_\{t\},\\Wbk​\\xbt\\Wb\_\{k\}\\,\\xb\_\{t\},\\Wbv​\\xbt\\Wb\_\{v\}\\,\\xb\_\{t\}each haveΘ​\(1\)\\Theta\(1\)coordinate size; the short convolution preserves this order, so\\qb~t,\\kb~t\\tilde\{\\qb\}\_\{t\},\\tilde\{\\kb\}\_\{t\}and\\vbt\\vb\_\{t\}each haveΘ​\(1\)\\Theta\(1\)coordinate size\. The L2\-normalized query and key are

\\qbt=\\qb~t‖\\qb~t‖2,\\kbt=\\kb~t‖\\kb~t‖2\.\\qb\_\{t\}=\\frac\{\\tilde\{\\qb\}\_\{t\}\}\{\\\|\\tilde\{\\qb\}\_\{t\}\\\|\_\{2\}\},~\\kb\_\{t\}=\\frac\{\\tilde\{\\kb\}\_\{t\}\}\{\\\|\\tilde\{\\kb\}\_\{t\}\\\|\_\{2\}\}\.Since‖\\qb~t‖2=Θ​\(d\)\\\|\\tilde\{\\qb\}\_\{t\}\\\|\_\{2\}=\\Theta\(\\sqrt\{d\}\), each coordinate of\\qbt\\qb\_\{t\}and\\kbt\\kb\_\{t\}is of orderΘ​\(1\)/Θ​\(d\)=Θ​\(1/d\)\\Theta\(1\)/\\Theta\(\\sqrt\{d\}\)=\\Theta\(1/\\sqrt\{d\}\), i\.e\.,\\qbt\\qb\_\{t\}and\\kbt\\kb\_\{t\}both haveΘ​\(1/d\)\\Theta\(1/\\sqrt\{d\}\)coordinate size\.

### 4\.2Coordinate size of the latent state

The rank\-one write update in \([1](https://arxiv.org/html/2606.04048#S3.E1)\) is\\Ubt=βt​\\vbt​\\kbt⊤\\Ub\_\{t\}=\\beta\_\{t\}\\,\\vb\_\{t\}\\kb\_\{t\}^\{\\top\}\. Sinceβt∈\(0,1\)\\beta\_\{t\}\\in\(0,1\)as a bounded scalar and combining theΘ​\(1\)\\Theta\(1\)coordinate size of\\vbt\\vb\_\{t\}with theΘ​\(1/d\)\\Theta\(1/\\sqrt\{d\}\)coordinate size of\\kbt\\kb\_\{t\}, each entry of\\Ubt\\Ub\_\{t\}satisfies

\(\\Ubt\)i​j=βt​\(\\vbt\)i​\(\\kbt\)j=Θ​\(1\)⋅Θ​\(1d\)=Θ​\(1d\)\.\(\\Ub\_\{t\}\)\_\{ij\}=\\beta\_\{t\}\\,\(\\vb\_\{t\}\)\_\{i\}\\,\(\\kb\_\{t\}\)\_\{j\}=\\Theta\(1\)\\cdot\\Theta\\Big\(\\frac\{1\}\{\\sqrt\{d\}\}\\Big\)=\\Theta\\Big\(\\frac\{1\}\{\\sqrt\{d\}\}\\Big\)\.For the cumulative latent state\\Sbbt\\Sbb\_\{t\}, we apply the argument ofvankadara2024feature111The detailed derivations can be found in Appendix[A\.1](https://arxiv.org/html/2606.04048#A1.SS1)\.: unless the write update\\Ubt\\Ub\_\{t\}perfectly cancels the residual term in \([1](https://arxiv.org/html/2606.04048#S3.E1)\) at every steptt, the steady\-state variance of\\Sbbt\\Sbb\_\{t\}matches that of\\Ubt\\Ub\_\{t\}\. More precisely, the spectral contraction factor of the map\\Sbb↦\\Sbb​αt​\(\\Ib−βt​\\kbt​\\kbt⊤\)\\Sbb\\mapsto\\Sbb\\,\\alpha\_\{t\}\(\\Ib\-\\beta\_\{t\}\\kb\_\{t\}\\kb\_\{t\}^\{\\top\}\)is at mostαt​\(1−βt​‖\\kbt‖22\)≤αt\\alpha\_\{t\}\(1\-\\beta\_\{t\}\\\|\\kb\_\{t\}\\\|^\{2\}\_\{2\}\)\\leq\\alpha\_\{t\}, which is strictly less than11whenαt,βt∈\(0,1\)\\alpha\_\{t\},\\beta\_\{t\}\\in\(0,1\)are both bounded away from0and11\. We assume this condition holds throughout the analysis\. Under this assumption the geometric sum of write updates converges and\\Sbbt\\Sbb\_\{t\}hasΘ​\(1/d\)\\Theta\(1/\\sqrt\{d\}\)coordinate size\.

### 4\.3Coordinate size of the readout

The output is\\obt=\\Sbbt​\\qbt\\ob\_\{t\}=\\Sbb\_\{t\}\\,\\qb\_\{t\}, so thejj\-th coordinate is

\(\\obt\)j=∑i=1d\(\\Sbbt\)j​i​\(\\qbt\)i\.\(\\ob\_\{t\}\)\_\{j\}=\\sum\_\{i=1\}^\{d\}\(\\Sbb\_\{t\}\)\_\{ji\}\\,\(\\qb\_\{t\}\)\_\{i\}\.Treating the entries of\\Sbbt\\Sbb\_\{t\}and\\qbt\\qb\_\{t\}as approximately independent and zero\-mean, each with varianceΘ​\(1/d\)\\Theta\(1/d\), the variance of the sum is

Var​\[\(\\obt\)j\]=∑i=1dVar​\[\(\\Sbbt\)j​i\]⋅Var​\[\(\\qbt\)i\]=d⋅Θ​\(1d\)⋅Θ​\(1d\)=Θ​\(1d\),\\text\{Var\}\\bigl\[\(\\ob\_\{t\}\)\_\{j\}\\bigr\]=\\sum\_\{i=1\}^\{d\}\\text\{Var\}\\bigl\[\(\\Sbb\_\{t\}\)\_\{ji\}\\bigr\]\\cdot\\text\{Var\}\\bigl\[\(\\qb\_\{t\}\)\_\{i\}\\bigr\]=d\\cdot\\Theta\\Big\(\\frac\{1\}\{d\}\\Big\)\\cdot\\Theta\\Big\(\\frac\{1\}\{d\}\\Big\)=\\Theta\\Big\(\\frac\{1\}\{d\}\\Big\),so\\obt\\ob\_\{t\}hasΘ​\(1/d\)\\Theta\(1/\\sqrt\{d\}\)coordinate size\. Although the subsequent RMSNorm forces its output toΘ​\(1\)\\Theta\(1\)coordinate size, this implicit rescaling would disrupt the gradient scaling required byμ\\muP\. We therefore recommend inserting ad\\sqrt\{d\}\-multiplier before RMSNorm so that the input to RMSNorm is alreadyΘ​\(1\)\\Theta\(1\):

\\obt↦d​\\obt,‖d​\\obt‖2=Θ​\(d\)\.\\ob\_\{t\}\\mapsto\\sqrt\{d\}\\ob\_\{t\},\\qquad\\\|\\sqrt\{d\}\\ob\_\{t\}\\\|\_\{2\}=\\Theta\(\\sqrt\{d\}\)\.An equivalent alternative is to replace the L2\-Normalization on\\qbt\\qb\_\{t\}with an RMSNorm layer, which absorbs the samed\\sqrt\{d\}\-factor\. With either modification, the standardμ\\muP formulation applies to all projection weights other than those governingαt\\alpha\_\{t\}andβt\\beta\_\{t\}\.

### 4\.4First\-order analysis of the gating scalars

The gating parameters are defined as

βt=σ​\(\\Wbβ​\\xbt\)∈\(0,1\),\\Wbβ∈\\RR1×d,\\beta\_\{t\}=\\sigma\(\\Wb\_\{\\beta\}\\,\\xb\_\{t\}\)\\in\(0,1\),\\qquad\\Wb\_\{\\beta\}\\in\\RR^\{1\\times d\},and

αt=egt∈\(0,1\),gt=−ealog​ln⁡\(1\+e\\Wbα​\\xbt\+b\),\\alpha\_\{t\}=e^\{g\_\{t\}\}\\in\(0,1\),\\qquad g\_\{t\}=\-e^\{a\_\{\\log\}\}\\ln\(1\+e^\{\\,\\Wb\_\{\\alpha\}\\xb\_\{t\}\+b\}\),with\\Wbα∈\\RR1×d\\Wb\_\{\\alpha\}\\in\\RR^\{1\\times d\}a trainable weight row andalog,b∈\\RRa\_\{\\log\},b\\in\\RRscalar parameters shared within each head\. Because bothαt\\alpha\_\{t\}andβt\\beta\_\{t\}are nonlinear transformations of a Gaussian\-distributed pre\-activation, they are not themselves Gaussian, and traditionalμ\\muP theory does not directly apply\.

Forαt\\alpha\_\{t\}, since it is bounded by\(0,1\)\(0,1\), it is naturallyΘ​\(1\)\\Theta\(1\)\. Definezα,t=\\Wbα​\\xbt\+bz\_\{\\alpha,t\}=\\Wb\_\{\\alpha\}\\xb\_\{t\}\+b, whenzα,t\+alog≪0z\_\{\\alpha,t\}\+a\_\{\\log\}\\ll 0,\|∂αt/∂zα,t\|=αt​ezα,t\+alog/\(1\+ezα,t\)\|\\partial\\alpha\_\{t\}/\\partial z\_\{\\alpha,t\}\|=\\alpha\_\{t\}e^\{z\_\{\\alpha,t\}\+a\_\{\\log\}\}/\(1\+e^\{z\_\{\\alpha,t\}\}\)is alsoΘ​\(1\)\\Theta\(1\)\. Under the original initialization ofyang2025gated, namelyalog∼Uniform​\(0,16\)a\_\{\\log\}\\sim\\mathrm\{Uniform\}\(0,16\)andb=b0\+ln⁡\(1−e−b0\)b=b\_\{0\}\+\\ln\(1\-e^\{\-b\_\{0\}\}\)withb0=102​ϵb−3b\_\{0\}=10^\{2\\epsilon\_\{b\}\-3\},ϵb∼Uniform​\(0,1\)\\epsilon\_\{b\}\\sim\\mathrm\{Uniform\}\(0,1\), the gradient\|∂αt/∂zα,t\|=αt⋅ezα,t\+alog/\(1\+ezα,t\)=ezα,t\+alog/\(1\+ezα,t\)ealog\+1\|\\partial\\alpha\_\{t\}/\\partial z\_\{\\alpha,t\}\|=\\alpha\_\{t\}\\cdot e^\{z\_\{\\alpha,t\}\+a\_\{\\log\}\}/\(1\+e^\{z\_\{\\alpha,t\}\}\)=e^\{z\_\{\\alpha,t\}\+a\_\{\\log\}\}/\(1\+e^\{z\_\{\\alpha,t\}\}\)^\{e^\{a\_\{\\log\}\}\+1\}is bounded by11\.

###### Proof 4\.1\.

Denotek=ealog\>0,p=ezα,t\>0k=e^\{a\_\{\\log\}\}\>0,p=e^\{z\_\{\\alpha,t\}\}\>0, andf​\(k,p\)=log⁡\(k​p\(1\+p\)k\+1\)=log⁡\(k​p\)−\(k\+1\)​log⁡\(1\+p\)f\(k,p\)=\\log\(\\frac\{kp\}\{\{\(1\+p\)\}^\{k\+1\}\}\)=\\log\(kp\)\-\(k\+1\)\\log\(1\+p\)\. Then we have∂f∂p=1p−1\+k1\+p=1−p​kp​\(1\+p\)\\frac\{\\partial f\}\{\\partial\{p\}\}=\\frac\{1\}\{p\}\-\\frac\{1\+k\}\{1\+p\}=\\frac\{1\-pk\}\{p\(1\+p\)\}\. Therefore,∀k\>0\\forall k\>0,f​\(k,p\)≤f​\(k,1/k\)=log⁡\(1\(1\+1/k\)k\+1\)<0f\(k,p\)\\leq f\(k,1/k\)=\\log\(\\frac\{1\}\{\(1\+1/k\)^\{k\+1\}\}\)<0, and\|∂αt/∂zα,t\|=ef​\(k,p\)<1\|\\partial\\alpha\_\{t\}/\\partial z\_\{\\alpha,t\}\|=e^\{f\(k,p\)\}<1\.

Forβt\\beta\_\{t\}, sincezβ,t:=\\Wbβ​\\xbtz\_\{\\beta,t\}:=\\Wb\_\{\\beta\}\\xb\_\{t\}isΘ​\(1\)\\Theta\(1\),βt∈\(0,1\)\\beta\_\{t\}\\in\(0,1\)does not saturate andβt=σ​\(zβ,t\)\\beta\_\{t\}=\\sigma\(z\_\{\\beta,t\}\), therefore,∂βt∂zβ,t=βt​\(1−βt\)=Θ​\(1\)\\frac\{\\partial\\beta\_\{t\}\}\{\\partial z\_\{\\beta,t\}\}=\\beta\_\{t\}\(1\-\\beta\_\{t\}\)=\\Theta\(1\)\.

Combining the two analyses,Θ​\(1\)\\Theta\(1\)coordinate\-size behavior of bothαt\\alpha\_\{t\}andβt\\beta\_\{t\}is maintained underμ\\muP initialization\. Consequently,\\Wbα\\Wb\_\{\\alpha\}and\\Wbβ\\Wb\_\{\\beta\}may be treated as hidden weights with initial variance1/d1/d, while the scalar parametersaloga\_\{\\log\}andbbare assigned constant \(i\.e\., width\-independent\) initial variance\.

## 5Theμ\\muP Analysis of Gated Delta Net under SGD

In this section, we will derive the backward process of theμ\\muP configuration and the learning rates for each module accordingly\. In this section, we focus on the scenario of Gated Delta Net under SGD and postpone the analysis for AdamW to Appendix[B](https://arxiv.org/html/2606.04048#A2), where AdamW optimizer benefits from a key simplification: it normalizes the gradient by its coordinate\-wise second moment, so the effective update magnitude isΘ​\(η\)\\Theta\(\\eta\)per coordinate regardless of the raw gradient scale\. Consequently, all weight classes share the sameΘ​\(1\)\\Theta\(1\)update magnitude, and the learning\-rate schedule needs only to account for the number of terms in the activation sum \(i\.e\. the fan\-innℓ−1n\_\{\\ell\-1\}\)\.

Under plain SGD this normalization is absent\. The update rule is

Wℓ←Wℓ−ηℓ​∇Wℓℒ,\\displaystyle W^\{\\ell\}\\leftarrow W^\{\\ell\}\-\\eta^\{\\ell\}\\nabla\_\{W^\{\\ell\}\}\\mathcal\{L\},so the magnitude of the changeΔ​zℓ=\(Δ​Wℓ\)​hℓ−1\\Delta z^\{\\ell\}=\(\\Delta W^\{\\ell\}\)h^\{\\ell\-1\}is proportional to both the learning rate and the raw gradient magnitude\. Different weights in Gated Delta Net receive different gradient magnitudes, so they require different learning\-rate scalings to satisfy theμ\\muP feature\-learning condition‖Δ​hℓ‖2=Θ​\(d\)\\\|\\Delta h^\{\\ell\}\\\|\_\{2\}=\\Theta\(\\sqrt\{d\}\)\.

### 5\.1Notation and assumptions

We retain all conventions and assumptions from Section[4](https://arxiv.org/html/2606.04048#S4): the hidden state\\xbt\\xb\_\{t\}satisfies‖\\xbt‖2=Θ​\(d\)\\\|\\xb\_\{t\}\\\|\_\{2\}=\\Theta\(\\sqrt\{d\}\), all forward\-pass estimates carry over unchanged, and SiLU activations are suppressed \(see Remark[4\.2](https://arxiv.org/html/2606.04048#S4.Thmtheorem2)\)\. And we writeηℓ\\eta^\{\\ell\}for the per\-layer SGD learning rate\.

\{assumption\}

\[Short effective memory and domination of direct gradient\] The loss gradient with respect to the latent state satisfies

∂ℒ∂\\Sbbt=\\gbt​\\qbt⊤\+∑τ=t\+1T\(∏s=t\+1ταs\)​\\gbτ​\\qbτ⊤​\(∏s=t\+1τ\(\\Ib−βs​\\kbs​\\kbs⊤\)\)⏟BPTT tail,\\displaystyle\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\\Sbb\_\{t\}\}=\\gb\_\{t\}\\qb\_\{t\}^\{\\top\}\+\\underbrace\{\\sum\_\{\\tau=t\+1\}^\{T\}\\Bigl\(\\prod\_\{s=t\+1\}^\{\\tau\}\\alpha\_\{s\}\\Bigr\)\\gb\_\{\\tau\}\\qb\_\{\\tau\}^\{\\top\}\\Bigl\(\\prod\_\{s=t\+1\}^\{\\tau\}\\bigl\(\\Ib\-\\beta\_\{s\}\\kb\_\{s\}\\kb\_\{s\}^\{\\top\}\\bigr\)\\Bigr\)\}\_\{\\text\{BPTT tail\}\},where the first term is the direct contribution from the readout at stepttand the second term accumulates gradients from all future readouts through the recurrent state chain\. Each BPTT \(Backpropagation\-through\-time\) term is individually of the samedd\-order as the direct term: its\(a,b\)\(a,b\)\-entry has magnitudeΘ​\(1/d\)\\Theta\(1/d\), identical to\(\\gbt\)a​\(\\qbt\)b\(\\gb\_\{t\}\)\_\{a\}\(\\qb\_\{t\}\)\_\{b\}\. However, the BPTT sum contains up toT−tT\-tterms\.

We assume throughout this section that the effective memory lengthLeff:=∑τ=tT∏s=t\+1ταs=O​\(1\)L\_\{\\mathrm\{eff\}\}:=\\sum\_\{\\tau=t\}^\{T\}\\prod\_\{s=t\+1\}^\{\\tau\}\\alpha\_\{s\}=O\(1\), i\.e\. the gating valuesαs\\alpha\_\{s\}decay the past state sufficiently fast so thatα¯:=𝔼​\[αt\]\\bar\{\\alpha\}:=\\mathbb\{E\}\[\\alpha\_\{t\}\]satisfies\(1−α¯\)−1=O​\(1\)\(1\-\\bar\{\\alpha\}\)^\{\-1\}=O\(1\)\. Under this assumption the BPTT tail isO​\(1\)O\(1\)times the direct term in theΘ​\(⋅\)\\Theta\(\\cdot\)sense, and all gradient estimates below use only the direct term∂ℒ/∂\\Sbbt=\\gbt​\\qbt⊤\\partial\\mathcal\{L\}/\\partial\\Sbb\_\{t\}=\\gb\_\{t\}\\qb\_\{t\}^\{\\top\}without loss ofdd\-scaling accuracy\.

When the model is used in a long\-context scenario with slow forgetting \(αt→1\\alpha\_\{t\}\\to 1\),LeffL\_\{\\mathrm\{eff\}\}can grow asO​\(T\)O\(T\)\. In this case, theμ\\muP learning\-rate prescriptions should be scaled down by1/Leff1/L\_\{\\mathrm\{eff\}\}accordingly\. And we postpone the detailed derivations of the BPTT term in Appendix[A](https://arxiv.org/html/2606.04048#A1)\.

### 5\.2Backward error at the readout

The readout is\\obt=\\Sbbt​\\qbt\\ob\_\{t\}=\\Sbb\_\{t\}\\qb\_\{t\}, followed by ad\\sqrt\{d\}\-multiplier and RMSNorm\. The loss\-gradient at\\obt\\ob\_\{t\}satisfies

\\gbt:=∂ℒ∂\\obt=d⋅∂ℒ∂\(d​\\obt\)\.\\gb\_\{t\}:=\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\\ob\_\{t\}\}=\\sqrt\{d\}\\cdot\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\(\\sqrt\{d\}\\,\\ob\_\{t\}\)\}\.If we assume that all the architectures outside Gated Delta Net followμ\\muP rules, then it is assumed that∂ℒ∂\(d​\\obt\)\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\(\\sqrt\{d\}\\,\\ob\_\{t\}\)\}hasΘ​\(1/d\)\\Theta\(1/d\)coordinate size and\\gbt\\gb\_\{t\}is withΘ​\(1/d\)\\Theta\(1/\\sqrt\{d\}\)coordinate size\.

### 5\.3Gradient of query, key and value projections

#### Gradient of the value projection\\Wbv\\Wb\_\{v\}\.

Recall from \([1](https://arxiv.org/html/2606.04048#S3.E1)\) and \([2](https://arxiv.org/html/2606.04048#S3.E2)\) that\\Sbbt\\Sbb\_\{t\}depends on\\vbt\\vb\_\{t\}through the rank\-one write updateβt​\\vbt​\\kbt⊤\\beta\_\{t\}\\vb\_\{t\}\\kb\_\{t\}^\{\\top\}\. Under Assumption[5\.1](https://arxiv.org/html/2606.04048#S5.SS1), the gradient ofℒ\\mathcal\{L\}with respect to\(\\vbt\)i\(\\vb\_\{t\}\)\_\{i\}receives a contribution from the direct readout at timett:

∂ℒ∂\(\\vbt\)i\\displaystyle\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\(\\vb\_\{t\}\)\_\{i\}\}=βt​∑j∂ℒ∂\(\\Sbbt\)i​j⋅\(\\kbt\)j=βt​∑j\(\\gbt\)i​\(\\qbt\)j⋅\(\\kbt\)j=βt​\(\\gbt\)i​⟨\\qbt,\\kbt⟩,\\displaystyle=\\beta\_\{t\}\\sum\_\{j\}\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\(\\Sbb\_\{t\}\)\_\{ij\}\}\\cdot\(\\kb\_\{t\}\)\_\{j\}=\\beta\_\{t\}\\sum\_\{j\}\(\\gb\_\{t\}\)\_\{i\}\(\\qb\_\{t\}\)\_\{j\}\\cdot\(\\kb\_\{t\}\)\_\{j\}=\\beta\_\{t\}\\,\(\\gb\_\{t\}\)\_\{i\}\\,\\langle\\qb\_\{t\},\\kb\_\{t\}\\rangle,\(3\)where we used∂ℒ/∂\(\\Sbbt\)i​j=\(\\gbt\)i​\(\\qbt\)j\\partial\\mathcal\{L\}/\\partial\(\\Sbb\_\{t\}\)\_\{ij\}=\(\\gb\_\{t\}\)\_\{i\}\(\\qb\_\{t\}\)\_\{j\}from the readout \([2](https://arxiv.org/html/2606.04048#S3.E2)\) under the short\-memory assumption \(Assumption[5\.1](https://arxiv.org/html/2606.04048#S5.SS1)\)\. Note also that\\vbt\\vb\_\{t\}does not propagate gradient back to earlier states\\Sbbt′\\Sbb\_\{t^\{\\prime\}\}witht′<tt^\{\\prime\}<t; those states depend on\\vbt′\\vb\_\{t^\{\\prime\}\}, not\\vbt\\vb\_\{t\}\. However,\\Sbbt\\Sbb\_\{t\}itself contributes to future states\\Sbbt′′\\Sbb\_\{t^\{\\prime\\prime\}\}fort′′\>tt^\{\\prime\\prime\}\>t, which is precisely the BPTT tail handled by Assumption[5\.1](https://arxiv.org/html/2606.04048#S5.SS1)\.

Since\\qbt\\qb\_\{t\}and\\kbt\\kb\_\{t\}are L2\-normalized to unit vectors in\\RRd\\RR^\{d\}, with independent entries each of orderΘ​\(1/d\)\\Theta\(1/\\sqrt\{d\}\)at initialization, the inner product satisfies\|⟨\\qbt,\\kbt⟩\|=Θ​\(1d\)\\bigl\|\\langle\\qb\_\{t\},\\kb\_\{t\}\\rangle\\bigr\|=\\Theta\\big\(\\frac\{1\}\{\\sqrt\{d\}\}\\big\)\. In \([3](https://arxiv.org/html/2606.04048#S5.E3)\) the three factors are:βt=Θ​\(1\)\\beta\_\{t\}=\\Theta\(1\),\(\\gbt\)i=Θ​\(1/d\)\(\\gb\_\{t\}\)\_\{i\}=\\Theta\(1/\\sqrt\{d\}\)per coordinate, and⟨\\qbt,\\kbt⟩=Θ​\(1/d\)\\langle\\qb\_\{t\},\\kb\_\{t\}\\rangle=\\Theta\(1/\\sqrt\{d\}\)\. Hence:

∂ℒ∂\(\\vbt\)i=Θ​\(1\)⋅Θ​\(1d\)⋅Θ​\(1d\)=Θ​\(1d\)\.\\displaystyle\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\(\\vb\_\{t\}\)\_\{i\}\}=\\Theta\(1\)\\cdot\\Theta\\Big\(\\frac\{1\}\{\\sqrt\{d\}\}\\Big\)\\cdot\\Theta\\Big\(\\frac\{1\}\{\\sqrt\{d\}\}\\Big\)=\\Theta\\Big\(\\frac\{1\}\{d\}\\Big\)\.By treating ShortConv as the identity for scale estimates, the gradient of the value projection\\Wbv∈\\RRd×d\\Wb\_\{v\}\\in\\RR^\{d\\times d\}is

∂ℒ∂\(\\Wbv\)i​j=∂ℒ∂\(\\vbt\)i⋅\(\\xbt\)j=Θ​\(1d\)⋅Θ​\(1\)=Θ​\(1d\)\.\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\(\\Wb\_\{v\}\)\_\{ij\}\}=\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\(\\vb\_\{t\}\)\_\{i\}\}\\cdot\(\\xb\_\{t\}\)\_\{j\}=\\Theta\\Big\(\\frac\{1\}\{d\}\\Big\)\\cdot\\Theta\(1\)=\\Theta\\Big\(\\frac\{1\}\{d\}\\Big\)\.The SGD update isΔ​\(\\Wbv\)i​j=−ηv​Θ​\(1/d\)\\Delta\(\\Wb\_\{v\}\)\_\{ij\}=\-\\eta\_\{v\}\\Theta\(1/d\)\. The resulting change in\\vbt\\vb\_\{t\}is

\(Δ​\\vbt\)i=∑j=1dΔ​\(\\Wbv\)i​j​\(\\xbt\)j=d⋅Θ​\(ηvd\)=Θ​\(ηv\)\.\(\\Delta\\vb\_\{t\}\)\_\{i\}=\\sum\_\{j=1\}^\{d\}\\Delta\(\\Wb\_\{v\}\)\_\{ij\}\(\\xb\_\{t\}\)\_\{j\}=d\\cdot\\Theta\\Big\(\\frac\{\\eta\_\{v\}\}\{d\}\\Big\)=\\Theta\(\\eta\_\{v\}\)\.The feature\-learning condition requiresΔ​\\vbt\\Delta\\vb\_\{t\}withΘ​\(1\)\\Theta\(1\)coordinate size, therefore,

ηv=Θ\(1\)\.\\displaystyle\\boxed\{\\eta\_\{v\}=\\Theta\(1\)\.\}

#### Gradient of the key projection\\Wbk\\Wb\_\{k\}\.

The key\\kbt=\\kb~t/‖\\kb~t‖2\\kb\_\{t\}=\\tilde\{\\kb\}\_\{t\}/\\\|\\tilde\{\\kb\}\_\{t\}\\\|\_\{2\}enters the latent\-state update both through the write term \(βt​\\vbt​\\kbt⊤\\beta\_\{t\}\\vb\_\{t\}\\kb\_\{t\}^\{\\top\}\) and the erase term \(αt​\\Sbbt−1​\(\\Ib−βt​\\kbt​\\kbt⊤\)\\alpha\_\{t\}\\Sbb\_\{t\-1\}\(\\Ib\-\\beta\_\{t\}\\kb\_\{t\}\\kb\_\{t\}^\{\\top\}\)\)\. The gradient ofℒ\\mathcal\{L\}with respect to the normalized\\kbt\\kb\_\{t\}from the write term alone is

\(∂ℒ∂\\kbt\)l\|write=βt​∑i∂ℒ∂\(\\Sbbt\)i​l⋅\(\\vbt\)i=βt​\(\\qbt\)l​∑i\(\\gbt\)i​\(\\vbt\)i=βt​\(\\qbt\)l​⟨\\gbt,\\vbt⟩\.\\displaystyle\\Big\(\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\\kb\_\{t\}\}\\Big\)\_\{l\}\\Bigg\|\_\{\\text\{write\}\}=\\beta\_\{t\}\\sum\_\{i\}\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\(\\Sbb\_\{t\}\)\_\{il\}\}\\cdot\(\\vb\_\{t\}\)\_\{i\}=\\beta\_\{t\}\\,\(\\qb\_\{t\}\)\_\{l\}\\sum\_\{i\}\(\\gb\_\{t\}\)\_\{i\}\(\\vb\_\{t\}\)\_\{i\}=\\beta\_\{t\}\\,\(\\qb\_\{t\}\)\_\{l\}\\,\\langle\\gb\_\{t\},\\vb\_\{t\}\\rangle\.\(4\)Here\\vbt\\vb\_\{t\}hasΘ​\(1\)\\Theta\(1\)coordinate size while\\gbt\\gb\_\{t\}hasΘ​\(1/d\)\\Theta\(1/\\sqrt\{d\}\)coordinate size, so\|⟨\\gbt,\\vbt⟩\|=Θ​\(1\)\|\\langle\\gb\_\{t\},\\vb\_\{t\}\\rangle\|=\\Theta\(1\)\. Since\\qbt\\qb\_\{t\}and\\kbt\\kb\_\{t\}haveΘ​\(1/d\)\\Theta\(1/\\sqrt\{d\}\)coordinate size, and\\gbt\\gb\_\{t\}hasΘ​\(1/d\)\\Theta\(1/\\sqrt\{d\}\)coordinate size, according to \([4](https://arxiv.org/html/2606.04048#S5.E4)\), we have

\(∂ℒ∂\\kbt\)l\|write=Θ​\(1\)⋅Θ​\(1d\)⋅Θ​\(1\)=Θ​\(1d\)\.\\Big\(\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\\kb\_\{t\}\}\\Big\)\_\{l\}\\Bigg\|\_\{\\text\{write\}\}=\\Theta\(1\)\\cdot\\Theta\\Big\(\\frac\{1\}\{\\sqrt\{d\}\}\\Big\)\\cdot\\Theta\(1\)=\\Theta\\Big\(\\frac\{1\}\{\\sqrt\{d\}\}\\Big\)\.We now compute the erase\-term contribution\. The erase term is\\Ebt=αt​\\Sbbt−1​\(\\Ib−βt​\\kbt​\\kbt⊤\)\\Eb\_\{t\}=\\alpha\_\{t\}\\Sbb\_\{t\-1\}\(\\Ib\-\\beta\_\{t\}\\kb\_\{t\}\\kb\_\{t\}^\{\\top\}\)\. Differentiating\(\\Sbbt\)a​l=\(\\Ebt\)a​l\+βt​\(\\vbt\)a​\(\\kbt\)l\(\\Sbb\_\{t\}\)\_\{al\}=\(\\Eb\_\{t\}\)\_\{al\}\+\\beta\_\{t\}\(\\vb\_\{t\}\)\_\{a\}\(\\kb\_\{t\}\)\_\{l\}with respect to\(\\kbt\)l\(\\kb\_\{t\}\)\_\{l\}gives two sub\-terms:

- •The column of\\Sbbt−1\\Sbb\_\{t\-1\}contracted with gradient: \(∂ℒ∂\\kbt\)l\|erase,1=−αt​βt​∑a∂ℒ∂\(\\Sbbt\)a​l​∑j\(\\Sbbt−1\)a​j​\(\\kbt\)j=−αt​βt​\(\\qbt\)l​⟨\\gbt,\\Sbbt−1​\\kbt⟩\.\\Big\(\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\\kb\_\{t\}\}\\Big\)\_\{l\}\\Bigg\|\_\{\\text\{erase,1\}\}=\-\\alpha\_\{t\}\\beta\_\{t\}\\sum\_\{a\}\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\(\\Sbb\_\{t\}\)\_\{al\}\}\\sum\_\{j\}\(\\Sbb\_\{t\-1\}\)\_\{aj\}\(\\kb\_\{t\}\)\_\{j\}=\-\\alpha\_\{t\}\\beta\_\{t\}\(\\qb\_\{t\}\)\_\{l\}\\,\\langle\\gb\_\{t\},\\Sbb\_\{t\-1\}\\kb\_\{t\}\\rangle\.Since\(\\Sbbt−1​\\kbt\)i\(\\Sbb\_\{t\-1\}\\kb\_\{t\}\)\_\{i\}has varianced⋅Θ​\(1/d\)⋅Θ​\(1/d\)=Θ​\(1/d\)d\\cdot\\Theta\(1/d\)\\cdot\\Theta\(1/d\)=\\Theta\(1/d\), i\.e\.Θ​\(1/d\)\\Theta\(1/\\sqrt\{d\}\)per coordinate, the inner product⟨\\gbt,\\Sbbt−1​\\kbt⟩\\langle\\gb\_\{t\},\\Sbb\_\{t\-1\}\\kb\_\{t\}\\ranglehas varianced⋅Θ​\(1/d\)⋅Θ​\(1/d\)=Θ​\(1/d\)d\\cdot\\Theta\(1/d\)\\cdot\\Theta\(1/d\)=\\Theta\(1/d\), so it isΘ​\(1/d\)\\Theta\(1/\\sqrt\{d\}\)\. Combined with\(\\qbt\)l=Θ​\(1/d\)\(\\qb\_\{t\}\)\_\{l\}=\\Theta\(1/\\sqrt\{d\}\), this sub\-term isΘ​\(1/d\)\\Theta\(1/d\)\.
- •The diagonal of\\Sbbt−1\\Sbb\_\{t\-1\}contracted with gradient\. The second sub\-term arises from the product rule applied to\(\\Sbbt−1​\\kbt\)a=∑m\(\\Sbbt−1\)a​m​\(\\kbt\)m\(\\Sbb\_\{t\-1\}\\kb\_\{t\}\)\_\{a\}=\\sum\_\{m\}\(\\Sbb\_\{t\-1\}\)\_\{am\}\(\\kb\_\{t\}\)\_\{m\}; differentiating the factor\(\\kbt\)l\(\\kb\_\{t\}\)\_\{l\}inside the sum gives \(∂ℒ∂\\kbt\)l\|erase,2=−αt​βt​∑a,c∂ℒ∂\(\\Sbbt\)a​c⋅\(\\Sbbt−1\)a​l⋅\(\\kbt\)c=−αt​βt​⟨\\qbt,\\kbt⟩​\(\\Sbbt−1⊤​\\gbt\)l\.\\Big\(\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\\kb\_\{t\}\}\\Big\)\_\{l\}\\Bigg\|\_\{\\text\{erase,2\}\}=\-\\alpha\_\{t\}\\beta\_\{t\}\\sum\_\{a,c\}\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\(\\Sbb\_\{t\}\)\_\{ac\}\}\\cdot\(\\Sbb\_\{t\-1\}\)\_\{al\}\\cdot\(\\kb\_\{t\}\)\_\{c\}=\-\\alpha\_\{t\}\\beta\_\{t\}\\langle\\qb\_\{t\},\\kb\_\{t\}\\rangle\\,\(\\Sbb\_\{t\-1\}^\{\\top\}\\gb\_\{t\}\)\_\{l\}\.Since\(\\Sbbt−1⊤​\\gbt\)l=∑a\(\\Sbbt−1\)a​l​\(\\gbt\)a\(\\Sbb\_\{t\-1\}^\{\\top\}\\gb\_\{t\}\)\_\{l\}=\\sum\_\{a\}\(\\Sbb\_\{t\-1\}\)\_\{al\}\(\\gb\_\{t\}\)\_\{a\}hasΘ​\(1/d\)\\Theta\(1/\\sqrt\{d\}\)coordinate size, and⟨\\qbt,\\kbt⟩=Θ​\(1/d\)\\langle\\qb\_\{t\},\\kb\_\{t\}\\rangle=\\Theta\(1/\\sqrt\{d\}\), this sub\-term isΘ​\(1/d\)\\Theta\(1/d\)per coordinate\.

Both erase sub\-terms are at mostΘ​\(1/d\)\\Theta\(1/d\), smaller than the write termΘ​\(1/d\)\\Theta\(1/\\sqrt\{d\}\)\. The dominant contribution therefore comes from the write term, therefore,‖∂ℒ∂\\kbt‖2=Θ​\(1\)\\left\\\|\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\\kb\_\{t\}\}\\right\\\|\_\{2\}=\\Theta\(1\), i\.e\., withΘ​\(1d\)\\Theta\\Big\(\\frac\{1\}\{\\sqrt\{d\}\}\\Big\)coordinate size\. Moreover,\\kbt\\kb\_\{t\}affects\\Sbbt\\Sbb\_\{t\}and, through the recurrent BPTT chain, all future states\\Sbbt′′\\Sbb\_\{t^\{\\prime\\prime\}\}fort′′\>tt^\{\\prime\\prime\}\>t\. Each such future contribution has the samedd\-order as the direct term above, and Assumption[5\.1](https://arxiv.org/html/2606.04048#S5.SS1)ensures their sum isO​\(Leff\)O\(L\_\{\\mathrm\{eff\}\}\)times the direct term, which is anO​\(1\)O\(1\)factor under the short\-memory assumption\.

We now focus on the L2\-normalization\\kbt=\\kb~t/‖\\kb~t‖2\\kb\_\{t\}=\\tilde\{\\kb\}\_\{t\}/\\\|\\tilde\{\\kb\}\_\{t\}\\\|\_\{2\}\. The Jacobian of L2\-normalization is\\Jbk=\(\\Ib−\\kbt​\\kbt⊤\)/‖\\kb~t‖2\\Jb\_\{k\}=\(\\Ib\-\\kb\_\{t\}\\kb\_\{t\}^\{\\top\}\)/\\\|\\tilde\{\\kb\}\_\{t\}\\\|\_\{2\}, which projects onto the hyperplane orthogonal to\\kbt\\kb\_\{t\}and scales by1/‖\\kb~t‖2=Θ​\(1/d\)1/\\\|\\tilde\{\\kb\}\_\{t\}\\\|\_\{2\}=\\Theta\(1/\\sqrt\{d\}\)\. Since∂ℒ/∂\\kbt\\partial\\mathcal\{L\}/\\partial\\kb\_\{t\}and\\kbt\\kb\_\{t\}are approximately independent at initialization, the projection loses negligible magnitude:

‖∂ℒ∂\\kb~t‖2=Θ​\(1d\)⋅Θ​\(1\)=Θ​\(1d\),\\displaystyle\\left\\\|\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\\tilde\{\\kb\}\_\{t\}\}\\right\\\|\_\{2\}=\\Theta\\Big\(\\frac\{1\}\{\\sqrt\{d\}\}\\Big\)\\cdot\\Theta\(1\)=\\Theta\\Big\(\\frac\{1\}\{\\sqrt\{d\}\}\\Big\),so each coordinate of∂ℒ/∂\\kb~t\\partial\\mathcal\{L\}/\\partial\\tilde\{\\kb\}\_\{t\}isΘ​\(1/d\)\\Theta\(1/d\)\. The gradient of\\Wbk\\Wb\_\{k\}is therefore

∂ℒ∂\(\\Wbk\)i​j=∂ℒ∂\(\\kb~t\)i⋅\(\\xbt\)j=Θ​\(1d\)⋅Θ​\(1\)=Θ​\(1d\),\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\(\\Wb\_\{k\}\)\_\{ij\}\}=\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\(\\tilde\{\\kb\}\_\{t\}\)\_\{i\}\}\\cdot\(\\xb\_\{t\}\)\_\{j\}=\\Theta\\Big\(\\frac\{1\}\{d\}\\Big\)\\cdot\\Theta\(1\)=\\Theta\\Big\(\\frac\{1\}\{d\}\\Big\),and an identical feature\-learning analysis as for\\Wbv\\Wb\_\{v\}gives

ηk=Θ\(1\)\.\\displaystyle\\boxed\{\\eta\_\{k\}=\\Theta\(1\)\.\}

#### Gradient of the query projection\\Wbq\\Wb\_\{q\}\.

The gradient ofℒ\\mathcal\{L\}with respect to the normalized\\qbt\\qb\_\{t\}is

∂ℒ∂\(\\qbt\)j=\(\\Sbbt⊤​\\gbt\)j=∑i\(\\Sbbt\)i​j​\(\\gbt\)i\.\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\(\\qb\_\{t\}\)\_\{j\}\}=\(\\Sbb\_\{t\}^\{\\top\}\\gb\_\{t\}\)\_\{j\}=\\sum\_\{i\}\(\\Sbb\_\{t\}\)\_\{ij\}\(\\gb\_\{t\}\)\_\{i\}\.With\(\\Sbbt\)i​j=Θ​\(1/d\)\(\\Sbb\_\{t\}\)\_\{ij\}=\\Theta\(1/\\sqrt\{d\}\)and\(\\gbt\)i=Θ​\(1/d\)\(\\gb\_\{t\}\)\_\{i\}=\\Theta\(1/\\sqrt\{d\}\), each entry of\\Sbbt\\Sbb\_\{t\}is approximately independent and zero\-mean with varianceΘ​\(1/d\)\\Theta\(1/d\)\. Thus, the product\(\\Sbbt\)i​j​\(gt\)i\(\\Sbb\_\{t\}\)\_\{ij\}\(g\_\{t\}\)\_\{i\}has varianceΘ​\(1/d2\)\\Theta\(1/d^\{2\}\)\. Summing overddterms gives\\Var​\[\(\\Sbbt⊤​\\gbt\)j\]=d⋅Θ​\(1/d\)⋅Θ​\(1/d\)=Θ​\(1/d\)\\Var\[\(\\Sbb\_\{t\}^\{\\top\}\\gb\_\{t\}\)\_\{j\}\]=d\\cdot\\Theta\(1/d\)\\cdot\\Theta\(1/d\)=\\Theta\(1/d\), i\.e\.∂ℒ/∂\\qbt=Θ​\(1/d\)\\partial\\mathcal\{L\}/\\partial\\qb\_\{t\}=\\Theta\(1/\\sqrt\{d\}\)per coordinate\. The L2\-normalization Jacobian for\\qbt\\qb\_\{t\}has the same spectral normΘ​\(1/d\)\\Theta\(1/\\sqrt\{d\}\)as for\\kbt\\kb\_\{t\}, therefore:

∂ℒ∂\(\\qb~t\)i=Θ​\(1d\),∂ℒ∂\(\\Wbq\)i​j=Θ​\(1d\)\.\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\(\\tilde\{\\qb\}\_\{t\}\)\_\{i\}\}=\\Theta\\Big\(\\frac\{1\}\{d\}\\Big\),\\qquad\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\(\\Wb\_\{q\}\)\_\{ij\}\}=\\Theta\\Big\(\\frac\{1\}\{d\}\\Big\)\.The feature\-learning condition yields

ηq=Θ\(1\),\\displaystyle\\boxed\{\\eta\_\{q\}=\\Theta\(1\),\}confirming that all threed×dd\\times dprojection matrices share the same SGD learning rate scalingΘ​\(1\)\\Theta\(1\)\.

### 5\.4Gradient of the Gating

We now show that the gating weights require a different learning rate scaling from the projection matrices, which is a distinctive feature of the SGD formulation absent in the AdamW case\.

#### Gating weight\\Wbα\\Wb\_\{\\alpha\}\.

Note thatαt\\alpha\_\{t\}affects not only\\Sbbt\\Sbb\_\{t\}but also all future states\\Sbbt′\\Sbb\_\{t^\{\\prime\}\}fort′\>tt^\{\\prime\}\>tthrough the recurrence chain \([1](https://arxiv.org/html/2606.04048#S3.E1)\)\. However, under Assumption[5\.1](https://arxiv.org/html/2606.04048#S5.SS1)those future contributions areO​\(Leff\)=O​\(1\)O\(L\_\{\\mathrm\{eff\}\}\)=O\(1\)times the direct contribution and do not change thedd\-scaling of the gradient\.

By the chain rule via the latent state𝐒t\\mathbf\{S\}\_\{t\}:

∂ℒ∂αt=Tr​\(\(∂ℒ∂𝐒t\)⊤​∂𝐒t∂αt\)\\displaystyle\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\\alpha\_\{t\}\}=\\text\{Tr\}\\left\(\\left\(\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\\mathbf\{S\}\_\{t\}\}\\right\)^\{\\top\}\\frac\{\\partial\\mathbf\{S\}\_\{t\}\}\{\\partial\\alpha\_\{t\}\}\\right\)Since∂ℒ∂𝐒t=𝐠t​𝐪t⊤\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\\mathbf\{S\}\_\{t\}\}=\\mathbf\{g\}\_\{t\}\\mathbf\{q\}\_\{t\}^\{\\top\}and∂𝐒t∂αt=𝐒t−1​\(𝐈−βt​𝐤t​𝐤t⊤\)\\frac\{\\partial\\mathbf\{S\}\_\{t\}\}\{\\partial\\alpha\_\{t\}\}=\\mathbf\{S\}\_\{t\-1\}\(\\mathbf\{I\}\-\\beta\_\{t\}\\mathbf\{k\}\_\{t\}\\mathbf\{k\}\_\{t\}^\{\\top\}\), we have

∂ℒ∂αt=Tr​\(𝐪t​𝐠t⊤​𝐒t−1​\(𝐈−βt​𝐤t​𝐤t⊤\)\)=𝐠t⊤​𝐒t−1​𝐪t⏟Term 1−βt​\(𝐠t⊤​𝐒t−1​𝐤t\)​⟨𝐤t,𝐪t⟩⏟Term 2\.\\displaystyle\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\\alpha\_\{t\}\}=\\text\{Tr\}\\left\(\\mathbf\{q\}\_\{t\}\\mathbf\{g\}\_\{t\}^\{\\top\}\\mathbf\{S\}\_\{t\-1\}\(\\mathbf\{I\}\-\\beta\_\{t\}\\mathbf\{k\}\_\{t\}\\mathbf\{k\}\_\{t\}^\{\\top\}\)\\right\)=\\underbrace\{\\mathbf\{g\}\_\{t\}^\{\\top\}\\mathbf\{S\}\_\{t\-1\}\\mathbf\{q\}\_\{t\}\}\_\{\\text\{Term 1\}\}\-\\underbrace\{\\beta\_\{t\}\(\\mathbf\{g\}\_\{t\}^\{\\top\}\\mathbf\{S\}\_\{t\-1\}\\mathbf\{k\}\_\{t\}\)\\langle\\mathbf\{k\}\_\{t\},\\mathbf\{q\}\_\{t\}\\rangle\}\_\{\\text\{Term 2\}\}\.
For Term 1,\(𝐒t−1​𝐪t\)i=∑j\(𝐒t−1\)i​j​\(𝐪t\)j\(\\mathbf\{S\}\_\{t\-1\}\\mathbf\{q\}\_\{t\}\)\_\{i\}=\\sum\_\{j\}\(\\mathbf\{S\}\_\{t\-1\}\)\_\{ij\}\(\\mathbf\{q\}\_\{t\}\)\_\{j\}\. Since both are zero\-mean independent variables of orderΘ​\(1/d\)\\Theta\(1/\\sqrt\{d\}\), we have𝐒t−1​𝐪t=Θ​\(1/d\)\\mathbf\{S\}\_\{t\-1\}\\mathbf\{q\}\_\{t\}=\\Theta\(1/\\sqrt\{d\}\)per coordinate\. Moreover, since𝐠t\\mathbf\{g\}\_\{t\}isΘ​\(1/d\)\\Theta\(1/\\sqrt\{d\}\)per coordinate, finally we arrive that Term 1 isΘ​\(1/d\)\\Theta\(1/\\sqrt\{d\}\)\. For Term 2, similarly we have𝐠t⊤​𝐒t−1​𝐤t=Θ​\(1/d\)\\mathbf\{g\}\_\{t\}^\{\\top\}\\mathbf\{S\}\_\{t\-1\}\\mathbf\{k\}\_\{t\}=\\Theta\(1/\\sqrt\{d\}\)\. Multiplying by⟨𝐤t,𝐪t⟩=Θ​\(1/d\)\\langle\\mathbf\{k\}\_\{t\},\\mathbf\{q\}\_\{t\}\\rangle=\\Theta\(1/\\sqrt\{d\}\)yieldsΘ​\(1/d\)\\Theta\(1/d\)\. Therefore,

∂ℒ∂αt=Θ​\(1d\)\.\\displaystyle\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\\alpha\_\{t\}\}=\\Theta\\left\(\\frac\{1\}\{\\sqrt\{d\}\}\\right\)\.Since∂αt/∂zα,t=Θ​\(1\)\\partial\\alpha\_\{t\}/\\partial z\_\{\\alpha,t\}=\\Theta\(1\), finally we have

∂ℒ∂zα,t=∂ℒ∂αt⋅∂αt∂zα,t=Θ​\(1d\)⋅Θ​\(1\)=Θ​\(1d\)\\displaystyle\\frac\{\\partial\\mathcal\{L\}\}\{\\partial z\_\{\\alpha,t\}\}=\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\\alpha\_\{t\}\}\\cdot\\frac\{\\partial\\alpha\_\{t\}\}\{\\partial z\_\{\\alpha,t\}\}=\\Theta\\left\(\\frac\{1\}\{\\sqrt\{d\}\}\\right\)\\cdot\\Theta\(1\)=\\Theta\\left\(\\frac\{1\}\{\\sqrt\{d\}\}\\right\)\(5\)Then the gradient per entry of\\Wbα\\Wb\_\{\\alpha\}is

∂ℒ∂\(\\Wbα\)j=∂ℒ∂zα,t⋅\(\\xbt\)j=Θ​\(1d\)⋅Θ​\(1\)=Θ​\(1d\)\.\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\(\\Wb\_\{\\alpha\}\)\_\{j\}\}=\\frac\{\\partial\\mathcal\{L\}\}\{\\partial z\_\{\\alpha,t\}\}\\cdot\(\\xb\_\{t\}\)\_\{j\}=\\Theta\\Big\(\\frac\{1\}\{\\sqrt\{d\}\}\\Big\)\\cdot\\Theta\(1\)=\\Theta\\Big\(\\frac\{1\}\{\\sqrt\{d\}\}\\Big\)\.To enable feature learning, the scalar pre\-activationzαz\_\{\\alpha\}must move byΘ​\(1\)\\Theta\(1\)after one gradient step\. The change inzαz\_\{\\alpha\}is

Δ​zα=Δ​𝐖α​𝐱t=−ηα​∑j=1d∂ℒ∂\(𝐖α\)j​\(𝐱t\)j=−ηα​∂ℒ∂zα,t​∑j=1d\(𝐱t\)j2\.\\displaystyle\\Delta z\_\{\\alpha\}=\\Delta\\mathbf\{W\}\_\{\\alpha\}\\mathbf\{x\}\_\{t\}=\-\\eta\_\{\\alpha\}\\sum\_\{j=1\}^\{d\}\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\(\\mathbf\{W\}\_\{\\alpha\}\)\_\{j\}\}\(\\mathbf\{x\}\_\{t\}\)\_\{j\}=\-\\eta\_\{\\alpha\}\\frac\{\\partial\\mathcal\{L\}\}\{\\partial z\_\{\\alpha,t\}\}\\sum\_\{j=1\}^\{d\}\(\\mathbf\{x\}\_\{t\}\)\_\{j\}^\{2\}\.Because∑j\(𝐱t\)j2=‖𝐱t‖22=Θ​\(d\)\\sum\_\{j\}\(\\mathbf\{x\}\_\{t\}\)\_\{j\}^\{2\}=\\\|\\mathbf\{x\}\_\{t\}\\\|\_\{2\}^\{2\}=\\Theta\(d\), we have

Δ​zα=ηα⋅Θ​\(1d\)⋅Θ​\(d\)=ηα​Θ​\(d\)\.\\displaystyle\\Delta z\_\{\\alpha\}=\\eta\_\{\\alpha\}\\cdot\\Theta\\left\(\\frac\{1\}\{\\sqrt\{d\}\}\\right\)\\cdot\\Theta\(d\)=\\eta\_\{\\alpha\}\\Theta\(\\sqrt\{d\}\)\.
ForΔ​zα=Θ​\(1\)\\Delta z\_\{\\alpha\}=\\Theta\(1\), the learning rate must be set to

ηα=Θ\(1d\)=Θ\(1nℓ−1\)\.\\displaystyle\\boxed\{\\eta\_\{\\alpha\}=\\Theta\\Big\(\\frac\{1\}\{\\sqrt\{d\}\}\\Big\)=\\Theta\\Big\(\\frac\{1\}\{\\sqrt\{n\_\{\\ell\-1\}\}\}\\Big\)\.\}

#### Gating weight\\Wbβ\\Wb\_\{\\beta\}\.

Similarly,βt\\beta\_\{t\}affects all future states through the recurrence, but under Assumption[5\.1](https://arxiv.org/html/2606.04048#S5.SS1)we need only the direct contribution to\\Sbbt\\Sbb\_\{t\}\. The gradient ofβt\\beta\_\{t\}combines contributions from both the write term and the erase term of \([1](https://arxiv.org/html/2606.04048#S3.E1)\):

∂ℒ∂βt=∑a,b∂ℒ∂\(\\Sbbt\)a​b⋅∂\(\\Sbbt\)a​b∂βt=⟨𝐠t,𝐯t⟩​⟨𝐪t,𝐤t⟩⏟Write Term−αt​⟨\\gbt,\\Sbbt−1​\\kbt⟩​⟨\\qbt,\\kbt⟩⏟Erase Term\.\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\\beta\_\{t\}\}=\\sum\_\{a,b\}\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\(\\Sbb\_\{t\}\)\_\{ab\}\}\\cdot\\frac\{\\partial\(\\Sbb\_\{t\}\)\_\{ab\}\}\{\\partial\\beta\_\{t\}\}=\\underbrace\{\\langle\\mathbf\{g\}\_\{t\},\\mathbf\{v\}\_\{t\}\\rangle\\langle\\mathbf\{q\}\_\{t\},\\mathbf\{k\}\_\{t\}\\rangle\}\_\{\\text\{Write Term\}\}\-\\underbrace\{\\alpha\_\{t\}\\langle\\gb\_\{t\},\\Sbb\_\{t\-1\}\\kb\_\{t\}\\rangle\\langle\\qb\_\{t\},\\kb\_\{t\}\\rangle\}\_\{\\text\{Erase Term\}\}\.Since⟨\\gbt,\\vbt⟩​⟨\\qbt,\\kbt⟩=Θ​\(1\)⋅Θ​\(1/d\)=Θ​\(1/d\)\\langle\\gb\_\{t\},\\vb\_\{t\}\\rangle\\langle\\qb\_\{t\},\\kb\_\{t\}\\rangle=\\Theta\(1\)\\cdot\\Theta\(1/\\sqrt\{d\}\)=\\Theta\(1/\\sqrt\{d\}\), the write contribution is∂\(\\Sbbt\)a​b/∂βt\|write=\(\\vbt\)a​\(\\kbt\)b\\partial\(\\Sbb\_\{t\}\)\_\{ab\}/\\partial\\beta\_\{t\}\\big\|\_\{\\text\{write\}\}=\(\\vb\_\{t\}\)\_\{a\}\(\\kb\_\{t\}\)\_\{b\}\. And since⟨\\gbt,\\Sbbt−1​\\kbt⟩​⟨\\qbt,\\kbt⟩=Θ​\(1/d\)⋅Θ​\(1/d\)=Θ​\(1/d\)\\langle\\gb\_\{t\},\\Sbb\_\{t\-1\}\\kb\_\{t\}\\rangle\\langle\\qb\_\{t\},\\kb\_\{t\}\\rangle=\\Theta\(1/\\sqrt\{d\}\)\\cdot\\Theta\(1/\\sqrt\{d\}\)=\\Theta\(1/d\), the erase term is∂\(\\Sbbt\)a​b/∂βt\|erase=−αt​∑c\(\\Sbbt−1\)a​c​\(\\kbt\)c​\(\\kbt\)b=−αt​\(\\Sbbt−1​\\kbt\)a​\(\\kbt\)b\\partial\(\\Sbb\_\{t\}\)\_\{ab\}/\\partial\\beta\_\{t\}\\big\|\_\{\\text\{erase\}\}=\-\\alpha\_\{t\}\\sum\_\{c\}\(\\Sbb\_\{t\-1\}\)\_\{ac\}\(\\kb\_\{t\}\)\_\{c\}\(\\kb\_\{t\}\)\_\{b\}=\-\\alpha\_\{t\}\(\\Sbb\_\{t\-1\}\\kb\_\{t\}\)\_\{a\}\(\\kb\_\{t\}\)\_\{b\}\. The write term therefore dominates and

∂ℒ∂βt=Θ​\(1d\)\.\\displaystyle\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\\beta\_\{t\}\}=\\Theta\\Big\(\\frac\{1\}\{\\sqrt\{d\}\}\\Big\)\.
Forzβ,t:=\\Wbβ​\\xbtz\_\{\\beta,t\}:=\\Wb\_\{\\beta\}\\xb\_\{t\}, since∂βt∂zβ,t=Θ​\(1\)\\frac\{\\partial\\beta\_\{t\}\}\{\\partial z\_\{\\beta,t\}\}=\\Theta\(1\), then the gradient of the pre\-activation is

∂ℒ∂zβ,t=∂βt∂zβ,t⋅∂ℒ∂βt=Θ​\(1\)⋅Θ​\(1d\)=Θ​\(1d\)\.\\displaystyle\\frac\{\\partial\\mathcal\{L\}\}\{\\partial z\_\{\\beta,t\}\}=\\frac\{\\partial\\beta\_\{t\}\}\{\\partial z\_\{\\beta,t\}\}\\cdot\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\\beta\_\{t\}\}=\\Theta\(1\)\\cdot\\Theta\\Big\(\\frac\{1\}\{\\sqrt\{d\}\}\\Big\)=\\Theta\\Big\(\\frac\{1\}\{\\sqrt\{d\}\}\\Big\)\.Finally, the gradient per entry of\\Wbβ\\Wb\_\{\\beta\}is

∂ℒ∂\(\\Wbβ\)j=∂ℒ∂zβ,t⋅\(\\xbt\)j=Θ​\(1d\)⋅Θ​\(1\)=Θ​\(1d\),\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\(\\Wb\_\{\\beta\}\)\_\{j\}\}=\\frac\{\\partial\\mathcal\{L\}\}\{\\partial z\_\{\\beta,t\}\}\\cdot\(\\xb\_\{t\}\)\_\{j\}=\\Theta\\Big\(\\frac\{1\}\{\\sqrt\{d\}\}\\Big\)\\cdot\\Theta\(1\)=\\Theta\\Big\(\\frac\{1\}\{\\sqrt\{d\}\}\\Big\),and the same feature\-learning analysis as\\Wbα\\Wb\_\{\\alpha\}gives

ηβ=Θ\(1d\)=Θ\(1nℓ−1\)\.\\displaystyle\\boxed\{\\eta\_\{\\beta\}=\\Theta\\Big\(\\frac\{1\}\{\\sqrt\{d\}\}\\Big\)=\\Theta\\Big\(\\frac\{1\}\{\\sqrt\{n\_\{\\ell\-1\}\}\}\\Big\)\.\}

#### Scalar parameters\.

Bothaloga\_\{\\log\}andbbenter through the gating pre\-activations as additive scalars\. Using the exact chain rule onαt=egt\\alpha\_\{t\}=e^\{g\_\{t\}\},

∂αt∂alog=∂egt∂alog=egt​∂gt∂alog=αt​gt=Θ​\(1\),\\frac\{\\partial\\alpha\_\{t\}\}\{\\partial a\_\{\\log\}\}=\\frac\{\\partial e^\{g\_\{t\}\}\}\{\\partial a\_\{\\log\}\}=e^\{g\_\{t\}\}\\frac\{\\partial g\_\{t\}\}\{\\partial a\_\{\\log\}\}=\\alpha\_\{t\}g\_\{t\}=\\Theta\(1\),so∂ℒ/∂alog=\(∂ℒ/∂αt\)⋅Θ​\(1\)\\partial\\mathcal\{L\}/\\partial a\_\{\\log\}=\(\\partial\\mathcal\{L\}/\\partial\\alpha\_\{t\}\)\\cdot\\Theta\(1\)\. From the argument in \([5](https://arxiv.org/html/2606.04048#S5.E5)\),∂ℒ/∂αt=Θ​\(1/d\)\\partial\\mathcal\{L\}/\\partial\\alpha\_\{t\}=\\Theta\(1/\\sqrt\{d\}\)\. Hence∂ℒ/∂alog=Θ​\(1/d\)\\partial\\mathcal\{L\}/\\partial a\_\{\\log\}=\\Theta\(1/\\sqrt\{d\}\), and an SGD step changesaloga\_\{\\log\}byηscal⋅Θ​\(1/d\)\\eta\_\{\\mathrm\{scal\}\}\\cdot\\Theta\(1/\\sqrt\{d\}\)\. Sincealoga\_\{\\log\}entersαt\\alpha\_\{t\}multiplicatively, aΘ​\(1\)\\Theta\(1\)change inαt\\alpha\_\{t\}viaaloga\_\{\\log\}requiresηscal=Θ​\(d\)\\eta\_\{\\mathrm\{scal\}\}=\\Theta\(\\sqrt\{d\}\)\.

### 5\.5Summary

According to theμ\\muP assumptions, the learning rate for other parameters can be set in the same way as Table 8 inyang2022tensor\. In summary, the completeμ\\muP formulation for Gated Delta Net with SGD is summarized in Table[1](https://arxiv.org/html/2606.04048#S5.T1)\.

Table 1:μ\\muP formulation of Gated Delta Net under SGD\. Weights have shape\\RRnℓ×nℓ−1\\RR^\{n\_\{\\ell\}\\times n\_\{\\ell\-1\}\}; for input weights and biases,nℓ−1=Θ​\(1\)n\_\{\\ell\-1\}=\\Theta\(1\)\. All initialization variances and forward multipliers are identical to the AdamW table \(Table[2](https://arxiv.org/html/2606.04048#A2.T2)\), and only the learning\-rate row differs\. Thegrayfactors are the differences between original scaling law inyang2022tensorand our proposed formulation\.

## 6Experiments

### 6\.1Experiment details

We implement LLM pre\-training experiments to validate ourμ\\muP derivation\. All models use 8 layers and 6 attention heads\. We test five model widthsd∈\{256,512,1024,1536\}d\\in\\\{256,512,1024,1536\\\}for AdamW\(loshchilov2017decoupled\)andd∈\{256,512,768,1024\}d\\in\\\{256,512,768,1024\\\}for SGD optimizer, which correspond to parameter counts ranging from approximately 21M to 342M \(non\-embedding\)\.

#### Architectural parameters\.

We refer to\(yang2025gated\)and its official repository for the implementation of GDN and re\-implement it onnanoGPTtraining framework\(Karpathy2022\)\. In detail, we set the head dimension of queries and keys tod/8d/8and that of values tod/4d/4\. And we set the kernel size of short convolutions in queries and keys to 4\. Additionally, the intermediate size of MLP is set to4​d4d, and we tied the input and output embeddings\.

#### Initialization and optimizer\.

For the base model withd0=256d\_\{0\}=256, the embedding layer and all input projections are initialized with standard deviation 0\.02, which also applies to larger models for SP\. In contrast, we initialize large models under originalμ\\muP and our proposedμ\\muP according to Tables[2](https://arxiv.org/html/2606.04048#A2.T2)and[1](https://arxiv.org/html/2606.04048#S5.T1)\. The scalar gating parametersaloga\_\{\\log\}andbbfollow the scheme ofyang2025gated:alog∼Uniform​\(0,16\)a\_\{\\log\}\\sim\\mathrm\{Uniform\}\(0,16\)andbbis set as described in Section[4](https://arxiv.org/html/2606.04048#S4)\. We apply a gradient clipping to 1\.0 and set Dropout ratio\(srivastava2014dropout\)to 0\.0\. The minimum learning rate is fixed to 5e\-5 throughout\. All runs use a cosine learning\-rate schedule with 2,000 warmup steps\.

For AdamW experiments, we use a weight decay of0\.10\.1and set\(β1,β2\)=\(0\.9,0\.95\)\(\\beta\_\{1\},\\beta\_\{2\}\)=\(0\.9,0\.95\)\. And for SGD experiments, we use SGD with Nesterov momentum\(nesterov1983method\)with a momentum of 0\.98, since we notice there is great instability when using original SGD optimizer\. For both optimizers, we use the same learning rates for all the modules in models withd0=256d\_\{0\}=256and all models under SP, and applies different learning rates according to the scaling laws in Tables[2](https://arxiv.org/html/2606.04048#A2.T2)and[1](https://arxiv.org/html/2606.04048#S5.T1)inμ\\muP experiments\.

We train models with each width at 5\-7 different learning rates log\-spaced with increased density near optimal learning rates\. The learning rate search grid ranges between 1e\-3 and 2e\-2 for AdamW and between 1e\-1 and 1 for SGD experiments\. And we fix the training seed to 42\.

#### Data and compute\.

We train on the FineWeb\-Edu 100B dataset\(lozhkov2024fineweb\-edu\)for 20k steps with a global batch size of 480 sequences and a sequence length of 1024 \(approximately 9\.83B tokens in total\)\. Moreover, we use 1 NVIDIA H100 80GB HBM3 GPU for all the experiments\.

### 6\.2Experiment results

The final validation losses for models with different widths and peak learning rates under theμ\\muP and SP configurations are shown in Figures[3](https://arxiv.org/html/2606.04048#S6.F3)and[7](https://arxiv.org/html/2606.04048#S6.F7)for AdamW and SGD, respectively\. To remove the trivial width\-dependence of the absolute loss, we report*shifted*validation loss, defined as the difference from the optimal loss value among all the experiments on the models with the same width but different learning rates\.

For AdamW, the optimal learning rate is consistently the same across all 4 model widths underμ\\muP, demonstrating zero\-shot learning\-rate transfer\. While under SP, the optimal learning rate shifts substantially with width, confirming that SP fails to support feature learning at scale\. SGD experiments show the same qualitative pattern\. Under SP, the optimal learning rate does not transfer across widths and under originalμ\\muP configuration, it varies a lot\. And in ourμ\\muP configuration, the optimal learning rate transfers perfectly\. These results validate that our theory works well in practice\.

\{subfigure\}\[b\]0\.45![Refer to caption](https://arxiv.org/html/2606.04048v1/x1.png)\{subfigure\}\[b\]0\.45![Refer to caption](https://arxiv.org/html/2606.04048v1/x2.png)

Figure 1:Standard Parametrization \(SP\)Figure 2:μ\\muP configurationFigure 3:Shifted validation loss for Gated Delta Network trained with AdamW under varying peak learning rates and model widths\.\{subfigure\}\[b\]0\.32![Refer to caption](https://arxiv.org/html/2606.04048v1/x3.png)\{subfigure\}\[b\]0\.32![Refer to caption](https://arxiv.org/html/2606.04048v1/x4.png)\{subfigure\}\[b\]0\.32![Refer to caption](https://arxiv.org/html/2606.04048v1/x5.png)

Figure 4:Standard Parametrization \(SP\)Figure 5:Originalμ\\muP configurationFigure 6:Ourμ\\muP configurationFigure 7:Shifted validation loss \(loss minus the best\-achieved loss at widthd=1024d=1024\) for Gated Delta Network trained with SGD under varying peak learning rates and model widths\.

## 7Conclusion

We have derived theμ\\muP\-style parametrization for Gated Delta Networks\. Our analysis reveals that under SGD, scalings of the gating weight matrices and the scalar gating parameters are different from the standardμ\\muP law\. LLM pre\-training experiments confirm that ourμ\\muP formulation achieves zero\-shot learning\-rate transfer under both AdamW and SGD, while standard parametrization fails to transfer, empirically validating the correctness of our theoretical derivation\. And we hope our derivations would enlighten further research in the scaling laws of other linear or hybrid architectures\.

## Code Availability

The code for this paper can be accessed in\\urlhttps://github\.com/lauyikfung/gated\_delta\_net\_mup\.

## Acknowledgement

Thank Fetch Compute program for their support of compute resources\. Thank Songlin Yang for discussion\. Thank Amazon Trainium scholarship project for their funding support\. And the code specific for Trainium chips can be accessed in\\urlhttps://github\.com/lauyikfung/Amazon\_Trainium\_Optimizer/tree/main/gdn\_mup\_code\.

## References

\\appendixpage

\\startcontents

\[section\]\\printcontents\[section\]l1

## Appendix AAdditional derivations in the backward process for SGD

In this section, we provide the detailed analysis of BPTT term in this section\.

### A\.1Derivation of the cumulative latent space

In Section[4\.2](https://arxiv.org/html/2606.04048#S4.SS2), we uses the argument ofvankadara2024featureto direct achieve that\\Sbbt\\Sbb\_\{t\}hasΘ​\(1/d\)\\Theta\(1/\\sqrt\{d\}\)coordinate size\. Here we show the detailed derivations of this conclusion\. The state update rule is:

𝐒t=αt​𝐒t−1−αt​βt​𝐒t−1​𝐤t​𝐤t⊤\+𝐔t,\\displaystyle\\mathbf\{S\}\_\{t\}=\\alpha\_\{t\}\\mathbf\{S\}\_\{t\-1\}\-\\alpha\_\{t\}\\beta\_\{t\}\\mathbf\{S\}\_\{t\-1\}\\mathbf\{k\}\_\{t\}\\mathbf\{k\}\_\{t\}^\{\\top\}\+\\mathbf\{U\}\_\{t\},where the write matrix is𝐔t=βt​𝐯t​𝐤t⊤\\mathbf\{U\}\_\{t\}=\\beta\_\{t\}\\mathbf\{v\}\_\{t\}\\mathbf\{k\}\_\{t\}^\{\\top\}\. From theμ\\muP initialization, we know that\(𝐯t\)i=Θ​\(1\)\(\\mathbf\{v\}\_\{t\}\)\_\{i\}=\\Theta\(1\),\(𝐤t\)j=Θ​\(1/d\)\(\\mathbf\{k\}\_\{t\}\)\_\{j\}=\\Theta\(1/\\sqrt\{d\}\)andαt,βt∈\(0,1\)\\alpha\_\{t\},\\beta\_\{t\}\\in\(0,1\)are bounded scalars, which can be treated asΘ​\(1\)\\Theta\(1\)independent variables\. And both\(𝐯t\)i\(\\mathbf\{v\}\_\{t\}\)\_\{i\}and\(𝐤t\)j\(\\mathbf\{k\}\_\{t\}\)\_\{j\}are with zero mean values\. Therefore,

𝔼​\[\(𝐔t\)i​j2\]=𝔼​\[βt2​\(𝐯t\)i2​\(𝐤t\)j2\]=Θ​\(1\)⋅Θ​\(1\)⋅Θ​\(1d\)=Θ​\(1d\)\.\\displaystyle\\mathbb\{E\}\[\(\\mathbf\{U\}\_\{t\}\)\_\{ij\}^\{2\}\]=\\mathbb\{E\}\[\\beta\_\{t\}^\{2\}\(\\mathbf\{v\}\_\{t\}\)\_\{i\}^\{2\}\(\\mathbf\{k\}\_\{t\}\)\_\{j\}^\{2\}\]=\\Theta\(1\)\\cdot\\Theta\(1\)\\cdot\\Theta\\left\(\\frac\{1\}\{d\}\\right\)=\\Theta\\left\(\\frac\{1\}\{d\}\\right\)\.Therefore,𝐔t\\mathbf\{U\}\_\{t\}has a coordinate size ofΘ​\(1/d\)\\Theta\(1/\\sqrt\{d\}\)\.

LetVt=𝔼​\[\(𝐒t\)i​j2\]V\_\{t\}=\\mathbb\{E\}\[\(\\mathbf\{S\}\_\{t\}\)\_\{ij\}^\{2\}\]be the element\-wise variance of the state matrix\. As a standard mean\-field assumption at initialization, we assume𝐒t−1\\mathbf\{S\}\_\{t\-1\}is statistically independent of the current inputs𝐯t\\mathbf\{v\}\_\{t\}and𝐤t\\mathbf\{k\}\_\{t\}\. Then,

\(𝐒t\)i​j=αt​\(𝐒t−1\)i​j−αt​βt​∑l\(𝐒t−1\)i​l​\(𝐤t\)l​\(𝐤t\)j\+\(𝐔t\)i​j\.\\displaystyle\(\\mathbf\{S\}\_\{t\}\)\_\{ij\}=\\alpha\_\{t\}\(\\mathbf\{S\}\_\{t\-1\}\)\_\{ij\}\-\\alpha\_\{t\}\\beta\_\{t\}\\sum\_\{l\}\(\\mathbf\{S\}\_\{t\-1\}\)\_\{il\}\(\\mathbf\{k\}\_\{t\}\)\_\{l\}\(\\mathbf\{k\}\_\{t\}\)\_\{j\}\+\(\\mathbf\{U\}\_\{t\}\)\_\{ij\}\.
Since𝐔t\\mathbf\{U\}\_\{t\}depends on𝐯t\\mathbf\{v\}\_\{t\}, which has zero mean and is independent of𝐒t−1\\mathbf\{S\}\_\{t\-1\}and𝐤t\\mathbf\{k\}\_\{t\}, the cross\-terms between𝐔t\\mathbf\{U\}\_\{t\}and the other terms vanish when we take the expected square\. Therefore,

Vt=𝔼​\[\(αt​\(𝐒t−1\)i​j−αt​βt​∑l\(𝐒t−1\)i​l​\(𝐤t\)l​\(𝐤t\)j\)2\]\+𝔼​\[\(𝐔t\)i​j2\]\.\\displaystyle V\_\{t\}=\\mathbb\{E\}\\left\[\\left\(\\alpha\_\{t\}\(\\mathbf\{S\}\_\{t\-1\}\)\_\{ij\}\-\\alpha\_\{t\}\\beta\_\{t\}\\sum\_\{l\}\(\\mathbf\{S\}\_\{t\-1\}\)\_\{il\}\(\\mathbf\{k\}\_\{t\}\)\_\{l\}\(\\mathbf\{k\}\_\{t\}\)\_\{j\}\\right\)^\{2\}\\right\]\+\\mathbb\{E\}\[\(\\mathbf\{U\}\_\{t\}\)\_\{ij\}^\{2\}\]\.For the squared expectation term, we have

𝔼​\[αt2​\(𝐒t−1\)i​j2\]=𝔼​\[αt2\]​Vt−1,\\displaystyle\\mathbb\{E\}\[\\alpha\_\{t\}^\{2\}\(\\mathbf\{S\}\_\{t\-1\}\)\_\{ij\}^\{2\}\]=\\mathbb\{E\}\[\\alpha\_\{t\}^\{2\}\]V\_\{t\-1\},and

−2​𝔼​\[αt2​βt​\(𝐒t−1\)i​j​∑l\(𝐒t−1\)i​l​\(𝐤t\)l​\(𝐤t\)j\]=−2​𝔼​\[αt2​βt\]​𝔼​\[\(𝐒t−1\)i​j2\]​1d=−2​𝔼​\[αt2​βt\]​Vt−1​1d,\\displaystyle\-2\\mathbb\{E\}\\left\[\\alpha\_\{t\}^\{2\}\\beta\_\{t\}\(\\mathbf\{S\}\_\{t\-1\}\)\_\{ij\}\\sum\_\{l\}\(\\mathbf\{S\}\_\{t\-1\}\)\_\{il\}\(\\mathbf\{k\}\_\{t\}\)\_\{l\}\(\\mathbf\{k\}\_\{t\}\)\_\{j\}\\right\]=\-2\\mathbb\{E\}\[\\alpha\_\{t\}^\{2\}\\beta\_\{t\}\]\\mathbb\{E\}\[\(\\mathbf\{S\}\_\{t\-1\}\)\_\{ij\}^\{2\}\]\\frac\{1\}\{d\}=\-2\\mathbb\{E\}\[\\alpha\_\{t\}^\{2\}\\beta\_\{t\}\]V\_\{t\-1\}\\frac\{1\}\{d\},where the second equality holds since𝐤t\\mathbf\{k\}\_\{t\}has independent zero\-mean entries, and then the expectation over𝐤t\\mathbf\{k\}\_\{t\}is zero unlessl=jl=j\. And for the remaining term, we have

𝔼​\[αt2​βt2​\(∑l\(𝐒t−1\)i​l​\(𝐤t\)l\)2​\(𝐤t\)j2\]\\displaystyle\\mathbb\{E\}\\left\[\\alpha\_\{t\}^\{2\}\\beta\_\{t\}^\{2\}\\left\(\\sum\_\{l\}\(\\mathbf\{S\}\_\{t\-1\}\)\_\{il\}\(\\mathbf\{k\}\_\{t\}\)\_\{l\}\\right\)^\{2\}\(\\mathbf\{k\}\_\{t\}\)\_\{j\}^\{2\}\\right\]=𝔼​\[αt2​βt2​\(∑l,m\(𝐒t−1\)i​l​\(𝐒t−1\)i​m​\(𝐤t\)l​\(𝐤t\)m\)​\(𝐤t\)j2\]\\displaystyle=\\mathbb\{E\}\\left\[\\alpha\_\{t\}^\{2\}\\beta\_\{t\}^\{2\}\\big\(\\sum\_\{l,m\}\(\\mathbf\{S\}\_\{t\-1\}\)\_\{il\}\(\\mathbf\{S\}\_\{t\-1\}\)\_\{im\}\(\\mathbf\{k\}\_\{t\}\)\_\{l\}\(\\mathbf\{k\}\_\{t\}\)\_\{m\}\\big\)\(\\mathbf\{k\}\_\{t\}\)\_\{j\}^\{2\}\\right\]=𝔼​\[αt2​βt2​\(∑l\(𝐒t−1\)i​l2​\(𝐤t\)l2\)​\(𝐤t\)j2\]\\displaystyle=\\mathbb\{E\}\[\\alpha\_\{t\}^\{2\}\\beta\_\{t\}^\{2\}\(\\sum\_\{l\}\(\\mathbf\{S\}\_\{t\-1\}\)\_\{il\}^\{2\}\(\\mathbf\{k\}\_\{t\}\)\_\{l\}^\{2\}\)\(\\mathbf\{k\}\_\{t\}\)\_\{j\}^\{2\}\]=𝔼​\[αt2​βt2\]​\(d⋅Vt−1⋅1d2\)\\displaystyle=\\mathbb\{E\}\[\\alpha\_\{t\}^\{2\}\\beta\_\{t\}^\{2\}\]\\left\(d\\cdot V\_\{t\-1\}\\cdot\\frac\{1\}\{d^\{2\}\}\\right\)=𝔼​\[αt2​βt2\]​Vt−1​1d,\\displaystyle=\\mathbb\{E\}\[\\alpha\_\{t\}^\{2\}\\beta\_\{t\}^\{2\}\]V\_\{t\-1\}\\frac\{1\}\{d\},where the second equality holds by the independence within the entries in\\kbt\\kb\_\{t\}\. Finally we have

Vt\\displaystyle V\_\{t\}=𝔼​\[αt2\]​Vt−1−2​𝔼​\[αt2​βt\]−𝔼​\[αt2​βt2\]d​Vt−1\+Θ​\(1d\)\\displaystyle=\\mathbb\{E\}\[\\alpha\_\{t\}^\{2\}\]V\_\{t\-1\}\-\\frac\{2\\mathbb\{E\}\[\\alpha\_\{t\}^\{2\}\\beta\_\{t\}\]\-\\mathbb\{E\}\[\\alpha\_\{t\}^\{2\}\\beta\_\{t\}^\{2\}\]\}\{d\}V\_\{t\-1\}\+\\Theta\\left\(\\frac\{1\}\{d\}\\right\)=Vt−1⋅𝔼​\[αt2​\(1−2​βt−βt2d\)\]⏟γ\+Θ​\(1d\)\.\\displaystyle=V\_\{t\-1\}\\cdot\\underbrace\{\\mathbb\{E\}\\left\[\\alpha\_\{t\}^\{2\}\\left\(1\-\\frac\{2\\beta\_\{t\}\-\\beta\_\{t\}^\{2\}\}\{d\}\\right\)\\right\]\}\_\{\\gamma\}\+\\Theta\\left\(\\frac\{1\}\{d\}\\right\)\.Assume𝔼​\[αt2\]=c2<1\\mathbb\{E\}\[\\alpha\_\{t\}^\{2\}\]=c^\{2\}<1\. Becauseddis relatively large, the\(2​βt−βt2\)/d\(2\\beta\_\{t\}\-\\beta\_\{t\}^\{2\}\)/dterm is small\. Therefore, the contraction factorγ\\gammais strictly dictated byαt\\alpha\_\{t\}, and we haveγ≈𝔼​\[αt2\]<1\\gamma\\approx\\mathbb\{E\}\[\\alpha\_\{t\}^\{2\}\]<1\. Then the varianceVtV\_\{t\}is a simple converging geometric progression:

Vt=γ​Vt−1\+Θ​\(1d\)\.\\displaystyle V\_\{t\}=\\gamma V\_\{t\-1\}\+\\Theta\\left\(\\frac\{1\}\{d\}\\right\)\.Ast→∞t\\to\\infty, it converges to the infinite sum:

V∞=Θ​\(1/d\)1−γ,\\displaystyle V\_\{\\infty\}=\\frac\{\\Theta\(1/d\)\}\{1\-\\gamma\},which isΘ​\(1/d\)\\Theta\(1/d\)sinceγ\\gammaisΘ​\(1\)\\Theta\(1\)and strictly less than11\. Therefore, the element\-wise variance of the latent state𝐒t\\mathbf\{S\}\_\{t\}isΘ​\(1/d\)\\Theta\(1/d\), and the coordinate size isΘ​\(1/d\)\\Theta\(1/\\sqrt\{d\}\)\.

## Appendix BCompatibility with AdamW

Table 2:μ\\muP formulation of Gated Delta Net under AdamW optimization\. Weights have shape\\RRnℓ×nℓ−1\\RR^\{n\_\{\\ell\}\\times n\_\{\\ell\-1\}\}; for input weights and biases,nℓ−1=Θ​\(1\)n\_\{\\ell\-1\}=\\Theta\(1\)\. Under AdamW, Adam’s coordinate\-wise gradient normalization equalizes the effective update magnitude across all weight classes, so the gating weight matrices\\Wbα\\Wb\_\{\\alpha\}and\\Wbβ\\Wb\_\{\\beta\}are subsumed into the “Hidden weights” column and require no special treatment\.Under the AdamW\[loshchilov2017decoupled\]optimizer, the parameter update at stepttis proportional to the coordinate\-wise exponential moving average of the gradient, so the effective coordinate\-wise update magnitude isΘ​\(1\)\\Theta\(1\)regardless of the gradient scale\. Hence theμ\\muP learning\-rate scaling is similar to that for standard architectures without modification\. The completeμ\\muP formulation for Gated Delta Net trained with AdamW is summarized in Table[2](https://arxiv.org/html/2606.04048#A2.T2)\. To rigorously validate the intuition, we will derive the scaling law of GDN when optimized by Adam\(W\) in the following\.

### B\.1Adam\(W\) in the Scale\-Invariant Regime

Consider the pre\-activation of the gating mechanism,zα,t=𝐖α​𝐱t\+bz\_\{\\alpha,t\}=\\mathbf\{W\}\_\{\\alpha\}\\mathbf\{x\}\_\{t\}\+b\. From our first\-order analysis, the gradient of the loss with respect to this pre\-activation isδα,t:=∂ℒ∂zα,t=Θ​\(1/d\)\\delta\_\{\\alpha,t\}:=\\frac\{\\partial\\mathcal\{L\}\}\{\\partial z\_\{\\alpha,t\}\}=\\Theta\(1/\\sqrt\{d\}\)\.

For any parameterθ\\thetawith gradientg=∂ℒ/∂θg=\\partial\\mathcal\{L\}/\\partial\\theta, the Adam\(W\) update rule normalizes the gradient by its root\-mean\-squarevv\. Since the gradients in our model are orderΘ​\(1/d\)\\Theta\(1/\\sqrt\{d\}\), which strictly dominates the standard Adam\(W\)ϵ\\epsilonparameter \(e\.g\.,10−810^\{\-8\}\) for practically sizeddd, Adam\(W\) operates in a scale\-invariant regime\. The update effectively reduces to a coordinate\-wise sign descent:

Δ​θj=−ηθ​gjvj\+ϵ≈−ηθ​sign​\(gj\),\\Delta\\theta\_\{j\}=\-\\eta\_\{\\theta\}\\frac\{g\_\{j\}\}\{\\sqrt\{v\_\{j\}\}\+\\epsilon\}\\approx\-\\eta\_\{\\theta\}\\text\{sign\}\(g\_\{j\}\),\(6\)whereηθ\\eta\_\{\\theta\}is the learning rate assigned to parameterθ\\theta\.

### B\.2Derivation for Main Projection Weights \(𝐖q,𝐖k,𝐖v,𝐖o\\mathbf\{W\}\_\{q\},\\mathbf\{W\}\_\{k\},\\mathbf\{W\}\_\{v\},\\mathbf\{W\}\_\{o\}\)

Let𝐖∈ℝd×d\\mathbf\{W\}\\in\\mathbb\{R\}^\{d\\times d\}denote any of the main linear projection matrices, yielding a pre\-activation𝐳=𝐖𝐱t∈ℝd\\mathbf\{z\}=\\mathbf\{W\}\\mathbf\{x\}\_\{t\}\\in\\mathbb\{R\}^\{d\}\. Letδi=∂ℒ∂zi\\delta\_\{i\}=\\frac\{\\partial\\mathcal\{L\}\}\{\\partial z\_\{i\}\}be the gradient arriving at theii\-th coordinate of the output\. The gradient with respect to the weight coordinateWi​jW\_\{ij\}is:

gi​j=∂ℒ∂Wi​j=δi​\(𝐱t\)j\.\\displaystyle g\_\{ij\}=\\frac\{\\partial\\mathcal\{L\}\}\{\\partial W\_\{ij\}\}=\\delta\_\{i\}\(\\mathbf\{x\}\_\{t\}\)\_\{j\}\.Applying the Adam update rule \([6](https://arxiv.org/html/2606.04048#A2.E6)\), the parameter change is:

Δ​Wi​j≈−ηW​sign​\(δi​\(𝐱t\)j\)=−ηW​sign​\(δi\)​sign​\(\(𝐱t\)j\)\.\\displaystyle\\Delta W\_\{ij\}\\approx\-\\eta\_\{W\}\\text\{sign\}\\big\(\\delta\_\{i\}\(\\mathbf\{x\}\_\{t\}\)\_\{j\}\\big\)=\-\\eta\_\{W\}\\text\{sign\}\(\\delta\_\{i\}\)\\text\{sign\}\\big\(\(\\mathbf\{x\}\_\{t\}\)\_\{j\}\\big\)\.We now compute the exact shiftΔ​zi\\Delta z\_\{i\}in theii\-th coordinate of the feature caused by this update:

Δ​zi\\displaystyle\\Delta z\_\{i\}=∑j=1dΔ​Wi​j​\(𝐱t\)j\\displaystyle=\\sum\_\{j=1\}^\{d\}\\Delta W\_\{ij\}\(\\mathbf\{x\}\_\{t\}\)\_\{j\}=∑j=1d\(−ηW​sign​\(δi\)​sign​\(\(𝐱t\)j\)\)​\(𝐱t\)j\\displaystyle=\\sum\_\{j=1\}^\{d\}\\left\(\-\\eta\_\{W\}\\text\{sign\}\(\\delta\_\{i\}\)\\text\{sign\}\\big\(\(\\mathbf\{x\}\_\{t\}\)\_\{j\}\\big\)\\right\)\(\\mathbf\{x\}\_\{t\}\)\_\{j\}=−ηW​sign​\(δi\)​∑j=1d\|\(𝐱t\)j\|\.\\displaystyle=\-\\eta\_\{W\}\\text\{sign\}\(\\delta\_\{i\}\)\\sum\_\{j=1\}^\{d\}\\big\|\(\\mathbf\{x\}\_\{t\}\)\_\{j\}\\big\|\.\(7\)Under theμ\\muP initialization, each coordinate of the hidden state𝐱t\\mathbf\{x\}\_\{t\}possesses a magnitude ofΘ​\(1\)\\Theta\(1\)\. Consequently, the sum of absolute values in \([7](https://arxiv.org/html/2606.04048#A2.E7)\) \(i\.e\., theℓ1\\ell\_\{1\}norm of𝐱t\\mathbf\{x\}\_\{t\}\) strictly scales asΘ​\(d\)\\Theta\(d\)\. Thus, the magnitude of the feature shift is:

\|Δ​zi\|=ηW⋅Θ​\(d\)\.\\displaystyle\|\\Delta z\_\{i\}\|=\\eta\_\{W\}\\cdot\\Theta\(d\)\.To enforce the feature\-learning requirement\|Δ​zi\|=Θ​\(1\)\|\\Delta z\_\{i\}\|=\\Theta\(1\), the learning rate for the main projection matrices must be scaled as:

ηW=Θ​\(1d\)\.\\displaystyle\\eta\_\{W\}=\\Theta\\left\(\\frac\{1\}\{d\}\\right\)\.

### B\.3Derivation for Gating Weights \(𝐖α,𝐖β\\mathbf\{W\}\_\{\\alpha\},\\mathbf\{W\}\_\{\\beta\}\)

For the gating weight matrix𝐖α∈ℝ1×d\\mathbf\{W\}\_\{\\alpha\}\\in\\mathbb\{R\}^\{1\\times d\}\(the derivation for𝐖β\\mathbf\{W\}\_\{\\beta\}is identical\), the gradient with respect to itsjj\-th coordinate is:

gα,j=∂ℒ∂\(𝐖α\)j=δα,t​\(𝐱t\)j\.\\displaystyle g\_\{\\alpha,j\}=\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\(\\mathbf\{W\}\_\{\\alpha\}\)\_\{j\}\}=\\delta\_\{\\alpha,t\}\(\\mathbf\{x\}\_\{t\}\)\_\{j\}\.Substituting this into the Adam update rule \([6](https://arxiv.org/html/2606.04048#A2.E6)\), we obtain the exact parameter update:

Δ​\(𝐖α\)j=−ηα​sign​\(δα,t​\(𝐱t\)j\)=−ηα​sign​\(δα,t\)​sign​\(\(𝐱t\)j\)\.\\displaystyle\\Delta\(\\mathbf\{W\}\_\{\\alpha\}\)\_\{j\}=\-\\eta\_\{\\alpha\}\\text\{sign\}\\big\(\\delta\_\{\\alpha,t\}\(\\mathbf\{x\}\_\{t\}\)\_\{j\}\\big\)=\-\\eta\_\{\\alpha\}\\text\{sign\}\(\\delta\_\{\\alpha,t\}\)\\text\{sign\}\\big\(\(\\mathbf\{x\}\_\{t\}\)\_\{j\}\\big\)\.To satisfy the fundamentalμ\\muP feature\-learning condition, a single optimization step must induce aΘ​\(1\)\\Theta\(1\)shift in the pre\-activationzα,tz\_\{\\alpha,t\}\. We compute this exact shiftΔ​zα,t\\Delta z\_\{\\alpha,t\}caused byΔ​𝐖α\\Delta\\mathbf\{W\}\_\{\\alpha\}:

Δ​zα,t\\displaystyle\\Delta z\_\{\\alpha,t\}=∑j=1dΔ​\(𝐖α\)j​\(𝐱t\)j\\displaystyle=\\sum\_\{j=1\}^\{d\}\\Delta\(\\mathbf\{W\}\_\{\\alpha\}\)\_\{j\}\(\\mathbf\{x\}\_\{t\}\)\_\{j\}=∑j=1d\(−ηα​sign​\(δα,t\)​sign​\(\(𝐱t\)j\)\)​\(𝐱t\)j\\displaystyle=\\sum\_\{j=1\}^\{d\}\\left\(\-\\eta\_\{\\alpha\}\\text\{sign\}\(\\delta\_\{\\alpha,t\}\)\\text\{sign\}\\big\(\(\\mathbf\{x\}\_\{t\}\)\_\{j\}\\big\)\\right\)\(\\mathbf\{x\}\_\{t\}\)\_\{j\}=−ηα​sign​\(δα,t\)​∑j=1d\|\(𝐱t\)j\|\.\\displaystyle=\-\\eta\_\{\\alpha\}\\text\{sign\}\(\\delta\_\{\\alpha,t\}\)\\sum\_\{j=1\}^\{d\}\\big\|\(\\mathbf\{x\}\_\{t\}\)\_\{j\}\\big\|\.\(8\)Under theμ\\muP assumptions, each coordinate of the hidden state𝐱t\\mathbf\{x\}\_\{t\}possesses a magnitude ofΘ​\(1\)\\Theta\(1\)\. Consequently, the sum of absolute values in \([8](https://arxiv.org/html/2606.04048#A2.E8)\) \(i\.e\., theℓ1\\ell\_\{1\}norm of𝐱t\\mathbf\{x\}\_\{t\}\) strictly scales asΘ​\(d\)\\Theta\(d\)\. Thus, the magnitude of the feature shift is:

\|Δ​zα,t\|=ηα⋅Θ​\(d\)\.\\displaystyle\|\\Delta z\_\{\\alpha,t\}\|=\\eta\_\{\\alpha\}\\cdot\\Theta\(d\)\.To enforce the feature\-learning requirement\|Δ​zα,t\|=Θ​\(1\)\|\\Delta z\_\{\\alpha,t\}\|=\\Theta\(1\), the learning rate must be precisely scaled as:

ηα=Θ​\(1d\)\.\\displaystyle\\eta\_\{\\alpha\}=\\Theta\\left\(\\frac\{1\}\{d\}\\right\)\.This demonstrates a core property of AdamW underμ\\muP: regardless of the output dimension of the projection, the learning rate scaling is strictly dictated by the fan\-in dimension \(dd\)\. Therefore, both matrix types can be summarized as one category with a1/nℓ−11/n\_\{\\ell\-1\}learning rate multiplier, perfectly aligning with Table[2](https://arxiv.org/html/2606.04048#A2.T2)\.

### B\.4Derivation for Scalar Parameters \(aloga\_\{\\log\}andbb\)

We now analyze the scalar biasbb\(and equivalently, the additive scalaraloga\_\{\\log\}\)\. The gradient with respect tobbis trivially the pre\-activation gradient:

gb=∂ℒ∂b=δα,t=Θ​\(1d\)\.\\displaystyle g\_\{b\}=\\frac\{\\partial\\mathcal\{L\}\}\{\\partial b\}=\\delta\_\{\\alpha,t\}=\\Theta\\left\(\\frac\{1\}\{\\sqrt\{d\}\}\\right\)\.Applying the scale\-invariant Adam update, the parameter change is:

Δ​b≈−ηb​sign​\(δα,t\)\.\\displaystyle\\Delta b\\approx\-\\eta\_\{b\}\\text\{sign\}\(\\delta\_\{\\alpha,t\}\)\.Becausebbacts as a direct additive scalar to the pre\-activation \(znew=𝐖α​𝐱t\+bnewz\_\{\\text\{new\}\}=\\mathbf\{W\}\_\{\\alpha\}\\mathbf\{x\}\_\{t\}\+b\_\{\\text\{new\}\}\), the resulting shift in the feature is exactly the parameter update itself:

\|Δ​zα,t,\(from bias\)\|=\|Δ​b\|=ηb⋅Θ​\(1\)\.\\displaystyle\|\\Delta z\_\{\\alpha,t,\\text\{\(from bias\)\}\}\|=\|\\Delta b\|=\\eta\_\{b\}\\cdot\\Theta\(1\)\.To achieve the requisiteΘ​\(1\)\\Theta\(1\)feature drift, the learning rate for the scalar must be independent of the width:

ηb=Θ​\(1\)\.\\displaystyle\\eta\_\{b\}=\\Theta\(1\)\.
\{subfigure\}\[b\]0\.32![Refer to caption](https://arxiv.org/html/2606.04048v1/x6.png)\{subfigure\}\[b\]0\.32![Refer to caption](https://arxiv.org/html/2606.04048v1/x7.png)\{subfigure\}\[b\]0\.32![Refer to caption](https://arxiv.org/html/2606.04048v1/x8.png)

Figure 8:Standard Parametrization \(SP\)Figure 9:Originalμ\\muP configurationFigure 10:Ourμ\\muP configurationFigure 11:The curves forRMS​\(\\qb~\)/dhead\\text\{RMS\}\(\\tilde\{\\qb\}\)/\\sqrt\{d\_\{\\mathrm\{head\}\}\}for SP, originalμ\\muP configuration and ourμ\\muP configuration\.\{subfigure\}\[b\]0\.32![Refer to caption](https://arxiv.org/html/2606.04048v1/x9.png)\{subfigure\}\[b\]0\.32![Refer to caption](https://arxiv.org/html/2606.04048v1/x10.png)\{subfigure\}\[b\]0\.32![Refer to caption](https://arxiv.org/html/2606.04048v1/x11.png)

Figure 12:Standard Parametrization \(SP\)Figure 13:Originalμ\\muP configurationFigure 14:Ourμ\\muP configurationFigure 15:The curves forRMS​\(\\kb~\)/dhead\\text\{RMS\}\(\\tilde\{\\kb\}\)/\\sqrt\{d\_\{\\mathrm\{head\}\}\}for SP, originalμ\\muP configuration and ourμ\\muP configuration\.

## Appendix CDynamics analysis ofμ\\muP for SGD

In Sections[4](https://arxiv.org/html/2606.04048#S4)and[5](https://arxiv.org/html/2606.04048#S5), we derived theμ\\muP formulation for the Gated Delta Network optimized by SGD by propagating coordinate\-size estimates through the forward and backward passes\. To corroborate these theoretical derivations, we embed diagnostic probes into our training framework\. We track the internal activations, gradients, and parameter updates of models across different widths \(d∈\{256,512,768,1024\}d\\in\\\{256,512,768,1024\\\}\) during the pre\-training phase with the same optimal learning rate of 0\.4\.

### C\.1Verification of Forward Pass Coordinate Sizes

In Section[4](https://arxiv.org/html/2606.04048#S4), we established that the query and key vectors before L2\-normalization layer,\\qb~t\\tilde\{\\qb\}\_\{t\}and\\kb~t\\tilde\{\\kb\}\_\{t\}, possessΘ​\(1/d\)\\Theta\(1/\\sqrt\{d\}\)coordinate sizes\. Figures[11](https://arxiv.org/html/2606.04048#A2.F11)and[15](https://arxiv.org/html/2606.04048#A2.F15)plot the empirical quantities\\qb~/dhead×d\\tilde\{\\qb\}/\\sqrt\{d\_\{\\mathrm\{head\}\}\}\\times\\sqrt\{d\}and\\kb~/dhead×d\\tilde\{\\kb\}/\\sqrt\{d\_\{\\mathrm\{head\}\}\}\\times\\sqrt\{d\}measured across the sampled layers, where we omittedd=768d=768due to instability across all the configurations\.

As demonstrated, the scaled quantities is approximately constant across varying widths under ourμ\\muP configuration, which does not hold for SP and originalμ\\muP\. This directly substantiates our derivations and validates that the recurrent state\\Sbbt\\Sbb\_\{t\}operates under stable variance conditions\.

\{subfigure\}\[b\]0\.32![Refer to caption](https://arxiv.org/html/2606.04048v1/x12.png)\{subfigure\}\[b\]0\.32![Refer to caption](https://arxiv.org/html/2606.04048v1/x13.png)\{subfigure\}\[b\]0\.32![Refer to caption](https://arxiv.org/html/2606.04048v1/x14.png)

Figure 16:Standard Parametrization \(SP\)Figure 17:Originalμ\\muP configurationFigure 18:Ourμ\\muP configurationFigure 19:The curves forRMS​\(∂ℒ/∂\\hb\)×d\\text\{RMS\}\(\\partial\\mathcal\{L\}/\\partial\\hb\)\\times dfor SP, originalμ\\muP configuration and ourμ\\muP configuration, wherehhis the hidden state before each layer\.
### C\.2Verification of Backward Pass Gradient Scaling

We also inspect the gradient scaling in the dynamics in backward pass\. In Figure[19](https://arxiv.org/html/2606.04048#A3.F19), we deduced that under our formulation and originalμ\\muP configuration, the gradient for the hidden states∂ℒ/∂\\hb\\partial\\mathcal\{L\}/\\partial\\hbstrictly follows aΘ​\(1/d\)\\Theta\(1/d\)coordinate size, whereas standard parametrization forces it toΘ​\(1/d\)\\Theta\(1/\\sqrt\{d\}\), leading to scaling instability\.

Moreover, in Figures[23](https://arxiv.org/html/2606.04048#A3.F23)and[27](https://arxiv.org/html/2606.04048#A3.F27), we observed that both∂ℒ/∂\\qb~\\partial\\mathcal\{L\}/\\partial\\tilde\{\\qb\}and∂ℒ/∂\\kb~\\partial\\mathcal\{L\}/\\partial\\tilde\{\\kb\}also follows aΘ​\(1/d\)\\Theta\(1/d\)coordinate size in our configuration, while originalμ\\muP deviates from it, making the training less stable and transfer not perfect\.

\{subfigure\}\[b\]0\.32![Refer to caption](https://arxiv.org/html/2606.04048v1/x15.png)\{subfigure\}\[b\]0\.32![Refer to caption](https://arxiv.org/html/2606.04048v1/x16.png)\{subfigure\}\[b\]0\.32![Refer to caption](https://arxiv.org/html/2606.04048v1/x17.png)

Figure 20:Standard Parametrization \(SP\)Figure 21:Originalμ\\muP configurationFigure 22:Ourμ\\muP configurationFigure 23:The curves forRMS​\(∂ℒ/∂\\qb~\)×d\\text\{RMS\}\(\\partial\\mathcal\{L\}/\\partial\\tilde\{\\qb\}\)\\times dfor SP, originalμ\\muP configuration and ourμ\\muP configuration\.\{subfigure\}\[b\]0\.32![Refer to caption](https://arxiv.org/html/2606.04048v1/x18.png)\{subfigure\}\[b\]0\.32![Refer to caption](https://arxiv.org/html/2606.04048v1/x19.png)\{subfigure\}\[b\]0\.32![Refer to caption](https://arxiv.org/html/2606.04048v1/x20.png)

Figure 24:Standard Parametrization \(SP\)Figure 25:Originalμ\\muP configurationFigure 26:Ourμ\\muP configurationFigure 27:The curves forRMS​\(∂ℒ/∂\\kb~\)×d\\text\{RMS\}\(\\partial\\mathcal\{L\}/\\partial\\tilde\{\\kb\}\)\\times dfor SP, originalμ\\muP configuration and ourμ\\muP configuration\.![Refer to caption](https://arxiv.org/html/2606.04048v1/x21.png)Figure 28:The average ofβt\\beta\_\{t\}and the ratio for strong\-writingβt\\beta\_\{t\}\(βt\>0\.5\\beta\_\{t\}\>0\.5\) for GDN models trained with SGD in ourμ\\muP configuration with different widths\.\{subfigure\}\[b\]0\.32![Refer to caption](https://arxiv.org/html/2606.04048v1/x22.png)\{subfigure\}\[b\]0\.32![Refer to caption](https://arxiv.org/html/2606.04048v1/x23.png)\{subfigure\}\[b\]0\.32![Refer to caption](https://arxiv.org/html/2606.04048v1/x24.png)

Figure 29:Standard Parametrization \(SP\)Figure 30:Originalμ\\muP configurationFigure 31:Ourμ\\muP configurationFigure 32:The curves forRMS​\(zα,t\)\\text\{RMS\}\(z\_\{\\alpha,t\}\)for SP, originalμ\\muP configuration and ourμ\\muP configuration\.\{subfigure\}\[b\]0\.32![Refer to caption](https://arxiv.org/html/2606.04048v1/x25.png)\{subfigure\}\[b\]0\.32![Refer to caption](https://arxiv.org/html/2606.04048v1/x26.png)\{subfigure\}\[b\]0\.32![Refer to caption](https://arxiv.org/html/2606.04048v1/x27.png)

Figure 33:Standard Parametrization \(SP\)Figure 34:Originalμ\\muP configurationFigure 35:Ourμ\\muP configurationFigure 36:The curves for average standard deviation ofRMS​\(zβ,t\)\\text\{RMS\}\(z\_\{\\beta,t\}\)for SP, originalμ\\muP configuration and ourμ\\muP configuration\.
### C\.3Stability of the Gating Dynamics

Our theoretical approximations in Section[4](https://arxiv.org/html/2606.04048#S4)assume that the data\-dependent gating scalars \(αt\\alpha\_\{t\}andβt\\beta\_\{t\}\) do not saturate into trivial states \(e\.g\., vanishing completely or remaining strictly 1\.0\)\.

By logging the runtime statistics of the write strengthβ=σ​\(Wβ​xt\)\\beta=\\sigma\(W\_\{\\beta\}x\_\{t\}\)in Figure[28](https://arxiv.org/html/2606.04048#A3.F28), we observe that its expected value remains centered around 0\.5 across all model widths, with a significant proportion of tokens actively triggering strong writes\. Additionally, in Figures[32](https://arxiv.org/html/2606.04048#A3.F32)and[36](https://arxiv.org/html/2606.04048#A3.F36)\(d=768d=768is omitted forzβ,tz\_\{\\beta,t\}for instability across all the configurations\), the standard deviation of the pre\-activationszα,t=\\Wbα​\\xbt\+bz\_\{\\alpha,t\}=\\Wb\_\{\\alpha\}\\xb\_\{t\}\+bandzβ,t=\\Wbβ​\\xbtz\_\{\\beta,t\}=\\Wb\_\{\\beta\}\\xb\_\{t\}remainsΘ​\(1\)\\Theta\(1\)\. These observations validate our first\-order approximations and ensure that the recurrent memory updates effectively capture long\-range dependencies without collapsing as the model scales\.

\{subfigure\}\[b\]0\.32![Refer to caption](https://arxiv.org/html/2606.04048v1/x28.png)\{subfigure\}\[b\]0\.32![Refer to caption](https://arxiv.org/html/2606.04048v1/x29.png)\{subfigure\}\[b\]0\.32![Refer to caption](https://arxiv.org/html/2606.04048v1/x30.png)\{subfigure\}\[b\]0\.32![Refer to caption](https://arxiv.org/html/2606.04048v1/x31.png)\{subfigure\}\[b\]0\.32![Refer to caption](https://arxiv.org/html/2606.04048v1/x32.png)

Figure 37:cos⁡⟨\\qbt,\\kbt⟩\\cos\\langle\\qb\_\{t\},\\kb\_\{t\}\\rangleatt=0t=0Figure 38:cos⁡⟨\\qbt,\\kbt⟩\\cos\\langle\\qb\_\{t\},\\kb\_\{t\}\\rangleatt=256t=256Figure 39:cos⁡⟨\\qbt,\\kbt⟩\\cos\\langle\\qb\_\{t\},\\kb\_\{t\}\\rangleatt=512t=512Figure 40:cos⁡⟨\\qbt,\\kbt⟩\\cos\\langle\\qb\_\{t\},\\kb\_\{t\}\\rangleatt=768t=768Figure 41:cos⁡⟨\\qbt,\\kbt⟩\\cos\\langle\\qb\_\{t\},\\kb\_\{t\}\\rangleatt=1023t=1023Figure 42:The curves forcos⁡⟨\\qbt,\\kbt⟩\\cos\\langle\\qb\_\{t\},\\kb\_\{t\}\\ranglein ourμ\\muP configuration for SGD at differentttvalues at 4\-th layer\.
### C\.4Detecting the dynamics of state spaces

Finally, we depict the dynamics of state spaces of GDN whenttincreases\. We plotted thecos⁡⟨\\qbt,\\kbt⟩\\cos\\langle\\qb\_\{t\},\\kb\_\{t\}\\ranglegiven a certain input in ourμ\\muP configuration at differentttvalues at 4\-th layer in Figure[42](https://arxiv.org/html/2606.04048#A3.F42)\. It can be seen that all the values are around 0, showing that independent assumption of\\qbt\\qb\_\{t\}and\\kbt\\kb\_\{t\}still holds for GDN\. And during the training process, the value also converges, showing that in our configuration, the model can indeed converge to a stable state\.

## Appendix DDynamic analysis of AdamW

In previous section, we show the plots of various dynamic probes of the model trained with SGD\. Here we provide additional analysis of the dynamics of Gated Delta Net trained with AdamW optimizer across different widths amongd∈\{256,512,1024,1536\}d\\in\\\{256,512,1024,1536\\\}with the same optimal learning rate of8×10−38\\times 10^\{\-3\}\.

\{subfigure\}\[b\]0\.45![Refer to caption](https://arxiv.org/html/2606.04048v1/x33.png)\{subfigure\}\[b\]0\.45![Refer to caption](https://arxiv.org/html/2606.04048v1/x34.png)

Figure 43:Standard Parametrization \(SP\)Figure 44:Originalμ\\muP configurationFigure 45:The curves forRMS​\(\\qb~\)/dhead\\text\{RMS\}\(\\tilde\{\\qb\}\)/\\sqrt\{d\_\{\\mathrm\{head\}\}\}for SP andμ\\muP configuration with AdamW optimizer\.\{subfigure\}\[b\]0\.45![Refer to caption](https://arxiv.org/html/2606.04048v1/x35.png)\{subfigure\}\[b\]0\.45![Refer to caption](https://arxiv.org/html/2606.04048v1/x36.png)

Figure 46:Standard Parametrization \(SP\)Figure 47:μ\\muP configurationFigure 48:The curves forRMS​\(\\kb~\)/dhead\\text\{RMS\}\(\\tilde\{\\kb\}\)/\\sqrt\{d\_\{\\mathrm\{head\}\}\}for SP andμ\\muP configuration with AdamW optimizer\.### D\.1Verification of Forward Pass Coordinate Sizes

In Section[4](https://arxiv.org/html/2606.04048#S4), we demonstrated that\\qb~t\\tilde\{\\qb\}\_\{t\}and\\kb~t\\tilde\{\\kb\}\_\{t\}haveΘ​\(1/d\)\\Theta\(1/\\sqrt\{d\}\)coordinate sizes\. We display the empirical quantities\\qb~/dhead×d\\tilde\{\\qb\}/\\sqrt\{d\_\{\\mathrm\{head\}\}\}\\times\\sqrt\{d\}and\\kb~/dhead×d\\tilde\{\\kb\}/\\sqrt\{d\_\{\\mathrm\{head\}\}\}\\times\\sqrt\{d\}measured across the sampled layers in Figures[45](https://arxiv.org/html/2606.04048#A4.F45)and[48](https://arxiv.org/html/2606.04048#A4.F48)\.

We can observe that the scaled quantities is approximately constant across varying widths under ourμ\\muP configuration, which does not hold for SP\. This directly substantiates our derivations and validates that the recurrent state\\Sbbt\\Sbb\_\{t\}operates under stable variance conditions\.

\{subfigure\}\[b\]0\.45![Refer to caption](https://arxiv.org/html/2606.04048v1/x37.png)\{subfigure\}\[b\]0\.45![Refer to caption](https://arxiv.org/html/2606.04048v1/x38.png)

Figure 49:Standard Parametrization \(SP\)Figure 50:μ\\muP configurationFigure 51:The curves forRMS​\(∂ℒ/∂\\hb\)×d\\text\{RMS\}\(\\partial\\mathcal\{L\}/\\partial\\hb\)\\times dfor SP andμ\\muP configuration with AdamW optimizer, wherehhis the hidden state before each layer\.
### D\.2Inspect of Backward Pass Gradient Scaling

We also inspect the gradient scaling in the dynamics in backward pass\. Although the normalization within Adam\(W\) withμ\\muP configuration ensures theΘ​\(1\)\\Theta\(1\)update of hidden state, here we instead inspect the stability of the gradient across different layers\. In Figure[51](https://arxiv.org/html/2606.04048#A4.F51), we notice that underμ\\muP configuration, the gradient for the hidden states∂ℒ/∂\\hb\\partial\\mathcal\{L\}/\\partial\\hbis much more stabler than that with SP configuration\.

![Refer to caption](https://arxiv.org/html/2606.04048v1/x39.png)Figure 52:The average ofβt\\beta\_\{t\}and the ratio for strong\-writingβt\\beta\_\{t\}\(βt\>0\.5\\beta\_\{t\}\>0\.5\) for GDN models trained with AdamW inμ\\muP configuration with different widths\.\{subfigure\}\[b\]0\.45![Refer to caption](https://arxiv.org/html/2606.04048v1/x40.png)\{subfigure\}\[b\]0\.45![Refer to caption](https://arxiv.org/html/2606.04048v1/x41.png)

Figure 53:Standard Parametrization \(SP\)Figure 54:μ\\muP configurationFigure 55:The curves forRMS​\(zα,t\)\\text\{RMS\}\(z\_\{\\alpha,t\}\)for SP andμ\\muP configuration with AdamW optimizer\.\{subfigure\}\[b\]0\.45![Refer to caption](https://arxiv.org/html/2606.04048v1/x42.png)\{subfigure\}\[b\]0\.45![Refer to caption](https://arxiv.org/html/2606.04048v1/x43.png)

Figure 56:Standard Parametrization \(SP\)Figure 57:μ\\muP configurationFigure 58:The curves for average standard deviation ofRMS​\(zβ,t\)\\text\{RMS\}\(z\_\{\\beta,t\}\)for SP, originalμ\\muP configuration and ourμ\\muP configuration with AdamW optimizer\.
### D\.3Stability of the Gating Dynamics

In Section[4](https://arxiv.org/html/2606.04048#S4), we assume that the data\-dependent gating scalars \(αt\\alpha\_\{t\}andβt\\beta\_\{t\}\) do not saturate into trivial states\. In Figure[52](https://arxiv.org/html/2606.04048#A4.F52), we observe that the expected value ofβt\\beta\_\{t\}remains around 0\.5 across all model widths, with also a significant proportion of tokens actively triggering strong writes\. Additionally, in Figures[55](https://arxiv.org/html/2606.04048#A4.F55)and[58](https://arxiv.org/html/2606.04048#A4.F58), the standard deviation of the pre\-activationszα,t=\\Wbα​\\xbt\+bz\_\{\\alpha,t\}=\\Wb\_\{\\alpha\}\\xb\_\{t\}\+bandzβ,t=\\Wbβ​\\xbtz\_\{\\beta,t\}=\\Wb\_\{\\beta\}\\xb\_\{t\}remainsΘ​\(1\)\\Theta\(1\)\. These observations validate our first\-order approximations and ensure that the recurrent memory updates effectively capture long\-range dependencies without collapsing as the model scales for AdamW scenarios\.

\{subfigure\}\[b\]0\.32![Refer to caption](https://arxiv.org/html/2606.04048v1/x44.png)\{subfigure\}\[b\]0\.32![Refer to caption](https://arxiv.org/html/2606.04048v1/x45.png)\{subfigure\}\[b\]0\.32![Refer to caption](https://arxiv.org/html/2606.04048v1/x46.png)\{subfigure\}\[b\]0\.32![Refer to caption](https://arxiv.org/html/2606.04048v1/x47.png)\{subfigure\}\[b\]0\.32![Refer to caption](https://arxiv.org/html/2606.04048v1/x48.png)

Figure 59:cos⁡⟨\\qbt,\\kbt⟩\\cos\\langle\\qb\_\{t\},\\kb\_\{t\}\\rangleatt=0t=0Figure 60:cos⁡⟨\\qbt,\\kbt⟩\\cos\\langle\\qb\_\{t\},\\kb\_\{t\}\\rangleatt=256t=256Figure 61:cos⁡⟨\\qbt,\\kbt⟩\\cos\\langle\\qb\_\{t\},\\kb\_\{t\}\\rangleatt=512t=512Figure 62:cos⁡⟨\\qbt,\\kbt⟩\\cos\\langle\\qb\_\{t\},\\kb\_\{t\}\\rangleatt=768t=768Figure 63:cos⁡⟨\\qbt,\\kbt⟩\\cos\\langle\\qb\_\{t\},\\kb\_\{t\}\\rangleatt=1023t=1023Figure 64:The curves forcos⁡⟨\\qbt,\\kbt⟩\\cos\\langle\\qb\_\{t\},\\kb\_\{t\}\\rangleinμ\\muP configuration for AdamW at differentttvalues at 4\-th layer\.
### D\.4Detecting the dynamics of state spaces

Finally, we depict the dynamics of state spaces of GDN whenttincreases\. We plotted thecos⁡⟨\\qbt,\\kbt⟩\\cos\\langle\\qb\_\{t\},\\kb\_\{t\}\\ranglegiven a certain input in ourμ\\muP configuration at differentttvalues at 4\-th layer in Figure[64](https://arxiv.org/html/2606.04048#A4.F64)\. It can be seen that, similar to SGD, all the values are around 0, showing that independent assumption of\\qbt\\qb\_\{t\}and\\kbt\\kb\_\{t\}still holds for GDN optimized with AdamW\. And during the training process, the value also converges, showing that in our configuration, the model can indeed converge to a stable state\.

Similar Articles

GQA-{\mu}P: The maximal parameterization update for grouped query attention

arXiv cs.LG

This paper extends the maximal update parameterization (μP) framework to grouped-query attention (GQA), deriving scaling laws for hyperparameter transfer across model architectures. It introduces spectral norm conditions for feature learning and addresses issues with low-rank weight matrices in GQA.

Unified Neural Scaling Laws

Hugging Face Daily Papers

Presents a unified neural scaling law that accurately models deep neural network scaling across multiple dimensions including parameters, dataset size, training steps, and compute, validated across diverse architectures and tasks.