@maximelabonne: Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate (first screenshot, Kalra and Ba…
Summary
This paper introduces a framework to quantify hyperparameter transfer in LLMs and finds that the benefit of μP over SP in AdamW training largely comes from increasing the embedding layer learning rate. It also explores the impact of weight decay and other factors.
View Cached Full Text
Cached at: 05/23/26, 02:08 PM
Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate (first screenshot, Kalra and Barkeshli): https://arxiv.org/abs/2605.21486 Optimal Embedding Learning Rate in LLMs: The Effect of Vocabulary Size (Hayou and Liu): https://arxiv.org/abs/2506.15025
Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate
Source: https://arxiv.org/html/2605.21486
Abstract
Hyperparameter transfer allows extrapolating optimal optimization hyperparameters from small to large scales, making it critical for training large language models (LLMs). This is done either by fitting a scaling law to the hyperparameters or by a judicious choice of parameterization, such as Maximal Update (μ\muP), that renders optimal hyperparameters approximately scale invariant. In this paper, we first develop a framework to quantify hyperparameter transfer through three metrics: (1) the quality of the scaling law fit, (2) the robustness to extrapolation errors, and (3) the asymptotic loss penalty due to choice of parameterization. Next, we investigate through a comprehensive series of ablations whyμ\muP appears to offer high-quality learning rate transfer relative to standard parameterization (SP), as existing theory is inadequate. We find that the overwhelming benefit ofμ\muP relative to SP when training with AdamW arises simply from maximizing the learning rate of the embedding layer. In SP, the embedding layer learning rate acts as a bottleneck that induces training instabilities; increasing it by a factor of width to matchμ\muP dramatically smooths out training while improving hyperparameter transfer. We also find that weight decay improves the scaling law fits, while, in the fixed token-per-parameter setting, it hurts the robustness of the extrapolation.
11footnotetext:Department of Physics, University of Maryland, College Park22footnotetext:Department of Computer Science, University of Maryland, College Park33footnotetext:Joint Quantum Institute, University of Maryland, College Park44footnotetext:Meta Superintelligence Labs, Fundamental AI Research## 1Introduction
Training large neural networks requires carefully tuning numerous hyperparameters, including the learning rate, weight decay, and batch size, among others[28,7,38,30,34,2]. As models scale into trillions of parameters, the cost of hyperparameter tuning at scale becomes prohibitive. The solution to this ishyperparameter transfer: one finds the optimal hyperparameters at small scales, along with a scaling law, which can then be used to extrapolate optimal hyperparameters at large scales. There are two main approaches in practice. The first approach fits functional forms to predict how the optimal learning rateη∗\eta^{*}scales as the model and/or data is scaled[7,36,3]. While effective, these fitted relationships may not generalize beyond their fitted domain and themselves require expensive hyperparameter searches to obtain a robust functional form.
The second approach is more structural—it parameterizes the model such that the activations and their updates remain independent of width[48], keeping the training dynamics invariant across scales[6,46].Yang and Hu [48]derived Maximal Update Parameterization (μ\muP) under this desideratum, originally to study feature learning in the infinite width limit. Empirically,μ\muP has a desirable practical property—it maintainsfairly consistentoptimal learning rates across width. This property, termedlearning rate transfer, allows practitioners to utilize optimal learning rates from a smaller proxy model to train larger ones[47], reducing the need for expensive hyperparameter searches at scale. Several works have since extended this framework to depth scaling[5,49,9], alternative architectures[4,8,43,małaśnicki2025muparametrizationmixtureexperts,19], and optimizers[29,13,37], suggesting the underlying principle generalizes to the design space of modern networks.
The current practice raises several questions. First, after fitting a scaling law to the hyperparameters, they can always be reparameterized to be (approximately) scale-invariant. In this case, is there any distinction withμ\muP and its variants? This suggests an urgent need toquantifythe quality of hyperparameter transfer in order to compare across different schemes.
Secondly, the theoretical derivation ofμ\muP and its variants make several assumptions that do not hold in practice. It assumes a finite number of training steps in the infinite width limit, whereas practical training runs are in the opposite limit, of a large number of training steps compared to width. It also assumes full alignment between weight updates and the activations[48], yet this alignment is low early in training and never reaches the full alignment even late in training[10]. It does not take into account aspects of the learning rate schedule, like learning rate warmup, which are crucial to observe transfer, as we show inAppendix˜F. Finally, it is derived in the setting of fixed dataset size, whereas typical training runs scale data and number of parameters in tandem, often in the setting of a fixed number of tokens per parameter (TPP)[17]. Surprisingly, despite these assumptions not holding,μ\muP still exhibits high-quality learning rate transfer, raising the question of why.
In this paper, we begin by developing a framework to quantify hyperparameter transfer. We develop three key metrics. The first is the quality of the scaling law fit itself. The second is a robustness metric: whether the choice of optimal hyperparameter is more robust to perturbations as the model scales. The third regards the asymptotic performance of the loss: it is possible that some parameterizations exhibit higher quality transfer at the cost of degradation in the absolute value of the loss, and this also needs to be taken into account. The three metrics can trade off against each other, as we will see.
Armed with a quantitative framework to measure the quality of hyperparameter transfer, we then investigate in more detail whyμ\muP appears to exhibit high-quality transfer. We show that in standard decoder-only Transformers[44]trained with AdamW[31],μ\muP has four key changes compared to SP. By examining all1616ablations, we isolate the embedding layer learning rate as the key factor.μ\muP has a much larger embedding layer learning rate compared to SP. By making this one alteration to SP, the hyperparameter transfer quality of SP essentially matches that ofμ\muP. We further study training dynamics, finding that the embedding layer learning rate is important throughout training, especially early in training, and if it is not sufficiently large, there are training instabilities. Therefore, one interesting source of training instabilities is that the learning can be bottlenecked by the embedding layer learning rate.
These results highlight the fundamental importance of the embedding layer learning rate. It is known that the embedding layer learns a significant amount of structure from the data[32,16,42,24], and as such, perhaps it is not surprising that keeping it too low can severely throttle training, while maximizing it helps stabilize the learning process.
Our contributions. Our paper contains the following contributions:
- •A quantitative framework for measuring the quality of hyperparameter transfer. This involves three metrics: (1) the quality of the scaling law fits in terms of an errorℰ\mathcal{E}, (2) a transfer robustness exponentκ\kappawhich measures sensitivity to errors in extrapolation of a hyperparameter from small to large scales, and (3) the asymptotic loss degradationℛ(∞)\mathcal{R}(\infty), which measures performance at scale relative to the optimal parameterization.
- •Focusing on GPT pre-training with AdamW, we express SP andμ\muP in a common form (Table˜1). The two parameterizations differ in four key ways (SP→μ\to\muP): (1) embedding learning rateΘ(1/n)→Θ(1)\Theta(\nicefrac{{1}}{{n}})\to\Theta(1), (2) last layer initialization varianceΘ(1/n)→Θ(1/n2)\Theta(\nicefrac{{1}}{{n}})\to\Theta(\nicefrac{{1}}{{n^{2}}}), (3) LayerNorm learning rate (Θ(1/n)→Θ(1)\Theta(\nicefrac{{1}}{{n}})\to\Theta(1)), (4) attention scaling (1/d→1/d\nicefrac{{1}}{{\sqrt{d}}}\to\nicefrac{{1}}{{d}}).
- •We examine all 16 ablations that distinguishμ\muP from SP and isolate the embedding layer learning rate as the primary factor: SP with an appropriately scaled embedding layer learning rate essentially matchesμ\muP in quality of hyperparameter transfer.
- •We find that the embedding layer learning rate, if not large enough, can cause training instabilities.
- •We study the effect of weight decay to show that (1) it can improve the scaling law fit quality (smallerℰ\mathcal{E}), (2) in the fixed token per parameter setting, it hurts the robustness of the extrapolation.
2Preliminaries: Neural Network Parameterizations
Table 1:Comparison of Standard (SP) and Maximal Update Parameterization (μ\muP) for AdamW. Weight decay is scaled with the learning rate so that its contribution to the weight update,η⋅λ⋅θ\eta\cdot\lambda\cdot\theta, remainsΘ(1)\Theta(1)across widths.We consider a neural networkfθf_{\mathbf{\theta}}with parametersθ\mathbf{\theta}trained by minimizing a lossL(θ)L(\mathbf{\theta})using AdamW optimizer[31]with learning rateη\etaand weight decayλ\lambda. The performance of a model can be improved by scaling along different dimensions, such as width, depth, context length and data. Aparameterizationis then a set of rules that specify how key network and optimizer hyperparameters must be adjusted as the model is scaled to maintain stable training dynamics. In this work, we focus on the widthnnas the scaling dimension and the learning rateη\etaas the hyperparameter to be scaled.
FollowingYang and Hu [48], we parameterize the network and optimizer using four scalar exponents{al,bl,cl,dl}\{a_{l},b_{l},c_{l},d_{l}\}per layerll, which control different aspects of width scaling. The exponentala_{l}controls the forward pass scaling𝐡(l+1)=n−alW(l+1)ϕ(𝐡(l)),\mathbf{h}^{(l+1)}=n^{-a_{l}}{W}^{(l+1)}\phi(\mathbf{h}^{(l)}),where𝐡(l)\mathbf{h}^{(l)}andW(l){W}^{(l)}denote the pre-activations and weights in layerll, andϕ(⋅)\phi(\cdot)is the activation function. The exponentblb_{l}scales the initialization varianceW(l)∼𝒩(0,n−2bl)W^{(l)}\sim\mathcal{N}(0,n^{-2b_{l}}). The exponentclc_{l}scales the layer-wise learning rateη(l)=η⋅n−cl\eta^{(l)}=\eta\cdot n^{-c_{l}}, whereη\etais the global learning rate. Finally, the exponentdld_{l}scales the weight decay strengthλl=λ⋅n−dl\lambda^{l}=\lambda\cdot n^{-d_{l}}such that the productηl⋅λl\eta^{l}\cdot\lambda^{l}is width-independent. For Transformers, the usual scaled dot-product attention comes with a per-head scaling by the head dimensiondd, with the standard choice being1/d\nicefrac{{1}}{{\sqrt{d}}}[45].
Given this general setup, specific parameterizations are obtained by imposing stability conditions on the training dynamics. Two canonical examples ofabcdabcdparameterization are SP[41]andμ\muP[48]. SP requires that activations at initialization do not blow up or vanish as the width grows, imposing one constraint per layer; any additional freedom is used to make the simplest choices, such as a uniform learning rate for all layers.μ\muP imposes a stronger condition: both the activations and their updates must be width-independent, resulting in two constraints per layer. For analytical tractability, these additional constraints are derived under three additional assumptions: (1) a finite number of training steps as the widthn→∞n\rightarrow\infty, (2) full alignment between weight updates and the activations they act on[48], and (3) a fixed dataset asnnscales. We provide a self-contained derivation ofμ\muP for both SGD and Adam optimizers inAppendix˜H.
Since one of our goals is to understand which aspects ofμ\muP are essential for transfer, we need to express SP andμ\muP in a common form. To this end, SP is defined with a weight initialization variance ofΘ(1/fan-in)\Theta(\nicefrac{{1}}{{\text{fan-in}}}), and a global learning rateηl=η⋅n−1\eta_{l}=\eta\cdot n^{-1}for all layers. Note that we choose a convention where we peel off a1/n1/nfactor in the learning rate, which differs from other conventions. To make the comparison between SP andμ\muP more transparent, we use the symmetry of abcd parameterization[48]to set the multipliersa=0a=0in our description ofμ\muP. As shown inTable˜1, the two differ in four key ways: (1) embedding layer learning rate, (2) last layer initialization variance, (3) LayerNorm learning rate, and (4) attention scaling. Weight decay strengthλ\lambdais scaled with learning rate so that its contribution to the weight update,ηλθ\eta\lambda{\theta}, isΘ(1)\Theta(1)in both parameterizations.
The intuitive reason for these changes is as follows. SP applies a uniformΘ(1/n)\Theta(\nicefrac{{1}}{{n}})learning rate to all layers. This scaling is needed for the hidden layers, which compute their output as a sum ofnninput termshi(l)=∑j=1nWijl+1hj(l)h^{(l)}_{i}=\sum_{j=1}^{n}W^{l+1}_{ij}h^{(l)}_{j}; without aΘ(1/n)\Theta(\nicefrac{{1}}{{n}})learning rate the activations updatesΔhi(l+1)\Delta{h}^{(l+1)}_{i}would scale asΘ(n)\Theta(n). By comparison, the embedding (a per-token lookup) and LayerNorm (an elementwise operation) do not involve a sum over the width dimension, so aΘ(1)\Theta(1)learning rate is a natural choice for these layers. Next, SP’s larger last layer initialization variance leads to higher Hessian sharpness at initialization, making training unstable at large learning rates[22]. However, in practice, learning rate warmup mitigates this by gradually reducing sharpness to a level determined by the peak learning rate[21], making the initialization difference between SP andμ\muP negligible. Finally,μ\muP replaces SP’s1/d\nicefrac{{1}}{{\sqrt{d}}}attention scaling with1/d\nicefrac{{1}}{{d}}, motivated by the observation that key and query vectors are projections of the same input and therefore would be more aligned than independent random vectors during training. In this work, we show that the embedding layer learning rate is the primary driver ofμ\muP’s advantage, with the remaining modifications contributing little.
3Quantifying Hyperparameter Transfer
In this section, we focus on learning rateη\etaas the hyperparameter of interest and consider scaling the widthnn. In principle, the discussion can be extended to include additional hyperparameters (e.g., batch size, weight decay) and scaling other quantities such as depth, context length, training steps.
A parameterization is effective for learning rate transfer if the optimal learning rate can be reliably extrapolated from small to large widths without sacrificing performance at scale. Reliability requires two conditions. First, the loss at the end of training must be predictable, for example, by satisfying a simple functional scaling law form. Second, even if the loss is predictable, the transferred learning rate will inevitably carry some residual error (e.g., from finite sampling at the small scale). The loss at scale must be robust to such errors—if a small error in predicting the learning rate induces an increasingly large loss gap at larger widths, transfer becomes sensitive despite predictability in the loss. Accordingly, we introduce three metrics:Loss Predictability Errorℰ\mathcal{E},Transfer Robustness Exponentκ\kappa, andAsymptotic Loss Degradationℛ(∞)\mathcal{R}(\infty)that capture loss predictability, robustness to prediction errors, and performance at scale, respectively.
To formalize these metrics, we first need to model how the loss landscape changes with width and learning rate. Following standard practice in neural scaling laws[23,17,1], we model the optimal lossL∗(n)L^{*}(n)as a power law in width:
L∗(n)=L∗(∞)+An−α,\displaystyle L^{*}(n)=L^{*}(\infty)+An^{-\alpha},(1)whereL∗(∞)L^{*}({\infty})is the irreducible loss,AAis the scaling coefficient, andα\alphais the scaling exponent. Here we can consider the dataset size to be held fixed or increasing along withnnas in compute-optimal training. We model the optimal learning rate using a scaling law with an irreducible term as well:
η∗(n)=η∗(∞)+B′n−β,\displaystyle\eta^{*}(n)=\eta^{*}(\infty)+B^{\prime}n^{-\beta},(2)where0<η∗(∞)<∞0<\eta^{*}(\infty)<\inftyis the asymptotic optimal learning rate,B′B^{\prime}is the scaling coefficient andβ>0\beta>0controls the rate of convergence. Note that this differs from the formη∗(n)∼n−β′\eta^{*}(n)\sim n^{-\beta^{\prime}}used elsewhere[10,28], which implicitly assumes that the optimal learning rate vanishes (β′>0\beta^{\prime}>0) or diverges (β′<0\beta^{\prime}<0) at large widths. InEquation˜2, we assume the dominant scaling law has been peeled off (as in our convention for SP), leaving the residual correctionB′n−βB^{\prime}n^{-\beta}.


Figure 1:Computing the three transfer metrics forμ\muP. (a) Loss vs. log learning rateν\nu, with star marking the optimumν∗(n)\nu^{*}(n), (b) Joint fit of the loss model (Equation˜6, dashed lines), with a low predictability errorℰ=0.0034\mathcal{E}=0.0034, (c) Loss curves in the normalized coordinates (Equation˜8), withκ=−2.640\kappa=-2.640indicating robust transfer. (d-f) Scaling laws for optimal lossL∗(n)L^{*}(n), optimal log-learning-rateν∗(n)\nu^{*}(n), and curvatureH(n)H(n). In (d), the orange curve shows the best loss across parameterizations at each width, used for estimating the asymptotic loss gapℛ(∞)\mathcal{R}(\infty).The scaling law above is expressed in terms of the learning rateη\eta. However, the loss as a function ofη\etais typically asymmetric around its optimumη∗\eta^{*}—beyond it training becomes unstable, and the loss increases sharply, while below it the loss degrades gradually. We therefore work in log-learning-rate spaceν:=log2η\nu:=\log_{2}\etafor the rest of the paper, where the loss landscape is more symmetric around its optimum (seeFigure˜1(a)).Equation˜2then takes the form:
ν∗(n)=ν∗(∞)+log(1+B′n−βη∗(∞))≈ν∗(∞)+Bn−β,\displaystyle\nu^{*}(n)=\nu^{*}(\infty)+\log\left(1+\frac{B^{\prime}n^{-\beta}}{\eta^{*}(\infty)}\right)\approx\nu^{*}(\infty)+Bn^{-\beta},whereB=B′/η∗(∞)B=B^{\prime}/\eta^{*}(\infty). Since Eq.2captures the residual width dependence after the dominant scaling has been removed, the termB′n−β/η∗(∞)B^{\prime}n^{-\beta}/\eta^{*}(\infty)is small, justifying the first-order Taylor expansion above.
The above scaling laws hold for each parameterization separately; different parameterizations may, for example, exhibit different irreducible loss. As such, a parameterization with high-quality transfer might sacrifice loss performance. This motivates the following definition.
Definition 3.1(Loss Degradationℛ(∞)\mathcal{R}(\infty)).
For a parameterization, letL∗(∞)L^{*}(\infty)denote its irreducible loss, and letLbest∗(∞)L^{*}_{\text{best}}(\infty)be the best possible irreducible loss across hyperparameters and parameterizations. We define loss degradation as the asymptotic loss gap:
ℛ(∞)=L∗(∞)−Lbest∗(∞)≥0.\displaystyle\mathcal{R}(\infty)=L^{*}(\infty)-L^{*}_{\text{best}}(\infty)\geq 0.(3)A parameterization withℛ(∞)=0\mathcal{R}(\infty)=0achieves best-in-class loss at scale, whileℛ(∞)>0\mathcal{R}(\infty)>0indicates a performance gap at scale. In practice,Lbest∗L^{*}_{\text{best}}is computed using a finite set of parameterizations under consideration.
Next, to capture the reliability of transfer, we model the loss aroundν∗(n)\nu^{*}(n)as a local quadratic form:
L(ν;n)=L∗(n)+12H(n)⋅(ν−ν∗(n))2+𝒪(ν3),\displaystyle L(\nu;n)=L^{*}(n)+\frac{1}{2}H(n)\cdot(\nu-\nu^{*}(n))^{2}+\mathcal{O}(\nu^{3}),(4)whereH(n)=∇ν2L(ν∗(n);n)H(n)=\nabla_{\nu}^{2}L(\nu^{*}(n);n)is the loss Hessian evaluated atν∗(n)\nu^{*}(n). In addition to the loss and log-learning rate, we model the Hessian scaling as a power law111Empirically, we find this form well approximates the Hessian scaling for many parameterizations we considered.:
H(n)=Cnγ,\displaystyle H(n)=Cn^{\gamma},(5)whereCCis the scaling coefficient andγ\gammais the scaling exponent. Substituting the scaling laws intoEquation˜4, we obtain:
L(ν;n)=L∗(∞)+An−α+12Cnγ⋅(ν−ν∗(∞)−Bn−β)2.\displaystyle L(\nu;n)=L^{*}(\infty)+An^{-\alpha}+\frac{1}{2}Cn^{\gamma}\cdot\left(\nu-\nu^{*}(\infty)-Bn^{-\beta}\right)^{2}.(6)This functional form serves as our scaling ansatz for learning rate transfer.
Definition 3.2(Loss Predictability Errorℰ\mathcal{E}).
Given loss observations{L(νi;nj)}\{L(\nu_{i};n_{j})\}atNνN_{\nu}log-learning-rates{νi}i=1Nν\{\nu_{i}\}_{i=1}^{N_{\nu}}andNnN_{n}widths{nj}j=1Nn\{n_{j}\}_{j=1}^{N_{n}}, we define loss predictability as the normalized mean squared error between observed and predicted loss{L^(νi;nj)}\{\hat{L}(\nu_{i};n_{j})\}fromEquation˜6with fitted parameters:
ℰ=1NνNn∑i,jNν,Nn[L(νi;nj)−L^(νi;nj)]2.\displaystyle\mathcal{E}=\frac{1}{N_{\nu}N_{n}}\sum_{i,j}^{N_{\nu},N_{n}}\left[L(\nu_{i};n_{j})-\hat{L}(\nu_{i};n_{j})\right]^{2}.(7)Whenℰ≈0\mathcal{E}\approx 0, the loss is well captured byEquation˜6, suggesting reliable extrapolation across widths is possible. Highℰ\mathcal{E}in contrast suggests a complex landscape arising from various sources, such as training instabilities, finite-size effects, or phase transitions, making an extrapolation unreliable.
Whileℰ\mathcal{E}captures the predictability of the landscape, it does not reveal how errors in extrapolating the optimal learning rate impact performance at large scales. To analyze this sensitivity, we normalize both loss and log-learning-rate, focusing on the scale-invariant landscape defined by the normalized coordinates:
L~(n)=L(ν;n)−L∗(∞)An−α,ν~(n)=ν−ν∗(∞)Bn−β.\displaystyle\tilde{L}(n)=\frac{L(\nu;n)-L^{*}(\infty)}{An^{-\alpha}},\qquad\tilde{\nu}(n)=\frac{\nu-\nu^{*}(\infty)}{Bn^{-\beta}}.(8)In these normalized coordinates,Equation˜6becomes:
L~(n)=1+C2AB2nα−2β+γ(ν~(n)−1)2.\displaystyle\tilde{L}(n)=1+\frac{C}{2AB^{2}}n^{\alpha-2\beta+\gamma}\left(\tilde{\nu}(n)-1\right)^{2}.(9)The normalized loss curvature scales asnκn^{\kappa}, whereκ=α−2β+γ\kappa=\alpha-2\beta+\gamma, determines whether the landscape flattens (κ<0\kappa<0) or sharpens (κ>0\kappa>0) as width increases. The sign ofκ\kappathus determines the sensitivity of the loss to learning rate prediction errors at scale.
Definition 3.3(Transfer Robustness Exponentκ\kappa).
We say that a parameterization under our ansatz (Equation˜6) exhibitsrobust transferif:
κ=α−2β+γ≤0.\displaystyle\kappa=\alpha-2\beta+\gamma\leq 0.(10)
The sign ofκ\kappacontrols how prediction errors propagate to loss at scale. Negativeκ\kappameans the landscape flattens, so errors in extrapolatingν∗(n)\nu^{*}(n)from small widths results in diminishing loss penalties at scale. By comparison, positiveκ\kappaamplifies these errors and degrades transfer reliability, as the landscape sharpens with width. Whenγ=0\gamma=0, the conditionα≤2β\alpha\leq 2\betacoincides with thefast transfercriterion ofGhoshet al.[11].
Taken together, these metrics capture three complementary axes of transfer.ℛ(∞)\mathcal{R}(\infty)captures performance at scale,ℰ\mathcal{E}quantifies if the loss is predictable, andκ\kappameasures whether prediction errors amplify or dampen at large widths. A parameterization can excel on one while failing on others, and as we show in later sections, the metrics can even be at odds with each other.
4Examiningμ\muP and SP Through the Lens of the Three Transfer Metrics
Figure 2:*Embedding layer learning rate is the critical difference between SP andμ\muP.*Loss vs. log learning rateν\nuacross widths for: (a) SP withΘ(1/n)\Theta(\nicefrac{{1}}{{n}})embedding learning rate; (b) SP modified to useΘ(1)\Theta(1)embedding learning rate (SP+Embd); (c)μ\muP modified to useΘ(1/n)\Theta(\nicefrac{{1}}{{n}})embedding learning rate (μ\muP-Embd). Speeding up the embedding in SP eliminates training instabilities and yields smooth,μ\muP-like curves, while slowing it down inμ\muP reintroduces SP-like instabilities. These interventions isolate the embedding learning rate as the primary driver ofμ\muP’s advantage, and show that training it fast enough is critical for stable training.We pre-train GPT-style Transformers on FineWeb-Edu[35]using AdamW for a fixed number of steps (T=10,000T=10,000) using Warmup-Stable-Decay (WSD) schedule[18](20%20\%warmup,60%60\%stable,20%20\%decay) with batch size10241024(≈1\approx 1M tokens per step). We scale the embedding dimension (width)n∈[128,2048]n\in[128,2048]by increasing the number of heads, while keeping the head dimension fixed atd=64d=64. For each width, we sweep the peak learning rate and weight decay strengthλ\lambda. For full experimental details, seeAppendix˜A. We consider compute-optimal (fixed TPP) training inSection˜6.
4.1Methodology
Filtering and Interpolation.For each parameterization and width, we retain runs with loss within1.35×1.35\timesthe per-width optimal loss to focus on the landscape around the minimum (Figure˜1(a)). Since discrete sampling introduces noise in the estimatedν∗(n)\nu^{*}(n), we interpolate the resulting per-width curves using a cubic spline to obtain smoother, denser curves (seeFigure˜6(a),Appendix˜B), enabling reliable estimation ofν∗(n)\nu^{*}(n)andH(n)H(n). For the optimal loss, we use the raw per-width minimum directly, as it’s already reliable.
Estimating the Three Transfer Metrics.By fitting scaling laws toL∗(n)L^{*}(n)andν∗(n)\nu^{*}(n)(Figure˜1(d, e)), we obtain the exponentsα\alpha,β\betaand the asymptotic lossL∗(∞)L^{*}(\infty), from which we computeℛ(∞)\mathcal{R}(\infty). Sinceℛ(∞)\mathcal{R}(\infty)is non-negative by definition (Definition˜3.1), we clamp fitted values near zero when they are negative due to finite-size fitting artifacts. For the curvatureH(n)H(n), we first fit a quadratic centered atν∗(n)\nu^{*}(n)to the per-width loss curves (seeFigure˜6(b)inAppendix˜B), and then fit the scaling lawH(n)=CnγH(n)=Cn^{\gamma}(Figure˜1(f)) to obtain the exponentγ\gamma. Combined withα\alphaandβ\betafrom above, this givesκ=α−2β+γ\kappa=\alpha-2\beta+\gamma, whose sign can be directly read off from the normalized transfer curves (Figure˜1(c)), with a flattening landscape indicating robust transfer. Finally, to compute the loss predictability errorℰ\mathcal{E}, we jointly fit all the parameters using the interpolated curves, then measureℰ\mathcal{E}using the fit evaluated on the raw filtered data points. Fitting on the interpolated data gives a smoother fit, while evaluating on the raw data measures the fit on the observed data. Throughout, we cap all scaling exponents at2.02.0to prevent spuriously large values, which would otherwise dominate comparisons across parameterization.
Whenν∗(n)\nu^{*}(n)is nearly constant across widths, its scaling law admits two degenerate solutions:β→0\beta\to 0(constant) andβ→∞\beta\to\infty(rapid convergence). We develop a procedure to distinguish between these two cases, and prefer theβ→∞\beta\to\inftysolution as it is consistent with our framework’s convergence-based interpretation. We provide full details on the fitting procedures and degeneracy resolution inAppendix˜B.
4.2What Exactly isμ\muP’s Advantage over SP?
To find out the essential elements required for reliable transfer, we compareμ\muP (Figure˜1) and SP (Figure˜2(a) andFigure˜9inAppendix˜C) using the three metrics. SP exhibits visibly noisier loss curves thanμ\muP due to training instabilities. Despite this, the two parameterizations are surprisingly similar on most metrics. The asymptotic loss gapℛ(∞)\mathcal{R}(\infty)of SP is slightly worse but still comparable toμ\muP. For both parameterizations,ν∗(n)\nu^{*}(n)empirically converges to a finite asymptotic value. Finally, the normalized loss flattens for both parameterizations, with large negative robustness exponents, indicating transfer is robust in both cases. Where SP falls short is in the loss predictability: the loss vs.ν\nucurves are visibly noisier due to training instabilities. As a result, the predictability errorℰ\mathcal{E}is roughly3×3\timeslarger than forμ\muP, suggesting that the loss is poorly described by our ansatz. While SP can exhibit transfer in principle, training instabilities make it unreliable in practice.
4.3A Step by Step Journey from SP toμ\muP
Figure 3:*Transfer metrics for parameterizations interpolating between SP andμ\muP.*Parameterizations with ‘++’ denote incremental changes from SP towardsμ\muP, while ‘−-’ denotes changes fromμ\muP towards SP. Green and red regions indicate desirable and undesirable regimes, respectively. The orange arrow highlights SP+Embd (SP withΘ(1)\Theta(1)embedding learning rate), which matchesμ\muP across all three metrics, suggesting the embedding layer learning rate is the primary driver ofμ\muP’s advantage.The results above suggest that SP has the right ingredients for reliable learning rate transfer, but is held back by training instability. Since SP andμ\muP differ in only four ways (Section˜2), a natural question is which change, if any, is most significant. To answer this, we perform systematic modifications, starting from SP and making one change at a time, sweeping the peak learning rate and weight decay across widths. We caution that these modifications interact non-linearly, so the effect of one change may depend on the scaling of other layers.Figure˜3shows the three transfer metrics for selected ablations and weight decay values. We defer the full results toFigure˜18inAppendix˜Cand summarize the key findings here.
The embedding layer learning rate emerges as the most critical modification (Figure˜2): training SP with anΘ(1)\Theta(1)embedding layer learning rate (SP+Embd) matchesμ\muP across metrics, while trainingμ\muP withΘ(1/n)\Theta(\nicefrac{{1}}{{n}})embedding layer learning rate (μ\muP-Embd) degrades it. Attention scale has a more subtle effect: both adding1/d\nicefrac{{1}}{{d}}scaling to SP (SP+Attn) and removing it fromμ\muP (μ\muP-Attn) result in large positiveκ\kappa, making transfer brittle. Increasing the LayerNorm learning rate toΘ(1)\Theta(1)in SP worsens instability, while decreasing it toΘ(1/n)\Theta(\nicefrac{{1}}{{n}})inμ\muP has a negligible effect, suggesting that LayerNorm parameters can be trained slowly without hurting performance. Finally, the last layer initialization variance has a negligible effect, thoughμ\muP with a1/n\nicefrac{{1}}{{n}}initialization (μ\muP-Last) exhibits instabilities at small widths. We leave a detailed understanding of these observations to future work.
5The Importance of Embedding Layer Learning Rate
The critical role of the embedding layer learning rate in stabilizing SP’s training is surprising. Since the embedding layer performs a width-independent lookup, one would expect its learning rate to scale asΘ(1)\Theta(1). But what is unexpected is that a smaller learning rate results in noisy loss curves in SP222These instabilities appear near the optimal learning rate; at smaller learning rates, SP trains stably but converges to worse final loss.(Figure˜2), as one might expect later layers to compensate for a poorly trained embedding. In this section, we dig deeper into why training the embedding layer fast enough is critical.
To better understand the role of training the embedding layer, we examine when it matters most by switching its learning rate at various points during training. We perform two experiments: inμ\muP, we slow down the embedding learning rate fromΘ(1)\Theta(1)toΘ(1/n)\Theta(\nicefrac{{1}}{{n}})at steptswitcht_{\text{switch}}, and in SP, we speed it up fromΘ(1/n)\Theta(\nicefrac{{1}}{{n}})toΘ(1)\Theta(1). Together, these experiments reveal that a small embedding layer learning rate not only slows down training but also causes instabilities, with early training being the most critical. In theμ\muP case (Figure˜4a), switching toΘ(1/n)\Theta(\nicefrac{{1}}{{n}})embedding learning rate early in training drastically slows down training. While further training eventually closes much of this gap, a residual loss difference of∼0.1\sim 0.1-0.20.2persists at the end of training, with a larger gap for earlier switches. By comparison, in the SP case (Figure˜4b), switching toΘ(1)\Theta(1)embedding learning rate at an early stage simultaneously eliminates the training instabilities and improves performance. We additionally perform two complementary experiments to better understand the role of different layers. First, inAppendix˜G, we completely freeze the embedding layer at initialization, finding that this hurts both parameterizations, butμ\muP much more than SP. This is surprising, as one might expect later layers to learn useful representations even with a randomly initialized embedding; however, the persistent gap indicates that they do not compensate for an untrained embedding. Second, we test whether the importance of training the first layer fast is specific to Transformers. InAppendix˜D, we show that for CNNs trained on CIFAR-100[27], SP with aΘ(1)\Theta(1)input-layer learning rate matchesμ\muP’s transfer quality, while changing only the last-layer initialization has little effect. Together, these results suggest that the first and last layers play a special role across architectures, likely because they sit at the network boundary with no upstream or downstream processing to compensate for poor training. Therefore, extra care should be taken in setting their hyperparameters.

(a)(b)

Figure 4:(a) Reducing the embedding learning rate fromΘ(1)\Theta(1)toΘ(1/n)\Theta(\nicefrac{{1}}{{n}})inμ\muP at steptswitcht_{\text{switch}}causes a persistent loss gap that grows larger for earlier switches (inset). (b) Increasing it fromΘ(1/n)\Theta(\nicefrac{{1}}{{n}})toΘ(1)\Theta(1)in SP at steptswitcht_{\text{switch}}eliminates training instabilities and improves performance over SP (tswitch=104t_{\text{switch}}=10^{4}).A natural question is: why does the embedding layer learning rate matter so much while the last-layer initialization does not? First, the two are not on equal footing inμ\muP’s derivation. TheΘ(1)\Theta(1)embedding learning rate is required for the embedding’s activation updates to beΘ(1)\Theta(1). This first-layer update has only a single term contributing to it, so its scaling has to be correct. By contrast, the last-layer initialization contributes to only one of the three terms of the function updateΔf1\Delta f_{1}, and its specific value depends on whether the last-layer weights arefully alignedwith their activation updates during training (seeAssumption˜H.6inAppendix˜H). The constraint is therefore weaker and depends on an empirical alignment value that may not hold in practice—indeed,Everettet al.[10]showed that relaxing this assumption still yields transfer. Second, learning rate warmup steers training away from the higher sharpness induced by a larger last-layer initialization, into flatter regions of the landscape[21], compensating for the difference in initialization scale.
6The Effect of Weight Decay and Compute Optimal Training
Most studies examining transfer focus on the fixed-step setting, often with little or no weight decay[47,10]. While recent work has begun examining the effect of weight decay[26]and compute-optimal scaling regime[2], their effect on transfer quality remains unclear. We find that weight decay improves loss predictability errorℰ\mathcal{E}but consistently hurts asymptotic performance in the fixed-step setting. By comparison, this performance penalty disappears in the compute optimal scaling regime (20 tokens per parameter, followingHoffmannet al.[17]), but transfer robustness degrades with weight decay, likely because the appropriate weight decay scaling in this regime is not well understood.
Weight Decay.In the fixed-step setting, weight decay consistently hurts asymptotic performance:ℛ(∞)\mathcal{R}(\infty)monotonically increases withλ\lambdaacross parameterizations, fromℛ(∞)≈0.01\mathcal{R}(\infty)\approx 0.01toℛ(∞)≈1\mathcal{R}(\infty)\approx 1at largeλ\lambda(Figure˜5a). The effect on loss predictability errorℰ\mathcal{E}is more nuanced and depends on the baseline stability of the parameterization without weight decay (Figure˜5b). Forμ\muP and SP+Embd, which already exhibit lowℰ\mathcal{E}atλ=0\lambda=0, small weight decay further improves predictability before worsening at largeλ\lambda. For SP and SP+LN, which suffer from training instabilities atλ=0\lambda=0, weight decay steadily reducesℰ\mathcal{E}but is never sufficient to match the stable parameterizations. Interestingly, at largeλ\lambda,ℰ\mathcal{E}converges to≈0.01\approx 0.01across parameterizations, suggesting that strong weight decay regularizes the landscape to a similar level regardless of parameterization. The improvement in loss predictability errorℰ\mathcal{E}with weight decay has a natural interpretation: weight decay regularizes the loss landscape, reducing its complexity and making it better captured by our loss ansatz (Equation˜6). The transfer robustness exponentκ\kapparemains largely negative and shows no clear dependence onλ\lambda(Figure˜5c).
Figure 5:*Effect of weight decay on the three transfer metrics in the fixed-step (top) and fixed TPP (bottom) settings.*In the fixed-step setting, weight decay improves predictability errorℰ\mathcal{E}at the cost of asymptotic performanceℛ(∞)\mathcal{R}(\infty). In the TPP setting, stable parameterizations achieve near-zeroℛ(∞)\mathcal{R}(\infty), and landscape predictability trends are similar, but robustnessκ\kappadegrades as weight decay increases.Compute Optimal Training.In the compute-optimal setting, most parameterizations achieve near-zeroℛ(∞)\mathcal{R}(\infty)(except for SP+LN andμ\muP-Embd due to exhibiting training instabilities), suggesting that the loss is not sensitive to the choice of parameterization. Predictability errorℰ\mathcal{E}follows similar trends to the fixed-step setting, with minor trend differences. A notable difference from the fixed-step setting lies inκ\kappa: atλ=0\lambda=0, most parameterizations exhibit negativeκ\kappa, but this robustness degrades sharply with increasing weight decay, with most parameterizations converging toκ≈0\kappa\approx 0. A natural explanation is thatμ\muP assumesΘ(1)\Theta(1)training steps relative to width, but in the compute optimal regime, stepsTTscale asn2n^{2}, violating this assumption and making the current conventionη⋅λ=Θ(1)\eta\cdot\lambda=\Theta(1)inadequate. InAppendix˜E, we find that scaling weight decay asη⋅λ=Θ(1/n2)\eta\cdot\lambda=\Theta(\nicefrac{{1}}{{n^{2}}})reduces the shift in optimal learning rate, but the shape of the loss curves around the minimum changes across widths. We leave a detailed analysis of the appropriate weight decay scaling in this regime to future work.
7Related Works
Our work is closely related to several recent works on hyperparameter transfer[10,11,26,2].Kossonet al.[26]argue that weight decay stabilizes feature learning and is central to learning rate transfer. Our three transfer metrics provide complementary insights into the role of weight decay: its primary effect to improve the loss predictability errorℰ\mathcal{E}, but it comes at the cost of increasing the asymptotic loss gapℛ(∞)\mathcal{R}(\infty)in the fixed step setting. Furthermore, very stable parameterizations such as SP+Embd+Attn achieve reliable transfer even without weight decay, and weight decay alone is insufficient to stabilize unstable parameterizations such as SP+LN. The analysis ofBergsmaet al.[2]implies thatη⋅λ\eta\cdot\lambdashould scale asΘ(1/n2)\Theta(1/n^{2})in the compute-optimal setting. Our preliminary experiments in this regime suggest that additional scaling considerations may be needed to fully resolve the transfer robustness degradation we observe, leaving the appropriate weight decay scaling as an open question.
8Discussion and Conclusion
In this work, we introduced a quantitative framework for evaluating hyperparameter transfer, with three metricsℛ(∞)\mathcal{R}(\infty),ℰ\mathcal{E}, andκ\kappathat together serve as a diagnostic lens for identifying what a given transfer setup is lacking. Our framework generalizes beyond learning rate transfer with width and can be applied to any hyperparameter and scaling dimension. For instance, it can help determine which depth scaling strategy yields more reliable transfer, whether learning rate transfer across tokens is as robust as across width, and whether current batch size and expert scaling conventions are brittle.
Using this framework, we find that under AdamW, the primary advantage ofμ\muP over SP comes from training the embedding layer at a sufficiently fast learning rate. This suggests that the full set ofμ\muP conditions is excessive, and practitioners using SP can recover comparable transfer performance by simply correcting the embedding layer learning rate. Interestingly, training the embedding layer too slowly not only slows down learning but can also cause training instabilities, which is counterintuitive, as one does not expect a layer trained too slowly to destabilize training. Training the embedding layer slowly may therefore be an overlooked source of training instabilities observed in practice. While our analysis is specific to AdamW, it would be interesting to extend it to other optimizers, such as SGD and Muon[20], whose different update geometries may yield different minimal variants analogous to SP+Embd. InAppendix˜Hwe also examine the weight-tied embedding case, where the embedding and last layer share parameters, and show that a naive SP would require a1/n\nicefrac{{1}}{{n}}output multiplier in addition toΘ(1)\Theta(1)embedding layer learning rate for reliable transfer. Our results also show that the weight decay scaling conventionη⋅λ=Θ(1)\eta\cdot\lambda=\Theta(1), derived underμ\muP’s fixed step assumption, is inadequate in the compute optimal regime where the training horizon itself scales asΘ(n2)\Theta({n^{2}}). We leave finding the correct weight decay scaling in this regime to future work.
Limitations.Our experiments are limited to decoder-only Transformers with fixed depth, scaled to∼1\sim 1B parameters, trained with AdamW on a single dataset (FineWeb-Edu). Due to the large hyperparameter sweep, each configuration is run with a single random seed. We leave the analysis of other architectures, optimizers, depth scaling, and datasets to future work.
Acknowledgments
We thank Tianyu He, Darshil Doshi, Sean McLeish, John Kirchenbauer, and Tom Goldstein for helpful discussions. MB and DSK thank the Simons Collaboration on Physics of Learning and Neural Computation (SFI-MPS-POL-00012574-09). The authors acknowledge the University of Maryland supercomputing resources (http://hpcc.umd.edu) made available for conducting the research reported in this paper.
References
- [1]M. Barkeshli, A. Alfarano, and A. Gromov(2026)On the origin of neural scaling laws: from random graphs to natural language.External Links:2601.10684,LinkCited by:§3.
- [2]S. Bergsma, N. S. Dey, G. Gosal, G. Gray, D. Soboleva, and J. Hestness(2025)Power lines: scaling laws for weight decay and batch size in LLM pre-training.InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,External Links:LinkCited by:Appendix E,§1,§6,§7.
- [3]J. Bjorck, A. Benhaim, V. Chaudhary, F. Wei, and X. Song(2025)Scaling optimal LR across token horizons.InThe Thirteenth International Conference on Learning Representations,External Links:LinkCited by:§1.
- [4]B. Bordelon, H. T. Chaudhry, and C. Pehlevan(2024)Infinite limits of multi-head transformer dynamics.InThe Thirty-eighth Annual Conference on Neural Information Processing Systems,External Links:LinkCited by:§1.
- [5]B. Bordelon, L. Noci, M. B. Li, B. Hanin, and C. Pehlevan(2024)Depthwise hyperparameter transfer in residual networks: dynamics and scaling limit.InThe Twelfth International Conference on Learning Representations,External Links:LinkCited by:§1.
- [6]B. Bordelon and C. Pehlevan(2022)Self-consistent dynamical field theory of kernel evolution in wide neural networks.InAdvances in Neural Information Processing Systems,A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho (Eds.),External Links:LinkCited by:§1.
- [7]DeepSeek-AI, :, X. Bi, D. Chen, G. Chen, S. Chen, D. Dai, C. Deng, H. Ding, K. Dong, Q. Du, Z. Fu, H. Gao, K. Gao, W. Gao, R. Ge, K. Guan, D. Guo, J. Guo, G. Hao, Z. Hao, Y. He, W. Hu, P. Huang, E. Li, G. Li, J. Li, Y. Li, Y. K. Li, W. Liang, F. Lin, A. X. Liu, B. Liu, W. Liu, X. Liu, X. Liu, Y. Liu, H. Lu, S. Lu, F. Luo, S. Ma, X. Nie, T. Pei, Y. Piao, J. Qiu, H. Qu, T. Ren, Z. Ren, C. Ruan, Z. Sha, Z. Shao, J. Song, X. Su, J. Sun, Y. Sun, M. Tang, B. Wang, P. Wang, S. Wang, Y. Wang, Y. Wang, T. Wu, Y. Wu, X. Xie, Z. Xie, Z. Xie, Y. Xiong, H. Xu, R. X. Xu, Y. Xu, D. Yang, Y. You, S. Yu, X. Yu, B. Zhang, H. Zhang, L. Zhang, L. Zhang, M. Zhang, M. Zhang, W. Zhang, Y. Zhang, C. Zhao, Y. Zhao, S. Zhou, S. Zhou, Q. Zhu, and Y. Zou(2024)DeepSeek llm: scaling open-source language models with longtermism.External Links:2401.02954,LinkCited by:Appendix F,§1.
- [8]N. S. Dey, S. Bergsma, and J. Hestness(2024)Sparse maximal update parameterization: a holistic approach to sparse training dynamics.InThe Thirty-eighth Annual Conference on Neural Information Processing Systems,External Links:LinkCited by:§1.
- [9]N. Dey, B. C. Zhang, L. Noci, M. Li, B. Bordelon, S. Bergsma, C. Pehlevan, B. Hanin, and J. Hestness(2025)Don’t be lazy: completep enables compute-efficient deep transformers.External Links:2505.01618,LinkCited by:§H.14,Table 4,§1.
- [10]K. E. Everett, L. Xiao, M. Wortsman, A. A. Alemi, P. J. Liu, I. Gur, J. Sohl-Dickstein, L. P. Kaelbling, J. Lee, and J. Pennington(2024)Scaling exponents across parameterizations and optimizers.InForty-first International Conference on Machine Learning,External Links:LinkCited by:§H.11,§H.11,§H.6,Appendix H,§1,§3,§5,§6,§7.
- [11]N. Ghosh, D. Wu, and A. Bietti(2025)Understanding the mechanisms of fast hyperparameter transfer.External Links:2512.22768,LinkCited by:§3,§7.
- [12]J. Gilmer, B. Ghorbani, A. Garg, S. Kudugunta, B. Neyshabur, D. Cardoze, G. E. Dahl, Z. Nado, and O. Firat(2022)A loss curvature perspective on training instabilities of deep learning models.InInternational Conference on Learning Representations,External Links:LinkCited by:Appendix F.
- [13]M. Haas, J. Xu, V. Cevher, and L. C. Vankadara(2024)\\boldsymbol\{\\mu\}\\mathbf\{p^2\}: effective sharpness aware minimization requires layerwise perturbation scaling.InThe Thirty-eighth Annual Conference on Neural Information Processing Systems,External Links:LinkCited by:§1.
- [14]S. Hayou and L. Liu(2025)Optimal embedding learning rate in llms: the effect of vocabulary size.External Links:2506.15025,LinkCited by:Table 4.
- [15]S. Hayou(2026)A proof of learning rate transfer under \\mup.InThe 29th International Conference on Artificial Intelligence and Statistics,External Links:LinkCited by:§H.3.
- [16]T. He, D. Doshi, A. Das, and A. Gromov(2024)Learning to grok: emergence of in-context learning and skill composition in modular arithmetic tasks.InThe Thirty-eighth Annual Conference on Neural Information Processing Systems,External Links:LinkCited by:§1.
- [17]J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de las Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. van den Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, O. Vinyals, J. W. Rae, and L. Sifre(2022)An empirical analysis of compute-optimal large language model training.InAdvances in Neural Information Processing Systems,A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho (Eds.),External Links:LinkCited by:Appendix A,§1,§3,§6.
- [18]S. Hu, Y. Tu, X. Han, G. Cui, C. He, W. Zhao, X. Long, Z. Zheng, Y. Fang, Y. Huang, X. Zhang, Z. L. Thai, C. Wang, Y. Yao, C. Zhao, J. Zhou, J. Cai, Z. Zhai, N. Ding, C. Jia, G. Zeng, dahai li, Z. Liu, and M. Sun(2024)MiniCPM: unveiling the potential of small language models with scalable training strategies.InFirst Conference on Language Modeling,External Links:LinkCited by:Appendix A,§4.
- [19]T. Jiang, B. Bordelon, C. Pehlevan, and B. Hanin(2026)Hyperparameter transfer with mixture-of-expert layers.External Links:2601.20205,LinkCited by:§1.
- [20]K. Jordan, Y. Jin, V. Boza, J. You, F. Cesista, L. Newhouse, and J. Bernstein(2024)Muon: an optimizer for hidden layers in neural networks.External Links:LinkCited by:§8.
- [21]D. S. Kalra and M. Barkeshli(2024)Why warmup the learning rate? underlying mechanisms and improvements.InThe Thirty-eighth Annual Conference on Neural Information Processing Systems,External Links:LinkCited by:§C.1,Appendix F,§2,§5.
- [22]D. S. Kalra, T. He, and M. Barkeshli(2025)Universal sharpness dynamics in neural network training: fixed point analysis, edge of stability, and route to chaos.InThe Thirteenth International Conference on Learning Representations,External Links:LinkCited by:§H.3,§2.
- [23]J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei(2020)Scaling laws for neural language models.External Links:2001.08361,LinkCited by:§3.
- [24]D. Karkada, D. J. Korchinski, A. Nava, M. Wyart, and Y. Bahri(2026)Symmetry in language statistics shapes the geometry of model representations.External Links:2602.15029,LinkCited by:§1.
- [25]A. Karpathy(2022)nanoGPT: the simplest, fastest repository for training/finetuning medium-sized gpts.GitHub.Note:https://github.com/karpathy/nanoGPTCited by:Appendix A.
- [26]A. Kosson, J. Welborn, Y. Liu, M. Jaggi, and X. Chen(2026)Weight decay may matter more than µp for learning rate transfer in practice.InThe Fourteenth International Conference on Learning Representations,External Links:LinkCited by:§6,§7.
- [27]A. Krizhevsky, V. Nair, and G. Hinton()CIFAR-100 (canadian institute for advanced research)..External Links:LinkCited by:§5.
- [28]H. Li, W. Zheng, J. Hu, Q. Wang, H. Zhang, Z. Wang, S. Xuyang, Y. Fan, S. Zhou, X. Zhang, and D. Jiang(2025)Predictable scale: part i – optimal hyperparameter scaling law in large language model pretraining.External Links:2503.04715,LinkCited by:§1,§3.
- [29]E. Littwin and G. Yang(2023)Adaptive optimization in the \\infty-width limit.InThe Eleventh International Conference on Learning Representations,External Links:LinkCited by:§1.
- [30]T. Llama(2024)The llama 3 herd of models.External Links:2407.21783,LinkCited by:§1.
- [31]I. Loshchilov and F. Hutter(2019)Decoupled weight decay regularization.InInternational Conference on Learning Representations,External Links:LinkCited by:§1,§2.
- [32]N. Nanda, L. Chan, T. Lieberum, J. Smith, and J. Steinhardt(2023)Progress measures for grokking via mechanistic interpretability.InThe Eleventh International Conference on Learning Representations,External Links:LinkCited by:§1.
- [33]L. Noci, A. Meterez, T. Hofmann, and A. Orvieto(2024)Super consistency of neural network landscapes and learning rate transfer.InThe Thirty-eighth Annual Conference on Neural Information Processing Systems,External Links:LinkCited by:§H.3.
- [34]T. OLMo, P. Walsh, L. Soldaini, D. Groeneveld, K. Lo, S. Arora, A. Bhagia, Y. Gu, S. Huang, M. Jordan,et al.(2025)2 olmo 2 furious.arXiv preprint arXiv:2501.00656.Cited by:Appendix F,§1.
- [35]G. Penedo, H. Kydlíček, L. B. allal, A. Lozhkov, M. Mitchell, C. Raffel, L. V. Werra, and T. Wolf(2024)The fineweb datasets: decanting the web for the finest text data at scale.InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track,External Links:LinkCited by:Appendix A,§4.
- [36]T. Porian, M. Wortsman, J. Jitsev, L. Schmidt, and Y. Carmon(2024)Resolving discrepancies in compute-optimal scaling of language models.InThe Thirty-eighth Annual Conference on Neural Information Processing Systems,External Links:LinkCited by:§1.
- [37]S. Qiu, Z. Chen, H. Phan, Q. Lei, and A. G. Wilson(2025)Hyperparameter transfer enables consistent gains of matrix-preconditioned optimizers across scales.External Links:2512.05620,LinkCited by:§1.
- [38]Qwen, :, A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lina, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Tang, T. Xia, X. Ren, X. Ren, Y. Fan, Y. Su, Y. Zhang, Y. Wan, Y. Liu, Z. Cui, Z. Zhang, and Z. Qiu(2025)Qwen2.5 technical report.External Links:2412.15115,LinkCited by:§1.
- [39]A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever(2019)Language models are unsupervised multitask learners.External Links:LinkCited by:Appendix A.
- [40]D. A. Roberts, S. Yaida, and B. Hanin(2022)Frontmatter.InThe Principles of Deep Learning Theory: An Effective Theory Approach to Understanding Neural Networks,pp. i–iv.Cited by:§H.3.
- [41]J. Sohl-Dickstein, R. Novak, S. S. Schoenholz, and J. Lee(2020)On the infinite width limit of neural networks with a standard parameterization.arXiv preprint arXiv:2001.07301.Cited by:§2.
- [42]T. Tao, D. Doshi, D. S. Kalra, T. He, and M. Barkeshli(2025)(How) can transformers predict pseudo-random numbers?.InForty-second International Conference on Machine Learning,External Links:LinkCited by:§1.
- [43]L. C. Vankadara, J. Xu, M. Haas, and V. Cevher(2024)On feature learning in structured state space models.InThe Thirty-eighth Annual Conference on Neural Information Processing Systems,External Links:LinkCited by:§1.
- [44]A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin(2017)Attention is all you need.InAdvances in Neural Information Processing Systems,I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.),Vol.30,pp..External Links:LinkCited by:§1.
- [45]A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin(2017)Attention is all you need.InAdvances in Neural Information Processing Systems,I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.),Vol.30,pp..External Links:LinkCited by:§H.14,§2.
- [46]S. Yaida(2022)Meta-principled family of hyperparameter scaling strategies.External Links:2210.04909,LinkCited by:§H.3,§H.3,§1.
- [47]G. Yang, E. J. Hu, I. Babuschkin, S. Sidor, X. Liu, D. Farhi, N. Ryder, J. Pachocki, W. Chen, and J. Gao(2021)Tuning large neural networks via zero-shot hyperparameter transfer.InAdvances in Neural Information Processing Systems,A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan (Eds.),External Links:LinkCited by:§1,§6.
- [48]G. Yang and E. J. Hu(2021-18–24 Jul)Tensor programs iv: feature learning in infinite-width neural networks.InProceedings of the 38th International Conference on Machine Learning,M. Meila and T. Zhang (Eds.),Proceedings of Machine Learning Research, Vol.139,pp. 11727–11737.External Links:LinkCited by:§H.11,§H.12,§H.14,§H.14,§H.2,§H.3,Table 4,§1,§1,§2,§2,§2.
- [49]G. Yang, D. Yu, C. Zhu, and S. Hayou(2024)Tensor programs VI: feature learning in infinite depth neural networks.InThe Twelfth International Conference on Learning Representations,External Links:LinkCited by:§1.
Appendix AExperimental Details
We pre-trained GPT-style Transformers on FineWeb-Edu[35], building on the nanoGPT codebase[25]. All experiments use a depth of1212Transformer blocks without biases, a context length of10241024, with the data tokenized using the GPT-2 tokenizer[39], resulting in a vocabulary size of50,30450,304. We scale the embedding dimension (width)n∈[128,2048]n\in[128,2048]by increasing the number of heads, while keeping the head dimension fixed atd=64d=64. We train the models using AdamW (β1=0.9,β2=0.95,ϵ=10−8\beta_{1}=0.9,\beta_{2}=0.95,\epsilon=10^{-8}) with a Warmup-Stable-Decay (WSD) schedule[18](20%20\%warmup,60%60\%stable,20%20\%decay). For each width, we sweep the peak learning rate and weight decay strengthλ\lambda.
In the fixed-step setting, we train the models for10,00010,000steps with a batch size of10241024(≈1\approx 1M tokens per step), corresponding to a total of approximately1010B tokens. By comparison, in the compute-optimal setting (fixed token-per-parameter), we scale training tokens proportional to the number of parameters, with a ratio of2020tokens per parameter following[17]. To ensure that small models are trained for a sufficient number of steps, we reduce the batch size to256256(≈0.25\approx 0.25M tokens).
Compute Usage.
All experiments were run on H100 GPUs. The total sweep covers2020learning rates×\times88weight decay values×\times88widths×\times1616parameterizations×\times22training regimes, with each run taking approximately22hours on average, for an estimated total of∼\sim160,000 H100 GPU hours.
Appendix BEstimating the Three Transfer Metrics
This appendix describes the filtering, interpolation, and fitting procedures used to compute the transfer framework metricsℛ(∞),ℰ\mathcal{R}(\infty),\mathcal{E}, andκ.\kappa.
(a)
(b)
Figure 6:(a) Interpolated loss curves forμ\muP across widths with loss filtering thresholdf=1.35f=1.35. Raw observed points (circles) and the fitted smoothing spline (solid lines) are shown for each width. (b) Per-width curvature fits forμ\muP. For each width, we show the interpolated loss curve (solid line) and the fitted centered quadraticL(ν)=Lmin+12H(n)(ν−ν∗)2L(\nu)=L_{\min}+\frac{1}{2}H(n)(\nu-\nu^{*})^{2}(dashed line).#### Filtering, Smoothing and Interpolation.
For each width, we retain runs with loss withinf=1.35×f=1.35\timesthe per-width optimal lossL∗(n)L^{*}(n)to focus the analysis on the landscape around the minimum. The thresholdffis chosen to avoid unstable and divergent runs while retaining as many data points as possible. We then fit a cubic spline (UnivariateSpline, degreek=3k=3) to the filtered per-width curves with smoothing parameterS=s⋅N⋅Var(L)S=s\cdot N\cdot\mathrm{Var}(L), wheres=0.1s=0.1is the base smoothing coefficient,NNis the number of points, andVar(L)\mathrm{Var}(L)is the variance of the loss values. Each spline is evaluated on a uniform grid of400400points spanning the observedν\nurange, resulting in smooth, dense curves for the subsequent analysis (seeFigure˜6(a)for an example). Out of128128total combinations (1616parameterizations×\times88weight decay values),44failed this procedure either due to too few points after the filtering step or the interpolated loss curve exhibiting excessive noise resulting in negative loss values, and were therefore excluded from the analysis.
Estimating the Three Transfer Metrics.
By fitting scaling laws toL∗(n)L^{*}(n)andν∗(n)\nu^{*}(n), we obtain the exponentsα\alpha,β\betaand the asymptotic lossL∗(∞)L^{*}(\infty), from which we computeℛ(∞)\mathcal{R}(\infty). The loss scaling law is fit in log space with constraintsL∗(∞),A,α≥0L^{*}(\infty),A,\alpha\geq 0, ensuring the fitted scaling law strictly decreases with width and converges to a finite irreducible value. The log-learning-rate scaling law is fit in linear space, sinceν\nuis already in log space, withβ≥0\beta\geq 0enforced to ensure convergence to the asymptotic value. Sinceℛ(∞)\mathcal{R}(\infty)is non-negative by definition (Definition˜3.1), we clamp fitted values to zero when they are negative due to finite-size fitting artifacts. For the curvatureH(n)H(n), we first fit a centered quadraticL(ν)=L∗+12H(n)(ν−ν∗(n))2L(\nu)=L^{*}+\frac{1}{2}H(n)(\nu-\nu^{*}(n))^{2}to the interpolated per-width curves (seeFigure˜6(b)), with the center fixed atν∗(n)\nu^{*}(n)to prevent noise in the loss curve from shifting the fitted center away from the true optimum. We then fit the scaling lawH(n)=CnγH(n)=Cn^{\gamma}(Figure˜1(f)) to obtain the exponentγ\gamma. Combined withα\alphaandβ\betafrom above, this givesκ=α−2β+γ\kappa=\alpha-2\beta+\gamma. Finally, we jointly fit all the parameters ofEquation˜6using the interpolated curves, and evaluate the fit on the raw filtered data to obtain the loss predictability errorℰ\mathcal{E}.
We fit all scaling laws using a Huber loss objective withδ=10−3\delta=10^{-3}to improve robustness to outliers and200200random initializations to avoid local minima in the non-convex scaling law fits. We additionally cap all scaling exponents at2.02.0, which serves as a practical proxy for exponents running to infinity and prevents spuriously large values that can make comparisons across parameterization unreliable.
(a)
(b)
Figure 7:Fittedβ\betaas a function of the lower boundβmin\beta_{\min}for two cases. Left: a degenerate case (μ\muP) where the step function better fits the observed trend, and we adopt the optimal solution withβ>βmin∗\beta>\beta_{\min}^{*}. Right: a genuine case (μ\muP-Attn) whereβ\betamonotonically increases withβmin\beta_{\min}, suggesting that the fitted smallβ\betacan be trusted.
Degeneracy inν∗\nu^{*}Scaling.
A subtle issue arises whenν∗(n)\nu^{*}(n)is nearly constant across widths. In this regime, the scaling lawν∗(n)=ν∗(∞)+Bn−β\nu^{*}(n)=\nu^{*}(\infty)+Bn^{-\beta}admits two degenerate solutions: (1)β→0\beta\to 0, reducing the model to a constantν∗(∞)+B\nu^{*}(\infty)+Band (2)β→∞\beta\to\infty, indicating rapid convergence toν∗(∞)\nu^{*}(\infty). We prefer theβ→∞\beta\to\inftysolution, asβ→0\beta\to 0reduces the scaling law to a constant and is inconsistent with the convergence-based interpretation of our framework. Sinceβ\betamay not be exactly zero numerically, distinguishing a genuinely smallβ\betafrom the degenerateβ→0\beta\to 0solution is non-trivial. To resolve this, we repeatedly fit the scaling law with an increasing lower boundβmin\beta_{\min}onβ\betaand track how the fittedβ\betachanges. If the true solution is degenerate, the fittedβ\betawill exhibit a sudden jump at someβmin∗\beta^{*}_{\min}, beyond which random initializations begin to prefer theβ→∞\beta\to\inftysolution. In contrast, if the solution is not degenerate, the fittedβ\betawill linearly increase fromβmin\beta_{\min}(seeFigure˜7). We then fit both a step function and a linear function to the resultingβ\betavs.βmin\beta_{\min}trend. If the step function fits better, we have identified the smallβ\betaas a degenerate solution and select the best solution withβ>βmin∗\beta>\beta^{*}_{\min}. Otherwise, we treat the small observedβ\betaas genuine. We apply this procedure only to theν∗(n)\nu^{*}(n)scaling law fits to estimateβ\betaand henceκ\kappa.
Degeneracy in the Full Model.
The full loss model (Equation˜6) introduces additional degeneracies beyond those in the individualν∗\nu^{*}scaling law. Specifically, the term12Cnγ(ν−ν∗(∞)−Bn−β)2\frac{1}{2}Cn^{\gamma}(\nu-\nu^{*}(\infty)-Bn^{-\beta})^{2}couples the curvature scaling parametersCCandγ\gammawith theν\nuscaling parametersBBandβ\betathrough the productCnγ⋅B2n−2βCn^{\gamma}\cdot B^{2}n^{-2\beta}, making individual parameter estimates unreliable even when the overall fit is good. We therefore do not use the joint fit for estimatingβ\beta,κ\kappa, andℛ(∞)\mathcal{R}(\infty), and instead rely on the individually fitted scaling laws. The parameters of the full model are constrained in the same way as the individual scaling laws, withL∗(∞),A,α,C,β≥0L^{*}(\infty),A,\alpha,C,\beta\geq 0andBBunconstrained. Nevertheless, the joint fit consistently yields smaller fit errors than substituting the individually fitted parameters directly intoEquation˜6, and we therefore use it for reportingℰ\mathcal{E}.
Appendix CTransfer Metrics for Additional Parameterizations
In this section, we analyze the three transfer metrics for all parameterizations interpolating between SP andμ\muP in further detail.Figure˜8–17show the loss and scaling law curves for each parameterization. We use ‘++’ to denote incremental changes from SP towardsμ\muP and ‘−-’ to denote changes fromμ\muP towards SP. For each parameterization, we select the weight decay value that best represents its typical behavior.
C.1Effect of Different Layer Types on Transfer Metrics
SP vs.μ\muP.
As described inSection˜4,μ\muP and SP are surprisingly similar across most metrics (Figures˜8and9). Despite being more unstable (a3×3\timeslarger loss predictability error), SP exhibits a large negative robustness exponentκ=−3.505\kappa=-3.505. SP thus has the right ingredients for reliable transfer but is held back by training instability.
Embedding Layer Learning Rate.
The embedding layer learning rate has the most pronounced effect on stability. Training SP with aΘ(1)\Theta(1)embedding learning rate (SP+Embd) eliminates the instabilities, resulting in smooth loss curves (Figure˜10). Conversely, trainingμ\muP with aΘ(1/n)\Theta(\nicefrac{{1}}{{n}})embedding learning rate (μ\muP-Embd) destabilizes training entirely (Figure˜11).
Attention Scaling.
The effect of1/d\nicefrac{{1}}{{d}}attention scaling is more subtle. Training SP with1/d\nicefrac{{1}}{{d}}attention scaling (SP+Attn) does not improve stability (Figure˜12): the loss predictability errorℰ\mathcal{E}remains similar to SP,ℛ(∞)\mathcal{R}(\infty)is slightly worse, and instabilities worsen at large widths. Most notably,κ\kappabecomes positive (κ=0.509\kappa=0.509), indicating that transfer becomes brittle. A similar degradation is observed when trainingμ\muP with1/d\nicefrac{{1}}{{\sqrt{d}}}attention scaling (μ\muP-Attn) (Figure˜13):ℰ\mathcal{E}is similar to SP, the optimal learning rate decreases with width, andκ=0.881\kappa=0.881, making transfer unreliable. These results suggest that the attention scaling has a more subtle effect depending on the scaling of other layers.
LayerNorm Learning Rate.
Increasing the LayerNorm learning rate toΘ(1)\Theta(1)in SP (SP+LN) severely destabilizes training (Figure˜14), withℰ\mathcal{E}roughly10×10\timeslarger than SP. Interestingly, while the loss at small widths is poor, SP+LN achieves the best asymptotic loss across all parameterizations at large widths. This highlights a sharp tradeoff: SP+LN can in principle reach excellent performance at scale, but its unpredictable loss landscape makes transfer unreliable—a high-reward but high-risk parameterization. By comparison, decreasing the LayerNorm learning rate toΘ(1/n)\Theta(\nicefrac{{1}}{{n}})inμ\muP (μ\muP-LN) has surprisingly almost no effect. The loss predictability errorℰ\mathcal{E}is slightly better thanμ\muP, while the asymptotic loss is slightly worse. The consistent improvement in loss predictability from slowing down LayerNorm training observed in both SP+LN andμ\muP-LN suggests that LayerNorm parameters are best trained slowly. This again reflects a tradeoff: slower LayerNorm training stabilizes the training dynamics at the cost of a small performance penalty.
Last Layer Initialization.
Finally, reducing the variance of the last-layer initialization toΘ(1/n2)\Theta(\nicefrac{{1}}{{n^{2}}})in SP has a negligible effect, with all three metrics remaining comparable to SP. By comparison, increasing it to1/n\nicefrac{{1}}{{n}}inμ\muP (μ\muP-Last) causes training instabilities at small widths, but transfer remains robust at larger widths. The minimal role of last layer initialization is consistent with the observation that a larger initialization variance leads to higher Hessian sharpness, but learning rate warmup gradually reduces this sharpness early in training[21], effectively mitigating any initialization differences.


Figure 8:Transfer metrics forμ\muP with weight decayλ=0.006\lambda=0.006. RepeatedFigure˜1for completeness.

Figure 9:Transfer metrics for SP with weight decayλ=0.001\lambda=0.001.

Figure 10:Transfer metrics for SP+Embd with weight decayλ=0.001\lambda=0.001.

Figure 11:Transfer metrics forμ\muP-Embd with weight decayλ=0.006\lambda=0.006.

Figure 12:Transfer metrics for SP+Attn with weight decayλ=0.001\lambda=0.001.

Figure 13:Transfer metrics forμ\muP-Attn with weight decayλ=0.006\lambda=0.006.

Figure 14:Transfer metrics for SP+LN with weight decayλ=0.001\lambda=0.001.

Figure 15:Transfer metrics forμ\muP-LN with weight decayλ=0.001\lambda=0.001.

Figure 16:Transfer metrics for SP+Last with weight decayλ=0.001\lambda=0.001.

Figure 17:Transfer metrics forμ\muP-Last with weight decayλ=0.006\lambda=0.006.
C.2Transfer Metric Phase Diagrams
Figure 18:*Transfer metrics for parameterizations interpolating between SP andμ\muP.*Parameterizations with ‘++’ denote incremental changes from SP towardsμ\muP, while ‘−-’ denotes changes fromμ\muP towards SP. Green and red regions indicate desirable and undesirable regimes, respectively.Figure˜18shows the three transfer metrics for all1616parameterizations and88weight decay values. The metrics naturally separate parameterizations into clusters. For example, in panel (b), parameterizations that train the embedding well (μ\muP, SP+Embd,μ\muP-LN, SP+Attn+Embd) cluster in the desirable lowℰ\mathcal{E}, lowℛ(∞)\mathcal{R}(\infty)region, while unstable parameterizations (SP, SP+LN,μ\muP-Embd) achieve near-zeroℛ(∞)\mathcal{R}(\infty)but at the cost of highℰ\mathcal{E}. This illustrates the tradeoff between the metrics noted inSection˜4: a parameterization can achieve excellent asymptotic performance while remaining unreliable for transfer. Large weight decay values are located in the top right, improvingℰ\mathcal{E}relative to the unstable cluster but at an asymptotic performance cost, never reaching the stable cluster. Similar clustering patterns emerge in panels (a) and (c), with stable parameterizations consistently occupying the desirable regions across all three metrics.
Appendix DImportance of the First Layer Learning Rate in CNN Image Classification
Figure 19:*CNN experiments on CIFAR with Adam.*Training loss as a function of log learning rateν\nufor four parameterizations across widths. In SP, the optimal learning rate drifts with width, whileμ\muP shows substantially less drift. Increasing the learning rate of the input-facing layer in SP (SP+Embd) largely removes this drift and places the optimum in a similar region toμ\muP. By contrast, changing only the last-layer initialization (SP+Last) leaves both the learning-rate drift and the optimum location closer to SP. This suggests that the role of the embedding layer learning rate in Transformers reflects a more general importance of the input-facing layer learning rate.InSection˜5, we demonstrated that the embedding layer learning rate is the primary factor in explaining the difference between SP andμ\muP in Transformers trained with Adam. A natural question is whether this result is specific to Transformers with an embedding layer or generalizes to other architectures and tasks. To test this, we consider CNNs trained on CIFAR with Adam, where the first layer is a simple convolutional layer, and there is no LayerNorm or attention.
As there are only two differences in this setting (first and last layer), there are only four variants:μ\muP, SP, SP+Embd, and SP+Last.Figure˜19shows the training loss as a function of log learning rateν\nuacross widths. We observe that the optimal learning rate increases with width in SP, whereas inμ\muP it remains fairly constant. Training SP with aΘ(1)\Theta(1)first-layer learning rate largely removes this drift: SP+Embd behaves similarly toμ\muP both in the location of the optimal learning rate and in its reduced drift. By contrast, changing only the last-layer initialization has little effect: SP+Last remains closer to SP, both in the location of the optimum and in the observed learning-rate drift.
These results suggest that the role of the embedding layer learning rate is not specific to Transformers, and that training the input layer sufficiently fast is important for learning-rate transfer under Adam.
Appendix EWeight Decay scaling in Compute-Optimal Regime
Figure 20:Loss curves forμ\muP in the compute-optimal setting (2020TPP) under two weight decay scalings. (a) Standardμ\muP conventionη⋅λ=Θ(1)\eta\cdot\lambda=\Theta(1):ν∗(n)\nu^{*}(n)shifts to the left with increasing width. (b) Corrected conventionη⋅λ=Θ(1/n2)\eta\cdot\lambda=\Theta(\nicefrac{{1}}{{n^{2}}}):ν∗(n)\nu^{*}(n)doesn’t vary much, but the shape of the loss curves around the minimum changes across widths.In the compute-optimal (fixed TPP) setting, the number of training steps scales asT∝n2T\propto n^{2}, which violatesμ\muP’s assumption ofΘ(1)\Theta(1)training steps compared to width. This makes the standard weight decay conventionη⋅λ=Θ(1)\eta\cdot\lambda=\Theta(1)inadequate, as we observed inSection˜6where transfer robustnessκ\kappadegrades with increasing weight decay. A natural scaling, motivated by[2], is to scale weight decay strength asλ=Θ(1/n2)\lambda=\Theta(\nicefrac{{1}}{{n^{2}}})so that the scale of weight decay’s cumulative contribution overT=Θ(n2)T=\Theta(n^{2})stepsηλT\eta\lambda TremainsΘ(1)\Theta(1)across widths. That being said, we caution that this choice is not well motivated. There is no first-principle reason why the cumulative weight decay must beΘ(1)\Theta(1), as different widths may benefit from different total contributions.
Figure˜20compares the loss curves forμ\muP under two scaling conventions. Underη⋅λ=Θ(1)\eta\cdot\lambda=\Theta(1)(a), the optimal log learning rateν∗(n)\nu^{*}(n)drifts noticeably to the left with increasing width, suggesting a slow convergence (smallβ\beta). As a result,κ=α−2β+γ\kappa=\alpha-2\beta+\gammabecomes large, resulting in brittle transfer observed inSection˜6. By comparison, underη⋅λ=Θ(1/n2)\eta\cdot\lambda=\Theta(\nicefrac{{1}}{{n^{2}}})(panel b), this drift is visually reduced, suggesting an improvement in transfer robustness. However, the shape of the loss curves around the minimum changes across widths. We leave a systematic analysis of the appropriate weight decay scaling in the TPP regime to future work.
Appendix FThe Effect of Learning Rate Warmup on Learning Rate Transfer
Figure 21:Learning rate warmup is essential for observing reliable learning rate transfer inμ\muP.Learning rate warmup is standard practice in large-scale training[7,34]. Its primary effect is gradually reducing the sharpness of the loss Hessian (or pre-conditioned sharpness for Adam), effectively steering optimization towards well-conditioned regions of the loss landscape where the model can be trained at large learning rates[12,21]. Despite its widespread use, its interaction with learning rate transfer has not been systematically studied.Figure˜21shows that short warmup durations (11-5%5\%) result in training instabilities that make transfer unreliable—the loss curves are noisy across widths, making extrapolation meaningless. At10%10\%warmup, the loss curves become noticeably smoother andν∗(n)\nu^{*}(n)aligns more consistently across widths. This result confirms that warmup is crucial for observing reliable learning rate transfer, which is not accounted for inμ\muP’s theoretical derivation. We use a conservative warmup duration of20%20\%throughout experiments to observe reliable transfer.
Appendix GEffect of Freezing the Embedding Layer



Figure 22:Effect of freezing the embedding layer inμ\muP and SP. (top) Loss vs. log learning rate curves. (middle) Optimal loss vs. total parameters. (bottom) Optimal loss vs. trainable parameters, where trainable parameters exclude the embedding when frozen.The switch experiments inSection˜5show that training the embedding layer too slowly can both slow down learning and cause training instabilities. Here, we take this further by studying the effect of completely freezing the embedding layer in bothμ\muP and SP.
Figure˜22(top) shows the loss vs. log learning rate curves. Freezing the embedding layer significantly affectsμ\muP, causing pronounced training instabilities and a shift in the optimal learning rate. SP is comparatively more robust: freezing causes instabilities at small widths, but the curves match the unfrozen case at large widths.
Figure˜22(middle) shows the optimal loss scaling laws against total parameters. Forμ\muP, freezing the embedding causes a large performance gap that narrows with width. For SP, the effect is more modest, with frozen and unfrozen performance converging at large widths.
However, comparing by total parameters is unfair, since at small widths the embedding dominates the parameter count (vocabulary size×\timeswidth, with vocabulary size50,30450{,}304). We therefore also compare against trainable parameters, which excludes frozen embedding parameters (Figure˜22, bottom). Even after this correction, freezingμ\muP’s embedding remains worse across all widths, though the gap narrows considerably. For SP, the frozen variant achieves lower loss at small widths, suggesting that non-embedding parameters might be more parameter-efficient at small scales, though the unfrozen case catches up at large widths. We leave a detailed understanding of this phenomenon to future work.
Appendix HMinimal Derivation ofμ\muP
In this section, we present a self-contained derivation of scaling conditions for neural networks trained at large widths and show thatμ\muP arises as one particular solution to these conditions. We consider a minimal three-layer linear network. The choice is minimal for two reasons. First, three layers suffice to capture all distinct layer types in a neural network: input-to-hidden, hidden-to-hidden, and hidden-to-output. Second, standard activation functions used in practice (ReLU, GeLU, SiLU, tanh) act element-wise, so they contribute onlyΘn(1)\Theta_{n}(1)factors to both the forward and backward passes and leave the width-scaling exponents unchanged. We build upon the analysis and notation ofEverettet al.[10], and extend the derivation to cover weight decay, LayerNorm, and attention scaling. Our goal here is to provide a simple, transparent derivation of the scaling rules, making every assumption explicit at the step where it is used.
H.1Notation and Norms
In this derivation, we examine how every quantity (weights, activations, gradients, and updates) scales with the widthnn, using its RMS norm:
Definition H.1(RMS norm).
For a vector𝐯∈ℝn\mathbf{v}\in\mathbb{R}^{n}, the RMS norm is defined as:
‖𝐯‖RMS:=1n∑i=1nvi2.\displaystyle\|\mathbf{v}\|_{\mathrm{RMS}}:=\sqrt{\frac{1}{n}\sum_{i=1}^{n}v_{i}^{2}}.For a matrixM∈ℝm×nM\in\mathbb{R}^{m\times n}, we use the entrywise extension:
‖M‖RMS:=1mn∑i=1m∑j=1nMij2.\displaystyle\|M\|_{\mathrm{RMS}}:=\sqrt{\frac{1}{mn}\sum_{i=1}^{m}\sum_{j=1}^{n}M_{ij}^{2}}.Unless we state otherwise, we use∥⋅∥\|\cdot\|to denote the RMS norm throughout this derivation.
Definition H.2(Width-scaling).
Let{𝐯n}\{\mathbf{v}_{n}\}be a family of vectors with𝐯n∈ℝn\mathbf{v}_{n}\in\mathbb{R}^{n}. We say𝐯n=Θn(nα)\mathbf{v}_{n}=\Theta_{n}(n^{\alpha})if there existC1,C2>0C_{1},C_{2}>0andNNsuch thatC1nα≤‖𝐯n‖≤C2nαC_{1}n^{\alpha}\leq\|\mathbf{v}_{n}\|\leq C_{2}n^{\alpha}for alln≥Nn\geq N. The one-sided versions𝒪n\mathcal{O}_{n}andΩn\Omega_{n}keep only the upper or lower bound. Furthermore, for two families{𝐮n}\{\mathbf{u}_{n}\}and{𝐯n}\{\mathbf{v}_{n}\}, we write𝐯n∼𝐮n\mathbf{v}_{n}\sim\mathbf{u}_{n}to indicate that they share the same width-scaling, i.e.,‖𝐯n‖=Θn(‖𝐮n‖)\|\mathbf{v}_{n}\|=\Theta_{n}(\|\mathbf{u}_{n}\|). The subscriptnnspecifies width as the scaling variable, and we will drop it whenever it’s clear from context.
H.2Model Architecture
Consider a three-layer linear network with trainable parameters𝜽:={U,W,V}\boldsymbol{\theta}:=\{U,W,V\}, widthnn, input dimensiondind_{\mathrm{in}}, and output dimensiondoutd_{\mathrm{out}}. For a training example(𝐱,𝐲)(\mathbf{x},\mathbf{y}), the forward pass is:
h(𝐱)\displaystyle h(\mathbf{x})=n−auU𝐱,\displaystyle=n^{-a_{u}}U\mathbf{x},z(𝐱)\displaystyle z(\mathbf{x})=n−awWh(𝐱),\displaystyle=n^{-a_{w}}Wh(\mathbf{x}),f(𝐱)\displaystyle f(\mathbf{x})=n−avVz(𝐱),\displaystyle=n^{-a_{v}}Vz(\mathbf{x}),(11)withU∈ℝn×dinU\in\mathbb{R}^{n\times d_{\mathrm{in}}},W∈ℝn×nW\in\mathbb{R}^{n\times n}, andV∈ℝdout×nV\in\mathbb{R}^{d_{\mathrm{out}}\times n}. Here, the exponentsa:=(au,aw,av)a:=(a_{u},a_{w},a_{v})scale the forward pass in each layer. The initialization variances are controlled by the exponentsb:=(bu,bw,bv)b:=(b_{u},b_{w},b_{v}):
Uij∼𝒩(0,n−2bu),Wij∼𝒩(0,n−2bw),Vij∼𝒩(0,n−2bv).\displaystyle U_{ij}\sim\mathcal{N}(0,n^{-2b_{u}}),\qquad W_{ij}\sim\mathcal{N}(0,n^{-2b_{w}}),\qquad V_{ij}\sim\mathcal{N}(0,n^{-2b_{v}}).(12)The per-layer learning rates are scaled by the exponentsc:=(cu,cw,cv)c:=(c_{u},c_{w},c_{v}):
ηu=ηn−cu,ηw=ηn−cw,ηv=ηn−cv.\displaystyle\eta_{u}=\eta\,n^{-c_{u}},\qquad\eta_{w}=\eta\,n^{-c_{w}},\qquad\eta_{v}=\eta\,n^{-c_{v}}.(13)whereη\etais the global learning rate. Finally, the per-layer weight decay strengths are scaled by the exponentsd:=(du,dw,dv)d:=(d_{u},d_{w},d_{v}):
λu=λn−du,λw=λn−dw,λv=λn−dv,\displaystyle\lambda_{u}=\lambda\,n^{-d_{u}},\qquad\lambda_{w}=\lambda\,n^{-d_{w}},\qquad\lambda_{v}=\lambda\,n^{-d_{v}},(14)whereλ\lambdais the weight decay strength. The exponents{a,b,c,d}\{a,b,c,d\}collectively define ourabcdabcdparameterization, which extends theabcabcparameterization of[48]with weight decay exponentsdd.
We use the subscriptttto denote the training step. Thus,UtU_{t},WtW_{t}, andVtV_{t}denote the parameters aftertttraining steps, andht(𝐱)h_{t}(\mathbf{x}),zt(𝐱)z_{t}(\mathbf{x}), andft(𝐱)f_{t}(\mathbf{x})denote the corresponding activations and network output. Initialization corresponds tot=0t=0. For any quantityqtq_{t}, we writeΔqt:=qt−qt−1\Delta q_{t}:=q_{t}-q_{t-1}to denote its change at steptt.
H.3Design Principles for Scaling Analysis
For training to remainstable, the activations and their updates must not vanish or explode as the width grows. If the activation updates vanish, the corresponding layer remains frozen at initialization; on the other hand, if they explode, training diverges. We formalize stability through the following two conditions[48].
Desideratum 1(Stable Initialization).
The hidden activations and the network output remain stable with width at initialization:
‖h0‖=Θn(1),‖z0‖=Θn(1),‖f0‖=𝒪n(1).\displaystyle\|h_{0}\|=\Theta_{n}(1),\qquad\|z_{0}\|=\Theta_{n}(1),\qquad\|f_{0}\|=\mathcal{O}_{n}(1).(15)
The hidden activationsh0h_{0}andz0z_{0}are required to beΘn(1)\Theta_{n}(1)for stable signal propagation across layers[40]. The network outputf0f_{0}in contrast has a looser𝒪n(1)\mathcal{O}_{n}(1)constraint because initializing the last layer with small weights does not affect signal propagation through the network[46].
Desideratum 2(Stable Updates at the First step).
At the first gradient stept=1t=1, the hidden activation and network output updates remain stable with width:
‖Δh1‖=Θn(1),‖Δz1‖=Θn(1),‖Δf1‖=Θn(1).\displaystyle\|\Delta h_{1}\|=\Theta_{n}(1),\qquad\|\Delta z_{1}\|=\Theta_{n}(1),\qquad\|\Delta f_{1}\|=\Theta_{n}(1).(16)
Here, we impose this condition att=1t=1so that the relevant computations are tractable. The same condition is desirable at every training step; however, extending it to general training stepttrequires further assumptions on weights, activations, and their alignment, which we introduce inSection˜H.13.
It is widely observed that parameterizations satisfyingDesiderata˜1and2(e.g.,μ\muP) exhibitlearning rate transfer: optimal learning rates at small width remain optimal at large widths. The implication, however, is non-trivial: ensuring that activations and their updates remain stable does not by itself imply that the optimal hyperparameters are width-independent, particularly in the limit where number of training steps is much larger than width. Several works support transfer underμ\muP, though in restricted settings. The simplest setting is ofKalraet al.[22], who showed that, for a two-layer linear network trained on a single example, the dynamical equations underμ\muP have no width dependence whatsoever, directly implying hyperparameter transfer without any assumptions. A recent work byHayou [15]provided a formal proof of learning-rate transfer underμ\muP in restricted settings of linear networks with input and output layers fixed at initialization. At the first gradient step, they provide an explicit limit and convergence rate; by comparison, at any fixed steptt, they establish transfer under the additional assumption that the loss has a unique minimizer inη\eta, but do not provide convergence rates as in the first step case. Beyond linear networks,Yaida [46]showed that the leading-order terms in the function-space update equations are width-independent underμ\muP, suggesting that the dynamics, and by extension the optimal hyperparameters, may also be width-stable in general neural networks. The subleading term, however, can accumulate over long training horizons and contribute to the optimal learning rate. Finally,Nociet al.[33]empirically observed that the sharpness (largest Hessian eigenvalue) does not vary with width underμ\muP, providing complementary evidence that the dominant loss landscape features remain invariant inμ\muP. Proving learning rate transfer in generic settings remains an open question.
H.4Conditions for Stable Initialization
In this section, we derive the conditions on{a,b}\{a,b\}required to satisfyDesideratum˜1; the conditions on exponents{c,d}\{c,d\}will be imposed by the training dynamics in the later sections. At initialization, the weights and their inputs are independent, so the scale of each layer’s output can be easily computed by averaging over the weight distribution while treating the input as fixed. Since all the quantities in this section are att=0t=0, we suppress the time index for brevity.
First layer.
The forward pass of the first layer is:
h(𝐱)=n−auU𝐱.\displaystyle h(\mathbf{x})=n^{-a_{u}}U\mathbf{x}.Let⟨⋅⟩\langle\cdot\rangledenote expectation over the weight distribution. The expected squared norm ofh(𝐱)h(\mathbf{x})is:
⟨‖h(𝐱)‖2⟩=⟨1n∑i=1nhi2(𝐱)⟩\displaystyle\left\langle\|h(\mathbf{x})\|^{2}\right\rangle=\left\langle\frac{1}{n}\sum_{i=1}^{n}h_{i}^{2}(\mathbf{x})\right\rangle=1n∑i=1nn−2au∑j,k=1din⟨UijUik⟩xjxk\displaystyle=\frac{1}{n}\sum_{i=1}^{n}n^{-2a_{u}}\sum_{j,k=1}^{d_{\mathrm{in}}}\langle U_{ij}U_{ik}\rangle\,x_{j}x_{k}=n−2au−1∑i=1n∑j=1dinn−2buxj2(using⟨UijUik⟩=n−2buδjk)\displaystyle=n^{-2a_{u}-1}\sum_{i=1}^{n}\sum_{j=1}^{d_{\mathrm{in}}}n^{-2b_{u}}\,x_{j}^{2}\qquad\left(\text{using }\langle U_{ij}U_{ik}\rangle=n^{-2b_{u}}\delta_{jk}\right)=n−2au−2bu∑j=1dinxj2.\displaystyle=n^{-2a_{u}-2b_{u}}\sum_{j=1}^{d_{\mathrm{in}}}x_{j}^{2}.(17)To handle the input term, we assume:
Assumption H.1(Input scaling).
The inputs are normalized so that‖𝐱‖=Θn(1)\|\mathbf{x}\|=\Theta_{n}(1).
UnderAssumption˜H.1,∑j=1dinxj2=Θn(1)\sum_{j=1}^{d_{\mathrm{in}}}x_{j}^{2}=\Theta_{n}(1), so
⟨‖h(𝐱)‖2⟩=Θn(n−2au−2bu).\displaystyle\left\langle\|h(\mathbf{x})\|^{2}\right\rangle=\Theta_{n}(n^{-2a_{u}-2b_{u}}).(18)Requiring‖h(𝐱)‖=Θn(1)\|h(\mathbf{x})\|=\Theta_{n}(1)yields:
au+bu=0.\displaystyle\boxed{a_{u}+b_{u}=0.}(19)
Middle layer.
The forward pass of the middle layer is:
z(𝐱)=n−awWh(𝐱).\displaystyle z(\mathbf{x})=n^{-a_{w}}Wh(\mathbf{x}).The expected squared norm ofz(𝐱)z(\mathbf{x}), with the expectation taken overWWat fixedh(𝐱)h(\mathbf{x}), is:
⟨‖z(𝐱)‖2⟩=⟨1n∑i=1nzi2(𝐱)⟩\displaystyle\left\langle\|z(\mathbf{x})\|^{2}\right\rangle=\left\langle\frac{1}{n}\sum_{i=1}^{n}z_{i}^{2}(\mathbf{x})\right\rangle=1n∑i=1nn−2aw∑j,k=1n⟨WijWik⟩hj(𝐱)hk(𝐱)\displaystyle=\frac{1}{n}\sum_{i=1}^{n}n^{-2a_{w}}\sum_{j,k=1}^{n}\langle W_{ij}W_{ik}\rangle\,h_{j}(\mathbf{x})h_{k}(\mathbf{x})=n−2aw−1∑i=1n∑j=1nn−2bwhj2(𝐱)(using⟨WijWik⟩=n−2bwδjk)\displaystyle=n^{-2a_{w}-1}\sum_{i=1}^{n}\sum_{j=1}^{n}n^{-2b_{w}}\,h_{j}^{2}(\mathbf{x})\qquad\left(\text{using }\langle W_{ij}W_{ik}\rangle=n^{-2b_{w}}\delta_{jk}\right)=n−2aw−2bw+1‖h(𝐱)‖2.\displaystyle=n^{-2a_{w}-2b_{w}+1}\,\|h(\mathbf{x})\|^{2}.(20)Using‖h(𝐱)‖=Θn(1)\|h(\mathbf{x})\|=\Theta_{n}(1)from the first layer, we have:
aw+bw=12.\displaystyle\boxed{a_{w}+b_{w}=\tfrac{1}{2}.}(21)
Last layer.
The forward pass of the last layer is:
f(𝐱)=n−avVz(𝐱).\displaystyle f(\mathbf{x})=n^{-a_{v}}Vz(\mathbf{x}).The calculation follows as in the middle layer, withVVin place ofWWand the output dimensiondoutd_{\mathrm{out}}replacingnnin the output index.
⟨‖f(𝐱)‖2⟩=n−2av−2bv+1‖z(𝐱)‖2.\displaystyle\left\langle\|f(\mathbf{x})\|^{2}\right\rangle=n^{-2a_{v}-2b_{v}+1}\,\|z(\mathbf{x})\|^{2}.(22)Using‖z(𝐱)‖=Θn(1)\|z(\mathbf{x})\|=\Theta_{n}(1)and the one-sided requirement‖f(𝐱)‖=𝒪n(1)\|f(\mathbf{x})\|=\mathcal{O}_{n}(1):
av+bv≥12.\displaystyle\boxed{a_{v}+b_{v}\geq\tfrac{1}{2}.}(23)Unlike the previous two layers, the last-layer initialization scale is not fixed by stability alone, as anyav+bv≥1/2a_{v}+b_{v}\geq 1/2satisfies the constraint.
The conditions above fix the weight and activation scales only at initialization, and the weights and activations can evolve in a width-dependent manner during training. We will discuss the implications in the later sections.
H.5Gradient Scaling at Initialization
Given a dataset𝒟={(𝐱μ,𝐲μ)}μ=1D\mathcal{D}=\{(\mathbf{x}^{\mu},\mathbf{y}^{\mu})\}_{\mu=1}^{D}and a per-example loss functionℓ:ℝdout×ℝdout→ℝ\ell:\mathbb{R}^{d_{\mathrm{out}}}\times\mathbb{R}^{d_{\mathrm{out}}}\to\mathbb{R}, we define the training loss as the empirical average:
L(𝜽)=1D∑μ=1Dℓ(f(𝐱μ),𝐲μ).\displaystyle L(\boldsymbol{\theta})=\frac{1}{D}\sum_{\mu=1}^{D}\ell\!\left(f(\mathbf{x}^{\mu}),\mathbf{y}^{\mu}\right).(24)The per-layer gradients are:
gv:=∇vL(𝜽)\displaystyle g_{v}:=\nabla_{v}L(\boldsymbol{\theta})=1D∑μ=1Dn−av∇fℓμ⋅z(𝐱μ)⊤,\displaystyle=\frac{1}{D}\sum_{\mu=1}^{D}n^{-a_{v}}\nabla_{f}\ell^{\mu}\cdot z(\mathbf{x}^{\mu})^{\top},gw:=∇wL(𝜽)\displaystyle g_{w}:=\nabla_{w}L(\boldsymbol{\theta})=1D∑μ=1Dn−aw−avV⊤∇fℓμ⋅h(𝐱μ)⊤,\displaystyle=\frac{1}{D}\sum_{\mu=1}^{D}n^{-a_{w}-a_{v}}V^{\top}\nabla_{f}\ell^{\mu}\cdot h(\mathbf{x}^{\mu})^{\top},gu:=∇uL(𝜽)\displaystyle g_{u}:=\nabla_{u}L(\boldsymbol{\theta})=1D∑μ=1Dn−au−aw−avW⊤V⊤∇fℓμ⋅𝐱μ⊤,\displaystyle=\frac{1}{D}\sum_{\mu=1}^{D}n^{-a_{u}-a_{w}-a_{v}}W^{\top}V^{\top}\nabla_{f}\ell^{\mu}\cdot\mathbf{x}^{\mu\top},(25)where∇fℓμ:=∇fℓ(f(𝐱μ),𝐲μ)\nabla_{f}\ell^{\mu}:=\nabla_{f}\ell(f(\mathbf{x}^{\mu}),\mathbf{y}^{\mu}). Next, we assume that the loss derivative does not scale with width:
Assumption H.2(Loss derivative scaling).
For every example(𝐱μ,𝐲μ)∈𝒟(\mathbf{x}^{\mu},\mathbf{y}^{\mu})\in\mathcal{D}, the per-example loss derivative satisfies‖∇fℓμ‖=Θn(1)\|\nabla_{f}\ell^{\mu}\|=\Theta_{n}(1).
We first analyze the per-example terms. Forgvg_{v}, the per-example term∇fℓμ⋅z(𝐱μ)⊤\nabla_{f}\ell^{\mu}\cdot z(\mathbf{x}^{\mu})^{\top}is an outer product of two vectors withΘn(1)\Theta_{n}(1)entries, so:
‖n−av∇fℓμz(𝐱μ)⊤‖=n−av‖∇fℓμ‖‖z(𝐱μ)‖=Θn(n−av).\displaystyle\|n^{-a_{v}}\nabla_{f}\ell^{\mu}z(\mathbf{x}^{\mu})^{\top}\|=n^{-a_{v}}\|\nabla_{f}\ell^{\mu}\|\|z(\mathbf{x}^{\mu})\|=\Theta_{n}(n^{-a_{v}}).(26) Forgwg_{w}, the per-example term is the outer product(V⊤∇fℓμ)h(𝐱μ)⊤(V^{\top}\nabla_{f}\ell^{\mu})h(\mathbf{x}^{\mu})^{\top}. The left factor is a matrix-vector product along the output dimensiondoutd_{\mathrm{out}}and does not pick up a width scaling factor. Therefore, we can write:
⟨‖V⊤∇fℓμ‖2⟩∼n−2bv‖∇fℓμ‖2=Θn(n−2bv),\displaystyle\langle\|V^{\top}\nabla_{f}\ell^{\mu}\|^{2}\rangle\sim n^{-2b_{v}}\,\|\nabla_{f}\ell^{\mu}\|^{2}=\Theta_{n}(n^{-2b_{v}}),This gives‖V⊤∇fℓμ‖=Θn(n−bv)\|V^{\top}\nabla_{f}\ell^{\mu}\|=\Theta_{n}(n^{-b_{v}}). The right factor satisfies‖h(𝐱μ)‖=Θn(1)\|h(\mathbf{x}^{\mu})\|=\Theta_{n}(1), so:
‖n−aw−avV⊤∇fℓμh(𝐱μ)⊤‖=n−aw−av‖V⊤∇fℓμ‖‖h(𝐱μ)‖=Θn(n−aw−av−bv).\displaystyle\|n^{-a_{w}-a_{v}}V^{\top}\nabla_{f}\ell^{\mu}h(\mathbf{x}^{\mu})^{\top}\|=n^{-a_{w}-a_{v}}\|V^{\top}\nabla_{f}\ell^{\mu}\|\|h(\mathbf{x}^{\mu})\|=\Theta_{n}(n^{-a_{w}-a_{v}-b_{v}}).(27) Forgug_{u}, the per-example term is the outer product(W⊤V⊤∇fℓμ)𝐱μ⊤(W^{\top}V^{\top}\nabla_{f}\ell^{\mu})\mathbf{x}^{\mu\top}. The left factor is a matrix–vector product, whose expected squared RMS norm at initialization is:
⟨‖W⊤V⊤∇fℓμ‖2⟩∼n−2bw+1‖V⊤∇fℓμ‖2=Θn(n−2bw−2bv+1),\displaystyle\langle\|W^{\top}V^{\top}\nabla_{f}\ell^{\mu}\|^{2}\rangle\sim n^{-2b_{w}+1}\,\|V^{\top}\nabla_{f}\ell^{\mu}\|^{2}=\Theta_{n}(n^{-2b_{w}-2b_{v}+1}),Thus,‖W⊤V⊤∇fℓμ‖=Θn(n−bw−bv+12)\|W^{\top}V^{\top}\nabla_{f}\ell^{\mu}\|=\Theta_{n}(n^{-b_{w}-b_{v}+\frac{1}{2}}). Usingaw+bw=12a_{w}+b_{w}=\frac{1}{2}, this scaling becomesΘn(naw−bv)\Theta_{n}(n^{a_{w}-b_{v}}). The right factor satisfies‖𝐱μ‖=Θn(1)\|\mathbf{x}^{\mu}\|=\Theta_{n}(1), so:
‖n−au−aw−avW⊤V⊤∇fℓμ𝐱μ⊤‖=n−au−aw−av‖W⊤V⊤∇fℓμ‖‖𝐱μ‖=Θn(n−au−av−bv).\displaystyle\|n^{-a_{u}-a_{w}-a_{v}}W^{\top}V^{\top}\nabla_{f}\ell^{\mu}\mathbf{x}^{\mu\top}\|=n^{-a_{u}-a_{w}-a_{v}}\|W^{\top}V^{\top}\nabla_{f}\ell^{\mu}\|\|\mathbf{x}^{\mu}\|=\Theta_{n}(n^{-a_{u}-a_{v}-b_{v}}).(28) In summary, the per-example contributions scale as:
‖gvμ‖=Θn(n−av),‖gwμ‖=Θn(n−aw−av−bv),‖guμ‖=Θn(n−au−av−bv).\displaystyle\|g_{v}^{\mu}\|=\Theta_{n}(n^{-a_{v}}),\qquad\|g_{w}^{\mu}\|=\Theta_{n}(n^{-a_{w}-a_{v}-b_{v}}),\qquad\|g_{u}^{\mu}\|=\Theta_{n}(n^{-a_{u}-a_{v}-b_{v}}).(29)The full gradientglg_{l}is obtained by averaging the per-example contributions:
gl=1D∑μ=1Dglμ.\displaystyle g_{l}=\frac{1}{D}\sum_{\mu=1}^{D}g_{l}^{\mu}.(30)We make the following assumption on this aggregation:
Assumption H.3(Aggregation does not induce width-scaling).
The average1D∑μ=1Dglμ\frac{1}{D}\sum_{\mu=1}^{D}g_{l}^{\mu}inherits the width-scaling of the per-example terms.
UnderAssumption˜H.3, the layerwise gradients scale as:
‖gv‖=Θn(n−av),‖gw‖=Θn(n−aw−av−bv),‖gu‖=Θn(n−au−av−bv).\displaystyle\|g_{v}\|=\Theta_{n}(n^{-a_{v}}),\qquad\|g_{w}\|=\Theta_{n}(n^{-a_{w}-a_{v}-b_{v}}),\qquad\|g_{u}\|=\Theta_{n}(n^{-a_{u}-a_{v}-b_{v}}).(31)The assumption rules out the gradient acquiring extra width scaling from aggregation. This is a benign assumption, as we do not expect a width dependence from aggregating. Without it, the triangle inequality provides only𝒪n\mathcal{O}_{n}upper bounds.
Similar to the activations, the gradient scales can also evolve in a width-dependent manner during training. We will discuss the implications in the later sections.
H.6Alignment and Feature Learning Exponents
Following[10], we define three alignment exponents{ρ,ω,σ}∈[0,1]\{\rho,\omega,\sigma\}\in[0,1]per layer that quantify the correlations between weight matrices, their updates, and activation vectors at the first gradient step.
Definition H.3(Alignment exponents).
For each layer, we define:
ρv,1\displaystyle\rho_{v,1}:=limn→∞logn‖ΔV1z0‖‖ΔV1‖‖z0‖,\displaystyle:=\lim_{n\to\infty}\log_{n}\frac{\|\Delta V_{1}\,z_{0}\|}{\|\Delta V_{1}\|\|z_{0}\|},ρw,1\displaystyle\rho_{w,1}:=limn→∞logn‖ΔW1h0‖‖ΔW1‖‖h0‖,\displaystyle:=\lim_{n\to\infty}\log_{n}\frac{\|\Delta W_{1}\,h_{0}\|}{\|\Delta W_{1}\|\|h_{0}\|},ωv,1\displaystyle\omega_{v,1}:=limn→∞logn‖V0Δz1‖‖V0‖‖Δz1‖,\displaystyle:=\lim_{n\to\infty}\log_{n}\frac{\|V_{0}\,\Delta z_{1}\|}{\|V_{0}\|\|\Delta z_{1}\|},ωw,1\displaystyle\omega_{w,1}:=limn→∞logn‖W0Δh1‖‖W0‖‖Δh1‖,\displaystyle:=\lim_{n\to\infty}\log_{n}\frac{\|W_{0}\,\Delta h_{1}\|}{\|W_{0}\|\|\Delta h_{1}\|},σv,1\displaystyle\sigma_{v,1}:=limn→∞logn‖ΔV1Δz1‖‖ΔV1‖‖Δz1‖,\displaystyle:=\lim_{n\to\infty}\log_{n}\frac{\|\Delta V_{1}\,\Delta z_{1}\|}{\|\Delta V_{1}\|\|\Delta z_{1}\|},σw,1\displaystyle\sigma_{w,1}:=limn→∞logn‖ΔW1Δh1‖‖ΔW1‖‖Δh1‖.\displaystyle:=\lim_{n\to\infty}\log_{n}\frac{\|\Delta W_{1}\,\Delta h_{1}\|}{\|\Delta W_{1}\|\|\Delta h_{1}\|}.(32)We do not define alignment exponents for the first layer because the matrix-vector productU𝐱U\mathbf{x}does not involve the width, and thus no alignment exponent is needed.
Definition H.4(Activation Update exponents).
LetΔh1\Delta h_{1},Δz1\Delta z_{1}, andΔf1\Delta f_{1}denote the activation updates at the first gradient step. We define:
ru,1:=limn→∞logn‖Δh1‖,rw,1:=limn→∞logn‖Δz1‖,rv,1:=limn→∞logn‖Δf1‖.\displaystyle r_{u,1}:=\lim_{n\to\infty}\log_{n}\|\Delta h_{1}\|,\quad r_{w,1}:=\lim_{n\to\infty}\log_{n}\|\Delta z_{1}\|,\quad r_{v,1}:=\lim_{n\to\infty}\log_{n}\|\Delta f_{1}\|.(33)
These definitions generalize to any stepttby replacing the subscript11withttthroughout. We work att=1t=1in the following sections because the scaling of weights, updates, and activations is tractable at the first step.
H.7Stable Update Conditions for SGD
We now derive the conditions on{a,b,c}\{a,b,c\}that satisfyDesideratum˜2for SGD, assumingDesideratum˜1holds. We analyze the weight decay scaling inSection˜H.9for SGD and Adam together, as it follows the same logic for both optimizers.
SGD update.
For SGD with per-layer learning rateηl=ηn−cl\eta_{l}=\eta\,n^{-c_{l}}, the parameter update at stepttis:
Ut+1=Ut−ηugu(𝜽t),Wt+1=Wt−ηwgw(𝜽t),Vt+1=Vt−ηvgv(𝜽t).\displaystyle U_{t+1}=U_{t}-\eta_{u}\,g_{u}(\boldsymbol{\theta}_{t}),\qquad W_{t+1}=W_{t}-\eta_{w}\,g_{w}(\boldsymbol{\theta}_{t}),\qquad V_{t+1}=V_{t}-\eta_{v}\,g_{v}(\boldsymbol{\theta}_{t}).(34)At the first step,‖ΔU1‖=ηu‖gu‖\|\Delta U_{1}\|=\eta_{u}\|g_{u}\|,‖ΔW1‖=ηw‖gw‖\|\Delta W_{1}\|=\eta_{w}\|g_{w}\|, and‖ΔV1‖=ηv‖gv‖\|\Delta V_{1}\|=\eta_{v}\|g_{v}\|. Using the gradient scalings at initialization fromSection˜H.5:
‖ΔU1‖=Θn(n−cu−au−av−bv),‖ΔW1‖=Θn(n−cw−aw−av−bv),‖ΔV1‖=Θn(n−cv−av).\displaystyle\|\Delta U_{1}\|=\Theta_{n}(n^{-c_{u}-a_{u}-a_{v}-b_{v}}),\quad\|\Delta W_{1}\|=\Theta_{n}(n^{-c_{w}-a_{w}-a_{v}-b_{v}}),\quad\|\Delta V_{1}\|=\Theta_{n}(n^{-c_{v}-a_{v}}).(35)Desideratum˜2requires‖Δh1‖,‖Δz1‖,‖Δf1‖=Θn(1)\|\Delta h_{1}\|,\|\Delta z_{1}\|,\|\Delta f_{1}\|=\Theta_{n}(1), which we enforce by setting the activation update exponents to zero:ru,1=rw,1=rv,1=0r_{u,1}=r_{w,1}=r_{v,1}=0.
First layer.
The activation update isΔh1(𝐱)=n−auΔU1𝐱\Delta h_{1}(\mathbf{x})=n^{-a_{u}}\Delta U_{1}\,\mathbf{x}. Since the matrix-vector productU𝐱U\mathbf{x}does not scale with width, we can write‖ΔU1𝐱‖=C‖ΔU1‖‖𝐱‖\|\Delta U_{1}\mathbf{x}\|=C\|\Delta U_{1}\|\|\mathbf{x}\|for some width-independent constantCC. As a result, the activation update scales as:
‖Δh1(𝐱)‖=Cn−au‖ΔU1‖‖𝐱‖=Θn(n−2au−cu−av−bv).\displaystyle\|\Delta h_{1}(\mathbf{x})\|=Cn^{-a_{u}}\|\Delta U_{1}\|\|\mathbf{x}\|=\Theta_{n}(n^{-2a_{u}-c_{u}-a_{v}-b_{v}}).(36)Settingru,1=0r_{u,1}=0:
2au+cu+av+bv=0.\displaystyle\boxed{2a_{u}+c_{u}+a_{v}+b_{v}=0.}(37)
Middle layer.
The activationz(𝐱)z(\mathbf{x})at stept=1t=1is:
z1(𝐱)=n−aw(W0+ΔW1)(h0(𝐱)+Δh1(𝐱)).\displaystyle z_{1}(\mathbf{x})=n^{-a_{w}}(W_{0}+\Delta W_{1})(h_{0}(\mathbf{x})+\Delta h_{1}(\mathbf{x})).(38)After expanding and subtractingz0(𝐱)=n−awW0h0(𝐱)z_{0}(\mathbf{x})=n^{-a_{w}}W_{0}h_{0}(\mathbf{x}), the activation update decomposes into three terms:
Δz1(𝐱)\displaystyle\Delta z_{1}(\mathbf{x})=n−awΔW1h0(𝐱)⏟weight-update term+n−awW0Δh1(𝐱)⏟activation-update term+n−awΔW1Δh1(𝐱)⏟second-order term.\displaystyle=\underbrace{n^{-a_{w}}\Delta W_{1}\,h_{0}(\mathbf{x})}_{\text{weight-update term}}\;+\;\underbrace{n^{-a_{w}}W_{0}\,\Delta h_{1}(\mathbf{x})}_{\text{activation-update term}}\;+\;\underbrace{n^{-a_{w}}\Delta W_{1}\,\Delta h_{1}(\mathbf{x})}_{\text{second-order term}}.(39)By the triangle inequality:
‖Δz1(𝐱)‖≤n−aw(‖ΔW1h0(𝐱)‖+‖W0Δh1(𝐱)‖+‖ΔW1Δh1(𝐱)‖).\displaystyle\|\Delta z_{1}(\mathbf{x})\|\;\leq\;n^{-a_{w}}\!\left(\|\Delta W_{1}h_{0}(\mathbf{x})\|+\|W_{0}\Delta h_{1}(\mathbf{x})\|+\|\Delta W_{1}\Delta h_{1}(\mathbf{x})\|\right).(40)We make the following assumption on the superposition of the three terms:
Assumption H.4(Superposition does not induce width-scaling).
The superposition of the three terms inΔz1(𝐱)\Delta z_{1}(\mathbf{x})does not introduce additional width-scaling: if each term isΘn(nα)\Theta_{n}(n^{\alpha}), then their sum isΘn(nα)\Theta_{n}(n^{\alpha}).
UnderAssumption˜H.4, it suffices to analyze the width-scaling of each term individually. Throughout this derivation, we impose the conditionru,1=0r_{u,1}=0, i.e.,‖Δh1‖=Θn(1)\|\Delta h_{1}\|=\Theta_{n}(1).
*Weight-update term.*By the definition of the alignment exponent (Definition˜H.3),‖ΔW1h0‖=nρw,1‖ΔW1‖‖h0‖\|\Delta W_{1}\,h_{0}\|=n^{\rho_{w,1}}\,\|\Delta W_{1}\|\,\|h_{0}\|. Using‖ΔW1‖=Θn(n−cw−aw−av−bv)\|\Delta W_{1}\|=\Theta_{n}(n^{-c_{w}-a_{w}-a_{v}-b_{v}})from the SGD update and‖h0‖=Θn(1)\|h_{0}\|=\Theta_{n}(1):
n−aw‖ΔW1h0(𝐱)‖=Θn(nρw,1−2aw−cw−av−bv).\displaystyle n^{-a_{w}}\|\Delta W_{1}\,h_{0}(\mathbf{x})\|=\Theta_{n}(n^{\rho_{w,1}-2a_{w}-c_{w}-a_{v}-b_{v}}).(41)Requiring this to beΘn(1)\Theta_{n}(1):
ρw,1−2aw−cw−av−bv=0.\displaystyle\boxed{\rho_{w,1}-2a_{w}-c_{w}-a_{v}-b_{v}=0.}(42) *Activation-update term.*Similarly,‖W0Δh1‖=nωw,1‖W0‖‖Δh1‖\|W_{0}\,\Delta h_{1}\|=n^{\omega_{w,1}}\,\|W_{0}\|\,\|\Delta h_{1}\|. Using‖W0‖=Θn(n−bw)\|W_{0}\|=\Theta_{n}(n^{-b_{w}})at initialization,‖Δh1‖=Θn(1)\|\Delta h_{1}\|=\Theta_{n}(1), andaw+bw=1/2a_{w}+b_{w}=1/2:
n−aw‖W0Δh1(𝐱)‖=Θn(nωw,1−aw−bw)=Θn(nωw,1−1/2).\displaystyle n^{-a_{w}}\|W_{0}\,\Delta h_{1}(\mathbf{x})\|=\Theta_{n}(n^{\omega_{w,1}-a_{w}-b_{w}})=\Theta_{n}(n^{\omega_{w,1}-1/2}).(43)Requiring this to beΘn(1)\Theta_{n}(1)fixes the alignment exponent itself:
ωw,1−12=0.\displaystyle\boxed{\omega_{w,1}-\tfrac{1}{2}=0.}(44)Unlikeρw,1\rho_{w,1}(andσw,1\sigma_{w,1}, as we will see below),ωw,1\omega_{w,1}is not a free variable if we demand stability and non-vanishing updatesrw,1=0r_{w,1}=0and its value is pinned to1/21/2. This scalingω=1/2\omega=1/2is expected to hold at the first gradient step, but as training progresses,ωw,t\omega_{w,t}is free to evolve, and the activation-update term contribution may vanish or diverge as width is scaled.
*Second-order term.*By the definition of the alignment exponent (Definition˜H.3),‖ΔW1Δh1‖=nσw,1‖ΔW1‖‖Δh1‖\|\Delta W_{1}\,\Delta h_{1}\|=n^{\sigma_{w,1}}\,\|\Delta W_{1}\|\,\|\Delta h_{1}\|. Using‖ΔW1‖=Θn(n−cw−aw−av−bv)\|\Delta W_{1}\|=\Theta_{n}(n^{-c_{w}-a_{w}-a_{v}-b_{v}})and‖Δh1‖=Θn(1)\|\Delta h_{1}\|=\Theta_{n}(1):
n−aw‖ΔW1Δh1(𝐱)‖=Θn(nσw,1−2aw−cw−av−bv).\displaystyle n^{-a_{w}}\|\Delta W_{1}\,\Delta h_{1}(\mathbf{x})\|=\Theta_{n}(n^{\sigma_{w,1}-2a_{w}-c_{w}-a_{v}-b_{v}}).(45)Requiring this to beΘn(1)\Theta_{n}(1):
σw,1−2aw−cw−av−bv=0.\displaystyle\boxed{\sigma_{w,1}-2a_{w}-c_{w}-a_{v}-b_{v}=0.}(46) The boxed conditionsEquations˜42,44and46correspond to requiringall threeterms inΔz1\Delta z_{1}to beΘn(1)\Theta_{n}(1). The minimal requirement for‖Δz1‖=Θn(1)\|\Delta z_{1}\|=\Theta_{n}(1)is weaker: each term must be𝒪n(1)\mathcal{O}_{n}(1)and at least one must beΘn(1)\Theta_{n}(1), i.e.,
max(ρw,1−2aw−cw−av−bv,ωw,1−12,σw,1−2aw−cw−av−bv)=0.\displaystyle\boxed{\max\!\big(\rho_{w,1}-2a_{w}-c_{w}-a_{v}-b_{v},\;\;\omega_{w,1}-\tfrac{1}{2},\;\;\sigma_{w,1}-2a_{w}-c_{w}-a_{v}-b_{v}\big)=0.}(47)
Last layer.
The output update decomposes similarly:
Δf1(𝐱)=n−avΔV1z0(𝐱)⏟weight-update term+n−avV0Δz1(𝐱)⏟activation-update term+n−avΔV1Δz1(𝐱)⏟second-order term.\displaystyle\Delta f_{1}(\mathbf{x})=\underbrace{n^{-a_{v}}\Delta V_{1}\,z_{0}(\mathbf{x})}_{\text{weight-update term}}+\underbrace{n^{-a_{v}}V_{0}\,\Delta z_{1}(\mathbf{x})}_{\text{activation-update term}}+\underbrace{n^{-a_{v}}\Delta V_{1}\,\Delta z_{1}(\mathbf{x})}_{\text{second-order term}}.(48)UnderAssumption˜H.4, it suffices to analyze each term individually. We imposerw,1=0r_{w,1}=0, i.e.,‖Δz1‖=Θn(1)\|\Delta z_{1}\|=\Theta_{n}(1).
*Weight-update term.*By the definition of the alignment exponent (Definition˜H.3),‖ΔV1z0‖=nρv,1‖ΔV1‖‖z0‖\|\Delta V_{1}\,z_{0}\|=n^{\rho_{v,1}}\,\|\Delta V_{1}\|\,\|z_{0}\|. Using‖ΔV1‖=Θn(n−cv−av)\|\Delta V_{1}\|=\Theta_{n}(n^{-c_{v}-a_{v}})from the SGD update and‖z0‖=Θn(1)\|z_{0}\|=\Theta_{n}(1):
n−av‖ΔV1z0(𝐱)‖=Θn(nρv,1−2av−cv).\displaystyle n^{-a_{v}}\|\Delta V_{1}\,z_{0}(\mathbf{x})\|=\Theta_{n}(n^{\rho_{v,1}-2a_{v}-c_{v}}).(49)Requiring this to beΘn(1)\Theta_{n}(1):
ρv,1−2av−cv=0.\displaystyle\boxed{\rho_{v,1}-2a_{v}-c_{v}=0.}(50) *Activation-update term.*Similarly,‖V0Δz1‖=nωv,1‖V0‖‖Δz1‖\|V_{0}\,\Delta z_{1}\|=n^{\omega_{v,1}}\,\|V_{0}\|\,\|\Delta z_{1}\|. Using‖V0‖=Θn(n−bv)\|V_{0}\|=\Theta_{n}(n^{-b_{v}})at initialization and‖Δz1‖=Θn(1)\|\Delta z_{1}\|=\Theta_{n}(1):
n−av‖V0Δz1(𝐱)‖=Θn(nωv,1−av−bv).\displaystyle n^{-a_{v}}\|V_{0}\,\Delta z_{1}(\mathbf{x})\|=\Theta_{n}(n^{\omega_{v,1}-a_{v}-b_{v}}).(51)Requiring this to beΘn(1)\Theta_{n}(1):
ωv,1−av−bv=0.\displaystyle\boxed{\omega_{v,1}-a_{v}-b_{v}=0.}(52)Unlike the middle layer, demandingrv,1=0r_{v,1}=0does not pinωv,1\omega_{v,1}to a fixed value, and it is constrained only through the relationωv,1=av+bv\omega_{v,1}=a_{v}+b_{v}, whereav+bva_{v}+b_{v}is itself underdetermined due to the weaker initialization constraint (Desideratum˜1).
*Second-order term.*By the definition of the alignment exponent (Definition˜H.3),‖ΔV1Δz1‖=nσv,1‖ΔV1‖‖Δz1‖\|\Delta V_{1}\,\Delta z_{1}\|=n^{\sigma_{v,1}}\,\|\Delta V_{1}\|\,\|\Delta z_{1}\|. Using the same scalings:
n−av‖ΔV1Δz1(𝐱)‖=Θn(nσv,1−2av−cv).\displaystyle n^{-a_{v}}\|\Delta V_{1}\,\Delta z_{1}(\mathbf{x})\|=\Theta_{n}(n^{\sigma_{v,1}-2a_{v}-c_{v}}).(53)Requiring this to beΘn(1)\Theta_{n}(1):
σv,1−2av−cv=0.\displaystyle\boxed{\sigma_{v,1}-2a_{v}-c_{v}=0.}(54) As in the middle layer, the boxed conditionsEquations˜50,52and54correspond to requiring all three terms inΔf1\Delta f_{1}to beΘn(1)\Theta_{n}(1). The minimal requirement for‖Δf1‖=Θn(1)\|\Delta f_{1}\|=\Theta_{n}(1)is weaker:
max(ρv,1−2av−cv,ωv,1−av−bv,σv,1−2av−cv)=0.\displaystyle\boxed{\max\!\big(\rho_{v,1}-2a_{v}-c_{v},\;\;\omega_{v,1}-a_{v}-b_{v},\;\;\sigma_{v,1}-2a_{v}-c_{v}\big)=0.}(55)
Gauge symmetry for SGD.
The boxed conditionsEquations˜47and55are invariant under the per-layer transformation:
al→al+Δl,bl→bl−Δl,cl→cl−2Δl,\displaystyle a_{l}\to a_{l}+\Delta_{l},\qquad b_{l}\to b_{l}-\Delta_{l},\qquad c_{l}\to c_{l}-2\Delta_{l},for anyΔl∈ℝ\Delta_{l}\in\mathbb{R}and each layerl∈{u,w,v}l\in\{u,w,v\}. The forward-pass exponental+bla_{l}+b_{l}is unchanged, and the shift inclc_{l}compensates so that update equations have the same width scalings. As a consequence, different choices of(a,b,c)(a,b,c)describe the same training dynamics. We will later use this freedom later to write down several equivalent forms ofμ\muP (Table˜4).
H.8Stable Update Conditions for Adam
We now derive the conditions on{a,b,c}\{a,b,c\}that satisfyDesideratum˜2for Adam, assumingDesideratum˜1holds. The only change from SGD is in the scaling of the parameter updates; the activation decompositions and alignment-exponent calculations are otherwise identical. Throughout this section, we useAssumption˜H.4on the superposition of three-term decomposition, and apply the alignment-exponent identities fromDefinition˜H.3(e.g.,‖ΔWh‖=nρw‖ΔW‖‖h‖\|\Delta Wh\|=n^{\rho_{w}}\|\Delta W\|\|h\|) without further citation. Weight decay is analyzed inSection˜H.9for SGD and Adam together.
Adam update.
Adam computes the update asΔθl=−ηl⋅ml/(vl+ϵ)\Delta\theta_{l}=-\eta_{l}\cdot m_{l}/(\sqrt{v_{l}}+\epsilon), wheremlm_{l}andvlv_{l}are the first and second moment estimates ofglg_{l}. We assume:
Assumption H.5(Adam normalized update).
The normalized updateml/(vl+ϵ)m_{l}/(\sqrt{v_{l}}+\epsilon)hasΘn(1)\Theta_{n}(1)entries. This is a reasonable assumption asmlm_{l}scales asglg_{l}andvlv_{l}scales asgl2g_{l}^{2}entrywise, soml/vlm_{l}/\sqrt{v_{l}}behaves like a sign-magnitude normalization independent of the gradient width scaling.
UnderAssumption˜H.5, the parameter update norms at the first step depend only on the layerwise learning rate:
‖ΔU1‖=Θn(n−cu),‖ΔW1‖=Θn(n−cw),‖ΔV1‖=Θn(n−cv).\displaystyle\|\Delta U_{1}\|=\Theta_{n}(n^{-c_{u}}),\qquad\|\Delta W_{1}\|=\Theta_{n}(n^{-c_{w}}),\qquad\|\Delta V_{1}\|=\Theta_{n}(n^{-c_{v}}).(56)Desideratum˜2requires‖Δh1‖,‖Δz1‖,‖Δf1‖=Θn(1)\|\Delta h_{1}\|,\|\Delta z_{1}\|,\|\Delta f_{1}\|=\Theta_{n}(1), which we enforce by setting the activation update exponents to zero:ru,1=rw,1=rv,1=0r_{u,1}=r_{w,1}=r_{v,1}=0.
First layer.
The activation update isΔh1(𝐱)=n−auΔU1𝐱\Delta h_{1}(\mathbf{x})=n^{-a_{u}}\Delta U_{1}\,\mathbf{x}. Since the matrix-vector productU𝐱U\mathbf{x}does not scale with width, we can write‖ΔU1𝐱‖=C‖ΔU1‖‖𝐱‖\|\Delta U_{1}\mathbf{x}\|=C\|\Delta U_{1}\|\|\mathbf{x}\|for some width-independent constantCC. As a result:
‖Δh1(𝐱)‖=Cn−au‖ΔU1‖‖𝐱‖=Θn(n−au−cu).\displaystyle\|\Delta h_{1}(\mathbf{x})\|=Cn^{-a_{u}}\|\Delta U_{1}\|\|\mathbf{x}\|=\Theta_{n}(n^{-a_{u}-c_{u}}).(57)Settingru,1=0r_{u,1}=0:
au+cu=0.\displaystyle\boxed{a_{u}+c_{u}=0.}(58)
Middle layer.
As in the SGD case, the activation update decomposes into three terms:
Δz1(𝐱)\displaystyle\Delta z_{1}(\mathbf{x})=n−awΔW1h0(𝐱)⏟weight-update term+n−awW0Δh1(𝐱)⏟activation-update term+n−awΔW1Δh1(𝐱)⏟second-order term.\displaystyle=\underbrace{n^{-a_{w}}\Delta W_{1}\,h_{0}(\mathbf{x})}_{\text{weight-update term}}\;+\;\underbrace{n^{-a_{w}}W_{0}\,\Delta h_{1}(\mathbf{x})}_{\text{activation-update term}}\;+\;\underbrace{n^{-a_{w}}\Delta W_{1}\,\Delta h_{1}(\mathbf{x})}_{\text{second-order term}}.(59) *Weight-update term.*Using‖ΔW1‖=Θn(n−cw)\|\Delta W_{1}\|=\Theta_{n}(n^{-c_{w}})and‖h0‖=Θn(1)\|h_{0}\|=\Theta_{n}(1):
n−aw‖ΔW1h0(𝐱)‖=n−awnρw,1‖ΔW1‖‖h0‖=Θn(nρw,1−aw−cw).\displaystyle n^{-a_{w}}\|\Delta W_{1}\,h_{0}(\mathbf{x})\|=n^{-a_{w}}\,n^{\rho_{w,1}}\,\|\Delta W_{1}\|\,\|h_{0}\|=\Theta_{n}(n^{\rho_{w,1}-a_{w}-c_{w}}).(60)Requiring this to beΘn(1)\Theta_{n}(1):
ρw,1−aw−cw=0.\displaystyle\boxed{\rho_{w,1}-a_{w}-c_{w}=0.}(61) *Activation-update term.*Using‖W0‖=Θn(n−bw)\|W_{0}\|=\Theta_{n}(n^{-b_{w}})at initialization,‖Δh1‖=Θn(1)\|\Delta h_{1}\|=\Theta_{n}(1), andaw+bw=1/2a_{w}+b_{w}=1/2:
n−aw‖W0Δh1(𝐱)‖=n−awnωw,1‖W0‖‖Δh1‖=Θn(nωw,1−1/2).\displaystyle n^{-a_{w}}\|W_{0}\,\Delta h_{1}(\mathbf{x})\|=n^{-a_{w}}\,n^{\omega_{w,1}}\,\|W_{0}\|\,\|\Delta h_{1}\|=\Theta_{n}(n^{\omega_{w,1}-1/2}).(62)Requiring this to beΘn(1)\Theta_{n}(1):
ωw,1−12=0.\displaystyle\boxed{\omega_{w,1}-\tfrac{1}{2}=0.}(63)As in the SGD case,ωw,1\omega_{w,1}is not a free variable if we demand stability and non-vanishing updatesrw,1=0r_{w,1}=0, and its value gets pinned to1/21/2.
*Second-order term.*Using‖ΔW1‖=Θn(n−cw)\|\Delta W_{1}\|=\Theta_{n}(n^{-c_{w}})and‖Δh1‖=Θn(1)\|\Delta h_{1}\|=\Theta_{n}(1):
n−aw‖ΔW1Δh1(𝐱)‖=n−awnσw,1‖ΔW1‖‖Δh1‖=Θn(nσw,1−aw−cw).\displaystyle n^{-a_{w}}\|\Delta W_{1}\,\Delta h_{1}(\mathbf{x})\|=n^{-a_{w}}\,n^{\sigma_{w,1}}\,\|\Delta W_{1}\|\,\|\Delta h_{1}\|=\Theta_{n}(n^{\sigma_{w,1}-a_{w}-c_{w}}).(64)Requiring this to beΘn(1)\Theta_{n}(1):
σw,1−aw−cw=0.\displaystyle\boxed{\sigma_{w,1}-a_{w}-c_{w}=0.}(65) The boxed conditionsEquations˜61,63and65correspond to requiring all three terms to beΘn(1)\Theta_{n}(1). The minimal requirement for‖Δz1‖=Θn(1)\|\Delta z_{1}\|=\Theta_{n}(1)is weaker:
max(ρw,1−aw−cw,ωw,1−12,σw,1−aw−cw)=0.\displaystyle\boxed{\max\!\big(\rho_{w,1}-a_{w}-c_{w},\;\;\omega_{w,1}-\tfrac{1}{2},\;\;\sigma_{w,1}-a_{w}-c_{w}\big)=0.}(66)
Last layer.
The output update decomposes similarly:
Δf1(𝐱)=n−avΔV1z0(𝐱)⏟weight-update term+n−avV0Δz1(𝐱)⏟activation-update term+n−avΔV1Δz1(𝐱)⏟second-order term.\displaystyle\Delta f_{1}(\mathbf{x})=\underbrace{n^{-a_{v}}\Delta V_{1}\,z_{0}(\mathbf{x})}_{\text{weight-update term}}+\underbrace{n^{-a_{v}}V_{0}\,\Delta z_{1}(\mathbf{x})}_{\text{activation-update term}}+\underbrace{n^{-a_{v}}\Delta V_{1}\,\Delta z_{1}(\mathbf{x})}_{\text{second-order term}}.(67) *Weight-update term.*Using‖ΔV1‖=Θn(n−cv)\|\Delta V_{1}\|=\Theta_{n}(n^{-c_{v}})and‖z0‖=Θn(1)\|z_{0}\|=\Theta_{n}(1):
n−av‖ΔV1z0(𝐱)‖=n−avnρv,1‖ΔV1‖‖z0‖=Θn(nρv,1−av−cv).\displaystyle n^{-a_{v}}\|\Delta V_{1}\,z_{0}(\mathbf{x})\|=n^{-a_{v}}\,n^{\rho_{v,1}}\,\|\Delta V_{1}\|\,\|z_{0}\|=\Theta_{n}(n^{\rho_{v,1}-a_{v}-c_{v}}).(68)Requiring this to beΘn(1)\Theta_{n}(1):
ρv,1−av−cv=0.\displaystyle\boxed{\rho_{v,1}-a_{v}-c_{v}=0.}(69) *Activation-update term.*Using‖V0‖=Θn(n−bv)\|V_{0}\|=\Theta_{n}(n^{-b_{v}})at initialization and‖Δz1‖=Θn(1)\|\Delta z_{1}\|=\Theta_{n}(1):
n−av‖V0Δz1(𝐱)‖=n−avnωv,1‖V0‖‖Δz1‖=Θn(nωv,1−av−bv).\displaystyle n^{-a_{v}}\|V_{0}\,\Delta z_{1}(\mathbf{x})\|=n^{-a_{v}}\,n^{\omega_{v,1}}\,\|V_{0}\|\,\|\Delta z_{1}\|=\Theta_{n}(n^{\omega_{v,1}-a_{v}-b_{v}}).(70)Requiring this to beΘn(1)\Theta_{n}(1):
ωv,1−av−bv=0.\displaystyle\boxed{\omega_{v,1}-a_{v}-b_{v}=0.}(71)As in the SGD case, demandingrv,1=0r_{v,1}=0does not pinωv,1\omega_{v,1}to a fixed value; it is constrained only through the relationωv,1=av+bv\omega_{v,1}=a_{v}+b_{v}, whereav+bva_{v}+b_{v}is itself underdetermined due to the weaker initialization constraint (Desideratum˜1).
*Second-order term.*Using‖ΔV1‖=Θn(n−cv)\|\Delta V_{1}\|=\Theta_{n}(n^{-c_{v}})and‖Δz1‖=Θn(1)\|\Delta z_{1}\|=\Theta_{n}(1):
n−av‖ΔV1Δz1(𝐱)‖=n−avnσv,1‖ΔV1‖‖Δz1‖=Θn(nσv,1−av−cv).\displaystyle n^{-a_{v}}\|\Delta V_{1}\,\Delta z_{1}(\mathbf{x})\|=n^{-a_{v}}\,n^{\sigma_{v,1}}\,\|\Delta V_{1}\|\,\|\Delta z_{1}\|=\Theta_{n}(n^{\sigma_{v,1}-a_{v}-c_{v}}).(72)Requiring this to beΘn(1)\Theta_{n}(1):
σv,1−av−cv=0.\displaystyle\boxed{\sigma_{v,1}-a_{v}-c_{v}=0.}(73) As in the middle layer, the boxed conditionsEquations˜69,71and73correspond to requiring all three terms inΔf1\Delta f_{1}to beΘn(1)\Theta_{n}(1). The minimal requirement for‖Δf1‖=Θn(1)\|\Delta f_{1}\|=\Theta_{n}(1)is weaker:
max(ρv,1−av−cv,ωv,1−av−bv,σv,1−av−cv)=0.\displaystyle\boxed{\max\!\big(\rho_{v,1}-a_{v}-c_{v},\;\;\omega_{v,1}-a_{v}-b_{v},\;\;\sigma_{v,1}-a_{v}-c_{v}\big)=0.}(74)
Gauge symmetry for Adam.
The boxed conditionsEquations˜66and74are invariant under the per-layer transformation:
al→al+Δl,bl→bl−Δl,cl→cl−Δl,\displaystyle a_{l}\to a_{l}+\Delta_{l},\qquad b_{l}\to b_{l}-\Delta_{l},\qquad c_{l}\to c_{l}-\Delta_{l},for anyΔl∈ℝ\Delta_{l}\in\mathbb{R}and each layerl∈{u,w,v}l\in\{u,w,v\}. The shift inclc_{l}differs from the SGD case (−2Δl-2\Delta_{l}there) because Adam’s update normΘn(n−cl)\Theta_{n}(n^{-c_{l}})depends onclc_{l}only once, whereas SGD’sηl‖gl‖\eta_{l}\|g_{l}\|inheritsclc_{l}explicitly and additionala,ba,bdependence implicitly through the gradient.
H.9Weight Decay Scaling
With weight decay, the parameter update at the first step becomes:
Δθl=−ηlϕ(gl)−ηlλlθl,\displaystyle\Delta\theta_{l}=-\eta_{l}\,\phi(g_{l})-\eta_{l}\lambda_{l}\,\theta_{l},(75)whereϕ(gl)\phi(g_{l})is the optimizer’s transformation of the gradient:ϕ(gl)=gl\phi(g_{l})=g_{l}for SGD andϕ(gl)=ml/(vl+ϵ)\phi(g_{l})=m_{l}/(\sqrt{v_{l}}+\epsilon)for Adam. For the weight decay term to contribute at the same scale as the parameters, we requireηlλl=Θn(1)\eta_{l}\lambda_{l}=\Theta_{n}(1). Usingηl=ηn−cl\eta_{l}=\eta\,n^{-c_{l}}andλl=λn−dl\lambda_{l}=\lambda\,n^{-d_{l}}:
cl+dl=0.\displaystyle\boxed{c_{l}+d_{l}=0.}(76)This holds identically for SGD and Adam, since the weight decay term does not depend onϕ\phi.
Extended gauge symmetry.
Including weight decay extends theabcabcgauge symmetry toabcdabcd. The conditioncl+dl=0c_{l}+d_{l}=0is preserved under
dl→dl+ζΔl,\displaystyle d_{l}\to d_{l}+\zeta\Delta_{l},(77)whereζ=2\zeta=2for SGD andζ=1\zeta=1for Adam.
H.10Summary of Stability Conditions
Table 2:Summary of stability conditions for SGD and Adam. Eachrl,1=0r_{l,1}=0condition is a max over the three terms (weight-update, activation-update, second-order) in the activation decomposition.Table˜2summarizes the conditions on the{a,b,c,d}\{a,b,c,d\}exponents that satisfyDesiderata˜1and2. The activation update conditionsrl,1=0r_{l,1}=0at each layer are expressed as a max over the three decomposition terms (weight-update, activation-update, second-order), reflecting the minimal requirement for‖Δh1‖,‖Δz1‖,‖Δf1‖=Θn(1)\|\Delta h_{1}\|,\|\Delta z_{1}\|,\|\Delta f_{1}\|=\Theta_{n}(1). In the next section, we discuss the conditions on the alignment exponents.
H.11Alignment Exponent Values
The stability conditions constrain the alignment exponents, however, the alignment exponents are not free design choices, and are empirical properties that quantify the correlations between weights, activations and their updates.
At initialization, the weights and activations are all random and their products behave like products of independent Gaussian vectors (same logic as inDesideratum˜1). As a result, all alignment exponents have the value1/21/2:
ρl,1=ωl,1=σl,1=12at initialization.\displaystyle\rho_{l,1}=\omega_{l,1}=\sigma_{l,1}=\tfrac{1}{2}\qquad\text{at initialization}.(78) As training progresses, the weights, activations, and their updates become correlated. As a result, the alignment exponents increase above their random-alignment value of1/21/2during training, although they typically remain far below the fully aligned value of11[10].
To deriveμ\muP,Yang and Hu [48]first imposeru,1=rw,1=rv,1=0r_{u,1}=r_{w,1}=r_{v,1}=0, so that all activation updates areΘn(1)\Theta_{n}(1). This constrains the hidden layer exponentωw=1/2\omega_{w}=1/2. Since weights, activations, and updates become correlated during training,Yang and Hu [48]assume that the remaining free exponents attain their maximum value:
Assumption H.6(μ\muP Full Alignment Assumption).
After imposingru,1=rw,1=rv,1=0r_{u,1}=r_{w,1}=r_{v,1}=0, all remaining free alignment exponents are set to11:
ρw,1=ρv,1=σw,1=σv,1=ωv,1=1.\displaystyle\rho_{w,1}=\rho_{v,1}=\sigma_{w,1}=\sigma_{v,1}=\omega_{v,1}=1.(79)
Table˜3shows the conditions onabcdabcdafter imposing the above assumption. Note that the assumptionωv=1\omega_{v}=1tightens the condition fromav+bv≥1/2a_{v}+b_{v}\geq 1/2toav+bv=1a_{v}+b_{v}=1.Everettet al.[10]relax this assumption toωv,1=1/2\omega_{v,1}=1/2, and observe that hyperparameter transfer still holds empirically, which suggests thatμ\muP’s full alignment assumption may be excessive.
Table 3:Conditions on exponents(a,b,c,d)(a,b,c,d)for SGD and Adam obtained after imposingAssumption˜H.6.
H.12Gauge Symmetry and Parameterization Choice
Even with these constraints, the system remains underdetermined due to the gauge symmetry of theabcdabcdparameterization. One exponent per layer must be fixed to obtain a specific parameterization. These choices are typically made based on implementation convenience. For instance, settinga=0a=0removes the explicit width factors in the forward pass, while setting all learning rate exponentsc=0c=0allows using a single global learning rate. The most common choice is the canonicalμ\muP proposed inYang and Hu [48].Table˜4summarizes several such implementations, all of which are gauge-equivalent and yield identical training dynamics under SGD and Adam. This symmetry, however, can be broken by operations that modify the gradients, such as gradient clipping or Adam’sϵ\epsilon.
H.13Multi-Step Scaling
So far we have analyzed the scaling of activation updates at the first training step. At later steps, the weights, activations, gradients, and alignment exponents can evolve in a width-dependent way. To extend the first-step scaling calculation to multiple steps, we assume that the weight and activation scalings persist throughout training.
Assumption H.7(Persistent weight and activation scaling).
For any training steptt, we assume:
‖Ut‖=Θn(n−bu),‖Wt‖=Θn(n−bw),‖Vt‖=Θn(n−bv),\displaystyle\|U_{t}\|=\Theta_{n}(n^{-b_{u}}),\qquad\|W_{t}\|=\Theta_{n}(n^{-b_{w}}),\qquad\|V_{t}\|=\Theta_{n}(n^{-b_{v}}),(80)and
‖ht‖=Θn(1),‖zt‖=Θn(1),‖ft‖=𝒪n(1).\displaystyle\|h_{t}\|=\Theta_{n}(1),\qquad\|z_{t}\|=\Theta_{n}(1),\qquad\|f_{t}\|=\mathcal{O}_{n}(1).(81)
For SGD, we additionally assume that the gradient scalings remain the same as at initialization.
Assumption H.8(Persistent gradient scaling for SGD).
For all training stepst≤Tt\leq T,
‖gv,t‖=Θn(n−av),‖gw,t‖=Θn(n−aw−av−bv),‖gu,t‖=Θn(n−au−av−bv).\displaystyle\|g_{v,t}\|=\Theta_{n}(n^{-a_{v}}),\qquad\|g_{w,t}\|=\Theta_{n}(n^{-a_{w}-a_{v}-b_{v}}),\qquad\|g_{u,t}\|=\Theta_{n}(n^{-a_{u}-a_{v}-b_{v}}).(82)
Adam does not require this gradient-scaling assumption, because underAssumption˜H.5the normalized update hasΘn(1)\Theta_{n}(1)entries.
Under these assumptions, the first-step scaling analysis can be repeated at any fixed steptt. However, this still does not control accumulation over a number of steps that grows with width. We therefore also assume a width-independent training horizon.
Assumption H.9(Width-independent training horizon).
The number of training steps does not scale with width:
T=Θn(1).\displaystyle T=\Theta_{n}(1).(83)
This assumption is quite restrictive and does not hold in realistic training regimes where the number of optimization steps is much larger than the width, or where the number of steps itself scales with width. This occurs, for example, in compute-optimal training. For a width-nnmodel withΘ(n2)\Theta(n^{2})parameters, the token budget scales with the parameter count, and as a result the number of optimization steps scales asT(n)=Θn(n2)T(n)=\Theta_{n}(n^{2}). In such regimes, different-width models are trained for different numbers of steps, and therefore accumulate different total update magnitudes. Therefore, fixed-step scaling arguments do not directly guarantee that training would remain consistent across widths.
Table 4:Comparison of differentμ\muP implementations for SGD and Adam. All implementations are gauge-equivalent and yield identical training dynamics. Weight decay exponents satisfycl+dl=0c_{l}+d_{l}=0.
H.14Attention and LayerNorm scaling
In standard transformers, two additional layer types are commonly used beyond the linear layers analyzed so far: attention and LayerNorm. These two have been analyzed inYang and Hu [48]andDeyet al.[9]. We treat them in turn, starting with attention.
Attention Scaling.
For queries, keys, and valuesQ,K,V∈ℝT×dQ,K,V\in\mathbb{R}^{T\times d}, whereTTis the context length andddis the head dimension, the attention mechanism computes:
A=softmax(QK⊤dα)V,\displaystyle A=\mathrm{softmax}\!\left(\frac{QK^{\top}}{d^{\alpha}}\right)V,(84)wheredαd^{\alpha}is the attention scaling factor. If the logits are too large, the softmax concentrates all mass on a single entry. To avoid it, we require the logits norm to be‖QK⊤/dα‖=Θd(1)\|QK^{\top}/d^{\alpha}\|=\Theta_{d}(1). Assuming‖Q‖,‖K‖=Θd(1)\|Q\|,\|K\|=\Theta_{d}(1), the dot product norm scales according to the alignment exponent:
ρa:=limd→∞logd‖QK⊤‖‖Q‖‖K‖,\displaystyle\rho_{a}:=\lim_{d\to\infty}\log_{d}\frac{\|QK^{\top}\|}{\|Q\|\|K\|},(85)giving‖QK⊤‖=Θd(dρa)\|QK^{\top}\|=\Theta_{d}(d^{\rho_{a}}). At initialization,WQW_{Q}andWKW_{K}are independent Gaussian matrices, soQQandKKbehave as independent random matrices and‖QK⊤‖=Θd(d)\|QK^{\top}\|=\Theta_{d}(\sqrt{d}), givingα=1/2\alpha=1/2, which is the scaling used in the original Transformer paper[45]. During training,QQandKKshould develop correlations between them. Under the full alignment assumption (analogous toAssumption˜H.6),‖QK⊤‖=Θd(d)\|QK^{\top}\|=\Theta_{d}(d), givingα=1\alpha=1, which is theμ\muP prescription[48].
The alignment exponentρa\rho_{a}itself can depend on widthnnand evolve during training, so neither prescription iscorrectin general. That said,α=1\alpha=1acts as an upper bound, sinceρa≤1\rho_{a}\leq 1.
LayerNorm.
For𝐱∈ℝn\mathbf{x}\in\mathbb{R}^{n}with trainable gain𝜸∈ℝn\boldsymbol{\gamma}\in\mathbb{R}^{n}and biasβ∈ℝn\beta\in\mathbb{R}^{n}, the output of LayerNorm is given by:
LN(𝐱)=𝐱^⊙𝜸+𝜷,𝐱^:=𝐱−μx𝟏σx2+ϵ,\displaystyle\mathrm{LN}(\mathbf{x})=\hat{\mathbf{x}}\odot\boldsymbol{\gamma}+\boldsymbol{\beta},\qquad\hat{\mathbf{x}}:=\frac{\mathbf{x}-\mu_{x}\mathbf{1}}{\sqrt{\sigma_{x}^{2}+\epsilon}},(86)where‖𝐱^‖=Θn(1)\|\hat{\mathbf{x}}\|=\Theta_{n}(1).
The LayerNorm parameters are deterministically initialized asγi=1,βi=0\gamma_{i}=1,\beta_{i}=0, which gives‖LN(𝐱)‖0=Θn(1)\|\mathrm{LN}(\mathbf{x})\|_{0}=\Theta_{n}(1)at initialization. There is no forward-pass multiplieraLNa_{\mathrm{LN}}, and weight decay is typically not applied to LayerNorm parameters, so only the learning rate exponentcLNc_{\mathrm{LN}}remains to be determined.
The gradients with respect to𝜸\boldsymbol{\gamma}and𝜷\boldsymbol{\beta}are:
(g𝜸)i=1D∑μ(∇LNℓμ)ix^iμ,(g𝜷)i=1D∑μ(∇LNℓμ)i,\displaystyle(g_{\boldsymbol{\gamma}})_{i}=\tfrac{1}{D}\sum_{\mu}(\nabla_{\mathrm{LN}}\ell^{\mu})_{i}\,\hat{x}^{\mu}_{i},\qquad(g_{\boldsymbol{\beta}})_{i}=\tfrac{1}{D}\sum_{\mu}(\nabla_{\mathrm{LN}}\ell^{\mu})_{i},(87)where∇LNℓμ:=∇LN(𝐱μ)ℓ(f(𝐱μ),𝐲μ)\nabla_{\mathrm{LN}}\ell^{\mu}:=\nabla_{\mathrm{LN}(\mathbf{x}^{\mu})}\ell(f(\mathbf{x}^{\mu}),\mathbf{y}^{\mu})denotes the gradient of the per-example loss with respect to the LayerNorm output. Assuming‖∇LNℓμ‖,‖𝐱^μ‖=Θn(1)\|\nabla_{\mathrm{LN}}\ell^{\mu}\|,\|\hat{\mathbf{x}}^{\mu}\|=\Theta_{n}(1)entrywise, we get‖g𝜸‖,‖g𝜷‖=Θn(1)\|g_{\boldsymbol{\gamma}}\|,\|g_{\boldsymbol{\beta}}\|=\Theta_{n}(1).
Since the gradients areΘn(1)\Theta_{n}(1), for both SGD and Adam, the parameter updates at the first step satisfy‖Δ𝜸1‖=‖Δ𝜷1‖=Θn(n−cLN)\|\Delta\boldsymbol{\gamma}_{1}\|=\|\Delta\boldsymbol{\beta}_{1}\|=\Theta_{n}(n^{-c_{\mathrm{LN}}}). The resulting output updateΔLN1(𝐱)=𝐱^⊙Δ𝜸1+Δ𝜷1\Delta\mathrm{LN}_{1}(\mathbf{x})=\hat{\mathbf{x}}\odot\Delta\boldsymbol{\gamma}_{1}+\Delta\boldsymbol{\beta}_{1}inherits this scaling. Requiring‖ΔLN1(𝐱)‖=Θn(1)\|\Delta\mathrm{LN}_{1}(\mathbf{x})\|=\Theta_{n}(1)yieldscLN=0.c_{\mathrm{LN}}=0.
H.15Weight Tying
In language models, the input embedding and output unembedding matrices are often tied, sharing parameters to reduce the parameter count. In our three-layer setup, this corresponds to settingV=U⊤V=U^{\top}. The forward pass in this case becomes:
h(𝐱)=n−auU𝐱,z(𝐱)=n−awWh(𝐱),f(𝐱)=n−avU⊤z(𝐱),\displaystyle h(\mathbf{x})=n^{-a_{u}}U\mathbf{x},\qquad z(\mathbf{x})=n^{-a_{w}}Wh(\mathbf{x}),\qquad f(\mathbf{x})=n^{-a_{v}}U^{\top}z(\mathbf{x}),(88)withU∈ℝn×dinU\in\mathbb{R}^{n\times d_{\mathrm{in}}}andW∈ℝn×nW\in\mathbb{R}^{n\times n}. The tied parameterUUappears with two different forward-pass exponentsaua_{u}andava_{v}. The initialization variance is controlled by the exponentsbu,bwb_{u},b_{w}:
Uij∼𝒩(0,n−2bu),Wij∼𝒩(0,n−2bw).\displaystyle U_{ij}\sim\mathcal{N}(0,n^{-2b_{u}}),\qquad W_{ij}\sim\mathcal{N}(0,n^{-2b_{w}}).(89)The per-layer learning rates are scaled bycuc_{u}andcwc_{w}:
ηu=ηn−cu,ηw=ηn−cw.\displaystyle\eta_{u}=\eta\,n^{-c_{u}},\qquad\eta_{w}=\eta\,n^{-c_{w}}.(90) In contrast to the untied case, the last layer no longer has its ownbvb_{v}andcvc_{v}exponents, since these are inherited fromUUasbub_{u}andcuc_{u}. Only the forward-pass exponentava_{v}remains as a free parameter for the last layer. Performing a similar analysis to the untied case gives the stability conditions summarized inTable˜5.
Table 5:Stability conditions under weight tying (U=V⊤U=V^{\top}).Next, applyingAssumption˜H.6yieldsTable˜6, which compares the untied and tied exponents side by side for both SGD and Adam. For Adam, the first-layer and middle-layer conditions are identical in the tied and untied cases, since Adam’s normalized update is independent of the gradient magnitude.
Table 6:Conditions on exponents(a,b,c)(a,b,c)underμ\muP, with and without weight tying.#### Whyaua_{u}andava_{v}must remain separate.
Collapsing them to a single exponent (av=aua_{v}=a_{u}) would forceau+bu=0a_{u}+b_{u}=0from the first-layer init andau+bu=1a_{u}+b_{u}=1from the last-layer condition, resulting in a contradiction. The two forward-pass exponents thus provide the only remaining degree of freedom to satisfy both layer constraints simultaneously.
A weight-tiedμ\muP example.
Unlike the untied case, we cannot setau=av=0a_{u}=a_{v}=0simultaneously: substituting into the conditions yieldsbu=0b_{u}=0from the first layer andbu=1b_{u}=1from the last layer. We must therefore fix one ofau,ava_{u},a_{v}to zero and let the other absorb the layer mismatch. Choosingau=0a_{u}=0for Adam yields the table inTable˜7. In this case, a naive SP implementation with1/fan-in\nicefrac{{1}}{{\text{fan-in}}}initialization and a global learning rate of1/n\nicefrac{{1}}{{n}}would not only train the embedding layer slowly, but also haveΘ(n)\Theta(\sqrt{n})logits at initialization.
Table 7:A weight-tiedμ\muP example.
Similar Articles
@maximelabonne: To clarify, this paper basically says: under AdamW, µP's embedding LR rule (constant) is essentially right and explains…
This paper clarifies that under AdamW, µP's embedding learning rate rule (constant) is essentially correct and explains most of µP's benefit, contrary to a previous finding by Hayou et al. about realistic LLM vocab sizes.
Unlocking Feature Learning in Gated Delta Networks at Scale
This paper derives scaling rules for Gated Delta Networks using Maximal Update Parametrization (μP), enabling zero-shot hyperparameter transfer across model widths for efficient sub-quadratic LLM architectures. Experiments confirm stable learning-rate transfer under both AdamW and SGD, whereas standard parametrization fails.
Predictable Scaling Laws of Optimal Hyperparameters for LLM Continued Pre-training
This paper discovers predictable scaling laws for optimal hyperparameters (learning rate, batch size) in LLM continued pre-training, proposing a two-stage framework that reduces hyperparameter search overhead by up to 90% while maintaining performance.
Can Muon Fine-tune Adam-Pretrained Models?
Research paper investigating performance degradation when using the Muon optimizer instead of Adam for fine-tuning pretrained models, demonstrating that parameter-efficient methods like LoRA effectively mitigate this optimizer mismatch across language and vision tasks.
Learning, Fast and Slow: Towards LLMs That Adapt Continually [R]
This paper introduces a Fast-Slow Training framework for LLMs that combines parameter updates with optimized context to improve sample efficiency and reduce catastrophic forgetting during continual learning.