Small Initialization Matters for Large Language Models

arXiv cs.AI 06/17/26, 04:00 AM Papers
parameter-initialization pretraining reasoning scaling large-language-models compression initialization-scale
Summary
This paper shows that reducing parameter initialization scale consistently improves pretraining of large language models, with the largest gains on reasoning-demanding tasks. It uncovers a critical initialization that balances reasoning and training, and proposes a simple γ-initialization rule.
arXiv:2606.17945v1 Announce Type: new Abstract: Large language models provide a tractable system for asking how intelligence itself emerges, rather than only how LLMs can be engineered. Although progress is usually attributed to scale, data and architecture, we show that parameter initialization is a gene-like determinant of training and, in particular, of model capacity. Reducing the initialization scale consistently improves pretraining, with the largest gains on reasoning-demanding tasks. We identify two widely used empirical settings that restrain the advantage of small initialization, and show how relaxing them restores favorable scaling. We further uncover a critical initialization that balances the reasoning and training. Mechanistically, small initialization drives a distinct developmental trajectory: parameters first condense into low-complexity structures and later expand into richer representations, giving concrete form to the idea that compression is intelligence. Token-level analyses show that the gains concentrate on non-trivial, context-constrained predictions rather than all tokens uniformly. These results motivate a simple $\gamma$-initialization rule: expose initialization rage as an explicit knob and use small initialization by default, an almost cost-free intervention that improves pretraining and strengthens reasoning across model scales.
Original Article
View Cached Full Text
Cached at: 06/17/26, 05:40 AM
# Small Initialization Matters for Large Language Models
Source: [https://arxiv.org/html/2606.17945](https://arxiv.org/html/2606.17945)
\\equalcont

These authors contributed equally to this work\.

\\equalcont

These authors contributed equally to this work\.

\[1,2\]\\fnmZhi\-Qin John\\surXu

1\]\\orgdivSchool of Mathematical Sciences,\\orgnameShanghai Jiao Tong University,\\orgaddress\\cityShanghai,\\postcode200240,\\countryChina

2\]\\orgdivInstitute of Natural Sciences,\\orgnameShanghai Jiao Tong University,\\orgaddress\\cityShanghai,\\postcode200240,\\countryChina

3\]\\orgnameMemTensor \(Shanghai\) Technology Co\., Ltd\.

4\]\\orgnameInstitute for Advanced Algorithms Research,\\orgaddress\\cityShanghai,\\countryChina

###### Abstract

Large language models provide a tractable system for asking how intelligence itself emerges, rather than only how LLMs can be engineered\. Although progress is usually attributed to scale, data and architecture, we show that parameter initialization is a gene\-like determinant of training and, in particular, of model capacity\. Reducing the initialization scale consistently improves pretraining, with the largest gains on reasoning\-demanding tasks\. We identify two widely used empirical settings that restrain the advantage of small initialization, and show how relaxing them restores favorable scaling\. We further uncover a critical initialization that balances the reasoning and training\. Mechanistically, small initialization drives a distinct developmental trajectory: parameters first condense into low\-complexity structures and later expand into richer representations, giving concrete form to the idea that compression is intelligence\. Token\-level analyses show that the gains concentrate on non\-trivial, context\-constrained predictions rather than all tokens uniformly\. These results motivate a simpleγ\\gamma\-initialization rule: expose initialization rage as an explicit knob and use small initialization by default, an almost cost\-free intervention that improves pretraining and strengthens reasoning across model scales\.

## 1Introduction

Beyond their practical role as engineered systems, large language models \(LLMs\) offer an experimental window into how intelligence can emerge from scale, data, optimization and architecture\. Their recent advances have largely come from increasing scale\[brown2020language,kaplan2020scaling\], refining data and optimization\[NEURIPS2022\_c1e2faff,hu2022lora,liu2024sophia,guo2025deepseek\], or modifying architectures\[vaswani2017attention,devlin2019bert,fedus2022switch,gu2023mamba\]\. Yet parameter initialization remains a critical design choice underpinning modern deep learning\[lecun2015deep\]\. Once considered settled by handcrafted heuristics such as Xavier\[glorot2010understanding\], LeCun\[lecun2012efficient\]and Kaiming\[he2015delving\]initialization, it has resurfaced in the large\-model era, where a tailored scheme now demands costly trial and error\. In this work, we demonstrate that the scale of parameter initialization profoundly influences the learning process of LLMs, as a concrete manifestation to support the viewpoint that compression essentially embodies intelligence\.

Large initialization makes networks behave like kernel methods\[jacot2018neural,chizat2019lazy,woodworth2020kernel\], whereas small initialization drives them into a nonlinear feature\-learning, or condensed, regime, in which the weight vectors within a layer first align along a few shared directions and later develop richer structure\[luo2021phase,zhou2022towards,kunin2024get\]\. Small initialization has been shown to bias models toward reasoning and improve generalization, though largely on simplified architectures and synthetic tasks\[zhang2024initialization,yao2025analysis,zhang2025complexity\]\. Whether these benefits carry over to realistic LLM pretraining, and persist as models scale up, remains largely unexplored\. Here we systematically examine whether, when, and why small initialization improves LLM performance\.

We parameterize each weight matrix asWij∼𝒩\(0,σ2\),σ=din−γW\_\{ij\}\\sim\\mathcal\{N\}\(0,\\sigma^\{2\}\),\\sigma=d\_\{\\mathrm\{in\}\}^\{\-\\gamma\}, where the initialization rateγ=1/2\\gamma=1/2recovers the standard scale such as Xavier\-like scale\[glorot2010understanding\]and largerγ\\gammayields smaller initialization\. Training LLMs across a range ofγ\\gamma, we find that simply reducing the initialization scale lowers the pretraining loss, though this advantage fades as models grow\. This weakening is not an intrinsic limitation of small initialization but a consequence of specific architectural components: layer normalization\[ba2016layer\]obscures differences in scale through its constantϵ\\epsilon, while small initialization simultaneously intensifies the attention sink\[xiao2024streamingllm\]\. Reducingϵ\\epsilonand introducing gated attention\[qiu2026gated\]unlock its latent benefit, yielding a markedly better scaling law with model size\.

Our results further indicate that an equal balance between the identity and residual pathways—achieved atγ=1\\gamma=1—yields the best performance\. Finally, we find that small\-initialization models follow a low\-to\-high complexity trajectory: their weight matrices first condense into low\-dimensional structures and later expand into a richer space, a condensation phenomenon\[xu2025overview\]shared by models from multilayer perceptrons to Transformers in condensed regimes\. A token\-level analysis of the loss further shows that the gains are not spread evenly across tokens but concentrate on a subset of non\-trivial ones\.

Together, these results establish the scale of initialization as both a practical and a mechanistic design axis for LLMs: small initialization substantially improves pretraining,γ=1\\gamma=1marks a critical balance point, and the resulting training dynamics follow a clear condensed pattern\. More broadly, we argue for treatingγ\\gammaas an explicit initialization parameter, and for adoptingγ\\gamma\-initialization as a built\-in initializer in mainstream deep learning frameworks, withγ=1\\gamma=1as the default\.

## 2Result

### 2\.1Why small initialization fails to scale, and how to fix it

We first examine whether reducing the initialization scale benefits LLM pretraining, training models of different sizes under the standard scaleγ=0\.5\\gamma=0\.5and a smaller scaleγ=1\\gamma=1\. Small initialization consistently lowers the validation loss across all model scales \(Figure[1](https://arxiv.org/html/2606.17945#S2.F1)a,b\)\. The gain, however, shrinks with size: moving fromγ=0\.5\\gamma=0\.5toγ=1\\gamma=1reduces the loss by about0\.050\.05for the 0\.1B model but only0\.0030\.003for the 1\.5B model\. This raises a central question: why does the benefit of small initialization vanish at scale?

![Refer to caption](https://arxiv.org/html/2606.17945v1/pic/combined_three_figures.png)Figure 1:\(a\) Validation loss across different model sizes under two initialization scales,γ=0\.5\\gamma=0\.5andγ=1\\gamma=1\. \(b\) Absolute validation loss reduction betweenγ=0\.5\\gamma=0\.5andγ=1\\gamma=1across model scales\. \(c\) Effective RMSNorm scaling factor as a function ofγ\\gammaunder two differentϵ\\epsilonsettings, whered=2048d=2048\. \(d\) Training loss curves for 1\.5B models trained underγ=1\\gamma=1with differentϵ\\epsilonvalues\. \(e\) Final validation loss comparison under different combinations of initialization scaleγ\\gammaand RMSNormϵ\\epsilon\. \(f\) Layer\-wise attention sink scores under different initialization scales in 1\.5B models\. \(g\) Layer\-wise attention sink scores with and without gated attention underγ=1\\gamma=1\. \(h\) Final validation loss comparison for models trained with and without gated attention under different initialization scales\.We show that this shrinkage is not a failure of small initialization but a consequence of architectural components that suppress its effect\. We identify two\. First, layer normalization enters anϵ\\epsilon\-dominated regime once the hidden\-state variance becomes small, masking the scale difference\. Second, small initialization aggravates the attention sink, concentrating attention on the first token\. Removing both barriers restores the gain, as we show below\.

#### Adjusting the RMSNorm constant

RMSNorm rescales inputs𝒉\\bm\{h\}by a factor proportional to\(σ2\(𝒉\)\+ε\)−1/2\\left\(\\sigma^\{2\}\(\\bm\{h\}\)\+\\varepsilon\\right\)^\{\-1/2\}, whereσ2\(𝒉\)=1/d∑id𝒉i2\\sigma^\{2\}\(\\bm\{h\}\)=1/d\\sum\_\{i\}^\{d\}\\bm\{h\}^\{2\}\_\{i\}andε\\varepsilonis the stability constant\. Small initialization shrinksσ2\(𝒉\)\\sigma^\{2\}\(\\bm\{h\}\), and onceσ2\(𝒉\)≲ε\\sigma^\{2\}\(\\bm\{h\}\)\\lesssim\\varepsilonthe factor is governed byε\\varepsilonrather than the𝒉\\bm\{h\}; further reducing the scale then no longer changes the normalization, masking the effect of small initialization\. Taking𝒉i∼𝒩\(0,d−2γ\)\\bm\{h\}\_\{i\}\\sim\\mathcal\{N\}\(0,d^\{\-2\\gamma\}\)so thatσ2\(𝒉\)=d−2γ\\sigma^\{2\}\(\\bm\{h\}\)=d^\{\-2\\gamma\}\. Consider withd=2048d=2048, thenσ2\(𝒉\)=4\.8×10−4\\sigma^\{2\}\(\\bm\{h\}\)=4\.8\\times 10^\{\-4\}whenγ=0\.5\\gamma=0\.5and2\.4×10−72\.4\\times 10^\{\-7\}whenγ=1\\gamma=1, which is much smaller that the commonε=10−5\\varepsilon=10^\{\-5\}\. This indicates that the common valueε=10−5\\varepsilon=10^\{\-5\}saturates the factor forγ\>0\.6\\gamma\>0\.6, whereasε=10−12\\varepsilon=10^\{\-12\}keeps it sensitive toγ\\gammaover a far wider range \(Figure[1](https://arxiv.org/html/2606.17945#S2.F1)c\)\.

The prediction holds in training\. For 1\.5B models atγ=1\\gamma=1, loweringε\\varepsilonfrom10−510^\{\-5\}to10−1210^\{\-12\}markedly decreases the loss \(Figure[1](https://arxiv.org/html/2606.17945#S2.F1)d,∼0\.038\\sim 0\.038\) while atγ=0\.5\\gamma=0\.5the same reduction barely helps \(Figure[1](https://arxiv.org/html/2606.17945#S2.F1)e,∼0\.001\\sim 0\.001\)\.

#### Mitigating the attention sink

The second barrier is the attention sink, the tendency of LLMs to place disproportionate attention on the first token\[xiao2024streamingllm,barbero2025why,ICLR2025\_f1b04fac,Yu2024unveiling\]\. Measuring the sink score \(see \([5](https://arxiv.org/html/2606.17945#S4.E5)\)\), the average attention mass on the first token, we find that small initialization strengthens it: the per\-layer average underγ=1\\gamma=1exceeds that underγ=0\.5\\gamma=0\.5by about0\.110\.11in 1\.5B models \(Figure[1](https://arxiv.org/html/2606.17945#S2.F1)f\)\.

To remove it, we apply gated attention, which gates the output of each attention head\[qiu2026gated\]\. Gating sharply reduces the sink score underγ=1\\gamma=1\(Figure[1](https://arxiv.org/html/2606.17945#S2.F1)g\), and this translates into performance: atγ=0\.5\\gamma=0\.5gating barely lowers the loss \(∼0\.004\\sim 0\.004\), whereas atγ=1\\gamma=1it produces a much larger decrease \(Figure[1](https://arxiv.org/html/2606.17945#S2.F1)h,∼0\.047\\sim 0\.047\)\. Mitigating the attention sink therefore substantially strengthens the effect of small initialization\. It’s noted this ablation experiment is implemented withε=10−12\\varepsilon=10^\{\-12\}\.

### 2\.2Unleashing the benefit of small initialization

Once the two architectural barriers are removed, the benefit of small initialization emerges in full\. Combining the adjustments, reducingε\\varepsilonto10−1210^\{\-12\}and adding gated attention, leaves the loss essentially unchanged atγ=0\.5\\gamma=0\.5but markedly amplifies the gain atγ=1\\gamma=1, and this gain persists as model size grows \(Figure[2](https://arxiv.org/html/2606.17945#S2.F2)a\)\. The comparison indicates that small initialization can improve the effective model size by approximately44%44\\%\. The adjustments thus matter only under small initialization: they do not improve training on their own, but release a benefit of small initialization that is otherwise hidden\.

This advantage extends to downstream capability\. On 1\.5B models with the adjustments,γ=1\\gamma=1outperformsγ=0\.5\\gamma=0\.5on benchmarks spanning knowledge, commonsense reasoning, and math \(Table[1](https://arxiv.org/html/2606.17945#S2.T1)\), with absolute gains exceeding4%4\\%on TriviaQA, HellaSwag, GSM8K, and MATH500\. Small initialization, properly supported by the architecture, therefore yields both lower loss and stronger task performance\.

Table 1:Evaluation results of the 1\.5B models under standard initialization \(γ=0\.5\\gamma=0\.5\) and small initialization \(γ=1\\gamma=1\)\. The gain column reports the absolute improvement ofγ=1\\gamma=1overγ=0\.5\\gamma=0\.5\. Gains smaller than 4 are shown in red, while gains greater than or equal to 4 are shown in green\.
### 2\.3Extension to mixture\-of\-experts models

To test whether the effect is specific to dense models, we repeat the experiments on mixture\-of\-experts \(MoE\) models in two configurations: 1\.5B total \(0\.25B active\) and 3B total \(0\.5B active\) parameters, each trained atγ=0\.5\\gamma=0\.5andγ=1\\gamma=1\. The MoE results mirror the dense case \(Figure[2](https://arxiv.org/html/2606.17945#S2.F2)b,c\): small initialization clearly lowers the loss at the smaller size, the gain weakens at the larger size under the standard architecture, and reducingε\\varepsilonto10−1210^\{\-12\}together with gated attention recovers and amplifies it\. Both the limitation and its remedy therefore carry over, establishing small initialization as a broadly useful strategy across architectures\.

![Refer to caption](https://arxiv.org/html/2606.17945v1/pic/dense_scaling_moe_final_loss_clean.png)Figure 2:\(a\) Validation loss across different model sizes underγ=0\.5\\gamma=0\.5andγ=1\\gamma=1, with and without the architectural adjustments\. \(b\) Validation loss of MoE models without the adjustments underγ=0\.5\\gamma=0\.5andγ=1\\gamma=1\. \(c\) Validation loss of MoE models with the adjustments underγ=0\.5\\gamma=0\.5andγ=1\\gamma=1\.
### 2\.4How small should initialization be?

Given these gains, should the scale be pushed as small as possible, that is,γ\\gammamade arbitrarily large? We find that it should not, for a reason rooted in the residual structure of a Transformer\. Each layer adds a residual update to an identity pathway; when the weights are too small, these updates vanish and the identity pathway dominates, leaving the network unable to transform its input during early training\. A useful initialization must keep the residual updates comparable to the identity pathway\.

This balance can be made precise\. For a pre\-norm Transformer ofLLlayers with final hidden state𝒉L\\bm\{h\}\_\{L\}and embedded input𝒆\\bm\{e\}, we define the residual flow as𝒉L−𝒆\\bm\{h\}\_\{L\}\-\\bm\{e\}\. Under small initialization, its relative scale satisfies‖𝒉L−𝒆‖2/‖𝒆‖2≍d1−γ\\\|\\bm\{h\}\_\{L\}\-\\bm\{e\}\\\|\_\{2\}/\\\|\\bm\{e\}\\\|\_\{2\}\\asymp d^\{1\-\\gamma\}\(derivation in Appendix[A](https://arxiv.org/html/2606.17945#A1)\)\. The residual flow thus dominates the embedding forγ<1\\gamma<1, matches it atγ=1\\gamma=1, and becomes negligible forγ\>1\\gamma\>1, where the network is initialized close to an identity mapping\.

Both the scaling and its consequence are confirmed in experiments: the measured norm ratio followsd1−γd^\{1\-\\gamma\}\(Figure[3](https://arxiv.org/html/2606.17945#S2.F3)a\), and the pretraining loss improves fromγ=0\.5\\gamma=0\.5toγ=1\\gamma=1but deteriorates beyond it \(Figure[3](https://arxiv.org/html/2606.17945#S2.F3)b\)\. The optimal scale therefore sits at the balance pointγ=1\\gamma=1, small enough to induce condensation yet large enough to keep the network trainable\.

![Refer to caption](https://arxiv.org/html/2606.17945v1/pic/gamma_residualflow_loss.png)Figure 3:\(a\) Relative residual\-flow strength‖𝒉L−𝒆‖2/‖𝒆‖2\\\|\\bm\{h\}\_\{L\}\-\\bm\{e\}\\\|\_\{2\}/\\\|\\bm\{e\}\\\|\_\{2\}as a function of initialization scaleγ\\gamma\. \(b\) Final validation loss as a function of initialization scaleγ\\gamma\.
### 2\.5Where the gains come from: a token\-level analysis

Validation loss and benchmark scores show that small initialization helps on average, but not where the improvement originates\. We therefore ask whether the gain is spread uniformly across tokens or concentrated on specific ones\. For each context𝒙<t\\bm\{x\}\_\{<t\}and label tokenyty\_\{t\}, we measure the symmetric probability gap between the small\- and standard\-initialization models,

Δsympt=2\(psmall\(yt∣𝒙<t\)−pstandard\(yt∣𝒙<t\)\)psmall\(yt∣𝒙<t\)\+pstandard\(yt∣𝒙<t\),\\Delta^\{\\mathrm\{sym\}\}p\_\{t\}=\\frac\{2\\left\(p\_\{\\mathrm\{small\}\}\(y\_\{t\}\\mid\\bm\{x\}\_\{<t\}\)\-p\_\{\\mathrm\{standard\}\}\(y\_\{t\}\\mid\\bm\{x\}\_\{<t\}\)\\right\)\}\{p\_\{\\mathrm\{small\}\}\(y\_\{t\}\\mid\\bm\{x\}\_\{<t\}\)\+p\_\{\\mathrm\{standard\}\}\(y\_\{t\}\\mid\\bm\{x\}\_\{<t\}\)\},which is positive when small initialization assigns higher probability to the correct token\.

![Refer to caption](https://arxiv.org/html/2606.17945v1/pic/token_loss_and_parameter_dynamics.png)Figure 4:\(a\) Distribution ofΔsymp\\Delta^\{\\mathrm\{sym\}\}pbetweenγ=1\\gamma=1andγ=0\.5\\gamma=0\.5\. Vertical dashed lines indicate the mean and median values\. \(b\) Mean and medianΔsymp\\Delta^\{\\mathrm\{sym\}\}pacross token difficulty quantile bins\. \(c\) Representative token\-level example with color\-coded token\-wise improvements according to theΔsymp\\Delta^\{\\mathrm\{sym\}\}p\. \(d\) Training loss curve and parameter cosine similarity snapshots for theγ=1\\gamma=1model\. Insets visualize the first\-layerWQW^\{Q\}andWdownW^\{down\}at different training stages\. \(e\) Training loss curve and parameter cosine similarity snapshots for theγ=0\.5\\gamma=0\.5model\. Insets visualize the first\-layerWQW^\{Q\}andWdownW^\{down\}at different training stages\. \(f\) Stable rank evolution of the first\-layerWQW^\{Q\}during training under different initialization scales\. \(g\) Stable rank evolution of the first\-layerWdownW^\{down\}during training under different initialization scales\.Over a validation set of 4B tokens, the distribution ofΔsymp\\Delta^\{\\mathrm\{sym\}\}pis far from uniform \(Figure[4](https://arxiv.org/html/2606.17945#S2.F4)a\): most tokens change little or even slightly worsen, while a subset improves substantially\. The aggregate loss reduction is thus driven by large gains on a minority of tokens rather than a uniform shift across the vocabulary\.

To identify these tokens, we rank them by difficulty, defined as the average loss of the two models,dt=12\(ℓsmall\(yt\)\+ℓstd\(yt\)\)d\_\{t\}=\\tfrac\{1\}\{2\}\\left\(\\ell\_\{\\mathrm\{small\}\}\(y\_\{t\}\)\+\\ell\_\{\\mathrm\{std\}\}\(y\_\{t\}\)\\right\), and bin them into ten equal\-sized groups\. The gain peaks on moderately\-to\-highly difficult tokens, not the hardest ones \(Figure[4](https://arxiv.org/html/2606.17945#S2.F4)b\): the hardest tokens are typically noisy fragments or rare symbols that neither model can predict, whereas intermediate\-difficulty tokens demand contextual integration and reasoning, precisely where small initialization helps\. A representative example confirms this \(Figure[4](https://arxiv.org/html/2606.17945#S2.F4)c\): in a short derivation about a quadratic function, the tokens that must be inferred from context, the derivative, the slope, the zero\-gradient condition, and the resulting critical point, all show clear gains\.

The benefit of small initialization is therefore not a uniform drop in loss but a sharpening of predictions that depend on earlier context\. It strengthens the model’s use of local dependencies and reasoning, and this systematic advantage on non\-trivial yet learnable tokens accounts for the aggregate improvement\.

### 2\.6Mechanism: condensation drives a low\-to\-high complexity trajectory

A large body of work has shown that, under small initialization, parameters within a layer first align into a few directions and later diversify, a pattern known as condensation\[luo2021phase,zhou2022towards,chen2026from\]\. We ask whether LLMs follow the same path, tracking two measures of each parameter matrix during training: the row\-wise cosine similarity and the stable rank, defined as‖𝑾‖F2‖𝑾‖22=∑i=1dσi2σmax2\\frac\{\\\|\\bm\{W\}\\\|\_\{F\}^\{2\}\}\{\\\|\\bm\{W\}\\\|\_\{2\}^\{2\}\}=\\frac\{\\sum\_\{i=1\}^\{d\}\\sigma\_\{i\}^\{2\}\}\{\\sigma\_\{\\max\}^\{2\}\}, where‖𝑾‖F\\\|\\bm\{W\}\\\|\_\{F\}is the Frobenius norm and‖𝑾‖2\\\|\\bm\{W\}\\\|\_\{2\}is the spectral norm,σi\\sigma\_\{i\}denotes the singular value andσmax\\sigma\_\{\\max\}means the maximum one\. The stable rank is a continuous relaxation of the matrix rank\.

The two scales diverge sharply \(Figure[4](https://arxiv.org/html/2606.17945#S2.F4)d,e\)\. At initialization, the stable rank under two settings are both high due to the random initialization\. Underγ=1\\gamma=1, the similarity heatmaps reveal pronounced condensation early in training, with rows aligned into a few coherent blocks that gradually weaken as the matrix develops richer directions\. Underγ=0\.5\\gamma=0\.5, no such low\-complexity phase appears: the structure is diffuse from the outset and evolves only slowly\. The stable rank makes this quantitative \(Figure[4](https://arxiv.org/html/2606.17945#S2.F4)f,g, Figure[5](https://arxiv.org/html/2606.17945#A3.F5)\-[8](https://arxiv.org/html/2606.17945#A3.F8)\): underγ=1\\gamma=1it drops steeply at the start and then climbs, tracing a clear low\-to\-high complexity trajectory, whereas underγ=0\.5\\gamma=0\.5it begins high and declines slowly\. It’s noted that the high stable rank at initialization for both settings is expected, since the parameters are still random and random matrices typically have relatively high effective rank before training\.

Small initialization therefore reshapes how representations form: the model first compresses into simple, low\-rank structures and only later expands into more complex ones\. This condensation\-driven trajectory, rather than a mere change in final weights, distinguishes small initialization from the standard regime\.

This low\-to\-high complexity trajectory also offers an explanation for the improved reasoning observed earlier\. Because complexity grows only gradually, the model is driven to fit the data with the lowest complexity at each stage, seeking the fewest rules that account for the observations before resorting to more intricate ones\. Such a bias toward minimal, parsimonious explanations favors genuine underlying reasoning over surface memorization, consistent with the larger token\-level gains on predictions that require contextual inference\.

## 3Discussion

Our work establishes initialization scale as a meaningful design axis for LLMs training\. We show that reducing the initialization scale can consistently improve pretraining loss and downstream performance in both dense and MoE LLMs\. These results challenge the common view that initialization is merely a low\-level implementation detail whose effect disappears during large\-scale training\.

A key implication is that the benefit of small initialization may be limited by the architecture with a larger model size\. The reduced gain observed at larger scales does not reflect an intrinsic limitation of small initialization\. Rather, it reveals a mismatch between the small initialization and standard LLMs components\. Normalization constants can mask the intended scale difference, while attention sink can become more severe and limit the effect of small initialization\. Correcting these factors releases the hidden gain of small initialization, showing that initialization and architecture should be considered jointly rather than independently\.

Our results further show that the initialization scale must be chosen with a residual\-flow balance in mind\. Making the weights arbitrarily small is not beneficial: whenγ\>1\\gamma\>1, residual updates become too weak relative to the embedding stream, and training degrades\. The pointγ=1\\gamma=1therefore plays a special role, balancing reduced initial complexity with active residual updates\. This provides a practical default for small\-initialization training and a theoretical guide for tuning initialization scale\.

We uncover that the benefits of small initialization stem from a change in the model’s learning dynamics\. Rather than producing a uniform improvement over all tokens, it mainly improves non\-trivial, context\-constrained predictions\. At the parameter level, it induces a low\-to\-high complexity trajectory, where weight matrices first condense into simpler structures and later expand toward richer representations\. These findings suggest that initialization affects the final loss via reshaping the formation process of internal representations\.

We therefore advocate treatingγ\\gamma\-controlled initialization as a native component of large language model design\. Future training frameworks should exposeγ\\gammaas an explicit hyperparameter rather than hiding initialization scale inside default implementations\. In particular,γ=1\\gamma=1offers a principled default that combines small\-scale regularization with residual\-flow balance\. More broadly, our findings suggest that effective LLM pretraining depends not only on scale, data, optimization, and architecture, but also on how representational complexity is initialized and allowed to emerge during learning\.

## 4Method

### 4\.1Model Architecture

Dense model: We adopt a standard decoder\-only Transformer architecture following the dense design and pre‑norm structure\. The parameter scales are based on the four parameter scales of GPT‑3\[brown2020language\], but we replace LayerNorm and MLP with the more commonly used RMSNorm and SwiGLU\. Each Transformer block consists of the following components: RMSNorm, a multi‑head attention layer, and a feed‑forward network \(MLP\) with SwiGLU activation\. Detailed parameters are shown in Table[2](https://arxiv.org/html/2606.17945#A2.T2)\.

MoE: For the MoE architecture, we evaluated two attention mechanisms: Multi\-head Latent Attention and conventional Multi\-Head Attention\. For the sparse expert layers, we followed the design practice by substantially increasing the number of experts while reducing the parameter scale of each individual expert, and by incorporating shared experts\[liu2024deepseek\]\. Detailed parameters are shown in Table[3](https://arxiv.org/html/2606.17945#A2.T3)\.

### 4\.2Initialization Scheme

The main variable in our study is the initialization scale of model parameters\. For each weight matrix𝑾\\bm\{W\}, we initialize its entries independently from a zero\-mean Gaussian distribution:

𝑾i,j∼𝒩\(0,din−2γ\),\\bm\{W\}\_\{i,j\}\\sim\\mathcal\{N\}\\left\(0,d\_\{\\mathrm\{in\}\}^\{\-2\\gamma\}\\right\),wheredind\_\{\\mathrm\{in\}\}is the input dimension of𝑾\\bm\{W\}, andγ\\gammacontrols the initialization scale\. In particular, whenγ=12\\gamma=\\frac\{1\}\{2\}, the variance becomesVar\(𝑾i,j\)=din−1\\mathrm\{Var\}\(\\bm\{W\}\_\{i,j\}\)=d\_\{\\mathrm\{in\}\}^\{\-1\}, which corresponds to the conventional Xavier\-like scaling\. Whenγ\\gammais increased, the weights are initialized with a smaller magnitude\.

### 4\.3Training Configuration

We train all models using the standard next\-token prediction objective\. Training is conducted with Megatron\-LM\[megatron\-lm\], using AdamW\[loshchilov2017decoupled\]as the optimizer\. Unless otherwise specified, each model is trained for one epoch\. The detailed training hyperparameters, are summarized in Tables[2](https://arxiv.org/html/2606.17945#A2.T2)and[3](https://arxiv.org/html/2606.17945#A2.T3)\.

Training is performed with bfloat16 mixed precision, while gradient accumulation is conducted in fp32 precision\. We disable both attention dropout and hidden dropout\.

### 4\.4Data

The experiments were conducted on a high\-quality bilingual \(Chinese–English\) corpus containing 1 trillion tokens and spanning multiple domains, including web data, mathematics, code, and books\. The Dense model was trained on a 36B\-token subset comprising 13\.5B web tokens, 9B Wikipedia tokens, 4\.5B code tokens, and 9B mathematics tokens\. The MoE model was trained on 100B tokens sampled from the same corpus, with the original domain distribution preserved\.

### 4\.5Evaluation Protocol

All downstream evaluations are conducted using the lm\-evaluation\-harness\[eval\-harness\]\. The evaluation suite covers knowledge and question answering benchmarks, including ARC\-C\[clark2018think\], TriviaQA\[2017arXivtriviaqa\], and MMLU\[hendrycks2021ethics,hendryckstest2021\]; commonsense and general reasoning benchmarks, including HellaSwag\[zellers2019hellaswag\], BBH\[suzgun2023challenging\], and SocialIQA\[sap2019social\]; and mathematical reasoning benchmarks, including GSM8K\[cobbe2021training\]and MATH500\[hendrycks2021measuring\]\.

### 4\.6Architectural Adjustments

To examine whether the benefit of small initialization is limited by architectural components under larger scale, we consider two simple architectural adjustments: reducing the normalization constantε\\varepsilonand introducing gated attention\.

#### LayerNorm/RMSNorm constant

Normalization layers rescale hidden states and can therefore interact with initialization scale\. For a hidden state𝒉∈ℝd\\bm\{h\}\\in\\mathbb\{R\}^\{d\}, LayerNorm is defined as

LN\(𝒉\)=𝜸⊙𝒉−𝝁\(𝒉\)σ2\(𝒉\)\+ε\+𝜷,\\mathrm\{LN\}\(\\bm\{h\}\)=\\bm\{\\gamma\}\\odot\\frac\{\\bm\{h\}\-\\bm\{\\mu\}\(\\bm\{h\}\)\}\{\\sqrt\{\\sigma^\{2\}\(\\bm\{h\}\)\+\\varepsilon\}\}\+\\bm\{\\beta\},\(1\)where𝝁\(𝒉\)\\bm\{\\mu\}\(\\bm\{h\}\)denotes the coordinate\-wise mean,σ2\(𝒉\)=d−1∑i\(𝒉i−μ\(𝒉\)\)2\\sigma^\{2\}\(\\bm\{h\}\)=d^\{\-1\}\\sum\_\{i\}\(\\bm\{h\}\_\{i\}\-\\mu\(\\bm\{h\}\)\)^\{2\}is the hidden\-state variance,ε\\varepsilonis the numerical constant for stability, and𝜸,𝜷\\bm\{\\gamma\},\\bm\{\\beta\}are learnable affine parameters\. In our implementation, the Transformer blocks use RMSNorm, which is defined as

RMSNorm\(𝒉\)=𝜸⊙𝒉d−1∑i𝒉i2\+ε\.\\mathrm\{RMSNorm\}\(\\bm\{h\}\)=\\bm\{\\gamma\}\\odot\\frac\{\\bm\{h\}\}\{\\sqrt\{d^\{\-1\}\\sum\_\{i\}\\bm\{h\}\_\{i\}^\{2\}\+\\varepsilon\}\}\.\(2\)When the hidden\-state variance becomes smaller thanε\\varepsilon, the normalization factor becomes dominated byε\\varepsilonrather than by the hidden\-state scale\. We compare two values ofε\\varepsilon: the standard settingε=10−5\\varepsilon=10^\{\-5\}and a smaller settingε=10−12\\varepsilon=10^\{\-12\}\.

#### Gated attention

We introduce gated attention to mitigate the attention sink phenomenon\. Following recent work on gated softmax attention\[qiu2026gated\], we apply a query\-dependent gate to the output of each attention head\. Given the standard attention headshead1,…,headnh\\mathrm\{head\}\_\{1\},\\ldots,\\mathrm\{head\}\_\{n\_\{h\}\}, the gated attention output is written as

GAttn\(𝒙\)=Concat\(g1\(𝒙\)⊙head1,…,gnh\(𝒙\)⊙headnh\)WO,\\mathrm\{GAttn\}\(\\bm\{x\}\)=\\mathrm\{Concat\}\\left\(g\_\{1\}\(\\bm\{x\}\)\\odot\\mathrm\{head\}\_\{1\},\\ldots,g\_\{n\_\{h\}\}\(\\bm\{x\}\)\\odot\\mathrm\{head\}\_\{n\_\{h\}\}\\right\)W\_\{O\},\(3\)wherenhn\_\{h\}is the number of attention heads,WOW\_\{O\}is the output projection matrix, andgh\(𝒙\)g\_\{h\}\(\\bm\{x\}\)is a head\-specific sigmoid gate\.

In practice, the gate is computed from the normalized hidden state before the attention module\. For each headhh, we use a linear projection followed by a sigmoid activation:

gh\(𝒙\)=σ\(𝒙Whg\),g\_\{h\}\(\\bm\{x\}\)=\\sigma\(\\bm\{x\}W\_\{h\}^\{g\}\),\(4\)whereWhgW\_\{h\}^\{g\}is the gate projection for headhh, andσ\(⋅\)\\sigma\(\\cdot\)denotes the sigmoid function\. The gate modulates the magnitude of the corresponding attention head after the softmax attention weights have been computed\. Thus, the gating mechanism does not change the causal attention mask or the softmax normalization itself; it only controls how strongly each head contributes to the residual update\.

### 4\.7Mechanistic Analysis Metrics

#### Attention sink score

For layerℓ\\elland headhh, letAh\(ℓ\)\(t,i,j\)A\_\{h\}^\{\(\\ell\)\}\(t,i,j\)denote the attention weight from query positioniito key positionjjin sequencett\. We define the sink score as

Ssink\(ℓ,h\)\(t\)=1T∑i=1TAh\(ℓ\)\(t,i,1\)\.S\_\{\\mathrm\{sink\}\}^\{\(\\ell,h\)\}\(t\)=\\frac\{1\}\{T\}\\sum\_\{i=1\}^\{T\}A\_\{h\}^\{\(\\ell\)\}\(t,i,1\)\.\(5\)This quantity measures the average attention mass assigned to the first token by all query positions in a sequence\. A largerSsink\(ℓ,h\)\(t\)S\_\{\\mathrm\{sink\}\}^\{\(\\ell,h\)\}\(t\)indicates stronger attention concentration on the first token\. In our experiments, we report layer\-wise sink scores by averagingSsink\(ℓ,h\)\(t\)S\_\{\\mathrm\{sink\}\}^\{\(\\ell,h\)\}\(t\)over validation sequences and attention heads\.

#### Token\-level loss analysis

To analyze where the gain of small initialization comes from, we compare the prediction behavior of the small\-initialization model and the standard\-initialization model at the token level\. For each validation tokenyty\_\{t\}, we denote the probability assigned to the ground\-truth token by the small\-initialization model and the standard\-initialization model as

psmall\(yt∣x<t\)andpstd\(yt∣x<t\),p\_\{\\mathrm\{small\}\}\(y\_\{t\}\\mid x\_\{<t\}\)\\quad\\text\{and\}\\quad p\_\{\\mathrm\{std\}\}\(y\_\{t\}\\mid x\_\{<t\}\),respectively\. The corresponding token\-level losses are defined as

ℓsmall\(yt\)=−log⁡psmall\(yt∣x<t\),ℓstd\(yt\)=−log⁡pstd\(yt∣x<t\)\.\\ell\_\{\\mathrm\{small\}\}\(y\_\{t\}\)=\-\\log p\_\{\\mathrm\{small\}\}\(y\_\{t\}\\mid x\_\{<t\}\),\\qquad\\ell\_\{\\mathrm\{std\}\}\(y\_\{t\}\)=\-\\log p\_\{\\mathrm\{std\}\}\(y\_\{t\}\\mid x\_\{<t\}\)\.
A direct comparison of token\-level losses shows whether small initialization reduces the cross\-entropy loss on each token\. However, raw loss differences can be strongly affected by the intrinsic difficulty of individual tokens\. Therefore, in addition to token\-level loss, we analyze the probability assigned to the correct token\.

We first define the absolute correct\-token probability gap as

Δpt=psmall\(yt∣x<t\)−pstd\(yt∣x<t\)\.\\Delta p\_\{t\}=p\_\{\\mathrm\{small\}\}\(y\_\{t\}\\mid x\_\{<t\}\)\-p\_\{\\mathrm\{std\}\}\(y\_\{t\}\\mid x\_\{<t\}\)\.To normalize the scale of the probability difference, we further use the symmetric probability gap:

Δsympt=2\(psmall\(yt∣x<t\)−pstd\(yt∣x<t\)\)psmall\(yt∣x<t\)\+pstd\(yt∣x<t\)\.\\Delta^\{\\mathrm\{sym\}\}p\_\{t\}=\\frac\{2\\left\(p\_\{\\mathrm\{small\}\}\(y\_\{t\}\\mid x\_\{<t\}\)\-p\_\{\\mathrm\{std\}\}\(y\_\{t\}\\mid x\_\{<t\}\)\\right\)\}\{p\_\{\\mathrm\{small\}\}\(y\_\{t\}\\mid x\_\{<t\}\)\+p\_\{\\mathrm\{std\}\}\(y\_\{t\}\\mid x\_\{<t\}\)\}\.This metric measures the relative improvement in the correct\-token probability\. It is positive when the small\-initialization model assigns a higher probability to the ground\-truth token, and negative when the standard\-initialization model assigns a higher probability\.

To examine how token\-level improvement depends on prediction difficulty, we define the difficulty of each token as the average token loss of the two models:

dt=12\(ℓsmall\(yt\)\+ℓstd\(yt\)\)\.d\_\{t\}=\\frac\{1\}\{2\}\\left\(\\ell\_\{\\mathrm\{small\}\}\(y\_\{t\}\)\+\\ell\_\{\\mathrm\{std\}\}\(y\_\{t\}\)\\right\)\.We sort all validation tokens according todtd\_\{t\}and divide them into ten equal\-sized difficulty bins\. For each bin, we compute the mean and median values ofΔsympt\\Delta^\{\\mathrm\{sym\}\}p\_\{t\}\.

#### Cosine similarity and stable rank

To characterize the training dynamics induced by different initialization scales, we analyze the structure of model parameters during training\. We focus on two complementary measurements: row\-wise cosine similarity and stable rank\.

For a weight matrix𝑾∈ℝm×n\\bm\{W\}\\in\\mathbb\{R\}^\{m\\times n\}, let𝒘i∈ℝn\\bm\{w\}\_\{i\}\\in\\mathbb\{R\}^\{n\}denote itsii\-th row vector\. We compute the pairwise cosine similarity between rows as

Cij\(𝑾\)=⟨𝒘i,𝒘j⟩‖𝒘i‖2‖𝒘j‖2,C\_\{ij\}\(\\bm\{W\}\)=\\frac\{\\langle\\bm\{w\}\_\{i\},\\bm\{w\}\_\{j\}\\rangle\}\{\\\|\\bm\{w\}\_\{i\}\\\|\_\{2\}\\\|\\bm\{w\}\_\{j\}\\\|\_\{2\}\},\(6\)whereCij\(𝑾\)∈\[−1,1\]C\_\{ij\}\(\\bm\{W\}\)\\in\[\-1,1\]\. The resulting matrixC\(𝑾\)∈ℝm×mC\(\\bm\{W\}\)\\in\\mathbb\{R\}^\{m\\times m\}describes the angular similarity among the row vectors of𝑾\\bm\{W\}\.

To quantify the effective dimensionality of𝑾\\bm\{W\}, we further compute its stable rank, defined as

StableRank\(𝑾\)=‖𝑾‖F2‖𝑾‖22=∑i=1dσi2σmax2,\\mathrm\{StableRank\}\(\\bm\{W\}\)=\\frac\{\\\|\\bm\{W\}\\\|\_\{F\}^\{2\}\}\{\\\|\\bm\{W\}\\\|\_\{2\}^\{2\}\}=\\frac\{\\sum\_\{i=1\}^\{d\}\\sigma\_\{i\}^\{2\}\}\{\\sigma\_\{\\max\}^\{2\}\},\(7\)whereσi\\sigma\_\{i\}denotes the singular value andσmax\\sigma\_\{\\max\}means the maximum one\. The stable rank is a continuous relaxation of the matrix rank\.

\\bmhead

Acknowledgements This work is sponsored by the National Key R&\\&D Program of China Grant No\. 2022YFA1008200 \(Z\. X\.\), the National Natural Science Foundation of China Grant No\. 92570001 \(Z\. X\.\), 12371511 \(Z\. X\.\), 12422119 \(Z\. X\.\), 2025 Key Technology R&D Program “New Generation Information Technology” Project 25511103100 of Shanghai Municipal Science and Technology Commission \(Z\. X\.\)\.

## Appendix ADetailed analysis

We analyze the relative scale of the accumulated residual update𝒉L−𝒆\\bm\{h\}\_\{L\}\-\\bm\{e\}compared with the initial embedding stream𝒆\\bm\{e\}\. Consider a residual network of the form

𝒉0=𝒆,𝒉ℓ\+1=𝒉ℓ\+𝑭ℓ\(𝒉ℓ\),ℓ=0,…,L−1\.\\bm\{h\}\_\{0\}=\\bm\{e\},\\qquad\\bm\{h\}\_\{\\ell\+1\}=\\bm\{h\}\_\{\\ell\}\+\\bm\{F\}\_\{\\ell\}\(\\bm\{h\}\_\{\\ell\}\),\\qquad\\ell=0,\\dots,L\-1\.Unrolling the residual recursion gives

𝒉L=𝒆\+∑ℓ=0L−1𝑭ℓ\(𝒉ℓ\),\\bm\{h\}\_\{L\}=\\bm\{e\}\+\\sum\_\{\\ell=0\}^\{L\-1\}\\bm\{F\}\_\{\\ell\}\(\\bm\{h\}\_\{\\ell\}\),and therefore

𝒉L−𝒆=∑ℓ=0L−1𝑭ℓ\(𝒉ℓ\)\.\\bm\{h\}\_\{L\}\-\\bm\{e\}=\\sum\_\{\\ell=0\}^\{L\-1\}\\bm\{F\}\_\{\\ell\}\(\\bm\{h\}\_\{\\ell\}\)\.Under small initialization, the residual branches are initially weak\. Thus, in the leading\-order scale analysis, the output of each block can be approximated by applying the corresponding residual branch to the initial embedding stream:

𝑭ℓ\(𝒉ℓ\)≈𝑭ℓ\(𝒆\)\.\\bm\{F\}\_\{\\ell\}\(\\bm\{h\}\_\{\\ell\}\)\\approx\\bm\{F\}\_\{\\ell\}\(\\bm\{e\}\)\.Hence,

𝒉L−𝒆≈∑ℓ=0L−1𝑭ℓ\(𝒆\)\.\\bm\{h\}\_\{L\}\-\\bm\{e\}\\approx\\sum\_\{\\ell=0\}^\{L\-1\}\\bm\{F\}\_\{\\ell\}\(\\bm\{e\}\)\.
We assume that the embedding vector and the weight matrices are initialized as

𝒆i∼𝒩\(0,d−2γ\),𝑾ij∼𝒩\(0,d−2γ\)\.\\bm\{e\}\_\{i\}\\sim\\mathcal\{N\}\(0,d^\{\-2\\gamma\}\),\\qquad\\bm\{W\}\_\{ij\}\\sim\\mathcal\{N\}\(0,d^\{\-2\\gamma\}\)\.Then the squared norm of the embedding satisfies

𝔼∥𝒆∥22=∑i=1d𝔼𝒆i2=d⋅d−2γ=d1−2γ\.\\mathbb\{E\}\\lVert\\bm\{e\}\\rVert\_\{2\}^\{2\}=\\sum\_\{i=1\}^\{d\}\\mathbb\{E\}\\bm\{e\}\_\{i\}^\{2\}=d\\cdot d^\{\-2\\gamma\}=d^\{1\-2\\gamma\}\.Thus the typical scale of the embedding norm is

∥𝒆∥2=Θ\(d1/2−γ\)\.\\lVert\\bm\{e\}\\rVert\_\{2\}=\\Theta\(d^\{1/2\-\\gamma\}\)\.
Now we compute the scale of one residual module applied to𝒆\\bm\{e\}\. We take the module to be an RMSNorm followed by two linear transformations:

𝑭ℓ\(𝒆\)=𝑾2,ℓϕ\(𝑾1,ℓRMSNorm⁡\(𝒆\)\)\.\\bm\{F\}\_\{\\ell\}\(\\bm\{e\}\)=\\bm\{W\}\_\{2,\\ell\}\\phi\\big\(\\bm\{W\}\_\{1,\\ell\}\\operatorname\{RMSNorm\}\(\\bm\{e\}\)\\big\)\.Ignoring the constant affine weight and assuming that the RMS term dominates the numericalε\\varepsilon, RMSNorm can be written as

RMSNorm⁡\(𝒆\)=𝒆RMS⁡\(𝒆\)\.\\operatorname\{RMSNorm\}\(\\bm\{e\}\)=\\frac\{\\bm\{e\}\}\{\\operatorname\{RMS\}\(\\bm\{e\}\)\}\.Here

RMS⁡\(𝒆\)=\(1d∑i=1dei2\)1/2=∥𝒆∥2d\.\\operatorname\{RMS\}\(\\bm\{e\}\)=\\left\(\\frac\{1\}\{d\}\\sum\_\{i=1\}^\{d\}e\_\{i\}^\{2\}\\right\)^\{1/2\}=\\frac\{\\lVert\\bm\{e\}\\rVert\_\{2\}\}\{\\sqrt\{d\}\}\.Since∥𝒆∥2=Θ\(d1/2−γ\)\\lVert\\bm\{e\}\\rVert\_\{2\}=\\Theta\(d^\{1/2\-\\gamma\}\), we have

RMS⁡\(𝒆\)=Θ\(d−γ\)\.\\operatorname\{RMS\}\(\\bm\{e\}\)=\\Theta\(d^\{\-\\gamma\}\)\.Therefore,

∥RMSNorm⁡\(𝒆\)∥2=‖𝒆RMS⁡\(𝒆\)‖2=Θ\(d1/2−γd−γ\)=Θ\(d1/2\)\.\\lVert\\operatorname\{RMSNorm\}\(\\bm\{e\}\)\\rVert\_\{2\}=\\left\\lVert\\frac\{\\bm\{e\}\}\{\\operatorname\{RMS\}\(\\bm\{e\}\)\}\\right\\rVert\_\{2\}=\\Theta\\left\(\\frac\{d^\{1/2\-\\gamma\}\}\{d^\{\-\\gamma\}\}\\right\)=\\Theta\(d^\{1/2\}\)\.Thus RMSNorm removes the initialization scale of𝒆\\bm\{e\}and maps it to a vector with coordinate scaleO\(1\)O\(1\)and Euclidean normΘ\(d1/2\)\\Theta\(d^\{1/2\}\)\.

For a random matrix𝑾∈ℝd×d\\bm\{W\}\\in\\mathbb\{R\}^\{d\\times d\}with entries initialized as𝒩\(0,d−2γ\)\\mathcal\{N\}\(0,d^\{\-2\\gamma\}\), its spectral norm has the typical scale

∥𝑾∥2=Θ\(d1/2−γ\)\.\\lVert\\bm\{W\}\\rVert\_\{2\}=\\Theta\(d^\{1/2\-\\gamma\}\)\.Therefore, after the first linear transformation,

‖𝑾1,ℓRMSNorm⁡\(𝒆\)‖2=Θ\(d1/2−γ\)⋅Θ\(d1/2\)=Θ\(d1−γ\)\.\\left\\lVert\\bm\{W\}\_\{1,\\ell\}\\operatorname\{RMSNorm\}\(\\bm\{e\}\)\\right\\rVert\_\{2\}=\\Theta\(d^\{1/2\-\\gamma\}\)\\cdot\\Theta\(d^\{1/2\}\)=\\Theta\(d^\{1\-\\gamma\}\)\.We assume that the activation function in the FFN block contributes only anO\(1\)O\(1\)factor to the leading\-order scale, so that

‖ϕ\(𝑾1,ℓRMSNorm⁡\(𝒆\)\)‖2=Θ\(d1−γ\)\.\\left\\lVert\\phi\\big\(\\bm\{W\}\_\{1,\\ell\}\\operatorname\{RMSNorm\}\(\\bm\{e\}\)\\big\)\\right\\rVert\_\{2\}=\\Theta\(d^\{1\-\\gamma\}\)\.Applying the second linear transformation gives

∥𝑭ℓ\(𝒆\)∥2=‖𝑾2,ℓϕ\(𝑾1,ℓRMSNorm⁡\(𝒆\)\)‖2\.\\lVert\\bm\{F\}\_\{\\ell\}\(\\bm\{e\}\)\\rVert\_\{2\}=\\left\\lVert\\bm\{W\}\_\{2,\\ell\}\\phi\\big\(\\bm\{W\}\_\{1,\\ell\}\\operatorname\{RMSNorm\}\(\\bm\{e\}\)\\big\)\\right\\rVert\_\{2\}\.Using again∥𝑾2,ℓ∥2=Θ\(d1/2−γ\)\\lVert\\bm\{W\}\_\{2,\\ell\}\\rVert\_\{2\}=\\Theta\(d^\{1/2\-\\gamma\}\), we obtain

∥𝑭ℓ\(𝒆\)∥2=Θ\(d1/2−γ\)⋅Θ\(d1−γ\)=Θ\(d3/2−2γ\)\.\\lVert\\bm\{F\}\_\{\\ell\}\(\\bm\{e\}\)\\rVert\_\{2\}=\\Theta\(d^\{1/2\-\\gamma\}\)\\cdot\\Theta\(d^\{1\-\\gamma\}\)=\\Theta\(d^\{3/2\-2\\gamma\}\)\.
Therefore, each residual module output has scale

∥𝑭ℓ\(𝒆\)∥2=Θ\(d3/2−2γ\)\.\\lVert\\bm\{F\}\_\{\\ell\}\(\\bm\{e\}\)\\rVert\_\{2\}=\\Theta\(d^\{3/2\-2\\gamma\}\)\.Using

𝒉L−𝒆≈∑ℓ=0L−1𝑭ℓ\(𝒆\),\\bm\{h\}\_\{L\}\-\\bm\{e\}\\approx\\sum\_\{\\ell=0\}^\{L\-1\}\\bm\{F\}\_\{\\ell\}\(\\bm\{e\}\),and treatingLLas fixed with respect todd, the accumulation over layers does not change the leading exponent indd\. Thus,

∥𝒉L−𝒆∥2=Θ\(d3/2−2γ\)\.\\lVert\\bm\{h\}\_\{L\}\-\\bm\{e\}\\rVert\_\{2\}=\\Theta\(d^\{3/2\-2\\gamma\}\)\.
Finally, comparing this with the initial embedding scale

∥𝒆∥2=Θ\(d1/2−γ\),\\lVert\\bm\{e\}\\rVert\_\{2\}=\\Theta\(d^\{1/2\-\\gamma\}\),we obtain

∥𝒉L−𝒆∥2∥𝒆∥2=Θ\(d3/2−2γd1/2−γ\)=Θ\(d1−γ\)\.\\frac\{\\lVert\\bm\{h\}\_\{L\}\-\\bm\{e\}\\rVert\_\{2\}\}\{\\lVert\\bm\{e\}\\rVert\_\{2\}\}=\\Theta\\left\(\\frac\{d^\{3/2\-2\\gamma\}\}\{d^\{1/2\-\\gamma\}\}\\right\)=\\Theta\(d^\{1\-\\gamma\}\)\.Therefore, the relative scale between the accumulated residual update and the initial embedding stream is

∥𝒉L−𝒆∥2∥𝒆∥2∼d1−γ\.\\frac\{\\lVert\\bm\{h\}\_\{L\}\-\\bm\{e\}\\rVert\_\{2\}\}\{\\lVert\\bm\{e\}\\rVert\_\{2\}\}\\sim d^\{1\-\\gamma\}\.
This gives three regimes\. Whenγ<1\\gamma<1,

∥𝒉L−𝒆∥2∥𝒆∥2→∞,\\frac\{\\lVert\\bm\{h\}\_\{L\}\-\\bm\{e\}\\rVert\_\{2\}\}\{\\lVert\\bm\{e\}\\rVert\_\{2\}\}\\to\\infty,so the residual update becomes larger than the embedding stream\. Whenγ=1\\gamma=1,

∥𝒉L−𝒆∥2∥𝒆∥2=Θ\(1\),\\frac\{\\lVert\\bm\{h\}\_\{L\}\-\\bm\{e\}\\rVert\_\{2\}\}\{\\lVert\\bm\{e\}\\rVert\_\{2\}\}=\\Theta\(1\),so the residual update and the embedding stream remain comparable\. Whenγ\>1\\gamma\>1,

∥𝒉L−𝒆∥2∥𝒆∥2→0,\\frac\{\\lVert\\bm\{h\}\_\{L\}\-\\bm\{e\}\\rVert\_\{2\}\}\{\\lVert\\bm\{e\}\\rVert\_\{2\}\}\\to 0,so the residual update becomes smaller than the embedding stream\.

Thus, under the small\-initialization approximation and for a pre\-normalized FFN block with two linear transformations,γ=1\\gamma=1is the balance point at which the accumulated residual update𝒉L−𝒆\\bm\{h\}\_\{L\}\-\\bm\{e\}and the initial embedding stream𝒆\\bm\{e\}have the same leading\-order scale\.

## Appendix BModel Architecture

Table 2:Architecture and training hyperparameters of the dense decoder\-only Transformer models\.Table 3:Architecture and training hyperparameters of the MoE decoder\-only Transformer models\.
## Appendix CComplete results of stable rank

![Refer to caption](https://arxiv.org/html/2606.17945v1/pic/stable_rank_appendix_fig1_L1_L6.png)Figure 5:Stable\-rank dynamics of linear modules in layers 1–6\. Rows correspond to different matrix types, including query, key, value, FFN up\-projection, and FFN down\-projection matrices, while columns correspond to layers\. Each subplot compares the stable\-rank trajectory of the same matrix underγ=0\.5\\gamma=0\.5andγ=1\\gamma=1, with the vertical axis shown on a logarithmic scale\.![Refer to caption](https://arxiv.org/html/2606.17945v1/pic/stable_rank_appendix_fig2_L7_L12.png)Figure 6:Stable\-rank dynamics of linear modules in layers 7–12\. Rows correspond to different matrix types, including query, key, value, FFN up\-projection, and FFN down\-projection matrices, while columns correspond to layers\. Each subplot compares the stable\-rank trajectory of the same matrix underγ=0\.5\\gamma=0\.5andγ=1\\gamma=1, with the vertical axis shown on a logarithmic scale\.![Refer to caption](https://arxiv.org/html/2606.17945v1/pic/stable_rank_appendix_fig3_L13_L18.png)Figure 7:Stable\-rank dynamics of linear modules in layers 13–18\. Rows correspond to different matrix types, including query, key, value, FFN up\-projection, and FFN down\-projection matrices, while columns correspond to layers\. Each subplot compares the stable\-rank trajectory of the same matrix underγ=0\.5\\gamma=0\.5andγ=1\\gamma=1, with the vertical axis shown on a logarithmic scale\.![Refer to caption](https://arxiv.org/html/2606.17945v1/pic/stable_rank_appendix_fig4_L19_L24.png)Figure 8:Stable\-rank dynamics of linear modules in layers 19–24\. Rows correspond to different matrix types, including query, key, value, FFN up\-projection, and FFN down\-projection matrices, while columns correspond to layers\. Each subplot compares the stable\-rank trajectory of the same matrix underγ=0\.5\\gamma=0\.5andγ=1\\gamma=1, with the vertical axis shown on a logarithmic scale\.
## References
Small Initialization Matters for Large Language Models

Similar Articles

Negligible in Size, Significant in Effect: On Scale Vectors in Large Language Models

Scaling laws for neural language models

Enhanced and Efficient Reasoning in Large Learning Models

Human-Alignment, Calibration, and Activation Patterns in Large Language Model Uncertainty

Parameter Alignment Mitigates Catastrophic Forgetting in Multilingual Expert Language Models

Submit Feedback

Similar Articles

Negligible in Size, Significant in Effect: On Scale Vectors in Large Language Models
Scaling laws for neural language models
Enhanced and Efficient Reasoning in Large Learning Models
Human-Alignment, Calibration, and Activation Patterns in Large Language Model Uncertainty
Parameter Alignment Mitigates Catastrophic Forgetting in Multilingual Expert Language Models