On the Residual Scaling of Looped Transformers: Stability and Transferability

arXiv cs.LG 06/18/26, 04:00 AM Papers
Summary
This paper analyzes residual scaling in looped (weight-tied) transformers, showing that weight sharing requires stronger scaling (1/N) than standard residual networks, and derives a factored parameterization that enables hyperparameter transfer across loop counts without retuning.
arXiv:2606.18524v1 Announce Type: new Abstract: Looped (weight-tied) Transformers apply a shared residual block $N$ times ($h \leftarrow h + \varepsilon\,f(h)$, same $f$ at each step), increasing effective depth without adding parameters. Prior depth-scaling analyses prescribe $\varepsilon = 1/\!\sqrt{L}$ for depth-$L$ residual networks. We show that this is insufficient for looped architectures: weight sharing makes residual updates correlated across iterations, requiring the stronger scaling $\varepsilon = 1/N$. For multi-layer blocks ($L$ unique layers looped $N$ times), we derive a factored parameterization $\varepsilon = \lambda/(N\!\sqrt{L})$ that separates the two sources of growth: $1/N$ controls the within-layer loop correlation, and $1/\!\sqrt{L}$ controls the across-layer variance. A key consequence is that the optimal learning rate depends only on the number of unique layers $L$, not on the loop count $N$, enabling direct hyperparameter transfer from small to large $N$ without retuning. Experiments on looped Transformers confirm that $1/N$ scaling improves trainability and yields better loss than $1/\!\sqrt{N}$ scaling across loop counts.
Original Article
View Cached Full Text
Cached at: 06/18/26, 05:44 AM
# On the Residual Scaling of Looped Transformers: Stability and Transferability
Source: [https://arxiv.org/html/2606.18524](https://arxiv.org/html/2606.18524)
1\]Tsinghua University 2\]ByteDance Seed 3\]M\-A\-P

\(June 16, 2026\)

###### Abstract

Looped \(weight\-tied\) Transformers apply a shared residual blockNNtimes \(h←h\+εf\(h\)h\\leftarrow h\+\\varepsilon\\,f\(h\), sameffat each step\), increasing effective depth without adding parameters\. Prior depth\-scaling analyses prescribeε=1/L\\varepsilon=1/\\\!\\sqrt\{L\}for depth\-LLresidual networks\. We show that this is insufficient for looped architectures: weight sharing makes residual updates correlated across iterations, requiring the stronger scalingε=1/N\\varepsilon=1/N\. For multi\-layer blocks \(LLunique layers loopedNNtimes\), we derive a factored parameterizationε=λ/\(NL\)\\varepsilon=\\lambda/\(N\\\!\\sqrt\{L\}\)that separates the two sources of growth:1/N1/Ncontrols the within\-layer loop correlation, and1/L1/\\\!\\sqrt\{L\}controls the across\-layer variance\. A key consequence is that the optimal learning rate depends only on the number of unique layersLL, not on the loop countNN, enabling direct hyperparameter transfer from small to largeNNwithout retuning\. Experiments on looped Transformers confirm that1/N1/Nscaling improves trainability and yields better loss than1/N1/\\\!\\sqrt\{N\}scaling across loop counts\.

## 1Introduction

Looped \(weight\-tied\) Transformers reuse a single blockffforNNiterations \(h←h\+f\(h\)h\\leftarrow h\+f\(h\), sameffat each step\), increasing effective depth without adding parameters\. This design appears in Universal Transformers\[[5](https://arxiv.org/html/2606.18524#bib.bib5)\], ALBERT\[[11](https://arxiv.org/html/2606.18524#bib.bib11)\], and recent work on algorithmic reasoning and latent computation\[[27](https://arxiv.org/html/2606.18524#bib.bib27),[7](https://arxiv.org/html/2606.18524#bib.bib7),[20](https://arxiv.org/html/2606.18524#bib.bib20),[8](https://arxiv.org/html/2606.18524#bib.bib8),[17](https://arxiv.org/html/2606.18524#bib.bib17),[30](https://arxiv.org/html/2606.18524#bib.bib30)\]\.

\(a\) Deep networkLLdistinct weight matricesW0W\_\{0\}W1W\_\{1\}W2W\_\{2\}⋯\\cdotsWL−1W\_\{L\-1\}hℓ\+1=hℓ\+εrℓh\_\{\\ell\+1\}=h\_\{\\ell\}\+\{\\color\[rgb\]\{0\.66796875,0\.3125,0\.1171875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.66796875,0\.3125,0\.1171875\}\\varepsilon\}\\,r\_\{\\ell\}ε\{\\color\[rgb\]\{0\.66796875,0\.3125,0\.1171875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.66796875,0\.3125,0\.1171875\}\\varepsilon\}: residual rescale,rℓ=Wℓϕ\(hℓ\)r\_\{\\ell\}=W\_\{\\ell\}\\,\\phi\(h\_\{\\ell\}\)‖∑ℓrℓ‖=Θ\(L\)\\displaystyle\\bigl\\\|\\textstyle\\sum\_\{\\ell\}r\_\{\\ell\}\\bigr\\\|=\\Theta\(\\sqrt\{L\}\)‖∑ℓrℓ‖2=Θ\(L\)\\displaystyle\\bigl\\\|\\textstyle\\sum\_\{\\ell\}r\_\{\\ell\}\\bigr\\\|^\{2\}=\\Theta\(L\)Random\-walk norm growthstandard scaling suffices:ε=λ/L\{\\color\[rgb\]\{0\.66796875,0\.3125,0\.1171875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.66796875,0\.3125,0\.1171875\}\\varepsilon\}=\\lambda/\\sqrt\{L\}\(b\) Looped networksingle sharedWW, reusedNNtimesWW×N\\times Nhn\+1=hn\+εrnh\_\{n\+1\}=h\_\{n\}\+\{\\color\[rgb\]\{0\.66796875,0\.3125,0\.1171875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.66796875,0\.3125,0\.1171875\}\\varepsilon\}\\,r\_\{n\}ε\{\\color\[rgb\]\{0\.66796875,0\.3125,0\.1171875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.66796875,0\.3125,0\.1171875\}\\varepsilon\}: residual rescale,rn=Wϕ\(hn\)r\_\{n\}=W\\,\\phi\(h\_\{n\}\)‖∑nrn‖=Θ\(N\)\\displaystyle\\bigl\\\|\\textstyle\\sum\_\{n\}r\_\{n\}\\bigr\\\|=\\Theta\(N\)‖∑nrn‖2=Θ\(N2\)\\displaystyle\\bigl\\\|\\textstyle\\sum\_\{n\}r\_\{n\}\\bigr\\\|^\{2\}=\\Theta\(N^\{2\}\)Linear norm growthlinear scaling required:ε=λ/N\{\\color\[rgb\]\{0\.66796875,0\.3125,0\.1171875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.66796875,0\.3125,0\.1171875\}\\varepsilon\}=\\lambda/N

Figure 1:Weight sharing changes how residuals accumulate\.\(a\)In a deep network with independent weights, residual updates point in different directions and accumulate like a random walk, with normΘ\(L\)\\Theta\(\\\!\\sqrt\{L\}\)\. Standard scalingε=1/L\\varepsilon=1/\\\!\\sqrt\{L\}keeps the output bounded\.\(b\)In a looped network, a single block is reusedNNtimes\. The shared weights make successive updates align, so their sum grows asΘ\(N\)\\Theta\(N\)—requiring the stronger scalingε=1/N\\varepsilon=1/N\.In practice, increasingNNoften leads to training instability such as exploding hidden states and high sensitivity to the learning rate\[[30](https://arxiv.org/html/2606.18524#bib.bib30)\]\. A standard remedy is to scale each residual branch by a factorε\\varepsilonthat shrinks with depth, givingh←h\+εf\(h\)h\\leftarrow h\+\\varepsilon\\,f\(h\)\. Prior depth\-scaling analyses prescribeε=1/N\\varepsilon=1/\\\!\\sqrt\{N\}for deep residual networks\[[2](https://arxiv.org/html/2606.18524#bib.bib2),[6](https://arxiv.org/html/2606.18524#bib.bib6)\], but whether this rule transfers to looped architectures—where the sameffis reused at every step—has not been established\.

We find that1/N1/\\\!\\sqrt\{N\}scaling is indeed insufficient for looped models\. Consider the residual\-stream norm‖hN‖\\\|h\_\{N\}\\\|afterNNiterations\. For standard \(non\-shared\) deep networks,ε=1/L\\varepsilon=1/\\\!\\sqrt\{L\}successfully keeps‖hL‖\\\|h\_\{L\}\\\|bounded as depthLLgrows \(Figure[2](https://arxiv.org/html/2606.18524#S1.F2), top row\)\. For looped networks, however,ε=1/N\\varepsilon=1/\\\!\\sqrt\{N\}fails to control‖hN‖\\\|h\_\{N\}\\\|, which grows rapidly withNN; in contrast,ε=1/N\\varepsilon=1/Nkeeps it bounded \(Figure[2](https://arxiv.org/html/2606.18524#S1.F2), bottom row\)\. Our theoretical analysis \(Section[3](https://arxiv.org/html/2606.18524#S3)\) explains this discrepancy: the1/N1/\\\!\\sqrt\{N\}rule relies on the assumption that each layer has independent weights, but weight sharing makes successive updates correlated, amplifying residual\-stream norm growth fromΘ\(N\)\\Theta\(\\\!\\sqrt\{N\}\)toΘ\(N\)\\Theta\(N\)\.

![Refer to caption](https://arxiv.org/html/2606.18524v1/x1.png)Figure 2:Linear scaling stabilizes looped networks; sqrt scaling does not\.Normalized residual\-stream normR=d−1/2‖h‖2R=d^\{\-1/2\}\\\|h\\\|\_\{2\}\(log scale\) vs\. depthLL\(top\) or loop countNN\(bottom\) during the first 10 training steps in the Llama\-style pre\-norm Transformer diagnostic; lines colored by step\.1/L1/\\\!\\sqrt\{L\}scaling stabilizes deep networks \(panel b\), but1/N1/\\\!\\sqrt\{N\}fails for loops \(panel e\)\. Linear scalingε=1/N\\varepsilon\{=\}1/Nkeeps the residual\-stream norm bounded across loop counts \(panel f\)\.Beyond stabilizing the forward pass, the1/N1/Nscaling also fixes the learning rate\. The same constructive accumulation that drivesΘ\(N2\)\\Theta\(N^\{2\}\)squared\-norm growth also amplifies weight updates: the output change from one optimizer step scales asηεN\\eta\\varepsilon N, so settingε=1/N\\varepsilon=1/Nmakes the stable learning rate constant inNN\(Section[3](https://arxiv.org/html/2606.18524#S3)\)\. This enables hyperparameter transfer: a learning rate tuned atN=1N\{=\}1remains near\-optimal at largerNNwithout re\-tuning\.

We further extend the analysis to practical multi\-layer blocks, whereLLdistinct layers are each reusedNNtimes \(Section[4](https://arxiv.org/html/2606.18524#S4)\)\. This introduces a second variance source: across independent layers, updates accumulate as a random walk inLL, just as in standard deep networks\. The factorized parameterizationε=λ/\(NL\)\\varepsilon=\\lambda/\(N\\sqrt\{L\}\)handles both sources independently:1/N1/Ncancels the within\-layer quadratic growth, while1/L1/\\\!\\sqrt\{L\}controls the across\-layer random walk\. The resulting learning\-rate law,η≲1/\(λL\)\\eta\\lesssim 1/\(\\lambda\\sqrt\{L\}\), depends only on the unique depthLL, not onNN, so hyperparameter transfer continues to hold for multi\-layer blocks\.

Experiments on decoder\-only Transformers trained on FineWeb\-Edu\[[18](https://arxiv.org/html/2606.18524#bib.bib18)\]confirm these predictions \(Section[5](https://arxiv.org/html/2606.18524#S5)\)\. The pairwise correlation structure underlying quadratic accumulation—dense positive cosine similarity between loop\-step updates—persists well beyond initialization through full training, confirming that theΘ\(N2\)\\Theta\(N^\{2\}\)growth mechanism is not an initialization artifact\. Linear residual scaling yields improved trainability and consistent learning\-rate transfer acrossN∈\{1,2,4,8\}N\\in\\\{1,2,4,8\\\}: the optimal learning rate remains nearly invariant, unlike under1/N1/\\\!\\sqrt\{N\}scaling where it shifts withNN\. The factorized parameterizationε=λ/\(NL\)\\varepsilon=\\lambda/\(N\\sqrt\{L\}\)further extends this transfer across depthsL∈\{12,24,48\}L\\in\\\{12,24,48\\\}, with a single learning rate remaining near\-optimal over all tested\(N,L\)\(N,L\)combinations\.

Taken together, our work extends the depth\-scaling framework\[[26](https://arxiv.org/html/2606.18524#bib.bib26),[6](https://arxiv.org/html/2606.18524#bib.bib6)\]to weight\-shared architectures, showing that the standard independence assumptions break down under parameter reuse and deriving the corrected scaling rules\. Beyond the theoretical contribution, the resulting parameterization directly solves two practical problems: it improves trainability at large loop counts and eliminates the need to re\-tune hyperparameters when varyingNN\. By making the loop count stable and tunable, these results establish reuse as a practical scaling axis for weight\-tied Transformers\.

## 2Related Work

##### Looped and parameter\-shared Transformers\.

Using one block repeatedly has been explored as recurrent depth in Universal Transformers\[[5](https://arxiv.org/html/2606.18524#bib.bib5)\]and as cross\-layer parameter sharing in ALBERT\[[11](https://arxiv.org/html/2606.18524#bib.bib11)\]\. More recent looped models show strong behavior on iterative algorithm learning and multi\-step in\-context procedures\[[27](https://arxiv.org/html/2606.18524#bib.bib27),[8](https://arxiv.org/html/2606.18524#bib.bib8)\], length generalization\[[7](https://arxiv.org/html/2606.18524#bib.bib7)\], and latent\-reasoning style test\-time compute scaling\[[20](https://arxiv.org/html/2606.18524#bib.bib20),[30](https://arxiv.org/html/2606.18524#bib.bib30)\]\. Related parameter\-sharing formulations also appear in looped neural networks\[[17](https://arxiv.org/html/2606.18524#bib.bib17)\]\. Our work addresses the complementary question of how to parameterize the residual scaling so that training remains stable and hyperparameters transfer\.

##### Stability in deep residual stacks and deep Transformers\.

Large\-depth training has a long line of stabilization techniques, including residual reparameterizations and initialization rules such as Fixup and ReZero\[[29](https://arxiv.org/html/2606.18524#bib.bib29),[1](https://arxiv.org/html/2606.18524#bib.bib1)\], and Transformer\-specific stabilizers such as DeepNet/DeepNorm\[[23](https://arxiv.org/html/2606.18524#bib.bib23)\]\. On the theory side, prior analyses of deep non\-shared residual networks characterize when residual scaling keeps signals controlled in the large\-depth limit\[[15](https://arxiv.org/html/2606.18524#bib.bib15)\]\. A complementary line studies generalization for continuous\-depth models and their ResNet analogues\[[4](https://arxiv.org/html/2606.18524#bib.bib4)\]:Marion \[[14](https://arxiv.org/html/2606.18524#bib.bib14)\]derives a Lipschitz\-based bound whose complexity term depends on differences between successive weight matrices\. These works mostly treat depth as a stack of distinct layers, whereas in our setting loop steps reuse the same parameters\.

##### Hyperparameter transfer and parameterization\.

Tensor\-program andμ\\muP\-style analyses establish transfer principles across model scales and motivate systematic parameterization choices\[[26](https://arxiv.org/html/2606.18524#bib.bib26)\]\. Recent depth\-transfer analyses in residual networks study how learning\-rate and initialization choices change with depth under non\-shared assumptions\[[2](https://arxiv.org/html/2606.18524#bib.bib2),[10](https://arxiv.org/html/2606.18524#bib.bib10)\]\. CompleteP and follow\-up work extend this direction for deep Transformers and broader axes of transfer\[[6](https://arxiv.org/html/2606.18524#bib.bib6),[16](https://arxiv.org/html/2606.18524#bib.bib16)\]\. Our work is complementary, targeting the*loop axis*and showing that shared weights create cross\-step correlations that change the stability threshold and transfer regime\.

## 3Loop Scaling for Shared Layers

We analyze initialization\-time scaling of a single shared MLP to isolate the effect of weight sharing; Section[4](https://arxiv.org/html/2606.18524#S4)extends to multi\-layer blocks\.

### 3\.1Setup

Consider the following simplified residual model, which abstracts one residual branch of a Transformer block:

hn\+1\\displaystyle h\_\{n\+1\}=hn\+εWϕ\(hn\),\\displaystyle=h\_\{n\}\+\\varepsilon\\,W\\phi\(h\_\{n\}\),\(1\)ε\\displaystyle\\varepsilon=N−α,n=0,…,N−1\.\\displaystyle=N^\{\-\\alpha\},\\qquad n=0,\\dots,N\-1\.HereNNis the loop count \(the number of times the shared layer is applied\), andα\>0\\alpha\>0is the scaling exponent that controls how aggressively the residual branch is down\-scaled withNN\. The hidden statehn∈ℝdh\_\{n\}\\in\\mathbb\{R\}^\{d\}is initialized from a given inputh0h\_\{0\}with‖h0‖22/d=Θ\(1\)\\left\\lVert h\_\{0\}\\right\\rVert\_\{2\}^\{2\}/d=\\Theta\(1\); the model output ishNh\_\{N\}\. The shared weight matrixW∈ℝd×dW\\in\\mathbb\{R\}^\{d\\times d\}is drawn i\.i\.d\.Wij∼𝒩\(0,1/d\)W\_\{ij\}\\sim\\mathcal\{N\}\(0,1/d\), andϕ\\phiis the ReLU activation\. Our goal is to determine the minimumα\\alphathat keepshNh\_\{N\}bounded asNNgrows, and the induced scaling law for the learning rate\.

We writeun≜ϕ\(hn\)u\_\{n\}\\triangleq\\phi\(h\_\{n\}\)for the post\-activation vector,rn≜Wunr\_\{n\}\\triangleq Wu\_\{n\}for the per\-step residual, andRn≜d−1/2‖hn‖2R\_\{n\}\\triangleq d^\{\-1/2\}\\left\\lVert h\_\{n\}\\right\\rVert\_\{2\}for the normalized residual\-stream norm\.

![Refer to caption](https://arxiv.org/html/2606.18524v1/x2.png)Figure 3:Weight sharing induces persistent cross\-step correlations\.Pairwise cosine similarity between block\-level incrementsδi=hi−hi−1\\delta\_\{i\}=h\_\{i\}\-h\_\{i\-1\}, comparing a non\-shared deep stack \(panel a;6464independent copies of a 12\-layer block, effective depth12×6412\{\\times\}64\) with a looped network \(panel b;L=12L\{=\}12,N=64N\{=\}64,d=768d\{=\}768\)\. Both configurations have the same effective depth; the only difference is whether block weights are shared\. Both measured without residual scaling, after 10 training steps\. In the non\-shared case, off\-diagonal correlations are negligible \(range\[−0\.037,0\.034\]\[\-0\.037,0\.034\]\)\. In the looped case, the shared weights produce dense positive alignment \(range\[0\.027,0\.995\]\[0\.027,0\.995\]\), consistent withΘ\(N2\)\\Theta\(N^\{2\}\)accumulation \(Theorem[1](https://arxiv.org/html/2606.18524#Thmtheorem1)\)\.
### 3\.2Quadratic variance accumulation

Unrolling \([1](https://arxiv.org/html/2606.18524#S3.E1)\) giveshN=h0\+ε∑n=0N−1rnh\_\{N\}=h\_\{0\}\+\\varepsilon\\sum\_\{n=0\}^\{N\-1\}r\_\{n\}\. To see howRNR\_\{N\}scales withNN, we expandRN2R\_\{N\}^\{2\}:

RN2\\displaystyle R\_\{N\}^\{2\}=R02\+2εd∑n=0N−1⟨h0,rn⟩⏟BN\+ε2d∑n=0N−1∑m=0N−1⟨rn,rm⟩⏟CN\.\\displaystyle=R\_\{0\}^\{2\}\+\\underbrace\{\\frac\{2\\varepsilon\}\{d\}\\sum\_\{n=0\}^\{N\-1\}\\left\\langle h\_\{0\},r\_\{n\}\\right\\rangle\}\_\{B\_\{N\}\}\+\\underbrace\{\\frac\{\\varepsilon^\{2\}\}\{d\}\\sum\_\{n=0\}^\{N\-1\}\\sum\_\{m=0\}^\{N\-1\}\\left\\langle r\_\{n\},r\_\{m\}\\right\\rangle\}\_\{C\_\{N\}\}\.The cross termBNB\_\{N\}sumsNNinner products between the fixed inputh0h\_\{0\}and the residualsrnr\_\{n\}, soBN=O\(εN\)B\_\{N\}=O\(\\varepsilon N\)\. The quadratic termCNC\_\{N\}, however, sumsN2N^\{2\}pairwise interactions and equalsε2d‖∑nrn‖22\\frac\{\\varepsilon^\{2\}\}\{d\}\\left\\lVert\\sum\_\{n\}r\_\{n\}\\right\\rVert\_\{2\}^\{2\}\. WhetherCNC\_\{N\}grows asΘ\(ε2N2\)\\Theta\(\\varepsilon^\{2\}N^\{2\}\)or merelyΘ\(ε2N\)\\Theta\(\\varepsilon^\{2\}N\)depends on whether the residual updates reinforce or cancel\.

BecauseWWis shared across all iterations, we can factor it out:

‖∑n=0N−1rn‖22=‖W∑n=0N−1un‖22\.\\left\\\|\\sum\_\{n=0\}^\{N\-1\}r\_\{n\}\\right\\\|\_\{2\}^\{2\}=\\left\\\|W\\sum\_\{n=0\}^\{N\-1\}u\_\{n\}\\right\\\|\_\{2\}^\{2\}\.This reduces the question to: how does‖∑nun‖\\left\\lVert\\sum\_\{n\}u\_\{n\}\\right\\rVertscale withNN? Since allun=ϕ\(hn\)u\_\{n\}=\\phi\(h\_\{n\}\)are produced by the same recurrence with sharedWW, successive activations are correlated\. When this correlation is strong enough that the activations accumulate constructively, the norm of their sum grows asΘ\(N\)\\Theta\(N\)rather thanΘ\(N\)\\Theta\(\\\!\\sqrt\{N\}\), giving‖∑nrn‖22=Θ\(N2\)\\left\\lVert\\sum\_\{n\}r\_\{n\}\\right\\rVert\_\{2\}^\{2\}=\\Theta\(N^\{2\}\)and thusCN=Θ\(ε2N2\)C\_\{N\}=\\Theta\(\\varepsilon^\{2\}N^\{2\}\)\.

In a standard \(non\-shared\) deep residual network, this does not happen: each step uses an independent weight matrixWnW\_\{n\}, so the cross terms⟨rn,rm⟩\\left\\langle r\_\{n\},r\_\{m\}\\right\\ranglehave zero mean forn≠mn\\neq mand the sum grows only asΘ\(N\)\\Theta\(\\\!\\sqrt\{N\}\)\[[2](https://arxiv.org/html/2606.18524#bib.bib2),[6](https://arxiv.org/html/2606.18524#bib.bib6)\]\. Weight sharing breaks this independence and changes the squared\-norm accumulation fromΘ\(N\)\\Theta\(N\)\(random walk\) toΘ\(N2\)\\Theta\(N^\{2\}\)\(linear norm growth from constructive accumulation\)\. Figure[3](https://arxiv.org/html/2606.18524#S3.F3)confirms this empirically: pairwise cosine similarities between loop\-step incrementsδn=hn−hn−1\\delta\_\{n\}=h\_\{n\}\-h\_\{n\-1\}are near\-zero off\-diagonal in non\-shared stacks but densely positive in looped networks\.

The following theorem formalizes this\. The proof has two parts: ReLU non\-negativity places∑nun\\sum\_\{n\}u\_\{n\}in the non\-negative orthant𝒞\+=\{x∈ℝd:xi≥0\}\\mathcal\{C\}\_\{\+\}=\\\{x\\in\\mathbb\{R\}^\{d\}:x\_\{i\}\\geq 0\\\}, ensuring constructive accumulation; a Gaussian matrix argument\[[9](https://arxiv.org/html/2606.18524#bib.bib9)\]then showsWWpreserves this norm\.

###### Theorem 1\(Looped residual\-stream norm scaling\)\.

Assume:

1. 1\.\(Nondegenerate activation mass\)1N∑n=0N−11d𝟏⊤un≥m−\>0\\frac\{1\}\{N\}\\sum\_\{n=0\}^\{N\-1\}\\frac\{1\}\{d\}\\mathbf\{1\}^\{\\top\}u\_\{n\}\\geq m\_\{\-\}\>0, wherem−m\_\{\-\}is independent ofNN\.
2. 2\.\(Bounded activation scale\)1N∑n=0N−11d‖un‖22≤q\+<∞\\frac\{1\}\{N\}\\sum\_\{n=0\}^\{N\-1\}\\frac\{1\}\{d\}\\left\\lVert u\_\{n\}\\right\\rVert\_\{2\}^\{2\}\\leq q\_\{\+\}<\\infty, whereq\+q\_\{\+\}is independent ofNN\.
3. 3\.\(Positive\-cone gain\)cW‖x‖2≤‖Wx‖2≤CW‖x‖2c\_\{W\}\\left\\lVert x\\right\\rVert\_\{2\}\\leq\\left\\lVert Wx\\right\\rVert\_\{2\}\\leq C\_\{W\}\\left\\lVert x\\right\\rVert\_\{2\}for everyx∈𝒞\+x\\in\\mathcal\{C\}\_\{\+\}, withcW,CW\>0c\_\{W\},C\_\{W\}\>0\.

Then

1d‖∑n=0N−1rn‖22=Θ\(N2\),\\frac\{1\}\{d\}\\left\\\|\\sum\_\{n=0\}^\{N\-1\}r\_\{n\}\\right\\\|\_\{2\}^\{2\}=\\Theta\(N^\{2\}\),and bounded residual\-stream norm requiresεN=O\(1\)\\varepsilon N=O\(1\), i\.e\.α≥1\\alpha\\geq 1\. For GaussianWW, the positive\-cone gain holds with high probability \(Appendix[7\.2](https://arxiv.org/html/2606.18524#S7.SS2)\)\.

Proof\.See Appendix[7\.2](https://arxiv.org/html/2606.18524#S7.SS2)\.

In words, the residual branch must be scaled as1/N1/N, not1/N1/\\\!\\sqrt\{N\}; Figure[2](https://arxiv.org/html/2606.18524#S1.F2)confirms this\.

Remark \(beyond ReLU\)\.ReLU non\-negativity is a*sufficient*condition used in the proof; Figures[3](https://arxiv.org/html/2606.18524#S3.F3)\(b\) and[6](https://arxiv.org/html/2606.18524#S4.F6)suggest that constructive accumulation also occurs in SwiGLU\-based models, where the positive\-cone argument does not directly apply\.

### 3\.3Learning\-rate scaling

Theorem[1](https://arxiv.org/html/2606.18524#Thmtheorem1)controls the forward pass: settingε=1/N\\varepsilon=1/Nkeeps the residual\-stream norm bounded at initialization\. But stable initialization alone does not guarantee stable training: the learning rateη\\etamust also be chosen so that each optimizer step produces anO\(1\)O\(1\)change in the outputhNh\_\{N\}, independent ofNN\. This is the maximal\-update principle ofYang et al\. \[[26](https://arxiv.org/html/2606.18524#bib.bib26)\]; followingDey et al\. \[[6](https://arxiv.org/html/2606.18524#bib.bib6)\], we apply it to the loop axis\. LetΔW\\Delta Wdenote the one\-step weight update andΔhN\\Delta h\_\{N\}the resulting change in output\.

###### Theorem 2\(Loop\-wise learning\-rate scaling\)\.

Assume:

1. 1\.\(Update scale\)ΔWij=ηdSij\\Delta W\_\{ij\}=\\frac\{\\eta\}\{\\sqrt\{d\}\}S\_\{ij\}with‖ΔW‖=O\(η\)\\left\\lVert\\Delta W\\right\\rVert=O\(\\eta\), modeling Adam\-style sign updates\.
2. 2\.\(Stable activations\)‖un‖22/d=Θ\(1\)\\left\\lVert u\_\{n\}\\right\\rVert\_\{2\}^\{2\}/d=\\Theta\(1\), ensured by Theorem[1](https://arxiv.org/html/2606.18524#Thmtheorem1)\.
3. 3\.\(Stable forward scaling\)εN=O\(1\)\\varepsilon N=O\(1\)\.

Then

1d‖ΔhN‖2=O\(ηεN\),\\frac\{1\}\{\\sqrt\{d\}\}\\left\\lVert\\Delta h\_\{N\}\\right\\rVert\_\{2\}=O\(\\eta\\varepsilon N\),soηεN=O\(1\)\\eta\\,\\varepsilon\\,N=O\(1\)is sufficient for anO\(1\)O\(1\)width\-normalized one\-step output perturbation\.

Proof\.See Appendix[7\.3](https://arxiv.org/html/2606.18524#S7.SS3)\.

At the critical scalingα=1\\alpha=1, the sharpness statement givesη=Θ\(1\)\\eta=\\Theta\(1\): the optimal learning rate is constant inNN, enabling hyperparameter transfer from small to large loop counts \(Section[5](https://arxiv.org/html/2606.18524#S5)\)\. Figure[4](https://arxiv.org/html/2606.18524#S3.F4)corroborates theηεN\\eta\\varepsilon Nupper\-bound scaling: the per\-step output update grows asNNandN\\sqrt\{N\}under the unscaled and square\-root scaling rules \(where the leadingηεN\\eta\\varepsilon Nterm is empirically sharp as a scaling law, though outside theεN=O\(1\)\\varepsilon N\{=\}O\(1\)premise of Theorem[2](https://arxiv.org/html/2606.18524#Thmtheorem2)\), while linear scaling keeps it bounded within a small range\.

![Refer to caption](https://arxiv.org/html/2606.18524v1/x3.png)Figure 4:Constant\-LR per\-step output updates track theηεN\\eta\\,\\varepsilon\\,Nupper\-bound scaling\.RMS change in the final pre\-RMSNorm residual hidden state after one optimizer step with fixedη=Θ\(1\)\\eta=\\Theta\(1\)vs\. loop countNN\.

## 4Extension to Multi\-Layer Blocks

![Refer to caption](https://arxiv.org/html/2606.18524v1/x4.png)Figure 5:Learning\-rate transfer across scaling rules and depths\.Curves sweep learning rate for loop countsN∈\{1,2,4,8\}N\\in\\\{1,2,4,8\\\}; stars mark per\-NNoptima\. Thexx\-axis is the base learning rateη0\\eta\_\{0\}; the actual repeated\-block learning rate isη0mL−1/2=η0\(L/12\)−1/2\\eta\_\{0\}\\,m\_\{L\}^\{\-1/2\}=\\eta\_\{0\}\(L/12\)^\{\-1/2\}, so for theL=12L\{=\}12panels \(a,b\) plotted and actual coincide, while panels \(c,d\) atL=24,48L\{=\}24,48apply the depth correction implicitly\. AtL=12L\{=\}12, sqrt scaling shifts the optimum withNN\(panel a\), while linear scaling aligns optima and improves large\-NNloss \(panel b\)\. Withε=1/\(NL\)\\varepsilon=1/\(N\\sqrt\{L\}\), a single base learning rate remains near\-optimal across bothLLandNN\. Diverged runs \(final validation loss\>4\>4\) are omitted\.![Refer to caption](https://arxiv.org/html/2606.18524v1/x5.png)Figure 6:Cosine similarity between loop\-step incrementsδn=hn−hn−1\\delta\_\{n\}=h\_\{n\}\-h\_\{n\-1\}after full training\(20,000 steps,N=8N\{=\}8\)\. Most early and middle adjacent loop steps remain positively correlated, and the overall pattern is qualitatively consistent across depths; pairs involving the final loop step \(step 8\) are weaker and can be negative\.![Refer to caption](https://arxiv.org/html/2606.18524v1/x6.png)Figure 7:Residual\-stream norm trajectories across loop stepsunder linear residual scaling\. Each line traces the residual\-stream L2 norm from loop step 0 \(post\-embedding\) through the final step, forN∈\{2,4,8\}N\\in\\\{2,4,8\\\}\. Although the per\-step increments differ acrossNN, the final norms converge to approximately the same value within each depth group, confirming that the1/N1/Nrescaling keeps the residual stream bounded regardless ofNN\.Section[3](https://arxiv.org/html/2606.18524#S3)analyzed a single shared layer; practical looped architectures repeat a block ofLLdistinct layersNNtimes\. We now extend the scaling analysis to this setting\.

### 4\.1Setup

Each layerℓ\\ellhas its own weight matrixWℓW\_\{\\ell\}, but the sameWℓW\_\{\\ell\}is reused across allNNiterations\. We index hidden states ashn,ℓh\_\{n,\\ell\}, wheren∈\{0,…,N−1\}n\\in\\\{0,\\dots,N\-1\\\}is the loop iteration andℓ∈\{0,…,L−1\}\\ell\\in\\\{0,\\dots,L\-1\\\}is the layer index within the block\. The recursion is:

hn,ℓ\+1\\displaystyle h\_\{n,\\ell\+1\}=hn,ℓ\+εWℓϕ\(hn,ℓ\),\\displaystyle=h\_\{n,\\ell\}\+\\varepsilon\\,W\_\{\\ell\}\\,\\phi\(h\_\{n,\\ell\}\),\(2\)ℓ\\displaystyle\\ell=0,…,L−1,n=0,…,N−1,\\displaystyle=0,\\dots,L\-1,\\qquad n=0,\\dots,N\-1,where the output of the last layer feeds into the next iteration:hn\+1,0=hn,Lh\_\{n\+1,0\}=h\_\{n,L\}, starting fromh0,0=h0h\_\{0,0\}=h\_\{0\}\. The model output ishN−1,Lh\_\{N\-1,L\}, equivalentlyhN,0h\_\{N,0\}after the final loop pass; our goal is to determineε\\varepsilonas a function of bothNNandLL\.

### 4\.2Two\-source variance accumulation

Telescoping the residual additions gives:

hout=h0\+ε∑ℓ=0L−1Gℓ,Gℓ≜Wℓ∑n=0N−1ϕ\(hn,ℓ\)\.h\_\{\\text\{out\}\}=h\_\{0\}\+\\varepsilon\\sum\_\{\\ell=0\}^\{L\-1\}G\_\{\\ell\},\\qquad G\_\{\\ell\}\\triangleq W\_\{\\ell\}\\sum\_\{n=0\}^\{N\-1\}\\phi\(h\_\{n,\\ell\}\)\.Unlike the single\-layer case, the output variance now receives contributions from two distinct sources\. Within each layerℓ\\ell, weight sharing produces the same quadratic accumulation as in Section[3](https://arxiv.org/html/2606.18524#S3): successive activationsϕ\(hn,ℓ\)\\phi\(h\_\{n,\\ell\}\)are correlated across loop steps, giving‖Gℓ‖2=Θ\(N2d\)\\\|G\_\{\\ell\}\\\|^\{2\}=\\Theta\(N^\{2\}d\)\. Across theLLlayers, the independent weight matrices contribute a standard depth\-wise random walk, the same mechanism studied for non\-shared deep networks\[[6](https://arxiv.org/html/2606.18524#bib.bib6)\]\. The only additional issue is that theLLlayers are not literally independent: they communicate through the same residual stream\. The factorized scaling is therefore valid when this communication remains local in strength\. Informally, after the within\-layer loop accumulation has been normalized by1/N1/N, changing one unique layer should perturb the other normalized branch outputs only through the total residual sizeεN\\varepsilon N, rather than dragging all layers in a single coherent direction\. Under this weak cross\-layer\-coupling picture, the remaining accumulation overℓ\\ellis the ordinary depth\-wise random walk, giving the extra1/L1/\\\!\\sqrt\{L\}factor\. The post\-training hidden\-norm and cosine diagnostics in Figures[7](https://arxiv.org/html/2606.18524#S4.F7)and[6](https://arxiv.org/html/2606.18524#S4.F6)are consistent with this picture: the residual stream remains controlled across loop counts, while correlations persist without producing a global layer\-wise collapse\. Writingε=λ/\(NL\)\\varepsilon=\\lambda/\(N\\\!\\sqrt\{L\}\)with a tunableO\(1\)O\(1\)constantλ\\lambda, the following theorem formalizes this intuition\.

###### Theorem 3\(Multi\-layer block scaling; informal\)\.

Assume that:

1. \(i\)each layer’s branch output has nondegenerate normalized norm;
2. \(ii\)the conditional mean of each layer’s branch output, given the other layers’ weights, has normalized norm at mostO\(εN\)O\(\\varepsilon N\);
3. \(iii\)replacing any single weight matrix changes other branches’ outputs by at mostO\(εN\)O\(\\varepsilon N\)\.

Thenε=λ/\(NL\)\\varepsilon=\\lambda/\(N\\\!\\sqrt\{L\}\)yields a residual\-stream norm bounded by a function ofλ\\lambdaalone, independent of bothNNandLL\. For sufficiently smallλ\\lambda, the bound is tight:the unscaled total branch variance isΘ\(LN2\)\\Theta\(LN^\{2\}\)\. The formal statement and proof are in Appendix[7\.4](https://arxiv.org/html/2606.18524#S7.SS4), Theorem[6](https://arxiv.org/html/2606.18524#Thmtheorem6)\.

### 4\.3Learning\-rate scaling

Under the factorized scaling, an updateΔWℓ\\Delta W\_\{\\ell\}to layerℓ\\ellis applied at allNNloop steps, contributingO\(εηN\)=O\(ηλ/L\)O\(\\varepsilon\\eta N\)=O\(\\eta\\lambda/\\\!\\sqrt\{L\}\)to the output change\. In the worst case, theLLlayer updates add coherently with normO\(L\)O\(L\), giving total output changeO\(ηλL\)O\(\\eta\\lambda\\sqrt\{L\}\)\. RequiringO\(1\)O\(1\)output change yields:η≲1λL\.\\eta\\lesssim\\frac\{1\}\{\\lambda\\sqrt\{L\}\}\.Proposition[7](https://arxiv.org/html/2606.18524#Thmtheorem7)\(Appendix[7\.5](https://arxiv.org/html/2606.18524#S7.SS5)\) formalizes this as a linearized maximal\-update bound, sharp under a coherence condition on the layer updates\. Crucially,NNdoes not appear: once1/N1/Nscaling is applied to the residual branch, the learning rate depends only on the unique depthLL, enabling hyperparameter transfer\. Table[1](https://arxiv.org/html/2606.18524#S4.T1)summarizes the resulting scaling recipe; the full per\-component parameterization is in Appendix[8\.2](https://arxiv.org/html/2606.18524#S8.SS2)\.

Table 1:Executive scaling recipe for a looped Transformer withLLunique layers repeatedNNtimes\.mL=L/Lrefm\_\{L\}=L/L\_\{\\mathrm\{ref\}\}is the depth multiplier relative to a reference model\.λ\\lambdais a tunableO\(1\)O\(1\)constant\.

## 5Experiments

We test the scaling predictions on looped Transformers trained for language modeling, progressing from a controlled initialization diagnostic \(Section[5\.1](https://arxiv.org/html/2606.18524#S5.SS1)\) to full\-scale training with learning\-rate transfer \(Sections[5\.2](https://arxiv.org/html/2606.18524#S5.SS2)–[5\.3](https://arxiv.org/html/2606.18524#S5.SS3)\) and validation that the theoretical assumptions hold beyond initialization \(Section[5\.4](https://arxiv.org/html/2606.18524#S5.SS4)\)\.

##### Model structure\.

All experiments use decoder\-only Transformers with a Llama\-style pre\-norm architecture\[[22](https://arxiv.org/html/2606.18524#bib.bib22),[21](https://arxiv.org/html/2606.18524#bib.bib21)\]\. The token embedding and unembedding head sit outside the loop; the looped stack containsLLunique Transformer blocks, each with two residual branches \(attention and MLP\), and the entireLL\-block sequence is repeated forNNpasses\. The residual scaling factorε\\varepsilonis applied to each branch; in the single\-layer theory \(Section[3](https://arxiv.org/html/2606.18524#S3)\), one “layer” corresponds to one branch, so the constant factor of two per block is absorbed intoλ\\lambda\.

### 5\.1Residual scaling at initialization

##### Protocol\.

We run a controlled experiment that measures the residual\-stream norm during the first1010optimizer steps\. We vary the number of loop iterationsN∈\{1,2,4,8,16,32,64\}N\\in\\\{1,2,4,8,16,32,64\\\}and compare three residual scalings:*none*\(ε=1\\varepsilon\{=\}1\),*sqrt*\(ε=1/N\\varepsilon\{=\}1/\\sqrt\{N\}\), and*linear*\(ε=1/N\\varepsilon\{=\}1/N\)\. Training uses AdamW for 10 steps, batch size 1, sequence length 128, and fixed random token inputs\. Figure[2](https://arxiv.org/html/2606.18524#S1.F2)plots the residual\-stream normR=d−1/2‖h‖2R=d^\{\-1/2\}\\left\\lVert h\\right\\rVert\_\{2\}at the end of the last loop iteration, averaged over 10 seeds\. To directly observe the correlation structure, we also measure pairwise cosine similarities between block\-level incrementsδn=hn−hn−1\\delta\_\{n\}=h\_\{n\}\-h\_\{n\-1\}for a non\-shared deep stack \(6464independent copies of a 12\-layer block, effective depth12×6412\{\\times\}64\) and a looped network \(L=12L\{=\}12,N=64N\{=\}64\), both without residual scaling \(Figure[3](https://arxiv.org/html/2606.18524#S3.F3)\)\. Using the same setup, we measure the per\-step output update—the root\-mean\-square \(RMS\) change inhNh\_\{N\}after one optimizer step—to directly test theηεN\\eta\\varepsilon Nprediction of Theorem[2](https://arxiv.org/html/2606.18524#Thmtheorem2)\(Figure[4](https://arxiv.org/html/2606.18524#S3.F4)\)\.

##### Findings\.

\(i\)*linear scaling controls forward\-pass growth*:1/N1/\\\!\\sqrt\{N\}scaling is insufficient—the residual\-stream norm grows rapidly withNNand can explode during early optimization \(Figure[2](https://arxiv.org/html/2606.18524#S1.F2), bottom row\), while linear scaling keeps it bounded and approximately invariant acrossNN\(Figure[2](https://arxiv.org/html/2606.18524#S1.F2), panel f\)\. \(ii\)*the stronger scaling is specific to weight sharing*: for non\-shared deep stacks,1/L1/\\\!\\sqrt\{L\}scaling suffices \(Figure[2](https://arxiv.org/html/2606.18524#S1.F2), top row\)\. \(iii\)*the mechanism is constructive accumulation*: in non\-shared stacks, pairwise cosine similarities between residual updates are near\-zero off\-diagonal \(Figure[3](https://arxiv.org/html/2606.18524#S3.F3), panel a\); in looped networks, the shared weights produce dense positive correlations, so updates reinforce rather than cancel \(panel b\), matching the prediction of Theorem[1](https://arxiv.org/html/2606.18524#Thmtheorem1)\. \(iv\)*per\-step output update tracks the leadingηεN\\eta\\varepsilon Ntrend*: with a shared base learning rate, the output update grows asNNforε=1\\varepsilon\{=\}1and asN\\sqrt\{N\}forε=1/N\\varepsilon\{=\}1/\\\!\\sqrt\{N\}, tracking the leadingηεN\\eta\\varepsilon Ntrend \(Figure[4](https://arxiv.org/html/2606.18524#S3.F4)\)\. Theorem[2](https://arxiv.org/html/2606.18524#Thmtheorem2)proves a uniform sufficient bound in the stable regimeεN=O\(1\)\\varepsilon N=O\(1\); the unscaled and square\-root curves are empirical extrapolations of the same leading term outside the theorem’s assumptions\. Forε=1/N\\varepsilon\{=\}1/N, the update is not strictly constant—it grows mildly before saturating nearN=32N\{=\}32—but stays within a small range across the fullN=1N\{=\}1–6464sweep, consistent with the theorem providing anO\(⋅\)O\(\\cdot\)upper bound rather than a tight scaling \(Appendix[7\.3](https://arxiv.org/html/2606.18524#S7.SS3)\)\.

### 5\.2Learning\-rate transfer across loop counts

##### Protocol\.

We trainL=12L\{=\}12looped Transformers on FineWeb\-Edu\[[18](https://arxiv.org/html/2606.18524#bib.bib18)\]\(10B tokens\) with loop countsN∈\{1,2,4,8\}N\\in\\\{1,2,4,8\\\}\. For eachNN, we sweep learning rates around a base value tuned atN=1N\{=\}1and compare sqrt vs\. linear residual scaling\. Optimizer, schedule, and full model configuration are in Appendix Table[2](https://arxiv.org/html/2606.18524#S8.T2)\. We evaluate the validation loss on the FineWeb\-Edu held\-out split, evaluated at the end of training\.

##### Findings\.

\(i\)*improved performance*: atN=8N=8, linear scaling achieves a lower minimum loss than sqrt scaling \(a reduction of0\.0250\.025nats\), indicating better trainability at larger loop counts\. \(ii\)*predictable transfer*: the optimal learning rate for linear scaling remains nearly invariant acrossNN\(Figure[5](https://arxiv.org/html/2606.18524#S4.F5)b\), consistent with our theoretical prediction \(η∝N0\\eta\\propto N^\{0\}\); under sqrt scaling, the optimum shifts withNN\(Figure[5](https://arxiv.org/html/2606.18524#S4.F5)a\), requiring separate tuning for each loop count\. Note that Theorem[2](https://arxiv.org/html/2606.18524#Thmtheorem2)requiresεN=O\(1\)\\varepsilon N=O\(1\)\(Assumption 3\); sqrt scaling \(α=1/2\\alpha\{=\}1/2\) violates this premise—forward\-pass norms still grow withNN\(Figure[2](https://arxiv.org/html/2606.18524#S1.F2)e\)—so the theorem does not predict the direction or magnitude of the optimum shift in that regime\.

### 5\.3Joint transfer across depth and loop count

##### Protocol\.

We train theL=24L\{=\}24andL=48L\{=\}48members of the same model family under the same setup as Section[5\.2](https://arxiv.org/html/2606.18524#S5.SS2), sweeping over learning rates and loop countsN∈\{1,2,4,8\}N\\in\\\{1,2,4,8\\\}\. Following the theoretical predictionη≲1/\(λL\)\\eta\\lesssim 1/\(\\lambda\\sqrt\{L\}\), we scale the learning rate bymL−1/2=\(L/12\)−1/2m\_\{L\}^\{\-1/2\}=\(L/12\)^\{\-1/2\}relative to theL=12L\{=\}12baseline \(Appendix[8\.2](https://arxiv.org/html/2606.18524#S8.SS2)\)\.

##### Findings\.

\(i\)*loop transfer at larger depth*: at bothL=24L\{=\}24andL=48L\{=\}48, the optimal learning rate remains nearly invariant acrossNN\(Figure[5](https://arxiv.org/html/2606.18524#S4.F5)c,d\)\. \(ii\)*depth transfer*: after applying the depth correction, a single base learning rate is near\-optimal across all three depths, confirming that the factorized parameterization decouples the loop and depth axes\. ForL=48L\{=\}48withN=8N\{=\}8, base learning rates above2×10−32\\\!\\times\\\!10^\{\-3\}cause divergence, consistent with the tighter stability margin at large effective depth\.

### 5\.4Residual scaling beyond initialization

##### Protocol\.

The theoretical analysis characterizes residual\-stream behavior at initialization\. We test whether two empirical signatures of the mechanism persist after full training \(step 20,000\) for theL∈\{12,24,48\}L\\in\\\{12,24,48\\\}models withN=8N\{=\}8\. First, we compute the pairwise cosine similarity between all pairs of loop\-step incrementsδn=hn−hn−1\\delta\_\{n\}=h\_\{n\}\-h\_\{n\-1\}, forming anN×NN\\times Ncorrelation matrix \(Figure[6](https://arxiv.org/html/2606.18524#S4.F6)\)\. Second, we trace the L2 norm of the residual stream at every loop step from step 0 \(post\-embedding\) through stepNN, for eachN∈\{2,4,8\}N\\in\\\{2,4,8\\\}\(Figure[7](https://arxiv.org/html/2606.18524#S4.F7)\)\.

##### Findings\.

\(i\)*correlation weakens but remains positive*: compared to initialization \(Figure[3](https://arxiv.org/html/2606.18524#S3.F3)\), the pairwise cosine similarities between loop\-step increments decrease after full training, but the correlation matrix remains positive for most loop\-step pairs across all three depths, with early and middle iterations concentrated in the 0\.3–0\.6 range; only the latest steps show near\-zero or mildly negative values \(Figure[6](https://arxiv.org/html/2606.18524#S4.F6)\)\. A similar early/middle\-step positive\-correlation pattern holds atN=4N\{=\}4, while pairs involving the final step can again be weak or negative; see Appendix[8\.4](https://arxiv.org/html/2606.18524#S8.SS4)\. \(ii\)*consistent final norm across loop counts*: although differentNNvalues lead to different per\-step increments in the residual stream, the final norm after the last loop step converges to roughly the same value within each depth group, with only modest variation acrossNN\(Figure[7](https://arxiv.org/html/2606.18524#S4.F7)\), indicating that1/N1/Nscaling produces similar training dynamics regardless of loop count\.

## 6Conclusion

We showed that weight sharing in looped Transformers requires1/N1/Nresidual scaling, and derived a parameterization that makes the stable learning rate independent of the loop count\. Experiments on language modeling confirm that this recipe transfers hyperparameters across both loop counts and depths without additional tuning\. By decoupling the loop count from optimization, our results open the possibility of treating weight reuse as a freely adjustable scaling axis\. Natural next steps include scaling up to production\-sized models and extending the theory to Post\-Norm architectures\[[25](https://arxiv.org/html/2606.18524#bib.bib25)\], where the interaction between normalization placement and weight sharing may yield a different scaling regime\.

## Limitations

Our theoretical analysis models the looped block as a shared\-weight MLP, abstracting away the multi\-head attention mechanism present in Transformers\. The formal results further assume ReLU activations, whereas our experiments use SwiGLU; the predictedΘ\(N2\)\\Theta\(N^\{2\}\)growth and1/N1/Nscaling nonetheless hold empirically \(Figures[3](https://arxiv.org/html/2606.18524#S3.F3),[6](https://arxiv.org/html/2606.18524#S4.F6)\)\. More broadly, the analysis does not account for optimizer state dynamics, normalization variants, or data\-dependent feature learning\.

On the empirical side, our experiments cover a finite set of model sizes, loop counts, and training budgets\. Validation on larger models, longer training horizons, and heterogeneous multi\-stage looped architectures remains future work\. In addition, each reported language\-modeling result uses a single random seed; multi\-seed replicas of the full configuration sweep were not conducted due to computational constraints\.

## Ethical Considerations

This work is a methodological study of residual scaling for weight\-tied Transformers\. It involves no human subjects and no personal or sensitive data\. All language\-modeling experiments use the publicly released FineWeb\-Edu corpus\[[18](https://arxiv.org/html/2606.18524#bib.bib18)\], which is derived from open web text and inherits the well\-known biases, factual inaccuracies, and content distribution of large\-scale web crawls; any model trained on such data carries those risks\. We do not release new model weights, datasets, or downstream applications\.

The largest model trained in this work has438438M parameters and is trained for1010B tokens, which is small relative to current frontier systems but still consumes meaningful compute\. We report all hyperparameters and the exact sweep configuration \(Appendix[8\.3](https://arxiv.org/html/2606.18524#S8.SS3)\) so that other researchers can replicate the experiments without redundant tuning\.

## References

- Bachlechner et al\. \[2021\]Thomas Bachlechner, Bodhisattwa Prasad Majumder, Huanru Henry Mao, Gary Cottrell, and Julian J\. McAuley\.ReZero is all you need: fast convergence at large depth\.In Cassio P\. de Campos, Marloes H\. Maathuis, and Erik Quaeghebeur, editors,*Proceedings of the Thirty\-Seventh Conference on Uncertainty in Artificial Intelligence, UAI 2021, Virtual Event, 27\-30 July 2021*, Proceedings of Machine Learning Research, pages 1352–1361\. AUAI Press, 2021\.URL[https://proceedings\.mlr\.press/v161/bachlechner21a\.html](https://proceedings.mlr.press/v161/bachlechner21a.html)\.
- Bordelon et al\. \[2024\]Blake Bordelon, Lorenzo Noci, Mufan Bill Li, Boris Hanin, and Cengiz Pehlevan\.Depthwise hyperparameter transfer in residual networks: Dynamics and scaling limit\.In*The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7\-11, 2024*\. OpenReview\.net, 2024\.URL[https://openreview\.net/forum?id=KZJehvRKGD](https://openreview.net/forum?id=KZJehvRKGD)\.
- Boucheron et al\. \[2013\]Stéphane Boucheron, Gábor Lugosi, and Pascal Massart\.*Concentration Inequalities \- A Nonasymptotic Theory of Independence*\.Oxford University Press, 2013\.ISBN 978\-0\-19\-953525\-5\.[10\.1093/ACPROF:OSO/9780199535255\.001\.0001](https://arxiv.org/doi.org/10.1093/ACPROF:OSO/9780199535255.001.0001)\.URL[https://doi\.org/10\.1093/acprof:oso/9780199535255\.001\.0001](https://doi.org/10.1093/acprof:oso/9780199535255.001.0001)\.
- Chen et al\. \[2018\]Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud\.Neural ordinary differential equations\.In Samy Bengio, Hanna M\. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa\-Bianchi, and Roman Garnett, editors,*Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3\-8, 2018, Montréal, Canada*, pages 6572–6583, 2018\.URL[https://proceedings\.neurips\.cc/paper/2018/hash/69386f6bb1dfed68692a24c8686939b9\-Abstract\.html](https://proceedings.neurips.cc/paper/2018/hash/69386f6bb1dfed68692a24c8686939b9-Abstract.html)\.
- Dehghani et al\. \[2019\]Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser\.Universal transformers\.In*7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6\-9, 2019*\. OpenReview\.net, 2019\.URL[https://openreview\.net/forum?id=HyzdRiR9Y7](https://openreview.net/forum?id=HyzdRiR9Y7)\.
- Dey et al\. \[2025\]Nolan Dey, Bin Claire Zhang, Lorenzo Noci, Mufan Li, Blake Bordelon, Shane Bergsma, Cengiz Pehlevan, Boris Hanin, and Joel Hestness\.Don’t be lazy: CompleteP enables compute\-efficient deep transformers\.2025\.[10\.48550/ARXIV\.2505\.01618](https://arxiv.org/doi.org/10.48550/ARXIV.2505.01618)\.URL[https://arxiv\.org/abs/2505\.01618](https://arxiv.org/abs/2505.01618)\.
- Fan et al\. \[2025\]Ying Fan, Yilun Du, Kannan Ramchandran, and Kangwook Lee\.Looped transformers for length generalization\.In*The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24\-28, 2025*\. OpenReview\.net, 2025\.URL[https://openreview\.net/forum?id=2edigk8yoU](https://openreview.net/forum?id=2edigk8yoU)\.
- Gatmiry et al\. \[2024\]Khashayar Gatmiry, Nikunj Saunshi, Sashank J\. Reddi, Stefanie Jegelka, and Sanjiv Kumar\.Can looped transformers learn to implement multi\-step gradient descent for in\-context learning?In Ruslan Salakhutdinov, Zico Kolter, Katherine A\. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,*Forty\-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21\-27, 2024*, Proceedings of Machine Learning Research, pages 15130–15152\. PMLR / OpenReview\.net, 2024\.URL[https://proceedings\.mlr\.press/v235/gatmiry24b\.html](https://proceedings.mlr.press/v235/gatmiry24b.html)\.
- Gordon \[1988\]Yehoram Gordon\.On Milman’s inequality and random subspaces which escape through a mesh inℝn\\mathbb\{R\}^\{n\}\.In*Geometric Aspects of Functional Analysis*, volume 1317 of*Lecture Notes in Mathematics*, pages 84–106\. Springer, 1988\.[10\.1007/BFb0081737](https://arxiv.org/doi.org/10.1007/BFb0081737)\.
- Hayou and Yang \[2023\]Soufiane Hayou and Greg Yang\.Width and depth limits commute in residual networks\.In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,*International Conference on Machine Learning, ICML 2023, 23\-29 July 2023, Honolulu, Hawaii, USA*, Proceedings of Machine Learning Research, pages 12700–12723\. PMLR, 2023\.URL[https://proceedings\.mlr\.press/v202/hayou23a\.html](https://proceedings.mlr.press/v202/hayou23a.html)\.
- Lan et al\. \[2020\]Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut\.ALBERT: A lite BERT for self\-supervised learning of language representations\.In*8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26\-30, 2020*\. OpenReview\.net, 2020\.URL[https://openreview\.net/forum?id=H1eA7AEtvS](https://openreview.net/forum?id=H1eA7AEtvS)\.
- Llama Team \[2024\]Llama Team\.The llama 3 herd of models\.*CoRR*, abs/2407\.21783, 2024\.[10\.48550/ARXIV\.2407\.21783](https://arxiv.org/doi.org/10.48550/ARXIV.2407.21783)\.URL[https://doi\.org/10\.48550/arXiv\.2407\.21783](https://doi.org/10.48550/arXiv.2407.21783)\.
- Loshchilov and Hutter \[2019\]Ilya Loshchilov and Frank Hutter\.Decoupled weight decay regularization\.In*7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6\-9, 2019*\. OpenReview\.net, 2019\.URL[https://openreview\.net/forum?id=Bkg6RiCqY7](https://openreview.net/forum?id=Bkg6RiCqY7)\.
- Marion \[2023\]Pierre Marion\.Generalization bounds for neural ordinary differential equations and deep residual networks\.In*Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 \- 16, 2023*, 2023\.URL[http://papers\.nips\.cc/paper\_files/paper/2023/hash/98ed250b203d1ac6b24bbcf263e3d4a7\-Abstract\-Conference\.html](http://papers.nips.cc/paper_files/paper/2023/hash/98ed250b203d1ac6b24bbcf263e3d4a7-Abstract-Conference.html)\.
- Marion et al\. \[2025\]Pierre Marion, Adeline Fermanian, Gérard Biau, and Jean\-Philippe Vert\.Scaling resnets in the large\-depth regime\.*J\. Mach\. Learn\. Res\.*, 26:56:1–56:48, 2025\.URL[https://jmlr\.org/papers/v26/22\-0664\.html](https://jmlr.org/papers/v26/22-0664.html)\.
- Mlodozeniec et al\. \[2025\]Bruno Mlodozeniec, Pierre Ablin, Louis Béthune, Dan Busbridge, Michal Klein, Jason Ramapuram, and Marco Cuturi\.Completed hyperparameter transfer across modules, width, depth, batch and duration\.2025\.URL[https://arxiv\.org/abs/2512\.22382](https://arxiv.org/abs/2512.22382)\.
- Ng and Wang \[2024\]Kei\-Sing Ng and Qingchen Wang\.Loop neural networks for parameter sharing\.2024\.URL[https://arxiv\.org/abs/2409\.14199](https://arxiv.org/abs/2409.14199)\.
- Penedo et al\. \[2024\]Guilherme Penedo, Hynek Kydlícek, Loubna Ben Allal, Anton Lozhkov, Margaret Mitchell, Colin A\. Raffel, Leandro von Werra, and Thomas Wolf\.The FineWeb datasets: Decanting the web for the finest text data at scale\.In*Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 \- 15, 2024*, 2024\.URL[http://papers\.nips\.cc/paper\_files/paper/2024/hash/370df50ccfdf8bde18f8f9c2d9151bda\-Abstract\-Datasets\_and\_Benchmarks\_Track\.html](http://papers.nips.cc/paper_files/paper/2024/hash/370df50ccfdf8bde18f8f9c2d9151bda-Abstract-Datasets_and_Benchmarks_Track.html)\.
- Press and Wolf \[2017\]Ofir Press and Lior Wolf\.Using the output embedding to improve language models\.In Mirella Lapata, Phil Blunsom, and Alexander Koller, editors,*Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017, Valencia, Spain, April 3\-7, 2017, Volume 2: Short Papers*, pages 157–163\. Association for Computational Linguistics, 2017\.[10\.18653/V1/E17\-2025](https://arxiv.org/doi.org/10.18653/V1/E17-2025)\.URL[https://doi\.org/10\.18653/v1/e17\-2025](https://doi.org/10.18653/v1/e17-2025)\.
- Saunshi et al\. \[2025\]Nikunj Saunshi, Nishanth Dikkala, Zhiyuan Li, Sanjiv Kumar, and Sashank J\. Reddi\.Reasoning with latent thoughts: On the power of looped transformers\.In*The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24\-28, 2025*\. OpenReview\.net, 2025\.URL[https://openreview\.net/forum?id=din0lGfZFd](https://openreview.net/forum?id=din0lGfZFd)\.
- Touvron et al\. \[2023\]Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie\-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurélien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample\.LLaMA: Open and efficient foundation language models\.*CoRR*, abs/2302\.13971, 2023\.[10\.48550/ARXIV\.2302\.13971](https://arxiv.org/doi.org/10.48550/ARXIV.2302.13971)\.URL[https://doi\.org/10\.48550/arXiv\.2302\.13971](https://doi.org/10.48550/arXiv.2302.13971)\.
- Vaswani et al\. \[2017\]Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N\. Gomez, Lukasz Kaiser, and Illia Polosukhin\.Attention is all you need\.In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M\. Wallach, Rob Fergus, S\. V\. N\. Vishwanathan, and Roman Garnett, editors,*Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4\-9, 2017, Long Beach, CA, USA*, pages 5998–6008, 2017\.URL[https://proceedings\.neurips\.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa\-Abstract\.html](https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html)\.
- Wang et al\. \[2024\]Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Dongdong Zhang, and Furu Wei\.DeepNet: Scaling transformers to 1,000 layers\.*IEEE Trans\. Pattern Anal\. Mach\. Intell\.*, 46\(10\):6761–6774, 2024\.[10\.1109/TPAMI\.2024\.3386927](https://arxiv.org/doi.org/10.1109/TPAMI.2024.3386927)\.URL[https://doi\.org/10\.1109/TPAMI\.2024\.3386927](https://doi.org/10.1109/TPAMI.2024.3386927)\.
- Wen et al\. \[2024\]Kaiyue Wen, Zhiyuan Li, Jason S\. Wang, David Hall, Percy Liang, and Tengyu Ma\.Understanding warmup\-stable\-decay learning rates: A river valley loss landscape perspective\.*CoRR*, abs/2410\.05192, 2024\.[10\.48550/ARXIV\.2410\.05192](https://arxiv.org/doi.org/10.48550/ARXIV.2410.05192)\.URL[https://doi\.org/10\.48550/arXiv\.2410\.05192](https://doi.org/10.48550/arXiv.2410.05192)\.
- Xiong et al\. \[2020\]Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tie\-Yan Liu\.On layer normalization in the transformer architecture\.In*Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13\-18 July 2020, Virtual Event*, Proceedings of Machine Learning Research, pages 10524–10533\. PMLR, 2020\.URL[http://proceedings\.mlr\.press/v119/xiong20b\.html](http://proceedings.mlr.press/v119/xiong20b.html)\.
- Yang et al\. \[2022\]Greg Yang, Edward J\. Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao\.Tensor programs v: Tuning large neural networks via zero\-shot hyperparameter transfer\.2022\.[10\.48550/ARXIV\.2203\.03466](https://arxiv.org/doi.org/10.48550/ARXIV.2203.03466)\.URL[https://arxiv\.org/abs/2203\.03466](https://arxiv.org/abs/2203.03466)\.
- Yang et al\. \[2024\]Liu Yang, Kangwook Lee, Robert D\. Nowak, and Dimitris Papailiopoulos\.Looped transformers are better at learning learning algorithms\.In*The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7\-11, 2024*\. OpenReview\.net, 2024\.URL[https://openreview\.net/forum?id=HHbRxoDTxE](https://openreview.net/forum?id=HHbRxoDTxE)\.
- Zhang and Sennrich \[2019\]Biao Zhang and Rico Sennrich\.Root mean square layer normalization\.In Hanna M\. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché\-Buc, Emily B\. Fox, and Roman Garnett, editors,*Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8\-14, 2019, Vancouver, BC, Canada*, pages 12360–12371, 2019\.URL[https://proceedings\.neurips\.cc/paper/2019/hash/1e8a19426224ca89e83cef47f1e7f53b\-Abstract\.html](https://proceedings.neurips.cc/paper/2019/hash/1e8a19426224ca89e83cef47f1e7f53b-Abstract.html)\.
- Zhang et al\. \[2019\]Hongyi Zhang, Yann N\. Dauphin, and Tengyu Ma\.Fixup initialization: Residual learning without normalization\.In*7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6\-9, 2019*\. OpenReview\.net, 2019\.URL[https://openreview\.net/forum?id=H1gsz30cKX](https://openreview.net/forum?id=H1gsz30cKX)\.
- Zhu et al\. \[2025\]Rui\-Jie Zhu, Zixuan Wang, Kai Hua, Tianyu Zhang, Ziniu Li, Haoran Que, Boyi Wei, Zixin Wen, Fan Yin, He Xing, Lu Li, Jiajun Shi, Kaijing Ma, Shanda Li, Taylor Kergan, Andrew Smith, Xingwei Qu, Mude Hui, Bohong Wu, Qiyang Min, Hongzhi Huang, Xun Zhou, Wei Ye, Jiaheng Liu, Jian Yang, Yunfeng Shi, Chenghua Lin, Enduo Zhao, Tianle Cai, Ge Zhang, Wenhao Huang, Yoshua Bengio, and Jason Eshraghian\.Scaling latent reasoning via looped language models\.2025\.[10\.48550/ARXIV\.2510\.25741](https://arxiv.org/doi.org/10.48550/ARXIV.2510.25741)\.URL[https://arxiv\.org/abs/2510\.25741](https://arxiv.org/abs/2510.25741)\.

\\beginappendix

## 7Proofs of Theoretical Results

### 7\.1Deep \(non\-shared\) residual stacks

For comparison, consider

hℓ\+1\\displaystyle h\_\{\\ell\+1\}=hℓ\+εWℓϕ\(hℓ\),\\displaystyle=h\_\{\\ell\}\+\\varepsilon W\_\{\\ell\}\\phi\(h\_\{\\ell\}\),Wℓ\\displaystyle W\_\{\\ell\}independent acrossℓ\.\\displaystyle\\ \\text\{independent across \}\\ell\.LetRℓ=d−1/2‖hℓ‖2R\_\{\\ell\}=d^\{\-1/2\}\\left\\lVert h\_\{\\ell\}\\right\\rVert\_\{2\}\. Expanding one step gives

Rℓ\+12\\displaystyle R\_\{\\ell\+1\}^\{2\}=Rℓ2\+2εd⟨hℓ,Wℓϕ\(hℓ\)⟩\\displaystyle=R\_\{\\ell\}^\{2\}\+\\frac\{2\\varepsilon\}\{d\}\\left\\langle h\_\{\\ell\},W\_\{\\ell\}\\phi\(h\_\{\\ell\}\)\\right\\rangle\+ε2d‖Wℓϕ\(hℓ\)‖22\.\\displaystyle\\quad\+\\frac\{\\varepsilon^\{2\}\}\{d\}\\left\\lVert W\_\{\\ell\}\\phi\(h\_\{\\ell\}\)\\right\\rVert\_\{2\}^\{2\}\.Becausehℓh\_\{\\ell\}depends onW<ℓW\_\{<\\ell\}but not onWℓW\_\{\\ell\}, the cross term has zero conditional mean given the past\. To make the scale of the last term explicit, take the standard mean\-field parameterizationWℓ=\(σW/d\)GℓW\_\{\\ell\}=\(\\sigma\_\{W\}/\\sqrt\{d\}\)G\_\{\\ell\}, with the entries ofGℓG\_\{\\ell\}independent standard Gaussians\. Conditional onhℓh\_\{\\ell\},

𝔼\[1d∥Wℓϕ\(hℓ\)∥22\|hℓ\]=σW21d∥ϕ\(hℓ\)∥22\.\\mathbb\{E\}\\left\[\\frac\{1\}\{d\}\\left\\lVert W\_\{\\ell\}\\phi\(h\_\{\\ell\}\)\\right\\rVert\_\{2\}^\{2\}\\,\\middle\|\\,h\_\{\\ell\}\\right\]=\\sigma\_\{W\}^\{2\}\\frac\{1\}\{d\}\\left\\lVert\\phi\(h\_\{\\ell\}\)\\right\\rVert\_\{2\}^\{2\}\.Under the usual ReLU mean\-field scaling,d−1‖ϕ\(hℓ\)‖22=Θ\(Rℓ2\)d^\{\-1\}\\left\\lVert\\phi\(h\_\{\\ell\}\)\\right\\rVert\_\{2\}^\{2\}=\\Theta\(R\_\{\\ell\}^\{2\}\), so the last term contributesΘ\(ε2Rℓ2\)\\Theta\(\\varepsilon^\{2\}R\_\{\\ell\}^\{2\}\)and

𝔼\[Rℓ\+12\]≈\(1\+cε2\)𝔼\[Rℓ2\],c=Θ\(1\)\.\\mathbb\{E\}\[R\_\{\\ell\+1\}^\{2\}\]\\approx\(1\+c\\varepsilon^\{2\}\)\\mathbb\{E\}\[R\_\{\\ell\}^\{2\}\],\\qquad c=\\Theta\(1\)\.Iterating this recursion:

𝔼\[RL2\]≈\(1\+cε2\)LR02≈exp⁡\(cLε2\)R02\.\\mathbb\{E\}\[R\_\{L\}^\{2\}\]\\approx\(1\+c\\varepsilon^\{2\}\)^\{L\}R\_\{0\}^\{2\}\\approx\\exp\(cL\\varepsilon^\{2\}\)R\_\{0\}^\{2\}\.Withε=L−α\\varepsilon=L^\{\-\\alpha\}:

𝔼\[RL2\]≈exp⁡\(cL1−2α\)R02\.\\mathbb\{E\}\[R\_\{L\}^\{2\}\]\\approx\\exp\(cL^\{1\-2\\alpha\}\)R\_\{0\}^\{2\}\.Hence a bounded residual\-stream norm for largeLLrequiresα≥12\\alpha\\geq\\tfrac\{1\}\{2\}\.

### 7\.2Proof of looped residual\-stream norm scaling

Forx∈ℝdx\\in\\mathbb\{R\}^\{d\}, write

q\(x\)\\displaystyle q\(x\):=1d‖x‖22,\\displaystyle=\\frac\{1\}\{d\}\\left\\lVert x\\right\\rVert\_\{2\}^\{2\},m\(x\)\\displaystyle m\(x\):=1d𝟏⊤xwhenx∈𝒞\+\.\\displaystyle=\\frac\{1\}\{d\}\\mathbf\{1\}^\{\\top\}x\\quad\\text\{when \}x\\in\\mathcal\{C\}\_\{\+\}\.The key point is that ReLU activations are nonnegative, so loop reuse creates an exact factorization through a vector in the positive cone\.

###### Lemma 4\(Positive\-cone mass\)\.

Letu0,…,uN−1∈𝒞\+u\_\{0\},\\ldots,u\_\{N\-1\}\\in\\mathcal\{C\}\_\{\+\}andUN=∑n=0N−1unU\_\{N\}=\\sum\_\{n=0\}^\{N\-1\}u\_\{n\}\. If

1N∑n=0N−1m\(un\)\\displaystyle\\frac\{1\}\{N\}\\sum\_\{n=0\}^\{N\-1\}m\(u\_\{n\}\)≥m−\>0,\\displaystyle\\geq m\_\{\-\}\>0,1N∑n=0N−1q\(un\)\\displaystyle\\frac\{1\}\{N\}\\sum\_\{n=0\}^\{N\-1\}q\(u\_\{n\}\)≤q\+<∞,\\displaystyle\\leq q\_\{\+\}<\\infty,then

m−N≤1d‖UN‖2≤q\+N\.m\_\{\-\}N\\leq\\frac\{1\}\{\\sqrt\{d\}\}\\left\\lVert U\_\{N\}\\right\\rVert\_\{2\}\\leq\\sqrt\{q\_\{\+\}\}N\.\(3\)

###### Proof\.

For the lower bound, sinceUN∈𝒞\+U\_\{N\}\\in\\mathcal\{C\}\_\{\+\},

‖UN‖22≥1d\(𝟏⊤UN\)2\.\\left\\lVert U\_\{N\}\\right\\rVert\_\{2\}^\{2\}\\geq\\frac\{1\}\{d\}\(\\mathbf\{1\}^\{\\top\}U\_\{N\}\)^\{2\}\.Dividing byddgivesd−1‖UN‖22≥\(∑nm\(un\)\)2≥m−2N2d^\{\-1\}\\left\\lVert U\_\{N\}\\right\\rVert\_\{2\}^\{2\}\\geq\(\\sum\_\{n\}m\(u\_\{n\}\)\)^\{2\}\\geq m\_\{\-\}^\{2\}N^\{2\}\. For the upper bound,

1d‖UN‖2\\displaystyle\\frac\{1\}\{\\sqrt\{d\}\}\\left\\lVert U\_\{N\}\\right\\rVert\_\{2\}≤∑n=0N−1q\(un\)≤N∑n=0N−1q\(un\)\\displaystyle\\leq\\sum\_\{n=0\}^\{N\-1\}\\sqrt\{q\(u\_\{n\}\)\}\\leq\\sqrt\{N\\sum\_\{n=0\}^\{N\-1\}q\(u\_\{n\}\)\}≤q\+N\.\\displaystyle\\leq\\sqrt\{q\_\{\+\}\}N\.∎

###### Lemma 5\(Gaussian gain on the positive cone\)\.

LetG∈ℝd×dG\\in\\mathbb\{R\}^\{d\\times d\}have independent𝒩\(0,1\)\\mathcal\{N\}\(0,1\)entries and setW=\(σW/d\)GW=\(\\sigma\_\{W\}/\\sqrt\{d\}\)G\. For any0<γ<1−1/20<\\gamma<1\-1/\\sqrt\{2\}, defineaγ=1−1/2−γ\>0a\_\{\\gamma\}=1\-1/\\sqrt\{2\}\-\\gamma\>0\. With probability at least

1−exp⁡\(−γ2d/2\)−2exp⁡\(−d/2\),1\-\\exp\(\-\\gamma^\{2\}d/2\)\-2\\exp\(\-d/2\),the following holds for everyx∈𝒞\+x\\in\\mathcal\{C\}\_\{\+\}:

aγσW‖x‖2≤‖Wx‖2≤3σW‖x‖2\.a\_\{\\gamma\}\\sigma\_\{W\}\\left\\lVert x\\right\\rVert\_\{2\}\\leq\\left\\lVert Wx\\right\\rVert\_\{2\}\\leq 3\\sigma\_\{W\}\\left\\lVert x\\right\\rVert\_\{2\}\.\(4\)

###### Proof\.

The upper bound follows from the standard Gaussian operator\-norm estimateℙ\{‖G‖op\>3d\}≤2e−d/2\\mathbb\{P\}\\\{\\left\\lVert G\\right\\rVert\_\{\\mathrm\{op\}\}\>3\\sqrt\{d\}\\\}\\leq 2e^\{\-d/2\}\. For the lower bound, Gordon’s escape\-through\-a\-mesh theorem\[[9](https://arxiv.org/html/2606.18524#bib.bib9)\]gives, with probability at least1−e−t2/21\-e^\{\-t^\{2\}/2\},

infx∈𝒞\+∩Sd−1‖Gx‖2≥d−w\(𝒞\+∩Sd−1\)−t,\\inf\_\{x\\in\\mathcal\{C\}\_\{\+\}\\cap S^\{d\-1\}\}\\left\\lVert Gx\\right\\rVert\_\{2\}\\geq\\sqrt\{d\}\-w\(\\mathcal\{C\}\_\{\+\}\\cap S^\{d\-1\}\)\-t,wherew\(⋅\)w\(\\cdot\)is Gaussian width\. For the positive cone,

w\(𝒞\+∩Sd−1\)\\displaystyle w\(\\mathcal\{C\}\_\{\+\}\\cap S^\{d\-1\}\)=𝔼‖g\+‖2≤𝔼‖g\+‖22\\displaystyle=\\mathbb\{E\}\\left\\lVert g\_\{\+\}\\right\\rVert\_\{2\}\\leq\\sqrt\{\\mathbb\{E\}\\left\\lVert g\_\{\+\}\\right\\rVert\_\{2\}^\{2\}\}=d/2\.\\displaystyle=\\sqrt\{d/2\}\.Takingt=γdt=\\gamma\\sqrt\{d\}and multiplying byσW/d\\sigma\_\{W\}/\\sqrt\{d\}gives the lower bound\. ∎

#### 7\.2\.1Proof of Theorem[1](https://arxiv.org/html/2606.18524#Thmtheorem1)

###### Proof\.

Letun=ϕ\(hn\)u\_\{n\}=\\phi\(h\_\{n\}\),UN=∑n=0N−1unU\_\{N\}=\\sum\_\{n=0\}^\{N\-1\}u\_\{n\}, andSN=∑n=0N−1rnS\_\{N\}=\\sum\_\{n=0\}^\{N\-1\}r\_\{n\}\. SinceWWis tied across loop steps,

SN=∑n=0N−1Wun=WUN\.S\_\{N\}=\\sum\_\{n=0\}^\{N\-1\}Wu\_\{n\}=WU\_\{N\}\.\(5\)By Lemma[4](https://arxiv.org/html/2606.18524#Thmtheorem4),d−1/2‖UN‖2=Θ\(N\)d^\{\-1/2\}\\left\\lVert U\_\{N\}\\right\\rVert\_\{2\}=\\Theta\(N\)\. Applying the cone\-gain condition toUN∈𝒞\+U\_\{N\}\\in\\mathcal\{C\}\_\{\+\}gives

1d‖SN‖22=1d‖WUN‖22=Θ\(N2\)\.\\frac\{1\}\{d\}\\left\\lVert S\_\{N\}\\right\\rVert\_\{2\}^\{2\}=\\frac\{1\}\{d\}\\left\\lVert WU\_\{N\}\\right\\rVert\_\{2\}^\{2\}=\\Theta\(N^\{2\}\)\.SincehN=h0\+εSNh\_\{N\}=h\_\{0\}\+\\varepsilon S\_\{N\}andRN=d−1/2‖hN‖2R\_\{N\}=d^\{\-1/2\}\\left\\lVert h\_\{N\}\\right\\rVert\_\{2\},

\(εcN−R0\)\+≤RN≤R0\+CεN\\left\(\\varepsilon cN\-R\_\{0\}\\right\)\_\{\+\}\\leq R\_\{N\}\\leq R\_\{0\}\+C\\varepsilon Nfor constantsc,C\>0c,C\>0independent ofNN\. ThusRN=O\(1\)R\_\{N\}=O\(1\)whenR0=O\(1\)R\_\{0\}=O\(1\)andεN=O\(1\)\\varepsilon N=O\(1\), whileRNR\_\{N\}diverges ifR0=O\(1\)R\_\{0\}=O\(1\)andεN→∞\\varepsilon N\\to\\infty\. Forε=N−α\\varepsilon=N^\{\-\\alpha\}, a bounded residual\-stream norm therefore requiresα≥1\\alpha\\geq 1\.

The same argument also explains the empirical positive\-overlap pattern:

1N2∑n,m=0N−11d⟨un,um⟩\\displaystyle\\frac\{1\}\{N^\{2\}\}\\sum\_\{n,m=0\}^\{N\-1\}\\frac\{1\}\{d\}\\left\\langle u\_\{n\},u\_\{m\}\\right\\rangle=1N21d‖UN‖22\\displaystyle=\\frac\{1\}\{N^\{2\}\}\\frac\{1\}\{d\}\\left\\lVert U\_\{N\}\\right\\rVert\_\{2\}^\{2\}=Θ\(1\),\\displaystyle=\\Theta\(1\),so positive average overlap is a consequence of the positive\-cone trajectory condition rather than a separate primitive assumption\. ∎

### 7\.3Proof of loop\-wise learning\-rate scaling

Fix one updatet→t\+1t\\to t\+1and writeW\+=W\+ΔWW^\{\+\}=W\+\\Delta W\. Lethn\+h\_\{n\}^\{\+\}denote the post\-update trajectory initialized from the sameh0h\_\{0\}, and letΔhn=hn\+−hn\\Delta h\_\{n\}=h\_\{n\}^\{\+\}\-h\_\{n\}\. Subtracting the two tied recursions gives

Δhn\+1\\displaystyle\\Delta h\_\{n\+1\}=Δhn\+εΔWϕ\(hn\+\)\\displaystyle=\\Delta h\_\{n\}\+\\varepsilon\\Delta W\\phi\(h\_\{n\}^\{\+\}\)\(6\)\+εW\(ϕ\(hn\+\)−ϕ\(hn\)\)\.\\displaystyle\\quad\+\\varepsilon W\\bigl\(\\phi\(h\_\{n\}^\{\+\}\)\-\\phi\(h\_\{n\}\)\\bigr\)\.
#### 7\.3\.1Sufficient upper bound

Assume‖W‖op≤CW\\left\\lVert W\\right\\rVert\_\{\\mathrm\{op\}\}\\leq C\_\{W\},‖ΔW‖op≤CΔη\\left\\lVert\\Delta W\\right\\rVert\_\{\\mathrm\{op\}\}\\leq C\_\{\\Delta\}\\eta,maxn<N⁡d−1/2‖ϕ\(hn\+\)‖2≤Q\+\\max\_\{n<N\}d^\{\-1/2\}\\left\\lVert\\phi\(h\_\{n\}^\{\+\}\)\\right\\rVert\_\{2\}\\leq Q\_\{\+\}, andεCWN≤M\\varepsilon C\_\{W\}N\\leq M\. Since ReLU is11\-Lipschitz, \([6](https://arxiv.org/html/2606.18524#S7.E6)\) implies, withan=‖Δhn‖2a\_\{n\}=\\left\\lVert\\Delta h\_\{n\}\\right\\rVert\_\{2\},

an\+1≤\(1\+εCW\)an\+εCΔηQ\+d\.a\_\{n\+1\}\\leq\(1\+\\varepsilon C\_\{W\}\)a\_\{n\}\+\\varepsilon C\_\{\\Delta\}\\eta Q\_\{\+\}\\sqrt\{d\}\.Iterating froma0=0a\_\{0\}=0yields

1d‖ΔhN‖2≤CΔQ\+eMηεN\.\\frac\{1\}\{\\sqrt\{d\}\}\\left\\lVert\\Delta h\_\{N\}\\right\\rVert\_\{2\}\\leq C\_\{\\Delta\}Q\_\{\+\}e^\{M\}\\eta\\varepsilon N\.ThereforeηεN=O\(1\)\\eta\\varepsilon N=O\(1\)is sufficient for anO\(1\)O\(1\)width\-normalized one\-step output perturbation\.

#### 7\.3\.2Trajectory\-direction sharpness

The sufficient bound becomes sharp only when the actual optimizer update has nondegenerate gain on the loop\-summed trajectory direction\. Empirically, thisηεN\\eta\\,\\varepsilon\\,Nscaling persists across the first 10 optimizer steps forε∈\{1,1/N,1/N\}\\varepsilon\\in\\\{1,1/\\\!\\sqrt\{N\},1/N\\\}; see Appendix[8\.5](https://arxiv.org/html/2606.18524#S8.SS5)for the multi\-step verification\.

Formally, let

UN\\displaystyle U\_\{N\}=∑n=0N−1ϕ\(hn\),\\displaystyle=\\sum\_\{n=0\}^\{N\-1\}\\phi\(h\_\{n\}\),ζN\\displaystyle\\zeta\_\{N\}:=‖ΔWUN‖2η‖UN‖2,UN≠0,\\displaystyle=\\frac\{\\left\\lVert\\Delta WU\_\{N\}\\right\\rVert\_\{2\}\}\{\\eta\\left\\lVert U\_\{N\}\\right\\rVert\_\{2\}\},\\qquad U\_\{N\}\\neq 0,and decompose

ΔhN=εΔWUN\+ENnl\.\\Delta h\_\{N\}=\\varepsilon\\Delta WU\_\{N\}\+E\_\{N\}^\{\\mathrm\{nl\}\}\.If the activation\-mass condition of Lemma[4](https://arxiv.org/html/2606.18524#Thmtheorem4)holds and

‖ENnl‖2≤θ‖εΔWUN‖2for someθ<1,\\left\\lVert E\_\{N\}^\{\\mathrm\{nl\}\}\\right\\rVert\_\{2\}\\leq\\theta\\left\\lVert\\varepsilon\\Delta WU\_\{N\}\\right\\rVert\_\{2\}\\qquad\\text\{for some \}\\theta<1,then

1d‖ΔhN‖22=Θ\(ζN2\(ηεN\)2\)\.\\frac\{1\}\{d\}\\left\\lVert\\Delta h\_\{N\}\\right\\rVert\_\{2\}^\{2\}=\\Theta\\\!\\left\(\\zeta\_\{N\}^\{2\}\(\\eta\\varepsilon N\)^\{2\}\\right\)\.In particular, ifζN=Θ\(1\)\\zeta\_\{N\}=\\Theta\(1\)over the update range of interest, the stable scale is sharp andε=N−α\\varepsilon=N^\{\-\\alpha\}givesη∝Nα−1\\eta\\propto N^\{\\alpha\-1\}\.

### 7\.4Multi\-layer block scaling proof \(Theorem[3](https://arxiv.org/html/2606.18524#Thmtheorem3)\)

Recall the recursion from Equation \([2](https://arxiv.org/html/2606.18524#S4.E2)\):

hn,ℓ\+1\\displaystyle h\_\{n,\\ell\+1\}=hn,ℓ\+εWℓϕ\(hn,ℓ\),\\displaystyle=h\_\{n,\\ell\}\+\\varepsilon W\_\{\\ell\}\\phi\(h\_\{n,\\ell\}\),ℓ\\displaystyle\\ell=0,…,L−1,n=0,…,N−1,\\displaystyle=0,\\dots,L\-1,\\qquad n=0,\\dots,N\-1,withhn\+1,0=hn,Lh\_\{n\+1,0\}=h\_\{n,L\}andh0,0=h0h\_\{0,0\}=h\_\{0\}\. The final output is

hout=h0\+ε∑ℓ=0L−1\[Wℓ∑n=0N−1ϕ\(hn,ℓ\)\]\.h\_\{\\text\{out\}\}=h\_\{0\}\+\\varepsilon\\sum\_\{\\ell=0\}^\{L\-1\}\\left\[W\_\{\\ell\}\\sum\_\{n=0\}^\{N\-1\}\\phi\(h\_\{n,\\ell\}\)\\right\]\.
Define the normalized loop\-averaged activation and branch output

U¯ℓ\\displaystyle\\bar\{U\}\_\{\\ell\}:=1N∑n=0N−1ϕ\(hn,ℓ\),\\displaystyle=\\frac\{1\}\{N\}\\sum\_\{n=0\}^\{N\-1\}\\phi\(h\_\{n,\\ell\}\),Yℓ\\displaystyle Y\_\{\\ell\}:=WℓU¯ℓ,\\displaystyle=W\_\{\\ell\}\\bar\{U\}\_\{\\ell\},Gℓ\\displaystyle G\_\{\\ell\}:=NYℓ\.\\displaystyle=NY\_\{\\ell\}\.Letβ:=εN\\beta:=\\varepsilon N\.

##### Assumptions\.

First, assume normalized branch squared norms are nondegenerate: for constants0<a−≤a\+<∞0<a\_\{\-\}\\leq a\_\{\+\}<\\inftyindependent ofN,L,dN,L,d,

a−≤𝔼\[1d‖Yℓ‖22\]≤a\+for allℓ\.a\_\{\-\}\\leq\\mathbb\{E\}\\left\[\\frac\{1\}\{d\}\\left\\lVert Y\_\{\\ell\}\\right\\rVert\_\{2\}^\{2\}\\right\]\\leq a\_\{\+\}\\qquad\\text\{for all \}\\ell\.\(7\)This is theN−2N^\{\-2\}\-normalized analogue of the single\-layer loop bound\.

Second, assume single\-layer replacement stability\. Let

ℱ−ℓ\\displaystyle\\mathcal\{F\}\_\{\-\\ell\}:=σ\(Ws:s≠ℓ\),\\displaystyle=\\sigma\(W\_\{s\}:s\\neq\\ell\),Bℓ\\displaystyle B\_\{\\ell\}:=𝔼\[Yℓ∣ℱ−ℓ\],\\displaystyle=\\mathbb\{E\}\[Y\_\{\\ell\}\\mid\\mathcal\{F\}\_\{\-\\ell\}\],Mℓ\\displaystyle M\_\{\\ell\}:=Yℓ−Bℓ\.\\displaystyle=Y\_\{\\ell\}\-B\_\{\\ell\}\.Fors≠ℓs\\neq\\ell, letYℓ\(s\)Y\_\{\\ell\}^\{\(s\)\}denote the same normalized branch output after replacingWsW\_\{s\}by an independent copy and recomputing the full trajectory\. We assume that for constantsC0,C1C\_\{0\},C\_\{1\}independent ofN,L,dN,L,d,

𝔼\[1d‖Bℓ‖22\]\\displaystyle\\mathbb\{E\}\\left\[\\frac\{1\}\{d\}\\left\\lVert B\_\{\\ell\}\\right\\rVert\_\{2\}^\{2\}\\right\]≤C0β2,\\displaystyle\\leq C\_\{0\}\\beta^\{2\},\(8\)12𝔼\[1d‖Yℓ−Yℓ\(s\)‖22\]\\displaystyle\\frac\{1\}\{2\}\\mathbb\{E\}\\left\[\\frac\{1\}\{d\}\\left\\lVert Y\_\{\\ell\}\-Y\_\{\\ell\}^\{\(s\)\}\\right\\rVert\_\{2\}^\{2\}\\right\]≤C1β2\.\\displaystyle\\leq C\_\{1\}\\beta^\{2\}\.\(9\)This condition says that one unique layer can affect a normalized branch output only through its total residual strengthεN=β\\varepsilon N=\\beta\. It is a local influence condition, not a direct assumption that cross\-layer inner products are small\. Together, these hypotheses rule out a strong layer\-wise coupling mode in which the shared residual stream forces all normalized branch outputs to move coherently, which would reintroduce anL2L^\{2\}contribution after the within\-layerN2N^\{2\}growth has been normalized\.

###### Theorem 6\(Multi\-layer block scaling; formal restatement of Theorem[3](https://arxiv.org/html/2606.18524#Thmtheorem3)\)\.

Under assumptions \([7](https://arxiv.org/html/2606.18524#S7.E7)\)–\([9](https://arxiv.org/html/2606.18524#S7.E9)\):

𝔼\[1d‖∑ℓ=0L−1Gℓ‖22\]\\displaystyle\\mathbb\{E\}\\left\[\\frac\{1\}\{d\}\\left\\\|\\sum\_\{\\ell=0\}^\{L\-1\}G\_\{\\ell\}\\right\\\|\_\{2\}^\{2\}\\right\]≤CN2\(L\+β2L2\),\\displaystyle\\leq CN^\{2\}\\left\(L\+\\beta^\{2\}L^\{2\}\\right\),β\\displaystyle\\beta=εN\.\\displaystyle=\\varepsilon N\.The factorized parameterizationε=λ/\(NL\)\\varepsilon=\\lambda/\(N\\sqrt\{L\}\)yields

ε2𝔼\[d−1‖∑ℓGℓ‖22\]=O\(λ2\+λ4\),\\varepsilon^\{2\}\\mathbb\{E\}\\left\[d^\{\-1\}\\left\\\|\\sum\_\{\\ell\}G\_\{\\ell\}\\right\\\|\_\{2\}^\{2\}\\right\]=O\(\\lambda^\{2\}\+\\lambda^\{4\}\),with constants independent ofNNandLL\. With the matching nondegeneracy condition and sufficiently small fixedλ\\lambda, the unscaled branch squared norm isΘ\(LN2\)\\Theta\(LN^\{2\}\)\.

###### Proof\.

Upper bound\.We prove

𝔼\[1d‖∑ℓ=0L−1Yℓ‖22\]≤C\(L\+β2L2\)\.\\mathbb\{E\}\\left\[\\frac\{1\}\{d\}\\left\\\|\\sum\_\{\\ell=0\}^\{L\-1\}Y\_\{\\ell\}\\right\\\|\_\{2\}^\{2\}\\right\]\\leq C\\left\(L\+\\beta^\{2\}L^\{2\}\\right\)\.\(10\)Multiplying byN2N^\{2\}gives the theorem’s bound for∑ℓGℓ\\sum\_\{\\ell\}G\_\{\\ell\}\.

The drift part is controlled directly by \([8](https://arxiv.org/html/2606.18524#S7.E8)\):

𝔼\[1d‖∑ℓ=0L−1Bℓ‖22\]\\displaystyle\\mathbb\{E\}\\left\[\\frac\{1\}\{d\}\\left\\\|\\sum\_\{\\ell=0\}^\{L\-1\}B\_\{\\ell\}\\right\\\|\_\{2\}^\{2\}\\right\]≤\(∑ℓ=0L−1𝔼\[d−1‖Bℓ‖22\]\)2\\displaystyle\\leq\\left\(\\sum\_\{\\ell=0\}^\{L\-1\}\\sqrt\{\\mathbb\{E\}\[d^\{\-1\}\\left\\lVert B\_\{\\ell\}\\right\\rVert\_\{2\}^\{2\}\]\}\\right\)^\{2\}≤C0β2L2\.\\displaystyle\\leq C\_\{0\}\\beta^\{2\}L^\{2\}\.
It remains to bound the centered part∑ℓMℓ\\sum\_\{\\ell\}M\_\{\\ell\}\. Use the Hoeffding–ANOVA decomposition with respect to the independent weightsW0,…,WL−1W\_\{0\},\\dots,W\_\{L\-1\}\[[3](https://arxiv.org/html/2606.18524#bib.bib3)\]:

Mℓ=∑A⊆\[L\]\(Mℓ\)A,M\_\{\\ell\}=\\sum\_\{A\\subseteq\[L\]\}\(M\_\{\\ell\}\)\_\{A\},where different index sets are orthogonal inL2L^\{2\}\. Since𝔼\[Mℓ∣ℱ−ℓ\]=0\\mathbb\{E\}\[M\_\{\\ell\}\\mid\\mathcal\{F\}\_\{\-\\ell\}\]=0, every nonzero component ofMℓM\_\{\\ell\}contains indexℓ\\ell\. Therefore, forℓ≠s\\ell\\neq s, only components containing bothℓ\\ellandsscontribute to𝔼⟨Mℓ,Ms⟩\\mathbb\{E\}\\langle M\_\{\\ell\},M\_\{s\}\\rangle\. Define

Vℓ,s:=∑A∋s𝔼\[1d‖\(Mℓ\)A‖22\]\.V\_\{\\ell,s\}:=\\sum\_\{A\\ni s\}\\mathbb\{E\}\\left\[\\frac\{1\}\{d\}\\left\\lVert\(M\_\{\\ell\}\)\_\{A\}\\right\\rVert\_\{2\}^\{2\}\\right\]\.By Cauchy–Schwarz and the Efron–Stein identity,

\|𝔼\[1d⟨Mℓ,Ms⟩\]\|≤Vℓ,s1/2Vs,ℓ1/2\.\\left\|\\mathbb\{E\}\\left\[\\frac\{1\}\{d\}\\left\\langle M\_\{\\ell\},M\_\{s\}\\right\\rangle\\right\]\\right\|\\leq V\_\{\\ell,s\}^\{1/2\}V\_\{s,\\ell\}^\{1/2\}\.Replacing one coordinate controls the ANOVA mass of all components containing that coordinate\. Since conditional expectation is a contraction inL2L^\{2\}, \([9](https://arxiv.org/html/2606.18524#S7.E9)\) also gives, up to a universal constant,

Vℓ,s≤Cβ2\(s≠ℓ\)\.V\_\{\\ell,s\}\\leq C\\beta^\{2\}\\qquad\(s\\neq\\ell\)\.Thus each centered cross term isO\(β2\)O\(\\beta^\{2\}\):

\|𝔼\[1d⟨Mℓ,Ms⟩\]\|≤Cβ2\(ℓ≠s\)\.\\left\|\\mathbb\{E\}\\left\[\\frac\{1\}\{d\}\\left\\langle M\_\{\\ell\},M\_\{s\}\\right\\rangle\\right\]\\right\|\\leq C\\beta^\{2\}\\qquad\(\\ell\\neq s\)\.\(11\)The diagonal terms obey

𝔼\[d−1‖Mℓ‖22\]\\displaystyle\\mathbb\{E\}\[d^\{\-1\}\\left\\lVert M\_\{\\ell\}\\right\\rVert\_\{2\}^\{2\}\]≤2𝔼\[d−1‖Yℓ‖22\]\\displaystyle\\leq 2\\mathbb\{E\}\[d^\{\-1\}\\left\\lVert Y\_\{\\ell\}\\right\\rVert\_\{2\}^\{2\}\]\+2𝔼\[d−1‖Bℓ‖22\]\\displaystyle\\quad\+2\\mathbb\{E\}\[d^\{\-1\}\\left\\lVert B\_\{\\ell\}\\right\\rVert\_\{2\}^\{2\}\]≤C\.\\displaystyle\\leq C\.Consequently,

𝔼\[1d‖∑ℓ=0L−1Mℓ‖22\]≤C\(L\+β2L2\)\.\\mathbb\{E\}\\left\[\\frac\{1\}\{d\}\\left\\\|\\sum\_\{\\ell=0\}^\{L\-1\}M\_\{\\ell\}\\right\\\|\_\{2\}^\{2\}\\right\]\\leq C\\left\(L\+\\beta^\{2\}L^\{2\}\\right\)\.Combining the centered and drift bounds with‖∑ℓYℓ‖22≤2‖∑ℓMℓ‖22\+2‖∑ℓBℓ‖22\\left\\lVert\\sum\_\{\\ell\}Y\_\{\\ell\}\\right\\rVert\_\{2\}^\{2\}\\leq 2\\left\\lVert\\sum\_\{\\ell\}M\_\{\\ell\}\\right\\rVert\_\{2\}^\{2\}\+2\\left\\lVert\\sum\_\{\\ell\}B\_\{\\ell\}\\right\\rVert\_\{2\}^\{2\}proves \([10](https://arxiv.org/html/2606.18524#S7.E10)\)\.

Consequences for factorized scaling\.SinceGℓ=NYℓG\_\{\\ell\}=NY\_\{\\ell\},

𝔼\[1d‖∑ℓ=0L−1Gℓ‖22\]≤CN2\(L\+β2L2\)\.\\mathbb\{E\}\\left\[\\frac\{1\}\{d\}\\left\\\|\\sum\_\{\\ell=0\}^\{L\-1\}G\_\{\\ell\}\\right\\\|\_\{2\}^\{2\}\\right\]\\leq CN^\{2\}\\left\(L\+\\beta^\{2\}L^\{2\}\\right\)\.Withε=λ/\(NL\)\\varepsilon=\\lambda/\(N\\sqrt\{L\}\), we haveβ=λ/L\\beta=\\lambda/\\sqrt\{L\}, so

ε2𝔼\[1d‖∑ℓ=0L−1Gℓ‖22\]\\displaystyle\\varepsilon^\{2\}\\mathbb\{E\}\\left\[\\frac\{1\}\{d\}\\left\\\|\\sum\_\{\\ell=0\}^\{L\-1\}G\_\{\\ell\}\\right\\\|\_\{2\}^\{2\}\\right\]≤λ2N2LCN2\(L\+λ2L\)=O\(λ2\+λ4\)\.\\displaystyle\\qquad\\leq\\frac\{\\lambda^\{2\}\}\{N^\{2\}L\}\\,CN^\{2\}\(L\+\\lambda^\{2\}L\)=O\(\\lambda^\{2\}\+\\lambda^\{4\}\)\.For the matching lower bound, define

SM\\displaystyle S\_\{M\}:=∑ℓ=0L−1Mℓ,\\displaystyle=\\sum\_\{\\ell=0\}^\{L\-1\}M\_\{\\ell\},SB\\displaystyle S\_\{B\}:=∑ℓ=0L−1Bℓ,\\displaystyle=\\sum\_\{\\ell=0\}^\{L\-1\}B\_\{\\ell\},SY\\displaystyle S\_\{Y\}:=∑ℓ=0L−1Yℓ=SM\+SB\.\\displaystyle=\\sum\_\{\\ell=0\}^\{L\-1\}Y\_\{\\ell\}=S\_\{M\}\+S\_\{B\}\.The drift bound above gives

𝔼\[d−1‖SB‖22\]≤Cβ2L2\.\\mathbb\{E\}\[d^\{\-1\}\\left\\lVert S\_\{B\}\\right\\rVert\_\{2\}^\{2\}\]\\leq C\\beta^\{2\}L^\{2\}\.For the centered diagonal terms, the inequality‖Yℓ−Bℓ‖22≥12‖Yℓ‖22−‖Bℓ‖22\\left\\lVert Y\_\{\\ell\}\-B\_\{\\ell\}\\right\\rVert\_\{2\}^\{2\}\\geq\\frac\{1\}\{2\}\\left\\lVert Y\_\{\\ell\}\\right\\rVert\_\{2\}^\{2\}\-\\left\\lVert B\_\{\\ell\}\\right\\rVert\_\{2\}^\{2\}and assumptions \([7](https://arxiv.org/html/2606.18524#S7.E7)\)–\([8](https://arxiv.org/html/2606.18524#S7.E8)\) imply

∑ℓ=0L−1𝔼\[d−1‖Mℓ‖22\]≥a−2L−Cβ2L\.\\sum\_\{\\ell=0\}^\{L\-1\}\\mathbb\{E\}\[d^\{\-1\}\\left\\lVert M\_\{\\ell\}\\right\\rVert\_\{2\}^\{2\}\]\\geq\\frac\{a\_\{\-\}\}\{2\}L\-C\\beta^\{2\}L\.Together with the centered cross bound \([11](https://arxiv.org/html/2606.18524#S7.E11)\), this yields

𝔼\[d−1‖SM‖22\]≥cL−Cβ2L2\\mathbb\{E\}\[d^\{\-1\}\\left\\lVert S\_\{M\}\\right\\rVert\_\{2\}^\{2\}\]\\geq cL\-C\\beta^\{2\}L^\{2\}for constantsc,C\>0c,C\>0\. Finally,‖SM\+SB‖22≥12‖SM‖22−‖SB‖22\\left\\lVert S\_\{M\}\+S\_\{B\}\\right\\rVert\_\{2\}^\{2\}\\geq\\frac\{1\}\{2\}\\left\\lVert S\_\{M\}\\right\\rVert\_\{2\}^\{2\}\-\\left\\lVert S\_\{B\}\\right\\rVert\_\{2\}^\{2\}, so

𝔼\[d−1‖SY‖22\]≥cL−Cβ2L2\.\\mathbb\{E\}\[d^\{\-1\}\\left\\lVert S\_\{Y\}\\right\\rVert\_\{2\}^\{2\}\]\\geq cL\-C\\beta^\{2\}L^\{2\}\.Hence, whenβ=λ/L\\beta=\\lambda/\\sqrt\{L\}andλ\\lambdais below a constant threshold depending onccandCC,

𝔼\[1d‖∑ℓ=0L−1Yℓ‖22\]=Θ\(L\),\\mathbb\{E\}\\left\[\\frac\{1\}\{d\}\\left\\\|\\sum\_\{\\ell=0\}^\{L\-1\}Y\_\{\\ell\}\\right\\\|\_\{2\}^\{2\}\\right\]=\\Theta\(L\),and therefore

𝔼\[1d‖∑ℓ=0L−1Gℓ‖22\]=Θ\(LN2\)\.\\mathbb\{E\}\\left\[\\frac\{1\}\{d\}\\left\\\|\\sum\_\{\\ell=0\}^\{L\-1\}G\_\{\\ell\}\\right\\\|\_\{2\}^\{2\}\\right\]=\\Theta\(LN^\{2\}\)\.∎

### 7\.5Multi\-layer learning\-rate scaling proof

Theorem[6](https://arxiv.org/html/2606.18524#Thmtheorem6)controls the forward residual\-stream norm\. The learning\-rate rule in Section[4](https://arxiv.org/html/2606.18524#S4)concerns a different object: the one\-step sensitivity of the final output to simultaneous updates of the reused matrices\. We formalize that calculation in the linearized maximal\-update sense\.

Fix one optimizer step and writeWℓ\+=Wℓ\+ΔWℓW\_\{\\ell\}^\{\+\}=W\_\{\\ell\}\+\\Delta W\_\{\\ell\}\. For each occurrence\(n,ℓ\)\(n,\\ell\), let𝒥n,ℓ\\mathcal\{J\}\_\{n,\\ell\}denote the Jacobian, along the pre\-update trajectory, of the map from an additive perturbation inserted after the residual branch athn,ℓ\+1h\_\{n,\\ell\+1\}to the final outputhouth\_\{\\mathrm\{out\}\}\. The first\-order output perturbation is

Δhoutlin=ε∑ℓ=0L−1∑n=0N−1𝒥n,ℓΔWℓϕ\(hn,ℓ\)\.\\Delta h\_\{\\mathrm\{out\}\}^\{\\mathrm\{lin\}\}=\\varepsilon\\sum\_\{\\ell=0\}^\{L\-1\}\\sum\_\{n=0\}^\{N\-1\}\\mathcal\{J\}\_\{n,\\ell\}\\Delta W\_\{\\ell\}\\phi\(h\_\{n,\\ell\}\)\.\(12\)
###### Proposition 7\(Multi\-layer learning\-rate scaling\)\.

Assume the factorized residual scalingε=λ/\(NL\)\\varepsilon=\\lambda/\(N\\sqrt\{L\}\)and the following tangent\-stability conditions hold for constants independent ofN,L,dN,L,d:

maxn,ℓ⁡1d‖ΔWℓϕ\(hn,ℓ\)‖2\\displaystyle\\max\_\{n,\\ell\}\\frac\{1\}\{\\sqrt\{d\}\}\\left\\lVert\\Delta W\_\{\\ell\}\\phi\(h\_\{n,\\ell\}\)\\right\\rVert\_\{2\}≤CΔη,\\displaystyle\\leq C\_\{\\Delta\}\\eta,\(13\)maxn,ℓ⁡‖𝒥n,ℓ‖op\\displaystyle\\max\_\{n,\\ell\}\\left\\lVert\\mathcal\{J\}\_\{n,\\ell\}\\right\\rVert\_\{\\mathrm\{op\}\}≤CJ\.\\displaystyle\\leq C\_\{J\}\.\(14\)Then

1d‖Δhoutlin‖2≤CJCΔηεNL=CJCΔηλL\.\\frac\{1\}\{\\sqrt\{d\}\}\\left\\lVert\\Delta h\_\{\\mathrm\{out\}\}^\{\\mathrm\{lin\}\}\\right\\rVert\_\{2\}\\leq C\_\{J\}C\_\{\\Delta\}\\,\\eta\\varepsilon NL=C\_\{J\}C\_\{\\Delta\}\\,\\eta\\lambda\\sqrt\{L\}\.Consequently,η=O\(\(λL\)−1\)\\eta=O\(\(\\lambda\\sqrt\{L\}\)^\{\-1\}\)is sufficient for anO\(1\)O\(1\)width\-normalized linearized output perturbation\. If, in addition, the layer\-update contributions are coherent in the sense that for someccoh\>0c\_\{\\mathrm\{coh\}\}\>0,

1d‖∑ℓ=0L−1∑n=0N−1𝒥n,ℓΔWℓϕ\(hn,ℓ\)‖2≥ccohηNL,\\frac\{1\}\{\\sqrt\{d\}\}\\left\\\|\\sum\_\{\\ell=0\}^\{L\-1\}\\sum\_\{n=0\}^\{N\-1\}\\mathcal\{J\}\_\{n,\\ell\}\\Delta W\_\{\\ell\}\\phi\(h\_\{n,\\ell\}\)\\right\\\|\_\{2\}\\geq c\_\{\\mathrm\{coh\}\}\\eta NL,\(15\)then the bound is sharp up to constants:

1d‖Δhoutlin‖2=Ω\(ηλL\)\.\\frac\{1\}\{\\sqrt\{d\}\}\\left\\lVert\\Delta h\_\{\\mathrm\{out\}\}^\{\\mathrm\{lin\}\}\\right\\rVert\_\{2\}=\\Omega\(\\eta\\lambda\\sqrt\{L\}\)\.

###### Proof\.

Using \([12](https://arxiv.org/html/2606.18524#S7.E12)\), writevn,ℓ:=𝒥n,ℓΔWℓϕ\(hn,ℓ\)v\_\{n,\\ell\}:=\\mathcal\{J\}\_\{n,\\ell\}\\Delta W\_\{\\ell\}\\phi\(h\_\{n,\\ell\}\)\. The triangle inequality and the two tangent\-stability assumptions give

1d‖Δhoutlin‖2\\displaystyle\\frac\{1\}\{\\sqrt\{d\}\}\\left\\lVert\\Delta h\_\{\\mathrm\{out\}\}^\{\\mathrm\{lin\}\}\\right\\rVert\_\{2\}≤ε∑ℓ=0L−1∑n=0N−11d‖vn,ℓ‖2\\displaystyle\\leq\\varepsilon\\sum\_\{\\ell=0\}^\{L\-1\}\\sum\_\{n=0\}^\{N\-1\}\\frac\{1\}\{\\sqrt\{d\}\}\\left\\lVert v\_\{n,\\ell\}\\right\\rVert\_\{2\}≤ε∑ℓ=0L−1∑n=0N−1CJCΔη\\displaystyle\\leq\\varepsilon\\sum\_\{\\ell=0\}^\{L\-1\}\\sum\_\{n=0\}^\{N\-1\}C\_\{J\}C\_\{\\Delta\}\\eta=CJCΔηεNL\.\\displaystyle=C\_\{J\}C\_\{\\Delta\}\\eta\\varepsilon NL\.Substitutingε=λ/\(NL\)\\varepsilon=\\lambda/\(N\\sqrt\{L\}\)gives the claimedO\(ηλL\)O\(\\eta\\lambda\\sqrt\{L\}\)upper bound\. Thus choosingη=O\(\(λL\)−1\)\\eta=O\(\(\\lambda\\sqrt\{L\}\)^\{\-1\}\)keeps the width\-normalized linearized output update bounded\.

For sharpness, combine \([12](https://arxiv.org/html/2606.18524#S7.E12)\) with the coherence condition \([15](https://arxiv.org/html/2606.18524#S7.E15)\):

1d‖Δhoutlin‖2≥ccohηεNL=ccohηλL\.\\frac\{1\}\{\\sqrt\{d\}\}\\left\\lVert\\Delta h\_\{\\mathrm\{out\}\}^\{\\mathrm\{lin\}\}\\right\\rVert\_\{2\}\\geq c\_\{\\mathrm\{coh\}\}\\eta\\varepsilon NL=c\_\{\\mathrm\{coh\}\}\\eta\\lambda\\sqrt\{L\}\.Therefore, in coherent regimes, takingηλL→∞\\eta\\lambda\\sqrt\{L\}\\to\\inftyproduces a diverging linearized output perturbation, so the stable scale isη=Θ\(\(λL\)−1\)\\eta=\\Theta\(\(\\lambda\\sqrt\{L\}\)^\{\-1\}\)\. ∎

For finite optimizer steps, the same scaling applies to the actual output difference whenever the nonlinear Taylor remainder is smaller than a constant multiple of the linearized term, matching the trajectory\-direction sharpness condition used in Appendix[7\.3](https://arxiv.org/html/2606.18524#S7.SS3)for the single\-layer loop\. If the layer contributions are less coherent than \([15](https://arxiv.org/html/2606.18524#S7.E15)\), the upper bound can be loose; this is the sense in which less coherent regimes can tolerate larger learning rates\.

## 8Experimental Setup and Implementation Details

### 8\.1Language modeling experiment configuration

Table[2](https://arxiv.org/html/2606.18524#S8.T2)lists the model configurations and training hyperparameters for the language modeling depth–loop transfer experiments reported in the main text\.

Table 2:Model configurations for the main depth–loop transfer experiments\.“KV” denotes key/value attention heads \(each row uses 12 attention heads and 12 key/value heads, i\.e\. no GQA grouping\)\. All models are decoder\-only Llama\-style pre\-norm Transformers with RMSNorm and the Llama 3 tokenizer vocabulary\[[22](https://arxiv.org/html/2606.18524#bib.bib22),[21](https://arxiv.org/html/2606.18524#bib.bib21),[12](https://arxiv.org/html/2606.18524#bib.bib12),[25](https://arxiv.org/html/2606.18524#bib.bib25),[28](https://arxiv.org/html/2606.18524#bib.bib28)\]\. The token embedding and tied LM head are outside the loop\[[19](https://arxiv.org/html/2606.18524#bib.bib19)\], while the whole Transformer stack is repeatedNNtimes\. Training uses FineWeb\-Edu for 20,000 optimizer steps \(10B tokens, 0\.5M\-token global batch\), AdamW\[[13](https://arxiv.org/html/2606.18524#bib.bib13)\]with\(β1,β2\)=\(0\.9,0\.95\)\(\\beta\_\{1\},\\beta\_\{2\}\)=\(0\.9,0\.95\)and weight decayω0\\omega\_\{0\}\(see Appendix[8\.3](https://arxiv.org/html/2606.18524#S8.SS3)\), and a warmup\-stable\-linear\-decay schedule\[[24](https://arxiv.org/html/2606.18524#bib.bib24)\]with 500 warmup steps and a final 1,000\-step decay\.
### 8\.2Depth–loop parameterization

This subsection spells out the implementation\-facing parameterization corresponding to the single\-stage case of the theory\. We consider anLL\-layer pre\-LN Transformer block loopedNNtimes: there areLLunique Transformer blocks, each with two residual branches, and every unique layer is reusedNNtimes\. Let

mL:=L12,m\_\{L\}:=\\frac\{L\}\{12\},where 12 is the reference unique depth used in our baseline tuning\. Width scaling terms are set to one, so the table below isolates the depth and loop axes\. We writeη0\\eta\_\{0\}for the base learning rate atL=12L=12,σ02\\sigma\_\{0\}^\{2\}for the base initialization variance,ω0\\omega\_\{0\}for the base AdamW weight decay, andϵ0\\epsilon\_\{0\}for the base AdamW numerical epsilon\.

Table 3:Depth–loop parameterization for anLL\-layer block loopedNNtimes\. The local loop correction is the factorN−1N^\{\-1\}in each attention and MLP residual branch\. The global depth correction is the factormL−1/2m\_\{L\}^\{\-1/2\}, withmL=L/12m\_\{L\}=L/12\. Learning rates depend on the unique depthLLbut not on the loop countNNonce the local residual correction has been applied\.
### 8\.3Hyperparameter values and sweep configuration

Table[4](https://arxiv.org/html/2606.18524#S8.T4)lists the numerical values used to instantiate the parameterization of Appendix[8\.2](https://arxiv.org/html/2606.18524#S8.SS2)for the main language\-modeling experiments \(Sections[5\.2](https://arxiv.org/html/2606.18524#S5.SS2)–[5\.3](https://arxiv.org/html/2606.18524#S5.SS3)\) and the learning\-rate sweep protocol behind Figure[5](https://arxiv.org/html/2606.18524#S4.F5)\.

QuantitySymbolValueResidual branch constantλ\\lambda11Base learning rate atL=12L\{=\}12η0\\eta\_\{0\}1\.25×10−31\.25\\times 10^\{\-3\}Initialization standard deviationσ0\\sigma\_\{0\}0\.020\.02AdamW weight decayω0\\omega\_\{0\}0\.10\.1AdamW numerical epsilonϵ0\\epsilon\_\{0\}10−810^\{\-8\}LR sweep grid \(baseη0\\eta\_\{0\}\)—\{5,7\.5,10,12\.5,15,20,30,40\}×10−4\\\{5,\\,7\.5,\\,10,\\,12\.5,\\,15,\\,20,\\,30,\\,40\\\}\\\!\\times\\\!10^\{\-4\}Divergence criterion—final validation loss\>4\>4Random seed—0\(single run\)Validation split—FineWeb\-Edu held\-outTable 4:Numerical values used in the language\-modeling experiments\. Symbols refer to the parameterization in Table[3](https://arxiv.org/html/2606.18524#S8.T3); per\-component scaling rules apply as listed there\. Initialization uses a truncated normal with standard deviationσ0\\sigma\_\{0\}\.
### 8\.4Post\-training cosine similarity at smaller loop counts

Figure[8](https://arxiv.org/html/2606.18524#S8.F8)extends the post\-training cosine\-similarity diagnostic of Figure[6](https://arxiv.org/html/2606.18524#S4.F6)in the main text fromN=8N\{=\}8toN=4N\{=\}4, using the correspondingL∈\{12,24,48\}L\\in\\\{12,24,48\\\}models trained atN=4N\{=\}4\. The qualitative pattern matches theN=8N\{=\}8case: early and middle loop steps remain positively correlated across all three depths, while final\-step pairs are weaker and can be near zero or negative, especially atL=12L\{=\}12\.

![Refer to caption](https://arxiv.org/html/2606.18524v1/x7.png)Figure 8:Cosine similarity between loop\-step increments after full training\(N=4N\{=\}4\)\. Companion to Figure[6](https://arxiv.org/html/2606.18524#S4.F6)\(N=8N\{=\}8\)\. Early and middle loop steps remain positively correlated across all three depths\. Final\-step pairs are weaker and can be near zero or negative, especially atL=12L\{=\}12, matching the pattern in Figure[6](https://arxiv.org/html/2606.18524#S4.F6)\.
### 8\.5Multi\-step learning\-rate transfer

Figure[9](https://arxiv.org/html/2606.18524#S8.F9)extends the one\-step learning\-rate analysis of Figure[4](https://arxiv.org/html/2606.18524#S3.F4)in the main text to the first 10 optimizer steps\. TheηεN\\eta\\,\\varepsilon\\,Nscaling identified in Theorem[2](https://arxiv.org/html/2606.18524#Thmtheorem2)persists across all measured steps forε∈\{1,1/N,1/N\}\\varepsilon\\in\\\{1,1/\\\!\\sqrt\{N\},1/N\\\}: the linear\-scaled update stays within a small constant range, while the sqrt\- and unscaled updates track the predicted power laws\.

![Refer to caption](https://arxiv.org/html/2606.18524v1/x8.png)Figure 9:Multi\-step companion to Figure[4](https://arxiv.org/html/2606.18524#S3.F4): theηεN\\eta\\,\\varepsilon\\,Nscaling persists across all 10 optimizer steps\.Each panel fixes oneε\\varepsilonrule and plots the per\-step RMS hidden\-state update∥ΔhN∥RMS\\lVert\\Delta h\_\{N\}\\rVert\_\{\\mathrm\{RMS\}\}against the loop countNN, with one curve per optimizer stept=1,…,10t\{=\}1,\\dots,10\(viridis ramp, mean±\\pmstandard error of the mean \(SEM\) over 10 seeds\)\. All settings share the same base learning rate10−410^\{\-4\}\(no per\-rule LR rescaling\)\. Dashed gray lines repeat the reference scalings from Figure[4](https://arxiv.org/html/2606.18524#S3.F4), anchored at\(N=1,t=1\)\(N\{=\}1,\\,t\{=\}1\)\. In panels \(a\) and \(b\), the empirical curves run essentially parallel to their respective dashed references at every optimizer step, so the∝N\\propto Nand∝N\\propto\\\!\\sqrt\{N\}scalings hold throughout early training\. In panel \(c\), the linear\-scaled update stays within roughly a4×4\{\\times\}range overN=1−64N\{=\}1\{\-\}64and saturates nearN=32N\{=\}32, much weaker than theN\\sqrt\{N\}orNNgrowth of the other rules; the residual growth reflectsηεN\\eta\\,\\varepsilon\\,Nbeing an asymptotic upper bound rather than a sharp scaling\. Loss values decrease across optimizer steps, which compresses the absolute update magnitude over time without changing the cross\-NNscaling\.
On the Residual Scaling of Looped Transformers: Stability and Transferability

Similar Articles

Simply Stabilizing the Loop via Fully Looped Transformer

@askalphaxiv: Another cool research on Looped Transformers They ask the question: "Can we loop a frozen, off-the-shelf checkpoint dir…

Dynamics of the Transformer Residual Stream: Coupling Spectral Geometry to Network Topology

Looped World Models

LoopQ: Quantization for Recursive Transformers

Submit Feedback

Similar Articles

Simply Stabilizing the Loop via Fully Looped Transformer
@askalphaxiv: Another cool research on Looped Transformers They ask the question: "Can we loop a frozen, off-the-shelf checkpoint dir…
Dynamics of the Transformer Residual Stream: Coupling Spectral Geometry to Network Topology
LoopQ: Quantization for Recursive Transformers