Why Do Accumulated Transformations Extrapolate?
Summary
This paper investigates why accumulated token-dependent orthogonal transformations, such as those used in PaTH Attention and a simplified variant with SO(2) rotations, enable length extrapolation in transformers. It proves that such transformations become incoherent after a finite number of steps, suppressing attention to distant tokens, and shows both theoretically and experimentally that this mechanism improves extrapolation but eventually degrades at extreme context lengths.
View Cached Full Text
Cached at: 06/25/26, 05:08 AM
# Why Do Accumulated Transformations Extrapolate?
Source: [https://arxiv.org/html/2606.24975](https://arxiv.org/html/2606.24975)
###### Abstract
PaTH Attention showed that replacing RoPE’s position\-indexed rotations with accumulated data\-dependent Householder reflections yields strong length extrapolation, though performance eventually degrades at extreme context lengths\. This paper asks whether that behavior depends on Householder\-specific structure or reflects a more general property of accumulated transformations along source\-to\-query paths\. We study a simpler variant that keeps RoPE’s block\-diagonalSO\(2\)\\mathrm\{SO\}\(2\)rotations but replaces position\-indexed angles with accumulated token\-dependent angles\. This simpler mechanism exhibits the same qualitative pattern: improved extrapolation followed by degradation at sufficiently long contexts\. We prove that the result extends to accumulated orthogonal transformations satisfying certain regularity conditions: their products become incoherent after a finite number of steps, suppressing attention to distant tokens\. We analyze the mechanism through a stylized model of attention that explains both behaviors\. Accumulated rotations of queries and keys create a finite mixing window independent of context length; the per\-token suppression learned during training transfers unchanged to any evaluation length, and high\-dimensional concentration produces a score gap that suppresses far\-token attention while near\-route transport preserves the target signal\. This suggests a concrete mechanism linking accumulated transport to length extrapolation\. On the other hand, a lower bound shows that accumulated rotations must eventually degrade: as the far set grows with context length, no choice of rotations can guarantee preservation of the near target signal without explicit far\-mass control\. Furthermore, we show that forSO\(2\)\\mathrm\{SO\}\(2\)rotations, also rotating values makes residual far contributions combine incoherently, extending the extrapolation range\. Controlled transformer experiments support these predictions\. Random accumulated rotations substantially improve extrapolation over RoPE, learned token\-dependent rotations maintain near\-training\-length perplexity far beyond the training context, and rotating queries, keys, and values improves over rotating queries and keys alone\. Rotation\-only models still degrade at extreme lengths, while ALiBi remains approximately length\-stable, consistent with the need for explicit far\-mass control\.
## 1Introduction
PaTH Attention\(Yanget al\.,[2025](https://arxiv.org/html/2606.24975#bib.bib13)\)replaces RoPE’s\(Suet al\.,[2024](https://arxiv.org/html/2606.24975#bib.bib7)\)position\-indexed rotations with products of data\-dependent Householder reflections and demonstrates strong extrapolation at 760M parameters: each token contributes a transformation, and the source\-query relationship is determined by the product of intervening steps rather than by absolute position\. The mechanism behind this established extrapolation phenomenon is not well understood\. This paper seeks to shed light on*why*accumulated transformations help length extrapolation\.
We also ask whether the mechanism requires Householder’s full generality\. We find that even RoPE’s commuting block\-diagonalSO\(2\)\\mathrm\{SO\}\(2\)rotations, with accumulated token\-dependent angles instead of position\-indexed ones, extrapolate and then degrade in the same qualitative pattern\. This suggests the mechanism is not specific to any particular orthogonal structure, and indeed we prove that accumulated products of orthogonal transformations satisfying certain regularity conditions become incoherent, suppressing attention to distant tokens\. We verify that bothSO\(2\)\\mathrm\{SO\}\(2\)rotations and Householder reflections satisfy these conditions\.
We call the ordered source\-to\-query path from tokenjjto queryiia*route*\. The intervening tokens on this route generate per\-token rotations whose product gives the source\-query rotation\. In RoPE, this rotation is determined entirely by the distancei−ji\{\-\}j; in accumulated transport, it depends on which tokens lie along the route\. Distant routes pass through many independent token\-dependent steps and become incoherent, while nearby routes can remain approximately aligned\. The boundary between these two regimes is independent of context length, so the model sees the same near/far structure at any evaluation length\. This train/test distributional match is a necessary condition for extrapolation, but it is not sufficient by itself\. NoPE\(Kazemnejadet al\.,[2023](https://arxiv.org/html/2606.24975#bib.bib10)\)provides an instructive counterexample: identity transport is length\-stable, but it does not create a score gap that suppresses far tokens\. Distributional match alone does not quantify how strongly far tokens are suppressed, whether that suppression can be overwhelmed as context grows, or what happens to far values that survive score selection\. The analysis below addresses these questions\.
We develop a stylized model of attention that explains both the extrapolation and the eventual degradation:
1. 1\.Score\-side decoherence \(mechanism for extrapolation\)\.Accumulated rotations of queries and keys create a finite mixing window whose boundary is independent of context length\. Once training length covers this window, the near/far route regime is the same at training and evaluation within the stylized model\. Within this regime, high\-dimensional concentration produces a score gap that suppresses far\-token attention mass, while near\-route transport preserves the target signal\. We prove that the result holds for any accumulated orthogonal transformation with a spectral gap \(Appendix[B\.5](https://arxiv.org/html/2606.24975#A2.SS5)\)\.
2. 2\.Far\-mass lower bound \(mechanism for eventual degradation\)\.Per\-token suppression is stable across lengths, but the far set grows with context\. A lower bound proves that this growth is fundamental: without explicit far\-mass control, no choice of rotations can guarantee that the target signal is preserved at unbounded lengths\. This bound applies to any orthogonal transport\. This is consistent with the flat extrapolation of distance\-bias methods such as ALiBi\(Presset al\.,[2022](https://arxiv.org/html/2606.24975#bib.bib8)\), which directly control the total far mass that rotations alone cannot bound\.
In addition, we show that forSO\(2\)\\mathrm\{SO\}\(2\)rotations, rotating values as well as queries and keys extends the extrapolation range, since far values that still receive attention mass combine incoherently, bounding the far contribution’s covariance\.
We do not aim to reproduce PaTH at scale\. Instead, we train small decoder\-only transformers with accumulatedSO\(2\)\\mathrm\{SO\}\(2\)rotations to isolate the mechanism; the results support each prediction\. Random accumulated rotations instantiate the independence and spectral\-gap assumptions most directly among our experimental variants and strongly improve extrapolation over RoPE\. Rotation\-only models degrade gradually at extreme lengths while ALiBi remains flat, consistent with the far\-mass requirement\. Adding value rotation further reduces long\-context degradation, consistent with the theoretical prediction\. Learned token\-dependent rotations match RoPE at training length and maintain near\-training\-length perplexity through16×16\\timesthe training context\.
The remainder of the paper follows this progression\. Section[2](https://arxiv.org/html/2606.24975#S2)defines the stylized attention model and the route notation used throughout\. Section[3](https://arxiv.org/html/2606.24975#S3)shows that accumulated rotations produce a content\-dependent mixing window and a score gap that forces small total far attention mass, establishing an upper bound on far\-token interference\. Section[4](https://arxiv.org/html/2606.24975#S4)proves a lower bound showing that even stable per\-token suppression cannot eliminate far\-mass leakage as the far set grows; explicit far\-mass control \(e\.g\., distance bias\) is structurally necessary\. Section[5](https://arxiv.org/html/2606.24975#S5)shows that, for theSO\(2\)\\mathrm\{SO\}\(2\)variant, rotating values as well as queries and keys tightens the upper bound on far\-token interference by making the surviving far contributions combine incoherently\. Section[6](https://arxiv.org/html/2606.24975#S6)establishes near\-signal preservation and summarizes the combined picture\. Section[7](https://arxiv.org/html/2606.24975#S7)tests the resulting predictions in controlledSO\(2\)\\mathrm\{SO\}\(2\)experiments\. Section[8](https://arxiv.org/html/2606.24975#S8)discusses scope and relation to other positional methods; Section[9](https://arxiv.org/html/2606.24975#S9)concludes\.
## 2Preliminaries and Stylized Model
### 2\.1Transformer attention and the aggregation model
For a single attention head, letxj∈ℝdmodelx\_\{j\}\\in\\mathbb\{R\}^\{d\_\{\\rm model\}\}be the residual representation at positionjj\. Standard scaled dot\-product attention forms
qi=WQxi,kj=WKxj,vj=WVxj,q\_\{i\}=W\_\{Q\}x\_\{i\},\\qquad k\_\{j\}=W\_\{K\}x\_\{j\},\\qquad v\_\{j\}=W\_\{V\}x\_\{j\},\(1\)then computes scores and softmax weights\. With an explicit logit scaleλ\>0\\lambda\>0,
sij=qi⊤kjdk\+bij,αij=exp\(λsij\)∑ℓexp\(λsiℓ\),s\_\{ij\}=\\frac\{q\_\{i\}^\{\\top\}k\_\{j\}\}\{\\sqrt\{d\_\{k\}\}\}\+b\_\{ij\},\\qquad\\alpha\_\{ij\}=\\frac\{\\exp\(\\lambda s\_\{ij\}\)\}\{\\sum\_\{\\ell\}\\exp\(\\lambda s\_\{i\\ell\}\)\},\(2\)wherebijb\_\{ij\}may include a causal mask or positional bias\. The usual normalization is recovered by takingλ=1\\lambda=1; largerλ\\lambdamakes the softmax more selective\. The attention\-head output before the output projection is
oi=∑jαijvj\.o\_\{i\}=\\sum\_\{j\}\\alpha\_\{ij\}\\,v\_\{j\}\.\(3\)Thus attention has a score side, which produces the weightsαij\\alpha\_\{ij\}, and a value side, which forms the weighted value sum\. From this point on,dddenotes the value/transport dimension, assumed even; it may be the value\-projection dimension rather than the full residual dimensiondmodeld\_\{\\rm model\}\.
This section defines the score/value abstraction used to study far interference after the score side has selected the weights\. The value\-side aggregation is
ci=∑jαijPj→ivj\.c\_\{i\}=\\sum\_\{j\}\\alpha\_\{ij\}\\,P\_\{j\\to i\}\\,v\_\{j\}\.\(4\)The ordinary transformer value sum is recovered by takingPj→i=IdP\_\{j\\to i\}=I\_\{d\}, so thatci=oic\_\{i\}=o\_\{i\}\. This covers both the identity\-rotation baseline and the standard RoPE\-style baseline in which Q/K uses position\-dependent rotation but V is summed directly\. In the position\-dependent Q/K/V comparison,Pj→iP\_\{j\\to i\}is chosen from the source\-query offset\. In the content\-dependent Q/K/V comparison,Pj→iP\_\{j\\to i\}is accumulated from the intervening tokens\. The Q/K use of the same route\-level near/far split is analyzed in Theorem[2](https://arxiv.org/html/2606.24975#Thmtheorem2)\.
In the stylized model, near values carry a latent target component; far values are background\. The question is whether the latent component remains recoverable as more far terms are added\.
### 2\.2Route transport
A transport operator is an orthogonal route operatorPj→iP\_\{j\\to i\}associated with the source\-to\-query interval\. It may be content\-independent, as in position\-based rotations, or content\-dependent, as in accumulated rotations generated from intervening token representations\. On the value path, it rotates each selected value vector before summation when value transport is enabled\. On the score path, the same route geometry can be used to compare transported query and key features\. Variants that rotate only queries and keys, including standard RoPE\-style attention, have no V\-side transport:Pj→i=IdP\_\{j\\to i\}=I\_\{d\}\.
Letctc\_\{t\}denote the token at positiontt\. Each position carries a per\-token orthogonal step
Mt∈O\(d\)\.M\_\{t\}\\in O\(d\)\.\(5\)For a sourcej<ij<i, the*route*fromjjtoiiis the ordered source\-to\-query interval\. Define the prefix product in left\-to\-right \(increasing\-tt\) order:Ai=M0M1⋯Mi−1A\_\{i\}=M\_\{0\}M\_\{1\}\\cdots M\_\{i\-1\}, withA0=IA\_\{0\}=I\. The route transport is
Pj→i=Ai−1Aj=Mi−1−1Mi−2−1⋯Mj−1,P\_\{j\\to i\}=A\_\{i\}^\{\-1\}A\_\{j\}=M\_\{i\-1\}^\{\-1\}M\_\{i\-2\}^\{\-1\}\\cdots M\_\{j\}^\{\-1\},\(6\)the product of inverse steps in query\-to\-source \(decreasing\-tt\) order\.
*Content\-dependent route transport*uses
Mt=M\(ct\),M\_\{t\}=M\(c\_\{t\}\),\(7\)whereM\(⋅\)M\(\\cdot\)maps each token representation to a learned orthogonal matrix\. The route fromjjtoiithen depends on all intervening content\. All operators remain norm\-preserving\.
*Position\-only transport*uses a constant stepMt=M0M\_\{t\}=M\_\{0\}for alltt, givingPj→i=M0−\(i−j\)P\_\{j\\to i\}=M\_\{0\}^\{\-\(i\-j\)\}\. The transport depends on the offseti−ji\-jbut not on intervening content\.
Routes to the same query are nested interval products\. For example,Pi−5→iP\_\{i\-5\\to i\}andPi−2→iP\_\{i\-2\\to i\}share the final two per\-token steps, so far routes to a fixed query are not independent\. The covariance analysis below works with this nested dependence directly\.
A transported aggregation layer computes convex weightsαij≥0\\alpha\_\{ij\}\\geq 0,∑jαij=1\\sum\_\{j\}\\alpha\_\{ij\}=1, and produces the output in \([4](https://arxiv.org/html/2606.24975#S2.E4)\)\. The score side determines the weights; when value transport is used, the value side rotates the selected values before summation\. The block\-diagonalSO\(2\)\\mathrm\{SO\}\(2\)specialization and its implementation convention are introduced in Section[5](https://arxiv.org/html/2606.24975#S5), where the commutative structure is used\.
The following table collects the terminology and notation used in the stylized model and proof chain\.
### 2\.3Near and Far Route Regimes
Fix a query positioniiand consider causal source positionsj<ij<i\. The route length isn=i−jn=i\-j\. When we use route lengthkkinstead of source indexjj, the notationPi−k→iP\_\{i\-k\\to i\}means the same source\-to\-query transport asPj→iP\_\{j\\to i\}withj=i−kj=i\-k\. For a window widthww, define the near and far index sets
𝒩i\(w\)=\{j:0<i−j<w\},ℱi\(w\)=\{j:i−j≥w\}\.\\mathcal\{N\}\_\{i\}\(w\)=\\\{j:0<i\-j<w\\\},\\qquad\\mathcal\{F\}\_\{i\}\(w\)=\\\{j:i\-j\\geq w\\\}\.These are route\-level sets\. They are used by the Q/K score analysis and the value\-side decomposition
ci=cinear\(w\)\+cifar\(w\),c\_\{i\}=c\_\{i\}^\{\\rm near\}\(w\)\+c\_\{i\}^\{\\rm far\}\(w\),\(8\)where
cinear\(w\)=∑j∈𝒩i\(w\)αijPj→ivjc\_\{i\}^\{\\rm near\}\(w\)=\\sum\_\{j\\in\\mathcal\{N\}\_\{i\}\(w\)\}\\alpha\_\{ij\}P\_\{j\\to i\}v\_\{j\}\(9\)is the near\-window contribution and
cifar\(w\)=∑j∈ℱi\(w\)αijPj→ivjc\_\{i\}^\{\\rm far\}\(w\)=\\sum\_\{j\\in\\mathcal\{F\}\_\{i\}\(w\)\}\\alpha\_\{ij\}P\_\{j\\to i\}v\_\{j\}\(10\)is the far contribution\. The near term contains the target\-bearing signal\. Section[3](https://arxiv.org/html/2606.24975#S3)gives conditions under which content\-dependent transport supplies a finite mixing window for this split, and Section[5](https://arxiv.org/html/2606.24975#S5)later bounds the covariance of the selected far contribution\. On the score path, the same near and far route sets index the Q/K comparisons\. The transported Q/K scoreSj→iS\_\{j\\to i\}between sourcejjand queryiidepends on the route transportPj→iP\_\{j\\to i\}: near routes have transport close to the identity and yield high scores; far routes with approximately Haar\-random transport yield low scores\. Using a logit\-scale convention, the corresponding softmax weights over a finite active source set𝒜i\\mathcal\{A\}\_\{i\}are
αij=exp\(λSj→i\)∑m∈𝒜iexp\(λSm→i\),λ\>0\.\\alpha\_\{ij\}=\\frac\{\\exp\(\\lambda S\_\{j\\to i\}\)\}\{\\sum\_\{m\\in\\mathcal\{A\}\_\{i\}\}\\exp\(\\lambda S\_\{m\\to i\}\)\},\\qquad\\lambda\>0\.\(11\)The explicit score formula \(a cosine average over block phases\) is given in Section[5](https://arxiv.org/html/2606.24975#S5)when theSO\(2\)\\mathrm\{SO\}\(2\)specialization is introduced\.
### 2\.4Signal\-interference decomposition
The active source set for queryiiis𝒜i=𝒮i∪˙𝒟i\\mathcal\{A\}\_\{i\}=\\mathcal\{S\}\_\{i\}\\,\\dot\{\\cup\}\\,\\mathcal\{D\}\_\{i\}, where𝒮i\\mathcal\{S\}\_\{i\}is a near\-window target\-bearing set and𝒟i\\mathcal\{D\}\_\{i\}is a far set\. The transported near\-signal coefficient is
B𝒮,i=∑j∈𝒮iαijPj→i,B\_\{\\mathcal\{S\},i\}=\\sum\_\{j\\in\\mathcal\{S\}\_\{i\}\}\\alpha\_\{ij\}\\,P\_\{j\\to i\},\(12\)and the far contribution is
ei=∑j∈𝒟iαijPj→ivj\.e\_\{i\}=\\sum\_\{j\\in\\mathcal\{D\}\_\{i\}\}\\alpha\_\{ij\}\\,P\_\{j\\to i\}\\,v\_\{j\}\.\(13\)Letℰi,L\\mathcal\{E\}\_\{i,L\}denote the*aggregation environment*: the near and far index sets, the weights\{αij\(L\)\}\\\{\\alpha\_\{ij\}^\{\(L\)\}\\\}, and the route transportsPj→i\(L\)P\_\{j\\to i\}^\{\(L\)\}\. The far covariance after conditioning on a realizationℰi,L=e\\mathcal\{E\}\_\{i,L\}=eis
Δ𝒟\(e\)=Cov\(ei∣ℰi=e\)\.\\Delta\_\{\\mathcal\{D\}\}\(e\)=\\mathrm\{Cov\}\(e\_\{i\}\\mid\\mathcal\{E\}\_\{i\}=e\)\.\(14\)The analysis focuses on the setting where far values share structure that creates nonzero cross\-covariances after transport\. The canonical specialization is the*shared\-background model*\(Appendix[B](https://arxiv.org/html/2606.24975#A2), Definition[1](https://arxiv.org/html/2606.24975#Thmdefinition1)\), in which far tokens share a zero\-mean Gaussian component\. Ordinary value summation leaves this component coherent; value transport can make the weighted sum small\.
## 3How Accumulated Rotations Suppress Far Attention
Accumulated content\-dependent orthogonal transformations create a content\-dependent mixing window\. The argument proceeds in three steps: a spectral\-gap mixing result, a score gap from high\-dimensional concentration, and far\-weight bounds after softmax\.
The starting point is a finite mixing window \(Theorem[1](https://arxiv.org/html/2606.24975#Thmtheorem1)\)\. For any i\.i\.d\. random orthogonal step matricesMt∈O\(d\)M\_\{t\}\\in O\(d\)with‖𝔼\[Mt\]‖op≤β<1\\\|\\mathbb\{E\}\[M\_\{t\}\]\\\|\_\{\\mathrm\{op\}\}\\leq\\beta<1\(a*spectral gap*\), the accumulated productPn=M1⋯MnP\_\{n\}=M\_\{1\}\\cdots M\_\{n\}satisfies‖𝔼\[Pn\]‖op≤βn\\\|\\mathbb\{E\}\[P\_\{n\}\]\\\|\_\{\\mathrm\{op\}\}\\leq\\beta^\{n\}\. The first moment of the route transport therefore decays geometrically, so that after a finite number of stepswεmixw\_\{\\varepsilon\_\{\\rm mix\}\}\(depending onβ\\betaand the tolerance, not on the total context length\) the accumulated transport is decorrelated\. The operator\-norm condition‖𝔼\[Mt\]‖op<1\\\|\\mathbb\{E\}\[M\_\{t\}\]\\\|\_\{\\mathrm\{op\}\}<1is a convenient sufficient contraction condition used throughout this paper; the exact necessary and sufficient condition for first\-moment decorrelation isρ\(𝔼\[Mt\]\)<1\\rho\(\\mathbb\{E\}\[M\_\{t\}\]\)<1, whereρ\\rhodenotes the spectral radius\. \(For i\.i\.d\. steps,𝔼\[Pn\]=\(𝔼\[M\]\)n→0\\mathbb\{E\}\[P\_\{n\}\]=\(\\mathbb\{E\}\[M\]\)^\{n\}\\to 0if and only ifρ\(𝔼\[M\]\)<1\\rho\(\\mathbb\{E\}\[M\]\)<1\.\) Whenρ\(𝔼\[Mt\]\)=1\\rho\(\\mathbb\{E\}\[M\_\{t\}\]\)=1—in particular wheneverMtM\_\{t\}is deterministic—the accumulated product retains a non\-decaying component and no decorrelation occurs\.
###### Example 1\(SO\(2\)\\mathrm\{SO\}\(2\): uniform step angles\)\.
If the step angle is uniform on\[−a,a\]\[\-a,a\]witha\>0a\>0, thenβ=\|sin\(a\)/a\|<1\\beta=\|\\sin\(a\)/a\|<1\. The random\-rotation experiments \(Section[7](https://arxiv.org/html/2606.24975#S7)\) use this distribution\.
###### Example 2\(Householder reflections\)\.
Letd≥2d\\geq 2\. For Householder reflectionsHt=I−2vtvt⊤H\_\{t\}=I\-2v\_\{t\}v\_\{t\}^\{\\top\}with normal vectorvtv\_\{t\}drawn from a distributionν\\nuonSd−1S^\{d\-1\}, the spectral gap is‖𝔼\[Ht\]‖op=‖I−2Σν‖op\\\|\\mathbb\{E\}\[H\_\{t\}\]\\\|\_\{\\mathrm\{op\}\}=\\\|I\-2\\Sigma\_\{\\nu\}\\\|\_\{\\mathrm\{op\}\}, whereΣν=𝔼\[vtvt⊤\]\\Sigma\_\{\\nu\}=\\mathbb\{E\}\[v\_\{t\}v\_\{t\}^\{\\top\}\]\. This is less than one whenever the support ofν\\nuspansℝd\\mathbb\{R\}^\{d\}\. Idealized PaTH\-like distributions satisfying this condition are an instance\.
###### Example 3\(Givens rotations in a random plane\)\.
A Givens rotationGpq\(θ\)G\_\{pq\}\(\\theta\)rotates the\(p,q\)\(p,q\)coordinate plane by angleθ\\thetaand acts as the identity on all other coordinates\. If the plane\(pt,qt\)\(p\_\{t\},q\_\{t\}\)is drawn uniformly from all\(d2\)\\binom\{d\}\{2\}coordinate pairs andθt\\theta\_\{t\}is uniform on\[−a,a\]\[\-a,a\]independently, then𝔼\[Mt\]=\(1−2\(1−sin\(a\)/a\)d\)I\\mathbb\{E\}\[M\_\{t\}\]=\\bigl\(1\-\\tfrac\{2\(1\-\\sin\(a\)/a\)\}\{d\}\\bigr\)\\,I, soβ=\|1−2\(1−sin\(a\)/a\)d\|<1\\beta=\\bigl\|1\-\\tfrac\{2\(1\-\\sin\(a\)/a\)\}\{d\}\\bigr\|<1ford≥2d\\geq 2\.
###### Example 4\(Block\-diagonalSO\(3\)\\mathrm\{SO\}\(3\)rotations\)\.
Assumed=3Bd=3B\. In each 3D block, rotate by a fixed angleθ∈\(0,π\)\\theta\\in\(0,\\pi\)around an axisn^t\\hat\{n\}\_\{t\}drawn uniformly fromS2S^\{2\}\. Then𝔼\[Mt\]=2cosθ\+13I3\\mathbb\{E\}\[M\_\{t\}\]=\\tfrac\{2\\cos\\theta\+1\}\{3\}\\,I\_\{3\}per block, givingβ=\|2cosθ\+1\|/3<1\\beta=\|2\\cos\\theta\+1\|/3<1\.
###### Example 5\(Torus rotation via a fixed skew\-symmetric matrix\)\.
Assumeddis even\. LetA∈𝔰𝔬\(d\)A\\in\\mathfrak\{so\}\(d\)be a fixed nonsingular skew\-symmetric matrix with eigenvalue magnitudesλ1,…,λd/2\\lambda\_\{1\},\\ldots,\\lambda\_\{d/2\}, all positive\. For a content\-dependent scalarϵt∼Uniform\[−a,a\]\\epsilon\_\{t\}\\sim\\mathrm\{Uniform\}\[\-a,a\]witha\>0a\>0, setMt=exp\(ϵtA\)M\_\{t\}=\\exp\(\\epsilon\_\{t\}A\)\. Thenβ=maxj\|sin\(aλj\)/\(aλj\)\|<1\\beta=\\max\_\{j\}\|\\sin\(a\\lambda\_\{j\}\)/\(a\\lambda\_\{j\}\)\|<1\.
Identity and position\-only transports have‖𝔼\[Mt\]‖op=1\\\|\\mathbb\{E\}\[M\_\{t\}\]\\\|\_\{\\mathrm\{op\}\}=1, so neither RoPE nor NoPE creates this content\-random mixing\. Once training length coverswεmixw\_\{\\varepsilon\_\{\\rm mix\}\}, longer contexts add only routes from the already\-present decorrelated regime\. This distributional match between training and evaluation is a necessary, but not sufficient, condition for extrapolation: it ensures that the per\-token suppression the model learns during training transfers unchanged to any evaluation length\. The remainder of this section quantifies how strong that per\-token suppression is \(the score gap\), and Section[4](https://arxiv.org/html/2606.24975#S4)shows why per\-token suppression alone is not enough when the far set grows\.
The mixing window becomes a score gap through high\-dimensional concentration \(Theorem[2](https://arxiv.org/html/2606.24975#Thmtheorem2)\)\. Near routes have transport close to the identity and therefore yield high transported Q/K scores\. Far routes in the decorrelated regime have approximately Haar\-random transport; sphere concentration \(Lévy’s lemma onSd−1S^\{d\-1\}\) then makes the probability of a large far score decay exponentially in the dimensiondd\. Under an additional TV\-mixing condition \(spectral gap of the random walk onO\(d\)O\(d\)\), the product distribution converges to Haar measure on the appropriate limiting space—SO\(d\)\\mathrm\{SO\}\(d\)for determinant\-\+1\+1steps, the parity\-determined determinant component for fixed determinant\-\(−1\)\(\-1\)steps, and fullO\(d\)O\(d\)for mixed\-sign steps\. Once the product is Haar\-like, it sends a fixed key vector to an approximately uniform point on the sphere, so its dot product with a fixed query is unlikely to be large\. For theSO\(2\)\\mathrm\{SO\}\(2\)specialization \(Section[5](https://arxiv.org/html/2606.24975#S5)\), the same mechanism admits a scalar analysis: each route accumulates scalar block phases whose independence across thed/2d/2blocks lets Hoeffding’s inequality yield a score\-gap rate ofd/4d/4, compared with the general Lévy rate of\(d−1\)/2\(d\{\-\}1\)/2\. RoPE\-style blocks share the same2×22\\\!\\times\\\!2rotation geometry but have deterministic phases, so they do not satisfy the content\-random condition that drives the tail bound\.
Finally, for a finite source set, the score gap plus a sufficiently large logit scaleλ\\lambdaforces small total far mass and small far\-weightℓ2\\ell\_\{2\}norm \(Proposition[1](https://arxiv.org/html/2606.24975#Thmproposition1)\)\. The required scale depends logarithmically on the far\-candidate boundMmaxM\_\{\\max\}; full softmax with length\-independent logits over an unbounded far set cannot provide these bounds by itself\. Formal statements and proofs for theSO\(2\)\\mathrm\{SO\}\(2\)case are in Appendix[B](https://arxiv.org/html/2606.24975#A2)\.
Idealized Householder\-step distributions satisfying the spectral\-gap conditions include PaTH\-like transformations \(Example[2](https://arxiv.org/html/2606.24975#Thmexample2)\); verifying the TV\-mixing condition for learned PaTH transformations remains future work\. The general orthogonal formal statements and proofs are in Appendix[B\.5](https://arxiv.org/html/2606.24975#A2.SS5)\. Proposition[2](https://arxiv.org/html/2606.24975#Thmproposition2)shows that Householder reflections whose normal vector has a smooth density bounded below onSd−1S^\{d\-1\}satisfy the stronger component\-TV mixing condition \(Assumption[2](https://arxiv.org/html/2606.24975#Thmassumption2)\), and Example[6](https://arxiv.org/html/2606.24975#Thmexample6)gives a heat\-kernel\-dithered Householder\-compatible family satisfying the same condition\.
## 4Why Rotations Alone Are Not Enough
Under full softmax with bounded logits, the total far attention mass satisfiesρ𝒟≥ML/\(ML\+Ke2λ\)\\rho\_\{\\mathcal\{D\}\}\\geq M\_\{L\}/\(M\_\{L\}\+Ke^\{2\\lambda\}\), whereKKis the near\-set size,MLM\_\{L\}is the far\-set size, andλ\\lambdais the logit bound \(Proposition[3](https://arxiv.org/html/2606.24975#Thmproposition3)\)\. AsML→∞M\_\{L\}\\to\\infty, this forcesρ𝒟→1\\rho\_\{\\mathcal\{D\}\}\\to 1, regardless of the rotation structure\. The near\-signal coefficient is then bounded by1−ρ𝒟1\-\\rho\_\{\\mathcal\{D\}\}, which vanishes at unbounded lengths \(Proposition[4](https://arxiv.org/html/2606.24975#Thmproposition4)\)\. Within this full\-softmax, bounded\-logit, rotation\-only setting, no choice of orthogonal transport \(SO\(2\)\\mathrm\{SO\}\(2\)blocks or Householder products alike\) can prevent this without explicit far\-mass control\.
This is why rotation\-only models, despite the score gap from Section[3](https://arxiv.org/html/2606.24975#S3), still degrade at extreme extrapolation lengths: without an additional mechanism that directly suppresses far attention mass, the growing far set eventually erodes the near\-signal coefficient\. ALiBi’s distance\-dependent score bias and FoX’s data\-dependent forget gates are two existing designs that provide this control; their flat extrapolation is consistent with this prediction\.
## 5How Rotating Values Provides Additional Protection: anSO\(2\)\\mathrm\{SO\}\(2\)\-Specific Result
Section[3](https://arxiv.org/html/2606.24975#S3)controls how much far content is selected by the score side\. Some far mass can nevertheless remain\. If far values contain a shared background component, ordinary value summation can preserve that component coherently even when the far weights are small\. TheSO\(2\)\\mathrm\{SO\}\(2\)Q/K/V variant targets this residual error: by rotating values along the same accumulated route, far values that survive score selection are made to combine incoherently\. The value\-side analysis exploits the commutativity of accumulatedSO\(2\)\\mathrm\{SO\}\(2\)rotations on the value path\.
#### SO\(2\)\\mathrm\{SO\}\(2\)notation\.
This section uses the block\-diagonalSO\(2\)\\mathrm\{SO\}\(2\)specialization of the general orthogonal framework introduced in Section[2](https://arxiv.org/html/2606.24975#S2)\. LetB=d/2B=d/2be the number of two\-dimensional rotation blocks\. Forψ∈ℝB\\psi\\in\\mathbb\{R\}^\{B\}, letR\(ψ\)R\(\\psi\)be the block\-diagonal matrix with a2×22\\\!\\times\\\!2rotationR\(ψb\)R\(\\psi\_\{b\}\)on blockbb\. Each position carries a block\-diagonal step rotationRt=R\(ψt\)∈SO\(2\)BR\_\{t\}=R\(\\psi\_\{t\}\)\\in\\mathrm\{SO\}\(2\)^\{B\}, which is a special case of the per\-token stepMtM\_\{t\}from \([5](https://arxiv.org/html/2606.24975#S2.E5)\)\. Because block rotations commute, accumulated anglesΘi=∑t<iψt\\Theta\_\{i\}=\\sum\_\{t<i\}\\psi\_\{t\}giveAi=R\(Θi\)A\_\{i\}=R\(\\Theta\_\{i\}\)and the route transport takes the simple formPj→i=R\(Θj−Θi\)P\_\{j\\to i\}=R\(\\Theta\_\{j\}\-\\Theta\_\{i\}\)\. Write the accumulated phase in blockbbasΘj→i,b\\Theta\_\{j\\to i,b\}\.
*Content\-dependent transport*uses the step\-angle
ψt=ω\+g\(ct\),\\psi\_\{t\}=\\omega\+g\(c\_\{t\}\),\(15\)whereggmaps each token to a learned angle vector inℝB\\mathbb\{R\}^\{B\}\.
In the matched\-score idealization, the block\-normalized transported Q/K score is
Sj→i=1B∑b=1BcosΘj→i,b\.S\_\{j\\to i\}=\\frac\{1\}\{B\}\\sum\_\{b=1\}^\{B\}\\cos\\Theta\_\{j\\to i,b\}\.\(16\)Near routes have small accumulated phases in most blocks; far routes have approximately uniform phases\.
*Position\-only transport*usesψt=ω\\psi\_\{t\}=\\omega\(a constant frequency vector\), givingPj→i=R\(ω\(j−i\)\)P\_\{j\\to i\}=R\(\\omega\(j\-i\)\)\.
#### Implementation convention\.
The accumulated rotations are applied before the attention call\. LetAi=R\(Θi\)A\_\{i\}=R\(\\Theta\_\{i\}\)be the prefix\-product rotation at positionii\. For Q/K\-only transport, queries are pre\-rotated byAi−1A\_\{i\}^\{\-1\}and keys byAj−1A\_\{j\}^\{\-1\}, so that the dot productqi⊤kjq\_\{i\}^\{\\top\}k\_\{j\}containscos\(Θj−Θi\)\\cos\(\\Theta\_\{j\}\-\\Theta\_\{i\}\)terms as in RoPE\. \(The cosine terms are invariant to the sign convention becausecos\(Θj−Θi\)=cos\(Θi−Θj\)\\cos\(\\Theta\_\{j\}\-\\Theta\_\{i\}\)=\\cos\(\\Theta\_\{i\}\-\\Theta\_\{j\}\)\. The sine/antisymmetric terms do change sign, soq⊤R\(θ\)k≠q⊤R\(−θ\)kq^\{\\top\}R\(\\theta\)k\\neq q^\{\\top\}R\(\-\\theta\)kfor generalq,kq,k; however, models trained from scratch with a fixed convention learn queries and keys adapted to that convention, so the choice is absorbed during training\.\) For Q/K/V transport, values are additionally pre\-rotated byAjA\_\{j\}before the attention call, and the attention output is post\-rotated byAi−1A\_\{i\}^\{\-1\}: this gives∑jαijAi−1Ajvj=∑jαijPj→ivj\\sum\_\{j\}\\alpha\_\{ij\}\\,A\_\{i\}^\{\-1\}A\_\{j\}\\,v\_\{j\}=\\sum\_\{j\}\\alpha\_\{ij\}\\,P\_\{j\\to i\}\\,v\_\{j\}as required\.
#### Why commutativity matters\.
The commutativity ofSO\(2\)\\mathrm\{SO\}\(2\)blocks is what makes accumulated route angles into sums of per\-token angles\. This additive structure enables per\-block Fourier analysis and the leave\-one\-block decoupling used in the value\-side proof below: by removing one block’s phase contribution from the score, the remainingB−1B\-1blocks provide proxy weights that are independent of the removed block’s transport\. The resulting adaptivity penalty is controlled by one block’s influence on the logit, of ordere2λ/B−1e^\{2\\lambda/B\}\-1\.
#### Value\-side decoherence\.
The central difficulty is that the score\-selected weights depend on the same route phases whose cancellation we want to prove\. The proof of Theorem[6](https://arxiv.org/html/2606.24975#Thmtheorem6)handles this by removing one block’s contribution from the score, applying a prefix\-product cancellation bound to that block with the resulting proxy weights, and then comparing the proxy and true softmax weights\. The cost is an adaptivity penalty of ordere2λ/B−1e^\{2\\lambda/B\}\-1, reflecting one block’s influence on the logit\. Given the score\-side far\-mass and far\-weightℓ2\\ell\_\{2\}bounds as input, this leave\-one\-block decoupling yields a spectral covariance bound in the shared\-background model\. On a high\-probability route\-phase event, the far covariance satisfiesΔ𝒟\(e\)⪯δ¯2Id\\Delta\_\{\\mathcal\{D\}\}\(e\)\\preceq\\bar\{\\delta\}^\{2\}I\_\{d\}, whereδ¯2\\bar\{\\delta\}^\{2\}depends on the spectral gap, the score\-side far\-mass bound, the logit scale, and the number of blocks\. The formal statement and proof are in Appendix[B](https://arxiv.org/html/2606.24975#A2), Theorem[6](https://arxiv.org/html/2606.24975#Thmtheorem6)\.
#### Householder products\.
PaTH applies accumulated Householder reflections only to queries and keys, not to values, so its extrapolation depends entirely on score\-side mechanisms\. Householder transformations could also be applied to values, but proving value\-side decoherence for noncommuting products would require a decorrelation argument for prefix products of random matrices, which remains open\.
## 6Near\-Signal Preservation and Summary
Sections[3](https://arxiv.org/html/2606.24975#S3)–[5](https://arxiv.org/html/2606.24975#S5)control far interference\. The remaining question is whether the near\-window contribution still carries the target signal after transport\. If every target\-bearing near route acts approximately as the identity on the latent signal subspace, then their weighted combination preserves the signal and the near\-signal gain is bounded away from zero \(Lemma[3](https://arxiv.org/html/2606.24975#Thmlemma3); Appendix[E](https://arxiv.org/html/2606.24975#A5)gives a probabilistic sufficient condition under sub\-Gaussian angle fluctuations\)\. Combining this with the score\-side far\-mass bound yields the overall near\-signal gain condition \(Corollary[2](https://arxiv.org/html/2606.24975#Thmcorollary2)\)\.
Taken together, these three mechanisms give a unified picture within the stylized model: score\-side decoherence bounds the total far attention mass \(Section[3](https://arxiv.org/html/2606.24975#S3)\); the far\-mass lower bound shows this control is eventually overwhelmed without explicit distance bias \(Section[4](https://arxiv.org/html/2606.24975#S4)\); and value\-side decoherence makes the surviving far contribution combine incoherently \(Section[5](https://arxiv.org/html/2606.24975#S5)\)\. Near\-signal preservation ensures that the useful component is not destroyed by the same transport\. Conditioned on fixed far\-mass and alignment quantities, the per\-route bounds do not depend on evaluation length \(Section[4](https://arxiv.org/html/2606.24975#S4)shows why those quantities can nevertheless degrade as the far set grows\), so the near/far structure seen during training transfers unchanged to longer sequences\.
## 7Controlled Experiments
### 7\.1Experimental setup
The experiments below serve as controlled demonstrations of the analysis’s predictions \(score\-side mixing, value\-side cancellation, the Q/K versus Q/K/V distinction, and the need for far\-mass control\) in trained transformers\. All models are decoder\-only causal transformers trained on OpenWebText at context length 512 and evaluated up to 65,536 tokens \(a 128\-fold increase\)\. The comparison isolates two factors: whether rotations are position\-indexed or accumulated, and whether the accumulated transport is applied only to Q/K or also to V\. Full architecture, optimizer, sampling, and evaluation details are given in Appendix[C](https://arxiv.org/html/2606.24975#A3)\.
### 7\.2Main rotation comparison
Table[1](https://arxiv.org/html/2606.24975#S7.T1)reports length\-extrapolation perplexity for the main rotation variants\.
Table 1:Length extrapolation perplexity for the main trained rotation variants\. Models are trained at context length 512 and evaluated up to length 65536\. The ratio columns report perplexity at the indicated length divided by perplexity at 512\.
### 7\.3Far\-mass control comparison
ALiBi tests the far\-mass\-control prediction: its distance\-dependent score bias directly suppresses far attention mass, providing a reference for the rotation\-only models\.
Table 2:ALiBi distance\-bias baseline, included to test the far\-mass\-control prediction of the stylized\-model analysis\.
### 7\.4Takeaways
The experiments support three predictions of the stylized\-model analysis\. First, accumulated rotations substantially improve extrapolation over position\-indexed RoPE: learned token rotations achieve 1\.17x at 8K/512 versus RoPE’s 16\.2x\. The random accumulated\-rotation control provides the most direct test since it satisfies the model assumptions by construction\. Second, rotating values in addition to queries and keys consistently improves long\-context behavior, especially for random rotations \(1\.59x at 65K/512 versus 7\.49x for queries and keys only\)\. Third, rotation\-only models still degrade at extreme lengths while ALiBi remains flat \(0\.96x at 65K/512\), consistent with the far\-mass lower bound\. Residual degradation reflects both the absence of explicit far\-mass control and multi\-layer dynamics beyond the single\-layer stylized model\.
## 8Discussion and Future Work
### 8\.1What the stylized model does and does not claim
The analysis in this paper sheds light on why accumulated transformations extrapolate\. Score\-side decoherence holds for any accumulated orthogonal transformation with a spectral gap; theSO\(2\)\\mathrm\{SO\}\(2\)case is developed in detail because it additionally enables value\-side analysis\. The theoretical results analyze the attention mechanism in isolation: the score/value subcomputation at a single layer with given inputs\. The finite\-window result does not say that each far contribution becomes small; every rotation remains norm\-preserving\. It says that route transport creates a finite regime split whose boundary does not move with evaluated context length\. Tokens beyond the mixing window may still participate, but they arrive from an already\-present decorrelated regime rather than from a new deterministic offset regime\.
The finite\-window argument rests on the per\-token transformations having enough variation across content\. The general condition is a spectral gap:‖𝔼\[Mt\]‖op<1\\\|\\mathbb\{E\}\[M\_\{t\}\]\\\|\_\{\\mathrm\{op\}\}<1, ensuring that accumulated products mix\. Examples[1](https://arxiv.org/html/2606.24975#Thmexample1)–[5](https://arxiv.org/html/2606.24975#Thmexample5)verify this condition for five families of orthogonal steps, including theSO\(2\)\\mathrm\{SO\}\(2\)uniform\-angle case used in the experiments and the Householder case relevant to PaTH\. The per\-token transformation is also role\-blind: it is applied at positiontt, independently of which query later reads from that position\. Since the same token can be relevant for one query and irrelevant for another, the transformation cannot be defined by setting it to the identity on signal routes and randomizing it on far routes\. Signal preservation enters through the realized near\-signal gain condition; role\-blind diversity is used for residual far suppression\. The model also separates content used for transformation generation, score\-side information used for weights, value vectors, and the latent signal; this separation is part of the model assumption\.
#### Beyond the train/test match intuition\.
Accumulated content\-dependent transport creates a finite mixing window whose boundary does not move with context length, so the model sees the same near/far regime at training and evaluation\. However, distributional match is a*necessary*condition for extrapolation, not a sufficient one\. The score\-gap analysis \(Section[3](https://arxiv.org/html/2606.24975#S3)\) shows that mixing actually translates into small far attention mass, not just distributional similarity\. The far\-mass lower bound \(Section[4](https://arxiv.org/html/2606.24975#S4)\) shows that this suppression is eventually overwhelmed, explaining the degradation that distributional match alone does not predict\. The value\-side analysis \(Section[5](https://arxiv.org/html/2606.24975#S5)\) shows that rotating values provides additional protection beyond score\-side suppression\. Together these move from an intuition to quantitative bounds and structural limitations\.
The random\-rotation experiments are useful because they instantiate the independence and spectral\-gap assumptions directly\. Their behavior therefore provides a controlled check of the stylized model, while the learned token\-rotation experiments test whether a trained model can exploit a related mechanism\.
### 8\.2Relation to other position methods
Position Interpolation\(Chenet al\.,[2024](https://arxiv.org/html/2606.24975#bib.bib11)\), NoPE\(Kazemnejadet al\.,[2023](https://arxiv.org/html/2606.24975#bib.bib10)\), Selective RoPE\(Movahediet al\.,[2026](https://arxiv.org/html/2606.24975#bib.bib17)\), and standard RoPE all have deterministic per\-step transport, so‖𝔼\[Mt\]‖op=1\\\|\\mathbb\{E\}\[M\_\{t\}\]\\\|\_\{\\mathrm\{op\}\}=1and the spectral\-gap condition does not apply\. Randomized RoPE\(Ruosset al\.,[2023](https://arxiv.org/html/2606.24975#bib.bib12)\)also extrapolates; it randomizes position indices during training, exposing the model to large offsets\. At inference time, however, positions are sequential and the per\-step rotation is a deterministic function oftt, so‖𝔼\[Mt\]‖op=1\\\|\\mathbb\{E\}\[M\_\{t\}\]\\\|\_\{\\mathrm\{op\}\}=1: there is no inference\-time mixing window\. The RoPE blocks are also deterministic functions of the same position offset, so there is no cross\-block independence of the kind used in the score concentration argument\. The separate\-path result in Section[D\.1](https://arxiv.org/html/2606.24975#A4.SS1)applies to an explicitly random phase\-perturbation model with independent value\-side phases after score selection, not to randomized RoPE obtained only by sampling position indices\.
Commutativity has an expressivity consequence: because theSO\(2\)\\mathrm\{SO\}\(2\)blocks commute, the accumulated route angle is a*sum*of the intervening step angles\. For the learned per\-token variant, the route relation between sourcejjand queryiitherefore depends on the multiset \(bag\) of intervening token identities, not on their order\. This is strictly less expressive than PaTH’s accumulated Householder products, which generate all ofO\(d\)O\(d\)and can represent order\-sensitive transformations\. Commutativity is what enables the value\-side decoherence analysis of Section[5](https://arxiv.org/html/2606.24975#S5): the additive angle structure allows per\-block Fourier analysis and the leave\-one\-block decoupling that bounds far covariance\.
The Forgetting Transformer\(Linet al\.,[2025](https://arxiv.org/html/2606.24975#bib.bib16)\)adds data\-dependent forget gates to the attention logits and also reports strong length extrapolation; like ALiBi, it operates on the score computation\. Because accumulated rotations and score\-side mechanisms operate on different parts of the attention computation, they are complementary and could in principle be combined\. This is consistent with PaTH’s own strongest results, which combine accumulated Householder transformations with the Forgetting Transformer’s data\-dependent score gates \(PaTH\-FoX\) to achieve flat extrapolation\.
#### Non\-normalizing activations\.
The far\-mass lower bound in Section[4](https://arxiv.org/html/2606.24975#S4)is specific to full softmax normalization\. It relies on the constraint∑jαij=1\\sum\_\{j\}\\alpha\_\{ij\}=1: with bounded logits, a growing far set contributes an increasingly large share of the denominator, eventually diluting the near\-token mass\. A non\-normalizing activation such aswij=log\(1\+esij\)w\_\{ij\}=\\log\(1\+e^\{s\_\{ij\}\}\)orwij=σ\(sij\)w\_\{ij\}=\\sigma\(s\_\{ij\}\)removes this particular denominator effect, since each token’s raw weight is computed independently and adding far tokens does not change the raw weights of near tokens\.
However, this does not by itself solve long\-context degradation\. It replaces mass dilution by a scale\-control problem: many far tokens with individually small but nonzero weights can still produce a large aggregate far contribution\. Thus the relevant quantities become the unnormalized analogues of total far weight and far\-weightℓ2\\ell\_\{2\}norm,∑j∈𝒟wij\\sum\_\{j\\in\\mathcal\{D\}\}w\_\{ij\}and\(∑j∈𝒟wij2\)1/2\(\\sum\_\{j\\in\\mathcal\{D\}\}w\_\{ij\}^\{2\}\)^\{1/2\}, rather than softmax mass\. Score\-side decoherence would still push far scores downward, and value\-side decoherence can still make surviving far contributions combine incoherently, but a separate argument would be needed to show that these unnormalized far\-weight sums remain bounded as context grows\. In practice, non\-normalized attention therefore requires explicit scale control, such as normalization, learned thresholds, temperature/gating, sparsification, or other mechanisms that make the effective far\-weight tail summable\.
### 8\.3Limitations and future work
The main limitation is that rotation\-only score gaps do not control total far mass under full softmax over a growing context\. The far\-mass lower bound \(Section[4](https://arxiv.org/html/2606.24975#S4)\) makes this structural: without explicit far\-mass control \(sparsity, masking, distance bias, or a growing score gap\), the near\-signal coefficient eventually degrades\. The experiments confirm this: rotation\-only models degrade gradually at extreme lengths\.
#### Future work\.
Several directions would extend the present analysis\.*Learned noncommuting transformations:*Appendix[B\.5](https://arxiv.org/html/2606.24975#A2.SS5)establishes score\-side decoherence for any accumulated orthogonal transformation under spectral\-gap assumptions\. A direct analysis of learned PaTH transformations \(verifying that the spectral\-gap condition holds in practice\) and especially value\-side decoherence for noncommuting products, remains open\.*Mechanistic probes:*direct measurements of spectral gaps \(empirical decay of‖𝔼\[Pn\]‖op\\\|\\mathbb\{E\}\[P\_\{n\}\]\\\|\_\{\\mathrm\{op\}\}as a function of route length\), attention\-mass profiles as a function of distance, and value\-side cancellation statistics in trained models would connect the model assumptions to observed transformer behavior\.*Multi\-layer theory:*extending the single\-layer analysis to account for residual\-stream position leakage across layers would close the gap between the theory’s length\-stable predictions and the gradual degradation observed at extreme extrapolation\.
## 9Conclusion
PaTH Attention established that accumulated data\-dependent transformations yield strong length extrapolation; this paper offers a mechanistic explanation\. A stylized model of attention shows that accumulated rotations suppress far\-token attention while preserving the near\-window target signal\. Score\-side decoherence \(Section[3](https://arxiv.org/html/2606.24975#S3)\) holds for any accumulated orthogonal transformation with a spectral gap; idealized Householder\-step distributions, including PaTH\-like transformations, satisfy this condition\. A far\-mass lower bound \(Section[4](https://arxiv.org/html/2606.24975#S4)\) shows this suppression is eventually overwhelmed without explicit distance bias\. In addition, we show that forSO\(2\)\\mathrm\{SO\}\(2\)rotations, accumulated V rotations bound the residual far contribution\. Controlled experiments support all three predictions: accumulated rotations substantially improve extrapolation over RoPE, rotating values improves over rotating only queries and keys, and rotation\-only models degrade at extreme lengths while ALiBi remains flat \(Section[7](https://arxiv.org/html/2606.24975#S7)\)\.
## References
- K\. Azuma \(1967\)Weighted sums of certain dependent random variables\.Tohoku Mathematical Journal19\(3\),pp\. 357–367\.Cited by:[§B\.10](https://arxiv.org/html/2606.24975#A2.SS10.SSS0.Px1.2.p2.18)\.
- E\. Bierstone and P\. D\. Milman \(1988\)Semianalytic and subanalytic sets\.Publications Mathématiques de l’IHÉS67,pp\. 5–42\.Cited by:[§B\.5](https://arxiv.org/html/2606.24975#A2.SS5.4.p1.9)\.
- S\. Boucheron, G\. Lugosi, and P\. Massart \(2013\)Concentration inequalities: a nonasymptotic theory of independence\.Oxford University Press,Oxford, UK\.Cited by:[§B\.10](https://arxiv.org/html/2606.24975#A2.SS10.SSS0.Px1.2.p2.18),[§D\.3](https://arxiv.org/html/2606.24975#A4.SS3.1.p1.6)\.
- S\. Chen, S\. Wong, L\. Chen, and Y\. Tian \(2024\)Extending context window of large language models via positional interpolation\.InConference on Empirical Methods in Natural Language Processing,Cited by:[§8\.2](https://arxiv.org/html/2606.24975#S8.SS2.p1.3)\.
- P\. Diaconis \(1988\)Group representations in probability and statistics\.IMS Lecture Notes—Monograph Series, Vol\.11,Institute of Mathematical Statistics,Hayward, CA\.Cited by:[§B\.5](https://arxiv.org/html/2606.24975#A2.SS5.19.p7.7)\.
- I\. Glazer, Y\. I\. Hendel, and S\. Sodin \(2026\)Integrability of pushforward measures by analytic maps\.Algebraic Geometry13\(2\),pp\. 154–192\.Note:arXiv:2202\.12446Cited by:[§B\.5](https://arxiv.org/html/2606.24975#A2.SS5.4.p1.9)\.
- W\. Hoeffding \(1963\)Probability inequalities for sums of bounded random variables\.Journal of the American Statistical Association58\(301\),pp\. 13–30\.Cited by:[§B\.3](https://arxiv.org/html/2606.24975#A2.SS3.2.p2.3)\.
- A\. Kazemnejad, I\. Padhi, K\. N\. Ramamurthy, P\. Das, and S\. Reddy \(2023\)The impact of positional encoding on length generalization in transformers\.InAdvances in Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2606.24975#S1.p3.3),[§8\.2](https://arxiv.org/html/2606.24975#S8.SS2.p1.3)\.
- M\. Ledoux \(2001\)The concentration of measure phenomenon\.Mathematical Surveys and Monographs, Vol\.89,American Mathematical Society,Providence, RI\.Cited by:[§B\.5](https://arxiv.org/html/2606.24975#A2.SS5.21.p2.16)\.
- Z\. Lin, E\. Nikishin, X\. O\. He, and A\. Courville \(2025\)Forgetting transformer: softmax attention with a forget gate\.InInternational Conference on Learning Representations,Cited by:[§8\.2](https://arxiv.org/html/2606.24975#S8.SS2.p3.1)\.
- S\. Movahedi, T\. Carstensen, A\. Afzal, F\. Hutter, A\. Orvieto, and V\. Cevher \(2026\)Selective rotary position embedding\.InInternational Conference on Learning Representations,Cited by:[§8\.2](https://arxiv.org/html/2606.24975#S8.SS2.p1.3)\.
- O\. Press, N\. A\. Smith, and M\. Lewis \(2022\)Train short, test long: attention with linear biases enables input length extrapolation\.InInternational Conference on Learning Representations,Cited by:[item 2](https://arxiv.org/html/2606.24975#S1.I1.i2.p1.1)\.
- A\. Ruoss, G\. Delétang, T\. Genewein, J\. Grau\-Moya, R\. Csordás, M\. Bennani, S\. Legg, and J\. Veness \(2023\)Randomized positional encodings boost length generalization of transformers\.InAnnual Meeting of the Association for Computational Linguistics,Cited by:[§8\.2](https://arxiv.org/html/2606.24975#S8.SS2.p1.3)\.
- J\. Su, Y\. Lu, S\. Pan, A\. Murtadha, B\. Wen, and Y\. Liu \(2024\)RoFormer: enhanced transformer with rotary position embedding\.Neurocomputing568,pp\. 127063\.Cited by:[§1](https://arxiv.org/html/2606.24975#S1.p1.1)\.
- R\. Vershynin \(2018\)High\-dimensional probability: an introduction with applications in data science\.Cambridge Series in Statistical and Probabilistic Mathematics,Cambridge University Press,Cambridge, UK\.Cited by:[§B\.5](https://arxiv.org/html/2606.24975#A2.SS5.21.p2.16)\.
- S\. Yang, Y\. Shen, K\. Wen, S\. Tan, M\. Mishra, L\. Ren, R\. Panda, and Y\. Kim \(2025\)PaTH attention: position encoding via accumulating householder transformations\.InAdvances in Neural Information Processing Systems,External Links:2505\.16381Cited by:[§1](https://arxiv.org/html/2606.24975#S1.p1.1)\.
## Appendix AGuide to the Formal Results
Figure[1](https://arxiv.org/html/2606.24975#A1.F1)summarizes the dependency structure of the formal results\. Table[3](https://arxiv.org/html/2606.24975#A1.T3)gives the role and downstream use of each result\. The proof chains mirror the main\-text progression: Theorem 1 gives theSO\(2\)\\mathrm\{SO\}\(2\)first\-harmonic mixing window, Theorem 2 converts that window into a many\-block score gap, and Proposition 1 converts the score gap into softmax far\-mass and far\-weight bounds over a finite candidate set\. Lemmas 1–2 and Proposition 2 verify the stronger component\-TV mixing condition needed for general orthogonal products, including idealized Householder reflections\. Theorem 3 records first\-moment decay for arbitrary orthogonal products, while Theorem 4 gives the stronger TV convergence required by Theorem 5; Corollary 1 then packages the general orthogonal score\-gap and softmax\-scaling result\. Propositions 3–4 prove the complementary lower bound showing that bounded\-logit full softmax eventually assigns asymptotically all mass to the growing far regime, forcing near\-signal degradation without explicit far\-mass control\. Theorem 6 proves the same\-pathSO\(2\)\\mathrm\{SO\}\(2\)value\-side covariance bound, while Theorem 7 and Corollary 3 give the simpler separate\-path value\-side analogue\. Finally, Lemma 3 and Corollary 2 establish near\-signal preservation when near routes remain close to identity, and Lemma 4 gives a probabilistic sufficient condition for this alignment\.
Examples 1–5verify spectral gapLemmas 1–2TV\-mixing criteriaExamples 6–8TV\-mixing familiesThm\. 1SO\(2\)\\mathrm\{SO\}\(2\)mixing windowProp\. 2Householder TV mixingAssump\. 1first\-moment gapThm\. 3first\-moment decay\(conceptual\)Assump\. 2component\-TV mixingThm\. 2SO\(2\)\\mathrm\{SO\}\(2\)score gapThm\. 4TV convergenceProp\. 1softmax far\-massandℓ2\\ell\_\{2\}boundsThm\. 5orthogonal scoreseparationCor\. 1orthogonal softmax boundsScore\-side boundsρ𝒟≤ρ⋆,‖𝜶𝒟‖2≤a⋆\\rho\_\{\\mathcal\{D\}\}\\leq\\rho\_\{\\star\},\\;\\\|\\bm\{\\alpha\}\_\{\\mathcal\{D\}\}\\\|\_\{2\}\\leq a\_\{\\star\}Prop\. 3far\-mass lower boundProp\. 4near\-signal upper boundLimitation:rotations aloneeventually failAssump\. 3same\-path couplingDef\. 1shared\-backgroundfar valuesThm\. 6same\-path Q/K/VcovarianceAssump\. 4separate V pathThm\. 7separate\-pathcovarianceCor\. 3far\-weightcovariance boundLemma 4sub\-Gaussian anglesLemma 3near\-route gainCor\. 2near\-signalpreservationEx\. 9value coherenceEx\. 10adaptive selectionFigure 1:Dependency graph for the formal results\. Solid arrows denote formal use; dashed arrows denote motivation or conceptual support\. The top half contains the score\-side chains \(SO\(2\)\\mathrm\{SO\}\(2\)on the left, general orthogonal in the center\) and the negative lower\-bound chain on the right\. The bottom half contains the value\-side and near\-signal preservation chains, all fed by the shared score\-side bounds\. Theorem 3 is shown as a conceptual branch; Theorem 4 supplies the TV mixing used by Theorem 5\.Table 3:Role and downstream use of each formal result\.
## Appendix BScore\-Side, Value\-Side, and Signal\-Preservation Proofs
This section gives the formal statements and proofs for the score\-side \(SO\(2\)\\mathrm\{SO\}\(2\)and general orthogonal\), far\-mass lower bound, value\-side, and signal\-preservation results stated in the main text\.
### B\.1Verification of Examples[1](https://arxiv.org/html/2606.24975#Thmexample1)–[5](https://arxiv.org/html/2606.24975#Thmexample5)
The following examples verify the first\-moment gap used for finite\-window decay \(Theorem[1](https://arxiv.org/html/2606.24975#Thmtheorem1)\)\.
###### Example[1](https://arxiv.org/html/2606.24975#Thmexample1)\.
\|𝔼\[eiθ\]\|=\|12a∫−aaeiθ𝑑θ\|=\|sin\(a\)/a\|<1\|\\mathbb\{E\}\[e^\{i\\theta\}\]\|=\\bigl\|\\frac\{1\}\{2a\}\\int\_\{\-a\}^\{a\}e^\{i\\theta\}\\,d\\theta\\bigr\|=\|\\sin\(a\)/a\|<1fora\>0a\>0\. ∎
###### Example[2](https://arxiv.org/html/2606.24975#Thmexample2)\.
𝔼\[Ht\]=𝔼\[I−2vtvt⊤\]=I−2𝔼\[vtvt⊤\]=I−2Σν\\mathbb\{E\}\[H\_\{t\}\]=\\mathbb\{E\}\[I\-2v\_\{t\}v\_\{t\}^\{\\top\}\]=I\-2\\mathbb\{E\}\[v\_\{t\}v\_\{t\}^\{\\top\}\]=I\-2\\Sigma\_\{\\nu\}\. The eigenvalues ofΣν\\Sigma\_\{\\nu\}lie in\[0,1\]\[0,1\]\(it is the second\-moment matrix of unit vectors, so it is positive semidefinite withtr\(Σν\)=1\\operatorname\{tr\}\(\\Sigma\_\{\\nu\}\)=1\), and therefore‖I−2Σν‖op=maxj\|1−2μj\|\\\|I\-2\\Sigma\_\{\\nu\}\\\|\_\{\\mathrm\{op\}\}=\\max\_\{j\}\|1\-2\\mu\_\{j\}\|, whereμj\\mu\_\{j\}are the eigenvalues ofΣν\\Sigma\_\{\\nu\}\. This is less than one if and only if everyμj∈\(0,1\)\\mu\_\{j\}\\in\(0,1\), equivalently,ν\\nuis not supported on a proper subspace\. ∎
###### Example[3](https://arxiv.org/html/2606.24975#Thmexample3)\.
For a fixed plane\(p,q\)\(p,q\)andθ∼Uniform\[−a,a\]\\theta\\sim\\mathrm\{Uniform\}\[\-a,a\],𝔼\[Gpq\(θ\)\]=I\+\(sinaa−1\)\(epep⊤\+eqeq⊤\)\\mathbb\{E\}\[G\_\{pq\}\(\\theta\)\]=I\+\(\\tfrac\{\\sin a\}\{a\}\-1\)\(e\_\{p\}e\_\{p\}^\{\\top\}\+e\_\{q\}e\_\{q\}^\{\\top\}\), since𝔼\[cosθ\]=sin\(a\)/a\\mathbb\{E\}\[\\cos\\theta\]=\\sin\(a\)/aand𝔼\[sinθ\]=0\\mathbb\{E\}\[\\sin\\theta\]=0\. Averaging over all\(d2\)\\binom\{d\}\{2\}planes: each basis vectoreje\_\{j\}appears ind−1d\{\-\}1pairs, so𝔼p,q\[epep⊤\+eqeq⊤\]=d−1\(d2\)I=2dI\\mathbb\{E\}\_\{p,q\}\[e\_\{p\}e\_\{p\}^\{\\top\}\+e\_\{q\}e\_\{q\}^\{\\top\}\]=\\frac\{d\-1\}\{\\binom\{d\}\{2\}\}\\,I=\\frac\{2\}\{d\}\\,I\. Therefore𝔼\[Mt\]=\(1−2\(1−sin\(a\)/a\)d\)I\\mathbb\{E\}\[M\_\{t\}\]=\\bigl\(1\-\\tfrac\{2\(1\-\\sin\(a\)/a\)\}\{d\}\\bigr\)\\,I\. ∎
###### Example[4](https://arxiv.org/html/2606.24975#Thmexample4)\.
By the Rodrigues formula, a rotation by angleθ\\thetaaround axisn^\\hat\{n\}isR=cosθI3\+\(1−cosθ\)n^n^⊤\+sinθ\[n^\]×R=\\cos\\theta\\,I\_\{3\}\+\(1\-\\cos\\theta\)\\,\\hat\{n\}\\hat\{n\}^\{\\top\}\+\\sin\\theta\\,\[\\hat\{n\}\]\_\{\\times\}\. Forn^\\hat\{n\}uniform onS2S^\{2\}:𝔼\[n^n^⊤\]=13I3\\mathbb\{E\}\[\\hat\{n\}\\hat\{n\}^\{\\top\}\]=\\tfrac\{1\}\{3\}\\,I\_\{3\}and𝔼\[\[n^\]×\]=0\\mathbb\{E\}\[\[\\hat\{n\}\]\_\{\\times\}\]=0\. Hence𝔼\[R\]=2cosθ\+13I3\\mathbb\{E\}\[R\]=\\tfrac\{2\\cos\\theta\+1\}\{3\}\\,I\_\{3\}\. ∎
###### Example[5](https://arxiv.org/html/2606.24975#Thmexample5)\.
Sinceddis even andAAis nonsingular,AAhas eigenvalues±iλ1,…,±iλd/2\\pm i\\lambda\_\{1\},\\ldots,\\pm i\\lambda\_\{d/2\}with allλj\>0\\lambda\_\{j\}\>0\. In the eigenbasis,exp\(ϵA\)\\exp\(\\epsilon A\)is block\-diagonal with2×22\\times 2rotation blocks of angleϵλj\\epsilon\\lambda\_\{j\}\. Applying Example[1](https://arxiv.org/html/2606.24975#Thmexample1)per block:‖𝔼\[exp\(ϵA\)\]‖op=maxj\|sin\(aλj\)/\(aλj\)\|<1\\\|\\mathbb\{E\}\[\\exp\(\\epsilon A\)\]\\\|\_\{\\mathrm\{op\}\}=\\max\_\{j\}\|\\sin\(a\\lambda\_\{j\}\)/\(a\\lambda\_\{j\}\)\|<1\. ∎
### B\.2Finite Transport Window
###### Theorem 1\(Finite Stable Far Regime from Content\-Dependent Accumulated Rotations\)\.
Fix a query positioniiand consider route lengthsn=i−j≥1n=i\-j\\geq 1\. In one rotation block, define the step phasor
Ht=exp\{−iψt\},ψt=ω\+g\(ct\),H\_\{t\}=\\exp\\\{\-i\\psi\_\{t\}\\\},\\qquad\\psi\_\{t\}=\\omega\+g\(c\_\{t\}\),using the value\-side sign convention for the transport definition above\. Assume the step phasors along the route are independent and satisfy the uniform first\-harmonic bound
\|𝔼Ht\|≤β<1\|\\mathbb\{E\}H\_\{t\}\|\\leq\\beta<1for every positiontt\. For any toleranceεmix∈\(0,1\)\\varepsilon\_\{\\rm mix\}\\in\(0,1\), ifβ=0\\beta=0, setwεmix=1w\_\{\\varepsilon\_\{\\rm mix\}\}=1; every route of length at least one is already mixed\. Otherwise assume0<β<10<\\beta<1and define
wεmix=⌈log\(1/εmix\)log\(1/β\)⌉\.w\_\{\\varepsilon\_\{\\rm mix\}\}=\\left\\lceil\\frac\{\\log\(1/\\varepsilon\_\{\\rm mix\}\)\}\{\\log\(1/\\beta\)\}\\right\\rceil\.Let
eiΘn=∏t=i−ni−1Ht\.e^\{i\\Theta\_\{n\}\}=\\prod\_\{t=i\-n\}^\{i\-1\}H\_\{t\}\.Then every route of lengthn≥wεmixn\\geq w\_\{\\varepsilon\_\{\\rm mix\}\}is in the far/decorrelated first\-harmonic regime:
\|𝔼eiΘn\|≤εmix\.\\left\|\\mathbb\{E\}e^\{i\\Theta\_\{n\}\}\\right\|\\leq\\varepsilon\_\{\\rm mix\}\.Routes withn<wεmixn<w\_\{\\varepsilon\_\{\\rm mix\}\}are near routes\. The boundarywεmixw\_\{\\varepsilon\_\{\\rm mix\}\}depends only on the uniform first\-harmonic gap and the tolerance, not on total context length\. The i\.i\.d\. content model is a special stationary case\.
###### Proof\.
By independence of the step phasors,
𝔼eiΘn=∏t=i−ni−1𝔼Ht\.\\mathbb\{E\}e^\{i\\Theta\_\{n\}\}=\\prod\_\{t=i\-n\}^\{i\-1\}\\mathbb\{E\}H\_\{t\}\.Taking absolute values gives
\|𝔼eiΘn\|≤∏t=i−ni−1\|𝔼Ht\|≤βn\.\\left\|\\mathbb\{E\}e^\{i\\Theta\_\{n\}\}\\right\|\\leq\\prod\_\{t=i\-n\}^\{i\-1\}\|\\mathbb\{E\}H\_\{t\}\|\\leq\\beta^\{n\}\.Ifβ=0\\beta=0, thenwεmix=1w\_\{\\varepsilon\_\{\\rm mix\}\}=1andβn=0\\beta^\{n\}=0for every route lengthn≥1n\\geq 1\. Ifn≥wεmixn\\geq w\_\{\\varepsilon\_\{\\rm mix\}\}, thenβn≤εmix\\beta^\{n\}\\leq\\varepsilon\_\{\\rm mix\}\. Thus every route longer than the window has first\-harmonic magnitude at mostεmix\\varepsilon\_\{\\rm mix\}\. Sincewεmixw\_\{\\varepsilon\_\{\\rm mix\}\}depends only onβ\\betaandεmix\\varepsilon\_\{\\rm mix\}, the boundary is independent of total context length\. ∎
*Finite\-harmonic extension\.*The same argument bounds any fixed finite set of harmonics\. If\|𝔼Htm\|≤βm<1\|\\mathbb\{E\}H\_\{t\}^\{m\}\|\\leq\\beta\_\{m\}<1uniformly form=1,…,qm=1,\\ldots,q, then
\|𝔼eimΘn\|≤βmn\.\|\\mathbb\{E\}e^\{im\\Theta\_\{n\}\}\|\\leq\\beta\_\{m\}^\{n\}\.A single finite window bounds all harmonicsm=1,…,qm=1,\\ldots,qby taking the maximum of the corresponding windows\.
### B\.3Score\-Side Gap
###### Theorem 2\(Many\-Block Score Gap\)\.
Fix a query positionii\. Suppose there is one near target router→ir\\to isuch that, for every blockbb,
\|Θr→i,b\|≤δ,0≤δ<π/2\.\|\\Theta\_\{r\\to i,b\}\|\\leq\\delta,\\qquad 0\\leq\\delta<\\pi/2\.\(17\)Then
Sr→i≥cosδ\.S\_\{r\\to i\}\\geq\\cos\\delta\.\(18\)Now letj→ij\\to ibe a far route\. Assume that the block phases\{Θj→i,b\}b=1B\\\{\\Theta\_\{j\\to i,b\}\\\}\_\{b=1\}^\{B\}are independent acrossbb, and that each block is in theεsc\\varepsilon\_\{\\rm sc\}\-far first\-harmonic regime:
\|𝔼eiΘj→i,b\|≤εsc\.\\bigl\|\\mathbb\{E\}e^\{i\\Theta\_\{j\\to i,b\}\}\\bigr\|\\leq\\varepsilon\_\{\\rm sc\}\.\(19\)The block phases need not be identically distributed; the proof only uses independence across blocks and the uniform first\-harmonic bound\. Then, for every score thresholds\>εscs\>\\varepsilon\_\{\\rm sc\},
Pr\[Sj→i≥s\]≤exp\(−B\(s−εsc\)22\)\.\\Pr\\bigl\[S\_\{j\\to i\}\\geq s\\bigr\]\\leq\\exp\\\!\\biggl\(\-\\frac\{B\(s\-\\varepsilon\_\{\\rm sc\}\)^\{2\}\}\{2\}\\biggr\)\.\(20\)If𝒟icand\\mathcal\{D\}\_\{i\}^\{\\rm cand\}is a finite candidate set of far routes with\|𝒟icand\|=M\|\\mathcal\{D\}\_\{i\}^\{\\rm cand\}\|=M, then
Pr\[maxj∈𝒟icandSj→i≥s\]≤Mexp\(−B\(s−εsc\)22\)\.\\Pr\\Bigl\[\\max\_\{j\\in\\mathcal\{D\}\_\{i\}^\{\\rm cand\}\}S\_\{j\\to i\}\\geq s\\Bigr\]\\leq M\\exp\\\!\\biggl\(\-\\frac\{B\(s\-\\varepsilon\_\{\\rm sc\}\)^\{2\}\}\{2\}\\biggr\)\.\(21\)Consequently, ifs<cosδs<\\cos\\delta, then with probability at least
1−Mexp\(−B\(s−εsc\)22\),1\-M\\exp\\\!\\biggl\(\-\\frac\{B\(s\-\\varepsilon\_\{\\rm sc\}\)^\{2\}\}\{2\}\\biggr\),the near route has a normalized score gap of at leastcosδ−s\\cos\\delta\-sover every route in the finite far candidate set\.
###### Proof\.
The near\-route bound follows immediately from\|Θr→i,b\|≤δ\|\\Theta\_\{r\\to i,b\}\|\\leq\\delta, sincecosΘr→i,b≥cosδ\\cos\\Theta\_\{r\\to i,b\}\\geq\\cos\\deltafor every block\. Averaging over blocks givesSr→i≥cosδS\_\{r\\to i\}\\geq\\cos\\delta\.
For a far route, defineXb=cosΘj→i,bX\_\{b\}=\\cos\\Theta\_\{j\\to i,b\}\. ThenXb∈\[−1,1\]X\_\{b\}\\in\[\-1,1\], and condition \([19](https://arxiv.org/html/2606.24975#A2.E19)\) implies
𝔼Xb=Re𝔼eiΘj→i,b≤εsc\.\\mathbb\{E\}X\_\{b\}=\\mathrm\{Re\}\\,\\mathbb\{E\}e^\{i\\Theta\_\{j\\to i,b\}\}\\leq\\varepsilon\_\{\\rm sc\}\.By independence across blocks,Sj→i=1B∑b=1BXbS\_\{j\\to i\}=\\frac\{1\}\{B\}\\sum\_\{b=1\}^\{B\}X\_\{b\}\. Hoeffding’s inequalityHoeffding \([1963](https://arxiv.org/html/2606.24975#bib.bib1)\)gives
Pr\[Sj→i−𝔼Sj→i≥s−εsc\]≤exp\(−B\(s−εsc\)22\),\\Pr\\bigl\[S\_\{j\\to i\}\-\\mathbb\{E\}S\_\{j\\to i\}\\geq s\-\\varepsilon\_\{\\rm sc\}\\bigr\]\\leq\\exp\\\!\\biggl\(\-\\frac\{B\(s\-\\varepsilon\_\{\\rm sc\}\)^\{2\}\}\{2\}\\biggr\),which proves \([20](https://arxiv.org/html/2606.24975#A2.E20)\)\. The maximum\-over\-MM\-routes bound \([21](https://arxiv.org/html/2606.24975#A2.E21)\) follows by the union bound\. The score\-gap claim follows by combining \([18](https://arxiv.org/html/2606.24975#A2.E18)\) and \([21](https://arxiv.org/html/2606.24975#A2.E21)\)\. ∎
### B\.4Score Gap and Softmax Scale
###### Proposition 1\(Score Gap and Softmax Scale Give Far\-Mass and Far\-Weight Bounds\)\.
Let𝒮\\mathcal\{S\}be a near target\-bearing set with\|𝒮\|=K≥1\|\\mathcal\{S\}\|=K\\geq 1, and let𝒟\\mathcal\{D\}be a far\-regime candidate set with\|𝒟\|=M≤Mmax\|\\mathcal\{D\}\|=M\\leq M\_\{\\max\}andMmax≥1M\_\{\\max\}\\geq 1\. Suppose the normalized scores satisfySj→i≥cnearS\_\{j\\to i\}\\geq c\_\{\\rm near\}for everyj∈𝒮j\\in\\mathcal\{S\}, andSk→i≤cfarS\_\{k\\to i\}\\leq c\_\{\\rm far\}for everyk∈𝒟k\\in\\mathcal\{D\}, with score gapg=cnear−cfar\>0g=c\_\{\\rm near\}\-c\_\{\\rm far\}\>0\. Let the softmax logits beℓj→i=λSj→i\\ell\_\{j\\to i\}=\\lambda\\,S\_\{j\\to i\},λ\>0\\lambda\>0, and letαj\\alpha\_\{j\}be the resulting full\-softmax weights over𝒮∪𝒟\\mathcal\{S\}\\cup\\mathcal\{D\}\. Fix target bound levelsρ⋆∈\(0,1\)\\rho\_\{\\star\}\\in\(0,1\),a⋆\>0a\_\{\\star\}\>0\. If
λg≥max\{0,log\(Mmax\(1−ρ⋆\)Kρ⋆\),log\(MmaxKa⋆\)\},\\lambda g\\geq\\max\\\!\\left\\\{0,\\;\\log\\\!\\left\(\\frac\{M\_\{\\max\}\(1\-\\rho\_\{\\star\}\)\}\{K\\rho\_\{\\star\}\}\\right\),\\;\\log\\\!\\left\(\\frac\{\\sqrt\{M\_\{\\max\}\}\}\{Ka\_\{\\star\}\}\\right\)\\right\\\},\(22\)thenρ𝒟:=∑k∈𝒟αk≤ρ⋆\\rho\_\{\\mathcal\{D\}\}:=\\sum\_\{k\\in\\mathcal\{D\}\}\\alpha\_\{k\}\\leq\\rho\_\{\\star\}and‖𝛂𝒟‖2≤a⋆\\\|\\bm\{\\alpha\}\_\{\\mathcal\{D\}\}\\\|\_\{2\}\\leq a\_\{\\star\}\.
For the one\-near\-route specialization of Theorem[2](https://arxiv.org/html/2606.24975#Thmtheorem2), takeK=1K=1,cnear=cosδc\_\{\\rm near\}=\\cos\\delta, andcfar=sc\_\{\\rm far\}=s\.
###### Proof\.
LetΔ=λg\\Delta=\\lambda g\. Sinceℓj−ℓk≥Δ\\ell\_\{j\}\-\\ell\_\{k\}\\geq\\Deltafor everyj∈𝒮j\\in\\mathcal\{S\},k∈𝒟k\\in\\mathcal\{D\}, the far\-to\-near denominator ratio is at mostMe−Δ/KMe^\{\-\\Delta\}/K\. Hence
ρ𝒟≤Me−ΔK\+Me−Δ≤Mmaxe−ΔK\+Mmaxe−Δ\.\\rho\_\{\\mathcal\{D\}\}\\leq\\frac\{Me^\{\-\\Delta\}\}\{K\+Me^\{\-\\Delta\}\}\\leq\\frac\{M\_\{\\max\}e^\{\-\\Delta\}\}\{K\+M\_\{\\max\}e^\{\-\\Delta\}\}\.The first condition in \([22](https://arxiv.org/html/2606.24975#A2.E22)\) givesρ𝒟≤ρ⋆\\rho\_\{\\mathcal\{D\}\}\\leq\\rho\_\{\\star\}\. Also, for eachk∈𝒟k\\in\\mathcal\{D\},
αk≤eλcfarKeλcnear=e−ΔK\.\\alpha\_\{k\}\\leq\\frac\{e^\{\\lambda c\_\{\\rm far\}\}\}\{Ke^\{\\lambda c\_\{\\rm near\}\}\}=\\frac\{e^\{\-\\Delta\}\}\{K\}\.Thus
‖𝜶𝒟‖2≤Me−Δ/K≤Mmaxe−Δ/K≤a⋆,\\\|\\bm\{\\alpha\}\_\{\\mathcal\{D\}\}\\\|\_\{2\}\\leq\\sqrt\{M\}\\,e^\{\-\\Delta\}/K\\leq\\sqrt\{M\_\{\\max\}\}\\,e^\{\-\\Delta\}/K\\leq a\_\{\\star\},where the last inequality is the second condition in \([22](https://arxiv.org/html/2606.24975#A2.E22)\)\. ∎
### B\.5Score\-Side Decoherence for General Orthogonal Products
This subsection proves score\-side decoherence for accumulated products of i\.i\.d\. random orthogonal matrices\. The result applies to any step distribution onO\(d\)O\(d\)satisfying a first\-moment gap and a TV\-mixing condition; idealized Householder\-step distributions, including PaTH\-like transformations, satisfy these conditions\. The logical structure parallels theSO\(2\)\\mathrm\{SO\}\(2\)proof: spectral gap→\\tomixing window→\\toconcentration→\\toscore gap→\\tosoftmax scaling\. The algebraic machinery differs:SO\(2\)\\mathrm\{SO\}\(2\)uses scalar Fourier analysis and Hoeffding concentration onB=d/2B=d/2independent blocks, while the general case uses random walks onO\(d\)O\(d\)and Lévy concentration onSd−1S^\{d\-1\}, giving a slightly tighter concentration rate of\(d−1\)/2\(d\{\-\}1\)/2versusd/4d/4\.
The proofs below cover three determinant cases\. If all steps have determinant\+1\+1, all products lie inSO\(d\)\\mathrm\{SO\}\(d\)and converge to Haar onSO\(d\)\\mathrm\{SO\}\(d\)\. If all steps have fixed determinant−1\-1\(as in PaTH’s Householder reflections\),det\(Pn\)=\(−1\)n\\det\(P\_\{n\}\)=\(\-1\)^\{n\}and the accumulated product is confined to a single determinant component ofO\(d\)O\(d\); the TV convergence target is Haar measure on that component\. If the step law assigns positive probability to both determinant components \(mixed sign\), the products visit both components and converge to Haar on fullO\(d\)O\(d\)\. In all three cases, the score\-side conclusion is the same, because every component ofO\(d\)O\(d\)acts transitively on the sphere\.
###### Assumption 1\(First\-Moment Orthogonal Gap\)\.
LetMt∈O\(d\)M\_\{t\}\\in O\(d\)be i\.i\.d\. random orthogonal matrices\. Define the first\-moment parameter
β=‖𝔼\[Mt\]‖op\.\\beta\\;=\\;\\bigl\\\|\\mathbb\{E\}\[M\_\{t\}\]\\bigr\\\|\_\{\\mathrm\{op\}\}\.Assumeβ<1\\beta<1\.
###### Assumption 2\(Component TV Mixing\)\.
Letμ\\mube the law ofMtM\_\{t\}\.
- \(i\)Fixed determinant\+1\+1\.IfMt∈SO\(d\)M\_\{t\}\\in\\mathrm\{SO\}\(d\)a\.s\., assume thatμ\\muitself satisfies anL2L^\{2\}\-smoothing and spectral\-gap condition onSO\(d\)\\mathrm\{SO\}\(d\): there exist an integerr≥1r\\geq 1and a constantρ∈\(0,1\)\\rho\\in\(0,1\)such thatμ∗r\\mu^\{\\ast r\}has a densityhr∈L2\(SO\(d\)\)h\_\{r\}\\in L^\{2\}\(\\mathrm\{SO\}\(d\)\)with respect to Haar measure onSO\(d\)\\mathrm\{SO\}\(d\), and the Markov operatorTμf\(x\)=∫SO\(d\)f\(xg\)𝑑μ\(g\)T\_\{\\mu\}f\(x\)=\\int\_\{\\mathrm\{SO\}\(d\)\}f\(xg\)\\,d\\mu\(g\)satisfies‖Tμf‖L2\(SO\(d\)\)≤ρ‖f‖L2\(SO\(d\)\)\\\|T\_\{\\mu\}f\\\|\_\{L^\{2\}\(\\mathrm\{SO\}\(d\)\)\}\\leq\\rho\\,\\\|f\\\|\_\{L^\{2\}\(\\mathrm\{SO\}\(d\)\)\}for every mean\-zerof∈L2\(SO\(d\)\)f\\in L^\{2\}\(\\mathrm\{SO\}\(d\)\)\.
- \(ii\)Fixed determinant−1\-1\.If all steps sharedetMt=−1\\det M\_\{t\}=\-1\(e\.g\., Householder reflections\), letμ2=μ∗μ\\mu\_\{2\}=\\mu\\ast\\mube the law of the even\-step productM1M2∈SO\(d\)M\_\{1\}M\_\{2\}\\in\\mathrm\{SO\}\(d\)\. Assume thatμ2\\mu\_\{2\}satisfies theL2L^\{2\}\-smoothing and spectral\-gap condition onSO\(d\)\\mathrm\{SO\}\(d\)as in \(i\), with the Markov operator Tμ2f\(x\)=∫SO\(d\)f\(xg\)𝑑μ2\(g\)\.T\_\{\\mu\_\{2\}\}f\(x\)\\;=\\;\\int\_\{\\mathrm\{SO\}\(d\)\}f\(xg\)\\,d\\mu\_\{2\}\(g\)\.
- \(iii\)Mixed sign\.Ifμ\\muassigns positive probability to bothSO\(d\)\\mathrm\{SO\}\(d\)andO−\(d\)O^\{\-\}\(d\), assume thatμ\\musatisfies theL2L^\{2\}\-smoothing and spectral\-gap condition onO\(d\)O\(d\): there existr≥1r\\geq 1andρ∈\(0,1\)\\rho\\in\(0,1\)such thatμ∗r\\mu^\{\\ast r\}has a densityhr∈L2\(O\(d\)\)h\_\{r\}\\in L^\{2\}\(O\(d\)\)with respect to Haar measure onO\(d\)O\(d\), andTμf\(x\)=∫O\(d\)f\(xg\)𝑑μ\(g\)T\_\{\\mu\}f\(x\)=\\int\_\{O\(d\)\}f\(xg\)\\,d\\mu\(g\)satisfies‖Tμf‖L2\(O\(d\)\)≤ρ‖f‖L2\(O\(d\)\)\\\|T\_\{\\mu\}f\\\|\_\{L^\{2\}\(O\(d\)\)\}\\leq\\rho\\,\\\|f\\\|\_\{L^\{2\}\(O\(d\)\)\}for every mean\-zerof∈L2\(O\(d\)\)f\\in L^\{2\}\(O\(d\)\)\.
The following two lemmas are used by both the Householder proposition and the heat\-kernel example below\.
###### Lemma 1\(Spread\-out support criterion\)\.
Letη\\etabe a probability measure onSO\(d\)\\mathrm\{SO\}\(d\)with anL2L^\{2\}Haar density:dη\(g\)=h\(g\)dgd\\eta\(g\)=h\(g\)\\,dg,h∈L2\(SO\(d\)\)h\\in L^\{2\}\(\\mathrm\{SO\}\(d\)\)\. SupposeI∈supp\(η\)I\\in\\mathrm\{supp\}\(\\eta\)andsupp\(η\)\\mathrm\{supp\}\(\\eta\)generatesSO\(d\)\\mathrm\{SO\}\(d\)\. Then‖Tη‖L02→L02<1\\\|T\_\{\\eta\}\\\|\_\{L^\{2\}\_\{0\}\\to L^\{2\}\_\{0\}\}<1, andη\\etasatisfies Assumption[2](https://arxiv.org/html/2606.24975#Thmassumption2)withr=1r=1\.
Moreover, ifh\(g\)≥m\>0h\(g\)\\geq m\>0for Haar\-a\.e\.gg, the explicit bound‖Tη‖L02→L02≤1−m\\\|T\_\{\\eta\}\\\|\_\{L^\{2\}\_\{0\}\\to L^\{2\}\_\{0\}\}\\leq 1\-mholds\.
###### Proof\.
Sinceh∈L2\(SO\(d\)\)h\\in L^\{2\}\(\\mathrm\{SO\}\(d\)\), the operatorTηT\_\{\\eta\}acts onL2\(G\)L^\{2\}\(G\)with kernelK\(x,y\)=h\(x−1y\)K\(x,y\)=h\(x^\{\-1\}y\), and‖K‖L2\(G×G\)2=‖h‖22<∞\\\|K\\\|\_\{L^\{2\}\(G\\times G\)\}^\{2\}=\\\|h\\\|\_\{2\}^\{2\}<\\infty, soTηT\_\{\\eta\}is Hilbert–Schmidt, hence compact\.
Assume for contradiction that‖Tη‖L02→L02=1\\\|T\_\{\\eta\}\\\|\_\{L^\{2\}\_\{0\}\\to L^\{2\}\_\{0\}\}=1\. SinceTηT\_\{\\eta\}is compact, its norm is attained: there exists nonzerof∈L02\(G\)f\\in L^\{2\}\_\{0\}\(G\)with‖f‖2=1\\\|f\\\|\_\{2\}=1and‖Tηf‖2=1\\\|T\_\{\\eta\}f\\\|\_\{2\}=1\. WritingRgf\(x\)=f\(xg\)R\_\{g\}f\(x\)=f\(xg\), eachRgR\_\{g\}is unitary andTηf=∫GRgf𝑑η\(g\)T\_\{\\eta\}f=\\int\_\{G\}R\_\{g\}f\\,d\\eta\(g\), so
1=‖Tηf‖2=‖∫GRgf𝑑η\(g\)‖2≤∫G‖Rgf‖2𝑑η\(g\)=1\.1=\\\|T\_\{\\eta\}f\\\|\_\{2\}=\\Bigl\\\|\\int\_\{G\}R\_\{g\}f\\,d\\eta\(g\)\\Bigr\\\|\_\{2\}\\leq\\int\_\{G\}\\\|R\_\{g\}f\\\|\_\{2\}\\,d\\eta\(g\)=1\.Equality in the Hilbert\-space triangle inequality forcesRgfR\_\{g\}fto be a common vector forη\\eta\-a\.e\.gg\. By strong continuity ofg↦Rgfg\\mapsto R\_\{g\}f, this extends to everyg∈supp\(η\)g\\in\\mathrm\{supp\}\(\\eta\)\. SinceI∈supp\(η\)I\\in\\mathrm\{supp\}\(\\eta\),Rgf=fR\_\{g\}f=ffor allg∈supp\(η\)g\\in\\mathrm\{supp\}\(\\eta\)\. Becausesupp\(η\)\\mathrm\{supp\}\(\\eta\)generatesSO\(d\)\\mathrm\{SO\}\(d\),ffis right\-invariant under all ofSO\(d\)\\mathrm\{SO\}\(d\), hence constant a\.e\. Butf∈L02f\\in L^\{2\}\_\{0\}forcesf=0f=0, contradicting‖f‖2=1\\\|f\\\|\_\{2\}=1\.
For the explicit bound whenh≥m\>0h\\geq m\>0: writeη=mHaar\+\(1−m\)ν′\\eta=m\\,\\mathrm\{Haar\}\+\(1\-m\)\\,\\nu^\{\\prime\}wheredν′\(g\)=\(h\(g\)−m\)dg/\(1−m\)d\\nu^\{\\prime\}\(g\)=\(h\(g\)\-m\)\\,dg/\(1\-m\)\. Forf∈L02f\\in L^\{2\}\_\{0\}, Haar\-invariance gives∫Gf\(xg\)𝑑g=0\\int\_\{G\}f\(xg\)\\,dg=0, soTηf=\(1−m\)Tν′fT\_\{\\eta\}f=\(1\-m\)\\,T\_\{\\nu^\{\\prime\}\}f, and‖Tηf‖2≤\(1−m\)‖f‖2\\\|T\_\{\\eta\}f\\\|\_\{2\}\\leq\(1\-m\)\\\|f\\\|\_\{2\}\. ∎
###### Lemma 2\(Analytic pushforward\)\.
LetM,NM,Nbe compact real\-analytic manifolds withdimN=n\\dim N=n, and letF:M→NF\\colon M\\to Nbe real\-analytic\. Letaabe a smooth density onMM\. Assume that on every connected component ofMMthat intersectssupp\(a\)\\mathrm\{supp\}\(a\), the differentialDFDFhas ranknnat at least one point\. Then the pushforwardF∗\(adm\)F\_\{\*\}\(a\\,dm\)is absolutely continuous with respect to smooth volume onNN, with density inLp\(N\)L^\{p\}\(N\)for somep\>1p\>1\.
###### Proof\.
On each connected component meetingsupp\(a\)\\mathrm\{supp\}\(a\), at least onen×nn\\times nminor ofDFDFis a nonzero real\-analytic function\. HenceFFis generically a submersion on that component, outside a proper analytic set\. By rectilinearization/monomialization of subanalytic maps\(Bierstone and Milman,[1988](https://arxiv.org/html/2606.24975#bib.bib25)\), the density of the pushforward is locally a finite sum of monomial\-type singularities, each lying inL1\+εL^\{1\+\\varepsilon\}for someε\>0\\varepsilon\>0\(see also Glazeret al\.,[2026](https://arxiv.org/html/2606.24975#bib.bib26)\)\. Compactness ofMMgives a finite cover and therefore a uniform positiveε\\varepsilon\. Takingp=1\+εp=1\+\\varepsilonproves the claim\. ∎
###### Proposition 2\(Householder reflections satisfy component\-TV mixing\)\.
Letd≥2d\\geq 2and letν\\nube a probability distribution onSd−1S^\{d\-1\}with a smooth density bounded below by a positive constant\. LetH\(v\)=I−2vv⊤H\(v\)=I\-2vv^\{\\top\}and letμ\\mube the law ofH\(v\)H\(v\)forv∼νv\\sim\\nu\. Then the Householder walk satisfies Assumption[2](https://arxiv.org/html/2606.24975#Thmassumption2)\.
###### Proof\.
EachH\(v\)H\(v\)has determinant−1\-1, so we work with the even\-step lawμ2=μ∗μ\\mu\_\{2\}=\\mu\\ast\\muonG=SO\(d\)G=\\mathrm\{SO\}\(d\)\. Letn=dimG=d\(d−1\)/2n=\\dim G=d\(d\{\-\}1\)/2\.
*Step 1: Generation and aperiodicity\.*By the Cartan–Dieudonné theorem, every element ofO\(d\)O\(d\)is a product of at mostddHouseholder reflections, and every element ofSO\(d\)\\mathrm\{SO\}\(d\)is a product of an even number\. Sinceν\\nuhas full support onSd−1S^\{d\-1\}, the semigroup generated by\{H\(u\)H\(v\):u,v∈Sd−1\}\\\{H\(u\)H\(v\):u,v\\in S^\{d\-1\}\\\}is all ofSO\(d\)\\mathrm\{SO\}\(d\)\. Moreover,I=H\(v\)H\(v\)∈supp\(μ2\)I=H\(v\)H\(v\)\\in\\mathrm\{supp\}\(\\mu\_\{2\}\)for everyvv, sosupp\(μ2\)\\mathrm\{supp\}\(\\mu\_\{2\}\)is not contained in any coset of a proper closed subgroup\.
*Step 2:LpL^\{p\}\-smoothing via a full\-rank product map\.*LetN=n=d\(d−1\)/2N=n=d\(d\{\-\}1\)/2and consider the product map
ΦN:\(Sd−1\)2N→SO\(d\),ΦN\(v1,…,v2N\)=H\(v1\)H\(v2\)⋯H\(v2N\)\.\\Phi\_\{N\}\\colon\(S^\{d\-1\}\)^\{2N\}\\to\\mathrm\{SO\}\(d\),\\quad\\Phi\_\{N\}\(v\_\{1\},\\ldots,v\_\{2N\}\)=H\(v\_\{1\}\)H\(v\_\{2\}\)\\cdots H\(v\_\{2N\}\)\.Thenμ2∗N\\mu\_\{2\}^\{\\ast N\}is the pushforward ofν⊗2N\\nu^\{\\otimes 2N\}underΦN\\Phi\_\{N\}\. We exhibit a point whereDΦND\\Phi\_\{N\}is surjective\.
Enumerate the coordinate pairs1≤i<j≤d1\\leq i<j\\leq das\(i1,j1\),…,\(iN,jN\)\(i\_\{1\},j\_\{1\}\),\\ldots,\(i\_\{N\},j\_\{N\}\)\. For theℓ\\ell\-th two\-reflection block, setv2ℓ−1=v2ℓ=eiℓv\_\{2\\ell\-1\}=v\_\{2\\ell\}=e\_\{i\_\{\\ell\}\}\. At this base point, every block is the identity:H\(eiℓ\)H\(eiℓ\)=IH\(e\_\{i\_\{\\ell\}\}\)H\(e\_\{i\_\{\\ell\}\}\)=I\. Now vary onlyv2ℓv\_\{2\\ell\}in the tangent directionejℓ∈TeiℓSd−1e\_\{j\_\{\\ell\}\}\\in T\_\{e\_\{i\_\{\\ell\}\}\}S^\{d\-1\}\. UsingDHa\[ξ\]=−2\(ξa⊤\+aξ⊤\)DH\_\{a\}\[\\xi\]=\-2\(\\xi a^\{\\top\}\+a\\xi^\{\\top\}\)forξ⟂a\\xi\\perp a, and the fact that all other blocks remain at the identity, the resulting tangent vector atI∈SO\(d\)I\\in\\mathrm\{SO\}\(d\)is
H\(eiℓ\)DHeiℓ\[ejℓ\]=2\(eiℓejℓ⊤−ejℓeiℓ⊤\)=2Aiℓjℓ,H\(e\_\{i\_\{\\ell\}\}\)\\,DH\_\{e\_\{i\_\{\\ell\}\}\}\[e\_\{j\_\{\\ell\}\}\]=2\(e\_\{i\_\{\\ell\}\}e\_\{j\_\{\\ell\}\}^\{\\top\}\-e\_\{j\_\{\\ell\}\}e\_\{i\_\{\\ell\}\}^\{\\top\}\)=2A\_\{i\_\{\\ell\}j\_\{\\ell\}\},whereAij=eiej⊤−ejei⊤A\_\{ij\}=e\_\{i\}e\_\{j\}^\{\\top\}\-e\_\{j\}e\_\{i\}^\{\\top\}is the standard basis element of𝔰𝔬\(d\)\\mathfrak\{so\}\(d\)\. Ranging over allNNpairs gives allNNbasis elements, soDΦND\\Phi\_\{N\}is surjective at this point\.
The domain\(Sd−1\)2N\(S^\{d\-1\}\)^\{2N\}is compact and connected ford≥2d\\geq 2, andΦN\\Phi\_\{N\}is real\-analytic\. SinceDΦND\\Phi\_\{N\}is surjective at the point above, Lemma[2](https://arxiv.org/html/2606.24975#Thmlemma2)applies to the smooth density ofν⊗2N\\nu^\{\\otimes 2N\}\. Thereforeμ2∗N\\mu\_\{2\}^\{\\ast N\}has a densityhN∈Lp\(SO\(d\)\)h\_\{N\}\\in L^\{p\}\(\\mathrm\{SO\}\(d\)\)for somep\>1p\>1\.
*Step 3: Upgrade toL2L^\{2\}\.*By Young’s convolution inequality on the compact groupGG, ifhN∈Lph\_\{N\}\\in L^\{p\}withp\>1p\>1, then aftermmfurther self\-convolutions the integrability exponent satisfies1/pm=1−m\(1−1/p\)1/p\_\{m\}=1\-m\(1\-1/p\)\(as long as1/pm\>01/p\_\{m\}\>0\)\. Choosemmso thatm\(1−1/p\)≥1/2m\(1\-1/p\)\\geq 1/2, givingpm≥2p\_\{m\}\\geq 2\. SinceGGhas finite Haar measure,Lpm\(G\)⊆L2\(G\)L^\{p\_\{m\}\}\(G\)\\subseteq L^\{2\}\(G\)\. Settingr0=Nmr\_\{0\}=Nm, we obtainμ2∗r0=hr0dg\\mu\_\{2\}^\{\\ast r\_\{0\}\}=h\_\{r\_\{0\}\}\\,dgwithhr0∈L2\(SO\(d\)\)h\_\{r\_\{0\}\}\\in L^\{2\}\(\\mathrm\{SO\}\(d\)\)\.
*Step 4: Spectral gap\.*The measureμ2∗r0\\mu\_\{2\}^\{\\ast r\_\{0\}\}has anL2L^\{2\}Haar density \(Step 3\)\. By Step 1,I∈supp\(μ2\)I\\in\\mathrm\{supp\}\(\\mu\_\{2\}\)andsupp\(μ2\)\\mathrm\{supp\}\(\\mu\_\{2\}\)generatesSO\(d\)\\mathrm\{SO\}\(d\); sincesupp\(μ2∗r0\)⊇supp\(μ2\)\\mathrm\{supp\}\(\\mu\_\{2\}^\{\\ast r\_\{0\}\}\)\\supseteq\\mathrm\{supp\}\(\\mu\_\{2\}\), the same holds forμ2∗r0\\mu\_\{2\}^\{\\ast r\_\{0\}\}\. Lemma[1](https://arxiv.org/html/2606.24975#Thmlemma1)applied toη=μ2∗r0\\eta=\\mu\_\{2\}^\{\\ast r\_\{0\}\}gives‖Tμ2r0‖L02→L02<1\\\|T\_\{\\mu\_\{2\}\}^\{r\_\{0\}\}\\\|\_\{L^\{2\}\_\{0\}\\to L^\{2\}\_\{0\}\}<1\. Since\(H\(v1\)H\(v2\)\)−1=H\(v2\)H\(v1\)\(H\(v\_\{1\}\)H\(v\_\{2\}\)\)^\{\-1\}=H\(v\_\{2\}\)H\(v\_\{1\}\)andv1,v2v\_\{1\},v\_\{2\}are i\.i\.d\.,μ2\\mu\_\{2\}is symmetric, soTμ2T\_\{\\mu\_\{2\}\}is self\-adjoint onL2\(G\)L^\{2\}\(G\)\. For a self\-adjoint operator,‖Tn‖=‖T‖n\\\|T^\{n\}\\\|=\\\|T\\\|^\{n\}, hence‖Tμ2‖L02→L02=‖Tμ2r0‖L02→L021/r0<1\\\|T\_\{\\mu\_\{2\}\}\\\|\_\{L^\{2\}\_\{0\}\\to L^\{2\}\_\{0\}\}=\\\|T\_\{\\mu\_\{2\}\}^\{r\_\{0\}\}\\\|\_\{L^\{2\}\_\{0\}\\to L^\{2\}\_\{0\}\}^\{1/r\_\{0\}\}<1, proving Assumption[2](https://arxiv.org/html/2606.24975#Thmassumption2)\. ∎
###### Example 6\(Heat\-kernel\-dithered orthogonal steps\)\.
LetGτ,t∼qτG\_\{\\tau,t\}\\sim q\_\{\\tau\}be independent draws from the heat kernel onSO\(d\)\\mathrm\{SO\}\(d\)at timeτ\>0\\tau\>0, and letLtL\_\{t\}be any i\.i\.d\. orthogonal steps \(with a fixed determinantσ∈\{\+1,−1\}\\sigma\\in\\\{\+1,\-1\\\}\), independent ofGτ,tG\_\{\\tau,t\}\. DefineMt=LtGτ,tM\_\{t\}=L\_\{t\}\\,G\_\{\\tau,t\}\. The heat kernelqτq\_\{\\tau\}is smooth and strictly positive on compactSO\(d\)\\mathrm\{SO\}\(d\), somτ:=ming∈SO\(d\)qτ\(g\)\>0m\_\{\\tau\}:=\\min\_\{g\\in\\mathrm\{SO\}\(d\)\}q\_\{\\tau\}\(g\)\>0\. Ifσ=\+1\\sigma=\+1, the one\-step density satisfiesh\(g\)=∫qτ\(ℓ−1g\)𝑑νL\(ℓ\)≥mτh\(g\)=\\int q\_\{\\tau\}\(\\ell^\{\-1\}g\)\\,d\\nu\_\{L\}\(\\ell\)\\geq m\_\{\\tau\}, so Lemma[1](https://arxiv.org/html/2606.24975#Thmlemma1)gives Assumption[2](https://arxiv.org/html/2606.24975#Thmassumption2)withρ≤1−mτ\\rho\\leq 1\-m\_\{\\tau\}\. Ifσ=−1\\sigma=\-1, the one\-step density onO−\(d\)O^\{\-\}\(d\)is bounded below bymτm\_\{\\tau\}, so Remark[4](https://arxiv.org/html/2606.24975#Thmremark4)givesρ≤1−mτ2\\rho\\leq 1\-m\_\{\\tau\}^\{2\}\. The first\-moment gap also holds:𝔼\[Gτ,t\]=aτI\\mathbb\{E\}\[G\_\{\\tau,t\}\]=a\_\{\\tau\}Iwith0<aτ<10<a\_\{\\tau\}<1by conjugation\-invariance ofqτq\_\{\\tau\}, so‖𝔼\[Mt\]‖op≤aτ<1\\\|\\mathbb\{E\}\[M\_\{t\}\]\\\|\_\{\\mathrm\{op\}\}\\leq a\_\{\\tau\}<1\.
In particular, takingLt=H\(vt\)L\_\{t\}=H\(v\_\{t\}\)\(a Householder reflection\) gives a smoothed Householder example with determinant−1\-1\. This does not assert that learned PaTH steps satisfy Assumption[2](https://arxiv.org/html/2606.24975#Thmassumption2); it shows that the assumption is nonempty in a Householder\-compatible family, and that the dither can be made arbitrarily small by takingτ→0\\tau\\to 0\.
###### Example 7\(Matrix\-Fisher law onSO\(d\)\\mathrm\{SO\}\(d\)\)\.
ForA∈ℝd×dA\\in\\mathbb\{R\}^\{d\\times d\}, the matrix\-Fisher density onSO\(d\)\\mathrm\{SO\}\(d\)ishA\(Q\)=ZA−1exp\(tr\(A⊤Q\)\)h\_\{A\}\(Q\)=Z\_\{A\}^\{\-1\}\\exp\(\\mathrm\{tr\}\(A^\{\\top\}Q\)\), whereZA=∫SO\(d\)exp\(tr\(A⊤Q\)\)𝑑QZ\_\{A\}=\\int\_\{\\mathrm\{SO\}\(d\)\}\\exp\(\\mathrm\{tr\}\(A^\{\\top\}Q\)\)\\,dQ\. SincehAh\_\{A\}is smooth and strictly positive on compactSO\(d\)\\mathrm\{SO\}\(d\), it is bounded below by somemA\>0m\_\{A\}\>0, so Lemma[1](https://arxiv.org/html/2606.24975#Thmlemma1)applies\. The crude bound\|tr\(A⊤Q\)\|≤‖A‖∗\|\\mathrm\{tr\}\(A^\{\\top\}Q\)\|\\leq\\\|A\\\|\_\{\*\}\(nuclear norm\) giveshA\(Q\)≥e−2‖A‖∗h\_\{A\}\(Q\)\\geq e^\{\-2\\\|A\\\|\_\{\*\}\}, and hence, ifA≠0A\\neq 0, Assumption[2](https://arxiv.org/html/2606.24975#Thmassumption2)holds withr=1r=1andρ≤1−e−2‖A‖∗<1\\rho\\leq 1\-e^\{\-2\\\|A\\\|\_\{\*\}\}<1\. IfA=0A=0, thenhA≡1h\_\{A\}\\equiv 1is Haar measure, soTμT\_\{\\mu\}maps every mean\-zero function to zero; Assumption[2](https://arxiv.org/html/2606.24975#Thmassumption2)holds with anyρ∈\(0,1\)\\rho\\in\(0,1\)\. The first\-moment gap also holds\. IfA=0A=0, then𝔼\[Q\]=0\\mathbb\{E\}\[Q\]=0\. IfA≠0A\\neq 0, the full support ofhAh\_\{A\}implies thatQxQxis not almost surely constant for any unit vectorxx, so‖𝔼\[Q\]x‖<1\\\|\\mathbb\{E\}\[Q\]x\\\|<1by strict Jensen’s inequality; compactness of the unit sphere gives‖𝔼\[Q\]‖op<1\\\|\\mathbb\{E\}\[Q\]\\\|\_\{\\mathrm\{op\}\}<1\.
###### Example 8\(Local Lie\-algebra noise\)\.
Let𝔰𝔬\(d\)\\mathfrak\{so\}\(d\)be the Lie algebra ofSO\(d\)\\mathrm\{SO\}\(d\)\. Chooser\>0r\>0small enough that the exponential mapexp:B𝔰𝔬\(d\)\(0,r\)→SO\(d\)\\exp\\colon B\_\{\\mathfrak\{so\}\(d\)\}\(0,r\)\\to\\mathrm\{SO\}\(d\)is a diffeomorphism onto its image\. LetXt∈𝔰𝔬\(d\)X\_\{t\}\\in\\mathfrak\{so\}\(d\)have a smooth density supported inB\(0,r\)B\(0,r\)that is bounded and positive on a smaller ballB\(0,r0\)B\(0,r\_\{0\}\), and defineMt=exp\(Xt\)M\_\{t\}=\\exp\(X\_\{t\}\)\. Then the lawμ\\muofMtM\_\{t\}has anL2L^\{2\}density with respect to Haar measure\. Its support contains the open neighborhoodexp\(B\(0,r0\)\)\\exp\(B\(0,r\_\{0\}\)\)of the identity, soI∈supp\(μ\)I\\in\\mathrm\{supp\}\(\\mu\)and, sinceSO\(d\)\\mathrm\{SO\}\(d\)is connected,supp\(μ\)\\mathrm\{supp\}\(\\mu\)generates all ofSO\(d\)\\mathrm\{SO\}\(d\)\. Lemma[1](https://arxiv.org/html/2606.24975#Thmlemma1)gives‖Tμ‖L02→L02<1\\\|T\_\{\\mu\}\\\|\_\{L^\{2\}\_\{0\}\\to L^\{2\}\_\{0\}\}<1\. Unlike the previous examples, the density need not be bounded below on all ofSO\(d\)\\mathrm\{SO\}\(d\); the non\-explicit part of Lemma[1](https://arxiv.org/html/2606.24975#Thmlemma1)provides the spectral gap\. The first\-moment gap follows by the same Jensen argument as the Matrix\-Fisher case: the support ofμ\\mucontains an open neighborhood ofII, soMtxM\_\{t\}xis not a\.s\. constant for any unitxx, giving‖𝔼\[Mt\]‖op<1\\\|\\mathbb\{E\}\[M\_\{t\}\]\\\|\_\{\\mathrm\{op\}\}<1\.
###### Theorem 3\(First\-Moment Decay for Orthogonal Products\)\.
Under Assumption[1](https://arxiv.org/html/2606.24975#Thmassumption1), for the accumulated productPn=M1⋯MnP\_\{n\}=M\_\{1\}\\cdots M\_\{n\}ofnni\.i\.d\. steps,
‖𝔼\[Pn\]‖op≤βn\.\\bigl\\\|\\mathbb\{E\}\[P\_\{n\}\]\\bigr\\\|\_\{\\mathrm\{op\}\}\\;\\leq\\;\\beta^\{n\}\.For any toleranceε∈\(0,1\)\\varepsilon\\in\(0,1\), ifβ=0\\beta=0, setwε\(1\)=1w\_\{\\varepsilon\}^\{\(1\)\}=1; every product of at least one step already has zero first moment\. Otherwise \(0<β<10<\\beta<1\), the first\-moment mixing window iswε\(1\)=⌈log\(1/ε\)/log\(1/β\)⌉w\_\{\\varepsilon\}^\{\(1\)\}=\\lceil\\log\(1/\\varepsilon\)/\\log\(1/\\beta\)\\rceil\.
###### Proof\.
By independence,𝔼\[Pn\]=∏t=1n𝔼\[Mt\]=\(𝔼\[M\]\)n\\mathbb\{E\}\[P\_\{n\}\]=\\prod\_\{t=1\}^\{n\}\\mathbb\{E\}\[M\_\{t\}\]=\(\\mathbb\{E\}\[M\]\)^\{n\}\. By submultiplicativity of the operator norm,‖\(𝔼\[M\]\)n‖op≤‖𝔼\[M\]‖opn=βn\\\|\(\\mathbb\{E\}\[M\]\)^\{n\}\\\|\_\{\\mathrm\{op\}\}\\leq\\\|\\mathbb\{E\}\[M\]\\\|\_\{\\mathrm\{op\}\}^\{n\}=\\beta^\{n\}\. Ifβ=0\\beta=0, thenβn=0\\beta^\{n\}=0for everyn≥1n\\geq 1\. If0<β<10<\\beta<1, settingβn≤ε\\beta^\{n\}\\leq\\varepsilonand solving givesn≥log\(1/ε\)/log\(1/β\)n\\geq\\log\(1/\\varepsilon\)/\\log\(1/\\beta\)\. ∎
###### Theorem 4\(Total Variation Convergence\)\.
Under Assumption[2](https://arxiv.org/html/2606.24975#Thmassumption2), letO\+\(d\)=SO\(d\)O^\{\+\}\(d\)=\\mathrm\{SO\}\(d\)andO−\(d\)=\{Q∈O\(d\):detQ=−1\}O^\{\-\}\(d\)=\\\{Q\\in O\(d\):\\det Q=\-1\\\}\.
- \(i\)Fixed determinant\.If every step matrix hasdetMt=σ∈\{\+1,−1\}\\det M\_\{t\}=\\sigma\\in\\\{\+1,\-1\\\}, writeπn=σn\\pi\_\{n\}=\\sigma^\{n\}and letHaarπn\\mathrm\{Haar\}\_\{\\pi\_\{n\}\}denote normalized Haar measure onOπn\(d\)O^\{\\pi\_\{n\}\}\(d\)\. Then there existC\>0C\>0andβTV∈\(0,1\)\\beta\_\{\\mathrm\{TV\}\}\\in\(0,1\)such that dTV\(law\(Pn\),Haarπn\)≤CβTVn∀n≥0\.d\_\{\\mathrm\{TV\}\}\\\!\\bigl\(\\mathrm\{law\}\(P\_\{n\}\),\\;\\mathrm\{Haar\}\_\{\\pi\_\{n\}\}\\bigr\)\\;\\leq\\;C\\,\\beta\_\{\\mathrm\{TV\}\}^\{n\}\\qquad\\forall\\,n\\geq 0\.
- \(ii\)Mixed sign\.Ifμ\\muassigns positive probability to bothSO\(d\)\\mathrm\{SO\}\(d\)andO−\(d\)O^\{\-\}\(d\), letHaarO\(d\)\\mathrm\{Haar\}\_\{O\(d\)\}denote normalized Haar measure onO\(d\)O\(d\)\. Then there existC\>0C\>0andβTV∈\(0,1\)\\beta\_\{\\mathrm\{TV\}\}\\in\(0,1\)such that dTV\(law\(Pn\),HaarO\(d\)\)≤CβTVn∀n≥0\.d\_\{\\mathrm\{TV\}\}\\\!\\bigl\(\\mathrm\{law\}\(P\_\{n\}\),\\;\\mathrm\{Haar\}\_\{O\(d\)\}\\bigr\)\\;\\leq\\;C\\,\\beta\_\{\\mathrm\{TV\}\}^\{n\}\\qquad\\forall\\,n\\geq 0\.
In both cases, the TV mixing windowwεTV=⌈log\(C/ε\)/log\(1/βTV\)⌉w\_\{\\varepsilon\}^\{\\mathrm\{TV\}\}=\\lceil\\log\(C/\\varepsilon\)/\\log\(1/\\beta\_\{\\mathrm\{TV\}\}\)\\rceilis finite and independent of context length\.
###### Proof\.
*Density\-evolution convention\.*For a probability measureν\\nuon a compact groupGG, define the*density\-evolution operator*Kνf\(x\)=∫Gf\(xg−1\)𝑑ν\(g\)K\_\{\\nu\}f\(x\)=\\int\_\{G\}f\(xg^\{\-1\}\)\\,d\\nu\(g\)\. Ifη\\etahas Haar densityhh, thenη∗ν\\eta\\ast\\nuhas Haar densityKνhK\_\{\\nu\}h\. EquivalentlyKν=Tν∗=TνˇK\_\{\\nu\}=T\_\{\\nu\}^\{\*\}=T\_\{\\check\{\\nu\}\}, whereνˇ\(A\)=ν\(A−1\)\\check\{\\nu\}\(A\)=\\nu\(A^\{\-1\}\)\. SinceTν∗T\_\{\\nu\}^\{\*\}andTνT\_\{\\nu\}have the same operator norm,‖Kν‖L02→L02=‖Tν‖L02→L02\\\|K\_\{\\nu\}\\\|\_\{L^\{2\}\_\{0\}\\to L^\{2\}\_\{0\}\}=\\\|T\_\{\\nu\}\\\|\_\{L^\{2\}\_\{0\}\\to L^\{2\}\_\{0\}\}\. Thus the spectral\-gap assumption onTνT\_\{\\nu\}gives the same bound for the density\-evolution operatorKνK\_\{\\nu\}\.
*Case \(i\): fixedσ=−1\\sigma=\-1\.*Each step has determinant−1\-1, sodet\(Pn\)=\(−1\)n\\det\(P\_\{n\}\)=\(\-1\)^\{n\}almost surely andPn∈Oπn\(d\)P\_\{n\}\\in O^\{\\pi\_\{n\}\}\(d\)\. The only possible Haar limit is thereforeHaarπn\\mathrm\{Haar\}\_\{\\pi\_\{n\}\}\.
*Even steps\.*For evenn=2mn=2m,law\(P2m\)=μ2∗m\\mathrm\{law\}\(P\_\{2m\}\)=\\mu\_\{2\}^\{\\ast m\}as a measure onSO\(d\)\\mathrm\{SO\}\(d\)\. Lethr=dμ2∗r/dHaar\+h\_\{r\}=d\\mu\_\{2\}^\{\\ast r\}/d\\mathrm\{Haar\}\_\{\+\}\. By theL2L^\{2\}\-smoothing and spectral\-gap condition \(Assumption[2](https://arxiv.org/html/2606.24975#Thmassumption2)\(ii\)\), form≥rm\\geq rthe density ofμ2∗m\\mu\_\{2\}^\{\\ast m\}with respect toHaar\+\\mathrm\{Haar\}\_\{\+\}satisfies
‖dμ2∗mdHaar\+−1‖L2=‖Kμ2m−r\(hr−1\)‖L2≤ρm−r‖hr−1‖L2\.\\Bigl\\\|\\frac\{d\\mu\_\{2\}^\{\\ast m\}\}\{d\\mathrm\{Haar\}\_\{\+\}\}\-1\\Bigr\\\|\_\{L^\{2\}\}=\\\|K\_\{\\mu\_\{2\}\}^\{m\-r\}\(h\_\{r\}\-1\)\\\|\_\{L^\{2\}\}\\leq\\rho^\{m\-r\}\\,\\\|h\_\{r\}\-1\\\|\_\{L^\{2\}\}\.\(The bound uses‖Kμ2‖L02=‖Tμ2‖L02\\\|K\_\{\\mu\_\{2\}\}\\\|\_\{L^\{2\}\_\{0\}\}=\\\|T\_\{\\mu\_\{2\}\}\\\|\_\{L^\{2\}\_\{0\}\}from the density\-evolution convention above\.\) By Cauchy–Schwarz,dTV\(μ2∗m,Haar\+\)≤12ρm−r‖hr−1‖L2d\_\{\\mathrm\{TV\}\}\(\\mu\_\{2\}^\{\\ast m\},\\,\\mathrm\{Haar\}\_\{\+\}\)\\leq\\tfrac\{1\}\{2\}\\,\\rho^\{m\-r\}\\,\\\|h\_\{r\}\-1\\\|\_\{L^\{2\}\}\. Increasing the constant to coverm<rm<rgivesdTV\(law\(P2m\),Haar\+\)≤C2ρmd\_\{\\mathrm\{TV\}\}\(\\mathrm\{law\}\(P\_\{2m\}\),\\,\\mathrm\{Haar\}\_\{\+\}\)\\leq C\_\{2\}\\,\\rho^\{m\}\.
*Odd steps\.*For oddn=2m\+1n=2m\+1,law\(P2m\+1\)=μ∗μ2∗m\\mathrm\{law\}\(P\_\{2m\+1\}\)=\\mu\\ast\\mu\_\{2\}^\{\\ast m\}\. Convolution contracts total variation:dTV\(μ∗μ2∗m,μ∗Haar\+\)≤dTV\(μ2∗m,Haar\+\)d\_\{\\mathrm\{TV\}\}\(\\mu\\ast\\mu\_\{2\}^\{\\ast m\},\\;\\mu\\ast\\mathrm\{Haar\}\_\{\+\}\)\\leq d\_\{\\mathrm\{TV\}\}\(\\mu\_\{2\}^\{\\ast m\},\\,\\mathrm\{Haar\}\_\{\+\}\)\. Since every element in the support ofμ\\muhas determinant−1\-1, left multiplication ofHaar\+\\mathrm\{Haar\}\_\{\+\}by such an element givesHaar−\\mathrm\{Haar\}\_\{\-\}\. Henceμ∗Haar\+=Haar−\\mu\\ast\\mathrm\{Haar\}\_\{\+\}=\\mathrm\{Haar\}\_\{\-\}, anddTV\(law\(P2m\+1\),Haar−\)≤C2ρmd\_\{\\mathrm\{TV\}\}\(\\mathrm\{law\}\(P\_\{2m\+1\}\),\\,\\mathrm\{Haar\}\_\{\-\}\)\\leq C\_\{2\}\\,\\rho^\{m\}\.
TakingβTV=ρ\\beta\_\{\\mathrm\{TV\}\}=\\sqrt\{\\rho\}and adjusting the constant givesdTV\(law\(Pn\),Haarπn\)≤CβTVnd\_\{\\mathrm\{TV\}\}\(\\mathrm\{law\}\(P\_\{n\}\),\\,\\mathrm\{Haar\}\_\{\\pi\_\{n\}\}\)\\leq C\\,\\beta\_\{\\mathrm\{TV\}\}^\{n\}for allnn\.
*Case \(i\): fixedσ=\+1\\sigma=\+1\.*All products lie inSO\(d\)\\mathrm\{SO\}\(d\)\. The argument above simplifies: no parity splitting is needed, and Assumption[2](https://arxiv.org/html/2606.24975#Thmassumption2)\(i\) gives the spectral gap forTμT\_\{\\mu\}, hence the same bound forKμK\_\{\\mu\}\.
*Case \(ii\): mixed sign\.*Assumption[2](https://arxiv.org/html/2606.24975#Thmassumption2)\(iii\) gives theL2L^\{2\}\-smoothing and spectral\-gap condition forTμT\_\{\\mu\}on all ofO\(d\)O\(d\)\. Using the density\-evolution operatorKμK\_\{\\mu\}with‖Kμ‖L02→L02=‖Tμ‖L02→L02≤ρ\\\|K\_\{\\mu\}\\\|\_\{L^\{2\}\_\{0\}\\to L^\{2\}\_\{0\}\}=\\\|T\_\{\\mu\}\\\|\_\{L^\{2\}\_\{0\}\\to L^\{2\}\_\{0\}\}\\leq\\rho, forn≥rn\\geq r,
‖dμ∗ndHaarO\(d\)−1‖L2=‖Kμn−r\(hr−1\)‖L2≤ρn−r‖hr−1‖L2,\\Bigl\\\|\\frac\{d\\mu^\{\\ast n\}\}\{d\\mathrm\{Haar\}\_\{O\(d\)\}\}\-1\\Bigr\\\|\_\{L^\{2\}\}=\\\|K\_\{\\mu\}^\{n\-r\}\(h\_\{r\}\-1\)\\\|\_\{L^\{2\}\}\\;\\leq\\;\\rho^\{n\-r\}\\,\\\|h\_\{r\}\-1\\\|\_\{L^\{2\}\},and Cauchy–Schwarz converts this to the stated TV bound\(cf\. Diaconis,[1988](https://arxiv.org/html/2606.24975#bib.bib22), Chapter 3\)\. ∎
###### Theorem 5\(Orthogonal Score Separation\)\.
Letd≥2d\\geq 2\. Fix deterministic unit vectorsq,k∈ℝdq,k\\in\\mathbb\{R\}^\{d\}with‖q‖=‖k‖=1\\\|q\\\|=\\\|k\\\|=1\. Equivalently, the same bounds hold conditionally on any sigma\-field independent ofPnP\_\{n\}with respect to whichqqandkkare measurable\.
- \(a\)Near route\.If‖Pj→i−Id‖op≤γ\\\|P\_\{j\\to i\}\-I\_\{d\}\\\|\_\{\\mathrm\{op\}\}\\leq\\gammawithγ<1\\gamma<1, then q⊤Pj→ik≥q⊤k−γ\.q^\{\\top\}P\_\{j\\to i\}\\,k\\;\\geq\\;q^\{\\top\}k\\;\-\\;\\gamma\.
- \(b\)Far route\.For everys≥0s\\geq 0and everyn≥wεTVn\\geq w\_\{\\varepsilon\}^\{\\mathrm\{TV\}\}\(the TV mixing window of Theorem[4](https://arxiv.org/html/2606.24975#Thmtheorem4)\), Pr\[q⊤Pnk≥s\]≤2exp\(−\(d−1\)s22\)\+ε\.\\Pr\\\!\\bigl\[q^\{\\top\}P\_\{n\}\\,k\\geq s\\bigr\]\\;\\leq\\;2\\exp\\\!\\Bigl\(\-\\frac\{\(d\-1\)\\,s^\{2\}\}\{2\}\\Bigr\)\+\\varepsilon\.
- \(c\)Union bound\.ForMMfar candidates, Pr\[maxjq⊤Pj→ik≥s\]≤2Mexp\(−\(d−1\)s22\)\+Mε\.\\Pr\\\!\\Bigl\[\\max\_\{j\}\\,q^\{\\top\}P\_\{j\\to i\}\\,k\\geq s\\Bigr\]\\;\\leq\\;2M\\exp\\\!\\Bigl\(\-\\frac\{\(d\-1\)\\,s^\{2\}\}\{2\}\\Bigr\)\+M\\varepsilon\.
###### Proof\.
Part \(a\)\.q⊤Pj→ik=q⊤k\+q⊤\(Pj→i−Id\)k≥q⊤k−‖Pj→i−Id‖op‖q‖‖k‖=q⊤k−γq^\{\\top\}P\_\{j\\to i\}\\,k=q^\{\\top\}k\+q^\{\\top\}\(P\_\{j\\to i\}\-I\_\{d\}\)\\,k\\geq q^\{\\top\}k\-\\\|P\_\{j\\to i\}\-I\_\{d\}\\\|\_\{\\mathrm\{op\}\}\\,\\\|q\\\|\\,\\\|k\\\|=q^\{\\top\}k\-\\gamma\.
Part \(b\)\.Fixs≥0s\\geq 0\. By Theorem[4](https://arxiv.org/html/2606.24975#Thmtheorem4)and the coupling characterization of total variation distance, there exists a joint distribution\(Pn,Un\)\(P\_\{n\},U\_\{n\}\)whereUnU\_\{n\}follows the appropriate Haar limit \(Haar onOπn\(d\)O^\{\\pi\_\{n\}\}\(d\)in the fixed\-determinant case, or Haar onO\(d\)O\(d\)in the mixed\-sign case\) andPr\[Pn≠Un\]≤ε\\Pr\[P\_\{n\}\\neq U\_\{n\}\]\\leq\\varepsilon\(usingn≥wεTVn\\geq w\_\{\\varepsilon\}^\{\\mathrm\{TV\}\}\)\. In every case,UnkU\_\{n\}kis uniformly distributed onSd−1S^\{d\-1\}:SO\(d\)\\mathrm\{SO\}\(d\),O−\(d\)O^\{\-\}\(d\), andO\(d\)O\(d\)all act transitively on the sphere and preserve surface measure\. The functionf\(x\)=q⊤xf\(x\)=q^\{\\top\}xis11\-Lipschitz onSd−1S^\{d\-1\}with𝔼\[q⊤Unk\]=0\\mathbb\{E\}\[q^\{\\top\}U\_\{n\}k\]=0\. By Lévy’s lemma \(concentration on the sphere; seeLedoux \([2001](https://arxiv.org/html/2606.24975#bib.bib23), Chapter 3\)orVershynin \([2018](https://arxiv.org/html/2606.24975#bib.bib24), Theorem 5\.1\.4\)\),
Pr\[q⊤Unk≥s\]≤2exp\(−\(d−1\)s22\)\.\\Pr\\\!\\bigl\[q^\{\\top\}U\_\{n\}\\,k\\geq s\\bigr\]\\;\\leq\\;2\\exp\\\!\\Bigl\(\-\\frac\{\(d\-1\)\\,s^\{2\}\}\{2\}\\Bigr\)\.Combining with the coupling bound:
Pr\[q⊤Pnk≥s\]\\displaystyle\\Pr\\\!\\bigl\[q^\{\\top\}P\_\{n\}\\,k\\geq s\\bigr\]≤Pr\[q⊤Unk≥s\]\+Pr\[Pn≠Un\]\\displaystyle\\leq\\Pr\\\!\\bigl\[q^\{\\top\}U\_\{n\}\\,k\\geq s\\bigr\]\+\\Pr\[P\_\{n\}\\neq U\_\{n\}\]≤2exp\(−\(d−1\)s22\)\+ε\.\\displaystyle\\leq 2\\exp\\\!\\Bigl\(\-\\frac\{\(d\-1\)\\,s^\{2\}\}\{2\}\\Bigr\)\+\\varepsilon\.
Part \(c\)\.Apply a union bound overMMfar routes, each satisfying the bound in part \(b\)\. ∎
###### Corollary 1\(Orthogonal Score Gap and Softmax Scaling\)\.
Fix deterministic unit vectorsq,k∈ℝdq,k\\in\\mathbb\{R\}^\{d\}\. Let𝒮\\mathcal\{S\}be a near target\-bearing set with\|𝒮\|=K≥1\|\\mathcal\{S\}\|=K\\geq 1, and let𝒟\\mathcal\{D\}be a finite far candidate set with\|𝒟\|=M≤Mmax\|\\mathcal\{D\}\|=M\\leq M\_\{\\max\}\. Assume every near route satisfies‖Pj→i−Id‖op≤γ\\\|P\_\{j\\to i\}\-I\_\{d\}\\\|\_\{\\mathrm\{op\}\}\\leq\\gammafor someγ<1\\gamma<1,j∈𝒮j\\in\\mathcal\{S\}\. Chooses≥0s\\geq 0andε∈\(0,1\)\\varepsilon\\in\(0,1\)such that every far route has length at leastwεTVw\_\{\\varepsilon\}^\{\\mathrm\{TV\}\}, and define
cnear=q⊤k−γ,cfar=s,g=cnear−cfar\.c\_\{\\rm near\}=q^\{\\top\}k\-\\gamma,\\quad c\_\{\\rm far\}=s,\\quad g=c\_\{\\rm near\}\-c\_\{\\rm far\}\.Assumeg\>0g\>0\. Then, with probability at least1−δorth1\-\\delta\_\{\\rm orth\}, where
δorth=2Mexp\(−\(d−1\)s22\)\+Mε,\\delta\_\{\\rm orth\}=2M\\exp\\\!\\Bigl\(\-\\frac\{\(d\-1\)\\,s^\{2\}\}\{2\}\\Bigr\)\+M\\varepsilon,all near scores are at leastcnearc\_\{\\rm near\}and all far scores are at mostcfarc\_\{\\rm far\}\. Consequently, if the softmax logits areℓj=λSj\\ell\_\{j\}=\\lambda S\_\{j\}and
λg≥max\{0,log\(Mmax\(1−ρ⋆\)Kρ⋆\),log\(MmaxKa⋆\)\},\\lambda g\\geq\\max\\\!\\left\\\{0,\\;\\log\\\!\\left\(\\frac\{M\_\{\\max\}\(1\-\\rho\_\{\\star\}\)\}\{K\\rho\_\{\\star\}\}\\right\),\\;\\log\\\!\\left\(\\frac\{\\sqrt\{M\_\{\\max\}\}\}\{Ka\_\{\\star\}\}\\right\)\\right\\\},then, on the same event,ρ𝒟≤ρ⋆\\rho\_\{\\mathcal\{D\}\}\\leq\\rho\_\{\\star\}and‖𝛂𝒟‖2≤a⋆\\\|\\bm\{\\alpha\}\_\{\\mathcal\{D\}\}\\\|\_\{2\}\\leq a\_\{\\star\}\.
###### Proof\.
The near\-route lower bound follows from Theorem[5](https://arxiv.org/html/2606.24975#Thmtheorem5)\(a\), givingSj≥q⊤k−γS\_\{j\}\\geq q^\{\\top\}k\-\\gammaforj∈𝒮j\\in\\mathcal\{S\}\. The far\-route upper bound follows from Theorem[5](https://arxiv.org/html/2606.24975#Thmtheorem5)\(c\), with failure probabilityδorth\\delta\_\{\\rm orth\}\. On the resulting event, the score gap is at leastg=cnear−cfar\>0g=c\_\{\\rm near\}\-c\_\{\\rm far\}\>0\. Proposition[1](https://arxiv.org/html/2606.24975#Thmproposition1)then gives the stated far\-mass and far\-weight bounds\. ∎
### B\.6Full Softmax Lower Bound
The bounded\-logit assumption in the next proposition is a length\-independent\-logit abstraction\. In the score model used in Theorem[2](https://arxiv.org/html/2606.24975#Thmtheorem2), it follows immediately from the normalized cosine score:
Sj→i=1B∑b=1BcosΘj→i,b∈\[−1,1\],ℓj→i=λSj→i∈\[−λ,λ\]\.S\_\{j\\to i\}=\\frac\{1\}\{B\}\\sum\_\{b=1\}^\{B\}\\cos\\Theta\_\{j\\to i,b\}\\in\[\-1,1\],\\qquad\\ell\_\{j\\to i\}=\\lambda S\_\{j\\to i\}\\in\[\-\\lambda,\\lambda\]\.For an ordinary attention head, the same kind of bound follows under explicit norm assumptions\. If
ℓij=qi⊤kjdk\+bij,‖qi‖≤Rq,‖kj‖≤Rk,\|bij\|≤Bb,\\ell\_\{ij\}=\\frac\{q\_\{i\}^\{\\top\}k\_\{j\}\}\{\\sqrt\{d\_\{k\}\}\}\+b\_\{ij\},\\qquad\\\|q\_\{i\}\\\|\\leq R\_\{q\},\\quad\\\|k\_\{j\}\\\|\\leq R\_\{k\},\\quad\|b\_\{ij\}\|\\leq B\_\{b\},then
\|ℓij\|≤RqRkdk\+Bb\.\|\\ell\_\{ij\}\|\\leq\\frac\{R\_\{q\}R\_\{k\}\}\{\\sqrt\{d\_\{k\}\}\}\+B\_\{b\}\.Orthogonal Q/K transport preserves the query and key norms, so it does not invalidate this bound\. The proposition below is therefore not a theorem about all trained transformer logits; it is the full\-softmax case in which the near/far logit gap does not grow with the number of far candidates\.
###### Proposition 3\(Full Bounded Softmax Assigns Asymptotically All Mass to the Far Regime\)\.
Let𝒮\\mathcal\{S\}be a near target\-bearing set with\|𝒮\|=K\|\\mathcal\{S\}\|=K, and let𝒟L\\mathcal\{D\}\_\{L\}be a far\-regime set with\|𝒟L\|=ML\|\\mathcal\{D\}\_\{L\}\|=M\_\{L\}\. Suppose attention weights are computed by full vanilla softmax over𝒮∪𝒟L\\mathcal\{S\}\\cup\\mathcal\{D\}\_\{L\}:
αj=eℓj∑m∈𝒮∪𝒟Leℓm\.\\alpha\_\{j\}=\\frac\{e^\{\\ell\_\{j\}\}\}\{\\sum\_\{m\\in\\mathcal\{S\}\\cup\\mathcal\{D\}\_\{L\}\}e^\{\\ell\_\{m\}\}\}\.Assume that the logits are uniformly bounded:−λ≤ℓj≤λ\-\\lambda\\leq\\ell\_\{j\}\\leq\\lambdafor allj∈𝒮∪𝒟Lj\\in\\mathcal\{S\}\\cup\\mathcal\{D\}\_\{L\}, whereλ<∞\\lambda<\\inftyis independent ofLL\. Define the total far\-regime massρ𝒟=∑j∈𝒟Lαj\\rho\_\{\\mathcal\{D\}\}=\\sum\_\{j\\in\\mathcal\{D\}\_\{L\}\}\\alpha\_\{j\}\. Then
ρ𝒟≥MLe−λKeλ\+MLe−λ=MLKe2λ\+ML\.\\rho\_\{\\mathcal\{D\}\}\\geq\\frac\{M\_\{L\}\\,e^\{\-\\lambda\}\}\{Ke^\{\\lambda\}\+M\_\{L\}\\,e^\{\-\\lambda\}\}=\\frac\{M\_\{L\}\}\{Ke^\{2\\lambda\}\+M\_\{L\}\}\.Consequently, ifKKandλ\\lambdaare fixed andML→∞M\_\{L\}\\to\\infty, thenρ𝒟→1\\rho\_\{\\mathcal\{D\}\}\\to 1\.
###### Proof\.
LetF=∑j∈𝒟LeℓjF=\\sum\_\{j\\in\\mathcal\{D\}\_\{L\}\}e^\{\\ell\_\{j\}\}andN=∑j∈𝒮eℓjN=\\sum\_\{j\\in\\mathcal\{S\}\}e^\{\\ell\_\{j\}\}\. ThenF≥MLe−λF\\geq M\_\{L\}e^\{\-\\lambda\}andN≤KeλN\\leq Ke^\{\\lambda\}, so
ρ𝒟=FF\+N≥MLe−λMLe−λ\+Keλ=MLML\+Ke2λ\.\\rho\_\{\\mathcal\{D\}\}=\\frac\{F\}\{F\+N\}\\geq\\frac\{M\_\{L\}\\,e^\{\-\\lambda\}\}\{M\_\{L\}\\,e^\{\-\\lambda\}\+Ke^\{\\lambda\}\}=\\frac\{M\_\{L\}\}\{M\_\{L\}\+Ke^\{2\\lambda\}\}\.AsML→∞M\_\{L\}\\to\\inftywithKKandλ\\lambdafixed,ML/\(ML\+Ke2λ\)→1M\_\{L\}/\(M\_\{L\}\+Ke^\{2\\lambda\}\)\\to 1\. ∎
The same denominator effect holds under any fixed finite near/far logit gap: ifℓj≤ℓ⋆\\ell\_\{j\}\\leq\\ell\_\{\\star\}on𝒮\\mathcal\{S\}andℓk≥ℓ⋆−Δ\\ell\_\{k\}\\geq\\ell\_\{\\star\}\-\\Deltaon a far set of sizeMLM\_\{L\}, then
ρ𝒟≥MLe−ΔK\+MLe−Δ→1\.\\rho\_\{\\mathcal\{D\}\}\\geq\\frac\{M\_\{L\}e^\{\-\\Delta\}\}\{K\+M\_\{L\}e^\{\-\\Delta\}\}\\to 1\.
### B\.7Far\-Mass Near\-Signal Upper Bound
###### Proposition 4\(Universal Near\-Signal Upper Bound from Far\-Mass Leakage\)\.
Letρ𝒟=∑j∈𝒟αj\\rho\_\{\\mathcal\{D\}\}=\\sum\_\{j\\in\\mathcal\{D\}\}\\alpha\_\{j\}be the total far\-regime mass, so that∑j∈𝒮αj=1−ρ𝒟\\sum\_\{j\\in\\mathcal\{S\}\}\\alpha\_\{j\}=1\-\\rho\_\{\\mathcal\{D\}\}\. For arbitrary orthogonal value transportsPj→i∈O\(d\)P\_\{j\\to i\}\\in O\(d\),
B𝒮,i⊤B𝒮,i⪯\(1−ρ𝒟\)2Id\.B\_\{\\mathcal\{S\},i\}^\{\\top\}B\_\{\\mathcal\{S\},i\}\\preceq\(1\-\\rho\_\{\\mathcal\{D\}\}\)^\{2\}I\_\{d\}\.\(23\)If, in addition, the weights are produced by full softmax withKKnear target\-bearing candidates,MLM\_\{L\}far candidates, and logits bounded in\[−λ,λ\]\[\-\\lambda,\\lambda\], then
B𝒮,i⊤B𝒮,i⪯\(Ke2λKe2λ\+ML\)2Id\.B\_\{\\mathcal\{S\},i\}^\{\\top\}B\_\{\\mathcal\{S\},i\}\\preceq\\Bigl\(\\frac\{Ke^\{2\\lambda\}\}\{Ke^\{2\\lambda\}\+M\_\{L\}\}\\Bigr\)^\{2\}I\_\{d\}\.\(24\)For fixedKKandλ\\lambda, the right\-hand side tends to zero asML→∞M\_\{L\}\\to\\infty\.
###### Proof\.
For anyxx,‖B𝒮,ix‖≤∑j∈𝒮αj‖Pj→ix‖=\(1−ρ𝒟\)‖x‖\\\|B\_\{\\mathcal\{S\},i\}\\,x\\\|\\leq\\sum\_\{j\\in\\mathcal\{S\}\}\\alpha\_\{j\}\\,\\\|P\_\{j\\to i\}\\,x\\\|=\(1\-\\rho\_\{\\mathcal\{D\}\}\)\\,\\\|x\\\|\. Hence
x⊤B𝒮,i⊤B𝒮,ix≤\(1−ρ𝒟\)2‖x‖2,x^\{\\top\}B\_\{\\mathcal\{S\},i\}^\{\\top\}B\_\{\\mathcal\{S\},i\}x\\leq\(1\-\\rho\_\{\\mathcal\{D\}\}\)^\{2\}\\\|x\\\|^\{2\},which proves \([23](https://arxiv.org/html/2606.24975#A2.E23)\)\. For full softmax with bounded logits, Proposition[3](https://arxiv.org/html/2606.24975#Thmproposition3)givesρ𝒟≥ML/\(Ke2λ\+ML\)\\rho\_\{\\mathcal\{D\}\}\\geq M\_\{L\}/\(Ke^\{2\\lambda\}\+M\_\{L\}\)\. Substitution gives the second bound\. ∎
### B\.8Position\-Only Value Coherence
###### Example 9\(RoPE\-Style Deterministic Value Transport and Arithmetic Coherence\)\.
Assumed=2Bd=2B\. Consider deterministic RoPE\-style value transport with route lengthk=i−jk=i\-j:
Pi−k→i=diag\(R\(−ω1k\),…,R\(−ωBk\)\),P\_\{i\-k\\to i\}=\\operatorname\{diag\}\\bigl\(R\(\-\\omega\_\{1\}k\),\\ldots,R\(\-\\omega\_\{B\}k\)\\bigr\),where eachR\(θ\)R\(\\theta\)is the two\-dimensional rotation by angleθ\\theta\. In the shared\-background far\-value model,vj=c0Gcom\+wjv\_\{j\}=c\_\{0\}G\_\{\\rm com\}\+w\_\{j\}, withGcom∼𝒩\(0,Id\)G\_\{\\rm com\}\\sim\\mathcal\{N\}\(0,I\_\{d\}\),wj∼𝒩\(0,σw2Id\)w\_\{j\}\\sim\\mathcal\{N\}\(0,\\sigma\_\{w\}^\{2\}I\_\{d\}\), and all variables independent, define
𝑻𝒟=∑k∈𝒟LαkPi−k→i\.\\bm\{T\}\_\{\\mathcal\{D\}\}=\\sum\_\{k\\in\\mathcal\{D\}\_\{L\}\}\\alpha\_\{k\}\\,P\_\{i\-k\\to i\}\.For each blockbb, define the deterministic far\-coherence coefficient
qb,L=\|∑k∈𝒟Lαke−iωbk\|,q\_\{b,L\}=\\Bigl\|\\sum\_\{k\\in\\mathcal\{D\}\_\{L\}\}\\alpha\_\{k\}\\,e^\{\-i\\omega\_\{b\}k\}\\Bigr\|,and defineqL=min1≤b≤Bqb,Lq\_\{L\}=\\min\_\{1\\leq b\\leq B\}q\_\{b,L\}\. Then
Δ𝒟⪰\(c02qL2\+σw2‖𝜶𝒟‖22\)Id\.\\Delta\_\{\\mathcal\{D\}\}\\succeq\\bigl\(c\_\{0\}^\{2\}q\_\{L\}^\{2\}\+\\sigma\_\{w\}^\{2\}\\\|\\bm\{\\alpha\}\_\{\\mathcal\{D\}\}\\\|\_\{2\}^\{2\}\\bigr\)I\_\{d\}\.\(25\)
###### Proof\.
In blockbb, the far transport contribution is∑k∈𝒟LαkR\(−ωbk\)\\sum\_\{k\\in\\mathcal\{D\}\_\{L\}\}\\alpha\_\{k\}\\,R\(\-\\omega\_\{b\}k\)\. Withζb,L=∑k∈𝒟Lαke−iωbk\\zeta\_\{b,L\}=\\sum\_\{k\\in\\mathcal\{D\}\_\{L\}\}\\alpha\_\{k\}e^\{\-i\\omega\_\{b\}k\}, thebb\-th block of𝑻𝒟\\bm\{T\}\_\{\\mathcal\{D\}\}has operator normqb,L=\|ζb,L\|q\_\{b,L\}=\|\\zeta\_\{b,L\}\|, and𝑻𝒟𝑻𝒟⊤\\bm\{T\}\_\{\\mathcal\{D\}\}\\bm\{T\}\_\{\\mathcal\{D\}\}^\{\\top\}hasbb\-th block equal toqb,L2I2q\_\{b,L\}^\{2\}I\_\{2\}\. Thus𝑻𝒟𝑻𝒟⊤⪰qL2Id\\bm\{T\}\_\{\\mathcal\{D\}\}\\bm\{T\}\_\{\\mathcal\{D\}\}^\{\\top\}\\succeq q\_\{L\}^\{2\}I\_\{d\}\. In the shared\-background model,Δ𝒟=c02𝑻𝒟𝑻𝒟⊤\+σw2‖𝜶𝒟‖22Id\\Delta\_\{\\mathcal\{D\}\}=c\_\{0\}^\{2\}\\,\\bm\{T\}\_\{\\mathcal\{D\}\}\\bm\{T\}\_\{\\mathcal\{D\}\}^\{\\top\}\+\\sigma\_\{w\}^\{2\}\\\|\\bm\{\\alpha\}\_\{\\mathcal\{D\}\}\\\|\_\{2\}^\{2\}I\_\{d\}\. HenceΔ𝒟⪰\(c02qL2\+σw2‖𝜶𝒟‖22\)Id\\Delta\_\{\\mathcal\{D\}\}\\succeq\(c\_\{0\}^\{2\}q\_\{L\}^\{2\}\+\\sigma\_\{w\}^\{2\}\\\|\\bm\{\\alpha\}\_\{\\mathcal\{D\}\}\\\|\_\{2\}^\{2\}\)\\,I\_\{d\}\. ∎
### B\.9Shared\-Background Distant\-Value Model
###### Definition 1\(Shared\-Background Distant\-Value Model\)\.
This model explains why the analysis uses a spectral covariance condition and why value transport can improve the ordinary weighted\-sum aggregation\.
Suppose far\-regime values have the form
vj=c0Gcom\+wj,j∈𝒟,v\_\{j\}=c\_\{0\}\\,G\_\{\\rm com\}\+w\_\{j\},\\qquad j\\in\\mathcal\{D\},\(26\)whereGcom∼𝒩\(0,Id\)G\_\{\\rm com\}\\sim\\mathcal\{N\}\(0,I\_\{d\}\)is a shared zero\-mean component,wj∼𝒩\(0,σw2Id\)w\_\{j\}\\sim\\mathcal\{N\}\(0,\\sigma\_\{w\}^\{2\}I\_\{d\}\)are idiosyncratic components\. Conditional on the realized aggregation environment,GcomG\_\{\\rm com\}and thewjw\_\{j\}’s are independent of the realized score variables, route phasors, weights, and transports, with thewjw\_\{j\}’s conditionally independent acrossjj\. Equivalently, the conditional covariance identity displayed below is assumed to hold\. That covariance identity is part of the shared\-background model assumption; it is not derived merely from conditioning on the aggregation environment\. Then
e𝒟=∑j∈𝒟αijPj→ivj=c0\(∑j∈𝒟αijPj→i\)Gcom\+∑j∈𝒟αijPj→iwj\.e\_\{\\mathcal\{D\}\}=\\sum\_\{j\\in\\mathcal\{D\}\}\\alpha\_\{ij\}\\,P\_\{j\\to i\}\\,v\_\{j\}=c\_\{0\}\\Bigl\(\\sum\_\{j\\in\\mathcal\{D\}\}\\alpha\_\{ij\}\\,P\_\{j\\to i\}\\Bigr\)G\_\{\\rm com\}\+\\sum\_\{j\\in\\mathcal\{D\}\}\\alpha\_\{ij\}\\,P\_\{j\\to i\}\\,w\_\{j\}\.Define
𝑻𝒟=∑j∈𝒟αijPj→i\.\\bm\{T\}\_\{\\mathcal\{D\}\}=\\sum\_\{j\\in\\mathcal\{D\}\}\\alpha\_\{ij\}\\,P\_\{j\\to i\}\.\(27\)Conditioned on the realized transports and weights,
Δ𝒟=c02𝑻𝒟𝑻𝒟⊤\+σw2‖𝜶𝒟‖22Id\.\\Delta\_\{\\mathcal\{D\}\}=c\_\{0\}^\{2\}\\,\\bm\{T\}\_\{\\mathcal\{D\}\}\\,\\bm\{T\}\_\{\\mathcal\{D\}\}^\{\\top\}\+\\sigma\_\{w\}^\{2\}\\,\\\|\\bm\{\\alpha\}\_\{\\mathcal\{D\}\}\\\|\_\{2\}^\{2\}\\,I\_\{d\}\.The crude deterministic bound is
‖𝑻𝒟‖op≤∑j∈𝒟αij=ρ,\\\|\\bm\{T\}\_\{\\mathcal\{D\}\}\\\|\_\{\\mathrm\{op\}\}\\leq\\sum\_\{j\\in\\mathcal\{D\}\}\\alpha\_\{ij\}=\\rho,and hence
‖Δ𝒟‖op≤c02ρ2\+σw2‖𝜶𝒟‖22\.\\\|\\Delta\_\{\\mathcal\{D\}\}\\\|\_\{\\mathrm\{op\}\}\\leq c\_\{0\}^\{2\}\\rho^\{2\}\+\\sigma\_\{w\}^\{2\}\\,\\\|\\bm\{\\alpha\}\_\{\\mathcal\{D\}\}\\\|\_\{2\}^\{2\}\.This bound does not use phase mixing; it holds for arbitrary orthogonal transports and therefore also for identity transport\. For identity value transport,Pj→i=IdP\_\{j\\to i\}=I\_\{d\}for every far token, so
𝑻𝒟=ρ𝒟Id,Δ𝒟=c02ρ𝒟2Id\+σw2‖𝜶𝒟‖22Id\.\\bm\{T\}\_\{\\mathcal\{D\}\}=\\rho\_\{\\mathcal\{D\}\}\\,I\_\{d\},\\qquad\\Delta\_\{\\mathcal\{D\}\}=c\_\{0\}^\{2\}\\rho\_\{\\mathcal\{D\}\}^\{2\}\\,I\_\{d\}\+\\sigma\_\{w\}^\{2\}\\\|\\bm\{\\alpha\}\_\{\\mathcal\{D\}\}\\\|\_\{2\}^\{2\}\\,I\_\{d\}\.Thus ordinary value summation leaves the shared far component fully coherent\.
The value\-side covariance theorem below gives a sharper transport\-specific bound for the actual nested interval products generated by route transport\. The price of the shared suffix dependence is the factor1/\(1−β\)1/\(1\-\\beta\)in the fluctuation term\.
*Prefix\-product radius\.*To keep the following statements readable, define
ℛpp\(μ,ρ,a;η\)=μρ\+4a1−βlog4Bη,\\mathcal\{R\}\_\{\\rm pp\}\(\\mu,\\rho,a;\\eta\)=\\mu\\rho\+\\frac\{4a\}\{1\-\\beta\}\\sqrt\{\\log\\frac\{4B\}\{\\eta\}\},\(28\)whereμ\\muis the remaining first\-harmonic mean term\. In the raw prefix\-product boundμ=βw\\mu=\\beta^\{w\}; after choosing the mixing window one may useμ=εmix\\mu=\\varepsilon\_\{\\rm mix\}\.
### B\.10Nested Q/K/V Route Phases
###### Assumption 3\(Route\-Phase Score/Value Coupling\)\.
Fix a query and a finite active route set𝒦=𝒦𝒮∪˙𝒦𝒟\\mathcal\{K\}=\\mathcal\{K\}\_\{\\mathcal\{S\}\}\\dot\{\\cup\}\\mathcal\{K\}\_\{\\mathcal\{D\}\}, where𝒦𝒟⊆\{k:k≥w\}\\mathcal\{K\}\_\{\\mathcal\{D\}\}\\subseteq\\\{k:k\\geq w\\\}is the far route set\. The set𝒦\\mathcal\{K\}and its partition into𝒦𝒮,𝒦𝒟\\mathcal\{K\}\_\{\\mathcal\{S\}\},\\mathcal\{K\}\_\{\\mathcal\{D\}\}are either deterministic or measurable with respect to a sigma\-field independent of the route phasors\{Hℓ,b\}ℓ,b\\\{H\_\{\\ell,b\}\\\}\_\{\\ell,b\}\(see Example[10](https://arxiv.org/html/2606.24975#Thmexample10)for why adaptive selection can invalidate the cancellation argument\)\. In each blockbb, let
Πk,b=∏ℓ=1kHℓ,b,\|Hℓ,b\|=1,\\Pi\_\{k,b\}=\\prod\_\{\\ell=1\}^\{k\}H\_\{\\ell,b\},\\qquad\|H\_\{\\ell,b\}\|=1,and assume that the same route phasors enter the matched Q/K score and the value rotation\. The phasors\{Hℓ,b\}ℓ,b\\\{H\_\{\\ell,b\}\\\}\_\{\\ell,b\}are independent over positions and blocks and satisfy
\|𝔼Hℓ,b\|≤β<1\.\|\\mathbb\{E\}H\_\{\\ell,b\}\|\\leq\\beta<1\.The logits are
ℓk=λB∑b=1BReΠk,b,k∈𝒦\.\\ell\_\{k\}=\\frac\{\\lambda\}\{B\}\\sum\_\{b=1\}^\{B\}\\mathrm\{Re\}\\,\\Pi\_\{k,b\},\\qquad k\\in\\mathcal\{K\}\.Letαk\\alpha\_\{k\}be the softmax weights over𝒦\\mathcal\{K\}, and define
ρ𝒟=∑k∈𝒦𝒟αk,‖𝜶𝒟‖2=\(∑k∈𝒦𝒟αk2\)1/2\.\\rho\_\{\\mathcal\{D\}\}=\\sum\_\{k\\in\\mathcal\{K\}\_\{\\mathcal\{D\}\}\}\\alpha\_\{k\},\\qquad\\\|\\bm\{\\alpha\}\_\{\\mathcal\{D\}\}\\\|\_\{2\}=\\Bigl\(\\sum\_\{k\\in\\mathcal\{K\}\_\{\\mathcal\{D\}\}\}\\alpha\_\{k\}^\{2\}\\Bigr\)^\{1/2\}\.Let𝒮sc\\mathcal\{S\}\_\{\\rm sc\}be the score\-bound event
𝒮sc=\{ρ𝒟≤ρ⋆,‖𝜶𝒟‖2≤a⋆\}\.\\mathcal\{S\}\_\{\\rm sc\}=\\\{\\rho\_\{\\mathcal\{D\}\}\\leq\\rho\_\{\\star\},\\;\\\|\\bm\{\\alpha\}\_\{\\mathcal\{D\}\}\\\|\_\{2\}\\leq a\_\{\\star\}\\\}\.Define the far value operator
𝑻𝒟same=∑k∈𝒦𝒟αkPk,\\bm\{T\}\_\{\\mathcal\{D\}\}^\{\\rm same\}=\\sum\_\{k\\in\\mathcal\{K\}\_\{\\mathcal\{D\}\}\}\\alpha\_\{k\}P\_\{k\},where blockbbofPkP\_\{k\}acts as multiplication byΠk,b\\Pi\_\{k,b\}\.
###### Theorem 6\(Far\-Covariance Bound for Nested Q/K/V Rotations\)\.
Assume the route\-phase score/value coupling of Assumption[3](https://arxiv.org/html/2606.24975#Thmassumption3)\. For anyη∈\(0,1\)\\eta\\in\(0,1\), define
Rsame=e2λ/Bℛpp\(βw,ρ⋆,a⋆;η\)\+\(e2λ/B−1\)ρ⋆\.R\_\{\\rm same\}=e^\{2\\lambda/B\}\\mathcal\{R\}\_\{\\rm pp\}\(\\beta^\{w\},\\rho\_\{\\star\},a\_\{\\star\};\\eta\)\+\(e^\{2\\lambda/B\}\-1\)\\rho\_\{\\star\}\.\(29\)Then there is an event𝒮val\\mathcal\{S\}\_\{\\rm val\}such that
Pr\(𝒮val\)≥1−η\\Pr\(\\mathcal\{S\}\_\{\\rm val\}\)\\geq 1\-\\etaand, on𝒮sc∩𝒮val\\mathcal\{S\}\_\{\\rm sc\}\\cap\\mathcal\{S\}\_\{\\rm val\},
‖𝑻𝒟same‖op≤Rsame\.\\\|\\bm\{T\}\_\{\\mathcal\{D\}\}^\{\\rm same\}\\\|\_\{\\mathrm\{op\}\}\\leq R\_\{\\rm same\}\.Consequently, in the shared\-background far\-value model satisfying the conditional covariance identity in Definition[1](https://arxiv.org/html/2606.24975#Thmdefinition1),
Δ𝒟=c02𝑻𝒟same\(𝑻𝒟same\)⊤\+σw2‖𝜶𝒟‖22Id\\Delta\_\{\\mathcal\{D\}\}=c\_\{0\}^\{2\}\\bm\{T\}\_\{\\mathcal\{D\}\}^\{\\rm same\}\(\\bm\{T\}\_\{\\mathcal\{D\}\}^\{\\rm same\}\)^\{\\top\}\+\\sigma\_\{w\}^\{2\}\\\|\\bm\{\\alpha\}\_\{\\mathcal\{D\}\}\\\|\_\{2\}^\{2\}I\_\{d\}satisfies
Δ𝒟⪯δ¯same2Id,δ¯same2=c02Rsame2\+σw2a⋆2\.\\Delta\_\{\\mathcal\{D\}\}\\preceq\\bar\{\\delta\}\_\{\\rm same\}^\{2\}I\_\{d\},\\qquad\\bar\{\\delta\}\_\{\\rm same\}^\{2\}=c\_\{0\}^\{2\}R\_\{\\rm same\}^\{2\}\+\\sigma\_\{w\}^\{2\}a\_\{\\star\}^\{2\}\.\(30\)If𝒮sc\\mathcal\{S\}\_\{\\rm sc\}itself holds with probability at least1−δsc1\-\\delta\_\{\\rm sc\}, then the covariance bound holds with probability at least1−δsc−η1\-\\delta\_\{\\rm sc\}\-\\eta\.
#### Proof idea\.
Section[5](https://arxiv.org/html/2606.24975#S5)describes the leave\-one\-block decoupling strategy\. The operator norm of the block\-diagonal far\-value transport equals the maximum over blockwise phasor sums\. For each blockbb, the proxy weights \(formed by removing blockbbfrom every logit\) are independent of the block\-bbvalue phases, so a martingale/Azuma bound controls the block\-bbphasor sum\. A union bound over blocks and a comparison between proxy and true weights gives the operator\-norm bound\. This argument is specific to block\-diagonal rotations; extending it to noncommuting orthogonal transports would require a different concentration approach\.
###### Proof\.
Seth=λ/Bh=\\lambda/B\. For blockbb, define the leave\-one\-block logit
ℓk\(−b\)=h∑q≠bReΠk,q,\\ell\_\{k\}^\{\(\-b\)\}=h\\sum\_\{q\\neq b\}\\mathrm\{Re\}\\,\\Pi\_\{k,q\},and letαk\(−b\)\\alpha\_\{k\}^\{\(\-b\)\}be its softmax weights\. The softmax defining𝜶\(−b\)\\bm\{\\alpha\}^\{\(\-b\)\}is taken over the same full active route set𝒦=𝒦𝒮∪˙𝒦𝒟\\mathcal\{K\}=\\mathcal\{K\}\_\{\\mathcal\{S\}\}\\dot\{\\cup\}\\mathcal\{K\}\_\{\\mathcal\{D\}\}as the original softmax; only blockbb’s contribution to each logit is removed\. SinceReΠk,b∈\[−1,1\]\\mathrm\{Re\}\\,\\Pi\_\{k,b\}\\in\[\-1,1\], both the numerator and the denominator change by at mostehe^\{h\}when blockbbis restored\. Thus, for everykk,
e−2hαk\(−b\)≤αk≤e2hαk\(−b\)\.e^\{\-2h\}\\alpha\_\{k\}^\{\(\-b\)\}\\leq\\alpha\_\{k\}\\leq e^\{2h\}\\alpha\_\{k\}^\{\(\-b\)\}\.\(31\)
Condition on all blocks exceptbb\. Thenαk\(−b\)\\alpha\_\{k\}^\{\(\-b\)\}is fixed, while\{Πk,b:k∈𝒦𝒟\}\\\{\\Pi\_\{k,b\}:k\\in\\mathcal\{K\}\_\{\\mathcal\{D\}\}\\\}is a nested prefix\-product family\. We now prove the needed one\-block bound\. All probabilities in this paragraph are conditional on the fixed external information and on the other blocks\. Suppress the block index and writeHℓ=Hℓ,bH\_\{\\ell\}=H\_\{\\ell,b\},mℓ=𝔼\[Hℓ\]m\_\{\\ell\}=\\mathbb\{E\}\[H\_\{\\ell\}\], andΠk=∏ℓ=1kHℓ\\Pi\_\{k\}=\\prod\_\{\\ell=1\}^\{k\}H\_\{\\ell\}\. Let
ub=∑k∈𝒦𝒟αk\(−b\)Πk,b\.u\_\{b\}=\\sum\_\{k\\in\\mathcal\{K\}\_\{\\mathcal\{D\}\}\}\\alpha\_\{k\}^\{\(\-b\)\}\\Pi\_\{k,b\}\.For the filtrationℱℓ=σ\(H1,…,Hℓ\)\\mathcal\{F\}\_\{\\ell\}=\\sigma\(H\_\{1\},\\ldots,H\_\{\\ell\}\), defineDℓ=Πℓ−mℓΠℓ−1D\_\{\\ell\}=\\Pi\_\{\\ell\}\-m\_\{\\ell\}\\Pi\_\{\\ell\-1\}withΠ0=1\\Pi\_\{0\}=1\. Then𝔼\[Dℓ∣ℱℓ−1\]=0\\mathbb\{E\}\[D\_\{\\ell\}\\mid\\mathcal\{F\}\_\{\\ell\-1\}\]=0and\|Dℓ\|≤2\|D\_\{\\ell\}\|\\leq 2\. IteratingΠℓ=mℓΠℓ−1\+Dℓ\\Pi\_\{\\ell\}=m\_\{\\ell\}\\Pi\_\{\\ell\-1\}\+D\_\{\\ell\}gives
Πk=∏q=1kmq\+∑ℓ=1k\(∏q=ℓ\+1kmq\)Dℓ\.\\Pi\_\{k\}=\\prod\_\{q=1\}^\{k\}m\_\{q\}\+\\sum\_\{\\ell=1\}^\{k\}\\Bigl\(\\prod\_\{q=\\ell\+1\}^\{k\}m\_\{q\}\\Bigr\)D\_\{\\ell\}\.Therefore
ub−𝔼ub=∑ℓ≥1cℓDℓ,cℓ=∑k∈𝒦𝒟k≥ℓαk\(−b\)∏q=ℓ\+1kmq\.u\_\{b\}\-\\mathbb\{E\}u\_\{b\}=\\sum\_\{\\ell\\geq 1\}c\_\{\\ell\}D\_\{\\ell\},\\qquad c\_\{\\ell\}=\\sum\_\{\\begin\{subarray\}\{c\}k\\in\\mathcal\{K\}\_\{\\mathcal\{D\}\}\\\\ k\\geq\\ell\\end\{subarray\}\}\\alpha\_\{k\}^\{\(\-b\)\}\\prod\_\{q=\\ell\+1\}^\{k\}m\_\{q\}\.Since\|mq\|≤β\|m\_\{q\}\|\\leq\\beta,
\|cℓ\|≤∑k∈𝒦𝒟k≥ℓαk\(−b\)βk−ℓ\.\|c\_\{\\ell\}\|\\leq\\sum\_\{\\begin\{subarray\}\{c\}k\\in\\mathcal\{K\}\_\{\\mathcal\{D\}\}\\\\ k\\geq\\ell\\end\{subarray\}\}\\alpha\_\{k\}^\{\(\-b\)\}\\beta^\{k\-\\ell\}\.Thus the sequence\(\|cℓ\|\)ℓ\(\|c\_\{\\ell\}\|\)\_\{\\ell\}is bounded by the convolution of\(αk\(−b\)\)k\(\\alpha\_\{k\}^\{\(\-b\)\}\)\_\{k\}with the one\-sided kernel\(βn\)n≥0\(\\beta^\{n\}\)\_\{n\\geq 0\}\. Young’s convolution inequality gives
\(∑ℓ≥1\|cℓ\|2\)1/2≤\(∑k∈𝒦𝒟\(αk\(−b\)\)2\)1/2∑n≥0βn=a𝒟\(−b\)1−β,\\Bigl\(\\sum\_\{\\ell\\geq 1\}\|c\_\{\\ell\}\|^\{2\}\\Bigr\)^\{1/2\}\\leq\\Bigl\(\\sum\_\{k\\in\\mathcal\{K\}\_\{\\mathcal\{D\}\}\}\(\\alpha\_\{k\}^\{\(\-b\)\}\)^\{2\}\\Bigr\)^\{1/2\}\\sum\_\{n\\geq 0\}\\beta^\{n\}=\\frac\{a\_\{\\mathcal\{D\}\}^\{\(\-b\)\}\}\{1\-\\beta\},where
a𝒟\(−b\)=\(∑k∈𝒦𝒟\(αk\(−b\)\)2\)1/2\.a\_\{\\mathcal\{D\}\}^\{\(\-b\)\}=\\Bigl\(\\sum\_\{k\\in\\mathcal\{K\}\_\{\\mathcal\{D\}\}\}\(\\alpha\_\{k\}^\{\(\-b\)\}\)^\{2\}\\Bigr\)^\{1/2\}\.The complex martingale sum∑ℓ≥1cℓDℓ\\sum\_\{\\ell\\geq 1\}c\_\{\\ell\}D\_\{\\ell\}has real and imaginary parts with increments bounded by2\|cℓ\|2\|c\_\{\\ell\}\|\. Azuma–HoeffdingAzuma \([1967](https://arxiv.org/html/2606.24975#bib.bib2)\); Boucheronet al\.\([2013](https://arxiv.org/html/2606.24975#bib.bib3)\)gives
Pr\{\|ub−𝔼ub\|≥r\}≤4exp\(−r2\(1−β\)216\(a𝒟\(−b\)\)2\)\.\\Pr\\\{\|u\_\{b\}\-\\mathbb\{E\}u\_\{b\}\|\\geq r\\\}\\leq 4\\exp\\\!\\biggl\(\-\\frac\{r^\{2\}\(1\-\\beta\)^\{2\}\}\{16\(a\_\{\\mathcal\{D\}\}^\{\(\-b\)\}\)^\{2\}\}\\biggr\)\.Withr=4a𝒟\(−b\)1−βlog\(4B/η\)r=\\frac\{4a\_\{\\mathcal\{D\}\}^\{\(\-b\)\}\}\{1\-\\beta\}\\sqrt\{\\log\(4B/\\eta\)\}, this probability is at mostη/B\\eta/B\. Also,
\|𝔼ub\|≤∑k∈𝒦𝒟αk\(−b\)βk≤βwρ𝒟\(−b\),\|\\mathbb\{E\}u\_\{b\}\|\\leq\\sum\_\{k\\in\\mathcal\{K\}\_\{\\mathcal\{D\}\}\}\\alpha\_\{k\}^\{\(\-b\)\}\\beta^\{k\}\\leq\\beta^\{w\}\\rho\_\{\\mathcal\{D\}\}^\{\(\-b\)\},where
ρ𝒟\(−b\)=∑k∈𝒦𝒟αk\(−b\)\.\\rho\_\{\\mathcal\{D\}\}^\{\(\-b\)\}=\\sum\_\{k\\in\\mathcal\{K\}\_\{\\mathcal\{D\}\}\}\\alpha\_\{k\}^\{\(\-b\)\}\.Thus, with probability at least1−η/B1\-\\eta/B,
\|ub\|≤ℛpp\(βw,ρ𝒟\(−b\),a𝒟\(−b\);η\)\.\|u\_\{b\}\|\\leq\\mathcal\{R\}\_\{\\rm pp\}\(\\beta^\{w\},\\rho\_\{\\mathcal\{D\}\}^\{\(\-b\)\},a\_\{\\mathcal\{D\}\}^\{\(\-b\)\};\\eta\)\.Define
Eb=\{\|∑k∈𝒦𝒟αk\(−b\)Πk,b\|≤ℛpp\(βw,ρ𝒟\(−b\),a𝒟\(−b\);η\)\},E\_\{b\}=\\left\\\{\\left\|\\sum\_\{k\\in\\mathcal\{K\}\_\{\\mathcal\{D\}\}\}\\alpha\_\{k\}^\{\(\-b\)\}\\Pi\_\{k,b\}\\right\|\\leq\\mathcal\{R\}\_\{\\rm pp\}\(\\beta^\{w\},\\rho\_\{\\mathcal\{D\}\}^\{\(\-b\)\},a\_\{\\mathcal\{D\}\}^\{\(\-b\)\};\\eta\)\\right\\\},and let
𝒮val=⋂b=1BEb\.\\mathcal\{S\}\_\{\\rm val\}=\\bigcap\_\{b=1\}^\{B\}E\_\{b\}\.For this fixed blockbb, the bound above is conditional on all blocksq≠bq\\neq b\. Since the conditional failure probability is at mostη/B\\eta/Bfor every realization of those conditioned variables, the tower property givesPr\(Ebc\)≤η/B\\Pr\(E\_\{b\}^\{c\}\)\\leq\\eta/Bunconditionally\. A union bound overb=1,…,Bb=1,\\ldots,BgivesPr\(𝒮val\)≥1−η\\Pr\(\\mathcal\{S\}\_\{\\rm val\}\)\\geq 1\-\\eta\.
On𝒮sc\\mathcal\{S\}\_\{\\rm sc\}, \([31](https://arxiv.org/html/2606.24975#A2.E31)\) givesρ𝒟\(−b\)≤e2hρ⋆\\rho\_\{\\mathcal\{D\}\}^\{\(\-b\)\}\\leq e^\{2h\}\\rho\_\{\\star\}anda𝒟\(−b\)≤e2ha⋆a\_\{\\mathcal\{D\}\}^\{\(\-b\)\}\\leq e^\{2h\}a\_\{\\star\}\. Therefore the leave\-one\-block contribution is at moste2hℛpp\(βw,ρ⋆,a⋆;η\)e^\{2h\}\\mathcal\{R\}\_\{\\rm pp\}\(\\beta^\{w\},\\rho\_\{\\star\},a\_\{\\star\};\\eta\)\. Let
tb=∑k∈𝒦𝒟αkΠk,b\.t\_\{b\}=\\sum\_\{k\\in\\mathcal\{K\}\_\{\\mathcal\{D\}\}\}\\alpha\_\{k\}\\Pi\_\{k,b\}\.The same comparison also gives\|αk−αk\(−b\)\|≤\(e2h−1\)αk\|\\alpha\_\{k\}\-\\alpha\_\{k\}^\{\(\-b\)\}\|\\leq\(e^\{2h\}\-1\)\\alpha\_\{k\}, and hence
\|tb−ub\|≤∑k∈𝒦𝒟\|αk−αk\(−b\)\|≤\(e2h−1\)ρ𝒟≤\(e2h−1\)ρ⋆\.\|t\_\{b\}\-u\_\{b\}\|\\leq\\sum\_\{k\\in\\mathcal\{K\}\_\{\\mathcal\{D\}\}\}\|\\alpha\_\{k\}\-\\alpha\_\{k\}^\{\(\-b\)\}\|\\leq\(e^\{2h\}\-1\)\\rho\_\{\\mathcal\{D\}\}\\leq\(e^\{2h\}\-1\)\\rho\_\{\\star\}\.Combining the bounds yields\|tb\|≤Rsame\|t\_\{b\}\|\\leq R\_\{\\rm same\}for every blockbb\. Since the value operator is block diagonal,‖𝑻𝒟same‖op=maxb\|tb\|\\\|\\bm\{T\}\_\{\\mathcal\{D\}\}^\{\\rm same\}\\\|\_\{\\mathrm\{op\}\}=\\max\_\{b\}\|t\_\{b\}\|\. The covariance bound follows from𝑻𝒟same\(𝑻𝒟same\)⊤⪯Rsame2Id\\bm\{T\}\_\{\\mathcal\{D\}\}^\{\\rm same\}\(\\bm\{T\}\_\{\\mathcal\{D\}\}^\{\\rm same\}\)^\{\\top\}\\preceq R\_\{\\rm same\}^\{2\}I\_\{d\}and‖𝜶𝒟‖2≤a⋆\\\|\\bm\{\\alpha\}\_\{\\mathcal\{D\}\}\\\|\_\{2\}\\leq a\_\{\\star\}on𝒮sc\\mathcal\{S\}\_\{\\rm sc\}\. The final probability statement is the union bound with the score\-bound event\. ∎
### B\.11Near\-Signal Gain
###### Lemma 3\(Near\-Route Closeness Gives Near\-Signal Gain\)\.
LetΠU\\Pi\_\{U\}denote the orthogonal projector ontocol\(U\)\\mathrm\{col\}\(U\), and letρ𝒟=∑j∈𝒟iαij\\rho\_\{\\mathcal\{D\}\}=\\sum\_\{j\\in\\mathcal\{D\}\_\{i\}\}\\alpha\_\{ij\}, so that∑j∈𝒮iαij=1−ρ𝒟\\sum\_\{j\\in\\mathcal\{S\}\_\{i\}\}\\alpha\_\{ij\}=1\-\\rho\_\{\\mathcal\{D\}\}\. Suppose that, for everyj∈𝒮ij\\in\\mathcal\{S\}\_\{i\},
‖\(Pj→i−Id\)ΠU‖op≤γ,0≤γ<1\.\\\|\(P\_\{j\\to i\}\-I\_\{d\}\)\\,\\Pi\_\{U\}\\\|\_\{\\mathrm\{op\}\}\\leq\\gamma,\\qquad 0\\leq\\gamma<1\.\(32\)Then, for alla∈ℝra\\in\\mathbb\{R\}^\{r\},
‖B𝒮,iUa‖≥\(1−ρ𝒟\)\(1−γ\)‖Ua‖\.\\\|B\_\{\\mathcal\{S\},i\}\\,Ua\\\|\\geq\(1\-\\rho\_\{\\mathcal\{D\}\}\)\(1\-\\gamma\)\\,\\\|Ua\\\|\.\(33\)Consequently,
U⊤B𝒮,i⊤B𝒮,iU⪰\(1−ρ𝒟\)2\(1−γ\)2U⊤U\.U^\{\\top\}B\_\{\\mathcal\{S\},i\}^\{\\top\}\\,B\_\{\\mathcal\{S\},i\}\\,U\\succeq\(1\-\\rho\_\{\\mathcal\{D\}\}\)^\{2\}\(1\-\\gamma\)^\{2\}\\,U^\{\\top\}U\.
###### Proof\.
For eachj∈𝒮ij\\in\\mathcal\{S\}\_\{i\},
Pj→iUa=Ua\+\(Pj→i−Id\)Ua,P\_\{j\\to i\}\\,Ua=Ua\+\(P\_\{j\\to i\}\-I\_\{d\}\)\\,Ua,and‖\(Pj→i−Id\)Ua‖≤γ‖Ua‖\\\|\(P\_\{j\\to i\}\-I\_\{d\}\)\\,Ua\\\|\\leq\\gamma\\,\\\|Ua\\\|\. Therefore
B𝒮,iUa=∑j∈𝒮iαijPj→iUa=\(1−ρ𝒟\)Ua\+∑j∈𝒮iαij\(Pj→i−Id\)Ua\.B\_\{\\mathcal\{S\},i\}\\,Ua=\\sum\_\{j\\in\\mathcal\{S\}\_\{i\}\}\\alpha\_\{ij\}\\,P\_\{j\\to i\}\\,Ua=\(1\-\\rho\_\{\\mathcal\{D\}\}\)\\,Ua\+\\sum\_\{j\\in\\mathcal\{S\}\_\{i\}\}\\alpha\_\{ij\}\\,\(P\_\{j\\to i\}\-I\_\{d\}\)\\,Ua\.The error term has norm at most\(1−ρ𝒟\)γ‖Ua‖\(1\-\\rho\_\{\\mathcal\{D\}\}\)\\,\\gamma\\,\\\|Ua\\\|\. The triangle inequality gives
‖B𝒮,iUa‖≥\(1−ρ𝒟\)\(1−γ\)‖Ua‖\.\\\|B\_\{\\mathcal\{S\},i\}\\,Ua\\\|\\geq\(1\-\\rho\_\{\\mathcal\{D\}\}\)\(1\-\\gamma\)\\,\\\|Ua\\\|\.Squaring both sides gives the stated Loewner bound\. ∎
### B\.12Score Bounds plus Near Alignment
###### Corollary 2\(Score Far\-Mass Bound plus Near Alignment Gives Near\-Signal Gain\)\.
Suppose the score side suppliesρ𝒟≤ρ⋆\\rho\_\{\\mathcal\{D\}\}\\leq\\rho\_\{\\star\}for someρ⋆∈\(0,1\)\\rho\_\{\\star\}\\in\(0,1\)\. Suppose also that, for every target\-bearing near tokenj∈𝒮ij\\in\\mathcal\{S\}\_\{i\},
‖\(Pj→i−Id\)ΠU‖op≤γ,0≤γ<1\.\\\|\(P\_\{j\\to i\}\-I\_\{d\}\)\\,\\Pi\_\{U\}\\\|\_\{\\mathrm\{op\}\}\\leq\\gamma,\\qquad 0\\leq\\gamma<1\.Then the direct near\-signal gain condition holds withκ⋆=\(1−ρ⋆\)2\(1−γ\)2\\kappa\_\{\\star\}=\(1\-\\rho\_\{\\star\}\)^\{2\}\(1\-\\gamma\)^\{2\}\. That is,
U⊤B𝒮,i⊤B𝒮,iU⪰\(1−ρ⋆\)2\(1−γ\)2U⊤U\.U^\{\\top\}B\_\{\\mathcal\{S\},i\}^\{\\top\}\\,B\_\{\\mathcal\{S\},i\}\\,U\\succeq\(1\-\\rho\_\{\\star\}\)^\{2\}\(1\-\\gamma\)^\{2\}\\,U^\{\\top\}U\.
###### Proof\.
By Lemma[3](https://arxiv.org/html/2606.24975#Thmlemma3),
U⊤B𝒮,i⊤B𝒮,iU⪰\(1−ρ𝒟\)2\(1−γ\)2U⊤U\.U^\{\\top\}B\_\{\\mathcal\{S\},i\}^\{\\top\}\\,B\_\{\\mathcal\{S\},i\}\\,U\\succeq\(1\-\\rho\_\{\\mathcal\{D\}\}\)^\{2\}\(1\-\\gamma\)^\{2\}\\,U^\{\\top\}U\.Sinceρ𝒟≤ρ⋆\\rho\_\{\\mathcal\{D\}\}\\leq\\rho\_\{\\star\},1−ρ𝒟≥1−ρ⋆1\-\\rho\_\{\\mathcal\{D\}\}\\geq 1\-\\rho\_\{\\star\}\. Therefore\(1−ρ𝒟\)2\(1−γ\)2≥\(1−ρ⋆\)2\(1−γ\)2\(1\-\\rho\_\{\\mathcal\{D\}\}\)^\{2\}\(1\-\\gamma\)^\{2\}\\geq\(1\-\\rho\_\{\\star\}\)^\{2\}\(1\-\\gamma\)^\{2\}\. ∎
## Appendix CExperimental Details
Table[4](https://arxiv.org/html/2606.24975#A3.T4)summarizes the architecture, training, and evaluation setup\. Training data is a continuous token stream; attention crosses document boundaries\. For random\-rotation models, step angles are resampled at each forward pass; the deterministic seed ensures that evaluation batches are identical across models but the random angles differ across iterations\.
Table 4:Architecture, training, and evaluation hyperparameters\.#### Model variants\.
RoFormer/RoPE uses the standard position\-indexed rotary map on Q/K only\. The fixed RoPE Q/K/V baseline uses the same RoPE angles but also applies the position\-indexed rotation to values before aggregation; it tests whether value transport alone solves extrapolation when the phase rule is still position\-indexed\. Random variants use accumulated random angle increments and instantiate the spectral\-gap assumptions most directly, testing whether incoherent accumulated phase can protect long\-context evaluation\. Specifically, each step angle is drawn independently per position and per dimension fromUniform\(−fb,fb\)\\mathrm\{Uniform\}\(\-f\_\{b\},f\_\{b\}\), wherefb=1/100002b/df\_\{b\}=1/10000^\{2b/d\}is the RoPE log\-spaced frequency for blockbb\. Angles are sampled independently across layers and resampled at each forward pass \(not fixed at initialization\)\. The same angle vector is shared across all heads within a layer\. Learned token\-rotation variants use one learned per\-token angle embedding per layer; angles depend on token identity and are accumulated along the sequence, so the source\-query relation is composed from the intervening tokens rather than from absolute position\. Variants that also rotate values test the value\-side transport predicted by the signal\-interference decomposition\.
## Appendix DAdditional Supporting Results
### D\.1Separate\-Path Conditional Mixing Assumption
This subsection records the separate\-path random\-phase case\. It applies when the far weights have already been chosen by the score path and, conditional on that score\-side information, the value\-side step phases still have independent first\-harmonic mixing\. For example, it applies to a construction in which each value\-side block receives independent random step phases that are not reused by the Q/K score path\. It should not be read as a statement about randomized RoPE obtained only by sampling position indices: in that case all RoPE frequency blocks are functions of the same sampled source\-query offset, so the block\-independence condition below is not supplied by the positional randomization alone\.
###### Assumption 4\(Conditional V\-Path First\-Harmonic Gap After Score Selection\)\.
Fix a query positionii\. Let𝒢i,L\\mathcal\{G\}\_\{i,L\}be the sigma\-field generated by the score\-side information used to select the attention weights for the evaluated context lengthLL\. The far weightsαk=αi,i−k\\alpha\_\{k\}=\\alpha\_\{i,i\-k\}are assumed to be𝒢i,L\\mathcal\{G\}\_\{i,L\}\-measurable\.
In each rotation blockbb, define the value\-side step phasor
Hℓ,bV=e−iψℓ,bV,H\_\{\\ell,b\}^\{V\}=e^\{\-i\\psi\_\{\\ell,b\}^\{V\}\},whereψℓ,bV\\psi\_\{\\ell,b\}^\{V\}is thebb\-th coordinate of the value\-side step angleψℓV=ω\+g\(cℓ\)\\psi\_\{\\ell\}^\{V\}=\\omega\+g\(c\_\{\\ell\}\)from \([15](https://arxiv.org/html/2606.24975#S5.E15)\)\. Conditional on𝒢i,L\\mathcal\{G\}\_\{i,L\}, the value\-side step phasors\{Hℓ,bV\}ℓ\\\{H\_\{\\ell,b\}^\{V\}\\\}\_\{\\ell\}are independent over positionsℓ\\ell, and their conditional means
mℓ,b=𝔼\[Hℓ,bV∣𝒢i,L\]m\_\{\\ell,b\}=\\mathbb\{E\}\[H\_\{\\ell,b\}^\{V\}\\mid\\mathcal\{G\}\_\{i,L\}\]satisfy
\|mℓ,b\|≤β<1\|m\_\{\\ell,b\}\|\\leq\\beta<1for everyℓ\\ellandbb\.
No independence is assumed among the route products themselves; the route products are nested interval products and share suffix rotations\. The assumption rules out choosing the far weights by observing the same value\-side phase realizations whose cancellation is later claimed\.
### D\.2Auxiliary Separate\-Path Prefix\-Product Bound
###### Theorem 7\(Nested Prefix\-Product Value\-Covariance Bound for Score\-Selected Weights\)\.
Fix a query positionii, a far windowww, and a finite evaluated far set of route lengths𝒦𝒟⊆\{k:k≥w\}\\mathcal\{K\}\_\{\\mathcal\{D\}\}\\subseteq\\\{k:k\\geq w\\\}\. Letαk≥0\\alpha\_\{k\}\\geq 0be the score\-selected far weights\. In this route\-length notation,Pi−k→iVP\_\{i\-k\\to i\}^\{V\}denotes the value\-side transport from source positioni−ki\-kto query positionii\. Write
ρ=∑k∈𝒦𝒟αk,a=\(∑k∈𝒦𝒟αk2\)1/2\.\\rho=\\sum\_\{k\\in\\mathcal\{K\}\_\{\\mathcal\{D\}\}\}\\alpha\_\{k\},\\qquad a=\\Bigl\(\\sum\_\{k\\in\\mathcal\{K\}\_\{\\mathcal\{D\}\}\}\\alpha\_\{k\}^\{2\}\\Bigr\)^\{1/2\}\.In blockbb, define
Πk,bV=∏ℓ=1kHi−ℓ,bV,\\Pi\_\{k,b\}^\{V\}=\\prod\_\{\\ell=1\}^\{k\}H\_\{i\-\\ell,b\}^\{V\},and
tb=∑k∈𝒦𝒟αkΠk,bV\.t\_\{b\}=\\sum\_\{k\\in\\mathcal\{K\}\_\{\\mathcal\{D\}\}\}\\alpha\_\{k\}\\,\\Pi\_\{k,b\}^\{V\}\.Under Assumption[4](https://arxiv.org/html/2606.24975#Thmassumption4), for everyη∈\(0,1\)\\eta\\in\(0,1\), with conditional probability at least1−η1\-\\etagiven𝒢i,L\\mathcal\{G\}\_\{i,L\},
max1≤b≤B\|tb\|≤ℛpp\(βw,ρ,a;η\)\.\\max\_\{1\\leq b\\leq B\}\|t\_\{b\}\|\\leq\\mathcal\{R\}\_\{\\rm pp\}\(\\beta^\{w\},\\rho,a;\\eta\)\.\(34\)Equivalently, for the block\-diagonal weighted far rotation sum,
𝑻𝒟V=∑k∈𝒦𝒟αkPi−k→iV,\\bm\{T\}\_\{\\mathcal\{D\}\}^\{V\}=\\sum\_\{k\\in\\mathcal\{K\}\_\{\\mathcal\{D\}\}\}\\alpha\_\{k\}\\,P\_\{i\-k\\to i\}^\{V\},one has, with conditional probability at least1−η1\-\\eta,
‖𝑻𝒟V‖op≤ℛpp\(βw,ρ,‖𝜶𝒟‖2;η\)\.\\\|\\bm\{T\}\_\{\\mathcal\{D\}\}^\{V\}\\\|\_\{\\mathrm\{op\}\}\\leq\\mathcal\{R\}\_\{\\rm pp\}\(\\beta^\{w\},\\rho,\\\|\\bm\{\\alpha\}\_\{\\mathcal\{D\}\}\\\|\_\{2\};\\eta\)\.\(35\)In the shared\-background far\-value model satisfying the conditional covariance identity of Definition[1](https://arxiv.org/html/2606.24975#Thmdefinition1),
Δ𝒟=c02𝑻𝒟V\(𝑻𝒟V\)⊤\+σw2‖𝜶𝒟‖22Id,\\Delta\_\{\\mathcal\{D\}\}=c\_\{0\}^\{2\}\\,\\bm\{T\}\_\{\\mathcal\{D\}\}^\{V\}\(\\bm\{T\}\_\{\\mathcal\{D\}\}^\{V\}\)^\{\\top\}\+\\sigma\_\{w\}^\{2\}\\,\\\|\\bm\{\\alpha\}\_\{\\mathcal\{D\}\}\\\|\_\{2\}^\{2\}\\,I\_\{d\},and therefore, on the same event,
Δ𝒟⪯\[c02ℛpp\(βw,ρ,‖𝜶𝒟‖2;η\)2\+σw2‖𝜶𝒟‖22\]Id\.\\Delta\_\{\\mathcal\{D\}\}\\preceq\\biggl\[c\_\{0\}^\{2\}\\mathcal\{R\}\_\{\\rm pp\}\(\\beta^\{w\},\\rho,\\\|\\bm\{\\alpha\}\_\{\\mathcal\{D\}\}\\\|\_\{2\};\\eta\)^\{2\}\+\\sigma\_\{w\}^\{2\}\\,\\\|\\bm\{\\alpha\}\_\{\\mathcal\{D\}\}\\\|\_\{2\}^\{2\}\\biggr\]I\_\{d\}\.\(36\)
###### Proof\.
Condition on𝒢i,L\\mathcal\{G\}\_\{i,L\}, so the weights are deterministic and the value\-side phasors satisfy ordinary independence and mean bounds\. For each block, the one\-block martingale calculation in the proof of Theorem[6](https://arxiv.org/html/2606.24975#Thmtheorem6), applied with fixed coefficientsαk\\alpha\_\{k\}, gives
\|tb\|≤βwρ\+4a1−βlog4Bη\|t\_\{b\}\|\\leq\\beta^\{w\}\\rho\+\\frac\{4a\}\{1\-\\beta\}\\sqrt\{\\log\\frac\{4B\}\{\\eta\}\}with conditional failure probability at mostη/B\\eta/B\. A union bound over theBBblocks gives \([34](https://arxiv.org/html/2606.24975#A4.E34)\)\. In each two\-dimensional block the matrix∑k∈𝒦𝒟αkPi−k→iV\\sum\_\{k\\in\\mathcal\{K\}\_\{\\mathcal\{D\}\}\}\\alpha\_\{k\}P\_\{i\-k\\to i\}^\{V\}acts as multiplication bytbt\_\{b\}; hence‖𝑻𝒟V‖op=max1≤b≤B\|tb\|\\\|\\bm\{T\}\_\{\\mathcal\{D\}\}^\{V\}\\\|\_\{\\mathrm\{op\}\}=\\max\_\{1\\leq b\\leq B\}\|t\_\{b\}\|\. The covariance bound follows from𝑻𝒟V\(𝑻𝒟V\)⊤⪯‖𝑻𝒟V‖op2Id\\bm\{T\}\_\{\\mathcal\{D\}\}^\{V\}\(\\bm\{T\}\_\{\\mathcal\{D\}\}^\{V\}\)^\{\\top\}\\preceq\\\|\\bm\{T\}\_\{\\mathcal\{D\}\}^\{V\}\\\|\_\{\\mathrm\{op\}\}^\{2\}I\_\{d\}\. ∎
### D\.3Adaptive Selection Counterexample
###### Example 10\(Diffuse Weights Alone Do Not Ensure Cancellation\)\.
Letζ1,…,ζM\\zeta\_\{1\},\\ldots,\\zeta\_\{M\}be independent phasors uniformly distributed on the unit circle\. Fixε∈\(0,π/2\]\\varepsilon\\in\(0,\\pi/2\], and define the selected index set
Aε=\{k:\|argζk\|≤ε\},A\_\{\\varepsilon\}=\\\{k:\|\\arg\\zeta\_\{k\}\|\\leq\\varepsilon\\\},whereargζk∈\[−π,π\)\\arg\\zeta\_\{k\}\\in\[\-\\pi,\\pi\)\. IfAε≠∅A\_\{\\varepsilon\}\\neq\\emptyset, define adaptive weights
αk=\{\|Aε\|−1,k∈Aε,0,k∉Aε\.\\alpha\_\{k\}=\\begin\{cases\}\|A\_\{\\varepsilon\}\|^\{\-1\},&k\\in A\_\{\\varepsilon\},\\\\ 0,&k\\notin A\_\{\\varepsilon\}\.\\end\{cases\}Then, with probability at least1−exp\(−Mε/\(8π\)\)1\-\\exp\(\-M\\varepsilon/\(8\\pi\)\), one has
‖𝜶‖2≤2πMε,\\\|\\bm\{\\alpha\}\\\|\_\{2\}\\leq\\sqrt\{\\frac\{2\\pi\}\{M\\varepsilon\}\},and
\|∑k=1Mαkζk\|≥cosε\.\\Bigl\|\\sum\_\{k=1\}^\{M\}\\alpha\_\{k\}\\zeta\_\{k\}\\Bigr\|\\geq\\cos\\varepsilon\.Thus‖𝛂‖2\\\|\\bm\{\\alpha\}\\\|\_\{2\}can tend to zero while the weighted phasor sum remains bounded away from zero\.
###### Proof\.
For eachkk, the event\|argζk\|≤ε\|\\arg\\zeta\_\{k\}\|\\leq\\varepsilonhas probabilityp=ε/πp=\\varepsilon/\\pi\. HenceNε=\|Aε\|N\_\{\\varepsilon\}=\|A\_\{\\varepsilon\}\|has the binomial distributionBin\(M,p\)\\mathrm\{Bin\}\(M,p\)and𝔼Nε=Mp=Mε/π\\mathbb\{E\}N\_\{\\varepsilon\}=Mp=M\\varepsilon/\\pi\. By the standard multiplicative Chernoff bound \(see, e\.g\.,Boucheronet al\.\([2013](https://arxiv.org/html/2606.24975#bib.bib3)\)\),
Pr\(Nε<12Mp\)≤exp\(−Mp/8\)=exp\(−Mε/\(8π\)\)\.\\Pr\\bigl\(N\_\{\\varepsilon\}<\\tfrac\{1\}\{2\}Mp\\bigr\)\\leq\\exp\(\-Mp/8\)=\\exp\(\-M\\varepsilon/\(8\\pi\)\)\.On the complementary event,Nε≥Mε/\(2π\)N\_\{\\varepsilon\}\\geq M\\varepsilon/\(2\\pi\)\. The weights are uniform onAεA\_\{\\varepsilon\}, so‖𝜶‖22=1/Nε\\\|\\bm\{\\alpha\}\\\|\_\{2\}^\{2\}=1/N\_\{\\varepsilon\}and‖𝜶‖2≤2π/\(Mε\)\\\|\\bm\{\\alpha\}\\\|\_\{2\}\\leq\\sqrt\{2\\pi/\(M\\varepsilon\)\}\. For everyk∈Aεk\\in A\_\{\\varepsilon\},Reζk=cos\(argζk\)≥cosε\\mathrm\{Re\}\\,\\zeta\_\{k\}=\\cos\(\\arg\\zeta\_\{k\}\)\\geq\\cos\\varepsilon\. Therefore
Re\(∑k=1Mαkζk\)=1Nε∑k∈AεReζk≥cosε\.\\mathrm\{Re\}\\Bigl\(\\sum\_\{k=1\}^\{M\}\\alpha\_\{k\}\\zeta\_\{k\}\\Bigr\)=\\frac\{1\}\{N\_\{\\varepsilon\}\}\\sum\_\{k\\in A\_\{\\varepsilon\}\}\\mathrm\{Re\}\\,\\zeta\_\{k\}\\geq\\cos\\varepsilon\.Since the magnitude of a complex number is at least its real part when the real part is at least zero,\|∑kαkζk\|≥cosε\|\\sum\_\{k\}\\alpha\_\{k\}\\zeta\_\{k\}\|\\geq\\cos\\varepsilon\. ∎
### D\.4Separate\-Path Far\-Weight to Covariance Corollary
###### Corollary 3\(Far\-Mass and Far\-Weight Bounds Give Spectral Far\-Covariance Bound\)\.
Letw=wεmixw=w\_\{\\varepsilon\_\{\\rm mix\}\}, so thatβw≤εmix\\beta^\{w\}\\leq\\varepsilon\_\{\\rm mix\}\. Suppose the far weights satisfy
ρ𝒟≤ρ⋆,‖𝜶𝒟‖2≤a⋆\.\\rho\_\{\\mathcal\{D\}\}\\leq\\rho\_\{\\star\},\\qquad\\\|\\bm\{\\alpha\}\_\{\\mathcal\{D\}\}\\\|\_\{2\}\\leq a\_\{\\star\}\.For each finite evaluated far set, the bound has no explicit dependence on the raw number of far routes; only total far mass and far\-weightℓ2\\ell\_\{2\}norm enter\. Under the shared\-background far\-value model satisfying the conditional covariance identity of Definition[1](https://arxiv.org/html/2606.24975#Thmdefinition1), and under Assumption[4](https://arxiv.org/html/2606.24975#Thmassumption4), Theorem[7](https://arxiv.org/html/2606.24975#Thmtheorem7)implies that, with probability at least1−η1\-\\eta,
Δ𝒟⪯δ¯pp2Id,\\Delta\_\{\\mathcal\{D\}\}\\preceq\\bar\{\\delta\}\_\{\\rm pp\}^\{2\}\\,I\_\{d\},where
δ¯pp2=c02ℛpp\(εmix,ρ⋆,a⋆;η\)2\+σw2a⋆2\.\\bar\{\\delta\}\_\{\\rm pp\}^\{2\}=c\_\{0\}^\{2\}\\mathcal\{R\}\_\{\\rm pp\}\(\\varepsilon\_\{\\rm mix\},\\rho\_\{\\star\},a\_\{\\star\};\\eta\)^\{2\}\+\\sigma\_\{w\}^\{2\}\\,a\_\{\\star\}^\{2\}\.\(37\)
###### Proof\.
By choice ofww, the mean term in Theorem[7](https://arxiv.org/html/2606.24975#Thmtheorem7)satisfiesβwρ≤εmixρ⋆\\beta^\{w\}\\rho\\leq\\varepsilon\_\{\\rm mix\}\\,\\rho\_\{\\star\}\. The fluctuation term uses‖𝜶𝒟‖2≤a⋆\\\|\\bm\{\\alpha\}\_\{\\mathcal\{D\}\}\\\|\_\{2\}\\leq a\_\{\\star\}\. Substituting into \([36](https://arxiv.org/html/2606.24975#A4.E36)\) gives the stated bound\. ∎
## Appendix EA Probabilistic Condition for Near\-Route Alignment
###### Lemma 4\(Sub\-Gaussian Short\-Route Angles Imply Near\-Signal Gain\)\.
Assumed=2Bd=2B\. For each target\-bearing near routej∈𝒮ij\\in\\mathcal\{S\}\_\{i\}, let the route length benj=i−jn\_\{j\}=i\-j, and supposenj≤nsign\_\{j\}\\leq n\_\{\\rm sig\}for everyj∈𝒮ij\\in\\mathcal\{S\}\_\{i\}\. In blockbb, write the step angle asψt,b=ψ¯t,b\+εt,b\\psi\_\{t,b\}=\\bar\{\\psi\}\_\{t,b\}\+\\varepsilon\_\{t,b\}, and assume that, for each fixed blockbb, the sequence\{εt,b\}t\\\{\\varepsilon\_\{t,b\}\\\}\_\{t\}is a martingale difference sequence with respect to a filtration\{ℱt,b\}t\\\{\\mathcal\{F\}\_\{t,b\}\\\}\_\{t\}, and is conditionally sub\-Gaussian with variance proxyσψ2\\sigma\_\{\\psi\}^\{2\}:
𝔼\[εt,b∣ℱt−1,b\]=0,𝔼\[exp\(λεt,b\)∣ℱt−1,b\]≤exp\(λ2σψ2/2\)\\mathbb\{E\}\[\\varepsilon\_\{t,b\}\\mid\\mathcal\{F\}\_\{t\-1,b\}\]=0,\\qquad\\mathbb\{E\}\[\\exp\(\\lambda\\,\\varepsilon\_\{t,b\}\)\\mid\\mathcal\{F\}\_\{t\-1,b\}\]\\leq\\exp\(\\lambda^\{2\}\\sigma\_\{\\psi\}^\{2\}/2\)for everyλ∈ℝ\\lambda\\in\\mathbb\{R\}\. The independent mean\-zeroσψ2\\sigma\_\{\\psi\}^\{2\}\-sub\-Gaussian case is a special case\. Assume the deterministic drift over every target\-bearing near route is bounded:
\|∑t=ji−1ψ¯t,b\|≤μsig\\Bigl\|\\sum\_\{t=j\}^\{i\-1\}\\bar\{\\psi\}\_\{t,b\}\\Bigr\|\\leq\\mu\_\{\\rm sig\}for everyj∈𝒮ij\\in\\mathcal\{S\}\_\{i\}and every blockbb\. Then, with probability at least1−η1\-\\eta,
maxj∈𝒮imax1≤b≤B\|Θj→i,b\|≤μsig\+σψ2nsiglog2B\|𝒮i\|η,\\max\_\{j\\in\\mathcal\{S\}\_\{i\}\}\\,\\max\_\{1\\leq b\\leq B\}\\,\|\\Theta\_\{j\\to i,b\}\|\\leq\\mu\_\{\\rm sig\}\+\\sigma\_\{\\psi\}\\sqrt\{2n\_\{\\rm sig\}\\log\\frac\{2B\|\\mathcal\{S\}\_\{i\}\|\}\{\\eta\}\},whereΘj→i,b=∑t=ji−1ψt,b\\Theta\_\{j\\to i,b\}=\\sum\_\{t=j\}^\{i\-1\}\\psi\_\{t,b\}\. Consequently, on this event, defining
γsig=μsig\+σψ2nsiglog2B\|𝒮i\|η,\\gamma\_\{\\rm sig\}=\\mu\_\{\\rm sig\}\+\\sigma\_\{\\psi\}\\sqrt\{2n\_\{\\rm sig\}\\log\\frac\{2B\|\\mathcal\{S\}\_\{i\}\|\}\{\\eta\}\},one has‖\(Pj→i−Id\)ΠU‖op≤γsig\\\|\(P\_\{j\\to i\}\-I\_\{d\}\)\\,\\Pi\_\{U\}\\\|\_\{\\mathrm\{op\}\}\\leq\\gamma\_\{\\rm sig\}for everyj∈𝒮ij\\in\\mathcal\{S\}\_\{i\}\. Ifγsig<1\\gamma\_\{\\rm sig\}<1, then
U⊤B𝒮,i⊤B𝒮,iU⪰\(1−ρ𝒟\)2\(1−γsig\)2U⊤U,ρ𝒟=∑j∈𝒟iαij\.U^\{\\top\}B\_\{\\mathcal\{S\},i\}^\{\\top\}\\,B\_\{\\mathcal\{S\},i\}\\,U\\succeq\(1\-\\rho\_\{\\mathcal\{D\}\}\)^\{2\}\(1\-\\gamma\_\{\\rm sig\}\)^\{2\}\\,U^\{\\top\}U,\\qquad\\rho\_\{\\mathcal\{D\}\}=\\sum\_\{j\\in\\mathcal\{D\}\_\{i\}\}\\alpha\_\{ij\}\.
###### Proof\.
Fixj∈𝒮ij\\in\\mathcal\{S\}\_\{i\}andb∈\{1,…,B\}b\\in\\\{1,\\ldots,B\\\}\. Write
Θj→i,b=∑t=ji−1ψ¯t,b\+∑t=ji−1εt,b\.\\Theta\_\{j\\to i,b\}=\\sum\_\{t=j\}^\{i\-1\}\\bar\{\\psi\}\_\{t,b\}\+\\sum\_\{t=j\}^\{i\-1\}\\varepsilon\_\{t,b\}\.By assumption,\|∑t=ji−1ψ¯t,b\|≤μsig\|\\sum\_\{t=j\}^\{i\-1\}\\bar\{\\psi\}\_\{t,b\}\|\\leq\\mu\_\{\\rm sig\}\. By the conditional sub\-Gaussian martingale assumption, the noise sumEj,b=∑t=ji−1εt,bE\_\{j,b\}=\\sum\_\{t=j\}^\{i\-1\}\\varepsilon\_\{t,b\}is sub\-Gaussian with variance proxy at mostnsigσψ2n\_\{\\rm sig\}\\sigma\_\{\\psi\}^\{2\}\. Hence, for everyu\>0u\>0,
Pr\(\|Ej,b\|≥u\)≤2exp\(−u22nsigσψ2\)\.\\Pr\(\|E\_\{j,b\}\|\\geq u\)\\leq 2\\exp\\\!\\Bigl\(\-\\frac\{u^\{2\}\}\{2n\_\{\\rm sig\}\\,\\sigma\_\{\\psi\}^\{2\}\}\\Bigr\)\.Chooseu=σψ2nsiglog\(2B\|𝒮i\|/η\)u=\\sigma\_\{\\psi\}\\sqrt\{2n\_\{\\rm sig\}\\log\(2B\|\\mathcal\{S\}\_\{i\}\|/\\eta\)\}\. A union bound over allB\|𝒮i\|B\|\\mathcal\{S\}\_\{i\}\|pairs gives, with probability at least1−η1\-\\eta,
\|Θj→i,b\|≤μsig\+σψ2nsiglog2B\|𝒮i\|η\.\|\\Theta\_\{j\\to i,b\}\|\\leq\\mu\_\{\\rm sig\}\+\\sigma\_\{\\psi\}\\sqrt\{2n\_\{\\rm sig\}\\log\\frac\{2B\|\\mathcal\{S\}\_\{i\}\|\}\{\\eta\}\}\.For a two\-dimensional rotation,‖R\(θ\)−I2‖op=2\|sin\(θ/2\)\|≤\|θ\|\\\|R\(\\theta\)\-I\_\{2\}\\\|\_\{\\mathrm\{op\}\}=2\|\\sin\(\\theta/2\)\|\\leq\|\\theta\|\. Thus, by block diagonality,‖Pj→i−Id‖op≤γsig\\\|P\_\{j\\to i\}\-I\_\{d\}\\\|\_\{\\mathrm\{op\}\}\\leq\\gamma\_\{\\rm sig\}and‖\(Pj→i−Id\)ΠU‖op≤γsig\\\|\(P\_\{j\\to i\}\-I\_\{d\}\)\\Pi\_\{U\}\\\|\_\{\\mathrm\{op\}\}\\leq\\gamma\_\{\\rm sig\}\. Ifγsig<1\\gamma\_\{\\rm sig\}<1, Lemma[3](https://arxiv.org/html/2606.24975#Thmlemma3)applies withγ=γsig\\gamma=\\gamma\_\{\\rm sig\}andρ𝒟=∑j∈𝒟iαij\\rho\_\{\\mathcal\{D\}\}=\\sum\_\{j\\in\\mathcal\{D\}\_\{i\}\}\\alpha\_\{ij\}\. ∎Similar Articles
@akshay_pachaar: Extending the context window isn't just about larger matrices. In a traditional transformer, expanding tokens by 8x inc…
Explains the memory challenge of expanding transformer context windows due to quadratic attention complexity, and hints at solutions.
Simply Stabilizing the Loop via Fully Looped Transformer
This paper identifies gradient oscillation and residual explosion as causes of training instability in Looped Transformers, and proposes Fully Looped Transformer with two parameter-free modifications (Fully Looped Architecture and Attention Injection) to stabilize training up to 12 loop iterations, achieving up to 13.2% improvement in downstream performance.
RoPE Distinguishes Neither Positions Nor Tokens in Long Contexts, Provably
This paper provides a theoretical proof that Rotary Positional Embeddings (RoPE) in Transformer-based language models lose their locality bias and ability to distinguish token order in long contexts, with attention scores becoming no better than random. The authors show that increasing the RoPE base trades off position vs. token distinction and that multi-head, multi-layer architectures cannot compensate for this fundamental limitation.
Adaptive Computation Depth via Learned Token Routing in Transformers
This paper presents Token-Selective Attention (TSA), a differentiable token routing mechanism that learns to skip unnecessary computations per token in transformer layers, reducing token-layer operations by 14–23% with minimal quality loss on language modeling tasks.
Exact Linear Attention
This paper introduces Exact Linear Attention (ELA), a mechanism that achieves linear computational complexity for Transformer attention without approximation error by leveraging kernel decomposition, and addresses gradient explosion and token dilution through constrained kernel functions. It also presents engineering innovations including Hyper Link, Memory Lobe, and a routing bias for Mixture of Experts.