RT-Transformer: The Transformer Block as a Spherical State Estimator

arXiv cs.LG Papers

Summary

This paper presents a theoretical framework interpreting Transformer components (attention, residual connections, normalization) as arising from a spherical state estimation problem using Radial-Tangential SDEs.

arXiv:2605.11007v1 Announce Type: new Abstract: We show that the core components of the Transformer block -- attention, residual connections, and normalization -- arise naturally from a single geometric estimation problem. Modeling the latent state as a direction on the hypersphere, with noise defined in the tangent plane at the current estimate, yields a precision-weighted directional inference procedure in which attention aggregates evidence, residual connections implement incremental state updates, and normalization retracts the updated state back onto the hypersphere. Together, these components follow from the geometry of the estimation problem rather than being introduced as independent architectural choices.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/13/26, 06:27 AM

# RT-Transformer: The Transformer Block as a Spherical State Estimator
Source: [https://arxiv.org/html/2605.11007](https://arxiv.org/html/2605.11007)
###### Abstract

We show that the core components of the Transformer block — attention, residual connections, and normalization — arise naturally from a single geometric estimation problem\. Modeling the latent state as a direction on the hypersphere, with noise defined in the tangent plane at the current estimate, yields a precision\-weighted directional inference procedure in which attention aggregates evidence, residual connections implement incremental state updates, and normalization retracts the updated state back onto the hypersphere\. Together, these components follow from the geometry of the estimation problem rather than being introduced as independent architectural choices\.

## 1Introduction

Despite its empirical success, the Transformer block lacks a unified interpretation: attention, residual connections, and normalization are typically introduced as separate design choices\. This raises a basic question: what underlying principle ties these operations together, and why does this particular structure work so well?

A useful perspective is to view attention as a filter that aggregates multiple predictions of a latent linear stochastic differential equation \(SDE\)\. In this interpretation, each token provides a candidate estimate of a shared latent state, and attention combines them according to their reliability\. Crucially, this aggregation must remain computationally tractable, requiring that uncertainty can be propagated and inverted in closed form while preserving the fully parallel𝒪​\(d\)\\mathcal\{O\}\(d\)structure of attention\.

These requirements impose strong constraints on the class of admissible latent dynamical models\. Under the assumption of linear dynamics and isotropic noise, covariance propagation reduces to a scalar function of time, which preserves the computational tractability of attention\. This yields a Euclidean filtering model in which uncertainty is identical in every direction\.

Although isotropic noise yields a tractable form of attention, it is a strong restriction: it rules out any model in which uncertainty depends on the state\. However, generic anisotropic noise destroys the tractable structure required for parallel attention, since covariance propagation and inversion become fully dense and state\-dependent\. The central challenge is therefore to identify the most general anisotropic uncertainty model that preserves closed\-form state and covariance propagation\. The hypersphere provides the simplest geometry satisfying these requirements, since uncertainty decomposes naturally into radial and tangential components that co\-rotate with the latent state under the dynamics\.

This motivates the*Radial–Tangential SDE*\(RT\-SDE\), in which process and measurement noise decompose into radial and tangential components aligned with the instantaneous state direction on the hypersphere\. The key structural property of the RT\-SDE is that the noise co\-rotates with the latent state, causing the rotational terms to cancel inside the covariance integral\. As a result, in the regime of small angular diffusion the propagated covariance remains analytically tractable despite the state dependence — the same regime in which directional inference is well\-posed\.

The RT\-SDE gives rise to a tractable*RT\-Filter*, under which each token is normalized to lie on the sphere and transported to the query position under rotational dynamics\. A precision\-weighted aggregation \(attention\) produces a directional estimate, and the state is updated by taking a small step toward this estimate in the tangent space \(the residual connection\) followed by retraction onto the sphere \(normalization\)\. This yields the familiar “add and norm” operation, which provides a first\-order approximation to a geodesic step on the sphere\. In this view, normalization is not an auxiliary stabilization mechanism, but a geometric consequence of directional state estimation\.

We show that the Transformer with rotary positional encodings closely approximates the structure of the RT\-Filter, excluding the feedforward network, which is not derived by the present filtering formulation\. A consequence of this formulation is that token magnitude encodes directional confidence, with angular uncertainty scaling as1/m21/m^\{2\}\.

The RT\-Filter makes concrete architectural predictions on modifications to the Transformer: Attention logits should incorporate magnitude\-dependent precision, weighting keys by the confidence of their directional estimates\. Queries, keys, and values should be normalized after projection to ensure that attention operates on unit directions, as required by the spherical state space\. Finally, the geodesic step is more faithfully implemented by a tangent\-space correction that removes the component of the attention output aligned with the current state before the residual connection\. These modifications arise directly from the underlying model rather than as independent design choices\.

Our main contributions are as follows:

1. 1\.Radial–Tangential SDE \(RT\-SDE\):A structured stochastic model in which noise is confined to the tangent plane of the current state, preserving closed\-form covariance propagation and tractable precision computation\.
2. 2\.Directional Interpretation of Attention:A precision\-weighted estimator of latent directions on the hypersphere, with token magnitude encoding directional confidence\.
3. 3\.Unified Derivation of the Transformer Block:A derivation of attention, residual connections, and normalization as components of a single filtering update — a tangent\-space step toward the new estimate followed by retraction onto the sphere\.
4. 4\.Architectural Modifications:Three concrete departures from the standard Transformer: magnitude\-dependent attention precision, QKV normalization, and a tangent\-space residual correction that removes the radial component of the attention output before the residual connection\.

This work focuses on the theoretical formulation and geometric interpretation of RT filtering\. A comprehensive empirical evaluation and scaling study will be presented in future work\.

## 2Related Work

### 2\.1Attention as Estimation and Filtering

The Transformer architecture\(Vaswaniet al\.,[2017](https://arxiv.org/html/2605.11007#bib.bib15)\)has been analyzed from several theoretical perspectives, including kernel smoothing\(Tsaiet al\.,[2019](https://arxiv.org/html/2605.11007#bib.bib23)\), associative memory models such as modern Hopfield networks\(Ramsaueret al\.,[2021](https://arxiv.org/html/2605.11007#bib.bib73)\), and probabilistic interpretations of attention\(Gabburet al\.,[2021](https://arxiv.org/html/2605.11007#bib.bib82); Bianchessiet al\.,[2026](https://arxiv.org/html/2605.11007#bib.bib161)\)\. These works primarily reinterpret attention weights or attention kernels, rather than deriving the broader Transformer block from an underlying dynamical estimation framework\.

Robust Filter Attention \(RFA\)\(Racioppo,[2026](https://arxiv.org/html/2605.11007#bib.bib209)\)derives attention as an approximate maximum likelihood estimator for a latent state evolving under a linear stochastic differential equation \(SDE\)\. In this formulation, tractability is achieved through an isotropic Euclidean noise model, reducing covariance propagation to a scalar precision per query–key pair\.

Our work builds on this filtering perspective by extending it to anisotropic settings while preserving tractability\. In particular, we introduce a Radial–Tangential SDE in which uncertainty decomposes into components aligned with and orthogonal to the state direction\. This enables structured anisotropy without breaking the closed\-form,𝒪​\(d\)\\mathcal\{O\}\(d\)computation required for attention, and allows the filtering interpretation to extend beyond attention to the full Transformer block \(excluding the FFN\)\.

### 2\.2Geometric Perspectives and Normalization

Several works study Transformers from a geometric or dynamical perspective\. Molina interprets token embeddings as trajectories on a hypersphere maintained by LayerNorm\(Molina,[2024](https://arxiv.org/html/2605.11007#bib.bib165)\), while Geshkovski et al\. analyze self\-attention as interacting particle dynamics on the sphere\(Geshkovskiet al\.,[2025](https://arxiv.org/html/2605.11007#bib.bib120)\)\. Related work has shown that LayerNorm substantially alters the long\-term dynamics of self\-attention, enabling stable higher\-rank equilibria and mitigating representation collapse\(Wuet al\.,[2024](https://arxiv.org/html/2605.11007#bib.bib217)\)\.

A complementary line of work studies normalization as a geometric operation\. Brody et al\. show that LayerNorm projects representations onto a hyperplane and increases attention expressivity\(Brodyet al\.,[2023](https://arxiv.org/html/2605.11007#bib.bib213)\), while normalization schemes such as QKNorm\(Henryet al\.,[2020](https://arxiv.org/html/2605.11007#bib.bib215)\)empirically explore query/key normalization in attention mechanisms\. More recent architectures explicitly constrain representations to hyperspherical manifolds\(Loshchilovet al\.,[2025](https://arxiv.org/html/2605.11007#bib.bib198)\)or introduce geodesic\-inspired update rules and normalization schemes on the sphere\(Zhenget al\.,[2026](https://arxiv.org/html/2605.11007#bib.bib210)\)\.

These works primarily interpret or impose spherical structure geometrically\. In contrast, we derive hyperspherical dynamics from an underlying stochastic filtering model\. In our formulation, spherical geometry arises from state\-dependent anisotropic uncertainty in the RT\-SDE, while normalization corresponds to a retraction associated with directional filtering\. This links attention, residual updates, and normalization within a unified probabilistic framework\.

## 3Methods

We extend isotropic filtering to model direction\-dependent uncertainty while preserving closed\-form covariance propagation and tractable precision\-weighted estimation\.

We first review isotropic filtering under linear stochastic dynamics, then introduce the RT\-SDE and derive its closed\-form covariance propagation\. The radial–tangential decomposition yields a factorized estimator in which directional inference reduces to precision\-weighted aggregation on the hypersphere\. Implementing this as an incremental update results in the RT\-Filter, which is closely approximated by the Transformer\.

### 3\.1Background: Robust Filter Attention

Robust Filter Attention \(RFA\)\(Racioppo,[2026](https://arxiv.org/html/2605.11007#bib.bib209)\)interprets attention as approximate Bayesian filtering under linear dynamical transport\. Past tokens are propagated to the query position through linear dynamics, and latent states are estimated through a robust precision\-weighted M\-estimator:

𝒛¯i=\(∑j≤iwi​j​𝑷i​j\)−1​∑j≤iwi​j​𝑷i​j​𝒛^i​j,\\bar\{\\boldsymbol\{z\}\}\_\{i\}=\\Big\(\\sum\_\{j\\leq i\}w\_\{ij\}\\boldsymbol\{P\}\_\{ij\}\\Big\)^\{\-1\}\\sum\_\{j\\leq i\}w\_\{ij\}\\boldsymbol\{P\}\_\{ij\}\\hat\{\\boldsymbol\{z\}\}\_\{ij\},\(1\)where𝒛^i​j=e𝑨​Δ​ti​j​𝒛j\\hat\{\\boldsymbol\{z\}\}\_\{ij\}=e^\{\\boldsymbol\{A\}\\Delta t\_\{ij\}\}\\boldsymbol\{z\}\_\{j\}are transported observations andwi​j​\(di​j2\)w\_\{ij\}\(d\_\{ij\}^\{2\}\)downweight inconsistent predictions as a function of Mahalanobis distance\.

Under diagonalizable dynamics and isotropic process noise, covariance propagation reduces to a scalar function of temporal lag, yielding a tractable attention mechanism with𝒪​\(N2​d\)\\mathcal\{O\}\(N^\{2\}d\)complexity\. The RT\-SDE developed below generalizes isotropic RFA by replacing scalar uncertainty with radial–tangential covariance structure on the hypersphere\. Full derivations and background are provided in Appendix[A](https://arxiv.org/html/2605.11007#A1)\.

### 3\.2The Radial–Tangential SDE \(RT\-SDE\)

To preserve analytic covariance propagation while allowing directional uncertainty, we introduce the*Radial–Tangential SDE*\(RT\-SDE\), in which process and measurement noise co\-rotate with the latent state direction in the eigenbasis of the dynamics\.

We consider the linear stochastic differential equation:

d​𝒙​\(t\)=𝑨​\(t\)​𝒙​\(t\)​d​t\+𝑮​\(t\)​d​𝒘​\(t\),𝒛​\(tk\)=𝒙​\(tk\)\+𝒗​\(tk\)\.d\\boldsymbol\{x\}\(t\)=\\boldsymbol\{A\}\(t\)\\boldsymbol\{x\}\(t\)\\,dt\+\\boldsymbol\{G\}\(t\)\\,d\\boldsymbol\{w\}\(t\),\\qquad\\boldsymbol\{z\}\(t\_\{k\}\)=\\boldsymbol\{x\}\(t\_\{k\}\)\+\\boldsymbol\{v\}\(t\_\{k\}\)\.\(2\)We assume diagonalizable dynamics,𝑨​\(t\)=𝑺​𝚲​\(t\)​𝑺−1,\\boldsymbol\{A\}\(t\)=\\boldsymbol\{S\}\\boldsymbol\{\\Lambda\}\(t\)\\boldsymbol\{S\}^\{\-1\},and perform filtering in the eigenbasis:

𝒙s​\(t\)=𝑺−1​𝒙​\(t\),𝒛s​\(tk\)=𝑺−1​𝒛​\(tk\)\.\\boldsymbol\{x\}\_\{s\}\(t\)=\\boldsymbol\{S\}^\{\-1\}\\boldsymbol\{x\}\(t\),\\qquad\\boldsymbol\{z\}\_\{s\}\(t\_\{k\}\)=\\boldsymbol\{S\}^\{\-1\}\\boldsymbol\{z\}\(t\_\{k\}\)\.The state is decomposed into magnitude and direction:

𝒙s​\(t\)=m​\(t\)​𝒖​\(t\),m​\(t\)=‖𝒙s​\(t\)‖2,𝒖​\(t\)=𝒙s​\(t\)‖𝒙s​\(t\)‖2\.\\boldsymbol\{x\}\_\{s\}\(t\)=m\(t\)\\boldsymbol\{u\}\(t\),\\qquad m\(t\)=\\\|\\boldsymbol\{x\}\_\{s\}\(t\)\\\|\_\{2\},\\qquad\\boldsymbol\{u\}\(t\)=\\frac\{\\boldsymbol\{x\}\_\{s\}\(t\)\}\{\\\|\\boldsymbol\{x\}\_\{s\}\(t\)\\\|\_\{2\}\}\.We assume that decay and process noise act independently in the radial and tangential directions of the latent state\. Defining the radial and tangential projectors

𝑷R​\(𝒖\)=𝒖​𝒖†,𝑷T​\(𝒖\)=𝑰−𝒖​𝒖†,\\boldsymbol\{P\}\_\{R\}\(\\boldsymbol\{u\}\)=\\boldsymbol\{u\}\\boldsymbol\{u\}^\{\\dagger\},\\qquad\\boldsymbol\{P\}\_\{T\}\(\\boldsymbol\{u\}\)=\\boldsymbol\{I\}\-\\boldsymbol\{u\}\\boldsymbol\{u\}^\{\\dagger\},we define the state\-dependent dynamics in the eigenbasis as:

𝚲​\(𝒖​\(t\)\)=−μr​𝑷R​\(𝒖​\(t\)\)−μt​𝑷T​\(𝒖​\(t\)\)\+𝚲Ω,\\boldsymbol\{\\Lambda\}\(\\boldsymbol\{u\}\(t\)\)=\-\\mu\_\{r\}\\boldsymbol\{P\}\_\{R\}\(\\boldsymbol\{u\}\(t\)\)\-\\mu\_\{t\}\\boldsymbol\{P\}\_\{T\}\(\\boldsymbol\{u\}\(t\)\)\+\\boldsymbol\{\\Lambda\}\_\{\\Omega\},where𝚲Ω∈i​ℝd×d\\boldsymbol\{\\Lambda\}\_\{\\Omega\}\\in i\\mathbb\{R\}^\{d\\times d\}is diagonal and generates rotational transport\.

Likewise, radial and tangential diffusion are modeled independently through projected Wiener increments

d​𝒘r=𝑷R​\(𝒖\)​d​𝒘,d​𝒘t=𝑷T​\(𝒖\)​d​𝒘\.d\\boldsymbol\{w\}\_\{r\}=\\boldsymbol\{P\}\_\{R\}\(\\boldsymbol\{u\}\)d\\boldsymbol\{w\},\\qquad d\\boldsymbol\{w\}\_\{t\}=\\boldsymbol\{P\}\_\{T\}\(\\boldsymbol\{u\}\)d\\boldsymbol\{w\}\.The resulting RT\-SDE is:

d​𝒙s=\(−μr​𝑷R​\(𝒖\)−μt​𝑷T​\(𝒖\)\+𝚲Ω\)​𝒙s​d​t\+σr​d​𝒘r\+σt​d​𝒘t\.d\\boldsymbol\{x\}\_\{s\}=\\Big\(\-\\mu\_\{r\}\\boldsymbol\{P\}\_\{R\}\(\\boldsymbol\{u\}\)\-\\mu\_\{t\}\\boldsymbol\{P\}\_\{T\}\(\\boldsymbol\{u\}\)\+\\boldsymbol\{\\Lambda\}\_\{\\Omega\}\\Big\)\\boldsymbol\{x\}\_\{s\}\\,dt\+\\sigma\_\{r\}d\\boldsymbol\{w\}\_\{r\}\+\\sigma\_\{t\}d\\boldsymbol\{w\}\_\{t\}\.\(3\)The projected Wiener increments induce the process covariance

𝚲Q​\(𝒖​\(t\)\)=σr2​𝑷R​\(𝒖​\(t\)\)\+σt2​𝑷T​\(𝒖​\(t\)\)\\boldsymbol\{\\Lambda\}\_\{Q\}\(\\boldsymbol\{u\}\(t\)\)=\\sigma\_\{r\}^\{2\}\\boldsymbol\{P\}\_\{R\}\(\\boldsymbol\{u\}\(t\)\)\+\\sigma\_\{t\}^\{2\}\\boldsymbol\{P\}\_\{T\}\(\\boldsymbol\{u\}\(t\)\)in the eigenbasis\. The ambient process covariance is therefore𝑸​\(t\)=𝑺​𝚲Q​\(𝒖​\(t\)\)​𝑺†\\boldsymbol\{Q\}\(t\)=\\boldsymbol\{S\}\\boldsymbol\{\\Lambda\}\_\{Q\}\(\\boldsymbol\{u\}\(t\)\)\\boldsymbol\{S\}^\{\\dagger\}, where𝑸​\(t\)=𝑮​\(t\)​𝑮​\(t\)⊤\\boldsymbol\{Q\}\(t\)=\\boldsymbol\{G\}\(t\)\\boldsymbol\{G\}\(t\)^\{\\top\}\.

Measurement noise is likewise decomposed into radial and tangential components:

𝒗​\(tk\)∼𝒩​\(𝟎,ηr2​𝑷R​\(𝒖\)\+ηt2​𝑷T​\(𝒖\)\)\.\\boldsymbol\{v\}\(t\_\{k\}\)\\sim\\mathcal\{N\}\\Big\(\\boldsymbol\{0\},\\eta\_\{r\}^\{2\}\\boldsymbol\{P\}\_\{R\}\(\\boldsymbol\{u\}\)\+\\eta\_\{t\}^\{2\}\\boldsymbol\{P\}\_\{T\}\(\\boldsymbol\{u\}\)\\Big\)\.Normalizing the measurement yields the observed direction𝒖z​\(tk\)=𝒛s​\(tk\)‖𝒛s​\(tk\)‖\.\\boldsymbol\{u\}\_\{z\}\(t\_\{k\}\)=\\frac\{\\boldsymbol\{z\}\_\{s\}\(t\_\{k\}\)\}\{\\\|\\boldsymbol\{z\}\_\{s\}\(t\_\{k\}\)\\\|\}\.Linearizing the normalization map around the latent statem​𝒖m\\boldsymbol\{u\}gives:

𝒖z​\(tk\)≈𝒖​\(tk\)\+1m​\(tk\)​𝑷T​\(𝒖​\(tk\)\)​𝒗t​\(tk\)\.\\boldsymbol\{u\}\_\{z\}\(t\_\{k\}\)\\approx\\boldsymbol\{u\}\(t\_\{k\}\)\+\\frac\{1\}\{m\(t\_\{k\}\)\}\\boldsymbol\{P\}\_\{T\}\(\\boldsymbol\{u\}\(t\_\{k\}\)\)\\boldsymbol\{v\}\_\{t\}\(t\_\{k\}\)\.\(4\)Only tangential noise directly perturbs the observed direction, while radial noise affects only the magnitude of the estimate\.

The RT\-SDE induces stochastic trajectories on the hypersphere in which direction evolves under tangential diffusion and rotational dynamics\. Figure[1](https://arxiv.org/html/2605.11007#S3.F1)illustrates typical realizations of this process\.

![Refer to caption](https://arxiv.org/html/2605.11007v1/x1.png)\(a\)Pure Tangential Noise\.
![Refer to caption](https://arxiv.org/html/2605.11007v1/x2.png)\(b\)RT\-SDE with rotational dynamics\.

Figure 1:Illustration of stochastic trajectories induced by the RT\-SDE on the hypersphere in the eigenbasis ford=3d=3\. True trajectories are shown as black solid lines, with noisy measurements shown as red dots\.\(a\)Pure tangential diffusion produces a random walk on the sphere\.\(b\)The full RT\-SDE, combining rotational dynamics \(𝚲Ω\\boldsymbol\{\\Lambda\}\_\{\\Omega\}\) with tangential diffusion, induces rotational transport while preserving the radial–tangential covariance structure\.
### 3\.3Closed\-form covariance propagation\.

The directional estimator requires knowing how reliably a past token predicts the current latent direction after transport under the RT\-SDE\. This depends on the propagated measurement covariance between timestjt\_\{j\}andtit\_\{i\}\.

The accumulated process covariance satisfies

𝑽​\(ti,tj\)=∫0Δ​ti​je𝑨​τ​𝑸​\(ti−τ\)​e𝑨⊤​τ​𝑑τ\.\\boldsymbol\{V\}\(t\_\{i\},t\_\{j\}\)=\\int\_\{0\}^\{\\Delta t\_\{ij\}\}e^\{\\boldsymbol\{A\}\\tau\}\\boldsymbol\{Q\}\(t\_\{i\}\-\\tau\)e^\{\\boldsymbol\{A\}^\{\\top\}\\tau\}\\,d\\tau\.\(5\)Since the dynamics are diagonalizable,𝑽​\(ti,tj\)=𝑺​𝚲V​\(ti,tj\)​𝑺−1\\boldsymbol\{V\}\(t\_\{i\},t\_\{j\}\)=\\boldsymbol\{S\}\\boldsymbol\{\\Lambda\}\_\{V\}\(t\_\{i\},t\_\{j\}\)\\boldsymbol\{S\}^\{\-1\}, so covariance propagation may be carried out in the eigenbasis\. The propagated measurement covariance is then

𝚲V^​\(ti,tj\)=𝚲V​\(ti,tj\)\+e𝚲​Δ​ti​j​𝚲R​\(tj\)​e𝚲†​Δ​ti​j\.\\boldsymbol\{\\Lambda\}\_\{\\hat\{V\}\}\(t\_\{i\},t\_\{j\}\)=\\boldsymbol\{\\Lambda\}\_\{V\}\(t\_\{i\},t\_\{j\}\)\+e^\{\\boldsymbol\{\\Lambda\}\\Delta t\_\{ij\}\}\\boldsymbol\{\\Lambda\}\_\{R\}\(t\_\{j\}\)e^\{\\boldsymbol\{\\Lambda\}^\{\\dagger\}\\Delta t\_\{ij\}\}\.\(6\)where the second term transports the measurement covariance from timetjt\_\{j\}to the query frame attit\_\{i\}\.

In the regime of small angular diffusion \(σt​Δ​ti​j≪1\\sigma\_\{t\}\\sqrt\{\\Delta t\_\{ij\}\}\\ll 1\), the direction𝒖​\(t\)\\boldsymbol\{u\}\(t\)is well\-approximated by pure rotational transport, and the covariance admits a closed\-form expression despite the state dependence\.

#### Proposition 1: Closed\-form covariance propagation under the RT\-SDE\.

In the eigenbasis, the propagated measurement covariance is well\-approximated by:

𝚲V^​\(ti,tj\)=σV​r2​\(\|Δ​ti​j\|\)​𝑷R​\(𝒖​\(ti\)\)\+σV​t2​\(\|Δ​ti​j\|\)​𝑷T​\(𝒖​\(ti\)\)\.\\boldsymbol\{\\Lambda\}\_\{\\hat\{V\}\}\(t\_\{i\},t\_\{j\}\)=\\sigma\_\{Vr\}^\{2\}\(\|\\Delta t\_\{ij\}\|\)\\boldsymbol\{P\}\_\{R\}\(\\boldsymbol\{u\}\(t\_\{i\}\)\)\+\\sigma\_\{Vt\}^\{2\}\(\|\\Delta t\_\{ij\}\|\)\\boldsymbol\{P\}\_\{T\}\(\\boldsymbol\{u\}\(t\_\{i\}\)\)\.\(7\)where

σV​r2​\(\|Δ​ti​j\|\)\\displaystyle\\sigma\_\{Vr\}^\{2\}\(\|\\Delta t\_\{ij\}\|\)=φ​\(μr,\|Δ​ti​j\|\)​σr2\+e−2​μr​Δ​ti​j​ηr2,\\displaystyle=\\varphi\(\\mu\_\{r\},\|\\Delta t\_\{ij\}\|\)\\sigma\_\{r\}^\{2\}\+e^\{\-2\\mu\_\{r\}\\Delta t\_\{ij\}\}\\eta\_\{r\}^\{2\},σV​t2​\(\|Δ​ti​j\|\)\\displaystyle\\sigma\_\{Vt\}^\{2\}\(\|\\Delta t\_\{ij\}\|\)=φ​\(μt,\|Δ​ti​j\|\)​σt2\+e−2​μt​Δ​ti​j​ηt2,\\displaystyle=\\varphi\(\\mu\_\{t\},\|\\Delta t\_\{ij\}\|\)\\sigma\_\{t\}^\{2\}\+e^\{\-2\\mu\_\{t\}\\Delta t\_\{ij\}\}\\eta\_\{t\}^\{2\},φ​\(μ,Δ​t\)=\{1−e−2​μ​Δ​t2​μ,μ\>0,Δ​t,μ=0\.\\varphi\(\\mu,\\Delta t\)=\\begin\{cases\}\\dfrac\{1\-e^\{\-2\\mu\\Delta t\}\}\{2\\mu\},&\\mu\>0,\\\\\[6\.0pt\] \\Delta t,&\\mu=0\.\\end\{cases\}The corresponding precision matrix is:

𝚲V^−1​\(ti,tj\)=1σV​r2​𝑷R\+1σV​t2​𝑷T\.\\boldsymbol\{\\Lambda\}\_\{\\hat\{V\}\}^\{\-1\}\(t\_\{i\},t\_\{j\}\)=\\frac\{1\}\{\\sigma\_\{Vr\}^\{2\}\}\\boldsymbol\{P\}\_\{R\}\+\\frac\{1\}\{\\sigma\_\{Vt\}^\{2\}\}\\boldsymbol\{P\}\_\{T\}\.\(8\)

#### Proof sketch\.

In the small angular diffusion regime, the direction𝒖​\(t\)\\boldsymbol\{u\}\(t\)is well\-approximated by pure rotational transport, so the radial and tangential projectors satisfy

𝚽​\(τ\)​𝑷R/T​\(ti−τ\)​𝚽​\(τ\)†≈𝑷R/T​\(ti\),\\boldsymbol\{\\Phi\}\(\\tau\)\\boldsymbol\{P\}\_\{R/T\}\(t\_\{i\}\-\\tau\)\\boldsymbol\{\\Phi\}\(\\tau\)^\{\\dagger\}\\approx\\boldsymbol\{P\}\_\{R/T\}\(t\_\{i\}\),and the rotational terms cancel approximately inside the covariance integral\.

Propagation therefore reduces to independent scalar exponential integrals in the radial and tangential subspaces\.

Since

𝚲V^=σV​t2​𝑰\+\(σV​r2−σV​t2\)​𝒖​𝒖†,\\boldsymbol\{\\Lambda\}\_\{\\hat\{V\}\}=\\sigma\_\{Vt\}^\{2\}\\boldsymbol\{I\}\+\(\\sigma\_\{Vr\}^\{2\}\-\\sigma\_\{Vt\}^\{2\}\)\\boldsymbol\{u\}\\boldsymbol\{u\}^\{\\dagger\},the covariance remains a rank\-1 correction of the identity and admits analytic inversion through the Sherman–Morrison formula\. Thus, the RT\-SDE introduces state\-dependent uncertainty while preserving the𝒪​\(d\)\\mathcal\{O\}\(d\)structure required for scalable attention\. The full proof is provided in Appendix[B](https://arxiv.org/html/2605.11007#A2)\.

### 3\.4Directional Filtering

Under the RT\-SDE, radial and tangential uncertainties decouple, and inference is performed over the unit direction𝒖i∈𝒮d−1\\boldsymbol\{u\}\_\{i\}\\in\\mathcal\{S\}^\{d\-1\}\.

#### Transported directions\.

Each past tokenj≤ij\\leq iprovides a directional observation at timetit\_\{i\}by transporting its direction under the rotational dynamics:

𝒖^i​j=𝚽​\(Δ​ti​j\)​𝒖z,j,𝚽​\(τ\)=e𝚲Ω​τ\.\\hat\{\\boldsymbol\{u\}\}\_\{ij\}=\\boldsymbol\{\\Phi\}\(\\Delta t\_\{ij\}\)\\,\\boldsymbol\{u\}\_\{z,j\},\\qquad\\boldsymbol\{\\Phi\}\(\\tau\)=e^\{\\boldsymbol\{\\Lambda\}\_\{\\Omega\}\\tau\}\.\(9\)Thus, attention aggregates a set of transported directions\{𝒖^i​j\}\\\{\\hat\{\\boldsymbol\{u\}\}\_\{ij\}\\\}into a consensus estimate of𝒖i\\boldsymbol\{u\}\_\{i\}\.

#### Residual Covariance

The directional residual between the query direction and a transported key direction is:

𝒓i​j\(dir\)=𝒖z,i−𝒖^i​j\.\\boldsymbol\{r\}\_\{ij\}^\{\(\\mathrm\{dir\}\)\}=\\boldsymbol\{u\}\_\{z,i\}\-\\hat\{\\boldsymbol\{u\}\}\_\{ij\}\.Directional similarity depends on the uncertainty of this residual\. In addition to the propagated covariance of the transported token, the query itself introduces directional uncertainty through its own measurement noise\.

We model query\-side uncertainty through a radial–tangential covariance of the form

𝚲Γ​\(ti\)=γr2​𝑷R​\(𝒖​\(ti\)\)\+γt2​𝑷T​\(𝒖​\(ti\)\)\.\\boldsymbol\{\\Lambda\}\_\{\\Gamma\}\(t\_\{i\}\)=\\gamma\_\{r\}^\{2\}\\boldsymbol\{P\}\_\{R\}\(\\boldsymbol\{u\}\(t\_\{i\}\)\)\+\\gamma\_\{t\}^\{2\}\\boldsymbol\{P\}\_\{T\}\(\\boldsymbol\{u\}\(t\_\{i\}\)\)\.The total residual covariance is therefore

𝚲Σ​\(ti,tj\)=𝚲V^​\(ti,tj\)\+𝚲Γ​\(ti\),\\boldsymbol\{\\Lambda\}\_\{\\Sigma\}\(t\_\{i\},t\_\{j\}\)=\\boldsymbol\{\\Lambda\}\_\{\\hat\{V\}\}\(t\_\{i\},t\_\{j\}\)\+\\boldsymbol\{\\Lambda\}\_\{\\Gamma\}\(t\_\{i\}\),yielding

𝚲Σ​\(ti,tj\)=σΣ​r2​𝑷R​\(𝒖​\(ti\)\)\+σΣ​t2​𝑷T​\(𝒖​\(ti\)\),\\boldsymbol\{\\Lambda\}\_\{\\Sigma\}\(t\_\{i\},t\_\{j\}\)=\\sigma\_\{\\Sigma r\}^\{2\}\\boldsymbol\{P\}\_\{R\}\(\\boldsymbol\{u\}\(t\_\{i\}\)\)\+\\sigma\_\{\\Sigma t\}^\{2\}\\boldsymbol\{P\}\_\{T\}\(\\boldsymbol\{u\}\(t\_\{i\}\)\),where

σΣ​r2=σV​r2\+γr2,σΣ​t2=σV​t2\+γt2\.\\sigma\_\{\\Sigma r\}^\{2\}=\\sigma\_\{Vr\}^\{2\}\+\\gamma\_\{r\}^\{2\},\\qquad\\sigma\_\{\\Sigma t\}^\{2\}=\\sigma\_\{Vt\}^\{2\}\+\\gamma\_\{t\}^\{2\}\.Thus, the residual covariance retains the same radial–tangential structure and remains analytically invertible\.

#### Directional uncertainty and precision\.

Linearizing the normalization map shows that tangential noise with varianceσΣ​t2\\sigma\_\{\\Sigma t\}^\{2\}induces angular variance:

σθ2∼σΣ​t2m2\.\\sigma\_\{\\theta\}^\{2\}\\sim\\frac\{\\sigma\_\{\\Sigma t\}^\{2\}\}\{m^\{2\}\}\.This yields directional precision:

κi​j=\(σΣ​t,i2mi2\+ϵ\+σΣ​t,j2​\(Δ​ti​j\)m^i​j2\+ϵ\+τθ2\)−1,\\kappa\_\{ij\}=\\bigg\(\\frac\{\\sigma\_\{\\Sigma t,i\}^\{2\}\}\{m\_\{i\}^\{2\}\+\\epsilon\}\+\\frac\{\\sigma\_\{\\Sigma t,j\}^\{2\}\(\\Delta t\_\{ij\}\)\}\{\\hat\{m\}\_\{ij\}^\{2\}\+\\epsilon\}\+\\tau\_\{\\theta\}^\{2\}\\bigg\)^\{\-1\},\(10\)whereτθ\\tau\_\{\\theta\}is an angular noise floor andϵ\\epsilonis a stability constant, both introduced to preventκi​j\\kappa\_\{ij\}from diverging\.

The whitened squared angular distance is:

di​j2=2​κi​j​\(1−𝒖i†​𝒖^i​j\)\.d\_\{ij\}^\{2\}=2\\kappa\_\{ij\}\\big\(1\-\\boldsymbol\{u\}\_\{i\}^\{\\dagger\}\\hat\{\\boldsymbol\{u\}\}\_\{ij\}\\big\)\.\(11\)

#### Exact directional estimator\.

The RT\-SDE and its radial–tangential geometry are defined in the eigenbasis, so inference is performed in the coordinates𝒛s,i=𝑺−1​𝒛i\\boldsymbol\{z\}\_\{s,i\}=\\boldsymbol\{S\}^\{\-1\}\\boldsymbol\{z\}\_\{i\}\. Since𝑺\\boldsymbol\{S\}is invertible, optimization in the eigenbasis is equivalent to the corresponding problem in ambient coordinates\.

Within the eigenbasis, the linearized normalization map \(Eq\.[4](https://arxiv.org/html/2605.11007#S3.E4)\) separates radial and tangential perturbations to first order\. The observed token norm therefore provides a first\-order estimate of the latent magnitude, allowing inference to condition on the observed magnitudesmj=‖𝒛s,j‖m\_\{j\}=\\\|\\boldsymbol\{z\}\_\{s,j\}\\\|and reduce the problem to latent directional estimation alone\.

Under this conditioning, the remaining uncertainty affects only the directional degrees of freedom and lies in the tangent plane of the sphere, with varianceσΣ​t2/m2\\sigma\_\{\\Sigma t\}^\{2\}/m^\{2\}\. Locally, the hypersphere is approximated by its tangent plane, so the directional likelihood becomes Gaussian in tangent\-space coordinates\. The resulting negative log\-likelihood is therefore quadratic in the directional residual:

min‖𝒖i‖=1​∑j≤iκi​j​‖𝒖i−𝒖^i​j‖2\.\\min\_\{\\\|\\boldsymbol\{u\}\_\{i\}\\\|=1\}\\sum\_\{j\\leq i\}\\kappa\_\{ij\}\\\|\\boldsymbol\{u\}\_\{i\}\-\\hat\{\\boldsymbol\{u\}\}\_\{ij\}\\\|^\{2\}\.which is equivalent to:

max‖𝒖i‖=1⁡𝒖i†​\(∑j≤iκi​j​𝒖^i​j\),\\max\_\{\\\|\\boldsymbol\{u\}\_\{i\}\\\|=1\}\\boldsymbol\{u\}\_\{i\}^\{\\dagger\}\\bigg\(\\sum\_\{j\\leq i\}\\kappa\_\{ij\}\\hat\{\\boldsymbol\{u\}\}\_\{ij\}\\bigg\),with solution:

𝒖i∗=Norm​\(∑j≤iκi​j​𝒖^i​j\)\.\\boldsymbol\{u\}\_\{i\}^\{\*\}=\\mathrm\{Norm\}\\bigg\(\\sum\_\{j\\leq i\}\\kappa\_\{ij\}\\hat\{\\boldsymbol\{u\}\}\_\{ij\}\\bigg\)\.\(12\)This corresponds to a normalized precision\-weighted consensus on the sphere\. Attention instead takes an incremental step toward this consensus direction:

𝒖¯i=∑j≤iAi​j​𝒖^i​j,Ai​j=κi​j∑j′κi​j′\.\\bar\{\\boldsymbol\{u\}\}\_\{i\}=\\sum\_\{j\\leq i\}A\_\{ij\}\\hat\{\\boldsymbol\{u\}\}\_\{ij\},\\qquad A\_\{ij\}=\\frac\{\\kappa\_\{ij\}\}\{\\sum\_\{j^\{\\prime\}\}\\kappa\_\{ij^\{\\prime\}\}\}\.\(13\)Thus,𝒖i∗=Norm​\(𝒖¯i\)\\boldsymbol\{u\}\_\{i\}^\{\*\}=\\mathrm\{Norm\}\(\\bar\{\\boldsymbol\{u\}\}\_\{i\}\), so attention computes the unnormalized version of the exact MLE\.

#### Robust reweighting\.

As in isotropic RFA, we introduce robust M\-estimation weights to downweight inconsistent observations\. Here, robustness is applied to directional disagreement on the hypersphere through the angular distancedi​j2d\_\{ij\}^\{2\}:

wi​j=\(1\+di​j2ν\)−κ,κ~i​j=wi​j​κi​j\.w\_\{ij\}=\\bigg\(1\+\\frac\{d\_\{ij\}^\{2\}\}\{\\nu\}\\bigg\)^\{\-\\kappa\},\\qquad\\tilde\{\\kappa\}\_\{ij\}=w\_\{ij\}\\kappa\_\{ij\}\.\(14\)

#### Geometric filtering update\.

We represent the RT filter state in eigenbasis coordinates𝒛s,i=mi​𝒖z,i\\boldsymbol\{z\}\_\{s,i\}=m\_\{i\}\\boldsymbol\{u\}\_\{z,i\}, where the spherical geometry is exact\. The precision\-weighted consensus𝒖¯i\\bar\{\\boldsymbol\{u\}\}\_\{i\}defines a local directional update on the hypersphere, while its norm‖𝒖¯i‖\\\|\\bar\{\\boldsymbol\{u\}\}\_\{i\}\\\|encodes the concentration of the directional evidence\.

The exact RT\-filter update corresponds to geodesic motion on the hypersphere toward the consensus direction\. To obtain a tractable update compatible with additive residual dynamics, we instead perform a local first\-order approximation in the tangent space\. Removing the component parallel to the current state yields the projected update

𝒛s,i\+=𝒛s,i\+r​Π𝒛s,i​\(𝒖¯i\),𝒖i\+=Norm​\(𝒛s,i\+\),\\boldsymbol\{z\}\_\{s,i\}^\{\+\}=\\boldsymbol\{z\}\_\{s,i\}\+r\\,\\Pi\_\{\\boldsymbol\{z\}\_\{s,i\}\}\(\\bar\{\\boldsymbol\{u\}\}\_\{i\}\),\\qquad\\boldsymbol\{u\}\_\{i\}^\{\+\}=\\mathrm\{Norm\}\(\\boldsymbol\{z\}\_\{s,i\}^\{\+\}\),\(15\)whereΠ𝒛s,i\\Pi\_\{\\boldsymbol\{z\}\_\{s,i\}\}denotes projection onto the tangent space at𝒛s,i\\boldsymbol\{z\}\_\{s,i\}, andr\>0r\>0controls the update scale\.

The induced angular update scales as

‖Δ​𝒖i‖∼r​‖𝒖¯i‖mi,\\\|\\Delta\\boldsymbol\{u\}\_\{i\}\\\|\\sim\\frac\{r\\,\\\|\\bar\{\\boldsymbol\{u\}\}\_\{i\}\\\|\}\{m\_\{i\}\},\(16\)so large\-magnitude states are more stable, while diffuse directional evidence produces smaller updates\. Magnitudemim\_\{i\}therefore acts as directional inertia, while‖𝒖¯i‖\\\|\\bar\{\\boldsymbol\{u\}\}\_\{i\}\\\|controls the adaptive step size\.

![Refer to caption](https://arxiv.org/html/2605.11007v1/images/explanations/pulled_forward_sphere3.png)\(a\)Sequential measurements are mapped forward to the current timetit\_\{i\}via rotational dynamicse𝚲Ω​Δ​te^\{\\boldsymbol\{\\Lambda\}\_\{\\Omega\}\\Delta t\}, forming a precision\-weighted consensus cloud\{𝒖^s,i​j\}j≤i\\\{\\hat\{\\boldsymbol\{u\}\}\_\{s,ij\}\\\}\_\{j\\leq i\}on the hypersphere\.
![Refer to caption](https://arxiv.org/html/2605.11007v1/images/explanations/tangent_step3.png)\(b\)A tangent\-space filtering update moves the state toward the consensus direction, while normalization retracts the updated state back to the hypersphere\.

Figure 2:Illustration of the RT\-Filter\. Transported directional observations form a precision\-weighted consensus on the hypersphere, followed by a local tangent\-space filtering update and retraction back onto the sphere\.

### 3\.5The Transformer as an RT\-Filter

We now show how the Transformer attention block emerges as a first\-order approximation to the RT\-Filter\. Attention, residual connections, and normalization arise naturally as components of directional state estimation on the hypersphere\.

#### Transformer implementation\.

As in Isotropic RFA, the projection matrices𝑾q,𝑾k,𝑾v\\boldsymbol\{W\}\_\{q\},\\boldsymbol\{W\}\_\{k\},\\boldsymbol\{W\}\_\{v\}absorb the diagonalizing matrix𝑺−1\\boldsymbol\{S\}^\{\-1\}, while the output projection𝑾o\\boldsymbol\{W\}\_\{o\}absorbs the mapping𝑺\\boldsymbol\{S\}back to the original basis\.

The RT filter update derived above is naturally defined in the eigenbasis, where the radial–tangential geometry is exact:

𝒛s,i\+≈𝒛s,i\+r​𝒖¯i,\\boldsymbol\{z\}\_\{s,i\}^\{\+\}\\approx\\boldsymbol\{z\}\_\{s,i\}\+r\\,\\bar\{\\boldsymbol\{u\}\}\_\{i\},Hence,

𝒛i\+≈𝑾o​\(𝒛s,i\+r​𝒖¯i\)=𝑾o​𝑾v​𝒛i\+r​𝑾o​𝒖¯i\.\\boldsymbol\{z\}\_\{i\}^\{\+\}\\approx\\boldsymbol\{W\}\_\{o\}\(\\boldsymbol\{z\}\_\{s,i\}\+r\\,\\bar\{\\boldsymbol\{u\}\}\_\{i\}\)=\\boldsymbol\{W\}\_\{o\}\\boldsymbol\{W\}\_\{v\}\\boldsymbol\{z\}\_\{i\}\+r\\,\\boldsymbol\{W\}\_\{o\}\\bar\{\\boldsymbol\{u\}\}\_\{i\}\.For exact preservation of the latent state under transport through the value space, one would ideally have𝑾o​𝑾v≈𝑰\.\\boldsymbol\{W\}\_\{o\}\\boldsymbol\{W\}\_\{v\}\\approx\\boldsymbol\{I\}\.The Transformer residual structure enforces this identity pathway explicitly, yielding the additive update

𝒛i\+=𝒛i\+r​𝑾o​𝒖¯i\.\\boldsymbol\{z\}\_\{i\}^\{\+\}=\\boldsymbol\{z\}\_\{i\}\+r\\,\\boldsymbol\{W\}\_\{o\}\\bar\{\\boldsymbol\{u\}\}\_\{i\}\.\(17\)
Thus, the residual connection preserves the original representation while attention contributes only the directional filtering correction\.

In high\-dimensional embeddings with approximately isotropic coordinates,‖𝒛i‖2∼d\\\|\\boldsymbol\{z\}\_\{i\}\\\|^\{2\}\\sim d, so dimension\-independent angular updates requirer∝dr\\propto\\sqrt\{d\}\. Writingr=γ​dr=\\gamma\\sqrt\{d\}, the scaleγ\\gammacorresponds naturally to the learned normalization gain used in RMSNorm\-like architectures\.

In summary, attention computes a directional consensus estimate, the residual connection applies the corresponding filtering correction, and normalization approximately retracts the updated state onto the hypersphere\. The Transformer block therefore implements a first\-order directional filtering step under the RT\-SDE\.

### 3\.6The RT\-Transformer

The preceding derivation motivates a geometrically consistent Transformer variant, which we term the*RT\-Transformer*\.

The RT\-Transformer modifies the standard Transformer block in three ways:

1. 1\.Attention weights incorporate magnitude\-dependent directional precision derived from the RT\-SDE;
2. 2\.Queries, keys, and values are normalized in the learned eigenbasis so that attention operates on hyperspherical directional states;
3. 3\.Residual updates use tangent\-space filtering corrections that remove residual components parallel to the current state\.

#### Tangent\-space residual updates\.

The additive update includes a component parallel to the current state that does not contribute to directional change after normalization but does affect magnitude and introduces a bias toward self\-reinforcement\.

Since the spherical geometry is defined in the eigenbasis coordinates, the tangent projection is naturally performed before mapping back to the ambient residual space:

𝒛i\+=𝒛i\+r​𝑾o​Π𝒛s,i​\(𝒖¯i\),𝒛s,i=𝑾v​𝒛i\.\\boldsymbol\{z\}\_\{i\}^\{\+\}=\\boldsymbol\{z\}\_\{i\}\+r\\,\\boldsymbol\{W\}\_\{o\}\\Pi\_\{\\boldsymbol\{z\}\_\{s,i\}\}\(\\bar\{\\boldsymbol\{u\}\}\_\{i\}\),\\qquad\\boldsymbol\{z\}\_\{s,i\}=\\boldsymbol\{W\}\_\{v\}\\boldsymbol\{z\}\_\{i\}\.\(18\)
This yields a tangent\-space update that more faithfully matches the underlying spherical filtering geometry while preserving the additive residual structure\.

#### Riemannian iterative estimator\.

The resulting architecture admits an interpretation as a stacked Riemannian iterative state estimator, in which layers repeatedly transport, reweight, and refine directional estimates through precision\-weighted consensus updates \(Appendix[D\.4](https://arxiv.org/html/2605.11007#A4.SS4)\)\.

#### Implementation\.

As in Isotropic RFA, the model is implemented inℝ2​d\\mathbb\{R\}^\{2d\}using paired real and imaginary channels\. Complex rotations reduce to standard sine–cosine RoPE operations, and all computations can be performed using ordinary real\-valued Transformer primitives\. The full algorithm is provided in Appendix[2](https://arxiv.org/html/2605.11007#algorithm2)\.

The present work focuses on the theoretical foundations and geometric interpretation of RT filtering\. Comprehensive empirical evaluation and large\-scale implementation studies will be reported in future work\.

## 4Conclusion

We presented a theoretical framework that interprets the Transformer block as an approximate filtering step for latent states evolving under stochastic dynamics\. We introduced the Radial–Tangential SDE \(RT\-SDE\), a directional stochastic process whose covariance structure separates radial and tangential uncertainty while preserving the tractable precision structure required for scalable attention\. Under this model, attention emerges as a precision\-weighted directional estimator on the hypersphere, while residual addition and normalization correspond to a first\-order geometric filtering update\.

This perspective provides a generative interpretation of Transformer dynamics and connects attention mechanisms to classical ideas from stochastic filtering, robust estimation, and dynamical systems\. It also suggests several directions for future work, including architectures that more explicitly align attention updates with the geometry of the underlying dynamics and broader investigations of structured dynamical priors for sequence modeling\.

## References

- A\. S\. Bianchessi, Y\. C\. Aguirre, R\. C\. Barros, and L\. S\. Kupssinskü \(2026\)Bayesian attention mechanism: a probabilistic framework for positional encoding and context length extrapolation\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=dXJB9O8fLd)Cited by:[§2\.1](https://arxiv.org/html/2605.11007#S2.SS1.p1.1)\.
- S\. Brody, U\. Alon, and E\. Yahav \(2023\)On the expressivity role of LayerNorm in Transformers’ attention\.InFindings of the Association for Computational Linguistics: ACL 2023,Toronto, Canada,pp\. 14211–14221\.External Links:[Link](https://aclanthology.org/2023.findings-acl.895/),[Document](https://dx.doi.org/10.18653/v1/2023.findings-acl.895)Cited by:[§2\.2](https://arxiv.org/html/2605.11007#S2.SS2.p2.1)\.
- P\. Gabbur, M\. Bilkhu, and J\. Movellan \(2021\)Probabilistic attention for interactive segmentation\.InProceedings of the 35th International Conference on Neural Information Processing Systems,NIPS ’21,Red Hook, NY, USA\.External Links:ISBN 9781713845393Cited by:[§2\.1](https://arxiv.org/html/2605.11007#S2.SS1.p1.1)\.
- B\. Geshkovski, C\. Letrouit, Y\. Polyanskiy, and P\. Rigollet \(2025\)A mathematical perspective on Transformers\.Bulletin of the American Mathematical Society62,pp\. 427–479\.External Links:[Document](https://dx.doi.org/10.1090/bull/1863)Cited by:[§2\.2](https://arxiv.org/html/2605.11007#S2.SS2.p1.1)\.
- A\. Henry, P\. R\. Dachapally, S\. S\. Pawar, and Y\. Chen \(2020\)Query\-key normalization for transformers\.InFindings of the Association for Computational Linguistics: EMNLP 2020,T\. Cohn, Y\. He, and Y\. Liu \(Eds\.\),Online,pp\. 4246–4253\.External Links:[Link](https://aclanthology.org/2020.findings-emnlp.379/),[Document](https://dx.doi.org/10.18653/v1/2020.findings-emnlp.379)Cited by:[§2\.2](https://arxiv.org/html/2605.11007#S2.SS2.p2.1)\.
- I\. Loshchilov, C\. Hsieh, S\. Sun, and B\. Ginsburg \(2025\)NGPT: normalized transformer with representation learning on the hypersphere\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=se4vjm7h4E)Cited by:[§2\.2](https://arxiv.org/html/2605.11007#S2.SS2.p2.1)\.
- R\. Molina \(2024\)Traveling words: a geometric interpretation of transformers\.External Links:[Link](https://openreview.net/forum?id=cSSHiLnjsJ)Cited by:[§2\.2](https://arxiv.org/html/2605.11007#S2.SS2.p1.1)\.
- P\. Racioppo \(2026\)Robust filter attention: self\-attention as precision\-weighted state estimation\.External Links:2509\.04154,[Link](https://arxiv.org/abs/2509.04154)Cited by:[Appendix A](https://arxiv.org/html/2605.11007#A1.p1.1),[§2\.1](https://arxiv.org/html/2605.11007#S2.SS1.p2.1),[§3\.1](https://arxiv.org/html/2605.11007#S3.SS1.p1.3)\.
- H\. Ramsauer, B\. Schäfl, J\. Lehner, P\. Seidl, M\. Widrich, T\. Adler, L\. Gruber, M\. Holzleitner, M\. Pavlović, G\. K\. Sandve, V\. Greiff, D\. Kreil, M\. Kopp, G\. Klambauer, J\. Brandstetter, and S\. Hochreiter \(2021\)Hopfield networks is all you need\.External Links:2008\.02217,[Link](https://arxiv.org/abs/2008.02217)Cited by:[§2\.1](https://arxiv.org/html/2605.11007#S2.SS1.p1.1)\.
- Y\. H\. Tsai, S\. Bai, M\. Yamada, L\. Morency, and R\. Salakhutdinov \(2019\)Transformer dissection: a unified understanding for Transformer’s attention via the lens of kernel\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\),K\. Inui, J\. Jiang, V\. Ng, and X\. Wan \(Eds\.\),Hong Kong, China,pp\. 4344–4353\.External Links:[Link](https://aclanthology.org/D19-1443/),[Document](https://dx.doi.org/10.18653/v1/D19-1443)Cited by:[§2\.1](https://arxiv.org/html/2605.11007#S2.SS1.p1.1)\.
- A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin \(2017\)Attention is all you need\.InAdvances in Neural Information Processing Systems,I\. Guyon, U\. V\. Luxburg, S\. Bengio, H\. Wallach, R\. Fergus, S\. Vishwanathan, and R\. Garnett \(Eds\.\),Vol\.30,pp\.\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf)Cited by:[§2\.1](https://arxiv.org/html/2605.11007#S2.SS1.p1.1)\.
- X\. Wu, A\. Ajorlou, Y\. Wang, S\. Jegelka, and A\. Jadbabaie \(2024\)On the role of attention masks and layernorm in transformers\.InThe Thirty\-eighth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=lIH6oCdppg)Cited by:[§2\.2](https://arxiv.org/html/2605.11007#S2.SS2.p1.1)\.
- C\. Zheng, J\. Sun, Y\. Gao, C\. Wang, Y\. Wang, J\. Xiong, L\. Ren, B\. Peng, Q\. Wang, X\. Shang, M\. Schwager, A\. Schneider, Y\. Nevmyvaka, and X\. Liu \(2026\)GeoNorm: unify pre\-norm and post\-norm with geodesic optimization\.External Links:2601\.22095,[Link](https://arxiv.org/abs/2601.22095)Cited by:[§2\.2](https://arxiv.org/html/2605.11007#S2.SS2.p2.1)\.

## Appendix Table of Contents

1. 1\.Appendix A: Background: Isotropic Filtering and Attention[A](https://arxiv.org/html/2605.11007#A1) - •Reviews the Isotropic RFA formulation, deriving attention as a precision\-weighted state estimator under linear stochastic dynamics\.
2. 2\.Appendix B: Radial–Tangential SDE Model[B](https://arxiv.org/html/2605.11007#A2) - •Introduces the RT\-SDE with state\-dependent radial and tangential dynamics, and shows that covariance propagation can be carried out in closed form\.
3. 3\.Appendix C: Directional Filtering under the RT\-SDE[C](https://arxiv.org/html/2605.11007#A3) - •Derives directional state estimation on the sphere, including angular uncertainty, precision, and the resulting attention mechanism\.
4. 4\.Appendix D: Connection to the Transformer[D](https://arxiv.org/html/2605.11007#A4) - •Shows how the Transformer block arises as an approximation of the RT\-Filter\.

## Appendix ABackground: Isotropic Filtering and Attention

Here, we summarize the formulation of Robust Filter Attention \(RFA\), from\[Racioppo,[2026](https://arxiv.org/html/2605.11007#bib.bib209)\], which provides the foundation for the RT\-Transformer developed in this paper\.

### A\.1Linear SDE and State Transport

In RFA, queries and keys are taken to be noisy measurements of a latent process\. In particular, we model latent representations with a linear time\-invariant SDE:

d​𝒙​\(t\)=𝑨​𝒙​\(t\)​d​t\+𝑮​d​𝒘​\(t\),𝒛i=𝒙​\(ti\)\+𝒗i\.d\\boldsymbol\{x\}\(t\)=\\boldsymbol\{A\}\\boldsymbol\{x\}\(t\)\\,dt\+\\boldsymbol\{G\}\\,d\\boldsymbol\{w\}\(t\),\\qquad\\boldsymbol\{z\}\_\{i\}=\\boldsymbol\{x\}\(t\_\{i\}\)\+\\boldsymbol\{v\}\_\{i\}\.where𝒗​\(ti\)∼𝒩​\(𝟎,𝑹\)\\boldsymbol\{v\}\(t\_\{i\}\)\\sim\\mathcal\{N\}\(\\boldsymbol\{0\},\\boldsymbol\{R\}\)is Gaussian measurement noise,d​𝒘​\(t\)d\\boldsymbol\{w\}\(t\)is a standard Wiener process, and𝑸=𝑮​𝑮⊤\\boldsymbol\{Q\}=\\boldsymbol\{G\}\\boldsymbol\{G\}^\{\\top\}is the process noise covariance\.

Under these dynamics, a past observation \(key\) at timetjt\_\{j\}induces a prediction of the latent state at timetit\_\{i\}:

𝒛^i​j=e𝑨​Δ​t​𝒛j\.\\hat\{\\boldsymbol\{z\}\}\_\{ij\}=e^\{\\boldsymbol\{A\}\\Delta t\}\\,\\boldsymbol\{z\}\_\{j\}\.

### A\.2Covariance Propagation

Uncertainty accumulates under the dynamics according to the differential Lyapunov equation \(DLE\):

dd​s​𝑽​\(s\)=𝑨​𝑽​\(s\)\+𝑽​\(s\)​𝑨⊤\+𝑸,𝑽​\(0\)=0\.\\frac\{d\}\{ds\}\\boldsymbol\{V\}\(s\)=\\boldsymbol\{A\}\\boldsymbol\{V\}\(s\)\+\\boldsymbol\{V\}\(s\)\\boldsymbol\{A\}^\{\\top\}\+\\boldsymbol\{Q\},\\quad\\boldsymbol\{V\}\(0\)=0\.whose solution is:

𝑽​\(Δ​t\)=∫0Δ​te𝑨​s​𝑸​e𝑨⊤​s​𝑑s\.\\boldsymbol\{V\}\(\\Delta t\)=\\int\_\{0\}^\{\\Delta t\}e^\{\\boldsymbol\{A\}s\}\\boldsymbol\{Q\}e^\{\\boldsymbol\{A\}^\{\\top\}s\}\\,ds\.The total covariance of a transported observation must account for uncertainty in both the transported key and the query, since the estimator operates on their difference\. Specifically, the residual𝒓i​j=𝒛i−𝒛^i​j\\boldsymbol\{r\}\_\{ij\}=\\boldsymbol\{z\}\_\{i\}\-\\hat\{\\boldsymbol\{z\}\}\_\{ij\}combines independent noise contributions from the transported observation and the local measurement\. The residual is distributed as:

𝒓i​j∼𝒩​\(𝟎,𝚺i​j\)\\boldsymbol\{r\}\_\{ij\}\\sim\\mathcal\{N\}\\\!\\left\(\\boldsymbol\{0\},\\;\\boldsymbol\{\\Sigma\}\_\{ij\}\\right\)where the covariance is:

𝚺i​j=𝑽​\(Δ​ti​j\)\+e𝑨​Δ​ti​j​𝑹​e𝑨⊤​Δ​ti​j\+𝑹Γ,\\boldsymbol\{\\Sigma\}\_\{ij\}=\\boldsymbol\{V\}\(\\Delta t\_\{ij\}\)\+e^\{\\boldsymbol\{A\}\\Delta t\_\{ij\}\}\\boldsymbol\{R\}e^\{\\boldsymbol\{A\}^\{\\top\}\\Delta t\_\{ij\}\}\+\\boldsymbol\{R\}\_\{\\Gamma\},where the first two terms capture process and measurement uncertainty associated with the transported key, and𝑹Γ\\boldsymbol\{R\}\_\{\\Gamma\}represents the measurement noise of the query𝒛i\\boldsymbol\{z\}\_\{i\}\.

The corresponding precision is:

𝑷i​j=𝚺i​j−1\.\\boldsymbol\{P\}\_\{ij\}=\\boldsymbol\{\\Sigma\}\_\{ij\}^\{\-1\}\.We measure consistency using the Mahalanobis distance:

di​j2=𝒓i​j⊤​𝑷i​j​𝒓i​j\.d\_\{ij\}^\{2\}=\\boldsymbol\{r\}\_\{ij\}^\{\\top\}\\boldsymbol\{P\}\_\{ij\}\\boldsymbol\{r\}\_\{ij\}\.

### A\.3Diagonalization and Closed\-Form Solution

To obtain a tractable form, we assume the system is simultaneously diagonalizable:

𝑨=𝑺​𝚲​𝑺−1,𝑸=𝑺​𝚲Q​𝑺†,𝑹=𝑺​𝚲R​𝑺†,𝑹Γ=𝑺​𝚲Γ​𝑺†\\boldsymbol\{A\}=\\boldsymbol\{S\}\\boldsymbol\{\\Lambda\}\\boldsymbol\{S\}^\{\-1\},\\quad\\boldsymbol\{Q\}=\\boldsymbol\{S\}\\boldsymbol\{\\Lambda\}\_\{Q\}\\boldsymbol\{S\}^\{\\dagger\},\\quad\\boldsymbol\{R\}=\\boldsymbol\{S\}\\boldsymbol\{\\Lambda\}\_\{R\}\\boldsymbol\{S\}^\{\\dagger\},\\quad\\boldsymbol\{R\}\_\{\\Gamma\}=\\boldsymbol\{S\}\\boldsymbol\{\\Lambda\}\_\{\\Gamma\}\\boldsymbol\{S\}^\{\\dagger\}where𝚲,𝚲Q,𝚲R,𝚲Γ\\boldsymbol\{\\Lambda\},\\boldsymbol\{\\Lambda\}\_\{Q\},\\boldsymbol\{\\Lambda\}\_\{R\},\\boldsymbol\{\\Lambda\}\_\{\\Gamma\}are diagonal, withkkth entryλk,λQ,k,λR,k,λΓ,k\\lambda\_\{k\},\\lambda\_\{Q,k\},\\lambda\_\{R,k\},\\lambda\_\{\\Gamma,k\}, respectively\.

In this basis, the dynamics decouple into independent modes, and the covariance admits a closed\-form solution:

𝑽​\(Δ​t\)=𝑺​𝚲V​\(Δ​t\)​𝑺†,\\boldsymbol\{V\}\(\\Delta t\)=\\boldsymbol\{S\}\\boldsymbol\{\\Lambda\}\_\{V\}\(\\Delta t\)\\boldsymbol\{S\}^\{\\dagger\},where𝚲V​\(Δ​t\)\\boldsymbol\{\\Lambda\}\_\{V\}\(\\Delta t\)is diagonal withkkth entry:

λV,k​\(Δ​t\)=λQ,k​1−e2​R​e​\(λk\)​Δ​t−2​R​e​\(λk\)\.\\lambda\_\{V,k\}\(\\Delta t\)=\\lambda\_\{Q,k\}\\frac\{1\-e^\{2\\mathrm\{Re\}\(\\lambda\_\{k\}\)\\Delta t\}\}\{\-2\\mathrm\{Re\}\(\\lambda\_\{k\}\)\}\.Thus, uncertainty propagation reduces to independent scalar processes along each mode\.

### A\.4Precision\-Weighted Estimation

Since transported observations are jointly dependent under the SDE, exact inference requires sequential filtering\. We instead adopt a mean\-field approximation, treating them as conditionally independent given the latent state\. Under this assumption, the latent state at positioniimay be estimated by minimizing a precision\-weighted least\-squares objective:

𝒛¯i=arg⁡min𝒙​∑j≤i\(𝒙−𝒛^i​j\)⊤​𝑷i​j​\(𝒙−𝒛^i​j\)\.\\bar\{\\boldsymbol\{z\}\}\_\{i\}=\\arg\\min\_\{\\boldsymbol\{x\}\}\\sum\_\{j\\leq i\}\(\\boldsymbol\{x\}\-\\hat\{\\boldsymbol\{z\}\}\_\{ij\}\)^\{\\top\}\\boldsymbol\{P\}\_\{ij\}\(\\boldsymbol\{x\}\-\\hat\{\\boldsymbol\{z\}\}\_\{ij\}\)\.This yields the closed\-form estimator:

𝒛¯i=\(∑j≤i𝑷i​j\)−1​∑j≤i𝑷i​j​𝒛^i​j\.\\bar\{\\boldsymbol\{z\}\}\_\{i\}=\\Big\(\\sum\_\{j\\leq i\}\\boldsymbol\{P\}\_\{ij\}\\Big\)^\{\-1\}\\sum\_\{j\\leq i\}\\boldsymbol\{P\}\_\{ij\}\\hat\{\\boldsymbol\{z\}\}\_\{ij\}\.

### A\.5Robust Reweighting & Attention Form

To account for model mismatch and outliers, we introduce data\-dependent weights based on residual consistency:

wi​j=w​\(di​j2\)\.w\_\{ij\}=w\(d\_\{ij\}^\{2\}\)\.
This yields a robust M\-estimator:

𝒛¯i=\(∑j≤iwi​j​𝑷i​j\)−1​∑j≤iwi​j​𝑷i​j​𝒛^i​j\.\\bar\{\\boldsymbol\{z\}\}\_\{i\}=\\Big\(\\sum\_\{j\\leq i\}w\_\{ij\}\\boldsymbol\{P\}\_\{ij\}\\Big\)^\{\-1\}\\sum\_\{j\\leq i\}w\_\{ij\}\\boldsymbol\{P\}\_\{ij\}\\hat\{\\boldsymbol\{z\}\}\_\{ij\}\.which can be expressed in the diagonalized basis as:

𝒛¯s,i=\(∑j≤iwi​j​𝚲P,i​j\)−1​∑j≤iwi​j​𝚲P,i​j​𝒛^s,i​j\.\\boldsymbol\{\\bar\{z\}\}\_\{s,i\}=\\Big\(\\sum\_\{j\\leq i\}w\_\{ij\}\\,\\boldsymbol\{\\Lambda\}\_\{P,ij\}\\Big\)^\{\-1\}\\sum\_\{j\\leq i\}w\_\{ij\}\\,\\boldsymbol\{\\Lambda\}\_\{P,ij\}\\,\\hat\{\\boldsymbol\{z\}\}\_\{s,ij\}\.where:

𝒛¯s,i:=𝑺−1​𝒛¯i,𝒛^s,i​j:=e𝚲​Δ​ti​j​𝒛s,j\.\\boldsymbol\{\\bar\{z\}\}\_\{s,i\}:=\\boldsymbol\{S\}^\{\-1\}\\boldsymbol\{\\bar\{z\}\}\_\{i\},\\qquad\\hat\{\\boldsymbol\{z\}\}\_\{s,ij\}:=e^\{\\boldsymbol\{\\Lambda\}\\Delta t\_\{ij\}\}\\,\\boldsymbol\{z\}\_\{s,j\}\.For exponential weighting,wi​j∝exp⁡\(−di​j2\)w\_\{ij\}\\propto\\exp\(\-d\_\{ij\}^\{2\}\), this reduces to a Softmax over pairwise scores\. Alternatively, one may adopt a power\-law familywi​j∝\(1\+di​j2ν\)−κw\_\{ij\}\\propto\\Big\(1\+\\tfrac\{d\_\{ij\}^\{2\}\}\{\\nu\}\\Big\)^\{\-\\kappa\}, for resistance to outliers, whereν\\nuandκ\\kappaare scalar robustness parameters\.

The estimator can be written as a normalized weighted sum:

𝒛¯s,i=∑j≤i𝒜i​j​𝒛^s,i​j,𝒜i​j=wi​j​𝚲P,i​j∑k≤iwi​k​𝚲P,i​k\.\\bar\{\\boldsymbol\{z\}\}\_\{s,i\}=\\sum\_\{j\\leq i\}\\mathcal\{A\}\_\{ij\}\\,\\hat\{\\boldsymbol\{z\}\}\_\{s,ij\},\\qquad\\mathcal\{A\}\_\{ij\}=\\frac\{w\_\{ij\}\\boldsymbol\{\\Lambda\}\_\{P,ij\}\}\{\\sum\_\{k\\leq i\}w\_\{ik\}\\boldsymbol\{\\Lambda\}\_\{P,ik\}\}\.Thus, attention arises as a precision\-weighted aggregation of transported predictions, with weights determined by both dynamical reliability and data\-dependent consistency\.

### A\.6Isotropic RFA Mechanism

The transformation to the decoupled eigenbasis is learned via complex\-valued projections\. We define:

𝑾q,𝑾k,𝑾v,𝑾o∈ℂd×d,\\boldsymbol\{W\}\_\{q\},\\,\\boldsymbol\{W\}\_\{k\},\\,\\boldsymbol\{W\}\_\{v\},\\,\\boldsymbol\{W\}\_\{o\}\\in\\mathbb\{C\}^\{d\\times d\},whereddis the embedding dimension\. The input projections\{𝑾q,𝑾k,𝑾v\}\\\{\\boldsymbol\{W\}\_\{q\},\\boldsymbol\{W\}\_\{k\},\\boldsymbol\{W\}\_\{v\}\\\}parameterize the learned diagonalizing basis𝑺−1\\boldsymbol\{S\}^\{\-1\}, mapping inputs into the eigenbasis where the DLE admits a closed\-form solution\. The output projection𝑾o\\boldsymbol\{W\}\_\{o\}parameterizes𝑺\\boldsymbol\{S\}, mapping the filtered state estimate back to the original embedding space\.

In the general case of anisotropic𝚲,𝚲Q,𝚲R,𝚲Γ\\boldsymbol\{\\Lambda\},\\boldsymbol\{\\Lambda\}\_\{Q\},\\boldsymbol\{\\Lambda\}\_\{R\},\\boldsymbol\{\\Lambda\}\_\{\\Gamma\}, storing the full attention tensor requires𝒪​\(N2​d\)\\mathcal\{O\}\(N^\{2\}d\)memory\. To obtain a scalable formulation, we assume a shared decay rate and isotropic noise within each head:

𝑨=−μ​𝑰\+𝛀,λk=−μ\+i​ωk,\\boldsymbol\{A\}=\-\\mu\\boldsymbol\{I\}\+\\boldsymbol\{\\Omega\},\\qquad\\lambda\_\{k\}=\-\\mu\+i\\omega\_\{k\},with scalar noise parameters:

𝚲Q=σ2​𝑰,𝚲R=η2​𝑰,𝚲Γ=γ2​𝑰\.\\boldsymbol\{\\Lambda\}\_\{Q\}=\\sigma^\{2\}\\boldsymbol\{I\},\\quad\\boldsymbol\{\\Lambda\}\_\{R\}=\\eta^\{2\}\\boldsymbol\{I\},\\quad\\boldsymbol\{\\Lambda\}\_\{\\Gamma\}=\\gamma^\{2\}\\boldsymbol\{I\}\.Under these assumptions, the propagated covariance reduces to a scalar kernel depending only on time lagτ=\|i−j\|\\tau=\|i\-j\|:

Σ2​\(τ\)=σ~2​\(1−e−2​μ​τ\)\+η2​e−2​μ​τ\+γ2\.\\Sigma^\{2\}\(\\tau\)=\\tilde\{\\sigma\}^\{2\}\\big\(1\-e^\{\-2\\mu\\tau\}\\big\)\+\\eta^\{2\}e^\{\-2\\mu\\tau\}\+\\gamma^\{2\}\.The corresponding precision is:

𝑷Δ​t​\[i,j\]=Σ−2​\(\|i−j\|\)\.\\boldsymbol\{P\}\_\{\\Delta t\}\[i,j\]=\\Sigma^\{\-2\}\(\|i\-j\|\)\.The isotropic constraint allows the dynamics to be factored into a stable decay term and complex forward/backward rotations:

𝑬​\[i,j\]=e−μ​\|ti−tj\|,𝚽~\+​\[k,i\]:=ei​ωk​ti,𝚽~−​\[k,i\]:=e−i​ωk​ti,\\boldsymbol\{E\}\[i,j\]=e^\{\-\\mu\|t\_\{i\}\-t\_\{j\}\|\},\\quad\\boldsymbol\{\\tilde\{\\Phi\}\}^\{\+\}\[k,i\]:=e^\{i\\omega\_\{k\}t\_\{i\}\},\\quad\\boldsymbol\{\\tilde\{\\Phi\}\}^\{\-\}\[k,i\]:=e^\{\-i\\omega\_\{k\}t\_\{i\}\},We define backward\-rotated queries, keys, and values:

𝑸~:=𝚽~−⊙𝑸,𝑲~:=𝚽~−⊙𝑲,𝑽~:=𝚽~−⊙𝑽,\\boldsymbol\{\\tilde\{Q\}\}:=\\boldsymbol\{\\tilde\{\\Phi\}\}^\{\-\}\\odot\\boldsymbol\{Q\},\\quad\\boldsymbol\{\\tilde\{K\}\}:=\\boldsymbol\{\\tilde\{\\Phi\}\}^\{\-\}\\odot\\boldsymbol\{K\},\\quad\\boldsymbol\{\\tilde\{V\}\}:=\\boldsymbol\{\\tilde\{\\Phi\}\}^\{\-\}\\odot\\boldsymbol\{V\},\\quadThe squared residual norm may then be written as:

‖𝑹i​j‖2=‖𝑸i‖2\+𝑬​\[i,j\]2⋅‖𝑲j‖2−2​𝑬​\[i,j\]⋅Re​\(𝑸~i†​𝑲~j\)\.\\\|\\boldsymbol\{R\}\_\{ij\}\\\|^\{2\}=\\\|\\boldsymbol\{Q\}\_\{i\}\\\|^\{2\}\+\\boldsymbol\{E\}\[i,j\]^\{2\}\\cdot\\\|\\boldsymbol\{K\}\_\{j\}\\\|^\{2\}\-2\\boldsymbol\{E\}\[i,j\]\\cdot\\mathrm\{Re\}\\\!\\left\(\\tilde\{\\boldsymbol\{Q\}\}\_\{i\}^\{\\dagger\}\\tilde\{\\boldsymbol\{K\}\}\_\{j\}\\right\)\.The Mahalanobis distance is then:

𝑫2​\[i,j\]=𝑷Δ​t​\[i,j\]⋅‖𝑹i​j‖2\.\\boldsymbol\{D\}^\{2\}\[i,j\]=\\boldsymbol\{P\}\_\{\\Delta t\}\[i,j\]\\cdot\\\|\\boldsymbol\{R\}\_\{ij\}\\\|^\{2\}\.We define logits using a robust influence function:

𝑳​\[i,j\]=log⁡\(𝑷Δ​t​\[i,j\]\)−κ​log⁡\(1\+1ν​𝑫2​\[i,j\]\),κ=ν\+dd\.\\boldsymbol\{L\}\[i,j\]=\\log\\big\(\\boldsymbol\{P\}\_\{\\Delta t\}\[i,j\]\\big\)\-\\kappa\\log\\\!\\left\(1\+\\frac\{1\}\{\\nu\}\\boldsymbol\{D\}^\{2\}\[i,j\]\\right\),\\quad\\kappa=\\frac\{\\nu\+d\}\{d\}\.
The attention weights are obtained via masked Softmax:

𝑨​\[i,j\]=Softmaxj​\(βs​𝑳​\[i,j\]\+𝑴causal​\[i,j\]\),\\boldsymbol\{A\}\[i,j\]=\\mathrm\{Softmax\}\_\{j\}\\\!\\left\(\\beta\_\{s\}\\boldsymbol\{L\}\[i,j\]\+\\boldsymbol\{M\}\_\{\\text\{causal\}\}\[i,j\]\\right\),whereβs\\beta\_\{s\}is an inverse temperature parameter and𝑴causal\\boldsymbol\{M\}\_\{\\text\{causal\}\}is a causal mask\.

Define the decayed attention matrix:

𝑨^​\[i,j\]=𝑨​\[i,j\]⋅𝑬​\[i,j\]\.\\boldsymbol\{\\hat\{A\}\}\[i,j\]=\\boldsymbol\{A\}\[i,j\]\\cdot\\boldsymbol\{E\}\[i,j\]\.The output is computed as:

𝑽¯=𝚽~\+⊙\(𝑽~​𝑨^⊤\)\.\\bar\{\\boldsymbol\{V\}\}=\\tilde\{\\boldsymbol\{\\Phi\}\}^\{\+\}\\odot\\left\(\\tilde\{\\boldsymbol\{V\}\}\\,\\boldsymbol\{\\hat\{A\}\}^\{\\top\}\\right\)\.This yields a rotate–aggregate–counter\-rotate structure, as values must be aggregated in the stationary eigenbasis and then counter\-rotated to restore the output to the value frame, ensuring dynamical consistency across the sequence\.

### A\.7Directional Case: Removal of Temporal Decay

We consider the case in which queries and keys represent directions with fixed norm:

‖𝑸i‖=‖𝑲j‖=r\.\\\|\\boldsymbol\{Q\}\_\{i\}\\\|=\\\|\\boldsymbol\{K\}\_\{j\}\\\|=r\.In this case, the exponential decay factor𝑬​\[i,j\]\\boldsymbol\{E\}\[i,j\]disappears\. The residual becomes:

‖𝑹q​k​\[i,j\]‖2=‖𝑸i‖2\+‖𝑲j‖2−2​Re​\(𝑸~i†​𝑲~j\)\\\|\\boldsymbol\{R\}\_\{qk\}\[i,j\]\\\|^\{2\}=\\\|\\boldsymbol\{Q\}\_\{i\}\\\|^\{2\}\+\\\|\\boldsymbol\{K\}\_\{j\}\\\|^\{2\}\-2\\,\\mathrm\{Re\}\\\!\\left\(\\tilde\{\\boldsymbol\{Q\}\}\_\{i\}^\{\\dagger\}\\tilde\{\\boldsymbol\{K\}\}\_\{j\}\\right\)=2​r2−2​Re​\(𝑸~i†​𝑲~j\)\.=2r^\{2\}\-2\\,\\mathrm\{Re\}\\\!\\left\(\\tilde\{\\boldsymbol\{Q\}\}\_\{i\}^\{\\dagger\}\\tilde\{\\boldsymbol\{K\}\}\_\{j\}\\right\)\.Value aggregation becomes:

𝑽¯=𝚽~\+⊙\(𝑽~​𝑨⊤\),\\boldsymbol\{\\bar\{V\}\}=\\boldsymbol\{\\tilde\{\\Phi\}\}^\{\+\}\\odot\\big\(\\boldsymbol\{\\tilde\{V\}\}\\boldsymbol\{A\}^\{\\top\}\\big\),The decayed attention matrix𝑨^=𝑨⊙𝑬\\boldsymbol\{\\hat\{A\}\}=\\boldsymbol\{A\}\\odot\\boldsymbol\{E\}does not appear\.

## Appendix BThe Radial\-Tangential SDE Model

We now extend the RFA framework to state\-dependent uncertainty by introducing a Radial–Tangential SDE \(RT\-SDE\), in which the process and measurement noise co\-rotate with the latent state direction, distinguishing radial from tangential variability\.

For generic state\-dependent diffusion, covariance propagation typically depends on the full state trajectory and does not admit a closed form solution\. The key property of the RT\-SDE is that this co\-rotation causes the time\-dependent rotation terms to cancel inside the covariance integral, preserving closed\-form covariance propagation despite the state dependence\.

### B\.1Radial–Tangential SDE

We consider the linear Itô stochastic differential equation

d​𝒙​\(t\)=𝑨​\(t\)​𝒙​\(t\)​d​t\+𝑮​\(t\)​d​𝒘​\(t\),𝒛​\(tk\)=𝒙​\(tk\)\+𝒗​\(tk\)\.d\\boldsymbol\{x\}\(t\)=\\boldsymbol\{A\}\(t\)\\boldsymbol\{x\}\(t\)\\,dt\+\\boldsymbol\{G\}\(t\)\\,d\\boldsymbol\{w\}\(t\),\\qquad\\boldsymbol\{z\}\(t\_\{k\}\)=\\boldsymbol\{x\}\(t\_\{k\}\)\+\\boldsymbol\{v\}\(t\_\{k\}\)\.Let the dynamics be diagonalizable as𝑨​\(t\)=𝑺​𝚲​\(t\)​𝑺−1\\boldsymbol\{A\}\(t\)=\\boldsymbol\{S\}\\boldsymbol\{\\Lambda\}\(t\)\\boldsymbol\{S\}^\{\-1\}\. Transforming into the eigenbasis,

𝒙s​\(t\)=𝑺−1​𝒙​\(t\),𝒛s​\(tk\)=𝑺−1​𝒛​\(tk\),\\boldsymbol\{x\}\_\{s\}\(t\)=\\boldsymbol\{S\}^\{\-1\}\\boldsymbol\{x\}\(t\),\\qquad\\boldsymbol\{z\}\_\{s\}\(t\_\{k\}\)=\\boldsymbol\{S\}^\{\-1\}\\boldsymbol\{z\}\(t\_\{k\}\),we decompose the state into magnitude and direction:

𝒙s​\(t\)=m​\(t\)​𝒖​\(t\),m​\(t\)=‖𝒙s​\(t\)‖2,𝒖​\(t\)=𝒙s​\(t\)‖𝒙s​\(t\)‖2\.\\boldsymbol\{x\}\_\{s\}\(t\)=m\(t\)\\boldsymbol\{u\}\(t\),\\qquad m\(t\)=\\\|\\boldsymbol\{x\}\_\{s\}\(t\)\\\|\_\{2\},\\qquad\\boldsymbol\{u\}\(t\)=\\frac\{\\boldsymbol\{x\}\_\{s\}\(t\)\}\{\\\|\\boldsymbol\{x\}\_\{s\}\(t\)\\\|\_\{2\}\}\.We assume that decay and process noise act independently in the radial and tangential directions of each token\. Defining a projection matrix onto the tangent space𝑷T​\(𝒖\)=𝑰−𝒖​𝒖†\\boldsymbol\{P\}\_\{T\}\(\\boldsymbol\{u\}\)=\\boldsymbol\{I\}\-\\boldsymbol\{u\}\\boldsymbol\{u\}^\{\\dagger\}and onto the radial direction𝑷R​\(𝒖\)=𝒖​𝒖†\\boldsymbol\{P\}\_\{R\}\(\\boldsymbol\{u\}\)=\\boldsymbol\{u\}\\boldsymbol\{u\}^\{\\dagger\}, we define state\-dependent dynamics and process noise as:

𝚲​\(𝒖​\(t\)\)=−μr​𝑷R​\(𝒖​\(t\)\)−μt​𝑷T​\(𝒖​\(t\)\)\+𝚲Ω,\\boldsymbol\{\\Lambda\}\(\\boldsymbol\{u\}\(t\)\)=\-\\mu\_\{r\}\\boldsymbol\{P\}\_\{R\}\(\\boldsymbol\{u\}\(t\)\)\-\\mu\_\{t\}\\boldsymbol\{P\}\_\{T\}\(\\boldsymbol\{u\}\(t\)\)\+\\boldsymbol\{\\Lambda\}\_\{\\Omega\},𝚲G​\(𝒖​\(t\)\)=σr​𝑷R​\(𝒖​\(t\)\)\+σt​𝑷T​\(𝒖​\(t\)\),\\boldsymbol\{\\Lambda\}\_\{G\}\(\\boldsymbol\{u\}\(t\)\)=\\sigma\_\{r\}\\boldsymbol\{P\}\_\{R\}\(\\boldsymbol\{u\}\(t\)\)\+\\sigma\_\{t\}\\,\\boldsymbol\{P\}\_\{T\}\(\\boldsymbol\{u\}\(t\)\),whereμr,μt,σr,σt∈ℝ\+\\mu\_\{r\},\\mu\_\{t\},\\sigma\_\{r\},\\sigma\_\{t\}\\in\\mathbb\{R\}^\{\+\}, and𝚲Ω∈i​ℝd×d\\boldsymbol\{\\Lambda\}\_\{\\Omega\}\\in i\\mathbb\{R\}^\{d\\times d\}is diagonal\.

The induced process covariance in the eigenbasis is therefore

𝚲Q​\(𝒖​\(t\)\)=𝚲G​\(𝒖​\(t\)\)​𝚲G​\(𝒖​\(t\)\)†=σr2​𝑷R​\(𝒖​\(t\)\)\+σt2​𝑷T​\(𝒖​\(t\)\),\\boldsymbol\{\\Lambda\}\_\{Q\}\(\\boldsymbol\{u\}\(t\)\)=\\boldsymbol\{\\Lambda\}\_\{G\}\(\\boldsymbol\{u\}\(t\)\)\\boldsymbol\{\\Lambda\}\_\{G\}\(\\boldsymbol\{u\}\(t\)\)^\{\\dagger\}=\\sigma\_\{r\}^\{2\}\\boldsymbol\{P\}\_\{R\}\(\\boldsymbol\{u\}\(t\)\)\+\\sigma\_\{t\}^\{2\}\\boldsymbol\{P\}\_\{T\}\(\\boldsymbol\{u\}\(t\)\),using the orthogonality and idempotence of𝑷R\\boldsymbol\{P\}\_\{R\}and𝑷T\\boldsymbol\{P\}\_\{T\}\.

Hence, the SDE becomes:

d​𝒙s=\(−μr​𝑷R​\(𝒖\)−μt​𝑷T​\(𝒖\)\+𝚲Ω\)​𝒙s​d​t\+σr​d​𝒘r\+σt​d​𝒘t,d\\boldsymbol\{x\}\_\{s\}=\\big\(\-\\mu\_\{r\}\\boldsymbol\{P\}\_\{R\}\(\\boldsymbol\{u\}\)\-\\mu\_\{t\}\\boldsymbol\{P\}\_\{T\}\(\\boldsymbol\{u\}\)\+\\boldsymbol\{\\Lambda\}\_\{\\Omega\}\\big\)\\boldsymbol\{x\}\_\{s\}\\,dt\+\\sigma\_\{r\}d\\boldsymbol\{w\}\_\{r\}\+\\sigma\_\{t\}d\\boldsymbol\{w\}\_\{t\},whered​𝒘t=𝑷T​\(𝒖\)​d​𝒘d\\boldsymbol\{w\}\_\{t\}=\\boldsymbol\{P\}\_\{T\}\(\\boldsymbol\{u\}\)d\\boldsymbol\{w\}andd​𝒘r=𝑷R​\(𝒖\)​d​𝒘=𝒖​d​wrd\\boldsymbol\{w\}\_\{r\}=\\boldsymbol\{P\}\_\{R\}\(\\boldsymbol\{u\}\)d\\boldsymbol\{w\}=\\boldsymbol\{u\}dw\_\{r\}, whered​wr=𝒖†​d​𝒘dw\_\{r\}=\\boldsymbol\{u\}^\{\\dagger\}d\\boldsymbol\{w\}\.

#### Polar decomposition of the dynamics\.

Applying Itô’s lemma tom=‖𝒙s‖m=\\\|\\boldsymbol\{x\}\_\{s\}\\\|and the product rule to𝒖=𝒙s/m\\boldsymbol\{u\}=\\boldsymbol\{x\}\_\{s\}/myields the coupled system:

d​m=−μ~r​m​d​t\+σr​d​wr,d​𝒖=\(−μ~t\+𝚲Ω\)​𝒖​d​t\+σtm​d​𝒘t,dm=\-\\tilde\{\\mu\}\_\{r\}m\\,dt\+\\sigma\_\{r\}\\,dw\_\{r\},\\qquad d\\boldsymbol\{u\}=\(\-\\tilde\{\\mu\}\_\{t\}\+\\boldsymbol\{\\Lambda\}\_\{\\Omega\}\)\\boldsymbol\{u\}\\,dt\+\\frac\{\\sigma\_\{t\}\}\{m\}d\\boldsymbol\{w\}\_\{t\},whereμ~t:=σt2​\(d−1\)/2​m2\\tilde\{\\mu\}\_\{t\}:=\\sigma\_\{t\}^\{2\}\(d\-1\)/2m^\{2\}andμ~r:=μr−μ~t\\tilde\{\\mu\}\_\{r\}:=\\mu\_\{r\}\-\\tilde\{\\mu\}\_\{t\}\. The termμ~t\\tilde\{\\mu\}\_\{t\}arises from the quadratic variation of the tangential noise, which contributesσt2​\(d−1\)​d​t\\sigma\_\{t\}^\{2\}\(d\-1\)\\,dttod​⟨𝒙s,𝒙s⟩d\\langle\\boldsymbol\{x\}\_\{s\},\\boldsymbol\{x\}\_\{s\}\\rangle; the radial Itô correction cancels exactly\. Notably, even whenσr=0\\sigma\_\{r\}=0, tangential diffusion induces a positive drift in magnitude\. This polar decomposition is nonlinear due to the normalization constraint, but we do not perform inference in these coordinates; all propagation remains in Cartesian space where the dynamics are linear\.

The tangential decay term−μt​𝑷T​\(𝒖\)​𝒙\-\\mu\_\{t\}\\boldsymbol\{P\}\_\{T\}\(\\boldsymbol\{u\}\)\\boldsymbol\{x\}vanishes identically when expressed in polar coordinates, since it acts orthogonally to the state direction\. As a result,μt\\mu\_\{t\}does not contribute to the deterministic evolution of𝒖\\boldsymbol\{u\}, and directional transport is governed solely by the rotational component𝚲Ω\\boldsymbol\{\\Lambda\}\_\{\\Omega\}\. However,μt\\mu\_\{t\}remains essential in the stochastic dynamics: it controls the rate of tangential diffusion and therefore enters the propagated covariance\.

### B\.2Radial–Tangential Measurement Model

We assume that measurement noise in the eigenbasis also follows a radial–tangential decomposition aligned with the instantaneous state direction:

𝒗​\(tk\)∼𝒩​\(𝟎,𝚲R​\(tk\)\),𝚲R​\(t\)=ηr2​𝑷R​\(𝒖\)\+ηt2​𝑷T​\(𝒖\)\.\\boldsymbol\{v\}\(t\_\{k\}\)\\sim\\mathcal\{N\}\\\!\\left\(\\boldsymbol\{0\},\\boldsymbol\{\\Lambda\}\_\{R\}\(t\_\{k\}\)\\right\),\\qquad\\boldsymbol\{\\Lambda\}\_\{R\}\(t\)=\\eta\_\{r\}^\{2\}\\boldsymbol\{P\}\_\{R\}\(\\boldsymbol\{u\}\)\+\\eta\_\{t\}^\{2\}\\boldsymbol\{P\}\_\{T\}\(\\boldsymbol\{u\}\)\.The measurement therefore decomposes as:

𝒛s=\(m\+vr\)​𝒖\+𝑷T​\(𝒖\)​𝒗t,\\boldsymbol\{z\}\_\{s\}=\(m\+v\_\{r\}\)\\,\\boldsymbol\{u\}\+\\boldsymbol\{P\}\_\{T\}\(\\boldsymbol\{u\}\)\\,\\boldsymbol\{v\}\_\{t\},where

vr​\(tk\)∼𝒩​\(0,ηr2\),𝒗t​\(tk\)∼𝒩​\(𝟎,ηt2​𝑷T​\(𝒖​\(tk\)\)\)\.v\_\{r\}\(t\_\{k\}\)\\sim\\mathcal\{N\}\(0,\\eta\_\{r\}^\{2\}\),\\qquad\\boldsymbol\{v\}\_\{t\}\(t\_\{k\}\)\\sim\\mathcal\{N\}\\\!\\left\(\\boldsymbol\{0\},\\eta\_\{t\}^\{2\}\\boldsymbol\{P\}\_\{T\}\(\\boldsymbol\{u\}\(t\_\{k\}\)\)\\right\)\.
#### Magnitude\.

We treat the token magnitude as directly observable from the token norm:

m​\(tk\)≈‖𝒛s​\(tk\)‖\.m\(t\_\{k\}\)\\approx\\\|\\boldsymbol\{z\}\_\{s\}\(t\_\{k\}\)\\\|\.This holds to first order when noise is small relative to the signal \(ηr,ηt≪m\\eta\_\{r\},\\eta\_\{t\}\\ll m\), under which‖𝒛s‖≈m\+vr≈m\\\|\\boldsymbol\{z\}\_\{s\}\\\|\\approx m\+v\_\{r\}\\approx m\. We therefore condition on the observed magnitude rather than treating it as a latent variable to be inferred\.

#### Direction\.

The unit\-direction measurement is obtained by normalization:

𝒖z=𝒛s‖𝒛s‖\.\\boldsymbol\{u\}\_\{z\}=\\frac\{\\boldsymbol\{z\}\_\{s\}\}\{\\\|\\boldsymbol\{z\}\_\{s\}\\\|\}\.Linearizing around the mean statem​𝒖m\\boldsymbol\{u\}:

𝒖z≈𝒖\+1m​𝑷T​\(𝒖\)​𝒗t\.\\boldsymbol\{u\}\_\{z\}\\approx\\boldsymbol\{u\}\+\\frac\{1\}\{m\}\\boldsymbol\{P\}\_\{T\}\(\\boldsymbol\{u\}\)\\,\\boldsymbol\{v\}\_\{t\}\.Only tangential noise perturbs the directional measurement, which is distributed as:

𝒖z​\(tk\)∼𝒩​\(𝒖​\(tk\),ηt2\(m​\(tk\)\)2​𝑷T​\(𝒖​\(tk\)\)\)\.\\boldsymbol\{u\}\_\{z\}\(t\_\{k\}\)\\sim\\mathcal\{N\}\\\!\\left\(\\boldsymbol\{u\}\(t\_\{k\}\),\\,\\frac\{\\eta\_\{t\}^\{2\}\}\{\(m\(t\_\{k\}\)\)^\{2\}\}\\boldsymbol\{P\}\_\{T\}\(\\boldsymbol\{u\}\(t\_\{k\}\)\)\\right\)\.Directional uncertainty scales inversely withm2m^\{2\}\. The magnitude therefore acts as directional inertia, entering the precision of the directional estimator through the1/m21/m^\{2\}scaling of the observation noise\.

#### Summary\.

The full model in polar coordinates is:

d​𝒖\\displaystyle d\\boldsymbol\{u\}=\(−μ~t\+𝚲Ω\)​𝒖​d​t\+σtm​d​𝒘t,\\displaystyle=\(\-\\tilde\{\\mu\}\_\{t\}\+\\boldsymbol\{\\Lambda\}\_\{\\Omega\}\)\\boldsymbol\{u\}\\,dt\+\\frac\{\\sigma\_\{t\}\}\{m\}d\\boldsymbol\{w\}\_\{t\},μ~t\\displaystyle\\tilde\{\\mu\}\_\{t\}:=σt2​\(d−1\)2​m2,\\displaystyle:=\\frac\{\\sigma\_\{t\}^\{2\}\(d\-1\)\}\{2m^\{2\}\},d​m\\displaystyle dm=−μ~r​m​d​t\+σr​d​wr,\\displaystyle=\-\\tilde\{\\mu\}\_\{r\}m\\,dt\+\\sigma\_\{r\}\\,dw\_\{r\},μ~r\\displaystyle\\tilde\{\\mu\}\_\{r\}:=μr−μ~t,\\displaystyle:=\\mu\_\{r\}\-\\tilde\{\\mu\}\_\{t\},m​\(tk\)\\displaystyle m\(t\_\{k\}\)≈‖𝒛s​\(tk\)‖\\displaystyle\\approx\\\|\\boldsymbol\{z\}\_\{s\}\(t\_\{k\}\)\\\|𝒖z​\(tk\)\\displaystyle\\boldsymbol\{u\}\_\{z\}\(t\_\{k\}\)≈𝒖​\(tk\)\+1m​\(tk\)​𝑷T​\(𝒖​\(tk\)\)​𝒗t​\(tk\),\\displaystyle\\approx\\boldsymbol\{u\}\(t\_\{k\}\)\+\\frac\{1\}\{m\(t\_\{k\}\)\}\\boldsymbol\{P\}\_\{T\}\(\\boldsymbol\{u\}\(t\_\{k\}\)\)\\,\\boldsymbol\{v\}\_\{t\}\(t\_\{k\}\),𝒗t​\(tk\)\\displaystyle\\boldsymbol\{v\}\_\{t\}\(t\_\{k\}\)∼𝒩​\(𝟎,ηt2​𝑷T​\(𝒖​\(tk\)\)\)\.\\displaystyle\\sim\\mathcal\{N\}\\big\(\\boldsymbol\{0\},\\eta\_\{t\}^\{2\}\\boldsymbol\{P\}\_\{T\}\(\\boldsymbol\{u\}\(t\_\{k\}\)\)\\big\)\.

#### Remark: Cartesian inference with radial–tangential geometry\.

Although the RT\-SDE admits a magnitude–direction decomposition, we do*not*reparameterize the likelihood in explicit angular coordinates\. Such a parameterization would render the dynamics and noise state\-dependent in coordinate form, destroy the quadratic structure of the Gaussian likelihood, and require iterative nonlinear optimization\. All propagation and weighted least\-squares inference are instead performed in Cartesian coordinates, where the dynamics remain linear and Gaussian and the covariance admits a closed\-form solution\.

### B\.3Propagation of Directions and Magnitudes

As in the Euclidean case, the estimator requires predictions of the latent state at the query position, obtained by transporting past observations under the deterministic dynamics\. Under the RT\-SDE, the deterministic part of the state transition factorizes into a unitary rotation and an exponential magnitude decay:

𝒛^s,i​j=e−μr​Δ​ti​j​e𝚲Ω​Δ​ti​j​𝒛s,j=m^i​j​𝒖^z,i​j,\\hat\{\\boldsymbol\{z\}\}\_\{s,ij\}=e^\{\-\\mu\_\{r\}\\Delta t\_\{ij\}\}\\,e^\{\\boldsymbol\{\\Lambda\}\_\{\\Omega\}\\Delta t\_\{ij\}\}\\,\\boldsymbol\{z\}\_\{s,j\}=\\hat\{m\}\_\{ij\}\\,\\hat\{\\boldsymbol\{u\}\}\_\{z,ij\},wherem^i​j=mj​e−μr​Δ​ti​j\\hat\{m\}\_\{ij\}=m\_\{j\}e^\{\-\\mu\_\{r\}\\Delta t\_\{ij\}\}and𝒖^z,i​j=e𝚲Ω​Δ​ti​j​𝒖z,j\\hat\{\\boldsymbol\{u\}\}\_\{z,ij\}=e^\{\\boldsymbol\{\\Lambda\}\_\{\\Omega\}\\Delta t\_\{ij\}\}\\boldsymbol\{u\}\_\{z,j\}\.

The directional component propagates unitarily undere𝚲Ω​Δ​ti​je^\{\\boldsymbol\{\\Lambda\}\_\{\\Omega\}\\Delta t\_\{ij\}\}, preserving the norm and keeping propagated directions on the sphere\. This follows because the scalar decay terms−μr​𝑷R​\(𝒖\)\-\\mu\_\{r\}\\boldsymbol\{P\}\_\{R\}\(\\boldsymbol\{u\}\)and−μt​𝑷T​\(𝒖\)\-\\mu\_\{t\}\\boldsymbol\{P\}\_\{T\}\(\\boldsymbol\{u\}\)act only along or orthogonal to the current direction and do not induce rotation; only𝚲Ω\\boldsymbol\{\\Lambda\}\_\{\\Omega\}transports direction\.

The magnitude decaym^i​j=mj​e−μr​Δ​ti​j\\hat\{m\}\_\{ij\}=m\_\{j\}e^\{\-\\mu\_\{r\}\\Delta t\_\{ij\}\}is an approximation\. The exact magnitude dynamics,

d​m=−μ~r​m​d​t\+σr​d​wr,μ~r=μr−σt2​\(d−1\)2​m2,dm=\-\\tilde\{\\mu\}\_\{r\}m\\,dt\+\\sigma\_\{r\}\\,dw\_\{r\},\\qquad\\tilde\{\\mu\}\_\{r\}=\\mu\_\{r\}\-\\frac\{\\sigma\_\{t\}^\{2\}\(d\-1\)\}\{2m^\{2\}\},include a state\-dependent centrifugal correction from the quadratic variation of the tangential noise, which prevents a closed\-form solution in general\. The simple exponential is valid whenm≫σt​\(d−1\)/μrm\\gg\\sigma\_\{t\}\\sqrt\{\(d\-1\)/\\mu\_\{r\}\}, i\.e\. when the centrifugal correction is negligible — the same regime in which magnitude is meaningful as a confidence measure and angular varianceσt2/m2≪1\\sigma\_\{t\}^\{2\}/m^\{2\}\\ll 1\.

### B\.4Propagation of Uncertainty through the RT\-SDE

The directional estimator derived in the next section requires knowing how reliably each past token predicts the current latent direction\. This reliability depends on how much uncertainty accumulates as a token is transported from its original time to the query frame under the RT\-SDE dynamics\. We therefore derive the propagated measurement covariance in closed form\. The key result is that the co\-rotating radial–tangential structure of the RT\-SDE causes the rotation terms to cancel inside the covariance integral, yielding an analytic expression that retains the same radial–tangential structure as the model itself\.

###### Proposition 1\(Closed\-form Propagated Covariance under RT\-SDE\)\.

Consider the RT\-SDE model defined above with process noise covariance:

𝚲Q​\(t\)=σr2​𝑷R​\(𝒖​\(t\)\)\+σt2​𝑷T​\(𝒖​\(t\)\)\\boldsymbol\{\\Lambda\}\_\{Q\}\(t\)=\\sigma\_\{r\}^\{2\}\\boldsymbol\{P\}\_\{R\}\(\\boldsymbol\{u\}\(t\)\)\+\\sigma\_\{t\}^\{2\}\\boldsymbol\{P\}\_\{T\}\(\\boldsymbol\{u\}\(t\)\)and measurement noise covariance:

𝚲R​\(t\)=ηr2​𝑷R​\(𝒖​\(t\)\)\+ηt2​𝑷T​\(𝒖​\(t\)\)\.\\boldsymbol\{\\Lambda\}\_\{R\}\(t\)=\\eta\_\{r\}^\{2\}\\boldsymbol\{P\}\_\{R\}\(\\boldsymbol\{u\}\(t\)\)\+\\eta\_\{t\}^\{2\}\\boldsymbol\{P\}\_\{T\}\(\\boldsymbol\{u\}\(t\)\)\.Letti\>tjt\_\{i\}\>t\_\{j\}withΔ​ti​j=ti−tj\\Delta t\_\{ij\}=t\_\{i\}\-t\_\{j\}\. Then, in the regime of small angular diffusion \(σt​Δ​ti​j≪1\\sigma\_\{t\}\\sqrt\{\\Delta t\_\{ij\}\}\\ll 1\), the covariance of a measurement at timetjt\_\{j\}propagated to timetit\_\{i\}is well\-approximated by the closed\-form expression:

𝚲V^​\(ti,tj\)=σV​r2​\(\|Δ​ti​j\|\)​𝑷R​\(𝒖​\(t\)\)\+σV​t2​\(\|Δ​ti​j\|\)​𝑷T​\(𝒖​\(t\)\),\\boldsymbol\{\\Lambda\}\_\{\\hat\{V\}\}\(t\_\{i\},t\_\{j\}\)=\\sigma\_\{Vr\}^\{2\}\(\|\\Delta t\_\{ij\}\|\)\\,\\boldsymbol\{P\}\_\{R\}\(\\boldsymbol\{u\}\(t\)\)\+\\sigma\_\{Vt\}^\{2\}\(\|\\Delta t\_\{ij\}\|\)\\boldsymbol\{P\}\_\{T\}\(\\boldsymbol\{u\}\(t\)\),where

σV​r2​\(\|Δ​ti​j\|\)\\displaystyle\\sigma\_\{Vr\}^\{2\}\(\|\\Delta t\_\{ij\}\|\)=φ​\(μr,\|Δ​ti​j\|\)​σr2\+e−2​μr​Δ​ti​j​ηr2,\\displaystyle=\\,\\varphi\(\\mu\_\{r\},\|\\Delta t\_\{ij\}\|\)\\,\\sigma\_\{r\}^\{2\}\+e^\{\-2\\mu\_\{r\}\\Delta t\_\{ij\}\}\\,\\eta\_\{r\}^\{2\},σV​t2​\(\|Δ​ti​j\|\)\\displaystyle\\sigma\_\{Vt\}^\{2\}\(\|\\Delta t\_\{ij\}\|\)=φ​\(μt,\|Δ​ti​j\|\)​σt2\+e−2​μt​Δ​ti​j​ηt2,\\displaystyle=\\,\\varphi\(\\mu\_\{t\},\|\\Delta t\_\{ij\}\|\)\\,\\sigma\_\{t\}^\{2\}\+e^\{\-2\\mu\_\{t\}\\Delta t\_\{ij\}\}\\,\\eta\_\{t\}^\{2\},and

φ​\(μ,Δ​t\)=\{1−e−2​μ​Δ​t2​μ,μ≠0,Δ​t,μ=0\.\\varphi\(\\mu,\\Delta t\)=\\begin\{cases\}\\dfrac\{1\-e^\{\-2\\mu\\Delta t\}\}\{2\\mu\},&\\mu\\neq 0,\\\\\[6\.0pt\] \\Delta t,&\\mu=0\.\\end\{cases\}Consequently, the propagated precision matrix is also available in closed form and retains the same radial–tangential structure:

𝚲V^−1​\(ti,tj\)=1σV​r2​\(\|Δ​ti​j\|\)​𝑷R​\(𝒖​\(t\)\)\+1σV​t2​\(\|Δ​ti​j\|\)​𝑷T​\(𝒖​\(t\)\)\.\\boldsymbol\{\\Lambda\}\_\{\\hat\{V\}\}^\{\-1\}\(t\_\{i\},t\_\{j\}\)=\\frac\{1\}\{\\sigma\_\{Vr\}^\{2\}\(\|\\Delta t\_\{ij\}\|\)\}\\boldsymbol\{P\}\_\{R\}\(\\boldsymbol\{u\}\(t\)\)\+\\frac\{1\}\{\\sigma\_\{Vt\}^\{2\}\(\|\\Delta t\_\{ij\}\|\)\}\\boldsymbol\{P\}\_\{T\}\(\\boldsymbol\{u\}\(t\)\)\.In the small angular diffusion regime, the covariance therefore admits a closed\-form analytic expression and retains radial–tangential rank\-1 structure for all lags\.

###### Proof\.

If𝑸\\boldsymbol\{Q\}is a function oftt, the state evolution \(for the causal case\) is governed by:

𝒙​\(ti\)=e𝑨​\(ti−tj\)​𝒙​\(tj\)\+∫tjtie𝑨​\(ti−s\)​𝑮​\(s\)​𝑑𝒘​\(s\)\\boldsymbol\{x\}\(t\_\{i\}\)=e^\{\\boldsymbol\{A\}\(t\_\{i\}\-t\_\{j\}\)\}\\boldsymbol\{x\}\(t\_\{j\}\)\+\\int\_\{t\_\{j\}\}^\{t\_\{i\}\}e^\{\\boldsymbol\{A\}\(t\_\{i\}\-s\)\}\\boldsymbol\{G\}\(s\)d\\boldsymbol\{w\}\(s\)So the propagated covariance is:

𝑽​\(ti,tj\)=∫tjtie𝑨​\(ti−s\)​𝑸​\(s\)​e𝑨⊤​\(ti−s\)​𝑑s\\boldsymbol\{V\}\(t\_\{i\},t\_\{j\}\)=\\int\_\{t\_\{j\}\}^\{t\_\{i\}\}e^\{\\boldsymbol\{A\}\(t\_\{i\}\-s\)\}\\boldsymbol\{Q\}\(s\)e^\{\\boldsymbol\{A\}^\{\\top\}\(t\_\{i\}\-s\)\}\\,dsLettingτ=ti−s\\tau=t\_\{i\}\-s,

𝑽​\(ti,tj\)=∫0Δ​ti​je𝑨​τ​𝑸​\(ti−τ\)​e𝑨⊤​τ​𝑑τ\\boldsymbol\{V\}\(t\_\{i\},t\_\{j\}\)=\\int\_\{0\}^\{\\Delta t\_\{ij\}\}e^\{\\boldsymbol\{A\}\\tau\}\\boldsymbol\{Q\}\(t\_\{i\}\-\\tau\)e^\{\\boldsymbol\{A\}^\{\\top\}\\tau\}\\,d\\tauPlugging in𝑨​\(t\)=𝑺​\(−μr​𝑷R​\(t\)−μt​𝑷T​\(t\)\+𝚲Ω\)​𝑺−1\\boldsymbol\{A\}\(t\)=\\boldsymbol\{S\}\\big\(\-\\mu\_\{r\}\\boldsymbol\{P\}\_\{R\}\(t\)\-\\mu\_\{t\}\\boldsymbol\{P\}\_\{T\}\(t\)\+\\boldsymbol\{\\Lambda\}\_\{\\Omega\}\\big\)\\boldsymbol\{S\}^\{\-1\}, the state transition matrix factorizes as:

𝚽​\(τ\)=e𝚲Ω​τ​\(e−μr​τ​𝑷R​\(𝒖​\(t\+τ\)\)\+e−μt​τ​𝑷T​\(𝒖​\(t\+τ\)\)\)\\boldsymbol\{\\Phi\}\(\\tau\)=e^\{\\boldsymbol\{\\Lambda\}\_\{\\Omega\}\\tau\}\\left\(e^\{\-\\mu\_\{r\}\\tau\}\\boldsymbol\{P\}\_\{R\}\(\\boldsymbol\{u\}\(t\+\\tau\)\)\+e^\{\-\\mu\_\{t\}\\tau\}\\boldsymbol\{P\}\_\{T\}\(\\boldsymbol\{u\}\(t\+\\tau\)\)\\right\)so the covariances becomes:

𝑽​\(ti,tj\)=𝑺​𝚲V​\(ti,tj\)​𝑺−1,\\boldsymbol\{V\}\(t\_\{i\},t\_\{j\}\)=\\boldsymbol\{S\}\\boldsymbol\{\\Lambda\}\_\{V\}\(t\_\{i\},t\_\{j\}\)\\boldsymbol\{S\}^\{\-1\},where:

𝚲V​\(ti,tj\)\\displaystyle\\boldsymbol\{\\Lambda\}\_\{V\}\(t\_\{i\},t\_\{j\}\)=∫0\|Δ​ti​j\|e𝚲Ω​τ​\(e−μr​τ​𝑷R​\(ti−τ\)\+e−μt​τ​𝑷T​\(ti−τ\)\)\\displaystyle=\\int\_\{0\}^\{\|\\Delta t\_\{ij\}\|\}e^\{\\boldsymbol\{\\Lambda\}\_\{\\Omega\}\\tau\}\\Big\(e^\{\-\\mu\_\{r\}\\tau\}\\boldsymbol\{P\}\_\{R\}\(t\_\{i\}\-\\tau\)\+e^\{\-\\mu\_\{t\}\\tau\}\\boldsymbol\{P\}\_\{T\}\(t\_\{i\}\-\\tau\)\\Big\)×\(σr2​𝑷R​\(ti−τ\)\+σt2​𝑷T​\(ti−τ\)\)\\displaystyle\\quad\\times\\Big\(\\sigma\_\{r\}^\{2\}\\boldsymbol\{P\}\_\{R\}\(t\_\{i\}\-\\tau\)\+\\sigma\_\{t\}^\{2\}\\boldsymbol\{P\}\_\{T\}\(t\_\{i\}\-\\tau\)\\Big\)×\(e−μr​τ​𝑷R​\(ti−τ\)\+e−μt​τ​𝑷T​\(ti−τ\)\)​e𝚲Ω†​τ​d​τ\\displaystyle\\quad\\times\\Big\(e^\{\-\\mu\_\{r\}\\tau\}\\boldsymbol\{P\}\_\{R\}\(t\_\{i\}\-\\tau\)\+e^\{\-\\mu\_\{t\}\\tau\}\\boldsymbol\{P\}\_\{T\}\(t\_\{i\}\-\\tau\)\\Big\)e^\{\\boldsymbol\{\\Lambda\}\_\{\\Omega\}^\{\\dagger\}\\tau\}d\\tau=∫0\|Δ​ti​j\|e𝚲Ω​τ​\(σr2​e−2​μr​τ​𝑷R​\(ti−τ\)\+σt2​e−2​μt​τ​𝑷T​\(ti−τ\)\)​e𝚲Ω†​τ​𝑑τ=\\int\_\{0\}^\{\|\\Delta t\_\{ij\}\|\}e^\{\\boldsymbol\{\\Lambda\}\_\{\\Omega\}\\tau\}\\Big\(\\sigma\_\{r\}^\{2\}e^\{\-2\\mu\_\{r\}\\tau\}\\boldsymbol\{P\}\_\{R\}\(t\_\{i\}\-\\tau\)\+\\sigma\_\{t\}^\{2\}e^\{\-2\\mu\_\{t\}\\tau\}\\boldsymbol\{P\}\_\{T\}\(t\_\{i\}\-\\tau\)\\Big\)e^\{\\boldsymbol\{\\Lambda\}\_\{\\Omega\}^\{\\dagger\}\\tau\}d\\tauIn the regime of small angular diffusion \(σt​Δ​ti​j≪1\\sigma\_\{t\}\\sqrt\{\\Delta t\_\{ij\}\}\\ll 1\), the direction𝒖​\(t\)\\boldsymbol\{u\}\(t\)is well\-approximated by pure rotational transport, so that

e𝚲Ω​τ​𝑷R​\(ti−τ\)​e𝚲Ω†​τ≈𝑷R​\(ti\),e^\{\\boldsymbol\{\\Lambda\}\_\{\\Omega\}\\tau\}\\boldsymbol\{P\}\_\{R\}\(t\_\{i\}\-\\tau\)e^\{\\boldsymbol\{\\Lambda\}\_\{\\Omega\}^\{\\dagger\}\\tau\}\\approx\\boldsymbol\{P\}\_\{R\}\(t\_\{i\}\),and likewise for𝑷T\\boldsymbol\{P\}\_\{T\}\. Thus, the rotation cancels inside the integral:

𝚲V​\(ti,tj\)=σr2​𝑷R​\(ti\)​∫0\|Δ​ti​j\|e−2​μr​τ​𝑑τ\+σt2​𝑷T​\(ti\)​∫0\|Δ​ti​j\|e−2​μt​τ​𝑑τ\\boldsymbol\{\\Lambda\}\_\{V\}\(t\_\{i\},t\_\{j\}\)=\\sigma\_\{r\}^\{2\}\\boldsymbol\{P\}\_\{R\}\(t\_\{i\}\)\\int\_\{0\}^\{\|\\Delta t\_\{ij\}\|\}e^\{\-2\\mu\_\{r\}\\tau\}d\\tau\+\\sigma\_\{t\}^\{2\}\\boldsymbol\{P\}\_\{T\}\(t\_\{i\}\)\\int\_\{0\}^\{\|\\Delta t\_\{ij\}\|\}e^\{\-2\\mu\_\{t\}\\tau\}d\\tau=σr2​φ​\(μr,\|Δ​ti​j\|\)​𝑷R​\(ti\)\+σt2​φ​\(μt,\|Δ​ti​j\|\)​𝑷T​\(ti\)=\\sigma\_\{r\}^\{2\}\\varphi\(\\mu\_\{r\},\|\\Delta t\_\{ij\}\|\)\\boldsymbol\{P\}\_\{R\}\(t\_\{i\}\)\\,\+\\sigma\_\{t\}^\{2\}\\varphi\(\\mu\_\{t\},\|\\Delta t\_\{ij\}\|\)\\boldsymbol\{P\}\_\{T\}\(t\_\{i\}\)where:

φ​\(μ,Δ​t\)=e−2​μ​Δ​t​κB​\(Δ​t\)=\{1−e−2​μ​Δ​t2​μ,μ≠0,Δ​t,μ=0\.\\varphi\(\\mu,\\Delta t\)=e^\{\-2\\mu\\Delta t\}\\kappa\_\{B\}\(\\Delta t\)=\\begin\{cases\}\\dfrac\{1\-e^\{\-2\\mu\\Delta t\}\}\{2\\mu\},&\\mu\\neq 0,\\\\ \\Delta t,&\\mu=0\.\\end\{cases\}Finally,

𝚲V^​\(ti,tj\)=𝚲V​\(ti,tj\)\+e𝚲​Δ​ti​j​𝚲R​\(tj\)​e𝚲†​Δ​ti​j\\boldsymbol\{\\Lambda\}\_\{\\hat\{V\}\}\(t\_\{i\},t\_\{j\}\)=\\boldsymbol\{\\Lambda\}\_\{V\}\(t\_\{i\},t\_\{j\}\)\+e^\{\\boldsymbol\{\\Lambda\}\\Delta t\_\{ij\}\}\\boldsymbol\{\\Lambda\}\_\{R\}\(t\_\{j\}\)e^\{\\boldsymbol\{\\Lambda\}^\{\\dagger\}\\Delta t\_\{ij\}\}=𝚲V​\(ti,tj\)\+e−2​μr​Δ​ti​j​ηr2​𝑷R​\(ti\)\+e−2​μt​Δ​ti​j​ηt2​𝑷T​\(ti\)=\\boldsymbol\{\\Lambda\}\_\{V\}\(t\_\{i\},t\_\{j\}\)\+e^\{\-2\\mu\_\{r\}\\Delta t\_\{ij\}\}\\eta\_\{r\}^\{2\}\\boldsymbol\{P\}\_\{R\}\(t\_\{i\}\)\+e^\{\-2\\mu\_\{t\}\\Delta t\_\{ij\}\}\\eta\_\{t\}^\{2\}\\boldsymbol\{P\}\_\{T\}\(t\_\{i\}\)Altogether, the propagated measurement covariance is:

𝚲V^​\(ti,tj\)=σV​r2​\(\|Δ​ti​j\|\)​𝑷R​\(ti\)\+σV​t2​\(\|Δ​ti​j\|\)​𝑷T​\(ti\)\\boldsymbol\{\\Lambda\}\_\{\\hat\{V\}\}\(t\_\{i\},t\_\{j\}\)=\\sigma\_\{Vr\}^\{2\}\(\|\\Delta t\_\{ij\}\|\)\\boldsymbol\{P\}\_\{R\}\(t\_\{i\}\)\+\\sigma\_\{Vt\}^\{2\}\(\|\\Delta t\_\{ij\}\|\)\\boldsymbol\{P\}\_\{T\}\(t\_\{i\}\)where:

σV​r2​\(\|Δ​ti​j\|\)\\displaystyle\\sigma\_\{Vr\}^\{2\}\(\|\\Delta t\_\{ij\}\|\)=φ​\(\|Δ​ti​j\|\)​σr2\+e−2​μr​Δ​ti​j​ηr2,\\displaystyle=\\,\\varphi\(\|\\Delta t\_\{ij\}\|\)\\,\\sigma\_\{r\}^\{2\}\+e^\{\-2\\mu\_\{r\}\\Delta t\_\{ij\}\}\\,\\eta\_\{r\}^\{2\},σV​t2​\(\|Δ​ti​j\|\)\\displaystyle\\sigma\_\{Vt\}^\{2\}\(\|\\Delta t\_\{ij\}\|\)=φ​\(\|Δ​ti​j\|\)​σt2\+e−2​μt​Δ​ti​j​ηt2\.\\displaystyle=\\,\\varphi\(\|\\Delta t\_\{ij\}\|\)\\,\\sigma\_\{t\}^\{2\}\+e^\{\-2\\mu\_\{t\}\\Delta t\_\{ij\}\}\\,\\eta\_\{t\}^\{2\}\.Since the propagated covariance is of the form𝚲V^​\(ti,tj\)=a​𝑰\+b​𝒖​\(ti\)​𝒖​\(ti\)†\\boldsymbol\{\\Lambda\}\_\{\\hat\{V\}\}\(t\_\{i\},t\_\{j\}\)=a\\boldsymbol\{I\}\+b\\boldsymbol\{u\}\(t\_\{i\}\)\\boldsymbol\{u\}\(t\_\{i\}\)^\{\\dagger\}, wherea=σV​t2,b=σV​r2−σV​t2a=\\sigma\_\{Vt\}^\{2\},\\,b=\\sigma\_\{Vr\}^\{2\}\-\\sigma\_\{Vt\}^\{2\}, i\.e\. a rank\-1 correction of identity, we can invert it with the Sherman\-Morrison formula:

𝚲V^−1=1a​\(𝑰\+ba​𝒖​𝒖†\)−1=1a​\(𝑰−ba​𝒖​𝒖†1\+ba​𝒖†​𝒖\)\\boldsymbol\{\\Lambda\}\_\{\\hat\{V\}\}^\{\-1\}=\\frac\{1\}\{a\}\\Big\(\\boldsymbol\{I\}\+\\frac\{b\}\{a\}\\boldsymbol\{u\}\\boldsymbol\{u\}^\{\\dagger\}\\Big\)^\{\-1\}=\\frac\{1\}\{a\}\\Big\(\\boldsymbol\{I\}\-\\frac\{b\}\{a\}\\frac\{\\boldsymbol\{u\}\\boldsymbol\{u\}^\{\\dagger\}\}\{1\+\\frac\{b\}\{a\}\\boldsymbol\{u\}^\{\\dagger\}\\boldsymbol\{u\}\}\\Big\)=1a​\(𝑰−𝒖​𝒖†\)\+\(−ba​\(a\+b\)\+1a\)​𝒖​𝒖†,=\\frac\{1\}\{a\}\\Big\(\\boldsymbol\{I\}\-\\boldsymbol\{u\}\\boldsymbol\{u\}^\{\\dagger\}\\Big\)\+\\Big\(\-\\frac\{b\}\{a\(a\+b\)\}\+\\frac\{1\}\{a\}\\Big\)\\boldsymbol\{u\}\\boldsymbol\{u\}^\{\\dagger\},=1a​\(𝑰−𝒖​𝒖†\)\+\(1a\+b\)​𝒖​𝒖†,=\\frac\{1\}\{a\}\\Big\(\\boldsymbol\{I\}\-\\boldsymbol\{u\}\\boldsymbol\{u\}^\{\\dagger\}\\Big\)\+\\Big\(\\frac\{1\}\{a\+b\}\\Big\)\\boldsymbol\{u\}\\boldsymbol\{u\}^\{\\dagger\},Hence, since the radial and tangential components are along orthogonal subspaces, we can invert𝚲V^​\(ti,tj\)\\boldsymbol\{\\Lambda\}\_\{\\hat\{V\}\}\(t\_\{i\},t\_\{j\}\)by inverting each component:

𝚲V^−1​\(ti,tj\)=1σV​r2​\(\|Δ​ti​j\|\)​𝑷R​\(ti\)\+1σV​t2​\(\|Δ​ti​j\|\)​𝑷T​\(ti\)\.\\boldsymbol\{\\Lambda\}\_\{\\hat\{V\}\}^\{\-1\}\(t\_\{i\},t\_\{j\}\)=\\frac\{1\}\{\\sigma\_\{Vr\}^\{2\}\(\|\\Delta t\_\{ij\}\|\)\}\\boldsymbol\{P\}\_\{R\}\(t\_\{i\}\)\+\\frac\{1\}\{\\sigma\_\{Vt\}^\{2\}\(\|\\Delta t\_\{ij\}\|\)\}\\boldsymbol\{P\}\_\{T\}\(t\_\{i\}\)\.
∎

### B\.5Derivation of the Directional Precision

The propagated covariance describes uncertainty in the ambient Euclidean space, whereas the estimator operates over unit directions on the hypersphere\. After normalization, the relevant uncertainty is therefore the angular variance induced by the propagated covariance after projection onto the tangent plane\. This yields a scalar directional precisionκi​j\\kappa\_\{ij\}, which replaces the Euclidean precision matrix in the directional estimator\.

The Euclidean formulation of RFA weights observations according to the Mahalanobis norm of the residual:

𝒓i​j=𝒛i−𝒛^i​j,di​j2=𝒓i​j⊤​𝑷i​j​𝒓i​j,\\boldsymbol\{r\}\_\{ij\}=\\boldsymbol\{z\}\_\{i\}\-\\hat\{\\boldsymbol\{z\}\}\_\{ij\},\\qquad d\_\{ij\}^\{2\}=\\boldsymbol\{r\}\_\{ij\}^\{\\top\}\\boldsymbol\{P\}\_\{ij\}\\boldsymbol\{r\}\_\{ij\},where𝑷i​j=𝚺i​j−1\\boldsymbol\{P\}\_\{ij\}=\\boldsymbol\{\\Sigma\}\_\{ij\}^\{\-1\}is the analytic precision\. The essential quantity is therefore the variance of the residual vector\.

On the sphere, the latent variable of interest is the unit direction𝒖i∈𝒮d−1\\boldsymbol\{u\}\_\{i\}\\in\\mathcal\{S\}^\{d\-1\}\. The natural directional residual is the vector difference:

𝒓i​j\(dir\)=𝒖z,i−𝒖^z,i​j\.\\boldsymbol\{r\}\_\{ij\}^\{\(\\mathrm\{dir\}\)\}=\\boldsymbol\{u\}\_\{z,i\}\-\\hat\{\\boldsymbol\{u\}\}\_\{z,ij\}\.Accordingly, the spherical analogue of the Euclidean Mahalanobis norm requires the covariance of this directional residual\.

#### Directional perturbations under the RT\-SDE\.

Let

𝒖z,i=𝒖i\+𝜹i,𝒖^z,i​j=𝒖i\+𝜹j,\\boldsymbol\{u\}\_\{z,i\}=\\boldsymbol\{u\}\_\{i\}\+\\boldsymbol\{\\delta\}\_\{i\},\\qquad\\hat\{\\boldsymbol\{u\}\}\_\{z,ij\}=\\boldsymbol\{u\}\_\{i\}\+\\boldsymbol\{\\delta\}\_\{j\},where𝒖i\\boldsymbol\{u\}\_\{i\}denotes the latent direction and𝜹i,𝜹j\\boldsymbol\{\\delta\}\_\{i\},\\boldsymbol\{\\delta\}\_\{j\}are independent perturbations lying in the tangent space at𝒖i\\boldsymbol\{u\}\_\{i\}\.

To see how Cartesian noise projects onto the sphere, we linearize the normalization mappingπ​\(𝒛\)=𝒛‖𝒛‖\\pi\(\\boldsymbol\{z\}\)=\\frac\{\\boldsymbol\{z\}\}\{\\\|\\boldsymbol\{z\}\\\|\}around the state𝒛=m​𝒖\\boldsymbol\{z\}=m\\boldsymbol\{u\}\. The Jacobian𝑱π\\boldsymbol\{J\}\_\{\\pi\}of this mapping is:

𝑱π​\(𝒛\)=∂∂𝒛​\(𝒛‖𝒛‖\)=1‖𝒛‖​𝑰−𝒛​𝒛⊤‖𝒛‖3\.\\boldsymbol\{J\}\_\{\\pi\}\(\\boldsymbol\{z\}\)=\\frac\{\\partial\}\{\\partial\\boldsymbol\{z\}\}\\left\(\\frac\{\\boldsymbol\{z\}\}\{\\\|\\boldsymbol\{z\}\\\|\}\\right\)=\\frac\{1\}\{\\\|\\boldsymbol\{z\}\\\|\}\\boldsymbol\{I\}\-\\frac\{\\boldsymbol\{z\}\\boldsymbol\{z\}^\{\\top\}\}\{\\\|\\boldsymbol\{z\}\\\|^\{3\}\}\.Evaluating this at the latent state𝒙=m​𝒖\\boldsymbol\{x\}=m\\boldsymbol\{u\}\(wherem=‖𝒙‖m=\\\|\\boldsymbol\{x\}\\\|\):

𝑱π​\(m​𝒖\)=1m​𝑰−m2​𝒖​𝒖⊤m3=1m​\(𝑰−𝒖​𝒖⊤\)=1m​𝑷T​\(𝒖\)\.\\boldsymbol\{J\}\_\{\\pi\}\(m\\boldsymbol\{u\}\)=\\frac\{1\}\{m\}\\boldsymbol\{I\}\-\\frac\{m^\{2\}\\boldsymbol\{u\}\\boldsymbol\{u\}^\{\\top\}\}\{m^\{3\}\}=\\frac\{1\}\{m\}\(\\boldsymbol\{I\}\-\\boldsymbol\{u\}\\boldsymbol\{u\}^\{\\top\}\)=\\frac\{1\}\{m\}\\boldsymbol\{P\}\_\{T\}\(\\boldsymbol\{u\}\)\.Now, we apply the standard covariance propagation ruleVar​\(π​\(𝒛\)\)≈𝑱π​Var​\(𝜹\)​𝑱π⊤\\text\{Var\}\(\\pi\(\\boldsymbol\{z\}\)\)\\approx\\boldsymbol\{J\}\_\{\\pi\}\\text\{Var\}\(\\boldsymbol\{\\delta\}\)\\boldsymbol\{J\}\_\{\\pi\}^\{\\top\}\. Given the RT\-SDE assumption that the noise is already tangential, i\.e\.,Var​\(𝜹\)=σΣ​t2​𝑷T​\(𝒖\)\\text\{Var\}\(\\boldsymbol\{\\delta\}\)=\\sigma\_\{\\Sigma t\}^\{2\}\\boldsymbol\{P\}\_\{T\}\(\\boldsymbol\{u\}\), we have:

Var​\(𝜹θ\)=\(1m​𝑷T​\(𝒖\)\)​\(σΣ​t2​𝑷T​\(𝒖\)\)​\(1m​𝑷T​\(𝒖\)\)⊤\.\\text\{Var\}\(\\boldsymbol\{\\delta\}\_\{\\theta\}\)=\\left\(\\frac\{1\}\{m\}\\boldsymbol\{P\}\_\{T\}\(\\boldsymbol\{u\}\)\\right\)\\left\(\\sigma\_\{\\Sigma t\}^\{2\}\\boldsymbol\{P\}\_\{T\}\(\\boldsymbol\{u\}\)\\right\)\\left\(\\frac\{1\}\{m\}\\boldsymbol\{P\}\_\{T\}\(\\boldsymbol\{u\}\)\\right\)^\{\\top\}\.Because𝑷T\\boldsymbol\{P\}\_\{T\}is an orthogonal projection matrix, it is idempotent \(𝑷T2=𝑷T\\boldsymbol\{P\}\_\{T\}^\{2\}=\\boldsymbol\{P\}\_\{T\}\) and symmetric \(𝑷T⊤=𝑷T\\boldsymbol\{P\}\_\{T\}^\{\\top\}=\\boldsymbol\{P\}\_\{T\}\)\. The expression simplifies directly:

Var​\(𝜹θ\)=σΣ​t2m2​𝑷T​\(𝒖\)\.\\text\{Var\}\(\\boldsymbol\{\\delta\}\_\{\\theta\}\)=\\frac\{\\sigma\_\{\\Sigma t\}^\{2\}\}\{m^\{2\}\}\\boldsymbol\{P\}\_\{T\}\(\\boldsymbol\{u\}\)\.Therefore, at timetit\_\{i\},

Var​\(𝜹i\)=σΣ​t2​\(0\)mi2​𝑷T​\(𝒖i\),\\mathrm\{Var\}\(\\boldsymbol\{\\delta\}\_\{i\}\)=\\frac\{\\sigma\_\{\\Sigma t\}^\{2\}\(0\)\}\{m\_\{i\}^\{2\}\}\\boldsymbol\{P\}\_\{T\}\(\\boldsymbol\{u\}\_\{i\}\),and for the transported key,

Var​\(𝜹j\)=σΣ​t2​\(\|Δ​ti​j\|\)m^i​j2​𝑷T​\(𝒖i\)\.\\mathrm\{Var\}\(\\boldsymbol\{\\delta\}\_\{j\}\)=\\frac\{\\sigma\_\{\\Sigma t\}^\{2\}\(\|\\Delta t\_\{ij\}\|\)\}\{\\hat\{m\}\_\{ij\}^\{2\}\}\\boldsymbol\{P\}\_\{T\}\(\\boldsymbol\{u\}\_\{i\}\)\.whereσΣ​t2​\(0\)=ηt2\+γt2\\sigma\_\{\\Sigma t\}^\{2\}\(0\)=\\eta\_\{t\}^\{2\}\+\\gamma\_\{t\}^\{2\}, andγt2≥0\\gamma\_\{t\}^\{2\}\\geq 0is an additional query\-side noise floor, included for the same reason as the analogous term in isotropic RFA: the propagated process noiseσV​t2​\(Δ​t\)\\sigma\_\{Vt\}^\{2\}\(\\Delta t\)vanishes asΔ​t→0\\Delta t\\to 0, but the query token still has irreducible directional uncertainty that should not be treated as zero even at zero lag\.

#### Variance of the directional residual\.

Since

𝒓i​j\(dir\)=𝜹i−𝜹j,\\boldsymbol\{r\}\_\{ij\}^\{\(\\mathrm\{dir\}\)\}=\\boldsymbol\{\\delta\}\_\{i\}\-\\boldsymbol\{\\delta\}\_\{j\},and the perturbations are independent, the covariance of the residual is

Var​\(𝒓i​j\(dir\)\)=Var​\(𝜹i\)\+Var​\(𝜹j\)\.\\mathrm\{Var\}\(\\boldsymbol\{r\}\_\{ij\}^\{\(\\mathrm\{dir\}\)\}\)=\\mathrm\{Var\}\(\\boldsymbol\{\\delta\}\_\{i\}\)\+\\mathrm\{Var\}\(\\boldsymbol\{\\delta\}\_\{j\}\)\.Substituting the expressions above yields

Var​\(𝒓i​j\(dir\)\)=Σθ,i​j​𝑷T​\(𝒖i\),\\mathrm\{Var\}\(\\boldsymbol\{r\}\_\{ij\}^\{\(\\mathrm\{dir\}\)\}\)=\\Sigma\_\{\\theta,ij\}\\,\\boldsymbol\{P\}\_\{T\}\(\\boldsymbol\{u\}\_\{i\}\),where

Σθ,i​j=σΣ​t,i2mi2\+σΣ​t,j2​\(\|Δ​ti​j\|\)m^i​j2\.\\Sigma\_\{\\theta,ij\}=\\frac\{\\sigma\_\{\\Sigma t,i\}^\{2\}\}\{m\_\{i\}^\{2\}\}\+\\frac\{\\sigma\_\{\\Sigma t,j\}^\{2\}\(\|\\Delta t\_\{ij\}\|\)\}\{\\hat\{m\}\_\{ij\}^\{2\}\}\.Thus the directional residual is isotropic in the tangent space, with scalar varianceΣθ,i​j\\Sigma\_\{\\theta,ij\}\.

#### Noise floor and limiting case

To this we must add a noise floorτθ2\\tau\_\{\\theta\}^\{2\}, representing a fixed angular resolution floor capturing irreducible uncertainty\. We also include a stabilization termϵ\\epsilon:

Σθ,i​j=σθ,i2\+σθ,j2\+τθ2=σΣ​t,i2mi2\+ϵ\+σΣ​t,j2​\(\|Δ​ti​j\|\)m^i​j2\+ϵ\+τθ2\.\\Sigma\_\{\\theta,ij\}=\\sigma\_\{\\theta,i\}^\{2\}\+\\sigma\_\{\\theta,j\}^\{2\}\+\\tau\_\{\\theta\}^\{2\}=\\frac\{\\sigma\_\{\\Sigma t,i\}^\{2\}\}\{m\_\{i\}^\{2\}\+\\epsilon\}\+\\frac\{\\sigma\_\{\\Sigma t,j\}^\{2\}\(\|\\Delta t\_\{ij\}\|\)\}\{\\hat\{m\}\_\{ij\}^\{2\}\+\\epsilon\}\+\\tau\_\{\\theta\}^\{2\}\.The precision required by the directional M–estimator is:

κi​j=Σθ,i​j−1\\kappa\_\{ij\}=\\Sigma\_\{\\theta,ij\}^\{\-1\}IfσV​i,t2\\sigma\_\{Vi,t\}^\{2\}andσV​j,t2\\sigma\_\{Vj,t\}^\{2\}are0, then we get a flat prior, as in standard attention:

κi​j=1τθ2\.\\kappa\_\{ij\}=\\frac\{1\}\{\\tau\_\{\\theta\}^\{2\}\}\.

#### Whitened spherical residual\.

The spherical analogue of the Euclidean Mahalanobis distance is therefore

di​j2=𝒓i​j\(dir\)⊤​\(Σθ,i​j−1​𝑷T\)​𝒓i​j\(dir\)\.d\_\{ij\}^\{2\}=\\boldsymbol\{r\}\_\{ij\}^\{\(\\mathrm\{dir\}\)\\top\}\\big\(\\Sigma\_\{\\theta,ij\}^\{\-1\}\\boldsymbol\{P\}\_\{T\}\\big\)\\boldsymbol\{r\}\_\{ij\}^\{\(\\mathrm\{dir\}\)\}\.Since𝒓i​j\(dir\)\\boldsymbol\{r\}\_\{ij\}^\{\(\\mathrm\{dir\}\)\}lies in the tangent plane,𝑷T\\boldsymbol\{P\}\_\{T\}acts as the identity and we obtain

di​j2=‖𝒖z,i−𝒖^z,i​j‖2Σθ,i​j\.d\_\{ij\}^\{2\}=\\frac\{\\\|\\boldsymbol\{u\}\_\{z,i\}\-\\hat\{\\boldsymbol\{u\}\}\_\{z,ij\}\\\|^\{2\}\}\{\\Sigma\_\{\\theta,ij\}\}\.Using the exact identity for unit vectors,

‖𝒖z,i−𝒖^z,i​j‖2=2​\(1−𝒖z,i†​𝒖^z,i​j\),\\\|\\boldsymbol\{u\}\_\{z,i\}\-\\hat\{\\boldsymbol\{u\}\}\_\{z,ij\}\\\|^\{2\}=2\\big\(1\-\\boldsymbol\{u\}\_\{z,i\}^\{\\dagger\}\\hat\{\\boldsymbol\{u\}\}\_\{z,ij\}\\big\),the whitened squared residual becomes

di​j2=2​κi​j​\(1−𝒖z,i†​𝒖^z,i​j\),d\_\{ij\}^\{2\}=2\\,\\kappa\_\{ij\}\\big\(1\-\\boldsymbol\{u\}\_\{z,i\}^\{\\dagger\}\\hat\{\\boldsymbol\{u\}\}\_\{z,ij\}\\big\),whereκi​j=Σθ,i​j−1\\kappa\_\{ij\}=\\Sigma\_\{\\theta,ij\}^\{\-1\}is the analytic directional precision\.

#### Comparison to Euclidean RFA\.

The scalarκi​j\\kappa\_\{ij\}plays the role of precision in the directional setting, weighting each key according to the reliability of its angular residual\. Since angular variance scales asσt2/m2\\sigma\_\{t\}^\{2\}/m^\{2\}, magnitude acts as directional inertia: high\-magnitude states resist reorientation, while low\-magnitude states are more easily perturbed\.

For small angular deviations,

1−𝒖z,i†​𝒖^z,i​j≈12​θi​j2,di​j2≈κi​j​θi​j2,1\-\\boldsymbol\{u\}\_\{z,i\}^\{\\dagger\}\\hat\{\\boldsymbol\{u\}\}\_\{z,ij\}\\approx\\tfrac\{1\}\{2\}\\theta\_\{ij\}^\{2\},\\qquad d\_\{ij\}^\{2\}\\approx\\kappa\_\{ij\}\\theta\_\{ij\}^\{2\},soκi​j=Σθ,i​j−1\\kappa\_\{ij\}=\\Sigma\_\{\\theta,ij\}^\{\-1\}coincides with the inverse angular variance\. In this regime, the estimator corresponds to maximum likelihood under a local von Mises–Fisher\-type model, withκi​j\\kappa\_\{ij\}playing the role of concentration\.

Unlike Euclidean RFA, where temporal structure appears through both explicit decay and propagated precision, normalization removes the explicit decay from the directional residuals\. The temporal dynamics are instead absorbed entirely into the directional precision through the transported magnitudem^i​j=mj​e−μ​Δ​ti​j\\hat\{m\}\_\{ij\}=m\_\{j\}e^\{\-\\mu\\Delta t\_\{ij\}\}:

κi​j∝m^i​j2σΣ​t2​\(Δ​ti​j\)=mj2​e−2​μ​Δ​ti​jσΣ​t2​\(Δ​ti​j\)\.\\kappa\_\{ij\}\\;\\propto\\;\\frac\{\\hat\{m\}\_\{ij\}^\{2\}\}\{\\sigma\_\{\\Sigma t\}^\{2\}\(\\Delta t\_\{ij\}\)\}=\\frac\{m\_\{j\}^\{2\}e^\{\-2\\mu\\Delta t\_\{ij\}\}\}\{\\sigma\_\{\\Sigma t\}^\{2\}\(\\Delta t\_\{ij\}\)\}\.Thus, the directional formulation preserves the same temporal filtering structure as Euclidean RFA, but expresses it entirely through angular precision\.

## Appendix CDirectional Filtering under the RT\-SDE

Under the RT\-SDE, filtering reduces to inference over latent directions on the hypersphere\. Transported observations provide noisy directional evidence whose uncertainty depends on both temporal propagation and the radial–tangential covariance structure derived in the previous section\. This yields a precision\-weighted directional filtering problem whose solution recovers attention as tangent\-space consensus estimation\.

### C\.1Directional estimation under the RT\-SDE\.

Under the RT\-SDE measurement model, normalization decouples radial and tangential uncertainty to first order: radial noise perturbs only token magnitude, while tangential noise perturbs only direction \(Appendix[B](https://arxiv.org/html/2605.11007#A2)\)\. Conditioning on the observed magnitudesmj=‖𝒛s,j‖m\_\{j\}=\\\|\\boldsymbol\{z\}\_\{s,j\}\\\|, the transported direction𝒖^z,i​j\\hat\{\\boldsymbol\{u\}\}\_\{z,ij\}becomes a noisy observation of the latent query direction𝒖i\\boldsymbol\{u\}\_\{i\}, with tangent\-plane varianceσV​t,i​j2/m^i​j2\\sigma\_\{Vt,ij\}^\{2\}/\\hat\{m\}\_\{ij\}^\{2\}\.

For nearby directions, the hypersphere is locally approximated by its tangent plane, and the squared Euclidean distance between unit vectors agrees with the squared geodesic distance to second order:

‖𝒖^z,i​j−𝒖i‖2=2​\(1−𝒖i⊤​𝒖^z,i​j\)\.\\\|\\hat\{\\boldsymbol\{u\}\}\_\{z,ij\}\-\\boldsymbol\{u\}\_\{i\}\\\|^\{2\}=2\\bigl\(1\-\\boldsymbol\{u\}\_\{i\}^\{\\top\}\\hat\{\\boldsymbol\{u\}\}\_\{z,ij\}\\bigr\)\.The local directional likelihood is therefore Gaussian in tangent\-space coordinates, yielding the directional negative log\-likelihood:

ℒT​\(𝒖i\)=∑j≤iκi​j​\(1−𝒖i⊤​𝒖^z,i​j\)\.\\mathcal\{L\}\_\{T\}\(\\boldsymbol\{u\}\_\{i\}\)=\\sum\_\{j\\leq i\}\\kappa\_\{ij\}\\big\(1\-\\boldsymbol\{u\}\_\{i\}^\{\\top\}\\hat\{\\boldsymbol\{u\}\}\_\{z,ij\}\\big\)\.
#### Exact directional estimator\.

MinimizingℒT\\mathcal\{L\}\_\{T\}subject to‖𝒖i‖=1\\\|\\boldsymbol\{u\}\_\{i\}\\\|=1is equivalent to:

max‖𝒖i‖=1⁡𝒖i⊤​\(∑j≤iκi​j​𝒖^z,i​j\),\\max\_\{\\\|\\boldsymbol\{u\}\_\{i\}\\\|=1\}\\;\\boldsymbol\{u\}\_\{i\}^\{\\top\}\\Big\(\\sum\_\{j\\leq i\}\\kappa\_\{ij\}\\hat\{\\boldsymbol\{u\}\}\_\{z,ij\}\\Big\),whose unique solution is the normalized precision\-weighted mean:

𝒖i∗=Norm​\(∑j≤iκi​j​𝒖^z,i​j\)\.\\boldsymbol\{u\}\_\{i\}^\{\*\}=\\mathrm\{Norm\}\\bigg\(\\sum\_\{j\\leq i\}\\kappa\_\{ij\}\\hat\{\\boldsymbol\{u\}\}\_\{z,ij\}\\bigg\)\.This recovers the maximum likelihood direction but discards information about the concentration of the directional evidence, since𝒖i∗\\boldsymbol\{u\}\_\{i\}^\{\*\}is always unit norm\.

#### Tangent\-space form\.

A more informative representation retains the evidence concentration\. LinearizingℒT\\mathcal\{L\}\_\{T\}around the current estimate𝒖i\\boldsymbol\{u\}\_\{i\}in its tangent space yields:

minΔ​𝒖∈T𝒖i​𝒮d−1​∑j≤iκi​j​‖Δ​𝒖−\(𝒖^z,i​j−𝒖i\)‖2,\\min\_\{\\Delta\\boldsymbol\{u\}\\in T\_\{\\boldsymbol\{u\}\_\{i\}\}\\mathcal\{S\}^\{d\-1\}\}\\sum\_\{j\\leq i\}\\kappa\_\{ij\}\\big\\\|\\Delta\\boldsymbol\{u\}\-\(\\hat\{\\boldsymbol\{u\}\}\_\{z,ij\}\-\\boldsymbol\{u\}\_\{i\}\)\\big\\\|^\{2\},with solution

Δ​𝒖i=\(∑jκi​j\)−1​∑jκi​j​\(𝒖^z,i​j−𝒖i\)=𝒖¯i−𝒖i,\\Delta\\boldsymbol\{u\}\_\{i\}=\\Big\(\\sum\_\{j\}\\kappa\_\{ij\}\\Big\)^\{\-1\}\\sum\_\{j\}\\kappa\_\{ij\}\(\\hat\{\\boldsymbol\{u\}\}\_\{z,ij\}\-\\boldsymbol\{u\}\_\{i\}\)=\\bar\{\\boldsymbol\{u\}\}\_\{i\}\-\\boldsymbol\{u\}\_\{i\},where

𝒖¯i=∑jAi​j​𝒖^z,i​j,Ai​j=κi​j∑j′κi​j′\.\\bar\{\\boldsymbol\{u\}\}\_\{i\}=\\sum\_\{j\}A\_\{ij\}\\,\\hat\{\\boldsymbol\{u\}\}\_\{z,ij\},\\qquad A\_\{ij\}=\\frac\{\\kappa\_\{ij\}\}\{\\sum\_\{j^\{\\prime\}\}\\kappa\_\{ij^\{\\prime\}\}\}\.Since𝒖¯i\\bar\{\\boldsymbol\{u\}\}\_\{i\}is a convex combination of unit vectors,𝒖∗=Norm​\(𝒖¯i\)\\boldsymbol\{u\}^\{\*\}=\\mathrm\{Norm\}\(\\bar\{\\boldsymbol\{u\}\}\_\{i\}\)and the two forms point in the same direction\. However,‖𝒖¯i‖∈\[0,1\]\\\|\\bar\{\\boldsymbol\{u\}\}\_\{i\}\\\|\\in\[0,1\]encodes the*concentration*of the directional consensus: it is close to 1 when all keys agree on a direction, and close to 0 when they are spread diffusely\. This is the circular mean resultant length, a standard measure of directional concentration\.

#### Robust reweighting\.

The quadratic objective is sensitive to model mis\-specification\. We introduce data\-dependent reweighting via a robust M\-estimator with squared whitened angular residual:

di​j2=2​κi​j​\(1−𝒖i⊤​𝒖^z,i​j\),d\_\{ij\}^\{2\}=2\\,\\kappa\_\{ij\}\\big\(1\-\\boldsymbol\{u\}\_\{i\}^\{\\top\}\\hat\{\\boldsymbol\{u\}\}\_\{z,ij\}\\big\),and robust weight:

wi​j=ψ​\(di​j2\),ψ​\(x\)=\(1\+xν\)−κ,w\_\{ij\}=\\psi\(d\_\{ij\}^\{2\}\),\\qquad\\psi\(x\)=\\left\(1\+\\frac\{x\}\{\\nu\}\\right\)^\{\-\\kappa\},corresponding to a Student\-ttM\-estimator\. The effective precisionκ~i​j=wi​j​κi​j\\tilde\{\\kappa\}\_\{ij\}=w\_\{ij\}\\kappa\_\{ij\}replacesκi​j\\kappa\_\{ij\}throughout, down\-weighting keys with large directional residuals\. The normalized weights

Ai​j=κ~i​j∑j′κ~i​j′A\_\{ij\}=\\frac\{\\tilde\{\\kappa\}\_\{ij\}\}\{\\sum\_\{j^\{\\prime\}\}\\tilde\{\\kappa\}\_\{ij^\{\\prime\}\}\}are now functions of both the dynamical precisions and the angular residuals, and𝒖¯i=∑jAi​j​𝒖^z,i​j\\bar\{\\boldsymbol\{u\}\}\_\{i\}=\\sum\_\{j\}A\_\{ij\}\\hat\{\\boldsymbol\{u\}\}\_\{z,ij\}is the attention output\.

### C\.2Eigenbasis filtering update\.

The RT\-Filter update is naturally defined in eigenbasis coordinates𝒛s,i=mi​𝒖i\\boldsymbol\{z\}\_\{s,i\}=m\_\{i\}\\boldsymbol\{u\}\_\{i\}, where the spherical geometry and radial–tangential covariance decomposition are exact\.

A geodesic update on the hypersphere would move along the great\-circle path toward𝒖i∗\\boldsymbol\{u\}\_\{i\}^\{\*\}\. A local geodesic step of sizeαi\\alpha\_\{i\}corresponds to adding a perturbationαi​mi​𝒖¯i\\alpha\_\{i\}m\_\{i\}\\bar\{\\boldsymbol\{u\}\}\_\{i\}followed by normalization\.

Requiring the update to remain invariant under𝒛s,i→c​𝒛s,i\\boldsymbol\{z\}\_\{s,i\}\\to c\\boldsymbol\{z\}\_\{s,i\}impliesαi∝1/mi\\alpha\_\{i\}\\propto 1/m\_\{i\}\. Writing

αi=rmi,\\alpha\_\{i\}=\\frac\{r\}\{m\_\{i\}\},the ambient update becomes

𝒛s,i\+=𝒛s,i\+r​𝒖¯i,𝒖i\+=Norm​\(𝒛s,i\+\),\\boldsymbol\{z\}\_\{s,i\}^\{\+\}=\\boldsymbol\{z\}\_\{s,i\}\+r\\,\\bar\{\\boldsymbol\{u\}\}\_\{i\},\\qquad\\boldsymbol\{u\}\_\{i\}^\{\+\}=\\mathrm\{Norm\}\(\\boldsymbol\{z\}\_\{s,i\}^\{\+\}\),wherer\>0r\>0controls the filtering step size\.

Using𝒖¯i\\bar\{\\boldsymbol\{u\}\}\_\{i\}rather than only the normalized direction𝒖i∗\\boldsymbol\{u\}\_\{i\}^\{\*\}preserves concentration information in the update magnitude, yielding larger angular updates when directional consensus is sharp and smaller updates when evidence is diffuse\.

The corresponding magnitude update is:

mi\+=‖mi​𝒖z,i\+r​𝒖¯i‖≈mi\+r​𝒖z,i⊤​𝒖¯i\(mi≫r\)\.m\_\{i\}^\{\+\}=\\\|m\_\{i\}\\boldsymbol\{u\}\_\{z,i\}\+r\\,\\bar\{\\boldsymbol\{u\}\}\_\{i\}\\\|\\approx m\_\{i\}\+r\\,\\boldsymbol\{u\}\_\{z,i\}^\{\\top\}\\bar\{\\boldsymbol\{u\}\}\_\{i\}\\qquad\(m\_\{i\}\\gg r\)\.The increment is proportional to the directional agreement𝒖i⊤​𝒖¯i\\boldsymbol\{u\}\_\{i\}^\{\\top\}\\bar\{\\boldsymbol\{u\}\}\_\{i\}, so consistent evidence increases magnitude and stabilizes the state, while diffuse or contradictory evidence leaves it more plastic\.

The exact geodesic update corresponds to spherical interpolation toward the consensus direction,

𝒖i\+=slerp​\(𝒖i,𝒖i∗,αi\),αi=r​‖𝒖¯i‖mi,\\boldsymbol\{u\}\_\{i\}^\{\+\}=\\mathrm\{slerp\}\(\\boldsymbol\{u\}\_\{i\},\\boldsymbol\{u\}\_\{i\}^\{\*\},\\alpha\_\{i\}\),\\qquad\\alpha\_\{i\}=\\frac\{r\\,\\\|\\bar\{\\boldsymbol\{u\}\}\_\{i\}\\\|\}\{m\_\{i\}\},whereαi\\alpha\_\{i\}incorporates directional concentration\. The ambient update preserves the additive structure of Transformer residual dynamics while coinciding with the geodesic update to first order inαi\\alpha\_\{i\}\.

## Appendix DThe Transformer as an RT Filter

The RT filter update derived in the previous section is:

𝒛s,i\+=𝒛s,i\+r​𝒖¯i,𝒖i\+=Norm​\(𝒛s,i\+\),\\boldsymbol\{z\}\_\{s,i\}^\{\+\}=\\boldsymbol\{z\}\_\{s,i\}\+r\\,\\bar\{\\boldsymbol\{u\}\}\_\{i\},\\qquad\\boldsymbol\{u\}\_\{i\}^\{\+\}=\\mathrm\{Norm\}\(\\boldsymbol\{z\}\_\{s,i\}^\{\+\}\),where

𝒖¯i=∑jAi​j​𝒖^z,i​j\\bar\{\\boldsymbol\{u\}\}\_\{i\}=\\sum\_\{j\}A\_\{ij\}\\hat\{\\boldsymbol\{u\}\}\_\{z,ij\}is the precision\-weighted directional consensus and‖𝒖¯i‖∈\[0,1\]\\\|\\bar\{\\boldsymbol\{u\}\}\_\{i\}\\\|\\in\[0,1\]encodes its concentration\.

In high\-dimensional embeddings with approximately isotropic coordinates,‖𝒛i‖2∼d\\\|\\boldsymbol\{z\}\_\{i\}\\\|^\{2\}\\sim d, so dimension\-independent angular updates requirer∝dr\\propto\\sqrt\{d\}\. Writingr=γ​dr=\\gamma\\sqrt\{d\}, the learned gainγ\\gammacorresponds naturally to the scaling implemented by RMSNorm\-like normalization layers\.

This update is defined in the eigenbasis

𝒛s,i=𝑾v​𝒛i,\\boldsymbol\{z\}\_\{s,i\}=\\boldsymbol\{W\}\_\{v\}\\boldsymbol\{z\}\_\{i\},where the RT geometry is exact\. Transformers, however, preserve additive residual updates in the original representation space𝒛i\\boldsymbol\{z\}\_\{i\}\.

The eigenbasis filtering update

𝒛s,i\+=𝒛s,i\+r​𝒖¯i\\boldsymbol\{z\}\_\{s,i\}^\{\+\}=\\boldsymbol\{z\}\_\{s,i\}\+r\\,\\bar\{\\boldsymbol\{u\}\}\_\{i\}therefore corresponds in ambient coordinates to

𝒛i\+=𝒛i\+r​𝑾o​𝒖¯i,\\boldsymbol\{z\}\_\{i\}^\{\+\}=\\boldsymbol\{z\}\_\{i\}\+r\\,\\boldsymbol\{W\}\_\{o\}\\bar\{\\boldsymbol\{u\}\}\_\{i\},where𝑾o\\boldsymbol\{W\}\_\{o\}maps the directional consensus back to the residual stream\.

When𝑾v​𝑾o=𝑰,\\boldsymbol\{W\}\_\{v\}\\boldsymbol\{W\}\_\{o\}=\\boldsymbol\{I\},the ambient residual update exactly recovers the eigenbasis RT filter\. In practice, independently learned projections introduce a basis mismatch between the filtering geometry and the residual update\.

Thus, attention computes the eigenbasis directional consensus𝒖¯i\\bar\{\\boldsymbol\{u\}\}\_\{i\}, the output projection maps it back to ambient coordinates, residual addition performs the filtering step, and normalization implements the retraction onto the hypersphere\.

Although softmax normalization is geometrically unnecessary for the direction itself,

Norm​\(𝒖¯i\)=Norm​\(∑jκ~i​j​𝒖^z,i​j\),\\mathrm\{Norm\}\(\\bar\{\\boldsymbol\{u\}\}\_\{i\}\)=\\mathrm\{Norm\}\\bigg\(\\sum\_\{j\}\\tilde\{\\kappa\}\_\{ij\}\\hat\{\\boldsymbol\{u\}\}\_\{z,ij\}\\bigg\),it remains important for the update magnitude\. Softmax normalization ensures that‖𝒖¯i‖\\\|\\bar\{\\boldsymbol\{u\}\}\_\{i\}\\\|reflects the concentration of the directional evidence rather than the total precision∑jκ~i​j\\sum\_\{j\}\\tilde\{\\kappa\}\_\{ij\}, thereby controlling the adaptive step size\.

### D\.1Tangent\-Space Residual Updates

The RT filter geometry is defined in the eigenbasis

𝒛s,i=𝑾v​𝒛i,\\boldsymbol\{z\}\_\{s,i\}=\\boldsymbol\{W\}\_\{v\}\\boldsymbol\{z\}\_\{i\},where directional states lie on the hypersphere\. The residual filtering update

𝒛s,i\+=𝒛s,i\+r​𝒖¯i\\boldsymbol\{z\}\_\{s,i\}^\{\+\}=\\boldsymbol\{z\}\_\{s,i\}\+r\\,\\bar\{\\boldsymbol\{u\}\}\_\{i\}is a first\-order approximation to a geodesic \(slerp\) step toward the consensus direction\. However,𝒖¯i\\bar\{\\boldsymbol\{u\}\}\_\{i\}generally contains a component parallel to𝒛s,i\\boldsymbol\{z\}\_\{s,i\}, which affects magnitude but not direction after normalization\.

A more geometrically faithful update therefore projects the consensus direction onto the tangent space of the sphere:

Π𝒛s,i​\(𝒖¯i\)=𝒖¯i−𝒛s,i⊤​𝒖¯i‖𝒛s,i‖2​𝒛s,i\.\\Pi\_\{\\boldsymbol\{z\}\_\{s,i\}\}\(\\bar\{\\boldsymbol\{u\}\}\_\{i\}\)=\\bar\{\\boldsymbol\{u\}\}\_\{i\}\-\\frac\{\\boldsymbol\{z\}\_\{s,i\}^\{\\top\}\\bar\{\\boldsymbol\{u\}\}\_\{i\}\}\{\\\|\\boldsymbol\{z\}\_\{s,i\}\\\|^\{2\}\}\\boldsymbol\{z\}\_\{s,i\}\.The tangent\-space filtering update becomes

𝒛s,i\+=𝒛s,i\+r​Π𝒛s,i​\(𝒖¯i\),𝒖i\+=Norm​\(𝒛s,i\+\)\.\\boldsymbol\{z\}\_\{s,i\}^\{\+\}=\\boldsymbol\{z\}\_\{s,i\}\+r\\,\\Pi\_\{\\boldsymbol\{z\}\_\{s,i\}\}\(\\bar\{\\boldsymbol\{u\}\}\_\{i\}\),\\qquad\\boldsymbol\{u\}\_\{i\}^\{\+\}=\\mathrm\{Norm\}\(\\boldsymbol\{z\}\_\{s,i\}^\{\+\}\)\.
Mapping this update back to the residual stream yields the ambient Transformer update

𝒛i\+=𝒛i\+r​𝑾o​Π𝒛s,i​\(𝒖¯i\)\.\\boldsymbol\{z\}\_\{i\}^\{\+\}=\\boldsymbol\{z\}\_\{i\}\+r\\,\\boldsymbol\{W\}\_\{o\}\\Pi\_\{\\boldsymbol\{z\}\_\{s,i\}\}\(\\bar\{\\boldsymbol\{u\}\}\_\{i\}\)\.This compensates for the basis mismatch introduced when𝑾o​𝑾v≠𝑰\\boldsymbol\{W\}\_\{o\}\\boldsymbol\{W\}\_\{v\}\\neq\\boldsymbol\{I\}, while preserving the additive residual structure of the Transformer block\.

### D\.2Pre\-Norm vs\. Post\-Norm

Under the RT\-SDE interpretation, token magnitude is a meaningful state variable: directional precision scales as1/m21/m^\{2\}, so magnitude controls directional stability\.

In a Pre\-Norm Transformer,

𝒙norm=Norm​\(𝒙\),𝒙←𝒙\+Attn​\(𝒙norm\),\\boldsymbol\{x\}\_\{\\mathrm\{norm\}\}=\\mathrm\{Norm\}\(\\boldsymbol\{x\}\),\\qquad\\boldsymbol\{x\}\\leftarrow\\boldsymbol\{x\}\+\\mathrm\{Attn\}\(\\boldsymbol\{x\}\_\{\\mathrm\{norm\}\}\),normalization is applied only within the attention branch, allowing the residual stream magnitude to accumulate across layers\. In a Post\-Norm Transformer,

𝒙←Norm​\(𝒙\+Attn​\(𝒙\)\),\\boldsymbol\{x\}\\leftarrow\\mathrm\{Norm\}\\bigl\(\\boldsymbol\{x\}\+\\mathrm\{Attn\}\(\\boldsymbol\{x\}\)\\bigr\),the residual stream is renormalized after every block, resetting the magnitude channel between layers\.

Both architectures remain locally consistent with the RT filter\. The difference is whether magnitude information persists across depth\. Unlike isotropic RFA, where precision depends only on temporal lag, the RT\-SDE makes magnitude a load\-bearing quantity throughκi​j∝m2\\kappa\_\{ij\}\\propto m^\{2\}\. From this perspective, Pre\-Norm naturally preserves accumulated directional confidence, whereas Post\-Norm discards it between layers\.

This suggests a concrete empirical prediction: in trained Pre\-Norm models, token norms should correlate with directional stability across layers\.

### D\.3Multi\-Head Structure as Block\-Diagonal Dynamics

Multi\-head attention arises from a block\-diagonal parameterization of the RT\-SDE, in which the eigenbasis coordinates are partitioned intoHHdisjoint index sets

\{1,…,d\}=⋃h=1Hℐh,\\\{1,\\dots,d\\\}=\\bigcup\_\{h=1\}^\{H\}\\mathcal\{I\}\_\{h\},with shared dynamical parameters within each block\. The RT\-SDE then decouples across blocks, so each head defines an independent filtering problem with its own directional precisionκi​j\(h\)\\kappa\_\{ij\}^\{\(h\)\}\.

The directional precision in each head still depends on the global token magnitudesmi,mjm\_\{i\},m\_\{j\}, through the1/m21/m^\{2\}scaling induced by normalization\. Each head therefore operates on a slice of the globally normalized direction rather than an independently normalized subvector\.

Let𝑷h\\boldsymbol\{P\}\_\{h\}denote the projection onto the coordinates associated with headhh\. The per\-head tangent updates combine linearly:

Δ​𝒖i=∑h=1H𝑷h⊤​Δ​𝒖i\(h\),𝒛s,i\+=𝒛s,i\+𝑾o​Δ​𝒖i\.\\Delta\\boldsymbol\{u\}\_\{i\}=\\sum\_\{h=1\}^\{H\}\\boldsymbol\{P\}\_\{h\}^\{\\top\}\\Delta\\boldsymbol\{u\}\_\{i\}^\{\(h\)\},\\qquad\\boldsymbol\{z\}\_\{s,i\}^\{\+\}=\\boldsymbol\{z\}\_\{s,i\}\+\\boldsymbol\{W\}\_\{o\}\\Delta\\boldsymbol\{u\}\_\{i\}\.Global normalization then retracts the combined update back onto the hypersphere\.

### D\.4Stacked Transformer Layers as a Riemannian Iterative State Estimator

The directional estimator derived in Section[C](https://arxiv.org/html/2605.11007#A3)is implicit: the precision weights depend on directional agreement with the current state estimate\. As in Euclidean robust estimation, this induces an iterative refinement procedure across layers\.

Working in the eigenbasis, let𝒛s,i\(k\)=ms,i\(k\)​𝒖s,i\(k\)\\boldsymbol\{z\}\_\{s,i\}^\{\(k\)\}=m\_\{s,i\}^\{\(k\)\}\\boldsymbol\{u\}\_\{s,i\}^\{\(k\)\}denote the state estimate at layerkk, initialized from the local token embedding\. At each iteration, transported predictions are recomputed from the current estimates:

𝒖^s,i​j\(k\)=Norm​\(e𝚲​Δ​ti​j​𝒛s,j\(k\)\)\.\\hat\{\\boldsymbol\{u\}\}\_\{s,ij\}^\{\(k\)\}=\\mathrm\{Norm\}\\\!\\left\(e^\{\\boldsymbol\{\\Lambda\}\\Delta t\_\{ij\}\}\\boldsymbol\{z\}\_\{s,j\}^\{\(k\)\}\\right\)\.yielding the precision\-weighted directional consensus

𝒖¯s,i\(k\)=∑jAi​j\(k\)​𝒖^s,i​j\(k\),Ai​j\(k\)=κi​j\(k\)∑j′κi​j′\(k\)\.\\bar\{\\boldsymbol\{u\}\}\_\{s,i\}^\{\(k\)\}=\\sum\_\{j\}A\_\{ij\}^\{\(k\)\}\\hat\{\\boldsymbol\{u\}\}\_\{s,ij\}^\{\(k\)\},\\qquad A\_\{ij\}^\{\(k\)\}=\\frac\{\\kappa\_\{ij\}^\{\(k\)\}\}\{\\sum\_\{j^\{\\prime\}\}\\kappa\_\{ij^\{\\prime\}\}^\{\(k\)\}\}\.Under the corresponding ambient residual update,

𝒛s,i\(k\+1\)=𝒛s,i\(k\)\+r​𝒖¯s,i\(k\),\\boldsymbol\{z\}\_\{s,i\}^\{\(k\+1\)\}=\\boldsymbol\{z\}\_\{s,i\}^\{\(k\)\}\+r\\,\\bar\{\\boldsymbol\{u\}\}\_\{s,i\}^\{\(k\)\},the magnitude evolves according to the agreement between the current direction and the consensus update, accumulating confidence across layers\.

Stacking Transformer layers therefore unrolls a Riemannian analogue of Iteratively Reweighted Least Squares, where each layer recomputes directional agreement, updates the precision weights, and refines the latent directional state estimate\.

### D\.5Algorithm

Algorithm[1](https://arxiv.org/html/2605.11007#algorithm1)details the implementation of Radial\-Tangential RFA\. The full RT\-Transformer is then shown in Algorithm[2](https://arxiv.org/html/2605.11007#algorithm2)\.

Algorithm 1Radial–Tangential Robust Filter Attention \(RT\-RFA\)Input:𝒁∈ℝd×N\\boldsymbol\{Z\}\\in\\mathbb\{R\}^\{d\\times N\}

Definitions: Real to complex \(d→2​dd\\rightarrow 2d\) linear layers:ℒq,ℒk,ℒv\\mathcal\{L\}\_\{q\},\\mathcal\{L\}\_\{k\},\\mathcal\{L\}\_\{v\}; Complex to real \(2​d→d2d\\rightarrow d\) linear layerℒo\\mathcal\{L\}\_\{o\};

Noise variance parameters: angular frequencies𝝎\\boldsymbol\{\\omega\}; decay rateμ∈ℝ\+\\mu\\in\\mathbb\{R\}^\{\+\};σ2,η2,γ2∈ℝ\+\\sigma^\{2\},\\eta^\{2\},\\gamma^\{2\}\\in\\mathbb\{R\}^\{\+\}; robustness parameterν\\nu; Softmax inverse temperatureβs\\beta\_\{s\}; causal mask𝑴causal∈\{0,−∞\}N×N\\boldsymbol\{M\}\_\{\\text\{causal\}\}\\in\\\{0,\-\\infty\\\}^\{N\\times N\}\.

1\. QKV Projection and Normalization: 𝑸,𝑲,𝑽←ℒq,k,v​\(𝒁\)\\boldsymbol\{Q\},\\boldsymbol\{K\},\\boldsymbol\{V\}\\leftarrow\\mathcal\{L\}\_\{q,k,v\}\(\\boldsymbol\{Z\}\) 𝑴=‖𝑽‖col\\boldsymbol\{M\}=\\\|\\boldsymbol\{V\}\\\|\_\{\\text\{col\}\} 𝑸,𝑲,𝑽←Norm​\(𝑸,𝑲,𝑽\)\\boldsymbol\{Q\},\\boldsymbol\{K\},\\boldsymbol\{V\}\\leftarrow\\text\{Norm\}\(\\boldsymbol\{Q\},\\boldsymbol\{K\},\\boldsymbol\{V\}\)

2\. QKV Rotation \(RoPE\): 𝑬​\[i,j\]=e−μ​\|ti−tj\|,𝚽~\+​\[k,i\]=ei​𝝎k​ti,𝚽~−​\[k,i\]=e−i​𝝎k​ti\\boldsymbol\{E\}\[i,j\]=e^\{\-\\mu\|t\_\{i\}\-t\_\{j\}\|\},\\quad\\boldsymbol\{\\tilde\{\\Phi\}\}^\{\+\}\[k,i\]=e^\{i\\boldsymbol\{\\omega\}\_\{k\}t\_\{i\}\},\\quad\\boldsymbol\{\\tilde\{\\Phi\}\}^\{\-\}\[k,i\]=e^\{\-i\\boldsymbol\{\\omega\}\_\{k\}t\_\{i\}\} 𝑸~,𝑲~,𝑽~←𝚽~−⊙\(𝑸,𝑲,𝑽\)\\boldsymbol\{\\tilde\{Q\}\},\\boldsymbol\{\\tilde\{K\}\},\\boldsymbol\{\\tilde\{V\}\}\\leftarrow\\boldsymbol\{\\tilde\{\\Phi\}\}^\{\-\}\\odot\(\\boldsymbol\{Q\},\\boldsymbol\{K\},\\boldsymbol\{V\}\)

3\. Analytic Precision Kernels \(Exact DLE Solutions\): 𝚺Δ​t​\[i,j\]=σ~2​\(1−𝑬2​\[\|ti−tj\|\]\)\+η2​𝑬2​\[\|ti−tj\|\]\+γ2\\boldsymbol\{\\Sigma\}\_\{\\Delta t\}\[i,j\]=\\tilde\{\\sigma\}^\{2\}\\big\(1\-\\boldsymbol\{E\}^\{2\}\[\|t\_\{i\}\-t\_\{j\}\|\]\\big\)\+\\eta^\{2\}\\boldsymbol\{E\}^\{2\}\[\|t\_\{i\}\-t\_\{j\}\|\]\+\\gamma^\{2\} 𝑴^←𝑴⋅e−μ​\|ti−tj\|\\boldsymbol\{\\hat\{M\}\}\\leftarrow\\boldsymbol\{M\}\\cdot e^\{\-\\mu\|t\_\{i\}\-t\_\{j\}\|\} 𝑷Δ​t​\[i,j\]←\(𝚺Δ​t​\[0,0\]/𝑴​\[i\]2\+𝚺Δ​t​\[i,j\]/𝑴^2​\[j\]\+τθ2\)−1\\boldsymbol\{P\}\_\{\\Delta t\}\[i,j\]\\leftarrow\\left\(\\boldsymbol\{\\Sigma\}\_\{\\Delta t\}\[0,0\]/\\boldsymbol\{M\}\[i\]^\{2\}\+\\boldsymbol\{\\Sigma\}\_\{\\Delta t\}\[i,j\]/\\boldsymbol\{\\hat\{M\}\}^\{2\}\[j\]\+\\tau\_\{\\theta\}^\{2\}\\right\)^\{\-1\}

4\. Spherical Attention: ‖𝑹q​k​\[i,j\]‖2=‖𝑸i‖2\+‖𝑲j‖2−2​R​e​\(𝑸~i†​𝑲~j\)\\\|\\boldsymbol\{R\}\_\{qk\}\[i,j\]\\\|^\{2\}=\\\|\\boldsymbol\{Q\}\_\{i\}\\\|^\{2\}\+\\\|\\boldsymbol\{K\}\_\{j\}\\\|^\{2\}\-2\\mathrm\{Re\}\(\\boldsymbol\{\\tilde\{Q\}\}\_\{i\}^\{\\dagger\}\\boldsymbol\{\\tilde\{K\}\}\_\{j\}\) 𝑳=log⁡\(𝑷Δ​t\)−ν\+dd​log⁡\(1\+1ν​𝑷Δ​t⊙‖𝑹q​k‖2\)\\boldsymbol\{L\}=\\log\(\\boldsymbol\{P\}\_\{\\Delta t\}\)\-\\frac\{\\nu\+d\}\{d\}\\log\\left\(1\+\\frac\{1\}\{\\nu\}\\boldsymbol\{P\}\_\{\\Delta t\}\\odot\\big\\\|\\boldsymbol\{R\}\_\{qk\}\\big\\\|^\{2\}\\right\), 𝑨=Softmaxj​\(βs​𝑳\+𝑴causal\)\\boldsymbol\{A\}=\\text\{Softmax\}\_\{j\}\\big\(\\beta\_\{s\}\\boldsymbol\{L\}\+\\boldsymbol\{M\}\_\{\\text\{causal\}\}\\big\)

5\. Aggregation and Counter\-Rotation: 𝑽¯←𝚽~\+⊙\(𝑽~​𝑨⊤\)\\boldsymbol\{\\bar\{V\}\}\\leftarrow\\boldsymbol\{\\tilde\{\\Phi\}\}^\{\+\}\\odot\(\\boldsymbol\{\\tilde\{V\}\}\\boldsymbol\{A\}^\{\\top\}\)

6\. Tangent Space Residual Update: 𝑼¯​\[:,i\]←𝑽¯​\[:,i\]−𝑽​\[:,i\]⊤​𝑽¯​\[:,i\]‖𝑽​\[:,i\]‖2​𝑽​\[:,i\]\\boldsymbol\{\\bar\{U\}\}\[:,i\]\\leftarrow\\boldsymbol\{\\bar\{V\}\}\[:,i\]\-\\frac\{\\boldsymbol\{V\}\[:,i\]^\{\\top\}\\boldsymbol\{\\bar\{V\}\}\[:,i\]\}\{\\\|\\boldsymbol\{V\}\[:,i\]\\\|^\{2\}\}\\boldsymbol\{V\}\[:,i\] 𝒁¯←ℒo​\(𝑼¯\)\\boldsymbol\{\\bar\{Z\}\}\\leftarrow\\mathcal\{L\}\_\{o\}\(\\boldsymbol\{\\bar\{U\}\}\) 𝒁\+←𝒁\+𝒁¯\\boldsymbol\{Z\}^\{\+\}\\leftarrow\\boldsymbol\{Z\}\+\\boldsymbol\{\\bar\{Z\}\}

7\. Output Normalization: 𝑼\+←Norm​\(𝒁\+\)\\boldsymbol\{U\}^\{\+\}\\leftarrow\\text\{Norm\}\(\\boldsymbol\{Z\}^\{\+\}\)

Return:𝒁\+,𝑼\+\\boldsymbol\{Z\}^\{\+\},\\boldsymbol\{U\}^\{\+\}

Algorithm 2RT\-TransformerInput:𝒁∈ℝd×N\\boldsymbol\{Z\}\\in\\mathbb\{R\}^\{d\\times N\}

𝒁\+,𝑼\+←RT\-RFA​\(𝒁\)\\boldsymbol\{Z\}^\{\+\},\\boldsymbol\{U\}^\{\+\}\\leftarrow\\text\{RT\-RFA\}\(\\boldsymbol\{Z\}\)

𝒁out=𝒁\+\+FFN​\(𝑼\+\)\\boldsymbol\{Z\}\_\{\\text\{out\}\}=\\boldsymbol\{Z\}^\{\+\}\+\\text\{FFN\}\(\\boldsymbol\{U\}^\{\+\}\)

Return:𝒁out\\boldsymbol\{Z\}\_\{\\text\{out\}\}

Similar Articles

Transformer Math Explorer [P]

Reddit r/MachineLearning

This interactive tool visualizes the mathematical underpinnings of transformer models through dataflow graphs, covering architectures from GPT-2 to Qwen 3.6 and various attention mechanisms.

The Spectral Geometry of Thought: Phase Transitions, Instruction Reversal, Token-Level Dynamics, and Perfect Correctness Prediction in How Transformers Reason

arXiv cs.LG

A comprehensive spectral analysis across 11 LLMs revealing that transformers exhibit phase transitions in hidden activation spaces during reasoning versus factual recall, with seven fundamental phenomena including spectral compression, instruction-tuning reversal, and perfect correctness prediction (AUC=1.0) based solely on spectral properties.

Elastic Attention Cores for Scalable Vision Transformers [R]

Reddit r/MachineLearning

This article presents a new paper on Elastic Attention Cores for Vision Transformers, proposing a core-periphery block-sparse attention structure that improves scalability and accuracy compared to dense self-attention methods like DINOv3.