Reachability and asymptotics of Gaussian Transformer dynamics

arXiv cs.LG 06/09/26, 04:00 AM Papers
Summary
This paper presents a mathematical framework for Transformer dynamics as a nonlinear control system on probability measures, proving that Gaussian distributions remain Gaussian under the flow, reducing to finite-dimensional bilinear control, and establishing reachability conditions and asymptotic stability results.
arXiv:2606.07600v1 Announce Type: new Abstract: We formulate data propagation through the Transformer, the machine learning architecture powering large language models, as a nonlinear control system on the space of probability measures. For the mean-field Transformer model with self-attention and affine feed-forward layers, we prove that Gaussian distributions remain exactly Gaussian along the induced flow. This invariance reduces the infinite-dimensional measure dynamics to a finite-dimensional bilinear control system governing the evolution of the mean and covariance, reformulates the expressive capacity of Transformers as a reachability problem for prescribed Gaussian moments, and reveals a novel connection with Riccati-type equations from classical filtering and control. For time-varying controls, we prove exact finite-time reachability of any target Gaussian distribution whose covariance matrix has the same rank as the initial one, this rank constraint being an intrinsic invariant of the dynamics. For time-invariant parameters, we derive explicit spectral conditions leading either to asymptotic stability toward positive-definite equilibria or to finite-time blow-up of the covariance. Numerical experiments complement the theory by showing that practical Transformers with Gaussian inputs remain close to moment-matched Gaussian distributions through early and intermediate layers, while Transformers with prescribed attention matrices reproduce the predicted covariance regimes: bounded evolution in stabilizing configurations and blow-up in destabilizing ones.
Original Article
View Cached Full Text
Cached at: 06/09/26, 08:48 AM
# Reachability and Asymptotics of Gaussian Transformer Dynamics††thanks: Submitted to the editors DATE. \fundingAA was funded by the European Union’s Horizon Europe MSCA project ModConFlex (grant number 101073558). EZ was funded by the Alexander von Humboldt-Professorship program, the ERC Advanced Grant CoDeFeL, the Grants PID2020-112617GB-C22 KiLearn and TED2021-131390B-I00-DasEl of MINECO and PID2023-146872OB-I00-DyCMaMod of MICIU (Spain), the European Union’s Horizon Europe MSCA project ModConFlex (grant number 101073558), the Transregio 154 Project “Mathematical Modelling, Simulation and Optimization Using the Example of Gas Networks” of the DFG, the AFOSR 24IOE027 project, and the SURE-AI Centre grant 357482, Research Council of Norway.
Source: [https://arxiv.org/html/2606.07600](https://arxiv.org/html/2606.07600)
\\newsiamremark

remarkRemark\\newsiamremarkhypothesisHypothesis\\newsiamthmclaimClaim\\newsiamremarkfactFact\\headersGaussian Transformer DynamicsA\. Alcalde, Z\. Ji, and E\. Zuazua

Albert AlcaldeChair for Dynamics, Control, Machine Learning & Numerics \(Alexander von Humboldt Professorship\), Department of Mathematics, Friedrich–Alexander\-Universität Erlangen–Nürnberg, 91058 Erlangen, Germany\. \(, , \)Enrique Zuazua22footnotemark:2Departamento de Matemáticas, Universidad Autónoma de Madrid, 28049 Madrid, Spain\. Chair of Computational Mathematics, Fundación Deusto\. Av\. de las Universidades, 24, 48007 Bilbao, Basque Country, Spain\.

###### Abstract

We formulate data propagation through the Transformer, the machine learning architecture powering large language models, as a nonlinear control system on the space of probability measures\. For the mean\-field Transformer model with self\-attention and affine feed\-forward layers, we prove that Gaussian distributions remain exactly Gaussian along the induced flow\. This invariance reduces the infinite\-dimensional measure dynamics to a finite\-dimensional bilinear control system governing the evolution of the mean and covariance, reformulates the expressive capacity of Transformers as a reachability problem for prescribed Gaussian moments, and reveals a novel connection with Riccati\-type equations from classical filtering and control\.

For time\-varying controls, we prove exact finite\-time reachability of any target Gaussian distribution whose covariance matrix has the same rank as the initial one, this rank constraint being an intrinsic invariant of the dynamics\. For time\-invariant parameters, we derive explicit spectral conditions leading either to asymptotic stability toward positive\-definite equilibria or to finite\-time blow\-up of the covariance\.

Numerical experiments complement the theory by showing that practical Transformers with Gaussian inputs remain close to moment\-matched Gaussian distributions through early and intermediate layers, while Transformers with prescribed attention matrices reproduce the predicted covariance regimes: bounded evolution in stabilizing configurations and blow\-up in destabilizing ones\.

###### keywords:

deep learning, mean\-field transformers, self\-attention, Riccati differential equations, covariance control

\{MSCcodes\}

68T07, 93B03, 93D20

## 1Introduction

Transformers\[bahdanau2014neural,vaswani2017attention\]have become the dominant architecture in modern machine learning, achieving state\-of\-the\-art performance in natural language processing\[achiam2023gpt4,devlin2019bert\], computer vision\[carion2020end,liu2021swin\], genomics\[abramson2024accurate,jumper2021highly\], and scientific machine learning\[bodnar2025foundation,price2025probabilistic\]\. Despite their empirical success, a rigorous mathematical framework characterizing exactly when and how Transformers reliably represent and propagate information remains elusive, which has motivated a growing theoretical effort to understand them through continuous\-depth and mean\-field limits, using tools from nonlinear control\[bruno2025emergence,burger2025analysis,castin2025unified,geshkovski2025mathematical,sander2022sinkformers\]\. When the number of layers and inputs becomes large, the Transformer model can be viewed as a controlled nonlinear flow acting on probability measures\[peyre2025optimal\], in which the controlled state space is the density of the distribution of the inputs, the layer index is interpreted as a continuous time variable, and the layer\-varying parameters of the Transformer serve as controls\.

A natural and analytically tractable setting for studying this measure flow is the invariant manifold of Gaussian input distributions\[castin2025unified\]\. This setting is widely adopted in modeling independently sampled data \(see[Section1\.2](https://arxiv.org/html/2606.07600#S1.SS2)\) and, crucially, the Gaussian invariant manifold reduces the infinite\-dimensional controlled evolution characterizing the information\-propagation properties of Transformers into a finite\-dimensional control system for the mean and covariance \(see[Section2\.1](https://arxiv.org/html/2606.07600#S2.SS1)\)\. The Gaussian framework provides a rigorous control\-theoretic paradigm to investigate the training dynamics and expressivity of Transformers\. Specifically, the approximation capacity of a Transformer translates directly into a reachability problem: does there exist a sequence of time\-varying controls \(weight parameters\) capable of steering an initial Gaussian measure along the nonlinear flow to match an arbitrary target Gaussian? Furthermore, understanding the forward\-pass dynamics naturally raises fundamental questions of asymptotic behavior: under what control parameter regimes does the Transformer flow stabilize to a stationary distribution, and when does it diverge? In the language of control theory, these properties correspond precisely to the system’s controllability and stability\[tabuada2022universal\]\.

In this paper, we address the key questions of reachability and asymptotic behavior of the nonlinear control system arising from the mean\-field Transformer model in the Gaussian setting\. The Transformer architecture poses distinct control\-theoretic challenges: \(i\) due to the strong nonlinear nature of self\-attention, the well\-posedness of the mean\-field evolution is not guaranteed when the initial measure is not of compact support; \(ii\) the Gaussian reduction leads to coupled nonlinear dynamics of the mean and covariance, complicating the use of classical geometric control techniques\[bianchini2003needle\]; \(iii\) the non\-commutativity of the control matrices arising from the self\-attention mechanism presents significant obstacles to establishing the existence of equilibria, and the nonlinearities in the covariance flow complicate global stability analysis\. We overcome these difficulties by resolvent techniques and perturbation analysis, explicitly bridging the Transformer dynamics with classical Riccati theory\.

### 1\.1Our contributions

The main contributions of this paper are as follows:

- •We study the mean\-field Transformer model with affine feed\-forward layers and self\-attention, and prove that the class of Gaussian measures remains invariant under the resulting system\. This yields a Riccati\-type ODE system for the evolution of mean and covariance \([Proposition2\.1](https://arxiv.org/html/2606.07600#S2.Thmtheorem1)\), resembling a classical formulation in optimal filtering and control theory: \(1\)\{μ˙=\(A\+V\)μ\+VΣBμ\+b,Σ˙=AΣ\+ΣA⊤\+VΣBΣ\+ΣB⊤ΣV⊤,\\displaystyle\\begin\{cases\}\\dot\{\\mu\}&=\(A\+V\)\\mu\+V\\Sigma B\\mu\+b,\\\\ \\dot\{\\Sigma\}&=A\\Sigma\+\\Sigma A^\{\\top\}\+V\\Sigma B\\Sigma\+\\Sigma B^\{\\top\}\\Sigma V^\{\\top\},\\end\{cases\}whereμ\(t\)∈ℝd\\mu\(t\)\\in\\mathbb\{R\}^\{d\}is the mean andΣ\(t\)∈ℝd×d\\Sigma\(t\)\\in\\mathbb\{R\}^\{d\\times d\}is the covariance of the Gaussian at timett, whileA\(t\),B\(t\),V\(t\)∈ℝd×dA\(t\),B\(t\),V\(t\)\\in\\mathbb\{R\}^\{d\\times d\}andb\(t\)∈ℝdb\(t\)\\in\\mathbb\{R\}^\{d\}are trainable parameters acting as controls\. Further, when the feed\-forward is constructed using ReLU activation, we derive quantitative estimates on the discrepancy between measure flows and the Gaussian evolution \([1](https://arxiv.org/html/2606.07600#S1.E1)\) \([Proposition2\.4](https://arxiv.org/html/2606.07600#S2.Thmtheorem4)\)\. The argument is not specific to ReLU and indicates how analogous estimates may be obtained for more general Lipschitz activations\.
- •For time\-varying parameter matrices, we show that the rank of the covariance matrixΣ\\Sigmais preserved along the flow of \([1](https://arxiv.org/html/2606.07600#S1.E1)\) \([Lemma3\.1](https://arxiv.org/html/2606.07600#S3.Thmtheorem1)\)\. Leveraging matrix congruence transformations, we construct explicit time\-varying control paths that achieve exact finite\-time reachability of any target Gaussian state sharing this initial covariance rank \([Theorem3\.3](https://arxiv.org/html/2606.07600#S3.Thmtheorem3)\)\.
- •For time\-invariant parameters, we characterize the long\-time behavior of the mean/covariance dynamics \([1](https://arxiv.org/html/2606.07600#S1.E1)\)\. For stabilizing parameter regimes, we show existence of positive\-definite equilibria and derive sufficient conditions for local stability \([Theorems4\.1](https://arxiv.org/html/2606.07600#S4.Thmtheorem1)and[4\.7](https://arxiv.org/html/2606.07600#S4.Thmtheorem7)\)\. Conversely, under destabilizing parameter configurations, the covariance blows up in finite time \([Theorem4\.8](https://arxiv.org/html/2606.07600#S4.Thmtheorem8)\)\. Moreover, one can asymptotically match an arbitrary target mean by static choices of parameters under stabilizing conditions \([Theorem4\.11](https://arxiv.org/html/2606.07600#S4.Thmtheorem11)\)\. Interestingly, the resulting stability conditions are consistent with empirical observations in pretrained vision Transformers\[trockman2023mimetic\]\.
- •We perform numerical experiments serving two complementary purposes beyond the exact Gaussian theory\. First, we show that pretrained Transformers, although outside the assumptions of the affine mean\-field model, preserve an approximately Gaussian moment structure over early and intermediate layers when initialized with Gaussian inputs\. Second, we test the robustness of the covariance predictions in more realistic discrete architectures with nonlinear feed\-forward blocks, showing bounded covariance growth in stabilizing sign configurations and blow\-up in destabilizing ones\.

Taken together, these results provide a rigorous framework for understanding data propagation in trained Transformers — traditionally studied via static approximation theory — from a dynamical and control\-theoretic perspective\.

### 1\.2Motivating the Gaussian setting

The restriction to Gaussian input distributions is not merely an analytical convenience, but also a standard paradigm in theoretical studies of the in\-context learning \(ICL\) capabilities of Transformers\[garg2022can,goel2026training,von2023transformers,zhang2024trained\]\. Broadly speaking, ICL investigates how Transformers can solve families of supervised learning tasks directly from contextual examples provided at inference time\. In this setting, the input to the Transformer typically consists of a sequence\{\(x1,y1\),\(x2,y2\),…,\(xm,ym\),\(xquery,0\)\}\\\{\(x\_\{1\},y\_\{1\}\),\(x\_\{2\},y\_\{2\}\),\\dots,\(x\_\{m\},y\_\{m\}\),\(x\_\{\\text\{query\}\},0\)\\\}, where each featurexi∈ℝdx\_\{i\}\\in\\mathbb\{R\}^\{d\}\(including the queryxqueryx\_\{\\text\{query\}\}\) is independently sampled from a Gaussian distribution \(oftenxi∼𝒩\(0,Σ\)x\_\{i\}\\sim\\mathcal\{N\}\(0,\\Sigma\)\), and the labelsyiy\_\{i\}are generated by some unknown task\-specific rule \(for instance,yi=w⊤xiy\_\{i\}=w^\{\\top\}x\_\{i\}in a linear regression task, withw∼𝒩\(0,Id\)w\\sim\\mathcal\{N\}\(0,I\_\{d\}\)\)\. The Transformer is trained over many such tasks to predict the missing labelyqueryy\_\{\\text\{query\}\}from the contextual examples\. An important aspect of this framework is that Gaussian inputs are not intended as realistic models of data distributions\. Rather, they provide an analytically tractable ensemble allowing one to isolate and study the mechanisms by which attention layers aggregate and propagate information\.

### 1\.3Related work

Our work builds on recent efforts to understand Transformer architectures through mean\-field limits and controlled measure flows\. We position our contribution with respect to three closely related directions: asymptotic analyses of attention dynamics, controllability of Transformers, and Riccati\-type covariance dynamics\.

#### Asymptotics and mean\-field perspective on Transformer dynamics

A recurring theme in theoretical studies is that repeated application of attention layers can drive collapse phenomena: inputs cluster and attention matrices can become effectively low\-rank, limiting expressivity\. At the particle level, the asymptotic dynamics of Transformers only with self\-attention layers have been studied in\[alcalde2025clustering,11494448,geshkovski2023emergence,pham2025dynamical\]\.

These observations are further clarified in the mean\-field regime, where the dynamics can be described by partial differential equations on probability measures inℝd\\mathbb\{R\}^\{d\}\. Early contributions identify well\-posedness\[sander2022sinkformers\], clustering\[burger2025analysis,geshkovski2025mathematical\], metastability\[alcalde2026quantifying,bruno2025emergence,bruno2025a,geshkovski2024dynamic\], and phase transitions relevant to long\-context attention\[chen2025critical\]\. Closest to our work is\[castin2025unified\], which proves Gaussian invariance for self\-attention\-only Transformers and derives corresponding mean and covariance equations\. Their asymptotic results rely on a strong commutativity condition between the control matrices and the initial covariance\. Our contribution is to move beyond the self\-attention\-only setting by incorporating affine feed\-forward layers, and to analyze the resulting Gaussian system relying on weaker sign assumptions on the control matrices\.

#### Reachability and simultaneous controllability of Transformers

The question of controllability and target reachability is central to Transformers\. At the discrete level, a first approximate simultaneous controllability result is proved in\[yun2019transformers\], and extended to the exact setting in\[alcalde2025exact,kim2023provable\]\. From a mean\-field perspective, the simultaneous controllability of Transformers including normalization layers has been established in\[geshkovski2024measure\], while\[akman2026optimal\]takes an optimal control approach to study the training dynamics of Transformers\. Although related, our work takes the different perspective of studying how a target Gaussian can be reached with minimal assumptions on the controls\.

#### Bilinear systems and covariance control

As established in[Proposition2\.1](https://arxiv.org/html/2606.07600#S2.Thmtheorem1), \([1](https://arxiv.org/html/2606.07600#S1.E1)\) forms a bilinear system, revealing a surprising link between Transformer models and the linear quadratic regulator theory, as the covariance matrixΣ\\Sigmacan be viewed as a “feedback gain” of the meanμ\\mu\(see[Section4\.1](https://arxiv.org/html/2606.07600#S4.SS1)\)\. It is also related to the optimal estimation problem which aims at designing feedback controllers for the closed\-loop system to reach a specified state covariance\[hotz1987covariance\]\. Unlike in optimal control where the Riccati equation is a tool to find optimal gains with guaranteed well\-posedness, in our analysis the Riccati/Bernoulli\-type equation is part of the state dynamics, the stability of which is under question: the coupled structure \([1](https://arxiv.org/html/2606.07600#S1.E1)\) is hence fundamentally new due to the specific structure of Transformers, whose behavior is complicated by the inherent lack of commutativity between control parameters and the covariance\. Our controllability analysis ofμ\\muandΣ\\Sigmabased on congruence transformations avoids the complexity of bracket computation in traditional nonlinear control theory\[elliott2009bilinear\]\.

### 1\.4Organization of the paper

The remainder of the paper is organized as follows\. In[Section2](https://arxiv.org/html/2606.07600#S2), we introduce the Gaussian Transformer model\.[Section3](https://arxiv.org/html/2606.07600#S3)is devoted to finite\-time reachability of the model, while[Section4](https://arxiv.org/html/2606.07600#S4)addresses its asymptotic behavior for time\-invariant parameters\. In[Section5](https://arxiv.org/html/2606.07600#S5), we present our numerical experiments and[Section6](https://arxiv.org/html/2606.07600#S6)concludes the paper, identifying future perspectives\.

## 2Transformer dynamics with Gaussian initial conditions

We study a class of nonlinear transport equations arising as mean\-field limits of Transformer architectures\[vaswani2017attention\]\. Following recent developments at the interface of machine learning, optimal transport, and control theory\[castin2025unified,geshkovski2024measure,sander2022sinkformers\], these models describe the evolution of probability measures driven by parameterized, nonlocal vector fields encoding the architecture of the network\. We briefly recall the modeling pathway leading to the equations considered in this work, referring to\[geshkovski2025mathematical\]for detailed derivations from the discrete architecture\.

The Transformer is a deep neural network model operating on a finite collection of particles\{xiℓ\}i=1n⊂ℝd\\\{x\_\{i\}^\{\\ell\}\\\}\_\{i=1\}^\{n\}\\subset\\mathbb\{R\}^\{d\}, calledtokens, which represent, for instance, words in a sentence or pixels in a picture\. In the infinite\-depth limit, their evolution is modeled as a continuous\-time interacting particle system governed by

\(2\)x˙i\(t\)=σ\(A\(t\)xi\(t\)\+b\(t\)\)\+∑j=1nexj\(t\)⊤B\(t\)xi\(t\)∑ℓ=1nexℓ\(t\)⊤B\(t\)xi\(t\)V\(t\)xj\(t\)\.\\dot\{x\}\_\{i\}\(t\)=\\sigma\(A\(t\)x\_\{i\}\(t\)\+b\(t\)\)\+\\sum\_\{j=1\}^\{n\}\\frac\{e^\{x\_\{j\}\(t\)^\{\\top\}B\(t\)x\_\{i\}\(t\)\}\}\{\\sum\_\{\\ell=1\}^\{n\}e^\{x\_\{\\ell\}\(t\)^\{\\top\}B\(t\)x\_\{i\}\(t\)\}\}\\,V\(t\)x\_\{j\}\(t\)\.In this system,σ:ℝd→ℝd\\sigma:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\}denotes a nonlinearity applied componentwise known as the*activation function*, whileA\(t\),B\(t\),V\(t\)∈ℝd×dA\(t\),B\(t\),V\(t\)\\in\\mathbb\{R\}^\{d\\times d\}andb\(t\)∈ℝdb\(t\)\\in\\mathbb\{R\}^\{d\}are time\-dependent parameters inherited from the trained network\. Motivated by the increasing context lengths of modern Transformers\[press2022train\], we are interested in studying the limit as the number of particlesn→∞n\\to\\infty\. Introducing the empirical measureρtn≔1n∑i=1nδxi\(t\)\\rho\_\{t\}^\{n\}\\coloneqq\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\delta\_\{x\_\{i\}\(t\)\}, the right\-hand side of \([2](https://arxiv.org/html/2606.07600#S2.E2)\) can be written as a functional ofρtn\\rho\_\{t\}^\{n\}\. Asn→∞n\\to\\infty, for compactly supported input measures, this system converges \(in the sense made precise in\[castin2025unified\]\) to a well\-posed continuity equation

\(3\)\{∂tρt\+∇x⋅\(ρtΓ\[ρt\]\(t,x\)\)=0,ρt=0=ρ0,\\begin\{cases\}\\partial\_\{t\}\\rho\_\{t\}\+\\nabla\_\{x\}\\cdot\\left\(\\rho\_\{t\}\\Gamma\[\\rho\_\{t\}\]\(t,x\)\\right\)=0,\\\\ \\rho\_\{t=0\}=\\rho\_\{0\},\\end\{cases\}whereρt∈𝒫\(ℝd\)\\rho\_\{t\}\\in\\mathcal\{P\}\(\\mathbb\{R\}^\{d\}\)and for anyρ∈𝒫\(ℝd\)\\rho\\in\\mathcal\{P\}\(\\mathbb\{R\}^\{d\}\), the velocity fieldΓ\[ρ\]\\Gamma\[\\rho\]is defined as

\(4\)Γ\[ρ\]\(t,x\)=σ\(A\(t\)x\+b\(t\)\)\+∫ey⊤B\(t\)xV\(t\)ydρ\(y\)∫ey⊤B\(t\)xdρ\(y\)\.\\Gamma\[\\rho\]\(t,x\)=\\sigma\\left\(A\(t\)x\+b\(t\)\\right\)\+\\frac\{\\int e^\{y^\{\\top\}B\(t\)x\}\\,V\(t\)y\\,\\mathrm\{d\}\\rho\(y\)\}\{\\int e^\{y^\{\\top\}B\(t\)x\}\\,\\mathrm\{d\}\\rho\(y\)\}\.We refer to \([3](https://arxiv.org/html/2606.07600#S2.E3)\) as the*Transformer PDE*\. The vector field \([4](https://arxiv.org/html/2606.07600#S2.E4)\) naturally decomposes into a pointwise drift

ℱ\(t,x\)≔σ\(A\(t\)x\+b\(t\)\),\\mathcal\{F\}\(t,x\)\\coloneqq\\sigma\(A\(t\)x\+b\(t\)\),corresponding to the so\-called*feed\-forward*layers, and a nonlocal interaction term

\(5\)𝒜\[ρ\]\(t,x\)≔∫ey⊤B\(t\)xV\(t\)ydρ\(y\)∫ey⊤B\(t\)xdρ\(y\),\\mathcal\{A\}\[\\rho\]\(t,x\)\\coloneqq\\frac\{\\int e^\{y^\{\\top\}B\(t\)x\}\\,V\(t\)y\\,\\mathrm\{d\}\\rho\(y\)\}\{\\int e^\{y^\{\\top\}B\(t\)x\}\\,\\mathrm\{d\}\\rho\(y\)\},which corresponds to the*self\-attention*mechanism\. From a control\-theoretic viewpoint, \([3](https://arxiv.org/html/2606.07600#S2.E3)\) is a controlled continuity equation with measure\-dependent drift, where the time\-dependent matricesA\(t\),B\(t\),V\(t\),b\(t\)A\(t\),B\(t\),V\(t\),b\(t\)act as control variables\.

In this work, we focus on the dynamics of \([3](https://arxiv.org/html/2606.07600#S2.E3)\) for Gaussian input measures\.

As we will show below in[Proposition2\.1](https://arxiv.org/html/2606.07600#S2.Thmtheorem1), when the initial measureρ0\\rho\_\{0\}is Gaussian and the activation functionσ\\sigmais the identity, the solutionρt\\rho\_\{t\}remains Gaussian for alltt, and the infinite\-dimensional dynamics \([3](https://arxiv.org/html/2606.07600#S2.E3)\) reduce to an ODE system governing the evolution of the mean and covariance\. We refer to the resulting reduced\-order model as the*Gaussian Transformer*\. Thus, the question of well\-posedness of the model can be studied equivalently in this finite\-dimensional setting\. Alternatively, for the case of a Gaussian input measure andσ\\sigmabeing the ReLU activation function, we will show that the solution remains sub\-Gaussian for short times \(see[Lemma2\.3](https://arxiv.org/html/2606.07600#S2.Thmtheorem3)\)\. This allows us to extend the well\-posedness guarantee to this setting\.

Throughout the paper, we will denote a \(symmetric\) positive definite \(respectively, positive semi\-definite, negative definite and negative semi\-definite\) matrixA∈ℝd×dA\\in\\mathbb\{R\}^\{d\\times d\}byA≻0A\\succ 0\(resp\.A⪰0A\\succeq 0,A≺0A\\prec 0andA⪯0A\\preceq 0\)\.

### 2\.1The Gaussian Transformer

For Gaussian measures, the self\-attention operator admits an explicit linear representation\. Ifρ=𝒩\(μ,Σ\)\\rho=\\mathcal\{N\}\(\\mu,\\Sigma\), then the*attention\-only*Transformer PDE \([3](https://arxiv.org/html/2606.07600#S2.E3)\) withσ=0\\sigma=0induces the vector field

\(6\)𝒜\[ρ\]\(t,x\)=V\(t\)\(μ\+ΣB\(t\)x\)\\mathcal\{A\}\[\\rho\]\(t,x\)=V\(t\)\(\\mu\+\\Sigma B\(t\)x\)as shown in\[castin2025unified\]\. In particular, Gaussian measures are invariant under the self\-attention\-only flow, and their evolution reduces to a closed system of ODEs for\(μ,Σ\)\(\\mu,\\Sigma\)\. We extend this structure to the full Transformer dynamics with an affine feed\-forward term, i\.e\.,σ=id\\sigma=\\mathrm\{id\}in \([4](https://arxiv.org/html/2606.07600#S2.E4)\)\.

###### Proposition 2\.1\(Gaussian Transformer\)\.

Letρt\\rho\_\{t\}be the solution of \([3](https://arxiv.org/html/2606.07600#S2.E3)\) withσ=id\\sigma=\\mathrm\{id\}and initial conditionρ0=𝒩\(μ0,Σ0\)\\rho\_\{0\}=\\mathcal\{N\}\(\\mu\_\{0\},\\Sigma\_\{0\}\), whereΣ0⪰0\\Sigma\_\{0\}\\succeq 0\. Then, there existsTmax\>0T\_\{\\max\}\>0such that for allt∈\[0,Tmax\)t\\in\[0,T\_\{\\max\}\),ρt\\rho\_\{t\}remains Gaussian, with its meanμ\(t\)\\mu\(t\)and covarianceΣ\(t\)\\Sigma\(t\)satisfying

\(7\)\{μ˙=\(A\(t\)\+V\(t\)\+V\(t\)ΣB\(t\)\)μ\+b\(t\),Σ˙=A\(t\)Σ\+ΣA\(t\)⊤\+V\(t\)ΣB\(t\)Σ\+ΣB\(t\)⊤ΣV\(t\)⊤,μ\(0\)=μ0,Σ\(0\)=Σ0\.\\begin\{cases\}\\dot\{\\mu\}&=\\left\(A\(t\)\+V\(t\)\+V\(t\)\\Sigma B\(t\)\\right\)\\mu\+b\(t\),\\\\ \\dot\{\\Sigma\}&=A\(t\)\\Sigma\+\\Sigma A\(t\)^\{\\top\}\+V\(t\)\\Sigma B\(t\)\\Sigma\+\\Sigma B\(t\)^\{\\top\}\\Sigma V\(t\)^\{\\top\},\\end\{cases\}\\quad\\begin\{aligned\} \\mu\(0\)&=\\mu\_\{0\},\\\\ \\Sigma\(0\)&=\\Sigma\_\{0\}\.\\end\{aligned\}

###### Proof 2\.2\.

Here and below, unless otherwise stated,A,B,V,bA,B,V,bare evaluated at timett\. By \([6](https://arxiv.org/html/2606.07600#S2.E6)\), the velocity field is given by

Γ\[ρt\]\(t,x\)=Ax\+b\+Vμ\+VΣBx=\(A\+VΣB\)x\+\(Vμ\+b\),\\Gamma\[\\rho\_\{t\}\]\(t,x\)=Ax\+b\+V\\mu\+V\\Sigma Bx=\\big\(A\+V\\Sigma B\\big\)x\+\\big\(V\\mu\+b\\big\),which is affine inxx\. The pushforward of a Gaussian by an affine map is again Gaussian, soρt\\rho\_\{t\}stays Gaussian\. To derive dynamics for the mean and the covariance, we use standard moment identities: for any smoothϕ:ℝd→ℝ\\phi:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}, it holds that

\(8\)ddt∫ϕ\(x\)dρt\(x\)=∫∇ϕ\(x\)⋅Γ\[ρt\]\(t,x\)dρt\(x\)\.\\frac\{\\mathrm\{d\}\}\{\\mathrm\{d\}t\}\\int\\phi\(x\)\\,\\mathrm\{d\}\\rho\_\{t\}\(x\)=\\int\\nabla\\phi\(x\)\\cdot\\Gamma\[\\rho\_\{t\}\]\(t,x\)\\,\\mathrm\{d\}\\rho\_\{t\}\(x\)\.Next, we use the definition ofμ\(t\)=∫xdρt\(x\)\\mu\(t\)=\\int x\\,\\mathrm\{d\}\\rho\_\{t\}\(x\)and \([8](https://arxiv.org/html/2606.07600#S2.E8)\) withϕ\(x\)=x\\phi\(x\)=xcomponentwise to obtain

μ˙=∫Γ\[ρt\]\(t,x\)dρt\(x\)=∫\(Ax\+b\+Vμ\+VΣBx\)dρt\(x\)=Aμ\+b\+Vμ\+VΣBμ,\\dot\{\\mu\}=\\int\\Gamma\[\\rho\_\{t\}\]\(t,x\)\\,\\mathrm\{d\}\\rho\_\{t\}\(x\)=\\int\\big\(Ax\+b\+V\\mu\+V\\Sigma Bx\\big\)\\,\\mathrm\{d\}\\rho\_\{t\}\(x\)=A\\mu\+b\+V\\mu\+V\\Sigma B\\mu,which gives the desired mean dynamics\. For the covariance dynamics, letM\(t\):=𝔼ρt\[xx⊤\]M\(t\):=\\mathbb\{E\}\_\{\\rho\_\{t\}\}\[xx^\{\\top\}\], soΣ=M−μμ⊤\\Sigma=M\-\\mu\\mu^\{\\top\}\. DifferentiatingMMgives

M˙=𝔼ρt\[xΓ\[ρt\]\(t,x\)⊤\+Γ\[ρt\]\(t,x\)x⊤\]\.\\dot\{M\}=\\mathbb\{E\}\_\{\\rho\_\{t\}\}\\big\[x\\Gamma\[\\rho\_\{t\}\]\(t,x\)^\{\\top\}\+\\Gamma\[\\rho\_\{t\}\]\(t,x\)x^\{\\top\}\\big\]\.SinceΓ\[ρt\]\(t,x\)=Ax\+b\+Vμ\+VΣBx\\Gamma\[\\rho\_\{t\}\]\(t,x\)=Ax\+b\+V\\mu\+V\\Sigma Bx, linearity and the identity𝔼ρt\[x\]=μ\\mathbb\{E\}\_\{\\rho\_\{t\}\}\[x\]=\\muyield

M˙\\displaystyle\\dot\{M\}=𝔼ρt\[x\(Ax\)⊤\]\+𝔼ρt\[x\(VΣBx\)⊤\]\+𝔼ρt\[xb⊤\]\+𝔼ρt\[x\(Vμ\)⊤\]\\displaystyle=\\mathbb\{E\}\_\{\\rho\_\{t\}\}\[x\(Ax\)^\{\\top\}\]\+\\mathbb\{E\}\_\{\\rho\_\{t\}\}\[x\(V\\Sigma Bx\)^\{\\top\}\]\+\\mathbb\{E\}\_\{\\rho\_\{t\}\}\[xb^\{\\top\}\]\+\\mathbb\{E\}\_\{\\rho\_\{t\}\}\[x\(V\\mu\)^\{\\top\}\]\+𝔼ρt\[\(Ax\)x⊤\]\+𝔼ρt\[\(VΣBx\)x⊤\]\+𝔼ρt\[bx⊤\]\+𝔼ρt\[\(Vμ\)x⊤\]\\displaystyle\\quad\+\\mathbb\{E\}\_\{\\rho\_\{t\}\}\[\(Ax\)x^\{\\top\}\]\+\\mathbb\{E\}\_\{\\rho\_\{t\}\}\[\(V\\Sigma Bx\)x^\{\\top\}\]\+\\mathbb\{E\}\_\{\\rho\_\{t\}\}\[bx^\{\\top\}\]\+\\mathbb\{E\}\_\{\\rho\_\{t\}\}\[\(V\\mu\)x^\{\\top\}\]=MA⊤\+MB⊤ΣV⊤\+μb⊤\+μμ⊤V⊤\+AM\+VΣBM\+bμ⊤\+Vμμ⊤\.\\displaystyle=MA^\{\\top\}\+MB^\{\\top\}\\Sigma V^\{\\top\}\+\\mu b^\{\\top\}\+\\mu\\mu^\{\\top\}V^\{\\top\}\+AM\+V\\Sigma BM\+b\\mu^\{\\top\}\+V\\mu\\mu^\{\\top\}\.DifferentiatingΣ=M−μμ⊤\\Sigma=M\-\\mu\\mu^\{\\top\}and substitutingμ˙\\dot\{\\mu\}gives

Σ˙=M˙−μ˙μ⊤−μμ˙⊤=AΣ\+ΣA⊤\+VΣBΣ\+ΣB⊤ΣV⊤,\\dot\{\\Sigma\}=\\dot\{M\}\-\\dot\{\\mu\}\\,\\mu^\{\\top\}\-\\mu\\,\\dot\{\\mu\}^\{\\top\}=A\\Sigma\+\\Sigma A^\{\\top\}\+V\\Sigma B\\Sigma\+\\Sigma B^\{\\top\}\\Sigma V^\{\\top\},which completes the proof\.

### 2\.2Transformer dynamics with ReLU feed\-forward layers

[Proposition2\.1](https://arxiv.org/html/2606.07600#S2.Thmtheorem1)shows that Gaussian distributions are invariant under the Transformer PDE \([3](https://arxiv.org/html/2606.07600#S2.E3)\) when the activation function in the feed\-forward layers is set to the identity\. In this section, we consider the dynamics of the distributionρt\\rho\_\{t\}in the presence of ReLU activationσ\(x\)=max⁡\(0,x\)\\sigma\(x\)=\\max\(0,x\)\(componentwise\), and quantitatively show that it remains close to the Gaussian evolution \([7](https://arxiv.org/html/2606.07600#S2.E7)\), hence justifying the well\-posedness of the Gaussian Transformer under ReLU activations\.

First, we prove the following result about the sub\-Gaussian behavior of the dynamics \([3](https://arxiv.org/html/2606.07600#S2.E3)\) for short times\. Throughout,λmax\(M\)\\lambda\_\{\\max\}\(M\)\(resp\.λmin\(M\)\\lambda\_\{\\min\}\(M\)\) denotes the largest \(resp\. smallest\) eigenvalue of a matrixMM\.

###### Lemma 2\.3\.

LetAA,bb,BB,VVbe fixed parameters\. Denote byρt\{\\rho\}\_\{t\}the solution of \([3](https://arxiv.org/html/2606.07600#S2.E3)\) with*ReLU*activationσ=max⁡\(0,x\)\\sigma=\\max\(0,x\)and initial conditionρ0=𝒩\(μ0,Σ0\)\{\\rho\}\_\{0\}=\\mathcal\{N\}\(\\mu\_\{0\},\\Sigma\_\{0\}\)\. Assumeκ0∈ℝ\\kappa\_\{0\}\\in\\mathbb\{R\}satisfies4‖B‖≤κ0<12λmax\(Σ0\)4\\\|B\\\|\\leq\\kappa\_\{0\}<\\frac\{1\}\{2\\lambda\_\{\\max\}\(\\Sigma\_\{0\}\)\}and define

Et:=∫ℝdeκ0\|x\|2dρt\(x\)\.E\_\{t\}:=\\int\_\{\\mathbb\{R\}^\{d\}\}e^\{\\kappa\_\{0\}\|x\|^\{2\}\}\\mathrm\{d\}\\rho\_\{t\}\(x\)\.Then,ρt\\rho\_\{t\}is sub\-Gaussian in short time, i\.e\., there existsT∗\>0T^\{\*\}\>0such thatEt≤2E0E\_\{t\}\\leq 2E\_\{0\}for allt∈\[0,T∗\]t\\in\[0,T^\{\*\}\]\.

The proof is relegated to[SectionA\.1](https://arxiv.org/html/2606.07600#A1.SS1)\. Thanks to[Lemma2\.3](https://arxiv.org/html/2606.07600#S2.Thmtheorem3), we have the following quantitative estimate on the short\-time preservation of Gaussianity in \([3](https://arxiv.org/html/2606.07600#S2.E3)\) in the presence of ReLU activation\.

###### Proposition 2\.4\.

LetAA,BB,VVandbbbe fixed parameters\. Denote byρt\{\\rho\}\_\{t\}the solution of \([3](https://arxiv.org/html/2606.07600#S2.E3)\) with*ReLU*activationσ=max⁡\(0,x\)\\sigma=\\max\(0,x\)and initial conditionρ0=𝒩\(μ0,Σ0\)\{\\rho\}\_\{0\}=\\mathcal\{N\}\(\\mu\_\{0\},\\Sigma\_\{0\}\), whereΣ0⪰0\\Sigma\_\{0\}\\succeq 0\. Ifλmax\(Σ0\)<18‖B‖\\lambda\_\{\\max\}\(\\Sigma\_\{0\}\)<\\frac\{1\}\{8\\\|B\\\|\}, then fort∈\[0,T∗\]t\\in\[0,T^\{\*\}\],

W2\(ρt,νt\)≤t⋅‖max⁡\(0,−\(Ax\+b\)\)‖L2\(ρ0\)\+𝒪\(t2\),W\_\{2\}\(\\rho\_\{t\},\\nu\_\{t\}\)\\leq t\\cdot\\left\\\|\\max\(0,\-\(Ax\+b\)\)\\right\\\|\_\{L^\{2\}\(\\rho\_\{0\}\)\}\+\\mathcal\{O\}\(t^\{2\}\),whereW2\(⋅,⋅\)W\_\{2\}\(\\cdot,\\cdot\)is the Wasserstein\-2 distance between distributions,νt\\nu\_\{t\}is the Gaussian evolution solving \([3](https://arxiv.org/html/2606.07600#S2.E3)\) withσ=id\\sigma=\\mathrm\{id\}andν0=ρ0\\nu\_\{0\}=\\rho\_\{0\}\.

The proof is postponed to[SectionA\.2](https://arxiv.org/html/2606.07600#A1.SS2)\. We nonetheless note that the argument for the short\-time preservation mechanism in[Proposition2\.4](https://arxiv.org/html/2606.07600#S2.Thmtheorem4)is not specific to ReLU\. More generally, for other Lipschitz nonlinear activationsσ\\sigma, under suitable assumptions on the initial covariance matrix, one may expect the existence of a timeTσ\>0T\_\{\\sigma\}\>0such that, fort≤Tσt\\leq T\_\{\\sigma\},W2\(ρt,νt\)≤t⋅‖σ\(AX0\+b\)−\(AX0\+b\)‖L2\(ρ0\)\+𝒪\(t2\)W\_\{2\}\(\\rho\_\{t\},\\nu\_\{t\}\)\\leq t\\cdot\\left\\\|\\sigma\(AX\_\{0\}\+b\)\-\(AX\_\{0\}\+b\)\\right\\\|\_\{L^\{2\}\(\\rho\_\{0\}\)\}\+\\mathcal\{O\}\(t^\{2\}\)\. Thus, ifσ\(AX0\+b\)\\sigma\(AX\_\{0\}\+b\)is close toAX0\+bAX\_\{0\}\+bunder the initial law, the nonlinear evolution remains close to the corresponding Gaussian evolution for short times\.

[Proposition2\.4](https://arxiv.org/html/2606.07600#S2.Thmtheorem4)implies that the Gaussian model \([7](https://arxiv.org/html/2606.07600#S2.E7)\) provides a quantifiable short\-time approximation of the nonlinear Transformer PDE \([3](https://arxiv.org/html/2606.07600#S2.E3)\)\. Although Gaussianity is not preserved for nonlinear activations, the result quantifies a tube of validity around the Gaussian dynamics\. This justifies focusing the subsequent analysis on the Gaussian setting, where the evolution is finite\-dimensional, guaranteeing that the analysis on the Gaussian Transformer \([7](https://arxiv.org/html/2606.07600#S2.E7)\) is informative for real models in practice \(see[Section5](https://arxiv.org/html/2606.07600#S5)for empirical validation with pretrained Transformer models\)\.

## 3Finite\-time reachability

Our first set of results concerns finite\-time behavior and the reachability set of the Gaussian Transformer \([7](https://arxiv.org/html/2606.07600#S2.E7)\) under all possible choices of the time\-varying parametersA\(t\)A\(t\),V\(t\)V\(t\)andB\(t\)B\(t\)\.

The following result establishes that the rank ofΣ\(t\)\\Sigma\(t\)along the flow \([7](https://arxiv.org/html/2606.07600#S2.E7)\) cannot change as a function oftt, so it is a natural invariant of the flow \([7](https://arxiv.org/html/2606.07600#S2.E7)\)\.

###### Lemma 3\.1\(Rank preservation\)\.

LetA,B,VA,B,V:\[0,∞\)→ℝd×d\[0,\\infty\)\\rightarrow\\mathbb\{R\}^\{d\\times d\}be time\-varying matrices\. For any finite time interval\[0,Tmax\)\[0,T\_\{\\max\}\)on which the solution of \([7](https://arxiv.org/html/2606.07600#S2.E7)\) exists, we haverank⁡\(Σ\(t\)\)=rank⁡\(Σ0\)\\operatorname\{rank\}\(\\Sigma\(t\)\)=\\operatorname\{rank\}\(\\Sigma\_\{0\}\),∀t∈\[0,Tmax\)\\forall t\\in\[0,T\_\{\\max\}\)\.

###### Proof 3\.2\.

We define a time\-varying matrixM\(t\)≔A\(t\)\+V\(t\)Σ\(t\)B\(t\)M\(t\)\\coloneqq A\(t\)\+V\(t\)\\Sigma\(t\)B\(t\)fort∈\[0,Tmax\)t\\in\[0,T\_\{\\max\}\)\. By symmetry of the equation we haveΣ˙\(t\)=M\(t\)Σ\(t\)\+Σ\(t\)M\(t\)⊤\\dot\{\\Sigma\}\(t\)=M\(t\)\\Sigma\(t\)\+\\Sigma\(t\)M\(t\)^\{\\top\}\. This is a linear time\-varying equation in terms of a congruence transformation\. LetΨ\(t,0\)\\Psi\(t,0\)be the state transition matrix from time0to timettfor the linear systemx˙\(t\)=M\(t\)x\(t\)\\dot\{x\}\(t\)=M\(t\)x\(t\), meaning that it is the unique solution of

ddtΨ\(t,0\)=M\(t\)Ψ\(t,0\),Ψ\(0,0\)=I\.\\frac\{\\mathrm\{d\}\}\{\\mathrm\{d\}t\}\\Psi\(t,0\)=M\(t\)\\Psi\(t,0\),\\quad\\Psi\(0,0\)=I\.Then the unique solution forΣ\(t\)\\Sigma\(t\)is

\(9\)Σ\(t\)=Ψ\(t,0\)Σ\(0\)Ψ\(t,0\)⊤\.\\displaystyle\\Sigma\(t\)=\\Psi\(t,0\)\\Sigma\(0\)\\Psi\(t,0\)^\{\\top\}\.The matrixΨ\(t,0\)\\Psi\(t,0\)as the state transition matrix of the equationx˙\(t\)=M\(t\)x\(t\)\\dot\{x\}\(t\)=M\(t\)x\(t\)is always invertible for any timet∈\[0,Tmax\)t\\in\[0,T\_\{\\max\}\)\. This is because its determinant is given by Liouville’s formula \(see, for example,\[Teschl2012ODE, Lemma 3\.11\]\) as

det\(Ψ\(t,0\)\)=exp⁡\(∫0ttr⁡\(M\(τ\)\)dτ\)≠0\.\\det\(\\Psi\(t,0\)\)=\\exp\\left\(\\int\_\{0\}^\{t\}\\operatorname\{tr\}\(M\(\\tau\)\)\\mathrm\{d\}\\tau\\right\)\\neq 0\.Therefore, \([9](https://arxiv.org/html/2606.07600#S3.E9)\) and the non\-singularity ofΨ\(t,0\)\\Psi\(t,0\)imply that the rank ofΣ\(t\)\\Sigma\(t\)in \([7](https://arxiv.org/html/2606.07600#S2.E7)\) is invariant and equal to the rank ofΣ\(0\)\\Sigma\(0\)for allttwhere the solution exists\.

###### Theorem 3\.3\.

SupposeΣ0⪰0\\Sigma\_\{0\}\\succeq 0\. For anyT\>0T\>0,μ^∈ℝd\\hat\{\\mu\}\\in\\mathbb\{R\}^\{d\}andΣ^⪰0\\hat\{\\Sigma\}\\succeq 0satisfyingrank⁡\(Σ^\)=rank⁡\(Σ0\)\\operatorname\{rank\}\(\\hat\{\\Sigma\}\)=\\operatorname\{rank\}\(\\Sigma\_\{0\}\), there exists continuousAA,BB,V:V:\[0,T\]→ℝd×d\[0,T\]\\rightarrow\\mathbb\{R\}^\{d\\times d\}andb:\[0,T\]→ℝdb:\[0,T\]\\rightarrow\\mathbb\{R\}^\{d\}such thatμ\(t\)\\mu\(t\),Σ\(t\)\\Sigma\(t\)solving \([7](https://arxiv.org/html/2606.07600#S2.E7)\) satisfyμ\(T\)=μ^\\mu\(T\)=\\hat\{\\mu\},Σ\(T\)=Σ^\\Sigma\(T\)=\\hat\{\\Sigma\}\.

###### Proof 3\.4\.

LetA\(t\)=−V\(t\)−V\(t\)Σ\(t\)B\(t\)A\(t\)=\-V\(t\)\-V\(t\)\\Sigma\(t\)B\(t\), andB\(t\)=Σ\(t\)†B\(t\)=\\Sigma\(t\)^\{\\dagger\}be the Moore–Penrose pseudoinverse ofΣ\(t\)\\Sigma\(t\), acting as smooth feedback controls\. Then the dynamics ofμ\\muandΣ\\Sigmabecome

\(10\)\{μ˙\(t\)=b\(t\),Σ˙\(t\)=−V\(t\)Σ\(t\)−Σ\(t\)V\(t\)⊤,μ\(0\)=μ0,Σ\(0\)=Σ0\.\\begin\{cases\}\\dot\{\\mu\}\(t\)=b\(t\),\\\\ \\dot\{\\Sigma\}\(t\)=\-V\(t\)\\Sigma\(t\)\-\\Sigma\(t\)V\(t\)^\{\\top\},\\end\{cases\}\\quad\\begin\{aligned\} \\mu\(0\)&=\\mu\_\{0\},\\\\ \\Sigma\(0\)&=\\Sigma\_\{0\}\.\\end\{aligned\}We shall first prove that there existsV:\[0,T\]→ℝd×dV:\[0,T\]\\rightarrow\\mathbb\{R\}^\{d\\times d\}such thatΣ\(t\)\\Sigma\(t\)satisfiesΣ\(0\)=Σ0\\Sigma\(0\)=\\Sigma\_\{0\}andΣ\(T\)=Σ^\\Sigma\(T\)=\\hat\{\\Sigma\}\. By the same argument as in[Lemma3\.1](https://arxiv.org/html/2606.07600#S3.Thmtheorem1), the solution of \([7](https://arxiv.org/html/2606.07600#S2.E7)\) becomesΣ\(t\)=ΨV\(t\)Σ0ΨV\(t\)⊤\\Sigma\(t\)=\\Psi\_\{V\}\(t\)\\Sigma\_\{0\}\\Psi\_\{V\}\(t\)^\{\\top\}where the transition matrixΨV\(t\)\\Psi\_\{V\}\(t\)is the unique solution of

\(11\)Ψ˙V\(t\)=−V\(t\)ΨV\(t\),ΨV\(0\)=Id\.\\displaystyle\\dot\{\\Psi\}\_\{V\}\(t\)=\-V\(t\)\\Psi\_\{V\}\(t\),\\quad\\Psi\_\{V\}\(0\)=I\_\{d\}\.Thus, it suffices to show there existsV\(t\)V\(t\)such thatΣ^=Σ\(T\)=ΨV\(T\)Σ0ΨV\(T\)⊤\\hat\{\\Sigma\}=\\Sigma\(T\)=\\Psi\_\{V\}\(T\)\\Sigma\_\{0\}\\Psi\_\{V\}\(T\)^\{\\top\}\.

We start by constructingΨV\(T\)\\Psi\_\{V\}\(T\)\. SinceΣ0⪰0\\Sigma\_\{0\}\\succeq 0andΣ^⪰0\\hat\{\\Sigma\}\\succeq 0, we can write their spectral decompositions asΣ0=U0Λ0U0⊤\\Sigma\_\{0\}=U\_\{0\}\\Lambda\_\{0\}U\_\{0\}^\{\\top\},Σ^=U1Λ1U1⊤\\hat\{\\Sigma\}=U\_\{1\}\\Lambda\_\{1\}U\_\{1\}^\{\\top\}, whereU0,U1∈SO\(d\)U\_\{0\},U\_\{1\}\\in\\mathrm\{SO\}\(d\)\. Without loss of generality, assume thatrank⁡\(Σ0\)=rank⁡\(Σ^\)=r\\operatorname\{rank\}\(\\Sigma\_\{0\}\)=\\operatorname\{rank\}\(\\hat\{\\Sigma\}\)=r, and that the eigenvalues are arranged so that the positive entries occupy the top\-leftr×rr\\times rblock of the diagonal matricesΛ0\\Lambda\_\{0\},Λ1\\Lambda\_\{1\}, while the remaining blocks are zero\.

Letλ0,i\\lambda\_\{0,i\}andλ1,i\\lambda\_\{1,i\}be theii\-th diagonal elements ofΛ0\\Lambda\_\{0\}andΛ1\\Lambda\_\{1\},i=1,…,di=1,\\ldots,d\. Construct a nonsingular diagonal matrixDDsuch thatDii=λ1,iλ0,iD\_\{ii\}=\\sqrt\{\\tfrac\{\\lambda\_\{1,i\}\}\{\\lambda\_\{0,i\}\}\}for1≤i≤r1\\leq i\\leq r, andDii=1D\_\{ii\}=1forr<i≤dr<i\\leq d\.

LetM=U1DU0⊤M=U\_\{1\}DU\_\{0\}^\{\\top\}\. Thendet\(M\)\>0\\det\(M\)\>0\. SinceU0U\_\{0\},U1U\_\{1\}are orthogonal matrices, it is easy to verify thatMΣ0M⊤=U1Λ1U1⊤=Σ^M\\Sigma\_\{0\}M^\{\\top\}=U\_\{1\}\\Lambda\_\{1\}U\_\{1\}^\{\\top\}=\\hat\{\\Sigma\}, and henceMMis the matrixΨV\(T\)\\Psi\_\{V\}\(T\)required\. Next we construct the time\-varying matrixVV\.

Sincedet\(M\)\>0\\det\(M\)\>0and the space ofdd\-dimensional nonsingular real matrices with strictly positive determinants is path\-connected, there exists a differentiable pathΦ:\[0,T\]→GL\+\(d,ℝ\)\\Phi:\[0,T\]\\rightarrow\\mathrm\{GL\}^\{\+\}\(d,\\mathbb\{R\}\)such thatΦ\(0\)=Id\\Phi\(0\)=I\_\{d\},Φ\(T\)=M\\Phi\(T\)=M\.

LetV\(t\):=−Φ˙\(t\)Φ\(t\)−1V\(t\):=\-\\dot\{\\Phi\}\(t\)\\Phi\(t\)^\{\-1\}, thenV\(t\)V\(t\)is the time\-varying matrix allowing the matrixΨV\(t\)\\Psi\_\{V\}\(t\)in \([11](https://arxiv.org/html/2606.07600#S3.E11)\) to satisfyΣ^=Σ\(T\)=ΨV\(T\)Σ0ΨV\(T\)⊤\\hat\{\\Sigma\}=\\Sigma\(T\)=\\Psi\_\{V\}\(T\)\\Sigma\_\{0\}\\Psi\_\{V\}\(T\)^\{\\top\}\.

Meanwhile, by choosingb\(t\)=1T\(μ^−μ0\)b\(t\)=\\frac\{1\}\{T\}\(\\hat\{\\mu\}\-\\mu\_\{0\}\), one obtains the finite\-time reachability ofμ\(t\)\\mu\(t\)in \([10](https://arxiv.org/html/2606.07600#S3.E10)\)\. This proves the result\.

## 4Asymptotic dynamics

Table 1:Asymptotics of \([7](https://arxiv.org/html/2606.07600#S2.E7)\) withΣ\(0\)=Σ0≻0\\Sigma\(0\)=\\Sigma\_\{0\}\\succ 0\. The first row follows from[Theorems4\.1](https://arxiv.org/html/2606.07600#S4.Thmtheorem1)and[4\.11](https://arxiv.org/html/2606.07600#S4.Thmtheorem11), the second from[Theorem4\.7](https://arxiv.org/html/2606.07600#S4.Thmtheorem7), and the third from[Theorem4\.8](https://arxiv.org/html/2606.07600#S4.Thmtheorem8)\.V,BV,BregimeExtra conditionCovarianceΣ\(t\)\\Sigma\(t\)Meanμ\(t\)\\mu\(t\)V=ηIdV=\\eta I\_\{d\},η\(B\+B⊤\)⪯0\\eta\(B\+B^\{\\top\}\)\\preceq 0η<0\\eta<0Σ\(t\)→U\(Σ∞a000\)U⊤\\Sigma\(t\)\\to U\\left\(\\begin\{smallmatrix\}\\Sigma^\{a\}\_\{\\infty\}&0\\\\ 0&0\\end\{smallmatrix\}\\right\)U^\{\\top\}μ\(t\)→μ∞∈ℝd\\mu\(t\)\\to\\mu\_\{\\infty\}\\in\\mathbb\{R\}^\{d\}η\>0\\eta\>0Σ\(t\)→U\(Σ∞a000\)U⊤\\Sigma\(t\)\\to U\\left\(\\begin\{smallmatrix\}\\Sigma^\{a\}\_\{\\infty\}&0\\\\ 0&0\\end\{smallmatrix\}\\right\)U^\{\\top\}‖μ\(t\)‖→∞\\\|\\mu\(t\)\\\|\\to\\inftyV≺0V\\prec 0,B≻0B\\succ 0ℜ⁡\(spec⁡\(A\)\)\>0\\Re\(\\operatorname\{spec\}\(A\)\)\>0Σ\(t\)→Σ∞≻0\\Sigma\(t\)\\to\\Sigma\_\{\\infty\}\\succ 0Depends onspec⁡\(A\+V\+VΣ∞B\)\\operatorname\{spec\}\(A\+V\+V\\Sigma\_\{\\infty\}B\)ℜ⁡\(spec⁡\(A\)\)≤0\\Re\(\\operatorname\{spec\}\(A\)\)\\leq 0Σ\(t\)→0\\Sigma\(t\)\\to 0Depends onspec⁡\(A\+V\)\\operatorname\{spec\}\(A\+V\)V≻0V\\succ 0,B≻0B\\succ 0λmin\(A\+A⊤\)≥0\\lambda\_\{\\min\}\(A\+A^\{\\top\}\)\\geq 0‖Σ\(t\)‖→∞\\\|\\Sigma\(t\)\\\|\\to\\infty‖μ\(t\)‖→∞\\\|\\mu\(t\)\\\|\\to\\inftyλmin\(A\+A⊤\)<0\\lambda\_\{\\min\}\(A\+A^\{\\top\}\)<0Depends onΣ0\\Sigma\_\{0\}Depends onΣ0\\Sigma\_\{0\}To characterize the asymptotics of \([7](https://arxiv.org/html/2606.07600#S2.E7)\), we couple the system with suitable initial conditionsμ\(0\)=μ0∈ℝd\\mu\(0\)=\\mu\_\{0\}\\in\\mathbb\{R\}^\{d\}andΣ\(0\)=Σ0⪰0\\Sigma\(0\)=\\Sigma\_\{0\}\\succeq 0, as well as restrict the study to constant\-in\-time parametersA,V,B∈ℝd×dA,V,B\\in\\mathbb\{R\}^\{d\\times d\}and biasb∈ℝdb\\in\\mathbb\{R\}^\{d\}\.

The asymptotic analysis of \([7](https://arxiv.org/html/2606.07600#S2.E7)\) is difficult due to the non\-commutativity ofΣ\\Sigmawith the parameter matrices\. As the product of positive definite matrices is not necessarily positive definite, it is not easy to derive the existence of the equilibrium of the nonlinear matrix differential equation based on the positive\-definiteness ofAA,BBandVV\. On the other hand, the asymptotics of the mean depend on the spectral properties of the limiting covariance matrix, and can be interpreted as the problem of characterizing stabilizing feedback gains for a linear system with static feedback\.

Given these difficulties, the results in this section are split in two\. In the first part, we consider the easier case ofV=ηIdV=\\eta I\_\{d\},η∈ℝ\\eta\\in\\mathbb\{R\}, in which the analysis simplifies thanks to the connections to Riccati theory, and weaker assumptions are needed on the other matrix parameters\. In the second part, we tackle the more general yet involved case of definiteV∈ℝd×dV\\in\\mathbb\{R\}^\{d\\times d\}\. We provide a summary of results in[Table1](https://arxiv.org/html/2606.07600#S4.T1)\.

### 4\.1Asymptotic dynamics forV=ηIdV=\\eta I\_\{d\}

Before introducing our results regarding the asymptotic dynamics of \([7](https://arxiv.org/html/2606.07600#S2.E7)\) forV=ηIdV=\\eta I\_\{d\}, we make the connection to Riccati theory rigorous\. This observation, although restricted to a certain parameter range, will prove to be useful in illustrating the dynamics of the Gaussian Transformer\.

Throughout this subsection, we will fixV=ηIdV=\\eta I\_\{d\},η∈ℝ\\eta\\in\\mathbb\{R\}, and assume

\(12\)η\(B\+B⊤\)⪯0\.\\eta\(B\+B^\{\\top\}\)\\preceq 0\.If \([12](https://arxiv.org/html/2606.07600#S4.E12)\) holds, we can write−H⊤H≔η\(B\+B⊤\)\-H^\{\\top\}H\\coloneqq\\eta\(B\+B^\{\\top\}\)for someH∈ℝm×dH\\in\\mathbb\{R\}^\{m\\times d\}andm≤dm\\leq d\. This yields

\(13\)Σ˙=AΣ\+ΣA⊤−ΣH⊤HΣ,Σ\(0\)=Σ0,\\dot\{\\Sigma\}=A\\Sigma\+\\Sigma A^\{\\top\}\-\\Sigma H^\{\\top\}H\\Sigma,\\quad\\Sigma\(0\)=\\Sigma\_\{0\},which is a particular instance of the differential Riccati equation\. For any initial conditionΣ0⪰0\\Sigma\_\{0\}\\succeq 0, standard Riccati theory\[AbouKandil2003Riccati, Theorem 4\.1\.6\]proves the existence of a unique solution of \([13](https://arxiv.org/html/2606.07600#S4.E13)\) for all times only for the case whenη\(B\+B⊤\)⪯0\\eta\(B\+B^\{\\top\}\)\\preceq 0\. Otherwise, finite\-time blow\-up of the solution may occur\. In turn, the mean dynamics becomes

μ˙=\(A\+ηId\)μ\+Σ\(ηB\)μ\+b=\(A−ΣH⊤H\)μ\+η\(Id−ΣB⊤\)μ\+b\.\\displaystyle\\dot\{\\mu\}=\(A\+\\eta I\_\{d\}\)\\mu\+\\Sigma\(\\eta B\)\\mu\+b=\(A\-\\Sigma H^\{\\top\}H\)\\mu\+\\eta\(I\_\{d\}\-\\Sigma B^\{\\top\}\)\\mu\+b\.The first term has the same structure as the Kalman–Bucy error dynamics with covariance \([13](https://arxiv.org/html/2606.07600#S4.E13)\), for which exponential decay is known under suitable assumptions\[ruymgaart2013mathematics\]\. However, the additional termη\(Id−ΣB⊤\)μ\\eta\(I\_\{d\}\-\\Sigma B^\{\\top\}\)\\muhas no direct analogue in the standard Kalman–Bucy setting and prevents us from inferring the asymptotic behavior ofμ\\mufrom Riccati theory alone\.

Thus, in the caseV=ηIdV=\\eta I\_\{d\}, the covariance dynamics admit a precise Riccati interpretation under \([12](https://arxiv.org/html/2606.07600#S4.E12)\)\. This analogy is useful but incomplete: it does not by itself characterize the mean dynamics, and it does not extend directly to general value matrices\. A separate analysis is therefore required\.

In what follows, we will denote the spectrum of a matrixA∈ℝd×dA\\in\\mathbb\{R\}^\{d\\times d\}byspec⁡\(A\)⊂ℂ\\operatorname\{spec\}\(A\)\\subset\\mathbb\{C\}, and byℜ⁡\(spec⁡\(A\)\)\\Re\(\\operatorname\{spec\}\(A\)\)the real part of its elements\. The next theorem characterizes the asymptotic behavior ofΣ\(t\)\\Sigma\(t\)in terms of the real part of the spectrum ofAA\.

###### Theorem 4\.1\.

LetΣ\(t\)\\Sigma\(t\)be the solution of \([13](https://arxiv.org/html/2606.07600#S4.E13)\) withΣ0≻0\\Sigma\_\{0\}\\succ 0\. Suppose thatAAhas real Schur decomposition as

\(14\)A=UA~U⊤=U\(AaA120An\)U⊤,\\displaystyle A=U\\tilde\{A\}U^\{\\top\}=U\\begin\{pmatrix\}A\_\{a\}&A\_\{12\}\\\\ 0&A\_\{n\}\\end\{pmatrix\}U^\{\\top\},whereUUis an orthogonal matrix,ℜ⁡\(spec⁡\(Aa\)\)\>0\\Re\(\\operatorname\{spec\}\(A\_\{a\}\)\)\>0andℜ⁡\(spec⁡\(An\)\)≤0\\Re\(\\operatorname\{spec\}\(A\_\{n\}\)\)\\leq 0\. Define the negative semidefinite matrixB~≔\(B11B12B21B22\)=U⊤η\(B\+B⊤\)U\\tilde\{B\}\\coloneqq\\left\(\\begin\{smallmatrix\}B\_\{11\}&B\_\{12\}\\\\ B\_\{21\}&B\_\{22\}\\end\{smallmatrix\}\\right\)=U^\{\\top\}\\eta\(B\+B^\{\\top\}\)U, in whichB11B\_\{11\}has the same dimension asAaA\_\{a\}\. Then whent→∞t\\rightarrow\\infty, we have

Σ\(t\)→U\(Σ∞a000\)U⊤\\displaystyle\\Sigma\(t\)\\rightarrow U\\begin\{pmatrix\}\\Sigma^\{a\}\_\{\\infty\}&0\\\\ 0&0\\end\{pmatrix\}U^\{\\top\}whereΣ∞a\\Sigma^\{a\}\_\{\\infty\}is the unique positive definite solution of the algebraic Riccati equation

\(15\)0=AaΣ∞a\+Σ∞aAa⊤\+Σ∞aB11Σ∞a\.\\displaystyle 0=A\_\{a\}\\Sigma^\{a\}\_\{\\infty\}\+\\Sigma^\{a\}\_\{\\infty\}A\_\{a\}^\{\\top\}\+\\Sigma^\{a\}\_\{\\infty\}B\_\{11\}\\Sigma^\{a\}\_\{\\infty\}\.

###### Proof 4\.2\.

AsΣ0\\Sigma\_\{0\}is invertible, we consider the decompositionΣ−1\(t\)=UP~\(t\)U⊤\\Sigma^\{\-1\}\(t\)=U\\tilde\{P\}\(t\)U^\{\\top\}and study the dynamics ofP~\(t\)\\tilde\{P\}\(t\)fort≥0t\\geq 0\. Using thatUdP~dtU⊤=dΣ−1dt=−Σ−1Σ˙Σ−1U\\frac\{\\mathrm\{d\}\\tilde\{P\}\}\{\\mathrm\{d\}t\}U^\{\\top\}=\\frac\{\\mathrm\{d\}\\Sigma^\{\-1\}\}\{\\mathrm\{d\}t\}=\-\\Sigma^\{\-1\}\\dot\{\\Sigma\}\\Sigma^\{\-1\}, from \([13](https://arxiv.org/html/2606.07600#S4.E13)\) we have

P~˙=−A~⊤P~−P~A~−B~\.\\dot\{\\tilde\{P\}\}=\-\\tilde\{A\}^\{\\top\}\\tilde\{P\}\-\\tilde\{P\}\\tilde\{A\}\-\\tilde\{B\}\.WritingP~=\(P11P12P21P22\)\\tilde\{P\}=\\begin\{pmatrix\}P\_\{11\}&P\_\{12\}\\\\ P\_\{21\}&P\_\{22\}\\end\{pmatrix\}corresponding to the blocked form ofAAandB~\\tilde\{B\}, we have

P˙11=−Aa⊤P11−P11Aa−B11P˙12=−Aa⊤P12−P11A12−P12An−B12P˙21=−A12⊤P11−An⊤P21−P21Aa−B21P˙22=−An⊤P22−P22An−\(A12⊤P12\+P21A12\)−B22\\displaystyle\\begin\{array\}\[\]\{l\}\\dot\{P\}\_\{11\}=\-A\_\{a\}^\{\\top\}P\_\{11\}\-P\_\{11\}A\_\{a\}\-B\_\{11\}\\\\ \\dot\{P\}\_\{12\}=\-A\_\{a\}^\{\\top\}P\_\{12\}\-P\_\{11\}A\_\{12\}\-P\_\{12\}A\_\{n\}\-B\_\{12\}\\\\ \\dot\{P\}\_\{21\}=\-A\_\{12\}^\{\\top\}P\_\{11\}\-A\_\{n\}^\{\\top\}P\_\{21\}\-P\_\{21\}A\_\{a\}\-B\_\{21\}\\\\ \\dot\{P\}\_\{22\}=\-A\_\{n\}^\{\\top\}P\_\{22\}\-P\_\{22\}A\_\{n\}\-\(A\_\{12\}^\{\\top\}P\_\{12\}\+P\_\{21\}A\_\{12\}\)\-B\_\{22\}\\end\{array\}Recalling the classical results of Lyapunov differential equations \(for example, see\[antsaklis2006linear, Section 6\.7, Theorem 7\.5\]\), one sees thatP11P\_\{11\}will converge to the unique positive definite solution of0=−Aa⊤P11−P11Aa−B110=\-A\_\{a\}^\{\\top\}P\_\{11\}\-P\_\{11\}A\_\{a\}\-B\_\{11\}, denoted byP11∞P^\{\\infty\}\_\{11\}, due to the fact that−Aa\-A\_\{a\}is stabilizing\.

As forP12\(t\)P\_\{12\}\(t\), sinceP11\(t\)P\_\{11\}\(t\)remains bounded for allt≥0t\\geq 0, the limit ofP12\(t\)P\_\{12\}\(t\)whent→∞t\\rightarrow\\inftydepends on the eigenvalues ofAaA\_\{a\}andAnA\_\{n\}, more specifically, its growth rate is

\(16\)λ12=max⁡ℜ⁡\(spec⁡\(−Aa⊤\)\)\+max⁡ℜ⁡\(spec⁡\(−An\)\),\\displaystyle\\lambda\_\{12\}=\\max\\Re\(\\operatorname\{spec\}\(\-A\_\{a\}^\{\\top\}\)\)\+\\max\\Re\(\\operatorname\{spec\}\(\-A\_\{n\}\)\),while forP22\(t\)P\_\{22\}\(t\)we have the growth rate of its homogeneous part as

\(17\)λ22=2max⁡ℜ⁡\(spec⁡\(−An\)\)\\displaystyle\\lambda\_\{22\}=2\\max\\Re\(\\operatorname\{spec\}\(\-A\_\{n\}\)\)By definition,ℜ⁡\(spec⁡\(−An\)\)≥0\\Re\(\\operatorname\{spec\}\(\-A\_\{n\}\)\)\\geq 0andℜ⁡\(spec⁡\(−Aa\)\)<0\\Re\(\\operatorname\{spec\}\(\-A\_\{a\}\)\)<0, henceλ22\>λ12\\lambda\_\{22\}\>\\lambda\_\{12\}andλ22≥0\\lambda\_\{22\}\\geq 0\. As the growth rate ofP22P\_\{22\}is determined byλ22−λ12\\lambda\_\{22\}\-\\lambda\_\{12\}andB22≻0B\_\{22\}\\succ 0, we haveP22\(t\)→∞P\_\{22\}\(t\)\\to\\inftywhent→∞t\\rightarrow\\infty\. Next, we study the limit ofP~\\tilde\{P\}to derive the limit ofΣ\\Sigma\. Let

Σ\(t\)=U\(Σ11\(t\)Σ12\(t\)Σ21\(t\)Σ22\(t\)\)U⊤\.\\Sigma\(t\)=U\\begin\{pmatrix\}\\Sigma\_\{11\}\(t\)&\\Sigma\_\{12\}\(t\)\\\\ \\Sigma\_\{21\}\(t\)&\\Sigma\_\{22\}\(t\)\\end\{pmatrix\}U^\{\\top\}\.By definition of the inverse matrix, we haveP21\(t\)Σ12\(t\)\+P22\(t\)Σ22\(t\)=IP\_\{21\}\(t\)\\Sigma\_\{12\}\(t\)\+P\_\{22\}\(t\)\\Sigma\_\{22\}\(t\)=I\. Aslimt→∞P22\(t\)=∞\\lim\\limits\_\{t\\rightarrow\\infty\}P\_\{22\}\(t\)=\\infty, it follows thatlimt→∞Σ22\(t\)=0\\lim\\limits\_\{t\\rightarrow\\infty\}\\Sigma\_\{22\}\(t\)=0; on the other hand, we have

P11\(t\)Σ12\(t\)\+P12\(t\)Σ22\(t\)=0P\_\{11\}\(t\)\\Sigma\_\{12\}\(t\)\+P\_\{12\}\(t\)\\Sigma\_\{22\}\(t\)=0which implies thatlimt→∞Σ12\(t\)=0\\lim\\limits\_\{t\\rightarrow\\infty\}\\Sigma\_\{12\}\(t\)=0\(becauseP11\(t\)≻0P\_\{11\}\(t\)\\succ 0for allt≥0t\\geq 0andP11∞≻0P^\{\\infty\}\_\{11\}\\succ 0\)\. Finally, we have

Σ11=\(P11−P12P22−1P21\)−1\\Sigma\_\{11\}=\(P\_\{11\}\-P\_\{12\}P\_\{22\}^\{\-1\}P\_\{21\}\)^\{\-1\}then, by the estimates ofλ12\\lambda\_\{12\}andλ22\\lambda\_\{22\}in \([16](https://arxiv.org/html/2606.07600#S4.E16)\) and \([17](https://arxiv.org/html/2606.07600#S4.E17)\), the growth rate ofP12P22−1P21P\_\{12\}P\_\{22\}^\{\-1\}P\_\{21\}is controlled by−min⁡\(ℜ⁡\(spec⁡\(Aa\)\)\)<0\-\\min\(\\Re\(\\operatorname\{spec\}\(A\_\{a\}\)\)\)<0\. Hence we havelimt→∞Σ11\(t\)=\(P11∞\)−1≻0\\lim\\limits\_\{t\\rightarrow\\infty\}\\Sigma\_\{11\}\(t\)=\(P^\{\\infty\}\_\{11\}\)^\{\-1\}\\succ 0\. DefiningΣ∞a=\(P11∞\)−1\\Sigma\_\{\\infty\}^\{a\}=\(P\_\{11\}^\{\\infty\}\)^\{\-1\}, one sees thatΣ∞a\\Sigma\_\{\\infty\}^\{a\}solves the equation \([15](https://arxiv.org/html/2606.07600#S4.E15)\)\. By the symmetry ofΣ\\Sigma, we also havelimt→∞Σ21\(t\)=0\\lim\\limits\_\{t\\rightarrow\\infty\}\\Sigma\_\{21\}\(t\)=0\. The conclusion follows\.

### 4\.2Asymptotic dynamics for definiteVV

To extend our results forVVbeyond scalar multiples of the identity, we study the case whenBBandVVare negative or positive definite, and derive asymptotic results based on their sign\.

Our first result concerns the spectrum of a matrix arising in the dynamics ofμ\\muin \([7](https://arxiv.org/html/2606.07600#S2.E7)\) at the equilibrium ofΣ\\Sigma, which applies to general time\-invariantAA,BBandVV\. This result will be used later for the asymptotic analysis of the mean in[Section4\.3](https://arxiv.org/html/2606.07600#S4.SS3)\.

###### Lemma 4\.4\.

LetA,B,V∈ℝd×dA,B,V\\in\\mathbb\{R\}^\{d\\times d\}be arbitrary real matrices\. AssumeΣ∞≻0\\Sigma\_\{\\infty\}\\succ 0satisfies the algebraic Bernoulli equation

\(18\)AΣ∞\+Σ∞A⊤\+VΣ∞BΣ∞\+Σ∞B⊤Σ∞V⊤=0\.A\\Sigma\_\{\\infty\}\+\\Sigma\_\{\\infty\}A^\{\\top\}\+V\\Sigma\_\{\\infty\}B\\Sigma\_\{\\infty\}\+\\Sigma\_\{\\infty\}B^\{\\top\}\\Sigma\_\{\\infty\}V^\{\\top\}=0\.Then all eigenvalues ofM∞≔A\+VΣ∞BM\_\{\\infty\}\\coloneqq A\+V\\Sigma\_\{\\infty\}Bare purely imaginary, i\.e\.,ℜ⁡\(spec⁡\(M∞\)\)=0\\Re\\bigl\(\\operatorname\{spec\}\(M\_\{\\infty\}\)\\bigr\)\\\!=\\\!0\.

###### Proof 4\.5\.

By definition, \([18](https://arxiv.org/html/2606.07600#S4.E18)\) implies

\(19\)M∞Σ∞\+Σ∞M∞⊤=0\.\\displaystyle M\_\{\\infty\}\\Sigma\_\{\\infty\}\+\\Sigma\_\{\\infty\}M\_\{\\infty\}^\{\\top\}=0\.Letλ∈ℂ\\lambda\\in\\mathbb\{C\}be an eigenvalue ofM∞M\_\{\\infty\}with corresponding left eigenvectorv∈ℂ1×d∖\{0\}v\\in\\mathbb\{C\}^\{1\\times d\}\\setminus\\\{0\\\}\. Multiplying \([19](https://arxiv.org/html/2606.07600#S4.E19)\) byvvandv∗v^\{\*\}yieldsv\(M∞Σ∞\+Σ∞M∞⊤\)v∗=0v\(M\_\{\\infty\}\\Sigma\_\{\\infty\}\+\\Sigma\_\{\\infty\}M\_\{\\infty\}^\{\\top\}\)v^\{\*\}=0\.

Next, we usevM∞=λvvM\_\{\\infty\}=\\lambda vandM∞⊤v∗=λ¯v∗M\_\{\\infty\}^\{\\top\}v^\{\*\}=\\overline\{\\lambda\}v^\{\*\}to compute\(λ\+λ¯\)vΣ∞v∗=0\(\\lambda\+\\overline\{\\lambda\}\)\\,v\\Sigma\_\{\\infty\}v^\{\*\}=0\.

BecauseΣ∞≻0\\Sigma\_\{\\infty\}\\succ 0, we havevΣ∞v∗\>0v\\Sigma\_\{\\infty\}v^\{\*\}\>0, and henceλ\+λ¯=0\\lambda\+\\overline\{\\lambda\}=0, which impliesℜ⁡\(λ\)=0\\Re\(\\lambda\)=0\. Sinceλ\\lambdais an arbitrary eigenvalue ofM∞M\_\{\\infty\}, all eigenvalues ofM∞M\_\{\\infty\}lie on the imaginary axis\.

Turning to the covariance dynamics, the following result characterizes its asymptotic behavior under opposite definiteness conditions onVVandBB\.

###### Theorem 4\.7\.

LetΣ\(t\)\\Sigma\(t\)be the solution of \([7](https://arxiv.org/html/2606.07600#S2.E7)\) withΣ\(0\)=Σ0≻0\\Sigma\(0\)=\\Sigma\_\{0\}\\succ 0\. LetV≺0V\\prec 0andB≻0B\\succ 0\. ThenΣ\(t\)\\Sigma\(t\)is bounded for allt∈\[0,\+∞\)t\\in\[0,\+\\infty\)\. Moreover:

1. 1\.Ifℜ⁡\(spec⁡\(A\)\)\>0\\Re\(\\operatorname\{spec\}\(A\)\)\>0, then there exists an equilibriumΣ∞≻0\\Sigma\_\{\\infty\}\\succ 0solving the algebraic Bernoulli equation \([18](https://arxiv.org/html/2606.07600#S4.E18)\)\. Further, ifΣ∞−1V\+VΣ∞−1≺0\\Sigma\_\{\\infty\}^\{\-1\}V\+V\\Sigma\_\{\\infty\}^\{\-1\}\\prec 0, thenΣ∞\\Sigma\_\{\\infty\}is a locally stable equilibrium ofΣ\(t\)\\Sigma\(t\)\.
2. 2\.Ifℜ⁡\(spec⁡\(A\)\)≤0\\Re\(\\operatorname\{spec\}\(A\)\)\\leq 0andλmax\(A\+A⊤\)≤0\\lambda\_\{\\max\}\(A\+A^\{\\top\}\)\\leq 0, thenlimt→∞Σ\(t\)=0\\lim\\limits\_\{t\\to\\infty\}\\Sigma\(t\)=0for allΣ0\\Sigma\_\{0\}\.
3. 3\.Ifℜ⁡\(spec⁡\(A\)\)<0\\Re\(\\operatorname\{spec\}\(A\)\)<0andλmax\(A\+A⊤\)\>0\\lambda\_\{\\max\}\(A\+A^\{\\top\}\)\>0, thenΣ=0\\Sigma=0is a locally stable equilibrium of theΣ\\Sigmaequation in \([7](https://arxiv.org/html/2606.07600#S2.E7)\)\.

The proof is technical and thus relegated to[SectionA\.3](https://arxiv.org/html/2606.07600#A1.SS3)\. Finally, when the signs of definiteness ofVVandBBcoincide, we have the following result showing the finite\-time blow\-up of the evolution ofΣ\\Sigma\.

###### Theorem 4\.8\.

LetΣ\(t\)\\Sigma\(t\)be the solution of \([7](https://arxiv.org/html/2606.07600#S2.E7)\) withΣ\(0\)=Σ0≻0\\Sigma\(0\)=\\Sigma\_\{0\}\\succ 0, whereV≻0V\\succ 0andB≻0B\\succ 0\. Then

1. 1\.Ifλmin\(A\+A⊤\)≥0\\lambda\_\{\\min\}\(A\+A^\{\\top\}\)\\geq 0, thenΣ\(t\)\\Sigma\(t\)blows up in finite timeTfT\_\{f\}, and forλmin\(A\+A⊤\)\>0\\lambda\_\{\\min\}\(A\+A^\{\\top\}\)\>0we have \(20\)Tf≤1λmin\(A\+A⊤\)log⁡\(1\+dλmin\(A\+A⊤\)2λmin\(V\)λmin\(B\)tr⁡\(Σ0\)\)\.\\displaystyle T\_\{f\}\\leq\\frac\{1\}\{\\lambda\_\{\\min\}\(A\+A^\{\\top\}\)\}\\log\\left\(1\+\\frac\{d\\,\\lambda\_\{\\min\}\(A\+A^\{\\top\}\)\}\{2\\lambda\_\{\\min\}\(V\)\\lambda\_\{\\min\}\(B\)\\operatorname\{tr\}\(\\Sigma\_\{0\}\)\}\\right\)\.
2. 2\.Ifλmin\(A\+A⊤\)<0\\lambda\_\{\\min\}\(A\+A^\{\\top\}\)<0, there existsC\>0C\>0such that when‖Σ0‖≥C\\\|\\Sigma\_\{0\}\\\|\\geq C,Σ\(t\)\\Sigma\(t\)blows up in finite time\.

###### Proof 4\.9\.

First, consider the caseλmin\(A\+A⊤\)≥0\\lambda\_\{\\min\}\(A\+A^\{\\top\}\)\\geq 0\. Differentiatingtr⁡\(Σ\)\\operatorname\{tr\}\(\\Sigma\)yields

tr⁡\(Σ˙\)=\\displaystyle\\operatorname\{tr\}\(\\dot\{\\Sigma\}\)=2tr⁡\(AΣ\)\+2tr⁡\(ΣVΣB\)\\displaystyle~2\\operatorname\{tr\}\(A\\Sigma\)\+2\\operatorname\{tr\}\(\\Sigma V\\Sigma B\)≥\\displaystyle\\geqλmin\(A\+A⊤\)tr⁡\(Σ\)\+2λmin\(V\)λmin\(B\)tr⁡\(Σ2\)\\displaystyle~\\lambda\_\{\\min\}\(A\+A^\{\\top\}\)\\operatorname\{tr\}\(\\Sigma\)\+2\\lambda\_\{\\min\}\(V\)\\lambda\_\{\\min\}\(B\)\\operatorname\{tr\}\(\\Sigma^\{2\}\)\(21\)≥\\displaystyle\\geqλmin\(A\+A⊤\)tr⁡\(Σ\)\+2λmin\(V\)λmin\(B\)d\(tr⁡\(Σ\)\)2,\\displaystyle~\\lambda\_\{\\min\}\(A\+A^\{\\top\}\)\\operatorname\{tr\}\(\\Sigma\)\+\\frac\{2\\lambda\_\{\\min\}\(V\)\\lambda\_\{\\min\}\(B\)\}\{d\}\(\\operatorname\{tr\}\(\\Sigma\)\)^\{2\},where the inequalities are due to the fact thatVV,BBandΣ\\Sigmaare all positive definite, and thattr⁡\(Σ2\)≥1d\(tr⁡\(Σ\)\)2\\operatorname\{tr\}\(\\Sigma^\{2\}\)\\geq\\frac\{1\}\{d\}\(\\operatorname\{tr\}\(\\Sigma\)\)^\{2\}\. Therefore, astr⁡\(Σ\)\>0\\operatorname\{tr\}\(\\Sigma\)\>0andλmin\(A\+A⊤\)≥0\\lambda\_\{\\min\}\(A\+A^\{\\top\}\)\\geq 0, it suffices to consider the following differential equation

\(22\)y˙=λmin\(A\+A⊤\)y\+2λmin\(V\)λmin\(B\)dy2,y\(0\)=tr⁡\(Σ0\)\\displaystyle\\dot\{y\}=\\lambda\_\{\\min\}\(A\+A^\{\\top\}\)y\+\\frac\{2\\lambda\_\{\\min\}\(V\)\\lambda\_\{\\min\}\(B\)\}\{d\}y^\{2\},\\quad y\(0\)=\\operatorname\{tr\}\(\\Sigma\_\{0\}\)and whenλmin\(A\+A⊤\)\>0\\lambda\_\{\\min\}\(A\+A^\{\\top\}\)\>0solving it yields an explicit solution

y\(t\)=αy0eαtα−βy0\(eαt−1\)y\(t\)=\\frac\{\\alpha y\_\{0\}e^\{\\alpha t\}\}\{\\alpha\-\\beta y\_\{0\}\(e^\{\\alpha t\}\-1\)\}whereα=λmin\(A\+A⊤\)\\alpha=\\lambda\_\{\\min\}\(A\+A^\{\\top\}\),β=2λmin\(V\)λmin\(B\)d\\beta=\\frac\{2\\lambda\_\{\\min\}\(V\)\\lambda\_\{\\min\}\(B\)\}\{d\}\. Therefore the blow\-up time oftr⁡\(Σ\)\\operatorname\{tr\}\(\\Sigma\)is shorter than that ofyy, namely,Tf≤1αlog⁡\(1\+αβy0\)T\_\{f\}\\leq\\frac\{1\}\{\\alpha\}\\log\\left\(1\+\\frac\{\\alpha\}\{\\beta y\_\{0\}\}\\right\), which is the estimate \([20](https://arxiv.org/html/2606.07600#S4.E20)\)\. On the other hand, ifα=0\\alpha=0, then solving \([22](https://arxiv.org/html/2606.07600#S4.E22)\) yieldsy\(t\)=y01−y0βty\(t\)=\\frac\{y\_\{0\}\}\{1\-y\_\{0\}\\beta t\}, confirming the finite\-time blow\-up ofΣ\\Sigma\.

Next, consider the case whenλmin\(A\+A⊤\)<0\\lambda\_\{\\min\}\(A\+A^\{\\top\}\)<0\. From \([21](https://arxiv.org/html/2606.07600#S4.E21)\)\([22](https://arxiv.org/html/2606.07600#S4.E22)\) one can see that whentr⁡\(Σ\)\>−λmin\(A\+A⊤\)d2λmin\(V\)λmin\(B\)\\operatorname\{tr\}\(\\Sigma\)\>\-\\frac\{\\lambda\_\{\\min\}\(A\+A^\{\\top\}\)d\}\{2\\lambda\_\{\\min\}\(V\)\\lambda\_\{\\min\}\(B\)\},tr⁡\(Σ\)\\operatorname\{tr\}\(\\Sigma\)becomes monotonic with an increasing speed bounded from below by a positive constant, and hence blows up in finite time by the aforementioned analysis of the caseλmin\(A\+A⊤\)\>0\\lambda\_\{\\min\}\(A\+A^\{\\top\}\)\>0\.

### 4\.3Asymptotic mean matching

We now study the ability of the Gaussian Transformer \([7](https://arxiv.org/html/2606.07600#S2.E7)\) to asymptotically match a prescribed mean\. ForV=ηIdV=\\eta I\_\{d\}, using the tools developed in[Section4\.1](https://arxiv.org/html/2606.07600#S4.SS1)we show that, once the covariance converges, the mean can be steered to any desired target by suitable time\-invariant parameters\.

###### Theorem 4\.11\.

Consider \([7](https://arxiv.org/html/2606.07600#S2.E7)\) withAAhaving the decomposition \([14](https://arxiv.org/html/2606.07600#S4.E14)\), supposeV=ηIdV=\\eta I\_\{d\}andB\+B⊤≻0B\+B^\{\\top\}\\succ 0\.

1. 1\.Ifη<0\\eta<0, thenlimt→∞μ\(t\)=−\(A\+ηΣ∞B\+ηId\)−1b\\lim\\limits\_\{t\\rightarrow\\infty\}\\mu\(t\)=\-\(A\+\\eta\\Sigma\_\{\\infty\}B\+\\eta I\_\{d\}\)^\{\-1\}bwhereΣ∞\\Sigma\_\{\\infty\}solves \([18](https://arxiv.org/html/2606.07600#S4.E18)\)\.
2. 2\.Ifη\>0\\eta\>0andAaA\_\{a\}exists, then‖μ\(t\)‖→∞\\\|\\mu\(t\)\\\|\\to\\infty\.

###### Proof 4\.12\.

DefineMη\(t\)≔A\+ηΣ\(t\)BM^\{\\eta\}\(t\)\\coloneqq A\+\\eta\\Sigma\(t\)B\. Then, the resulting dynamics ofμ\\muis

\(23\)μ˙=\(Mη\(t\)\+ηId\)μ\+b,\\displaystyle\\dot\{\\mu\}=\(M^\{\\eta\}\(t\)\+\\eta I\_\{d\}\)\\mu\+b,wherelimt→∞ℜ⁡\(spec⁡\(Mη\(t\)\)\)≤0\\lim\\limits\_\{t\\rightarrow\\infty\}\\Re\(\\operatorname\{spec\}\(M^\{\\eta\}\(t\)\)\)\\leq 0\. Indeed, by[Theorem4\.1](https://arxiv.org/html/2606.07600#S4.Thmtheorem1), we have

M∞η≔limt→∞Mη\(t\)=limt→∞A\+ηΣ\(t\)B=U\(Aa\+Σ∞aB11A120An\)U⊤,M\_\{\\infty\}^\{\\eta\}\\coloneqq\\lim\\limits\_\{t\\rightarrow\\infty\}M^\{\\eta\}\(t\)=\\lim\\limits\_\{t\\rightarrow\\infty\}A\+\\eta\\Sigma\(t\)B=U\\begin\{pmatrix\}A\_\{a\}\+\\Sigma\_\{\\infty\}^\{a\}B\_\{11\}&A\_\{12\}\\\\ 0&A\_\{n\}\\end\{pmatrix\}U^\{\\top\},whereUUis an orthogonal matrix,ℜ⁡\(spec⁡\(Aa\)\)\>0\\Re\(\\operatorname\{spec\}\(A\_\{a\}\)\)\>0andℜ⁡\(spec⁡\(An\)\)≤0\\Re\(\\operatorname\{spec\}\(A\_\{n\}\)\)\\leq 0\. On the other hand, by[Lemma4\.4](https://arxiv.org/html/2606.07600#S4.Thmtheorem4),ℜ⁡\(spec⁡\(Aa\+Σ∞aB11\)\)=0\\Re\(\\operatorname\{spec\}\(A\_\{a\}\+\\Sigma^\{a\}\_\{\\infty\}B\_\{11\}\)\)=0due to the fact thatΣ∞a\\Sigma\_\{\\infty\}^\{a\}solves \([15](https://arxiv.org/html/2606.07600#S4.E15)\) which is a special case of \([18](https://arxiv.org/html/2606.07600#S4.E18)\)\. Hence, we haveℜ⁡\(spec⁡\(M∞η\)\)≤0\\Re\(\\operatorname\{spec\}\(M\_\{\\infty\}^\{\\eta\}\)\)\\leq 0\.

Defineμ∞:=−\(M∞η\+ηId\)−1b\\mu\_\{\\infty\}:=\-\(M\_\{\\infty\}^\{\\eta\}\+\\eta I\_\{d\}\)^\{\-1\}b\. Calculating the error dynamics yields

\(24\)ddt\(μ\(t\)−μ∞\)=\(Mη\(t\)\+ηId\)\(μ\(t\)−μ∞\)\+\(Mη\(t\)−M∞η\)μ∞\.\\displaystyle\\frac\{\\mathrm\{d\}\}\{\\mathrm\{d\}t\}\(\\mu\(t\)\-\\mu\_\{\\infty\}\)=\(M^\{\\eta\}\(t\)\+\\eta I\_\{d\}\)\(\\mu\(t\)\-\\mu\_\{\\infty\}\)\+\(M^\{\\eta\}\(t\)\-M^\{\\eta\}\_\{\\infty\}\)\\mu\_\{\\infty\}\.Whenη<0\\eta<0, choose the Lyapunov function as𝒱\(x\):=x⊤Px\\mathcal\{V\}\(x\):=x^\{\\top\}PxwhereP≻0P\\succ 0solves the Lyapunov equation\(M∞η\+ηId\)⊤P\+P\(M∞η\+ηId\)=−Id\(M\_\{\\infty\}^\{\\eta\}\+\\eta I\_\{d\}\)^\{\\top\}P\+P\(M\_\{\\infty\}^\{\\eta\}\+\\eta I\_\{d\}\)=\-I\_\{d\}, then one can use the fact thatM∞η=limt→∞Mη\(t\)M\_\{\\infty\}^\{\\eta\}=\\lim\\limits\_\{t\\rightarrow\\infty\}M^\{\\eta\}\(t\)to show that the driftless dynamicsx˙=\(Mη\(t\)\+ηId\)x\\dot\{x\}=\(M^\{\\eta\}\(t\)\+\\eta I\_\{d\}\)xexponentially decay to zero\. Finally by applying Duhamel’s principle to \([24](https://arxiv.org/html/2606.07600#S4.E24)\) and using the fact thatlimt→∞Mη\(t\)−M∞η=0\\lim\\limits\_\{t\\rightarrow\\infty\}M^\{\\eta\}\(t\)\-M\_\{\\infty\}^\{\\eta\}=0, we havelimt→∞μ\(t\)=μ∞\\lim\\limits\_\{t\\rightarrow\\infty\}\\mu\(t\)=\\mu\_\{\\infty\}\.

On the other hand, ifη\>0\\eta\>0and there exists an eigenvalue ofAAwith positive real part, thenηId\\eta I\_\{d\}plus the purely oscillatory componentAa\+Σ∞aB11A\_\{a\}\+\\Sigma\_\{\\infty\}^\{a\}B\_\{11\}inM∞ηM\_\{\\infty\}^\{\\eta\}contributes to an unstable force in the dynamics \([23](https://arxiv.org/html/2606.07600#S4.E23)\), drivingμ\(t\)\\mu\(t\)to infinity exponentially\.

## 5Numerical experiments

In this section, we examine two aspects of the Gaussian Transformer dynamics \([7](https://arxiv.org/html/2606.07600#S2.E7)\)\. First, we validate the asymptotic mean and covariance behavior predicted in[Section4](https://arxiv.org/html/2606.07600#S4)in a controlled two\-dimensional setting\. Second, we test whether an analogous Gaussian moment structure is visible in pretrained Transformer models when their inputs are sampled from a Gaussian distribution\. Codes to reproduce our results can be found in[https://github\.com/DCN\-FAU\-AvH/gaussianTransformers](https://github.com/DCN-FAU-AvH/gaussianTransformers)\.

### 5\.1Validation of asymptotic results

We begin with a two\-dimensional example illustrating the asymptotic behavior of[Equation7](https://arxiv.org/html/2606.07600#S2.E7)described in[Section4](https://arxiv.org/html/2606.07600#S4)\. We fixB≻0B\\succ 0andV=−IdV=\-I\_\{d\}, and consider three choices ofAA: one withℜ⁡\(spec⁡\(A\)\)\>0\\Re\(\\operatorname\{spec\}\(A\)\)\>0, one withℜ⁡\(spec⁡\(A\)\)≤0\\Re\(\\operatorname\{spec\}\(A\)\)\\leq 0, and one whose eigenvalues have real parts of mixed sign\. We setμ0=\(−1,−1\)⊤\\mu\_\{0\}=\(\-1,\-1\)^\{\\top\},Σ0=\(0\.40\.20\.20\.2\)\\Sigma\_\{0\}=\\left\(\\begin\{smallmatrix\}0\.4&0\.2\\\\ 0\.2&0\.2\\end\{smallmatrix\}\\right\), and prescribed limiting mean asμ∞=\(1,2\)⊤\\mu\_\{\\infty\}=\(1,2\)^\{\\top\}\. In each case, the bias termbbis chosen according to the mean\-matching condition of[Theorem4\.11](https://arxiv.org/html/2606.07600#S4.Thmtheorem11), so thatμ∞\\mu\_\{\\infty\}is an equilibrium for the mean dynamics\.

![Refer to caption](https://arxiv.org/html/2606.07600v1/x1.png)

![Refer to caption](https://arxiv.org/html/2606.07600v1/x2.png)\(a\)ℜ⁡\(spec⁡\(A\)\)\>0\\Re\(\\operatorname\{spec\}\(A\)\)\>0
![Refer to caption](https://arxiv.org/html/2606.07600v1/x3.png)\(b\)ℜ⁡\(spec⁡\(A\)\)≤0\\Re\(\\operatorname\{spec\}\(A\)\)\\leq 0
![Refer to caption](https://arxiv.org/html/2606.07600v1/x4.png)\(c\)mixed sign

Figure 1:Asymptotic dynamics of[Equation7](https://arxiv.org/html/2606.07600#S2.E7)withB≻0B\\succ 0andV=−IdV=\-I\_\{d\}\. The bias is chosen so thatμ∞=\(1,2\)⊤\\mu\_\{\\infty\}=\(1,2\)^\{\\top\}is the limiting mean\.The resulting trajectories are shown in[Figure1](https://arxiv.org/html/2606.07600#S5.F1)\. In all three regimes, the mean converges to the prescribed target by timeT=100T=100\. The covariance behavior depends on the spectral regime ofAA, as predicted by[Theorem4\.1](https://arxiv.org/html/2606.07600#S4.Thmtheorem1)\. Forℜ⁡\(spec⁡\(A\)\)\>0\\Re\(\\operatorname\{spec\}\(A\)\)\>0, the covariance remains nondegenerate and converges toward a finite limiting shape\. Whenℜ⁡\(spec⁡\(A\)\)≤0\\Re\(\\operatorname\{spec\}\(A\)\)\\leq 0, the covariance collapses\. For the mixed\-sign case, the evolution is anisotropic: contraction occurs only along the stable directions\.

### 5\.2Gaussian structure in pretrained Transformers

We next examine whether Gaussian moment structure persists in pretrained Transformers\. For a model with embedding dimensiondd, we draw independent input tokensxi\(0\)∼𝒩\(0,Id\)x\_\{i\}^\{\(0\)\}\\sim\\mathcal\{N\}\(0,I\_\{d\}\), propagate them through the layers, and denote byρ\(ℓ\)\\rho^\{\(\\ell\)\}their empirical distribution at layerℓ\\ell\. Further, letγ\(ℓ\)=𝒩\(μ\(ℓ\),Σ\(ℓ\)\)\\gamma^\{\(\\ell\)\}=\\mathcal\{N\}\\left\(\\mu^\{\(\\ell\)\},\\Sigma^\{\(\\ell\)\}\\right\)be the Gaussian distribution with the same empirical mean and covariance asρ\(ℓ\)\\rho^\{\(\\ell\)\}\.

![Refer to caption](https://arxiv.org/html/2606.07600v1/x5.png)\(a\)ModernBERT
![Refer to caption](https://arxiv.org/html/2606.07600v1/x6.png)\(b\)Tiny\-DeiT

Figure 2:Distances between the empirical distributionρ\(ℓ\)\\rho^\{\(\\ell\)\}and its moment\-matched Gaussian approximationγ\(ℓ\)\\gamma^\{\(\\ell\)\}\. Curves show the mean across2020batches ofn=8192n=8192tokens, and shaded regions indicate the empirical0\.10\.1–0\.90\.9quantile band\.![Refer to caption](https://arxiv.org/html/2606.07600v1/x7.png)

![Refer to caption](https://arxiv.org/html/2606.07600v1/x8.png)

Figure 3:Marginal distributions across layers of ModernBERT \(top row\) Tiny\-DeiT \(bottom row\)\. Each panel shows the coordinate whose variance is closest to the median at that layer, usingn=8 192n=8\\,192Gaussian input tokens\.We compareρ\(ℓ\)\\rho^\{\(\\ell\)\}withγ\(ℓ\)\\gamma^\{\(\\ell\)\}using the sliced Wasserstein distance \(SW\-2\):

SW2\(ρ\(ℓ\),γ\(ℓ\)\)=\(𝔼θW22\(⟨θ,x\(ℓ\)⟩,⟨θ,z\(ℓ\)⟩\)\)12,\\mathrm\{SW\}\_\{2\}\(\\rho^\{\(\\ell\)\},\\gamma^\{\(\\ell\)\}\)=\\left\(\\mathbb\{E\}\_\{\\theta\}W\_\{2\}^\{2\}\\\!\\left\(\\langle\\theta,x^\{\(\\ell\)\}\\rangle,\\langle\\theta,z^\{\(\\ell\)\}\\rangle\\right\)\\right\)^\{\\frac\{1\}\{2\}\},wherex\(ℓ\)∼ρ\(ℓ\)x^\{\(\\ell\)\}\\sim\\rho^\{\(\\ell\)\}andz\(ℓ\)∼γ\(ℓ\)z^\{\(\\ell\)\}\\sim\\gamma^\{\(\\ell\)\}, and the maximum mean discrepancy \(MMD\):

MMD2\(ρ\(ℓ\),γ\(ℓ\)\)=𝔼x,x′∼ρ\(ℓ\)\[k\(x,x′\)\]\+𝔼z,z′∼γ\(ℓ\)\[k\(z,z′\)\]−2𝔼x∼ρ\(ℓ\),z∼γ\(ℓ\)\[k\(x,z\)\],\\mathrm\{MMD\}^\{2\}\(\\rho^\{\(\\ell\)\},\\gamma^\{\(\\ell\)\}\)=\\mathbb\{E\}\_\{x,x^\{\\prime\}\\sim\\rho^\{\(\\ell\)\}\}\[k\(x,x^\{\\prime\}\)\]\+\\mathbb\{E\}\_\{z,z^\{\\prime\}\\sim\\gamma^\{\(\\ell\)\}\}\[k\(z,z^\{\\prime\}\)\]\-2\\mathbb\{E\}\_\{x\\sim\\rho^\{\(\\ell\)\},\\,z\\sim\\gamma^\{\(\\ell\)\}\}\[k\(x,z\)\],wherekkis the Gaussian kernel\. For our experiments, we use ModernBERT\[warner2025smarter\], a pretrained language encoder withL=22L=22layers andd=768d=768, and Tiny\-DeiT\[pmlr\-v139\-touvron21a\], a compact vision Transformer withL=12L=12layers andd=192d=192\. The resulting distances across layers are shown in[Figure2](https://arxiv.org/html/2606.07600#S5.F2)\. For both architectures, the distance to the moment\-matched Gaussian remains comparatively small through the early and intermediate layers\. The discrepancies increase in deeper layers, indicating that non\-Gaussian behavior abruptly emerges at later stages of the network\.

We further assess Gaussian preservation in pretrained models by visualizing one\-dimensional marginals\. At each selected layer, we choose the coordinate whose marginal variance is closest to the median\. As shown in[Figure3](https://arxiv.org/html/2606.07600#S5.F3), the two architectures behave qualitatively differently\. In ModernBERT, the marginal broadens with depth and becomes progressively flatter\. In contrast, Tiny\-DeiT maintains a more Gaussian\-like profile across layers, with only modest changes in spread\.

Finally, we test whether the covariance regimes predicted by[Theorem4\.7](https://arxiv.org/html/2606.07600#S4.Thmtheorem7)appear in a realistic setting\. Starting from Tiny\-DeiT, we remove normalization layers and replace self\-attention with residual single\-head attention maps using prescribed matricesVVandBB, while leaving the nonlinear feed\-forward blocks unchanged\. For Gaussian inputs, we track the empirical mean and covariance trace\.

[Figure4](https://arxiv.org/html/2606.07600#S5.F4)shows two distinct regimes: opposing signs ofVVandBBkeep the covariance trace bounded, whereas jointly positive signs cause rapid growth\. Thus, the sign structure in[Theorem4\.7](https://arxiv.org/html/2606.07600#S4.Thmtheorem7)predicts qualitative covariance behavior even in a discrete residual architecture with nonlinear feed\-forward blocks\.

![Refer to caption](https://arxiv.org/html/2606.07600v1/x9.png)\(a\)V≺0V\\prec 0,B≻0B\\succ 0
![Refer to caption](https://arxiv.org/html/2606.07600v1/x10.png)\(b\)V≻0V\\succ 0,B≻0B\\succ 0

Figure 4:Evolution of the empirical mean and covariance trace in modified Tiny\-DeiT blocks\. Each curve reports the average over 20 independent batches, each containingn=8192n=8192tokens initialized from a standard Gaussian distribution\. Shaded regions indicate the empirical0\.10\.1–0\.90\.9quantile band\.

## 6Conclusions

In this work, we connected modern Transformer architectures with classical control theory by analyzing a mean\-field formulation in the Gaussian regime\. We proved that, for self\-attention with affine feed\-forward layers, Gaussian measures are preserved by the induced flow\. This invariance reduces the nonlocal transport PDE on probability measures to a finite\-dimensional bilinear control system for the mean and covariance\.

Our characterization of mean and covariance dynamics provides a control theoretic interpretation of neural network expressivity\. Finite\-time reachability yields a minimal interpolation property within the invariant class of Gaussian measures with fixed covariance rank, while the asymptotic analysis identifies parameter regimes leading either to stable covariance dynamics or to finite\-time blow\-up\.

At the model level, covariance instabilities correspond to divergence of token values during forward propagation and therefore suggest possible numerical failure modes\. The experiments support this picture: although exact Gaussian invariance is not expected in trained encoder Transformers, Gaussian moment structure remains sufficiently persistent in early and intermediate layers for the reduced dynamics to capture relevant qualitative behavior\.

Several questions remain open\. First, the long\-time analysis relies on structural assumptions on the Transformer parameter matrices, including sign and symmetry conditions onBBandVV\. Extending the theory beyond these regimes is challenging because the resulting Bernoulli\-type matrix equations may exhibit non\-normal behavior\. Second, the present framework treats a simplified architecture: it focuses on single Gaussian input distributions, omits layer normalization, and uses affine feed\-forward layers to preserve Gaussian closure\. The restriction to a single Gaussian is essential: for Gaussian mixtures, the exponential factors produced by the different components no longer cancel, so the attention field is no longer affine inxx, and the closed Bernoulli system for the mean and covariance no longer applies\. Consequently, finite Gaussian mixtures do not form an invariant class under the Transformer flow, and no exact finite\-dimensional closure analogous to[Proposition2\.1](https://arxiv.org/html/2606.07600#S2.Thmtheorem1)should be expected\.

A further perspective is to study the infinite inverse\-temperature limit within the Gaussian setting, obtained by replacingBBin the attention weights withβB\\beta Band lettingβ→∞\\beta\\to\\infty\. For every finiteβ\\beta, the mean and covariance still satisfy a Bernoulli\-type system, with the quadratic terms scaled byβ\\beta\. The limit formally corresponds to hardmax attention, but this is not directly well posed on the unbounded support of Gaussians\. Whether a meaningful interpretation survives after truncation, renormalization, or projection onto Gaussian moments remains an interesting open problem\. More generally, extensions to Gaussian mixtures, multi\-head attention, normalization mechanisms, nonlinear feed\-forward layers, and singular attention limits would require a more delicate analysis, likely based on projected or approximate moment\-closure methods rather than exact Gaussian closure\.

Overall, the Gaussian Transformer shows that expressivity and forward\-pass stability can be studied jointly through controlled mean–covariance dynamics\. This perspective offers a principled route toward the design and analysis of stable, controllable, and mathematically grounded Transformer architectures\.

## Appendix AProofs of technical results

This appendix contains proofs of auxiliary results\. Throughout, we omit the domains of integration to ease readability\.

### A\.1Proof of[Lemma2\.3](https://arxiv.org/html/2606.07600#S2.Thmtheorem3)

###### Proof A\.1\.

LetT∗:=sup\{T≥0:Et≤2E0for allt∈\[0,T\]\}T^\{\*\}:=\\sup\\\{T\\geq 0:\\ E\_\{t\}\\leq 2E\_\{0\}\\text\{ for all \}t\\in\[0,T\]\\\}\. We shall showT∗\>0T^\{\*\}\>0\.

DefineΛt\(ξ\)=log∫eξ⊤ydρt\(y\)\\Lambda\_\{t\}\(\\xi\)=\\log\\int e^\{\\xi^\{\\top\}y\}\\mathrm\{d\}\\rho\_\{t\}\(y\)as the cumulant generating function ofρt\\rho\_\{t\}\. Then, it holds that𝒜ρt\(x\)=V∇ξΛt\(Bx\)\\mathcal\{A\}\_\{\\rho\_\{t\}\}\(x\)=V\\nabla\_\{\\xi\}\\Lambda\_\{t\}\(Bx\)\. AsEt≤2E0E\_\{t\}\\leq 2E\_\{0\}, by Young’s inequality we have

\(25\)Λt\(ξ\)≤log⁡\(e14κ0\|ξ\|2∫eκ0\|y\|2dρt\(y\)\)≤log⁡\(e14κ0\|ξ\|2⋅2E0\)≤14κ0\|ξ\|2\+log⁡\(2E0\)\\Lambda\_\{t\}\(\\xi\)\\leq\\log\\left\(e^\{\\frac\{1\}\{4\\kappa\_\{0\}\}\|\\xi\|^\{2\}\}\\int e^\{\\kappa\_\{0\}\|y\|^\{2\}\}\\mathrm\{d\}\\rho\_\{t\}\(y\)\\right\)\\leq\\log\\left\(e^\{\\frac\{1\}\{4\\kappa\_\{0\}\}\|\\xi\|^\{2\}\}\\cdot 2E\_\{0\}\\right\)\\leq\\frac\{1\}\{4\\kappa\_\{0\}\}\|\\xi\|^\{2\}\+\\log\(2E\_\{0\}\)for allt∈\[0,T∗\]t\\in\[0,T^\{\*\}\]\. Note that \([25](https://arxiv.org/html/2606.07600#A1.E25)\) implies that the moment generating function ofρt\\rho\_\{t\}is finite on all ofℝd\\mathbb\{R\}^\{d\}, henceΛt\\Lambda\_\{t\}is well defined and smooth inξ\\xi, and

∇Λt\(ξ\)=∫yeξ⊤ydρt\(y\)∫eξ⊤ydρt\(y\)\.\\nabla\\Lambda\_\{t\}\(\\xi\)=\\frac\{\\int y\\,e^\{\\xi^\{\\top\}y\}\\,\\mathrm\{d\}\\rho\_\{t\}\(y\)\}\{\\int e^\{\\xi^\{\\top\}y\}\\,\\mathrm\{d\}\\rho\_\{t\}\(y\)\}\.SinceΛt\\Lambda\_\{t\}is convex and has at most quadratic growth, there exist constantsC1,C2\>0C\_\{1\},C\_\{2\}\>0such that‖∇Λt\(ξ\)‖≤C1\|ξ\|\+C2\\\|\\nabla\\Lambda\_\{t\}\(\\xi\)\\\|\\leq C\_\{1\}\|\\xi\|\+C\_\{2\}for allt∈\[0,T∗\]t\\in\[0,T^\{\*\}\]\. It follows that

‖𝒜\[ρt\]\(t,x\)‖=‖V∇Λt\(Bx\)‖≤‖V‖\(C1‖B‖\|x\|\+C2\)\.\\\|\\mathcal\{A\}\[\{\\rho\_\{t\}\}\]\(t,x\)\\\|=\\\|V\\nabla\\Lambda\_\{t\}\(Bx\)\\\|\\leq\\\|V\\\|\\bigl\(C\_\{1\}\\\|B\\\|\|x\|\+C\_\{2\}\\bigr\)\.Sinceσ=ReLU\\sigma=\\mathrm\{ReLU\}is globally Lipschitz and has linear growth, the velocity fieldut\(x\):=σ\(Ax\+b\)\+𝒜\[ρt\]\(t,x\)u\_\{t\}\(x\):=\\sigma\(Ax\+b\)\+\\mathcal\{A\}\[\\rho\_\{t\}\]\(t,x\)is locally Lipschitz inxxand satisfies

\(26\)‖ut\(x\)‖≤K1\|x\|\+K2\\\|u\_\{t\}\(x\)\\\|\\leq K\_\{1\}\|x\|\+K\_\{2\}for allt∈\[0,T∗\]t\\in\[0,T^\{\*\}\]and some constantsK1,K2\>0K\_\{1\},K\_\{2\}\>0\. Therefore, by the Picard–Lindelöf existence\-uniqueness theorem for ODEs\[Teschl2012ODE\], for eachx0∈ℝdx\_\{0\}\\in\\mathbb\{R\}^\{d\}the characteristic equation

Φ˙t\(x0\)=ut\(Φt\(x0\)\),Φ0\(x0\)=x0,\\dot\{\\Phi\}\_\{t\}\(x\_\{0\}\)=u\_\{t\}\(\\Phi\_\{t\}\(x\_\{0\}\)\),\\qquad\\Phi\_\{0\}\(x\_\{0\}\)=x\_\{0\},admits a unique solution on\[0,T∗\]\[0,T^\{\*\}\], and the linear\-growth bound \([26](https://arxiv.org/html/2606.07600#A1.E26)\) prevents finite\-time blow\-up\. Sinceρt\\rho\_\{t\}solves the continuity equation \([3](https://arxiv.org/html/2606.07600#S2.E3)\) with velocity fieldutu\_\{t\}, the method of characteristics yieldsρt=\(Φt\)\#ρ0\\rho\_\{t\}=\(\\Phi\_\{t\}\)\_\{\\\#\}\\rho\_\{0\}where\(Φt\)\#\(\\Phi\_\{t\}\)\_\{\\\#\}is the pushforward of the flow mapΦt\\Phi\_\{t\}\. Consequently,

\(27\)Et=∫eκ0\|Φt\(x0\)\|2dρ0\(x0\)\.\\displaystyle E\_\{t\}=\\int e^\{\\kappa\_\{0\}\|\\Phi\_\{t\}\(x\_\{0\}\)\|^\{2\}\}\\,\\mathrm\{d\}\\rho\_\{0\}\(x\_\{0\}\)\.Thus, for a\.e\.t∈\[0,T∗\]t\\in\[0,T^\{\*\}\],

\(28\)ddt\|Φt\(x0\)\|≤‖σ\(AΦt\(x0\)\+b\)\+𝒜\[ρt\]\(t,Φt\(x0\)\)‖≤K1\|Φt\(x0\)\|\+K2\.\\displaystyle\\frac\{\\mathrm\{d\}\}\{\\mathrm\{d\}t\}\|\\Phi\_\{t\}\(x\_\{0\}\)\|\\leq\\\|\\sigma\(A\\Phi\_\{t\}\(x\_\{0\}\)\+b\)\+\\mathcal\{A\}\[\\rho\_\{t\}\]\(t,\\Phi\_\{t\}\(x\_\{0\}\)\)\\\|\\leq K\_\{1\}\|\\Phi\_\{t\}\(x\_\{0\}\)\|\+K\_\{2\}\.Multiplying bye−K1te^\{\-K\_\{1\}t\}we haveddt\(\|Φt\(x0\)\|e−K1t\)≤K2e−K1t\\frac\{\\mathrm\{d\}\}\{\\mathrm\{d\}t\}\\left\(\|\\Phi\_\{t\}\(x\_\{0\}\)\|e^\{\-K\_\{1\}t\}\\right\)\\leq K\_\{2\}e^\{\-K\_\{1\}t\}, and integrating both sides from0tottyields\|Φt\(x0\)\|≤\|x0\|eK1t\+K2K1\(eK1t−1\)\|\\Phi\_\{t\}\(x\_\{0\}\)\|\\leq\|x\_\{0\}\|e^\{K\_\{1\}t\}\+\\frac\{K\_\{2\}\}\{K\_\{1\}\}\\left\(e^\{K\_\{1\}t\}\-1\\right\), therefore, for anyϵ\>0\\epsilon\>0,

\|Φt\(x0\)\|2≤\(1\+ϵ\)e2K1t\|x0\|2\+\(1\+1ϵ\)\(K2K1\(eK1t−1\)\)2\.\|\\Phi\_\{t\}\(x\_\{0\}\)\|^\{2\}\\leq\(1\+\\epsilon\)e^\{2K\_\{1\}t\}\|x\_\{0\}\|^\{2\}\+\\left\(1\+\\frac\{1\}\{\\epsilon\}\\right\)\\left\(\\frac\{K\_\{2\}\}\{K\_\{1\}\}\(e^\{K\_\{1\}t\}\-1\)\\right\)^\{2\}\.Then substituting it into \([27](https://arxiv.org/html/2606.07600#A1.E27)\) we have

\(29\)Et≤exp⁡\(κ0\(1\+1ϵ\)\(K2K1\(eK1t−1\)\)2\)∫exp⁡\(κ0\(1\+ϵ\)e2K1t\|x0\|2\)dρ0\(x0\)\.\\displaystyle E\_\{t\}\\leq\\exp\\left\(\\kappa\_\{0\}\\left\(1\+\\tfrac\{1\}\{\\epsilon\}\\right\)\\left\(\\tfrac\{K\_\{2\}\}\{K\_\{1\}\}\(e^\{K\_\{1\}t\}\-1\)\\right\)^\{2\}\\right\)\\int\\exp\\left\(\\kappa\_\{0\}\(1\+\\epsilon\)e^\{2K\_\{1\}t\}\|x\_\{0\}\|^\{2\}\\right\)\\mathrm\{d\}\\rho\_\{0\}\(x\_\{0\}\)\.Sinceρ0=𝒩\(μ0,Σ0\)\\rho\_\{0\}=\\mathcal\{N\}\(\\mu\_\{0\},\\Sigma\_\{0\}\)andκ0<12λmax\(Σ0\)\\kappa\_\{0\}<\\frac\{1\}\{2\\lambda\_\{\\max\}\(\\Sigma\_\{0\}\)\},∫eα\|x0\|2dρ0\(x0\)\\int e^\{\\alpha\|x\_\{0\}\|^\{2\}\}\\,\\mathrm\{d\}\\rho\_\{0\}\(x\_\{0\}\)is finite for everyα<12λmax\(Σ0\)\\alpha<\\frac\{1\}\{2\\lambda\_\{\\max\}\(\\Sigma\_\{0\}\)\}, and depends continuously onα\\alpha\. Hence one can chooseϵ\>0\\epsilon\>0and thenτ\>0\\tau\>0sufficiently small so thatκ0\(1\+ϵ\)e2K1t<12λmax\(Σ0\)\\kappa\_\{0\}\(1\+\\epsilon\)e^\{2K\_\{1\}t\}<\\frac\{1\}\{2\\lambda\_\{\\max\}\(\\Sigma\_\{0\}\)\}for allt∈\[0,τ\]t\\in\[0,\\tau\], and the right\-hand side of \([29](https://arxiv.org/html/2606.07600#A1.E29)\) is bounded by32E0\\frac\{3\}\{2\}E\_\{0\}for allt∈\[0,τ\]t\\in\[0,\\tau\]\. ThereforeEt≤2E0E\_\{t\}\\leq 2E\_\{0\}on\[0,τ\]\[0,\\tau\], soT∗≥τ\>0T^\{\*\}\\geq\\tau\>0\.

### A\.2Proof of[Proposition2\.4](https://arxiv.org/html/2606.07600#S2.Thmtheorem4)

###### Proof A\.2\.

ConsiderXt∼ρtX\_\{t\}\\sim\{\\rho\}\_\{t\},Yt∼νtY\_\{t\}\\sim\\nu\_\{t\}onℝd\\mathbb\{R\}^\{d\}\. Then we have

X˙t=σ\(AXt\+b\)\+𝒜\[ρt\]\(t,Xt\),Y˙t=AYt\+b\+𝒜\[νt\]\(t,Yt\)\\displaystyle\\dot\{X\}\_\{t\}=\\sigma\(AX\_\{t\}\+b\)\+\\mathcal\{A\}\[\\rho\_\{t\}\]\(t,X\_\{t\}\),\\quad\\quad\\dot\{Y\}\_\{t\}=AY\_\{t\}\+b\+\\mathcal\{A\}\[\\nu\_\{t\}\]\(t,Y\_\{t\}\)Letet=Xt−Yte\_\{t\}=X\_\{t\}\-Y\_\{t\}be the discrepancy and use the shorthand notation𝒜\[ρt\]\(t,Xt\)=𝒜ρt\(Xt\)\\mathcal\{A\}\[\\rho\_\{t\}\]\(t,X\_\{t\}\)=\\mathcal\{A\}\_\{\\rho\_\{t\}\}\(X\_\{t\}\)for readability\. Differentiation yields

\(30\)ddtet=η\(AXt\+b\)\+Aet\+\(𝒜ρt\(Xt\)−𝒜νt\(Yt\)\),\\displaystyle\\frac\{\\mathrm\{d\}\}\{\\mathrm\{d\}t\}\{e\}\_\{t\}=\\eta\(AX\_\{t\}\+b\)\+Ae\_\{t\}\+\\left\(\\mathcal\{A\}\_\{\\rho\_\{t\}\}\(X\_\{t\}\)\-\\mathcal\{A\}\_\{\\nu\_\{t\}\}\(Y\_\{t\}\)\\right\),whereη\(x\)=σ\(x\)−x=max⁡\{0,−x\}\\eta\(x\)=\\sigma\(x\)\-x=\\max\\\{0,\-x\\\}\. We shall bound the term

𝒜ρt\(Xt\)−𝒜νt\(Yt\)=𝒜ρt\(Xt\)−𝒜νt\(Xt\)\+𝒜νt\(Xt\)−𝒜νt\(Yt\)\\displaystyle\\mathcal\{A\}\_\{\\rho\_\{t\}\}\(X\_\{t\}\)\-\\mathcal\{A\}\_\{\\nu\_\{t\}\}\(Y\_\{t\}\)=\\mathcal\{A\}\_\{\\rho\_\{t\}\}\(X\_\{t\}\)\-\\mathcal\{A\}\_\{\{\\nu\}\_\{t\}\}\(X\_\{t\}\)\+\\mathcal\{A\}\_\{\{\\nu\}\_\{t\}\}\(X\_\{t\}\)\-\\mathcal\{A\}\_\{\\nu\_\{t\}\}\(Y\_\{t\}\)by calculating the spatial and measure variation, respectively\. For the spatial variation we use thatνt\\nu\_\{t\}is Gaussian and the expression \([6](https://arxiv.org/html/2606.07600#S2.E6)\), hence𝒜νt\(Xt\)−𝒜νt\(Yt\)=VΣtB\(Xt−Yt\)\\mathcal\{A\}\_\{\\nu\_\{t\}\}\(X\_\{t\}\)\-\\mathcal\{A\}\_\{\\nu\_\{t\}\}\(Y\_\{t\}\)=V\\Sigma\_\{t\}B\(X\_\{t\}\-Y\_\{t\}\)\. Taking theL2L^\{2\}norm directly yields the spatial bound:

‖𝒜νt\(Xt\)−𝒜νt\(Yt\)‖L2≤‖VΣtB‖‖Xt−Yt‖L2\.\\\|\\mathcal\{A\}\_\{\\nu\_\{t\}\}\(X\_\{t\}\)\-\\mathcal\{A\}\_\{\\nu\_\{t\}\}\(Y\_\{t\}\)\\\|\_\{L^\{2\}\}\\leq\\\|V\\Sigma\_\{t\}B\\\|\\\|X\_\{t\}\-Y\_\{t\}\\\|\_\{L^\{2\}\}\.As for the measure variation, we define for any densityϱ\\varrhoonℝd\\mathbb\{R\}^\{d\}two functionsNϱ\(x\):=∫yey⊤Bxdϱ\(y\)N\_\{\\varrho\}\(x\):=\\int ye^\{y^\{\\top\}Bx\}\\mathrm\{d\}\\varrho\(y\),Zϱ\(x\):=∫ey⊤Bxdϱ\(y\)Z\_\{\\varrho\}\(x\):=\\int e^\{y^\{\\top\}Bx\}\\mathrm\{d\}\\varrho\(y\)\. Then𝒜ϱ\(x\)=VNϱ\(x\)Zϱ\(x\)\\mathcal\{A\}\_\{\\varrho\}\(x\)=V\\frac\{N\_\{\\varrho\}\(x\)\}\{Z\_\{\\varrho\}\(x\)\}\. We have

\(31\)𝒜ρt\(x\)−𝒜νt\(x\)=VZρt\(x\)\(Nρt\(x\)−Nνt\(x\)\)\+𝒜ρt\(x\)Zνt\(x\)−Zρt\(x\)Zρt\(x\)\.\\displaystyle\\mathcal\{A\}\_\{\\rho\_\{t\}\}\(x\)\-\\mathcal\{A\}\_\{\\nu\_\{t\}\}\(x\)=\\frac\{V\}\{Z\_\{\\rho\_\{t\}\}\(x\)\}\\left\(N\_\{\\rho\_\{t\}\}\(x\)\-N\_\{\\nu\_\{t\}\}\(x\)\\right\)\+\\mathcal\{A\}\_\{\{\\rho\}\_\{t\}\}\(x\)\\frac\{Z\_\{\\nu\_\{t\}\}\(x\)\-Z\_\{\\rho\_\{t\}\}\(x\)\}\{Z\_\{\\rho\_\{t\}\}\(x\)\}\.Letγt\\gamma\_\{t\}be the optimal coupling ofρt\\rho\_\{t\}andνt\\nu\_\{t\}\[villani2009optimal\], which is a probability distribution onℝ2d\\mathbb\{R\}^\{2d\}satisfying∬‖y−z‖2dγt=W22\(ρt,νt\)\\iint\\\|y\-z\\\|^\{2\}\\mathrm\{d\}\\gamma\_\{t\}=W\_\{2\}^\{2\}\(\\rho\_\{t\},\\nu\_\{t\}\)\. Then we have

\|Zρt\\displaystyle\|Z\_\{\\rho\_\{t\}\}\(x\)−Zνt\(x\)\|≤∬\(∥B∥∥x∥∥y−z∥\)\(ey⊤Bx\+ez⊤Bx\)dγt\(y,z\)\\displaystyle\(x\)\-Z\_\{\\nu\_\{t\}\}\(x\)\|\\leq\\iint\\Big\(\\\|B\\\|\\\|x\\\|\\\|y\-z\\\|\\Big\)\\Big\(e^\{y^\{\\top\}Bx\}\+e^\{z^\{\\top\}Bx\}\\Big\)\\mathrm\{d\}\\gamma\_\{t\}\(y,z\)≤‖B‖‖x‖\(∬‖y−z‖2dγt\(y,z\)\)12\(∬\(ey⊤Bx\+ez⊤Bx\)2dγt\(y,z\)\)12\\displaystyle\\leq\\\|B\\\|\\\|x\\\|\\left\(\\iint\\\|y\-z\\\|^\{2\}\\mathrm\{d\}\\gamma\_\{t\}\(y,z\)\\right\)^\{\\frac\{1\}\{2\}\}\\left\(\\iint\\left\(e^\{y^\{\\top\}Bx\}\+e^\{z^\{\\top\}Bx\}\\right\)^\{2\}\\mathrm\{d\}\\gamma\_\{t\}\(y,z\)\\right\)^\{\\frac\{1\}\{2\}\}≤‖B‖‖x‖W2\(ρt,νt\)\(2∫e2y⊤Bxdρt\(y\)\+2∫e2z⊤Bxdνt\(z\)\)12\.\\displaystyle\\leq\\\|B\\\|\\\|x\\\|W\_\{2\}\(\\rho\_\{t\},\\nu\_\{t\}\)\\left\(2\\int e^\{2y^\{\\top\}Bx\}\\mathrm\{d\}\\rho\_\{t\}\(y\)\+2\\int e^\{2z^\{\\top\}Bx\}\\mathrm\{d\}\\nu\_\{t\}\(z\)\\right\)^\{\\frac\{1\}\{2\}\}\.On the other hand, we have

‖Nρt\(x\)−Nνt\(x\)‖≤∬‖y−z‖ey⊤Bxdγt\+∬‖z‖\|ey⊤Bx−ez⊤Bx\|dγt\\displaystyle\\\|N\_\{\\rho\_\{t\}\}\(x\)\-N\_\{\\nu\_\{t\}\}\(x\)\\\|\\leq\\iint\\\|y\-z\\\|e^\{y^\{\\top\}Bx\}\\mathrm\{d\}\\gamma\_\{t\}\+\\iint\\\|z\\\|\\left\|e^\{y^\{\\top\}Bx\}\-e^\{z^\{\\top\}Bx\}\\right\|\\mathrm\{d\}\\gamma\_\{t\}≤\(∬‖y−z‖2dγt\)12\(∬e2y⊤Bxdγt\)12\\displaystyle\\leq\\left\(\\iint\\\|y\-z\\\|^\{2\}\\mathrm\{d\}\\gamma\_\{t\}\\right\)^\{\\frac\{1\}\{2\}\}\\left\(\\iint e^\{2y^\{\\top\}Bx\}\\mathrm\{d\}\\gamma\_\{t\}\\right\)^\{\\frac\{1\}\{2\}\}\+∬‖z‖\(‖B‖‖x‖‖y−z‖\)\(ey⊤Bx\+ez⊤Bx\)dγt\\displaystyle\\quad\+\\iint\\\|z\\\|\\Big\(\\\|B\\\|\\\|x\\\|\\\|y\-z\\\|\\Big\)\\Big\(e^\{y^\{\\top\}Bx\}\+e^\{z^\{\\top\}Bx\}\\Big\)\\mathrm\{d\}\\gamma\_\{t\}≤W2\(ρt,νt\)\(∫e2y⊤Bxdρt\(y\)\)12\\displaystyle\\leq W\_\{2\}\(\\rho\_\{t\},\\nu\_\{t\}\)\\left\(\\int e^\{2y^\{\\top\}Bx\}\\mathrm\{d\}\\rho\_\{t\}\(y\)\\right\)^\{\\frac\{1\}\{2\}\}\+‖B‖‖x‖W2\(ρt,νt\)\(∬2‖z‖2e2y⊤Bxdγt\+∬2‖z‖2e2z⊤Bxdγt\)12\.\\displaystyle\\quad\+\\\|B\\\|\\\|x\\\|W\_\{2\}\(\\rho\_\{t\},\\nu\_\{t\}\)\\left\(\\iint 2\\\|z\\\|^\{2\}e^\{2y^\{\\top\}Bx\}\\mathrm\{d\}\\gamma\_\{t\}\+\\iint 2\\\|z\\\|^\{2\}e^\{2z^\{\\top\}Bx\}\\mathrm\{d\}\\gamma\_\{t\}\\right\)^\{\\frac\{1\}\{2\}\}\.Additionally, by Jensen’s inequality,Zρt\(x\)≥exp⁡\(∫y⊤Bxdρt\(y\)\)≥exp⁡\(μρ⊤Bx\),Z\_\{\\rho\_\{t\}\}\(x\)\\geq\\exp\\left\(\\int y^\{\\top\}Bx\\,\\mathrm\{d\}\\rho\_\{t\}\(y\)\\right\)\\geq\\exp\\left\(\\mu\_\{\\rho\}^\{\\top\}Bx\\right\),whereμρ=∫ydρt\(y\)\\mu\_\{\\rho\}=\\int y\\,\\mathrm\{d\}\\rho\_\{t\}\(y\)\. Substituting the above estimates ofZρt\(x\)−Zνt\(x\)Z\_\{\\rho\_\{t\}\}\(x\)\-Z\_\{\\nu\_\{t\}\}\(x\),Nρt\(x\)−Nνt\(x\)N\_\{\\rho\_\{t\}\}\(x\)\-N\_\{\\nu\_\{t\}\}\(x\)and1Zρt\(x\)≤exp⁡\(−μρ⊤Bx\)\\frac\{1\}\{Z\_\{\\rho\_\{t\}\}\(x\)\}\\leq\\exp\\left\(\-\\mu\_\{\\rho\}^\{\\top\}Bx\\right\)into \([31](https://arxiv.org/html/2606.07600#A1.E31)\), we obtain

‖𝒜ρt\(x\)−𝒜νt\(x\)‖≤\\displaystyle\\\|\\mathcal\{A\}\_\{\\rho\_\{t\}\}\(x\)\-\\mathcal\{A\}\_\{\\nu\_\{t\}\}\(x\)\\\|\\leq‖V‖Zρt\(x\)\(‖Nρt\(x\)−Nνt\(x\)‖\+‖Nνt\(x\)‖Zνt\(x\)\|Zρt\(x\)−Zνt\(x\)\|\)\\displaystyle\\frac\{\\\|V\\\|\}\{Z\_\{\\rho\_\{t\}\}\(x\)\}\\left\(\\\|N\_\{\\rho\_\{t\}\}\(x\)\-N\_\{\\nu\_\{t\}\}\(x\)\\\|\+\\frac\{\\\|N\_\{\\nu\_\{t\}\}\(x\)\\\|\}\{Z\_\{\\nu\_\{t\}\}\(x\)\}\|Z\_\{\\rho\_\{t\}\}\(x\)\-Z\_\{\\nu\_\{t\}\}\(x\)\|\\right\)≤\\displaystyle\\leq‖V‖\[It1\+It2\+It3\]W2\(ρt,νt\)\\displaystyle\{\\\|V\\\|\}\\Big\[I\_\{t\}^\{1\}\+I\_\{t\}^\{2\}\+I\_\{t\}^\{3\}\\Big\]W\_\{2\}\(\\rho\_\{t\},\\nu\_\{t\}\)where

It1\\displaystyle I\_\{t\}^\{1\}=\(∫e2\(y−μρ\)⊤Bxdρt\(y\)\)12,\\displaystyle=\\left\(\\int e^\{2\(y\-\\mu\_\{\\rho\}\)^\{\\top\}Bx\}\\mathrm\{d\}\\rho\_\{t\}\(y\)\\right\)^\{\\frac\{1\}\{2\}\},It2\\displaystyle I\_\{t\}^\{2\}=‖B‖‖x‖\(∬2‖z‖2e2\(y−μρ\)⊤Bxdγt\+∬2‖z‖2e2\(z−μρ\)⊤Bxdγt\)12,\\displaystyle=\\\|B\\\|\\\|x\\\|\\left\(\\iint 2\\\|z\\\|^\{2\}e^\{2\(y\-\\mu\_\{\\rho\}\)^\{\\top\}Bx\}\\mathrm\{d\}\\gamma\_\{t\}\+\\iint 2\\\|z\\\|^\{2\}e^\{2\(z\-\\mu\_\{\\rho\}\)^\{\\top\}Bx\}\\mathrm\{d\}\\gamma\_\{t\}\\right\)^\{\\frac\{1\}\{2\}\},It3\\displaystyle I\_\{t\}^\{3\}=‖B‖‖x‖\(2∫e2\(y−μρ\)⊤Bxdρt\+2∫e2\(z−μρ\)⊤Bxdνt\)12\(‖μt‖\+‖ΣtB‖‖x‖\)\.\\displaystyle=\\\|B\\\|\\\|x\\\|\\left\(2\\int e^\{2\(y\-\\mu\_\{\\rho\}\)^\{\\top\}Bx\}\\mathrm\{d\}\\rho\_\{t\}\+2\\int e^\{2\(z\-\\mu\_\{\\rho\}\)^\{\\top\}Bx\}\\mathrm\{d\}\\nu\_\{t\}\\right\)^\{\\frac\{1\}\{2\}\}\(\\\|\\mu\_\{t\}\\\|\+\\\|\\Sigma\_\{t\}B\\\|\\\|x\\\|\)\.We shall show that theL2\(ρt\)L^\{2\}\(\\rho\_\{t\}\)norm ofIt1\+It2\+It3I\_\{t\}^\{1\}\+I\_\{t\}^\{2\}\+I\_\{t\}^\{3\}is finite within\[0,T∗\]\[0,T^\{\*\}\]\. Sincez∼νtz\\sim\\nu\_\{t\}, terms involving∫‖z‖2ec‖z‖dνt\(z\)\\int\\\|z\\\|^\{2\}e^\{c\\\|z\\\|\}\\mathrm\{d\}\\nu\_\{t\}\(z\), forc\>0c\>0, are bounded overt∈\[0,T∗\]t\\in\[0,T^\{\*\}\], and polynomials and exponentials ofxxintegrated againstρt\(x\)\\rho\_\{t\}\(x\)are also finite overt∈\[0,T∗\]t\\in\[0,T^\{\*\}\]due to the sub\-Gaussianity proved in[Lemma2\.3](https://arxiv.org/html/2606.07600#S2.Thmtheorem3)\. Thus, by the algebraic inequality\(a\+b\+c\)2≤3\(a2\+b2\+c2\)\(a\+b\+c\)^\{2\}\\leq 3\(a^\{2\}\+b^\{2\}\+c^\{2\}\), it suffices to consider the worst\-case integrals

Jt1=\\displaystyle J^\{1\}\_\{t\}=∫‖x‖2∬‖z‖2e2\(y−μρ\)⊤Bxdγ\(y,z\)dρt\(x\)\\displaystyle\\int\\\|x\\\|^\{2\}\\iint\\\|z\\\|^\{2\}e^\{2\(y\-\\mu\_\{\\rho\}\)^\{\\top\}Bx\}\\mathrm\{d\}\\gamma\(y,z\)\\mathrm\{d\}\\rho\_\{t\}\(x\)=\\displaystyle=∭‖x‖2‖z‖2e2\(y−μρ\)⊤Bxdγ\(y,z\)dρt\(x\),\\displaystyle\\iiint\\\|x\\\|^\{2\}\\\|z\\\|^\{2\}e^\{2\(y\-\\mu\_\{\\rho\}\)^\{\\top\}Bx\}\\mathrm\{d\}\\gamma\(y,z\)\\mathrm\{d\}\\rho\_\{t\}\(x\),Jt2=\\displaystyle J^\{2\}\_\{t\}=∫‖x‖4∫e2\(y−μρ\)⊤Bxdρt\(y\)dρt\(x\)=∬‖x‖4e2\(y−μρ\)⊤Bxdρt\(y\)dρt\(x\)\\displaystyle\\int\\\|x\\\|^\{4\}\\int e^\{2\(y\-\\mu\_\{\\rho\}\)^\{\\top\}Bx\}\\mathrm\{d\}\\rho\_\{t\}\(y\)\\mathrm\{d\}\\rho\_\{t\}\(x\)=\\iint\\\|x\\\|^\{4\}e^\{2\(y\-\\mu\_\{\\rho\}\)^\{\\top\}Bx\}\\mathrm\{d\}\\rho\_\{t\}\(y\)\\mathrm\{d\}\\rho\_\{t\}\(x\)and show that they are both bounded within a short time\. By Fubini’s theorem and the Cauchy–Schwarz inequality,

Jt1≤\\displaystyle J\_\{t\}^\{1\}\\leq\(∫‖x‖2e‖B‖‖x‖2dρt\(x\)\)\(∬‖z‖2e‖B‖‖y−μρ‖2dγt\(y,z\)\),\\displaystyle\{\\left\(\\int\\\|x\\\|^\{2\}e^\{\\\|B\\\|\\\|x\\\|^\{2\}\}\\mathrm\{d\}\\rho\_\{t\}\(x\)\\right\)\}\{\\left\(\\iint\\\|z\\\|^\{2\}e^\{\\\|B\\\|\\\|y\-\\mu\_\{\\rho\}\\\|^\{2\}\}\\mathrm\{d\}\\gamma\_\{t\}\(y,z\)\\right\)\},Jt2≤\\displaystyle J\_\{t\}^\{2\}\\leq\(∫‖x‖4e‖B‖‖x‖2dρt\(x\)\)\(∫e‖B‖‖y−μρ‖2dρt\(y\)\)\.\\displaystyle\{\\left\(\\int\\\|x\\\|^\{4\}e^\{\\\|B\\\|\\\|x\\\|^\{2\}\}\\mathrm\{d\}\\rho\_\{t\}\(x\)\\right\)\}\{\\left\(\\int e^\{\\\|B\\\|\\\|y\-\\mu\_\{\\rho\}\\\|^\{2\}\}\\mathrm\{d\}\\rho\_\{t\}\(y\)\\right\)\}\.Since‖B‖<κ0\\\|B\\\|<\\kappa\_\{0\}, by[Lemma2\.3](https://arxiv.org/html/2606.07600#S2.Thmtheorem3), there exists a constantc0c\_\{0\}such that

max\{\(∫∥x∥2e‖B‖‖x‖2dρt\(x\)\),\(∫∥x∥4e‖B‖‖x‖2dρt\(x\)\),\(∫e‖B‖‖y−μρ‖2dρt\(y\)\)\}≤c0\\max\\Big\\\{\\left\(\\int\\\|x\\\|^\{2\}e^\{\\\|B\\\|\\\|x\\\|^\{2\}\}\\mathrm\{d\}\\rho\_\{t\}\(x\)\\right\),\\\\ \\left\(\\int\\\|x\\\|^\{4\}e^\{\\\|B\\\|\\\|x\\\|^\{2\}\}\\mathrm\{d\}\\rho\_\{t\}\(x\)\\right\),\\left\(\\int e^\{\\\|B\\\|\\\|y\-\\mu\_\{\\rho\}\\\|^\{2\}\}\\mathrm\{d\}\\rho\_\{t\}\(y\)\\right\)\\Big\\\}\\leq c\_\{0\}for allt∈\[0,T∗\]t\\in\[0,T^\{\*\}\]\. As for the second part ofJt1J^\{1\}\_\{t\}, we have

∬‖z‖2e‖B‖‖y−μρ‖2dγt\(y,z\)≤\(∫‖z‖4dνt\(z\)\)12\(∫e2‖B‖‖y−μρ‖2dρt\(y\)\)12\\displaystyle\\iint\\\|z\\\|^\{2\}e^\{\\\|B\\\|\\\|y\-\\mu\_\{\\rho\}\\\|^\{2\}\}\\mathrm\{d\}\\gamma\_\{t\}\(y,z\)\\leq\\left\(\\int\\\|z\\\|^\{4\}\\mathrm\{d\}\\nu\_\{t\}\(z\)\\right\)^\{\\frac\{1\}\{2\}\}\\left\(\\int e^\{2\\\|B\\\|\\\|y\-\\mu\_\{\\rho\}\\\|^\{2\}\}\\mathrm\{d\}\\rho\_\{t\}\(y\)\\right\)^\{\\frac\{1\}\{2\}\}≤\(∫‖z‖4dνt\(z\)\)12\(e4‖B‖‖μρ‖2∫e4‖B‖‖y‖2dρt\(y\)\)12,\\displaystyle\\leq\\left\(\\int\\\|z\\\|^\{4\}\\mathrm\{d\}\\nu\_\{t\}\(z\)\\right\)^\{\\frac\{1\}\{2\}\}\\left\(e^\{4\\\|B\\\|\\\|\\mu\_\{\\rho\}\\\|^\{2\}\}\\int e^\{4\\\|B\\\|\\\|y\\\|^\{2\}\}\\mathrm\{d\}\\rho\_\{t\}\(y\)\\right\)^\{\\frac\{1\}\{2\}\},which is finite overt∈\[0,T∗\]t\\in\[0,T^\{\*\}\]thanks to[Lemma2\.3](https://arxiv.org/html/2606.07600#S2.Thmtheorem3)\. Summarizing the above argument, we have‖𝒜ρt\(x\)−𝒜νt\(x\)‖≤Lρ\(t\)W2\(ρt,νt\)\\\|\\mathcal\{A\}\_\{\\rho\_\{t\}\}\(x\)\-\\mathcal\{A\}\_\{\\nu\_\{t\}\}\(x\)\\\|\\leq L\_\{\\rho\}\(t\)W\_\{2\}\(\\rho\_\{t\},\\nu\_\{t\}\)whereLρ\(t\)L\_\{\\rho\}\(t\)is bounded overt∈\[0,T∗\]t\\in\[0,T^\{\*\}\]\. Hence by \([30](https://arxiv.org/html/2606.07600#A1.E30)\), we have

ddt‖et‖L2≤\(‖A‖\+‖VΣtB‖\+Lρ\(t\)\)‖et‖L2\+‖η\(AXt\+b\)‖L2\(ρt\)\.\\frac\{\\mathrm\{d\}\}\{\\mathrm\{d\}t\}\\\|e\_\{t\}\\\|\_\{L^\{2\}\}\\leq\\Big\(\\\|A\\\|\+\\\|V\\Sigma\_\{t\}B\\\|\+L\_\{\\rho\}\(t\)\\Big\)\\\|e\_\{t\}\\\|\_\{L^\{2\}\}\+\\\|\\eta\(AX\_\{t\}\+b\)\\\|\_\{L^\{2\}\(\\rho\_\{t\}\)\}\.SinceLρ\(t\)L\_\{\\rho\}\(t\)and‖VΣtB‖\\\|V\\Sigma\_\{t\}B\\\|are uniformly bounded overt∈\[0,T∗\]t\\in\[0,T^\{\*\}\], by Gröwall’s lemma, there existsK\>0K\>0such that

\(32\)W2\(ρt,νt\)≤‖et‖L2≤∫0t‖η\(AXs\+b\)‖L2\(ρs\)eK\(t−s\)ds\.\\displaystyle W\_\{2\}\(\\rho\_\{t\},\\nu\_\{t\}\)\\leq\\\|e\_\{t\}\\\|\_\{L^\{2\}\}\\leq\\int\_\{0\}^\{t\}\\\|\\eta\(AX\_\{s\}\+b\)\\\|\_\{L^\{2\}\(\\rho\_\{s\}\)\}e^\{K\(t\-s\)\}\\mathrm\{d\}s\.By the definition of the pushforward measure, we have

‖η\(AXs\+b\)‖L2\(ρs\)=\\displaystyle\\\|\\eta\(AX\_\{s\}\+b\)\\\|\_\{L^\{2\}\(\\rho\_\{s\}\)\}=‖η\(AΦs\(X0\)\+b\)‖L2\(ρ0\)\\displaystyle\\\|\\eta\(A\\Phi\_\{s\}\(X\_\{0\}\)\+b\)\\\|\_\{L^\{2\}\(\\rho\_\{0\}\)\}≤\\displaystyle\\leq‖η\(AX0\+b\)‖L2\(ρ0\)\+‖A‖‖Φs\(X0\)−X0‖L2\(ρ0\)\.\\displaystyle\\\|\\eta\(AX\_\{0\}\+b\)\\\|\_\{L^\{2\}\(\\rho\_\{0\}\)\}\+\\\|A\\\|\\\|\\Phi\_\{s\}\(X\_\{0\}\)\-X\_\{0\}\\\|\_\{L^\{2\}\(\\rho\_\{0\}\)\}\.As we have established in \([28](https://arxiv.org/html/2606.07600#A1.E28)\) thatddt\|Φt\(x0\)\|≤K1\|Φt\(x0\)\|\+K2\\frac\{\\mathrm\{d\}\}\{\\mathrm\{d\}t\}\|\\Phi\_\{t\}\(x\_\{0\}\)\|\\leq K\_\{1\}\|\\Phi\_\{t\}\(x\_\{0\}\)\|\+K\_\{2\}, it follows that

\|Φs\(X0\)−X0\|≤\(K1\|X0\|\+K2\)∫0seK1rdr≤\(K1\|X0\|\+K2\)seK1s\.\|\\Phi\_\{s\}\(X\_\{0\}\)\-X\_\{0\}\|\\leq\(K\_\{1\}\|X\_\{0\}\|\+K\_\{2\}\)\\int\_\{0\}^\{s\}e^\{K\_\{1\}r\}\\mathrm\{d\}r\\leq\(K\_\{1\}\|X\_\{0\}\|\+K\_\{2\}\)se^\{K\_\{1\}s\}\.Therefore we have‖η\(AXs\+b\)‖L2\(ρs\)≤‖η\(AX0\+b\)‖L2\(ρ0\)\+C0seK1s\\\|\\eta\(AX\_\{s\}\+b\)\\\|\_\{L^\{2\}\(\\rho\_\{s\}\)\}\\leq\\\|\\eta\(AX\_\{0\}\+b\)\\\|\_\{L^\{2\}\(\\rho\_\{0\}\)\}\+C\_\{0\}se^\{K\_\{1\}s\}, whereC0=‖A‖\(K1‖μ0‖2\+tr⁡\(Σ0\)\+K2\)C\_\{0\}=\\\|A\\\|\(K\_\{1\}\\sqrt\{\\\|\\mu\_\{0\}\\\|^\{2\}\+\\operatorname\{tr\}\(\\Sigma\_\{0\}\)\}\+K\_\{2\}\)\. Substituting into \([32](https://arxiv.org/html/2606.07600#A1.E32)\), yields

W2\(ρt,νt\)≤\\displaystyle W\_\{2\}\(\\rho\_\{t\},\\nu\_\{t\}\)\\leq∫0t\[‖η\(AX0\+b\)‖L2\(ρ0\)\+C0seK1s\]eK\(t−s\)ds\\displaystyle\\int\_\{0\}^\{t\}\\Big\[\\\|\\eta\(AX\_\{0\}\+b\)\\\|\_\{L^\{2\}\(\\rho\_\{0\}\)\}\+C\_\{0\}se^\{K\_\{1\}s\}\\Big\]e^\{K\(t\-s\)\}\\mathrm\{d\}s≤\\displaystyle\\leq‖η\(AX0\+b\)‖L2\(ρ0\)\(eKt−1K\)\+C0eKt∫0tse\(K1−K\)sds\\displaystyle\\\|\\eta\(AX\_\{0\}\+b\)\\\|\_\{L^\{2\}\(\\rho\_\{0\}\)\}\\left\(\\frac\{e^\{Kt\}\-1\}\{K\}\\right\)\+C\_\{0\}e^\{Kt\}\\int\_\{0\}^\{t\}se^\{\(K\_\{1\}\-K\)s\}\\mathrm\{d\}s≤\\displaystyle\\leqt⋅‖η\(AX0\+b\)‖L2\(ρ0\)\+𝒪\(t2\)\\displaystyle~t\\cdot\\left\\\|\\eta\(AX\_\{0\}\+b\)\\right\\\|\_\{L^\{2\}\(\\rho\_\{0\}\)\}\+\\mathcal\{O\}\(t^\{2\}\)for allt∈\[0,T∗\]t\\in\[0,T^\{\*\}\], which completes the proof\.

### A\.3Proof of[Theorem4\.7](https://arxiv.org/html/2606.07600#S4.Thmtheorem7)

###### Proof A\.3\.

We split the proof in two cases, depending on the sign ofℜ⁡\(spec⁡\(A\)\)\\Re\(\\operatorname\{spec\}\(A\)\)\.

#### Caseℜ⁡\(spec⁡\(A\)\)\>0\\Re\(\\operatorname\{spec\}\(A\)\)\>0

We first show the existence of an equilibrium of theΣ\\Sigmaequation in \([7](https://arxiv.org/html/2606.07600#S2.E7)\) \(i\.e\., the solution of \([18](https://arxiv.org/html/2606.07600#S4.E18)\)\) using the Poincaré\-Hopf theorem \(see\[milnor1997topology, Chapter 7\]\)\. More precisely, we construct a domain𝒦⊂ℝd×d\\mathcal\{K\}\\subset\\mathbb\{R\}^\{d\\times d\}making sure that the vector field defining theΣ\\Sigmaequation in \([7](https://arxiv.org/html/2606.07600#S2.E7)\) is transversal to∂𝒦\\partial\\mathcal\{K\}, and then show the existence of equilibria inside𝒦\\mathcal\{K\}by a topological argument\.

We begin by constructing the domain𝒦\\mathcal\{K\}using hyperplanes defined via Lyapunov functions\. Calculating the trace of the derivative ofΣ\(t\)\\Sigma\(t\)yieldstr⁡\(Σ˙\)=tr⁡\(AΣ\+ΣA⊤\)\+tr⁡\(VΣBΣ\+ΣB⊤ΣV⊤\)=2tr⁡\(AΣ\)\+2tr⁡\(ΣVΣB\)\.\\operatorname\{tr\}\(\\dot\{\\Sigma\}\)=\\operatorname\{tr\}\(A\\Sigma\+\\Sigma A^\{\\top\}\)\+\\operatorname\{tr\}\(V\\Sigma B\\Sigma\+\\Sigma B^\{\\top\}\\Sigma V^\{\\top\}\)=2\\operatorname\{tr\}\(A\\Sigma\)\+2\\operatorname\{tr\}\(\\Sigma V\\Sigma B\)\.We shall compare the growth of the two terms in this equation to give a sign totr⁡\(Σ˙\)\\operatorname\{tr\}\(\\dot\{\\Sigma\}\)when the norm ofΣ\\Sigmais large\. On the one hand, there existsC1≥0C\_\{1\}\\geq 0such thattr⁡\(AΣ\)≤C1‖Σ‖F\\operatorname\{tr\}\(A\\Sigma\)\\leq C\_\{1\}\\\|\\Sigma\\\|\_\{F\}where∥⋅∥F\\\|\\cdot\\\|\_\{F\}is the Frobenius norm of square matrices\. Meanwhile, sinceΣVΣ≺0\\Sigma V\\Sigma\\prec 0\(by Sylvester’s law of inertia\) andB≻0B\\succ 0, we havetr⁡\(ΣVΣB\)<0\\operatorname\{tr\}\(\\Sigma V\\Sigma B\)<0, and further by applying von Neumann’s trace inequality \(refer to, for example,\[horn2012matrix, Theorem 4\.3\.53\]\), we have

tr⁡\(ΣVΣB\)≤λmin\(B\)tr⁡\(ΣVΣ\)≤λmin\(B\)λmax\(V\)tr⁡\(Σ2\)≤−C2‖Σ‖F2\\displaystyle\\operatorname\{tr\}\(\\Sigma V\\Sigma B\)\\leq\\lambda\_\{\\min\}\(B\)\\operatorname\{tr\}\(\\Sigma V\\Sigma\)\\leq\\lambda\_\{\\min\}\(B\)\\lambda\_\{\\max\}\(V\)\\operatorname\{tr\}\(\\Sigma^\{2\}\)\\leq\-C\_\{2\}\\\|\\Sigma\\\|^\{2\}\_\{F\}whereC2\>0C\_\{2\}\>0\. Therefore, sinceΣ≻0\\Sigma\\succ 0by[Lemma3\.1](https://arxiv.org/html/2606.07600#S3.Thmtheorem1), there exists a constantR\>0R\>0such that iftr⁡\(Σ\)=R\\operatorname\{tr\}\(\\Sigma\)=R, thentr⁡\(Σ˙\)<0\\operatorname\{tr\}\(\\dot\{\\Sigma\}\)<0\. Next, denotex≔vec\(Σ\)x\\coloneqq\\mathrm\{vec\}\(\\Sigma\)\. Consider a boundary functionU\(Σ\)=x⊤Px=vec\(Σ\)⊤Pvec\(Σ\)U\(\\Sigma\)=x^\{\\top\}Px=\\text\{vec\}\(\\Sigma\)^\{\\top\}P\\text\{vec\}\(\\Sigma\)whereP≻0P\\succ 0solves the Lyapunov equation

P\(Id⊗A\+A⊗Id\)\+\(Id⊗A\+A⊗Id\)⊤P=Id2\.\\displaystyle P\(I\_\{d\}\\otimes A\+A\\otimes I\_\{d\}\)\+\(I\_\{d\}\\otimes A\+A\\otimes I\_\{d\}\)^\{\\top\}P=I\_\{d^\{2\}\}\.The existence ofPPis guaranteed by the fact thatℜ⁡\(spec⁡\(A\)\)\>0\\Re\(\\operatorname\{spec\}\(A\)\)\>0\. Then, differentiating the boundary function yields

U˙=x⊤P\(Id⊗A\+A⊗Id\)\+\(Id⊗A\+A⊗Id\)⊤Px\+2x⊤Pvec\(𝒩\(Σ\)\)=‖x‖2\+2x⊤Pvec\(𝒩\(Σ\)\),\\dot\{U\}=x^\{\\top\}P\(I\_\{d\}\\otimes A\+A\\otimes I\_\{d\}\)\+\(I\_\{d\}\\otimes A\+A\\otimes I\_\{d\}\)^\{\\top\}Px\\\\ \+2x^\{\\top\}P\\text\{vec\}\(\\mathcal\{N\}\(\\Sigma\)\)=\\\|x\\\|^\{2\}\+2x^\{\\top\}P\\text\{vec\}\(\\mathcal\{N\}\(\\Sigma\)\),where𝒩\(Σ\):=VΣBΣ\+ΣB⊤ΣV⊤\\mathcal\{N\}\(\\Sigma\):=V\\Sigma B\\Sigma\+\\Sigma B^\{\\top\}\\Sigma V^\{\\top\}is the quadratic nonlinearity\. Therefore, there existsε\>0\\varepsilon\>0such that whenU\(Σ\)=εU\(\\Sigma\)=\\varepsilon, we haveU˙=⟨∇U,vec\(Σ˙\)⟩\>0\\dot\{U\}=\\langle\\nabla U,\\mathrm\{vec\}\(\\dot\{\\Sigma\}\)\\rangle\>0\. Summarizing the above arguments, we construct the set𝒦≔\{Σ≻0\|U\(Σ\)≥ε,tr⁡\(Σ\)≤R\}\\mathcal\{K\}\\coloneqq\\\{\\Sigma\\succ 0\\\>\|\\\>U\(\\Sigma\)\\geq\\varepsilon,\\operatorname\{tr\}\(\\Sigma\)\\leq R\\\}\. Note that𝒦\\mathcal\{K\}is a compact convex set in the set ofd×dd\\times dpositive definite matrices \(and hence inℝd\(d\+1\)/2\\mathbb\{R\}^\{d\(d\+1\)/\{2\}\}\) with nonempty interior\. Moreover, on the boundary of𝒦\\mathcal\{K\}, theΣ\\Sigmavector field𝒱\(Σ\)≔AΣ\+ΣA⊤\+VΣBΣ\+ΣB⊤ΣV⊤\\mathcal\{V\}\(\\Sigma\)\\coloneqq A\\Sigma\+\\Sigma A^\{\\top\}\+V\\Sigma B\\Sigma\+\\Sigma B^\{\\top\}\\Sigma V^\{\\top\}is transversal to the boundary∂𝒦\\partial\\mathcal\{K\}and points towards the interior of𝒦\\mathcal\{K\}at every point on the boundary, due to the above calculation of the derivatives ofU\(Σ\)U\(\\Sigma\)andtr⁡\(Σ\)\\operatorname\{tr\}\(\\Sigma\)\. Therefore, by the Poincaré\-Hopf theorem, we have

\(33\)∑\{Σ∗∈𝒦\|𝒱\(Σ∗\)=0\}ind\(Σ∗\)=\(−1\)d\(d\+1\)2χ\(𝒦\)=\(−1\)d\(d\+1\)2,\\displaystyle\\sum\_\{\\\{\\Sigma^\{\*\}\\in\\mathcal\{K\}\\\>\|\\\>\\mathcal\{V\}\(\\Sigma^\{\*\}\)=0\\\}\}\\mathrm\{ind\}\(\\Sigma^\{\*\}\)=\(\-1\)^\{\\frac\{d\(d\+1\)\}\{2\}\}\\chi\(\\mathcal\{K\}\)=\(\-1\)^\{\\frac\{d\(d\+1\)\}\{2\}\},where the last equality is due to the fact that any compact, convex subset ofℝN\\mathbb\{R\}^\{N\}\(N\>0N\>0\) with non\-empty interior is homeomorphic to the closed unit ball𝔻N\\mathbb\{D\}^\{N\}, whose Euler characteristicχ\\chiis11\.

The non\-zeroness of \([33](https://arxiv.org/html/2606.07600#A1.E33)\) implies that inside𝒦\\mathcal\{K\}there exists at least one equilibrium of the vector field𝒱\\mathcal\{V\}, guaranteeing that \([18](https://arxiv.org/html/2606.07600#S4.E18)\) has at least one positive definite solution\.

LetΣ∞≻0\\Sigma\_\{\\infty\}\\succ 0be one solution of \([18](https://arxiv.org/html/2606.07600#S4.E18)\)\. We shall prove the local stability ofΣ∞\\Sigma\_\{\\infty\}under the condition thatΣ∞−1V\+VΣ∞−1≺0\\Sigma\_\{\\infty\}^\{\-1\}V\+V\\Sigma\_\{\\infty\}^\{\-1\}\\prec 0, by showing that locally the flow of equation ofΣ\\Sigma\([7](https://arxiv.org/html/2606.07600#S2.E7)\) converges to the solution of \([18](https://arxiv.org/html/2606.07600#S4.E18)\)\. DefineΔ\(t\):=Σ\(t\)−Σ∞\\Delta\(t\):=\\Sigma\(t\)\-\\Sigma\_\{\\infty\}, then

\(34\)Δ˙=M∞Δ\+ΔM∞⊤\+VΔBΣ∞\+Σ∞B⊤ΔV⊤\+VΔBΔ\+ΔB⊤ΔV⊤,\\displaystyle\\dot\{\\Delta\}=M\_\{\\infty\}\\Delta\+\\Delta M\_\{\\infty\}^\{\\top\}\+V\\Delta B\\Sigma\_\{\\infty\}\+\\Sigma\_\{\\infty\}B^\{\\top\}\\Delta V^\{\\top\}\+V\\Delta B\\Delta\+\\Delta B^\{\\top\}\\Delta V^\{\\top\},whereM∞:=A\+VΣ∞BM\_\{\\infty\}:=A\+V\\Sigma\_\{\\infty\}B\. WriteP:=Σ∞−1≻0P:=\\Sigma\_\{\\infty\}^\{\-1\}\\succ 0, and define the quadratic functionalΦ\(Δ\):=12tr⁡\(ΔPΔP\)≡12‖P12ΔP12‖F2\\Phi\(\\Delta\):=\\frac\{1\}\{2\}\\operatorname\{tr\}\(\\Delta P\\Delta P\)\\equiv\\frac\{1\}\{2\}\\\|P^\{\\frac\{1\}\{2\}\}\\Delta P^\{\\frac\{1\}\{2\}\}\\\|\_\{F\}^\{2\}\. It holds that

\(35\)12λmin\(P\)2‖Δ‖F2≤Φ\(Δ\)≤12λmax\(P\)2‖Δ‖F2\.\\frac\{1\}\{2\}\\lambda\_\{\\min\}\(P\)^\{2\}\\\|\\Delta\\\|\_\{F\}^\{2\}\\;\\leq\\;\\Phi\(\\Delta\)\\;\\leq\\;\\frac\{1\}\{2\}\\lambda\_\{\\max\}\(P\)^\{2\}\\\|\\Delta\\\|\_\{F\}^\{2\}\.We compute the derivative ofΦ\\Phialong solutions of \([34](https://arxiv.org/html/2606.07600#A1.E34)\)\. Using the cyclic property of the trace, we obtainΦ˙\(Δ\)=tr⁡\(ΔPΔ˙P\)\\dot\{\\Phi\}\(\\Delta\)=\\operatorname\{tr\}\(\\Delta P\\dot\{\\Delta\}P\)\. Substituting \([34](https://arxiv.org/html/2606.07600#A1.E34)\), we decomposeΦ˙=I\+II\+III\\dot\{\\Phi\}=\\mathrm\{I\}\+\\mathrm\{II\}\+\\mathrm\{III\}, where

I\\displaystyle\\mathrm\{I\}:=tr⁡\(ΔP\(M∞Δ\+ΔM∞⊤\)P\),\\displaystyle:=\\operatorname\{tr\}\\bigl\(\\Delta P\(M\_\{\\infty\}\\Delta\+\\Delta M\_\{\\infty\}^\{\\top\}\)P\\bigr\),II\\displaystyle\\mathrm\{II\}:=tr⁡\(ΔP\(VΔBΣ∞\+Σ∞BΔV\)P\),\\displaystyle:=\\operatorname\{tr\}\\bigl\(\\Delta P\(V\\Delta B\\Sigma\_\{\\infty\}\+\\Sigma\_\{\\infty\}B\\Delta V\)P\\bigr\),III\\displaystyle\\mathrm\{III\}:=tr⁡\(ΔP\(VΔBΔ\+ΔBΔV\)P\)\.\\displaystyle:=\\operatorname\{tr\}\\bigl\(\\Delta P\(V\\Delta B\\Delta\+\\Delta B\\Delta V\)P\\bigr\)\.Using cyclicity of the trace, we haveI=tr⁡\(ΔPΔ\(PM∞\+M∞⊤P\)\)\\mathrm\{I\}=\\operatorname\{tr\}\\bigl\(\\Delta P\\Delta\(PM\_\{\\infty\}\+M\_\{\\infty\}^\{\\top\}P\)\\bigr\)\. Multiplying \([18](https://arxiv.org/html/2606.07600#S4.E18)\) on the left and right byPPyieldsPM∞\+M∞⊤P=0PM\_\{\\infty\}\+M\_\{\\infty\}^\{\\top\}P=0, henceI=0\\mathrm\{I\}=0\. Further, usingΣ∞P=PΣ∞=I\\Sigma\_\{\\infty\}P=P\\Sigma\_\{\\infty\}=I, we obtain

II=tr⁡\(ΔPVΔB\)\+tr⁡\(ΔBΔVP\)=tr⁡\(BΔ\(PV\+VP\)Δ\)\.\\displaystyle\\mathrm\{II\}=\\operatorname\{tr\}\(\\Delta PV\\Delta B\)\+\\operatorname\{tr\}\(\\Delta B\\Delta VP\)=\\operatorname\{tr\}\\bigl\(B\\Delta\(PV\+VP\)\\Delta\\bigr\)\.By assumption, we have thatPV\+VP≺0PV\+VP\\prec 0\. SetS:=−\(PV\+VP\)≻0S:=\-\(PV\+VP\)\\succ 0\. ThenII=−tr⁡\(BΔSΔ\)\\mathrm\{II\}=\-\\operatorname\{tr\}\(B\\Delta S\\Delta\)\. SinceB≻0B\\succ 0andS≻0S\\succ 0, we have

tr⁡\(BΔSΔ\)=‖B12ΔS12‖F2≥λmin\(B\)λmin\(S\)‖Δ‖F2\.\\operatorname\{tr\}\(B\\Delta S\\Delta\)=\\\|B^\{\\frac\{1\}\{2\}\}\\Delta S^\{\\frac\{1\}\{2\}\}\\\|\_\{F\}^\{2\}\\;\\geq\\;\\lambda\_\{\\min\}\(B\)\\lambda\_\{\\min\}\(S\)\\\|\\Delta\\\|\_\{F\}^\{2\}\.Therefore by \([35](https://arxiv.org/html/2606.07600#A1.E35)\), there existsc\>0c\>0such thatII≤−cΦ\(Δ\)\\mathrm\{II\}\\leq\-c\\,\\Phi\(\\Delta\)\. As for the cubic termIII\\mathrm\{III\}, by \([35](https://arxiv.org/html/2606.07600#A1.E35)\), there existsC1\>0C\_\{1\}\>0such that\|III\|≤C1Φ\(Δ\)32\|\\mathrm\{III\}\|\\leq C\_\{1\}\\Phi\(\\Delta\)^\{\\frac\{3\}\{2\}\}\. Combining the above estimates, we obtainΦ˙\(Δ\)≤−cΦ\(Δ\)\+C1Φ\(Δ\)32\\dot\{\\Phi\}\(\\Delta\)\\leq\-c\\Phi\(\\Delta\)\+C\_\{1\}\\Phi\(\\Delta\)^\{\\frac\{3\}\{2\}\}\. Hence, there existsr\>0r\>0such that ifΦ\(Δ\)≤r\\Phi\(\\Delta\)\\leq r, thenΦ˙\(Δ\)≤−c2Φ\(Δ\)\\dot\{\\Phi\}\(\\Delta\)\\leq\-\\frac\{c\}\{2\}\\Phi\(\\Delta\)\. This implies exponential decay ofΦ\(Δ\(t\)\)\\Phi\(\\Delta\(t\)\)for all initial data sufficiently close toΣ∞\\Sigma\_\{\\infty\}\. Using \([35](https://arxiv.org/html/2606.07600#A1.E35)\) yields‖Σ\(t\)−Σ∞‖F≤C2e−γt‖Σ\(0\)−Σ∞‖F\\\|\\Sigma\(t\)\-\\Sigma\_\{\\infty\}\\\|\_\{F\}\\leq C\_\{2\}e^\{\-\\gamma t\}\\\|\\Sigma\(0\)\-\\Sigma\_\{\\infty\}\\\|\_\{F\}for someC2,γ\>0C\_\{2\},\\gamma\>0, proving this case\.

#### Caseℜ⁡\(spec⁡\(A\)\)≤0\\Re\(\\operatorname\{spec\}\(A\)\)\\leq 0

We note thatΣ∞=0\\Sigma\_\{\\infty\}=0is always a solution of \([18](https://arxiv.org/html/2606.07600#S4.E18)\)\. Next, we show its global or local stability depending on the sign ofλmax\(A\+A⊤\)\\lambda\_\{\\max\}\(A\+A^\{\\top\}\)\. Calculating the derivative oftr⁡\(Σ\)\\operatorname\{tr\}\(\\Sigma\)yields

\(36\)tr⁡\(Σ˙\)≤λmax\(A\+A⊤\)tr⁡\(Σ\)\+2λmax\(V\)λmax\(B\)tr⁡\(Σ2\)\.\\displaystyle\\operatorname\{tr\}\(\\dot\{\\Sigma\}\)\\leq\\lambda\_\{\\max\}\(A\+A^\{\\top\}\)\\operatorname\{tr\}\(\\Sigma\)\+2\\lambda\_\{\\max\}\(V\)\\lambda\_\{\\max\}\(B\)\\operatorname\{tr\}\(\\Sigma^\{2\}\)\.Hence ifλmax\(A\+A⊤\)<0\\lambda\_\{\\max\}\(A\+A^\{\\top\}\)<0andV≺0V\\prec 0,B≻0B\\succ 0, thentr⁡\(Σ˙\)≤λmax\(A\+A⊤\)tr⁡\(Σ\)\\operatorname\{tr\}\(\\dot\{\\Sigma\}\)\\leq\\lambda\_\{\\max\}\(A\+A^\{\\top\}\)\\operatorname\{tr\}\(\\Sigma\), and by Grönwall’s Lemma,Σ\(t\)\\Sigma\(t\)converges to0exponentially\. Ifλmax\(A\+A⊤\)=0\\lambda\_\{\\max\}\(A\+A^\{\\top\}\)=0, then \([36](https://arxiv.org/html/2606.07600#A1.E36)\) yieldstr⁡\(Σ˙\)≤2λmax\(V\)λmax\(B\)tr⁡\(Σ2\)\\operatorname\{tr\}\(\\dot\{\\Sigma\}\)\\leq 2\\lambda\_\{\\max\}\(V\)\\lambda\_\{\\max\}\(B\)\\operatorname\{tr\}\(\\Sigma^\{2\}\), which is strictly negative as long asΣ≠0\\Sigma\\neq 0, hence assuring thatlimt→∞Σ\(t\)=0\\lim\_\{t\\rightarrow\\infty\}\\Sigma\(t\)=0\.

If, instead,ℜ⁡\(spec⁡\(A\)\)<0\\Re\(\\operatorname\{spec\}\(A\)\)<0andλmax\(A\+A⊤\)\>0\\lambda\_\{\\max\}\(A\+A^\{\\top\}\)\>0, by first\-order linearization of the dynamics ofΣ\\Sigmain \([7](https://arxiv.org/html/2606.07600#S2.E7)\),0is a locally stable equilibrium\. Further,Σ\\Sigmaremains uniformly bounded overt∈\(0,\+∞\)t\\in\(0,\+\\infty\)as we have the estimate

tr\(Σ˙\)≤λmax\(A\+A⊤\)tr\(Σ\)\+2dλmax\(V\)λmax\(B\)tr\(Σ\)2<0\\operatorname\{tr\}\(\\dot\{\\Sigma\}\)\\leq\\lambda\_\{\\max\}\(A\+A^\{\\top\}\)\\operatorname\{tr\}\(\\Sigma\)\+\\frac\{2\}\{d\}\\lambda\_\{\\max\}\(V\)\\lambda\_\{\\max\}\(B\)\\operatorname\{tr\}\(\\Sigma\)^\{2\}<0whentr⁡\(Σ\)\>−λmax\(A\+A⊤\)2dλmax\(V\)λmax\(B\)\>0\\operatorname\{tr\}\(\\Sigma\)\>\-\\frac\{\\lambda\_\{\\max\}\(A\+A^\{\\top\}\)\}\{\\frac\{2\}\{d\}\\lambda\_\{\\max\}\(V\)\\lambda\_\{\\max\}\(B\)\}\>0\. Hence completes the proof\.

## Acknowledgments

The authors thank the Speinshart Scientific Center for AI and SuperTech for its hospitality, where part of this research was conducted\.

## References
Reachability and asymptotics of Gaussian Transformer dynamics

Similar Articles

Dynamics of the Transformer Residual Stream: Coupling Spectral Geometry to Network Topology

Attractor Geometry of Transformer Memory: From Conflict Arbitration to Confident Hallucination

I Found a Hidden Ratio in Transformers That Predicts Geometric Stability [R]

Controlled Dynamics Attractor Transformer

@Propriocetive: New preprint: Mathematics is All You Need 2 — Sign-Stabilized Behavioral Fibers in Transformer Residual Streams. Headli…

Submit Feedback

Similar Articles

Dynamics of the Transformer Residual Stream: Coupling Spectral Geometry to Network Topology
Attractor Geometry of Transformer Memory: From Conflict Arbitration to Confident Hallucination
I Found a Hidden Ratio in Transformers That Predicts Geometric Stability [R]
Controlled Dynamics Attractor Transformer
@Propriocetive: New preprint: Mathematics is All You Need 2 — Sign-Stabilized Behavioral Fibers in Transformer Residual Streams. Headli…