Kuramoto Attention: Synchronizing Self-Attention on the Torus

arXiv cs.LG Papers

Summary

Introduces Kuramoto attention, a self-attention layer where hidden states are phase angles on a torus, enabling synchronization through gated cosine similarity and circular mean updates. The layer performs comparably to standard transformers on character-level language modeling.

arXiv:2606.11585v1 Announce Type: new Abstract: We introduce Kuramoto attention, a self-attention layer in which each hidden coordinate is an angle. The layer scores tokens by gated cosine similarity, attends over previous phase states, and updates each token by the tangent component of the attention-weighted circular mean. Because the values are the raw phase states, this update is exactly the Kuramoto coupling term $\sum_u A_{t,u}\sin(\theta_u-\theta_t)$, with the attention matrix acting as an adaptive, content-dependent coupling kernel. Equivalently, the gated score is a learned metric on the torus that selects which tokens couple, and the update pulls each token toward the circular mean of the tokens it selects, tightening their phase agreement. The same two ingredients, an invariant similarity score and an on-manifold mean, define such a layer on any compact group; the torus is the abelian case, where both are closed-form. The softmax weights solve an entropy-regularized phase-retrieval problem, and rotary position enters as a position-dependent phase drift in the score. On enwiki8 character-level language modeling, the layer trains as a functional language model whose bits-per-character stays close to a strong matched RoPE+SwiGLU transformer: within $0.02$ BPC at one million parameters ($1.637\pm0.010$ versus $1.616\pm0.004$) and level on the median at five million ($1.448$ versus $1.452$ over five seeds) with the transformer ahead on the mean ($1.468$ versus $1.456$). These experiments establish that the constrained geometric structure is a viable language model at this scale; the structure itself, and its synchronization reading, is the contribution. Ablations isolate the load-bearing components, and the result gives a compact bridge between self-attention and phase synchronization.
Original Article
View Cached Full Text

Cached at: 06/11/26, 01:49 PM

# Kuramoto Attention: Synchronizing Self-Attention on the Torus
Source: [https://arxiv.org/html/2606.11585](https://arxiv.org/html/2606.11585)
Joshua Nunleyjoshnunl@iu\.edu Department of Informatics, Luddy School of Informatics, Computing, and Engineering Cognitive Science Program Indiana University Bloomington

###### Abstract

We introduce*Kuramoto attention*, a self\-attention layer in which each hidden coordinate is an angle\. The layer scores tokens by gated cosine similarity, attends over previous phase states, and updates each token by the tangent component of the attention\-weighted circular mean\. Because the values are the raw phase states, this update is exactly the Kuramoto coupling term∑uAt,u​sin⁡\(θu−θt\)\\sum\_\{u\}A\_\{t,u\}\\sin\(\\theta\_\{u\}\-\\theta\_\{t\}\), with the attention matrix acting as an adaptive, content\-dependent coupling kernel\. Equivalently, the gated score is a learned metric on the torus that selects which tokens couple, and the update pulls each token toward the circular mean of the tokens it selects, tightening their phase agreement\. The same two ingredients, an invariant similarity score and an on\-manifold mean, define such a layer on any compact group; the torus is the abelian case, where both are closed\-form\. The softmax weights solve an entropy\-regularized phase\-retrieval problem, and rotary position enters as a position\-dependent phase drift in the score\. On enwiki8 character\-level language modeling, the layer trains as a functional language model whose bits\-per\-character stays close to a strong matched RoPE\+SwiGLU transformer: within0\.020\.02BPC at one million parameters \(1\.637±0\.0101\.637\\pm 0\.010versus1\.616±0\.0041\.616\\pm 0\.004\) and level on the median at five million \(1\.4481\.448versus1\.4521\.452over five seeds\) with the transformer ahead on the mean \(1\.4681\.468versus1\.4561\.456\)\. These experiments establish that the constrained geometric structure is a viable language model at this scale; the structure itself, and its synchronization reading, is the contribution\. Ablations isolate the load\-bearing components, and the result gives a compact bridge between self\-attention and phase synchronization\.

## 1Introduction

Self\-attention is conventionally described as content\-based retrieval\(Vaswaniet al\.,[2017](https://arxiv.org/html/2606.11585#bib.bib3)\): queries and keys produce a softmax weighting, and values are averaged\. We introduce a self\-attention layer whose hidden state is phase\-valued, a bank of phase oscillators on a torus, and we design its value update so that the layer computes an adaptive Kuramoto synchronization step\. In this step, each oscillator is pulled toward the attention\-weighted circular mean of the oscillators it attends to, and the softmax attention matrix acts as a content\-dependent coupling kernel\.

Because the values are the raw phases, the value update equals the Kuramoto coupling term∑uAt,u​sin⁡\(θu−θt\)\\sum\_\{u\}A\_\{t,u\}\\sin\(\\theta\_\{u\}\-\\theta\_\{t\}\)exactly \([Proposition˜1](https://arxiv.org/html/2606.11585#Thmtheorem1)\); the softmax weights solve an entropy\-regularized phase\-retrieval problem \([Proposition˜2](https://arxiv.org/html/2606.11585#Thmtheorem2)\)\. These weights are the assignment that best matches each token’s phases to its neighbors’ while staying as spread out as an entropy penalty allows, which means that the layer first selects neighbors and then synchronizes with them\. Rotary position enters as a position\-dependent phase drift in the score \([Section˜3](https://arxiv.org/html/2606.11585#S3)\), and for this layer that drift reads as the natural frequencies of the Kuramoto model; as in a transformer, it is a positional choice and not part of the core layer\. On enwiki8 the layer trains as a functional character\-level language model whose validation BPC stays close to a matched transformer: level on the median at five million parameters \(1\.4481\.448versus1\.4521\.452over five seeds\) with the transformer ahead on the mean \(1\.4681\.468versus1\.4561\.456\), and within0\.020\.02BPC at one million \(1\.637±0\.0101\.637\\pm 0\.010versus1\.616±0\.0041\.616\\pm 0\.004\)\. Ablations isolate the load\-bearing components\. The contribution is the structure and its synchronization reading; these experiments show the structure is a viable language model at a non\-trivial scale and task\.

#### Contributions\.

1. 1\.A new attention layer\.We introduce Kuramoto attention, a self\-attention layer whose hidden state is phase\-valued and lives on a torus\. Its values are geometry\-native, since the values are the raw phase states themselves and are not learned projections, and its increment is bounded so that every update stays on the torus \([Section˜2](https://arxiv.org/html/2606.11585#S2)\)\.
2. 2\.The Kuramoto identity\.Because its values are the raw phases, its value update is exactly Kuramoto coupling, at every step, with the softmax attention matrix as the coupling kernel \([Proposition˜1](https://arxiv.org/html/2606.11585#Thmtheorem1)\)\.
3. 3\.Selection and synchronization are one coherence\.The softmax weights solve an entropy\-regularized phase\-retrieval problem \([Proposition˜2](https://arxiv.org/html/2606.11585#Thmtheorem2)\), so the layer selects a soft set of neighbors\. The score reads this coherence under a learned metric, and the value update follows the same coherence to pull each token toward its selected neighbors, so selection and synchronization are two aspects of one object \([Lemma˜3](https://arxiv.org/html/2606.11585#Thmtheorem3), which holds for fixedAA; becauseAAdepends on the phases, the full layer is not a single gradient flow\)\.
4. 4\.Rotary position as phase drift\.Rotary embedding, the positional encoding we use, enters as a position\-dependent phase drift in the score, and this drift reads as the natural frequencies of the Kuramoto model\. As in a transformer it is an optional positional choice, separate from the core layer, and its score\-only placement is the natural one \([Remark˜1](https://arxiv.org/html/2606.11585#Thmremark1)\)\.
5. 5\.A multiplicative value path\.Kuramoto attention has no dense value or output projection, and it carries no coordinate’s value additively into another’s update: the values mix only multiplicatively, through the gates and the per\-coordinate value gate, with additive value mixing left to the feed\-forward block\. The gates that set those per\-coordinate scalars are themselves dense readouts of all phases; they act on the update multiplicatively\. This is a marked departure from standard attention \([Section˜3\.5](https://arxiv.org/html/2606.11585#S3.SS5)\)\.
6. 6\.A functional layer, with ablations\.As a character\-level language model on enwiki8 the layer reaches BPC close to a matched transformer \(level on the median at five million, within0\.020\.02BPC at one million\), which shows the constrained structure is a viable language model, and ablations isolate the contribution of each component \([Sections˜4](https://arxiv.org/html/2606.11585#S4)and[4\.3](https://arxiv.org/html/2606.11585#S4.SS3)\)\.

Geshkovski et al\.\(Geshkovskiet al\.,[2023](https://arxiv.org/html/2606.11585#bib.bib5);[2024](https://arxiv.org/html/2606.11585#bib.bib6)\)analyze standard attention and prove that it clusters only asymptotically, and oscillator\-attention models such as AKOrN\(Miyatoet al\.,[2025](https://arxiv.org/html/2606.11585#bib.bib7)\)add Kuramoto dynamics to attention by design\. We design a phase\-state attention layer whose value update is*exactly*an adaptive Kuramoto step at every layer, and we show that this layer is a functional language model performing close to a standard transformer on a character\-level task\. The synchronization is the layer’s own value update, so it requires no added oscillator machinery\.

## 2Kuramoto Attention

We model each token’s hidden state as a vector of phasesθ∈ℝk/\(2​π​ℤ\)k\\theta\\in\\mathbb\{R\}^\{k\}/\(2\\pi\\mathbb\{Z\}\)^\{k\}, i\.e\. a bank ofkkunit oscillatorsei​θje^\{i\\theta\_\{j\}\}\. A layer maps a sequence of phase states to phase increments through three stages: a gated similarity score, a circular\-mean value update, and a bounded feed\-forward block\. Tokens are embedded as learned phases and logits are read out by phase alignment\.

#### The layer at a glance\.

Writingψ​\(θ\)=\(cos⁡θ,sin⁡θ\)\\psi\(\\theta\)=\(\\cos\\theta,\\sin\\theta\)for the lift andρ\\rhofor a positive gate, the layer maps the phases of tokentt, coordinatejj, to updated phases by

gtq\\displaystyle g^\{q\}\_\{t\}=ρ​\(Wq​ψ​\(θt\)\),gtk=ρ​\(Wk​ψ​\(θt\)\),\\displaystyle=\\rho\\big\(W\_\{q\}\\,\\psi\(\\theta\_\{t\}\)\\big\),\\quad g^\{k\}\_\{t\}=\\rho\\big\(W\_\{k\}\\,\\psi\(\\theta\_\{t\}\)\\big\),\(1\)st,u\\displaystyle s\_\{t,u\}=τk​∑jgt,jq​gu,jk​cos⁡\(θt,j−θu,j\+ωj​\(t−u\)\),At,⋅=softmaxu≤t​st,⋅,\\displaystyle=\\tfrac\{\\tau\}\{\\sqrt\{k\}\}\\textstyle\\sum\_\{j\}g^\{q\}\_\{t,j\}\\,g^\{k\}\_\{u,j\}\\cos\\\!\\big\(\\theta\_\{t,j\}\-\\theta\_\{u,j\}\+\\omega\_\{j\}\(t\-u\)\\big\),\\quad A\_\{t,\\cdot\}=\\mathrm\{softmax\}\_\{u\\leq t\}\\,s\_\{t,\\cdot\},Gt,j\\displaystyle G\_\{t,j\}=∑u≤tAt,u​ei​θu,j,at,j=−sin⁡θt,j​Re​Gt,j\+cos⁡θt,j​Im​Gt,j,\\displaystyle=\\textstyle\\sum\_\{u\\leq t\}A\_\{t,u\}\\,e^\{i\\theta\_\{u,j\}\},\\quad a\_\{t,j\}=\-\\sin\\theta\_\{t,j\}\\,\\mathrm\{Re\}\\,G\_\{t,j\}\+\\cos\\theta\_\{t,j\}\\,\\mathrm\{Im\}\\,G\_\{t,j\},θt\\displaystyle\\theta\_\{t\}←θt\+bound​\(v​\(θt\)⊙at\),\\displaystyle\\leftarrow\\theta\_\{t\}\+\\mathrm\{bound\}\\big\(v\(\\theta\_\{t\}\)\\odot a\_\{t\}\\big\),θt\\displaystyle\\theta\_\{t\}←θt\+bound​\(SwiGLU​\(θt\)\)\.\\displaystyle\\leftarrow\\theta\_\{t\}\+\\mathrm\{bound\}\\big\(\\mathrm\{SwiGLU\}\(\\theta\_\{t\}\)\\big\)\.The two residual updates are applied in turn\. The subsections below define each line, and[Section˜3](https://arxiv.org/html/2606.11585#S3)reads them as synchronization\.

### 2\.1Gated phase similarity

Each token produces positive query and key gates from its phase featuresψ​\(θ\)=\(cos⁡θ,sin⁡θ\)\\psi\(\\theta\)=\(\\cos\\theta,\\sin\\theta\)\(the\(cos,sin\)\(\\cos,\\sin\)lift used throughout\),

gtq=ρ​\(Wq​ψ​\(θt\)\),gtk=ρ​\(Wk​ψ​\(θt\)\),g^\{q\}\_\{t\}=\\rho\\big\(W\_\{q\}\\,\\psi\(\\theta\_\{t\}\)\\big\),\\qquad g^\{k\}\_\{t\}=\\rho\\big\(W\_\{k\}\\,\\psi\(\\theta\_\{t\}\)\\big\),withρ=softplus\\rho=\\mathrm\{softplus\}followed by mean normalization \(g¯=g/mean​\(g\)\\bar\{g\}=g/\\mathrm\{mean\}\(g\)\)\. The score combines the gates with the native torus cosine kernel and a rotary phase advanceωj​\(t−u\)\\omega\_\{j\}\(t\-u\)with learned per\-coordinate ratesωj\\omega\_\{j\}:

st,u=τk​∑jgt,jq​gu,jk​cos⁡\(θt,j−θu,j\+ωj​\(t−u\)\),A=softmaxu​\(st,u\)s\_\{t,u\}=\\frac\{\\tau\}\{\\sqrt\{k\}\}\\sum\_\{j\}g^\{q\}\_\{t,j\}\\,g^\{k\}\_\{u,j\}\\,\\cos\\\!\\big\(\\theta\_\{t,j\}\-\\theta\_\{u,j\}\+\\omega\_\{j\}\(t\-u\)\\big\),\\qquad A=\\mathrm\{softmax\}\_\{u\}\\\!\\big\(s\_\{t,u\}\\big\)with causal masking\. Position enters only here, through the score, as the per\-coordinate phase driftωj​\(t−u\)\\omega\_\{j\}\(t\-u\); the rest of the layer, and all of its synchronization structure, is independent of position\. We use rotary position for this drift, with learned per\-coordinate ratesωj\\omega\_\{j\}, and settingω≡0\\omega\\equiv 0recovers the bare torus cosine kernel \([Section˜3](https://arxiv.org/html/2606.11585#S3)\)\.

#### Score and value update share one coherence\.

The score and the value update are both built from a single quantity, namely the pairwise phase coherence∑jcos⁡\(θt,j−θu,j\)\\sum\_\{j\}\\cos\(\\theta\_\{t,j\}\-\\theta\_\{u,j\}\)\. The score reads this coherence under a learned metric, and in doing so it selects which tokens couple\. The gatesgq​gkg^\{q\}g^\{k\}supply that metric, because they are a position\-dependent, per\-coordinate weighting that forms a diagonal metric on𝕋k\\mathbb\{T\}^\{k\}\([Appendix˜B](https://arxiv.org/html/2606.11585#A2)\), and this metric is shared across layers\. The value update below follows the same coherence: it pulls each token toward the attention\-weighted circular mean of the tokens it selects, and so synchronizes their phases\.[Section˜3](https://arxiv.org/html/2606.11585#S3)develops this shared structure \([Lemma˜3](https://arxiv.org/html/2606.11585#Thmtheorem3)\)\.

### 2\.2Circular\-mean value update

Values are the raw phase states, and they are not learned projections\. We writeGt,j=∑uAt,u​ei​θu,jG\_\{t,j\}=\\sum\_\{u\}A\_\{t,u\}e^\{i\\theta\_\{u,j\}\}for the attention\-weighted resultant\. This resultant is the circular mean of the attended phases\. Viewed as a point in the complex plane, its argument is the mean phase and its modulus measures how closely the attended phases agree\. The update then movesθt,j\\theta\_\{t,j\}toward this mean\. The increment is the tangent component ofGt,jG\_\{t,j\}at the current phase, which we obtain as follows\. The unit tangent to the circle atθt,j\\theta\_\{t,j\}is\(−sin⁡θt,j,cos⁡θt,j\)\(\-\\sin\\theta\_\{t,j\},\\cos\\theta\_\{t,j\}\), and projectingGt,j=\(Re​Gt,j,Im​Gt,j\)G\_\{t,j\}=\(\\mathrm\{Re\}\\,G\_\{t,j\},\\mathrm\{Im\}\\,G\_\{t,j\}\)onto this tangent gives

at,j=−sin⁡θt,j​Re​Gt,j\+cos⁡θt,j​Im​Gt,j\.a\_\{t,j\}=\-\\sin\\theta\_\{t,j\}\\,\\mathrm\{Re\}\\,G\_\{t,j\}\+\\cos\\theta\_\{t,j\}\\,\\mathrm\{Im\}\\,G\_\{t,j\}\.Equivalently, writing the phase as the unit phasorzt,j=ei​θt,jz\_\{t,j\}=e^\{i\\theta\_\{t,j\}\}so thatGt,j=∑uAt,u​zu,jG\_\{t,j\}=\\sum\_\{u\}A\_\{t,u\}z\_\{u,j\}, this projection is the imaginary part of a single complex product,at,j=Im​\(zt,j¯​Gt,j\)a\_\{t,j\}=\\mathrm\{Im\}\\\!\\big\(\\overline\{z\_\{t,j\}\}\\,G\_\{t,j\}\\big\), from which\|at,j\|≤\|Gt,j\|≤1\|a\_\{t,j\}\|\\leq\|G\_\{t,j\}\|\\leq 1follows at once\. This increment keeps the motion on the torus, because it advances the angle along the circle, and[Proposition˜1](https://arxiv.org/html/2606.11585#Thmtheorem1)shows that this increment equals the Kuramoto coupling term\. A per\-coordinate value gatev=v​\(θt\)v=v\(\\theta\_\{t\}\)scalesaa\. The gate is a signed, linear readout of the phase features, so a positive entry pulls a coordinate toward the attended circular mean and a negative entry pushes it away; it is a learned, content\-dependent coupling strength, positive for synchronization and negative for repulsion\. Atanh\\tanhbound then keeps the phase increment within a learned radius\. The bound is norm\-matched, which means that small increments are scaled by the learned radius and otherwise left in proportion, while large increments are compressed toward this radius, so the saturating nonlinearity caps the step size without distorting the relative sizes of typical updates\. The bounded increment is added to the state,θ←θ\+bound​\(v⊙a\)\\theta\\leftarrow\\theta\+\\mathrm\{bound\}\(v\\odot a\)\.

### 2\.3Feed\-forward block

A SwiGLU feed\-forward block produces a second, bounded increment that relocates the state between layers\. The block acts in local phase coordinates, since the anglesθ\\thetaare its input features and its output is a per\-coordinate angular increment,θ←θ\+bound​\(SwiGLU​\(θ\)\)\\theta\\leftarrow\\theta\+\\mathrm\{bound\}\\big\(\\mathrm\{SwiGLU\}\(\\theta\)\\big\)\. Layers are residual in the phase state\. This block is the standard transformer feed\-forward network in phase coordinates, and it is the one component not built from the geometry\. It is non\-geometric in a precise sense: it reads the raw angleθ\\thetaas its input, where every other component reads the\(cos,sin\)\(\\cos,\\sin\)lift, so its output depends on the representative chosen for a phase \(θ\\thetaversusθ\+2​π\\theta\+2\\pi\) and it is not a function on the torus\. The feed\-forward block alone breaks the periodicity that the rest of the layer respects, and a variant that acts on the lift is the natural geometry\-native replacement\.[Section˜3\.5](https://arxiv.org/html/2606.11585#S3.SS5)returns to it as the only block that mixes the embedding values additively\.

#### No normalization layer\.

A transformer interleaves attention with a normalization layer that rescales the residual stream\(Baet al\.,[2016](https://arxiv.org/html/2606.11585#bib.bib12); Zhang and Sennrich,[2019](https://arxiv.org/html/2606.11585#bib.bib13)\); RMSNorm projects each token onto a sphere of fixed radius, a constraint the normalized transformer makes explicit\(Loshchilovet al\.,[2024](https://arxiv.org/html/2606.11585#bib.bib14)\)\. Kuramoto attention places the hidden state on𝕋k\\mathbb\{T\}^\{k\}by construction, because every coordinate keeps unit modulus at every layer, and each update adds a bounded increment to the angles\. On the flat torus the exponential map atθ\\thetais the additionθ↦θ\+X\\theta\\mapsto\\theta\+Xof a tangent vectorXX, so adding the increment is exactly this exponential map, a retraction onto the torus for whatever tangent vector the update forms, here the gated and bounded Kuramoto stepbound​\(v⊙a\)\\mathrm\{bound\}\(v\\odot a\)\. The torus is the per\-coordinate analogue of that sphere, and it is built into the state itself, so the network uses no normalization layer\. The increment bound supplies the scale control that a normalization gain would otherwise provide\.

### 2\.4Phase\-alignment readout

Logits score each vocabulary item by phase alignment against a learned prototype phaseϕv\\phi\_\{v\},ℓv∝∑jcos⁡\(θj−ϕv,j\)\\;\\ell\_\{v\}\\propto\\sum\_\{j\}\\cos\(\\theta\_\{j\}\-\\phi\_\{v,j\}\)\.

#### Design summary\.

The model has four components\. The gated similarity produces the coupling weights, the phase\-valued circular\-mean update revises the phases, the increment bounding supplies stability, and the feed\-forward block performs the between\-layer mixing\.[Section˜3](https://arxiv.org/html/2606.11585#S3)reads these components as synchronization, and[Section˜4\.3](https://arxiv.org/html/2606.11585#S4.SS3)quantifies every component\.

## 3The Synchronization View

The layer has a simple reading\. The softmax selects which past tokens a token attends to, and the value update then synchronizes the token’s phases with the selected ones\. We make three pieces of this reading precise\. We show that the value update is exactly Kuramoto coupling, that the softmax is entropy\-regularized retrieval, and that rotary position acts as a phase drift\. Throughout, we writeGt,j=∑uAt,u​ei​θu,jG\_\{t,j\}=\\sum\_\{u\}A\_\{t,u\}e^\{i\\theta\_\{u,j\}\}for the attention\-weighted resultant\. This resultant is the weighted sum of the attended unit vectorsei​θu,je^\{i\\theta\_\{u,j\}\}, read as a planar vector\. Its argument is the circular mean of those phases, and its length∥Gt,j∥∈\[0,1\]\\lVert G\_\{t,j\}\\rVert\\in\[0,1\]measures how closely the phases agree, since the length is near11when the attended phases align and near0when they cancel\.

### 3\.1The value update is Kuramoto coupling

###### Proposition 1\(Kuramoto coupling\)\.

The circular\-mean value update of[Section˜2](https://arxiv.org/html/2606.11585#S2)satisfies, for every coordinatejj,

at,j=−sin⁡θt,j​Re​Gt,j\+cos⁡θt,j​Im​Gt,j=∑uAt,u​sin⁡\(θu,j−θt,j\)\.a\_\{t,j\}=\-\\sin\\theta\_\{t,j\}\\,\\mathrm\{Re\}\\,G\_\{t,j\}\+\\cos\\theta\_\{t,j\}\\,\\mathrm\{Im\}\\,G\_\{t,j\}=\\sum\_\{u\}A\_\{t,u\}\\,\\sin\\\!\\big\(\\theta\_\{u,j\}\-\\theta\_\{t,j\}\\big\)\.That is, the update is the Kuramoto coupling term with coupling matrixAA\.

###### Proof\.

Expandsin⁡\(θu,j−θt,j\)=sin⁡θu,j​cos⁡θt,j−cos⁡θu,j​sin⁡θt,j\\sin\(\\theta\_\{u,j\}\-\\theta\_\{t,j\}\)=\\sin\\theta\_\{u,j\}\\cos\\theta\_\{t,j\}\-\\cos\\theta\_\{u,j\}\\sin\\theta\_\{t,j\}and sum againstAt,uA\_\{t,u\}\. ∎

Unlike standard Kuramoto, the couplingAAis computed from the states by a softmax of the cosine\-similarity score, and is causal\. Each token is pulled toward the attention\-weighted circular mean of the tokens it attends to, with pull strength set by the resultant length∥Gt,j∥\\lVert G\_\{t,j\}\\rVert\.

The directionata\_\{t\}is exactly the Kuramoto term\. The implemented layer scales this direction by a learned per\-coordinate value gate and a norm bound before adding it to the state \([Section˜2](https://arxiv.org/html/2606.11585#S2)\)\. The attention matrixAAsets which tokens pull a phase and toward where\. The value gate is signed, so it sets how strongly, and in which direction, each coordinate responds: a positive gate gives Kuramoto attraction toward the attended mean, and a negative gate gives repulsion, so a single layer can both synchronize and desynchronize a coordinate against the tokens it selects\. The norm bound then rescales the whole increment by one positive scalar\.

### 3\.2Softmax selects neighbors

###### Proposition 2\(Entropy\-regularized retrieval\)\.

Write the score asst,u=τ​s¯t,us\_\{t,u\}=\\tau\\,\\bar\{s\}\_\{t,u\}, wheres¯t,u\\bar\{s\}\_\{t,u\}is theτ\\tau\-free cosine affinity andτ\>0\\tau\>0is the learned score scale\. With retrieval costct​\(u\)=−s¯t,uc\_\{t\}\(u\)=\-\\bar\{s\}\_\{t,u\}and entropy temperatureλ=1/τ\\lambda=1/\\tau, the attention weights are the unique minimizer

At,⋅=arg⁡minπ∈Δ⁡\{∑uπ​\(u\)​ct​\(u\)\+λ​KL​\(π∥unif\)\}=softmaxu​\(s¯t,u/λ\)=softmaxu​\(st,u\)\.A\_\{t,\\cdot\}=\\arg\\min\_\{\\pi\\in\\Delta\}\\Big\\\{\\textstyle\\sum\_\{u\}\\pi\(u\)\\,c\_\{t\}\(u\)\+\\lambda\\,\\mathrm\{KL\}\(\\pi\\,\\\|\\,\\mathrm\{unif\}\)\\Big\\\}=\\mathrm\{softmax\}\_\{u\}\\\!\\big\(\\bar\{s\}\_\{t,u\}/\\lambda\\big\)=\\mathrm\{softmax\}\_\{u\}\\\!\\big\(s\_\{t,u\}\\big\)\.

###### Proof\.

The Gibbs variational identity\(Boyd and Vandenberghe,[2004](https://arxiv.org/html/2606.11585#bib.bib16)\): a Lagrange multiplier on the simplex givesπ∗∝exp⁡\(−ct/λ\)\\pi^\{\\ast\}\\propto\\exp\(\-c\_\{t\}/\\lambda\), and the uniform reference contributes only a constant\. ∎

The learned scaleτ=1/λ\\tau=1/\\lambdais the inverse temperature of the retrieval: largeτ\\tausharpens attention toward the closest keys, whileτ→0\\tau\\to 0makes it uniform\. So the softmax chooses a soft set of neighbors, and the update of[Proposition˜1](https://arxiv.org/html/2606.11585#Thmtheorem1)then synchronizes the token’s phases with those neighbors\. In this way the softmax first selects, and the update then synchronizes\.

### 3\.3A single geometry that plays two roles

The two propositions describe one cosine coherence acting twice\. The score is that coherence under the learned metric𝒢​\(θ\)=diag​\(gq​gk\)\\mathcal\{G\}\(\\theta\)=\\mathrm\{diag\}\(g^\{q\}g^\{k\}\)of[Appendix˜B](https://arxiv.org/html/2606.11585#A2)\. By[Proposition˜2](https://arxiv.org/html/2606.11585#Thmtheorem2), the attention matrixAAis entropy\-regularized retrieval under this metric𝒢\\mathcal\{G\}, so the metric sets which tokens couple\. The update, in turn, is the gradient of the same coherence taken without the gate:

###### Lemma 3\(Synchronization is coherence descent\)\.

For fixedAA, the Kuramoto update of[Proposition˜1](https://arxiv.org/html/2606.11585#Thmtheorem1)is the negative gradient of the attention\-weighted coherence energyEt​\(θt\)=−∑uAt,u​∑jcos⁡\(θt,j−θu,j\)E\_\{t\}\(\\theta\_\{t\}\)=\-\\sum\_\{u\}A\_\{t,u\}\\sum\_\{j\}\\cos\(\\theta\_\{t,j\}\-\\theta\_\{u,j\}\):

at,j=−∂Et/∂θt,j=∑uAt,u​sin⁡\(θu,j−θt,j\)\.a\_\{t,j\}=\-\\,\\partial E\_\{t\}/\\partial\\theta\_\{t,j\}=\\sum\_\{u\}A\_\{t,u\}\\sin\(\\theta\_\{u,j\}\-\\theta\_\{t,j\}\)\.

###### Proof\.

∂θt,jcos⁡\(θt,j−θu,j\)=−sin⁡\(θt,j−θu,j\)\\partial\_\{\\theta\_\{t,j\}\}\\cos\(\\theta\_\{t,j\}\-\\theta\_\{u,j\}\)=\-\\sin\(\\theta\_\{t,j\}\-\\theta\_\{u,j\}\); sum againstAt,uA\_\{t,u\}and negate\. ∎

So a single coherence plays both roles\. When it is metric\-gated in the score, this coherence selects neighbors, and it does so through the coupling kernel of[Proposition˜2](https://arxiv.org/html/2606.11585#Thmtheorem2)\. When it appears ungated inEtE\_\{t\}, the negative gradient of this coherence is the synchronization step, which is the coupling term of[Proposition˜1](https://arxiv.org/html/2606.11585#Thmtheorem1)\. The metric therefore sets which tokens are selected, while the descent runs on the flat coherence, so that the model selects under the learned metric and then synchronizes\. This gradient reading holds for the update at a single step, withAAheld fixed; becauseAAis itself computed from the phases, the full layer is not the gradient flow of one fixed energy, and we make no such claim\. The descent identity is exact, as shown in[Proposition˜1](https://arxiv.org/html/2606.11585#Thmtheorem1), and the metric is the equal\-position geometry of the score, as shown in[Appendix˜B](https://arxiv.org/html/2606.11585#A2)\. This gives the geometric reading of Kuramoto on the torus\. The same two ingredients, namely an invariant similarity score and an on\-manifold mean, can be defined on any compact group, an outlook that we develop in[Section˜6](https://arxiv.org/html/2606.11585#S6)\.

### 3\.4Rotary position as phase drift

Position is not part of the base layer of[Section˜2](https://arxiv.org/html/2606.11585#S2); we add it the way a transformer does, through a positional encoding\. We use rotary position, which enters the score as a per\-coordinate phase drift,st,u=τk​∑jgt,jq​gu,jk​cos⁡\(θt,j−θu,j\+ωj​\(t−u\)\)s\_\{t,u\}=\\tfrac\{\\tau\}\{\\sqrt\{k\}\}\\sum\_\{j\}g^\{q\}\_\{t,j\}g^\{k\}\_\{u,j\}\\cos\(\\theta\_\{t,j\}\-\\theta\_\{u,j\}\+\\omega\_\{j\}\(t\-u\)\), and only there\. For this layer the drift has a natural reading: the ratesωj\\omega\_\{j\}are learned per coordinate \(initialized at the geometric rotary schedule\) and act as natural frequencies in the Kuramoto model, since each oscillator advances at its own rate and the score compares tokens after that advance\. This reading is a bonus of the geometry, and the synchronization results above hold with or without the drift\. Both parts of the score shape selection: the gates are the learned metric \([Appendix˜B](https://arxiv.org/html/2606.11585#A2)\) and the driftωj​\(t−u\)\\omega\_\{j\}\(t\-u\)is the natural\-frequency advance applied before tokens are compared\. Metric and frequencies set which tokens couple; the value update then synchronizes with them \([Lemma˜3](https://arxiv.org/html/2606.11585#Thmtheorem3)\)\.

### 3\.5A multiplicative value path

A standard attention layer mixes the embedding coordinates additively\. Its value and output projectionsWVW\_\{V\}andWOW\_\{O\}are dense linear maps, so each coordinate of the output is a weighted sum of all coordinates of the input\. This additive mixing across coordinates is a core part of what a standard attention layer computes\.

Kuramoto attention has no such dense projection, but its value path is not empty\. The value that the layer aggregates is the raw phase stateei​θue^\{i\\theta\_\{u\}\}itself, with noWVW\_\{V\}applied\. From the aggregateGtG\_\{t\}the layer forms the Kuramoto direction by the tangent retraction\(Absilet al\.,[2008](https://arxiv.org/html/2606.11585#bib.bib15)\), which also stands in for the output projectionWOW\_\{O\}and ambient residual of standard attention, and it then scales that direction with a learned, content\-dependent per\-coordinate value gatev​\(θt\)v\(\\theta\_\{t\}\)before bounding it \(lines 3–4 of[Equation˜1](https://arxiv.org/html/2606.11585#S2.E1)\)\. So the layer does transform its values\. What makes this transform different from a usual value map is that every step on the value path is multiplicative\. The value gate rescales each coordinate’s update on its own, with a learned sign, and the norm\-match bound applies a single positive scalar to the whole increment\. Neither step forms a linear combination across coordinates of the values\.

Additive cross\-coordinate computation does enter the layer, in one place: the gates\. Each gate entry is a dense linear readout of all phases,gt,jq=ρ​\(\(Wq​ψ​\(θt\)\)j\)g^\{q\}\_\{t,j\}=\\rho\\big\(\(W\_\{q\}\\psi\(\\theta\_\{t\}\)\)\_\{j\}\\big\)withWqW\_\{q\}dense, so coordinatejj’s gate depends on every phase, and the value gate is the same kind of dense readout\. Those readouts set the per\-coordinate scalars, the metric of the score and the value\-gate strengths, which then enter the update multiplicatively through the single shared attention weightAt,uA\_\{t,u\}\. What the layer never does is carry one coordinate’s value additively into another’s update; that additive mixing of the values is left to the feed\-forward block, a dense linear map\. The ablations make this split visible\. Making the value path additive across coordinates, by replacing the value gate with a dense linear value projection, costs\+0\.25\+0\.25BPC \([Remark˜2](https://arxiv.org/html/2606.11585#Thmremark2)\), and removing the feed\-forward block, the only component that mixes coordinates additively, costs\+0\.27\+0\.27BPC\. The two ablations show that the multiplicative value path and the additive feed\-forward block each carry distinct load\.

## 4Experiments

### 4\.1Setup

We train at≈106\\approx 10^\{6\}and≈5×106\\approx 5\\times 10^\{6\}parameters on enwiki8\(Hutter,[2006](https://arxiv.org/html/2606.11585#bib.bib9)\), sequence length256256,44layers,5050epochs, with total parameters matched to the transformer baseline at each scale\. Optimizer, learning rate, architecture, exact parameter counts, and the matching protocol are in[Appendix˜A](https://arxiv.org/html/2606.11585#A1)\.

### 4\.2Language modeling

We test the layer as a character\-level language model on enwiki8\. It trains stably and reaches validation and test BPC close to a matched RoPE\+SwiGLU transformer: within0\.020\.02BPC at one million parameters, and at five million it is level on the median over five seeds while the transformer leads on the mean, with held\-out test BPC tracking validation at both scales \([Table˜1](https://arxiv.org/html/2606.11585#S4.T1)\)\. These experiments show that the constrained geometric structure is a viable language model at this scale, with the matched transformer as the reference point\. One Kuramoto\-attention seed lagged from the first epoch onward and converged to val BPC1\.551\.55, against the others’∼1\.45\{\\sim\}1\.45; an active run would have restarted it\.

Table 1:Validation and test BPC over five seeds, as median and mean±\\pmstd\. At 1M the transformer leads on both splits by about0\.020\.02\. At 5M the two are level on the median \(Kuramoto attention ahead by0\.0040\.004on validation and0\.0020\.002on test\) while the transformer leads on the mean; the higher Kuramoto\-attention mean and variance come from one seed that lagged from the first epoch and converged to val BPC1\.551\.55against the others’∼1\.45\{\\sim\}1\.45\.
### 4\.3Ablation suite

The metric reading of the score organizes the ablations\. The construction has two*geometry\-native*ingredients\. The first is a learned metric on the torus, which is the score, and the second is an on\-manifold mean, which is the value update\. These two ingredients, together with the positional drift in the score, which is the geometry’s form of a positional encoding, are the parts that would carry over to a different compact group\. The construction also has one*non\-geometric*component, namely the feed\-forward block\. Finally, it admits a family of optional*geometric corrections*that the metric naturally suggests\. We ablate one axis at a time from the reference configuration \([Table˜2](https://arxiv.org/html/2606.11585#S4.T2)\)\.

Table 2:Single\-axis ablations, mean±\\pmstd over five seeds,Δ\\Deltafrom the reference \(five\-seed mean1\.6371\.637\); total parameters are re\-matched per row\.#### Geometry\-native core \(required\)\.

These axes are the synchronization machinery of[Section˜3](https://arxiv.org/html/2606.11585#S3)\. The on\-manifold value update is the Kuramoto coupling step \([Proposition˜1](https://arxiv.org/html/2606.11585#Thmtheorem1); circular\-mean vs\. a learned value projection, value gate and bound\)\. The metric is the coupling kernel \([Proposition˜2](https://arxiv.org/html/2606.11585#Thmtheorem2); gate activation, mean\-normalization, shared vs\. per\-layer\)\. The natural frequencies are the rotary drift \([Remark˜1](https://arxiv.org/html/2606.11585#Thmremark1); score\-only vs\. gate, value, or no rotary\); as in a transformer the drift is an optional positional choice, and like any positional encoding it carries performance on sequence modeling\. These three axes are the portable ingredients, and they carry the performance\. Replacing the circular mean with a learned value projection costs\+0\.25\+0\.25BPC, which is the largest geometry\-native effect\. Removing the metric gates costs\+0\.09\+0\.09, and untying them across layers costs a further\+0\.05\+0\.05, so the single shared metric is itself load\-bearing\.

#### Feed\-forward block \(the non\-geometric residual\)\.

This is the one piece that is neither the coupling kernel nor the synchronization step\. Because it lies outside the geometry, it is implemented as the same SwiGLU that the matched transformer uses, and it is the layer’s only block that mixes the embedding values additively across coordinates \([Section˜3\.5](https://arxiv.org/html/2606.11585#S3.SS5)\)\. Removing it costs\+0\.27\+0\.27BPC, which is the largest effect in the suite, and sharing it across layers in place of per\-layer learning costs\+0\.04\+0\.04\. Replacing this generic map with one built from the group is the remaining step toward a fully geometry\-native layer\.

#### Minor components\.

Gate normalization, the value bound, and a phase\-alignment readout each move val BPC by at most0\.020\.02relative to their simpler alternatives\.

#### Geometric corrections \(optional\)\.

Once the score is read as a metric \([Appendix˜B](https://arxiv.org/html/2606.11585#A2)\), a frame\-field connection and richer metric parameterizations become the natural geometric refinements to test next\.

### 4\.4Phase dynamics

Two diagnostics make the synchronization visible in the reference run \([Figures˜1](https://arxiv.org/html/2606.11585#S4.F1)and[2](https://arxiv.org/html/2606.11585#S4.F2)\)\. The per\-coordinate attention\-weighted order parameterRt,j=\|Gt,j\|=\|∑uAt,u​ei​θu,j\|R\_\{t,j\}=\|G\_\{t,j\}\|=\\lvert\\sum\_\{u\}A\_\{t,u\}e^\{i\\theta\_\{u,j\}\}\\rvert, averaged over coordinates, stays near11at every layer, which means that each token synchronizes tightly with the neighbors it selects\. Yet the*global*order parameterRjglob=\|1N​∑uei​θu,j\|R^\{\\mathrm\{glob\}\}\_\{j\}=\\lvert\\frac\{1\}\{N\}\\sum\_\{u\}e^\{i\\theta\_\{u,j\}\}\\rvertfalls with depth; it is the same coherence with the attention weightsAt,uA\_\{t,u\}replaced by a flat average over allNNtokens\. Tokens lock tightly to the few neighbors that each one selects, while the population of phases as a whole spreads out\. The model therefore synchronizes locally without collapsing globally, which is the select\-then\-synchronize behavior of[Section˜3](https://arxiv.org/html/2606.11585#S3)\. Separately, the learned natural frequenciesωj\\omega\_\{j\}depart from their geometric initialization into a depth\-to\-timescale trend: early layers carry higher frequencies, later layers lower frequencies\. A largerωj\\omega\_\{j\}advances the phase faster per position step, so the cosine kernelcos⁡\(θt,j−θu,j\+ωj​\(t−u\)\)\\cos\(\\theta\_\{t,j\}\-\\theta\_\{u,j\}\+\\omega\_\{j\}\(t\-u\)\)de\-correlates over a shorter token distance\. A high frequency therefore corresponds to short\-range coupling, while a low frequency corresponds to long\-range coupling, and so the depth trend reads as early layers mixing locally and later layers reaching far\.

![Refer to caption](https://arxiv.org/html/2606.11585v1/x1.png)Figure 1:Local synchronization without global collapse \(reference 1M run, validation split\)\.\(a\)The local order parameterRt=⟨\|∑uAt,u​ei​θu,j\|⟩jR\_\{t\}=\\langle\\lvert\\sum\_\{u\}A\_\{t,u\}e^\{i\\theta\_\{u,j\}\}\\rvert\\rangle\_\{j\}, the coherence of each token with the neighbors it attends to, stays near one at every layer and sequence position\.\(b\)The global order parameterRglob=⟨\|1N​∑uei​θu,j\|⟩jR^\{\\mathrm\{glob\}\}=\\langle\\lvert\\frac\{1\}\{N\}\\sum\_\{u\}e^\{i\\theta\_\{u,j\}\}\\rvert\\rangle\_\{j\}, the same coherence with the attention weights replaced by a flat average over allNNtokens, falls with depth while the local one stays high: each token locks to its selected neighbors while the population of phases spreads out\.![Refer to caption](https://arxiv.org/html/2606.11585v1/x2.png)Figure 2:Learned natural frequencies follow a depth\-to\-timescale trend \(reference 1M run\)\.\(a\)The per\-layer spectrum of learned rates\|ωj\|\\lvert\\omega\_\{j\}\\rvertdeparts from the geometric rotary initialization \(dashed\)\.\(b\)The mean rate falls with depth; a high frequency de\-correlates the score over a short token distance, so early layers couple short\-range and later layers long\-range\.

## 5Related Work

#### Transformers as synchronizing dynamics\.

Geshkovskiet al\.\([2023](https://arxiv.org/html/2606.11585#bib.bib5);[2024](https://arxiv.org/html/2606.11585#bib.bib6)\)analyze standard self\-attention as an interacting particle system on the sphere and prove asymptotic clustering\. That line of work*analyzes*the limiting behavior of vanilla attention\. By contrast, we design a trained, functional language\-model layer that carries an*exact, per\-step*Kuramoto identity\. When the values are taken to be the identity, the aggregate becomes a circular mean, so its tangent projection is exactly the Kuramoto step\. A general value map, on the other hand, averages transformed features and therefore breaks the identity \([Remark˜2](https://arxiv.org/html/2606.11585#Thmremark2)\)\.

#### Oscillator neurons\.

Miyatoet al\.\([2025](https://arxiv.org/html/2606.11585#bib.bib7)\)build Kuramoto oscillatory neurons bottom\-up and relate them to attention, primarily for vision, binding, and reasoning, andMuzellecet al\.\([2025](https://arxiv.org/html/2606.11585#bib.bib18)\)add complex\-valued representations and Kuramoto synchronization dynamics to deep networks, with gains on object binding\. Both*add*oscillator dynamics to a network\. Our layer is instead a phase\-state causal attention layer whose value update already*is*adaptive Kuramoto coupling, with its natural frequencies supplied by rotary position \([Section˜3](https://arxiv.org/html/2606.11585#S3)\) and its coupling kernel an entropy\-regularized soft assignment \([Proposition˜2](https://arxiv.org/html/2606.11585#Thmtheorem2)\), evaluated on language modeling\.

#### Kuramoto dynamics\.

The classical model\(Kuramoto,[1975](https://arxiv.org/html/2606.11585#bib.bib1); Acebrónet al\.,[2005](https://arxiv.org/html/2606.11585#bib.bib2)\)couples phase oscillators with heterogeneous natural frequencies\. Our layer is the same coupling made content\-adaptive, in the sense that the coupling matrix is a softmax of token similarities and the frequencies are learned\.

#### Attention as associative memory\.

Ramsaueret al\.\([2021](https://arxiv.org/html/2606.11585#bib.bib8)\)connect attention to modern Hopfield retrieval, reading a step as energy descent toward stored patterns\. We give the complementary reading, in which the same step is understood as synchronization dynamics on phase states\. In this reading, the same softmax weights act as a coupling kernel, and the value update is the Kuramoto step that pulls each token’s phases toward the phases of the tokens it attends to \([Proposition˜1](https://arxiv.org/html/2606.11585#Thmtheorem1)\)\.

#### Rotary position\.

We read rotary embeddings\(Suet al\.,[2021](https://arxiv.org/html/2606.11585#bib.bib4)\)in the phase\-state representation as a position\-dependent phase drift in the score \([Section˜3](https://arxiv.org/html/2606.11585#S3)\)\. This phase drift plays the role of the natural\-frequency term in the Kuramoto model\.

#### Unitary/group\-valued recurrence\.

Arjovskyet al\.\([2016](https://arxiv.org/html/2606.11585#bib.bib10)\)and follow\-up work place the recurrent*parameters*on a group, whereas in our setting it is the*state*that lives on the group\.

## 6Conclusion

We introduced a phase\-valued self\-attention layer that functions as a character\-level language model at a level close to a matched transformer, and showed that its value update is an adaptive Kuramoto synchronization step\. In this step, the value update is exactly the Kuramoto coupling term, and the softmax attention matrix plays the role of the coupling kernel \([Proposition˜1](https://arxiv.org/html/2606.11585#Thmtheorem1)\)\. The softmax itself selects neighbors, and it does so as an entropy\-regularized form of phase retrieval \([Proposition˜2](https://arxiv.org/html/2606.11585#Thmtheorem2)\)\. Rotary position enters as a phase drift, and for this layer that drift can be read as the natural frequencies of the Kuramoto model \([Remark˜1](https://arxiv.org/html/2606.11585#Thmremark1)\)\. When we read these results geometrically, the score and the update are two aspects of a single object\. The score is a learned metric on the torus, and the update follows the same coherence, pulling each token toward the neighbors that the metric selects \([Lemma˜3](https://arxiv.org/html/2606.11585#Thmtheorem3)\); Kuramoto is the abelian case\. Our ablations isolate the components that carry the performance\. Taken together, these results give a compact bridge between self\-attention and phase synchronization\.

#### Beyond the torus\.

The construction uses only two group\-geometric ingredients\. The first ingredient is a group\-invariant similarity score, and the second is a value update that stays on the manifold\. The torus is the abelian case, in which both of these ingredients have closed forms\. Supplying the invariant inner product and the on\-manifold mean of any compact subgroup of the unitary group defines an attention layer whose hidden state stays on that group; this is the Lohe model of non\-abelian synchronization\(Lohe,[2009](https://arxiv.org/html/2606.11585#bib.bib11)\), with the torus as its abelian \(Kuramoto\) case, and we develop the general construction separately\(Nunley,[2026](https://arxiv.org/html/2606.11585#bib.bib17)\)\. Our ablations show that the metric and the on\-manifold mean carry the performance, and they also show that the feed\-forward block is the one generic map that has still to be made geometry\-native\. The motivating targets are manifold\-valued domains, such as articulated motion onSO​\(3\)J\\mathrm\{SO\}\(3\)^\{J\}or unitary\-valued states in quantum machine learning\. We leave to future work the task of instantiating the non\-abelian cases and the task of making the feed\-forward block geometry\-native\.

#### Limitations\.

The identity is exact for this phase\-state layer; standard dot\-product attention is known only to synchronize asymptotically\(Geshkovskiet al\.,[2023](https://arxiv.org/html/2606.11585#bib.bib5)\)\. The layer trails a matched transformer at one million parameters and performs comparably at five million \(level on the median over five seeds, the transformer leading on the mean\), and it offers no speed advantage\. Our evidence comes from character\-level enwiki8 at one\-to\-five\-million parameters, so broader scales and tasks are natural next steps\.

## Appendix AExperimental details

#### Data\.

enwiki8\(Hutter,[2006](https://arxiv.org/html/2606.11585#bib.bib9)\), the first10810^\{8\}characters of English Wikipedia, split90/5/5%90/5/5\\%into train/validation/test by character\. Inputs are character sequences of length256256\(vocabulary size205205\)\.

#### Training\.

AdamW, learning rate10−310^\{\-3\}, weight decay0\.010\.01, batch size6464, gradient clipping at1\.01\.0, dropout0\.10\.1,5050epochs\. We use the identical training recipe for both models at both scales and report validation BPC at the best epoch\.

#### Architecture and parameter matching\.

Both models use44layers, a SwiGLU feed\-forward block, and RoPE position; the transformer baseline is single\-head dot\-product attention, and Kuramoto attention is likewise single\-head \(one shared bank ofkkoscillators per layer\), so the two differ only in the attention mechanism\. We fix the depth and the parameter budget and then vary the width so that the total number of parameters is equalized at each scale\. For Kuramoto attention the width is the torus dimensionkk, and for the transformer the width isdmodeld\_\{\\mathrm\{model\}\}\. The two widths are not directly comparable:kkcounts phase coordinates, each carried by a two\-component\(cos,sin\)\(\\cos,\\sin\)lift, whereasdmodeld\_\{\\mathrm\{model\}\}counts model channels directly, so the Kuramoto\-attention model reaches the same parameter budget at a larger nominal width \(k=176k=176vs\.dmodel=120d\_\{\\mathrm\{model\}\}=120at 1M\)\. We hold the total parameter count fixed across the two models, and the widths are allowed to differ\.[Table˜3](https://arxiv.org/html/2606.11585#A1.T3)reports the achieved widths and counts \(matched to within∼3%\\sim 3\\%\)\.

Table 3:Achieved widths and parameter counts at each scale\.All settings needed to reproduce the runs are given above; the exact configuration files will accompany the code release\.

## Appendix BThe score as a position\-varying metric

The score splits into an equal\-position part and the rotary driftωj​\(t−u\)\\omega\_\{j\}\(t\-u\), where the rotary drift plays the role of the natural frequencies \([Section˜3](https://arxiv.org/html/2606.11585#S3)\)\. The metric we study here is the equal\-position part of this split\. To extract it, fix a query atθ\\thetaand a key atθ\+δ\\theta\+\\deltawith smallδ\\deltaandt=ut=u\. At equal position the rotary driftωj​\(t−u\)\\omega\_\{j\}\(t\-u\)vanishes, so the score’s kernelcos⁡\(θt,j−θu,j\+ωj​\(t−u\)\)\\cos\(\\theta\_\{t,j\}\-\\theta\_\{u,j\}\+\\omega\_\{j\}\(t\-u\)\)reduces tocos⁡δj\\cos\\delta\_\{j\}, withδj\\delta\_\{j\}thejj\-th component of the offsetδ\\delta\. For gates that vary slowly across the torus we treat them as locally constant,gjk​\(θ\+δ\)=gjk​\(θ\)\+O​\(δ​∇g\)g^\{k\}\_\{j\}\(\\theta\+\\delta\)=g^\{k\}\_\{j\}\(\\theta\)\+O\(\\delta\\nabla g\), so the score’s leadingδ\\delta\-dependence is the cosine kernel\. Expanding it,

s=τk​∑jgjq​\(θ\)​gjk​\(θ\)​cos⁡δj=τk​∑jgjq​gjk−τ2​k​∑jgjq​gjk​δj2\+O​\(δ4,δ​∇g\),s\\;=\\;\\frac\{\\tau\}\{\\sqrt\{k\}\}\\sum\_\{j\}g^\{q\}\_\{j\}\(\\theta\)\\,g^\{k\}\_\{j\}\(\\theta\)\\,\\cos\\delta\_\{j\}\\;=\\;\\frac\{\\tau\}\{\\sqrt\{k\}\}\\sum\_\{j\}g^\{q\}\_\{j\}g^\{k\}\_\{j\}\\;\-\\;\\frac\{\\tau\}\{2\\sqrt\{k\}\}\\sum\_\{j\}g^\{q\}\_\{j\}g^\{k\}\_\{j\}\\,\\delta\_\{j\}^\{2\}\\;\+\\;O\\\!\\big\(\\delta^\{4\},\\ \\delta\\nabla g\\big\),so similarity falls off quadratically inδ\\deltaat a rate set by the gate product\. The symmetric quadratic form that measures this second\-order decay assigns a squared length∑j𝒢j​j​\(θ\)​δj2\\sum\_\{j\}\\mathcal\{G\}\_\{jj\}\(\\theta\)\\,\\delta\_\{j\}^\{2\}to a small displacementδ\\delta, with the conventional factor1/21/2absorbed\. Because the gatesgq,gkg^\{q\},g^\{k\}are strictly positive \(softplus outputs\), this form is positive\-definite and diagonal on𝕋k\\mathbb\{T\}^\{k\}, and it varies smoothly withθ\\theta, so it is a position\-dependent metric on the flat torus, which we call the score’s metric and read lengths under,

𝒢​\(θ\)=τk​diag​\(gq​\(θ\)​gk​\(θ\)\),\\mathcal\{G\}\(\\theta\)\\;=\\;\\frac\{\\tau\}\{\\sqrt\{k\}\}\\,\\mathrm\{diag\}\\\!\\big\(g^\{q\}\(\\theta\)\\,g^\{k\}\(\\theta\)\\big\),position\-dependent through the gates\. Here*diagonal*describes the quadratic form, whose second\-order expansion has noδi​δj\\delta\_\{i\}\\delta\_\{j\}cross\-term, so𝒢\\mathcal\{G\}is diagonal at each point\. As a field over the torus it is not separable, because each entry𝒢j​j​\(θ\)\\mathcal\{G\}\_\{jj\}\(\\theta\)is a dense readout of all phases \(the gates depend on every coordinate,[Section˜3\.5](https://arxiv.org/html/2606.11585#S3.SS5)\), so∂θi𝒢j​j≠0\\partial\_\{\\theta\_\{i\}\}\\mathcal\{G\}\_\{jj\}\\neq 0fori≠ji\\neq j\. Sharing the gates across layers makes𝒢\\mathcal\{G\}a*single*learned geometry under which every layer synchronizes\. This is the content of[Lemma˜3](https://arxiv.org/html/2606.11585#Thmtheorem3), in which the value update follows the coherence that this metric gates, pulling each token toward its selected neighbors\. A metric fixes lengths and angles; a separate, optional structure fixes how a tangent vector is carried between nearby points\. The frame\-field correction of[Section˜4\.3](https://arxiv.org/html/2606.11585#S4.SS3)is one such carrying rule, a candidate refinement beyond the metric\. Because the metric field is non\-separable, the connection it induces couples coordinates even though the metric form does not, so a correction that keeps only the per\-coordinate part is exact in the separable case and approximate here\. The on\-manifold value update is the circular mean of the selected tokens, where the selection is the entropy\-regularized retrieval that the score induces \([Proposition˜2](https://arxiv.org/html/2606.11585#Thmtheorem2)\)\. Only the feed\-forward block lies outside this geometry\.

## References

- Optimization algorithms on matrix manifolds\.Princeton University Press\.Cited by:[§3\.5](https://arxiv.org/html/2606.11585#S3.SS5.p2.5)\.
- J\. A\. Acebrón, L\. L\. Bonilla, C\. J\. Pérez Vicente, F\. Ritort, and R\. Spigler \(2005\)The Kuramoto model: a simple paradigm for synchronization phenomena\.Reviews of Modern Physics77\(1\),pp\. 137–185\.Cited by:[§5](https://arxiv.org/html/2606.11585#S5.SS0.SSS0.Px3.p1.1)\.
- M\. Arjovsky, A\. Shah, and Y\. Bengio \(2016\)Unitary evolution recurrent neural networks\.InInternational Conference on Machine Learning,Cited by:[§5](https://arxiv.org/html/2606.11585#S5.SS0.SSS0.Px6.p1.1)\.
- J\. L\. Ba, J\. R\. Kiros, and G\. E\. Hinton \(2016\)Layer normalization\.arXiv preprint arXiv:1607\.06450\.Cited by:[§2\.3](https://arxiv.org/html/2606.11585#S2.SS3.SSS0.Px1.p1.5)\.
- S\. Boyd and L\. Vandenberghe \(2004\)Convex optimization\.Cambridge University Press\.Cited by:[§3\.2](https://arxiv.org/html/2606.11585#S3.SS2.1.p1.1)\.
- B\. Geshkovski, C\. Letrouit, Y\. Polyanskiy, and P\. Rigollet \(2023\)The emergence of clusters in self\-attention dynamics\.InAdvances in Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2606.11585#S1.SS0.SSS0.Px1.p2.1),[§5](https://arxiv.org/html/2606.11585#S5.SS0.SSS0.Px1.p1.1),[§6](https://arxiv.org/html/2606.11585#S6.SS0.SSS0.Px2.p1.1)\.
- B\. Geshkovski, C\. Letrouit, Y\. Polyanskiy, and P\. Rigollet \(2024\)A mathematical perspective on transformers\.arXiv preprint arXiv:2312\.10794\.Cited by:[§1](https://arxiv.org/html/2606.11585#S1.SS0.SSS0.Px1.p2.1),[§5](https://arxiv.org/html/2606.11585#S5.SS0.SSS0.Px1.p1.1)\.
- M\. Hutter \(2006\)The Hutter prize for lossless compression of human knowledge \(enwik8\)\.Note:[http://prize\.hutter1\.net/](http://prize.hutter1.net/)Cited by:[Appendix A](https://arxiv.org/html/2606.11585#A1.SS0.SSS0.Px1.p1.4),[§4\.1](https://arxiv.org/html/2606.11585#S4.SS1.p1.5)\.
- Y\. Kuramoto \(1975\)Self\-entrainment of a population of coupled non\-linear oscillators\.International Symposium on Mathematical Problems in Theoretical Physics,pp\. 420–422\.Cited by:[§5](https://arxiv.org/html/2606.11585#S5.SS0.SSS0.Px3.p1.1)\.
- M\. A\. Lohe \(2009\)Non\-Abelian Kuramoto models and synchronization\.Journal of Physics A: Mathematical and Theoretical42\(39\),pp\. 395101\.Cited by:[§6](https://arxiv.org/html/2606.11585#S6.SS0.SSS0.Px1.p1.1)\.
- I\. Loshchilov, C\. Hsieh, S\. Sun, and B\. Ginsburg \(2024\)nGPT: normalized transformer with representation learning on the hypersphere\.arXiv preprint arXiv:2410\.01131\.Cited by:[§2\.3](https://arxiv.org/html/2606.11585#S2.SS3.SSS0.Px1.p1.5)\.
- T\. Miyato, S\. Löwe, A\. Geiger, and M\. Welling \(2025\)Artificial Kuramoto oscillatory neurons\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.11585#S1.SS0.SSS0.Px1.p2.1),[§5](https://arxiv.org/html/2606.11585#S5.SS0.SSS0.Px2.p1.1)\.
- S\. Muzellec, A\. Alamia, T\. Serre, and R\. VanRullen \(2025\)Enhancing deep neural networks through complex\-valued representations and Kuramoto synchronization dynamics\.Transactions on Machine Learning Research\.Cited by:[§5](https://arxiv.org/html/2606.11585#S5.SS0.SSS0.Px2.p1.1)\.
- J\. Nunley \(2026\)Subgroups ofU​\(d\)U\(d\)induce natural RNN and transformer architectures\.arXiv preprint arXiv:2602\.18417\.Cited by:[§6](https://arxiv.org/html/2606.11585#S6.SS0.SSS0.Px1.p1.1)\.
- H\. Ramsauer, B\. Schäfl, J\. Lehner, P\. Seidl, M\. Widrich, T\. Adler, L\. Gruber, M\. Holzleitner, M\. Pavlović, G\. K\. Sandve,et al\.\(2021\)Hopfield networks is all you need\.InInternational Conference on Learning Representations,Cited by:[§5](https://arxiv.org/html/2606.11585#S5.SS0.SSS0.Px4.p1.1)\.
- J\. Su, Y\. Lu, S\. Pan, A\. Murtadha, B\. Wen, and Y\. Liu \(2021\)RoFormer: enhanced transformer with rotary position embedding\.arXiv preprint arXiv:2104\.09864\.Cited by:[§5](https://arxiv.org/html/2606.11585#S5.SS0.SSS0.Px5.p1.1)\.
- A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, L\. Kaiser, and I\. Polosukhin \(2017\)Attention is all you need\.InAdvances in Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2606.11585#S1.p1.1)\.
- B\. Zhang and R\. Sennrich \(2019\)Root mean square layer normalization\.InAdvances in Neural Information Processing Systems,Cited by:[§2\.3](https://arxiv.org/html/2606.11585#S2.SS3.SSS0.Px1.p1.5)\.

Similar Articles

Interdomain Attention: Beyond Token-Level Key-Value Memory

arXiv cs.LG

Proposes Interdomain Attention, a new method that integrates state space models into attention via kernel methods, achieving efficient long-context modeling with a fixed-size state and outperforming SSMs and softmax attention in language modeling experiments up to 1.3B parameters.