The E$\Delta$-MHC-Geo Transformer: Adaptive Geodesic Operations with Guaranteed Orthogonality

arXiv cs.LG Papers

Summary

The paper introduces the EΔ-MHC-Geo Transformer, a novel architecture using adaptive geodesic operations with guaranteed orthogonality via Cayley rotations and Householder reflections. It demonstrates improved long-horizon stability and norm preservation compared to existing baselines like Deep Delta Learning.

arXiv:2605.06729v1 Announce Type: new Abstract: We present the E$\Delta$-MHC-Geo Transformer, a novel architecture that unifies Manifold-Constrained Hyper-Connections (mHC), Deep Delta Learning (DDL), and the Cayley transform to obtain input-adaptive, unconditionally orthogonal residual connections. Unlike DDL, whose Householder operator is orthogonal only at $\beta \in \{0,2\}$, our Data-Dependent Cayley rotation $Q(x)=(I+(\beta/2)A(x))^{-1}(I-(\beta/2)A(x))$ preserves orthogonality for all $\beta$ and all inputs. To handle negation, an eigenvalue $-1$ case that Cayley provably excludes, we introduce the E$\Delta$-MHC-Geo Hybrid, which combines Cayley rotation with Householder reflection via a learned operator-selection gate $X'=\gamma(X)Q(X)X+(1-\gamma(X))H_2(X)X$. A midpoint-collapse regularizer, $4\gamma(1-\gamma)$, encourages boundary gate decisions, where each selected component is orthogonal. In matched-parameter comparisons, with approximately 1.79M parameters per model and mean +/- standard deviation over 3 seeds, against four baselines including the concurrent JPmHC, E$\Delta$-MHC-Geo achieves the best long-horizon stability, 1.9x over JPmHC and 3.8x over GPT; the best near-$\pi$ rotation loss, 4.5x over JPmHC on single-plane; strong norm preservation, with 0.001 mean deviation; and 0.96 negation cosine alignment in a diagnostic reflection probe, all with 33% fewer layers. While JPmHC's wider representation excels on pure rotation, its finite Cayley residual mixer excludes an exact $\lambda=-1$ operator and has no reflection branch, motivating our hybrid approach for accessing both connected components of $O(n)$.
Original Article
View Cached Full Text

Cached at: 05/11/26, 06:43 AM

# Adaptive Geodesic Operations with Guaranteed Orthogonality
Source: [https://arxiv.org/html/2605.06729](https://arxiv.org/html/2605.06729)
## The EΔ\\Delta\-MHC\-Geo Transformer: Adaptive Geodesic Operations with Guaranteed Orthogonality

###### Abstract

We present the EΔ\\Delta\-MHC\-Geo Transformer, a novel architecture that unifies Manifold\-Constrained Hyper\-Connections \(mHC\), Deep Delta Learning \(DDL\), and the Cayley transform to obtain*input\-adaptive, unconditionally orthogonal*residual connections\. Unlike DDL, whose Householder operator is orthogonal only atβ∈\{0,2\}\\beta\\\!\\in\\\!\\\{0,2\\\}, our Data\-Dependent Cayley rotation𝐐​\(𝐱\)=\(𝐈\+β2​𝐀​\(𝐱\)\)−1​\(𝐈−β2​𝐀​\(𝐱\)\)\\mathbf\{Q\}\(\\mathbf\{x\}\)\\\!=\\\!\(\\mathbf\{I\}\+\\tfrac\{\\beta\}\{2\}\\mathbf\{A\}\(\\mathbf\{x\}\)\)^\{\-1\}\(\\mathbf\{I\}\-\\tfrac\{\\beta\}\{2\}\\mathbf\{A\}\(\\mathbf\{x\}\)\)preserves orthogonality for*all*β\\betaand all inputs\. To handle negation \(eigenvalue−1\-1, which Cayley provably excludes\), we introduce the EΔ\\Delta\-MHC\-Geo Hybrid that combines Cayley rotation with Householder reflection via a learned operator\-selection gate𝐗′=γ​\(𝐗\)​𝐐​\(𝐗\)​𝐗\+\(1−γ​\(𝐗\)\)​𝐇2​\(𝐗\)​𝐗\\mathbf\{X\}^\{\\prime\}\\\!=\\\!\\gamma\(\\mathbf\{X\}\)\\,\\mathbf\{Q\}\(\\mathbf\{X\}\)\\mathbf\{X\}\+\(1\-\\gamma\(\\mathbf\{X\}\)\)\\,\\mathbf\{H\}\_\{2\}\(\\mathbf\{X\}\)\\mathbf\{X\}\. A midpoint\-collapse regularizerℒgate=4​γ​\(1−γ\)\\mathcal\{L\}\_\{\\mathrm\{gate\}\}=4\\gamma\(1\-\\gamma\)encourages boundary gate decisions, where each selected component is orthogonal\. In matched\-parameter comparisons \(∼1\.79\{\\sim\}1\.79M parameters each, mean±\\pmstd over 3 seeds\) against four baselines including the concurrent JPmHC\(Sengupta et al\.,[2026](https://arxiv.org/html/2605.06729#bib.bib10)\), EΔ\\Delta\-MHC\-Geo achieves the best long\-horizon stability \(1\.9×1\.9\\timesover JPmHC,3\.8×3\.8\\timesover GPT\), the best near\-π\\pirotation loss \(4\.5×4\.5\\timesover JPmHC on single\-plane\), strong norm preservation \(0\.0010\.001mean deviation\), and0\.960\.96negation cosine alignment in a diagnostic reflection probe—all with33%33\\%fewer layers\. While JPmHC’s wider representation \(nembd=512n\_\{\\mathrm\{embd\}\}\\\!=\\\!512\) excels on pure rotation \(gyroscope\), its finite Cayley residual mixer excludes an exactλ=−1\\lambda\\\!=\\\!\{\-1\}operator and has no reflection branch, motivating our hybrid approach for accessing both connected components ofO​\(n\)\\mathrm\{O\}\(n\)\. Code is available at[https://github\.com/arash\-shahmansoori/edelta](https://github.com/arash-shahmansoori/edelta)\.

## 1Introduction

Residual connections\(He et al\.,[2016](https://arxiv.org/html/2605.06729#bib.bib4)\)are a cornerstone of modern deep learning, enabling gradient flow through very deep networks via the shortcut𝐗l\+1=𝐗l\+F​\(𝐗l\)\\mathbf\{X\}\_\{l\+1\}=\\mathbf\{X\}\_\{l\}\+F\(\\mathbf\{X\}\_\{l\}\)\. The standard additive residual, however, provides no geometric guarantees: norms can drift, and the identity shortcut limits the expressivity of inter\-layer mixing\.

Two recent lines of work address this limitation from complementary angles\.Manifold\-Constrained Hyper\-Connections \(mHC\)\(DeepSeek AI,[2024](https://arxiv.org/html/2605.06729#bib.bib3)\)replace the identity with a multi\-stream residual mixing matrix𝐇res\\mathbf\{H\}\_\{\\mathrm\{res\}\}\(projected to doubly stochastic via Sinkhorn–Knopp\), together with pre/post mappings𝐇pre,𝐇post\\mathbf\{H\}\_\{\\mathrm\{pre\}\},\\mathbf\{H\}\_\{\\mathrm\{post\}\}for stream aggregation\.Deep Delta Learning \(DDL\)\(Yang et al\.,[2024](https://arxiv.org/html/2605.06729#bib.bib14)\)replaces the identity shortcut with a Householder operator𝐇β=𝐈−β​𝐤𝐤⊤\\mathbf\{H\}\_\{\\beta\}=\\mathbf\{I\}\-\\beta\\mathbf\{k\}\\mathbf\{k\}^\{\\top\}whose direction𝐤\\mathbf\{k\}and magnitudeβ\\betaare input\-dependent, yielding an*input\-adaptive*residual\.

Limitations\.DDL is orthogonal*only*whenβ∈\{0,2\}\\beta\\in\\\{0,2\\\}; during trainingβ\\betavaries continuously, breaking isometry and causing gradient instability\. mHC’s Sinkhorn projection is only approximately orthogonal, and errors accumulate over long sequences\. Neither architecture provides*unconditional*orthogonality\.

Our contribution\.We propose EΔ\\Delta\-MHC\-Geo, which replaces the residual mixing matrix with a*Data\-Dependent Cayley rotation*𝐐​\(𝐱\)∈SO​\(n\)\\mathbf\{Q\}\(\\mathbf\{x\}\)\\in\\mathrm\{SO\}\(n\)\. The key insight is that the skew\-symmetry of the Cayley generator𝐀=𝐮𝐯⊤−𝐯𝐮⊤\\mathbf\{A\}=\\mathbf\{u\}\\mathbf\{v\}^\{\\top\}\-\\mathbf\{v\}\\mathbf\{u\}^\{\\top\}depends only on its algebraic form, not on how𝐮,𝐯\\mathbf\{u\},\\mathbf\{v\}are obtained\. Making𝐮​\(𝐱\),𝐯​\(𝐱\)\\mathbf\{u\}\(\\mathbf\{x\}\),\\mathbf\{v\}\(\\mathbf\{x\}\)input\-dependent therefore preserves*all*Cayley properties—orthogonality, isometry,det=\+1\\det=\+1—for every input and everyβ\\beta\.

Since Cayley provably cannot produce eigenvalue−1\-1\(Theorem[4\.6](https://arxiv.org/html/2605.06729#S4.Thmtheorem6)\), we introduce the EΔ\\Delta\-MHC\-GeoHybrid, combining Cayley rotation with Householder reflection \(β=2\\beta=2fixed\) through a learned gateγ\\gamma\. A midpoint\-collapse regularizer encouragesγ→\{0,1\}\\gamma\\to\\\{0,1\\\}, so the model can “jump” between the two disconnected components ofO​\(n\)\\mathrm\{O\}\(n\)rather than linger in the non\-orthogonal interior\.

𝐗l\\mathbf\{X\}\_\{l\}F​\(⋅\)F\(\\cdot\)\+\+𝐗l\+F​\(𝐗l\)\\mathbf\{X\}\_\{l\}\\\!\+\\\!F\(\\mathbf\{X\}\_\{l\}\)\(a\) Standard Residualdet=\+1\\det\\\!=\\\!\+1, no orthogonality𝐗l\\mathbf\{X\}\_\{l\}𝐇β=𝐈−β​𝐤𝐤⊤\\mathbf\{H\}\_\{\\beta\}\\\!=\\\!\\mathbf\{I\}\\\!\-\\\!\\beta\\mathbf\{k\}\\mathbf\{k\}^\{\\top\}β​𝐤𝐯⊤\\beta\\mathbf\{k\}\\mathbf\{v\}^\{\\top\}\+\+𝐇β​𝐗\+β​𝐤𝐯⊤\\mathbf\{H\}\_\{\\beta\}\\mathbf\{X\}\\\!\+\\\!\\beta\\mathbf\{k\}\\mathbf\{v\}^\{\\top\}\(b\) DDLdet=−1\\det\\\!=\\\!\{\-1\}, orth\. iffβ∈\{0,2\}\\beta\\\!\\in\\\!\\\{0,2\\\}𝐗l\\mathbf\{X\}\_\{l\}F​\(⋅\)F\(\\cdot\)𝐐~≈Cayley\\widetilde\{\\mathbf\{Q\}\}\\\!\\approx\\\!\\text\{Cayley\}\+\+F​\(𝐗\)\+𝐐~​𝐗F\(\\mathbf\{X\}\)\\\!\+\\\!\\widetilde\{\\mathbf\{Q\}\}\\mathbf\{X\}\(c\) JPmHCdet≈\+1\\det\\\!\\approx\\\!\{\+1\}, approx\. orth\.𝐗l\\mathbf\{X\}\_\{l\}𝐐​\(𝐗\)∈SO​\(n\)\\mathbf\{Q\}\(\\mathbf\{X\}\)\\\!\\in\\\!\\mathrm\{SO\}\(n\)𝐇2​\(𝐤\)\\mathbf\{H\}\_\{2\}\(\\mathbf\{k\}\)γ\\gammaγ​𝐐𝐗\+\(1−γ\)​𝐇2​𝐗\\gamma\\mathbf\{Q\}\\mathbf\{X\}\\\!\+\\\!\(1\\\!\-\\\!\\gamma\)\\mathbf\{H\}\_\{2\}\\mathbf\{X\}\(d\) EΔ\\Delta\-MHC\-Geo Hybrid \(Ours\)det∈\{−1,\+1\}\\det\\\!\\in\\\!\\\{\-1,\+1\\\}, exact orth\.Figure 1:Residual connection paradigms\.\(a\) Standard additive residual with identity shortcut\. \(b\) DDL with Householder operator—orthogonal only atβ∈\{0,2\}\\beta\\\!\\in\\\!\\\{0,2\\\}\. \(c\) JPmHC with iterative Cayley retraction—parallel routing, approximate orthogonality,SO​\(n\)\\mathrm\{SO\}\(n\)only\. \(d\) Proposed EΔ\\Delta\-MHC\-Geo Hybrid combining exact Cayley rotation with Householder reflection via learned gateγ​\(𝐗\)\\gamma\(\\mathbf\{X\}\), enabling boundary access to both components ofO​\(n\)\\mathrm\{O\}\(n\)\.Summary of contributions:

1. 1\.Data\-Dependent Cayley rotationwith provably unconditional orthogonality for*all*inputs and*all*β\\beta\(Theorem[4\.1](https://arxiv.org/html/2605.06729#S4.Thmtheorem1)\)\. Unlike prior Cayley\-based methods\(Helfrich et al\.,[2018](https://arxiv.org/html/2605.06729#bib.bib5); Lezcano\-Casado & Martínez\-Rubio,[2019](https://arxiv.org/html/2605.06729#bib.bib6)\), the rotation plane itself is input\-dependent, enabling adaptive geometric transformations without sacrificing any algebraic guarantees\.
2. 2\.EΔ\\Delta\-MHC\-Geo Hybridcombining Cayley rotation and Householder reflection via a learned gate, accessing both connected components of the orthogonal groupO​\(n\)\\mathrm\{O\}\(n\)at the gate boundaries\. The gate learns to select the appropriate operator*automatically*based on task structure\.
3. 3\.Midpoint collapse regularizationthat encourages binary gate decisions, with a universal zero\-gradient theorem \(Theorem[7\.3](https://arxiv.org/html/2605.06729#S7.Thmtheorem3)\) explaining when this regularization succeeds, when it stalls, and what escape mechanisms are available\.
4. 4\.Rigorous experimental validationon four benchmarks with fair parameter matching \(∼1\.79\{\\sim\}1\.79M each\) against four baselines including the concurrent JPmHC, with results averaged over 3 random seeds\. EΔ\\Delta\-MHC\-Geo is the only evaluated architecture with a direct mechanism for bothdet=\+1\\det=\+1anddet=−1\\det=\-1operators: best stability \(1\.9×1\.9\\timesover JPmHC\), best near\-π\\piloss \(4\.5×4\.5\\timesover JPmHC\),0\.960\.96negation cosine alignment in a diagnostic operator probe, and33%33\\%fewer layers\. The experiments validate the main algebraic predictions while also exposing limitations of the current hybrid parameterization\.

## 2Related Work

Orthogonal parameterizations\.Maintaining orthogonality in neural networks has a rich history\.Arjovsky et al\. \([2016](https://arxiv.org/html/2605.06729#bib.bib1)\)proposed unitary RNNs with unitary weight matrices to mitigate vanishing gradients, but the parameterization is*fixed*per layer and does not adapt to input\.Helfrich et al\. \([2018](https://arxiv.org/html/2605.06729#bib.bib5)\)introduced the Cayley transform for orthogonal RNN parameterization, demonstrating improved gradient flow; however, the rotation planes are fixed at initialization and remain constant throughout inference\.Lezcano\-Casado & Martínez\-Rubio \([2019](https://arxiv.org/html/2605.06729#bib.bib6)\)provided efficient parameterizations of the orthogonal and unitary groups via the matrix exponential and Cayley map, achievingO​\(n\)O\(n\)cost for structured matrices, but again with static \(weight\-dependent, not input\-dependent\) orthogonal operators\.Vorontsov et al\. \([2017](https://arxiv.org/html/2605.06729#bib.bib13)\)andBansal et al\. \([2018](https://arxiv.org/html/2605.06729#bib.bib2)\)studied orthogonality as a*regularization*objective rather than an architectural guarantee, meaning orthogonality is approximate and degrades as regularization strength is reduced\. In contrast, our work makes the Cayley rotation*input\-adaptive*—the rotation plane changes with every input—while preserving*exact*orthogonality algebraically, without any soft penalty or projection\.

Residual connections\.He et al\. \([2016](https://arxiv.org/html/2605.06729#bib.bib4)\)introduced the foundational skip connection𝐗l\+1=𝐗l\+F​\(𝐗l\)\\mathbf\{X\}\_\{l\+1\}=\\mathbf\{X\}\_\{l\}\+F\(\\mathbf\{X\}\_\{l\}\), enabling training of very deep networks but providing no geometric structure on the residual path\.DeepSeek AI \([2024](https://arxiv.org/html/2605.06729#bib.bib3)\)proposed mHC with multi\-stream residual and Sinkhorn\-based doubly stochastic mixing, achieving approximately orthogonal residual mixing; however, the Sinkhorn projection introduces approximation error that accumulates over long sequences \(typically requiring20\+20\+iterations per layer\)\.Yang et al\. \([2024](https://arxiv.org/html/2605.06729#bib.bib14)\)introduced DDL with input\-dependent Householder operators, providing input\-adaptivity but sacrificing orthogonality at all training points whereβ∉\{0,2\}\\beta\\notin\\\{0,2\\\}—which is the vast majority of training\. Concurrently,Sengupta et al\. \([2026](https://arxiv.org/html/2605.06729#bib.bib10)\)\(JPmHC, arXiv:2602\.18308v2, March 2026\) independently identify the gradient instability of Sinkhorn\-based projections via operator\-valued free probability, and propose replacing the Birkhoff constraint with an iterative Cayley retraction\. However, JPmHC uses a*parallel routing*topology \(separate Cayley residual path and softmax compute path\), an*iterative fixed\-point approximation*\(s=2s\\\!=\\\!2steps,α=0\.1\\alpha\\\!=\\\!0\.1\) that yields only approximate orthogonality \(‖Y⊤​Y−I‖max<10−3\\\|Y^\{\\top\}Y\-I\\\|\_\{\\max\}<10^\{\-3\}\), and restricts toSO​\(n\)\\mathrm\{SO\}\(n\)through a finite Cayley residual parameterization with no reflection mechanism—leaving an exact eigenvalueλ=−1\\lambda\\\!=\\\!\-1inaccessible to that residual path\. JPmHC’s complementary strength is substantial: its operator\-valued free\-probability analysis gives a principled spectral explanation of why Birkhoff/Sinkhorn mixers lose dynamical isometry, and its March v2 experiments provide large\-scale ARC\-AGI evidence for orthogonal hyper\-connections\. Our work unifies the strengths of all prior approaches: mHC’s multi\-stream framework with DDL’s input adaptivity, while adding*unconditional*\(exact, non\-iterative\) orthogonality that holds for every input, everyβ\\beta, and at every training step, together with boundary access to both components ofO​\(n\)\\mathrm\{O\}\(n\)via the Householder gate\.

Geometric deep learning\.Saxe et al\. \([2014](https://arxiv.org/html/2605.06729#bib.bib9)\)showed that orthogonal initialization enables exact gradient solutions in deep linear networks, establishing that orthogonality at initialization accelerates convergence\. The Cayley transform\(Shepard et al\.,[2015](https://arxiv.org/html/2605.06729#bib.bib11)\)provides a smooth bijection between skew\-symmetric matrices and the special orthogonal groupSO​\(n\)\\mathrm\{SO\}\(n\), and is well\-studied in Lie group theory\. Our contribution is making this map*data\-dependent*—the skew\-symmetric generator is computed from the input via neural networks—creating a bridge between geometric group theory and adaptive neural architectures\. This is distinct from prior Cayley\-based methods\(Helfrich et al\.,[2018](https://arxiv.org/html/2605.06729#bib.bib5); Lezcano\-Casado & Martínez\-Rubio,[2019](https://arxiv.org/html/2605.06729#bib.bib6)\)in that the*rotation itself*\(not just its parameters\) varies with each input\.

## 3Mathematical Framework

### 3\.1Notation

Let𝐱∈ℝB×S×D\\mathbf\{x\}\\in\\mathbb\{R\}^\{B\\times S\\times D\}denote the input tensor \(batch, sequence, dimension\),𝐱¯∈ℝB×D\\bar\{\\mathbf\{x\}\}\\in\\mathbb\{R\}^\{B\\times D\}the mean\-pooled representation,nnthe number of streams \(typically 4\), andd=D/nd=D/nthe per\-stream dimension\.

### 3\.2Data\-Dependent Generator Networks

###### Definition 3\.1\(Generator Networks\)\.

The rotation generators are computed by neural networks:

𝐮​\(𝐱\)=𝐖u⋅𝐱¯\+𝐛u∈ℝn,𝐯​\(𝐱\)=𝐖v⋅𝐱¯\+𝐛v∈ℝn,\\mathbf\{u\}\(\\mathbf\{x\}\)=\\mathbf\{W\}\_\{u\}\\cdot\\bar\{\\mathbf\{x\}\}\+\\mathbf\{b\}\_\{u\}\\in\\mathbb\{R\}^\{n\},\\qquad\\mathbf\{v\}\(\\mathbf\{x\}\)=\\mathbf\{W\}\_\{v\}\\cdot\\bar\{\\mathbf\{x\}\}\+\\mathbf\{b\}\_\{v\}\\in\\mathbb\{R\}^\{n\},\(1\)where𝐖u,𝐖v∈ℝn×D\\mathbf\{W\}\_\{u\},\\mathbf\{W\}\_\{v\}\\in\\mathbb\{R\}^\{n\\times D\}are learnable\. For increased expressivity, we use two\-layer MLPs with GELU activation\.

###### Definition 3\.2\(Data\-Dependent Skew\-Symmetric Generator\)\.

𝐀​\(𝐱\)=𝐮​\(𝐱\)​𝐯​\(𝐱\)⊤−𝐯​\(𝐱\)​𝐮​\(𝐱\)⊤\.\\mathbf\{A\}\(\\mathbf\{x\}\)=\\mathbf\{u\}\(\\mathbf\{x\}\)\\mathbf\{v\}\(\\mathbf\{x\}\)^\{\\top\}\-\\mathbf\{v\}\(\\mathbf\{x\}\)\\mathbf\{u\}\(\\mathbf\{x\}\)^\{\\top\}\.\(2\)

###### Proposition 3\.3\(Skew\-Symmetry Preservation\)\.

For any𝐮,𝐯∈ℝn\\mathbf\{u\},\\mathbf\{v\}\\in\\mathbb\{R\}^\{n\}, the matrix𝐀=𝐮𝐯⊤−𝐯𝐮⊤\\mathbf\{A\}=\\mathbf\{u\}\\mathbf\{v\}^\{\\top\}\-\\mathbf\{v\}\\mathbf\{u\}^\{\\top\}satisfies𝐀⊤=−𝐀\\mathbf\{A\}^\{\\top\}=\-\\mathbf\{A\}\.

###### Proof\.

𝐀⊤=\(𝐮𝐯⊤−𝐯𝐮⊤\)⊤=𝐯𝐮⊤−𝐮𝐯⊤=−𝐀\\mathbf\{A\}^\{\\top\}=\(\\mathbf\{u\}\\mathbf\{v\}^\{\\top\}\-\\mathbf\{v\}\\mathbf\{u\}^\{\\top\}\)^\{\\top\}=\\mathbf\{v\}\\mathbf\{u\}^\{\\top\}\-\\mathbf\{u\}\\mathbf\{v\}^\{\\top\}=\-\\mathbf\{A\}\. ∎

###### Corollary 3\.4\.

The skew\-symmetry of𝐀​\(𝐱\)\\mathbf\{A\}\(\\mathbf\{x\}\)holds regardless of how𝐮​\(𝐱\)\\mathbf\{u\}\(\\mathbf\{x\}\)and𝐯​\(𝐱\)\\mathbf\{v\}\(\\mathbf\{x\}\)are computed—whether by fixed parameters, linear layers, or deep neural networks\.

###### Definition 3\.5\(Data\-Dependent Cayley Transform\)\.

𝐐​\(𝐱\)=\(𝐈\+β​\(𝐱\)2​𝐀​\(𝐱\)\)−1​\(𝐈−β​\(𝐱\)2​𝐀​\(𝐱\)\),\\mathbf\{Q\}\(\\mathbf\{x\}\)=\\bigl\(\\mathbf\{I\}\+\\tfrac\{\\beta\(\\mathbf\{x\}\)\}\{2\}\\mathbf\{A\}\(\\mathbf\{x\}\)\\bigr\)^\{\-1\}\\bigl\(\\mathbf\{I\}\-\\tfrac\{\\beta\(\\mathbf\{x\}\)\}\{2\}\\mathbf\{A\}\(\\mathbf\{x\}\)\\bigr\),\(3\)whereβ​\(𝐱\)∈ℝ\+\\beta\(\\mathbf\{x\}\)\\in\\mathbb\{R\}^\{\+\}is a data\-dependent rotation magnitude\.

## 4Main Theoretical Results

###### Theorem 4\.1\(Unconditional Orthogonality\)\.

For any differentiable functions𝐮:ℝD→ℝn\\mathbf\{u\}:\\mathbb\{R\}^\{D\}\\to\\mathbb\{R\}^\{n\}and𝐯:ℝD→ℝn\\mathbf\{v\}:\\mathbb\{R\}^\{D\}\\to\\mathbb\{R\}^\{n\}, and anyβ​\(𝐱\)∈ℝ\\beta\(\\mathbf\{x\}\)\\in\\mathbb\{R\}, the Data\-Dependent Cayley transform satisfies:

𝐐​\(𝐱\)⊤​𝐐​\(𝐱\)=𝐈n\.\\mathbf\{Q\}\(\\mathbf\{x\}\)^\{\\top\}\\mathbf\{Q\}\(\\mathbf\{x\}\)=\\mathbf\{I\}\_\{n\}\.\(4\)

###### Proof\.

Let𝐌=β2​𝐀​\(𝐱\)\\mathbf\{M\}=\\tfrac\{\\beta\}\{2\}\\mathbf\{A\}\(\\mathbf\{x\}\)\. Since𝐀​\(𝐱\)\\mathbf\{A\}\(\\mathbf\{x\}\)is skew\-symmetric \(Proposition[3\.3](https://arxiv.org/html/2605.06729#S3.Thmtheorem3)\), so is𝐌\\mathbf\{M\}:𝐌⊤=−𝐌\\mathbf\{M\}^\{\\top\}=\-\\mathbf\{M\}\. Define𝐐=\(𝐈\+𝐌\)−1​\(𝐈−𝐌\)\\mathbf\{Q\}=\(\\mathbf\{I\}\+\\mathbf\{M\}\)^\{\-1\}\(\\mathbf\{I\}\-\\mathbf\{M\}\)\.

Step 1\.𝐐⊤=\(𝐈−𝐌\)⊤​\(\(𝐈\+𝐌\)−1\)⊤=\(𝐈\+𝐌\)​\(𝐈−𝐌\)−1\\mathbf\{Q\}^\{\\top\}=\(\\mathbf\{I\}\-\\mathbf\{M\}\)^\{\\top\}\(\(\\mathbf\{I\}\+\\mathbf\{M\}\)^\{\-1\}\)^\{\\top\}=\(\\mathbf\{I\}\+\\mathbf\{M\}\)\(\\mathbf\{I\}\-\\mathbf\{M\}\)^\{\-1\}\.

Step 2\.𝐐⊤​𝐐=\(𝐈\+𝐌\)​\(𝐈−𝐌\)−1​\(𝐈\+𝐌\)−1⏟commute​\(𝐈−𝐌\)\\mathbf\{Q\}^\{\\top\}\\mathbf\{Q\}=\(\\mathbf\{I\}\+\\mathbf\{M\}\)\\underbrace\{\(\\mathbf\{I\}\-\\mathbf\{M\}\)^\{\-1\}\(\\mathbf\{I\}\+\\mathbf\{M\}\)^\{\-1\}\}\_\{\\text\{commute\}\}\(\\mathbf\{I\}\-\\mathbf\{M\}\)\.

Step 3\.Since\(𝐈−𝐌\)\(\\mathbf\{I\}\-\\mathbf\{M\}\)and\(𝐈\+𝐌\)\(\\mathbf\{I\}\+\\mathbf\{M\}\)are polynomials in𝐌\\mathbf\{M\}, they commute, so\(𝐈−𝐌\)−1​\(𝐈\+𝐌\)−1=\(𝐈\+𝐌\)−1​\(𝐈−𝐌\)−1\(\\mathbf\{I\}\-\\mathbf\{M\}\)^\{\-1\}\(\\mathbf\{I\}\+\\mathbf\{M\}\)^\{\-1\}=\(\\mathbf\{I\}\+\\mathbf\{M\}\)^\{\-1\}\(\\mathbf\{I\}\-\\mathbf\{M\}\)^\{\-1\}\.

Step 4\.𝐐⊤​𝐐=\(𝐈\+𝐌\)​\(𝐈\+𝐌\)−1​\(𝐈−𝐌\)−1​\(𝐈−𝐌\)=𝐈⋅𝐈=𝐈\\mathbf\{Q\}^\{\\top\}\\mathbf\{Q\}=\(\\mathbf\{I\}\+\\mathbf\{M\}\)\(\\mathbf\{I\}\+\\mathbf\{M\}\)^\{\-1\}\(\\mathbf\{I\}\-\\mathbf\{M\}\)^\{\-1\}\(\\mathbf\{I\}\-\\mathbf\{M\}\)=\\mathbf\{I\}\\cdot\\mathbf\{I\}=\\mathbf\{I\}\. ∎

###### Corollary 4\.2\(Unconditional\)\.

The Data\-Dependent Cayley transform is orthogonal forallinputs, without any constraint onβ​\(𝐱\)\\beta\(\\mathbf\{x\}\)\.

###### Theorem 4\.3\(Isometry / Norm Preservation\)\.

For any input𝐱\\mathbf\{x\}and any vector𝐲∈ℝn\\mathbf\{y\}\\in\\mathbb\{R\}^\{n\}:‖𝐐​\(𝐱\)​𝐲‖2=‖𝐲‖2\\\|\\mathbf\{Q\}\(\\mathbf\{x\}\)\\mathbf\{y\}\\\|\_\{2\}=\\\|\\mathbf\{y\}\\\|\_\{2\}\.

###### Proof\.

‖𝐐𝐲‖22=𝐲⊤​𝐐⊤​𝐐𝐲=𝐲⊤​𝐈𝐲=‖𝐲‖22\\\|\\mathbf\{Q\}\\mathbf\{y\}\\\|\_\{2\}^\{2\}=\\mathbf\{y\}^\{\\top\}\\mathbf\{Q\}^\{\\top\}\\mathbf\{Q\}\\mathbf\{y\}=\\mathbf\{y\}^\{\\top\}\\mathbf\{I\}\\mathbf\{y\}=\\\|\\mathbf\{y\}\\\|\_\{2\}^\{2\}\. ∎

###### Theorem 4\.4\(Proper Rotation\)\.

For any input𝐱\\mathbf\{x\}:det\(𝐐​\(𝐱\)\)=\+1\\det\(\\mathbf\{Q\}\(\\mathbf\{x\}\)\)=\+1\.

###### Proof\.

det\(𝐐\)=det\(𝐈−𝐌\)/det\(𝐈\+𝐌\)\\det\(\\mathbf\{Q\}\)=\\det\(\\mathbf\{I\}\-\\mathbf\{M\}\)/\\det\(\\mathbf\{I\}\+\\mathbf\{M\}\)\. Skew\-symmetric𝐌\\mathbf\{M\}has purely imaginary eigenvalues±i​μk\\pm i\\mu\_\{k\}\(paired by conjugacy\)\. Thusdet\(𝐈\+𝐌\)=∏k\(1\+i​μk\)​\(1−i​μk\)=∏k\(1\+μk2\)\\det\(\\mathbf\{I\}\+\\mathbf\{M\}\)=\\prod\_\{k\}\(1\+i\\mu\_\{k\}\)\(1\-i\\mu\_\{k\}\)=\\prod\_\{k\}\(1\+\\mu\_\{k\}^\{2\}\), and identicallydet\(𝐈−𝐌\)=∏k\(1\+μk2\)\\det\(\\mathbf\{I\}\-\\mathbf\{M\}\)=\\prod\_\{k\}\(1\+\\mu\_\{k\}^\{2\}\)\. Hencedet\(𝐐\)=1\\det\(\\mathbf\{Q\}\)=1\. ∎

###### Theorem 4\.5\(Non\-Singularity\)\.

The matrix\(𝐈\+β2​𝐀​\(𝐱\)\)\(\\mathbf\{I\}\+\\tfrac\{\\beta\}\{2\}\\mathbf\{A\}\(\\mathbf\{x\}\)\)is always invertible for any finiteβ\\betaand any input𝐱\\mathbf\{x\}\.

###### Proof\.

Eigenvalues of𝐈\+β2​𝐀\\mathbf\{I\}\+\\tfrac\{\\beta\}\{2\}\\mathbf\{A\}are1\+i​β​μk21\+i\\tfrac\{\\beta\\mu\_\{k\}\}\{2\}with modulus1\+β2​μk2/4≥1\>0\\sqrt\{1\+\\beta^\{2\}\\mu\_\{k\}^\{2\}/4\}\\geq 1\>0\. ∎

###### Theorem 4\.6\(Eigenvalue Exclusion\)\.

The Data\-Dependent Cayley transform cannot produce eigenvalueλ=−1\\lambda=\-1for any finite parameters\.

###### Proof\.

The eigenvalues of𝐐​\(𝐱\)\\mathbf\{Q\}\(\\mathbf\{x\}\)areλk=e−2​i​arctan⁡\(β​μk/2\)\\lambda\_\{k\}=e^\{\-2i\\arctan\(\\beta\\mu\_\{k\}/2\)\}\. Sincearctan:ℝ→\(−π2,π2\)\\arctan:\\mathbb\{R\}\\to\(\-\\tfrac\{\\pi\}\{2\},\\tfrac\{\\pi\}\{2\}\), the argument lies in\(−π,π\)\(\-\\pi,\\pi\), strictly excluding±π\\pm\\pi\. Thereforeλ=−1=ei​π\\lambda=\-1=e^\{i\\pi\}is impossible\. ∎

## 5Comparison with Deep Delta Learning

###### Proposition 5\.1\(DDL Orthogonality Condition\)\.

The Householder operator𝐇=𝐈−β​𝐤𝐤⊤\\mathbf\{H\}=\\mathbf\{I\}\-\\beta\\mathbf\{k\}\\mathbf\{k\}^\{\\top\}with‖𝐤‖=1\\\|\\mathbf\{k\}\\\|=1is orthogonal if and only ifβ∈\{0,2\}\\beta\\in\\\{0,2\\\}\.

###### Proof\.

𝐇⊤​𝐇=𝐈\+\(β2−2​β\)​𝐤𝐤⊤\\mathbf\{H\}^\{\\top\}\\mathbf\{H\}=\\mathbf\{I\}\+\(\\beta^\{2\}\-2\\beta\)\\mathbf\{k\}\\mathbf\{k\}^\{\\top\}\. For𝐇⊤​𝐇=𝐈\\mathbf\{H\}^\{\\top\}\\mathbf\{H\}=\\mathbf\{I\}:β2−2​β=0\\beta^\{2\}\-2\\beta=0, i\.e\.β​\(β−2\)=0\\beta\(\\beta\-2\)=0\. ∎

###### Corollary 5\.2\(DDL Norm Distortion\)\.

Forβ∉\{0,2\}\\beta\\notin\\\{0,2\\\}:‖𝐇𝐱‖2=‖𝐱‖2\+\(β2−2​β\)​\(𝐤⊤​𝐱\)2\\\|\\mathbf\{H\}\\mathbf\{x\}\\\|^\{2\}=\\\|\\mathbf\{x\}\\\|^\{2\}\+\(\\beta^\{2\}\-2\\beta\)\(\\mathbf\{k\}^\{\\top\}\\mathbf\{x\}\)^\{2\}\. Ifβ∈\(0,2\)\\beta\\in\(0,2\), norms shrink; ifβ\>2\\beta\>2orβ<0\\beta<0, norms grow\.

Table[1](https://arxiv.org/html/2605.06729#S5.T1)summarizes the comparison\.

Table 1:Comprehensive comparison of residual connection architectures\.
## 6The Negation Problem and Hybrid Solution

### 6\.1Why Negation Matters

For correction tasks \(e\.g\., “Actually, no” or “Wait, I meant”\), the model must rapidly negate previous information, requiring eigenvalueλ=−1\\lambda=\-1:𝐓𝐱=−𝐱\\mathbf\{T\}\\mathbf\{x\}=\-\\mathbf\{x\}along some direction\. Theorem[4\.6](https://arxiv.org/html/2605.06729#S4.Thmtheorem6)proves that Cayley*cannot*achieve this for any finite parameters\.

### 6\.2Householder Reflection

###### Definition 6\.1\(Householder Reflection\)\.

𝐇β​\(𝐤\)=𝐈−β​𝐤𝐤⊤\\mathbf\{H\}\_\{\\beta\}\(\\mathbf\{k\}\)=\\mathbf\{I\}\-\\beta\\,\\mathbf\{k\}\\mathbf\{k\}^\{\\top\}, with‖𝐤‖=1\\\|\\mathbf\{k\}\\\|=1\.

###### Theorem 6\.2\(Householder Eigenvalue Structure\)\.

𝐇β​\(𝐤\)\\mathbf\{H\}\_\{\\beta\}\(\\mathbf\{k\}\)has eigenvalues:λ=1\\lambda=1with multiplicity\(n−1\)\(n\-1\)for𝐯⟂𝐤\\mathbf\{v\}\\perp\\mathbf\{k\}, andλ=1−β\\lambda=1\-\\betaalong𝐤\\mathbf\{k\}\.

###### Corollary 6\.3\(Negation atβ=2\\beta=2\)\.

Whenβ=2\\beta=2:𝐇2​\(𝐤\)​𝐤=−𝐤\\mathbf\{H\}\_\{2\}\(\\mathbf\{k\}\)\\mathbf\{k\}=\-\\mathbf\{k\}\. This is the eigenvalueλ=−1\\lambda=\-1that Cayley cannot achieve\.

###### Theorem 6\.4\(Householder Orthogonality\)\.

𝐇β\\mathbf\{H\}\_\{\\beta\}is orthogonal if and only ifβ∈\{0,2\}\\beta\\in\\\{0,2\\\}\.

### 6\.3The EΔ\\Delta\-MHC\-Geo Hybrid Architecture

###### Definition 6\.5\(EΔ\\Delta\-MHC\-Geo Hybrid Operator\)\.

𝒢γ​\(𝐗\)=γ​\(𝐗\)⋅𝐐​\(𝐗\)​𝐗⏟Cayley rotation\+\(1−γ​\(𝐗\)\)⋅𝐇2​\(𝐤​\(𝐗\)\)​𝐗⏟Householder reflection,\\mathcal\{G\}\_\{\\gamma\}\(\\mathbf\{X\}\)=\\gamma\(\\mathbf\{X\}\)\\cdot\\underbrace\{\\mathbf\{Q\}\(\\mathbf\{X\}\)\\mathbf\{X\}\}\_\{\\text\{Cayley rotation\}\}\+\(1\-\\gamma\(\\mathbf\{X\}\)\)\\cdot\\underbrace\{\\mathbf\{H\}\_\{2\}\(\\mathbf\{k\}\(\\mathbf\{X\}\)\)\\mathbf\{X\}\}\_\{\\text\{Householder reflection\}\},\(5\)whereγ​\(𝐗\)=σ​\(𝐖γ⋅𝐗¯\+bγ\)∈\(0,1\)\\gamma\(\\mathbf\{X\}\)=\\sigma\(\\mathbf\{W\}\_\{\\gamma\}\\cdot\\bar\{\\mathbf\{X\}\}\+b\_\{\\gamma\}\)\\in\(0,1\)is the learned gate, and𝐤​\(𝐗\)=normalize​\(fk​\(𝐗¯\)\)\\mathbf\{k\}\(\\mathbf\{X\}\)=\\mathrm\{normalize\}\(f\_\{k\}\(\\bar\{\\mathbf\{X\}\}\)\)\.

The full EΔ\\Delta\-MHC\-Geo Hybrid layer transition with mHC pre/post mappings is:

𝐗l\+1=𝒢γ​\(𝐗l\)\+𝐇post⊤​F​\(𝐇pre⋅LN​\(𝒢γ​\(𝐗l\)\)\)\.\\mathbf\{X\}\_\{l\+1\}=\\mathcal\{G\}\_\{\\gamma\}\(\\mathbf\{X\}\_\{l\}\)\+\\mathbf\{H\}\_\{\\mathrm\{post\}\}^\{\\top\}F\\bigl\(\\mathbf\{H\}\_\{\\mathrm\{pre\}\}\\cdot\\mathrm\{LN\}\(\\mathcal\{G\}\_\{\\gamma\}\(\\mathbf\{X\}\_\{l\}\)\)\\bigr\)\.\(6\)
###### Theorem 6\.6\(Boundary Access to BothO​\(n\)\\mathrm\{O\}\(n\)Components\)\.

The EΔ\\Delta\-MHC\-Geo Hybrid accesses both connected components ofO​\(n\)\\mathrm\{O\}\(n\)at the gate boundaries: whenγ→1\\gamma\\to 1,𝒢γ→𝐐∈SO​\(n\)\\mathcal\{G\}\_\{\\gamma\}\\to\\mathbf\{Q\}\\in\\mathrm\{SO\}\(n\)\(det=\+1\\det=\+1\); whenγ→0\\gamma\\to 0,𝒢γ→𝐇2∈O​\(n\)∖SO​\(n\)\\mathcal\{G\}\_\{\\gamma\}\\to\\mathbf\{H\}\_\{2\}\\in\\mathrm\{O\}\(n\)\\setminus\\mathrm\{SO\}\(n\)\(det=−1\\det=\-1\)\. The blended operator at intermediateγ∈\(0,1\)\\gamma\\in\(0,1\)is*not*itself orthogonal \(Theorem[7\.1](https://arxiv.org/html/2605.06729#S7.Thmtheorem1)\), but the midpoint collapse regularizer drivesγ\\gammatoward the boundaries, ensuring the effective operator is near\-orthogonal\.

Table 2:Gate behavior: how the learned gate selects operators\.

## 7Midpoint Collapse Regularization

### 7\.1The Topological Gap

The orthogonal groupO​\(n\)\\mathrm\{O\}\(n\)has two disconnected components:SO​\(n\)\\mathrm\{SO\}\(n\)\(rotations,det=\+1\\det=\+1\) andO​\(n\)∖SO​\(n\)\\mathrm\{O\}\(n\)\\setminus\\mathrm\{SO\}\(n\)\(reflections,det=−1\\det=\-1\)\. There is no continuous path between them that stays on the orthogonal manifold\.

###### Theorem 7\.1\(Non\-Orthogonality at Midpoint\)\.

Let𝐐∈SO​\(n\)\\mathbf\{Q\}\\in\\mathrm\{SO\}\(n\)and𝐇∈O​\(n\)∖SO​\(n\)\\mathbf\{H\}\\in\\mathrm\{O\}\(n\)\\setminus\\mathrm\{SO\}\(n\)\. The linear combination𝐌=γ​𝐐\+\(1−γ\)​𝐇\\mathbf\{M\}=\\gamma\\mathbf\{Q\}\+\(1\-\\gamma\)\\mathbf\{H\}satisfies𝐌⊤​𝐌≠𝐈\\mathbf\{M\}^\{\\top\}\\mathbf\{M\}\\neq\\mathbf\{I\}forγ∈\(0,1\)\\gamma\\in\(0,1\)\.

###### Proof\.

𝐌⊤​𝐌=γ2​𝐈\+\(1−γ\)2​𝐈\+γ​\(1−γ\)​\(𝐐⊤​𝐇\+𝐇⊤​𝐐\)≠𝐈\\mathbf\{M\}^\{\\top\}\\mathbf\{M\}=\\gamma^\{2\}\\mathbf\{I\}\+\(1\-\\gamma\)^\{2\}\\mathbf\{I\}\+\\gamma\(1\-\\gamma\)\(\\mathbf\{Q\}^\{\\top\}\\mathbf\{H\}\+\\mathbf\{H\}^\{\\top\}\\mathbf\{Q\}\)\\neq\\mathbf\{I\}in general\. ∎

### 7\.2The “Jump, Don’t Swim” Strategy

###### Definition 7\.2\(Midpoint Collapse Regularization\)\.

ℒgate=λgate⋅4​γ​\(1−γ\)\.\\mathcal\{L\}\_\{\\mathrm\{gate\}\}=\\lambda\_\{\\mathrm\{gate\}\}\\cdot 4\\gamma\(1\-\\gamma\)\.\(7\)

This function equals0atγ∈\{0,1\}\\gamma\\in\\\{0,1\\\}\(pure operators\), equals11atγ=0\.5\\gamma=0\.5\(maximum penalty\), and its gradient∂ℒ/∂γ=4​\(1−2​γ\)\\partial\\mathcal\{L\}/\\partial\\gamma=4\(1\-2\\gamma\)pushesγ\\gammatoward the boundaries\.

γ=0\\gamma=0Householder∇=\+4→\\nabla\\\!=\\\!\{\+4\}\\;\\toγ=0\.25\\gamma=0\.25∇=\+2→\\nabla\\\!=\\\!\{\+2\}\\;\\toγ=0\.5\\gamma=0\.5Critical Point∇=0\\nabla\\\!=\\\!0γ=0\.75\\gamma=0\.75←∇=−2\\leftarrow\\;\\nabla\\\!=\\\!\{\-2\}γ=1\\gamma=1Cayley←∇=−4\\leftarrow\\;\\nabla\\\!=\\\!\{\-4\}Escape mechanisms:\\scriptsize1⃝ Task loss∇ℒtask≠0\\nabla\\mathcal\{L\}\_\{\\mathrm\{task\}\}\\\!\\neq\\\!0\\scriptsize2⃝ Input variation\\scriptsize3⃝ Biased initb≠0b\\\!\\neq\\\!0Thm\.[7\.3](https://arxiv.org/html/2605.06729#S7.Thmtheorem3)Figure 2:Gradient flow of midpoint collapse regularization\.The gradient∂ℒ/∂γ=4​\(1−2​γ\)\\partial\\mathcal\{L\}/\\partial\\gamma=4\(1\-2\\gamma\)is positive forγ<0\.5\\gamma<0\.5and negative forγ\>0\.5\\gamma\>0\.5, butexactly zero atγ=0\.5\\gamma=0\.5\(red\)\. Escape requires external forces\.###### Theorem 7\.3\(Universal Zero\-Gradient at Midpoint\)\.

Any smooth, symmetric regularizationf:\[0,1\]→ℝf:\[0,1\]\\to\\mathbb\{R\}withf​\(γ\)=f​\(1−γ\)f\(\\gamma\)=f\(1\-\\gamma\)has zero gradient atγ=0\.5\\gamma=0\.5\.

###### Proof\.

Differentiatingf​\(γ\)=f​\(1−γ\)f\(\\gamma\)=f\(1\-\\gamma\):f′​\(γ\)=−f′​\(1−γ\)f^\{\\prime\}\(\\gamma\)=\-f^\{\\prime\}\(1\-\\gamma\)\. Atγ=0\.5\\gamma=0\.5:f′​\(0\.5\)=−f′​\(0\.5\)f^\{\\prime\}\(0\.5\)=\-f^\{\\prime\}\(0\.5\), hencef′​\(0\.5\)=0f^\{\\prime\}\(0\.5\)=0\. ∎

The total loss becomes:

ℒtotal=ℒtask\+∑layersℒgate\.\\mathcal\{L\}\_\{\\mathrm\{total\}\}=\\mathcal\{L\}\_\{\\mathrm\{task\}\}\+\\sum\_\{\\text\{layers\}\}\\mathcal\{L\}\_\{\\mathrm\{gate\}\}\.\(8\)

## 8Architecture Details

### 8\.1EΔ\\Delta\-MHC\-Geo Operator \(Data\-Dependent Cayley Rotation\)

The EΔ\\Delta\-MHC\-Geo operator computes a data\-dependent rotation as follows:

1. 1\.Pool input:𝐱¯=MeanPool​\(𝐗\)\\bar\{\\mathbf\{x\}\}=\\mathrm\{MeanPool\}\(\\mathbf\{X\}\)\.
2. 2\.Compute generators:𝐮=fu​\(𝐱¯\)\\mathbf\{u\}=f\_\{u\}\(\\bar\{\\mathbf\{x\}\}\),𝐯=fv​\(𝐱¯\)\\mathbf\{v\}=f\_\{v\}\(\\bar\{\\mathbf\{x\}\}\),β=Softplus​\(fβ​\(𝐱¯\)\)\\beta=\\mathrm\{Softplus\}\(f\_\{\\beta\}\(\\bar\{\\mathbf\{x\}\}\)\)\.
3. 3\.Build skew\-symmetric:𝐀=𝐮𝐯⊤−𝐯𝐮⊤\\mathbf\{A\}=\\mathbf\{u\}\\mathbf\{v\}^\{\\top\}\-\\mathbf\{v\}\\mathbf\{u\}^\{\\top\}\.
4. 4\.Cayley transform:𝐐=\(𝐈\+β2​𝐀\)−1​\(𝐈−β2​𝐀\)\\mathbf\{Q\}=\(\\mathbf\{I\}\+\\tfrac\{\\beta\}\{2\}\\mathbf\{A\}\)^\{\-1\}\(\\mathbf\{I\}\-\\tfrac\{\\beta\}\{2\}\\mathbf\{A\}\)viatorch\.linalg\.solve\.
5. 5\.Apply: reshape𝐗\\mathbf\{X\}intonnstreams, rotate via𝐐\\mathbf\{Q\}\.

𝐗l∈ℝB×S×D\\mathbf\{X\}\_\{l\}\\in\\mathbb\{R\}^\{B\\times S\\times D\}𝐱¯=MeanPool​\(𝐗l\)\\bar\{\\mathbf\{x\}\}=\\mathrm\{MeanPool\}\(\\mathbf\{X\}\_\{l\}\)𝐮=fu​\(𝐱¯\)\\mathbf\{u\}\\\!=\\\!f\_\{u\}\(\\bar\{\\mathbf\{x\}\}\)𝐯=fv​\(𝐱¯\)\\mathbf\{v\}\\\!=\\\!f\_\{v\}\(\\bar\{\\mathbf\{x\}\}\)𝐤=fk​\(𝐱¯\)\\mathbf\{k\}\\\!=\\\!f\_\{k\}\(\\bar\{\\mathbf\{x\}\}\)β=fβ​\(𝐱¯\)\\beta\\\!=\\\!f\_\{\\beta\}\(\\bar\{\\mathbf\{x\}\}\)𝐐=\(𝐈\+β2​𝐀\)−1​\(𝐈−β2​𝐀\)\\mathbf\{Q\}\\\!=\\\!\(\\mathbf\{I\}\\\!\+\\\!\\tfrac\{\\beta\}\{2\}\\mathbf\{A\}\)^\{\-1\}\(\\mathbf\{I\}\\\!\-\\\!\\tfrac\{\\beta\}\{2\}\\mathbf\{A\}\)YC=𝐐𝐗lY\_\{C\}\\\!=\\\!\\mathbf\{Q\}\\mathbf\{X\}\_\{l\}𝐇2=𝐈−2​𝐤𝐤⊤\\mathbf\{H\}\_\{2\}\\\!=\\\!\\mathbf\{I\}\-2\\mathbf\{k\}\\mathbf\{k\}^\{\\top\}YH=𝐇2​𝐗lY\_\{H\}\\\!=\\\!\\mathbf\{H\}\_\{2\}\\mathbf\{X\}\_\{l\}γ=σ​\(w⊤​𝐱¯\+b\)\\gamma\\\!=\\\!\\sigma\(w^\{\\top\}\\bar\{\\mathbf\{x\}\}\\\!\+\\\!b\)𝐗geo=γ​YC\+\(1−γ\)​YH\\mathbf\{X\}\_\{\\mathrm\{geo\}\}=\\gamma\\,Y\_\{C\}\+\(1\\\!\-\\\!\\gamma\)\\,Y\_\{H\}LN→Hpre→F​\(⋅\)→Hpost⊤\\mathrm\{LN\}\\to H\_\{\\mathrm\{pre\}\}\\to F\(\\cdot\)\\to H\_\{\\mathrm\{post\}\}^\{\\top\}𝐗l\+1=𝐗geo\+Hpost⊤​F​\(⋅\)\\mathbf\{X\}\_\{l\+1\}=\\mathbf\{X\}\_\{\\mathrm\{geo\}\}\+H\_\{\\mathrm\{post\}\}^\{\\top\}F\(\\cdot\)det=\+1\\det\\\!=\\\!\{\+1\}det=−1\\det\\\!=\\\!\{\-1\}Figure 3:EΔ\\Delta\-MHC\-Geo Hybrid block architecture\.Input𝐗l\\mathbf\{X\}\_\{l\}is processed through parallel branches: Cayley rotation \(𝐐∈SO​\(n\)\\mathbf\{Q\}\\in\\mathrm\{SO\}\(n\), unconditionally orthogonal\) and Householder reflection \(𝐇2\\mathbf\{H\}\_\{2\},β=2\\beta\\\!=\\\!2fixed\)\. The learned gateγ​\(𝐗\)\\gamma\(\\mathbf\{X\}\)blends both branches\. In the main reported implementation,HpreH\_\{\\mathrm\{pre\}\}andHpostH\_\{\\mathrm\{post\}\}are learned full\-dimensional pre/post projections initialized as identity\.
### 8\.2Integration with mHC Framework

FollowingDeepSeek AI \([2024](https://arxiv.org/html/2605.06729#bib.bib3)\), each EΔ\\Delta\-MHC\-Geo block retains learned pre/post mappings around the attention/MLP function\. In the main reported model these are full\-dimensional linear projections, initialized as identity and shared across inputs\. This differs from strict mHC stream aggregation/broadcast; a separateedelta\_streamvariant implements dynamic stream routing with per\-stream compute\. Our main contribution replaces mHC’s doubly stochastic residual mixer with the orthogonal𝐐​\(𝐗\)\\mathbf\{Q\}\(\\mathbf\{X\}\)\(or𝒢γ​\(𝐗\)\\mathcal\{G\}\_\{\\gamma\}\(\\mathbf\{X\}\)for the Hybrid\), gaining exact orthogonality, input adaptivity, and a direct reflection branch \(Table[3](https://arxiv.org/html/2605.06729#S8.T3)\)\.

Table 3:Comparison with mHC\(DeepSeek AI,[2024](https://arxiv.org/html/2605.06729#bib.bib3)\)and JPmHC\(Sengupta et al\.,[2026](https://arxiv.org/html/2605.06729#bib.bib10)\)\.
### 8\.3Full Transformer Architecture

The EΔ\\Delta\-MHC\-Geo Transformer stacksLLblocks, each applying the geometric operator twice \(before attention and before MLP\):

Embedding \+ Positional𝒢γ\\mathcal\{G\}\_\{\\gamma\}LayerNormHpreH\_\{\\mathrm\{pre\}\}Multi\-Head AttnHpost⊤H\_\{\\mathrm\{post\}\}^\{\\top\}\+\+𝒢γ\\mathcal\{G\}\_\{\\gamma\}LayerNormHpreH\_\{\\mathrm\{pre\}\}Feed\-ForwardHpost⊤H\_\{\\mathrm\{post\}\}^\{\\top\}\+\+LayerNormOutput Head×L\\times LblocksFigure 4:Full EΔ\\Delta\-MHC\-Geo Transformer\.The geometric operator𝒢γ\\mathcal\{G\}\_\{\\gamma\}\(green\) replaces identity shortcuts\. In the main model,Hpre/HpostH\_\{\\mathrm\{pre\}\}/H\_\{\\mathrm\{post\}\}are learned full\-dimensional pre/post projections; the stream\-routed variant is reported separately\.L=6L=6layers for our model\.
### 8\.4Properties of the Hybrid Operator

The following results characterize the geometric properties of the EΔ\\Delta\-MHC\-Geo Hybrid and are summarized in Table[4](https://arxiv.org/html/2605.06729#S8.T4)\.

###### Theorem 8\.1\(Component\-wise Orthogonality\)\.

Both components of the EΔ\\Delta\-MHC\-Geo Hybrid are individually orthogonal:𝐐​\(𝐗\)⊤​𝐐​\(𝐗\)=𝐈\\mathbf\{Q\}\(\\mathbf\{X\}\)^\{\\top\}\\mathbf\{Q\}\(\\mathbf\{X\}\)=\\mathbf\{I\}\(Cayley, always\) and𝐇2​\(𝐗\)⊤​𝐇2​\(𝐗\)=𝐈\\mathbf\{H\}\_\{2\}\(\\mathbf\{X\}\)^\{\\top\}\\mathbf\{H\}\_\{2\}\(\\mathbf\{X\}\)=\\mathbf\{I\}\(Householder atβ=2\\beta=2\)\.

###### Proposition 8\.2\(Approximate Isometry\)\.

Let𝐗′=γ​𝐐𝐗\+\(1−γ\)​𝐇𝐗\\mathbf\{X\}^\{\\prime\}=\\gamma\\mathbf\{Q\}\\mathbf\{X\}\+\(1\-\\gamma\)\\mathbf\{H\}\\mathbf\{X\}\. Then‖𝐗′‖2=γ2​‖𝐗‖2\+\(1−γ\)2​‖𝐗‖2\+2​γ​\(1−γ\)​⟨𝐐𝐗,𝐇𝐗⟩\\\|\\mathbf\{X\}^\{\\prime\}\\\|^\{2\}=\\gamma^\{2\}\\\|\\mathbf\{X\}\\\|^\{2\}\+\(1\-\\gamma\)^\{2\}\\\|\\mathbf\{X\}\\\|^\{2\}\+2\\gamma\(1\-\\gamma\)\\langle\\mathbf\{Q\}\\mathbf\{X\},\\mathbf\{H\}\\mathbf\{X\}\\rangle\. Exact isometry holds atγ∈\{0,1\}\\gamma\\in\\\{0,1\\\}; in between, the deviation from isometry is bounded by\|2​γ​\(1−γ\)\|\|2\\gamma\(1\-\\gamma\)\|times the cross\-term, which the midpoint collapse regularizer minimizes by drivingγ\\gammatoward the boundaries\.

###### Proposition 8\.3\(Determinant Structure\)\.

det\(𝒢γ\)=\+1\\det\(\\mathcal\{G\}\_\{\\gamma\}\)=\+1ifγ=1\\gamma=1\(rotation\),−1\-1ifγ=0\\gamma=0\(reflection\), varies forγ∈\(0,1\)\\gamma\\in\(0,1\)\. This enables the model to select the appropriate connected component ofO​\(n\)\\mathrm\{O\}\(n\)viaγ\\gammaalone\.

Table 4:Capability summary of geometric operators\.

## 9Experimental Validation

### 9\.1Setup

All models are configured forfair comparisonwith matched parameter counts \(∼1\.79\{\\sim\}1\.79M parameters\)\. Rather than reducing EΔ\\Delta\-MHC\-Geo’s capacity, we scale up baseline layer counts to match \(Table[5](https://arxiv.org/html/2605.06729#S9.T5)\)\.

Table 5:Model configurations with matched parameters \(∼1\.79\{\\sim\}1\.79M\)\.Training uses AdamW\(Loshchilov & Hutter,[2017](https://arxiv.org/html/2605.06729#bib.bib7)\)withlr=10−3\\mathrm\{lr\}\\\!=\\\!10^\{\-3\}\(cosine decay to10−410^\{\-4\}\), weight decay0\.10\.1, gradient clipping at1\.01\.0, batch size6464, and20002000iterations\. The JPmHC baseline follows the March 2026 v2 architecture: per\-token generation ofHpreH\_\{\\mathrm\{pre\}\},HpostH\_\{\\mathrm\{post\}\}, andHresH\_\{\\mathrm\{res\}\}, row/column softmax constraints for the pre/post mixers, and the fixed\-point Cayley residual retraction withα=0\.1\\alpha=0\.1ands=2s=2iterations\.

### 9\.2Benchmark Datasets

Gyroscope\(manifold precision\): predict continuous rotation trajectories onSO​\(n\)\\mathrm\{SO\}\(n\)\(d=16d\\\!=\\\!16, seq\. len\.255255,90009000train\)\. Tests whether models maintain manifold constraints\.

Stability\(long\-horizon isometry\): predict echo sequences with‖𝐱t\+1‖=‖𝐱t‖\\\|\\mathbf\{x\}\_\{t\+1\}\\\|=\\\|\\mathbf\{x\}\_\{t\}\\\|\(d=64d\\\!=\\\!64, seq\. len\.127127,900900train\)\. Tests norm preservation over extended sequences\.

Reflection\(negation\): learn pure negation𝐲=−𝐱\\mathbf\{y\}=\-\\mathbf\{x\}\(d=64d\\\!=\\\!64, sample sizes\[10​–​500\]\[10\\text\{\-\-\}500\],20002000iter\)\. Validates geometric operator behavior following the “Illusion of Insight” methodology\(Shojaee et al\.,[2025](https://arxiv.org/html/2605.06729#bib.bib12)\)\.

Near\-π\\pirotation: controlled experiments withθ=177\.6∘\\theta=177\.6^\{\\circ\}\(single\-plane\) andθ=179\.9∘\\theta=179\.9^\{\\circ\}\(all 32 planes\) to probe the boundary between rotation and reflection\.

### 9\.3Main Results

Table 6:Validation loss \(mean±\\pmstd over 3 seeds; lower is better\)\. All models have∼1\.79\{\\sim\}1\.79M parameters\.LL= number of layers\.Table[6](https://arxiv.org/html/2605.06729#S9.T6)presents the core results \(mean±\\pmstd over 3 seeds\)\. On the gyroscope benchmark, which requires maintaining manifold constraints over 255\-step sequences, EΔ\\Delta\-MHC\-Geo achieves3\.7×3\.7\\timesand3\.3×3\.3\\timesimprovement over GPT and DDL respectively—with33% fewer layersand4×4\\timesnarrower representation\(nembd=128n\_\{\\mathrm\{embd\}\}\\\!=\\\!128vs\.512512\)\. JPmHC achieves the overall lowest loss \(3\.08×10−43\.08\\\!\\times\\\!10^\{\-4\}\), benefiting from its wider representation \(nembd=512n\_\{\\mathrm\{embd\}\}\\\!=\\\!512,dstream=128d\_\{\\mathrm\{stream\}\}\\\!=\\\!128\) which provides richer per\-stream expressivity for this rotation\-heavy task\. Both geometric models \(JPmHC and EΔ\\Delta\-MHC\-Geo\) dramatically outperform all non\-geometric baselines, confirming the value of orthogonal residual connections\.

On the stability benchmark, the ranking reverses: EΔ\\Delta\-MHC\-Geo achieves the lowest loss \(4\.35×10−64\.35\\\!\\times\\\!10^\{\-6\}\), outperforming JPmHC \(8\.19×10−68\.19\\\!\\times\\\!10^\{\-6\}\) by1\.9×1\.9\\timesand GPT \(1\.64×10−51\.64\\\!\\times\\\!10^\{\-5\}\) by3\.8×3\.8\\times\. DDL reaches1\.53×10−51\.53\\\!\\times\\\!10^\{\-5\}, while mHC fails catastrophically \(8\.65×10−38\.65\\\!\\times\\\!10^\{\-3\}\)\. The reversal is revealing: stability requires precise norm preservation over 127\-step sequences, where EΔ\\Delta\-MHC\-Geo’s*exact*analytical orthogonality outperforms JPmHC’s iterative approximation\. EΔ\\Delta\-MHC\-Geo maintains excellent norm preservation with mean deviation of just0\.0010\.001\(Figure[5\(b\)](https://arxiv.org/html/2605.06729#S9.F5.sf2)\), validating Theorem[4\.3](https://arxiv.org/html/2605.06729#S4.Thmtheorem3)\.

### 9\.4Detailed Comparison with JPmHC v2

The comparison with JPmHC should be interpreted as a tradeoff rather than a uniform dominance claim\. JPmHC uses a fulln×nn\\times nskew\-symmetric generator for its residual mixer, so a single residual operator can rotate several independent stream planes at once\. In contrast, EΔ\\Delta\-MHC\-Geo uses the rank\-2 generator

𝐀​\(𝐗\)=𝐮​\(𝐗\)​𝐯​\(𝐗\)⊤−𝐯​\(𝐗\)​𝐮​\(𝐗\)⊤,\\mathbf\{A\}\(\\mathbf\{X\}\)=\\mathbf\{u\}\(\\mathbf\{X\}\)\\mathbf\{v\}\(\\mathbf\{X\}\)^\{\\top\}\-\\mathbf\{v\}\(\\mathbf\{X\}\)\\mathbf\{u\}\(\\mathbf\{X\}\)^\{\\top\},\(9\)which applies one data\-dependent planar rotation per geometric operator\. This is less expressive per operator, but it is also more structured, cheaper to generate, and solved exactly rather than iteratively\. Because the EΔ\\Delta\-MHC\-Geo block applies the geometric operator before both attention and MLP, and because multiple planar rotations compose across depth, the model can trade depth for rotational expressivity while preserving exact orthogonality at each Cayley step\.

This tradeoff matches the observed results\. JPmHC performs best on the gyroscope benchmark, where the task is dominated by pure rotation and its widernembd=512n\_\{\\mathrm\{embd\}\}=512stream representation plus full\-rank mixer are beneficial\. EΔ\\Delta\-MHC\-Geo performs best on long\-horizon stability and near\-π\\pirotation, where exact analytical orthogonality and conditioning near the Cayley boundary matter more\. The reflection diagnostic tests a capability outside finite Cayley residual mixers: exactλ=−1\\lambda=\-1selection\. There, EΔ\\Delta\-MHC\-Geo’s gate moves toward the Householder branch and reaches0\.960\.96cosine alignment, while the JPmHC\-style finite Cayley diagnostic remains negative\. Thus JPmHC v2 currently has stronger large\-scale evidence for orthogonal hyper\-connections, while EΔ\\Delta\-MHC\-Geo contributes an exact, reflection\-capable extension whose advantages are clearest on controlled geometric and stability tests\.

![Refer to caption](https://arxiv.org/html/2605.06729v1/figures/journal_fig1_training.png)\(a\)Training loss and gradient dynamics\.
![Refer to caption](https://arxiv.org/html/2605.06729v1/figures/journal_fig2_stability.png)\(b\)Stability analysis \(norm preservation\)\.

Figure 5:Training dynamics and stability\.\(a\) EΔ\\Delta\-MHC\-Geo \(green\) shows smooth, stable loss decrease without the oscillations seen in DDL\. \(b\) Norm preservation over 100 positions: EΔ\\Delta\-MHC\-Geo maintains norm≈1\.0\{\\approx\}1\.0\(deviation0\.0010\.001\), JPmHC0\.0040\.004, while GPT \(0\.4740\.474\), DDL \(0\.5060\.506\), and mHC \(0\.5430\.543\) drift to0\.450\.45–0\.550\.55\.
### 9\.5Reflection Experiment: Parameter Convergence

Table 7:Parameter convergence and cosine alignment on the negation task \(𝐲=−𝐱\\mathbf\{y\}=\-\\mathbf\{x\}\)\. 500\-sample row reports mean±\\pmstd over 3 seeds\. JPmHC’s finite Cayley diagnostic has no direct reflection branch; its alignment remains negative at all sample sizes, validating the need for the Householder branch\.![Refer to caption](https://arxiv.org/html/2605.06729v1/figures/reflection_comprehensive.png)Figure 6:Reflection experiment: negation cosine\-alignment comparison\(followingShojaee et al\. \([2025](https://arxiv.org/html/2605.06729#bib.bib12)\)\)\. \(a\) DDL’sβ→2\.0\\beta\\\!\\to\\\!2\.0and EΔ\\Delta’sγ→0\.0\\gamma\\\!\\to\\\!0\.0with increasing samples\. \(b\) DDL and EΔ\\Delta\-MHC\-Geo reach0\.960\.96cosine alignment at 500 samples;JPmHC remains negative at all sample sizesunder this finite Cayley diagnostic\. \(c–e\) Training dynamics at 500 samples: DDL discoversβ=2\\beta\\\!=\\\!2, JPmHC is stuck with negative alignment, EΔ\\Delta\-MHC\-Geo selectsγ→0\\gamma\\\!\\to\\\!0\(Householder\)\.Table[7](https://arxiv.org/html/2605.06729#S9.T7)and Figure[6](https://arxiv.org/html/2605.06729#S9.F6)show that:

- •DDL’sβ\\betaconverges to1\.995±0\.0011\.995\\pm 0\.001\(within0\.25%0\.25\\%of the theoretical targetβ=2\\beta=2\), validating Theorem[6\.4](https://arxiv.org/html/2605.06729#S6.Thmtheorem4)\.
- •EΔ\\Delta\-MHC\-Geo’sγ\\gammaconverges to0\.051±0\.0050\.051\\pm 0\.005\(within5\.1%5\.1\\%of targetγ=0\\gamma=0\), demonstrating*automatic operator selection*\.
- •JPmHC fails in this diagnostic—its cosine alignment remains negative even at 500 samples \(−0\.25\-0\.25\), consistent with the fact that the finite Cayley map used in the residual mixer excludes exactλ=−1\\lambda=\-1\(Theorem[4\.6](https://arxiv.org/html/2605.06729#S4.Thmtheorem6)\)\.
- •Parameter convergence*precedes*alignment gains \(the “Aha\!” moment\), confirming the geometric nature of learning\.

### 9\.6Near\-π\\piRotation Analysis

To probe the boundary between rotation and reflection, we design controlled experiments with eigenvalues that*approach*but never exactly equal−1\-1\. This tests whether the learned gate can discriminate between tasks requiring pure operators versus those amenable to blended operators\.

![Refer to caption](https://arxiv.org/html/2605.06729v1/figures/near_pi_rotation_comparison.png)Figure 7:Near\-π\\pirotation analysis\.\(a–b\) Training curves on single\-plane \(θ=177\.6∘\\theta\\\!=\\\!177\.6^\{\\circ\}\) and multi\-plane \(θ=179\.9∘\\theta\\\!=\\\!179\.9^\{\\circ\}\) tasks\. EΔ\\Delta\-MHC\-Geo and JPmHC converge to∼10−6\{\\sim\}10^\{\-6\}loss, dramatically outperforming GPT, DDL, and mHC\. Summary table reports all five models’ final losses\. \(c–e\) Per\-layer gate evolution \(γ\\gamma, layers L0–L5\) on single\-plane with three initializations: Cayley\-biased \(γ0≈0\.82\\gamma\_\{0\}\\\!\\approx\\\!0\.82\), neutral \(γ0=0\.50\\gamma\_\{0\}\\\!=\\\!0\.50\), and Householder\-biased \(γ0≈0\.18\\gamma\_\{0\}\\\!\\approx\\\!0\.18\)\. \(f–h\) Same for multi\-plane\. All initializations converge to∼10−6\{\\sim\}10^\{\-6\}loss, demonstrating robustness; the gate adapts per\-layer to the task geometry\.###### Proposition 9\.1\(Near\-π\\piCayley Boundary\)\.

For any generated Cayley rotation plane and any angleθ∈\(−π,π\)\\theta\\in\(\-\\pi,\\pi\), there exist finite Cayley parameters realizing that angle\. The exact endpointθ=π\\theta=\\piis excluded and is approached only in the limitβ​μ→∞\\beta\\mu\\to\\infty\.

###### Proof sketch\.

The Cayley eigenvalues areλkC=e−2​i​arctan⁡\(β​μk/2\)\\lambda\_\{k\}^\{C\}=e^\{\-2i\\arctan\(\\beta\\mu\_\{k\}/2\)\}\. Solvingθ=−2​arctan⁡\(β​μ/2\)\\theta=\-2\\arctan\(\\beta\\mu/2\)givesβ​μ=−2​tan⁡\(θ/2\)\\beta\\mu=\-2\\tan\(\\theta/2\), which is finite for everyθ∈\(−π,π\)\\theta\\in\(\-\\pi,\\pi\)\. Asθ→±π\\theta\\to\\pm\\pi, the tangent diverges, so finite parameters cannot attain the endpoint\. Thus near\-π\\pirotations test optimization and conditioning near the Cayley boundary, whereas exact reflection requires a separate Householder branch\. ∎

Figure[7](https://arxiv.org/html/2605.06729#S9.F7)and Table[8](https://arxiv.org/html/2605.06729#S9.T8)reveal a key insight: near\-π\\pirotations are not topologically impossible for Cayley; they are difficult boundary cases\. Multiple gate regimes can fit these tasks well: biased initializations often polarize toward the nearest boundary, while neutral initialization can retain intermediate or layer\-specialized gates with comparable loss\. In contrast, exact negation \(𝐲=−𝐱\\mathbf\{y\}=\-\\mathbf\{x\}\) creates a direct need for the Householder branch, and the diagnostic probe drivesγ→0\\gamma\\to 0\. We therefore interpret the near\-π\\piresults as evidence of robustness near the Cayley boundary, not as proof that intermediate gate values are themselves orthogonal operators\.

### 9\.7Initialization Robustness

A critical question for practical deployment is whether gate initialization affects final performance\. We systematically varyinit\_gate\_biasacross\{−1\.5,0\.0,\+1\.5\}\\\{\-1\.5,0\.0,\+1\.5\\\}, corresponding to initialγ0∈\{0\.18,0\.50,0\.82\}\\gamma\_\{0\}\\in\\\{0\.18,0\.50,0\.82\\\}\.

Table 8:Initialization robustness \(mean±\\pmstd over 3 seeds\): all initializations achieve comparable loss \(within4×4\\times\), demonstrating that practitioners need not tune the gate initialization hyperparameter\.All six configurations \(Table[8](https://arxiv.org/html/2605.06729#S9.T8)\) achieve mean validation loss in the10−610^\{\-6\}to10−710^\{\-7\}range, varying by only∼4×\{\\sim\}4\\times\. The per\-layer gate trajectories in Figure[7](https://arxiv.org/html/2605.06729#S9.F7)\(c–h\) reveal two findings: \(i\) biased initializations collapse to their nearest boundary \(γ0\>0\.5⇒γ→1\\gamma\_\{0\}\>0\.5\\Rightarrow\\gamma\\to 1;γ0<0\.5⇒γ→0\\gamma\_\{0\}<0\.5\\Rightarrow\\gamma\\to 0\), visible as all six layer curves converging to the same pole—consistent with the regularization basin structure; \(ii\) neutral initialization \(γ0=0\.5\\gamma\_\{0\}=0\.5\) allows per\-layer specialization to emerge, with different layers choosing different operators \(panels d, g\)\. Both regimes achieve equivalent performance, confirming that the loss landscape contains multiple good local optima\.

### 9\.8Ablation Studies

Regularization weight\.Table[9](https://arxiv.org/html/2605.06729#S9.T9)shows the effect of the midpoint collapse penaltyλgate\\lambda\_\{\\mathrm\{gate\}\}\. Without regularization \(λ=0\\lambda=0\), the gate lingers in the non\-orthogonal regionγ∈\[0\.3,0\.7\]\\gamma\\in\[0\.3,0\.7\], degrading performance by44%44\\%\. Atλ≥0\.1\\lambda\\geq 0\.1, binary polarization occurs and performance saturates\. The sweet spot atλ=0\.1\\lambda=0\.1balances task flexibility with orthogonality enforcement\.

Table 9:Ablation: effect of regularization weightλ\\lambdaon the gyroscope benchmark\. Binary polarization \(γ∈\{0,1\}\\gamma\\in\\\{0,1\\\}\) coincides with optimal loss\.
### 9\.9Computational Efficiency

The primary computational cost of EΔ\\Delta\-MHC\-Geo is then×nn\\\!\\times\\\!nmatrix solve\(𝐈\+𝐌\)−1​\(𝐈−𝐌\)\(\\mathbf\{I\}\+\\mathbf\{M\}\)^\{\-1\}\(\\mathbf\{I\}\-\\mathbf\{M\}\)in the Cayley transform, which isO​\(n3\)O\(n^\{3\}\)in the number of streams\. Withn=4n\\\!=\\\!4streams, this is a4×44\\\!\\times\\\!4linear system—negligible compared to theO​\(T2​d\)O\(T^\{2\}d\)attention cost\. The additional overhead comes from the five generator networks \(fu,fv,fβ,fk,fγf\_\{u\},f\_\{v\},f\_\{\\beta\},f\_\{k\},f\_\{\\gamma\}\), each a 2\-layer MLP\. Together, these add approximately two extra “Linear” layers per block compared to a standard residual connection\.

However, EΔ\\Delta\-MHC\-Geo uses only6 layersversus 8–9 for the baselines \(Table[5](https://arxiv.org/html/2605.06729#S9.T5)\), which offsets the per\-layer overhead\. The total parameter count is matched across all models \(∼1\.79\{\\sim\}1\.79M\), so the comparison is fair in terms of model capacity\. In practice, the33%33\\%depth reduction makes EΔ\\Delta\-MHC\-Geo’s total wall\-clock training time comparable to the baselines despite the per\-layer geometric operator cost\.

### 9\.10Summary of Theoretical Validation

Table[10](https://arxiv.org/html/2605.06729#S9.T10)consolidates the correspondence between theoretical predictions and experimental outcomes\. The strongest claims are the algebraic ones: exact Cayley orthogonality, finite\-Cayley exclusion ofλ=−1\\lambda=\-1, and Householder reflection atβ=2\\beta=2\. The experiments test whether the implemented models exploit these structures in practice\.

Table 10:Theoretical predictions and empirical diagnostics\.

## 10Discussion

### 10\.1When Does Geometric Bias Help Most?

Our results suggest that EΔ\\Delta\-MHC\-Geo’s advantages are most pronounced on tasks with strong geometric structure—rotation prediction, negation, and operator selection—where the inductive bias of guaranteed orthogonality directly matches the task’s mathematical requirements\. The comparison with JPmHC v2 is instructive: JPmHC’s wider representation \(nembd=512n\_\{\\mathrm\{embd\}\}\\\!=\\\!512,dstream=128d\_\{\\mathrm\{stream\}\}\\\!=\\\!128\) excels on the gyroscope task where rich per\-stream expressivity aids pure rotation, but EΔ\\Delta\-MHC\-Geo achieves better long\-horizon stability \(1\.9×1\.9\\timesgap\) and best near\-π\\piloss, while being the only evaluated architecture with a direct reflection branch\. This supports the hybrid design as a practical way to access bothdet=\+1\\det=\+1anddet=−1\\det=\-1regimes\.

Representation width ablation\.A natural question is whether JPmHC’s gyroscope advantage stems from its4×4\\timeswider internal representation \(nembd=512n\_\{\\mathrm\{embd\}\}\\\!=\\\!512vs\.128128\) rather than its architectural innovations\. To test this, we widened EΔ\\Delta\-MHC\-Geo tonembd=184n\_\{\\mathrm\{embd\}\}\\\!=\\\!184\(33layers,1\.851\.85M\) and224224\(22layers,1\.831\.83M\) at matched parameters\. Both performed*worse*on the gyroscope \(3\.97×10−33\.97\\\!\\times\\\!10^\{\-3\}and4\.01×10−34\.01\\\!\\times\\\!10^\{\-3\}vs\.1\.02×10−31\.02\\\!\\times\\\!10^\{\-3\}\): the forced depth reduction hurts more than width helps\. JPmHC achieves both width*and*depth simultaneously because its sub\-layerFFoperates atdstream=128d\_\{\\mathrm\{stream\}\}\\\!=\\\!128\(per their Section 3\.2\), consuming far fewer parameters per layer\. This architectural efficiency—not the Cayley retraction itself—is the source of JPmHC’s gyroscope edge\. Crucially, EΔ\\Delta\-MHC\-Geo remains highly competitive on gyroscope \(3\.7×3\.7\\timesover GPT\) despite operating at4×4\\timesnarrower representation and with33%33\\%fewer layers, while being the only model in this comparison that adds a direct reflection branch\. Adapting the geometric operator for efficient per\-stream architectures at wider representations is a promising direction for future work\. We provide an EΔ\\Delta\-MHC\-Geo\-Stream implementation in the codebase \(src/models/edelta\_stream\.py\) that combines stream\-level Cayley rotation with stream\-axis Householder reflection and JPmHC\-style dynamic routing\. This variant demonstrates the compatibility of the hybrid operator with per\-stream compute, but it is not the source of the headline results in Table[6](https://arxiv.org/html/2605.06729#S9.T6)\.

The automatic discovery ofβ→2\\beta\\to 2andγ→0\\gamma\\to 0are not incremental gains but qualitative differences that emerge from the algebraic structure of the Cayley transform\.

### 10\.2Limitations

We identify several limitations that inform future work:

Benchmark scope\.All experiments use synthetic benchmarks designed to isolate geometric properties\. While this is appropriate for validating theoretical claims, the generalization to natural language processing, computer vision, and other large\-scale tasks remains to be demonstrated\. We view the current work as establishing the theoretical and empirical foundations upon which such extensions can be built\.

Scale\.Experiments are conducted at∼1\.79\{\\sim\}1\.79M parameters\. TheO​\(n3\)O\(n^\{3\}\)cost of the Cayley matrix solve \(wherennis the number of streams\) is negligible atn=4n=4but could become a bottleneck if many streams are needed at larger scales\. Efficient approximations \(e\.g\., truncated Neumann series\) may be required forn≫4n\\gg 4\.

Multi\-seed scope\.Main results are reported as mean±\\pmstandard deviation over three random seeds \(42, 123, 456\), and the resulting confidence intervals are narrow for most benchmarks\. However, three seeds remain a modest sample; larger\-scale variance studies across more seeds and hyperparameter settings would further strengthen the claims\.

Hybrid orthogonality gap\.The convex combinationγ​𝐐𝐗\+\(1−γ\)​𝐇𝐗\\gamma\\mathbf\{Q\}\\mathbf\{X\}\+\(1\-\\gamma\)\\mathbf\{H\}\\mathbf\{X\}is orthogonal only at the extremesγ∈\{0,1\}\\gamma\\in\\\{0,1\\\}, not in between \(Theorem[7\.1](https://arxiv.org/html/2605.06729#S7.Thmtheorem1)\)\. Although the midpoint collapse regularizer drivesγ\\gammatoward the boundaries, the transient non\-orthogonality during early training is a theoretical limitation\. Investigating geodesic interpolation onO​\(n\)\\mathrm\{O\}\(n\)as an alternative to linear blending is a promising direction\.

## 11Conclusion

We have presented the EΔ\\Delta\-MHC\-Geo Transformer, which introduces input\-adaptive, unconditionally orthogonal residual connections through the Data\-Dependent Cayley transform\. The key theoretical insight is that the Cayley transform’s orthogonality guarantee derives from the algebraic structure of skew\-symmetry, which is preserved regardless of how the generator vectors𝐮​\(𝐱\),𝐯​\(𝐱\)\\mathbf\{u\}\(\\mathbf\{x\}\),\\mathbf\{v\}\(\\mathbf\{x\}\)are computed\. This allows us to make the rotation plane input\-dependent without sacrificing any geometric guarantees—a property that neither DDL \(conditional onβ=2\\beta=2\) nor mHC \(approximate via Sinkhorn\) can match\.

The EΔ\\Delta\-MHC\-Geo Hybrid extends this by combining Cayley rotation \(det=\+1\\det=\+1\) with Householder reflection \(det=−1\\det=\-1,β=2\\beta=2fixed\) through a learned gate, giving boundary access to both connected components ofO​\(n\)\\mathrm\{O\}\(n\)\. Midpoint collapse regularization encourages binary operator selection\. The universal zero\-gradient theorem \(Theorem[7\.3](https://arxiv.org/html/2605.06729#S7.Thmtheorem3)\) explains when and why this regularization succeeds or fails, providing principled deployment guidance\.

Empirically, the main algebraic predictions are supported across 3 random seeds: unconditional orthogonality yields excellent norm preservation \(0\.0010\.001mean deviation\), automatic operator selection drivesγ→0\.051±0\.005\\gamma\\to 0\.051\\pm 0\.005on the diagnostic negation probe, and DDL independently discoversβ→1\.995±0\.001\\beta\\to 1\.995\\pm 0\.001via gradient descent\. The comparison with the concurrent JPmHC v2\(Sengupta et al\.,[2026](https://arxiv.org/html/2605.06729#bib.bib10)\)is particularly informative: JPmHC’s wider representation \(nembd=512n\_\{\\mathrm\{embd\}\}\\\!=\\\!512\) excels on pure rotation \(gyroscope\), while EΔ\\Delta\-MHC\-Geo achieves best stability \(1\.9×1\.9\\timesover JPmHC\), best near\-π\\piloss \(4\.5×4\.5\\timeson single\-plane\), and—crucially—is the only evaluated model with a direct reflection branch, validating the hybrid design for tasks that requiredet=−1\\det=\-1operators\. This does not imply uniform superiority over JPmHC: its full\-rank mixer and the March v2 ARC\-AGI evidence are stronger for broad rotation/routing workloads\. Rather, the controlled experiments show that EΔ\\Delta\-MHC\-Geo trades per\-operator rotation rank for exactness, stable composition, and a reflection\-capable branch\. Geometric inductive bias allows EΔ\\Delta\-MHC\-Geo to achieve these results with33%33\\%fewer layers than baselines at matched parameter count, suggesting that the*structure*of inter\-layer transformations matters as much as their*quantity*\.

Future work includes scaling to large language models, exploring geodesic interpolation to replace convex blending, and extending the framework to unitary groups for complex\-valued architectures\.

#### Broader Impact Statement

This work advances the theoretical foundations of orthogonal transformations in neural networks\. We do not foresee significant negative societal impacts, as the work is primarily theoretical and validated on controlled benchmarks\. However, improved training stability could accelerate the development of larger models, which carries the usual concerns about AI safety and misuse\.

#### Acknowledgment

This manuscript received limited editorial and grammatical refinement assistance from Claude Sonnet 4\.6 and GPT 5\.5 during the writing process\. All research ideas, methodologies, experimental designs, and scientific claims were developed and verified by the author, who reviewed and edited all AI\-assisted output and takes full responsibility for the accuracy and integrity of the final manuscript\.

#### Reproducibility

All code and experimental scripts are publicly available at[https://github\.com/arash\-shahmansoori/edelta](https://github.com/arash-shahmansoori/edelta)\. Continuous benchmarks are reproduced withbash scripts/run\_matched\_params\.sh; near\-π\\piinitialization robustness withbash scripts/run\_near\_pi\.sh; and reflection diagnostics withbash scripts/run\_reflection\.sh\.

## References

- Arjovsky et al\. \(2016\)Martin Arjovsky, Amar Shah, and Yoshua Bengio\.Unitary evolution recurrent neural networks\.In*International Conference on Machine Learning*, pp\. 1120–1128\. PMLR, 2016\.
- Bansal et al\. \(2018\)Nitin Bansal, Xiaohan Chen, and Zhangyang Wang\.Can we gain more from orthogonality regularizations in training deep networks?In*Advances in Neural Information Processing Systems*, volume 31, 2018\.
- DeepSeek AI \(2024\)DeepSeek AI\.Hyper\-connections\.*arXiv preprint arXiv:2512\.24880*, 2024\.
- He et al\. \(2016\)Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun\.Deep residual learning for image recognition\.In*IEEE Conference on Computer Vision and Pattern Recognition*, pp\. 770–778, 2016\.
- Helfrich et al\. \(2018\)Kyle Helfrich, Devin Willmott, and Qiang Ye\.Orthogonal recurrent neural networks with scaled Cayley transform\.In*International Conference on Machine Learning*, pp\. 1969–1978\. PMLR, 2018\.
- Lezcano\-Casado & Martínez\-Rubio \(2019\)Mario Lezcano\-Casado and David Martínez\-Rubio\.Cheap orthogonal constraints in neural networks: A simple parametrization of the orthogonal and unitary group\.In*International Conference on Machine Learning*, pp\. 3794–3803\. PMLR, 2019\.
- Loshchilov & Hutter \(2017\)Ilya Loshchilov and Frank Hutter\.Decoupled weight decay regularization\.*arXiv preprint arXiv:1711\.05101*, 2017\.
- Radford et al\. \(2019\)Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever\.Language models are unsupervised multitask learners\.Technical report, OpenAI, 2019\.
- Saxe et al\. \(2014\)Andrew M\. Saxe, James L\. McClelland, and Surya Ganguli\.Exact solutions to the nonlinear dynamics of learning in deep linear neural networks\.In*International Conference on Learning Representations*, 2014\.
- Sengupta et al\. \(2026\)Biswa Sengupta, Jinhua Wang, and Leo Brunswic\.JPmHC dynamical isometry via orthogonal hyper\-connections\.*arXiv preprint arXiv:2602\.18308v2*, mar 2026\.Version 2, updated March 4, 2026\.
- Shepard et al\. \(2015\)Ron Shepard, Michael Minkoff, et al\.Representation of the rotation reflection group\.*Journal of Mathematical Chemistry*, 53\(1\):382–401, 2015\.
- Shojaee et al\. \(2025\)Parshin Shojaee, Jamshid Mirzakhalov, Sophia Ananiadou, and Marti A\. Hearst\.Illusion of insight: When reasoning models appear smarter than they are\.*arXiv preprint arXiv:2601\.00514*, 2025\.
- Vorontsov et al\. \(2017\)Eugene Vorontsov, Chiheb Trabelsi, Samuel Kadoury, and Chris Pal\.On orthogonality and learning recurrent networks with long term dependencies\.In*International Conference on Machine Learning*, pp\. 3570–3578\. PMLR, 2017\.
- Yang et al\. \(2024\)Liu Yang, Zhiwei Xu, et al\.Deep delta learning\.*arXiv preprint arXiv:2406\.17550*, 2024\.

## Appendix ADetailed Proofs

### A\.1Commutativity in Cayley Orthogonality Proof

###### Lemma A\.1\.

For any matrix𝐌\\mathbf\{M\}, the matrices\(𝐈\+𝐌\)\(\\mathbf\{I\}\+\\mathbf\{M\}\)and\(𝐈−𝐌\)\(\\mathbf\{I\}\-\\mathbf\{M\}\)commute\.

###### Proof\.

\(𝐈\+𝐌\)​\(𝐈−𝐌\)=𝐈−𝐌2=\(𝐈−𝐌\)​\(𝐈\+𝐌\)\(\\mathbf\{I\}\+\\mathbf\{M\}\)\(\\mathbf\{I\}\-\\mathbf\{M\}\)=\\mathbf\{I\}\-\\mathbf\{M\}^\{2\}=\(\\mathbf\{I\}\-\\mathbf\{M\}\)\(\\mathbf\{I\}\+\\mathbf\{M\}\)\. ∎

### A\.2Gradient Flow Through Data\-Dependent Cayley

The gradient of lossℒ\\mathcal\{L\}with respect to parametersθ\\theta\(weights of𝐖u\\mathbf\{W\}\_\{u\},𝐖v\\mathbf\{W\}\_\{v\}\) flows through:

∂ℒ∂θ=∂ℒ∂𝐐⋅∂𝐐∂𝐀⋅∂𝐀∂𝐮,𝐯⋅∂𝐮,𝐯∂θ\.\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\\theta\}=\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\\mathbf\{Q\}\}\\cdot\\frac\{\\partial\\mathbf\{Q\}\}\{\\partial\\mathbf\{A\}\}\\cdot\\frac\{\\partial\\mathbf\{A\}\}\{\\partial\\mathbf\{u\},\\mathbf\{v\}\}\\cdot\\frac\{\\partial\\mathbf\{u\},\\mathbf\{v\}\}\{\\partial\\theta\}\.\(10\)All operations are differentiable, andtorch\.linalg\.solvesupports autograd\.

### A\.3Midpoint Collapse Regularization Derivation

The regularizationℒgate=4​γ​\(1−γ\)\\mathcal\{L\}\_\{\\mathrm\{gate\}\}=4\\gamma\(1\-\\gamma\)has: domainγ∈\[0,1\]\\gamma\\in\[0,1\]; rangeℒ∈\[0,1\]\\mathcal\{L\}\\in\[0,1\];dd​γ​\[4​γ​\(1−γ\)\]=4−8​γ=0\\frac\{d\}\{d\\gamma\}\[4\\gamma\(1\-\\gamma\)\]=4\-8\\gamma=0atγ=0\.5\\gamma=0\.5;d2d​γ2=−8<0\\frac\{d^\{2\}\}\{d\\gamma^\{2\}\}=\-8<0\(maximum atγ=0\.5\\gamma=0\.5\); boundary valuesℒ​\(0\)=ℒ​\(1\)=0\\mathcal\{L\}\(0\)=\\mathcal\{L\}\(1\)=0\(minima\)\.

### A\.4Regularization Function Comparison

Table 11:Comparison of regularization functions for gate polarization\.

## Appendix BNear\-π\\piRotation Experiments

![Refer to caption](https://arxiv.org/html/2605.06729v1/figures/regularization_analysis.png)Figure 8:Regularization analysis\.All smooth symmetric regularizations have zero gradient atγ=0\.5\\gamma=0\.5\(Theorem[7\.3](https://arxiv.org/html/2605.06729#S7.Thmtheorem3)\)\. The current4​γ​\(1−γ\)4\\gamma\(1\-\\gamma\)has the strongest boundary gradient among quadratic alternatives\.### B\.1Experimental Protocol

Near\-π\\pirotation datasets probe the boundary between rotation and reflection:

- •Single\-plane\(θ=3\.10\\theta=3\.10rad=177\.6∘=177\.6^\{\\circ\}\): 2 eigenvalues near−1\-1, 62 at\+1\+1\. Cayley sufficient\.
- •Multi\-plane\(θ=3\.14\\theta=3\.14rad=179\.9∘=179\.9^\{\\circ\}\): all 64 eigenvalues near−1\-1\. Extreme test\.
- •Exact negation\(θ=π\\theta=\\pi\): eigenvalues exactly−1\-1\. Must use Householder\.

For single\-plane rotation at angleθ\\thetain the\(i,j\)\(i,j\)\-plane:

𝐑i​j​\(θ\)=𝐈\+\(cos⁡θ−1\)​\(𝐞i​𝐞i⊤\+𝐞j​𝐞j⊤\)\+sin⁡θ​\(𝐞i​𝐞j⊤−𝐞j​𝐞i⊤\)\.\\mathbf\{R\}\_\{ij\}\(\\theta\)=\\mathbf\{I\}\+\(\\cos\\theta\-1\)\(\\mathbf\{e\}\_\{i\}\\mathbf\{e\}\_\{i\}^\{\\top\}\+\\mathbf\{e\}\_\{j\}\\mathbf\{e\}\_\{j\}^\{\\top\}\)\+\\sin\\theta\(\\mathbf\{e\}\_\{i\}\\mathbf\{e\}\_\{j\}^\{\\top\}\-\\mathbf\{e\}\_\{j\}\\mathbf\{e\}\_\{i\}^\{\\top\}\)\.\(11\)

### B\.2Baseline Comparison

Table 12:Near\-π\\pirotation: baseline comparison \(mean±\\pmstd over 3 seeds; all∼1\.8\{\\sim\}1\.8M params\)\.Table[12](https://arxiv.org/html/2605.06729#A2.T12)shows that both EΔ\\Delta\-MHC\-Geo and JPmHC dramatically outperform GPT and DDL on near\-π\\pitasks\. EΔ\\Delta\-MHC\-Geo achieves the lowest mean loss on both benchmarks:1\.44×10−61\.44\\\!\\times\\\!10^\{\-6\}on single\-plane \(4\.5×4\.5\\timesbetter than JPmHC\) and1\.24×10−61\.24\\\!\\times\\\!10^\{\-6\}on multi\-plane \(1\.4×1\.4\\timesbetter than JPmHC\), demonstrating that exact analytical Cayley computation outperforms iterative approximation even on pureSO​\(n\)\\mathrm\{SO\}\(n\)tasks near the rotation–reflection boundary\. The initialization robustness analysis \(Table[8](https://arxiv.org/html/2605.06729#S9.T8)\) confirms that all gate initializations converge to comparable performance\.

mHC fails catastrophically \(∼6,700×\{\\sim\}6\{,\}700\\timesworse than EΔ\\Delta\-MHC\-Geo\) because doubly stochastic matrices have all\-positive entries and cannot represent the near\-negative matrix entries required by near\-180∘180^\{\\circ\}rotations\. This is a fundamental architectural limitation, not a training failure: even with perfect optimization, mHC’s representation class excludes the target operator\.

## Appendix CFull Hyperparameters

Table 13:Optimizer configuration\.ParameterValueDescriptionOptimizerAdamWDecoupled weight decayβ1\\beta\_\{1\}0\.9First momentβ2\\beta\_\{2\}0\.95Second momentϵ\\epsilon10−810^\{\-8\}Numerical stabilityWeight decay0\.1L2 regularizationGradient clip1\.0Global normTable 14:Learning rate schedule\.Table 15:EΔ\\Delta\-MHC\-Geo\-specific parameters\.ParameterValueDescriptiongeo\_hidden\_ratio4Hidden dim=nembd/4=32=n\_\{\\mathrm\{embd\}\}/4=32n\_streams4Parallel mHC streamsHouseholderβ\\beta2\.0 \(fixed\)Theorem[6\.4](https://arxiv.org/html/2605.06729#S6.Thmtheorem4)Gate init \(continuous\)0\.0NeutralGate init \(reflection\)−1\.5\-1\.5Symmetry\-breakingλgate\\lambda\_\{\\mathrm\{gate\}\}0\.1–1\.0Midpoint collapse weightSinkhorn iterations20For mHC baselineTable 16:Dataset configuration\.
## Appendix DImplementation Alignment

Table[17](https://arxiv.org/html/2605.06729#A4.T17)maps the mathematical objects used in the paper to the active implementation\. The main reported EΔ\\Delta\-MHC\-Geo model issrc/models/edelta\_hybrid\.py; the reflection experiment is a separate low\-dimensional operator diagnostic implemented insrc/training/train\_reflection\.py\.

Table 17:Paper–code alignment for the main mathematical components\.
## Appendix EEffect Sizes and Robustness

We provide a quantitative assessment of the magnitude and consistency of the reported improvements\.

Norm preservation \(stability dataset\)\.Table[18](https://arxiv.org/html/2605.06729#A5.T18)reports the mean norm deviation across all sequence positions for each model\. EΔ\\Delta\-MHC\-Geo’s deviation of0\.0010\.001is a direct consequence of the algebraic guarantee‖𝐐𝐲‖=‖𝐲‖\\\|\\mathbf\{Q\}\\mathbf\{y\}\\\|=\\\|\\mathbf\{y\}\\\|\(Theorem[4\.3](https://arxiv.org/html/2605.06729#S4.Thmtheorem3)\); the small residual arises from the mHC pre/post mappings and the MLP branch, not from the geometric operator itself\. The effect size is474×474\\timesrelative to GPT, placing it well beyond any plausible noise floor\. Notably, JPmHC \(0\.0040\.004\) also achieves near\-perfect norm preservation thanks to its Cayley retraction, but EΔ\\Delta\-MHC\-Geo’s exact analytical solve yields4×4\\timeslower deviation\.

Table 18:Norm preservation analysis \(stability dataset\)\.Initialization robustness and multi\-seed validation\.Main results are reported as mean±\\pmstd over three random seeds \(42, 123, 456\)\. The initialization robustness analysis \(Table[8](https://arxiv.org/html/2605.06729#S9.T8), six configurations per dataset\) further confirms stability: all configurations achieve mean loss in the10−610^\{\-6\}–10−710^\{\-7\}range despite varyingγ0∈\{0\.18,0\.50,0\.82\}\\gamma\_\{0\}\\in\\\{0\.18,0\.50,0\.82\\\}, with a worst\-to\-best ratio of∼4×\{\\sim\}4\\times\. This is substantially smaller than the474×474\\timesimprovement over GPT, confirming that the reported gains are robust to both seed and initialization variation\.

Parameter convergence consistency\.On the reflection task \(Table[7](https://arxiv.org/html/2605.06729#S9.T7)\), both DDL’sβ→1\.995±0\.001\\beta\\to 1\.995\\pm 0\.001and EΔ\\Delta\-MHC\-Geo’sγ→0\.051±0\.005\\gamma\\to 0\.051\\pm 0\.005converge toward their theoretical targets with increasing sample size, with tight standard deviations across seeds confirming reproducibility\. DDL’sβ\\betashows monotonic convergence, while EΔ\\Delta\-MHC\-Geo’sγ\\gammareaches its target at 500 samples\. The convergence pattern—parameter alignment precedes cosine\-alignment gains—reproduces reliably, providing evidence of systematic rather than stochastic behavior\.

## Appendix FCode Availability

Table 19:Repository structure\.Reproduction:

curl\-LsSfhttps://astral\.sh/uv/install\.sh\|sh

uvsync

bashscripts/prepare\_data\.sh

bashscripts/run\_matched\_params\.sh

bashscripts/run\_reflection\.sh

Similar Articles