Three-Phase Transformer

Hugging Face Daily Papers 04/15/26, 12:00 AM Papers

Summary

A research paper introducing Three-Phase Transformer (3PT), which applies Tesla's polyphase geometry to transformer architectures by organizing the residual stream into three 120° offset phases. The approach achieves 7.2% perplexity improvement on WikiText-103 with minimal parameters (0.00124% overhead) and 1.93× convergence speedup.

We present Three-Phase Transformer (3PT), a residual-stream structural prior for decoder-only Transformers on a standard SwiGLU + RMSNorm + RoPE + GQA backbone. The hidden vector is partitioned into N equally-sized cyclic channels, each maintained by phase-respecting ops: a per-channel RMSNorm, a 2D Givens rotation between attention and FFN that rotates each channel by theta + i*(2*pi/N), and a head-count constraint aligning GQA heads with the partition. The architecture is a self-stabilizing equilibrium between scrambling and re-imposition, not a bolted-on module. The partition carves out a one-dimensional DC subspace orthogonal to the channels, into which we inject a fixed Gabriel's horn profile r(p) = 1/(p+1) as an absolute-position side-channel composing orthogonally with RoPE's relative-position rotation. The canonical N=3 borrows its metaphor from balanced three-phase AC, where three sinusoids 120 degrees apart sum to zero with no anti-correlated pair. At 123M parameters on WikiText-103, 3PT achieves -7.20% perplexity (-2.62% bits-per-byte) over a matched RoPE-Only baseline at +1,536 parameters (0.00124% of total), with 1.93x step-count convergence speedup (1.64x wall-clock). N behaves as a parameter-sharing knob rather than a unique optimum: at 5.5M an N-sweep over {1,2,3,4,6,8,12} is near-monotone with N=1 winning; at 123M a three-seed sweep finds N=3 and N=1 statistically indistinguishable. The load-bearing mechanism is the channel-partitioned residual stream, per-block rotation, per-phase normalization, and horn DC injection. We characterize (a) self-stabilization of the geometry without explicit enforcement, a novel instance of the conservation-law framework for neural networks; (b) a U-shaped depth profile of rotation-angle drift at 12 layers; (c) orthogonal composition with RoPE, attention, and FFN.

Original Article

View Cached Full Text

Cached at: 04/20/26, 08:27 AM

Paper page - Three-Phase Transformer

Source: https://huggingface.co/papers/2604.14430 In 1888, Nikola Tesla presented the world with three phase motor.. i did the same but for transformers. 😊

I just published “Three-Phase Transformer (3PT)”

A research paper that takes Tesla’s polyphase geometry and drops it into a Transformer’s residual stream.

Tesla’s three currents sum to zero at every instant. Three is the unique small integer where you get the zero-sum identity and no anti-correlated pair. Three is the sweet spot - it’s why every electric grid on Earth runs on three phases. 😎

Here’s the thing. Networks already arrive at this geometry on their own. Anthropic’s Toy Models of Superposition (2022) shows networks naturally organize features into 120° triangles. Neural collapse theory proves three vectors at 120° mutual separation is the globally optimal representation geometry. The networks are stumbling into three-phase structure by accident, paying for it in convergence time.

So instead of letting them stumble into the house, I built the house first. 🤗

Split the hidden vector into three equal stripes at 120° offsets. Add four phase-respecting ops per block: per-phase RMSNorm, a 2D rotation between attention and FFN using Tesla’s 120° offsets, phase-aligned GQA heads, and a fixed signal in the 1D subspace orthogonal to the three phases. The stripes spin like motor windings. Attention and FFN scramble across boundaries. The phase ops pull it back. Equilibrium, not a bolted-on module.

But the architecture isn’t the headline. What it revealed is. 🤖

The three-phase balance leaves one direction in channel space empty by construction - the DC direction, orthogonal to all three phases. I filled it with Gabriel’s horn from 1641. The cross-phase residual measures at exactly the analytic horn value to floating-point precision. Every seed, every run. RoPE handles relative position; the horn handles absolute position. They never collide. Mathematics, not optimization.

The geometry self-stabilizes. No auxiliary loss, no constraint, no enforcement. The phases settle into balance within 1,000 steps and hold for 29,000 more. The same principle Tesla relied on - balanced loads maintain themselves without active correction. A novel instance of the conservation-law framework for neural networks.

The result at 123M on WikiText-103: −7.20% perplexity. Parameters added: 1,536. That’s 0.00124% of the model. 1.93× convergence speedup!.

A 17th-century painter’s paradox riding through the 1D tunnel that a 19th-century motor geometry carves out, dropped into a 2017 Transformer. None of it should compose. All of it does in 2026. 👽

Tesla might never imagined his polyphase system would run anything other than rotating machinery. 138 years later, it’s running the Transformer geometry. 😇

Code:https://arxiv.org/abs/2604.14430 Paper:https://github.com/achelousace/three-phase-transformer

Mohammad R. Abu Ayyash Brains Build Research Ramallah, Palestine.

Three-Phase Transformer

Paper page - Three-Phase Transformer

Similar Articles

Transformer Math Explorer [P]

DxPTA: An Architecture Design Space Exploration with Optical Dataflow-guided Strategy for HW/SW Co-Design of Photonic Transformer Accelerators

Universal Quantum Transformer

Phase Marginalization for Patch-Grid Instability in Vision Transformers

Lite3R: A Model-Agnostic Framework for Efficient Feed-Forward 3D Reconstruction

Submit Feedback

Similar Articles

DxPTA: An Architecture Design Space Exploration with Optical Dataflow-guided Strategy for HW/SW Co-Design of Photonic Transformer Accelerators

Phase Marginalization for Patch-Grid Instability in Vision Transformers

Lite3R: A Model-Agnostic Framework for Efficient Feed-Forward 3D Reconstruction