Three-Phase Transformer

Hugging Face Daily Papers Papers

Summary

A research paper introducing Three-Phase Transformer (3PT), which applies Tesla's polyphase geometry to transformer architectures by organizing the residual stream into three 120° offset phases. The approach achieves 7.2% perplexity improvement on WikiText-103 with minimal parameters (0.00124% overhead) and 1.93× convergence speedup.

We present Three-Phase Transformer (3PT), a residual-stream structural prior for decoder-only Transformers on a standard SwiGLU + RMSNorm + RoPE + GQA backbone. The hidden vector is partitioned into N equally-sized cyclic channels, each maintained by phase-respecting ops: a per-channel RMSNorm, a 2D Givens rotation between attention and FFN that rotates each channel by theta + i*(2*pi/N), and a head-count constraint aligning GQA heads with the partition. The architecture is a self-stabilizing equilibrium between scrambling and re-imposition, not a bolted-on module. The partition carves out a one-dimensional DC subspace orthogonal to the channels, into which we inject a fixed Gabriel's horn profile r(p) = 1/(p+1) as an absolute-position side-channel composing orthogonally with RoPE's relative-position rotation. The canonical N=3 borrows its metaphor from balanced three-phase AC, where three sinusoids 120 degrees apart sum to zero with no anti-correlated pair. At 123M parameters on WikiText-103, 3PT achieves -7.20% perplexity (-2.62% bits-per-byte) over a matched RoPE-Only baseline at +1,536 parameters (0.00124% of total), with 1.93x step-count convergence speedup (1.64x wall-clock). N behaves as a parameter-sharing knob rather than a unique optimum: at 5.5M an N-sweep over {1,2,3,4,6,8,12} is near-monotone with N=1 winning; at 123M a three-seed sweep finds N=3 and N=1 statistically indistinguishable. The load-bearing mechanism is the channel-partitioned residual stream, per-block rotation, per-phase normalization, and horn DC injection. We characterize (a) self-stabilization of the geometry without explicit enforcement, a novel instance of the conservation-law framework for neural networks; (b) a U-shaped depth profile of rotation-angle drift at 12 layers; (c) orthogonal composition with RoPE, attention, and FFN.
Original Article
View Cached Full Text

Cached at: 04/20/26, 08:27 AM

Paper page - Three-Phase Transformer

Source: https://huggingface.co/papers/2604.14430 In 1888, Nikola Tesla presented the world with three phase motor.. i did the same but for transformers. 😊

I just published “Three-Phase Transformer (3PT)”

A research paper that takes Tesla’s polyphase geometry and drops it into a Transformer’s residual stream.

Tesla’s three currents sum to zero at every instant. Three is the unique small integer where you get the zero-sum identity and no anti-correlated pair. Three is the sweet spot - it’s why every electric grid on Earth runs on three phases. 😎

Here’s the thing. Networks already arrive at this geometry on their own. Anthropic’s Toy Models of Superposition (2022) shows networks naturally organize features into 120° triangles. Neural collapse theory proves three vectors at 120° mutual separation is the globally optimal representation geometry. The networks are stumbling into three-phase structure by accident, paying for it in convergence time.

So instead of letting them stumble into the house, I built the house first. 🤗

Split the hidden vector into three equal stripes at 120° offsets. Add four phase-respecting ops per block: per-phase RMSNorm, a 2D rotation between attention and FFN using Tesla’s 120° offsets, phase-aligned GQA heads, and a fixed signal in the 1D subspace orthogonal to the three phases. The stripes spin like motor windings. Attention and FFN scramble across boundaries. The phase ops pull it back. Equilibrium, not a bolted-on module.

But the architecture isn’t the headline. What it revealed is. 🤖

The three-phase balance leaves one direction in channel space empty by construction - the DC direction, orthogonal to all three phases. I filled it with Gabriel’s horn from 1641. The cross-phase residual measures at exactly the analytic horn value to floating-point precision. Every seed, every run. RoPE handles relative position; the horn handles absolute position. They never collide. Mathematics, not optimization.

The geometry self-stabilizes. No auxiliary loss, no constraint, no enforcement. The phases settle into balance within 1,000 steps and hold for 29,000 more. The same principle Tesla relied on - balanced loads maintain themselves without active correction. A novel instance of the conservation-law framework for neural networks.

The result at 123M on WikiText-103: −7.20% perplexity. Parameters added: 1,536. That’s 0.00124% of the model. 1.93× convergence speedup!.

A 17th-century painter’s paradox riding through the 1D tunnel that a 19th-century motor geometry carves out, dropped into a 2017 Transformer. None of it should compose. All of it does in 2026. 👽

Tesla might never imagined his polyphase system would run anything other than rotating machinery. 138 years later, it’s running the Transformer geometry. 😇

Code:https://arxiv.org/abs/2604.14430 Paper:https://github.com/achelousace/three-phase-transformer

Mohammad R. Abu Ayyash Brains Build Research Ramallah, Palestine.

Similar Articles

Transformer Math Explorer [P]

Reddit r/MachineLearning

This interactive tool visualizes the mathematical underpinnings of transformer models through dataflow graphs, covering architectures from GPT-2 to Qwen 3.6 and various attention mechanisms.

Universal Quantum Transformer

arXiv cs.AI

This paper introduces the Universal Quantum Transformer (UQT), a quantum-native architecture that uses multi-qubit systems for exact mathematical reasoning, achieving deterministic generalization on modular arithmetic and permutation groups while bypassing classical over-parameterization and quadratic attention bottlenecks, with deployment on IBM Quantum hardware.

Phase Marginalization for Patch-Grid Instability in Vision Transformers

Hugging Face Daily Papers

Phase Marginalization is a post-hoc method that addresses phase-dependent instability in Vision Transformers by evaluating structured patch-grid phases and aggregating outputs. It improves segmentation, depth, and local matching over the canonical baseline with minimal extra cost.

Lite3R: A Model-Agnostic Framework for Efficient Feed-Forward 3D Reconstruction

Hugging Face Daily Papers

Lite3R is a model-agnostic framework that improves the efficiency of transformer-based 3D reconstruction using sparse linear attention and FP8-aware quantization. It reduces latency and memory usage by up to 2.4x while maintaining geometric accuracy on backbones like VGGT and DA3-Large.