Three-Phase Transformer
Summary
A research paper introducing Three-Phase Transformer (3PT), which applies Tesla's polyphase geometry to transformer architectures by organizing the residual stream into three 120° offset phases. The approach achieves 7.2% perplexity improvement on WikiText-103 with minimal parameters (0.00124% overhead) and 1.93× convergence speedup.
View Cached Full Text
Cached at: 04/20/26, 08:27 AM
Paper page - Three-Phase Transformer
Source: https://huggingface.co/papers/2604.14430 In 1888, Nikola Tesla presented the world with three phase motor.. i did the same but for transformers. 😊
I just published “Three-Phase Transformer (3PT)”
A research paper that takes Tesla’s polyphase geometry and drops it into a Transformer’s residual stream.
Tesla’s three currents sum to zero at every instant. Three is the unique small integer where you get the zero-sum identity and no anti-correlated pair. Three is the sweet spot - it’s why every electric grid on Earth runs on three phases. 😎
Here’s the thing. Networks already arrive at this geometry on their own. Anthropic’s Toy Models of Superposition (2022) shows networks naturally organize features into 120° triangles. Neural collapse theory proves three vectors at 120° mutual separation is the globally optimal representation geometry. The networks are stumbling into three-phase structure by accident, paying for it in convergence time.
So instead of letting them stumble into the house, I built the house first. 🤗
Split the hidden vector into three equal stripes at 120° offsets. Add four phase-respecting ops per block: per-phase RMSNorm, a 2D rotation between attention and FFN using Tesla’s 120° offsets, phase-aligned GQA heads, and a fixed signal in the 1D subspace orthogonal to the three phases. The stripes spin like motor windings. Attention and FFN scramble across boundaries. The phase ops pull it back. Equilibrium, not a bolted-on module.
But the architecture isn’t the headline. What it revealed is. 🤖
The three-phase balance leaves one direction in channel space empty by construction - the DC direction, orthogonal to all three phases. I filled it with Gabriel’s horn from 1641. The cross-phase residual measures at exactly the analytic horn value to floating-point precision. Every seed, every run. RoPE handles relative position; the horn handles absolute position. They never collide. Mathematics, not optimization.
The geometry self-stabilizes. No auxiliary loss, no constraint, no enforcement. The phases settle into balance within 1,000 steps and hold for 29,000 more. The same principle Tesla relied on - balanced loads maintain themselves without active correction. A novel instance of the conservation-law framework for neural networks.
The result at 123M on WikiText-103: −7.20% perplexity. Parameters added: 1,536. That’s 0.00124% of the model. 1.93× convergence speedup!.
A 17th-century painter’s paradox riding through the 1D tunnel that a 19th-century motor geometry carves out, dropped into a 2017 Transformer. None of it should compose. All of it does in 2026. 👽
Tesla might never imagined his polyphase system would run anything other than rotating machinery. 138 years later, it’s running the Transformer geometry. 😇
Code:https://arxiv.org/abs/2604.14430 Paper:https://github.com/achelousace/three-phase-transformer
Mohammad R. Abu Ayyash Brains Build Research Ramallah, Palestine.
Similar Articles
Transformer Math Explorer [P]
This interactive tool visualizes the mathematical underpinnings of transformer models through dataflow graphs, covering architectures from GPT-2 to Qwen 3.6 and various attention mechanisms.
DxPTA: An Architecture Design Space Exploration with Optical Dataflow-guided Strategy for HW/SW Co-Design of Photonic Transformer Accelerators
This paper proposes DxPTA, a novel design space exploration methodology for efficient HW/SW co-design of photonic transformer accelerators that meet area, power, energy, and latency constraints. It achieves up to 15.2x faster searching time than exhaustive approaches, enabling efficient PTA designs for diverse transformer models.
Universal Quantum Transformer
This paper introduces the Universal Quantum Transformer (UQT), a quantum-native architecture that uses multi-qubit systems for exact mathematical reasoning, achieving deterministic generalization on modular arithmetic and permutation groups while bypassing classical over-parameterization and quadratic attention bottlenecks, with deployment on IBM Quantum hardware.
Phase Marginalization for Patch-Grid Instability in Vision Transformers
Phase Marginalization is a post-hoc method that addresses phase-dependent instability in Vision Transformers by evaluating structured patch-grid phases and aggregating outputs. It improves segmentation, depth, and local matching over the canonical baseline with minimal extra cost.
Lite3R: A Model-Agnostic Framework for Efficient Feed-Forward 3D Reconstruction
Lite3R is a model-agnostic framework that improves the efficiency of transformer-based 3D reconstruction using sparse linear attention and FP8-aware quantization. It reduces latency and memory usage by up to 2.4x while maintaining geometric accuracy on backbones like VGGT and DA3-Large.