Emergence of Frontier Superposition: M\"obius attractor and Cascade Supervision

arXiv cs.LG 05/20/26, 04:00 AM Papers

Summary

This paper identifies a Möbius attractor and Cascade Supervision as key mechanisms for the emergence of superposition reasoning in transformers, closing a theoretical gap on gradient descent convergence for graph reachability tasks.

arXiv:2605.18820v1 Announce Type: new Abstract: Superposition allows Transformers to reason in depth, carrying an entire reasoning frontier in parallel through a bounded-depth forward pass instead of unrolling serial chain-of-thought tokens. While Zhu et al. (2025) hand-crafted an equal-weight breadth-first frontier in a single residual stream for graph reachability, it remained open whether gradient descent could ever find this target amidst permutation-symmetric saddles. We close this gap on Reachability-by-Superposition over Erd\H{o}s-R\'enyi graphs by isolating architectural and supervisional contributions. Architecturally, we identify a M\"obius attractor: under $S_n$-symmetry in the tree regime, layerwise dynamics reduce to a 1D M\"obius map whose zero set is a codimension-one manifold of global optima containing the equal-weight superposition state. On the supervision side, we identify Cascade Supervision: a loss class whose backward pass simultaneously delivers (A) selectivity bootstrap, (B) gradient persistence across depth, and (C) per-step discrimination (e.g., \mathcal{L}_{sup} and \mathcal{L}_{node}). End-to-end supervision fails condition (B) and is provably insufficient: internal gradients at layer c decay as (np)^{-(D-c-2)/2} in the graph fan-out and stall before the manifold is reached. Our thesis: M\"obius attractor + Cascade Supervision = emergence of superposition reasoning. The parameter-free decay law predicts a final-step cosine of 0.35 vs. 0.71 (end-to-end vs. cascade) at depth D=3; experiments confirm 0.37 vs. 0.69, matching within 0.02 at every step.

Original Article

Similar Articles

A Robust Foundation Model for Conservation Laws: Injecting Context into Flux Neural Operators via Recurrent Vision Transformers

arXiv cs.LG

This paper proposes a new architecture that augments Flux Neural Operators with recurrent Vision Transformers to solve conservation laws as a foundation model. It demonstrates robust generalization and long-time prediction capabilities across diverse conservative systems without explicit access to governing equations.

Attractor Geometry of Transformer Memory: From Conflict Arbitration to Confident Hallucination

arXiv cs.AI

This paper presents a unified geometric framework for understanding transformer memory failures, distinguishing between conflict arbitration and hallucination through hidden-state attractor basins. It demonstrates that geometric margin is a superior diagnostic for detecting these failures compared to output entropy, particularly as model scale increases.

Dynamics of the Transformer Residual Stream: Coupling Spectral Geometry to Network Topology

arXiv cs.LG

This paper performs full Jacobian eigendecomposition across production-scale LLMs, revealing a learned spectral gradient from rotation-dominated early layers to symmetric late layers, along with a low-rank bottleneck that compresses perturbations. The results link perturbation propagation and compression to network functional topology.

Transformers Linearly Represent Highly Structured World Models

arXiv cs.LG

This paper demonstrates that transformers trained on Sudoku solving traces build structured world models organized by domain constraints, and identifies a sparse, monosemantic circuit responsible for the naked-single decision rule. The work provides a fully interpretable algorithmic account of transformer reasoning on a combinatorial task.

@askalphaxiv: Another cool research on Looped Transformers They ask the question: "Can we loop a frozen, off-the-shelf checkpoint dir…

X AI KOLs Timeline

This research introduces a technique to loop frozen, off-the-shelf transformer checkpoints at inference time by using damped Runge-Kutta substeps, treating transformer layers as Euler steps in a residual ODE. This allows extra latent compute without fine-tuning, architecture changes, or new weights, showing gains on knowledge tasks like MMLU-Pro, GPQA, and ARC.

Similar Articles

A Robust Foundation Model for Conservation Laws: Injecting Context into Flux Neural Operators via Recurrent Vision Transformers

Attractor Geometry of Transformer Memory: From Conflict Arbitration to Confident Hallucination

Dynamics of the Transformer Residual Stream: Coupling Spectral Geometry to Network Topology

Transformers Linearly Represent Highly Structured World Models

@askalphaxiv: Another cool research on Looped Transformers They ask the question: "Can we loop a frozen, off-the-shelf checkpoint dir…

Submit Feedback