Emergence of Frontier Superposition: M\"obius attractor and Cascade Supervision

arXiv cs.LG Papers

Summary

This paper identifies a Möbius attractor and Cascade Supervision as key mechanisms for the emergence of superposition reasoning in transformers, closing a theoretical gap on gradient descent convergence for graph reachability tasks.

arXiv:2605.18820v1 Announce Type: new Abstract: Superposition allows Transformers to reason in depth, carrying an entire reasoning frontier in parallel through a bounded-depth forward pass instead of unrolling serial chain-of-thought tokens. While Zhu et al. (2025) hand-crafted an equal-weight breadth-first frontier in a single residual stream for graph reachability, it remained open whether gradient descent could ever find this target amidst permutation-symmetric saddles. We close this gap on Reachability-by-Superposition over Erd\H{o}s-R\'enyi graphs by isolating architectural and supervisional contributions. Architecturally, we identify a M\"obius attractor: under $S_n$-symmetry in the tree regime, layerwise dynamics reduce to a 1D M\"obius map whose zero set is a codimension-one manifold of global optima containing the equal-weight superposition state. On the supervision side, we identify Cascade Supervision: a loss class whose backward pass simultaneously delivers (A) selectivity bootstrap, (B) gradient persistence across depth, and (C) per-step discrimination (e.g., \mathcal{L}_{sup} and \mathcal{L}_{node}). End-to-end supervision fails condition (B) and is provably insufficient: internal gradients at layer c decay as (np)^{-(D-c-2)/2} in the graph fan-out and stall before the manifold is reached. Our thesis: M\"obius attractor + Cascade Supervision = emergence of superposition reasoning. The parameter-free decay law predicts a final-step cosine of 0.35 vs. 0.71 (end-to-end vs. cascade) at depth D=3; experiments confirm 0.37 vs. 0.69, matching within 0.02 at every step.
Original Article

Similar Articles

Transformers Linearly Represent Highly Structured World Models

arXiv cs.LG

This paper demonstrates that transformers trained on Sudoku solving traces build structured world models organized by domain constraints, and identifies a sparse, monosemantic circuit responsible for the naked-single decision rule. The work provides a fully interpretable algorithmic account of transformer reasoning on a combinatorial task.

@askalphaxiv: Another cool research on Looped Transformers They ask the question: "Can we loop a frozen, off-the-shelf checkpoint dir…

X AI KOLs Timeline

This research introduces a technique to loop frozen, off-the-shelf transformer checkpoints at inference time by using damped Runge-Kutta substeps, treating transformer layers as Euler steps in a residual ODE. This allows extra latent compute without fine-tuning, architecture changes, or new weights, showing gains on knowledge tasks like MMLU-Pro, GPQA, and ARC.