transformer-interpretability

#transformer-interpretability

Bag of Dims: Training-Free Mechanistic Interpretability via Dimension-Level Sign Patterns

Hugging Face Daily Papers ↗ · 2026-06-17 Cached

Proposes the Bag of Dims framework showing that the standard basis of transformer hidden states provides a training-free, architecture-general feature representation where dimensions encode semantic content via sign patterns; validated across language, vision, and audio models, achieving high accuracy with no learned rotations.

0 favorites 0 likes

#transformer-interpretability

MechRL: Reinforcement Learning Agents Perform Circuit Discovery for Mechanistic Interpretability

arXiv cs.LG ↗ · 2026-05-27 Cached

Proposes MechRL, a reinforcement learning approach to automate circuit discovery in transformer language models. A PPO agent trained on multiple tasks discovers attention head circuits that match known canonical circuits and generalizes to a held-out task.

0 favorites 0 likes

#transformer-interpretability

Emergence of Frontier Superposition: M\"obius attractor and Cascade Supervision

arXiv cs.LG ↗ · 2026-05-20

This paper identifies a Möbius attractor and Cascade Supervision as key mechanisms for the emergence of superposition reasoning in transformers, closing a theoretical gap on gradient descent convergence for graph reachability tasks.

0 favorites 0 likes

#transformer-interpretability

@Propriocetive: New preprint: Mathematics is All You Need 2 — Sign-Stabilized Behavioral Fibers in Transformer Residual Streams. Headli…

X AI KOLs Following ↗ · 2026-05-10

A new preprint titled 'Mathematics is All You Need 2' presents the 'Two-Channel theorem,' demonstrating that behavioral fibers in transformer residual streams are sign-stabilized and causally steerable across different architectures (Qwen to Llama). The study claims high reproducibility and shows that the behavioral substrate is near-one-dimensional, separating generation from latent structure.

1 favorites 1 likes

#transformer-interpretability

Large Vision-Language Models Get Lost in Attention

arXiv cs.AI ↗ · 2026-05-08 Cached

This research paper analyzes the internal mechanics of Large Vision-Language Models (LVLMs) using information theory, revealing that attention mechanisms may be redundant while Feed-Forward Networks drive semantic innovation. The authors demonstrate that replacing learned attention weights with random values can yield comparable performance, suggesting current models 'get lost in attention'.

0 favorites 0 likes

transformer-interpretability

Bag of Dims: Training-Free Mechanistic Interpretability via Dimension-Level Sign Patterns

MechRL: Reinforcement Learning Agents Perform Circuit Discovery for Mechanistic Interpretability

Emergence of Frontier Superposition: M\"obius attractor and Cascade Supervision

@Propriocetive: New preprint: Mathematics is All You Need 2 — Sign-Stabilized Behavioral Fibers in Transformer Residual Streams. Headli…

Large Vision-Language Models Get Lost in Attention

Submit Feedback