Bag of Dims: Training-Free Mechanistic Interpretability via Dimension-Level Sign Patterns

Hugging Face Daily Papers 06/17/26, 12:00 AM Papers

Summary

Proposes the Bag of Dims framework showing that the standard basis of transformer hidden states provides a training-free, architecture-general feature representation where dimensions encode semantic content via sign patterns; validated across language, vision, and audio models, achieving high accuracy with no learned rotations.

We show the standard basis of transformer hidden states already provides a training-free, architecture-general feature basis. Individual dimensions encode semantic content via their signs (+/-1) and confidence via their magnitudes, acting as independent binary registers; a feature is a subset of dimensions with a consistent sign pattern, read by counting sign agreements with no learned rotation. We validate this Bag of Dims framework across seven models spanning language (Qwen 3.5-4B, Gemma 3-4B, Mistral 7B, Qwen3-32B), vision (DINOv2, ViT-Base), and audio (AST). Signs alone carry predictive content: unit-magnitude sign patterns preserve 60-93% top-5 next-token accuracy through the LM head, and decoder-free Hamming scoring reaches 80-90% top-4096. From a single-token cache (one forward pass per token, no context, no labels), we detect 175 categories at AUC 0.97-0.99 by sign agreement; a trained probe adds only +0.018 AUC and converges to axis-aligned weights. These features are causally operative: they survive the K/V attention projections, trace to the FFN neuron coalitions that write them (random-weight controls never reproduce this), and flipping a feature's signs during the live forward pass suppresses its concept across four language models, magnitude-matched and concept-specific. Dimensions stay independent throughout (pairwise mutual information below 0.006 bits). The structure is not specific to language: the same per-dimension signs appear in self-supervised vision (DINOv2, 9/12 ImageNet superclasses), supervised vision (ViT-Base, 11/12), and audio (AST, 50/50 ESC-50 categories), so it reflects transformer training in general, not the language-modeling objective. The standard basis already suffices for feature reading at one forward pass, no optimization, no GPU-days. The open problem shifts from finding the right rotation to cataloging what each dimension encodes.

Original Article

View Cached Full Text

Cached at: 06/18/26, 03:58 PM

Paper page - Bag of Dims: Training-Free Mechanistic Interpretability via Dimension-Level Sign Patterns

Source: https://huggingface.co/papers/2606.12629

Abstract

The standard basis of transformer hidden states serves as a training-free, architecture-general feature representation where individual dimensions encode semantic content through signs and confidence through magnitudes, functioning as independent binary registers without requiring learned rotations or optimization.

We show thestandard basisoftransformer hidden statesalready provides a training-free, architecture-generalfeature basis. Individual dimensions encodesemantic contentvia their signs (+/-1) and confidence via theirmagnitudes, acting as independentbinary registers; a feature is a subset of dimensions with a consistent sign pattern, read by counting sign agreements with no learned rotation. We validate thisBag of Dims frameworkacross seven models spanning language (Qwen 3.5-4B, Gemma 3-4B, Mistral 7B, Qwen3-32B), vision (DINOv2, ViT-Base), and audio (AST). Signs alone carry predictive content: unit-magnitude sign patternspreserve 60-93% top-5next-token accuracythrough the LM head, and decoder-freeHamming scoringreaches 80-90% top-4096. From a single-token cache(oneforward passper token, no context, no labels), we detect 175 categories at AUC 0.97-0.99 by sign agreement; a trained probe adds only +0.018 AUC and converges to axis-aligned weights. These features are causally operative: they survive the K/Vattention projections, trace to theFFN neuron coalitionsthat write them (random-weight controls never reproduce this), and flipping a feature’s signs during the liveforward passsuppresses its concept across four language models,magnitude-matched and concept-specific. Dimensions stay independent throughout (pairwise mutual informationbelow 0.006 bits). The structure is not specific to language: the same per-dimension signs appear inself-supervised vision(DINOv2, 9/12 ImageNet superclasses),supervised vision(ViT-Base, 11/12), and audio (AST, 50/50 ESC-50 categories), so it reflects transformer training in general, not the language-modeling objective. Thestandard basisalready suffices for feature reading at oneforward pass, no optimization, no GPU-days. The open problem shifts from finding the right rotation to cataloging what each dimension encodes.

View arXiv page View PDF Add to collection

Get this paper in your agent:

hf papers read 2606\.12629

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.12629 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.12629 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.12629 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Bag of Dims: Training-Free Mechanistic Interpretability via Dimension-Level Sign Patterns

Paper page - Bag of Dims: Training-Free Mechanistic Interpretability via Dimension-Level Sign Patterns

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

An expressivity analysis of hierarchical modelling in deep transformers via bounded-depth grammars

Representational Capacity: Geometric Limits on Feature Representation in Transformer Language Models

SignX: Continuous Sign Recognition in Compact Pose-Rich Latent Space

Deep Dreams Are Made of This: Visualizing Monosemantic Features in Diffusion Models

The Spectral Geometry of Thought: Phase Transitions, Instruction Reversal, Token-Level Dynamics, and Perfect Correctness Prediction in How Transformers Reason

Submit Feedback

Similar Articles

An expressivity analysis of hierarchical modelling in deep transformers via bounded-depth grammars

Representational Capacity: Geometric Limits on Feature Representation in Transformer Language Models

SignX: Continuous Sign Recognition in Compact Pose-Rich Latent Space

Deep Dreams Are Made of This: Visualizing Monosemantic Features in Diffusion Models

The Spectral Geometry of Thought: Phase Transitions, Instruction Reversal, Token-Level Dynamics, and Perfect Correctness Prediction in How Transformers Reason