Unstable Features, Reproducible Subspaces: Understanding Seed Dependence in Sparse Autoencoders
Summary
This paper studies seed dependence in sparse autoencoders, finding that stable features carry most predictive signal while unstable features reflect reproducible low-dimensional subspaces.
View Cached Full Text
Cached at: 06/16/26, 03:32 PM
Paper page - Unstable Features, Reproducible Subspaces: Understanding Seed Dependence in Sparse Autoencoders
Source: https://huggingface.co/papers/2606.12138
Abstract
Sparse autoencoders exhibit feature stability patterns where stable features carry most predictive signal while unstable features reflect reproducible low-dimensional structure despite individual non-reproducibility.
Sparse autoencoders(SAEs) are widely used to interpret neural network representations, but their utility depends on whether the learned features are reproducible across training runs. We study this question throughfeature stability: for each SAE feature, we estimate the probability that a similar feature reappears in an independently trained SAE. This yields a scalable per-feature signal that separates stable from unstable features. In a large-scale study across seeds, models, layers, dictionary sizes, and SAE variants, we find a pronounced functional asymmetry: stable features carry most of the reconstruction- andprediction-relevant signal, while unstable features have weak marginal impact and are dominated by low-frequency surface-form triggers in bothactivation statisticsandautomatic explanations. Geometrically, unstable features are individually non-reproducible but concentrate in reproducible lower-rank subspaces, suggesting that seed dependence often reflectsbasis ambiguitywithin a shared region of activation space rather than pure noise. A controlled synthetic model makes this mechanism explicit, showing that low-rank ground-truth features can be recovered at the subspace level while remaining non-identifiable as individual SAE latents across seeds. Finally, by pooling uniquecross-seed features, we construct more stable SAEs while preservingexplained variancein this setting. Together, these results show that unstable features are not merely failed or noisy latents: they have weak individual functional impact, but reflect reproducible low-dimensional structure that standard SAEs resolve differently across seeds.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2606\.12138
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.12138 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.12138 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.12138 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Feature Starvation as Geometric Instability in Sparse Autoencoders
This paper identifies feature starvation in sparse autoencoders as a geometric instability and proposes adaptive elastic net SAEs (AEN-SAEs) to mitigate it without heuristics.
Feature Rivalry in Sparse Autoencoder Representations: A Mechanistic Study of Uncertainty-Driven Feature Competition in LLMs
This research paper introduces 'Feature Rivalry' in Sparse Autoencoder representations as a mechanistic signature of uncertainty in LLMs. Using Gemma-2-2B, the study demonstrates that negatively correlated feature pairs localize uncertainty to specific layers and causally influence model outputs.
Structural Instability of Feature Composition
This paper presents a geometric framework to analyze the instability of feature composition in Sparse Autoencoders, revealing that non-linearities cause a ratchet effect leading to compositional collapse beyond a critical density.
Effects of sparsity and superposition on loss in simple autoencoders
This paper provides a mathematical analysis of superposition in neural networks, deriving upper and lower bounds on L2 reconstruction loss for simple autoencoders with power activation functions, corroborating empirical findings by Elhage et al.
A Geometric View for Understanding Concept Learning and Neuron Interpretation in Sparse Autoencoders
This paper proposes a unified geometric framework for understanding concept learning and neuron interpretation in sparse autoencoders, formalizing concepts as sets and defining detection, separation, and approximation. It provides error bounds, capacity constraints, and links to formal concept analysis, with experiments on synthetic data.