Tag
This paper proposes a post-hoc certification framework for sparse autoencoder (SAE) based interpretability, deriving an upper bound on the frozen language model's risk using measurable quantities. The framework is validated on GPT-2 Small, Gemma-2B, and Llama-3-8B, showing non-vacuous bounds and revealing depth-dependent behavior.
This paper demonstrates that interventions on Sparse Autoencoder (SAE) features can be unreliable because suppressed behavior can recover through residual-space optimization, even while the intervention remains active. It reveals a critical gap between feature-level control and actual behavioral completeness in language models.
This paper proposes replacing the inner product scoring in sparse autoencoders with a learned combination of cosine similarity and input magnitude, showing that the resulting features are more interpretable and concept-aligned, with the optimizer consistently preferring cosine over inner product.
Introduces Rational Sparse Autoencoder (RSAE), which replaces fixed encoder activations with trainable rational functions, improving reconstruction and sparsity trade-offs on residual-stream activations of open-weight language models across multiple baseline families.
The paper hypothesizes that language model activations contain a low-rank dense component that is inefficiently represented by sparse autoencoders (SAEs). By adding a linear bottleneck to absorb dense structure, the authors reduce dense latents and improve sparse probing performance on Gemma-2-2B.
This paper studies seed dependence in sparse autoencoders, finding that stable features carry most predictive signal while unstable features reflect reproducible low-dimensional subspaces.
This paper identifies a shared latent mechanism across diverse backdoor behaviors in LLMs, using sparse autoencoders to detect and causally suppress these features, enabling unified backdoor detection and mitigation across models and attack types.
Query Lens extends Logit Lens to interpret sparse autoencoder features by jointly considering encoder-side key features and decoder-side value features, and accounting for indirect effects from downstream modules. The paper also introduces the Subspace Channel Hypothesis, suggesting downstream modules read features through layer-specific subspaces.
This paper applies sparse autoencoders to the CosyVoice3 text-to-speech language model, discovering interpretable features that can be steered to control attributes like laughter, speaker gender, and speech rate while preserving content.
This paper proposes a unified geometric framework for understanding concept learning and neuron interpretation in sparse autoencoders, formalizing concepts as sets and defining detection, separation, and approximation. It provides error bounds, capacity constraints, and links to formal concept analysis, with experiments on synthetic data.
This paper investigates whether auto-generated labels for sparse autoencoder features generalize across languages and scripts, using Serbian digraphia as a controlled testbed. It finds that while feature sets show substantial overlap across languages, the labels often fail to track the same concept in non-English inputs, particularly in less represented scripts.
Researchers extract indexable, BM25-ready sparse features from frozen dense retrievers using reconstruction-trained sparse autoencoders.
This paper presents a single-stage sparse coding method using unsupervised sparse autoencoders and natural inverted indexing to accelerate multi-vector retrieval, outperforming traditional k-means based approaches.
The paper proposes Latent Terms, a method using Sparse Autoencoders to extract BM25-ready sparse features from frozen dense retrievers, achieving competitive performance without retrieval-specific training.
This paper demonstrates that sparse autoencoders can extract interpretable features from Claude 3 Sonnet, a production-scale language model, addressing scalability concerns for dictionary learning. The features are multilingual, multimodal, and include safety-relevant concepts like deception and sycophancy, with causal influence on model outputs.
This paper introduces the readout-mediator angle to demonstrate that linear probes can decode information from language model activations that is orthogonal to the model's actual causal computation, undermining probe-based interpretability. The finding replicates across model scales and families, revealing a fundamental failure mode in using probes for mechanistic understanding or safety monitoring.
This paper uses Sparse Autoencoders to analyze the geometry of LoRA-induced representations in language models, finding that LoRA updates occupy partially distinct feature structures not fully captured by pretrained interpretability dictionaries.
This paper investigates the Platonic Representation Hypothesis, proposing that alignment arises from linear structure in representations, and introduces a statistical framework of signal, bias, and noise.
MechELK is a three-stage framework combining mechanistic interpretability tools (SAE, activation patching, causal probing) with representation engineering to elicit latent knowledge from LLMs, achieving 84.7% accuracy and outperforming existing methods like CCS and linear probing.
This paper introduces a bifurcation theory of representation dynamics to detect when neural networks acquire structured representations during training, using a Hessian analysis of a GMM probe. The resulting ratio β/β_c serves as a label-free phase coordinate that predicts the onset of usable structure and can forecast feature interpretability in sparse autoencoders early in training.