sparse-autoencoders

#sparse-autoencoders

From Sparse Features to Trustworthy Proxies: Certifying SAE-Based Interpretability

arXiv cs.LG ↗ · 2026-06-18 Cached

This paper proposes a post-hoc certification framework for sparse autoencoder (SAE) based interpretability, deriving an upper bound on the frozen language model's risk using measurable quantities. The framework is validated on GPT-2 Small, Gemma-2B, and Llama-3-8B, showing non-vacuous bounds and revealing depth-dependent behavior.

0 favorites 0 likes

#sparse-autoencoders

SAE Interventions are Unreliable: Post-Intervention Recovery of Suppressed Behavior

arXiv cs.LG ↗ · 2026-06-18 Cached

This paper demonstrates that interventions on Sparse Autoencoder (SAE) features can be unreliable because suppressed behavior can recover through residual-space optimization, even while the intervention remains active. It reveals a critical gap between feature-level control and actual behavioral completeness in language models.

0 favorites 0 likes

#sparse-autoencoders

Size Doesn't Matter: Cosine-Scored Sparse Autoencoders

arXiv cs.LG ↗ · 2026-06-16 Cached

This paper proposes replacing the inner product scoring in sparse autoencoders with a learned combination of cosine similarity and input magnitude, showing that the resulting features are more interpretable and concept-aligned, with the optimizer consistently preferring cosine over inner product.

0 favorites 0 likes

#sparse-autoencoders

Rational Sparse Autoencoder

arXiv cs.LG ↗ · 2026-06-16 Cached

Introduces Rational Sparse Autoencoder (RSAE), which replaces fixed encoder activations with trainable rational functions, improving reconstruction and sparsity trade-offs on residual-stream activations of open-weight language models across multiple baseline families.

0 favorites 0 likes

#sparse-autoencoders

Decompose Sparsely Where You Should, Absorb Densely Where You Should No

arXiv cs.LG ↗ · 2026-06-15 Cached

The paper hypothesizes that language model activations contain a low-rank dense component that is inefficiently represented by sparse autoencoders (SAEs). By adding a linear bottleneck to absorb dense structure, the authors reduce dense latents and improve sparse probing performance on Gemma-2-2B.

0 favorites 0 likes

#sparse-autoencoders

Unstable Features, Reproducible Subspaces: Understanding Seed Dependence in Sparse Autoencoders

Hugging Face Daily Papers ↗ · 2026-06-10 Cached

This paper studies seed dependence in sparse autoencoders, finding that stable features carry most predictive signal while unstable features reflect reproducible low-dimensional subspaces.

0 favorites 0 likes

#sparse-autoencoders

Shared Latent Structures Enable Unified Backdoor Detection and Mitigation in LLMs

arXiv cs.AI ↗ · 2026-06-09 Cached

This paper identifies a shared latent mechanism across diverse backdoor behaviors in LLMs, using sparse autoencoders to detect and causally suppress these features, enabling unified backdoor detection and mitigation across models and attack types.

0 favorites 0 likes

#sparse-autoencoders

Query Lens: Interpreting Sparse Key-Value Features with Indirect Effects

arXiv cs.LG ↗ · 2026-06-09 Cached

Query Lens extends Logit Lens to interpret sparse autoencoder features by jointly considering encoder-side key features and decoder-side value features, and accounting for indirect effects from downstream modules. The paper also introduces the Subspace Channel Hypothesis, suggesting downstream modules read features through layer-specific subspaces.

0 favorites 0 likes

#sparse-autoencoders

Interpreting and Steering a Text-to-Speech Language Model with Sparse Autoencoders

Hugging Face Daily Papers ↗ · 2026-06-08 Cached

This paper applies sparse autoencoders to the CosyVoice3 text-to-speech language model, discovering interpretable features that can be steered to control attributes like laughter, speaker gender, and speech rate while preserving content.

0 favorites 0 likes

#sparse-autoencoders

A Geometric View for Understanding Concept Learning and Neuron Interpretation in Sparse Autoencoders

arXiv cs.LG ↗ · 2026-06-08 Cached

This paper proposes a unified geometric framework for understanding concept learning and neuron interpretation in sparse autoencoders, formalizing concepts as sets and defining detection, separation, and approximation. It provides error bounds, capacity constraints, and links to formal concept analysis, with experiments on synthetic data.

0 favorites 0 likes

#sparse-autoencoders

How Far Do Auto-Interpretation Labels Generalize: A Controlled Study Across Languages, Scripts, and Rewordings

arXiv cs.CL ↗ · 2026-06-02 Cached

This paper investigates whether auto-generated labels for sparse autoencoder features generalize across languages and scripts, using Serbian digraphia as a controlled testbed. It finds that while feature sets show substantial overlap across languages, the labels often fail to track the same concept in non-English inputs, particularly in less represented scripts.

0 favorites 0 likes

#sparse-autoencoders

@bclavie: Very excited to finally share this one after sitting on it for far too long! It's very topical now. Blog post coming ve…

X AI KOLs Timeline ↗ · 2026-05-30 Cached

Researchers extract indexable, BM25-ready sparse features from frozen dense retrievers using reconstruction-trained sparse autoencoders.

0 favorites 0 likes

#sparse-autoencoders

@lateinteraction: Late-interaction sparse retrieval? With neuron-level inverted indexing, on top of unsupervised sparse autoencoders. Wor…

X AI KOLs Timeline ↗ · 2026-05-30 Cached

This paper presents a single-stage sparse coding method using unsupervised sparse autoencoders and natural inverted indexing to accelerate multi-vector retrieval, outperforming traditional k-means based approaches.

0 favorites 0 likes

#sparse-autoencoders

@_reachsumit: Latent Terms: Dense Retrievers Contain Trivially Extractable BM25-ready Zipfian Vocabularies @bclavie et al. extract in…

X AI KOLs Following ↗ · 2026-05-29 Cached

The paper proposes Latent Terms, a method using Sparse Autoencoders to extract BM25-ready sparse features from frozen dense retrievers, achieving competitive performance without retrieval-specific training.

0 favorites 0 likes

#sparse-autoencoders

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

arXiv cs.AI ↗ · 2026-05-29 Cached

This paper demonstrates that sparse autoencoders can extract interpretable features from Claude 3 Sonnet, a production-scale language model, addressing scalability concerns for dictionary learning. The features are multilingual, multimodal, and include safety-relevant concepts like deception and sycophancy, with causal influence on model outputs.

0 favorites 0 likes

#sparse-autoencoders

When and How Long? The Readout-Mediator Angle in Temporal Reasoning

arXiv cs.LG ↗ · 2026-05-29 Cached

This paper introduces the readout-mediator angle to demonstrate that linear probes can decode information from language model activations that is orthogonal to the model's actual causal computation, undermining probe-based interpretability. The finding replicates across model scales and families, revealing a fundamental failure mode in using probes for mechanistic understanding or safety monitoring.

0 favorites 0 likes

#sparse-autoencoders

Feature Geometry of LoRA Adapters: A Sparse Autoencoder Analysis of Representational Divergence in Fine-Tuned Language Models

arXiv cs.LG ↗ · 2026-05-29 Cached

This paper uses Sparse Autoencoders to analyze the geometry of LoRA-induced representations in language models, finding that LoRA updates occupy partially distinct feature structures not fully captured by pretrained interpretability dictionaries.

0 favorites 0 likes

#sparse-autoencoders

Representation Alignment Rests on Linear Structure

arXiv cs.LG ↗ · 2026-05-29 Cached

This paper investigates the Platonic Representation Hypothesis, proposing that alignment arises from linear structure in representations, and introduces a statistical framework of signal, bias, and noise.

0 favorites 0 likes

#sparse-autoencoders

MechELK: A Mechanistic Interpretability Framework for Eliciting Latent Knowledge in Large Language Models

arXiv cs.CL ↗ · 2026-05-29 Cached

MechELK is a three-stage framework combining mechanistic interpretability tools (SAE, activation patching, causal probing) with representation engineering to elicit latent knowledge from LLMs, achieving 84.7% accuracy and outperforming existing methods like CCS and linear probing.

0 favorites 0 likes

#sparse-autoencoders

Feature Lottery? A Bifurcation Theory of Concept Emergence

arXiv cs.LG ↗ · 2026-05-26 Cached

This paper introduces a bifurcation theory of representation dynamics to detect when neural networks acquire structured representations during training, using a Hessian analysis of a GMM probe. The resulting ratio β/β_c serves as a label-free phase coordinate that predicts the onset of usable structure and can forecast feature interpretability in sparse autoencoders early in training.

0 favorites 0 likes

sparse-autoencoders

Submit Feedback