sparse-autoencoders

#sparse-autoencoders

From Weights to Features: SAE-Guided Activation Regularization for LLM Continual Learning

arXiv cs.LG ↗ · 23h ago Cached

This paper proposes a continual learning method for LLMs that uses pretrained sparse autoencoders (SAEs) to regularize in activation space instead of weight space, achieving better memory efficiency and stronger performance on benchmarks while avoiding catastrophic forgetting without storing previous data.

0 favorites 0 likes

#sparse-autoencoders

Discovering Millions of Interpretable Features with Sparse Autoencoders

arXiv cs.LG ↗ · 23h ago Cached

This paper introduces Qwen3-Instruct SAE, a suite of sparse autoencoders trained on Qwen3 instruction-tuned models, enabling the discovery of millions of interpretable features and demonstrating refusal steering capabilities.

0 favorites 0 likes

#sparse-autoencoders

Localizing RL-Induced Tool Use to a Single Crosscoder Feature

arXiv cs.LG ↗ · 23h ago Cached

This paper uses Dedicated Feature Crosscoders to localize RL-induced tool-use capability in Qwen2.5-3B to a single steerable feature, achieving +65pp tool-correctness via feature steering and demonstrating capability spillover to frozen base models.

0 favorites 0 likes

#sparse-autoencoders

At the Edge of Understanding: Sparse Autoencoders Trace The Limits of Transformer Generalization

arXiv cs.LG ↗ · 23h ago Cached

This paper proposes using sparse autoencoders to detect out-of-distribution inputs for transformers, including typos and jailbreak prompts, by analyzing spurious concept activations. The method enables a mechanistically grounded fine-tuning strategy to improve LLM robustness.

0 favorites 0 likes

#sparse-autoencoders

From Sparse Features to Trustworthy Proxies: Certifying SAE-Based Interpretability

arXiv cs.LG ↗ · 2026-06-18 Cached

This paper proposes a post-hoc certification framework for sparse autoencoder (SAE) based interpretability, deriving an upper bound on the frozen language model's risk using measurable quantities. The framework is validated on GPT-2 Small, Gemma-2B, and Llama-3-8B, showing non-vacuous bounds and revealing depth-dependent behavior.

0 favorites 0 likes

#sparse-autoencoders

SAE Interventions are Unreliable: Post-Intervention Recovery of Suppressed Behavior

arXiv cs.LG ↗ · 2026-06-18 Cached

This paper demonstrates that interventions on Sparse Autoencoder (SAE) features can be unreliable because suppressed behavior can recover through residual-space optimization, even while the intervention remains active. It reveals a critical gap between feature-level control and actual behavioral completeness in language models.

0 favorites 0 likes

#sparse-autoencoders

Size Doesn't Matter: Cosine-Scored Sparse Autoencoders

arXiv cs.LG ↗ · 2026-06-16 Cached

This paper proposes replacing the inner product scoring in sparse autoencoders with a learned combination of cosine similarity and input magnitude, showing that the resulting features are more interpretable and concept-aligned, with the optimizer consistently preferring cosine over inner product.

0 favorites 0 likes

#sparse-autoencoders

Rational Sparse Autoencoder

arXiv cs.LG ↗ · 2026-06-16 Cached

Introduces Rational Sparse Autoencoder (RSAE), which replaces fixed encoder activations with trainable rational functions, improving reconstruction and sparsity trade-offs on residual-stream activations of open-weight language models across multiple baseline families.

0 favorites 0 likes

#sparse-autoencoders

Decompose Sparsely Where You Should, Absorb Densely Where You Should No

arXiv cs.LG ↗ · 2026-06-15 Cached

The paper hypothesizes that language model activations contain a low-rank dense component that is inefficiently represented by sparse autoencoders (SAEs). By adding a linear bottleneck to absorb dense structure, the authors reduce dense latents and improve sparse probing performance on Gemma-2-2B.

0 favorites 0 likes

#sparse-autoencoders

Unstable Features, Reproducible Subspaces: Understanding Seed Dependence in Sparse Autoencoders

Hugging Face Daily Papers ↗ · 2026-06-10 Cached

This paper studies seed dependence in sparse autoencoders, finding that stable features carry most predictive signal while unstable features reflect reproducible low-dimensional subspaces.

0 favorites 0 likes

#sparse-autoencoders

Shared Latent Structures Enable Unified Backdoor Detection and Mitigation in LLMs

arXiv cs.AI ↗ · 2026-06-09 Cached

This paper identifies a shared latent mechanism across diverse backdoor behaviors in LLMs, using sparse autoencoders to detect and causally suppress these features, enabling unified backdoor detection and mitigation across models and attack types.

0 favorites 0 likes

#sparse-autoencoders

Query Lens: Interpreting Sparse Key-Value Features with Indirect Effects

arXiv cs.LG ↗ · 2026-06-09 Cached

Query Lens extends Logit Lens to interpret sparse autoencoder features by jointly considering encoder-side key features and decoder-side value features, and accounting for indirect effects from downstream modules. The paper also introduces the Subspace Channel Hypothesis, suggesting downstream modules read features through layer-specific subspaces.

0 favorites 0 likes

#sparse-autoencoders

Interpreting and Steering a Text-to-Speech Language Model with Sparse Autoencoders

Hugging Face Daily Papers ↗ · 2026-06-08 Cached

This paper applies sparse autoencoders to the CosyVoice3 text-to-speech language model, discovering interpretable features that can be steered to control attributes like laughter, speaker gender, and speech rate while preserving content.

0 favorites 0 likes

#sparse-autoencoders

A Geometric View for Understanding Concept Learning and Neuron Interpretation in Sparse Autoencoders

arXiv cs.LG ↗ · 2026-06-08 Cached

This paper proposes a unified geometric framework for understanding concept learning and neuron interpretation in sparse autoencoders, formalizing concepts as sets and defining detection, separation, and approximation. It provides error bounds, capacity constraints, and links to formal concept analysis, with experiments on synthetic data.

0 favorites 0 likes

#sparse-autoencoders

How Far Do Auto-Interpretation Labels Generalize: A Controlled Study Across Languages, Scripts, and Rewordings

arXiv cs.CL ↗ · 2026-06-02 Cached

This paper investigates whether auto-generated labels for sparse autoencoder features generalize across languages and scripts, using Serbian digraphia as a controlled testbed. It finds that while feature sets show substantial overlap across languages, the labels often fail to track the same concept in non-English inputs, particularly in less represented scripts.

0 favorites 0 likes

#sparse-autoencoders

@bclavie: Very excited to finally share this one after sitting on it for far too long! It's very topical now. Blog post coming ve…

X AI KOLs Timeline ↗ · 2026-05-30 Cached

Researchers extract indexable, BM25-ready sparse features from frozen dense retrievers using reconstruction-trained sparse autoencoders.

0 favorites 0 likes

#sparse-autoencoders

@lateinteraction: Late-interaction sparse retrieval? With neuron-level inverted indexing, on top of unsupervised sparse autoencoders. Wor…

X AI KOLs Timeline ↗ · 2026-05-30 Cached

This paper presents a single-stage sparse coding method using unsupervised sparse autoencoders and natural inverted indexing to accelerate multi-vector retrieval, outperforming traditional k-means based approaches.

0 favorites 0 likes

#sparse-autoencoders

@_reachsumit: Latent Terms: Dense Retrievers Contain Trivially Extractable BM25-ready Zipfian Vocabularies @bclavie et al. extract in…

X AI KOLs Following ↗ · 2026-05-29 Cached

The paper proposes Latent Terms, a method using Sparse Autoencoders to extract BM25-ready sparse features from frozen dense retrievers, achieving competitive performance without retrieval-specific training.

0 favorites 0 likes

#sparse-autoencoders

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

arXiv cs.AI ↗ · 2026-05-29 Cached

This paper demonstrates that sparse autoencoders can extract interpretable features from Claude 3 Sonnet, a production-scale language model, addressing scalability concerns for dictionary learning. The features are multilingual, multimodal, and include safety-relevant concepts like deception and sycophancy, with causal influence on model outputs.

0 favorites 0 likes

#sparse-autoencoders

When and How Long? The Readout-Mediator Angle in Temporal Reasoning

arXiv cs.LG ↗ · 2026-05-29 Cached

This paper introduces the readout-mediator angle to demonstrate that linear probes can decode information from language model activations that is orthogonal to the model's actual causal computation, undermining probe-based interpretability. The finding replicates across model scales and families, revealing a fundamental failure mode in using probes for mechanistic understanding or safety monitoring.

0 favorites 0 likes

sparse-autoencoders

Submit Feedback