sparse-autoencoders

#sparse-autoencoders

Steered Generation via Gradient-Based Optimization on Sparse Query Features

arXiv cs.LG ↗ · 2026-05-25 Cached

This paper introduces Prototype-Based Sparse Steering, a method that applies sparse autoencoders to attention query activations in LLMs, then uses gradient-based optimization during inference to steer generation toward target behaviors. The approach is validated in both a logical planning task and a stylistic educational domain, demonstrating interpretable and disentangled control.

0 favorites 0 likes

#sparse-autoencoders

Multilingual Steering by Design: Multilingual Sparse Autoencoders and Principled Layer Selection

arXiv cs.CL ↗ · 2026-05-25 Cached

This paper introduces a principled approach to multilingual language steering using sparse autoencoders (SAEs) trained on multilingual data and a novel layer selection rule based on the intersection of multilingual alignment and language separability, evaluated on LLaMA-3.1-8B and Gemma-2-9B for machine translation and cross-lingual summarization.

0 favorites 0 likes

#sparse-autoencoders

Sparse Autoencoders Map Brain-LLM Alignment onto Cortical Semantic Topography

arXiv cs.CL ↗ · 2026-05-25 Cached

This paper uses sparse autoencoders to decompose LLMs into interpretable features and shows that semantic features explain brain alignment with cortical semantic topography, generalizing across English, Chinese, and French.

0 favorites 0 likes

#sparse-autoencoders

Transcoders Trace Visual Grounding and Hallucinations in Vision-Language Models

arXiv cs.LG ↗ · 2026-05-25 Cached

This paper presents a function-centric framework using Transcoders to trace computational pathways in vision-language models, demonstrating stronger attribution of visual grounding and the ability to predict hallucinations via graph-based features.

0 favorites 0 likes

#sparse-autoencoders

From Correlation to Cause: A Five-Stage Methodology for Feature Analysis in Transformer Language Models

arXiv cs.CL ↗ · 2026-05-22 Cached

This paper proposes a five-stage methodology for causal feature analysis in transformer language models, demonstrated on GPT-2 small for the IOI task. It finds that features are specifically causal but not necessary, and exposes a gap between detection and causal robustness.

0 favorites 0 likes

#sparse-autoencoders

Can SAEs Capture Neural Geometry? (6 minute read)

TLDR AI ↗ · 2026-05-22 Cached

This article explores how sparse autoencoders (SAEs) can capture curved neural geometry, revealing three distinct ways SAE features represent manifolds, and presents an unsupervised pipeline to uncover geometric structure in neural representations.

0 favorites 0 likes

#sparse-autoencoders

Compositional Literary Primitives in Instruction-Tuned LLMs: Cross-Architectural SAE Features for Self, Style, and Affect

arXiv cs.LG ↗ · 2026-05-20

This paper characterizes compositional literary primitives in instruction-tuned LLMs using sparse autoencoders, discovering feature classes for self, style, and affect that enable emotion steering across two architectures.

0 favorites 0 likes

#sparse-autoencoders

Lost or Hidden? A Concept-Level Forgetting in Supervised Continual Learning

arXiv cs.LG ↗ · 2026-05-19 Cached

This paper introduces a diagnostic framework using Sparse Autoencoders to analyze concept-level forgetting in continual learning, finding that much forgetting is due to representational inaccessibility rather than erasure.

0 favorites 0 likes

#sparse-autoencoders

Preference Instability in Reward Models: Detection and Mitigation via Sparse Autoencoders

arXiv cs.LG ↗ · 2026-05-19 Cached

This paper investigates preference instability in reward models for LLMs, where subtle input variations cause contradictory preference assignments. The authors propose two SAE-based mitigation strategies—SAE Feature Steering and SAE Residual Correction—to reduce incorrect preference assignments without retraining.

0 favorites 0 likes

#sparse-autoencoders

Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders

arXiv cs.LG ↗ · 2026-05-15 Cached

This paper applies TopK Sparse Autoencoders to three EEG foundation models (SleepFM, REVE, LaBraM) to extract interpretable feature dictionaries and introduces a framework for concept steering, revealing representational failures and clinical entanglements.

0 favorites 0 likes

#sparse-autoencoders

Sparse Autoencoders enable Robust and Interpretable Fine-tuning of CLIP models

Hugging Face Daily Papers ↗ · 2026-05-15 Cached

SAE-FT introduces a novel fine-tuning method for CLIP models that uses sparse autoencoder constraints to regularize visual representations, improving robustness against distribution shifts while maintaining performance and enabling interpretability.

0 favorites 0 likes

#sparse-autoencoders

Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces

arXiv cs.LG ↗ · 2026-05-14 Cached

This paper introduces a framework for token-level influence attribution in large language models by learning orthogonal latent spaces with sparse autoencoders, enabling precise identification of training data tokens that jointly influence predictions, with applications in high-stakes domains like healthcare.

0 favorites 0 likes

#sparse-autoencoders

Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models

arXiv cs.LG ↗ · 2026-05-13 Cached

This paper introduces a novel adaptive scheduler for steering discrete diffusion language models using sparse autoencoders, demonstrating that targeting interventions based on when specific attributes commit improves control quality and strength over uniform methods.

0 favorites 0 likes

#sparse-autoencoders

The Echo Amplifies the Knowledge: Somatic Marker Analogues in Language Models via Emotion Vector Re-Injection

arXiv cs.AI ↗ · 2026-05-12 Cached

This preprint introduces a method to inject emotion vectors into language models to simulate somatic markers, aiming to bridge the gap between semantic and episodic memory. The authors demonstrate that combining emotional echoes with semantic knowledge improves decision-making capabilities, replicating findings from human cognitive science.

0 favorites 0 likes

#sparse-autoencoders

Deep Dreams Are Made of This: Visualizing Monosemantic Features in Diffusion Models

arXiv cs.LG ↗ · 2026-05-12 Cached

This paper introduces Latent Visualization by Optimization (LVO), a mechanistic interpretability technique that uses sparse autoencoders to visualize monosemantic features in diffusion models like Stable Diffusion 1.5.

0 favorites 0 likes

#sparse-autoencoders

The Grounding Gap: How LLMs Anchor the Meaning of Abstract Concepts Differently from Humans

arXiv cs.CL ↗ · 2026-05-12 Cached

This study investigates how LLMs ground abstract concepts compared to humans, finding a significant 'grounding gap' where models rely heavily on word associations rather than emotional or internal states. Using sparse autoencoders, the authors identify internal features related to grounding dimensions, suggesting LLMs possess this information but do not recruit it naturally during generation.

0 favorites 0 likes

#sparse-autoencoders

Feature Rivalry in Sparse Autoencoder Representations: A Mechanistic Study of Uncertainty-Driven Feature Competition in LLMs

arXiv cs.LG ↗ · 2026-05-12 Cached

This research paper introduces 'Feature Rivalry' in Sparse Autoencoder representations as a mechanistic signature of uncertainty in LLMs. Using Gemma-2-2B, the study demonstrates that negatively correlated feature pairs localize uncertainty to specific layers and causally influence model outputs.

0 favorites 0 likes

#sparse-autoencoders

Beyond the Black Box: Interpretability of Agentic AI Tool Use

arXiv cs.AI ↗ · 2026-05-11 Cached

This paper introduces a mechanistic interpretability toolkit using Sparse Autoencoders and linear probes to monitor internal model states before AI agents invoke tools, aiming to improve diagnostics and safety in enterprise workflows.

0 favorites 0 likes

#sparse-autoencoders

Activation Differences Reveal Backdoors: A Comparison of SAE Architectures

arXiv cs.CL ↗ · 2026-05-11 Cached

This paper compares Crosscoders and Differential SAEs for detecting backdoors in fine-tuned LLMs, finding that Diff-SAE significantly outperforms Crosscoders by isolating directional activation shifts.

0 favorites 0 likes

#sparse-autoencoders

Disillusionment with mechanistic interpretability research [D]

Reddit r/MachineLearning ↗ · 2026-05-08

An undergraduate researcher expresses disillusionment with recent mechanistic interpretability research from Anthropic, specifically criticizing their new natural language autoencoder approach as a black-box technique that lacks rigorous metric comparisons against sparse autoencoder baselines.

0 favorites 0 likes

sparse-autoencoders

Submit Feedback