sparse-autoencoders

#sparse-autoencoders

Feature Starvation as Geometric Instability in Sparse Autoencoders

arXiv cs.LG ↗ · 2026-05-08 Cached

This paper identifies feature starvation in sparse autoencoders as a geometric instability and proposes adaptive elastic net SAEs (AEN-SAEs) to mitigate it without heuristics.

0 favorites 0 likes

#sparse-autoencoders

Structural Instability of Feature Composition

arXiv cs.LG ↗ · 2026-05-08 Cached

This paper presents a geometric framework to analyze the instability of feature composition in Sparse Autoencoders, revealing that non-linearities cause a ratchet effect leading to compositional collapse beyond a critical density.

0 favorites 0 likes

#sparse-autoencoders

SLAM: Structural Linguistic Activation Marking for Language Models

arXiv cs.CL ↗ · 2026-05-08 Cached

SLAM is a novel white-box watermarking scheme that embeds marks into the structural geometry of LLM residual streams using sparse autoencoders, achieving 100% detection accuracy with minimal quality loss on Gemma-2 models, avoiding the token-distribution biasing of prior methods.

0 favorites 1 likes

#sparse-autoencoders

HalluSAE: Detecting Hallucinations in Large Language Models via Sparse Auto-Encoders

arXiv cs.CL ↗ · 2026-04-21 Cached

Researchers from Beihang University and other institutions propose HalluSAE, a framework using sparse autoencoders and phase transition theory to detect hallucinations in LLMs by modeling generation as trajectories through a potential energy landscape and identifying critical transition zones where factual errors occur.

0 favorites 0 likes

#sparse-autoencoders

Toward understanding and preventing misalignment generalization

OpenAI Blog ↗ · 2025-06-18 Cached

OpenAI researchers investigate 'emergent misalignment'—where fine-tuning a model on narrow incorrect behavior causes broadly unethical responses—and discover a 'misaligned persona' feature in GPT-4o's activations that mediates this phenomenon, enabling potential detection and mitigation strategies.

0 favorites 0 likes

#sparse-autoencoders

Extracting Concepts from GPT-4

OpenAI Blog ↗ · 2024-06-06 Cached

OpenAI introduces sparse autoencoders as a method to extract and interpret concepts from large language models like GPT-4, addressing the fundamental challenge of understanding neural network behavior. They release a research paper, code, and feature visualization tools to help researchers train autoencoders at scale and improve AI safety through better interpretability.

0 favorites 0 likes

sparse-autoencoders

Feature Starvation as Geometric Instability in Sparse Autoencoders

Structural Instability of Feature Composition

SLAM: Structural Linguistic Activation Marking for Language Models

HalluSAE: Detecting Hallucinations in Large Language Models via Sparse Auto-Encoders

Toward understanding and preventing misalignment generalization

Extracting Concepts from GPT-4

Submit Feedback