sparse-autoencoders

Tag

Cards List
#sparse-autoencoders

Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models

arXiv cs.LG · 14h ago Cached

This paper introduces a novel adaptive scheduler for steering discrete diffusion language models using sparse autoencoders, demonstrating that targeting interventions based on when specific attributes commit improves control quality and strength over uniform methods.

0 favorites 0 likes
#sparse-autoencoders

The Echo Amplifies the Knowledge: Somatic Marker Analogues in Language Models via Emotion Vector Re-Injection

arXiv cs.AI · yesterday Cached

This preprint introduces a method to inject emotion vectors into language models to simulate somatic markers, aiming to bridge the gap between semantic and episodic memory. The authors demonstrate that combining emotional echoes with semantic knowledge improves decision-making capabilities, replicating findings from human cognitive science.

0 favorites 0 likes
#sparse-autoencoders

Deep Dreams Are Made of This: Visualizing Monosemantic Features in Diffusion Models

arXiv cs.LG · yesterday Cached

This paper introduces Latent Visualization by Optimization (LVO), a mechanistic interpretability technique that uses sparse autoencoders to visualize monosemantic features in diffusion models like Stable Diffusion 1.5.

0 favorites 0 likes
#sparse-autoencoders

The Grounding Gap: How LLMs Anchor the Meaning of Abstract Concepts Differently from Humans

arXiv cs.CL · yesterday Cached

This study investigates how LLMs ground abstract concepts compared to humans, finding a significant 'grounding gap' where models rely heavily on word associations rather than emotional or internal states. Using sparse autoencoders, the authors identify internal features related to grounding dimensions, suggesting LLMs possess this information but do not recruit it naturally during generation.

0 favorites 0 likes
#sparse-autoencoders

Feature Rivalry in Sparse Autoencoder Representations: A Mechanistic Study of Uncertainty-Driven Feature Competition in LLMs

arXiv cs.LG · yesterday Cached

This research paper introduces 'Feature Rivalry' in Sparse Autoencoder representations as a mechanistic signature of uncertainty in LLMs. Using Gemma-2-2B, the study demonstrates that negatively correlated feature pairs localize uncertainty to specific layers and causally influence model outputs.

0 favorites 0 likes
#sparse-autoencoders

Beyond the Black Box: Interpretability of Agentic AI Tool Use

arXiv cs.AI · 2d ago Cached

This paper introduces a mechanistic interpretability toolkit using Sparse Autoencoders and linear probes to monitor internal model states before AI agents invoke tools, aiming to improve diagnostics and safety in enterprise workflows.

0 favorites 0 likes
#sparse-autoencoders

Activation Differences Reveal Backdoors: A Comparison of SAE Architectures

arXiv cs.CL · 2d ago Cached

This paper compares Crosscoders and Differential SAEs for detecting backdoors in fine-tuned LLMs, finding that Diff-SAE significantly outperforms Crosscoders by isolating directional activation shifts.

0 favorites 0 likes
#sparse-autoencoders

Disillusionment with mechanistic interpretability research [D]

Reddit r/MachineLearning · 5d ago

An undergraduate researcher expresses disillusionment with recent mechanistic interpretability research from Anthropic, specifically criticizing their new natural language autoencoder approach as a black-box technique that lacks rigorous metric comparisons against sparse autoencoder baselines.

0 favorites 0 likes
#sparse-autoencoders

Feature Starvation as Geometric Instability in Sparse Autoencoders

arXiv cs.LG · 5d ago Cached

This paper identifies feature starvation in sparse autoencoders as a geometric instability and proposes adaptive elastic net SAEs (AEN-SAEs) to mitigate it without heuristics.

0 favorites 0 likes
#sparse-autoencoders

Structural Instability of Feature Composition

arXiv cs.LG · 5d ago Cached

This paper presents a geometric framework to analyze the instability of feature composition in Sparse Autoencoders, revealing that non-linearities cause a ratchet effect leading to compositional collapse beyond a critical density.

0 favorites 0 likes
#sparse-autoencoders

SLAM: Structural Linguistic Activation Marking for Language Models

arXiv cs.CL · 5d ago Cached

SLAM is a novel white-box watermarking scheme that embeds marks into the structural geometry of LLM residual streams using sparse autoencoders, achieving 100% detection accuracy with minimal quality loss on Gemma-2 models, avoiding the token-distribution biasing of prior methods.

0 favorites 1 likes
#sparse-autoencoders

HalluSAE: Detecting Hallucinations in Large Language Models via Sparse Auto-Encoders

arXiv cs.CL · 2026-04-21 Cached

Researchers from Beihang University and other institutions propose HalluSAE, a framework using sparse autoencoders and phase transition theory to detect hallucinations in LLMs by modeling generation as trajectories through a potential energy landscape and identifying critical transition zones where factual errors occur.

0 favorites 0 likes
#sparse-autoencoders

Toward understanding and preventing misalignment generalization

OpenAI Blog · 2025-06-18 Cached

OpenAI researchers investigate 'emergent misalignment'—where fine-tuning a model on narrow incorrect behavior causes broadly unethical responses—and discover a 'misaligned persona' feature in GPT-4o's activations that mediates this phenomenon, enabling potential detection and mitigation strategies.

0 favorites 0 likes
#sparse-autoencoders

Extracting Concepts from GPT-4

OpenAI Blog · 2024-06-06 Cached

OpenAI introduces sparse autoencoders as a method to extract and interpret concepts from large language models like GPT-4, addressing the fundamental challenge of understanding neural network behavior. They release a research paper, code, and feature visualization tools to help researchers train autoencoders at scale and improve AI safety through better interpretability.

0 favorites 0 likes
← Back to home

Submit Feedback