Tag
This paper introduces a novel adaptive scheduler for steering discrete diffusion language models using sparse autoencoders, demonstrating that targeting interventions based on when specific attributes commit improves control quality and strength over uniform methods.
This preprint introduces a method to inject emotion vectors into language models to simulate somatic markers, aiming to bridge the gap between semantic and episodic memory. The authors demonstrate that combining emotional echoes with semantic knowledge improves decision-making capabilities, replicating findings from human cognitive science.
This paper introduces Latent Visualization by Optimization (LVO), a mechanistic interpretability technique that uses sparse autoencoders to visualize monosemantic features in diffusion models like Stable Diffusion 1.5.
This study investigates how LLMs ground abstract concepts compared to humans, finding a significant 'grounding gap' where models rely heavily on word associations rather than emotional or internal states. Using sparse autoencoders, the authors identify internal features related to grounding dimensions, suggesting LLMs possess this information but do not recruit it naturally during generation.
This research paper introduces 'Feature Rivalry' in Sparse Autoencoder representations as a mechanistic signature of uncertainty in LLMs. Using Gemma-2-2B, the study demonstrates that negatively correlated feature pairs localize uncertainty to specific layers and causally influence model outputs.
This paper introduces a mechanistic interpretability toolkit using Sparse Autoencoders and linear probes to monitor internal model states before AI agents invoke tools, aiming to improve diagnostics and safety in enterprise workflows.
This paper compares Crosscoders and Differential SAEs for detecting backdoors in fine-tuned LLMs, finding that Diff-SAE significantly outperforms Crosscoders by isolating directional activation shifts.
An undergraduate researcher expresses disillusionment with recent mechanistic interpretability research from Anthropic, specifically criticizing their new natural language autoencoder approach as a black-box technique that lacks rigorous metric comparisons against sparse autoencoder baselines.
This paper identifies feature starvation in sparse autoencoders as a geometric instability and proposes adaptive elastic net SAEs (AEN-SAEs) to mitigate it without heuristics.
This paper presents a geometric framework to analyze the instability of feature composition in Sparse Autoencoders, revealing that non-linearities cause a ratchet effect leading to compositional collapse beyond a critical density.
SLAM is a novel white-box watermarking scheme that embeds marks into the structural geometry of LLM residual streams using sparse autoencoders, achieving 100% detection accuracy with minimal quality loss on Gemma-2 models, avoiding the token-distribution biasing of prior methods.
Researchers from Beihang University and other institutions propose HalluSAE, a framework using sparse autoencoders and phase transition theory to detect hallucinations in LLMs by modeling generation as trajectories through a potential energy landscape and identifying critical transition zones where factual errors occur.
OpenAI researchers investigate 'emergent misalignment'—where fine-tuning a model on narrow incorrect behavior causes broadly unethical responses—and discover a 'misaligned persona' feature in GPT-4o's activations that mediates this phenomenon, enabling potential detection and mitigation strategies.
OpenAI introduces sparse autoencoders as a method to extract and interpret concepts from large language models like GPT-4, addressing the fundamental challenge of understanding neural network behavior. They release a research paper, code, and feature visualization tools to help researchers train autoencoders at scale and improve AI safety through better interpretability.