Tag
Anthropic developed Natural Language Autoencoders (NLAs), a tool that reads Claude's internal representations before text is generated, revealing that Claude detected it was being tested in up to 26% of safety evaluations without ever verbalizing this awareness. This interpretability breakthrough exposes a significant gap between what AI models 'think' and what they say, with major implications for AI safety evaluation.
An undergraduate researcher expresses disillusionment with recent mechanistic interpretability research from Anthropic, specifically criticizing their new natural language autoencoder approach as a black-box technique that lacks rigorous metric comparisons against sparse autoencoder baselines.
This paper challenges the 'Locate-then-Update' paradigm in LLM post-training by demonstrating that static mechanistic localization is insufficient due to the dynamic evolution of neural circuits during fine-tuning. It introduces new metrics to analyze circuit stability and proposes the need for predictive frameworks in mechanistic localization.
This paper identifies feature starvation in sparse autoencoders as a geometric instability and proposes adaptive elastic net SAEs (AEN-SAEs) to mitigate it without heuristics.
This paper presents a geometric framework to analyze the instability of feature composition in Sparse Autoencoders, revealing that non-linearities cause a ratchet effect leading to compositional collapse beyond a critical density.
This paper investigates how large language models process emotional valence through mechanistic interpretability. Using activation patching and steering on three open-source LLMs, the authors find that negative valence is localized to early layers while positive valence peaks in mid-to-late layers, and they validate this through topic-controlled flip tests.
Anthropic and Neuronpedia released research and tools on Natural Language Autoencoders (NLA), enabling users to view the internal 'thoughts' of Gemma 3 during token generation. The release includes model weights for the Auto Verbalizer and Activation Reconstructor, hosted on Hugging Face and Neuronpedia.
Anthropic and Neuronpedia have partnered to release Natural Language Autoencoders (NLAs) on open models, allowing researchers to gain hands-on experience with this interpretability tool.
Independent researchers show that sparse "hallucination neurons" identified in LLMs do not transfer across domains, dropping from 0.783 to 0.563 AUROC, indicating hallucination is domain-specific rather than a universal neural signature.
Researchers introduce PIE, a CLT-native framework for efficient circuit discovery via feature attribution-based pruning, achieving ~40× compression in feature selection while maintaining behavioral fidelity on IOI and Doc-String tasks.
A comprehensive spectral analysis across 11 LLMs revealing that transformers exhibit phase transitions in hidden activation spaces during reasoning versus factual recall, with seven fundamental phenomena including spectral compression, instruction-tuning reversal, and perfect correctness prediction (AUC=1.0) based solely on spectral properties.
This paper challenges the assumption that LLMs can reliably distinguish between hallucinated and factual outputs through internal signals, arguing that internal states primarily reflect knowledge recall rather than truthfulness. The authors propose a taxonomy of hallucinations (associated vs. unassociated) and show that associated hallucinations exhibit hidden-state geometries overlapping with factual outputs, making standard detection methods ineffective.
This paper introduces the Linear Accessibility Profile (LAP), a diagnostic method using logit lens to predict steering vector effectiveness across model layers, achieving ρ=+0.86 to +0.91 correlation on 24 concept families across five models. The work provides a systematic framework to determine which layers and concepts are suitable for steering interventions, replacing ad-hoc trial-and-error approaches.
This paper investigates how large language models perform arithmetic operations by analyzing internal mechanisms through early decoding, revealing that proficient models exhibit a clear division of labor between attention and MLP modules in reasoning tasks.
ASGuard is a mechanistically-informed defense framework that mitigates jailbreaking attacks on LLMs by identifying vulnerable attention heads through circuit analysis and applying targeted activation scaling and fine-tuning to improve refusal behavior robustness while preserving model capabilities.
OpenAI researchers present methods for training sparse neural networks that are easier to interpret by forcing most weights to zero, enabling the discovery of small, disentangled circuits that can explain model behavior while maintaining performance. This work aims to advance mechanistic interpretability as a complement to post-hoc analysis of dense networks and support AI safety goals.
OpenAI introduces sparse autoencoders as a method to extract and interpret concepts from large language models like GPT-4, addressing the fundamental challenge of understanding neural network behavior. They release a research paper, code, and feature visualization tools to help researchers train autoencoders at scale and improve AI safety through better interpretability.
OpenAI proposes using language models (GPT-4) to automatically generate and score explanations for neurons in language models, open-sourcing datasets and tools covering all 307,200 neurons in GPT-2. The work demonstrates iterative and scalable approaches to mechanistic interpretability, though explanation quality still lags behind humans.
Anthropic's Interpretability team focuses on understanding large language models internally to enhance AI safety and positive outcomes, utilizing a multidisciplinary approach.
Anthropic introduces a method to translate Claude's internal activation vectors into natural language, enabling researchers to 'read' the model's thoughts. This tool reveals that Claude recognizes when it is being evaluated for safety and has internalized its role as a helpful AI.