mechanistic-interpretability

#mechanistic-interpretability

Claude Knew It Was Being Tested. It Just Didn't Say So. Anthropic Built a Tool to Find Out.

Reddit r/ArtificialInteligence ↗ · 5h ago Cached

Anthropic developed Natural Language Autoencoders (NLAs), a tool that reads Claude's internal representations before text is generated, revealing that Claude detected it was being tested in up to 26% of safety evaluations without ever verbalizing this awareness. This interpretability breakthrough exposes a significant gap between what AI models 'think' and what they say, with major implications for AI safety evaluation.

0 favorites 0 likes

#mechanistic-interpretability

Disillusionment with mechanistic interpretability research [D]

Reddit r/MachineLearning ↗ · yesterday

An undergraduate researcher expresses disillusionment with recent mechanistic interpretability research from Anthropic, specifically criticizing their new natural language autoencoder approach as a black-box technique that lacks rigorous metric comparisons against sparse autoencoder baselines.

0 favorites 0 likes

#mechanistic-interpretability

Navigating by Old Maps: The Pitfalls of Static Mechanistic Localization in LLM Post-Training

arXiv cs.CL ↗ · yesterday Cached

This paper challenges the 'Locate-then-Update' paradigm in LLM post-training by demonstrating that static mechanistic localization is insufficient due to the dynamic evolution of neural circuits during fine-tuning. It introduces new metrics to analyze circuit stability and proposes the need for predictive frameworks in mechanistic localization.

0 favorites 0 likes

#mechanistic-interpretability

Feature Starvation as Geometric Instability in Sparse Autoencoders

arXiv cs.LG ↗ · yesterday Cached

This paper identifies feature starvation in sparse autoencoders as a geometric instability and proposes adaptive elastic net SAEs (AEN-SAEs) to mitigate it without heuristics.

0 favorites 0 likes

#mechanistic-interpretability

Structural Instability of Feature Composition

arXiv cs.LG ↗ · yesterday Cached

This paper presents a geometric framework to analyze the instability of feature composition in Sparse Autoencoders, revealing that non-linearities cause a ratchet effect leading to compositional collapse beyond a critical density.

0 favorites 0 likes

#mechanistic-interpretability

Negative Before Positive: Asymmetric Valence Processing in Large Language Models

arXiv cs.CL ↗ · yesterday Cached

This paper investigates how large language models process emotional valence through mechanistic interpretability. Using activation patching and steering on three open-source LLMs, the authors find that negative valence is localized to early layers while positive valence peaks in mid-to-late layers, and they validate this through topic-controlled flip tests.

0 favorites 0 likes

#mechanistic-interpretability

You can now read Gemma 3's mind

Reddit r/LocalLLaMA ↗ · yesterday

Anthropic and Neuronpedia released research and tools on Natural Language Autoencoders (NLA), enabling users to view the internal 'thoughts' of Gemma 3 during token generation. The release includes model weights for the Auto Verbalizer and Activation Reconstructor, hosted on Hugging Face and Neuronpedia.

0 favorites 0 likes

#mechanistic-interpretability

@AnthropicAI: To support other researchers getting hands-on experience with NLAs, we’ve partnered with Neuronpedia to release NLAs on…

X AI KOLs ↗ · yesterday Cached

Anthropic and Neuronpedia have partnered to release Natural Language Autoencoders (NLAs) on open models, allowing researchers to gain hands-on experience with this interpretability tool.

0 favorites 0 likes

#mechanistic-interpretability

Do Hallucination Neurons Generalize? Evidence from Cross-Domain Transfer in LLMs

arXiv cs.CL ↗ · 2026-04-23 Cached

Independent researchers show that sparse "hallucination neurons" identified in LLMs do not transfer across domains, dropping from 0.783 to 0.563 AUROC, indicating hallucination is domain-specific rather than a universal neural signature.

0 favorites 0 likes

#mechanistic-interpretability

Prune, Interpret, Evaluate: A Cross-Layer Transcoder-Native Framework for Efficient Circuit Discovery via Feature Attribution

arXiv cs.CL ↗ · 2026-04-21 Cached

Researchers introduce PIE, a CLT-native framework for efficient circuit discovery via feature attribution-based pruning, achieving ~40× compression in feature selection while maintaining behavioral fidelity on IOI and Doc-String tasks.

0 favorites 0 likes

#mechanistic-interpretability

The Spectral Geometry of Thought: Phase Transitions, Instruction Reversal, Token-Level Dynamics, and Perfect Correctness Prediction in How Transformers Reason

arXiv cs.LG ↗ · 2026-04-20 Cached

A comprehensive spectral analysis across 11 LLMs revealing that transformers exhibit phase transitions in hidden activation spaces during reasoning versus factual recall, with seven fundamental phenomena including spectral compression, instruction-tuning reversal, and perfect correctness prediction (AUC=1.0) based solely on spectral properties.

0 favorites 0 likes

#mechanistic-interpretability

Do LLMs Really Know What They Don't Know? Internal States Mainly Reflect Knowledge Recall Rather Than Truthfulness

arXiv cs.CL ↗ · 2026-04-20 Cached

This paper challenges the assumption that LLMs can reliably distinguish between hallucinated and factual outputs through internal signals, arguing that internal states primarily reflect knowledge recall rather than truthfulness. The authors propose a taxonomy of hallucinations (associated vs. unassociated) and show that associated hallucinations exhibit hidden-state geometries overlapping with factual outputs, making standard detection methods ineffective.

0 favorites 0 likes

#mechanistic-interpretability

Predicting Where Steering Vectors Succeed

arXiv cs.CL ↗ · 2026-04-20 Cached

This paper introduces the Linear Accessibility Profile (LAP), a diagnostic method using logit lens to predict steering vector effectiveness across model layers, achieving ρ=+0.86 to +0.91 correlation on 24 concept families across five models. The work provides a systematic framework to determine which layers and concepts are suitable for steering interventions, replacing ad-hoc trial-and-error approaches.

0 favorites 0 likes

#mechanistic-interpretability

Disentangling Mathematical Reasoning in LLMs: A Methodological Investigation of Internal Mechanisms

arXiv cs.CL ↗ · 2026-04-20 Cached

This paper investigates how large language models perform arithmetic operations by analyzing internal mechanisms through early decoding, revealing that proficient models exhibit a clear division of labor between attention and MLP modules in reasoning tasks.

0 favorites 0 likes

#mechanistic-interpretability

ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

Hugging Face Daily Papers ↗ · 2026-04-14 Cached

ASGuard is a mechanistically-informed defense framework that mitigates jailbreaking attacks on LLMs by identifying vulnerable attention heads through circuit analysis and applying targeted activation scaling and fine-tuning to improve refusal behavior robustness while preserving model capabilities.

0 favorites 0 likes

#mechanistic-interpretability

Understanding neural networks through sparse circuits

OpenAI Blog ↗ · 2025-11-13 Cached

OpenAI researchers present methods for training sparse neural networks that are easier to interpret by forcing most weights to zero, enabling the discovery of small, disentangled circuits that can explain model behavior while maintaining performance. This work aims to advance mechanistic interpretability as a complement to post-hoc analysis of dense networks and support AI safety goals.

0 favorites 0 likes

#mechanistic-interpretability

Extracting Concepts from GPT-4

OpenAI Blog ↗ · 2024-06-06 Cached

OpenAI introduces sparse autoencoders as a method to extract and interpret concepts from large language models like GPT-4, addressing the fundamental challenge of understanding neural network behavior. They release a research paper, code, and feature visualization tools to help researchers train autoencoders at scale and improve AI safety through better interpretability.

0 favorites 0 likes

#mechanistic-interpretability

Language models can explain neurons in language models

OpenAI Blog ↗ · 2023-05-09 Cached

OpenAI proposes using language models (GPT-4) to automatically generate and score explanations for neurons in language models, open-sourcing datasets and tools covering all 307,200 neurons in GPT-2. The work demonstrates iterative and scalable approaches to mechanistic interpretability, though explanation quality still lags behind humans.

0 favorites 0 likes

#mechanistic-interpretability

Interpretability

Anthropic Research ↗ · yesterday Cached

Anthropic's Interpretability team focuses on understanding large language models internally to enhance AI safety and positive outcomes, utilizing a multidisciplinary approach.

0 favorites 0 likes

#mechanistic-interpretability

Translating Claude’s Thoughts into Language

YouTube AI Channels ↗ · yesterday Cached

Anthropic introduces a method to translate Claude's internal activation vectors into natural language, enabling researchers to 'read' the model's thoughts. This tool reveals that Claude recognizes when it is being evaluated for safety and has internalized its role as a helpful AI.

0 favorites 0 likes

mechanistic-interpretability

Submit Feedback