circuit-discovery

#circuit-discovery

Contrastive targeted SFT as a mechinterp method - has anyone mapped causal dependency interactions this way? [D]

Reddit r/MachineLearning ↗ · 2d ago

A researcher shares an experimental plan for identifying causal dependencies between capability dimensions in a 31B model using contrastive targeted SFT and circuit tracing, seeking feedback on methodology and related work.

0 favorites 0 likes

#circuit-discovery

When Attribution Patching Lies: Diagnosis and a Second-Order Correction

arXiv cs.LG ↗ · 2026-06-10 Cached

This paper diagnoses systematic errors in attribution patching, a gradient-based approximation used for causal localization in language models, and proposes a second-order correction using Hessian-vector products that improves reliability with minimal additional computational cost.

0 favorites 0 likes

#circuit-discovery

MechRL: Reinforcement Learning Agents Perform Circuit Discovery for Mechanistic Interpretability

arXiv cs.LG ↗ · 2026-05-27 Cached

Proposes MechRL, a reinforcement learning approach to automate circuit discovery in transformer language models. A PPO agent trained on multiple tasks discovers attention head circuits that match known canonical circuits and generalizes to a held-out task.

0 favorites 0 likes

#circuit-discovery

Spectral Probe-Circuits: A Three-Step Recipe for Identifying Attention-Head Circuits in Pretrained Transformers

arXiv cs.LG ↗ · 2026-05-26 Cached

Introduces a three-step recipe for identifying attention-head circuits in pretrained transformers using a spectral signal and task-pattern screen without requiring labels, validated across 51M to 1B parameter models and multiple architectures.

0 favorites 0 likes

#circuit-discovery

Prune, Interpret, Evaluate: A Cross-Layer Transcoder-Native Framework for Efficient Circuit Discovery via Feature Attribution

arXiv cs.CL ↗ · 2026-04-21 Cached

Researchers introduce PIE, a CLT-native framework for efficient circuit discovery via feature attribution-based pruning, achieving ~40× compression in feature selection while maintaining behavioral fidelity on IOI and Doc-String tasks.

0 favorites 0 likes

circuit-discovery

Contrastive targeted SFT as a mechinterp method - has anyone mapped causal dependency interactions this way? [D]

When Attribution Patching Lies: Diagnosis and a Second-Order Correction

MechRL: Reinforcement Learning Agents Perform Circuit Discovery for Mechanistic Interpretability

Spectral Probe-Circuits: A Three-Step Recipe for Identifying Attention-Head Circuits in Pretrained Transformers

Prune, Interpret, Evaluate: A Cross-Layer Transcoder-Native Framework for Efficient Circuit Discovery via Feature Attribution

Submit Feedback