Tag
A researcher shares an experimental plan for identifying causal dependencies between capability dimensions in a 31B model using contrastive targeted SFT and circuit tracing, seeking feedback on methodology and related work.
This paper diagnoses systematic errors in attribution patching, a gradient-based approximation used for causal localization in language models, and proposes a second-order correction using Hessian-vector products that improves reliability with minimal additional computational cost.
Proposes MechRL, a reinforcement learning approach to automate circuit discovery in transformer language models. A PPO agent trained on multiple tasks discovers attention head circuits that match known canonical circuits and generalizes to a held-out task.
Introduces a three-step recipe for identifying attention-head circuits in pretrained transformers using a spectral signal and task-pattern screen without requiring labels, validated across 51M to 1B parameter models and multiple architectures.
Researchers introduce PIE, a CLT-native framework for efficient circuit discovery via feature attribution-based pruning, achieving ~40× compression in feature selection while maintaining behavioral fidelity on IOI and Doc-String tasks.