Tag
This paper investigates whether language model agents can automate the explanation phase of mechanistic interpretability by introducing AgenticInterpBench, a benchmark with 84 semi-synthetic circuits, and HyVE, an agentic explainer that iteratively hypothesizes, validates, and explains circuit components. Experiments show promise but identify reliable validation as a key obstacle.