lm-agents

#lm-agents

Can Language Model Agents be Helpful Circuit Explainers in Mechanistic Interpretability?

arXiv cs.AI ↗ · 4d ago Cached

This paper investigates whether language model agents can automate the explanation phase of mechanistic interpretability by introducing AgenticInterpBench, a benchmark with 84 semi-synthetic circuits, and HyVE, an agentic explainer that iteratively hypothesizes, validates, and explains circuit components. Experiments show promise but identify reliable validation as a key obstacle.

0 favorites 0 likes

lm-agents

Can Language Model Agents be Helpful Circuit Explainers in Mechanistic Interpretability?

Submit Feedback