Tag
This paper analyzes latent reasoning models (LRMs) and demonstrates that observable patterns in latent states are not causal explanations of reasoning; it advocates for matched controls and causal tests in interpretability research.
ORCA is a copilot for end-to-end causal analysis that uses agents to guide users through workflows including causal discovery, effect estimation, and root cause analysis, with structured reports.
This paper proposes a five-stage methodology for causal feature analysis in transformer language models, demonstrated on GPT-2 small for the IOI task. It finds that features are specifically causal but not necessary, and exposes a gap between detection and causal robustness.
This paper identifies a 'Diagnostic Paradox' in multi-module LLM agents: the module most causally responsible for failures (the routing module) is not the best place to intervene, and patching it can harm performance. The authors propose the 'Linguistic Contract' hypothesis and present empirical evidence across three agent families.
This paper investigates the internal mechanisms of LLM-as-a-judge, finding a shared Latent Evaluator sub-graph in mid-to-late MLPs across models that handles abstract judging, while format-specific terminal branches map the judgment to output tokens, revealing the cause of format-induced inconsistency.
This paper presents causal evidence that hallucination in autoregressive language models results from early trajectory commitment governed by asymmetric attractor dynamics, using same-prompt bifurcation and activation patching experiments on Qwen2.5-1.5B to show that hallucinated trajectories diverge at the first token and exhibit strong causal asymmetry across model layers.