Tag
This article introduces ReplaySCM, a benchmark designed to evaluate language models' ability to induce executable causal mechanisms from interventional evidence, focusing on semantic replay behavior rather than syntactic matches.