ReplaySCM: A Benchmark for Executable Causal Mechanism Induction from Interventions

arXiv cs.LG Papers

Summary

This article introduces ReplaySCM, a benchmark designed to evaluate language models' ability to induce executable causal mechanisms from interventional evidence, focusing on semantic replay behavior rather than syntactic matches.

arXiv:2605.08197v1 Announce Type: new Abstract: Most causal benchmarks for language models score local answers or graph structure. We introduce ReplaySCM, a 1,300 item benchmark for executable causal mechanism induction from finite interventional evidence. Each item contains binary worlds generated by a latent fully observed acyclic Boolean structural causal model (SCM). A system must output a mechanism map in a restricted Boolean DSL; the submission is parsed, checked for legality and acyclicity, and replayed on training and held-out intervention worlds. Scoring uses replay behavior rather than formula strings, so syntactically different mechanisms receive credit when they behave correctly. ReplaySCM varies the structural information disclosed to the model through Ordered, Block-order, Hidden-order, and Hidden-roots settings, and includes Alternative-SCM tasks that supply a valid reference SCM and ask for a semantically distinct alternative that fits the training worlds, together with a separating intervention and witness. Frontier LLMs infer parts of the functional-parent structure, but held-out replay drops sharply when order or root structure is hidden. We also evaluate a matched support-audit ladder: Original, Extra Worlds, and Counterexample Audit (CEx), that raises mean local predecessor-pattern coverage from 0.8949 to 0.9815 to 1.0; under the audited searches, no discovered semantic alternative remains consistent with the training worlds. The Ordered/Hidden-order gap persists under this stronger evidence. ReplaySCM complements answer-level causal reasoning and graph-discovery benchmarks by evaluating executable replay generalization from finite interventional evidence, without claiming unique identification of the latent SCM.
Original Article
View Cached Full Text

Cached at: 05/12/26, 07:02 AM

# ReplaySCM: A Benchmark for Executable Causal Mechanism Induction from Interventions
Source: [https://arxiv.org/html/2605.08197](https://arxiv.org/html/2605.08197)
###### Abstract

Most causal benchmarks for language models score local answers or graph structure\. We introduce ReplaySCM, a 1,300\-item benchmark for executable causal mechanism induction from finite interventional evidence\. Each item contains binary worlds generated by a latent fully observed acyclic Boolean structural causal model \(SCM\)\. A system must output a mechanism map in a restricted Boolean DSL; the submission is parsed, checked for legality and acyclicity, and replayed on training and held\-out intervention worlds\. Scoring uses replay behavior rather than formula strings, so syntactically different mechanisms receive credit when they behave correctly\. ReplaySCM varies the structural information disclosed to the model through Ordered, Block\-order, Hidden\-order, and Hidden\-roots settings, and includes Alternative\-SCM tasks that supply a valid reference SCM and ask for a semantically distinct alternative that fits the training worlds, together with a separating intervention and witness\. Frontier LLMs infer parts of the functional\-parent structure, but held\-out replay drops sharply when order or root structure is hidden\. We also evaluate a matched support\-audit ladder—Original, Extra Worlds, and Counterexample Audit \(CEx\)—that raises mean local predecessor\-pattern coverage from 0\.8949 to 0\.9815 to 1\.0; under the audited searches, no discovered semantic alternative remains consistent with the training worlds\. The Ordered/Hidden\-order gap persists under this stronger evidence\. ReplaySCM complements answer\-level causal reasoning and graph\-discovery benchmarks by evaluating executable replay generalization from finite interventional evidence, without claiming unique identification of the latent SCM\.

ReplaySCM: A Benchmark for Executable Causal Mechanism Induction from Interventions

Serafim Batzoglou serafim\.batzoglou@gmail\.com

## 1Introduction

Recent causal benchmarks for language models usually score local outputs: an answer to a causal question, a predicted counterfactual value, a graph edge, or a written explanation\. These tasks are useful, but they leave open whether a system can output a causal model that can be reused under new interventions\. We introduce ReplaySCM, a 1,300\-item benchmark for executable causal mechanism induction from finite interventional evidence in a controlled Boolean setting\.

Structural causal models \(SCMs\) are a natural target for this evaluation because they represent mechanisms and interventions in one formal object\(Pearl,[2009](https://arxiv.org/html/2605.08197#bib.bib25)\)\. ReplaySCM uses small, finite, binary, fully observed, acyclic SCMs whose endogenous mechanisms are Boolean expressions\. ReplaySCM asks for a mechanism map in a restricted Boolean DSL\. The evaluator parses the map, checks legality and acyclicity, and replays it on both training and held\-out intervention worlds\. Scoring is semantic: formulas receive credit for their replay behavior, independent of their textual form\.

The benchmark is organized around matched versions of the same latent SCMs\. A revealed\-structure ladder varies what the model is told: Ordered gives the full topological order, Block\-order gives only coarse precedence blocks, Hidden\-order hides the endogenous order, and Hidden\-roots also hides the root set\. Two Alternative\-SCM tasks, Ordered and Hidden\-order, supply a valid reference SCM and ask for a semantically distinct alternative with a separating intervention and witness\. Extra Worlds and Counterexample Audit \(CEx\) tasks \(Ordered and Hidden\-order\) progressively add training worlds to reduce finite\-evidence ambiguity\. In CEx, no discovered semantic alternative from LLM outputs, symbolic exact\-search, or 50\-seed bnlearn\+DSL searches fits the training worlds\. The matched revealed\-structure settings isolate disclosure on the same latent SCMs; CEx preserves latent SCMs and held\-out worlds while auditing ambiguity\.

We evaluate frontier LLMs and include two non\-LLM calibration rows\. The empirical picture is consistent across these comparisons\. Frontier LLMs often infer some functional\-parent relationships, but exact executable replay remains hard, especially when order or roots are hidden\. Held\-out replay is much higher among responses that exactly fit all training worlds, while responses that fail on the training worlds rarely replay all held\-out worlds\. Supplying a valid SCM in Alternative\-SCM substantially raises performance, showing that local editing with a supplied causal object is easier than inferring that object from intervention worlds\. The Ordered/Hidden\-order gap persists in the Extra Worlds and CEx settings\.

Contributions\.We make four contributions\. First, we introduce executable\-SCM induction from interventions as a publicly released benchmark evaluated by exact replay on training and held\-out worlds\. Second, we organize the benchmark as a matched revealed\-structure ladder—Ordered, Block\-order, Hidden\-order, and Hidden\-roots—together with Alternative\-SCM, which separates reasoning with a supplied SCM from hidden\-structure inference\. Third, we build a generator that filters out trivial shortcuts and tracks residual ambiguity through support filters, shortcut checks, bounded audits, and a three\-level matched support\-audit ladder: Original, Extra Worlds, and Counterexample Audit \(CEx\)\. Fourth, we benchmark frontier LLMs under a shared evaluator and use two non\-LLM baselines to calibrate the difficulty of executable mechanism induction\.

We release a public artifact of benchmark instances, prompt\-export and scoring code, replay and validation scripts, and documentation for reproducing the reported evaluations\.

## 2Related Work

Causal reasoning benchmarks for language models mostly evaluate local outputs: commonsense causal judgments, intervention questions, counterfactual answers, or graph\-level predictions\. Representative examples include COPA, WIQA, Corr2Cause, CLadder, CounterBench, ExpliCa, CausalFlip, CausalGraphBench, and CausalGraph2LLM\(Roemmele et al\.,[2011](https://arxiv.org/html/2605.08197#bib.bib27); Tandon et al\.,[2019](https://arxiv.org/html/2605.08197#bib.bib33); Jin et al\.,[2024](https://arxiv.org/html/2605.08197#bib.bib16),[2023](https://arxiv.org/html/2605.08197#bib.bib15); Chen et al\.,[2025](https://arxiv.org/html/2605.08197#bib.bib5); Miliani et al\.,[2025](https://arxiv.org/html/2605.08197#bib.bib18); Wang et al\.,[2026](https://arxiv.org/html/2605.08197#bib.bib35); Babakov et al\.,[2025](https://arxiv.org/html/2605.08197#bib.bib4); Sheth et al\.,[2025](https://arxiv.org/html/2605.08197#bib.bib29)\)\. ReplaySCM differs in the target object: it scores a full executable SCM by replay under interventions\.

ReplaySCM is directly connected to classical causal discovery, interventional structure learning, Boolean network inference, inductive logic programming, and program synthesis\. Constraint\-based, score\-based, interventional, functional\-causal\-model, and continuous\-optimization methods estimate graphs or equivalence classes from observational and interventional data\(Spirtes et al\.,[2000](https://arxiv.org/html/2605.08197#bib.bib32); Chickering,[2002](https://arxiv.org/html/2605.08197#bib.bib6); Hauser and Bühlmann,[2012](https://arxiv.org/html/2605.08197#bib.bib14); Shimizu et al\.,[2006](https://arxiv.org/html/2605.08197#bib.bib30); Zheng et al\.,[2018](https://arxiv.org/html/2605.08197#bib.bib39); Glymour et al\.,[2019](https://arxiv.org/html/2605.08197#bib.bib11)\)\. Solver\-checkable symbolic induction has long been studied in Boolean network inference, logic\-model inference, ILP, and synthesis\(Liang et al\.,[1998](https://arxiv.org/html/2605.08197#bib.bib17); Quinlan,[1990](https://arxiv.org/html/2605.08197#bib.bib26); Muggleton,[1991](https://arxiv.org/html/2605.08197#bib.bib20); Muggleton and De Raedt,[1994](https://arxiv.org/html/2605.08197#bib.bib21); Solar\-Lezama,[2008](https://arxiv.org/html/2605.08197#bib.bib31); Alur et al\.,[2013](https://arxiv.org/html/2605.08197#bib.bib1); Torlak and Bodik,[2014](https://arxiv.org/html/2605.08197#bib.bib34)\)\. For ReplaySCM, the closest prior work is exact symbolic induction from finite interventional evidence, because the benchmark rewards acyclic structure search together with exact Boolean mechanism synthesis under a shared semantic evaluator\. ReplaySCM does not introduce a new causal\-discovery algorithm; it formulates this discrete mechanism\-induction problem as an LLM benchmark with a fixed final\-object contract: output one executable causal mechanism and evaluate it by intervention replay\.

## 3Benchmark Definition

Each benchmark instance consists of multiple interventional worlds generated by a small binary acyclic SCM\. The required output is one executable Boolean mechanism map\. Credit is assigned by semantic replay on training and held\-out worlds; the evaluator does not compare formula strings\. Any executable SCM that induces the correct replay behavior on the scored worlds is counted as correct, even if its Boolean formulas differ syntactically from the latent gold mechanisms\.

Latent SCM and intervention worlds\.Each problem is generated from a small binary acyclic SCM with observed roots and endogenous variables\. The benchmark provides multiple intervention worlds, each with hard intervention targets, row\-level assignments, and observed rows produced by executing the latent SCM under that intervention\. In Ordered, Block\-order, and Hidden\-order, a submission is an executable mechanism map in the benchmark Boolean DSL\. In Hidden\-roots, the submission must also predict the root set\. In Alternative\-SCM, the model is given a valid reference SCM and must return a semantically distinct alternative that fits the training worlds, together with a separating intervention and witness\.

Replay and metrics\.Replay is semantic: intervened variables are clamped to their assigned values, non\-intervened roots are copied from the observed row, and non\-intervened endogenous variables are computed in a valid topological order of the submitted SCM\. Only non\-intervened endogenous cells are scored\. TrainExact requires exact replay of all scored training cells\. TrainWorldExact and HeldoutWorldExact average exact replay over training and held\-out worlds\. HeldoutExact is stricter: it requires exact training replay and exact replay of every held\-out world\. Appendix[B\.1](https://arxiv.org/html/2605.08197#A2.SS1)gives the formal replay definitions\.

Replay example\.With one root variableRRand one endogenous variableYYwith gold mechanismY=not​RY=\\mathrm\{not\}\\ R, the candidatef^Y=not​R\\hat\{f\}\_\{Y\}=\\mathrm\{not\}\\ Rexactly replays both an observational row\(R,Y\)=\(0,1\)\(R,Y\)=\(0,1\)and an intervention row withR:=1R:=1and\(R,Y\)=\(1,0\)\(R,Y\)=\(1,0\), whereasf^Y=R\\hat\{f\}\_\{Y\}=Rfails\.

Semantic structure metrics\.We use the following semantic structure metrics\. A variableUUis a functional parent of a local mechanismf^V\\hat\{f\}\_\{V\}if flippingUUcan change the truth table off^V\\hat\{f\}\_\{V\}; self\-loops are excluded\. This yields a directed functional\-parent graphG​\(M^\)G\(\\widehat\{M\}\)with edgesU→VU\\to V\. Parent F1 is the edge\-level F1 ofG​\(M^\)G\(\\widehat\{M\}\)against the gold functional\-parent graph\. Exact parent map requires every endogenous variable’s functional\-parent set to match the gold set\. Parent SHD is the structural Hamming distance between the directed functional\-parent graphs, with additions/deletions costing 1 and reversals also costing 1\.

Benchmark settings and pools\.ReplaySCM uses four revealed\-structure settings—Ordered, Block\-order, Hidden\-order, and Hidden\-roots—and the supplementary Alternative\-SCM family\. Ordered \(Ord\-Full\) and Hidden\-order \(Hid\-Full\) are the two 250\-problem pools\. Within these pools, 100 problems are matched with the same latent SCM in each pair\. These 100 problems form the pool for matched Ordered \(Ord\-Match\), Block\-order \(Block\), matched Hidden\-order \(Hid\-Match\), Hidden\-roots \(Hid\-Roots\), and Alternative\-SCM \(Alt\-Ord, Alt\-Hid\)\. The same 100 problems also form a three\-level support\-audit ladder: Original \(Ord\-Match/Hid\-Match\), Extra Worlds \(Ord\-Ext/Hid\-Ext, with additional training worlds and unchanged held\-out worlds\), and Counterexample Audit \(Ord\-CEx/Hid\-CEx, with further worlds that complete local predecessor\-pattern coverage and add counterexamples against discovered train\-consistent alternatives\)\. The benchmark pools, sizes, and relations are listed in Appendix Table[A\.1](https://arxiv.org/html/2605.08197#A1.T1)\.

## 4Benchmark Construction

Naively sampling latent SCMs and intervention worlds yields many under\-constrained problems: simple shortcut formulas can fit the observed training worlds, and some local mechanisms may never be queried on the predecessor assignments needed to determine them\. ReplaySCM therefore generates latent SCMs and world sets jointly\. Gold mechanisms must depend semantically on every declared parent, attain both Boolean outputs, and appear only in instances that satisfy local support, intervention coverage, distribution shift, and shortcut\-resistance checks\.

The generator then applies two ambiguity\-reduction stages\. First, a bounded survivor\-reduction loop keeps a pool of shortcut candidates that fit the training worlds and adds new worlds that rule out as many as possible\. Second, a targeted disambiguation stage searches for local semantic alternatives to each endogenous mechanism and adds compact worlds that rule out many alternatives at once\. After generation, bounded ambiguity audits enumerate local alternatives and coordinated upstream/downstream alternative pairs under fixed search budgets\. These audits quantify residual ambiguity under bounded searches; they do not prove uniqueness\.

The benchmark balances finite\-evidence support with structural diversity\. Instances are first generated in Matched Hidden\-order form and then converted into matched Ordered, Block\-order, and Hidden\-roots variants of the same latent SCM\. Alternative\-SCM is constructed from valid alternatives that fit the training worlds in paired result pools, deduplicated by semantic signature, and retained only when a single\-variable separating intervention and witness exist\.

Support\-audit ladder\.The three\-level support\-audit ladder uses the same 100 Matched Ordered/Hidden\-order SCMs to ask whether the Ordered/Hidden\-order gap remains after adding more evidence\. The Original level is Ord\-Match/Hid\-Match\. The Extra Worlds level adds the same 3–4 gold\-simulated training worlds per problem to both disclosure settings, excludes held\-out intervention signatures, and raises mean local predecessor\-pattern coverage from 0\.8949 to 0\.9815 while preserving the held\-out worlds\. The Counterexample Audit \(CEx\) level starts from Extra Worlds, completes local predecessor\-pattern coverage with gold\-simulated worlds, and then adds separating worlds until no discovered semantic alternative from LLM outputs, symbolic exact\-search, or 50\-seed bnlearn\+DSL searches still fits the training worlds\. These variants add evidence, but they also make the prompts longer\. The complete generator specification is in Appendix B\.2\.

## 5Experimental Setup

All experiments use the same fixed benchmark snapshot\. The two full pools are Ordered \(full\) and Hidden\-order \(full\), each with 250 problems\. All matched, Alternative\-SCM, Extra Worlds, and CEx settings are derived from the same 100\-problem same\-latent subpool\. Support\-audit variants share latent SCMs and held\-out worlds\. Original and Extra Worlds also share training worlds, while CEx may add setting\-specific counterexample training worlds\. Every system output is parsed, checked, and replayed by the same evaluator\. Appendix Table[A\.1](https://arxiv.org/html/2605.08197#A1.T1)summarizes the inventory and naming convention\.

### 5\.1Systems and shared evaluator

Figure[A\.1](https://arxiv.org/html/2605.08197#A1.F1)shows the shared evaluation contract\. For each instance, every system receives the same benchmark record: structured training worlds, intervention annotations, observed variables, task disclosure, allowed operators, and the required output schema\. Systems differ only in candidate generation; all scored objects are executable SCMs in the benchmark DSL, with any extra fields required by supplementary settings\.

We include two fixed\-protocol non\-LLM baselines\. bnlearn\+DSL uses the bnlearn structure\-learning toolkit\(Scutari,[2010](https://arxiv.org/html/2605.08197#bib.bib28)\)to propose candidate parent structure and then fits executable Boolean mechanisms in the benchmark DSL\. The symbolic exact\-search baseline searches directly for Boolean mechanisms that replay all training worlds exactly under fixed staged budgets\.

LLM prompting\.Each LLM receives the same benchmark record rendered as a structured prompt: task metadata, variable roles, revealed structural information, DSL grammar, intervention modes, and tabular training worlds\. The model is asked for one schema\-compatible answer object\. The evaluation uses a direct\-generation protocol: no tool use, self\-consistency voting, evaluator\-guided revision, or semantic repair is allowed\. Appendix[A\.2](https://arxiv.org/html/2605.08197#A1.SS2)lists the model identifiers, run dates, decoding settings, response\-extraction rule, stored\-answer selection policy, and provider/model references, and notes that the final snapshot does not store a uniform per\-item retry log\. Appendix[B\.3](https://arxiv.org/html/2605.08197#A2.SS3)shows a prompt excerpt\.

Non\-LLM baselines\.The two non\-LLM calibration rows receive the same training worlds and revealed\-structure fields as the LLM prompt\. The symbolic exact\-search baseline searches directly for train\-exact Boolean DSL mechanisms under a fixed staged budget\. The bnlearn\+DSL baseline uses bnlearn only for structure proposal; a shared exact Boolean fitter then synthesizes executable DSL mechanisms over the proposed parents\. Both baselines are scored by the same parser, legality checks, acyclicity check, and replay evaluator as the LLM submissions\. These rows are benchmark\-specific induction procedures, distinct from off\-the\-shelf end\-to\-end causal discovery systems\. They show how much of the task remains difficult even for fixed symbolic or hybrid search pipelines under the same evaluator\. Appendix[H](https://arxiv.org/html/2605.08197#A8)gives the fixed procedures and budgets\.

We report TrainExact, TrainWorldExact, HeldoutWorldExact, and HeldoutExact to distinguish exact replay on the exposed worlds from replay under new interventions\. Coverage is the fraction of benchmark problems with a scored result\. Conditional held\-out metrics, reported mainly in the appendix, compute held\-out replay only for responses that are train\-exact\. Alternative\-SCM and Hidden\-roots use task\-specific summaries: joint success for Alternative\-SCM, and root\-set exactness together with downstream mechanism induction for Hidden\-roots\. Any reported conditional rate with denominator 1–5 is suppressed and shown as \*, while – indicates that the quantity is undefined because the conditioning event is empty\. Figure[1](https://arxiv.org/html/2605.08197#S6.F1)bootstrap intervals resample the 100 matched latent problem IDs, preserve same\-latent pairing, and treat stored model outputs as fixed\.

## 6Results

We evaluate training replay, held\-out replay, semantic parent structure, and held\-out replay among train\-exact responses \(Tables 1 and 2\)\. TrainExact and HeldoutWorldExact generally decline as structural information is withheld along the same\-latent disclosure ladder \(Figure 1\)\.

### 6\.1Exact executable induction remains unsolved

On Ordered, Block\-order, and Hidden\-order, frontier LLMs remain well below perfect exact replay\. TrainExact, HeldoutWorldExact, and HeldoutExact are all far from 1\.0, with the largest difficulty on Hidden\-order\. Among the LLM rows, GPT\-5\.4 is strongest on Hidden\-order, followed by Claude Opus 4\.6 and Gemini 3\.1 Pro\.

Held\-out replay is much higher among responses that replay all training worlds exactly\. Responses that fail on the training worlds also tend to fail on held\-out worlds\. The non\-LLM calibration rows in Table 1 show that many items admit high\-scoring executable solutions under the same parser and replay evaluator\. This gap therefore reflects the difficulty of direct executable mechanism induction, rather than a problem with the DSL or scorer\.

### 6\.2Analysis of semantic structure and local mechanisms

For frontier LLMs, schema, parsing, and executability failures are uncommon\. The main errors come after parsing: models often infer some parent relationships but miss the exact parent map \(Table 2\)\.

We measure semantic parent structure with parent recall, parent F1, structural Hamming distance \(SHD\), per\-variable parent exactness, and exact parent\-map matching\. A variable counts as a parent only if flipping it can change the parsed local Boolean function\. On Hidden\-order, GPT\-5\.4 achieves Parent F1 0\.891 but Exact parent map only 0\.280, showing that frontier models often infer informative dependency structure without matching the exact structure and local mechanisms needed for full replay\. This pattern holds across the reported models: partial parent\-structure inference is common, while exact parent\-map and mechanism matching are much harder\.

Once an exact parent map is matched, models usually produce mechanisms that are exact on the exposed worlds and on the held\-out worlds, with most conditional accuracies ranging from 0\.8 to 1\.0 \(Table 2, TrainExact and HeldoutExact \| exact parent map\)\.

Table 1:ReplaySCM core benchmark performance\.Ord\-Full and Hid\-Full contain 250 problems; Block contains the matched 100\-problem block\-order pool\. The central comparison is among LLMs; non\-LLM rows calibrate task difficulty\. Valid denotes executable SCM submissions\. TrainExact is exact replay on all scored training cells\. HeldoutWorld denotes HeldoutWorldExact, the unconditioned mean exact held\-out world replay rate\. HeldoutExact is strict: it requires exact training replay and exact replay of every held\-out world\. Here and below, \* denotes a defined conditional rate suppressed because the denominator is between 1 and 5, and – denotes an undefined conditional rate with zero denominator\. Boldface, where present, marks the best LLM value within the setting and column; non\-LLM calibration rows are excluded from this highlighting\.Table 2:Structural and local\-semantic diagnostics\.Parent Recall, Parent F1, Parent SHD, Per\-var\. parent exact, and Exact parent map compare semantic dependency structure against the gold SCM among executable responses\. Mean local match is the fraction of endogenous variables whose Boolean mechanisms match semantically\. The two conditional columns are conditioned on exact parent\-map matching among executable responses\. The \* and – notation follows Table[1](https://arxiv.org/html/2605.08197#S6.T1)\. Boldface, where present, marks the best LLM value within the setting and column\.
### 6\.3Hiding structure drives the largest losses

Across the structural\-information ladder from Matched Ordered to Hidden\-roots, TrainExact and HeldoutWorldExact generally fall as less structure is revealed \(Figure 1\)\.

The comparison uses the same latent SCMs across disclosure conditions, so score changes reflect how much structural information is revealed\. The largest and most consistent drops occur on Matched Ordered→Matched Hidden\-order and Matched Hidden\-order→Hidden\-roots; paired deltas are reported in Appendix D \(Table D\.1\)\. Some individual model/metric curves are not monotone\. Block\-order falls between the two extremes: coarse precedence information helps, but hidden structure remains difficult\.

![[Uncaptioned image]](https://arxiv.org/html/2605.08197v1/x1.png)

Figure 1:Same\-latent disclosure ladder for LLMs\.Left: TrainExact; right: HeldoutWorldExact, labeled HeldoutWorld in the panel title\. Markers are means over the 100 matched latent SCMs; vertical bars are 95% bootstrap intervals over matched problem IDs\.
Alternative\-SCM supplies a valid reference SCM, so the model does not have to infer the structure from the intervention worlds\. The task is to construct a semantically distinct alternative that fits the training worlds, along with a separating intervention and witness\. Joint success remains high under the leave\-model\-out restriction \(Table 3\)\. Successful alternatives typically preserve most of the supplied structure and make local edits\. Alternative\-SCM therefore tests local editing with a supplied SCM; broad alternative discovery is outside its scope \(Appendix G\)\.

Hidden\-roots separates root\-set prediction from mechanism induction\. Once the root set is hidden, frontier LLMs struggle with both parts of the task\. Table 3 reports root\-set exactness separately from mechanism replay\. Exact mechanism replay remains weak even when the root set is correct \(Appendix Tables G\.6–G\.8\)\.

Stronger disambiguation during benchmark generation improves held\-out replay on Ordered and reduces train\-to\-held\-out gaps on Hidden\-order\. The Ordered/Hidden\-order gap remains in the support\-audit ladder \(Appendix E\.2; Appendix F\)\.

Table 3:Alternative\-SCM and Hidden\-roots results\.Panel A compares paired\-source induction with Alternative\-SCM on the same matched source problems\. Paired TrainCorrect is correctness on the paired source task\. Alt\-SCM joint requires a train\-exact semantically distinct alternative together with a successful separating experiment and witness\. Joint, leave\-model\-out excludes alternatives sourced from the evaluated model; because its eligible\-case denominator can differ, this column need not be numerically below the all\-source joint column\. Panel B reports Hidden\-roots RootExact and replay\. RootExact is exact root\-set prediction, and the RootExact\-conditioned replay columns condition on exact root\-set prediction\. HeldoutWorld denotes HeldoutWorldExact\. The \* and – notation follows Table[1](https://arxiv.org/html/2605.08197#S6.T1)\. Boldface, where present, marks the best LLM value within the setting and column; non\-LLM calibration rows are excluded from this highlighting\.

### 6\.4Additional Analyses

Appendix C breaks errors down into validity, parent structure, mechanism matching, and held\-out replay\. Frontier models often infer informative dependencies before they match the exact parent map or the exact local mechanisms needed for full replay\. Held\-out replay is much higher among train\-exact responses, so many failed submissions already fail on the training worlds \(TablesLABEL:tab:mechanism\_validity\_funnel\_appendix–LABEL:tab:conditional\_retention\_appendix\)\.

The matched support\-audit ladder—Original, Extra Worlds, and CEx—asks whether ambiguity in the finite worlds explains the Ordered/Hidden\-order gap\. Extra Worlds raises mean local predecessor\-pattern coverage from 0\.8949 to 0\.9815 and sharply reduces discovered alternatives that fit the training worlds\. CEx completes bounded local predecessor\-pattern coverage \(mean 1\.0, with all 100 problems fully covered\), preserves the same held\-out worlds, and adds counterexamples until no discovered semantic alternative from LLM outputs, symbolic exact\-search, or 50\-seed bnlearn\+DSL searches still fits the training worlds\. In the CEx rows, HeldoutExact equals TrainExact for every LLM row with TrainExact \> 0, so train\-exact CEx outputs also replay all held\-out worlds exactly\. Thus, in the most heavily audited matched setting, the main remaining difference is whether the model can find an executable SCM that fits the exposed worlds in the first place\. Hidden\-order nevertheless remains below Ordered on the same latent SCMs and held\-out worlds despite the added evidence \(Appendix E\.2, Table E\.1; Appendix F, Tables F\.2 and F\.7–F\.14\)\.

The supplementary settings show where models fail\. Alternative\-SCM performance remains strong, including under leave\-model\-out restrictions, showing that local edits are easier when a valid SCM is supplied\. Hidden\-roots adds a separate root\-set prediction problem before mechanism induction\. Together, these analyses show that inferring hidden structure is harder than producing a valid answer object or fitting isolated local mechanisms on the training worlds \(Appendix G, Tables G\.1–G\.8\)\.

## 7Discussion and Limitations

ReplaySCM evaluates causal reasoning through an executable object: a reusable mechanism map that must continue to work under new interventions\. This exposes failures that can remain hidden behind local answers and graph predictions\. A model can produce a plausible explanation or a plausible edge set, yet still fail when its proposed mechanisms are replayed\. The benchmark therefore separates two capabilities that are often conflated: describing causal structure and constructing a causal object that can be executed\.

The main empirical pattern is that hidden structure is hard\. Frontier LLMs often infer many functional\-parent relationships, but they rarely assemble the exact parent map and Boolean mechanisms needed to replay all worlds\. Alternative\-SCM sharpens this point: when a valid SCM is supplied, models are much better at making local semantic edits and proposing separating interventions\. Hidden\-roots shows the complementary failure mode: hiding the root set adds a hard root\-set prediction problem before mechanism induction even begins\.

The support\-audit results strengthen this interpretation\. Extra Worlds and CEx add evidence, improve local predecessor\-pattern coverage, and reduce discovered train\-consistent alternatives\. CEx reaches full bounded local predecessor\-pattern coverage, and no discovered alternative from LLM outputs, symbolic exact\-search, or 50\-seed bnlearn\+DSL searches remains train\-consistent\. Even under this stronger finite evidence, Hidden\-order remains harder than Ordered on the same latent SCMs and held\-out worlds\.

The benchmark is intentionally narrow: small, finite, binary, fully observed Boolean SCMs\. That restriction is what makes exact replay, solver\-checkable scoring, same\-latent comparisons, and bounded ambiguity audits possible\. The audits rule out many easy finite\-sample shortcuts within their search budgets, but finite worlds still do not establish global uniqueness\.

ReplaySCM therefore provides a controlled testbed for executable causal mechanism induction\. Natural next steps include noisy or partially observed SCMs, larger hidden test sets, natural\-language problem statements, and interactive protocols in which models propose interventions, observe outcomes, and revise candidate mechanisms\.

## References

- Alur et al\. \(2013\)Rajeev Alur, Rastislav Bodik, Garvit Juniwal, Milo M\. K\. Martin, Mukund Raghothaman, Sanjit A\. Seshia, Rishabh Singh, Armando Solar\-Lezama, Emina Torlak, and Abhishek Udupa\.Syntax\-guided synthesis\.In*2013 Formal Methods in Computer\-Aided Design*, pages 1–8\. IEEE, 2013\.doi:10\.1109/FMCAD\.2013\.6679385\.
- Anthropic \(2026a\)Anthropic\.Introducing Claude Opus 4\.6\.[https://www\.anthropic\.com/news/claude\-opus\-4\-6](https://www.anthropic.com/news/claude-opus-4-6), 2026a\.Accessed May 4, 2026\.
- Anthropic \(2026b\)Anthropic\.Claude Opus 4\.6 System Card\.[https://www\-cdn\.anthropic\.com/14e4fb01875d2a69f646fa5e574dea2b1c0ff7b5\.pdf](https://www-cdn.anthropic.com/14e4fb01875d2a69f646fa5e574dea2b1c0ff7b5.pdf), 2026b\.Accessed May 4, 2026\.
- Babakov et al\. \(2025\)Nikolay Babakov, Ehud Reiter, and Alberto Bugarín\-Diz\.CausalGraphBench: a benchmark for evaluating language models’ capabilities of causal graph discovery\.In*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 4: Student Research Workshop\)*, pages 240–258\. Association for Computational Linguistics, 2025\.doi:10\.18653/v1/2025\.acl\-srw\.16\.URL[https://aclanthology\.org/2025\.acl\-srw\.16/](https://aclanthology.org/2025.acl-srw.16/)\.
- Chen et al\. \(2025\)Yuefei Chen, Vivek K\. Singh, Jing Ma, and Ruxiang Tang\.CounterBench: a benchmark for counterfactuals reasoning in large language models, 2025\.URL[https://arxiv\.org/abs/2502\.11008](https://arxiv.org/abs/2502.11008)\.
- Chickering \(2002\)David Maxwell Chickering\.Optimal structure identification with greedy search\.*Journal of Machine Learning Research*, 3:507–554, 2002\.
- DeepSeek \(2026a\)DeepSeek\.Models and Pricing\.[https://api\-docs\.deepseek\.com/quick\_start/pricing](https://api-docs.deepseek.com/quick_start/pricing), 2026a\.DeepSeek API documentation\. Accessed May 4, 2026\.
- DeepSeek \(2026b\)DeepSeek\.Reasoning Model \(deepseek\-reasoner\)\.[https://api\-docs\.deepseek\.com/guides/reasoning\_model](https://api-docs.deepseek.com/guides/reasoning_model), 2026b\.DeepSeek API documentation\. Accessed May 4, 2026\.
- DeepSeek\-AI \(2026a\)DeepSeek\-AI\.DeepSeek V4 Technical Documentation: Model Card\.[https://fe\-static\.deepseek\.com/chat/transparency/deepseek\-V4\-model\-card\-EN\.pdf](https://fe-static.deepseek.com/chat/transparency/deepseek-V4-model-card-EN.pdf), 2026a\.Accessed May 4, 2026\.
- DeepSeek\-AI \(2026b\)DeepSeek\-AI\.DeepSeek\-V4\-Pro\.[https://huggingface\.co/deepseek\-ai/DeepSeek\-V4\-Pro](https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro), 2026b\.Hugging Face model card\. Accessed May 4, 2026\.
- Glymour et al\. \(2019\)Clark Glymour, Kun Zhang, and Peter Spirtes\.Review of causal discovery methods based on graphical models\.*Frontiers in Genetics*, 10, 2019\.doi:10\.3389/fgene\.2019\.00524\.
- Google \(2026\)Google\.Gemini 3\.1 Pro Preview\.[https://ai\.google\.dev/gemini\-api/docs/models/gemini\-3\.1\-pro\-preview](https://ai.google.dev/gemini-api/docs/models/gemini-3.1-pro-preview), 2026\.Google AI for Developers documentation\. Accessed May 4, 2026\.
- Google DeepMind \(2026\)Google DeepMind\.Gemini 3\.1 Pro Model Card\.[https://deepmind\.google/models/model\-cards/gemini\-3\-1\-pro/](https://deepmind.google/models/model-cards/gemini-3-1-pro/), 2026\.Accessed May 4, 2026\.
- Hauser and Bühlmann \(2012\)Alain Hauser and Peter Bühlmann\.Characterization and greedy learning of interventional markov equivalence classes of directed acyclic graphs\.*Journal of Machine Learning Research*, 13:2409–2464, 2012\.
- Jin et al\. \(2023\)Zhijing Jin, Yuen Chen, Felix Leeb, Luigi Gresele, Ojasv Kamal, Zhiheng Lyu, Kevin Blin, Fernando Gonzalez Adauto, Max Kleiman\-Weiner, Mrinmaya Sachan, and Bernhard Schölkopf\.CLadder: assessing causal reasoning in language models\.In*Advances in Neural Information Processing Systems*, 2023\.URL[https://openreview\.net/forum?id=e2wtjx0Yqu](https://openreview.net/forum?id=e2wtjx0Yqu)\.Poster\.
- Jin et al\. \(2024\)Zhijing Jin, Jiarui Liu, Zhiheng Lyu, Spencer Poff, Mrinmaya Sachan, Rada Mihalcea, Mona T\. Diab, and Bernhard Schölkopf\.Corr2Cause: can large language models infer causation from correlation?In*The Twelfth International Conference on Learning Representations*, 2024\.URL[https://openreview\.net/forum?id=vqIH0ObdqL](https://openreview.net/forum?id=vqIH0ObdqL)\.Poster\.
- Liang et al\. \(1998\)Shoudan Liang, Shuwen Fuhrman, and Roland Somogyi\.REVEAL: a general reverse engineering algorithm for inference of genetic network architectures\.In*Pacific Symposium on Biocomputing*, volume 3, pages 18–29, 1998\.
- Miliani et al\. \(2025\)Martina Miliani, Serena Auriemma, Alessandro Bondielli, Emmanuele Chersoni, Lucia Passaro, Irene Sucameli, and Alessandro Lenci\.ExpliCa: evaluating explicit causal reasoning in large language models\.In*Findings of the Association for Computational Linguistics: ACL 2025*, pages 17335–17355\. Association for Computational Linguistics, 2025\.doi:10\.18653/v1/2025\.findings\-acl\.891\.URL[https://aclanthology\.org/2025\.findings\-acl\.891/](https://aclanthology.org/2025.findings-acl.891/)\.
- Moonshot AI \(2026\)Moonshot AI\.Kimi K2 Thinking\.[https://huggingface\.co/moonshotai/Kimi\-K2\-Thinking](https://huggingface.co/moonshotai/Kimi-K2-Thinking), 2026\.Hugging Face model card\. Accessed May 4, 2026\.
- Muggleton \(1991\)Stephen Muggleton\.Inductive logic programming\.*New Generation Computing*, 8\(4\):295–318, 1991\.doi:10\.1007/BF03037089\.
- Muggleton and De Raedt \(1994\)Stephen H\. Muggleton and Luc De Raedt\.Inductive logic programming: theory and methods\.*Journal of Logic Programming*, 19–20:629–679, 1994\.
- OpenAI \(2026a\)OpenAI\.Introducing GPT\-5\.4\.[https://openai\.com/index/introducing\-gpt\-5\-4/](https://openai.com/index/introducing-gpt-5-4/), 2026a\.Accessed May 4, 2026\.
- OpenAI \(2026b\)OpenAI\.GPT\-5\.4 Model\.[https://developers\.openai\.com/api/docs/models/gpt\-5\.4](https://developers.openai.com/api/docs/models/gpt-5.4), 2026b\.OpenAI API documentation\. Accessed May 4, 2026\.
- OpenRouter \(2025\)OpenRouter\.Kimi K2 Thinking\.[https://openrouter\.ai/moonshotai/kimi\-k2\-thinking](https://openrouter.ai/moonshotai/kimi-k2-thinking), 2025\.Access\-route documentation for the evaluated provider string\. Accessed May 4, 2026\.
- Pearl \(2009\)Judea Pearl\.*Causality: models, reasoning, and inference*\.Cambridge University Press, 2 edition, 2009\.
- Quinlan \(1990\)J\. Ross Quinlan\.Learning logical definitions from relations\.*Machine Learning*, 5\(3\):239–266, 1990\.
- Roemmele et al\. \(2011\)Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S\. Gordon\.COPA: choice of plausible alternatives for commonsense causal reasoning\.In*Logical Formalizations of Commonsense Reasoning, Papers from the 2011 AAAI Spring Symposium*\. AAAI Press, 2011\.URL[http://www\.aaai\.org/ocs/index\.php/SSS/SSS11/paper/view/2418](http://www.aaai.org/ocs/index.php/SSS/SSS11/paper/view/2418)\.
- Scutari \(2010\)Marco Scutari\.Learning bayesian networks with the bnlearn R package\.*Journal of Statistical Software*, 35\(3\):1–22, 2010\.
- Sheth et al\. \(2025\)Ivaxi Sheth, Bahare Fatemi, and Mario Fritz\.CausalGraph2LLM: evaluating LLMs for causal queries\.In*Findings of the Association for Computational Linguistics: NAACL 2025*, pages 2076–2098\. Association for Computational Linguistics, 2025\.doi:10\.18653/v1/2025\.findings\-naacl\.110\.URL[https://aclanthology\.org/2025\.findings\-naacl\.110/](https://aclanthology.org/2025.findings-naacl.110/)\.
- Shimizu et al\. \(2006\)Shohei Shimizu, Patrik O\. Hoyer, Aapo Hyvärinen, and Antti Kerminen\.A linear non\-Gaussian acyclic model for causal discovery\.*Journal of Machine Learning Research*, 7:2003–2030, 2006\.
- Solar\-Lezama \(2008\)Armando Solar\-Lezama\.*Program synthesis by sketching*\.PhD thesis, University of California, Berkeley, 2008\.
- Spirtes et al\. \(2000\)Peter Spirtes, Clark Glymour, and Richard Scheines\.*Causation, prediction, and search*\.MIT Press, 2 edition, 2000\.
- Tandon et al\. \(2019\)Niket Tandon, Bhavana Dalvi, Keisuke Sakaguchi, Peter Clark, and Antoine Bosselut\.WIQA: a dataset for “what if…” reasoning over procedural text\.In*Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing*, pages 6076–6085\. Association for Computational Linguistics, 2019\.doi:10\.18653/v1/D19\-1629\.URL[https://aclanthology\.org/D19\-1629/](https://aclanthology.org/D19-1629/)\.
- Torlak and Bodik \(2014\)Emina Torlak and Rastislav Bodik\.A lightweight symbolic virtual machine for solver\-aided host languages\.*ACM SIGPLAN Notices*, 49\(6\):530–541, 2014\.doi:10\.1145/2666356\.2594340\.
- Wang et al\. \(2026\)Yuzhe Wang, Yaochen Zhu, and Jundong Li\.CausalFlip: a benchmark for LLM causal judgment beyond semantic matching, 2026\.URL[https://arxiv\.org/abs/2602\.20094](https://arxiv.org/abs/2602.20094)\.
- xAI \(2025\)xAI\.Grok 4 Model Card\.[https://data\.x\.ai/2025\-08\-20\-grok\-4\-model\-card\.pdf](https://data.x.ai/2025-08-20-grok-4-model-card.pdf), 2025\.Accessed May 4, 2026\.
- xAI \(2026a\)xAI\.Grok 4 0709\.[https://docs\.x\.ai/docs/models/grok\-4\-0709](https://docs.x.ai/docs/models/grok-4-0709), 2026a\.xAI developer documentation\. Accessed May 4, 2026\.
- xAI \(2026b\)xAI\.Grok 4\.20 0309 Reasoning\.[https://docs\.x\.ai/developers/models/grok\-4\.20\-0309\-reasoning](https://docs.x.ai/developers/models/grok-4.20-0309-reasoning), 2026b\.xAI developer documentation\. Accessed May 4, 2026\.
- Zheng et al\. \(2018\)Xun Zheng, Bryon Aragam, Pradeep Ravikumar, and Eric P\. Xing\.DAGs with NO TEARS: continuous optimization for structure learning\.In*Advances in Neural Information Processing Systems*, pages 9472–9483, 2018\.

## Appendix AEvaluator and protocol details

The appendices follow the structure of the paper\. Appendix A specifies the evaluator and model\-call protocol\. Appendix B gives the formal replay definitions, benchmark\-generation specification, and prompt excerpt\. Appendix C breaks failures down into validity, structure, and replay errors\. Appendix D supports the same\-latent disclosure ladder\. Appendix E summarizes robustness and support\-audit results, with detailed construction and support\-audit analyses in Appendix F\. Appendix G analyzes Alternative\-SCM and Hidden\-roots settings, Appendix H specifies the non\-LLM calibration rows, and Appendix I gives illustrative case studies\.

### A\.1Benchmark inventory and evaluator contract

Benchmark card\.Primary and supplementary benchmark inventorySee Table[A\.1](https://arxiv.org/html/2605.08197#A1.T1)for benchmark names, sizes, and relations\.Three\-level support\-audit matched subsetsOriginal matched source problems, Ordered/Hidden\-order \+ Extra Worlds, and Ordered/Hidden\-order \+ Counterexample Audit \(CEx\)\.Study size1300 scored prompt records: the two full pools plus Block, Hid\-Roots, Alt\-Ord/Alt\-Hid, Ord\-Ext/Hid\-Ext, and Ord\-CEx/Hid\-CEx\. Ord\-Match and Hid\-Match are subsets of the full pools, not additional records\.SCM classFully observed, finite, binary, acyclic SCMs with observed roots and endogenous variables; revealed structure varies by setting\.Mechanism languageBoolean DSL over not, and, or, xor, and iff; submissions are executable mechanism maps with full replay semantics\.EvidenceMultiple training and held\-out intervention worlds per instance; world\-level replay averages exact replay over worlds\.Interventions and scoringHard interventions clamp targeted variables; non\-intervened endogenous cells are scored by replay\. Roots provide observed per\-row context unless intervened\.Design emphasisHigh\-precision solver\-checkable evaluation, same\-latent revealed\-structure comparisons, and bounded ambiguity audits; large corpus scale is outside the design goal\.

Table A\.1:Benchmark inventory and naming convention\.Full names are used in explanatory prose; short labels are used in tables, figure axes, and compact references to matched variants\. The benchmark is organized around two 250\-problem full pools \(Ord\-FullandHid\-Full\), a matched 100\-problem same\-latent pool \(Ord\-Match,Block,Hid\-Match, andHid\-Roots\), and derived settings built from that matched pool \(Alt\-Ord/Alt\-Hid,Ord\-Ext/Hid\-Ext, andOrd\-CEx/Hid\-CEx\)\. Extra Worlds uses the same added training worlds across disclosure settings; CEx preserves the same latent SCMs and held\-out worlds, but counterexample training additions may be setting\-specific\. HereHidabbreviates Hidden\-order; Hidden\-roots is always writtenHid\-Roots\.The evaluator first checks that a submission has the required schema, mentions only observed variables, assigns mechanisms to exactly the required endogenous variables, and induces an acyclic dependency graph\. Each mechanism must parse in the benchmark Boolean DSL and may use only the variables allowed by the task disclosure\. Valid SCM submissions are then replayed by the formal definitions in Appendix[B\.1](https://arxiv.org/html/2605.08197#A2.SS1)on every training and held\-out world\. For the Alternative\-SCM setting, the scorer additionally checks that the proposed alternative fits the training worlds, is semantically distinct from the supplied reference SCM on the bounded signature support, and is separated by the submitted single\-variable intervention and witness\.

##### LLM response handling\.

The prompt asks for exactly one schema\-compatible answer object and nothing else\. Strict one\-line JSON compliance is reported separately\. Replay metrics, however, are computed from the selected candidate object under the fixed extraction and selection policy in Appendix[A\.2](https://arxiv.org/html/2605.08197#A1.SS2); extracted mechanism strings are never rewritten, simplified, or semantically repaired\.

![Refer to caption](https://arxiv.org/html/2605.08197v1/x2.png)Figure A\.1:Shared evaluator and proposal pipeline\.Systems differ only in how they propose a candidate SCM; every candidate then passes through the same parsing, legality, acyclicity, and replay evaluator\.

### A\.2LLM evaluation protocol

All LLM rows use a direct\-generation protocol\. For each benchmark item, the rendered prompt is sent to the model without tool access, external computation, evaluator feedback, or iterative repair\. Calls use the decoding settings in Table[A\.2](https://arxiv.org/html/2605.08197#A1.T2)\. For each model/problem pair, the evaluation snapshot contains one selected response\. When a pair was rerun, later calls used the same prompt and decoding settings and stopped once the stored response reached the validity criterion; however, the final snapshot does not store a uniform per\-item retry count\. Validation\-stage rates are computed on the selected stored responses under the response\-handling policy in Table[A\.3](https://arxiv.org/html/2605.08197#A1.T3)\. They are not one\-call compliance rates\. The evaluator parses the selected object and replays it exactly\. It never edits mechanism strings or selects candidates using training or held\-out replay scores\.

Table A\.2:LLM model identifiers and decoding settings\.Dates are taken from logged rows in the result snapshot\. “Not logged” indicates absence from the final result records, and “not set in request” indicates that the provider request omitted the parameter\. Provider/model references are listed immediately below the table\.Provider/model references\. Table[A\.2](https://arxiv.org/html/2605.08197#A1.T2)is the source of truth for the exact queried model identifiers and run windows\. We cite official provider model cards or API documentation for the corresponding model families and identifiers where available: GPT\-5\.4\(OpenAI,[2026b](https://arxiv.org/html/2605.08197#bib.bib23),[a](https://arxiv.org/html/2605.08197#bib.bib22)\), Claude Opus 4\.6\(Anthropic,[2026b](https://arxiv.org/html/2605.08197#bib.bib3),[a](https://arxiv.org/html/2605.08197#bib.bib2)\), DeepSeek V4 Pro\(DeepSeek\-AI,[2026a](https://arxiv.org/html/2605.08197#bib.bib9),[b](https://arxiv.org/html/2605.08197#bib.bib10)\), Gemini 3\.1 Pro\(Google DeepMind,[2026](https://arxiv.org/html/2605.08197#bib.bib13); Google,[2026](https://arxiv.org/html/2605.08197#bib.bib12)\), Grok 4\.20 and Grok 4\(xAI,[2026b](https://arxiv.org/html/2605.08197#bib.bib38),[2025](https://arxiv.org/html/2605.08197#bib.bib36),[a](https://arxiv.org/html/2605.08197#bib.bib37)\), Kimi K2 Thinking\(Moonshot AI,[2026](https://arxiv.org/html/2605.08197#bib.bib19); OpenRouter,[2025](https://arxiv.org/html/2605.08197#bib.bib24)\), and DeepSeek Reasoner\(DeepSeek,[2026b](https://arxiv.org/html/2605.08197#bib.bib8),[a](https://arxiv.org/html/2605.08197#bib.bib7)\)\.

Table A\.3:Response extraction and final\-answer policy\.The policy distinguishes literal prompt compliance from deterministic extraction of a candidate object for replay evaluation\.The protocol evaluates direct candidate generation from stored responses\. Repeated invocations, when present in the stored snapshot, were used only to obtain one extractable or executable candidate under the response\-handling rule; they were never used to search over training or held\-out replay scores\.

## Appendix BFormal replay and benchmark generation

### B\.1Formal replay definitions

##### Latent SCM and intervention worlds\.

Let

𝒱=ℛ⊔ℰ\\mathcal\{V\}=\\mathcal\{R\}\\sqcup\\mathcal\{E\}be the observed variables, partitioned into rootsℛ\\mathcal\{R\}and endogenous variablesℰ\\mathcal\{E\}\. The latent SCM is

M⋆=\(ℛ,ℰ,F⋆\),F⋆=\{fV⋆:V∈ℰ\},M^\{\\star\}=\(\\mathcal\{R\},\\mathcal\{E\},F^\{\\star\}\),\\qquad F^\{\\star\}=\\\{f\_\{V\}^\{\\star\}:V\\in\\mathcal\{E\}\\\},where eachfV⋆f\_\{V\}^\{\\star\}is a Boolean expression in the benchmark DSL and the endogenous dependency graph is acyclic\. A worldwwspecifies a hard intervention target setIw⊆𝒱I\_\{w\}\\subseteq\\mathcal\{V\}and row\-level assigned valuesaw\(i\)​\(V\)a\_\{w\}^\{\(i\)\}\(V\)forV∈IwV\\in I\_\{w\}\. It contains rowsi∈\[nw\]i\\in\[n\_\{w\}\], each with an observed binary assignmentx\(w,i\)∈\{0,1\}𝒱x^\{\(w,i\)\}\\in\\\{0,1\\\}^\{\\mathcal\{V\}\}generated by executingM⋆M^\{\\star\}under that intervention\. The problem exposes training worlds𝒲tr\\mathcal\{W\}\_\{\\mathrm\{tr\}\}and withholds held\-out worlds𝒲ho\\mathcal\{W\}\_\{\\mathrm\{ho\}\}\.

##### Submitted objects\.

For the primary induction settings—“Ordered,” “Block\-order,” and “Hidden\-order”—a submission is an executable mechanism map

F^=\{f^V:V∈ℰ^\}\\widehat\{F\}=\\\{\\hat\{f\}\_\{V\}:V\\in\\widehat\{\\mathcal\{E\}\}\\\}in the restricted Boolean DSL overnot,and,or,xor, andiff\. The task disclosure determines the candidate root/endogenous partition except in “Hidden\-roots,” where the submission must also predict the root set and hence the endogenous set\. The evaluator converts the submitted object into a candidate SCM

M^=\(ℛ^,ℰ^,F^\),\\widehat\{M\}=\(\\widehat\{\\mathcal\{R\}\},\\widehat\{\\mathcal\{E\}\},\\widehat\{F\}\),checks schema validity, variable use, DSL parseability, and acyclicity, and then executesM^\\widehat\{M\}\. In the “Alternative\-SCM” setting, the answer object is different: the system receives a valid reference SCM and must return a semantically distinct alternative SCM that fits the training worlds, a separating hard intervention, and a witness assignment\.

##### Semantic replay\.

Replay executes the submitted SCM row by row\. For each world\-row, intervened variables are clamped to their assigned values; non\-intervened roots are copied from the observed row; and non\-intervened endogenous variables are computed in a valid topological order ofM^\\widehat\{M\}\. For a valid candidateM^\\widehat\{M\}, replay produces

x~M^\(w,i\)​\(V\)=\{aw\(i\)​\(V\),V∈Iw,x\(w,i\)​\(V\),V∈ℛ^∖Iw,f^V\(x~M^\(w,i\)↾dep​\(f^V\)\),V∈ℰ^∖Iw\.\\tilde\{x\}\_\{\\widehat\{M\}\}^\{\(w,i\)\}\(V\)=\\begin\{cases\}a\_\{w\}^\{\(i\)\}\(V\),&V\\in I\_\{w\},\\\\ x^\{\(w,i\)\}\(V\),&V\\in\\widehat\{\\mathcal\{R\}\}\\setminus I\_\{w\},\\\\ \\hat\{f\}\_\{V\}\\\!\\left\(\\tilde\{x\}\_\{\\widehat\{M\}\}^\{\(w,i\)\}\\\!\\restriction\_\{\\mathrm\{dep\}\(\\hat\{f\}\_\{V\}\)\}\\right\),&V\\in\\widehat\{\\mathcal\{E\}\}\\setminus I\_\{w\}\.\\end\{cases\}Only non\-intervened endogenous cells are scored\. For a worldww, define the scored cells

Sw​\(M^\)=\{\(i,V\):i∈\[nw\],V∈ℰ^∖Iw\}S\_\{w\}\(\\widehat\{M\}\)=\\\{\(i,V\):i\\in\[n\_\{w\}\],\\ V\\in\\widehat\{\\mathcal\{E\}\}\\setminus I\_\{w\}\\\}and the world replay indicator

Ew​\(M^\)=𝟏​\[∀\(i,V\)∈Sw​\(M^\),x~M^\(w,i\)​\(V\)=x\(w,i\)​\(V\)\]\.E\_\{w\}\(\\widehat\{M\}\)=\\mathbf\{1\}\\\!\\left\[\\forall\(i,V\)\\in S\_\{w\}\(\\widehat\{M\}\),\\ \\tilde\{x\}\_\{\\widehat\{M\}\}^\{\(w,i\)\}\(V\)=x^\{\(w,i\)\}\(V\)\\right\]\.Invalid submissions receive zero on replay metrics\.

##### Replay metrics\.

Strict training exactness requires exact replay of every scored cell in every training world:

TrainExact⁡\(M^\)=𝟏​\[∀w∈𝒲tr,Ew​\(M^\)=1\]\.\\operatorname\{TrainExact\}\(\\widehat\{M\}\)=\\mathbf\{1\}\\\!\\left\[\\forall w\\in\\mathcal\{W\}\_\{\\mathrm\{tr\}\},\\ E\_\{w\}\(\\widehat\{M\}\)=1\\right\]\.World\-level training and held\-out replay rates average exact replay over worlds:

TrainWorldExact​\(M^\)=1\|𝒲tr\|​∑w∈𝒲trEw​\(M^\),HeldoutWorldExact​\(M^\)=1\|𝒲ho\|​∑w∈𝒲hoEw​\(M^\)\.\\mathrm\{TrainWorldExact\}\(\\widehat\{M\}\)=\\frac\{1\}\{\|\\mathcal\{W\}\_\{\\mathrm\{tr\}\}\|\}\\sum\_\{w\\in\\mathcal\{W\}\_\{\\mathrm\{tr\}\}\}E\_\{w\}\(\\widehat\{M\}\),\\qquad\\mathrm\{HeldoutWorldExact\}\(\\widehat\{M\}\)=\\frac\{1\}\{\|\\mathcal\{W\}\_\{\\mathrm\{ho\}\}\|\}\\sum\_\{w\\in\\mathcal\{W\}\_\{\\mathrm\{ho\}\}\}E\_\{w\}\(\\widehat\{M\}\)\.Strict held\-out replay requires both exact training fit and exact replay of all held\-out worlds:

HeldoutExact⁡\(M^\)=TrainExact⁡\(M^\)​1​\[∀w∈𝒲ho,Ew​\(M^\)=1\]\.\\operatorname\{HeldoutExact\}\(\\widehat\{M\}\)=\\operatorname\{TrainExact\}\(\\widehat\{M\}\)\\,\\mathbf\{1\}\\\!\\left\[\\forall w\\in\\mathcal\{W\}\_\{\\mathrm\{ho\}\},\\ E\_\{w\}\(\\widehat\{M\}\)=1\\right\]\.

##### Sanity checks\.

Invalid submissions receive zero on replay metrics\. The aggregate tables must therefore satisfyTrainExact≤TrainWorldExact\\operatorname\{TrainExact\}\\leq\\mathrm\{TrainWorldExact\}andHeldoutExact≤TrainExact\\operatorname\{HeldoutExact\}\\leq\\operatorname\{TrainExact\}\. WhenHeldoutExact∣TrainExact\\operatorname\{HeldoutExact\}\\mid\\operatorname\{TrainExact\}is reported and not suppressed,HeldoutExact\\operatorname\{HeldoutExact\}equalsTrainExact\\operatorname\{TrainExact\}timesHeldoutExact∣TrainExact\\operatorname\{HeldoutExact\}\\mid\\operatorname\{TrainExact\}up to rounding\.HeldoutWorldExact\\mathrm\{HeldoutWorldExact\}can exceedTrainExact\\operatorname\{TrainExact\}because it is an unconditioned per\-world held\-out average\. Conditional rates with denominators 1–5 are suppressed as “\*”; zero\-denominator cases are shown as “–”\.

Retention is the ratioHeldoutWorldExact/TrainWorldExact\\mathrm\{HeldoutWorldExact\}/\\mathrm\{TrainWorldExact\}when the denominator is nonzero\. Conditional held\-out metrics compute held\-out replay only whenTrainExact=1\\operatorname\{TrainExact\}=1\. For the primary induction settings, TrainExact asks whether one executable SCM explains the exposed worlds exactly\. HeldoutWorldExact and HeldoutExact are the stricter mechanistic criteria, because they ask whether that same executable object survives new interventions\.

### B\.2Benchmark generation specification

ReplaySCM instances are generated programmatically\. The generator first samples a small acyclic Boolean SCM, then simulates observational and interventional worlds from that SCM\. Candidate instances are rejected or strengthened until they satisfy support, intervention\-coverage, shortcut\-resistance, and bounded ambiguity checks\. These procedures reduce obvious finite\-sample shortcuts while keeping the task finite: the evidence still does not guarantee identification\.

Table B\.1:Latent SCM and intervention\-world sampling\.Table B\.2:Construction filters and bounded audit budgets\.#### B\.2\.1Latent SCM sampling

Each latent SCM has binary observed variablesX​1,…,X​nX1,\\ldots,Xnwithn∈\{6,…,10\}n\\in\\\{6,\\ldots,10\\\}\. Exactly three latent slots are roots, and all remaining variables are endogenous\. Visible labels are randomly permuted over the latent slots, so label order contains no causal\-order information\.

Parent sets are sampled only from earlier latent variables, which enforces acyclicity by construction\. The predecessor window is bounded by MaxPredecessorsPerMechanism, and the parent count is sampled uniformly from the admissible range, using a lower bound of two whenever at least two predecessors are available\. Local mechanisms are sampled as Boolean DSL expressions over the declared parents usingnot,and,or,xor, andiff; constants are not allowed\. A mechanism is rejected if any declared parent is semantically inactive or if its truth table is constant\.

Algorithm 1: Sample a latent SCM

Input: size stratum, predecessor bound, operator set1\. Sample n in \{6,\.\.\.,10\}; create n latent slots\.2\. Mark the first three latent slots as roots and the remaining slots as endogenous\.3\. Randomly permute visible labels X1,\.\.\.,Xn over the latent slots\.4\. For each endogenous variable V in latent order:a\. Let C\(V\) be the bounded set of earlier latent variables\.b\. Sample a parent count uniformly from the admissible range\.c\. Sample that many parents uniformly from C\(V\)\.d\. Sample a Boolean DSL expression over those parents\.e\. Reject and resample unless every declared parent is semantically activeand the truth table is nonconstant\.5\. Return the acyclic Boolean SCM\.

#### B\.2\.2Intervention\-world construction

A world is a table of unit rows under one intervention\. A unit is a row identifier with latent root thresholds\. Ordinary generated worlds contain 10–12 rows\. Unit IDs share latent root thresholds across worlds, but non\-intervened root values can change across worlds because the world\-level environment changes\. Specifically, each unit/root threshold is sampled uniformly from\[0,1\]\[0,1\], each world/root environment level lies in\{0\.2,0\.35,0\.5,0\.65,0\.8\}\\\{0\.2,0\.35,0\.5,0\.65,0\.8\\\}, and the non\-intervened root value is1​\[threshold<level\]1\[\\mathrm\{threshold\}<\\mathrm\{level\}\]\.

The intervention modenoneleaves all variables unaltered\. Inhard\_constant, each target variable is clamped to one uniformly sampled Boolean value for the entire world\. Inhard\_assigned, each target receives row\-level Boolean assignments sampled with bias in\{0\.3,0\.5,0\.7\}\\\{0\.3,0\.5,0\.7\\\}; all\-equal multi\-row assignments are repaired to include both Boolean values\. Roots and endogenous variables can both be targets, and a world may intervene on multiple variables\.

Simulation clamps intervened variables first\. Non\-intervened roots are generated from unit thresholds and world environments\. Endogenous variables are then evaluated in latent topological order, unless they are themselves intervention targets\. The generator starts from an initial target of eight training worlds; final core records contain 8–11 training worlds after disambiguation additions, plus 8 held\-out worlds\.

Algorithm 2: Simulate an intervention world

Input: latent SCM M, units U, world intervention IFor each unit u in U:1\. For each root R:if R is intervened on, set R to its intervention value;otherwise set R from the unit threshold and world environment\.2\. For each endogenous variable V in latent order:if V is intervened on, set V to its intervention value;otherwise evaluate f\_V on the already assigned parent values\.3\. Record the complete row over observed variables\.Return the world table and intervention metadata\.

Training worlds are selected and then strengthened by the filters in Section[B\.2\.3](https://arxiv.org/html/2605.08197#A2.SS2.SSS3)\. Eight held\-out worlds are simulated from the same latent SCM under intervention signatures withheld from the prompt\. These held\-out worlds are withheld from the model and used only for evaluation\.

#### B\.2\.3Filters and bounded ambiguity reduction

Local support and scored exposure\.For an endogenous variableVV, a local predecessor pattern is an assignment to a bounded subset of variables that precedeVVin the latent order\. Patterns are counted only on rows whereVVis not intervened\. The local\-support probes use subset size 3 for nearly all records and subset size 4 for a small number of records\. In stricter generation strata, each endogenous variable must also have 3–4 scored training worlds and 33/40/44 scored cells\.

Intervention coverage and held\-out balance\.The intervention\-coverage checks require enough direct intervention variation without making variables nearly always clamped\. In stricter strata, records require 3–4hard\_assignedworlds, 1–2hard\_constantworlds, and at most 4–5 intervened worlds per endogenous variable\. Held\-out target novelty is constrained between lower bounds 0\.20–0\.25 and upper bounds 0\.65–0\.72, depending on the generation stratum\.

Shortcut resistance\.A shortcut is a bounded Boolean DSL formula over an admissible predecessor set that fits the training rows while differing from the intended local mechanism\. The shortcut class uses AST cap 5 and floor 2–3\. Survivor reduction maintains a set of shortcut formulas that fit the training worlds, proposes 170–340 candidate worlds, and uses 8–17 iterations or restarts to add worlds that remove survivors\. Accepted records must reduce the survivor pool by a fraction between 0\.75 and 0\.95, depending on the stratum\.

Targeted disambiguation\.A local semantic alternative is a Boolean DSL mechanism that is consistent with the observed rows but has a different local truth table from the gold mechanism\. The targeted search uses AST slack 2, maximum cap 8, 50,000 states per size, and 2\.5 seconds per variable\. When compact intervention worlds separate many such alternatives, they are added to the training set; accepted records contain 0–3 such added worlds\.

Post\-generation audits and interpretation\.Stronger audits measure remaining ambiguity under larger bounded searches\. The stronger local audit uses AST slack 4, cap 10, 80,000 states per size, and 4 seconds per variable\. The coordinated pair audit uses AST slack 3, cap 9, up to 5 upstream alternatives per variable, and 120 seconds per problem\. These searches do not establish uniqueness\.

Algorithm 3: Accept or strengthen a candidate instance

Input: latent SCM M, initial training worlds W1\. Reject M if any mechanism has inactive parents or a constant truth table\.2\. Simulate candidate training worlds from M\.3\. Check local support, scored exposure, intervention coverage, and held\-out balance\.4\. Enumerate bounded shortcut formulas that fit W\.5\. While shortcut survivors remain above threshold and the world\-addition budget remains:a\. propose candidate intervention worlds;b\. simulate each candidate from M;c\. add the world that rules out the most surviving shortcuts\.6\. Enumerate bounded local semantic alternatives\.7\. Add compact separating worlds when they rule out many alternatives\.8\. Run bounded local and coordinated ambiguity audits\.9\. Accept the instance if all checks pass; otherwise reject or retry\.

#### B\.2\.4Benchmark variants

Ordered reveals the root/endogenous partition and a full topological order, so mechanisms may reference only earlier variables\. Block\-order reveals the roots and coarse precedence blocks; within\-block order is hidden, but submitted dependencies must remain compatible with block precedence and acyclicity\. Hidden\-order reveals the root/endogenous partition but hides the endogenous order, so the evaluator must infer whether the submitted dependency graph is acyclic before replay\.

Hidden\-roots hides the root set and requires the model to predict it\. In Hidden\-roots, the predicted root set determines which variables require mechanisms; root\-set exactness is therefore reported separately from mechanism replay\. An incorrect root set can still fit some finite observed worlds, so root\-set prediction and downstream replay are separated in the results\.

Matched pools share latent SCMs and worlds, changing only disclosed structure and output schema\. The 100\-problem matched same\-latent subset yields Matched Ordered, Block\-order, Matched Hidden\-order, and Hidden\-roots variants\. The full Ordered and Hidden\-order pools each contain 250 problems, with the matched pool serving as a controlled same\-latent subset\.

#### B\.2\.5Support\-audit and Alternative\-SCM derived settings

Support\-audit variants\.The Extra Worlds level starts from the 100 matched Ordered/Hidden\-order pairs\. The same added worlds are used for Ordered \+ Extra Worlds and Hidden\-order \+ Extra Worlds, and the held\-out worlds are unchanged\. Candidate additions are observational worlds, single\-variablehard\_constantworlds, and single\-variablehard\_assignedworlds, with held\-out intervention signatures excluded\. The selector adds 3–4 worlds per problem, mean 3\.09\. Mean local predecessor\-pattern coverage rises from 0\.8949 to 0\.9815, and full local coverage rises from 2/100 to 42/100\. The Counterexample Audit \(CEx\) level starts from these Extra Worlds records and adds further gold\-simulated worlds until mean local predecessor\-pattern coverage is 1\.0\. It then tests every semantically distinct LLM or symbolic exact\-search answer found on the Extra Worlds records that fits the training worlds; if the answer still fits after the coverage additions, CEx adds a separating world\. A 50\-seed bnlearn\+DSL sweep is then run for each problem and disclosure setting, and semantically distinct bnlearn\+DSL alternatives that fit the training worlds are separated in the same way\. After these additions, no discovered alternative in the audited pools still fits the training worlds\. Because discovered alternatives can differ by disclosure, CEx may add setting\-specific counterexample worlds while preserving the same latent SCMs and held\-out worlds\. The added worlds increase both the evidence supplied to the model and the prompt length, so CEx does not isolate prompt length from evidence\.

Alternative\-SCM\.Alternative\-SCM candidates are valid train\-exact alternatives discovered in paired result pools, with sources including both LLMs and non\-LLM systems\. Semantic signatures deduplicate syntactic rewrites by recording effective parents and local truth\-table behavior\. Retained items must be semantically distinct from the supplied reference SCM and admit a single\-variable separating intervention with a witness\. Leave\-model\-out analyses exclude alternatives sourced from the evaluated model\. Because Alternative\-SCM is built from discovered alternatives that fit the training worlds, it tests local editing from a supplied SCM\. It does not provide an independent proof of latent\-SCM non\-identifiability\.

#### B\.2\.6Snapshot and artifact

All experiments use one frozen benchmark snapshot\. The released artifact will include the generator, configuration manifest, fixed benchmark records, rendered prompts, evaluator, and audit summaries\. The manifest records the exact pool names, seeds, and configuration files needed to reproduce the snapshot\. If a hidden\-evaluation split is used, the public artifact will distinguish public prompt records from evaluator\-only held\-out data\.

The filters and audits use bounded searches\. Passing them means that no alternative was found within the specified candidate classes and budgets, and that the released worlds satisfy the stated support and shortcut\-resistance criteria\. ReplaySCM therefore evaluates executable replay from finite interventional evidence; it does not guarantee identification of a unique latent SCM\.

### B\.3Prompt excerpt

All systems receive a structured prompt record containing task metadata, the DSL grammar, intervention\-world tables, and the required JSON schema\. The excerpt below shows the common input contract for an Ordered item\. Other task variants preserve the same serialization style while changing the disclosed structural fields and output schema\. The released artifact contains the full rendered prompts for every benchmark item\.

##### System instruction excerpt\.

You are solving a formal causal mechanism induction taskover finite interventional worlds\.Treat the input as machine\-checkable structure\.Output exactly one JSON object matching the required schema\.

##### Task metadata excerpt\.

Task: CIND\_A\_SCMObservedVariables: X1, X2, X3, X4, X5RootVariables: X1, X2EndogenousVariables: X3, X4, X5TopologicalOrder: X1, X2, X3, X4, X5AllowedOperators: not, and, or, xor, iffInterventionModes: none, hard\_constant, hard\_assigned

##### Scoring excerpt\.

For each world and row, intervened variables are clamped to their intervention values, non\-intervened roots are copied from the row, and non\-intervened endogenous variables are evaluated in topological order from the submitted mechanisms\. Only non\-intervened endogenous cells are scored\. The same mechanism map is replayed on all training and held\-out worlds\.

##### DSL excerpt\.

expr ::= VAR\| \(not expr\)\| \(and expr expr \.\.\.\)\| \(or expr expr \.\.\.\)\| \(xor expr expr \.\.\.\)\| \(iff expr expr \.\.\.\)Constants are disallowed\. Operators and variable names must match the metadata exactly\.

##### Training\-world excerpt\.

WorldId: train\_00InterventionMode: noneInterventionTargetsAssigned: \[\]InterventionTargetsConstant: \{\}Rows:\- u00: X1=0 X2=1 X3=1 X4=0 X5=1\- u01: X1=1 X2=0 X3=0 X4=1 X5=0WorldId: train\_01InterventionMode: hard\_constantInterventionTargetsConstant: \{"X3": 1\}Rows:\- u00: X1=0 X2=1 X3=1 X4=1 X5=0\- u01: X1=1 X2=0 X3=1 X4=0 X5=1

##### Output schema excerpt\.

\{"mechanisms":\{"X3":"\.\.\.","X4":"\.\.\.","X5":"\.\.\."\}\}

The literal prompt also includes validity conditions, a formatting\-only example, and the complete set of training worlds for the instance\.

## Appendix CFailure analysis support

Appendix C separates failures in validity, semantic parent structure, and held\-out replay among train\-exact responses\. Across Tables C\.1–C\.4, frontier models often infer some dependencies before producing a fully correct executable mechanism\.

Table C\.1:Validation\-stage breakdown\.The columns separate successive evaluator stages before final executable validity\. Mechanism strings are never semantically repaired\. Strict JSON, Extracted JSON, and the later validation stages follow the response\-handling policy in Appendix[A\.2](https://arxiv.org/html/2605.08197#A1.SS2)\. Appendix[A\.2](https://arxiv.org/html/2605.08197#A1.SS2)reports selected stored responses instead of retry\-normalized one\-call attempts, so Strict JSON and Valid should be interpreted under that stored\-response policy\.SettingModelStrictJSONExtractedJSONSchemaKeysParseLegalAcyclicValidOrd\-FullGPT\-5\.41\.0001\.0001\.0001\.0001\.0001\.0001\.0001\.000Ord\-FullOpus 4\.60\.9440\.9800\.9800\.9800\.9800\.9800\.9800\.980Ord\-FullDeepSeek4Pro0\.7201\.0001\.0001\.0001\.0001\.0001\.0001\.000Ord\-FullGemini3\.11\.0001\.0001\.0001\.0001\.0001\.0001\.0001\.000Ord\-FullGrok 4\.201\.0001\.0001\.0001\.0001\.0001\.0001\.0001\.000Ord\-FullGrok 40\.9840\.9840\.9840\.9840\.9800\.9800\.9800\.980Ord\-FullGrok 4\.30\.9880\.9880\.9880\.9880\.9880\.9880\.9880\.988Ord\-FullKimiK2t0\.0520\.9760\.9760\.9760\.9760\.9720\.9720\.972Ord\-FullDSReasoner0\.1520\.9880\.9880\.9880\.9880\.9880\.9880\.988BlockGPT\-5\.41\.0001\.0001\.0001\.0001\.0001\.0001\.0001\.000BlockOpus 4\.60\.6000\.9500\.9500\.9500\.9500\.9500\.9500\.950BlockDeepSeek4Pro0\.4401\.0001\.0001\.0001\.0001\.0001\.0001\.000BlockGemini3\.11\.0001\.0001\.0001\.0001\.0001\.0001\.0001\.000BlockGrok 4\.201\.0001\.0001\.0001\.0001\.0001\.0001\.0001\.000BlockGrok 41\.0001\.0001\.0001\.0001\.0001\.0000\.9900\.990BlockGrok 4\.30\.9900\.9900\.9900\.9900\.9900\.9900\.9700\.970BlockKimiK2t0\.0500\.9900\.9900\.9900\.9900\.9900\.9800\.980BlockDSReasoner0\.1601\.0001\.0001\.0001\.0001\.0000\.9900\.990Hid\-FullGPT\-5\.40\.9921\.0001\.0001\.0001\.0001\.0001\.0001\.000Hid\-FullOpus 4\.60\.7600\.9720\.9720\.9720\.9720\.9720\.9720\.972Hid\-FullDeepSeek4Pro0\.7881\.0001\.0001\.0001\.0001\.0000\.9800\.980Hid\-FullGemini3\.11\.0001\.0001\.0001\.0001\.0001\.0001\.0001\.000Hid\-FullGrok 4\.201\.0001\.0001\.0001\.0001\.0001\.0000\.9840\.984Hid\-FullGrok 40\.9960\.9960\.9960\.9960\.9960\.9920\.9760\.976Hid\-FullGrok 4\.30\.9840\.9840\.9840\.9840\.9840\.9840\.9720\.972Hid\-FullKimiK2t0\.0720\.9920\.9920\.9920\.9920\.9920\.9840\.984Hid\-FullDSReasoner0\.1161\.0001\.0001\.0001\.0001\.0000\.9920\.992Hid\-RootsGPT\-5\.41\.0001\.0001\.0001\.0001\.0001\.0000\.9900\.990Hid\-RootsOpus 4\.60\.5900\.9300\.9300\.9300\.9300\.9300\.9300\.930Hid\-RootsDeepSeek4Pro0\.7300\.9900\.9900\.9900\.9900\.9900\.9600\.960Hid\-RootsGemini3\.11\.0001\.0001\.0001\.0001\.0001\.0001\.0001\.000Hid\-RootsGrok 4\.201\.0001\.0001\.0001\.0001\.0001\.0000\.9900\.990Hid\-RootsGrok 41\.0001\.0001\.0001\.0001\.0001\.0001\.0001\.000Hid\-RootsGrok 4\.30\.9900\.9900\.9900\.9900\.9900\.9900\.9700\.970Hid\-RootsKimiK2t0\.0200\.9900\.9900\.9900\.9900\.9900\.9800\.980Hid\-RootsDSReasoner0\.1601\.0001\.0001\.0001\.0001\.0001\.0001\.000Alt\-OrdGPT\-5\.41\.0001\.0001\.0001\.0001\.0001\.0001\.0001\.000Alt\-OrdOpus 4\.61\.0001\.0001\.0001\.0001\.0001\.0001\.0001\.000Alt\-OrdDeepSeek4Pro0\.1701\.0001\.0001\.0001\.0001\.0001\.0001\.000Alt\-OrdGemini3\.11\.0001\.0001\.0001\.0001\.0001\.0001\.0001\.000Alt\-OrdGrok 4\.201\.0001\.0001\.0001\.0001\.0001\.0001\.0001\.000Alt\-OrdGrok 41\.0001\.0001\.0001\.0001\.0001\.0001\.0001\.000Alt\-OrdGrok 4\.31\.0001\.0001\.0001\.0001\.0001\.0001\.0001\.000Alt\-OrdKimiK2t0\.0501\.0001\.0001\.0001\.0001\.0001\.0001\.000Alt\-OrdDSReasoner0\.1901\.0001\.0001\.0001\.0001\.0001\.0001\.000Alt\-HidGPT\-5\.41\.0001\.0001\.0001\.0001\.0001\.0001\.0001\.000Alt\-HidOpus 4\.60\.9901\.0000\.9900\.9900\.9900\.9900\.9900\.990Alt\-HidDeepSeek4Pro0\.2101\.0001\.0001\.0001\.0001\.0001\.0001\.000Alt\-HidGemini3\.11\.0001\.0001\.0001\.0001\.0001\.0001\.0001\.000Alt\-HidGrok 4\.201\.0001\.0001\.0001\.0001\.0001\.0001\.0001\.000Alt\-HidGrok 41\.0001\.0001\.0001\.0001\.0001\.0001\.0001\.000Alt\-HidGrok 4\.31\.0001\.0001\.0001\.0001\.0001\.0001\.0001\.000Alt\-HidKimiK2t0\.0100\.9800\.9800\.9800\.9800\.9800\.9800\.980Alt\-HidDSReasoner0\.1601\.0001\.0001\.0001\.0001\.0001\.0001\.000Ord\-ExtGPT\-5\.41\.0001\.0001\.0001\.0001\.0001\.0001\.0001\.000Ord\-ExtOpus 4\.60\.9500\.9800\.9800\.9800\.9800\.9800\.9800\.980Ord\-ExtDeepSeek4Pro1\.0001\.0001\.0001\.0001\.0001\.0001\.0001\.000Ord\-ExtGemini3\.11\.0001\.0001\.0001\.0001\.0001\.0001\.0001\.000Ord\-ExtGrok 4\.201\.0001\.0001\.0001\.0001\.0001\.0001\.0001\.000Ord\-ExtGrok 41\.0001\.0001\.0001\.0001\.0000\.9900\.9900\.990Ord\-ExtGrok 4\.31\.0001\.0001\.0001\.0001\.0001\.0001\.0001\.000Ord\-ExtKimiK2t0\.0400\.9500\.9500\.9500\.9500\.9500\.9500\.950Ord\-ExtDSReasoner0\.0701\.0001\.0001\.0001\.0001\.0001\.0001\.000Hid\-ExtGPT\-5\.41\.0001\.0001\.0001\.0001\.0001\.0000\.9900\.990Hid\-ExtOpus 4\.60\.7300\.9500\.9500\.9500\.9500\.9500\.9500\.950Hid\-ExtDeepSeek4Pro1\.0001\.0001\.0001\.0001\.0001\.0001\.0001\.000Hid\-ExtGemini3\.11\.0001\.0001\.0001\.0001\.0001\.0001\.0001\.000Hid\-ExtGrok 4\.201\.0001\.0001\.0001\.0001\.0001\.0000\.9700\.970Hid\-ExtGrok 41\.0001\.0001\.0001\.0000\.9900\.9900\.9900\.990Hid\-ExtGrok 4\.30\.9800\.9800\.9800\.9800\.9800\.9800\.9800\.980Hid\-ExtKimiK2t0\.0200\.9900\.9900\.9900\.9900\.9900\.9700\.970Hid\-ExtDSReasoner0\.1001\.0001\.0001\.0001\.0001\.0000\.9900\.990Ord\-CExGPT\-5\.41\.0001\.0001\.0001\.0001\.0001\.0001\.0001\.000Ord\-CExOpus 4\.60\.9400\.9900\.9900\.9900\.9900\.9900\.9900\.990Ord\-CExDeepSeek4Pro0\.6001\.0001\.0001\.0001\.0001\.0001\.0001\.000Ord\-CExGemini3\.11\.0001\.0001\.0001\.0001\.0001\.0001\.0001\.000Ord\-CExGrok 4\.201\.0001\.0001\.0001\.0001\.0001\.0001\.0001\.000Ord\-CExGrok 40\.9900\.9900\.9900\.9900\.9900\.9900\.9900\.990Ord\-CExGrok 4\.30\.9800\.9800\.9800\.9800\.9800\.9800\.9800\.980Ord\-CExKimiK2t0\.0800\.9700\.9700\.9700\.9700\.9600\.9600\.960Ord\-CExDSReasoner0\.1601\.0001\.0001\.0001\.0001\.0001\.0001\.000Hid\-CExGPT\-5\.41\.0001\.0001\.0001\.0001\.0001\.0001\.0001\.000Hid\-CExOpus 4\.60\.6700\.9500\.9500\.9500\.9500\.9500\.9500\.950Hid\-CExDeepSeek4Pro0\.8001\.0001\.0001\.0001\.0001\.0001\.0001\.000Hid\-CExGemini3\.11\.0001\.0001\.0001\.0001\.0001\.0001\.0001\.000Hid\-CExGrok 4\.201\.0001\.0001\.0001\.0001\.0001\.0001\.0001\.000Hid\-CExGrok 40\.9500\.9500\.9500\.9500\.9400\.9400\.9300\.930Hid\-CExGrok 4\.30\.9900\.9900\.9900\.9900\.9900\.9900\.9800\.980Hid\-CExKimiK2t0\.0601\.0001\.0001\.0001\.0001\.0000\.9800\.980Hid\-CExDSReasoner0\.3501\.0001\.0001\.0001\.0001\.0000\.9900\.990For the strongest frontier models, failures at the wrapper\-text or schema stages are already uncommon\. Deterministic extraction absorbs most wrapper variation without altering mechanism formulas, so the consequential losses arise later in the evaluator, after a candidate executable object has already been formed \(TableLABEL:tab:mechanism\_validity\_funnel\_appendix\)\.

Table C\.2:Semantic parent\-structure details\.Parent metrics ignore formula spelling and depend only on semantic functional\-parent structure\. For Hidden\-roots, structural scoring is conditioned on exact root\-set prediction, so thenncolumn counts root\-exact executable submissions; for other benchmarks it counts executable submissions\. This is why Hidden\-roots structural counts can be much lower than Valid in TableLABEL:tab:structural\_shd\_core\_appendix\.SettingModelnnstructurescoredRecall\|\|scoredParentF1\|\|scoredPer\-var\.parentexact\|\|scoredParentmap exact\|\|scoredMeanlocal match\|\|scoredFulllocal match\|\|scoredTrainExact\|\|mapexactOrd\-FullGPT\-5\.42500\.9560\.9450\.7980\.3880\.7150\.2720\.835Ord\-FullOpus 4\.62450\.9580\.9440\.7930\.3350\.7170\.2200\.854Ord\-FullDeepSeek4Pro2500\.9060\.8800\.6450\.1680\.5780\.1360\.976Ord\-FullGemini3\.12500\.9140\.9190\.7640\.3320\.6860\.2360\.807Ord\-FullGrok 4\.202500\.9040\.8940\.6720\.1800\.6060\.1440\.889Ord\-FullGrok 42450\.9170\.8810\.6380\.1470\.5830\.1180\.917Ord\-FullGrok 4\.32470\.7500\.7640\.4730\.0730\.4190\.0570\.944Ord\-FullKimiK2t2430\.7460\.7220\.3930\.0290\.3210\.0210\.857Ord\-FullDSReasoner2470\.6380\.6830\.3540\.0360\.3100\.0361\.000BlockGPT\-5\.41000\.9580\.9520\.8380\.4600\.7780\.3600\.891BlockOpus 4\.6950\.9270\.9170\.7730\.3160\.7040\.2210\.800BlockDeepSeek4Pro1000\.8220\.8200\.6250\.1700\.5870\.1500\.941BlockGemini3\.11000\.8780\.8890\.7170\.2600\.6690\.2200\.846BlockGrok 4\.201000\.8490\.8600\.6770\.1500\.6310\.1200\.800BlockGrok 4990\.8370\.8300\.6150\.1210\.5670\.0710\.667BlockGrok 4\.3970\.6540\.6800\.4250\.0510\.3980\.051\*BlockKimiK2t980\.6370\.6330\.3340\.0200\.2720\.010\*BlockDSReasoner990\.6050\.6490\.3550\.0100\.3410\.010\*Hid\-FullGPT\-5\.42500\.9030\.8910\.7420\.2800\.6890\.1920\.957Hid\-FullOpus 4\.62430\.8520\.8440\.6580\.1890\.6110\.1230\.913Hid\-FullDeepSeek4Pro2450\.7450\.7400\.4950\.0860\.4630\.0651\.000Hid\-FullGemini3\.12500\.7860\.8050\.6190\.1560\.5710\.1000\.795Hid\-FullGrok 4\.202460\.7680\.7740\.5600\.1220\.5240\.0690\.733Hid\-FullGrok 42440\.7700\.7520\.5140\.0660\.4760\.0491\.000Hid\-FullGrok 4\.32430\.5200\.5430\.2850\.0290\.2580\.0120\.857Hid\-FullKimiK2t2460\.4570\.4560\.1530\.0000\.1190\.000–Hid\-FullDSReasoner2480\.3510\.3910\.1230\.0000\.1070\.000–Hid\-RootsGPT\-5\.4240\.6760\.6900\.4940\.0830\.4690\.083\*Hid\-RootsOpus 4\.6240\.6510\.6530\.4400\.0000\.3940\.000–Hid\-RootsDeepSeek4Pro340\.5730\.5890\.3880\.0290\.3620\.029\*Hid\-RootsGemini3\.1340\.6320\.6610\.4790\.1180\.4580\.088\*Hid\-RootsGrok 4\.20390\.6300\.6570\.4340\.1280\.4200\.128\*Hid\-RootsGrok 4300\.5620\.5650\.3160\.1000\.3160\.100\*Hid\-RootsGrok 4\.3170\.4900\.5030\.3020\.1760\.2690\.176\*Hid\-RootsKimiK2t60\.3290\.3460\.1670\.0000\.1330\.000–Hid\-RootsDSReasoner4\*\*\*\*\*\*–Ord\-ExtGPT\-5\.41000\.9790\.9750\.9030\.6100\.8660\.5000\.836Ord\-ExtOpus 4\.6980\.9530\.9440\.8380\.4180\.7850\.3060\.756Ord\-ExtDeepSeek4Pro1000\.8980\.8730\.6440\.1100\.6190\.0800\.727Ord\-ExtGemini3\.11000\.9150\.9230\.7840\.3600\.7460\.2900\.806Ord\-ExtGrok 4\.201000\.9010\.8980\.7370\.2700\.7040\.2000\.741Ord\-ExtGrok 4990\.8550\.8390\.6120\.1620\.5800\.1110\.688Ord\-ExtGrok 4\.31000\.6820\.7140\.4160\.0200\.3880\.020\*Ord\-ExtKimiK2t950\.6650\.6750\.3930\.0210\.3410\.021\*Ord\-ExtDSReasoner1000\.6410\.6860\.3760\.0200\.3470\.020\*Hid\-ExtGPT\-5\.4990\.9180\.9200\.7930\.3130\.7740\.2630\.839Hid\-ExtOpus 4\.6950\.8710\.8740\.7080\.2420\.6820\.2100\.870Hid\-ExtDeepSeek4Pro1000\.7190\.7280\.5310\.1200\.5290\.1201\.000Hid\-ExtGemini3\.11000\.8200\.8460\.6620\.1900\.6290\.1600\.842Hid\-ExtGrok 4\.20970\.7740\.7900\.5980\.1340\.5770\.1030\.769Hid\-ExtGrok 4990\.7080\.7170\.5090\.0910\.4910\.0810\.889Hid\-ExtGrok 4\.3960\.5120\.5410\.2980\.0520\.2860\.052\*Hid\-ExtKimiK2t970\.4660\.4630\.1610\.0000\.1330\.000–Hid\-ExtDSReasoner990\.3730\.4030\.1550\.0000\.1420\.000–Semantic parent\-structure estimates are much stronger than exact parent\-map or local\-semantic matching\. This gap is especially clear on Hidden\-order, where edge\-level dependency estimates remain informative while exact structural and mechanism matching are much lower \(TableLABEL:tab:effective\_parent\_structure\_appendix\)\.

Table C\.3:Structural\-distance view of parent structure\.Valid is the fraction of benchmark items for which the model’s selected final answer is an executable SCM: schema\-compatible, parseable in the DSL, legal under the disclosure rules, and acyclic\. The denominator is all benchmark items for that model and benchmark under the selected stored\-response policy in Appendix[A\.2](https://arxiv.org/html/2605.08197#A1.SS2); missing or non\-scorable final answers count as non\-valid\. Parent SHD is directed semantic functional\-parent structural Hamming distance; lower values are better\. For Hidden\-roots, parent and local\-semantic columns are additionally conditioned on exact root\-set prediction\.SettingModelValidParentF1\|\|scoredParentSHD↓\\downarrow\|\|scoredParentmap exact\|\|scoredMeanlocal match\|\|scoredFulllocal match\|\|scoredTrainExactHeldoutWorldBlockGPT\-5\.41\.0000\.9521\.630\.4600\.7780\.3600\.6000\.774BlockOpus 4\.60\.9500\.9172\.810\.3160\.7040\.2210\.4200\.671BlockDeepSeek4Pro1\.0000\.8205\.090\.1700\.5870\.1500\.3500\.410BlockGemini3\.11\.0000\.8893\.460\.2600\.6690\.2200\.2800\.559BlockGrok 4\.201\.0000\.8604\.380\.1500\.6310\.1200\.1500\.455BlockGrok 40\.9900\.8305\.250\.1210\.5670\.0710\.2400\.437BlockGrok 4\.30\.9500\.6808\.140\.0530\.3990\.0530\.1100\.210BlockKimiK2t0\.9800\.63310\.160\.0200\.2720\.0100\.0100\.079BlockDSReasoner0\.9900\.6499\.120\.0100\.3410\.0100\.0100\.090Hid\-RootsGPT\-5\.40\.9900\.6907\.830\.0830\.4690\.0830\.0800\.120Hid\-RootsOpus 4\.60\.9300\.6539\.210\.0000\.3940\.0000\.0200\.138Hid\-RootsDeepSeek4Pro0\.9600\.58910\.500\.0290\.3620\.0290\.0200\.081Hid\-RootsGemini3\.11\.0000\.6618\.440\.1180\.4580\.0880\.0400\.116Hid\-RootsGrok 4\.200\.9900\.6579\.280\.1280\.4200\.1280\.0600\.143Hid\-RootsGrok 41\.0000\.56512\.600\.1000\.3160\.1000\.0500\.104Hid\-RootsGrok 4\.30\.9600\.50311\.470\.1760\.2690\.1760\.0300\.060Hid\-RootsKimiK2t0\.9800\.34615\.670\.0000\.1330\.0000\.0000\.014Hid\-RootsDSReasoner1\.000\*\*\*\*\*0\.0000\.001The same contrast appears in structural\-distance form\. Parent SHD remains far lower than chance\-level structure would suggest, even when exact parent\-map matching is still weak\. Frontier models therefore often infer substantial dependency information without producing the full exact executable mechanism \(TableLABEL:tab:structural\_shd\_core\_appendix\)\.

To avoid overinterpreting unstable conditional estimates, the appendix suppresses any defined rate whose denominator is between 1 and 5 inclusive and marks it with \*; only zero\-denominator cases are shown as –\.

Table C\.4:Held\-out replay among train\-exact responses\.These conditional rates distinguish submissions that fail to fit all training worlds from held\-out replay errors among responses that are already TrainExact\.SettingModelTrainExactHeldoutWorldHeldoutWorld\|\|TrainExactHeldoutExact\|\|TrainExactnntrain\-exactsolutionsOrd\-FullGPT\-5\.40\.6120\.7310\.8490\.562153Ord\-FullOpus 4\.60\.6400\.7250\.8160\.456160Ord\-FullDeepSeek4Pro0\.3600\.5180\.7830\.42290Ord\-FullGemini3\.10\.3920\.6400\.8850\.66398Ord\-FullGrok 4\.200\.3320\.5320\.8250\.49483Ord\-FullGrok 40\.2960\.5190\.8180\.50074Ord\-FullGrok 4\.30\.1200\.2430\.8920\.56730Ord\-FullKimiK2t0\.0840\.1640\.7860\.28621Ord\-FullDSReasoner0\.0480\.1160\.9170\.75012Hid\-FullGPT\-5\.40\.6280\.6970\.8430\.465157Hid\-FullOpus 4\.60\.4160\.5950\.8270\.394104Hid\-FullDeepSeek4Pro0\.2960\.3750\.7720\.32474Hid\-FullGemini3\.10\.1840\.4310\.8910\.65246Hid\-FullGrok 4\.200\.2280\.4220\.8380\.45657Hid\-FullGrok 40\.2080\.3700\.8150\.40452Hid\-FullGrok 4\.30\.0840\.1910\.8330\.38121Hid\-FullKimiK2t0\.0080\.049\*\*2Hid\-FullDSReasoner0\.0000\.027––0Ord\-MatchGPT\-5\.40\.5800\.7480\.9090\.69058Ord\-MatchOpus 4\.60\.5800\.7570\.8880\.62158Ord\-MatchDeepSeek4Pro0\.2800\.4990\.8880\.57128Ord\-MatchGemini3\.10\.3700\.6560\.9430\.78437Ord\-MatchGrok 4\.200\.3000\.5390\.8790\.60030Ord\-MatchGrok 40\.3100\.5040\.8310\.48431Ord\-MatchGrok 4\.30\.1000\.2070\.9000\.60010Ord\-MatchKimiK2t0\.0100\.098\*\*1Ord\-MatchDSReasoner0\.0100\.062\*\*1Hid\-MatchGPT\-5\.40\.6200\.7230\.8910\.50062Hid\-MatchOpus 4\.60\.2800\.4890\.8790\.53628Hid\-MatchDeepSeek4Pro0\.2100\.3210\.8150\.42921Hid\-MatchGemini3\.10\.1100\.3841\.0001\.00011Hid\-MatchGrok 4\.200\.1700\.3580\.8970\.70617Hid\-MatchGrok 40\.1500\.2930\.8080\.40015Hid\-MatchGrok 4\.30\.0700\.1680\.8390\.5717Hid\-MatchKimiK2t0\.0000\.027––0Hid\-MatchDSReasoner0\.0000\.027––0BlockGPT\-5\.40\.6000\.7740\.9120\.68360BlockOpus 4\.60\.4200\.6710\.8840\.57142BlockDeepSeek4Pro0\.3500\.4100\.8610\.48635BlockGemini3\.10\.2800\.5590\.9420\.78628BlockGrok 4\.200\.1500\.4550\.9670\.86715BlockGrok 40\.2400\.4370\.7450\.29224BlockGrok 4\.30\.1100\.2060\.8750\.54511BlockKimiK2t0\.0100\.079\*\*1BlockDSReasoner0\.0100\.090\*\*1Hid\-RootsGPT\-5\.40\.0800\.1200\.7340\.2508Hid\-RootsOpus 4\.60\.0200\.138\*\*2Hid\-RootsDeepSeek4Pro0\.0200\.081\*\*2Hid\-RootsGemini3\.10\.0400\.116\*\*4Hid\-RootsGrok 4\.200\.0600\.1430\.9170\.8336Hid\-RootsGrok 40\.0500\.104\*\*5Hid\-RootsGrok 4\.30\.0300\.061\*\*3Hid\-RootsKimiK2t0\.0000\.014––0Hid\-RootsDSReasoner0\.0000\.001––0Ord\-ExtGPT\-5\.40\.6300\.8800\.9580\.84163Ord\-ExtOpus 4\.60\.3800\.7830\.9770\.89538Ord\-ExtDeepSeek4Pro0\.1400\.4650\.9380\.71414Ord\-ExtGemini3\.10\.3000\.7090\.9920\.96730Ord\-ExtGrok 4\.200\.2300\.5710\.9670\.91323Ord\-ExtGrok 40\.1200\.4601\.0001\.00012Ord\-ExtGrok 4\.30\.0200\.146\*\*2Ord\-ExtKimiK2t0\.0200\.066\*\*2Ord\-ExtDSReasoner0\.0200\.100\*\*2Hid\-ExtGPT\-5\.40\.4300\.7830\.9620\.76743Hid\-ExtOpus 4\.60\.2700\.6700\.9720\.85227Hid\-ExtDeepSeek4Pro0\.1700\.3430\.9410\.82417Hid\-ExtGemini3\.10\.1800\.5070\.9790\.88918Hid\-ExtGrok 4\.200\.1300\.4660\.9710\.84613Hid\-ExtGrok 40\.0900\.3350\.9720\.8899Hid\-ExtGrok 4\.30\.0500\.137\*\*5Hid\-ExtKimiK2t0\.0000\.032––0Hid\-ExtDSReasoner0\.0000\.034––0Held\-out replay is much stronger among train\-exact responses\. The main error for frontier models is failing to produce an executable SCM that fits all training worlds \(Table C\.4\)\.

## Appendix DDisclosure\-ladder support

Appendix D supports the disclosure\-ladder analysis in Section 6\.3\. The paired same\-latent deltas hold the latent SCM fixed while varying only the information revealed to the model\. Matched\-pool rates and train\-versus\-held\-out comparisons show the same pattern\.

The common 100\-problem matched pool follows the same TrainExact ordering, although several structural facts change across that ladder at once\. The train\-versus\-held\-out comparison shows that executable mechanism induction cannot be reduced to fitting the exposed worlds alone \(Table[D\.2](https://arxiv.org/html/2605.08197#A4.T2)and Figure[D\.1](https://arxiv.org/html/2605.08197#A4.F1)\)\.

Table D\.1:Paired disclosure deltas\.Each row compares paired versions of the same latent SCM under two disclosure settings\. Deltas are computed as first setting minus second setting; positive values therefore indicate lower performance in the less\-disclosed setting\.Most transitions toward less structural information reduce both TrainExact and HeldoutWorld, with the largest losses on Ordered→Hidden\-order and Hidden\-order→Hidden\-roots\. GPT\-5\.4 is the main exception on Ordered→Hidden\-order TrainExact, but even there held\-out replay still declines \(Table[D\.1](https://arxiv.org/html/2605.08197#A4.T1)\)\.

Table D\.2:TrainExact on the common matched pool\.All entries are TrainExact on the same matched source problems or on stronger\-support variants derived from them\. These rates support the paired\-delta analysis, although several structural facts change across the ladder at once\.TrainExact on the common matched pool is consistent with the paired\-delta analysis, without isolating one disclosure change at a time \(Table[D\.2](https://arxiv.org/html/2605.08197#A4.T2)\)\.

![Refer to caption](https://arxiv.org/html/2605.08197v1/x3.png)Figure D\.1:Training versus held\-out replay for representative LLM rows\.Even among the strongest high\-coverage LLM rows, executable replay degrades on held\-out interventions\.
## Appendix ERobustness and support\-audit summaries

### E\.1Generation\-time disambiguation

Adding disambiguation during generation produces better\-constrained cohorts\. On Ordered \(full\), the disambiguated cohort preserves training replay while improving held\-out replay, shrinking the train\-minus\-held\-out gap, and raising retention\. Cohort summaries, pooled uncertainty, per\-model deltas, and shortcut\-pressure bins all show this pattern \(Tables[F\.1](https://arxiv.org/html/2605.08197#A6.T1)–[F\.6](https://arxiv.org/html/2605.08197#A6.T6)\)\.

Hidden\-order behaves differently\. There, the disambiguated cohort also reduces the train\-to\-held\-out gap, while absolute held\-out replay remains lower\. This pattern is consistent with stronger finite\-evidence support without making the task trivially easier\. The ambiguity measures are consistent with this interpretation \(Table[F\.6](https://arxiv.org/html/2605.08197#A6.T6)\)\.

### E\.2Three\-level support\-audit matched subsets

The three\-level matched ladder—Original, Extra Worlds, and CEx—asks whether residual ambiguity explains the Ordered/Hidden\-order gap\. All three levels use the same 100 matched latent SCMs and preserve the held\-out worlds\. Original and Extra Worlds are pure disclosure pairs: Ordered and Hidden\-order share the same training worlds and differ only in whether the topological order is revealed\. CEx adds counterexample worlds against discovered alternatives\. Its counterexample worlds may be setting\-specific because discovered alternatives can differ across disclosure settings\.

Extra Worlds raises mean local predecessor\-pattern coverage from 0\.8949 to 0\.9815 and increases the number of fully covered problems from 2/100 to 42/100\. CEx then raises mean local predecessor\-pattern coverage to 1\.0, reaches full local coverage on all 100 problems, and adds separating worlds until no discovered semantic alternative from LLM outputs, symbolic exact\-search, or 50\-seed bnlearn\+DSL searches still fits the training worlds\. These additions materially weaken the local\-support and discovered\-alternative objections while leaving uniqueness and prompt\-length control outside the claim\.

Table E\.1:LLM results on the support\-audit matched subsets\.nnis the number of stored scored responses\.Ord\-Ext/Hid\-Extdenote Extra Worlds;Ord\-CEx/Hid\-CExdenote Counterexample Audit \(CEx\)\. CEx preserves the same held\-out worlds and adds training worlds to complete bounded local predecessor\-pattern coverage and separate discovered alternatives under the audited searches; these counterexample additions may be setting\-specific\. In the CEx rows, HeldoutExact equals TrainExact whenever TrainExact is nonzero, so train\-exact CEx outputs also replay all held\-out worlds exactly\. HeldoutWorld is the unconditioned mean held\-out world replay rate\.Replay remains separated by disclosure under stronger evidence \(Table E\.1\)\. In the CEx rows, train\-exact LLM outputs also replay all held\-out worlds exactly, yet Hidden\-order still reaches TrainExact less often than Ordered\. Structural measures show the same pattern\. Adding worlds changes both evidence and prompt length, and CEx counterexample additions may be setting\-specific; before/after comparisons therefore mix support, prompt\-length, and counterexample\-audit effects\. Original and Extra Worlds are the pure disclosure comparisons; CEx is a same\-latent, same\-held\-out condition with fewer discovered alternatives\.

## Appendix FConstruction and support\-audit analyses

Appendix F gives the detailed construction and support\-audit evidence behind Appendix E\. Tables[F\.1](https://arxiv.org/html/2605.08197#A6.T1)–[F\.6](https://arxiv.org/html/2605.08197#A6.T6)evaluate generation\-time disambiguation and construction quality\. Tables[F\.7](https://arxiv.org/html/2605.08197#A6.T7)–[F\.14](https://arxiv.org/html/2605.08197#A6.T14)follow the same 100 matched SCMs through Original, Extra Worlds, and Counterexample Audit \(CEx\), testing whether Hidden\-order remains harder than Ordered after adding stronger local support and counterexamples against discovered alternatives\.

Table F\.1:Pooled LLM cohort deltas with 95% bootstrap CIs\.Each row compares a later cohort with the corresponding base cohort for the fixed LLM set\. PositiveΔ\\DeltaHeldoutWorld andΔ\\DeltaRetention indicate improvement; negativeΔ\\DeltaTrain\-minus\-heldout gap indicates a smaller overfitting gap\.The pooled deltas show the cohort effect most clearly on Ordered: held\-out replay improves, retention rises, and the train\-minus\-held\-out gap shrinks\. On Hidden\-order, the main effect is gap shrinkage rather than a comparable gain in absolute held\-out replay \(Table[F\.1](https://arxiv.org/html/2605.08197#A6.T1)\)\.

Table F\.2:Three\-level support\-audit construction summary\.The same 100 latent SCMs underlie the Ordered and Hidden\-order variants\. Extra Worlds raises mean local predecessor\-pattern coverage from 0\.8949 to 0\.9815\. Counterexample Audit \(CEx\) starts from those records, completes bounded local predecessor\-pattern coverage to 1\.0, and adds worlds that separate discovered alternatives that fit the training worlds while preserving the original held\-out worlds\. The comparison strengthens finite evidence but also increases prompt length\.The three\-level support\-audit ladder materially strengthens finite evidence\. Mean local predecessor\-pattern coverage rises from 0\.8949 \(Original\) to 0\.9815 \(Extra Worlds\) to 1\.0 \(CEx\); full local coverage rises from 2/100 to 42/100 to 100/100\. CEx also adds separating worlds until no discovered alternative from LLM outputs, symbolic exact\-search, or 50\-seed bnlearn\+DSL searches still fits the training worlds\. Together, these steps leave no discovered alternative train\-consistent under the implemented audits\. They do not prove uniqueness or isolate prompt length from evidence \(Table[F\.2](https://arxiv.org/html/2605.08197#A6.T2)\)\.

Table F\.3:Cohort\-level construction summary\.Ordered shows the clearest held\-out improvement under stronger generation\-time disambiguation, while Hidden\-order shows a narrower train\-to\-held\-out gap without a comparable gain in absolute held\-out replay\.On Ordered, the disambiguated cohort preserves training replay while improving held\-out replay and retention\. Hidden\-order behaves differently: the train\-to\-held\-out gap narrows, but absolute held\-out replay remains comparatively low\. The newer Hidden\-order cohort is therefore better constrained, even though absolute held\-out replay remains low \(Table[F\.3](https://arxiv.org/html/2605.08197#A6.T3)\)\.

Table F\.4:Bootstrap uncertainty for cohort deltas\.All rows use the fixed LLM set \(m=5m=5\)\. Ordered disambiguated minus base cohort improves HeldoutWorld and retention while shrinking the train\-minus\-held\-out gap\. Hidden\-order shrinks the gap but has lower absolute HeldoutWorld\.Bootstrap intervals show the same contrast\. On Ordered, stronger disambiguation yields held\-out gains after uncertainty is quantified\. On Hidden\-order, the main change is a smaller train\-to\-held\-out gap rather than a clear gain in absolute held\-out replay \(Table[F\.4](https://arxiv.org/html/2605.08197#A6.T4)\)\.

Table F\.5:Per\-model cohort deltas for the fixed LLM set used in Table[F\.4](https://arxiv.org/html/2605.08197#A6.T4)\.SettingComparisonModelΔ\\DeltaHeldoutWorldΔ\\DeltaTrain\-minusheld\-out gapΔ\\DeltaRetentionOrd\-Matchdisambiguated \- base cohortGPT\-5\.4\+0\.114\-0\.124\+0\.135Ord\-Matchdisambiguated \- base cohortOpus 4\.6\+0\.065\-0\.123\+0\.121Ord\-Matchdisambiguated \- base cohortGemini3\.1\+0\.100\-0\.102\+0\.135Ord\-Matchdisambiguated \- base cohortGrok 4\.20\+0\.069\-0\.037\+0\.067Ord\-Matchdisambiguated \- base cohortGrok 4\+0\.068\-0\.054\+0\.087Ord\-Extextension \- base cohortGPT\-5\.4\-0\.139\-0\.215\+0\.209Ord\-Extextension \- base cohortOpus 4\.6\+0\.044\-0\.096\+0\.092Ord\-Extextension \- base cohortGemini3\.1\+0\.116\-0\.159\+0\.211Ord\-Extextension \- base cohortGrok 4\.20\+0\.120\-0\.119\+0\.189Ord\-Extextension \- base cohortGrok 4\-0\.254\+0\.010\-0\.182Hid\-Matchdisambiguated \- base cohortGPT\-5\.4\+0\.057\-0\.037\+0\.047Hid\-Matchdisambiguated \- base cohortOpus 4\.6\-0\.172\-0\.038\-0\.003Hid\-Matchdisambiguated \- base cohortGemini3\.1\-0\.082\-0\.044\+0\.050Hid\-Matchdisambiguated \- base cohortGrok 4\.20\-0\.105\-0\.019\-0\.005Hid\-Matchdisambiguated \- base cohortGrok 4\-0\.145\-0\.003\-0\.086Hid\-Extextension \- base cohortGPT\-5\.4\-0\.180\-0\.049\+0\.001Hid\-Extextension \- base cohortOpus 4\.6\-0\.187\+0\.020\-0\.079Hid\-Extextension \- base cohortGemini3\.1\-0\.025\-0\.063\+0\.101Hid\-Extextension \- base cohortGrok 4\.20\-0\.110\-0\.015\-0\.014Hid\-Extextension \- base cohortGrok 4\+0\.120\-0\.037\+0\.090Per\-model deltas show the same pattern across the fixed LLM set: Ordered has the clearest held\-out gains from stronger disambiguation, while Hidden\-order mainly shows narrower train\-to\-held\-out gaps \(TableLABEL:tab:cleanliness\_by\_model\_deltas\_appendix\)\.

Table F\.6:Ambiguity\-proxy bins\.Stronger shortcut pressure is associated with weaker retention and held\-out replay, especially on Ordered\.The ambiguity\-proxy bins connect the cohort effect back to construction\-time metadata\. On Ordered, stronger shortcut pressure is associated with weaker held\-out replay and weaker retention, suggesting that the generator’s ambiguity measures track real difficulty \(Table[F\.6](https://arxiv.org/html/2605.08197#A6.T6)\)\.

### F\.1Support\-audit matched subsets

The support\-audit subsets address the residual\-ambiguity objection by analyzing replay, structure, and discovered alternatives\. Table[F\.7](https://arxiv.org/html/2605.08197#A6.T7)tracks replay across Original, Extra Worlds, and CEx\. Tables[F\.8](https://arxiv.org/html/2605.08197#A6.T8)–LABEL:tab:structural\_bootstrap\_cis\_appendixtrack semantic structure for Original and Extra Worlds\. Tables[F\.11](https://arxiv.org/html/2605.08197#A6.T11)–[F\.14](https://arxiv.org/html/2605.08197#A6.T14)track discovered alternatives that fit the training worlds before CEx construction and show how Extra Worlds reduces them\. Original and Extra Worlds are pure disclosure comparisons because Ordered and Hidden\-order share latent SCMs, training worlds, and held\-out worlds\. CEx uses the same latent SCMs and held\-out worlds and leaves zero surviving discovered alternatives under the audited pools; its counterexample additions may be setting\-specific\.

Table F\.7:Replay across the three support\-audit levels\.Original, Extra Worlds, and Counterexample Audit \(CEx\) compare the same matched source problems under increasingly strong finite\-evidence support; Held\-out world exact denotes HeldoutWorldExact, and the held\-out worlds are preserved\. Before/after differences combine stronger local evidence with longer prompts, and CEx can add setting\-specific counterexample worlds; these rows therefore include prompt\-length and audit\-construction effects rather than a pure disclosure ablation\.Within each support level, the Ordered/Hidden\-order separation remains visible in replay\. Original and Extra Worlds provide pure disclosure comparisons; CEx keeps the same latent SCMs and held\-out worlds while reducing discovered alternatives \(Table[F\.7](https://arxiv.org/html/2605.08197#A6.T7)\)\.

In the extra\-worlds benchmarks, the Hidden\-order gap remains visible in semantic parent structure and exact parent\-map matching on the same latent problems \(Table[F\.8](https://arxiv.org/html/2605.08197#A6.T8)\)\.

Panel A: Validity and parent\-graph structureModelValidOrd\-ExtValidHid\-ExtParentF1Ord\-ExtParentF1Hid\-ExtParentSHDOrd\-ExtParentSHDHid\-ExtGPT\-5\.41\.0000\.9900\.9750\.9200\.882\.52Opus 4\.60\.9800\.9500\.9440\.8741\.793\.71DeepSeek4Pro1\.0001\.0000\.8730\.7284\.017\.77Gemini3\.11\.0001\.0000\.9230\.8462\.624\.74Grok 4\.201\.0000\.9700\.8980\.7903\.386\.28Grok 40\.9900\.9900\.8390\.7174\.907\.93Grok 4\.31\.0000\.9600\.7140\.5417\.3111\.18KimiK2t0\.9500\.9700\.6750\.4638\.7613\.34DSReasoner1\.0000\.9900\.6860\.4037\.8613\.75
Panel B: Exact structure and replayModelExactparent mapOrd\-ExtExactparent mapHid\-ExtTrainExactOrd\-ExtTrainExactHid\-ExtHeldoutWorldOrd\-ExtHeldoutWorldHid\-ExtGPT\-5\.40\.6100\.3130\.6300\.4300\.8800\.783Opus 4\.60\.4180\.2420\.3800\.2700\.7830\.670DeepSeek4Pro0\.1100\.1200\.1400\.1700\.4650\.343Gemini3\.10\.3600\.1900\.3000\.1800\.7090\.507Grok 4\.200\.2700\.1340\.2300\.1300\.5710\.467Grok 40\.1620\.0910\.1200\.0900\.4600\.335Grok 4\.30\.0200\.0520\.0200\.0500\.1460\.137KimiK2t0\.0210\.0000\.0200\.0000\.0660\.032DSReasoner0\.0200\.0000\.0200\.0000\.1000\.034

Table F\.8:Structural comparison on the Extra Worlds subset\.Ordered \+ Extra Worlds and Hidden\-order \+ Extra Worlds use the same latent SCMs, the same additional training worlds, and the same held\-out worlds; only revealed order differs\. Panel A reports validity and parent\-graph structure, and Panel B reports exact parent maps and replay\.Table F\.9:Structure before and after added worlds\.The same source problems and held\-out worlds are compared before and after additional training worlds are added\. Panel A reports parent\-graph changes, and Panel B reports local semantic match and replay\.Panel A: Parent\-graph structureModelParentF1OriginalParentF1ExtraworldsParentSHDOriginalParentSHDExtraworldsExact parentmapOriginalExact parentmapExtraworldsOrd\-MatchGPT\-5\.40\.9520\.9751\.700\.880\.4600\.610Opus 4\.60\.9560\.9441\.591\.790\.3800\.410DeepSeek4Pro0\.8720\.8734\.114\.010\.1600\.110Gemini3\.10\.9270\.9232\.442\.620\.3900\.360Grok 4\.200\.9030\.8983\.343\.380\.2100\.270Grok 40\.8770\.8394\.244\.900\.1460\.162Grok 4\.30\.7490\.7146\.717\.310\.0710\.020KimiK2t0\.6810\.6758\.848\.760\.0100\.021DSReasoner0\.6740\.6868\.147\.860\.0100\.020Hid\-MatchGPT\-5\.40\.8960\.9203\.232\.520\.3100\.313Opus 4\.60\.7950\.8745\.853\.710\.1800\.230DeepSeek4Pro0\.7320\.7287\.477\.770\.0930\.120Gemini3\.10\.7970\.8465\.684\.740\.1700\.190Grok 4\.200\.7560\.7906\.856\.280\.1500\.130Grok 40\.7450\.7177\.527\.930\.0610\.091Grok 4\.30\.5250\.54111\.8611\.180\.0380\.052KimiK2t0\.4310\.46313\.8713\.340\.0000\.000DSReasoner0\.3760\.40313\.7613\.750\.0000\.000Panel B: Local semantics and replayLocal semantic matchTrainExactHeldoutWorldModelOriginalExtra WorldsOriginalExtra WorldsOriginalExtra WorldsOrd\-MatchGPT\-5\.40\.7630\.8660\.5800\.6300\.7480\.880Opus 4\.60\.7690\.7850\.5800\.3800\.7570\.783DeepSeek4Pro0\.6160\.6190\.2800\.1400\.4990\.465Gemini3\.10\.7330\.7460\.3700\.3000\.6560\.709Grok 4\.200\.6560\.7040\.3000\.2300\.5390\.571Grok 40\.6210\.5800\.3100\.1200\.5040\.460Grok 4\.30\.4290\.3880\.1000\.0200\.2070\.146KimiK2t0\.2950\.3410\.0100\.0200\.0980\.066DSReasoner0\.3080\.3470\.0100\.0200\.0620\.100Hid\-MatchGPT\-5\.40\.7100\.7740\.6200\.4300\.7230\.783Opus 4\.60\.5750\.6820\.2800\.2700\.4890\.670DeepSeek4Pro0\.4680\.5290\.2100\.1700\.3210\.343Gemini3\.10\.5760\.6290\.1100\.1800\.3840\.507Grok 4\.200\.5190\.5770\.1700\.1300\.3580\.467Grok 40\.4920\.4910\.1500\.0900\.2930\.335Grok 4\.30\.2590\.2860\.0700\.0500\.1680\.137KimiK2t0\.1240\.1330\.0000\.0000\.0270\.032DSReasoner0\.1200\.1420\.0000\.0000\.0270\.034The comparison of original versus additional training worlds mixes two changes at once: increased intervention coverage, which reduces local mechanism ambiguity, and increased prompt length\. Even so, the same held\-out worlds become harder to explain with spurious alternatives once stronger local support is present \(TableLABEL:tab:audit\_original\_vs\_audit\_structural\_appendix\)\.

Table F\.10:Bootstrap intervals for added\-world analyses\.The Matched Ordered/Matched Hidden\-order rows isolate revealed\-order effects on the matched subset, whereas the original\-to\-extra\-worlds rows quantify how much the added evidence changes the same source problems\.Panel A: Parent\-graph intervalsModelΔ\\DeltaParent SHD↓\\downarrowΔ\\DeltaParent F1Δ\\DeltaExact parent mapHid\-Ext \- Ord\-ExtGPT\-5\.4\+1\.66 \[\+1\.00, \+2\.36\]\-0\.055 \[\-0\.080, \-0\.033\]\-0\.303 \[\-0\.404, \-0\.192\]Opus 4\.6\+1\.96 \[\+1\.18, \+2\.72\]\-0\.071 \[\-0\.102, \-0\.040\]\-0\.172 \[\-0\.290, \-0\.043\]DeepSeek4Pro\+3\.76 \[\+2\.63, \+4\.87\]\-0\.145 \[\-0\.191, \-0\.100\]\+0\.010 \[\-0\.070, \+0\.090\]Gemini3\.1\+2\.12 \[\+1\.38, \+2\.89\]\-0\.076 \[\-0\.102, \-0\.049\]\-0\.170 \[\-0\.260, \-0\.090\]Grok 4\.20\+2\.79 \[\+1\.79, \+3\.85\]\-0\.106 \[\-0\.148, \-0\.067\]\-0\.113 \[\-0\.196, \-0\.041\]Grok 4\+3\.05 \[\+2\.06, \+4\.09\]\-0\.122 \[\-0\.164, \-0\.075\]\-0\.071 \[\-0\.153, \+0\.010\]Grok 4\.3\+3\.96 \[\+2\.84, \+5\.00\]\-0\.178 \[\-0\.228, \-0\.125\]\+0\.031 \[\-0\.010, \+0\.083\]KimiK2t\+4\.87 \[\+3\.92, \+5\.79\]\-0\.219 \[\-0\.260, \-0\.180\]\-0\.022 \[\-0\.054, \+0\.000\]DSReasoner\+5\.89 \[\+5\.05, \+6\.73\]\-0\.282 \[\-0\.325, \-0\.241\]\-0\.020 \[\-0\.051, \+0\.000\]Ord\-Ext \- Ord\-MatchGPT\-5\.4\-0\.82 \[\-1\.27, \-0\.40\]\+0\.023 \[\+0\.009, \+0\.038\]\+0\.150 \[\+0\.040, \+0\.260\]Opus 4\.6\+0\.14 \[\-0\.33, \+0\.60\]\-0\.011 \[\-0\.031, \+0\.006\]\+0\.021 \[\-0\.107, \+0\.150\]DeepSeek4Pro\-0\.10 \[\-0\.83, \+0\.68\]\+0\.002 \[\-0\.025, \+0\.031\]\-0\.050 \[\-0\.130, \+0\.030\]Gemini3\.1\+0\.18 \[\-0\.49, \+0\.83\]\-0\.005 \[\-0\.025, \+0\.015\]\-0\.030 \[\-0\.120, \+0\.060\]Grok 4\.20\+0\.04 \[\-0\.62, \+0\.74\]\-0\.004 \[\-0\.026, \+0\.018\]\+0\.060 \[\-0\.020, \+0\.130\]Grok 4\+0\.63 \[\-0\.15, \+1\.41\]\-0\.036 \[\-0\.070, \-0\.007\]\+0\.021 \[\-0\.063, \+0\.116\]Grok 4\.3\+0\.57 \[\-0\.12, \+1\.31\]\-0\.034 \[\-0\.071, \+0\.001\]\-0\.051 \[\-0\.102, \+0\.000\]KimiK2t\-0\.08 \[\-0\.98, \+0\.88\]\-0\.005 \[\-0\.044, \+0\.034\]\+0\.011 \[\-0\.022, \+0\.054\]DSReasoner\-0\.23 \[\-0\.88, \+0\.41\]\+0\.010 \[\-0\.019, \+0\.040\]\+0\.000 \[\+0\.000, \+0\.000\]Hid\-Ext \- Hid\-MatchGPT\-5\.4\-0\.69 \[\-1\.41, \+0\.03\]\+0\.024 \[\-0\.001, \+0\.050\]\+0\.000 \[\-0\.101, \+0\.091\]Opus 4\.6\-1\.89 \[\-2\.99, \-0\.89\]\+0\.076 \[\+0\.033, \+0\.118\]\+0\.056 \[\-0\.033, \+0\.144\]DeepSeek4Pro\+0\.38 \[\-0\.69, \+1\.47\]\-0\.008 \[\-0\.054, \+0\.036\]\+0\.031 \[\-0\.031, \+0\.093\]Gemini3\.1\-0\.94 \[\-1\.72, \-0\.13\]\+0\.049 \[\+0\.017, \+0\.079\]\+0\.020 \[\-0\.060, \+0\.090\]Grok 4\.20\-0\.64 \[\-1\.79, \+0\.53\]\+0\.037 \[\-0\.013, \+0\.085\]\-0\.011 \[\-0\.096, \+0\.074\]Grok 4\+0\.28 \[\-0\.80, \+1\.38\]\-0\.023 \[\-0\.069, \+0\.023\]\+0\.031 \[\-0\.031, \+0\.102\]Grok 4\.3\-0\.37 \[\-1\.35, \+0\.68\]\+0\.005 \[\-0\.045, \+0\.055\]\+0\.027 \[\-0\.013, \+0\.080\]KimiK2t\-0\.40 \[\-1\.36, \+0\.51\]\+0\.030 \[\-0\.012, \+0\.073\]\+0\.000 \[\+0\.000, \+0\.000\]DSReasoner\-0\.04 \[\-0\.87, \+0\.75\]\+0\.027 \[\-0\.013, \+0\.067\]\+0\.000 \[\+0\.000, \+0\.000\]Panel B: Local semantics and held\-out replay intervalsModelΔ\\DeltaLocal semantic matchΔ\\DeltaHeldoutWorldHid\-Ext \- Ord\-ExtGPT\-5\.4\-0\.095 \[\-0\.131, \-0\.060\]\-0\.100 \[\-0\.155, \-0\.043\]Opus 4\.6\-0\.106 \[\-0\.161, \-0\.054\]\-0\.125 \[\-0\.185, \-0\.060\]DeepSeek4Pro\-0\.091 \[\-0\.150, \-0\.035\]\-0\.122 \[\-0\.214, \-0\.031\]Gemini3\.1\-0\.117 \[\-0\.162, \-0\.077\]\-0\.201 \[\-0\.263, \-0\.139\]Grok 4\.20\-0\.124 \[\-0\.174, \-0\.077\]\-0\.095 \[\-0\.166, \-0\.022\]Grok 4\-0\.093 \[\-0\.148, \-0\.031\]\-0\.124 \[\-0\.195, \-0\.050\]Grok 4\.3\-0\.109 \[\-0\.165, \-0\.056\]\-0\.016 \[\-0\.074, \+0\.046\]KimiK2t\-0\.212 \[\-0\.261, \-0\.167\]\-0\.035 \[\-0\.077, \+0\.004\]DSReasoner\-0\.204 \[\-0\.254, \-0\.156\]\-0\.067 \[\-0\.115, \-0\.027\]Ord\-Ext \- Ord\-MatchGPT\-5\.4\+0\.103 \[\+0\.061, \+0\.147\]\+0\.133 \[\+0\.077, \+0\.189\]Opus 4\.6\+0\.022 \[\-0\.019, \+0\.062\]\+0\.025 \[\-0\.032, \+0\.085\]DeepSeek4Pro\+0\.003 \[\-0\.056, \+0\.058\]\-0\.034 \[\-0\.120, \+0\.049\]Gemini3\.1\+0\.013 \[\-0\.030, \+0\.057\]\+0\.052 \[\-0\.022, \+0\.125\]Grok 4\.20\+0\.048 \[\+0\.012, \+0\.087\]\+0\.033 \[\-0\.030, \+0\.096\]Grok 4\-0\.039 \[\-0\.092, \+0\.012\]\-0\.034 \[\-0\.120, \+0\.045\]Grok 4\.3\-0\.039 \[\-0\.086, \+0\.006\]\-0\.057 \[\-0\.117, \+0\.001\]KimiK2t\+0\.042 \[\-0\.005, \+0\.093\]\-0\.026 \[\-0\.069, \+0\.024\]DSReasoner\+0\.031 \[\-0\.005, \+0\.063\]\+0\.028 \[\-0\.008, \+0\.064\]Hid\-Ext \- Hid\-MatchGPT\-5\.4\+0\.064 \[\+0\.021, \+0\.109\]\+0\.056 \[\-0\.008, \+0\.117\]Opus 4\.6\+0\.099 \[\+0\.044, \+0\.154\]\+0\.165 \[\+0\.089, \+0\.240\]DeepSeek4Pro\+0\.056 \[\-0\.000, \+0\.110\]\+0\.026 \[\-0\.043, \+0\.088\]Gemini3\.1\+0\.053 \[\+0\.016, \+0\.089\]\+0\.124 \[\+0\.064, \+0\.185\]Grok 4\.20\+0\.061 \[\+0\.005, \+0\.119\]\+0\.116 \[\+0\.036, \+0\.196\]Grok 4\+0\.003 \[\-0\.049, \+0\.052\]\+0\.045 \[\-0\.028, \+0\.117\]Grok 4\.3\+0\.030 \[\-0\.017, \+0\.077\]\-0\.005 \[\-0\.058, \+0\.047\]KimiK2t\+0\.011 \[\-0\.027, \+0\.050\]\+0\.000 \[\-0\.025, \+0\.024\]DSReasoner\+0\.023 \[\-0\.015, \+0\.063\]\+0\.006 \[\-0\.022, \+0\.035\]Bootstrap intervals keep these effects separate\. The matched Ordered\-versus\-Hidden\-order rows quantify the residual disclosure penalty after additional worlds are added, whereas the original\-to\-extra\-worlds rows quantify how much stronger local support changes the same source problems \(TableLABEL:tab:structural\_bootstrap\_cis\_appendix\)\.

Additional training worlds make TrainExact stricter, so before\-versus\-after comparisons couple support gains with task changes\. At the same time, when the comparison is made on matched latent SCMs with matched held\-out worlds, Hidden\-order remains structurally and behaviorally harder than Ordered \(TableLABEL:tab:structural\_bootstrap\_cis\_appendix\)\.

### F\.2Alternative discovery under stronger support

Because the search is over the available result pool, the numbers in this subsection are discovery counts rather than absence proofs\. Even so, they provide a useful proxy for how much stronger local support narrows the space of discovered alternatives\. These tables report discovery on Original and Extra Worlds\. CEx adds separating worlds against discovered alternatives that survive Extra Worlds\. Because none of those audited alternatives remain, CEx is summarized through its construction and replay results rather than as another discovery row\.

Table F\.11:Problem\-level alternative discovery across support\-audit levels\.The first three numeric columns are problem counts out of 100; Any alternative rate is the fraction of problems with at least one discovered semantic alternative that fits the training worlds\. Original and Extra Worlds are discovery counts and rates in the available result pools\. CEx rows are post\-audit checks after adding counterexamples against surviving discovered alternatives; zero rates mean no discovered alternative remains under the implemented audited searches, not that uniqueness is proven\.From Original to Extra Worlds, the fraction of problems with any discovered alternative that fits the training worlds drops sharply for both Ordered and Hidden\-order\. CEx then adds a stronger empirical filter by explicitly separating discovered alternatives that survive the Extra Worlds records until the audited discovered\-alternative count is zero\. This pattern is consistent with stronger finite\-evidence support under the added evidence \(Table[F\.11](https://arxiv.org/html/2605.08197#A6.T11)\)\.

Table F\.12:Per\-model alternative\-discovery rates across support\-audit levels\.An alternative identification is an executable SCM that is TrainExact and semantically distinct from the gold SCM under local Boolean mechanism truth\-table equality; syntactic rewrites of the gold mechanisms are excluded\. The two rate columns divide alternative identifications by attempted submissions and by train\-exact solutions\. Original and Extra Worlds report discovered alternatives in the available result pools\. CEx rows report the post\-audit check after counterexample additions, where no discovered alternative remains under the implemented audited searches\. The \* and – notation follows Table[1](https://arxiv.org/html/2605.08197#S6.T1)\.SettingSystemAttemptedTrain\-exactsolutionsAlternativeidentificationsAlt\. rate/ attemptAlt\. rate/ train\-exactOrd\-MatchGPT\-5\.410058270\.2700\.466Ord\-MatchOpus 4\.610058310\.3100\.534Ord\-MatchDeepSeek4Pro10028130\.1300\.464Ord\-MatchGemini3\.11003780\.0800\.216Ord\-MatchGrok 4\.2010030140\.1400\.467Ord\-MatchGrok 410031200\.2000\.645Ord\-MatchGrok 4\.31001050\.0500\.500Ord\-MatchKimiK2t100110\.010\*Ord\-MatchDSReasoner100100\.000\*Ord\-Matchbnlearn\+DSL100100380\.3800\.380Ord\-Matchsymbolic exact\-search10097320\.3200\.330Hid\-MatchGPT\-5\.410062360\.3600\.581Hid\-MatchOpus 4\.610028150\.1500\.536Hid\-MatchDeepSeek4Pro10021120\.1200\.571Hid\-MatchGemini3\.11001100\.0000\.000Hid\-MatchGrok 4\.201001760\.0600\.353Hid\-MatchGrok 41001590\.0900\.600Hid\-MatchGrok 4\.3100750\.0500\.714Hid\-MatchKimiK2t100000\.000–Hid\-MatchDSReasoner100000\.000–Hid\-Matchbnlearn\+DSL100100450\.4500\.450Hid\-Matchsymbolic exact\-search10077360\.3600\.468Ord\-ExtGPT\-5\.410063130\.1300\.206Ord\-ExtOpus 4\.61003880\.0800\.210Ord\-ExtDeepSeek4Pro1001460\.0600\.429Ord\-ExtGemini3\.11003010\.0100\.033Ord\-ExtGrok 4\.201002330\.0300\.130Ord\-ExtGrok 41001210\.0100\.083Ord\-ExtGrok 4\.3100200\.000\*Ord\-ExtKimiK2t100200\.000\*Ord\-ExtDSReasoner100200\.000\*Ord\-Extbnlearn\+DSL100100180\.1800\.180Ord\-Extsymbolic exact\-search10094140\.1400\.149Hid\-ExtGPT\-5\.410043170\.1700\.395Hid\-ExtOpus 4\.61002770\.0700\.259Hid\-ExtDeepSeek4Pro1001750\.0500\.294Hid\-ExtGemini3\.11001820\.0200\.111Hid\-ExtGrok 4\.201001330\.0300\.231Hid\-ExtGrok 4100910\.0100\.111Hid\-ExtGrok 4\.3100500\.000\*Hid\-ExtKimiK2t100000\.000–Hid\-ExtDSReasoner100000\.000–Hid\-Extbnlearn\+DSL10099240\.2400\.242Hid\-Extsymbolic exact\-search10063140\.1400\.222Ord\-CExGPT\-5\.41005300\.0000\.000Ord\-CExOpus 4\.61004600\.0000\.000Ord\-CExDeepSeek4Pro1002100\.0000\.000Ord\-CExGemini3\.11003000\.0000\.000Ord\-CExGrok 4\.201002300\.0000\.000Ord\-CExGrok 41001000\.0000\.000Ord\-CExGrok 4\.3100200\.000\*Ord\-CExKimiK2t100000\.000–Ord\-CExDSReasoner100600\.0000\.000Ord\-CExbnlearn\+DSL10010000\.0000\.000Ord\-CExsymbolic exact\-search1009500\.0000\.000Ord\-CExbnlearn\+DSL 50\-seed audit505000\.0000\.000Hid\-CExGPT\-5\.4994300\.0000\.000Hid\-CExOpus 4\.61003000\.0000\.000Hid\-CExDeepSeek4Pro1001200\.0000\.000Hid\-CExGemini3\.11002100\.0000\.000Hid\-CExGrok 4\.201001700\.0000\.000Hid\-CExGrok 4100800\.0000\.000Hid\-CExGrok 4\.3100600\.0000\.000Hid\-CExKimiK2t100000\.000–Hid\-CExDSReasoner100000\.000–Hid\-CExbnlearn\+DSL10010000\.0000\.000Hid\-CExsymbolic exact\-search1009500\.0000\.000Hid\-CExbnlearn\+DSL 50\-seed audit505000\.0000\.000The reduction also appears at the system level\. Systems that solve many original matched problems contribute most heavily to the available alternative pool, and they also show some of the clearest drops after augmentation \(TableLABEL:tab:alternative\_identification\_rates\_by\_model\_appendix\)\.

Table F\.13:Alternative discovery before and after Extra Worlds\.Original and Extra Worlds columns separate train\-exact solution counts, alternative\-identification rates among all attempts, and alternative\-identification rates among train\-exact solutions\. The delta column is Extra Worlds minus Original for the rate among attempts\. These before/after rates summarize discovered alternatives in the available result pool; they do not prove that no other alternatives exist\. CEx then adds separating worlds against surviving discovered alternatives\. The \* and – notation follows Table[1](https://arxiv.org/html/2605.08197#S6.T1)\.Alternative discovery falls both in aggregate and within train\-exact responses, so the drop cannot be attributed only to fewer systems solving the training worlds \(Table[F\.13](https://arxiv.org/html/2605.08197#A6.T13)\)\.

Table F\.14:Solved\-intersection control for alternative discovery before CEx construction\.Restricting to problems solved on both sides tests whether the reduction in discovered alternatives is a solved\-set artifact\. The drop from Original to Extra Worlds persists within this solved intersection\. The \* and – notation follows Table[1](https://arxiv.org/html/2605.08197#S6.T1)\.The drop in discovered alternatives persists after restricting attention to the solved intersection\. The reduction reflects more than a solved\-set artifact \(Table[F\.14](https://arxiv.org/html/2605.08197#A6.T14)\)\.

Extra Worlds and CEx improve local support and reduce discovered alternatives, while Hidden\-order remains harder than Ordered on matched latent SCMs\.

## Appendix GSupplementary\-setting analyses

Appendix G analyzes the two supplementary settings\. Alternative\-SCM tests local editing when a valid SCM is supplied \(Tables G\.1–G\.5\)\. Hidden\-roots separates root\-set prediction from downstream mechanism induction \(Tables G\.6–G\.8\)\.

Table G\.1:Alternative\-SCM versus paired hidden induction\.The comparison between Paired hidden\-task correctness and Alternative\-SCM joint success quantifies how much performance rises once a valid SCM is supplied\. The final column conditions Alternative\-SCM success on paired hidden\-task failure\.Supplying a valid reference SCM substantially raises performance\. Joint success on Alternative\-SCM is much higher than performance on the paired hidden\-induction task, especially for Hidden\-order source problems\. Models can often edit a supplied SCM on problems where they failed to infer that SCM from the intervention worlds \(Table[G\.1](https://arxiv.org/html/2605.08197#A7.T1)\)\.

Table G\.2:Alternative\-SCM leave\-model\-out results\.KP\-all uses all known\-pair alternatives; OM\-src excludes alternatives mined from the evaluated model; SM\-src uses alternatives mined from the same model; OE uses open\-ended cases\. Bold marks the best LLM value within each benchmark and source\-relation group\.ModelSourcerelationnnJointsuccessAlternativesuccessExperimentvalidSeparatesWitnessAlt\-OrdGPT\-5\.4KP\-all710\.9720\.9721\.0000\.9720\.972GPT\-5\.4OM\-src540\.9630\.9631\.0000\.9630\.963GPT\-5\.4SM\-src171\.0001\.0001\.0001\.0001\.000GPT\-5\.4OE290\.9310\.9311\.0000\.9310\.931Opus 4\.6KP\-all710\.7610\.7611\.0000\.7610\.761Opus 4\.6OM\-src510\.7650\.7651\.0000\.7650\.765Opus 4\.6SM\-src200\.7500\.7501\.0000\.7500\.750Opus 4\.6OE290\.6550\.6551\.0000\.6550\.655DeepSeek4ProKP\-all710\.7890\.7891\.0000\.7890\.789DeepSeek4ProOM\-src710\.7890\.7891\.0000\.7890\.789DeepSeek4ProOE290\.8280\.8281\.0000\.8280\.828Gemini3\.1KP\-all710\.9440\.9441\.0000\.9440\.944Gemini3\.1OM\-src680\.9410\.9411\.0000\.9410\.941Gemini3\.1SM\-src3\*\*\*\*\*Gemini3\.1OE290\.8620\.8621\.0000\.8620\.862Grok 4\.20KP\-all710\.8030\.8031\.0000\.8030\.803Grok 4\.20OM\-src700\.8000\.8001\.0000\.8000\.800Grok 4\.20SM\-src1\*\*\*\*\*Grok 4\.20OE290\.7930\.7931\.0000\.7930\.793Grok 4KP\-all710\.9300\.9301\.0000\.9300\.930Grok 4OM\-src640\.9220\.9221\.0000\.9220\.922Grok 4SM\-src71\.0001\.0001\.0001\.0001\.000Grok 4OE290\.7930\.8281\.0000\.7930\.793Grok 4\.3KP\-all710\.4790\.4791\.0000\.4790\.479Grok 4\.3OM\-src710\.4790\.4791\.0000\.4790\.479Grok 4\.3OE290\.2760\.3101\.0000\.2760\.276KimiK2tKP\-all710\.2960\.3800\.9010\.2960\.296KimiK2tOM\-src710\.2960\.3800\.9010\.2960\.296KimiK2tOE290\.1030\.1380\.8620\.1030\.103DSReasonerKP\-all710\.0420\.0701\.0000\.0420\.042DSReasonerOM\-src710\.0420\.0701\.0000\.0420\.042DSReasonerOE290\.0000\.0001\.0000\.0000\.000Alt\-HidGPT\-5\.4KP\-all710\.8310\.8311\.0000\.8310\.831GPT\-5\.4OM\-src540\.8330\.8331\.0000\.8330\.833GPT\-5\.4SM\-src170\.8240\.8241\.0000\.8240\.824GPT\-5\.4OE290\.8620\.8621\.0000\.8620\.862Opus 4\.6KP\-all710\.7040\.7041\.0000\.7040\.704Opus 4\.6OM\-src510\.7450\.7451\.0000\.7450\.745Opus 4\.6SM\-src200\.6000\.6001\.0000\.6000\.600Opus 4\.6OE290\.6900\.6900\.9660\.6900\.690DeepSeek4ProKP\-all710\.7040\.7041\.0000\.7040\.704DeepSeek4ProOM\-src710\.7040\.7041\.0000\.7040\.704DeepSeek4ProOE290\.5860\.6211\.0000\.5860\.586Gemini3\.1KP\-all710\.9300\.9581\.0000\.9440\.930Gemini3\.1OM\-src680\.9260\.9561\.0000\.9410\.926Gemini3\.1SM\-src3\*\*\*\*\*Gemini3\.1OE291\.0001\.0001\.0001\.0001\.000Grok 4\.20KP\-all710\.7470\.7471\.0000\.7470\.747Grok 4\.20OM\-src700\.7570\.7571\.0000\.7570\.757Grok 4\.20SM\-src1\*\*\*\*\*Grok 4\.20OE290\.6900\.6901\.0000\.6900\.690Grok 4KP\-all710\.9150\.9151\.0000\.9150\.915Grok 4OM\-src640\.9220\.9221\.0000\.9220\.922Grok 4SM\-src70\.8570\.8571\.0000\.8570\.857Grok 4OE290\.7590\.7591\.0000\.7590\.759Grok 4\.3KP\-all710\.4080\.4221\.0000\.4080\.408Grok 4\.3OM\-src710\.4080\.4221\.0000\.4080\.408Grok 4\.3OE290\.2410\.3101\.0000\.2410\.241KimiK2tKP\-all710\.2110\.2390\.9150\.2110\.211KimiK2tOM\-src710\.2110\.2390\.9150\.2110\.211KimiK2tOE290\.1030\.1030\.8280\.1030\.103DSReasonerKP\-all710\.0420\.0421\.0000\.0420\.042DSReasonerOM\-src710\.0420\.0421\.0000\.0420\.042DSReasonerOE290\.0000\.0001\.0000\.0000\.000This increase extends beyond models recognizing alternatives sourced from the same model\. The leave\-model\-out rows remain high for the strongest frontier models, so local editing with a supplied SCM remains strong even when same\-model sourced alternatives are excluded \(TableLABEL:tab:alt\_exp\_leave\_model\_out\_appendix\)\.

Table G\.3:Alternative\-SCM intervention quality\.High joint success is usually paired with strong separating interventions, indicating that successful models often provide structured witnesses\.Successful Alternative\-SCM outputs are usually paired with strong separating interventions\. The setting tests more than the existence of an alternative: it also tests whether the model can separate that alternative efficiently once an SCM is supplied \(Table[G\.3](https://arxiv.org/html/2605.08197#A7.T3)\)\.

The overlap analysis shows how much the supplied SCM helps\. Successful alternatives usually remain close to the reference SCM in both semantic parent structure and local semantics, and they are overwhelmingly one\-variable edits rather than global rewrites\. Supplying the SCM removes most of the structure search \(Table[G\.4](https://arxiv.org/html/2605.08197#A7.T4)\)\.

Table G\.4:Alternative\-SCM reference overlap and edit locality\.Parent\-overlap and local\-semantic columns compare the model\-supplied alternative with the provided reference SCM over valid executable alternatives\. The edit\-type columns condition on alternative success and separate one\-variable same\-parent edits from one\-variable parent\-changing edits\.
Table G\.5:Alternative\-SCM edit depth relative to the provided reference SCM\.Panel A reports per\-variable change rates on the alternative\-success subset\. Panel B conditions on changed variables and shows where successful edits concentrate by reference depth\.The local edits cluster across the supplied SCM\. Early endogenous layers remain the most stable part of the supplied SCM, whereas most successful changes occur at depth 2 and deeper\. In Alternative\-SCM \(Hidden\-order\), the remaining flexibility shifts further downstream and more often preserves the parent set while changing only the local rule\. In Alternative\-SCM \(Ordered\), successful edits are somewhat shallower and more often change the parent set itself \(Table[G\.5](https://arxiv.org/html/2605.08197#A7.T5)\)\.

Hidden\-roots separates two kinds of error\. Once the root set is hidden, the model must first identify the correct roots and then produce executable mechanisms for the remaining variables\.

Table G\.6:Hidden\-roots breakdown\.RootExact, mechanism train exactness, and mechanism held\-out replay are reported separately so that root\-set prediction is not conflated with downstream mechanism induction\. Mechanism HeldoutExact is strict: it requires exact mechanism training replay and exact replay of every held\-out world\.Root\-set prediction is already hard, and exact mechanism induction remains weak even on the subset where the root set is correct \(Table[G\.6](https://arxiv.org/html/2605.08197#A7.T6)\)\.

Panel A: Overall rates and RootExact\-conditioned replayModelnnRootExactnnRootExactTrainExact\|\|RootExactHeldoutWorld\|\|RootExactHeldoutExact\|\|RootExactGPT\-5\.41000\.240240\.3330\.3070\.083Opus 4\.61000\.240240\.0830\.2600\.000DeepSeek4Pro1000\.360360\.0560\.1950\.028Gemini3\.11000\.340340\.1180\.2900\.088Grok 4\.201000\.390390\.1280\.2150\.128Grok 41000\.300300\.1330\.1920\.100Grok 4\.31000\.170170\.1760\.2430\.176KimiK2t1000\.06060\.0000\.0620\.000DSReasoner1000\.0404\*\*\*
Panel B: Root\-wrong cases and residual failuresModelTrainExact\| root\-wrongHeldoutWorld\| root\-wrongTrainfailuresHeld\-outfailuresGPT\-5\.40\.0000\.060166Opus 4\.60\.0000\.096222DeepSeek4Pro0\.0000\.018341Gemini3\.10\.0000\.026301Grok 4\.200\.0160\.096340Grok 40\.0140\.066261Grok 4\.30\.0000\.022140KimiK2t0\.0000\.01160DSReasoner0\.0000\.00140

Table G\.7:Mechanism replay after exact root\-set prediction\.The all\-problems columns give the total number of problems and RootExact\. The RootExact\-conditioned columns condition on exact root\-set prediction, and the root\-wrong columns condition on an incorrect root set\. Failure counts report RootExact cases that fail mechanism training fit and RootExact TrainExact cases that fail strict held\-out exactness\. The \* and – notation follows Table[1](https://arxiv.org/html/2605.08197#S6.T1)\.Even when the root set is correct, many submissions still make downstream mechanism errors \(Table[G\.7](https://arxiv.org/html/2605.08197#A7.T7)\)\.

Panel A: Root\-set exactness and structural decompositionModelnnValidRootExactnnRootExact\+ validParent F1\|\|RootExact \+ validExact parent map\|\|RootExact \+ validGPT\-5\.41000\.9900\.240240\.6900\.083Opus 4\.61000\.9300\.240240\.6530\.000DeepSeek4Pro1000\.9600\.360340\.5890\.029Gemini3\.11001\.0000\.340340\.6610\.118Grok 4\.201000\.9900\.390390\.6570\.128Grok 41001\.0000\.300300\.5650\.100Grok 4\.31000\.9700\.170170\.5030\.176KimiK2t1000\.9800\.06060\.3460\.000DSReasoner1001\.0000\.0404\*\*
Panel B: Mechanism replay after exact root\-set predictionModelTrainExact\|\|RootExact \+ validHeldoutWorld\|\|RootExact \+ TrainExactGPT\-5\.40\.3330\.734Opus 4\.60\.083\*DeepSeek4Pro0\.059\*Gemini3\.10\.118\*Grok 4\.200\.128\*Grok 40\.133\*Grok 4\.30\.176\*KimiK2t0\.000–DSReasoner\*–

Table G\.8:Nested Hidden\-roots breakdown\.RootExact separates root\-set prediction from downstream mechanism induction\. Parent F1, Exact parent map, and TrainExact are conditioned on submissions that are both RootExact and valid; HeldoutWorld is further conditioned on exact training fit\. The \* and – notation follows Table[1](https://arxiv.org/html/2605.08197#S6.T1)\.Nested conditioning on root\-exact and valid submissions leads to the same qualitative conclusion: residual structural and mechanism errors remain even after the root set is correct \(Table[G\.8](https://arxiv.org/html/2605.08197#A7.T8)\)\.

Taken together, root\-set prediction and downstream mechanism induction are separate sources of error in Hidden\-roots \(Tables[G\.6](https://arxiv.org/html/2605.08197#A7.T6)–[G\.8](https://arxiv.org/html/2605.08197#A7.T8)\)\.

## Appendix HNon\-LLM calibration rows

The non\-LLM rows are fixed benchmark\-specific procedures, distinct from off\-the\-shelf end\-to\-end causal discovery systems\. The symbolic exact\-search baseline searches directly in the benchmark SCM language\. The bnlearn\+DSL baseline uses bnlearn for structure proposal and then invokes the same exact Boolean fitter used by the symbolic pipeline\. Both procedures receive only the training worlds and the same revealed\-structure fields available to the corresponding LLM prompt, and both are scored by the same evaluator as the LLM submissions\.

These rows show which parts of exact executable induction remain difficult under fixed symbolic or hybrid search budgets\. They are not a comprehensive comparison to all causal\-discovery or program\-synthesis methods\.

Table H\.1:Non\-LLM baseline protocol summary\.Both procedures are fixed calibration rows evaluated under the same final\-object contract as the LLM submissions; they are not budget\-matched direct\-generation LLM baselines\.##### Symbolic exact\-search baseline\.

The symbolic exact\-search baseline treats each item as an exact finite\-world synthesis problem in the benchmark DSL\. A candidate consists of a root set, an acyclic functional\-parent graph, and one Boolean mechanism for every predicted endogenous variable\. In Ordered, parent sets are restricted to earlier variables in the disclosed order\. In Block\-order, the search respects block precedence and searches admissible within\-block dependencies\. In Hidden\-order, it searches over acyclic orders and parent assignments\. In Hidden\-roots, it also enumerates root\-set hypotheses before mechanism fitting\.

The search is run as a staged portfolio with the budgets shown in Table[H\.1](https://arxiv.org/html/2605.08197#A8.T1)\. Each accepted candidate must replay all training worlds exactly under the benchmark evaluator\. When multiple candidates satisfy the training worlds, the baseline selects the first train\-exact executable candidate found by the fixed staged ordering\. If no exact candidate is found within budget, it emits a schema\-correct failure object, which is then scored by the same evaluator\.

##### bnlearn\+DSL baseline\.

The bnlearn\+DSL baseline separates structure proposal from mechanism fitting\. It first converts the training worlds into the data representation used by bnlearn and runs mbde\-scored tabu search and hill climbing under the budgets in Table[H\.1](https://arxiv.org/html/2605.08197#A8.T1)\. The structural search respects the disclosure condition: full order in Ordered, block constraints in Block\-order, no disclosed endogenous order in Hidden\-order, and explicit root\-set hypotheses in Hidden\-roots\.

The learned graph is an intermediate structure proposal\. Each learned graph, or the parent proposals derived from the graph ensemble, is passed to the shared exact Boolean fitter, which searches for benchmark\-DSL mechanisms over the proposed parents\. The fitter uses the same operator set and fitting budgets shown in Table[H\.1](https://arxiv.org/html/2605.08197#A8.T1)\. The final candidate is selected by the fixed stage order and then scored by the same evaluator as every other system\.

Table H\.2 separates full task correctness, training replay, and held\-out replay\.

Table H\.2:Non\-LLM calibration\-row breakdown\.The rows separate full task correctness, training replay, and held\-out replay for the fixed symbolic exact\-search and bnlearn\+DSL procedures\. TaskCorrect is full task correctness under the benchmark evaluator; on Hidden\-roots it differs from TrainExact because root\-set prediction is part of the task\. HeldoutWorld is the unconditioned mean held\-out world replay rate\. HeldoutExact is strict: it requires exact training replay and exact replay of every held\-out world\.

## Appendix IIllustrative case studies

These case studies illustrate three recurrent failure modes already visible in the aggregate results\. The first shows a hidden\-structure shortcut: revealing order removes a plausible but incorrect endogenous\-parent substitution\. The second shows a train\-exact but mechanistically wrong solution that fails only when a new parent configuration is reached\. The third shows that the same model can edit a supplied SCM successfully on a paired source problem where it failed to infer that SCM from interventions alone\.

##### Case 1: Revealed order rules out a downstream surrogate\.

In one paired item, GPT\-5\.4 solves the Ordered version exactly, with TrainWorldExact and HeldoutWorldExact both 1\.000\. On the corresponding Hidden\-order version, the submitted SCM is still executable but no longer train\-exact: TrainWorldExact falls to 0\.556 and HeldoutWorldExact to 0\.625\. The key difference is causal admissibility\. When the true order is revealed,X1is known to be downstream ofX2, so it cannot be used as an input toX2\. When order is hidden, the model usesX1as a plausible surrogate for the root/input signalX3\. The surrogate is correlated enough to look reasonable on many rows, but it is causally backwards\.

##### Case 2: Train\-exact replay can still miss a truth\-table corner\.

In a Hidden\-order item, GPT\-5\.4 returns a valid SCM that is fully train\-exact: TrainWorldExact is 1\.000 and cell\-level training accuracy is 1\.000\. It nevertheless fails strict held\-out exactness, with HeldoutWorldExact 0\.750 and cell\-level held\-out accuracy 0\.970\. The failure is concentrated in one local mechanism\. The submittedX5agrees with every scored training row forX5, but the held\-out intervention reaches a parent assignment that the training worlds never query\.

##### Case 3: A supplied SCM changes the task\.

For a linked Hidden\-order and Alternative\-SCM pair, GPT\-5\.4 fails the discovery version but succeeds once a valid reference SCM is supplied\. On Hidden\-order, the model returns an executable but non\-exact SCM: TrainWorldExact is 0\.444 and HeldoutWorldExact is 0\.250\. In Alternative\-SCM, the same model constructs a valid, train\-exact, semantically distinct alternative together with a separating intervention and witness\. This mirrors the main\-table pattern: editing an executable causal object is much easier once the object is already supplied\.

##### Case 4: Short formulas can still fail\.

AST size is useful for diagnosing behavior, but smaller is not automatically better\. In an Ordered item, GPT\-5\.4 returns a valid train\-exact SCM whose total AST size is 18, compared with 28 for the gold mechanism map\. The submission is almost a compressed version of the gold map:X1is equivalent up to reordering,X5is a Boolean simplification of the gold expression, andX7agrees on the training worlds\. The problem isX2\. The model replaces a three\-parent mechanism with the much shorter\(or X6 X7\), fits every training world, and then fails on a held\-out intervention\.

##### Case 5: Train\-exact bloat can overfit observed worlds\.

The opposite failure also appears\. In another Ordered item, GPT\-5\.4 returns a valid train\-exact SCM with total AST size 95, while the gold map has total AST size 23\. The submission keeps some simple equations exactly, but inflates others into DNF\-like clauses with many negated guards and extra parents\. This achieves training fit at the expense of robustness: HeldoutWorldExact is 0\.375 and cell\-level held\-out accuracy is 0\.848\. The first held\-out mismatch already shows the problem\. The goldX1is the simple mechanism\(xor X2 X5\), while the submittedX1adds a spurious dependence onX6\.

##### Case 6: Long formulas can still replay correctly\.

A final case illustrates why the evaluator prioritizes replay behavior over formula style\. In one Ordered item, Grok 4 returns a much larger valid SCM: total AST 100 versus 39 for the gold map\. Several local mechanisms are expanded into longer case\-like formulas, and the submitted map uses 22 parent references, compared with 19 in the gold map\. Unlike the previous case, however, the submitted SCM is exact on both training and held\-out worlds: TrainWorldExact, HeldoutWorldExact, and cell\-level held\-out accuracy are all 1\.000\. This leaves global equivalence beyond the benchmark worlds unresolved, but it shows the intended scoring principle\. The benchmark requires an executable causal object that survives replay, independent of gold\-string matching or formula length\.

Similar Articles

CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists

Hugging Face Daily Papers

CausaLab is a scalable environment for evaluating LLM agents on interactive causal discovery, assessing both predictive accuracy and faithful recovery of underlying causal mechanisms. Experiments reveal a gap between prediction and mechanism recovery, highlighting limits in current LLM agents as experimental causal reasoners.

LURE: Live-Usage Replay Evaluations for Reducing Evaluation Awareness

arXiv cs.CL

This paper proposes LURE (Live-Usage Replay Evaluations), a method for constructing realistic, deployment-like evaluations of large language models by replaying real agentic interaction trajectories and appending evaluation prompts, reducing the detectability of evaluations compared to existing benchmarks.