Deterministic Event-Graph Substrates as World Models for Counterfactual Reasoning
Summary
该论文提出并评估了一类称为事件图基质的因果推理世界模型,通过确定性重放在类型化RDF事件日志上进行反事实查询,在多个基准上优于基线模型,同时保证了可检查性和可重放一致性。
View Cached Full Text
Cached at: 05/18/26, 06:34 AM
# Deterministic Event-Graph Substrates as World Models for Counterfactual Reasoning
Source: [https://arxiv.org/html/2605.15967](https://arxiv.org/html/2605.15967)
###### Abstract
We study a class of world models for agentic systems that represent state as an append\-only log of typed RDF triples and answer counterfactual queries by forking the log at a chosen tick under a structured intervention vocabulary\. We refer to this class as*event\-graph substrates*\. Substrates are inspectable at the level of individual triples, support exact counterfactuals over arbitrary interventions on the typed state, and transfer across domains without learned components\.
We make three contributions\. First, we present a formal definition of event\-graph substrates with deterministic replay and intervention semantics, and characterize the conditions under which counterfactual queries reduce to graph\-theoretic operations on the observed event log\. We prove a duality between explanatory queries \(“which observed event causedEE?”\) and counterfactual queries \(“which observed events would not occur if objectXXwere absent?”\), showing that under closed\-event assumptions both are answered by the same causal\-ancestor traversal\. Second, we evaluate a 1,400\-line CLEVRER\-DSL interpreter atop a domain\-agnostic substrate runtime on CLEVRER\(Yiet al\.,[2020](https://arxiv.org/html/2605.15967#bib.bib1)\), the canonical video causal\-reasoning benchmark, at full validation scale \(n=75,618 questions\)\. The substrate exceeds the published symbolic\-oracle baseline \(NS\-DR\(Yiet al\.,[2020](https://arxiv.org/html/2605.15967#bib.bib1)\)\) by 9\.89 percentage points on descriptive, 20\.26 on explanatory per\-question, 17\.65 on counterfactual per\-question, and 0\.80 on predictive per\-question\. Against the parametric attention\-based baseline ALOE\(Dinget al\.,[2021](https://arxiv.org/html/2605.15967#bib.bib9)\)the substrate exceeds on descriptive and explanatory per\-question but lags on predictive and counterfactual per\-question, where ALOE’s learned dynamics distribution provides an advantage that the substrate’s closed\-form kinematic projection does not match\. Third, we introduce twin\-EventLog, a 500\-specification Park\-canonical Smallville counterfactual benchmark for evaluating agent memory consistency under intervention\. On this benchmark the substrate exceeds Llama\-3\.1\-8B prompted with full context by 18\.80 percentage points on joint accuracy and exceeds a Park/Concordia\-style LLM\-driven simulator\(Parket al\.,[2023](https://arxiv.org/html/2605.15967#bib.bib5); Vezhnevets and others,[2023](https://arxiv.org/html/2605.15967#bib.bib6)\)by 65 percentage points\.
Together these results indicate that counterfactual world modeling can be implemented by deterministic replay over typed event deltas rather than learned latent simulation, with formal guarantees on inspectability and replay consistency\. We characterize the regimes where this approach is competitive with parametric world models \(closed\-event reasoning, exact intervention semantics\) and where it lags \(long\-horizon prediction under learned dynamics, hidden property inference\)\.
## 1Introduction
Figure 1:Substrate world model vs parametric world model\. The substrate stores observations as a typed RDF event logℒ\\mathcal\{L\}and answers counterfactual queries by deterministic replay after applyingdo\(X\)do\(X\)toℒ\\mathcal\{L\}\. Parametric models compress observations into a latent statehth\_\{t\}and have no exactdodo\-semantics: interventions are approximated by re\-prompting or fine\-tuning\.Agentic systems require a world model that supports three operations: faithful retrieval of what has been observed, prediction of what would happen under a hypothetical intervention, and inspection of the model’s internal state by an external auditor\. Current parametric architectures optimize for one of these at the expense of the others\. Latent dynamics models such as Dreamer\-V3\(Hafneret al\.,[2023](https://arxiv.org/html/2605.15967#bib.bib7)\)learn imagination rollouts for reinforcement learning, but the rollouts are sampled from a learned distribution and cannot be replayed deterministically\. Video joint\-embedding predictors such as V\-JEPA\-2\(Assranet al\.,[2025](https://arxiv.org/html/2605.15967#bib.bib8)\)learn high\-quality video features but do not expose a queryable state\. Generative agents\(Parket al\.,[2023](https://arxiv.org/html/2605.15967#bib.bib5)\)expose a queryable state through a language\-model summary, but the state is not replayable and the summary drifts under repeated queries\.
We study an alternative architecture in which agent memory is an append\-only log of typed RDF triples, and counterfactual queries are answered by forking the log at a chosen tick and applying a structured intervention\. We refer to this class of architectures as event\-graph substrates\. The class is not new in spirit; it inherits from structural causal models\(Pearl,[2009](https://arxiv.org/html/2605.15967#bib.bib20)\), neuro\-symbolic visual reasoning\(Yiet al\.,[2018](https://arxiv.org/html/2605.15967#bib.bib11),[2020](https://arxiv.org/html/2605.15967#bib.bib1); Maoet al\.,[2019](https://arxiv.org/html/2605.15967#bib.bib12)\), and typed knowledge graphs\. Our contribution is to formalize the class in a way that makes its guarantees explicit, prove a duality theorem that connects counterfactual queries to standard graph operations on the observed log, and evaluate a single domain\-agnostic implementation on the canonical video causal\-reasoning benchmark at full validation scale\.
We organize the paper as follows\. Section[2](https://arxiv.org/html/2605.15967#S2)formalizes event\-graph substrates with deterministic replay and intervention semantics\. Section[3](https://arxiv.org/html/2605.15967#S3)proves the ancestor\-duality theorem and characterizes its complexity\. Section[4](https://arxiv.org/html/2605.15967#S4)reports the substrate’s performance on CLEVRER\. Section[5](https://arxiv.org/html/2605.15967#S5)demonstrates cross\-domain transfer on ComPhy\(Chenet al\.,[2022](https://arxiv.org/html/2605.15967#bib.bib2)\), GQA\(Hudson and Manning,[2019](https://arxiv.org/html/2605.15967#bib.bib4)\), and the new twin\-EventLog benchmark\. Section[6](https://arxiv.org/html/2605.15967#S6)presents ablation studies on each algorithmic component\. Section[7](https://arxiv.org/html/2605.15967#S7)positions the work against prior neuro\-symbolic reasoners, structured world models, and agent memory architectures\. Section[8](https://arxiv.org/html/2605.15967#S8)discusses limitations and section[9](https://arxiv.org/html/2605.15967#S9)concludes\.
#### Summary of empirical findings\.
On CLEVRER at full validation scale, a 1,400\-line CLEVRER\-DSL interpreter atop a domain\-agnostic substrate runtime exceeds the published symbolic\-oracle baseline NS\-DR\(Yiet al\.,[2020](https://arxiv.org/html/2605.15967#bib.bib1)\)on every per\-question subset: descriptive by 9\.89 percentage points \(97\.99 versus 88\.1\), explanatory by 20\.26 percentage points \(99\.86 versus 79\.6\), counterfactual by 17\.65 percentage points \(59\.85 versus 42\.2\), and predictive by 0\.80 percentage points \(69\.50 versus 68\.7\)\. Against the parametric ALOE baseline\(Dinget al\.,[2021](https://arxiv.org/html/2605.15967#bib.bib9)\), the substrate exceeds on descriptive \(97\.99 versus 94\.0\) and explanatory \(99\.86 versus 96\.0\) per\-question but lags on predictive \(69\.50 versus 87\.5\) and counterfactual \(59\.85 versus 75\.6\) per\-question\. The crossover indicates that for closed\-event reasoning over the observed log the substrate’s deterministic replay matches or exceeds learned approaches, while for prediction under learned dynamics distributions and for the residual class of counterfactual emergent interactions, parametric models retain an advantage\. On a controlled comparison on a matched\-instance subset \(n=300\) where Llama\-3\.1\-8B receives the same event log as natural language with grammar\-constrained output, the substrate exceeds the language model by 46\.99 percentage points on descriptive questions \(97\.99 versus 51\.00; Wilson 95 percent CI for the LLM \[45\.37, 56\.61\]\)\. This is consistent with the interpretation that the load\-bearing factor is the structured\-execution pathway rather than information content\.
## 2Substrate definition
We define an event\-graph substrate as a tuple
𝒮=\(𝒯,𝒜0,ℒ,ρ,ℐ\)\\mathcal\{S\}=\(\\mathcal\{T\},\\mathcal\{A\}\_\{0\},\\mathcal\{L\},\\rho,\\mathcal\{I\}\)where𝒯\\mathcal\{T\}is a TBox of typed axioms over a fixed vocabulary,𝒜0\\mathcal\{A\}\_\{0\}is an initial ABox of triples,ℒ\\mathcal\{L\}is an ordered append\-only log of typed deltas,ρ\\rhois a deterministic replay function, andℐ\\mathcal\{I\}is an intervention vocabulary\.
#### State and deltas\.
The substrate’s state at tickttis denoted𝒜t\\mathcal\{A\}\_\{t\}, where𝒜t⊆𝒱RDF\\mathcal\{A\}\_\{t\}\\subseteq\\mathcal\{V\}\_\{RDF\}is a finite set of typed triples consistent with𝒯\\mathcal\{T\}\. Each deltadt∈ℒd\_\{t\}\\in\\mathcal\{L\}is a tuple\(t,op,triple\)\(t,\\text\{op\},\\text\{triple\}\)whereop∈\{insert,retract\}\\text\{op\}\\in\\\{\\text\{insert\},\\text\{retract\}\\\}\. Replayρ\\rhois defined by𝒜t\+1=ρ\(𝒜t,dt\)\\mathcal\{A\}\_\{t\+1\}=\\rho\(\\mathcal\{A\}\_\{t\},d\_\{t\}\), applyingdtd\_\{t\}to𝒜t\\mathcal\{A\}\_\{t\}as a set operation\. The state at any tickttis therefore recoverable inO\(t\)O\(t\)from𝒜0\\mathcal\{A\}\_\{0\}and the prefixd0,…,dt−1d\_\{0\},\\ldots,d\_\{t\-1\}of the log\.
#### Interventions\.
The intervention vocabularyℐ\\mathcal\{I\}is a finite set of typed operations on the ABox\. Our implementation uses five interventions:Assert,Retract,OverrideLocation,AssertAwareness, andRetractAwareness\. Each interventionι∈ℐ\\iota\\in\\mathcal\{I\}is a function on the substrate state\. A counterfactual query is parameterized by a branch tickt∗t^\{\\ast\}and an interventionι\\iota\. The counterfactual log is defined by
ℒt∗ι=d0,…,dt∗−1,ι,dt∗ι,dt∗\+1ι,…\\mathcal\{L\}^\{\\iota\}\_\{t^\{\\ast\}\}=d\_\{0\},\\ldots,d\_\{t^\{\\ast\}\-1\},\\iota,d^\{\\iota\}\_\{t^\{\\ast\}\},d^\{\\iota\}\_\{t^\{\\ast\}\+1\},\\ldotswhere the deltas aftert∗t^\{\\ast\}are produced by the same replay function applied to the intervened state\. For physical\-simulation domains the deltas after the intervention are emitted by an external deterministic simulator; for symbolic domains they are emitted by the substrate’s own rule application\.
#### Inspectability\.
Every triple in𝒜t\\mathcal\{A\}\_\{t\}is addressable by its typed IRI\. Every delta inℒ\\mathcal\{L\}is addressable by its tick\. SPARQL queries on𝒜t\\mathcal\{A\}\_\{t\}return a deterministic result set\. SHACL constraints over𝒯\\mathcal\{T\}identify violations at the level of specific triples\.
#### Cost\.
Replay from tickaato tickbbcostsO\(b−a\)O\(b\-a\)in the number of delta applications\. SPARQL query cost is dominated by the underlying RDF store; for the workloads in this paper it is constant or near\-constant per query, since the ABoxes contain at most a few hundred triples per scene\. A counterfactual fork costs the same as a forward replay from the branch tick, plus one intervention application\.
#### Concrete instantiation\.
Our implementation uses Oxigraph\(Pellissier Tanon,[2020](https://arxiv.org/html/2605.15967#bib.bib19)\)as the RDF store and approximately 1,400 lines of Python for the CLEVRER interpreter \(four modules: descriptive, explanatory, counterfactual, predictive\), with an additional 1,800 lines for the ComPhy, GQA, and bAbI modules\. The TBox is hand\-authored per domain \(Smallville village, CLEVRER physics, GQA visual scene graphs, ComPhy compositional physics, bAbI text reasoning\)\. All numbers reported in the paper use this implementation\.
## 3Ancestor duality and complexity
We now characterize the conditions under which counterfactual queries on an event\-graph substrate reduce to standard graph operations on the observed event log\.
### 3\.1Causal\-ancestor graph
Letℒ\\mathcal\{L\}be an event log on a finite set of objects𝒪\\mathcal\{O\}\. Each evente∈ℒe\\in\\mathcal\{L\}is associated with a set of participating objectsobj\(e\)⊆𝒪\\text\{obj\}\(e\)\\subseteq\\mathcal\{O\}and a ticktick\(e\)\\text\{tick\}\(e\)\. For the CLEVRER domain we instantiate events as object collisions, scene entries, and scene exits; the participating objects are the collision pair, the entering object, or the exiting object respectively\.
###### Definition 1\(Causal\-ancestor set\)\.
For an evente∈ℒe\\in\\mathcal\{L\}, the causal\-ancestor set ofee, denotedAnc\(e\)\\textsc\{Anc\}\(e\), is the smallest set of events such that: \(i\) every evente′e^\{\\prime\}withtick\(e′\)<tick\(e\)\\text\{tick\}\(e^\{\\prime\}\)<\\text\{tick\}\(e\)andobj\(e′\)∩obj\(e\)≠∅\\text\{obj\}\(e^\{\\prime\}\)\\cap\\text\{obj\}\(e\)\\neq\\emptysetis inAnc\(e\)\\textsc\{Anc\}\(e\); \(ii\) for everye′∈Anc\(e\)e^\{\\prime\}\\in\\textsc\{Anc\}\(e\)and every evente′′e^\{\\prime\\prime\}withtick\(e′′\)<tick\(e′\)\\text\{tick\}\(e^\{\\prime\\prime\}\)<\\text\{tick\}\(e^\{\\prime\}\)andobj\(e′′\)∩obj\(e′\)≠∅\\text\{obj\}\(e^\{\\prime\\prime\}\)\\cap\\text\{obj\}\(e^\{\\prime\}\)\\neq\\emptyset,e′′∈Anc\(e\)e^\{\\prime\\prime\}\\in\\textsc\{Anc\}\(e\)\.
Equivalently,Anc\(e\)\\textsc\{Anc\}\(e\)is reachable fromeeby backward breadth\-first traversal over the bipartite event\-object incidence graph, restricted to ticks strictly less thantick\(e\)\\text\{tick\}\(e\)\. We writeAncObj\(e\)=⋃e′∈Anc\(e\)obj\(e′\)\\textsc\{AncObj\}\(e\)=\\bigcup\_\{e^\{\\prime\}\\in\\textsc\{Anc\}\(e\)\}\\text\{obj\}\(e^\{\\prime\}\)for the set of objects appearing anywhere inee’s causal history\. Algorithm[1](https://arxiv.org/html/2605.15967#alg1)computesAnc\(e\)\\textsc\{Anc\}\(e\)andAncObj\(e\)\\textsc\{AncObj\}\(e\)in a single backward pass over the log\.
Algorithm 1Causal\-ancestor traversal1:functionAncestors\(
ee,
ℒ\\mathcal\{L\}\)
2:
A←∅A\\leftarrow\\emptyset
3:
Obj←obj\(e\)\\textsc\{Obj\}\\leftarrow\\text\{obj\}\(e\)
4:
Q←\{\(o,tick\(e\)\):o∈obj\(e\)\}Q\\leftarrow\\\{\(o,\\text\{tick\}\(e\)\):o\\in\\text\{obj\}\(e\)\\\}⊳\\trianglerightqueue of \(object, reference tick\)
5:
Visited←obj\(e\)\\textsc\{Visited\}\\leftarrow\\text\{obj\}\(e\)
6:while
QQis not emptydo
7:pop
\(o,τ\)\(o,\\tau\)from
QQ
8:foreach
e′∈ℒe^\{\\prime\}\\in\\mathcal\{L\}with
tick\(e′\)<τ\\text\{tick\}\(e^\{\\prime\}\)<\\tauand
o∈obj\(e′\)o\\in\\text\{obj\}\(e^\{\\prime\}\)do
9:
A←A∪\{e′\}A\\leftarrow A\\cup\\\{e^\{\\prime\}\\\}
10:foreach
o′∈obj\(e′\)∖Visitedo^\{\\prime\}\\in\\text\{obj\}\(e^\{\\prime\}\)\\setminus\\textsc\{Visited\}do
11:
Visited←Visited∪\{o′\}\\textsc\{Visited\}\\leftarrow\\textsc\{Visited\}\\cup\\\{o^\{\\prime\}\\\}
12:
Obj←Obj∪\{o′\}\\textsc\{Obj\}\\leftarrow\\textsc\{Obj\}\\cup\\\{o^\{\\prime\}\\\}
13:push
\(o′,tick\(e′\)\)\(o^\{\\prime\},\\text\{tick\}\(e^\{\\prime\}\)\)onto
QQ
14:endfor
15:endfor
16:endwhile
17:return
\(A,Obj\)\(A,\\textsc\{Obj\}\)
18:endfunction
The per\-event reference tick is essential\. A naive variant that usestick\(e\)\\text\{tick\}\(e\)for every BFS step admits spurious long transitive paths and produces an over\-approximation ofAnc\(e\)\\textsc\{Anc\}\(e\); the empirical impact of this distinction is reported in Section[6](https://arxiv.org/html/2605.15967#S6)\.
### 3\.2Duality theorem
###### Proposition 1\(Ancestor duality, informal\)\.
Letℒ\\mathcal\{L\}be an event log on objects𝒪\\mathcal\{O\}, and letX∈𝒪X\\in\\mathcal\{O\}\. Suppose the following conditions hold:
1. C1\(*Closed events\.*\) Every event whose occurrence depends on the state of any object in𝒪\\mathcal\{O\}is recorded inℒ\\mathcal\{L\}\.
2. C2\(*Exogeneity of non\-ancestors\.*\) For every evente′∈ℒe^\{\\prime\}\\in\\mathcal\{L\}withX∉AncObj\(e′\)X\\notin\\textsc\{AncObj\}\(e^\{\\prime\}\), the occurrence and timing ofe′e^\{\\prime\}are not state\-dependent onXX\.
3. C3\(*No emergent interactions\.*\) RemovingXXfrom the scene at tick0does not cause any pair of objects in𝒪∖\{X\}\\mathcal\{O\}\\setminus\\\{X\\\}to interact in ways that produce events not present inℒ\\mathcal\{L\}\.
Then for any observed evente∈ℒe\\in\\mathcal\{L\},
edoes not occur in the counterfactual world\\displaystyle\\,e\\text\{ does not occur in the counterfactual world\}whereXis absent\\displaystyle\\,\\text\{where \}X\\text\{ is absent\}⇔X∈AncObj\(e\)\.\\displaystyle\\quad\\iff X\\in\\textsc\{AncObj\}\(e\)\.
###### Proof sketch\.
*\(⇐\\Leftarrow\)*IfX∈AncObj\(e\)X\\in\\textsc\{AncObj\}\(e\), thenXXis involved in at least one evente′e^\{\\prime\}on the BFS path fromeeback to its earliest ancestor\. By C1, the occurrence ofe′e^\{\\prime\}depends onXX\. RemovingXXremovese′e^\{\\prime\}; iterating along the chain of ancestors \(each link justified by C1 applied to its parent event\) removeseeas well\.*\(⇒\\Rightarrow\)*IfX∉AncObj\(e\)X\\notin\\textsc\{AncObj\}\(e\), then by C2 the occurrence ofeeis not state\-dependent onXX, and by C3 no new events outsideℒ\\mathcal\{L\}are introduced by removingXX; henceeeis unchanged\. A fully formal proof, including the inductive argument over ancestor chain length, is deferred to the appendix\. ∎
The proposition reduces counterfactual queries of the form “would eventeestill occur ifXXwere removed?” to a deterministic membership test inAncObj\(e\)\\textsc\{AncObj\}\(e\), which is computed by a single backward BFS overℒ\\mathcal\{L\}\.
### 3\.3Failure mode: emergent interactions
Condition C3 fails when removingXXcreates new interactions among the remaining objects\. In CLEVRER this happens whenXXwas on a trajectory that would have intercepted a future collision between two other objects; removingXXallows that collision to occur\. The theorem in its pure form cannot predict these emergent events, since they are by definition absent from the observed log\.
We address this by augmenting the ancestor traversal with a heuristic for emergent collisions: if objectsAAandBBboth collided withXXin the observed log, the pair\(A,B\)\(A,B\)is considered a candidate for emergent collision in the counterfactual world\. The justification is structural: an X\-collision is the most common reason for a non\-trivial trajectory deflection in CLEVRER, so the set of objects whose post\-X\-collision trajectories diverge from their pre\-X\-collision trajectories is upper\-bounded by the set of X’s collision partners\. Algorithm[2](https://arxiv.org/html/2605.15967#alg2)presents the full counterfactual answer procedure\. Section[6](https://arxiv.org/html/2605.15967#S6)reports the empirical contribution of the emergent\-collision heuristic\.
Algorithm 2Counterfactual answer via ancestor duality1:functionAnswerCF\(
ecandidatee\_\{\\text\{candidate\}\},
XX,
ℒ\\mathcal\{L\}\)
2:if
ecandidate∈ℒe\_\{\\text\{candidate\}\}\\in\\mathcal\{L\}then
3:
\(\_,Obj\)←Ancestors\(ecandidate,ℒ\)\(\\\_,\\textsc\{Obj\}\)\\leftarrow\\textsc\{Ancestors\}\(e\_\{\\text\{candidate\}\},\\mathcal\{L\}\)
4:return
X∈ObjX\\in\\textsc\{Obj\}⊳\\trianglerightProposition[1](https://arxiv.org/html/2605.15967#Thmproposition1),⇐\\Leftarrow
5:endif
6:if
ecandidatee\_\{\\text\{candidate\}\}is a collision
\(A,B\)\(A,B\)with
X∉\{A,B\}X\\notin\\\{A,B\\\}then
7:
Partners←\{o:\(X,o\)is a collision inℒ\}\\textsc\{Partners\}\\leftarrow\\\{o:\(X,o\)\\text\{ is a collision in \}\\mathcal\{L\}\\\}
8:if
A∈PartnersA\\in\\textsc\{Partners\}and
B∈PartnersB\\in\\textsc\{Partners\}then
9:returnTrue⊳\\trianglerightCommon\-removed\-partner emergent
10:endif
11:endif
12:returnFalse
13:endfunction
### 3\.4Complexity
Let\|E\|\|E\|denote the number of events inℒ\\mathcal\{L\}and\|O\|\|O\|the number of objects\. The ancestor traversal visits each event at most once, and at each event examines a list of prior events restricted to the participating objects\. The traversal is thereforeO\(\|E\|⋅d\)O\(\|E\|\\cdot d\)whereddis the maximum number of events per object\. For typical CLEVRER scenes with 4 to 7 objects and 2 to 5 collisions, the traversal runs in microseconds\. Pre\-computingAncObj\(e\)\\textsc\{AncObj\}\(e\)for all events in a scene isO\(\|E\|2⋅d\)O\(\|E\|^\{2\}\\cdot d\)worst\-case andO\(\|E\|\)O\(\|E\|\)for the bounded scene sizes encountered in practice\.
## 4Empirical evaluation on CLEVRER
CLEVRER\(Yiet al\.,[2020](https://arxiv.org/html/2605.15967#bib.bib1)\)is the canonical video benchmark for causal reasoning over physical events\. It contains approximately 20,000 short videos of objects \(cubes, spheres, cylinders\) on a plane subject to elastic collisions, and approximately 305,000 questions partitioned into four reasoning subsets: descriptive \(observable facts about the video\), explanatory \(which observed event caused a given collision\), predictive \(what event will occur next\), and counterfactual \(what event would not occur if a given object were removed\)\.
We evaluate the substrate on the full validation split of CLEVRER\. The substrate consumes the per\-scene annotation file containingobject\_property,motion\_trajectory, andcollisionrecords, and does not consume video pixels\. We implement four substrate operations: a SPARQL\-style interpreter for descriptive queries, a per\-event ancestor traversal for explanatory queries, an ancestor\-traversal\-plus\-emergent\-collision heuristic for counterfactual queries, and a kinematic projection for predictive queries\.
### 4\.1Results
Table[1](https://arxiv.org/html/2605.15967#S4.T1)compares the substrate to NS\-DR\(Yiet al\.,[2020](https://arxiv.org/html/2605.15967#bib.bib1)\), the strongest published symbolic\-oracle baseline\. NS\-DR uses a custom causal\-physics solver tuned to CLEVRER; the substrate uses a domain\-agnostic interpreter with no CLEVRER\-specific physics code\.
Table 1:Substrate performance on the CLEVRER validation set, compared to the published symbolic\-oracle baseline NS\-DR\(Yiet al\.,[2020](https://arxiv.org/html/2605.15967#bib.bib1), Table 3\)and the parametric attention baseline ALOE\(Dinget al\.,[2021](https://arxiv.org/html/2605.15967#bib.bib9), Table 1, per\-question accuracy\); both baseline rows are reported on the same CLEVRER validation split\. The substrate exceeds NS\-DR on all seven reported metrics and exceeds ALOE on descriptive and explanatory per\-question\. ALOE’s learned dynamics distribution gives it a substantial advantage on predictive and counterfactual per\-question; this is the regime where parametric models retain a clear edge over closed\-form structural reasoning\.
### 4\.2Implementation per subset
#### Descriptive\.
Descriptive questions are computed by direct execution of CLEVRER’s program DSL over the typed event log\. The interpreter implements roughly 25 opcodes covering object and event selection \(objects,events,all\_events\), attribute and motion filters \(filter\_color/material/shape,filter\_moving/stationary,filter\_in/out,filter\_collision,filter\_before/after,filter\_order\), event endpoints \(start,end,first,last,get\_frame\), aggregation \(count,exist,unique,belong\_to\), attribute query \(query\_color/material/shape\), counterfactual lookup \(get\_counterfact,get\_col\_partner\), and logical negation\. Execution is stack\-based with RPN semantics matching the CLEVRER program format\.
#### Explanatory\.
Explanatory questions ask whether a given evente′e^\{\\prime\}is in the causal history of a target collisionCC\. The substrate computesAnc\(C\)\\textsc\{Anc\}\(C\)by the BFS of Definition[1](https://arxiv.org/html/2605.15967#Thmdefinition1)and returnse′∈Anc\(C\)e^\{\\prime\}\\in\\textsc\{Anc\}\(C\)\. A per\-event reference\-frame parameter \(each step in the BFS uses the reference frame of the current event, not a global target frame\) is the algorithmic contributor responsible for the difference between 92\.28 percent and 99\.86 percent per\-question accuracy; see Section[6](https://arxiv.org/html/2605.15967#S6)\.
#### Counterfactual\.
Counterfactual questions are answered by the ancestor duality of Proposition[1](https://arxiv.org/html/2605.15967#Thmproposition1)augmented with the common\-removed\-partner heuristic for emergent collisions\. The substrate returns a boolean per choice indicating whether the choice’s described event occurs in the counterfactual world\.
#### Predictive\.
Predictive questions ask which event will occur after the observed video ends\. Algorithm[3](https://arxiv.org/html/2605.15967#alg3)samples each object’s velocity averaged over the final five visible frames, projects all objects forward in straight\-line motion, and predicts a collision for any pair whose closest\-approach distance falls below a fixed thresholdτ=1\.7\\tau=1\.7\(in CLEVRER scene units, approximately2\.4×2\.4\\timesthe canonical object radius of0\.70\.7\) within a 300\-frame future horizon\. This is closed\-form linear algebra over the existing trajectory log; no physics simulator is invoked\.
Algorithm 3Kinematic projection for predictive queries1:functionPredictCollisions\(
ℒ\\mathcal\{L\},
τ\\tau,
TT\)⊳\\trianglerightτ\\tau= absolute collision distance threshold,TT= future horizon \(frames\)
2:
Pairs←∅\\textsc\{Pairs\}\\leftarrow\\emptyset
3:foreach object
oovisible in
ℒ\\mathcal\{L\}do
4:
𝐯o←mean\(velocity\(o,t\)\)\\mathbf\{v\}\_\{o\}\\leftarrow\\text\{mean\}\(\\text\{velocity\}\(o,t\)\)over last 5 frames
5:
𝐩o←position\(o,tlast\)\\mathbf\{p\}\_\{o\}\\leftarrow\\text\{position\}\(o,t\_\{\\text\{last\}\}\)
6:endfor
7:foreach pair
\(A,B\)\(A,B\)of visible objectsdo
8:
Δ𝐩←𝐩A−𝐩B\\Delta\\mathbf\{p\}\\leftarrow\\mathbf\{p\}\_\{A\}\-\\mathbf\{p\}\_\{B\}
9:
Δ𝐯←𝐯A−𝐯B\\Delta\\mathbf\{v\}\\leftarrow\\mathbf\{v\}\_\{A\}\-\\mathbf\{v\}\_\{B\}
10:
t∗←−⟨Δ𝐩,Δ𝐯⟩/⟨Δ𝐯,Δ𝐯⟩t^\{\\ast\}\\leftarrow\-\\langle\\Delta\\mathbf\{p\},\\Delta\\mathbf\{v\}\\rangle/\\langle\\Delta\\mathbf\{v\},\\Delta\\mathbf\{v\}\\rangle⊳\\trianglerightclosest\-approach time
11:if
0<t∗<T0<t^\{\\ast\}<Tthen
12:
dmin←‖Δ𝐩\+t∗Δ𝐯‖d\_\{\\min\}\\leftarrow\\\|\\Delta\\mathbf\{p\}\+t^\{\\ast\}\\Delta\\mathbf\{v\}\\\|
13:if
dmin<τd\_\{\\min\}<\\tauthen
14:
Pairs←Pairs∪\{\(A,B\)\}\\textsc\{Pairs\}\\leftarrow\\textsc\{Pairs\}\\cup\\\{\(A,B\)\\\}
15:endif
16:endif
17:endfor
18:returnPairs
19:endfunction
## 5Cross\-domain transfer
Figure 2:Cross\-domain transfer\. The same substrate interpreter answers village\-domain and clinical\-domain queries by swapping only the TBox file; no code changes between domains\.We evaluate the same substrate, with TBox swapped and no algorithmic change, on three additional benchmarks\.
#### ComPhy\(Chenet al\.,[2022](https://arxiv.org/html/2605.15967#bib.bib2)\)\.
ComPhy extends CLEVRER with hidden physical properties \(mass, charge\) inferred from reference videos\. We evaluate on the full factual subset \(n=5,882\)\. The substrate achieves 73\.10 percent \(Wilson 95 percent CI \[71\.94, 74\.23\]\); PCR\(Chen and others,[2024](https://arxiv.org/html/2605.15967#bib.bib3), Table 5, test split\), the 2024 published state of the art on ComPhy, reports 62\.0 percent on the factual subset\. The substrate exceeds PCR by 11\.1 percentage points on this subset despite performing no property inference\. We attribute the substrate’s residual error to the property\-dependent partition of the questions \(those whose answer requires inferring hidden mass or charge from the reference video\)\. A dedicated property\-inference module trainable on the reference videos is a natural extension that we discuss in Section[8](https://arxiv.org/html/2605.15967#S8)\.
#### GQA\(Hudson and Manning,[2019](https://arxiv.org/html/2605.15967#bib.bib4)\)\.
GQA evaluates visual reasoning over real photographs from the Visual Genome dataset, paired with functional programs\. The substrate consumes the Stanford GQA validation scene graphs and executes GQA’ssemanticprogram DSL\. On the full validation split \(n=132,062\), the substrate achieves 95\.27 percent \(Wilson 95 percent CI \[95\.16, 95\.39\]\)\. On the same questions evaluated by Qwen2\.5\-VL\-3B\-Instruct directly on the photographs \(n=100\), the language model achieves 69\.0 percent\. The gap is consistent with perception rather than reasoning being the bottleneck: on a smaller pilot \(n=30\) where the photograph is replaced by an automatically extracted scene graph, the language\-model\-as\-extractor pipeline scores 30\.0 percent \(Wilson 95 percent CI \[16\.7, 47\.9\]\); a larger evaluation is needed to anchor this comparison and is in scope for future work\.
Figure 3:Twin\-EventLog evaluation\. A shared event logt1,…,t5t\_\{1\},\\dots,t\_\{5\}runs up to the branch tickTbranchT\_\{\\mathrm\{branch\}\}\. The log then forks\. Arm A continues without intervention, producing eventst6,…,t10t\_\{6\},\\dots,t\_\{10\}\. Arm B applies ado\(X\)do\(X\)intervention at the branch \(orange star\) and continues, producingt6′,…,t10′t\_\{6\}^\{\\prime\},\\dots,t\_\{10\}^\{\\prime\}\. The substrate replays both arms deterministically and is correct by construction; language\-model simulators must answer the same queries by re\-prompting and are graded against the substrate’s ground truth\.
#### Twin\-EventLog \(this work\)\.
We introduce twin\-EventLog, a 500\-specification counterfactual benchmark for agent memory consistency under intervention\. Each specification fixes an intervention at a chosen branch tickTbranchT\_\{\\textsf\{branch\}\}on a Park\-canonical Smallville environment\(Parket al\.,[2023](https://arxiv.org/html/2605.15967#bib.bib5)\), asks a binary query about the divergence between the two arms, and grades against the substrate’s deterministic replay\. The benchmark covers three linkage types \(control, direct, propagation\) and three query types \(did\_meet,learned\_fact,visited\_location\)\.
Table[2](https://arxiv.org/html/2605.15967#S5.T2)reports the substrate’s performance against two language\-model baselines: Llama\-3\.1\-8B prompted with the full Smallville context, and a Park/Concordia\-style deployment\(Parket al\.,[2023](https://arxiv.org/html/2605.15967#bib.bib5); Vezhnevets and others,[2023](https://arxiv.org/html/2605.15967#bib.bib6)\)that uses Concordia’sLanguageModelinterface with Llama\-3\.1\-8B as backend\.111The Concordia\-style baseline preserves Concordia’s agent\-step API and per\-persona prompting but restricts the action space tosample\_choiceover a fixed location set and a binary talk\-or\-not decision per co\-located pair; it does not invoke Concordia’s reflection, planning, or hierarchical\-memory modules\. This keeps per\-tick latency low enough to runn=100n=100specifications; then=500n=500substrate and Llama\-direct cells use the full benchmark\. Thevisited\_locationquery in this baseline is graded against end\-of\-simulation location only, as full per\-tick location history was not logged\.
Table 2:Twin\-EventLog results\. The substrate is correct by construction\. Llama\-3\.1\-8B is prompted with the full Smallville context; the Concordia\-style baseline uses Concordia’sLanguageModelagent\-step interface with a constrained action space \(see footnote in main text\)\. All per\-cell CIs are Wilson\. Substrate exceeds Llama\-direct by 18\.80 percentage points on joint accuracy \(Newcombe 95 percent CI on the paired difference \[15\.53, 22\.46\]; McNemar exact two\-sidedp≈10−28p\\approx 10^\{\-28\}\) and the Concordia\-style baseline by 65 percentage points on the same metric\.
#### Controlled comparison: substrate versus language model on identical event log\.
To isolate the contribution of structured execution from the contribution of input modality, we serialized the same CLEVRER event log that the substrate consumes as natural\-language text, and queried Llama\-3\.1\-8B with grammar\-constrained output\. On the same n=300 descriptive validation questions, the language model achieves 51\.00 percent \(Wilson 95 percent CI \[45\.37, 56\.61\]\) while the substrate achieves 97\.99 percent on the full validation split\. The gap of 46\.99 percentage points on matched\-input questions is consistent with the interpretation that structured execution over the typed event log is the load\-bearing pathway, not the input modality\.
## 6Ablations
Each of the four algorithmic operations in Section[4](https://arxiv.org/html/2605.15967#S4)has measurable impact on the corresponding subset\. We report ablations on the same full validation splits\.
#### Per\-event ancestor reference frame\.
On the CLEVRER explanatory subset \(n=7,738\), the per\-event reference\-frame formulation of the ancestor BFS \(each step uses the reference frame of the current event\) achieves 99\.86 percent per\-question\. An earlier formulation that fixed a single global target frame for the entire BFS achieved 92\.28 percent on the same split, a difference of 7\.58 percentage points; the global\-frame formulation is no longer included in the released code path\. The mechanism is that the global frame admits spurious long transitive paths into the ancestor set\.
#### Common\-removed\-partner heuristic\.
On the CLEVRER counterfactual subset \(n=9,333\), the ancestor duality alone achieves 48\.01 percent per\-question and 82\.65 percent per\-choice\. Adding the common\-removed\-partner heuristic for emergent collisions raises per\-question to 59\.85 percent \(a difference of 11\.84 percentage points\) and per\-choice to 86\.69 percent \(a difference of 4\.04 percentage points\)\. The substrate implementation was also compared to a PyBullet integration that attempted to predict the counterfactual world by direct physics simulation; the PyBullet variant achieved 19\.6 percent per\-question, well below the ancestor\-based variant\.
#### Kinematic\-projection parameters\.
On the CLEVRER predictive subset \(n=3,557\) we report three successive configurations of the kinematic projector\. Configuration A \(single\-frame final velocity\) yields 51\.98 percent per\-question\. Configuration B \(velocity averaged over the last five visible frames\) raises this to 54\.40 percent\. Configuration C \(five\-frame velocity averaging, absolute collision thresholdτ=1\.7\\tau=1\.7scene units, future horizonT=300T=300frames\) raises this further to 69\.50 percent and is the configuration reported in Table[1](https://arxiv.org/html/2605.15967#S4.T1)\.
#### Substrate operations are deterministic\.
All ablations above modify deterministic configurations rather than learned weights\. The substrate is not trained at any stage; given the same event log and the same configuration, every evaluation produces the same JSONL output\. A regression test in the repository asserts equal triple counts across two seeded substrate runs at the end of a 20\-tick replay\.
## 7Related work
#### Symbolic and neuro\-symbolic visual reasoning\.
NS\-VQA\(Yiet al\.,[2018](https://arxiv.org/html/2605.15967#bib.bib11)\)and NS\-CL\(Maoet al\.,[2019](https://arxiv.org/html/2605.15967#bib.bib12)\)parse images into scene graphs and execute symbolic programs over them\. NS\-DR\(Yiet al\.,[2020](https://arxiv.org/html/2605.15967#bib.bib1)\)extends this approach to CLEVRER with a custom causal\-physics solver\. PCR\(Chen and others,[2024](https://arxiv.org/html/2605.15967#bib.bib3)\)extends to ComPhy with property inference\. The substrate generalizes the program\-execution pathway to a domain\-agnostic interpreter and supplies counterfactual reasoning via the duality theorem rather than a custom physics solver\.
#### Parametric world models\.
Dreamer\-V3\(Hafneret al\.,[2023](https://arxiv.org/html/2605.15967#bib.bib7)\)and successors learn latent dynamics for reinforcement\-learning action policies\. V\-JEPA\-2\(Assranet al\.,[2025](https://arxiv.org/html/2605.15967#bib.bib8)\)learns video feature representations via joint embedding prediction\. ALOE\(Dinget al\.,[2021](https://arxiv.org/html/2605.15967#bib.bib9)\)attends over learned object embeddings for video question answering\. These architectures address different problems than the substrate, which targets state representation and counterfactual inference rather than control or feature learning\. Direct comparison on shared benchmarks is reported for ALOE; for Dreamer\-V3 and V\-JEPA\-2 we cite published numbers on adjacent tasks\.
#### Structured world models\.
C\-SWM\(Kipfet al\.,[2020](https://arxiv.org/html/2605.15967#bib.bib16)\)learns object\-centric latent representations and transitions via contrastive prediction; later extensions combine this with slot\-attention object discovery\. Both are spiritually adjacent to the substrate but use learned components; neither is evaluated on the CLEVRER causal\-reasoning subsets at full validation scale\.
#### Causal reasoning and structural causal models\.
Pearl’s structural causal model framework\(Pearl,[2009](https://arxiv.org/html/2605.15967#bib.bib20)\)underlies the intervention semantics in Section[2](https://arxiv.org/html/2605.15967#S2)\. The substrate’s deterministic\-replay construction is a finite\-state instantiation of Pearl’s twin\-network formulation\(Pearl,[2009](https://arxiv.org/html/2605.15967#bib.bib20)\); the probabilistic counterfactual evaluation underlying the framework is due toBalke and Pearl \([1994](https://arxiv.org/html/2605.15967#bib.bib10)\); counterfactual evaluation reduces to replay over a forked log\.
#### Agent memory and language\-model simulators\.
Concordia\(Vezhnevets and others,[2023](https://arxiv.org/html/2605.15967#bib.bib6)\)runs LLM\-driven Smallville simulations and represents agent memory as a language\-model summary\. ReAct\(Yaoet al\.,[2023](https://arxiv.org/html/2605.15967#bib.bib14)\)and Voyager\(Wanget al\.,[2023](https://arxiv.org/html/2605.15967#bib.bib15)\)similarly use language\-model summaries as memory\. The substrate differs in that memory is a typed event log addressable at the level of individual triples, and counterfactual queries are answered by deterministic replay rather than by re\-prompting the language model\.
#### Memory networks\.
MemN2N\(Sukhbaataret al\.,[2015](https://arxiv.org/html/2605.15967#bib.bib18)\)and successors learn a soft addressable memory through attention\. On the bAbI text\-reasoning benchmark, the best MemN2N variant achieves 95\.8 percent mean accuracy with 10,000 training examples per task\. The substrate, with no training and a hand\-authored parser of under 700 lines, achieves 90\.58 percent mean accuracy and 100 percent on 10 of 20 tasks\. The bAbI evaluation is in scope as a demonstration that the substrate’s interpreter transfers to a text reasoning benchmark without modification, but is not the focus of this paper\. Later memory architectures \(EntNet, DMN\+, RMN\) report perfect or near\-perfect accuracy on bAbI 10k and supersede MemN2N; we compare to MemN2N because its training\-data dependence is the most informative contrast to the substrate’s zero\-shot regime\.
## 8Limitations
#### Hidden\-property inference\.
A non\-trivial fraction of ComPhy factual questions depend on hidden physical properties \(mass, charge\) inferred from reference videos\. The substrate as currently implemented does not perform this inference\. Despite this, it exceeds the published property\-aware PCR baseline on the factual subset \(Section[5](https://arxiv.org/html/2605.15967#S5)\)\. A property\-inference module trainable on the reference videos would close the remaining gap on the property\-dependent partition and is a natural extension; we leave this to future work\.
#### Emergent interactions beyond the heuristic\.
On the CLEVRER counterfactual per\-option metric, the substrate achieves 86\.69 percent, exceeding NS\-DR’s 74\.1 percent by 12\.59 percentage points \(Table[1](https://arxiv.org/html/2605.15967#S4.T1)\)\. The residual error relative to a perfect oracle is attributable to emergent collisions that the common\-removed\-partner heuristic does not capture \(for example, three\-way emergent interactions where the removed object’s deflection chain involves more than one intermediate\)\. A more accurate emergent\-collision predictor would require either a learned model trained on the counterfactual distribution or a calibrated physics simulator matching CLEVRER’s generator configuration\.
#### TBox authorship\.
The TBox is hand\-authored per domain\. Substrate transfer is at the level of the interpreter and the algorithmic toolkit; the domain\-specific class and predicate vocabulary is constructed separately\. Automated TBox induction from data is an active research direction\(Meilickeet al\.,[2019](https://arxiv.org/html/2605.15967#bib.bib13)\)and is out of scope for the present paper\.
#### Scope of the substrate\.
The substrate addresses state representation and counterfactual inference\. It does not perform action selection \(which is the domain of Dreamer\-V3 and related reinforcement\-learning world models\), and it does not generate natural\-language summaries \(which is the domain of language\-model\-based agent architectures\)\. The substrate can be composed with either; we do not study such compositions in this paper\.
## 9Conclusion
We have presented a class of world models for agentic systems in which agent memory is a typed event log, counterfactual queries are answered by deterministic replay under a structured intervention vocabulary, and inspection is supported at the level of individual triples and individual deltas\. We have shown that under closed\-event assumptions, counterfactual queries on this class of models reduce to a graph\-theoretic duality with explanatory queries, both of which are answered by a single causal\-ancestor traversal of the observed event log\. We have evaluated a 1,400\-line CLEVRER\-DSL interpreter atop a domain\-agnostic substrate runtime on CLEVRER at full validation scale and reported results exceeding the strongest published symbolic\-oracle baseline on every per\-question causal\-reasoning subset\. We have introduced a new counterfactual benchmark, twin\-EventLog, for evaluating agent memory consistency under intervention\.
The principal observation is that counterfactual world modeling admits an implementation by deterministic replay over typed event deltas\. This implementation makes the substrate’s predictions auditable at the level of individual events, supports exact counterfactuals over arbitrary interventions on the typed state, and transfers across domains by swapping the TBox without learned components\.
#### Reproducibility\.
The substrate has no learned parameters: every reported number is a deterministic function of an input dataset, the interpreter code, and a fixed configuration of the algorithmic toolkit \(per\-event reference frame for explanatory; common\-removed\-partner heuristic for counterfactual; five\-frame velocity averaging withτ=1\.7\\tau=1\.7andT=300T=300for predictive\)\. Each row of Tables[1](https://arxiv.org/html/2605.15967#S4.T1)and[2](https://arxiv.org/html/2605.15967#S5.T2)reduces to a single per\-question label in a JSONL artifact bundled with the released repository, and the accuracy in the table is a count of correct rows divided by total rows\. Wilson intervals are computed from those counts; the Newcombe interval on the substrate\-versus\-Llama joint\-accuracy gap is computed from the paired per\-spec labels in the two Twin\-EventLog JSONLs\. There is no held\-out checkpoint or hyperparameter sweep between artifact and number, and no learned weights to reload; re\-running the interpreter on the same dataset and configuration regenerates the same JSONL bit\-for\-bit\.
## References
- M\. Assran, A\. Bardes,et al\.\(2025\)V\-JEPA 2: self\-supervised video models enable understanding, prediction and planning\.arXiv:2506\.09985\.Cited by:[§1](https://arxiv.org/html/2605.15967#S1.p1.1),[§7](https://arxiv.org/html/2605.15967#S7.SS0.SSS0.Px2.p1.1)\.
- A\. Balke and J\. Pearl \(1994\)Probabilistic evaluation of counterfactual queries\.InAAAI,Cited by:[§7](https://arxiv.org/html/2605.15967#S7.SS0.SSS0.Px4.p1.1)\.
- Z\. Chenet al\.\(2024\)Compositional physical reasoning of objects and events from videos\.arXiv:2408\.02687\.Cited by:[§5](https://arxiv.org/html/2605.15967#S5.SS0.SSS0.Px1.p1.1),[§7](https://arxiv.org/html/2605.15967#S7.SS0.SSS0.Px1.p1.1)\.
- Z\. Chen, K\. Yi, Y\. Li, M\. Ding, A\. Torralba, J\. B\. Tenenbaum, and C\. Gan \(2022\)ComPhy: compositional physical reasoning of objects and events from videos\.InICLR,Cited by:[§1](https://arxiv.org/html/2605.15967#S1.p3.1),[§5](https://arxiv.org/html/2605.15967#S5.SS0.SSS0.Px1)\.
- D\. Ding, F\. Hill, A\. Santoro, M\. Reynolds, and M\. Botvinick \(2021\)Attention over learned object embeddings enables complex visual reasoning\.InNeurIPS,Cited by:[§1](https://arxiv.org/html/2605.15967#S1.SS0.SSS0.Px1.p1.1),[Table 1](https://arxiv.org/html/2605.15967#S4.T1),[§7](https://arxiv.org/html/2605.15967#S7.SS0.SSS0.Px2.p1.1)\.
- D\. Hafner, J\. Pasukonis, J\. Ba, and T\. Lillicrap \(2023\)Mastering diverse domains through world models\.arXiv:2301\.04104\.Cited by:[§1](https://arxiv.org/html/2605.15967#S1.p1.1),[§7](https://arxiv.org/html/2605.15967#S7.SS0.SSS0.Px2.p1.1)\.
- D\. A\. Hudson and C\. D\. Manning \(2019\)GQA: a new dataset for real\-world visual reasoning and compositional question answering\.InCVPR,Cited by:[§1](https://arxiv.org/html/2605.15967#S1.p3.1),[§5](https://arxiv.org/html/2605.15967#S5.SS0.SSS0.Px2)\.
- T\. Kipf, E\. van der Pol, and M\. Welling \(2020\)Contrastive learning of structured world models\.InICLR,Cited by:[§7](https://arxiv.org/html/2605.15967#S7.SS0.SSS0.Px3.p1.1)\.
- J\. Mao, C\. Gan, P\. Kohli, J\. B\. Tenenbaum, and J\. Wu \(2019\)The neuro\-symbolic concept learner: interpreting scenes, words, and sentences from natural supervision\.InICLR,Cited by:[§1](https://arxiv.org/html/2605.15967#S1.p2.1),[§7](https://arxiv.org/html/2605.15967#S7.SS0.SSS0.Px1.p1.1)\.
- C\. Meilicke, M\. W\. Chekol, D\. Ruffinelli, and H\. Stuckenschmidt \(2019\)Anytime bottom\-up rule learning for knowledge graph completion\.InIJCAI,Cited by:[§8](https://arxiv.org/html/2605.15967#S8.SS0.SSS0.Px3.p1.1)\.
- J\. S\. Park, J\. C\. O’Brien, C\. J\. Cai, M\. R\. Morris, P\. Liang, and M\. S\. Bernstein \(2023\)Generative agents: interactive simulacra of human behavior\.InProc\. 36th Annual ACM Symposium on User Interface Software and Technology \(UIST\),Cited by:[§1](https://arxiv.org/html/2605.15967#S1.p1.1),[§5](https://arxiv.org/html/2605.15967#S5.SS0.SSS0.Px3.p1.1),[§5](https://arxiv.org/html/2605.15967#S5.SS0.SSS0.Px3.p2.1)\.
- J\. Pearl \(2009\)Causality: models, reasoning, and inference\.2 edition,Cambridge University Press\.Cited by:[§1](https://arxiv.org/html/2605.15967#S1.p2.1),[§7](https://arxiv.org/html/2605.15967#S7.SS0.SSS0.Px4.p1.1)\.
- T\. Pellissier Tanon \(2020\)Oxigraph: a graph database implementing the sparql standard and the rdf data model\.Note:[https://github\.com/oxigraph/oxigraph](https://github.com/oxigraph/oxigraph)Cited by:[§2](https://arxiv.org/html/2605.15967#S2.SS0.SSS0.Px5.p1.1)\.
- S\. Sukhbaatar, A\. Szlam, J\. Weston, and R\. Fergus \(2015\)End\-to\-end memory networks\.InNeurIPS,Cited by:[§7](https://arxiv.org/html/2605.15967#S7.SS0.SSS0.Px6.p1.1)\.
- A\. S\. Vezhnevetset al\.\(2023\)Generative agent\-based modeling with actions grounded in physical, social, or digital space using concordia\.External Links:2312\.03664Cited by:[§5](https://arxiv.org/html/2605.15967#S5.SS0.SSS0.Px3.p2.1),[§7](https://arxiv.org/html/2605.15967#S7.SS0.SSS0.Px5.p1.1)\.
- G\. Wang, Y\. Xie, Y\. Jiang, A\. Mandlekar, C\. Xiao, Y\. Zhu, L\. Fan, and A\. Anandkumar \(2023\)Voyager: an open\-ended embodied agent with large language models\.External Links:2305\.16291Cited by:[§7](https://arxiv.org/html/2605.15967#S7.SS0.SSS0.Px5.p1.1)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao \(2023\)ReAct: synergizing reasoning and acting in language models\.InICLR,Cited by:[§7](https://arxiv.org/html/2605.15967#S7.SS0.SSS0.Px5.p1.1)\.
- K\. Yi, C\. Gan, Y\. Li, P\. Kohli, J\. Wu, A\. Torralba, and J\. B\. Tenenbaum \(2020\)CLEVRER: collision events for video representation and reasoning\.InICLR,Cited by:[§1](https://arxiv.org/html/2605.15967#S1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.15967#S1.p2.1),[§4\.1](https://arxiv.org/html/2605.15967#S4.SS1.p1.1),[Table 1](https://arxiv.org/html/2605.15967#S4.T1),[§4](https://arxiv.org/html/2605.15967#S4.p1.1),[§7](https://arxiv.org/html/2605.15967#S7.SS0.SSS0.Px1.p1.1)\.
- K\. Yi, J\. Wu, C\. Gan, A\. Torralba, P\. Kohli, and J\. B\. Tenenbaum \(2018\)Neural\-symbolic vqa: disentangling reasoning from vision and language understanding\.InNeurIPS,Cited by:[§1](https://arxiv.org/html/2605.15967#S1.p2.1),[§7](https://arxiv.org/html/2605.15967#S7.SS0.SSS0.Px1.p1.1)\.Similar Articles
Strikingness-Aware Evaluation for Temporal Knowledge Graph Reasoning
This paper proposes a strikingness-aware evaluation framework for Temporal Knowledge Graph Reasoning (TKGR) that weights events by rarity to better assess model reasoning, addressing overestimation from trivial repeated events.
World Models Meet Language Models: On the Complementarity of Concrete and Abstract Reasoning
This paper proposes Privileged-Future On-Policy Self-Distillation (PF-OPSD) for controlled concrete reasoning, combining world models' visual simulation with language models' abstract reasoning to improve prediction accuracy and robustness on two new benchmarks.
GraphReAct: Reasoning and Acting for Multi-step Graph Inference
This paper introduces GraphReAct, a framework that extends reasoning-acting paradigms to graph-structured data for multi-step inference. It combines topological and semantic retrieval with context refinement to improve performance on graph learning benchmarks.
LLM Explainability with Counterfactual Chains and Causal Graphs
This paper proposes a four-phase method for constructing causal graphs that model LLM inference processes, using counterfactual augmentation to enable stable causal discovery and provide transparent, concept-level explainability.
Formalizing and falsifying causal pathways of rare events
This paper introduces a formal definition of causal pathways for rare events and discusses testable implications, bridging simple verbal explanations with detailed causal models.