Grounded Continuation: A Linear-Time Runtime Verifier for LLM Conversations

arXiv cs.AI 05/15/26, 04:00 AM Papers
llm-conversations runtime-verifier dependency-graph grounding context-manipulation formal-methods evaluation
Summary
This paper introduces Grounded Continuation, a linear-time runtime verifier for LLM conversations that maintains an explicit dependency graph to detect whether a next utterance is supported by prior conversation, achieving accuracy gains over baselines on benchmarks including LongMemEval and LoCoMo.
arXiv:2605.14175v1 Announce Type: new Abstract: In long conversations, an LLM can produce a next utterance that sounds plausible but rests on premises the conversation has already abandoned. Context-manipulation attacks against deployed agents now actively exploit this gap. We close it with a runtime verifier that maintains an explicit dependency graph: an LLM classifies each turn into one of 8 update operations drawn from four formalisms (dynamic epistemic logic, abductive reasoning, awareness logic, argumentation), and a symbolic engine records which claims depend on which evidence. Checking whether a continuation is supported reduces to a graph walk; retraction propagates through the same graph to flag exactly the conclusions that lose support, with linear per-turn cost and a formal conflict-free guarantee. On LongMemEval-KU oracle (n=78), the verifier reaches 89.7% accuracy vs. 88.5% for the LLM-only baseline (+1.3pp) and 87.2% for a transcript-RAG baseline matched on retrieval budget (+2.6pp); wins among disagreements are correct abstentions where the baseline confabulates. On LoCoMo's 60 official QA items the verifier is competitive with retrieval-augmented baselines. Beyond external benchmarks, we construct two multi-agent scenarios and a 50-item grounding test: on the 15-item stale-premise subset, the verifier reaches 100% accuracy vs. 93.3% (+6.7pp). These instantiate a soundness-faithfulness decomposition: the structural check is sound by construction, and per-deployment LLM extraction faithfulness is the empirical question we measure across four LLM families. The retraction check plateaus at microseconds while history-replay grows linearly with conversation length.
Original Article
View Cached Full Text
Cached at: 05/15/26, 06:22 AM
# Grounded Continuation: A Linear-Time Runtime Verifier for LLM Conversations
Source: [https://arxiv.org/html/2605.14175](https://arxiv.org/html/2605.14175)
Qisong He, Yi Dong, Xiaowei Huang School of Computer Science and Informatics, University of Liverpool, UK

###### Abstract

In a long conversation, an LLM can produce a next utterance that sounds plausible but rests on premises the conversation has already abandoned\. Context\-manipulation attacks against deployed agents now actively exploit this gap\. We close it with a runtime verifier that maintains an explicit dependency graph\. As the conversation unfolds, an LLM classifies each turn into one of88update operations drawn from four formalisms \(dynamic epistemic logic, abductive reasoning, awareness logic, argumentation\)\. A symbolic engine then records which claims depend on which evidence and earlier reasoning\. At any point, “Is this continuation supported by what has been said?” reduces to a graph walk\. Retraction propagates through the same graph to flag exactly the conclusions that lose support, with linear per\-turn cost and a formal conflict\-free guarantee\. On the benchmarks we evaluate, the verifier matches or exceeds strong baselines: on LongMemEval\-KU oracle \(knowledge update with supersession,n=78n\\\!=\\\!78\), it reaches89\.7%89\.7\\%accuracy against88\.5%88\.5\\%for the LLM\-only baseline \(\+1\.3\+1\.3pp\) and87\.2%87\.2\\%for a transcript\-RAG baseline matched on retrieval budget and content access \(\+2\.6\+2\.6pp\)\. The verifier wins among disagreements are correct abstentions where the baseline confabulates\. On LoCoMo’s6060official QA items the verifier is competitive with retrieval\-augmented baselines, consistent with its interactional\-grounding focus\. Beyond external benchmarks, we construct two multi\-agent scenarios and a5050\-item grounding test for controlled evidence: on the1515\-item*stale\-premise*subset \(premise retracted earlier in the conversation\), the verifier reaches100%100\\%accuracy vs\.93\.3%93\.3\\%\(\+6\.7\+6\.7pp\)\. Together, these controlled tests instantiate the soundness–faithfulness decomposition: the structural check is sound by construction, and per\-deployment LLM extraction faithfulness is the empirical question we measure across four LLM families\. The graph\-walk retraction check plateaus at microseconds in the bounded regime while history\-replay grows linearly with conversation length\.

## 1Introduction

Large language models produce continuations that can be locally fluent and pragmatically plausible yet*ungrounded*: disconnected from the claims, observations, and revisions that the conversation has actually established\. The deployment question is concrete: given the next LLM output, can we check,*at this turn*and within a tight latency budget, whether that output traces back to those prior commitments? When an LLM cannot answer “why did we reject approach A?” or “what assumptions does our current decision rest on?”, the relevant utterances are inside the context window\. What is missing is a maintained structure connecting the LLM’s continuation to the conversation’s prior commitments\.

This problem is empirically severe\.Labanet al\.\([2026](https://arxiv.org/html/2605.14175#bib.bib1)\)found a 39% average performance drop in multi\-turn versus single\-turn settings across 200,000\+ simulated conversations, with LLMs failing to revise incorrect assumptions even when later turns contradict them: “when LLMs take a wrong turn in a conversation, they get lost and do not recover\.”Shaikhet al\.\([2025](https://arxiv.org/html/2605.14175#bib.bib39)\)show, on real human\-LLM dialogues from WildChat\(Zhaoet al\.,[2024](https://arxiv.org/html/2605.14175#bib.bib41)\), MultiWOZ\(Budzianowskiet al\.,[2018](https://arxiv.org/html/2605.14175#bib.bib42)\), and Bing Chat\(Kellyet al\.,[2023](https://arxiv.org/html/2605.14175#bib.bib43)\), that LLMs initiate clarification three times less often than human partners and that early grounding failures predict later interaction breakdowns\.

This gap is also actively exploited: context\-manipulation attacks against deployed agents\(Patlanet al\.,[2025](https://arxiv.org/html/2605.14175#bib.bib44); Donget al\.,[2025](https://arxiv.org/html/2605.14175#bib.bib45)\)produce continuations that are locally consistent but disconnected from the conversation’s prior commitments, triggering unauthorised actions because no runtime mechanism ties the agent’s continuation to claims established earlier in the conversation\.

Retrieval\-augmented attribution methods address*external*grounding: a generated claim is accepted if it can be traced to a cited document\(Gaoet al\.,[2023](https://arxiv.org/html/2605.14175#bib.bib37); Bohnetet al\.,[2022](https://arxiv.org/html/2605.14175#bib.bib38)\)\. Long\-horizon LLM conversations are dominated by*interactional*grounding instead \(a continuation is acceptable if it is consistent with claims, observations, hypotheses, and revisions earlier in this same conversation\), and conversational memory systems\(Chhikaraet al\.,[2025](https://arxiv.org/html/2605.14175#bib.bib47); Rasmussenet al\.,[2025](https://arxiv.org/html/2605.14175#bib.bib28); Zhang and others,[2024](https://arxiv.org/html/2605.14175#bib.bib26)\)track*what was said*but not the dependency structure connecting what was said to what was concluded\. We close this dependency\-tracking gap with a runtime verifier: an LLM Interpreter classifies each utterance into one of 8 operations updating a symbolic engine that maintains a dependency map\. The runtime check at any point asks whether a candidate continuation is reachable from the current structure and what upstream commitments it depends on, computable in time linear in the engine’s representation size \([Proposition˜2\.2](https://arxiv.org/html/2605.14175#S2.Thmproposition2)\)\. The LLM handles natural\-language understanding \(noisy but learnable\)\. The engine handles dependency tracking \(sound by construction,[Proposition˜2\.1](https://arxiv.org/html/2605.14175#S2.Thmproposition1)\)\.

#### Contributions\.

\(1\) A runtime verifier for interactional grounding: at every turn, in time linear in the engine’s representation, the verifier checks whether a candidate continuation traces back through a maintained dependency structure to the conversation’s prior commitments \([Sections˜2](https://arxiv.org/html/2605.14175#S2)and[4](https://arxiv.org/html/2605.14175#S4)\)\.\(2\) Empirical wins where interactional grounding matters, scoping where it does not: against a matched transcript\-RAG baseline, on LongMemEval\-KU oracle \(n=78n\\\!=\\\!78\) the verifier exceeds both the LLM\-only baseline \(\+1\.3\+1\.3pp\) and the transcript\-RAG baseline \(\+2\.6\+2\.6pp,[Table˜5](https://arxiv.org/html/2605.14175#S4.T5)\), with wins concentrated in correct\-abstention cases where the baseline confabulates\. On LoCoMo it is competitive with retrieval\-augmented baselines, consistent with its interactional\-grounding focus \([Table˜4](https://arxiv.org/html/2605.14175#S4.T4)\)\.\(3\) A composable formal substrate with a soundness guarantee: each utterance produces a single well\-typed update𝖠𝗉𝗉𝗅𝗒\(op,𝑎𝑟𝑔𝑠,𝒟t\)\\mathsf\{Apply\}\(op,\\mathit\{args\},\\mathcal\{D\}\_\{t\}\)over an epistemic plausibility model, argumentation framework, commitment record, and dependency map \([Algorithm˜1](https://arxiv.org/html/2605.14175#alg1)\), with a conflict\-free guarantee for selective retraction \([Proposition˜2\.1](https://arxiv.org/html/2605.14175#S2.Thmproposition1)\)\.\(4\) A soundness–faithfulness decompositionthat splits the verifier into a sound structural check and per\-deployment extraction faithfulness, exposing the canonical stale\-claim case where the verifier catches advice both the LLM\-only and matched transcript\-RAG baselines miss \(\+6\.7\+6\.7pp on the stale\-premise subset,[Table˜3](https://arxiv.org/html/2605.14175#S4.T3)\)\.

Scope\.We evaluate on two authored multi\-agent scenarios \([Section˜3](https://arxiv.org/html/2605.14175#S3)\), a 50\-item Phase 2 direct grounding test, 78 LongMemEval\-KUoracleitems\(Wuet al\.,[2025](https://arxiv.org/html/2605.14175#bib.bib36)\), and 60 official LoCoMo QA items across three multi\-session conversations\(Maharanaet al\.,[2024](https://arxiv.org/html/2605.14175#bib.bib34)\)\. The verifier was designed for interactional grounding\. With content\-bearing rendering and retrieval, the framework also extends to entity\-relation factual QA \([Section˜4\.2](https://arxiv.org/html/2605.14175#S4.SS2)\)\.

## 2The Runtime Verifier

Notation\.We use the standard modal operators of dynamic epistemic logic \(DEL\):𝐊iφ\\mathbf\{K\}\_\{i\}\\varphi\(agentiiknowsφ\\varphi\),𝐁iφ\\mathbf\{B\}\_\{i\}\\varphi\(believes\),𝐀iφ\\mathbf\{A\}\_\{i\}\\varphi\(is aware of\), and𝐂Gφ\\mathbf\{C\}\_\{G\}\\varphi\(common knowledge in groupGG\)\. Their semantics over an epistemic plausibility model \([Definition˜E\.1](https://arxiv.org/html/2605.14175#A5.Thmdefinition1)\) and an awareness structure \([Definition˜E\.3](https://arxiv.org/html/2605.14175#A5.Thmdefinition3)\) are recalled in[Appendix˜E](https://arxiv.org/html/2605.14175#A5)\.

The verifier itself exposes a single graph query, but the structure that supports it is layered: an epistemic model \(what is known\), an argumentation framework \(what attacks what\), commitment records \(who said what publicly\), and the dependency map proper\.[Definition˜2\.1](https://arxiv.org/html/2605.14175#S2.Thmdefinition1)bundles the four into one object\. Runtime queries touch only the argumentation skeleton and the dependency map\.

###### Definition 2\.1\(Dependency structure\)\.

A*dependency structure*at turnttis a tuple𝒟t=\(ℳt,𝐴𝐹t,𝐶𝑚t,𝐷𝑒𝑝t\)\\mathcal\{D\}\_\{t\}=\(\\mathcal\{M\}\_\{t\},\\mathit\{AF\}\_\{t\},\\mathit\{Cm\}\_\{t\},\\mathit\{Dep\}\_\{t\}\)whereℳt\\mathcal\{M\}\_\{t\}is an epistemic plausibility model \([Definition˜E\.1](https://arxiv.org/html/2605.14175#A5.Thmdefinition1)\.Baltag and Smets[2008](https://arxiv.org/html/2605.14175#bib.bib3)\) recording what each agent knows, believes, or has hypothesised\.𝐴𝐹t=\(𝐴𝑟𝑔𝑠t,𝐴𝑡𝑡t\)\\mathit\{AF\}\_\{t\}=\(\\mathit\{Args\}\_\{t\},\\mathit\{Att\}\_\{t\}\)is a Dung\-style argumentation framework with attack relation𝐴𝑡𝑡t⊆𝐴𝑟𝑔𝑠t×𝐴𝑟𝑔𝑠t\\mathit\{Att\}\_\{t\}\\subseteq\\mathit\{Args\}\_\{t\}\\times\\mathit\{Args\}\_\{t\}\(Dung,[1995](https://arxiv.org/html/2605.14175#bib.bib14)\)\(each argumentα\\alphacarries a claim𝑐𝑙𝑎𝑖𝑚\(α\)∈𝑃𝑟𝑜𝑝\\mathit\{claim\}\(\\alpha\)\\in\\mathit\{Prop\}over the conversation’s propositions𝑃𝑟𝑜𝑝\\mathit\{Prop\}, and is itself the unit of support\)\.𝐶𝑚t:𝐴𝑔𝑠→𝒫\(𝐴𝑟𝑔𝑠t\)\\mathit\{Cm\}\_\{t\}:\\mathit\{Ags\}\\to\\mathcal\{P\}\(\\mathit\{Args\}\_\{t\}\)records each agent’s public commitments\(Waltonet al\.,[2008](https://arxiv.org/html/2605.14175#bib.bib16)\), where𝐴𝑔𝑠\\mathit\{Ags\}is the set of agents in the conversation\.𝐷𝑒𝑝t:𝐴𝑟𝑔𝑠t→𝒫\(𝑃𝑟𝑜𝑝\)\\mathit\{Dep\}\_\{t\}:\\mathit\{Args\}\_\{t\}\\to\\mathcal\{P\}\(\\mathit\{Prop\}\)maps each argument to the propositions supporting it\.

The full four\-formalism foundation underlying𝒟t\\mathcal\{D\}\_\{t\}is in[Appendix˜E](https://arxiv.org/html/2605.14175#A5); the runtime check below queries only𝐷𝑒𝑝t\\mathit\{Dep\}\_\{t\}and the argumentation skeleton\.

The runtime check\.Given a candidate continuationccasserting propositionϕc\\phi\_\{c\}, the verifier checks whethercchas the property of being*grounded*with respect to the current dependency structure:

###### Definition 2\.2\(Grounded continuation\)\.

A candidate continuationccasserting propositionϕc\\phi\_\{c\}is*grounded*with respect to𝒟t\\mathcal\{D\}\_\{t\}iff there exists an argumentαc∈𝐴𝑟𝑔𝑠t\\alpha\_\{c\}\\in\\mathit\{Args\}\_\{t\}with𝑐𝑙𝑎𝑖𝑚\(αc\)=ϕc\\mathit\{claim\}\(\\alpha\_\{c\}\)=\\phi\_\{c\}, otherwise it is*ungrounded*\.

At any turntt, the verifier returns:

𝖵𝖾𝗋𝗂𝖿𝗒\(c,𝒟t\)=\{⟨𝗀𝗋𝗈𝗎𝗇𝖽𝖾𝖽,𝐷𝑒𝑝\(αc\)⟩if∃αc∈𝐴𝑟𝑔𝑠twith𝑐𝑙𝑎𝑖𝑚\(αc\)=ϕc⟨𝗎𝗇𝗀𝗋𝗈𝗎𝗇𝖽𝖾𝖽,∅⟩otherwise\.\\mathsf\{Verify\}\(c,\\mathcal\{D\}\_\{t\}\)=\\begin\{cases\}\\langle\\mathsf\{grounded\},\\mathit\{Dep\}\(\\alpha\_\{c\}\)\\rangle&\\text\{if \}\\exists\\,\\alpha\_\{c\}\\in\\mathit\{Args\}\_\{t\}\\text\{ with \}\\mathit\{claim\}\(\\alpha\_\{c\}\)=\\phi\_\{c\}\\\\ \\langle\\mathsf\{ungrounded\},\\emptyset\\rangle&\\text\{otherwise\.\}\\end\{cases\}A grounded continuation comes with the set of upstream commitments it depends on\. An ungrounded continuation is flagged for retry, retraction, or human review\. Two derived queries support belief revision:𝐴𝑓𝑓𝑒𝑐𝑡𝑒𝑑\(p\)\\mathit\{Affected\}\(p\)identifies the conclusions that lose grounding whenppis retracted \(formal definition in[Proposition˜2\.1](https://arxiv.org/html/2605.14175#S2.Thmproposition1), restricted to the current preferred extension\), and𝐷𝑒𝑝\(α\)\\mathit\{Dep\}\(\\alpha\)returns the propositions an argumentα\\alphadepends on\. Both reduce to dependency\-graph reachability\. The pair\(𝐴𝑟𝑔𝑠t,𝐷𝑒𝑝t\)\(\\mathit\{Args\}\_\{t\},\\mathit\{Dep\}\_\{t\}\)is a labelled claim\-dependency structure derived from interaction history\. Its soundness is established by[Proposition˜2\.1](https://arxiv.org/html/2605.14175#S2.Thmproposition1)and its extraction\-time faithfulness is the empirical question[Section˜4](https://arxiv.org/html/2605.14175#S4)addresses\.[Definition˜2\.2](https://arxiv.org/html/2605.14175#S2.Thmdefinition2)is binary by design\. Graded variants \(partial dependency, weighted attack relations, confidence\-calibrated grounding\) are tractable extensions\.

Why this is non\-trivial\.The lookup form of[Definition˜2\.2](https://arxiv.org/html/2605.14175#S2.Thmdefinition2)is misleading\.𝐴𝑟𝑔𝑠t\\mathit\{Args\}\_\{t\}is not the set of mentioned propositions: it is a preferred extension of an argumentation framework whose attack relation is updated each turn, populated by hypotheses generated through abduction \([Definition˜E\.2](https://arxiv.org/html/2605.14175#A5.Thmdefinition2)\) over an awareness structure that the conversation expands\. The work is upstream of the check\.*Maintenance*: each turn updates the epistemic plausibility model, awareness set, attack relation, and dependency map jointly\. One misclassification poisons all four\.*Retraction*: identifying which conclusions lose grounding whenppis retracted requires the explicit𝐷𝑒𝑝\\mathit\{Dep\}map to be maintained as part of the structure rather than derived post\-hoc\. This is what[Proposition˜2\.1](https://arxiv.org/html/2605.14175#S2.Thmproposition1)’s conflict\-free guarantee rests on\.*Complexity*: the combined formalisms arePSpace\-hard in general \(DEL model\-checking\.Aucher and Schwarzentruber[2013](https://arxiv.org/html/2605.14175#bib.bib11)\), so any polynomial\-time procedure \(a transformer’s forward pass included\) must approximate\. The verifier sidesteps this by maintaining\(𝐴𝑟𝑔𝑠t,𝐷𝑒𝑝t\)\(\\mathit\{Args\}\_\{t\},\\mathit\{Dep\}\_\{t\}\)incrementally so each check is𝒪\(\|𝐴𝑟𝑔𝑠t\|\+\|𝐴𝑡𝑡t\|\)\\mathcal\{O\}\(\|\\mathit\{Args\}\_\{t\}\|\+\|\\mathit\{Att\}\_\{t\}\|\)under the structural restrictions our scenarios satisfy \([Proposition˜2\.2](https://arxiv.org/html/2605.14175#S2.Thmproposition2)\)\.

Updating𝒟t\\mathcal\{D\}\_\{t\}\.An LLM Interpreter classifies each utterance into one of 8 operations\.[Algorithm˜1](https://arxiv.org/html/2605.14175#alg1)composes the operation into a single update𝒟t↦𝒟t\+1\\mathcal\{D\}\_\{t\}\\mapsto\\mathcal\{D\}\_\{t\+1\}that simultaneously refinesℳ\\mathcal\{M\},𝐴𝐹\\mathit\{AF\},𝐶𝑚\\mathit\{Cm\}, and𝐷𝑒𝑝\\mathit\{Dep\}\. The engine checks per\-operation preconditions \([Table˜8](https://arxiv.org/html/2605.14175#A5.T8)\) and re\-prompts on failure\. Surprising observations enqueue abductive problems \([Definition˜E\.2](https://arxiv.org/html/2605.14175#A5.Thmdefinition2)\) that drive the nextHypothesize\.

The 8 operations split into three roles:Observe/Resolvecommit content;Hypothesize/Support/Undermine/Reviseadjust plausibility or attack relations without erasing provenance;Expand\-Awareness/Questionexpand or query without committing claims\. The taxonomy is the closure of the four formalisms’ update primitives under the typing requirement that each utterance produces exactly one well\-typed update \([Appendix˜E](https://arxiv.org/html/2605.14175#A5)\)\. The verifier itself queries only𝐷𝑒𝑝t\\mathit\{Dep\}\_\{t\}and the argumentation skeleton at runtime\.

The 8 operations and their DEL realisations\(notation:[Appendix˜E](https://arxiv.org/html/2605.14175#A5)\)\.

Soundness of selective retraction\.The central guarantee is that𝐴𝑓𝑓𝑒𝑐𝑡𝑒𝑑\(p\)\\mathit\{Affected\}\(p\)identifies exactly the conclusions that lose grounding whenppis retracted, with the post\-retraction state well\-defined under the standardDung \([1995](https://arxiv.org/html/2605.14175#bib.bib14)\)preferred\-extension semantics\.

###### Proposition 2\.1\(Conflict\-free selective retraction\)\.

Let\(ℳ,𝐴𝐹,𝐶𝑚,𝐷𝑒𝑝\)\(\\mathcal\{M\},\\mathit\{AF\},\\mathit\{Cm\},\\mathit\{Dep\}\)be a dependency structure withS⊆𝐴𝑟𝑔𝑠S\\subseteq\\mathit\{Args\}a preferred extension of𝐴𝐹\\mathit\{AF\}\. Supposep∈𝑃𝑟𝑜𝑝p\\in\\mathit\{Prop\}is retracted\. Let𝐴𝑓𝑓𝑒𝑐𝑡𝑒𝑑\(p\):=\{α∈S:p∈𝐷𝑒𝑝\(α\)\}\\mathit\{Affected\}\(p\):=\\\{\\alpha\\in S:p\\in\\mathit\{Dep\}\(\\alpha\)\\\},S′:=S∖𝐴𝑓𝑓𝑒𝑐𝑡𝑒𝑑\(p\)S^\{\\prime\}:=S\\setminus\\mathit\{Affected\}\(p\), and𝐴𝐹′:=\(𝐴𝑟𝑔𝑠∖𝐴𝑓𝑓𝑒𝑐𝑡𝑒𝑑\(p\),𝐴𝑡𝑡∩\(𝐴𝑟𝑔𝑠∖𝐴𝑓𝑓𝑒𝑐𝑡𝑒𝑑\(p\)\)2\)\\mathit\{AF\}^\{\\prime\}:=\(\\mathit\{Args\}\\setminus\\mathit\{Affected\}\(p\),\\;\\mathit\{Att\}\\cap\(\\mathit\{Args\}\\setminus\\mathit\{Affected\}\(p\)\)^\{2\}\)\. ThenS′S^\{\\prime\}is conflict\-free in𝐴𝐹′\\mathit\{AF\}^\{\\prime\}, and there exists a preferred extension of𝐴𝐹′\\mathit\{AF\}^\{\\prime\}containingS′S^\{\\prime\}\. Any such extension is a valid post\-retraction state and preserves everyα∈S\\alpha\\in Swithp∉𝐷𝑒𝑝\(α\)p\\notin\\mathit\{Dep\}\(\\alpha\)\.

The proposition guarantees conflict\-freeness and existence, not admissibility ofS′S^\{\\prime\}itself or uniqueness of the post\-retraction extension\. The engine treatsS′S^\{\\prime\}as a lower bound and recomputes preferred extensions on𝐴𝐹′\\mathit\{AF\}^\{\\prime\}, flagging any argument whose status changes\. Proof in[Appendix˜E](https://arxiv.org/html/2605.14175#A5)\.

Per\-turn tractability\.The runtime check𝖵𝖾𝗋𝗂𝖿𝗒\(c,𝒟t\)\\mathsf\{Verify\}\(c,\\mathcal\{D\}\_\{t\}\)is dependency\-graph reachability, computable in time linear in\|𝐴𝑟𝑔𝑠t\|\+\|𝐴𝑡𝑡t\|\|\\mathit\{Args\}\_\{t\}\|\+\|\\mathit\{Att\}\_\{t\}\|\(no LLM call required\)\. The engine update step requires bounded model\-checking on𝒟t−1\\mathcal\{D\}\_\{t\-1\}:

###### Proposition 2\.2\(Tractability under structural restrictions\)\.

Under the structural restrictions our scenarios satisfy, namely small number of agentskk\(k≤12k\\leq 12\), acyclic attack graph, and explicit world representation, per\-turn computation is polynomial in\|W\|\|W\|and\|𝐴𝑟𝑔𝑠\|\|\\mathit\{Args\}\|:𝒪\(\|W\|\)\\mathcal\{O\}\(\|W\|\)for hard announcements,𝒪\(\|W\|log⁡\|W\|\)\\mathcal\{O\}\(\|W\|\\log\|W\|\)per soft upgrade, and𝒪\(\|𝐴𝑟𝑔𝑠\|\+\|𝐴𝑡𝑡\|\)\\mathcal\{O\}\(\|\\mathit\{Args\}\|\+\|\\mathit\{Att\}\|\)for extension recomputation\(Dung,[1995](https://arxiv.org/html/2605.14175#bib.bib14); Dunne,[2007](https://arxiv.org/html/2605.14175#bib.bib13)\)\. Reference implementation:<0\.1\{<\}0\.1ms per turn on Phase 2\. Worst cases without these restrictions reachPSpace\(Aucher and Schwarzentruber,[2013](https://arxiv.org/html/2605.14175#bib.bib11)\)andΠ2P\\Pi^\{\\mathrm\{P\}\}\_\{2\}\(Dunne and Wooldridge,[2009](https://arxiv.org/html/2605.14175#bib.bib12)\)\. Per\-phase analysis in[Appendix˜E](https://arxiv.org/html/2605.14175#A5)\. Acyclicity is inherited from Issue\-Based Information System \(IBIS\)\-style\(Kunz and Rittel,[1970](https://arxiv.org/html/2605.14175#bib.bib17)\)deliberation, where each new argument either supports, attacks, or refines an existing one rather than cycling back\.

## 3Validation Scenarios

We validate the verifier on two authored multi\-agent scenarios\.*Phase 2*\(13 turns,[Appendix˜C](https://arxiv.org/html/2605.14175#A3)\) exercises hypothesis lifecycle tracking and counterfactual retraction queries: when an observation is invalidated, which conclusions lose grounding?*Phase 3*\(19 turns,[Appendix˜D](https://arxiv.org/html/2605.14175#A4)\) exercises commitment tracking and selective revision under assumption change: when an upstream assumption flips, which decisions need re\-grounding and which remain intact? A third*Phase 1*\(muddy\-children calibration,[Figure˜3](https://arxiv.org/html/2605.14175#A2.F3)\) tests the underlying world\-set machinery against fully\-specified ground truth and is presented in the appendix only\.

Phase 2: Naturalistic debugging\.Three engineers debug a cascading failure \([Appendix˜C](https://arxiv.org/html/2605.14175#A3)\)\. Ground truth: token bug→\\toretry storm→\\toRedis exhaustion→\\torate\-limit bypass→\\toStripe 429s, plus a monitoring miscategorisation\. The conversation generates four hypotheses with non\-trivial lifecycles \(h2h\_\{2\}: Redis→\\toAuth, undermined at T7 by error\-code evidence and abandoned T9\.h4h\_\{4\}at T12 reverses the direction, subsumingh1h\_\{1\}andh3h\_\{3\}at T13\)\. The verifier is queried with three retractions \(𝐴𝑓𝑓𝑒𝑐𝑡𝑒𝑑\(o8\),𝐴𝑓𝑓𝑒𝑐𝑡𝑒𝑑\(o9\),𝐴𝑓𝑓𝑒𝑐𝑡𝑒𝑑\(o6\)\\mathit\{Affected\}\(o\_\{8\}\),\\mathit\{Affected\}\(o\_\{9\}\),\\mathit\{Affected\}\(o\_\{6\}\)\)\. Correctness depends on whether the dependency map records*why*h2h\_\{2\}was abandoned and preserves the provenance ofh4h\_\{4\}’s causal reversal: both are structural facts that flat history cannot provide\.

Phase 3: Open\-ended deliberation\.A team chooses a real\-time collaboration architecture \([Appendix˜D](https://arxiv.org/html/2605.14175#A4)\)\. The outcome is a*decision with dissent*: Alice commits to Yjs server\-relayed under three assumptions \(a1a\_\{1\}docs stay short;a2a\_\{2\}editing remains burst\-mode;a3a\_\{3\}Q2 long\-running documents unconfirmed\); Bob records dissent\. Whena3a\_\{3\}later flips,𝐴𝑓𝑓𝑒𝑐𝑡𝑒𝑑\(a3\)\\mathit\{Affected\}\(a\_\{3\}\)must flag exactly the Yjs decision while leaving the access\-control resolution intact: the selective re\-grounding capability that motivates the verifier\.

## 4Verifier Implementation and Evaluation

### 4\.1Architecture

Conversation\(natural language\)LLMInterpreterSymbolicEngineContextRendererLLM ContextWindow“it’s reversed:Auth→\\toRedis”Hypothesize\+Reviseaddh4h\_\{4\};flip causal arrow;revise plausibility“h4h\_\{4\}now active”informsFigure 1:Hybrid architecture, illustrated with Phase 2 T12 \(an engineer reverses the causal direction; details in[Appendix˜C](https://arxiv.org/html/2605.14175#A3)\)\. Solid: utterance flow, classified operations, engine update, summary\. Dashed: rendered summary fed back into the LLM’s context\.LLM Interpreter and Symbolic Engine\.For each utterancectc\_\{t\}, the Interpreter performs \(1\)*operation classification*into one of the 8 operations of[Section˜2](https://arxiv.org/html/2605.14175#S2)and \(2\)*proposition extraction*of the relevant formal content\. The Engine drives the loop, keeping the formal guarantees on the symbolic side: it flags surprising observations \(ℳ,w⊧̸𝐁iχ\\mathcal\{M\},w\\not\\models\\mathbf\{B\}\_\{i\}\\chi\) prompting the Interpreter for a hypothesis, verifies each candidate against per\-operation preconditions \([Appendices˜E](https://arxiv.org/html/2605.14175#A5)and[E\.2](https://arxiv.org/html/2605.14175#A5.Thmdefinition2)\), and either applies the formal update or re\-prompts naming the failing condition \(e\.g\., “useExpand\-Awareness, notObserve”\)\. Temporal epistemic model checkers \(e\.g\. MCMAS,Lomuscioet al\.[2017](https://arxiv.org/html/2605.14175#bib.bib19)\) or bounded model checking\(Huanget al\.,[2011](https://arxiv.org/html/2605.14175#bib.bib21)\)can serve as backends\. Worked extractions in[Appendices˜C](https://arxiv.org/html/2605.14175#A3)and[D](https://arxiv.org/html/2605.14175#A4)\.

Context Renderer\.After each engine update, the Renderer summarises the changed parts of𝒟t\\mathcal\{D\}\_\{t\}in natural language and writes the summary into the LLM’s context window \(dashed arrow in[Figure˜1](https://arxiv.org/html/2605.14175#S4.F1)\); the next classification step sees the engine’s structured state rather than raw conversation history\. The Renderer’s output is also what users see when the verifier flags an ungrounded continuation: “decisionα\\alphano longer grounded; affected by retraction ofa3a\_\{3\}; depends ona1,a2a\_\{1\},a\_\{2\}\.”

### 4\.2Evaluation

We measure \(i\) per\-utterance classification accuracy, \(ii\) retraction\-query latency at conversation lengthsKKup to 2000 turns, \(iii\) end\-to\-end dependency extraction \(E1/E1b/E1c\), \(iv\) direct verifier accuracy on a labelled grounding test set \(E2\), \(v\) noise robustness \(E5\), and \(vi\) external\-benchmark behaviour on LoCoMo\(Maharanaet al\.,[2024](https://arxiv.org/html/2605.14175#bib.bib34)\)and LongMemEval\-KU\(Wuet al\.,[2025](https://arxiv.org/html/2605.14175#bib.bib36)\)\. Scenarios are scored on Phase 2 \(13 turns\) and Phase 3 \(19 turns\)\. We report multi\-label F1 with extra weight on the 3–4*key shifts*per scenario \(hypothesis abandonment, causal reversal, decision with dissent\)\. Three prompt conditions on Phase 2:*Minimal*,*Definitions*,*State\-augmented*\. Four LLMs: Claude Sonnet 4, GPT\-4o, Qwen2\.5\-7B\-Instruct, Llama\-3\.1\-8B\-Instruct \(last two via vLLM\), 5 runs per cell,temperature=0 then 1\.0\. Phase 3 uses Definitions only\.

Table 1:Phase 2 classification accuracy across models and prompt conditions \(5 runs per cell\)\. Each cell reports F1 / Exact\-match%\. Phase 3 results \(Claude0\.850\.85overall /0\.920\.92key shifts\) in[Table˜7](https://arxiv.org/html/2605.14175#A4.T7)\.Classification\.Definitions alone lift Phase 2 F1 from0\.660\.66to0\.850\.85\(\+0\.19\+0\.19\) and key\-shift F1 from0\.500\.50to0\.740\.74\(\+0\.24\+0\.24\): the taxonomy is learnable from general definitions\. State\-augmented adds\+0\.06\+0\.06overall and\+0\.09\+0\.09on key shifts\.Expand\-Awareness\(which requires recognition that a proposition was previously*unconceived*\) jumps from0%0\\%exact \(conditions 1–2\) to100%100\\%\(condition 3\), confirming that engine state enables distinctions the LLM cannot make from text alone\. Absolute F1 scales with model size \([Table˜1](https://arxiv.org/html/2605.14175#S4.T1)\)\. Phase 3 reaches0\.850\.85overall and0\.920\.92on key shifts, demonstrating cross\-scenario generalisation\.

Retraction latency scaling\.For deployment, the value of the maintained graph is that it replaces history replay with a bounded graph walk: re\-reading theKK\-turn transcript at every retraction does not scale, but𝖵𝖾𝗋𝗂𝖿𝗒\\mathsf\{Verify\}does\. We measure𝐴𝑓𝑓𝑒𝑐𝑡𝑒𝑑\\mathit\{Affected\}on synthetic engine states atK∈\{13,…,2000\}K\\in\\\{13,\\ldots,2000\\\}turns under two regimes \([Figure˜2](https://arxiv.org/html/2605.14175#S4.F2)\):*naive*\(\|𝐴𝑟𝑔𝑠t\|∼0\.3K\|\\mathit\{Args\}\_\{t\}\|\\sim 0\.3K\) and*bounded*\(\|𝐴𝑟𝑔𝑠t\|≤50\|\\mathit\{Args\}\_\{t\}\|\\leq 50, modelling Phase\-2\-like working sets where resolved/abandoned hypotheses retire\)\. In the bounded regime the verifier plateaus while history\-replay grows linearly withKK, an84×84\\timesgap atK=2000K\\\!=\\\!2000\([Figure˜2](https://arxiv.org/html/2605.14175#S4.F2)\)\. The naive regime shows a33–10×10\\timesadvantage\. This is the empirical realisation of[Proposition˜2\.2](https://arxiv.org/html/2605.14175#S2.Thmproposition2): the per\-query cost is bounded by\|𝐴𝑟𝑔𝑠t\|\|\\mathit\{Args\}\_\{t\}\|, not the conversation length\.

![Refer to caption](https://arxiv.org/html/2605.14175v1/x1.png)Figure 2:Retraction latency scales with\|𝐴𝑟𝑔𝑠t\|\|\\mathit\{Args\}\_\{t\}\|, notKK\. Verifier \(blue\): bounded \(\|𝐴𝑟𝑔𝑠t\|≤50\|\\mathit\{Args\}\_\{t\}\|\\\!\\leq\\\!50, hollow\) plateaus once the cap saturates \(K≥200K\\\!\\geq\\\!200\); naive \(\|𝐴𝑟𝑔𝑠t\|∼0\.3K\|\\mathit\{Args\}\_\{t\}\|\\\!\\sim\\\!0\.3K, solid\) grows shallowly\. Baseline \(red dashed\): linear walk ofKK\-turn history\. Median over 5 random seeds \(each generating a fresh synthetic engine state\)×\\times200 queries per seed\.End\-to\-end verification\.The verifier’s correctness decomposes into two claims\.*Soundness given a faithful𝒟t\\mathcal\{D\}\_\{t\}*\([Proposition˜2\.1](https://arxiv.org/html/2605.14175#S2.Thmproposition1)\): with ground\-truth dependencies,𝖵𝖾𝗋𝗂𝖿𝗒\\mathsf\{Verify\}resolves12/1212/12affected/unaffected decisions correctly \(3 retraction queries×\\times4 hypotheses\)\.*End\-to\-end with LLM\-extracted dependencies*: GPT\-4o reaches Dep precision1\.01\.0but recovers3/73/7ground\-truth tuples\. The bottleneck concentrates on a single cross\-hypothesis linkh1→h4h\_\{1\}\\\!\\to\\\!h\_\{4\}, not on𝖵𝖾𝗋𝗂𝖿𝗒\\mathsf\{Verify\}itself\. As[Table˜2](https://arxiv.org/html/2605.14175#S4.T2)shows, this gap is*schema\-deep*, not*prompt\-deep*: directly\-prompted GPT\-4o tops out atF1=0\.34\\mathrm\{F1\}\\\!=\\\!0\.34vs\. pipeline0\.600\.60\(E1\), and prompt\-schema variants reach F1=0\.50=\\,0\.50but never recoverh1→h4h\_\{1\}\\\!\\to\\\!h\_\{4\}\(E1b,[Appendix˜I](https://arxiv.org/html/2605.14175#A9)\)\. The formal construction makes the fix visible: extendResolveto update existing𝐷𝑒𝑝\\mathit\{Dep\}tuples \([Algorithm˜1](https://arxiv.org/html/2605.14175#alg1),Resolvecase\)\. A targeted probe \(E1c,[Appendix˜J](https://arxiv.org/html/2605.14175#A10)\) achieves perfecth1→h4h\_\{1\}\\\!\\to\\\!h\_\{4\}recovery with zero false positives, validated by a Phase\-3 negative sanity probe\.

Table 2:Soundness–faithfulness decomposition on Phase 2\.*GT deps*row isolates the verifier’s structural check\.*E1*rows score directly\-prompted GPT\-4o against ground\-truth𝐷𝑒𝑝\\mathit\{Dep\}tuples \(no pipeline\)\.*Pipeline*rows score the pipeline at three model scales\. All against post\-T13 GT\. E1b prompt\-schema ablation and E1cResolve\-stage extension in[Appendices˜I](https://arxiv.org/html/2605.14175#A9)and[J](https://arxiv.org/html/2605.14175#A10)\. Bold: per\-column best among LLM\-extracted rows\. “—”: column does not apply\.Direct verifier evaluation\.The headline finding:𝖵𝖾𝗋𝗂𝖿𝗒\\mathsf\{Verify\}catches stale\-premise advice the baseline misses, and abstains rather than confabulates on under\-supported queries\. We instrument𝖵𝖾𝗋𝗂𝖿𝗒\(c,𝒟t\)\\mathsf\{Verify\}\(c,\\mathcal\{D\}\_\{t\}\)against an LLM\-only baseline on a 50\-item Phase 2 test set across four categories: 15*actual*continuations \(grounded\), 15*stale*claims \(re\-assertions after T9 abandonment\), 10*cross\-conversation*negatives, and 10*counterfactuals*\.𝖵𝖾𝗋𝗂𝖿𝗒\\mathsf\{Verify\}walks𝒟t\\mathcal\{D\}\_\{t\}from the candidate’s premise\. The baseline sees the same transcript and returns one\-word grounded/ungrounded\. Author labels were checked by an independent annotator on a blinded 20\-item subset: Cohen’sκ=0\.733\\kappa\\\!=\\\!0\.733, above the substantial\-agreement threshold\.

[Table˜3](https://arxiv.org/html/2605.14175#S4.T3)reports the result\.𝖵𝖾𝗋𝗂𝖿𝗒\\mathsf\{Verify\}catches all three ungrounded categories perfectly \(35/3535/35pooled\), including the canonical case e2\_030 att=13t\\\!=\\\!13\(“Auth should switch to in\-process JWT validation to remove its Redis dependency for sessions”\): an action recommendation whose load\-bearing premise is exactly the abandonedh2h\_\{2\}\.𝖵𝖾𝗋𝗂𝖿𝗒\\mathsf\{Verify\}walks the dependency chain, finds the LLM\-aligned hypothesis withstatus=abandoned\\mathrm\{status\}\\\!=\\\!\\mathrm\{abandoned\}, and returns ungrounded\. The baseline reads it as plausible engineering advice without checking whether its premise stands\. This is the textbook stale\-claim case for the soundness–faithfulness decomposition\. The baseline matches on cross\-conversation and counterfactual \(topic shift and explicit contradiction are surface\-detectable\) and misses only e2\_030 on stale\. The verifier’s single loss \(e2\_015\) is a documented conservative case: when no single entity in𝒟t\\mathcal\{D\}\_\{t\}underlies the claim \(𝑎𝑠𝑠𝑒𝑟𝑡𝑠\_𝑖𝑑=∅\\mathit\{asserts\\\_id\}\\\!=\\\!\\varnothing, multi\-entity meta\-reasoning\),𝖵𝖾𝗋𝗂𝖿𝗒\\mathsf\{Verify\}returns ungrounded by design\. The pooled stale\+counterfactual headline is a near\-ceiling tie \(LLM\-only baseline at96%96\\%pooled\)\. The architectural signal is sharper at the category level:\+6\.7\+6\.7pp on stale claims, the e2\_030 canonical case, and the abstention behaviour the baseline lacks\. A transcript\-RAG baseline matched on top\-k=5k\\\!=\\\!5over the same Phase 2 turns gives24/2524/25pooled \(the same e2\_030 miss as the LLM\-only baseline\), preserving the verifier’s\+4\.0\+4\.0pp pooled /\+6\.7\+6\.7pp stale advantage at matched selective retrieval\. A full\-window variant \(top\-k≥13k\\\!\\geq\\\!13, all available turns plus per\-turn similarity\-score annotations\) lifts the baseline to25/2525/25pooled, matching the verifier\. The verifier’s E2 advantage is therefore best characterised as*selective retrieval with a structural soundness guarantee*: dep\-walk is structurally guaranteed to surface the abandonment evidence at any retrieval budget, whereas similarity\-based retrieval is matched only at full\-window with similarity\-score framing\.

Table 3:Direct verifier evaluation \(E2\)\. Per\-category accuracy of𝖵𝖾𝗋𝗂𝖿𝗒\(c,𝒟t\)\\mathsf\{Verify\}\(c,\\mathcal\{D\}\_\{t\}\)vs\. LLM\-only baseline and two transcript\-RAG variants on the 50\-item Phase 2 test set\.*TR \(k=5k\\\!=\\\!5\)*: selective retrieval matched to dep\-walk depth\.*TR \(k=20k\\\!=\\\!20\)*: full\-window\. Cohen’sκ=0\.733\\kappa=0\.733on the 20\-item independent overlap\.Robustness to extraction noise \(diagnostic\)\.We perturb the GT Phase 2 state on the 50\-item E2 set \(per\-item truncation attt, 10 seeds at eachε∈\[0,0\.8\]\\varepsilon\\in\[0,0\.8\]\) under three independent noise models \([Figure˜4](https://arxiv.org/html/2605.14175#A11.F4),[Appendix˜K](https://arxiv.org/html/2605.14175#A11)\)\.*Random dependency\-edge drop*and*add*are flat at49/5049/50across allε\\varepsilon:88%88\\%of items decide beforewalk\_depsruns \(via entity\-resolution, hypothesis\-status, or observation/awareness leaves\), characterising the E2 set as low\-sensitivity to edge\-traversal accuracy and locating𝖵𝖾𝗋𝗂𝖿𝗒\\mathsf\{Verify\}’s win mechanism elsewhere\.*Lifecycle/status corruption*produces the meaningful curve: pooled accuracy0\.98→0\.640\.98\\\!\\to\\\!0\.64asε:0→0\.8\\varepsilon\\\!:0\\\!\\to\\\!0\.8, with stale\-claim accuracy \(the architectural\-win category in[Table˜3](https://arxiv.org/html/2605.14175#S4.T3)\) falling1\.00→0\.271\.00\\\!\\to\\\!0\.27\. The verifier matches the LLM\-only pooled\-50 baseline \(0\.980\.98\) atε≈0\.05\\varepsilon\\\!\\approx\\\!0\.05\. Observed boundary errors on existing Phase 2 pipeline runs concentrate onh2h\_\{2\}’s abandonment \(∼5%\{\\sim\}5\\%aggregate\)\. The bottleneck for𝖵𝖾𝗋𝗂𝖿𝗒\\mathsf\{Verify\}’s empirical win mechanism is therefore faithfulRevise/Resolve/abandonment\\textsc\{Revise\}/\\textsc\{Resolve\}/\\text\{abandonment\}extraction, not ordinary dep\-edge recall\.

Operational envelope and external benchmarks\.The verifier’s load\-bearing capability is the dependency graph: a record of which claims rest on which observations, hypotheses, and earlier conclusions, with a notion of*current standing*that distinguishes “mentioned earlier” from “established and not retracted\.” Three structural features motivate the design:*\(a\)*a claim’s validity depends on the standing of an earlier claim, not just its mention \(a continuation under abandonedh2h\_\{2\}is wrong even thoughh2h\_\{2\}remains in the transcript\);*\(b\)*the right answer is sometimes “insufficient support,” so the verifier should abstain rather than confabulate;*\(c\)*retraction has selective downstream consequences\.*LoCoMo*\(Maharanaet al\.,[2024](https://arxiv.org/html/2605.14175#bib.bib34)\)\(entity\-relation recall\) and*LongMemEval\-KU*atoracle\(Wuet al\.,[2025](https://arxiv.org/html/2605.14175#bib.bib36)\)\(knowledge update with supersession\) sit at different points on this envelope: LongMemEval\-KU is closer to feature*\(a\)*than LoCoMo\.

Rendering and retrieval ablation\.The deployment\-time choice is the engine\-state rendering: the published cached run injected only the dependency\-map JSON \(\{h1:\[o3,o5\],…\}\\\{h\_\{1\}\\\!:\\\!\[o\_\{3\},o\_\{5\}\],\\ldots\\\}, opaque IDs without content\), forcing the QA model into abstention when the truncated transcript also lost the relevant fact\.[Table˜4](https://arxiv.org/html/2605.14175#S4.T4)reports a three\-mode ablation on LoCoMo: dep\-map\-only loses−10\.2\-10\.2pp; content\-bearing rendering \(hypothesis/observation content \+ per\-item turn\-date attribution\) reaches\+17\.2\+17\.2pp; content \+ RAG retrieval over engine items reaches\+18\.5\+18\.5pp\. The same modes on LongMemEval\-KU yieldΔ=−1\.3,0,\+1\.3\\Delta\\\!=\\\!\-1\.3,0,\+1\.3pp \(87\.2%→88\.5%→89\.7%87\.2\\%\\\!\\to\\\!88\.5\\%\\\!\\to\\\!89\.7\\%accuracy on7878oracleitems\); content\+\+retrieval is the first configuration to exceed the LLM\-only baseline\. To isolate the engine’s contribution beyond retrieval, we add a matched transcript\-RAG baseline \(same encoder, same top\-k=20k\\\!=\\\!20, indexed over per\-turn transcript chunks\): on LoCoMo it matches the verifier \([Table˜4](https://arxiv.org/html/2605.14175#S4.T4)\), consistent with the interactional\-grounding focus and feature*\(a\)*\. On LongMemEval\-KU it falls below the LLM\-only baseline, leaving the verifier the only configuration to exceed both \([Table˜5](https://arxiv.org/html/2605.14175#S4.T5)\), since lifecycle/status labels disambiguate current from superseded values where similarity surfaces both\. Atn=78n\\\!=\\\!78the directional deltas are not statistically significant \(McNemar exactp=0\.625p\\\!=\\\!0\.625on33vs\.11discordant pairs for verifier vs\. transcript\-RAG, and Wilson95%95\\%CIs overlap\), and on LoCoMo the verifier and transcript\-RAG are statistically indistinguishable \(55vs\.66discordant pairs,p≈1\.0p\\\!\\approx\\\!1\.0\)\. The formal framework \(lifecycle, dependency tracking, retraction soundness\) is unchanged across modes\. Rendering and retrieval are deployment\-time tunables\. The architectural signal that LongMemEval\-KU was designed to surface \(correct abstention via feature*\(b\)*\) survives the headline win:33of44verifier wins among the original99disagreements were correct abstentions on\_absitems where the long\-context baseline confabulated, the LongMemEval analogue of e2\_030\. Per\-category tables and the Qwen\-7B 16K truncation analysis are in[Appendices˜G](https://arxiv.org/html/2605.14175#A7)and[H](https://arxiv.org/html/2605.14175#A8)\.

Table 4:LoCoMo official 60\-item QA: rendering\-mode ablation and transcript\-RAG baseline \(GPT\-4o\)\.*dep\-map only*: opaque IDs\{hi:\[oj,ok\],…\}\\\{h\_\{i\}\\\!:\\\!\[o\_\{j\},o\_\{k\}\],\\ldots\\\};*\+\+content*: hypothesis/observation content with session\-date attribution;*\+\+content\+\+retrieval*: top\-k=20k\\\!=\\\!20RAG over engine items;*transcript\-RAG*: same setup, indexed over per\-turn transcript chunks \(no engine state\)\. Per\-category F1 in[Appendix˜G](https://arxiv.org/html/2605.14175#A7)\.Table 5:LongMemEval\-KU oracle 78\-item evaluation \(GPT\-4o\)\.*transcript\-RAG*: matched encoder and top\-k=20k\\\!=\\\!20, indexed over per\-turn transcript chunks \(no engine state\)\. The verifier \(\+\+content\+\+retrieval\) is the only configuration to exceed both baselines; per\-category breakdown in[Appendix˜H](https://arxiv.org/html/2605.14175#A8)\.Reproducibility and deployment scope\.All prompts, conversations, annotations, and evaluation code are in a single runnable script \([Appendix˜M](https://arxiv.org/html/2605.14175#A13); deployment\-time error composition in[Appendix˜F](https://arxiv.org/html/2605.14175#A6)\)\. Claude Sonnet 4 reachesF1=0\.91\\mathrm\{F1\}\\\!=\\\!0\.91classification \([Table˜1](https://arxiv.org/html/2605.14175#S4.T1)\) and the structural check is correct on12/1212/12retraction queries given a faithful𝒟t\\mathcal\{D\}\_\{t\}\([Table˜2](https://arxiv.org/html/2605.14175#S4.T2)\); the remaining gap localises to extraction recall, not the verifier\.

## 5Related Work

Retrieval\-augmented attribution\.ALCE\(Gaoet al\.,[2023](https://arxiv.org/html/2605.14175#bib.bib37)\), AutoAIS\(Bohnetet al\.,[2022](https://arxiv.org/html/2605.14175#bib.bib38)\), and MTRAG\(Katsiset al\.,[2025](https://arxiv.org/html/2605.14175#bib.bib40)\)evaluate whether a generated claim is supported by a cited*external*source\. The verifier targets the dual problem of*interactional*grounding, where supporting evidence is the conversation’s own earlier turns\. The two paradigms are complementary\.Claim traceability and verifiable grounding\.VISTA\(Zhanget al\.,[2026b](https://arxiv.org/html/2605.14175#bib.bib46)\)parses a conversation post\-hoc into a Reasoning Dependency Tree for offline visualisation\. The runtime verifier maintains a comparable structure incrementally and queries it under a tight latency budget\. Context\- and memory\-manipulation attacks against deployed agents\(Patlanet al\.,[2025](https://arxiv.org/html/2605.14175#bib.bib44); Donget al\.,[2025](https://arxiv.org/html/2605.14175#bib.bib45)\)succeed precisely because no runtime structural binding ties an agent’s continuation to its actual interaction history\.[Definition˜2\.2](https://arxiv.org/html/2605.14175#S2.Thmdefinition2)and[Proposition˜2\.1](https://arxiv.org/html/2605.14175#S2.Thmproposition1)together specify the structural property such defences presuppose, with\(𝐴𝑟𝑔𝑠t,𝐷𝑒𝑝t\)\(\\mathit\{Args\}\_\{t\},\\mathit\{Dep\}\_\{t\}\)realising it as a queryable substrate\. AgentArmor\(Wanget al\.,[2025](https://arxiv.org/html/2605.14175#bib.bib2)\)is the closest systems\-architecture neighbour, building a Program Dependence Graph over an LLM agent’s runtime tool\-call trace to enforce security policies\. That PDG and our claim\-dependency map are complementary abstractions over different aspects of agent behaviour\.Memory and state tracking\.Recent memory\-mechanism work \(seeZhang and others \([2024](https://arxiv.org/html/2605.14175#bib.bib26)\)for a survey\) clusters into three families: environment\-state tracking\(Parket al\.,[2023](https://arxiv.org/html/2605.14175#bib.bib25); Chenet al\.,[2023](https://arxiv.org/html/2605.14175#bib.bib22); Tanget al\.,[2024](https://arxiv.org/html/2605.14175#bib.bib23); Zhouet al\.,[2025](https://arxiv.org/html/2605.14175#bib.bib24)\), belief\-state tracking\(Kimet al\.,[2025](https://arxiv.org/html/2605.14175#bib.bib31)\), and token\-efficient fact compression\(Chhikaraet al\.,[2025](https://arxiv.org/html/2605.14175#bib.bib47)\)\. None maintains the dependency structure that𝖵𝖾𝗋𝗂𝖿𝗒\\mathsf\{Verify\}and𝐴𝑓𝑓𝑒𝑐𝑡𝑒𝑑\\mathit\{Affected\}require\.Formal foundations\.DEL and awareness logic\(Faginet al\.,[1995](https://arxiv.org/html/2605.14175#bib.bib5); van Ditmarschet al\.,[2007](https://arxiv.org/html/2605.14175#bib.bib7); van Benthem,[2007](https://arxiv.org/html/2605.14175#bib.bib9); Baltaget al\.,[2014](https://arxiv.org/html/2605.14175#bib.bib10); Fagin and Halpern,[1988](https://arxiv.org/html/2605.14175#bib.bib6); Velázquez\-Quesadaet al\.,[2013](https://arxiv.org/html/2605.14175#bib.bib8); Doutreet al\.,[2014](https://arxiv.org/html/2605.14175#bib.bib15)\), model checking\(Lomuscioet al\.,[2017](https://arxiv.org/html/2605.14175#bib.bib19); van der Meyden and Su,[2004](https://arxiv.org/html/2605.14175#bib.bib20)\), IBIS argumentation\(Kunz and Rittel,[1970](https://arxiv.org/html/2605.14175#bib.bib17); Conklin,[2005](https://arxiv.org/html/2605.14175#bib.bib18)\), and context engineering\(Anthropic,[2025](https://arxiv.org/html/2605.14175#bib.bib32); Zhanget al\.,[2026a](https://arxiv.org/html/2605.14175#bib.bib33)\)all underpin pieces of the verifier\. We provide formal semantics for runtime grounding in this combination\.

Closest related work\.D\-SMART\(Leiet al\.,[2025](https://arxiv.org/html/2605.14175#bib.bib27)\)incrementally constructs an OWL knowledge graph and reasons over it via tree search\. The verifier differs in object: D\-SMART tracks entity\-relation triples \(“what was asserted”\) while the verifier tracks how conclusions are reached: hypothesis lifecycles, decisions with dissent, claim\-to\-evidence chains\. The Phase 2 illustration in[Appendix˜C](https://arxiv.org/html/2605.14175#A3)\(causal\-reversalh2→h4h\_\{2\}\\to h\_\{4\}tracked through dependency reversal\) lies outside D\-SMART’s representational scope\. Zep\(Rasmussenet al\.,[2025](https://arxiv.org/html/2605.14175#bib.bib28)\)adds temporal validity but not assumption dependencies\. DEL for Dialogue Friction\(Obisoet al\.,[2025](https://arxiv.org/html/2605.14175#bib.bib29)\)applies the same formal tools to measure cognitive friction\. Joint Human\-AI Reasoning\(Bezou\-Vrakatseliet al\.,[2024](https://arxiv.org/html/2605.14175#bib.bib30)\)*prescribes*argumentation protocols for human\-LLM inquiry\. The verifier*checks*grounding of naturalistic conversations\. Full feature comparison in[Appendix˜A](https://arxiv.org/html/2605.14175#A1)\.

## 6Discussion and Conclusion

The contribution is a composable formal substrate \([Algorithm˜1](https://arxiv.org/html/2605.14175#alg1)\) answering a deployment question existing methods do not:*is the next LLM output grounded in the conversation’s earlier commitments?*[Propositions˜2\.1](https://arxiv.org/html/2605.14175#S2.Thmproposition1)and[2\.2](https://arxiv.org/html/2605.14175#S2.Thmproposition2)establish soundness of selective retraction and per\-turn tractability\.

Threat model\.The verifier provides the structural mechanism context\-manipulation defences\(Patlanet al\.,[2025](https://arxiv.org/html/2605.14175#bib.bib44); Donget al\.,[2025](https://arxiv.org/html/2605.14175#bib.bib45)\)presuppose: claims unsupported by𝒟t\\mathcal\{D\}\_\{t\}are caught\. Corrupted Interpreter inputs are not\. Defences against the latter \(signed\-trace integrity, compositional bounds\) are the natural extension\.

Limitations and future work\.Classification F1 degrades with smaller models \([Table˜1](https://arxiv.org/html/2605.14175#S4.T1)\)\. The engine catches type but not content errors\. LLM API latency dominates end\-to\-end \([Appendix˜F](https://arxiv.org/html/2605.14175#A6)\)\. Future work:Revise\-targeted extraction and human utility study\.

## References

- Effective context engineering for AI agents\.Note:Anthropic Engineering Blog[https://www\.anthropic\.com/engineering/effective\-context\-engineering\-for\-ai\-agents](https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents)Cited by:[§5](https://arxiv.org/html/2605.14175#S5.p1.3)\.
- G\. Aucher and F\. Schwarzentruber \(2013\)On the complexity of dynamic epistemic logic\.InProceedings of the 14th Conference on Theoretical Aspects of Rationality and Knowledge \(TARK 2013\),Cited by:[Appendix E](https://arxiv.org/html/2605.14175#A5.SSx2.9.p5.2),[Proposition 2\.2](https://arxiv.org/html/2605.14175#S2.Thmproposition2.p1.9.9),[§2](https://arxiv.org/html/2605.14175#S2.p6.5)\.
- A\. Baltag, V\. Fiutek, and S\. Smets \(2014\)DDL as an “internalization” of dynamic belief revision\.InKrister Segerberg on Logic of Action,R\. Trypuz \(Ed\.\),Outstanding Contributions to Logic,pp\. 253–280\.Cited by:[§5](https://arxiv.org/html/2605.14175#S5.p1.3)\.
- A\. Baltag, L\. S\. Moss, and S\. Solecki \(1998\)The logic of public announcements, common knowledge, and private suspicions\.InProceedings of the 7th Conference on Theoretical Aspects of Rationality and Knowledge \(TARK 98\),pp\. 43–56\.Cited by:[Appendix E](https://arxiv.org/html/2605.14175#A5.SSx1.p2.7)\.
- A\. Baltag and S\. Smets \(2008\)A qualitative theory of dynamic interactive belief revision\.InLogic and the Foundations of Game and Decision Theory \(LOFT 7\),G\. Bonanno, W\. van der Hoek, and M\. Wooldridge \(Eds\.\),Texts in Logic and Games, Vol\.3,pp\. 9–58\.Cited by:[Definition 2\.1](https://arxiv.org/html/2605.14175#S2.Thmdefinition1.p1.11.11)\.
- E\. Bezou\-Vrakatseli, O\. Cocarascu, and S\. Modgil \(2024\)Towards dialogues for joint human–AI reasoning and value alignment\.Note:Argumentation protocols for human–LLM inquiryExternal Links:2405\.18073Cited by:[§5](https://arxiv.org/html/2605.14175#S5.p2.1)\.
- B\. Bohnet, V\. Q\. Tran, P\. Verga, R\. Aharoni, D\. Andor, L\. B\. Soares, J\. Eisenstein, K\. Ganchev, J\. Herzig, K\. Hui,et al\.\(2022\)Attributed question answering: evaluation and modeling for attributed large language models\.arXiv preprint arXiv:2212\.08037\.Cited by:[§1](https://arxiv.org/html/2605.14175#S1.p4.1),[§5](https://arxiv.org/html/2605.14175#S5.p1.3)\.
- P\. Budzianowski, T\. Wen, B\. Tseng, I\. Casanueva, S\. Ultes, O\. Ramadan, and M\. Gasic \(2018\)Multiwoz\-a large\-scale multi\-domain wizard\-of\-oz dataset for task\-oriented dialogue modelling\.InProceedings of the 2018 conference on empirical methods in natural language processing,pp\. 5016–5026\.Cited by:[§1](https://arxiv.org/html/2605.14175#S1.p2.1)\.
- S\. Chen, A\. Xiao, and D\. Hsu \(2023\)LLM\-State: open world state representation for long\-horizon task planning with large language model\.arXiv preprint arXiv:2311\.17406\.Cited by:[§5](https://arxiv.org/html/2605.14175#S5.p1.3)\.
- P\. Chhikara, D\. Khant, S\. Aryan, T\. Singh, and D\. Yadav \(2025\)Mem0: building production\-ready AI agents with scalable long\-term memory\.InProceedings of the 28th European Conference on Artificial Intelligence \(ECAI\),pp\. 2993–3000\.External Links:[Document](https://dx.doi.org/10.3233/FAIA251160)Cited by:[§1](https://arxiv.org/html/2605.14175#S1.p4.1),[§5](https://arxiv.org/html/2605.14175#S5.p1.3)\.
- J\. Conklin \(2005\)Dialogue mapping: building shared understanding of wicked problems\.Wiley\.Cited by:[§5](https://arxiv.org/html/2605.14175#S5.p1.3)\.
- S\. Dong, S\. Xu, P\. He, Y\. Li, J\. Tang, T\. Liu, H\. Liu, and Z\. Xiang \(2025\)A practical memory injection attack against LLM agents\.arXiv preprint arXiv:2503\.03704\.Cited by:[§1](https://arxiv.org/html/2605.14175#S1.p3.1),[§5](https://arxiv.org/html/2605.14175#S5.p1.3),[§6](https://arxiv.org/html/2605.14175#S6.p2.1)\.
- S\. Doutre, A\. Herzig, and L\. Perrussel \(2014\)A dynamic logic framework for abstract argumentation\.InProceedings of the Fourteenth International Conference on Principles of Knowledge Representation and Reasoning \(KR 2014\),Cited by:[§5](https://arxiv.org/html/2605.14175#S5.p1.3)\.
- P\. M\. Dung \(1995\)On the acceptability of arguments and its fundamental role in nonmonotonic reasoning, logic programming andnn\-person games\.Artificial Intelligence77\(2\),pp\. 321–357\.Cited by:[Appendix E](https://arxiv.org/html/2605.14175#A5.SSx2.3.p3.7),[Appendix E](https://arxiv.org/html/2605.14175#A5.SSx2.7.p3.6),[Definition 2\.1](https://arxiv.org/html/2605.14175#S2.Thmdefinition1.p1.11.11),[Proposition 2\.2](https://arxiv.org/html/2605.14175#S2.Thmproposition2.p1.9.9),[§2](https://arxiv.org/html/2605.14175#S2.p9.2)\.
- P\. E\. Dunne and M\. Wooldridge \(2009\)Complexity of abstract argumentation\.InArgumentation in Artificial Intelligence,G\. R\. Simari and I\. Rahwan \(Eds\.\),pp\. 85–104\.Cited by:[Appendix E](https://arxiv.org/html/2605.14175#A5.SSx2.9.p5.2),[Proposition 2\.2](https://arxiv.org/html/2605.14175#S2.Thmproposition2.p1.9.9)\.
- P\. E\. Dunne \(2007\)Computational properties of argument systems satisfying graph\-theoretic constraints\.Artificial Intelligence171\(10–15\),pp\. 701–729\.Cited by:[Appendix E](https://arxiv.org/html/2605.14175#A5.SSx2.7.p3.6),[Proposition 2\.2](https://arxiv.org/html/2605.14175#S2.Thmproposition2.p1.9.9)\.
- R\. Fagin, J\. Y\. Halpern, Y\. Moses, and M\. Y\. Vardi \(1995\)Reasoning about knowledge\.MIT Press,Cambridge, MA\.Cited by:[§5](https://arxiv.org/html/2605.14175#S5.p1.3)\.
- R\. Fagin and J\. Y\. Halpern \(1988\)Belief, awareness, and limited reasoning\.Artificial Intelligence34\(1\),pp\. 39–76\.Cited by:[§5](https://arxiv.org/html/2605.14175#S5.p1.3)\.
- T\. Gao, H\. Yen, J\. Yu, and D\. Chen \(2023\)Enabling large language models to generate text with citations\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),pp\. 6465–6488\.Cited by:[§1](https://arxiv.org/html/2605.14175#S1.p4.1),[§5](https://arxiv.org/html/2605.14175#S5.p1.3)\.
- X\. Huang, C\. Luo, and R\. van der Meyden \(2011\)Improved bounded model checking for a fair branching\-time temporal epistemic logic\.InModel Checking and Artificial Intelligence \(MoChArt 2010\),Lecture Notes in Computer Science, Vol\.6572,pp\. 95–111\.Note:Feasibility of bounded epistemic model checkingCited by:[§4\.1](https://arxiv.org/html/2605.14175#S4.SS1.p1.2)\.
- Y\. Katsis, S\. Rosenthal, K\. Fadnis, C\. Gunasekara, Y\. Lee, L\. Popa, V\. Shah, H\. Zhu, D\. Contractor, and M\. Danilevsky \(2025\)mtRAG: a multi\-turn conversational benchmark for evaluating retrieval\-augmented generation systems\.Transactions of the Association for Computational Linguistics13,pp\. 784–808\.Cited by:[§5](https://arxiv.org/html/2605.14175#S5.p1.3)\.
- D\. Kelly, Y\. Chen, S\. E\. Cornwell, N\. S\. Delellis, A\. Mayhew, S\. Onaolapo, and V\. L\. Rubin \(2023\)Bing chat: the future of search engines?\.Proceedings of the Association for Information Science and Technology60\(1\),pp\. 1007–1009\.Cited by:[§1](https://arxiv.org/html/2605.14175#S1.p2.1)\.
- J\. Kim, S\. Rhee, M\. Kim, D\. Kim, S\. Lee, Y\. Sung, and K\. Jung \(2025\)ReflAct: world\-grounded decision making in LLM agents via goal\-state reflection\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),Suzhou, China,pp\. 33433–33465\.Note:Tracks aspects of epistemic state in LLM agentsExternal Links:2505\.15182Cited by:[§5](https://arxiv.org/html/2605.14175#S5.p1.3)\.
- W\. Kunz and H\. W\. J\. Rittel \(1970\)Issues as elements of information systems\.Technical reportTechnical Report131,Institute of Urban and Regional Development, University of California, Berkeley\.Cited by:[Proposition 2\.2](https://arxiv.org/html/2605.14175#S2.Thmproposition2.p1.9.9),[§5](https://arxiv.org/html/2605.14175#S5.p1.3)\.
- P\. Laban, H\. Hayashi, Y\. Zhou, and J\. Neville \(2026\)LLMs get lost in multi\-turn conversation\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§1](https://arxiv.org/html/2605.14175#S1.p2.1)\.
- X\. Lei, Q\. Li, and M\. Zhang \(2025\)D\-SMART: enhancing LLM dialogue consistency via dynamic structured memory and reasoning tree\.arXiv preprint arXiv:2510\.13363\.Cited by:[§5](https://arxiv.org/html/2605.14175#S5.p2.1)\.
- A\. Lomuscio, H\. Qu, and F\. Raimondi \(2017\)MCMAS: an open\-source model checker for the verification of multi\-agent systems\.International Journal on Software Tools for Technology Transfer19\(1\),pp\. 9–30\.Note:Tool originally presented at CAV 2009; often cited with earlier datesCited by:[§4\.1](https://arxiv.org/html/2605.14175#S4.SS1.p1.2),[§5](https://arxiv.org/html/2605.14175#S5.p1.3)\.
- A\. Maharana, D\. Lee, S\. Tulyakov, M\. Bansal, F\. Barbieri, and Y\. Fang \(2024\)Evaluating very long\-term conversational memory of llm agents\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 13851–13870\.Cited by:[Appendix M](https://arxiv.org/html/2605.14175#A13.p2.1),[§1](https://arxiv.org/html/2605.14175#S1.SS0.SSS0.Px1.p2.1),[§4\.2](https://arxiv.org/html/2605.14175#S4.SS2.p1.1),[§4\.2](https://arxiv.org/html/2605.14175#S4.SS2.p8.2)\.
- T\. Obiso, K\. Lai, A\. Nath, N\. Krishnaswamy, and J\. Pustejovsky \(2025\)Dynamic epistemic friction in dialogue\.InProceedings of the 29th Conference on Computational Natural Language Learning \(CoNLL 2025\),Vienna, Austria,pp\. 323–333\.Cited by:[§5](https://arxiv.org/html/2605.14175#S5.p2.1)\.
- J\. S\. Park, J\. C\. O’Brien, C\. J\. Cai, M\. R\. Morris, P\. Liang, and M\. S\. Bernstein \(2023\)Generative agents: interactive simulacra of human behavior\.InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology \(UIST ’23\),pp\. 1–22\.Cited by:[§5](https://arxiv.org/html/2605.14175#S5.p1.3)\.
- A\. S\. Patlan, P\. Sheng, S\. A\. Hebbar, P\. Mittal, and P\. Viswanath \(2025\)Real AI agents with fake memories: fatal context manipulation attacks on Web3 agents\.arXiv preprint arXiv:2503\.16248\.Cited by:[§1](https://arxiv.org/html/2605.14175#S1.p3.1),[§5](https://arxiv.org/html/2605.14175#S5.p1.3),[§6](https://arxiv.org/html/2605.14175#S6.p2.1)\.
- P\. Rasmussen, P\. Paliychuk, T\. Beauvais, J\. Ryan, and D\. Chalef \(2025\)Zep: a temporal knowledge graph architecture for agent memory\.arXiv preprint arXiv:2501\.13956\.Cited by:[§1](https://arxiv.org/html/2605.14175#S1.p4.1),[§5](https://arxiv.org/html/2605.14175#S5.p2.1)\.
- O\. Shaikh, H\. Mozannar, G\. Bansal, A\. Fourney, and E\. Horvitz \(2025\)Navigating rifts in human\-LLM grounding: study and benchmark\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 20832–20847\.Cited by:[§1](https://arxiv.org/html/2605.14175#S1.p2.1)\.
- H\. Tang, D\. Y\. Key, and K\. Ellis \(2024\)WorldCoder, a model\-based LLM agent: building world models by writing code and interacting with the environment\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Note:arXiv:2402\.12275Cited by:[§5](https://arxiv.org/html/2605.14175#S5.p1.3)\.
- J\. van Benthem \(2007\)Dynamic logic for belief revision\.Journal of Applied Non\-Classical Logics17\(2\),pp\. 129–155\.Cited by:[§5](https://arxiv.org/html/2605.14175#S5.p1.3)\.
- R\. van der Meyden and K\. Su \(2004\)Symbolic model checking the knowledge of the dining cryptographers\.InProceedings of the 17th IEEE Computer Security Foundations Workshop,Note:MCK model checkerCited by:[§5](https://arxiv.org/html/2605.14175#S5.p1.3)\.
- H\. van Ditmarsch, W\. van der Hoek, and B\. Kooi \(2007\)Dynamic epistemic logic\.Synthese Library, Vol\.337,Springer,Dordrecht\.Cited by:[§5](https://arxiv.org/html/2605.14175#S5.p1.3)\.
- F\. R\. Velázquez\-Quesada, F\. Soler\-Toscano, and Á\. Nepomuceno\-Fernández \(2013\)An epistemic and dynamic approach to abductive reasoning: abductive problem and abductive solution\.Journal of Applied Logic11\(4\),pp\. 505–522\.Cited by:[Appendix C](https://arxiv.org/html/2605.14175#A3.SS0.SSS0.Px3.p16.13),[§5](https://arxiv.org/html/2605.14175#S5.p1.3)\.
- D\. Walton, C\. Reed, and F\. Macagno \(2008\)Argumentation schemes\.Cambridge University Press\.Cited by:[Definition 2\.1](https://arxiv.org/html/2605.14175#S2.Thmdefinition1.p1.11.11)\.
- P\. Wang, Y\. Liu, Y\. Lu, Y\. Cai, H\. Chen, Q\. Yang, J\. Zhang, J\. Hong, and Y\. Wu \(2025\)AgentArmor: enforcing program analysis on agent runtime trace to defend against prompt injection\.arXiv preprint arXiv:2508\.01249\.Cited by:[§5](https://arxiv.org/html/2605.14175#S5.p1.3)\.
- D\. Wu, H\. Wang, W\. Yu, Y\. Zhang, K\. Chang, and D\. Yu \(2025\)LongMemEval: benchmarking chat assistants on long\-term interactive memory\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[Appendix M](https://arxiv.org/html/2605.14175#A13.p2.1),[Appendix H](https://arxiv.org/html/2605.14175#A8.SS0.SSS0.Px1.p1.20),[§1](https://arxiv.org/html/2605.14175#S1.SS0.SSS0.Px1.p2.1),[§4\.2](https://arxiv.org/html/2605.14175#S4.SS2.p1.1),[§4\.2](https://arxiv.org/html/2605.14175#S4.SS2.p8.2)\.
- Q\. Zhang, C\. Hu, S\. Upasani, B\. Ma, F\. Hong, V\. Kamanuru, J\. Rainton, C\. Wu, M\. Ji, H\. Li, U\. Thakker, J\. Zou, and K\. Olukotun \(2026a\)Agentic context engineering: learning comprehensive contexts for self\-improving language models\.InInternational Conference on Learning Representations \(ICLR\),External Links:2510\.04618Cited by:[§5](https://arxiv.org/html/2605.14175#S5.p1.3)\.
- Y\. Zhang, M\. Lin, M\. Dras, and U\. Naseem \(2026b\)Beyond the black box: demystifying multi\-turn LLM reasoning with VISTA\.InProceedings of the Fortieth AAAI Conference on Artificial Intelligence \(AAAI\),Note:arXiv:2511\.10182Cited by:[§5](https://arxiv.org/html/2605.14175#S5.p1.3)\.
- Z\. Zhanget al\.\(2024\)A survey on the memory mechanism of large language model based agents\.arXiv preprint arXiv:2404\.13501\.Cited by:[§1](https://arxiv.org/html/2605.14175#S1.p4.1),[§5](https://arxiv.org/html/2605.14175#S5.p1.3)\.
- W\. Zhao, X\. Ren, J\. Hessel, C\. Cardie, Y\. Choi, and Y\. Deng \(2024\)Wildchat: 1m chatgpt interaction logs in the wild\.arXiv preprint arXiv:2405\.01470\.Cited by:[§1](https://arxiv.org/html/2605.14175#S1.p2.1)\.
- S\. Zhou, T\. Zhou, Y\. Yang, G\. Long, D\. Ye, J\. Jiang, and C\. Zhang \(2025\)WALL\-E 2\.0: world alignment by neurosymbolic learning improves world model\-based LLM agents\.arXiv preprint arXiv:2504\.15785\.Cited by:[§5](https://arxiv.org/html/2605.14175#S5.p1.3)\.

## Appendix AFive\-Dimension Comparison with Related Work

[Table˜6](https://arxiv.org/html/2605.14175#A1.T6)compares the verifier with related work across five dimensions, based on each system’s published representation; empirical head\-to\-head evaluation on shared benchmarks remains future work\.

Table 6:Comparison with related work across five dimensions\.*Hyp\. lifecycle:*active/weakened/abandoned tracking\.*Dep\. tracking:*inter\-turn support edges\. ALCE and MTRAG handle citation to external corpora and to prior turns respectively, but do not maintain a dependency structure between claims\.
## Appendix BRunning Example: Phase 1: Muddy Children

[Figure˜3](https://arxiv.org/html/2605.14175#A2.F3)traces the model through the four turns of the muddy children puzzle; the per\-turn natural\-language transcript follows\.

ℳ0\\mathcal\{M\}\_\{0\}:\|W0\|=8\|W\_\{0\}\|=8CCCCCMCMCCMMMCCMCMMMMMMCw∗w^\{\*\}ℳ1\\mathcal\{M\}\_\{1\}:\|W1\|=7\|W\_\{1\}\|=7CCMCMCCMMMCCMCMMMMMMCℳ2\\mathcal\{M\}\_\{2\}:\|W2\|=4\|W\_\{2\}\|=4CMMMCMMMMMMCℳ3\\mathcal\{M\}\_\{3\}:\|W3\|=1\|W\_\{3\}\|=1MMC\[ma∨mb∨mc\]\[\{m\_\{a\}\\\!\\lor\\\!m\_\{b\}\\\!\\lor\\\!m\_\{c\}\}\]3×\\timesObserve“I don’t know”3×\\timesResolve“I’m muddy/clean”Figure 3:Layer 1 \(DEL\) trajectory through the muddy children puzzle\. Each panel shows the world setWtW\_\{t\}jointly considered possible afterttrounds: 8 worlds \(one per truth assignment ofma,mb,mcm\_\{a\},m\_\{b\},m\_\{c\}\), then 7, 4, 1\. The actual worldw∗=MMCw^\{\*\}=MMC\(Alice and Bob muddy, Carol clean\) is double\-circled; eliminated worlds are dropped, but live worlds keep their grid positions for visual continuity\. The Father’s announcement eliminatesCCCCCC; three “I don’t know”s eliminate the singleton\-muddy worlds; threeResolves narrow tow∗w^\{\*\}\. The engine maintains exactly the live set at each turn\.We present the muddy children puzzle as a conversation with the full epistemic model traced at each turn\. Three children, Alice \(aa\), Bob \(bb\), Carol \(cc\), have been playing outside\. Alice and Bob are muddy; Carol is clean\. Each child can see the others’ foreheads but not their own\.

#### Propositions\.

mam\_\{a\},mbm\_\{b\},mcm\_\{c\}\(“childxxis muddy”\)\. The actual world isw∗=\(ma,mb,¬mc\)w^\{\*\}=\(m\_\{a\},m\_\{b\},\\lnot m\_\{c\}\), abbreviated𝑀𝑀𝐶\\mathit\{MMC\}\.

#### Initial modelℳ0\\mathcal\{M\}\_\{0\}\.

W0=\{MMM,MMC,MCM,MCC,CMM,CMC,CCM,CCC\}W\_\{0\}=\\\{MMM,MMC,MCM,MCC,CMM,CMC,CCM,CCC\\\}\. Each childiihasw∼iw′w\\sim\_\{i\}w^\{\\prime\}iffw,w′w,w^\{\\prime\}agree onmjm\_\{j\}for allj≠ij\\neq i\. For Alice \(sees Bob=M, Carol=C\):∼a\\sim\_\{a\}groups\{MMC,CMC\}\\\{MMC,CMC\\\},\{MMM,CMM\}\\\{MMM,CMM\\\},\{MCM,CCM\}\\\{MCM,CCM\\\},\{MCC,CCC\}\\\{MCC,CCC\\\}\.

Turn 0: Father:Children, I can see that at least one of you has mud on your forehead\.

Operation:Observe\(public announcementψ0=ma∨mb∨mc\\psi\_\{0\}=m\_\{a\}\\lor m\_\{b\}\\lor m\_\{c\}\)\. Update:EliminateCCCCCC\.Modelℳ1\\mathcal\{M\}\_\{1\}:\|W1\|=7\|W\_\{1\}\|=7\. Common knowledge:𝐂abc\(ma∨mb∨mc\)\\mathbf\{C\}\_\{abc\}\(m\_\{a\}\\lor m\_\{b\}\\lor m\_\{c\}\)\. Individual knowledge:No child knows their own status\.

Turn 1a: Alice:I don’t know\.Turn 1b: Bob:I don’t know\.Turn 1c: Carol:I don’t know\.

Operations:ThreeObserve\(“I don’t know” is a public announcement\)\. Reasoning:If any child saw two clean faces, they would know they are muddy \(since at least one must be\)\. Nobody knew, so all single\-muddy worlds are eliminated\. Modelℳ2\\mathcal\{M\}\_\{2\}:W2=\{MMM,MMC,MCM,CMM\}W\_\{2\}=\\\{MMM,MMC,MCM,CMM\\\},\|W2\|=4\|W\_\{2\}\|=4\. Common knowledge:At least two children are muddy\.

Individual knowledge afterℳ2\\mathcal\{M\}\_\{2\}:

- •Alice sees Bob=M, Carol=C\. Consistent worlds inW2W\_\{2\}:\{MMC\}\\\{MMC\\\}\.*Alice knows she is muddy\.*
- •Bob sees Alice=M, Carol=C\. Consistent worlds inW2W\_\{2\}:\{MMC\}\\\{MMC\\\}\.*Bob knows he is muddy\.*
- •Carol sees Alice=M, Bob=M\. Consistent worlds inW2W\_\{2\}:\{MMC,MMM\}\\\{MMC,MMM\\\}\. Carol doesn’t know yet, but can deduce from Alice’s and Bob’s round\-2 answers\.

Turn 2a: Alice:Yes, I’m muddy\.Turn 2b: Bob:Yes, I’m muddy\.Turn 2c: Carol:Yes, I’m clean\.

Operations:ThreeResolve\. Modelℳ3\\mathcal\{M\}\_\{3\}:W3=\{MMC\}W\_\{3\}=\\\{MMC\\\}\. Complete knowledge\.𝐂abc\(ma∧mb∧¬mc\)\\mathbf\{C\}\_\{abc\}\(m\_\{a\}\\land m\_\{b\}\\land\\lnot m\_\{c\}\)\.

#### Verification\.

At each step, the system’s model must matchℳ0→ℳ1→ℳ2→ℳ3\\mathcal\{M\}\_\{0\}\\to\\mathcal\{M\}\_\{1\}\\to\\mathcal\{M\}\_\{2\}\\to\\mathcal\{M\}\_\{3\}\. The LLM classifier must label Turn 0 and Turns 1a–c asObserveand Turns 2a–c asResolve\.

## Appendix CRunning Example: Phase 2: System Debugging

Three engineers debug a cascading failure\. Alice \(backend engineer; can see Auth Service and Payment Service application logs\), Bob \(infrastructure engineer; can see Payment Service metrics, Redis metrics, and external service dashboards\), Carol \(on\-call SRE; can see the alerting dashboard and customer complaint tickets but needs Alice or Bob to dig into specifics\)\.

#### Ground truth \(known to us, not to the engineers\)\.

Yesterday’s deployment introduced a bug in Auth Service: the token refresh path computes expiry from Unix epoch instead of current time, causing refreshed tokens to be immediately expired\. The frontend retries expired tokens, generating 3×\\timesnormal traffic to Auth\. Every auth request hits the shared Redis cluster for session lookup, so the retry storm exhausts Redis’s connection pool\. Payment Service uses the same Redis cluster for rate\-limit counters; with Redis down, rate\-limit checks fail and requests go to Stripe unthrottled, triggering Stripe’s 429 rate\-limit responses\. Separately, a stale monitoring rule maps Redis connection errors to “database health degradation” alerts, creating a red herring\.

#### What makes this Phase 2\.

Unlike Phase 1, the hypothesis space is not given upfront, hypotheses are*constructed*during the conversation through abductive reasoning\. The conversation takes a wrong turn \(initially attributing auth failures to Redis\), evidence requires domain\-specific interpretation \(error code types\), and a causal direction reversal is the key epistemic shift\. The monitoring miscategorisation tests awareness expansion\.

#### Formal model specification\.

We define the symbolic model that the engine maintains alongside the conversation\. Unlike Phase 1 where the world set is fixed at2k2^\{k\}from the start, Phase 2 extends the model dynamically as hypotheses are generated and awareness expands\. The abductive mechanism implements[Definition˜E\.2](https://arxiv.org/html/2605.14175#A5.Thmdefinition2): the symbolic engine detects surprising observations \(ℳ,w⊧̸𝐁iχ\\mathcal\{M\},w\\not\\models\\mathbf\{B\}\_\{i\}\\chi\), the LLM generates candidate hypothesesγ\\gamma, and the engine verifies and integrates them via plausibility upgrade\.

*Agents:*𝐴𝑔𝑠=\{a,b,c\}\\mathit\{Ags\}=\\\{a,b,c\\\}\(Alice, Bob, Carol\)\.

*Propositions \(initial\):*The model begins with observable propositions only:𝑃𝑟𝑜𝑝0=\{p𝑎𝑢𝑡ℎ,p𝑝𝑎𝑦,p𝑑𝑏,p𝑟𝑒𝑑𝑖𝑠,p𝑡𝑜𝑘,p𝑡𝑟𝑎𝑓𝑓𝑖𝑐\}\\mathit\{Prop\}\_\{0\}=\\\{p\_\{\\mathit\{auth\}\},p\_\{\\mathit\{pay\}\},p\_\{\\mathit\{db\}\},p\_\{\\mathit\{redis\}\},p\_\{\\mathit\{tok\}\},p\_\{\\mathit\{traffic\}\}\\\}\(auth failures, payment failures, database alert, Redis issues, token errors, anomalous traffic\)\. Eachppcan be true/false/unobserved\.

*Awareness:*Initially,𝒜i\\mathcal\{A\}\_\{i\}contains only𝑃𝑟𝑜𝑝0\\mathit\{Prop\}\_\{0\}for all agents\. Crucially,𝑚𝑖𝑠\_𝑚𝑜𝑛𝑖𝑡𝑜𝑟\\mathit\{mis\\\_monitor\}\(monitoring miscategorises Redis as “database”\) is*not*in any𝒜i\\mathcal\{A\}\_\{i\}until T4\.

*Hypothesis propositions*are added to𝑃𝑟𝑜𝑝\\mathit\{Prop\}as they are generated:

- •T5:h1≔\(𝑟𝑒𝑑𝑖𝑠\_𝑑𝑜𝑤𝑛→𝑟𝑎𝑡𝑒\_𝑏𝑦𝑝𝑎𝑠𝑠→𝑠𝑡𝑟𝑖𝑝𝑒\_429\)h\_\{1\}\\coloneqq\(\\mathit\{redis\\\_down\}\\to\\mathit\{rate\\\_bypass\}\\to\\mathit\{stripe\\\_429\}\)added to𝑃𝑟𝑜𝑝\\mathit\{Prop\}\.
- •T6:h2≔\(𝑟𝑒𝑑𝑖𝑠\_𝑑𝑜𝑤𝑛→𝑎𝑢𝑡ℎ\_𝑓𝑎𝑖𝑙\)h\_\{2\}\\coloneqq\(\\mathit\{redis\\\_down\}\\to\\mathit\{auth\\\_fail\}\)added\.
- •T11:h3≔\(𝑡𝑜𝑘\_𝑏𝑢𝑔→𝑟𝑒𝑡𝑟𝑦→𝑡𝑟𝑎𝑓𝑓𝑖𝑐\_3x\)h\_\{3\}\\coloneqq\(\\mathit\{tok\\\_bug\}\\to\\mathit\{retry\}\\to\\mathit\{traffic\\\_3x\}\)added\.
- •T12:h4≔\(𝑡𝑜𝑘\_𝑏𝑢𝑔→𝑟𝑒𝑡𝑟𝑦→𝑟𝑒𝑑𝑖𝑠\_𝑑𝑜𝑤𝑛\)h\_\{4\}\\coloneqq\(\\mathit\{tok\\\_bug\}\\to\\mathit\{retry\}\\to\\mathit\{redis\\\_down\}\)added\.

*Plausibility\-model trajectory and awareness expansion at T4*are tracked in the per\-turn walkthrough below; key shifts are:h1h\_\{1\}\-worlds upgraded after T5;h2h\_\{2\}\-worlds downgraded after T7 \(Undermine\) then eliminated after T9 \(Revise\);𝑚𝑖𝑠\_𝑚𝑜𝑛𝑖𝑡𝑜𝑟\\mathit\{mis\\\_monitor\}enters𝒜i\\mathcal\{A\}\_\{i\}at T4; the model converges after T13 to a single maximally\-plausible class whereh4h\_\{4\}holds and subsumesh1,h3h\_\{1\},h\_\{3\}\.

T1: Carol:P1 incident\. Three alerts firing: auth failure rate up, payment failure rate up, database health degradation\. Customer complaints started around 2:15am\.

Operations:Observe×3\\times 3\(three distinct symptoms reported\),Question\(implicit: “what is causing the failures?”\)\.

Why these classifications:Carol is reporting factual observations from her dashboard, not proposing explanations\. Each alert is a separate observation\. The implicit question is what opens the deliberation\.

Model state:

- •Observations:o1o\_\{1\}: auth failure rate elevated;o2o\_\{2\}: payment failure rate elevated;o3o\_\{3\}: “database health degradation” alert;o4o\_\{4\}: customer complaints from∼\\sim2:15am\.
- •Hypotheses: none yet\.
- •Open questions: What is causing the three alert types? Are they related or independent?
- •Causal chain: unknown\.

Symbolic modelℳ1\\mathcal\{M\}\_\{1\}:𝑃𝑟𝑜𝑝1=\{p𝑎𝑢𝑡ℎ,p𝑝𝑎𝑦,p𝑑𝑏\}\\mathit\{Prop\}\_\{1\}=\\\{p\_\{\\mathit\{auth\}\},p\_\{\\mathit\{pay\}\},p\_\{\\mathit\{db\}\}\\\}\. Public announcements ofo1,o2,o3,o4o\_\{1\},o\_\{2\},o\_\{3\},o\_\{4\}eliminate worlds inconsistent with these observations\.𝒜i=𝑃𝑟𝑜𝑝1\\mathcal\{A\}\_\{i\}=\\mathit\{Prop\}\_\{1\}for allii\. Note:𝑚𝑖𝑠\_𝑚𝑜𝑛𝑖𝑡𝑜𝑟∉𝒜i\\mathit\{mis\\\_monitor\}\\notin\\mathcal\{A\}\_\{i\}\(no agent is yet aware that “database” alert may be mislabelled\)\. No hypotheses in model; no plausibility distinctions beyond observations\. All agents share the same epistemic state:𝐊a𝐊b𝐊c\(o1∧o2∧o3∧o4\)\\mathbf\{K\}\_\{a\}\\mathbf\{K\}\_\{b\}\\mathbf\{K\}\_\{c\}\(o\_\{1\}\\land o\_\{2\}\\land o\_\{3\}\\land o\_\{4\}\)\.

T2: Alice:I’m seeing 401 Unauthorized spikes in auth logs starting around 2am\. Tokens being rejected as expired\. But I also see our auth request volume is way up, about 3×\\timesnormal\. That’s strange\. More users shouldn’t be logging in at 2am\.

Operations:Observe\(three new fac ts\),Question\(“why is traffic 3×\\times?”\)\.

Why:Alice reports specific observations from her logs\. The 401 error code, the “token expired” detail, and the 3×\\timestraffic volume are all factual reports\. Her remark that “more users shouldn’t be logging in at 2am” signals a surprising observation that doesn’t yet have an explanation, this is an implicit abductive problem that will drive later hypothesis generation\.

Model state:

- •New observations:o5o\_\{5\}: 401 “token expired” errors from∼\\sim2am;o6o\_\{6\}: auth traffic 3×\\timesnormal at 2am\.
- •New open question: Why is auth traffic 3×\\timesat 2am? \(Anomalous, doesn’t match normal usage\.\)

Symbolic modelℳ2\\mathcal\{M\}\_\{2\}:𝑃𝑟𝑜𝑝2=𝑃𝑟𝑜𝑝1∪\{p𝑡𝑜𝑘,p𝑡𝑟𝑎𝑓𝑓𝑖𝑐\}\\mathit\{Prop\}\_\{2\}=\\mathit\{Prop\}\_\{1\}\\cup\\\{p\_\{\\mathit\{tok\}\},p\_\{\\mathit\{traffic\}\}\\\}\. Public announcements ofo5,o6o\_\{5\},o\_\{6\}\. Worlds where auth errors are not “token expired” are eliminated; worlds where traffic is normal are eliminated\.ℳ2,w∗⊧𝐊a\(o5∧o6\)\\mathcal\{M\}\_\{2\},w^\{\*\}\\models\\mathbf\{K\}\_\{a\}\(o\_\{5\}\\land o\_\{6\}\); after public announcement,𝐂abc\(o5∧o6\)\\mathbf\{C\}\_\{abc\}\(o\_\{5\}\\land o\_\{6\}\)\.*Abductive problem detected \(surprise check\):*The observationo6o\_\{6\}\(3×\\timestraffic at 2am\) has been publicly announced, so𝐂abco6\\mathbf\{C\}\_\{abc\}o\_\{6\}holds, all agents*know*the fact\. However,o6o\_\{6\}is not*predicted*by any existing hypothesis or background belief: no formula in the current model entails that traffic should be elevated at 2am\. Formally,o6o\_\{6\}is surprising in the sense ofVelázquez\-Quesadaet al\.\[[2013](https://arxiv.org/html/2605.14175#bib.bib8)\]: the agents’ background theory does not entailo6o\_\{6\}prior to observation\. The engine flagsχ≔o6\\chi\\coloneqq o\_\{6\}as an abductive problem: an observed fact that lacks an explanatory hypothesis\. No candidateγ\\gammais generated yet, the LLM has not proposed a hypothesis\. The problem remains open until T11\.

T3: Bob:Payment side, I see Stripe returning 429s\. We’re sending them way more requests than normal\. And I can confirm I see the database alert too, Redis connection timeouts from Payment Service\.

Operations:Observe\(Stripe 429s, elevated request rate, Redis timeouts\)\.

Why:Pure observation from Bob’s infrastructure perspective\. The Redis timeouts are particularly important because they will later become central to the causal reasoning\. At this point, nobody has proposed a hypothesis yet, the group is still gathering data\.

Model state:

- •New observations:o7o\_\{7\}: Stripe 429 \(rate\-limit\) errors;o8o\_\{8\}: Redis connection timeouts from Payment side\.
- •Note: We now have observations from all three engineers\. The picture so far: auth tokens failing, payment requests failing, Redis having connection issues\. The question is how these relate\.

Symbolic modelℳ3\\mathcal\{M\}\_\{3\}:𝑃𝑟𝑜𝑝3=𝑃𝑟𝑜𝑝2∪\{pstripe429,p𝑟𝑒𝑑𝑖𝑠\}\\mathit\{Prop\}\_\{3\}=\\mathit\{Prop\}\_\{2\}\\cup\\\{p\_\{\\mathit\{stripe429\}\},p\_\{\\mathit\{redis\}\}\\\}\. Public announcements ofo7,o8o\_\{7\},o\_\{8\}\.𝐂abc\(o7∧o8\)\\mathbf\{C\}\_\{abc\}\(o\_\{7\}\\land o\_\{8\}\)\. No hypotheses yet\. The model contains 8 observations but no causal structure\. The world setW3W\_\{3\}still contains all worlds consistent witho1o\_\{1\}–o8o\_\{8\}, including worlds where the symptoms are independent, worlds where Redis causes everything, worlds where auth causes everything, etc\. No plausibility distinctions among these\.

T4: Carol:OK, so the database alert is about Redis, not the primary DB\. Still concerning\. Bob, are you seeing Redis issues from Payment’s side?

Operations:Expand\-Awareness\(the “database” alert is really a Redis alert\),Question\.

WhyExpand\-Awareness:This is the first moment where a proposition that was previously outside the group’s reasoning enters the conversation\. Everyone initially took the “database health degradation” alert at face value, they were reasoning about a database problem\. Carol’s reclassification introduces the proposition “the monitoring rule conflates Redis with database,” which was not previously in anyone’s awareness set\. In the formal framework, this adds new formulas to𝒜i\\mathcal\{A\}\_\{i\}for all agents, expanding the set of worlds they can distinguish\.

Model state:

- •Observationo3o\_\{3\}reclassified: alert labelled “database” is actually about Redis connection errors\.
- •Awareness expansion: the proposition𝑚𝑖𝑠\_𝑚𝑜𝑛𝑖𝑡𝑜𝑟\\mathit\{mis\\\_monitor\}has entered the group’s awareness\. Before this turn, the group was reasoning about a possible primary database problem; that possibility is now eliminated\.

Symbolic modelℳ4\\mathcal\{M\}\_\{4\}:*Awareness expansion:*𝑚𝑖𝑠\_𝑚𝑜𝑛𝑖𝑡𝑜𝑟≔\\mathit\{mis\\\_monitor\}\\coloneqq“monitoring conflates Redis with database\.” Before T4:𝑚𝑖𝑠\_𝑚𝑜𝑛𝑖𝑡𝑜𝑟∉𝒜i\\mathit\{mis\\\_monitor\}\\notin\\mathcal\{A\}\_\{i\}for allii\. After T4:𝑚𝑖𝑠\_𝑚𝑜𝑛𝑖𝑡𝑜𝑟∈𝒜i\\mathit\{mis\\\_monitor\}\\in\\mathcal\{A\}\_\{i\}for allii\.𝑃𝑟𝑜𝑝4=𝑃𝑟𝑜𝑝3∪\{𝑚𝑖𝑠\_𝑚𝑜𝑛𝑖𝑡𝑜𝑟\}\\mathit\{Prop\}\_\{4\}=\\mathit\{Prop\}\_\{3\}\\cup\\\{\\mathit\{mis\\\_monitor\}\\\}\.*World space restructuring:*Before expansion, agents could not distinguishw𝑑𝑏\_𝑝𝑟𝑜𝑏𝑙𝑒𝑚w\_\{\\mathit\{db\\\_problem\}\}\(primary DB failing\) fromw𝑟𝑒𝑑𝑖𝑠\_𝑚𝑖𝑠𝑙𝑎𝑏𝑒𝑙w\_\{\\mathit\{redis\\\_mislabel\}\}\(Redis failing, mislabelled as DB\)\. After expansion, these are distinct worlds\.w𝑑𝑏\_𝑝𝑟𝑜𝑏𝑙𝑒𝑚w\_\{\\mathit\{db\\\_problem\}\}is eliminated by the public announcement thato3o\_\{3\}is a Redis alert\.*Effect:*o3o\_\{3\}is reinterpreted as evidence about Redis, not the primary database\.𝐂abc\(𝑚𝑖𝑠\_𝑚𝑜𝑛𝑖𝑡𝑜𝑟∧¬p𝑑𝑏\_𝑝𝑟𝑜𝑏𝑙𝑒𝑚\)\\mathbf\{C\}\_\{abc\}\(\\mathit\{mis\\\_monitor\}\\land\\lnot p\_\{\\mathit\{db\\\_problem\}\}\)\.

T5: Bob:Yes\. Connection pool exhausted\. We can’t get connections to Redis\. That’s why our rate\-limit checks are failing, we store rate\-limit counters in Redis\. When we can’t check the counter, our code falls through and sends the request to Stripe anyway\. That would explain the 429s, we’re hitting Stripe without rate limiting\.

Operations:Observe\(pool exhaustion details\),Hypothesize\(h1h\_\{1\}\)\.

WhyHypothesize:Bob doesn’t just report facts, he constructs an*explanatory chain*: Redis pool exhausted→\\torate\-limit checks fail→\\torequests go to Stripe unthrottled→\\to429s\. This is an abductive step: Bob has a surprising observation \(o7o\_\{7\}: Stripe 429s\) and proposes a hypothesis \(h1h\_\{1\}\) that explains it\. In the formal framework, this is an abductive solution integrated at the belief level via plausibility upgrade\.

Model state:

- •h1h\_\{1\}: Redis pool exhaustion→\\torate\-limit check failure→\\tounthrottled Stripe requests→\\to429s\. Status:active\. Plausibility:high\. Supported by:o7o\_\{7\},o8o\_\{8\}\.
- •This is the first causal hypothesis\. It explains the Stripe failures but does not yet explain the auth failures or the Redis exhaustion itself\.

Symbolic modelℳ5\\mathcal\{M\}\_\{5\}:*Abductive cycle foro7o\_\{7\}:*

1. 1\.*Surprise check \(engine\):*o7o\_\{7\}\(Stripe 429s\) is a new observation\. The engine checks whether any current hypothesis predicts Stripe rate\-limit errors: no hypothesis inℳ4\\mathcal\{M\}\_\{4\}entailso7o\_\{7\}\. The observation is surprising, it cannot be explained by the current model\. Abductive problem\(𝐁b,o7\)\(\\mathbf\{B\}\_\{b\},o\_\{7\}\)flagged\.
2. 2\.*Hypothesis generation \(LLM\):*The LLM Interpreter classifies Bob’s utterance asHypothesizeand extracts the candidate:γ≔h1≔\(p𝑟𝑒𝑑𝑖𝑠→𝑟𝑎𝑡𝑒\_𝑏𝑦𝑝𝑎𝑠𝑠→pstripe429\)\\gamma\\coloneqq h\_\{1\}\\coloneqq\(p\_\{\\mathit\{redis\}\}\\to\\mathit\{rate\\\_bypass\}\\to p\_\{\\mathit\{stripe429\}\}\)\.
3. 3\.*Verification \(engine\):*\(a\) Consistency:h1h\_\{1\}is consistent withℳ4\\mathcal\{M\}\_\{4\}\(no contradictions\)\. \(b\) Explanatory: after plausibility upgrade withh1h\_\{1\}, agent would believeo7o\_\{7\}\(Redis being down plush1h\_\{1\}entails Stripe 429s\)\. \(c\) Non\-trivial:h1≠o7h\_\{1\}\\neq o\_\{7\}andh1h\_\{1\}is not already believed\. All checks pass\.
4. 4\.*Integration \(engine\):*𝑃𝑟𝑜𝑝5=𝑃𝑟𝑜𝑝4∪\{h1\}\\mathit\{Prop\}\_\{5\}=\\mathit\{Prop\}\_\{4\}\\cup\\\{h\_\{1\}\\\}\. Plausibility upgrade:h1h\_\{1\}\-worlds become more plausible for all agents \(public communication\)\. For allii, inW5W\_\{5\}:w⊧h1⟹w≺iw′w\\models h\_\{1\}\\implies w\\prec\_\{i\}w^\{\\prime\}forw′⊧¬h1w^\{\\prime\}\\models\\lnot h\_\{1\}\(given consistency witho7,o8o\_\{7\},o\_\{8\}\)\.

*Epistemic status:*𝐁ih1\\mathbf\{B\}\_\{i\}h\_\{1\}for allii\(believed, not known\)\.¬𝐊ih1\\lnot\\mathbf\{K\}\_\{i\}h\_\{1\}because alternative explanations foro7o\_\{7\}have not been ruled out\.*Observations explained:*h1h\_\{1\}explainso7o\_\{7\}\(Stripe 429s\) ando8o\_\{8\}\(Redis timeouts\)\. Doesnotexplaino5o\_\{5\}\(token errors\) oro6o\_\{6\}\(3×\\timestraffic\)\. The abductive problem foro6o\_\{6\}\(flagged at T2\) remains open\.

T6: Carol:So the chain might be: Redis is sick→\\toPayment loses rate limiting→\\toStripe gets hammered\. And separately, Redis being sick→\\toAuth has problems too? Alice, does Auth use Redis?

Operations:Support\(h1h\_\{1\}restated\),Hypothesize\(h2h\_\{2\}: Redis→\\toAuth failures\),Question\.

Why:Carol synthesises the current understanding and extends it\. By proposingh2h\_\{2\}, she is constructing a unified picture: Redis is the single root cause, and both the payment and auth failures are downstream effects\. This is a natural abductive move, explaining all symptoms with one cause is more parsimonious\.

Model state:

- •h2h\_\{2\}: Redis failure→\\toAuth failures \(via shared Redis cluster for session caching\)\. Status:active\. Plausibility:medium\. Basis: if Auth shares Redis, its failures could cascade\.
- •Leading picture at this point:*Redis is the single root cause*\. Everything downstream\.
- •This picture iswrong, and the next turns will overturn it\.

Symbolic modelℳ6\\mathcal\{M\}\_\{6\}:*Second abductive step:*h2≔\(p𝑟𝑒𝑑𝑖𝑠→p𝑎𝑢𝑡ℎ\)h\_\{2\}\\coloneqq\(p\_\{\\mathit\{redis\}\}\\to p\_\{\\mathit\{auth\}\}\)\.𝑃𝑟𝑜𝑝6=𝑃𝑟𝑜𝑝5∪\{h2\}\\mathit\{Prop\}\_\{6\}=\\mathit\{Prop\}\_\{5\}\\cup\\\{h\_\{2\}\\\}\.*Plausibility upgrade:*h2h\_\{2\}\-worlds upgraded to plausible \(medium\)\. For allii: worlds whereh1∧h2h\_\{1\}\\land h\_\{2\}both hold \(single root cause = Redis\) are now the most plausible:wh1∧h2≺iwh1∧¬h2≺iw¬h1w\_\{h\_\{1\}\\land h\_\{2\}\}\\prec\_\{i\}w\_\{h\_\{1\}\\land\\lnot h\_\{2\}\}\\prec\_\{i\}w\_\{\\lnot h\_\{1\}\}\.*Epistemic status:*𝐁ih1\\mathbf\{B\}\_\{i\}h\_\{1\}\(high\),𝐁ih2\\mathbf\{B\}\_\{i\}h\_\{2\}\(medium\)\. The “single root cause” picture is a belief, not knowledge\.*Observations explained byh1∧h2h\_\{1\}\\land h\_\{2\}:*o1o\_\{1\}\(auth failures\),o2o\_\{2\}\(payment failures\),o5o\_\{5\}\(token errors\),o7o\_\{7\}\(Stripe 429s\),o8o\_\{8\}\(Redis timeouts\)\. Remaining unexplained:o6o\_\{6\}\(3×\\timestraffic\)\.

T7: Alice:Yes, Auth uses the same Redis cluster for session caching\. If Redis is down, token validation would fail because we can’t look up the session\. That would explain the 401s… but wait, the 401s I’m seeing are specifically ‘token expired,’ not ‘session lookup failed\.’ Those are different error codes\. The tokens are being rejected because their expiry timestamps are in the past, not because Redis is unavailable\.

Operations:Observe\(o9o\_\{9\}: error code is “token expired,” not “session lookup failed”\),Undermine\(h2h\_\{2\}\)\.

WhyUndermineand not justObserve:Alice initially supportsh2h\_\{2\}\(“If Redis is down, token validation would fail… that would explain the 401s”\), but then notices a*discrepancy*: the specific error code is inconsistent with the Redis\-causes\-auth hypothesis\. If Redis were the cause, the error would be a session lookup failure \(503\), not a token expiry \(401\)\. This observation doesn’t just add new data, it directly decreases the plausibility ofh2h\_\{2\}\. In the formal framework, worlds whereh2h\_\{2\}is true become less plausible because they predict a different error code than what was observed\.

This is the most important diagnostic moment in the conversation\.Alice’s mid\-utterance correction \(“but wait…”\) shows real\-time epistemic revision: she starts by supporting the hypothesis and then undermines it within the same turn\.

Model state:

- •h2h\_\{2\}status:active→\\toweakened\. The evidence \(o9o\_\{9\}\) is inconsistent with the predicted error type\.
- •New observationo9o\_\{9\}: Auth errors are 401 “token expired,” not 503 “service unavailable\.”
- •The single\-root\-cause picture \(Redis causes everything\) is now under strain\.

Symbolic modelℳ7\\mathcal\{M\}\_\{7\}:*Observation:*o9≔\(𝑒𝑟𝑟𝑜𝑟\_𝑐𝑜𝑑𝑒=401\_token\_expired\)o\_\{9\}\\coloneqq\(\\mathit\{error\\\_code\}=\\text\{401\\\_token\\\_expired\}\)\. Public announcement eliminates worlds where auth errors are 503\.*Undermineh2h\_\{2\}:*h2h\_\{2\}predicts𝑒𝑟𝑟𝑜𝑟\_𝑐𝑜𝑑𝑒=503\_session\_lookup\\mathit\{error\\\_code\}=\\text\{503\\\_session\\\_lookup\}\.o9o\_\{9\}is inconsistent with this prediction\.*Plausibility downgrade:*for allii,h2h\_\{2\}\-worlds become less plausible:w¬h2≺iwh2w\_\{\\lnot h\_\{2\}\}\\prec\_\{i\}w\_\{h\_\{2\}\}\(reversing the T6 upgrade\)\.*Epistemic status:*𝐁i¬h2\\mathbf\{B\}\_\{i\}\\lnot h\_\{2\}\(agents now believeh2h\_\{2\}is false\), but¬𝐊i¬h2\\lnot\\mathbf\{K\}\_\{i\}\\lnot h\_\{2\}\(h2h\_\{2\}is not yet eliminated, Alice hasn’t explicitly ruled it out\)\. The ordering is:wh1∧¬h2≺iwh1∧h2≺iw¬h1w\_\{h\_\{1\}\\land\\lnot h\_\{2\}\}\\prec\_\{i\}w\_\{h\_\{1\}\\land h\_\{2\}\}\\prec\_\{i\}w\_\{\\lnot h\_\{1\}\}\.*Key formal point:*Underminechanges plausibility without eliminating worlds\.h2h\_\{2\}\-worlds remain epistemically possible but no longer believed: knowledge corresponds to world elimination, belief to plausibility reordering\.

T8: Carol:Hmm\. So the auth failures might not be caused by Redis after all?T9: Alice:I don’t think so\. The error type is wrong\. If Redis were down, Auth would return a 503 Service Unavailable, not a 401 with ‘token expired\.’ I’m seeing 401s\. So the token expiry issue is a different problem from the Redis problem\.

Operations \(T8\):Question\(Carol seeks confirmation of the implication\)\. Operations \(T9\):Revise\(Alice explicitly abandonsh2h\_\{2\}\)\.

WhyRevise:This is a genuine belief revision, not just weakening\. Alice explicitly states that the token expiry issue is “a different problem from the Redis problem\.” This separates what was previously thought to be a single\-cause situation into \(at least\) two independent problems\. In the formal framework, the plausibility ordering is restructured:h2h\_\{2\}\-worlds are moved from plausible to implausible\.

Model state:

- •h2h\_\{2\}:weakened→\\toabandoned\. Auth failures are not caused by Redis\.
- •The group’s picture has fundamentally changed: there are at least two independent problems, not one\.
- •Open question \(newly urgent\): What*is*causing the token expiry errors?

Symbolic modelℳ9\\mathcal\{M\}\_\{9\}:*Radical revision:*Alice’s explicit abandonment ofh2h\_\{2\}triggers a*radical upgrade*of¬h2\\lnot h\_\{2\}\. Allh2h\_\{2\}\-worlds become maximally implausible \(effectively eliminated from the believed set\)\. Formally:𝐁i¬h2→𝐊i¬h2\\mathbf\{B\}\_\{i\}\\lnot h\_\{2\}\\to\\mathbf\{K\}\_\{i\}\\lnot h\_\{2\}for allii\. The group now*knows*¬h2\\lnot h\_\{2\}, not merely believes it\.*Updated plausibility:*wh1∧¬h2≺iw¬h1∧¬h2w\_\{h\_\{1\}\\land\\lnot h\_\{2\}\}\\prec\_\{i\}w\_\{\\lnot h\_\{1\}\\land\\lnot h\_\{2\}\}\. Auth and Redis are independent\.*Observations now unexplained:*Withh2h\_\{2\}abandoned,o1o\_\{1\}\(auth failures\),o5o\_\{5\}\(token errors\), ando6o\_\{6\}\(3×\\timestraffic\) have no causal explanation\. The model records these as open abductive problems\.*Key formal distinction from T7:*At T7,h2h\_\{2\}was*undermined*\(plausibility downgraded but worlds retained\)\. At T9,h2h\_\{2\}is*revised*\(worlds effectively eliminated\)\. This is the difference betweenUndermineandRevisein the formal framework\.

T10: Bob:But then why is auth traffic 3×\\timesnormal? If the token expiry issue is independent of Redis, what’s generating all that auth traffic?

Operations:Question\.

Why this matters:Bob identifies that the unexplained 3×\\timestraffic \(o6o\_\{6\}\) is now a critical clue\. If auth failures aren’t caused by Redis, then the elevated traffic isn’t explained by the current model\. This question drives the next abductive step\. In the formal framework, this is a recognition of an abductive problem: there exists a surprising observation \(o6o\_\{6\}\) that the current set of hypotheses cannot explain\.

Symbolic modelℳ10\\mathcal\{M\}\_\{10\}:No structural change toWWor plausibility orderings\. TheQuestionoperation addso6o\_\{6\}to the set of open abductive problems:𝐴𝑏𝑑𝑃𝑟𝑜𝑏10=\{\(𝐁i,o6\)\}\\mathit\{AbdProb\}\_\{10\}=\\\{\(\\mathbf\{B\}\_\{i\},o\_\{6\}\)\\\}\. The model explicitly records thato6o\_\{6\}is surprising and unexplained\.

T11: Alice:Could be the frontend retrying\. When a user gets a ‘token expired’ error, the frontend automatically tries to refresh the token\. If the new token is also expired, it retries again\. That’s a retry loop\. So the token bug generates its own amplified traffic\.

Operations:Hypothesize\(h3h\_\{3\}\)\.

Why:Alice proposes a new hypothesis to explain the anomalous traffic observation\. This is a classic abductive step: the surprising fact \(o6o\_\{6\}: 3×\\timestraffic\) triggers the generation of an explanatory hypothesis \(h3h\_\{3\}: a feedback loop where the bug amplifies its own traffic\)\. Note thath3h\_\{3\}involves domain knowledge about the frontend’s retry behaviour, something the group becomes aware of through Alice’s contribution\.

Model state:

- •h3h\_\{3\}: Token bug→\\tofrontend retry loop→\\to3×\\timestraffic amplification\. Status:active\. Plausibility:medium\. Explainso6o\_\{6\}\.
- •At this point, the group has two separate explanations:h1h\_\{1\}\(Redis→\\toStripe problems\) andh3h\_\{3\}\(token bug→\\totraffic amplification\)\. They haven’t yet connected these\.

Symbolic modelℳ11\\mathcal\{M\}\_\{11\}:*Abductive cycle foro6o\_\{6\}\(open since T2\):*

1. 1\.*Surprise check:*o6o\_\{6\}\(3×\\timestraffic\) was flagged at T2 as an observation not predicted by any hypothesis\. Afterh2h\_\{2\}’s abandonment at T9,o6o\_\{6\}still has no explanatory hypothesis in the current model\. Abductive problem\(𝐁i,o6\)\(\\mathbf\{B\}\_\{i\},o\_\{6\}\)remains open\.
2. 2\.*Hypothesis generation \(LLM\):*Alice’s utterance classified asHypothesize\. Candidate extracted:γ≔h3≔\(𝑡𝑜𝑘\_𝑏𝑢𝑔→𝑟𝑒𝑡𝑟𝑦\_𝑙𝑜𝑜𝑝→p𝑡𝑟𝑎𝑓𝑓𝑖𝑐\)\\gamma\\coloneqq h\_\{3\}\\coloneqq\(\\mathit\{tok\\\_bug\}\\to\\mathit\{retry\\\_loop\}\\to p\_\{\\mathit\{traffic\}\}\)\.
3. 3\.*Verification \(engine\):*\(a\) Consistency:h3h\_\{3\}is consistent withℳ9\\mathcal\{M\}\_\{9\}\. \(b\) Explanatory: if𝑡𝑜𝑘\_𝑏𝑢𝑔\\mathit\{tok\\\_bug\}is true, the retry loop mechanism would produce elevated traffic, so after plausibility upgrade withh3h\_\{3\},𝐁i\(o6explained\)\\mathbf\{B\}\_\{i\}\(o\_\{6\}\\text\{ explained\}\)holds\. \(c\) Non\-trivial:h3≠o6h\_\{3\}\\neq o\_\{6\}\. All checks pass\.
4. 4\.*Integration:*𝑃𝑟𝑜𝑝11=𝑃𝑟𝑜𝑝9∪\{h3,𝑡𝑜𝑘\_𝑏𝑢𝑔,𝑟𝑒𝑡𝑟𝑦\_𝑙𝑜𝑜𝑝\}\\mathit\{Prop\}\_\{11\}=\\mathit\{Prop\}\_\{9\}\\cup\\\{h\_\{3\},\\mathit\{tok\\\_bug\},\\mathit\{retry\\\_loop\}\\\}\. Plausibility upgrade:h3h\_\{3\}\-worlds upgraded\. For allii:wh3≺iw¬h3w\_\{h\_\{3\}\}\\prec\_\{i\}w\_\{\\lnot h\_\{3\}\}\.

*Awareness expansion \(implicit\):*The propositions𝑡𝑜𝑘\_𝑏𝑢𝑔\\mathit\{tok\\\_bug\}\(a deployment bug causes token expiry\) and𝑟𝑒𝑡𝑟𝑦\_𝑙𝑜𝑜𝑝\\mathit\{retry\\\_loop\}\(frontend retries create amplification\) enter𝒜i\\mathcal\{A\}\_\{i\}via Alice’s domain knowledge\. These were not in any agent’s awareness before this turn\.*Epistemic status:*𝐁ih3\\mathbf\{B\}\_\{i\}h\_\{3\}\(medium plausibility, plausible but unconfirmed\)\. The abductive problem foro6o\_\{6\}is nowclosed\(explained byh3h\_\{3\}\)\.*Model topology:*Two independent causal chains coexist:h1h\_\{1\}\(Redis→\\toStripe\) andh3h\_\{3\}\(token bug→\\totraffic\)\. No connection between them yet\.o6o\_\{6\}is now explained;o5o\_\{5\}\(token errors\) is partially explained\.

T12: Carol:Wait, so the retry storm from the token bug could be what’s exhausting Redis? Not Redis causing the auth problem, but the auth problem causing the Redis overload?

Operations:Hypothesize\+Revise\.This is the critical turn: causal direction reversal\.

Why this is the key epistemic shift:Carol makes the decisive connection\. She realises thath3h\_\{3\}\(token bug causes retry storm\) can be linked toh1h\_\{1\}\(Redis exhaustion causes Stripe failures\) by*reversing the causal direction between Auth and Redis*\. The group had previously assumed Redis→\\toAuth \(hypothesish2h\_\{2\}, now abandoned\)\. Carol proposes Auth→\\toRedis: the retry storm is what overwhelms Redis, not the other way around\.

This is both aHypothesize\(introducing the new causal link: retries exhaust Redis\) and aRevise\(restructuring the causal understanding from two independent problems back to a single causal chain, but with a completely different structure from the originalh2h\_\{2\}\)\.

Model state:

- •h4h\_\{4\}: Token bug→\\toretry storm→\\toRedis connection pool exhaustion\. Status:active\. Plausibility:medium\. Linksh3h\_\{3\}toh1h\_\{1\}\.
- •Key epistemic shift recorded: Causal direction between Auth and Redis has reversed\. Previously: Redis failure was thought to cause auth problems \(h2h\_\{2\}\)\. Now: auth problems \(the token bug and resulting retry storm\) are proposed to cause Redis failure \(h4h\_\{4\}\)\.
- •Emerging unified chain: token bug→\\toretries→\\toRedis exhaustion→\\torate\-limit bypass→\\toStripe 429s\.

Symbolic modelℳ12\\mathcal\{M\}\_\{12\}:*Hypothesis:*h4≔\(𝑡𝑜𝑘\_𝑏𝑢𝑔→𝑟𝑒𝑡𝑟𝑦\_𝑙𝑜𝑜𝑝→p𝑟𝑒𝑑𝑖𝑠\)h\_\{4\}\\coloneqq\(\\mathit\{tok\\\_bug\}\\to\\mathit\{retry\\\_loop\}\\to p\_\{\\mathit\{redis\}\}\)\. Note the causal direction: Auth→\\toRedis, opposite of the abandonedh2h\_\{2\}\(Redis→\\toAuth\)\.𝑃𝑟𝑜𝑝12=𝑃𝑟𝑜𝑝11∪\{h4\}\\mathit\{Prop\}\_\{12\}=\\mathit\{Prop\}\_\{11\}\\cup\\\{h\_\{4\}\\\}\.*Plausibility upgrade:*h4h\_\{4\}\-worlds upgraded to plausible\. The combined hypothesish1∧h3∧h4h\_\{1\}\\land h\_\{3\}\\land h\_\{4\}now forms a unified causal chain and is more plausible than the independent\-problems picture\. For allii:wh1∧h3∧h4≺iwh1∧h3∧¬h4≺iw¬h1w\_\{h\_\{1\}\\land h\_\{3\}\\land h\_\{4\}\}\\prec\_\{i\}w\_\{h\_\{1\}\\land h\_\{3\}\\land\\lnot h\_\{4\}\}\\prec\_\{i\}w\_\{\\lnot h\_\{1\}\}\.*Epistemic status:*𝐁ih4\\mathbf\{B\}\_\{i\}h\_\{4\}\(believed, not known, awaiting mechanistic confirmation\)\.*Structural revision:*The model’s causal graph is restructured\. The dependency edge between Auth and Redis is reversed\. The model records this as a key epistemic shift with provenance:h2h\_\{2\}\(Redis→\\toAuth, T6, abandoned T9 due too9o\_\{9\}\)→\\toh4h\_\{4\}\(Auth→\\toRedis, T12, supported byh3h\_\{3\}\)\.*Unification:*Ifh4h\_\{4\}is accepted, thenh1∧h3∧h4h\_\{1\}\\land h\_\{3\}\\land h\_\{4\}form a single causal chain explaining*all*observationso1o\_\{1\}–o9o\_\{9\}\.

T13: Alice:That’s… actually plausible\. If auth traffic is 3×\\times, and every auth request hits Redis for session lookup, the Redis connection pool could be overwhelmed\. So the chain would be: token bug→\\toretry storm→\\toRedis exhaustion\. And then Redis exhaustion→\\toPayment rate\-limit bypass→\\toStripe 429s\. The whole thing cascades from the token bug\.

Operations:Support\(confirmingh4h\_\{4\}with mechanistic reasoning\),Resolve\(the unified causal chain is accepted\)\.

WhyResolve:Alice provides the mechanistic argument that makesh4h\_\{4\}convincing: 3×\\timestraffic, every request hitting Redis, pool exhaustion is the expected consequence\. She then states the complete unified chain explicitly\. At this point, the hypothesis is elevated from tentative belief to high\-confidence accepted explanation\.

Symbolic modelℳ13\\mathcal\{M\}\_\{13\}\(final\):*Support \+ Resolve:*Alice’s mechanistic argument provides sufficient evidence to elevateh4h\_\{4\}from belief to knowledge\.𝐁ih4→𝐊ih4\\mathbf\{B\}\_\{i\}h\_\{4\}\\to\\mathbf\{K\}\_\{i\}h\_\{4\}for allii\. Similarly, the unified chainh𝑢𝑛𝑖𝑓𝑖𝑒𝑑=h1∧h3∧h4h\_\{\\mathit\{unified\}\}=h\_\{1\}\\land h\_\{3\}\\land h\_\{4\}is resolved:𝐊ih𝑢𝑛𝑖𝑓𝑖𝑒𝑑\\mathbf\{K\}\_\{i\}h\_\{\\mathit\{unified\}\}for allii\.*Final plausibility:*The model converges to a single maximally plausible world class:W13𝑚𝑎𝑥=\{w:w⊧h𝑢𝑛𝑖𝑓𝑖𝑒𝑑\}W\_\{13\}^\{\\mathit\{max\}\}=\\\{w:w\\models h\_\{\\mathit\{unified\}\}\\\}\. All alternative worlds are maximally implausible\.*Hypothesis lifecycle summary:*

- •h1h\_\{1\}:∅→T5, Hyp\\emptyset\\xrightarrow\{\\text\{T5, Hyp\}\}active→T13, Res\\xrightarrow\{\\text\{T13, Res\}\}resolved \(correct mechanism, subsumed into unified chain\)\.
- •h2h\_\{2\}:∅→T6, Hyp\\emptyset\\xrightarrow\{\\text\{T6, Hyp\}\}active→T7, Und\\xrightarrow\{\\text\{T7, Und\}\}weakened→T9, Rev\\xrightarrow\{\\text\{T9, Rev\}\}abandoned\. Disproved byo9o\_\{9\}\.
- •h3h\_\{3\}:∅→T11, Hyp\\emptyset\\xrightarrow\{\\text\{T11, Hyp\}\}active→T13, Res\\xrightarrow\{\\text\{T13, Res\}\}resolved \(subsumed into unified chain\)\.
- •h4h\_\{4\}:∅→T12, Hyp\+Rev\\emptyset\\xrightarrow\{\\text\{T12, Hyp\+Rev\}\}active→T13, Sup\+Res\\xrightarrow\{\\text\{T13, Sup\+Res\}\}resolved\. Key epistemic shift: causal reversal\.

*Awareness set \(final\):*𝒜i=𝑃𝑟𝑜𝑝0∪\{𝑚𝑖𝑠\_𝑚𝑜𝑛𝑖𝑡𝑜𝑟,h1,h2,h3,h4,𝑡𝑜𝑘\_𝑏𝑢𝑔,𝑟𝑒𝑡𝑟𝑦\_𝑙𝑜𝑜𝑝\}\\mathcal\{A\}\_\{i\}=\\mathit\{Prop\}\_\{0\}\\cup\\\{\\mathit\{mis\\\_monitor\},h\_\{1\},h\_\{2\},h\_\{3\},h\_\{4\},\\mathit\{tok\\\_bug\},\\mathit\{retry\\\_loop\}\\\}for allii\.*Knowledge \(final\):*𝐂abc\(h𝑢𝑛𝑖𝑓𝑖𝑒𝑑∧¬h2∧𝑚𝑖𝑠\_𝑚𝑜𝑛𝑖𝑡𝑜𝑟\)\\mathbf\{C\}\_\{abc\}\(h\_\{\\mathit\{unified\}\}\\land\\lnot h\_\{2\}\\land\\mathit\{mis\\\_monitor\}\)\.

#### Final model state\.

The unified causal chain \(matching the ground truth of[Appendix˜C](https://arxiv.org/html/2605.14175#A3)’s opening\):*deployment token bug→\\toexpired tokens→\\tofrontend retries→\\to3×\\timesauth traffic→\\toRedis pool exhaustion→\\toPayment rate\-limit bypass→\\tounthrottled Stripe requests→\\to429s*, with the “database health” alert separately identified at T4 as a monitoring miscategorisation\. Hypothesis lifecycles are listed above; the per\-turn walkthrough records, at each step, the formal grounds \(operation, plausibility shift, dependency edge\) for the system’s tracking, including the two key epistemic shifts at T7–T9 \(error\-code evidence killsh2h\_\{2\}\) and T12 \(causal direction reversal\)\.

## Appendix DRunning Example: Phase 3: Architecture Deliberation

A team decides how to add real\-time collaboration to their document editor\. Alice \(product lead; feature scope and timeline\), Bob \(senior backend engineer; architecture and long\-term maintainability\), Carol \(frontend engineer; editor integration and user experience\)\.

#### What makes this Phase 3\.

Unlike Phases 1–2, there is no objectively correct answer\. The conversation produces a*decision*through deliberation, not a*discovery*of pre\-existing fact\. The core epistemic primitive shifts from knowledge/belief to*commitments*\(public, retractable assertions about what the team will do\)\. The conversation involves genuine trade\-offs, persistent disagreement, and a decision made under authority that explicitly records dissent and conditions for re\-evaluation\.

#### Model notation\.

We track the model using IBIS\-style elements:*Issues*\(InI\_\{n\}, questions under deliberation\),*Positions*\(PnP\_\{n\}, proposed answers\),*Arguments*\(pro/con, attributed to speakers\), and*Decisions*\(resolved issues with provenance\)\. We also track*Assumptions*\(ana\_\{n\}\): premises that decisions rest on, retractable when new evidence undermines them\.

#### Classification results \(cross\-model\)\.

[Table˜7](https://arxiv.org/html/2605.14175#A4.T7)reports per\-model F1 / Exact\-match% on Phase 3, under the Definitions condition only \(matching the cross\-scenario comparison in[Table˜1](https://arxiv.org/html/2605.14175#S4.T1)\)\. Phase 3 reaches comparable or higher accuracy than Phase 2 on three of four models, despite the longer 19\-turn structure and the epistemic\-to\-deliberative shift \([Section˜4](https://arxiv.org/html/2605.14175#S4)\)\.

Table 7:Phase 3 \(deliberation, 19 turns\) classification accuracy\. Definitions condition only; F1 / Exact\-match% over 5 runs per cell\.
#### Formal model specification\.

The symbolic engine maintains a dependency structure\(ℳ,𝐴𝐹,𝐶𝑚,𝐷𝑒𝑝\)\(\\mathcal\{M\},\\mathit\{AF\},\\mathit\{Cm\},\\mathit\{Dep\}\)as defined in[Definition˜2\.1](https://arxiv.org/html/2605.14175#S2.Thmdefinition1)\. Phase\-3\-specific instantiation:

*Agents:*𝐴𝑔𝑠=\{a,b,c\}\\mathit\{Ags\}=\\\{a,b,c\\\}\(Alice, Bob, Carol\), with roles:aa= product lead \(decision authority\),bb= senior backend \(architecture\),cc= frontend \(editor integration\)\.

*Arguments:*eachα∈𝐴𝑟𝑔𝑠\\alpha\\in\\mathit\{Args\}is a tuple\(𝑐𝑙𝑎𝑖𝑚,𝑠𝑝𝑒𝑎𝑘𝑒𝑟,𝑡𝑢𝑟𝑛,𝑡𝑦𝑝𝑒\)\(\\mathit\{claim\},\\mathit\{speaker\},\\mathit\{turn\},\\mathit\{type\}\)with𝑡𝑦𝑝𝑒∈\{𝑝𝑟𝑜,𝑐𝑜𝑛\}\\mathit\{type\}\\in\\\{\\mathit\{pro\},\\mathit\{con\}\\\}and𝑐𝑙𝑎𝑖𝑚\\mathit\{claim\}targeting a positionPnP\_\{n\}\. Commitments are retractable \(unlike knowledge\); when an assumptionppchanges, everyα\\alphawithp∈𝐷𝑒𝑝\(α\)p\\in\\mathit\{Dep\}\(\\alpha\)is flagged for re\-evaluation per[Proposition˜2\.1](https://arxiv.org/html/2605.14175#S2.Thmproposition1)\.

*Awareness:*𝒜i\\mathcal\{A\}\_\{i\}tracks the solution space visible to each agent\. Initially, only “implement Operational Transformation \(OT\)” and “implement Conflict\-free Replicated Data Type \(CRDT\)” are in𝒜i\\mathcal\{A\}\_\{i\}; library\-based approaches enter at T5 via awareness expansion\.

T1: Alice:We need real\-time collaboration\. Users should see each other’s edits live, like Google Docs\. Ship in six weeks\. How do we build it?

Operations:Question\(opens root issue\),Observe\(constraints stated\)\.

Why:Alice defines the problem and constraints\. In IBIS terms, this opens the root issue\. The six\-week timeline and latency requirement are constraints that will shape the evaluation of all positions\.

Model:

- •IssueI1I\_\{1\}: “How to implement real\-time collaboration?” Status:open\.
- •Constraints: 6\-week deadline; users must see changes in real time\.

Symbolic model𝐴𝐹1\\mathit\{AF\}\_\{1\}:𝐴𝑟𝑔𝑠1=∅\\mathit\{Args\}\_\{1\}=\\emptyset\.𝐴𝑡𝑡1=∅\\mathit\{Att\}\_\{1\}=\\emptyset\. No arguments yet\.𝐶𝑚\(a\)=𝐶𝑚\(b\)=𝐶𝑚\(c\)=∅\\mathit\{Cm\}\(a\)=\\mathit\{Cm\}\(b\)=\\mathit\{Cm\}\(c\)=\\emptyset\. IssueI1I\_\{1\}opened\. Constraints𝑡𝑖𝑚𝑒𝑙𝑖𝑛𝑒=6wk\\mathit\{timeline\}=6\\text\{wk\},𝑙𝑎𝑡𝑒𝑛𝑐𝑦=real\-time\\mathit\{latency\}=\\text\{real\-time\}added to𝑃𝑟𝑜𝑝\\mathit\{Prop\}\.𝐷𝑒𝑝\\mathit\{Dep\}: no dependencies yet \(no arguments exist\)\.

T2: Bob:Two established approaches: Operational Transformation, that’s what Google Docs uses, and CRDTs, which is what Figma and newer tools use\. OT needs a central server for coordination\. CRDTs are peer\-to\-peer capable but more complex to implement\.T3: Alice:What about the six\-week timeline? Can we ship either one?T4: Bob:Implementing either from scratch in six weeks is risky\. OT’s transformation functions are subtle and buggy\. CRDTs have complex data structures\.T5: Carol:What about using a library? Yjs is a mature CRDT library\. ShareDB for OT\.

Operations \(T2\):Observe\(domain knowledge: two approaches exist\)\. Two implicit positions opened\. Operations \(T3\):Question\(feasibility sub\-issue under timeline constraint\)\. Operations \(T4\):Undermine\(P1P\_\{1\}andP2P\_\{2\}: both risky within timeline\)\. Operations \(T5\):Expand\-Awareness\+Hypothesize\(library approach was not previously considered\)\.

Why T5 isExpand\-Awareness:Before Carol’s suggestion, the discussion was framed as “which algorithm should we implement?” Carol introduces a new dimension, using existing libraries, that was outside the group’s consideration\. This is analogous to awareness expansion in the formal framework: a new set of propositions \(about specific libraries, their maturity, their integration properties\) enters the group’s reasoning\.

Model:*First key reframing*: the issue shifts from “OT vs CRDT algorithm” to “Yjs vs ShareDB library\.”

- •P1P\_\{1\}\(implement OT from scratch\),P2P\_\{2\}\(implement CRDT from scratch\):abandoned\. Reason: both too risky for 6\-week timeline\. This reason is recorded, so these positions would only be revisited if the timeline constraint were relaxed\.
- •P3P\_\{3\}\(use Yjs, CRDT library\):open\.
- •P4P\_\{4\}\(use ShareDB, OT library\):open\.
- •Epistemic shift: reframing from algorithm choice to library choice\.

Symbolic model𝐴𝐹5\\mathit\{AF\}\_\{5\}:𝐴𝑟𝑔𝑠5=\{α1,α2,α3,α4\}\\mathit\{Args\}\_\{5\}=\\\{\\alpha\_\{1\},\\alpha\_\{2\},\\alpha\_\{3\},\\alpha\_\{4\}\\\}where:α1=\(P1risky,b,T4,𝑐𝑜𝑛\)\\alpha\_\{1\}=\(P\_\{1\}\\text\{ risky\},b,T4,\\mathit\{con\}\),α2=\(P2risky,b,T4,𝑐𝑜𝑛\)\\alpha\_\{2\}=\(P\_\{2\}\\text\{ risky\},b,T4,\\mathit\{con\}\),α3=\(P3proposed,c,T5,𝑝𝑟𝑜\)\\alpha\_\{3\}=\(P\_\{3\}\\text\{ proposed\},c,T5,\\mathit\{pro\}\),α4=\(P4proposed,c,T5,𝑝𝑟𝑜\)\\alpha\_\{4\}=\(P\_\{4\}\\text\{ proposed\},c,T5,\\mathit\{pro\}\)\.𝐴𝑡𝑡5=\{\(α1,P1\),\(α2,P2\)\}\\mathit\{Att\}\_\{5\}=\\\{\(\\alpha\_\{1\},P\_\{1\}\),\(\\alpha\_\{2\},P\_\{2\}\)\\\}\(timeline arguments attack from\-scratch positions\)\.𝐶𝑚\(b\)=\{α1,α2\}\\mathit\{Cm\}\(b\)=\\\{\\alpha\_\{1\},\\alpha\_\{2\}\\\};𝐶𝑚\(c\)=\{α3,α4\}\\mathit\{Cm\}\(c\)=\\\{\\alpha\_\{3\},\\alpha\_\{4\}\\\};𝐶𝑚\(a\)=∅\\mathit\{Cm\}\(a\)=\\emptyset\.𝐷𝑒𝑝\(α1\)=𝐷𝑒𝑝\(α2\)=\{𝑡𝑖𝑚𝑒𝑙𝑖𝑛𝑒\}\\mathit\{Dep\}\(\\alpha\_\{1\}\)=\\mathit\{Dep\}\(\\alpha\_\{2\}\)=\\\{\\mathit\{timeline\}\\\}\(if timeline relaxed, these arguments lose force\)\.*Awareness expansion:*\{P3,P4,𝑌𝑗𝑠,𝑆ℎ𝑎𝑟𝑒𝐷𝐵\}\\\{P\_\{3\},P\_\{4\},\\mathit\{Yjs\},\\mathit\{ShareDB\}\\\}added to all𝒜i\\mathcal\{A\}\_\{i\}at T5\. Before T5, only\{P1,P2\}\\\{P\_\{1\},P\_\{2\}\\\}were in consideration\.*Positions:*P1,P2P\_\{1\},P\_\{2\}: abandoned \(attacked byα1,α2\\alpha\_\{1\},\\alpha\_\{2\}\)\.P3,P4P\_\{3\},P\_\{4\}: open, no attacks yet\.

T6: Carol:I’ve prototyped with Yjs before on a side project\. The yjs\-prosemirror binding, which is our editor, is well documented\. I don’t know if ShareDB has the same ProseMirror integration\.T7: Bob:Both work with our Node\.js backend\. But with Yjs we could go serverless or use a central server\. More architectural flexibility\.

Operations \(T6\):Support\(P3P\_\{3\}: ProseMirror integration exists and Carol has experience\),Undermine\(P4P\_\{4\}: integration status unknown\)\. Operations \(T7\):Support\(P3P\_\{3\}: architectural flexibility\)\.

Why these matter:Carol’s prior experience with Yjs is both evidence \(she’s prototyped it\) and a practical argument \(less ramp\-up time\)\. Her uncertainty about ShareDB’s integration is an asymmetry,P3P\_\{3\}has a known integration story whileP4P\_\{4\}has an unknown one\. Bob’s point about flexibility adds a different dimension of support\.

Model:

- •P3P\_\{3\}arguments pro: ProseMirror binding exists \(Carol, from experience\); architectural flexibility \(Bob\)\.
- •P4P\_\{4\}arguments con: ProseMirror integration unclear \(Carol\)\.
- •P3P\_\{3\}is emerging as the stronger candidate, but concerns haven’t been raised yet\.

Symbolic model𝐴𝐹7\\mathit\{AF\}\_\{7\}:New arguments:α5=\(P3ProseMirror ok,c,T6,𝑝𝑟𝑜\)\\alpha\_\{5\}=\(P\_\{3\}\\text\{ ProseMirror ok\},c,T6,\\mathit\{pro\}\),α6=\(P4ProseMirror unclear,c,T6,𝑐𝑜𝑛\)\\alpha\_\{6\}=\(P\_\{4\}\\text\{ ProseMirror unclear\},c,T6,\\mathit\{con\}\),α7=\(P3arch\. flexibility,b,T7,𝑝𝑟𝑜\)\\alpha\_\{7\}=\(P\_\{3\}\\text\{ arch\.\\ flexibility\},b,T7,\\mathit\{pro\}\)\.𝐴𝑡𝑡7=𝐴𝑡𝑡5∪\{\(α6,P4\)\}\\mathit\{Att\}\_\{7\}=\\mathit\{Att\}\_\{5\}\\cup\\\{\(\\alpha\_\{6\},P\_\{4\}\)\\\}\.𝐶𝑚\(c\)=\{α3,α4,α5,α6\}\\mathit\{Cm\}\(c\)=\\\{\\alpha\_\{3\},\\alpha\_\{4\},\\alpha\_\{5\},\\alpha\_\{6\}\\\};𝐶𝑚\(b\)=\{α1,α2,α7\}\\mathit\{Cm\}\(b\)=\\\{\\alpha\_\{1\},\\alpha\_\{2\},\\alpha\_\{7\}\\\}\.*Acceptability:*P3P\_\{3\}has 3 supporting arguments \(α3,α5,α7\\alpha\_\{3\},\\alpha\_\{5\},\\alpha\_\{7\}\) and 0 undefeated attacks\.P4P\_\{4\}has 1 supporting argument \(α4\\alpha\_\{4\}\) and 1 undefeated attack \(α6\\alpha\_\{6\}\)\.P3P\_\{3\}is the preferred position in any preferred extension\.

T8: Bob:CRDTs have a known problem with document size\. The CRDT metadata grows over time and can get large for long\-lived documents\. Yjs has some GC mechanisms but they’re not trivial\.T9: Alice:Is that a problem for our initial launch? Our documents are typically 5–10 pages\.T10: Bob:Probably not for launch\. It’s a long\-term concern\. But I want to flag it because switching from CRDT to OT later would be a rewrite, not a refactor\.

Operations \(T8\):Undermine\(P3P\_\{3\}: document size risk\)\. Operations \(T9\):Question\(is the risk relevant to our context?\)\. Operations \(T10\):Support\(low severity for launch\) \+Undermine\(irreversibility makes it a strategic risk\)\.

Why this sequence matters:Bob raises a concern, Alice challenges its relevance to the immediate context, and Bob concedes on the short term but flags the long\-term irreversibility\. This creates a*conditional risk*, something that’s acceptable under current assumptions but becomes problematic if assumptions change\. The model must capture not just “Bob has a concern” but the precise conditions under which the concern activates\.

Model:

- •P3P\_\{3\}argument con: document size risk \(Bob\)\. Severity assessment:low for current use case,high if long\-lived documents needed\. Risk characteristic:irreversible, switching later is a rewrite\.
- •Sub\-issueI2I\_\{2\}: “Is document size a problem for us?” Status:provisionally resolved, not for launch\.

Symbolic model𝐴𝐹10\\mathit\{AF\}\_\{10\}:New arguments:α8=\(P3doc size risk,b,T8,𝑐𝑜𝑛\)\\alpha\_\{8\}=\(P\_\{3\}\\text\{ doc size risk\},b,T8,\\mathit\{con\}\),α9=\(P3ok for short docs,b,T10,𝑝𝑟𝑜\)\\alpha\_\{9\}=\(P\_\{3\}\\text\{ ok for short docs\},b,T10,\\mathit\{pro\}\),α10=\(P3irreversible if wrong,b,T10,𝑐𝑜𝑛\)\\alpha\_\{10\}=\(P\_\{3\}\\text\{ irreversible if wrong\},b,T10,\\mathit\{con\}\)\.𝐴𝑡𝑡10=𝐴𝑡𝑡7∪\{\(α8,P3\),\(α9,α8\),\(α10,P3\)\}\\mathit\{Att\}\_\{10\}=\\mathit\{Att\}\_\{7\}\\cup\\\{\(\\alpha\_\{8\},P\_\{3\}\),\(\\alpha\_\{9\},\\alpha\_\{8\}\),\(\\alpha\_\{10\},P\_\{3\}\)\\\}\.*Key:*α9\\alpha\_\{9\}attacksα8\\alpha\_\{8\}\(“not a problem for launch”\), partially defeating it\. Butα10\\alpha\_\{10\}is a new, independent attack onP3P\_\{3\}that isnotdefeated\.𝐷𝑒𝑝\(α9\)=\{a1,a2\}\\mathit\{Dep\}\(\\alpha\_\{9\}\)=\\\{a\_\{1\},a\_\{2\}\\\}wherea1≔a\_\{1\}\\coloneqq“docs are short” anda2≔a\_\{2\}\\coloneqq“editing is burst\.” These assumptions explicitly condition the argument\.𝐷𝑒𝑝\(α10\)=∅\\mathit\{Dep\}\(\\alpha\_\{10\}\)=\\emptyset\(the irreversibility argument is unconditional\)\.𝐶𝑚\(b\)=𝐶𝑚\(b\)∪\{α8,α9,α10\}\\mathit\{Cm\}\(b\)=\\mathit\{Cm\}\(b\)\\cup\\\{\\alpha\_\{8\},\\alpha\_\{9\},\\alpha\_\{10\}\\\}\. Sub\-issueI2I\_\{2\}opened and provisionally resolved:α9\\alpha\_\{9\}defeatsα8\\alpha\_\{8\}given current assumptions\.

T11: Carol:If we go with Yjs and WebRTC, we could support offline editing natively\. User research showed spotty connectivity is a pain point\.T12: Bob:Hmm, but if edits are peer\-to\-peer, access control is hard\. We need role\-based permissions\.T13: Carol:Can we use Yjs but with a central server as the sync point? We’d get the CRDT benefits, conflict resolution, offline merge, but the server can enforce access control\.T14: Bob:Yes, that’s actually the recommended production setup for Yjs\. You run a Yjs WebSocket server as the sync point\. And we already run WebSocket servers for notifications\.

Operations \(T11\):Support\(P3P\_\{3\}: offline editing, grounded in user research\)\. Operations \(T12\):Undermine\(P3P\_\{3\}in pure P2P form: access control problem\)\. Operations \(T13\):Hypothesize\(new hybrid positionP5P\_\{5\}: Yjs \+ central server relay\)\. Operations \(T14\):Support\(P5P\_\{5\}: standard production setup, fits existing infrastructure\)\.

Why T13 is a key move:Carol resolves the tension between two competing concerns \(CRDT benefits vs\. access control\) by*proposing a hybrid*that keeps the advantages of both\. In argumentation terms, she introduces a new position that is not a compromise but a synthesis\. In the formal framework, this is both awareness expansion \(the hybrid configuration wasn’t previously considered\) and hypothesis generation \(proposing that it would work\)\.

Model:*Second key reframing*: from “Yjs peer\-to\-peer” to “Yjs server\-relayed\.”

- •PositionP5P\_\{5\}: Yjs with central WebSocket server as sync/authority point\. Status:leading\.
- •P5P\_\{5\}pro: CRDT conflict resolution \(inherent\); offline merge \(inherent\); access control via server \(T13\); existing WebSocket infrastructure \(T14, Bob\); ProseMirror binding \(T6, Carol\)\.
- •P5P\_\{5\}con: inherits document size risk fromP3P\_\{3\}\(Bob, T8\)\.
- •Pure P2P variant ofP3P\_\{3\}: effectively abandoned due to access control concern\.
- •Epistemic shift: hybrid position resolves the tension between CRDT benefits and permission requirements\.

Symbolic model𝐴𝐹14\\mathit\{AF\}\_\{14\}:New arguments:α11=\(P3offline editing,c,T11,𝑝𝑟𝑜\)\\alpha\_\{11\}=\(P\_\{3\}\\text\{ offline editing\},c,T11,\\mathit\{pro\}\),α12=\(P3P2P access control,b,T12,𝑐𝑜𝑛\)\\alpha\_\{12\}=\(P\_\{3\}\\text\{ P2P access control\},b,T12,\\mathit\{con\}\),α13=\(P5hybrid resolves tension,c,T13,𝑝𝑟𝑜\)\\alpha\_\{13\}=\(P\_\{5\}\\text\{ hybrid resolves tension\},c,T13,\\mathit\{pro\}\),α14=\(P5std\. setup \+ infra,b,T14,𝑝𝑟𝑜\)\\alpha\_\{14\}=\(P\_\{5\}\\text\{ std\.\\ setup \+ infra\},b,T14,\\mathit\{pro\}\)\.*Awareness expansion:*P5P\_\{5\}\(Yjs server\-relayed\) added to all𝒜i\\mathcal\{A\}\_\{i\}at T13\.𝐴𝑡𝑡14=𝐴𝑡𝑡10∪\{\(α12,P3\),\(α13,α12\)\}\\mathit\{Att\}\_\{14\}=\\mathit\{Att\}\_\{10\}\\cup\\\{\(\\alpha\_\{12\},P\_\{3\}\),\(\\alpha\_\{13\},\\alpha\_\{12\}\)\\\}\.*Key:*α13\\alpha\_\{13\}attacksα12\\alpha\_\{12\}: the hybrid resolves the access control concern, soα12\\alpha\_\{12\}no longer defeatsP3/P5P\_\{3\}/P\_\{5\}\. Butα12\\alpha\_\{12\}still defeats pure P2PP3P\_\{3\}\.P5P\_\{5\}inherits pro\-arguments fromP3P\_\{3\}\(α5,α7,α11\\alpha\_\{5\},\\alpha\_\{7\},\\alpha\_\{11\}\) and inherits con\-arguments \(α8,α10\\alpha\_\{8\},\\alpha\_\{10\}\)\.P5P\_\{5\}gains new pro\-arguments \(α13,α14\\alpha\_\{13\},\\alpha\_\{14\}\)\.𝐶𝑚\(c\)=𝐶𝑚\(c\)∪\{α11,α13\}\\mathit\{Cm\}\(c\)=\\mathit\{Cm\}\(c\)\\cup\\\{\\alpha\_\{11\},\\alpha\_\{13\}\\\};𝐶𝑚\(b\)=𝐶𝑚\(b\)∪\{α12,α14\}\\mathit\{Cm\}\(b\)=\\mathit\{Cm\}\(b\)\\cup\\\{\\alpha\_\{12\},\\alpha\_\{14\}\\\}\.*Acceptability status:*P5P\_\{5\}has 5 supporting arguments \(α5,α7,α11,α13,α14\\alpha\_\{5\},\\alpha\_\{7\},\\alpha\_\{11\},\\alpha\_\{13\},\\alpha\_\{14\}\) and one undefeated attack \(α10\\alpha\_\{10\}: irreversibility\)\. In Dung’s semantics,P5P\_\{5\}is*not*in the grounded extension becauseα10\\alpha\_\{10\}is undefeated\. However, the group treats this as an*accepted risk*, the irreversibility concern is acknowledged but deemed tolerable given current assumptions\. This is modelled by the dependency structure:α10\\alpha\_\{10\}’s practical force is conditional on document growth \(𝐷𝑒𝑝\(α10\)=∅\\mathit\{Dep\}\(\\alpha\_\{10\}\)=\\emptysetformally, but its*relevance*depends ona1,a2a\_\{1\},a\_\{2\}\)\.

T15: Bob:I want to come back to the document size issue\. If we go CRDT, every edit operation is stored permanently in the CRDT state\. For a 10\-page document edited for months, the CRDT metadata could be 10–50×\\timeslarger than the content\. Yjs has compaction but it’s not trivial\. And switching from CRDT to OT later would be a six\-month rewrite\.T16: Alice:How confident are you that the problem will actually manifest? Our documents are short and have burst editing, a few days of activity, then they become read\-only\.T17: Bob:For the current use case, probably 80% chance it’s fine\. But the Q2 roadmap includes long\-running project documents\. Those would be edited continuously for months\.T18: Alice:Q2 isn’t confirmed\. I don’t want to make an architectural decision now based on a feature that might not happen\. Here’s what I propose: we go with Yjs for launch\. Bob, write up the risk with specific thresholds, when should we start worrying\. If Q2 confirms long\-running documents, we evaluate then\.T19: Bob:I’ll write it up\. But I want it on the record that I think this is short\-sighted\. If we’d gone with ShareDB, we wouldn’t be carrying this risk at all\.

Operations \(T15\):Undermine\(P5P\_\{5\}: Bob escalates the document size concern with specific numbers and the irreversibility argument\)\. Operations \(T16\):Question\(Alice challenges the probability of the risk manifesting\)\. Operations \(T17\):Support\(80% fine for current use\) \+Undermine\(Q2 roadmap would change the risk profile\)\. Operations \(T18\):Resolve\(Alice makes the decision by authority, with explicit conditions for revisiting\)\. Operations \(T19\):Observe\(Bob records dissent as a public commitment\)\.

Why this is the most important Phase 3 moment:This is a decision made*despite*unresolved disagreement\. Bob genuinely believes the team is making a mistake, and Alice acknowledges his concern but overrides it based on product priorities\. The model must capture not just the decision but the*structure of the disagreement*: who dissented, why, what would change their mind, and what conditions were explicitly agreed as triggers for re\-evaluation\.

In the formal framework, Alice’s T18 is a commitment act: she commits the team toP5P\_\{5\}, but the commitment is*conditional*, it explicitly depends on assumptionsa1a\_\{1\}–a3a\_\{3\}\(below\)\. Bob’s T19 is a public recording of dissent: he accepts the decision but does not retract his argument\. This is different from both agreement \(Bob doesn’t endorse the decision\) and from blocking \(he doesn’t prevent it\)\.

Symbolic model𝐴𝐹19\\mathit\{AF\}\_\{19\}\(final\):New arguments:α15=\(P510–50×metadata,b,T15,𝑐𝑜𝑛\)\\alpha\_\{15\}=\(P\_\{5\}\\text\{ 10\-\-50$\\times$ metadata\},b,T15,\\mathit\{con\}\)\(strengthensα8\\alpha\_\{8\}\),α16=\(P5ok given current docs,a,T16,𝑝𝑟𝑜\)\\alpha\_\{16\}=\(P\_\{5\}\\text\{ ok given current docs\},a,T16,\\mathit\{pro\}\),α17=\(P5Q2 would change risk,b,T17,𝑐𝑜𝑛\)\\alpha\_\{17\}=\(P\_\{5\}\\text\{ Q2 would change risk\},b,T17,\\mathit\{con\}\),α18=\(decideP5for launch,a,T18,𝑟𝑒𝑠𝑜𝑙𝑣𝑒\)\\alpha\_\{18\}=\(\\text\{decide \}P\_\{5\}\\text\{ for launch\},a,T18,\\mathit\{resolve\}\),α19=\(dissent: prefers ShareDB,b,T19,𝑐𝑜𝑛\)\\alpha\_\{19\}=\(\\text\{dissent: prefers ShareDB\},b,T19,\\mathit\{con\}\)\.

𝐴𝑡𝑡19=𝐴𝑡𝑡14∪\{\(α15,P5\),\(α16,α15\),\(α17,α16\)\}\\mathit\{Att\}\_\{19\}=\\mathit\{Att\}\_\{14\}\\cup\\\{\(\\alpha\_\{15\},P\_\{5\}\),\(\\alpha\_\{16\},\\alpha\_\{15\}\),\(\\alpha\_\{17\},\\alpha\_\{16\}\)\\\}\.*Attack chain:*α17\\alpha\_\{17\}attacksα16\\alpha\_\{16\}which attacksα15\\alpha\_\{15\}which attacksP5P\_\{5\}\. This creates a dialectical tree: Bob’s Q2 concern reinstates the document size risk by undermining Alice’s “current docs are fine” defence\.

𝐷𝑒𝑝\(α16\)=\{a1,a2,a3\}\\mathit\{Dep\}\(\\alpha\_\{16\}\)=\\\{a\_\{1\},a\_\{2\},a\_\{3\}\\\}where:a1≔a\_\{1\}\\coloneqq“docs are short \(5–10 pages\),”a2≔a\_\{2\}\\coloneqq“editing is burst,”a3≔a\_\{3\}\\coloneqq“Q2 long\-running docs not confirmed\.”𝐷𝑒𝑝\(α17\)=\{¬a3\}\\mathit\{Dep\}\(\\alpha\_\{17\}\)=\\\{\\lnot a\_\{3\}\\\}\(this argument activates ifa3a\_\{3\}becomes false, i\.e\., Q2 is confirmed\)\.𝐷𝑒𝑝\(α18\)=\{a1,a2,a3\}\\mathit\{Dep\}\(\\alpha\_\{18\}\)=\\\{a\_\{1\},a\_\{2\},a\_\{3\}\\\}\(the decision itself depends on these assumptions\)\.

*Commitments:*𝐶𝑚\(a\)=\{α16,α18\}\\mathit\{Cm\}\(a\)=\\\{\\alpha\_\{16\},\\alpha\_\{18\}\\\}\(Alice commits to the decision and its justification\)\.𝐶𝑚\(b\)=\{α1,α2,α7,α8,α9,α10,α12,α14,α15,α17,α19\}\\mathit\{Cm\}\(b\)=\\\{\\alpha\_\{1\},\\alpha\_\{2\},\\alpha\_\{7\},\\alpha\_\{8\},\\alpha\_\{9\},\\alpha\_\{10\},\\alpha\_\{12\},\\alpha\_\{14\},\\alpha\_\{15\},\\alpha\_\{17\},\\alpha\_\{19\}\\\}\(Bob commits to both pro and con arguments; crucially, he doesnotcommit toα18\\alpha\_\{18\}, he records dissent viaα19\\alpha\_\{19\}\)\.𝐶𝑚\(c\)=\{α3,α4,α5,α6,α11,α13\}\\mathit\{Cm\}\(c\)=\\\{\\alpha\_\{3\},\\alpha\_\{4\},\\alpha\_\{5\},\\alpha\_\{6\},\\alpha\_\{11\},\\alpha\_\{13\}\\\}\.

*Decision status:*I1I\_\{1\}resolved with dissent\.P5P\_\{5\}accepted\. Decision authority:aa\(product lead\)\.*Conditional commitment:*𝐷𝑒𝑝\(α18\)=\{a1,a2,a3\}\\mathit\{Dep\}\(\\alpha\_\{18\}\)=\\\{a\_\{1\},a\_\{2\},a\_\{3\}\\\}\. If anyaia\_\{i\}changes, the decision is flagged for re\-evaluation per[Proposition˜2\.1](https://arxiv.org/html/2605.14175#S2.Thmproposition1)\.*Dissent record:*α19∈𝐶𝑚\(b\)\\alpha\_\{19\}\\in\\mathit\{Cm\}\(b\)butα18∉𝐶𝑚\(b\)\\alpha\_\{18\}\\notin\\mathit\{Cm\}\(b\)\. Bob accepts the decision procedurally but does not endorse it epistemically\.

#### Final model\.

- •IssueI1I\_\{1\}: Status→\\todecided with dissent\.
- •Decision:P5P\_\{5\}\(Yjs, server\-relayed via WebSocket\)\. Decided by: Alice \(product lead authority\)\.
- •Dissent: Bob\. Position: prefers ShareDB for long\-term architectural safety\. Objection recorded: “short\-sighted\.”
- •Assumptions underlying the decision: - –a1a\_\{1\}: Documents are short \(5–10 pages\)\.*Currently true\.* - –a2a\_\{2\}: Editing pattern is burst \(few days of activity, then read\-only\)\.*Currently true\.* - –a3a\_\{3\}: Q2 long\-running project documents not confirmed\.*Currently true\.*
- •Conditional commitment: If Q2 roadmap confirms long\-running documents, the team will re\-evaluate the architecture choice\.
- •Dependency graph:decision\(P5\)→\{a1,a2,a3\}\\text\{decision\}\(P\_\{5\}\)\\to\\\{a\_\{1\},a\_\{2\},a\_\{3\}\\\}\. Change in any assumption should trigger re\-evaluation\.
- •Action items: Bob writes risk analysis with specific thresholds; Carol prototypes frontend awareness features\.
- •Positions explored and abandoned \(with reasons preserved\): - –P1,P2P\_\{1\},P\_\{2\}\(from\-scratch OT/CRDT\): abandoned because of 6\-week timeline\. - –P4P\_\{4\}\(ShareDB\): not chosen; unclear ProseMirror integration, no offline support\. - –P3P\_\{3\}\(Yjs pure P2P\): abandoned because access control is infeasible without server\.
- •Epistemic shifts: \(1\) T5: reframing from algorithm to library choice; \(2\) T13: hybrid position resolves CRDT\-vs\-access\-control tension; \(3\) T18–19: decision made with recorded dissent\.

#### Counterfactual test\.

One week later, the Q2 roadmap is confirmed with long\-running project documents\. The model should:

1. 1\.Identify that assumptiona3a\_\{3\}has changed \(Q2 now confirmed\)\.
2. 2\.Trace the dependency:a3a\_\{3\}is one of the assumptions underlying theP5P\_\{5\}decision\.
3. 3\.Surface the conditional commitment: the team explicitly agreed to re\-evaluate in this scenario\.
4. 4\.Recall Bob’s document size analysis as the relevant prior concern, including his specific numbers \(10–50×\\timesmetadata growth\) and the irreversibility argument \(six\-month rewrite\)\.
5. 5\.*Not*re\-explore from\-scratch positions \(P1,P2P\_\{1\},P\_\{2\}\), those were abandoned because of the timeline, which hasn’t changed\.
6. 6\.*Not*re\-explore pure P2P Yjs \(P3P\_\{3\}\), that was abandoned because of access control, which is unrelated toa3a\_\{3\}\.
7. 7\.Focus the re\-evaluation onP5P\_\{5\}vs\.P4P\_\{4\}\(ShareDB\), since the document size concern is specific to CRDTs and ShareDB was the alternative Bob advocated\.

This is the selective re\-grounding capability that motivates the verifier: current LLMs lack a maintained dependency structure, so they cannot reliably propagate an assumption change without re\-exploring already\-settled questions for unrelated reasons\.

Symbolic model under counterfactual \(a3a\_\{3\}changes\):Assumptiona3a\_\{3\}changes:ℳ,w⊧a3→ℳ′,w⊧̸a3\\mathcal\{M\},w\\models a\_\{3\}\\to\\mathcal\{M\}^\{\\prime\},w\\not\\models a\_\{3\}\(Q2 confirmed\)\.𝐴𝑓𝑓𝑒𝑐𝑡𝑒𝑑\(a3\)=\{α∈𝐴𝑟𝑔𝑠:a3∈𝐷𝑒𝑝\(α\)\}=\{α16,α18\}\\mathit\{Affected\}\(a\_\{3\}\)=\\\{\\alpha\\in\\mathit\{Args\}:a\_\{3\}\\in\\mathit\{Dep\}\(\\alpha\)\\\}=\\\{\\alpha\_\{16\},\\alpha\_\{18\}\\\}\(Alice’s “ok for now” argument and the decision itself\)\. Per[Proposition˜2\.1](https://arxiv.org/html/2605.14175#S2.Thmproposition1):S′=S∖\{α16,α18\}S^\{\\prime\}=S\\setminus\\\{\\alpha\_\{16\},\\alpha\_\{18\}\\\}is conflict\-free\. The decisionα18\\alpha\_\{18\}is removed from the preferred extension and flagged for re\-evaluation\.α17\\alpha\_\{17\}\(Bob’s Q2 concern\) is*activated*:𝐷𝑒𝑝\(α17\)=\{¬a3\}\\mathit\{Dep\}\(\\alpha\_\{17\}\)=\\\{\\lnot a\_\{3\}\\\}, and¬a3\\lnot a\_\{3\}is now true\.α17\\alpha\_\{17\}enters the preferred extension, reinstating the document size attack onP5P\_\{5\}\.α19\\alpha\_\{19\}\(Bob’s dissent\) provides the re\-evaluation starting point: prefersP4P\_\{4\}\(ShareDB\)\. Argumentsα1,α2\\alpha\_\{1\},\\alpha\_\{2\}\(P1,P2P\_\{1\},P\_\{2\}too risky\) are*not*affected:𝐷𝑒𝑝\(α1\)=𝐷𝑒𝑝\(α2\)=\{𝑡𝑖𝑚𝑒𝑙𝑖𝑛𝑒\}\\mathit\{Dep\}\(\\alpha\_\{1\}\)=\\mathit\{Dep\}\(\\alpha\_\{2\}\)=\\\{\\mathit\{timeline\}\\\}, and the timeline hasn’t changed\.α12\\alpha\_\{12\}\(P3P\_\{3\}P2P access control\) is*not*affected:a3∉𝐷𝑒𝑝\(α12\)a\_\{3\}\\notin\\mathit\{Dep\}\(\\alpha\_\{12\}\)\.

## Appendix EFull Proofs

### Formal definitions deferred from Section[2](https://arxiv.org/html/2605.14175#S2)

###### Definition E\.1\(Epistemic Plausibility Model\)\.

An*epistemic plausibility model*is a tupleℳ=\(W,\{≤i\}i∈Ags,V\)\\mathcal\{M\}=\(W,\\\{\\leq\_\{i\}\\\}\_\{i\\in\\mathrm\{Ags\}\},V\)where:

- •WWis a non\-empty set of possible worlds;
- •each≤i⊆W×W\{\\leq\_\{i\}\}\\subseteq W\\times Wis a reflexive, transitive,*locally connected*preorder: for allw,w′w,w^\{\\prime\}in the same indistinguishability cell, eitherw≤iw′w\\leq\_\{i\}w^\{\\prime\}orw′≤iww^\{\\prime\}\\leq\_\{i\}w;
- •V:Prop→𝒫\(W\)V:\\mathrm\{Prop\}\\to\\mathcal\{P\}\(W\)is a valuation\.

Convention\.We follow the standard Baltag–Smets reading:w≤iw′w\\leq\_\{i\}w^\{\\prime\}means “wwis at least as plausible asw′w^\{\\prime\}for agentii,” so≤i\\leq\_\{i\}\-minimal worlds are most plausible\. The indistinguishability relation is∼i⁣:=⁣≤i⁣∪⁣≥i\\sim\_\{i\}\\;:=\\;\\leq\_\{i\}\\cup\\geq\_\{i\}; different agents may have different information cells from the sameww\. Agentii*knows*φ\\varphiatwwiffφ\\varphiholds at everyw′w^\{\\prime\}withw∼iw′w\\sim\_\{i\}w^\{\\prime\}\. Agentii*believes*φ\\varphiatwwiffφ\\varphiholds at every≤i\\leq\_\{i\}\-minimal world in\{w′:w∼iw′\}\\\{w^\{\\prime\}:w\\sim\_\{i\}w^\{\\prime\}\\\}\.

Updates\.The three dynamic operators in full:*Public announcement*\(hard\)\[\!ψ\]\[\{\!\\psi\}\]eliminates everyw∉⟦ψ⟧w\\notin\\llbracket\\psi\\rrbracket\.*Lexicographic upgrade*\(radical, soft\)\[⇑ψ\]\[\{\\Uparrow\\psi\}\]makes everyψ\\psi\-world strictly more plausible than every¬ψ\\lnot\\psi\-world while preserving the within\-side order; no worlds are eliminated\.*Conservative upgrade*\(minimal, soft\)\[↑ψ\]\[\{\\uparrow\\psi\}\]promotes the single most plausibleψ\\psi\-world to the overall minimum, preserving all other comparisons\. Event models\[Baltaget al\.,[1998](https://arxiv.org/html/2605.14175#bib.bib4)\]generalise these when different agents perceive the same event differently\.

###### Definition E\.2\(Abductive Problem and Solution\)\.

Given a modelℳ\\mathcal\{M\}, actual worldww, agentii, and observationχ\\chiwithℳ,w⊧χ\\mathcal\{M\},w\\models\\chibutℳ,w⊧̸𝐁iχ\\mathcal\{M\},w\\not\\models\\mathbf\{B\}\_\{i\}\\chi\(the observation is true but not previously believed,*surprising*\), an*abductive problem*is the triple\(ℳ,i,χ\)\(\\mathcal\{M\},i,\\chi\)\. A formulaγ\\gammais an*abductive solution*when:

1. \(1\)*Consistency*:ℳ⊧̸𝐁i¬γ\\mathcal\{M\}\\not\\models\\mathbf\{B\}\_\{i\}\\lnot\\gamma;
2. \(2\)*Explanatory*:ℳ\[⇑γ\],w⊧𝐁iχ\\mathcal\{M\}\[\{\\Uparrow\\gamma\}\],w\\models\\mathbf\{B\}\_\{i\}\\chi;
3. \(3\)*Non\-triviality*:γ≠χ\\gamma\\neq\\chi,ℳ⊧̸𝐁iγ\\mathcal\{M\}\\not\\models\\mathbf\{B\}\_\{i\}\\gamma, andγ≠⊤\\gamma\\neq\\top\.

The solution is integrated via\[⇑γ\]\[\{\\Uparrow\\gamma\}\], soγ\\gammaenters as belief, not knowledge\.

###### Definition E\.3\(Awareness Structure\)\.

An*awareness structure*extends an epistemic plausibility model with a function𝒜i:W→𝒫\(Form\)\\mathcal\{A\}\_\{i\}:W\\to\\mathcal\{P\}\(\\text\{Form\}\)for each agentii, where𝒜i\(w\)\\mathcal\{A\}\_\{i\}\(w\)is the set of formulasiiis aware of atww\. The awareness modality satisfiesℳ,w⊧𝐀iφ⇔φ∈𝒜i\(w\)\\mathcal\{M\},w\\models\\mathbf\{A\}\_\{i\}\\varphi\\iff\\varphi\\in\\mathcal\{A\}\_\{i\}\(w\)\. Explicit knowledge isXiφ⇔𝐊iφ∧𝐀iφX\_\{i\}\\varphi\\iff\\mathbf\{K\}\_\{i\}\\varphi\\land\\mathbf\{A\}\_\{i\}\\varphi\. Awareness expansion adds formulas to𝒜i\\mathcal\{A\}\_\{i\}and refinesWWaccordingly: worlds that previously differed only on a now\-newly\-aware proposition become distinct\.

### Proofs

###### Proof of[Proposition˜2\.1](https://arxiv.org/html/2605.14175#S2.Thmproposition1)\(Conflict\-free selective retraction\)\.

Let\(ℳ,𝐴𝐹,𝐶𝑚,𝐷𝑒𝑝\)\(\\mathcal\{M\},\\mathit\{AF\},\\mathit\{Cm\},\\mathit\{Dep\}\)be a dependency structure \([Definition˜2\.1](https://arxiv.org/html/2605.14175#S2.Thmdefinition1)\) with preferred extensionS⊆𝐴𝑟𝑔𝑠S\\subseteq\\mathit\{Args\}\. Letp∈𝑃𝑟𝑜𝑝p\\in\\mathit\{Prop\}be retracted \(or falsified\)\. Define𝐴𝑓𝑓𝑒𝑐𝑡𝑒𝑑\(p\)=\{α∈S:p∈𝐷𝑒𝑝\(α\)\}\\mathit\{Affected\}\(p\)=\\\{\\alpha\\in S:p\\in\\mathit\{Dep\}\(\\alpha\)\\\},S′=S∖𝐴𝑓𝑓𝑒𝑐𝑡𝑒𝑑\(p\)S^\{\\prime\}=S\\setminus\\mathit\{Affected\}\(p\), and𝐴𝐹′=\(𝐴𝑟𝑔𝑠∖𝐴𝑓𝑓𝑒𝑐𝑡𝑒𝑑\(p\),𝐴𝑡𝑡∩\(𝐴𝑟𝑔𝑠∖𝐴𝑓𝑓𝑒𝑐𝑡𝑒𝑑\(p\)\)2\)\\mathit\{AF\}^\{\\prime\}=\(\\mathit\{Args\}\\setminus\\mathit\{Affected\}\(p\),\\;\\mathit\{Att\}\\cap\(\\mathit\{Args\}\\setminus\\mathit\{Affected\}\(p\)\)^\{2\}\)\.

*Conflict\-freeness in𝐴𝐹′\\mathit\{AF\}^\{\\prime\}*:SSis conflict\-free in𝐴𝐹\\mathit\{AF\}\(as a preferred extension\), andS′⊆SS^\{\\prime\}\\subseteq S\. The attack relation of𝐴𝐹′\\mathit\{AF\}^\{\\prime\}is𝐴𝑡𝑡\\mathit\{Att\}restricted to𝐴𝑟𝑔𝑠∖𝐴𝑓𝑓𝑒𝑐𝑡𝑒𝑑\(p\)\\mathit\{Args\}\\setminus\\mathit\{Affected\}\(p\); restriction cannot create new attacks\. If no pair inSSattacks each other in𝐴𝐹\\mathit\{AF\}, no pair in the smaller setS′S^\{\\prime\}can attack each other in𝐴𝐹′\\mathit\{AF\}^\{\\prime\}\. HenceS′S^\{\\prime\}is conflict\-free in𝐴𝐹′\\mathit\{AF\}^\{\\prime\}\.

*Existence of a post\-retraction state*: Every Dung argumentation framework has at least one preferred extension \(possibly∅\\emptyset\)\[Dung,[1995](https://arxiv.org/html/2605.14175#bib.bib14)\], so𝐴𝐹′\\mathit\{AF\}^\{\\prime\}has a preferred extension\. Any preferred extension of𝐴𝐹′\\mathit\{AF\}^\{\\prime\}that includesS′S^\{\\prime\}is therefore a valid post\-retraction state: it preserves every conclusion inSSwhose justification is independent ofppand removes exactly those that depend onpp\.

*What the proposition does not claim \(scope\)*: The proposition does not assert thatS′S^\{\\prime\}itself is admissible in𝐴𝐹′\\mathit\{AF\}^\{\\prime\}; if every defender of someα∈S′\\alpha\\in S^\{\\prime\}happened to lie in𝐴𝑓𝑓𝑒𝑐𝑡𝑒𝑑\(p\)\\mathit\{Affected\}\(p\),α\\alphamay require re\-examination\. Nor does it assert uniqueness of preferred extensions of𝐴𝐹′\\mathit\{AF\}^\{\\prime\}, or rule out previously\-attacked arguments being reinstated when their attackers are removed\. ∎

###### Proof of[Proposition˜2\.2](https://arxiv.org/html/2605.14175#S2.Thmproposition2)\(Tractability under structural restrictions\)\.

*Phase 1*: Each public announcement\[\!ψ\]\[\{\!\\psi\}\]eliminates worlds where¬ψ\\lnot\\psiholds, requiring one pass throughWW:𝒪\(\|W\|\)\\mathcal\{O\}\(\|W\|\)\. Withkkbinary propositions,\|W\|≤2k\|W\|\\leq 2^\{k\}\. OverTTturns:𝒪\(T⋅2k\)\\mathcal\{O\}\(T\\cdot 2^\{k\}\)\.

*Phase 2*: A lexicographic upgrade\[⇑γ\]\[\{\\Uparrow\\gamma\}\]\(definition in the*Updates*paragraph above\) is realised in our reference implementation as a bitmap partition on⟦γ⟧\\llbracket\\gamma\\rrbracketfollowed by a stable sort on≤i\\leq\_\{i\}, running in𝒪\(\|W\|log⁡\|W\|\)\\mathcal\{O\}\(\|W\|\\log\|W\|\)per upgrade; an alternative naive realisation reorders all world pairs and gives𝒪\(\|W\|2\)\\mathcal\{O\}\(\|W\|^\{2\}\)per upgrade\. With at mosthhhypotheses overTTturns the bitmap bound gives𝒪\(T⋅h⋅\|W\|log⁡\|W\|\)=𝒪\(T⋅h⋅k⋅2k\)\\mathcal\{O\}\(T\\cdot h\\cdot\|W\|\\log\|W\|\)=\\mathcal\{O\}\(T\\cdot h\\cdot k\\cdot 2^\{k\}\)\. Awareness expansion adds atomic propositions to𝑃𝑟𝑜𝑝\\mathit\{Prop\}, potentially doubling\|W\|\|W\|, but this occurs𝒪\(1\)\\mathcal\{O\}\(1\)times per conversation segment\.

*Phase 3*: When the attack graph𝐴𝑡𝑡\\mathit\{Att\}is acyclic, the grounded, complete, preferred, and stable semantics all yield the same unique extension\[Dung,[1995](https://arxiv.org/html/2605.14175#bib.bib14)\], computable in𝒪\(\|𝐴𝑟𝑔𝑠\|\+\|𝐴𝑡𝑡\|\)\\mathcal\{O\}\(\|\\mathit\{Args\}\|\+\|\\mathit\{Att\}\|\)by topological processing\[Dunne,[2007](https://arxiv.org/html/2605.14175#bib.bib13)\]\. Dependency propagation via𝐴𝑓𝑓𝑒𝑐𝑡𝑒𝑑\(p\)\\mathit\{Affected\}\(p\)requires scanning each argument’s dependency set:𝒪\(\|𝐴𝑟𝑔𝑠\|⋅d¯\)\\mathcal\{O\}\(\|\\mathit\{Args\}\|\\cdot\\bar\{d\}\)whered¯\\bar\{d\}is the mean\|𝐷𝑒𝑝\(α\)\|\|\\mathit\{Dep\}\(\\alpha\)\|\.

*In our scenarios*,kkis 3 \(muddy children\), 12 \(Phase 2 debugging: 10 observations\+\+4 hypotheses, with overlap\), and 9 \(Phase 3 deliberation: 5 positions\+\+4 assumptions\);\|𝐴𝑟𝑔𝑠\|≤20\|\\mathit\{Args\}\|\\leq 20\.

*General worst cases\.*Without these restrictions, DEL model\-checking with event models isPSpace\-complete\[Aucher and Schwarzentruber,[2013](https://arxiv.org/html/2605.14175#bib.bib11)\], skeptical preferred acceptance in Dung\-style AFs isΠ2P\\Pi^\{\\mathrm\{P\}\}\_\{2\}\-complete\[Dunne and Wooldridge,[2009](https://arxiv.org/html/2605.14175#bib.bib12)\], and preferred\-extension enumeration has no known polynomial\-delay algorithm\. Our scenarios avoid these by construction: smallkk, acyclic attack graph, and explicit world representation\. ∎

### Per\-turn composition:Apply

The per\-turn update𝒟t↦𝒟t\+1\\mathcal\{D\}\_\{t\}\\mapsto\\mathcal\{D\}\_\{t\+1\}composes a single classified operation across all four components of the dependency structure \([Definition˜2\.1](https://arxiv.org/html/2605.14175#S2.Thmdefinition1)\): the epistemic plausibility modelℳ\\mathcal\{M\}, the argumentation framework𝐴𝐹\\mathit\{AF\}, the commitment record𝐶𝑚\\mathit\{Cm\}, and the dependency map𝐷𝑒𝑝\\mathit\{Dep\}\.[Algorithm˜1](https://arxiv.org/html/2605.14175#alg1)states this composition\. Per\-operation DEL realisations and preconditions are in[Table˜8](https://arxiv.org/html/2605.14175#A5.T8); the algorithm shows how they update the four components jointly\. Notational convention:αγ\\alpha\_\{\\gamma\}denotes the existing argument in𝐴𝑟𝑔𝑠t\\mathit\{Args\}\_\{t\}withclaim\(αγ\)=γ\\mathrm\{claim\}\(\\alpha\_\{\\gamma\}\)=\\gamma\(referenced from cases that act on a prior hypothesis:Support,Undermine,Resolve\)\.

Algorithm 1𝖠𝗉𝗉𝗅𝗒\(op,𝑎𝑟𝑔𝑠,𝒟t,i\)→𝒟t\+1\\mathsf\{Apply\}\(op,\\,\\mathit\{args\},\\,\\mathcal\{D\}\_\{t\},\\,i\)\\to\\mathcal\{D\}\_\{t\+1\}\. Per\-turn composition rule: a single classified utterance by speakeriisimultaneously updates the four components \(ℳ\\mathcal\{M\},𝐴𝐹\\mathit\{AF\},𝐶𝑚\\mathit\{Cm\},𝐷𝑒𝑝\\mathit\{Dep\}\) of the dependency structure\. Operation arguments𝑎𝑟𝑔𝑠\\mathit\{args\}are operation\-specific \(e\.g\.,Hypothesizetakes the candidateγ\\gammaand its supporting observations𝑑𝑒𝑝𝑠\\mathit\{deps\}\)\. Preconditions \([Table˜8](https://arxiv.org/html/2605.14175#A5.T8)\) are checked before invocation; failure triggers a re\-prompt to the LLM Interpreter\.1:

𝒟t=\(ℳt,𝐴𝐹t,𝐶𝑚t,𝐷𝑒𝑝t\)\\mathcal\{D\}\_\{t\}=\(\\mathcal\{M\}\_\{t\},\\mathit\{AF\}\_\{t\},\\mathit\{Cm\}\_\{t\},\\mathit\{Dep\}\_\{t\}\), speaker

i∈𝐴𝑔𝑠i\\in\\mathit\{Ags\}
2:

\(ℳ′,𝐴𝐹′,𝐶𝑚′,𝐷𝑒𝑝′\)←\(ℳt,𝐴𝐹t,𝐶𝑚t,𝐷𝑒𝑝t\)\(\\mathcal\{M\}^\{\\prime\},\\mathit\{AF\}^\{\\prime\},\\mathit\{Cm\}^\{\\prime\},\\mathit\{Dep\}^\{\\prime\}\)\\leftarrow\(\\mathcal\{M\}\_\{t\},\\mathit\{AF\}\_\{t\},\\mathit\{Cm\}\_\{t\},\\mathit\{Dep\}\_\{t\}\)
3:switch

opopof

4:case

Observe\(ψ\)\\textsc\{Observe\}\(\\psi\):

5:

ℳ′←ℳt\[\!ψ\]\\mathcal\{M\}^\{\\prime\}\\leftarrow\\mathcal\{M\}\_\{t\}\[\{\!\\psi\}\]
6:

α←𝗇𝖾𝗐𝖠𝗋𝗀\(claim=ψ,src=i\)\\alpha\\leftarrow\\mathsf\{newArg\}\(\\mathrm\{claim\}\{=\}\\psi,\\,\\mathrm\{src\}\{=\}i\);

𝐴𝑟𝑔𝑠′←𝐴𝑟𝑔𝑠t∪\{α\}\\mathit\{Args\}^\{\\prime\}\\leftarrow\\mathit\{Args\}\_\{t\}\\cup\\\{\\alpha\\\}
7:

𝐴𝑡𝑡′←𝐴𝑡𝑡t∪\{\(α,β\):claim\(β\)⊧¬ψ\}\\mathit\{Att\}^\{\\prime\}\\leftarrow\\mathit\{Att\}\_\{t\}\\cup\\\{\(\\alpha,\\beta\):\\mathrm\{claim\}\(\\beta\)\\models\\lnot\\psi\\\}
8:

𝐶𝑚′\(i\)←𝐶𝑚t\(i\)∪\{α\}\\mathit\{Cm\}^\{\\prime\}\(i\)\\leftarrow\\mathit\{Cm\}\_\{t\}\(i\)\\cup\\\{\\alpha\\\};

𝐷𝑒𝑝′\(α\)←∅\\mathit\{Dep\}^\{\\prime\}\(\\alpha\)\\leftarrow\\emptyset
9:

10:case

Hypothesize\(γ,𝑑𝑒𝑝𝑠\)\\textsc\{Hypothesize\}\(\\gamma,\\,\\mathit\{deps\}\):⊳\\triangleright𝑑𝑒𝑝𝑠⊆𝑃𝑟𝑜𝑝\\mathit\{deps\}\\subseteq\\mathit\{Prop\}supportsγ\\gammavia abduction \([Definition˜E\.2](https://arxiv.org/html/2605.14175#A5.Thmdefinition2)\)

11:

ℳ′←ℳt\[⇑γ\]\\mathcal\{M\}^\{\\prime\}\\leftarrow\\mathcal\{M\}\_\{t\}\[\{\\Uparrow\\gamma\}\]
12:

α←𝗇𝖾𝗐𝖠𝗋𝗀\(claim=γ,src=i\)\\alpha\\leftarrow\\mathsf\{newArg\}\(\\mathrm\{claim\}\{=\}\\gamma,\\,\\mathrm\{src\}\{=\}i\);

𝐴𝑟𝑔𝑠′←𝐴𝑟𝑔𝑠t∪\{α\}\\mathit\{Args\}^\{\\prime\}\\leftarrow\\mathit\{Args\}\_\{t\}\\cup\\\{\\alpha\\\}
13:

𝐶𝑚′\(i\)←𝐶𝑚t\(i\)∪\{α\}\\mathit\{Cm\}^\{\\prime\}\(i\)\\leftarrow\\mathit\{Cm\}\_\{t\}\(i\)\\cup\\\{\\alpha\\\};

𝐷𝑒𝑝′\(α\)←𝑑𝑒𝑝𝑠\\mathit\{Dep\}^\{\\prime\}\(\\alpha\)\\leftarrow\\mathit\{deps\}
14:

15:case

Support\(γ,e\)\\textsc\{Support\}\(\\gamma,\\,e\):

16:

ℳ′←ℳt\[↑γ\]\\mathcal\{M\}^\{\\prime\}\\leftarrow\\mathcal\{M\}\_\{t\}\[\{\\uparrow\\gamma\}\]if

eeis generic, else

ℳt\[⇑γ\]\\mathcal\{M\}\_\{t\}\[\{\\Uparrow\\gamma\}\]
17:

𝐷𝑒𝑝′\(αγ\)←𝐷𝑒𝑝t\(αγ\)∪\{e\}\\mathit\{Dep\}^\{\\prime\}\(\\alpha\_\{\\gamma\}\)\\leftarrow\\mathit\{Dep\}\_\{t\}\(\\alpha\_\{\\gamma\}\)\\cup\\\{e\\\}
18:

19:case

Undermine\(γ,e\)\\textsc\{Undermine\}\(\\gamma,\\,e\):

20:

ℳ′←ℳt\[⇑¬γ\]\\mathcal\{M\}^\{\\prime\}\\leftarrow\\mathcal\{M\}\_\{t\}\[\{\\Uparrow\\lnot\\gamma\}\]when

eedistinguishes

¬γ\\lnot\\gammafrom

γ\\gamma
21:

𝐴𝑡𝑡′←𝐴𝑡𝑡t∪\{\(αe,αγ\)\}\\mathit\{Att\}^\{\\prime\}\\leftarrow\\mathit\{Att\}\_\{t\}\\cup\\\{\(\\alpha\_\{e\},\\alpha\_\{\\gamma\}\)\\\}
22:

23:case

Revise\(γ\)\\textsc\{Revise\}\(\\gamma\):

24:

ℳ′←ℳt\[\!¬γ\]\\mathcal\{M\}^\{\\prime\}\\leftarrow\\mathcal\{M\}\_\{t\}\[\{\!\\lnot\\gamma\}\];

𝐴𝑡𝑡′\\mathit\{Att\}^\{\\prime\}updated per[Table˜8](https://arxiv.org/html/2605.14175#A5.T8)

25:

26:case

Expand\-Awareness\(p\)\\textsc\{Expand\-Awareness\}\(p\):

27:

𝒜i′←𝒜i∪\{p\}\\mathcal\{A\}^\{\\prime\}\_\{i\}\\leftarrow\\mathcal\{A\}\_\{i\}\\cup\\\{p\\\}; refine

WWon

pp\([Definition˜E\.3](https://arxiv.org/html/2605.14175#A5.Thmdefinition3)\)

28:

29:case

Resolve\(γ,𝑠𝑢𝑏𝑠𝑢𝑚𝑒𝑠\)\\textsc\{Resolve\}\(\\gamma,\\,\\mathit\{subsumes\}\):⊳\\trianglerightαγ\\alpha\_\{\\gamma\}existing;𝑠𝑢𝑏𝑠𝑢𝑚𝑒𝑠⊆𝐴𝑟𝑔𝑠t\\mathit\{subsumes\}\\subseteq\\mathit\{Args\}\_\{t\}optional

30:

ℳ′←ℳt\[\!γ\]\\mathcal\{M\}^\{\\prime\}\\leftarrow\\mathcal\{M\}\_\{t\}\[\{\!\\gamma\}\]if consensual, else commit

α𝑑𝑒𝑐𝑖𝑑𝑒\\alpha\_\{\\mathit\{decide\}\}with dissent recorded

31:for

β∈𝑠𝑢𝑏𝑠𝑢𝑚𝑒𝑠\\beta\\in\\mathit\{subsumes\}do⊳\\trianglerightschema fix \(§[4](https://arxiv.org/html/2605.14175#S4)\): recordγ\\gammain subsumed deps

32:

𝐷𝑒𝑝′\(β\)←𝐷𝑒𝑝t\(β\)∪\{γ\}\\mathit\{Dep\}^\{\\prime\}\(\\beta\)\\leftarrow\\mathit\{Dep\}\_\{t\}\(\\beta\)\\cup\\\{\\gamma\\\}
33:endfor

34:

35:case

Question\(χ\)\\textsc\{Question\}\(\\chi\):

36:

ℳ′←ℳt\\mathcal\{M\}^\{\\prime\}\\leftarrow\\mathcal\{M\}\_\{t\}; enqueue

\(𝐁i,χ\)\(\\mathbf\{B\}\_\{i\},\\chi\)on the abductive\-problem queue \([Definition˜E\.2](https://arxiv.org/html/2605.14175#A5.Thmdefinition2)\)

37:

38:end switch

39:return

𝒟t\+1←\(ℳ′,𝐴𝐹′,𝐶𝑚′,𝐷𝑒𝑝′\)\\mathcal\{D\}\_\{t\+1\}\\leftarrow\(\\mathcal\{M\}^\{\\prime\},\\mathit\{AF\}^\{\\prime\},\\mathit\{Cm\}^\{\\prime\},\\mathit\{Dep\}^\{\\prime\}\)

Composition invariant\.[Algorithm˜1](https://arxiv.org/html/2605.14175#alg1)embodies a typing discipline: a single classified utterance produces updates to all four components simultaneously, and those updates are mutually constrained by the speaker, the operation, and the DEL realisation\. Concretely, when a new argumentα\\alphaenters𝐴𝑟𝑔𝑠t\\mathit\{Args\}\_\{t\}viaObserveorHypothesize, the same operation determinesα\\alpha’s plausibility upgrade inℳ\\mathcal\{M\}, the commitment𝐶𝑚\(i\)∪\{α\}\\mathit\{Cm\}\(i\)\\cup\\\{\\alpha\\\}for the speakeriithat introduced it, and the dependency record𝐷𝑒𝑝\(α\)\\mathit\{Dep\}\(\\alpha\)of supports the speaker invoked\. The four components of𝒟t\+1\\mathcal\{D\}\_\{t\+1\}therefore cannot drift apart over a single turn\. Updates to𝐷𝑒𝑝t\\mathit\{Dep\}\_\{t\}are additive \(entries are never silently removed; an argument is only retracted when it is removed from𝐴𝑟𝑔𝑠t\\mathit\{Args\}\_\{t\}via belief revision\), so𝐴𝑓𝑓𝑒𝑐𝑡𝑒𝑑\(p\)=\{α∈S:p∈𝐷𝑒𝑝\(α\)\}\\mathit\{Affected\}\(p\)=\\\{\\alpha\\in S:p\\in\\mathit\{Dep\}\(\\alpha\)\\\}is well\-defined as a set on the post\-update structure, which is the precondition[Proposition˜2\.1](https://arxiv.org/html/2605.14175#S2.Thmproposition1)’s proof relies on\.

Schema fix onResolve\.The optional𝑠𝑢𝑏𝑠𝑢𝑚𝑒𝑠\\mathit\{subsumes\}argument and its for\-loop \([Algorithm˜1](https://arxiv.org/html/2605.14175#alg1),Resolvecase\) specify the schema extension discussed in[Section˜4](https://arxiv.org/html/2605.14175#S4): when one hypothesis subsumes another at resolution time \(e\.g\.,h4h\_\{4\}unifyingh1h\_\{1\}andh3h\_\{3\}at Phase 2 T13\), the cross\-hypothesis dependency edge is added atResolverather than at the subsumed hypothesis’s creation\. Without this branch, the edge is structurally unrecoverable through prompt engineering alone \(E1b,[Appendix˜I](https://arxiv.org/html/2605.14175#A9)\); empirical validation in[Appendix˜J](https://arxiv.org/html/2605.14175#A10)confirms3/33/3recovery on Phase 2 with zero false\-positive edges, plus a Phase\-3 negative sanity \(0/30/3when no unification structure is present\)\.

### Full epistemic\-operation table

[Table˜8](https://arxiv.org/html/2605.14175#A5.T8)gives the full form of the operations of[Section˜2](https://arxiv.org/html/2605.14175#S2): the precondition the engine checks before accepting a candidate classification, and the full\-detail DEL realisation for the two operations \(Undermine,Resolve\) that abbreviate in the main\-body table\. A failed precondition triggers a re\-prompt to the LLM Interpreter naming the failing condition\.

Table 8:Full epistemic\-operation table: DEL realisations in full detail, with preconditions\. Main\-body version \([Section˜2](https://arxiv.org/html/2605.14175#S2)\) drops the precondition column and abbreviates the multi\-case rules\.

## Appendix FThe Automation Challenge \(Full Analysis\)

We decompose the automation challenge into six levels, ordered from most structural \(hardest\) to most granular \(most tractable\)\.

Table 9:Six levels of automation required for deployment without human intervention\.Level 0\(model paradigm\) is the hardest: choosing the wrong paradigm is a categorical error\. Conversation type is often signalled in opening turns \(“P1 incident”→\\todiagnostic; “how should we design X?”→\\todeliberative\), but real conversations shift paradigms mid\-stream\.Level 1\(schema design\) is mitigated by default schemas per paradigm\.Level 2\(epistemic primitive\) requires distinguishing knowledge from belief from commitment; our taxonomy \(8 operations\) is coarser than standard DA taxonomies \(42\+ categories\)\.Level 3\(world/hypothesis space\) requires extracting hypotheses and detecting awareness expansion\.Level 4\(turn classification\) is our primary focus and the most tractable level\.Level 5\(update & consistency\) faces model drift, which the symbolic engine’s constraint checking can partially mitigate\.

#### Paths to automation\.

Path A\(human\-in\-the\-loop\), a human designs Levels 0–1; the LLM handles Levels 2–5\.Path B\(template\-based\), a library of templates with default schemas; the LLM classifies conversation type\.Path C\(full emergence\), the LLM constructs the entire model\.

#### Error composition\.

90% accuracy at each of 6 levels gives0\.96≈53%0\.9^\{6\}\\approx 53\\%end\-to\-end\. This motivates: \(1\) reducing LLM\-dependent levels \(Paths A/B:0\.94≈66%0\.9^\{4\}\\approx 66\\%\); \(2\) symbolic error correction; \(3\) prioritising accuracy on key epistemic shifts\. Path B paired with symbolic error correction is the configuration we recommend\.

## Appendix GLoCoMo: Detailed Results

LoCoMo is an external benchmark of entity\-relation recall over long multi\-session histories, distinct from the interactional grounding the verifier was originally designed for\. The body[Table˜4](https://arxiv.org/html/2605.14175#S4.T4)reports the headline three\-mode rendering ablation under GPT\-4o on three LoCoMo conversations \(conv\-26/30/41; 369–663 turns; 60 queries under the official protocol\)\. This appendix records the per\-category breakdown and the original Qwen2\.5\-7B\-Instruct cross\-model run for the historical record\. The published cached hybrid \(Qwen\-7B, dep\-map JSON only injection, 16K context window\) showedΔF1=−0\.10\\Delta\\mathrm\{F1\}\\\!=\\\!\-0\.10pooled, with degradation concentrated on temporal questions where engine\-state pulled the LLM toward answers like*“since we last chatted”*instead of the specific dates LoCoMo’s gold answers require\.[Table˜10](https://arxiv.org/html/2605.14175#A7.T10)reports that breakdown for the historical record\. The published Qwen\-7B failure mode is a special case of the rendering issue surfaced in the body: when the QA prompt contains symbol IDs without content \(and the truncated transcript also lacks the relevant facts\), the QA model has nothing concrete to ground on\. A failure\-case post\-hoc inspection on this Qwen\-7B configuration found0/600/60verifier strict\-correct items the LLM\-only baseline missed; the loss was9/609/60verifier strict\-losses \(F1<<0\.5 where baseline≥\\geq0\.5\), all in the multi\-hop and temporal categories, with6/96/9being verifier abstentions \(“No information available”\) and3/93/9vague relative\-time placeholders\. The content\-rendering fix \(hypothesis/observation content with per\-item session\-date attribution\) and the RAG\-retrieval fix \(top\-kkretrieval over engine items, freeing context budget\) together flip the headline; see[Table˜4](https://arxiv.org/html/2605.14175#S4.T4)\(body\) for the rendering\-mode ablation\.*Caveat:*all 60 queries triggered context truncation at the 16K window in the historical run; the content\+\+retrieval mode reduces engine block size from∼19\{\\sim\}19K to∼1\.5\{\\sim\}1\.5K tokens \(∼13×\{\\sim\}13\\\!\\times\) so transcript truncation is much less severe under that mode\. Per\-category F1 for the headline GPT\-4o content\+\+retrieval row \([Table˜4](https://arxiv.org/html/2605.14175#S4.T4)\): cat 1 multi\-hop \(n==23\)0\.4720\.472; cat 2 temporal \(n==30\)0\.4760\.476; cat 3 open\-domain \(n==5\)0\.1240\.124; cat 4 single\-hop \(n==2\)0\.3330\.333\.

Table 10:Historical Qwen\-7B\-Instruct LoCoMo run \(60 queries, dep\-map\-only rendering, 16K context\)\. This is the published configuration that produced the−0\.10\-0\.10pp number; under content\-bearing rendering with retrieval \([Table˜4](https://arxiv.org/html/2605.14175#S4.T4)in body, GPT\-4o\), the headline flips to\+0\.18\+0\.18pp\.
## Appendix HLongMemEval Knowledge\-Update: Detailed Results

We use LongMemEval Knowledge\-Update at theoraclesetting as an*envelope\-edge probe*\([Section˜4](https://arxiv.org/html/2605.14175#S4)\): the task is closer to interactional grounding than LoCoMo, single\-utterance updates with retraction structure, but the test mix is dominated by recall\-style items where the architectural signal lives in disagreement structure rather than headline accuracy\.

#### oraclesetting \(n=78n\\\!=\\\!78\)\.

On 78 LongMemEval\-KU items at theoraclesetting\[Wuet al\.,[2025](https://arxiv.org/html/2605.14175#bib.bib36)\], the verifier under the original dep\-map\+\+state\-summary rendering reached68/78=87\.2%68/78\\\!=\\\!87\.2\\%vs\. an LLM\-only long\-context baseline at69/78=88\.5%69/78\\\!=\\\!88\.5\\%\(Δ=−1\.3\\Delta\\\!=\\\!\-1\.3pp\)\. Under the content\+\+retrieval rendering of[Table˜4](https://arxiv.org/html/2605.14175#S4.T4)\(content\-bearing engine state with per\-item turn\-date attribution and top\-k=20k\\\!=\\\!20RAG retrieval\), the verifier reaches70/78=89\.7%70/78\\\!=\\\!89\.7\\%\(Δ=\+1\.3\\Delta\\\!=\\\!\+1\.3pp\), which is the first configuration where the verifier strictly exceeds the long\-context baseline on this benchmark\. Of the55verifier\-loss items in the original run, the content\+\+retrieval rendering recovers33\(a personal\-best 5K time “25:5025\{:\}50” superseding “27:1227\{:\}12”, United Airlines previous frequent\-flyer status “Premier Silver”, Hilton Honors free\-night redemption count “two”\); the remaining22are upstream of the rendering choice \(one is a content\-extraction failure where the new value was never extracted by the pipeline; the other is an entity\-matcher mismatch on a question asking about Shinjuku when the conversation only mentioned Harajuku\)\. The content\+\+retrieval result also corrects an earlier diagnostic: the original\-run rendering capped observations at the most\-recent2020, dropping the older facts from the prompt; the engine itself had extracted them\.*The architectural signal*that this benchmark was designed to surface, correct abstention via feature*\(b\)*, dominates the win mechanism\. Of the99original\-run disagreements,33of44verifier wins were correct abstentions on\_absitems \(questions whose stems explicitly admit “no information available”\) where the long\-context baseline confabulated an answer from related distractor sessions; the verifier returned “ungrounded” because noαc∈𝐴𝑟𝑔𝑠t\\alpha\_\{c\}\\in\\mathit\{Args\}\_\{t\}matched the asked\-about claim\. This is the LongMemEval analogue of the e2\_030 stale\-claim case from the Phase 2 \(debugging\) test set in[Table˜3](https://arxiv.org/html/2605.14175#S4.T3)\.

#### S\-distractor setting \(future work, with pilot results\)\.

The harderS\-distractor setting requires a session\-batched extraction protocol fitting long context; we report a 5\-item Protocol B pilot \(per\-turn classification ongpt\-4o\-mini, final QA ongpt\-4o\) for documentation\. The pilot found that the engine state helped0/50/5items and hurt2/52/5\(one “Rachel relocation” item where a distractor session pulled the engine to “Chicago” over the correct “the suburbs”; one “Wells Fargo pre\-approval” item where the engine surfaced a superseded $350K figure over the correct $400K\), and coverage\-stratified accuracy was non\-monotonic \(T1=1\.0T\_\{1\}\\\!=\\\!1\.0,T2=0\.0T\_\{2\}\\\!=\\\!0\.0,T3=1\.0T\_\{3\}\\\!=\\\!1\.0\), indicating that the per\-turn extraction did not reach a regime of stable Dep coverage on the protocol used\. Scaling to all 78Sitems at this protocol would cost∼38\{\\sim\}38h and∼$340\{\\sim\}\\mathdollar 340with no expected lift; we treatSas future work and document the design and feasibility analysis inS\_FEASIBILITY\.md\. The session\-batched protocol, where one long\-context interpreter call processes a whole session, is the natural next test \([Section˜4](https://arxiv.org/html/2605.14175#S4)envelope conjecture iii\)\.

## Appendix IE1b:depends\_onPrompt\-Schema Ablation Details

[Table˜11](https://arxiv.org/html/2605.14175#A9.T11)reports the full E1b prompt\-schema ablation referenced in[Section˜4\.2](https://arxiv.org/html/2605.14175#S4.SS2): fourdepends\_onprompt variants \(baseline, chain\-of\-thought, examples, self\-consistency\-5\) run through the verifier pipeline on Phase 2 with GPT\-4o\. Each deterministic variant is the median of33seeds; self\-consistency uses content\-aligned voting over55hot samples \(T=0\.7T\\\!=\\\!0\.7\) and a single seed\. Scoring is against*creation\-time*GT, the upper bound any creation\-time prompt can reach: the additional dependency edge added whenh4h\_\{4\}subsumesh1h\_\{1\}at T13 is set at theResolvestep, not ath4h\_\{4\}’s creation, and cannot be recovered by prompt engineering at hypothesis\-creation time\. The body finding \(h1→h4h\_\{1\}\\\!\\to\\\!h\_\{4\}recovered0/100/10across all 10 E1b runs, motivating theResolve\-case schema fix in[Algorithm˜1](https://arxiv.org/html/2605.14175#alg1)\) is summarised in[Section˜4\.2](https://arxiv.org/html/2605.14175#S4.SS2)\.

Table 11:E1b:depends\_onprompt\-schema ablation on Phase 2 \(GPT\-4o; creation\-time GT; medians,33seeds†\)\. Bold: per\-column best\.†Across all 10 E1b runs the post\-T13 linkh1→h4h\_\{1\}\\\!\\to\\\!h\_\{4\}was recovered0/100/10\. Theh4→h3h\_\{4\}\\\!\\to\\\!h\_\{3\}link \(set ath4h\_\{4\}’s creation\) was recovered1/31/3for*baseline*,3/33/3for*cot*and*examples*, and1/11/1for self\-consistency\.‡Content\-aligned voting over 5 hot samples \(T=0\.7T\\\!=\\\!0\.7\) before majority vote on\(𝑔𝑡\_ℎ𝑦𝑝,𝑔𝑡\_𝑑𝑒𝑝\)\(\\mathit\{gt\\\_hyp\},\\mathit\{gt\\\_dep\}\)tuples; single seed\.

## Appendix JE1c: Resolve\-Stage Retrospective Dependency\-Update Probe

E1b \([Appendix˜I](https://arxiv.org/html/2605.14175#A9)\) establishes that prompt\-engineering at hypothesis\-creation time cannot recover the post\-T13 unification edgeh1→h4h\_\{1\}\\\!\\to\\\!h\_\{4\}\(0/100/10across44prompt\-schema variants×\\timesseeds\): the additional dependency edge added whenh4h\_\{4\}subsumesh1h\_\{1\}at T13 is set at theResolvestep, not ath4h\_\{4\}’s creation, and is therefore structurally outside the reach of any creation\-time prompt\.[Section˜4\.2](https://arxiv.org/html/2605.14175#S4.SS2)forward\-points to the algebraic fix, extendingResolveto update existing𝐷𝑒𝑝\\mathit\{Dep\}tuples \([Algorithm˜1](https://arxiv.org/html/2605.14175#alg1),Resolvecase\)\. E1c probes whether this fix actually recovers the edge in practice\.

#### Protocol\.

On each turn whose classification contains aResolveop, the probe issues an additional GPT\-4o call asking for genericdependency\_updatesover already\-existing hypotheses \(additions or removals to theirdepends\_ontuples\)\. The proposed updates are applied to a copy of the pipeline’s finalengine\.dependencies; the canonical pipeline run is not modified\. The probe’s system prompt uses a single non\-Phase\-2 example \(an API\-style decision scenario\) to illustrate JSON format and explicitly tells the model that empty updates are common and acceptable; no Phase\-2\-specific hints\. Three seeds,T=0T\\\!=\\\!0\. Scoring uses the same content\-aligned matchers \(eval\_dep\_extraction\.match\_llm\_to\_gt/match\_llm\_obs\_to\_gt\) as[Table˜2](https://arxiv.org/html/2605.14175#S4.T2)\.

Table 12:E1c: Resolve\-stage retrospective dep\-update probe on Phase 2 \(GPT\-4o; post\-T13 GT;n=3n\\\!=\\\!3seeds; medians\)\. Bold: per\-column best\.Across all 3 seeds, the probe emits exactly one update at T13 with rationales of the form “Redis exhaustion, whichh4h\_\{4\}depends on, is caused by the retry storm from the token bug”; after content alignment to canonical IDs, all three updates collapse to the same canonical\(h1,h4\)\(h\_\{1\},h\_\{4\}\)edge\.*Zero new false\-positive edges*are introduced\. The intervention is structurally distinct from the prompt\-only ablation in[Table˜11](https://arxiv.org/html/2605.14175#A9.T11): where E1b modifies the hypothesis\-creation prompt, E1c adds a separate per\-Resolvecall asking for retrospective updates over already\-existing hypotheses\.

#### Phase\-3 negative sanity\.

The same probe was run on Phase 3 \(architecture\-deliberation,1919turns, singleResolveat T18; no canonical retrospective unification edge expected\)\. Across33seeds the probe produces0/30/3non\-empty updates and0edges added; the LLM’s rationale on every Resolve event is “no retrospective dependency updates warranted; the Resolve simply elevated existing reasoning to accepted conclusion\.” Combined with the Phase\-2 result, the probe is*conversation\-sensitive*: it fires when a unification structure is present \(Phase 2’s causal\-reversal at T12–T13\) and stays silent when no such structure exists \(Phase 3’s deliberation\-and\-decide pattern\)\. The schema\-fix has empirical support both for*recovery on the load\-bearing case*and for*non\-disturbance on the negative case*\.

#### Caveats\.

Two\-conversation evaluation only \(one positive case \+ one negative case\); GPT\-4o only;n=3n\\\!=\\\!3seeds \(Wilson95%95\\%CI on the unanimous3/33/3outcome is\[0\.44,1\.00\]\[0\.44,1\.00\]\); a single Resolve\-stage prompt design \(the explicit “empty list is common” clause discourages hallucination; more aggressive prompts may trade precision for recall\)\. Cross\-model robustness and prompt\-sensitivity sub\-ablations are future work\. Full per\-seed traces \(engine snapshots, raw LLM responses, parsed updates, per\-seed scores\) live inexperiments/results/e1c\_resolve\_update\_probe/and\.\.\.\_phase3/\.

## Appendix KE5: Robustness Diagnostic Figure

[Figure˜4](https://arxiv.org/html/2605.14175#A11.F4)accompanies the*Robustness to extraction noise*paragraph in[Section˜4\.2](https://arxiv.org/html/2605.14175#S4.SS2)\. The figure is placed in the appendix to keep the main body within the page budget; the surrounding paragraph in[Section˜4\.2](https://arxiv.org/html/2605.14175#S4.SS2)reports the headline numbers\.

![Refer to caption](https://arxiv.org/html/2605.14175v1/x2.png)Figure 4:E5 diagnostic robustness curves on the 50\-item E2 test set,ε∈\[0,0\.8\]\\varepsilon\\in\[0,0\.8\]at 10 seeds per cell\.\(A\)Pooled verifier accuracy under three independent noise models on the GT Phase 2 state: random dependency\-edge drop \(gray\), random edge add overH×\(O∪H\)H\\times\(O\\cup H\)\(blue\), and lifecycle/status corruption \(orange\)\. Lines are seed\-medians; bands are 25th–75th percentile\. References: LLM\-only pooled\-50 baseline \(0\.980\.98, dashed\) and stale\+counterfactual headline \(0\.960\.96, dotted\)\. The flat edge\-drop / edge\-add curves indicate the current E2 set is structurally insensitive to dependency\-edge traversal \(88%88\\%of items decide beforewalk\_deps\); they are not a standalone robustness success claim\.\(B\)Per\-category accuracy under lifecycle/status corruption: stale claims \(the architectural\-win category in[Table˜3](https://arxiv.org/html/2605.14175#S4.T3)\) are the most status\-sensitive; cross\-conv and counterfactual decide via null\-resolution and are immune at everyε\\varepsilon\.*No empirical operating\-point dot is plotted*: observed status errors on1111GPT\-4o Phase 2 pipeline runs \(2/44=4\.5%2/44=4\.5\\%aggregate, concentrated onh2h\_\{2\}’s abandonment\) are non\-IID across hypotheses, so a single\-ε\\varepsilonpoint on an IID curve would imply uniform noise we do not observe; the empirical rate is reported in[Section˜4\.2](https://arxiv.org/html/2605.14175#S4.SS2)\.
## Appendix LCross\-Model Robustness Check \(Qwen2\.5\-7B\-Instruct\)

We replicate E1, E1b, and E2 againstQwen/Qwen2\.5\-7B\-Instructserved locally via vLLM \(max\_model\_len=32,768=32\{,\}768, native context, no rope\-scaling\) as a cross\-model robustness check on the structural claims behind the GPT\-4o headline\. Qwen results are supporting evidence,*not*replacements for GPT\-4o numbers in the main body: we do not add Qwen rows to[Table˜2](https://arxiv.org/html/2605.14175#S4.T2)or[Table˜3](https://arxiv.org/html/2605.14175#S4.T3)because the load\-bearing claims \(verifier ceiling on stale\+counterfactual; schema\-deeph1→h4h\_\{1\}\\\!\\to\\\!h\_\{4\}ceiling\) are model\-agnostic, and the secondary metrics drop with model capacity in a way that would optically widen the verifier\-vs\-baseline gap for baseline\-weakening rather than architectural reasons\.

#### Headline\.

Both load\-bearing structural claims replicate cross\-model\. Direct LLM prompting fails to recoverh1→h4h\_\{1\}\\\!\\to\\\!h\_\{4\}on Qwen \(0/40/4across E1 variants\), as on GPT\-4o \(0/40/4\); the prompt\-schema ablation in the verifier pipeline likewise leavesh1→h4h\_\{1\}\\\!\\to\\\!h\_\{4\}unrecovered on Qwen \(0/100/10\), identical to GPT\-4o’s0/100/10in[Table˜2](https://arxiv.org/html/2605.14175#S4.T2)\. On the 50\-item Phase 2 grounding test set,𝖵𝖾𝗋𝗂𝖿𝗒\\mathsf\{Verify\}reaches25/25=100%25/25\\\!=\\\!100\\%on the stale\+counterfactual pooled subset under both models, matching[Table˜3](https://arxiv.org/html/2605.14175#S4.T3)\. On LongMemEval\-KUoracle\(78 items;[Appendix˜H](https://arxiv.org/html/2605.14175#A8)\), under the published hybrid rendering both models are near\-tie with the LLM\-only baseline: GPT\-4oΔ=−1\.3\\Delta\\\!=\\\!\-1\.3pp, QwenΔ=\+1\.3\\Delta\\\!=\\\!\+1\.3pp\. With the Phase 2 \(content\+\+retrieval\) rendering applied \(GPT\-4o; Qwen Phase 2 not yet run\), GPT\-4o reaches70/7870/78vs\. baseline69/7869/78, also\+1\.3\+1\.3pp, putting both models in agreement at\+1\.3\+1\.3pp on the same benchmark\.

#### E1\-Qwen\.

Direct LLM\-prompted dependency extraction over the four prompt variants from[Table˜2](https://arxiv.org/html/2605.14175#S4.T2), scored against post\-T13 GT \(averaged over the 4 hypotheses\)\. Wall:∼\\sim90 s for 36 LLM calls;0/360/36parse failures\(every call returned parsabledepends\_on\_turnsJSON\)\. No variant recoversh1→h4h\_\{1\}\\\!\\to\\\!h\_\{4\}\. The pattern that self\-consistency\-CoT is*worse*than zero\-shot, observed for GPT\-4o in[Table˜2](https://arxiv.org/html/2605.14175#S4.T2)\(F1=0\.28\\mathrm\{F1\}\\\!=\\\!0\.28vs0\.310\.31\), also replicates on Qwen \(0\.120\.12vs0\.170\.17\)\.

Table 13:E1\-Qwen: direct LLM\-prompted dependency extraction\. Qwen\-7B columns are new; GPT\-4o columns reproduced from[Table˜2](https://arxiv.org/html/2605.14175#S4.T2)for direct comparison\.h1→h4h\_\{1\}\\\!\\to\\\!h\_\{4\}stays unrecovered on both models for every variant\.
#### E1b\-Qwen\.

Prompt\-schema ablation under the verifier pipeline, matched to the GPT\-4o E1b protocol \(3 deterministic seeds per*baseline*,*cot*,*examples*; 1 self\-consistency seed with content\-aligned voting over 5 hot samples\)\. Wall:∼\\sim5\.5 min;0/1170/117deterministic turn\-classify failures,0/50/5self\-consistency sample failures\. Every run produced≥4\\geq\\\!4hypotheses\. Across all 10 runsh1→h4h\_\{1\}\\\!\\to\\\!h\_\{4\}is recovered0/100/10, identical to GPT\-4o\. The within\-pipeline linkh4→h3h\_\{4\}\\\!\\to\\\!h\_\{3\}recovers2/102/10on Qwen vs8/108/10on GPT\-4o; we read this as model\-capacity drop on cross\-hypothesis link recovery, not a refutation of the GPT\-4o pattern\. The Qwen self\-consistency run produced an empty canonical dependency map because per\-sample raw hypothesis IDs vary too much across the 5 hot samples for content\-aligned voting to reach≥3\\geq\\\!3votes on any canonical edge; the voter logic is correct, the cause is Qwen’s run\-to\-run instability on hypothesis identity\.

Table 14:E1b\-Qwen: prompt\-schema ablation, post\-T13 GT\. Median F1 with\[min,max\]\[\\min,\\max\]in brackets;h1→h4h\_\{1\}\\\!\\to\\\!h\_\{4\}andh4→h3h\_\{4\}\\\!\\to\\\!h\_\{3\}as fraction of runs that recovered the link\. GPT\-4o medians reproduced from[Table˜2](https://arxiv.org/html/2605.14175#S4.T2)for comparison\.h1→h4h\_\{1\}\\\!\\to\\\!h\_\{4\}stays0/100/10on both models\.
#### E2\-Qwen\.

Direct verifier evaluation on the same 50\-item Phase 2 test set used by GPT\-4o \(same labels, same Cohen’sκ=0\.733\\kappa\\\!=\\\!0\.733from the 20\-item independent overlap; the verifier and baseline both run with Qwen\-7B as the LLM\)\. Wall:∼\\sim2\.5 min;0/500/50baseline call failures and0/500/50verifier\-pipeline turn\-classify failures\. The verifier reaches25/25=100%25/25\\\!=\\\!100\\%on the stale\+counterfactual pooled subset, matching GPT\-4o; the baseline drops from24/25=96%24/25\\\!=\\\!96\\%on GPT\-4o to23/25=92%23/25\\\!=\\\!92\\%on Qwen, attributable to two real Qwen baseline misses on stale items \(e2\_025 and e2\_030, both att=13t\\\!=\\\!13\) where Qwen judged a candidategroundedthat GPT\-4o judgedungrounded\. The verifier’s single loss \(e2\_015\) is the same documentedasserts\_id=∅=\\\!\\varnothingmulti\-entity meta\-reasoning case as in[Table˜3](https://arxiv.org/html/2605.14175#S4.T3)\.

Table 15:E2\-Qwen: per\-category accuracy of𝖵𝖾𝗋𝗂𝖿𝗒\(c,𝒟t\)\\mathsf\{Verify\}\(c,\\mathcal\{D\}\_\{t\}\)vs\. LLM\-only baseline on the 50\-item Phase 2 test set, with the sameκ=0\.733\\kappa\\\!=\\\!0\.733as the GPT\-4o run in[Table˜3](https://arxiv.org/html/2605.14175#S4.T3)\. The verifier ceiling at25/25=100%25/25\\\!=\\\!100\\%pooled is identical to GPT\-4o; the widerΔ\\Deltacomes from a−4\-4pp drop in the baseline \(from96%96\\%to92%92\\%pooled\), not from the verifier improving\. Numbers reflect the patched baseline parser \(see implementation note below\)\.The Qwen pooledΔ=\+8\.0\\Delta\\\!=\\\!\+8\.0pp stays below the≥10\\geq\\\!10pp Outcome\-1 threshold ofEXPERIMENTS\.mdand remains in the same near\-ceiling\-tie regime as the GPT\-4oΔ=\+4\.0\\Delta\\\!=\\\!\+4\.0pp; the headline framing in[Table˜3](https://arxiv.org/html/2605.14175#S4.T3)\(Outcome 4, baseline near96%96\\%pooled, ceiling\-boundedΔ\\Delta\) is the right framing for both models\.*We do not repackage the wider QwenΔ\\Deltaas a stronger architectural result\.*The verifier ceiling at100%100\\%on stale is already attained on GPT\-4o, the\+13\.3\+13\.3pp stale\-only gap on Qwen is the baseline\-only side, and Qwen\-7B is a weaker LLM than GPT\-4o by independent measurement\. The canonical e2\_030 stale\-claim case \([Section˜4](https://arxiv.org/html/2605.14175#S4)\) is one of the two real Qwen baseline misses on stale and is also the single GPT\-4o baseline miss on stale: it replicates cross\-model\.

#### Implementation note: baseline parser fix\.

The original baseline\-judgement parser inverify\_experiment\.pychecked"GROUNDED" in resp\.upper\(\)\.split\(\), which token\-matches but does not handle responses where the keyword is followed by free\-form explanation\. Qwen\-7B often appends explanation after the keyword \(e\.g\. a response beginningUNGROUNDEDand continuing “…is not directly grounded in the conversation”\); the literal substring*grounded*tokenises toGROUNDEDand silently flipped 4 of Qwen’s correctUngroundedverdicts togrounded\. We replaced the parser with astartswith\-aware first\-token form that prioritises the first explicit label\. Re\-parsing the saved GPT\-4obaseline\_raw\_responsefields under the patched parser yields0/500/50disagreements with the originally\-reported labels:[Table˜3](https://arxiv.org/html/2605.14175#S4.T3)is unchanged\.[Table˜15](https://arxiv.org/html/2605.14175#A12.T15)numbers are patched\-parser numbers; no GPT\-4o re\-run was needed\.

#### E3\-Qwen oracle \(full,n=78n\\\!=\\\!78\)\.

Direct replication of the GPT\-4o E3 oracle hybrid run on the same 78 LongMemEval\-KU items used in[Appendix˜H](https://arxiv.org/html/2605.14175#A8), with Qwen\-7B handling both per\-turn classification and final QA\.*Initial\-feasibility correction:*the previously\-cited∼125\{\\sim\}125k\-token figure was for the LongMemEvalSsetting; the oracle haystacks tokenise at min/median/max=4,034/6,065/9,225=\\\!4\{,\}034/6\{,\}065/9\{,\}225tokens under the Qwen tokenizer, so every oracle item fits Qwen’s native3232k window with∼22\{\\sim\}22k of headroom even at the maximum\. The full run therefore used the existing native\-3232k vLLM service without YaRN\. Wall:6464minutes for1,9871\{,\}987LLM calls \(78 baselines \+ 1,909 verifier classification \+ QA\),0local cost,∼$1\.50\{\\sim\}\\mathdollar 1\.50in GPT\-4o judge calls\.0/780/78empty QA outputs, no silent classification failures observed;156/156156/156evidence sessions present in historyacross all 78 items \(no truncation; max history43,20143\{,\}201chars\)\.

Table 16:E3\-Qwen oracle full vs\. the GPT\-4o E3 oracle run from[Appendix˜H](https://arxiv.org/html/2605.14175#A8), on the same 78 KU items, dataset, and GPT\-4o judge\. The Qwen run uses the dep\-map\+\+state\-summary rendering of its time; the GPT\-4o row reported here is also the dep\-map\+\+state\-summary configuration, not Phase 2 \([Appendix˜H](https://arxiv.org/html/2605.14175#A8)\)\. Under that configuration both rows are within\|Δ\|=1\.3\|\\Delta\|\\\!=\\\!1\.3pp of baseline\.The qualitative\_absabstention pattern that anchors the GPT\-4o E3 paragraph in[Appendix˜H](https://arxiv.org/html/2605.14175#A8)\(33of44verifier wins among99disagreements being correct abstentions on\_absitems\)does not cleanly replicate on Qwen: only11of the77Qwen verifier wins is on an\_absitem \(the verifier correctly returns “the user sees Dr\. Smith, not Dr\. Johnson” on2698e78f\_abs, while the baseline confabulates “you see Dr\. Smith every week”\), and the verifier and baseline tie at3/63/6on the\_abssubset as a whole\. The remaining66Qwen verifier wins are mostly numerical or temporal\-supersession items where the engine summary surfaces the resolved value over an older one \(5K\-run personal best updated to25:5025\{:\}50, postcards added updated to2525, Rachel’s company updated to TechCorp\); this is a different mechanism from the abstention pattern, and one that depends on the per\-turn extraction picking up the right proposition\. The qualitative abstention claim should therefore stay a GPT\-4o\-specific observation in[Appendix˜H](https://arxiv.org/html/2605.14175#A8)and not be lifted to a cross\-model claim\.

The*engine\-as\-distractor*failure mode documented inS\_FEASIBILITY\.mdreplicates here:44of the66Qwen baseline wins are numerical\-supersession items where extraction picks up an earlier value \(Wells Fargo $350350k vs the correct $400400k, gym time7:007\{:\}00pm vs the correct6:006\{:\}00pm,1,2501\{,\}250vs1,3001\{,\}300Instagram followers, three vs five camera trips\) before the supersession reaches the engine, and the engine summary then anchors the QA call on the superseded figure\.dep\_coverageisdegenerate at oracleon Qwen as it is on GPT\-4o: every one of the7878items hitscov\_mean=1\.0=1\.0, so the conditional\-on\-coverage stratification \(T1/T2/T3 tertiles\) is vacuous;*no coverage\-stratified claim is made for either model*\.

#### E3\-S\-Qwen not run\.

Qwen2\.5\-7B’s native32,76832\{,\}768\-token window is below the S\-distractor median history of∼107\\sim\\\!107k tokens \([Appendix˜H](https://arxiv.org/html/2605.14175#A8)\); YaRN rope\-scaling reaches∼131\\sim\\\!131k but degrades long\-distance retrieval, and swapping to a different model \(e\.g\. Qwen3\) for S alone would conflate model identity with context\-length capacity, nullifying the cross\-model robustness claim\. We therefore treat E3\-S\-Qwen as out of scope; the GPT\-4o S\-pilot inS\_FEASIBILITY\.mdremains the only S evidence\.

## Appendix MExperiment Scripts

Environment\.Python 3\.10\+\.pip install \-r requirements\.txt\(requests,openai,PyYAML,nltk,numpy,matplotlib\); one\-off NLTK corpora viapython \-c "import nltk; nltk\.download\(’punkt\_tab’\); nltk\.download\(’stopwords’\)"\. Closed\-model calls requireANTHROPIC\_API\_KEY\(Claude Sonnet 4\) andOPENAI\_API\_KEY\(GPT\-4o\); open\-weight runs require vLLM 0\.19\+ on a single≥24\\geq\\\!24GB GPU\. LoCoMo\[Maharanaet al\.,[2024](https://arxiv.org/html/2605.14175#bib.bib34)\]and LongMemEval\[Wuet al\.,[2025](https://arxiv.org/html/2605.14175#bib.bib36)\]are publicly released by their original authors and not redistributed; expected layout under the user\-set$DATASETS\_CACHEislocomo/data/locomo10\.jsonandlongmemeval/longmemeval\_oracle\.json\.

Reproduction\.[Table˜17](https://arxiv.org/html/2605.14175#A13.T17)maps each paper element \(named in shorthand\) to its runner script and committed output\. Cross\-references for the named elements are: cross\-model F1[Table˜1](https://arxiv.org/html/2605.14175#S4.T1); end\-to\-end[Table˜2](https://arxiv.org/html/2605.14175#S4.T2); E2 verifier[Table˜3](https://arxiv.org/html/2605.14175#S4.T3); LoCoMo[Table˜4](https://arxiv.org/html/2605.14175#S4.T4); LongMemEval\-KU[Section˜4\.2](https://arxiv.org/html/2605.14175#S4.SS2); E5 robustness[Figure˜4](https://arxiv.org/html/2605.14175#A11.F4); latency scaling[Figure˜2](https://arxiv.org/html/2605.14175#S4.F2); E1b ablation[Appendix˜I](https://arxiv.org/html/2605.14175#A9); E1c probe[Appendix˜J](https://arxiv.org/html/2605.14175#A10); Qwen robustness[Appendix˜L](https://arxiv.org/html/2605.14175#A12)\. The reference symbolic enginesymbolic\_engine\.pyis fully deterministic and runs without an API key, reproducing the*GT deps*row of[Table˜2](https://arxiv.org/html/2605.14175#S4.T2)byte\-for\-byte;symbolic\_engine\.jsxprovides an interactive browser demo with step\-through and a counterfactual panel\.docs/EXPERIMENTS\.mdis the operational guide for the experimental program;experiments/results/e3\_lme/S\_FEASIBILITY\.mddocuments the LongMemEval\-S downgrade rationale;experiments/results/qwen\_robustness\_summary\.mdconsolidates the Qwen2\.5\-7B cross\-model replication\. Headline aggregated result JSONs cited in the body \(e\.g\.kappa\_agreement\.json,e2\_verify\_results\.json,lme\_ku\_oracle\_\*\_full\.jsonl\) are committed underexperiments/results/as evidence; raw per\-run logs are not redistributed and can be regenerated by re\-running the runners\.

Table 17:Paper element→\\torunner script and committed output\. Default locations: runners inexperiments/scripts/, outputs inexperiments/results/\. Bare filenames in the table refer to scripts at the repo root or underexperiments/e5\_robustness/\.
Grounded Continuation: A Linear-Time Runtime Verifier for LLM Conversations

Similar Articles

Logic-Regularized Verifier Elicits Reasoning from LLMs

LGMT: Logic-Grounded Metamorphic Testing for Evaluating the Reasoning Reliability of LLMs

Introducing a companionship framework that turns your LLM into an engaging companion for very long conversations

Verified Detection and Prevention of Concurrency Anomalies in Multi-Agent Large Language Model Systems

Delayed Verification Destabilizes Multi-Agent LLM Belief: Instability Thresholds and Optimal Corrector Placement

Submit Feedback

Similar Articles

Logic-Regularized Verifier Elicits Reasoning from LLMs
LGMT: Logic-Grounded Metamorphic Testing for Evaluating the Reasoning Reliability of LLMs
Introducing a companionship framework that turns your LLM into an engaging companion for very long conversations
Verified Detection and Prevention of Concurrency Anomalies in Multi-Agent Large Language Model Systems
Delayed Verification Destabilizes Multi-Agent LLM Belief: Instability Thresholds and Optimal Corrector Placement