Verified Detection and Prevention of Concurrency Anomalies in Multi-Agent Large Language Model Systems

arXiv cs.LG 06/17/26, 04:00 AM Papers
multi-agent-systems large-language-models concurrency-anomalies formal-verification tla-plus verus rust
Summary
This paper formalizes four concurrency anomalies in multi-agent LLM systems, mechanically verifies a consistency hierarchy, and provides verified Rust runtimes with bounded prevention costs, including a fix for ByteDance's deer-flow and tool-effect reordering in LangGraph.
arXiv:2606.17182v1 Announce Type: new Abstract: Multi-agent LLM systems share state through memory stores, vector indices, and tool registries. We model such sharing as long-running read-generate-write operations under deterministic-generation semantics -- the regime durable-execution engines enforce by deterministic replay -- and formalize four concurrency anomalies in TLA+: stale-generation, phantom-tool, causal-cascade, and tool-effect reordering, structural analogues of classical isolation anomalies, each with a TLC counter-example. The exclusion lattice over these anomalies is trivial; the contribution is the mechanically verified realizability and strict separation of one maximal chain within it, $L_0 \subsetneq \cdots \subsetneq L_4$, to our knowledge the first machine-checked consistency hierarchy for such runtimes. A development of 274 Verus obligations (zero assume, zero admit; trust base: two structural axioms and a mutex correspondence) proves the detectors sound and complete against the specifications and each runtime its avoidance set. Three deployed Rust runtimes realize L0-L1 (pessimistic locking, serializable snapshot isolation, default-SI), each verified against stale-generation and refined to its state machine; L2-L4 are exec-mode-verified with dependency-free prevention twins (A3, A6, A2: 0/1000 versus 1000/1000), and L2 is run live across three model families (A3 prevented in all 120 retracted sessions). We reproduce a silent lost update in ByteDance's deer-flow, formalizing its fix as a verified $L_0 \to L_1$ refinement, and exhibit tool-effect reordering in LangGraph's ToolNode on unmodified output, removed by an L3 commit-order sequencer. The verified detector, refinements, and realizability artifacts are the contribution; the phenomena and lattice are classical.
Original Article
View Cached Full Text
Cached at: 06/17/26, 05:36 AM
# Verified Detection and Prevention of Concurrency Anomalies in Multi-Agent Large Language Model Systems
Source: [https://arxiv.org/html/2606.17182](https://arxiv.org/html/2606.17182)
Sajjad KhanS\. Khan is an independent researcher \(e\-mail: sajjadanwar200@gmail\.com\)\.The online appendix referenced throughout \(its §§ A–F and supplementary Tables S1–S6\), together with the verification and experiment artifacts—the TLA\+sources, the Verus crateverus\-detector, the reference runtimemac\-consistency\-runtime, and the Python harnesses—are provided as ancillary files accompanying this arXiv submission\.

###### Abstract

Multi\-agent LLM systems share state through memory stores, vector indices, and tool registries\. We model such sharing as long\-running read–generate–write operations under deterministic\-generation semantics—the regime durable\-execution engines enforce by deterministic replay—and formalize four concurrency anomalies in TLA\+: stale\-generation, phantom\-tool, causal\-cascade, and tool\-effect reordering, structural analogues of classical isolation anomalies, each with a TLC counter\-example\. The exclusion lattice over these anomalies is trivial; the contribution is the mechanically verified*realizability and strict separation of one maximal chain*within it,L0⊊⋯⊊L4L\_\{0\}\\subsetneq\\cdots\\subsetneq L\_\{4\}—to our knowledge the first machine\-checked consistency hierarchy for such runtimes\. A development of 274 Verus obligations \(zeroassume, zeroadmit; trust base: two structural axioms and a mutex correspondence\) proves the detectors sound and complete against the specifications and each runtime its avoidance set\. Three deployed Rust runtimes realizeL0L\_\{0\}–L1L\_\{1\}\(pessimistic locking, serializable snapshot isolation, default\-SI\), each verified against stale\-generation, refined to its state machine;L2L\_\{2\}–L4L\_\{4\}are exec\-mode\-verified with dependency\-free prevention twins \(A3A\_\{3\},A6A\_\{6\},A2A\_\{2\}:0/10000/1000versus1000/10001000/1000\), andL2L\_\{2\}is additionally run live across three model families \(A3A\_\{3\}prevented in all120120retracted sessions\);L3L\_\{3\}/L4L\_\{4\}remain exec\-verified\. Prevention costs are bounded, not zero: snapshot isolation adds∼8%\{\\sim\}8\\%tokens on one workload, pessimistic locking1\.61\.6–2\.3×2\.3\\times—not the order\-of\-magnitude penalty commonly assumed\. We reproduce a silent lost update in ByteDance’s deer\-flow, formalizing its fix as a verifiedL0→L1L\_\{0\}\\rightarrow L\_\{1\}refinement, and exhibit tool\-effect reordering in LangGraph’sToolNodeon unmodified output, removed by anL3L\_\{3\}commit\-order sequencer\. The verified detector, refinements, and realizability artifacts are the contribution; the phenomena and lattice are classical\.

###### Index Terms:

Multi\-agent systems, memory consistency, isolation levels, formal methods, TLA\+, Verus, large language models\.

## 1Introduction

### 1\.1A constructed failure within a documented pattern

We construct a scenario within the travel\-booking workflow used to motivate the SagaLLM\[[1](https://arxiv.org/html/2606.17182#bib.bib1)\]architecture: a multi\-agent system in which one agent reserves a flight while another reserves a hotel, both consulting and updating a shared trip\-state record\. Suppose the flight\-booking agent reads “trip date==14 June” from shared state, begins a generation phase of several seconds in which it drafts the reservation request, and during that phase the user \(acting through a third agent\) updates the trip date to 21 June\. When the flight agent commits, it submits the original date, producing a booking that contradicts the system’s now\-current state\. No fault has occurred at any layer of the runtime\. Yet the system has produced an external effect that no current state justifies\. SagaLLM’s compensating\-transactions architecture is one possible response to this failure mode; Atomix’s progress\-gated tool calls\[[2](https://arxiv.org/html/2606.17182#bib.bib2)\]are another\. Neither names the phenomenon, locates it within a stratified design space of consistency guarantees, or formally proves the conditions under which a runtime prevents it\.

The scenario above is a constructed instantiation of an architectural pattern documented in published work, not a trace from a deployed production system; locating production traces of concurrency\-induced inconsistency is itself a research question we return to in[Section˜6\.2](https://arxiv.org/html/2606.17182#S6.SS2)\. A deployed instance of the lower\-level lost\-update anomaly that motivates the lattice’s base is, by contrast, not constructed:[Section˜6\.5](https://arxiv.org/html/2606.17182#S6.SS5)reproduces a live silent lost update from an open issue in ByteDance’s deer\-flow—a widely\-used agent application—from the project’s own regression test, and formalizes its fix as a verifiedL0→L1L\_\{0\}\\rightarrow L\_\{1\}refinement\. The pattern, however, is one of five concurrency anomalies we catalog in[Section˜3](https://arxiv.org/html/2606.17182#S3), all specific to the operational regime we propose and none addressed by existing consistency theory\.

### 1\.2Why existing theory does not directly operationalize these phenomena

Three traditions of consistency theory bear on this problem; none applies directly\. Hardware models \(sequential consistency\[[3](https://arxiv.org/html/2606.17182#bib.bib3)\], total store order, release consistency\[[4](https://arxiv.org/html/2606.17182#bib.bib4)\], ARMv8\[[5](https://arxiv.org/html/2606.17182#bib.bib5)\], linearizability\[[6](https://arxiv.org/html/2606.17182#bib.bib6)\]\) presuppose bounded operation latency and statically\-determined read sets; long\-running agent operations have neither\. Database isolation theory\[[7](https://arxiv.org/html/2606.17182#bib.bib7),[8](https://arxiv.org/html/2606.17182#bib.bib8),[9](https://arxiv.org/html/2606.17182#bib.bib9)\]adopts the anomaly\-by\-anomaly characterization we follow here, but assumes the read set is held under lock or snapshot for the transaction’s lifetime\. Distributed\-systems consistency\[[10](https://arxiv.org/html/2606.17182#bib.bib10),[11](https://arxiv.org/html/2606.17182#bib.bib11),[12](https://arxiv.org/html/2606.17182#bib.bib12),[13](https://arxiv.org/html/2606.17182#bib.bib13),[14](https://arxiv.org/html/2606.17182#bib.bib14)\]addresses replicated state and message\-passing actors\[[15](https://arxiv.org/html/2606.17182#bib.bib15)\]but treats operations as discrete events with summarizable signatures, not as long\-running, non\-deterministic processes\. The PACELC formulation\[[16](https://arxiv.org/html/2606.17182#bib.bib16)\]refines CAP with a latency/consistency tradeoff that is conceptually adjacent to our lattice but does not treat the long\-generation phase as a first\-class object\. The canonical taxonomy of non\-transactional consistency models\[[17](https://arxiv.org/html/2606.17182#bib.bib17)\]provides the closest existing framework against which our lattice must be situated; its levels are defined for replicated stores under network propagation, and do not transfer directly to the operational regime we study\.[Section˜7](https://arxiv.org/html/2606.17182#S7)expands on each of these comparisons\.

Three recent runtimes address concurrency concerns in agent deployments\. Atomix\[[2](https://arxiv.org/html/2606.17182#bib.bib2)\]provides progress\-aware transactional semantics for tool calls; SagaLLM\[[1](https://arxiv.org/html/2606.17182#bib.bib1)\]uses saga\-style compensation; CodeCRDT\[[18](https://arxiv.org/html/2606.17182#bib.bib18)\]uses CRDTs for strong eventual consistency in parallel code generation\. Each names one consistency point and shows how to achieve it; none locates its guarantees within a stratified design space\. We provide such placements informally in[Section˜6\.4](https://arxiv.org/html/2606.17182#S6.SS4)\.

### 1\.3The gap and the contribution

Existing characterizations of multi\-agent LLM failures, including the recent NeurIPS taxonomy\[[19](https://arxiv.org/html/2606.17182#bib.bib19)\], document a different failure class: cognitive and coordination failures such as step repetition, reasoning\-action mismatch, and inter\-agent misalignment\. The concurrency failures we describe are largely absent from such taxonomies because the present generation of agent benchmarks does not stress\-test inter\-agent shared state under contention\. Classical isolation hierarchies—ANSI/SQL isolation\[[7](https://arxiv.org/html/2606.17182#bib.bib7)\], Adya’s generalized model\[[8](https://arxiv.org/html/2606.17182#bib.bib8)\], the Bailis/Hellerstein survey of weak consistency\[[20](https://arxiv.org/html/2606.17182#bib.bib20)\], and serializable snapshot isolation\[[21](https://arxiv.org/html/2606.17182#bib.bib21)\]—characterize a closely related design space for database transactions but do not directly address the long\-latency inference phase that distinguishes the multi\-agent LLM regime\. We do not claim a fundamental new theory, but rather a port of these classical characterizations into the new operational regime, with mechanically\-verified detector, safety, and refinement artifacts to support the port\. This paper supplies the port; how heavily it weighs against the classical originals is for the reader to judge\.

Three features of the operational regime fall outside the modeling assumptions of the classical anomaly hierarchies, and it is these—not the anomalies in isolation—that the port must accommodate\. First, the generation phase has*unbounded latency*: database isolation theory assumes a transaction holds its read set under lock or snapshot for a bounded window, whereas an agent operation’s read\-to\-commit interval is a neural\-inference phase of seconds to minutes, during which the read set is neither locked nor cheaply re\-validated\. Second, the*tool registry*is first\-class mutable state with no counterpart in the relational model:A2A\_\{2\}\(phantom\-tool\) is a consistency failure over a capability set that classical isolation does not model at all\. Third, external tool effects are*irreversible*: the cascading\-abort and atomic\-commit machinery of database recovery assumes effects can be undone, which is exactly what fails forA3A\_\{3\}andA6A\_\{6\}when a committed effect is an outbound email or a payment\. The individual anomalies are structural analogues of known ones, and we claim no new concurrency theory; the contribution is that this*combination*of regime features is not expressible in the classical taxonomies, and that we supply mechanically\-verified detector, safety, and refinement artifacts for it\.

### 1\.4Contributions

This paper makes five contributions\. The Boolean latticeℒ=⟨2\{A1,A2,A3,A6\},⊆⟩\\mathcal\{L\}=\\langle 2^\{\\\{A\_\{1\},A\_\{2\},A\_\{3\},A\_\{6\}\\\}\},\\subseteq\\ranglethat the four formalized anomalies induce serves as organizational vocabulary for the substantive contributions that follow; the lattice is not itself a theoretical result\.

The weight of the paper is on the mechanized and deployed artifacts, contributions \(4a\)–\(4c\) and \(5\)–\(5b\), not on the catalog or the lattice\. What we claim as new is the*combination*: Verus\-verified detectors proved sound*and*complete against the formal anomaly specifications; consistency disciplines verified along the entire chain and deployed live under real LLM agents atL0L\_\{0\}–L1L\_\{1\}; and refinements verified against the*actual channel semantics of deployed frameworks*—LangGraph’s reducer and AutoGen’s ETag persistence layer—one of which closes a reproduced, live silent lost update in a shipped application \(ByteDance’s deer\-flow\)\. The catalog \(1\), the organizing lattice \(2\), and the snapshot\-insufficiency observation \(3\) are scaffolding the verified artifacts hang from; the phenomena are classical\. To our knowledge no prior work mechanically verifies a consistency hierarchy for multi\-agent LLM runtimes along the whole chain, or grounds its refinements in the concurrency primitives deployed agent frameworks actually ship\. That gap—not the lattice—is the contribution\.

\(1\) An anomaly catalog\([Section˜3](https://arxiv.org/html/2606.17182#S3)\)\. We identify five concurrency anomalies that arise in our proposed operational model\. Four—stale\-generation, phantom\-tool, causal\-cascade, tool\-effect\-reordering—are formalized as predicates over operation histories in TLA\+, and each is verified by an explicit TLC counter\-example trace at small finite parameters \(\|A\|=2\|A\|=2,\|𝐶𝑒𝑙𝑙𝑠\|≤2\|\\mathit\{Cells\}\|\\leq 2,𝑀𝑎𝑥𝑂𝑝𝑠≤4\\mathit\{MaxOps\}\\leq 4\)\. The fifth, split\-view, falls outside the single\-store operational model \(it requires replication\) and is formalized separately, at the Verus model level, as a monotone\-primary no\-split theorem \([Section˜4\.12](https://arxiv.org/html/2606.17182#S4.SS12)\)\.

\(2\) A consistency lattice as organizing framework\([Section˜4](https://arxiv.org/html/2606.17182#S4)\)\. The Boolean latticeℒ\\mathcal\{L\}has sixteen points; we name five distinguished pointsL0,…,L4L\_\{0\},\\ldots,L\_\{4\}along one chosen maximal chain—one of the twenty\-four linear extensions ofℒ\\mathcal\{L\}—selected on operational grounds rather than mathematical primacy\. The construction extends the methodology of Berenson et al\.\[[7](https://arxiv.org/html/2606.17182#bib.bib7)\]and Adya\[[8](https://arxiv.org/html/2606.17182#bib.bib8)\]into the multi\-agent LLM regime\.

\(3\) Snapshot\-insufficiency: a classical pattern in a new operational setting\.\([Section˜4\.5](https://arxiv.org/html/2606.17182#S4.SS5)\)\. We observe that a structurally\-defined generation snapshot is insufficient to prevent stale\-generation, and that the level must be augmented with explicit read\-set stability\. This echoes the write\-skew pattern under snapshot isolation\[[7](https://arxiv.org/html/2606.17182#bib.bib7),[21](https://arxiv.org/html/2606.17182#bib.bib21),[22](https://arxiv.org/html/2606.17182#bib.bib22)\], which is well\-known in the database literature\. Our contribution is the formalization of the pattern within the operational model of[Section˜2](https://arxiv.org/html/2606.17182#S2), where the work discarded on abort is an LLM inference rather than a relational update\.

\(4a\) Mechanically verified detector pipeline\([Section˜4\.7](https://arxiv.org/html/2606.17182#S4.SS7)\)\. We mechanically verify in Verus\[[23](https://arxiv.org/html/2606.17182#bib.bib23)\]that the Rust detector functions for the four formalized anomalies are sound*and*complete with respect to their TLA\+specifications, including the value\-mismatch component ofA1A\_\{1\}\. The chain discharges twenty\-four obligations withoutassumeor added axioms\.

\(4b\) Mechanically verified runtime safety*againstA1A\_\{1\}only*\.We additionally verify in Verus that all three runtime strategies of[Section˜6\.6](https://arxiv.org/html/2606.17182#S6.SS6)satisfy¬A1\\neg A\_\{1\}over their respective abstract state machines: the pessimistic\-locking runtime \(23 obligations, unconditional\), the SSI runtime \(8 obligations, unconditional,validate\_no\_write=truemode\), and the default\-SI runtime \(9 obligations, conditional on the all\-writers workload hypothesis\)\. The default\-SI result is a conditional workload characterization, not an unconditional guarantee: the runtime admitsA1A\_\{1\}by design through the read\-only no\-write bypass \(measured silent residual:3%3\\%on triage,[Section˜5\.13](https://arxiv.org/html/2606.17182#S5.SS13)\)—the negative result that motivates unconditional SSI as the recommendedL1L\_\{1\}mechanism\. Across all three: zeroassume, zeroexternal\_bodyin safety\-theorem proof bodies, zero added axioms, forty obligations\.

The three deployed runtimes implement only theL0→L1L\_\{0\}\\to L\_\{1\}step; the higher levels are verified by dedicated runtime models \(causal tracking with cascading abort,[Section˜4\.11](https://arxiv.org/html/2606.17182#S4.SS11); saga compensation,[Section˜4\.15](https://arxiv.org/html/2606.17182#S4.SS15); registry\-snapshot isolation,[Section˜4\.16](https://arxiv.org/html/2606.17182#S4.SS16)\)—distinct artifacts from the pilot runtimes, which are not claimed to reachL2L\_\{2\}or above\.[Table˜I](https://arxiv.org/html/2606.17182#S1.T1)makes the split explicit\.

TABLE I:Scope of mechanical verification by anomaly and runtime\.Detector: Verus equivalence between operational predicate and detector implementation \(A1,A2,A3,A6A\_\{1\},A\_\{2\},A\_\{3\},A\_\{6\}all verified;[Section˜4\.7](https://arxiv.org/html/2606.17182#S4.SS7)\)\.Pessimistic / SSI / Default\-SI: the three*deployed*pilot runtimes; each is Verus\-verified to blockA1A\_\{1\}\(L1L\_\{1\}\) and is not claimed to blockA3,A6,A2A\_\{3\},A\_\{6\},A\_\{2\}\.Model\-level: dedicated runtime models that realize the higher levels \([Sections˜4\.11](https://arxiv.org/html/2606.17182#S4.SS11),[4\.15](https://arxiv.org/html/2606.17182#S4.SS15)and[4\.16](https://arxiv.org/html/2606.17182#S4.SS16)\); these are separate artifacts from the deployed runtimes\. Each higher level is additionally realized as runnable code:L2L\_\{2\}as an executable runtime, exec\-mode\-verified to refine the model \(zero added axioms\) with measuredA3A\_\{3\}prevention \([Section˜5\.11](https://arxiv.org/html/2606.17182#S5.SS11)\);L3L\_\{3\}andL4L\_\{4\}as exec\-mode\-verified artifacts with dependency\-free twins whoseA6A\_\{6\}/A2A\_\{2\}prevention is measured under identical adversarial schedules\.DetectorPessimisticSSIDefault\-SIModel\-levelA1A\_\{1\}\(stale\-gen\)verifiedverifiedverifiedverified \(cond\.\)L1L\_\{1\}\(deployed\)A3A\_\{3\}\(cascade\)verified———L2L\_\{2\}verified, run live \(0/1200/120\)A6A\_\{6\}\(reorder\)verified———L3L\_\{3\}exec\-verified, twin\-measuredA2A\_\{2\}\(phantom\)verified———L4L\_\{4\}exec\-verified, twin\-measuredA4A\_\{4\}\(split\-view\)————verified \(monotone\-primary no\-split\)We do not claim these are the first mechanically\-verified safety results for LLM runtimes; prior verification work on transactional memory, sagas, workflow systems, and database isolation levels covers structurally similar artifacts in adjacent domains, and we situate the contribution against that body of work in[Section˜7](https://arxiv.org/html/2606.17182#S7)\.

\(4c\) Mechanically verified spec–runtime refinement\.We close, in Verus, four refinement proofs establishing that the deployed Rust runtimes for the pessimistic\-locking, SSI, and default\-SI strategies refine the abstract state machines of \(4b\)\. The pessimistic refinement discharges 31 obligations; the SSI projection refinement discharges 18 obligations; the SSI literal version\-chain refinement \(Map\(CellId,Seq\(Time,Value\)\)\) discharges 17 obligations; the default\-SI refinement \(all four transitions including the bypass\) discharges 18\. The four files share a two\-axiom trust base \(string\-identifier injectivity, null\-sentinel mapping\), carry zeroexternal\_bodyon any safety\-bearing lemma, and the formerly\-required finite\-set image axiom is now a proven lemma\. The concurrent\-execution gap is partially closed by the concurrent\-semantics lift of[Section˜4\.10](https://arxiv.org/html/2606.17182#S4.SS10)\(9 obligations, 0 axioms\); the residual relocates tostd::sync::Mutexconformance, with weak memory, liveness, and lock poisoning explicit gaps\.

\(5\) Empirical pilot\([Section˜5](https://arxiv.org/html/2606.17182#S5)\)\. We exercise the detector on 700 synthetic traces and 300 gpt\-4o sessions, then run a 900\-session three\-strategy baseline replicated on Claude Sonnet 4\.5 \(1,800 token\-instrumented sessions across two model families\)\. The real\-LLM pilot reports stale\-generation at 1%, 35%, 100% across plan\-execute/triage/edit\-review \(both endpoints workload\-engineered, a sensitivity check rather than a prevalence finding\)\. SSI is statistically indistinguishable in cost from vanilla on both providers and all three workloads between sessions \(a paired re\-analysis isolates one∼8%\{\\sim\}8\\%overhead\); pessimistic overhead is bounded but provider\-dependent \(≤\\leq1\.6×\\timeson gpt\-4o,≤\\leq2\.3×\\timeson Claude\)\. A within\-model wall\-clock replication finds both guarded strategies at the inference\-noise floor on gpt\-4o and open\-weights Llama\-3\.2, a third model family\.

\(5b\) TheL2L\_\{2\}guarantee is runnable code, not only a model\.We exec\-mode\-verify theL2L\_\{2\}causal\-tracking discipline in Verus \(zero added axioms\), so theA3A\_\{3\}\-freedom theorem holds of runnable code, and a std\-only twin measures the unguarded baseline it removes \(0/10000/1000vs\.1000/10001000/1000, transitive cascades included\), and the runtime is run live: across three model families on a triage workload, supervisors retract plans at0%0\\%/15\.5%15\.5\\%/44\.5%44\.5\\%and the verifiedL2L\_\{2\}runtime preventsA3A\_\{3\}in all120120retracted sessions where the baseline corrupts \(0/1200/120,95%95\\%CI\[0,2\.5%\]\[0,2\.5\\%\];[Section˜5\.11](https://arxiv.org/html/2606.17182#S5.SS11)\)\.L3L\_\{3\}andL4L\_\{4\}are carried to runnable code the same way but remain twin\-measured, not yet live\.[Section˜6\.4](https://arxiv.org/html/2606.17182#S6.SS4)additionally maps Atomix, SagaLLM, and CodeCRDT onto chain points \(approximate, not formal\), showing the lattice discriminates among contemporary designs\.

### 1\.5Roadmap

[Section˜2](https://arxiv.org/html/2606.17182#S2)introduces the runtime model\.[Section˜3](https://arxiv.org/html/2606.17182#S3)presents the catalog\.[Section˜4](https://arxiv.org/html/2606.17182#S4)defines the levels and includes both a TLAPS coherence check \([Section˜4\.6](https://arxiv.org/html/2606.17182#S4.SS6)\) and a Verus equivalence proof for the detector pipeline \([Section˜4\.7](https://arxiv.org/html/2606.17182#S4.SS7)\)\.[Section˜5](https://arxiv.org/html/2606.17182#S5)reports the empirical pilot\.[Section˜6](https://arxiv.org/html/2606.17182#S6)discusses implications, locates contemporary runtimes, and lists limitations\.[Section˜7](https://arxiv.org/html/2606.17182#S7)surveys related work\.[Section˜8](https://arxiv.org/html/2606.17182#S8)concludes\.

## 2Background and runtime model

[Section˜1\.2](https://arxiv.org/html/2606.17182#S1.SS2)surveyed the consistency\-theory landscape and is not repeated here\. This section introduces the operational model on which the rest of the paper builds\.

### 2\.1Runtime model

A multi\-agent LLM system, for the purposes of this paper, consists of a finite set of agentsAA, a shared memoryMMmapping cells to values \(with a distinguished sentinelNULL\\mathrm\{NULL\}\), and a shared registryRRof available tools\. Each agent executes a sequence of operations\. An operation has three phases:

- •A*read*phase that captures the values of a chosen subset of cells \(the operation’s𝑟𝑒𝑎𝑑\_𝑠𝑒𝑡\\mathit\{read\\\_set\}\) along with the current registry contents \(𝑟𝑒𝑎𝑑\_𝑟𝑒𝑔𝑖𝑠𝑡𝑟𝑦\\mathit\{read\\\_registry\}\)\. The read time is recorded as the current log length\.
- •A*generation*phase, during which the agent computes its planned writes and tool calls\. This phase is dominated by neural\-inference latency and is not treated as instantaneous\.
- •A*write*phase that atomically commits the operation’s planned writes to memory, along with a record of the selected tool \(𝑝𝑙𝑎𝑛𝑛𝑒𝑑\_𝑡𝑜𝑜𝑙\\mathit\{planned\\\_tool\}\) and the registry contents at commit \(𝑤𝑟𝑖𝑡𝑒\_𝑟𝑒𝑔𝑖𝑠𝑡𝑟𝑦\\mathit\{write\\\_registry\}\)\. The write time is recorded\.

The operation log is append\-only; its length plays the role of a logical clock\. We denote the log byhhand a history byh∈Seq\(𝑂𝑝𝑅𝑒𝑐𝑜𝑟𝑑\)h\\in\\mathrm\{Seq\}\(\\mathit\{OpRecord\}\)\. The complete TLA\+ specification of this model is given in the supplementary artifact \(Memory\.tla\)\.

The model is our proposal for studying long\-running concurrent operations under inference\-bounded latency\. It does not directly mirror any single deployed multi\-agent runtime; deployed systems \(AutoGen\[[24](https://arxiv.org/html/2606.17182#bib.bib24)\], LangGraph, CrewAI\) use heterogeneous abstractions including conversation\-based message passing, explicit state graphs, and delegation patterns\. Mapping these onto our model is non\-trivial; we make the case for the model’s utility by exhibiting placements of three contemporary runtimes \([Section˜6\.4](https://arxiv.org/html/2606.17182#S6.SS4)\) and by the analytical traction the model provides for the anomaly catalog\. Substantive validation against a broader range of deployed systems is future work\.

#### Auxiliary definition\.

For historyhh, cellcc, and logical timeτ\\tau,

𝐿𝑎𝑡𝑒𝑠𝑡𝑊𝑟𝑖𝑡𝑒𝐵𝑒𝑓𝑜𝑟𝑒\(h,c,τ\)≜\\displaystyle\\mathit\{LatestWriteBefore\}\(h,c,\\tau\)\\;\\triangleq\\;\{h\[k\]\.𝑤𝑟𝑖𝑡𝑒\_𝑣𝑎𝑙𝑢𝑒𝑠\[c\]ifklargest index,h\[k\]\.𝑤𝑟𝑖𝑡𝑒\_𝑡𝑖𝑚𝑒<τ,c∈h\[k\]\.𝑤𝑟𝑖𝑡𝑒\_𝑠𝑒𝑡,NULLotherwise\.\\displaystyleThis function returns the value of cellccas of the most recent write toccstrictly before timeτ\\tau\. It is used in[Section˜4\.5](https://arxiv.org/html/2606.17182#S4.SS5)to formalize the structural snapshot\.

### 2\.2Simplifying assumptions and their scope

The model above is simplified\. Each simplification is a deliberate concession to formal tractability that defines the scope of the present results\.

*Determinism of LLM outputs\.*We treat each LLM invocation as deterministic in its inputs: the content of an𝑂𝑝𝑅𝑒𝑐𝑜𝑟𝑑\\mathit\{OpRecord\}’s𝑤𝑟𝑖𝑡𝑒\_𝑣𝑎𝑙𝑢𝑒𝑠\\mathit\{write\\\_values\}and𝑡𝑜𝑜𝑙𝑠\_𝑢𝑠𝑒𝑑\\mathit\{tools\\\_used\}fields is a function of its𝑟𝑒𝑎𝑑\_𝑣𝑎𝑙𝑢𝑒𝑠\\mathit\{read\\\_values\}and the tools visible at read time\. We state the resulting scope as a boundary, not a buried caveat: the framework’s*operational*claims hold over the deterministic\-generation regime—the regime that production durable\-execution engines \([Section˜7\.9](https://arxiv.org/html/2606.17182#S7.SS9)\) enforce by deterministic replay—and degrade to a soundness\-only screen \([Sections˜4\.9](https://arxiv.org/html/2606.17182#S4.SS9)and[4\.13](https://arxiv.org/html/2606.17182#S4.SS13)\) outside it, with the residual gap measured rather than assumed \([Section˜5\.9](https://arxiv.org/html/2606.17182#S5.SS9)\)\. This assumption is load\-bearing in a way that warrants explicit discussion\. In deployment, LLM outputs are stochastic, with non\-trivial probability of generating different write payloads for identical read sets\. The framework specifies the consistency properties of the histories thus produced and does not model the per\-call sampling distribution\. The strongest objection to this simplification concerns theA1A\_\{1\}value\-mismatch conjunct \(h\[i\]\.𝑟𝑒𝑎𝑑\_𝑣𝑎𝑙𝑢𝑒𝑠\[c\]≠h\[j\]\.𝑤𝑟𝑖𝑡𝑒\_𝑣𝑎𝑙𝑢𝑒𝑠\[c\]h\[i\]\.\\mathit\{read\\\_values\}\[c\]\\neq h\[j\]\.\\mathit\{write\\\_values\}\[c\]\): under stochasticity, an agent that had observed the fresh value might still have generated the same write payload as it did having observed the stale value, in which case the conjunct does not fire even though the read\-set instability that motivates the anomaly is present\. We acknowledge that this objection is sound: the predicate fires on observed value\-divergence in the trace, not on the underlying probability that the agent’s output would have differed under different read\-set evidence\. The framework classifies the deterministic histories it observes; it does not classify the agent’s input–output distribution\. Three partial justifications: \(i\) the framework is the zero\-temperature/identical\-seed special case—not how deployed systems run, but how reproducibility\-critical ones are sometimes configured, and even then greedy decoding is not fully deterministic\[[98](https://arxiv.org/html/2606.17182#bib.bib98)\]while constrained/grammar\-guided generation\[[97](https://arxiv.org/html/2606.17182#bib.bib97)\]narrows toward our regime—and, non\-hypothetically, durable\-execution engines \([Section˜7\.9](https://arxiv.org/html/2606.17182#S7.SS9)\) enforce deterministic*replay*in production, so within that regime each history is a deterministic function of its recorded inputs; \(ii\) anomaly witnesses are properties of histories, so a stochastic system yields a distribution over independently\-classifiable histories \(the firing*rate*is then a property of that distribution\); and \(iii\) the empirical pilot samples many traces from real stochastic agents\. None of these closes the value\-mismatch concern; a probabilistic refinement admitting non\-determinism in the operation predicate is the principal formalization follow\-up, along the lines of Bailis et al\.\[[25](https://arxiv.org/html/2606.17182#bib.bib25)\]and the quantitative relaxation of Henzinger et al\.\[[26](https://arxiv.org/html/2606.17182#bib.bib26)\]\. Independently,[Section˜5\.9](https://arxiv.org/html/2606.17182#S5.SS9)measures the operational counterpart:pop=Pr⁡\(decision changes∣flagged mismatch\)p\_\{\\text\{op\}\}=\\Pr\(\\text\{decision changes\}\\mid\\text\{flagged mismatch\}\)is strongly role\-dependent \(0%0\\%for an independent producer up to100%100\\%for an assessor\), so the trace\-level predicate over\-counts*operational*staleness by a measured, bounded factor—bounding with data the charge that it fires on an operationally\-meaningless string inequality\.

*What the assumption does and does not underwrite\.*The mechanized results consume no determinism hypothesis: the TLA\+ commit action binds write values existentially, the anomaly predicates and the safety/refinement theorems range over recorded histories and reachable states with write values free, and no proof body uses a writes\-as\-function\-of\-reads assumption\. Determinism load\-bears exactly one inference—from a fired predicate to*operational*staleness—and that inference is measured \([Section˜5\.9](https://arxiv.org/html/2606.17182#S5.SS9)\), not assumed\. Detection and prevention, as trace\-level results, hold of stochastic deployments verbatim; what stochasticity weakens is the counterfactual reading of a firing or its absence \([Section˜4\.13](https://arxiv.org/html/2606.17182#S4.SS13)\), not the theorems\.

*Atomic commit at the record level, not at the external\-effect level\.*The commit of an operation, viewed as a single𝑂𝑝𝑅𝑒𝑐𝑜𝑟𝑑\\mathit\{OpRecord\}, is atomic with respect to the shared store: all of its𝑤𝑟𝑖𝑡𝑒\_𝑠𝑒𝑡\\mathit\{write\\\_set\}entries become visible at a single point in time𝑤𝑟𝑖𝑡𝑒\_𝑡𝑖𝑚𝑒\\mathit\{write\\\_time\}\. This atomicity is at the*record*granularity over*internal store state*\. Tools invoked during the operation may also produce*external effects*that bypass the shared store: API calls to third\-party services, emails sent, payments authorized, log lines written to external sinks\. These external effects are issued sequentially during the operation but are not subject to the record’s atomic commit; once issued, they cannot be retracted by aborting the commit\. The record’s𝑖𝑜\\mathit\{io\}sequence captures the order of issuance of external effects,𝑐𝑜\\mathit\{co\}captures the order in which the runtime considers them committed \(e\.g\., for retry/abort logic\), andA6A\_\{6\}is the formalization of cases in which these orders diverge\.

This distinction resolves the apparent tension between “record\-level atomic commit” and “intra\-record reordering observable\.” Internal store state is updated atomically; external effects are not\. A single tool call can issue an external effect mid\-operation and then have its committed write to the shared store land in a different relative order\. Examples: a tool that sends an email*then*updates a state cell with the confirmation; the email is observable to its recipient immediately, the state update is observable to other agents only at𝑤𝑟𝑖𝑡𝑒\_𝑡𝑖𝑚𝑒\\mathit\{write\\\_time\}\. If a second tool call within the same record updates a different state cell*before*sending its own external effect, the issuance order \(email\-1 before email\-2\) and the commit order at the shared store \(cell\-2 before cell\-1\) can diverge\.

The framework therefore distinguishes two layers of atomicity: across records over internal state \(assumed\), and within a record over external effects \(not assumed;A6A\_\{6\}is precisely the formalization of that gap\)\. A fully streaming model in which shared\-store writes externalize individually before record commit is future work and would introduce a finer anomaly stratification\.

*No replication\.*The shared memory is a single logical store\. Replicated stores admit split\-view anomalies that require a richer model\.

*Cost asymmetry: bounded, not unbounded\.*A single LLM inference call costs measurably more \(in dollars and latency\) than a database transaction or a memory access, and the standard intuition in the multi\-agent LLM literature treats this as an order\-of\-magnitude asymmetry that makes abort\-and\-retry mechanisms cripplingly expensive\. We share the qualitative intuition but make no quantitative order\-of\-magnitude claim a priori: our measurements at gpt\-4o scale \([Section˜5\.6](https://arxiv.org/html/2606.17182#S5.SS6)\) show overhead bounded at≤\\leq1\.6×\\timesvanilla in the worst measured workload and statistically zero in others; the Claude Sonnet 4\.5 replication in the same section shifts the bound to≤\\leq2\.3×\\times\. The cost asymmetry is real but bounded in the regime we measured\. We treat the cost dimension as empirically reported in[Section˜5\.6](https://arxiv.org/html/2606.17182#S5.SS6)rather than theoretically modeled\.

## 3The Anomaly Catalog

We identify five concurrency anomalies\. Four are formalized as predicates over operation histories; the fifth, split\-view, is deferred\.111The numbering \(A1,A2,A3,A4,A6A\_\{1\},A\_\{2\},A\_\{3\},A\_\{4\},A\_\{6\}\) is not contiguous: an earlier formulation included anA5A\_\{5\}\(*LongGeneration*\), which was subsumed byA1A\_\{1\}in the current operational model and dropped from the catalog\. We retain the existing numbering to keep cross\-references stable with the public artifact and with previously circulated drafts\.For each formalized anomaly, we state the predicate, describe the failure mode, summarize the TLC\-verified witness, and contrast with classical consistency models\.

The TLA\+ source for all four predicates is in the supplementary artifact \(Anomalies\.tla\); each witness is reproduced in full as a sequence of states in the artifact appendix\. All TLC runs in this section use\|A\|=2\|A\|=2agents,\|𝐶𝑒𝑙𝑙𝑠\|≤2\|\\mathit\{Cells\}\|\\leq 2memory cells, and𝑀𝑎𝑥𝑂𝑝𝑠≤4\\mathit\{MaxOps\}\\leq 4operation budget\. The largest exhaustive verification run in the artifact,MC\_CodeCRDT\_RYW\([Section˜6\.4](https://arxiv.org/html/2606.17182#S6.SS4)\), explores 9,348,770 distinct states at depth 12 in 38 seconds at\|A\|=2,\|𝐶𝑒𝑙𝑙𝑠\|=1,𝑀𝑎𝑥𝑂𝑝𝑠=3\|A\|=2,\|\\mathit\{Cells\}\|=1,\\mathit\{MaxOps\}=3; expanding bounds to\|A\|=3,𝑀𝑎𝑥𝑂𝑝𝑠=6\|A\|=3,\\mathit\{MaxOps\}=6and\|A\|=3,𝑀𝑎𝑥𝑂𝑝𝑠=9\|A\|=3,\\mathit\{MaxOps\}=9explores in excess of 150 million distinct states in each case without finding a violation within a 10\-minute budget\. These larger runs are bounded partial explorations, not exhaustive checks\. The witness traces for the CodeCRDTA1A\_\{1\}admission reproduce robustly at all three bound levels \(5\-state witness, 2,997–6,126 distinct states explored before the violation is found\)\.

#### Scale invariance of positive witnesses\.

Every catalog predicate is existentially quantified and locally checkable on a constant number of records \(A1A\_\{1\}:i,j,ci,j,c;A2A\_\{2\}/A6A\_\{6\}: single record;A3A\_\{3\}:j,c,vj,c,v\)\. A witness at\(\|A\|,\|𝐶𝑒𝑙𝑙𝑠\|,𝑀𝑎𝑥𝑂𝑝𝑠\)\(\|A\|,\|\\mathit\{Cells\}\|,\\mathit\{MaxOps\}\)therefore embeds verbatim into any larger configuration \(idle extra agents, cells, and slots preserve it\), so positive TLC results generalize upward\. The converse does not hold: the negativeMC\_CodeCRDT\_RYWruns at\|A\|=3,𝑀𝑎𝑥𝑂𝑝𝑠∈\{6,9\}\|A\|=3,\\mathit\{MaxOps\}\\in\\\{6,9\\\}are bounded confidence, not proof, and generalizing them would need an inductive argument left to future work\.

### 3\.1A1: Stale\-Generation

#### Definition\.

###### Definition 1\(Stale\-generation\)\.

A historyh∈Seq\(𝑂𝑝𝑅𝑒𝑐𝑜𝑟𝑑\)h\\in\\mathrm\{Seq\}\(\\mathit\{OpRecord\}\)exhibits the*stale\-generation anomaly*when there existi,j∈\{1,…,\|h\|\}i,j\\in\\\{1,\\ldots,\|h\|\\\},i≠ji\\neq j, withh\[i\]\.𝑎𝑔𝑒𝑛𝑡≠h\[j\]\.𝑎𝑔𝑒𝑛𝑡h\[i\]\.\\mathit\{agent\}\\neq h\[j\]\.\\mathit\{agent\}and somec∈h\[i\]\.𝑟𝑒𝑎𝑑\_𝑠𝑒𝑡∩h\[j\]\.𝑤𝑟𝑖𝑡𝑒\_𝑠𝑒𝑡c\\in h\[i\]\.\\mathit\{read\\\_set\}\\cap h\[j\]\.\\mathit\{write\\\_set\}such that

h\[i\]\.𝑟𝑒𝑎𝑑\_𝑡𝑖𝑚𝑒<h\[j\]\.𝑤𝑟𝑖𝑡𝑒\_𝑡𝑖𝑚𝑒<h\[i\]\.𝑤𝑟𝑖𝑡𝑒\_𝑡𝑖𝑚𝑒andh\[i\]\.𝑟𝑒𝑎𝑑\_𝑣𝑎𝑙𝑢𝑒𝑠\[c\]≠h\[j\]\.𝑤𝑟𝑖𝑡𝑒\_𝑣𝑎𝑙𝑢𝑒𝑠\[c\]\.\\begin\{gathered\}h\[i\]\.\\mathit\{read\\\_time\}<h\[j\]\.\\mathit\{write\\\_time\}<h\[i\]\.\\mathit\{write\\\_time\}\\\\ \\text\{and\}\\quad h\[i\]\.\\mathit\{read\\\_values\}\[c\]\\neq h\[j\]\.\\mathit\{write\\\_values\}\[c\]\.\\end\{gathered\}

#### Description\.

AgentAAbegins an operation by reading cellcc, recording its observed value\. WhileAA’s generation phase is in progress, agentBBwrites a new value tocc\. By the timeAAcommits, the valueAAobserved is no longer current, yetAA’s output is grounded in the obsolete observation and committed on that basis\. The definition admits any number of intervening writes; the multi\-intervening case specializes to bounded\-staleness consistency models in the sense of Bailis et al\.\[[25](https://arxiv.org/html/2606.17182#bib.bib25)\], which parameterize allowable staleness by version count and would yield refinements ofL1L\_\{1\}betweenL0L\_\{0\}andL1L\_\{1\}\.

#### Witness\.

TLC produces a five\-state counter\-example at\|A\|=2\|A\|=2,\|𝐶𝑒𝑙𝑙𝑠\|=1\|\\mathit\{Cells\}\|=1,𝑀𝑎𝑥𝑂𝑝𝑠=4\\mathit\{MaxOps\}=4\. Agenta2a\_\{2\}begins readingc1c\_\{1\}at𝑟𝑒𝑎𝑑\_𝑡𝑖𝑚𝑒=0\\mathit\{read\\\_time\}=0, recording𝑟𝑒𝑎𝑑\_𝑣𝑎𝑙𝑢𝑒𝑠\[c1\]=NULL\\mathit\{read\\\_values\}\[c\_\{1\}\]=\\mathrm\{NULL\}\. Concurrently, agenta1a\_\{1\}completes an operation writingc1=v1c\_\{1\}=v\_\{1\}at𝑤𝑟𝑖𝑡𝑒\_𝑡𝑖𝑚𝑒=1\\mathit\{write\\\_time\}=1\. Agenta2a\_\{2\}then commits at𝑤𝑟𝑖𝑡𝑒\_𝑡𝑖𝑚𝑒=2\\mathit\{write\\\_time\}=2with its log entry still showing𝑟𝑒𝑎𝑑\_𝑣𝑎𝑙𝑢𝑒𝑠\[c1\]=NULL\\mathit\{read\\\_values\}\[c\_\{1\}\]=\\mathrm\{NULL\}\. The predicate fires withi=2i=2,j=1j=1:0<1<20<1<2andNULL≠v1\\mathrm\{NULL\}\\neq v\_\{1\}\.

#### Distinction from classical models\.

Hardware consistency models assume bounded operation latency and statically\-determined read sets; database isolation theory assumes the read set is held under lock or snapshot for the operation’s lifetime\. Neither assumption survives a generation phase whose duration is dominated by neural inference\.

### 3\.2A2: Phantom\-Tool

#### Definition\.

###### Definition 2\(Phantom\-tool\)\.

A historyhhexhibits the*phantom\-tool anomaly*when there existsi∈\{1,…,\|h\|\}i\\in\\\{1,\\ldots,\|h\|\\\}such that

h\[i\]\.𝑝𝑙𝑎𝑛𝑛𝑒𝑑\_𝑡𝑜𝑜𝑙≠NULL,h\[i\]\.𝑝𝑙𝑎𝑛𝑛𝑒𝑑\_𝑡𝑜𝑜𝑙∈h\[i\]\.𝑟𝑒𝑎𝑑\_𝑟𝑒𝑔𝑖𝑠𝑡𝑟𝑦,h\[i\]\.𝑝𝑙𝑎𝑛𝑛𝑒𝑑\_𝑡𝑜𝑜𝑙∉h\[i\]\.𝑤𝑟𝑖𝑡𝑒\_𝑟𝑒𝑔𝑖𝑠𝑡𝑟𝑦\.\\begin\{gathered\}h\[i\]\.\\mathit\{planned\\\_tool\}\\neq\\mathrm\{NULL\},\\\\ h\[i\]\.\\mathit\{planned\\\_tool\}\\in h\[i\]\.\\mathit\{read\\\_registry\},\\\\ h\[i\]\.\\mathit\{planned\\\_tool\}\\notin h\[i\]\.\\mathit\{write\\\_registry\}\.\\end\{gathered\}

#### Description\.

An agent begins an operation by reading the tool registry and selecting a planned tool\. During the generation phase, the registry mutates and the planned tool is removed\. When the agent commits, it emits a tool call to a capability that no longer exists\.

#### Witness\.

TLC produces a four\-state counter\-example at\|A\|=2\|A\|=2,\|𝑇𝑜𝑜𝑙𝑠\|=2\|\\mathit\{Tools\}\|=2,𝑀𝑎𝑥𝑂𝑝𝑠=3\\mathit\{MaxOps\}=3\. Agenta1a\_\{1\}enters its read phase with𝑟𝑒𝑎𝑑\_𝑟𝑒𝑔𝑖𝑠𝑡𝑟𝑦=\{t1,t2\}\\mathit\{read\\\_registry\}=\\\{t\_\{1\},t\_\{2\}\\\}and𝑝𝑙𝑎𝑛𝑛𝑒𝑑\_𝑡𝑜𝑜𝑙=t1\\mathit\{planned\\\_tool\}=t\_\{1\}\. The transition𝑅𝑒𝑚𝑜𝑣𝑒𝑇𝑜𝑜𝑙\(t1\)\\mathit\{RemoveTool\}\(t\_\{1\}\)fires whilea1a\_\{1\}is in its generation phase, leaving the registry as\{t2\}\\\{t\_\{2\}\\\}\. Agenta1a\_\{1\}then commits with𝑤𝑟𝑖𝑡𝑒\_𝑟𝑒𝑔𝑖𝑠𝑡𝑟𝑦=\{t2\}\\mathit\{write\\\_registry\}=\\\{t\_\{2\}\\\}and𝑝𝑙𝑎𝑛𝑛𝑒𝑑\_𝑡𝑜𝑜𝑙=t1∉𝑤𝑟𝑖𝑡𝑒\_𝑟𝑒𝑔𝑖𝑠𝑡𝑟𝑦\\mathit\{planned\\\_tool\}=t\_\{1\}\\notin\\mathit\{write\\\_registry\}\.

#### Distinction from classical models\.

Database phantom anomalies concern read\-set instability under predicate queries;A2A\_\{2\}applies to the tool registry, which is not modeled in classical consistency theory\. Object\-capability literature\[[27](https://arxiv.org/html/2606.17182#bib.bib27),[28](https://arxiv.org/html/2606.17182#bib.bib28)\]addresses tool revocation but without an explicit lattice of consistency levels that names the anomaly produced when revocation overlaps an operation’s lifetime\. Tool\-ecosystem dynamics in modern LLM deployments\[[29](https://arxiv.org/html/2606.17182#bib.bib29),[30](https://arxiv.org/html/2606.17182#bib.bib30)\]make this anomaly increasingly relevant: tool registries change frequently as tools are added, deprecated, or rate\-limited\.

### 3\.3A3: Causal\-Cascade

#### The anomaly\.

A3A\_\{3\}is the uncompensated\-cascade phenomenon\. When an operation aborts in a compensating \(saga\-style\) runtime, its writes are rolled back; any operation that had read a value the aborted writer produced is then grounded on a basis no surviving committed write supports, and that invalidity cascades to anything that in turn read the dependent’s output\.

#### Definition \(the cataloged predicate\)\.

Earlier drafts definedA3A\_\{3\}as a flat\-history*residue*—a committed read of a cell whose value no surviving committed write produced at or before the read\. That residue is a sound over\-approximation of the cascade but not its characteristic predicate: it also fires, benignly, on a read of an initial, seeded, or externally\-supplied value that no logged operation wrote, even in an execution with no abort\. Because the runtime prevention theorem of[Section˜4\.11](https://arxiv.org/html/2606.17182#S4.SS11)is stated over the precise cascade condition, we take that condition—not the residue—as the catalogedA3A\_\{3\}, so that the catalog predicate and the mechanically\-verified guarantee are the same predicate\.

###### Definition 3\(Causal\-cascade\)\.

Let each operation record additionally carry an abort flagh\[i\]\.𝑎𝑏𝑜𝑟𝑡𝑒𝑑∈𝔹h\[i\]\.\\mathit\{aborted\}\\in\\mathbb\{B\}and a predecessor seth\[i\]\.𝑝𝑟𝑒𝑑𝑠⊆\{1,…,\|h\|\}h\[i\]\.\\mathit\{preds\}\\subseteq\\\{1,\\ldots,\|h\|\\\}recording the operations whose committed writes it observed \(its causal closure\)\. A historyhhexhibits the*causal\-cascade anomaly*when there existj,p∈\{1,…,\|h\|\}j,p\\in\\\{1,\\ldots,\|h\|\\\}with

¬h\[j\]\.𝑎𝑏𝑜𝑟𝑡𝑒𝑑∧p∈h\[j\]\.𝑝𝑟𝑒𝑑𝑠∧h\[p\]\.𝑎𝑏𝑜𝑟𝑡𝑒𝑑\.\\neg\\,h\[j\]\.\\mathit\{aborted\}\\;\\wedge\\;p\\in h\[j\]\.\\mathit\{preds\}\\;\\wedge\\;h\[p\]\.\\mathit\{aborted\}\.That is, a surviving \(committed, non\-aborted\) operation retains in its causal closure a predecessor that was aborted: its basis was retracted and it was not itself compensated\.

#### This is exactly the predicate the runtime prevents\.

[Definition˜3](https://arxiv.org/html/2606.17182#Thmanomaly3)is the trace\-level image oflib\_l2\_safety\.rs::a3\_witness\(a committed, non\-aborted transaction with an aborted transaction in its predecessor closure\)\. TheoremL2L\_\{2\}h establishes mechanically, with noassume/admit, that no reachableL2L\_\{2\}state satisfies it, andL2L\_\{2\}g exhibits one that does \(non\-vacuity\), so “L2L\_\{2\}preventsA3A\_\{3\}” is a statement about a single verified predicate, not an informal bridge\. This closes the gap a careful reviewer would raise: the earlier residue definition fired in benign serial executions, so the safety claim only ever held for the cascade condition now promoted to[Definition˜3](https://arxiv.org/html/2606.17182#Thmanomaly3)\. At the model levelAnomalies\.tladefinesCausalCascadeas this precise predicate \(retaining the residue separately\), and a TLC check confirms the redefinition discriminates:CausalCascadefires on the cascade witness and is silent on a benign serial history reading an un\-logged value, on which the retained residue still fires\.

#### Flat\-trace detector: a sound over\-approximation\.

A flat operation history that does not expose𝑝𝑟𝑒𝑑𝑠\\mathit\{preds\}and𝑎𝑏𝑜𝑟𝑡𝑒𝑑\\mathit\{aborted\}—for instance a black\-box trace captured from an un\-instrumented runtime—cannot decide[Definition˜3](https://arxiv.org/html/2606.17182#Thmanomaly3)directly\. For that setting we retain the*residue*predicate

𝑅𝑒𝑠𝑖𝑑𝑢𝑒\(h\)≡∃j,∃c∈h\[j\]\.𝑟𝑒𝑎𝑑\_𝑠𝑒𝑡:v=h\[j\]\.𝑟𝑒𝑎𝑑\_𝑣𝑎𝑙𝑢𝑒𝑠\[c\]≠NULL∧∀k≠j:¬\(c∈h\[k\]\.𝑤𝑟𝑖𝑡𝑒\_𝑠𝑒𝑡∧h\[k\]\.𝑤𝑟𝑖𝑡𝑒\_𝑡𝑖𝑚𝑒≤h\[j\]\.𝑟𝑒𝑎𝑑\_𝑡𝑖𝑚𝑒∧h\[k\]\.𝑤𝑟𝑖𝑡𝑒\_𝑣𝑎𝑙𝑢𝑒𝑠\[c\]=v\),\\mathit\{Residue\}\(h\)\\;\\equiv\\;\\exists\\,j,\\;\\exists\\,c\\in h\[j\]\.\\mathit\{read\\\_set\}:\\;\\\\ v=h\[j\]\.\\mathit\{read\\\_values\}\[c\]\\neq\\mathrm\{NULL\}\\;\\wedge\\;\\forall k\\neq j:\\\\ \\neg\\bigl\(c\\in h\[k\]\.\\mathit\{write\\\_set\}\\;\\wedge\\;h\[k\]\.\\mathit\{write\\\_time\}\\leq h\[j\]\.\\mathit\{read\\\_time\}\\;\\wedge\\\\ h\[k\]\.\\mathit\{write\\\_values\}\[c\]=v\\bigr\),no committed write producedvvforccat or beforejj’s read\. Every genuine cascade that leaves a surviving committed reader produces a residue witness—the aborted writer’s value is no longer produced by any surviving committed write—soA3⇒𝑅𝑒𝑠𝑖𝑑𝑢𝑒A\_\{3\}\\Rightarrow\\mathit\{Residue\}on the projected flat history: the residue is a*sound*detector for the cascade\. It is not complete in the other direction \(it additionally fires on reads of un\-logged initial values\), which is the same conservative direction adopted by black\-box serializability checkers such as Cobra\[[59](https://arxiv.org/html/2606.17182#bib.bib59)\]and Elle\[[60](https://arxiv.org/html/2606.17182#bib.bib60)\]: a reported witness is a candidate the trace cannot rule out\. A deployment that logs cell initialization as a write collapses the over\-approximation to the cascade footprint exactly\.

#### Witnesses\.

For[Definition˜3](https://arxiv.org/html/2606.17182#Thmanomaly3)the minimal witness is a two\-operation history: operationppcommits a write ofc1=v1c\_\{1\}=v\_\{1\}and is subsequently aborted; operationjjreadsc1=v1c\_\{1\}=v\_\{1\}withp∈h\[j\]\.𝑝𝑟𝑒𝑑𝑠p\\in h\[j\]\.\\mathit\{preds\}and commits without being compensated, so¬h\[j\]\.𝑎𝑏𝑜𝑟𝑡𝑒𝑑∧p∈h\[j\]\.𝑝𝑟𝑒𝑑𝑠∧h\[p\]\.𝑎𝑏𝑜𝑟𝑡𝑒𝑑\\neg h\[j\]\.\\mathit\{aborted\}\\wedge p\\in h\[j\]\.\\mathit\{preds\}\\wedge h\[p\]\.\\mathit\{aborted\}holds\. For the flat\-trace detector the witness is the residue history: a committed read ofc1=v1c\_\{1\}=v\_\{1\}with no committed write ofv1v\_\{1\}toc1c\_\{1\}at or before the read\.

#### Verified detector\.

The Verus𝑑𝑒𝑡𝑒𝑐𝑡\_a3\\mathit\{detect\\\_a3\}\([Section˜4\.7](https://arxiv.org/html/2606.17182#S4.SS7)\) is verified sound*and*complete against𝑅𝑒𝑠𝑖𝑑𝑢𝑒\\mathit\{Residue\}—the exact characteristic function of the flat\-trace over\-approximation\. The precise predicate of[Definition˜3](https://arxiv.org/html/2606.17182#Thmanomaly3)is decided at the runtime\-state level bya3\_witness\(absence on every reachable state is TheoremL2L\_\{2\}h\); a trace\-level procedure over provenance\-annotated histories is a routine extension we do not separately mechanize, since the runtime\-level theorem already discharges the guarantee\.

#### Distinction from classical models\.

Saga compensation\[[31](https://arxiv.org/html/2606.17182#bib.bib31)\]and the cascading\-abort literature in database recovery address exactly this dependence: when a transaction aborts, its dependents must be aborted or compensated in turn\. The agent\-runtime difference is irreversibility—external tool effects a dependent has already issued cannot be undone—so the cascade must be*prevented*, by aborting dependents before they externalize, rather than repaired afterward\.A3A\_\{3\}names exactly the surviving\-dependent\-of\-an\-aborted\-operation condition that an unprevented cascade produces\.

### 3\.4A4: Split\-View \(outside the single\-store model\)

We identify the*split\-view*anomaly: under replicated memory, two agents may simultaneously read divergent values of the same cell\. FormalizingA4A\_\{4\}*within the operational model of[Section˜2\.1](https://arxiv.org/html/2606.17182#S2.SS1)*would require extending it with replica identifiers, replica synchronization transitions, and a notion of visible\-replica\-set per operation; in the present single\-store model,A4A\_\{4\}is vacuous, and the lattice contains no point distinguished solely by its prevention\.A4A\_\{4\}is instead formalized separately, at the Verus model level, in[Section˜4\.12](https://arxiv.org/html/2606.17182#S4.SS12): a monotone\-primary no\-split theorem proved from an append\-only invariant, with a constructive secondary\-lag witness for the split\-view it excludes\.

### 3\.5A6: Tool\-Effect Reordering

#### Operational extension\.

A6A\_\{6\}is the only formalized anomaly that requires extending the operational model of[Section˜2\.1](https://arxiv.org/html/2606.17182#S2.SS1)\. Each operation record carries two additional fields: the*issuance order*𝑖𝑜∈Seq\(𝐶𝑒𝑙𝑙𝑠×𝑉𝑎𝑙𝑢𝑒𝑠\)\\mathit\{io\}\\in\\mathrm\{Seq\}\(\\mathit\{Cells\}\\times\\mathit\{Values\}\), the sequence of writes in the order the agent intended to issue them; and the*commit order*𝑐𝑜∈Seq\(𝐶𝑒𝑙𝑙𝑠×𝑉𝑎𝑙𝑢𝑒𝑠\)\\mathit\{co\}\\in\\mathrm\{Seq\}\(\\mathit\{Cells\}\\times\\mathit\{Values\}\), the sequence in which the runtime actually externalizes them\. The set𝑤𝑟𝑖𝑡𝑒\_𝑠𝑒𝑡\\mathit\{write\\\_set\}and the value map𝑤𝑟𝑖𝑡𝑒\_𝑣𝑎𝑙𝑢𝑒𝑠\\mathit\{write\\\_values\}used in earlier definitions are derived from𝑐𝑜\\mathit\{co\}\. \(We retain the numberingA6A\_\{6\}from earlier drafts of this work for continuity with the supplementary artifact; the catalog contains four formalized anomalies, not six\.\)

#### Definition\.

###### Definition 4\(Tool\-effect reordering\)\.

A historyhhexhibits the*tool\-effect reordering anomaly*when there existsi∈\{1,…,\|h\|\}i\\in\\\{1,\\ldots,\|h\|\\\}such that

\|h\[i\]\.𝑖𝑜\|≥2,h\[i\]\.𝑐𝑜≠h\[i\]\.𝑖𝑜,h\[i\]\.𝑐𝑜is a permutation ofh\[i\]\.𝑖𝑜\.\\begin\{gathered\}\|h\[i\]\.\\mathit\{io\}\|\\geq 2,\\quad h\[i\]\.\\mathit\{co\}\\neq h\[i\]\.\\mathit\{io\},\\\\ h\[i\]\.\\mathit\{co\}\\text\{ is a permutation of \}h\[i\]\.\\mathit\{io\}\.\\end\{gathered\}

#### Description\.

An operation issues two or more writes in a specific intended order, but the runtime externalizes them in a different order\. This is the agent\-runtime analogue of non\-atomic commit in distributed transactions, with the additional point that effects on external cells are typically irreversible: once externalized in the wrong order, the inconsistency is not undoable by compensation alone\.

#### Witness\.

TLC produces a three\-state counter\-example at\|A\|=1\|A\|=1,\|𝐶𝑒𝑙𝑙𝑠\|=2\|\\mathit\{Cells\}\|=2,\|𝑉𝑎𝑙𝑢𝑒𝑠\|=1\|\\mathit\{Values\}\|=1,𝑀𝑎𝑥𝑂𝑝𝑠=1\\mathit\{MaxOps\}=1\. Agenta1a\_\{1\}issues writes in order𝑖𝑜=⟨\(c1,v1\),\(c2,v1\)⟩\\mathit\{io\}=\\langle\(c\_\{1\},v\_\{1\}\),\(c\_\{2\},v\_\{1\}\)\\rangleand the runtime non\-deterministically commits them in the reversed order𝑐𝑜=⟨\(c2,v1\),\(c1,v1\)⟩\\mathit\{co\}=\\langle\(c\_\{2\},v\_\{1\}\),\(c\_\{1\},v\_\{1\}\)\\rangle\. Both sequences contain the same multiset of writes, but𝑐𝑜≠𝑖𝑜\\mathit\{co\}\\neq\\mathit\{io\}and\|𝑖𝑜\|=2≥2\|\\mathit\{io\}\|=2\\geq 2, so the predicate fires\.

#### Distinction from classical models\.

Atomic commit in databases via two\-phase commit, write\-buffering under release consistency, and similar mechanisms address the analogous phenomenon in their respective settings\. The agent\-runtime distinction is irreversibility: classical models can compensate by undoing externalized writes, whereas tool calls to external services \(payment\-processor calls, outbound emails, irreversible state changes\) cannot\. Recent runtime work\[[2](https://arxiv.org/html/2606.17182#bib.bib2)\]addresses this operationally; we isolate it as a point in the lattice\.

## 4The Consistency Lattice

A consistency model is a predicate over operation histories\. We characterize a model by the subset of anomalies it excludes, in the spirit of Adya’s\[[8](https://arxiv.org/html/2606.17182#bib.bib8)\]subset\-of\-conflicts characterization of database isolation levels\. With four formalized anomalies𝒜=\{A1,A2,A3,A6\}\\mathcal\{A\}=\\\{A\_\{1\},A\_\{2\},A\_\{3\},A\_\{6\}\\\}, the design space is the Boolean latticeℒ=⟨2𝒜,⊆⟩\\mathcal\{L\}=\\langle 2^\{\\mathcal\{A\}\},\\subseteq\\rangle: each subsetS⊆𝒜S\\subseteq\\mathcal\{A\}corresponds to a model that excludes exactly the anomalies inSSand admits all others\. The lattice has\|ℒ\|=24=16\|\\mathcal\{L\}\|=2^\{4\}=16elements; its bottom⊥=∅\\bot=\\emptysetadmits all anomalies and its top⊤=𝒜\\top=\\mathcal\{A\}excludes them all\. The order is by set inclusion:S1≤S2⇔S1⊆S2S\_\{1\}\\leq S\_\{2\}\\iff S\_\{1\}\\subseteq S\_\{2\}, equivalently “S2S\_\{2\}’s admissible histories are a subset ofS1S\_\{1\}’s\.”

### 4\.1Five distinguished points: theLnL\_\{n\}chain

We name five points alongℒ\\mathcal\{L\}:

L0\\displaystyle L\_\{0\}≡∅\\displaystyle\\equiv\\emptyset\(bottom; admits everything\)L1\\displaystyle L\_\{1\}≡\{A1\}\\displaystyle\\equiv\\\{A\_\{1\}\\\}L2\\displaystyle L\_\{2\}≡\{A1,A3\}\\displaystyle\\equiv\\\{A\_\{1\},A\_\{3\}\\\}L3\\displaystyle L\_\{3\}≡\{A1,A3,A6\}\\displaystyle\\equiv\\\{A\_\{1\},A\_\{3\},A\_\{6\}\\\}L4\\displaystyle L\_\{4\}≡\{A1,A2,A3,A6\}\\displaystyle\\equiv\\\{A\_\{1\},A\_\{2\},A\_\{3\},A\_\{6\}\\\}\(top; admits nothing\)
The five form a chainL0⊊L1⊊L2⊊L3⊊L4L\_\{0\}\\subsetneq L\_\{1\}\\subsetneq L\_\{2\}\\subsetneq L\_\{3\}\\subsetneq L\_\{4\}inℒ\\mathcal\{L\}\. As predicates over operation histories:

L0\(h\)\\displaystyle L\_\{0\}\(h\)≡𝑇𝑅𝑈𝐸\\displaystyle\\equiv\\mathit\{TRUE\}L1\(h\)\\displaystyle L\_\{1\}\(h\)≡¬𝑆𝑡𝑎𝑙𝑒𝐺𝑒𝑛𝑒𝑟𝑎𝑡𝑖𝑜𝑛\(h\)\\displaystyle\\equiv\\neg\\mathit\{StaleGeneration\}\(h\)L2\(h\)\\displaystyle L\_\{2\}\(h\)≡L1\(h\)∧¬𝐶𝑎𝑢𝑠𝑎𝑙𝐶𝑎𝑠𝑐𝑎𝑑𝑒\(h\)\\displaystyle\\equiv L\_\{1\}\(h\)\\wedge\\neg\\mathit\{CausalCascade\}\(h\)L3\(h\)\\displaystyle L\_\{3\}\(h\)≡L2\(h\)∧¬𝑇𝑜𝑜𝑙𝐸𝑓𝑓𝑒𝑐𝑡𝑅𝑒𝑜𝑟𝑑𝑒𝑟𝑖𝑛𝑔\(h\)\\displaystyle\\equiv L\_\{2\}\(h\)\\wedge\\neg\\mathit\{ToolEffectReordering\}\(h\)L4\(h\)\\displaystyle L\_\{4\}\(h\)≡L3\(h\)∧¬𝑃ℎ𝑎𝑛𝑡𝑜𝑚𝑇𝑜𝑜𝑙\(h\)\\displaystyle\\equiv L\_\{3\}\(h\)\\wedge\\neg\\mathit\{PhantomTool\}\(h\)
The TLA\+ form, as a practitioner working with the artifact would encounter it, is shown inLABEL:lst:level\-tla\.

Listing 1:The five named lattice points in TLA\+, as inLevels\.tla\.L0\(h\)==TRUE

L1\(h\)==~StaleGeneration\(h\)

L2\(h\)==L1\(h\)/\\~CausalCascade\(h\)

L3\(h\)==L2\(h\)/\\~ToolEffectReordering\(h\)

L4\(h\)==L3\(h\)/\\~PhantomTool\(h\)

### 4\.2One linearization among twenty\-four

The chainL0→L1→L2→L3→L4L\_\{0\}\\to L\_\{1\}\\to L\_\{2\}\\to L\_\{3\}\\to L\_\{4\}is a maximal chain inℒ\\mathcal\{L\}, equivalently a linear extension of the Boolean lattice to a total order on five points\. There are4\!=244\!=24such maximal chains, all visiting the bottom, exactly one node per layer of fixed cardinality, and the top\. They are mathematically equivalent: each is a valid linear extension of⟨2𝒜,⊆⟩\\langle 2^\{\\mathcal\{A\}\},\\subseteq\\rangle\.

We chose this particular chain on operational grounds, not because it is canonical\. The orderA1→A3→A6→A2A\_\{1\}\\to A\_\{3\}\\to A\_\{6\}\\to A\_\{2\}in which anomalies are added reflects operational grounds\.A1A\_\{1\}is the staleness failure of the basic read–generate–write window and the weakest non\-trivial guarantee to establish first; we claim no frequency primacy for it—the corpus measurement finds it architecturally confined \([Section˜5\.8](https://arxiv.org/html/2606.17182#S5.SS8)\), the demonstrably in\-the\-wild member of its family is theL0L\_\{0\}lost update \([Section˜6\.5](https://arxiv.org/html/2606.17182#S6.SS5)\), and bounded\-staleness refinements in the sense of Bailis et al\.\[[25](https://arxiv.org/html/2606.17182#bib.bib25)\]would yield intermediate sublevels we do not formalize\.A3A\_\{3\}couples most directly withA1A\_\{1\}\(read–write coupling on the value memory\);A6A\_\{6\}completes the value\-memory fragment \(write–write atomicity within one operation\);A2A\_\{2\}sits last because registry stability across the operation lifetime is the costliest single requirement\. A reader with different priorities may prefer a chain admittingA2A\_\{2\}orA6A\_\{6\}first; all2424are equally valid linear extensions, and the choice reflects expository convenience and the pilot’s design, not mathematical primacy\.

The eleven unnamed lattice points \(the points not on the chosen chain — see[Figure˜1](https://arxiv.org/html/2606.17182#S4.F1)\) correspond to model variants that combine anomalies in ways our operational pilot does not directly target\. They are present inℒ\\mathcal\{L\}, accessible to the same verification machinery, and addressable in future work along alternative chains\.

### 4\.3What is and is not trivial: the realizability frontier

We have been explicit that the lattice⟨2𝒜,⊆⟩\\langle 2^\{\\mathcal\{A\}\},\\subseteq\\rangleis mathematically trivial as an abstract structure: it is the Boolean algebra on four generators, and naming it a contribution would be vacuous\. A reviewer is right to press on this\. The substantive content is not the lattice but the*realizability frontier*laid over it: the question of which of the sixteen points are inhabited by a genuine anomaly, which are achievable by a concrete runtime mechanism, and which adjacent points are separated by a*mechanically\-verified*prevention theorem\. None of these three questions is settled by the Boolean structure; each requires either a witness trace or a proof\.

What holds without qualification\.Because the results below each carry an explicit scope caveat, we state up front the core that carries none, so the caveats read as precision rather than retreat\. The following are unconditional: the four anomaly detectors are mechanically verified*sound and complete*against their TLA\+predicates \(24 obligations, noassume, no added axiom;[Section˜4\.7](https://arxiv.org/html/2606.17182#S4.SS7)\); the SSI runtime is verified¬A1\\neg A\_\{1\}unconditionally \([Section˜6\.6](https://arxiv.org/html/2606.17182#S6.SS6)\); the TLAPS chain\-coherence check and theA1A\_\{1\}generation lower bound are machine\-checked \([Section˜4\.6](https://arxiv.org/html/2606.17182#S4.SS6)\); and theL2L\_\{2\}exec\-mode refinement adds zero author\-introduced axioms to its trust base \([Section˜5\.11](https://arxiv.org/html/2606.17182#S5.SS11)\)\. The conditional, sequential, or model\-level qualifications attached to individual results below restrict*those*results; they do not weaken this core\.

Three observations make the distinction concrete\.

First,*inhabitation is not free*\. Each of the four generators is inhabited by an explicit TLC counter\-example \([Section˜3](https://arxiv.org/html/2606.17182#S3)\); establishing that the four anomalies are pairwise distinct as operational phenomena—rather than re\-descriptions of one underlying race—is an empirical claim about the operational model, discharged by exhibiting four structurally different witnesses, not by the algebra\.

Second,*achievability is not free*\. Not every point in2𝒜2^\{\\mathcal\{A\}\}corresponds to a runtime that can be built without paying for adjacent guarantees\. The chosen chain is exactly the sequence of points for which we can name a concrete, implementable mechanism \(pessimistic locking, serializable snapshot isolation with causal tracking, saga compensation\) that achieves the avoidance set and no more\. The mechanism\-to\-avoidance\-set map, not the lattice, is the design content\. We realize and separate these five points only; the remaining eleven subsets of𝒜\\mathcal\{A\}we neither build nor claim, so the frontier we verify is a single maximal chain through the cube, not a map of all sixteen vertices\.

Third, and most importantly,*separation is mechanically verified along the entire chain*\. The strict refinement

L0⊊L1⊊L2⊊L3⊊L4L\_\{0\}\\subsetneq L\_\{1\}\\subsetneq L\_\{2\}\\subsetneq L\_\{3\}\\subsetneq L\_\{4\}is no longer a definitional inclusion:L1L\_\{1\}\([Section˜4\.8](https://arxiv.org/html/2606.17182#S4.SS8)\),L2L\_\{2\}\([Section˜4\.11](https://arxiv.org/html/2606.17182#S4.SS11)\),L3L\_\{3\}\([Section˜4\.15](https://arxiv.org/html/2606.17182#S4.SS15)\), andL4L\_\{4\}\([Section˜4\.16](https://arxiv.org/html/2606.17182#S4.SS16)\) are each backed by a Verus safety theorem showing that a concrete runtime achieves the avoidance set, and each step strictly adds a prevented anomaly with an explicit witness in the lower level that the upper level excludes\. A trivial Boolean lattice does not come with four mechanically\-verified realizing runtime models—deployed and refined to executable Rust atL0L\_\{0\}andL1L\_\{1\}, and exec\-mode\-verified with measured prevention twins atL2L\_\{2\},L3L\_\{3\}, andL4L\_\{4\}—and a proof that each strictly dominates the last; that artifact is the contribution, and it is what a reader should evaluate\. The lattice supplies the coordinates; the single mechanically verified chain through them is the contribution\.

We locate the contribution precisely, in two grades of realization\.L0L\_\{0\}andL1L\_\{1\}are realized by*deployed*Rust runtimes \(pessimistic, SSI, default\-SI\), Verus\-verified to blockA1A\_\{1\}\([Section˜6\.6](https://arxiv.org/html/2606.17182#S6.SS6)\), refined to their state machines \([Section˜4\.8](https://arxiv.org/html/2606.17182#S4.SS8)\), and exercised in the pilot \([Section˜5](https://arxiv.org/html/2606.17182#S5)\)\.L2L\_\{2\}–L4L\_\{4\}are realized by Verus runtime models \([Sections˜4\.11](https://arxiv.org/html/2606.17182#S4.SS11),[4\.15](https://arxiv.org/html/2606.17182#S4.SS15)and[4\.16](https://arxiv.org/html/2606.17182#S4.SS16)\) and additionally by exec\-mode artifacts with dependency\-free twins whose prevention is measured \(0/10000/1000vs\.1000/10001000/1000\)\. The frontier is thus verified and executable along the whole chain; the grades of*live*exercise differ\. The full three\-strategy pilot drivesL0L\_\{0\}/L1L\_\{1\}from a live agent loop \([Section˜5](https://arxiv.org/html/2606.17182#S5)\); theL2L\_\{2\}causal\-tracking discipline is additionally run live, by replaying real multi\-model sessions through the verified runtime \([Section˜5\.11](https://arxiv.org/html/2606.17182#S5.SS11):A3A\_\{3\}prevented in all120120retracted sessions,0/1200/120\); andL3L\_\{3\}/L4L\_\{4\}remain exec\-verified and twin\-measured under adversarial schedules, not yet run under live agents \([Table˜I](https://arxiv.org/html/2606.17182#S1.T1)\)\. Driving theL3L\_\{3\}/L4L\_\{4\}runtimes from a live agent loop is the remaining engineering follow\-up\.

### 4\.4Visualization and the lattice table

[Figure˜1](https://arxiv.org/html/2606.17182#S4.F1)displaysℒ\\mathcal\{L\}as a Hasse diagram, with the chosenLnL\_\{n\}chain highlighted in bold\.[Table˜II](https://arxiv.org/html/2606.17182#S4.T2)tabulates which anomalies are admitted \(✓\\checkmark\) and prevented \(×\\times\) at each named lattice point on the chosen chain\.

L4=\{1,2,3,6\}L\_\{4\}\{=\}\\\{1,2,3,6\\\}\{2,3,6\}L3=\{1,3,6\}L\_\{3\}\{=\}\\\{1,3,6\\\}\{1,2,6\}\{1,2,3\}\{3,6\}\{2,6\}\{2,3\}\{1,6\}L2=\{1,3\}L\_\{2\}\{=\}\\\{1,3\\\}\{1,2\}L1=\{1\}L\_\{1\}\{=\}\\\{1\\\}\{2\}\{3\}\{6\}L0=∅L\_\{0\}\{=\}\\emptysetFigure 1:Hasse diagram of the consistency latticeℒ=⟨2\{A1,A2,A3,A6\},⊆⟩\\mathcal\{L\}=\\langle 2^\{\\\{A\_\{1\},A\_\{2\},A\_\{3\},A\_\{6\}\\\}\},\\subseteq\\rangle\. Each node is the subset of anomalies excluded by that model, with set elements abbreviated by their anomaly indices \(e\.g\.,\{1,3\}\\\{1,3\\\}means\{A1,A3\}\\\{A\_\{1\},A\_\{3\}\\\}\)\. Of the sixteen points and thirty\-two cover edges, the chosenLnL\_\{n\}chain \(bold\) is one of4\!=244\!=24maximal chains\. The remaining eleven unnamed points correspond to model variants reachable via alternative linearizations\.TABLE II:Each named lattice point on the chosen chain admits \(✓\\checkmark\) or prevents \(×\\times\) each of the four formalized anomalies\. ADMITS column entries are verified by explicit TLC counter\-example traces; PREVENTS column entries follow from the definitions of[Section˜4\.1](https://arxiv.org/html/2606.17182#S4.SS1)and are mechanically verified by the TLAPS coherence check \([Section˜4\.6](https://arxiv.org/html/2606.17182#S4.SS6)\)\. The eleven unnamed lattice points are not tabulated; their admissions and preventions follow directly from their subset memberships\.LevelA1A\_\{1\}A2A\_\{2\}A3A\_\{3\}A6A\_\{6\}L0L\_\{0\}✓\\checkmark✓\\checkmark✓\\checkmark✓\\checkmarkL1L\_\{1\}×\\times✓\\checkmark✓\\checkmark✓\\checkmarkL2L\_\{2\}×\\times✓\\checkmark×\\times✓\\checkmarkL3L\_\{3\}×\\times✓\\checkmark×\\times×\\timesL4L\_\{4\}×\\times×\\times×\\times×\\times
### 4\.5The snapshot\-insufficiency observation

A reader familiar with database isolation theory may ask whyL1L\_\{1\}includes the explicit conjunct¬A1\\neg A\_\{1\}rather than being defined purely as a snapshot guarantee\. Define a structural alternative:

L1struct\(h\)≡∀i∈\{1,…,\|h\|\},∀c∈h\[i\]\.𝑟𝑒𝑎𝑑\_𝑠𝑒𝑡:h\[i\]\.𝑟𝑒𝑎𝑑\_𝑣𝑎𝑙𝑢𝑒𝑠\[c\]=𝐿𝑎𝑡𝑒𝑠𝑡𝑊𝑟𝑖𝑡𝑒𝐵𝑒𝑓𝑜𝑟𝑒\(h,c,h\[i\]\.𝑟𝑒𝑎𝑑\_𝑡𝑖𝑚𝑒\)\.\\begin\{gathered\}L\_\{1\}^\{\\text\{struct\}\}\(h\)\\equiv\\forall i\\in\\\{1,\\ldots,\|h\|\\\},\\ \\forall c\\in h\[i\]\.\\mathit\{read\\\_set\}:\\\\ h\[i\]\.\\mathit\{read\\\_values\}\[c\]=\\mathit\{LatestWriteBefore\}\(h,c,h\[i\]\.\\mathit\{read\\\_time\}\)\.\\end\{gathered\}This is the natural formalization of “read values reflect memory at the read time”—a generation snapshot\. We observe, by example, thatL1structL\_\{1\}^\{\\text\{struct\}\}does not excludeA1A\_\{1\}\.

###### Observation\(Snapshot insufficiency, TLC\-verified\)\.

There exists a historyhhwithL1struct\(h\)=𝑇𝑅𝑈𝐸L\_\{1\}^\{\\text\{struct\}\}\(h\)=\\mathit\{TRUE\}and𝑆𝑡𝑎𝑙𝑒𝐺𝑒𝑛𝑒𝑟𝑎𝑡𝑖𝑜𝑛\(h\)=𝑇𝑅𝑈𝐸\\mathit\{StaleGeneration\}\(h\)=\\mathit\{TRUE\}\.

TLC verifies this by violatingL1struct⇒¬𝑆𝑡𝑎𝑙𝑒𝐺𝑒𝑛𝑒𝑟𝑎𝑡𝑖𝑜𝑛L\_\{1\}^\{\\text\{struct\}\}\\Rightarrow\\neg\\mathit\{StaleGeneration\}at\|A\|=2\|A\|=2,\|𝐶𝑒𝑙𝑙𝑠\|=1\|\\mathit\{Cells\}\|=1,𝑀𝑎𝑥𝑂𝑝𝑠=3\\mathit\{MaxOps\}=3\(1,934 states, depth 5; the same 5\-state witness reproduces at\|A\|=3,𝑀𝑎𝑥𝑂𝑝𝑠∈\{6,9\}\|A\|=3,\\mathit\{MaxOps\}\\in\\\{6,9\\\}\)\. In the witness \(MC\_A1\_struct\_witness\.txt\),a1a\_\{1\}writesc1=v1c\_\{1\}=v\_\{1\}at time 1 whilea2a\_\{2\}, having readc1=NULLc\_\{1\}=\\mathrm\{NULL\}at time 0, commits at time 2: the snapshot condition holds ata2a\_\{2\}’s read time \(𝐿𝑎𝑡𝑒𝑠𝑡𝑊𝑟𝑖𝑡𝑒𝐵𝑒𝑓𝑜𝑟𝑒\(c1,0\)=NULL\\mathit\{LatestWriteBefore\}\(c\_\{1\},0\)=\\mathrm\{NULL\}\) yet[Definition˜1](https://arxiv.org/html/2606.17182#Thmanomaly1)fires \(0<1<20<1<2,NULL≠v1\\mathrm\{NULL\}\\neq v\_\{1\}\)\.

This is analogous to write\-skew under snapshot isolation\[[7](https://arxiv.org/html/2606.17182#bib.bib7),[22](https://arxiv.org/html/2606.17182#bib.bib22),[21](https://arxiv.org/html/2606.17182#bib.bib21)\]in the loose sense that each read is individually snapshot\-consistent yet the joint behavior is anomalous; it is simpler than write\-skew proper \(single\-sided staleness, not mutual constraint violation\), and ourL1L\_\{1\}rejects it by adding read\-set stability explicitly\. Whether a broader class of structural definitions is insufficient by a class\-level argument is open\.

### 4\.6TLAPS consistency check on chain definitions

We mechanically check the chain definitions in TLAPS \(Hierarchy\.tla\): eleven theorems—four adjacent containmentsLi\+1\(h\)⇒Li\(h\)L\_\{i\+1\}\(h\)\\Rightarrow L\_\{i\}\(h\), six transitive soundness, one aggregate—decompose into 15 obligations, all discharged by definitional unfolding\. This is a shallow coherence check: it confirms the conjunction structure yields the claimed implications, not correctness against a deeper semantic target; the substantive mechanization is the Verus chain of[Section˜4\.7](https://arxiv.org/html/2606.17182#S4.SS7)\. We additionally prove a non\-definitional generation\-phase lower bound forA1A\_\{1\}\(A1LowerBound\.tla, 28 obligations, noassume/omitted\): every stale\-generation\-free history must, at each operation, either hold its read set stable through commit or read only values no intervening operation revised \(𝑅𝑒𝑎𝑑𝑆𝑒𝑡𝐿𝑜𝑐𝑘\(h,i\)∨𝑉𝑎𝑙𝑢𝑒𝐴𝑔𝑟𝑒𝑒𝑚𝑒𝑛𝑡\(h,i\)\\mathit\{ReadSetLock\}\(h,i\)\\vee\\mathit\{ValueAgreement\}\(h,i\)\), under two trace\-lifted invariants—single\-operation\-in\-flight and read\-before\-write monotonicity, both load\-bearing\. Unlike the coherence check, it constrains every stale\-generation\-free execution, formalizing the read\-set\-stability claim of[Section˜4\.5](https://arxiv.org/html/2606.17182#S4.SS5)\. The full theorem statements and proof structure are in the online appendix \(§ F\)\.

### 4\.7Spec–implementation equivalence in Verus

The TLAPS proofs of[Section˜4\.6](https://arxiv.org/html/2606.17182#S4.SS6)establish, by definitional unfolding, that the predicates ofAnomalies\.tlaare well\-typed and that the lattice points on the chosen chain stand in the expected refinement order\. We emphasize that this is a textual\-consistency check, not a correctness argument; the substantive mechanization in this paper is the Verus chain we describe next\. To strengthen the link between the formal predicates and the executable detectors used in the empirical pilot of[Section˜5](https://arxiv.org/html/2606.17182#S5), we additionally verified, in Verus\[[23](https://arxiv.org/html/2606.17182#bib.bib23)\], that the Rust anomaly detectors implement their TLA\+specifications\.

The chain discharges twenty\-four verification obligations, all withoutassumeand without adding axioms\.

Helpers \(sound and complete\)\.The primitive predicates and lookup operations are verified as exec–spec equivalences:

- •reads\(op, c\)returnstrueiff∃k\. 0≤k<\|𝑜𝑝\.𝑟𝑒𝑎𝑑\_𝑠𝑒𝑡\|∧𝑜𝑝\.𝑟𝑒𝑎𝑑\_𝑠𝑒𝑡\[k\]=c\\exists k\.\\ 0\\leq k<\|\\mathit\{op\.read\\\_set\}\|\\land\\mathit\{op\.read\\\_set\}\[k\]=c;
- •writes\(op, c\)satisfies the analogous equivalence over the write set;
- •contains\_tool\(v, t\)satisfies the analogous equivalence over a tool sequence;
- •first\_value\_exec\(s, c\)agrees with a deterministic first\-match spec function𝑓𝑖𝑟𝑠𝑡\_𝑣𝑎𝑙𝑢𝑒\\mathit\{first\\\_value\}overSeq\(𝐶𝑒𝑙𝑙𝐼𝑑×𝑉𝑎𝑙𝑢𝑒\)\\mathrm\{Seq\}\(\\mathit\{CellId\}\\times\\mathit\{Value\}\), supported by a uniqueness lemma showing that the first\-match index is unique\.

The first\-match uniqueness lemma is required to make the spec\-level value lookup well\-defined under Verus’schoosesemantics: without it, the existential witness used to define𝑓𝑖𝑟𝑠𝑡\_𝑣𝑎𝑙𝑢𝑒\\mathit\{first\\\_value\}could be any matching index, breaking exec–spec agreement\.

Detector theorems\.The four detectors are verified sound*and*complete against the predicates of[Section˜3](https://arxiv.org/html/2606.17182#S3)\.detect\_a1satisfies a two\-sided equivalence with the fullA1A\_\{1\}predicate including the value\-mismatch conjunct; soundness needs loop\-invariant reasoning over a triply\-nested existential, and completeness strengthens each of the three\(i,j,k\)\(i,j,k\)loop invariants to carry “no witness in the prefix searched,” closing with a case\-split on whetherccis in the read set\.detect\_a2anddetect\_a6are the analogous \(and, forA6A\_\{6\}, simplest—single\-record\) equivalences\.detect\_a3is sound and complete against the cascade*residue*of[Definition˜3](https://arxiv.org/html/2606.17182#Thmanomaly3)\(the exact characteristic function on the residue; the sound over\-approximation of[Section˜3\.3](https://arxiv.org/html/2606.17182#S3.SS3)against the cascade proper\)\. A smoke\-test proof exhibits a two\-recordA1A\_\{1\}\-satisfying history, so the spec is satisfiable\.

The Verus chain establishes that the executable detectors used in the empirical pilot are mechanically equivalent to their formal specifications\. This is a substantively stronger guarantee than the TLAPS coherence check of[Section˜4\.6](https://arxiv.org/html/2606.17182#S4.SS6): the soundness obligations require non\-trivial reasoning over nested quantifiers and cannot be closed by definition unfolding alone, and the completeness obligations require strengthening loop invariants to range over the prefix already searched\.

### 4\.8Spec–runtime refinement in Verus

The detector chain of[Section˜4\.7](https://arxiv.org/html/2606.17182#S4.SS7)and the safety theorems of[Section˜6\.6](https://arxiv.org/html/2606.17182#S6.SS6)establish two mechanically\-verified properties: that the executable detectors realize their formal predicates, and that each runtime strategy’s abstract state machine satisfies¬A1\\neg A\_\{1\}\. Neither, however, directly establishes that the*deployed*Rust runtime inmac\-consistency\-runtimecorresponds to those abstract state machines\. We close that gap via four Verus refinement proofs covering pessimistic\-locking, SSI under a\(𝑠𝑡𝑜𝑟𝑒,𝑙𝑎𝑠𝑡\_𝑤𝑟𝑖𝑡𝑒\)\(\\mathit\{store\},\\mathit\{last\\\_write\}\)projection, SSI under the literal version\-chain representation, and default\-SI\.

Refinement structure\.For each strategy we define a concrete\-side state record mirroring the relevant fields of the deployed RustInnerstruct \(data,locks,agent\_holds,clock,tracefor pessimistic;store,last\_write,clock,tracefor SSI\), an abstraction functionα\\alphamapping concrete states to instances of the abstract state machine of[Section˜6\.6](https://arxiv.org/html/2606.17182#S6.SS6), and concrete\-side transitions corresponding to each abstract transition\. Each refinement lemma has the shape

concrete\_step\(c,c′\)⟹abstract\_step\(α\(c\),α\(c′\)\),\\mathrm\{concrete\\\_step\}\(c,c^\{\\prime\}\)\\;\\Longrightarrow\\;\\mathrm\{abstract\\\_step\}\(\\alpha\(c\),\\alpha\(c^\{\\prime\}\)\),together with an initial\-state correspondenceα\(c0\)=a0\\alpha\(c\_\{0\}\)=a\_\{0\}\. Composed with the safety theorems of[Section˜6\.6](https://arxiv.org/html/2606.17182#S6.SS6), this establishes that every concrete execution maps to an abstract execution satisfying¬A1\\neg A\_\{1\}\.

The four refinements\.Each refinement discharges, for its strategy’s concrete transitions \(begin/commit and, for SSI/default\-SI, abort\), an initial\-state correspondence and the structural helpers—Map\-insert, snapshot\-build, trace\-push, and lock\-release commutativity—that lift each concrete step to its abstract image; the per\-helper detail is in the reproducible artifact\. The counts are 31 \(pessimistic\), 18 \(SSI projection\), 17 \(SSI version\-chain\), and 18 \(default\-SI\)\. Two points are load\-bearing rather than mechanical\. The version\-chain refinement \(lib\_refinement\_ssi\_chain\.rs\) targets the literal deployedBTreeMap<CellId, Vec<\(Time,Value\)\>\>and takes chain\-monotonicity \(every chain non\-empty with strictly increasing times bounded by𝑐𝑙𝑜𝑐𝑘\\mathit\{clock\}\) as a precondition; we discharge its preservation across all reachable states \(lib\_si\_commit\_invariant\.rs: 4 verified, noassume/admit\), so the invariant holds by induction rather than by assumption\. The default\-SI refinement \(lib\_refinement\_default\_si\.rs\) covers the bypass branch \(empty write\-set, validation skipped\) unconditionally—the refinement itself imposes no workload restriction; only the safety theorem does—and the new technical piece is that an empty concrete write\-set maps to an empty abstract one under the𝑆𝑒𝑡::𝑚𝑎𝑝\\mathit\{Set::map\}abstraction\.In\-the\-loop verification of the SI validation gate\.For the snapshot\-isolation validation decision we go beyond refinement to direct execution\-level verification: we lift the deployedSnapshotIsolationStore::commitread\-set check into a Verusexecfunction, prove it sound and complete against the freshness predicate \(validate: 5 verified, noassume/admit/external\_body\), and the deployed commit now invokes that exact function \(confirmed by the unchanged test suite\)\. The procedure that selects commit versus abort at run time is therefore the proven function itself; the unverified remainder is only the marshalling that flattens the version map into the gate’s input and theparking\_lot::Mutexcritical section \([Section˜4\.14](https://arxiv.org/html/2606.17182#S4.SS14)\)\. TheA1A\_\{1\}detector is verified in the sameexecstyle \(9 verified\)\. For these two procedures the spec–runtime residual is reduced from a refinement argument to a marshalling correspondence: the executable code that runs is the verified code\.

Pessimistic acquisition: gate and exclusivity invariant\.The pessimistic enforcer is verified to the same depth: its begin\-time acquisition check is anexecfunction proved sound and complete against the no\-foreign\-holder predicate \(can\_acquire: 5 verified\), and we prove that acquisition and release preserve consistency of the two holder views \(lib\_pessimistic\_invariant\.rs: 3 verified, noassume/admit\), so each cell has at most one holder in every reachable state—the property that excludes concurrent same\-cell operation and henceA1A\_\{1\}at this layer\. Both deployed enforcers thus carry an exec\-verified decision gate and a mechanically\-proved state\-machine invariant, not only a refinement to an abstract model\.

Trust base: two foundational axioms\.The four refinement files share a trust base of two axioms, disclosed here in full:

- •axiom\_string\_to\_int\_injective:∀s1,s2\.𝑠𝑡𝑟𝑖𝑛𝑔\_𝑡𝑜\_𝑖𝑛𝑡\(s1\)=𝑠𝑡𝑟𝑖𝑛𝑔\_𝑡𝑜\_𝑖𝑛𝑡\(s2\)⇒s1=s2\\forall s\_\{1\},s\_\{2\}\.\\;\\mathit\{string\\\_to\\\_int\}\(s\_\{1\}\)=\\mathit\{string\\\_to\\\_int\}\(s\_\{2\}\)\\Rightarrow s\_\{1\}=s\_\{2\}\. This formalizes that the abstraction function for string identifiers \(cell IDs, agent IDs, value strings\) is injective\. In the deployed runtime, identifiers are arbitraryStrings and the corresponding abstract type isℤ\\mathbb\{Z\}; injectivity of any well\-defined encoding is required for the abstraction to commute with Map\-insertion\. This is a structural assumption about the alphabet, not a domain claim\.
- •axiom\_null\_sentinel:𝑣𝑎𝑙𝑢𝑒\_𝑎𝑙𝑝ℎ𝑎\(⟨’N’,’U’,’L’,’L’⟩\)=0\\mathit\{value\\\_alpha\}\(\\langle\\text\{'N','U','L','L'\}\\rangle\)=0\. The concrete runtime uses the string"NULL"as a sentinel for absent values returned from reads on uninitialized cells; the abstract specification uses the integer0as the corresponding null\. The axiom asserts that these correspond under the abstraction function\. This is a one\-line constant\-folding assumption\.

Both are structural properties of the encoding alphabet, asserting nothing about the lattice, catalog, or strategies \(see the consolidated trust base of[Section˜4\.8](https://arxiv.org/html/2606.17182#S4.SS8)\)\. A natural candidate for a third axiom—𝑓𝑖𝑛𝑖𝑡𝑒\(s\)⇒𝑓𝑖𝑛𝑖𝑡𝑒\(s\.map\(f\)\)\\mathit\{finite\}\(s\)\\Rightarrow\\mathit\{finite\}\(s\.\\mathrm\{map\}\(f\)\), used for the finite\-domain conjunct of the abstract commit step—is instead a proven Verus lemma \(lemma\_set\_map\_preserves\_finite, by induction on\|s\|\|s\|\), keeping it out of the trust base\.

Relating the projection and chain refinements\.The projection refinement \(latest version per cell\) and the chain refinement \(full𝑆𝑒𝑞\(𝑇𝑖𝑚𝑒,𝑉𝑎𝑙𝑢𝑒\)\\mathit\{Seq\}\(\\mathit\{Time\},\\mathit\{Value\}\)history\) refine to the same abstract SSI machine: the projection is the minimal abstraction sufficient for the safety theorem, the chain is the literal deployed Rust shape\. Distributing both makes that trade\-off explicit\.

Scope of refinement: sequential semantics\.All four refinements are sequential: the concrete\-state transition relations describe one logical operation at a time\. The Rust runtime enforces this granularity via aparking\_lot::Mutexheld across the read\-and\-lock phase ofbeginand across the write\-and\-tick phase ofcommit; mutex correctness lifts the sequential refinement to the multi\-threaded execution\. A refinement that treats internal concurrency \(multiple OS threads racing inside a single logical step\) would require a different proof methodology—linearizability or a relational/separation logic—and is identified as follow\-up in[Section˜6\.2](https://arxiv.org/html/2606.17182#S6.SS2)\.

Combined formal scorecard\.The Verus mechanization comprises \(all counts live\-confirmed byverus\_count\.sh; all zeroassume, zeroadmit\): 24 for the detector exec–spec equivalence chain \([Section˜4\.7](https://arxiv.org/html/2606.17182#S4.SS7)\); 40 across the three strategy safety theorems; 84 across the four refinement files \(under the two shared axioms\); 9 for the concurrent\-semantics lift \([Section˜4\.10](https://arxiv.org/html/2606.17182#S4.SS10)\); 22 for theL2L\_\{2\}safety theorem \([Section˜4\.11](https://arxiv.org/html/2606.17182#S4.SS11), with𝑝𝑟𝑒𝑑\_𝑐𝑙𝑜𝑠𝑒𝑑\\mathit\{pred\\\_closed\}proved inductive and TheoremsL2L\_\{2\}g/i/i\+\); 27 net for the exec\-modeL2L\_\{2\}runtime \([Section˜5\.11](https://arxiv.org/html/2606.17182#S5.SS11), on natively\-injectiveu64interning, no added axiom; its self\-contained file verifies at 49 end to end\); 9 for theA4A\_\{4\}formalization \([Section˜4\.12](https://arxiv.org/html/2606.17182#S4.SS12)\); 4 for the RustBelt interface \([Section˜4\.14](https://arxiv.org/html/2606.17182#S4.SS14), including panic\-freedom and poisoning fail\-safety, with the 3RUSTBELT\_OBLIGATIONstubs\); 6 for theL3L\_\{3\}saga theorem and 5 for the commit\-order sequencer \(TheoremL3L\_\{3\}g\); 5 for theL4L\_\{4\}registry\-snapshot theorem; 3 for theL2L\_\{2\}state\-to\-trace projection \(TheoremL2L\_\{2\}j\); 7 for the LangGraph refinement; and 8 for the OCCL1→L2L\_\{1\}\\to L\_\{2\}channel refinement \([Section˜6\.5](https://arxiv.org/html/2606.17182#S6.SS5)\); 5 for the joint\-composition root \(lib\_consistency\_lattice\.rs, beyond its three re\-included component modules,[Section˜4\.16](https://arxiv.org/html/2606.17182#S4.SS16)\); 7 for the exec\-modeL3L\_\{3\}sequencer \(lib\_l3\_exec\.rs\); and 9 for the exec\-modeL4L\_\{4\}snapshot \(lib\_l4\_exec\.rs,[Section˜5\.11](https://arxiv.org/html/2606.17182#S5.SS11)\)\.Total: 274 distinct verified obligations\(0 errors\), all unconditional under the two foundational axioms and three documented stubs; the stochastic case is left open at the specification level \([Section˜4\.13](https://arxiv.org/html/2606.17182#S4.SS13)\) and the artifact carries no probabilistic obligation or axiom\. Every named lattice pointL0L\_\{0\}–L4L\_\{4\}is backed by a mechanically\-verified realizing runtime model: deployed and refined to Rust atL0L\_\{0\}/L1L\_\{1\}, and exec\-mode\-verified with measured dependency\-free twins atL2L\_\{2\}/L3L\_\{3\}/L4L\_\{4\}\(A3A\_\{3\},A6A\_\{6\},A2A\_\{2\}:0/10000/1000vs\.1000/10001000/1000\)\. Counting every verified source file and subtracting module/textual re\-inclusion, the development verifies295 distinct obligations; the 21 between the 274 curated headline and 295 are independent invariant lemmas and exec runtimes \(lib\_detect\_a1\_exec\.rs9,lib\_pessimistic\_exec\.rs5,lib\_pessimistic\_invariant\.rs3,lib\_si\_commit\_invariant\.rs4\) that support the named results without being lattice points\.verus\_count\.shregenerates both figures \(274 curated; 295 with\-\-full\)\.

*Trusted computing base, stated in full\.*To forestall any “zero\-assume” over\-reading, we list the complete trust base of the safety\-bearing development in one place\. It comprises exactly: \(i\) the Verus type system and the backing SMT solver; \(ii\) thevstdstandard library, includinggroup\_hash\_axiomsfor theHashMap\-backed exec states; \(iii\) two enumerated foundational axioms \(string\-identifier injectivity and the null\-sentinel mapping,[Section˜4\.8](https://arxiv.org/html/2606.17182#S4.SS8)\), each a structural property of the encoding alphabet rather than a claim about the lattice; and \(iv\) threeRUSTBELT\_OBLIGATIONexternal\_bodystubs \([Section˜4\.14](https://arxiv.org/html/2606.17182#S4.SS14)\) relocating thestd::sync::Mutexcorrespondence to a future Iris/RustBelt proof\. “Zeroassume, zeroadmit” is a statement about*proof bodies*; it is not a claim of an empty trust base, and the four items above are the trust base in its entirety\. No safety\-bearing proof body invokes a writes\-as\-function\-of\-reads assumption \([Section˜2\.2](https://arxiv.org/html/2606.17182#S2.SS2)\), and the mechanized development carries no probabilistic obligation or axiom\.

### 4\.9Probabilistic refinement ofA1A\_\{1\}

The deterministicA1A\_\{1\}predicate of[Section˜3\.1](https://arxiv.org/html/2606.17182#S3.SS1)compares recorded values syntactically: a witness fires whenr1\.read\_values\[c\]≠r2\.write\_values\[c\]r\_\{1\}\.\\textit\{read\\\_values\}\[c\]\\neq r\_\{2\}\.\\textit\{write\\\_values\}\[c\]\. Under stochastic generation this string comparison and the operationally\-meaningful notion of*stale\-read disagreement*diverge—string equality forces operational equivalence \(an LLM cannot disagree with itself given identical inputs\), but string inequality does not entail operational disagreement, since a fresh re\-read might yield the same downstream output the stale read produced\. This is the dominant objection to the determinism assumption, and we consider it sound\. We treat it not as a closed gap but as a scope separation \([Section˜4\.13](https://arxiv.org/html/2606.17182#S4.SS13)\); this subsection records the partial refinement and is explicit about the one direction it does*not*establish\.

Construction\.We introduce an abstract spec function𝑑𝑖𝑠𝑎𝑔𝑟𝑒𝑒𝑚𝑒𝑛𝑡\_𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦:𝑃𝑟𝑜𝑚𝑝𝑡𝐼𝑑×𝑉𝑎𝑙𝑢𝑒×𝑉𝑎𝑙𝑢𝑒→ℕ\[0,100\]\\mathit\{disagreement\\\_probability\}:\\mathit\{PromptId\}\\times\\mathit\{Value\}\\times\\mathit\{Value\}\\to\\mathbb\{N\}\_\{\[0,100\]\}, the percentage probability that an agent in prompt\-contextpp, having observedvstalev\_\{\\textit\{stale\}\}for cellcc, would have produced a different downstream output had it observedvfreshv\_\{\\textit\{fresh\}\}\. It is left uninterpreted at the specification level and estimated operationally by re\-running the prompt withvfreshv\_\{\\textit\{fresh\}\}substitutedkktimes;[Section˜5\.9](https://arxiv.org/html/2606.17182#S5.SS9)carries out exactly this estimate \(controlled value\-staleness75%75\\%; deployed read\-before\-write staleness0–100%100\\%depending on whether the agent consumes the read\)\. The probabilisticA1A\_\{1\}predicate at thresholdθ\\thetaleaves the structural conjuncts unchanged and replaces the value\-mismatch conjunct with𝑑𝑖𝑠𝑎𝑔𝑟𝑒𝑒𝑚𝑒𝑛𝑡\_𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦\(p,vstale,vfresh\)\>θ\\mathit\{disagreement\\\_probability\}\(p,v\_\{\\textit\{stale\}\},v\_\{\\textit\{fresh\}\}\)\>\\theta\. This specification\-level construction—not mechanized in Verus—rests on three assumptions: \(a\)𝑑𝑖𝑠𝑎𝑔𝑟𝑒𝑒𝑚𝑒𝑛𝑡\_𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦∈\[0,100\]\\mathit\{disagreement\\\_probability\}\\in\[0,100\]; \(b\)𝑑𝑖𝑠𝑎𝑔𝑟𝑒𝑒𝑚𝑒𝑛𝑡\_𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦\(p,v,v\)=0\\mathit\{disagreement\\\_probability\}\(p,v,v\)=0for anyp,vp,v\(the equal\-values bridge, formalizing that string equality forces operational equivalence\); and \(c\) a Hoeffding\-style concentration bound for the empirical estimator atk≥100k\\geq 100with a 10\-percentage\-point error budget\. Assumptions \(a\)–\(b\) are trivially true; \(c\) is*assumed, not proved*, and[Section˜4\.13](https://arxiv.org/html/2606.17182#S4.SS13)states it as an open problem rather than discharging it\. We do not mechanize the construction because the cost is out of proportion to its content: the implication itself is elementary, but a Hoeffding\-style bound requires exponential\-moment reasoning that Verus, which has no real\-analysis development, cannot currently express; the genuinely open question is the validity of the disagreement\-probability premise, on which[Section˜5\.9](https://arxiv.org/html/2606.17182#S5.SS9)provides the measurements\.

What is and is not established, and the direction\-of\-soundness caveat\.The over\-approximation chain \(a specification\-level argument, not a mechanized proof\) is as follows: the deterministic detector flags every trace the probabilistic predicate flags at any non\-zero threshold, monotonically inθ\\theta, and the empirical estimator at an inflated threshold \(k≥100k\\geq 100\) soundly implies true detection\. The direction is the crux, and it is the*wrong*direction for a safety screen\. We proveA1prob\(t,0\)⇒A1det\(t\)A\_\{1\}^\{\\textit\{prob\}\}\(t,0\)\\Rightarrow A\_\{1\}^\{\\textit\{det\}\}\(t\)—every probabilistic event is caught—not the converseA1det\(t\)⇒A1prob\(t,θ\)A\_\{1\}^\{\\textit\{det\}\}\(t\)\\Rightarrow A\_\{1\}^\{\\textit\{prob\}\}\(t,\\theta\), because traces exist where the recorded strings differ but the model coincidentally produces the same output \(operational equivalence under string inequality\)\. The deterministic detector is therefore a sound*candidate\-anomaly screen*—every probabilistically\-real event is in the audit set, conditional on axiom \(b\)—but not the guarantee that no operationally\-realA1A\_\{1\}slips past unobserved\. We deliberately do not internalize a full Hoeffding bound inside Verus: axiom \(c\) packages the concentration result abstractly, which suffices for the over\-approximation chain at the cost of trusting rather than proving it\. Formalizing it would require real\-number probability infrastructure the current Verus distribution does not provide\[[32](https://arxiv.org/html/2606.17182#bib.bib32),[23](https://arxiv.org/html/2606.17182#bib.bib23)\];[Section˜4\.13](https://arxiv.org/html/2606.17182#S4.SS13)states the open problem and the soundness\-only position precisely\.

### 4\.10Concurrent semantics lift

The four refinements of[Section˜4\.8](https://arxiv.org/html/2606.17182#S4.SS8)hold under sequential interleaving, while the deployed runtime is multi\-threaded viastd::sync::Mutex\. We close this gap with a mechanized atomic\-event model \(lib\_concurrent\_semantics\.rs; 9 verified, 0 errors, 0 axioms\): each step is aLockAcquire/LockRelease/Read/Writeevent, a trace is well\-formed iff every event is enabled \(conformance to the abstract mutex protocol\), and the verified theorems—chiefly distinct\-agent access separation and “reads observe the current value”—establish that every well\-formed concurrent trace projects, per cell, to exactly the sequential trace shape the refinements consume\. The lift identifies the precise abstract protocol the sequential proofs require; it is conditional onstd::sync::Mutexrealizing Acquire/Release with event\-granular sequential consistency, an obligation relocated to RustBelt\[[35](https://arxiv.org/html/2606.17182#bib.bib35)\]\. We additionally check the lock protocol under the RC11 weak\-memory model\[[33](https://arxiv.org/html/2606.17182#bib.bib33)\]with the GenMC stateless model checker\[[34](https://arxiv.org/html/2606.17182#bib.bib34)\]: encoding the discipline asNNconcurrent locked read–modify–writes, GenMC exhaustively finds no lost update \(anA1A\_\{1\}witness\) forN∈\{2,3,4\}N\\in\\\{2,3,4\\\}\(2/6/24 executions\), while a relaxed\-ordering control is flagged as a data race \(non\-vacuity\)\. The full atomic\-event definitions, theorem statements, and weak\-memory analysis are in the online appendix \(§ A\); the residual—unboundedNN, theMutexrealization, liveness—is future work \([Sections˜4\.14](https://arxiv.org/html/2606.17182#S4.SS14)and[6\.3](https://arxiv.org/html/2606.17182#S6.SS3)\)\.

### 4\.11L2L\_\{2\}safety: causal\-tracking runtime preventsA1∧A3A\_\{1\}\\wedge A\_\{3\}

In[Section˜1\.4](https://arxiv.org/html/2606.17182#S1.SS4)only theL0→L1L\_\{0\}\\to L\_\{1\}step was mechanically verified across runtimes; the higher lattice levels were paper design\. This subsection closes theL1→L2L\_\{1\}\\to L\_\{2\}step\.

Construction\.TheL2L\_\{2\}runtime model \(verus\-detector/src/lib\_l2\_safety\.rs\) carries, per transaction, a*predecessors*set: the set of committed transactions whose writes appear in this transaction’s read\-values\. A commit isL2L\_\{2\}\-valid iff \(a\) the read\-set is fresh against the current committed store \(the¬A1\\neg A\_\{1\}check\) and \(b\) every predecessor is committed and not aborted \(the¬A3\\neg A\_\{3\}check\)\. An abort cascades to every transaction that has the aborter in its predecessors\. The model is a direct operational analogue of serializable snapshot isolation with cascading abort\[[21](https://arxiv.org/html/2606.17182#bib.bib21)\]\. The predecessor set, recorded as a causal*closure*at read time, is the operation\-record analogue of the causal\-history tracking of vector clocks\[[99](https://arxiv.org/html/2606.17182#bib.bib99),[100](https://arxiv.org/html/2606.17182#bib.bib100)\]: each transaction carries the identities of its transitively observed writers, which is what makes the one\-level cascade provably sufficient\.

Verified theorems \(22 verified, 0 errors\)\.

- •L2L\_\{2\}a/L2L\_\{2\}b \(noA1A\_\{1\}/A3A\_\{3\}at commit\)\.A commit yields a state where every read cell holds the observed value, and every predecessor intt’s causal closure is committed and not aborted\.
- •L2L\_\{2\}c \(cascade preserves clean predecessors\)\.The cascade\-abort transition preserves cleanliness; the transitive case is closed by recording𝑝𝑟𝑒𝑑𝑒𝑐𝑒𝑠𝑠𝑜𝑟𝑠\\mathit\{predecessors\}as a causal*closure*at read time \(a transitive dependent is, by closure, a direct one\), so the one\-level cascade is provably sufficient—the closure is an inductive invariant, noassume, no fixpoint iteration\.
- •L2L\_\{2\}g \(non\-vacuity\)\.A state exists in which a committed transaction retains an aborted predecessor, so the excluded predicate is satisfiable \(lemma\_no\_cascade\_admits\_a3\)\.
- •L2L\_\{2\}h \(reachable states are footprint\-free; end\-to\-end\)\.No state satisfying𝑖𝑛𝑣\_l2\\mathit\{inv\\\_l2\}contains the cascade footprint \(lemma\_l2\_reachable\_no\_a3\); with𝑖𝑛𝑣\_l2\\mathit\{inv\\\_l2\}inductive \(base case plus five preservation lemmas\), this closes¬A3\\neg A\_\{3\}over all reachable executions\.
- •L2L\_\{2\}i \(committed reads are supported; catalog\-A3A\_\{3\}image\)\.On every reachable state, every committed read has a surviving committed producer that wrote exactly the value read \(lemma\_l2\_reads\_supported\)—the runtime\-level form of[Definition˜3](https://arxiv.org/html/2606.17182#Thmanomaly3), via the provenance invariant𝑖𝑛𝑣\_𝑟𝑒𝑎𝑑\_𝑝𝑟𝑜𝑣𝑒𝑛𝑎𝑛𝑐𝑒\\mathit\{inv\\\_read\\\_provenance\}\.

Scope\.The transitive\-cascade case is not a residual: the causal\-closure invariant𝑝𝑟𝑒𝑑\_𝑐𝑙𝑜𝑠𝑒𝑑\\mathit\{pred\\\_closed\}is proved an inductive invariant of theL2L\_\{2\}runtime model—a base case plus five per\-step preservation lemmas \(lib\_l2\_safety\.rs: 22 verified, 0 errors, noassume, noadmit\)—so the one\-level cascade is provably sufficient on every reachable state and the case is closed unconditionally\. A constructive witness \(TheoremL2L\_\{2\}g,lemma\_no\_cascade\_admits\_a3\) shows the excluded predicate is satisfiable, so the safety theorem is non\-vacuous in the same sense as theL4L\_\{4\}witness of[Section˜4\.16](https://arxiv.org/html/2606.17182#S4.SS16); reachability under a disabled cascade is not mechanized at the proof level, but it is exhibited operationally—the unguarded twin of[Section˜5\.11](https://arxiv.org/html/2606.17182#S5.SS11)reaches the excluded footprint in1000/10001000/1000scenarios at every depth, so the predicate the theorem excludes is not merely satisfiable in the abstract but reached by a runnableL1L\_\{1\}\-class baseline\.

The correspondence to the catalogedA3A\_\{3\}is characterized precisely rather than asserted\.L2L\_\{2\}h prevents the*cascade*form exactly; for the*trace\-residue*form \([Definition˜3](https://arxiv.org/html/2606.17182#Thmanomaly3)\),L2L\_\{2\}i discharges the value\-provenance half \(every committed read has a surviving producer of exactly its value\), and the temporal half is closed by TheoremL2L\_\{2\}i\+\(lemma\_l2\_reads\_supported\_temporal\): via a per\-cell read\-timestamp invariant𝑖𝑛𝑣\_𝑟𝑒𝑎𝑑\_𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙\\mathit\{inv\\\_read\\\_temporal\}, the surviving producer committed no later than the read observed it \(𝑐𝑜𝑚𝑚𝑖𝑡\_𝑡𝑖𝑚𝑒\(w\)≤𝑟𝑒𝑎𝑑\_𝑎𝑡\[t\]\[c\]\\mathit\{commit\\\_time\}\(w\)\\leq\\mathit\{read\\\_at\}\[t\]\[c\]\), a genuine consequence of the monotone clock rather than an added assumption\.L2L\_\{2\}i\+thus matches[Definition˜3](https://arxiv.org/html/2606.17182#Thmanomaly3)on both clauses\. The detector decides over an emittedSeq⟨𝑂𝑝𝑅𝑒𝑐𝑜𝑟𝑑⟩\\mathrm\{Seq\}\\langle\\mathit\{OpRecord\}\\ranglewhile the model is a state machine; that projection—formerly the one remaining gap—is discharged by TheoremL2L\_\{2\}j below\.

TheoremL2L\_\{2\}j \(projection: the runtime emits detector\-clean histories;lib\_l2\_projection\.rs, 3 verified, 0 errors, 0 axioms, 0external\_body, 0assume\)\.A second potential residual—that TheoremL2L\_\{2\}i\+holds on the runtime*state machine*while the detector \([Section˜4\.7](https://arxiv.org/html/2606.17182#S4.SS7)\) decidesA3A\_\{3\}over an emittedSeq⟨𝑂𝑝𝑅𝑒𝑐𝑜𝑟𝑑⟩\\mathrm\{Seq\}\\langle\\mathit\{OpRecord\}\\rangle—is also closed\. We define a faithfulemitrelation \(each committed read becomes a single\-cell read op with𝑟𝑒𝑎𝑑\_𝑡𝑖𝑚𝑒=𝑟𝑒𝑎𝑑\_𝑎𝑡\[t\]\[c\]\\mathit\{read\\\_time\}=\\mathit\{read\\\_at\}\[t\]\[c\]and an empty write\-set; each committed write becomes a write op with𝑤𝑟𝑖𝑡𝑒\_𝑡𝑖𝑚𝑒=𝑐𝑜𝑚𝑚𝑖𝑡\_𝑡𝑖𝑚𝑒\\mathit\{write\\\_time\}=\\mathit\{commit\\\_time\}\) and prove that any trace faithfully emitting anL2L\_\{2\}i\+\-supported state contains*no*A3A\_\{3\}witness: the producing write the detector’s predicate demands is supplied byL2L\_\{2\}i\+at the state level and realized as a write op byemit, withk≠jk\\neq jbecause read ops write nothing\. The bridge is non\-circular by construction—emitnever refers to a read’s producer, so the trace\-level “every read has a producer” hypothesis, which would coincide with¬A3\\neg A\_\{3\}and trivialize the theorem, is not used; the producer link is supplied only byL2L\_\{2\}i\+, whose hypothesis is verbatimlemma\_l2\_reads\_supported\_temporal’s conclusion \(modular composition\)\. Two witnesses pin non\-vacuity: a satisfying state/trace pair, and an unsupported trace on whichA3A\_\{3\}fires\. Two modeling choices are stated rather than hidden: carriers areintrather than the detector’su32\(theA3A\_\{3\}predicate is width\-agnostic—identical logic\), andemitis specified as a faithful relation with a constructed satisfying witness rather than aSeq\-builder\. We name the residual this leaves rather than let it read as closed: becauseemitis a relation pinned by a witness, not a proven\-total function, the bridge establishes that a*faithful*emission of anL2L\_\{2\}i\+\-supported state isA3A\_\{3\}\-clean, not that every serializer a deployment might write is faithful; mechanizingemitas a total, deterministic function \(or proving an existing serializer refines it\) is the one step that would upgrade “end\-to\-end modulo a faithful emitter” to “end\-to\-end” without qualification\. WithL2L\_\{2\}j the verifiedL2L\_\{2\}runtime and the verified detector compose into a single chain under exactly that disclosed proviso\.

Residual\.One residual remains, shared by every level aboveL1L\_\{1\}: unlike the deployedL1L\_\{1\}runtimes \([Section˜4\.8](https://arxiv.org/html/2606.17182#S4.SS8)\), theL2L\_\{2\}model is not refined to an executable Rust runtime\. The remaining levelsL3L\_\{3\}\(adding¬A6\\neg A\_\{6\}\) andL4L\_\{4\}\(adding¬A2\\neg A\_\{2\}\) are closed by dedicated runtime models in[Section˜4\.15](https://arxiv.org/html/2606.17182#S4.SS15)and[Section˜4\.16](https://arxiv.org/html/2606.17182#S4.SS16)respectively\.

### 4\.12A4A\_\{4\}split\-view formalization

DeferringA4A\_\{4\}\(split\-view under replication\) would be a critical structural omission, because replication is central to deployed multi\-agent infrastructure\. A weak closure is available but worthless: “read\-from\-primary suppressesA4A\_\{4\}” follows from the observation thatA4A\_\{4\}requires two distinct replica ids while read\-from\-primary uses one—a structural tautology, since if only one replica is ever read, two replicas cannot be observed to disagree\. This subsection gives an argument with content instead: a safety property of the replication*mechanism*, proved from an append\-only invariant rather than read off the arity of the predicate\.

Construction\.The model \(verus\-detector/src/lib\_a4\_split\_view\.rs\) represents the primary as an append\-only, monotone\-versioned log over a single cell \(the per\-cell argument carries split\-view without loss, asA4A\_\{4\}is a per\-cell predicate\): the value committed at versionkkoccupies log positionk−1k\-1, the head version is the log length, and a write appends one entry and advances the version\. A read records the replica it was served by, the version and value it observed, and its trace position; a primary read observes the head, while a secondary replicates the primary only up to some log index, and a*lagging*secondary serves an older index than the current head\. TheA4A\_\{4\}predicate is unchanged: a trace exhibitsA4A\_\{4\}iff two reads of the same cell from*different*replicas observe*different*values\.

Verified theorems \(9 verified, 0 errors, 0 axioms, 0external\_body, 0assume\)\.

- •Append\-only invariants \(the content\)\.lemma\_state\_monotone\_lenproves the head version is monotone non\-decreasing in trace order;lemma\_state\_stable\_eq\_lenproves that two trace points with equal head version have*identical*logs—a write would have grown the length, so every intervening step was a read\. Both are proved by induction on the trace and carry the argument the theorems below read off\.
- •TheoremA4A\_\{4\}\-mono \(thm\_primary\_version\_monotone\)\.Primary read versions are monotone in trace order\.
- •TheoremA4A\_\{4\}\-no\-split \(thm\_primary\_no\_split\_at\_version\)\.Two primary reads of the cell that observe the*same*version observe the*same*value: no split view at a fixed committed version\. This is the substantive safety statement, and it follows from append\-only monotonicity, not from the predicate’s arity\.
- •Corollary \(cor\_primary\_diff\_value\_implies\_diff\_version\)\.Two primary reads returning different values must be at different versions—the value\-mismatch half ofA4A\_\{4\}cannot occur among primary reads at one version\.
- •Non\-vacuity: the prevented phenomenon \(lemma\_secondary\_lag\_admits\_a4\)\.A concrete trace in which a lagging secondary serves log index0\(value NULL\) while the primary sits at version11\(value1010\) produces a genuineA4A\_\{4\}witness\. Both observed values are computed from the actual model state rather than asserted by hand, so the witness is faithful to the construction: this is exactly the split\-view that monotone\-primary pinning excludes, and the exclusion has a reason \(the secondary served an older log index\), not a definitional fiat\.
- •Regime non\-vacuity \(lemma\_cross\_version\_primary\_differs\)\.Two primary reads legitimately differ in value \(55then77\) at distinct versions, so the no\-split theorem does not hold vacuously by all primary reads being forced equal\.

Scope\.The theorem establishes that a single append\-only, monotone\-versioned primary admits no split view at a fixed version, and that the residual split\-view arises precisely from secondary lag—which read\-from\-primary pinning removes\. The model is single\-cell and single\-primary; eventually\-consistent and CRDT\-merge strategies remain out of scope, requiring an unbounded propagation horizon or merge\-function algebra orthogonal to the argument here \([Section˜7\.8](https://arxiv.org/html/2606.17182#S7.SS8)\)\. As with the other model\-level results, theA4A\_\{4\}model is not refined to executable code\.

### 4\.13Probabilistic case: a soundness\-only screen, and the open concentration bound

We do not claim a probabilistic refinement among this paper’s contributions; framing one as a contribution would overstate what is achievable\. The accurate position is a scope separation between two distinct questions, plus an open problem the current toolchain cannot close\.

The deterministic detector answers a*per\-trace*question exactly: did a stale read occur in*this*execution? It fires iff the trace contains a read whose observed value differs from a later committed write on the same cell, and for that question it is sound*and*complete—the characteristic function of the property \([Section˜4\.7](https://arxiv.org/html/2606.17182#S4.SS7)\), with no false\-negative concern, because the property is decidable from the trace\. The*distributional*question—would re\-executing the read–generate–write window diverge with probability exceedingθ\\theta?—is a counterfactual property of the prompt’s output distribution, not of any single trace, and a non\-firing detector on one trace does not certify it \(¬det\\neg\\mathrm\{det\}may simply not have exercised the divergence on this run\)\. We therefore use the deterministic detector as a per\-execution*exact*detector and a per\-prompt*candidate\-anomaly screen*, and do not conflate the two\.

Certifying the distributional property quantitatively—“the empirical estimate is withinε\\varepsilonof the true rate with probability1−δ1\-\\deltaatk=O\(\(1/ε2\)log⁡\(1/δ\)\)k=O\(\(1/\\varepsilon^\{2\}\)\\log\(1/\\delta\)\)samples”—requires a Hoeffding\-grade concentration bound, which needs real\-number probability theory that the current Verus distribution does not provide\. A Markov\-style count bound \(the empirical count is bounded by the sample size\) is too weak to be a concentration guarantee, so we make no probabilistic refinement claim, and the mechanized development contains no probabilistic obligations and no probabilistic axioms\. We leave the concentration bound as an explicit open problem—adapting IronFleet\-style infrastructure\[[32](https://arxiv.org/html/2606.17182#bib.bib32)\]or a probabilistic program logic \([Section˜7\.11](https://arxiv.org/html/2606.17182#S7.SS11)\)\.

### 4\.14RustBelt interface specification

The concurrent\-semantics lift relocates the safety trust base to “std::sync::Mutexconforms to the abstract Acquire/Release protocol\.” We make that precise by enumerating the residual obligations as Verus theorems withexternal\_bodybodies taggedRUSTBELT\_OBLIGATION\(lib\_rustbelt\_interface\.rs; 4 verified\)\. Three remain bare stubs by design—lock\_is\_acquire,drop\_is\_release, andevent\_seqcst, the API and memory\-ordering correspondences a future Iris/RustBelt proof\[[35](https://arxiv.org/html/2606.17182#bib.bib35)\]would discharge—while the fourth \(no panic inside a critical section\) is addressed at the model level by two proved results: 4a panic\-freedom and 4b poisoning fail\-safety \(an agent panic poisons the lock rather than admitting a torn state\)\. The residual stub count is therefore three, matching the trust\-base accounting of[Section˜4\.8](https://arxiv.org/html/2606.17182#S4.SS8); this does not close the concurrency gap but makes it auditable as three named obligations rather than a collective “mutex correctness” assumption\. The full obligation specifications are in the online appendix \(§ B\)\.

### 4\.15L3L\_\{3\}safety: saga\-compensation runtime preventsA6A\_\{6\}

TheL1→L2L\_\{1\}\\to L\_\{2\}step closed in[Section˜4\.11](https://arxiv.org/html/2606.17182#S4.SS11)established that a causal\-tracking runtime preventsA1∧A3A\_\{1\}\\wedge A\_\{3\}\. TheL2→L3L\_\{2\}\\to L\_\{3\}step requires additionally preventingA6A\_\{6\}\(tool\-effect reordering\): two external\-effect tool calls within a transaction that complete in a different order than their issuance\. This subsection closes that step\.

Construction\.TheL3L\_\{3\}runtime model \(verus\-detector/src/lib\_l3\_safety\.rs\) carries, per transaction, a saga record: an ordered sequence of external tool calls with their issuance and completion times\. The runtime issues calls strictly sequentially, waiting for each to complete before issuing the next; on abort, every completed call is compensated in reverse issuance order\. The saga is well\-formed iff each call’sissued\_atexceeds the previous call’scompleted\_at, enforcing strict serialization\. The pattern follows the standard saga literature\[[31](https://arxiv.org/html/2606.17182#bib.bib31)\]adapted to LLM\-tool\-call orchestration\.

Verified theorems \(6 verified, 0 errors\)\.

- •L3L\_\{3\}a \(well\-formed saga has noA6A\_\{6\}witness\)\.Saga well\-formedness forces earlier\-issued calls to complete before later ones are issued, so no out\-of\-order completion pair exists\.
- •L3L\_\{3\}b \(append preserves well\-formedness\)andL3L\_\{3\}d/L3L\_\{3\}e \(abort compensates all completed calls; a committed saga is compensation\-trivial\)\.
- •L3L\_\{3\}c \(L3L\_\{3\}composes withL2L\_\{2\}\)\.A runtime with both disciplines preventsA1∧A3∧A6A\_\{1\}\\wedge A\_\{3\}\\wedge A\_\{6\}; the state and external\-effect domains are orthogonal\.
- •L3L\_\{3\}f \(lattice placement\)\.A saga satisfyingL3L\_\{3\}well\-formedness and compensation order does not exhibitA6A\_\{6\}\.

TheoremL3L\_\{3\}g \(a commit\-order sequencer orders genuinely concurrent effects;lib\_l3\_sequencer\.rs, 5 verified, 0 errors, 0 axioms, 0external\_body, 0assume\)\.TheoremL3L\_\{3\}a preventsA6A\_\{6\}by forbidding concurrency; this theorem prevents it*under*concurrency\. External effects complete in an arbitrary schedule—any permutation of completion times, including fully reversed—and a commit\-order sequencer externalizes effectkkonly once effects0,…,k0,\\dots,khave all completed, so its externalization time is the maximum completion time over those indices\. We prove \(thm\_sequencer\_orders\_concurrent\_effects\) that this externalization time is monotone in issuance index*for every completion schedule*, hence \(cor\_sequencer\_no\_a6\) the externalized order never inverts issuance order: noA6A\_\{6\}\. Two witnesses keep the result non\-vacuous: the unsequenced baseline \(externalize\-on\-completion\) on a reversed schedule does exhibitA6A\_\{6\}\(lemma\_no\_sequencer\_admits\_a6, the prevented phenomenon\), and that same reversed schedule—completions genuinely out of issuance order—yields noA6A\_\{6\}once sequenced \(lemma\_sequencer\_fixes\_concurrent\_reorder\), so the guarantee is not vacuous\. UnlikeL3L\_\{3\}a, the proof orders concurrent effects rather than excluding them\.

Scope\.TheoremL3L\_\{3\}a preventsA6A\_\{6\}*by exclusion*: saga well\-formedness serializes external calls, so the model forbids the concurrent executions that alone could exhibitA6A\_\{6\}, rather than ordering genuinely concurrent effects\. We retainL3L\_\{3\}a as the strict\-serialization special case but no longer rest theA6A\_\{6\}claim on it\. The concurrent case—A6A\_\{6\}prevention for genuinely concurrent external effects under an enforced commit order, which an earlier version left unmodeled—is now discharged by TheoremL3L\_\{3\}g \(lib\_l3\_sequencer\.rs\), which orders an arbitrary completion schedule into issuance order without excluding concurrency\. The substantive contributions are therefore TheoremL3L\_\{3\}g and TheoremL3L\_\{3\}c \(the orthogonal composition of the saga discipline withL2L\_\{2\}’s causal tracking\);L3L\_\{3\}a is definitional and is reported as such\. TheL3L\_\{3\}proof establishes the saga discipline as sufficient forA6A\_\{6\}prevention; the runtime engineering for real external\-tool integrations \(idempotency, timeout handling, partial\-failure recovery\) remains paper design, with the saga pattern itself\[[31](https://arxiv.org/html/2606.17182#bib.bib31),[2](https://arxiv.org/html/2606.17182#bib.bib2)\]as the operational realization template\. The strong precondition on the append theorem requires all prior calls to have completed before the new call’s issuance, which is operationally trivial under strict serialization but materializes the proof obligation explicitly\. TheL4L\_\{4\}step \(adding¬A2\\neg A\_\{2\}\) is treated next\.

### 4\.16L4L\_\{4\}safety: registry\-snapshot isolation preventsA2A\_\{2\}

The final named lattice pointL4L\_\{4\}adds prevention ofA2A\_\{2\}\(phantom\-tool\): an operation plans a tool call against the registry it observed at read time, but by call time the tool has been removed or its signature changed, so the call dispatches against a tool that no longer matches the plan\. This subsection closesL4L\_\{4\}, so that every named point of the chain—L0L\_\{0\}throughL4L\_\{4\}—is backed by a mechanically\-verified realizing runtime model; no named level remains*unverified*paper design, thoughL2L\_\{2\}throughL4L\_\{4\}remain verified at the model level rather than as deployed, code\-refined runtimes\.

Construction\.The runtime model \(verus\-detector/src/lib\_l4\_safety\.rs\) carries a live tool registry \(a map from tool id to signature\) and, per operation, the signature pinned for the planned tool at read time\. TheA2A\_\{2\}predicate holds for a committed operation when its planned tool is, against the live registry, either absent or carrying a signature different from the pinned one\. Two prevention disciplines are modeled:*validation*\(optimistic\), in which an operation commits only if the planned tool is still present with the pinned signature and otherwise aborts; and*snapshot isolation*\(by construction\), in which the operation resolves its tool binding from a pinned snapshot, so concurrent registry mutation cannot affect what it dispatches against\.

Verified theorems \(5 verified, 0 errors, 0 axioms\)\.

- •L4L\_\{4\}a/L4L\_\{4\}b \(prevention\)\.A validated commit \(planned tool present with pinned signature\) leaves noA2A\_\{2\}witness, and an operation resolving its binding from a pinned*snapshot*can never exhibit one regardless of registry churn\.
- •L4L\_\{4\}c \(non\-vacuity\)\.A constructive witness: without a discipline, a committed operation against a mutated registry whose planned tool’s signature differs from the pinned one is anA2A\_\{2\}witness\.
- •L4L\_\{4\}d/L4L\_\{4\}e \(placement\)\.The commit invariant excludes both witness disjuncts, tying the discipline to lattice pointL4L\_\{4\}\.

Composition and scope\.A2A\_\{2\}prevention operates on the tool\-registry domain, whileL3L\_\{3\}’s saga discipline operates on the external\-effect\-call domain andL2L\_\{2\}’s causal tracking on the read/write/predecessor domain; the three domains are orthogonal, so a runtime applying all three preventsA1,A2,A3A\_\{1\},A\_\{2\},A\_\{3\}, andA6A\_\{6\}simultaneously\. This composition is now mechanized rather than argued:lib\_consistency\_lattice\.rsdefines the product state\(L2,L3,L4\)\(L\_\{2\},L\_\{3\},L\_\{4\}\)over the three disjoint carriers and proveslemma\_lattice\_jointly\_safe—that the conjunction of the three component invariants \(𝑖𝑛𝑣\_l2\\mathit\{inv\\\_l2\},𝑠𝑎𝑡𝑖𝑠𝑓𝑖𝑒𝑠\_l3\\mathit\{satisfies\\\_l3\},𝑛𝑜\_a2\_𝑎𝑛𝑦𝑤ℎ𝑒𝑟𝑒\\mathit\{no\\\_a2\\\_anywhere\}\) entails joint prevention ofA1A\_\{1\}\(image\),A3A\_\{3\},A6A\_\{6\}, andA2A\_\{2\}—by invoking each component’s safety theorem on its projection, with noassumeand noadmit\. Three framing lemmas establish non\-interference: a step on one carrier preserves the other two disciplines, since the predicates range over disjoint fields\. The composition root verifies at 38 obligations \(0 errors\)—the three component modules re\-verified together with five new composition lemmas\. As with the lower levels, the proof establishes the discipline as sufficient forA2A\_\{2\}prevention at the model level; the runtime engineering for live registry hot\-swapping \(versioned signatures, snapshot lifecycle, validation\-abort recovery\) is the operational realization, not re\-verified here\. WithL4L\_\{4\}closed, the strict chainL0⊊L1⊊L2⊊L3⊊L4L\_\{0\}\\subsetneq L\_\{1\}\\subsetneq L\_\{2\}\\subsetneq L\_\{3\}\\subsetneq L\_\{4\}is mechanically realized end to end\.

## 5Empirical pilot

[Section˜3](https://arxiv.org/html/2606.17182#S3)provides bounded\-state evidence that each anomaly admits a witness within the operational model\. This leaves open whether the lattice is exercised by realistic multi\-agent workloads\. To probe this question we built an end\-to\-end pipeline that instruments multi\-agent runs, emits one operation record per logical step in JSON Lines format, and runs the Rust detectors of[Section˜4\.7](https://arxiv.org/html/2606.17182#S4.SS7)over the resulting traces\.

### 5\.1Methodology

The instrumentation layer is a Python module of approximately 200 lines exposing a two\-phasebegin/commitAPI\. Eachbegin\(𝑟𝑒𝑎𝑑\_𝑘𝑒𝑦𝑠\)\(\\mathit\{read\\\_keys\}\)call snapshots the agent’s read state—the values of the requested cells, the read clock, and the visible\-tools registry—but does not advance the global clock\. Each subsequentcommit\(𝑤𝑟𝑖𝑡𝑒\_𝑘𝑣,𝑡𝑜𝑜𝑙\_𝑢𝑠𝑒𝑑\)\(\\mathit\{write\\\_kv\},\\mathit\{tool\\\_used\}\)call advances the clock, applies the writes if any, and emits a single operation record whose schema is field\-for\-field aligned withMemory\.tla\. Operations that require atomicity \(a single\-agent sequential update, for instance\) use a one\-shotop\(\)wrapper that callsbeginthencommitwith no intervening events\. The two\-phase API enables the canonicalA1A\_\{1\}interleaving: two agents observe a cell at clocktt, then commit divergent values att\+1t\{\+\}1andt\+2t\{\+\}2\.

### 5\.2Workload

The pilot uses seven synthetic but reproducible task scenarios designed to surface specific points in the lattice:

- •Concurrent stale\-read\.Two agents read the same cell at the same clock tick, then commit divergent values in different orders\. Designed to surfaceA1A\_\{1\}\.
- •Tool deprecation\.An agent plans a tool atbegin; the tool is removed from the registry beforecommit\. Designed to surfaceA2A\_\{2\}\.
- •Phantom\-write cascade\.An agent reports having read a value ofccthat no committed write produced—a commit\-log skew\. Designed to surfaceA3A\_\{3\}\.
- •Tool\-effect reorder\.An agent issues writes in order\[x,y\]\[x,y\]but the store commits them in order\[y,x\]\[y,x\], simulating asynchronous tool execution\. Designed to surfaceA6A\_\{6\}\.
- •Linear pipeline \(clean\)\.Three agents update a shared progress cell in strict sequence\. Control: no anomaly expected\.
- •Fan\-out merge\.A coordinator forks two workers writing disjoint cells, then merges\. Branching control with no shared write target\.
- •Single\-agent control\.One agent makes three sequential updates\. Single\-agent control: no anomaly possible\.

We executed 700 trace generations with scenarios drawn uniformly at random under a fixed seed, producing 700 JSON Lines traces\. The first four scenarios are designed to surface one anomaly each; the remaining three are controls with no anomaly expected\.

### 5\.3Results: real\-LLM pilot

We exercised the pipeline with LLM\-driven agents under the AutoGen framework\[[24](https://arxiv.org/html/2606.17182#bib.bib24)\]across three workloads, designed to span the space from concurrent to strictly sequential inter\-agent dependency:

- •Edit\-review\.Two agents \(editor,reviewer\) both read celldocin the first turn and commit divergent writes thereafter\. By construction, both agents readdocat the same clock tick under round\-robin scheduling\.
- •Plan\-execute\.Two agents \(planner,executor\) operate in strict sequence: planner writesplan, executor readsplanand writesresult\. No cell is read by one agent and written by the other concurrently\.
- •Triage\.Three agents \(reporter,triager,engineer\) form a pipeline: reporter writesticket; triager readsticketand writespriority; engineer reads both and writesresolution\. Mid\-pipeline reads overlap with downstream reads of the same cell\.

System prompts enforced a strict tool\-use protocol in all three; the model was OpenAI’s gpt\-4o with default temperature\. We ran 100 sessions per workload with fixed seeds, yielding 300 JSON Lines traces\.

TABLE III:Level distribution across 300 LLM\-driven sessions \(N=100N=100per workload, gpt\-4o\), with 95% bootstrap CIs \(1,000 resamples\)\. The 100% rate in edit\-review is structurally guaranteed by workload design; the 1% in plan\-execute reflects a single off\-protocol re\-read; the 35% in triage is the only non\-trivial intermediate measurement\.Leveledit\-reviewplan\-executetriageL0L\_\{0\}\(A1A\_\{1\}fires\)100\.0% \[100, 100\]1\.0% \[0, 3\]35\.0% \[26, 44\]L4L\_\{4\}\(clean\)0\.0% \[0, 0\]99\.0% \[97, 100\]65\.0% \[56, 74\]Total100100100[Table˜III](https://arxiv.org/html/2606.17182#S5.T3)reports the level distribution\. Three findings\.

First, the edit\-review rate of 100% confirms that under round\-robin scheduling with both agents required to read the same cell before either writes,A1A\_\{1\}is*structurally guaranteed*rather than empirically observed\. We retain this workload in the table not as a measurement ofA1A\_\{1\}prevalence in the wild but as a calibration point: the rate at which the LLM follows the prescribed 2\-call \(read, write\) pattern, with the residual 0% accounted for by sessions in which the model collapsed reads and writes into a single turn\.

Second, the plan\-execute rate of 1% confirms that strictly sequential workflows — where no cell is concurrently readable by two agents whose writes follow — effectively suppressA1A\_\{1\}\. The single\-trace anomaly is consistent with an off\-protocol re\-read during a multi\-turn execution; we did not investigate further\.

Third, the triage rate of 35% is the only measurement of the three that occupies a non\-trivial intermediate regime\. The 3\-agent pipeline producesA1A\_\{1\}when the engineer’s read ofticketinterleaves with the triager’s still\-active read window, but not always; the variability comes from the LLM’s tool\-call ordering within each agent’s turn\.

The pilot’s principal output is methodological: the detector pipeline runs end\-to\-end against real LLM traces, and produces a level classification per session under the chosen chain\. TheA1A\_\{1\}rate spans 1%–100% across the three workloads, with workload structure dominating the rate to a degree exceeding any plausible model\-, prompt\-, or seed\-induced variation; this is consistent with the rates being workload\-engineered\. We do not present these rates as evidence ofA1A\_\{1\}prevalence in unmodified production multi\-agent systems; only as a sensitivity check on the detector pipeline’s ability to discriminate workload regions\. Confidence in the lattice framework’s practical utility requires empirical work we have not performed \(large\-scale, multi\-model\-family, unmodified\-framework deployments instrumented with the detector pipeline\), and we are explicit about this in[Section˜6\.2](https://arxiv.org/html/2606.17182#S6.SS2)\.

### 5\.4Synthetic baseline comparison

To check whether the rates above are artifacts of an unguarded runtime, we re\-ran the 700\-trace synthetic pilot through three variants \(vanilla; pessimistic locking, on\-conflict ops silently dropped; snapshot isolation, on\-conflict commits aborted\) on the same scenarios, seed, and detector pipeline\. Pessimistic locking drives bothA1A\_\{1\}andA2A\_\{2\}to0%0\\%at the cost of 285 dropped operations \(40\.7%\); snapshot isolation drivesA1A\_\{1\}to0%0\\%but leavesA2A\_\{2\}at21\.1%21\.1\\%\(its scope is value memory, not the tool registry\) at 145 aborts \(20\.7%\); vanilla exhibitsA1A\_\{1\}at20\.7%20\.7\\%andA2A\_\{2\}at21\.1%21\.1\\%with no aborts\. Under these scripted scenarios both guarded runtimes pay≥1\\geq 1abort per anomaly prevented, and a sound\-and\-complete detector returningL4L\_\{4\}is machine\-checkable evidence of an anomaly\-free trace\. This controlled\-scale result is superseded by the LLM\-scale baseline of[Section˜5\.5](https://arxiv.org/html/2606.17182#S5.SS5), where each drop or abort corresponds to a re\-issued inference call\.

### 5\.5Real\-LLM baseline comparison

We re\-ran the three\-workload pilot of[Section˜5\.3](https://arxiv.org/html/2606.17182#S5.SS3)through each of the three runtimes \(vanilla, pessimistic locking, snapshot isolation\), 100 sessions per cell, 900 sessions total, against gpt\-4o\.[Table˜IV](https://arxiv.org/html/2606.17182#S5.T4)reports theA1A\_\{1\}firing rate at each cell with 95% bootstrap CIs \(1,000 resamples\)\.

TABLE IV:Real\-LLM baseline:A1A\_\{1\}\(stale\-generation\) firing rate by runtime×\\timesworkload, 100 sessions per cell, 95% bootstrap CIs\. The vanilla row replicates the rates of[Section˜5\.3](https://arxiv.org/html/2606.17182#S5.SS3)\(within sampling noise\)\. Pessimistic locking eliminatesA1A\_\{1\}in every cell\. Snapshot isolation eliminatesA1A\_\{1\}in two of three workloads but admits a residual 3% rate in triage\.Runtimeedit\-reviewplan\-executetriageVanilla \(no instrumentation\)100\.0% \[100, 100\]0\.0% \[0, 0\]33\.0% \[24, 42\]Pessimistic locking0\.0% \[0, 0\]0\.0% \[0, 0\]0\.0% \[0, 0\]Snapshot isolation0\.0% \[0, 0\]0\.0% \[0, 0\]3\.0% \[0, 6\]#### Observation: snapshot isolation’s classical incompleteness manifests in the agent setting\.

The SI runtime validates a committing operation against concurrent writers to its read set*at commit time*, but validation triggers only on a non\-empty write set; a read\-only operation ticks the clock and emits a no\-write𝑂𝑝𝑅𝑒𝑐𝑜𝑟𝑑\\mathit\{OpRecord\}without validation\. This is the classical write\-only SI behavior—read\-only transactions in textbook SI are not validated\[[21](https://arxiv.org/html/2606.17182#bib.bib21),[22](https://arxiv.org/html/2606.17182#bib.bib22)\], and read\-set serialization \(SSI\) closes the gap\. We did not implement SSI in the pilot \(a deliberate scope choice\); the 3% rate is the fraction of read\-only\-triager sessions whose decisions reach downstream agents without materializing as a validated write, not a general claim\. Concretely, in the three SI/triage failures the reporter writes the ticket twice, the triager reads the first version and hands off \(empty write set, no validation\), and the detector surfaces the stale read against the reporter’s second commit\. The conceptual content—read\-only operations that influence downstream agents need SSI\-style validation—is identical to the database case; the operational regime in which it matters is what is new\.

#### Pessimistic locking eliminatesA1A\_\{1\}in every cell\.

The pessimistic\-locking runtime drops every conflicting operation silently \([Section˜5\.4](https://arxiv.org/html/2606.17182#S5.SS4)\), and every cell of[Table˜IV](https://arxiv.org/html/2606.17182#S5.T4)shows zeroA1A\_\{1\}instances\. The mechanism is stronger than SI’s because the lock is acquired*before*the operation reads, and a failed acquisition causes the operation to be dropped without invoking the LLM beyond the agent’s planning phase\. This makes pessimistic locking the recommended consistency mechanism for agent workloads where the cost of a wasted LLM inference exceeds the cost of missed collaboration on a contested cell\.

#### Cost of prevention\.

Both disciplines drop a substantial fraction of operations under contention: in edit\-review both drop the reviewer’s commit in 100% of sessions, though the LLM inference still ran \(SI aborts at validation, pessimistic fails acquisition—which could in principle short\-circuit the call but does not in our implementation\)\. A defensible cost claim requires per\-session token counts from the clientusagecallbacks, which the original pilot did not capture \(character\-derived estimates underestimate cost by ignoring history growth and intermediate tool\-call exchanges\)\. We measure token\-level cost at LLM scale in[Section˜5\.6](https://arxiv.org/html/2606.17182#S5.SS6)\.

### 5\.6Cost analysis

The cost of preventingA1A\_\{1\}is the most directly actionable empirical result in this paper, and we promote it accordingly: under the runtimes and workloads measured, serializable snapshot isolation preventsA1A\_\{1\}at no cost detectable in the between\-session comparison with an unguarded runtime \(a paired design,[Section˜5\.10](https://arxiv.org/html/2606.17182#S5.SS10), resolves a single∼8%\{\\sim\}8\\%token overhead on one workload\), and the cost of the strictest discipline \(pessimistic locking\) is bounded rather than the order\-of\-magnitude penalty the multi\-agent literature commonly assumes\.

We instrumented the AutoGen pilot driver \(tokens\_capture\.py\) to capture per\-call OpenAI usage metadata \(𝑝𝑟𝑜𝑚𝑝𝑡\_𝑡𝑜𝑘𝑒𝑛𝑠\\mathit\{prompt\\\_tokens\},𝑐𝑜𝑚𝑝𝑙𝑒𝑡𝑖𝑜𝑛\_𝑡𝑜𝑘𝑒𝑛𝑠\\mathit\{completion\\\_tokens\}, elapsed wall\-clock milliseconds\) and re\-ran the 900\-session real\-LLM baseline of[Section˜5\.5](https://arxiv.org/html/2606.17182#S5.SS5)with the instrumentation in place\. Table S1 \(online appendix\) reports the per\-session cost by runtime and workload with 95% bootstrap CIs \(1,000 resamples\)\. Pricing is gpt\-4o\-2024\-08\-06 list price at the time of the run: $2\.50/M input tokens, $10\.00/M output tokens\.

Three findings, in descending order of importance\.

Finding 1: snapshot isolation adds at most a single∼8%\{\\sim\}8\\%token overhead \(resolved by a paired design on one workload\), undetectable between sessions\.The most statistically powerful measurement is the paired design of[Section˜5\.10](https://arxiv.org/html/2606.17182#S5.SS10), which removes between\-session variance: it isolates one small token overhead \(SSI on plan\-execute,∼8%\{\\sim\}8\\%,p=0\.007p\{=\}0\.007, 90% CI up to∼12\.6%\{\\sim\}12\.6\\%\)\. The coarser between\-session comparison masks even this—the SSI mean sits within sampling noise of vanilla in every cell: CIs overlap on edit\-review \(3\.11 vs 3\.13\), plan\-execute \(1\.95 vs 1\.93\), and triage \(3\.13 vs 3\.23\)\. The honest reading is therefore “∼8%\{\\sim\}8\\%where a paired design can resolve it, no overhead separable between sessions otherwise,” not an exact zero\. TheA1A\_\{1\}\-prevention rate of SSI is 100%, 100%, and 97% respectively \([Table˜IV](https://arxiv.org/html/2606.17182#S5.T4)\)\. The classical cost asymmetry that motivates serializable snapshot isolation in the database setting\[[21](https://arxiv.org/html/2606.17182#bib.bib21)\]—validation rather than locking keeps reads non\-blocking—transfers to the agent setting in the form most favorable to deployment\.

Finding 2: pessimistic locking incurs bounded overhead, concentrated in multi\-stage contention\.On edit\-review the pessimistic overhead is statistically zero \(CIs marginally overlap: 3\.21–3\.38 vs 3\.11–3\.16\)\. On plan\-execute the overhead is also zero \(slightly cheaper than vanilla, within noise\)\. On triage the overhead is real and statistically significant: $5\.21 m vs $3\.23 m, with CIs \[4\.90, 5\.54\] vs \[2\.93, 3\.55\] that do not overlap, an≈\\approx62% increase\. The pattern matches[Section˜5\.4](https://arxiv.org/html/2606.17182#S5.SS4): pessimistic locking adds cost where lock contention forces wait\-and\-retry across pipeline stages, and is essentially free in single\-writer or low\-overlap workloads\.

Finding 3: the cost asymmetry in the agent setting is bounded at≤\\leq1\.6×\\timesvanilla in the worst measured workload on gpt\-4o, not the unbounded "orders of magnitude" that[Section˜2\.2](https://arxiv.org/html/2606.17182#S2.SS2)speculated\.Our pre\-measurement intuition—that a discarded LLM inference is "cripplingly expensive"—is empirically wrong for these workloads at this scale: worst\-case overhead is bounded, typical\-case near zero\. Abort costs remain non\-trivial in kind \([Section˜5\.4](https://arxiv.org/html/2606.17182#S5.SS4): pessimistic locking drops 40\.7% of synthetic operations\), but not order\-of\-magnitude in measured cost\.

Cross\-model validation on Claude Sonnet 4\.5\.To probe whether the gpt\-4o findings reflect properties of the runtime model or of one specific model family, we replicated the full 900\-session protocol against Anthropic’sclaude\-sonnet\-4\-5\-20250929, identical workloads and identical instrumentation\. Table S2 \(online appendix\) reports the per\-session cost\. Claude pricing at the time of the run is $3\.00/M input tokens, $15\.00/M output tokens\.

Finding 4: SSI between\-session cost\-parity replicates across model families; pessimistic\-overhead pattern is qualitatively consistent, quantitatively model\-dependent\.The strongest claim in this section—that snapshot isolation adds no cost overhead detectable between sessions vs vanilla— holds in both model families\. All six SSI cells \(three workloads×\\timestwo providers\) have means within sampling noise of their vanilla counterparts: CIs overlap on every cell\. Pessimistic overhead retains the same shape on both providers \(approximately zero on edit\-review and plan\-execute, real overhead concentrated in triage\), but the magnitude of the triage overhead scales differently: gpt\-4o exhibits a 62% overhead in triage\-pessimistic, while Claude Sonnet 4\.5 exhibits a 124% overhead in the same cell \($27\.73\\mathdollar 27\.73m vs$12\.36\\mathdollar 12\.36m vanilla, non\-overlapping CIs\)\. Wall\-clock latency follows the same pattern: triage\-pessimistic on Claude requires 21\.4 s median wall\-clock vs 10\.5 s vanilla, a 2×\\timeslatency penalty matching the cost penalty\. The magnitude difference is not explainable from these data alone \(candidate mechanisms: retry budgets, client\-side tool serialization, tokenization density\); the qualitative conclusion stands on both providers—bounded,≤\\leq1\.6×\\times\(gpt\-4o\) to≤\\leq2\.3×\\times\(Claude\), never order\-of\-magnitude\.

Finding 5: SSI cost is governed by the realized abort rate, with a fitted intercept statistically indistinguishable from zero\.To test the discipline where its abort\-and\-retry machinery engages, we ran a paired high\-contention sweep \([Figure˜2](https://arxiv.org/html/2606.17182#S5.F2)\): contention varied two independent ways—fan\-inW∈\{2,4,8,16\}W\\in\\\{2,4,8,16\\\}at one cell, and cell countC∈\{2,4,8,16,32\}C\\in\\\{2,4,8,16,32\\\}atW=8W\{=\}8—over 155 paired sessions on gpt\-4o\-mini at realized abort rates0\.090\.09–0\.930\.93\. Token cost is counted exactly \(including generations re\-spent on aborts\), and the harness re\-verifies per session that the baseline exhibitsA1A\_\{1\}while the guarded disciplines do not, so the measured quantity is the cost of*working*prevention\. SSI token overhead is linear in the abort rate,overhead=\+0\.1%\[−2,\+2\]\+108%\[104,112\]×abort\_rate\\text\{overhead\}=\+0\.1\\%\\,\[\-2,\+2\]\+108\\%\\,\[104,112\]\\times\\text\{abort\\\_rate\}: one re\-spent generation per abort, with no super\-linear term, and the two contention mechanisms fall on the common line, so the dependence is on realized contention rather than how it is induced\. The intercept is indistinguishable from zero, and overhead stays below15%15\\%up to an abort rate of0\.14\[0\.13,0\.15\]0\.14\\,\[0\.13,0\.15\]; the lowest cell measures\+9\.9%\[5\.6,14\.5\]\+9\.9\\%\\,\[5\.6,14\.5\]at abort rate0\.090\.09, so the sub\-breakpoint regime is measured, not extrapolated—and what it measures is low overhead, not zero\. We do not establish that deployed workloads occupy this regime \(the0/6000/600MAST result is consistent with it but is not an abort\-rate measurement\); the envelope instead makes the question auditable per deployment\. Pessimistic locking is the complementary profile: token\-neutral between sessions \(±1%\\pm 1\\%\) but serializing critical sections, so its cost is real yet invisible to a token metric \(wall\-clock figures here are*composed*from per\-call latencies, not measured under a live concurrent runtime, and grow several\-fold at1616\-way\)\. We therefore report the cost result as an envelope: statistically zero at a zero abort rate, bounded and predictable \(≈\\approxone generation per abort\) above it\.

![Refer to caption](https://arxiv.org/html/2606.17182v1/x1.png)Figure 2:SSI token overhead versus realized abort rate under the high\-contention sweep \(gpt\-4o\-mini, 155 paired sessions\)\. Two independent contention mechanisms—varying fan\-in at a single contended cell \(squares\) and varying cell count at fixed fan\-in \(diamonds, with per\-session points\)—fall on a common line,\+0%\+108%×abort\_rate\+0\\%\+108\\%\\times\\text\{abort\\\_rate\}\. The intercept is indistinguishable from zero and overhead stays within the low\-overhead band \(<15%\{<\}15\\%; a descriptive plotting threshold—the pre\-declared equivalence margin of[Section˜5\.10](https://arxiv.org/html/2606.17182#S5.SS10)is the stricter±10%\\pm 10\\%\) up to an abort rate of≈0\.14\{\\approx\}0\.14\. Pessimistic locking is token\-neutral in between\-session means \(one nominal paired cell on triage;[Section˜5\.10](https://arxiv.org/html/2606.17182#S5.SS10)\) but incurs the wall\-clock cost reported in the text\.What this section cannot establish\.The measurement covers gpt\-4o and Claude Sonnet 4\.5 on these three workloads\. Cost behavior on Gemini or other providers, and on workloads with materially larger read\-sets, is not measured; the high\-contention sweep of Finding 5 covers fan\-in up to 16 agents and multi\-round pipelines, but only on gpt\-4o\-mini\. The measured cost asymmetry is consistent with the design intuition that SSI offers a near\-optimal cost\-correctness trade\-off in low\-to\-moderate contention regimes and pessimistic locking is preferable only under workload\-specific contention patterns; the two\-provider replication strengthens this claim qualitatively but does not generalize to open\-weights models or to model families optimized for low\-latency tool use\. Three\-plus provider replication including local open\-weights inference \(Llama, Mistral\) and higher\-contention workload exploration are identified as principal empirical follow\-up \([Section˜6\.3](https://arxiv.org/html/2606.17182#S6.SS3)\)\.

On cross\-model comparison: what is and is not claimed\.Comparing the gpt\-4o triage overhead \(62%\) directly against Claude’s \(124%\) would be confounded by tokenization density, pricing, and tool serialization, so we make no cross\-provider absolute comparison\. Every claim here is strictly within\-model—SSI\-on\-gpt\-4o versus vanilla\-on\-gpt\-4o, never gpt\-4o versus Claude—so those differences cancel by construction; that the same within\-model patterns \(\(i\) SSI indistinguishable from vanilla, \(ii\) pessimistic overhead triage\-concentrated\) hold in two unrelated families is evidence they are properties of the runtime strategy, a replication rather than a between\-model measurement\. The62%62\\%\-vs\-124%124\\%gap is exactly the confounded quantity and we decline to interpret it\. The clean isolation is the within\-model wall\-clock study of[Section˜5\.10](https://arxiv.org/html/2606.17182#S5.SS10)\.

### 5\.7Cookbook\-derived scenarios

The pilots of[Sections˜5\.3](https://arxiv.org/html/2606.17182#S5.SS3)and[5\.5](https://arxiv.org/html/2606.17182#S5.SS5)engineer workload structure to induce or suppress specific anomalies; their value is to demonstrate that the detector pipeline operates end\-to\-end and that runtime strategies behave as the formal predicates predict\.[Section˜5\.7](https://arxiv.org/html/2606.17182#S5.SS7)reports a separate study, designed to address a distinct evidentiary question: do the catalog’s anomalies fire in cookbook\-shaped multi\-agent workloads*not*engineered to exhibit them?

Method\.We construct six multi\-agent scenarios modeled on the LangGraph222[https://langchain\-ai\.github\.io/langgraph/](https://langchain-ai.github.io/langgraph/)and AutoGen333[https://microsoft\.github\.io/autogen/](https://microsoft.github.io/autogen/)cookbook templates: five stateless\-tool scenarios—*research\_collab*\(researcher–analyst\),*supervisor*,*hierarchical*,*code\_review*, and*customer\_triage*—and one typed\-shared\-state scenario,*shared\_workspace*\(planner/executor/monitor reading and writing a typed workspace with slots*task*,*progress*,*notes*, modeled on the LangGraphStateGraphpattern\)\. The AutoGen framework code is unmodified; only the chat\-completion client is wrapped to record each\.create\(\.\.\.\)as an OpRecord\-shaped event, classifyingread\_workspace/write\_workspacecalls as reads/writes of cellws:<slot\>and stateless tools as fresh per\-call cells\. The OpRecord format is unchanged from[Section˜5\.1](https://arxiv.org/html/2606.17182#S5.SS1), so the verified detector consumes the traces directly\. Each scenario was run 100 times with gpt\-4o underRoundRobinGroupChat, yielding 600 sessions\.

Results\.Table S3 \(online appendix\) reports theA1A\_\{1\}rate per scenario, with two refinements: \(i\) the rate across allA1A\_\{1\}\-eligible witnesses in the trace \(any\), \(ii\) the rate restricted to witnesses in which the read and the subsequent contradicting write are performed by*different*agents \(cross\-agent\), and \(iii\) the rate restricted to within\-agent read\-then\-update patterns \(self\-agent\)\. All 95% confidence intervals are bootstrap intervals over 1000 resamples\.

Witness inspection\.Cross\-agentA1A\_\{1\}inshared\_workspacesession 0005 fires on the following pair \(paraphrased; full trace in artifact bundle\): the executor readsws:taskattr=9t\_\{r\}=9with value"Plan and execute the organization of a digital file storage system", after which the planner writesws:taskattw=14t\_\{w\}=14with value"Continue executing the organization of a digital file storage system"\. The read precedes the write, the values differ, the agents differ\. The pattern matches the formalA1A\_\{1\}predicate of[Section˜3\.1](https://arxiv.org/html/2606.17182#S3.SS1)\.

Interpretation, with explicit tautology caveat\.The contrast between the stateless \(0%\) and stateful \(90%\) scenarios is structurally what one would expect from the definition ofA1A\_\{1\}: the predicate requires shared mutable cells, and only the stateful scenario has them\. The contrast therefore does not by itself validate that the catalog is operationally relevant; it confirms that the formal definition behaves as the formal definition predicts\. We accept the criticism\.

What the contrast*does*establish, and what is not trivially baked into the definition, is that in unmodified cookbook\-shaped workloads using stock AutoGenRoundRobinGroupChat, \(a\) stateless tool composition is sufficient to suppressA1A\_\{1\}across 500 sessions, and \(b\) a typed\-shared\-state pattern modeled on the LangGraphStateGraphidiom—a documented public pattern, not a workload designed to trigger anomalies—generates inspectableA1A\_\{1\}witnesses at 90/100 sessions, with 89/100 involving cross\-agent staleness\. Per\-witness inspection \(included in the released artifact\) confirms the witnesses are concrete, not measurement artifacts\. The precise scope is: “A1A\_\{1\}fires non\-trivially in cookbook patterns that share mutable state through a typed workspace\.”

Aggregated over all six cookbook scenarios,A1A\_\{1\}fires in 90 of 600 sessions \(15\.0%15\.0\\%\); the entire signal comes from the single shared\-mutable\-state scenario, the other five being exactly0/1000/100\. This is the gradient the lattice predicts—prevalence is governed by whether a workload shares mutable state across concurrently\-scheduled agents: from0%0\\%\(stateless or strictly\-staged\), through15%15\\%\(a realistic scenario mix containing one shared\-state pattern\), to≈90\\approx 90–100%100\\%\(a workload built around a shared workspace\)\.

Two structural properties jointly determine whetherA1A\_\{1\}can fire under RoundRobin scheduling: \(i\) some logical resource must admit multiple writes to the same cell across the session, and \(ii\) some agent must observe it after a write but before the next update\. Stateless tool composition fails \(i\) \(each invocation is a fresh cell\); typed\-shared\-state patterns satisfy both whenever an agent revises a slot another has read\. Read positively, this explains why current systems are largelyA1A\_\{1\}\-free—the stateless scenarios, the supervisor/hierarchical patterns, and every MAST\-surveyed framework \([Section˜5\.8](https://arxiv.org/html/2606.17182#S5.SS8)\) avoid property \(i\) via single\-writer reducers, fresh per\-call cells, or strict staging, which are exactly theL1L\_\{1\}–L4L\_\{4\}disciplines of[Section˜4](https://arxiv.org/html/2606.17182#S4)\. The lattice thus does double duty: it accounts for why today’s frameworks are safe by construction and supplies a verified guarantee for the shared\-mutable\-state regime where that incidental safety breaks down\.

What this section cannot establish\.The 90% rate is a measurement under one specific shared\-state protocol \(the planner\-executor\-monitor structure ofshared\_workspace\), one provider \(gpt\-4o\), one scheduler \(RoundRobinGroupChat\), and one framework \(AutoGen\)\. Different protocols, providers, schedulers, or frameworks will produce different rates\. The value of the present measurement is to confirm that theA1A\_\{1\}predicate fires at non\-trivial rates in cookbook\-shaped workloads,*not*that 90% is the prevalence in deployed multi\-agent systems generally\. The 90% is moreover a*structural*firing rate; the fraction of these firings that change the agent’s decision is measured separately in[Section˜5\.9](https://arxiv.org/html/2606.17182#S5.SS9)and is role\-dependent \(from0%0\\%for a pure producer to100%100\\%for an assessor\), so this rate is an upper bound on operational materiality, not a direct measure of it\. We also note a tension with the empirical findings of the MAST taxonomy\[[19](https://arxiv.org/html/2606.17182#bib.bib19)\]: across the MAST\-annotated multi\-agent failure traces from seven frameworks, MAST does not identify any failure modes that correspond directly toA1A\_\{1\}\-A6A\_\{6\}\. Two readings are consistent with this: \(a\) the catalog’s anomalies are rare in the MAST\-surveyed deployments; \(b\) MAST’s annotation methodology, which is oriented toward task\-level failure modes \(specification issues, inter\-agent misalignment, task verification\), is not calibrated to detect concurrency anomalies at the operation\-record granularity used here\. We do not have data that selects between \(a\) and \(b\); the latter is the working hypothesis behind the cookbook study, but[Section˜6\.3](https://arxiv.org/html/2606.17182#S6.SS3)identifies running the verified detector on MAST\-Data traces as a principal next step\.

### 5\.8MAST\-Data empirical run

[Section˜5\.7](https://arxiv.org/html/2606.17182#S5.SS7)identified running the verified detector on MAST\-Data\[[19](https://arxiv.org/html/2606.17182#bib.bib19)\]as the principal next step for resolving the tension between our anomaly framework and MAST’s 14 failure modes\. This subsection reports the result\.

Method\.We pulled the MAST\-Data corpus \(mcemri/MADon HuggingFace, fileMAD\_full\_dataset\.json, revision5a82e32\), which contains 1,242 annotated execution traces across 7 frameworks \(ChatDev, MetaGPT, Magentic, OpenManus, AG2, AppWorld, HyperAgent\)\.444The file is also mirrored atmcemri/MAST\-Data\. The 1,642\-trace headline reported by Cemri et al\.\[[19](https://arxiv.org/html/2606.17182#bib.bib19)\]exceeds this public release; the0/6000/600figure is a statement about the 600 of these 1,242 traces we successfully parse\.Because each framework uses a different log format \(ChatDev:\*\*Role\*\*:markdown with timestamps; MetaGPT:\[time\] FROM: X TO: Y / ACTION:lines; AG2: YAML\-likerole:/name:/content:blocks; others heterogeneous\), we built framework\-dispatched extractors that convert each trace to operation records \(python/mast\_adapter\.py\)\. Each tool call becomes either a read or a write to a synthesized cell whose identifier is derived from the tool name and slot key\. The resulting JSONL is fed to the verified detector unchanged\.

Coverage\.Of the 1,242 records, 600 \(48%\) were successfully parsed into the operation\-record format and analyzed\.

The four fully\-covered frameworks \(ChatDev, MetaGPT, Magentic, OpenManus = 585 traces\) and the two partially covered \(HyperAgent, AG2\) all report zeroA1A\_\{1\}witnesses\. AppWorld’s task\-block format was not parsed by our generic fallback; we report this as\-is rather than impute\. The AG2 13/597 yield reflects the difficulty of extracting tool calls from AG2’s YAML\-encoded conversation blocks via a single dispatched parser; reaching the full AG2 corpus would require dedicated parser engineering\. We do not include AppWorld in the rate denominator because no traces parsed\.

Finding\.*A1A\_\{1\}fired in 0 of 600 op\-record\-mapped MAST\-Data traces across all six parsed frameworks*\(exact 95% Clopper–Pearson\[0,0\.61%\]\[0,\\,0\.61\\%\], so the per\-trace upper bound is below one percent, not merely a point estimate of zero\)\. A structural audit sharpens this: only a small fraction of parsed traces contain the read\-before\-write\-on\-shared\-state structureA1A\_\{1\}requires, so the zero rate reflects the rarity of the architectural precondition in this corpus, consistent with the catalog’s scope\. Combined with the cookbook study \([Section˜5\.7](https://arxiv.org/html/2606.17182#S5.SS7): 90% onshared\_workspace, 0% on five stateless scenarios\), the picture is:A1A\_\{1\}requires shared mutable state with read\-before\-write semantics; none of the 48% of mappable MAST traces exhibited it; and the unmappable remainder is dominated by AG2 \(584 traces\), AppWorld, and HyperAgent—message\-passing or external\-state architectures whose coordination is not by concurrent reads/writes of a shared channel \(the same reason AutoGen isA1A\_\{1\}\-immune by construction,[Section˜6\.5](https://arxiv.org/html/2606.17182#S6.SS5)\)\. That the parse failure reflects structural absence rather than concealed instances is a hypothesis about the unparsed traces, resting on documented architectures, not yet checked against them; we state0/6000/600as a bound on the catalog’s applicability over the parsed subset, not a corpus prevalence figure\. The pattern*does*occur in LangGraphStateGraphunder concurrent same\-key updates \([Section˜6\.5](https://arxiv.org/html/2606.17182#S6.SS5)\)\.

Topological susceptibility and dynamic confirmation\.We complement the corpus result with a deterministic structural upper bound: cross\-agentA1A\_\{1\}has a purely structural necessary condition \(two agents in one superstep, one reading a channel the other writes\), decidable from a graph’s layers alone—a sound over\-approximation \(lib\_l2\_safety\.rs::a1\_witnesswith the value inequality relaxed\)\. Applied to sixteen canonical multi\-agent topologies,*six are structurally susceptible*\(hierarchical teams, collaboration, swarm, blackboard, competing\-agents consensus, reducer\-less map\-reduce—each sharing a mutable channel within a superstep\) and ten are immune by construction \(disjoint channels, additive reducers, or sequential staging\)\. A dynamic check then runs each susceptible topology’s unconditionally\-concurrent form under three model families \(gpt\-4o\-mini,claude\-haiku\-4\-5, open\-weightsLlama\-3\.2\): every susceptible instantiation and the positive control fire20/2020/20\(95% Clopper–Pearson\[0\.83,1\.00\]\[0\.83,1\.00\]\) while every immune topology and the negative control fire0/200/20, identically on all three\. Taken with the6/166/16static bound and the0/6000/600corpus result, this*demarcates*A1A\_\{1\}’s scope rather than estimating prevalence: absent by construction from reducer\- and fan\-out\-structured deployments, reproducible in shared\-mutable\-cell designs run concurrently\. The full topology list, the reducer\-vs\-accumulator contrast, the controls, and the per\-model results are in the online appendix \(§ C\); the deterministicprevalence\_static\.pyreproduces the counts\.

Resolution of the MAST tension\.The cookbook\-vs\-MAST tension flagged in[Section˜5\.7](https://arxiv.org/html/2606.17182#S5.SS7)is resolved by the structural difference in the surveyed regimes: MAST’s 14 failure modes characterize*task\-level*failures over conversation\-and\-tool traces, while our anomalies characterize*concurrency\-level*failures over shared\-state traces\. They do not overlap because, in the parsed 48% of the corpus, their regimes do not—the verified detector finds zero overlap there because none of the conditions our anomalies require are present in the traces we could map \(we do not extend the claim to the unparsed 52%\)\. This negative finding is a positive scope demarcation: it bounds where the prevention contracts are valuable \(shared\-mutable\-state architectures—LangGraphStateGraph, theshared\_workspacecookbook scenario, actor\-with\-shared\-memory systems\) and where they are not \(predominantly\-dialogue MAS, most of MAST’s sample\)\.

Replication\.The full extractor and analysis pipeline is atpython/mast\_adapter\.pyandpython/analyze\_production\.py\. Runningpython mast\_adapter\.py \-\-out mast\_oprecords/downloads MAST\-Data from HuggingFace, processes 600 traces into JSONL, and the analyzer reproduces Table S4 \(online appendix\)\. Themast\_rates\.jsonoutput file is included in the artifact bundle\.

### 5\.9Operational materiality of detector firings

The detector fires on a value mismatch between a read and a later committed write \([Section˜4\.7](https://arxiv.org/html/2606.17182#S4.SS7)\); a fair objection is that such a mismatch may be a lexical artifact uncorrelated with the agent’s behavior\. We measure that correlation directly with a counterfactual re\-prompt\. Holding an agent’s context fixed and varying only the shared value it read \(vstalev\_\{\\text\{stale\}\}versusvfreshv\_\{\\text\{fresh\}\}\), we record whether the agent’s STEP\-2 decision changes, and define the operational disagreement probabilitypop=Pr⁡\(decision changes∣flagged mismatch\)p\_\{\\text\{op\}\}=\\Pr\(\\text\{decision changes\}\\mid\\text\{flagged mismatch\}\)—the empirical counterpart of the𝑑𝑖𝑠𝑎𝑔𝑟𝑒𝑒𝑚𝑒𝑛𝑡\_𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦\\mathit\{disagreement\\\_probability\}left uninterpreted in[Section˜4\.13](https://arxiv.org/html/2606.17182#S4.SS13)\. We separate two sub\-phenomena\.*Value\-staleness*\(a stale non\-null value differs from a fresh non\-null value\) is measured in a controlled task with an exact comparator\.*Read\-before\-write staleness*\(the reader observed an unwritten cell,vstale=nullv\_\{\\text\{stale\}\}=\\textsc\{null\}\) is the form that actually occurs in our round\-robin pilot traces, and is measured on those traces using the deployed agent prompts; free\-text outcomes are scored by an LLM judge under a decision\-level rubric \(materiality means a downstream consumer would reach a different decision\), a softer signal than the categorical comparator and labeled as such\.

#### Value\-staleness \(controlled\)\.

In a triage task whose decision is categorical \(priority in \{P0,P1,P2\}\),pop=150/200=75\.0%p\_\{\\text\{op\}\}=150/200=75\.0\\%\(exact 95% Clopper–Pearson\[68\.4,80\.8\]\[68\.4,80\.8\]\)\. Divergence tracks whether the value pair straddles a decision boundary: adjacent same\-side values do not flip the decision, boundary\-crossing values almost always do, confirming the signal is operational rather than lexical\.

#### Read\-before\-write staleness \(deployed traces\)\.

Re\-prompting the real agents with\(vstale,vfresh\)\(v\_\{\\text\{stale\}\},v\_\{\\text\{fresh\}\}\)drawn from actual firings, holding co\-read cells fixed and varying only the firing cell, materiality is sharply consumption\-dependent \([Table˜V](https://arxiv.org/html/2606.17182#S5.T5)\)\. The*reviewer*, whose task is to assess the shared document, diverges on every firing \(43/43=100%43/43=100\\%,\[91\.8,100\]\[91\.8,100\]\): a stale read yields a categorically wrong review\. The*engineer*, which needs the priority but also reads the ticket, diverges about half the time \(18/33=54\.5%18/33=54\.5\\%,\[36\.4,71\.9\]\[36\.4,71\.9\]\): roughly half its resolutions are carried by the ticket regardless of the missing priority\. The*editor*, which authors a definition independent of what it read, never diverges \(0/100=0%0/100=0\\%,\[0,3\.6\]\[0,3\.6\]\)\. The triager produced only three qualifying firings in the analyzed set, too few to estimate a rate, and is omitted from[Table˜V](https://arxiv.org/html/2606.17182#S5.T5)\. Structural A1firings thus overcount operational staleness, andpopp\_\{\\text\{op\}\}quantifies the consumed fraction: of edit\-review’s 143 structural firings \(its 100% structural rate\), only the 43 reviewer firings are operationally material—a∼30%\\sim\\\!30\\%operational rate concentrated entirely in the assessor role\.

TABLE V:Operational disagreement probabilitypopp\_\{\\text\{op\}\}: fraction of detector\-flagged firings that change the agent’s decision\. The controlled task uses an exact categorical comparator; deployed\-trace free\-text outcomes are LLM\-judged at decision level \(a softer signal\)\. All deployed firings are read\-before\-write \(vstale=nullv\_\{\\text\{stale\}\}=\\textsc\{null\}\)\.SettingRole / decisionpopp\_\{\\text\{op\}\}95% CIControlledtriage priority \(categorical\)75\.0%\[68\.4, 80\.8\]Deployedreviewer \(assess doc\)100%\[91\.8, 100\]Deployedengineer \(resolve, judged\)54\.5%\[36\.4, 71\.9\]Deployededitor \(author, judged\)0%\[0, 3\.6\]
#### Scope\.

The controlled task is a proxy with an exact comparator; deployed free\-text outcomes are judged by an LLM of the same family under a decision\-level rubric and are reported as a softer, corroborating signal\. On a blind sample of 60 judged firings—the coder shown only the two free\-text outputs, not the model’s verdict—manual coding agreed with the LLM judge on 59 of 60 \(98%98\\%; Cohen’sκ=0\.96\\kappa=0\.96between judge and human, approximate 95% CI\[0\.89,1\.00\]\[0\.89,1\.00\]\), so the judge tracks a human reading of decision\-level divergence rather than a lexical artifact\. All deployed firings are read\-before\-write, so value\-staleness is exercised only by the controlled experiment\. The editor and engineer results reflect these workloads, in which the producer does not fully consume the read; they are not a claim that producer roles are universally immune\. The triager cell had only three qualifying firings, too few to estimate a rate, and is omitted from[Table˜V](https://arxiv.org/html/2606.17182#S5.T5)\.

#### Design consequence: an advisory materiality layer over the verified core\.

The role\-dependence above \(popp\_\{\\text\{op\}\}from0%0\\%for a pure producer to100%100\\%for an assessor\) bears directly on the objection that a boolean detector firing is operationally meaningless for whole classes of agents\. The verified core remains the boolean predicate of[Section˜4\.7](https://arxiv.org/html/2606.17182#S4.SS7)—we do not weaken its sound\-and\-complete guarantee\. We instead recommend that a deployment wrap it in an*advisory*triple\(detected,p^op,confidence\)\(\\textit\{detected\},\\,\\hat\{p\}\_\{\\text\{op\}\},\\,\\textit\{confidence\}\), wherep^op\\hat\{p\}\_\{\\text\{op\}\}is estimated from the firing agent’s role \(producer/assessor/consumer\) using exactly the measurement of this section, so an abort policy can gate onp^op\>θ\\hat\{p\}\_\{\\text\{op\}\}\>\\thetarather than on the raw firing\. This layer is deliberately unverified and outside the trust base: it changes which firings a runtime*acts*on, not which the verified detector*reports*\. It converts the role\-dependence from an objection into a runtime knob, and is the cleanest near\-term answer to the syntactic comparator critique short of the mechanized probabilistic predicate of[Section˜4\.13](https://arxiv.org/html/2606.17182#S4.SS13)\.

### 5\.10Within\-model wall\-clock cost study

The token\-instrumented cost study of[Section˜5\.6](https://arxiv.org/html/2606.17182#S5.SS6)reports relative overhead within each provider but invites the objection that any comparison*across*providers conflates runtime overhead with tokenization, pricing, and serialization differences, and that billed tokens are not the same quantity as wasted wall\-clock computation\. This subsection reports the instrument we identified as the clean resolution: a study that holds the model fixed \(gpt\-4o\), varies only the runtime strategy, and measures*wall\-clock seconds*rather than billed tokens\.

Method\.We ran 270 sessions \(python/wallclock\_cost\_study\.py\): three workloads \(edit\-review, plan\-execute, triage\)×\\timesthree strategies \(vanilla, pessimistic with a 20% per\-write abort probability, SSI with a 5% commit\-time abort probability\)×\\times30 sessions, on gpt\-4o \(and replicated on open\-weights Llama\-3\.2 below\) with deterministic mock tools so that the measured wall\-clock isolates LLM\-inference and abort\-and\-retry time from external\-service latency\. We report bootstrap 95% confidence intervals \(5000 resamples\) on the per\-cell mean wall\-clock\.

Finding\.The wall\-clock overhead of both guarded strategies over vanilla is small and*not statistically separable from API\-latency noise atn=30n=30in any of the three workloads*: every pessimistic and SSI confidence interval overlaps the corresponding vanilla interval\. The directional pattern is nonetheless the expected one\. Pessimistic overhead is largest exactly where realized contention is highest—triage, the multi\-stage workload, in which pessimistic incurred 14 aborts across 30 sessions versus 6–7 in the lower\-contention workloads—and the triage pessimistic interval \(\[4\.82, 5\.49\]\) sits almost entirely above the vanilla interval \(\[4\.57, 5\.07\]\), a\+6\.6%\+6\.6\\%point estimate\. The within\-cell token overhead mirrors this:\+5\.6%\+5\.6\\%on triage,\+3\.4%\+3\.4\\%on edit\-review,−1\.4%\-1\.4\\%on plan\-execute\. The two apparent inversions where a guarded strategy is point\-faster than vanilla \(edit\-review pessimistic and SSI\) are artifacts of API\-latency variance: vanilla drew several high\-latency outliers, and the strategies do near\-identical compute when realized abort counts are low\.

What this resolves\.Two things\. First, the within\-model design eliminates the cross\-provider confound*by construction*: every comparison in Table S5 \(online appendix\) holds gpt\-4o fixed and varies only the strategy, so tokenization, pricing, and serialization differences cannot enter\. Second, measuring wall\-clock rather than billed tokens addresses the “tokens are not wasted computation” objection directly: the real\-time overhead of the guarded strategies is single\-digit\-percent and at the noise floor, materially smaller than the billed\-token overhead figures of the cross\-model study \(up to62%62\\%on gpt\-4o triage in[Section˜5\.6](https://arxiv.org/html/2606.17182#S5.SS6)\)\. The gap between the two measurements is itself the point: billed\-token overhead overstates real\-time cost, because the additional tokens from a retried operation are largely cached\-prompt context that is processed quickly relative to a fresh generation\. The accurate conclusion is conservative and favorable to the guarded strategies: at the realized abort rates of these workloads \(≤0\.47\\leq 0\.47aborts per session even in triage\), runtime strategy choice imposes no wall\-clock cost separable from noise, and the worst\-case directional overhead is concentrated in the contention workload as expected\.

Open\-weights replication \(Llama\-3\.2\)\.We re\-ran the identical 270\-session protocol on open\-weights Llama\-3\.2 served locally through Ollama, where dollar cost is negligible and wall\-clock is the only meaningful unit \(Table S6 \(online appendix\)\)\. Per\-session wall\-clock is an order of magnitude larger than on gpt\-4o \(local inference at2424–3838s versus55–99s\), and against that latency the abort\-and\-retry overhead of both guarded strategies is invisible: every pessimistic and SSI interval overlaps the corresponding vanilla interval, with point ratios in\[0\.82,1\.04\]×\[0\.82,1\.04\]\\timesand SSI point\-faster than vanilla in all three workloads—an artifact of the large per\-call latency variance, not a real speed\-up\. The open\-weights result extends the noise\-floor conclusion to a third model family: at the realized abort rates of these workloads, serializable snapshot isolation imposes no wall\-clock cost separable from inference noise, and pessimistic locking’s overhead, bounded and provider\-dependent on the API models, is likewise at the noise floor on local Llama\-3\.2\.

Scope\.This is a two\-model\-family \(n=30n=30per cell each\) study with mock deterministic tools; it isolates runtime overhead but does not capture real\-tool latency, and atn=30n=30the API\-noise floor exceeds the overhead signal, so the study bounds the overhead from above rather than resolving its exact magnitude\. A higher\-contention workload, largernn, or real external tools would be required to separate the triage signal from noise\. Thewallclock\_results\.jsonoutput is in the artifact bundle\.

Paired re\-analysis and power\.Because the same per\-session seed was used under each strategy, a paired analysis \(matched on⟨\\langleworkload, seed⟩\\rangle\) removes between\-session variance and yields an explicit MDE atα=0\.05\\alpha\{=\}0\.05, power0\.800\.80\. On*token*cost it separates two small overheads from zero—SSI on plan\-execute \(\+91\+91tokens,∼8%\{\\sim\}8\\%,\[\+32,\+150\]\[\+32,\+150\],p=0\.007p\{=\}0\.007\) and pessimistic on triage \(\+41\+41tokens,∼5\.5%\{\\sim\}5\.5\\%,p=0\.04p\{=\}0\.04, nominal only: not surviving Holm correction across the six tests\)—while pessimistic\-on\-plan\-execute and SSI\-on\-triage are genuine nulls \(MDE≈54\\approx 54–8787tokens, below the10%10\\%effect of interest\), established as*equivalent*by a two\-one\-sided test against the pre\-declared±10%\\pm 10\\%margin\. SSI\-on\-plan\-execute is*not*equivalent \(90%90\\%CI\[\+40,\+140\]\[\+40,\+140\]vs\. a111111\-token margin: up to∼12\.6%\{\\sim\}12\.6\\%\), and edit\-review is underpowered \(MDE≈290\\approx 290tokens\)\. On*wall\-clock*time the paired MDEs \(0\.70\.7–3\.43\.4s\) exceed the candidate effects and the lone nominal cell is in the cost\-*reducing*direction \(residual latency confounding\), so we claim neither overhead nor equivalence there\. The paired analysis thus quantifies the earlier statement: token overhead is bounded and≤8%\\leq 8\\%where separable, wall\-clock overhead below the resolution of ann=30n\{=\}30design\.

### 5\.11ExecutableL2L\_\{2\}runtime: mechanized refinement and measuredA3A\_\{3\}prevention

The concern thatL2L\_\{2\}throughL4L\_\{4\}are verified models rather than runnable code is the sharpest limitation of the verification chain\. We address it forL2L\_\{2\}directly: we implement the verifiedL2L\_\{2\}causal\-tracking discipline of[Section˜4\.11](https://arxiv.org/html/2606.17182#S4.SS11)as an executable Rust runtime \(mac\-consistency\-runtime/src/l2\_causal\.rs, dependency\-free andcargo test\-runnable\)\. The runtime is a faithful realization of the Verus model—a transaction carries the same predecessor causal closure,commit\_validenforces reads\-fresh \(¬A1\\neg A\_\{1\}\) and predecessors\-clean \(¬A3\\neg A\_\{3\}\), and an abort cascades to every transaction retaining the aborted one in its closure, the disciplinelemma\_l2\_reachable\_no\_a3proves sound at the model level\.

Method\.We generate causal\-cascade workloads: a root commits a write, a chain of dependents each reads the previous cell \(acquiring the chain as predecessors\) and commits, and the root is then aborted \(saga compensation\)\. The emitted provenance trace is scored by the preciseA3A\_\{3\}predicate of[Definition˜3](https://arxiv.org/html/2606.17182#Thmanomaly3)\(a surviving, non\-aborted operation retaining an aborted predecessor\)\. We run1,0001\{,\}000independent scenarios at each chain depth∈\{2,3,5\}\\in\\\{2,3,5\\\}under theL2L\_\{2\}cascade discipline and under an unguarded baseline that aborts the root without cascading—the behavior of anL1L\_\{1\}\-class runtime\.

Result\.The unguarded baseline exhibitsA3A\_\{3\}in1000/10001000/1000scenarios at every depth; theL2L\_\{2\}runtime exhibits it in0/10000/1000, including the transitive cascades at depth 5 \(the discipline cascade\-aborted10001000,20002000, and40004000dependents at depths22,33, and55respectively\)\.L2L\_\{2\}thus moves from a verified model to a verified model*plus*an executable runtime whoseA3A\_\{3\}prevention is measured: the proof gives universality \(no reachable state admitsA3A\_\{3\}\), and the runtime gives a deployed, executed, and measured realization of that guarantee\.

Live evaluation under real LLM agents\.The twin above is dependency\-free; to exercise the same runtime in a live agent loop we replace the seeded chain with a plan–execute–revise workload driven by real models\. A planner agent commits a one\-line action plan for an ambiguous triage ticket; an executor reads the plan \(thereby acquiring the planner as a causal predecessor\) and commits a result; a supervisor agent then decides, from the ticket and plan, whether to retract the plan—the live analogue of saga compensation\. We run200200sessions on each of three model families \(gpt\-4o\-mini,claude\-haiku\-4\-5,Llama\-3\.2\), replaying each session’s content under both theL2L\_\{2\}cascade discipline and the unguarded baseline, and score every emitted provenance trace with the same preciseA3A\_\{3\}predicate of[Definition˜3](https://arxiv.org/html/2606.17182#Thmanomaly3)\. Supervisor\-driven retraction—the trigger forA3A\_\{3\}—varied sharply across models, from0%0\\%\(claude\-haiku\-4\-5, a permissive supervisor that vetoed no plan\) through15\.5%15\.5\\%\(gpt\-4o\-mini\) to44\.5%44\.5\\%\(Llama\-3\.2\), confirming that such cascades are rare\-to\-moderate under benign live operation and consistent with the scarcity the catalog finds empirically \(0/6000/600in the MAST corpus\)\. In every one of the120120retracted sessions pooled across the three models the unguarded baseline left anA3A\_\{3\}witness; theL2L\_\{2\}runtime prevented all120120\(0/1200/120, rule\-of\-three95%95\\%CI\[0,2\.5%\]\[0,2\.5\\%\]\), at a pooled executor\-liveness of80%80\\%\(84\.5%84\.5\\%,100%100\\%, and55\.5%55\.5\\%per model—the cost of cascading aborts scales with the retraction rate\)\. The prevention is structural: the executor depends on the planner because it read the plan cell, and the verified cascade removes the dependent whenever its predecessor is retracted, so the0%0\\%rate holds by construction of the verified discipline regardless of model\. The live runs therefore do not claim the model’s reasoning*causes*A3A\_\{3\}; they demonstrate the verified discipline operating under real, divergent agent behavior \(a0–44\.5%44\.5\\%spread of supervisor judgment\) at real cost and liveness, closing the gap between the model\-level proof and a running multi\-agent loop\.

Mechanized refinement of the executable runtime\.We go one step beyond execution\-and\-measurement\. A second executable artifact,lib\_l2\_exec\.rs, implements the sameL2L\_\{2\}protocol in Verus*exec mode*and is mechanically verified to refine the abstract model\. Each of the six state transitions \(new,begin,read,write,commit,abort\) carries the postconditionself\.view\(\)=stepX\(old\.view\(\),…\)\\texttt\{self\.view\(\)\}=\\texttt\{step\}\_\{X\}\(\\texttt\{old\.view\(\)\},\\dots\), whereview\(\)maps the executable,u64\-keyed runtime state into the model’sRuntimeState, and each preserves the well\-formedness predicatewf≡inv\_l2∧fresh\_ok∧keys\_contiguous\\textit\{wf\}\\equiv\\textit\{inv\\\_l2\}\\wedge\\textit\{fresh\\\_ok\}\\wedge\\textit\{keys\\\_contiguous\}\. The model’sA3A\_\{3\}\-freedom theorem \(lemma\_l2\_reachable\_no\_a3\) therefore transfers to the executable runtime as a capstone \(a3\_free\): every well\-formed runtime state isA3A\_\{3\}\-free, by construction of the transitions that produce it\. Crucially, the interning of heap identifiers asu64keys is*natively injective*—theu64→int\\texttt\{u64\}\\to\\texttt\{int\}embedding viaas intrequires no axiom—so the exec\-to\-model refinement adds*zero author\-introduced*entries to the trust base\. \(The exec state isHashMap\-backed and so relies on vstd’s standardgroup\_hash\_axioms, trusted with the Verus standard library; we add no axiom of our own\.\) This is a stronger footing than the four spec–runtime refinements of[Section˜4\.7](https://arxiv.org/html/2606.17182#S4.SS7), which share two structural axioms\. The file re\-includes theL2L\_\{2\}model verbatim and verifies at 49 obligations end to end \(0 errors, 0assume, 0admit, 0 added axioms\)\.

Scope\.This is a bounded result in three respects\. First, theA3A\_\{3\}\-prevention*guarantee*is a theorem, not a measurement: the exec\-modea3\_freecapstone proves every well\-formedL2RuntimestateA3A\_\{3\}\-free with well\-formedness preserved, so the verified runtime cannot emit anA3A\_\{3\}witness\. The measuredl2\_causal\.rstwin and the verifiedlib\_l2\_exec\.rsare two artifacts of one protocol; the twin exhibits the unguarded baseline and corroborates the proof, and the guarantee does not depend on it\. Second, both are sequential, under the same assumed\-mutex model as the rest of the development \(the threeRUSTBELT\_OBLIGATIONstubs of[Section˜4\.14](https://arxiv.org/html/2606.17182#S4.SS14)remain the residual concurrency trust base\)\. Third, the workload is a synthetic cascade against our own unguarded baseline, so the measurement shows the discipline*executes and preventsA3A\_\{3\}as proved*, not that uncompensated cascades occur at this rate in the wild \(the prevalence question of[Section˜5\.8](https://arxiv.org/html/2606.17182#S5.SS8)is separate\)\. The same two\-artifact closure extends toL3L\_\{3\}andL4L\_\{4\}:lib\_l3\_exec\.rs\(7\) verifies an executable sequencer whose invariant pins emitted order to issuance order \(A6A\_\{6\}\-freedom by invariant\), andlib\_l4\_exec\.rs\(9\) verifies a registry\-snapshot operation whose dispatched signature provably equals its pinned one \(A2A\_\{2\}\-freedom by construction\); both are self\-contained \(zero axioms\) and carry dependency\-free twins measured under adversarial schedules at0/10000/1000vs\.1000/10001000/1000\(widths22–1616\), with completeness guards confirming prevention is not by dropping effects\. The grade distinction is explicit:L2L\_\{2\}/L3L\_\{3\}/L4L\_\{4\}are exec\-verified and twin\-measured under synthetic schedules, not run under live agents; the pilot deploysL0L\_\{0\}/L1L\_\{1\}live \([Section˜5\.5](https://arxiv.org/html/2606.17182#S5.SS5)\)\. ForA6A\_\{6\},[Section˜5\.12](https://arxiv.org/html/2606.17182#S5.SS12)adds an intermediate grade of evidence—the anomaly and itsL3L\_\{3\}prevention demonstrated against a released framework on real model\-emitted call orders, with tool effects modeled\.

### 5\.12An in\-the\-wildA6A\_\{6\}: tool\-effect reordering in LangGraph’sToolNode

TheL2L\_\{2\}result above measures prevention against a synthetic adversary\. ForA6A\_\{6\}we add a different and, in one respect, stronger kind of evidence: the anomaly arises in a released, widely used agent framework, driven by real model\-emitted call orders, and theL3L\_\{3\}sequencing discipline of[Section˜4\.15](https://arxiv.org/html/2606.17182#S4.SS15)removes it\. We separate what is real from what is modeled throughout, because that line is exactly what bounds the claim\.

The defect\.LangGraph’sToolNode—the standard primitive that executes the tool calls an assistant turn emits—dispatches a turn’s calls concurrently viaasyncio\.gather\.555The parallel dispatch is documented upstream \(langchain\-ai/langgraphissue \#6624, as of June 2026\) and confirmed here by reproduction\.gathercollects*results*in input order but does not serialize the calls’*side effects*: each coroutine externalizes its effect when it completes\. So when an operation issues\|𝑖𝑜\|≥2\|\\mathit\{io\}\|\\geq 2order\-dependent calls, the committed effect order𝑐𝑜\\mathit\{co\}is the completion order, and𝑐𝑜≠𝑖𝑜\\mathit\{co\}\\neq\\mathit\{io\}whenever completion does not match issuance—an instance of[Definition˜4](https://arxiv.org/html/2606.17182#Thmanomaly4)in deployed code\. The framework exposes no option to preserve issuance order; its own issue tracker records a user requiring exactly this for stateful tools and unable to obtain it\.666langchain\-ai/langgraphjsissue \#861 \(“ToolNode should support executing tools sequentially rather than in parallel”\), as of June 2026\.This is precisely the case[Section˜3\.5](https://arxiv.org/html/2606.17182#S3.SS5)singles out: the effects are external and irreversible, so an out\-of\-order commit is not undoable by compensation\.

The intended order is genuine, and the reorder is latency\-driven\.We reproduced the defect on the releasedToolNode\(langgraph1\.2\.5\) with an order\-dependent pipeline of five tools \(create\_database→\\torun\_migrations→\\toseed\_reference\_data→\\tobuild\_search\_index→\\toenable\_live\_traffic\) under a natural prompt that states the task and the tools with no instruction to parallelize\. The model emits the calls in canonical order within a single turn; that emitted sequence is the issuance order𝑖𝑜\\mathit\{io\}of[Definition˜4](https://arxiv.org/html/2606.17182#Thmanomaly4), not an order we imposed\. Two controls isolate the mechanism\. \(i\) Under*equal*per\-tool latency the committed order equals issuance in every run—a uniform\-latency control yielding0reorderings—so the reorder is a function of latency*differences*, which real tools always exhibit \(the stateful UI actions of issue \#861; a cache read beside an external write\)\. \(ii\) In every trace the committed order is the latency argsort,𝑐𝑜=argsort⁡\(latency\)\\mathit\{co\}=\\operatorname\{argsort\}\(\\text\{latency\}\), checkable per run\. With the fastest call committing first, the harness routed live traffic to a database it had not yet created\.

Whether a model triggers it varies sharply—which is the argument forL3L\_\{3\}, not against it\.The trigger is whether a backbone batches order\-dependent calls into one turn, and that behavior is model\- and version\-dependent, not something a deployment controls\. Under the natural prompt,gpt\-4o\-minibatched in30/3030/30turns and reordered in2222\(including all1818turns in which it batched the full five\-step pipeline\);llama\-3\.2batched in3/303/30\(reordering22\); andclaude\-haiku\-4\-5batched in0/300/30—across6060turns under two prompt conditions it issued one call per turn and never exhibitedA6A\_\{6\}\. A backbone that batches walks into the defect; one that sequences sidesteps it\. The point is not that one model is wrong, but that effect\-order correctness would otherwise rest on a property no specification pins down: the same task yieldsA6A\_\{6\}on one backbone and not another\. The reordering rate scales with batch width \(gpt\-4o\-mini:4/124/12at width two,18/1818/18at width five\), confirming that𝑐𝑜\\mathit\{co\}tracks completion rather than issuance\. We report these per model and deliberately do not pool them into a single headline rate; the rate is a property of the workload and the backbone, not of the framework alone\.

The anomaly is invisible in the transcript\.In all batched turns the returnedToolMessagelist preserved issuance order \(33/3333/33\), becausegatherorders*results*by input position while the*effects*have already committed out of order\. The conversation history a developer would inspect shows the correct order; the external state does not\.A6A\_\{6\}is thus undetectable by reading the artifact most operators check, which is why a runtime\-level discipline—rather than a prompt instruction or a transcript audit—is the appropriate remedy\.

TheL3L\_\{3\}discipline removes it\.Replacing the defaultToolNodewith the commit\-order sequencer of[Section˜4\.15](https://arxiv.org/html/2606.17182#S4.SS15)\(TheoremL3L\_\{3\}g\)—issue the emitted calls strictly in𝑖𝑜\\mathit\{io\}order, awaiting each before the next—yields𝑐𝑜=𝑖𝑜\\mathit\{co\}=\\mathit\{io\}on every model\-emitted batch:0/330/33reorderings across the multi\-call turns that produced2424reorderings under the default executor \(rule\-of\-three95%95\\%CI\[0,9\.1%\]\[0,9\.1\\%\]\)\. The sequencedToolNodeis a drop\-in realization of theL3L\_\{3\}saga discipline against a deployed framework, and it makes effect\-order correctness independent of which backbone is in use—the very property the cross\-model spread shows cannot otherwise be assumed\.

Scope\.This is one framework and an existence\-and\-mechanism result, not a prevalence estimate, and three boundaries are explicit\. First, the tool*effects and latencies are modeled*\(asyncio\.sleep\); the executor and the call orders are real, but we do not measure any one deployment’s latency distribution, so the robust finding is the structural defect and its transcript\-invisibility, not the73%73\\%figure, which is conditioned ongpt\-4o\-minibatching the full pipeline\. Second, the measurement is*single\-turn*; the cross\-turn picture is consistent, since the backbones that avoid the defect do so by sequencing within their first turn\. Third,*generality is untested*: theasyncio\.gatherpattern is common and issue \#861 reports the same class in a second LangGraph implementation, but we claim only what is measured here\. What is robust is that a released tool executor admits[Definition˜4](https://arxiv.org/html/2606.17182#Thmanomaly4)on unmodified model output, that the violation is invisible in the transcript, and that theL3L\_\{3\}sequencer eliminates it\.

### 5\.13Threats to validity

The real\-LLM pilot of[Section˜5\.3](https://arxiv.org/html/2606.17182#S5.SS3)reports workload\-engineered rates and should not be read as a general prevalence claim: the 100% rate in edit\-review is structurally guaranteed by the workload design, the 1% rate in plan\-execute reflects deliberate suppression via sequential dependency, and the 35% rate in triage is one observation point in a 3\-agent pipeline regime\. Sample size isN=100N=100per workload, no confidence intervals are reported, and a single model \(gpt\-4o\) is used for the stale\-read prevalence measurements; the cost analysis of[Section˜5\.6](https://arxiv.org/html/2606.17182#S5.SS6)is replicated across gpt\-4o and Claude Sonnet 4\.5 but the stale\-read rates themselves are not\. The cookbook\-derived study of[Section˜5\.7](https://arxiv.org/html/2606.17182#S5.SS7)addresses the most acute version of this threat: it reportsA1A\_\{1\}rates under unmodifiedRoundRobinGroupChaton six scenarios drawn from public LangGraph and AutoGen example templates, with the framework code untouched and only the chat\-completion client wrapped to capture trace events\. The 0/500 result for stateless scenarios and the 90/100 \(any\-agent; 89/100 cross\-agent\) result for the typed\-shared\-state scenario are*not*workload\-engineered, and the cross\-agent witnesses are inspectable in the released trace bundle\. However, even the cookbook study cannot establish prevalence in deployed production multi\-agent systems: it samples six scenarios on one provider with one scheduler, and the 90% rate is specific to the planner\-executor\-monitor protocol used inshared\_workspace\. Further replication with longer\-running tasks, larger agent counts, alternative frameworks \(CrewAI, LangGraph at runtime\), additional models \(open\-weights inference such as Llama or Mistral\), and instrumentation of real deployments is identified as principal empirical follow\-up\.

Cost\-metric limitations and silent failures\.The cost analysis of[Section˜5\.6](https://arxiv.org/html/2606.17182#S5.SS6)reports per\-session token counts, the metric the LLM providers expose through usage callbacks\. This is an incomplete proxy for the cost\-asymmetry argument in two specific ways\. First, an aborted commit after a 30\-second generation is*wall\-clock waste*that is not fully captured by the token count, since the inference phase has already consumed compute regardless of whether the commit succeeds\. A within\-model wall\-clock measurement \(same model, different runtime strategies\) isolates the runtime\-overhead question; we report one in[Section˜5\.10](https://arxiv.org/html/2606.17182#S5.SS10), and atn=30n=30it bounds the overhead from above \(single\-digit\-percent, at the API\-noise floor\) rather than resolving its exact magnitude\. The token\-cost finding should accordingly be read as “no order\-of\-magnitude token\-cost asymmetry” rather than “no runtime\-cost asymmetry\.” Second, snapshot isolation’s reported residual 3%A1A\_\{1\}rate in the triage workload \([Table˜III](https://arxiv.org/html/2606.17182#S5.T3)\) reflects read\-only operations that bypass SI validation; these are*silent*—they succeed from the runtime’s perspective and consume tokens normally—but produce stale outputs\. The between\-session cost\-parity finding therefore cannot be read as “SI is operationally equivalent to vanilla”; it is “SI’s per\-session token cost overlaps with vanilla’s, modulo a silent residual anomaly rate that does not appear in the cost metric\.”

Cross\-model cost comparison is confounded\.Pricing, tokenization density, tool\-call serialization, and retry budgets differ between gpt\-4o and Claude, so the magnitude difference in pessimistic overhead \(1\.6×1\.6\\timesvs\.2\.3×2\.3\\times\) cannot be attributed to runtime strategy alone; we make only the within\-model claims of[Section˜5\.6](https://arxiv.org/html/2606.17182#S5.SS6)and the within\-model wall\-clock design of[Section˜5\.10](https://arxiv.org/html/2606.17182#S5.SS10)isolates the runtime effect \(bounded from above atn=30n=30\)\.

The operational model itself — single shared key\-value store, discrete monotonic clock, in\-process agents — abstracts away network partitioning, replica divergence, and asynchronous tool execution, whose consequences are discussed in[Section˜6\.2](https://arxiv.org/html/2606.17182#S6.SS2)\.

Model\-faithfulness: refined for three strategies, under disclosed assumptions\.The detector pipeline and runtime safety theorems are stated against abstract models whose types differ from the executable Rust runtime \(Set\(𝐶𝑒𝑙𝑙𝐼𝑑\)\\mathrm\{Set\}\(\\mathit\{CellId\}\)/𝑀𝑎𝑝\\mathit\{Map\}versusVec<String\>/BTreeMap\); for all three strategies the correspondence is established by the mechanically\-verified refinements of[Section˜4\.8](https://arxiv.org/html/2606.17182#S4.SS8)\(84 obligations, four files, two foundational axioms\), with SSI refined against both the version\-chain and projection representations\. Two caveats apply: the two foundational axioms \(string\-identifier injectivity, null\-sentinel mapping\), and sequential\-semantics scope—the refinements describe one logical operation at a time, lifted to multi\-threaded execution by mutex correctness, with a linearizability/separation\-logic treatment left open\.

Concurrency\-model granularity\.The verified refinement and safety theorems hold for sequential interleavings of atomic begin/commit \(and abort\) transitions, lifted to the multi\-threaded runtime by mutex correctness as a trust\-base assumption; finer\-grained internal concurrency is not captured by design\.[Section˜4\.10](https://arxiv.org/html/2606.17182#S4.SS10)provides a bounded RC11 check of the lock protocol \(GenMC, small agent counts, with a non\-vacuity control\), and unbounded thread\-level treatment—linearizability against a relaxed memory model or a relational separation logic—remains follow\-up work \([Section˜6\.3](https://arxiv.org/html/2606.17182#S6.SS3)\)\.

## 6Discussion

### 6\.1Implications for system designers

The lattice provides a vocabulary in which existing multi\-agent runtimes may be located\.[Section˜6\.4](https://arxiv.org/html/2606.17182#S6.SS4)provides substantive informal placements of three contemporary runtimes; formal mappings, requiring runtime\-specific operational formalizations and refinement proofs against the predicates of[Section˜3](https://arxiv.org/html/2606.17182#S3), are future work\.

What we offer in addition to the deployed\-system placements is the following observation: the placements are predominantly at the lower points of the lattice, withL3L\_\{3\}\-class guarantees achieved only under restricted workload assumptions\. The lattice thus serves both as a vocabulary for naming current capabilities and as a map of the design space that is yet to be filled\. Future runtimes targetingL3L\_\{3\}orL4L\_\{4\}for unrestricted workloads constitute open engineering territory\.

### 6\.2Limitations

*Operational model is a proposal, not a derivation\.*The model of[Section˜2\.1](https://arxiv.org/html/2606.17182#S2.SS1)is our proposal for studying long\-running concurrent operations under inference\-bounded latency\. It does not derive from the operational behavior of any specific deployed runtime; deployed systems use heterogeneous abstractions \(message\-passing, state graphs, delegation\)\. Validation of the model against a broader range of runtimes is future work\.

*Deterministic LLM assumption\.*We treat each LLM invocation as deterministic in its inputs;[Section˜2\.2](https://arxiv.org/html/2606.17182#S2.SS2)discusses why this assumption is load\-bearing and how the empirical pilot mediates between the deterministic specification and stochastic deployment\.

*Atomic\-commit\-at\-record\-level assumption\.*Per\-𝑂𝑝𝑅𝑒𝑐𝑜𝑟𝑑\\mathit\{OpRecord\}commit is atomic; within\-record tool\-effect ordering is observable and is exactly whatA6A\_\{6\}formalizes \([Section˜2\.2](https://arxiv.org/html/2606.17182#S2.SS2)\)\. A fully streaming model in which individual writes externalize before record commit is future work\.

*No replication\.*The shared memory is a single logical store\. Replicated stores admitA4A\_\{4\}\(split\-view\), which is identified but not formalized here\.

*No Byzantine or adversarial agents\.*We assume agents are non\-adversarial and that a logged write reflects a genuine commit\. A hallucinated or malicious write is, at the trace level, indistinguishable from a stale read by another agent, but defending against it is a fault\-tolerance and agent\-correctness concern\[[82](https://arxiv.org/html/2606.17182#bib.bib82)\]orthogonal to the concurrency anomalies formalized here; we do not address it\.

*Chosen\-chain ordering by construction, not by empirical validation\.*The placement of each anomaly at its level is defended structurally in[Section˜4\.2](https://arxiv.org/html/2606.17182#S4.SS2)but is not validated against empirically observed runtime behavior\. Doing so would require implementing reference runtimes at each level and measuring the failure modes they exhibit under contention\.

*TLC verification at small bounds\.*All TLC runs in[Section˜3](https://arxiv.org/html/2606.17182#S3)use\|A\|=2\|A\|=2,\|𝐶𝑒𝑙𝑙𝑠\|≤2\|\\mathit\{Cells\}\|\\leq 2,𝑀𝑎𝑥𝑂𝑝𝑠≤4\\mathit\{MaxOps\}\\leq 4\. The largest exhaustive run isMC\_CodeCRDT\_RYWat 9\.3 million distinct states; expanding bounds to\|A\|=3\|A\|=3,𝑀𝑎𝑥𝑂𝑝𝑠=6\\mathit\{MaxOps\}=6pushes the state space beyond 150 million distinct states without exhausting in a 10\-minute budget\. Generalization to larger configurations would require either substantially more compute, symmetry reductions on the agent relation, or symbolic verification techniques beyond TLC’s explicit\-state model checking\.

*Two\-provider evaluation; broader model\-family replication deferred\.*The synthetic baseline \([Section˜5\.4](https://arxiv.org/html/2606.17182#S5.SS4)\) covers 700 traces across three runtimes; the real\-LLM baseline \([Section˜5\.5](https://arxiv.org/html/2606.17182#S5.SS5)\) covers 1,800 sessions across three workloads and three runtimes on two model families \(gpt\-4o, Claude Sonnet 4\.5\)\.[Section˜5\.6](https://arxiv.org/html/2606.17182#S5.SS6)reports measured per\-session cost with bootstrap CIs on both providers\. The cost baseline does not exercise open\-weights models; the dynamic prevalence discrimination \([Section˜5\.8](https://arxiv.org/html/2606.17182#S5.SS8)\) additionally includes the open\-weightsLlama\-3\.2, though for susceptibility only, not cost\. The pilot does not exercise replicated stores, network partitions, or the streaming\-write model that would surfaceA4A\_\{4\}andA6A\_\{6\}at prevalence\. A defensible empirical claim about cost\-of\-prevention across three\-or\-more model families \(extending to local open\-weights inference such as Llama or Mistral\) and deployment patterns is the principal empirical follow\-up\.

*High\-contention cost sweep is synthetic and single\-model\.*The contention sweep of Finding 5 \([Section˜5\.6](https://arxiv.org/html/2606.17182#S5.SS6)\) uses a parameterized microbenchmark—WWagents contending onCCcells—rather than a deployed workload, and runs on gpt\-4o\-mini only\. Its token costs are exact, but its wall\-clock figures are composed from measured per\-call latencies under each discipline’s concurrency structure, not measured under a concurrent runtime\. The envelope it characterizes \(cost linear in the abort rate, sub\-15%15\\%below the breakpoint\) is thus a property of the cost model under controlled contention; whether deployed workloads sit at the low\-abort end of that envelope is the prevalence question below, which this sweep does not settle\.

*Anomaly prevalence in production deployments is not established\.*Establishing that the formalized anomalies occur in production multi\-agent systems at any non\-trivial rate would itself require instrumentation of deployed runtimes, ethically\-cleared trace collection at scale, and analysis pipelines that distinguish the concurrency anomalies we formalize from the cognitive failures cataloged elsewhere\[[19](https://arxiv.org/html/2606.17182#bib.bib19)\]\. The 1%, 35%, 100% rates we report are workload\-engineered, not production\-observed\.

*Cost asymmetry not modeled\.*LLM inference cost asymmetry is acknowledged in[Section˜2\.2](https://arxiv.org/html/2606.17182#S2.SS2)but not modeled in the lattice\. Optimistic concurrency control with abort\-and\-retry, which Spanner\-class systems rely on forL1L\_\{1\}\-equivalent guarantees, becomes substantially more expensive in the agent setting\. The lattice treats consistency guarantees as predicates without considering cost\.

### 6\.3Future work

The most consequential extensions are: \(i\) extension to non\-sequential execution, requiring a linearizability or separation\-logic methodology to handle internal thread\-level concurrency that the current Mutex\-based trust\-base assumption discharges informally; \(ii\) formalization ofA4A\_\{4\}within an extended replication\-aware model; \(iii\) extension to stochastic LLM outputs; \(iv\) empirical validation of the chosen\-chain ordering against deployed runtimes; \(v\) cost\-aware refinement of the lattice recognizing that abort\-and\-retry mechanisms are expensive in the agent setting; and \(vi\) identifying or constructing a deployed framework that*pins*a read snapshot across the generation phase rather than refreshing it: the frameworks we examined \(LangGraph, Letta\) refresh state at generation and thereby close the cross\-agentA1A\_\{1\}window by construction, so observing a genuine cross\-agentA1A\_\{1\}in the wild requires a runtime that holds the snapshot across the inference phase—locating or building one is the missing ingredient for testing whether the window ever arises in practice rather than, as our evidence indicates, remaining architecturally confined\.

### 6\.4Illustrative discussion: locating contemporary runtimes on the chosen chain

As illustrative discussion \(not a formal contribution; placements are approximate, from published descriptions\), we locate three contemporary runtimes on the chain\. SagaLLM\[[1](https://arxiv.org/html/2606.17182#bib.bib1)\]achieves\{A3\}\\\{A\_\{3\}\\\}\(off\-chain\) on compensable workloads via saga compensation; Atomix\[[2](https://arxiv.org/html/2606.17182#bib.bib2)\]reachesL3L\_\{3\}\(\{A1,A3,A6\}\\\{A\_\{1\},A\_\{3\},A\_\{6\}\\\}\) in its strongest explicit\-dependency configuration and weaker off\-chain sets otherwise; and CodeCRDT\[[18](https://arxiv.org/html/2606.17182#bib.bib18)\]sits atL0L\_\{0\}under our model—*not*a quality ranking, since CRDT merge\-convergence is an axis orthogonal to the serializability lattice \([Section˜7\.8](https://arxiv.org/html/2606.17182#S7.SS8)\), so readingL0L\_\{0\}as “worst” is a category error\. Of the five distinct placements, two are on\-chain and three off\-chain—itself informative: the chain cannot name every guarantee real runtimes provide, which is why the lattice is primary and the chain one linearization\. None reachesL4L\_\{4\}\(registry\-stability is absent from all three\)\. The full per\-runtime justifications and the supporting TLC runs are in the online appendix \(§ D\)\.

### 6\.5Occurrence and susceptibility in deployed frameworks

The anomaly classes of[Section˜3](https://arxiv.org/html/2606.17182#S3)are not confined to our model: the lowest lattice pointL0L\_\{0\}—unsynchronized shared mutable state—is the*default*coordination mode of the most widely deployed multi\-agent framework, LangGraph, and its anomalies are observable, reproducible, and reported in the wild\. We probed this with three complementary, fully reproducible artifacts \(deterministic, with no LLM calls, runnable offline; identical results on LangGraphv1\.2\.0andv1\.2\.4\)\. We do not run our Verus\-verified detector pipeline against production deployments—that refinement remains the principal follow\-up \([Section˜2\.2](https://arxiv.org/html/2606.17182#S2.SS2)\)—so what follows is occurrence\-and\-susceptibility evidence for one framework, not a production\-trace prevalence measurement\.

#### Mechanism: the default channel admits theL0L\_\{0\}anomaly class\.

LangGraph executes parallel nodes in Pregel supersteps over shared state channels\. When a state key carries no reducer \(the default\), two behaviors arise\. \(i\) Two nodes writing the same key in one superstep fail\-stop withINVALID\_CONCURRENT\_GRAPH\_UPDATE\[[37](https://arxiv.org/html/2606.17182#bib.bib37)\]\(“Can receive only one value per step”\)\. \(ii\) A node that reads a key concurrently updated by another, or a downstream node whose partial update overwrites an accumulated value, silently commits a stale or clobbered result—no error is raised\. Our artifact reproduces both on an unmodified LangGraph runtime with deterministic nodes: a parallel edit/summarize graph commits a summary generated from a superseded snapshot \(a stale\-generation witness in theA1A\_\{1\}family, raising nothing\), and a two\-writer graph fail\-stops\. Declaring the key with a merge reducer \(e\.g\.Annotated\[list, op\]\) removes both: this is exactly theL0→L1L\_\{0\}\\to L\_\{1\}ascent the lattice formalizes—a different, stronger consistency contract, not a patch\. The structural fail\-stop check is moreover neither sound nor complete for ourA1A\_\{1\}predicate: LangGraph issue \#6446\[[38](https://arxiv.org/html/2606.17182#bib.bib38)\]reports it firing when a parallel sub\-graph updates a*different*key \(a false positive\), while the silent variant \(ii\) shows it is also incomplete\. A verified predicate\-level specification of the kind we provide \([Section˜4\.7](https://arxiv.org/html/2606.17182#S4.SS7)\) is a candidate oracle against which such framework checks could be validated\.

#### Occurrence: the detectable variant in real third\-party code\.

The fail\-stop variant raises a fixed error string, so its incidence is measurable\. A reproducible GitHub query over issues and pull requests matching the two exact strings LangGraph emits777INVALID\_CONCURRENT\_GRAPH\_UPDATE\(50 issues/PRs\) and “Can receive only one value per step” \(56\); the union is 44 repositories, of which 34 are third\-party \(non\-langchain\-ai\)\. Both totals fall below GitHub’s per\-query page limit, so the union is complete for these strings rather than a sample\. The query and captured results are included in the artifact\.returns34 distinct third\-party repositoriescontaining an issue or pull request that mentions one of the error strings\. They include widely\-starred projects \(bytedance/deer\-flow,FlowiseAI/Flowise,CopilotKit/CopilotKit\) alongside a long tail of smaller ones; we report the repository count as an*occurrence*signal only—a star count measures a project’s popularity, not how often the anomaly fires, and a repository mentioning the error string is evidence the fail\-stop was reached there, not a prevalence rate\. Because the silent variant emits nothing, it is absent from any such count, so34 is a lower bound on occurrenceand the undetectable variant is the more dangerous of the two\.

#### In the wild: the silent variant, reproduced from a live bug\.

The silent variant is not merely a possibility we constructed\. ByteDance’s deer\-flow carries open issue \#3123\[[42](https://arxiv.org/html/2606.17182#bib.bib42)\]\(filed May 2026, labeled*help wanted*\): a todo list is visible during streaming but*disappears*once the run completes\. The project’s own regression test attributes the cause to a downstream node’s partial state update withtodos=Noneoverwriting the accumulated value under the default last\-write\-wins channel—a silent lost\-update / stale\-overwrite in theL0L\_\{0\}class \(the surviving state is stale with respect to the accumulated value; we do not claim it instantiates the exactA1A\_\{1\}temporal predicate, only its staleness family\)\. Our artifact reproduces \#3123 on an unmodified runtime using deer\-flow’s*own*reducer: without it, two seeded todos are silently lost when a downstream node emitstodos=None; with it, both survive\. The fix history is itself instructive about why a consistency lattice is useful: the reducer patch \(\#3180\[[43](https://arxiv.org/html/2606.17182#bib.bib43)\]\) in turn collided with LangChain’sTodoListMiddleware, which declares the sametodoschannel with an incompatible type \(Channel ’todos’ already exists with a different type, open issue \#3199\[[44](https://arxiv.org/html/2606.17182#bib.bib44)\]\); reverting the reducer reintroduces \#3123\. The resolution is to enforce a single consistent channel contract\. The full in\-the\-wild arc—default⇒\\Rightarrowsilent loss⇒\\Rightarrowreducer ascent⇒\\Rightarrowchannel\-contract conflict⇒\\Rightarrowunify the contract—is a production instance of exactly the reasoning the lattice is built to support, including that agreeing the consistency discipline*across components*is the hard part\.

#### Formalized: the reducer fix as a verifiedL0→L1L\_\{0\}\\rightarrow L\_\{1\}refinement\.

The placements of[Section˜6\.4](https://arxiv.org/html/2606.17182#S6.SS4)are informal, but theL0→L1L\_\{0\}\\rightarrow L\_\{1\}edge—the one deer\-flow \#3123 exercises—we discharge formally\. A core Verus model of LangGraph’s per\-superstep channel update, parameterized by the reducer, proves \(machine\-checked\) that the additive channel’s value equals the initial value followed by the concatenation of every payload, so no committed contribution is ever lost—theL1L\_\{1\}guarantee—while a concrete two\-superstep execution exhibits the default last\-write\-wins channel silently dropping a contribution \(L0L\_\{0\}; deer\-flow \#3123 in miniature\) \(lib\_langgraph\_refinement\.rs; 7 verified, 0 axioms\)\. This turns the reducer fix from an anecdote into a refinement theorem against the most\-used framework’s own channel semantics\. The construction climbs one edge further: a versioned\-channel model under optimistic concurrency \(Orleans\-style ETag, as AutoGen’s persistence layer adopts\) refinesL1→L2L\_\{1\}\\rightarrow L\_\{2\}, proving version monotonicity and that any channel\-changing write read the current head \(no stale\-grounded write\), with a witness the version\-unaware channel accepts but the OCC channel rejects \(lib\_occ\_l2\_refinement\.rs; 8 verified, 0 axioms\)\. Two runtime edges are thus verified refinements against real framework mechanisms; the higher edges remain future work\.

#### A second system: Letta shared memory\.

TheL0L\_\{0\}class is not specific to LangGraph’s channel reducers\. Letta, the production successor to MemGPT\[[76](https://arxiv.org/html/2606.17182#bib.bib76)\], lets several agents share a single memory block, and its own documentation warns that concurrent edits lose updates, advising each agent write a separate section—theL0→L1L\_\{0\}\\rightarrow L\_\{1\}discipline by another name\. We ran the unguarded default: on a self\-hosted Letta server, four agents attached to one shared block were asked concurrently to read the shared plan and append their next step \(gpt\-4obackbone, ordinary collaborative\-editing prompt; harnesspython/letta\_a1\_probe\.py\)\. Across six rounds of four writers, 18 of 24 appends were silently lost to last\-writer\-wins\. This is the same silentL0L\_\{0\}lost\-update as deer\-flow \#3123, reproduced in an independent deployed system of a different architectural family\. We classify it asL0L\_\{0\},*not*a cross\-agentA1A\_\{1\}witness, and the reason is architectural rather than observational: Letta re\-renders the*current*block into each agent’s context at every generation step, so no commit is grounded in a superseded read and[Definition˜1](https://arxiv.org/html/2606.17182#Thmanomaly1)’s precondition cannot persist; the only anomaly concurrent load produces is the write\-side clobber we measure\. We attempted to certifyA1A\_\{1\}by intercepting the model call and report the negative outcome \(the per\-step refresh leaves no persistent stale read, and modern backbones bypass a drop\-in proxy\)\. We read this as a second, architecture\-level corroboration of confinement\. Whether*any*deployed framework leaves the window open—by*pinning*a read snapshot across a long generation rather than refreshing it—is the sharp form of the open empirical question; we have not found one that does\.

#### Architectural contrast: AutoGen\.

The susceptibility tracks the coordination architecture, not the workload\. AutoGen uses an actor model: each agent owns its state and agents communicate by asynchronous message passing, so the shared\-mutable\-state lost\-update class does not arise from a built\-in channel by construction\. The corresponding AutoGen pain is fragmented conversational state—a community discussion\[[39](https://arxiv.org/html/2606.17182#bib.bib39)\]describes users building propose/validate/commit layers over the message log \(an application\-level OCC\), and AutoGen’s own state\-persistence work\[[40](https://arxiv.org/html/2606.17182#bib.bib40)\]adopts Orleans\-style ETag\-based optimistic concurrency\. Shared\-mutable\-state \(LangGraph\) versus message\-passing actors \(AutoGen\) is precisely theL0L\_\{0\}\-versus\-higher axis the lattice formalizes: the framework that defaults to unsynchronized shared state is the one whose users file the conflicts\. We also record one adjacent atomicity gap our model does*not*capture: AutoGen issue \#7043\[[41](https://arxiv.org/html/2606.17182#bib.bib41)\]describes a cross\-record persistence failure \(a completed agent’s output is logged but the next\-agent queue is not enqueued atomically\), related to but outsideA6A\_\{6\}’s within\-record reordering; our per\-record\-atomicity assumption \([Section˜2\.2](https://arxiv.org/html/2606.17182#S2.SS2)\) would need relaxing to formalize it\.

#### What this establishes, and what it does not\.

The reproductions show that LangGraph’s default semantics admit theL0L\_\{0\}class \(fail\-stop and silent\) on an unmodified runtime, that the reducer discipline is theL0→L1L\_\{0\}\\to L\_\{1\}ascent, that≥34\\geq 34third\-party projects have hit the detectable variant, that deer\-flow \#3123 is a live silent instance reproduced from the project’s own code, and that Letta exhibits the same loss in a different architectural family\. It does*not*establish a production anomaly*rate*\(we measure occurrence and susceptibility, not the fraction of runs affected\), nor that our model refines either framework in full—only theL0→L1L\_\{0\}\\to L\_\{1\}andL1→L2L\_\{1\}\\to L\_\{2\}edges are discharged formally \(against models of the channel reducer and Orleans\-style ETag concurrency\), the rest being future work\. The bounded claim the evidence supports: the lattice’s lowest point is the default of the most\-used framework, its anomalies are real and reproducible there, and the discipline that escapes them is the ascent the lattice defines\.

### 6\.6Reference Rust runtime

We provide a reference crate,mac\-consistency\-runtime, implementing the three runtimes \(vanilla, pessimistic, SSI with optional mode\) over a commonStoretrait, emitting𝑂𝑝𝑅𝑒𝑐𝑜𝑟𝑑\\mathit\{OpRecord\}traces classified by the ported verified detectors; its integration tests reproduce the behaviors of[Section˜5\.5](https://arxiv.org/html/2606.17182#S5.SS5)\(vanilla exhibitsA1A\_\{1\}; pessimistic and SSI eliminate it; default\-SI misses the read\-only no\-write gap that SSI closes\)\. The runtime is augmented by machine\-checked¬A1\\neg A\_\{1\}safety proofs for all three strategies—theorem\_pessimistic\_prevents\_a1andtheorem\_ssi\_prevents\_a1\(unconditional\) andtheorem\_default\_si\_conditional\_prevents\_a1\(under the all\-writers hypothesis\)—spanning 40 obligations across three files \(23 pessimistic, 8 SSI, 9 default\-SI\), with zeroassume/external\_body/added axioms\. The default\-SI result is a Verus characterization of*which workload class*the strategy is safe on \(thevalidate\_no\_writetoggle: run SSI if the all\-writers workload cannot be guaranteed\), not an unconditional guarantee—the no\-write bypass admitsA1A\_\{1\}by design \([Table˜IV](https://arxiv.org/html/2606.17182#S5.T4)\)\. The three theorems hold over sequential atomic interleavings, lifted to the multi\-threaded runtime by mutex correctness \([Section˜4\.10](https://arxiv.org/html/2606.17182#S4.SS10)\); the residual isstd::sync::Mutexconformance, with weak memory, liveness, and poisoning as explicit gaps\. The full invariants and proof structure are in the online appendix \(§ E\)\.

## 7Related Work

### 7\.1Classical consistency theory

The hardware consistency tradition—sequential consistency\[[3](https://arxiv.org/html/2606.17182#bib.bib3)\], TSO\[[4](https://arxiv.org/html/2606.17182#bib.bib4)\], ARMv8\[[5](https://arxiv.org/html/2606.17182#bib.bib5)\], linearizability\[[6](https://arxiv.org/html/2606.17182#bib.bib6)\]—characterizes memory operation visibility under bounded latency\. Our model differs fundamentally because the operation’s generation phase is not bounded\.

The database isolation hierarchy\[[7](https://arxiv.org/html/2606.17182#bib.bib7),[8](https://arxiv.org/html/2606.17182#bib.bib8),[9](https://arxiv.org/html/2606.17182#bib.bib9)\]is the most direct conceptual ancestor; we adopt its anomaly\-based methodology over a different operational model\. The line through Fekete et al\.\[[22](https://arxiv.org/html/2606.17182#bib.bib22)\]and Cahill et al\.\[[21](https://arxiv.org/html/2606.17182#bib.bib21)\]characterizes when snapshot isolation can be made serializable; Fekete\[[45](https://arxiv.org/html/2606.17182#bib.bib45)\]treats isolation\-level assignment for mixed workloads; the technique became a production serializable level in PostgreSQL \(Ports and Grittner\[[83](https://arxiv.org/html/2606.17182#bib.bib83)\]\) and underpins distributed SQL engines such as CockroachDB—the practical reference points for our SSI runtime, the difference being that the work discarded on abort is a relational re\-execution there and an LLM inference here\. Our snapshot\-insufficiency observation \([Section˜4\.5](https://arxiv.org/html/2606.17182#S4.SS5)\) parallels the write\-skew anomaly of this line\. Bailis et al\.\[[46](https://arxiv.org/html/2606.17182#bib.bib46)\]characterize highly available transactions; the orthogonal question of*whether coordination is needed at all*is the focus of CALM\[[47](https://arxiv.org/html/2606.17182#bib.bib47)\]and coordination avoidance\[[48](https://arxiv.org/html/2606.17182#bib.bib48)\]—a complementary axis \(we treat it as future work\) determining whetherL3L\_\{3\}/L4L\_\{4\}enforcement can be avoided for monotonic computations\. Bailis et al\.\[[25](https://arxiv.org/html/2606.17182#bib.bib25)\]treat bounded\-staleness probabilistically \(the closest analogue forL1L\_\{1\}refinements admitting staleness windows\), and Crooks et al\.\[[49](https://arxiv.org/html/2606.17182#bib.bib49)\]give a client\-centric formulation; we adopt the operation\-history form because the multi\-agent setting has many clients per operation\.

Viotti and Vukolić\[[17](https://arxiv.org/html/2606.17182#bib.bib17)\]provide the canonical taxonomy of non\-transactional consistency models in distributed storage; their classification covers per\-object, per\-key, and session\-level guarantees under network propagation delay\. Our lattice targets a different operational regime \(long\-running operations under neural\-inference latency, not network propagation\), but their methodology—enumerating anomalies and stratifying levels by negation—is the direct analogue we follow\. Abadi\[[16](https://arxiv.org/html/2606.17182#bib.bib16)\]refines the CAP theorem with a latency/consistency tradeoff that captures, at a higher level of abstraction, the same tension our lattice makes explicit at the level granularity\. Globally distributed databases\[[50](https://arxiv.org/html/2606.17182#bib.bib50),[51](https://arxiv.org/html/2606.17182#bib.bib51)\]achieve serializability and external consistency under wide\-area replication through synchronized clocks and deterministic concurrency control respectively; the techniques bear on the implementability ofL4L\_\{4\}guarantees in deployed agent runtimes, modulo cost\-asymmetry considerations\.

The distributed\-systems consistency literature more broadly\[[10](https://arxiv.org/html/2606.17182#bib.bib10),[11](https://arxiv.org/html/2606.17182#bib.bib11),[14](https://arxiv.org/html/2606.17182#bib.bib14),[12](https://arxiv.org/html/2606.17182#bib.bib12),[13](https://arxiv.org/html/2606.17182#bib.bib13)\]addresses replicated state but assumes operations are discrete events with summarizable signatures\.

#### Transactional memory and multi\-version concurrency control\.

Software transactional memory\[[52](https://arxiv.org/html/2606.17182#bib.bib52),[53](https://arxiv.org/html/2606.17182#bib.bib53)\]is the closest concurrency\-control paradigm to the agent setting: long\-running transactions, optimistic execution with abort\-and\-retry, and conflict detection at commit, in the tradition of optimistic concurrency control\[[84](https://arxiv.org/html/2606.17182#bib.bib84)\]\. The asymmetry that distinguishes the agent setting from STM is the cost\-asymmetry assumption \([Section˜2\.2](https://arxiv.org/html/2606.17182#S2.SS2)\): an aborted transaction costs microseconds in STM and seconds\-to\-minutes in the agent setting, which inverts the favored concurrency\-control strategy from optimistic to pessimistic for agent runtimes targeting tight latency\. Multi\-version concurrency control\[[54](https://arxiv.org/html/2606.17182#bib.bib54)\]solves a related problem in databases by retaining multiple versions of each cell to support read\-only transactions without conflicts; an MVCC\-style mechanism applied to the agent setting would preventA1A\_\{1\}at the cost of storing pre\-generation snapshots for the duration of every operation, which is feasible but expensive at LLM\-inference latency\. The transaction\-modeling literature—specifically Chrysanthis and Ramamritham’s\[[55](https://arxiv.org/html/2606.17182#bib.bib55)\]ACTA framework for advanced transaction models—provides the formal vocabulary \(commit dependencies, abort dependencies, view\-serializability\) that we adapt informally for the saga\-compensation discussion ofA3A\_\{3\}\. Our contribution is not a new transaction model in this lineage but a lattice of consistency predicates over a specific operational regime\.

The pessimistic\-versus\-optimistic choice our cost analysis turns on \([Section˜5\.6](https://arxiv.org/html/2606.17182#S5.SS6)\) is the contention\-management problem of software transactional memory, mapped by Guerraoui et al\.\[[80](https://arxiv.org/html/2606.17182#bib.bib80)\]and Scherer and Scott\[[81](https://arxiv.org/html/2606.17182#bib.bib81)\]\. Our recommendation that pessimistic locking suits agent workloads where a discarded inference is expensive is a special case of their contention\-management policy space, distinguished only by the magnitude of the abort cost; we adopt the framing rather than extend the theory\.

#### Verified distributed systems and chaos testing\.

IronFleet\[[32](https://arxiv.org/html/2606.17182#bib.bib32)\]and Verdi\[[56](https://arxiv.org/html/2606.17182#bib.bib56)\]established that mechanically\-verified protocols at Paxos scale are feasible; closest to ourL2L\_\{2\}target, Chapar\[[57](https://arxiv.org/html/2606.17182#bib.bib57)\]verifies causally\-consistent key\-value stores in Coq—the difference being that Chapar certifies a*running*store, whereas ourL2L\_\{2\}is a model\-level artifact\. Our Verus chain targets a far smaller artifact \(detectors over a finite history, not a running protocol\) but reuses their proof\-by\-refinement pattern\. On the empirical side, Jepsen\[[58](https://arxiv.org/html/2606.17182#bib.bib58)\]generates and exhibits anomalies in deployed databases as witnesses, and this was refined into verified\-from\-trace checkers Cobra\[[59](https://arxiv.org/html/2606.17182#bib.bib59)\]and Elle\[[60](https://arxiv.org/html/2606.17182#bib.bib60)\]; both are methodological cousins of our detector, the difference being that ours targets a different \(inference\-latency\) catalog and is itself mechanically verified sound and complete, whereas Cobra and Elle are unverified analyzers\. Our pilot is closer in spirit to Jepsen than to IronFleet—verification of detectors plus empirical exhibition of anomalies in the wild is the methodology we adapt to the multi\-agent LLM setting\.

### 7\.2Multi\-agent infrastructure systems and frameworks

Atomix\[[2](https://arxiv.org/html/2606.17182#bib.bib2)\], SagaLLM\[[1](https://arxiv.org/html/2606.17182#bib.bib1)\], and CodeCRDT\[[18](https://arxiv.org/html/2606.17182#bib.bib18)\]are the closest contemporary work; we placed them informally in[Section˜6\.4](https://arxiv.org/html/2606.17182#S6.SS4)\. Atomix is cited as an arXiv preprint \(arXiv:2602\.14849, 2026\); reviewers should note its preprint status\. Cemri et al\.\[[19](https://arxiv.org/html/2606.17182#bib.bib19)\]provide an empirical taxonomy of multi\-agent failure modes whose categories \(specification gaps, reasoning\-action mismatch, inter\-agent misalignment\) are orthogonal to the concurrency anomalies we formalize\. The actor model\[[15](https://arxiv.org/html/2606.17182#bib.bib15)\]is the historical ancestor of agent systems but does not address LLM\-specific operation latency\.

Contemporary multi\-agent frameworks form the deployment substrate for the runtimes our lattice targets\. AutoGen\[[24](https://arxiv.org/html/2606.17182#bib.bib24)\]provides a conversation\-based abstraction with tool\-call orchestration; ReAct\[[61](https://arxiv.org/html/2606.17182#bib.bib61)\]defines the reason\-act loop that underlies most current agent architectures; MetaGPT\[[62](https://arxiv.org/html/2606.17182#bib.bib62)\]introduces standardized role abstractions for multi\-agent collaboration; generative agents\[[63](https://arxiv.org/html/2606.17182#bib.bib63)\]explore long\-running agent behavior with persistent memory\. The Model Context Protocol\[[64](https://arxiv.org/html/2606.17182#bib.bib64)\]standardizes the tool\-call interface across providers\. None of these frameworks provides explicit consistency guarantees on shared state; the guarantees arise \(or fail to arise\) from the underlying runtime, which our lattice targets\. We do not formalize the operational models of these frameworks here\.

### 7\.3Workflow systems, long\-lived transactions, and compensations

The challenge of consistency over long\-running activities is not new to multi\-agent LLM systems\. The advanced\-transaction\-model literature\[[65](https://arxiv.org/html/2606.17182#bib.bib65),[66](https://arxiv.org/html/2606.17182#bib.bib66)\]formalizes long\-running and nested\-transaction semantics including sagas\[[31](https://arxiv.org/html/2606.17182#bib.bib31)\], the original compensation\-based approach for long\-running activities; the ConTract model\[[67](https://arxiv.org/html/2606.17182#bib.bib67)\]extending sagas with explicit dependency contracts; and WS\-BusinessActivity for web\-services compositions\. Modern workflow engines apply these ideas to business\-process orchestration\. Our work adopts the compensation\-aware framing ofA3A\_\{3\}from this tradition and the no\-write\-without\-compensation pattern from sagas; the agent setting introduces the additional cost\-asymmetry constraint \([Section˜2\.2](https://arxiv.org/html/2606.17182#S2.SS2)\) which neither classical sagas nor ConTract address\.

### 7\.4Actor systems and message\-passing concurrency

The actor model\[[15](https://arxiv.org/html/2606.17182#bib.bib15)\]treats concurrent computation as isolated actors communicating via asynchronous messages\. Modern implementations—Erlang/OTP, Akka, Microsoft Orleans\[[68](https://arxiv.org/html/2606.17182#bib.bib68)\]and its virtual\-actor abstraction—realize this model with various strong\-consistency extensions including transactional actors, ETag\-based optimistic concurrency, and grain\-level state\. The relevance to multi\-agent LLM systems is direct: AutoGen’s underlying coordination model is message\-passing, and AutoGen’s ongoing state\-persistence work\[[40](https://arxiv.org/html/2606.17182#bib.bib40)\]adopts Orleans\-style ETag\-based OCC\. Our operational model assumes a single shared key\-value store, a simplification relative to the message\-passing reality of these frameworks; the lattice points and predicates we formalize correspond to the consistency contracts an actor system can implement, but the operational mapping from a message\-passing runtime to our shared\-store model is non\-trivial and deferred to future work\.

A related formal lineage is*behavioral*and*session*types\[[77](https://arxiv.org/html/2606.17182#bib.bib77)\], which type communication channels by the protocol of messages they carry and statically enforce that interacting components agree on that protocol\. This is directly germane to one of our field reports: the deer\-flow channel\-type conflict \(“Channel ‘todos’ already exists with a different type,” issue \#3199\[[44](https://arxiv.org/html/2606.17182#bib.bib44)\]\), whose resolution is to unify the channel contract across components, is in essence a behavioral\-typing concern—components must agree on the consistency discipline of a shared channel before they compose, and multiparty session types are the formal home for that agreement\. We note the connection but do not formalize it: our operation\-record model classifies runtime traces against a consistency contract rather than statically typing channel protocols, and bridging the two is left to future work\.

### 7\.5Verification\-aware and model\-checking approaches

Two contemporary lines of work address the formal verification of multi\-agent LLM systems with methodologies distinct from ours\. VeriMAP\[[69](https://arxiv.org/html/2606.17182#bib.bib69)\]integrates verification into the planning process itself: a planner agent emits, for each subtask in a DAG decomposition, planner\-generated verification functions in Python and natural language; an executor produces JSON outputs verified by these functions, with retries and replanning managed by a coordinator\. This is a different design point than the one we occupy: we classify traces post\-hoc against a formal consistency contract, whereas VeriMAP integrates per\-subtask verification into the plan\. The two are complementary; VeriMAP can prevent task\-level misalignments that our anomaly detector would not catch \(it does not look at subtask\-pass criteria\), and our detector can identify cross\-agent staleness that VeriMAP’s per\-subtask VFs would not test for \(they verify outputs, not inter\-agent state coherence\)\.

AgentVerify\[[70](https://arxiv.org/html/2606.17182#bib.bib70)\]formalizes agent safety properties in LTL and verifies them via model checking, treating the LLM as a non\-deterministic oracle within a finite\-state machine defined by the orchestration layer\. Its compositional library covers memory integrity, tool\-call protocols, MCP/skill invocations, and human\-in\-the\-loop boundaries\. The methodological contrast with our work is that AgentVerify uses LTL specifications over orchestration FSMs, whereas we use TLA\+predicates over operation records and lift the executable detectors via Verus refinement\. AgentVerify’s compositional FSM model is better suited to safety\-property verification across complex orchestration topologies; our operation\-record model is better suited to anomaly classification on captured traces\. Combining the two methodologies—LTL safety specification over an orchestration FSM with operation\-record\-level anomaly classification at runtime—is an attractive direction we have not pursued\.

Both VeriMAP and AgentVerify descend from a longer tradition of*multi\-agent systems verification*that predates LLM agents: temporal\-epistemic model checking, embodied in the MCMAS checker\[[85](https://arxiv.org/html/2606.17182#bib.bib85)\], verifies CTL/ATL and epistemic properties of interacting agents over an interpreted\-systems semantics\. That line targets knowledge, strategy, and coordination properties of agent protocols, whereas our predicates target shared\-state consistency over operation records; the two are complementary specification layers over the same systems, and we adopt the trace/operation\-record granularity precisely because the consistency anomalies we formalize are not naturally expressed as epistemic or strategic modalities\.

### 7\.6Runtime verification and trace monitoring

Our detector is, in the terminology of the runtime\-verification \(RV\) field, a*trace monitor*: it decides a formally specified property over an observed execution rather than verifying the producing system\. The methodology and its limits are well charted\. Havelund and Roşu\[[86](https://arxiv.org/html/2606.17182#bib.bib86)\]synthesize monitors from safety properties; Leucker and Schallhart\[[87](https://arxiv.org/html/2606.17182#bib.bib87)\]survey the area; and Bauer, Leucker, and Schallhart\[[88](https://arxiv.org/html/2606.17182#bib.bib88)\]characterize*monitorability*—which LTL/TLTL properties can be soundly decided from finite prefixes—and give the standard three\-valued \(⊤/⊥/?\\top/\\bot/?\) semantics for online monitoring\. Two consequences bear on our detector\. First, the anomalies we monitor are finitary predicates over a completed operation history \(a non\-repeatable read, an aborted predecessor in a causal closure\), so they are monitorable in the strong sense—decidable from the trace—which is why the Verus soundness\-and\-completeness result of[Section˜4\.7](https://arxiv.org/html/2606.17182#S4.SS7)is achievable at all, rather than merely a sound one\-sided alarm\. Second, the RV literature’s distinction between detecting a violation and establishing operational impact is exactly the gap our operational\-materiality study \([Section˜5\.9](https://arxiv.org/html/2606.17182#S5.SS9)\) addresses empirically\. What our detector adds over generic RV is that the monitor itself is mechanically verified against its specification; most deployed RV tools are unverified monitor implementations\. The database\-trace checkers Cobra\[[59](https://arxiv.org/html/2606.17182#bib.bib59)\]and Elle\[[60](https://arxiv.org/html/2606.17182#bib.bib60)\]discussed above are the domain\-specific instances of this same monitoring paradigm\.

### 7\.7Concurrent program logics

The concurrent\-semantics lift of[Section˜4\.10](https://arxiv.org/html/2606.17182#S4.SS10)sits in the territory of program logics for shared\-state concurrency, and we situate it there rather than claim to advance it\. The foundational systems—Owicki and Gries’s interference\-freedom method\[[89](https://arxiv.org/html/2606.17182#bib.bib89)\]and Jones’s rely\-guarantee calculus\[[90](https://arxiv.org/html/2606.17182#bib.bib90)\]—reason compositionally about threads sharing memory, and concurrent separation logic \(O’Hearn\[[91](https://arxiv.org/html/2606.17182#bib.bib91)\], with Brookes’s semantics\[[92](https://arxiv.org/html/2606.17182#bib.bib92)\]\) adds ownership\-based local reasoning, mechanized in modern form as Iris\[[93](https://arxiv.org/html/2606.17182#bib.bib93)\]and applied to Rust by RustBelt\[[35](https://arxiv.org/html/2606.17182#bib.bib35)\]\. These logics are the proper tool for discharging—not merely enumerating—the mutex obligation our lift relocates to the standard library: a rely\-guarantee or CSL proof thatstd::sync::Mutexrealizes the abstract Acquire/Release protocol would close the residual that[Section˜4\.10](https://arxiv.org/html/2606.17182#S4.SS10)leaves open, and RustBelt has carried out exactly that style of proof for Rust’s synchronization primitives\. We deliberately do*not*reconstruct such a logic here: our lift is a minimal, mechanically\-verified identification of the abstract protocol the sequential refinements require, conditional on that protocol holding, and the program\-logic machinery needed to verify the condition is future work \([Section˜6\.3](https://arxiv.org/html/2606.17182#S6.SS3)\)\. Complementing the deductive approach, the stateless\-model\-checking line for weak memory—GenMC\[[34](https://arxiv.org/html/2606.17182#bib.bib34)\]\(which we use\), CDSChecker\[[94](https://arxiv.org/html/2606.17182#bib.bib94)\], and Nidhugg\[[95](https://arxiv.org/html/2606.17182#bib.bib95)\]—decides the same lock\-protocol question exhaustively but only at bounded thread counts, which is the precise scope and limit of our RC11 check\.

### 7\.8Operational transformation and CRDTs for collaborative editing

The collaborative\-editing literature treats concurrent edits through operational transformation \(OT\) and CRDTs\. OT, originating in GROVE\[[71](https://arxiv.org/html/2606.17182#bib.bib71),[72](https://arxiv.org/html/2606.17182#bib.bib72)\]and deployed in Google Docs,*re\-serializes*concurrent operations for the classes its transformation functions cover—tool invocations and budget decrements are not among those classes\. CRDTs\[[13](https://arxiv.org/html/2606.17182#bib.bib13)\]provide merge functions guaranteeing eventual convergence, with a verification tradition of its own \(Gomes et al\.\[[101](https://arxiv.org/html/2606.17182#bib.bib101)\]verify strong eventual consistency in Isabelle/HOL\) parallel to our exclusion\-axis verification\. CodeCRDT\[[18](https://arxiv.org/html/2606.17182#bib.bib18)\]is a CRDT\-based agent runtime on a different contract than ourL1L\_\{1\}, and LangGraph’s recommendedoperator\.add\-reducer workaround forINVALID\_CONCURRENT\_GRAPH\_UPDATE\[[37](https://arxiv.org/html/2606.17182#bib.bib37)\]is likewise a contract change, notA1A\_\{1\}prevention \([Section˜6\.5](https://arxiv.org/html/2606.17182#S6.SS5)\)\. Merge\-convergence is thus an axis*orthogonal*to the exclusion lattice \(as CALM’s coordination\-avoidance axis\[[47](https://arxiv.org/html/2606.17182#bib.bib47)\]is\): for commutative effects a CRDT contract can makeA1A\_\{1\}prevention unnecessary, while for non\-commutative effects no merge restores a serial order and the exclusion lattice applies\. Neither axis subsumes the other; non\-serializable convergence points are future work\.

### 7\.9Durable execution and stateful workflow runtimes

A production lineage directly addresses long\-running, failure\-prone activities with external effects: durable\-execution engines such as Temporal\[[75](https://arxiv.org/html/2606.17182#bib.bib75)\]and Cadence, AWS Step Functions, and Azure Durable Functions, whose semantics Burckhardt et al\.\[[74](https://arxiv.org/html/2606.17182#bib.bib74)\]formalize\. These systems guarantee exactly\-once activity completion and saga\-style compensation through*deterministic replay*: a workflow’s history is logged and re\-executed deterministically after failure\. The connection to our work is twofold\. First,A3A\_\{3\}\(cascade\) andA6A\_\{6\}\(tool\-effect reordering\) are precisely the phenomena these engines manage operationally, and our saga model \([Section˜4\.15](https://arxiv.org/html/2606.17182#S4.SS15)\) is a model\-level analogue of their compensation discipline; we do not claim a new mechanism over them\. Second, their deterministic\-replay requirement is a production instance of our load\-bearing determinism assumption \([Section˜2\.2](https://arxiv.org/html/2606.17182#S2.SS2)\): durable engines forbid non\-deterministic activity bodies for exactly the reason our framework assumes determinism, which both motivates the assumption and bounds its applicability to agent workloads whose generation is not replay\-deterministic\.

### 7\.10Agent\-memory systems

The regime our catalog targets—shared mutable state across LLM agents—is realized concretely by agent\-memory systems such as MemGPT\[[76](https://arxiv.org/html/2606.17182#bib.bib76)\]and its successor Letta, which give agents persistent, editable memory partitioned into context tiers\. These systems make shared\-mutable\-state coordination a first\-class concern but provide no explicit consistency contract over concurrent memory edits; they are therefore candidate deployment substrates for the prevention contracts our lattice formalizes, and instrumenting one is the principal empirical follow\-up we identify in[Section˜6\.2](https://arxiv.org/html/2606.17182#S6.SS2)\. We do not formalize their operational models here\.

### 7\.11Probabilistic program verification

The probabilistic refinement of[Section˜4\.13](https://arxiv.org/html/2606.17182#S4.SS13)stops at a Markov\-style discrete bound and identifies Hoeffding\-grade concentration as the open problem, blocked by the absence of real\-number probability theory in the current Verus distribution\. The probabilistic\-verification literature is the natural home for closing it: probabilistic relational Hoare logic\[[78](https://arxiv.org/html/2606.17182#bib.bib78)\]for relating a stochastic agent’s behavior under counterfactual reads, and weakest\-pre\-expectation calculi\[[79](https://arxiv.org/html/2606.17182#bib.bib79)\]for expectation and concentration reasoning over probabilistic programs; mature probabilistic model checkers such as Storm\[[96](https://arxiv.org/html/2606.17182#bib.bib96)\]supply the algorithmic backend such reasoning ultimately invokes\. Adapting either to a Verus\-style mechanization of the disagreement\-probability estimator is the path to a genuine concentration guarantee; we identify it as formalization follow\-up rather than claiming it here\.

### 7\.12Deterministic databases

Deterministic database systems—Calvin\[[51](https://arxiv.org/html/2606.17182#bib.bib51)\], H\-Store\[[73](https://arxiv.org/html/2606.17182#bib.bib73)\], and the broader line of work on deterministic transaction processing—obtain serializability without locking by ordering transactions globally before execution\. The ordering is taken as the canonical schedule and all replicas execute it identically\. The connection to ourL4L\_\{4\}target is direct: deterministic ordering is one mechanism by whichL4L\_\{4\}could be implemented in a multi\-agent runtime, with phantom\-tool prevention encoded as part of the deterministic schedule\. We do not pursue this implementation strategy in the current paper, but identify it as a candidateL4L\_\{4\}runtime worth investigating in deployment\.

## 8Conclusion

This paper provides a mechanically verified consistency hierarchy for an operational regime characteristic of multi\-agent large language model systems\. Four concurrency anomalies are formalized in TLA\+and exhibited by TLC counter\-example traces at small finite parameters \(one further anomaly is deferred for a replication\-aware extension\); over the Boolean latticeℒ=⟨2𝒜,⊆⟩\\mathcal\{L\}=\\langle 2^\{\\mathcal\{A\}\},\\subseteq\\ranglethey induce, we name one operationally chosen maximal chainL0⊊L1⊊⋯⊊L4L\_\{0\}\\subsetneq L\_\{1\}\\subsetneq\\cdots\\subsetneq L\_\{4\}\(one of4\!=244\!=24\)\.

The substantive contribution is the mechanized realization, not the catalog or the lattice\. Detector pipelines for the four anomalies are verified sound*and*complete in Verus and the chain definitions pass a TLAPS coherence check; three deployed Rust runtimes are verified against stale\-generation and refined to their state machines; and theL2L\_\{2\}–L4L\_\{4\}disciplines are exec\-mode\-verified with dependency\-free prevention twins\. We ground the refinements in the concurrency primitives that deployed frameworks ship—formalizing a reproduced, live silent lost update in ByteDance’s deer\-flow as a verifiedL0→L1L\_\{0\}\\rightarrow L\_\{1\}refinement, and removing an in\-the\-wild tool\-effect reordering in LangGraph’sToolNodewith anL3L\_\{3\}commit\-order sequencer\. The stale\-generation anomaly arises even when each operation reads a structurally\-defined snapshot at begin time—the write\-skew pattern of Berenson et al\. and Cahill et al\.\[[7](https://arxiv.org/html/2606.17182#bib.bib7),[21](https://arxiv.org/html/2606.17182#bib.bib21)\]in a regime governed by long inference latency; the phenomena are classical, and the lattice supplies organizing vocabulary, not a theoretical result\. A replication\-aware extension for split\-view, a probabilistic refinement for stochastic outputs, broader production\-trace prevalence measurement, and exploration of alternative chains are explicit follow\-up work, under the assumptions disclosed in[Section˜2\.2](https://arxiv.org/html/2606.17182#S2.SS2)\.

## Appendix AVerified obligation inventory

The verification sections report 274 curated and 295 full distinct verified proof obligations \(0 errors\)\. This appendix reproduces the per\-file accounting those figures derive from, exactly as emitted byverus\_count\.sh\. The script verifies everysrc/\*\.rsfile in theverus\-detectorcrate standalone underverus \-\-crate\-type=lib, subtracts each obligation that one file re\-verifies from another once per re\-inclusion edge, and prints the arithmetic, so the totals are auditable rather than asserted\. The default invocation applies the curated exclusion of four non\-headline helper files \(marked†\\daggerin[Table˜VI](https://arxiv.org/html/2606.17182#A1.T6)\);verus\_count\.sh \-\-fullclears that exclusion\. Every counted file verifies at 0 errors\.

TABLE VI:Per\-file Verus obligation counts \(verus\_count\.sh \-\-full\)\.†\\daggermarks the four helper files excluded from the 274 curated headline; the defaultverus\_count\.shrun applies that exclusion\.verified\_a1\.rscarries no proof obligations\. The distinct totals subtract re\-included obligations once per edge \(listed below the table\)\.File \(verus\-detector/src/\)Verifiedlib\_a4\_split\_view\.rs9lib\_concurrent\_semantics\.rs9lib\_consistency\_lattice\.rs38lib\_default\_si\.rs9lib\_detect\_a1\_exec\.rs†\\dagger9lib\_detector\_equivalence\.rs24lib\_l2\_exec\.rs49lib\_l2\_projection\.rs3lib\_l2\_safety\.rs22lib\_l3\_exec\.rs7lib\_l3\_safety\.rs6lib\_l3\_sequencer\.rs5lib\_l4\_exec\.rs9lib\_l4\_safety\.rs5lib\_langgraph\_refinement\.rs7lib\_occ\_l2\_refinement\.rs8lib\_pessimistic\_exec\.rs†\\dagger5lib\_pessimistic\_invariant\.rs†\\dagger3lib\_refinement\_default\_si\.rs18lib\_refinement\_pessimistic\.rs31lib\_refinement\_ssi\_chain\.rs17lib\_refinement\_ssi\.rs18lib\.rs23lib\_rustbelt\_interface\.rs4lib\_si\_commit\_invariant\.rs†\\dagger4lib\_ssi\.rs8verified\_a1\.rs—Raw per\-file sum \(counted files\)350*Re\-inclusion edges\.*lib\_consistency\_lattice\.rsre\-includeslib\_l2\_safety\.rs\(−22\-22, viamod\),lib\_l3\_safety\.rs\(−6\-6,mod\), andlib\_l4\_safety\.rs\(−5\-5,mod\);lib\_l2\_exec\.rsre\-includes theL2L\_\{2\}model textually \(−22\-22\)\. These four edges remove the5555double\-counted obligations\.lib\_l2\_safety\.rsis subtracted twice— once for each distinct parent that re\-includes it—leaving its 22 obligations counted exactly once\.

*Totals\.*The full distinct total is350−55=295350\-55=295\. The four†\\daggerhelper files contribute9\+5\+3\+4=219\+5\+3\+4=21obligations to the raw sum; excluding them gives a curated raw sum of329329and hence the curated distinct total329−55=274329\-55=274\.verus\_count\.shreproduces the former;verus\_count\.sh \-\-fullreproduces the latter\.

## References

- \[1\]E\. Y\. Chang and L\. Geng, “SagaLLM: Context management, validation, and transaction guarantees for multi\-agent LLM planning,”*Proceedings of the VLDB Endowment*, 2025, arXiv:2503\.11951\.
- \[2\]B\. Mohammadi, N\. Potamitis, L\. Klein, A\. Arora, and L\. Bindschaedler, “Atomix: Timely, transactional tool use for reliable agentic workflows,”*arXiv preprint arXiv:2602\.14849*, 2026, preprint; not yet peer\-reviewed at time of writing\.
- \[3\]L\. Lamport, “How to make a multiprocessor computer that correctly executes multiprocess programs,”*IEEE Transactions on Computers*, vol\. C\-28, no\. 9, pp\. 690–691, 1979\.
- \[4\]K\. Gharachorloo, D\. Lenoski, J\. Laudon, P\. Gibbons, A\. Gupta, and J\. Hennessy, “Memory consistency and event ordering in scalable shared\-memory multiprocessors,” in*ISCA*, 1990, pp\. 15–26\.
- \[5\]C\. Pulte, S\. Flur, W\. Deacon, J\. French, S\. Sarkar, and P\. Sewell, “Simplifying arm concurrency: Multicopy\-atomic axiomatic and operational models for ARMv8,” in*POPL*, vol\. 2, 2018, pp\. 19:1–19:29\.
- \[6\]M\. P\. Herlihy and J\. M\. Wing, “Linearizability: A correctness condition for concurrent objects,”*ACM TOPLAS*, vol\. 12, no\. 3, pp\. 463–492, 1990\.
- \[7\]H\. Berenson, P\. Bernstein, J\. Gray, J\. Melton, E\. O’Neil, and P\. O’Neil, “A critique of ANSI SQL isolation levels,” in*ACM SIGMOD*, 1995, pp\. 1–10\.
- \[8\]A\. Adya, “Weak consistency: A generalized theory and optimistic implementations for distributed transactions,” Ph\.D\. dissertation, MIT, 1999\.
- \[9\]A\. Adya, B\. Liskov, and P\. O’Neil, “Generalized isolation level definitions,” in*Proceedings of the 16th International Conference on Data Engineering \(ICDE\)*, 2000, pp\. 67–78\.
- \[10\]W\. Lloyd, M\. J\. Freedman, M\. Kaminsky, and D\. G\. Andersen, “Don’t settle for eventual: Scalable causal consistency for wide\-area storage with COPS,” in*SOSP*, 2011, pp\. 401–416\.
- \[11\]P\. Mahajan, L\. Alvisi, and M\. Dahlin, “Consistency, availability, and convergence,” UT Austin, Tech\. Rep\. UTCS\-TR\-11\-22, 2011\.
- \[12\]S\. Burckhardt,*Principles of Eventual Consistency*\. now publishers, 2014\.
- \[13\]M\. Shapiro, N\. Preguiça, C\. Baquero, and M\. Zawirski, “Conflict\-free replicated data types,” in*SSS*, 2011, pp\. 386–400\.
- \[14\]D\. B\. Terry, A\. J\. Demers, K\. Petersen, M\. J\. Spreitzer, M\. M\. Theimer, and B\. B\. Welch, “Session guarantees for weakly consistent replicated data,” in*PDIS*, 1994, pp\. 140–149\.
- \[15\]C\. Hewitt, P\. Bishop, and R\. Steiger, “A universal modular ACTOR formalism for artificial intelligence,” in*IJCAI*, 1973, pp\. 235–245\.
- \[16\]D\. J\. Abadi, “Consistency tradeoffs in modern distributed database system design: CAP is only part of the story,”*IEEE Computer*, vol\. 45, no\. 2, pp\. 37–42, 2012\.
- \[17\]P\. Viotti and M\. Vukolić, “Consistency in non\-transactional distributed storage systems,”*ACM Computing Surveys*, vol\. 49, no\. 1, pp\. 19:1–19:34, 2016\.
- \[18\]S\. Pugachev, “CodeCRDT: Observation\-driven coordination for multi\-agent LLM code generation,”*arXiv preprint arXiv:2510\.18893*, 2025\.
- \[19\]M\. Cemri, M\. Z\. Pan, S\. Yang, L\. A\. Agrawal, B\. Chopra, R\. Tiwari, K\. Keutzer, A\. Parameswaran, D\. Klein, K\. Ramchandran, M\. Zaharia, J\. E\. Gonzalez, and I\. Stoica, “Why do multi\-agent LLM systems fail?” in*NeurIPS Datasets and Benchmarks Track*, 2025, spotlight; arXiv:2503\.13657\.
- \[20\]P\. Bailis, A\. Ghodsi, J\. M\. Hellerstein, and I\. Stoica, “Bolt\-on causal consistency,” in*Proceedings of the ACM SIGMOD International Conference on Management of Data*, 2013, pp\. 761–772\.
- \[21\]M\. J\. Cahill, U\. Röhm, and A\. D\. Fekete, “Serializable isolation for snapshot databases,” in*Proceedings of the ACM SIGMOD International Conference on Management of Data*, 2008, pp\. 729–738\.
- \[22\]A\. Fekete, D\. Liarokapis, E\. O’Neil, P\. O’Neil, and D\. Shasha, “Making snapshot isolation serializable,”*ACM Transactions on Database Systems \(TODS\)*, vol\. 30, no\. 2, pp\. 492–528, 2005\.
- \[23\]A\. Lattuada, T\. Hance, C\. Cho, M\. Brun, I\. Subasinghe, Y\. Zhou, J\. Howell, B\. Parno, and C\. Hawblitzel, “Verus: Verifying Rust programs using linear ghost types,” in*Proceedings of the ACM on Programming Languages \(OOPSLA\)*, vol\. 7, 2023, pp\. 286–315\.
- \[24\]Q\. Wu, G\. Bansal, J\. Zhang, Y\. Wu, S\. Zhang, E\. Zhu, B\. Li, L\. Jiang, X\. Zhang, and C\. Wang, “AutoGen: Enabling next\-gen LLM applications via multi\-agent conversation,” in*arXiv preprint arXiv:2308\.08155*, 2023\.
- \[25\]P\. Bailis, S\. Venkataraman, M\. J\. Franklin, J\. M\. Hellerstein, and I\. Stoica, “Probabilistically bounded staleness for practical partial quorums,”*Proceedings of the VLDB Endowment*, vol\. 5, no\. 8, pp\. 776–787, 2012\.
- \[26\]T\. A\. Henzinger, C\. M\. Kirsch, H\. Payer, A\. Sezgin, and A\. Sokolova, “Quantitative relaxation of concurrent data structures,” in*Proceedings of the 40th ACM SIGPLAN\-SIGACT Symposium on Principles of Programming Languages \(POPL\)*, 2013, pp\. 317–328\.
- \[27\]M\. S\. Miller, “Robust composition: Towards a unified approach to access control and concurrency control,” Ph\.D\. dissertation, Johns Hopkins University, 2006\.
- \[28\]T\. Murray, “Analysing the security properties of object\-capability patterns,” Ph\.D\. dissertation, University of Oxford, 2010\.
- \[29\]T\. Schick, J\. Dwivedi\-Yu, R\. Dessì, R\. Raileanu, M\. Lomeli, L\. Zettlemoyer, N\. Cancedda, and T\. Scialom, “Toolformer: Language models can teach themselves to use tools,” in*NeurIPS*, 2023\.
- \[30\]S\. G\. Patil, T\. Zhang, X\. Wang, and J\. E\. Gonzalez, “Gorilla: Large language model connected with massive APIs,” in*arXiv preprint arXiv:2305\.15334*, 2023\.
- \[31\]H\. Garcia\-Molina and K\. Salem, “Sagas,” in*Proceedings of the 1987 ACM SIGMOD International Conference on Management of Data*, 1987, pp\. 249–259\.
- \[32\]C\. Hawblitzel, J\. Howell, M\. Kapritsos, J\. R\. Lorch, B\. Parno, M\. L\. Roberts, S\. Setty, and B\. Zill, “IronFleet: Proving practical distributed systems correct,” in*Proceedings of the 25th ACM Symposium on Operating Systems Principles \(SOSP\)*, 2015, pp\. 1–17\.
- \[33\]O\. Lahav, V\. Vafeiadis, J\. Kang, C\.\-K\. Hur, and D\. Dreyer, “Repairing sequential consistency in C/C\+\+11,” in*Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation \(PLDI\)*, 2017, pp\. 618–632\.
- \[34\]M\. Kokologiannakis and V\. Vafeiadis, “GenMC: A model checker for weak memory models,” in*Computer Aided Verification \(CAV\)*, 2021, pp\. 427–440\.
- \[35\]R\. Jung, J\.\-H\. Jourdan, R\. Krebbers, and D\. Dreyer, “RustBelt: Securing the foundations of the Rust programming language,” in*Proceedings of the 45th ACM SIGPLAN Symposium on Principles of Programming Languages \(POPL\)*\. Association for Computing Machinery, 2018, pp\. 66:1–66:34\.
- \[36\]J\. Kang, C\.\-K\. Hur, O\. Lahav, V\. Vafeiadis, and D\. Dreyer, “A promising semantics for relaxed\-memory concurrency,” in*Proceedings of the 44th ACM SIGPLAN Symposium on Principles of Programming Languages \(POPL\)*, 2017, pp\. 175–189\.
- \[37\]LangChain AI, “INVALID\_CONCURRENT\_GRAPH\_UPDATE: LangGraph error reference,”[https://docs\.langchain\.com/oss/python/langgraph/errors/INVALID\_CONCURRENT\_GRAPH\_UPDATE](https://docs.langchain.com/oss/python/langgraph/errors/INVALID_CONCURRENT_GRAPH_UPDATE), 2025, accessed November 2025\.
- \[38\]Issue \#6446, langchain\-ai/langgraph, “InvalidUpdateError when using parallel subgraphs with shared state keys in LangGraph,”[https://github\.com/langchain\-ai/langgraph/issues/6446](https://github.com/langchain-ai/langgraph/issues/6446), 2025, accessed November 2025\.
- \[39\]Discussion \#7144, microsoft/autogen, “Handling shared state across multi\-agent conversations in AutoGen,”[https://github\.com/microsoft/autogen/discussions/7144](https://github.com/microsoft/autogen/discussions/7144), 2025, accessed November 2025\.
- \[40\]Pull Request \#3954, microsoft/autogen, “Stateful persistence grains for each agent,”[https://github\.com/microsoft/autogen/pull/3954](https://github.com/microsoft/autogen/pull/3954), 2024, accessed November 2025\.
- \[41\]Issue \#7043, microsoft/autogen, “GraphFlow state persistence bug: workflow gets stuck after interruption during agent transitions,”[https://github\.com/microsoft/autogen/issues/7043](https://github.com/microsoft/autogen/issues/7043), 2025, accessed November 2025\.
- \[42\]Issue \#3123, bytedance/deer\-flow, “To\-do list shown during streaming disappears after the run completes,”[https://github\.com/bytedance/deer\-flow/issues/3123](https://github.com/bytedance/deer-flow/issues/3123), 2026, accessed June 2026\.
- \[43\]Pull Request \#3180, bytedance/deer\-flow, “fix\(agents\): preserve todos state across node updates,”[https://github\.com/bytedance/deer\-flow/pull/3180](https://github.com/bytedance/deer-flow/pull/3180), 2026, accessed June 2026\.
- \[44\]Issue \#3199, bytedance/deer\-flow, “\[runtime\] Plan mode fails with duplicate todos channel type conflict,”[https://github\.com/bytedance/deer\-flow/issues/3199](https://github.com/bytedance/deer-flow/issues/3199), 2026, accessed June 2026\.
- \[45\]A\. Fekete, “Allocating isolation levels to transactions,” in*Proceedings of the 24th ACM SIGMOD\-SIGACT\-SIGART Symposium on Principles of Database Systems \(PODS\)*, 2005, pp\. 206–215\.
- \[46\]P\. Bailis, A\. Davidson, A\. Fekete, A\. Ghodsi, J\. M\. Hellerstein, and I\. Stoica, “Highly available transactions: Virtues and limitations,”*Proceedings of the VLDB Endowment*, vol\. 7, no\. 3, pp\. 181–192, 2014\.
- \[47\]J\. M\. Hellerstein and P\. Alvaro, “Keeping CALM: When distributed consistency is easy,”*Communications of the ACM*, vol\. 63, no\. 9, pp\. 72–81, 2020\.
- \[48\]P\. Bailis, A\. Fekete, M\. J\. Franklin, A\. Ghodsi, J\. M\. Hellerstein, and I\. Stoica, “Coordination avoidance in database systems,” in*Proceedings of the VLDB Endowment*, vol\. 8, no\. 3, 2014, pp\. 185–196\.
- \[49\]N\. Crooks, Y\. Pu, L\. Alvisi, and A\. Clement, “Seeing is believing: A client\-centric specification of database isolation,” in*Proceedings of the ACM Symposium on Principles of Distributed Computing \(PODC\)*, 2017, pp\. 73–82\.
- \[50\]J\. C\. Corbett, J\. Dean, M\. Epstein, A\. Fikes, C\. Frost, J\. J\. Furman, S\. Ghemawat, A\. Gubarev, C\. Heiser, P\. Hochschild, W\. Hsieh, S\. Kanthak, E\. Kogan, H\. Li, A\. Lloyd, S\. Melnik, D\. Mwaura, D\. Nagle, S\. Quinlan, R\. Rao, L\. Rolig, Y\. Saito, M\. Szymaniak, C\. Taylor, R\. Wang, and D\. Woodford, “Spanner: Google’s globally\-distributed database,” in*OSDI*, 2012, pp\. 251–264\.
- \[51\]A\. Thomson, T\. Diamond, S\.\-C\. Weng, K\. Ren, P\. Shao, and D\. J\. Abadi, “Calvin: Fast distributed transactions for partitioned database systems,” in*ACM SIGMOD*, 2012, pp\. 1–12\.
- \[52\]N\. Shavit and D\. Touitou, “Software transactional memory,” in*Proceedings of the 14th ACM Symposium on Principles of Distributed Computing \(PODC\)*, 1995, pp\. 204–213\.
- \[53\]M\. Herlihy and J\. E\. B\. Moss, “Transactional memory: Architectural support for lock\-free data structures,” in*Proceedings of the 20th International Symposium on Computer Architecture \(ISCA\)*, 1993, pp\. 289–300\.
- \[54\]P\. A\. Bernstein and N\. Goodman, “Multiversion concurrency control—theory and algorithms,”*ACM Transactions on Database Systems*, vol\. 8, no\. 4, pp\. 465–483, 1983\.
- \[55\]P\. K\. Chrysanthis and K\. Ramamritham, “ACTA: The SAGA continues,” in*Database Transaction Models for Advanced Applications*, A\. K\. Elmagarmid, Ed\. Morgan Kaufmann, 1992, pp\. 349–397\.
- \[56\]J\. R\. Wilcox, D\. Woos, P\. Panchekha, Z\. Tatlock, X\. Wang, M\. D\. Ernst, and T\. Anderson, “Verdi: A framework for implementing and formally verifying distributed systems,” in*Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation \(PLDI\)*, 2015, pp\. 357–368\.
- \[57\]M\. Lesani, C\. J\. Bell, and A\. Chlipala, “Chapar: Certified causally consistent distributed key\-value stores,” in*Proceedings of the 43rd Annual ACM SIGPLAN\-SIGACT Symposium on Principles of Programming Languages \(POPL\)*, 2016\.
- \[58\]K\. Kingsbury, “Jepsen: Distributed systems safety analysis,”[https://jepsen\.io](https://jepsen.io/), 2013–present, accessed 2026\.
- \[59\]C\. Tan, C\. Zhao, S\. Mu, and M\. Walfish, “Cobra: Making transactional key\-value stores verifiably serializable,” in*Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation \(OSDI\)*, 2020, pp\. 63–80\.
- \[60\]K\. Kingsbury and P\. Alvaro, “Elle: Inferring isolation anomalies from experimental observations,” arXiv:2003\.10554, 2020\.
- \[61\]S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao, “ReAct: Synergizing reasoning and acting in language models,” in*arXiv preprint arXiv:2210\.03629*, 2022\.
- \[62\]S\. Hong, X\. Zheng, J\. Chen, Y\. Cheng, J\. Wang, C\. Zhang, Z\. Wang, S\. K\. S\. Yau, Z\. Lin, L\. Zhou, C\. Ran, L\. Xiao, C\. Wu, and J\. Schmidhuber, “MetaGPT: Meta programming for multi\-agent collaborative framework,” in*arXiv preprint arXiv:2308\.00352*, 2023\.
- \[63\]J\. S\. Park, J\. C\. O’Brien, C\. J\. Cai, M\. R\. Morris, P\. Liang, and M\. S\. Bernstein, “Generative agents: Interactive simulacra of human behavior,” in*UIST*, 2023\.
- \[64\]Anthropic, “Model context protocol,”[https://modelcontextprotocol\.io](https://modelcontextprotocol.io/), 2024\.
- \[65\]A\. K\. Elmagarmid, “Database transaction models for advanced applications,”*Morgan Kaufmann*, 1992\.
- \[66\]P\. K\. Chrysanthis and K\. Ramamritham, “Synthesis of extended transaction models using ACTA,”*ACM Transactions on Database Systems*, vol\. 19, no\. 3, pp\. 450–491, 1994\.
- \[67\]A\. Reuter and H\. Wächter, “The ConTract model,”*IEEE Data Engineering Bulletin*, vol\. 14, no\. 1, pp\. 39–43, 1991\.
- \[68\]P\. A\. Bernstein, S\. Bykov, A\. Geller, G\. Kliot, and J\. Thelin, “Orleans: Distributed virtual actors for programmability and scalability,” in*Microsoft Research Technical Report MSR\-TR\-2014\-41*, 2014\.
- \[69\]T\. Xu*et al\.*, “Verification\-aware planning for multi\-agent systems,”*arXiv preprint arXiv:2510\.17109*, 2025\.
- \[70\]E\. Fang, “AgentVerify: Compositional formal verification of AI agent safety properties via LTL model checking,” Preprints\.org, 2026, manuscript 202604\.1029\.
- \[71\]C\. A\. Ellis and S\. J\. Gibbs, “Concurrency control in groupware systems,” in*Proceedings of the 1989 ACM SIGMOD International Conference on Management of Data*, 1989, pp\. 399–407\.
- \[72\]C\. A\. Ellis, S\. J\. Gibbs, and G\. L\. Rein, “Groupware: Some issues and experiences,”*Communications of the ACM*, vol\. 34, no\. 1, pp\. 39–58, 1991\.
- \[73\]R\. Kallman, H\. Kimura, J\. Natkins, A\. Pavlo, A\. Rasin, S\. Zdonik, E\. P\. C\. Jones, S\. Madden, M\. Stonebraker, Y\. Zhang, J\. Hugg, and D\. J\. Abadi, “H\-Store: A high\-performance, distributed main memory transaction processing system,”*Proceedings of the VLDB Endowment*, vol\. 1, no\. 2, pp\. 1496–1499, 2008\.
- \[74\]S\. Burckhardt, C\. Gillum, D\. Justo, K\. Kallas, C\. McMahon, and C\. S\. Meiklejohn, “Durable Functions: Semantics for stateful serverless,”*Proceedings of the ACM on Programming Languages \(OOPSLA\)*, vol\. 5, 2021\.
- \[75\]Temporal Technologies, “Temporal: durable execution platform,”[https://temporal\.io](https://temporal.io/), accessed 2026\.
- \[76\]C\. Packer, V\. Fang, S\. G\. Patil, K\. Lin, S\. Wooders, and J\. E\. Gonzalez, “MemGPT: Towards LLMs as operating systems,”*arXiv preprint arXiv:2310\.08560*, 2023\.
- \[77\]K\. Honda, N\. Yoshida, and M\. Carbone, “Multiparty asynchronous session types,” in*Proceedings of the 35th ACM SIGPLAN\-SIGACT Symposium on Principles of Programming Languages \(POPL\)*, 2008, pp\. 273–284\.
- \[78\]G\. Barthe, B\. Köpf, F\. Olmedo, and S\. Zanella\-Béguelin, “Probabilistic relational reasoning for differential privacy,” in*Proceedings of the 39th ACM SIGPLAN\-SIGACT Symposium on Principles of Programming Languages \(POPL\)*, 2012, pp\. 97–110\.
- \[79\]B\. L\. Kaminski, J\.\-P\. Katoen, C\. Matheja, and F\. Olmedo, “Weakest precondition reasoning for expected run\-times of probabilistic programs,” in*Programming Languages and Systems \(ESOP\)*, 2016, pp\. 364–389\.
- \[80\]R\. Guerraoui, M\. Herlihy, and B\. Pochon, “Toward a theory of transactional contention managers,” in*Proceedings of the 24th ACM Symposium on Principles of Distributed Computing \(PODC\)*, 2005, pp\. 258–264\.
- \[81\]W\. N\. Scherer III and M\. L\. Scott, “Advanced contention management for dynamic software transactional memory,” in*Proceedings of the 24th ACM Symposium on Principles of Distributed Computing \(PODC\)*, 2005, pp\. 240–248\.
- \[82\]L\. Lamport, R\. Shostak, and M\. Pease, “The Byzantine generals problem,”*ACM Transactions on Programming Languages and Systems*, vol\. 4, no\. 3, pp\. 382–401, 1982\.
- \[83\]D\. R\. K\. Ports and K\. Grittner, “Serializable snapshot isolation in PostgreSQL,”*Proceedings of the VLDB Endowment*, vol\. 5, no\. 12, pp\. 1850–1861, 2012\.
- \[84\]H\. T\. Kung and J\. T\. Robinson, “On optimistic methods for concurrency control,”*ACM Transactions on Database Systems*, vol\. 6, no\. 2, pp\. 213–226, 1981\.
- \[85\]A\. Lomuscio, H\. Qu, and F\. Raimondi, “MCMAS: an open\-source model checker for the verification of multi\-agent systems,”*International Journal on Software Tools for Technology Transfer*, vol\. 19, no\. 1, pp\. 9–30, 2017\.
- \[86\]K\. Havelund and G\. Roşu, “Synthesizing monitors for safety properties,” in*Tools and Algorithms for the Construction and Analysis of Systems \(TACAS\)*, 2002, pp\. 342–356\.
- \[87\]M\. Leucker and C\. Schallhart, “A brief account of runtime verification,”*Journal of Logic and Algebraic Programming*, vol\. 78, no\. 5, pp\. 293–303, 2009\.
- \[88\]A\. Bauer, M\. Leucker, and C\. Schallhart, “Runtime verification for LTL and TLTL,”*ACM Transactions on Software Engineering and Methodology*, vol\. 20, no\. 4, pp\. 14:1–14:64, 2011\.
- \[89\]S\. Owicki and D\. Gries, “Verifying properties of parallel programs: an axiomatic approach,”*Communications of the ACM*, vol\. 19, no\. 5, pp\. 279–285, 1976\.
- \[90\]C\. B\. Jones, “Tentative steps toward a development method for interfering programs,”*ACM Transactions on Programming Languages and Systems*, vol\. 5, no\. 4, pp\. 596–619, 1983\.
- \[91\]P\. W\. O’Hearn, “Resources, concurrency, and local reasoning,”*Theoretical Computer Science*, vol\. 375, no\. 1–3, pp\. 271–307, 2007\.
- \[92\]S\. Brookes, “A semantics for concurrent separation logic,”*Theoretical Computer Science*, vol\. 375, no\. 1–3, pp\. 227–270, 2007\.
- \[93\]R\. Jung, R\. Krebbers, J\.\-H\. Jourdan, A\. Bizjak, L\. Birkedal, and D\. Dreyer, “Iris from the ground up: A modular foundation for higher\-order concurrent separation logic,”*Journal of Functional Programming*, vol\. 28, e20, 2018\.
- \[94\]B\. Norris and B\. Demsky, “CDSchecker: checking concurrent data structures written with C/C\+\+ atomics,” in*Proceedings of the ACM Conference on Object\-Oriented Programming, Systems, Languages, and Applications \(OOPSLA\)*, 2013, pp\. 131–150\.
- \[95\]P\. A\. Abdulla, S\. Aronis, M\. F\. Atig, B\. Jonsson, C\. Leonardsson, and K\. Sagonas, “Stateless model checking for TSO and PSO,” in*Tools and Algorithms for the Construction and Analysis of Systems \(TACAS\)*, 2015, pp\. 353–367\.
- \[96\]C\. Dehnert, S\. Junges, J\.\-P\. Katoen, and M\. Volk, “A storm is coming: A modern probabilistic model checker,” in*Computer Aided Verification \(CAV\)*, ser\. LNCS, vol\. 10427\. Springer, 2017, pp\. 592–600\.
- \[97\]B\. T\. Willard and R\. Louf, “Efficient guided generation for large language models,”*arXiv preprint arXiv:2307\.09702*, 2023\.
- \[98\]Y\. Song*et al\.*, “The good, the bad, and the greedy: Evaluation of LLMs should not ignore non\-determinism,”*arXiv preprint arXiv:2407\.10457*, 2024\.
- \[99\]C\. J\. Fidge, “Timestamps in message\-passing systems that preserve the partial ordering,” in*Proceedings of the 11th Australian Computer Science Conference \(ACSC\)*, 1988, pp\. 56–66\.
- \[100\]F\. Mattern, “Virtual time and global states of distributed systems,” in*Parallel and Distributed Algorithms*, M\. Cosnard et al\., Eds\. North\-Holland, 1989, pp\. 215–226\.
- \[101\]V\. B\. F\. Gomes, M\. Kleppmann, D\. P\. Mulligan, and A\. R\. Beresford, “Verifying strong eventual consistency in distributed systems,”*Proceedings of the ACM on Programming Languages*, vol\. 1, no\. OOPSLA, pp\. 109:1–109:28, 2017\.
Verified Detection and Prevention of Concurrency Anomalies in Multi-Agent Large Language Model Systems

Similar Articles

Emergent Languages in Populations of Language Model Agents: From Token Efficiency to Oversight Evasion

Towards Security-Auditable LLM Agents: A Unified Graph Representation

Decentralized Multi-Agent Systems with Shared Context

Your Agent Has a Genome: Sequence-Level Behavioral Analysis and Runtime Governance of LLM-Powered Autonomous Agents

Grounded Continuation: A Linear-Time Runtime Verifier for LLM Conversations

Submit Feedback

Similar Articles

Emergent Languages in Populations of Language Model Agents: From Token Efficiency to Oversight Evasion
Towards Security-Auditable LLM Agents: A Unified Graph Representation
Decentralized Multi-Agent Systems with Shared Context
Your Agent Has a Genome: Sequence-Level Behavioral Analysis and Runtime Governance of LLM-Powered Autonomous Agents
Grounded Continuation: A Linear-Time Runtime Verifier for LLM Conversations