PatchBoard: Schema-Grounded State Mutation for Reliable and Auditable LLM Multi-Agent Collaboration

arXiv cs.CL Papers

Summary

PatchBoard replaces natural-language dialogue in LLM multi-agent systems with validated JSON Patch mutations over a shared structured state, achieving higher success rates and significantly lower token usage on ALFWorld benchmarks.

arXiv:2605.29313v1 Announce Type: new Abstract: LLM multi-agent systems often coordinate through natural-language dialogue or loosely structured shared memory, making intermediate state difficult to validate, attribute, and audit. We introduce PatchBoard, a schema-grounded collaboration architecture that replaces inter-agent dialogue with validated JSON Patch mutations over a shared structured state. An Architect agent constructs a task-specific schema and workflow rules, while a deterministic kernel validates each proposed state mutation against schema constraints, role-specific write contracts, and runtime invariants before committing it transactionally. On 630 matched ALFWorld episodes, PatchBoard achieves an 84.6% success rate, compared with 30.8% for LangGraph and 61.6% for Flock, while reducing tokens per successful task to 45.5k, compared with 368.3k and 64.2k, respectively.
Original Article
View Cached Full Text

Cached at: 05/29/26, 09:18 AM

# PatchBoard: Schema-Grounded State Mutation for Reliable and Auditable LLM Multi-Agent Collaboration
Source: [https://arxiv.org/html/2605.29313](https://arxiv.org/html/2605.29313)
Shuyu Zhang Yaqi Shi Lu Wang School of Computer Science and Technology Xidian University Xi’an, China wanglu@xidian\.edu\.cn

###### Abstract

LLM multi\-agent systems often coordinate through natural\-language dialogue or loosely structured shared memory, making intermediate state difficult to validate, attribute, and audit\. We introduce PatchBoard, a schema\-grounded collaboration architecture that replaces inter\-agent dialogue with validated JSON Patch mutations over a shared structured state\. An Architect agent constructs a task\-specific schema and workflow rules, while a deterministic kernel validates each proposed state mutation against schema constraints, role\-specific write contracts, and runtime invariants before committing it transactionally\. On 630 matched ALFWorld episodes, PatchBoard achieves an 84\.6% success rate, compared with 30\.8% for LangGraph and 61\.6% for Flock, while reducing tokens per successful task to 45\.5k, compared with 368\.3k and 64\.2k, respectively\.

PatchBoard: Schema\-Grounded State Mutation for Reliable and Auditable LLM Multi\-Agent Collaboration

Shuyu Zhang Yaqi Shi Lu Wang††thanks:Corresponding author\.School of Computer Science and TechnologyXidian UniversityXi’an, Chinawanglu@xidian\.edu\.cn

## 1Introduction

Large language models are increasingly used as autonomous agents that plan, reason, call tools, interact with environments, and revise their behavior through feedback\(Yaoet al\.,[2023b](https://arxiv.org/html/2605.29313#bib.bib34); Schicket al\.,[2023](https://arxiv.org/html/2605.29313#bib.bib22); Shinnet al\.,[2023](https://arxiv.org/html/2605.29313#bib.bib23); Yaoet al\.,[2023a](https://arxiv.org/html/2605.29313#bib.bib35); Wanget al\.,[2024a](https://arxiv.org/html/2605.29313#bib.bib27)\)\. As tasks become longer and more compositional, a natural extension is to organize multiple agents into role\-specialized teams\. In such systems, different agents coordinate through multi\-turn interaction\. Representative systems such as AutoGen, CAMEL, ChatDev, MetaGPT, and AgentVerse show that multi\-agent collaboration can improve task decomposition and support complex workflows across reasoning, software engineering, simulation, and tool\-use settings\(Wuet al\.,[2024a](https://arxiv.org/html/2605.29313#bib.bib1); Liet al\.,[2023](https://arxiv.org/html/2605.29313#bib.bib16); Qianet al\.,[2024](https://arxiv.org/html/2605.29313#bib.bib19); Honget al\.,[2024](https://arxiv.org/html/2605.29313#bib.bib11); Chenet al\.,[2024](https://arxiv.org/html/2605.29313#bib.bib6); Wanget al\.,[2024b](https://arxiv.org/html/2605.29313#bib.bib28)\)\. The dominant coordination interface in these systems is natural language, which is attractive because it matches the native input\-output format of LLMs and makes agent communication flexible and expressive\.

However, natural\-language communication becomes a fragile substrate for long\-horizon and stateful collaboration\. Dialogue histories grow with the number of turns, mix task facts with meta\-discussion and repair attempts, and often leave unclear which intermediate outputs should be treated as committed state\. A downstream agent may read an unverified observation, stale plan, malformed intermediate claim, or failed repair attempt as if it were reliable task state\. Once such information enters the shared context, later agents can amplify the error through additional reasoning, tool calls, or environment actions\. This problem is especially harmful in collaborative settings because failure is not localized to one model call\. A polluted intermediate state can silently affect all subsequent agents\.

Existing work attempts to address these issues by making agent coordination more explicit\. Some systems use workflow graphs, planners, or verification functions to constrain execution\(LangChain,[2024](https://arxiv.org/html/2605.29313#bib.bib15); Zhanget al\.,[2025](https://arxiv.org/html/2605.29313#bib.bib36)\)\. Others ask agents to generate or reuse executable programs and skills, enabling compact and compositional control policies\(Wanget al\.,[2024a](https://arxiv.org/html/2605.29313#bib.bib27); Yanget al\.,[2025b](https://arxiv.org/html/2605.29313#bib.bib33)\)\. Blackboard\-style systems instead coordinate agents through shared memory, allowing independent workers to read and update a common state\(Hayes\-Roth,[1985](https://arxiv.org/html/2605.29313#bib.bib10); Salemiet al\.,[2025](https://arxiv.org/html/2605.29313#bib.bib21)\)\. Structured generation methods such as LMQL and Outlines further help models produce outputs that follow specified formats\(Beurer\-Kellneret al\.,[2023](https://arxiv.org/html/2605.29313#bib.bib3); Willard and Louf,[2023](https://arxiv.org/html/2605.29313#bib.bib29)\)\. These approaches improve over unconstrained dialogue, but they leave an important gap\. Workflow and code\-based methods still require the runtime to trust generated procedures or control logic, while blackboard memory does not by itself determine whether an update is well typed, authorized, non\-stale, or safe to commit\. Structured output helps with formatting, but formatting alone does not define a system\-level boundary between a model suggestion and committed shared state\.

This paper proposes PatchBoard, a schema\-grounded communication substrate for reliable and auditable LLM multi\-agent collaboration\. PatchBoard replaces open\-ended inter\-agent dialogue with validated JSON Patch mutations over a shared JSON state\(Bryan and Nottingham,[2013](https://arxiv.org/html/2605.29313#bib.bib20)\)\. An Architect agent defines the task schema, worker contracts, context budgets, and workflow rules\(Bourhiset al\.,[2017](https://arxiv.org/html/2605.29313#bib.bib12)\), while a deterministic kernel validates and transactionally commits only authorized state updates\. This makes collaboration explicit, attributable, and replayable, preventing malformed or unauthorized outputs from silently entering shared memory\. On ALFWorld\(Shridharet al\.,[2021](https://arxiv.org/html/2605.29313#bib.bib24)\), PatchBoard achieves an 84\.6% success rate over 630 matched episodes, compared with 30\.8% for LangGraph and 61\.6% for Flock\. It also achieves the lowest normalized cost, requiring 45\.5k tokens per successful task compared with 368\.3k for LangGraph and 64\.2k for Flock\.

We make the following contributions:

- •We formulate LLM multi\-agent collaboration as validated mutation over a shared structured state, using a restricted JSON Patch interface to make inter\-agent communication explicit, typed, and auditable\.
- •We design a deterministic kernel that validates proposed updates, enforces schema and role\-specific write constraints, constructs budgeted context views, commits accepted patches transactionally, and records replayable transaction logs\.
- •We build a full PatchBoard prototype and evaluate it on long\-horizon interaction tasks, with blackboard controls, ablations, sensitivity analyses, fault injection, and a diagnostic QA study that clarifies the boundary between structural validation and semantic support\.

## 2Related Work

#### LLM multi\-agent coordination\.

Recent LLM multi\-agent systems decompose complex tasks into role\-specialized agents that communicate, critique, and coordinate through explicit interaction protocols\. AutoGen and AgentScope provide general\-purpose infrastructures for composing conversational or message\-passing agents\(Wuet al\.,[2024a](https://arxiv.org/html/2605.29313#bib.bib1); Gaoet al\.,[2024](https://arxiv.org/html/2605.29313#bib.bib2)\), while ChatDev and MetaGPT instantiate role\-based collaboration for software development through chat chains or SOP\-style workflows\(Qianet al\.,[2024](https://arxiv.org/html/2605.29313#bib.bib19); Honget al\.,[2024](https://arxiv.org/html/2605.29313#bib.bib11)\)\. Recent analyses further show that coordination failures, ambiguous handoffs, and redundant communication remain central challenges in LLM\-based multi\-agent systems\(Cemriet al\.,[2025](https://arxiv.org/html/2605.29313#bib.bib4)\)\. This line of work motivates treating the communication substrate itself as a first\-class object in multi\-agent design\.

#### Structured workflows and verification\-aware orchestration\.

A growing body of work makes agent execution more explicit through graphs, state machines, language\-model programs, or automatically optimized workflows\. LangGraph represents agent applications as stateful graphs\(LangChain,[2024](https://arxiv.org/html/2605.29313#bib.bib15)\); StateFlow formulates LLM task solving as state\-driven workflows\(Wuet al\.,[2024b](https://arxiv.org/html/2605.29313#bib.bib30)\); DSPy abstracts LM pipelines as declarative programs that can be optimized\(Khattabet al\.,[2024](https://arxiv.org/html/2605.29313#bib.bib14)\); and SGLang targets efficient execution of structured language\-model programs\(Zhenget al\.,[2024](https://arxiv.org/html/2605.29313#bib.bib37)\)\. Recent systems such as AFlow and VeriMAP further explore automated workflow generation and verification\-aware multi\-agent planning\(Zhanget al\.,[2025](https://arxiv.org/html/2605.29313#bib.bib36); Xuet al\.,[2026](https://arxiv.org/html/2605.29313#bib.bib38)\)\. These works show the value of moving agent coordination from free\-form chat toward explicit control structures\.

#### Shared memory, blackboards, and agent memory\.

Blackboard architectures provide a classical mechanism for coordinating independent knowledge sources through a shared state\(Hayes\-Roth,[1985](https://arxiv.org/html/2605.29313#bib.bib10); Penny,[1986](https://arxiv.org/html/2605.29313#bib.bib17)\)\. This idea has recently reappeared in LLM multi\-agent systems, where blackboard\-style memory supports dynamic agent selection, shared information discovery, and event\-driven collaboration\(Han and Zhang,[2025](https://arxiv.org/html/2605.29313#bib.bib9); Salemiet al\.,[2025](https://arxiv.org/html/2605.29313#bib.bib21)\)\. In parallel, agent memory systems study how long\-term observations can be stored, linked, and retrieved to support persistent behavior across interactions\(Packeret al\.,[2023](https://arxiv.org/html/2605.29313#bib.bib18); Xuet al\.,[2025](https://arxiv.org/html/2605.29313#bib.bib31)\)\. These works highlight the importance of persistent shared state, while leaving open how such state should be updated, authorized, and audited during long\-horizon collaboration\.

#### Structured outputs, transactions, and semantic verification\.

Structured generation techniques constrain LLM outputs with grammars, schemas, or programming interfaces, reducing format errors in machine\-consumed outputs\(Beurer\-Kellneret al\.,[2023](https://arxiv.org/html/2605.29313#bib.bib3); Willard and Louf,[2023](https://arxiv.org/html/2605.29313#bib.bib29); Zhenget al\.,[2024](https://arxiv.org/html/2605.29313#bib.bib37); Genget al\.,[2025](https://arxiv.org/html/2605.29313#bib.bib7)\)\. Related systems work brings stronger runtime guarantees into LLM agents: SagaLLM studies context management, validation, and transaction guarantees for multi\-agent planning\(Chang and Geng,[2025](https://arxiv.org/html/2605.29313#bib.bib5)\), while recent runtime\-governance work emphasizes path\-dependent policy enforcement for autonomous agents\(Kapteinet al\.,[2026](https://arxiv.org/html/2605.29313#bib.bib13)\)\. Finally, evidence\-grounded QA and fact\-checking benchmarks such as HotpotQA, FEVER, and MuSiQue evaluate whether generated claims are supported by evidence\(Yanget al\.,[2018](https://arxiv.org/html/2605.29313#bib.bib32); Thorneet al\.,[2018](https://arxiv.org/html/2605.29313#bib.bib25); Trivediet al\.,[2022](https://arxiv.org/html/2605.29313#bib.bib26)\)\. Together, these lines connect structured output control, transactional execution, and semantic verification\.

Overall, PatchBoard advances this direction by making collaboration a sequence of schema\-grounded, role\-authorized, replayable state mutations, giving multi\-agent systems a tighter runtime boundary for reliable and auditable coordination\.

## 3Methodology

![Refer to caption](https://arxiv.org/html/2605.29313v1/x1.png)Figure 1:PatchBoard architecture\. The Architect compiles a user request into a task blueprint containing the global state schema, worker contracts, workflow rules, and context budgets\. The deterministic kernel maintains the global state tree, constructs bounded state views, validates worker\-proposed JSON Patches, commits accepted updates transactionally, and schedules subsequent worker invocations\. Workers interact through schema\-validated mutations to the shared state\.### 3\.1Method Overview

PatchBoard formulates multi\-agent collaboration as a closed state\-transition loop over a shared structured state\. As shown in Figure[1](https://arxiv.org/html/2605.29313#S3.F1), an Architect first converts the user request into a task blueprint specifying the global state schema, worker contracts, context budgets, and workflow rules\. After the blueprint is validated, runtime coordination is handled by a deterministic kernel, which initializes the global state, constructs bounded worker views, validates JSON Patch proposals, commits accepted patches, records transaction logs, and schedules future workers from committed state events\.

Let𝒮t\\mathcal\{S\}\_\{t\}denote the global state tree at steptt\. A hand\-crafted blueprint meta\-schemaΣmeta\\Sigma\_\{\\mathrm\{meta\}\}defines the legal structure of Architect\-produced blueprints\. An accepted blueprintℬ\\mathcal\{B\}instantiates a task\-specific schemaΣ\\Sigma, a set of workers𝒜\\mathcal\{A\}, and workflow rulesℛ\\mathcal\{R\}\. The schemaΣ\\Sigmadefines the valid structure and invariants of𝒮t\\mathcal\{S\}\_\{t\}, whileℛ\\mathcal\{R\}maps committed state events to future worker invocations\. For a workera∈𝒜a\\in\\mathcal\{A\}, the kernel materializes a bounded view𝒱ta\\mathcal\{V\}\_\{t\}^\{a\}and receives a candidate patchΔta\\Delta\_\{t\}^\{a\}\.

The kernel is the only component that can mutate the committed state\. LetWaW\_\{a\}denote the write contract of workeraa, and let⊥\\botdenote failed patch application\. The kernel first applies the candidate patch to a temporary copy of the current state,

𝒮^t\+1\\displaystyle\\hat\{\\mathcal\{S\}\}\_\{t\+1\}=𝖠𝗉𝗉𝗅𝗒​\(𝒮t,Δta\),\\displaystyle=\\mathsf\{Apply\}\(\\mathcal\{S\}\_\{t\},\\Delta\_\{t\}^\{a\}\),\(1\)𝖠𝖼𝖼𝖾𝗉𝗍ta\\displaystyle\\mathsf\{Accept\}\_\{t\}^\{a\}=𝖲𝗒𝗇𝗍𝖺𝗑​\(Δta\)∧𝖠𝗎𝗍𝗁​\(Δta,Wa\)\\displaystyle=\\mathsf\{Syntax\}\(\\Delta\_\{t\}^\{a\}\)\\land\\mathsf\{Auth\}\(\\Delta\_\{t\}^\{a\},W\_\{a\}\)∧\(𝒮^t\+1≠⊥\)∧𝖵𝖺𝗅𝗂𝖽Σ​\(𝒮^t\+1\)\\displaystyle\\quad\\land\\;\(\\hat\{\\mathcal\{S\}\}\_\{t\+1\}\\neq\\bot\)\\land\\mathsf\{Valid\}\_\{\\Sigma\}\(\\hat\{\\mathcal\{S\}\}\_\{t\+1\}\)∧𝖨𝗇𝗏B​\(𝒮t,Δta,𝒮^t\+1\)\.\\displaystyle\\quad\\land\\;\\mathsf\{Inv\}\_\{B\}\(\\mathcal\{S\}\_\{t\},\\Delta\_\{t\}^\{a\},\\hat\{\\mathcal\{S\}\}\_\{t\+1\}\)\.Here𝖠𝖼𝖼𝖾𝗉𝗍ta\\mathsf\{Accept\}\_\{t\}^\{a\}indicates whether workeraa’s patch is accepted at steptt, and𝖨𝗇𝗏ℬ\\mathsf\{Inv\}\_\{\\mathcal\{B\}\}denotes the runtime invariants registered by blueprintℬ\\mathcal\{B\}\. Accepted patches are committed as transactions; rejected patches are logged without changing the committed state\. This separates model\-generated proposals from accepted system state\. The full kernel pseudocode is provided in Appendix[A](https://arxiv.org/html/2605.29313#A1)\.

### 3\.2Architect and Task Blueprint

The Architect is invoked once at task initialization\. Given a user request, it produces a blueprintℬ\\mathcal\{B\}that defines the collaboration structure before any worker is called\. The blueprint contains the task schemaΣ\\Sigma, worker specifications, context budgets, and workflow rulesℛ\\mathcal\{R\}\. These fields determine the layout of the shared state, the roles that may participate in the task, the state regions each role may observe or modify, and the events that trigger future worker invocations\.

Before execution, the kernel validates the Architect output against the blueprint meta\-schemaΣmeta\\Sigma\_\{\\mathrm\{meta\}\}\. The meta\-schema constrains the format of the generated blueprint\. Worker names must be declared, read and write paths must refer to valid schema locations, context budgets must be finite, and workflow rules must use the restricted trigger\-condition\-action format supported by the scheduler\. A structurally invalid blueprint is rejected before runtime execution begins\.

After validation, the accepted blueprint becomes the runtime contract for the task\. Each worker specification contains a role instruction, authorized read paths, authorized write paths, a view budget, and patch\-format constraints\. For example, an evidence collector may read the task query and unresolved claims, append evidence records under/evidence/\-, and have no permission to replace verifier\-controlled fields such as/claims/\*/status\. Workflow rules connect committed state changes to future computation\. Adding a source may wake an extractor, while adding an unverified claim may wake a verifier\.

This design makes the Architect a setup\-time planning component\. Its output defines the initial collaboration structure, including the state schema, worker roles, context budgets, and workflow rules\. After the blueprint is accepted, runtime coordination is fully mediated by the deterministic kernel through patch validation, transactional commits, and event\-based scheduling\.

### 3\.3Schema\-Grounded Patch Interface

Workers interact with the shared state through a restricted JSON Patch interface\(Bryan and Nottingham,[2013](https://arxiv.org/html/2605.29313#bib.bib20)\)\. At each invocation, a worker receives a bounded view𝒱ta\\mathcal\{V\}\_\{t\}^\{a\}and returns a candidate patchΔta\\Delta\_\{t\}^\{a\}over the global state tree\. Each operation is path\-addressed, so the intended edit is explicit at the field level\. This makes worker outputs easier to validate, attribute, and replay than free\-form messages\.

The patch interface is grounded in the accepted schemaΣ\\Sigma\. Each path in a candidate patch must refer to a valid schema location, and each value must satisfy the type and field constraints associated with that location\. The allowed operation subset is intentionally small\. Workers may add newly produced objects, replace fields assigned to their role, and usetestoperations to express stale\-view preconditions\. Destructive operations such asremoveare disabled by default and can be enabled only for privileged roles\.

Role\-specific write contracts further constrain the interface\. A worker can only propose edits to paths granted by its blueprint specification\. For example, an extractor may append draft claims under/claims/\-, while a verifier may replace verification fields under/claims/\*/status\. This path\-level separation prevents one role from silently overwriting intermediate products or decisions assigned to another role\.

Patch validation happens before any committed state is modified\. The kernel parses the candidate patch, checks operation syntax, verifies path authorization, applies the patch to a temporary copy of𝒮t\\mathcal\{S\}\_\{t\}, and validates the resulting candidate state againstΣ\\Sigmaand registered invariants\. If all checks pass, the patch is committed as a transaction and produces𝒮t\+1\\mathcal\{S\}\_\{t\+1\}\. If any check fails, the patch is rejected and logged with its rejection reason\.

This interface provides a narrow coordination surface for LLM workers\. Workers still generate semantic content, and schema validity alone cannot guarantee factual correctness\. However, every accepted update has an explicit writer, path, operation, and post\-state validation result\. This gives downstream workers and auditors a concrete record of how the shared state evolved\.

### 3\.4Deterministic Kernel

The deterministic kernel is the runtime controller of PatchBoard\. After a blueprint has been accepted, the kernel maintains the committed global state, the event queue, worker budgets, and the transaction log\. For each scheduled worker invocation, it constructs the worker input from the current state, calls the worker, receives a candidate patch, and decides whether the proposed update can become part of the committed trajectory\.

The kernel validates every worker output before state mutation\. Given𝒮t\\mathcal\{S\}\_\{t\}, workeraa, and candidate patchΔta\\Delta\_\{t\}^\{a\}, the kernel checks the allowed JSON Patch operation subset and verifies that every target path is covered by the worker’s write contract\. It then applies the patch to a temporary copy of𝒮t\\mathcal\{S\}\_\{t\}and validates the candidate next state againstΣ\\Sigmaand any registered invariants\. The committed state advances to𝒮t\+1\\mathcal\{S\}\_\{t\+1\}only when all validation checks succeed\.

Accepted patches are committed transactionally\. A committed transaction records the worker id, triggering event, worker\-view hash, accepted patch, and resulting state hash\. Rejected patches are also recorded with the failed validation stage and rejection reason\. Since rejected patches never modify the committed state, malformed outputs, unauthorized writes, and schema\-violating edits remain visible in the log without contaminating the shared state\.

The scheduler consumes events emitted by committed transactions and matches them againstℛ\\mathcal\{R\}\. When a rule condition is satisfied, the corresponding worker invocation is added to the event queue\. The runtime trajectory therefore depends on accepted state changes rather than free\-form inter\-worker messages\. The same committed state and transaction log provide a replayable account of how the system reached its final state\.

The kernel also monitors simple failure signals during execution\. It tracks consecutive invalid patches, repeated no\-op edits, exhausted worker budgets, and repeated state hashes\. When a configured threshold is reached, the kernel may stop the branch, wake a verifier, reduce the available view, or terminate the task\. These policies are deterministic functions of the transaction log and the accepted blueprint\.

### 3\.5Budgeted Context Views

The global state tree may grow as workers add sources, claims, plans, evidence records, and verification results\. Passing the full state to every worker increases context cost and exposes irrelevant fields to roles that do not need them\. PatchBoard therefore makes view construction a kernel responsibility\. Before invoking workeraa, the kernel materializes a bounded view𝒱ta\\mathcal\{V\}\_\{t\}^\{a\}from𝒮t\\mathcal\{S\}\_\{t\}according to the worker’s read contract and context budget\.

A view contains the state fields required by the worker’s role, the relevant schema fragment, unresolved dependencies, and recent rejection feedback associated with the same worker or state region\. Large collections are represented through compact summaries and stable handles\. For example, a verifier may receive the task query, a small set of unresolved claims, their linked evidence handles, and the schema fragment for claim\-status updates, while unrelated worker outputs remain outside its view\.

The context budget limits the size of the materialized view\. When the authorized state region exceeds this budget, the kernel prioritizes active task fields, required schema fields, and recently changed objects\. Older or lower\-priority collections are compressed into summaries that preserve identifiers and provenance\. A worker that needs additional information can propose a typed expansion request through the same patch interface, allowing the kernel to page in specific handles in a later invocation\.

This design keeps context selection explicit and auditable\. The transaction log records the view hash used for each worker call, so accepted and rejected patches can be traced back to the state slice that produced them\. Budgeted views also reduce accidental role leakage, since workers only observe paths allowed by their read contracts\. As a result, PatchBoard controls both what a worker may modify and what information the worker may condition on during patch generation\.

### 3\.6Structural Properties

The preceding components give PatchBoard several structural properties, conditioned on a valid blueprint and a correct kernel implementation\. Since every runtime update passes through the patch validator inside the deterministic kernel, committed states preserve the accepted task schema\. If𝒮t\\mathcal\{S\}\_\{t\}satisfiesΣ\\Sigma, the kernel commits a transition only after applying the candidate patch to a temporary copy and validating the resulting state againstΣ\\Sigmaand registered invariants\. Malformed patches, type errors, missing required fields, and schema\-violating updates are rejected before they can modify the committed state\.

Worker effects are isolated by path\-level contracts defined in the blueprint\. Each worker can propose patches only to paths authorized by its role, and the kernel checks these paths before applying the patch\. This prevents one role from silently overwriting fields assigned to another role\. It also makes each accepted update attributable to a specific worker invocation, since the transaction records the worker id, the viewed state slice, the proposed patch, and the validation outcome\.

Accepted trajectories are replayable at the patch level\. The transaction manager records accepted and rejected patches, input view hashes, rejection reasons, and resulting state hashes\. Given the initial state, the accepted blueprint, and the committed transaction log, the sequence of accepted state transitions can be reconstructed without resampling worker outputs\. This supports debugging and auditability, while the event scheduler makes downstream worker invocations deterministic with respect to committed state events and workflow rules\.

## 4Experimental Setup

### 4\.1Evaluation Goals

The experiments evaluate whether schema\-grounded state mutation improves task success and normalized cost, whether the gains can be explained by shared memory alone, and which PatchBoard components contribute most to the observed behavior\. We also include controlled fault injection and a diagnostic QA setting to distinguish structural state validity from semantic support\.

### 4\.2Benchmarks and Systems

The primary benchmark is ALFWorld\(Shridharet al\.,[2021](https://arxiv.org/html/2605.29313#bib.bib24)\)\. We use 126 matched gamefiles stratified across six ALFWorld task types and run each gamefile with 5 independent execution seeds, yielding 630 episodes per system\. All systems use the same gamefiles, seeds, model, decoding configuration, environment step budget, and timeout\. An episode is counted as successful only when the environment reaches the target terminal condition within 20 environment steps\.

The main ALFWorld comparison includes PatchBoard, LangGraph\(LangChain,[2024](https://arxiv.org/html/2605.29313#bib.bib15)\), and Flock\(white duck GmbH,[2025](https://arxiv.org/html/2605.29313#bib.bib40)\)\. LangGraph serves as a graph\-based workflow baseline with explicit nodes and shared state, while Flock serves as a blackboard\-based multi\-agent baseline following its public repository and documentation\. All systems use the same ALFWorld observation\-action interface, model, decoding setting, step budget, and timeout\. Detailed baseline configurations are provided in Appendix[B\.1](https://arxiv.org/html/2605.29313#A2.SS1)\.

The secondary benchmark is a HotpotQA diagnostic\(Yanget al\.,[2018](https://arxiv.org/html/2605.29313#bib.bib32)\)with 240 matched prepared validation examples\. This setting evaluates evidence\-grounded claim propagation rather than long\-horizon environment interaction\. It is used to diagnose whether structurally valid workflows reduce unsupported factual claims\. Full HotpotQA results are reported in Appendix[C](https://arxiv.org/html/2605.29313#A3)\.

Unless otherwise stated, all experiments use Qwen\-plus\(Yanget al\.,[2025a](https://arxiv.org/html/2605.29313#bib.bib39)\)with temperature 0 and a 60\-second timeout\. Full configuration details are summarized in Appendix[B\.2](https://arxiv.org/html/2605.29313#A2.SS2)\.

### 4\.3Metrics and Estimation

For ALFWorld, we measure task success, environment steps, total token usage, and tokens per success\. Tokens per success is computed as mean total tokens divided by success rate\. Token totals include prompt and completion tokens from all LLM calls, covering blueprint generation, worker calls, schema/context views, and patch\-format instructions for PatchBoard, as well as the corresponding planner, controller, worker, setup, and repair calls for the baselines\. Appendix[D](https://arxiv.org/html/2605.29313#A4)gives a concrete PatchBoard running example and a component\-level token cost breakdown\.

For HotpotQA, we use answer accuracy, unsupported claim rate, evidence coverage, and verified claim precision as diagnostic metrics\. Unsupported claim rate measures schema\-valid claims that lack sufficient evidence, while invalid state rate measures malformed or unauthorized intermediate updates that enter committed state\. Where applicable, proportion metrics use Wilson intervals and continuous metrics use paired bootstrap intervals over matched examples\.

![Refer to caption](https://arxiv.org/html/2605.29313v1/x2.png)Figure 2:Main ALFWorld comparison under matched gamefiles and execution seeds\. Tokens per successful episode use a log scale\.We report 95% confidence intervals for all main metrics\. Proportion metrics use Wilson intervals, including success rate, answer accuracy, unsupported claim rate, invalid state rate, evidence coverage, verified claim precision, fault contamination rates, and cycle halt rate\. Continuous metrics use paired bootstrap intervals over matched task and seed identifiers, including mean steps, mean solved steps, mean total tokens, tokens per success, and tokens per answer\.

### 4\.4Fault Injection

We inject 200 instances of each fault type into each system\. The injected faults cover Invalid JSON, Bad Path/Type, Unauthorized Write, False Claim, and Cycle Halt\. The first three faults test whether malformed or unauthorized updates contaminate committed state\. False Claim tests whether schema\-valid but unsupported content is accepted\. Cycle Halt tests whether repeated no\-op or oscillatory trajectories are stopped\.

## 5Results and Analysis

### 5\.1Main ALFWorld Results

Figure[2](https://arxiv.org/html/2605.29313#S4.F2)summarizes the main ALFWorld comparison in terms of task success, environment steps, and tokens per success\.

PatchBoard solves 533/630 episodes, compared with 194/630 for LangGraph and 388/630 for Flock\. It also achieves the lowest tokens per success, indicating that validated state mutation improves both task success and normalized cost in this matched ALFWorld setting\.

### 5\.2Blackboard Controls

Blackboard controls test whether the improvement comes merely from adding shared memory\. Figure[3](https://arxiv.org/html/2605.29313#S5.F3)compares PatchBoard with plain and structured blackboards under the same matched ALFWorld setting\.

![Refer to caption](https://arxiv.org/html/2605.29313v1/x3.png)Figure 3:Blackboard controls under matched ALFWorld episodes\.Both blackboard controls solve fewer episodes than PatchBoard\. The plain blackboard also incurs a much higher cost per successful task, while the structured JSON blackboard narrows the cost gap but still trails in success\. These results suggest that structured shared state is helpful, and that transactional validation and write contracts provide additional gains beyond shared memory\.

### 5\.3Ablation Study

The strongest ablation effects come from removing the patch/schema interface and bounded context views\. Figure[4](https://arxiv.org/html/2605.29313#S5.F4)reports the change in success relative to full PatchBoard and the relative tokens per success\.

![Refer to caption](https://arxiv.org/html/2605.29313v1/x4.png)Figure 4:Ablation impact relative to full PatchBoard\. The left panel reports success\-rate change relative to the full system\. The right panel reports relative tokens per success, with 1\.0 marking full PatchBoard\.C1 and C3 produce the largest drops, while other components show smaller but consistent effects\. Removing the patch/schema interface causes the clearest cost failure, more than doubling tokens per success\. Removing context slicing produces a comparable success drop, which supports the view\-construction mechanism\. The remaining ablations move in the expected direction, but their effects are smaller on this evaluation\.

### 5\.4Sensitivity Analyses

We further examine two sensitivity factors: context budget measured in characters and schema source\. Figure[5](https://arxiv.org/html/2605.29313#S5.F5)reports the context\-budget sensitivity, and Figure[6](https://arxiv.org/html/2605.29313#S5.F6)reports the schema\-source sensitivity\.

![Refer to caption](https://arxiv.org/html/2605.29313v1/x5.png)Figure 5:Context budget sensitivity on ALFWorld\.The smallest tested context budget achieves the best success\-cost profile\. Increasing the budget does not improve success and leads to higher normalized cost\. This result supports the bounded\-view design: exposing more state to workers can introduce irrelevant context without improving local decision quality\.

![Refer to caption](https://arxiv.org/html/2605.29313v1/x6.png)Figure 6:Schema source sensitivity on ALFWorld\.Generated task\-specific schemas outperform fixed schemas in both success and normalized cost\. This result supports using the Architect to construct task\-specific blueprints in the current setting, while also showing that schema construction quality affects the reliability of the overall system\.

### 5\.5Fault Isolation and Termination

Fault injection evaluates how each system handles malformed, unauthorized, semantic, and cyclic updates\. For Invalid JSON, Bad Path/Type, Unauthorized Write, and False Claim, lower rates indicate less contamination or unsupported accepted content\. For Cycle Halt, higher rates indicate more successful termination of repeated no\-op or oscillatory trajectories\.

Table 1:Fault injection results over 200 injections per fault type\.PatchBoard has zero observed contamination for invalid JSON, bad paths/types, and unauthorized writes\. False claims remain possible because a false claim can satisfy the schema\. The high cycle\-halt rate shows that the deterministic kernel can stop most repeated no\-op or oscillatory trajectories\.

## 6Conclusion

We presented PatchBoard, a schema\-grounded architecture for reliable and auditable LLM multi\-agent collaboration\. PatchBoard replaces open\-ended inter\-agent dialogue with validated JSON Patch mutations over a shared structured state, where an Architect defines the task schema, workflow rules, worker contracts, and context budgets, and a deterministic kernel validates proposed mutations, commits accepted updates transactionally, and records replayable logs\. Across 630 matched ALFWorld episodes, PatchBoard achieves the strongest success\-cost profile among the compared systems, solving 84\.6% of tasks and reducing tokens per successful task to 45\.5k\. Blackboard controls, ablations, and sensitivity analyses indicate that the gains come from the patch/schema interface, bounded context views, and transactional validation beyond shared memory alone\. The diagnostic HotpotQA study further clarifies that schema\-grounded mutation improves structural validity and auditability, while factual correctness still depends on evidence selection, verifier design, and task\-specific semantic checks\.

## Limitations

PatchBoard provides structural reliability, so it does not guarantee semantic correctness\. The deterministic kernel can reject malformed patches, unauthorized writes, and schema\-violating state transitions, yet a schema\-valid claim may still be false, incomplete, or unsupported\. This limitation appears in the HotpotQA diagnostic, where structurally valid workflows can still produce unsupported claims\. Factual tasks therefore require stronger evidence retrieval, verifier design, and human review for high\-impact decisions\.

The system also depends on blueprint quality, model behavior, and task setting\. An overly sparse schema may block useful progress, while an overly permissive schema weakens role isolation\. Our strongest evidence comes from ALFWorld, where task state and success conditions are clearly defined; more open\-ended settings such as software engineering, scientific discovery, or long\-form generation may require richer schemas and more complex validation\. PatchBoard also introduces engineering overhead through schema construction, context slicing, patch validation, and transaction logging, making it most suitable when auditability, attribution, and state integrity are central requirements\.

## Ethical considerations

Beyond the technical limitations above, auditable state transitions can make multi\-agent systems easier to inspect, but they may also create false confidence\. Users must not treat schema validity as factual correctness\. Applications that affect people require evidence, provenance, and human review requirements for high\-impact state transitions\. Because transaction logs can contain sensitive intermediate information, deployments require access control, retention policies, and redaction mechanisms\.

Our experiments use public research benchmarks, tools, and model APIs only for evaluation: ALFWorld, HotpotQA, LangGraph, Flock, and the Alibaba Cloud Bailian API\. We cite the corresponding creators or providers and use these artifacts according to their public documentation, licenses, or access terms\. We do not collect new human\-subject data\.

## References

- L\. Beurer\-Kellner, M\. Fischer, and M\. Vechev \(2023\)Prompting is programming: a query language for large language models\.Proceedings of the ACM on Programming Languages7\(PLDI\),pp\. 1946–1969\.Cited by:[§1](https://arxiv.org/html/2605.29313#S1.p3.1),[§2](https://arxiv.org/html/2605.29313#S2.SS0.SSS0.Px4.p1.1)\.
- P\. Bourhis, J\. L\. Reutter, F\. Suárez, and D\. Vrgoč \(2017\)JSON: data model, query languages and schema specification\.InProceedings of the 36th ACM SIGMOD\-SIGACT\-SIGAI symposium on principles of database systems,pp\. 123–135\.Cited by:[§1](https://arxiv.org/html/2605.29313#S1.p4.1)\.
- P\. Bryan and M\. Nottingham \(2013\)RFC 6902: javascript object notation \(json\) patch\.RFC Editor\.Cited by:[§1](https://arxiv.org/html/2605.29313#S1.p4.1),[§3\.3](https://arxiv.org/html/2605.29313#S3.SS3.p1.2)\.
- M\. Cemri, M\. Z\. Pan, S\. Yang, L\. A\. Agrawal, B\. Chopra, R\. Tiwari, K\. Keutzer, A\. Parameswaran, D\. Klein, K\. Ramchandran, M\. Zaharia, J\. E\. Gonzalez, and I\. Stoica \(2025\)Why do multi\-agent LLM systems fail?\.InAdvances in Neural Information Processing Systems,Vol\.38\.External Links:[Link](https://nips.cc/virtual/2025/poster/121528)Cited by:[§2](https://arxiv.org/html/2605.29313#S2.SS0.SSS0.Px1.p1.1)\.
- E\. Y\. Chang and L\. Geng \(2025\)SagaLLM: context management, validation, and transaction guarantees for multi\-agent LLM planning\.Proceedings of the VLDB Endowment18\(12\),pp\. 4874–4886\.External Links:[Document](https://dx.doi.org/10.14778/3750601.3750611),[Link](https://www.vldb.org/pvldb/vol18/p4874-chang.pdf)Cited by:[§2](https://arxiv.org/html/2605.29313#S2.SS0.SSS0.Px4.p1.1)\.
- W\. Chen, Y\. Su, J\. Zuo, C\. Yang, C\. Yuan, C\. Chan, H\. Yu, Y\. Lu, Y\. Hung, C\. Qian, Y\. Qin, X\. Cong, R\. Xie, Z\. Liu, M\. Sun, and J\. Zhou \(2024\)AgentVerse: facilitating multi\-agent collaboration and exploring emergent behaviors\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=EHg5GDnyq1)Cited by:[§1](https://arxiv.org/html/2605.29313#S1.p1.1)\.
- D\. Gao, Z\. Li, X\. Pan, W\. Kuang, Z\. Ma, B\. Qian, F\. Wei, W\. Zhang, Y\. Xie, D\. Chen, L\. Yao, H\. Peng, Z\. Zhang, L\. Zhu, C\. Cheng, H\. Shi, Y\. Li, B\. Ding, and J\. Zhou \(2024\)AgentScope: a flexible yet robust multi\-agent platform\.External Links:2402\.14034,[Link](https://arxiv.org/abs/2402.14034)Cited by:[§2](https://arxiv.org/html/2605.29313#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Geng, H\. Cooper, M\. Moskal, S\. Jenkins, J\. Berman, N\. Ranchin, R\. West, E\. Horvitz, and H\. Nori \(2025\)JSONSchemaBench: a rigorous benchmark of structured outputs for language models\.External Links:2501\.10868,[Document](https://dx.doi.org/10.48550/arXiv.2501.10868),[Link](https://arxiv.org/abs/2501.10868)Cited by:[§2](https://arxiv.org/html/2605.29313#S2.SS0.SSS0.Px4.p1.1)\.
- B\. Han and S\. Zhang \(2025\)Exploring advanced LLM multi\-agent systems based on blackboard architecture\.External Links:2507\.01701,[Document](https://dx.doi.org/10.48550/arXiv.2507.01701),[Link](https://arxiv.org/abs/2507.01701)Cited by:[§2](https://arxiv.org/html/2605.29313#S2.SS0.SSS0.Px3.p1.1)\.
- B\. Hayes\-Roth \(1985\)A blackboard architecture for control\.Artificial Intelligence26\(3\),pp\. 251–321\.External Links:[Document](https://dx.doi.org/10.1016/0004-3702%2885%2990063-3)Cited by:[§1](https://arxiv.org/html/2605.29313#S1.p3.1),[§2](https://arxiv.org/html/2605.29313#S2.SS0.SSS0.Px3.p1.1)\.
- S\. Hong, M\. Zhuge, J\. Chen, X\. Zheng, Y\. Cheng, C\. Zhang, J\. Wang, Z\. Wang, S\. K\. S\. Yau, Z\. Lin, L\. Zhou, C\. Ran, L\. Xiao, C\. Wu, and J\. Schmidhuber \(2024\)MetaGPT: meta programming for a multi\-agent collaborative framework\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=VtmBAGCN7o)Cited by:[§1](https://arxiv.org/html/2605.29313#S1.p1.1),[§2](https://arxiv.org/html/2605.29313#S2.SS0.SSS0.Px1.p1.1)\.
- M\. Kaptein, V\. Khan, and A\. Podstavnychy \(2026\)Runtime governance for AI agents: policies on paths\.External Links:2603\.16586,[Document](https://dx.doi.org/10.48550/arXiv.2603.16586),[Link](https://arxiv.org/abs/2603.16586)Cited by:[§2](https://arxiv.org/html/2605.29313#S2.SS0.SSS0.Px4.p1.1)\.
- O\. Khattab, A\. Singhvi, P\. Maheshwari, Z\. Zhang, K\. Santhanam, S\. Vardhamanan, S\. Haq, A\. Sharma, T\. T\. Joshi, H\. Moazam, H\. Miller, M\. Zaharia, and C\. Potts \(2024\)DSPy: compiling declarative language model calls into self\-improving pipelines\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=sY5N0zY5Od)Cited by:[§2](https://arxiv.org/html/2605.29313#S2.SS0.SSS0.Px2.p1.1)\.
- LangChain \(2024\)LangGraph: low\-level orchestration framework for controllable agents\.Note:[https://langchain\-ai\.github\.io/langgraph/](https://langchain-ai.github.io/langgraph/)Accessed 2026\-05\-06Cited by:[§1](https://arxiv.org/html/2605.29313#S1.p3.1),[§2](https://arxiv.org/html/2605.29313#S2.SS0.SSS0.Px2.p1.1),[§4\.2](https://arxiv.org/html/2605.29313#S4.SS2.p2.1)\.
- G\. Li, H\. Hammoud, H\. Itani, D\. Khizbullin, and B\. Ghanem \(2023\)Camel: communicative agents for" mind" exploration of large language model society\.Advances in neural information processing systems36,pp\. 51991–52008\.Cited by:[§1](https://arxiv.org/html/2605.29313#S1.p1.1)\.
- C\. Packer, S\. Wooders, K\. Lin, V\. Fang, S\. G\. Patil, I\. Stoica, and J\. E\. Gonzalez \(2023\)MemGPT: towards LLMs as operating systems\.External Links:2310\.08560,[Document](https://dx.doi.org/10.48550/arXiv.2310.08560),[Link](https://arxiv.org/abs/2310.08560)Cited by:[§2](https://arxiv.org/html/2605.29313#S2.SS0.SSS0.Px3.p1.1)\.
- N\. H\. Penny \(1986\)Blackboard systems: the blackboard model of problem solving and the evolution of blackboard architectures\.The AI Magazine\.Cited by:[§2](https://arxiv.org/html/2605.29313#S2.SS0.SSS0.Px3.p1.1)\.
- C\. Qian, W\. Liu, H\. Liu, N\. Chen, Y\. Dang, J\. Li, C\. Yang, W\. Chen, Y\. Su, X\. Cong, J\. Xu, D\. Li, Z\. Liu, and M\. Sun \(2024\)ChatDev: communicative agents for software development\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Bangkok, Thailand,pp\. 15174–15186\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.810),[Link](https://aclanthology.org/2024.acl-long.810/)Cited by:[§1](https://arxiv.org/html/2605.29313#S1.p1.1),[§2](https://arxiv.org/html/2605.29313#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Salemi, M\. Parmar, P\. Goyal, Y\. Song, J\. Yoon, H\. Zamani, T\. Pfister, and H\. Palangi \(2025\)LLM\-based multi\-agent blackboard system for information discovery in data science\.External Links:2510\.01285,[Document](https://dx.doi.org/10.48550/arXiv.2510.01285),[Link](https://arxiv.org/abs/2510.01285)Cited by:[§1](https://arxiv.org/html/2605.29313#S1.p3.1),[§2](https://arxiv.org/html/2605.29313#S2.SS0.SSS0.Px3.p1.1)\.
- T\. Schick, J\. Dwivedi\-Yu, R\. Dessì, R\. Raileanu, M\. Lomeli, E\. Hambro, L\. Zettlemoyer, N\. Cancedda, and T\. Scialom \(2023\)Toolformer: language models can teach themselves to use tools\.InAdvances in Neural Information Processing Systems,Vol\.36,pp\. 68539–68551\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2023/hash/d842425e4bf79ba039352da0f658a906-Abstract-Conference.html)Cited by:[§1](https://arxiv.org/html/2605.29313#S1.p1.1)\.
- N\. Shinn, F\. Cassano, E\. Berman, A\. Gopinath, K\. Narasimhan, and S\. Yao \(2023\)Reflexion: language agents with verbal reinforcement learning\.InAdvances in Neural Information Processing Systems,Vol\.36\.External Links:[Link](https://arxiv.org/abs/2303.11366)Cited by:[§1](https://arxiv.org/html/2605.29313#S1.p1.1)\.
- M\. Shridhar, X\. Yuan, M\. Côté, Y\. Bisk, A\. Trischler, and M\. Hausknecht \(2021\)ALFWorld: aligning text and embodied environments for interactive learning\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=0IOX0YcCdTn)Cited by:[§1](https://arxiv.org/html/2605.29313#S1.p4.1),[§4\.2](https://arxiv.org/html/2605.29313#S4.SS2.p1.1)\.
- J\. Thorne, A\. Vlachos, C\. Christodoulopoulos, and A\. Mittal \(2018\)FEVER: a large\-scale dataset for fact extraction and VERification\.InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,New Orleans, Louisiana,pp\. 809–819\.External Links:[Document](https://dx.doi.org/10.18653/v1/N18-1074),[Link](https://aclanthology.org/N18-1074/)Cited by:[§2](https://arxiv.org/html/2605.29313#S2.SS0.SSS0.Px4.p1.1)\.
- H\. Trivedi, N\. Balasubramanian, T\. Khot, and A\. Sabharwal \(2022\)MuSiQue: multihop questions via single\-hop question composition\.Transactions of the Association for Computational Linguistics10,pp\. 539–554\.External Links:[Document](https://dx.doi.org/10.1162/tacl%5Fa%5F00475),[Link](https://aclanthology.org/2022.tacl-1.31/)Cited by:[§2](https://arxiv.org/html/2605.29313#S2.SS0.SSS0.Px4.p1.1)\.
- G\. Wang, Y\. Xie, Y\. Jiang, A\. Mandlekar, C\. Xiao, Y\. Zhu, L\. Fan, and A\. Anandkumar \(2024a\)Voyager: an open\-ended embodied agent with large language models\.Transactions on Machine Learning Research\.External Links:[Link](https://openreview.net/forum?id=ehfRiF0R3a)Cited by:[§1](https://arxiv.org/html/2605.29313#S1.p1.1),[§1](https://arxiv.org/html/2605.29313#S1.p3.1)\.
- X\. Wang, J\. Chen, N\. Li, L\. Chen, X\. Yuan, W\. Shi, X\. Ge, R\. Xu, and Y\. Xiao \(2024b\)SurveyAgent: a conversational system for personalized and efficient research survey\.External Links:2404\.06364,[Link](https://arxiv.org/abs/2404.06364)Cited by:[§1](https://arxiv.org/html/2605.29313#S1.p1.1)\.
- white duck GmbH \(2025\)Flock: declarative blackboard multi\-agent orchestration\.Note:[https://whiteducksoftware\.github\.io/flock/](https://whiteducksoftware.github.io/flock/)Documentation\. Accessed: 2026\-05\-25Cited by:[§4\.2](https://arxiv.org/html/2605.29313#S4.SS2.p2.1)\.
- B\. T\. Willard and R\. Louf \(2023\)Efficient guided generation for large language models\.External Links:2307\.09702,[Link](https://arxiv.org/abs/2307.09702)Cited by:[§1](https://arxiv.org/html/2605.29313#S1.p3.1),[§2](https://arxiv.org/html/2605.29313#S2.SS0.SSS0.Px4.p1.1)\.
- Q\. Wu, G\. Bansal, J\. Zhang, Y\. Wu, B\. Li, E\. Zhu, L\. Jiang, X\. Zhang, S\. Zhang, J\. Liu, A\. H\. Awadallah, R\. W\. White, D\. Burger, and C\. Wang \(2024a\)AutoGen: enabling next\-gen LLM applications via multi\-agent conversations\.InFirst Conference on Language Modeling,External Links:[Link](https://openreview.net/forum?id=BAakY1hNKS)Cited by:[§1](https://arxiv.org/html/2605.29313#S1.p1.1),[§2](https://arxiv.org/html/2605.29313#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Wu, T\. Yue, S\. Zhang, C\. Wang, and Q\. Wu \(2024b\)StateFlow: enhancing LLM task\-solving through state\-driven workflows\.InNeurIPS 2024 Workshop on Open\-World Agents,External Links:[Link](https://openreview.net/forum?id=CZAs3WFw5r)Cited by:[§2](https://arxiv.org/html/2605.29313#S2.SS0.SSS0.Px2.p1.1)\.
- T\. Xu, D\. Zhang, K\. Mitra, and E\. Hruschka \(2026\)Verification\-aware planning for multi\-agent systems\.InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 7528–7546\.Cited by:[§2](https://arxiv.org/html/2605.29313#S2.SS0.SSS0.Px2.p1.1)\.
- W\. Xu, Z\. Liang, K\. Mei, H\. Gao, J\. Tan, and Y\. Zhang \(2025\)A\-MEM: agentic memory for LLM agents\.InAdvances in Neural Information Processing Systems,Vol\.38\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2025/hash/19909c36f51abc4856b4560aff3d36d6-Abstract-Conference.html)Cited by:[§2](https://arxiv.org/html/2605.29313#S2.SS0.SSS0.Px3.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv, C\. Zheng, D\. Liu, F\. Zhou, F\. Huang, F\. Hu, H\. Ge, H\. Wei, H\. Lin, J\. Tang, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Zhou, J\. Lin, K\. Dang, K\. Bao, K\. Yang, L\. Yu, L\. Deng, M\. Li, M\. Xue, M\. Li, P\. Zhang, P\. Wang, Q\. Zhu, R\. Men, R\. Gao, S\. Liu, S\. Luo, T\. Li, T\. Tang, W\. Yin, X\. Ren, X\. Wang, X\. Zhang, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Wang, Z\. Cui, Z\. Zhang, Z\. Zhou, and Z\. Qiu \(2025a\)Qwen3 technical report\.External Links:2505\.09388,[Link](https://arxiv.org/abs/2505.09388)Cited by:[§4\.2](https://arxiv.org/html/2605.29313#S4.SS2.p4.1)\.
- B\. Yang, X\. He, H\. Gao, Y\. Cao, X\. Li, and D\. Hsu \(2025b\)CodeAgents: a token\-efficient framework for codified multi\-agent reasoning in LLMs\.External Links:2507\.03254,[Document](https://dx.doi.org/10.48550/arXiv.2507.03254),[Link](https://arxiv.org/abs/2507.03254)Cited by:[§1](https://arxiv.org/html/2605.29313#S1.p3.1)\.
- Z\. Yang, P\. Qi, S\. Zhang, Y\. Bengio, W\. W\. Cohen, R\. Salakhutdinov, and C\. D\. Manning \(2018\)HotpotQA: a dataset for diverse, explainable multi\-hop question answering\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,Brussels, Belgium,pp\. 2369–2380\.External Links:[Document](https://dx.doi.org/10.18653/v1/D18-1259),[Link](https://aclanthology.org/D18-1259/)Cited by:[§2](https://arxiv.org/html/2605.29313#S2.SS0.SSS0.Px4.p1.1),[§4\.2](https://arxiv.org/html/2605.29313#S4.SS2.p3.1)\.
- S\. Yao, D\. Yu, J\. Zhao, I\. Shafran, T\. L\. Griffiths, Y\. Cao, and K\. Narasimhan \(2023a\)Tree of thoughts: deliberate problem solving with large language models\.InAdvances in Neural Information Processing Systems,Vol\.36,pp\. 11809–11822\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2023/hash/271db9922b8d1f4dd7aaef84ed5ac703-Abstract-Conference.html)Cited by:[§1](https://arxiv.org/html/2605.29313#S1.p1.1)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao \(2023b\)ReAct: synergizing reasoning and acting in language models\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=WE_vluYUL-X)Cited by:[§1](https://arxiv.org/html/2605.29313#S1.p1.1)\.
- J\. Zhang, J\. Xiang, Z\. Yu, F\. Teng, X\. Chen, J\. Chen, M\. Zhuge, X\. Cheng, S\. Hong, J\. Wang, B\. Zheng, B\. Liu, Y\. Luo, and C\. Wu \(2025\)AFlow: automating agentic workflow generation\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=z5uVAKwmjf)Cited by:[§1](https://arxiv.org/html/2605.29313#S1.p3.1),[§2](https://arxiv.org/html/2605.29313#S2.SS0.SSS0.Px2.p1.1)\.
- L\. Zheng, L\. Yin, Z\. Xie, C\. Sun, J\. Huang, C\. H\. Yu, S\. Cao, C\. Kozyrakis, I\. Stoica, J\. E\. Gonzalez, C\. Barrett, and Y\. Sheng \(2024\)SGLang: efficient execution of structured language model programs\.InAdvances in Neural Information Processing Systems,Vol\.37\.External Links:[Document](https://dx.doi.org/10.52202/079017-2000),[Link](https://proceedings.neurips.cc/paper_files/paper/2024/hash/724be4472168f31ba1c9ac630f15dec8-Abstract-Conference.html)Cited by:[§2](https://arxiv.org/html/2605.29313#S2.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2605.29313#S2.SS0.SSS0.Px4.p1.1)\.

## Appendix APatchBoard Kernel Pseudocode

Algorithm[1](https://arxiv.org/html/2605.29313#alg1)summarizes the runtime loop\. Worker calls can vary with the underlying model, but conditional on an accepted blueprint and proposed worker patches, validation, commit, logging, scheduling, and circuit decisions are deterministic\.

The pseudocode makes explicit where the reliability boundary sits\. A worker returns only a candidate patch, and the committed state changes only afterValidPatchchecks syntax, authorization, patch application, post\-state schema validity, and registered invariants\. Rejected patches are still logged, preserving an audit trail without allowing invalid outputs to enter the shared state\.

Algorithm 1PatchBoard runtime kernel1:user request

xx, blueprint meta\-schema

Σmeta\\Sigma\_\{\\mathrm\{meta\}\}
2:committed state

𝒮\\mathcal\{S\}, transaction log

ℒ\\mathcal\{L\}
3:

b←Architect​\(x\)b\\leftarrow\\textsc\{Architect\}\(x\)
4:ifnotValidBlueprint\(

b,Σmetab,\\Sigma\_\{\\mathrm\{meta\}\}\)then

5:returnRejectBlueprint\(

bb\)

6:endif

7:

\(Σ,ℛ,𝒞\)←Unpack​\(b\)\(\\Sigma,\\mathcal\{R\},\\mathcal\{C\}\)\\leftarrow\\textsc\{Unpack\}\(b\)
8:

𝒮←InitialState​\(x,Σ\)\\mathcal\{S\}\\leftarrow\\textsc\{InitialState\}\(x,\\Sigma\)
9:

ℒ←\[\]\\mathcal\{L\}\\leftarrow\[\\,\];

Q←InitialQueue​\(ℛ,𝒮\)Q\\leftarrow\\textsc\{InitialQueue\}\(\\mathcal\{R\},\\mathcal\{S\}\)
10:while

Q≠∅Q\\neq\\emptysetand notBudgetExceededdo

11:

\(a,e\)←Pop​\(Q\)\(a,e\)\\leftarrow\\textsc\{Pop\}\(Q\)
12:

𝒱a←Slice​\(𝒮,a,𝒞a\)\\mathcal\{V\}^\{a\}\\leftarrow\\textsc\{Slice\}\(\\mathcal\{S\},a,\\mathcal\{C\}\_\{a\}\)
13:

Δa←Worker​\(a,𝒱a,e\)\\Delta^\{a\}\\leftarrow\\textsc\{Worker\}\(a,\\mathcal\{V\}^\{a\},e\)
14:

\(o​k,𝒮′,r\)←ValidPatch​\(𝒮,Δa,a,Σ\)\(ok,\\mathcal\{S\}^\{\\prime\},r\)\\leftarrow\\textsc\{ValidPatch\}\(\\mathcal\{S\},\\Delta^\{a\},a,\\Sigma\)
15:if

o​kokthen

16:

𝒮←𝒮′\\mathcal\{S\}\\leftarrow\\mathcal\{S\}^\{\\prime\}
17:

ℒ←ℒ∘Commit​\(a,e,𝒱a,Δa,𝒮\)\\mathcal\{L\}\\leftarrow\\mathcal\{L\}\\circ\\textsc\{Commit\}\(a,e,\\mathcal\{V\}^\{a\},\\Delta^\{a\},\\mathcal\{S\}\)
18:

E←Events​\(Δa\)E\\leftarrow\\textsc\{Events\}\(\\Delta^\{a\}\)
19:

Q←Q∘Schedule​\(E,ℛ,𝒮\)Q\\leftarrow Q\\circ\\textsc\{Schedule\}\(E,\\mathcal\{R\},\\mathcal\{S\}\)
20:else

21:

ℒ←ℒ∘RejectPatch​\(a,e,𝒱a,Δa,r\)\\mathcal\{L\}\\leftarrow\\mathcal\{L\}\\circ\\textsc\{RejectPatch\}\(a,e,\\mathcal\{V\}^\{a\},\\Delta^\{a\},r\)
22:endif

23:

u←CircuitPolicy​\(ℒ,𝒮,Q\)u\\leftarrow\\textsc\{CircuitPolicy\}\(\\mathcal\{L\},\\mathcal\{S\},Q\)
24:

\(Q,𝒮\)←ApplyPolicy​\(u,Q,𝒮\)\(Q,\\mathcal\{S\}\)\\leftarrow\\textsc\{ApplyPolicy\}\(u,Q,\\mathcal\{S\}\)
25:endwhile

26:return

\(𝒮,ℒ\)\(\\mathcal\{S\},\\mathcal\{L\}\)

Table[2](https://arxiv.org/html/2605.29313#A1.T2)defines the deterministic operations used in Algorithm[1](https://arxiv.org/html/2605.29313#alg1)\. These operations mediate between model\-generated patch proposals and committed shared state\.

Table 2:Deterministic kernel operations used in Algorithm[1](https://arxiv.org/html/2605.29313#alg1)\.![Refer to caption](https://arxiv.org/html/2605.29313v1/x7.png)Figure 7:Diagnostic HotpotQA results\. All systems have similar answer accuracy, while unsupported\-claim rates differ\.The scheduler operates only on events emitted by committed patches\. Downstream worker calls are therefore driven by accepted state transitions and workflow rules\. The circuit policy is evaluated after each proposal so that repeated invalid edits, no\-op loops, or short state cycles can be handled deterministically\.

## Appendix BImplementation and Reproducibility Details

### B\.1Baseline System Settings

The LangGraph baseline uses the pure LangGraph runner in a supervisor–subagent configuration\. Each environment turn is represented as a LangGraph state update\. A supervisor node receives the current ALFWorld turn context and coordinates three tool\-exposed subagents: a planner subagent, an action subagent, and a critic subagent\. The planner produces or revises a local plan, the action subagent selects an admissible environment action, and the critic checks the candidate action before the supervisor finalizes the turn\. This gives LangGraph a structured subagent workflow while leaving communication as tool\-mediated message passing rather than schema\-validated patch transactions\.

The Flock baseline is run through a turn\-level bridge that maps each ALFWorld state into a Flock orchestration call\. This is a blackboard\-family baseline in the sense that Flock agents communicate through typed artifacts in an orchestration context\. In the reported comparison, the bridge uses a planner\-and\-executor workflow: the planner publishes a plan artifact from the local task state, and the executor consumes that artifact with the current turn context to return one admissible environment action\. Flock receives the same observations, admissible\-action interface, model, decoding configuration, step budget, and timeout as the other systems, but it does not use PatchBoard’s deterministic patch validator, role\-specific write contracts, or transaction log\.

### B\.2Run Configuration

Table[3](https://arxiv.org/html/2605.29313#A2.T3)summarizes the run configuration used in the reported results, including the model backend, runtime hardware, benchmark scale, execution limits, and estimation settings\. The main text describes the evaluated systems, controls, and ablation variants; this appendix records the concrete settings used for reproducibility\.

Table 3:Run configuration used for the reported results\.

## Appendix CDiagnostic HotpotQA Results

The HotpotQA diagnostic is included to mark a limitation, not to claim transfer improvement\. The three systems have similar answer accuracy, and Flock has the lowest unsupported\-claim rate\. This result supports the paper’s boundary claim: schema\-valid state transitions make coordination more auditable, while factual support still depends on evidence selection and verifier quality\.

Figure[7](https://arxiv.org/html/2605.29313#A1.F7)groups the semantic diagnostic metrics that are most relevant to this boundary\. Answer accuracy shows whether the final answer is correct, while unsupported\-claim rate measures whether the system introduced claims that were not backed by the prepared evidence\. Evidence coverage and verified precision indicate how well the evidence\-tracking fields support factual checking\. The figure is therefore read as a semantic\-support diagnostic rather than as an end\-task win for PatchBoard\.

## Appendix DRunning Example and Token Cost Breakdown

![Refer to caption](https://arxiv.org/html/2605.29313v1/x8.png)Figure 8:Component\-level token cost breakdown for a representative PatchBoard trajectory\.Figure[8](https://arxiv.org/html/2605.29313#A4.F8)decomposes token usage for a representative PatchBoard trajectory on an ALFWorld clean\-and\-place task\. The total accounted token usage is 40\.3k\. Actor worker calls are the largest component, accounting for 26\.1k tokens, or 64\.6% of the total\. This is expected because action selection is invoked repeatedly across environment turns and must condition on the current observation, admissible actions, recent state, and task objective\.

The remaining costs span setup, planning, verification, and repair\. Architect blueprint generation uses 2\.8k tokens for one\-time schema, contracts, workflow rules, and initial state construction\. Planner calls consume 6\.2k tokens to generate and revise local plans as observations change\. Verifier calls use 2\.4k tokens to check action admissibility and progress, while repair/retry calls use 2\.3k tokens to handle inadmissible placement attempts\.

Schema and context overhead accounts for only 0\.6k tokens, or 1\.5% of the total\. This category includes schema fragments, bounded state views, patch\-format instructions, and state handles passed to workers\. The breakdown supports an implementation\-level interpretation: in this representative workflow, the main cost center is repeated action generation, while the schema and context machinery adds comparatively little token overhead\. Future efficiency improvements should primarily target the number, size, or timing of actor calls, while preserving the validation and auditability benefits of schema\-grounded state mutation\.

![Refer to caption](https://arxiv.org/html/2605.29313v1/x9.png)Figure 9:Running example of PatchBoard on the ALFWorld clean\-and\-place task analyzed in Figure[8](https://arxiv.org/html/2605.29313#A4.F8)\.Figure[9](https://arxiv.org/html/2605.29313#A4.F9)gives the concrete execution trace behind the representative trajectory analyzed above\. The task goal is to put a clean apple on the dining table\. The task begins with the initial observation, admissible actions, and target terminal condition\. The Architect then constructs a task blueprint containing the state schema, worker contracts, workflow rules, and context budgets\. This blueprint initializes the shared state and defines which workers may read or modify each state region\.

The planner first proposes a patch that fills the task subgoal, target object, target receptacle, and workflow status\. The kernel accepts the patch only after checking that the edited paths are authorized and that the resulting state satisfies the task schema\. During the main execution loop, the actor proposes environment actions, the verifier checks admissibility and task progress, and accepted actions are executed in the ALFWorld environment\. The trace illustrates how environment feedback is converted into committed state updates, allowing later worker calls to condition on the validated state trajectory rather than on an unstructured dialogue history\.

The repair step highlights the validation boundary enforced by PatchBoard\. After cleaning the apple, the actor initially proposes placing it on the dining table before the agent has navigated there\. The verifier marks this candidate action as inadmissible and requests repair\. The rejected action is logged, while the repair hint is accepted as a separate state update\. The subsequent retry navigates to the dining table, after which the final placement action completes the task\. This example shows how invalid intermediate proposals can remain auditable without contaminating the committed shared state\.

Similar Articles

ANNEAL: Adapting LLM Agents via Governed Symbolic Patch Learning

arXiv cs.AI

Introduces ANNEAL, a neuro-symbolic agent that converts recurring failures into governed symbolic edits of a process knowledge graph without modifying model weights, achieving persistent structural repairs and eliminating recurring failures in tested settings.

World-State Transformations for Neuro-symbolic Interactive Storytelling

arXiv cs.CL

This paper explores using LLMs to predict state changes within rule-based interactive storytelling systems, aiming to improve coherence and player expression. Experiments with Llama 3 70B and Gemini 1.5 Flash show that world-state transformations can maintain consistency while encouraging creative player input.

AgentCollabBench: Diagnosing When Good Agents Make Bad Collaborators

arXiv cs.CL

This paper introduces AgentCollabBench, a diagnostic benchmark for multi-agent systems that evaluates behavioral risks like instruction decay and context leakage across four major LLMs. It argues that communication topology is a critical factor in multi-agent reliability, often overshadowing raw model capability.