Discusses two failure modes in multi-agent systems with shared state—concurrent lost updates and zombie writers—and presents a solution with fenced writers and model-checked guarantees.
If you run an orchestrator with parallel workers, or long-running agents that share state (a memory store, a decisions doc, a plan file), here are two failure modes I keep finding in multi-agent setups. They look identical from the outside: the run finishes clean, and somewhere downstream the system acts like an update never happened. Both get blamed on the model first. Neither is the model. **Failure 1: the concurrent lost update.** A planner dispatches six workers, each writes its result to a shared key. Two finish in the same instant. Both writes return success. One of them isn't there afterward. Classic last-write-wins, except agents make it worse than ordinary services: nobody re-reads the doc with suspicion. The next prompt just inherits whatever survived, reasons over it fluently, and the gap surfaces three steps later as "the agent forgot X." **Failure 2: the zombie writer.** A long-running agent stalls mid-task while holding the write grant. Recovery (correctly) reclaims the grant so the rest of the fleet isn't blocked. An hour later the stalled process wakes up and completes its write. Here's the trap: if nothing else touched that artifact in between, the version number still matches. Every version check passes. The stale commit lands on top of state the system moved past long ago. What I ended up wanting from the write path, and eventually built: * Concurrent same-key writers resolve to exactly one winner. The loser gets a typed, retryable conflict (read fresh, recompute, commit again) instead of a silent drop. * Reclaimed writers get fenced. Every reclamation bumps an ownership epoch, every claim records the epoch it was made under, and commit checks both atomically with the version persist. The zombie write is rejected even though the version number never moved. The guarantees are model-checked in TLA+. The checker runs in CI, and each spec carries a documented mutant (delete the guard) that has to turn the checker red. If removing a guard doesn't fail the model, the invariant wasn't load-bearing. Scope, so nobody installs this expecting more: one coordinator, one host, and only writers that go through it. Cross-host isn't built. I'm gating it on someone actually needing it. It runs over plain files shared across processes, with adapters for LangGraph, CrewAI, AutoGen, and the OpenAI Agents SDK. There's a deterministic, no-keys repro of the lost-update shape in the repo. How are people handling concurrent writes to shared agent state in production today? Retries and hoping? Single-writer by architecture? Or just not bitten yet? Also curious what failure modes you've hit that this wouldn't catch.
Two different multi-agent system teams experienced the same silent failure caused by agents writing to the same key in different formats, leading to phantom corruption. The article discusses solutions including schema validation, read-after-write validation, and introducing an 'unconfirmed' state for unverifiable actions.
Describes the LAC-Protocol for handling concurrent write collisions in local-first multi-agent systems, using lock-state separation and avoidance caching to prevent data loss and token waste.
This paper studies failure modes in shared-state collaborative reasoning for resource-constrained visual agents, introducing CoSee, an auditing framework that formalizes read-write-verify loops. It finds that naive shared workspaces can amplify hallucinations and identifies noise reinforcement and policy collapse as dominant failure modes.
Building multi-agent systems reveals that managing shared memory and context consistency is more challenging than orchestration. The author's experiment using Statewave treats memory as an evolving lifecycle rather than a retrieval problem.
The author describes a pattern where worker agents emit structured memory events instead of writing directly to shared memory, using a Memory Curator to validate, deduplicate, and route them to appropriate scopes, aiming to prevent memory pollution in multi-agent systems. They compare this approach to existing frameworks and solicit community feedback.