Two workers wrote the same key at the same moment. Both writes "succeeded." One is gone.

Reddit r/AI_Agents 06/10/26, 04:20 PM Tools

Summary

Discusses two failure modes in multi-agent systems with shared state—concurrent lost updates and zombie writers—and presents a solution with fenced writers and model-checked guarantees.

If you run an orchestrator with parallel workers, or long-running agents that share state (a memory store, a decisions doc, a plan file), here are two failure modes I keep finding in multi-agent setups. They look identical from the outside: the run finishes clean, and somewhere downstream the system acts like an update never happened. Both get blamed on the model first. Neither is the model. **Failure 1: the concurrent lost update.** A planner dispatches six workers, each writes its result to a shared key. Two finish in the same instant. Both writes return success. One of them isn't there afterward. Classic last-write-wins, except agents make it worse than ordinary services: nobody re-reads the doc with suspicion. The next prompt just inherits whatever survived, reasons over it fluently, and the gap surfaces three steps later as "the agent forgot X." **Failure 2: the zombie writer.** A long-running agent stalls mid-task while holding the write grant. Recovery (correctly) reclaims the grant so the rest of the fleet isn't blocked. An hour later the stalled process wakes up and completes its write. Here's the trap: if nothing else touched that artifact in between, the version number still matches. Every version check passes. The stale commit lands on top of state the system moved past long ago. What I ended up wanting from the write path, and eventually built: * Concurrent same-key writers resolve to exactly one winner. The loser gets a typed, retryable conflict (read fresh, recompute, commit again) instead of a silent drop. * Reclaimed writers get fenced. Every reclamation bumps an ownership epoch, every claim records the epoch it was made under, and commit checks both atomically with the version persist. The zombie write is rejected even though the version number never moved. The guarantees are model-checked in TLA+. The checker runs in CI, and each spec carries a documented mutant (delete the guard) that has to turn the checker red. If removing a guard doesn't fail the model, the invariant wasn't load-bearing. Scope, so nobody installs this expecting more: one coordinator, one host, and only writers that go through it. Cross-host isn't built. I'm gating it on someone actually needing it. It runs over plain files shared across processes, with adapters for LangGraph, CrewAI, AutoGen, and the OpenAI Agents SDK. There's a deterministic, no-keys repro of the lost-update shape in the repo. How are people handling concurrent writes to shared agent state in production today? Retries and hoping? Single-writer by architecture? Or just not bitten yet? Also curious what failure modes you've hit that this wouldn't catch.

Original Article

Two workers wrote the same key at the same moment. Both writes "succeeded." One is gone.

Similar Articles

The silent failure that wrecked two different multi-agent teams in exactly the same way

How we solved concurrent write collisions and data loss in local-first multi-agent systems (LAC-Protocol)

Diagnosing Failure Modes of Shared-State Collaboration in Resource-Constrained Visual Agents

Building multi-agent systems made me realize memory is harder than orchestration

Should worker agents write memory directly? A curator-agent pattern I am testing

Submit Feedback

Similar Articles

The silent failure that wrecked two different multi-agent teams in exactly the same way

How we solved concurrent write collisions and data loss in local-first multi-agent systems (LAC-Protocol)

Diagnosing Failure Modes of Shared-State Collaboration in Resource-Constrained Visual Agents

Building multi-agent systems made me realize memory is harder than orchestration

Should worker agents write memory directly? A curator-agent pattern I am testing