Tag
This paper introduces MemClaw, a governed shared memory architecture for multi-agent LLM systems, formalizing failure modes like unauthorized leakage and stale propagation, and evaluating the system via the ArgusFleet harness.
A tweet recommends learning graph and networking theory as a high-ROI investment, listing key books, courses, and tools.
RollArt presents a disaggregated architecture for large-scale reinforcement learning, demonstrating significant improvements in efficiency and scalability.
This paper proposes a layered architecture for distributed general-purpose agent networks, enabling heterogeneous AI agents to discover, trust, and cooperate on open-ended tasks across personal devices and edge nodes.
American Express describes its cell-based architecture for its core payments ecosystem that isolates failures, reduces latency, and scales capacity. The approach groups microservices and databases into independent cells to contain blast radius.
This paper describes Scuba, a distributed in-memory database system developed at Facebook for real-time analytics and data exploration.
A deep dive on Antithesis, a multiverse debugger for large distributed systems that offers deterministic replay and fault injection, now available as a free article.
The article argues that an AI agent is defined by its durable event log, not the runtime or model, enabling fault-tolerant resumption and simplified reasoning about agent state.
Discusses two failure modes in multi-agent systems with shared state—concurrent lost updates and zombie writers—and presents a solution with fenced writers and model-checked guarantees.
The author rebuilt their private AI dev team as an open-sourced substrate with addressable agents, reliable messaging, expertise discovery, memory, and isolated runtimes, allowing team behavior to emerge from natural-language instructions. They share insights on coordination challenges such as deadlocks and self-healing, and question how agent teams can collaborate using NL instructions.
Explores extending Conflict-Free Replicated Data Types (CRDTs) to handle concurrent creation, beyond their traditional ability to merge concurrent edits.
A Twitter thread listing 35 essential system design concepts with links to detailed explanations, aimed at helping developers learn and review key topics.
The author discusses the unglamorous but critical aspects of engineering reliable AI agents in production, including monitoring mid-flight runs, resuming failed runs, and providing UI status, and asks the community about common pain points and off-the-shelf solutions.
A curated reading list of foundational and modern resources for understanding agentic architecture, blending classic distributed systems concepts with current AI agent patterns.
A developer shares a curated list of software engineering book recommendations, including titles on AI engineering, distributed systems, and refactoring, and promotes their own book.
A comprehensive system design master tree covering fundamentals through real-world applications, including architecture patterns, databases, caching, messaging systems, API design, and deployment strategies. Intended as a structured learning guide for software engineers.
This paper proposes directly mapping mature architectural patterns from distributed systems (such as publish-subscribe and message queues) to multi-agent systems to lower the development barrier. It was validated in a course: even students with no distributed systems experience could get started with gRPC and RabbitMQ, achieving an average score above 80%.
This paper investigates whether restructuring communication among robots yields larger gains than increasing onboard model size in a multi-robot transport-and-mapping task. Results show that switching to modular hierarchical interactions improves normalized performance by 47 points, while doubling neural network hidden size yields at most 9 points.
A Databricks tech lead argues that multi-agent AI systems fail not due to model intelligence but due to lack of coordination, framing 50+ agents as a distributed systems problem where parallelism is easy but shared coherence is difficult.
Agyn is an open-source, Kubernetes-native agent runtime that brings AI agents like Claude Code and Codex into production with full credential isolation and pre-built harnesses. It addresses security concerns by running MCP servers in sidecars and using mTLS for internal services, preventing prompt injection credential leaks.