Tag
This thesis from Aalto University presents a taxonomy of synchronization architectures, analyzing trade-offs and decision factors to guide the design of generalized sync engines.
An interview with Pierre Zemb, staff engineer at Clever Cloud, discussing his work building data layers on FoundationDB and his previous experience at OVHcloud.
This paper introduces MemClaw, a governed shared memory architecture for multi-agent LLM systems, formalizing failure modes like unauthorized leakage and stale propagation, and evaluating the system via the ArgusFleet harness.
A recommendation of Nancy Lynch's book 'Distributed Algorithms' as a valuable resource for distributed systems professionals.
A tweet recommends learning graph and networking theory as a high-ROI investment, listing key books, courses, and tools.
RollArt presents a disaggregated architecture for large-scale reinforcement learning, demonstrating significant improvements in efficiency and scalability.
This paper proposes a layered architecture for distributed general-purpose agent networks, enabling heterogeneous AI agents to discover, trust, and cooperate on open-ended tasks across personal devices and edge nodes.
American Express describes its cell-based architecture for its core payments ecosystem that isolates failures, reduces latency, and scales capacity. The approach groups microservices and databases into independent cells to contain blast radius.
This paper describes Scuba, a distributed in-memory database system developed at Facebook for real-time analytics and data exploration.
A deep dive on Antithesis, a multiverse debugger for large distributed systems that offers deterministic replay and fault injection, now available as a free article.
The article argues that an AI agent is defined by its durable event log, not the runtime or model, enabling fault-tolerant resumption and simplified reasoning about agent state.
Discusses two failure modes in multi-agent systems with shared state—concurrent lost updates and zombie writers—and presents a solution with fenced writers and model-checked guarantees.
The author rebuilt their private AI dev team as an open-sourced substrate with addressable agents, reliable messaging, expertise discovery, memory, and isolated runtimes, allowing team behavior to emerge from natural-language instructions. They share insights on coordination challenges such as deadlocks and self-healing, and question how agent teams can collaborate using NL instructions.
Explores extending Conflict-Free Replicated Data Types (CRDTs) to handle concurrent creation, beyond their traditional ability to merge concurrent edits.
A Twitter thread listing 35 essential system design concepts with links to detailed explanations, aimed at helping developers learn and review key topics.
The author discusses the unglamorous but critical aspects of engineering reliable AI agents in production, including monitoring mid-flight runs, resuming failed runs, and providing UI status, and asks the community about common pain points and off-the-shelf solutions.
A curated reading list of foundational and modern resources for understanding agentic architecture, blending classic distributed systems concepts with current AI agent patterns.
A developer shares a curated list of software engineering book recommendations, including titles on AI engineering, distributed systems, and refactoring, and promotes their own book.
A comprehensive system design master tree covering fundamentals through real-world applications, including architecture patterns, databases, caching, messaging systems, API design, and deployment strategies. Intended as a structured learning guide for software engineers.
This paper proposes directly mapping mature architectural patterns from distributed systems (such as publish-subscribe and message queues) to multi-agent systems to lower the development barrier. It was validated in a course: even students with no distributed systems experience could get started with gRPC and RabbitMQ, achieving an average score above 80%.