@dotey: Q: Our company has a dozen microservices, and we want developers to use AI Agents for system design and coding. The problem is that a user story often requires collaboration among multiple microservices, and the Agent must understand each service's responsibility boundaries and business concepts to make reasonable designs. We plan to put all microservices into a single …
Summary
The article discusses in a Q&A format how to enable AI Agents to perform system design and coding in a multi-microservice scenario, focusing on practical experiences with context quality (via monorepo, layered documentation) and validation loops (via contract testing, mock servers).
View Cached Full Text
Cached at: 06/30/26, 03:43 PM
Q: Our company has a dozen or so microservices, and we now want developers to use AI agents for system design and coding. The problem is that a user story often requires collaboration among multiple microservices, and the agent must understand the responsibility boundaries and business concepts of each service to make a reasonable design. We plan to put all microservices under one workspace, each with its own documentation, and let the AI handle it itself. Is this approach reasonable? Are there better practices?
A: The key to using an agent well lies in two points: the quality of context, and a closed loop for verification. Let’s talk about context quality first.
Putting everything under one workspace is currently a recommended practice in the community. Monorepos are naturally well-suited for working with AI because the agent can see schema definitions, API protocols, and implementation code for all services in one place. If due to historical reasons it’s not convenient to combine into a monorepo, there’s a compromise called a virtual monorepo, which involves cloning multiple repositories into the same local directory.
In addition to co-location, documentation is also a great way for the agent to obtain context. It’s best to give the agent a map with on-demand loading:
- Place a master AGENTS.md (or CLAUDE.md) in the root directory as an index, listing all services, their responsibilities, and which directory to read for a given service.
- Each microservice’s own directory should have its own document describing its responsibility boundaries and business concepts – this is essentially DDD’s bounded context.
- Let the agent first look at the root index, locate the relevant services, and then load their details.
However, documentation must be kept up to date, especially when microservice protocols change; otherwise, it can mislead. Anything that can be automatically generated from code or specifications should not be written manually. Manual documentation will eventually become inconsistent with the code. Machine-readable interface specifications like OpenAPI serve as both documentation and can be used to generate mocks and tests.
Beyond documentation, there is another source of context that many overlook: protocol test code. High-quality contract tests are themselves the most accurate living documentation, precisely describing the actual interaction protocols between services, and are less likely to become outdated than human-written documentation because if they are wrong, the tests will fail. If you already have OpenAPI specs or Pact contract files, these are very valuable for the agent to understand service boundaries.
Now regarding verification. Verification is the trickiest part in a microservices scenario because a user story might involve collaboration across several services. You can’t ask the agent to run the entire system for end-to-end testing every time it changes a line of code. A practical approach is: each microservice provides a mock server or a simulated service automatically generated from its OpenAPI spec. After writing code, the agent can run contract tests locally to verify whether its changes break the protocol agreements with other services, without relying on live real APIs or a full integration environment. This way, the agent forms a closed loop of ‘write code → run tests → self-correct’, without requiring frequent human intervention.
To go further, consider learning about contract testing (consumer-driven contract testing, commonly using Pact). The idea is that the caller records the actual interface shape it uses, generates a contract file, and the callee then verifies whether it can satisfy that contract. In short: the workspace provides a unified global view; layered documentation + protocol tests provide precise context; mock server + contract tests provide a verification closed loop. With these three layers in place, the agent can handle cross-microservice system design more reliably.
Some references
- Anthropic’s Effective context engineering for AI agents, discusses how to treat context as a scarce resource and load on demand: https://anthropic.com/engineering/effective-context-engineering-for-ai-agents…
- Anthropic’s Effective harnesses for long-running agents, discusses how to scaffold agents for long tasks (e.g., using progress files with git records for cross-context window handoff): https://anthropic.com/engineering/effective-harnesses-for-long-running-agents…
- How to organize AGENTS.md in a monorepo for agents, see this post on dev.to: Steering AI Agents in Monorepos with AGENTS.md: https://dev.to/datadog-frontend-dev/steering-ai-agents-in-monorepos-with-agentsmd-13g0…
Introduction to contract testing, just search for Pact plus consumer-driven contract testing guides.
Effective context engineering for AI agents
Source: https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
After a few years of prompt engineering being the focus of attention in applied AI, a new term has come to prominence:context engineering. Building with language models is becoming less about finding the right words and phrases for your prompts, and more about answering the broader question of “what configuration of context is most likely to generate our model’s desired behavior?”
Contextrefers to the set of tokens included when sampling from a large-language model (LLM). Theengineeringproblem at hand is optimizing the utility of those tokens against the inherent constraints of LLMs in order to consistently achieve a desired outcome. Effectively wrangling LLMs often requiresthinking in context— in other words: considering the holistic state available to the LLM at any given time and what potential behaviors that state might yield.
In this post, we’ll explore the emerging art of context engineering and offer a refined mental model for building steerable, effective agents. At Anthropic, we view context engineering as the natural progression of prompt engineering. Prompt engineering refers to methods for writing and organizing LLM instructions for optimal outcomes (seeour docs for an overview and useful prompt engineering strategies).Context engineeringrefers to the set of strategies for curating and maintaining the optimal set of tokens (information) during LLM inference, including all the other information that may land there outside of the prompts.
In the early days of engineering with LLMs, prompting was the biggest component of AI engineering work, as the majority of use cases outside of everyday chat interactions required prompts optimized for one-shot classification or text generation tasks. As the term implies, the primary focus of prompt engineering is how to write effective prompts, particularly system prompts. However, as we move towards engineering more capable agents that operate over multiple turns of inference and longer time horizons, we need strategies for managing the entire context state (system instructions, tools,Model Context Protocol(MCP), external data, message history, etc). An agent running in a loop generates more and more data thatcouldbe relevant for the next turn of inference, and this information must be cyclically refined. Context engineering is theart and scienceof curating what will go into the limited context window from that constantly evolving universe of possible information.
Prompt engineering vs. context engineering In contrast to the discrete task of writing a prompt, context engineering is iterative and the curation phase happens each time we decide what to pass to the model.
Why context engineering is important to building capable agents
Despite their speed and ability to manage larger and larger volumes of data, we’ve observed that LLMs, like humans, lose focus or experience confusion at a certain point. Studies on needle-in-a-haystackstyle benchmarking have uncovered the concept ofcontext rot: as the number of tokens in the context window increases, the model’s ability to accurately recall information from that context decreases. While some models exhibit more gentle degradation than others, this characteristic emerges across all models. Context, therefore, must be treated as a finite resource with diminishing marginal returns. Like humans, who havelimited working memory capacity, LLMs have an “attention budget” that they draw on when parsing large volumes of context. Every new token introduced depletes this budget by some amount, increasing the need to carefully curate the tokens available to the LLM.
This attention scarcity stems from architectural constraints of LLMs. LLMs are based on thetransformer architecture, which enables every token toattend to every other tokenacross the entire context. This results in n² pairwise relationships for n tokens. As its context length increases, a model’s ability to capture these pairwise relationships gets stretched thin, creating a natural tension between context size and attention focus. Additionally, models develop their attention patterns from training data distributions where shorter sequences are typically more common than longer ones. This means models have less experience with, and fewer specialized parameters for, context-wide dependencies. Techniques likeposition encoding interpolationallow models to handle longer sequences by adapting them to the originally trained smaller context, though with some degradation in token position understanding. These factors create a performance gradient rather than a hard cliff: models remain highly capable at longer contexts but may show reduced precision for information retrieval and long-range reasoning compared to their performance on shorter contexts.
These realities mean that thoughtful context engineering is essential for building capable agents.
The anatomy of effective context
Given that LLMs are constrained by a finite attention budget,goodcontext engineering means finding thesmallest**possibleset of high-signal tokens that maximize the likelihood of some desired outcome. Implementing this practice is much easier said than done, but in the following section, we outline what this guiding principle means in practice across the different components of context.
System promptsshould be extremely clear and use simple, direct language that presents ideas at theright altitudefor the agent. The right altitude is the Goldilocks zone between two common failure modes. At one extreme, we see engineers hardcoding complex, brittle logic in their prompts to elicit exact agentic behavior. This approach creates fragility and increases maintenance complexity over time. At the other extreme, engineers sometimes provide vague, high-level guidance that fails to give the LLM concrete signals for desired outputs or falsely assumes shared context. The optimal altitude strikes a balance: specific enough to guide behavior effectively, yet flexible enough to provide the model with strong heuristics to guide behavior.
Calibrating the system prompt in the process of context engineering. At one end of the spectrum, we see brittle if-else hardcoded prompts, and at the other end we see prompts that are overly general or falsely assume shared context.
We recommend organizing prompts into distinct sections (like`## Behavior`,`## Tool guidance`,`## Output description`, etc) and using techniques like XML tagging or Markdown headers to delineate these sections, although the exact formatting of prompts is likely becoming less important as models become more capable. Regardless of how you decide to structure your system prompt, you should be striving for the minimal set of information that fully outlines your expected behavior. (Note that minimal does not necessarily mean short; you still need to give the agent sufficient information up front to ensure it adheres to the desired behavior.) It’s best to start by testing a minimal prompt with the best model available to see how it performs on your task, and then add clear instructions and examples to improve performance based on failure modes found during initial testing.
Toolsallow agents to operate with their environment and pull in new, additional context as they work. Because tools define the contract between agents and their information/action space, it’s extremely important that tools promote efficiency, both by returning information that is token efficient and by encouraging efficient agent behaviors. InWriting tools for AI agents – with AI agents, we discussed building tools that are well understood by LLMs and have minimal overlap in functionality. Similar to the functions of a well-designed codebase, tools should be self-contained, robust to error, and extremely clear with respect to their intended use. Input parameters should similarly be descriptive, unambiguous, and play to the inherent strengths of the model. One of the most common failure modes we see is bloated tool sets that cover too much functionality or lead to ambiguous decision points about which tool to use. If a human engineer can’t definitively say which tool should be used in a given situation, an AI agent can’t be expected to do better. As we’ll discuss later, curating a minimal viable set of tools for the agent can also lead to more reliable maintenance and pruning of context over long interactions.
Providing examples, otherwise known as few-shot prompting, is a well known best practice that we continue to strongly advise. However, teams will often stuff a laundry list of edge cases into a prompt in an attempt to articulate every possible rule the LLM should follow for a particular task. We do not recommend this. Instead, we recommend working to curate a set of diverse, canonical examples that effectively portray the expected behavior of the agent. For an LLM, examples are the “pictures” worth a thousand words. Our overall guidance across the different components of context (system prompts, tools, examples, message history, etc) is to be thoughtful and keep your context informative, yet tight. Now let’s dive into dynamically retrieving context at runtime.
Context retrieval and agentic search
InBuilding effective AI agents, we highlighted the differences between LLM-based workflows and agents. Since we wrote that post, we’ve gravitated towards asimple definitionfor agents: LLMs autonomously using tools in a loop. Working alongside our customers, we’ve seen the field converging on this simple paradigm. As the underlying models become more capable, the level of autonomy of agents can scale: smarter models allow agents to independently navigate nuanced problem spaces and recover from errors.
We’re now seeing a shift in how engineers think about designing context for agents. Today, many AI-native applications employ some form of embedding-based pre-inference time retrieval to surface important context for the agent to reason over. As the field transitions to more agentic approaches, we increasingly see teams augmenting these retrieval systems with “just in time” context strategies. Rather than pre-processing all relevant data up front, agents built with the “just in time” approach maintain lightweight identifiers (file paths, stored queries, web links, etc.) and use these references to dynamically load data into context at runtime using tools. Anthropic’s agentic coding solutionClaude Codeuses this approach to perform complex data analysis over large databases. The model can write targeted queries, store results, and leverage Bash commands like head and tail to analyze large volumes of data without ever loading the full data objects into context. This approach mirrors human cognition: we generally don’t memorize entire corpuses of information, but rather introduce external organization and indexing systems like file systems, inboxes, and bookmarks to retrieve relevant information on demand.
Beyond storage efficiency, the metadata of these references provides a mechanism to efficiently refine behavior, whether explicitly provided or intuitive. To an agent operating in a file system, the presence of a file named`test_utils.py`in a`tests`folder implies a different purpose than a file with the same name located in`src/core_logic/`Folder hierarchies, naming conventions, and timestamps all provide important signals that help both humans and agents understand how and when to utilize information.
Letting agents navigate and retrieve data autonomously also enables progressive disclosure—in other words, allows agents to incrementally discover relevant context through exploration. Each interaction yields context that informs the next decision: file sizes suggest complexity; naming conventions hint at purpose; timestamps can be a proxy for relevance. Agents can assemble understanding layer by layer, maintaining only what’s necessary in working memory and leveraging note-taking strategies for additional persistence. This self-managed context window keeps the agent focused on relevant subsets rather than drowning in exhaustive but potentially irrelevant information.
Of course, there’s a trade-off: runtime exploration is slower than retrieving pre-computed data. Not only that, but opinionated and thoughtful engineering is required to ensure that an LLM has the right tools and heuristics for effectively navigating its information landscape. Without proper guidance, an agent can waste context by misusing tools, chasing dead-ends, or failing to identify key information. In certain settings, the most effective agents might employ a hybrid strategy, retrieving some data up front for speed, and pursuing further autonomous exploration at its discretion. The decision boundary for the ‘right’ level of autonomy depends on the task. Claude Code is an agent that employs this hybrid model:CLAUDE.mdfiles are naively dropped into context up front, while primitives like glob and grep allow it to navigate its environment and retrieve files just-in-time, effectively bypassing the need to load all files into the context window at once.
Similar Articles
@aiDotEngineer: The Multi-Agent Architecture That Actually Ships https://youtube.com/watch?v=ow1we5PzK-o… What does a multi-agent codin…
本文深入解析了FactoryAI的Missions多智能体架构,通过角色分工、验证合约与结构化交接机制,实现了可在生产环境中连续稳定运行数十天的自动化编码系统。该设计将软件工程瓶颈从人工执行转向人类注意力管理,为开发者提供了可落地的长期多智能体协作方案。
@grapeot: Very well said, hits the nail on the head.
A discussion about using AI Agent for system design and coding in a microservices environment, highlighting the need for the AI to understand service boundaries and business concepts.
This article systematically reviews AI Agent architecture and engineering practices, covering control flow, context engineering, tool design, memory, multi-agent organization, evaluation, tracing, and security. It is based on the OpenClaw implementation and emphasizes the critical role of Harness (testing and validation infrastructure) for system stability.
This article systematically reviews AI Agent architecture and engineering practices, covering control flow, context engineering, tool design, memory, multi-agent organization, evaluation, tracing, and security. It is based on the OpenClaw implementation and emphasizes the critical role of Harness (testing and validation infrastructure) for system stability.
@Xudong07452910: Open-source framework recommendation: Agency Agents — 232 professional AI agents, divided by function, covering 16 business departments. If you've used Claude Code or Codex, you may have encountered this problem: AI is very capable at coding tasks, but when it comes to front-end design, writing marketing...
Agency Agents is an open-source framework providing 232 professional AI agents covering 16 business departments. Each agent has a unique personality, communication style, and delivery standards. It supports multiple development tools such as Claude Code, GitHub Copilot, and has community-translated versions.
@lidangzzz: I've said it many times over the years: to make an AI Agent write good code, all the secrets are in the textbooks from the 1990s: - Write tests diligently, write more tests, push test coverage as high as possible - Do CI/CD properly, avoid messing up at all costs - For a new proj…
The author emphasizes that the key to making AI agents write good code lies in following classic software engineering practices from 1990s textbooks: writing tests, doing CI/CD properly, top-down design, and modular decoupling.