@cyrilXBT: https://x.com/cyrilXBT/status/2070690243880116242

X AI KOLs Timeline 06/27/26, 02:06 AM News

multi-agent agent-architecture ai-agents handoffs verification failure-modes

Summary

A practical guide explaining why naive multi-agent systems fail and how to build coordinated AI agent teams using Builder, Judge, and Manager roles with clear handoffs and verification.

https://t.co/g6eYZoKg57

Original Article

View Cached Full Text

Cached at: 06/27/26, 03:58 PM

How to Build a Team of AI Agents That Actually Work Together (Full Course)

Most multi-agent systems fail for the same boring reason. Someone spins up three or four agents, gives each one a vague job description, lets them all talk to each other freely, and then wonders why the output is worse than a single well-prompted agent would have produced alone.

More agents is not the upgrade people think it is. An uncoordinated group of agents is just one confused process wearing several costumes. The upgrade comes from structure. Clear roles. Clear handoffs. A real verification step. A way to stop.

This course covers the actual architecture behind multi-agent systems that hold up under real use, not demo conditions. By the end you will understand why most teams fail, the specific roles that make a team function, how to wire the handoffs between them, and the failure modes you need to test for before you trust the system with anything that matters.

Why Adding Agents Usually Makes Things Worse

Start with the uncomfortable truth. A second agent does not add intelligence to a system. It adds another point of failure, another place where context can get lost, and another opportunity for the system to confidently produce something wrong.

The naive version of multi-agent design treats agents like a brainstorm. Throw a few of them at a problem, let them all see everything, let them all contribute, and trust that more perspectives produce a better answer. This fails for a specific reason. Without role separation, every agent ends up doing a slightly worse version of the same generalist task, and nobody is actually checking anybody’s work. You have parallelized the guessing, not improved the accuracy.

The systems that actually work invert this completely. Instead of many agents doing the same job, you build a small number of agents each doing a distinct job, with a structure that catches mistakes before they compound. Multi-agent design is not about adding more minds to a problem. It is about separating concerns the same way a well run team of humans separates concerns, with someone building, someone checking, and someone deciding when the work is actually done.

The Three Roles Every Working System Needs

Every multi-agent system that holds up in production reduces, at its core, to some version of three roles. Different teams name them differently, but the functions are constant.

The Builder. This agent does the actual work. Writes the code, drafts the content, researches the topic, executes the task. It has the most context about the specific problem and the fewest constraints on creativity, because its job is to produce a first attempt, not a perfect one.

The Judge. This agent does not build anything. Its only job is to evaluate what the Builder produced against a specific, written standard. Does the code pass the tests. Does the draft match the brief. Does the research actually answer the question asked. The Judge should ideally run on a different prompt, sometimes a different model entirely, because an agent grading its own homework with the same blind spots it wrote the homework with catches far less than an independent reviewer would.

The Manager. This agent does not build or judge. It routes. It decides what happens next based on what the Judge reported. Send it back to the Builder with specific feedback. Escalate to a human. Mark the task complete. The Manager is also where your stop conditions live, because without an explicit role responsible for deciding when to stop, a system will happily loop forever.

This Builder, Judge, Manager structure is intentionally minimal. You can add more specialized roles, a Researcher that only gathers facts, a Formatter that only handles output structure, but every addition should justify itself against this baseline. If a new agent is not clearly building, judging, or managing, it is probably duplicating a role that already exists, which adds cost and failure surface without adding capability.

Designing the Handoffs

The actual engineering work in a multi-agent system is not the prompts for each agent. It is the handoffs between them, and this is the part most builders skip.

A handoff needs three things to work reliably. A defined format for what gets passed between agents, so the receiving agent is not parsing free text and guessing at structure. A defined trigger for when the handoff happens, so the Builder is not deciding on its own when it is done. And a defined failure path, so that when a handoff does not go cleanly, the system has a specified next step instead of silently breaking.

In practice, this means your Builder agent should output something structured, not just prose. A JSON object with a status field, a content field, and a confidence or completion flag works well. The Judge then consumes that structure, evaluates it against a checklist you wrote in advance, and outputs its own structured verdict, pass, fail, or needs revision, along with the specific reason. The Manager reads the verdict, not the raw content, and decides the next action based on rules you defined ahead of time, not based on improvising in the moment.

This sounds like more upfront work than just letting the agents talk to each other in natural language, and it is. That upfront cost is exactly what buys you a system that behaves the same way on the hundredth run as it did on the first one. Natural language handoffs between agents drift. Structured handoffs do not.

Stop Conditions Are Not Optional

The single most common reason multi-agent systems become expensive disasters is the absence of an explicit stop condition. Without one, a Builder and Judge can loop indefinitely, each revision triggering another evaluation, each evaluation triggering another revision, while your API bill climbs and nobody notices until the invoice arrives.

A real stop condition has three components. A maximum iteration count, a hard ceiling on how many times the Builder and Judge can cycle before the Manager is forced to escalate to a human regardless of whether the task is actually done. A quality threshold, a specific, measurable bar that counts as good enough, so the system is not chasing perfection it was never asked to deliver. And a time or cost ceiling, an absolute budget the task cannot exceed no matter what state it is in when that ceiling hits.

Write all three into the Manager’s logic explicitly, as code or as an unambiguous rule, not as a vague instruction inside a prompt. An instruction like “stop when the result is good enough” inside a prompt is not a stop condition. It is a suggestion the model can and eventually will ignore under the right conditions. A hard iteration counter that the Manager checks before allowing another loop is a stop condition.

Memory Across Agents

A team of agents that forgets everything between tasks is not actually a team, it is the same conversation happening from scratch every time with extra steps.

The systems that compound over time give the Manager access to a persistent memory layer that survives across tasks, not just within one. When the Judge flags a recurring failure pattern, that pattern gets written somewhere permanent, a file, a database, a vault, so the next time a similar task comes through, the Builder starts with that lesson already available instead of making the same mistake again.

The structure that works well here mirrors the write, consolidate, recall pattern. After each task, write down what happened and what the Judge found. Periodically consolidate those raw logs into a small number of distilled lessons rather than letting the raw log grow forever. Before starting a new task, have the Builder recall the relevant lessons first. This is the difference between a system that is the same competence on day ninety as it was on day one, and a system that quietly gets sharper every week because it is not relearning the same mistakes.

The Four Failure Modes You Must Test For

Before you trust a multi-agent system with anything that matters, run it through these four specific failure conditions deliberately. Most systems that fail in production would have failed one of these tests immediately, if anyone had bothered to run it.

The infinite loop test. Feed the system a task it genuinely cannot complete to the Judge’s standard, on purpose. Does the Manager correctly hit the iteration ceiling and escalate, or does it spin forever, burning tokens on a task it was never going to finish.

The malicious or malformed input test. Feed the Builder something deliberately broken or adversarial. Does the structured handoff format hold up, or does a bad input break the parsing between agents and silently corrupt everything downstream of it.

The Judge collusion test. If your Judge and Builder share the same underlying model, check directly whether the Judge is actually catching the Builder’s specific blind spots or just rubber stamping anything that looks superficially complete. Run a known bad output through the Judge manually and confirm it gets correctly rejected.

The cost runaway test. Simulate the worst case path through your system, maximum iterations, most expensive model calls, longest possible content, and calculate what that actually costs in dollars and time. If that number would alarm you on a real invoice, your stop conditions are not tight enough yet.

Running these four tests before deployment catches the overwhelming majority of the failures that otherwise show up for the first time in front of a client, a boss, or your own bank statement.

A Worked Example: Content Production Team

To make this concrete, here is how the Builder, Judge, Manager structure looks for a content production pipeline, the kind of system you would use to turn a raw source into a finished, fact-checked piece of content without supervising every step.

The Builder agent receives a source, an article, a transcript, a press release, and drafts the content in the requested format. It has web search access if claims need verification, but its only job is producing a complete first draft, not deciding whether that draft is good enough to publish.

The Judge agent receives the draft and the original source side by side, never the draft alone. It checks for three things specifically. Does every factual claim in the draft trace back to something actually present in the source. Does the draft follow the required style rules, no AI clichés, correct formatting, correct length. Is the core argument or hook actually present and not diluted by filler. The Judge outputs a structured verdict with a pass or fail on each of the three checks individually, not just one overall score, because a single overall score hides exactly which part failed.

The Manager reads that structured verdict. A clean pass on all three sends the content to a final output queue. A failure on facts sends it back to the Builder with the specific unverified claim flagged, not a vague “double check your facts” instruction. A failure on style sends it back with the specific rule that was violated. After three failed cycles on the same check, the Manager stops looping and escalates to a human with the full history of what failed and why, rather than continuing to retry a problem the system has already demonstrated it cannot fix on its own.

This is a small system, three roles, one clear handoff format, one explicit stop condition. It is also a system you could actually run unattended overnight and trust the output of in the morning, because every failure mode has a defined path instead of an undefined one.

A Second Example: Code Review Team

The same structure looks different but follows identical logic when applied to code instead of content, which is worth walking through because the specific checks change even though the architecture does not.

The Builder agent here is whatever is writing or modifying code, a coding agent working through a defined task, a feature request, a bug fix. Its output is not just the code itself but the actual command output from running it, the test results, the lint output, the build status, all bundled into the structured handoff. An agent that produces code but never actually runs it is not a Builder you can trust, because syntactically correct code that fails its own test suite is worse than no code at all, since it looks finished without being finished.

The Judge here checks three different but structurally identical categories of correctness. Did the change pass the existing test suite without modification, since a Builder quietly editing tests to make them pass is a specific and common failure worth checking for directly. Did static analysis and linting come back clean. Does the diff actually address the original task, not a related but different problem the Builder decided was more interesting along the way. Each of these gets its own pass or fail in the structured verdict, exactly the same shape as the content example, just with different specific criteria.

The Manager routes based on which specific check failed. A failing test suite goes back to the Builder with the exact failing test output attached, not a generic “fix the tests” instruction. A scope mismatch, where the diff solved the wrong problem, escalates to a human immediately rather than looping, because that is a judgment failure about what the task actually was, not a mechanical defect the Builder can iterate its way out of on its own.

Notice that the underlying skeleton, Builder produces and submits, Judge checks against specific named criteria with a structured per check verdict, Manager routes based on which specific check failed, is identical across both examples. This is the actual lesson worth taking from seeing two domains side by side. Once you have the skeleton right, applying it to a new task is mostly a matter of writing a new, specific checklist for the Judge, not redesigning the architecture from scratch.

Common Mistakes That Break Otherwise Good Designs

Most multi-agent failures in the wild are not architecture failures, the Builder, Judge, Manager skeleton above is sound and well tested across many real systems. They are implementation details that quietly undermine an otherwise correct design, and the same handful of mistakes show up again and again across completely different domains.

Letting the Judge see only the Builder’s output and not the original source or requirements. A Judge with no ground truth to compare against can only check internal consistency, not correctness, which means it will happily approve confident, well-formatted, completely wrong output.

Giving every agent the same model and the same base prompt with a thin role instruction layered on top. This often means the Judge inherits the exact same blind spots the Builder has, because they are fundamentally the same reasoning process wearing a different hat. Where the budget allows, using a genuinely different model or a meaningfully different prompting approach for the Judge produces real independence instead of theatrical independence.

Treating the Manager as a simple if-else router instead of giving it real decision logic and a memory of what has already been tried. A Manager that cannot see the history of previous attempts on this specific task will happily send the same failed feedback back to the Builder a second and third time, producing the exact same failed result each time.

Skipping the four failure mode tests because the system worked fine in the one demo run everyone watched closely. A system that works under observation and a system that works unattended are different claims, and only the second one is the actual goal of building a multi-agent team in the first place.

When to Add a Fourth Role

The three role structure covers the majority of real use cases, but some tasks genuinely need a fourth role, and it is worth knowing when that addition is justified versus when it is just complexity for its own sake.

A Researcher role earns its place when the Builder’s job requires gathering information from multiple sources before it can produce anything, and that gathering step has its own distinct failure modes separate from the writing or building step itself. Separating research from building means the Builder works only from verified, already gathered material, which prevents it from quietly inventing facts under the pressure of also having to find them in the same pass. The cost of this separation is an extra handoff, so it is worth it specifically when fact accuracy is the dimension most likely to fail, which is true for content production, technical documentation, and anything making specific claims.

A Formatter role earns its place when output structure is rigid and important enough that a content focused Builder consistently gets it wrong while focusing on substance. Rather than asking the Builder to be excellent at both ideas and formatting simultaneously, a dedicated Formatter takes a structurally loose but substantively correct draft and reshapes it into the exact required format, whether that is a specific JSON schema, a specific document template, or platform specific length and structure rules.

The test for adding any new role is the same every time. Can you point to a specific, recurring failure that the new role would catch, that the existing three roles structurally cannot catch because it falls outside what they are each individually responsible for checking. If you cannot point to that specific gap, the new role is not solving a real problem, it is adding a fourth point of failure to chase a feeling of thoroughness.

Choosing Models for Each Role

Not every role in your system needs the same model, and treating model selection as a single decision applied uniformly across the whole pipeline usually means overpaying for the easy roles and underpowering the hard ones.

The Builder role generally benefits the most from your most capable model, since this is where the actual creative or technical difficulty of the task lives, and a weaker model here produces a worse first draft that costs more revision cycles to fix than it would have cost to generate well the first time.

The Judge role has a more interesting calculus. It does not need to be creative, it needs to be consistent and rule following, which means a smaller, faster, cheaper model checking against an extremely well specified checklist often performs just as reliably as your largest model, at a fraction of the cost and latency. The exception is tasks where judgment itself is genuinely hard, nuanced writing quality, subtle factual disputes, where you may want your most capable model in this seat specifically because the evaluation task is as difficult as the generation task.

The Manager role almost never needs your most expensive model, because its job is routing logic against rules you have already written down, not open ended reasoning. A fast, cheap model handling Manager duties keeps your per task overhead low across every single cycle, since the Manager runs at least once per iteration regardless of how the Builder and Judge are performing.

This tiered approach, expensive model for building, cheap and consistent model for judging routine checks, cheap model for routing, is usually where the real cost savings in a multi-agent system come from. Most people assume cost control means using fewer agents. It actually comes from matching model cost to the actual difficulty of each specific role.

Where to Go From Here

Start small. Build the Builder, Judge, Manager structure for one narrow, well-defined task before you try to generalize it. Write the Judge’s checklist down explicitly before you write a single line of the Builder’s prompt, because the standard you are checking against should shape the work, not the other way around.

Run the four failure mode tests before you trust the system with anything real, not after something has already gone wrong in production. Add the memory layer once the basic loop is reliable, not before, because a system that compounds mistakes faster than it compounds lessons is worse than a system with no memory at all.

The teams that get real, compounding value out of multi-agent systems are not the ones running the most agents at once. They are the ones who took the time to define exactly what each agent is responsible for, exactly how work moves between them, and exactly what happens when something goes wrong. That discipline, not agent count, is the entire difference between a system that works while you sleep and one that quietly racks up a bill while producing nothing you can actually use.

@cyrilXBT: https://x.com/cyrilXBT/status/2070690243880116242

How to Build a Team of AI Agents That Actually Work Together (Full Course)

Why Adding Agents Usually Makes Things Worse

The Three Roles Every Working System Needs

Designing the Handoffs

Stop Conditions Are Not Optional

Memory Across Agents

The Four Failure Modes You Must Test For

A Worked Example: Content Production Team

A Second Example: Code Review Team

Common Mistakes That Break Otherwise Good Designs

When to Add a Fourth Role

Choosing Models for Each Role

Where to Go From Here

Similar Articles

@0xCodez: https://x.com/0xCodez/status/2058513716509913581

@Voxyz_ai: https://x.com/Voxyz_ai/status/2062246736257556654

@0xMorlex: https://x.com/0xMorlex/status/2070079645148451263

AI Agents 101

@hwchase17: https://x.com/hwchase17/status/2053157547985834227

Submit Feedback

Similar Articles

@0xCodez: https://x.com/0xCodez/status/2058513716509913581

@Voxyz_ai: https://x.com/Voxyz_ai/status/2062246736257556654

@0xMorlex: https://x.com/0xMorlex/status/2070079645148451263

@hwchase17: https://x.com/hwchase17/status/2053157547985834227