In practice, our multi-agent failures were almost never the model - they were the handoffs. Does the MAST data match what you see?

Reddit r/AI_Agents 07/02/26, 08:41 AM Papers

multi-agent failures orchestration coordination verification llm-systems

Summary

An analysis of multi-agent LLM pipeline failures, citing the Berkeley MAST paper which attributes most failures to coordination issues (specification, inter-agent misalignment) rather than model capability, and suggests dedicated verifier agents as a fix.

I've been building and debugging multi-agent pipelines (orchestration + tool use) and kept hitting the same wall: a run fails, someone swaps the model or bumps the context window, it passes once, then fails somewhere else. We were treating a coordination problem as a capability problem. The Berkeley MAST paper ("Why Do Multi-Agent LLM Systems Fail?", arXiv:2503.13657) lines up with this. They hand-annotated 1,600+ execution traces across 7 frameworks (6 annotators, Cohen's Kappa 0.88) and put failures into 3 buckets: - Specification & design: 41.8% (bad decomposition, ambiguous roles, no termination condition) - Inter-agent misalignment: 36.9% (context lost at handoffs, conflicting outputs, format mismatches) - Verification: 21.3% (premature "done," incomplete or incorrect checks) The kicker on the eval side: various 2026 audits report LLM-judges being wrong more than half the time, with position bias (favoring the first answer) and length bias. And pass^k variance is brutal - the same agent across 4 runs can score 15–25 points lower on pass^4 than pass^1. So a single green run tells you very little. My take: most "add another agent" fixes make it worse because they add more seams, not fewer. The most underused fix I've found is a dedicated verifier agent with isolated context and scoring criteria the producing agents never see - basically treating verification as a separate stage, like an independent scanner in a security pipeline.

Original Article

In practice, our multi-agent failures were almost never the model - they were the handoffs. Does the MAST data match what you see?

Similar Articles

When Does Combining Language Models Help? A Co-Failure Ceiling on Routing, Voting, and Mixture-of-Agents Across 67 Frontier Models

AI agents fail in ways nobody writes about. Here's what I've actually seen.

When Planning Fails Despite Correct Execution: On Epistemic Calibration for LLM-Based Multi-Agent Systems

Your agent isn't failing because of the model, it's failing because nobody built a stop button

How Consistent Are LLM Agents? Measuring Behavioral Reproducibility in Multi-Step Tool-Calling Pipelines

Submit Feedback

Similar Articles

When Does Combining Language Models Help? A Co-Failure Ceiling on Routing, Voting, and Mixture-of-Agents Across 67 Frontier Models

AI agents fail in ways nobody writes about. Here's what I've actually seen.

When Planning Fails Despite Correct Execution: On Epistemic Calibration for LLM-Based Multi-Agent Systems

Your agent isn't failing because of the model, it's failing because nobody built a stop button

How Consistent Are LLM Agents? Measuring Behavioral Reproducibility in Multi-Step Tool-Calling Pipelines