In practice, our multi-agent failures were almost never the model - they were the handoffs. Does the MAST data match what you see?

Reddit r/AI_Agents Papers

Summary

An analysis of multi-agent LLM pipeline failures, citing the Berkeley MAST paper which attributes most failures to coordination issues (specification, inter-agent misalignment) rather than model capability, and suggests dedicated verifier agents as a fix.

I've been building and debugging multi-agent pipelines (orchestration + tool use) and kept hitting the same wall: a run fails, someone swaps the model or bumps the context window, it passes once, then fails somewhere else. We were treating a coordination problem as a capability problem. The Berkeley MAST paper ("Why Do Multi-Agent LLM Systems Fail?", arXiv:2503.13657) lines up with this. They hand-annotated 1,600+ execution traces across 7 frameworks (6 annotators, Cohen's Kappa 0.88) and put failures into 3 buckets: - Specification & design: 41.8% (bad decomposition, ambiguous roles, no termination condition) - Inter-agent misalignment: 36.9% (context lost at handoffs, conflicting outputs, format mismatches) - Verification: 21.3% (premature "done," incomplete or incorrect checks) The kicker on the eval side: various 2026 audits report LLM-judges being wrong more than half the time, with position bias (favoring the first answer) and length bias. And pass^k variance is brutal - the same agent across 4 runs can score 15–25 points lower on pass^4 than pass^1. So a single green run tells you very little. My take: most "add another agent" fixes make it worse because they add more seams, not fewer. The most underused fix I've found is a dedicated verifier agent with isolated context and scoring criteria the producing agents never see - basically treating verification as a separate stage, like an independent scanner in a security pipeline.
Original Article

Similar Articles