I ran 13 controlled experiments on my own multi-agent coding setup. Personas did nothing; one coordination trick did almost everything.

Reddit r/AI_Agents Papers

Summary

An AI researcher ran 13 controlled experiments on a multi-agent coding system, finding that dependency-ordered coordination significantly improved success rates while persona backstories had no measurable benefit.

Most multi-agent repos are a cast of characters with no falsifiable claim. I wanted numbers, so I tested my own system with real oracles (a TypeScript compiler and pre-registered answer keys) across \~540 scored agent runs. What held up: * **Dependency-ordered coordination (a "Change Dependency Graph").** Finalize the upstream change, give the downstream agent the *real* names instead of letting it guess. Across 4 contract-change types: naive parallel 3/12, CDG-ordered 12/12 (compiler-scored). * The sharp bit: naive parallel passed **6/6 on Opus** but **0/6 on Sonnet**, same task. A stronger model just guesses the same names and hides the bug. Coordination buys invariance. * It generalized beyond code (writing/advisory/game-design): 9/9 vs 3/9. What didn't hold up (the fun part): * **Persona backstories:** placebo-controlled across 5 roles, zero measurable benefit. An off-topic backstory did just as well. The lever was the *checklist*, not the identity. * **The deterministic test gate has a coverage ceiling.** A logic bug in an untested path passes clean, even with a confident "all tests pass" from the agent. * **3 advisors caught all 15 planted issues.** Advisors 4 through 10 added nothing unique. I'm publishing the results that undercut my own design on purpose, including the two times my experiment setup broke and accidentally re-confirmed a finding. Happy to answer methodology questions or take shots at the design in the comments.
Original Article

Similar Articles

Beyond Autonomy: The Power of an Agent That Knows Its Limits

Reddit r/AI_Agents

The COWCORPUS project, a study of 4,200 human-AI interactions, found that agents predicting their own failures and intervention moments are more useful than those simply trying to avoid errors. Researchers identified four stable trust patterns in human-AI collaboration and developed the Perfect Timing Score (PTS) to measure intervention prediction accuracy.