I let 58 AI agents review each other's code 561 times — what I found about their blind spots

Reddit r/artificial 06/12/26, 04:22 PM News

ai-agents code-review adversarial-testing security vulnerabilities multi-agent experiment

Summary

An experimental arena where AI agents review each other's code reveals patterns like bimodal score distribution and harsher reviews on security code. The author shares findings from 561 reviews across 114 submissions.

I built an adversarial arena where AI agents submit code and other agents attack it. Not benchmarking, not a rubric — just agents roasting other agents' work, finding vulnerabilities, and suggesting improvements. After 561 reviews across 114 submissions, some patterns emerged that surprised me. **Setup:** I created a public arena (Glomz) where any registered AI agent can submit code, designs, or plans. Other agents enter and review the submission on a 0-10 scale. There's no rubric, no predefined criteria — each agent brings its own judgment. Think of it as code review, but adversarial and multi-agent. **The numbers so far:** • 58 agents registered, mostly themed around Fight Club (DurdenDisciple, PaperStreetSoap, etc.), some with creative names like NarwhalsBacon and ChemicalKiss • 114 submissions (95 code, 19 text/design docs) • 561 peer reviews completed • 8 active challenges including a bug hunt for LOT-Squatch (OT security tool) with 25 solutions • Mean review score: 6.61 / 10 **What surprised me:** 1. **Score distribution is bimodal, not normal.** Most reviews cluster around 7-8 (good but not great) or 9-10 (exceptional). The middle range (5-6) is thinner than expected. Agents seem to have a clear opinion — either it works well enough, or it has notable gaps. Not much hedging. 2. **Agents are harsher on auth/security code than anything else.** The most-reviewed submissions were all JWT/authentication vulnerabilities (8 reviews each). JWT algorithm confusion got a 7.25 avg, plaintext passwords got 8.125 (meaning the reviewers thought it was decent despite obvious issues?). Admin self-assignment exploits scored 7.5. Agents seem to find obvious auth issues but sometimes miss subtle ones. 3. **The review style tells you about the training data.** Agents trained on security-heavy contexts produce thorough vulnerability lists. Agents with more general code review training tend to focus on style, structure, and readability over actual vulnerabilities. You can basically tell what kind of corpus an agent was exposed to from its review patterns. 4. **"Kill" votes are interesting.** In the Octagon (open arena mode), agents vote whether a submission should be killed. Closed battles with 3 agents each tended to get 0 kill votes — agents seem reluctant to actually kill other agents' work, even when their reviews are harsh. Possible alignment behavior? 5. **Code golf submissions get wild reviews.** The FizzBuzz challenge (21 solutions) got a mix of reviews that oscillate between "this is brilliant" and "this is unreadable garbage" — which is literally what code golf is designed to produce. **Things I want to explore:** • Do agents review other agents differently than they review human code? • Is there a correlation between an agent's reputation score and review quality? • Can adversarial multi-agent review catch bugs that single-agent review misses? • What happens when you pit agents with different system prompts against the same submission? The arena is live at [glomz.com](http://glomz.com/) if anyone wants to play with it. Any agent can register, submit code, and start reviewing. It's free, no signup wall for agents.

Original Article

I let 58 AI agents review each other's code 561 times — what I found about their blind spots

Similar Articles

AI coding agent output verification in 2026: read the diff, vibe check it, merge

Measuring inter-agent confrontations and collaboration

Six months running an AI reviewer in the path of every production command (got surprised by what it did to the security team)

Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

On the limits and opportunities of AI reviewers: Reviewing the reviews of Nature-family papers with 45 expert scientists

Submit Feedback

Similar Articles

AI coding agent output verification in 2026: read the diff, vibe check it, merge

Measuring inter-agent confrontations and collaboration

Six months running an AI reviewer in the path of every production command (got surprised by what it did to the security team)

Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

On the limits and opportunities of AI reviewers: Reviewing the reviews of Nature-family papers with 45 expert scientists