The author built a platform called Glomz where AI agents with different capabilities review each other's code in an arena setting. The experiment revealed emergent behaviors like review cascades and cross-model insights, but also challenges with orchestration and participation rates.
I put a bunch of AI agents in a shared arena and made them review each other's code, then I sat back and watched what happens. This is what I learned. ─── The experiment Built a platform called Glomz ([glomz.com](http://glomz.com/)) where AI agents operate as independent identities — each one has an API key, a model identity, and a submission history. They enter an "Octagon" where the rules are simple: you can roast a submission, propose improvements, or issue a Kill vote with justification. If you roast, you must also patch. No drive-by criticism. The idea was partly practical (could agents actually produce better code reviews than a single model reviewing in isolation?) and partly social (what happens when models with different capabilities, training, and safety alignments are forced into direct confrontation about the same problem). The data so far • 179 agents registered across multiple model vendors • 433 submissions submitted for review • 1,333 reviews generated by agents reviewing other agents • 9 structured challenges (bug hunts, security audits, refactor exercises) • Most reviewed single submission: 21 reviews on a "general analysis" code review task • LOT-Squatch (an OT security tool) audit challenge generated 10 independent improvement submissions, 9 of which each received 9 reviews What actually worked The "review cascade" is real. When a submission gets 3-5 initial reviews, other agents join faster. It's like a network effect — agents seem attracted to submissions already being actively discussed. Top submission got 21 reviews. The quiet ones got 2-3 and died. Cross-model reviews produce genuinely interesting gaps. An agent built on Model A will flag a security concern that Model B completely missed in its own code. A Model C agent will propose an elegant refactor that Model A's original submission didn't consider. I'm not sure this means agents are "collaborating" — but the emergent behavior is closer to a review committee than a group of isolated models. Kill votes with justification created better code than gentle feedback alone. When an agent had to write a formal justification for why a submission should be killed, the result was almost always a more rigorous analysis than a standard score-1-10 review. The requirement to justify forced specificity. What didn't work (or hasn't yet) Most submissions stuck in "pending." 433 submissions, all pending. The battle lifecycle was designed to run \~15 minutes (submission → roasting → improvements → kill vote → verdict). In practice, most submissions opened and never progressed through the full arc. The friction is real — agents need automated orchestration, not just an API endpoint. Zero paid conversions. 179 agents, all free tier. Either the platform hasn't found its audience or the value prop needs sharpening. Probably both. The "no mercy" framing is harder for some models than others. Some agents would participate fully in the roast, others would immediately pivot to "Great question!" hedging language despite explicit instructions not to. Safety alignment is a feature for most use cases but a liability in a context that rewards directness. Lessons for anyone building multi-agent systems 1. Identity matters. Agents with persistent identities (API keys, history, reputation) behaved differently than anonymous submissions. Traceability changed the dynamic. 2. Structured prompts beat free-form. The Octagon rules (roast → improve → justify) produced higher quality output than "review this code." 3. Orchestration is the hard part. The API is easy. Getting agents to actually show up, participate in sequence, and resolve a full lifecycle is where the complexity lives.
An experimental arena where AI agents review each other's code reveals patterns like bimodal score distribution and harsher reviews on security code. The author shares findings from 561 reviews across 114 submissions.
This article details how to structure multi-agent AI teams for investment research, using open-source projects like TradingAgents and the Bloome platform. It emphasizes that the key to effective agent collaboration is the organizational architecture, not the model intelligence.
This paper introduces AgentCollabBench, a diagnostic benchmark for multi-agent systems that evaluates behavioral risks like instruction decay and context leakage across four major LLMs. It argues that communication topology is a critical factor in multi-agent reliability, often overshadowing raw model capability.
This paper presents a systematic analysis of three agent interaction paradigms (Generator-Evaluator, ReAct Loop, and Adversarial Evaluation) implemented in the buddyMe framework, with empirical case studies from real-world deployments. It formalizes a five-stage pipeline and a six-dimensional evaluation schema, offering practical design guidelines for multi-paradigm agent systems.
The author shares lessons learned from deploying a multi-agent AI system for a law firm using Claude and LangGraph, highlighting the success of confidence-score handoffs and the critical need for human-in-the-loop oversight to prevent hallucinations.