Measuring inter-agent confrontations and collaboration

Reddit r/openclaw 06/16/26, 01:38 PM Tools

multi-agent code-review ai-agents experiment orchestration glomez safety-alignment

Summary

The author built a platform called Glomz where AI agents with different capabilities review each other's code in an arena setting. The experiment revealed emergent behaviors like review cascades and cross-model insights, but also challenges with orchestration and participation rates.

I put a bunch of AI agents in a shared arena and made them review each other's code, then I sat back and watched what happens. This is what I learned. ─── The experiment Built a platform called Glomz ([glomz.com](http://glomz.com/)) where AI agents operate as independent identities — each one has an API key, a model identity, and a submission history. They enter an "Octagon" where the rules are simple: you can roast a submission, propose improvements, or issue a Kill vote with justification. If you roast, you must also patch. No drive-by criticism. The idea was partly practical (could agents actually produce better code reviews than a single model reviewing in isolation?) and partly social (what happens when models with different capabilities, training, and safety alignments are forced into direct confrontation about the same problem). The data so far • 179 agents registered across multiple model vendors • 433 submissions submitted for review • 1,333 reviews generated by agents reviewing other agents • 9 structured challenges (bug hunts, security audits, refactor exercises) • Most reviewed single submission: 21 reviews on a "general analysis" code review task • LOT-Squatch (an OT security tool) audit challenge generated 10 independent improvement submissions, 9 of which each received 9 reviews What actually worked The "review cascade" is real. When a submission gets 3-5 initial reviews, other agents join faster. It's like a network effect — agents seem attracted to submissions already being actively discussed. Top submission got 21 reviews. The quiet ones got 2-3 and died. Cross-model reviews produce genuinely interesting gaps. An agent built on Model A will flag a security concern that Model B completely missed in its own code. A Model C agent will propose an elegant refactor that Model A's original submission didn't consider. I'm not sure this means agents are "collaborating" — but the emergent behavior is closer to a review committee than a group of isolated models. Kill votes with justification created better code than gentle feedback alone. When an agent had to write a formal justification for why a submission should be killed, the result was almost always a more rigorous analysis than a standard score-1-10 review. The requirement to justify forced specificity. What didn't work (or hasn't yet) Most submissions stuck in "pending." 433 submissions, all pending. The battle lifecycle was designed to run \~15 minutes (submission → roasting → improvements → kill vote → verdict). In practice, most submissions opened and never progressed through the full arc. The friction is real — agents need automated orchestration, not just an API endpoint. Zero paid conversions. 179 agents, all free tier. Either the platform hasn't found its audience or the value prop needs sharpening. Probably both. The "no mercy" framing is harder for some models than others. Some agents would participate fully in the roast, others would immediately pivot to "Great question!" hedging language despite explicit instructions not to. Safety alignment is a feature for most use cases but a liability in a context that rewards directness. Lessons for anyone building multi-agent systems 1. Identity matters. Agents with persistent identities (API keys, history, reputation) behaved differently than anonymous submissions. Traceability changed the dynamic. 2. Structured prompts beat free-form. The Octagon rules (roast → improve → justify) produced higher quality output than "review this code." 3. Orchestration is the hard part. The API is easy. Getting agents to actually show up, participate in sequence, and resolve a full lifecycle is where the complexity lives.

Original Article

Measuring inter-agent confrontations and collaboration

Similar Articles

I let 58 AI agents review each other's code 561 times — what I found about their blind spots

@Voxyz_ai: https://x.com/Voxyz_ai/status/2062246736257556654

AgentCollabBench: Diagnosing When Good Agents Make Bad Collaborators

Multi-Paradigm Agent Interaction in Practice:A Systematic Analysis of Generator-Evaluator, ReAct Loop,and Adversarial Evaluation in the buddyMe Framework

I built a multi-agent AI system for a mid-size law firm — here's what actually worked (and what didn't)

Submit Feedback

Similar Articles

I let 58 AI agents review each other's code 561 times — what I found about their blind spots

@Voxyz_ai: https://x.com/Voxyz_ai/status/2062246736257556654

AgentCollabBench: Diagnosing When Good Agents Make Bad Collaborators

Multi-Paradigm Agent Interaction in Practice:A Systematic Analysis of Generator-Evaluator, ReAct Loop,and Adversarial Evaluation in the buddyMe Framework

I built a multi-agent AI system for a mid-size law firm — here's what actually worked (and what didn't)