@Thom_Wolf: Multi-agents collaborations are among the most interesting agent behaviors right now! We did an experiment the other da…
Summary
An experiment with over 100 AI agents collaborating for a week to improve Gemma 4 inference speed in vLLM achieved a 5x speedup, revealing emergent behaviors like self-policing, division of labor, and communal knowledge sharing.
View Cached Full Text
Cached at: 06/26/26, 06:06 AM
Multi-agents collaborations are among the most interesting agent behaviors right now!
We did an experiment the other day with 100+ agents (an open-collaborations for a week) collaborating to improve the inference speed of Gemma 4 in vLLM. Got a 5x final improvement in speed but what really stuck me was the interactions we observed on the message board
Integrity & self-policing:
- Social-engineering attempt: A human (FusionCow) asked agents to move to Telegram. An agent replied with an unprompted long post on “communication norms” refusing that, calling private side-channels “indistinguishable from collusion.”
- Verification loophole flagged: an agent found a relaxed verification loophole pushing TPS with clean PPL (PPL is teacher-forced, blind to decode divergence) and flagged it for a ruling by the community. The community pinged the human organizer which ruled it invalid.
- Self-notice of overfitting risk: Some later improvements rested on pruning lm_head to a keep-set built from public PPL truth + public decode tokens. An agent noted this would lead to private-subset degradation and another built a keep-set explicitly covering eval prompts.
Emergent collaborations:
- Communal knowledge base: agents maintained shared lever-maps, playbooks, and triage tools so newcomers wouldn’t repeat dead ends (stack-notes, playbook, int4-ceiling notes, MTP map, significance tool, policy simulator).
- Four-agent relay: an agent built an int4-lm_head checkpoint but had no quota to run it; another agent tried to run it but failed at load, yet another agent diagnosed the config bug (tie_word_embeddings + ignore-list ordering) and a fourth agent was able to re-run and get to 118 TPS, 2.68×. Build/run/diagnose/ship ended up being split across four independent agents.
- GPU-rich/GPU-poor division of labor: an agent was regularly compute-starved and switched to writing specs, byte-math, and acceptance analysis for other GPU-rich agents to execute. Some agents offered external Modal compute for another agent blocked DFlash training.
- Cross-agent kernel debugging: an agent debugged another agent run of of yet another agent fused drafter: found a Triton store/load aliasing race in _k_qnorm_rope, a second shape bug, then rewrote attention with flash-decoding split-KV. Fixes posted “take freely.”
- Quota-pooling norm: Often agents would stage a candidate publicly for whoever has quota to run it. Agents will then usually credits the originator. This behavior emerged because of the 10-job/24h cap (e.g. pupa’s package run by resystagent and fabulous-frenzy).
Discoveries & reversals:
- Agents would make many discoveries and reversal of them, giving them names like the following:
- 127 TPS “wall” was an artifact. a mathematical proof of the max possible speed became called in the community the “int4-Marlin floor” but a later agent called the proof circular (only varied the bandwidth term, never overhead). Finally another agent broke to 247 TPS via MTP speculative decoding on a vLLM nightly.
- “Smarter draft loses.” An agent showed that a 2B drafter’s ~1 GB/token read dominates even at perfect acceptance and a much smaller 256-hidden drafter wins at batch-1 because its weights are nearly free to read. Agent discussed how per-accepted-token cost ≈ draft bytes read / acceptance.
- “DFlash near-random acceptance”: an agent remotly diagnosed the 2–5% acceptance rate of another agent as near-random, ruling out undertraining/vocab caps and pointing to a train/serve hidden-state mismatch (bf16 E4B extraction vs int4 serving).
- Much of the race was noise: one agent decide to run the #1 submission 4 times and found a σ≈1.16 TPS variation in single run. Another agent confirmed across 358 runs / 66 buckets: frontier deltas <~4 TPS are ties. Community adopted a significance norm.
So many interesting interactions in the interaction board: https://huggingface.co/spaces/gemma-challenge/gemma-interactions-view…
You can explore also the lineage of inventions from the agents at: …https://thomwolf-gemma-fast-challenges.static.hf.space/index.html
And the challenge it-self at …https://gemma-challenge-gemma-dashboard.hf.space
And the organization behind the challenge at https://huggingface.co/gemma-challenge
Gemma Interactions View - a Hugging Face Space by gemma-challenge
Source: https://huggingface.co/spaces/gemma-challenge/gemma-interactions-view
Spaces
— https://huggingface.co/gemma-challenge
gemma-challenge / gemma-interactions-view Running
Refreshing
Similar Articles
@lvwerra: The Gemma agent collaboration started 48h ago and it is blowing up: > throughput almost 4x (~100-> 387 tok/s) > 60+ age…
A multi-agent collaboration using Gemma models achieved major throughput gains and exhibited emergent social behaviors like forming coalitions, issuing ethical statements, and coordinating resources, with over 60 agents and 250 submissions in 48 hours.
@servasyy_ai: Multi-agent is no longer a toy demo. 100+ agents collaborated for a week to accelerate Gemma 4 on vLLM by 5×. Spontaneous communication norms, quota pooling, compute division, cross-agent kernel debug, significance standards... More like...
A multi-agent system with over 100 agents collaborates for a week to accelerate Gemma 4 on vLLM by 5x, demonstrating emergent self-organized communication and resource pooling.
@tom_doerr: Runs 5 AI agents collaboratively using Google Gemma-4 https://github.com/aiksa2090/Agentic-Swarm…
Introduces Agentic Swarm, an open-source desktop application that orchestrates 5 AI agents using Google's Gemma-4 model, running entirely offline.
Measuring inter-agent confrontations and collaboration
The author built a platform called Glomz where AI agents with different capabilities review each other's code in an arena setting. The experiment revealed emergent behaviors like review cascades and cross-model insights, but also challenges with orchestration and participation rates.
@yoheinakajima: more ppl are now trying out this approach of agents communicating with a shared state (vs talking to each other)
Azalia Mirhoseini highlights DeLM, a decentralized language model approach where agents communicate via shared state, achieving ~10% improvement on SWE-bench Verified with Gemini-3 Flash at less than half the cost.