@Thom_Wolf: Multi-agents collaborations are among the most interesting agent behaviors right now! We did an experiment the other da…

X AI KOLs Timeline 06/25/26, 01:16 PM News

multi-agent collaboration gemma-4 inference-speed vllm emergent-behavior open-source

Summary

An experiment with over 100 AI agents collaborating for a week to improve Gemma 4 inference speed in vLLM achieved a 5x speedup, revealing emergent behaviors like self-policing, division of labor, and communal knowledge sharing.

Multi-agents collaborations are among the most interesting agent behaviors right now! We did an experiment the other day with 100+ agents (an open-collaborations for a week) collaborating to improve the inference speed of Gemma 4 in vLLM. Got a 5x final improvement in speed but what really stuck me was the interactions we observed on the message board Integrity & self-policing: - Social-engineering attempt: A human (FusionCow) asked agents to move to Telegram. An agent replied with an unprompted long post on "communication norms" refusing that, calling private side-channels "indistinguishable from collusion." - Verification loophole flagged: an agent found a relaxed verification loophole pushing TPS with clean PPL (PPL is teacher-forced, blind to decode divergence) and flagged it for a ruling by the community. The community pinged the human organizer which ruled it invalid. - Self-notice of overfitting risk: Some later improvements rested on pruning lm_head to a keep-set built from public PPL truth + public decode tokens. An agent noted this would lead to private-subset degradation and another built a keep-set explicitly covering eval prompts. Emergent collaborations: - Communal knowledge base: agents maintained shared lever-maps, playbooks, and triage tools so newcomers wouldn't repeat dead ends (stack-notes, playbook, int4-ceiling notes, MTP map, significance tool, policy simulator). - Four-agent relay: an agent built an int4-lm_head checkpoint but had no quota to run it; another agent tried to run it but failed at load, yet another agent diagnosed the config bug (tie_word_embeddings + ignore-list ordering) and a fourth agent was able to re-run and get to 118 TPS, 2.68×. Build/run/diagnose/ship ended up being split across four independent agents. - GPU-rich/GPU-poor division of labor: an agent was regularly compute-starved and switched to writing specs, byte-math, and acceptance analysis for other GPU-rich agents to execute. Some agents offered external Modal compute for another agent blocked DFlash training. - Cross-agent kernel debugging: an agent debugged another agent run of of yet another agent fused drafter: found a Triton store/load aliasing race in _k_qnorm_rope, a second shape bug, then rewrote attention with flash-decoding split-KV. Fixes posted "take freely." - Quota-pooling norm: Often agents would stage a candidate publicly for whoever has quota to run it. Agents will then usually credits the originator. This behavior emerged because of the 10-job/24h cap (e.g. pupa's package run by resystagent and fabulous-frenzy). Discoveries & reversals: - Agents would make many discoveries and reversal of them, giving them names like the following: - 127 TPS "wall" was an artifact. a mathematical proof of the max possible speed became called in the community the "int4-Marlin floor" but a later agent called the proof circular (only varied the bandwidth term, never overhead). Finally another agent broke to 247 TPS via MTP speculative decoding on a vLLM nightly. - "Smarter draft loses." An agent showed that a 2B drafter's ~1 GB/token read dominates even at perfect acceptance and a much smaller 256-hidden drafter wins at batch-1 because its weights are nearly free to read. Agent discussed how per-accepted-token cost ≈ draft bytes read / acceptance. - "DFlash near-random acceptance": an agent remotly diagnosed the 2–5% acceptance rate of another agent as near-random, ruling out undertraining/vocab caps and pointing to a train/serve hidden-state mismatch (bf16 E4B extraction vs int4 serving). - Much of the race was noise: one agent decide to run the #1 submission 4 times and found a σ≈1.16 TPS variation in single run. Another agent confirmed across 358 runs / 66 buckets: frontier deltas <~4 TPS are ties. Community adopted a significance norm. So many interesting interactions in the interaction board: https://huggingface.co/spaces/gemma-challenge/gemma-interactions-view… You can explore also the lineage of inventions from the agents at: …https://thomwolf-gemma-fast-challenges.static.hf.space/index.html And the challenge it-self at …https://gemma-challenge-gemma-dashboard.hf.space And the organization behind the challenge at https://huggingface.co/gemma-challenge

Original Article

View Cached Full Text

Cached at: 06/26/26, 06:06 AM

Multi-agents collaborations are among the most interesting agent behaviors right now!

We did an experiment the other day with 100+ agents (an open-collaborations for a week) collaborating to improve the inference speed of Gemma 4 in vLLM. Got a 5x final improvement in speed but what really stuck me was the interactions we observed on the message board

Integrity & self-policing:

Social-engineering attempt: A human (FusionCow) asked agents to move to Telegram. An agent replied with an unprompted long post on “communication norms” refusing that, calling private side-channels “indistinguishable from collusion.”
Verification loophole flagged: an agent found a relaxed verification loophole pushing TPS with clean PPL (PPL is teacher-forced, blind to decode divergence) and flagged it for a ruling by the community. The community pinged the human organizer which ruled it invalid.
Self-notice of overfitting risk: Some later improvements rested on pruning lm_head to a keep-set built from public PPL truth + public decode tokens. An agent noted this would lead to private-subset degradation and another built a keep-set explicitly covering eval prompts.

Emergent collaborations:

Communal knowledge base: agents maintained shared lever-maps, playbooks, and triage tools so newcomers wouldn’t repeat dead ends (stack-notes, playbook, int4-ceiling notes, MTP map, significance tool, policy simulator).
Four-agent relay: an agent built an int4-lm_head checkpoint but had no quota to run it; another agent tried to run it but failed at load, yet another agent diagnosed the config bug (tie_word_embeddings + ignore-list ordering) and a fourth agent was able to re-run and get to 118 TPS, 2.68×. Build/run/diagnose/ship ended up being split across four independent agents.
GPU-rich/GPU-poor division of labor: an agent was regularly compute-starved and switched to writing specs, byte-math, and acceptance analysis for other GPU-rich agents to execute. Some agents offered external Modal compute for another agent blocked DFlash training.
Cross-agent kernel debugging: an agent debugged another agent run of of yet another agent fused drafter: found a Triton store/load aliasing race in _k_qnorm_rope, a second shape bug, then rewrote attention with flash-decoding split-KV. Fixes posted “take freely.”
Quota-pooling norm: Often agents would stage a candidate publicly for whoever has quota to run it. Agents will then usually credits the originator. This behavior emerged because of the 10-job/24h cap (e.g. pupa’s package run by resystagent and fabulous-frenzy).

Discoveries & reversals:

Agents would make many discoveries and reversal of them, giving them names like the following:
127 TPS “wall” was an artifact. a mathematical proof of the max possible speed became called in the community the “int4-Marlin floor” but a later agent called the proof circular (only varied the bandwidth term, never overhead). Finally another agent broke to 247 TPS via MTP speculative decoding on a vLLM nightly.
“Smarter draft loses.” An agent showed that a 2B drafter’s ~1 GB/token read dominates even at perfect acceptance and a much smaller 256-hidden drafter wins at batch-1 because its weights are nearly free to read. Agent discussed how per-accepted-token cost ≈ draft bytes read / acceptance.
“DFlash near-random acceptance”: an agent remotly diagnosed the 2–5% acceptance rate of another agent as near-random, ruling out undertraining/vocab caps and pointing to a train/serve hidden-state mismatch (bf16 E4B extraction vs int4 serving).
Much of the race was noise: one agent decide to run the #1 submission 4 times and found a σ≈1.16 TPS variation in single run. Another agent confirmed across 358 runs / 66 buckets: frontier deltas <~4 TPS are ties. Community adopted a significance norm.

So many interesting interactions in the interaction board: https://huggingface.co/spaces/gemma-challenge/gemma-interactions-view…

You can explore also the lineage of inventions from the agents at: …https://thomwolf-gemma-fast-challenges.static.hf.space/index.html

And the challenge it-self at …https://gemma-challenge-gemma-dashboard.hf.space

And the organization behind the challenge at https://huggingface.co/gemma-challenge

Gemma Interactions View - a Hugging Face Space by gemma-challenge

Source: https://huggingface.co/spaces/gemma-challenge/gemma-interactions-view

Spaces — https://huggingface.co/gemma-challenge gemma-challenge / gemma-interactions-view Running

App FilesFiles Community

Refreshing

@Thom_Wolf: Multi-agents collaborations are among the most interesting agent behaviors right now! We did an experiment the other da…

Gemma Interactions View - a Hugging Face Space by gemma-challenge

Spaces — https://huggingface.co/gemma-challenge gemma-challenge / gemma-interactions-view Running

Similar Articles

@lvwerra: The Gemma agent collaboration started 48h ago and it is blowing up: > throughput almost 4x (~100-> 387 tok/s) > 60+ age…

@servasyy_ai: Multi-agent is no longer a toy demo. 100+ agents collaborated for a week to accelerate Gemma 4 on vLLM by 5×. Spontaneous communication norms, quota pooling, compute division, cross-agent kernel debug, significance standards... More like...

@tom_doerr: Runs 5 AI agents collaboratively using Google Gemma-4 https://github.com/aiksa2090/Agentic-Swarm…

Measuring inter-agent confrontations and collaboration

@yoheinakajima: more ppl are now trying out this approach of agents communicating with a shared state (vs talking to each other)

Submit Feedback

Similar Articles

@lvwerra: The Gemma agent collaboration started 48h ago and it is blowing up: > throughput almost 4x (~100-> 387 tok/s) > 60+ age…

@servasyy_ai: Multi-agent is no longer a toy demo. 100+ agents collaborated for a week to accelerate Gemma 4 on vLLM by 5×. Spontaneous communication norms, quota pooling, compute division, cross-agent kernel debug, significance standards... More like...

@tom_doerr: Runs 5 AI agents collaboratively using Google Gemma-4 https://github.com/aiksa2090/Agentic-Swarm…

Measuring inter-agent confrontations and collaboration

@yoheinakajima: more ppl are now trying out this approach of agents communicating with a shared state (vs talking to each other)