TeamBench: Evaluating Agent Coordination under Enforced Role Separation

arXiv cs.AI Papers

Summary

This article introduces TeamBench, a benchmark for evaluating agent coordination under enforced role separation, addressing issues where prompt-only roles may bypass intended constraints.

arXiv:2605.07073v1 Announce Type: new Abstract: Agent systems often decompose a task across multiple roles, but these roles are typically specified by prompts rather than enforced by access controls. Without enforcement, a team pass rate can mask whether agents actually coordinated or whether one role effectively did another role's work. We present TeamBench, a benchmark with 851 task templates and 931 seeded instances for evaluating agent coordination under operating system-enforced role separation. TeamBench separates specification access, workspace editing, and final certification across Planner, Executor, and Verifier roles, so that no role can read the full requirements, modify the workspace, and certify the final answer. Prompt-only and sandbox-enforced teams reach statistically indistinguishable pass rates, but prompt-only runs produce 3.6 times more cases where the verifier attempts to edit the executor's code. Verifiers approve 49% of submissions that fail the deterministic grader, and removing the verifier improves mean partial score in the ablation. Team value is also conditional. Teams benefit when single agents struggle, but hurt when single agents already perform well. A 40-session human study under the same role separation shows that our benchmark exposes interaction patterns that pass rate misses. Solo participants work through the task directly, human participants paired with agents often collapse into quick approval, and human teams spend more effort coordinating missing information across roles.
Original Article
View Cached Full Text

Cached at: 05/11/26, 07:11 AM

# TeamBench: Evaluating Agent Coordination under Enforced Role Separation
Source: [https://arxiv.org/html/2605.07073](https://arxiv.org/html/2605.07073)
Yubin Kim1,2Chanwoo Park1Taehan Kim4Eugene Park1Samuel Schmidgall3 Salman Rahman2Chunjong Park3Cynthia Breazeal1Xin Liu2Hamid Palangi2 Hae Won Park1Daniel McDuff2

1MIT2Google Research3Google DeepMind4Independent Researcher ![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/github-logo.png)[GitHub](https://github.com/ybkim95/TeamBench)![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/hf-logo.png)[Dataset](https://huggingface.co/datasets/ybkim95/teambench)![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/globe-icon.jpg)[Website](https://teambench.github.io/)

###### Abstract

Agent systems often decompose a task across multiple roles, but these roles are typically specified by prompts rather than enforced by access controls\. Without enforcement, a team pass rate can mask whether agents actually coordinated or whether one role effectively did another role’s work\. We presentTeamBench, a benchmark with 851 task templates and 931 seeded instances for evaluating agent coordination under operating system\-enforced role separation\.TeamBenchseparates specification access, workspace editing, and final certification across Planner, Executor, and Verifier roles, so that no role can read the full requirements, modify the workspace, and certify the final answer\. Prompt\-only and sandbox\-enforced teams reach statistically indistinguishable pass rates, but prompt\-only runs produce 3\.6 times more cases where the verifier attempts to edit the executor’s code\. Verifiers approve 49% of submissions that fail the deterministic grader, and removing the verifier improves mean partial score in the ablation\. Team value is also conditional\. Teams benefit when single agents struggle, but hurt when single agents already perform well\. A 40\-session human study under the same role separation shows that our benchmark exposes interaction patterns that pass rate misses\. Solo participants work through the task directly, human participants paired with agents often collapse into quick approval, and human teams spend more effort coordinating missing information across roles\. Our dataset, code and implementation details can be found at[https://teambench\.github\.io](https://teambench.github.io/)\.

## 1Introduction

![Refer to caption](https://arxiv.org/html/2605.07073v1/imgs/teambench_teaser_original_new.png)Figure 1:TeamBenchevaluates agents under enforced role separation\. The Planner reads the full requirements, the Executor edits the workspace, and the Verifier issues the final verdict\. Sandboxes restrict which files each role can read or write\. The example shows a role failure, where the Verifier accepts an implementation that violates a Planner constraint\.Agent\-based systems built on large language models \(LLMs\) often split tasks across different roles and measure gains over a single agent\[[8](https://arxiv.org/html/2605.07073#bib.bib15),[26](https://arxiv.org/html/2605.07073#bib.bib16),[20](https://arxiv.org/html/2605.07073#bib.bib17),[14](https://arxiv.org/html/2605.07073#bib.bib18),[30](https://arxiv.org/html/2605.07073#bib.bib21),[11](https://arxiv.org/html/2605.07073#bib.bib48)\]\. Yet it is often unclear whether those gains come from coordination among roles or from one model effectively carrying several roles within the same run\. A team that adds a verification step needs to know whether the Verifier is an independent quality gate or simply another editing pass\. In many benchmarks, different roles are assigned through prompts to the same backbone model, and the harness does not prevent that model from planning, editing, and certifying the same solution\. Recent works raise related concerns about role collapse, benchmark validity, and multi\-agent failure modes\[[1](https://arxiv.org/html/2605.07073#bib.bib37),[2](https://arxiv.org/html/2605.07073#bib.bib32),[29](https://arxiv.org/html/2605.07073#bib.bib34),[15](https://arxiv.org/html/2605.07073#bib.bib33)\]\. SWE\-Bench\[[10](https://arxiv.org/html/2605.07073#bib.bib1)\]and Terminal\-Bench\[[17](https://arxiv.org/html/2605.07073#bib.bib38)\]ask whether an agent can solve realistic tasks\.TeamBenchasks when a Planner, Executor, and Verifier decomposition produces useful coordination, and when it only adds overhead without genuine coordination\.

We study this question by enforcing role separation with operating system permissions\. Each role runs in a separate container with access only to the files and tools needed for that role\. The Planner reads the full requirements but cannot modify the workspace\. The Executor can edit and test the workspace, but only receives a summarized brief rather than the full specification\. The Verifier reviews the requirements and the Executor’s evidence before approving the final submission\. No role can simultaneously read the full requirements, modify the workspace, and certify the final answer\. This design makes coordination necessary because information must move across roles for the team to complete the task\.

TeamBenchcontains851851task templates that expand to931931seeded evaluation instances\. The tasks span 19 base categories, with 21 refined categories for the leaderboard\. Each task has a generator that produces a deterministic workspace from a seed, allowing instances to refresh without retiring the benchmark\. We evaluate models on a stratified 90\-task subset and also reportTeamBench\-Verified\. Beyond the leaderboard, the evaluation suite includes Planner and Verifier ablations, a 27\-configuration cross\-provider role\-mixing ablation, a prompt\-only versus enforced\-role comparison, and a human study with the same role boundaries and deterministic graders\.

Several findings across the experiments include: \(i\) In the role\-mixing ablation, Verifiers approve 49% of submissions that fail the deterministic grader, and removing the Verifier improves partial score in the reference ablation\. \(ii\) Teams help most in the lowest Solo\-score quintile, but hurt on easier tasks where Solo already performs well\. \(iii\) Prompt\-only and sandbox\-enforced roles reach statistically indistinguishable pass rates, while prompt\-only runs produce 3\.6 times more cases where the Verifier rewrites the Executor’s code\. \(iv\) In the human study, Solo participants work through the task directly, Hybrid \(participant paired with agents\) sessions often collapse into quick approval, and human teams spend more effort coordinating missing information across roles\. These results indicate that the same pass rate can hide different role and verification behavior, and coordination costs\.

Our primary contributions are:

1. 1\.Role\-separated coordination benchmark\.We introduceTeamBench, which uses operating\-system permissions to separate specification access, workspace editing, and final certification\.
2. 2\.Matched ablations for role value\.We run matched ablation under Solo, Restricted, Full Team, No\-Plan, and No\-Evaluate conditions, isolating when planning, verification, and missing specification access improve performance\.
3. 3\.Verifier failure as a bottleneck\.Verifiers approve 49% of grader\-failing submissions, and removing the Verifier improves mean partial score in the reference ablation\.
4. 4\.Coordination behavior beyond pass rate\.Team value depends on Solo capability, prompt\-only roles hide more Verifier code\-edit attempts, and our human study suggests distinct Solo, Hybrid, and Team interaction patterns under the same role boundaries\.

## 2TeamBench

TeamBenchisolates three roles that prompt\-only evaluations blend\. It measures the marginal contribution of each role, the effect of assigning different providers to different roles, and how team value changes with task difficulty\. To measure these effects, we enforce the role boundaries by the harness rather than via the prompt\. Table[15](https://arxiv.org/html/2605.07073#A4.T15)comparesTeamBenchwith existing agent benchmarks\.

### 2\.1Role Separation

Each role runs in a separate container with only the files and tools it needs\. The Planner receives the full requirements but cannot edit or execute code\. The Executor modifies the solution workspace and runs commands but never sees the full requirements, only a brief\. The Verifier reads the full requirements and read\-only evidence from the Executor, then issues the final pass\-or\-fail verdict\. Because no role has all three permissions, a successful run requires information to move across roles\. The three roles separate task understanding, implementation, and validation\. Removing either the Verifier or the Planner gives an ablation condition \(Table[1](https://arxiv.org/html/2605.07073#S2.T1)\)\. Implementation details, including container boundaries, file paths, and attestation format are in Appendix[A\.1](https://arxiv.org/html/2605.07073#A1.SS1)\.

### 2\.2Task Construction

A coordination benchmark needs tasks where the specification contains information that the workspace alone does not reveal\. If the workspace alone contains enough information for an Executor\-only agent to solve the task, the task does not require coordination\. We curate851851templates with this property\. Each template includes a deterministic script grader and expands to one or more seeded evaluation instances, for 931 instances total\. We curate 161 tasks with critical constraints available only in the full requirements\. We adapt 650 GitHub bug reports from active open source repositories, including Flask, Django, NumPy, SciPy, and Keras\. We also include 30 data science tasks built on canonical public datasets\[[5](https://arxiv.org/html/2605.07073#bib.bib30)\]and 10 incident response tasks adapted from public post mortems\. The pool spans1919base categories covering security patching, data\-pipeline repair, distributed systems debugging, cryptographic correctness, adversarial specification traps, and more\. The benchmark includes many challenging tasks because §[3\.4](https://arxiv.org/html/2605.07073#S3.SS4)tests whether teams help most when single agents struggle\. Figure[2](https://arxiv.org/html/2605.07073#S2.F2)shows the composition\.

![Refer to caption](https://arxiv.org/html/2605.07073v1/x1.png)Figure 2:TeamBenchcomposition\.\(a\)Source distribution by origin class\.\(b\)Domain distribution across the 851 task templates\.\(c\)Difficulty distribution over the 804 templates with grader\-check counts\. The remaining 47 templates are Unrated due to the grader\-check count was not available\.Tasks span five coordination failure modes\. Relay tasks require the Planner to pass specification details that the Executor cannot infer from the workspace\. Adversarial\-trap tasks place plausible but incorrect content in the workspace\. Open\-ended tasks allow multiple valid implementations, but deterministic checks define which outputs pass\. Discovery tasks hide data\-quality or API issues that the Executor must surface through active exploration\. Synthesis tasks require correlating evidence across multiple documents\. In the 161 authored templates, critical constraints appear only in the full specification\. They are absent from the brief and workspace, so the Executor needs information from the Planner\. Per\-origin counts and file layout are in Appendix[A\.2](https://arxiv.org/html/2605.07073#A1.SS2)\.

Every task deploys with a generator that produces deterministic but distinct workspace files from different random seeds\. Generators vary task\-relevant parameters \(configuration values, API field names, bug locations\) while preserving structural complexity, which protects against value memorization\.

### 2\.3Ablation Conditions

We run the same task under five conditions \(Table[1](https://arxiv.org/html/2605.07073#S2.T1)\) that isolate the marginal contribution of each role\. Solo is a single agent with the full spec, workspace, and four tools\. Restricted is the same single agent without access to the full specification\. The three team conditions \(Full Team, Team, No Plan and Team, No Evaluate\) then add or remove the Planner and Verifier\.

Table 1:Ablation conditions, with which agent holds each capability\.Spec= read full specification,Edit= modify workspace and execute commands,Attest= write the closing attestation\.†In Team\-No\-Plan, the Executor sees only the brief, and the Verifier holds the full spec for compliance checking\.ConditionAgentsSpecEditAttestSoloone agent \(full access\)✓✓✓Restrictedone agent \(Executor\-only tools\)–✓–Full TeamPlanner \+ Executor \+ VerifierPlannerExecutorVerifierTeam, No PlanExecutor \+ VerifierVerifier†ExecutorVerifierTeam, No EvaluatePlanner \+ ExecutorPlannerExecutor–We define theTeamwork Necessity Index \(TNI\)as the fraction of the Solo versus Restricted gap recovered by the team\. Intuitively, TNI asks how much performance is recovered when the missing requirements must be relayed through the team:

TNI=Steam−Srestrictedmax⁡\(ϵ,Ssolo−Srestricted\),\\textsc\{TNI\}=\\frac\{S\_\{\\text\{team\}\}\-S\_\{\\text\{restricted\}\}\}\{\\max\(\\epsilon,\\,S\_\{\\text\{solo\}\}\-S\_\{\\text\{restricted\}\}\)\},\(1\)withϵ=0\.05\\epsilon=0\.05to avoid instability when the Solo–Restricted gap is near zero\.TNI=1\\textsc\{TNI\}=1indicates that the team fully recovers the single\-agent advantage, whileTNI\>1\\textsc\{TNI\}\>1exceeds it\. We summarize TNI only over tasks whereSsolo−Srestricted\>ϵS\_\{\\text\{solo\}\}\-S\_\{\\text\{restricted\}\}\>\\epsilon, since smaller Solo–Restricted gaps do not provide a meaningful test of teamwork necessity\. We additionally report Planning Value=Sfull−Sno\_plan=S\_\{\\text\{full\}\}\-S\_\{\\text\{no\\\_plan\}\}and Verification Value=Sfull−Sno\_verify=S\_\{\\text\{full\}\}\-S\_\{\\text\{no\\\_verify\}\}, and classify tasks as HIGH\-TNI, TEAM\-HELPS, NEUTRAL, or TEAM\-HURTS using a±0\.05\\pm 0\.05band\.

### 2\.4Role Mixing

If Planner, Executor, and Verifier roles require distinct capabilities, then the best model for one role need not be the best model for another\. We therefore treat model assignment as an experimental variable rather than fixing one model across all roles, as much prior agent\-based work does\. The protocol enumerates every assignment of three models to three roles, giving2727configurations, restricted to three LLM families \(Anthropic, Google, OpenAI\) with one model per family to keep the grid tractable\. Each configuration runs all2525tasks of a stratified subset\. We use a compact code for configurations, with one letter for the provider assigned to each role\. For instance,PGEAVOruns the Google Planner, the Anthropic Executor, and the OpenAI Verifier\.

## 3Experiments and Results

The experiments address three questions\. 1\) What is the marginal contribution of each role under structural enforcement? 2\) Does pass rate depend on which provider fills which role? 3\) In which task regimes does the team provide positive uplift, and where does it hurt? Appendix[C\.1](https://arxiv.org/html/2605.07073#A3.SS1)lists the studies and the count behind each one\.

### 3\.1Setup

The leaderboard evaluates1313models across four families \(Anthropic, Google, OpenAI, and Alibaba\)\. The cross provider grid uses one compact frontier model from each commercial family\. Every role is a tool\-calling loop in its own sandbox with tools\. Each task includes a deterministic script grader\. A binary pass requires every check to pass, following SWE\-Bench\[[10](https://arxiv.org/html/2605.07073#bib.bib1)\]and Terminal\-Bench\[[17](https://arxiv.org/html/2605.07073#bib.bib38)\]\. Held\-out seeds are reserved for leaderboard refresh\. Comparisons use paired bootstrap \(10,00010\{,\}000iterations\) with Wilson95%95\\%CIs\. The enforcement study uses McNemar with Holm\-Bonferroni\. Reproducibility details are listed in Appendix[D\.2](https://arxiv.org/html/2605.07073#A4.SS2)\.

### 3\.2Main Results

![Refer to caption](https://arxiv.org/html/2605.07073v1/x2.png)Figure 3:TeamBenchleaderboard\.Rows are sorted bymax\(\\max\(Solo, Team\)\)descending so the row order tracks each model’s best demonstrated capability\. The bar shows both Solo \(solid family color\) and Team \(hatched\), and the right\-side label gives that maximum percentage with the parenthesized delta to the other condition in green \(\+\+, team helps\) or red \(−\-, team hurts\)\. The shorter of the two bars is drawn on top so both endpoints are visible\. Whiskers are Wilson95%95\\%CIs on the Team rate\.Figure[3](https://arxiv.org/html/2605.07073#S3.F3)shows thatTeamBenchremains difficult even for frontier models\. Claude Opus 4\.7 is the strongest model in both settings, with37\.8%37\.8\\%on Full Team and35\.6%35\.6\\%on Solo\. Across models, however, the gap between Full Team and Solo changes direction with Solo performance\. Models with low Solo scores often improve under the team setting, including Sonnet 4\.6 \(\+20\.0\+20\.0\), Haiku 4\.5 \(\+16\.7\+16\.7\), GPT\-5\.4 \(\+15\.6\+15\.6\), and Gemini\-3 Flash \(\+12\.2\+12\.2\)\. Models with stronger Solo performance show little gain or lose accuracy, including Opus 4\.7 \(\+2\.2\+2\.2\), GPT\-5\.4 Mini \(−4\.4\-4\.4\), and Gemma 4 31B \(−5\.6\-5\.6\)\. This pattern suggests that role separation helps when a single agent lacks enough planning context to make progress, but can add failure opportunities once the Solo agent already solves much of the task\.

We follow the counting convention in Table[12](https://arxiv.org/html/2605.07073#A3.T12)\. A run that passes all structural checks but fails only the final attestation check is counted as a pass, because the attestation file records metadata compliance rather than task quality\. The Qwen 3 family and gpt\-oss\-20b score below18%18\\%on Solo and fall further in the team setting, mainly because of tool\-call errors and context\-window overflow rather than task reasoning failures \(Appendix[E\.4](https://arxiv.org/html/2605.07073#A5.SS4)\)\.

### 3\.3The Contribution of Each Role

Verifiers reduce average score in the reference ablation\.In the 155\-task reference ablation with Gemini\-3 Flash, which fixes the model across all five conditions, removing the Verifier from the Full Team raises mean partial score by5\.55\.5points, and the per\-task verification value averages−5\.8\-5\.8points\. This decrease could reflect either Verifier error or grader error\. To separate the two, we use the role\-mixing pool, where every Verifier verdict is paired with the deterministic grader\. The Verifier approves49\.4%49\.4\\%of submissions that the grader rejects \(Wilson95%95\\%CI\[45\.9,52\.9\]\[45\.9,52\.9\],n=1,083n=1\{,\}083\), while the rate of Verifier rejection on grader\-passing submissions stays below19%19\\%in every cell \(Figure[5](https://arxiv.org/html/2605.07073#S3.F5)\-\(a\)\)\. The error is therefore asymmetric, where Verifiers more often accept failing work than reject passing work\. Section[3\.7](https://arxiv.org/html/2605.07073#S3.SS7)analyzes these false\-accept failures\.

The Planner has a smaller positive effect\. Adding the Planner to the No\-Plan team raises mean partial score by2\.42\.4points, concentrated on hard tasks where the requirements contain decision rules the Executor cannot infer\. Full Team is not significantly better than Solo on average in the reference ablation\. Mean team\-vs\-Solo uplift is\+0\.5\+0\.5points \(p=0\.20p=0\.20, paired bootstrap,n=10,000n=10\{,\}000\), and the team wins on6868of155155tasks\. OnTeamBench Mini, two stronger single\-agent baselines, Solo CoT and Solo 2Pass, do not close the team gap \(Appendix[E\.8](https://arxiv.org/html/2605.07073#A5.SS8)\)\. The average therefore hides where teams help and where they hurt\.

### 3\.4Where Teams Benefit

Teams help when Solo performance is low, but hurt when it is high\.We stratify the155155\-task reference ablation by per\-task Solo score, with3131tasks per quintile\. Full Team outperforms Solo by15\.715\.7points in the lowest quintile \(Solo0\.000\.00to0\.220\.22;95%95\\%CI\[5\.8,25\.7\]\[5\.8,25\.7\]\) and by8\.88\.8points in the second quintile\. In the remaining quintiles, Full Team trails Solo by6\.86\.8\-10\.110\.1points \(Figure[5](https://arxiv.org/html/2605.07073#S3.F5)\-\(b\)\)\.

Appendix[E\.6](https://arxiv.org/html/2605.07073#A5.SS6)tests three explanations\. The pattern is not explained by time spent in Solo runs, since Solo elapsed time is unrelated to Solo score \(r=0\.00r=0\.00\)\. It is not explained by the Planner acting as a second reasoning pass, since No\-Plan and Restricted are indistinguishable on hard tasks \(p=0\.66p=0\.66\)\. It is also not explained by the amount of missing specification alone, since the Solo–Restricted gap is negatively correlated with the Full\-Team minus Solo difference \(r=−0\.45r=\-0\.45,p<10−24p<10^\{\-24\}\)\. This suggests that team helps when Solo agent lacks enough structure to make progress\. Once Solo already makes progress, Verifier intervention can redirect or overwrite work that would otherwise pass\.

TNI classifies1515tasks as HIGH\-TNI,3939as TEAM\-HELPS,6262as NEUTRAL, and3939as TEAM\-HURTS\. The Planner\-to\-Executor transfer analysis shows why these gains are limited\. Across792792runs over1515tasks, mean recall of spec\-critical tokens is0\.790\.79in the Planner channel but only0\.240\.24in the Executor tool inputs and outputs\. The Executor retains0\.210\.21of the Planner’s spec\-critical tokens on average and0\.130\.13at the median\. Because this measure only counts tokens that appear in the Executor channel, it is a lower bound on how much Planner information reaches the Executor\.

### 3\.5Cross\-Provider Role Mixing

![Refer to caption](https://arxiv.org/html/2605.07073v1/x3.png)Figure 4:Cross\-provider role\-mixing on the2525\-task subset\.\(a\)\-\(c\)report the marginal pass rate of each provider in each role, averaged over the99configurations holding that slot fixed\.\(d\)plots all2727configurations on the cost versus pass\-rate plane \(logxx\)\. The stair\-step Pareto frontier runs from POEOVO \(18\.7%18\.7\\%, $2\.09\) to PGEAVA \(26\.7%26\.7\\%, $20\.52\)\. Marker color indicates the Executor family\.Executor and Verifier choices matter\.Per\-role family marginals \(Figure[4](https://arxiv.org/html/2605.07073#S3.F4)\(a\) to \(c\), bootstrap95%95\\%CIs in Appendix[C\.3](https://arxiv.org/html/2605.07073#A3.SS3)\) show that the Anthropic Executor is about 3 points higher than the alternatives and the OpenAI Verifier has the highest Verifier marginal\. The Planner confidence intervals overlap across all three families\. Mixed provider teams improve the cost\-performance tradeoff\. Across three seeds, PGEAVA \(Google Planner, Anthropic Executor, Anthropic Verifier\) reaches26\.7%26\.7\\%at $20\.52, outperforming the all\-Anthropic team \(22\.7%22\.7\\%, $39\.58\) by44points at roughly half the cost\. POEOVA matches the all\-Anthropic rate at $10\.98, and the all\-OpenAI team hits18\.7%18\.7\\%at $2\.09\. Fine\-grained configuration ranks are unstable across seeds \(Spearman0\.090\.09to0\.280\.28,28\.9%28\.9\\%of configuration task pairs flip\)\. We therefore focus on pooled role marginals rather than exact rank order\.

### 3\.6Prompt\-Only vs\. Enforced

Pass rate does not identify role behavior\.We pre\-specified a 450\-run study on a 25\-task subset across three model families and two seeds\. Prompt\-only assignment shares tools and message history\. Enforced assignment separates both execution environments and histories, while enforced with shared history preserves shared conversation only\. After exclusions, 400 valid runs remain\. We use paired McNemar tests with Holm\-Bonferroni correction\. No planned comparison remains significant after correction, with the strongest raw test atp=0\.038p=0\.038andpadj=0\.113p\_\{\\text\{adj\}\}=0\.113\. The shared\-history condition is reported as inconclusive under the worst\-case sensitivity analysis \(Appendix[E\.7](https://arxiv.org/html/2605.07073#A5.SS7)\)\. Trace labels tell a different story\. Enforcement reduces Verifier code\-edit attempts from256256to7272, while increasing executor\-plans events from261261to416416\. These changes leave final pass rate nearly unchanged\. A pass\-rate\-only comparison would treat the two settings as equivalent, even though prompt\-only assignment allows many more Verifier takeovers\. Without enforced roles, a benchmark cannot tell whether agents coordinated or whether one model performed roles it was not assigned\.

### 3\.7Error Analysis

![Refer to caption](https://arxiv.org/html/2605.07073v1/x4.png)Figure 5:Verifier errors and conditional team value\.\(a\)Verifier false\-accept and false\-reject rates on role\-mixing runs\. False\-accept rates reach 60% for Haiku\-4\.5, 77% for Gemini\-3 Flash, and 36% for GPT\-5\.4 Mini, while false\-reject rates remain below 19%\.\(b\)Mean team uplift over Solo by per\-task Solo\-score quintile on the 155\-task reference ablation \(n=31n=31per quintile\)\. Teams help on difficult tasks but hurt on easier ones\. Whiskers show 95% confidence intervals\.Verifier failure modes\.We inspect384384false\-accept events and identify three recurring patterns \(Figure[5](https://arxiv.org/html/2605.07073#S3.F5)\-\(a\)\)\. Optimistic verdicts approve the submission without inspecting Executor evidence, most often for Haiku 4\.5\. Echo verdicts repeat the Executor’s claim of completion without checking it against the requirements, most often for Gemini\-3 Flash\. Verifier\-rewrite failures occur when the Verifier edits workspace code\-edit attempts instead of judging the Executor’s output\. This role takeover occurs 3\.6 times more often under prompt\-only assignment in the pre\-specified comparison\.

Counting the942942missing verdicts as failures lowers the effective false\-accept rate from49\.4%49\.4\\%to22\.3%22\.3\\%\(Table[19](https://arxiv.org/html/2605.07073#A5.T19)\), but leaves the conclusion unchanged\. Current Verifiers are unreliable quality gates in this protocol\. The auditedTeamBench\-Verifiedsubset shows the same pattern, with a38\.7%38\.7\\%false\-accept rate, suggesting that the result is not driven by one unaudited task pool\. Because missing attestations are themselves verifier\-side or system\-side failures, the 49\.4% rate should be read as a verdict\-quality estimate conditional on a verdict, while the 22\.3% sensitivity analysis is the conservative end\-to\-end estimate\.

Verifier errors explain why team value changes with Solo score\.Figure[5](https://arxiv.org/html/2605.07073#S3.F5)\-\(b\) connects the Verifier false\-accept result to the Solo\-score stratification\. In the lowest Solo\-score quintile \(Solo0\.000\.00to0\.220\.22\), Planner guidance offsets Verifier errors and Full Team gains15\.715\.7points\. In the highest quintile \(Solo≥0\.90\\geq 0\.90\), the Executor is often already close to a passing solution, and Verifier edits reduce the score by about1010points\. This explains why the average team gain is small\. Planning helps in the lowest Solo\-score quintiles, while Verifier edits hurt in the highest ones\.

Small open\-weight models fail mainly in tool use\.The Qwen\-3 family scores at most5\.6%5\.6\\%, due to invalid tool calls and context\-window exhaustion, rather than task reasoning failures \(Appendix[E\.4](https://arxiv.org/html/2605.07073#A5.SS4)\)\. At the≤30\\leq 30B open\-weight tier,TeamBenchprimarily measures tool\-call reliability rather than coordination\.

### 3\.8Human Study

Appendix[B](https://arxiv.org/html/2605.07073#A2)reports a human study run under settings that mirror the agent experiments\.Solois a single human with full access\.Hybridis one human paired with two agents\.Teamis three humans under the same role separation as the agent harness\. The study covers4040completed survey\-confirmed sessions across2121tasks from1818distinct humans \(Solon=12n\{=\}12, Hybridn=17n\{=\}17, Teamn=11n\{=\}11\)\.1313of the2020stratified target tasks have at least one session\. The pilot is covered by an MIT COUHES exempt determination under 45 CFR 46\.104\(d\)\(2\), Exempt Category 2\.

The value of the verifier\.Across3232role\-level surveys from1111Team sessions, the three CATME\[[19](https://arxiv.org/html/2605.07073#bib.bib49)\]ratings cluster tightly\. Verifier value is3\.753\.75, Executor efficiency is3\.723\.72, and early planning is3\.693\.69on a11to55scale\. In the Solo counterfactual survey \(n=12n=12\), participants rate per\-role teammates similarly \(3\.003\.00to3\.173\.17\) and rate the time\-only counterfactual lowest at2\.172\.17, indicating that adding any teammate is perceived as more helpful than additional alone\-time\. The Verifier role is therefore not perceived as low\-value in either instrument, in contrast to the role where LLMs fail most often in our agent experiments \(§[3\.7](https://arxiv.org/html/2605.07073#S3.SS7)\)\. The contrast suggests that the Verifier slot is not inherently low value, but that current Verifiers struggle to execute it reliably in this protocol\.

Time pressure and missing information lead the failure factors\.Figure[6](https://arxiv.org/html/2605.07073#A2.F6)\-\(c\) plots the primary failure factors from the3232Team surveys\. Time pressure is the top endorsement \(1717of3232\), followed by information missing across roles \(1414\)\. The next four are weak or late planning \(77\), unclear communication \(77\), implementation difficulty \(66\), and other \(44\), with missed verification tied at44\(12\.5%12\.5\\%\)\. Two observations follow\. First, humans running the same single\-pass file\-based protocol cite the same architectural property the benchmark imposes on agents, namely information structurally split between the specification and the workspace, as their dominant non\-time failure cause\. Second, missed verification is rarely selected by humans, which contrasts with the49%49\\%Verifier false\-accept rate in the agent experiments \(§[3\.7](https://arxiv.org/html/2605.07073#S3.SS7)\)\. In this study, human participants rarely selected missed verification as the primary failure factor, whereas Verifiers often approved grader\-failing work\.

Three modes show different collaboration patterns\.*Solo*sessions are exploratory, with median1111minutes wall\-clock and median66explicit actions per session\.*Hybrid*sessions collapse to a near\-instant approve\-and\-submit pattern, with median33minutes and only22explicit actions per session, while1616of1717Hybrid Verifiers self\-attested pass\.*Team*sessions instead spread effort across roles, with median2626minutes and median3939explicit actions per session\. Per\-role chat turns \(Figure[6](https://arxiv.org/html/2605.07073#A2.F6)\-\(b\)\) place the Executor highest \(median1010\), the Planner middle \(median88\), and the Verifier lowest \(median55\), even though the Verifier role receives the highest perceived value\.

## 4Discussion

Pass rate alone is not enough for agent evaluation\.The40\.5%40\.5\\%Full Team pass rate in the prompt\-only versus enforced study \(Table[6](https://arxiv.org/html/2605.07073#A2.T6)\) hides two effects that matter for coordination\. Enforcement reduces Verifier code\-edit attempts by 3\.6 times, and the Full\-Team minus Solo gap changes by24\.724\.7points between the lowest and highest Solo\-score quintiles\. Reporting only final pass rate would miss both\. We suggest two additions to multi\-agent benchmark reports\. First, reports should include role\-violation rates from per\-turn traces, so that teams with the same pass rate but different role behavior are not treated as equivalent\. Second, reports should stratify team value by Solo score, so that where teams help and where they hurt is visible rather than averaged into a\+0\.5\+0\.5point mean\.TeamBenchreports both quantities and the per\-turn rubric needed to compute role violations\.

LLM judges may inherit the same false\-accept problem\.In the role\-mixing runs, Verifiers accept grader\-failing submissions at rates from36%36\\%for GPT\-5\.4 Mini to77%77\\%for Gemini\-3 Flash, with a pooled rate of49\.4%49\.4\\%\(Figure[5](https://arxiv.org/html/2605.07073#S3.F5)\-\(a\)\)\. This matters for multi\-agent benchmarks that use an LLM judge to mark milestones or task completion, including the MultiAgentBench Coordination Score\[[28](https://arxiv.org/html/2605.07073#bib.bib35)\]\. Such metrics can overcredit outputs that satisfy the judge while failing deterministic checks\. A simple safeguard is to anchor scores to deterministic graders or human\-audited outcomes, and to report judge\-grader disagreement\. The49%49\\%rate is a property of our sampled role\-mixing distribution, not of any single model\. Future systems may reduce it with tool\-assisted checking or attestations that require evidence\.

Matched human runs make the process signal visible\.The human study under the same role boundaries gives three signals that agent traces alone cannot provide \(Appendix[B](https://arxiv.org/html/2605.07073#A2)\)\. First, humans report missing information across roles as their main non\-time barrier, matching the split between specification access and workspace access imposed by the benchmark\. Second, Hybrid sessions often reduce to quick approval, with median33minutes per session and94\.1%94\.1\\%self\-attestation against79%79\\%structural\-grader partial mean\. This mirrors the over\-acceptance seen in Verifiers and suggests that human\-in\-the\-loop benchmarks should log override behavior\. Third, the Verifier receives the highest stated value but the lowest activity count, even though Verifiers fail most often in the agent experiments\. Together, these observations show that pass rate alone is not a sufficient signal for multi\-agent evaluation, whether the final decision is made by a model or by a human in the loop\. We release the benchmark and human\-baseline platform together so that future systems can be evaluated under the same single\-pass file\-based protocol\.

Cross\-provider mixing changes the cost\-performance frontier\.On the2525\-task grid, the Anthropic\-only team reaches22\.7%22\.7\\%at $39\.58\. PGEAVA reaches26\.7%26\.7\\%at $20\.52, POEOVA matches the Anthropic\-only rate at $10\.98, and the all\-OpenAI team reaches18\.7%18\.7\\%at $2\.09 \(Appendix[C\.3](https://arxiv.org/html/2605.07073#A3.SS3)\)\. No single provider dominates the Pareto frontier across the2727configurations\. The per\-role marginals suggest that Executor and Verifier choice matters more than Planner choice\. This means that single\-provider leaderboards may miss cheaper mixed\-provider configurations with similar or better pass rates\. We release the full2727\-cell grid and cost ledger to support cost\-aware multi\-agent comparisons\.

Relation to prior work and remaining scope\.The Solo\-score pattern in Section[3\.4](https://arxiv.org/html/2605.07073#S3.SS4)matches Steiner’s prediction\[[22](https://arxiv.org/html/2605.07073#bib.bib40)\]that disjunctive tasks yield little team gain over a capable individual\. The Executor retains only0\.210\.21of the Planner’s spec\-critical tokens on average, consistent with transactive memory findings on newly formed teams\[[24](https://arxiv.org/html/2605.07073#bib.bib41),[12](https://arxiv.org/html/2605.07073#bib.bib43)\]\. The human study points in the same direction: participants rate the Verifier as useful, while LLM Verifiers fail frequently in the agent experiments\. Compared with concurrent diagnostics that taxonomize failures\[[2](https://arxiv.org/html/2605.07073#bib.bib32)\], probe collective reasoning\[[15](https://arxiv.org/html/2605.07073#bib.bib33)\], or formalize benchmark validity\[[29](https://arxiv.org/html/2605.07073#bib.bib34)\],TeamBenchadds a controlled comparison of prompt\-assigned and enforced roles\. The current harness uses capped turns and a file\-based workflow\. It does not test multi\-round dialogue, dynamic role assignment, or within\-provider model\-size scaling\. Verifier false acceptance should be a priority for future agent systems\. Enforced role separation makes that target measurable\. A stronger Verifier should reject failing outputs without taking over the Executor role\.

## 5Conclusion

TeamBenchenforces role separation with sandbox permissions and evaluates the resulting teams across 13 models\. We find that teams rarely outperform single agents on average, Verifiers falsely accept many failing submissions, team gains depend strongly on Solo capability, and a mixed provider team can match the strongest single provider team at roughly half the cost\. Prompt\-only and enforced assignments reach statistically indistinguishable pass rates, but prompt only runs produce 3\.6 times more Verifier code\-edit attempts\. We release the benchmark, harness, and human study platform so that agent systems can be evaluated not only by final pass rate, but by whether each added agent performs distinct work\.

## References

- \[1\]\(2025\)How we built our multi\-agent research system\.Note:[https://www\.anthropic\.com/engineering/multi\-agent\-research\-system](https://www.anthropic.com/engineering/multi-agent-research-system)Anthropic Engineering Blog, June 2025Cited by:[§A\.3](https://arxiv.org/html/2605.07073#A1.SS3.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2605.07073#S1.p1.1)\.
- \[2\]M\. Cemri, M\. Z\. Pan, S\. Yang, L\. A\. Agrawal, B\. Chopra, R\. Tiwari, K\. Keutzer, A\. Parameswaran, D\. Klein, K\. Ramchandran, M\. A\. Zaharia, J\. E\. Gonzalez, and I\. Stoica\(2025\)Why do multi\-agent LLM systems fail?\.InAdvances in Neural Information Processing Systems \(NeurIPS\), Datasets and Benchmarks Track,Note:arXiv:2503\.13657Cited by:[§A\.3](https://arxiv.org/html/2605.07073#A1.SS3.SSS0.Px3.p1.1),[§E\.3](https://arxiv.org/html/2605.07073#A5.SS3.p1.1),[Table 20](https://arxiv.org/html/2605.07073#A5.T20),[§1](https://arxiv.org/html/2605.07073#S1.p1.1),[§4](https://arxiv.org/html/2605.07073#S4.p5.1)\.
- \[3\]J\. S\. Chan, N\. Chowdhury, O\. Jaffe, J\. Aung, D\. Sherburn, E\. Mays, G\. Starace, K\. Liu, L\. Maksin, T\. Patwardhan, L\. Weng, and A\. Mądry\(2024\)MLE\-Bench: evaluating machine learning agents on machine learning engineering\.arXiv preprint arXiv:2410\.07095\.Cited by:[§A\.3](https://arxiv.org/html/2605.07073#A1.SS3.SSS0.Px1.p1.1),[Table 15](https://arxiv.org/html/2605.07073#A4.T15.5.3.3.2)\.
- \[4\]Y\. Du, S\. Li, A\. Torralba, J\. B\. Tenenbaum, and I\. Mordatch\(2024\)Improving factuality and reasoning in language models through multiagent debate\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§A\.3](https://arxiv.org/html/2605.07073#A1.SS3.SSS0.Px2.p1.1)\.
- \[5\]D\. Dua and C\. Graff\(2019\)UCI machine learning repository\.University of California, Irvine\.External Links:[Link](https://archive.ics.uci.edu/ml)Cited by:[§2\.2](https://arxiv.org/html/2605.07073#S2.SS2.p1.2)\.
- \[6\]T\. Gebru, J\. Morgenstern, B\. Vecchione, J\. W\. Vaughan, H\. Wallach, H\. Daumé III, and K\. Crawford\(2021\)Datasheets for datasets\.Communications of the ACM64\(12\),pp\. 86–92\.Cited by:[§D\.2](https://arxiv.org/html/2605.07073#A4.SS2.SSS0.Px5.p1.1)\.
- \[7\]J\. R\. Hackman\(1987\)The design of work teams\.InHandbook of Organizational Behavior,J\. W\. Lorsch \(Ed\.\),pp\. 315–342\.Cited by:[§A\.3](https://arxiv.org/html/2605.07073#A1.SS3.SSS0.Px4.p1.1)\.
- \[8\]S\. Hong, M\. Zhuge, J\. Chen, X\. Zheng, Y\. Cheng, C\. Zhang, J\. Wang, Z\. Wang, S\. K\. S\. Yau, Z\. Lin,et al\.\(2024\)MetaGPT: meta programming for a multi\-agent collaborative framework\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§A\.3](https://arxiv.org/html/2605.07073#A1.SS3.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.07073#S1.p1.1)\.
- \[9\]N\. Jain, K\. Han, A\. Gu, W\. Li, F\. Yan, T\. Zhang, S\. Wang, A\. Solar\-Lezama, K\. Sen, and I\. Stoica\(2024\)LiveCodeBench: holistic and contamination free evaluation of large language models for code\.arXiv preprint arXiv:2403\.07974\.Cited by:[§A\.3](https://arxiv.org/html/2605.07073#A1.SS3.SSS0.Px1.p1.1),[Table 15](https://arxiv.org/html/2605.07073#A4.T15.8.6.8.1)\.
- \[10\]C\. E\. Jimenez, J\. Yang, A\. Wettig, S\. Yao, K\. Pei, O\. Press, and K\. Narasimhan\(2024\)SWE\-bench: can language models resolve real\-world GitHub issues?\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§A\.3](https://arxiv.org/html/2605.07073#A1.SS3.SSS0.Px1.p1.1),[Table 15](https://arxiv.org/html/2605.07073#A4.T15.3.1.1.3),[§1](https://arxiv.org/html/2605.07073#S1.p1.1),[§3\.1](https://arxiv.org/html/2605.07073#S3.SS1.p1.3)\.
- \[11\]Y\. Kim, K\. Gu, C\. Park, C\. Park, S\. Schmidgall, A\. A\. Heydari, Y\. Yan, Z\. Zhang, Y\. Zhuang, Y\. Liu,et al\.\(2025\)Towards a science of scaling agent systems\.arXiv preprint arXiv:2512\.08296\.Cited by:[§1](https://arxiv.org/html/2605.07073#S1.p1.1)\.
- \[12\]K\. Lewis\(2003\)Measuring transactive memory systems in the field: scale development and validation\.Journal of Applied Psychology88\(4\),pp\. 587–604\.Cited by:[§A\.3](https://arxiv.org/html/2605.07073#A1.SS3.SSS0.Px4.p1.1),[§4](https://arxiv.org/html/2605.07073#S4.p5.1)\.
- \[13\]B\. Li, W\. Wu, Z\. Tang, L\. Shi, J\. Yang, J\. Li, S\. Yao, C\. Qian, X\. Cong, X\. He,et al\.\(2024\)DevBench: a comprehensive benchmark for software development\.arXiv preprint arXiv:2403\.08604\.Cited by:[§A\.3](https://arxiv.org/html/2605.07073#A1.SS3.SSS0.Px1.p1.1),[Table 15](https://arxiv.org/html/2605.07073#A4.T15.6.4.4.2)\.
- \[14\]G\. Li, H\. A\. A\. K\. Hammoud, H\. Itani, D\. Khizbullin, and B\. Ghanem\(2023\)CAMEL: communicative agents for “mind” exploration of large language model society\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§A\.3](https://arxiv.org/html/2605.07073#A1.SS3.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.07073#S1.p1.1)\.
- \[15\]Y\. Li, A\. Naito, and H\. Shirado\(2025\)HiddenBench: assessing collective reasoning in multi\-agent LLMs via hidden profile tasks\.arXiv preprint arXiv:2505\.11556\.Cited by:[§A\.3](https://arxiv.org/html/2605.07073#A1.SS3.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2605.07073#S1.p1.1),[§4](https://arxiv.org/html/2605.07073#S4.p5.1)\.
- \[16\]X\. Liu, H\. Yu, H\. Zhang, Y\. Xu, X\. Lei, H\. Lai, Y\. Gu, H\. Ding, K\. Men, K\. Yang,et al\.\(2024\)AgentBench: evaluating LLMs as agents\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§A\.3](https://arxiv.org/html/2605.07073#A1.SS3.SSS0.Px1.p1.1),[Table 15](https://arxiv.org/html/2605.07073#A4.T15.8.6.11.1)\.
- \[17\]M\. A\. Merrillet al\.\(2026\)Terminal\-bench: benchmarking agents on hard, realistic tasks in command line interfaces\.Note:arXiv:2601\.11868Cited by:[§A\.3](https://arxiv.org/html/2605.07073#A1.SS3.SSS0.Px1.p1.1),[Table 15](https://arxiv.org/html/2605.07073#A4.T15.4.2.2.2),[§1](https://arxiv.org/html/2605.07073#S1.p1.1),[§3\.1](https://arxiv.org/html/2605.07073#S3.SS1.p1.3)\.
- \[18\]G\. Mialon, C\. Fourrier, C\. Swift, T\. Wolf, Y\. LeCun, and T\. Scialom\(2024\)GAIA: a benchmark for general AI assistants\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§A\.3](https://arxiv.org/html/2605.07073#A1.SS3.SSS0.Px1.p1.1),[Table 15](https://arxiv.org/html/2605.07073#A4.T15.8.6.9.1)\.
- \[19\]L\. O’Bryan, T\. Oxendahl, X\. Chen, D\. McDuff, S\. Segarra, M\. Wettergreen, M\. E\. Beier, and A\. Sabharwal\(2024\)Objective communication patterns associated with team member effectiveness in real\-world virtual teams\.Human Factors66\(5\),pp\. 1414–1430\.Cited by:[§3\.8](https://arxiv.org/html/2605.07073#S3.SS8.p2.11)\.
- \[20\]C\. Qian, W\. Liu, H\. Liu, N\. Chen, Y\. Dang, J\. Li, C\. Yang, W\. Chen, Y\. Su, X\. Cong,et al\.\(2024\)ChatDev: communicative agents for software development\.InAnnual Meeting of the Association for Computational Linguistics \(ACL\),Cited by:[§A\.3](https://arxiv.org/html/2605.07073#A1.SS3.SSS0.Px2.p1.1),[Table 15](https://arxiv.org/html/2605.07073#A4.T15.8.6.13.1),[§1](https://arxiv.org/html/2605.07073#S1.p1.1)\.
- \[21\]G\. Stasser and W\. Titus\(1985\)Pooling of unshared information in group decision making: biased information sampling during discussion\.Journal of Personality and Social Psychology48\(6\),pp\. 1467–1478\.Cited by:[§A\.3](https://arxiv.org/html/2605.07073#A1.SS3.SSS0.Px3.p1.1)\.
- \[22\]I\. D\. Steiner\(1972\)Group process and productivity\.Academic Press,New York\.Cited by:[§A\.3](https://arxiv.org/html/2605.07073#A1.SS3.SSS0.Px4.p1.1),[§4](https://arxiv.org/html/2605.07073#S4.p5.1)\.
- \[23\]X\. Wang, Y\. Li, Y\. Song, F\. F\. Xu, H\. Tang, M\. Zhuge, J\. Pan, Y\. Song, B\. Li, J\. Singh, H\. H\. Tran, F\. Li, R\. Ma, M\. Zheng, B\. Qian, Y\. Shao, N\. Muennighoff, Y\. Zhang, B\. Hui, J\. Lin, R\. Brennan, H\. Peng, H\. Ji, and G\. Neubig\(2024\)OpenHands: an open platform for AI software developers as generalist agents\.arXiv preprint arXiv:2407\.16741\.Cited by:[§A\.3](https://arxiv.org/html/2605.07073#A1.SS3.SSS0.Px1.p1.1)\.
- \[24\]D\. M\. Wegner\(1987\)Transactive memory: a contemporary analysis of the group mind\.InTheories of Group Behavior,B\. Mullen and G\. R\. Goethals \(Eds\.\),pp\. 185–208\.Cited by:[§A\.3](https://arxiv.org/html/2605.07073#A1.SS3.SSS0.Px4.p1.1),[§4](https://arxiv.org/html/2605.07073#S4.p5.1)\.
- \[25\]J\. Wei, Z\. Sun, S\. Papay, S\. McKinney, J\. Han, I\. Fulford, H\. W\. Chung, A\. T\. Passos, W\. Fedus, and A\. Glaese\(2025\)BrowseComp: a simple yet challenging benchmark for browsing agents\.arXiv preprint arXiv:2504\.12516\.Cited by:[§A\.3](https://arxiv.org/html/2605.07073#A1.SS3.SSS0.Px1.p1.1),[Table 15](https://arxiv.org/html/2605.07073#A4.T15.8.6.10.1)\.
- \[26\]Q\. Wu, G\. Bansal, J\. Zhang, Y\. Wu, B\. Li, E\. Zhu, L\. Jiang, X\. Zhang, S\. Zhang, J\. Liu,et al\.\(2023\)AutoGen: enabling next\-gen LLM applications via multi\-agent conversation\.arXiv preprint arXiv:2308\.08155\.Cited by:[§A\.3](https://arxiv.org/html/2605.07073#A1.SS3.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.07073#S1.p1.1)\.
- \[27\]J\. Yang, C\. E\. Jimenez, A\. Wettig, K\. Lieret, S\. Yao, K\. Pei, O\. Press, and K\. Narasimhan\(2024\)SWE\-agent: agent\-computer interfaces enable automated software engineering\.arXiv preprint arXiv:2405\.15793\.Cited by:[§A\.3](https://arxiv.org/html/2605.07073#A1.SS3.SSS0.Px1.p1.1)\.
- \[28\]K\. Zhu, H\. Du, Z\. Wang,et al\.\(2025\)MultiAgentBench: evaluating the collaboration and competition of LLM agents\.InAnnual Meeting of the Association for Computational Linguistics \(ACL\),Note:arXiv:2503\.01935Cited by:[§A\.3](https://arxiv.org/html/2605.07073#A1.SS3.SSS0.Px2.p1.1),[Table 15](https://arxiv.org/html/2605.07073#A4.T15.8.6.12.2),[§4](https://arxiv.org/html/2605.07073#S4.p2.4)\.
- \[29\]Y\. Zhuet al\.\(2025\)Establishing best practices for building rigorous agentic benchmarks\.InAdvances in Neural Information Processing Systems \(NeurIPS\), Datasets and Benchmarks Track,Note:arXiv:2507\.02825Cited by:[§A\.3](https://arxiv.org/html/2605.07073#A1.SS3.SSS0.Px3.p1.1),[§D\.3](https://arxiv.org/html/2605.07073#A4.SS3.p1.1),[§1](https://arxiv.org/html/2605.07073#S1.p1.1),[§4](https://arxiv.org/html/2605.07073#S4.p5.1)\.
- \[30\]M\. Zhuge, W\. Wang, L\. Kirsch, F\. Faccio, D\. Khizbullin, and J\. Schmidhuber\(2024\)Language agents as optimizable graphs\.arXiv preprint arXiv:2402\.16823\.Cited by:[§A\.3](https://arxiv.org/html/2605.07073#A1.SS3.SSS0.Px2.p1.1),[Table 15](https://arxiv.org/html/2605.07073#A4.T15.8.6.14.1),[§1](https://arxiv.org/html/2605.07073#S1.p1.1)\.

## Appendix ABenchmark Construction

### A\.1Role permissions

Table 2:Role permissions enforced by Docker bind mounts\. No single role has simultaneous access to the full specification, workspace write access, and attestation write access\.PlannerExecutorVerifierReadspec\.md✓–✓Readbrief\.md✓✓–Readworkspace/–✓✓\(read\-only\)Writeworkspace/–✓–Readmessages/✓✓✓Writemessages/✓✓✓Readreports/\(Executor logs\)–✓✓\(read\-only\)Writereports/–✓–Writeattestation\.json––✓Execute commands–✓–
### A\.2Task construction

#### Originally\-authored templates\.

Information asymmetry must be introduced deliberately\. In benchmarks derived from real GitHub issues, the issue description typically contains enough context for any single agent to proceed, rendering the Planner redundant by construction\. Our 161 originally\-authored templates enforce asymmetry at the authoring stage by placing critical constraints exclusively inspec\.md, absent from the brief and workspace\.

#### Real GitHub bug reports\.

The benchmark includes 11 hand\-curated GitHub bug tasks \(GH1 through GH11\) drawn from Flask, Click, httpx, Requests, Pydantic, Django, pytest, FastAPI, SQLAlchemy, Celery, and Werkzeug, plus 639 additional templates from a broader scrape across active open\-source repositories such as NumPy, SciPy, Keras, and spaCy\. The hand\-curated set includes the full issue discussion in the specification and a user\-facing symptom in the brief\.

#### Task file layout\.

Each task includes a specification \(spec\.md\) with all requirements and edge cases, a brief \(brief\.md\) for the Executor, a workspace with initial code, a setup script, and a deterministic grader \(grade\.sh\) that produces a partial score in\[0,1\]\[0,1\]\.

#### Evaluation subsets\.

The2828\-taskTeamBench\-Minisubset is retained for backward\-compatible quick validation\. The new stratifiedTeamBenchleaderboard \(Appendix[C\.2](https://arxiv.org/html/2605.07073#A3.SS2)\) supersedes Mini as the primary cross\-model evaluation set\.

### A\.3Related work

#### Single\-agent and agent\-scaffolding benchmarks\.

SWE\-Bench\[[10](https://arxiv.org/html/2605.07073#bib.bib1)\], Terminal\-Bench\[[17](https://arxiv.org/html/2605.07073#bib.bib38)\], MLE\-Bench\[[3](https://arxiv.org/html/2605.07073#bib.bib13)\], DevBench\[[13](https://arxiv.org/html/2605.07073#bib.bib5)\], GAIA\[[18](https://arxiv.org/html/2605.07073#bib.bib9)\], BrowseComp\[[25](https://arxiv.org/html/2605.07073#bib.bib14)\], and AgentBench\[[16](https://arxiv.org/html/2605.07073#bib.bib10)\]drive progress on single\-agent capability across software engineering, terminal use, ML engineering, multi\-step reasoning, and open\-web browsing\. SWE\-agent\[[27](https://arxiv.org/html/2605.07073#bib.bib6)\]and OpenHands\[[23](https://arxiv.org/html/2605.07073#bib.bib7)\]package the scaffolding side\. None of these benchmarks include ablation conditions that measure multi\-agent coordination directly\. LiveCodeBench\[[9](https://arxiv.org/html/2605.07073#bib.bib4)\]uses temporal holdouts for contamination resistance\. Our parameterized generators produce arbitrary unseen seeds at evaluation time, decoupling contamination resistance from the publication date\.

#### Multi\-agent frameworks and benchmarks\.

MetaGPT\[[8](https://arxiv.org/html/2605.07073#bib.bib15)\], AutoGen\[[26](https://arxiv.org/html/2605.07073#bib.bib16)\], ChatDev\[[20](https://arxiv.org/html/2605.07073#bib.bib17)\], CAMEL\[[14](https://arxiv.org/html/2605.07073#bib.bib18)\], multi\-agent debate\[[4](https://arxiv.org/html/2605.07073#bib.bib19)\], and GPTSwarm\[[30](https://arxiv.org/html/2605.07073#bib.bib21)\]assign roles through prompt instructions, which permits a single dominant agent to absorb every role and conflates coordination with prompt compliance\. Our container\-based harness enforces role separation directly rather than relying on prompt instructions\. Table[15](https://arxiv.org/html/2605.07073#A4.T15)comparesTeamBenchagainst this group across eight design axes\. MultiAgentBench\[[28](https://arxiv.org/html/2605.07073#bib.bib35)\]is the closest concurrent benchmark in spirit, evaluating six interactive scenarios across four communication topologies \(star, chain, tree, graph\) with an LLM\-based milestone detector that yields Task and Coordination Scores\. Two design choices distinguish our work\. The first is to fix a single Planner\-Executor\-Verifier topology and vary enforcement rather than vary topology under prompt\-assigned roles, which isolates whether coordination gains come from the team structure or from prompt compliance, an axis MultiAgentBench does not isolate\. The second is to evaluate coordination against a deterministic grader rather than an LLM judge on milestone completion\. Our Verifier\-false\-acceptance measurement \(49%49\\%on grader\-failing runs\) raises the possibility that LLM\-judge coordination scores may over\-credit teams unless calibrated against deterministic or human\-audited outcomes, which we view this as a calibration check rather than a competing benchmark\.

#### Concurrent multi\-agent diagnostics\.

MAST\[[2](https://arxiv.org/html/2605.07073#bib.bib32)\]extracts a 14\-mode failure taxonomy from 1,600 traces across seven prompt\-orchestrated frameworks and concludes that prompted teams underperform single agents on a variety of tasks\. The controlled experimental setting we report generates the kind of traces such a taxonomy needs, and isolates how much failure attribution depends on prompt\-only role assignment\. HiddenBench\[[15](https://arxiv.org/html/2605.07073#bib.bib33)\]ports the social\-psychology hidden\-profile paradigm\[[21](https://arxiv.org/html/2605.07073#bib.bib39)\]to multi\-agent LLMs on 65 abstract\-decision tasks\. We measure coordination on production\-grade software and data\-analysis tasks instead\. The Agentic Benchmark Checklist\[[29](https://arxiv.org/html/2605.07073#bib.bib34)\]documents validity failures in agent benchmarks and proposes task\-validity, outcome\-validity, and reporting standards that we adopt by separating the deterministic grader from the Verifier attestation, by reporting the Verifier false\-acceptance rate against the grader, and by reporting per\-seed flip rates rather than headline ranks alone\. Anthropic’s multi\-agent retrospective\[[1](https://arxiv.org/html/2605.07073#bib.bib37)\]reports that subagents in coding workloads spend more tokens on coordination than on actual work, matching our finding that the average team gain is small and that the Verifier is the binding constraint\.

#### Connections to human teamwork research\.

The capability\-floor pattern in Section[3\.4](https://arxiv.org/html/2605.07073#S3.SS4)aligns with Steiner’s disjunctive\-task prediction\[[22](https://arxiv.org/html/2605.07073#bib.bib40)\]\. The Planner\-to\-Executor relay fidelity of0\.210\.21aligns with the transactive\-memory line\[[24](https://arxiv.org/html/2605.07073#bib.bib41),[12](https://arxiv.org/html/2605.07073#bib.bib43)\]on fresh\-strangers teams\. Hackman\[[7](https://arxiv.org/html/2605.07073#bib.bib42)\]treats process measurement and communication quality as prerequisites for explaining team performance differences, and the released traces give downstream work a reusable process\-measurement substrate\. The Discussion expands on the implications\.

## Appendix BHuman Study

![Refer to caption](https://arxiv.org/html/2605.07073v1/x5.png)Figure 6:Human study results\.Each colored dot is one session, and the short black horizontal bar marks the per\-mode median\.\(a\)Duration per session by mode, in minutes on a linear scale, capped at4040minutes \(n=10n=10Solo,n=17n=17Hybrid,n=11n=11Team within the cap; two Solo sessions were excluded since they exceeded the cap\)\.\(b\)Per\-role chat turns in Team mode, counting only messages a participant sent in the chat panel\. The per\-session total is reported in the body text\.\(c\)Failure factors from Team\-mode participants on the post\-task survey, multi\-select per survey across3232role\-level responses from1111Team sessions\.We deploy a web\-based collection platform that runsTeamBenchtasks for human participants under the same role separation\. Figure[7](https://arxiv.org/html/2605.07073#A2.F7)shows the four screens of the flow\. Participants sign in with their institution and primary expertise \(a\), pick a task from the stratified human\-eval subset \(b\), choose Team Mode or Solo Mode and their role \(c\), and then work in a workspace view that mirrors the agent harness with identical file scoping, grading, and attestation flow \(d\)\. The same Docker grading script scores human and agent submissions, which allows direct human\-versus\-agent comparison on matched tasks\. The current submission reports a pilot and releases the platform for larger follow\-up studies\. The platform logs participant behavior by task and role, and can be reused by other groups for matched human comparisons\.

![Refer to caption](https://arxiv.org/html/2605.07073v1/x6.png)Figure 7:Human study platform\([https://teambench\.github\.io/human\-eval/](https://teambench.github.io/human-eval/)\)\. Participants enter a profile \(a\), pick a task from the stratified human\-eval subset \(b\), select Team or Solo Mode with their role \(c\), and work in a sandboxed workspace with the same graders and role\-based file access that agents use \(d\)\. The platform issues the same attestation file that closes the agent run\.### B\.1Pilot Coverage

Coverage as of submission, derived from the released collection backend\. A row is counted only if the session reached thecompletedphase, the participant submitted a survey for the participant’s role, and the identity passes a dev\-pollution filter \(real\-shaped email, name length≥2\\geq 2, no test/admin/probe pattern\)\.

Table 3:Human baseline pilot coverage as of submission, restricted to sessions that reached the completed phase with a submitted survey\.modeunique humansunique emailssessionsdistinct task idsSolo6†71210Hybrid10101711Team9†10118total18†194021‡
†Email\-uniqueness lists Solo as77and Team as1010because one participant used two emails in each mode\. De\-duplicated by person, Solo is66humans, Team is99, and the total is1818\.‡2121distinct task ids across all modes, of which1313overlap with the2020\-task stratified target subset\.

Of the2020stratified target tasks, Solo has any data on55\(API1\_version\_compat,CR4\_api\_review,D6\_data\_reconcile,TEST3\_integration,TRAP1\_spec\_conflict\), Hybrid on55\(DIST1\_queue\_race,GH1002\_scipy\_24753,O6\_perf\_tuning,RDS10\_survey\_analysis,RDS13\_smote\_leakage\), and Team on55\(CR4\_api\_review,IR2\_misinformation\_trap,LH2\_budgeted\_workflow,PIPE2\_data\_pipeline,TEST3\_integration\)\.

### B\.2Findings

#### Hybrid wall\-clock distribution\.

Of the1717Hybrid sessions with a recorded duration,11finished under6060s,88between11and33minutes,55between33and1010minutes,22between1010and3030minutes, and11between3030and6060minutes\. The1212Solo sessions distribute much more evenly across the same buckets:1,1,3,4,21,1,3,4,2, with one additional Solo sessions above6060minutes excluded from the Figure[6](https://arxiv.org/html/2605.07073#A2.F6)\(a\) duration panel as an inactive browser session\. The collapse of the Hybrid bottom buckets \(99of1717under33minutes\), combined with the16/1716/17self\-attestation rate against the deterministic\-grader partial mean of0\.790\.79\(§[3\.8](https://arxiv.org/html/2605.07073#S3.SS8)\), is the basis for the grader\-override measurement finding\.

#### Survey instruments\.

Solo participants complete theTeamBench\-Solo\-Reflectioninstrument: an attention check, a counterfactual team\-value Likert \(per\-role and time\-only items\), and an open\-ended collaboration\-challenge prompt\. Hybrid participants complete aHybrid\-AI\-Teammateinstrument that asks how useful and trustworthy the AI Planner and Executor were, how often the participant overrode the grader \(overrode\_grader,11=never,55=always\), and the perceived value of the Verifier role\. Team participants complete aCATME \+ TeamBench Coordinationinstrument with role\-separation, verifier\-value, communication\-overhead, early\-plan, executor\-efficiency, and information\-held\-by\-other items, plus the same attention check\.

#### Self\-attestation systematically overstates completion\.

The methodology, exclusions, and per\-session results are in Appendix[B\.3](https://arxiv.org/html/2605.07073#A2.SS3)\. We report the structural\-checks partial score as the primary metric since the binary all\-checks\-pass column is sensitive to a singlepytestinvocation that depends on container\-level environment parity which our offline pipeline does not fully reproduce\. Hybrid self\-attested16/1716/17pass and the grader confirms a structural mean partial of0\.790\.79\(median0\.860\.86\)\. Team self\-attested11/1111/11pass and the grader confirms a structural mean partial of0\.600\.60\(median0\.830\.83, with44of1111sessions reaching≥0\.86\\geq 0\.86and the strongest at0\.940\.94onCR4\_api\_review\)\. Solo verdicts are not written by the current platform, but the1212Solo sessions that re\-graded yield a structural mean partial of0\.800\.80\(median0\.900\.90\)\. Team\-mode submissions are therefore produced at near\-completion quality, while the human Verifier role accepts them as fully complete every time\. The Hybrid post\-task survey corroborates the override pattern\. The mean self\-reportedoverrode\_graderscore is2\.412\.41on a11to55scale, and99of1717sessions report≥3\\geq 3\. We release the platform, survey instruments, and pre\-specified analysis plan, so other groups can run matched human comparisons at scale\.

#### Pilot survey aggregates \(descriptive only,NNsmall\)\.

Table 4:Pilot survey aggregates by mode\. Likert items are11to55, mean reported with sample size in parentheses\. For Solo and Hybrid,nncounts sessions \(n=12n=12Solo,n=17n=17Hybrid\)\. For Team,n=32n=32counts*role\-level survey responses*pooled across team members from1111Team sessions\. Means within∼0\.1\\sim 0\.1of each other should be read as clustered, not ranked\.modeitemmean \(nn\)Solo \(counterfactual\)would a Planner teammate help \(cf\_planner\)3\.173\.17\(1212\)would a Verifier teammate help \(cf\_verifier\)3\.083\.08\(1212\)would an Executor teammate help \(cf\_executor\)3\.003\.00\(1212\)would domain expertise help \(cf\_domain\)3\.003\.00\(1212\)would more time alone help \(cf\_time\_only\)2\.172\.17\(1212\)Hybrid \(AI teammate\)AI Planner useful \(ai\_planner\_useful\)3\.763\.76\(1717\)AI Planner trust \(ai\_planner\_trust\)3\.713\.71\(1717\)AI Executor quality \(ai\_executor\_quality\)3\.593\.59\(1717\)AI Executor trust \(ai\_executor\_trust\)3\.713\.71\(1717\)Verifier role value \(verifier\_role\_value\)3\.293\.29\(1717\)self\-reported grader\-override \(overrode\_grader\)2\.412\.41\(1717\)Team \(coordination\)verifier value3\.753\.75\(3232\)executor efficiency3\.723\.72\(3232\)early plan3\.693\.69\(3232\)role separation helped3\.443\.44\(3232\)information held by other3\.383\.38\(3232\)communication overhead3\.253\.25\(3232\)Team \(counterfactual “stronger X” items\)stronger Executor would change outcome3\.723\.72\(3232\)stronger Verifier would change outcome3\.533\.53\(3232\)stronger Planner would change outcome3\.443\.44\(3232\)
#### Team\-mode primary failure factors\.

The Team coordination instrument includes a structured multi\-select item \(max33selections\) asking which factor most affected the outcome on this task\. Across the3232team\-role surveys, endorsement counts are:time\_pressure1717,missing\_info\_across\_roles1414,weak\_or\_late\_planning77,unclear\_communication77,implementation\_difficulty66,other44,missed\_verification44\.

### B\.3Re\-graded human outcomes

The verdict field stored against each session is the participant’s self\-attestation written from the verdict UI, not a deterministic\-grader run\. We re\-grade each session offline by replaying the participant’s final workspace snapshot through the same shell\-script grader used by the agent harness\. The pipeline is deterministic and reproducible from the released collection backend export\.

For each session that reached the completed phase with a submitted survey, we replay the final workspace snapshot under the same deterministic shell\-script grader used by the agent harness, materializing each file into a fresh temporary workspace and using the generator’s seed\-0 expected outputs\. Support files outside the canonical workspace are written best\-effort\.

Table 5:Re\-graded human outcomes versus self\-attestation\. Self\-attest counts the participant’s pass verdict written to the attestation file\. Grader counts a pass under the deterministic shell\-script grader after replaying the final workspace snapshot\. All 40 completed sessions regrade cleanly under the offline pipeline\.modennself\-attest passbinary grader pass†mean partialmedian partialtop partialSolo1212not written11\(8\.3%8\.3\\%\)0\.800\.800\.900\.901\.001\.00Hybrid17171616\(94\.1%94\.1\\%\)44\(23\.5%23\.5\\%\)0\.790\.790\.860\.861\.001\.00Team11111111\(100%100\\%\)0\(0%0\\%\)‡0\.600\.600\.830\.830\.940\.94†Partial score is the fraction of grader checks satisfied and is robust to the offline\-pipeline limitation flagged below\. The binary\-pass column requires every check including a containerizedpytestinvocation that our offline pipeline cannot fully reproduce\.‡Team binary0/100/10is therefore a lower bound\. Several Team sessions reach0\.860\.86to0\.880\.88partial \(one failing test away from binary pass\)\.

The interpretation is the self\-attestation gap\. Team mode produces structurally near\-complete artifacts \(median partial0\.830\.83, top0\.940\.94\) and the human Verifier role accepts every one of those grader\-failing submissions \(10/1010/10\)\. On the role\-mixing pool, the Verifier false\-accept rate is49%49\\%on grader\-failing runs \(§[3\.7](https://arxiv.org/html/2605.07073#S3.SS7)\)\. The pilot is small and the Team\-mode tasks are hard\-skewed \(55of77unique tasks are hard or expert\), so we report this as a directional confirmation rather than a precise human\-vs\-LLM ranking\. The eligibility filter and minimum\-cell rules used for the reported aggregates are stated in §[B\.4](https://arxiv.org/html/2605.07073#A2.SS4)\.

Table 6:pre\-specified prompt\-only versus structural\-enforcement comparison\. The primary prompt\-only vs enforced contrast has148148paired observations\. The shared\-history condition has lower coverage due to infrastructure exclusions and is treated as inconclusive\. Per\-run violation rate is the per\-run mean fraction of turns flagged by a deterministic role\-compliance rubric\. McNemar tests are paired at the \(model, task, seed\) level with Holm\-Bonferroni correction\.ConditionPass rate \(%\) \[95% CI\]Per\-run violation rate \(%\) \[95% CI\]nprompt\_only42\.7 \[34\.7, 50\.0\]6\.4 \[5\.3, 7\.6\]150enforced40\.5 \[32\.4, 48\.6\]6\.2 \[5\.3, 7\.3\]148enforced\_shared\_history48\.0 \[38\.2, 57\.8\]8\.9 \[7\.6, 10\.3\]102pre\-specified McNemar tests \(Holm\-Bonferroni adjusted\)T1: compliance, prompt\-only vs enforcedpadj=0\.113p\_\{\\text\{adj\}\}=0\.113148 pairsT2: outcome, prompt\-only vs enforcedpadj=0\.907p\_\{\\text\{adj\}\}=0\.907148 pairsT3: outcome, shared\-history vs enforcedpadj=0\.907p\_\{\\text\{adj\}\}=0\.907100 pairs
### B\.4Analysis Protocol

The analysis pipeline applies the following decisions, fixed before the human dataset was re\-opened for analysis\.

1. 1\.Eligibility filter\.A Hybrid session is analyzable iffoverrode\_grader≤2\\leq 2on the 1 to 5 scale and wall\-clock≥5\\geq 5minutes\. Sessions failing either criterion are reported but excluded from outcome comparisons\. Solo and Team sessions have no analogous eligibility filter beyond the existing attention check\.
2. 2\.Minimum cell\.A per\-task pass rate is reported only at≥10\\geq 10participant\-sessions in the relevant cell, and an aggregate is reported only if≥8\\geq 8tasks satisfy the per\-task minimum\-cell rule\.
3. 3\.Pairing\.Solo\-vs\-LLM\-Solo and Team\-vs\-LLM\-Team comparisons use McNemar’s exact test on \(task id, seed\) pairs, with Holm\-Bonferroni adjustment across the three pairings \(Solo, Hybrid, Team\)\.
4. 4\.Effect\-size focus\.Headline numbers are paired differences in pass rate with Wilson 95% confidence intervals\. Significance tests are secondary\. We do not chasep<\.05p<\.05thresholds\.
5. 5\.Open\-ended responsesare summarized by two annotators using the same role\-collapse rubric as Section[3\.3](https://arxiv.org/html/2605.07073#S3.SS3), and inter\-annotator agreement is reported as Cohen’sκ\\kappa\.
6. 6\.Demographics, recruitment, and IRB\.As specified in the public protocol document accompanying the platform release\.

Stop condition for collection\. The target is≥10\\geq 10sessions per mode on each of the 20 stratified target tasks, with the eligibility filter applied\.

## Appendix CLeaderboard

### C\.1Studies and counts

Table 7:The source of truth for counts and studies in the paper\.QuantityCountUsed inTemplates851851full poolSeeded instances931931full poolBase / refined categories1919/2121taxonomyLeaderboard subset9090leaderboard \(55conditions,1313models\)TeamBench\-Verifiedsubset5757audited subset for grader\-sensitive analysesReference ablation pool15515555\-condition runs on gemini\-3\-flashCross\-provider grid25252727configs×\\times33seeds,33familiespre\-specified enforcement252533conditions×\\times22seeds,33familiesModels evaluated in the leaderboard1313panel includes models with valid runs
### C\.2Leaderboard construction

TheTeamBenchleaderboard subset was selected by stratified sampling within each of the2121refined categories\. Quotas were set proportional to category size in the full pool, rounded to whole tasks, with a floor of one task per category\. Within each category, tasks were sampled with priority on \(a\) presence of complete reference ablation data, then \(b\) difficulty\-mix balance, then \(c\) authoring date as tie\-breaker\. The selection script and the selection JSON ship in the public release\. Table[10](https://arxiv.org/html/2605.07073#A3.T10)gives the per\-category quota and difficulty mix\. The “difficulty mix” columns in Tables[9](https://arxiv.org/html/2605.07073#A3.T9)and[10](https://arxiv.org/html/2605.07073#A3.T10)use the per\-template author label; the objective grader\-check rubric in Figure[2](https://arxiv.org/html/2605.07073#S2.F2)\(c\) covers the full 851\-template pool\.

Table 8:Task distribution\. The top block lists the difficulty breakdown for the 153 originally\-authored templates with complete 5\-condition ablation metadata\. Two extension tasks with later\-added ablation data join this core to form the 155\-task reference ablation pool used in §[3\.4](https://arxiv.org/html/2605.07073#S3.SS4)\. The lower block lists the remaining 698 templates by origin \(650 GitHub bug reports, 30 real data\-science, 10 real incident response, 8 originally\-authored without ablation yet\) for a total of 851 unique templates and 931 seeded evaluation instances\. Difficulty in this table uses the per\-template author label; the objective grader\-check rubric in Figure[2](https://arxiv.org/html/2605.07073#S2.F2)\(c\) covers the full 851\-template pool\.Category \(Original\)EasyMediumHardExpertTotalSecurity \(SEC/CRYPTO\)–110415Data Engineering \(D/SQL\)145111Incident Response \(INC\)–19111Software Eng\. \(SWE/GO/JS\)118111Testing \(TEST\)126–9Multi\-language \(MULTI\)––9–9Operations \(OPS\)12519Pipeline / Integration \(PIPE/INT\)–36–9Long\-Horizon \(LH/SCALE/SYNTH\)––459Adversarial \(TRAP\)––718Information Retrieval \(IR\)134–8Policy \(POL\)134–8Cross\-System \(CROSS\)––7–7Specification \(SPEC\)–25–7Code Review \(CR\)132–6Distributed \(DIST\)––426Expertise Asymmetry \(EA\)––5–5Negotiation \(NEG\)–14–5Real\-World \(GH1–11\)––11–11Originally\-authored core \(153 with ablation\)72610416153Originally\-authored extension \(no ablation yet\)difficulty pending8Real GitHub bug reports \(GH\)medium\-difficulty maintainer\-reported bugs650Real data\-science \(RDS\)hard, parameterized over canonical public datasets30Real incident response \(RINC\)hard, adapted from public post\-mortems10Total unique templates851Seeded evaluation instancesafter multi\-seed expansion at the released seeds931Table 9:TeamBenchleaderboard stratified by category tier\. Quotas are proportional to category size in the full pool\.∗Pool size counts candidate seeded entries eligible for leaderboard selection\. Seeded instance counts are reported in Table[7](https://arxiv.org/html/2605.07073#A3.T7)\. The per\-category breakdown, including which tasks already have complete reference ablation data, is in Appendix[C\.2](https://arxiv.org/html/2605.07073#A3.SS2)\. Pool size counts candidate seeded entries after category expansion, not unique templates\.TierCategoriesTasksPool size∗Difficulty mixLarge \(≥\\geq5 selected\)855854Easy 0 / Medium 15 / Hard 29 / Expert 11Medium \(3 to 4 selected\)103172Easy 0 / Medium 1 / Hard 25 / Expert 5Small \(1 to 2 selected\)345Easy 0 / Medium 1 / Hard 3 / Expert 0Table 10:TeamBenchleaderboard per\-category quotas\.With abl\.is the number of selected tasks that have complete reference 5\-condition ablation data\. The rest are evaluated for the first time on the leaderboard\.Refined categoryQuotaPool sizeWith abl\.Difficulty mixOther \(Misc, broad GH scrape\)74221hard:4, medium:3GitHub Issues \(curated\)1222112medium:12Real Data Science8900hard:8Incident Response6266expert:1, hard:5Security6326expert:4, hard:2Software Eng\.6316expert:2, hard:4Data Engineering5155expert:1, hard:4Operations5175expert:3, hard:2Testing4124hard:4Adversarial373hard:3Code Review363hard:2, medium:1Cross\-System Integration353hard:3Distributed Systems373expert:3Information Retrieval383hard:3Long\-Horizon363expert:2, hard:1Multi\-language363hard:3Pipeline363hard:3Policy393hard:3Specification232hard:1, medium:1Integration110hard:1Negotiation111hard:1
### C\.3Cross\-provider role\-mixing

Table[11](https://arxiv.org/html/2605.07073#A3.T11)reports all 27 cross\-provider configurations on the 25\-task stratified subset, three\-seed pooled\. The top three configurations cluster within 4 points of each other and their Wilson CIs overlap\. The cost spread runs from $0\.65 to $39\.58 \(all\-Anthropic\), a60\.9×60\.9\\timesspread for a 4\-point pass\-rate spread between the cheapest and most expensive 22\.7%\-tier configurations\.

Table 11:All cross\-provider configurations on the 25\-task subset, sorted by pass rate\. Each row pools three seeds atn=75n\{=\}75runs per configuration\. Cost is total USD across all 75 runs\. Pass/$ is pass rate divided by total cost\. LLM tool\-call turns per session is the mean across7575sessions of \(Planner steps \+ Executor steps \+ Verifier steps \+ any Executor steps inside Verifier\-triggered remediation rounds\), where each “step” is one assistant message that issues a tool call or returns a verdict\.Boldpass rate marks the top configuration, andboldpass/$ marks the most cost\-efficient\.RankPlannerExecutorVerifierPassPartialCost $Pass/$Tool\-call turns1![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/google.png)Gemini\-3f![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/anthropic.png)Haiku\-4\.5![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/anthropic.png)Haiku\-4\.526\.7%0\.69520\.520\.01354\.12![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/openai.png)GPT\-5\.4m![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/openai.png)GPT\-5\.4m![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/anthropic.png)Haiku\-4\.522\.7%0\.67910\.980\.02130\.93![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/google.png)Gemini\-3f![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/anthropic.png)Haiku\-4\.5![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/openai.png)GPT\-5\.4m22\.7%0\.67211\.770\.01949\.84![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/anthropic.png)Haiku\-4\.5![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/anthropic.png)Haiku\-4\.5![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/anthropic.png)Haiku\-4\.522\.7%0\.63939\.580\.00654\.35![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/openai.png)GPT\-5\.4m![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/google.png)Gemini\-3f![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/google.png)Gemini\-3f21\.3%0\.6123\.730\.05736\.26![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/anthropic.png)Haiku\-4\.5![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/anthropic.png)Haiku\-4\.5![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/openai.png)GPT\-5\.4m21\.3%0\.65029\.880\.00745\.07![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/openai.png)GPT\-5\.4m![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/google.png)Gemini\-3f![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/openai.png)GPT\-5\.4m20\.0%0\.6002\.990\.06731\.38![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/anthropic.png)Haiku\-4\.5![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/openai.png)GPT\-5\.4m![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/openai.png)GPT\-5\.4m20\.0%0\.6399\.530\.02137\.29![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/openai.png)GPT\-5\.4m![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/openai.png)GPT\-5\.4m![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/openai.png)GPT\-5\.4m18\.7%0\.5922\.090\.08925\.210![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/google.png)Gemini\-3f![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/openai.png)GPT\-5\.4m![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/openai.png)GPT\-5\.4m18\.7%0\.6272\.360\.07938\.311![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/anthropic.png)Haiku\-4\.5![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/google.png)Gemini\-3f![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/openai.png)GPT\-5\.4m18\.7%0\.58312\.450\.01546\.712![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/anthropic.png)Haiku\-4\.5![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/google.png)Gemini\-3f![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/anthropic.png)Haiku\-4\.518\.7%0\.61821\.620\.00951\.713![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/openai.png)GPT\-5\.4m![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/anthropic.png)Haiku\-4\.5![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/openai.png)GPT\-5\.4m17\.3%0\.4486\.150\.02819\.914![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/openai.png)GPT\-5\.4m![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/anthropic.png)Haiku\-4\.5![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/google.png)Gemini\-3f17\.3%0\.4666\.990\.02523\.515![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/anthropic.png)Haiku\-4\.5![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/google.png)Gemini\-3f![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/google.png)Gemini\-3f17\.3%0\.60513\.340\.01350\.916![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/anthropic.png)Haiku\-4\.5![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/anthropic.png)Haiku\-4\.5![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/google.png)Gemini\-3f17\.3%0\.64629\.480\.00651\.617![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/google.png)Gemini\-3f![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/google.png)Gemini\-3f![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/openai.png)GPT\-5\.4m16\.0%0\.5573\.670\.04445\.318![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/anthropic.png)Haiku\-4\.5![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/openai.png)GPT\-5\.4m![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/anthropic.png)Haiku\-4\.516\.0%0\.60218\.310\.00944\.919![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/google.png)Gemini\-3f![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/openai.png)GPT\-5\.4m![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/google.png)Gemini\-3f14\.7%0\.5593\.330\.04445\.720![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/google.png)Gemini\-3f![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/anthropic.png)Haiku\-4\.5![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/google.png)Gemini\-3f14\.7%0\.4968\.130\.01835\.621![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/anthropic.png)Haiku\-4\.5![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/openai.png)GPT\-5\.4m![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/google.png)Gemini\-3f14\.7%0\.61710\.870\.01343\.922![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/openai.png)GPT\-5\.4m![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/anthropic.png)Haiku\-4\.5![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/anthropic.png)Haiku\-4\.514\.7%0\.43712\.410\.01224\.223![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/google.png)Gemini\-3f![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/google.png)Gemini\-3f![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/google.png)Gemini\-3f13\.3%0\.5704\.980\.02754\.624![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/openai.png)GPT\-5\.4m![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/google.png)Gemini\-3f![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/anthropic.png)Haiku\-4\.513\.3%0\.3997\.980\.01724\.525![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/openai.png)GPT\-5\.4m![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/openai.png)GPT\-5\.4m![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/google.png)Gemini\-3f12\.0%0\.6022\.660\.04529\.526![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/google.png)Gemini\-3f![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/openai.png)GPT\-5\.4m![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/anthropic.png)Haiku\-4\.510\.7%0\.3177\.430\.01432\.027![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/google.png)Gemini\-3f![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/google.png)Gemini\-3f![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/anthropic.png)Haiku\-4\.510\.7%0\.3718\.110\.01337\.0
### C\.4Leaderboard detail

Table[12](https://arxiv.org/html/2605.07073#A3.T12)reports the full per\-condition pass\-rate matrix that backs Figure[3](https://arxiv.org/html/2605.07073#S3.F3)\. The leaderboard panel distinguishes models with Full coverage atn≥30n\{\\geq\}30from those below threshold \(rows below the divider\)\. Aggregation deduplicates by \(model, condition, task\) across canonical and resumed\-checkpoint result files, with later writes winning\.

Table 12:TeamBenchleaderboard per\-condition pass rates \(%\)\. Models are sorted bymax\(\\max\(Solo, Full\)\)descending\.Boldmarks the highest cell across the row\.ModelSoloRestrictedNo PlanNo EvalFull![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/anthropic.png)Claude Opus 4\.735\.633\.335\.633\.337\.8![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/openai.png)GPT\-5\.4 Mini33\.323\.325\.624\.428\.9![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/anthropic.png)Claude Haiku 4\.512\.231\.118\.91\.128\.9![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/google.png)Gemini\-3\.1 Pro27\.822\.216\.725\.628\.9![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/anthropic.png)Claude Sonnet 4\.67\.827\.810\.06\.727\.8![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/openai.png)GPT\-5\.412\.235\.623\.334\.427\.8![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/google.png)Gemma 4 31B27\.825\.624\.420\.022\.2![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/google.png)Gemini\-3 Flash13\.318\.914\.427\.825\.6![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/google.png)Gemini\-3\.1 Flash Lite5\.621\.18\.917\.817\.8![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/openai.png)gpt\-oss\-20b17\.817\.812\.27\.82\.2![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/alibaba.png)Qwen 3 14B5\.62\.22\.21\.12\.2![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/alibaba.png)Qwen 3 32B5\.63\.30\.05\.61\.1![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/alibaba.png)Qwen 3 8B2\.25\.61\.13\.30\.000footnotetext:The default grader marks a run as a failure whenever any check fails, including the attestation check that only verifies the agent wrote a valid attestation JSON at the end of the run\. In Solo mode the attestation is the agent’s self\-confirmation and does not measure task quality\. Several frontier models systematically forget the attestation file even when every structural check passes\. The promotion rule used here counts a run as a pass if all failures are attestation\-related and no other failure mode is present, while keeping the original verdict alongside the promoted verdict so the operation is fully reversible\. The released aggregate records both verdicts so the headline numbers can be reproduced under either rule\.The Qwen 3 family scores near zero across all conditions and the gpt\-oss\-20b Full collapses to2\.2%2\.2\\%despite a higher Solo of17\.8%17\.8\\%\. The dominant cause is malformed tool calls and context overflow rather than raw capability \(Appendix[E\.4](https://arxiv.org/html/2605.07073#A5.SS4)\)\.

## Appendix DVerified Subset and Reproducibility

We audit whether the graders support the reported outcomes\. The audit has four components\. A canonical solution check, following the idea of SWE\-Bench Verified, records whether a known correct solution passes the grader\. Mutation testing applies AST\-level mutations to a passing workspace and asks whether the grader catches them\. Cross\-model discrimination flags tasks where every leaderboard model produces the same outcome\. A grader plausibility audit asks three independent LLM\-judges to label 285 stratified runs as PASS or FAIL\. We run this audit in two variants\. With the deterministic verdict shown to judges, Fleiss’sκ=0\.74\\kappa=0\.74\. With the verdict hidden \(a leakage\-free re\-run on the same285285tuples, Appendix[D\.1](https://arxiv.org/html/2605.07073#A4.SS1)\), Fleiss’sκ=0\.07\\kappa=0\.07\. The0\.740\.74value is therefore largely anchored by the verdict, and we report*neither*variant as independent grader validation\. The leakage\-free run surfaces a systematic disagreement: Gemini\-3 Flash returns PASS on53%53\\%of cases against the grader’s14%14\\%, mirroring its over\-acceptance as a Verifier \(§[3\.3](https://arxiv.org/html/2605.07073#S3.SS3)\)\.TeamBench\-Verifiedrequires the canonical\-solution and discrimination checks, and applies mutation testing where source\-level mutation is applicable\.5757tasks \(63\.3%63\.3\\%of the leaderboard\) clear these thresholds\. “Verified” means*audited*on the pillars that apply rather than expert\-verified per task\. On the role mixing pool restricted to the Verified subset, the Verifier false accept rate is 38\.7%\. The same failure pattern therefore appears on the audited subset \(Appendix[D\.1](https://arxiv.org/html/2605.07073#A4.SS1)\)\.

### D\.1Per\-pillar detail

Table[13](https://arxiv.org/html/2605.07073#A4.T13)gives per\-pillar eligibility, passing denominators, and the rule applied to non\-eligible tasks\. In this submission, “Verified” means that a task passes the applicable audit checks\. It does not mean that every task has been expert adjudicated\.

Table 13:Per\-pillar audit denominators for the leaderboard subset\.PillarEligible / passingRule for non\-eligibleNotesCanonical solution9090/5858required5858pass the LLM\-run evidence pathMutation testing99/77exempt where AST mutation is inapplicable11task at kill\-rate0\.420\.42excludedCross\-model discrim\.9090/5858requiredspread≥0\.10\\geq 0\.10across the leaderboard panelGrader\-plausibility285285/−\-sampled audit only, not eligibility\-gatingFleiss’sκ=0\.74\\kappa=0\.74with verdict,0\.070\.07without#### Canonical\-solution check\.

A task is verified if some known\-correct solution exists that the deterministic grader actually accepts\. We accept three sources of proof, in order: the seed\-0 workspace as\-is, any historical agent run whose post\-run workspace passed the grader, or the upstream PR diff applied to a GH\-sourced task\.5858of9090leaderboard tasks pass via the LLM\-run evidence path\. We label this an audited rather than expert\-verified evidence path because canonical solutions are not always human\-authored\.

#### Mutation\-killing grader\.

For each task with a known\-passing workspace, we apply small AST\-level mutations \(operator swaps, return\-value flips, body deletions\) and check whether the grader catches them\. The threshold is mutation kill rate≥0\.5\\geq 0\.5\. Of the leaderboard tasks with a known\-passing workspace \(n=9n=9\),77exceed0\.50\.5and44exceed0\.70\.7\.O6\_perf\_tuningis exempt by design, since its grader scores a deliverable artifact via a simulator rather than running workspace source code\. Tasks without a known\-passing workspace are not eligible for source\-level mutation testing\.

#### Cross\-model discrimination\.

Tasks where every leaderboard model produces the same outcome carry no comparative signal\. We require a per\-task pass\-rate spread across the leaderboard panel of≥0\.10\\geq 0\.10\.5858of9090leaderboard tasks meet this bar\.

#### Grader\-plausibility audit \(with\-verdict and leakage\-free\)\.

We run the same285285stratified judgments under two prompt variants\.*With\-verdict*: judges see the spec, a workspace summary, the verifier attestation, and the deterministic\-grader verdict\.*No\-verdict*: the same285285tuples with the deterministic verdict removed from the prompt, so each judge forms its own opinion from spec and artifacts alone\. Table[14](https://arxiv.org/html/2605.07073#A4.T14)reports both\. Cross\-judge agreement falls from substantial \(κ=0\.74\\kappa=0\.74\) under the with\-verdict prompt to slight \(κ=0\.07\\kappa=0\.07\) under the leakage\-free prompt, indicating that the original0\.740\.74figure was anchored by the verdict\. We therefore use*neither*audit as independent grader validation\. Two patterns survive both variants and align with the main paper\. First, Haiku 4\.5 and GPT\-5\.4 Mini track the deterministic grader’s PASS rate \(14%14\\%overall\) reasonably well even without the verdict \(PASS rates of11%11\\%and10%10\\%respectively, agreement82%82\\%and83%83\\%\)\. Second, Gemini\-3 Flash returns PASS on53%53\\%of cases when the verdict is hidden, mirroring its77%77\\%false\-accept rate as a Verifier \(§[3\.3](https://arxiv.org/html/2605.07073#S3.SS3)\)\. The honest reading is that the audit is informative about how each model judges pass/fail in isolation and consistent with the agent\-side Verifier\-fails finding, not that it independently validates the deterministic grader\.

Table 14:LLM grader\-plausibility audit on285285stratified runs, with the deterministic verdict shown to judges \(left\) and hidden from judges \(right\)\. Substantial\-looking agreement under the original protocol is largely anchored by the verdict, while the leakage\-free protocol reveals systematic disagreement, particularly Gemini\-3 Flash’s over\-acceptance\.QuantityWith verdict \(original\)No verdict \(leakage\-free\)Three\-way Fleiss’sκ\\kappa0\.740\.740\.070\.07Binary Krippendorff’sα\\alpha0\.740\.740\.070\.07Pairwise Cohen’sκ\\kappa, Haiku↔\\leftrightarrowGemini0\.660\.660\.180\.18Pairwise Cohen’sκ\\kappa, Haiku↔\\leftrightarrowGPT0\.910\.910\.270\.27Pairwise Cohen’sκ\\kappa, Gemini↔\\leftrightarrowGPT0\.700\.700\.100\.10Haiku 4\.5 PASS rate14\.0%14\.0\\%10\.9%10\.9\\%Gemini\-3 Flash PASS rate14\.7%14\.7\\%53\.3%53\.3\\%GPT\-5\.4 Mini PASS rate13\.7%13\.7\\%9\.5%9\.5\\%Deterministic grader PASS rate14\.4%14\.4\\%14\.4%14\.4\\%
#### Triage of non\-Verified tasks\.

Of the non\-Verified tasks,1111are*near\-miss\-very\-close*\(best historical partial≥0\.9\\geq 0\.9but no run passes\),1111are*near\-miss*\(partial0\.70\.7to0\.90\.9\),1010are*solvable\-in\-principle*\(partial0\.40\.4to0\.70\.7\), and one \(TRAP1\_spec\_conflict\) sits at mutation kill rate0\.420\.42, just below threshold\. The per\-task ledger is part of the public release\.

#### Verifier false\-accept robustness on the audited subset\.

Table 15:Comparison with related benchmarks and benchmark ecosystems across eight design axes\. SA = single\-agent, MA = multi\-agent\.✓= supported,✗= not supported,∼\\sim= partial or planned\. The first four axes describe what the benchmark measures\. The last four describe whether the benchmark is versioned, holds out evaluation seeds, releases run logs, and is scheduled for periodic refresh\.BenchmarkStruct\. enf\.Role abl\.Cross\-prov\.Contam\. res\.Vers\./liveVerifiedHuman bsl\.Pub\. tracesSASWE\-Bench\[[10](https://arxiv.org/html/2605.07073#bib.bib1)\]–✗✓✗✓✓✗∼\\simTerminal\-Bench\[[17](https://arxiv.org/html/2605.07073#bib.bib38)\]–✗✓✓✓✗✗∼\\simLiveCodeBench\[[9](https://arxiv.org/html/2605.07073#bib.bib4)\]–✗✓✓✓✗✗✗GAIA\[[18](https://arxiv.org/html/2605.07073#bib.bib9)\]–✗✓✗✗✗✓✗MLE\-Bench\[[3](https://arxiv.org/html/2605.07073#bib.bib13)\]–✗✓✗✗✗✓∼\\simBrowseComp\[[25](https://arxiv.org/html/2605.07073#bib.bib14)\]–✗✓✗✗✗✓✗AgentBench\[[16](https://arxiv.org/html/2605.07073#bib.bib10)\]–✗✓✗✗✗✗✗MAMultiAgentBench\[[28](https://arxiv.org/html/2605.07073#bib.bib35)\]✗✗✓✗✗✗✗✗DevBench\[[13](https://arxiv.org/html/2605.07073#bib.bib5)\]✗✗∼\\sim✗✗✗✗✗ChatDev\[[20](https://arxiv.org/html/2605.07073#bib.bib17)\]✗✗✗✗✗✗✗✗GPTSwarm\[[30](https://arxiv.org/html/2605.07073#bib.bib21)\]✗✗✗✗✗✗✗✗TeamBench\(Ours\)✓✓✓✓∼\\sim✓∼\\sim✓

### D\.2Reproducibility

#### Code and data availability\.

The benchmark, all generators, the evaluation harness, the role\-mixing protocol, and the 5\-condition ablation runner are released under a permissive open\-source license at[https://teambench\.github\.io/](https://teambench.github.io/)\. The leaderboard task selection JSON, the per\-condition reference\-ablation logs, the2,0252\{,\}025deduplicated role\-mixing per\-task records, and the cost ledger are included in the release\. The Docker images for the Planner, Executor, and Verifier roles are built with deterministic dependencies pinned by hash\.

#### Determinism and replication\.

All graders are deterministic shell scripts\. All API calls use temperature 0\. Task generators take a seed and produce byte\-identical workspaces across machines\. Replication of the reference ablation requires 1,165 task runs under gemini\-3\-flash\-preview\. Replication of the cross\-provider role\-mixing study requires 2,025 task runs across three commercial providers \(27 configurations×\\times25 tasks×\\times3 seeds\), with measured total spend of $326\.04\.

#### Held\-out seeds and contamination\.

Seeds 0 through 2 are released with the public benchmark\. Seeds 5 and above are reserved for the hidden leaderboard\. Across5050tasks,72%72\\%of held\-out seed workspaces differ on every file from their development counterparts and the mean Jaccard similarity is0\.71±0\.230\.71\\pm 0\.23\. The leaderboard runner re\-rolls task seeds at submission time using a leaderboard\-only RNG state\. Parameterized seeds reduce exact value memorization but do not eliminate semantic contamination, so for studies whose primary claim is memorization resistance we recommend evaluating on the held\-out leaderboard seeds rather than on public seeds\.

#### Responsible use\.

The adversarial\-trap and security\-vulnerability tasks contain plausible\-but\-incorrect security patterns by design\. They are synthetic evaluation cases and are not designed to target real systems\. The cryptographic tasks use intentional weaknesses \(nonce reuse, low PBKDF2 iterations, truncated authentication tags\) for evaluation\. These patterns must not be deployed\. We include a per\-task usage note in the task metadata that flags adversarial content, and we recommend that hosted leaderboard submissions run in network\-isolated containers\.

#### Dataset card and metadata\.

A dataset card followingGebruet al\.\[[6](https://arxiv.org/html/2605.07073#bib.bib31)\]accompanies the public release\. It documents the collection process, the annotation procedure for difficulty and category labels, intended uses, known biases \(English\-language tasks only, Python\-heavy\), and a contact channel for issues\. The dataset card is versioned with the release tags so that any update to a task triggers a card revision\. The dataset is hosted on Hugging Face at[https://huggingface\.co/datasets/ybkim95/teambench](https://huggingface.co/datasets/ybkim95/teambench)and ships with Croissant1\.11\.1machine\-readable metadata, including Responsible AI fields\. Both the dataset content and the code release are distributed under the MIT license\.

#### Benchmark governance plan\.

Table[16](https://arxiv.org/html/2605.07073#A4.T16)summarizes the governance features that ship with the initial release and the items scheduled for the first follow\-up\.

Table 16:Benchmark governance plan\.✓ships with the initial release\.∼\\simis scheduled for the first follow\-up release\.FeatureStatusNotesPublic code, generators, harness✓MIT license\.Public per\-task run records✓1,165 reference\-ablation runs and 2,025 role\-mixing runs \(27 configs×\\times25 tasks×\\times3 seeds\)\.Hidden held\-out seeds✓Seeds 5 and above\.Versioned releases \(semver\)✓v1\.0 ships with this paper\.Issue tracker for problematic tasks✓GitHub issues, with per\-task usage notes\.Submission protocol✓JSON contract documented in the release\.Canary string in public docs∼\\simTo discourage training\-data ingestion\.TeamBench\-Verified\-Humansubset∼\\simFuture expert\-adjudicated subset, distinct from the leaderboard\.TeamBench\-Livemonthly refresh∼\\simNewly\-generated GitHub issues\.Compare\-by\-cost leaderboard view∼\\simAlready collected\.Public trace download∼\\simPer\-run agent transcripts\.The three planned\-status rows of Table[16](https://arxiv.org/html/2605.07073#A4.T16)are the prioritized follow\-up items: an expert\-adjudicatedTeamBench\-Verified\-Humansubset, a canary string for contamination audits, and a public per\-run trace download\.

### D\.3Adherence to the Agentic Benchmark Checklist

We map each criterion of the Agentic Benchmark Checklist \(ABC\) ofZhu and others \[[29](https://arxiv.org/html/2605.07073#bib.bib34)\]to the corresponding TeamBench design choice \(Table[17](https://arxiv.org/html/2605.07073#A4.T17)\)\. Items that do not apply to a deterministic shell\-script grader \(fuzzing\) are marked N/A\.

Table 17:TeamBench against the Agentic Benchmark Checklist\. T = task validity, O = outcome validity, R = reporting\.IDCriterionTeamBench design choiceTask validityT\.1Specify tool / package versionsDocker images pin all dependencies by hash; per\-task setup script lists required toolchainT\.2Manage service availability and rate limitsHarness retries on 429 / 503 with key rotation; infrastructure failures count as task failures under then=90n=90conventionT\.3Detect API interruptions and terminate evaluationPer\-run logs capture HTTP status; failed runs are recorded with error codeT\.4Clean up legacy stateEach run starts in a fresh container with the seed\-0 workspaceT\.5Isolate agents from ground truthGrader runs in a separate container; expected outputs are written to a grader\-only directory the agent cannot readT\.6Reproducible environmentHash\-pinned Docker images; deterministic shell\-script gradersT\.7Verify ground\-truth annotationCanonical\-solution audit verifies a passing workspace exists for5858of9090LB90 tasksT\.8Verify task solvabilitySame canonical\-solution check; the 5\-condition ablation pool achieves non\-zero per\-condition pass on every taskT\.9Provide oracle solverCanonical\-via\-LLM\-run evidence path serves as the solver demonstrationT\.10Inspect outliers in pilotPilot evidence ledger; outlier sessions documentedOutcome validityO\.a\.1Semantically equivalent expressionsGraders use exact\-string or deterministic equivalence on JSON / structured fieldsO\.a\.2Handle redundant wordsGraders normalize whitespace; output formats are documented in the specO\.b\.1Negation modifiersPolicy and spec tasks include negation tests in the graderO\.b\.2Prevent listing\-all\-answers successMulti\-check graders fail most checks if every option is listedO\.b\.3Avoid guessing successMultiple grader checks per task; partial\-score grader rewards specific outputsO\.c\.1Pilot LLM\-judge accuracy285\-tuple LLM\-judge audit \(App D\.1\)O\.d\.1Manually verify test casesTest cases are maintainer\-reported \(GitHub\-curated\) or author\-designed for originally\-authored tasksO\.d\.2Coverage as quality metricPer\-task graders use multiple specific checks \(median 10 across the originally\-authored set\)O\.e\.1–3Fuzzing comprehensivenessN/A\. Tasks are spec\-driven, not fuzzedO\.f\.1Cover all branches of user workflowsMultiple checks per task; full pass requires every check to passO\.f\.2Eliminate non\-determinismDeterministic graders; temperature 0; fixed seedsO\.g\.1Ground\-truth all outcomesSpecs enumerate accepted outputs; partial\-score grader awards partial progressO\.g\.2Relevant \+ irrelevant statesWorkspaces include distractor files \(Adversarial / Trap categories\)O\.g\.3Sufficiently complex state spaceWorkspaces span 0 to 40 files; multi\-file edits requiredO\.h\.1Explicit format assumptionsSpecs document required output formatsO\.h\.2Avoid guessing successSame as O\.b\.3O\.i\.1Metrics correlate with reasoningPartial\-score checks tied to specific behaviors documented in specBenchmark reportingR\.1Open\-source dataset and harnessPublic release on Hugging Face and GitHubR\.2Open evaluation harnessSame releaseR\.3Contamination preventionHeld\-out seeds≥5\\geq 5, parameterized generators, Croissant1\.11\.1metadataR\.4Plan for consistent updatesVersioned releases; governance plan in Table[16](https://arxiv.org/html/2605.07073#A4.T16)R\.5Specify capabilitiesPlanner / Executor / Verifier capabilities in Table[1](https://arxiv.org/html/2605.07073#S2.T1)R\.6Articulate construct validity§3 ties each ablation to the role\-marginal it measuresR\.7Document mitigation§4 and §5 lists scope and limitationsR\.8Quantitative limitation impactTeamBench\-Verified rate38\.7%38\.7\\%vs full\-pool49\.4%49\.4\\%quantifies the grader\-completeness impactR\.9Quantitative impactSame as R\.8R\.10Statistical significanceWilson 95% CIs, paired bootstrap \(10,000 iterations\), McNemar with Holm\-BonferroniR\.11Interpretation guidelines§3\.2 and §3\.4 spell out interpretationR\.12Baseline comparisonsSolo / Restricted / No\-Plan / No\-Evaluate / Full TeamR\.13Trivial agent baselinesRestricted is the trivial baseline

## Appendix EAdditional Results

### E\.1Per\-category teamwork effect

![Refer to caption](https://arxiv.org/html/2605.07073v1/x7.png)Figure 8:Per\-category mean team uplift on the reference 5\-condition ablation\. Uplift is full\-team partial score minus Solo partial score\. Blue indicates positive uplift, red indicates negative\. Bars include 95% bootstrap CIs \(10,000 iterations\)\. The figure includes only categories with three or more tasks for stable intervals\. Per\-category counts use the canonical task\-to\-category mapping in the released dataset\.The categories with the largest positive bootstrap means are Testing, Specification, and Policy\. The categories with the largest negative means are Pipeline, Incident Response, and Multi\-language\. Categories with positive uplift share a structural property in which the specification carries decision rules that the Executor cannot independently derive from the workspace alone\. The wide CIs on small\-nncategories \(for example Specification atn=3n\{=\}3and Cross\-System atn=3n\{=\}3\) overlap zero, so those category\-level effects should be read as suggestive rather than conclusive\.

### E\.2Verifier confusion matrix

We extract the Verifier verdict from each run’s attestation file and the deterministic grader result from the grader’s score record for every role\-mixing run that produced both\. Table[18](https://arxiv.org/html/2605.07073#A5.T18)reports the per\-Verifier\-provider breakdown on the complete pool of 1,083 attestation\-bearing runs across all 27 configurations and three seeds \(2,025 unique role\-mixing runs minus 942 that lacked an attestation due to harness errors, run timeouts, or pending verifier turns\)\. The pooled false\-accept rate of 49\.4% \(Wilson 95% CI\[45\.9,52\.9\]\[45\.9,52\.9\]\) matches the headline reported in the Discussion \(Section[4](https://arxiv.org/html/2605.07073#S4)\)\.

Table 18:Verifier confusion matrix on the 1,083 role\-mixing runs with an attestation and a grader score, broken down by the Verifier’s provider family\.Actual passis the deterministic grader result\.Verdictis the Verifier’s verdict \(pass or fail\)\. The Verifier accepts incomplete work between 36% and 77% of the time across providers and almost never rejects correct work\.Verifier providerTPFPFNTN![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/anthropic.png)Haiku\-4\.5 \(n=331\)98140093![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/google.png)Gem\-3f \(n=199\)70871626![[Uncaptioned image]](https://arxiv.org/html/2605.07073v1/imgs/logos_small/openai.png)GPT\-5\.4m \(n=553\)1171574275Pooled \(n=1,083\)28538420394
MetricPooledGPT \(best\)Gemini \(worst\)Accuracy62\.7%70\.9%48\.2%False\-reject rate6\.6%3\.3%18\.6%False\-accept rate49\.4%36\.3%77\.0%

The pattern across the three providers is consistent\. The Verifier rarely rejects valid work, with the pooled false\-reject rate at6\.6%6\.6\\%and the per\-provider rates between0%0\\%and19%19\\%\. The Verifier accepts incomplete work between36%36\\%and77%77\\%of the time\. GPT\-5\.4 Mini is the most discriminative Verifier in this slot\. Gemini\-3 Flash is the least discriminative, accepting77%77\\%of grader\-failing runs and producing the worst overall accuracy at48%48\\%\. The headline implication is that current Verifiers in this single\-pass file\-based protocol return pass by default rather than functioning as a quality gate\.

#### Missingness sensitivity\.

The pool of1,0831\{,\}083attestation\-bearing runs excludes942942runs of the2,0252\{,\}025\-cell role\-mixing grid where the agent did not produce a valid attestation\. Treating those942942missing attestations as failures rather than excluding them \(Table[19](https://arxiv.org/html/2605.07073#A5.T19)\) lowers the effective false\-accept rate but does not change the qualitative finding that Verifiers do not function as a quality gate, because the missing\-attestation runs are themselves verifier\-side or system\-side failures\.

Table 19:Verifier\-failure rate under different treatments of the942942runs that lacked a valid attestation\.Treatment of missing attestationsEffective rateInterpretationExclude \(attestation\-bearing only\)49\.4%49\.4\\%false\-accept among grader\-failingverdict\-quality boundTreat as fail verdict22\.3%22\.3\\%false\-accept among grader\-failingconservative boundCount as verifier\-side failure66\.5%66\.5\\%verifier failure overallreliability bound

### E\.3Mapping MAST failure modes to TeamBench detectors

Table[20](https://arxiv.org/html/2605.07073#A5.T20)maps the 14 failure modes ofCemriet al\.\[[2](https://arxiv.org/html/2605.07073#bib.bib32)\]to the role\-violation detectors and grader signals used in TeamBench\. Modes that depend on multi\-turn dialogue do not apply to the single\-pass file\-based protocol and are marked N/A\.

Table 20:MAST failure modes mapped to TeamBench detectors\. Quoted definitions are fromCemriet al\.\[[2](https://arxiv.org/html/2605.07073#bib.bib32)\]\.MASTFailure modeTeamBench detector / signalSystem design issuesFM\-1\.1Disobey task specificationverifier\-modifies\-code event; deterministic\-grader failFM\-1\.2Disobey role specificationverifier\-modifies\-code, planner\-writes\-code, executor\-self\-approvesFM\-1\.3Step repetitionper\-run turn count; not a primary detectorFM\-1\.4Loss of conversation historyN/A \(single\-pass file\-based protocol\)FM\-1\.5Unaware of termination conditionsmissing\-attestation rateInter\-agent misalignmentFM\-2\.1Conversation resetN/A \(no multi\-turn dialogue\)FM\-2\.2Fail to ask for clarificationN/A \(single\-turn\)FM\-2\.3Task derailmentOptimistic verdict failures \(one\-line approvals\)FM\-2\.4Information withholdingPlanner\-to\-Executor relay fidelity 0\.21 \(§[3\.4](https://arxiv.org/html/2605.07073#S3.SS4)\)FM\-2\.5Ignored other agent’s inputverifier\-modifies\-code \(Verifier ignores Executor evidence\)FM\-2\.6Reasoning\-action mismatchEcho verdicts \(Verifier reasons completeness without checking\)Task verificationFM\-3\.1Premature terminationOptimistic\-verdict subtype: one\-line approvalsFM\-3\.2No or incomplete verification49\.4% Verifier false\-accept rate \(§[3\.3](https://arxiv.org/html/2605.07073#S3.SS3)\)FM\-3\.3Incorrect verificationFalse\-reject rate 6\.6% pooled; Gemini\-3 Flash 18\.6% \(App F\)
### E\.4Open\-source failures

The smallest open\-source models on theTeamBenchleaderboard score near zero across all conditions\. Three failure modes account for nearly all the failures\.

- •Malformed tool calls\.The model emits text that resembles a tool call \(for example<tool\_call\>\.\.\.\) but does not conform to the expected JSON schema\. The harness records a no\-op turn\.
- •Output\-budget exhaustion\.Some reasoning models emit long reasoning blocks that exhaust the harness’s per\-turn output budget \(8,192 tokens\) before producing any tool calls\.
- •Missing attestation\.The agent produces output but fails to write a validattestation\.json\. The grader reports an automatic failure regardless of workspace changes\.

Tool\-use reliability is a binding constraint at the small open\-weight tier rather than raw language\-modeling capability\. Malformed tool calls and context overflow are counted as task failures rather than errors excluded from the denominator \(Table[12](https://arxiv.org/html/2605.07073#A3.T12)\)\. We have implemented lenient parsing for several common malformed patterns\. The aggregate impact is that every Qwen 3 row and the gpt\-oss\-20b Full cell collapse to under3%3\\%, an order of magnitude below the smallest fully\-functional commercial models\. At this tier,TeamBenchmeasures tool\-call reliability more than coordination, which we report as a benchmark limitation\.

### E\.5Expertise asymmetry

Table 21:Expertise\-Asymmetry results pooled across three Gemini models \(N=200N=200total runs\)\. Analysis Value = Full−\-No\-Analysis\. All five tasks show positive analysis value\. EA5 \(dependency audit\) benefits most\.TaskPlanner ToolFullNo\-Anal\.SoloAnal\. ValueEA1: Security Scanbandit50\.6%30\.7%14\.5%\+19\.9\+19\.9EA2: Coverage Gappytest\-cov78\.8%71\.2%66\.2%\+7\.7\+7\.7EA3: Type Safetymypy58\.8%50\.4%48\.2%\+8\.5\+8\.5EA4: Code Qualityruff/pylint50\.0%34\.6%30\.8%\+15\.4\+15\.4EA5: Dep\. Auditpip\-audit64\.8%37\.8%23\.7%\+27\.0\+27\.0Pooled mean60\.6%44\.9%36\.7%\+15\.7\+15\.7The pooled TNI across three models is 0\.265 \(bootstrap 95 percent CI\[0\.170,0\.356\]\[0\.170,0\.356\]\), entirely above zero\. On this five\-task exploratory subset, tool\-augmented teamwork provides statistically significant benefit\. A TNI exceeding 1\.0 is achieved by Gemini\-3 Flash Preview \(1\.25\), which means tool\-specialized teams can exceed the full\-access Solo baseline on this subset\. The implication is that multi\-agent systems can be designed around complementary capabilities, not just complementary information\.

### E\.6Equalizer mechanism

We test three candidate mechanisms for the conditional team\-uplift pattern reported in Section[3\.4](https://arxiv.org/html/2605.07073#S3.SS4), on a472472\-pair cross\-model panel\.

#### H1: Specification relay\.

If team benefit stems from the Planner relaying specification knowledge, tasks with larger necessity gaps \(Solo minus Restricted\) should show greater team uplift\. The opposite holds: necessity gap is negatively correlated with team uplift \(r=−0\.446r=\-0\.446,p<10−24p<10^\{\-24\},n=472n=472\)\. Spec relay alone is not sufficient to explain the pattern\.

#### H2: Step\-limit exhaustion\.

If single agents exhaust their step budget on hard tasks while teams distribute the budget across roles, Solo elapsed time should predict failure\. Solo elapsed time shows zero correlation with Solo score \(r=0\.00r=0\.00,p=0\.99p=0\.99\), and weak models fail equally fast across difficulties \(hard / easy elapsed ratio=0\.93=0\.93\)\. Not supported\.

#### H3: Implicit chain\-of\-thought\.

If the Planner’s structured output functions as implicit chain\-of\-thought for the Executor, the No\-Plan condition should underperform Restricted on hard tasks\. Empirically, No\-Plan and Restricted are statistically indistinguishable on hard tasks \(mean difference−0\.006\-0\.006,p=0\.66p=0\.66,n=202n=202\)\.

#### Capability\-conditional uplift\.

A quadratic model of team uplift versus Solo score \(R2=0\.60R^\{2\}=0\.60\) outperforms a linear model \(R2=0\.40R^\{2\}=0\.40\) on the472472\-pair panel, and the parametric vertex sits at Solo0\.540\.54with a fitted peak of7\.67\.6points \(p=0\.002p=0\.002\)\. The equal\-size quintile analysis on the originally\-authored reference ablation reported in Section[3\.4](https://arxiv.org/html/2605.07073#S3.SS4)places the peak at Q1 \(hardest\)\. The two statistics answer different questions on different pools and we report both\.

### E\.7Enforcement sensitivity

We re\-run the pre\-specified outcome contrasts \(T2, T3\) on the450450planned cells after substituting the5050excluded runs with deterministic failure outcomes\. The substitution barely changes prompt\-only \(150150of150150already observed\) and enforced \(only22of150150missing\), but shifts enforced\-shared\-history substantially because the Gemini\-3 Flash quota incident affected4848of150150cells\. Pass rate moves from48\.0%48\.0\\%\(observed\) to32\.7%32\.7\\%\(with missing as fail\)\. The pre\-specified McNemar contrasts give T2 \(prompt\-only vs\. enforced\)praw=0\.45p\_\{\\text\{raw\}\}=0\.45under sensitivity \(vs\.0\.450\.45original\) and T3 \(shared\-history vs\. enforced\)praw=0\.05p\_\{\\text\{raw\}\}=0\.05under sensitivity \(vs\.0\.770\.77original\)\. Under Holm\-Bonferroni over the three planned tests, neither contrast reaches significance \(padj\>0\.10p\_\{\\text\{adj\}\}\>0\.10\)\. The qualitative finding that prompt\-only and enforced pass rates are statistically indistinguishable holds under both treatments\. The directional interpretation of T3, however, flips under sensitivity\. With the missing cells treated as failures, enforced\-shared\-history sits below enforced rather than above\. We therefore report T3 as inconclusive rather than as evidence for the shared\-history condition\. The3\.6×3\.6\\timesreduction in*verifier\-modifies\-code*events under enforcement is computed on per\-turn events rather than per\-run outcomes and is unchanged\.

### E\.8Enhanced Solo

Table 22:Enhanced solo baselines on TeamBench\-Mini \(28 tasks, gemini\-3\-flash\-preview\)\. Structured prompting does not close the team gap\. Solo\-CoT is worse than the standard solo\.ConditionAvg\. Partial Scorevs\. SoloSolo \(standard\)0\.704baselineSolo\-CoT0\.640−6\.4\-6\.4Solo\-2Pass0\.698−0\.6\-0\.6Full Team0\.754\+5\.0\+5\.0The Solo\-CoT condition instructs the single agent to first read the full specification carefully, create a detailed plan, and then implement the solution\. Solo\-2Pass runs the agent in two sequential phases \(specification comprehension, then implementation\)\. Neither condition improves over the minimal solo prompt\. On TeamBench\-Mini, these two structured\-prompting baselines do not close the gap to the full team, which suggests that the measured benefit is not merely a consequence of asking a single model to plan more explicitly\.

## Appendix FImplementation Details and Examples

### F\.1Agent tools

Table[23](https://arxiv.org/html/2605.07073#A6.T23)lists the four tools available in TeamBench and their access permissions per role\. All tools follow a unified interface\. Each accepts typed parameters and returns structured output \(stdout, stderr, exit code\)\. Tool access is enforced programmatically\. Attempts to use a disallowed tool return a permission denied error\.

Table 23:Tool definitions and per\-role access permissions\. ✓ = allowed, ✗ = denied\. The Planner and Verifier cannot execute arbitrary commands or modify the workspace\.ToolDescriptionPlannerExecutorVerifierread\(path\)Read file contents from allowed directories✓✓✓write\(path, content\)Write to a file in allowed directories✗✓Attest\. onlyrun\(cmd\)Execute a shell command in the workspace✗✓✗send\_message\(to, content\)Send a message to another role✓✓✓

#### File access boundaries\.

Each role has a distinct set of allowed read and write roots, enforced via absolute path prefix checks\. The boundaries here match the matrix in Appendix[A\.1](https://arxiv.org/html/2605.07073#A1.SS1)\.

- •Planner\.Reads the full specification \(spec\.md\) and the task directory\. Reads and writesmessages/\. No read or write access to the workspace or toreports/\.
- •Executor\.Readsbrief\.md, the workspace,messages/, and writes to the workspace,messages/, andreports/\(its own test logs\)\. No read access to the full specification\.
- •Verifier\.Reads the specification, the workspace \(read\-only\), andreports/\(read\-only, the Executor’s test logs\)\. Writesmessages/andattestation\.json\. No write access to the workspace, orreports/, and no command execution\. The Verifier therefore cannot run tests itself and must inspect the Executor’s test logs to require evidence\.

When running in Docker mode, these boundaries are enforced via container bind mounts\. Each role’s container only mounts the directories it is permitted to access\. Shell escape attempts cannot circumvent the restrictions because the disallowed paths are not present in the container filesystem\.

### F\.2Core system prompts

Planner System PromptYou are the Planner\. You have access to the full task specification\. Your job is to understand the requirements, decompose the goal, and create a clear plan\. You CANNOT execute commands or modify the workspace\. You MUST communicate your plan to the Executor by calling thesend\_messagetool\. Highlight hidden constraints and edge cases the Executor might miss\.

Executor System PromptYou are the Executor\. You can run commands and edit files in the workspace\. You only have access to a brief summary of the task\. Follow the Planner’s instructions carefully\. For file reads and writes, use paths relative to the workspace \(for exampleapp/main\.py\)\. When done with your work, send a message to the Verifier and output DONE\. Ask the Planner for clarification if requirements are unclear\.

Verifier System PromptYou are the Verifier\. You independently verify whether the task was completed correctly\. You have read\-only access to the workspace and reports\. You have access to the full task specification for checking compliance\. You CANNOT execute commands or modify the workspace\. Your job is to check every requirement, identify violations, and produceattestation\.json\. If requirements are not met, send feedback to the Executor and setverdict=’fail’\. Only setverdict=’pass’when ALL requirements are satisfied\. When done, output DONE\.

Solo System PromptYou are a software engineer\. You have access to the full task specification and can execute any command\. Complete the task to the best of your ability\.

The solo has access to all four tools with no restrictions on file paths or command execution\. Prompts are intentionally concise\. Structural enforcement of tool access \(Appendix[F\.1](https://arxiv.org/html/2605.07073#A6.SS1)\) carries most of the role\-separation burden rather than relying on prompt compliance\.

### F\.3Task examples: when teams help and when they hurt

Two contrasting tasks illustrate the conditional value of agent teams measured throughout the paper\.

#### When the team is necessary \(HIGH\-TNI\): MULTI1\_polyglot\_build\.

The workspace contains a multi\-language project \(Python orchestration, a Go binary, a Node CLI\) with a build that fails because the language toolchains expect mutually inconsistent file layouts\. Only the specification \(visible to the Planner\) names which layout is canonical and which two are vestigial\. A Restricted agent without specification access reads three plausible file trees and picks the wrong one, scoring0\.100\.10\. The full\-access Solo agent does see the spec but spends most of its turn budget on the wrong sub\-build before pivoting, scoring0\.450\.45\. The full team passes \(1\.001\.00\): the Planner extracts the canonical\-layout instruction, the Executor follows it directly, and the Verifier rejects two intermediate attempts that broke the cross\-language build\.TNI=2\.00\\textsc\{TNI\}=2\.00on this task: the team*exceeds*what the single agent recovers from full access\.

#### When the team hurts \(TEAM\-HURTS\): GH6\_queryset\_union\.

A real Django bug report on QuerySet union handling\. The Solo agent scores0\.250\.25partial by returning a working but shallow fix, and the Restricted agent matches at0\.250\.25\. The Team\-No\-Eval condition reaches0\.330\.33\. The full team scores0\.250\.25, lower than No\-Eval\. Trace inspection shows the Verifier reading an Executor patch that would have improved the grader score, asking for an overly defensive guard, and the Executor over\-correcting in a way that breaks one of the original tests\. The task is small enough that the Executor would have submitted the right answer, but the Verifier’s intervention introduces rework that displaces correct work\. The pattern, role\-induced rework on tasks the single agent partially solved, is exactly the capability\-floor inversion measured at the aggregate level in §[3\.4](https://arxiv.org/html/2605.07073#S3.SS4): structure helps when the Executor lacks any starting point and hurts when it would have succeeded alone\.

Similar Articles

AgentCollabBench: Diagnosing When Good Agents Make Bad Collaborators

arXiv cs.CL

This paper introduces AgentCollabBench, a diagnostic benchmark for multi-agent systems that evaluates behavioral risks like instruction decay and context leakage across four major LLMs. It argues that communication topology is a critical factor in multi-agent reliability, often overshadowing raw model capability.

Partial Evidence Bench: Benchmarking Authorization-Limited Evidence in Agentic Systems

arXiv cs.AI

This paper introduces Partial-Evidence-Bench, a deterministic benchmark for measuring 'authorization-limited evidence' failures in agentic AI systems. It evaluates how models handle tasks where access control restricts visibility, assessing their ability to recognize and report incomplete information rather than silently producing seemingly complete but incomplete answers.

Agentick: A Unified Benchmark for General Sequential Decision-Making Agents

arXiv cs.AI

This paper introduces Agentick, a unified benchmark for evaluating general sequential decision-making agents across RL, LLM, and VLM paradigms. It provides 37 procedurally generated tasks and reveals that no single approach currently dominates, highlighting significant room for improvement in agent autonomy.