Multi-agent Framework for Time-Sensitive Complementary Collaboration in Minecraft
Summary
The paper introduces TickingCollabBench, a Minecraft-based multi-agent benchmark for time-sensitive complementary collaboration tasks with dynamic environments, and demonstrates that LLMs frequently fail under such conditions compared to a global-knowledge oracle.
View Cached Full Text
Cached at: 06/16/26, 11:48 AM
# Multi-agent Framework for Time-Sensitive Complementary Collaboration in Minecraft
Source: [https://arxiv.org/html/2606.15684](https://arxiv.org/html/2606.15684)
Juheon Yi, Jinglu Wang, Xiaoyi Zhang, Yan Lu Microsoft Research Asia \{jyi,jinglwa,xiaoyizhang,yanlu\}@microsoft\.com
###### Abstract
We presentTickingCollabBench, a Minecraft\-based multi\-agent benchmark for a novel class of*time\-sensitive complementary collaboration tasks*\. Our benchmark reflects four core characteristics of real\-world collaboration: agent heterogeneity, mandatory collaboration, dynamic environments, and strict real\-time constraints with failure risks\. To enable this, we develop theTickingCollabframework, which supports the generation of diverse dynamic events and abstracts Minecraft’s primitive APIs to enable declarative YAML task specifications for composing these events\. Building on this, we design a feasibility\-aware automated benchmark generation pipeline, where an LLM drafts structurally diverse task configurations and feasibility verifier filters out invalid ones using approximate constraints\. Evaluations demonstrate that long latency and the inherent difficulty of coordinating under partial observability and agent heterogeneity cause LLMs to frequently fail under dynamic environments and fall significantly short of a global\-knowledge oracle\.
## 1Introduction
Real\-world multi\-agent collaboration often requires agents with partial observability to synergistically combine heterogeneous capabilities and complete tasks under strict time constraints\. For instance, a team of embodied robots with different tools and mobility may need to coordinate to respond to spreading hazards before a rescue deadline\. Similarly, in collaborative work settings, personal agents running on different users’ devices may observe only local data and have heterogeneous computing resources, requiring them to jointly process distributed information to respond to users’ requests in a timely manner\. However, composing such time\-sensitive collaborative scenarios in the real world and evaluating agents at scale is difficult due to safety risks, deployment costs, and limited controllability over environment dynamics\. As a result, many prior works focus on static tasks with shared context, homogeneous agents, and no explicit time\-to\-failure constraintsZhugeet al\.\([2024](https://arxiv.org/html/2606.15684#bib.bib29)\); Chenet al\.\([2024](https://arxiv.org/html/2606.15684#bib.bib22)\); Yuet al\.\([2024a](https://arxiv.org/html/2606.15684#bib.bib25)\)\.
To bridge this gap, Minecraft has emerged as a scalable testbed for composing complex tasks and dynamic environments, and systematically controlling agent capabilities such as tools, mobility, and perception\. However, existing Minecraft\-based multi\-agent collaboration benchmarksWhiteet al\.\([2025](https://arxiv.org/html/2606.15684#bib.bib4)\); Schipperet al\.\([2025](https://arxiv.org/html/2606.15684#bib.bib5)\); Longet al\.\([2024](https://arxiv.org/html/2606.15684#bib.bib6)\); Yuet al\.\([2024b](https://arxiv.org/html/2606.15684#bib.bib2)\); Donget al\.\([2024](https://arxiv.org/html/2606.15684#bib.bib3)\)still suffer from critical limitations:
\(a\)Prepare for a crisis\.
\(b\)Mine vanishing blocks\.
\(c\)Raid a boss\.
\(d\)Agent observations and capabilities\.
Figure 1:Time\-sensitive complementary collaboration tasks inTickingCollabBench\.- ∙\\bulletInsufficient emphasis on real\-time, dynamic collaboration\.Existing tasks often feature homogeneous agents and are individually solvable, rendering genuine collaboration optional rather than mandatory\. Furthermore, their static environments and best\-effort objectives allow agents to rely on one\-time offline planning, facing little to no failure risk from delayed decisions or runtime dynamic events \(quantitative analysis in[Table˜2](https://arxiv.org/html/2606.15684#S2.T2)\)\.
- ∙\\bulletLimited framework support for dynamic tasks\. Existing Minecraft frameworks are primarily designed for static environments and lack built\-in support for runtime dynamic events \(e\.g\., spreading floods, object/monster spawn and despawn\)\. Consequently, to introduce such dynamics, agent developers must utilize low\-level Minecraft APIs to build custom server plugins from scratch\. This high technical barrier creates severe development overhead, thereby limiting the creation of diverse and complex collaboration scenarios\.
We presentTickingCollab, a benchmark suite and framework for evaluating LLM agents on a novel class of*time\-sensitive complementary collaboration tasks*in Minecraft\.TickingCollabBench\([Figure˜1](https://arxiv.org/html/2606.15684#S1.F1)\) targets scenarios where agents with heterogeneous capabilities and partial observability must tightly integrate their complementary skills\. Crucially, the environments continuously change, and failure to rapidly adapt directly causes task failure\. Supported by quantitative comparisons with prior benchmarks \([Tables˜1](https://arxiv.org/html/2606.15684#S2.T1)and[2](https://arxiv.org/html/2606.15684#S2.T2)\), we pose a fundamental question:*can LLMs orchestrate accurate and efficient collaboration across heterogeneous agents, when faced with dynamic environments and real\-time failure risks?*
To systematically construct and evaluate time\-sensitive complementary collaboration tasks, ourTickingCollabframework provides three key functionalities:
- ∙\\bulletDynamic environment manager\.Developers can declaratively inject complex runtime dynamics \(e\.g\., lava waves, object/monster spawn/despawn\) via Minecraft API\-free YAML configurations \(LABEL:list:2\-metadata\-example\), bypassing the severe overhead of custom plugin development\.
- ∙\\bulletFeasibility\-aware automated benchmark generation\.To systematically explore a large and complex parameter space in composing time\-sensitive complementary collaboration tasks, we design an automated pipeline where an LLM drafts diverse task configurations and a feasibility verifier filters out invalid ones via approximate constraints\.
- ∙\\bulletComprehensive evaluation\.The framework isolates LLM’s planning accuracy from inference latency via dual execution modes \(*synchronous fixed\-timestep*vs\.*asynchronous real\-time*\), while supporting parallel simulation and fine\-grained system cost logging\.
We evaluate our benchmark using a baseline multi\-agent collaboration scheme \(TickingCollabAgent\) with two distinct coordination policies \(centralized and distributed\) motivated from prior worksLonget al\.\([2024](https://arxiv.org/html/2606.15684#bib.bib6)\); Whiteet al\.\([2025](https://arxiv.org/html/2606.15684#bib.bib4)\)\. Our evaluation reveals that LLM inference latency poses a critical bottleneck in real\-time asynchronous execution, frequently causing task failures due to time constraint violations\. Furthermore, while centralized coordination outperforms distributed topologies by mitigating communication and inference overheads, it still underperforms an oracle–a non\-LLM solution that leverages global, ground\-truth access to the dynamic environment and human\-crafted scheduling rules\. These findings underscore the challenges of heterogeneous multi\-agent planning under partial observability and necessitate efficient LLM inference and multi\-agent coordination policies in dynamic environments\.
## 2TickingCollabBenchBenchmark Suite
Table 1:Comparison of prior Minecraft agent benchmarks andTickingCollabBench\.
△\\triangle: partially covered for the subset of tasks listed in parentheses\.BenchmarkHeterogeneousagent capabilities?Mandatorycollaboration?Dynamicenvironment?Real\-time constraints\(or failure risks\)?SingleagentMineRLGusset al\.\([2019](https://arxiv.org/html/2606.15684#bib.bib13)\)×\\times×\\times×\\times×\\timesMineDojoFanet al\.\([2022](https://arxiv.org/html/2606.15684#bib.bib1)\)×\\times×\\times△\\triangle\(combat\)△\\triangle\(combat, survive\)OdysseyLiuet al\.\([2024](https://arxiv.org/html/2606.15684#bib.bib17)\)×\\times×\\times△\\triangle\(combat\)△\\triangle\(combat, survive\)MCUZhenget al\.\([2025](https://arxiv.org/html/2606.15684#bib.bib14)\)×\\times×\\times△\\triangle\(combat\)△\\triangle\(combat, survive\)Multi\-agentMineLandYuet al\.\([2024b](https://arxiv.org/html/2606.15684#bib.bib2)\)×\\times×\\times△\\triangle\(combat\)△\\triangle\(combat, survive\)TeamCraftLonget al\.\([2024](https://arxiv.org/html/2606.15684#bib.bib6)\)○\\bigcirc\(different items\)○\\bigcirc×\\times×\\timesMineCollabWhiteet al\.\([2025](https://arxiv.org/html/2606.15684#bib.bib4)\)○\\bigcirc\(different items\)△\\triangle\(cook, craft\)×\\times×\\timesPillagerBenchSchipperet al\.\([2025](https://arxiv.org/html/2606.15684#bib.bib5)\)×\\times×\\times○\\bigcirc○\\bigcirc\(vs\. opponents\)VillagerBenchDonget al\.\([2024](https://arxiv.org/html/2606.15684#bib.bib3)\)×\\times△\\triangle\(escape room\)△\\triangle\(harvest\)×\\timesTickingCollabBench○\\bigcirc○\\bigcirc○\\bigcirc○\\bigcirc
TickingCollabBenchconsists of three representative*time\-sensitive complementary collaboration tasks*\([Figure˜1](https://arxiv.org/html/2606.15684#S1.F1)\), where agents with heterogeneous capabilities must coordinate to achieve a global objective under partial observability of the dynamic environment\. These tasks are designed as controlled analogues of the real\-world collaboration scenarios discussed in[Section˜1](https://arxiv.org/html/2606.15684#S1), emphasizing four key properties that are underrepresented in previous Minecraft multi\-agent benchmarks \(see[Table˜1](https://arxiv.org/html/2606.15684#S2.T1)for a detailed comparison\):
- ∙\\bulletHeterogeneous capabilities\.Agents differ in action\-relevant attributes and resources \(e\.g\., perception range, speed, health, tools\), necessitating complex complementary roles that cannot be captured by simple inventory differences featured in prior benchmarks\.
- ∙\\bulletMandatory collaboration\.Tasks are constructed so that success requires coordinating complementary capabilities, rather than simply scaling identical agents\.
- ∙\\bulletDynamic environments\.Environments continuously change at runtime, invalidating one\-shot offline plans and requiring online adaptation\.
- ∙\\bulletReal\-time constraints\.Unlike prior penalty\-free settings, delayed decision\-making directly causes task failure, demanding timely execution\.
TickingCollabframework provides a declarative interface for specifying these tasks\. This enables an automated benchmark generation pipeline: an LLM first generates diverse configurations incorporating the four key properties, which are then filtered based on feasibility criteria to yield 634 valid tasks forTickingCollabBench\(details in[Section˜3\.1](https://arxiv.org/html/2606.15684#S3.SS1)\)\.
### 2\.1Task Suite
Task \#1: Prepare for a crisis \([Figure˜1\(a\)](https://arxiv.org/html/2606.15684#S1.F1.sf1)\)\. Agents must identify an approaching crisis \(e\.g\., lava flood, avalanche\) and collaboratively gather appropriate materials scattered in the map to build an appropriate survival shelter \(e\.g\., stone instead of flammable wood\) before crisis impact\. Agents possess different mining tools \(e\.g\., axes for woods, pickaxes for metal ores\), perception ranges, and movement speeds\. Survival depends on efficient role allocation, such as utilizing long\-perception agents as scouts while faster agents collect distant blocks\. Unlike prior*construction tasks*that ignore time constraints and failure risks, this task demands timely execution and efficient coordination of heterogeneous agents\.
Task \#2: Mine vanishing blocks \([Figure˜1\(b\)](https://arxiv.org/html/2606.15684#S1.F1.sf2)\)\. Agents must mine target quotas for multiple block types that randomly appear and vanish after type\-specific lifetimes\. Given distinct movement speeds, perception ranges, and heterogeneous mining tools that dictate both block compatibility and mining efficiency \(e\.g\., wood requires an axe, gold ore requires at least an iron\-tier pickaxe, and higher\-tier tools like diamond yield faster mining rates\), agents must optimally assign targets by calculating travel and mining times against block lifetimes to avoid wasted effort\. While previous*harvesting tasks*mostly feature static block placements and uniform agents, our task requires dynamic, capability\-aware assignment\.
Task \#3: Raid a boss \([Figure˜1\(c\)](https://arxiv.org/html/2606.15684#S1.F1.sf3)\)\. Agents must defeat a boss monster that dynamically spawns various minions with different life points \(HP\) and damages\. Agents differ in base life points and weapons with type\-specific damage multipliers\. Agents must jointly optimize target assignments based on type advantages and survivability, while strategically disengaging to consume health potions at scattered chests\. Whereas prior*combat tasks*typically assume statically generated monsters and homogeneous agents, ours introduces dynamic enemy spawns and demands intricate combat coordination among diverse roles\.
### 2\.2Collaboration Difficulty Metrics
Table 2:Comparison of multi\-agent benchmark statistics\.↑\\uparrowand↓\\downarrowindicate whether lower or higher values imply a more challenging task, respectively\.MineLandYuet al\.\([2024b](https://arxiv.org/html/2606.15684#bib.bib2)\)TeamCraftLonget al\.\([2024](https://arxiv.org/html/2606.15684#bib.bib6)\)MineCollabWhiteet al\.\([2025](https://arxiv.org/html/2606.15684#bib.bib4)\)TickingCollabBenchCombatHarvestBuildBreakBuildFarmSmeltBuildCookCraftPrepareMineRaidℋ↑\\mathcal\{H\}\\uparrow0000\.720\.430\.390\.6800\.430\.440\.780\.720\.31𝒩↑\\mathcal\{N\}\\uparrow00000\.541\.13001\.871\.761\.111\.421\.39𝒟↑\\mathcal\{D\}\\uparrow00000000001\.792\.980\.38τ↓\\tau\\downarrow11\.51s∗∞\\infty∞\\infty∞\\infty∞\\infty∞\\infty∞\\infty∞\\infty∞\\infty∞\\infty44\.03s33\.31s25\.37s
\* Samples withτ=∞\\tau=\\inftyare excluded\.
We define four collaboration difficulty metrics to quantitatively evaluate howTickingCollabBenchreflects time\-sensitive, complementary collaboration\.
- ∙\\bulletAgent heterogeneity \(ℋ\\mathcal\{H\}\)measures the average pairwise normalized distance of attributes across all unique agent pairsPP: ℋ=1\|P\|∑\(i,j\)∈P\(1\|K\|∑k∈Kδk\(ai,aj\)\)\.\\mathcal\{H\}=\\frac\{1\}\{\|P\|\}\\sum\_\{\(i,j\)\\in P\}\\left\(\\frac\{1\}\{\|K\|\}\\sum\_\{k\\in K\}\\delta\_\{k\}\(a\_\{i\},a\_\{j\}\)\\right\)\. For each attributek∈Kk\\in K, the distanceδk\\delta\_\{k\}handles continuous values \(e\.g\., HP\) via min\-max normalization and sets \(e\.g\., inventory items\) via Jaccard distance: δk\(ai,aj\)=\{\|vik−vjk\|vkmax−vkminifkis continuous1−\|Sik∩Sjk\|\|Sik∪Sjk\|ifkis a set\\delta\_\{k\}\(a\_\{i\},a\_\{j\}\)=\\begin\{cases\}\\frac\{\|v\_\{ik\}\-v\_\{jk\}\|\}\{v\_\{k\}^\{max\}\-v\_\{k\}^\{min\}\}&\\text\{if \}k\\text\{ is continuous\}\\\\ 1\-\\frac\{\|S\_\{ik\}\\cap S\_\{jk\}\|\}\{\|S\_\{ik\}\\cup S\_\{jk\}\|\}&\\text\{if \}k\\text\{ is a set \}\\end\{cases\} wherevkmaxv\_\{k\}^\{max\}andvkminv\_\{k\}^\{min\}denote predefined bounds of the parameter space\. Thus,ℋ∈\[0,1\]\\mathcal\{H\}\\in\[0,1\], where11implies maximal distinctiveness\.
- ∙\\bulletCollaboration necessity \(𝒩\\mathcal\{N\}\)estimates the ratio of the total task workload to the maximum single\-agent capacity: 𝒩=mina∈A\(∑kWorkloadkThroughputa,k\)/Tmax\.\\mathcal\{N\}=\\min\_\{a\\in A\}\\left\(\\sum\_\{k\}\\frac\{\\text\{Workload\}\_\{k\}\}\{\\text\{Throughput\}\_\{a,k\}\}\\right\)/T\_\{max\}\. The inner sum computes the time required for the most capable single agenta∈Aa\\in Ato sequentially process all targetskk\. Here,Workloadk\\text\{Workload\}\_\{k\}is the required block count \(Tasks \#1, \#2\) or total enemy HP \(Task \#3\), whileThroughputa,k\\text\{Throughput\}\_\{a,k\}is agentaa’s mining speed or damage\-per\-second \(DPS\)\. Note that𝒩\\mathcal\{N\}is a conservative lower bound, as the time required for agent planning and movement are omitted; thus,𝒩\>1\\mathcal\{N\}\>1strictly guarantees that collaboration is mandatory\.
- ∙\\bulletEnvironment dynamicity \(𝒟\\mathcal\{D\}\)is quantified as: 𝒟=\(Total Environment State Changes\)/Tmax\.\\mathcal\{D\}=\\left\(\\text\{Total Environment State Changes\}\\right\)/T\_\{max\}\. A “state change” is any environment modification independent of agent actions, such as entity spawns/despawns \(Tasks \#2, \#3\) or crisis spread \(Task \#1\)\. Higher𝒟\\mathcal\{D\}necessitates continuous online replanning\.
- ∙\\bulletTime\-to\-failure \(τ\\tau\)measures the time window before irreversible failure occurs, which is defined as: crisis arrival time \(Task \#1\), minimum block lifespan \(Task \#2\), or the time for spawned enemies to defeat all agents, calculated asTotal Agent HPTotal Enemy DPS\\frac\{\\text\{Total Agent HP\}\}\{\\text\{Total Enemy DPS\}\}averaged across spawn events \(Task \#3\)\. Lowerτ\\taudemands faster decision\-making\.
As shown in[Table˜2](https://arxiv.org/html/2606.15684#S2.T2), prior Minecraft benchmarks typically feature static environments, negligible failure risks, and limited agent heterogeneity \(differing only in inventories\), rarely necessitating true collaboration outside a few exceptions \(e\.g\., crafting or cooking in MineCollabWhiteet al\.\([2025](https://arxiv.org/html/2606.15684#bib.bib4)\), combat in MineLandYuet al\.\([2024b](https://arxiv.org/html/2606.15684#bib.bib2)\)\)\. In contrast,TickingCollabBenchyields significantly higher metric scores, indicating far more challenging collaboration tasks\.[Figure˜2](https://arxiv.org/html/2606.15684#S2.F2)and[Table˜3](https://arxiv.org/html/2606.15684#S2.T3)further show that the generated configurations cover broad ranges of difficulty and task parameters\. Across the three tasks, these distributions reflect diverse collaboration challenges that vary in their focus \(e\.g\., strict time\-to\-failure constraints in “Raid a boss” task, and agent heterogeneity in the other two\)\.
\(a\)Heterogeneity \(ℋ↑\\mathcal\{H\}\\\!\\uparrow\)\.
\(b\)Necessity \(𝒩↑\\mathcal\{N\}\\\!\\uparrow\)\.
\(c\)Dynamicity \(𝒟↑\\mathcal\{D\}\\\!\\uparrow\)\.
\(d\)Time\-to\-failure \(τ↓\\tau\\\!\\downarrow\)\.
Figure 2:Distribution of collaboration difficulty metrics inTickingCollabBench\. Internal lines show quartiles \(Q1, median, Q3\); overlaid dots show individual configurations\.Table 3:Parameter distributions \(min / mean±\{\\pm\}std / max for continuous values\)\.Prepare for a crisisMine vanishing blocksRaid a boss\# Agents2 /3\.88±1\.903\.88\{\\pm\}1\.90/ 8\# Agents2 /5\.12±2\.255\.12\{\\pm\}2\.25/ 8\# Agents3 /5\.93±1\.545\.93\{\\pm\}1\.54/ 8Crisis typelava 30%, snow 32%, water 38%Tool tiergolden 14%, iron 31%,stone 32%, wooden 23%Boss HP210 /241\.53±18\.73241\.53\{\\pm\}18\.73/ 280Crisis speed1 /1\.98±0\.821\.98\{\\pm\}0\.82/ 3\# Target types2: 33%, 3: 36%, 4: 32%Minion HP25 /34\.60±3\.8734\.60\{\\pm\}3\.87/ 40Agent speed3 /4\.49±1\.094\.49\{\\pm\}1\.09/ 6Block lifetime25 /33\.30±5\.3633\.30\{\\pm\}5\.36/ 40\# Minions/wave2 /2\.50±0\.562\.50\{\\pm\}0\.56/ 4\# Blocks/wave8 /9\.03±0\.709\.03\{\\pm\}0\.70/ 10Wave interval8 /11\.34±1\.8811\.34\{\\pm\}1\.88/ 16
## 3TickingCollabFramework
[Figure˜3](https://arxiv.org/html/2606.15684#S3.F3)illustrates theTickingCollabframework architecture\.
- ∙\\bulletTask metadata generator\([Section˜3\.1](https://arxiv.org/html/2606.15684#S3.SS1)\) provides descriptive task composition via declarative YAML without requiring Minecraft API expertise, enabling an automated generation pipeline to constructTickingCollabBench\.
- ∙\\bulletTask orchestrator\([Section˜3\.2](https://arxiv.org/html/2606.15684#S3.SS2)\) translates the metadata into dynamic runtime events via the*dynamic environment manager*and bridges agents to the Minecraft server using MineflayerPrismarineJS \([2014](https://arxiv.org/html/2606.15684#bib.bib11)\)\. It supports both synchronous \(fixed\-timestep\) and asynchronous \(real\-time\) execution to decouple the LLM agent’s reasoning accuracy from latency\.
- ∙\\bulletMulti\-agent runtime\([Section˜3\.3](https://arxiv.org/html/2606.15684#S3.SS3)\) provides a modular abstraction for the*agent core*and*communication manager*, streamlining the development of custom collaborative agents\.

Figure 3:TickingCollabframework architecture\.### 3\.1Task Metadata Generator and Automated Benchmark Generation
Listing 1:Example metadata \(“prepare for a crisis”, full list in Supp\.[Appendix˜A](https://arxiv.org/html/2606.15684#A1)\)\.1
2task:
3\-goal:"Identifytheoriginandtypeofthecrisis,andgathernecessaryblockstobuildashelterforsurvival\."
4environment:
5\-\{type:cobblestone,position:\[\-5,64,15\],num\_blocks:10\},
6\-\{type:oak\_log,position:\[\-8,64,10\],num\_blocks:8\},
7\.\.\.
8events:
9\{id:lava\_wave,trigger:\{start:5,end:100\},actions:\{type:progressive\_fill,block:lava,area:\{min:\[\-40,64,\-20\],max:\[40,65,20\]\},direction:east,speed:2\}
10agents:
11\{name:MineflayerBot0,position:\[8,64,18\],inventory:\{gold\_pickaxe:1\},capabilities:\{perceptionrange:10;speed:10\},
12\.\.\.
LABEL:list:2\-metadata\-exampleillustrates the task metadata structure, comprising four key fields: \(i\)*task description*, \(ii\)*environment layout*, \(iii\)*events*\(with dynamic triggers and patterns\), and \(iv\)*agents*\. While our declarative interface abstracts away complex Minecraft API programming, manual task design still remains challenging due to the massive parameter space \(e\.g\., jointly aligning crisis type and speed, necessary survival block placements, and agents’ speed, perception range, and mining tools\)\. Prior Minecraft multi\-agent benchmarks typically perform naive parameter sweeping \(e\.g\., merely adjustingNNidentical agents vs\.MMskeletons\) under static environments, yielding trivial variations that fail to meaningfully emphasize collaboration\.
To overcome this challenge, we design a*feasibility\-aware automated generation pipeline*to constructTickingCollabBenchat scale\. First, users define task goals and parameter spaces via a task metadata template \(example in Supp\.[Appendix˜B](https://arxiv.org/html/2606.15684#A2)\)\. Using this, the LLM navigates the parameter space to draft structurally diverse configurations spanning various environments, agent compositions, and difficulty levels\. The LLM may occasionally generate unsolvable task configurations \(e\.g\., requiring gold ore mining without pickaxes, or setting enemy HP beyond the agents’ maximum damage output\); thus, a*feasibility verifier*screens out invalid configurations using constraints that model the task feasibility\.
[Figure˜4](https://arxiv.org/html/2606.15684#S3.F4.fig1)details the task parameter space \(designed to reflect the four properties in[Table˜1](https://arxiv.org/html/2606.15684#S2.T1)\) and the feasibility verification criteria\. Even setting aside runtime stochasticity \(e\.g\., random spawn positions and agent trajectories\), optimal target assignment and sequencing \(e\.g\., which blocks to mine\) remain intractable\. With heterogeneous agent capabilities and strict time constraints, this reduces to an NP\-hard routing and allocation problem \(analogous to the multi\-agent traveling salesman problem\), precluding exact guarantees\. We thus introduce marginsα,β,γ\\alpha,\\beta,\\gammato define approximate feasibility constraints that also control task difficulty\. Generating 250 configurations per task with GPT\-5\.1 and filtering them with margin values 2\.0 yielded 634 valid configurations \(225, 219, and 190 for Tasks \#1, \#2, and \#3\); see Supp\.[Appendix˜D](https://arxiv.org/html/2606.15684#A4)for margin sensitivity analysis and[Appendix˜E](https://arxiv.org/html/2606.15684#A5)for generated examples\.
Figure 4:TickingCollab’s automated benchmark generation: parameter space and feasibility verification criteria\. Details on each variable in verification criteria are in Supp\.\-[Appendix˜C](https://arxiv.org/html/2606.15684#A3)\.Parameter spaceVerification criteriaTask \#1: Prepare for a crisis\-Crisis config:𝒞=\(Type,P,V,H\)\\mathcal\{C\}=\(Type,P,V,H\)–TypeType: crisis type \(e\.g\., lava, water, snow\);PP,VV,HH: origin, speed, height\.\-Agent config:𝒜n=\(Vn,Rn,En\)\\mathcal\{A\}\_\{n\}=\(V\_\{n\},R\_\{n\},E\_\{n\}\)–Vn,RnV\_\{n\},R\_\{n\}: movement speed and perception range;EnE\_\{n\}: mining tool with tierτEn\\tau\_\{E\_\{n\}\}\. Block typeiiis mineable only ifτEn≥τreq\(i\)\\tau\_\{E\_\{n\}\}\\geq\\tau\_\{req\}\(i\), and takes timeTminen,iT\_\{mine\}^\{n,i\}\.\-Block placements:ℐ=\{\(Typej,Locj\)\}\\mathcal\{I\}=\\\{\(Type\_\{j\},Loc\_\{j\}\)\\\}– types and locations\.\-Simulation duration:TmaxT\_\{max\}steps\.\-Necessary mining items?maxn\(τEn\)≥τreq\(b\),∀b∈ℬtarget\\max\_\{n\}\(\\tau\_\{E\_\{n\}\}\)\\geq\\tau\_\{req\}\(b\),\\;\\forall b\\in\\mathcal\{B\}\_\{target\}\-Sufficient survival blocks?Nrequired≥\(H\+1\)⋅Nagent\\textstyle N\_\{required\}\\geq\(H\+1\)\\cdot N\_\{agent\}\-Sufficient preparation time?Tcrisis≥α⋅\(Tgather\+Tconstruct\)T\_\{crisis\}\\geq\\alpha\\cdot\(T\_\{gather\}\+T\_\{construct\}\)\(α\\alpha: time margin\)Task \#2: Mine vanishing blocks\-Target blocks:ℬi=\(Typei,Ngoali\)\\mathcal\{B\}\_\{i\}=\(Type\_\{i\},N\_\{goal\}^\{i\}\)– type and count\.\-Agent config:𝒜n=\(Vn,Rn,En\)\\mathcal\{A\}\_\{n\}=\(V\_\{n\},R\_\{n\},E\_\{n\}\)– same as Task \#1\.\-Spawn pattern:𝒮i=\(Tstarti,Tendi,Tinti,Tlifei\)\\mathcal\{S\}\_\{i\}=\(T\_\{start\}^\{i\},T\_\{end\}^\{i\},T\_\{int\}^\{i\},T\_\{life\}^\{i\}\)– start, end, interval, lifetime\.\-Simulation duration:TmaxT\_\{max\}steps\.\-Necessary mining items?maxn\(τEn\)≥τreq\(b\),∀b∈ℬtarget\\max\_\{n\}\(\\tau\_\{E\_\{n\}\}\)\\geq\\tau\_\{req\}\(b\),\\;\\forall b\\in\\mathcal\{B\}\_\{target\}\-Sufficient block spawns?Nspawni≥Ngoali,∀i\\textstyle N\_\{spawn\}^\{i\}\\geq N\_\{goal\}^\{i\},\\;\\forall i\-Sufficient block lifetimes?β⋅\(minn\(Tmove\+Tminen,i\)\)≤Tlifei,∀i\\beta\\cdot\(\\min\_\{n\}\(T\_\{move\}\+T\_\{mine\}^\{n,i\}\)\)\\leq T\_\{life\}^\{i\},\\;\\forall i\(β\\beta: time margin\)Task \#3: Raid a boss\-Boss config:ℬ=\(HPB,DB\)\\mathcal\{B\}=\(HP\_\{B\},D\_\{B\}\)– boss life point and damage\.\-Minion spawns:𝒫i=\(Tstarti,Tendi,Tinti,Ni,HPMi,DMi\)\\mathcal\{P\}\_\{i\}=\(T\_\{start\}^\{i\},T\_\{end\}^\{i\},T\_\{int\}^\{i\},N^\{i\},HP\_\{M\}^\{i\},D\_\{M\}^\{i\}\)– start, end, interval ofii\-th minion spawn events;NiN^\{i\}: minion count per spawn;HPMiHP\_\{M\}^\{i\},DMiD\_\{M\}^\{i\}: minion life point, damage\.\-Agent config:𝒜n=\(HPn,Dn\)\\mathcal\{A\}\_\{n\}=\(HP\_\{n\},D\_\{n\}\)– life point and damage\.\-Simulation duration:TmaxT\_\{max\}steps\.\-Sufficient agent damage?γ⋅Tmax⋅∑nDn≥HPB\+∑i\(Ni⋅HPMi\)\\gamma\\cdot T\_\{max\}\\cdot\\sum\_\{n\}D\_\{n\}\\geq HP\_\{B\}\+\\sum\_\{i\}\(N\_\{i\}\\cdot HP\_\{M\}^\{i\}\)\(γ\\gamma: efficiency margin\)
### 3\.2Task Orchestrator
Component \#1: Dynamic environment manager\.We build a plugin abstracting Minecraft’s complex primitive APIs \(e\.g\.,/effect,/attribute\) to automatically compose and manage dynamic environment specified in metadata\. Unlike prior benchmarks reliant on static environments, this manager handles diverse, complex runtime dynamics\. It is also highly extensible, allowing developers to inject custom interaction mechanisms not natively supported by Minecraft \(e\.g\., an inventory weight system, region\-specific debuffs\)\.
Component \#2: Communication manager\.This module bridges the agent runtime and Mineflayer\-controlled botsPrismarineJS \([2014](https://arxiv.org/html/2606.15684#bib.bib11)\)connected to Minecraft server\. It maps high\-level agent actions to predefined JavaScript code executed on the Fabric server \(Supp\.[Appendix˜F](https://arxiv.org/html/2606.15684#A6)\) and returns structured observations \(agent status and sensor data\) for the next decision step\.
Component \#3: Multi\-agent collaboration evaluator\.To enable in\-depth analysis of the agent’s reasoning capabilities and system\-level costs \([Section˜4](https://arxiv.org/html/2606.15684#S4)\), our framework supports parallel simulation execution, comprehensive logging \(e\.g\., number of inferences and inference latency, token usage, communication overhead\), and two distinct evaluation modes:
- ∙\\bulletSynchronous \(fixed\-timestep\) modepauses the simulation during LLM inference\. It isolates and evaluates pure decision\-making accuracy by ignoring inference latency\.
- ∙\\bulletAsynchronous \(real\-time\) moderuns the simulation continuously\. It evaluates both accuracy and latency, as prolonged LLM inference causes actions to become stale against the rapidly changing environment, directly impacting task success\.
### 3\.3Multi\-agent Runtime andTickingCollabAgent
Our modular runtime decouples the agent core \(LLM/memory\) from communication modules, enabling flexible logic customization\. We implement a baselineTickingCollabAgentwith two distinct coordination policies \(system prompts and toolsets in Supp\.[Appendices˜G](https://arxiv.org/html/2606.15684#A7)and[F](https://arxiv.org/html/2606.15684#A6)\):
- ∙\\bulletTickingCollabAgent\-centralized\(motivated by TeamCraftLonget al\.\([2024](https://arxiv.org/html/2606.15684#bib.bib6)\)\)\. A designated*master*agent aggregates all peer agents’ observations, performs joint planning, and dispatches actions\. While centralizing global observations maximizes joint planning quality, it introduces synchronization bottlenecks; the master must wait for all agents to report back before replanning, leaving faster agents idle due to varying action latencies\.
- ∙\\bulletTickingCollabAgent\-distributed\(motivated by MineCollabWhiteet al\.\([2025](https://arxiv.org/html/2606.15684#bib.bib4)\)\)\. Agents plan independently in parallel, resolving shared decisions \(e\.g\., role allocation\) via a propose–wait–act negotiation protocol\. We relax MineCollab’s 1\-to\-1 messaging restriction to support selective multi\-agent broadcasting, facilitating efficient coordination under time constraints\.
## 4Experiments
We host two LLMs \(GPT\-5\.1 and DeepSeek\-R1Guoet al\.\([2025](https://arxiv.org/html/2606.15684#bib.bib7)\)\) in Azure AI Foundry as the backbone ofTickingCollabAgent, and evaluate their task success rates on our benchmark\. We use an Ubuntu 22\.04\.5 machine with an AMD EPYC 9V84 CPU \(80 logical CPUs\), 629 GiB RAM\.We implement ourTickingCollabframework on top of Minecraft Java Edition 1\.19, using a Fabric server \(loader v0\.14\.18\) and Mineflayer 4\.14\.0 \(unofficial third\-party modifications to Minecraft\) as the bot control interface\.
Table 4:Average task success rate ofTickingCollabAgent\.GPT\-5\.1DeepSeek\-R1CentralizedDistributedCentralizedDistributedOracleTaskSyncAsyncSyncAsyncSyncAsyncSyncAsyncPrepare for a crisis0\.420\.150\.240\.020\.260\.020\.030\.010\.91Mine vanishing blocks0\.620\.050\.540\.000\.650\.000\.350\.010\.80Raid a boss0\.280\.060\.400\.030\.370\.010\.290\.020\.59
### 4\.1Overall Task Success Rate
[Table˜4](https://arxiv.org/html/2606.15684#S4.T4)shows the overall task success rates across various tasks, LLMs, and execution modes\. To evaluate the LLM’s planning accuracy, we designed an oracle solution where a centralized agent has access to ground\-truth task metadata and runtime events:
- ∙\\bulletTask \#1 Prepare for a crisis: the oracle retrieves the crisis type, origin, and speed from the task metadata at the beginning of the simulation\.
- ∙\\bulletTask \#2 Mine vanishing blocks & Task \#3 Raid a boss: the oracle receives real\-time object spawn/despawn events from task orchestrator \([Section˜3\.2](https://arxiv.org/html/2606.15684#S3.SS2)\) during runtime\.
Based on this information, the oracle plans the actions of individual agents\. Since scheduling heterogeneous agents is an NP\-hard problem \(analogous to the multi\-agent traveling salesman problem\), we designed an efficient, non\-LLM\-based task allocation algorithm using handcrafted heuristics\. Note that scheduling in Task \#2 is more challenging than in Task \#1 due to a greater variety and volume of target blocks\. Furthermore, moving enemies and agent deaths from attacks in Task \#3 make planning non\-trivial, even for the oracle\.
Overall, we observe the following trends\. First, most tasks fail in async mode, as the average 20\-second API delay frequently exceeds our benchmark’s time\-to\-failure constraints \([Figure˜2](https://arxiv.org/html/2606.15684#S2.F2)\(d\)\)\. Second, in sync mode,TickingCollabAgent\-centralizedgenerally outperformsdistributed\. In distributed topology, inter\-agent communication and inference overheads result in most of the time budget being spent for planning, leaving insufficient time for actions \(detailed in[Figures˜6](https://arxiv.org/html/2606.15684#S4.F6)and[7](https://arxiv.org/html/2606.15684#S4.F7)\)\. Finally, even in sync mode,centralizedfalls short of the oracle’s success rate\. This highlights the difficulty of planning multi\-agent collaboration in dynamic environments using partial observations rather than global ground\-truth knowledge\. These highlight the necessity of an efficient multi\-agent coordination policy coupled with fast and accurate LLM planning\. Refer to Supp\.[Appendix˜H](https://arxiv.org/html/2606.15684#A8)for operational timeline examples\.
### 4\.2Ablation Study
Figure 5:Task success rate across different numbers of agents\.We analyze the factors behindTickingCollabAgent’s low success rate\. Given consistent trends across both evaluated LLMs, we present results for GPT\-5\.1\.
Scaling with team size\.[Figure˜5](https://arxiv.org/html/2606.15684#S4.F5)shows the task success rate across different numbers of agents\. For “raid a boss” and “mine vanishing blocks” tasks, increasing the number of agents improves enemy\-killing or mining throughput, leading to higher success rates\. In contrast, in “prepare for a crisis”, increasing the number of agents increases both the required survival blocks to mine and the shelter size\. Since the success condition requires all agents to survive, the success rate tends to decrease\. Across all tasks, asynchronous mode generally yields lower success rates due to LLM inference latency \(typically around 20 seconds per API call\)\.
Figure 6:System costs ofTickingCollabAgent\-distributedin synchronous mode\.System costs\.Figure[6](https://arxiv.org/html/2606.15684#S4.F6)reports the communication and inference costs of the distributed baseline in synchronous mode\. Because agents can broadcast messages to multiple peers, message volume and LLM inference calls increase rapidly as team size increases \(Figures[6](https://arxiv.org/html/2606.15684#S4.F6)\(a\) and \(b\)\)\. Consequently, token usage and wall\-clock time spike, often approaching the 40\-minute simulation timeout \(Figures[6](https://arxiv.org/html/2606.15684#S4.F6)\(c\) and \(d\)\)\. This highlights the need for efficient communication protocols and group formation strategies in distributed multi\-agent systems\.
\(a\)TickingCollabAgent\-centralizedtimeline example\.
\(b\)TickingCollabAgent\-centralizedstep breakdown\.
\(c\)TickingCollabAgent\-distributedstep breakdown\.
Figure 7:Step\-level planning, action, and idle\-time breakdown in asynchronous mode\.Coordination overhead\.Figure[7](https://arxiv.org/html/2606.15684#S4.F7)breaks down asynchronous mode execution into planning, action, and idle steps\. In the centralized baseline, agents remain idle while the central planner waits for LLM inference \(typically around 20 seconds per API call\); additional idle time arises as agents’ assigned actions vary in duration \(Figure[7](https://arxiv.org/html/2606.15684#S4.F7)\(a\)\)\. This synchronization bottleneck becomes more pronounced as team size grows, leading to a larger idle fraction \(Figure[7](https://arxiv.org/html/2606.15684#S4.F7)\(b\)\)\. The distributed baseline avoids a single central planner, but spends more time in planning because agents must communicate and negotiate before acting \(Figure[7](https://arxiv.org/html/2606.15684#S4.F7)\(c\)\)\. Overall, these results highlight the need for agents to form appropriate coordination topologies and optimize both planning and action execution times\.
## 5Limitations and Future Work
Observation modality\. We use distance\-limited structured semantic sensors to focus on the collaborative planning capabilities of LLM agents\. We plan to extendTickingCollabBenchto multimodal inputs \(e\.g\., first\-person\-view video and audio\) to evaluate VLM agents\.
Agent\-environment dynamics\.TickingCollabsupports complex agent\-conditioned interactions \(e\.g\., inventory\-based movement slowdowns\)\. We will leverage these primitives to test collaboration under complex, cascading environmental changes driven by agent actions\.
## 6Related Work
Minecraft agents\.While numerous works leverage Minecraft to develop autonomous agentsWanget al\.\([2023](https://arxiv.org/html/2606.15684#bib.bib16)\); Boltonet al\.\([2025](https://arxiv.org/html/2606.15684#bib.bib18)\); Magneet al\.\([2026](https://arxiv.org/html/2606.15684#bib.bib19)\); Liet al\.\([2025a](https://arxiv.org/html/2606.15684#bib.bib20)\); Caiet al\.\([2024](https://arxiv.org/html/2606.15684#bib.bib15)\); Gonget al\.\([2024](https://arxiv.org/html/2606.15684#bib.bib21)\); Fanet al\.\([2022](https://arxiv.org/html/2606.15684#bib.bib1)\), most focus on single\-agent scenarios\. Recent multi\-agent extensionsLonget al\.\([2024](https://arxiv.org/html/2606.15684#bib.bib6)\); Yuet al\.\([2024b](https://arxiv.org/html/2606.15684#bib.bib2)\); Whiteet al\.\([2025](https://arxiv.org/html/2606.15684#bib.bib4)\); Donget al\.\([2024](https://arxiv.org/html/2606.15684#bib.bib3)\); Schipperet al\.\([2025](https://arxiv.org/html/2606.15684#bib.bib5)\)mostly feature static environments, homogeneous agents, and tasks where collaboration is optional or lacks real\-time constraints\. Bridging this gap from real\-world collaboration,TickingCollabBenchintroduces time\-sensitive, complementary tasks requiring strictly mandatory coordination\.
Multi\-agent collaboration\.Broader evaluations of multi\-agent collaboration in domains like mathematics, science, or codingZhugeet al\.\([2024](https://arxiv.org/html/2606.15684#bib.bib29)\); Chenet al\.\([2024](https://arxiv.org/html/2606.15684#bib.bib22)\); Yuet al\.\([2024a](https://arxiv.org/html/2606.15684#bib.bib25)\)often lack environmental dynamics and explicit time limits\.TickingCollabBenchstrictly demands both decision accuracy and low latency\. Other orthogonal efforts improving multi\-agent efficiency—such as task plan searchZuet al\.\([2025](https://arxiv.org/html/2606.15684#bib.bib23)\); Zhanget al\.\([2025](https://arxiv.org/html/2606.15684#bib.bib24)\), communication topologiesZhuet al\.\([2025](https://arxiv.org/html/2606.15684#bib.bib26)\); Liet al\.\([2025b](https://arxiv.org/html/2606.15684#bib.bib27)\); Qianet al\.\([2024](https://arxiv.org/html/2606.15684#bib.bib28)\); Zhugeet al\.\([2024](https://arxiv.org/html/2606.15684#bib.bib29)\), and resource\-aware planningYanget al\.\([2025](https://arxiv.org/html/2606.15684#bib.bib30)\); Caiet al\.\([2025](https://arxiv.org/html/2606.15684#bib.bib31)\)—can be seamlessly integrated to tackle the real\-time challenges in our benchmark\.
## 7Conclusion
We presentedTickingCollabBench, a Minecraft benchmark evaluating*time\-sensitive complementary collaboration*\. To capture the core characteristics of real\-world collaboration \(agent heterogeneity, mandatory coordination, dynamic environments, and real\-time constraints\), we developed an extensible orchestration framework and an automated task generation pipeline\. Baseline evaluations reveal significant performance gaps in current LLMs, highlighting the critical need for more accurate and latency\-efficient multi\-agent systems\.
Acknowledgement\.This study is conducted for research only, not for an actual Minecraft product\.
## References
- \[1\]A\. Bolton, A\. Lerchner, A\. Cordell, A\. Moufarek, A\. Bolt, A\. Lampinen, A\. Mitenkova, A\. O\. Hallingstad, B\. Vujatovic, B\. Li,et al\.\(2025\)SIMA 2: a generalist embodied agent for virtual worlds\.arXiv preprint arXiv:2512\.04797\.Cited by:[§6](https://arxiv.org/html/2606.15684#S6.p1.1)\.
- \[2\]\(2024\)MineStudio: a streamlined package for minecraft ai agent development\.arXiv preprint arXiv:2412\.18293\.Cited by:[§6](https://arxiv.org/html/2606.15684#S6.p1.1)\.
- \[3\]S\. Cai, Y\. Ning, and H\. Liu\(2025\)AgentBalance: backbone\-then\-topology design for cost\-effective multi\-agent systems under budget constraints\.arXiv preprint arXiv:2512\.11426\.Cited by:[§6](https://arxiv.org/html/2606.15684#S6.p2.1)\.
- \[4\]W\. Chen, Y\. Su, J\. Zuo, C\. Yang, C\. Yuan, C\. Chan, H\. Yu, Y\. Lu, Y\. Hung, C\. Qian,et al\.\(2024\)AgentVerse: facilitating multi\-agent collaboration and exploring emergent behaviors\.\.InICLR,Cited by:[§1](https://arxiv.org/html/2606.15684#S1.p1.1),[§6](https://arxiv.org/html/2606.15684#S6.p2.1)\.
- \[5\]Y\. Dong, X\. Zhu, Z\. Pan, L\. Zhu, and Y\. Yang\(2024\)Villageragent: a graph\-based multi\-agent framework for coordinating complex task dependencies in minecraft\.arXiv preprint arXiv:2406\.05720\.Cited by:[§1](https://arxiv.org/html/2606.15684#S1.p2.1),[Table 1](https://arxiv.org/html/2606.15684#S2.T1.38.36.36.5),[§6](https://arxiv.org/html/2606.15684#S6.p1.1)\.
- \[6\]L\. Fan, G\. Wang, Y\. Jiang, A\. Mandlekar, Y\. Yang, H\. Zhu, A\. Tang, D\. Huang, Y\. Zhu, and A\. Anandkumar\(2022\)Minedojo: building open\-ended embodied agents with internet\-scale knowledge\.Advances in Neural Information Processing Systems35,pp\. 18343–18362\.Cited by:[Table 1](https://arxiv.org/html/2606.15684#S2.T1.10.8.8.5),[§6](https://arxiv.org/html/2606.15684#S6.p1.1)\.
- \[7\]R\. Gong, Q\. Huang, X\. Ma, Y\. Noda, Z\. Durante, Z\. Zheng, D\. Terzopoulos, L\. Fei\-Fei, J\. Gao, and H\. Vo\(2024\)Mindagent: emergent gaming interaction\.InFindings of the Association for Computational Linguistics: NAACL 2024,pp\. 3154–3183\.Cited by:[§6](https://arxiv.org/html/2606.15684#S6.p1.1)\.
- \[8\]D\. Guo, D\. Yang, H\. Zhang, J\. Song, P\. Wang, Q\. Zhu, R\. Xu, R\. Zhang, S\. Ma, X\. Bi,et al\.\(2025\)Deepseek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.arXiv preprint arXiv:2501\.12948\.Cited by:[§4](https://arxiv.org/html/2606.15684#S4.p1.1)\.
- \[9\]W\. H\. Guss, B\. Houghton, N\. Topin, P\. Wang, C\. Codel, M\. Veloso, and R\. Salakhutdinov\(2019\)Minerl: a large\-scale dataset of minecraft demonstrations\.arXiv preprint arXiv:1907\.13440\.Cited by:[Table 1](https://arxiv.org/html/2606.15684#S2.T1.6.4.4.6)\.
- \[10\]M\. Li, Z\. Wang, K\. He, X\. Ma, and Y\. Liang\(2025\)Jarvis\-vla: post\-training large\-scale vision language models to play visual games with keyboards and mouse\.arXiv preprint arXiv:2503\.16365\.Cited by:[§6](https://arxiv.org/html/2606.15684#S6.p1.1)\.
- \[11\]X\. Li, X\. Wang, C\. Bai, and J\. Zhang\(2025\)Exponential topology\-enabled scalable communication in multi\-agent reinforcement learning\.arXiv preprint arXiv:2502\.19717\.Cited by:[§6](https://arxiv.org/html/2606.15684#S6.p2.1)\.
- \[12\]S\. Liu, Y\. Li, K\. Zhang, Z\. Cui, W\. Fang, Y\. Zheng, T\. Zheng, and M\. Song\(2024\)Odyssey: empowering minecraft agents with open\-world skills\.arXiv preprint arXiv:2407\.15325\.Cited by:[Table 1](https://arxiv.org/html/2606.15684#S2.T1.14.12.12.5)\.
- \[13\]Q\. Long, Z\. Li, R\. Gong, Y\. N\. Wu, D\. Terzopoulos, and X\. Gao\(2024\)Teamcraft: a benchmark for multi\-modal multi\-agent systems in minecraft\.arXiv preprint arXiv:2412\.05255\.Cited by:[§1](https://arxiv.org/html/2606.15684#S1.p2.1),[§1](https://arxiv.org/html/2606.15684#S1.p6.1),[Table 1](https://arxiv.org/html/2606.15684#S2.T1.26.24.24.5),[Table 2](https://arxiv.org/html/2606.15684#S2.T2.18.14.15.3),[1st item](https://arxiv.org/html/2606.15684#S3.I9.i1.p1.1),[§6](https://arxiv.org/html/2606.15684#S6.p1.1)\.
- \[14\]L\. Magne, A\. Awadalla, G\. Wang, Y\. Xu, J\. Belofsky, F\. Hu, J\. Kim, L\. Schmidt, G\. Gkioxari, J\. Kautz,et al\.\(2026\)NitroGen: an open foundation model for generalist gaming agents\.arXiv preprint arXiv:2601\.02427\.Cited by:[§6](https://arxiv.org/html/2606.15684#S6.p1.1)\.
- \[15\]PrismarineJS\(2014\)Mineflayer\.\.Note:[https://github\.com/PrismarineJS/mineflayer](https://github.com/PrismarineJS/mineflayer)Cited by:[2nd item](https://arxiv.org/html/2606.15684#S3.I1.i2.p1.1),[§3\.2](https://arxiv.org/html/2606.15684#S3.SS2.p2.1)\.
- \[16\]PrismarineJS\(2025\)Mineflayer\-pvp\.\.Note:[https://github\.com/PrismarineJS/mineflayer\-pvp](https://github.com/PrismarineJS/mineflayer-pvp)Cited by:[Appendix F](https://arxiv.org/html/2606.15684#A6.p3.1)\.
- \[17\]C\. Qian, Z\. Xie, Y\. Wang, W\. Liu, K\. Zhu, H\. Xia, Y\. Dang, Z\. Du, W\. Chen, C\. Yang,et al\.\(2024\)Scaling large language model\-based multi\-agent collaboration\.arXiv preprint arXiv:2406\.07155\.Cited by:[§6](https://arxiv.org/html/2606.15684#S6.p2.1)\.
- \[18\]O\. Schipper, Y\. Zhang, Y\. Du, M\. Pechenizkiy, and M\. Fang\(2025\)Pillagerbench: benchmarking llm\-based agents in competitive minecraft team environments\.In2025 IEEE Conference on Games \(CoG\),pp\. 1–15\.Cited by:[§1](https://arxiv.org/html/2606.15684#S1.p2.1),[Table 1](https://arxiv.org/html/2606.15684#S2.T1.34.32.32.5),[§6](https://arxiv.org/html/2606.15684#S6.p1.1)\.
- \[19\]G\. Wang, Y\. Xie, Y\. Jiang, A\. Mandlekar, C\. Xiao, Y\. Zhu, L\. Fan, and A\. Anandkumar\(2023\)Voyager: an open\-ended embodied agent with large language models\.arXiv preprint arXiv:2305\.16291\.Cited by:[§6](https://arxiv.org/html/2606.15684#S6.p1.1)\.
- \[20\]I\. White, K\. Nottingham, A\. Maniar, M\. Robinson, H\. Lillemark, M\. Maheshwari, L\. Qin, and P\. Ammanabrolu\(2025\)Collaborating action by action: a multi\-agent llm framework for embodied reasoning\.arXiv preprint arXiv:2504\.17950\.Cited by:[§1](https://arxiv.org/html/2606.15684#S1.p2.1),[§1](https://arxiv.org/html/2606.15684#S1.p6.1),[§2\.2](https://arxiv.org/html/2606.15684#S2.SS2.p3.1),[Table 1](https://arxiv.org/html/2606.15684#S2.T1.30.28.28.5),[Table 2](https://arxiv.org/html/2606.15684#S2.T2.18.14.15.4),[2nd item](https://arxiv.org/html/2606.15684#S3.I9.i2.p1.1),[§6](https://arxiv.org/html/2606.15684#S6.p1.1)\.
- \[21\]L\. Yang, J\. Luo, X\. Liu, Y\. Lou, and Z\. Chen\(2025\)BAMAS: structuring budget\-aware multi\-agent systems\.arXiv preprint arXiv:2511\.21572\.Cited by:[§6](https://arxiv.org/html/2606.15684#S6.p2.1)\.
- \[22\]H\. Yu, Z\. Hong, Z\. Cheng, K\. Zhu, K\. Xuan, J\. Yao, T\. Feng, and J\. You\(2024\)Researchtown: simulator of human research community\.arXiv preprint arXiv:2412\.17767\.Cited by:[§1](https://arxiv.org/html/2606.15684#S1.p1.1),[§6](https://arxiv.org/html/2606.15684#S6.p2.1)\.
- \[23\]X\. Yu, J\. Fu, R\. Deng, and W\. Han\(2024\)Mineland: simulating large\-scale multi\-agent interactions with limited multimodal senses and physical needs\.arXiv preprint arXiv:2403\.19267\.Cited by:[§1](https://arxiv.org/html/2606.15684#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.15684#S2.SS2.p3.1),[Table 1](https://arxiv.org/html/2606.15684#S2.T1.22.20.20.6),[Table 2](https://arxiv.org/html/2606.15684#S2.T2.18.14.15.2),[§6](https://arxiv.org/html/2606.15684#S6.p1.1)\.
- \[24\]Y\. Zhang, S\. Yang, C\. Bai, F\. Wu, X\. Li, Z\. Wang, and X\. Li\(2025\)Towards efficient llm grounding for embodied multi\-agent collaboration\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 1663–1699\.Cited by:[§6](https://arxiv.org/html/2606.15684#S6.p2.1)\.
- \[25\]X\. Zheng, H\. Lin, K\. He, Z\. Wang, Q\. Fu, H\. Fu, Z\. Zheng, and Y\. Liang\(2025\)MCU: an evaluation framework for open\-ended game agents\.InForty\-second International Conference on Machine Learning,Cited by:[Table 1](https://arxiv.org/html/2606.15684#S2.T1.18.16.16.5)\.
- \[26\]K\. Zhu, H\. Du, Z\. Hong, X\. Yang, S\. Guo, D\. Z\. Wang, Z\. Wang, C\. Qian, R\. Tang, H\. Ji,et al\.\(2025\)Multiagentbench: evaluating the collaboration and competition of llm agents\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 8580–8622\.Cited by:[§6](https://arxiv.org/html/2606.15684#S6.p2.1)\.
- \[27\]M\. Zhuge, W\. Wang, L\. Kirsch, F\. Faccio, D\. Khizbullin, and J\. Schmidhuber\(2024\)Gptswarm: language agents as optimizable graphs\.InForty\-first International Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2606.15684#S1.p1.1),[§6](https://arxiv.org/html/2606.15684#S6.p2.1)\.
- \[28\]L\. Zu, L\. Lin, S\. Fu, N\. Zhao, and P\. Zhou\(2025\)Collaborative tree search for enhancing embodied multi\-agent collaboration\.InProceedings of the Computer Vision and Pattern Recognition Conference,pp\. 29513–29522\.Cited by:[§6](https://arxiv.org/html/2606.15684#S6.p2.1)\.
## Appendix ATask Metadata Schema
LABEL:list:2\-metadata\-full\-exampleshows the full task metadata for “prepare for a crisis” task\. It is mainly composed of four sections: \(i\)*task description*to specify the task goal and guidance, \(ii\)*environment*to specify the simulation world layout \(e\.g\., material, regions with specific interaction effects\), \(iii\)*events*to specify the runtime dynamic event’s trigger, type, and pattern, and \(iv\)*agents*to specify the attributes of each agent\. Overall, the interface is designed to be declarative and intuitive, making it easy for users to define custom tasks and for LLMs to generate valid outputs\.
Listing 2:Example metadata for “prepare for a crisis” task\.1task:
2type:prepare\_crisis
3goal:"Identifythetypeandoriginofthecrisisandbuildasheltertosurvive"
4guidance:
5text:\|
6Eachagenthasdifferenttools,speeds,andperceptionranges\.
7
8events:
9\-id:lava\_wave
10trigger:
11start:30
12end:200
13actions:
14\-type:progressive\_fill
15block:lava
16area:
17min:\{x:\-30,y:64,z:0\}
18max:\{x:40,y:64,z:40\}
19direction:east
20speed\_bps:1
21
22agents:
23count:4
24spawn:
25\-name:MineflayerBot0
26position:\[8,64,18\]
27inventory:
28diamond\_pickaxe:
29count:1
30unbreakable:true
31capabilities:
32max\_health:40
33speed\_bps:4\.3
34perception\_range:24
35effects:
36\-fire\_resistance
37\-name:MineflayerBot1
38position:\[12,64,18\]
39inventory:
40diamond\_axe:
41count:1
42unbreakable:true
43capabilities:
44max\_health:30
45speed\_bps:6\.5
46perception\_range:16
47effects:
48\-fire\_resistance
49\-name:MineflayerBot2
50position:\[8,64,22\]
51inventory:
52iron\_pickaxe:
53count:1
54unbreakable:true
55capabilities:
56speed\_bps:3\.0
57perception\_range:20
58\-name:MineflayerBot3
59position:\[12,64,22\]
60inventory:
61golden\_axe:
62count:1
63unbreakable:true
64capabilities:
65speed\_bps:8\.0
66perception\_range:10
67
68environment:
69max\_steps:100
70
71world:
72world\_type:flatgrass
73difficulty:hard
74time:day
75gamerules:
76doMobSpawning:false
77doImmediateRespawn:false
78doDaylightCycle:false
79
80chest:
81position:\[0,64,0\]
82
83entities:
84hostiles:\[\]
85
86materials:
87grid:
88
89\-block:cobblestone
90position:\[20,64,30\]
91width:4
92height:1
93depth:5
94\-block:cobblestone
95position:\[25,64,35\]
96width:3
97height:1
98depth:4
99\-block:cobblestone
100position:\[10,64,40\]
101width:4
102height:1
103depth:5
104\-block:stone
105position:\[5,64,30\]
106width:3
107height:1
108depth:3
109\-block:stone
110position:\[10,64,25\]
111width:4
112height:1
113depth:3
114\-block:stone\_bricks
115position:\[0,64,20\]
116width:3
117height:1
118depth:4
119\-block:stone\_bricks
120position:\[25,64,5\]
121width:3
122height:1
123depth:3
124\-block:bricks
125position:\[20,64,\-12\]
126width:3
127height:1
128depth:3
129
130
131\-block:oak\_log
132position:\[\-10,64,10\]
133width:3
134height:1
135depth:4
136\-block:oak\_log
137position:\[15,64,20\]
138width:3
139height:1
140depth:4
141\-block:oak\_planks
142position:\[15,64,5\]
143width:4
144height:1
145depth:3
146\-block:oak\_planks
147position:\[\-5,64,10\]
148width:3
149height:1
150depth:4
## Appendix BUser Specification Example for Automated Benchmark Generation
LABEL:list:2\-user\-specification\-exampleshows an example YAML interface for user specification used to generate task metadata with LLMs\. The configuration file is parsed and converted into a text prompt for the LLM\. Users can specify the following components: \(i\)*task description*to define the task type, \(ii\)*variable parameters*to define the parameter space and their possible values \(parameter names must match those in the task metadata\), \(iii\)*constraints*to enforce formatting requirements for the generated metadata, \(iv\)*generation instructions*to guide the LLM to reflect the properties of our proposed time\-sensitive complementary collaboration tasks, \(v\)*diversity*to encourage balanced sampling across varying parameters, and \(vi\)*references*to provide supporting documents \(e\.g\., metadata schema, Minecraft physics rules, and interaction mechanisms\)\.
Listing 3:Example user specification for “prepare for a crisis” task\.1task\_type:prepare\_crisis
2description:"Multi\-agentteampreparesforanincomingcrisisbygatheringresourcesandbuildingdefenses"
3
4
5variable\_parameters:
6
7events\.actions\.type:
8description:"Crisiseventtype"
9allowed:\[progressive\_fill\]
10
11events\.actions\.block:
12description:"Blocktypeforprogressive\_fillevents"
13allowed:\[lava,water,powder\_snow\]
14
15events\.actions\.direction:
16description:"Filldirection\(horizontalonly\)"
17allowed:\[east,west,north,south\]
18
19events\.actions\.area\.min\.y:
20description:"AreaminYcoordinate\(mustbe64\)"
21range:\[64,64\]
22
23events\.actions\.area\.max\.y:
24description:"AreamaxYcoordinate\(mustbe64\)"
25range:\[64,64\]
26
27events\.actions\.speed:
28description:"Blocksfilledpertrigger"
29range:\[1,5\]
30
31events\.trigger\.start:
32description:"Stepwhencrisisbegins"
33range:\[3,20\]
34
35events\.trigger\.end:
36description:"Stepwhencrisisends"
37range:\[20,200\]
38
39
40agents\.count:
41description:"Teamsize"
42range:\[1,8\]
43
44agents\.spawn\.inventory:
45description:"Agentstartingtoolsandarmor"
46allowed:\[wooden\_pickaxe,stone\_pickaxe,iron\_pickaxe,diamond\_pickaxe,golden\_pickaxe,netherite\_pickaxe,wooden\_axe,stone\_axe,iron\_axe,diamond\_axe,netherite\_axe,leather\_boots\]
47match:keys
48
49agents\.spawn\.capabilities\.max\_health:
50description:"AgentHP"
51range:\[10,60\]
52
53agents\.spawn\.capabilities\.speed\_bps:
54description:"Agentmovementspeedmultiplier"
55range:\[0\.5,2\.0\]
56
57agents\.spawn\.capabilities\.effects\.type:
58description:"Potioneffecttypes"
59allowed:\[fire\_resistance\]
60
61
62environment\.materials\.grid\.block:
63description:"Buildingblocksscatteredinworld"
64allowed:\[stone,cobblestone,stone\_bricks,bricks,deepslate,iron\_block,gold\_block,diamond\_block,obsidian,crying\_obsidian,netherite\_block,oak\_log,birch\_log,spruce\_log,dark\_oak\_log,oak\_planks\]
65
66environment\.materials\.grid\.height:
67description:"Heightofmaterialpiles"
68range:\[1,1\]
69
70
71environment\.max\_steps:
72description:"Timelimitforthescenario"
73range:\[50,200\]
74
75
76constraints:\|
77CRITICALYAMLFORMATTINGRULES
78
791\.Coordinatesmustbeinlinelists:position:\[8,64,18\]
802\.Areacentersmustbeinlinelists:center:\[0,64,20\]
813\.Areamin/maxmustbeinlinelists
824\.Keepsimplekey\-valuepairsononeline
835\.Usequotedstringsfordescriptions
846\.Keepthesameindentationandstructureasthetemplate
857\.Theeventslistmustcontainexactlyoneeventwithoneprogressive\_fillaction
868\.CrisisareamustuseY=64forbothminandmax
879\.Directionmustbeeast,west,north,orsouth
88
89
90generation\_instructions:\|
91Generatediversecrisispreparationscenariosbyvaryingtheparameterslistedinvariable\_parameters\.
92
93Crisistypedistribution:
94Uselava,water,orpowder\_snow\.Ifnotassigned,chooserandomly\.
95
96Agentcountdiversity:
97Varyagentcountacrossthefullrange\(2\-8\)\.
98
99Designguidelines:
100
101Agentheterogeneity:
102\-Eachagentmusthavedifferentattributes
103\-Mixtooltypesandmovementspeeds
104\-Forlavascenarios,givesomeagentsfire\_resistance
105
106Collaborationnecessity:
107\-N=\(blocks\_to\_gather\+blocks\_to\_place\)/best\_agent\_throughput/T\_max
108\-Nmustbegreaterthan1\.0soasingleagentcannotcompletealone
109
110Environmentdynamicity:
111\-Increasefillspeedortriggerfrequencytoincreasedynamics
112
113Time\-to\-failure:
114\-Lowerdistanceorhigherspeedincreasesurgency
115
116
117diversity\_round\_robin:
118\-field:events\.actions\.block
119label:CrisisTypeAssignment
120instruction:"Adaptallfieldstomatchtheassignedcrisistype\."
121
122\-field:agents\.count
123label:AgentCountAssignment
124values:\[2,2,2,3,3,3,4,4,5,7,8\]
125instruction:"Useexactlytheassignednumberofagents\."
126
127
128references:
129\-metadata\_schema\.md
130\-mining\_items\.md
131\-interaction\_effects\.md
## Appendix CVerification Criteria forTickingCollabBenchGeneration
In this section, we provide a detailed explanation of the verification criteria for the automated benchmark generation pipeline shown in Tab\. 3 of our submission\.
### C\.1Task \#1: Prepare for a Crisis
For this task, we first verify if at least one agent has necessary item to mine the target survival block, by checking the following:
maxn\(τEn\)≥τreq\(b\),∀b∈ℬtarget,\\max\_\{n\}\(\\tau\_\{E\_\{n\}\}\)\\geq\\tau\_\{req\}\(b\),\\quad\\forall b\\in\\mathcal\{B\}\_\{target\},
whereτEn\\tau\_\{E\_\{n\}\}is the mining tier of the tool of thenn\-th agent,τreq\(b\)\\tau\_\{req\}\(b\)denotes the required mining tier for block typebb, andℬtarget\\mathcal\{B\}\_\{target\}is the set of target survival block types\. This condition ensures that at least one agent possesses a tool capable of mining the required survival blocks\.
Next, we verify whether a sufficient number of survival blocks are placed in the map\. Specifically, we check the following:
Nrequired≥\(H\+1\)⋅Nagent,N\_\{required\}\\geq\(H\+1\)\\cdot N\_\{agent\},
whereNrequiredN\_\{required\}denotes the number of survival blocks available in the environment \(e\.g\., stone blocks when the crisis type is lava flood\),NagentN\_\{agent\}is the number of agents, andHHis the height of the crisis\. Since agents must build a shelter higher than the crisis height and stand on top of it to survive, this condition verifies that a sufficient number of blocks are available to construct such a structure\.
Finally, we verify whether the agents have sufficient time to build a shelter before the crisis hits by evaluating the following condition:
Tcrisis≥α⋅\(Tmove\+Tmine\+Tconstruct\)\.T\_\{crisis\}\\geq\\alpha\\cdot\(T\_\{move\}\+T\_\{mine\}\+T\_\{construct\}\)\.
Here,TcrisisT\_\{crisis\}is the time it takes for the crisis to reach the build site, computed asTcrisis=t0\+Larea/vfillT\_\{crisis\}=t\_\{0\}\+\{L\_\{area\}\}/\{v\_\{fill\}\}\.t0t\_\{0\}is the crisis start time,LareaL\_\{area\}is the length of the crisis area along the spread direction, andvfillv\_\{fill\}is the fill speed in blocks per step\. We assume the build site is located at the farthest end of the crisis area \(i\.e\., the last point the crisis reaches\)\.
The right\-hand side of the inequality decomposes the total required preparation time into two phases:
- •Tmove\+Tmine=maxaWaT\_\{move\}\+T\_\{mine\}=\\max\_\{a\}W\_\{a\}is the makespan of the material gathering phase \(moving to the blocks and mining them\)\. Each agentaais greedily assigned a subset of material block piles to visit\. Piles are sorted by proximity to the build site and assigned to a compatible agent \(i\.e\., one holding a tool of sufficient tier\) that currently has the lowest accumulated workload\. Agentaa’s workload is defined as: Wa=∑i∈𝒫a\(d\(pi−1a,pia\)va\+nia⋅1\.5⋅hbisbia\)\+d\(p\|𝒫a\|a,𝐬\)vaW\_\{a\}=\\sum\_\{i\\in\\mathcal\{P\}\_\{a\}\}\\left\(\\frac\{d\(p^\{a\}\_\{i\-1\},p^\{a\}\_\{i\}\)\}\{v\_\{a\}\}\+n^\{a\}\_\{i\}\\cdot 1\.5\\cdot\\frac\{h\_\{b\_\{i\}\}\}\{s^\{a\}\_\{b\_\{i\}\}\}\\right\)\+\\frac\{d\(p^\{a\}\_\{\|\\mathcal\{P\}\_\{a\}\|\},\\mathbf\{s\}\)\}\{v\_\{a\}\}where𝒫a\\mathcal\{P\}\_\{a\}is the ordered set of piles assigned to agentaa,piap^\{a\}\_\{i\}is the position of theii\-th pile \(p0ap^\{a\}\_\{0\}being the agent’s spawn position\),vav\_\{a\}is the agent’s movement speed,nian^\{a\}\_\{i\}is the number of blocks mined from that pile,hbih\_\{b\_\{i\}\}is the block hardness,sbias^\{a\}\_\{b\_\{i\}\}is the mining speed of agentaa’s best tool for block typebib\_\{i\}, and𝐬\\mathbf\{s\}is the build site position\. The bottleneck agent determines the overall gather time\.
- •Tconstruct=0\.5⋅NrequiredNagents′T\_\{construct\}=\\frac\{0\.5\\cdot N\_\{required\}\}\{N^\{\\prime\}\_\{agents\}\}estimates the parallel construction time, assuming0\.50\.5seconds per block placement \(the default value in Minecraft\)\. Here,NrequiredN\_\{required\}denotes the total number of blocks required for the shelter floor and access stairs, andNagents′N^\{\\prime\}\_\{agents\}is the number of active agents equipped with a valid mining tool \(and thus assumed to participate in the construction\)\.
Since the above terms are estimates, we introduceα≥1\\alpha\\geq 1as a time safety margin\. In the context of task generation, a largerα\\alphaenforces a longer delay before the crisis arrives, thereby creating an easier task for the agents\.
### C\.2Task \#2: Mine Vanishing Blocks
For this task, we first verify if at least one agent has necessary item to mine the target survival block, by checking the following:
maxn\(τEn\)≥τreq\(b\),∀b∈ℬtarget,\\max\_\{n\}\(\\tau\_\{E\_\{n\}\}\)\\geq\\tau\_\{req\}\(b\),\\quad\\forall b\\in\\mathcal\{B\}\_\{target\},
where the variable notations are the same as in the “Prepare for a crisis” task\.
Next, we verify whether the block lifetimes are sufficiently long for agents to reach and mine them by evaluating the following condition:
β1⋅\(minn\(Tmove\+Tminen,i\)\)≤Tlifei,∀i,\\beta\_\{1\}\\cdot\(\\min\_\{n\}\(T\_\{move\}\+T\_\{mine\}^\{n,i\}\)\)\\leq T\_\{life\}^\{i\},\\;\\forall i,
Here, for theii\-th target block,TmoveT\_\{move\}is the estimated travel time from the mean position of the mining\-capable agents to the spawn area\. This is calculated using the distance to the center of the target block spawn area plus23R\\frac\{2\}\{3\}R, which represents the expected offset within a spawn radiusRR\. Furthermore,Tminen,iT\_\{mine\}^\{n,i\}is the time required for agentnnto mine one block of typeii, andTlifeiT\_\{life\}^\{i\}is the lifetime of block typeiibefore despawning\. The minimization is computed over all agentsnncapable of mining block typeii\. Since the actual positions of the agents and blocks are dynamic at runtime, we introduce a safety marginβ1≥1\\beta\_\{1\}\\geq 1\. Enforcing a largerβ1\\beta\_\{1\}during task generation requires longer block lifetimes, thereby making the task easier for the agents\.
Finally, we verify whether the total number of spawned target blocks is sufficient to meet the target count by checking the following condition:
Nspawni≥β2⋅Ngoali∀i,\\textstyle N\_\{spawn\}^\{i\}\\geq\\beta\_\{2\}\\cdot N\_\{goal\}^\{i\}\\;\\forall i,
whereNspawniN\_\{spawn\}^\{i\}is the total number of spawned blocks of typeii, andNgoaliN\_\{goal\}^\{i\}is the required number of blocks of typeiias specified by the task goal\. We introduceβ2≥1\\beta\_\{2\}\\geq 1as a supply safety margin to account for blocks that may despawn before the agents can reach and mine them\.
### C\.3Task \#3: Raid a Boss
For this task, we evaluate whether the agents have sufficient attack damage to defeat the boss and its spawned minions within the simulation duration\. Specifically, we verify the following conditions:
γ⋅Tmax⋅∑nDn≥HPB\+∑i\(Ni⋅HPMi\),\\gamma\\cdot T\_\{max\}\\cdot\\sum\_\{n\}D\_\{n\}\\geq HP\_\{B\}\+\\sum\_\{i\}\(N\_\{i\}\\cdot HP\_\{M\}^\{i\}\),
whereTmaxT\_\{max\}denotes the total simulation duration,DnD\_\{n\}is the attack damage of thenn\-th agent,HPBHP\_\{B\}andHPMiHP\_\{M\}^\{i\}are the life points of the boss and theii\-th minion, respectively, andNiN\_\{i\}is the total number of generatedii\-th minion type\. The left\-hand side of the equation represents the maximum damage that the agents can theoretically inflict\. In practice, however, both agents and enemies continuously move, making it impossible to achieve this maximum damage\. To account for this, we introduce an efficiency marginγ\\gammaas a soft constraint\. During the verification stage, the desired range ofγ\\gammacan be specified to control the task difficulty, where a largerγ\\gammaindicates an easier task\.
## Appendix DFeasibility Margin Sensitivity
Figure[8](https://arxiv.org/html/2606.15684#A4.F8)shows how the acceptance rate of generated configurations varies as the feasibility margin increases\. A higher margin imposes stricter constraints—requiring agents to complete objectives with a larger safety buffer—thereby filtering out configurations that are only marginally feasible\. At our chosen threshold ofα=2\.0×\\alpha\{=\}2\.0\\times, acceptance rates remain above 87% across all tasks, and degrade gracefully asα\\alphaincreases further\. Notably,Prepare for Crisisdoes not reach 100% acceptance even at the lowest margin \(α=0\.5\\alpha\{=\}0\.5\), plateauing around 94%\. This is because a small fraction of LLM\-generated configurations assign crisis types \(e\.g\., lava, water, powder snow\) for which no agent possesses the appropriate tool to harvest the corresponding survival block, making the scenario structurally infeasible regardless of margin\.
Figure 8:Acceptance rate vs\. feasibility marginα\\alphafor each task category\. Higher margins impose stricter feasibility constraints, reducing the fraction of retained configurations\.
## Appendix ETickingCollabBenchGenerated Dataset Examples

Figure 9:Examples inTickingCollabBench\. Top row: “prepare for a crisis” task, middle row: “mine vanishing blocks” task, bottom row: “raid a boss” task\.[Figure˜9](https://arxiv.org/html/2606.15684#A5.F9)shows example tasks fromTickingCollabBench\. Overall, the results demonstrate that the LLM generates diverse task variations\. For the “prepare for a crisis task”, multiple crisis types with different effects are generated \(e\.g\., lava flood which burns wooden blocks, water flood which spreads rapidly and can sweep agents away, snow avalanche that slows down agents\)\. The intensity and frequency of the damage received when agents come into contact with the crisis are controlled by theTickingCollabframework’s dynamic environment manager\. In addition, the origin and propagation speed of the crisis, material types, quantities, and locations of the shelter blocks, as well as the agent configurations, vary across tasks, resulting in different difficulty levels\. For the “mine vanishing blocks task”, the target block types and target quantities vary, along with the types and numbers of blocks placed in the environment and the agent configurations\. Finally, for the “raid a boss” task, the monster spawn patterns \(e\.g\., a few strong monsters or many weaker ones\) and the agent configurations differ across tasks\. Since generating diverse task configurations requires jointly considering multiple parameters and their interactions, manually designing such variations is challenging\. Our LLM\-based automated benchmark generation approach therefore provides an efficient solution\.
## Appendix FTickingCollabAgentAction Spaces
Table 5:Agent toolset for each task\.TaskTool nameParametersDescriptionPreparefor a crisis*scout\_blocks\_at*target\_pos,max\_distanceMove to location and search for blocks*mine\_blocks\_at*block\_positionsMine blocks at specific positions*build\_floor*center\_pos, width, depth, heightBuild a layered floor platform*move\_to*target\_posMove to a location*deposit\_to\_chest*chest\_pos, items, quantitiesDeposit items into shared chest*get\_from\_chest*chest\_pos, items, quantitiesGet items from shared chestMine vanishingblocks*scout\_blocks\_at*target\_pos, max\_distanceMove to location and search for blocks*mine\_blocks\_at*block\_positionsMine blocks at specific positions*deposit\_to\_chest*chest\_pos, items, quantitiesDeposit items into shared chestRaid a boss*move\_to*target\_posMove to a location*find\_entities*entity\_types, count, max\_distanceScan for nearby entities*attack*entity\_typeAttack a target entity*equip\_item*itemEquip a weapon*use\_item*itemUse a consumable item*deposit\_to\_chest*chest\_pos, items, quantitiesDeposit items into shared chest*get\_from\_chest*chest\_pos, items, quantitiesGet items from shared chest
[Table˜5](https://arxiv.org/html/2606.15684#A6.T5)summarizes the agent toolset for each task\. Each tool is implemented as a JavaScript function, which is generated by the LLM and executed by Mineflayer\. For the “prepare for a crisis task”, agents can move and scout nearby blocks \(including both crisis types and candidate survival blocks\) using the*scout\_blocks\_at\(\)*function, where*max\_distance*specifies the agent’s maximum perception range\. Agents can mine target survival blocks using*mine\_blocks\_at\(\)*, and construct a shelter using*build\_floor\(\)*\. Agents that do not participate in the mining process can retreat to the shelter via*move\_to\(\)*\. Item exchange is performed through a chest, whose location is predefined in the task metadata, using the*deposit\_from\_chest\(\)*and*get\_from\_chest\(\)*functions\.
For the “mine vanishing blocks” task, the action set is simpler, consisting of three functions for scouting and mining blocks, and depositing mined blocks into the chest\. The task is considered successful when the chest contains the target block type in the required quantity\.
For the “raid a boss task”, agents similarly scan nearby enemies, equip or use appropriate items, and exchange items through the chest\. For the*attack\(\)*function, the input is not a fixed position because monsters continuously move\. Instead, the function searches for the nearest target entity within the vicinity and performs tracking and attacking based on the object reference of that entity\. The attack process terminates when either a timeout \(15 seconds\) is reached or the target entity is defeated\. This functionality is implemented using Mineflayer\-pvp\[[16](https://arxiv.org/html/2606.15684#bib.bib9)\]module\.
## Appendix GAgent System Prompts
We provide the full system prompt templates for both baseline agent architectures evaluated in our experiments\. Template variables in\{curly\_braces\}\(e\.g\.,\{agent\_name\},\{task\_type\}\) are populated at runtime with the corresponding agent state, task specification, and environment context\.LABEL:list:prompt\-centralizedshows the centralized baseline prompt, where a single LLM call receives all agents’ states and produces a joint plan\.LABEL:list:prompt\-distributedshows the distributed baseline prompt, where each agent independently makes decisions and coordinates through an explicit negotiation protocol\.
### G\.1Centralized Baseline
Listing 4:System prompt template for the centralized baseline\.1YouareacentralizedcontrollermanagingALL\{num\_agents\}agentsinaMinecraftteam\.
2
3Youreceiveeveryagent’sstate,capabilities,andactionhistory\.Youproducea
4coordinatedplanforALLagentssimultaneously\.EachagentwillexecuteONLYthe
5tasksyouassigntoit\.AgentsdoNOTcommunicatewitheachother\-\-youarethe
6singledecision\-maker\.
7
8YourplanswillbeexecutedIMMEDIATELY\.YouwillbecalledagainwhenALLagents
9finishtheirassignedtasks\.
10
11\#\#Strategy
12Thinkholistically\.Youseethefullpicture\-\-everyagent’sposition,health,
13inventory,andcapabilities\.Usethisto:
14\-Assigneachagenttothesub\-taskthatbestfitsitscapabilities
15\-Avoidduplicatework\(twoagentsminingthesameblock\)
16\-Balanceworkloadsoagentsfinishatroughlythesametime
17\-Anticipatecoordinationneeds\(e\.g\.,oneagentscoutswhileanotherbuilds\)
18
19
20OutputasingleJSONobject:
21
22\{
23"reasoning":"Briefexplanationofyouroverallstrategy\(2\-3sentences\)",
24"agent\_plans":\{
25"\{agent\_name\_0\}":\[
26\{
27"id":"\{agent\_name\_0\}\_task\_1",
28"do":"action\_name",
29"with":\{\},
30"after":\[\],
31"note":"Human\-readabledescription"
32\}
33\],
34\.\.\. \(one entry per agent, same format\)
35\}
36\}
37
38\*\*Fields:\*\*
39\-\*\*reasoning\*\*:Youroverallcoordinationstrategy\(2\-3sentences\)\.
40\-\*\*agent\_plans\*\*:Adictionarymappingeachagentnametoitstasklist\.
41\-EachtasklistfollowsDAGformat\-\-tasksexecuteindependencyorder\.
42\-The"after"fieldmustONLYreferencetaskIDswithintheSAMEagent’splan\.
43\-TaskIDsmustbeprefixedwiththeagent’sname\.
44
45CONSTRAINTS:
461\.YouMUSTassigntaskstoeachaliveagent\.
472\.UseONLYpositionsfromtheprovidedknowledge\-\-NEVERinventcoordinates\.
483\.Ifnoblocksareknown,assignfind\_blocksorscout\_blocks\_atfirst\.
494\.The"after"fieldmustONLYreferencetaskIDswithintheSAMEagent’splan
50\-\-nevercross\-agentdependencies\.
515\.OutputONLYtheJSONobject,noadditionaltext\.
526\.Everyactionrequiresitsparameters\.MissingparameterswillcauseFAILURE\.
537\.DoNOTassigntaskstoDEADagents\.Theyaremarkedbelow\.
548\.BREAKREPETITION:Ifanagent’sactionsfailed2\+times,assigna
55differentapproach\.
569\.BALANCEWORKLOAD:Assignroughlyequalworktoeachagent\.Youwillnotbe
57calledagainuntilALLagentscompletetheirplans\.
58
59
60\-TaskType:\{task\_type\}
61\-Goal:\{goal\}
62
63
64\{guidance\_text\}
65
66
67\{task\_content\}
68
69
70\{env\_context\}
71\{task\_specific\_context\}
72\{available\_actions\}
73
74\-wait:Pauseforaspecifiednumberofsteps\.
75\-with:\{duration:N\}\(1\-\-50steps\)
76
77
78\{all\_agents\_context\}
79
80
81\{combined\_history\}
82
83
84\{dynamic\_subgoals\}
### G\.2Distributed Baseline
Listing 5:System prompt template for the distributed baseline\.1YouareAgent\{agent\_name\}\(index\{agent\_idx\}\)inafullydistributed
2\{num\_agents\}\-agentMinecraftteam\.
3
4Youdecideyourownactionsautonomously\-\-youplan,communicate,negotiate,
5andexecuteonyourownjudgment\.Bydefaultthereisnodesignatedleader,but
6youmayPROPOSEorACCEPTleadership,delegatetasks,orformanycoordination
7structureifthetaskbenefitsfromit\.YourplanwillbeexecutedIMMEDIATELY\.
8
9
10Youarepartofadistributedteam\.Youandyourpeersindependentlydecideyour
11ownactions,buteffectiveteamsNEGOTIATEbeforeactingonshareddecisions\.
12Whenadecisionaffectsmultipleagents\-\-roleassignment,buildlocation,target
13priority,retreat\-\-proposeyouridea,waitforresponses,thenactontheoutcome\.
14
15
16Usethepropose\-\>wait\-\>read\-\>actpattern:
17
181\.\*\*Propose\*\*:Senda\[PROPOSAL\]messagedescribingyouridea\.
192\.\*\*Wait\*\*:Usethe‘wait‘actiontopausewhilepeersreadandrespond\.
203\.\*\*Read\*\*:Afterthewait,ConversationHistorywillcontainpeerresponses\.
214\.\*\*Act\*\*:Incorporatefeedback\-\-adjust,accept,orcounter\-propose\.
22
23Negotiatewhen:assigningroles,choosingbuildlocations,dividingterritory,
24formingstrategy\.Skipnegotiationwhen:HPiscritical,blocksaredespawning,
25oranattackisinprogress\-\-actfirstandinformpeersafter\.
26
27
28Whenyouseea\[PROPOSAL\]or\[COUNTER\-PROPOSAL\]inConversationHistory:
29\-\*\*\[AGREE\]\*\*:Youaccept\.Statewhatyouwilldoundertheagreedplan\.
30\-\*\*\[DISAGREE\]\*\*:Youreject\.Explainwhyandsuggestanalternative\.
31\-\*\*\[COUNTER\-PROPOSAL\]\*\*:Youproposeamodification\.Explainthechange\.
32Silenceafterareasonablewaitistreatedasimplicitagreement\.
33
34
35Re\-negotiatewhen:actionfails2\+times,yourarearunsdry,HPdropsbelow
3650%,taskphasechanges\(e\.g\.,scouting\-\>building\),orapeersends\[HELP\]\.
37\.\.\. \(detailed re\-negotiation guidelines for each trigger\)
38
39
40Yourteamcanadoptanycoordinationstructure\-\-youdecidetogether:
41\-\*\*Distributed\*\*\(default\):Eachagentdecidesindependently\.
42\-\*\*Leader\-worker\*\*:Oneagentcoordinates,othersfollow\.
43\-\*\*Tree/Hierarchy\*\*:Sub\-leadersforsub\-teams\.
44Topologyisnotfixed\.Anyagentcanproposeachangeatanytimevia\[PROPOSAL\]\.
45
46
47Messagesarefree\-formnaturallanguage\.Tagmessagesforclarity:
48\[PROPOSAL\]\-\-role,plan,topology,orstrategysuggestions
49\[AGREE\]\-\-acceptingapeer’sproposal
50\[DISAGREE\]\-\-rejectingaproposal\(includereason\+alternative\)
51\[COUNTER\-PROPOSAL\]\-\-proposingamodificationtosomeone’splan
52\[STATUS\]\-\-actionresultsorprogressupdates
53\.\.\. \(and \[DISCOVERY\], \[HELP\], \[CONFLICT\], \[PLAN\_CHANGE\]\)
54
55
56\-No\-repeatrule:DoNOTresendmessagesthatalreadycoveryoursituation\.
57\-Whensendinga\[PROPOSAL\],followitwitha‘wait‘actiontocollectresponses\.
58\-Whenreceivinga\[PROPOSAL\],respondwith\[AGREE\],\[DISAGREE\],or
59\[COUNTER\-PROPOSAL\]\.Donotsilentlyignoreit\.
60\-Respectclaims:Ifapeerclaimedcoordinates,pickdifferenttargets\.
61\.\.\. \(additional collision\-avoidance and target\-claiming rules\)
62
63
64OutputasingleJSONobject:
65
66\{
67"reasoning":"Briefexplanationofyourcurrentdecision",
68"my\_role":"Yourcurrentrole\(e\.g\.,’Tank’,’Miner’,’Scout’\)",
69"plan":\[
70\{
71"id":"\{agent\_name\}\_task\_1",
72"do":"action\_name",
73"with":\{\},
74"after":\[\],
75"note":"Human\-readabledescription"
76\}
77\],
78"messages":\[
79\{"to":"MineflayerBot1","content":"\[PROPOSAL\]I’llminecobblestone
80at\[5,64,10\]\.Canyouminestoneat\[20,64,8\]?"\},
81\.\.\. \(one message per recipient\)
82\]
83\}
84
85\*\*Fields:\*\*
86\-\*\*reasoning\*\*:Whyyouchosethisaction\(1\-2sentences\)\.
87\-\*\*my\_role\*\*:Yourcurrentteamrole\.Use"Undecided"ifnoroleyet\.
88\-\*\*plan\*\*:TasksequenceinDAGformat\.Canbeempty\[\]ifonlymessaging\.
89\-\*\*messages\*\*:Messagestopeers\.Canbeempty\[\]\.
90
91CONSTRAINTS:
921\.PlanONLYtasksforYOURSELF\-\-doNOTincludean"agent"field\.
932\.UseONLYpositionsfromyourknowledge\-\-NEVERinventcoordinates\.
943\.Ifnoblocksareknown,usefind\_blocksorscout\_blocks\_atfirst\.
954\.The"after"fieldmustONLYreferencetaskIDswithinYOUROWNplan\.
965\.OutputONLYtheJSONobject,noadditionaltext\.
976\.Everyactionrequiresitsparameters\.MissingparameterswillcauseFAILURE\.
987\.RESPECTCLAIMS:Ifapeerclaimedcoordinates,choosedifferenttargets\.
998\.BREAKREPETITION:Ifthesameactionfails2\+times,tryadifferent
100approach\.
101
102
103
104Agentproposesrolesandwaitsforteaminput:
105\{
106"reasoning":"Firstdecision\.Coordinatingrolesbeforewestart\.",
107"my\_role":"Undecided",
108"plan":\[
109\{"id":"\{agent\_name\}\_wait","do":"wait",
110"with":\{"duration":5\},"after":\[\],
111"note":"Waitforteamtorespondtoroleproposal"\}
112\],
113"messages":\[
114\{"to":"MineflayerBot1","content":"\[PROPOSAL\]Ihaveanetherite
115swordandhighHP\-\-I’lltank\.Youhaveabow\-\-rangedDPS?"\},
116\{"to":"MineflayerBot2","content":"\[PROPOSAL\]I’lltank\.Youhave
117goldenapples\-\-healer/support?"\}
118\]
119\}
120\.\.\. \(after wait, agent reads responses and executes the agreed plan\)
121
122
123\-TaskType:\{task\_type\}
124\-Goal:\{goal\}
125
126
127\{guidance\_text\}
128
129
130\{task\_content\}
131
132
133\{env\_context\}
134\{task\_specific\_context\}
135\{available\_actions\}
136
137\-wait:Pauseforaspecifiednumberofsteps\.Useaftersendinga\[PROPOSAL\]
138togivepeerstimetorespond\.
139\-with:\{duration:N\}\(1\-\-50steps\)
140
141
142\{peer\_agents\}
143
144
145\{history\}
146
147
148\-CurrentStep:\{current\_step\}/\{max\_steps\}
149\{agent\_state\}
150
151
152\{current\_action\}
153
154
155\{remaining\_plan\}
156\{dynamic\_subgoals\}
## Appendix HTickingCollabAgentOperation Timeline Examples
[Figure˜10](https://arxiv.org/html/2606.15684#A8.F10)shows example operational timelines ofTickingCollabAgent\-centralized\.[Figure˜10\(a\)](https://arxiv.org/html/2606.15684#A8.F10.sf1)shows an example of the “prepare for a crisis” task, where four agents need to build a shelter to survive from a lava flood\. The central agent correctly selects stone blocks \(instead of wood which burns\) and assigns two agents with pickaxes to mine them, while sending the other two to scout the flood\. It also chooses a shelter location far from the crisis\. However, it instructs only one agent to build the shelter, leaving the other mining agent idle\. As a result, the shelter remains incomplete due to insufficient materials\.
[Figure˜10\(b\)](https://arxiv.org/html/2606.15684#A8.F10.sf2)shows another example from the “mine vanishing blocks” task, where five agents need to mine 2 oak logs and 15 gold blocks\. The central agent correctly assigns agents with axes and pickaxes to the corresponding targets\. However, it does not correctly account for the lifetime of the blocks when scheduling the mining actions; as a result, the blocks often disappear by the time an agent reaches them\. Meanwhile, while the two agents repeatedly fail to mine the disappeared blocks, the remaining agents stay idle; a more effective strategy would have been to proactively search for newly spawned blocks at other locations\.
\(a\)Prepare for a crisis\.
\(b\)Mine vanishing blocks\.
Figure 10:TickingCollabAgent\-centralized operation timelines examples\.Similar Articles
AgentCollabBench: Diagnosing When Good Agents Make Bad Collaborators
This paper introduces AgentCollabBench, a diagnostic benchmark for multi-agent systems that evaluates behavioral risks like instruction decay and context leakage across four major LLMs. It argues that communication topology is a critical factor in multi-agent reliability, often overshadowing raw model capability.
CollabBench: Benchmarking and Unleashing Collaborative Ability of LLMs with Diverse Players via Proactive Engagement
CollabBench is a new benchmark for evaluating and training LLM agents in cooperative games, featuring diverse player simulation and a collaborative training paradigm. Experiments show 19.5% higher efficiency and 24.4% improved affective performance over base models.
MineExplorer: Evaluating Open-World Exploration of MLLM Agents in Minecraft
The MineExplorer benchmark evaluates multimodal large language model agents' open-world exploration abilities in Minecraft using atomic and multi-hop tasks designed through multi-agent synthesis. Experiments show that open-world exploration remains challenging, with strong models degrading sharply over longer trajectories.
CoMIC: Collaborative Memory and Insights Circulation for Long-Horizon LLM Agents in Cloud-Edge Systems
CoMIC is a cloud-edge framework for LLM agents that uses collaborative memory and insight circulation to improve long-horizon task performance without requiring parameter updates, achieving gains in progress rate and action grounding across multiple tasks.
GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations
GroupMemBench is a new benchmark for evaluating LLM agent memory in multi-party conversations, exposing failures in current memory systems with the best achieving only 46% average accuracy.