Terminal-World: Scaling Terminal-Agent Environments via Agent Skills
Summary
Terminal-World introduces a fully automated pipeline that uses agent skills to synthesize high-quality training data for terminal agents, enabling models to outperform baselines with only 1.2% of the training data. The method co-derives task instructions, environments, and teacher trajectories from skill primitives.
View Cached Full Text
Cached at: 05/21/26, 06:35 AM
# Terminal-World: Scaling Terminal-Agent Environments via Agent Skills
Source: [https://arxiv.org/html/2605.20876](https://arxiv.org/html/2605.20876)
Zihao Cheng1, \*Hongru Wang2, \*Zeming Liu1,†\\daggerXinyi Wang2Xiangrong Zhu2 Yuhang Guo3Wei Lin2Jeff Z\. Pan4Yunhong Wang1 1School of Computer Science and Engineering, Beihang University, Beijing, China 2Independent Researcher3Beijing Institute of Technology4University of Edinburgh \*Equal contribution†\\daggerCorresponding author Email:\{zihaocheng, zmliu\}@buaa\.edu\.cn
###### Abstract
Terminal agents extend Large Language Models with the ability to execute tasks directly in command\-line environments, but their progress is bottlenecked by the scarcity of high\-quality training data\. Existing approaches bootstrap frompartial sourcessuch as human\-defined seeds or GitHub repositories to instantiate one component and then complete the rest, producing tasks confined to narrow seed distributions, environments misaligned with task semantics, and inefficient trajectories from unguided exploration\. To address these limitations, we introduceTerminal\-World, a fully automated pipeline that usesagent skillsas the central synthesis primitive, which jointly encodewhatto accomplish,whento apply \(preconditions and environment state\), andhowto execute, enabling task instructions, environments, and teacher trajectories to be co\-derived\. To further broaden the synthesis space, Terminal\-World composes skills intoskill teamsandskill graphsfor multi\-role and cross\-domain task synthesis\. Using this pipeline, we construct5,723training environments and trainTerminal\-World\-8B/14B/32B, evaluated across 6 benchmarks where the Terminal\-World series consistently outperforms terminal\-agent baselines\. Notably, using the same teacher model and only1\.2%of the training data, Terminal\-World\-32B surpasses Nemotron\-Terminal\-32B on Terminal\-Bench 2\.0 by\+4\.5Pass@1 \(31\.5\) and achieves 43\.8 Pass@3\.
Figure 1:Overview of Terminal\-World \(left\) and agent performance \(right\)\.Terminal\-World uses agent skills as the synthesis primitive for terminal\-agent data construction\. Each skill encodeswhatthe agent should accomplish,whenthe skill should be applied, andhowthe task should be executed\. By decoding these three aspects, Terminal\-World top\-down synthesizestask instructions,environments, andtrajectoryin a unified process\. With85×\\timesless data than Nemotron\-Terminal, Terminal\-World achieves a4\.5%absolute improvement on Terminal\-Bench 2\.0\.## 1Introduction
LLM\-based agents are increasingly moving from predefined API calls\(Patilet al\.,[2024](https://arxiv.org/html/2605.20876#bib.bib6); Liuet al\.,[2024](https://arxiv.org/html/2605.20876#bib.bib8),[https://arxiv.org/html/2605.20876#bib.bib15](https://arxiv.org/html/2605.20876#bib.bib15); Jinet al\.,[2024](https://arxiv.org/html/2605.20876#bib.bib7); Prabhakaret al\.,[2025](https://arxiv.org/html/2605.20876#bib.bib19); Yanget al\.,[2025b](https://arxiv.org/html/2605.20876#bib.bib16); Liet al\.,[2025](https://arxiv.org/html/2605.20876#bib.bib17); Donget al\.,[2026](https://arxiv.org/html/2605.20876#bib.bib41)\)to direct terminal operation\. Systems such as Claude Code\(Anthropic,[2025](https://arxiv.org/html/2605.20876#bib.bib1)\)and Codex\(OpenAI,[2025](https://arxiv.org/html/2605.20876#bib.bib2)\)issue shell commands inside real execution environments, replacing fixed tool schemas with a compositional action space that affords substantially greater generality\(Menget al\.,[2026](https://arxiv.org/html/2605.20876#bib.bib33); Bui,[2026](https://arxiv.org/html/2605.20876#bib.bib34)\)and autonomy\(Wanget al\.,[2025a](https://arxiv.org/html/2605.20876#bib.bib51)\)\.
Despite this potential, the progress of terminal agents is fundamentally bottlenecked by the scarcity of high\-quality training data\. Unlike API\-based agents, which simply select and parameterize predefined tools\(Quet al\.,[2025](https://arxiv.org/html/2605.20876#bib.bib50)\), terminal agents operate within real file systems and runtime environments\(Merrillet al\.,[2026](https://arxiv.org/html/2605.20876#bib.bib27); Gandhiet al\.,[2026](https://arxiv.org/html/2605.20876#bib.bib25)\)\. Each training example jointly specifies a task instruction, an executable environment with initial files, dependencies, and system configurations, and a high\-quality multi\-turn trajectory\. The tight interdependence among these components makes manually curating such data prohibitively expensive and difficult to scale, motivating a growing line of work on automated terminal\-agent data synthesis\.
Existing methods synthesize terminal\-agent data by starting frompartial sources, such as human\-defined seed data \(i\.e\., manually specified keywords or short descriptors\)\(Gandhiet al\.,[2026](https://arxiv.org/html/2605.20876#bib.bib25); Zhuet al\.,[2026](https://arxiv.org/html/2605.20876#bib.bib23); Piet al\.,[2026](https://arxiv.org/html/2605.20876#bib.bib24)\)or GitHub repositories\(Wuet al\.,[2026](https://arxiv.org/html/2605.20876#bib.bib32)\), to instantiate one component of the data, and then rely on LLMs to complete the rest\. Although this paradigm can synthesize terminal\-agent data, it still suffers from three key limitations:\(1\) Limited Tasks: tasks are directly converted from human\-defined seeds or repositories, resulting in a constrained distribution that fails to capture the diverse requirements and complexity of real\-world tasks;\(2\) Environment Misalignment: task semantics and execution environments are not jointly specified from the beginning, so environments are retrofitted around tasks, producing configurations that are fragile or only loosely aligned with the intended task;\(3\) Trajectory Inefficiency: without explicit procedural guidance, teacher models often rely on autonomous exploration to solve each sandbox, producing trajectories with redundant exploration, suboptimal solution paths, and strong dependence on the teacher’s intrinsic terminal\-solving capability\.
Our key observation is that a natural synthesis primitive for terminal\-agent data already exists in open\-source ecosystems:agent skills\(Xiaet al\.,[2026](https://arxiv.org/html/2605.20876#bib.bib39); Luet al\.,[2026](https://arxiv.org/html/2605.20876#bib.bib40)\), such as those collected in ClawHub\(ClawHub,[2026](https://arxiv.org/html/2605.20876#bib.bib44)\)and SkillMP\(SkillsMP,[2026](https://arxiv.org/html/2605.20876#bib.bib43)\), which are human\-authored guidance packages that encapsulate authentic terminal workflows distilled from real practice\. As illustrated in Figure[1](https://arxiv.org/html/2605.20876#S0.F1)\(Left\), each skill jointly encodes three aspects of an end\-to\-end terminal task:①what should be accomplished,②when the skill should be applied\(i\.e\., the preconditions, inputs, and environmental state required for execution\), and③how it should be executed\. An agent skill thus constitutes a pre\-aligned specification of task semantics, environmental constraints, and execution procedure, directly addressing the three limitations mentioned above\.
Building on this primitive, we introduceTerminal\-World, a fully automated pipeline that orchestrates a multi\-agent architecture to instantiate each agent skill as a unifiedtask instruction–executable environment–teacher trajectorytriple\. To further scale the synthesis space, Terminal\-World extends individual skills intoagent skill teamsandagent skill graphs, enabling more complex multi\-role and cross\-domain task synthesis\. To broaden the usage scenarios of each skill, Terminal\-World pairs skills with user personas\(Chanet al\.,[2024](https://arxiv.org/html/2605.20876#bib.bib4)\), enabling the same underlying ability to be instantiated across diverse user backgrounds, goals, and preferences\. Using Terminal\-World, we construct5,723high\-fidelity terminal\-agent training environments and collect teacher trajectories with DeepSeek\-V3\.2 at an average cost of only$0\.17, demonstrating the efficiency of our automated construction harness\. We further train a family of models,Terminal\-World\-8B/14B/32B\. Across 6 benchmarks, the Terminal\-World series consistently outperforms existing terminal\-agent baselines at comparable model scales\. Notably, using the same teacher model and only1\.2%of the training data, Terminal\-World\-32B surpasses Nemotron\-Terminal\-32B\(Piet al\.,[2026](https://arxiv.org/html/2605.20876#bib.bib24)\)on Terminal\-Bench 2\.0 by\+4\.5Pass@1 \(31\.5\) and achieves 43\.8 Pass@3\. It also exhibits more efficient task\-execution behavior \(Sec\.[5\.1](https://arxiv.org/html/2605.20876#S5.SS1)\), requiring fewer steps and commands while maintaining lower command\-failure rates\. These results demonstrate that our pipeline produces diverse, high\-quality terminal environments and effective trajectories at low cost\. Overall, our contributions are summarized as follows:
- •We proposeTerminal\-World, a fully automated synthesis pipeline that usesagent skillsas the centralsynthesis primitiveto jointly drive task instruction synthesis, environment construction, and teacher trajectory collection\.
- •Using Terminal\-World, we construct5,723high\-fidelity terminal\-agent training environments, each paired with a skill\-guided teacher trajectory, and train a family of modelsTerminal\-World\-8B/14B/32Bon this data\.
- •Extensive experiments across 6 benchmarks show that the Terminal\-World series outperforms existing terminal\-agent baselines\. Notably, with the same teacher model and only1\.2%of the training data, Terminal\-World\-32B surpasses Nemotron\-Terminal\-32B on Terminal\-Bench 2\.0 by\+4\.5Pass@1 \(31\.5\) and achieves 43\.8 Pass@3\.
## 2Related Work
Table 1:Comparison of existing datasets\.Align\.indicates whether task semantics and environments are jointly designed rather than post\-hoc adapted\.Open File Spaceindicates support for arbitrary file types in environments\.Exec\. Verif\.indicates whether task completion can be verified by executing evaluation scripts\.Sandbox:![[Uncaptioned image]](https://arxiv.org/html/2605.20876v1/figures/web.png)Web search,![[Uncaptioned image]](https://arxiv.org/html/2605.20876v1/figures/python.png)Python,![[Uncaptioned image]](https://arxiv.org/html/2605.20876v1/figures/sql.png)SQL engine, and![[Uncaptioned image]](https://arxiv.org/html/2605.20876v1/figures/bash.png)Terminal\.Tool SpaceisFixedfor predefined toolsets andOpenfor extensible tool spaces\.Teacher Sourceindicates whether trajectories are generated through self\-solving or guided by additional structured guidelines\.DatasetTaskEnvironmentToolTrajectoryPrimitiveHuman FreeAlign\.No Pre\-def\.Real WorldOpen File SpaceExec\. Verif\.SandboxTool Gen\.Tool Space\# ToolsTeacher Source\# Traj\.Gorilla\(Patilet al\.,[2024](https://arxiv.org/html/2605.20876#bib.bib6)\)API specs✗–✗✗✗✗–✗Fixed1,645Self16,450ToolBridge\(Jinet al\.,[2024](https://arxiv.org/html/2605.20876#bib.bib7)\)API specs✗–✗✗✗✓![[Uncaptioned image]](https://arxiv.org/html/2605.20876v1/figures/python.png)✗Fixed∞\\inftySelf178,023APIGen\(Liuet al\.,[2024](https://arxiv.org/html/2605.20876#bib.bib8)\)API specs✓–✗✓✗✓![[Uncaptioned image]](https://arxiv.org/html/2605.20876v1/figures/python.png)✗Fixed3,673Self60,000WebExplorer\(Liuet al\.,[2025b](https://arxiv.org/html/2605.20876#bib.bib11)\)Web entities✓–✓✓✗✗![[Uncaptioned image]](https://arxiv.org/html/2605.20876v1/figures/web.png)✗Fixed2Self13,000ProgSearch\(Panditet al\.,[2025](https://arxiv.org/html/2605.20876#bib.bib13)\)Web entities✓–✓✓✗✗![[Uncaptioned image]](https://arxiv.org/html/2605.20876v1/figures/web.png)\+![[Uncaptioned image]](https://arxiv.org/html/2605.20876v1/figures/python.png)✓Fixed3Guided5,500Aseacher\([Gaoet al\.,](https://arxiv.org/html/2605.20876#bib.bib14)\)Web entities✗–✗✓✗✗![[Uncaptioned image]](https://arxiv.org/html/2605.20876v1/figures/web.png)✗Fixed2Self35,000ToolACE\([Liuet al\.,](https://arxiv.org/html/2605.20876#bib.bib15)\)API specs✗–✗✗✗✗–✗Fixed26,507Guided11,300ToolMind\(Yanget al\.,[2025b](https://arxiv.org/html/2605.20876#bib.bib16)\)API specs✓–✗✗✗✗–✗Fixed20,000Guided111,941InfTool\(Liet al\.,[2025](https://arxiv.org/html/2605.20876#bib.bib17)\)API specs✓–✗✗✗✗–✗Fixed3,059Self4,965DataMind\(Qiaoet al\.,[2025](https://arxiv.org/html/2605.20876#bib.bib18)\)Data files✔✗✓✓✓✗✓![[Uncaptioned image]](https://arxiv.org/html/2605.20876v1/figures/python.png)\+![[Uncaptioned image]](https://arxiv.org/html/2605.20876v1/figures/sql.png)✓Open∞\\inftyGuided11,707APIGen\-MT\(Prabhakaret al\.,[2025](https://arxiv.org/html/2605.20876#bib.bib19)\)API specs✓–✗✗✗✓–✗Fixed28Guided5,000TaskCraft\(Shiet al\.,[2025](https://arxiv.org/html/2605.20876#bib.bib20)\)Seeds✗✓✗✗✗✗![[Uncaptioned image]](https://arxiv.org/html/2605.20876v1/figures/python.png)✗Open∞\\inftyGuided36,000GEM\(Xuet al\.,[2026](https://arxiv.org/html/2605.20876#bib.bib22)\)Raw text✓✓✓✗✗✗–✗Open∞\\inftyGuided10,000Endless Terminal\(Gandhiet al\.,[2026](https://arxiv.org/html/2605.20876#bib.bib25)\)Seeds✗✗✓✓✔✗✓![[Uncaptioned image]](https://arxiv.org/html/2605.20876v1/figures/bash.png)✓Open420––TermiGen\(Zhuet al\.,[2026](https://arxiv.org/html/2605.20876#bib.bib23)\)Seeds✗✗✓✓✔✗✓![[Uncaptioned image]](https://arxiv.org/html/2605.20876v1/figures/bash.png)✓Open420Guided3,291Nemotron\-Terminal\(Piet al\.,[2026](https://arxiv.org/html/2605.20876#bib.bib24)\)Seeds✗✗✓✓✔✗✓![[Uncaptioned image]](https://arxiv.org/html/2605.20876v1/figures/bash.png)✓Open∞\\inftySelf490,520Terminal\-World \(Ours\)Agent skills✓✓✓✓✓✓![[Uncaptioned image]](https://arxiv.org/html/2605.20876v1/figures/bash.png)✓Open∞\\inftyGuided5,723
#### Tool\-Using Agents
LLM\-based agents interact with the external world through tool use, enabling them to execute actions beyond the limits of their parametric knowledge\(Quet al\.,[2025](https://arxiv.org/html/2605.20876#bib.bib50)\)\. To strengthen this capability, early efforts synthesized training data for API selection and argument filling\(Patilet al\.,[2024](https://arxiv.org/html/2605.20876#bib.bib6); Liuet al\.,[2024](https://arxiv.org/html/2605.20876#bib.bib8)\)\. Subsequent work expanded API and tool coverage\(Jinet al\.,[2024](https://arxiv.org/html/2605.20876#bib.bib7);[Liuet al\.,](https://arxiv.org/html/2605.20876#bib.bib15)\), increased the interaction turns and orchestration complexity of tool use\(Yanget al\.,[2025b](https://arxiv.org/html/2605.20876#bib.bib16); Prabhakaret al\.,[2025](https://arxiv.org/html/2605.20876#bib.bib19); Liet al\.,[2025](https://arxiv.org/html/2605.20876#bib.bib17)\), and broadened workflow sources by mining latent tool\-use patterns from raw text\(Xuet al\.,[2026](https://arxiv.org/html/2605.20876#bib.bib22)\)\. A parallel line studies web\-search agents and synthesizes search trajectories over web content\(Zhanget al\.,[2025](https://arxiv.org/html/2605.20876#bib.bib9); Taoet al\.,[2025](https://arxiv.org/html/2605.20876#bib.bib10); Liuet al\.,[2025b](https://arxiv.org/html/2605.20876#bib.bib11); Sunet al\.,[2025](https://arxiv.org/html/2605.20876#bib.bib12); Panditet al\.,[2025](https://arxiv.org/html/2605.20876#bib.bib13);[Gaoet al\.,](https://arxiv.org/html/2605.20876#bib.bib14); Wanget al\.,[2025b](https://arxiv.org/html/2605.20876#bib.bib21)\)\. Despite this progress, these datasets typically operate over predefined toolsets, yielding a closed action space that cannot fully capture the open\-ended and compositional nature of real\-world tasks\. In contrast, Terminal\-World grounds agents in a Bash terminal, where the action space is no longer bounded by a predefined toolset but instead spans the full spectrum of composable system commands within a real execution environment\.
#### Terminal\-Based Agents
The rise of CLI\-based coding agents such as Codex\(OpenAI,[2025](https://arxiv.org/html/2605.20876#bib.bib2)\)and Claude Code\(Anthropic,[2025](https://arxiv.org/html/2605.20876#bib.bib1)\)has shifted agent interaction toward direct operation in terminal environments, motivating recent efforts to synthesize terminal\-agent training data\. Endless Terminal\(Gandhiet al\.,[2026](https://arxiv.org/html/2605.20876#bib.bib25)\), TermiGen\(Zhuet al\.,[2026](https://arxiv.org/html/2605.20876#bib.bib23)\), and Nemotron\-Terminal\(Piet al\.,[2026](https://arxiv.org/html/2605.20876#bib.bib24)\)start from human\-defined seed data and use LLMs to synthesize terminal tasks, before constructing the corresponding environments, verification scripts, and teacher trajectories\. TerminalTraj\(Wuet al\.,[2026](https://arxiv.org/html/2605.20876#bib.bib32)\)instead starts from GitHub repositories and infers task instructions and validation logic from existing codebases\. However, these methods still face limitations in task diversity, environment\-task alignment, and trajectory efficiency\. Terminal\-World addresses these limitations by using agent skills as the synthesis primitive\. Each skill specifies①what should be accomplished,②when the skill is applicable, and③how the task should be executed, providing a unified anchor from which task instructions, executable environments, verification criteria, and teacher trajectories can be co\-derived\.
## 3Terminal\-World
Figure 2:Overview of Terminal\-World\.We start from real\-world agent skills, filter high\-quality candidates via rule\-based, LLM\-based, and popularity\-based screening, pair each skill with a user persona to synthesize diverse and verifiable task quadruples\(ℐ,ℰ,𝒱,𝒢\)\(\\mathcal\{I\},\\mathcal\{E\},\\mathcal\{V\},\\mathcal\{G\}\), construct executable sandboxes through iterative generate\-verify\-repair cycles, and collect efficient trajectories\.As illustrated in Figure[2](https://arxiv.org/html/2605.20876#S3.F2), Terminal\-World uses agent skills as the synthesis primitive to jointly align task instructions, environments, and trajectories\. The pipeline consists of four stages: \(a\)Agent Skill Collection\(§[3\.1](https://arxiv.org/html/2605.20876#S3.SS1)\) constructs a terminal capability space from real\-world agent skills; \(b\)Task Generation\(§[3\.2](https://arxiv.org/html/2605.20876#S3.SS2)\) pairs each skill with user personas to diversify the usage scenarios of the underlying capability, and synthesizes instructions, environment blueprints, evaluation criteria, and execution guidelines; \(c\)Environment Building\(§[3\.3](https://arxiv.org/html/2605.20876#S3.SS3)\) instantiates each blueprint into an executable sandbox with initial files, setup scripts, and pytest verifiers; and \(d\)Trajectory Collection\(§[3\.4](https://arxiv.org/html/2605.20876#S3.SS4)\) collects efficient teacher trajectories under skill\-derived guidance\. We provide comprehensive statistics of Terminal\-World in §[3\.5](https://arxiv.org/html/2605.20876#S3.SS5)\.
### 3\.1Agent Skill Collection
To construct a high\-quality and broadly distributed terminal capability space, we collect10,000agent skills from ClawHub and SkillMP, and apply a three\-stage filter to retain those that are relevant and informative\.\(1\) Rule\-basedfiltering removes skills with terminal\-irrelevant names \(e\.g\., “skill creator”\), leaving 8,520 skills\.\(2\) LLM\-basedfiltering scores each skill on terminal applicability and content richness \(1–3 each\), retaining only those scoring the maximum on both, yielding 3,025 skills\.\(3\) Popularity\-basedfiltering ranks the remaining skills by their download counts and selects the top1,000, spanning 12 categories and 63 subcategories\. The full taxonomy is shown in Fig\.[6](https://arxiv.org/html/2605.20876#A1.F6)\.
Building on these single skills as atomic primitives, we further extend the synthesis space along two complementary dimensions:agent skill teams, which compose multiple skills*within the same subcategory*into a multi\-role workflow as adepth extension, andagent skill graphs, which connect skills*across different subcategories*into an end\-to\-end pipeline as abreadth extension\. Specifically, we use SkillNet\(Lianget al\.,[2026](https://arxiv.org/html/2605.20876#bib.bib42)\)to classify pairs of skills into 4 relations:*Compose with*,*Depends on*,*Similar to*, and*Belong to*\. We treat*Compose with*and*Depends on*as composition signals, use*Similar to*as a deduplication signal, and discard*Belong to*as redundant with our subcategory taxonomy\. Composition relations within the same subcategory are grouped into agent skill teams and synthesized into multi\-role workflows via TeamSkill\-Creator\(openJiuwen\-ai,[2026](https://arxiv.org/html/2605.20876#bib.bib45)\)driven by Claude Code, yielding76skill teams\. Composition relations spanning different subcategories are used to construct a cross\-subcategory composition graph, from which we greedily extract maximal paths until all nodes are consumed, yielding237skill graphs\. Both teams and graphs are flattened into the sameskill\.mdformat and, together with the 1,000 single skills, serve as the synthesis primitives𝒮=𝒮single∪𝒮team∪𝒮graph\\mathcal\{S\}=\\mathcal\{S\}\_\{single\}\\cup\\mathcal\{S\}\_\{team\}\\cup\\mathcal\{S\}\_\{graph\}for the next stage\.
### 3\.2Task Generation
Given the synthesis primitives𝒮\\mathcal\{S\}from Sec\.[3\.1](https://arxiv.org/html/2605.20876#S3.SS1), this stage converts each into a unified task specification\. These primitives are well\-suited for this purpose because they explicitly describe what the agent should accomplish, when the skill should be applied, and how the task should be executed\. These three aspects provide natural anchors for constructing the task instruction, execution context, and execution guideline, respectively\. To diversify the usage scenarios of each capability, we pair each primitive𝒮\\mathcal\{S\}with a user persona𝒰\\mathcal\{U\}sampled from FinePersonas\(Lozhkovet al\.,[2024](https://arxiv.org/html/2605.20876#bib.bib46); Chanet al\.,[2024](https://arxiv.org/html/2605.20876#bib.bib4)\), which consists of short natural\-language profiles describing potential users’ backgrounds, roles, and preferences\. During task synthesis, the LLM instantiates a task only when the sampled persona forms a coherent usage scenario for the given primitive, while irrelevant pairs are ignored\. For each pair, we synthesize a quadruple\(ℐ,ℰ,𝒱,𝒢\)\(\\mathcal\{I\},\\mathcal\{E\},\\mathcal\{V\},\\mathcal\{G\}\)as follows:
ℐ,ℰ,𝒱=LLM\(𝒫s,𝒮,𝒰\),𝒢=LLM\(𝒫g,ℐ,𝒮\),\\mathcal\{I\},\\mathcal\{E\},\\mathcal\{V\}=\\mathrm\{LLM\}\(\\mathcal\{P\}\_\{s\},\\mathcal\{S\},\\mathcal\{U\}\),\\quad\\mathcal\{G\}=\\mathrm\{LLM\}\(\\mathcal\{P\}\_\{g\},\\mathcal\{I\},\\mathcal\{S\}\),\(1\)
where𝒫s\\mathcal\{P\}\_\{s\}and𝒫g\\mathcal\{P\}\_\{g\}denote the prompt templates for task synthesis \(Appendix[D](https://arxiv.org/html/2605.20876#A4)\) and guideline generation \(Appendix[D](https://arxiv.org/html/2605.20876#A4)\), respectively\. Here,ℐ\\mathcal\{I\}denotes the task instruction\.ℰ\\mathcal\{E\}denotes the environment blueprint used for sandbox construction, consisting of initial filesℰfiles\\mathcal\{E\}\_\{\\text\{files\}\}and setup stepsℰsteps\\mathcal\{E\}\_\{\\text\{steps\}\}\.𝒱\\mathcal\{V\}specifies the evaluation criteria that translate the task goal into verifiable completion conditions, which are later used to generate pytest\-based verifiers\.𝒢\\mathcal\{G\}provides skill\-derived execution guidance for collecting teacher trajectories\.
To ensure task quality, we apply an LLM\-as\-a\-Judge filter along five dimensions:①instruction quality,②closed\-world solvability,③blueprint completeness,④guideline quality, and⑤evaluation\-criteria quality, and the Prompt in Appendix[D](https://arxiv.org/html/2605.20876#A4)\. We retain only samples that receive a score of at least 4 on every dimension and pass them to the subsequent stages\. An end\-to\-end example of this stage is provided in Appendix[E\.1](https://arxiv.org/html/2605.20876#A5.SS1)\.
### 3\.3Environment Building
Given the task specification\(ℐ,ℰ,𝒱,𝒢\)\(\\mathcal\{I\},\\mathcal\{E\},\\mathcal\{V\},\\mathcal\{G\}\), this stage instantiates the environment blueprint into an executable and verifiable sandbox\. Each sandbox comprises three artifacts: initial filesℱ\\mathcal\{F\}that define the starting workspace, a setup scriptℬenv\\mathcal\{B\}\_\{\\text\{env\}\}that prepares runtime dependencies and services, and a pytest verifier𝒯test\\mathcal\{T\}\_\{\\text\{test\}\}that provides automatic completion checking\. To ensure the quality of all three artifacts, we construct them through a unified generate\-verify\-repair \(GVR\) mechanism:
x\(0\)=Generate\(⋅\),x\(t\+1\)=Repair\(x\(t\),Verify\(x\(t\)\)\),x^\{\(0\)\}=\\mathrm\{Generate\}\(\\cdot\),\\quad x^\{\(t\+1\)\}=\\mathrm\{Repair\}\\bigl\(x^\{\(t\)\},\\;\\mathrm\{Verify\}\(x^\{\(t\)\}\)\\bigr\),\(2\)wherexxdenotes the artifact\. Tasks that cannot be repaired withinT=3T=3iterations are discarded\. We now describe theGenerate\(⋅\)\\mathrm\{Generate\}\(\\cdot\)andVerify\(⋅\)\\mathrm\{Verify\}\(\\cdot\)procedures instantiated for each artifact\.
\(1\) Initial Files:To support the generation of arbitrary file types, we adopt a multi\-agent architecture that routes each file inℰfiles\\mathcal\{E\}\_\{\\text\{files\}\}to a dedicated agent based on its generation mode, which is annotated during task generation in Section[3\.2](https://arxiv.org/html/2605.20876#S3.SS2)\. Specifically, an LLM\-synthesis agent𝒜llm\_direct\\mathcal\{A\}\_\{llm\\\_direct\}, a local\-tool agent𝒜local\_tool\\mathcal\{A\}\_\{local\\\_tool\}equipped with file\-creation tools, or a remote\-fetch agent𝒜remote\_fetch\\mathcal\{A\}\_\{remote\\\_fetch\}with search tools, and each conditioned onℐ\\mathcal\{I\}and the file description \(prompts in Appendices[D](https://arxiv.org/html/2605.20876#A4),[D](https://arxiv.org/html/2605.20876#A4), and[D](https://arxiv.org/html/2605.20876#A4)\)\. A file\-verification agent \(prompt in Appendix[D](https://arxiv.org/html/2605.20876#A4)\) then inspects both*①internal correctness*\(well\-formedness and description alignment\) and*②external consistency*\(cross\-file paths, references, and schemas\), with failures triggering joint repair across dependent files\.
\(2\) Setup Scripts:For scalability, rather than building a per\-task Docker image, we use a shared general\-purpose sandbox with task\-specific setup scripts\. An environment\-building agent converts the natural\-language stepsℰsteps\\mathcal\{E\}\_\{\\text\{steps\}\}into executable shell commands for dependency installation, service initialization, and runtime configuration \(Appendix[D](https://arxiv.org/html/2605.20876#A4)\)\. To confirm logical correctness rather than relying on script exit codes alone, an environment\-verification agent then generates and executes diagnostic probing scripts that inspect whether required packages, services, and runtime states are properly established \(Appendix[D](https://arxiv.org/html/2605.20876#A4)\); detected issues are returned to the building agent for repair\.
\(3\) Pytest Verifiers:A verifier\-generation agent translates the evaluation criteria𝒱\\mathcal\{V\}into pytest scripts over the expected post\-execution state, conditioned onℐ\\mathcal\{I\},𝒱\\mathcal\{V\},ℱ\\mathcal\{F\}, andℬenv\\mathcal\{B\}\_\{\\text\{env\}\}\(Appendix[D](https://arxiv.org/html/2605.20876#A4)\)\. A verifier\-validation agent then checks two properties:*①executability*, requiring all test scripts to run without syntax or import errors, and*②reliability*, requiring all tests to fail on the pre\-execution initial state to rule out vacuous passes\. Tests violating either property are returned to the generation agent for repair\. An example of this stage is provided in Appendix[E\.2](https://arxiv.org/html/2605.20876#A5.SS2)\.
### 3\.4Trajectory Collection
Given the synthesized tasks and their executable sandboxes, this stage collects efficient teacher trajectories\. Specifically, we use the execution guideline𝒢\\mathcal\{G\}synthesized in Section[3\.2](https://arxiv.org/html/2605.20876#S3.SS2)as skill\-derived guidance for the teacher model, rather than letting it explore freely, which often results in lengthy and redundant trajectories\. Following the setup of\(Piet al\.,[2026](https://arxiv.org/html/2605.20876#bib.bib24)\)for fair comparison, we adopt DeepSeek\-V3\.2\(Liuet al\.,[2025a](https://arxiv.org/html/2605.20876#bib.bib3)\)with the Terminus2 scaffolding\(Harbor Framework Team,[2026](https://arxiv.org/html/2605.20876#bib.bib5)\)as the teacher, which isolates the contribution of our data construction pipeline from differences in teacher capability\. At each step, the teacher produces an action𝒜t=πteacher\(ℐ,𝒢,ℋt−1\)\\mathcal\{A\}\_\{t\}=\\pi\_\{\\text\{teacher\}\}\(\\mathcal\{I\},\\mathcal\{G\},\\mathcal\{H\}\_\{t\-1\}\)and receives an observation𝒪t\\mathcal\{O\}\_\{t\}from the sandbox, whereℋt−1=\(𝒜1,𝒪1,…,𝒜t−1,𝒪t−1\)\\mathcal\{H\}\_\{t\-1\}=\(\\mathcal\{A\}\_\{1\},\\mathcal\{O\}\_\{1\},\\ldots,\\mathcal\{A\}\_\{t\-1\},\\mathcal\{O\}\_\{t\-1\}\)denotes the prior interaction history\. After rollout, we run the verifier𝒯test\\mathcal\{T\}\_\{\\text\{test\}\}against the resulting sandbox state to annotate each trajectory with its verification outcome, while retaining both successful and failed trajectories\(Wuet al\.,[2026](https://arxiv.org/html/2605.20876#bib.bib32)\)\. A representative trajectory with per\-step analysis, commands, and observations is shown in Appendix[E\.3](https://arxiv.org/html/2605.20876#A5.SS3)\. Importantly, the guideline𝒢\\mathcal\{G\}is used only during trajectory collection: before SFT training, we remove𝒢\\mathcal\{G\}from the training input so that the student model learns from the verified terminal interaction itself rather than relying on auxiliary procedural hints\.
### 3\.5Data Statistics
Table 2:Key statistics of Terminal\-World\.StatisticValueTask\# Persona4,973\# Single\-skill Tasks3,723\# Team\-skill Tasks1,000\# Graph\-skill Tasks1,000\# Total Tasks5,723EnvironmentAvg\. Init\. Files2\.25\# File Types104Avg\.pytestTests4\.27TrajectoryAvg\. Steps13\.44Avg\. Tokens18,176\# Bash Cmd\.1,939
Figure 3:Comprehensive statistics of the Terminal\-World\.
To characterize the resulting dataset, we analyze Terminal\-World from three dimensions: task coverage, environment diversity, and trajectory complexity, as shown in Table[3](https://arxiv.org/html/2605.20876#S3.F3)and Figure[3](https://arxiv.org/html/2605.20876#S3.F3)\. At the task level, Terminal\-World uses 1,000 single skills together with 76 skill teams and 237 skill graphs to synthesize5,723terminal\-agent tasks, covering diverse capability scenarios\. At the environment level, each task contains an executable sandbox with an average of 2\.25 initial files and 4\.27 pytest tests\. These environments span 104 file types, including both textual formats \(e\.g\., JSON\) and non\-textual formats \(e\.g\., SQL and Parquet\), reflecting the diversity of realistic terminal workspaces\. At the trajectory level, each verified demonstration contains 13\.44 steps and 18,176 tokens on average, and the collected trajectories cover 1,939 distinct Bash commands\. These statistics indicate that Terminal\-World constructs not only diverse task scenarios, but also executable environments and long\-horizon trajectories suitable for terminal\-agent training\.
## 4Experiments
Table 3:Main results on 6 benchmarks:Terminal\-Bench 2\.0, AIME 24, AIME 25, DABench, TableBench, and BIRD\. We report accuracy \(%\), including pass@1 and pass@3\. In theOpen\-source Models \(<<100B\)block, the best result is shown inbold, and the second\-best result isunderlined\.Model\# TrainingSamplesTB 2\.0AIME24AIME25DABenchTableBenchBIRDAvg\.P@1P@3P@1P@3P@1P@3P@1P@3P@1P@3P@1P@3P@1P@3Frontier Proprietary Models![[Uncaptioned image]](https://arxiv.org/html/2605.20876v1/figures/icons/openai.png)GPT\-5\.2–53\.974\.2––––––––––––![[Uncaptioned image]](https://arxiv.org/html/2605.20876v1/figures/icons/gemini.png)Gemini\-3\-Flash–51\.766\.393\.3100\.090\.096\.787\.092\.073\.577\.549\.558\.074\.281\.8![[Uncaptioned image]](https://arxiv.org/html/2605.20876v1/figures/icons/gemini.png)Gemini\-3\.1\-Pro\-Preview–68\.580\.996\.7100\.0100\.0100\.088\.092\.577\.079\.059\.062\.581\.585\.8![[Uncaptioned image]](https://arxiv.org/html/2605.20876v1/figures/icons/claude.png)Claude\-Sonnet\-4\.5–42\.753\.9––––––––––––Open\-source Models \(\>\>100B\)![[Uncaptioned image]](https://arxiv.org/html/2605.20876v1/figures/icons/openai.png)GPT\-OSS\-120B \(High\)–13\.527\.090\.096\.786\.796\.775\.090\.066\.579\.050\.059\.563\.674\.8![[Uncaptioned image]](https://arxiv.org/html/2605.20876v1/figures/icons/minimax.png)MiniMax\-M2\.7–56\.265\.250\.083\.363\.396\.785\.590\.576\.083\.049\.558\.563\.479\.5![[Uncaptioned image]](https://arxiv.org/html/2605.20876v1/figures/icons/qwen.png)Qwen3\-Coder–23\.639\.370\.073\.346\.766\.781\.088\.567\.575\.549\.562\.056\.467\.6![[Uncaptioned image]](https://arxiv.org/html/2605.20876v1/figures/icons/deepseek.png)DeepSeek\-V3\.2–38\.252\.883\.396\.793\.396\.787\.092\.572\.579\.550\.056\.570\.779\.1![[Uncaptioned image]](https://arxiv.org/html/2605.20876v1/x3.png)GLM\-5–56\.265\.286\.796\.783\.393\.388\.091\.576\.582\.556\.060\.574\.581\.6![[Uncaptioned image]](https://arxiv.org/html/2605.20876v1/figures/icons/kimi.png)Kimi\-K2\-Thinking–36\.049\.476\.783\.370\.086\.785\.091\.575\.082\.052\.558\.565\.975\.2Open\-source Models \(<<100B\)![[Uncaptioned image]](https://arxiv.org/html/2605.20876v1/figures/icons/qwen.png)Qwen3\-8B–2\.23\.3763\.376\.753\.363\.316\.532\.524\.539\.04\.09\.027\.337\.3![[Uncaptioned image]](https://arxiv.org/html/2605.20876v1/figures/icons/qwen.png)Qwen3\-14B–4\.57\.8776\.786\.753\.366\.738\.069\.035\.560\.07\.518\.535\.951\.5![[Uncaptioned image]](https://arxiv.org/html/2605.20876v1/figures/icons/qwen.png)Qwen3\-32B–4\.59\.060\.086\.743\.366\.744\.575\.026\.553\.56\.516\.030\.951\.2![[Uncaptioned image]](https://arxiv.org/html/2605.20876v1/figures/icons/openai.png)GPT\-OSS\-20B \(High\)–3\.49\.093\.396\.783\.393\.359\.582\.538\.564\.00\.06\.046\.358\.6![[Uncaptioned image]](https://arxiv.org/html/2605.20876v1/figures/icons/stanford.png)EndLess Terminal\-8B3\.3k6\.712\.423\.336\.730\.043\.326\.038\.528\.042\.06\.011\.520\.030\.7![[Uncaptioned image]](https://arxiv.org/html/2605.20876v1/figures/icons/ucsb.png)TermiGen\-32B3\.3k19\.127\.046\.756\.733\.343\.381\.091\.568\.079\.050\.562\.549\.860\.0![[Uncaptioned image]](https://arxiv.org/html/2605.20876v1/figures/icons/um.png)Termial\-Traj\-32B50\.7k22\.027\.046\.760\.026\.746\.776\.088\.062\.076\.049\.561\.547\.259\.9![[Uncaptioned image]](https://arxiv.org/html/2605.20876v1/figures/icons/nvidia.png)Nemotron\-Terminal\-8B490\.5k13\.521\.383\.383\.363\.390\.080\.592\.066\.573\.544\.055\.558\.569\.3![[Uncaptioned image]](https://arxiv.org/html/2605.20876v1/figures/icons/nvidia.png)Nemotron\-Terminal\-14B490\.5k20\.224\.790\.096\.790\.093\.380\.590\.570\.075\.050\.559\.066\.973\.2![[Uncaptioned image]](https://arxiv.org/html/2605.20876v1/figures/icons/nvidia.png)Nemotron\-Terminal\-32B490\.5k27\.037\.193\.3100\.086\.793\.381\.593\.072\.575\.552\.557\.068\.976\.0OursTerminal\-World\-8B5\.7k15\.723\.686\.793\.380\.090\.080\.591\.069\.574\.548\.558\.063\.571\.7Terminal\-World\-14B5\.7k21\.327\.090\.096\.783\.393\.381\.092\.070\.576\.050\.059\.566\.074\.1Terminal\-World\-32B5\.7k31\.543\.893\.396\.786\.793\.383\.593\.071\.579\.049\.561\.569\.377\.9
### 4\.1Experiment Setting
#### Baselines
Following prior work\(Zhuet al\.,[2026](https://arxiv.org/html/2605.20876#bib.bib23)\), we compare Terminal\-World against the following baselines\.Frontier Proprietary Models: GPT\-5\.2, Gemini\-3\-Flash, Gemini\-3\.1\-Pro\-Preview, and Claude\-Sonnet\-4\.5\.Open\-source Models \(\>\>100B\): GPT\-OSS\-120B\(Agarwalet al\.,[2025](https://arxiv.org/html/2605.20876#bib.bib48)\), MiniMax\-M2\.7, Qwen3\-Coder\-480B\(Yanget al\.,[2025a](https://arxiv.org/html/2605.20876#bib.bib26)\), DeepSeek\-V3\.2\(Liuet al\.,[2025a](https://arxiv.org/html/2605.20876#bib.bib3)\), GLM\-5\(Zenget al\.,[2026](https://arxiv.org/html/2605.20876#bib.bib37)\), and Kimi\-K2\-Thinking\(Teamet al\.,[2025](https://arxiv.org/html/2605.20876#bib.bib49)\)\.Open\-source Models \(<<100B\): Qwen3\-8B/14B/32B\(Yanget al\.,[2025a](https://arxiv.org/html/2605.20876#bib.bib26)\), GPT\-OSS\-20B, EndLess Terminal\-8B\(Gandhiet al\.,[2026](https://arxiv.org/html/2605.20876#bib.bib25)\), TermiGen\-Qwen3\-32B\(Zhuet al\.,[2026](https://arxiv.org/html/2605.20876#bib.bib23)\), Terminal\-Traj\-32B\(Wuet al\.,[2026](https://arxiv.org/html/2605.20876#bib.bib32)\), and Nemotron\-Terminal\-8B/14B/32B\(Piet al\.,[2026](https://arxiv.org/html/2605.20876#bib.bib24)\)\. All models are evaluated under the Terminus2 Agent scaffolding\(Harbor Framework Team,[2026](https://arxiv.org/html/2605.20876#bib.bib5)\), except Endless Terminal\-8B and TermiGen\-32B, which are evaluated in their native formats \(i\.e\., EndlessAgent and BashAgent\)\.
#### Benchmarks
Following prior work\(Shiet al\.,[2025](https://arxiv.org/html/2605.20876#bib.bib20); Piet al\.,[2026](https://arxiv.org/html/2605.20876#bib.bib24); Zhuet al\.,[2026](https://arxiv.org/html/2605.20876#bib.bib23)\), we evaluate models on six benchmarks\.Terminal\-Bench 2\.0\(Merrillet al\.,[2026](https://arxiv.org/html/2605.20876#bib.bib27); Zenget al\.,[2026](https://arxiv.org/html/2605.20876#bib.bib37)\)evaluates terminal\-agentic coding ability\. The remaining benchmarks are converted into terminal\-based tasks\(Piet al\.,[2026](https://arxiv.org/html/2605.20876#bib.bib24)\):AIME24/25\(Zhang and Math\-AI,[2024](https://arxiv.org/html/2605.20876#bib.bib30),[2025](https://arxiv.org/html/2605.20876#bib.bib31)\)evaluate mathematical reasoning,DABench\(Huet al\.,[2024](https://arxiv.org/html/2605.20876#bib.bib28)\)andTableBench\(Wuet al\.,[2025](https://arxiv.org/html/2605.20876#bib.bib29)\)evaluate table\-based data analysis, andBIRD\(Liet al\.,[2023](https://arxiv.org/html/2605.20876#bib.bib38)\)evaluates SQL\-based data analysis\. Details in Appendix[B\.1](https://arxiv.org/html/2605.20876#A2.SS1)\.
#### Implementation Details
Following prior work\(Piet al\.,[2026](https://arxiv.org/html/2605.20876#bib.bib24)\), we conduct SFT using Swift\(Zhaoet al\.,[2024](https://arxiv.org/html/2605.20876#bib.bib35)\)\. Specifically, we set the learning rate to2e−52\\mathrm\{e\}\{\-5\}, with a warmup ratio of0\.10\.1, weight decay of1e−41\\mathrm\{e\}\{\-4\}, and train for22epochs with sequence length of32,76832\{,\}768\. During inference, we adopt a context length of40,96040\{,\}960tokens, temperature0\.60\.6, top\-p=0\.95p=0\.95, top\-k=20k=20, and min\-p=0\.0p=0\.0\. All experiments are conducted on 4 nodes, and each is equipped with 8 H20\-141GB GPUs\. For all other baselines, we follow the officially recommended sampling parameters\. Details in the Appendix[B\.3](https://arxiv.org/html/2605.20876#A2.SS3)\.
### 4\.2Main Results
Table[3](https://arxiv.org/html/2605.20876#S4.T3)presents the main results across six benchmarks\. We summarize the key findings as follows:
#### Terminal\-World delivers strong performance with significantly higher sample efficiency\.
At all three model sizes, Terminal\-World outperforms existing terminal\-synthetic baselines across benchmarks\. Terminal\-World\-32B achieves69\.3Avg\. Pass@1 and77\.9Avg\. Pass@3\. On Terminal\-Bench 2\.0, Terminal\-World\-32B achieves31\.5Pass@1 and43\.8Pass@3, surpassing all<<100B open\-source models\. Notably, these results are obtained with only5\.7Ktraining trajectories \(1\.2%of the 490\.5K used by Nemotron\-Terminal\), demonstrating that skill\-grounded synthesis can achieve superior performance at a fraction of the data scale\.
#### Skill\-grounded trajectories generalize beyond terminal\-native coding without auxiliary data\.
Unlike Nemotron\-Terminal, which supplements terminal data with 226\.3K auxiliary samples spanning math, code, and SWE tasks, Terminal\-World is trained exclusively on 5\.7K terminal\-style trajectories\. Despite this, Terminal\-World\-32B matches or exceeds Nemotron\-Terminal\-32B on all five non\-terminal benchmarks, indicating that skill\-grounded trajectories develop transferable problem\-solving capabilities that generalize to mathematical reasoning, table analysis, and SQL tasks without domain\-specific training data\.
#### The performance advantage is most pronounced at the 32B scale\.
On Terminal\-Bench 2\.0, Terminal\-World\-32B achieves the largest absolute margin over its counterpart, surpassing Nemotron\-Terminal\-32B by\+4\.5Pass@1 and\+6\.7Pass@3\. This pronounced gain at the largest scale suggests that higher\-capacity models can better leverage the structured supervision encoded in skill\-grounded trajectories\.
## 5Analysis
In this section, we conduct a comprehensive analysis to answer the following research questions:RQ1:Does Terminal\-World learn efficient and reliable terminal execution behaviors?\(§[5\.1](https://arxiv.org/html/2605.20876#S5.SS1); Appendix[C\.1](https://arxiv.org/html/2605.20876#A3.SS1)\)RQ2:Which SFT data construction choices are most important?\(§[5\.2](https://arxiv.org/html/2605.20876#S5.SS2); Appendix[C\.2](https://arxiv.org/html/2605.20876#A3.SS2)\)RQ3:Can Terminal\-World synthesize diverse, high\-quality, and cost\-effective terminal\-agent training data?\(§[5\.3](https://arxiv.org/html/2605.20876#S5.SS3); Appendices[C\.3](https://arxiv.org/html/2605.20876#A3.SS3),[C\.4](https://arxiv.org/html/2605.20876#A3.SS4), and[C\.5](https://arxiv.org/html/2605.20876#A3.SS5)\)
### 5\.1RQ1: Behavior Analysis
Figure 4:Behavior statistics on Terminal\-Bench 2\.0 with the Terminus2 scaffolding\.Terminal\-World\-32B provides more efficient interactions and more reliable task execution\.In Table[3](https://arxiv.org/html/2605.20876#S4.T3), we report the accuracy of the Terminal\-World series\. In this section, we further examine their task\-execution behavior\. Specifically, we analyze four behavioral metrics: the average number of steps \(Avg Steps\), the command failure rate \(Command Error Rate\), and the average number of commands at both the step \(Commands / Step\) and trajectory \(Avg Total Commands\) levels\. To avoid confounding effects caused by differences in task difficulty, we conduct the analysis on the intersection of tasks correctly solved by both Terminal\-World\-32B and Nemotron\-Terminal\-32B across three independent runs on Terminal\-Bench 2\.0\. The results are shown in Figure[4](https://arxiv.org/html/2605.20876#S5.F4)\. Terminal\-World\-32B requires only 10\.2 steps and 40\.3 commands on average to complete each task, while also exhibiting a lower command failure rate\. Notably, even without guideline assistance, Terminal\-World\-32B outperforms its teacher model, DeepSeek\-V3\.2, across all four behavioral metrics\. This suggests that skill\-guided trajectories in Terminal\-World teach the model more concise and goal\-directed execution strategies\.
### 5\.2RQ2: Impact of SFT Data Strategies
Table 4:Effect of SFT data strategies\.Performance comparison under different data strategies\.SFT Data Strategy\#SamplesTerminal\-World\-8BTerminal\-World\-14BPass@1Pass@3Pass@1Pass@3Full data strategy5\.7k15\.723\.621\.327\.0w/ 1k single\-skill traj\.1\.0k9\.0\(↓\\downarrow6\.7\)12\.4\(↓\\downarrow11\.2\)13\.5\(↓\\downarrow7\.8\)16\.9\(↓\\downarrow10\.1\)w/ 1k team\-skill traj\.1\.0k10\.1\(↓\\downarrow5\.6\)13\.5\(↓\\downarrow10\.1\)14\.6\(↓\\downarrow6\.7\)19\.1\(↓\\downarrow7\.9\)w/ 1k graph\-skill traj\.1\.0k10\.1\(↓\\downarrow5\.6\)14\.6\(↓\\downarrow9\.0\)15\.7\(↓\\downarrow5\.6\)20\.2\(↓\\downarrow6\.8\)w/o full data scale2\.3k12\.4\(↓\\downarrow3\.3\)18\.0\(↓\\downarrow5\.6\)18\.0\(↓\\downarrow3\.3\)22\.5\(↓\\downarrow4\.5\)w/o guideline removal5\.7k13\.5\(↓\\downarrow2\.2\)20\.2\(↓\\downarrow3\.4\)19\.1\(↓\\downarrow2\.2\)24\.7\(↓\\downarrow2\.3\)w/o failure trajectory2\.3k10\.1\(↓\\downarrow5\.6\)14\.6\(↓\\downarrow9\.0\)15\.7\(↓\\downarrow5\.6\)20\.2\(↓\\downarrow6\.8\)w/ failure\-trajectory suppression5\.7k9\.0\(↓\\downarrow6\.7\)13\.5\(↓\\downarrow10\.1\)14\.6\(↓\\downarrow6\.7\)19\.1\(↓\\downarrow7\.9\)
Building on the SFT setup in Sec\.[4](https://arxiv.org/html/2605.20876#S4), we further investigate the effects of different data construction strategies\. Specifically, we conduct SFT at both the 8B and 14B scales under four ablation settings: retaining the guidelines in the instructions \(i\.e\., w/o guideline removal\), using a reduced dataset \(i\.e\., w/ reduced data scale\), training only on successful trajectories \(i\.e\., w/o failure trajectories\), and suppressing unsuccessful trajectories with a negative SFT loss while keeping successful trajectories unchanged \(i\.e\., w/ failure\-trajectory suppression\)\. As shown in Tab\.[4](https://arxiv.org/html/2605.20876#S5.T4), retaining guidelines substantially degrades performance, suggesting that prescriptive step\-by\-step instructions discourage the model from learning autonomous planning\. Reducing data size also leads to a clear drop, confirming the scaling benefit of Terminal\-World\. More importantly, these two ablations together reveal the critical role of failure trajectories\. Removing them causes a larger decline than reducing the data scale, suggesting that they cover harder tasks and contain richer error\-correction and recovery processes\. Penalizing them with a negative SFT loss further degrades performance, because failure trajectories contain many correct intermediate steps; penalizing the entire trajectory inevitably suppresses correct behaviors alongside erroneous ones\. We provide detailed analysis in Appendix[C\.2](https://arxiv.org/html/2605.20876#A3.SS2)\.
### 5\.3RQ3: Task Diversity and Cost Analysis
Figure 5:Task diversity of Terminal\-World\.
Table 5:API cost for Terminal\-World data collection\.StageModelCost \($\)OutputAvg\. Cost \($\)Task GenerationDeepSeek\-V3\.2101\.716,884 tasks0\.015Env BuildingGemini\-3\-Flash476\.615,723 envs0\.083TrajectoryDeepSeek\-V3\.2421\.275,723 traj\.0\.074Total999\.595,723 traj\.0\.170
To evaluate the diversity and cost\-efficiency of Terminal\-World, we examine task diversity by independently generating 250 tasks under 3 settings: 50 skills without persona pairing, 50 skills paired with 5 personas each, and 25 skills paired with 10 personas each\. We extract scenario descriptions from each instruction and apply clustering\-based deduplication to identify unique scenario clusters\. As shown in Fig\.[5](https://arxiv.org/html/2605.20876#S5.F5), persona grounding substantially improves scenario coverage\. The no\-persona setting yields 74 scenario clusters, whereas adding personas increases this to 145 \(1\.96×\\times\)\. Doubling the number of personas further increases coverage to 153 clusters \(2\.07×\\times\), confirming that personas serve as effective context multipliers for a fixed set of skill primitives\.
Table[5](https://arxiv.org/html/2605.20876#S5.F5)further shows that this increased diversity is achieved at low cost\. Our pipeline uses DeepSeek\-V3\.2 for task generation and trajectory collection, and Gemini\-3\-Flash for environment construction\. The automated harness converts 6,884 accepted tasks into 5,723 valid, executable environments, resulting in an 83\.1% construction success rate\. Overall, the full pipeline costs $999\.59, equivalent to only$0\.17 per trajectory\. Taken together, these results demonstrate that Terminal\-World produces diverse, challenging, and well\-aligned terminal data at low cost, as further evidenced by task difficulty analysis \(Appendix[C\.3](https://arxiv.org/html/2605.20876#A3.SS3)\) and multi\-dimensional environment quality evaluation \(Appendix[C\.4](https://arxiv.org/html/2605.20876#A3.SS4)\)\.
## 6Conclusion
In this paper, we introduceTerminal\-World, a fully automated pipeline that uses agent skills as the central synthesis primitive to jointly drive task, environment, and trajectory construction, and further extends individual skills intoskill teamsandskill graphsto scale the coverage of the synthesis space\. Using this pipeline, we construct 5,723 terminal\-agent training environments and trainTerminal\-World\-8B/14B/32B\. Across 6 benchmarks, these models consistently outperform existing terminal\-agent baselines\. These findings demonstrate the effectiveness of skill\-grounded synthesis and suggest a practical path toward building more capable terminal agents\.
## References
- \[1\]S\. Agarwal, L\. Ahmad, J\. Ai, S\. Altman, A\. Applebaum, E\. Arbus, R\. K\. Arora, Y\. Bai, B\. Baker, H\. Bao,et al\.\(2025\)Gpt\-oss\-120b & gpt\-oss\-20b model card\.arXiv preprint arXiv:2508\.10925\.Cited by:[§4\.1](https://arxiv.org/html/2605.20876#S4.SS1.SSS0.Px1.p1.2)\.
- \[2\]\(2025\-04\)Claude code: best practices for agentic coding\.Note:[https://www\.anthropic\.com/engineering/claude\-code\-best\-practices](https://www.anthropic.com/engineering/claude-code-best-practices)Cited by:[§1](https://arxiv.org/html/2605.20876#S1.p1.1),[§2](https://arxiv.org/html/2605.20876#S2.SS0.SSS0.Px2.p1.1)\.
- \[3\]N\. D\. Bui\(2026\)Building effective ai coding agents for the terminal: scaffolding, harness, context engineering, and lessons learned\.arXiv preprint arXiv:2603\.05344\.Cited by:[§1](https://arxiv.org/html/2605.20876#S1.p1.1)\.
- \[4\]X\. Chan, X\. Wang, D\. Yu, H\. Mi, and D\. Yu\(2024\)Scaling synthetic data creation with 1,000,000,000 personas\.arXiv preprint arXiv:2406\.20094\.Cited by:[§1](https://arxiv.org/html/2605.20876#S1.p5.1),[§3\.2](https://arxiv.org/html/2605.20876#S3.SS2.p1.4)\.
- \[5\]ClawHub\(2026\)ClawHub\.Note:[https://clawhub\.ai/](https://clawhub.ai/)Cited by:[§1](https://arxiv.org/html/2605.20876#S1.p4.1)\.
- \[6\]G\. Dong, J\. Lu, J\. Huang, W\. Zhong, L\. Liu, S\. Huang, Z\. Li, Y\. Zhao, X\. Song, X\. Li,et al\.\(2026\)Agent\-world: scaling real\-world environment synthesis for evolving general agent intelligence\.arXiv preprint arXiv:2604\.18292\.Cited by:[§1](https://arxiv.org/html/2605.20876#S1.p1.1)\.
- \[7\]K\. Gandhi, S\. Garg, N\. D\. Goodman, and D\. Papailiopoulos\(2026\)Endless terminals: scaling rl environments for terminal agents\.arXiv preprint arXiv:2601\.16443\.Cited by:[§1](https://arxiv.org/html/2605.20876#S1.p2.1),[§1](https://arxiv.org/html/2605.20876#S1.p3.1),[§2](https://arxiv.org/html/2605.20876#S2.SS0.SSS0.Px2.p1.1),[Table 1](https://arxiv.org/html/2605.20876#S2.T1.22.14.14.2),[§4\.1](https://arxiv.org/html/2605.20876#S4.SS1.SSS0.Px1.p1.2)\.
- \[8\]J\. Gao, W\. Fu, M\. Xie, S\. Xu, C\. He, Z\. Mei, B\. Zhu, and Y\. WuBeyond ten turns: unlocking long\-horizon agentic search with large\-scale asynchronous rl\.InFirst Workshop on Multi\-Turn Interactions in Large Language Models,Cited by:[§2](https://arxiv.org/html/2605.20876#S2.SS0.SSS0.Px1.p1.1),[Table 1](https://arxiv.org/html/2605.20876#S2.T1.15.7.7.2)\.
- \[9\]Harbor: A framework for evaluating and optimizing agents and models in container environmentsExternal Links:[Link](https://github.com/harbor-framework/harbor)Cited by:[§3\.4](https://arxiv.org/html/2605.20876#S3.SS4.p1.7),[§4\.1](https://arxiv.org/html/2605.20876#S4.SS1.SSS0.Px1.p1.2)\.
- \[10\]X\. Hu, Z\. Zhao, S\. Wei, Z\. Chai, Q\. Ma, G\. Wang, X\. Wang, J\. Su, J\. Xu, M\. Zhu,et al\.\(2024\)Infiagent\-dabench: evaluating agents on data analysis tasks\.arXiv preprint arXiv:2401\.05507\.Cited by:[3rd item](https://arxiv.org/html/2605.20876#A2.I1.i3.p1.1),[Table 7](https://arxiv.org/html/2605.20876#A2.T7.3.1.2),[§4\.1](https://arxiv.org/html/2605.20876#S4.SS1.SSS0.Px2.p1.1)\.
- \[11\]Z\. Jin, M\. Liu, D\. Chen, L\. Zhu, Y\. Li, and L\. Yu\(2024\)Toolbridge: an open\-source dataset to equip llms with external tool capabilities\.arXiv preprint arXiv:2410\.10872\.Cited by:[§1](https://arxiv.org/html/2605.20876#S1.p1.1),[§2](https://arxiv.org/html/2605.20876#S2.SS0.SSS0.Px1.p1.1),[Table 1](https://arxiv.org/html/2605.20876#S2.T1.10.2.2.3)\.
- \[12\]J\. Li, B\. Hui, G\. Qu, J\. Yang, B\. Li, B\. Li, B\. Wang, B\. Qin, R\. Geng, N\. Huo,et al\.\(2023\)Can llm already serve as a database interface? a big bench for large\-scale database grounded text\-to\-sqls\.Advances in Neural Information Processing Systems36,pp\. 42330–42357\.Cited by:[5th item](https://arxiv.org/html/2605.20876#A2.I1.i5.p1.1),[Table 7](https://arxiv.org/html/2605.20876#A2.T7.5.3.2),[§4\.1](https://arxiv.org/html/2605.20876#S4.SS1.SSS0.Px2.p1.1)\.
- \[13\]Y\. Li, W\. Zhang, Z\. Huang, M\. Yang, J\. Wu, S\. Guo, H\. Hu, L\. Sun, J\. Yang, M\. Tang,et al\.\(2025\)Close the loop: synthesizing infinite tool\-use data via multi\-agent role\-playing\.arXiv preprint arXiv:2512\.23611\.Cited by:[§1](https://arxiv.org/html/2605.20876#S1.p1.1),[§2](https://arxiv.org/html/2605.20876#S2.SS0.SSS0.Px1.p1.1),[Table 1](https://arxiv.org/html/2605.20876#S2.T1.27.19.25.1)\.
- \[14\]Y\. Liang, R\. Zhong, H\. Xu, C\. Jiang, Y\. Zhong, R\. Fang, J\. Gu, S\. Deng, Y\. Yao, M\. Wang, S\. Qiao, X\. Xu, T\. Wu, K\. Wang, Y\. Liu, Z\. Bi, J\. Lou, Y\. E\. Jiang, H\. Zhu, G\. Yu, H\. Hong, L\. Huang, H\. Xue, C\. Wang, Y\. Wang, Z\. Shan, X\. Chen, Z\. Tu, F\. Xiong, X\. Xie, P\. Zhang, Z\. Gui, L\. Liang, J\. Zhou, C\. Wu, J\. Shang, Y\. Gong, J\. Lin, C\. Xu, H\. Deng, W\. Zhang, K\. Ding, Q\. Zhang, F\. Huang, N\. Zhang, J\. Z\. Pan, G\. Qi, H\. Wang, and H\. Chen\(2026\)SkillNet: create, evaluate, and connect ai skills\.External Links:2603\.04448,[Link](https://arxiv.org/abs/2603.04448)Cited by:[§A\.2](https://arxiv.org/html/2605.20876#A1.SS2.p1.1),[§3\.1](https://arxiv.org/html/2605.20876#S3.SS1.p2.1)\.
- \[15\]A\. Liu, A\. Mei, B\. Lin, B\. Xue, B\. Wang, B\. Xu, B\. Wu, B\. Zhang, C\. Lin, C\. Dong,et al\.\(2025\)Deepseek\-v3\. 2: pushing the frontier of open large language models\.arXiv preprint arXiv:2512\.02556\.Cited by:[§3\.4](https://arxiv.org/html/2605.20876#S3.SS4.p1.7),[§4\.1](https://arxiv.org/html/2605.20876#S4.SS1.SSS0.Px1.p1.2)\.
- \[16\]J\. Liu, Y\. Li, C\. Zhang, J\. Li, A\. Chen, K\. Ji, W\. Cheng, Z\. Wu, C\. Du, Q\. Xu,et al\.\(2025\)Webexplorer: explore and evolve for training long\-horizon web agents\.arXiv preprint arXiv:2509\.06501\.Cited by:[§2](https://arxiv.org/html/2605.20876#S2.SS0.SSS0.Px1.p1.1),[Table 1](https://arxiv.org/html/2605.20876#S2.T1.12.4.4.2)\.
- \[17\]W\. Liu, X\. Huang, X\. Zeng, S\. Yu, D\. Li, S\. Wang, W\. Gan, Z\. Liu, Y\. Yu, Z\. WANG,et al\.ToolACE: winning the points of llm function calling\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.20876#S1.p1.1),[§2](https://arxiv.org/html/2605.20876#S2.SS0.SSS0.Px1.p1.1),[Table 1](https://arxiv.org/html/2605.20876#S2.T1.27.19.23.1)\.
- \[18\]Z\. Liu, T\. Hoang, J\. Zhang, M\. Zhu, T\. Lan, J\. Tan, W\. Yao, Z\. Liu, Y\. Feng, R\. RN,et al\.\(2024\)Apigen: automated pipeline for generating verifiable and diverse function\-calling datasets\.Advances in Neural Information Processing Systems37,pp\. 54463–54482\.Cited by:[§1](https://arxiv.org/html/2605.20876#S1.p1.1),[§2](https://arxiv.org/html/2605.20876#S2.SS0.SSS0.Px1.p1.1),[Table 1](https://arxiv.org/html/2605.20876#S2.T1.11.3.3.2)\.
- \[19\]FineWeb\-eduExternal Links:[Document](https://dx.doi.org/10.57967/hf/2497),[Link](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu)Cited by:[§3\.2](https://arxiv.org/html/2605.20876#S3.SS2.p1.4)\.
- \[20\]Z\. Lu, Z\. Yao, J\. Wu, C\. Han, Q\. Gu, X\. Cai, W\. Lu, J\. Xiao, Y\. Zhuang, and Y\. Shen\(2026\)Skill0: in\-context agentic reinforcement learning for skill internalization\.arXiv preprint arXiv:2604\.02268\.Cited by:[§1](https://arxiv.org/html/2605.20876#S1.p4.1)\.
- \[21\]Q\. Meng, Y\. Wang, L\. Chen, Q\. Wang, C\. Lu, W\. Wu, Y\. Gao, Y\. Wu, and Y\. Hu\(2026\-04\)Agent harness for large language model agents: a survey\.Preprints\.External Links:[Document](https://dx.doi.org/10.20944/preprints202604.0428.v2),[Link](https://doi.org/10.20944/preprints202604.0428.v2)Cited by:[§1](https://arxiv.org/html/2605.20876#S1.p1.1)\.
- \[22\]M\. A\. Merrill, A\. G\. Shaw, N\. Carlini, B\. Li, H\. Raj, I\. Bercovich, L\. Shi, J\. Y\. Shin, T\. Walshe, E\. K\. Buchanan,et al\.\(2026\)Terminal\-bench: benchmarking agents on hard, realistic tasks in command line interfaces\.arXiv preprint arXiv:2601\.11868\.Cited by:[1st item](https://arxiv.org/html/2605.20876#A2.I1.i1.p1.1),[Table 7](https://arxiv.org/html/2605.20876#A2.T7.5.5.1),[§1](https://arxiv.org/html/2605.20876#S1.p2.1),[§4\.1](https://arxiv.org/html/2605.20876#S4.SS1.SSS0.Px2.p1.1)\.
- \[23\]OpenAI\(2025\-05\)Introducing codex\.Note:[https://openai\.com/index/introducing\-codex/](https://openai.com/index/introducing-codex/)Overview of the Codex coding agent accessible via ChatGPT and related clientsCited by:[§1](https://arxiv.org/html/2605.20876#S1.p1.1),[§2](https://arxiv.org/html/2605.20876#S2.SS0.SSS0.Px2.p1.1)\.
- \[24\]openJiuwen\-ai\(2026\)JiuwenClaw\.Note:[https://github\.com/openJiuwen\-ai/jiuwenclaw](https://github.com/openJiuwen-ai/jiuwenclaw)Cited by:[§3\.1](https://arxiv.org/html/2605.20876#S3.SS1.p2.1)\.
- \[25\]S\. Pandit, X\. Nguyen, Y\. Ming, A\. Xu, J\. Wang, C\. Xiong, and S\. Joty\(2025\)Synthesizing agentic data for web agents with progressive difficulty enhancement mechanisms\.arXiv preprint arXiv:2510\.13913\.Cited by:[§2](https://arxiv.org/html/2605.20876#S2.SS0.SSS0.Px1.p1.1),[Table 1](https://arxiv.org/html/2605.20876#S2.T1.14.6.6.3)\.
- \[26\]S\. G\. Patil, T\. Zhang, X\. Wang, and J\. E\. Gonzalez\(2024\)Gorilla: large language model connected with massive apis\.Advances in Neural Information Processing Systems37,pp\. 126544–126565\.Cited by:[§1](https://arxiv.org/html/2605.20876#S1.p1.1),[§2](https://arxiv.org/html/2605.20876#S2.SS0.SSS0.Px1.p1.1),[Table 1](https://arxiv.org/html/2605.20876#S2.T1.27.19.22.1)\.
- \[27\]R\. Pi, G\. Lam, M\. Shoeybi, P\. Jannaty, B\. Catanzaro, and W\. Ping\(2026\)On data engineering for scaling llm terminal capabilities\.arXiv preprint arXiv:2602\.21193\.Cited by:[§B\.1](https://arxiv.org/html/2605.20876#A2.SS1.p1.1),[§1](https://arxiv.org/html/2605.20876#S1.p3.1),[§1](https://arxiv.org/html/2605.20876#S1.p5.1),[§2](https://arxiv.org/html/2605.20876#S2.SS0.SSS0.Px2.p1.1),[Table 1](https://arxiv.org/html/2605.20876#S2.T1.25.17.17.3),[§3\.4](https://arxiv.org/html/2605.20876#S3.SS4.p1.7),[§4\.1](https://arxiv.org/html/2605.20876#S4.SS1.SSS0.Px1.p1.2),[§4\.1](https://arxiv.org/html/2605.20876#S4.SS1.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2605.20876#S4.SS1.SSS0.Px3.p1.10)\.
- \[28\]A\. Prabhakar, Z\. Liu, M\. Zhu, J\. Zhang, T\. Awalgaonkar, S\. Wang, Z\. Liu, H\. Chen, T\. Hoang, J\. C\. Niebles,et al\.\(2025\)Apigen\-mt: agentic pipeline for multi\-turn data generation via simulated agent\-human interplay\.arXiv preprint arXiv:2504\.03601\.Cited by:[§1](https://arxiv.org/html/2605.20876#S1.p1.1),[§2](https://arxiv.org/html/2605.20876#S2.SS0.SSS0.Px1.p1.1),[Table 1](https://arxiv.org/html/2605.20876#S2.T1.27.19.26.1)\.
- \[29\]S\. Qiao, Y\. Zhao, Z\. Qiu, X\. Wang, J\. Zhang, Z\. Bin, N\. Zhang, Y\. Jiang, P\. Xie, F\. Huang,et al\.\(2025\)Scaling generalist data\-analytic agents\.arXiv preprint arXiv:2509\.25084\.Cited by:[Table 1](https://arxiv.org/html/2605.20876#S2.T1.18.10.10.4)\.
- \[30\]C\. Qu, S\. Dai, X\. Wei, H\. Cai, S\. Wang, D\. Yin, J\. Xu, and J\. Wen\(2025\)Tool learning with large language models: a survey\.Frontiers of Computer Science19\(8\),pp\. 198343\.Cited by:[§1](https://arxiv.org/html/2605.20876#S1.p2.1),[§2](https://arxiv.org/html/2605.20876#S2.SS0.SSS0.Px1.p1.1)\.
- \[31\]D\. Shi, J\. Cao, Q\. Chen, W\. Sun, W\. Li, H\. Lu, F\. Dong, T\. Qin, K\. Zhu, M\. Liu,et al\.\(2025\)Taskcraft: automated generation of agentic tasks\.arXiv preprint arXiv:2506\.10055\.Cited by:[3rd item](https://arxiv.org/html/2605.20876#A2.I1.i3.p1.1),[4th item](https://arxiv.org/html/2605.20876#A2.I1.i4.p1.1),[§B\.1](https://arxiv.org/html/2605.20876#A2.SS1.p1.1),[Table 1](https://arxiv.org/html/2605.20876#S2.T1.20.12.12.3),[§4\.1](https://arxiv.org/html/2605.20876#S4.SS1.SSS0.Px2.p1.1)\.
- \[32\]SkillsMP\(2026\)Agent skills marketplace\.Note:[https://skillsmp\.com/](https://skillsmp.com/)Cited by:[§1](https://arxiv.org/html/2605.20876#S1.p4.1)\.
- \[33\]S\. Sun, H\. Song, Y\. Wang, R\. Ren, J\. Jiang, J\. Zhang, F\. Bai, J\. Deng, W\. X\. Zhao, Z\. Liu,et al\.\(2025\)SimpleDeepSearcher: deep information seeking via web\-powered reasoning trajectory synthesis\.arXiv preprint arXiv:2505\.16834\.Cited by:[§2](https://arxiv.org/html/2605.20876#S2.SS0.SSS0.Px1.p1.1)\.
- \[34\]Z\. Tao, J\. Wu, W\. Yin, J\. Zhang, B\. Li, H\. Shen, K\. Li, L\. Zhang, X\. Wang, Y\. Jiang,et al\.\(2025\)Webshaper: agentically data synthesizing via information\-seeking formalization\.arXiv preprint arXiv:2507\.15061\.Cited by:[§2](https://arxiv.org/html/2605.20876#S2.SS0.SSS0.Px1.p1.1)\.
- \[35\]K\. Team, Y\. Bai, Y\. Bao, Y\. Charles, C\. Chen, G\. Chen, H\. Chen, H\. Chen, J\. Chen, N\. Chen,et al\.\(2025\)Kimi k2: open agentic intelligence\.arXiv preprint arXiv:2507\.20534\.Cited by:[§4\.1](https://arxiv.org/html/2605.20876#S4.SS1.SSS0.Px1.p1.2)\.
- \[36\]H\. Wang, C\. Qian, M\. Li, J\. Qiu, B\. Xue, M\. Wang, H\. Ji, and K\. Wong\(2025\)Toward a theory of agents as tool\-use decision\-makers\.arXiv preprint arXiv:2506\.00886\.Cited by:[§1](https://arxiv.org/html/2605.20876#S1.p1.1)\.
- \[37\]Z\. Wang, Y\. Liang, X\. Zhang, Q\. Wu, S\. Han, A\. Bastos, R\. Wang, C\. Bansal, B\. Peng, J\. Gao,et al\.\(2025\)Adapting web agents with synthetic supervision\.arXiv preprint arXiv:2511\.06101\.Cited by:[§2](https://arxiv.org/html/2605.20876#S2.SS0.SSS0.Px1.p1.1)\.
- \[38\]S\. Wu, Y\. Li, Y\. Song, W\. Zhang, Y\. Wang, R\. Batista\-Navarro, X\. Yang, M\. Tang, B\. Dai, J\. Yang, and C\. Lin\(2026\)Large\-scale terminal agentic trajectory generation from dockerized environments\.External Links:2602\.01244,[Link](https://arxiv.org/abs/2602.01244)Cited by:[§1](https://arxiv.org/html/2605.20876#S1.p3.1),[§2](https://arxiv.org/html/2605.20876#S2.SS0.SSS0.Px2.p1.1),[§3\.4](https://arxiv.org/html/2605.20876#S3.SS4.p1.7),[§4\.1](https://arxiv.org/html/2605.20876#S4.SS1.SSS0.Px1.p1.2)\.
- \[39\]X\. Wu, J\. Yang, L\. Chai, G\. Zhang, J\. Liu, X\. Du, D\. Liang, D\. Shu, X\. Cheng, T\. Sun,et al\.\(2025\)Tablebench: a comprehensive and complex benchmark for table question answering\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.39,pp\. 25497–25506\.Cited by:[4th item](https://arxiv.org/html/2605.20876#A2.I1.i4.p1.1),[Table 7](https://arxiv.org/html/2605.20876#A2.T7.4.2.2),[§4\.1](https://arxiv.org/html/2605.20876#S4.SS1.SSS0.Px2.p1.1)\.
- \[40\]P\. Xia, J\. Chen, H\. Wang, J\. Liu, K\. Zeng, Y\. Wang, S\. Han, Y\. Zhou, X\. Zhao, H\. Chen,et al\.\(2026\)Skillrl: evolving agents via recursive skill\-augmented reinforcement learning\.arXiv preprint arXiv:2602\.08234\.Cited by:[§1](https://arxiv.org/html/2605.20876#S1.p4.1)\.
- \[41\]Z\. Xu, R\. Li, J\. Li, R\. Weng, J\. Wang, X\. Cai, and X\. Wang\(2026\)Unlocking implicit experience: synthesizing tool\-use trajectories from text\.arXiv preprint arXiv:2601\.10355\.Cited by:[§2](https://arxiv.org/html/2605.20876#S2.SS0.SSS0.Px1.p1.1),[Table 1](https://arxiv.org/html/2605.20876#S2.T1.21.13.13.2)\.
- \[42\]A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§B\.3](https://arxiv.org/html/2605.20876#A2.SS3.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2605.20876#S4.SS1.SSS0.Px1.p1.2)\.
- \[43\]C\. Yang, R\. Le, Y\. Xing, Z\. An, Z\. Chen, W\. X\. Zhao, Y\. Song, and T\. Zhang\(2025\)ToolMind technical report: a large\-scale, reasoning\-enhanced tool\-use dataset\.arXiv preprint arXiv:2511\.15718\.Cited by:[§1](https://arxiv.org/html/2605.20876#S1.p1.1),[§2](https://arxiv.org/html/2605.20876#S2.SS0.SSS0.Px1.p1.1),[Table 1](https://arxiv.org/html/2605.20876#S2.T1.27.19.24.1)\.
- \[44\]A\. Zeng, X\. Lv, Z\. Hou, Z\. Du, Q\. Zheng, B\. Chen, D\. Yin, C\. Ge, C\. Huang, C\. Xie,et al\.\(2026\)Glm\-5: from vibe coding to agentic engineering\.arXiv preprint arXiv:2602\.15763\.Cited by:[1st item](https://arxiv.org/html/2605.20876#A2.I1.i1.p1.1),[Table 7](https://arxiv.org/html/2605.20876#A2.T7.5.5.1),[§4\.1](https://arxiv.org/html/2605.20876#S4.SS1.SSS0.Px1.p1.2),[§4\.1](https://arxiv.org/html/2605.20876#S4.SS1.SSS0.Px2.p1.1)\.
- \[45\]G\. Zhang, J\. Zhu, R\. Yang, K\. Qiu, M\. Zhang, Z\. Wu, Q\. Dai, B\. Liu, C\. Luo, Z\. Yang,et al\.\(2025\)Infoagent: advancing autonomous information\-seeking agents\.arXiv preprint arXiv:2509\.25189\.Cited by:[§2](https://arxiv.org/html/2605.20876#S2.SS0.SSS0.Px1.p1.1)\.
- \[46\]Y\. Zhang and T\. Math\-AI\(2024\)American invitational mathematics examination \(aime\) 2024\.Cited by:[2nd item](https://arxiv.org/html/2605.20876#A2.I1.i2.p1.1),[Table 7](https://arxiv.org/html/2605.20876#A2.T7.5.6.1),[§4\.1](https://arxiv.org/html/2605.20876#S4.SS1.SSS0.Px2.p1.1)\.
- \[47\]Y\. Zhang and T\. Math\-AI\(2025\)American invitational mathematics examination \(aime\) 2025\.Cited by:[2nd item](https://arxiv.org/html/2605.20876#A2.I1.i2.p1.1),[Table 7](https://arxiv.org/html/2605.20876#A2.T7.5.7.1),[§4\.1](https://arxiv.org/html/2605.20876#S4.SS1.SSS0.Px2.p1.1)\.
- \[48\]Y\. Zhao, J\. Huang, J\. Hu, X\. Wang, Y\. Mao, D\. Zhang, Z\. Jiang, Z\. Wu, B\. Ai, A\. Wang, W\. Zhou, and Y\. Chen\(2024\)SWIFT:a scalable lightweight infrastructure for fine\-tuning\.External Links:2408\.05517,[Link](https://arxiv.org/abs/2408.05517)Cited by:[§4\.1](https://arxiv.org/html/2605.20876#S4.SS1.SSS0.Px3.p1.10)\.
- \[49\]L\. Zheng, L\. Yin, Z\. Xie, C\. Sun, J\. Huang, C\. H\. Yu, S\. Cao, C\. Kozyrakis, I\. Stoica, J\. E\. Gonzalez,et al\.\(2024\)Sglang: efficient execution of structured language model programs\.Advances in neural information processing systems37,pp\. 62557–62583\.Cited by:[§B\.3](https://arxiv.org/html/2605.20876#A2.SS3.SSS0.Px2.p1.1),[§B\.3](https://arxiv.org/html/2605.20876#A2.SS3.p1.1)\.
- \[50\]K\. Zhu, Y\. Nie, Y\. Li, Y\. Huang, J\. Wu, J\. Liu, X\. Sun, Z\. Yin, L\. Wang, Z\. Liu, E\. Barsoum, W\. Y\. Wang, and W\. Guo\(2026\)TermiGen: high\-fidelity environment and robust trajectory synthesis for terminal agents\.arXiv preprint arXiv:2602\.07274\.Cited by:[§1](https://arxiv.org/html/2605.20876#S1.p3.1),[§2](https://arxiv.org/html/2605.20876#S2.SS0.SSS0.Px2.p1.1),[Table 1](https://arxiv.org/html/2605.20876#S2.T1.23.15.15.2),[§4\.1](https://arxiv.org/html/2605.20876#S4.SS1.SSS0.Px1.p1.2),[§4\.1](https://arxiv.org/html/2605.20876#S4.SS1.SSS0.Px2.p1.1)\.
## Table of Contents
## Appendix ATerminal\-World Details
### A\.1Skill Taxonomy[Back to ToC](https://arxiv.org/html/2605.20876#appendixtoc)
Figure[6](https://arxiv.org/html/2605.20876#A1.F6)presents the full skill taxonomy used in Terminal\-World, organized into 12 major categories and 63 subcategories\. Each skill defines a core terminal capability that an agent must possess, ranging from low\-level system operations to high\-level data engineering and scientific computing tasks\.
Figure 6:Skill category taxonomy\.The taxonomy covers 12 major categories and 63 subcategories\.
### A\.2Skill Composition: Team and Graph Construction[Back to ToC](https://arxiv.org/html/2605.20876#appendixtoc)
Algorithm[1](https://arxiv.org/html/2605.20876#alg1)details how the 1,000 filtered single skills are extended into skill teams and skill graphs \(introduced in §[3\.1](https://arxiv.org/html/2605.20876#S3.SS1)\)\. The key design choices are:\(1\)SkillNet\[[14](https://arxiv.org/html/2605.20876#bib.bib42)\]operates on the full skill set to produce pairwise relations, avoiding manual enumeration;\(2\)Compose\-withrelations within the same subcategory drive depth extension \(teams\), whileDepend\-onrelations across subcategories drive breadth extension \(graphs\);\(3\)greedy longest\-path extraction ensures every skill participates in at least one graph primitive without duplication\.
Algorithm 1Skill Composition: Team and Graph Construction1:
𝒮single\\mathcal\{S\}\_\{single\}: 1,000 filtered skills with subcategory attribute
sub\(⋅\)\\mathrm\{sub\}\(\\cdot\)
2:Skill teams
𝒮team\\mathcal\{S\}\_\{team\}and skill graphs
𝒮graph\\mathcal\{S\}\_\{graph\}inskill\.mdformat
3:
ℛ←SkillNet\(𝒮single\)\\mathcal\{R\}\\leftarrow\\textsc\{SkillNet\}\(\\mathcal\{S\}\_\{single\}\)//label each pair with 1 of 4 relations
4://Relation filtering \(*Similar\-to*: dedup;*Belong\-to*: discarded\)
5:
ℛteam←\{\(si,sj\)∈ℛCompose\-with∣sub\(si\)=sub\(sj\)\}\\mathcal\{R\}\_\{team\}\\leftarrow\\\{\(s\_\{i\},s\_\{j\}\)\\in\\mathcal\{R\}\_\{\\textit\{Compose\-with\}\}\\mid\\mathrm\{sub\}\(s\_\{i\}\)=\\mathrm\{sub\}\(s\_\{j\}\)\\\}
6:
ℛgraph←\{\(si,sj\)∈ℛDepend\-on∣sub\(si\)≠sub\(sj\)\}\\mathcal\{R\}\_\{graph\}\\leftarrow\\\{\(s\_\{i\},s\_\{j\}\)\\in\\mathcal\{R\}\_\{\\textit\{Depend\-on\}\}\\mid\\mathrm\{sub\}\(s\_\{i\}\)\\neq\\mathrm\{sub\}\(s\_\{j\}\)\\\}
7://Skill Teams \(depth extension\)
8:
𝒮team←TeamSkill\-Creator\(ℛteam\)\\mathcal\{S\}\_\{team\}\\leftarrow\\textsc\{TeamSkill\-Creator\}\(\\mathcal\{R\}\_\{team\}\)
9://Skill Graphs \(breadth extension\): greedy maximal\-path cover
10:
G←DirectedGraph\(V=skills\(ℛgraph\),E=ℛgraph\)G\\leftarrow\\textsc\{DirectedGraph\}\(V\{=\}\\textsc\{skills\}\(\\mathcal\{R\}\_\{graph\}\),\\ E\{=\}\\mathcal\{R\}\_\{graph\}\)
11:
𝒮graph←∅\\mathcal\{S\}\_\{graph\}\\leftarrow\\emptyset
12:while
V\(G\)≠∅V\(G\)\\neq\\emptysetdo
13:
p∗←argmaxv∈V\(G\)\|LongestSimplePath\(G,v\)\|p^\{\*\}\\leftarrow\\arg\\max\_\{v\\in V\(G\)\}\\ \\big\|\\textsc\{LongestSimplePath\}\(G,v\)\\big\|
14:if
\|p∗\|<2\|p^\{\*\}\|<2thenbreak//only isolated nodes left
15:endif
16:
𝒮graph←𝒮graph∪\{Flatten\(p∗\)\}\\mathcal\{S\}\_\{graph\}\\leftarrow\\mathcal\{S\}\_\{graph\}\\cup\\\{\\textsc\{Flatten\}\(p^\{\*\}\)\\\}
17:
G←G\[V\(G\)∖p∗\]G\\leftarrow G\\bigl\[V\(G\)\\setminus p^\{\*\}\\bigr\]
18:endwhile
19:return
𝒮team,𝒮graph\\mathcal\{S\}\_\{team\},\\ \\mathcal\{S\}\_\{graph\}
20:functionLongestSimplePath\(
G,vG,\\ v\)
21:
best←\[v\]best\\leftarrow\[v\]
22:DFS\(v,\{v\},\[v\]\)\(v,\\ \\\{v\\\},\\ \[v\]\)where
23:DFS\(u,𝑣𝑖𝑠,𝑝𝑎𝑡ℎ\)\(u,\\ \\mathit\{vis\},\\ \\mathit\{path\}\):
24:
𝑒𝑥𝑡𝑒𝑛𝑑𝑒𝑑←𝐟𝐚𝐥𝐬𝐞\\mathit\{extended\}\\leftarrow\\mathbf\{false\}
25:foreach
\(u→w\)∈E\(G\)\(u\\to w\)\\in E\(G\)with
w∉𝑣𝑖𝑠w\\notin\\mathit\{vis\}do
26:DFS\(w,𝑣𝑖𝑠∪\{w\},𝑝𝑎𝑡ℎ∘\[w\]\);𝑒𝑥𝑡𝑒𝑛𝑑𝑒𝑑←𝐭𝐫𝐮𝐞\(w,\\ \\mathit\{vis\}\\cup\\\{w\\\},\\ \\mathit\{path\}\\circ\[w\]\);\\ \\mathit\{extended\}\\leftarrow\\mathbf\{true\}
27:if
¬𝑒𝑥𝑡𝑒𝑛𝑑𝑒𝑑\\lnot\\,\\mathit\{extended\}and
\|𝑝𝑎𝑡ℎ\|\>\|𝑏𝑒𝑠𝑡\|\|\\mathit\{path\}\|\>\|\\mathit\{best\}\|then
𝑏𝑒𝑠𝑡←𝑝𝑎𝑡ℎ\\mathit\{best\}\\leftarrow\\mathit\{path\}
28:return
bestbest
29:endfunction
### A\.3Environment Building[Back to ToC](https://arxiv.org/html/2605.20876#appendixtoc)
This subsection complements Sec\.[3\.3](https://arxiv.org/html/2605.20876#S3.SS3)by detailing the tool interfaces used in the initial\-file generation stage of our multi\-agent generate\-verify\-repair pipeline\. Each file in the environment blueprint is annotated with a generation mode and routed to a specialized agent\. Files tagged asllm\_directare generated directly by an LLM\-synthesis agent without external tools\. Files tagged aslocal\_toolare handled by a Local Tool Agent \(𝒜local\_tool\\mathcal\{A\}\_\{local\\\_tool\}\), which programmatically creates or repairs artifacts inside the sandbox\. Files tagged asweb\_fetchare delegated to a Remote Fetch Agent \(𝒜remote\_fetch\\mathcal\{A\}\_\{remote\\\_fetch\}\), which searches, inspects, and downloads public resources when the target artifact depends on external sources\. Table[6](https://arxiv.org/html/2605.20876#A1.T6)summarizes the tool schemas of the two tool\-augmented agents in a unified format\.
Table 6:Tool schema for the Local Tool Agent \(𝒜local\_tool\\mathcal\{A\}\_\{local\\\_tool\}\) and the Remote Fetch Agent \(𝒜remote\_fetch\\mathcal\{A\}\_\{remote\\\_fetch\}\)\.Tool NameDescriptionParametersLocal Tool Agent,𝒜local\_tool\\mathcal\{A\}\_\{local\\\_tool\}pythonRun Python code inside/appto create or repair the target artifact\.•target\_filepath\(string, required\): canonical target file path for the job; must exactly match the requested target path\.
•code\(string, required\): Python code to execute inside/app\.
•timeout\_sec\(integer, optional\): timeout for the Python execution\.Remote Fetch Agent,𝒜remote\_fetch\\mathcal\{A\}\_\{remote\\\_fetch\}web\_searchSearch the web for candidate pages or download locations for the target artifact\.•query\(string, required\): search query to run\.
•top\_k\(integer, optional\): number of search results to request\.
•domain\_hint\(string, optional\): domain substring to prefer or filter by\.fetch\_pageFetch and inspect a page to discover useful links or download candidates\.•url\(string, required\): page URL to inspect\.
•mode\(string, required\): fetch mode, one ofhttp,dynamic, orstealth\.
•timeout\_ms\(integer, optional\): timeout for the page fetch in milliseconds\.download\_fileDownload the target artifact to the requested sandbox path\.•url\(string, required\): file URL to download\.
•save\_as\(string, required\): destination path inside the sandbox\.
•timeout\_ms\(integer, optional\): timeout for the download in milliseconds\.After file generation, all produced artifacts are passed to the File Verify Agent described in Appendix[D](https://arxiv.org/html/2605.20876#A4), which checks both internal correctness and cross\-file consistency\. Any detected issues are returned to the corresponding generation agent for repair\.
## Appendix BExperimental Setup and Reproducibility
### B\.1Benchmark Details[Back to ToC](https://arxiv.org/html/2605.20876#appendixtoc)
Table 7:Overview of the benchmarks, domains, test set sizes, and evaluation metrics used in our experiments\.†indicates that the test set was randomly sampled\.BenchmarkDomainTest SizeMetricTerminal\-Bench 2\.0\[[22](https://arxiv.org/html/2605.20876#bib.bib27),[44](https://arxiv.org/html/2605.20876#bib.bib37)\]Terminal Agentic Coding89Exact MatchAIME 24\[[46](https://arxiv.org/html/2605.20876#bib.bib30)\]Math Reasoning30Exact MatchAIME 25\[[47](https://arxiv.org/html/2605.20876#bib.bib31)\]Math Reasoning30Exact MatchDABench\[[10](https://arxiv.org/html/2605.20876#bib.bib28)\]CSV Data Analysis500†LLM\-as\-a\-JudgeTableBench\[[39](https://arxiv.org/html/2605.20876#bib.bib29)\]CSV Data Analysis500†LLM\-as\-a\-JudgeBIRD\[[12](https://arxiv.org/html/2605.20876#bib.bib38)\]SQL Data Analysis500†Exact MatchTo evaluate the capabilities of terminal agents across different domains, we conduct experiments on six benchmarks, as summarized in Table[7](https://arxiv.org/html/2605.20876#A2.T7)\. Since the latter five benchmarks were not originally designed for terminal agents, we convert them into executable terminal tasks following Nemotron\-Terminal\[[27](https://arxiv.org/html/2605.20876#bib.bib24)\], where agents must solve the problems by manipulating files, running code, and interacting with the command line\. For DABench and TableBench, we follow TaskCraft\[[31](https://arxiv.org/html/2605.20876#bib.bib20)\]to employ an LLM\-as\-a\-Judge for evaluation\.
- •Terminal\-Bench 2\.0: Terminal\-Bench 2\.0\[[22](https://arxiv.org/html/2605.20876#bib.bib27),[44](https://arxiv.org/html/2605.20876#bib.bib37)\]is a native terminal\-agentic coding benchmark that assesses an agent’s ability to navigate the command line, perform system administration, manipulate files, and write code in a real Bash environment\. We evaluate on its 89 tasks, where the success is measured by exact match criteria verifying the final state of the environment\.
- •AIME 24/25: These benchmarks correspond to the problem sets from the 2024 and 2025 American Invitational Mathematics Examinations\[[46](https://arxiv.org/html/2605.20876#bib.bib30),[47](https://arxiv.org/html/2605.20876#bib.bib31)\]\. Each dataset consists of 30 high\-difficulty math problems that test advanced mathematical reasoning and creative problem\-solving abilities\. Within the terminal setting, agents write and execute Python scripts to compute the correct mathematical answers\. We use exact match of the final extracted answer to measure accuracy\.
- •DABench: DABench\[[10](https://arxiv.org/html/2605.20876#bib.bib28)\]is a benchmark designed to evaluate data analysis capabilities\. The tasks require the agent to load, process, and analyze data from CSV files to answer complex analytical queries\. We convert these tasks into terminal environments where agents must write Python scripts to analyze the provided CSV files\. We randomly sample a subset of 500 instances as the test set and evaluate the output using an LLM\-as\-a\-Judge following TaskCraft\[[31](https://arxiv.org/html/2605.20876#bib.bib20)\]\.
- •TableBench: Similar to DABench, TableBench\[[39](https://arxiv.org/html/2605.20876#bib.bib29)\]assesses an agent’s ability to perform complex table\-based reasoning and data manipulation\. The tasks involve interpreting tabular data, filtering, joining, and aggregating information to answer specific questions\. We randomly sample 500 instances for testing and evaluate the output using an LLM\-as\-a\-Judge following TaskCraft\[[31](https://arxiv.org/html/2605.20876#bib.bib20)\]\.
- •BIRD: BIRD\[[12](https://arxiv.org/html/2605.20876#bib.bib38)\]is a large\-scale, complex SQL\-based data analysis benchmark that evaluates text\-to\-SQL and database querying capabilities over real\-world, large\-scale databases\. We deploy the SQLite databases in the terminal environment\. Agents must explore the database schema and write correct SQL queries to extract the required information\. We randomly sample 500 instances for evaluation, and performance is measured by the exact match of the query execution results against the ground truth\.
### B\.2Terminalization of Benchmarks[Back to ToC](https://arxiv.org/html/2605.20876#appendixtoc)
To evaluate the general\-purpose task\-solving capability of models in real\-world scenarios, we systematically converted several established domain\-specific benchmarks \(AIME, DABench, TableBench, and BIRD\) into fully executable terminal environments\.
This conversion places the models in an open\-ended Bash sandbox\. Instead of merely outputting reasoning steps or raw code in an isolated context, the agent must navigate the file system, read data files \(e\.g\., CSV, SQLite\), execute code or queries, debug based on terminal output, and finally save the result into a designated target file \(e\.g\.,/app/answer\.txtor/app/result\.csv\)\.
Table[8](https://arxiv.org/html/2605.20876#A2.T8)illustrates how the original questions from each benchmark are wrapped with specific file paths and formatting constraints to form the final terminalized instructions\.
Table 8:Examples of terminalized instructions\.Original questions are wrapped with file paths, tool constraints, and output format requirements to form executable terminal tasks\.BenchmarkOriginal QuestionTerminalized InstructionAIME 24/25Every morning Aya goes for a 9\-kilometer\-long walk and stops at a coffee shop afterwards… Find the number of minutes the walk takes her\.Solve the following problem\. Reason step by step, create the file/app/answer\.txt, and put your final answer there\. Your answer should be a single integer from 1 to 999 inclusive\.Every morning Aya goes for a 9\-kilometer\-long walk and stops at a coffee shop afterwards… Find the number of minutes the walk takes her\.
DABenchCalculate the mean fare paid by the passengers\. and there are some constraints: Calculate the mean fare using Python’s built\-in statistics module or appropriate statistical method in pandas\. Rounding off the answer to two decimal places\.You are given a CSV data file located at/app/data/test\_ave\.csv\. Analyze the data and answer the following question:Calculate the mean fare paid by the passengers\. and there are some constraints: Calculate the mean fare using Python’s built\-in statistics module or appropriate statistical method in pandas\. Rounding off the answer to two decimal places\.
Write your answer to/app/answer\.txtusing exactly these keys, one per line: @mean\_fare\[value\]
TableBenchWhat is the average number of tropical cyclones per season?You are given a CSV data file located at/app/data/table\_000000\.csv\. Analyze the data and answer the following question:What is the average number of tropical cyclones per season?
Write your answer to/app/answer\.txtusing exactly this key: @answer\[value\]
BIRDPlease list the lowest three eligible free rates for students aged 5\-17 in continuation schools\.You are given a SQLite database at/app/data/california\_schools\.sqlite\. Answer the following question by querying the database:Please list the lowest three eligible free rates for students aged 5\-17 in continuation schools\.
Write a SQL query that answers the question, execute it, and save the result as a CSV file at/app/result\.csv\. The CSV file must include a header row with column names\.
### B\.3Sampling Details[Back to ToC](https://arxiv.org/html/2605.20876#appendixtoc)
To ensure a fair and reproducible comparison across all evaluated models, we adopt the officially recommended sampling parameters released by the respective model providers whenever available\. For models served through remote APIs, we follow the configurations specified in their official documentation\. For models deployed locally, we use the SGLang inference framework\[[49](https://arxiv.org/html/2605.20876#bib.bib36)\]with the recommended decoding parameters from each model’s technical report or model card\. A complete summary of the sampling parameters used in our experiments is provided in Table[9](https://arxiv.org/html/2605.20876#A2.T9)for closed\-source and large\-scale open\-source models served via API, and in Table[10](https://arxiv.org/html/2605.20876#A2.T10)for locally deployed models\.
#### API\-based Models
Table 9:Sampling parameters of API\-based models used in our experiments\.ModelContextTemp\.Top\-ppTop\-kkThinkingNotesGemini\-3\-Flash1,000,0001\.0––✓–Gemini\-3\.1\-Pro\-Preview1,000,0001\.0––✓–DeepSeek\-V3\.2163,8401\.0––✓–GLM\-5202,7520\.71\.0–✓–Kimi\-K2\-Thinking262,1441\.0––✓–MiniMax\-M2\.7204,8001\.00\.9540✓–Qwen3\-Coder\-480B262,1440\.70\.820✗repetition\_penalty=1\.05
For closed\-source models \(i\.e\., GPT\-5\.2, Claude\-Sonnet\-4\.5, Gemini\-3\-Flash, and Gemini\-3\.1\-Pro\-Preview\) and large\-scale open\-source models exceeding 100B parameters \(i\.e\., DeepSeek\-V3\.2, GLM\-5, Kimi\-K2\-Thinking, MiniMax\-M2\.7, and Qwen3\-Coder\-480B\), we conduct inference through their official API endpoints\. The detailed sampling configurations are summarized in Table[9](https://arxiv.org/html/2605.20876#A2.T9)\. For models that support explicit reasoning or thinking modes \(e\.g\., DeepSeek\-V3\.2, Gemini\-3\-Flash, and Gemini\-3\.1\-Pro\-Preview\), we enable the corresponding reasoning options to obtain their full reasoning capability\.
#### Locally Deployed Models
Table 10:Sampling parameters of locally deployed models served via SGLang\.ModelContextTemp\.Top\-ppTop\-kkMin\-ppThinkingNotesQwen3\-8B40,9600\.60\.95200\.0✓–Qwen3\-14B40,9600\.60\.95200\.0✓–Qwen3\-32B40,9600\.60\.95200\.0✓–Nemotron\-Terminal\-8B40,9600\.60\.95200\.0✓–Nemotron\-Terminal\-14B40,9600\.60\.95200\.0✓–Nemotron\-Terminal\-32B40,9600\.60\.95200\.0✓–TermiGen\-32B32,7680\.7–––✗–TerminalTraj\-32B32,7680\.7–––✗–GPT\-OSS\-20B131,0721\.01\.0−1\-1–✓reasoning\_effort=highGPT\-OSS\-120B131,0721\.01\.0−1\-1–✓reasoning\_effort=high
All open\-source models with fewer than 120B parameters are deployed locally using the SGLang framework\[[49](https://arxiv.org/html/2605.20876#bib.bib36)\]\. For the Qwen3 series \(8B/14B/32B\) and the Nemotron\-Terminal series \(8B/14B/32B\), we adopt the official recommended sampling configuration from the Qwen3 technical report\[[42](https://arxiv.org/html/2605.20876#bib.bib26)\], with thinking mode enabled to leverage the model’s reasoning capability\. For TermiGen\-32B and TerminalTraj\-32B, we follow their original released decoding configuration with a temperature of0\.70\.7and no additional sampling constraints\. For the GPT\-OSS series, we setreasoning\_effort=highto enable the model’s strongest reasoning behavior\. The complete configurations are summarized in Table[10](https://arxiv.org/html/2605.20876#A2.T10)\.
### B\.4Training Compute Details[Back to ToC](https://arxiv.org/html/2605.20876#appendixtoc)
Table 11:SFT hyperparameters for Terminal\-World\.HyperparameterValueTraining sequence length32,76832\{,\}768tokensGlobal batch size3232Training epochs22OptimizerAdamPeak learning rate2×10−52\\times 10^\{\-5\}LR scheduleCosine decayLR warmup10%10\\%of total stepsMinimum LR5×10−75\\times 10^\{\-7\}Weight decay1×10−41\\times 10^\{\-4\}Gradient clipping1\.01\.0Tensor parallel size44Pipeline parallel size44Sequence parallelismEnabledRandom seed4242We fine\-tune Terminal\-World\-32B using the hyperparameters summarized in Table[11](https://arxiv.org/html/2605.20876#A2.T11), which include a peak learning rate of2×10−52\\times 10^\{\-5\}, a cosine decay schedule with a 10% warmup ratio, weight decay of1×10−41\\times 10^\{\-4\}, gradient clipping of1\.01\.0, and a training sequence length of32,76832\{,\}768tokens over 2 epochs\. Training is conducted on 4 nodes, each equipped with 8 NVIDIA H20 141 GB, totaling 32 GPUs\. Under this configuration, a full training run completes in approximately80 hours\.
## Appendix CSupplementary Analyses
### C\.1Error Analysis[Back to ToC](https://arxiv.org/html/2605.20876#appendixtoc)
To systematically investigate the failure modes of our model, we conducted an in\-depth error analysis based on three independent evaluation runs of Terminal\-World\-32B within the Terminal\-World 2\.0 environment\. By manually examining the execution trajectories of unsuccessful attempts, we identified and categorized the critical bottlenecks into four distinct error types, as illustrated in Table[12](https://arxiv.org/html/2605.20876#A3.T12)\.
Table 12:Error Analysis of Terminal\-World\-32B on Terminal\-Bench 2\.0\.Redhighlights the critical error at each failure point\.Error TypeAgent Trajectory \(Failure Point\)Error Reason & Result\(1\) Context Window
Overflow
\[winning\-avg\-corewars\]\[Step 21\]System: Context limit reached\. History compressed\.
\[Step 22\]“…fails against stone \(66 wins\)…”
↪\\hookrightarrowwrites new warriorOmni
\[Step 23\]System: Context limit reached\. History compressed\.
\[Step 24\]“Let’s try a new strategy…”
↪\\hookrightarrowwrites the exact same warriorOmni2Reason:Due to severe summarization, the model completely loses memory of its specific prior failed attempts, causing it to confidently repeat identical strategies without converging\.
Result:AgentTimeout\(2\) Execution
Deadlock
\[dna\-insert\]\[Step 29\]“Terminal is stuck in a heredoc\. I will send EOF\.”
↪\\hookrightarrowEOF\\n
\[Step 30\]“Still stuck\. I will try Ctrl\+C\.”
↪\\hookrightarrowC\-c\\n
\[Steps 31–50\]Repeats the exact same reasoning and‘\\n’→\\to‘EOF’→\\to‘C\-c’sequence 20 more times\.Reason:Agent falls into an endless execution loop\. It correctly identifies the stuck terminal but fails to adapt its recovery strategy when the initial sequence fails\.
Result:AgentTimeout\(3\) Premature
Completion
\[overfull\-hbox\]\[Step 8\]Fixes layout by replacing arbitrary words with shorter ones \(ignoring thesynonyms\.txtconstraint\)\.
\[Step 9,task\_complete=True\]
“Overfull hbox warnings eliminated \(count=0\)\.I replaced words from the provided list, which is allowed\.”Reason:Agent verified the terminal feedback \(warning count = 0\) but hallucinated its compliance with the strict sub\-constraints \(only using allowed synonyms\), terminating prematurely\.
Result:FAILED\(modified illegal words\)\(4\) Task Substitution
\[gcode\-to\-text\]\[Step 3\]Notices metadata:; M486 AEmbossed text
\[Step 4\]Instead of decoding G\-code toolpaths, takes a shortcut:
↪\\hookrightarrowgrep \-i "emboss" text\.gcode
\[Step 12,task\_complete=True\]
“Successfully extracted the text: Embossed text”Reason:Substituted the complex required task \(geometric coordinate decoding\) with a superficially similar but incorrect shortcut \(metadata string search\)\.
Result:FAILED\(Expectedflag\{\.\.\.\}, gotEmbossed text\)\(1\) Context Window Overflow\.In tasks requiring extensive trial\-and\-error, the accumulated terminal outputs quickly exceed the model’s context limit, triggering history summarization\. Consequently, the agent loses fine\-grained memory of its prior actions\. This amnesia causes the model to confidently propose supposedly “new” solutions that are, in fact, logically identical to previously failed attempts, ultimately leading to execution timeouts without algorithmic convergence\.
\(2\) Execution Deadlock\.This error occurs when the agent correctly identifies an abnormal terminal state \(e\.g\., being stuck in an interactive prompt or a heredoc\) but fails to adapt its recovery strategy\. Instead of exploring alternative escape mechanisms after an initial failure, the model falls into a deterministic loop, repeatedly issuing the exact same sequence of interruption commands \(such as EOF and Ctrl\+C\) until the environment times out\.
\(3\) Premature Completion\.Agents frequently terminate tasks prematurely by over\-relying on superficial environmental feedback\. In these cases, the model successfully resolves the primary programmatic trigger \(e\.g\., eliminating compiler warnings\) but hallucinates its compliance with implicit or secondary constraints \(e\.g\., restricting vocabulary to a provided synonym list\)\. Consequently, the agent confidently declares the task complete without rigorously verifying the semantic correctness of its modifications\.
\(4\) Task Substitution\.Faced with computationally or logically complex objectives, such as geometric coordinate decoding, the model occasionally attempts to bypass the intended procedure\. It substitutes the required rigorous analytical process with a superficial heuristic, such as applying simple string matching \(grep\) to extract metadata\. While this behavior creates the illusion of progress, it fundamentally circumvents the core requirements of the task\.
### C\.2Semantic Correctness of Failure Trajectories[Back to ToC](https://arxiv.org/html/2605.20876#appendixtoc)
To further understand why suppressing failure trajectories hurts performance, we conduct an additional analysis on the semantic correctness of unsuccessful trajectories\. Specifically, we randomly sample 300 trajectories that are labeled as failed by the original execution pipeline\. For each trajectory, we provide four independent judge models \(Gemini\-3\-Flash, GPT\-4\.1, GLM\-5, and Doubao\-2\.0\-Pro\) with the task instruction and the complete trajectory, and ask each judge to determine whether the task has actually been completed\.
Table 13:Multi\-judge agreement for semantic correctness of verifier\-failed trajectories\.We report judgments from Gemini\-3\-Flash, GPT\-4\.1, GLM\-5, and Doubao\-2\.0\-Pro on 300 trajectories labeled as failed by the execution verifier\.Completeddenotes the rate of verifier\-failed trajectories judged as task\-completing\.Maj\. Completeddenotes the majority\-vote completed rate\. The 95% confidence interval is computed over majority\-vote labels, andAgreementreports Fleiss’κ\\kappaamong the four judges\.Trajectory SetCompleted RateMaj\. Completed95% CIAgreementGemini\-3\-FlashGPT\-4\.1GLM\-5Doubao\-2\.0\-ProAll Failed Trajectories0\.7030\.6600\.6330\.6870\.677\[0\.622, 0\.728\]κ=0\.742\\kappa=0\.742
As shown in Table[13](https://arxiv.org/html/2605.20876#A3.T13), the individual completion rates across the four judges range from 63\.3% to 70\.3%, and the majority\-vote completion rate is 67\.7% \(95% CI: \[0\.622, 0\.728\]\)\. The inter\-judge agreement is substantial \(Fleiss’κ=0\.742\\kappa=0\.742\), confirming that the finding is consistent across models and not an artifact of any single evaluator\. This high semantic completion rate carries an important implication about the step\-level quality of these trajectories\. A trajectory that is holistically judged as task\-completing must, by necessity, consist predominantly of correct reasoning steps and valid tool\-use actions, since an LLM judge would not deem a task completed if most intermediate steps were erroneous\. Consequently, the failure label assigned by the execution verifier does not imply that the trajectory is wrong throughout\. Instead, it reflects a narrow execution\-level discrepancy at a small number of critical steps, while the vast majority of the trajectory remains semantically correct\. Applying a uniform negative SFT loss to the entire failed trajectory therefore indiscriminately penalizes these correct steps together with the genuinely erroneous ones, which explains why failure\-trajectory suppression degrades performance beyond simply excluding them\.
### C\.3Difficulty of Synthesized Tasks[Back to ToC](https://arxiv.org/html/2605.20876#appendixtoc)
Table 14:Difficulty of terminal environments\.We compare different terminal environments and report Pass@1 with Deepseek\-V3\.2\.EnvironmentPass@1Δrel%\\Delta\_\{\\text\{\{rel\}\}\}^\{\\textbf\{\\%\}\}Terminal\-Bench 2\.038\.2–Terminal\-World39\.8\+4\.2%Nemotron\-Terminal\-Corpus79\.8\+108\.9%TermiGen59\.0\+54\.5%Endless\-Terminal58\.1\+52\.1%To further examine the quality of Terminal\-World and investigate why Terminal\-World achieves stronger performance with substantially less training data, we conduct an in\-depth analysis of its task quality\. Specifically, we use the same agent configuration \(i\.e\., DeepSeek\-V3\.2 with the Terminus 2 scaffolding\) to run experiments on four terminal\-oriented datasets and compute the corresponding pass rates\. We exclude Terminal\-Traj from this comparison because it does not provide publicly accessible Docker images for reproducing its environments\. As shown in Table[14](https://arxiv.org/html/2605.20876#A3.T14), DeepSeek\-V3\.2 achieves a pass rate of only 39\.8% on Terminal\-World, which is only 4\.2% relatively higher than its performance on Terminal\-Bench 2\.0\. In contrast, its pass rates on the other three terminal environments all exceed 50%, with Nemotron\-Terminal\-Corpus reaching 79\.8%\. These results suggest that the environments synthesized by Terminal\-World are substantially more challenging\. Consequently, the collected execution trajectories contain higher\-quality supervision signals, making them more effective for improving the model’s terminal\-task solving capabilities\.
### C\.4Quality of Task and Environment[Back to ToC](https://arxiv.org/html/2605.20876#appendixtoc)
To directly examine the intrinsic quality of synthesized terminal\-agent data, we evaluate each dataset along four dimensions:①terminal nativeness,②environment\-task consistency,③environment quality, and④verifier robustness\. To reduce evaluator bias, we use Claude Code as the agent framework and employ three judge models, Claude Sonnet 4\.5, Kimi K2\.5, and GLM\-5\. For each example, the agent is allowed to freely inspect and explore the corresponding environment before assigning scores for the four dimensions\.
Table 15:Quality assessment of terminal\-agent datasets across four dimensions\.We report scores from three judge models: Claude Sonnet 4\.5 \(C\), Kimi K2\.5 \(K\), and GLM\-5 \(G\)\.DatasetTerminal NativenessEnv\-Task ConsistencyEnv QualityVerifier RobustnessCKGAvg\.CKGAvg\.CKGAvg\.CKGAvg\.Terminal\-World2\.722\.712\.642\.692\.992\.942\.992\.972\.772\.863\.002\.882\.972\.713\.002\.92Endless\-Terminal2\.102\.322\.542\.322\.902\.942\.892\.912\.932\.962\.742\.882\.812\.982\.962\.92TermiGen1\.932\.071\.992\.002\.732\.452\.592\.592\.442\.442\.752\.542\.242\.342\.422\.33Terminal\-Traj1\.912\.122\.222\.081\.121\.101\.061\.091\.011\.031\.051\.031\.801\.842\.151\.93
As shown in Table[15](https://arxiv.org/html/2605.20876#A3.T15), Terminal\-World consistently achieves the highest quality across all four dimensions and all three judge models\. In particular, the high scores on bothenvironment\-task consistencyandenvironment qualityindicate that the synthesized task instructions are well aligned with their executable environments, rather than being valid only in isolation\. These results demonstrate that our skill\-driven synthesis pipeline can construct higher\-quality terminal tasks and environments with stronger task\-environment alignment\.
### C\.5Effect of Execution Guidelines[Back to ToC](https://arxiv.org/html/2605.20876#appendixtoc)
Table 16:Effect of execution guidelines\.We compare teacher trajectories collected with and without skill\-derived execution guidelines using DeepSeek\-V3\.2 on 500 sampled tasks\.SettingSuccessΔrel%\\Delta\_\{\\text\{\{rel\}\}\}^\{\\textbf\{\\%\}\}Avg\. StepsΔrel%\\Delta\_\{\\text\{\{rel\}\}\}^\{\\textbf\{\\%\}\}w/ guideline39\.6–12\.66–w/o guideline27\.6\-30\.3%14\.27\+12\.7%To examine whether execution guidelines improve teacher trajectory collection, we randomly sample 500 tasks from Terminal\-World and collect two sets of trajectories with the same DeepSeek\-V3\.2 teacher, holding all other settings fixed\. In thew/ guidelinesetting, the teacher receives the skill\-derived execution guidelineGGas input; in thew/o guidelinesetting, the teacher must infer the solution path from the task alone\. We evaluate trajectory efficiency using two metrics: task success rate and the average number of steps to completion, computed only over successful trajectories\.
As shown in Table[16](https://arxiv.org/html/2605.20876#A3.T16), execution guidelines substantially improve both reliability and efficiency\. With guidelines, the teacher achieves a success rate of39\.6%, compared with27\.6%without guidelines, indicating a30\.3%relative drop when guidelines are removed\. Moreover, successful trajectories require only12\.66steps on average with guidelines, compared with14\.27steps without guidelines\. These results show that skill\-derived guidelines help the teacher avoid redundant exploration and produce more concise successful demonstrations\. This supports our design choice of usingGGduring trajectory collection while removing it from SFT inputs, enabling the student to learn from efficient demonstrations without depending on guideline information at inference time\.
### C\.6LLM\-as\-a\-Judge Consistency Analysis[Back to ToC](https://arxiv.org/html/2605.20876#appendixtoc)
Our pipeline relies on LLM\-as\-a\-Judge at two distinct stages: task quality filtering \(Sec\.[3\.2](https://arxiv.org/html/2605.20876#S3.SS2)\) and evaluation on DABench and TableBench \(Sec\.[4](https://arxiv.org/html/2605.20876#S4); Appendix[B\.1](https://arxiv.org/html/2605.20876#A2.SS1)\)\. To verify the reliability of these judgments, we conduct multi\-judge consistency analyses using four independent judge models: Gemini\-3\-Flash, GPT\-4\.1, GLM\-5, and Doubao\-2\.0\-Pro, and additionally report human evaluation on randomly sampled subsets\.
#### Task Quality Filtering\.
Table[17](https://arxiv.org/html/2605.20876#A3.T17)reports inter\-judge agreement for the five filtering criteria used in task quality filtering\. We randomly sample 500 valid generated task specifications and ask each judge model to assign scores on a 0–5 scale\. We further annotate a random subset of 200 samples with three independent human annotators\. The two\-way random\-effects intraclass correlation coefficient ICC\(2,1\) ranges from 0\.708 to 0\.812 across criteria, with an overall ICC of 0\.770, indicating substantial agreement among the four judge models\. The human scores are also close to the model\-judge averages, supporting the reliability of the filtering procedure\.
Table 17:Multi\-judge agreement for task quality filtering\.Each judge scores the five filtering criteria used in Sec\. 3\.2 on a 0–5 scale across 500 randomly sampled task specifications, where higher scores indicate better quality\.Avg\.denotes the average score across the four judges\.ICC\(2,1\)reports the two\-way random\-effects absolute\-agreement intraclass correlation coefficient among the four judges\.Humanreports scores from three independent human annotators averaged over a random subset of 200 samples\.Filtering CriterionJudge ScoreAvg\.ICC\(2,1\)Gemini\-3\-FlashGPT\-4\.1GLM\-5Doubao\-2\.0\-ProInstruction Quality4\.624\.404\.284\.554\.460\.812Closed\-World Solvability4\.384\.184\.054\.304\.230\.784Blueprint Completeness4\.153\.923\.784\.083\.980\.741Guideline Quality4\.053\.803\.653\.953\.860\.708Evaluation Criteria Quality4\.524\.304\.184\.424\.360\.795Overall4\.344\.123\.994\.264\.180\.770Human \(200 samples\)4\.274\.133\.944\.214\.140\.763
#### LLM\-as\-a\-Judge on DABench and TableBench\.
Table[18](https://arxiv.org/html/2605.20876#A3.T18)reports per\-judge Pass@1 scores and majority\-vote results on the full 500\-instance DABench and TableBench test sets \(Sec\.[4](https://arxiv.org/html/2605.20876#S4); Appendix[B\.1](https://arxiv.org/html/2605.20876#A2.SS1)\) for three 32B models, together with human evaluation on a random subset of 200 instances per benchmark\. Fleiss’κ\\kappaamong the three audit judges \(GPT\-4\.1, GLM\-5, and Doubao\-2\.0\-Pro\) is 0\.873 and 0\.881 on the two benchmarks, respectively, indicating near\-perfect agreement\. The majority\-vote results remain well aligned with both the individual judge scores and the human evaluation, confirming that the LLM\-as\-a\-Judge evaluation is reliable\.
Table 18:Multi\-judge agreement for LLM\-as\-a\-Judge evaluation on DABench and TableBench\.We report Pass@1 judging results from Gemini\-3\-Flash, GPT\-4\.1, GLM\-5, and Doubao\-2\.0\-Pro on three 32B models\.Maj\.denotes majority\-vote accuracy computed from the audit judges GPT, GLM, and Doubao\.Avg\. Maj\.averages the majority\-vote accuracies on DABench and TableBench\. Fleiss’κ\\kappaamong the audit judges is 0\.873 for DABench and 0\.881 for TableBench\.Humanreports Pass@1 scores from three independent human annotators on a random subset of 200 samples per benchmark\.ModelDABenchTableBenchAvg\. Maj\.Gemini\-3\-FlashGPT\-4\.1GLM\-5Doubao\-2\.0\-ProMaj\.HumanGemini\-3\-FlashGPT\-4\.1GLM\-5Doubao\-2\.0\-ProMaj\.HumanQwen3\-32B44\.442\.043\.044\.643\.442\.026\.424\.025\.026\.025\.223\.534\.3Nemotron\-Terminal\-32B81\.678\.480\.082\.080\.478\.572\.470\.071\.673\.071\.870\.076\.1Terminal\-World\-32B83\.681\.082\.484\.082\.881\.071\.669\.670\.472\.070\.868\.576\.8
## Appendix DPrompt Templates
Prompt for Task Generation \(Core Goal\)[Back to ToC](https://arxiv.org/html/2605.20876#appendixtoc)[⬇](data:text/plain;base64,WW91IGFyZSBjcmVhdGluZyByZWFsaXN0aWMgTGludXgtdGVybWluYWwgdGFza3MgZm9yIHRyYWluaW5nIGEgdGVybWluYWwgQUkgYWdlbnQuIEVhY2ggdGFzayBzaG91bGQgYmUgc3ludGhlc2l6ZWQgZnJvbSB0d28gaW5wdXRzOiBhbiBBZ2VudCBTa2lsbCBhbmQgYSBQZXJzb25hLiBUaGUgQWdlbnQgU2tpbGwgZGVmaW5lcyB0aGUgY29yZSBraW5kIG9mIHdvcmsgdGhlIHRhc2sgc2hvdWxkIHJlcXVpcmUsIGluY2x1ZGluZyB0aGUgdW5kZXJseWluZyBjYXBhYmlsaXR5LCB3b3JrZmxvdywgYW5kIHR5cGljYWwgZmFpbHVyZSBtb2Rlcy4gVGhlIFBlcnNvbmEgcHJvdmlkZXMgdGhlIHJlYWwtd29ybGQgc2V0dGluZyBpbiB3aGljaCB0aGF0IHdvcmsgd291bGQgbmF0dXJhbGx5IGFyaXNlLCBzaGFwaW5nIHRoZSBkb21haW4gY29udGV4dCwgbW90aXZhdGlvbiwgYW5kIHRvbmUgb2YgdGhlIHJlcXVlc3QuCgpZb3VyIGpvYiBpcyB0byBjb21iaW5lIHRoZXNlIHR3byBpbnB1dHMgaW50byBhIHNpbmdsZSBzZWxmLWNvbnRhaW5lZCB0ZXJtaW5hbCB0YXNrIHRoYXQgZmVlbHMgbGlrZSBhIGdlbnVpbmUgcGllY2Ugb2Ygd29yayBzb21lb25lIHdvdWxkIGFzayBhbiBhdXRvbm9tb3VzIGNvZGluZyBhZ2VudCB0byBjb21wbGV0ZS4gVGhlIHRhc2sgc2hvdWxkIHN0YXkgZmFpdGhmdWwgdG8gdGhlIGNvcmUgbWVjaGFuaWNzIG9mIHRoZSBBZ2VudCBTa2lsbCwgd2hpbGUgdXNpbmcgdGhlIFBlcnNvbmEgdG8gbWFrZSB0aGUgc2NlbmFyaW8gc3BlY2lmaWMsIHJlYWxpc3RpYywgYW5kIGdyb3VuZGVkLgoKIyBUQVNLIEVOVklST05NRU5UClRhc2tzIHJ1biBpbiBhbiBpc29sYXRlZCBEZWJpYW4gMTMgKHRyaXhpZSkgY29udGFpbmVyIGFuZCBtdXN0IGJlIHNvbHZhYmxlIGVudGlyZWx5IHZpYSBiYXNoLgpQcmUtaW5zdGFsbGVkIHRvb2xzIGluY2x1ZGU6Ci0gUHl0aG9uIDMuMTIsIHBpcCAyNSAoKlx0ZXh0Y29sb3J7Ymx1ZX17JFxjZG90JH0qKSBOb2RlLmpzIDIwLCBucG0gMTAgKCpcdGV4dGNvbG9ye2JsdWV9eyRcY2RvdCR9KikgSmF2YSA4IChPcGVuSkRLKQotIGdjYy9nKysgMTQsIG1ha2UsIGdpdCwgY3VybCwgd2dldCwgdG11eAotIGFwdC1nZXQgZm9yIGFkZGl0aW9uYWwgcGFja2FnZXMKLSBXb3JraW5nIGRpcmVjdG9yeTogL2FwcCAoc3ViZGlyczogL291dHB1dCwgL2xvZ3MsIC90ZXN0cywgL3NvbHV0aW9uKQoKIyBJTlNUUlVDVElPTlMKCiMjIDEuIFJlbGV2YW5jZSBDaGVjawpKdWRnZSB3aGV0aGVyIHRoZSBQZXJzb25hIGFuZCBBZ2VudCBTa2lsbCBhcmUgbWVhbmluZ2Z1bGx5IHJlbGF0ZWQgZW5vdWdoIHRvIHByb2R1Y2UgYSByZWFsaXN0aWMgdGFzay4KLSBJZiB0aGUgcGFpciBpcyBjbGVhcmx5IG1pc21hdGNoZWQsIHNldCBgcGFpcl9yZWxldmFuY2VgIHRvICJ1bnJlbGF0ZWQiLCBnaXZlIGEgY29uY3JldGUgcmVhc29uLCBzZXQgYHRhc2tfdGl0bGVgIHRvICJVTlJFTEFURURfUEFJUiIsIGFuZCBsZWF2ZSBhbGwgb3RoZXIgY29udGVudCBmaWVsZHMgZW1wdHkuCi0gSWYgdGhlcmUgaXMgYSBwbGF1c2libGUgcmVhbC13b3JsZCBjb25uZWN0aW9uLCBzZXQgYHBhaXJfcmVsZXZhbmNlYCB0byAicmVsYXRlZCIgYW5kIGdlbmVyYXRlIHRoZSBmdWxsIHRhc2sgYmVsb3cuCgojIyAyLiBJbnN0cnVjdGlvbgpUaGUgImluc3RydWN0aW9uIiBmaWVsZCBpcyB0aGUgZXhhY3QgcHJvbXB0IHRoZSBhZ2VudCB3aWxsIHNlZS4gSXQgc2hvdWxkIGRlc2NyaWJlIGEgc2VsZi1jb250YWluZWQgdGVybWluYWwgdGFzayBncm91bmRlZCBpbiB0aGUgQWdlbnQgU2tpbGwuIEFpbSBmb3IgYSB0YXNrIHRoYXQgaXMgY2hhbGxlbmdpbmcgeWV0IHNvbHZhYmxlIGluIHRoZSBzYW5kYm94LCBhbmQgd2hvc2Ugc3VjY2VzcyBjYW4gYmUgdmVyaWZpZWQgdGhyb3VnaCBvYnNlcnZhYmxlIG91dHB1dHMuCgojIyAzLiBJbml0aWFsIEZpbGVzCkVhY2ggZW50cnkgaW4gImluaXRpYWxfZmlsZXMiIG11c3QgaW5jbHVkZToKLSAoKlx0ZXh0Y29sb3J7Ymx1ZX17Z2VuZXJhdGlvblxfbW9kZX0qKTogImxsbVxfZGlyZWN0IiB8ICJsb2NhbFxfdG9vbCIgfCAicmVtb3RlXF9mZXRjaCIKLSAoKlx0ZXh0Y29sb3J7Ymx1ZX17ZGVzY3JpcHRpb259Kik6IEEgY29tcGxldGUgcmVwcm9kdWN0aW9uIHNwZWMgLS0gZmlsZSBmb3JtYXQsIGludGVybmFsIHN0cnVjdHVyZSwgc2NhbGUsIDItMyBjb25jcmV0ZSBleGFtcGxlIHZhbHVlcywgYW5kIGFueSBkZWxpYmVyYXRlIGFub21hbGllcyB0aGUgYWdlbnQgbXVzdCBoYW5kbGUuCgojIyA0LiBTZXR1cCBTdGVwcwpBbiBvcmRlcmVkIGxpc3Qgb2YgZW52aXJvbm1lbnQgcHJlcGFyYXRpb24gc3RlcHMgKG5hdHVyYWwgbGFuZ3VhZ2UsIG5vdCBzaGVsbCBjb21tYW5kcykuIFVzZSBbXSBpZiBubyBleHRyYSBzZXR1cCBpcyBuZWVkZWQuCgojIyA1LiBFdmFsdWF0aW9uIENyaXRlcmlhCkVhY2ggY3JpdGVyaW9uIG11c3QgYmUgcHJlY2lzZSBlbm91Z2ggdG8gdHJhbnNsYXRlIGRpcmVjdGx5IGludG8gYSBweXRlc3QgYXNzZXJ0aW9uIC0tIGluY2x1ZGUgZXhhY3QgZmlsZSBwYXRocywga2V5IG5hbWVzLCB2YWx1ZSB0aHJlc2hvbGRzLCBhbmQgZXhwZWN0ZWQgZm9ybWF0cy4KCiMjIDYuIE91dHB1dCBGb3JtYXQKT3V0cHV0IFNUUklDVExZIGFzIGEgSlNPTiBvYmplY3QuIERvIG5vdCBpbmNsdWRlIG1hcmtkb3duIGZlbmNlcyBvciB0ZXh0IG91dHNpZGUgdGhlIEpTT04uCgojIElOUFVUUwojIyBBZ2VudCBTa2lsbAooKlx0ZXh0Y29sb3J7Ymx1ZX17XHtza2lsbFx9fSopCgojIyBQZXJzb25hCigqXHRleHRjb2xvcntibHVlfXtce3BlcnNvbmFcfX0qKQ==)YouarecreatingrealisticLinux\-terminaltasksfortrainingaterminalAIagent\.Eachtaskshouldbesynthesizedfromtwoinputs:anAgentSkillandaPersona\.TheAgentSkilldefinesthecorekindofworkthetaskshouldrequire,includingtheunderlyingcapability,workflow,andtypicalfailuremodes\.ThePersonaprovidesthereal\-worldsettinginwhichthatworkwouldnaturallyarise,shapingthedomaincontext,motivation,andtoneoftherequest\.Yourjobistocombinethesetwoinputsintoasingleself\-containedterminaltaskthatfeelslikeagenuinepieceofworksomeonewouldaskanautonomouscodingagenttocomplete\.ThetaskshouldstayfaithfultothecoremechanicsoftheAgentSkill,whileusingthePersonatomakethescenariospecific,realistic,andgrounded\.\#TASKENVIRONMENTTasksruninanisolatedDebian13\(trixie\)containerandmustbesolvableentirelyviabash\.Pre\-installedtoolsinclude:\-Python3\.12,pip25⋅\\cdotNode\.js20,npm10⋅\\cdotJava8\(OpenJDK\)\-gcc/g\+\+14,make,git,curl,wget,tmux\-apt\-getforadditionalpackages\-Workingdirectory:/app\(subdirs:/output,/logs,/tests,/solution\)\#INSTRUCTIONS\#\#1\.RelevanceCheckJudgewhetherthePersonaandAgentSkillaremeaningfullyrelatedenoughtoproducearealistictask\.\-Ifthepairisclearlymismatched,set‘pair\_relevance‘to"unrelated",giveaconcretereason,set‘task\_title‘to"UNRELATED\_PAIR",andleaveallothercontentfieldsempty\.\-Ifthereisaplausiblereal\-worldconnection,set‘pair\_relevance‘to"related"andgeneratethefulltaskbelow\.\#\#2\.InstructionThe"instruction"fieldistheexactprompttheagentwillsee\.Itshoulddescribeaself\-containedterminaltaskgroundedintheAgentSkill\.Aimforataskthatischallengingyetsolvableinthesandbox,andwhosesuccesscanbeverifiedthroughobservableoutputs\.\#\#3\.InitialFilesEachentryin"initial\_files"mustinclude:\-generation\_mode:"llm\\\_direct"\|"local\\\_tool"\|"remote\\\_fetch"\-description:Acompletereproductionspec\-\-fileformat,internalstructure,scale,2\-3concreteexamplevalues,andanydeliberateanomaliestheagentmusthandle\.\#\#4\.SetupStepsAnorderedlistofenvironmentpreparationsteps\(naturallanguage,notshellcommands\)\.Use\[\]ifnoextrasetupisneeded\.\#\#5\.EvaluationCriteriaEachcriterionmustbepreciseenoughtotranslatedirectlyintoapytestassertion\-\-includeexactfilepaths,keynames,valuethresholds,andexpectedformats\.\#\#6\.OutputFormatOutputSTRICTLYasaJSONobject\.DonotincludemarkdownfencesortextoutsidetheJSON\.\#INPUTS\#\#AgentSkill\{skill\}\#\#Persona\{persona\}
Prompt for Guideline Generation[Back to ToC](https://arxiv.org/html/2605.20876#appendixtoc)[⬇](data:text/plain;base64,WW91IGFyZSBnZW5lcmF0aW5nIGFuIGV4ZWN1dGlvbiBndWlkZWxpbmUgZm9yIGEgdGVybWluYWwtYWdlbnQgU0ZUIGRhdGEgc3ludGhlc2lzIHBpcGVsaW5lLiBUaGUgZ3VpZGVsaW5lIGd1aWRlcyB0aGUgYWdlbnQgdGhyb3VnaCB0aGUgdGFzayBzdGVwLWJ5LXN0ZXAuCgojIFJFUVVJUkVNRU5UUwoKRWFjaCBzdGVwIG11c3QgYmU6Ci0gKCpcdGV4dGNvbG9ye2JsdWV9e0FjdGlvbmFibGV9Kik6IGluY2x1ZGUgdGhlIHNwZWNpZmljIGNvbW1hbmQsIGZpbGUgcGF0aCwgb3IgZWRpdAotICgqXHRleHRjb2xvcntibHVlfXtWZXJpZmlhYmxlfSopOiBzYXkgaG93IHRvIGNvbmZpcm0gaXQgd29ya2VkCi0gKCpcdGV4dGNvbG9ye2JsdWV9e09yZGVyZWR9Kik6IHJlc3BlY3QgdGFzayBkZXBlbmRlbmNpZXMKCkV4dHJhY3QgY29uY3JldGUgc3RlcHMgZnJvbSB0aGUgQWdlbnQgU2tpbGwncyBTT1Agd2hlbiBhdmFpbGFibGUuIFByZWZpeCBjcml0aWNhbCBjYXZlYXRzIHdpdGggIklNUE9SVEFOVDoiIG9yICJXQVJOSU5HOiIuIERvIG5vdCBsZWFrIGZpbmFsIGZpbGUgY29udGVudCBvciBjb21wbGV0ZSBzb2x1dGlvbnMuCgpFYWNoIHN0ZXAgc2hvdWxkIGJlIHdyaXR0ZW4gYXMgYSBzaW5nbGUgc3RyaW5nIGluIHRoaXMgc3R5bGU6CiJTdGVwIE46IDxhY3Rpb24+IC0tIDxleGFjdCBjb21tYW5kIG9yIGVkaXQ+IC0tIDx2ZXJpZmljYXRpb24gb3Igd2FybmluZz4iCgojIEdPT0QgRVhBTVBMRVMKCi0gIlN0ZXAgMTogQmVuY2htYXJrIGN1cnJlbnQgc3RhdGUgLS0gUnVuICdkdSAtc2ggcHVibGljL2ltYWdlcy8nIGFuZCAnZmluZCBwdWJsaWMvaW1hZ2VzIC10eXBlIGYgLW5hbWUgIioucG5nIiB8IHdjIC1sJyB0byBtZWFzdXJlIGJhc2VsaW5lLiIKLSAiU3RlcCAyOiBVcGRhdGUgc2NoZW1hLnByaXNtYSBnZW5lcmF0b3IgLS0gQ2hhbmdlIHByb3ZpZGVyIGZyb20gJ3ByaXNtYS1jbGllbnQtanMnIHRvICdwcmlzbWEtY2xpZW50JywgYWRkIG91dHB1dCA9ICcuLi9zcmMvZ2VuZXJhdGVkL3ByaXNtYScuIgotICJTdGVwIDM6IFJ1biAnYnVueCBwcmlzbWEgZ2VuZXJhdGUnIGFuZCB2ZXJpZnkgZ2VuZXJhdGVkIGNsaWVudCBhcHBlYXJzIGF0IHBhY2thZ2VzL2RiL3NyYy9nZW5lcmF0ZWQvcHJpc21hLy4iCi0gIlN0ZXAgNDogSU1QT1JUQU5UOiBWZXJpZnkgQHByaXNtYS9jbGllbnQgcmVtYWlucyBpbiBwYWNrYWdlLmpzb24gZGVwZW5kZW5jaWVzIChkbyBOT1QgcmVtb3ZlIC0tIG5lZWRlZCBhdCBydW50aW1lKS4iCgpBdm9pZCB2YWd1ZSBzdGVwcyBsaWtlICJJbnNwZWN0IHRoZSBwcm9qZWN0IiBvciAiRml4IGFueSBlcnJvcnMiIHRoYXQgbGFjayBzcGVjaWZpYyBjb21tYW5kcyBvciB0YXJnZXRzLgoKIyBPVVRQVVQgRk9STUFUCgpSZXR1cm4gU1RSSUNUTFkgYXMgYSBKU09OIG9iamVjdDoKeyJndWlkZWxpbmUiOiBbIlN0ZXAgMTogLi4uIiwgIlN0ZXAgMjogLi4uIl19CgpEbyBub3QgaW5jbHVkZSBtYXJrZG93biBmZW5jZXMsIGV4cGxhbmF0aW9uLCBvciBhbnkgdGV4dCBvdXRzaWRlIHRoZSBKU09OLgoKIyBJTlBVVFMKCiMjIFNraWxsCigqXHRleHRjb2xvcntibHVlfXtce3NraWxsXH19KikKCiMjIENvcmUgR29hbAooKlx0ZXh0Y29sb3J7Ymx1ZX17XHtjb3JlXF9nb2FsXF9qc29uXH19Kik=)Youaregeneratinganexecutionguidelineforaterminal\-agentSFTdatasynthesispipeline\.Theguidelineguidestheagentthroughthetaskstep\-by\-step\.\#REQUIREMENTSEachstepmustbe:\-Actionable:includethespecificcommand,filepath,oredit\-Verifiable:sayhowtoconfirmitworked\-Ordered:respecttaskdependenciesExtractconcretestepsfromtheAgentSkill’sSOPwhenavailable\.Prefixcriticalcaveatswith"IMPORTANT:"or"WARNING:"\.Donotleakfinalfilecontentorcompletesolutions\.Eachstepshouldbewrittenasasinglestringinthisstyle:"StepN:<action\>\-\-<exactcommandoredit\>\-\-<verificationorwarning\>"\#GOODEXAMPLES\-"Step1:Benchmarkcurrentstate\-\-Run’du\-shpublic/images/’and’findpublic/images\-typef\-name"\*\.png"\|wc\-l’tomeasurebaseline\."\-"Step2:Updateschema\.prismagenerator\-\-Changeproviderfrom’prisma\-client\-js’to’prisma\-client’,addoutput=’\.\./src/generated/prisma’\."\-"Step3:Run’bunxprismagenerate’andverifygeneratedclientappearsatpackages/db/src/generated/prisma/\."\-"Step4:IMPORTANT:Verifyprisma/clientremainsinpackage\.jsondependencies\(doNOTremove\-\-neededatruntime\)\."Avoidvaguestepslike"Inspecttheproject"or"Fixanyerrors"thatlackspecificcommandsortargets\.\#OUTPUTFORMATReturnSTRICTLYasaJSONobject:\{"guideline":\["Step1:\.\.\.","Step2:\.\.\."\]\}Donotincludemarkdownfences,explanation,oranytextoutsidetheJSON\.\#INPUTS\#\#Skill\{skill\}\#\#CoreGoal\{core\_goal\_json\}
Prompt for Task Quality Judgment[Back to ToC](https://arxiv.org/html/2605.20876#appendixtoc)[⬇](data:text/plain;base64,WW91IGFyZSBhbiBleHBlcnQgcXVhbGl0eSBldmFsdWF0b3IgZm9yIGJhc2ggdGVybWluYWwgdGFza3MuIFlvdXIgam9iIGlzIHRvIGV2YWx1YXRlIHRoZSBxdWFsaXR5IG9mIHRoZSB0YXNrIHNwZWNpZmljYXRpb24gYWNyb3NzIGZpdmUgZGltZW5zaW9ucyAoMC01IGVhY2gpIGFuZCBqdXN0aWZ5IGVhY2ggc2NvcmUuCgojIElOUFVUCgojIyBQZXJzb25hICh0aGUgdXNlciByb2xlIHRoYXQgbW90aXZhdGVzIHRoaXMgdGFzaykKKCpcdGV4dGNvbG9ye2JsdWV9e1x7cGVyc29uYVxfdGV4dFx9fSopCgojIyBTa2lsbCAodGhlIHRlY2huaWNhbCBjYXBhYmlsaXR5IGJlaW5nIGV4ZXJjaXNlZCkKKCpcdGV4dGNvbG9ye2JsdWV9e1x7c2tpbGxcX3RleHRcfX0qKQoKIyMgR2VuZXJhdGVkIEdvYWwgKHRoZSBmdWxsIHRhc2sgc3BlY2lmaWNhdGlvbiB0byBldmFsdWF0ZSkKKCpcdGV4dGNvbG9ye2JsdWV9e1x7Z29hbFxfanNvblx9fSopCgojIElNUE9SVEFOVCBKVURHSU5HIFBSSU5DSVBMRVMKCi0gSnVkZ2UgZm9yIHJlYWxpc20sIHRhc2sgcXVhbGl0eSwgYW5kIHRyYWluaW5nIHZhbHVlIC0tIG5vdCBmb3IgcmlnaWQgZm9ybWF0dGluZy4KLSBBIGNvbmNpc2UsIG5hdHVyYWwgaW5zdHJ1Y3Rpb24gY2FuIHJlY2VpdmUgYSBoaWdoIHNjb3JlLgotIERvIE5PVCByZXF1aXJlIGEgZml4ZWQgb3BlbmVyLCBudW1iZXJlZCBsaXN0LCBvciBhICJSZXF1aXJlbWVudHM6IiBzZWN0aW9uLgotIEhpZ2ggc2NvcmVzIHNob3VsZCByZWZsZWN0IHJlYWxpc3RpYyB0YXNrIGZyYW1pbmcsIGNsZWFyIHRhc2sgaW50ZW50LCBzdHJvbmcgc2tpbGwgYWxpZ25tZW50LCBzb2x2YWJpbGl0eSwgYW5kIHZlcmlmaWFiaWxpdHkuCgojIEVWQUxVQVRJT04gRElNRU5TSU9OUwoKIyMgRGltIDEgLS0gSW5zdHJ1Y3Rpb24gUXVhbGl0eSAoaW5zdHJ1Y3Rpb25fcXVhbGl0eSkKSXMgdGhlIGluc3RydWN0aW9uIHJlYWxpc3RpYywgc2tpbGwtYWxpZ25lZCwgYWN0aW9uYWJsZSwgYW5kIGNsZWFyIGVub3VnaCBmb3IgdGhlIGFnZW50IHRvIHN0YXJ0IHRoZSByaWdodCBraW5kIG9mIHdvcms/CiAgNSAtIFJlYWxpc3RpYywgc3Ryb25nbHkgc2tpbGwtYWxpZ25lZCwgY2xlYXIgZ29hbCBhbmQgc3VjY2VzcyBjb25kaXRpb24uCiAgNCAtIFN0cm9uZyBhbmQgdXNhYmxlOyBzbGlnaHRseSBzeW50aGV0aWMgb3IgbGVzcyBwcmVjaXNlIHRoYW4gaWRlYWwuCiAgMyAtIEV4ZWN1dGFibGUgYnV0IGdlbmVyaWMgb3IgcGFydGlhbGx5IHVuZGVyLXNwZWNpZmllZC4KICAyIC0gV2Vha2x5IGFsaWduZWQgd2l0aCB0aGUgc2tpbGwsIGFydGlmaWNpYWwsIG9yIHRvbyB2YWd1ZS4KICAxIC0gQmFyZWx5IGFjdGlvbmFibGUgb3IgbWlzc2luZyB0aGUgdGFzaydzIGNvcmUgb2JqZWN0aXZlLgogIDAgLSBTZWxmLWNvbnRyYWRpY3RvcnksIGluY29oZXJlbnQsIG9yIG5vdCBhIHdvcmthYmxlIGluc3RydWN0aW9uLgoKIyMgRGltIDIgLS0gU29sdmFibGUgJiBDbG9zZWQtV29ybGQgKHNvbHZhYmxlX2Nsb3NlZF93b3JsZCkKQ2FuIHRoZSB0YXNrIGJlIGNvbXBsZXRlZCBlbnRpcmVseSBpbnNpZGUgYW4gaXNvbGF0ZWQgY29udGFpbmVyIHdpdGhvdXQgaW50ZXJuZXQgYWNjZXNzLCBwcml2YXRlIGNyZWRlbnRpYWxzLCBvciBwcml2aWxlZ2VkIG9wZXJhdGlvbnM/CiAgNSAtIFN0cmljdGx5IGNsb3NlZC13b3JsZDsgYWxsIGRlcGVuZGVuY2llcyBhbmQgZGF0YSBhcmUgaW4gdGhlIGJsdWVwcmludC4KICA0IC0gRXNzZW50aWFsbHkgY2xvc2VkLXdvcmxkOyBhIGZldyBpbnN0YWxsYWJsZSBkZXBzOyBubyBleHRlcm5hbCBkYXRhIG5lZWRlZC4KICAzIC0gTGlrZWx5IGNsb3NlZC13b3JsZDsgc29tZSB1bmNsZWFyIGRldGFpbHMgY2FuIHJlYXNvbmFibHkgYmUgZmlsbGVkIGluLgogIDIgLSBTaWduaWZpY2FudCBleHRlcm5hbC1kZXBlbmRlbmN5IHJpc2sgb3IgY3JpdGljYWwgZGF0YSBnYXBzLgogIDEgLSBIaWdobHkgdW5zb2x2YWJsZSAtLSBjcml0aWNhbCBpbnB1dHMgbWlzc2luZyBvciBzdHJvbmcgZXh0ZXJuYWwgZGVwZW5kZW5jaWVzLgogIDAgLSBEZWZpbml0aXZlbHkgdW5zb2x2YWJsZSBvciBjb250cmFkaWN0cyBlbnZpcm9ubWVudCBjb25zdHJhaW50cy4KCiMjIERpbSAzIC0tIEJsdWVwcmludCBDb21wbGV0ZW5lc3MgKGJsdWVwcmludF9jb21wbGV0ZW5lc3MpCklzIHRoZSBlbnZpcm9ubWVudCBibHVlcHJpbnQgc3BlY2lmaWMgZW5vdWdoIHRvIGNvbnN0cnVjdCB0aGUgZW52aXJvbm1lbnQgZGV0ZXJtaW5pc3RpY2FsbHk/CiAgNSAtIEFsbCA1IGNhdGVnb3JpZXMgY29tcGxldGU6IGZpbGVzeXN0ZW0sIGRhdGEgc2NoZW1hLCBkZXBzLCBlbnRyeXBvaW50cywgdmFsaWRhdGlvbi4KICA0IC0gQXQgbGVhc3QgNCBvZiA1IGNhdGVnb3JpZXMgY292ZXJlZDsgbWlub3IgZGV0YWlscyBtaXNzaW5nLgogIDMgLSBCdWlsZGFibGUgYnV0IDEtMiBvZiBzY2hlbWEvdmVyc2lvbnMvdmFsaWRhdGlvbiBhcmUgbWlzc2luZy4KICAyIC0gVG9vIGFic3RyYWN0OyBidWlsZGVyIG11c3QgZ3Vlc3MgZXh0ZW5zaXZlbHkuCiAgMSAtIE5lYXJseSB1bmJ1aWxkYWJsZSBlbnZpcm9ubWVudC4KICAwIC0gRW1wdHkgYmx1ZXByaW50IG9yIGNsZWFybHkgaW5jb25zaXN0ZW50IHdpdGggdGhlIGluc3RydWN0aW9uLgoKIyMgRGltIDQgLS0gR3VpZGVsaW5lIFF1YWxpdHkgKGd1aWRlbGluZV9xdWFsaXR5KQpEb2VzIHRoZSBndWlkZWxpbmUgZHJpdmUgYSByZWFsaXN0aWMgdGVybWluYWwgdHJhY2Ugd2l0aCBjb3JyZWN0IG9yZGVyaW5nLCBhcHByb3ByaWF0ZSBncmFudWxhcml0eSwgY2hlY2twb2ludHMsIGFuZCBTT1AgY292ZXJhZ2UgLS0gd2l0aG91dCBzcG9pbGluZyB0aGUgZmluYWwgYW5zd2VyPwogIDUgLSBDYXB0dXJlcyBza2lsbCB3b3JrZmxvdyB3ZWxsOyBzZW5zaWJsZSBvcmRlcmluZywgdXNlZnVsIGNoZWNrcG9pbnRzLCBubyBzcG9pbGVycy4KICA0IC0gU3Ryb25nIGFuZCB1c2VmdWw7IHNsaWdodGx5IGNvYXJzZSBvciBzcGFyc2Ugb24gY2hlY2twb2ludHMuCiAgMyAtIFVzYWJsZSBidXQgZ2VuZXJpYzsgc29tZSB3b3JrZmxvdyB2YWx1ZSwgaW5jb21wbGV0ZSBvcmRlcmluZyBvciBjaGVja3BvaW50cy4KICAyIC0gVG9vIGFic3RyYWN0IHRvIGd1aWRlIGV4ZWN1dGlvbiwgb3IgdG9vIHNjcmlwdC1saWtlIGFuZCBvdmVycHJlc2NyaXB0aXZlLgogIDEgLSBEb2VzIG5vdCBtYXRjaCBpbnN0cnVjdGlvbi9ibHVlcHJpbnQsIG9yIHN0ZXBzIGFyZSBkaXNvcmRlcmVkLgogIDAgLSBObyBndWlkZWxpbmUgb3IgY29tcGxldGVseSB1bnVzYWJsZS4KCiMjIERpbSA1IC0tIEV2YWx1YXRpb24gQ3JpdGVyaWEgUXVhbGl0eSAoZXZhbHVhdGlvbl9jcml0ZXJpYV9xdWFsaXR5KQpEbyB0aGUgZXZhbHVhdGlvbiBjcml0ZXJpYSBjbGVhcmx5IGRlZmluZSBob3cgdGFzayBjb21wbGV0aW9uIHdpbGwgYmUgYXNzZXNzZWQ/IENhbiB0aGV5IGJlIHRyYW5zbGF0ZWQgaW50byByZWxpYWJsZSBibGFjay1ib3ggcHl0ZXN0IGNoZWNrcyB3aXRob3V0IGd1ZXNzaW5nPwogIDUgLSBFeHBsaWNpdCwgb3V0Y29tZS1mb2N1c2VkLCBjb25jcmV0ZSwgYW5kIGRpcmVjdGx5IHRyYW5zbGF0YWJsZSB0byBweXRlc3QuCiAgNCAtIE1vc3RseSBzdHJvbmc7IG1pbm9yIGFtYmlndWl0eSBidXQgb3ZlcmFsbCByZWxpYWJseSBqdWRnZWFibGUuCiAgMyAtIFBhcnRpYWxseSB1c2VmdWwgYnV0IGluY29tcGxldGUgb3IgdmFndWUgb24gYXQgbGVhc3Qgb25lIGltcG9ydGFudCBjb25kaXRpb24uCiAgMiAtIFdlYWsgb3IgdW5kZXJzcGVjaWZpZWQ7IGRpZmZpY3VsdCB0byB0cmFuc2xhdGUgaW50byBzdGFibGUgYXNzZXJ0aW9ucy4KICAxIC0gTW9zdGx5IHVudXNhYmxlOyBtYWpvciBjb25kaXRpb25zIG1pc3NpbmcsIHZhZ3VlLCBvciBzdWJqZWN0aXZlLgogIDAgLSBBYnNlbnQsIGNvbnRyYWRpY3RvcnksIG9yIGNhbm5vdCBhc3Nlc3MgdGFzayBjb21wbGV0aW9uIGF0IGFsbC4KCiMgT1VUUFVUIEZPUk1BVAoKUmVzcG9uZCB3aXRoIE9OTFkgYSBKU09OIG9iamVjdCAtLSBubyBtYXJrZG93biBmZW5jZXMsIG5vIGV4cGxhbmF0aW9uIG91dHNpZGUgdGhlIEpTT04uCgp7CiAgImluc3RydWN0aW9uX3F1YWxpdHkiOiAgICAgICAgIHsic2NvcmUiOiA8MC01PiwgInJlYXNvbiI6ICIuLi4ifSwKICAic29sdmFibGVfY2xvc2VkX3dvcmxkIjogICAgICAgeyJzY29yZSI6IDwwLTU+LCAicmVhc29uIjogIi4uLiJ9LAogICJibHVlcHJpbnRfY29tcGxldGVuZXNzIjogICAgICB7InNjb3JlIjogPDAtNT4sICJyZWFzb24iOiAiLi4uIn0sCiAgImd1aWRlbGluZV9xdWFsaXR5IjogICAgICAgICAgIHsic2NvcmUiOiA8MC01PiwgInJlYXNvbiI6ICIuLi4ifSwKICAiZXZhbHVhdGlvbl9jcml0ZXJpYV9xdWFsaXR5IjogeyJzY29yZSI6IDwwLTU+LCAicmVhc29uIjogIi4uLiJ9Cn0=)Youareanexpertqualityevaluatorforbashterminaltasks\.Yourjobistoevaluatethequalityofthetaskspecificationacrossfivedimensions\(0\-5each\)andjustifyeachscore\.\#INPUT\#\#Persona\(theuserrolethatmotivatesthistask\)\{persona\_text\}\#\#Skill\(thetechnicalcapabilitybeingexercised\)\{skill\_text\}\#\#GeneratedGoal\(thefulltaskspecificationtoevaluate\)\{goal\_json\}\#IMPORTANTJUDGINGPRINCIPLES\-Judgeforrealism,taskquality,andtrainingvalue\-\-notforrigidformatting\.\-Aconcise,naturalinstructioncanreceiveahighscore\.\-DoNOTrequireafixedopener,numberedlist,ora"Requirements:"section\.\-Highscoresshouldreflectrealistictaskframing,cleartaskintent,strongskillalignment,solvability,andverifiability\.\#EVALUATIONDIMENSIONS\#\#Dim1\-\-InstructionQuality\(instruction\_quality\)Istheinstructionrealistic,skill\-aligned,actionable,andclearenoughfortheagenttostarttherightkindofwork?5\-Realistic,stronglyskill\-aligned,cleargoalandsuccesscondition\.4\-Strongandusable;slightlysyntheticorlessprecisethanideal\.3\-Executablebutgenericorpartiallyunder\-specified\.2\-Weaklyalignedwiththeskill,artificial,ortoovague\.1\-Barelyactionableormissingthetask’scoreobjective\.0\-Self\-contradictory,incoherent,ornotaworkableinstruction\.\#\#Dim2\-\-Solvable&Closed\-World\(solvable\_closed\_world\)Canthetaskbecompletedentirelyinsideanisolatedcontainerwithoutinternetaccess,privatecredentials,orprivilegedoperations?5\-Strictlyclosed\-world;alldependenciesanddataareintheblueprint\.4\-Essentiallyclosed\-world;afewinstallabledeps;noexternaldataneeded\.3\-Likelyclosed\-world;someuncleardetailscanreasonablybefilledin\.2\-Significantexternal\-dependencyriskorcriticaldatagaps\.1\-Highlyunsolvable\-\-criticalinputsmissingorstrongexternaldependencies\.0\-Definitivelyunsolvableorcontradictsenvironmentconstraints\.\#\#Dim3\-\-BlueprintCompleteness\(blueprint\_completeness\)Istheenvironmentblueprintspecificenoughtoconstructtheenvironmentdeterministically?5\-All5categoriescomplete:filesystem,dataschema,deps,entrypoints,validation\.4\-Atleast4of5categoriescovered;minordetailsmissing\.3\-Buildablebut1\-2ofschema/versions/validationaremissing\.2\-Tooabstract;buildermustguessextensively\.1\-Nearlyunbuildableenvironment\.0\-Emptyblueprintorclearlyinconsistentwiththeinstruction\.\#\#Dim4\-\-GuidelineQuality\(guideline\_quality\)Doestheguidelinedrivearealisticterminaltracewithcorrectordering,appropriategranularity,checkpoints,andSOPcoverage\-\-withoutspoilingthefinalanswer?5\-Capturesskillworkflowwell;sensibleordering,usefulcheckpoints,nospoilers\.4\-Stronganduseful;slightlycoarseorsparseoncheckpoints\.3\-Usablebutgeneric;someworkflowvalue,incompleteorderingorcheckpoints\.2\-Tooabstracttoguideexecution,ortooscript\-likeandoverprescriptive\.1\-Doesnotmatchinstruction/blueprint,orstepsaredisordered\.0\-Noguidelineorcompletelyunusable\.\#\#Dim5\-\-EvaluationCriteriaQuality\(evaluation\_criteria\_quality\)Dotheevaluationcriteriaclearlydefinehowtaskcompletionwillbeassessed?Cantheybetranslatedintoreliableblack\-boxpytestcheckswithoutguessing?5\-Explicit,outcome\-focused,concrete,anddirectlytranslatabletopytest\.4\-Mostlystrong;minorambiguitybutoverallreliablyjudgeable\.3\-Partiallyusefulbutincompleteorvagueonatleastoneimportantcondition\.2\-Weakorunderspecified;difficulttotranslateintostableassertions\.1\-Mostlyunusable;majorconditionsmissing,vague,orsubjective\.0\-Absent,contradictory,orcannotassesstaskcompletionatall\.\#OUTPUTFORMATRespondwithONLYaJSONobject\-\-nomarkdownfences,noexplanationoutsidetheJSON\.\{"instruction\_quality":\{"score":<0\-5\>,"reason":"\.\.\."\},"solvable\_closed\_world":\{"score":<0\-5\>,"reason":"\.\.\."\},"blueprint\_completeness":\{"score":<0\-5\>,"reason":"\.\.\."\},"guideline\_quality":\{"score":<0\-5\>,"reason":"\.\.\."\},"evaluation\_criteria\_quality":\{"score":<0\-5\>,"reason":"\.\.\."\}\}
Prompt for File Generation \(llm\_directmode\)[Back to ToC](https://arxiv.org/html/2605.20876#appendixtoc)[⬇](data:text/plain;base64,WW91IGFyZSBhbiBleHBlcnQgZW52aXJvbm1lbnQgYnVpbGRlciBmb3IgQUkgY29kaW5nIGFnZW50IGJlbmNobWFya3MuCgpHaXZlbiBhIHRhc2sgZGVzY3JpcHRpb24sIGVudmlyb25tZW50IGJsdWVwcmludCwgYW5kIGEgdGFyZ2V0IGZpbGUgdG8gZ2VuZXJhdGUsCmNyZWF0ZSB0aGUgYWN0dWFsIGZpbGUgY29udGVudCBmb3IgdGhhdCBzcGVjaWZpYyBmaWxlLgoKIyBUYXNrIEluc3RydWN0aW9uCigqXHRleHRjb2xvcntibHVlfXtce2luc3RydWN0aW9uXH19KikKCiMgRW52aXJvbm1lbnQgQmx1ZXByaW50CigqXHRleHRjb2xvcntibHVlfXtce2JsdWVwcmludFx9fSopCgojIFRhcmdldCBGaWxlIHRvIEdlbmVyYXRlCigqXHRleHRjb2xvcntibHVlfXtce3RhcmdldFxfZmlsZVx9fSopCgojIFByZXZpb3VzbHkgR2VuZXJhdGVkIEZpbGVzCigqXHRleHRjb2xvcntibHVlfXtce3ByZXZpb3VzXF9maWxlc1x9fSopCgojIEluc3RydWN0aW9ucwpHZW5lcmF0ZSB0aGUgY29udGVudCBmb3IgdGhlIHRhcmdldCBmaWxlIGJhc2VkIG9uOgoxLiBUaGUgZmlsZSdzIGRlc2NyaXB0aW9uIGFuZCBmaWxlcGF0aAoyLiBUaGUgdGFzayBpbnN0cnVjdGlvbiAod2hhdCB0aGUgYWdlbnQgbmVlZHMgdG8gYWNjb21wbGlzaCkKMy4gVGhlIG92ZXJhbGwgZW52aXJvbm1lbnQgY29udGV4dCAoc3lzdGVtIHJlcXVpcmVtZW50cywgcHJlLXNldHVwKQo0LiBQcmV2aW91c2x5IGdlbmVyYXRlZCBmaWxlcyAobWFpbnRhaW4gY29uc2lzdGVuY3kgd2l0aCBpbXBvcnRzLCBBUElzLCBldGMuKQoKWW91IE1VU1QgZW5kIHlvdXIgcmVzcG9uc2Ugd2l0aCBhIGBgYGpzb24gZmVuY2VkIGNvZGUgYmxvY2sgLS0gdGhpcyBibG9jayBpcyByZXF1aXJlZC4KVGhlIEpTT04gb2JqZWN0IG11c3QgaGF2ZSBleGFjdGx5IHR3byBmaWVsZHM6Ci0gImZpbGVwYXRoIjogdGhlIGV4YWN0IHBhdGggZnJvbSB0aGUgdGFyZ2V0IGZpbGUKLSAiY29udGVudCI6IHRoZSBmdWxsIGZpbGUgY29udGVudCBhcyBhIHN0cmluZw==)YouareanexpertenvironmentbuilderforAIcodingagentbenchmarks\.Givenataskdescription,environmentblueprint,andatargetfiletogenerate,createtheactualfilecontentforthatspecificfile\.\#TaskInstruction\{instruction\}\#EnvironmentBlueprint\{blueprint\}\#TargetFiletoGenerate\{target\_file\}\#PreviouslyGeneratedFiles\{previous\_files\}\#InstructionsGeneratethecontentforthetargetfilebasedon:1\.Thefile’sdescriptionandfilepath2\.Thetaskinstruction\(whattheagentneedstoaccomplish\)3\.Theoverallenvironmentcontext\(systemrequirements,pre\-setup\)4\.Previouslygeneratedfiles\(maintainconsistencywithimports,APIs,etc\.\)YouMUSTendyourresponsewitha‘‘‘jsonfencedcodeblock\-\-thisblockisrequired\.TheJSONobjectmusthaveexactlytwofields:\-"filepath":theexactpathfromthetargetfile\-"content":thefullfilecontentasastring
Prompt for File Generation \(local\_toolmode\)[Back to ToC](https://arxiv.org/html/2605.20876#appendixtoc)[⬇](data:text/plain;base64,W1N5c3RlbV0KWW91IGFyZSBhIHNwZWNpYWxpemVkIGxvY2FsIGFydGlmYWN0IGdlbmVyYXRpb24gYWdlbnQgcnVubmluZyBpbnNpZGUgYSBMaW51eCBzYW5kYm94LgoKWW91IG11c3QgcmVzcG9uZCB3aXRoIGV4YWN0bHkgb25lIHJlcXVpcmVkIGZ1bmN0aW9uIGNhbGwgYW5kIG5vdGhpbmcgZWxzZS4KCkF2YWlsYWJsZSB0b29sOgogIHB5dGhvbjogeyJ0YXJnZXRfZmlsZXBhdGgiOiBzdHJpbmcsICJjb2RlIjogc3RyaW5nLCAidGltZW91dF9zZWMiOiBpbnRlZ2VyIChvcHRpb25hbCl9CgpSdWxlczoKLSBPbmx5IHVzZSB0aGUgcHl0aG9uIHRvb2wuCi0gdGFyZ2V0X2ZpbGVwYXRoIG11c3QgZXhhY3RseSBtYXRjaCB0aGUgcmVxdWVzdGVkIHRhcmdldCBwYXRoLgotIFVzZSBQeXRob24gdG8gY3JlYXRlIG9yIHJlcGFpciB0aGUgdGFyZ2V0IGFydGlmYWN0LgotIEtlZXAgd29ya2luZyB1bnRpbCB0aGUgdG9vbCBvYnNlcnZhdGlvbiByZXBvcnRzIGEgdmFsaWQgYXJ0aWZhY3QuCgpbVXNlcl0KVGFyZ2V0IGZpbGU6ICgqXHRleHRjb2xvcntibHVlfXtce3RhcmdldFxfZmlsZXBhdGhcfX0qKQpEZXNjcmlwdGlvbjogKCpcdGV4dGNvbG9ye2JsdWV9e1x7ZmlsZVxfZGVzY3JpcHRpb25cfX0qKQoKVGFzayBjb250ZXh0IChmb3IgcmVmZXJlbmNlIG9ubHkgLS0gZG8gTk9UIHNvbHZlIHRoZSB3aG9sZSB0YXNrLCBvbmx5IGNyZWF0ZSB0aGUgdGFyZ2V0IGFydGlmYWN0KToKKCpcdGV4dGNvbG9ye2JsdWV9e1x7aW5zdHJ1Y3Rpb25cX3N1bW1hcnlcfX0qKQ==)\[System\]YouareaspecializedlocalartifactgenerationagentrunninginsideaLinuxsandbox\.Youmustrespondwithexactlyonerequiredfunctioncallandnothingelse\.Availabletool:python:\{"target\_filepath":string,"code":string,"timeout\_sec":integer\(optional\)\}Rules:\-Onlyusethepythontool\.\-target\_filepathmustexactlymatchtherequestedtargetpath\.\-UsePythontocreateorrepairthetargetartifact\.\-Keepworkinguntilthetoolobservationreportsavalidartifact\.\[User\]Targetfile:\{target\_filepath\}Description:\{file\_description\}Taskcontext\(forreferenceonly\-\-doNOTsolvethewholetask,onlycreatethetargetartifact\):\{instruction\_summary\}
Prompt for File Generation \(remote\_fetchmode\)[Back to ToC](https://arxiv.org/html/2605.20876#appendixtoc)[⬇](data:text/plain;base64,W1N5c3RlbV0KWW91IGFyZSBhIHNwZWNpYWxpemVkIHN0YXRlbGVzcyByZW1vdGUgZmV0Y2ggYWdlbnQgcnVubmluZyBpbnNpZGUgYSBMaW51eCBzYW5kYm94LgoKWW91IG11c3QgcmVzcG9uZCB3aXRoIGV4YWN0bHkgb25lIHJlcXVpcmVkIGZ1bmN0aW9uIGNhbGwgYW5kIG5vdGhpbmcgZWxzZS4KCkF2YWlsYWJsZSB0b29sczoKICB3ZWJfc2VhcmNoOiAgICB7InF1ZXJ5Ijogc3RyaW5nLCAidG9wX2siOiBpbnRlZ2VyIChvcHRpb25hbCksICJkb21haW5faGludCI6IHN0cmluZyAob3B0aW9uYWwpfQogIGZldGNoX3BhZ2U6ICAgIHsidXJsIjogc3RyaW5nLCAibW9kZSI6ICJodHRwInwiZHluYW1pYyJ8InN0ZWFsdGgiLCAidGltZW91dF9tcyI6IGludGVnZXIgKG9wdGlvbmFsKX0KICBkb3dubG9hZF9maWxlOiB7InVybCI6IHN0cmluZywgInNhdmVfYXMiOiBzdHJpbmcsICJ0aW1lb3V0X21zIjogaW50ZWdlciAob3B0aW9uYWwpfQoKUnVsZXM6Ci0gU3RhdGVsZXNzOiBkbyBub3QgYXNzdW1lIGJyb3dzZXIgc3RhdGUgb3Igc2Vzc2lvbiByZXVzZSBiZXR3ZWVuIGFjdGlvbnMuCi0gc2F2ZV9hcyBtdXN0IGV4YWN0bHkgbWF0Y2ggdGhlIHJlcXVlc3RlZCB0YXJnZXQgcGF0aC4KLSBVc2Ugc2VhcmNoLCBmZXRjaCwgYW5kIGRvd25sb2FkIGFzIHNlcGFyYXRlIHN0ZXBzLgotIEtlZXAgd29ya2luZyB1bnRpbCB0aGUgZG93bmxvYWRlZCBhcnRpZmFjdCB2YWxpZGF0ZXMgc3VjY2Vzc2Z1bGx5LgoKW1VzZXJdClRhcmdldCBmaWxlOiAoKlx0ZXh0Y29sb3J7Ymx1ZX17XHt0YXJnZXRcX2ZpbGVwYXRoXH19KikKRGVzY3JpcHRpb246ICgqXHRleHRjb2xvcntibHVlfXtce2ZpbGVcX2Rlc2NyaXB0aW9uXH19KikKClRhc2sgY29udGV4dCAoZm9yIHJlZmVyZW5jZSBvbmx5IC0tIGRvIE5PVCBzb2x2ZSB0aGUgd2hvbGUgdGFzaywgb25seSBmZXRjaCB0aGUgdGFyZ2V0IGFydGlmYWN0KToKKCpcdGV4dGNvbG9ye2JsdWV9e1x7aW5zdHJ1Y3Rpb25cX3N1bW1hcnlcfX0qKQ==)\[System\]YouareaspecializedstatelessremotefetchagentrunninginsideaLinuxsandbox\.Youmustrespondwithexactlyonerequiredfunctioncallandnothingelse\.Availabletools:web\_search:\{"query":string,"top\_k":integer\(optional\),"domain\_hint":string\(optional\)\}fetch\_page:\{"url":string,"mode":"http"\|"dynamic"\|"stealth","timeout\_ms":integer\(optional\)\}download\_file:\{"url":string,"save\_as":string,"timeout\_ms":integer\(optional\)\}Rules:\-Stateless:donotassumebrowserstateorsessionreusebetweenactions\.\-save\_asmustexactlymatchtherequestedtargetpath\.\-Usesearch,fetch,anddownloadasseparatesteps\.\-Keepworkinguntilthedownloadedartifactvalidatessuccessfully\.\[User\]Targetfile:\{target\_filepath\}Description:\{file\_description\}Taskcontext\(forreferenceonly\-\-doNOTsolvethewholetask,onlyfetchthetargetartifact\):\{instruction\_summary\}
Prompt for File Verification[Back to ToC](https://arxiv.org/html/2605.20876#appendixtoc)[⬇](data:text/plain;base64,W1N5c3RlbV0KWW91IGFyZSBhIGZpbGUgdmVyaWZpZXIgYWdlbnQgcnVubmluZyBpbnNpZGUgYSBMaW51eCBzYW5kYm94LgoKWW91ciByb2xlIGlzIHZlcmlmaWNhdGlvbi1vbmx5LgoKUnVsZXM6Ci0gWW91IGFyZSBub3QgYSB0YXNrLXNvbHZpbmcgYWdlbnQuCi0gWW91ciBzY29wZSBpcyBsaW1pdGVkIHRvIHRoZSBkZWNsYXJlZCBmaWxlcGF0aHMgYW5kIHRoZWlyIGZpbGUgc3BlY2lmaWNhdGlvbnMuCi0gVXNlIHNoZWxsIGNvbW1hbmRzIG9ubHkgdG8gaW5zcGVjdCB0aGUgcHJvdmlkZWQgZmlsZXMuCi0gRG8gTk9UIG1vZGlmeSBmaWxlcywgdXNlIG5ldHdvcmsgYWNjZXNzLCBvciBldmFsdWF0ZSBydW50aW1lIHN0YXRlLgotIFByZWZlciBzdGF0dXM9ImNvbnRpbnVlIiB3aGVuZXZlciBldmlkZW5jZSBpcyBpbmNvbXBsZXRlLgotIFJlc3BvbmQgd2l0aCBleGFjdGx5IG9uZSBKU09OIG9iamVjdCBwZXIgdHVybi4KLSBDb250aW51ZSBtb2RlIG11c3QgdXNlOiAgYW5hbHlzaXMsIHN0YXR1cz0iY29udGludWUiLCBjb21tYW5kcwotIEZpbmFsaXplIG1vZGUgbXVzdCB1c2U6ICBhbmFseXNpcywgc3RhdHVzPSJmaW5hbGl6ZSIsIHJlc3VsdAotIEZpbmFsaXplIHJlc3VsdCBrZXlzOiAgICBvdmVyYWxsX3ZlcmRpY3QsIGZpbGVfZmluZGluZ3MsIGdsb2JhbF9maW5kaW5ncwotIG92ZXJhbGxfdmVyZGljdCBtdXN0IGJlIGV4YWN0bHkgInBhc3MiIG9yICJmYWlsIgotIEVhY2ggaXRlbSBpbiBmaWxlX2ZpbmRpbmdzOiAge2ZpbGVwYXRoLCByZWFzb24sIHJlcGFpcl9pbnN0cnVjdGlvbnN9Ci0gRWFjaCBpdGVtIGluIGdsb2JhbF9maW5kaW5nczoge3JlYXNvbiwgcHJpbWFyeV9vd25lciwgcmVwYWlyX2luc3RydWN0aW9uc30KICAocHJpbWFyeV9vd25lcjogImxsbV9kaXJlY3QiIHwgInNwZWNpYWxpemVkIiB8ICJ1bmF0dHJpYnV0ZWQiKQotIElmIG92ZXJhbGxfdmVyZGljdCBpcyAicGFzcyIsIGJvdGggZmluZGluZ3MgYXJyYXlzIG11c3QgYmUgZW1wdHkuCi0gSWYgb3ZlcmFsbF92ZXJkaWN0IGlzICJmYWlsIiwgYXQgbGVhc3Qgb25lIGZpbmRpbmcgbXVzdCBiZSBwcmVzZW50LgotIHJlcGFpcl9pbnN0cnVjdGlvbnMgbXVzdCBiZSBjb25jcmV0ZSBhbmQgYWN0aW9uYWJsZSBmb3IgdGhlIHJlcGFpciBhZ2VudC4KClZhbGlkIHJlc3BvbnNlIGV4YW1wbGVzOgoKICBDb250aW51ZToKICB7CiAgICAiYW5hbHlzaXMiOiAiTmVlZCB0byBjaGVjayBmaWxlIGNvbnRlbnQuIiwKICAgICJzdGF0dXMiOiAiY29udGludWUiLAogICAgImNvbW1hbmRzIjogWyJjYXQgL2FwcC9kYXRhLmNzdiJdCiAgfQoKICBGaW5hbGl6ZSAocGFzcyk6CiAgewogICAgImFuYWx5c2lzIjogIkFsbCBmaWxlcyBtYXRjaCB0aGVpciBzcGVjaWZpY2F0aW9ucy4iLAogICAgInN0YXR1cyI6ICJmaW5hbGl6ZSIsCiAgICAicmVzdWx0IjogeyJvdmVyYWxsX3ZlcmRpY3QiOiAicGFzcyIsICJmaWxlX2ZpbmRpbmdzIjogW10sICJnbG9iYWxfZmluZGluZ3MiOiBbXX0KICB9CgogIEZpbmFsaXplIChmYWlsKToKICB7CiAgICAiYW5hbHlzaXMiOiAiRm91bmQgYSBtaXNtYXRjaCBpbiAvYXBwL2V4YW1wbGUudHh0LiIsCiAgICAic3RhdHVzIjogImZpbmFsaXplIiwKICAgICJyZXN1bHQiOiB7CiAgICAgICJvdmVyYWxsX3ZlcmRpY3QiOiAiZmFpbCIsCiAgICAgICJmaWxlX2ZpbmRpbmdzIjogW3sKICAgICAgICAiZmlsZXBhdGgiOiAiL2FwcC9leGFtcGxlLnR4dCIsCiAgICAgICAgInJlYXNvbiI6ICJDb250ZW50IGRvZXMgbm90IG1hdGNoIHRoZSBkZWNsYXJlZCBmb3JtYXQuIiwKICAgICAgICAicmVwYWlyX2luc3RydWN0aW9ucyI6ICJSZXdyaXRlIC9hcHAvZXhhbXBsZS50eHQgdG8gbWF0Y2ggdGhlIGRlY2xhcmVkIGZvcm1hdC4iCiAgICAgIH1dLAogICAgICAiZ2xvYmFsX2ZpbmRpbmdzIjogW10KICAgIH0KICB9CgpbVXNlcl0KVGFzayBpbnN0cnVjdGlvbjoKKCpcdGV4dGNvbG9ye2JsdWV9e1x7aW5zdHJ1Y3Rpb25cfX0qKQoKVmVyaWZpY2F0aW9uIHNjb3BlOgotIE9ubHkgdmVyaWZ5IHRoZSBkZWNsYXJlZCBmaWxlcGF0aHMgYmVsb3cuCi0gSWdub3JlIHJ1bnRpbWUgc3RhdGUgYW5kIGJyb2FkZXIgdGFzayBjb21wbGV0aW9uLgoKRmlsZXMgdG8gdmVyaWZ5OgooKlx0ZXh0Y29sb3J7Ymx1ZX17XHtmaWxlXF9saW5lc1x9fSopCgpGaWxlLW9ubHkgdmVyaWZpY2F0aW9uIG9iamVjdGl2ZToKLSBWZXJpZnkgZWFjaCBkZWNsYXJlZCBmaWxlIGFnYWluc3QgaXRzIGRlc2NyaXB0aW9uIGFuZCB0aGUgdGFzayBpbnN0cnVjdGlvbi4KLSBPbmx5IHVzZSBnbG9iYWxfZmluZGluZ3MgZm9yIHN0aWxsLWZpbGUtc2NvcGVkIGNyb3NzLWZpbGUgaXNzdWVzLgotIElmIHlvdSBmYWlsIGFueSBmaWxlIG9yIGVtaXQgYW55IGdsb2JhbCBmaW5kaW5nLCBpbmNsdWRlIGRldGFpbGVkIHJlcGFpcl9pbnN0cnVjdGlvbnMuCgpJbml0aWFsIHdvcmtzcGFjZSB0cmVlIChpbml0aWFsIGhpbnQgb25seSk6CigqXHRleHRjb2xvcntibHVlfXtce3dvcmtzcGFjZVxfdHJlZVx9fSop)\[System\]YouareafileverifieragentrunninginsideaLinuxsandbox\.Yourroleisverification\-only\.Rules:\-Youarenotatask\-solvingagent\.\-Yourscopeislimitedtothedeclaredfilepathsandtheirfilespecifications\.\-Useshellcommandsonlytoinspecttheprovidedfiles\.\-DoNOTmodifyfiles,usenetworkaccess,orevaluateruntimestate\.\-Preferstatus="continue"wheneverevidenceisincomplete\.\-RespondwithexactlyoneJSONobjectperturn\.\-Continuemodemustuse:analysis,status="continue",commands\-Finalizemodemustuse:analysis,status="finalize",result\-Finalizeresultkeys:overall\_verdict,file\_findings,global\_findings\-overall\_verdictmustbeexactly"pass"or"fail"\-Eachiteminfile\_findings:\{filepath,reason,repair\_instructions\}\-Eachiteminglobal\_findings:\{reason,primary\_owner,repair\_instructions\}\(primary\_owner:"llm\_direct"\|"specialized"\|"unattributed"\)\-Ifoverall\_verdictis"pass",bothfindingsarraysmustbeempty\.\-Ifoverall\_verdictis"fail",atleastonefindingmustbepresent\.\-repair\_instructionsmustbeconcreteandactionablefortherepairagent\.Validresponseexamples:Continue:\{"analysis":"Needtocheckfilecontent\.","status":"continue","commands":\["cat/app/data\.csv"\]\}Finalize\(pass\):\{"analysis":"Allfilesmatchtheirspecifications\.","status":"finalize","result":\{"overall\_verdict":"pass","file\_findings":\[\],"global\_findings":\[\]\}\}Finalize\(fail\):\{"analysis":"Foundamismatchin/app/example\.txt\.","status":"finalize","result":\{"overall\_verdict":"fail","file\_findings":\[\{"filepath":"/app/example\.txt","reason":"Contentdoesnotmatchthedeclaredformat\.","repair\_instructions":"Rewrite/app/example\.txttomatchthedeclaredformat\."\}\],"global\_findings":\[\]\}\}\[User\]Taskinstruction:\{instruction\}Verificationscope:\-Onlyverifythedeclaredfilepathsbelow\.\-Ignoreruntimestateandbroadertaskcompletion\.Filestoverify:\{file\_lines\}File\-onlyverificationobjective:\-Verifyeachdeclaredfileagainstitsdescriptionandthetaskinstruction\.\-Onlyuseglobal\_findingsforstill\-file\-scopedcross\-fileissues\.\-Ifyoufailanyfileoremitanyglobalfinding,includedetailedrepair\_instructions\.Initialworkspacetree\(initialhintonly\):\{workspace\_tree\}
Prompt for Environment Setup \(env build\)[Back to ToC](https://arxiv.org/html/2605.20876#appendixtoc)[⬇](data:text/plain;base64,WW91IGFyZSBhbiBleHBlcnQgRGV2T3BzIGVuZ2luZWVyIHNldHRpbmcgdXAgYSBMaW51eCBzYW5kYm94IGVudmlyb25tZW50IGZvciBhbiBBSSBjb2RpbmcgYWdlbnQuIFlvdXIgam9iIGlzIE9OTFkgdG8gcHJlcGFyZSB0aGUgZW52aXJvbm1lbnQgLS0gaW5zdGFsbCBzeXN0ZW0gcGFja2FnZXMsIGxhbmd1YWdlIHJ1bnRpbWVzLCBhbmQgbGlicmFyaWVzLiBZb3UgbXVzdCBOT1Qgd3JpdGUgYXBwbGljYXRpb24gY29kZSBvciBleGVjdXRlIHRoZSB0YXNrIGl0c2VsZi4KCiMgQmFzZSBFbnZpcm9ubWVudCAoYWxyZWFkeSBwcm92aXNpb25lZCAtLSBkbyBOT1QgcmVpbnN0YWxsIHRoZXNlKQotIE9TOiBEZWJpYW4gMTMgKHRyaXhpZSksIGtlcm5lbCA2LjEueCwgeDg2XzY0Ci0gVXNlcjogdXNlciAodWlkPTEwMDApLCBub24tcm9vdC4gVXNlIHN1ZG8gZm9yIHByaXZpbGVnZWQgb3BlcmF0aW9ucy4KLSBXb3JraW5nIGRpcmVjdG9yeTogL2FwcAotIFByZS1pbnN0YWxsZWQ6IFB5dGhvbiAzLjEyLCBwaXAgMjUueCwgTm9kZS5qcyAyMC54LCBucG0gMTAueCwgSmF2YSA4IChPcGVuSkRLKSwgZ2NjL2crKyAxNC54LCBtYWtlIDQueCwgZ2l0IDIueCwgY3VybCA4LngsIHdnZXQgMS54Ci0gSGFyYm9yIGRpcmVjdG9yaWVzIChwZXJtcz03NzcpOiAvbG9ncy9hZ2VudCwgL2xvZ3MvdmVyaWZpZXIsIC90ZXN0cywgL2FwcCwgL291dHB1dCwgL3NvbHV0aW9uCgojIFRhc2sgSW5zdHJ1Y3Rpb24gKGZvciBjb250ZXh0IG9ubHkgLS0gZG8gTk9UIGV4ZWN1dGUgdGhpcyB0YXNrKQooKlx0ZXh0Y29sb3J7Ymx1ZX17XHtpbnN0cnVjdGlvblx9fSopCgojIEVudmlyb25tZW50IEJsdWVwcmludAooKlx0ZXh0Y29sb3J7Ymx1ZX17XHtibHVlcHJpbnRcfX0qKQoKIyBQcmUtc2VlZGVkIEFzc2V0cwpUaGUgZm9sbG93aW5nIGFzc2V0cyBoYXZlIGFscmVhZHkgYmVlbiB3cml0dGVuIGludG8gdGhlIHNhbmRib3guIFJlZmVyZW5jZSB0aGVtIGRpcmVjdGx5IChlLmcuIGBwaXAgaW5zdGFsbCAtciAvYXBwL3JlcXVpcmVtZW50cy50eHRgKS4gRG8gTk9UIHJlY3JlYXRlIG9yIG92ZXJ3cml0ZSB0aGVtLgoKKCpcdGV4dGNvbG9ye2JsdWV9e1x7cHJlXF9zZWVkZWRcX2ZpbGVzXH19KikKCiMgSW5zdHJ1Y3Rpb25zCkdlbmVyYXRlIHNoZWxsIGNvbW1hbmRzIHRvIE9OTFk6CjEuIEV4ZWN1dGUgZWFjaCBzdGVwIGxpc3RlZCBpbiBzZXR1cF9zdGVwcywgaW4gb3JkZXIKMi4gQ3JlYXRlIG5lY2Vzc2FyeSBkaXJlY3RvcmllcwozLiBJbnN0YWxsIGZyb20gYW55IGRlcGVuZGVuY3kgbWFuaWZlc3QgaW4gdGhlIHByZS1zZWVkZWQgYXNzZXRzCjQuIFN0YXJ0IG9yIGNvbmZpZ3VyZSByZXF1aXJlZCBzZXJ2aWNlcywgZW52aXJvbm1lbnQgdmFyaWFibGVzLCBhbmQgcGVybWlzc2lvbnMKCklNUE9SVEFOVDoKLSBISUdIRVNUIFBSSU9SSVRZOiBOZXZlciBjcmVhdGUsIGRvd25sb2FkLCBvciBvdmVyd3JpdGUgYW55IHByZS1zZWVkZWQgYXNzZXQgcGF0aAotIFNjb3BlIGlzIGxpbWl0ZWQgdG8gZW52aXJvbm1lbnQgY29uZmlndXJhdGlvbiBvbmx5IC0tIG5vIGFwcGxpY2F0aW9uIHNvdXJjZSBjb2RlCi0gRG8gTk9UIGV4ZWN1dGUgdGhlIHRhc2sgKGUuZy4gZG8gTk9UIHJ1biBgbm9kZSBpbmRleC5qc2Agb3IgYHB5dGhvbiBtYWluLnB5YCkKLSBXaGVuIHVzaW5nIHN1ZG8gd2l0aCBuZXR3b3JrIGNvbW1hbmRzLCBBTFdBWVMgdXNlIGBzdWRvIC1FYCB0byBwcmVzZXJ2ZSBwcm94eSBlbnYgdmFycwoKRW5kIHlvdXIgcmVzcG9uc2Ugd2l0aCBhIGBgYGJhc2ggZmVuY2VkIGNvZGUgYmxvY2sgY29udGFpbmluZyB0aGUgY29tbWFuZHMu)YouareanexpertDevOpsengineersettingupaLinuxsandboxenvironmentforanAIcodingagent\.YourjobisONLYtopreparetheenvironment\-\-installsystempackages,languageruntimes,andlibraries\.YoumustNOTwriteapplicationcodeorexecutethetaskitself\.\#BaseEnvironment\(alreadyprovisioned\-\-doNOTreinstallthese\)\-OS:Debian13\(trixie\),kernel6\.1\.x,x86\_64\-User:user\(uid=1000\),non\-root\.Usesudoforprivilegedoperations\.\-Workingdirectory:/app\-Pre\-installed:Python3\.12,pip25\.x,Node\.js20\.x,npm10\.x,Java8\(OpenJDK\),gcc/g\+\+14\.x,make4\.x,git2\.x,curl8\.x,wget1\.x\-Harbordirectories\(perms=777\):/logs/agent,/logs/verifier,/tests,/app,/output,/solution\#TaskInstruction\(forcontextonly\-\-doNOTexecutethistask\)\{instruction\}\#EnvironmentBlueprint\{blueprint\}\#Pre\-seededAssetsThefollowingassetshavealreadybeenwrittenintothesandbox\.Referencethemdirectly\(e\.g\.‘pipinstall\-r/app/requirements\.txt‘\)\.DoNOTrecreateoroverwritethem\.\{pre\_seeded\_files\}\#InstructionsGenerateshellcommandstoONLY:1\.Executeeachsteplistedinsetup\_steps,inorder2\.Createnecessarydirectories3\.Installfromanydependencymanifestinthepre\-seededassets4\.Startorconfigurerequiredservices,environmentvariables,andpermissionsIMPORTANT:\-HIGHESTPRIORITY:Nevercreate,download,oroverwriteanypre\-seededassetpath\-Scopeislimitedtoenvironmentconfigurationonly\-\-noapplicationsourcecode\-DoNOTexecutethetask\(e\.g\.doNOTrun‘nodeindex\.js‘or‘pythonmain\.py‘\)\-Whenusingsudowithnetworkcommands,ALWAYSuse‘sudo\-E‘topreserveproxyenvvarsEndyourresponsewitha‘‘‘bashfencedcodeblockcontainingthecommands\.
Prompt for Environment Verification \(env verify\)[Back to ToC](https://arxiv.org/html/2605.20876#appendixtoc)[⬇](data:text/plain;base64,WW91IGFyZSBhbiBleHBlcnQgRGV2T3BzIGVuZ2luZWVyIHZlcmlmeWluZyB0aGF0IGEgTGludXggc2FuZGJveCBlbnZpcm9ubWVudCBoYXMgYmVlbiBjb3JyZWN0bHkgc2V0IHVwLiBBIHNldHVwIHNjcmlwdCBoYXMgYWxyZWFkeSBiZWVuIGV4ZWN1dGVkIHN1Y2Nlc3NmdWxseSAoZXhpdCBjb2RlIDApLiBZb3VyIGpvYiBpcyB0byBnZW5lcmF0ZSB2ZXJpZmljYXRpb24gY29tbWFuZHMgdGhhdCBjaGVjayB3aGV0aGVyIHRoZSBlbnZpcm9ubWVudCBpcyB0cnVseSByZWFkeSBmb3IgdGhlIHRhc2suCgojIEJhc2UgRW52aXJvbm1lbnQgKGFscmVhZHkgcHJvdmlzaW9uZWQpCi0gT1M6IERlYmlhbiAxMyAodHJpeGllKSwga2VybmVsIDYuMS54LCB4ODZfNjQ7IFdvcmtpbmcgZGlyZWN0b3J5OiAvYXBwCi0gUHJlLWluc3RhbGxlZDogUHl0aG9uIDMuMTIsIE5vZGUuanMgMjAueCwgSmF2YSA4LCBnY2MvZysrIDE0LngsIG1ha2UsIGdpdCwgY3VybCwgd2dldAoKIyBUYXNrIEluc3RydWN0aW9uIChmb3IgY29udGV4dCAtLSBkbyBOT1QgZXhlY3V0ZSB0aGlzIHRhc2spCigqXHRleHRjb2xvcntibHVlfXtce2luc3RydWN0aW9uXH19KikKCiMgRW52aXJvbm1lbnQgQmx1ZXByaW50CigqXHRleHRjb2xvcntibHVlfXtce2JsdWVwcmludFx9fSopCgojIFByZS1zZWVkZWQgQXNzZXRzCigqXHRleHRjb2xvcntibHVlfXtce3ByZVxfc2VlZGVkXF9maWxlc1x9fSopCgojIFNldHVwIFNjcmlwdCBUaGF0IFdhcyBFeGVjdXRlZAooKlx0ZXh0Y29sb3J7Ymx1ZX17XHtzZXR1cFxfc2NyaXB0XH19KikKCiMgSW5zdHJ1Y3Rpb25zCkdlbmVyYXRlIHNoZWxsIGNvbW1hbmRzIHRoYXQgT05MWSB2ZXJpZnkgdGhlIGVudmlyb25tZW50IGlzIHJlYWR5OgoxLiBDaGVjayB0aGF0IHJlcXVpcmVkIHBhY2thZ2VzL2xpYnJhcmllcyBhcmUgaW1wb3J0YWJsZSAoZS5nLiBgcHl0aG9uMyAtYyAiaW1wb3J0IGZsYXNrImAsIGBub2RlIC1lICJyZXF1aXJlKCdleHByZXNzJykiYCkKMi4gQ2hlY2sgdGhhdCByZXF1aXJlZCBDTEkgdG9vbHMgYXJlIGF2YWlsYWJsZSAoZS5nLiBgd2hpY2ggZ2NjYCwgYGphdmEgLXZlcnNpb25gKQozLiBDaGVjayB0aGF0IHJlcXVpcmVkIGRpcmVjdG9yaWVzIGV4aXN0CjQuIENoZWNrIHRoYXQgZGVwZW5kZW5jeSB2ZXJzaW9ucyBtZWV0IHJlcXVpcmVtZW50cyBpZiBzcGVjaWZpYyB2ZXJzaW9ucyB3ZXJlIHJlcXVlc3RlZAoKSU1QT1JUQU5UOgotIERvIE5PVCBpbnN0YWxsIGFueXRoaW5nIC0tIG9ubHkgdmVyaWZ5Ci0gRG8gTk9UIGV4ZWN1dGUgdGhlIHRhc2sgaXRzZWxmIG9yIHdyaXRlIGFwcGxpY2F0aW9uIGNvZGUKLSBUaGUgc2NyaXB0IGRvZXMgTk9UIHVzZSBgc2V0IC1lYCAtLSBhbGwgY29tbWFuZHMgcnVuIGV2ZW4gaWYgc29tZSBmYWlsCgpFbmQgeW91ciByZXNwb25zZSB3aXRoIGEgYGBgYmFzaCBmZW5jZWQgY29kZSBibG9jayBjb250YWluaW5nIHRoZSB2ZXJpZmljYXRpb24gY29tbWFuZHMu)YouareanexpertDevOpsengineerverifyingthataLinuxsandboxenvironmenthasbeencorrectlysetup\.Asetupscripthasalreadybeenexecutedsuccessfully\(exitcode0\)\.Yourjobistogenerateverificationcommandsthatcheckwhethertheenvironmentistrulyreadyforthetask\.\#BaseEnvironment\(alreadyprovisioned\)\-OS:Debian13\(trixie\),kernel6\.1\.x,x86\_64;Workingdirectory:/app\-Pre\-installed:Python3\.12,Node\.js20\.x,Java8,gcc/g\+\+14\.x,make,git,curl,wget\#TaskInstruction\(forcontext\-\-doNOTexecutethistask\)\{instruction\}\#EnvironmentBlueprint\{blueprint\}\#Pre\-seededAssets\{pre\_seeded\_files\}\#SetupScriptThatWasExecuted\{setup\_script\}\#InstructionsGenerateshellcommandsthatONLYverifytheenvironmentisready:1\.Checkthatrequiredpackages/librariesareimportable\(e\.g\.‘python3\-c"importflask"‘,‘node\-e"require\(’express’\)"‘\)2\.CheckthatrequiredCLItoolsareavailable\(e\.g\.‘whichgcc‘,‘java\-version‘\)3\.Checkthatrequireddirectoriesexist4\.CheckthatdependencyversionsmeetrequirementsifspecificversionswererequestedIMPORTANT:\-DoNOTinstallanything\-\-onlyverify\-DoNOTexecutethetaskitselforwriteapplicationcode\-ThescriptdoesNOTuse‘set\-e‘\-\-allcommandsrunevenifsomefailEndyourresponsewitha‘‘‘bashfencedcodeblockcontainingtheverificationcommands\.
Prompt for Environment Repair \(env repair\)[Back to ToC](https://arxiv.org/html/2605.20876#appendixtoc)[⬇](data:text/plain;base64,WW91IGFyZSBhbiBleHBlcnQgRGV2T3BzIGVuZ2luZWVyIGZpeGluZyBhIGZhaWxlZCBzYW5kYm94IGVudmlyb25tZW50IHNldHVwLiBZb3VyIGpvYiBpcyBPTkxZIHRvIHByZXBhcmUgdGhlIGVudmlyb25tZW50IC0tIGluc3RhbGwgcGFja2FnZXMgYW5kIGNvbmZpZ3VyZSBzZXJ2aWNlcy4gWW91IG11c3QgTk9UIHdyaXRlIGFwcGxpY2F0aW9uIGNvZGUgb3IgZXhlY3V0ZSB0aGUgdGFzay4KCiMgQmFzZSBFbnZpcm9ubWVudCAoYWxyZWFkeSBwcm92aXNpb25lZCAtLSBkbyBOT1QgcmVpbnN0YWxsIHRoZXNlKQotIE9TOiBEZWJpYW4gMTMgKHRyaXhpZSksIGtlcm5lbCA2LjEueCwgeDg2XzY0OyBXb3JraW5nIGRpcmVjdG9yeTogL2FwcAotIFByZS1pbnN0YWxsZWQ6IFB5dGhvbiAzLjEyLCBOb2RlLmpzIDIwLngsIEphdmEgOCwgZ2NjL2crKyAxNC54LCBtYWtlLCBnaXQsIGN1cmwsIHdnZXQKCiMgVGFzayBJbnN0cnVjdGlvbiAoZm9yIGNvbnRleHQgb25seSAtLSBkbyBOT1QgZXhlY3V0ZSB0aGlzIHRhc2spCigqXHRleHRjb2xvcntibHVlfXtce2luc3RydWN0aW9uXH19KikKCiMgRW52aXJvbm1lbnQgQmx1ZXByaW50CigqXHRleHRjb2xvcntibHVlfXtce2JsdWVwcmludFx9fSopCgojIFByZS1zZWVkZWQgQXNzZXRzClRoZSBmb2xsb3dpbmcgYXNzZXRzIHdpbGwgYmUgcmUtd3JpdHRlbiBiZWZvcmUgeW91ciBjb21tYW5kcyBydW4uIERvIE5PVCByZWNyZWF0ZSB0aGVtLgoKKCpcdGV4dGNvbG9ye2JsdWV9e1x7cHJlXF9zZWVkZWRcX2ZpbGVzXH19KikKCiMgUHJldmlvdXMgU2V0dXAgQ29tbWFuZHMKKCpcdGV4dGNvbG9ye2JsdWV9e1x7Y29tbWFuZHNcfX0qKQoKIyBDb21tYW5kIEZhaWx1cmUgRGV0YWlscwooKlx0ZXh0Y29sb3J7Ymx1ZX17XHtlcnJvcnNcfX0qKQoKIyBJbnN0cnVjdGlvbnMKVGhlIHByZXZpb3VzIGF0dGVtcHQgZmFpbGVkLiBUaGUgbmV4dCBhdHRlbXB0IHJ1bnMgaW4gYSBGUkVTSCBzYW5kYm94IC0tIGRvIE5PVCByZWx5IG9uIGFueSBzaWRlIGVmZmVjdHMgZnJvbSB0aGUgcHJldmlvdXMgYXR0ZW1wdC4KCkdlbmVyYXRlIGEgRlVMTCBjb3JyZWN0ZWQgbGlzdCBvZiBzZXR1cCBjb21tYW5kcyAobm90IGEgcGF0Y2gvZGVsdGEpLiBBbGwgY29tbWFuZHMgd2lsbCBiZSBjb21iaW5lZCBpbnRvIGEgc2luZ2xlIGJhc2ggc2NyaXB0IHdpdGggYHNldCAtZXV4byBwaXBlZmFpbGAgYW5kIGV4ZWN1dGVkIG9uY2UuCgpJTVBPUlRBTlQ6Ci0gSElHSEVTVCBQUklPUklUWTogTmV2ZXIgY3JlYXRlLCBkb3dubG9hZCwgb3Igb3ZlcndyaXRlIGFueSBwcmUtc2VlZGVkIGFzc2V0IHBhdGgKLSBTY29wZSBpcyBsaW1pdGVkIHRvIGVudmlyb25tZW50IGNvbmZpZ3VyYXRpb24gb25seSAtLSBubyBhcHBsaWNhdGlvbiBzb3VyY2UgY29kZQotIERvIE5PVCBleGVjdXRlIHRoZSB0YXNrIChlLmcuIGRvIE5PVCBydW4gYG5vZGUgaW5kZXguanNgIG9yIGBweXRob24gbWFpbi5weWApCi0gV2hlbiB1c2luZyBzdWRvIHdpdGggbmV0d29yayBjb21tYW5kcywgQUxXQVlTIHVzZSBgc3VkbyAtRWAgdG8gcHJlc2VydmUgcHJveHkgZW52IHZhcnMKCkVuZCB5b3VyIHJlc3BvbnNlIHdpdGggYSBgYGBiYXNoIGZlbmNlZCBjb2RlIGJsb2NrIGNvbnRhaW5pbmcgdGhlIGZ1bGwgY29ycmVjdGVkIGNvbW1hbmRzLg==)YouareanexpertDevOpsengineerfixingafailedsandboxenvironmentsetup\.YourjobisONLYtopreparetheenvironment\-\-installpackagesandconfigureservices\.YoumustNOTwriteapplicationcodeorexecutethetask\.\#BaseEnvironment\(alreadyprovisioned\-\-doNOTreinstallthese\)\-OS:Debian13\(trixie\),kernel6\.1\.x,x86\_64;Workingdirectory:/app\-Pre\-installed:Python3\.12,Node\.js20\.x,Java8,gcc/g\+\+14\.x,make,git,curl,wget\#TaskInstruction\(forcontextonly\-\-doNOTexecutethistask\)\{instruction\}\#EnvironmentBlueprint\{blueprint\}\#Pre\-seededAssetsThefollowingassetswillbere\-writtenbeforeyourcommandsrun\.DoNOTrecreatethem\.\{pre\_seeded\_files\}\#PreviousSetupCommands\{commands\}\#CommandFailureDetails\{errors\}\#InstructionsThepreviousattemptfailed\.ThenextattemptrunsinaFRESHsandbox\-\-doNOTrelyonanysideeffectsfromthepreviousattempt\.GenerateaFULLcorrectedlistofsetupcommands\(notapatch/delta\)\.Allcommandswillbecombinedintoasinglebashscriptwith‘set\-euxopipefail‘andexecutedonce\.IMPORTANT:\-HIGHESTPRIORITY:Nevercreate,download,oroverwriteanypre\-seededassetpath\-Scopeislimitedtoenvironmentconfigurationonly\-\-noapplicationsourcecode\-DoNOTexecutethetask\(e\.g\.doNOTrun‘nodeindex\.js‘or‘pythonmain\.py‘\)\-Whenusingsudowithnetworkcommands,ALWAYSuse‘sudo\-E‘topreserveproxyenvvarsEndyourresponsewitha‘‘‘bashfencedcodeblockcontainingthefullcorrectedcommands\.
Prompt for Pytest Verifier Generation[Back to ToC](https://arxiv.org/html/2605.20876#appendixtoc)[⬇](data:text/plain;base64,WW91IGFyZSBnZW5lcmF0aW5nIGEgcHl0ZXN0LWJhc2VkIHRhc2sgdmVyaWZpZXIgZm9yIGEgdGVybWluYWwgYmVuY2htYXJrIHRhc2suCgpZb3VyIHZlcmlmaWVyIHdpbGwgcnVuIEFGVEVSIHRoZSBhZ2VudCBoYXMgZmluaXNoZWQgZXhlY3V0aW5nIHRoZSB0YXNrLCBpbnNpZGUgdGhlIFNBTUUgc2FuZGJveCBzdGF0ZS4gRG8gbm90IGFzc3VtZSB0aGUgZW52aXJvbm1lbnQgaXMgcmVzdGFydGVkLiBWZXJpZnkgT05MWSB3aGV0aGVyIHRoZSB0YXNrIHdhcyBhY3R1YWxseSBjb21wbGV0ZWQgLS0gZG8gbm90IGluc3BlY3QgdGhlIGFnZW50J3MgbWVzc2FnZXMgb3IgdG9vbCB0cmFjZXMuCgpZb3UgbXVzdCBvdXRwdXQgYSBKU09OIG9iamVjdCB3aXRoIHRoaXMgZXhhY3Qgc2NoZW1hOgp7CiAgInN5c3RlbV9wYWNrYWdlcyI6IFsiLi4uIl0sCiAgInB5dGhvbl9wYWNrYWdlcyI6IFsiLi4uIl0sCiAgImhlbHBlcl9maWxlcyI6IFt7InBhdGgiOiAidGVzdHMvZmlsZW5hbWUuZXh0IiwgImNvbnRlbnQiOiAiLi4uIn1dLAogICJ0ZXN0X291dHB1dHNfcHkiOiAiLi4uIgp9CgpSdWxlczoKMS4gVGhlIHZlcmlmaWVyIG11c3QgdXNlIHB5dGVzdC4KMi4gYHRlc3Rfb3V0cHV0c19weWAgbXVzdCBjb250YWluIHZhbGlkIFB5dGhvbiBzb3VyY2UgY29kZS4KMy4gVGVzdHMgbXVzdCBiZSBibGFjay1ib3guCjQuIFByZWZlciB2ZXJpZnlpbmcgdmlhIGZpbGVzLCBjb21tYW5kIGJlaGF2aW9yLCBsb2NhbCBIVFRQIGJlaGF2aW9yLCBvciBkZXRlcm1pbmlzdGljIGVuZC10by1lbmQgZXhhbXBsZXMuCjUuIFRlc3RzIG11c3QgYmUgZGV0ZXJtaW5pc3RpYyBhbmQgc2VsZi1jb250YWluZWQuCjYuIEF2b2lkIG5ldHdvcmsgYWNjZXNzIHVubGVzcyB0aGUgdGFzayBleHBsaWNpdGx5IHJlcXVpcmVzIGxvY2FsaG9zdCBhY2Nlc3MuCjcuIE9ubHkgYWRkIHBhY2thZ2VzIHJlcXVpcmVkIGJ5IHRoZSB2ZXJpZmllciBpdHNlbGYuCjguIFB1dCBhbnkgZXh0cmEgdGVzdCBhc3NldHMgaW50byBgaGVscGVyX2ZpbGVzYC4KOS4gRG8gbm90IGdlbmVyYXRlIGV4cGxhbmF0aW9ucyBvdXRzaWRlIHRoZSBKU09OLgoKIyBJbnN0cnVjdGlvbgooKlx0ZXh0Y29sb3J7Ymx1ZX17XHtpbnN0cnVjdGlvblx9fSopCgojIEV2YWx1YXRpb24gQ3JpdGVyaWEKKCpcdGV4dGNvbG9ye2JsdWV9e1x7ZXZhbHVhdGlvblxfY3JpdGVyaWFcfX0qKQoKIyBUYXJnZXQgT3V0cHV0IEZpbGUKKCpcdGV4dGNvbG9ye2JsdWV9e1x7dGFyZ2V0XF9vdXRwdXRcX2ZpbGVcfX0qKQoKIyBJbml0aWFsIFRleHQgRmlsZXMgKEFDVFVBTCBDT05URU5UIC0tIGFscmVhZHkgcHJlc2VudCBpbiBlbnZpcm9ubWVudCkKSU1QT1JUQU5UOiBUaGVzZSBmaWxlcyBhbHJlYWR5IGV4aXN0IHdoZW4gdGhlIGFnZW50IHN0YXJ0cy4gRE8gTk9UIHRlc3QgZm9yIHRoZW0uIFRoZXkgYXJlIHByb3ZpZGVkIGFzIGNvbnRleHQgb25seS4KCigqXHRleHRjb2xvcntibHVlfXtce2luaXRpYWxcX3RleHRcX2ZpbGVzXH19KikKCiMgSW5pdGlhbCBBc3NldCBGaWxlcyAoTUVUQURBVEEgT05MWSAtLSBhbHJlYWR5IHByZXNlbnQgaW4gZW52aXJvbm1lbnQpCklNUE9SVEFOVDogVGhlc2UgZmlsZXMgZXhpc3QgYnV0IG9ubHkgbWV0YWRhdGEgaXMgcHJvdmlkZWQuIERvIE5PVCBhc3N1bWUgYWN0dWFsIGNvbnRlbnQuCgooKlx0ZXh0Y29sb3J7Ymx1ZX17XHtpbml0aWFsXF9hc3NldFxfZmlsZXNcfX0qKQoKIyBWYWxpZGF0ZWQgRW52aXJvbm1lbnQgRmlsZSBQYXRocwooKlx0ZXh0Y29sb3J7Ymx1ZX17XHt2YWxpZGF0ZWRcX2ZpbGVwYXRoc1x9fSopCgpDUklUSUNBTCBSVUxFUzoKLSBETyBOT1QgdGVzdCB3aGV0aGVyIGluaXRpYWwgZmlsZXMgZXhpc3QgKGd1YXJhbnRlZWQgdG8gZXhpc3QpCi0gRE8gTk9UIHRlc3QgaW5pdGlhbCBmaWxlIGNvbnRlbnQgKGFscmVhZHkgdmFsaWRhdGVkKQotIE9OTFkgdGVzdCB3aGV0aGVyIHRoZSBldmFsdWF0aW9uIGNyaXRlcmlhIGFyZSBtZXQKLSBUZXN0IHdoYXQgdGhlIEFHRU5UIGNyZWF0ZXMsIG5vdCB3aGF0IHRoZSBlbnZpcm9ubWVudCBwcm92aWRlcw==)Youaregeneratingapytest\-basedtaskverifierforaterminalbenchmarktask\.YourverifierwillrunAFTERtheagenthasfinishedexecutingthetask,insidetheSAMEsandboxstate\.Donotassumetheenvironmentisrestarted\.VerifyONLYwhetherthetaskwasactuallycompleted\-\-donotinspecttheagent’smessagesortooltraces\.YoumustoutputaJSONobjectwiththisexactschema:\{"system\_packages":\["\.\.\."\],"python\_packages":\["\.\.\."\],"helper\_files":\[\{"path":"tests/filename\.ext","content":"\.\.\."\}\],"test\_outputs\_py":"\.\.\."\}Rules:1\.Theverifiermustusepytest\.2\.‘test\_outputs\_py‘mustcontainvalidPythonsourcecode\.3\.Testsmustbeblack\-box\.4\.Preferverifyingviafiles,commandbehavior,localHTTPbehavior,ordeterministicend\-to\-endexamples\.5\.Testsmustbedeterministicandself\-contained\.6\.Avoidnetworkaccessunlessthetaskexplicitlyrequireslocalhostaccess\.7\.Onlyaddpackagesrequiredbytheverifieritself\.8\.Putanyextratestassetsinto‘helper\_files‘\.9\.DonotgenerateexplanationsoutsidetheJSON\.\#Instruction\{instruction\}\#EvaluationCriteria\{evaluation\_criteria\}\#TargetOutputFile\{target\_output\_file\}\#InitialTextFiles\(ACTUALCONTENT\-\-alreadypresentinenvironment\)IMPORTANT:Thesefilesalreadyexistwhentheagentstarts\.DONOTtestforthem\.Theyareprovidedascontextonly\.\{initial\_text\_files\}\#InitialAssetFiles\(METADATAONLY\-\-alreadypresentinenvironment\)IMPORTANT:Thesefilesexistbutonlymetadataisprovided\.DoNOTassumeactualcontent\.\{initial\_asset\_files\}\#ValidatedEnvironmentFilePaths\{validated\_filepaths\}CRITICALRULES:\-DONOTtestwhetherinitialfilesexist\(guaranteedtoexist\)\-DONOTtestinitialfilecontent\(alreadyvalidated\)\-ONLYtestwhethertheevaluationcriteriaaremet\-TestwhattheAGENTcreates,notwhattheenvironmentprovides
System Prompt for Terminus2 Agent \(JSON Format\)[Back to ToC](https://arxiv.org/html/2605.20876#appendixtoc)[⬇](data:text/plain;base64,WW91IGFyZSBhbiBBSSBhc3Npc3RhbnQgdGFza2VkIHdpdGggc29sdmluZyBjb21tYW5kLWxpbmUgdGFza3MgaW4gYSBMaW51eCBlbnZpcm9ubWVudC4gWW91IHdpbGwgYmUgZ2l2ZW4gYSB0YXNrIGRlc2NyaXB0aW9uIGFuZCB0aGUgb3V0cHV0IGZyb20gcHJldmlvdXNseSBleGVjdXRlZCBjb21tYW5kcy4gWW91ciBnb2FsIGlzIHRvIHNvbHZlIHRoZSB0YXNrIGJ5IHByb3ZpZGluZyBiYXRjaGVzIG9mIHNoZWxsIGNvbW1hbmRzLgoKRm9ybWF0IHlvdXIgcmVzcG9uc2UgYXMgSlNPTiB3aXRoIHRoZSBmb2xsb3dpbmcgc3RydWN0dXJlOgoKewogICJhbmFseXNpcyI6ICJBbmFseXplIHRoZSBjdXJyZW50IHN0YXRlIGJhc2VkIG9uIHRoZSB0ZXJtaW5hbCBvdXRwdXQuIFdoYXQgaGFzIGJlZW4gYWNjb21wbGlzaGVkPyBXaGF0IHN0aWxsIG5lZWRzIHRvIGJlIGRvbmU/IiwKICAicGxhbiI6ICJEZXNjcmliZSB5b3VyIHBsYW4gZm9yIHRoZSBuZXh0IHN0ZXBzLiBXaGF0IGNvbW1hbmRzIHdpbGwgeW91IHJ1biBhbmQgd2h5PyBCZSBzcGVjaWZpYyBhYm91dCB3aGF0IHlvdSBleHBlY3QgZWFjaCBjb21tYW5kIHRvIGFjY29tcGxpc2guIiwKICAiY29tbWFuZHMiOiBbCiAgICB7ImtleXN0cm9rZXMiOiAibHMgLWxhXG4iLCAgICAgImR1cmF0aW9uIjogMC4xfSwKICAgIHsia2V5c3Ryb2tlcyI6ICJjZCBwcm9qZWN0XG4iLCAiZHVyYXRpb24iOiAwLjF9CiAgXSwKICAidGFza19jb21wbGV0ZSI6IHRydWUKfQoKUmVxdWlyZWQgZmllbGRzOgotICJhbmFseXNpcyI6ICBZb3VyIGFuYWx5c2lzIG9mIHRoZSBjdXJyZW50IHNpdHVhdGlvbgotICJwbGFuIjogICAgICBZb3VyIHBsYW4gZm9yIHRoZSBuZXh0IHN0ZXBzCi0gImNvbW1hbmRzIjogIEFycmF5IG9mIGNvbW1hbmQgb2JqZWN0cyB0byBleGVjdXRlCgpPcHRpb25hbCBmaWVsZHM6Ci0gInRhc2tfY29tcGxldGUiOiBCb29sZWFuIGluZGljYXRpbmcgaWYgdGhlIHRhc2sgaXMgY29tcGxldGUgKGRlZmF1bHRzIHRvIGZhbHNlKQoKQ29tbWFuZCBvYmplY3Qgc3RydWN0dXJlOgotICJrZXlzdHJva2VzIjogRXhhY3Qga2V5c3Ryb2tlcyB0byBzZW5kIHRvIHRoZSB0ZXJtaW5hbCAocmVxdWlyZWQsIG11c3QgZW5kIHdpdGggXG4pCi0gImR1cmF0aW9uIjogICBTZWNvbmRzIHRvIHdhaXQgYmVmb3JlIGV4ZWN1dGluZyB0aGUgbmV4dCBjb21tYW5kIChkZWZhdWx0OiAxLjApCgpJTVBPUlRBTlQ6IFRoZSB0ZXh0IGluc2lkZSAia2V5c3Ryb2tlcyIgaXMgc2VudCB2ZXJiYXRpbSB0byB0aGUgdGVybWluYWw6Ci0gRXZlcnkgY29tbWFuZCBtdXN0IGVuZCB3aXRoIFxuIG9yIGl0IHdpbGwgbm90IGV4ZWN1dGUuCi0gU3BlY2lhbCBrZXkgc2VxdWVuY2VzIHVzZSB0bXV4LXN0eWxlIGVzY2FwZSBzZXF1ZW5jZXM6IEMtYyBmb3IgQ3RybCtDLCBDLWQgZm9yIEN0cmwrRC4KClRoZSAiZHVyYXRpb24iIGF0dHJpYnV0ZSBjb250cm9scyBob3cgbG9uZyB0byB3YWl0IGZvciBvdXRwdXQgYmVmb3JlIHByb2NlZWRpbmc6Ci0gSW1tZWRpYXRlIGNvbW1hbmRzIChjZCwgbHMsIGVjaG8sIGNhdCk6ICAwLjEgcwotIFN0YW5kYXJkIGNvbW1hbmRzIChnY2MsIGZpbmQsIHJ1c3RjKTogICAgMS4wIHMKLSBTbG93IGNvbW1hbmRzIChtYWtlLCB3Z2V0LCBsb25nIHNjcmlwdHMpOiBzZXQgYW4gYXBwcm9wcmlhdGUgdmFsdWUgYXMgbmVlZGVkCi0gUHJlZmVyIHNob3J0ZXIgZHVyYXRpb25zOyBwb2xsIGJ5IHNlbmRpbmcgeyJrZXlzdHJva2VzIjogIiIsICJkdXJhdGlvbiI6IDEwLjB9IHJhdGhlciB0aGFuIGJsb2NraW5nLgoKSW1wb3J0YW50IG5vdGVzOgotIFNlbmQga2V5c3Ryb2tlcyBleGFjdGx5IGFzIHdyaXR0ZW47IGRvIG5vdCBhZGQgZXh0cmEgd2hpdGVzcGFjZSB1bmxlc3MgaW50ZW5kZWQuCi0gRXh0cmEgdGV4dCBvdXRzaWRlIHRoZSBKU09OIGdlbmVyYXRlcyB3YXJuaW5ncyBidXQgaXMgdG9sZXJhdGVkLgotIFRoZSBKU09OIG11c3QgYmUgdmFsaWQgLS0gZXNjYXBlIHF1b3RlcyBhbmQgc3BlY2lhbCBjaGFyYWN0ZXJzIGluc2lkZSBzdHJpbmdzLgotIEFuIGVtcHR5ICJjb21tYW5kcyIgYXJyYXkgaXMgdmFsaWQgd2hlbiB5b3UgbmVlZCB0byBvYnNlcnZlIHdpdGhvdXQgYWN0aW5nLgoKVGFzayBEZXNjcmlwdGlvbjoKKCpcdGV4dGNvbG9ye2JsdWV9e1x7aW5zdHJ1Y3Rpb25cfX0qKQoKQ3VycmVudCB0ZXJtaW5hbCBzdGF0ZToKKCpcdGV4dGNvbG9ye2JsdWV9e1x7dGVybWluYWxcX3N0YXRlXH19Kik=)YouareanAIassistanttaskedwithsolvingcommand\-linetasksinaLinuxenvironment\.Youwillbegivenataskdescriptionandtheoutputfrompreviouslyexecutedcommands\.Yourgoalistosolvethetaskbyprovidingbatchesofshellcommands\.FormatyourresponseasJSONwiththefollowingstructure:\{"analysis":"Analyzethecurrentstatebasedontheterminaloutput\.Whathasbeenaccomplished?Whatstillneedstobedone?","plan":"Describeyourplanforthenextsteps\.Whatcommandswillyourunandwhy?Bespecificaboutwhatyouexpecteachcommandtoaccomplish\.","commands":\[\{"keystrokes":"ls\-la\\n","duration":0\.1\},\{"keystrokes":"cdproject\\n","duration":0\.1\}\],"task\_complete":true\}Requiredfields:\-"analysis":Youranalysisofthecurrentsituation\-"plan":Yourplanforthenextsteps\-"commands":ArrayofcommandobjectstoexecuteOptionalfields:\-"task\_complete":Booleanindicatingifthetaskiscomplete\(defaultstofalse\)Commandobjectstructure:\-"keystrokes":Exactkeystrokestosendtotheterminal\(required,mustendwith\\n\)\-"duration":Secondstowaitbeforeexecutingthenextcommand\(default:1\.0\)IMPORTANT:Thetextinside"keystrokes"issentverbatimtotheterminal:\-Everycommandmustendwith\\noritwillnotexecute\.\-Specialkeysequencesusetmux\-styleescapesequences:C\-cforCtrl\+C,C\-dforCtrl\+D\.The"duration"attributecontrolshowlongtowaitforoutputbeforeproceeding:\-Immediatecommands\(cd,ls,echo,cat\):0\.1s\-Standardcommands\(gcc,find,rustc\):1\.0s\-Slowcommands\(make,wget,longscripts\):setanappropriatevalueasneeded\-Prefershorterdurations;pollbysending\{"keystrokes":"","duration":10\.0\}ratherthanblocking\.Importantnotes:\-Sendkeystrokesexactlyaswritten;donotaddextrawhitespaceunlessintended\.\-ExtratextoutsidetheJSONgenerateswarningsbutistolerated\.\-TheJSONmustbevalid\-\-escapequotesandspecialcharactersinsidestrings\.\-Anempty"commands"arrayisvalidwhenyouneedtoobservewithoutacting\.TaskDescription:\{instruction\}Currentterminalstate:\{terminal\_state\}
Prompt for Environment Quality Evaluation[Back to ToC](https://arxiv.org/html/2605.20876#appendixtoc)[⬇](data:text/plain;base64,WW91IGFyZSBldmFsdWF0aW5nIGEgSGFyYm9yLWZvcm1hdCB0ZXJtaW5hbCBiZW5jaG1hcmsgdGFzay4KClRhc2sgZGlyZWN0b3J5OiAoKlx0ZXh0Y29sb3J7Ymx1ZX17XHt0YXNrXF9kaXJcfX0qKQoKUGxlYXNlIHJlYWQgdGhlIHRhc2sgY29udGVudHMgKGluc3RydWN0aW9uLm1kLCBlbnZpcm9ubWVudC8gZmlsZXMsCnRlc3RzL3Rlc3Rfb3V0cHV0cy5weSBvciB0ZXN0cy90ZXN0X2ZpbmFsX3N0YXRlLnB5KSBhbmQgc2NvcmUgaXQgb24KZm91ciBkaW1lbnNpb25zLiBFYWNoIGRpbWVuc2lvbiBpcyBzY29yZWQgMS0zOgogIDEgPSBwb29yICAgMiA9IGFjY2VwdGFibGUgICAzID0gZ29vZAoKSW1wb3J0YW50OiBiYXNlIHlvdXIgc2NvcmVzIE9OTFkgb24gdGhlIGZpbGUgY29udGVudHMgeW91IHJlYWQuCklnbm9yZSB0aGUgZGlyZWN0b3J5IHBhdGggYW5kIGFueSBkYXRhc2V0IG5hbWUgaXQgbWF5IGltcGx5IC0tCnRyZWF0IGV2ZXJ5IHRhc2sgYXMgYW5vbnltb3VzLgoKU2NvcmluZyBkaW1lbnNpb25zOgoKMS4gdGVybWluYWxfbmF0aXZlbmVzcwogICBEb2VzIHRoZSB0YXNrIGdlbnVpbmVseSByZXF1aXJlIHRlcm1pbmFsIENMSSBvcGVyYXRpb25zIChjb21waWxlcnMsCiAgIHBhY2thZ2UgbWFuYWdlcnMsIHN5c3RlbSBjb21tYW5kcywgYnVpbGQgdG9vbHMsIG5ldHdvcmsgdG9vbHMpPwogICBIaWdoIHNjb3JlID0gcmVhbCBDTEkgdG9vbGNoYWluIHJlcXVpcmVkLgogICBMb3cgc2NvcmUgID0ganVzdCB3cml0aW5nIGZpbGVzIG9yIHRyaXZpYWwgZWNobyBjb21tYW5kcy4KCjIuIGVudl90YXNrX2NvbnNpc3RlbmN5CiAgIERvIHRoZSBwcmUtcGxhY2VkIGVudmlyb25tZW50IGZpbGVzIGFuZCBzZXR1cCBwcmVjaXNlbHkgbWF0Y2ggd2hhdAogICB0aGUgaW5zdHJ1Y3Rpb24gcmVxdWlyZXM/CiAgIEhpZ2ggc2NvcmUgPSBlbnZpcm9ubWVudCBwcm92aWRlcyBleGFjdGx5IHRoZSByaWdodCBzY2FmZm9sZGluZywKICAgICAgICAgICAgICAgIG5vIGV4Y2VzcyBvciBnYXAuCiAgIExvdyBzY29yZSAgPSBlbnZpcm9ubWVudCBpcyBlbXB0eSwgaXJyZWxldmFudCwgb3IgY29udHJhZGljdHMKICAgICAgICAgICAgICAgIHRoZSB0YXNrLgoKMy4gZW52X3F1YWxpdHkKICAgQXJlIHRoZSBlbnZpcm9ubWVudCBmaWxlcyAoRG9ja2VyZmlsZSwgc2V0dXAuc2gsIGluaXRpYWwgZmlsZXMpCiAgIHdlbGwtZm9ybWVkIGFuZCBjcmVkaWJsZT8KICAgSGlnaCBzY29yZSA9IGZpbGVzIGFyZSByZWFsaXN0aWMsIGNvbXBsZXRlLCBhbmQgZXhlY3V0YWJsZS4KICAgTG93IHNjb3JlICA9IGZpbGVzIGFyZSBmYWJyaWNhdGVkIHN0dWJzLCBoYXZlIG9idmlvdXMgYnVncywgb3IKICAgICAgICAgICAgICAgIGFyZSBtaXNzaW5nIGNyaXRpY2FsIGRlcGVuZGVuY2llcy4KCjQuIHZlcmlmaWVyX3JvYnVzdG5lc3MKICAgRG8gdGhlIHB5dGVzdCBhc3NlcnRpb25zIGluIHRlc3RzLyBhY2N1cmF0ZWx5IGRpc3Rpbmd1aXNoCiAgIHRhc2stY29tcGxldGUgZnJvbSB0YXNrLWluY29tcGxldGU/CiAgIEhpZ2ggc2NvcmUgPSBhc3NlcnRpb25zIGFyZSBzcGVjaWZpYywgY292ZXIgYWxsIGtleSBhY2NlcHRhbmNlCiAgICAgICAgICAgICAgICBjcml0ZXJpYSBmcm9tIHRoZSBpbnN0cnVjdGlvbiwgbG93IGZhbHNlLXBvc2l0aXZlIHJpc2suCiAgIExvdyBzY29yZSAgPSBvbmx5IGNoZWNrcyBmaWxlIGV4aXN0ZW5jZSwgb3IgYXNzZXJ0aW9ucyBhcmUKICAgICAgICAgICAgICAgIHVucmVsYXRlZCB0byB0aGUgaW5zdHJ1Y3Rpb24gcmVxdWlyZW1lbnRzLgoKSW1wb3J0YW50OiBpZiBhIGRpbWVuc2lvbiBpcyBub3QgYXBwbGljYWJsZSAoZS5nLiBubyB0ZXN0cy8gZGlyZWN0b3J5CmV4aXN0cyksIHNjb3JlIGl0IDEuCgpSZXNwb25kIHdpdGggT05MWSBhIEpTT04gb2JqZWN0LCBubyBvdGhlciB0ZXh0Ogp7CiAgInRlcm1pbmFsX25hdGl2ZW5lc3MiOiAgICAgICAgIDwxfDJ8Mz4sCiAgInRlcm1pbmFsX25hdGl2ZW5lc3NfcmVhc29uIjogICI8b25lIHNlbnRlbmNlPiIsCiAgImVudl90YXNrX2NvbnNpc3RlbmN5IjogICAgICAgIDwxfDJ8Mz4sCiAgImVudl90YXNrX2NvbnNpc3RlbmN5X3JlYXNvbiI6ICI8b25lIHNlbnRlbmNlPiIsCiAgImVudl9xdWFsaXR5IjogICAgICAgICAgICAgICAgIDwxfDJ8Mz4sCiAgImVudl9xdWFsaXR5X3JlYXNvbiI6ICAgICAgICAgICI8b25lIHNlbnRlbmNlPiIsCiAgInZlcmlmaWVyX3JvYnVzdG5lc3MiOiAgICAgICAgIDwxfDJ8Mz4sCiAgInZlcmlmaWVyX3JvYnVzdG5lc3NfcmVhc29uIjogICI8b25lIHNlbnRlbmNlPiIKfQ==)YouareevaluatingaHarbor\-formatterminalbenchmarktask\.Taskdirectory:\{task\_dir\}Pleasereadthetaskcontents\(instruction\.md,environment/files,tests/test\_outputs\.pyortests/test\_final\_state\.py\)andscoreitonfourdimensions\.Eachdimensionisscored1\-3:1=poor2=acceptable3=goodImportant:baseyourscoresONLYonthefilecontentsyouread\.Ignorethedirectorypathandanydatasetnameitmayimply\-\-treateverytaskasanonymous\.Scoringdimensions:1\.terminal\_nativenessDoesthetaskgenuinelyrequireterminalCLIoperations\(compilers,packagemanagers,systemcommands,buildtools,networktools\)?Highscore=realCLItoolchainrequired\.Lowscore=justwritingfilesortrivialechocommands\.2\.env\_task\_consistencyDothepre\-placedenvironmentfilesandsetuppreciselymatchwhattheinstructionrequires?Highscore=environmentprovidesexactlytherightscaffolding,noexcessorgap\.Lowscore=environmentisempty,irrelevant,orcontradictsthetask\.3\.env\_qualityAretheenvironmentfiles\(Dockerfile,setup\.sh,initialfiles\)well\-formedandcredible?Highscore=filesarerealistic,complete,andexecutable\.Lowscore=filesarefabricatedstubs,haveobviousbugs,oraremissingcriticaldependencies\.4\.verifier\_robustnessDothepytestassertionsintests/accuratelydistinguishtask\-completefromtask\-incomplete?Highscore=assertionsarespecific,coverallkeyacceptancecriteriafromtheinstruction,lowfalse\-positiverisk\.Lowscore=onlychecksfileexistence,orassertionsareunrelatedtotheinstructionrequirements\.Important:ifadimensionisnotapplicable\(e\.g\.notests/directoryexists\),scoreit1\.RespondwithONLYaJSONobject,noothertext:\{"terminal\_nativeness":<1\|2\|3\>,"terminal\_nativeness\_reason":"<onesentence\>","env\_task\_consistency":<1\|2\|3\>,"env\_task\_consistency\_reason":"<onesentence\>","env\_quality":<1\|2\|3\>,"env\_quality\_reason":"<onesentence\>","verifier\_robustness":<1\|2\|3\>,"verifier\_robustness\_reason":"<onesentence\>"\}
## Appendix EData Examples
To complement the textual pipeline description in Section[3](https://arxiv.org/html/2605.20876#S3), this section presents*three stage\-focused examples*drawn from theTerminal\-Worlddataset\. Rather than compressing an end\-to\-end rollout for every task, each example zooms into a single synthesis stage and displays its inputs and outputs verbatim so the reader can see the concrete shape of each artifact:
1. 1\.Example[E\.1](https://arxiv.org/html/2605.20876#A5.SS1)illustratesTask Generation\(§[3\.2](https://arxiv.org/html/2605.20876#S3.SS2)\): from a single \(SkillSS, PersonaUU\) pair to the synthesized quadruple\(ℐ,ℰ,𝒱,𝒢\)\(\\mathcal\{I\},\\mathcal\{E\},\\mathcal\{V\},\\mathcal\{G\}\)\. We show all six artifacts verbatim\.
2. 2\.Example[E\.2](https://arxiv.org/html/2605.20876#A5.SS2)illustratesEnvironment Building\(§[3\.3](https://arxiv.org/html/2605.20876#S3.SS3)\): a three\-file blueprint is routed through three different sub\-agents of the multi\-agent GVR architecture, producing the initial filesFF, setup scriptBenvB\_\{\\text\{env\}\}, and pytest verifierTtestT\_\{\\text\{test\}\}\.
3. 3\.Example[E\.3](https://arxiv.org/html/2605.20876#A5.SS3)illustratesTrajectory Collection\(§[3\.4](https://arxiv.org/html/2605.20876#S3.SS4)\): a multi\-turn teacher\-model rollout with verbatimanalysis/plan/commands/observationfor four representative steps\.
### E\.1Example 1: Task Generation — ELF Binary Parsing \(Astrophysics\)[Back to ToC](https://arxiv.org/html/2605.20876#appendixtoc)
Stage\-Focused Example 1\|Stage: Task Generation \(§[3\.2](https://arxiv.org/html/2605.20876#S3.SS2)\)❶InputSS— Agent Skill[⬇](data:text/plain;base64,LS0tCm5hbWU6IGVsZi1iaW5hcnktYW5hbHlzaXMKZGVzY3JpcHRpb246ID4KICBBbmFseXplIEVMRiBiaW5hcnkgZmlsZXMgZm9yIHJldmVyc2UgZW5naW5lZXJpbmcsIHNlY3VyaXR5IHJlc2VhcmNoLCBhbmQKICBleHBsb2l0YXRpb24uIFVzZSB0aGlzIHNraWxsIHdoZW5ldmVyIHRoZSB1c2VyIG5lZWRzIHRvIHVuZGVyc3RhbmQgRUxGCiAgc3RydWN0dXJlLCBhbmFseXplIHByb2dyYW0gaGVhZGVycywgc2VjdGlvbiBoZWFkZXJzLCBzeW1ib2xzLCByZWxvY2F0aW9ucywKICBHT1QvUExULCBvciBpZGVudGlmeSBiaW5hcnkgcHJvdGVjdGlvbnMgbGlrZSBSRUxSTywgc3RhY2sgY2FuYXJpZXMsIGFuZCBQSUUuCiAgVHJpZ2dlciBvbiBhbnkgcmVxdWVzdCBpbnZvbHZpbmcgRUxGIGZpbGVzLCBiaW5hcnkgYW5hbHlzaXMsIHJlYWRlbGYgb3V0cHV0CiAgaW50ZXJwcmV0YXRpb24sIG9yIGV4cGxvaXRhdGlvbiByZWNvbm5haXNzYW5jZS4KdGFnczogWydyZXZlcnNlLWVuZ2luZWVyaW5nJywgJ2JpbmFyeS1hbmFseXNpcycsICdlbGYnXQpzb3VyY2U6IGdpdGh1Yi5jb20vYWJlbHJndWV6ci9oYWNrdHJpY2tzLXNraWxscwp0YXNrX3NsdWc6IGJpbmFyeS1leHBsb2l0YXRpb24vZWxmLXRyaWNrcwpyb3V0ZV9sYWJlbDogc2VnbWVudC1idWZmZXItYWxsb2NhdGlvbgotLS0=)\-\-\-name:elf\-binary\-analysisdescription:\>AnalyzeELFbinaryfilesforreverseengineering,securityresearch,andexploitation\.UsethisskillwhenevertheuserneedstounderstandELFstructure,analyzeprogramheaders,sectionheaders,symbols,relocations,GOT/PLT,oridentifybinaryprotectionslikeRELRO,stackcanaries,andPIE\.TriggeronanyrequestinvolvingELFfiles,binaryanalysis,readelfoutputinterpretation,orexploitationreconnaissance\.tags:\[’reverse\-engineering’,’binary\-analysis’,’elf’\]source:github\.com/abelrguezr/hacktricks\-skillstask\_slug:binary\-exploitation/elf\-tricksroute\_label:segment\-buffer\-allocation\-\-\-❷InputUU— User Persona \(FinePersonas\)[⬇](data:text/plain;base64,QW4gYXN0cm9waHlzaWNpc3Qgb3IgYSBzY2llbmNlIHdyaXRlciBjb3ZlcmluZyByZWNlbnQgZGV2ZWxvcG1lbnRzIGluCmFzdHJvcGh5c2ljcywgcGFydGljdWxhcmx5IGZvY3VzZWQgb24gZ2FtbWEtcmF5IGJ1cnN0cywgZ3Jhdml0YXRpb25hbAp3YXZlcywgYW5kIGFzdHJvbm9taWNhbCByZXNlYXJjaC4KCmxhYmVsczogW1NjaWVuY2UsIEFjYWRlbWlhLCBTcGFjZSBFeHBsb3JhdGlvbl0=)Anastrophysicistorasciencewritercoveringrecentdevelopmentsinastrophysics,particularlyfocusedongamma\-raybursts,gravitationalwaves,andastronomicalresearch\.labels:\[Science,Academia,SpaceExploration\]❸Outputℐ\\mathcal\{I\}— Task InstructionAs an astrophysicist, I have a compiled binary toolgrb\_processor\.elfused for gamma\-ray burst data analysis\. To validate its memory mapping and extract calibration constants, write a Node\.js script that parses the ELF file\. Your script should identify thePT\_LOADsegment withRWflags \(the data and BSS region\), allocate a buffer with zero\-padding for the BSS section, and extract 32\-bit integer values from all 4\-byte aligned virtual addresses within that segment\. Output a JSON object where keys are virtual addresses as decimal strings and values are the 32\-bit integers\. Save the JSON to/app/memory\_map\.json\.❹Outputℰ\\mathcal\{E\}— Environment BlueprintInitial Files: /app/grb\_processor\.elf•mode: local\_toolELF 64\-bit LSB executable binary file\. To generate, compile the following C program withgcc \-o /app/grb\_processor\.elf /tmp/source\.c\. C source code:[⬇](data:text/plain;base64,I2luY2x1ZGUgPHN0ZGludC5oPgppbnQgY2FsaWJyYXRpb24gPSAyNzE4Mjg7ICAgICAgICAgIC8vIEV1bGVyJ3MgbnVtYmVyICogMTAwMDAwCmludCBmbHV4X3JlYWRpbmdzWzI1NV0gPSB7MH07ICAgICAgLy8gMjU1IHggNCA9IDEwMjAgYnl0ZXMgb2YgaW5pdGlhbGl6ZWQgZGF0YQpjaGFyIHBob3Rvbl9idWZmZXJbNDA5Nl07ICAgICAgICAgIC8vIHVuaW5pdGlhbGl6ZWQgQlNTICh6ZXJvLXBhZGRlZCBhdCBsb2FkIHRpbWUpCmludCBtYWluKCkgeyByZXR1cm4gMDsgfQ==)\#include<stdint\.h\>intcalibration=271828;//Euler’snumber\*100000intflux\_readings\[255\]=\{0\};//255x4=1020bytesofinitializeddatacharphoton\_buffer\[4096\];//uninitializedBSS\(zero\-paddedatloadtime\)intmain\(\)\{return0;\}After compilation,readelf \-lshows two LOAD segments: LOAD\[0\] \(text,R E\) withVirtAddr=0x400000,FileSiz=0x238,MemSiz=0x238— code only, not used for extraction; LOAD\[1\] \(data\+BSS,RW\) withVirtAddr=0x601000,FileSiz=0x400,MemSiz=0x1400\. Extraction targets only theRWLOAD segment: MemSiz0x1400= 5120 bytes, yielding 1280 key\-value pairs \(5120/4=1280\)\.calibrationoccupies the first 4 bytes of LOAD\[1\] atVirtAddr=0x601000\.Setup Steps•Installbinutilspackage viaapt\-getto ensurereadelfcommand is available for ELF inspection\.❺Output𝒱\\mathcal\{V\}— Evaluation Scope✓File/app/memory\_map\.jsonexists and contains valid JSON\.✓JSON object has exactly1280key\-value pairs, matching total MemSiz0x1400bytes divided by 4\.✓Key’6295552’\(decimal for virtual address0x601000, start of the RW LOAD segment\) has value271828\.✓All keys in the JSON are strings that can be parsed as integers, and all values are numbers\.❻Output𝒢\\mathcal\{G\}— Execution GuidelineStep 1\.Installbinutilsforreadelfcommand — Runapt\-get update && apt\-get install \-y binutils— Verify withreadelf \-\-version\.Step 2\.Inspect ELF header to confirm binary properties — Executereadelf \-h /app/grb\_processor\.elf— Confirm output showsClass: ELF64and little\-endian\.Step 3\.Identify the data LOAD segment — Runreadelf \-l /app/grb\_processor\.elf— Locate thePT\_LOADentry withRWflags; note its VirtAddr \(0x601000\), FileSiz \(0x400\), and MemSiz \(0x1400\); confirm two LOAD segments exist\.Step 4\.Verify Node\.js runtime — Runnode \-\-version— Ensure version is v14\.x or higher for BigInt support\.Step 5\.\[IMPORTANT\]Create Node\.js script with BSS handling — Write/app/extract\.jsimplementing buffer allocation based on MemSiz and 4\-byte aligned reads — Verify file creation withls \-l /app/extract\.js\.Step 6\.Run extraction script —node /app/extract\.js /app/grb\_processor\.elf \> /app/memory\_map\.json 2\>&1— Check exit code withecho $?; it should be 0\.Step 7\.Validate JSON output format —head \-c 200 /app/memory\_map\.json— Confirm it starts with\{and appears as valid JSON\.Step 8\.Final verification — use a Node\.js one\-liner to parse the JSON, verify key count is 1280 and value at’6295552’\(VirtAddr0x601000\) is 271828\.
### E\.2Example 2: Environment Building — Multi\-Format Data Merger[Back to ToC](https://arxiv.org/html/2605.20876#appendixtoc)
Stage\-Focused Example 2\|Stage: Environment Building \(§[3\.3](https://arxiv.org/html/2605.20876#S3.SS3)\)❶Input — Environment Blueprintℰfiles\\mathcal\{E\}\_\{\\text\{files\}\}Initial Files: \(1\) /app/quiz\_data\.json•mode: llm\_directJSON array of objects with keys:student\_id\(int\),topic\(string, either’algebra’or’scientific\_notation’\),score\(float or null\),timestamp\(ISO8601 string\)\. 50 objects\. Example objects:[⬇](data:text/plain;base64,eyJzdHVkZW50X2lkIjogMTAxLCAidG9waWMiOiAiYWxnZWJyYSIsICAgICAgICAgICAgICJzY29yZSI6IDg4LjUsIC4uLn0KeyJzdHVkZW50X2lkIjogMTAyLCAidG9waWMiOiAic2NpZW50aWZpY19ub3RhdGlvbiIsICJzY29yZSI6IDg1LjUsIC4uLn0=)\{"student\_id":101,"topic":"algebra","score":88\.5,\.\.\.\}\{"student\_id":102,"topic":"scientific\_notation","score":85\.5,\.\.\.\}Student\_id 103 has a null score for algebra \(missing data\)\. Student\_id 101 appears twice with different scores for algebra \(88\.5\) and scientific\_notation \(92\.0\) — agent must map these to separate fields\.Initial Files: \(2\) /app/gradebook\.csv•mode: llm\_directCSV file with columns:id\(int\),name\(string\),algebra\_grade\(float\),sci\_not\_grade\(float\),updated\(ISO8601 string\)\. 50 rows\. Example rows:[⬇](data:text/plain;base64,MTAxLEFsaWNlIEpvaG5zb24sOTAuMCw4OC41LDIwMjQtMTAtMTdUMDk6MDA6MDAKMTAyLEJvYiBTbWl0aCw3Ni41LDg5LjAsMjAyNC0xMC0xN1QwOTowNTowMA==)101,AliceJohnson,90\.0,88\.5,2024\-10\-17T09:00:00102,BobSmith,76\.5,89\.0,2024\-10\-17T09:05:00Row forid 104hasalgebra\_grade=\-1\.0\(sentinel for missing data\)\. Columnidcorresponds tostudent\_idin other sources\.Initial Files: \(3\) /app/project\_scores\.parquet•mode: local\_toolParquet file with columns:student\_id\(int\),project\_type\(string,’algebra’or’scientific\_notation’\),project\_score\(int or string\),date\(string in YYYY\-MM\-DD format\)\. 30 rows\. Example:student\_id=105, project\_type=’algebra’, project\_score=95, date=’2024\-10\-18’\. Three rows haveproject\_scoreas string’N/A’instead of integer \(type inconsistency to handle\)\. Scores are out of 100\. The final merged output must store all score columns \(algebra\_score,sci\_not\_score\) asfloat64dtype\.Setup Steps•Installpandasandpyarrowviapipto handle CSV, JSON, and Parquet file operations\.❷ArtifactFF— Generated Initial Files \(head of each\)Routed to𝒜llm\_direct\\mathcal\{A\}\_\{llm\\\_direct\}—\(1\) /app/quiz\_data\.json:[⬇](data:text/plain;base64,WwogIHsic3R1ZGVudF9pZCI6IDEwMSwgInRvcGljIjogImFsZ2VicmEiLCAgICAgICAgICAgICAic2NvcmUiOiA4OC41LCAidGltZXN0YW1wIjogIjIwMjQtMTAtMTVUMTA6MzA6MDAifSwKICB7InN0dWRlbnRfaWQiOiAxMDEsICJ0b3BpYyI6ICJzY2llbnRpZmljX25vdGF0aW9uIiwgInNjb3JlIjogOTIuMCwgInRpbWVzdGFtcCI6ICIyMDI0LTEwLTE1VDExOjQ1OjAwIn0sCiAgeyJzdHVkZW50X2lkIjogMTAyLCAidG9waWMiOiAic2NpZW50aWZpY19ub3RhdGlvbiIsICJzY29yZSI6IDg1LjUsICJ0aW1lc3RhbXAiOiAiMjAyNC0xMC0xNlQxMTowMDowMCJ9LAogIHsic3R1ZGVudF9pZCI6IDEwMywgInRvcGljIjogImFsZ2VicmEiLCAgICAgICAgICAgICAic2NvcmUiOiBudWxsLCAidGltZXN0YW1wIjogIjIwMjQtMTAtMTZUMTE6MTU6MDAifSwKICB7InN0dWRlbnRfaWQiOiAxMDQsICJ0b3BpYyI6ICJhbGdlYnJhIiwgICAgICAgICAgICAgInNjb3JlIjogNzcuMCwgInRpbWVzdGFtcCI6ICIyMDI0LTEwLTE1VDA5OjAwOjAwIn0sCiAgeyJzdHVkZW50X2lkIjogMTA0LCAidG9waWMiOiAic2NpZW50aWZpY19ub3RhdGlvbiIsICJzY29yZSI6IDgyLjUsICJ0aW1lc3RhbXAiOiAiMjAyNC0xMC0xNVQxMDowMDowMCJ9LAogIC4uLiAoNDQgbW9yZSBlbnRyaWVzKQpd)\[\{"student\_id":101,"topic":"algebra","score":88\.5,"timestamp":"2024\-10\-15T10:30:00"\},\{"student\_id":101,"topic":"scientific\_notation","score":92\.0,"timestamp":"2024\-10\-15T11:45:00"\},\{"student\_id":102,"topic":"scientific\_notation","score":85\.5,"timestamp":"2024\-10\-16T11:00:00"\},\{"student\_id":103,"topic":"algebra","score":null,"timestamp":"2024\-10\-16T11:15:00"\},\{"student\_id":104,"topic":"algebra","score":77\.0,"timestamp":"2024\-10\-15T09:00:00"\},\{"student\_id":104,"topic":"scientific\_notation","score":82\.5,"timestamp":"2024\-10\-15T10:00:00"\},\.\.\.\(44moreentries\)\]Routed to𝒜llm\_direct\\mathcal\{A\}\_\{llm\\\_direct\}—\(2\) /app/gradebook\.csv:[⬇](data:text/plain;base64,aWQsbmFtZSxhbGdlYnJhX2dyYWRlLHNjaV9ub3RfZ3JhZGUsdXBkYXRlZAoxMDEsQWxpY2UgSm9obnNvbiw5MC4wLDg4LjUsMjAyNC0xMC0xN1QwOTowMDowMAoxMDIsQm9iIFNtaXRoLDc2LjUsODkuMCwyMDI0LTEwLTE3VDA5OjA1OjAwCjEwMyxDaGFybGllIEJyb3duLDgyLjAsODEuNSwyMDI0LTEwLTE3VDA5OjEwOjAwCjEwNCxEYXZpZCBXaWxsaWFtcywtMS4wLDkwLjAsMjAyNC0xMC0xN1QwOToxNTowMAoxMDUsRXZhIE1hcnRpbmV6LDkxLjAsODkuNSwyMDI0LTEwLTE3VDA5OjIwOjAwCjEwNixGcmFuayBNaWxsZXIsODUuMCw4NC4wLDIwMjQtMTAtMTdUMDk6MjU6MDAKLi4uICg0NCBtb3JlIHJvd3MpCjE0OSxXaWxsIE1vcmdhbiw4Mi4wLDc5LjUsMjAyNC0xMC0xN1QxMjozNTowMAoxNTAsWm9lIENhcnRlciw4OC41LDkxLjAsMjAyNC0xMC0xN1QxMjo0MDowMA==)id,name,algebra\_grade,sci\_not\_grade,updated101,AliceJohnson,90\.0,88\.5,2024\-10\-17T09:00:00102,BobSmith,76\.5,89\.0,2024\-10\-17T09:05:00103,CharlieBrown,82\.0,81\.5,2024\-10\-17T09:10:00104,DavidWilliams,\-1\.0,90\.0,2024\-10\-17T09:15:00105,EvaMartinez,91\.0,89\.5,2024\-10\-17T09:20:00106,FrankMiller,85\.0,84\.0,2024\-10\-17T09:25:00\.\.\.\(44morerows\)149,WillMorgan,82\.0,79\.5,2024\-10\-17T12:35:00150,ZoeCarter,88\.5,91\.0,2024\-10\-17T12:40:00Routed to𝒜local\_tool\\mathcal\{A\}\_\{local\\\_tool\}—\(3\) /app/project\_scores\.parquet:[⬇](data:text/plain;base64,Pj4+IGltcG9ydCBwYW5kYXMgYXMgcGQ7IGRmID0gcGQucmVhZF9wYXJxdWV0KCcvYXBwL3Byb2plY3Rfc2NvcmVzLnBhcnF1ZXQnKQo+Pj4gZGYuZHR5cGVzCnN0dWRlbnRfaWQgICAgICAgIGludDY0CnByb2plY3RfdHlwZSAgICAgICBzdHIKcHJvamVjdF9zY29yZSAgICAgIHN0ciAgICAgIDwtIHR5cGUgaW5jb25zaXN0ZW5jeTogc29tZSByb3dzIGhvbGQgJ04vQScKZGF0ZSAgICAgICAgICAgICAgIHN0cgo+Pj4gZGYuaGVhZCg4KQogc3R1ZGVudF9pZCAgICAgICAgIHByb2plY3RfdHlwZSAgcHJvamVjdF9zY29yZSAgICAgICAgIGRhdGUKICAgICAgICAxMDEgICAgICAgICAgICAgIGFsZ2VicmEgICAgICAgICAgICAgNzAgICAyMDI0LTEwLTE2CiAgICAgICAgMTAyICBzY2llbnRpZmljX25vdGF0aW9uICAgICAgICAgICAgIDgxICAgMjAyNC0xMC0xMwogICAgICAgIDEwMyAgICAgICAgICAgICAgYWxnZWJyYSAgICAgICAgICAgICA5NSAgIDIwMjQtMTAtMjIKICAgICAgICAxMDQgIHNjaWVudGlmaWNfbm90YXRpb24gICAgICAgICAgICAgOTEgICAyMDI0LTEwLTI0CiAgICAgICAgMTA1ICAgICAgICAgICAgICBhbGdlYnJhICAgICAgICAgICAgIDk1ICAgMjAyNC0xMC0xOAogICAgICAgIDEwNiAgc2NpZW50aWZpY19ub3RhdGlvbiAgICAgICAgICAgIE4vQSAgIDIwMjQtMTAtMTcKICAgICAgICAxMDcgICAgICAgICAgICAgIGFsZ2VicmEgICAgICAgICAgICAgOTQgICAyMDI0LTEwLTIyCiAgICAgICAgMTA4ICBzY2llbnRpZmljX25vdGF0aW9uICAgICAgICAgICAgIDg2ICAgMjAyNC0xMC0xNAo+Pj4gZGYuc2hhcGUKKDMwLCA0KQ==)\>\>\>importpandasaspd;df=pd\.read\_parquet\(’/app/project\_scores\.parquet’\)\>\>\>df\.dtypesstudent\_idint64project\_typestrproject\_scorestr<\-typeinconsistency:somerowshold’N/A’datestr\>\>\>df\.head\(8\)student\_idproject\_typeproject\_scoredate101algebra702024\-10\-16102scientific\_notation812024\-10\-13103algebra952024\-10\-22104scientific\_notation912024\-10\-24105algebra952024\-10\-18106scientific\_notationN/A2024\-10\-17107algebra942024\-10\-22108scientific\_notation862024\-10\-14\>\>\>df\.shape\(30,4\)❸ArtifactBenvB\_\{\\text\{env\}\}— Generatedsetup\.sh[⬇](data:text/plain;base64,IyEvdXNyL2Jpbi9lbnYgYmFzaApzZXQgLWV1eG8gcGlwZWZhaWwKY2QgL2FwcAoKc3VkbyAtRSBwaXAgaW5zdGFsbCBwYW5kYXMgcHlhcnJvdw==)set\-euxopipefailcd/appsudo\-Epipinstallpandaspyarrow❹ArtifactTtestT\_\{\\text\{test\}\}— Generated pytest Verifier[⬇](data:text/plain;base64,aW1wb3J0IHBhbmRhcyBhcyBwZCwganNvbiwgb3MsIHB5dGVzdAppbXBvcnQgbnVtcHkgYXMgbnAKCmRlZiB0ZXN0X291dHB1dF9maWxlc19leGlzdCgpOgogICAgYXNzZXJ0IG9zLnBhdGguZXhpc3RzKCIvYXBwL21lcmdlZF9zdHVkZW50cy5wYXJxdWV0IikKICAgIGFzc2VydCBvcy5wYXRoLmV4aXN0cygiL2FwcC9jb25mbGljdHMuanNvbiIpCgpkZWYgdGVzdF9tZXJnZWRfcGFycXVldF9zdHJ1Y3R1cmUoKToKICAgIGRmID0gcGQucmVhZF9wYXJxdWV0KCIvYXBwL21lcmdlZF9zdHVkZW50cy5wYXJxdWV0IikKICAgIGV4cGVjdGVkX2NvbHMgPSBbJ3N0dWRlbnRfaWQnLCAnbmFtZScsICdhbGdlYnJhX3Njb3JlJywgJ3NjaV9ub3Rfc2NvcmUnXQogICAgYXNzZXJ0IHNldChkZi5jb2x1bW5zKSA9PSBzZXQoZXhwZWN0ZWRfY29scykKICAgIGFzc2VydCBwZC5hcGkudHlwZXMuaXNfaW50ZWdlcl9kdHlwZShkZlsnc3R1ZGVudF9pZCddKQogICAgYXNzZXJ0IHBkLmFwaS50eXBlcy5pc19mbG9hdF9kdHlwZShkZlsnYWxnZWJyYV9zY29yZSddKQogICAgYXNzZXJ0IHBkLmFwaS50eXBlcy5pc19mbG9hdF9kdHlwZShkZlsnc2NpX25vdF9zY29yZSddKQogICAgYXNzZXJ0IGRmWydzdHVkZW50X2lkJ10uZHR5cGUgPT0gJ2ludDY0JwoKZGVmIHRlc3Rfcm93X2NvdW50KCk6CiAgICBxdWl6ICAgICAgPSBwZC5yZWFkX2pzb24gICAoIi9hcHAvcXVpel9kYXRhLmpzb24iKQogICAgZ3JhZGVib29rID0gcGQucmVhZF9jc3YgICAgKCIvYXBwL2dyYWRlYm9vay5jc3YiKQogICAgcHJvamVjdCAgID0gcGQucmVhZF9wYXJxdWV0KCIvYXBwL3Byb2plY3Rfc2NvcmVzLnBhcnF1ZXQiKQogICAgYWxsX2lkcyAgID0gc2V0KHF1aXpbJ3N0dWRlbnRfaWQnXSkgfCBzZXQoZ3JhZGVib29rWydpZCddKSB8IHNldChwcm9qZWN0WydzdHVkZW50X2lkJ10pCiAgICBkZiAgICAgICAgPSBwZC5yZWFkX3BhcnF1ZXQoIi9hcHAvbWVyZ2VkX3N0dWRlbnRzLnBhcnF1ZXQiKQogICAgYXNzZXJ0IGxlbihkZikgPT0gbGVuKGFsbF9pZHMpCiAgICBhc3NlcnQgc2V0KGRmWydzdHVkZW50X2lkJ10pID09IGFsbF9pZHMKCmRlZiB0ZXN0X3ByaW9yaXR5X2FuZF9kYXRhX2xvZ2ljKCk6CiAgICBkZiA9IHBkLnJlYWRfcGFycXVldCgiL2FwcC9tZXJnZWRfc3R1ZGVudHMucGFycXVldCIpCiAgICAjIFN0dWRlbnQgMTAxOiBRdWl6KGFsZz04OC41LCBzY2k9OTIuMCkgYmVhdHMgR3JhZGVib29rKGFsZz05MC4wLCBzY2k9ODguNSkgYnkgcHJpb3JpdHkuCiAgICBzMTAxID0gZGZbZGZbJ3N0dWRlbnRfaWQnXSA9PSAxMDFdLmlsb2NbMF0KICAgIGFzc2VydCBzMTAxWydhbGdlYnJhX3Njb3JlJ10gPT0gODguNQogICAgYXNzZXJ0IHMxMDFbJ3NjaV9ub3Rfc2NvcmUnXSA9PSA5Mi4wCiAgICAjIFN0dWRlbnQgMTAzOiBRdWl6KGFsZz1udWxsKSAtPiBmYWxsYmFjayB0byBHcmFkZWJvb2soYWxnPTgyLjApLgogICAgczEwMyA9IGRmW2RmWydzdHVkZW50X2lkJ10gPT0gMTAzXS5pbG9jWzBdCiAgICBhc3NlcnQgczEwM1snYWxnZWJyYV9zY29yZSddID09IDgyLjAKICAgICMgU3R1ZGVudCAxMDQ6IFF1aXooYWxnPTc3LjApIGJlYXRzIEdyYWRlYm9vayhhbGc9LTEuMCBzZW50aW5lbCkuCiAgICBzMTA0ID0gZGZbZGZbJ3N0dWRlbnRfaWQnXSA9PSAxMDRdLmlsb2NbMF0KICAgIGFzc2VydCBzMTA0WydhbGdlYnJhX3Njb3JlJ10gPT0gNzcuMAogICAgYXNzZXJ0IHMxMDFbJ25hbWUnXSA9PSAiQWxpY2UgSm9obnNvbiIKICAgICMgU3R1ZGVudCBvbmx5IGluIEdyYWRlYm9vay4KICAgIHMxNDkgPSBkZltkZlsnc3R1ZGVudF9pZCddID09IDE0OV0uaWxvY1swXQogICAgYXNzZXJ0IHMxNDlbJ25hbWUnXSA9PSAiV2lsbCBNb3JnYW4iCiAgICBhc3NlcnQgczE0OVsnYWxnZWJyYV9zY29yZSddID09IDgyLjAKCmRlZiB0ZXN0X2NvbmZsaWN0c19qc29uX2NvbnRlbnQoKToKICAgIHdpdGggb3BlbigiL2FwcC9jb25mbGljdHMuanNvbiIsICdyJykgYXMgZjoKICAgICAgICBkYXRhID0ganNvbi5sb2FkKGYpCiAgICBhc3NlcnQgInRvdGFsX2NvbmZsaWN0cyIgaW4gZGF0YQogICAgYXNzZXJ0ICJjb25mbGljdHMiICAgICAgIGluIGRhdGEKICAgIGFzc2VydCBpc2luc3RhbmNlKGRhdGFbJ2NvbmZsaWN0cyddLCBsaXN0KQogICAgIyBTdHVkZW50IDEwMSBoYWQgYSBjb25mbGljdCBvbiBhbGdlYnJhX3Njb3JlIChRdWl6IDg4LjUgdnMgR3JhZGVib29rIDkwLjApLgogICAgZm9yIGNvbmZsaWN0IGluIGRhdGFbJ2NvbmZsaWN0cyddOgogICAgICAgIGlmIGNvbmZsaWN0WydzdHVkZW50X2lkJ10gPT0gMTAxIGFuZCBjb25mbGljdFsnZmllbGQnXSA9PSAnYWxnZWJyYV9zY29yZSc6CiAgICAgICAgICAgIGFzc2VydCBjb25mbGljdFsnc2VsZWN0ZWRfdmFsdWUnXSA9PSA4OC41CiAgICAgICAgICAgIGFzc2VydCAncXVpeicgICAgICBpbiBjb25mbGljdFsndmFsdWVzJ10KICAgICAgICAgICAgYXNzZXJ0ICdncmFkZWJvb2snIGluIGNvbmZsaWN0Wyd2YWx1ZXMnXQogICAgIyBHZW5lcmFsIHByaW9yaXR5IGNoZWNrOiBzZWxlY3RlZF92YWx1ZSBtdXN0IG1hdGNoIGhpZ2hlc3QtcHJpb3JpdHkgbm9uLW1pc3Npbmcgc291cmNlLgogICAgZm9yIGNvbmZsaWN0IGluIGRhdGFbJ2NvbmZsaWN0cyddOgogICAgICAgIHZhbHMsIHNlbGVjdGVkID0gY29uZmxpY3RbJ3ZhbHVlcyddLCBjb25mbGljdFsnc2VsZWN0ZWRfdmFsdWUnXQogICAgICAgIHByaW9yaXR5X3ZhbCA9IE5vbmUKICAgICAgICBmb3Igc3JjIGluIFsncXVpeicsICdncmFkZWJvb2snLCAncHJvamVjdCddOgogICAgICAgICAgICB2YWwgPSB2YWxzLmdldChzcmMpCiAgICAgICAgICAgIGlmIHZhbCBpcyBub3QgTm9uZSBhbmQgdmFsICE9IC0xLjAgYW5kIHZhbCAhPSAnTi9BJyBcCiAgICAgICAgICAgICAgIGFuZCBub3QgKGlzaW5zdGFuY2UodmFsLCBmbG9hdCkgYW5kIG5wLmlzbmFuKHZhbCkpOgogICAgICAgICAgICAgICAgcHJpb3JpdHlfdmFsID0gdmFsCiAgICAgICAgICAgICAgICBicmVhawogICAgICAgIGlmIHByaW9yaXR5X3ZhbCBpcyBub3QgTm9uZToKICAgICAgICAgICAgYXNzZXJ0IGZsb2F0KHNlbGVjdGVkKSA9PSBmbG9hdChwcmlvcml0eV92YWwp)importpandasaspd,json,os,pytestimportnumpyasnpdeftest\_output\_files\_exist\(\):assertos\.path\.exists\("/app/merged\_students\.parquet"\)assertos\.path\.exists\("/app/conflicts\.json"\)deftest\_merged\_parquet\_structure\(\):df=pd\.read\_parquet\("/app/merged\_students\.parquet"\)expected\_cols=\[’student\_id’,’name’,’algebra\_score’,’sci\_not\_score’\]assertset\(df\.columns\)==set\(expected\_cols\)assertpd\.api\.types\.is\_integer\_dtype\(df\[’student\_id’\]\)assertpd\.api\.types\.is\_float\_dtype\(df\[’algebra\_score’\]\)assertpd\.api\.types\.is\_float\_dtype\(df\[’sci\_not\_score’\]\)assertdf\[’student\_id’\]\.dtype==’int64’deftest\_row\_count\(\):quiz=pd\.read\_json\("/app/quiz\_data\.json"\)gradebook=pd\.read\_csv\("/app/gradebook\.csv"\)project=pd\.read\_parquet\("/app/project\_scores\.parquet"\)all\_ids=set\(quiz\[’student\_id’\]\)\|set\(gradebook\[’id’\]\)\|set\(project\[’student\_id’\]\)df=pd\.read\_parquet\("/app/merged\_students\.parquet"\)assertlen\(df\)==len\(all\_ids\)assertset\(df\[’student\_id’\]\)==all\_idsdeftest\_priority\_and\_data\_logic\(\):df=pd\.read\_parquet\("/app/merged\_students\.parquet"\)s101=df\[df\[’student\_id’\]==101\]\.iloc\[0\]asserts101\[’algebra\_score’\]==88\.5asserts101\[’sci\_not\_score’\]==92\.0s103=df\[df\[’student\_id’\]==103\]\.iloc\[0\]asserts103\[’algebra\_score’\]==82\.0s104=df\[df\[’student\_id’\]==104\]\.iloc\[0\]asserts104\[’algebra\_score’\]==77\.0asserts101\[’name’\]=="AliceJohnson"s149=df\[df\[’student\_id’\]==149\]\.iloc\[0\]asserts149\[’name’\]=="WillMorgan"asserts149\[’algebra\_score’\]==82\.0deftest\_conflicts\_json\_content\(\):withopen\("/app/conflicts\.json",’r’\)asf:data=json\.load\(f\)assert"total\_conflicts"indataassert"conflicts"indataassertisinstance\(data\[’conflicts’\],list\)forconflictindata\[’conflicts’\]:ifconflict\[’student\_id’\]==101andconflict\[’field’\]==’algebra\_score’:assertconflict\[’selected\_value’\]==88\.5assert’quiz’inconflict\[’values’\]assert’gradebook’inconflict\[’values’\]forconflictindata\[’conflicts’\]:vals,selected=conflict\[’values’\],conflict\[’selected\_value’\]priority\_val=Noneforsrcin\[’quiz’,’gradebook’,’project’\]:val=vals\.get\(src\)ifvalisnotNoneandval\!=\-1\.0andval\!=’N/A’\\andnot\(isinstance\(val,float\)andnp\.isnan\(val\)\):priority\_val=valbreakifpriority\_valisnotNone:assertfloat\(selected\)==float\(priority\_val\)
### E\.3Example 3: Trajectory Collection — Video OCR Extraction[Back to ToC](https://arxiv.org/html/2605.20876#appendixtoc)
Stage\-Focused Example 3\|Stage: Trajectory Collection \(§[3\.4](https://arxiv.org/html/2605.20876#S3.SS4)\)![[Uncaptioned image]](https://arxiv.org/html/2605.20876v1/figures/video_slides/video_strip.png)❶Step 2 \(of 11\) — InstallffmpeganalysisWe’re at /app with a fresh terminal\. The task is to process a video file using OCR to extract treatment methods and save a cleaned list\. A checklist is provided with 10 steps\. We need to start with step 1: update package lists\.planExecute step 1: run apt\-get update to update package lists\. Then proceed to step 2: install ffmpeg\.commands[⬇](data:text/plain;base64,YXB0LWdldCB1cGRhdGUKYXB0LWdldCBpbnN0YWxsIC15IGZmbXBlZw==)apt\-getupdateapt\-getinstall\-yffmpegobservation/app $ apt\-get update Hit:1 http://mirrors\.tuna\.tsinghua\.edu\.cn/debian trixie InRelease Reading package lists\.\.\. Done /app $ apt\-get install \-y ffmpeg Reading package lists\.\.\. Done Building dependency tree\.\.\. Done ffmpeg is already the newest version \(7:7\.1\.3\-0\+deb13u1\)\. 0 upgraded, 0 newly installed, 0 to remove and 79 not upgraded\.❷Step 6 \(of 11\) — Extract video frames withffmpeganalysisAll prerequisites installed\. Proceed to step 6: extract video frames\. Create /app/frames and run ffmpeg atfps=1\.planmkdir \-p /app/frames, then ffmpeg withfps=1\. Verify withfind \.\.\. \| wc \-l\.commands[⬇](data:text/plain;base64,bWtkaXIgLXAgL2FwcC9mcmFtZXMKZmZtcGVnIC1pIC9hcHAvZWR1Y2F0aW9uX3ZpZGVvLm1wNCAtdmYgImZwcz0xIiAvYXBwL2ZyYW1lcy9mcmFtZV8lMDRkLnBuZwpmaW5kIC9hcHAvZnJhbWVzIC1uYW1lICIqLnBuZyIgfCB3YyAtbA==)mkdir\-p/app/framesffmpeg\-i/app/education\_video\.mp4\-vf"fps=1"/app/frames/frame\_%04d\.pngfind/app/frames\-name"\*\.png"\|wc\-lobservation/app $ ffmpeg \-i /app/education\_video\.mp4 \-vf "fps=1" /app/frames/frame\_%04d\.png ffmpeg version 7\.1\.3\-0\+deb13u1 Copyright \(c\) 2000\-2025 the FFmpeg developers … Input \#0, mov,mp4,m4a,3gp,3g2,mj2, from ’/app/education\_video\.mp4’: Duration: 00:00:15\.00, start: 0\.000000, bitrate: 21 kb/s Stream \#0:0: Video: h264 \(High\) \(avc1/0x31637661\), yuv420p, 1280x720, 20 kb/s, 1 fps Stream mapping: Stream \#0:0 \-\> \#0:0 \(h264 \(native\) \-\> png \(native\)\) Output \#0, image2, to ’/app/frames/frame\_%04d\.png’: Stream \#0:0: Video: png, rgb24, 1280x720, q=2\-31, 1 fps frame= 15 fps=0\.0 q=\-0\.0 Lsize=N/A time=00:00:15\.00 bitrate=N/A speed= 157x /app $ find /app/frames \-name "\*\.png" \| wc \-l 15❸Step 7 \(of 11\) — Authorocr\.py, run OCRanalysisFrame extraction successful: 15 frames\. Now step 7: create OCR script and run it\.planCreate/app/ocr\.pythat globs frames, runspytesseracton each, and appends toall\_text\.txt\.commands[⬇](data:text/plain;base64,aW1wb3J0IGN2MiwgcHl0ZXNzZXJhY3QsIGdsb2IKCmZyYW1lc19kaXIgID0gJy9hcHAvZnJhbWVzJwpvdXRwdXRfZmlsZSA9ICcvYXBwL2FsbF90ZXh0LnR4dCcKCmZyYW1lX3BhdGhzID0gc29ydGVkKGdsb2IuZ2xvYihmJ3tmcmFtZXNfZGlyfS9mcmFtZV8qLnBuZycpKQphbGxfdGV4dCA9IFtdCmZvciBwYXRoIGluIGZyYW1lX3BhdGhzOgogICAgaW1nID0gY3YyLmltcmVhZChwYXRoKQogICAgaWYgaW1nIGlzIE5vbmU6CiAgICAgICAgY29udGludWUKICAgIHRleHQgPSBweXRlc3NlcmFjdC5pbWFnZV90b19zdHJpbmcoaW1nKQogICAgYWxsX3RleHQuYXBwZW5kKHRleHQpCgp3aXRoIG9wZW4ob3V0cHV0X2ZpbGUsICd3JykgYXMgZjoKICAgIGYud3JpdGUoJ1xuJy5qb2luKGFsbF90ZXh0KSk=)importcv2,pytesseract,globframes\_dir=’/app/frames’output\_file=’/app/all\_text\.txt’frame\_paths=sorted\(glob\.glob\(f’\{frames\_dir\}/frame\_\*\.png’\)\)all\_text=\[\]forpathinframe\_paths:img=cv2\.imread\(path\)ifimgisNone:continuetext=pytesseract\.image\_to\_string\(img\)all\_text\.append\(text\)withopen\(output\_file,’w’\)asf:f\.write\(’\\n’\.join\(all\_text\)\)observation/app $ python3 /app/ocr\.py /app $ wc \-l /app/all\_text\.txt 29 /app/all\_text\.txt /app $ head \-c 200 /app/all\_text\.txt Slide 1: Introduction to Adolescent Depression Slide 1: Introduction to Adolescent Depression Slide 2: Cognitive Behavioral Therapy \(CBT\) Slide 2: Co…❹Step 9 \(of 11\) — Refineparse\.pyto filter out the intro slideanalysisStep 8 produced an initial parse with 5 unique lines, including the non\-treatmentIntroduction to Adolescent Depression\. Filter by treatment\-related keywords\.planRewrite/app/parse\.pywith a keyword whitelist \[therapy, treatment, medication, cbt, ssri, antidepressant\], dedup, re\-run, verify≥4\\geq 4lines and CBT presence\.commands[⬇](data:text/plain;base64,aW5wdXRfZmlsZSAgPSAnL2FwcC9hbGxfdGV4dC50eHQnCm91dHB1dF9maWxlID0gJy9hcHAvdHJlYXRtZW50X2xpc3QudHh0JwprZXl3b3JkcyAgICA9IFsndGhlcmFweScsICd0cmVhdG1lbnQnLCAnbWVkaWNhdGlvbicsICdjYnQnLCAnc3NyaScsICdhbnRpZGVwcmVzc2FudCddCgp3aXRoIG9wZW4oaW5wdXRfZmlsZSwgJ3InKSBhcyBmOgogICAgbGluZXMgPSBmLnJlYWQoKS5zcGxpdGxpbmVzKCkKCnRyZWF0bWVudHMgPSBzZXQoKQpmb3IgbGluZSBpbiBsaW5lczoKICAgIGxpbmUgPSBsaW5lLnN0cmlwKCkKICAgIGlmIG5vdCBsaW5lOgogICAgICAgIGNvbnRpbnVlCiAgICBpZiBsaW5lLnN0YXJ0c3dpdGgoJ1NsaWRlJyk6CiAgICAgICAgcGFydHMgPSBsaW5lLnNwbGl0KCc6JywgMSkKICAgICAgICBpZiBsZW4ocGFydHMpID09IDI6CiAgICAgICAgICAgIHRyZWF0bWVudCA9IHBhcnRzWzFdLnN0cmlwKCkKICAgICAgICAgICAgaWYgdHJlYXRtZW50IGFuZCBhbnkoa3cgaW4gdHJlYXRtZW50Lmxvd2VyKCkgZm9yIGt3IGluIGtleXdvcmRzKToKICAgICAgICAgICAgICAgIHRyZWF0bWVudHMuYWRkKHRyZWF0bWVudCkKCndpdGggb3BlbihvdXRwdXRfZmlsZSwgJ3cnKSBhcyBmOgogICAgZm9yIHQgaW4gc29ydGVkKHRyZWF0bWVudHMpOgogICAgICAgIGYud3JpdGUodCArICdcbicp)input\_file=’/app/all\_text\.txt’output\_file=’/app/treatment\_list\.txt’keywords=\[’therapy’,’treatment’,’medication’,’cbt’,’ssri’,’antidepressant’\]withopen\(input\_file,’r’\)asf:lines=f\.read\(\)\.splitlines\(\)treatments=set\(\)forlineinlines:line=line\.strip\(\)ifnotline:continueifline\.startswith\(’Slide’\):parts=line\.split\(’:’,1\)iflen\(parts\)==2:treatment=parts\[1\]\.strip\(\)iftreatmentandany\(kwintreatment\.lower\(\)forkwinkeywords\):treatments\.add\(treatment\)withopen\(output\_file,’w’\)asf:fortinsorted\(treatments\):f\.write\(t\+’\\n’\)observation/app $ python3 /app/parse\.py Extracted 4 unique treatments\. /app $ cat /app/treatment\_list\.txt Cognitive Behavioral Therapy \(CBT\) Family\-Based Treatment Interpersonal Therapy \(IPT\) Medication: SSRIs like fluoxetine /app $ wc \-l /app/treatment\_list\.txt 4 /app/treatment\_list\.txt /app $ grep \-q "Cognitive Behavioral Therapy\\\|CBT" /app/treatment\_list\.txt && echo "Contains CBT" Contains CBT❺Verification Result \(downstream pytest\)Test CaseStatusMetricValuetest\_treatment\_list\_exists✓ PASSEDfile present/app/treatment\_list\.txttest\_treatment\_list\_min\_lines✓ PASSEDline count≥4\\geq 4\(actual: 4\)test\_cbt\_present✓ PASSEDCBT keywordfoundtest\_ssri\_present✓ PASSEDSSRI keywordfoundtest\_ipt\_present✓ PASSEDIPT keywordfoundtest\_family\_based\_present✓ PASSEDFBT keywordfoundReward1\.0\(6/6 passed, 11 steps\)Similar Articles
LiteCoder-Terminal: Scaling Long-Horizon Terminal Environments for Learning Language Agents
LiteCoder-Terminal-Gen introduces a zero-dependency synthetic pipeline that generates executable terminal training environments, producing SFT and RL datasets that enable language agents to achieve significant performance gains on Terminal Bench benchmarks.
What Makes Interaction Trajectories Effective for Training Terminal Agents?
This paper investigates what makes interaction trajectories effective for training terminal-based AI agents, introducing the Terminal-Lego pipeline and revealing a pedagogical paradox where weaker agents can produce better training data. It finds that environment-grounded supervision, rather than teacher performance, is key for student generalization.
Turning local agents into self-optimizing agents
A self-optimizing agentic pipeline that improves benchmark performance from ~30% to ~90% on TerminalBench, and can be extended to everyday chats by logging interactions, reflecting with a local model, and injecting lessons into future system prompts.
@tom_doerr: Builds agents from 200,000 skills https://github.com/ynulihao/AgentSkillOS…
AgentSkillOS is an open-source framework that enables developers to build AI agents by retrieving and orchestrating pipelines from over 200,000 available skills.
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
Agent-World introduces a self-evolving training framework for general agent intelligence that autonomously discovers real-world environments and tasks via the Model Context Protocol, enabling continuous learning. Agent-World-8B and 14B models outperform strong proprietary models across 23 challenging agent benchmarks.