CLI-Universe: Towards Verifiable Task Synthesis Engine for Terminal Agents
Summary
CLI-Universe is a synthesis engine that generates verifiable terminal-agent tasks via multi-dimensional capability taxonomy and evidence-guided research, producing a distilled dataset of 6,000 trajectories. Fine-tuning Qwen3-32B on this dataset achieves 33.4% on Terminal-Bench 2.0, setting a new state-of-the-art for open-source models at or below 32B parameters.
View Cached Full Text
Cached at: 06/23/26, 05:40 AM
Paper page - CLI-Universe: Towards Verifiable Task Synthesis Engine for Terminal Agents
Source: https://huggingface.co/papers/2606.22883 Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
A principled synthesis engine generates high-quality terminal-agent tasks through multi-dimensional capability taxonomy and evidence-guided research, creating a distilled dataset that enables significant performance gains in LLM training.
While recentLLM-based terminal agentshave demonstrated promising capabilities, the scarcity of high-quality,executable training dataremains a critical bottleneck. Existing synthesis pipelines typically scale by retrofitting surface-level artifacts into tasks, frequently yielding ambiguous instructions, shallow execution paths, and brittle tests that provide weak learning signals. To overcome this, we introduce CLI-Universe, a principledsynthesis enginethat constructs terminal-agent tasks. CLI-Universe generates candidate tasks by sampling combinations across a multi-dimensionalcapability taxonomy(domain, skill type, capability, and engineering pillar), then grounds each candidate throughevidence-guided deep researchover real-world technical materials. To ensure rigorous supervision, validated blueprints are instantiated intoDockerized environmentsand subjected to a multi-stageexecutable verification pipelinefeaturingrubric-gated test construction, hint-conditional filtering, and strictfail-to-pass checking. Across the full pipeline, from candidate generation to verification, approximately two-thirds of candidates are discarded, retaining only those that are genuine, verifiable, and non-trivially challenging. To validate our framework, we instantiate a highly distilled dataset of 6,000 trajectories calledCLI-Universe-6K. Remarkably, fine-tuningQwen3-32BonCLI-Universe-6Kachieves 33.4% onTerminal-Bench 2.0. This sets a new state-of-the-art for models trained on open-source data at or below 32B parameters, and outperforms several models an order of magnitude larger, demonstrating the profound data efficiency of structured, high-fidelity synthesis.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2606\.22883
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.22883 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.22883 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.22883 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Terminal-World: Scaling Terminal-Agent Environments via Agent Skills
Terminal-World introduces a fully automated pipeline that uses agent skills to synthesize high-quality training data for terminal agents, enabling models to outperform baselines with only 1.2% of the training data. The method co-derives task instructions, environments, and teacher trajectories from skill primitives.
LiteCoder-Terminal: Scaling Long-Horizon Terminal Environments for Learning Language Agents
LiteCoder-Terminal-Gen introduces a zero-dependency synthetic pipeline that generates executable terminal training environments, producing SFT and RL datasets that enable language agents to achieve significant performance gains on Terminal Bench benchmarks.
WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
WildClawBench evaluates language and vision-language models on realistic long-horizon tasks using actual CLI environments with real tools. The benchmark reveals that even the best model achieves only 62.2% accuracy, indicating long-horizon agent evaluation remains challenging.
Covering Human Action Space for Computer Use: Data Synthesis and Benchmark
This paper introduces CUActSpot, a multimodal benchmark for evaluating computer-use agents, and a renderer-based data synthesis pipeline. The proposed Phi-Ground-Any-4B model outperforms open-source models under 32B parameters.
TOBench: A Task-Oriented Omni-Modal Benchmark for Real-World Tool-Using Agents
TOBench is a new benchmark for evaluating AI agents on real-world, task-oriented tool use with multimodal inputs and closed-loop verification. Experiments show top models like Qwen 3.5 Plus achieve only 41% success, far below the 94% human benchmark, highlighting a significant gap.