CLI-Universe: Towards Verifiable Task Synthesis Engine for Terminal Agents

Hugging Face Daily Papers Papers

Summary

CLI-Universe is a synthesis engine that generates verifiable terminal-agent tasks via multi-dimensional capability taxonomy and evidence-guided research, producing a distilled dataset of 6,000 trajectories. Fine-tuning Qwen3-32B on this dataset achieves 33.4% on Terminal-Bench 2.0, setting a new state-of-the-art for open-source models at or below 32B parameters.

While recent LLM-based terminal agents have demonstrated promising capabilities, the scarcity of high-quality, executable training data remains a critical bottleneck. Existing synthesis pipelines typically scale by retrofitting surface-level artifacts into tasks, frequently yielding ambiguous instructions, shallow execution paths, and brittle tests that provide weak learning signals. To overcome this, we introduce CLI-Universe, a principled synthesis engine that constructs terminal-agent tasks. CLI-Universe generates candidate tasks by sampling combinations across a multi-dimensional capability taxonomy (domain, skill type, capability, and engineering pillar), then grounds each candidate through evidence-guided deep research over real-world technical materials. To ensure rigorous supervision, validated blueprints are instantiated into Dockerized environments and subjected to a multi-stage executable verification pipeline featuring rubric-gated test construction, hint-conditional filtering, and strict fail-to-pass checking. Across the full pipeline, from candidate generation to verification, approximately two-thirds of candidates are discarded, retaining only those that are genuine, verifiable, and non-trivially challenging. To validate our framework, we instantiate a highly distilled dataset of 6,000 trajectories called CLI-Universe-6K. Remarkably, fine-tuning Qwen3-32B on CLI-Universe-6K achieves 33.4% on Terminal-Bench 2.0. This sets a new state-of-the-art for models trained on open-source data at or below 32B parameters, and outperforms several models an order of magnitude larger, demonstrating the profound data efficiency of structured, high-fidelity synthesis.
Original Article
View Cached Full Text

Cached at: 06/23/26, 05:40 AM

Paper page - CLI-Universe: Towards Verifiable Task Synthesis Engine for Terminal Agents

Source: https://huggingface.co/papers/2606.22883 Authors:

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

Abstract

A principled synthesis engine generates high-quality terminal-agent tasks through multi-dimensional capability taxonomy and evidence-guided research, creating a distilled dataset that enables significant performance gains in LLM training.

While recentLLM-based terminal agentshave demonstrated promising capabilities, the scarcity of high-quality,executable training dataremains a critical bottleneck. Existing synthesis pipelines typically scale by retrofitting surface-level artifacts into tasks, frequently yielding ambiguous instructions, shallow execution paths, and brittle tests that provide weak learning signals. To overcome this, we introduce CLI-Universe, a principledsynthesis enginethat constructs terminal-agent tasks. CLI-Universe generates candidate tasks by sampling combinations across a multi-dimensionalcapability taxonomy(domain, skill type, capability, and engineering pillar), then grounds each candidate throughevidence-guided deep researchover real-world technical materials. To ensure rigorous supervision, validated blueprints are instantiated intoDockerized environmentsand subjected to a multi-stageexecutable verification pipelinefeaturingrubric-gated test construction, hint-conditional filtering, and strictfail-to-pass checking. Across the full pipeline, from candidate generation to verification, approximately two-thirds of candidates are discarded, retaining only those that are genuine, verifiable, and non-trivially challenging. To validate our framework, we instantiate a highly distilled dataset of 6,000 trajectories calledCLI-Universe-6K. Remarkably, fine-tuningQwen3-32BonCLI-Universe-6Kachieves 33.4% onTerminal-Bench 2.0. This sets a new state-of-the-art for models trained on open-source data at or below 32B parameters, and outperforms several models an order of magnitude larger, demonstrating the profound data efficiency of structured, high-fidelity synthesis.

View arXiv pageView PDFAdd to collection

Get this paper in your agent:

hf papers read 2606\.22883

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.22883 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.22883 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.22883 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Terminal-World: Scaling Terminal-Agent Environments via Agent Skills

arXiv cs.CL

Terminal-World introduces a fully automated pipeline that uses agent skills to synthesize high-quality training data for terminal agents, enabling models to outperform baselines with only 1.2% of the training data. The method co-derives task instructions, environments, and teacher trajectories from skill primitives.

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

Hugging Face Daily Papers

WildClawBench evaluates language and vision-language models on realistic long-horizon tasks using actual CLI environments with real tools. The benchmark reveals that even the best model achieves only 62.2% accuracy, indicating long-horizon agent evaluation remains challenging.