ClawEnvKit: Automatic Environment Generation for Claw-Like Agents
Summary
ClawEnvKit is an automated pipeline that generates diverse, verified environments for claw-like agents from natural language descriptions, enabling the construction of Auto-ClawEval, a large-scale benchmark with 1,040 environments at 13,800x lower cost than human curation. The system supports continuous, on-demand evaluation and adaptive training environment generation across multiple model families and agent frameworks.
View Cached Full Text
Cached at: 04/21/26, 07:20 AM
Paper page - ClawEnvKit: Automatic Environment Generation for Claw-Like Agents
Source: https://huggingface.co/papers/2604.18543
Abstract
An automated pipeline generates diverse, verified environments for claw-like agents from natural language descriptions, enabling large-scale benchmark construction and continuous evaluation.
Constructing environments for training and evaluating claw-like agents remains a manual, human-intensive process that does not scale. We argue that what is needed is not just a dataset, but anautomated pipelinecapable of generating diverse, verified environments on demand. To this end, we introduce ClawEnvKit, an autonomous generation pipeline that instantiates this formalism fromnatural language descriptions. The pipeline comprises three modules: (1) a parser that extracts structured generation parameters from natural language input; (2) a generator that produces thetask specification,tool interface, andscoring configuration; and (3) a validator that enforces feasibility, diversity, structural validity, and internal consistency across the generated environments. Using ClawEnvKit, we construct Auto-ClawEval, the first large-scale benchmark for claw-like agents, comprising 1,040 environments across 24 categories. Empirically, Auto-ClawEval matches or exceeds human-curated environments on coherence and clarity at 13,800x lower cost. Evaluated across 4 model families and 8 agent harness frameworks, we find that harness engineering boosts performance by up to 15.7 percentage points over a bare ReAct baseline, completion remains the primary axis of variation with no model saturating the benchmark, and automated generation enables evaluation at a scale previously infeasible. Beyond static benchmarking, ClawEnvKit enables live evaluation: users describe a desired capability in natural language and obtain a verified environment on demand, turning evaluation into a continuous, user-driven process. The same mechanism serves as an on-demand training environment generator, producing task distributions that adapt to an agent’s current weaknesses rather than being bounded by existing user logs.
View arXiv pageView PDFGitHub2Add to collection
Get this paper in your agent:
hf papers read 2604\.18543
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2604.18543 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2604.18543 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2604.18543 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL
EnvFactory automates the creation of executable tool environments and natural multi-turn trajectories for training LLMs with agentic reinforcement learning, achieving superior performance on benchmarks like BFCLv3 and MCP-Atlas with fewer environments than prior work.
VisualClaw: A Real-Time, Personalized Agent for the Physical World
VisualClaw is a self-evolving multimodal agent that reduces deployment costs through hybrid encoding and skill evolution, while improving video-QA accuracy across multiple benchmarks.
EnvScaler: Scaling Tool-Interactive Environments for LLM Agent via Programmatic Synthesis
EnvScaler is an automated framework for scaling tool-interactive environments for LLM agents through programmatic synthesis, creating 191 diverse environments and 7K scenarios to improve agent performance on multi-turn, multi-tool interactions.
STAGE-Claw: Automated State-based Agent Benchmarking for Realistic Scenarios
This paper introduces STAGE-Claw, an automated framework for building and evaluating realistic personal-agent scenarios in state-based computing environments, enabling scalable, state-based evaluation of LLM-powered agents.
ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents
ClawGUI is an open-source framework for training, evaluating, and deploying GUI agents using reinforcement learning, featuring standardized benchmarks and cross-platform deployment to Android, iOS, and HarmonyOS.