CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents
Summary
CUA-Gym introduces a scalable pipeline for generating verifiable training environments and tasks for computer-use agents, addressing data scarcity. The resulting dataset and models achieve strong performance on benchmarks like OSWorld-Verified and WebArena.
View Cached Full Text
Cached at: 05/26/26, 02:44 PM
Paper page - CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents
Source: https://huggingface.co/papers/2605.25624 Authors:
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
RLVR framework for computer-use agents addresses data scarcity through scalable generation pipeline and synthetic environments, achieving superior performance on verification and transfer benchmarks.
Reinforcement learning with verifiable rewards(RLVR) has driven breakthroughs in domains such as math, tool-use, and software engineering, yet its extension tocomputer-use agents(CUAs) has been bottlenecked by the scarcity of scalable training data with deterministic rewards. Constructing such data for CUAs requires consistenttask instruction,executable environment, andverifiable reward. However, hand-curated benchmarks achieve high reward fidelity but cover few applications and LLM-as-judge-based datasets scale broadly but lack reliable verification. We present CUA-Gym, a scalable pipeline that co-generatestask instructions, environment states, and reward functions. Concretely, aGenerator agentconstructs the initial and golden environment states, and a separateDiscriminator agentwrites the reward function from the task specification. Anorchestrator agentdrives the two through iterative rounds upon execution. Generated tuples then pass a final filter combiningLLM majority votingandagent rollouts, ensuring quality beyond the per-task adversarial loop. To address the scarcity of training environments, we further synthesizeCUA-Gym-Hub, a broad suite of high-fidelity mock web applications grounded in real-world software-use distributions, expanding the scale of CUA RLVR data by magnitude. Using this pipeline, we construct CUA-Gym, a dataset of 32,112 verified RLVR training tuples grounded in 110 environments. Trained withGSPOon CUA-Gym, our CUA-Gym-A3B and CUA-Gym-A17B achieve 62.1% and 72.6% onOSWorld-Verified, outperforming prior open-source CUAs at comparable scales, with performance scaling smoothly in both data volume and environment diversity. The same checkpoints also improve on the held-outWebArenabenchmark, indicating transfer beyond the training environments. We will open-source the full synthesis pipeline, dataset,CUA-Gym-Hubenvironments, and models.
View arXiv pageView PDFProject pageGitHub8Add to collection
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.25624 in a model README.md to link it from this page.
Datasets citing this paper1
#### xlangai/CUA-Gym Viewer• Updatedabout 7 hours ago • 7.9k • 145
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.25624 in a Space README.md to link it from this page.
Collections including this paper1
Similar Articles
ShopGym: An Integrated Framework for Realistic Simulation and Scalable Benchmarking of E-Commerce Web Agents
ShopGym is a framework that converts live e-commerce storefronts into self-contained sandbox shops for realistic, controllable, and reproducible benchmarking of web agents, with synthetic tasks across seven skill categories.
MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research
MobileGym is a browser-based simulation platform for mobile GUI agent research, featuring deterministic state evaluation and scalable parallel execution. It includes a benchmark of 416 tasks and demonstrates gains using GRPO on Qwen3-VL-4B.
Ecom-RLVE: Adaptive Verifiable Environments for E-Commerce Conversational Agents
Huggingface introduces EcomRLVE-GYM, a framework providing eight verifiable environments for training reinforcement learning agents on complex e-commerce tasks. The tool features adaptive difficulty curricula and algorithmic rewards to improve task completion in shopping assistants, demonstrated by training a Qwen 3 8B model.
OpenComputer: Verifiable Software Worlds for Computer-Use Agents
OpenComputer presents a framework for creating verifiable software environments for computer-use agents, integrating state verifiers, self-improving verification layers, task synthesis, and evaluation systems across 33 desktop applications. Experiments show its verifiers align better with human judgment than LLM-as-judge, and frontier agents struggle with end-to-end completion.
Computer-Using Agent
OpenAI introduced the Computer-Using Agent (CUA), a model combining GPT-4o's vision with reinforcement learning to interact with GUIs like a human, powering the new Operator agent. CUA sets new state-of-the-art benchmarks including 38.1% on OSWorld and 58.1% on WebArena, and is available as a research preview for ChatGPT Pro users in the US.