CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents

Hugging Face Daily Papers Papers

Summary

CUA-Gym introduces a scalable pipeline for generating verifiable training environments and tasks for computer-use agents, addressing data scarcity. The resulting dataset and models achieve strong performance on benchmarks like OSWorld-Verified and WebArena.

Reinforcement learning with verifiable rewards (RLVR) has driven breakthroughs in domains such as math, tool-use, and software engineering, yet its extension to computer-use agents (CUAs) has been bottlenecked by the scarcity of scalable training data with deterministic rewards. Constructing such data for CUAs requires consistent task instruction, executable environment, and verifiable reward. However, hand-curated benchmarks achieve high reward fidelity but cover few applications and LLM-as-judge-based datasets scale broadly but lack reliable verification. We present CUA-Gym, a scalable pipeline that co-generates task instructions, environment states, and reward functions. Concretely, a Generator agent constructs the initial and golden environment states, and a separate Discriminator agent writes the reward function from the task specification. An orchestrator agent drives the two through iterative rounds upon execution. Generated tuples then pass a final filter combining LLM majority voting and agent rollouts, ensuring quality beyond the per-task adversarial loop. To address the scarcity of training environments, we further synthesize CUA-Gym-Hub, a broad suite of high-fidelity mock web applications grounded in real-world software-use distributions, expanding the scale of CUA RLVR data by magnitude. Using this pipeline, we construct CUA-Gym, a dataset of 32,112 verified RLVR training tuples grounded in 110 environments. Trained with GSPO on CUA-Gym, our CUA-Gym-A3B and CUA-Gym-A17B achieve 62.1% and 72.6% on OSWorld-Verified, outperforming prior open-source CUAs at comparable scales, with performance scaling smoothly in both data volume and environment diversity. The same checkpoints also improve on the held-out WebArena benchmark, indicating transfer beyond the training environments. We will open-source the full synthesis pipeline, dataset, CUA-Gym-Hub environments, and models.
Original Article
View Cached Full Text

Cached at: 05/26/26, 02:44 PM

Paper page - CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents

Source: https://huggingface.co/papers/2605.25624 Authors:

,

,

,

,

,

,

,

,

,

,

,

,

Abstract

RLVR framework for computer-use agents addresses data scarcity through scalable generation pipeline and synthetic environments, achieving superior performance on verification and transfer benchmarks.

Reinforcement learning with verifiable rewards(RLVR) has driven breakthroughs in domains such as math, tool-use, and software engineering, yet its extension tocomputer-use agents(CUAs) has been bottlenecked by the scarcity of scalable training data with deterministic rewards. Constructing such data for CUAs requires consistenttask instruction,executable environment, andverifiable reward. However, hand-curated benchmarks achieve high reward fidelity but cover few applications and LLM-as-judge-based datasets scale broadly but lack reliable verification. We present CUA-Gym, a scalable pipeline that co-generatestask instructions, environment states, and reward functions. Concretely, aGenerator agentconstructs the initial and golden environment states, and a separateDiscriminator agentwrites the reward function from the task specification. Anorchestrator agentdrives the two through iterative rounds upon execution. Generated tuples then pass a final filter combiningLLM majority votingandagent rollouts, ensuring quality beyond the per-task adversarial loop. To address the scarcity of training environments, we further synthesizeCUA-Gym-Hub, a broad suite of high-fidelity mock web applications grounded in real-world software-use distributions, expanding the scale of CUA RLVR data by magnitude. Using this pipeline, we construct CUA-Gym, a dataset of 32,112 verified RLVR training tuples grounded in 110 environments. Trained withGSPOon CUA-Gym, our CUA-Gym-A3B and CUA-Gym-A17B achieve 62.1% and 72.6% onOSWorld-Verified, outperforming prior open-source CUAs at comparable scales, with performance scaling smoothly in both data volume and environment diversity. The same checkpoints also improve on the held-outWebArenabenchmark, indicating transfer beyond the training environments. We will open-source the full synthesis pipeline, dataset,CUA-Gym-Hubenvironments, and models.

View arXiv pageView PDFProject pageGitHub8Add to collection

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.25624 in a model README.md to link it from this page.

Datasets citing this paper1

#### xlangai/CUA-Gym Viewer• Updatedabout 7 hours ago • 7.9k • 145

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.25624 in a Space README.md to link it from this page.

Collections including this paper1

Similar Articles

Ecom-RLVE: Adaptive Verifiable Environments for E-Commerce Conversational Agents

Hugging Face Blog

Huggingface introduces EcomRLVE-GYM, a framework providing eight verifiable environments for training reinforcement learning agents on complex e-commerce tasks. The tool features adaptive difficulty curricula and algorithmic rewards to improve task completion in shopping assistants, demonstrated by training a Qwen 3 8B model.

OpenComputer: Verifiable Software Worlds for Computer-Use Agents

Hugging Face Daily Papers

OpenComputer presents a framework for creating verifiable software environments for computer-use agents, integrating state verifiers, self-improving verification layers, task synthesis, and evaluation systems across 33 desktop applications. Experiments show its verifiers align better with human judgment than LLM-as-judge, and frontier agents struggle with end-to-end completion.

Computer-Using Agent

OpenAI Blog

OpenAI introduced the Computer-Using Agent (CUA), a model combining GPT-4o's vision with reinforcement learning to interact with GUIs like a human, powering the new Operator agent. CUA sets new state-of-the-art benchmarks including 38.1% on OSWorld and 58.1% on WebArena, and is available as a research preview for ChatGPT Pro users in the US.