ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents

Hugging Face Daily Papers 05/12/26, 12:00 AM Papers

Summary

ToolCUA is a new agent framework that optimizes GUI-tool path selection for computer use agents through staged training and reinforcement learning. It achieves state-of-the-art performance on OSWorld-MCP by effectively interleaving GUI actions and high-level tool calls.

Computer Use Agents (CUAs) can act through both atomic GUI actions, such as click and type, and high-level tool calls, such as API-based file operations, but this hybrid action space often leaves them uncertain about when to continue with GUI actions or switch to tools, leading to suboptimal execution paths. This difficulty stems from the scarcity of high-quality interleaved GUI-Tool trajectories, the cost and brittleness of collecting real tool trajectories, and the lack of trajectory-level supervision for GUI-Tool path selection. In this paper, we propose ToolCUA, an end-to-end agent designed to learn optimal GUI-Tool path selection through a staged training paradigm. We first introduce an Interleaved GUI-Tool Trajectory Scaling Pipeline that repurposes abundant static GUI trajectories and synthesizes a grounded tool library, enabling diverse GUI-Tool trajectories without manual engineering or real tool-trajectory collection. We then perform Tool-Bootstrapped GUI RFT, combining warmup SFT with single-turn RL to improve decisions at critical GUI-Tool switching points. Finally, we optimize ToolCUA with Online Agentic RL in a high-fidelity GUI-Tool environment, guided by a Tool-Efficient Path Reward that encourages appropriate tool use and shorter execution paths. Experiments on OSWorld-MCP show that ToolCUA achieves 46.85% accuracy, a relative improvement of approximately 66% over the baseline, establishing a new state of the art among models of comparable scale. It also improves by 3.9% over GUI-only settings, demonstrating effective GUI-Tool orchestration. The results further suggest that training in a hybrid action space is a promising paradigm for real-world digital agents. Open-sourced here: https://x-plug.github.io/ToolCUA/

Original Article

View Cached Full Text

Cached at: 05/13/26, 08:12 AM

Paper page - ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents

Source: https://huggingface.co/papers/2605.12481

Abstract

ToolCUA is an end-to-end agent that learns optimal GUI-tool path selection through staged training, achieving superior performance in hybrid action space environments.

Computer Use Agents(CUAs) can act through both atomicGUI actions, such as click and type, and high-leveltool calls, such as API-based file operations, but thishybrid action spaceoften leaves them uncertain about when to continue withGUI actionsor switch to tools, leading to suboptimal execution paths. This difficulty stems from the scarcity of high-qualityinterleaved GUI-Tool trajectories, the cost and brittleness of collecting real tool trajectories, and the lack of trajectory-level supervision for GUI-Tool path selection. In this paper, we propose ToolCUA, an end-to-end agent designed to learn optimal GUI-Tool path selection through astaged training paradigm. We first introduce anInterleaved GUI-Tool Trajectory Scaling Pipelinethat repurposes abundant static GUI trajectories and synthesizes a groundedtool library, enabling diverse GUI-Tool trajectories without manual engineering or real tool-trajectory collection. We then performTool-Bootstrapped GUI RFT, combining warmup SFT withsingle-turn RLto improve decisions at critical GUI-Tool switching points. Finally, we optimize ToolCUA withOnline Agentic RLin a high-fidelity GUI-Tool environment, guided by aTool-Efficient Path Rewardthat encourages appropriate tool use and shorter execution paths. Experiments onOSWorld-MCPshow that ToolCUA achieves 46.85% accuracy, a relative improvement of approximately 66% over the baseline, establishing a new state of the art among models of comparable scale. It also improves by 3.9% over GUI-only settings, demonstrating effective GUI-Tool orchestration. The results further suggest that training in ahybrid action spaceis a promising paradigm for real-world digital agents. Open-sourced here: https://x-plug.github.io/ToolCUA/

View arXiv page View PDF Project page GitHub8 Add to collection

Get this paper in your agent:

hf papers read 2605\.12481

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper1

#### mPLUG/ToolCUA-8B Image-Text-to-Text• 9B• Updatedabout 3 hours ago • 78

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.12481 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.12481 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents

Paper page - ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents

Abstract

Models citing this paper1

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

PRO-CUA: Process-Reward Optimization for Computer Use Agents

Computer-Using Agent

GUICrafter: Weakly-Supervised GUI Agent Leveraging Massive Unannotated Screenshots

Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability

CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents

Submit Feedback

Similar Articles

PRO-CUA: Process-Reward Optimization for Computer Use Agents

GUICrafter: Weakly-Supervised GUI Agent Leveraging Massive Unannotated Screenshots

Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability

CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents