Covering Human Action Space for Computer Use: Data Synthesis and Benchmark
Summary
This paper introduces CUActSpot, a multimodal benchmark for evaluating computer-use agents, and a renderer-based data synthesis pipeline. The proposed Phi-Ground-Any-4B model outperforms open-source models under 32B parameters.
View Cached Full Text
Cached at: 05/14/26, 04:17 AM
Paper page - Covering Human Action Space for Computer Use: Data Synthesis and Benchmark
Source: https://huggingface.co/papers/2605.12501 Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
Computer-use agents face reliability challenges with complex GUI interactions due to data scarcity, addressed through a multi-modal benchmark and synthetic data generation pipeline.
Computer-use agents(CUAs) automate on-screen work, as illustrated by GPT-5.4 and Claude. Yet their reliability on complex, low-frequency interactions is still poor, limiting user trust. Our analysis of failure cases from advanced models suggests along-tail patterninGUI operations, where a relatively small fraction of complex and diverse interactions accounts for a disproportionate share of task failures. We hypothesize that this issue largely stems from the scarcity of data for complex interactions. To address this problem, we propose a new benchmark CUActSpot for evaluating models’ capabilities on complex interactions across five modalities: GUI, text, table, canvas, and natural image, as well as a variety of actions (click, drag, draw, etc.), covering a broader range of interaction types than prior click-centric benchmarks that focus mainly on GUI widgets. We also design a renderer-based data-synthesis pipeline: scenes are automatically generated for each modality, screenshots and element coordinates are recorded, and anLLMproduces matching instructions andaction traces. After training on this corpus, our Phi-Ground-Any-4B outperforms open-source models with fewer than 32B parameters. We will release our benchmark, data, code, and models at https://github.com/microsoft/Phi-Ground.git
View arXiv pageView PDFGitHub33Add to collection
Get this paper in your agent:
hf papers read 2605\.12501
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper1
#### microsoft/Phi-Ground-Any Updated1 day ago • 115 • 13
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.12501 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.12501 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
ShapeCodeBench: A Renewable Benchmark for Perception-to-Program Reconstruction of Synthetic Shape Scenes
ShapeCodeBench is a synthetic benchmark for perception-to-program reconstruction where models generate executable drawing programs from raster images, evaluated on metrics like exact match and pixel accuracy. The benchmark is designed to be renewable via seeded RNG, and current models still achieve low exact match rates, indicating room for improvement.
Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction
Collider-Bench is a new benchmark that evaluates LLM agents on reproducing particle physics analyses from the Large Hadron Collider using only public papers and open software, requiring physical reasoning to fill missing implementation details.
3DCodeBench: Benchmarking Agentic Procedural 3D Modeling Via Code
This paper introduces 3DCodeBench, a benchmark for evaluating vision-language models on procedural 3D modeling via code, and 3DCodeArena, a ranking platform based on pairwise human preferences.
CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents
CUA-Gym introduces a scalable pipeline for generating verifiable training environments and tasks for computer-use agents, addressing data scarcity. The resulting dataset and models achieve strong performance on benchmarks like OSWorld-Verified and WebArena.
Stargazer: A Scalable Model-Fitting Benchmark Environment for AI Agents under Astrophysical Constraints
Stargazer introduces a scalable benchmark environment with 120 astrophysics tasks to evaluate AI agents on physics-grounded model-fitting of radial-velocity data, revealing gaps between statistical optimization and physical constraint adherence.