Covering Human Action Space for Computer Use: Data Synthesis and Benchmark

Hugging Face Daily Papers 05/12/26, 12:00 AM Papers

computer-use-agents benchmark data-synthesis gui-interaction multimodal phi-ground

Summary

This paper introduces CUActSpot, a multimodal benchmark for evaluating computer-use agents, and a renderer-based data synthesis pipeline. The proposed Phi-Ground-Any-4B model outperforms open-source models under 32B parameters.

Computer-use agents (CUAs) automate on-screen work, as illustrated by GPT-5.4 and Claude. Yet their reliability on complex, low-frequency interactions is still poor, limiting user trust. Our analysis of failure cases from advanced models suggests a long-tail pattern in GUI operations, where a relatively small fraction of complex and diverse interactions accounts for a disproportionate share of task failures. We hypothesize that this issue largely stems from the scarcity of data for complex interactions. To address this problem, we propose a new benchmark CUActSpot for evaluating models' capabilities on complex interactions across five modalities: GUI, text, table, canvas, and natural image, as well as a variety of actions (click, drag, draw, etc.), covering a broader range of interaction types than prior click-centric benchmarks that focus mainly on GUI widgets. We also design a renderer-based data-synthesis pipeline: scenes are automatically generated for each modality, screenshots and element coordinates are recorded, and an LLM produces matching instructions and action traces. After training on this corpus, our Phi-Ground-Any-4B outperforms open-source models with fewer than 32B parameters. We will release our benchmark, data, code, and models at https://github.com/microsoft/Phi-Ground.git

Original Article

View Cached Full Text

Cached at: 05/14/26, 04:17 AM

Paper page - Covering Human Action Space for Computer Use: Data Synthesis and Benchmark

Source: https://huggingface.co/papers/2605.12501 Authors:

Abstract

Computer-use agents face reliability challenges with complex GUI interactions due to data scarcity, addressed through a multi-modal benchmark and synthetic data generation pipeline.

Computer-use agents(CUAs) automate on-screen work, as illustrated by GPT-5.4 and Claude. Yet their reliability on complex, low-frequency interactions is still poor, limiting user trust. Our analysis of failure cases from advanced models suggests along-tail patterninGUI operations, where a relatively small fraction of complex and diverse interactions accounts for a disproportionate share of task failures. We hypothesize that this issue largely stems from the scarcity of data for complex interactions. To address this problem, we propose a new benchmark CUActSpot for evaluating models’ capabilities on complex interactions across five modalities: GUI, text, table, canvas, and natural image, as well as a variety of actions (click, drag, draw, etc.), covering a broader range of interaction types than prior click-centric benchmarks that focus mainly on GUI widgets. We also design a renderer-based data-synthesis pipeline: scenes are automatically generated for each modality, screenshots and element coordinates are recorded, and anLLMproduces matching instructions andaction traces. After training on this corpus, our Phi-Ground-Any-4B outperforms open-source models with fewer than 32B parameters. We will release our benchmark, data, code, and models at https://github.com/microsoft/Phi-Ground.git

View arXiv page View PDF GitHub33 Add to collection

Get this paper in your agent:

hf papers read 2605\.12501

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper1

#### microsoft/Phi-Ground-Any Updated1 day ago • 115 • 13

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.12501 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.12501 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Covering Human Action Space for Computer Use: Data Synthesis and Benchmark

Paper page - Covering Human Action Space for Computer Use: Data Synthesis and Benchmark

Abstract

Models citing this paper1

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

ShapeCodeBench: A Renewable Benchmark for Perception-to-Program Reconstruction of Synthetic Shape Scenes

Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction

3DCodeBench: Benchmarking Agentic Procedural 3D Modeling Via Code

CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents

Stargazer: A Scalable Model-Fitting Benchmark Environment for AI Agents under Astrophysical Constraints

Submit Feedback

Similar Articles

ShapeCodeBench: A Renewable Benchmark for Perception-to-Program Reconstruction of Synthetic Shape Scenes

Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction

3DCodeBench: Benchmarking Agentic Procedural 3D Modeling Via Code

CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents

Stargazer: A Scalable Model-Fitting Benchmark Environment for AI Agents under Astrophysical Constraints