Covering Human Action Space for Computer Use: Data Synthesis and Benchmark

Hugging Face Daily Papers Papers

Summary

This paper introduces CUActSpot, a multimodal benchmark for evaluating computer-use agents, and a renderer-based data synthesis pipeline. The proposed Phi-Ground-Any-4B model outperforms open-source models under 32B parameters.

Computer-use agents (CUAs) automate on-screen work, as illustrated by GPT-5.4 and Claude. Yet their reliability on complex, low-frequency interactions is still poor, limiting user trust. Our analysis of failure cases from advanced models suggests a long-tail pattern in GUI operations, where a relatively small fraction of complex and diverse interactions accounts for a disproportionate share of task failures. We hypothesize that this issue largely stems from the scarcity of data for complex interactions. To address this problem, we propose a new benchmark CUActSpot for evaluating models' capabilities on complex interactions across five modalities: GUI, text, table, canvas, and natural image, as well as a variety of actions (click, drag, draw, etc.), covering a broader range of interaction types than prior click-centric benchmarks that focus mainly on GUI widgets. We also design a renderer-based data-synthesis pipeline: scenes are automatically generated for each modality, screenshots and element coordinates are recorded, and an LLM produces matching instructions and action traces. After training on this corpus, our Phi-Ground-Any-4B outperforms open-source models with fewer than 32B parameters. We will release our benchmark, data, code, and models at https://github.com/microsoft/Phi-Ground.git
Original Article
View Cached Full Text

Cached at: 05/14/26, 04:17 AM

Paper page - Covering Human Action Space for Computer Use: Data Synthesis and Benchmark

Source: https://huggingface.co/papers/2605.12501 Authors:

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

Abstract

Computer-use agents face reliability challenges with complex GUI interactions due to data scarcity, addressed through a multi-modal benchmark and synthetic data generation pipeline.

Computer-use agents(CUAs) automate on-screen work, as illustrated by GPT-5.4 and Claude. Yet their reliability on complex, low-frequency interactions is still poor, limiting user trust. Our analysis of failure cases from advanced models suggests along-tail patterninGUI operations, where a relatively small fraction of complex and diverse interactions accounts for a disproportionate share of task failures. We hypothesize that this issue largely stems from the scarcity of data for complex interactions. To address this problem, we propose a new benchmark CUActSpot for evaluating models’ capabilities on complex interactions across five modalities: GUI, text, table, canvas, and natural image, as well as a variety of actions (click, drag, draw, etc.), covering a broader range of interaction types than prior click-centric benchmarks that focus mainly on GUI widgets. We also design a renderer-based data-synthesis pipeline: scenes are automatically generated for each modality, screenshots and element coordinates are recorded, and anLLMproduces matching instructions andaction traces. After training on this corpus, our Phi-Ground-Any-4B outperforms open-source models with fewer than 32B parameters. We will release our benchmark, data, code, and models at https://github.com/microsoft/Phi-Ground.git

View arXiv pageView PDFGitHub33Add to collection

Get this paper in your agent:

hf papers read 2605\.12501

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper1

#### microsoft/Phi-Ground-Any Updated1 day ago • 115 • 13

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.12501 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.12501 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles