OpenComputer: Verifiable Software Worlds for Computer-Use Agents
Summary
OpenComputer presents a framework for creating verifiable software environments for computer-use agents, integrating state verifiers, self-improving verification layers, task synthesis, and evaluation systems across 33 desktop applications. Experiments show its verifiers align better with human judgment than LLM-as-judge, and frontier agents struggle with end-to-end completion.
View Cached Full Text
Cached at: 05/20/26, 02:35 AM
Paper page - OpenComputer: Verifiable Software Worlds for Computer-Use Agents
Source: https://huggingface.co/papers/2605.19769
Abstract
OpenComputer presents a framework for creating verifiable software environments for computer-use agents through integrated state verification, self-improving layers, task synthesis, and evaluation systems across multiple desktop applications.
We present OpenComputer, averifier-grounded frameworkfor constructing verifiable software worlds for computer-use agents. OpenComputer integrates four components: (1) app-specificstate verifiersthat expose structured inspection endpoints over real applications, (2) aself-evolving verification layerthat improves verifier reliability using execution-grounded feedback, (3) atask-generation pipelinethat synthesizes realistic and machine-checkable desktop tasks, and (4) anevaluation harnessthat records full trajectories and computes auditablepartial-credit rewards. In its current form, OpenComputer covers 33desktop applicationsand 1,000 finalized tasks spanning browsers, office tools, creative software, development environments, file managers, and communication applications. Experiments show that OpenComputer’s hard-coded verifiers align more closely withhuman adjudicationthanLLM-as-judge evaluation, especially when success depends on fine-grained application state. Frontier agents struggle with end-to-end completion despite partial progress, and open-source models exhibit sharp drops from their OSWorld-Verified scores, exposing a persistent gap in robustcomputer automation.
View arXiv pageView PDFGitHubAdd to collection
Get this paper in your agent:
hf papers read 2605\.19769
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.19769 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.19769 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.19769 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Open Computer Use
Open Computer Use is an open-source MCP (Model Context Protocol) for AI agents to control computer interfaces.
CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents
CUA-Gym introduces a scalable pipeline for generating verifiable training environments and tasks for computer-use agents, addressing data scarcity. The resulting dataset and models achieve strong performance on benchmarks like OSWorld-Verified and WebArena.
Toward Pre-Deployment Assurance for Enterprise AI Agents: Ontology-Grounded Simulation and Trust Certification
Researchers present an ontology-grounded framework for pre-deployment verification of enterprise AI agents, combining an Agent Operational Envelope, automated scenario generation, and machine-verifiable Trust Certificates with graduated deployment verdicts. A pilot across four regulated industries generated 1,800 scenarios and showed ontology-grounded generation significantly outperformed persona-based baselines on regulatory coverage.
On the Reliability of Computer Use Agents
A preprint analyzing why computer-use agents succeed once but fail on repeated executions, attributing unreliability to execution stochasticity, task ambiguity, and behavioral variability, and advocating repeated evaluation and stable strategies.
Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability
This academic paper proposes a unified architecture-lifecycle framework for securing computer-use agents (CUAs) as they transition from benchmarks to real-world software environments. It analyzes reliability challenges across perception, decision, execution layers and creation, deployment, operation, maintenance stages.