OpenComputer: Verifiable Software Worlds for Computer-Use Agents

Hugging Face Daily Papers Papers

Summary

OpenComputer presents a framework for creating verifiable software environments for computer-use agents, integrating state verifiers, self-improving verification layers, task synthesis, and evaluation systems across 33 desktop applications. Experiments show its verifiers align better with human judgment than LLM-as-judge, and frontier agents struggle with end-to-end completion.

We present OpenComputer, a verifier-grounded framework for constructing verifiable software worlds for computer-use agents. OpenComputer integrates four components: (1) app-specific state verifiers that expose structured inspection endpoints over real applications, (2) a self-evolving verification layer that improves verifier reliability using execution-grounded feedback, (3) a task-generation pipeline that synthesizes realistic and machine-checkable desktop tasks, and (4) an evaluation harness that records full trajectories and computes auditable partial-credit rewards. In its current form, OpenComputer covers 33 desktop applications and 1,000 finalized tasks spanning browsers, office tools, creative software, development environments, file managers, and communication applications. Experiments show that OpenComputer's hard-coded verifiers align more closely with human adjudication than LLM-as-judge evaluation, especially when success depends on fine-grained application state. Frontier agents struggle with end-to-end completion despite partial progress, and open-source models exhibit sharp drops from their OSWorld-Verified scores, exposing a persistent gap in robust computer automation.
Original Article
View Cached Full Text

Cached at: 05/20/26, 02:35 AM

Paper page - OpenComputer: Verifiable Software Worlds for Computer-Use Agents

Source: https://huggingface.co/papers/2605.19769

Abstract

OpenComputer presents a framework for creating verifiable software environments for computer-use agents through integrated state verification, self-improving layers, task synthesis, and evaluation systems across multiple desktop applications.

We present OpenComputer, averifier-grounded frameworkfor constructing verifiable software worlds for computer-use agents. OpenComputer integrates four components: (1) app-specificstate verifiersthat expose structured inspection endpoints over real applications, (2) aself-evolving verification layerthat improves verifier reliability using execution-grounded feedback, (3) atask-generation pipelinethat synthesizes realistic and machine-checkable desktop tasks, and (4) anevaluation harnessthat records full trajectories and computes auditablepartial-credit rewards. In its current form, OpenComputer covers 33desktop applicationsand 1,000 finalized tasks spanning browsers, office tools, creative software, development environments, file managers, and communication applications. Experiments show that OpenComputer’s hard-coded verifiers align more closely withhuman adjudicationthanLLM-as-judge evaluation, especially when success depends on fine-grained application state. Frontier agents struggle with end-to-end completion despite partial progress, and open-source models exhibit sharp drops from their OSWorld-Verified scores, exposing a persistent gap in robustcomputer automation.

View arXiv pageView PDFGitHubAdd to collection

Get this paper in your agent:

hf papers read 2605\.19769

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.19769 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.19769 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.19769 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Open Computer Use

Product Hunt

Open Computer Use is an open-source MCP (Model Context Protocol) for AI agents to control computer interfaces.

Toward Pre-Deployment Assurance for Enterprise AI Agents: Ontology-Grounded Simulation and Trust Certification

arXiv cs.AI

Researchers present an ontology-grounded framework for pre-deployment verification of enterprise AI agents, combining an Agent Operational Envelope, automated scenario generation, and machine-verifiable Trust Certificates with graduated deployment verdicts. A pilot across four regulated industries generated 1,800 scenarios and showed ontology-grounded generation significantly outperformed persona-based baselines on regulatory coverage.

On the Reliability of Computer Use Agents

Hugging Face Daily Papers

A preprint analyzing why computer-use agents succeed once but fail on repeated executions, attributing unreliability to execution stochasticity, task ambiguity, and behavioral variability, and advocating repeated evaluation and stable strategies.