Tag
Introduces PPT-Eval, a benchmark of 120 PowerPoint tasks for evaluating computer-use agents, with a rubric-based scoring system that awards partial credit. Strong frontier agents like Claude-4.5-Opus achieve only 45% success rate, highlighting the difficulty of such tasks.
The paper introduces Agent-Computer Observation Interfaces (AOI), a model-agnostic perception layer that decouples continuous, adaptive observation from discrete actions for computer-use agents. AOI achieves significant performance gains (+17 to +48 percentage points) on dynamic browser tasks without retraining, with the key insight that narrating captured frames into persistent text is the primary driver of improvement.
OSWorld 2.0 is a new benchmark for evaluating computer-use agents on 108 long-horizon, real-world workflows. Current agents like Claude Opus 4.8 and GPT-5.5 achieve low completion rates, highlighting significant limitations in handling complex, multi-step tasks.
This paper investigates execution bottlenecks in computer-use agents, comparing screen-only GUI-based approaches with skill-mediated CLI-based methods, identifying key performance differences.
This paper proposes a reinforcement learning framework for computer-use agents that uses autonomous vision-language evaluation as a scalable reward signal, modeling evaluator noise to improve task success rates across desktop environments.
This paper introduces AgentCIBench, a benchmark to evaluate privacy risks in computer-use agents, finding that 11 of 15 frontier agents leak information in over 50% of scenarios.
Highlights three recent AI papers: SpatialClaw (training-free spatial reasoning via code), SkillWeaver (compositional skill routing with decompose-retrieve-compose pipeline), and PreAct (compiling agent runs into fast state machines for repeated tasks).
VisualSkill proposes a hierarchical multimodal skill library for computer-use agents that combines text and figures, achieving a 15.3 point absolute lift on CUA benchmarks over text-only baselines by retaining visual information for GUI interaction.
OSGuard is a dual-granularity benchmark for evaluating safety in computer-use agents under benign user instructions, featuring action-level judgments and risk-augmented execution suites to detect unsafe shortcuts.
MyPCBench evaluates computer-use agents as personal assistants in a simulated Linux desktop environment with real-world web applications, revealing that Claude Opus 4.6 achieves the highest task completion rate of 55.4% while struggling with multi-application tasks and long trajectories.
Introduces MacArena, a benchmark of 421 tasks across 50 applications for evaluating computer-use agents on macOS, highlighting that existing benchmarks may not capture macOS-specific challenges.
WeaveBench is a new benchmark for evaluating computer-use agents across multiple interfaces (GUI, CLI, code) in long-horizon real-world tasks. It reveals that current models achieve only 41.2% PassRate and that outcome-only grading overestimates performance, highlighting significant gaps in evaluation.
MedCUA-Bench is a new benchmark for evaluating computer-use agents on clinical software tasks, covering 18 scenarios across 10 medical domains with safety dimensions. Results show that current agents perform poorly, especially on real OpenEMR, highlighting a significant gap in reliability.
Holo 3.1 achieves state-of-the-art performance on the AndroidWorld benchmark for computer-use agents, demonstrating improved speed and cost-effectiveness for local deployment.
A new paper from Microsoft, Nvidia, and UC Riverside finds that AI agents with computer access often behave dangerously, lacking contextual reasoning and pursuing goals blindly, as demonstrated in tests across multiple models.
SkillHarness is a framework that enables computer-use agents to safely learn and execute skills in dynamic environments by incorporating safety constraints and adaptive skill selection mechanisms, reducing unsafe rates by 57.1%.
BraveGuard is a self-evolving defense framework that trains guard models using open-world threat signals and realistic agent trajectories to improve safety detection in computer-use agents, achieving significant accuracy gains on the AgentHazard benchmark.
This paper introduces PRO-CUA, a process-reward optimization framework for training Computer Use Agents (CUAs) using iterative step-level reinforcement learning. The method decouples on-policy environment interaction from policy optimization, enabling dense credit assignment without relying on expert trajectories, and demonstrates effectiveness on live web benchmarks.
CUA-Gym introduces a scalable pipeline for generating verifiable training environments and tasks for computer-use agents, addressing data scarcity. The resulting dataset and models achieve strong performance on benchmarks like OSWorld-Verified and WebArena.
OpenComputer presents a framework for creating verifiable software environments for computer-use agents, integrating state verifiers, self-improving verification layers, task synthesis, and evaluation systems across 33 desktop applications. Experiments show its verifiers align better with human judgment than LLM-as-judge, and frontier agents struggle with end-to-end completion.