computer-use-agents

Tag

Cards List
#computer-use-agents

PPT-Eval: A Benchmark for Computer-Use Agents on PowerPoint Tasks

arXiv cs.LG · 4d ago Cached

Introduces PPT-Eval, a benchmark of 120 PowerPoint tasks for evaluating computer-use agents, with a rubric-based scoring system that awards partial credit. Strong frontier agents like Claude-4.5-Opus achieve only 45% success rate, highlighting the difficulty of such tasks.

0 favorites 0 likes
#computer-use-agents

Agent-Computer Observation Interfaces Enable Dynamic Computer Use

arXiv cs.AI · 5d ago Cached

The paper introduces Agent-Computer Observation Interfaces (AOI), a model-agnostic perception layer that decouples continuous, adaptive observation from discrete actions for computer-use agents. AOI achieves significant performance gains (+17 to +48 percentage points) on dynamic browser tasks without retraining, with the key insight that narrating captured frames into persistent text is the primary driver of improvement.

0 favorites 0 likes
#computer-use-agents

OSWorld2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks

Hugging Face Daily Papers · 2026-06-28 Cached

OSWorld 2.0 is a new benchmark for evaluating computer-use agents on 108 long-horizon, real-world workflows. Current agents like Claude Opus 4.8 and GPT-5.5 achieve low completion rates, highlighting significant limitations in handling complex, multi-step tasks.

0 favorites 0 likes
#computer-use-agents

GUI vs. CLI: Execution Bottlenecks in Screen-Only and Skill-Mediated Computer-Use Agents

arXiv cs.AI · 2026-06-24 Cached

This paper investigates execution bottlenecks in computer-use agents, comparing screen-only GUI-based approaches with skill-mediated CLI-based methods, identifying key performance differences.

0 favorites 0 likes
#computer-use-agents

Reinforcement Learning for Computer-Use Agents with Autonomous Evaluation

arXiv cs.AI · 2026-06-24 Cached

This paper proposes a reinforcement learning framework for computer-use agents that uses autonomous vision-language evaluation as a scalable reward signal, modeling evaluator noise to improve task success rates across desktop environments.

0 favorites 0 likes
#computer-use-agents

Capable but Careless: Do Computer-Use Agents Follow Contextual Integrity?

Hugging Face Daily Papers · 2026-06-22 Cached

This paper introduces AgentCIBench, a benchmark to evaluate privacy risks in computer-use agents, finding that 11 of 15 frontier agents leak information in over 50% of scenarios.

0 favorites 0 likes
#computer-use-agents

@dair_ai: https://x.com/dair_ai/status/2068724104815890889

X AI KOLs Following · 2026-06-21 Cached

Highlights three recent AI papers: SpatialClaw (training-free spatial reasoning via code), SkillWeaver (compositional skill routing with decompose-retrieve-compose pipeline), and PreAct (compiling agent runs into fast state machines for repeated tasks).

0 favorites 0 likes
#computer-use-agents

VISUALSKILL: Multimodal Skills for Computer-Use Agents

arXiv cs.CL · 2026-06-18 Cached

VisualSkill proposes a hierarchical multimodal skill library for computer-use agents that combines text and figures, achieving a 15.3 point absolute lift on CUA benchmarks over text-only baselines by retaining visual information for GUI interaction.

0 favorites 0 likes
#computer-use-agents

OSGuard: A Benchmark for Safety in Computer-Use Agents

arXiv cs.AI · 2026-06-16 Cached

OSGuard is a dual-granularity benchmark for evaluating safety in computer-use agents under benign user instructions, featuring action-level judgments and risk-augmented execution suites to detect unsafe shortcuts.

0 favorites 0 likes
#computer-use-agents

MyPCBench: A Benchmark for Personally Intelligent Computer-Use Agents

Hugging Face Daily Papers · 2026-06-15 Cached

MyPCBench evaluates computer-use agents as personal assistants in a simulated Linux desktop environment with real-world web applications, revealing that Claude Opus 4.6 achieves the highest task completion rate of 55.4% while struggling with multi-application tasks and long trajectories.

0 favorites 0 likes
#computer-use-agents

MacArena: Benchmarking Computer Use Agents on an Online macOS Environment

arXiv cs.LG · 2026-06-08 Cached

Introduces MacArena, a benchmark of 421 tasks across 50 applications for evaluating computer-use agents on macOS, highlighting that existing benchmarks may not capture macOS-specific challenges.

0 favorites 0 likes
#computer-use-agents

WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces

Hugging Face Daily Papers · 2026-06-08 Cached

WeaveBench is a new benchmark for evaluating computer-use agents across multiple interfaces (GUI, CLI, code) in long-horizon real-world tasks. It reveals that current models achieve only 41.2% PassRate and that outcome-only grading overestimates performance, highlighting significant gaps in evaluation.

0 favorites 0 likes
#computer-use-agents

MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents

arXiv cs.AI · 2026-06-03 Cached

MedCUA-Bench is a new benchmark for evaluating computer-use agents on clinical software tasks, covering 18 scenarios across 10 medical domains with safety dimensions. Results show that current agents perform poorly, especially on real OpenEMR, highlighting a significant gap in reliability.

0 favorites 0 likes
#computer-use-agents

@NielsRogge: Holo 3.1 reaches a new SOTA on AndroidWorld, a popular computer use agents benchmark Can be explored here https://paper…

X AI KOLs Following · 2026-06-02 Cached

Holo 3.1 achieves state-of-the-art performance on the AndroidWorld benchmark for computer-use agents, demonstrating improved speed and cost-effectiveness for local deployment.

0 favorites 0 likes
#computer-use-agents

Nvidia and Microsoft Researchers Say AI Agents Don't Care About Safety or Reliability

Reddit r/artificial · 2026-06-02 Cached

A new paper from Microsoft, Nvidia, and UC Riverside finds that AI agents with computer access often behave dangerously, lacking contextual reasoning and pursuing goals blindly, as demonstrated in tests across multiple models.

0 favorites 0 likes
#computer-use-agents

SkillHarness: Harnessing Safe Skills for Computer-Use Agents

Hugging Face Daily Papers · 2026-06-02 Cached

SkillHarness is a framework that enables computer-use agents to safely learn and execute skills in dynamic environments by incorporating safety constraints and adaptive skill selection mechanisms, reducing unsafe rates by 57.1%.

0 favorites 0 likes
#computer-use-agents

BraveGuard: From Open-World Threats to Safer Computer-Use Agents

Hugging Face Daily Papers · 2026-06-02 Cached

BraveGuard is a self-evolving defense framework that trains guard models using open-world threat signals and realistic agent trajectories to improve safety detection in computer-use agents, achieving significant accuracy gains on the AgentHazard benchmark.

0 favorites 0 likes
#computer-use-agents

PRO-CUA: Process-Reward Optimization for Computer Use Agents

arXiv cs.AI · 2026-05-29 Cached

This paper introduces PRO-CUA, a process-reward optimization framework for training Computer Use Agents (CUAs) using iterative step-level reinforcement learning. The method decouples on-policy environment interaction from policy optimization, enabling dense credit assignment without relying on expert trajectories, and demonstrates effectiveness on live web benchmarks.

0 favorites 0 likes
#computer-use-agents

CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents

Hugging Face Daily Papers · 2026-05-25 Cached

CUA-Gym introduces a scalable pipeline for generating verifiable training environments and tasks for computer-use agents, addressing data scarcity. The resulting dataset and models achieve strong performance on benchmarks like OSWorld-Verified and WebArena.

0 favorites 0 likes
#computer-use-agents

OpenComputer: Verifiable Software Worlds for Computer-Use Agents

Hugging Face Daily Papers · 2026-05-19 Cached

OpenComputer presents a framework for creating verifiable software environments for computer-use agents, integrating state verifiers, self-improving verification layers, task synthesis, and evaluation systems across 33 desktop applications. Experiments show its verifiers align better with human judgment than LLM-as-judge, and frontier agents struggle with end-to-end completion.

0 favorites 0 likes
Next →
← Back to home

Submit Feedback