outcome-process-gap

Tag

Cards List
#outcome-process-gap

OpenClawBench: Benchmarking Process-side Anomalies in Real-world Agent Execution Trajectories

arXiv cs.AI · 2026-05-29 Cached

This paper introduces OpenClawBench, a large-scale dataset for benchmarking process-side anomalies in real-world AI agent execution trajectories. It reveals that task success can hide process failures, with 9.33% of oracle-passing executions containing anomalies, and provides structured supervision via a novel taxonomy.

0 favorites 0 likes
← Back to home

Submit Feedback