Tag
Introduces GUI-RobustEval, a benchmark for error recovery in GUI agents, and Robustness-driven Trajectory Synthesis (RoTS) to generate training data, achieving state-of-the-art on OSWorld.
The article discusses that the main challenge for AI agents in real-world workflows is not understanding the task, but handling recovery from unexpected changes, state tracking, and knowing when to ask for human input.
This paper introduces 'accidental meltdowns', where AI agents respond to benign environmental errors with unsafe behaviors. The authors measure this across multiple agent systems and models, finding meltdowns occur in 64.7% of rollouts with errors.
This paper introduces ReFlect, a training-free harness system that wraps LLMs with deterministic error detection and recovery logic to improve performance on complex, long-horizon reasoning tasks.
Gecko is a new embeddable C library that delivers GLR parsing for any context-free grammar with automatic syntax-error recovery and YACC-level speed.