Should AI agent benchmarks separate “safe success” from “unsafe success”?
Summary
This article discusses the concept of 'Verifier Tax' in AI agent benchmarks, distinguishing between safe success (completing tasks without violating constraints) and unsafe success (completing tasks but violating constraints), and questions how to properly measure agent performance considering safety tradeoffs.
Similar Articles
Can an AI agent complete a task and still fail?
This paper introduces the concept of 'Verifier Tax' to categorize AI agent outcomes as safe success, unsafe success, or failure, and proposes a two-tier verification architecture for tool-using LLM agents.
The Verifier Tax: Horizon-Dependent Safety–Success Tradeoffs in Tool-Using LLM Agents [R]
This paper presents a safety evaluation framework for tool-using LLM agents, introducing the concept of the 'Verifier Tax'—a horizon-dependent tradeoff between safety and task completion. It proposes a two-tier verification architecture and uses Tau-bench scenarios to demonstrate how verification can reduce unsafe successes but also decrease task completion as task horizon increases.
What Benchmarks Don't Measure: The Case for Evaluating Abstention Competence in Autonomous Agents
This paper argues that current benchmarks for autonomous agents fail to evaluate whether an agent should have proceeded at all, introducing a 'compliance bias'. The authors propose a taxonomy of abstention-warranted scenarios and new evaluation protocols (Safety Rate, Usability Rate, Informed Refusal Rate) with preliminary results showing tunable safety–usability tradeoffs across model families.
I built an AI support agent where the main metric is unsafe auto-action rate, not just accuracy
A technical walkthrough of building a telecom customer support agent that prioritizes safety metrics over classifier accuracy, using a deterministic access gate, scoped tool execution, and route-level evaluation.
What your agent's green test suite actually proves
This article argues that standard test suites with fixed inputs and expected outputs are insufficient for AI agents due to infinite input spaces and non-deterministic behavior, advocating for property-based testing instead.