The Verifier Tax: Horizon-Dependent Safety–Success Tradeoffs in Tool-Using LLM Agents [R]

Reddit r/MachineLearning Papers

Summary

This paper presents a safety evaluation framework for tool-using LLM agents, introducing the concept of the 'Verifier Tax'—a horizon-dependent tradeoff between safety and task completion. It proposes a two-tier verification architecture and uses Tau-bench scenarios to demonstrate how verification can reduce unsafe successes but also decrease task completion as task horizon increases.

We recently presented a paper at ACM CAIS 2026 on safety evaluation for tool-using LLM agents. The core issue is that task completion alone can be misleading: an agent may complete a task while violating a safety or policy constraint. We separate outcomes into **safe success**, **unsafe success**, and **failure**, and study how verification changes this tradeoff. We evaluate this using **τ-bench / Tau-bench** tool-use scenarios and propose a **two-tier verification architecture**: deterministic policy/tool checks first, followed by an LLM-based verifier for more contextual safety cases. The main finding is that verification can reduce unsafe success, but it can also reduce task completion as the task horizon increases. This creates what we call the **Verifier Tax**: a horizon-dependent safety–success tradeoff in tool-using agents. Paper: [https://dl.acm.org/doi/full/10.1145/3786335.3813160](https://dl.acm.org/doi/full/10.1145/3786335.3813160) Curious how others think agent evaluations should report unsafe success. Should unsafe completion be counted as success, failure, or a separate category?
Original Article

Similar Articles

Can an AI agent complete a task and still fail?

Reddit r/artificial

This paper introduces the concept of 'Verifier Tax' to categorize AI agent outcomes as safe success, unsafe success, or failure, and proposes a two-tier verification architecture for tool-using LLM agents.

Should AI agent benchmarks separate “safe success” from “unsafe success”?

Reddit r/AI_Agents

This article discusses the concept of 'Verifier Tax' in AI agent benchmarks, distinguishing between safe success (completing tasks without violating constraints) and unsafe success (completing tasks but violating constraints), and questions how to properly measure agent performance considering safety tradeoffs.

Contract2Tool: Learning Preconditions and Effects for Reliable Tool-Augmented LLM Agents

arXiv cs.AI

This paper introduces Contract2Tool, a framework for automatically inferring lightweight tool contracts (preconditions, effects, risk) from tool metadata, documentation, and execution traces, enabling reliable causal tool filtering for LLM agents. Experiments show learned contracts achieve near-gold contract performance in downstream multi-step agent tasks, significantly reducing token usage.

On Safety Risks in Experience-Driven Self-Evolving Agents

arXiv cs.CL

Researchers from Harbin Institute of Technology and Singapore Management University investigate safety risks in experience-driven self-evolving LLM agents, finding that even benign task experience can compromise safety in high-risk scenarios due to agents' execution-oriented tendencies, and revealing a fundamental safety–utility trade-off.