Tag
Anchor is a task-generation pipeline that addresses artifact drift in AI agent benchmarks by jointly producing instructions, environments, solutions, and verifiers from a single constraint optimization specification, yielding consistent and auditable evaluation tasks for enterprise workflows. The paper introduces ERP-Bench, a benchmark of 300 long-horizon tasks in a production ERP system, showing that frontier models satisfy explicit constraints in 26.1% of trials but reach optimal solutions in only 17.4%.
TASTE is an automated method for generating challenging agent benchmarks with broader tool-use coverage by evolving tool sequences through adaptive contrastive n-gram modeling and iterative difficulty refinement. The resulting τ^c-Bench reveals that models nearly saturating existing benchmarks suffer severe performance drops, indicating saturation rather than robust skill.
SynAE is a framework for evaluating the quality of synthetic data used in tool-calling agent evaluations, assessing validity, fidelity, and diversity across multiple axes. It addresses challenges of insufficient or sensitive real data by providing metrics to guide synthetic data generation.
This paper introduces temporal semantic caching and MCP workflow optimizations for agentic plan-execute pipelines, achieving up to 30.6x speedup on cache hits and 1.67x overall speedup on the AssetOpsBench industrial benchmark.
This position paper argues that interactive AI evaluation should be treated as a design science paradigm, proposing a two-axis taxonomy and reporting standards for assessing dynamic system behavior through trajectories.