erp-systems

Tag

Cards List
#erp-systems

Anchor: Mitigating Artifact Drift in Agent Benchmark Generation

arXiv cs.AI · 2026-05-27 Cached

Anchor is a task-generation pipeline that addresses artifact drift in AI agent benchmarks by jointly producing instructions, environments, solutions, and verifiers from a single constraint optimization specification, yielding consistent and auditable evaluation tasks for enterprise workflows. The paper introduces ERP-Bench, a benchmark of 300 long-horizon tasks in a production ERP system, showing that frontier models satisfy explicit constraints in 26.1% of trials but reach optimal solutions in only 17.4%.

0 favorites 0 likes
← Back to home

Submit Feedback