Anchor: Mitigating Artifact Drift in Agent Benchmark Generation

arXiv cs.AI 05/27/26, 04:00 AM Papers
Summary
Anchor is a task-generation pipeline that addresses artifact drift in AI agent benchmarks by jointly producing instructions, environments, solutions, and verifiers from a single constraint optimization specification, yielding consistent and auditable evaluation tasks for enterprise workflows. The paper introduces ERP-Bench, a benchmark of 300 long-horizon tasks in a production ERP system, showing that frontier models satisfy explicit constraints in 26.1% of trials but reach optimal solutions in only 17.4%.
arXiv:2605.26321v1 Announce Type: new Abstract: AI agents are beginning to complete valuable, long-horizon business operations tasks, but training and evaluation environments for enterprise work still struggle to balance realism, verifiability, and scale. Environment and task creation frequently suffers from a failure mode we call artifact drift: when instructions, environments, oracles, and verifiers are created by loosely coupled processes, they frequently disagree on what a task requires, producing environments that are unsolvable, reward-hackable, or inconsistent. We introduce Anchor, a task-generation pipeline that formalizes domain experts' specifications of business workflows into constraint optimization programs. From a single parametric specification, the pipeline jointly produces a natural-language instruction, environment configuration, solver-certified ground-truth solution, and state-based verifier. With Anchor, altering parameters yields new tasks with controlled difficulty and known optimal solutions, producing harness-agnostic environments whose rewards depend solely on end-state business correctness. We apply Anchor to produce ERP-Bench: a benchmark of 300 long-horizon tasks spanning procurement and manufacturing workflows in a production-grade ERP system. We find that generation parameters predict realized difficulty, and that frontier models satisfy explicit task constraints in 26.1% of trials but reach a fully optimal solution in only 17.4% of trials. Overall, we show that Anchor and ERP-Bench offer a concrete recipe for building auditable evaluation environments for economically valuable agent work. We release the task generator and ERP-Bench dataset at erpbench.ai
Original Article
View Cached Full Text
Cached at: 05/27/26, 09:03 AM
# Anchor: Mitigating Artifact Drift in Agent Benchmark Generation
Source: [https://arxiv.org/html/2605.26321](https://arxiv.org/html/2605.26321)
\(2026\)

###### Abstract\.

AI agents are beginning to complete valuable, long\-horizon business operations tasks, but training and evaluation environments for enterprise work still struggle to balance realism, verifiability, and scale\. Environment and task creation frequently suffers from a failure mode we call*artifact drift*: when instructions, environments, oracles, and verifiers are created by loosely coupled processes, they frequently disagree on what a task requires, producing environments that are unsolvable, reward\-hackable, or inconsistent\. We introduce Anchor, a task\-generation pipeline that formalizes domain experts’ specifications of business workflows into constraint optimization programs\. From a single parametric specification, the pipeline jointly produces a natural\-language instruction, environment configuration, solver\-certified ground\-truth solution, and state\-based verifier\. With Anchor, altering parameters yields new tasks with controlled difficulty and known optimal solutions, producing harness\-agnostic environments whose rewards depend solely on end\-state business correctness\. We apply Anchor to produce ERP\-Bench: a benchmark of 300 long\-horizon tasks spanning procurement and manufacturing workflows in a production\-grade ERP system\. We find that generation parameters predict realized difficulty, and that frontier models satisfy explicit task constraints in 26\.1% of trials but reach a fully optimal solution in only 17\.4% of trials\. Overall, we show that Anchor and ERP\-Bench offer a concrete recipe for building auditable evaluation environments for economically valuable agent work\. We release the task generator and ERP\-Bench dataset at[erpbench\.ai](https://erpbench.ai/)\.

agent benchmarks, verifiable rewards, enterprise workflows, constraint optimization, ERP systems

††copyright:none††journalyear:2026††conference:RLEval: Methods and Reinforcement Learning Environments for Evaluating AI Agents, Workshop at the ACM Conference on AI and Agentic Systems \(ACM CAIS 2026\); May 26, 2026; San Jose, CA, USA††ccs:Computing methodologies Artificial intelligence††ccs:Computing methodologies Planning and scheduling††ccs:Software and its engineering Software testing and debugging## 1\.Introduction

![Refer to caption](https://arxiv.org/html/2605.26321v1/figures/artifact_drift.png)Figure 1\.Artifact drift\. Inconsistencies between a task’s four artifacts—instructionII, environmentEE, oracle solutionx∗x^\{\*\}, and verifierVV—each invalidate the intended taskτ\\tauin a different way\.Diagram of the four task artifacts and the four failure modes that arise when they disagree\.![Refer to caption](https://arxiv.org/html/2605.26321v1/Fig1_v3.png)Figure 2\.Anchor single\-source task creation pipeline\. A solved constraint satisfiability problem specification generates the instruction, environment setup, oracle solution, and terminal\-state verifier for each task instance\.Anchor single\-source task creation pipeline\.Recent surveys document substantial gaps between benchmark scores and the production performance of language\-model agents on enterprise tasks\(Pan and others,[2025](https://arxiv.org/html/2605.26321#bib.bib21); Mehta,[2025](https://arxiv.org/html/2605.26321#bib.bib16); Yehudai and others,[2025](https://arxiv.org/html/2605.26321#bib.bib30)\)\. Audits trace much of this gap to construction errors in the benchmarks themselves\. Do\-nothing agents pass 38% ofτ\\tau\-bench airline tasks\(Yaoet al\.,[2025](https://arxiv.org/html/2605.26321#bib.bib29)\)because the verifier accepts empty responses\(Zhu and others,[2025](https://arxiv.org/html/2605.26321#bib.bib34)\), and strengthening unit tests in SWE\-bench\(Jimenezet al\.,[2024](https://arxiv.org/html/2605.26321#bib.bib13)\)reranks 40\.9% of SWE\-bench Lite leaderboard positions\(Aleithan and others,[2025](https://arxiv.org/html/2605.26321#bib.bib1); Yu and others,[2025](https://arxiv.org/html/2605.26321#bib.bib31)\)\. Benchmark authors face an inherent tension between realism, verifiability, and scale\. Expert\-authored benchmarks improve realism but require costly curation\(Xu and others,[2025](https://arxiv.org/html/2605.26321#bib.bib28)\); synthetic generators scale task creation but ship noisy or single\-path graders\(Xie and others,[2026](https://arxiv.org/html/2605.26321#bib.bib27); Saxena and others,[2025](https://arxiv.org/html/2605.26321#bib.bib24)\); andτ2\\tau^\{2\}\-bench authors observe that earlier benchmarks pushed instructions toward “carefully crafted …to help ensure a single, solvable path”\(Barres and others,[2025](https://arxiv.org/html/2605.26321#bib.bib4)\)\. Most benchmarks still author each task’s four artifacts of instruction, environment, oracle solution, and verifier in parallel and validate consistency post hoc through audit\. This creates a four\-way consistency failure we call*artifact drift*\([fig\.1](https://arxiv.org/html/2605.26321#S1.F1)\): loosely coupled processes end up describing subtly different tasks, such as when the environment lacks data the instruction assumes, the oracle depends on state the environment omits, or the verifier accepts outcomes the instruction did not request\.

We address this withAnchor, a task\-generation pipeline that compiles all four artifacts from a single solved specification\. A domain expert, with an engineer, formalizes a business workflow as a parametric constraint program in OR\-Tools CP\-SAT\(Perron and Didier,[2025](https://arxiv.org/html/2605.26321#bib.bib23)\)with decision variables, business\-rule constraints, and an objective metric\. For a proposed parameter setting, the CP\-SAT solver either rejects the parameters as infeasible or certifies an optimal solution, which then compiles deterministically into the task artifacts\. Because all four artifacts are deterministic projections of the same solved specification, the dataset mitigates artifact drift by construction\. The same parameters that define a task also tune its difficulty, so the pipeline could support a verifiable training data curriculum for agent training\. We apply Anchor to produceERP\-Bench, 300 procurement and manufacturing tasks across 29 workflow patterns in Odoo 19, a production\-grade open\-source ERP system\. We evaluate five frontier models across coding, browser, and computer\-use harnesses across 18,000 trials\. Pass@5 falls monotonically from easy to hard tiers in every harness, dropping from 70\.5% to 22\.3% in coding, 46\.5% to 7\.7% in browser, and 56\.0% to 9\.5% in computer\-use, whith zero instances of reward hacking during evaluation\.

Our key contributions are:

- •Anchor: a task creation pipeline that compiles instruction, environment, oracle, and verifier from one solved constraint program specification\.
- •ERP\-Bench: a 300\-task verifiable benchmark of long\-horizon procurement and manufacturing workflows in a production\-grade ERP, with controlled difficulty and certified optima\.
- •Evaluation: a controlled comparison of frontier proprietary and open\-weight models across coding, browser, and computer\-use harnesses on ERP\-Bench tasks\.

## 2\.Anchor

Our task creation pipeline follows from prior work in large language models and mathematical reasoning\. AlphaProof and related systems treat an informal mathematical statement as the start of a pipeline that translates it into a formal Lean program, where a deterministic checker grades any candidate proof and where synthetic variants of the formalized statement become the curriculum for reinforcement learning\(Hubert and others,[2026](https://arxiv.org/html/2605.26321#bib.bib12); AlphaProof and AlphaGeometry Teams,[2024](https://arxiv.org/html/2605.26321#bib.bib2)\)\. We similarly aim to translate informal business workflows into checkable programs and generate synthetic variants to address the data scarcity and fidelity problem\. Unlike AlphaProof and other autoformalization work, we undertake the formalization step manually, and the generated variants are translated back into informal scenarios for agent training and evaluation\.

Many enterprise workflows operate on structured data in systems of record, follow explicit business rules, and optimize measurable outcomes, which makes them naturally expressible as constraint satisfiability and optimization problems\. The Anchor pipeline \([fig\.2](https://arxiv.org/html/2605.26321#S1.F2)\) starts when a domain expert and an engineer formalize a workflow such as invoice prioritization, deal qualification, or production scheduling as a parametric constraint program\(Perron and Didier,[2025](https://arxiv.org/html/2605.26321#bib.bib23)\)with decision variables, business\-rule constraints, and an objective function\. This constraint program becomes the core of the task generation engine\.

Given a parameter settingθ\\theta, the solver either rejects the sample as infeasible or certifies an optimal solutionxθ∗x^\{\*\}\_\{\\theta\}\. We call the resulting parameters, constraints, objective, and certified solution the solved specificationSθS\_\{\\theta\}\. Four translation layers then compileSθS\_\{\\theta\}into the task artifacts shown in[fig\.2](https://arxiv.org/html/2605.26321#S1.F2): an*instruction generator*renders parameters, constraints, and objective as natural language; a*setup generator*writes the sampled initial records into the environment container; an*oracle generator*writes the solver’s solution as the reference terminal state; and a*verifier generator*grades terminal states against the program’s constraints and objective\. Because the four artifacts are deterministic projections ofSθS\_\{\\theta\}, the inconsistencies of[fig\.1](https://arxiv.org/html/2605.26321#S1.F1)are mitigated by construction\. Because the solver certifies an optimal objective value, verification remains tractable without contriving the task to a single action path\.

Anchor does not eliminate every construction error: the constraint program can encode incomplete business logic and a renderer can mistranslate a correct specification\. Five end\-to\-end checks surface residual defects \([appendixI](https://arxiv.org/html/2605.26321#A9)\): a no\-op agent should score zero on every task, oracle replay should receive full credit, an LLM judge cross\-reads artifacts against the CP\-SAT program, a reward\-hacking canary flags rollouts that beat the solver objective without tripping verifier rules, and domain experts spot\-check tasks by hand\.

## 3\.ERP\-Bench

We apply Anchor to create ERP\-Bench: 300 long\-horizon procurement and manufacturing tasks across 29 patterns in Odoo 19, an open\-source ERP system\(Odoo S\.A\.,[2026](https://arxiv.org/html/2605.26321#bib.bib20)\)\. Procurement and manufacturing back\-office work is economically consequential: manufacturing contributed $2\.91T to US GDP in 2024 across roughly 12\.6M workers and 239,000 firms\(U\.S\. Bureau of Economic Analysis,[2025](https://arxiv.org/html/2605.26321#bib.bib5); National Association of Manufacturers,[2025](https://arxiv.org/html/2605.26321#bib.bib18)\), and purchasing roles account for 58,700 projected annual openings through 2034\(U\.S\. Bureau of Labor Statistics,[2025](https://arxiv.org/html/2605.26321#bib.bib6)\)\. Mistakes in these workflows directly affect spend, fulfillment, capacity, invoicing, and auditability rather than only surface task completion\. Each task runs in its own container against a fresh database seeded with the customers, vendors, inventory, bills of materials, and workcenters the scenario requires\. Agents touch the same persistent records a back\-office user would, including sales orders, purchase orders, manufacturing orders, vendor pricelists, and invoices, through the JSON\-2 API or the standard Odoo web client\. For example, a task may ask the agent to fulfill four customer sales orders due within a week when the starting warehouse inventory does not cover them: it must place purchase orders against tiered vendor pricelists that respect minimum\-order quantities and lead times, schedule the manufacturing orders that assemble the finished goods from purchased components, link the resulting records back to each sales order while minimizing total spend, and send invoices to the customers with correct payment terms\. The 29 workflow patterns were grounded in roughly 40 person\-hours of consultation and review with 10 freelance ERP practitioners, and the pipeline then samples each pattern into many task instances as expert effort is incurred per task pattern rather than per instance\. The per\-pattern roster appears in[appendixB](https://arxiv.org/html/2605.26321#A2), the Harbor task specification in[appendixC](https://arxiv.org/html/2605.26321#A3), and generator details in[appendixD](https://arxiv.org/html/2605.26321#A4)\.

The verifier combines three dimensions weighted 25/60/15%:*constraint satisfaction*runs discrete checks for demand coverage, deadlines, sourcing rules, manufacturing feasibility, and invoicing;*optimality*compares the realized objective to the certified optimum with exponential decay for suboptimal plans; and*traceability*grades the audit linkage between POs, MOs, invoices, and the sales orders they serve\. The constraint dimension gates the others, and a small set of structural prerequisites act as hard zeros \([appendixE](https://arxiv.org/html/2605.26321#A5)\)\. ERP\-Bench tasks are thereforeboth verifiable and open\-ended: the CP\-SAT solver certifies an exact optimal objective value and an assignment that achieves it, while agents may reach many valid terminal states through many action sequences\.

Difficulty is controlled by parameter groups that compose into easy, medium, and hard recipes\. Demand\-side parameters scale the number of customers, the size of each order, and how soon each order is due\. Supply\-side parameters shrink on\-hand stock, tighten vendor capacity, and reduce slack between supply and demand\. Sourcing parameters progressively unlock more of the ERP surface, layering in single\-stage and multi\-stage manufacturing, workcenter capacity, and broken initial states that the agent must diagnose and repair\. Higher difficulty also enriches the objective, moving from feasibility\-only or simple spend objectives toward vendor consolidation, capacity preservation, and plan repair\.[AppendicesB](https://arxiv.org/html/2605.26321#A2)and[D](https://arxiv.org/html/2605.26321#A4)summarize the task taxonomy and generation design\.

## 4\.Evaluation

We evaluate five frontier models across three harnesses with five trials per agent\-task pair, for 18,000 scheduled trials, all sharing one verifier on identical containerized instances\. We build the harnesses on the minimal, open\-sourcepi\-monoagent toolkit\(Zechner,[2026](https://arxiv.org/html/2605.26321#bib.bib32)\)so the testbed reflects a real\-world agent scaffold rather than a benchmark\-specific wrapper used only for evaluation\. The coding harness uses shell and filesystem tools, driving Odoo through the JSON\-2 API\. The browser harness extendspiwith a Playwright tool that drives the standard Odoo web client through a11y\-resolved actions\. The computer\-use harness drives an Xvfb\-backed Chromium through pixel\-coordinate clicks, keystrokes, and screenshots, with no DOM access\. We evaluate two proprietary models \(GPT\-5\.5, Claude Opus 4\.7\) and three open\-weight models \(GLM\-5\.1, GLM\-5V\-Turbo, Kimi K2\.5\)\. GLM\-5\.1 is swapped for GLM\-5V\-Turbo on the computer\-use harness because the former does not natively support vision input \([appendixF](https://arxiv.org/html/2605.26321#A6)for harness and model details\)\.

![Refer to caption](https://arxiv.org/html/2605.26321v1/figures/fig2_v2.png)Figure 3\.pass@5 by model, harness, and generated difficulty tier \(95% Wilson CIs\)\. Generated difficulty tiers correlate with realized performance in every evaluated model and harness\.2x2 grid of per\-model pass@5 bar charts across Easy, Medium, and Hard difficulty bands, with three harness bars \(Coding, Browser, Computer\) per difficulty and 95% Wilson confidence intervals\.![Refer to caption](https://arxiv.org/html/2605.26321v1/figures/image3.png)Figure 4\.Task parameters correlate with realized difficulty\.Task parameters predict realized difficulty\.![Refer to caption](https://arxiv.org/html/2605.26321v1/figures/image5.png)Figure 5\.Evaluation reveals a feasibility\-optimality gap\. Agents satisfy constraints more often than they reach fully optimal solutions\.Evaluation reveals a feasibility\-optimality gap\. Agents satisfy constraints more often than they reach fully optimal solutions\.The parametric difficulty intervention is the first empirical claim the methodology enables\. Across the 300\-task release, the declared easy, medium, and hard bands collapse to the same monotone signal in every harness\. Aggregate pass@5 falls from 70\.5% to 22\.3% for coding agents, from 46\.5% to 7\.7% for browser agents, and from 56\.0% to 9\.5% for computer\-use agents between easy and hard tiers \([fig\.3](https://arxiv.org/html/2605.26321#S4.F3)\)\. The strongest negative Spearman correlations against task\-level pass@1 come from scale and structure variables such as workcenters, manufacturing orders, solver variables and constraints, BOM component lines, purchase orders, and distinct vendors, while maximum BOM depth and on\-hand\-stock\-to\-demand ratio sit near zero because the sampler clamps depth tightly and equalizes stock pressure across difficulty buckets \([fig\.4](https://arxiv.org/html/2605.26321#S4.F4)\)\. Task parameters also correlate positively with the number of steps the agent has to take to resolve a task \([fig\.6](https://arxiv.org/html/2605.26321#A7.F6)\)\.

The feasibility\-optimality gap is the second empirical claim \([fig\.5](https://arxiv.org/html/2605.26321#S4.F5)\)\. Across all evaluated models and harnesses, agents satisfy every explicit task constraint in 26\.1% of trials but reach a fully optimal solution in only 17\.4%, with the strongest evaluated setting, GPT\-5\.5 in the coding harness, showing a 23\.4\-point gap\. Score loss with difficulty comes from broken constraints rather than worse tradeoffs among feasible plans \([fig\.7](https://arxiv.org/html/2605.26321#A7.F7)\), so frontier agents remain brittle at business\-rule adherence rather than merely choosing more expensive valid plans\.

The construction claim holds end\-to\-end \([appendixI](https://arxiv.org/html/2605.26321#A9)\): the no\-op agent scores zero on every task, oracle replay receives full credit on every task, and no rollout reaches a feasible state strictly better than the certified optimum\.

The evaluated models appear more capable and cost\-efficient in the coding harness than in browser or computer\-use harnesses, even though ERP\-Bench is not a coding benchmark and humans complete these tasks through the Odoo GUI\. On identical task instances, GUI harnesses resolve 16–56 percentage points fewer tasks than coding harnesses while costing 3\.1–14\.3×\\timesmore per attempted task across the two proprietary models evaluated\.[AppendicesG](https://arxiv.org/html/2605.26321#A7)and[H](https://arxiv.org/html/2605.26321#A8)report further results, cost drivers, and failure modes\.

## 5\.Limitations

The main cost of our task construction is paid upfront in formalization: a domain expert specifies the workflow, an engineer writes a CP\-SAT program modelling the entities, constraints, objective, and ERP actions, and a translation layer keeps artifacts aligned\. Task validity still depends on the specification\. A constraint program can encode an incomplete business rule, a renderer can express the right rule unclearly, and an ERP can expose defaults the formal model did not intend to make relevant\. The benchmark covers the part of enterprise work that is naturally visible in terminal system state\. Workflows depending on tacit managerial judgment, negotiation, free\-text persuasion, or outcomes outside the system of record would not fit well to this framework\. Generated instructions are also more explicit than many real workplace requests\. ERP\-Bench prioritizes feasible, reproducible, and exactly gradable tasks, so released instructions expose the constraints and objectives needed for a fair terminal\-state reward\.

## 6\.Conclusion

ERP\-Bench and Anchor show that using constraint program solvers for agent task generation can be effective in enterprise knowledge work domains\. The constructed tasks are realistic, well\-specified, and solvable by both humans and agents\. Extending Anchor to additional workflows \(e\.g\. sales deal qualification, workforce management, patient intake\) and systems of record, \(e\.g\. CRM, HRIS, EHR\) for agent evaluation and training are natural next steps\.

## References

- D\. Aleithanet al\.\(2025\)SWE\-Bench\+: enhanced coding benchmark for LLMs\.External Links:2410\.06992,[Link](https://arxiv.org/abs/2410.06992)Cited by:[§1](https://arxiv.org/html/2605.26321#S1.p1.2)\.
- AlphaProof and AlphaGeometry Teams \(2024\)AI achieves silver\-medal standard solving international mathematical olympiad problems\.Google DeepMind\.External Links:[Link](https://deepmind.google/discover/blog/ai-solves-imo-problems-at-silver-medal-level/)Cited by:[§2](https://arxiv.org/html/2605.26321#S2.p1.1)\.
- V\. Barreset al\.\(2025\)τ2\\tau^\{2\}\-Bench: evaluating conversational agents in a dual\-control environment\.External Links:2506\.07982,[Link](https://arxiv.org/abs/2506.07982)Cited by:[§1](https://arxiv.org/html/2605.26321#S1.p1.2)\.
- Harbor Framework Team \(2026\)Task structure\.External Links:[Link](https://www.harborframework.com/docs/tasks)Cited by:[Appendix C](https://arxiv.org/html/2605.26321#A3.p1.1)\.
- T\. Hubertet al\.\(2026\)Olympiad\-level formal mathematical reasoning with reinforcement learning\.Nature651,pp\. 607–613\.External Links:[Document](https://dx.doi.org/10.1038/s41586-025-09833-y),[Link](https://www.nature.com/articles/s41586-025-09833-y)Cited by:[§2](https://arxiv.org/html/2605.26321#S2.p1.1)\.
- C\. E\. Jimenez, J\. Yang, A\. Wettig, S\. Yao, K\. Pei, O\. Press, and K\. Narasimhan \(2024\)SWE\-bench: can language models resolve real\-world GitHub issues?\.InInternational Conference on Learning Representations,External Links:2310\.06770,[Link](https://openreview.net/forum?id=8y2YPzvJaG)Cited by:[§1](https://arxiv.org/html/2605.26321#S1.p1.2)\.
- N\. Mehta \(2025\)Beyond accuracy: a multi\-dimensional framework for evaluating enterprise agentic AI systems\.External Links:2511\.14136,[Link](https://arxiv.org/abs/2511.14136)Cited by:[§1](https://arxiv.org/html/2605.26321#S1.p1.2)\.
- National Association of Manufacturers \(2025\)Facts about manufacturing\.External Links:[Link](https://nam.org/mfgdata/facts-about-manufacturing-expanded/)Cited by:[§3](https://arxiv.org/html/2605.26321#S3.p1.1)\.
- Odoo S\.A\. \(2026\)Odoo 19\.0 documentation\.External Links:[Link](https://www.odoo.com/documentation/19.0/)Cited by:[§3](https://arxiv.org/html/2605.26321#S3.p1.1)\.
- A\. Panet al\.\(2025\)Measuring agents in production\.External Links:2512\.04123,[Link](https://arxiv.org/abs/2512.04123)Cited by:[§1](https://arxiv.org/html/2605.26321#S1.p1.2)\.
- L\. Perron and F\. Didier \(2025\)OR\-Tools CP\-SAT v9\.12\.External Links:[Link](https://developers.google.com/optimization/cp/cp_solver)Cited by:[§D\.1](https://arxiv.org/html/2605.26321#A4.SS1.p2.2),[§1](https://arxiv.org/html/2605.26321#S1.p2.1),[§2](https://arxiv.org/html/2605.26321#S2.p2.1)\.
- A\. Saxenaet al\.\(2025\)Continuous benchmark generation for evaluating enterprise\-scale LLM agents\.External Links:2511\.10049,[Link](https://arxiv.org/abs/2511.10049)Cited by:[§1](https://arxiv.org/html/2605.26321#S1.p1.2)\.
- U\.S\. Bureau of Economic Analysis \(2025\)Gross domestic product, 4th quarter and year 2024, third estimate, GDP by industry, and corporate profits\.External Links:[Link](https://www.bea.gov/news/2025/gross-domestic-product-4th-quarter-and-year-2024-third-estimate-gdp-industry-and)Cited by:[§3](https://arxiv.org/html/2605.26321#S3.p1.1)\.
- U\.S\. Bureau of Labor Statistics \(2025\)Purchasing managers, buyers, and purchasing agents\.External Links:[Link](https://www.bls.gov/ooh/business-and-financial/purchasing-managers-buyers-and-purchasing-agents.htm)Cited by:[§3](https://arxiv.org/html/2605.26321#S3.p1.1)\.
- J\. Xieet al\.\(2026\)AgentSynth: scalable task generation for generalist computer\-use agents\.External Links:2506\.14205,[Link](https://arxiv.org/abs/2506.14205)Cited by:[§1](https://arxiv.org/html/2605.26321#S1.p1.2)\.
- F\. F\. Xuet al\.\(2025\)TheAgentCompany: benchmarking LLM agents on consequential real world tasks\.External Links:2412\.14161,[Link](https://arxiv.org/abs/2412.14161)Cited by:[§1](https://arxiv.org/html/2605.26321#S1.p1.2)\.
- S\. Yao, N\. Shinn, P\. Razavi, and K\. Narasimhan \(2025\)τ\\tau\-bench: a benchmark for tool\-agent\-user interaction in real\-world domains\.InInternational Conference on Learning Representations,External Links:2406\.12045,[Link](https://openreview.net/forum?id=roNSXZpUDN)Cited by:[§1](https://arxiv.org/html/2605.26321#S1.p1.2)\.
- A\. Yehudaiet al\.\(2025\)Survey on evaluation of LLM\-based agents\.External Links:2503\.16416,[Link](https://arxiv.org/abs/2503.16416)Cited by:[§1](https://arxiv.org/html/2605.26321#S1.p1.2)\.
- C\. Yuet al\.\(2025\)UTBoost: rigorous evaluation of coding agents on SWE\-Bench\.External Links:2506\.09289,[Link](https://arxiv.org/abs/2506.09289)Cited by:[§1](https://arxiv.org/html/2605.26321#S1.p1.2)\.
- M\. Zechner \(2026\)Pi monorepo\.External Links:[Link](https://github.com/badlogic/pi-mono)Cited by:[§4](https://arxiv.org/html/2605.26321#S4.p1.1)\.
- M\. Zhuet al\.\(2025\)Establishing best practices for building rigorous agentic benchmarks\.External Links:2507\.02825,[Link](https://arxiv.org/abs/2507.02825)Cited by:[§1](https://arxiv.org/html/2605.26321#S1.p1.2)\.

## Appendix AGlossary

ERP\-Bench borrows operational terminology from enterprise systems\.[Table1](https://arxiv.org/html/2605.26321#A1.T1)collects the domain terms that recur in the paper and the appendix\.

Table 1\.Domain terms used throughout the paper\. Definitions point at the meaning the term carries in Odoo and in ERP\-Bench rather than the broader supply\-chain literature\.
## Appendix BTask Taxonomy

Each ERP\-Bench task is an ordinary back\-office problem inside Odoo\. Customers need products by certain dates; the company has some stock, some suppliers, and sometimes a factory that can build the product or its parts\. The agent has to decide what to buy, what to make, which customer requests to keep, and which records to create or fix\. A task is solved only when the final ERP state covers the right demand, respects the business rules, and, when applicable, uses the best plan according to the stated objective\.

ERP\-Bench contains 300 such tasks across 29 workflow patterns generated under fixed seeds\. The patterns group into three practical situations: buying goods from suppliers \([table2](https://arxiv.org/html/2605.26321#A2.T2)\), repairing a plan after something breaks \([table3](https://arxiv.org/html/2605.26321#A2.T3)\), and combining buying with in\-house production \([table4](https://arxiv.org/html/2605.26321#A2.T4)\)\. Stable per\-task names include the scenario number, difficulty, and generator pattern, so per\-pattern result tables are reproducible from directory names\.

Objective mix\.Some tasks only ask for a valid plan; most also ask for a good plan\. In 183 tasks, good means spending the least on new supply\. In 34 tasks, it means using fewer vendors; in 31, preserving workcenter capacity; and in 28 repair\-plan tasks, changing an existing plan as little as possible\. The remaining 24 grade feasibility only\. Vendor consolidation, capacity preservation, and repair tasks also use spend as a secondary tie\-breaker \([appendixE](https://arxiv.org/html/2605.26321#A5)\)\.

Seeded ERP state scale\.Each database contains the task\-relevant records plus unrelated customers, vendors, products, and business documents for realism\. Easy tasks seed roughly 44 customers, 44 vendors, 41 products, and 8 BOMs without any workcenters; medium tasks seed 59, 71, 56, 11, and 1\.8 workcenters on average; hard tasks seed 68, 69, 56, 12, and 3\.3 workcenters\. Across the corpus, 290 of 300 tasks include BOMs in the seeded state, 218 include workcenters, 105 include pre\-existing sales orders, 28 include pre\-existing purchase orders, and 20 include pre\-existing manufacturing orders\. Agents must find the records that matter and leave adjacent records untouched\.

Product domains and routes\.The same planning problem appears in different business contexts, giving agents varied names, catalogs, and surrounding records\. Tasks span 12 industrial product categories, most often power equipment \(53\), industrial equipment \(41\), office systems \(34\), material handling \(34\), building systems \(33\), safety systems \(22\), and commercial fixtures \(21\)\. Routes determine the available actions: 82 tasks are buy\-only, 207 allow both buying and manufacturing, and 11 require manufacturing only\.

Table 2\.Buy\-and\-intake patterns\.Table 3\.Disruption\-recovery patterns\.Table 4\.Production\-and\-fulfillment patterns\.
## Appendix CTask Technical Specification

ERP\-Bench releases each instance as a Harbor task directory\. Harbor tasks package an instruction, task metadata, environment files, an optional solution, and tests that emit a verifier reward file\(Harbor Framework Team,[2026](https://arxiv.org/html/2605.26321#bib.bib11)\)\. ERP\-Bench follows that contract:

Execution semantics\.A task starts by loading the sampled company state into Odoo\. The agent then creates, updates, confirms, or cancels ERP records through its assigned harness\. Evaluation reads the final Odoo database, not the trajectory\. The same verifier grades agent runs, oracle replay, and no\-op replay, and writes reward artifacts such as the scalar reward, optimality accounting, spend summary, and per\-rule results\.

## Appendix DGenerator and CP\-SAT Specification

### D\.1\.Workflow Formalization

Ten freelance ERP practitioners with five or more years of production experience contributed roughly forty person\-hours of elicitation and review across the 29 workflow patterns\. An engineer then encoded each pattern as a parametric CP\-SAT program\. For09\_single\_bom\_lowest\_cost, the integer decision variables are:

qvp≥0units purchased on offerv,p,bvp∈\{0,1\}indicator that offerv,pis used,aw≥0units assembled on routew,si≥0stock allocated to orderi\.\\begin\{array\}\[\]\{@\{\}ll@\{\}\}q\_\{vp\}\\geq 0&\\text\{units purchased on offer \}v,p,\\\\ b\_\{vp\}\\in\\\{0,1\\\}&\\text\{indicator that offer \}v,p\\text\{ is used\},\\\\ a\_\{w\}\\geq 0&\\text\{units assembled on route \}w,\\\\ s\_\{i\}\\geq 0&\\text\{stock allocated to order \}i\.\\end\{array\}Constraints enforce per\-order demand coverage from stock, purchases, and assembly; MOQ\-and\-capacity tiersLvpbvp≤qvp≤UvpbvpL\_\{vp\}b\_\{vp\}\\leq q\_\{vp\}\\leq U\_\{vp\}b\_\{vp\}; BOM feasibility so each assembled unit draws on\-hand or purchased components; workcenter capacity; and on\-time delivery against per\-order deadlines\. The objective for this pattern is minimum new spend across confirmed purchases and assembly\. Other patterns reuse the same feasibility region: vendor consolidation minimizes used offers, capacity preservation minimizes scheduled workcenter minutes, repair tasks minimize distance from a seeded baseline plan, and feasibility\-only tasks omit the optimization phase\.

CP\-SAT\(Perron and Didier,[2025](https://arxiv.org/html/2605.26321#bib.bib23)\)fits this construction because it is a portfolio solver that pairs SAT\-style search with linear\-programming relaxation\. Thebvpb\_\{vp\}channeling, BOM disjunctions, and cumulative workcenter capacity each have native primitives in CP\-SAT, where a generic MIP formulation would need big\-M expansions for the on\-off conditions or anO\(n⋅T\)O\(n\\cdot T\)time\-indexed expansion for the cumulative constraint\. CP\-SAT also returns a certificate of optimality, which the verifier uses downstream to grade realized terminal states against a known optimum\.

### D\.2\.Difficulty Recipes

Difficulty is controlled by recipe parameters that scale demand, deadlines, stock, vendor capacity, manufacturing depth, workcenter capacity, intake screening, invoicing, and repair disruptions\. Authored configurations vary sourcing structure, capacity regime, order\-acceptance policy, objective type, and industrial domain\.[Table5](https://arxiv.org/html/2605.26321#A4.T5)reports representative ranges for each tier\.

Within those ranges, the sampler draws concrete instances from a small set of distributions: discrete uniforms for customer counts, deadlines, demand quantities, vendor counts, lead times, and MOQ tiers; continuous uniforms for stock ratios, vendor\-capacity ratios, required margins, and budget multipliers; Gaussian noise around catalog list and standard prices for per\-customer prices and per\-vendor costs; and categorical draws for product domain and vendor\-category mix\. A seedednumpygenerator makes every draw reproducible\.

This parameter diversity is the lever for solution diversity\. Different sampled vendor prices, MOQ tiers, and capacities push the CP\-SAT optimum onto different vendor subsets\. Different stock ratios shift which orders are covered from stock, purchases, or assembly\. Different component prices flip individual products between make and buy, and different deadline draws change which supply paths remain on time\. Two tasks generated from the same authored pattern can therefore force qualitatively different decisions despite sharing a constraint family and an objective shape\.

Table 5\.Representative generator recipes by difficulty\. Ranges are sampled uniformly unless the scenario fixes a value\. Stock ratio and vendor capacity ratio are relative to total sampled demand\.Recipes compose along independent axes: a medium task can stay buy\-only with intake screening and invoicing, or add manufacturing; a hard task can combine tight supply with multi\-stage BOMs, shared or qualified workcenters, or a seeded disruption\.

### D\.3\.Sampling, Rejection, and Determinism

Anchor generated 300 accepted tasks after rejecting 732 sampled parameter sets, in 656\.76 s of wall time\. Pre\-solver discards \(275\) catch cheap arithmetic failures such as demand exceeding total available supply; solver discards \(457\) catch parameterizations CP\-SAT proves infeasible after a 5\-second attempt, optionally retried at 15 seconds\. Among successful solves, a multi\-phase sequence first finds a feasible plan, optimizes the primary objective, optionally optimizes a spend secondary, and then runs a fixed lexicographic search that selects one stable plan among equal optima so that the instruction, environment, oracle, and verifier all refer to the same plan\.[Table6](https://arxiv.org/html/2605.26321#A4.T6)reports the per\-phase call counts; the 59UNKNOWNstatuses sit in tie\-break phases and leave the certified primary solution intact\.

Table 6\.Solver status accounting across generation phases\.
### D\.4\.Solver Performance by Difficulty

[Table7](https://arxiv.org/html/2605.26321#A4.T7)reports cumulative solve time per accepted task, including failed samples, retry solver calls, and tie\-breaker solves\. Average solve time rises from 0\.044 seconds on easy tasks to 0\.951 seconds on medium tasks and 5\.856 seconds on hard tasks\. The increase tracks CP\-SAT model size, from 19\.8 variables on easy tasks to 78\.6 on medium tasks and 96\.7 on hard tasks, as recipes turn on manufacturing, workcenter capacity, screening, and repair constraints\. The P95 and maximum columns show why a short initial budget is not enough for every sample: hard tasks include a long tail of difficult parameter settings, which the generator resolves by retrying or resampling before release\.

Table 7\.Solver performance and CP\-SAT model size by difficulty tier\. Solve\-time columns report cumulative wall time per accepted task, including failed and retry solver calls and the tie\-breaker solve\.

## Appendix EVerifier and Reward Details

### E\.1\.Verifier Protocol and Reward Formula

The verifier is a terminal\-state coprocess\. A shell script waits for the seeded Odoo to come up, then feeds named check invocations over stdin to a Python checker that returns PASS, FAIL, or NA on stdout\. NA marks rules that do not apply to the task, so per\-task rule density reflects only checks that scored the run\. The same script grades agent rollouts, oracle replay, and the no\-op baseline, and writes per\-rule, spend, optimality, and aggregate\-reward artifacts under/logs/verifierfor downstream auditing\.

Letccandttbe the per\-dimension percent of applicable checks passed for the constraint and traceability dimensions, andoothe optimality score \([sectionE\.3](https://arxiv.org/html/2605.26321#A5.SS3)\)\. The final reward is

R=\{0if a hard\-zero gate fires,0\.25celse ifc<100,0\.25c\+0\.60o\+0\.15totherwise\.R=\\begin\{cases\}0&\\text\{if a hard\-zero gate fires,\}\\\\ 0\.25\\,c&\\text\{else if \}c<100,\\\\ 0\.25\\,c\+0\.60\\,o\+0\.15\\,t&\\text\{otherwise\.\}\\end\{cases\}Hard\-zero gates are defined in[sectionE\.4](https://arxiv.org/html/2605.26321#A5.SS4)\. The constraint gate ensures an agent cannot trade an explicit business rule for a cheaper plan\.

### E\.2\.Rule Catalog

Constraint and traceability checks come from a fixed catalog grouped into six functional families \([table8](https://arxiv.org/html/2605.26321#A5.T8)\)\. Each task instantiates the applicable subset with sampled arguments\. Demand and sales dominates because every task begins with customer demand: each customer\-product pair drives a separate coverage, deadline, list\-price, revenue, and budget check\. The remaining families enter only when the sampled pattern requires manufacturing, screening, invoicing, or repair logic\.

Table 8\.Rule families with representative checks and scored invocations across the 300\-task release\. The five constraint families gate the optimality and traceability slices of the reward \([sectionE\.3](https://arxiv.org/html/2605.26321#A5.SS3)\); the traceability family is graded into the traceability slice directly\.
### E\.3\.Optimality Calculation

The verifier recomputes the realized primary objectiveaafrom terminal Odoo records and compares it with the certified valueee\. The four objectives with an optimization phase share the exponential decay

score\(a,e\)=\{100ifa≤e\+τ,100⋅exp⁡\(−k\(a−e\)/max⁡\(e,1\)\)otherwise,\\texttt\{score\}\(a,e\)=\\begin\{cases\}100&\\text\{if \}a\\leq e\+\\tau,\\\\ 100\\cdot\\exp\\\!\\big\(\{\-\}k\\,\(a\-e\)/\\max\(e,1\)\\big\)&\\text\{otherwise,\}\\end\{cases\}with per\-objectiveτ\\tauandkkin[table9](https://arxiv.org/html/2605.26321#A5.T9)\. Objectives with a secondary spend metric combine primary scoreppand secondary scoresslexicographically with band weightw=0\.1w=0\.1:o=\(1−w\)p\+wso=\(1\{\-\}w\)\\,p\+w\\,swhenp≥100p\\geq 100, and\(1−w\)p\(1\{\-\}w\)\\,potherwise\. The secondary can lift a fully satisfied primary above 90 but never compensates for a primary regression\.

Table 9\.Optimality objectives and decay parameters\.τ\\tauis the zero\-penalty tolerance;kkcontrols how quickly the score decays as the realized metric overshoots the oracle\.
### E\.4\.Hard\-Zero Gates

Two structural conditions collapse the reward to zero even when individual constraint checks pass\. A*partial\-acceptance*gate fires on screened tasks when the agent leaves every task\-relevant record untouched, and a*repair\-state*gate fires on repair\-plan tasks when the seeded broken plan is still present in the terminal state\. The gates exist because screened tasks reward an explicit triage decision a no\-op never demonstrates, and repair tasks measure deviation from a baseline an unmodified state has zero of; without the gates both task types would reward inactivity\.

### E\.5\.Reward\-Hacking Defense

The agent has write access to the POprice\_unitfield, so a naive verifier could be tricked into rewarding a small price entered by hand\. The optimality dimension defends against this by re\-pricing each PO line from the authoritative vendor offer tier on file rather than reading the agent\-written field, and the tier\-price compliance check in the constraint dimension fails the line if it deviates from the tier price by more than a small tolerance\. A failed tier check clips the score through the constraint gate, so an agent that writes a wrong price loses constraint credit regardless of optimality\.

### E\.6\.Rule Density

[Table10](https://arxiv.org/html/2605.26321#A5.T10)reports the per\-task count of scored verifier checks\. Hard tasks instantiate more independent rules because they combine manufacturing, sourcing, capacity, and repair constraints in a single scenario\.

Table 10\.Scored verifier checks per task by difficulty tier\. The denominator counts applicable checks only;NAreturns stay out of the rule density numbers\.

## Appendix FHarness Details

We choose the three harnesses to separate task competence from interface\. Coding tests direct API automation, a natural scaffold for terminal\-oriented agents; browser and computer\-use test less integrated UI modes closer to human Odoo work\. All three share the samepiCLI, task image, seeded database, instruction, and verifier; only thepi\-monoadapter and tool surface change\.

Table 11\.Harness implementation surfaces\. Adapters live underagents/\. UI extensions mask the default coding tools, so browser and computer\-use agents cannot call shell, filesystem, or API helpers directly\.The full evaluation schedules 5 trials per agent\-task pair\. Each trial starts from scratch with no retries, a one\-hour timeout, a 400\-turn budget, and provider\-default reasoning effort where exposed\. UI containers add only the packages needed to expose the interface, including Xvfb and Chromium for computer\-use trials\. The computer\-use runs for open\-weight models were halted early after over 500 trials produced zero points, and their results reflect partial trials\.

Model access\.Proprietary models are accessed via first\-party developer APIs; open\-weight models are accessed via OpenRouter\.

## Appendix GAdditional Results

### G\.1\.Headline Metrics by Model and Harness

[Table12](https://arxiv.org/html/2605.26321#A7.T12)reports evaluation metrics for each model and harness over the 300\-task release\. Every model\-harness pair schedules 1,500 trials \(300 tasks×\\times5 trials\); the two halted computer\-use runs stopped early after producing zero points across hundreds of attempts\. Per\-pattern pass@5 across the 29 task patterns ships with the dataset release\.

The cost of moving from coding to a UI harness is uneven across models\. Pass@5 falls by 49 percentage points for GPT\-5\.5 and 51 for GLM\-5\.1 from coding to browser, and by 56 for GPT\-5\.5 from coding to computer\-use\. Claude Opus 4\.7 loses 16 and 22 percentage points across the same two transitions, the smallest GUI penalty of any evaluated model\.

Table 12\.Evaluation metrics by model and harness\.*Clean*is the share of trials whose terminal state passes every applicable constraint check,*perfect*is the share with reward 100, and*opt\|\|clean*is the mean optimality score among constraint\-clean trials\. The two halted computer\-use runs produced zero points before being stopped\.
### G\.2\.Task Parameters Predict Action Burden

Generated difficulty predicts how much work success takes, not only whether success occurs\.[Figure6](https://arxiv.org/html/2605.26321#A7.F6)reports per\-harness Spearman correlations between task parameters and turns to resolve, computed on resolved trials\. The scale variables that lower pass@1 in[fig\.4](https://arxiv.org/html/2605.26321#S4.F4)—purchase orders, distinct vendors, manufacturing orders, components, BOM lines—raise turns to resolve with correlations of\+0\.6\+0\.6to\+0\.7\+0\.7in every harness\.

![Refer to caption](https://arxiv.org/html/2605.26321v1/figures/turns-parameters-correlation.png)Figure 6\.Spearman correlations between task parameters and mean turns to resolve, on resolved trials, reported per harness\.Spearman correlations between task parameters and mean turns to resolve\.
### G\.3\.Feasibility and Optimality Split

[Figure7](https://arxiv.org/html/2605.26321#A7.F7)breaks the feasibility\-optimality gap of[fig\.5](https://arxiv.org/html/2605.26321#S4.F5)down by model, harness, and difficulty tier\. Constraint satisfaction collapses with difficulty across the board, but mean optimality among constraint\-clean trials stays above 86 wherever at least one clean trial exists\. Score loss with difficulty comes from broken explicit constraints, not from worse trade\-offs among feasible plans\.

![Refer to caption](https://arxiv.org/html/2605.26321v1/figures/image6.png)Figure 7\.Constraint satisfaction and mean optimality given constraints pass, broken out by model, harness, and difficulty tier\.Constraint satisfaction and mean optimality given constraints pass\.
### G\.4\.Cost and Step Efficiency

[Figure8](https://arxiv.org/html/2605.26321#A7.F8)plots resolution rate against dollar cost per task\. The pareto frontier is traced by three coding points \(Kimi K2\.5, GLM\-5\.1, and GPT\-5\.5\); every UI point and the Claude coding point sit below it\. Moving the same model from coding to a UI harness pushes its operating point off the frontier without exception\.

[Figure9](https://arxiv.org/html/2605.26321#A7.F9)decomposes the gap\. Tokens per turn are within 5% across harnesses \(∼\\sim28k–30k\), so the cost gap is almost entirely a turn\-count gap: 24 turns per coding trial against 171 for computer\-use and 219 for browser\. The structural reason is action batching\. A coding agent writes one script and batches many ERP changes through the JSON\-2 API in a single tool call\. A UI agent has to wait for the next screen after each form submit, and the per\-call action cap \(7 for browser, 16 for computer\-use\) puts a hard ceiling on how many ERP effects fit in one turn\. Coding is a natural plan\-execute substrate; the UI harnesses are not\.

![Refer to caption](https://arxiv.org/html/2605.26321v1/figures/image7.png)Figure 8\.Resolution rate against dollar cost per task\. The pareto frontier is traced by Kimi K2\.5, GLM\-5\.1, and GPT\-5\.5 in the coding harness\.Resolution rate against dollar cost per task\.![Refer to caption](https://arxiv.org/html/2605.26321v1/figures/image9.png)Figure 9\.Per\-harness cost drivers\. Tokens per turn are comparable across harnesses; turns per trial drives the cost gap\.Per\-harness cost drivers\.
### G\.5\.Reliability Across Repeated Trials

Pass@5 measures whether at least one of five trials succeeds; pass5measures whether all five do\. Aggregate pass5̂ is 9\.8% in the coding harness and 3\.6% in the browser harness, roughly an order of magnitude below the corresponding pass@5\. Frontier agents on ERP\-Bench are not yet reliable enough for unattended deployment\.

## Appendix HFailure Analysis

Parseable failures fall into five broad families\.*Demand and sales*failures miss, mis\-price, or fail to confirm the required sales orders\.*Timing*failures schedule purchases, manufacturing orders, or deliveries after the customer deadline\.*Sourcing*failures violate vendor price, lead time, minimum or maximum quantity, consolidation, screened\-vendor, spend\-floor, or margin policy\.*Manufacturing*failures build infeasible plans with incompatible components, workcenters, or capacities\.*Hygiene and state*failures break traceability, confirmation, invoicing, seeded\-order handling, repair state, or adjacent\-data preservation\. The families are not exclusive: one terminal state can break demand, timing, sourcing, and hygiene checks at once\. We hold optimality out of this view because a sub\-oracle optimality score is only meaningful once the terminal state clears constraints and hygiene\.

[Table13](https://arxiv.org/html/2605.26321#A8.T13)labels each parseable failed trial by the families of failed verifier rules inrule\_results\.tsv\. Coding failures are mostly plan\-execution failures: the agent has enough information to size demand but schedules production before inputs can legally arrive, mutates confirmed procurements after discovering a planning error, or loses origin links while patching the plan\. Browser failures more often start with UI state ambiguity\. A browser agent can reason about the right quantities and still leave a sales\-order line blank, fail to confirm a quotation, or overwrite a procurement date after recognizing the selected vendor is late\. Computer\-use failures share the business\-rule structure of browser failures, amplified by screenshot\-based form editing: the characteristic mistake is committing a partially edited PO or SO because the agent cannot reliably tell whether the cell, datepicker, autocomplete, or source field accepted the intended value\.

Table 13\.Prevalence of failure families per harness, as percentages of parseable failed trials\. Families are not exclusive, so rows do not sum to 100\.### H\.1\.Failure Examples

The three examples below are randomly sampled failed trajectories from the released run directories under a fixed seed, filtered to parseable rewards with concrete failed verifier rules\. Each example traces the verifier failure to the agent decision that caused it\.

Coding: manufacturing before inputs arrive\.GLM\-5\.1 on a medium two\-stage build correctly sized demand at 64 finished units, then scheduled manufacturing at the start of the run and forced component receipts through the API before respecting supplier lead times\. The verifier flaggedsupply\_timing\_feasible,po\_delivery\_schedule\_compliance,mo\_schedule\_compliance,component\_stock\_capacity\_compliance,mo\_component\_feasibility, andpo\_origin\_traceability\. After the first MOs became invalid, the agent cancelled and recreated them but kept the already\-created component POs and tried to repair lineage after the fact, updatingoriginfields to point at the new MOs rather than cancelling and rebuilding the dependent POs\.

Browser: proceeding after a broken sales\-order row\.GPT\-5\.5 on the same two\-stage scenario hit a UI failure while creating a customer sales order—a product autocomplete dropdown closed before the line selection committed—and continued without verifying that the customer, product, quantity, price, commitment date, and confirmation state had persisted\. The verifier flaggeddemand\_coverage,deadline\_fulfillment,sale\_revenue,supply\_timing\_feasible,po\_delivery\_schedule\_compliance,po\_price\_tier\_compliance,so\_confirmed, andpo\_origin\_traceability\. The agent then made a separate sourcing mistake: it identified that a candidate vendor would arrive after the deadline, described the vendor as invalid in its own scratchpad, and still entered the late purchase\.

Computer\-use: ordering after stock already covers demand\.GPT\-5\.5 on a screened buy\-only intake task identified the key fact—only one seeded order should be accepted, and existing finished stock covers it—and then created a purchase order anyway\. The verifier flaggednew\_spend\_margin\_policy,supply\_timing\_feasible,po\_delivery\_schedule\_compliance,po\_consolidation\_compliance,po\_min\_qty\_compliance, andpo\_price\_tier\_compliance\. The same trace shows the form\-entry uncertainty that drove most of those rules: the vendor price field displayed 800\.54 instead of the tier price 865\.41, and a quantity edit failed to update the line amount\. Because no new supply was needed for the accepted demand, any nonzero PO introduced avoidable sourcing, timing, and price\-tier obligations\.

## Appendix IValidity Checks

Single\-source generation does not eliminate every construction error\. The CP\-SAT program can encode incomplete business logic, and a renderer can mistranslate a correct specification into one of the artifacts\. Five end\-to\-end checks target each of the four artifact\-drift failure modes \([table14](https://arxiv.org/html/2605.26321#A9.T14)\)\. Because the generator is a single program rather than a corpus of hand\-authored tasks, any defect a check surfaces is traced to the specification or one of its translation layers, fixed once, and propagated deterministically to every affected instance\.

Table 14\.Validity checks against the artifact pairs they constrain, using the notation of[fig\.2](https://arxiv.org/html/2605.26321#S1.F2)\. A check fires when the two named artifacts disagree on what the task requires\.No\-op agent\.A no\-op agent should score zero on every task; a nonzero score means the seeded environment already satisfies the verifier\. The no\-op agent scores zero on 300 of 300 tasks\.

Oracle replay\.Replaying the solver’s certified plan into the seeded environment should receive full reward; a lower score means setup, oracle, and verifier disagree on the same plan\. Oracle replay scores full reward on 300 of 300 tasks\.

LLM judge\.An LLM judge cross\-reads the instruction, environment configuration, oracle, and verifier against the CP\-SAT program and flags any artifact that disagrees with the others\. The judge reviewed all 300 tasks across 12 iterations during pipeline development\. Defects it surfaced and we then fixed at the generator level include the setup script writing the wrong initial on\-hand stock for some procurement scenarios, the verifier missing record\-lineage checks linking purchase and manufacturing orders to their originating sales orders, and the verifier missing checks for policy clauses that the instruction generator had introduced into the prompt\.

Reward\-hacking canary\.The canary flags any rollout whose terminal state both passes every applicable constraint check and reports an objective strictly better than the certified optimum, which would indicate either a verifier hole or a CP\-SAT optimality gap\. Across the 16,159 completed trials, zero rollouts triggered the canary\.

Expert spot checks\.Two domain experts independently completed a stratified sample of 15 tasks by hand in the standard Odoo web client, reading only the instruction\. The verifier scored their terminal states at a mean reward of 90\.48 across the 30 expert\-task trials\. The experts hit a broken explicit constraint in 3 of 30 trials, in line with the constraint\-clean rate of strong models\. Mean expert completion time was 55 minutes per task, comparable to the 60\-minute trial budget the agents receive\.

## Appendix JRelease

We release the repository containing the task generator and the ERP\-Bench dataset in harbor format at[erpbench\.ai](https://erpbench.ai/)\.
Anchor: Mitigating Artifact Drift in Agent Benchmark Generation

Similar Articles

Benchmark Everything Everywhere All at Once

Good Benchmarks

Benchmarks are Not Enough: RAMP for Runtime Assessing of Agentic Models in Production Systems

Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values

HealthAgentBench: A Unified Benchmark Suite of Realistic Agentic Healthcare Environments for Challenging Frontier AI Agents

Submit Feedback

Similar Articles

Benchmark Everything Everywhere All at Once
Benchmarks are Not Enough: RAMP for Runtime Assessing of Agentic Models in Production Systems
Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values
HealthAgentBench: A Unified Benchmark Suite of Realistic Agentic Healthcare Environments for Challenging Frontier AI Agents