MANTRA: Synthesizing SMT-Validated Compliance Benchmarks for Tool-Using LLM Agents

arXiv cs.CL 05/08/26, 04:00 AM Papers
Summary
The article introduces MANTRA, a framework for automatically synthesizing SMT-validated compliance benchmarks for tool-using LLM agents from natural language manuals. It demonstrates that this approach enables scalable and reliable evaluation of agent adherence to complex procedural rules.
arXiv:2605.06334v1 Announce Type: new Abstract: Tool-using large language model (LLM) agents are increasingly deployed in settings where their reliable behavior is governed by strict procedural manuals. Ensuring that such agents comply with the rules from these manuals is challenging, as they are typically written for humans in natural language while agent behavior manifests as an execution trace of tool calls. Existing evaluations of LLM agents rely on manually constructed benchmarks or LLM-based judges, which either do not scale or lack reliability for complex, long-horizon manuals. To overcome these limitations, we present MANTRA, a framework for automatically synthesizing machine-checkable compliance benchmarks from natural-language manuals and tool schemas. MANTRA independently generates (i) a symbolic world model capturing procedural dependencies, and (ii) a set of trace-level compliance checks for a given task, and validates their consistency using SMT solving. A structured repair loop resolves inconsistencies, requiring human intervention only as a fallback. %This yields benchmarks that are formally validated. Importantly, MANTRA supports arbitrary domains and long procedural manuals, and provides a tunable notion of task complexity which is utilized to automatically derive challenging tasks accompanying compliance checks. Using MANTRA, we build a new benchmark suite with 285 tasks across 6 domains scaling to 50+ page manuals with minimal human effort. Empirically, we show that the compliance checks are richer with stronger constraint enforcement compared to existing benchmarks. Additionally, the granularity of the checks can be used for debugging the agents' failure modes. These results demonstrate that combining automated benchmark generation with formally grounded validation methods enables scalable and reliable benchmarking of tool-using agents.
Original Article
View Cached Full Text
Cached at: 05/08/26, 07:47 AM
# MANTRA: Synthesizing SMT-Validated Compliance Benchmarks for Tool-Using LLM Agents
Source: [https://arxiv.org/html/2605.06334](https://arxiv.org/html/2605.06334)
11institutetext:Max Planck Institute for Software Systems, Kaiserslautern, Germany
11email:\{ashwani,ichatzi,rraha,akschmuck\}@mpi\-sws\.org###### Abstract

Tool\-using large language model \(LLM\) agents are increasingly deployed in settings where their reliable behavior is governed by strict procedural manuals\. Ensuring that such agents comply with the rules from these manuals is challenging, as they are typically written for humans in natural language while agent behavior manifests as an execution trace of tool calls\. Existing evaluations of LLM agents rely on manually constructed benchmarks or LLM\-based judges, which either do not scale or lack reliability for complex, long\-horizon manuals\. To overcome these limitations, we presentMANTRA, a framework for automatically synthesizing machine\-checkable compliance benchmarks from natural\-language manuals and tool schemas\.MANTRA*independently*generates \(i\) a symbolic world model capturing procedural dependencies, and \(ii\) a set of trace\-level compliance checks for a given task, and validates their consistency using SMT solving\. A structured repair loop resolves inconsistencies, requiring human intervention only as a fallback\. Importantly,MANTRAsupports arbitrary domains and long procedural manuals, and provides a tunable notion of task complexity which is utilized to automatically derive challenging tasks accompanying compliance checks\. UsingMANTRA, we build a new benchmark suite with 285 tasks across 6 domains scaling to 50\+ page manuals with minimal human effort\. Empirically, we show that the compliance checks are richer with stronger constraint enforcement compared to existing benchmarks\. Additionally, the granularity of the checks can be used for debugging the agents’ failure modes\. These results demonstrate that combining automated benchmark generation with formally grounded validation methods enables scalable and reliable benchmarking of tool\-using agents\.

## 1Introduction

The capabilities of tool\-using large language\-model \(LLM\) agents have increased substantially in recent years\. Modern agents can autonomously plan, invoke external tools, and execute complex, multi\-step workflows across various domains such as enterprise operations\[[44](https://arxiv.org/html/2605.06334#bib.bib42),[27](https://arxiv.org/html/2605.06334#bib.bib41)\], scientific discovery\[[38](https://arxiv.org/html/2605.06334#bib.bib44),[33](https://arxiv.org/html/2605.06334#bib.bib45)\], travel booking\[[6](https://arxiv.org/html/2605.06334#bib.bib40)\], and healthcare\[[13](https://arxiv.org/html/2605.06334#bib.bib43),[25](https://arxiv.org/html/2605.06334#bib.bib46)\]\. However, their practical usefulness depends not only on their capability to successfully complete tasks, but also on whether they comply with the regulations and procedures that govern them\[[18](https://arxiv.org/html/2605.06334#bib.bib48),[48](https://arxiv.org/html/2605.06334#bib.bib47)\]\. For instance, when procuring company hardware, employees must adhere to detailed policies specifying under which conditions items should be sourced internally versus ordered, and which approval steps must precede a purchase\.

Such policies are typically written as natural\-language manuals for human interpretation, while agent behavior is observed as execution traces of tool calls\. This mismatch makes it inherently difficult to design benchmarks for evaluating procedural compliance of tool\-using agents, as it requires translating informal instructions into machine\-checkable constraints over execution traces\. Existing benchmarks address this problem through either manual constraint construction, LLM\-based judging, or a combination thereof\[[32](https://arxiv.org/html/2605.06334#bib.bib3),[43](https://arxiv.org/html/2605.06334#bib.bib13),[2](https://arxiv.org/html/2605.06334#bib.bib14),[20](https://arxiv.org/html/2605.06334#bib.bib51),[22](https://arxiv.org/html/2605.06334#bib.bib50),[26](https://arxiv.org/html/2605.06334#bib.bib4),[31](https://arxiv.org/html/2605.06334#bib.bib49),[39](https://arxiv.org/html/2605.06334#bib.bib7),[30](https://arxiv.org/html/2605.06334#bib.bib8)\]\(see Tab\.[1](https://arxiv.org/html/2605.06334#S1.T1)\)\. Manual construction can produce high\-quality evaluations, but it is costly to scale across domains and long manuals\. LLM\-based judges are easier to deploy, but they struggle at reliably handling strict temporal constraints, preconditions, and forbidden actions over long traces\. As a result, current methods do not provide a scalable and reproducible way to construct trace\-level compliance benchmarks from natural\-language manuals\.

We address this gap withMANTRA\(MANual\-to\-Test\-tRAnslation\), a framework which automatically generates compliance benchmarks from natural\-language procedural documents and tool schemas\. In contrast to all existing methods,MANTRAexposes*both*a very high level of automation, and yields high\-quality trace\-level compliance checks which allows to evaluate agents deterministically\. Thereby,MANTRAscales to large unstructured procedural documents and can easily be extended to new domains with minimal human effort\.

##### Contributions\.

Concretely, we make the following contributions:\(i\)We develop an automated pipeline for converting a given natural\-language procedural manual and a tool schema into machine\-checkable compliance benchmarks, with controllable task difficulty and deterministic trace\-level grading\.\(ii\)We introduce a two\-artifact formulation that separately generates a symbolic world model and compliance checks, which are automatically cross\-validated by a Satisfiability Modulo Theories \(SMT\)\[[8](https://arxiv.org/html/2605.06334#bib.bib2),[3](https://arxiv.org/html/2605.06334#bib.bib1)\]solver\. This allows to automatically identify and repair inconsistencies before benchmark items are accepted\. Human review is supported but only required as a fallback\.\(iii\)To show the capabilities ofMANTRAas a benchmark generator, we have created a benchmark suite with285285diverse benchmarks across 6 domains\. We show empirically that the granularity of the trace\-level checks can be used for debugging the agents’ failure modes\. The code is available at[https://anonymous\.4open\.science/r/mantra\-for\-compliance/](https://anonymous.4open.science/r/mantra-for-compliance/)and the benchmark data is available at[https://huggingface\.co/datasets/mantra\-anon/MANTRA](https://huggingface.co/datasets/mantra-anon/MANTRA)\.

Table 1:Comparison of MANTRA with representative tool\-use and agent\-compliance benchmarks along supported \(✓\), partially/indirectly supported \(△\\triangle\) and not supported \(✗\) features\.Notes:*Automated generation*is partially satisfied \(△\\triangle\) if tasks are automatically generated but require substantial human validation or hand\-built environments\.*Large document sizes*are only partially supported \(△\\triangle\) if long structured instructions rather than unstructured procedural manuals are used\.*Expanding to new domains*is only partially satisfied \(△\\triangle\) if domain\-specific manual effort is required\.*Trace evaluation*is supported \(✓\) when procedural properties of the execution trace \(e\.g\., required or forbidden calls, ordering of constraints, \.\.\) are checked and not only the final answer or database state, and partially supported \(△\\triangle\) if trajectory\-level properties are evaluated by LLM judges\.*Deterministic evaluation*is supported \(✓\) if per\-test grading is fully reproducible without invoking an LLM judge, and partially supported \(△\\triangle\) if evaluation relies only partly on LLM judges\.

##### Related work\.

Our work builds upon a growing literature on benchmarking tool\-using agents, automated benchmark generation, and the use of formal methods to validate LLM outputs\.

The study of tool\-using agents has been a major focus of recent research\[[24](https://arxiv.org/html/2605.06334#bib.bib21)\], with a large number of benchmarks proposed across various domains such as web navigation\[[42](https://arxiv.org/html/2605.06334#bib.bib26),[47](https://arxiv.org/html/2605.06334#bib.bib27)\], science\[[36](https://arxiv.org/html/2605.06334#bib.bib18)\], and general tool and API usage\[[21](https://arxiv.org/html/2605.06334#bib.bib36),[23](https://arxiv.org/html/2605.06334#bib.bib17),[17](https://arxiv.org/html/2605.06334#bib.bib20),[29](https://arxiv.org/html/2605.06334#bib.bib25),[30](https://arxiv.org/html/2605.06334#bib.bib8)\]\. However, the capabilities of benchmarks for unstructured and open\-ended tasks is still limited, as shown in Tab\.[1](https://arxiv.org/html/2605.06334#S1.T1)\(see App\.[0\.A](https://arxiv.org/html/2605.06334#Pt0.A1)for details\)\. Even the most similar benchmark SOP\-Bench\[[26](https://arxiv.org/html/2605.06334#bib.bib4)\]requires substantial human effort in benchmark generation, validation and extension to new domains, vastly limiting its scalability compared toMANTRA\.

Automated benchmark generation has been recently explored in different contexts\[[40](https://arxiv.org/html/2605.06334#bib.bib12)\], especially for coding tasks\[[37](https://arxiv.org/html/2605.06334#bib.bib33),[46](https://arxiv.org/html/2605.06334#bib.bib32),[16](https://arxiv.org/html/2605.06334#bib.bib34),[19](https://arxiv.org/html/2605.06334#bib.bib35)\]\. However, these benchmarks rely on an accuracy\-based evaluation of structured tasks, are not designed to test agents’ tool usage capabilities, and do not involve checking compliance of tool traces with complex natural language manuals designed for human use\.MANTRAbrings these techniques to unstructured domains where tool\-using LLM agents are increasingly deployed\.

While formal methods have a rich history in the specification and verification of software systems\[[1](https://arxiv.org/html/2605.06334#bib.bib37)\], their application in the context of LLMs remains under\-explored\[[45](https://arxiv.org/html/2605.06334#bib.bib23)\]\. An emerging line of work studies the usage of symbolic solvers to improve LLM\-generated code\[[5](https://arxiv.org/html/2605.06334#bib.bib28)\], logical reasoning\[[28](https://arxiv.org/html/2605.06334#bib.bib39),[41](https://arxiv.org/html/2605.06334#bib.bib38),[11](https://arxiv.org/html/2605.06334#bib.bib31)\]and planning\[[15](https://arxiv.org/html/2605.06334#bib.bib24)\], and to detect hallucinated claims\[[34](https://arxiv.org/html/2605.06334#bib.bib29)\]\. Recently, a translation of natural language policies and LLM outputs into symbolic specifications for response grounding was proposed\[[4](https://arxiv.org/html/2605.06334#bib.bib30)\], which focuses on verifying static natural\-language outputs\. In contrast,MANTRAdoes not only verify a final output, but constructs trace\-level compliance checks to evaluate sequences of tool calls\.

Procedural Manual for Hardware Procurement *Standard fulfillment \(A\)\.*When fulfilling a hardware request, the agent must firstcheck inventory\. If the requested item is in stock, the agent mustassign a warehouse picker\. It mustnot create a purchase orderfor an in\-stock item\.… *Integration\-lab procurement \(B\)\.*Forintegration\-lab requeststhat require procurement, the agent must firstcheck the legacy portalbeforecreating a purchase order, and it mustset delivery optionsfor the incoming hardware\.Tool schema check\_inventory assign\_warehouse\_picker create\_purchase\_order check\_legacy\_portal set\_delivery\_options …Case 1uses paragraph \(A\) *Scenario\.*“Please get 1 Dell UltraSharp U2723QE monitor for Maya Torres at Lake Union Studio\. The quoted item price is $1,850, under refresh request HW\-2841\. This request is still pending, and no purchase order has been created yet\.” CALLcheck\_inventory\(item\_name="Dell UltraSharp U2723QE"\) CALLassign\_warehouse\_picker\(item\_id="HWM2741", quantity=1\) check\_inventoryPRECEDESassign\_warehouse\_picker NO\-CALLcreate\_purchase\_orderCase 2uses paragraphs \(A\), \(B\) *Scenario\.*“I need 1 Cisco Catalyst 9300 switch for the Denver integration lab at 4100 Blake Street, Denver, CO 80216\. The quoted item price is $1,995, and if it has to be purchased, the supplier quote shows a 9\-day lead time\. Put it under pending request LAB\-9315\.” CALLcheck\_inventory\(item\_name="Cisco Catalyst 9300"\) CALLcheck\_legacy\_portal\(item\_id="HWC9305"\) check\_inventoryPRECEDEScheck\_legacy\_portal CALLcreate\_purchase\_order\(item\_id="HWC9305", quantity=1\) create\_purchase\_orderFOLLOWScheck\_legacy\_portal CALLset\_delivery\_options\(item\_id="HWC9305"\)MANTRA

Figure 1:Running example\.From a hardware\-procurement manual and a tool schema,MANTRAsynthesizes benchmark items with concrete scenarios and trace\-level checks\. Case 1 is generated from paragraph \(A\), while Case 2 combines paragraphs \(A\) and \(B\)\.

## 2Overview

Before presentingMANTRAin detail, we illustrate its operation—schematically shown in Fig\.[2](https://arxiv.org/html/2605.06334#S3.F2)—using the running example in Fig\.[1](https://arxiv.org/html/2605.06334#S1.F1)\. The example is based on a hardware\-procurement procedural manual𝒟\\mathcal\{D\}, of which two representative paragraphs are shown on the left of Fig\.[1](https://arxiv.org/html/2605.06334#S1.F1)\. In addition,MANTRAis given a global tool schema𝒯\\mathcal\{T\}describing the tools available to a tool\-using agent for interacting with a hidden databaseDBover variablesVarDB\\mathrm\{Var\}\_\{\\texttt\{DB\}\}\.

Given these inputs,MANTRAconstructs a benchmark suiteℬ=\{bi\}i∈\[n\],n∈ℕ\\mathcal\{B\}=\\\{b\_\{i\}\\\}\_\{i\\in\[n\]\},~n\\in\\mathbb\{N\}of validated test\-cases, two of which are shown on the right of Fig\.[1](https://arxiv.org/html/2605.06334#S1.F1)\. Each test\-casebib\_\{i\}is centered around a scenarioλi\\lambda\_\{i\}, generated from𝒟\\mathcal\{D\}in Step1of Fig\.[2](https://arxiv.org/html/2605.06334#S3.F2)\. To do so,MANTRAfirst builds a dependence graph𝒢𝒟\\mathcal\{G\}\_\{\\mathcal\{D\}\}capturing hierarchy and reference relations in𝒟\\mathcal\{D\}, and then samples a subgraph𝒢𝒟S\\mathcal\{G\}\_\{\\mathcal\{D\}\_\{\\mathrm\{S\}\}\}corresponding to a region𝒟S\\mathcal\{D\}\_\{\\mathrm\{S\}\}of𝒟\\mathcal\{D\}together with a subset of relevant tools𝒯S\{\\mathcal\{T\}\}\_\{\\mathrm\{S\}\}\. An LLM then maps\(𝒟S,𝒯S\)\(\\mathcal\{D\}\_\{\\mathrm\{S\}\},\{\\mathcal\{T\}\}\_\{\\mathrm\{S\}\}\)to a concrete scenarioλ\\lambda, consisting of a promptpλp\_\{\\lambda\}and a valuation of relevant variables\. The key design choice in this step is that the sampled subgraph controls scenario complexity: in Fig\.[1](https://arxiv.org/html/2605.06334#S1.F1), Case 1 uses only paragraph \(A\), whereas Case 2 combines paragraphs \(A\) and \(B\)\.

When the promptpλp\_\{\\lambda\}is later given to an agent along with𝒟\\mathcal\{D\}, it produces an execution traceτ=\[\(t0,a0\),…,\(tl−1,al−1\)\]\\tau=\\big\[\(t\_\{0\},a\_\{0\}\),\\ldots,\(t\_\{l\-1\},a\_\{l\-1\}\)\\big\], where eachti∈𝒯t\_\{i\}\\in\\mathcal\{T\}is a tool call and eachaia\_\{i\}instantiates its arguments\. Step2ofMANTRAgenerates, for each scenarioλ\\lambda, a set of trace\-level checks𝒞λ\\mathcal\{C\}\_\{\\lambda\}that specify whether such a trace complies with the sampled document region\. These checks provide a deterministic evaluation procedure for agent traces\.

Because the checks are generated by an LLM,MANTRAdoes not assume that they are correct\. Instead, Step3independently generates a world model𝒲S\\mathcal\{W\}\_\{\\mathrm\{S\}\}for the same𝒟S\{\\mathcal\{D\}\}\_\{\\mathrm\{S\}\}and uses SMT\-based cross\-validation to compare the bounded traces admitted by𝒞λ\\mathcal\{C\}\_\{\\lambda\}and𝒲S\\mathcal\{W\}\_\{\\mathrm\{S\}\}\. Concretely,MANTRAsearches for bounded traces that satisfy the checks but violate the world model, suggesting that the checks are under\-constrained or the world model is over\-constrained, and for bounded traces that satisfy the world model but only violate a single check, which may suggest the vice versa\. Such inconsistencies are used to iteratively refine the generated checks and world model\. Although𝒲S\\mathcal\{W\}\_\{\\mathrm\{S\}\}is not treated as ground truth, this formal cross\-validation between two independently generated artifacts yields a structured and largely automated refinement loop, producing validated test\-cases with minimal human intervention\.

## 3TheMANTRAFramework \(Methodology\)

MANTRAProcedural Manual\(𝒟\)\(\\mathcal\{D\}\)Dependence Graph\(𝒢𝒟\)\(\\mathcal\{G\}\_\{\\mathcal\{D\}\}\)Sample\(S=\(𝒟S,𝒢𝒟S\)\(\\mathrm\{S\}=\(\\mathcal\{D\}\_\{\\mathrm\{S\}\},\\mathcal\{G\}\_\{\\mathcal\{D\}\_\{\\mathrm\{S\}\}\}\)WorldModel\(𝒲S\)\(\\mathcal\{W\}\_\{\\mathrm\{S\}\}\)Database\(DB\)ToolSchema\(𝒯\)\(\\mathcal\{T\}\)Relevant Tools\(𝒯S\)\(\{\\mathcal\{T\}\}\_\{\\mathrm\{S\}\}\)Scenarios\(λ\)\(\\lambda\)Checks\(𝒞λ\)\(\\mathcal\{C\}\_\{\\lambda\}\)World\-ModelSMT Encoding\(Φ^h\(𝒲S\)\)\(\\widehat\{\\Phi\}^\{h\}\(\\mathcal\{W\}\_\{\\mathrm\{S\}\}\)\)Check SMTEncoding\(Φ^h\(𝒞λ\)\)\(\\widehat\{\\Phi\}^\{h\}\(\\mathcal\{C\}\_\{\\lambda\}\)\)Z3 ForwardConflict\(Φλfwd\)\(\\Phi\_\{\\lambda\}^\{\\mathrm\{fwd\}\}\)Z3 BackwardConflict\(Φλbwd\)\(\\Phi\_\{\\lambda\}^\{\\mathrm\{bwd\}\}\)Validated✓\\checkmarkConflictingTrace\(τ∗\)\(\\tau^\{\*\}\)Benchmark\(ℬ\)\(\\mathcal\{B\}\)1Document Ingestion and Scenario Generation2Check Creation3Cross\-Validation using SMTSATUNSATUNSATSATrefinement

Figure 2:Overview of theMANTRApipeline\. See Sec\.[3](https://arxiv.org/html/2605.06334#S3)and App\.[0\.B](https://arxiv.org/html/2605.06334#Pt0.A2)–[0\.C](https://arxiv.org/html/2605.06334#Pt0.A3)for details\.The next subsections explain all steps ofMANTRAshown in Fig\.[2](https://arxiv.org/html/2605.06334#S3.F2)\(see App\.[0\.B](https://arxiv.org/html/2605.06334#Pt0.A2)–[0\.C](https://arxiv.org/html/2605.06334#Pt0.A3)for details\)\.

### 3\.1Document Ingestion and Generation Pipeline

The first stage of the pipeline involves generating a diverse set of candidate \(partial\) test\-cases called*scenarios*, which will later be converted into full test\-cases by adding necessary*checks*\. At this stage, each scenarioλ\\lambdacontains a natural\-language promptpλp\_\{\\lambda\}for the agent, along with a concrete valuation of a set of relevant variablesVarλ\\mathrm\{Var\}\_\{\\lambda\}\. In order to generate scenarios that reflect the diversity of rules and constraints in𝒟\\mathcal\{D\}, we first create and then sample parts of a graph representation of𝒟\\mathcal\{D\}\.

##### Dependence Graph Construction\.

The problem of extracting a graph representation of a text document has been studied extensively in the context of retrieval\-augmented generation\[[9](https://arxiv.org/html/2605.06334#bib.bib52),[14](https://arxiv.org/html/2605.06334#bib.bib53)\]\. The goal is to construct a dependence graph𝒢𝒟\\mathcal\{G\}\_\{\\mathcal\{D\}\}over𝒟\\mathcal\{D\}, where nodes represent chunks of content \(*e\.g\.*, chapters, sections, or paragraphs\) and edges represent hierarchical and reference relations between chunks\. In our implementation, the nodes are created deterministically by recursively chunking the document based on its natural hierarchy and a pre\-specified length limit,111This chunk\-size limit can be chosen by the benchmark creator based on the context window available for downstream generation stages\.while the edges are created by prompting an LLM to identify explicit and implicit references between chunks\. See App\.[0\.C\.5](https://arxiv.org/html/2605.06334#Pt0.A3.SS5)for details\.

##### Coverage\-Driven Node Sampling\.

Once𝒢𝒟\\mathcal\{G\}\_\{\\mathcal\{D\}\}is constructed,MANTRAsamples nodes from the graph and their corresponding edges to obtain smaller regions of𝒟\\mathcal\{D\}for scenario generation\. We call a sampleS=\(𝒢𝒟S,𝒟S\)\\mathrm\{S\}=\(\{\\mathcal\{G\}\_\{\\mathcal\{D\}\}\}\_\{\\mathrm\{S\}\},\{\\mathcal\{D\}\}\_\{\\mathrm\{S\}\}\), where𝒢𝒟S\{\\mathcal\{G\}\_\{\\mathcal\{D\}\}\}\_\{\\mathrm\{S\}\}is the sampled subgraph and𝒟S\{\\mathcal\{D\}\}\_\{\\mathrm\{S\}\}is the text of𝒟\\mathcal\{D\}that is contained in the nodes of𝒢𝒟S\{\\mathcal\{G\}\_\{\\mathcal\{D\}\}\}\_\{\\mathrm\{S\}\}\. Each sample will be used to generate one or more scenarios involving only the parts of the procedural manual that are in𝒟S\{\\mathcal\{D\}\}\_\{\\mathrm\{S\}\}\. As such, the sampling process naturally controls scenario complexity: small, densely connected subgraphs tend to yield simpler scenarios focused on a specific part of𝒟\\mathcal\{D\}, whereas larger or more weakly connected subgraphs can induce more complex ones involving constraints from multiple, distant parts of𝒟\\mathcal\{D\}\. Our chosen sampling strategy maintains coverage statistics over nodes in𝒢𝒟\\mathcal\{G\}\_\{\\mathcal\{D\}\}and biases sampling toward under\-covered regions – refer to App\.[0\.C\.6](https://arxiv.org/html/2605.06334#Pt0.A3.SS6)for details\.

##### Tool Relevance Mapping\.

For each sampleS\\mathrm\{S\},MANTRAprompts an LLM to identify a subset of tools𝒯S⊆𝒯\{\\mathcal\{T\}\}\_\{\\mathrm\{S\}\}\\subseteq\\mathcal\{T\}that are relevant to𝒟S\{\\mathcal\{D\}\}\_\{\\mathrm\{S\}\}, to be used in subsequent stages of test\-case generation\. For instance, a sampled subgraph corresponding to paragraph \(A\) of𝒟\\mathcal\{D\}in Fig\.[1](https://arxiv.org/html/2605.06334#S1.F1)would retain tools such ascheck\_inventory,assign\_warehouse\_picker, andcreate\_purchase\_orderin𝒯S\{\\mathcal\{T\}\}\_\{\\mathrm\{S\}\}, whereas a sampled subgraph that also includes paragraph \(B\) would additionally have tools such ascheck\_legacy\_portalandset\_delivery\_options\.

##### Scenario Generation\.

For each sampleS\\mathrm\{S\}and its relevant tool subset𝒯S\{\\mathcal\{T\}\}\_\{\\mathrm\{S\}\},MANTRAgenerates, via an LLM, a user\-specified number ofmmdistinct scenariosλ1,…,λm\\lambda\_\{1\},\\ldots,\\lambda\_\{m\}that involve𝒟S\{\\mathcal\{D\}\}\_\{\\mathrm\{S\}\}\. The LLM is instructed to generate, for each scenarioλ\\lambda, a natural\-language promptpλp\_\{\\lambda\}describing a task for the agent to complete in compliance with𝒟S\{\\mathcal\{D\}\}\_\{\\mathrm\{S\}\}, which contains specific named entities and exact numerical quantities\. In parallel, the LLM generates a corresponding valuationIλI\_\{\\lambda\}of a set of variablesVarλ=Varλext∪Varλint\\mathrm\{Var\}\_\{\\lambda\}=\\mathrm\{Var\}\_\{\\lambda\}^\{\\mathrm\{ext\}\}\\cup\\mathrm\{Var\}\_\{\\lambda\}^\{\\mathrm\{int\}\}, whereVarλext⊆VarDB\\mathrm\{Var\}\_\{\\lambda\}^\{\\mathrm\{ext\}\}\\subseteq\\mathrm\{Var\}\_\{\\texttt\{DB\}\}are variables that the agent may observe via tool calls to the database, andVarλint∩VarDB=∅\\mathrm\{Var\}\_\{\\lambda\}^\{\\mathrm\{int\}\}\\cap\\mathrm\{Var\}\_\{\\texttt\{DB\}\}=\\varnothingare additional scenario\-specific variables that are not part of the database and cannot be accessed by the agent\. In order for the agent to be able to complete the task, the values ofVarλint\\mathrm\{Var\}\_\{\\lambda\}^\{\\mathrm\{int\}\}are made explicit in the scenario promptpλp\_\{\\lambda\}\. For example, in Fig\.[1](https://arxiv.org/html/2605.06334#S1.F1)\(top right\), the internal variablesrequest\_pendingandpurchase\_order\_createdare initialized, and the prompt states that*“This request is still pending, and no purchase order has been created yet\.”*\.

### 3\.2Check Generation and World Model Creation

Evaluating whether an agent complied with𝒟\\mathcal\{D\}when executing a scenario is challenging because the agent may follow different valid tool\-call trajectories\. Existing benchmarks either inspect the final state after execution or rely on an LLM judge to detect violations\[[43](https://arxiv.org/html/2605.06334#bib.bib13),[20](https://arxiv.org/html/2605.06334#bib.bib51),[22](https://arxiv.org/html/2605.06334#bib.bib50),[26](https://arxiv.org/html/2605.06334#bib.bib4),[31](https://arxiv.org/html/2605.06334#bib.bib49)\]\. However, the former cannot determine whether the agent followed the correct procedure, and the latter is sensitive to the variability of LLM judgments\. To address these limitations,MANTRAgenerates a set of deterministic trace\-level checks𝒞λ\\mathcal\{C\}\_\{\\lambda\}for each scenarioλ\\lambda, to be later integrated into test\-cases\.

##### Check Generation \(𝒞λ\\mathcal\{C\}\_\{\\lambda\}\)\.

For each scenarioλ\\lambda,MANTRAgenerates a set𝒞λ\\mathcal\{C\}\_\{\\lambda\}of boolean checks\. These checks are intended to capture the properties that must be satisfied by an execution traceτ\\taugenerated by the agent when prompted withpλp\_\{\\lambda\}, in order for the scenario to be executed in compliance with𝒟\\mathcal\{D\}\. LetΠ\(𝒞λ\)\\Pi\(\\mathcal\{C\}\_\{\\lambda\}\)be the set of traces that satisfy all checks in𝒞λ\\mathcal\{C\}\_\{\\lambda\}, Traces inΠ\(𝒞λ\)\\Pi\(\\mathcal\{C\}\_\{\\lambda\}\)may be required to \(not\) contain particular calls and satisfy correct ordering between dependent calls\. More specifically, atomic checks take the formcall\(t,a\)\(t,a\)andno\_call\(t,a\)\(t,a\), wheret∈𝒯St\\in\{\\mathcal\{T\}\}\_\{\\mathrm\{S\}\}andaais a concrete partial argument map,222The check only matches the arguments explicitly specified byaa\.while compound checks support disjunction and temporal operators such asafter,before,follows, andprecedes—see App\.[0\.B\.1](https://arxiv.org/html/2605.06334#Pt0.A2.SS1)for the check grammar and operator semantics\.

Importantly, the arguments appearing in the checks in𝒞λ\\mathcal\{C\}\_\{\\lambda\}are grounded rather than symbolic, because they use the values set inIλI\_\{\\lambda\}during scenario generation\. In an ideal world where𝒞λ\\mathcal\{C\}\_\{\\lambda\}perfectly represents the intent of𝒟\\mathcal\{D\}, an agent successfully completes the task in compliance with𝒟\\mathcal\{D\}if its execution trace satisfies all checks in𝒞λ\\mathcal\{C\}\_\{\\lambda\}\. For instance, in case 1 of the running example in Fig\.[1](https://arxiv.org/html/2605.06334#S1.F1), the requested*Dell Ultrasharp U2723QE*monitor is assigned to be in stock inIλI\_\{\\lambda\}, so a check must classify all execution traces which callcreate\_purchase\_orderas non\-compliant\.

##### Generating Formal World Model \(𝒲S\\mathcal\{W\}\_\{\\mathrm\{S\}\}\) from a Sample\.

Unfortunately, we cannot assume that𝒞λ\\mathcal\{C\}\_\{\\lambda\}accurately represents all procedural restrictions from𝒟\\mathcal\{D\}for the given scenario, due to the inherent limitations of LLMs which are used to generate them\.MANTRAtherefore independently generates a world model𝒲S\\mathcal\{W\}\_\{\\mathrm\{S\}\}for the sampled𝒟S\{\\mathcal\{D\}\}\_\{\\mathrm\{S\}\}from which𝒞λ\\mathcal\{C\}\_\{\\lambda\}was derived\. This world model aims to represent𝒟S\{\\mathcal\{D\}\}\_\{\\mathrm\{S\}\}as a state\-transition system, which defines the set of tracesΠ\(𝒲S\)\\Pi\(\\mathcal\{W\}\_\{\\mathrm\{S\}\}\)that comply with𝒟S\{\\mathcal\{D\}\}\_\{\\mathrm\{S\}\}\. We then compareΠ\(𝒲S\)\\Pi\(\\mathcal\{W\}\_\{\\mathrm\{S\}\}\)withΠ\(𝒞λ\)\\Pi\(\\mathcal\{C\}\_\{\\lambda\}\)in Sec\.[3\.3](https://arxiv.org/html/2605.06334#S3.SS3)in a validation and refinement loop\.

Formally, the world model𝒲S\\mathcal\{W\}\_\{\\mathrm\{S\}\}is defined as𝒲S=\(𝒳,𝒯S,δ\)\\mathcal\{W\}\_\{\\mathrm\{S\}\}=\(\\mathcal\{X\},\{\\mathcal\{T\}\}\_\{\\mathrm\{S\}\},\\delta\), where𝒯S⊆𝒯\{\\mathcal\{T\}\}\_\{\\mathrm\{S\}\}\\subseteq\\mathcal\{T\}is the set of relevant tools \(actions\),𝒳\\mathcal\{X\}is the state space of relevant variablesVar𝒲S\\mathrm\{Var\}\_\{\\mathcal\{W\}\_\{\\mathrm\{S\}\}\}, andδ\\deltais a transition relation which maps each toolt∈𝒯St\\in\{\\mathcal\{T\}\}\_\{\\mathrm\{S\}\}to a transition relationδt\(x,x′,a\)\\delta\_\{t\}\(x,x^\{\\prime\},a\)over the current statex∈𝒳x\\in\\mathcal\{X\}, the next statex′∈𝒳x^\{\\prime\}\\in\\mathcal\{X\}, and concrete tool argumentsaa\. Similarly to the scenario variables, we haveVar𝒲S=Var𝒲Sext∪Var𝒲Sint\\mathrm\{Var\}\_\{\\mathcal\{W\}\_\{\\mathrm\{S\}\}\}=\\mathrm\{Var\}\_\{\\mathcal\{W\}\_\{\\mathrm\{S\}\}\}^\{\\mathrm\{ext\}\}\\cup\\mathrm\{Var\}\_\{\\mathcal\{W\}\_\{\\mathrm\{S\}\}\}^\{\\mathrm\{int\}\}, whereVar𝒲Sext⊆VarDB\\mathrm\{Var\}\_\{\\mathcal\{W\}\_\{\\mathrm\{S\}\}\}^\{\\mathrm\{ext\}\}\\subseteq\\mathrm\{Var\}\_\{\\texttt\{DB\}\}may be accessible to the agent via tool calls, whileVar𝒲Sint∩VarDB=∅\\mathrm\{Var\}\_\{\\mathcal\{W\}\_\{\\mathrm\{S\}\}\}^\{\\mathrm\{int\}\}\\cap\\mathrm\{Var\}\_\{\\texttt\{DB\}\}=\\varnothingare hidden internal variables withVarλint∩Var𝒲Sint=∅\\mathrm\{Var\}\_\{\\lambda\}^\{\\mathrm\{int\}\}\\cap\\mathrm\{Var\}\_\{\\mathcal\{W\}\_\{\\mathrm\{S\}\}\}^\{\\mathrm\{int\}\}=\\varnothing\. We say that an execution traceτ=\[\(t0,a0\),…,\(tl−1,al−1\)\]\\tau=\[\(t\_\{0\},a\_\{0\}\),\\ldots,\(t\_\{l\-1\},a\_\{l\-1\}\)\]is compliant with𝒲S\\mathcal\{W\}\_\{\\mathrm\{S\}\}and writeτ∈Π\(𝒲S\)\\tau\\in\\Pi\(\\mathcal\{W\}\_\{\\mathrm\{S\}\}\)if there exist statesx0,x1,…x\_\{0\},x\_\{1\},\\ldotss\.t\.xi∈𝒳x\_\{i\}\\in\\mathcal\{X\}and transitionsδti\(xi,xi\+1,ai\)\\delta\_\{t\_\{i\}\}\(x\_\{i\},x\_\{i\+1\},a\_\{i\}\)for alli∈\[l−1\]i\\in\[l\-1\]\.

In practice, generating𝒲S\\mathcal\{W\}\_\{\\mathrm\{S\}\}directly is unreasonable due to \(i\) the potentially prohibitive size of𝒳\\mathcal\{X\}andδ\\deltaand \(ii\) the lack of structure to guide its generation\. Instead, we prompt an LLM to describe𝒲S\\mathcal\{W\}\_\{\\mathrm\{S\}\}in a typed domain specific language \(DSL\), which allows us to generate a compact, structured representation of𝒲S\\mathcal\{W\}\_\{\\mathrm\{S\}\}\. We provide all details on this DSL in App\.[0\.B\.2](https://arxiv.org/html/2605.06334#Pt0.A2.SS2)\.

### 3\.3Cross\-Validation Using SMT\.

To validate that the generated checks for a scenario are consistent with the corresponding sampled document region,MANTRAleverages bounded model checking with SMT solvers to compare the sets ofhh\-bounded tracesΠh\(𝒞λ\)⊆Π\(𝒞λ\)\\Pi^\{h\}\(\\mathcal\{C\}\_\{\\lambda\}\)\\subseteq\\Pi\(\\mathcal\{C\}\_\{\\lambda\}\)andΠh\(𝒲S\)⊆Π\(𝒲S\)\\Pi^\{h\}\(\\mathcal\{W\}\_\{\\mathrm\{S\}\}\)\\subseteq\\Pi\(\\mathcal\{W\}\_\{\\mathrm\{S\}\}\), wherehha pre\-specified bound on the length of traces\. Concretely, we search for two kinds of bounded inconsistencies: traces that comply with checks but violate the world model, and traces that comply with the world model while violating a single check\. The former indicates that the checks may be too weak or the world model too restrictive, whereas the latter indicates that the world model may be too permissive or the checks too restrictive\. We provide a concrete example of this process in App\.[0\.B\.5\.2](https://arxiv.org/html/2605.06334#Pt0.A2.SS5.SSS2)\.

##### SMT Compilation\.

To support this validation,MANTRAdeterministically compiles both the generated checks𝒞λ\\mathcal\{C\}\_\{\\lambda\}and the world model𝒲S\\mathcal\{W\}\_\{\\mathrm\{S\}\}into bounded SMT encodingsΦ^h\(𝒞λ\)\\widehat\{\\Phi\}^\{h\}\(\\mathcal\{C\}\_\{\\lambda\}\)andΦ^h\(𝒲S\)\\widehat\{\\Phi\}^\{h\}\(\\mathcal\{W\}\_\{\\mathrm\{S\}\}\)respectively\. To illustrate these encodings, consider the running example from Fig\.[1](https://arxiv.org/html/2605.06334#S1.F1)\. A check in𝒞λ\\mathcal\{C\}\_\{\\lambda\}requiring that the tool callassign\_warehouse\_pickermust follow a tool callcheck\_inventory, contributes toΦ^h\(𝒞λ\)\\widehat\{\\Phi\}^\{h\}\(\\mathcal\{C\}\_\{\\lambda\}\)a bounded ordering constraint of the form

⋁0≤i<j<h\(ti=check\_inventory∧tj=assign\_warehouse\_picker\)\\bigvee\_\{0\\leq i<j<h\}\\Bigl\(t\_\{i\}=\\texttt\{check\\\_inventory\}\\;\\land\\;t\_\{j\}=\\texttt\{assign\\\_warehouse\\\_picker\}\\Bigr\)which asserts that there exist two positions in the bounded trace such thatcheck\_inventoryoccurs beforeassign\_warehouse\_picker\.

Similarly, a transition rule in𝒲S\\mathcal\{W\}\_\{\\mathrm\{S\}\}stating thatassign\_warehouse\_pickeris enabled only whenin\_stockholds and, when executed, setspicker\_assignedin the successor state, contributes the following family of per\-step constraints toΦ^h\(𝒲S\)\\widehat\{\\Phi\}^\{h\}\(\\mathcal\{W\}\_\{\\mathrm\{S\}\}\):

⋀i=0h−1\(\\displaystyle\\bigwedge\_\{i=0\}^\{h\-1\}\\biggl\(ti=assign\_warehouse\_picker⇒\\displaystyle t\_\{i\}=\\texttt\{assign\\\_warehouse\\\_picker\}\\Rightarrow\(1\)\(in\_stocki=⊤∧picker\_assignedi\+1=⊤\)\)\\displaystyle\\bigl\(\\texttt\{in\\\_stock\}\_\{i\}=\\top\\;\\land\\;\\texttt\{picker\\\_assigned\}\_\{i\+1\}=\\top\\bigr\)\\biggr\)Thus, bounded SMT compilation unrolls the transition semantics across the horizon0,…,h−10,\\dots,h\-1: if stepiiinvokes a tool, then its pre\-conditions must hold at positionii, and its post\-conditions constrain positioni\+1i\+1\. Refer to App\.[0\.B\.3](https://arxiv.org/html/2605.06334#Pt0.A2.SS3)–[0\.B\.4](https://arxiv.org/html/2605.06334#Pt0.A2.SS4)for the full SMT compilation procedure\.

##### Forward Z3 Conflict Search\.

Intuitively,Φ^h\(𝒲S\)\\widehat\{\\Phi\}^\{h\}\(\\mathcal\{W\}\_\{\\mathrm\{S\}\}\)andΦ^h\(𝒞λ\)\\widehat\{\\Phi\}^\{h\}\(\\mathcal\{C\}\_\{\\lambda\}\)are a set of logical formulas representingΠh\(𝒲S\)\\Pi^\{h\}\(\\mathcal\{W\}\_\{\\mathrm\{S\}\}\)andΠh\(𝒞λ\)\\Pi^\{h\}\(\\mathcal\{C\}\_\{\\lambda\}\), respectively, and we search for mutual inconsistencies in these formulas via a cross\-validation loop using the SMT solver Z3\. This search is meaningful as both formulas involve the same variables fromVarDB\\mathrm\{Var\}\_\{\\texttt\{DB\}\}\. A natural first attempt is to query the solver to check whether the formulaΦ=Φ^h\(𝒞λ\)∧¬Φ^h\(𝒲S\)\\Phi=\\widehat\{\\Phi\}^\{h\}\(\\mathcal\{C\}\_\{\\lambda\}\)\\;\\land\\;\\neg\\widehat\{\\Phi\}^\{h\}\(\\mathcal\{W\}\_\{\\mathrm\{S\}\}\)is satisfiable\. Any trace that satisfiesΦ\\Phicomplies with the checks but not with the world model, and is hence non\-consistent\. In practice, however, this query is too coarse: the solver can satisfy the negated world\-model formula in spurious ways that do not isolate the actual source of disagreement\.

To obtain more informative inconsistencies,MANTRAdecomposesΦ^h\(𝒲S\)\\widehat\{\\Phi\}^\{h\}\(\\mathcal\{W\}\_\{\\mathrm\{S\}\}\)into three distinct sets of formulas:Φ^preh\(𝒲S\)\\widehat\{\\Phi\}\_\{\\mathrm\{pre\}\}^\{h\}\(\\mathcal\{W\}\_\{\\mathrm\{S\}\}\)andΦ^posth\(𝒲S\)\\widehat\{\\Phi\}\_\{\\mathrm\{post\}\}^\{h\}\(\\mathcal\{W\}\_\{\\mathrm\{S\}\}\)include constrains on the pre\- and post\-conditions of the subset of tools𝒯λ⊆𝒯S\\mathcal\{T\}\_\{\\mathrm\{\\lambda\}\}\\subseteq\{\\mathcal\{T\}\}\_\{\\mathrm\{S\}\}explicitly mentioned in𝒞λ\\mathcal\{C\}\_\{\\lambda\}, andΦ^bgh\(𝒲S\)\\widehat\{\\Phi\}\_\{\\mathrm\{bg\}\}^\{h\}\(\\mathcal\{W\}\_\{\\mathrm\{S\}\}\)includes all constrains on the rest of the tools𝒯bg=𝒯S∖𝒯λ\\mathcal\{T\}\_\{\\mathrm\{bg\}\}=\{\\mathcal\{T\}\}\_\{\\mathrm\{S\}\}\\setminus\\mathcal\{T\}\_\{\\mathrm\{\\lambda\}\}\. Then the refined forward conflict query takes the form

Φλfwd=Φ^h\(𝒞λ\)∧Φ^bgh\(𝒲S\)∧Φ^posth\(𝒲S\)∧¬Φ^preh\(𝒲S\)\\Phi\_\{\\lambda\}^\{\\mathrm\{fwd\}\}\\;=\\;\\widehat\{\\Phi\}^\{h\}\(\\mathcal\{C\}\_\{\\lambda\}\)\\;\\land\\;\\widehat\{\\Phi\}\_\{\\mathrm\{bg\}\}^\{h\}\(\\mathcal\{W\}\_\{\\mathrm\{S\}\}\)\\;\\land\\;\\widehat\{\\Phi\}\_\{\\mathrm\{post\}\}^\{h\}\(\\mathcal\{W\}\_\{\\mathrm\{S\}\}\)\\;\\land\\;\\neg\\widehat\{\\Phi\}\_\{\\mathrm\{pre\}\}^\{h\}\(\\mathcal\{W\}\_\{\\mathrm\{S\}\}\)together with the initial\-state constraint, that is, assigning the values ofIλI\_\{\\lambda\}to the corresponding variables\. Intuitively, a trace that satisfiesΦλfwd\\Phi\_\{\\lambda\}^\{\\mathrm\{fwd\}\}complies with𝒞λ\\mathcal\{C\}\_\{\\lambda\}while violating the pre\-conditions of tools𝒞λ\\mathcal\{C\}\_\{\\lambda\}focuses on, rather than via conflicts in unrelated parts of the world model\. IfΦλfwd\\Phi\_\{\\lambda\}^\{\\mathrm\{fwd\}\}is satisfiable, the solver returns a bounded conflicting traceτ∗\\tau^\{\*\}witnessing this conflict\. This trace is then used in the subsequent resolution step to determine whether the inconsistency is better explained by an error in the world model or by an error in the generated checks\. See App\.[0\.B\.5\.2](https://arxiv.org/html/2605.06334#Pt0.A2.SS5.SSS2)for an example\.

##### Sample\-Batched World\-Model Resolution\.

Given a fixed sampleS\\mathrm\{S\}, the above forward conflict search via SMT solving is applied to each scenarioλ1,…,λm\\lambda\_\{1\},\\dots,\\lambda\_\{m\}generated fromS\\mathrm\{S\}and its corresponding check encodingΦ^h\(𝒞λ1\),…,Φ^h\(𝒞λm\)\\widehat\{\\Phi\}^\{h\}\(\\mathcal\{C\}\_\{\\lambda\_\{1\}\}\),\\dots,\\widehat\{\\Phi\}^\{h\}\(\\mathcal\{C\}\_\{\\lambda\_\{m\}\}\)against the same bounded world\-model encodingΦ^h\(𝒲S\)\\widehat\{\\Phi\}^\{h\}\(\\mathcal\{W\}\_\{\\mathrm\{S\}\}\)\. In practice, several resulting conflicting traces often reflect the same defect in the shared world model, such as a missing pre\-condition for a tool transition, so it is sensible to resolve them together\. That is whyMANTRAfirst performs a batched world\-model resolution step: it groups the conflicting traces obtained for the same sampleS\\mathrm\{S\}and asks an LLM judge whether they point to a common defect in the world model\. If so, the judge proposes a single shared repair to the world model, such as adding a missing state predicate to the pre\-condition of a transition\. The repaired world model is then recompiled into SMT and the affected scenarios are re\-evaluated\.

##### Check Resolution\.

Conflicts that remain after the shared world\-model repair step are treated as evidence that the inconsistency may instead lie in the generated checks for a particular scenarioλ\\lambda, such as a missingno\_callcondition\. In this case,MANTRAperforms a per\-scenario resolution step: given a scenarioλ\\lambda, its check set𝒞λ\\mathcal\{C\}\_\{\\lambda\}, and a conflicting traceτ∗\\tau^\{\*\}, an LLM judge identifies which checks in𝒞λ\\mathcal\{C\}\_\{\\lambda\}are responsible for the disagreement and proposes localized fixes to those checks only\. This separation between shared world\-model repair and scenario\-specific check repair keeps modifications targeted to the artifact most likely to be incorrect\.

##### Deterministic Edits and Backward Audit\.

To ensure that repairs remain well\-formed,MANTRAdoes not apply free\-form text edits directly to the generated artifacts\. Instead, every repair is expressed through a typed edit language over the intermediate representations of the world model and checks\. These edits are then applied deterministically, after which the corresponding artifacts are recompiled before the next solver call\. The edit\-and\-validate loop continues until the forward conflict queryΦλifwd\\Phi\_\{\\lambda\_\{i\}\}^\{\\mathrm\{fwd\}\}is unsatisfiable for every generated scenarioλi\\lambda\_\{i\}of the sampled subgraph or the pre\-specified number of iterations\. The former case indicating that no bounded forward inconsistencies remain under the current encodings\.

After this forward phase,MANTRAperforms a backward audit to detect over\-restrictive checks\. For a fixed scenarioλ\\lambdawith check set𝒞λ=\{cλ,1,…,cλ,ℓλ\}\\mathcal\{C\}\_\{\\lambda\}=\\\{c\_\{\\lambda,1\},\\dots,c\_\{\\lambda,\\ell\_\{\\lambda\}\}\\\}, the backward audit considers each checkcλ,jc\_\{\\lambda,j\}in turn and asks whether the world model admits a bounded trace that satisfies all remaining checks but violatescλ,jc\_\{\\lambda,j\}\. Formally, the corresponding backward query is

Φλ,jbwd=Φ^h\(𝒲S\)∧⋀k≠jΦ^h\(cλ,k\)∧¬Φ^h\(cλ,j\)\.\\Phi\_\{\\lambda,j\}^\{\\mathrm\{bwd\}\}\\;=\\;\\widehat\{\\Phi\}^\{h\}\(\\mathcal\{W\}\_\{\\mathrm\{S\}\}\)\\;\\land\\;\\bigwedge\_\{k\\neq j\}\\widehat\{\\Phi\}^\{h\}\(c\_\{\\lambda,k\}\)\\;\\land\\;\\neg\\widehat\{\\Phi\}^\{h\}\(c\_\{\\lambda,j\}\)\.IfΦλ,jbwd\\Phi\_\{\\lambda,j\}^\{\\mathrm\{bwd\}\}is satisfiable, thencλ,jc\_\{\\lambda,j\}is flagged as a candidate for over\-restriction, since the current world model admits a bounded trace that is rejected only because of that particular check\. An LLM judge then decides whether to removecλ,jc\_\{\\lambda,j\}from𝒞λ\\mathcal\{C\}\_\{\\lambda\}, which is deterministically done using the typed edit language\. Then𝒞λ\\mathcal\{C\}\_\{\\lambda\}is re\-compiled and re\-evaluated\.

If for a sampleS\\mathrm\{S\}, after a pre\-specified number of validation loops, all scenariosλ1,…,λm\\lambda\_\{1\},\\dots,\\lambda\_\{m\}pass both the forward conflict search and the backward audit, then all these scenarios are marked as validated\. Otherwise, they are deferred to human review and re\-evaluated\. At the end of the pipeline, all scenarios that have been validated make up the test\-cases of the final benchmarkℬ\\mathcal\{B\}\.

##### Human\-in\-the\-Loop Fallback\.

Most inconsistencies can be resolved automatically through the repair loop discussed above\. However, some cases remain ambiguous or unresolved after a fixed repair budget\.MANTRAthen falls back to structured human review: the unresolved scenarioλ\\lambda, its checks𝒞λ\\mathcal\{C\}\_\{\\lambda\}, and the corresponding conflicting traceτ∗\\tau^\{\*\}are presented to a reviewer, along with𝒟S\{\\mathcal\{D\}\}\_\{\\mathrm\{S\}\}who can inspect the conflict and apply a manual typed repair\. Once such a manual repair is made, the modified portion is excluded from subsequent automated rewriting, and any overwrite requires human approval\. See App\.[0\.C\.8](https://arxiv.org/html/2605.06334#Pt0.A3.SS8)for more details\.

## 4Evaluation

This section discusses both the evaluation of the benchmark*generation*process in Sec\.[4\.1](https://arxiv.org/html/2605.06334#S4.SS1)as well as the*utilization*of the resulting benchmark suit for evaluating tool\-using agents in Sec\.[4\.2](https://arxiv.org/html/2605.06334#S4.SS2)\.

### 4\.1Evaluating Benchmark Generation withMANTRA

We built a prototype implementation ofMANTRAwhich uses OpenAI GPT 5\.4 \(see App\.[0\.C\.2](https://arxiv.org/html/2605.06334#Pt0.A3.SS2)\) for all LLM calls in the pipeline and Z3\[[8](https://arxiv.org/html/2605.06334#bib.bib2)\]to evaluate SMT queries\. It was run on a machine with 2×\\times3\.2 GHz Intel Xeon Gold 6134M CPUs, 2×\\times32GB V100 Nvidia Tesla GPUs and 768GB RAM\. We used thecoverage\_islandssampling strategy \(described in App\.[0\.C\.6](https://arxiv.org/html/2605.06334#Pt0.A3.SS6)\) to create a sample graph𝒢𝒟S\\mathcal\{G\}\_\{\\mathcal\{D\}\_\{\\mathrm\{S\}\}\}and generatedm=4m=4scenarios per sampleS\\mathrm\{S\}\. For the SMT conflict search, we used trace boundh=16h=16, and refined the world model𝒲S\\mathcal\{W\}\_\{\\mathrm\{S\}\}and checks𝒞λ\\mathcal\{C\}\_\{\\lambda\}for at most55rounds\. The choice ofhhis based on the observation during development that most tests cases require only a few steps to fulfill the task, and that the SMT solving time \(specially for UNSAT proofs\) increases significantly with higher trace bounds\.

We used this implementation to generate a benchmark suite of 285 tasks across 6 domains, with procedural manuals𝒟\\mathcal\{D\}ranging from 1,158 to 16,644 words and tool schemas𝒯\\mathcal\{T\}spanning 13 to 148 tools \(Table[2](https://arxiv.org/html/2605.06334#S4.T2)\)\. Theoperationsdomain uses a synthetic manual; the others use real\-world procedural manuals\. Theairline,retail, andtelecomdomains are taken fromτ\\tau\-bench\[[2](https://arxiv.org/html/2605.06334#bib.bib14)\], with tool schemas𝒯\\mathcal\{T\}and databasesDBavailable\. For the remaining domains, we built𝒯\\mathcal\{T\}andDB\.

Table[2](https://arxiv.org/html/2605.06334#S4.T2)further illustrates, that many generated test cases where automatically validated\. These automatically validated cases form a benchmark suite with a variety of scenarios to cover the procedural rules from𝒟\\mathcal\{D\}as controlled by the sampling strategy\. The number of validated test cases can be increased by either running more cross\-validating rounds or by human review\. To aid human review,MANTRAincludes a user\-interface \(see App\.[0\.C\.8](https://arxiv.org/html/2605.06334#Pt0.A3.SS8)for details\) which enables the targeted inspection of remaining conflicts\. While we did not review these conflicts, we did inspect all test cases automatically validated byMANTRAto ensure the generation of high\-quality benchmark instances\. We emphasize that we did not find any instance where the LLM hallucinated similarly in the world model and checks, which would have resulted in a false positive validation\. During the review, we removed a few problematic generated test cases, as shown in the last two columns of Tab\.[2](https://arxiv.org/html/2605.06334#S4.T2), that originated from engineering challenges such as arguments pinning exact strings \(see App\.[0\.E](https://arxiv.org/html/2605.06334#Pt0.A5)for details\)\.

Table 2:Per\-domain corpus statistics: input size, average checks per case, and number of test\-cases yield through the automatic forward/backward validation pipeline\.![Refer to caption](https://arxiv.org/html/2605.06334v1/x1.png)Figure 3:Cumulative validated test cases through forward refinement rounds R1–R5, per domain\. Left: absolute counts\. Right: fraction of generated cases\.nnis the initial generated count\.
### 4\.2Evaluating Agent Performance on the Generated Benchmark Suites

To show the usefulness of the benchmark suite generated byMANTRAfor its intended application, we used it to evaluate 6 LLM models\. We implemented the agents using the Strands framework\[[35](https://arxiv.org/html/2605.06334#bib.bib57)\]\. We run all 285 tasks55times independently and report the success rate statistics in Table[4](https://arxiv.org/html/2605.06334#Pt0.A4.T4)of App\.[0\.D\.1](https://arxiv.org/html/2605.06334#Pt0.A4.SS1)\. We note high variance in performance within domains, which indicates varying test\-case difficulty\. An interesting observation is that domains taken fromτ\\tau\-bench seem harder for all models\. This is explained by Table[5](https://arxiv.org/html/2605.06334#Pt0.A4.T5)of App\.[0\.D\.1](https://arxiv.org/html/2605.06334#Pt0.A4.SS1), which shows that while the domains fromτ\\tau\-bench have shorter procedural manuals and fewer tools, these tools are more complex and harder to use correctly by the agents\.

Another interesting observation is that none of the used models reached very high procedural compliance\. While this might change for stronger models, the detailed trace\-level checks generated byMANTRAallow us to understand in detail*how*different models fail in different domains, which reveals structural problems of the used LLMs\. Towards this goal, we have assigned a*failure category*to each of the10,89510\{,\}895failed checks observed in our corpus\. Table[6](https://arxiv.org/html/2605.06334#Pt0.A4.T6)in App\.[0\.D\.2](https://arxiv.org/html/2605.06334#Pt0.A4.SS2)summarizes the distribution over these failure modes and shows that76%76\\%of all failed checks are due toMissing\-Required\-Call\(an atom of the formCALL\(t,a\)\\textbf\{\{CALL\}\}\(t,a\)appears in𝒞λ\\mathcal\{C\}\_\{\\lambda\}but the agent never invokedttwith arguments matchingaa\) andMissing\-Anchor\(the absence of the anchor tool of anAFTER,BEFORE,FOLLOWS, orPRECEDESclause\)\.

These missing tool calls in particular manifest themselves in the habit of weaker models to*write prematurely*, i\.e\., to call a write\-style tool before any read\-style tool\. Table[9](https://arxiv.org/html/2605.06334#Pt0.A4.T9)in App\.[0\.D\.2](https://arxiv.org/html/2605.06334#Pt0.A4.SS2)reports how often agents write prematurely\. Interestingly,Qwen3\.6:36Bnever prematurely writes intau2\-airlineandtau2\-retail, which may explain why it performs best in these domains\. It is plausible that the consistent read\-before\-write habit was explicitly thought toQwen3\.6:36Band could therefore be used to also improve other open\-source models\.

## 5Discussion and Limitations

Here, we point out limitations of our work, discuss its broader impact, and highlight interesting avenues for future research\.

##### Methodological limitations\.

MANTRAdoes not provide full formal certification of compliance with the original document\. Instead, it mitigates this challenge by independently generating trace\-level checks and a world model, then cross\-validating them through SMT\-based conflict search and structured repair\. Moreover, the framework assumes a specified tool schema as input and validates traces only up to a fixed horizonhh\. While this makes the procedure explicit and tractable, inconsistencies arising only beyond the chosen bound may remain undetected\. Some scenarios may also remain unresolved after the automated repair budget is exhausted and therefore require structured human review\.

##### Scope and generality\.

MANTRAis designed to evaluate tool\-using agents through their tool\-call traces\. This focus enables deterministic grading and fine\-grained compliance analysis within the target setting, but limits the framework’s immediate applicability beyond tool\-based agent interaction\. Extending the same ideas to settings without explicit tool calls, or to broader forms of agent evaluation, would be an interesting direction for future research\.

##### Evaluation coverage\.

Our benchmark currently contains 285 validated test cases across six domains, all derived from English\-language procedural manuals\. Although this shows thatMANTRAcan operate across diverse enterprise\-style settings, broader coverage across additional domains, languages, and document styles remains to be explored\. Likewise, while we evaluate six LLM agents, model capabilities continue to evolve rapidly, and broader evaluation on future systems is left to future work\.

##### Broader Impact\.

This work illustrates how machine\-learning\-based methods and verification\-based methods can complement each other in the development and evaluation of AI systems\. More broadly, such combinations may help make AI systems more trustworthy in settings where procedural compliance matters, including enterprise, regulatory, and safety\-critical workflows\. At the same time,MANTRAis not a standalone guarantee of safe or compliant behavior, but rather one component of a broader assurance process\. We hope this work encourages further research at the interface of machine learning and formal methods for evaluating increasingly capable LLM agents\.

## 6Conclusion

We presentMANTRA, a framework for automatically generating and validating compliance benchmarks for tool\-using LLM agents from procedural manuals and tool schemas\. By combining LLM\-based generation with SMT\-based cross\-validation,MANTRAproduces high\-quality test cases with deterministic trace\-level grading\. UsingMANTRA, we built a benchmark of 285 validated test cases across six domains and used it to evaluate six frontier LLM agents\. We believe thatMANTRAprovides a useful foundation for the more reliable evaluation of increasingly capable AI agents in compliance\-critical settings in the future\.

## 7Acknowledgements

This work was supported by DFG project 89792660 as part of TRR248\-CPEC, and by the Emmy Noether Grant SCHM 3541/1\-1\.

## References

- \[1\]C\. Baier and J\. Katoen\(2008\)Principles of model checking\.MIT press\.Cited by:[§1](https://arxiv.org/html/2605.06334#S1.SS0.SSS0.Px2.p4.1)\.
- \[2\]V\. Barres, H\. Dong, S\. Ray, X\. Si, and K\. Narasimhan\(2025\)Tau2\-bench: evaluating conversational agents in a dual\-control environment\.arXiv preprint arXiv:2506\.07982\.Cited by:[Appendix 0\.A](https://arxiv.org/html/2605.06334#Pt0.A1.SS0.SSS0.Px1.p1.2),[3rd item](https://arxiv.org/html/2605.06334#Pt0.A4.I1.i3.p1.6),[Table 1](https://arxiv.org/html/2605.06334#S1.T1.12.10.11.1.3),[§1](https://arxiv.org/html/2605.06334#S1.p2.1),[§4\.1](https://arxiv.org/html/2605.06334#S4.SS1.p2.7),[Table 2](https://arxiv.org/html/2605.06334#S4.T2.5.5.5.2),[Table 2](https://arxiv.org/html/2605.06334#S4.T2.6.6.6.2),[Table 2](https://arxiv.org/html/2605.06334#S4.T2.7.7.7.2)\.
- \[3\]C\. Barrett and C\. Tinelli\(2018\)Satisfiability modulo theories\.InHandbook of Model Checking,pp\. 305–343\.External Links:ISBN 978\-3\-319\-10575\-8Cited by:[§1](https://arxiv.org/html/2605.06334#S1.SS0.SSS0.Px1.p1.1)\.
- \[4\]S\. Bayless, S\. Buliani, D\. Cassel, B\. Cook, D\. Clough, R\. Delmas, N\. Diallo, F\. Erata, N\. Feng, D\. Giannakopoulou,et al\.\(2025\)A neurosymbolic approach to natural language formalization and verification\.arXiv preprint arXiv:2511\.09008\.Cited by:[§1](https://arxiv.org/html/2605.06334#S1.SS0.SSS0.Px2.p4.1)\.
- \[5\]Y\. Cai, Z\. Hou, D\. Sanan, X\. Luan, Y\. Lin, J\. Sun, and J\. S\. Dong\(2025\-01\)Automated program refinement: guide and verify code large language model with refinement calculus\.Proc\. ACM Program\. Lang\.9\(POPL\)\.External Links:[Document](https://dx.doi.org/10.1145/3704905)Cited by:[§1](https://arxiv.org/html/2605.06334#S1.SS0.SSS0.Px2.p4.1)\.
- \[6\]A\. Chen, X\. Ge, Z\. Fu, Y\. Xiao, and J\. Chen\(2024\)Travelagent: an ai assistant for personalized travel planning\.arXiv preprint arXiv:2409\.08069\.Cited by:[§1](https://arxiv.org/html/2605.06334#S1.p1.1)\.
- \[7\]M\. Chen, J\. Tworek, H\. Jun, Q\. Yuan, H\. P\. d\. O\. Pinto, J\. Kaplan, H\. Edwards, Y\. Burda, N\. Joseph, G\. Brockman,et al\.\(2021\)Evaluating large language models trained on code\.arXiv preprint arXiv:2107\.03374\.Cited by:[2nd item](https://arxiv.org/html/2605.06334#Pt0.A4.I1.i2.p1.5)\.
- \[8\]L\. de Moura and N\. Bjørner\(2008\)Z3: an efficient smt solver\.InTools and Algorithms for the Construction and Analysis of Systems,Berlin, Heidelberg,pp\. 337–340\.External Links:ISBN 978\-3\-540\-78800\-3Cited by:[§1](https://arxiv.org/html/2605.06334#S1.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.06334#S4.SS1.p1.10)\.
- \[9\]D\. Edge, H\. Trinh, N\. Cheng, J\. Bradley, A\. Chao, A\. Mody, S\. Truitt, D\. Metropolitansky, R\. O\. Ness, and J\. Larson\(2024\)From local to global: a graph rag approach to query\-focused summarization\.arXiv preprint arXiv:2404\.16130\.Cited by:[§3\.1](https://arxiv.org/html/2605.06334#S3.SS1.SSS0.Px1.p1.2)\.
- \[10\]Fandom\(2026\)Sneaky sasquatch wiki \- doctor\.Note:Accessed: 2026\-05\-05[https://sneaky\-sasquatch\.fandom\.com/wiki/Doctor](https://sneaky-sasquatch.fandom.com/wiki/Doctor)Cited by:[Table 2](https://arxiv.org/html/2605.06334#S4.T2.3.3.3.2)\.
- \[11\]Y\. Feng, N\. Weir, K\. Bostrom, S\. Bayless, D\. Cassel, S\. Chaudhary, B\. Kiesl\-Reiter, and H\. Rangwala\(2025\)VeriCoT: neuro\-symbolic chain\-of\-thought validation via logical consistency checks\.arXiv preprint arXiv:2511\.04662\.Cited by:[§1](https://arxiv.org/html/2605.06334#S1.SS0.SSS0.Px2.p4.1)\.
- \[12\]Flight Safety Foundation\(2016\)Cabin safety compendium\.Note:Accessed: 2026\-05\-05[https://flightsafety\.org/wp\-content/uploads/2016/09/cabin\_safety\_compendium\.pdf](https://flightsafety.org/wp-content/uploads/2016/09/cabin_safety_compendium.pdf)Cited by:[Table 2](https://arxiv.org/html/2605.06334#S4.T2.4.4.4.2)\.
- \[13\]S\. Gao, A\. Fang, Y\. Huang, V\. Giunchiglia, A\. Noori, J\. R\. Schwarz, Y\. Ektefaie, J\. Kondic, and M\. Zitnik\(2024\)Empowering biomedical discovery with ai agents\.Cell187\(22\),pp\. 6125–6151\.Cited by:[§1](https://arxiv.org/html/2605.06334#S1.p1.1)\.
- \[14\]Y\. Gao, Y\. Xiong, X\. Gao, K\. Jia, J\. Pan, Y\. Bi, Y\. Dai, J\. Sun, H\. Wang, H\. Wang,et al\.\(2023\)Retrieval\-augmented generation for large language models: a survey\.arXiv preprint arXiv:2312\.109972\(1\),pp\. 32\.Cited by:[§3\.1](https://arxiv.org/html/2605.06334#S3.SS1.SSS0.Px1.p1.2)\.
- \[15\]Y\. Hao, Y\. Chen, Y\. Zhang, and C\. Fan\(2025\)Large language models can solve real\-world planning rigorously with formal verification tools\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 3434–3483\.Cited by:[§1](https://arxiv.org/html/2605.06334#S1.SS0.SSS0.Px2.p4.1)\.
- \[16\]D\. Huang, J\. M\. Zhang, M\. Harman, Q\. Zhang, M\. Du, and S\. Ng\(2026\-03\)Benchmarking llms for unit test generation from real\-world functions\.ACM Trans\. Softw\. Eng\. Methodol\.\.Note:Just AcceptedExternal Links:[Document](https://dx.doi.org/10.1145/3805043),ISSN 1049\-331XCited by:[§1](https://arxiv.org/html/2605.06334#S1.SS0.SSS0.Px2.p3.1)\.
- \[17\]Y\. Huang, J\. Shi, Y\. Li, C\. Fan, S\. Wu, Q\. Zhang, Y\. Liu, P\. Zhou, Y\. Wan, N\. Z\. Gong, and L\. Sun\(2024\)MetaTool benchmark for large language models: deciding whether to use tools and which to use\.InThe Twelfth International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.06334#S1.SS0.SSS0.Px2.p2.1)\.
- \[18\]N\. Kolt\(2025\)Governing ai agents\.arXiv preprint arXiv:2501\.07913\.Cited by:[§1](https://arxiv.org/html/2605.06334#S1.p1.1)\.
- \[19\]H\. Lee, Z\. Zhang, H\. Lu, and L\. ZHANG\(2026\)SEC\-bench: automated benchmarking of LLM agents on real\-world software security tasks\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2605.06334#S1.SS0.SSS0.Px2.p3.1)\.
- \[20\]E\. Levi and I\. Kadar\(2025\)Intellagent: a multi\-agent framework for evaluating conversational ai systems\.arXiv preprint arXiv:2501\.11067\.Cited by:[Appendix 0\.A](https://arxiv.org/html/2605.06334#Pt0.A1.SS0.SSS0.Px1.p1.2),[§0\.C\.6](https://arxiv.org/html/2605.06334#Pt0.A3.SS6.p2.1),[Table 1](https://arxiv.org/html/2605.06334#S1.T1.12.10.11.1.4),[§1](https://arxiv.org/html/2605.06334#S1.p2.1),[§3\.2](https://arxiv.org/html/2605.06334#S3.SS2.p1.3)\.
- \[21\]M\. Li, Y\. Zhao, B\. Yu, F\. Song, H\. Li, H\. Yu, Z\. Li, F\. Huang, and Y\. Li\(2023\-12\)API\-bank: a comprehensive benchmark for tool\-augmented LLMs\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,Singapore,pp\. 3102–3116\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.187)Cited by:[§1](https://arxiv.org/html/2605.06334#S1.SS0.SSS0.Px2.p2.1)\.
- \[22\]Z\. Li, S\. Huang, J\. Wang, N\. Zhang, A\. Antoniades, W\. Hua, K\. Zhu, S\. Zeng, W\. Y\. Wang, and X\. Yan\(2025\)Agentorca: a dual\-system framework to evaluate language agents on operational routine and constraint adherence\.arXiv e\-prints,pp\. arXiv–2503\.Cited by:[Appendix 0\.A](https://arxiv.org/html/2605.06334#Pt0.A1.SS0.SSS0.Px1.p1.2),[Table 1](https://arxiv.org/html/2605.06334#S1.T1.12.10.11.1.5),[§1](https://arxiv.org/html/2605.06334#S1.p2.1),[§3\.2](https://arxiv.org/html/2605.06334#S3.SS2.p1.3)\.
- \[23\]C\. Ma, J\. Zhang, Z\. Zhu, C\. Yang, Y\. Yang, Y\. Jin, Z\. Lan, L\. Kong, and J\. He\(2024\)Agentboard: an analytical evaluation board of multi\-turn llm agents\.Advances in neural information processing systems37,pp\. 74325–74362\.Cited by:[§1](https://arxiv.org/html/2605.06334#S1.SS0.SSS0.Px2.p2.1)\.
- \[24\]M\. Mohammadi, Y\. Li, J\. Lo, and W\. Yip\(2025\)Evaluation and benchmarking of llm agents: a survey\.InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V\.2,KDD ’25,New York, NY, USA,pp\. 6129–6139\.External Links:[Document](https://dx.doi.org/10.1145/3711896.3736570),ISBN 9798400714542Cited by:[§1](https://arxiv.org/html/2605.06334#S1.SS0.SSS0.Px2.p2.1)\.
- \[25\]M\. Moritz, E\. Topol, and P\. Rajpurkar\(2025\)Coordinated ai agents for advancing healthcare\.Nature Biomedical Engineering9\(4\),pp\. 432–438\.Cited by:[§1](https://arxiv.org/html/2605.06334#S1.p1.1)\.
- \[26\]S\. Nandi, A\. Datta, N\. Vichare, I\. Bhattacharya, H\. Raja, J\. Xu, S\. Ray, G\. Carenini, A\. Srivastava, A\. Chan,et al\.\(2025\)Sop\-bench: complex industrial sops for evaluating llm agents\.arXiv preprint arXiv:2506\.08119\.Cited by:[Appendix 0\.A](https://arxiv.org/html/2605.06334#Pt0.A1.SS0.SSS0.Px1.p1.2),[§1](https://arxiv.org/html/2605.06334#S1.SS0.SSS0.Px2.p2.1),[Table 1](https://arxiv.org/html/2605.06334#S1.T1.12.10.11.1.6),[§1](https://arxiv.org/html/2605.06334#S1.p2.1),[§3\.2](https://arxiv.org/html/2605.06334#S3.SS2.p1.3)\.
- \[27\]F\. M\. Ozman\(2025\)Systematic literature review on the rise of agentic ai in enterprise operations\.International Journal of Frontiers in Science and Technology Research8\(2\),pp\. 001–015\.Cited by:[§1](https://arxiv.org/html/2605.06334#S1.p1.1)\.
- \[28\]L\. Pan, A\. Albalak, X\. Wang, and W\. Wang\(2023\-12\)Logic\-LM: empowering large language models with symbolic solvers for faithful logical reasoning\.InFindings of the Association for Computational Linguistics: EMNLP 2023,Singapore,pp\. 3806–3824\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.findings-emnlp.248)Cited by:[§1](https://arxiv.org/html/2605.06334#S1.SS0.SSS0.Px2.p4.1)\.
- \[29\]S\. G\. Patil, H\. Mao, F\. Yan, C\. C\. Ji, V\. Suresh, I\. Stoica, and J\. E\. Gonzalez\(2025\)The berkeley function calling leaderboard \(BFCL\): from tool use to agentic evaluation of large language models\.InForty\-second International Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2605.06334#S1.SS0.SSS0.Px2.p2.1)\.
- \[30\]H\. M\. Pysklo, A\. Zhuravel, and P\. D\. Watson\(2026\)Agent\-diff: benchmarking llm agents on enterprise api tasks via code execution with state\-diff\-based evaluation\.arXiv preprint arXiv:2602\.11224\.Cited by:[Appendix 0\.A](https://arxiv.org/html/2605.06334#Pt0.A1.SS0.SSS0.Px1.p1.2),[§1](https://arxiv.org/html/2605.06334#S1.SS0.SSS0.Px2.p2.1),[Table 1](https://arxiv.org/html/2605.06334#S1.T1.12.10.11.1.9),[§1](https://arxiv.org/html/2605.06334#S1.p2.1)\.
- \[31\]Y\. Qi, H\. Peng, X\. Wang, A\. Xin, Y\. Liu, B\. Xu, L\. Hou, and J\. Li\(2026\)AGENTIF: benchmarking large language models instruction following ability in agentic scenarios\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track,Cited by:[Appendix 0\.A](https://arxiv.org/html/2605.06334#Pt0.A1.SS0.SSS0.Px1.p1.2),[Table 1](https://arxiv.org/html/2605.06334#S1.T1.12.10.11.1.7),[§1](https://arxiv.org/html/2605.06334#S1.p2.1),[§3\.2](https://arxiv.org/html/2605.06334#S3.SS2.p1.3)\.
- \[32\]Y\. Qin, S\. Liang, Y\. Ye, K\. Zhu, L\. Yan, Y\. Lu, Y\. Lin, X\. Cong, X\. Tang, B\. Qian, S\. Zhao, L\. Hong, R\. Tian, R\. Xie, J\. Zhou, M\. Gerstein, dahai li, Z\. Liu, and M\. Sun\(2024\)ToolLLM: facilitating large language models to master 16000\+ real\-world APIs\.InThe Twelfth International Conference on Learning Representations,Cited by:[Appendix 0\.A](https://arxiv.org/html/2605.06334#Pt0.A1.SS0.SSS0.Px1.p1.2),[Table 1](https://arxiv.org/html/2605.06334#S1.T1.12.10.11.1.2),[§1](https://arxiv.org/html/2605.06334#S1.p2.1)\.
- \[33\]C\. K\. Reddy and P\. Shojaee\(2025\)Towards scientific discovery with generative ai: progress, opportunities, and challenges\.InProceedings of the AAAI conference on artificial intelligence,Vol\.39,pp\. 28601–28609\.Cited by:[§1](https://arxiv.org/html/2605.06334#S1.p1.1)\.
- \[34\]V\. Singh, D\. Cassel, N\. Weir, N\. Feng, and S\. Bayless\(2026\)VERGE: formal refinement and guidance engine for verifiable llm reasoning\.arXiv preprint arXiv:2601\.20055\.Cited by:[§1](https://arxiv.org/html/2605.06334#S1.SS0.SSS0.Px2.p4.1)\.
- \[35\]Strands Labs\(2026\)Strands agents\.Note:Accessed: 2026\-05\-05[https://github\.com/strands\-agents](https://github.com/strands-agents)Cited by:[§4\.2](https://arxiv.org/html/2605.06334#S4.SS2.p1.3)\.
- \[36\]Q\. Sun, Z\. Liu, C\. Ma, Z\. Ding, F\. Xu, Z\. Yin, H\. Zhao, Z\. Wu, K\. Cheng, Z\. Liu,et al\.\(2025\)Scienceboard: evaluating multimodal autonomous agents in realistic scientific workflows\.arXiv preprint arXiv:2505\.19897\.Cited by:[§1](https://arxiv.org/html/2605.06334#S1.SS0.SSS0.Px2.p2.1)\.
- \[37\]K\. Vergopoulos, M\. N\. Mueller, and M\. Vechev\(2025\)Automated benchmark generation for repository\-level coding tasks\.InForty\-second International Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2605.06334#S1.SS0.SSS0.Px2.p3.1)\.
- \[38\]H\. Wang, T\. Fu, Y\. Du, W\. Gao, K\. Huang, Z\. Liu, P\. Chandak, S\. Liu, P\. Van Katwyk, A\. Deac,et al\.\(2023\)Scientific discovery in the age of artificial intelligence\.Nature620\(7972\),pp\. 47–60\.Cited by:[§1](https://arxiv.org/html/2605.06334#S1.p1.1)\.
- \[39\]Z\. Wang, Q\. Chang, H\. Patel, S\. Biju, C\. Wu, Q\. Liu, A\. Ding, A\. Rezazadeh, A\. Shah, Y\. Bao, and E\. Siow\(2026\)MCP\-bench: benchmarking tool\-using LLM agents with complex real\-world tasks via MCP servers\.InThe Fourteenth International Conference on Learning Representations,Cited by:[Appendix 0\.A](https://arxiv.org/html/2605.06334#Pt0.A1.SS0.SSS0.Px1.p1.2),[Table 1](https://arxiv.org/html/2605.06334#S1.T1.12.10.11.1.8),[§1](https://arxiv.org/html/2605.06334#S1.p2.1)\.
- \[40\]C\. White, S\. Dooley, M\. Roberts, A\. Pal, B\. Feuer, S\. Jain, R\. Shwartz\-Ziv, N\. Jain, K\. Saifullah, S\. Dey, Shubh\-Agrawal, S\. S\. Sandha, S\. V\. Naidu, C\. Hegde, Y\. LeCun, T\. Goldstein, W\. Neiswanger, and M\. Goldblum\(2025\)LiveBench: a challenging, contamination\-limited LLM benchmark\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.06334#S1.SS0.SSS0.Px2.p3.1)\.
- \[41\]J\. Xu, H\. Fei, L\. Pan, Q\. Liu, M\. Lee, and W\. Hsu\(2024\-08\)Faithful logical reasoning via symbolic chain\-of\-thought\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Bangkok, Thailand,pp\. 13326–13365\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.720)Cited by:[§1](https://arxiv.org/html/2605.06334#S1.SS0.SSS0.Px2.p4.1)\.
- \[42\]S\. Yao, H\. Chen, J\. Yang, and K\. Narasimhan\(2022\)Webshop: towards scalable real\-world web interaction with grounded language agents\.Advances in Neural Information Processing Systems35,pp\. 20744–20757\.Cited by:[§1](https://arxiv.org/html/2605.06334#S1.SS0.SSS0.Px2.p2.1)\.
- \[43\]S\. Yao, N\. Shinn, P\. Razavi, and K\. R\. Narasimhan\(2025\)Tau\-bench: a benchmark for tool\-agent\-user interaction in real\-world domains\.InThe Thirteenth International Conference on Learning Representations,Cited by:[Appendix 0\.A](https://arxiv.org/html/2605.06334#Pt0.A1.SS0.SSS0.Px1.p1.2),[Table 1](https://arxiv.org/html/2605.06334#S1.T1.12.10.11.1.3),[§1](https://arxiv.org/html/2605.06334#S1.p2.1),[§3\.2](https://arxiv.org/html/2605.06334#S3.SS2.p1.3)\.
- \[44\]M\. Zdravković, H\. Panetto, and G\. Weichhart\(2022\)AI\-enabled enterprise information systems for manufacturing\.Enterprise Information Systems16\(4\),pp\. 668–720\.Cited by:[§1](https://arxiv.org/html/2605.06334#S1.p1.1)\.
- \[45\]Y\. Zhang, Y\. Cai, X\. Zuo, X\. Luan, K\. Wang, Z\. Hou, Y\. Zhang, Z\. Wei, M\. Sun, J\. Sun,et al\.\(2024\)The fusion of large language models and formal methods for trustworthy ai agents: a roadmap\.arXiv preprint arXiv:2412\.06512\.Cited by:[§1](https://arxiv.org/html/2605.06334#S1.SS0.SSS0.Px2.p4.1)\.
- \[46\]C\. Zhou, A\. Liu, Y\. Deng, Z\. Zeng, T\. Zhang, H\. Zhu, J\. Cai, Y\. Mao, C\. Zhang, L\. Tan, ZiyanXU, B\. Zhai, HengyiLIu, S\. Zhu, W\. Zhou, and F\. Lian\(2026\)AutoCodeBench: large language models are automatic code benchmark generators\.InThe Fourteenth International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.06334#S1.SS0.SSS0.Px2.p3.1)\.
- \[47\]S\. Zhou, F\. F\. Xu, H\. Zhu, X\. Zhou, R\. Lo, A\. Sridhar, X\. Cheng, T\. Ou, Y\. Bisk, D\. Fried, U\. Alon, and G\. Neubig\(2024\)WebArena: a realistic web environment for building autonomous agents\.InThe Twelfth International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.06334#S1.SS0.SSS0.Px2.p2.1)\.
- \[48\]N\. Zwerdling, D\. Boaz, E\. Rabinovich, G\. Uziel, D\. Amid, and A\. Anaby Tavor\(2025\-11\)Towards enforcing company policy adherence in agentic workflows\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track,Suzhou \(China\),pp\. 595–606\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-industry.41),ISBN 979\-8\-89176\-333\-3Cited by:[§1](https://arxiv.org/html/2605.06334#S1.p1.1)\.

## Appendix 0\.ADetailed Comparison ofMANTRAto Existing Benchmarks

Table[1](https://arxiv.org/html/2605.06334#S1.T1)positions MANTRA against representative benchmarks for tool\-using and policy\-compliant LLM agents\. We highlight the contrasts that matter for compliance benchmarking on long procedural manuals\.

##### Automation vs\. human effort\.

ToolBench\[[32](https://arxiv.org/html/2605.06334#bib.bib3)\], IntellAgent\[[20](https://arxiv.org/html/2605.06334#bib.bib51)\], and MCP\-Bench\[[39](https://arxiv.org/html/2605.06334#bib.bib7)\]achieve fully automated task generation, but at the cost of either having no policy documents at all \(ToolBench, MCP\-Bench\) or relying on short synthetic policies \(IntellAgent\)\. At the other end of the spectrum,τ\\tau/τ2\\tau^\{2\}\-bench\[[43](https://arxiv.org/html/2605.06334#bib.bib13),[2](https://arxiv.org/html/2605.06334#bib.bib14)\]and AgentIF\[[31](https://arxiv.org/html/2605.06334#bib.bib49)\]produce high\-quality benchmarks through heavy manual authoring and validation, which limits scaling to new domains and longer documents\. AgentOrca\[[22](https://arxiv.org/html/2605.06334#bib.bib50)\], SOP\-Bench\[[26](https://arxiv.org/html/2605.06334#bib.bib4)\], and Agent\-Diff\[[30](https://arxiv.org/html/2605.06334#bib.bib8)\]sit in between, automating part of the pipeline but requiring substantial hand\-built environments, executable constraints, or state\-diff contracts per domain\. MANTRA is, to our knowledge, the only framework that combines fully automated benchmark generation with rigorous, formally grounded validation over arbitrary natural\-language manuals\.

##### Documents: instructions vs\. procedural manuals\.

Several benchmarks accept long inputs, but along a different axis than MANTRA\. AgentIF includes instructions averaging1,7231\{,\}723words \(max15,63015\{,\}630\), but each instruction is a self\-contained directive for one task in which essentially every clause applies\. SOP\-Bench similarly uses detailed industrial SOPs but pairs each test with a tightly scoped task\. In contrast, MANTRA targets standing*procedural manuals*—hierarchical, cross\-referenced documents \(up to∼\\sim16,00016\{,\}000words and148148tools in our suite\) where any given user request activates only a small, scenario\-dependent subset of the rules\. This setting requires the benchmark to identify which parts of the document are relevant to a scenario before constraints can even be formulated, a step absent from instruction\-style benchmarks\.

##### Trace\-level vs\. outcome\-level evaluation\.

A central distinction is whether evaluation inspects the agent’s tool\-call trajectory or only the final environment state\.τ\\tau/τ2\\tau^\{2\}\-bench, ToolBench, and Agent\-Diff fall on the outcome side: they compare final database states or LLM\-judged solutions, which cannot distinguish between an agent that followed the prescribed procedure and one that arrived at the right state through a non\-compliant path\. For example, a policy that requires a safety check before a critical operation would be satisfied by an agent that simply performed the critical operation without the check, as long as the final state is correct\. This is a fundamental limitation for compliance benchmarking, as it cannot reliably detect procedural violations\. AgentOrca, SOP\-Bench, AgentIF, and partially MCP\-Bench do evaluate trajectories, but their checks are either hand\-coded per task or expressed as natural\-language rubrics scored by an LLM judge\. MANTRA generates trace\-level checks automatically from the manual—ordering constraints, required calls, forbidden calls, conditional obligations—and validates them against an independently extracted symbolic world model before they are accepted into the benchmark\.

##### Determinism of grading\.

Once a benchmark is constructed, agent grading should be reproducible\. LLM\-as\-judge pipelines \(ToolBench, IntellAgent, MCP\-Bench, and the natural\-language\-assertion mode ofτ2\\tau^\{2\}\-bench\) introduce variance and prompt sensitivity into per\-test scoring\.τ\\tau\-bench, AgentOrca, SOP\-Bench, and Agent\-Diff achieve deterministic grading but only by restricting evaluation to outcome states or hand\-coded constraints\. MANTRA is the only framework in the table that delivers deterministic, trace\-level grading on automatically generated tests: once a test case is validated and committed, evaluating an agent trace against its check set requires zero LLM calls\.

Each existing benchmark optimizes for a subset of the five properties we consider, and trades off the others\. ToolBench and MCP\-Bench scale via automation but cannot enforce procedural compliance\.τ\\tau/τ2\\tau^\{2\}\-bench, SOP\-Bench, and AgentIF achieve high evaluation rigor through manual effort that does not scale to long manuals or new domains\. AgentOrca and Agent\-Diff achieve deterministic, code\-based grading at the cost of hand\-built constraints and sandboxes\. MANTRA is the first benchmark\-generation framework that simultaneously \(i\) automates task and check synthesis, \(ii\) scales to large procedural manuals, \(iii\) extends to new domains given only the manual and a tool schema, \(iv\) evaluates the full execution trace, and \(v\) grades each test deterministically through SMT\-validated checks\.

## Appendix 0\.BCheck Grammar, World Model DSL, and SMT Encoding

This appendix details the grammar used to formalize checks in𝒞\\mathcal\{C\}in Sec\.[0\.B\.1](https://arxiv.org/html/2605.06334#Pt0.A2.SS1)and the typed domain\-specific language \(DSL\) used to express the world model𝒲S\\mathcal\{W\}\_\{\\mathrm\{S\}\}in Sec\.[0\.B\.2](https://arxiv.org/html/2605.06334#Pt0.A2.SS2)\. It is then discussed how both are deterministically compiled into the bounded SMT encodingsΦ^h\(𝒲S\)\\widehat\{\\Phi\}^\{h\}\(\\mathcal\{W\}\_\{\\mathrm\{S\}\}\)andΦ^h\(𝒞λ\)\\widehat\{\\Phi\}^\{h\}\(\\mathcal\{C\}\_\{\\lambda\}\)in Sec\.[0\.B\.3](https://arxiv.org/html/2605.06334#Pt0.A2.SS3)and Sec\.[0\.B\.4](https://arxiv.org/html/2605.06334#Pt0.A2.SS4), respectively\. We note that the DSL is the only object used by the LLMs in the world\-model generation, structured fix suggestions, and failure combination withinMANTRA’s pipeline\. Raw SMT encodings or Z3 Python expressions are never exposed to the LLMs\.

### 0\.B\.1Check Grammar

A set𝒞=\{c1,…,ck\}\\mathcal\{C\}=\\\{c\_\{1\},\\ldots,c\_\{k\}\\\}of checks is formalized via the following grammar\. Atoms are either required calls or forbidden calls:

call\(t,a\)no\_call\(t,a\),\\texttt\{call\}\(t,a\)\\qquad\\texttt\{no\\\_call\}\(t,a\),whereaais a concrete partial argument map\. Missing arguments are wildcards\. Compound checks support disjunction and temporal ordering:

The distinction between conditional ordering and positive ordering is important\.afterandbeforecan be vacuously satisfied by a trace in which the target call never occurs\.followsandprecedesfail unless both calls exist\.

### 0\.B\.2World model DSL

This subsection details the typed domain specific language \(DSL\) used to formalize the world model𝒲S\\mathcal\{W\}\_\{\\mathrm\{S\}\}\.

##### Model Syntax\.

The world model syntax is an S\-expression variant\. A model declares constants, state variables, and one transition per relevant toolt∈𝒯St\\in\{\\mathcal\{T\}\}\_\{\\mathrm\{S\}\}, as detailed in the following listing where comments begin with;and run to end of line\.

model ::= ’\(’ ’model’ clause\* ’\)’clause ::= const \| state\-var \| transitionconst ::= ’\(’ ’const’ IDENT type\-expr value ’\)’state\-var ::= ’\(’ ’var’ IDENT type\-expr ’\)’type\-expr ::= ’Int’ \| ’Real’ \| ’Bool’ \| ’String’\| ’\(’ ’Enum’ STRING\+ ’\)’\| ’\(’ ’Record’ field\+ ’\)’\| ’\(’ ’Array’ type\-expr ’\)’field ::= ’\(’ IDENT type\-expr ’\)’transition ::= ’\(’ ’transition’ IDENT’\(’ ’params’ param\-bind\* ’\)’’\(’ ’pre’ expr\* ’\)’’\(’ ’post’ expr\* ’\)’ ’\)’param\-bind ::= ’\(’ IDENT IDENT ’\)’ ; \(tool\_param\_name local\_name\)expr ::= IDENT ; state\-var or constant\| LITERAL\| ’\(’ ’param’ IDENT ’\)’ ; tool argument\| ’\(’ ’next’ IDENT ’\)’ ; post\-state value\| ’\(’ ’field’ expr IDENT ’\)’ ; record field access\| ’\(’ ’contains’ expr expr ’\)’ ; array membership\| ’\(’ op expr\+ ’\)’op ::= ’\+’ \| ’\-’ \| ’\*’ \| ’/’\| ’=’ \| ’<’ \| ’<=’ \| ’\>’ \| ’\>=’\| ’and’ \| ’or’ \| ’not’ \| ’=\>’

The tokenizer is permissive about identifiers, which keeps state\-variable names and the ordering operators \(<=,\>=,=\>\) syntactically distinct without escaping\. The reserved head keywords are \{model,const,var,transition,params,pre,post,Enum,Record,Array,param,next,field,contains\} together with the operator set above\.

##### Strongly Typed Abstract Syntax Trees \(AST\)\.

Every DSL phrase constructed from the syntax above parses into a strongly typed abstract syntax tree \(AST\)\. Types are drawn from

DSLType::=Int∣Real∣Bool∣String∣Enum\(v1,…,vn\)∣Record\(fi:τi\)i=1n∣Array\(τ\),\\mathrm\{DSLType\}\\;::=\\;\\mathrm\{Int\}\\mid\\mathrm\{Real\}\\mid\\mathrm\{Bool\}\\mid\\mathrm\{String\}\\mid\\mathrm\{Enum\}\(v\_\{1\},\\ldots,v\_\{n\}\)\\mid\\mathrm\{Record\}\(f\_\{i\}\\\!:\\\!\\tau\_\{i\}\)\_\{i=1\}^\{n\}\\mid\\mathrm\{Array\}\(\\tau\),withRecordandArrayallowed to nest\. Expressions form the following AST node families:

A*transition*for toolt∈𝒯St\\in\{\\mathcal\{T\}\}\_\{\\mathrm\{S\}\}binds a subset of the tool’s arguments to local names, and lists*precondition*and*postcondition*clauses\. Top\-level pre/post entries must be Bool\-typed\. The validator rejects bare value\-typed nodes such as\(pre \(var\_ref booking\_status\) \(literal "NONE"\)\), which would otherwise compile to non\-Boolean Z3 expressions; this discipline is enforced atomically before any in\-place edit is committed during the repair loop\.

##### Semantics of pre\- and postconditions

Letσ∈𝒳\\sigma\\in\\mathcal\{X\}denote a typed valuation of all state variables and constants, and letaadenote a tuple of concrete arguments to a tooltt\. A precondition clausePPover\(σ,a\)\(\\sigma,a\)evaluates by interpreting bare identifiers asσ\(⋅\)\\sigma\(\\cdot\)values and\(param p\)as the corresponding component ofaa\. A postcondition clauseQQover\(σ,σ′,a\)\(\\sigma,\\sigma^\{\\prime\},a\)additionally interprets\(next v\)asσ′\(v\)\\sigma^\{\\prime\}\(v\)\. The transition relationδt\(σ,σ′,a\)\\delta\_\{t\}\(\\sigma,\\sigma^\{\\prime\},a\)referenced in Sec\.[3\.2](https://arxiv.org/html/2605.06334#S3.SS2)is the conjunction of the corresponding pre\- and postcondition clauses together with a frame axiom on the unmentioned variables\.

##### Frame axiom\.

State variables that do not appear under any\(next⋅\\cdot\)in the postcondition list of a transition are*implicitly carried forward*\. Concretely, ifMod\(t\)⊆𝒳\\mathrm\{Mod\}\(t\)\\subseteq\\mathcal\{X\}is the set of variables mentioned via\(next⋅\\cdot\)intt’s postconditions, the transition relationδt\\delta\_\{t\}implicitly conjoins⋀v∉Mod\(t\)σ′\(v\)=σ\(v\)\.\\bigwedge\_\{v\\,\\notin\\,\\mathrm\{Mod\}\(t\)\}\\sigma^\{\\prime\}\(v\)=\\sigma\(v\)\.This eliminates a common source of malformed output, which is forgetting to restate identity for irrelevant variables\.

##### No\-op padding\.

Traces shorter than the bound are padded with a sentinel*no\-op*step that carries every state variable forward unchanged\. A*contiguous\-prefix*constraint enforces that once a step is inactive, all later steps are inactive too, so all encoded traces effectively have lengthi∈\[0,h\]i\\in\[0,h\]\.

### 0\.B\.3Bounded SMT compilation

For a fixed trace boundhh, the DSL compiles deterministically into the bounded encodingΦ^h\(𝒲S\)\\widehat\{\\Phi\}^\{h\}\(\\mathcal\{W\}\_\{\\mathrm\{S\}\}\)\. The compiler introduces, for each stepi∈\{0,…,h−1\}i\\in\\\{0,\\ldots,h\{\-\}1\\\}and each transition indexk∈\{0,…,\|𝒯S\|\}k\\in\\\{0,\\ldots,\|\{\\mathcal\{T\}\}\_\{\\mathrm\{S\}\}\|\\\}\(wherek=\|𝒯S\|k=\|\{\\mathcal\{T\}\}\_\{\\mathrm\{S\}\}\|is the no\-op\):

- •a tool\-index variableti∈\{0,…,\|𝒯S\|\}t\_\{i\}\\in\\\{0,\\ldots,\|\{\\mathcal\{T\}\}\_\{\\mathrm\{S\}\}\|\\\},
- •an activity flagactivei∈𝔹\\mathrm\{active\}\_\{i\}\\in\\mathbb\{B\}withactivei⇔\(ti≠\|𝒯S\|\)\\mathrm\{active\}\_\{i\}\\Leftrightarrow\(t\_\{i\}\\neq\|\{\\mathcal\{T\}\}\_\{\\mathrm\{S\}\}\|\),
- •a per\-boundary state vectorσb\\sigma\_\{b\}forb∈\{0,…,h\}b\\in\\\{0,\\ldots,h\\\}, whereσ0\\sigma\_\{0\}is the initial state, and
- •a per\-step argument vectoraia\_\{i\}keyed on the local names declared in each transition’sparamsblock\.

##### Structural clauses\.

The compiler emits, for everyii, the constant pinning constraints, the enum\-domain bounds \(enums encode to consecutive integers in declaration order\), the tool\-index range, the activity\-flag definition, the contiguous\-prefix property¬activei⇒¬activei\+1\\neg\\mathrm\{active\}\_\{i\}\\Rightarrow\\neg\\mathrm\{active\}\_\{i\+1\}, and the no\-op carryti=\|𝒯S\|⇒σi\+1=σit\_\{i\}=\|\{\\mathcal\{T\}\}\_\{\\mathrm\{S\}\}\|\\Rightarrow\\sigma\_\{i\+1\}=\\sigma\_\{i\}\.

##### Transition clauses\.

For every transitiontkt\_\{k\}with preconditionsPk=\{pk1,…,pknk\}P\_\{k\}=\\\{p\_\{k\}^\{1\},\\ldots,p\_\{k\}^\{n\_\{k\}\}\\\}and postconditionsQk=\{qk1,…,qkmk\}Q\_\{k\}=\\\{q\_\{k\}^\{1\},\\ldots,q\_\{k\}^\{m\_\{k\}\}\\\}, the compiler emits, for everyii,

\(ti=k\)\\displaystyle\(t\_\{i\}=k\)⟹⋀j=1nk⟦pkj⟧\(σi,ai\),\\displaystyle\\;\\Longrightarrow\\;\\bigwedge\_\{j=1\}^\{n\_\{k\}\}\\llbracket p\_\{k\}^\{j\}\\rrbracket\(\\sigma\_\{i\},a\_\{i\}\),\(ti=k\)\\displaystyle\(t\_\{i\}=k\)⟹⋀j=1mk⟦qkj⟧\(σi,σi\+1,ai\),\\displaystyle\\;\\Longrightarrow\\;\\bigwedge\_\{j=1\}^\{m\_\{k\}\}\\llbracket q\_\{k\}^\{j\}\\rrbracket\(\\sigma\_\{i\},\\sigma\_\{i\+1\},a\_\{i\}\),\(ti=k\)\\displaystyle\(t\_\{i\}=k\)⟹⋀v∉Mod\(tk\)σi\+1\(v\)=σi\(v\),\\displaystyle\\;\\Longrightarrow\\;\\bigwedge\_\{v\\notin\\mathrm\{Mod\}\(t\_\{k\}\)\}\\sigma\_\{i\+1\}\(v\)=\\sigma\_\{i\}\(v\),where⟦⋅⟧\\llbracket\\cdot\\rrbracketdenotes the homomorphic translation from the AST to the corresponding Z3 expression\. Equality of an enum\-typed state variable with a string literal lowers to integer equality against the value’s declaration index; arithmetic operators preserve the sort that Z3 infers from their operands\.

##### Focused compilation\.

The validation loop of Sec\.[3\.3](https://arxiv.org/html/2605.06334#S3.SS3)requires, for a counterexample search, formulas of the form

Φλfwd=Φ^h\(𝒞λ\)∧Φ^bgh\(𝒲S\)∧Φ^posth\(𝒲S\)∧¬Φ^preh\(𝒲S\)\\Phi\_\{\\lambda\}^\{\\mathrm\{fwd\}\}\\;=\\;\\widehat\{\\Phi\}^\{h\}\(\\mathcal\{C\}\_\{\\lambda\}\)\\;\\land\\;\\widehat\{\\Phi\}\_\{\\mathrm\{bg\}\}^\{h\}\(\\mathcal\{W\}\_\{\\mathrm\{S\}\}\)\\;\\land\\;\\widehat\{\\Phi\}\_\{\\mathrm\{post\}\}^\{h\}\(\\mathcal\{W\}\_\{\\mathrm\{S\}\}\)\\;\\land\\;\\neg\\widehat\{\\Phi\}\_\{\\mathrm\{pre\}\}^\{h\}\(\\mathcal\{W\}\_\{\\mathrm\{S\}\}\)together with the initial\-state constraint, that is, assigning the values ofIλI\_\{\\lambda\}to the corresponding variables\. Intuitively, a trace that satisfiesΦλfwd\\Phi\_\{\\lambda\}^\{\\mathrm\{fwd\}\}complies with𝒞λ\\mathcal\{C\}\_\{\\lambda\}while violating the pre\-conditions of tools𝒞λ\\mathcal\{C\}\_\{\\lambda\}focuses on, rather than via conflicts in unrelated parts of the world model\.

##### Witness extraction\.

Whenever the solver returnssat, a counterexample traceτ∗=\(\(ti0,ai0\),…,\(til−1,ail−1\)\)\\tau^\{\*\}=\(\(t\_\{i\_\{0\}\},a\_\{i\_\{0\}\}\),\\ldots,\(t\_\{i\_\{l\-1\}\},a\_\{i\_\{l\-1\}\}\)\)is recovered by reading back the values oftit\_\{i\}and the corresponding argument vector at each active step, stopping at the first no\-op \(which by the contiguous\-prefix invariant terminates the trace\)\. The state vectorσi0\\sigma\_\{i\_\{0\}\}at the witness’s first step is the concrete initial state that the solver chose, which is what the judge panel renders when reasoning about the disagreement\.

##### Determinism and concurrency\.

The compilation procedure depends only on the parsed AST, the trace bound, and the tool registry; given the same inputs, the same Z3 formula is produced\. Z3’s Python bindings share a single global C context, so all calls into the solver, including the encoding\-time sanity check that catches latent sort errors, serialise on a shared lock\.

### 0\.B\.4Check encoding

In order to run a cross\-validation check between𝒞λ\\mathcal\{C\}\_\{\\lambda\}and𝒲S\\mathcal\{W\}\_\{\\mathrm\{S\}\}also𝒞λ\\mathcal\{C\}\_\{\\lambda\}needs to be compiled into an SMT encodingΦ^h\(𝒞λ\)\\widehat\{\\Phi\}^\{h\}\(\\mathcal\{C\}\_\{\\lambda\}\)\. In order to formalize cross\-validation via a joint formula evaluated by Z3,Φ^h\(𝒞λ\)\\widehat\{\\Phi\}^\{h\}\(\\mathcal\{C\}\_\{\\lambda\}\)must reuse the step variables introduced in Sec\.[0\.B\.3](https://arxiv.org/html/2605.06334#Pt0.A2.SS3)\. The shared building block is a per\-step matching predicate

match\(t,a,i\)≡activei∧ti=idx\(t\)∧⋀\(p,v\)∈aai\(p\)=v,\\mathrm\{match\}\(t,a,i\)\\;\\equiv\\;\\mathrm\{active\}\_\{i\}\\,\\land\\,t\_\{i\}=\\mathrm\{idx\}\(t\)\\,\\land\\,\\bigwedge\_\{\(p,v\)\\in a\}a\_\{i\}\(p\)=v,which holds when stepiiis active, calls tooltt, and pins every argument specified inaa\(an emptyaacollapses the inner conjunction into a wildcard that matches any call tott\)\.

Each check formula compiles deterministically using this predicate\.atom\(call\(t,a\)\)expands to⋁imatch\(t,a,i\)\\bigvee\_\{i\}\\mathrm\{match\}\(t,a,i\)andatom\(no\_call\(t,a\)\)to its global negation;orreturns the disjunction of its operands\. The temporal forms quantify over ordered pairs of step indices:afterandbeforeare conditional — a match at stepjjimplies the existence \(resp\. non\-existence\) of a witnessing anchor call on the indicated side ofjj— whilefollowsandprecedesare positive obligations that require both events to occur with the demanded order\. The initial state declared by the test case is pinned by adding equalitiesσ0=σinit\\sigma\_\{0\}=\\sigma\_\{\\mathrm\{init\}\}toΦ^h\(𝒞λ\)\\widehat\{\\Phi\}^\{h\}\(\\mathcal\{C\}\_\{\\lambda\}\), with enum strings lowered to their declaration index\.

### 0\.B\.5Illustrative Encoding and Cross\-Validation Example

We illustrate the introduced DSL and the subsequent cross\-validation loop on fragments of the hardware\-procurement running example from Fig\.[1](https://arxiv.org/html/2605.06334#S1.F1)\. We first show how a sampled document fragment is represented in the DSL and compiled into a bounded SMT world\-model encoding\. We then continue with a concrete refinement example showing how a conflicting trace reveals a missing check and how the repair is incorporated into the bounded check encoding\.

#### 0\.B\.5\.1Example DSL for World Model\.

Consider a fragment of the hardware\-procurement example containing a single Boolean state variable for inventory availability together with transitions for inventory checking, warehouse\-picker assignment, and purchase\-order creation\. The warehouse\-picker assignment requiresin\_stock, while purchase\-order creation requires that the legacy portal has already been consulted\. We omit tool argument names that do not participate in the world model, and use⊥\\bot/⊤\\topfor Boolean literals purely as typesetting shorthand\.

\(model\(var in\_stock Bool\)\(var legacy\_checked Bool\)\(var picker\_assigned Bool\)\(var po\_created Bool\)\(transition check\_inventory\(params \(item itm\)\)\(pre\)\(post \(= \(next in\_stock\) in\_stock\)\)\)\(transition check\_legacy\_portal\(params \(item itm\)\)\(pre\)\(post \(= \(next legacy\_checked\) true\)\)\)\(transition assign\_warehouse\_picker\(params \(item itm\)\)\(pre \(= in\_stock true\)\)\(post \(= \(next picker\_assigned\) true\)\)\)\(transition create\_purchase\_order\(params \(item itm\)\)\(pre \(= in\_stock false\)\(= legacy\_checked true\)\)\(post \(= \(next po\_created\) true\)\)\)\)

For trace boundh=4h=4, the compiler introduces step variablest0,…,t3∈\{0,…,4\}t\_\{0\},\\ldots,t\_\{3\}\\in\\\{0,\\ldots,4\\\}, where index44denotes the no\-op, boundary statesσ0,…,σ4\\sigma\_\{0\},\\ldots,\\sigma\_\{4\}over the four declared variables, and per\-step argument variables\. The transitionassign\_warehouse\_pickercontributes the following family of bounded constraints \(cf\. Sec\.[3\.3](https://arxiv.org/html/2605.06334#S3.SS3)\):

⋀i=0h−1\(\\displaystyle\\bigwedge\_\{i=0\}^\{h\-1\}\\Bigl\(ti=assign\_warehouse\_picker⟹\\displaystyle t\_\{i\}=\\texttt\{assign\\\_warehouse\\\_picker\}\\,\\Longrightarrow\\,σi\(in\_stock\)=⊤∧σi\+1\(picker\_assigned\)=⊤\),\\displaystyle\\sigma\_\{i\}\(\\texttt\{in\\\_stock\}\)=\\top\\;\\land\\;\\sigma\_\{i\+1\}\(\\texttt\{picker\\\_assigned\}\)=\\top\\Bigr\),together with the frame axiom on the three unmentioned variables of the same transition,

⋀i=0h−1\(\\displaystyle\\bigwedge\_\{i=0\}^\{h\-1\}\\Bigl\(ti=assign\_warehouse\_picker⟹\\displaystyle t\_\{i\}=\\texttt\{assign\\\_warehouse\\\_picker\}\\,\\Longrightarrow\\,⋀v∈\{in\_stock,legacy\_checked,po\_created\}σi\+1\(v\)=σi\(v\)\)\.\\displaystyle\\bigwedge\_\{v\\,\\in\\,\\\{\\texttt\{in\\\_stock\},\\,\\texttt\{legacy\\\_checked\},\\,\\texttt\{po\\\_created\}\\\}\}\\sigma\_\{i\+1\}\(v\)=\\sigma\_\{i\}\(v\)\\Bigr\)\.Combined with the analogous clauses for the remaining transitions and the structural constraints from Sec\.[0\.B\.3](https://arxiv.org/html/2605.06334#Pt0.A2.SS3), this yields the bounded world\-model encodingΦ^h\(𝒲S\)\\widehat\{\\Phi\}^\{h\}\(\\mathcal\{W\}\_\{\\mathrm\{S\}\}\)\.

#### 0\.B\.5\.2Cross\-Validation and Repair\.

We now continue with one concrete instance of the refinement loop from our implementation on the in\-stock hardware\-procurement scenario from Fig\.[1](https://arxiv.org/html/2605.06334#S1.F1)\. The scenario asks the agent to fulfill a request for one*Dell UltraSharp U2723QE*monitor for*Maya Torres*at*Lake Union Studio*, under the assumption that the request is still pending and no purchase order has yet been created\.

In an early version of this test case, the generated checks already required the calls

call\(check\_inventory,…\)andcall\(assign\_warehouse\_picker,…\),\\texttt\{call\}\(\\texttt\{check\\\_inventory\},\\ldots\)\\qquad\\text\{and\}\\qquad\\texttt\{call\}\(\\texttt\{assign\\\_warehouse\\\_picker\},\\ldots\),but did not yet include the missing ordering constraint thatcheck\_inventorymust precedeassign\_warehouse\_picker\. As a result, the preliminary bounded check encoding admitted traces in which both calls occurred, but in the wrong order\.

At the same time, the bounded world\-model encoding already captured the intended precondition for picker assignment\. For a fixed trace boundhh, it contained constraints of the form

⋀i=0h−1\(ti=assign\_warehouse\_picker⟹inventory\_checkedi=true\),\\bigwedge\_\{i=0\}^\{h\-1\}\\biggl\(t\_\{i\}=\\texttt\{assign\\\_warehouse\\\_picker\}\\implies\\texttt\{inventory\\\_checked\}\_\{i\}=\\mathrm\{true\}\\biggr\),that is,assign\_warehouse\_pickeris enabled only after inventory has already been checked\.

Z3 then returned the trace

τ∗=\(\\displaystyle\\tau^\{\*\}=\\bigl\(assign\_warehouse\_picker\(HWM2741,1\),\\displaystyle\\texttt\{assign\\\_warehouse\\\_picker\}\(\\texttt\{HWM2741\},1\),\\;check\_inventory\(Dell UltraSharp U2723QE\)\)\.\\displaystyle\\texttt\{check\\\_inventory\}\(\\texttt\{Dell UltraSharp U2723QE\}\)\\bigr\)\.This trace satisfies the preliminary check encoding, since both required calls occur\. However, the same tool callassign\_warehouse\_pickeralso activates the corresponding world\-model precondition\. At the first step ofτ∗\\tau^\{\*\}, we have

t0=assign\_warehouse\_pickerandinventory\_checked0=false,t\_\{0\}=\\texttt\{assign\\\_warehouse\\\_picker\}\\qquad\\text\{and\}\\qquad\\texttt\{inventory\\\_checked\}\_\{0\}=\\mathrm\{false\},so the world\-model implication

t0=assign\_warehouse\_picker⟹inventory\_checked0=truet\_\{0\}=\\texttt\{assign\\\_warehouse\\\_picker\}\\implies\\texttt\{inventory\\\_checked\}\_\{0\}=\\mathrm\{true\}is false at that step\. Hence its negation is satisfied byτ∗\\tau^\{\*\}\. Therefore, this trace simultaneously satisfies the incomplete preliminary checks and witnesses a violation of the world\-model precondition, which is exactly why it is returned as a conflicting trace\.

Given this conflict, the resolver added the missing ordering check

precedes\(check\_inventory,assign\_warehouse\_picker\)\.\\texttt\{precedes\}\(\\texttt\{check\\\_inventory\},\\texttt\{assign\\\_warehouse\\\_picker\}\)\.At the SMT level, this contributes the bounded constraint

⋁0≤i<j<h\(ti=check\_inventory∧tj=assign\_warehouse\_picker\),\\bigvee\_\{0\\leq i<j<h\}\\Bigl\(t\_\{i\}=\\texttt\{check\\\_inventory\}\\;\\land\\;t\_\{j\}=\\texttt\{assign\\\_warehouse\\\_picker\}\\Bigr\),which rules out traces in which the picker assignment occurs before the inventory check\. After recompiling the checks, the traceτ∗\\tau^\{\*\}no longer satisfies the check encoding, and the inconsistency disappears\.

This example illustrates one typical refinement step in the cross\-validation loop\. Even when the generated checks already capture the required calls, solver\-generated conflicting traces can reveal that an essential temporal constraint is still missing\. In this case, the repair consisted of adding the missing precedence check and re\-validating the resulting encoding against the same world model\.

## Appendix 0\.CAdditional implementation details

This section provides all implementation details ofMANTRAexcept for the details on the cross\-validation loop \(3in Fig\.[2](https://arxiv.org/html/2605.06334#S3.F2)\), which are detailed in App\.[0\.B](https://arxiv.org/html/2605.06334#Pt0.A2)\.

### 0\.C\.1Resumability

TheMANTRApipeline is designed to be resumable at every stage\. Intermediate outputs such as the parsed document graph, the sampled subgraph, the generated world model and checks, the round information are all saved to disk in a structured format\. If the process is interrupted or if a particular stage needs to be re\-run with modified parameters,MANTRAcan load the relevant intermediate files and resume from that point without having to repeat earlier stages\. This is particularly useful during development and debugging, as well as for iterating on specific samples or scenarios without having to regenerate everything from scratch\.

We built the document graph once and reused them to seed different runs, skipping the document parsing and graph construction stages\.

### 0\.C\.2LLM model choice

MANTRAis designed to be model\-agnostic and supports multiple LLM providers and models\. The configuration allows users to specify which model to use for each stage of the pipeline\. For the experiments in this paper, we used OpenAI’s GPT\-5\.4 with Codex for all stages\. The choice of the model is based on its strong performance, but we also experimented with other models during development\. We used open source models suchs as GPT\-OSS:20\.9B, Qwen3:8\.2B, and Qwen3\.6:36\.0B during development and found that GPT\-OSS and Qwen3 struggled with generating syntactically correct world models and checks, and could not recover even with retry\. Qwen3\.6 performed better and generated valid output more consistently, but the test cases tool more rounds of refinement to get validated\. In one of the runs, out of total 48 generated scenarios, only 11 got validated within 13 rounds of refinement\. In contrast, GPT\-5\.4 with Codex validated approximately 50% of the generated scenarios within 5 rounds of refinement\.

### 0\.C\.3Validation parameters

We set the trace boundhhto 16 for all experiments, which is a reasonable upper bound on the length of compliant traces in our domains\. The number of validation roundsnroundsn\_\{\\mathrm\{rounds\}\}is set to 5, which allows for multiple iterations of repair and conflict resolution while keeping the overall generation time manageable\.

### 0\.C\.4Pipeline Configuration

MANTRAoffers a highly configurable YAML\-based configuration system\. Below is an example config file used for generating the benchmarks in this paper, which specifies parameters for document parsing, tool integration, LLM usage, sampling strategies, validation settings, and logging preferences\.

run\_dir\_base: runsdocument\_path: environments/operations/instructions\.mdtools:format: python\_callablesource: assets\.operations\.toolsenv\_bindings:path: environments/operations/bindings\.yamlllm:provider\_default: codexdefault\_model: gpt\-5\.4ollama\_base\_url: $\{OLLAMA\_BASE\_URL:\-http://localhost:11434\}openai\_api\_key: $\{OPENAI\_API\_KEY:\-\}anthropic\_api\_key: $\{ANTHROPIC\_API\_KEY:\-\}codex\_executable: $\{CODEX\_EXECUTABLE:\-codex\}stage\_overrides:document\_extraction:provider: codexmodel: gpt\-5\.4tool\_relevance:provider: codexmodel: gpt\-5\.4scenario\_generation:provider: codexmodel: gpt\-5\.4check\_generation:provider: codexmodel: gpt\-5\.4smt\_schema\_prepass:provider: codexmodel: gpt\-5\.4smt\_full\_pass:provider: codexmodel: gpt\-5\.4conflict\_resolution:provider: codexmodel: gpt\-5\.4locus\_reassessment:provider: codexmodel: gpt\-5\.4fix\_repair:provider: codexmodel: gpt\-5\.4retry:max\_attempts: 16base\_delay\_s: 0\.5max\_repair\_attempts: 5cache:enabled: falsechunking:max\_chunk\_tokens: 4000sampling:max\_parallel\_samples: 3strategy: coverage\_islandsmin\_nodes\_per\_sample: 2max\_nodes\_per\_sample: 4scenarios\_per\_sample: 4max\_samples: 12validation:n\_rounds: 5z3\_trace\_bound: 16web\_ui:port: 8080logging:level: INFOconsole: falsefilename: session\.log

### 0\.C\.5Graph generation

To be able to generate test cases that cover a long document, we first build a graph representation of the document\. We ingest a markdown document and recursively build the graph by chunking the document\. We start with the whole graph as a node, and if the node content is longer than a certain threshold \(can be set via the config, based on the downstream LLM’s context window\), we chunk the node into smaller nodes by the next hierarchy level \(e\.g\., from section to subsection\) and add edges from the original node to the smaller nodes\. We repeat this process until we reach the leaf nodes, which are within the content threshold\. This deterministic process results in a hierarchical graph representation of the document with the edges representing the parent\-child relationships between the chunks\. We then build a table of contents \(ToC\) by traversing the graph and find implicit and explicit references between the nodes by prompting an LLM to identify them\. The LLM is provided with the content of a node and the ToC, and can read other nodes’ content if needed\. The extracted references are added as edges to the graph\. This graph representation allows us to sample coherent chunks of the document for test case generation, and also to identify related sections that may need to be covered together in a scenario\.

### 0\.C\.6Graph sampling strategies

The graph sampler \(1in Fig\.[2](https://arxiv.org/html/2605.06334#S3.F2)\) is a core part ofMANTRA’s benchmark generation pipeline\. It determines which paragraphs of the procedural document are used together for benchmark generation, which in turn affects tool and document coverage and hence influences the quality of the resulting compliance evaluation\. Once a node from the document graph𝒢𝒟\\mathcal\{G\}\_\{\\mathcal\{D\}\}is successfully sampled, its coverage counter is incremented\. This allows to prioritize low\-coverage nodes in successive batches to improve overall node coverage of the generated benchmark over the document graph\.MANTRAexposes four strategies:uniformis a random baseline;coverage\_drivenprioritizes least\-covered nodes;connected\_diverse\_coverageselects a low\-coverage, globally diverse seed and expands through graph neighbors to preserve local coherence; andcoverage\_islandsbuilds a sample from multiple locally coherent islands, which can exercise cross\-policy composition when a single scenario naturally touches separated parts of the document\. Our default experiments usecoverage\_islands\.

Intellagent\[[20](https://arxiv.org/html/2605.06334#bib.bib51)\]uses a sophisticated weight based sampling strategy to ensure good coverage of a policy graph \(different from our graph\) that is constructed from multiple policies with nodes weighted with complexity and edges representing likelihood that two policies co\-occur in the same interaction\. While we did not implement this exact strategy, ourcoverage\_islandssampler is designed to achieve similar goals of improving coverage and compositionality by sampling multiple connected subgraphs\. It will be interesting future work to use a weighted sampling strategy similar to the one in\[[20](https://arxiv.org/html/2605.06334#bib.bib51)\]\.

##### Sample Scenarios Across Sampling Strategies\.

TableLABEL:tab:scenarios\-cabinshows how the choice of sampling strategy andnodes\_per\_sampleinteracts to shape the scenarios that downstream stages consume\. To generate this table we have runMANTRAonly until stage1was completed, i\.e\.,a scenarioλ\\lambdawas generated\. The shown scenarios are, hence, not effected by the preceding refinement steps\. Columns are the three sampling strategies; rows arenodes\_per\_samplebuckets \(small, medium, large\)\. All cells shareseed = 42andmax\_samples = 2; the only knobs varied across the matrix arestrategy\(columns\) andmin/max\_nodes\_per\_sample\(rows:\[2,3\]\[2,3\],\[4,6\]\[4,6\],\[8,10\]\[8,10\]\)\. The cabin\-safety document graph 121 document chunks, while the operations document contains 12, so the*large*bucket is cut to 12 for the latter as the sampler refuses to draw more nodes than are eligible\. Each cell holds one cabin\-safety entry \(top, markedcabin\) and one operations entry \(bottom, markedops\); both are deterministically selected as the first sample emitted at that\(strategy,nodes\-per\-sample\)\(\\text\{strategy\},\\text\{nodes\-per\-sample\}\)configuration\. The italicised line above each scenario shows the document chunks the sampler selected\. We clarify that these scenarios were generated for the purpose of demonstrating effects of the different sampling strategies and these were not sent downstream to the generation, validation, repair phases\.

##### Observed Effect of Used Sampling Strategy\.

TableLABEL:tab:scenarios\-cabinallows us to draw the following conclusions\.

- •coverage\_drivenranks eligibles by tier \(least\-covered first\) and breaks ties at random\. The result is broad document coverage but topically loose pairings: in the cabin run, samples mix unrelated sections \(e\.g\.Turbulence\+Ramp escorting,Report closure\+Symptoms & treatment\)\. Scenarios end up multi\-pronged, weaving two unrelated themes per user request\.
- •coverage\_islandsdraws a globally diverse low\-coverage seed and then expands locally through graph neighbours, only starting a new island when the marginal coverage win exceeds local expansion\. Each sample becomes a coherent topical cluster: cabin samples look like \(Receipt of report\+Processing of report\), \(Symptoms\+Use of ground physician\), \(Bomb procedures\+Threat types\)\. Scenarios read as natural single\-workflow user requests\.
- •uniformignores coverage and graph structure entirely\. It produces the most jarring chunk pairings, which the scenario LLM is forced to splice into one user request — the cabin / medium cell pairsBomb handlingwithRapid decompressionandFinal descent, generating contrived multi\-topic scenarios that are unrealistic but stress\-test the downstream stages\.

We further observe that the size of the procedural document𝒟\\mathcal\{D\}matters for the diversity of generated scenarios\. While the cabin domain \(121 chunks\) shows clear strategy divergence; the operations domain \(12 chunks\) does not\. With so few eligible nodes, all three strategies converge onto largely the same chunk pairings, so the ops scenarios are nearly interchangeable across columns\. This shows that the used sampling strategy only becomes a meaningful differentiator when the document is large enough that a single sample cannot cover most of it\.

##### Implication for benchmark generation\.

For benchmarks that should resemble realistic single\-workflow user requests,coverage\_islandsis the strongest default — it gives broad coverage across batches while keeping each sample topically tight\.uniformis useful when the goal is adversarial cross\-section testing of how an agent handles unrelated obligations bundled together\.coverage\_drivensits between the two: balanced coverage at the cost of synthetic pairings\.

Table 3:Sample scenarios across sampling strategies \(columns\) andnodes\_per\_samplebuckets \(rows\)\. One cabin\-safety entry \(top,cabin\) and one operations entry \(bottom,ops\) per cell, deterministically selected as the first sample emitted at that configuration\. Cabin: 121\-chunk document; ops: 12\-chunk document — the ops*large*bucket clamps to 12 because the sampler refuses to draw more nodes than are eligible\. codex / gpt\-5\.4, seed = 42\.\#Nodescoverage\_drivencoverage\_islandsuniformsmall\(2–3\)\[cabin\]1\.3 GENERAL / 2\.10\.3 Non\-ambulatory PassengersWe’re fueling Flight SKX431 at Gate B12\. Wheelchair passenger Elena Petrov is in seat 18C\. Please update the safety log for this flight\.\[ops\]2\.1\. The “SSC” Conditional Rule / 4\.1\. Inventory Check LogicSet up Lena Ortiz’s finance workstation: install Northstar Ledger on SSC\-FIN\-07 with department head signature DHS\-8841 and verification IDs VER\-4418 and VER\-4421, and get 2 Meridian Glow monitors\.\[cabin\]5\.1\.3 Receipt of the Report / 5\.1\.4 Processing of the ReportI’m Elena Markovic, employee ID FA\-3187\. I need to report that galley cart GC\-17 broke loose during boarding on Flight LYN482 and struck seat row 12, and 1 flight attendant was bruised\.\[ops\]2\.1\. The “SSC” Conditional Rule / 2\. Digital Asset Acquisition \(Software & Licenses\)Install MapleEdit Community on workstation ENG\-BIRCH for Priya Natarajan\.\[cabin\]3\.11\.1 In\-Flight Medical Emergency Action Plan / 3\.6\.4 At DecompressionCase MED\-308\-07 on flight HJ308: Malik Thompson, 8, in seat 14A ate a peanut snack and now has facial swelling, hives, and wheezing\. No one has identified as a medical professional on board, and the emergency medical kit with epinephrine is in the forward galley\. Please handle this in\-flight medical emergency\.\[ops\]4\.2\. Delivery Requirements / 4\.1\. Inventory Check LogicPlease send one ApexForge Atlas desktop workstation, priced at $2,005, to Priya Natarajan’s home at 1187 Willow Creek Dr, Naperville, IL 60540\.medium\(4–6\)\[cabin\]2\.2\.8 Stowage of Cabin Baggage / 1\.3 GENERAL / 2\.10\.3 Non\-ambulatory Passengers / 2\.9\.6 Turbulence Terminology / 2\.9\.3 Turbulence Procedures During Flight / 2\.3\.1 Guidelines for EscortsOn flight AUR482, after 12 minutes of smooth air, the captain just called about the current ride conditions while service was underway\. The roller bag tagged BG\-447 is resting against the forward bulkhead at row 2, and tote BG\-448 was left in the lavatory\. Update the rest of the cabin crew using standard turbulence wording, give passengers the appropriate announcement, and move both bags into approved stowage\.\[ops\]3\.1\. Regional Overrides / 2\.1\. The “SSC” Conditional Rule / 4\.2\. Delivery Requirements / 3\.2\. The “Over\-Budget” Protocol / Global Operations & Procurement Standard Operating Procedure \(SOP\) v\.4\.2 / 4\.1\. Inventory Check LogicBook travel for Claire Dubois from Brussels to Paris for a 2\-day investor meeting\. My budget is $252 per day, and I want a business\-class flight\.\[cabin\]5\.1\.3 Receipt of the Report / 5\.1\.4 Processing of the Report / 5\.1\.6 Reporting Database / 5\.1\.2 How will staff report hazards? / 5\.1 HAZARD REPORTING/TRACKING / 5\.1\.5 Distribution of the Report ResultsPlease log a confidential cabin safety report for me: during boarding on Flight OA417 at Berlin Brandenburg on 2026\-05\-05 at 06:40, a SkyCater Services cart jammed in the forward left exit area and blocked access to the door for 11 minutes\. I am cabin crew member Leena Voss, employee ID CC\-2719, and I need my name withheld from anything shared outside Cabin Safety, but I do want confirmation that my report was received and a follow\-up when the investigation is done\.\[ops\]3\.1\. Regional Overrides / 3\.2\. The “Over\-Budget” Protocol / 3\. Travel & Expense Branching Logic / 5\. Compliance & Termination States / Global Operations & Procurement Standard Operating Procedure \(SOP\) v\.4\.2 / 4\. Hardware Procurement & LogisticsPlease book GlobalGate trip GGT\-55102 from Chicago to Toronto as a business\-class flight for 2 days\. The flight is 5 hours 40 minutes, the daily budget is $220, and the reason is “Urgent Business Need: Q2 board presentation\.”\[cabin\]4\.2\.4 Bomb Handling Procedures / 3\.6\.3 Rapid Decompression Subjective Signs / 3\.6\.6 Post\-Decompression Procedures / 2\.2\.3 Transportation of Passengers with Disabilities / 2\.9\.5 Anticipated Turbulence / 2\.7 FINAL DESCENT PROCEDURESOn flight HN517, we’re passing 10,000 feet on descent into Boston\. Passenger Rosa Delgado, ID PAX\-3307, uses a cane tagged CANE\-44, wants wheelchair help at the gate after landing, and is insisting on staying in exit row 21A\. Her grandson Mateo Delgado, ID PAX\-3308, is in 21B in a KidSecure 2 child restraint, and Rosa’s cane is still beside her seat\. Please handle this for landing\.\[ops\]4\.2\. Delivery Requirements / 3\. Travel & Expense Branching Logic / 2\. Digital Asset Acquisition \(Software & Licenses\) / 4\. Hardware Procurement & Logistics / 5\. Compliance & Termination States / 4\.1\. Inventory Check LogicPlease arrange 2 HelioView 32Q Monitor units for Dana Voss and deliver them to her residence at 1842 Cedar Lane, Ann Arbor, MI 48104\. The total quoted value is $2040, and the request number is HWR\-2026\-118\.large\(8–10\)\[cabin\]2\.2\.8 Stowage of Cabin Baggage / 1\.3 GENERAL / 2\.10\.3 Non\-ambulatory Passengers / 2\.9\.6 Turbulence Terminology / 2\.9\.3 Turbulence Procedures During Flight / 2\.3\.1 Guidelines for Escorts / 2\.2\.7 Cabin Baggage \(Carry\-On Luggage\) / 5\.1\.3 Receipt of the Report / 3\.11\.1 In\-Flight Medical Emergency Action Plan / 2\.2\.5 Seat DuplicationsBefore the door closes on BlueNorth 217 \(flight ID FLT\-BN217\), help passenger Nora Petrov in 14C with bag BG\-4411, a hard\-shell suitcase, and bag BG\-4412, a Styrofoam food cooler\. BG\-4411 weighs 4 kg, BG\-4412 weighs 3 kg, the underseat space at 14C has a restraint bar and BG\-4412 fits fully there, the overhead bin above row 14 is placarded for 22 kg and already has 18 kg in it, and the only other open spots are the forward lavatory and the bulkhead ledge at row 1\.\[ops\]2\.1\. The “SSC” Conditional Rule / 3\.1\. Regional Overrides / 3\.2\. The “Over\-Budget” Protocol / 3\. Travel & Expense Branching Logic / 4\.1\. Inventory Check Logic / 4\.2\. Delivery Requirements / 4\. Hardware Procurement & Logistics / 5\. Compliance & Termination States / Global Operations & Procurement Standard Operating Procedure \(SOP\) v\.4\.2 / 1\. IntroductionBook travel for Elena Marquez from Amsterdam to Brussels for June 17\-18, 2026\. She wants a business\-class flight, the daily budget is $255, and the reason on the request is Urgent Business Need for the NovaGrid antitrust hearing\.\[cabin\]5\.1\.3 Receipt of the Report / 5\.1\.4 Processing of the Report / 5\.1\.6 Reporting Database / 5\.1\.2 How will staff report hazards? / 5\.1 HAZARD REPORTING/TRACKING / 5\.1\.5 Distribution of the Report Results / 5\.1\.7 Report Closure / 5\.1\.1 What hazards should staff report? / 3\.5\.6 Galley Fire Procedure / 3\.5\.2 Classification of FireMeridian Air flight MA482 has smoke coming from the forward galley oven 2 minutes after meal service started, and the oven smoke detector has triggered once\. Please take care of the onboard emergency on MA482 and make sure Leah Ortiz’s report is handled confidentially by Cabin Safety afterward\.\[ops\]3\.1\. Regional Overrides / 3\.2\. The “Over\-Budget” Protocol / 3\. Travel & Expense Branching Logic / 5\. Compliance & Termination States / Global Operations & Procurement Standard Operating Procedure \(SOP\) v\.4\.2 / 4\. Hardware Procurement & Logistics / 4\.1\. Inventory Check Logic / 4\.2\. Delivery Requirements / 1\. Introduction / 2\.1\. The “SSC” Conditional RulePlease arrange travel for Elena Fischer from Brussels to Paris for a 2\-day regulatory meeting\. I want a flight in economy, my budget is $260 per day, and the justification on the request should read “Urgent Business Need”\.\[cabin\]4\.2\.4 Bomb Handling Procedures / 3\.6\.3 Rapid Decompression Subjective Signs / 3\.6\.6 Post\-Decompression Procedures / 2\.2\.3 Transportation of Passengers with Disabilities / 2\.9\.5 Anticipated Turbulence / 2\.7 FINAL DESCENT PROCEDURES / 1\.1 BACKGROUND / 3\.9\.2 Cabin Crew Incapacitation / 2\.2\.9 Child Restraint Systems \(CRS\) / 3\.6\.4 At DecompressionFlight ORC482 has started descent into Madrid\. Please handle Daniel Cho in 14A, who uses a cane and wants to remain in that seat, and Maya Torres in 13B, who wants her 25\-month\-old son Leo to stay in his child seat in 13A for landing\.\[ops\]4\.2\. Delivery Requirements / 3\. Travel & Expense Branching Logic / 2\. Digital Asset Acquisition \(Software & Licenses\) / 4\. Hardware Procurement & Logistics / 5\. Compliance & Termination States / 4\.1\. Inventory Check Logic / Global Operations & Procurement Standard Operating Procedure \(SOP\) v\.4\.2 / 3\.1\. Regional Overrides / 3\.2\. The “Over\-Budget” Protocol / 1\. IntroductionI’m Ravi Cho\. Please order one AtlasForge RackStation S2 server and send it to my home office at 417 Maple Crest Drive, Plano, TX 75024; it’s a residential address\. Purchasing quoted a 5\-day lead time, and the unit price is $2,480\.Table 3:Sample scenarios across sampling strategies andnodes\_per\_sample\(continued\)\.

### 0\.C\.7Syntactic Error Repair Loop

During the generation or repair of the world model and the checks, the LLM may produce output that does not conform to the expected DSL syntax or type constraints\. To address this, we implement a syntactic error repair loop that detects and corrects such issues before they propagate to the SMT encoding and validation stages\. When the LLM generates a candidate world model or check set, we first attempt to parse it into the corresponding AST\. If parsing fails due to syntax errors, we extract the error message and the offending text span, and feed this information back into the LLM with a prompt that includes the original generation, the error details, and instructions to produce a corrected version\. This loop continues until a valid AST is produced or a maximum number of attempts is reached\. We catch issues like use of wrong tool names, wrong argument types, etc\. By catching syntactic issues early, we ensure that the subsequent stages of compilation and validation operate on well\-formed inputs, which improves the overall robustness of the benchmark generation process\.

### 0\.C\.8Human intervention via WebUI

MANTRAincludes a web\-based user interface that allows human reviewers to inspect generated scenarios, checks, and counterexamples after the repair rounds are exhausted\. If the automated repair loop fails to resolve issues between the world model and the check set after a predefined number of iterations,MANTRAmarks the scenarioawaiting\_human\. We observed that such issues mainly arise when there is ambiguity in the document\. Then the human reviewer get the relevant information regarding the scenario, along with LLM generated suggestions for how to resolve the disagreement\. The human reviewer then can edit the world model or checks directly in the UI to fix the issue\. The human edits are considered final and are not subject to further automated repair \(the agents may still flag issues but any modification must be done via the WebUI\), but the scenario is re\-validated against the world model to ensure consistency\. This human\-in\-the\-loop mechanism allows us to handle edge cases and ambiguities that may be difficult for the LLM to resolve on its own, while still maintaining a high level of automation in the overall benchmark generation process\.

We observe that many repairs could be handled by the automated loop given sufficient time, but for the test suite, we discarded scenarios that required human intervention after 5 rounds of repair\. However, human intervention could be a valuable tool for generating benchmarks in cases where the document is particularly ambiguous or complex\.

### 0\.C\.9Reproducibility

While we share our implementation ofMANTRA, the generation of our provided benchmark involves non\-deterministic and non\-reproducible components because of the use of a proprietary LLM agent during the process\. However, we believe that one can obtain qualitatively similar benchmarks by running our code using the same LLM agent\.

## Appendix 0\.DEvaluating LLM Agents on Generated Benchmarks

### 0\.D\.1Success Rate Statistics

We evaluated the benchmark suite generated byMANTRAon 6 LLM models\. For a suite withNNtasks, we run each tasknntimes independently and letαi∈\{0,1,…,n\}\\alpha\_\{i\}\\in\\\{0,1,\\dots,n\\\}denote the number of successful runs on taskii\. We report the following standard success rates in Table[4](https://arxiv.org/html/2605.06334#Pt0.A4.T4):

- •Pass@𝟏\\mathbf\{1\}— the average single\-attempt success rate,1N∑i=1Nαin\\tfrac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\tfrac\{\\alpha\_\{i\}\}\{n\}\. This is the standard accuracy of one independent rollout\.
- •Pass@𝐤\\mathbf\{k\}— the probability that at least one ofkkindependent attempts on a task succeeds, averaged over the suite\. Following the unbiased estimator of Chen et al\.\[[7](https://arxiv.org/html/2605.06334#bib.bib54)\],Pass@k=1N∑i=1N\[1−\(n−αik\)/\(nk\)\]\\text\{Pass@\}k=\\tfrac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\\!\\left\[1\-\\binom\{n\-\\alpha\_\{i\}\}\{k\}\\big/\\binom\{n\}\{k\}\\right\]forn≥kn\\geq k\. We reportPass@𝟓\\mathbf\{5\}\.
- •𝐏𝐚𝐬𝐬^𝐤\\mathbf\{Pass\\text\{\\textasciicircum\}k\}— the probability that*all*kkindependent attempts on a task succeed, averaged over the suite\[[2](https://arxiv.org/html/2605.06334#bib.bib14)\]:𝐏𝐚𝐬𝐬^𝐤\\mathbf\{Pass\\text\{\\textasciicircum\}k\}=1N∑i=1N\(αik\)/\(nk\)=\\tfrac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\binom\{\\alpha\_\{i\}\}\{k\}\\big/\\binom\{n\}\{k\}\. This is a stricter consistency/reliability measure than Pass@kk\. We report𝐏𝐚𝐬𝐬^𝟓\\mathbf\{Pass\\text\{\\textasciicircum\}5\}\.

Table 4:Pass@𝟏\\mathbf\{1\},Pass@𝟓\\mathbf\{5\}, and𝐏𝐚𝐬𝐬^𝟓\\mathbf\{Pass\\text\{\\textasciicircum\}5\}by model and domain\. Values are percentages with 95% confidence intervals; bold indicates the best model for each domain and metric\.We see that Qwen3\.6 is the strongest overall model, leading the most metrics across domains, especially Cabin, Ops, Airline, and Retail\. Gemma4 is highly competitive, particularly on Cabin and Sasquatch, while Claude Haiku 4\.5 stands out on Sasquatch, Airline, and Telecom support\-style metrics\. GPT\-5\.4\-mini is the clear Telecom specialist and also has the best Retail support score\. Qwen3 and GPT\-OSS have isolated strengths—Qwen3 on Cabin support and GPT\-OSS on Ops pass@5—but are less consistent overall\. Retail appears to be the hardest domain, with generally lower scores than the others\.

We further see that domains taken fromτ\\tau\-bench seem harder for all models even thoughCabinhas a much longer document and a much larger tool set\. Yet, this discrepancy can be explained via the*complexity*of the tools used in the different domains, as reported in Table[5](https://arxiv.org/html/2605.06334#Pt0.A4.T5)\. This reveals that the difficulty of the benchmark \(low success rate in Table[4](https://arxiv.org/html/2605.06334#Pt0.A4.T4)\) correlates with the tool complexity \(highest values in Table[5](https://arxiv.org/html/2605.06334#Pt0.A4.T5)\)\.

Table 5:Tool argument structural complexity by domain\.*Tools*is the registry size;*Avg\. params*and*Avg\. required*are mean parameter counts per tool;*Enum*/*Array*count parameters whose typed slot is an enumeration or array \(records do not occur in any registry\);*Complex*counts tools with≥4\\geq 4parameters;*Max*is the largest parameter count of any tool in the domain\.
### 0\.D\.2Analysis of Compliance Failures

To understand*how*agents fail onMANTRAbenchmarks, we aggregate the per\-check verdicts emitted by the deterministic evaluator across all sweep runs \(66domains×\\times66models×\\times55attempts per test case\)\. Each failed check is assigned a*failure category*based on the structural reason the evaluator rejected the trace\. The categories below partition the10,89510\{,\}895failed checks observed in our corpus\.

##### Aggregate failure\-category distribution\.

Table[6](https://arxiv.org/html/2605.06334#Pt0.A4.T6)summarizes the distribution\. The dominant failure mode, by a wide margin, isMissing\-Required\-Call: an atom of the formCALL\(t,a\)\\textbf\{\{CALL\}\}\(t,a\)appears in the check set but the agent never invokedttwith arguments matchingaa\. Together,Missing\-Required\-CallandMissing\-Anchor\(the absence of the anchor tool of anAFTER,BEFORE,FOLLOWS, orPRECEDESclause\) account for76%76\\%of all failed checks\. Pure ordering violations—an agent that called the right tools but in the wrong order—are rare \(2%2\\%\), and compoundORclauses where every disjunct fails are rarer still \(1%1\\%\)\. The benchmark, in aggregate, is therefore primarily measuring whether agents*call the right tools at all*, not whether they sequence them correctly\.

Table 6:Distribution of failure categories across all failed checks in the evaluation sweep\.Missing\-Required\-CallandMissing\-Anchortogether account for76%76\\%of failures, indicating that agents most commonly fail by*skipping*a mandated tool call rather than by*misordering*the calls they do make\.
##### The skipped\-lookup pattern\.

The dominance ofMissing\-Required\-Callholds within every domain \(4949–67%67\\%of that domain’s failures\) and every model \(5151–71%71\\%\)\. Strikingly, the tools that account for the largest counts ofMissing\-Required\-Callfailures are almost all read\-style information\-gathering primitives:check\_inventory\(449449, operations\),get\_item\_details\(615615, retail\),get\_order\_details\(357357, retail\),get\_flight\_status\(306306, airline\), inspect\_patient\(204204, sneaky\-sasquatch\), andprocess\_hazard\_report\(227227, cabin safety\)\. Models tend to jump to write\-style or commit\-style tool calls without first running the lookups that the manual mandates as a precondition\. Concretely, the benchmark is in large part measuring whether models read before they write\.

##### Domain\-specific over\-action\.

The complementary failure mode—Forbidden\-Call—is concentrated on a small number of write\-style tools whose preconditions the agent skipped\. Table[7](https://arxiv.org/html/2605.06334#Pt0.A4.T7)lists the top forbidden\-tool hits per domain\. In the benchgen domains, these are exactly the tools that mandate gating reads:create\_purchase\_orderrequires an inventory or legacy\-portal check first,book\_travelrequires policy verification, andrun\_vitamin\_testshould not be ordered when the scenario does not call for it\. Across the threetau2domains, the dominant forbidden tool istransfer\_to\_human\_agents: agents prematurely escalate to a human when the policy expects them to handle the request themselves\.

Table 7:Top write\-style tools accounting forForbidden\-Callfailures, by domain\. These tools share the property that the manual prescribes a gating read before they may fire; the failure mode is the agent committing without that read\.
##### Model\-level pathologies\.

Beyond the per\-check categories, we observe model\-specific failure modes that manifest as crashes or empty traces rather than wrong tool calls\. Table[8](https://arxiv.org/html/2605.06334#Pt0.A4.T8)summarizes them\.GPT\-OSS:20\.9Bhas the highest crash rate at7\.6%7\.6\\%overall, rising to23\.4%23\.4\\%on the operations domain\. On the other end of the spectrum,Claude Haiku 4\.5crashes in fewer than0\.5%0\.5\\%of attempts overall, but on the cabin\-safety domain21%21\\%of its failures are*empty traces*—the model returns natural\-language text without invoking any tool\. We did not observe a comparable empty\-trace rate for any other \(model, domain\) pair\. We conjecture that the unusually long cabin\-safety policy document triggers a passivity or refusal mode in the Haiku system prompt; investigating the trigger is left for future work\.

Table 8:Model\-level crash rates \(no parseable trace\) and per\-model failure modes\. Three qualitatively different weaknesses appear: crashing \(GPT\-OSS\), staying silent \(Claude Haikuon cabin\), and committing before looking up \(Gemma4,Qwen3:8\.2B\)\.
##### Lookup discipline distinguishes strong from weak models\.

A behavioral statistic that cleanly separates the strongest from the weakest model on each domain is the*premature\-write rate*: the fraction of attempts in which a write\-style tool was called before any read\-style tool\. Table[9](https://arxiv.org/html/2605.06334#Pt0.A4.T9)reports this rate for the five non\-crashing models on the five domains where the gap is most pronounced\.Qwen3\.6:36Bis the only open model whose premature\-write rate is0%0\\%ontau2\-airlineandtau2\-retail—exactly the two domains on which it dominates the leaderboard\. Restricted to the attempts that pass ontau2\-retail,Qwen3\.6:36Baverages4\.134\.13read\-style calls per passing attempt, while every other model averages1\.841\.84–2\.562\.56\. The win is not driven by unique problem solving \(per domain, only0–11test cases are uniquely solved byQwen3\.6:36B\); it is driven by raising the floor through a consistent read\-before\-write habit\. This habit is plausibly teachable to smaller models via prompting or fine\-tuning\.

Table 9:Premature\-write rate \(%\): fraction of attempts in which a write\-style tool was invoked before any read\-style tool\. Lower is better\.Qwen3\.6:36Bis the only open model that never writes before looking up ontau2\-airlineortau2\-retail, which coincides with its leaderboard dominance on those domains\.
##### Illustrative head\-to\-head trace\.

Test casesmp\_003\_002ontau2\-retail\(“swapITM\-7210forITM\-7216on orderORD\-58804, charge difference to payment methodPM\-CC\-7730”\) exhibits the lookup\-discipline contrast in microcosm\.Qwen3\.6:36Bpassed all5/55/5attempts with the trace

find\_user\_id\_by\_email→get\_user\_details→\\displaystyle\\texttt\{find\\\_user\\\_id\\\_by\\\_email\}\\to\\texttt\{get\\\_user\\\_details\}\\toget\_order\_details→get\_item\_details\(7210\)→\\displaystyle\\texttt\{get\\\_order\\\_details\}\\to\\texttt\{get\\\_item\\\_details\}\(\\texttt\{7210\}\)\\toget\_item\_details\(7216\)→get\_product\_details\\displaystyle\\texttt\{get\\\_item\\\_details\}\(\\texttt\{7216\}\)\\to\\texttt\{get\\\_product\\\_details\}six read\-only lookups followed by no write at all \(the order is not in a modifiable state, which the lookups reveal\)\. BothQwen3:8\.2BandGemma4:8\.0Bfailed all5/55/5attempts with the same shorter trace,

get\_order\_details→modify\_pending\_order\_items\(…\),\\texttt\{get\\\_order\\\_details\}\\to\\texttt\{modify\\\_pending\\\_order\\\_items\}\(\\ldots\),an immediate write that violates the manual’s gating\-read requirement\. Both theMissing\-Required\-Callchecks \(for the skipped item lookups\) and theForbidden\-Callcheck onmodify\_pending\_order\_itemsfire, producing the characteristic compound failure pattern visible in the aggregate statistics\.

## Appendix 0\.EManual Benchmark Inspection and Filtering

UsingMANTRA, we generated test cases across 6 domains\. The validated \(both forward and backward\) test cases were then manually inspected for consistency and quality\. During this process, we identified some minor issues in the generated test cases that were not caught by the automated validation process\. The issues highlight engineering challenges instead of showing problems with the pipeline itself, and they were mostly resolved by manual filtering or minor edits to the checks\.

We describe these issues here for transparency and to inform future improvements to the benchmark generation process\.

##### Forcing only one tool call\.

In some cases, the generated checks forced at most one tool call in the trace\. Although we do not have a specific rule in the grammar to restrict the number of times a tool call can be made, the one\-call\-only constraint can be imposed using aNO\-CALLtoolBEFORECALLtoolcheck and aNO\-CALLtoolAFTERCALLtoolcheck\. While this is necessary for tools likegive\_medicineandperform\_surgeryin thesneaky\-sasquatchdomain, to ensure that unnecessary medical interventions are not performed, some of the generated test cases had this constraint on tools likerun\_mineral\_test\. While it is not necessarily wrong to have only one call torun\_mineral\_test, it is not a clear requirement as per the SOP\. In these cases, we manually removed the check that forced at most one call\. We believe that this issue is inherently difficult for the LLM to resolve, as it requires a nuanced understanding of the SOP and the specific requirements of each tool, which may not be explicitly stated in the document\.

##### Convention mismatch in string arguments\.

In someairlinetest cases, the checks pinned thebook\_flighttool call to specific string arguments that were present in the scenario text but did not match the expected format in the tool description\. For example, the scenario might mention a flight from “New York” to “Los Angeles”, while the tool expects IATA codes like “JFK” and “LAX”\. This mismatch caused some check failures that had to be manually filtered out\. These issues could not be caught by our repair loop because the checks do not violate any logical constraints and the SOP does not explicitly require the use of IATA codes\.

Such issues can be caught in the future by adding a post\-processing step that finds traces in𝒲∧𝒞\\mathcal\{W\}\\wedge\\mathcal\{C\}, to find a satisfying trace and executing the tract to identify any tool call errors\.

##### Trivial samples\.

Some of the generated test cases were deemed trivial upon manual inspection, since they were generated from chunks that contained very basic information or instructions that did not require any complex reasoning or tool use\. For example, for the following chunk fromoperationsdomain document, the generated scenarios did not have meaningful tasks\.

\#\#\# 1\. IntroductionThis document outlines the mandatory procedures for all personnelregarding the acquisition of digital assets, physical hardware,and travel arrangements\. Failure to adhere to these branching logicrequirements will result in a "Rejection State" for the procurementticket\.We discarded such test cases to maintain the overall quality and challenge of the benchmark\. Such issues can be mitigated by choosing a better sampling strategy that avoids chunks with trivial content\.

##### Pinned string arguments in checks\.

In some cases, the generated checks included specific string arguments in the tool calls, such as ‘reason’ for flight cancellation, or ‘delivery\_address’ for hardware procurement\. These could not be caught by the repair loop as they do not violate any logical constraints, but we removed them manually since they made the checks brittle\. For example, if a check requires that the ‘delivery\_address’ argument for a hardware delivery must be ‘123 Main St, London, UK’, and a trace with minor deviations from this address \(such as ‘123 Main St\., London, United Kingdom’\) would fail\.

We believe that this issue is again difficult to LLMs to avoid, but a better prompt design that teaches LLMs to avoid pinning string arguments such as ‘delivery\_address’ without affecting the string arguments like ‘user\_id’\.

A related issue was encountered in thecabindomain where a chunk mentioned that passengers should not be informed about bomb threats\. Although the domain provides a tool for making announcements, no check can appropriately capture the requirement\. For example,NO\-CALLmake\_announcement would be too broad, andNO\-CALLmake\_announcement\(*"We have received a bomb threat"*\) is effectively useless\.

##### Missing relevant information in a sample\.

Insneaky\-sasquatchdomain, a sample contained the following chunk\.

\#\#\# Unknown Illness — "I’m not sure what’s wrong" /"I dunno doc, I just\.\.\. dunno"1\. Perform a close\-up visual inspection for any visible clues\.2\. Run blood, cholesterol, and vitamin tests in the lab\.3\. Run any additional lab tests that seem relevant\.4\. If nothing shows up in the lab work, use the X\-Ray Machine to checkfor foreign bodies or broken bones\.5\. If the X\-Ray is clean, take the patient’s temperature andblood pressure\.6\. Special cases to keep in mind during this workup:\- \*\*Diabetes:\*\* the patient needs \*\*Insulin\*\*or a food high in glucose \(for example, Cookies\)\.\- \*\*Glitched lab result:\*\* if the results screenshows nothing under "problems found" but the white bloodcell count is too high, the patient has an \*\*Infection\*\* andneeds \*\*Antibiotics\*\*\.

A generated scenario required diagnosing a patient suffering from diabetes, so it created a check that required callinggive\_foodwith a pinned argument ‘Cookies’\. However, there exists another chunks that mentions the food items that are high in glucose\.

\#\#\# Glucose\- Corn\- Eggplant\- Soy Sauce\- Onion\- Yogurt\- Cake\- Soda \(can\)The check should have allowed any of these food items to be given, but it only allowed ‘Cookies’ due to the pinned argument\. We removed the pinned argument manually, and allowed any food item that is high in glucose to be given\.

This is again a sampling issue, as the document graph indeed contains a reference edge between the chunk that mentions diabetes and the chunk that lists glucose\-rich food items, but the sampling strategy did not include both chunks in the same sample\.

## Appendix 0\.FInteresting Test Cases

Diagnose\-only scenario vs\. treatment\-mandating SOPTEST CASE:sneaky\-sasquatchEntities:Eli Barker,patient\_id=PT\-7004,case\_id=CM44,deficiency=vitamin\_b12\.Scenario & ChecksScenarioPatient Eli Barker, patient IDPT\-7004, says, “Been eating terribly lately” and wants to know exactly what nutrient is missing*before we make any diet changes*\. Just sort out the deficiency for his chart\.Trace checks1CALLrun\_vitamin\_test\(PT\-7004\)2CALLrun\_mineral\_test\(PT\-7004\)Relevant ChunkSymptom:“Been eating terribly lately\.” Treatment:A food item from your inventory that is high in the missing vitamin or mineral\. Note:Run vitamin and mineral tests in the lab first so you know exactly which nutrient the patient is short on,*then deliver the matching food\.*A genuine scenario\-vs\-SOP design conflict\. The scenario deliberately framed a “diagnose only” task, but the SOP mentions that after vitamin and mineral tests are completed, an appropriate food should be delivered\. Here the expected behavior is to run the tests but not deliver anything, as the scenario only asks for the diagnosis\. This tests whether the model can correctly identify the relevant SOP instructions and not over\-apply them when the scenario doesn’t call for it\.

Misleading scenario against SOPTEST CASE:sneaky\-sasquatchEntities:Lena Cross,patient\_id=PT\-CINDER,inspection\_clue=faint\_glow\.Scenario & ChecksScenarioBefore discharge, give Lena Cross, patient IDPT\-CINDER, age 31, some Eye Drops for the irritation she has been complaining about\.Trace checks1CALLinspect\_patient\(PT\-CINDER\)2CALLgive\_medicine\(PT\-CINDER, bed\_rest\)AFTERCALLinspect\_patient\(PT\-CINDER\)3CALLgive\_medicine\(PT\-CINDER, bed\_rest\)Relevant ChunkPink Eye Inspection clue:Redness around the eye\. Treatment:Eye Drops\. Radiation Poisoning Inspection clue:Green skin with clicking sounds or a faint glow \(a Geiger counter can be heard nearby\)\. Treatment:Anti\-Radiation Medication\.This test checks whether the model can correctly identify the relevant SOP instructions when the scenario contains misleading information\. The scenario describes a patient with irritation, and suggests Eye Drops\. However, on inspection, a faint glow actually points to Radiation Poisoning, which requires a different treatment\. The model should run the patient inspection first, identify the correct condition based on the clue, and then administer the appropriate treatment, rather than being misled by the initial description\.

## Appendix 0\.GAdditional Test Cases

Cabin Safety – case\_029TEST CASE:cabin\-safetydataset\_hf: case\_029Scenario & ChecksScenarioPlease handle passenger Victor Alvarez, passenger ID PAX\-55209\. After 2 instructions from the cabin crew during climb, he is still refusing to stow his roller bag and is shouting profanities at the flight attendant\. No one has been injured, but the aisle remains partially blocked\.Trace checks1check\_000:CALLclassify\_passenger\_misconduct\_level\(passenger\_id=PAX\-55209\)2check\_001:CALLnotify\_pic\_of\_unruly\_passenger\(passenger\_id=PAX\-55209\)3check\_002:CALLattempt\_unruly\_passenger\_deescalation\(passenger\_id=PAX\-55209\)4check\_003:CALLcoordinate\_unruly\_passenger\_report\(passenger\_id=PAX\-55209\)5check\_004:CALLnotify\_pic\_of\_unruly\_passenger\(passenger\_id=PAX\-55209\)PRECEDESCALLattempt\_unruly\_passenger\_deescalation\(passenger\_id=PAX\-55209\)6check\_005:CALLnotify\_pic\_of\_unruly\_passenger\(passenger\_id=PAX\-55209\)PRECEDESCALLcoordinate\_unruly\_passenger\_report\(passenger\_id=PAX\-55209\)7check\_006:CALLclassify\_passenger\_misconduct\_level\(passenger\_id=PAX\-55209\)PRECEDESCALLcoordinate\_unruly\_passenger\_report\(passenger\_id=PAX\-55209\)8check\_007:NO\-CALLrecord\_unruly\_passenger\_restraint\(\)9check\_008:NO\-CALLrequest\_law\_enforcement\_for\_unruly\_passenger\(\)Relevant Chunk3\.7\.1 General\.An unruly passenger is one whose behaviour poses a threat to the safety of the flight and/or its passengers, crew, or properties\. \(Note: This behaviour is distinguished from attempted hijacking, skyjacking, or bomb threats\)Passenger misconduct involves behaviour that poses a threat to the safety of the flight, its passengers, crew, or property\. Passenger misconduct can range from rude and boorish behaviour to physical assault\. Operators should have a zero tolerance for physical assaults against its crewmembers or agents\. Refer to the "Misconduct and Category and Action Table" in Appendix D\.During the flight, inform the PIC whenever a potential unruly passenger is on board\. Flight deck crew should avoid dealing directly with such passengers as they are needed to fly the airplane\. If all efforts to contain such an unruly passenger fail and a threat to safety is identified, immediately advise the PIC who shall evaluate the situation and decide on the course of action\.If the PIC has reasonable grounds to believe that a person has committed or is about to commit an offence or act which may jeopardize the safety of the airplane, the PIC might impose upon the person reasonable measures, including restraint, to protect the safety of the airplane, its passengers, crew, and cargo\.There are several levels/categories of passenger misconduct, as follows:\- The most benign are those where a crewmember requests compliance with instructions and the passenger complies with the request; no further action is required by the crewmember, nor does this warrant a report to the flight deck, the carrier or the regulatory authority \- The second level are those where a crewmember requests the passenger to comply, but the passenger continues disturbance which interferes with cabin safety, such as continuation of verbal abuse or continuing refusal to comply with applicable regulations \- The most severe cases of passenger misconduct are those where a crewmember’s duties are disrupted by the continuing passenger interference, a passenger or crewmember is injured or subjected to a credible threat of injury, an unscheduled landing is made, and/or restraints are necessary3\.7\.2 Unruly Passenger Handling Procedures\.Procedures for handling the misconduct vary with the severity of the event\. An action/procedure table form of these procedures is contained in Appendix D, Section D\.8\.For the second level described above:\- Cabin crewmember and PIC should coordinate efforts to defuse the situation and the cabin crewmember completes a report of the disturbance \- PIC and the cabin crewmember should coordinate issuance of the report to the passenger and other appropriate actions; the PIC signs the report, which indicates concurrence with providing the report to the passenger and distributing it upon landing \- After landing, the cabin crewmember should provide the completed report to local station personnel; depending on the carrier operations procedures and local regulations, the PIC may also be required to submit a separated report of the incidentFor the most severe incidents, the procedures above should be followed by:\- Notification by the PIC to the operator dispatch of the name and general description of the passenger, seat number and the nature of the misconduct, and request law enforcement officials meet the flight \- Upon landing, the PIC files a complaint with the local law enforcement agency \- The operator dispatch obtains the name and general description of the passenger, seat number and nature of complaint, informs the landing station, and requests local management notify the appropriate law enforcement officials \- Operator dispatch files necessary paperwork \- The landing station where passenger exits the aircraft or is removed should request an appropriate law enforcement official meet the flight \- The landing station should complete all appropriate paperworkSMT ModelSMT id: smt\_smp\_003\_v012State vars: flight\_phase: enum\(ON\_GROUND, IN\_FLIGHT, AFTER\_LANDING\); passenger\_misconduct\_level: enum\(BENIGN\_COMPLIED, NON\_SAFETY\_GROUND, SECOND\_LEVEL, MOST\_SEVERE\); misconduct\_level\_checked: Bool; unruly\_passenger\_reported: Bool; pic\_notified: Bool; crew\_communication\_recorded: Bool; deescalation\_attempted: Bool; continued\_disturbance\_report\_coordinated: Bool; passenger\_restrained: Bool; law\_enforcement\_requested: BoolTransitions: attempt\_unruly\_passenger\_deescalation\(passenger\_id\-\>pid\): pre \(flight\_phase = "ON\_GROUND" or flight\_phase = "IN\_FLIGHT"\); post deescalation\_attempted’ = trueclassify\_passenger\_misconduct\_level\(passenger\_id\-\>pid\): pre true; post misconduct\_level\_checked’ = truecoordinate\_unruly\_passenger\_report\(passenger\_id\-\>pid\): pre \(passenger\_misconduct\_level = "SECOND\_LEVEL" or passenger\_misconduct\_level = "MOST\_SEVERE"\); post continued\_disturbance\_report\_coordinated’ = truenotify\_pic\_of\_unruly\_passenger\(passenger\_id\-\>pid\): pre \(flight\_phase = "IN\_FLIGHT" or passenger\_misconduct\_level = "NON\_SAFETY\_GROUND" or passenger\_misconduct\_level = "SECOND\_LEVEL" or passenger\_misconduct\_level = "MOST\_SEVERE"\); post pic\_notified’ = truerecord\_crew\_communication\(flight\_id\-\>fid, from\_role\-\>from, message\-\>msg, to\_role\-\>to\): pre true; post crew\_communication\_recorded’ = truerecord\_unruly\_passenger\_restraint\(passenger\_id\-\>pid\): pre pic\_notified = true, passenger\_misconduct\_level = "MOST\_SEVERE"; post passenger\_restrained’ = truereport\_unruly\_passenger\(passenger\_id\-\>pid\): pre true; post unruly\_passenger\_reported’ = truerequest\_law\_enforcement\_for\_unruly\_passenger\(passenger\_id\-\>pid\): pre misconduct\_level\_checked = true, pic\_notified = true, continued\_disturbance\_report\_coordinated = true, passenger\_misconduct\_level = "MOST\_SEVERE"; post law\_enforcement\_requested’ = true

Cabin Safety – case\_010TEST CASE:cabin\-safetydataset\_hf: case\_010Scenario & ChecksScenarioCategorize the active bomb threat for Kestrel Air flight KST482\.Trace checks1check\_000:CALLclassify\_bomb\_threat\(flight\_id=KST482\)Relevant Chunk4\.2\.1 Types of Bomb Threats\.\- \*\*Specific\*\*: Threat is identified by flight number, departure time, or bomb location and includes positive identification to aircraft \- \*\*Non\-specific\*\*: Threat is one in which the caller may identify the flight by destination or origin, flight number or time of departure or arrivalDetermination of bomb threat type should be conducted by Dispatch and/or PIC or management\.\> \*\*NOTE\*\*: While a majority of bomb threats are hoaxes, the threat should be treated as legitimate\.SMT ModelSMT id: smt\_smp\_002\_v001State vars: bomb\_threat\_type: enum\(SPECIFIC, NON\_SPECIFIC\); bomb\_threat\_type\_checked: BoolTransitions: classify\_bomb\_threat\(flight\_id\-\>fid\): pre true; post bomb\_threat\_type\_checked’ = true

Operations – case\_006TEST CASE:operationsdataset\_hf: case\_006Scenario & ChecksScenarioPlease arrange a 4\-day economy flight from Boston to Toronto for Elena Marquez so she can attend the NorthBridge compliance workshop, and keep the daily budget at $245\.Trace checks1check\_000:CALLbook\_travel\(daily\_budget\_usd=245, destination=Toronto, num\_days=4, origin=Boston, travel\_class=economy, travel\_mode=flight\)2check\_001:NO\-CALLbook\_travel\(daily\_budget\_usd=245, destination=Toronto, num\_days=4, origin=Boston, reason=NorthBridge compliance workshop, travel\_class=economy, travel\_mode=flight\)BEFORECALLbook\_travel\(daily\_budget\_usd=245, destination=Toronto, num\_days=4, origin=Boston, reason=NorthBridge compliance workshop, travel\_class=economy, travel\_mode=flight\)3check\_002:NO\-CALLbook\_travel\(daily\_budget\_usd=245, destination=Toronto, num\_days=4, origin=Boston, reason=NorthBridge compliance workshop, travel\_class=economy, travel\_mode=flight\)AFTERCALLbook\_travel\(daily\_budget\_usd=245, destination=Toronto, num\_days=4, origin=Boston, reason=NorthBridge compliance workshop, travel\_class=economy, travel\_mode=flight\)Relevant Chunk1\. Introduction\.This document outlines the mandatory procedures for all personnel regarding the acquisition of digital assets, physical hardware, and travel arrangements\. Failure to adhere to these branching logic requirements will result in a "Rejection State" for the procurement ticket\.3\. Travel & Expense Branching Logic\.All travel must be booked through the "GlobalGate" portal\.SMT ModelSMT id: smt\_smp\_004\_v002State vars: booking\_status: enum\(NONE, CONFIRMED\)Transitions: book\_travel\(daily\_budget\_usd\-\>b, destination\-\>d, num\_days\-\>n, origin\-\>o, reason\-\>r, travel\_class\-\>c, travel\_mode\-\>m\): pre true; post booking\_status’ = "CONFIRMED"

Operations – case\_016TEST CASE:operationsdataset\_hf: case\_016Scenario & ChecksScenarioPlease get one Dell Latitude 5540 Laptop for Priya Nand at her home office, 84 Willow Bend, Brookline, MA 02445\. Use request label OPS\-4471\. The hardware total is $2,150\.Trace checks1check\_000:CALLassign\_warehouse\_picker\(item\_id=HW\-001, quantity=1\)2check\_001:CALLset\_delivery\_options\(is\_residential=true, item\_value=2150\)3check\_002:NO\-CALLcreate\_purchase\_order\(\)4check\_003:NO\-CALLcheck\_legacy\_portal\(\)Relevant Chunk4\.1\. Inventory Check Logic\.\* \*\*If\*\* the requested hardware is "In\-Stock": \* Assign a "Warehouse Picker" task\. \* Set status to "Internal Fulfillment\."\* \*\*If\*\* the requested hardware is "Out\-of\-Stock": \* \*\*If\*\* the lead time is < 7 days: Place a "Priority PO\." \* \*\*If\*\* the lead time is \> 7 days: The agent must \*\*eventually\*\* check for a refurbished unit in the "Legacy Portal\." \* \*\*Only if\*\* no refurbished unit is found after 24 hours \(simulated wait\) should a "Standard PO" be issued\.4\.2\. Delivery Requirements\.For all hardware over $2,000:\* A "Signature Required" tag must be added\. \* \*\*If\*\* the delivery address is a residential zone, the "Delivery Window" must be set to "Evening \(6 PM \- 9 PM\)\."SMT ModelSMT id: smt\_smp\_000\_v006Constants: lead\_time\_threshold\_days: Int; refurbished\_wait\_threshold\_hours: Int; signature\_required\_value\_threshold\_usd: RealState vars: hardware\_in\_stock: Bool; inventory\_checked: Bool; lead\_time\_days: Int; warehouse\_picker\_assigned: Bool; fulfillment\_status: enum\(NONE, INTERNAL\_FULFILLMENT\); purchase\_order\_type: enum\(NONE, PRIORITY\_PO, STANDARD\_PO\); legacy\_portal\_checked: Bool; refurbished\_unit\_found: Bool; legacy\_wait\_elapsed\_hours: Int; signature\_required: Bool; delivery\_window: enum\(UNSET, EVENING\_6PM\_9PM\)Transitions: assign\_warehouse\_picker\(item\_id\-\>iid, quantity\-\>q\): pre hardware\_in\_stock = true; post warehouse\_picker\_assigned’ = true, fulfillment\_status’ = "INTERNAL\_FULFILLMENT"check\_legacy\_portal\(category\-\>c, item\_id\-\>iid\): pre \(hardware\_in\_stock = false and lead\_time\_days \> lead\_time\_threshold\_days\); post legacy\_portal\_checked’ = truecreate\_purchase\_order\(item\_id\-\>iid, priority\-\>p, quantity\-\>q\): pre \(hardware\_in\_stock = false and \(\(p = true and lead\_time\_days < lead\_time\_threshold\_days\) or \(p = false and lead\_time\_days \> lead\_time\_threshold\_days and legacy\_portal\_checked = true and refurbished\_unit\_found = false and legacy\_wait\_elapsed\_hours \>= refurbished\_wait\_threshold\_hours\)\)\); post p = true =\> purchase\_order\_type’ = "PRIORITY\_PO", p = false =\> purchase\_order\_type’ = "STANDARD\_PO"set\_delivery\_options\(delivery\_address\-\>a, is\_residential\-\>r, item\_value\-\>v, po\_id\-\>po, task\_id\-\>t\): pre true; post v \> signature\_required\_value\_threshold\_usd =\> signature\_required’ = true, \(v \> signature\_required\_value\_threshold\_usd and r = true\) =\> delivery\_window’ = "EVENING\_6PM\_9PM"

Sneaky Sasquatch – case\_016TEST CASE:sneaky\-sasquatchdataset\_hf: case\_016Scenario & ChecksScenarioPlease assess patient Caleb Neri in Recovery Juniper\. His patient ID is PT\-QUARRY\-RS\.Trace checks1check\_000:CALLinspect\_patient\(patient\_id=PT\-QUARRY\-RS\)2check\_001:CALLlisten\_to\_patient\(patient\_id=PT\-QUARRY\-RS\)3check\_002:NO\-CALLgive\_medicine\(\)Relevant ChunkPoisoning \(Visible\)\.\- \*\*Inspection clue:\*\* Dark green skin\. \- \*\*Treatment:\*\* Antidote\.Poisoning \(Ingested\)\.\- \*\*Symptom:\*\* "I ingested something toxic"\. \- \*\*Treatment:\*\* Antidote\.SMT ModelSMT id: smt\_smp\_000\_v006Constants: antidote\_medicine\_name: String; wrong\_medicine\_malpractice\_threshold: IntState vars: patient\_admitted: Bool; visible\_clue: enum\(DARK\_GREEN\_SKIN, OTHER\); reported\_symptom: enum\(INGESTED\_SOMETHING\_TOXIC, OTHER\); visible\_clue\_checked: Bool; symptoms\_checked: Bool; antidote\_administered: Bool; wrong\_medicine\_count: IntTransitions: inspect\_patient\(patient\_id\-\>p\): pre true; post visible\_clue\_checked’ = truelisten\_to\_patient\(patient\_id\-\>p\): pre true; post symptoms\_checked’ = truegive\_medicine\(medicine\_name\-\>m, patient\_id\-\>p\): pre \(\(\(visible\_clue\_checked = true and visible\_clue = "DARK\_GREEN\_SKIN"\) or \(symptoms\_checked = true and reported\_symptom = "INGESTED\_SOMETHING\_TOXIC"\)\) and m = antidote\_medicine\_name\), antidote\_administered = false; post antidote\_administered’ = true, wrong\_medicine\_count’ = wrong\_medicine\_count

Sneaky Sasquatch – case\_037TEST CASE:sneaky\-sasquatchdataset\_hf: case\_037Scenario & ChecksScenarioEvaluate Marcus Bell in Trauma Room 2 for possible cold injury\. His patient ID is PT\-55271\. He was trapped inside the FrostLine Produce trailer for 41 minutes at \-8 C\.Trace checks1check\_000:CALLinspect\_patient\(patient\_id=PT\-55271\)2check\_003:CALLorder\_bed\_rest\(patient\_id=PT\-55271\)3check\_006:CALLinspect\_patient\(patient\_id=PT\-55271\)PRECEDESCALLorder\_bed\_rest\(patient\_id=PT\-55271\)4check\_007:NO\-CALLinspect\_patient\(patient\_id=\)5check\_008:NO\-CALLorder\_bed\_rest\(patient\_id=\)Relevant ChunkHypothermia\.\- \*\*Inspection clue:\*\* Blue or gray\-tinted skin\. \- \*\*Treatment:\*\* Bed Rest\.Visual Inspection Diagnoses\.Some illnesses are identified by looking at the patient’s body during a visual inspection\.SMT ModelSMT id: smt\_smp\_005\_v012State vars: patient\_admitted: Bool; visual\_inspection\_done: Bool; visual\_inspection\_clue: enum\(BLUE\_OR\_GRAY\_TINTED\_SKIN, NO\_RELEVANT\_CLUE\); bed\_rest\_ordered: BoolTransitions: inspect\_patient\(patient\_id\-\>p\): pre not\(p = ""\); post visual\_inspection\_done’ = trueorder\_bed\_rest\(patient\_id\-\>p, recovery\_type\-\>r\): pre not\(p = ""\); post bed\_rest\_ordered’ = true

Tau2 Airline – case\_002TEST CASE:tau2\-airlinedataset\_hf: case\_002Scenario & ChecksScenarioWhich payment methods are saved for user USR\-77193?Trace checks1check\_000:CALLget\_user\_details\(user\_id=USR\-77193\)Relevant ChunkUser\.Each user has a profile containing: \- user id \- email \- addresses \- date of birth \- payment methods \- membership level \- reservation numbersThere are three types of payment methods: \*\*credit card\*\*, \*\*gift card\*\*, \*\*travel certificate\*\*\.There are three membership levels: \*\*regular\*\*, \*\*silver\*\*, \*\*gold\*\*\.SMT ModelSMT id: smt\_smp\_000\_v005State vars: user\_details\_retrieved: Bool; user\_id: String; email: String; addresses: \{"element": \{"kind": "primitive", "ptype": "String"\}, "kind": "array"\}; date\_of\_birth: String; payment\_methods: \{"element": \{"kind": "enum", "values": \["credit\_card", "gift\_card", "travel\_certificate"\]\}, "kind": "array"\}; membership\_level: enum\(regular, silver, gold\); reservation\_numbers: \{"element": \{"kind": "primitive", "ptype": "String"\}, "kind": "array"\}Transitions: get\_user\_details\(user\_id\-\>requested\_user\_id\): pre requested\_user\_id = user\_id, user\_details\_retrieved = false; post user\_details\_retrieved’ = true, user\_id’ = user\_id

Tau2 Airline – case\_004TEST CASE:tau2\-airlinedataset\_hf: case\_004Scenario & ChecksScenarioList the reservation numbers on file for user USR\-66340\.Trace checks1check\_000:CALLget\_user\_details\(user\_id=USR\-66340\)Relevant ChunkUser\.Each user has a profile containing: \- user id \- email \- addresses \- date of birth \- payment methods \- membership level \- reservation numbersThere are three types of payment methods: \*\*credit card\*\*, \*\*gift card\*\*, \*\*travel certificate\*\*\.There are three membership levels: \*\*regular\*\*, \*\*silver\*\*, \*\*gold\*\*\.SMT ModelSMT id: smt\_smp\_000\_v005State vars: user\_details\_retrieved: Bool; user\_id: String; email: String; addresses: \{"element": \{"kind": "primitive", "ptype": "String"\}, "kind": "array"\}; date\_of\_birth: String; payment\_methods: \{"element": \{"kind": "enum", "values": \["credit\_card", "gift\_card", "travel\_certificate"\]\}, "kind": "array"\}; membership\_level: enum\(regular, silver, gold\); reservation\_numbers: \{"element": \{"kind": "primitive", "ptype": "String"\}, "kind": "array"\}Transitions: get\_user\_details\(user\_id\-\>requested\_user\_id\): pre requested\_user\_id = user\_id, user\_details\_retrieved = false; post user\_details\_retrieved’ = true, user\_id’ = user\_id

Tau2 Retail – case\_004TEST CASE:tau2\-retaildataset\_hf: case\_004Scenario & ChecksScenarioUpdate the shipping address on order ORD\-58192 for user U\-20418 to 415 W 18th St, Apt 5B, New York, NY 10011, United States\. I already confirm this address change\. My account email is u\_20418@example\.com\.Trace checks1check\_000:CALLget\_order\_details\(order\_id=ORD\-58192\)2check\_001:NO\-CALLmodify\_pending\_order\_address\(\)Relevant ChunkOrder\.Each order has the following attributes:\- unique order id \- user id \- address \- items ordered \- status \- fullfilments info \(tracking id and item ids\) \- payment historyThe status of an order can be: \*\*pending\*\*, \*\*processed\*\*, \*\*delivered\*\*, or \*\*cancelled\*\*\.Orders can have other optional attributes based on the actions that have been taken \(cancellation reason, which items have been exchanged, what was the exchane price difference etc\)SMT ModelSMT id: smt\_smp\_001\_v009Constants: product\_type\_count: Int; max\_pending\_item\_modifications: Int; max\_delivered\_post\_delivery\_actions: Int; non\_gift\_card\_refund\_min\_business\_days: Int; non\_gift\_card\_refund\_max\_business\_days: IntState vars: order\_status: enum\(PENDING, PROCESSED, DELIVERED, CANCELLED, RETURN\_REQUESTED\); shipping\_address\_address1: String; shipping\_address\_address2: String; shipping\_address\_city: String; shipping\_address\_state: String; shipping\_address\_zip: String; shipping\_address\_country: String; order\_item\_ids: String; payment\_method\_id: String; payment\_method\_kind: enum\(GIFT\_CARD, NON\_GIFT\_CARD\); refund\_status: enum\(NONE, GIFT\_CARD\_BALANCE\_CREDITED, EXTERNAL\_REFUND\_PENDING\); pending\_item\_modification\_count: Int; delivered\_post\_delivery\_action\_count: Int; order\_details\_checked: Bool; item\_details\_checked: Bool; product\_details\_checked: Bool; product\_types\_listed: Bool; requested\_item\_type\_match: Bool; cancellation\_confirmed: Bool; pending\_address\_modification\_confirmed: Bool; pending\_item\_modification\_confirmed: Bool; pending\_payment\_modification\_confirmed: Bool; delivered\_exchange\_confirmed: Bool; delivered\_return\_confirmed: BoolTransitions: cancel\_pending\_order\(order\_id\-\>oid, reason\-\>r\): pre order\_status = "PENDING", cancellation\_confirmed = true, non\_gift\_card\_refund\_min\_business\_days <= non\_gift\_card\_refund\_max\_business\_days; post order\_status’ = "CANCELLED", payment\_method\_kind = "GIFT\_CARD" =\> refund\_status’ = "GIFT\_CARD\_BALANCE\_CREDITED", payment\_method\_kind = "NON\_GIFT\_CARD" =\> refund\_status’ = "EXTERNAL\_REFUND\_PENDING"exchange\_delivered\_order\_items\(item\_ids\-\>iids, new\_item\_ids\-\>nids, order\_id\-\>oid, payment\_method\_id\-\>pmid\): pre order\_details\_checked = true, requested\_item\_type\_match = true, \(product\_details\_checked = true or product\_types\_listed = true\), order\_status = "DELIVERED", delivered\_post\_delivery\_action\_count < max\_delivered\_post\_delivery\_actions, delivered\_exchange\_confirmed = true; post delivered\_post\_delivery\_action\_count’ = delivered\_post\_delivery\_action\_count \+ 1get\_item\_details\(item\_id\-\>iid\): pre true; post item\_details\_checked’ = trueget\_order\_details\(order\_id\-\>oid\): pre true; post order\_details\_checked’ = trueget\_product\_details\(product\_id\-\>pid\): pre true; post product\_details\_checked’ = truelist\_all\_product\_types\(\): pre product\_type\_count = 50; post product\_types\_listed’ = truemodify\_pending\_order\_address\(address1\-\>a1, address2\-\>a2, city\-\>c, country\-\>co, order\_id\-\>oid, state\-\>st, zip\-\>z\): pre order\_details\_checked = true, order\_status = "PENDING", pending\_address\_modification\_confirmed = true; post shipping\_address\_address1’ = a1, shipping\_address\_address2’ = a2, shipping\_address\_city’ = c, shipping\_address\_country’ = co, shipping\_address\_state’ = st, shipping\_address\_zip’ = zmodify\_pending\_order\_items\(item\_ids\-\>iids, new\_item\_ids\-\>nids, order\_id\-\>oid, payment\_method\_id\-\>pmid\): pre order\_details\_checked = true, item\_details\_checked = true, \(product\_details\_checked = true or product\_types\_listed = true\), requested\_item\_type\_match = true, order\_status = "PENDING", pending\_item\_modification\_count < max\_pending\_item\_modifications, pending\_item\_modification\_confirmed = true; post pending\_item\_modification\_count’ = pending\_item\_modification\_count \+ 1modify\_pending\_order\_payment\(order\_id\-\>oid, payment\_method\_id\-\>pmid\): pre order\_details\_checked = true, order\_status = "PENDING", pending\_payment\_modification\_confirmed = true; post payment\_method\_id’ = pmidreturn\_delivered\_order\_items\(item\_ids\-\>iids, order\_id\-\>oid, payment\_method\_id\-\>pmid\): pre order\_details\_checked = true, order\_status = "DELIVERED", delivered\_post\_delivery\_action\_count < max\_delivered\_post\_delivery\_actions, delivered\_return\_confirmed = true; post order\_status’ = "RETURN\_REQUESTED", delivered\_post\_delivery\_action\_count’ = delivered\_post\_delivery\_action\_count \+ 1

Tau2 Retail – case\_007TEST CASE:tau2\-retaildataset\_hf: case\_007Scenario & ChecksScenarioPlease exchange the Larkspur Hoodie from order ORD\-51763 from small to medium and charge any difference to gift card GC\-77105, which has $25\.00 available\. I’m returning the whole hoodie line, and yes, I confirm you should proceed\. My account email is usr\_auto\_\_51763@example\.com\.Trace checks1check\_000:CALLget\_order\_details\(order\_id=ORD\-51763\)2check\_001:NO\-CALLexchange\_delivered\_order\_items\(\)3check\_002:NO\-CALLreturn\_delivered\_order\_items\(\)Relevant ChunkExchange delivered order\.An order can only be exchanged if its status is ’delivered’, and you should check its status before taking the action\. In particular, remember to remind the customer to confirm they have provided all items to be exchanged\.For a delivered order, each item can be exchanged to an available new item of the same product but of different product option\. There cannot be any change of product types, e\.g\. modify shirt to shoe\.The user must provide a payment method to pay or receive refund of the price difference\. If the user provides a gift card, it must have enough balance to cover the price difference\.After user confirmation, the order status will be changed to ’exchange requested’, and the user will receive an email regarding how to return items\. There is no need to place a new order\.SMT ModelSMT id: smt\_smp\_002\_v002State vars: order\_status: enum\(DELIVERED, EXCHANGE\_REQUESTED, RETURN\_REQUESTED, OTHER\); order\_status\_checked: Bool; exchange\_confirmed: Bool; return\_confirmed: Bool; payment\_method\_kind: enum\(ORIGINAL\_PAYMENT\_METHOD, EXISTING\_GIFT\_CARD, OTHER\); gift\_card\_balance\_sufficient: Bool; exchange\_items\_eligible: BoolTransitions: get\_order\_details\(order\_id\-\>oid\): pre true; post order\_status\_checked’ = trueexchange\_delivered\_order\_items\(item\_ids\-\>iids, new\_item\_ids\-\>niids, order\_id\-\>oid, payment\_method\_id\-\>pmid\): pre order\_status\_checked = true, order\_status = "DELIVERED", exchange\_confirmed = true, exchange\_items\_eligible = true, payment\_method\_kind = "EXISTING\_GIFT\_CARD" =\> gift\_card\_balance\_sufficient = true; post order\_status’ = "EXCHANGE\_REQUESTED"return\_delivered\_order\_items\(item\_ids\-\>iids, order\_id\-\>oid, payment\_method\_id\-\>pmid\): pre order\_status\_checked = true, order\_status = "DELIVERED", return\_confirmed = true, \(payment\_method\_kind = "ORIGINAL\_PAYMENT\_METHOD" or payment\_method\_kind = "EXISTING\_GIFT\_CARD"\), not\(oid = ""\), not\(iids = ""\), not\(pmid = ""\); post order\_status’ = "RETURN\_REQUESTED"

Tau2 Telecom – case\_007TEST CASE:tau2\-telecomdataset\_hf: case\_007Scenario & ChecksScenarioI’m helping Priya Nair: her phone has been stuck with no signal bars since last night even after she turned Airplane Mode off and reseated the SIM\. Her line LINE\-88431 for customer CUST\-44862 is currently suspended, so please restore service on that line\.Trace checks1check\_000:CALLresume\_line\(customer\_id=CUST\-44862, line\_id=LINE\-88431\)2check\_003:NO\-CALLsend\_payment\_request\(\)Relevant ChunkDiagnosing Service Issues\.‘check\_status\_bar\(\)‘ can be used to check if the user is facing a service issue\. If there is cellular service, the status bar will return a signal strength indicator\.Common Service Issues and Their Causes\.If the user is experiencing service problems, here are some common causes:\* \*\*Airplane Mode is ON\*\*: This disables all wireless radios, including cellular\. \* \*\*SIM Card Problems\*\*: \* Not inserted or improperly seated\. \* Locked due to incorrect PIN/PUK entries\. \* \*\*Incorrect Network Settings\*\*: APN settings might be incorrect resulting in a loss of service\. \* \*\*Carrier Issues\*\*: Your line might be inactive due to billing problems\.Understanding and Troubleshooting Your Phone’s Cellular Service\.This section details for agents how a user’s phone connects to the cellular network \(often referred to as "service"\) and provides procedures to troubleshoot common issues\. Good cellular service is required for calls, texts, and mobile data\.SMT ModelSMT id: smt\_smp\_004\_v004State vars: service\_issue\_checked: Bool; has\_cellular\_signal: Bool; bill\_status\_checked: Bool; selected\_bill\_status: enum\(PAID, UNPAID, AWAITING\_PAYMENT\); other\_bill\_awaiting\_payment\_exists: Bool; customer\_exists: Bool; bill\_exists: Bool; bill\_belongs\_to\_customer: Bool; line\_status: enum\(ACTIVE, SUSPENDED, PENDING\_ACTIVATION\); suspension\_start\_date\_present: BoolTransitions: get\_bills\_for\_customer\(customer\_id\-\>cid, limit\-\>lim\): pre not\(cid = ""\); post bill\_status\_checked’ = trueresume\_line\(customer\_id\-\>cid, line\_id\-\>lid\): pre \(line\_status = "SUSPENDED" or line\_status = "PENDING\_ACTIVATION"\), not\(cid = ""\), not\(lid = ""\); post line\_status’ = "ACTIVE", suspension\_start\_date\_present’ = falsesend\_payment\_request\(bill\_id\-\>bid, customer\_id\-\>cid\): pre not\(bid = ""\), not\(cid = ""\), customer\_exists = true, bill\_exists = true, bill\_belongs\_to\_customer = true, other\_bill\_awaiting\_payment\_exists = false, bill\_status\_checked = true, not\(selected\_bill\_status = "PAID"\); post selected\_bill\_status’ = "AWAITING\_PAYMENT"

Tau2 Telecom – case\_026TEST CASE:tau2\-telecomdataset\_hf: case\_026Scenario & ChecksScenarioMaya Chen’s Galaxy S24 has had no calls or data for 6 hours\. I have already turned off airplane mode and reseated the SIM twice, and the account shows this line is currently suspended\. Please check why this line is down and get it working again\. Customer ID CUST\-10482, line ID LINE\-8841\.Trace checks1check\_000:CALLget\_bills\_for\_customer\(customer\_id=CUST\-10482\)2check\_001:CALLresume\_line\(customer\_id=CUST\-10482, line\_id=LINE\-8841\)3check\_002:CALLget\_bills\_for\_customer\(customer\_id=CUST\-10482\)PRECEDESCALLresume\_line\(customer\_id=CUST\-10482, line\_id=LINE\-8841\)Relevant ChunkCommon Service Issues and Their Causes\.If the user is experiencing service problems, here are some common causes:\* \*\*Airplane Mode is ON\*\*: This disables all wireless radios, including cellular\. \* \*\*SIM Card Problems\*\*: \* Not inserted or improperly seated\. \* Locked due to incorrect PIN/PUK entries\. \* \*\*Incorrect Network Settings\*\*: APN settings might be incorrect resulting in a loss of service\. \* \*\*Carrier Issues\*\*: Your line might be inactive due to billing problems\.SMT ModelSMT id: smt\_smp\_001\_v003State vars: service\_issue\_cause: enum\(AIRPLANE\_MODE\_ON, SIM\_CARD\_PROBLEM, INCORRECT\_NETWORK\_SETTINGS, BILLING\_PROBLEM\); billing\_checked: Bool; line\_status: enum\(ACTIVE, SUSPENDED, PENDING\_ACTIVATION\); suspension\_start\_date: StringTransitions: get\_bills\_for\_customer\(customer\_id\-\>cid\): pre true; post billing\_checked’ = trueresume\_line\(customer\_id\-\>cid, line\_id\-\>lid\): pre \(line\_status = "SUSPENDED" or line\_status = "PENDING\_ACTIVATION"\), not\(cid = ""\), not\(lid = ""\); post line\_status’ = "ACTIVE", suspension\_start\_date’ = ""
MANTRA: Synthesizing SMT-Validated Compliance Benchmarks for Tool-Using LLM Agents

Similar Articles

Beyond Goodhart's Law: A Dynamic Benchmark for Evaluating Compliance in Multi-Agent Systems

ToolMenuBench: Benchmarking Tool-Menu Filtering Strategies for Reliable and Efficient LLM Agents

MTR-Suite: A Framework for Evaluating and Synthesizing Conversational Retrieval Benchmarks

Contract2Tool: Learning Preconditions and Effects for Reliable Tool-Augmented LLM Agents

Beyond Function Calling: Benchmarking Tool-Using Agents under Tool-Environment Unreliability

Submit Feedback

Similar Articles

Beyond Goodhart's Law: A Dynamic Benchmark for Evaluating Compliance in Multi-Agent Systems
ToolMenuBench: Benchmarking Tool-Menu Filtering Strategies for Reliable and Efficient LLM Agents
MTR-Suite: A Framework for Evaluating and Synthesizing Conversational Retrieval Benchmarks
Contract2Tool: Learning Preconditions and Effects for Reliable Tool-Augmented LLM Agents
Beyond Function Calling: Benchmarking Tool-Using Agents under Tool-Environment Unreliability