Declarative Skills for AI Agents in Knowledge-Grounded Tool-Use Workflows
Summary
This paper studies orchestration mechanisms for tool-using AI agents in customer-service workflows, comparing declarative agents with imperative state machines and baselines. Results show retrieval quality is a key bottleneck, and under high-quality retrieval, declarative skills improve accuracy on procedural tasks.
View Cached Full Text
Cached at: 06/08/26, 09:14 AM
# Declarative Skills for AI Agents in Knowledge-Grounded Tool-Use Workflows
Source: [https://arxiv.org/html/2606.06923](https://arxiv.org/html/2606.06923)
M\. Danish Lim, I\. Danial Bin Sharudin, Wen Han Chen, Cedric Lim, Laura Wynter School of Computing and Information Systems Singapore Management University Singapore
###### Abstract
We study orchestration mechanisms for tool\-using AI agents in realistic customer\-service workflows over an unstructured knowledge base\. We argue that declarative agents—AI agents equipped with natural\-language skill files appended to the system prompt—are an effective orchestration paradigm\. Concretely, we compare \(i\) a DeclarativeAgent that reads three domain\-specific skill files at inference time and decides its own control flow, \(ii\) an ImperativeAgent based on a programmatic state machine with explicit phases, and \(iii\) an unscaffolded baseline agent modeled after theτ\\tau\-Knowledge benchmark agent\. Our ImperativeAgent is motivated by externalised\-control inference as in Recursive Language Models and graph\-based orchestration frameworks\. We formalise the three agents as policy classes within a decentralised partially\-observable Markov decision process and analyse their information\-theoretic and structural properties; we then test the predicted differences empirically on five language models and two retrieval regimes\. Our results show that retrieval quality is a dominant bottleneck for AI agents: when evidence is incomplete or skewed, all agents degrade substantially, and skill files cannot recover lost performance\. Under high\-quality retrieval, however, declarative skills consistently improve accuracy on procedural tasks and reduce orchestration errors, while the imperative state machine’s brittleness does not reliably improve task success or compliance\. Code will be provided upon publication
## 1Introduction
Tool\-using AI agents extend generative language models with the ability to act on external systems, and evaluating how well they perform real\-world tasks has become a central research problem as such agents are deployed at scale\. We focus on a general and practically important class of agent: customer\-service workflows, which combine three distinct competencies:*conversational*\(greeting the user, eliciting and clarifying intent\),*procedural*\(determining ordering constraints, verifications required, and tool\-call sequences\), and*reasoning*\(interpreting retrieved documents to choose the correct tool arguments\)\. These competencies are broadly applicable beyond banking customer service\.
We make use ofτ\\tau\-Knowledge\[[1](https://arxiv.org/html/2606.06923#bib.bib1)\], an extension ofτ\\tau\-Bench\[[2](https://arxiv.org/html/2606.06923#bib.bib2)\], which was developed for evaluating AI agents on tasks that require coordinated retrieval over a large unstructured knowledge base \(KB\)\. Unlike prior agent benchmarks that supply a fully\-specified tool interface up front,τ\\tau\-Knowledge allows the capabilities to be discoverable; for example, many state\-changing operations are referenced only in natural\-language documentation and must be located via knowledge\-base search before they can be invoked\. This is framed as a decentralised partially observable Markov decision process, \(Dec\-POMDP\)\[[4](https://arxiv.org/html/2606.06923#bib.bib4)\]over a shared state spaceS=SDB×ShistoryS=S\_\{\\text\{DB\}\}\\times S\_\{\\text\{history\}\}, whereSDBS\_\{\\text\{DB\}\}is the state of the database which includes user and system entities, andShistoryS\_\{\\text\{history\}\}is the stored user\-AI agent conversation history\. The AI agent observes only tool outputs and user messages, and the binary task reward depends on whether the final database stateSDBS\_\{\\text\{DB\}\}matches a hand\-curated target\. The problem is a POMDP because the AI agent and the user have asymmetric and incomplete views of the state,SS\.
We propose two paradigms to orchestrate the AI agents\. Imperative orchestration is defined as externalised, deterministic control\. Deterministic control has broad appeal in combating the non\-determinism of LLMs\. It can be argued that end\-to\-end LLM inference for real\-world tasks is too risky and that some of the control should be externalised to deterministic code\. Recursive Language Models \(RLMs\)\[[8](https://arxiv.org/html/2606.06923#bib.bib8)\]are an aggressive form of this idea that proposes to treat LLM input context as an external environment \(e\.g\. a Python REPL variable\)\. Then, using RLM, the LLM agent programmatically inspects and decomposes the REPL environment, and recursively invokes itself on smaller sub\-problems, composing the final answer from its sub\-results\. The same general idea is found also in LangGraph DFAs\[[10](https://arxiv.org/html/2606.06923#bib.bib10)\], ReAct reasoning\-action loops\[[11](https://arxiv.org/html/2606.06923#bib.bib11)\], and numerous other works\. In our instantiation, our ImperativeAgent owns the phase graph and the transition rules \(such as verification before write, bounded retries\) while the LLM is invoked as a per\-phase sub\-routine\. The desired benefit of the ImperativeAgent is that this deterministic enforcement will reduce hallucination, improve interpretability, and make compliance properties more auditable\.
The second form of orchestration we examine is called declarative\. This paradigm relies on an agent skills approach similar to that proposed by Anthropic\[[5](https://arxiv.org/html/2606.06923#bib.bib5)\], who suggest that procedural knowledge should be expressed in natural language and read by the model at run time as needed\. An agent skill is a markdown document describing when an action is appropriate, the preconditions and ordering constraints, and the tool\-argument requirements\. The LLM interprets the agent skill files as part of its system prompt and chooses the workflow and details on\-line\. The justification for the declarative paradigm is that the LLM’s own attention mechanism can integrate natural language skill content with retrieved evidence in more flexible ways that a fixed state graph\.
In summary, we define the following two orchestrated agents:
- •ImperativeAgent that implements a finite\-state machine with deterministic transitions, explicit verification gates, and hard\-coded retry policies, representing externalised control and programmatic orchestration\.
- •DeclarativeAgent that follows Anthropic\-style Agent Skills: the model reads a small set of markdown skill files in its system prompt and is free to choose the workflow, tools, and verification strategy in natural language, with no explicit state machine\.
Our contribution is to answer the question: does skill\-file\-based declarative orchestration outperform or underperform programmatic state\-machine orchestration for tool\-using AI agents in realistic, complex workflows, and what trade\-offs do these approaches entail in terms of task success, robustness, compliance, and efficiency? Both our imperative and declarative agentic paradigms have plausible arguments for success; our imperative approach should reduce hallucination and provide reliable results, while our declarative approach should be less brittle\. In addition to evaluating our imperative and declarative orchestration paradigms, we evaluate a baseline, unscaffolded LLM agent as used in theτ\\tau\-knowledge benchmark\[[1](https://arxiv.org/html/2606.06923#bib.bib1)\]paper\.
The paper proceeds as follows\. In the next section, we discuss related work\. Then, Section[3](https://arxiv.org/html/2606.06923#S3)casts the three agents as policy classes within a Dec\-POMDP and states our three main research questions\. Sections[4](https://arxiv.org/html/2606.06923#S4)–[5](https://arxiv.org/html/2606.06923#S5)describe the DeclarativeAgent and ImperativeAgent in detail\. Section[6](https://arxiv.org/html/2606.06923#S6)analyses the three policy classes theoretically\. Section[7](https://arxiv.org/html/2606.06923#S7)provides the experimental design and the main results and Section[8](https://arxiv.org/html/2606.06923#S8)provides an ablation into the compliance and efficiency of the agents\. We conclude with a discussion that ties our findings back to our research questions\. The appendix includes additional details\.
## 2Related Work
The application domain used inτ\\tau\-Knowledge isτ\\tau\-Banking, a fintech\-customer\-service set of tasks\. The benchmark environment contains 698 documents \(∼\\sim195K tokens, 71 topics, 21 product categories\), 14 permanent agent tools plus 51 discoverable tools, and 97 evaluation tasks\. Each task requires, on average, 18\.6 documents and 9\.52 tool calls \(max 33\) to resolve\. The benchmark scoring proposed in\[[1](https://arxiv.org/html/2606.06923#bib.bib1)\]ispassk\\text\{pass\}^\{k\}, and while the paper considersk=1,3k=1,3, we focus only on the more challenging pass1metric\. Documents provided include product specifications, internal procedural policies \(e\.g\. retention protocols, account\-closure eligibility\), and discoverable\-tool signatures with required\-argument schemas\. Tool names contain random four\-digit suffixes \(e\.g\.close\_bank\_account\_7392\) that cannot be guessed\. There are two discoverability approaches used in the paper: ”gold” and retrieval\. Gold means that the task\-critical documents are provided in the system prompt, while retrieval uses an external retriever\. The benchmark paper\[[1](https://arxiv.org/html/2606.06923#bib.bib1)\]uses both keyword\-matching retrieval, via BM25, and embedding\-based retrieval\.
Beyondτ\\tau\-Knowledge, several recent works benchmark customer\-support agents on adherence to business policies, multi\-step workflows, and tool\-use correctness\.\[[12](https://arxiv.org/html/2606.06923#bib.bib12),[13](https://arxiv.org/html/2606.06923#bib.bib13)\]\. These works assume a fixed orchestration style, comprising a single LLMAgent\-like loop with tools, and focus on model or retriever comparisons\. Our work proposes to explore the benefits of agent skills via our DeclarativeAgent as well as programmatic approaches to invoking tools via our Imperative Agent under a common benchmark and tool set\.
Our work also relates to the literature on agentic scaffolding and orchestration\. ReAct\-style reasoning\-and\-acting loops\[[11](https://arxiv.org/html/2606.06923#bib.bib11)\]interleave natural\-language thoughts and tool calls\. Graph\-based and DFA\-style frameworks such as LangGraph\[[10](https://arxiv.org/html/2606.06923#bib.bib10)\]expose the agent’s control flow as an explicit state machine or graph, allowing deterministic transitions\. Recursive Language Models\[[8](https://arxiv.org/html/2606.06923#bib.bib8)\]externalise control further, treating the prompt as an environment variable, and allow the LLM to recursively call itself over decomposed subproblems, demonstrating strong gains on long\-context reasoning tasks\[[9](https://arxiv.org/html/2606.06923#bib.bib9)\]\. Our ImperativeAgent is in this broad family of deterministic approaches to orchestration\.
A parallel line of work studies retrieval\-augmented generation in realistic, messy knowledge bases\.τ\\tau\-Knowledge itself highlights that even frontier models struggle to retrieve, interpret, and act on unstructured documentation\[[1](https://arxiv.org/html/2606.06923#bib.bib1)\], and subsequent reports have underscored how retrieval quality often dominates model choice in customer\-support agents\[[1](https://arxiv.org/html/2606.06923#bib.bib1),[14](https://arxiv.org/html/2606.06923#bib.bib14)\]\. There is also increasing interest in replacing brittle tool registries and MCP\-style plugins with file\-centric agent interfaces, where files serve simultaneously as context, tools, and skills\[[7](https://arxiv.org/html/2606.06923#bib.bib7)\]\. Our results strengthen these findings: golden retrieval exposes the benefits of skill\-file orchestration, while noisy embedding retrieval sharply reduces performance for all agents, illustrating that agent skills and high\-capacity LLMs with reasoning cannot compensate for fundamentally incorrect evidence\.
Related to our declarative orchestration is Anthropic’s Agent Skills specification\[[5](https://arxiv.org/html/2606.06923#bib.bib5)\], which proposes reusable SKILL\.md files as composable, model\-readable procedural knowledge for agents\. Agent Skills are loaded via progressive disclosure, with short metadata always in context and full skill bodies read only when needed\[[5](https://arxiv.org/html/2606.06923#bib.bib5),[6](https://arxiv.org/html/2606.06923#bib.bib6)\]\. Follow\-on work has generalised this idea to other ecosystems, arguing that skills should be small, focused markdown files that can be swapped or combined to tailor agent behaviour\[[7](https://arxiv.org/html/2606.06923#bib.bib7),[6](https://arxiv.org/html/2606.06923#bib.bib6)\]\.
Our DeclarativeAgent instantiates this paradigm, using three skill files to encode conversational structure, banking procedures, and knowledge\-discovery strategy, and provides, to our knowledge, the first systematic comparison between a skill\-file declarative agent and a programmatic state\-machine agent on a realistic customer\-support benchmark\.
The authors of\[[1](https://arxiv.org/html/2606.06923#bib.bib1)\]identified the main causes of failure when using their benchmark on LLM agents as: \(1\)*complex interdependencies between offerings*\(∼\\sim14\.5% of failures\) — multi\-hop reasoning across documents to find the optimal product combination; \(2\)*failure to respect implicit subtask ordering*\(∼\\sim5%\) — e\.g\. disputes must resolve before credit limit increases; \(3\)*overtrusting user assertions*\(∼\\sim4%\) — acting on user\-claimed state without verifying via tools; and \(4\)*search inefficiency and unwarranted assumptions*\(∼\\sim23%\) — committing to early hypotheses rather than searching the KB\.
These causes of failure motivate our ImperativeAgent and DeclarativeAgent strategies\. While failure type 1 may be assumed to be tied mainly to LLM capacity \(and hence model parameter count\), we aim to rectify failures 2–3, namely topological task ordering and verification gating using code with our ImperativeAgent\. Similarly, we provide explicit KB\-search guidance in the agent skills of our DeclarativeAgent, positioning declarative skills as a low\-cost capability enhancement for AI agents\.
Theτ\\tau\-Knowledge paper evaluates five frontier models across the various retrieval configurations\. Their main finding is that the benchmark is hard for current LLM agents: their best non\-gold configuration was GPT\-5\.2 \(high reasoning\) with terminal use at 25\.52%pass1\\text\{pass\}^\{1\}, and even with gold documents provided to the agent in context their best score was Claude\-4\.5\-Opus \(high\) at 39\.69%\. Our Table[1](https://arxiv.org/html/2606.06923#S2.T1)reproduces their benchmark’s mainpass1\\text\{pass\}^\{1\}results using their unscaffolded LLM agent\[[1](https://arxiv.org/html/2606.06923#bib.bib1), Table 2\]\.
Table 1:τ\\tau\-Knowledge benchmark’s frontier\-model baselines on unscaffolded LLM agents usingpass1\\text\{pass\}^\{1\}\(%\), reproduced from\[[1](https://arxiv.org/html/2606.06923#bib.bib1)\]\. Gold provided the minimal document set to the agent in context\. Parentheses indicateΔ\\Deltavs\. Gold setting for each row\. Reas\. means reasoning level setting\.
## 3Problem Formulation
We model the customer\-service interaction as a finite\-horizon, two\-agent decentralised partially\-observable Markov decision process \(Dec\-POMDP\)\[[4](https://arxiv.org/html/2606.06923#bib.bib4)\]\. A simulation, hereafter referred to as a*task*, is one rollout of this process on a fixed task specification drawn fromτ\\tau\-Knowledge\.
The world state isS=SDB×SconvS=S\_\{\\text\{DB\}\}\\times S\_\{\\text\{conv\}\}, whereSDBS\_\{\\text\{DB\}\}is the relational state of the banking database \(customer records, accounts, transactions, disputes, cards\) andSconvS\_\{\\text\{conv\}\}is the rolling conversation history\. Two policies operate overSSjointly: the task agentπ\\pithat we design, and a user\-simulatorπu\\pi\_\{u\}whose persona and intent are fixed by the task\. Both observeSSonly through messages and tool outputs; the database is not directly visible to either agent\.
At each turntt, the task agent emits an actionat∈𝒜a\_\{t\}\\in\\mathcal\{A\}given an information stateht=\(o1:t,a1:t−1\)h\_\{t\}=\(o\_\{1:t\},a\_\{1:t\-1\}\)\. We partition the action space as𝒜=𝒜say∪𝒜read∪𝒜write\\mathcal\{A\}=\\mathcal\{A\}\_\{\\text\{say\}\}\\cup\\mathcal\{A\}\_\{\\text\{read\}\}\\cup\\mathcal\{A\}\_\{\\text\{write\}\}:𝒜say\\mathcal\{A\}\_\{\\text\{say\}\}are natural\-language turns directed at the user;𝒜read\\mathcal\{A\}\_\{\\text\{read\}\}are non\-mutating tool calls \(KB\_search, account lookups, tool discovery, identitylog\_verification\);𝒜write\\mathcal\{A\}\_\{\\text\{write\}\}are state\-mutating tool calls that changeSDBS\_\{\\text\{DB\}\}\(transaction submissions, account changes, referrals, transfers to human agents, etc\.\)\. A task terminates when either party emits a stop token, when an unrecoverable error is raised, or when a horizonTmaxT\_\{\\max\}is reached\. Reward is binary,
r\(τ\)=1\{SDBfinal=SDBgold\}⋅1\{Arequired⊆A\(τ\)\},r\(\\tau\)\\;=\\;1\\\!\\left\\\{S\_\{\\text\{DB\}\}^\{\\text\{final\}\}=S\_\{\\text\{DB\}\}^\{\\text\{gold\}\}\\right\\\}\\cdot 1\\\!\\left\\\{A^\{\\text\{required\}\}\\subseteq A\(\\tau\)\\right\\\},whereArequiredA^\{\\text\{required\}\}is the set of gold action\-checks \(canonical write tools with their canonical arguments\) supplied by the benchmark andA\(τ\)A\(\\tau\)is the multiset of tool calls actually issued\. AcrossKKtrials, the standardτ\\tau\-Knowledge metric ispassk=𝔼τ1:K\[1\{∀i∈\[k\],r\(τi\)=1\}\]\\mathrm\{pass\}^\{k\}=\\mathbb\{E\}\_\{\\tau\_\{1:K\}\}\[1\\\{\\forall i\\in\[k\],\\,r\(\\tau\_\{i\}\)=1\\\}\]; we reportpass1\\mathrm\{pass\}^\{1\}as our primary metric\.
We compare three policy classes within this Dec\-POMDP\. Letθsys\\theta\_\{\\text\{sys\}\}denote the baseline system prompt distributed with the benchmark, and let the LLM under evaluation beMM\.
Definition 1 \(Baseline policy\)\.The baseline policy isπB\(at∣ht;θsys,M\)=M\(at∣θsys,ht\)\\pi\_\{B\}\(a\_\{t\}\\mid h\_\{t\};\\,\\theta\_\{\\text\{sys\}\},M\)=M\\\!\\left\(a\_\{t\}\\mid\\theta\_\{\\text\{sys\}\},h\_\{t\}\\right\), i\.e\. the model conditions only on the static system prompt and the running history\. There is no agent\-side control flow\.
Definition 2 \(Declarative policy\)\.LetΣ=\{s1,s2,s3\}\\Sigma=\\\{s\_\{1\},s\_\{2\},s\_\{3\}\\\}be a finite set of natural\-language skill files\. The declarative policy isπD\(at∣ht;θsys,Σ,M\)=M\(at∣θsys⊕Σ,ht\),\\pi\_\{D\}\(a\_\{t\}\\mid h\_\{t\};\\,\\theta\_\{\\text\{sys\}\},\\Sigma,M\)=M\\\!\\left\(a\_\{t\}\\mid\\theta\_\{\\text\{sys\}\}\\oplus\\Sigma,h\_\{t\}\\right\),where⊕\\oplusdenotes prompt concatenation\. Structurally,πD\\pi\_\{D\}differs fromπB\\pi\_\{B\}only in that the system prompt has been enlarged withΣ\\Sigma; there is no phase variable and no restriction on𝒜\\mathcal\{A\}\.
Definition 3 \(Imperative policy\)\.The imperative policy is a hierarchical pair\(πI,ϕ,δ\)\(\\pi\_\{I,\\phi\},\\delta\)whereϕ∈Φ\\phi\\in\\Phiis a phase,δ:Φ×State→Φ\\delta:\\Phi\\times\\mathrm\{State\}\\to\\Phiis a deterministic phase\-transition function, and eachπI,ϕ\\pi\_\{I,\\phi\}is a phase\-conditional sub\-policy that emits actions only in a restricted subset𝒜ϕ⊆𝒜\\mathcal\{A\}\_\{\\phi\}\\subseteq\\mathcal\{A\}together with a phase\-specific instructionιϕ\\iota\_\{\\phi\}injected into the system prompt:πI\(at∣ht,ϕt;θsys,M\)=M\(at∣θsys⊕ιϕt,ht\)⋅1\{at∈𝒜ϕt\}\.\\pi\_\{I\}\(a\_\{t\}\\mid h\_\{t\},\\,\\phi\_\{t\};\\,\\theta\_\{\\text\{sys\}\},M\)=M\\\!\\left\(a\_\{t\}\\mid\\theta\_\{\\text\{sys\}\}\\oplus\\iota\_\{\\phi\_\{t\}\},h\_\{t\}\\right\)\\cdot 1\\\{a\_\{t\}\\in\\mathcal\{A\}\_\{\\phi\_\{t\}\}\\\}\.Transitionsϕt\+1=δ\(ϕt,statet\)\\phi\_\{t\+1\}=\\delta\(\\phi\_\{t\},\\mathrm\{state\}\_\{t\}\)are computed by code, not by the LLM\.
The three policies share the same modelMM, the same tools, and the same user\-simulator distribution; they differ only in how procedural knowledge is encoded \(none, natural\-language, or executable code\) and in whether the action space is restricted at each turn\. This isolates the orchestration choice as the independent variable in the experiments that follow\.
We define our three main research questions below\.
RQ1 \(skill files as procedural prior\)\.For modelsMMwith a procedural\-competence gapg\(M\)\>0g\(M\)\>0on this domain — operationalised bypass1\(πB,M\)\\mathrm\{pass\}^\{1\}\(\\pi\_\{B\},M\)being substantially below the human or oracle ceiling — our first research question aims to determine whether the declarative policy weakly improves on the baseline:pass1\(πD,M\)≥pass1\(πB,M\)\\mathrm\{pass\}^\{1\}\(\\pi\_\{D\},M\)\\geq\\mathrm\{pass\}^\{1\}\(\\pi\_\{B\},M\), and whether the gain shrinks asg\(M\)→0g\(M\)\\to 0\.
RQ2 \(imperative as compliance enforcer\)\.Does the imperative policy reduce the unauthorized\-write rate \(i\.e\., writes before a successfullog\_verification\) compared to baseline and declarative, by construction of a verification gateV→EV\\to E?
RQ3 \(retrieval as bottleneck\)\.Under noisy embedding retrieval, does the advantage of our declarative paradigm over the baseline collapse?
Figure[1](https://arxiv.org/html/2606.06923#S3.F1)contrasts the three policy classes schematically: all three involve conditioning the modelMMon the historyhth\_\{t\}, but they differ in what augments the system prompt and whether the action space is restricted at each turn\.
Baseline \(πB\\pi\_\{B\}\)θsys\\theta\_\{\\text\{sys\}\}LLMMMat∈𝒜a\_\{t\}\\in\\mathcal\{A\}DeclarativeAgent \(πD\\pi\_\{D\}\)θsys⊕Σ\\theta\_\{\\text\{sys\}\}\\oplus\\Sigma\(skillss1,s2,s3s\_\{1\},s\_\{2\},s\_\{3\}\)LLMMMat∈𝒜a\_\{t\}\\in\\mathcal\{A\}ImperativeAgent \(πI\\pi\_\{I\}\)θsys⊕ιϕt\\theta\_\{\\text\{sys\}\}\\oplus\\iota\_\{\\phi\_\{t\}\}\(phase\-specific\)LLMMMat∈𝒜ϕt⊊𝒜a\_\{t\}\\in\\mathcal\{A\}\_\{\\phi\_\{t\}\}\\subsetneq\\mathcal\{A\}ϕt\\phi\_\{t\}δ\(ϕt,statet\)\\delta\(\\phi\_\{t\},\\mathrm\{state\}\_\{t\}\)
Figure 1:Three policy classes within the same Dec\-POMDP\. The baseline is conditioned on a fixed system promptθsys\\theta\_\{\\text\{sys\}\}\. The DeclarativeAgent enlarges the system prompt with skill filesΣ\\Sigma\. The ImperativeAgent injects a phase\-specific instructionιϕt\\iota\_\{\\phi\_\{t\}\}, restricts the per\-turn action space to𝒜ϕt⊊𝒜\\mathcal\{A\}\_\{\\phi\_\{t\}\}\\subsetneq\\mathcal\{A\}, and advances the phase via a deterministic transitionδ\\delta\.
## 4DeclarativeAgent
Our DeclarativeAgent is a minimal extension of the baseline: it inherits theτ\\tau\-BenchLLMAgentunchanged and adds a single intervention, that is, natural\-language*agent skill files*appended to the system prompt\. This design isolates the orchestration choice as the independent variable, so any observed gap between the baseline and the DeclarativeAgent is attributable to the skill files themselves rather than to differences in tools, prompting style, or control flow\. As proposed by the AgentSkill paradigm\[[5](https://arxiv.org/html/2606.06923#bib.bib5)\], the LLM decides what to do, when to call which tool, and when to verify, while the skill files describe the workflow in natural language\. There is no agent\-side control flow, no phase enumeration, and no per\-turn instruction template\.
The declarative system prompt is built from the baselineLLMAgentprompt with an appended<skills\>block containing three concatenated Markdown skill files \(see Appendix[9\.3](https://arxiv.org/html/2606.06923#Sx1.SS3)for details\)\. The instructions and policy are identical to the baseline agent; skill content is loaded once at agent construction by readingsrc/skills/\*\.mdalphabetically and concatenating with\-\-\-separators \(seeDeclarativeAgent\.\_load\_skills\)\.
The skill set comprises three markdown documents pertaining to the three capabilities required of the AI agent\. \(src/skills/\*\.md\):
banking\-procedures\.mdMaps each banking operation \(account information, credit\-card operations, replacements, disputes, referrals, account closure, transfers\) to its preconditions, the discoverable tools required, ordering constraints between operations \(e\.g\. “credit\-limit increase requires no pending disputes”\), and the canonical argument set\. This skillfile is a policy index intended to keep the model from having to derive potentially erroneous ordering rules from the raw KB\.
customer\-interaction\.mdDefines a generic four\-step conversational structure —*Greeting and Understanding*,*Triage*,*Verification and Action*,*Confirmation*— with a multi\-request inventory at the beginning \(“Identify ALL requests in their message before choosing a path”\) and guidance to ask exactly one targeted clarifying question\. This is the natural language, non\-deterministic analog of a state machine: the phases are descriptive, not code, and the model is free to deviate from them\.
knowledge\-discovery\.mdSpecifies the KB\-search strategy, specifically:*when*to search \(any operation needing a discoverable tool, any policy edge case\),*how*to construct queries for both embedding\-based and golden retrieval, and how to recover if the search fails to return a result\. As the benchmark tool names contain random four\-digit suffixes \(e\.g\.close\_bank\_account\_7392\) that cannot be guessed, this skill file emphasises that KB search is a hard precondition for state\-changing tools\.
The behaviour of our DeclarativeAgent proceeds as follows\. At each turn\. the DeclarativeAgent appends the incoming message to its state, concatenatessystem\_messages \+ messages, and forwards it togenerate\_cached\(which is our prompt\-caching variant of tau2 benchmark’sgeneratefunction\)\. The full tool set is exposed every turn\. No phase\-conditional tool restrictions are imposed, nor aretool\_choiceoverrides or specific per\-phase instructions provided\. The returnedAssistantMessageis then appended to the state and returned to the orchestrator\. The control flow is provided in DeclarativeAgent\.generate\_next\_message\.
## 5ImperativeAgent
Our ImperativeAgent is defined by a finite\-state machine pipeline, as follows:
GREETING→TRIAGE→VERIFICATION→PLANNING→EXECUTION→CONFIRMATION→COMPLETE\\begin\{array\}\[\]\{l\}\\text\{GREETING\}\\to\\text\{TRIAGE\}\\to\\text\{VERIFICATION\}\\to\\text\{PLANNING\}\\\\ \\qquad\\to\\text\{EXECUTION\}\\to\\text\{CONFIRMATION\}\\to\\text\{COMPLETE\}\\end\{array\}Additionally, we define anADVISORYbranch from TRIAGE which is called for purely informational requests thus bypassing verification and planning\. We also define anESCALATEterminal phase reachable from VERIFICATION or EXECUTION for the case where a tool exceeds its retry budget\. Phase transitions are deterministic functions of agent state; the LLM is invoked once per turn with a phase\-specific instruction and the phase’s allowed\-tool subset\. The state graph is shown in Figure[2](https://arxiv.org/html/2606.06923#S5.F2)\.
OurAgentStateis a Pydantic model insrc/agents/state\.pywhich carries six fields that the phase logic reads on every turn: two boolean gates \(user\_identified,verified\) that control entry to VERIFICATION and EXECUTION; two list fields \(pending\_tasks,completed\_tasks\) implementing the explicit task queue; atool\_retry\_countsdict for per\-tool retry tracking; and aexpect\_violation\_countcounter for response\-type\-mismatch instrumentation\.
ESCALATEis the deterministic exit path for tool failures whose retry budget is exhausted\. Without it, a model that repeatedly calls a state\-changing tool with malformed arguments would loop indefinitely;ESCALATEthus bounds the worst\-case trajectory length\.
GREETINGTRIAGEVERIFYPLANNINGEXECUTIONCONFIRMCOMPLETEADVISORYESCALATEuser\_idKB hitverifiedemptyretryretry
Figure 2:The state graph of the ImperativeAgentWe define six structural strategies to implement the deterministic guarantees of our imperative pipeline\. Each targets a specific failure identified in the theτ\\tau\-Knowledge benchmark paper\[[1](https://arxiv.org/html/2606.06923#bib.bib1)\]\. Details of our strategies are provided in the Appendix\.
1. 1\.Explicit task queue\. PLANNING emits a structuredTASKS:…\\dotsEND\_TASKSblock; the agent parses it intostate\.pending\_tasks, injects the live queue as a<task\_queue\>on every EXECUTION turn, and pops items from the queue as they complete\. EXECUTION transitions to CONFIRMATION only when the pending queue is empty\. The goal of this strategy is to reduce “forgot the second request” types of failures\.
2. 2\.Topological task ordering\. After parsing, the queue is sorted by Kahn’s algorithm\[[15](https://arxiv.org/html/2606.06923#bib.bib15)\]over a five\-rule keyword precedence table \(for example,credit limit≺\\precdispute;open≺\\precclose\)\. This strategy targets ordering failures\.
3. 3\.State\-driven phase transitions\. Phase logic reads boolean flags \(user\_identified,verified\) instead of inspecting the previous assistant message\. Flags are updated in\_update\_task\_state\(\)immediately after the incoming message is appended and before phase determination, so transitions are decoupled from message\-stream timing\.
4. 4\.Verification hard gate\. EXECUTION re\-checksstate\.verifiedand returns to VERIFICATION if false\. This way state\-changing tools are not used without a verified identity\. This strategy targets failures from the LLM agent trusting unverified user assertions\.
5. 5\.Per\-tool retry policy with deterministic escalation\. Each retriable tool gets a \(max\_retries,failure\_phase\) policy \(e\.g\.log\_verification: 3 retries→\\toESCALATE;KB\_search: 4 retries→\\toADVISORY\)\. Retry\-limit enforcement runs first in\_determine\_phase\(\)so no other logic can override it\. This bounds worst\-case trajectory length and aims also to improve the cost efficiency of the ImperativeAgent as compared to the baseline LLMagent and the DeclarativeAgent\.
6. 6\.Strict response\-type enforcement\. Each phase declares an expected response type \(text,tool\_call, oreither\); a wrapper aroundgenerate\(\)re\-prompts requiring tool\_choice \(up to 2 times\) on mismatch and records violations instate\.expect\_violation\_countfor post\-hoc analysis\.
Table[2](https://arxiv.org/html/2606.06923#S5.T2)shows thegenerate\_next\_messageexecution order\.
Table 2:Execution steps to generate\_next\_message
## 6Theoretical Analysis
We relate our three policy classes to their expected behaviour on the Dec\-POMDP\.
Proposition 1 \(Skill\-file information advantage\)\.Let𝒜∗\(ht\)\\mathcal\{A\}^\{\*\}\(h\_\{t\}\)denote the set of optimal next actions given information statehth\_\{t\}, and letHM\(𝒜∗∣ht\)H\_\{M\}\(\\mathcal\{A\}^\{\*\}\\mid h\_\{t\}\)denote the conditional entropy ofMM’s action distribution at that turn\. For any prompt\-side priorΣ\\Sigmathat is informative about𝒜∗\\mathcal\{A\}^\{\*\},
HM\(𝒜∗∣ht,θsys,Σ\)≤HM\(𝒜∗∣ht,θsys\),H\_\{M\}\(\\mathcal\{A\}^\{\*\}\\mid h\_\{t\},\\,\\theta\_\{\\text\{sys\}\},\\Sigma\)\\;\\leq\\;H\_\{M\}\(\\mathcal\{A\}^\{\*\}\\mid h\_\{t\},\\,\\theta\_\{\\text\{sys\}\}\),with strict inequality whenever the model has non\-trivial procedural\-competence gapg\(M\)\>0g\(M\)\>0\.
Regarding the first part of the proposition, we posit that skill files encode procedural knowledge, such as ordering constraints, verification preconditions, search heuristics, and as such should have a non\-negative impact on model results on the task\. Regarding the second part of the proposition, procedural\-competence gap means the ability of the model to correctly follow procedures\. We assume that the expected gain from skill files inπD\\pi\_\{D\}relative toπB\\pi\_\{B\}will decrease to zero as model capacity increases and thus asg\(M\)→0g\(M\)\\to 0\.
Proposition 2 \(Imperative restriction is policy\-class shrinking\)\.LetΠB\\Pi\_\{B\}be the set of behaviours expressible byπB\\pi\_\{B\}over a fixedMMand letΠI\\Pi\_\{I\}be the set expressible byπI\\pi\_\{I\}with the sameMM\. BecauseπI\\pi\_\{I\}allows onlyat∈𝒜ϕta\_\{t\}\\in\\mathcal\{A\}\_\{\\phi\_\{t\}\}at each turn, and because𝒜ϕt⊊𝒜\\mathcal\{A\}\_\{\\phi\_\{t\}\}\\subsetneq\\mathcal\{A\}for every non\-terminal phase,ΠI⊊ΠB\\Pi\_\{I\}\\subsetneq\\Pi\_\{B\}\. Consequently, in the absence of additional compliance benefits,
supπ∈ΠI𝔼\[r\(τ\)\]≤supπ∈ΠB𝔼\[r\(τ\)\]\.\\sup\_\{\\pi\\in\\Pi\_\{I\}\}\\mathbb\{E\}\[r\(\\tau\)\]\\;\\leq\\;\\sup\_\{\\pi\\in\\Pi\_\{B\}\}\\mathbb\{E\}\[r\(\\tau\)\]\.
The imperative policy aims to use deterministic gating to increase compliance sufficiently to offset the capacity loss coming from restricting action space\.
Proposition 3 \(Trajectory length governs cost\)\.LetTTbe the number of LLM calls in a task and letc¯turn\\bar\{c\}\_\{\\text\{turn\}\}be the mean per\-call cost forMM\. Then
𝔼\[Cost/Task\]=𝔼\[T\]⋅c¯turn\.\\mathbb\{E\}\\\!\\left\[\\mathrm\{Cost/Task\}\\right\]\\;=\\;\\mathbb\{E\}\[T\]\\cdot\\bar\{c\}\_\{\\text\{turn\}\}\.
The imperative policy’s bounded per\-tool retry budget caps𝔼\[T\]\\mathbb\{E\}\[T\]from above\. This should lower expected cost relative to the baseline\.
Proposition 4 \(Retrieval noise as channel degradation\)\.Define the retrieval step as a noisy observation channelO~=O\+η\\tilde\{O\}=O\+\\etawhereη\\etainjects off\-topic chunks intohth\_\{t\}\. A skill\-file priorΣ\\Sigmasharpens the action distributionπ\(⋅∣ht\)\\pi\(\\cdot\\mid h\_\{t\}\)but leaveshth\_\{t\}itself unchanged\. Therefore the data\-processing inequality gives us
I\(𝒜∗;h~t,Σ\)≤I\(𝒜∗;h~t\)\+H\(Σ\)≤I\(𝒜∗;ht\)\+H\(Σ\),I\(\\mathcal\{A\}^\{\*\};\\,\\tilde\{h\}\_\{t\},\\,\\Sigma\)\\;\\leq\\;I\(\\mathcal\{A\}^\{\*\};\\,\\tilde\{h\}\_\{t\}\)\+H\(\\Sigma\)\\;\\leq\\;I\(\\mathcal\{A\}^\{\*\};\\,h\_\{t\}\)\+H\(\\Sigma\),with the first gap growing asη\\etadegradesO~\\tilde\{O\}\.
Beyond a noise threshold, the lift fromΣ\\Sigmawill be dominated by the information loss inO~\\tilde\{O\}, so the declarative advantage from Proposition 1 will collapse under noisy retrieval\.
The four propositions thus predict that: \(i\)πD\>πB\\pi\_\{D\}\>\\pi\_\{B\}on weaker models under clean retrieval, \(ii\)πI\\pi\_\{I\}increases compliance enough to compensate the reduced capacity, \(iii\) the imperative cost benefit depends on retry reduction, and \(iv\) the declarative advantage is conditional on retrieval quality\.
## 7Experimental Results
We instantiate the three policies on five large language models spanning roughly an order of magnitude in capability: Qwen3\.5\-Flash, Claude Haiku\-4\.5, Gemini\-3\.1\-Flash\-Lite, DeepSeek\-v4\-Flash, and DeepSeek\-v4\-Pro\. Each model is paired with two retrieval regimes:*golden*retrieval, in which the task\-critical documents are placed in the system prompt directly, and*embedding*retrieval, using a localall\-MiniLM\-L6\-v2dense index served through a custom retriever plugged into tau2\-bench\. We do not report BM25 keyword retrieval because of its poor recall and very high token cost from retry loops\. We evaluate on the 97\-task suite fromτ\\tau\-Knowledge banking\. Details are provided int he Appendix and the companion GitHub repository\.
Our primary metrics are defined as follows\.*Pass1*is𝔼τ\[1\{r\(τ\)=1\}\]\\mathbb\{E\}\_\{\\tau\}\[1\\\{r\(\\tau\)=1\\\}\]averaged uniformly over the tasks within a condition; infrastructure errors \(i\.e\., LiteLLM auth retries exhausted\) are excluded from the average\.*DB match*is𝔼τ\[1\{SDBfinal=SDBgold\}\]\\mathbb\{E\}\_\{\\tau\}\[1\\\{S\_\{\\text\{DB\}\}^\{\\text\{final\}\}=S\_\{\\text\{DB\}\}^\{\\text\{gold\}\}\\\}\], the database half of the reward without the action\-check half, that is, whether the agent reached the right end\-state regardless of how it got there\.*Cost/Task*is the per\-task mean of the LiteLLM\-reportedagent\_costfield, which sums per\-message provider\-reported usage at published prices and includes cache\-read discounts when available\.*Write\-argument accuracy*is the fraction of state\-mutating tool calls whose argument set matches the gold trajectory, evaluated over all writes issued in a condition\.
Tables[3](https://arxiv.org/html/2606.06923#S7.T3)and[4](https://arxiv.org/html/2606.06923#S7.T4)report the metrics under golden and embedding retrieval, respectively\. The golden\-retrieval results in Table[3](https://arxiv.org/html/2606.06923#S7.T3)confirms our first research question RQ1: the DeclarativeAgent improves Pass1on four of the five models, with parity on Gemini\-Flash\-Lite\. The gain scales roughly with the procedural\-competence gap of the underlying model, with the exception of DeepSeek\-Pro, that shows a larger gain than its smaller Flash version\. The ImperativeAgent underperforms the baseline for every model, consistent with the policy\-class\-shrinking prediction of Proposition 2\. We examine the question of the compliance gain posed in research question RQ2 in the ablations of the next section\.
Under embedding retrieval as shown in Table[4](https://arxiv.org/html/2606.06923#S7.T4), Pass1drops sharply for every model and orchestration strategy\. The declarative–baseline advantage collapses with DeepSeek\-Pro and Haiku, and provides moderate gains on the medium\-capacity Gemini\-Flash\-Lite and DeepSeek\-Flash\. This confirms Proposition 4, in that skill files cannot compensate enough for the noisy observation channel\.
Table[5](https://arxiv.org/html/2606.06923#S7.T5)reports write\-argument accuracy on DeepSeek\-v4\-Pro\. The DeclarativeAgent is a strict improvement over the baseline on both retrieval modes and is substantially superior to the ImperativeAgent, which loses roughly twenty percentage points of write accuracy on either retrieval setting\.
Table 3:Aggregate metrics under*golden*retrieval\. DeclarativeAgent is the best orchestration across the board, with the exception of a tie with the baseline on Gemini\-Flash\-Lite\. The ImperativeAgent underperforms across the board\.Table 4:Aggregate metrics under*embedding*retrieval \(all\-MiniLM\-L6\-v2\)\. Pass1drops sharply on every model under noisy retrieval\. The DeclarativeAgent underperforms compared to the baseline on three of the five models, but remains the strongest approach on the medium\-sized Gemini\-Flash\-Lite and DeepSeek\-Flash\.0\.0020\.010\.050\.10\.500\.10\.20\.30\.40\.5Cost/Task \(USD, log scale\)Pass1GoldenEmbed\.BaselineDeclarativeImperative
Figure 3:Pass1versus Cost/Task across all 30 \(5 models x 3 agents x 2 retrieval types\) combinations\. Filled symbols are golden retrieval; empty symbols are embedding retrieval\. The DeclarativeAgent \(squares\) are an upper envelope of the golden frontier except for Gemini\-Flash\-Lite\. Embedding\-retrieval \(empty shapes\) and ImparativeAgent \(triangles\) are well inside the envelope\.Table 5:Write accuracy on DeepSeek\-v4\-Pro \(fraction of write tool calls whose arguments match the gold trajectory\) and mean Cost/Task\. Write actions are the dominant reward signal inbanking\_knowledge: a single mismatched write usually drops the task reward to 0\.
## 8Ablations on Compliance and Efficiency
While the task\-success rate of the ImperativeAgent is across the board lower, we examine whether it is a safer approach for compliance, in that the state machine should prevent state\-changing tool calls outside the EXECUTION phase, with EXECUTION reachable only via VERIFICATION\. To test this we replayed the strategies and computed compliance and efficiency metrics on the ImperativeAgent’s tool\-call sequence\.
Specifically, for every pre\-evaluated simulation and the existing results\.json files, we walked the message stream and counted: \(a\) the number of state\-mutating tool calls \(write tools, defined as the set of apply\_for\_credit\_card, call\_discoverable\_agent\_tool, call\_discoverable\_user\_tool, change\_user\_email, give\_discoverable\_user\_tool, request\_human\_agent\_transfer, submit\_referral, submit\_transaction; \(b\) whether each write occurred before the agent had successfully completed a log\_verification call; and \(c\) the maximum number of consecutive failed retries of the same tool\.
From these we derive the three additional metrics:
- •unauthorized\_write\_rate \- percentage of tasks with any invalid write, that is, at least 1 unauthorized write was performed before successful verification in the task\.
- •over\_retry\_rate \- percentage of trials in which the agent re\-issued the same tool≥4\\geq 4times consecutively with failed results\.
- •write pre\-verify \- percentage of unauthorized writes over all write calls and all tasks\.
- •mean trajectory length \- mean number of assistant turns per task\.
Table 6:Compliance and efficiency metrics for the ImperativeAgent\. “Writes pre\-verify” is the raw share of write tool calls that fired beforelog\_verificationsucceeded\.From Table[6](https://arxiv.org/html/2606.06923#S8.T6), we see that the ImperativeAgent’s unauthorized\-write rate is in fact not lower than the baseline’s \(4\.4% vs\. 4\.3% on golden; constant at 3\.2% on embedding\)\. On the granular per\-write measure, the ImperativeAgent using embedding retrieval actually has the*highest*share of pre\-verification, meaning unauthorized, writes in the table\. In addition, the over\-retry rate is 4–7×\\timeshigher \(6\.7% golden, 4\.3% embedding\) for the ImperativeAgent than Baseline or Declarative \(≤1\.1%\\leq 1\.1\\%\) and the Trajectory length is not shorter\.
This implies that the verification gate property of the ImperativeAgent is not effective in practice\. Indeed, while TRIAGE→\\toVERIFICATION is gated on a successful identification tool call, a model can go into EXECUTION through a phase mis\-classification and still write before verifying\. We see this from the fact that the ImperativeAgent over\-retry is at6×6\\timesthe rate of the baseline\. This demonstrates the brittleness of code\-based actions as opposed to the adaptive LLM\-based actions\.
Detailed inspection of the traces also showed how the DeclarativeAgent was able to better handle complex tasks\. The baseline LLM agent, for example, frequently failed to issue alog\_verificationcall before state\-mutating tools, misordered multi\-step requests, and re\-asked the user for information already returned by KB search\. On the other hand, the DeclarativeAgent was able to handle these types of subtasks correctly through the use of its declarative skill files\.
## 9Discussion and Conclusion
We show that agent skill files act as a procedural prior whose benefit scales with the model’s procedural\-competence gap: the DeclarativeAgent improves on the baseline on four of the five models, with the gain trending downwards on stronger models\. Research question RQ3, that retrieval quality dominates the orchestration choice, is clearly demonstrated\. On the other hand, the use of imperative verification was shown to not reduce unauthorized writes\. This proves the brittleness of the ImperativeAgent approach; phase mis\-classification is able to route actions past the deterministic gate, eliminating any potential compliance benefit of the imperative paradigm\.
Theoretically, agent skill files reduceHM\(𝒜∗∣ht\)H\_\{M\}\(\\mathcal\{A\}^\{\*\}\\mid h\_\{t\}\)without modifyinghth\_\{t\}, so the lift they provide is bounded by the procedural\-competence gap of the underlying model and by the quality of the observation channel\. The imperative agent’s restricted per\-phase action sets shrink the policy class; the resulting capacity loss can only be recovered if the restriction prevents enough costly mistakes to offset the reduction of capacity, which is not the case in practice\.
From an engineering deployment perspective, skill\-augmented system prompts add on a marginal cost per\-task LLM cost, while delivering accuracy benefits under high\-quality retrieval\. Latency is unaffected when using prompt caching since the agent skill block is static\. Maintainability of the DeclarativeAgent is straightforward: skill files are markdown editable by domain experts and can be versioned alongside business\-policy document\. For regulated domains, the empirical refutation of the imperative state\-machine compliance guarantee is a useful result, in that deterministic gates do not necessarily improve compliance\.
We conclude with three main points\. First, a declarative agent using natural\-language skill files gives a measurable accuracy improvement on tool\-using LLM agents whose underlying model has a procedural\-competence gap, at a modest cost premium\. Second, the imperative state\-machine paradigm trades capacity for compliance guarantees that may not be successful in practice\. Third, retrieval quality remains a dominant bottleneck for tool\-using AI agents: when the observation channel is noisy, no orchestration paradigm can recover the lost information, and advances in retrieval should be prioritised alongside advances in orchestration if agentic systems are to reach their commercial potential\.
##### Declaration of generative AI in the manuscript preparation process\.
During the preparation of this work the authors used Claude Code in compiling the experimental results and producing an initial version of the paper\. After using this tool, the authors reviewed and heavily modified all of the content and take full responsibility for the content of the published article\.
##### Funding disclosure\.
This research did not receive any specific grant from funding agencies in the public, commercial, or not\-for\-profit sectors\.
## References
- \[1\]Quan Shi, Alexandra Zytek, Pedram Razavi, Karthik Narasimhan, and Victor Barres\.τ\\tau\-Knowledge: Evaluating Conversational Agents over Unstructured Knowledge\. Sierra Research / Princeton, arXiv:2603\.04370, 2026\.[https://arxiv\.org/abs/2603\.04370](https://arxiv.org/abs/2603.04370)\.
- \[2\]Yao, S\., Shinn, N\., Razavi, P\., Narasimhan, K\.τ\\tau\-Bench: A Benchmark for Tool\-Agent\-User Interaction in Real\-World Domains\. ICLR, 2025\.
- \[3\]Sierra Research\.τ\\tau\-Bench: A Benchmark for Tool\-Agent\-User Interaction in Real\-World Domains \(Codebase\)\. GitHub repository, 2025\.[https://github\.com/sierra\-research/tau2\-bench](https://github.com/sierra-research/tau2-bench)\.
- \[4\]Bernstein, D\.S\., Givan, R\., Immerman, N\., Zilberstein, S\.The Complexity of Decentralized Control of Markov Decision Processes\. Mathematics of Operations Research, 27\(4\):819–840, 2002\.
- \[5\]Anthropic\.Agent Skills: Composable, Model\-Read Procedural Knowledge for LLM Agents\.[https://agentskills\.io](https://agentskills.io/), 2025\.
- \[6\]Firecrawl\.How SKILL\.md Files Work and Why They’re Everywhere\. Firecrawl Blog, 2026\.[https://www\.firecrawl\.dev/blog/agent\-skills](https://www.firecrawl.dev/blog/agent-skills)\.
- \[7\]LlamaIndex\.Files for AI Agents: Context, Search, Skills Guide\. LlamaIndex Blog, 2026\.[https://www\.llamaindex\.ai/blog/files\-are\-all\-you\-need](https://www.llamaindex.ai/blog/files-are-all-you-need)\.
- \[8\]Anonymous\.Recursive Language Models\. arXiv:2512\.24601, 2025\.[https://arxiv\.org/abs/2512\.24601](https://arxiv.org/abs/2512.24601)\.
- \[9\]Zhang, A\.Recursive Language Models\. Blog post, 2025\.[https://alexzhang13\.github\.io/blog/2025/rlm/](https://alexzhang13.github.io/blog/2025/rlm/)\.
- \[10\]Chase, H\. et al\.LangGraph: Building Stateful, Multi\-Actor Applications with LLMs\. LangChain, 2024\.[https://langchain\-ai\.github\.io/langgraph/](https://langchain-ai.github.io/langgraph/)\.
- \[11\]Yao, S\., Zhao, J\., Yu, D\., Du, N\., Shafran, I\., Narasimhan, K\., Cao, Y\.ReAct: Synergizing Reasoning and Acting in Language Models\. ICLR, 2023\.
- \[12\]\[Redacted for blind review\]\.Benchmarking Customer Support LLM Agents for Business\-Adherence\. EACL Industry Track, 2026\.[https://aclanthology\.org/2026\.eacl\-industry\.15\.pdf](https://aclanthology.org/2026.eacl-industry.15.pdf)\.
- \[13\]Toloka AI\.TAU\-bench extension: benchmarking policy\-aware agents in realistic settings\. Toloka AI Blog, 2026\.[https://toloka\.ai/blog/tau\-bench\-extension\-benchmarking\-policy\-aware\-agents\-in\-realistic\-settings/](https://toloka.ai/blog/tau-bench-extension-benchmarking-policy-aware-agents-in-realistic-settings/)\.
- \[14\]Sierra Research\.τ3\\tau^\{3\}\-Bench: Advancing Agent Benchmarking to Knowledge and Voice\. Sierra Blog, 2026\.[https://sierra\.ai/blog/bench\-advancing\-agent\-benchmarking\-to\-knowledge\-and\-voice](https://sierra.ai/blog/bench-advancing-agent-benchmarking-to-knowledge-and-voice)\.
- \[15\]Kahn, A\.B\.Topological sorting of large networks\. Communications of the ACM, 5\(11\):558–562, 1962\.
## Appendix
### 9\.1Running the Experiments
A pilot run \(5 tasks, 2 conditions\) and full experiment \(97 tasks, 4 conditions\) are available via the project Makefile:
1
2makepilot
3
4
5makeexperiment
### 9\.2File Index
src/agents/baseline\_agent\.pyBaselineAgent: tau2 LLMAgent \+generate\_cached, no skills, no orchestration\.
src/agents/declarative\_agent\.pyDeclarativeAgent: LLMAgent with<skills\>\-block injection\.
src/skills/\*\.mdSkill files \(banking\-procedures,customer\-interaction,knowledge\-discovery\) consumed by the declarative agent\.
src/agents/state\.pyAgentState Pydantic model\.
src/agents/imperative\_agent\.pyImperativeAgent implementation\.
src/agents/cached\_generate\.pyDrop\-ingenerate\_cachedwith Anthropic\-/DeepSeek\-style prompt caching and DeepSeek reasoning\-content passthrough\.
src/agents/register\.pyFactories registeringbaseline\_agent,declarative\_agent,imperative\_agentwith the tau2 registry\.
src/analysis/safety\_metrics\.pyOffline compliance/ efficiency metric script\.
configs/baseline\.yaml,configs/baseline\-haiku\.yaml,configs/scaling\-flash\.yaml,configs/scaling\-flash\-lite\.yamlRun configs for the evaluations
—————————————————–
### 9\.3DeclarativeAgent system prompt
The skill\-file declarative agent uses the following system prompt template\. Three Markdown skill files \(banking\-procedures,customer\-interaction,knowledge\-discovery\) are concatenated with\-\-\-separators into the<skills\>block at agent construction time\.
1<instructions\>
2Youareacustomerserviceagentthathelpstheuser
3accordingtothe<policy\>providedbelow\.
4Ineachturnyoucaneither:
5\-Sendamessagetotheuser\.
6\-Makeatoolcall\.
7Youcannotdobothatthesametime\.
8Trytobehelpfulandalwaysfollowthepolicy\.
9</instructions\>
10<policy\>
11\{domain\_policy\}
12</policy\>
13<skills\>
14\#\#SKILL:banking\-procedures
15\.\.\.markdowncontent\.\.\.
16\-\-\-
17\#\#SKILL:customer\-interaction
18\.\.\.markdowncontent\.\.\.
19\-\-\-
20\#\#SKILL:knowledge\-discovery
21\.\.\.markdowncontent\.\.\.
22</skills\>
### 9\.4ImperativeAgent Implementation Details
This appendix collects the listings that implement the six deterministic strategies\.
#### AgentState fields
1\#ExplicitbooleangatesforVERIFICATIONandEXECUTION
2user\_identified:bool=False
3verified:bool=False
4
5\#Explicittaskqueuefororderedmulti\-stepexecution
6pending\_tasks:list\[str\]=Field\(default\_factory=list\)
7completed\_tasks:list\[str\]=Field\(default\_factory=list\)
8
9\#Per\-toolretrytracking
10tool\_retry\_counts:dict\[str,int\]=Field\(default\_factory=dict\)
11
12\#Response\-typeviolationcounter
13expect\_violation\_count:int=0
#### Strategy 1: Explicit task queue
PLANNING instructs the model to emit a structured block:
1TASKS:
21\.Requestcreditlimitincrease
32\.Filetransactiondispute
4END\_TASKS
The parsed list is stored instate\.pending\_tasksand injected into the EXECUTION phase instruction as a<task\_queue\>hint:
1queue\_hint=\(
2"\\n<task\_queue\>\\n"
3f"Pending:␣\{’,␣’\.join\(f’\{i\+1\}\.␣\{t\}’␣for␣i,␣t␣in␣enumerate\(state\.pending\_tasks\)\)\}\\n"
4f"Completed:␣\{’,␣’\.join\(state\.completed\_tasks\)␣or␣’none’\}\\n"
5"</task\_queue\>"
6\)
After each successful execution tool call,\_update\_task\_state\(\)pops the first item frompending\_tasksintocompleted\_tasks\. EXECUTION transitions to CONFIRMATION only whenpending\_tasksis empty\.
#### Strategy 2: Topological task ordering \(Kahn’s algorithm\)
1MUST\_PRECEDE\_RULES:list\[tuple\[str,str\]\]=\[
2\("creditlimit","dispute"\),\#creditlimitbeforedispute
3\("open","clos"\),\#openaccountbeforeclosing
4\("transfer","clos"\),\#transferfundsbeforeclosure
5\("replacement","clos"\),\#resolvereplacementbeforeclosure
6\("balance","clos"\),\#clearbalancebeforeclosure
7\]
8
9def\_sort\_tasks\_by\_dependencies\(self,tasks:list\[str\]\)\-\>list\[str\]:
10n=len\(tasks\)
11predecessors=\[set\(\)for\_inrange\(n\)\]
12fori,tiinenumerate\(tasks\):
13forj,tjinenumerate\(tasks\):
14ifi==j:continue
15forkw\_a,kw\_binMUST\_PRECEDE\_RULES:
16ifkw\_a\.lower\(\)inti\.lower\(\)andkw\_b\.lower\(\)intj\.lower\(\):
17predecessors\[j\]\.add\(i\)\#imustprecedej
18queue=\[iforiinrange\(n\)ifnotpredecessors\[i\]\]
19result=\[\]
20whilequeue:
21idx=queue\.pop\(0\)
22result\.append\(tasks\[idx\]\)
23forjinrange\(n\):
24predecessors\[j\]\.discard\(idx\)
25ifnotpredecessors\[j\]andtasks\[j\]notinresult:
26queue\.append\(j\)
27returnresult
#### Strategy 3 & 4: State\-driven transitions and EXECUTION hard gate
1ifcurrent=="TRIAGE":
2ifstate\.user\_identified:
3return"VERIFICATION"
4\.\.\.
5
6ifcurrent=="EXECUTION":
7ifnotstate\.verified:
8return"VERIFICATION"
9\.\.\.
#### Strategy 5: Tool retry policy with deterministic escalation
1@dataclass
2classToolRetryPolicy:
3max\_retries:int
4failure\_phase:str\#phasetoenterafterexhaustingretries
5
6TOOL\_RETRY\_POLICY:dict\[str,ToolRetryPolicy\]=\{
7"log\_verification":ToolRetryPolicy\(3,"ESCALATE"\),
8"call\_discoverable\_agent\_tool":ToolRetryPolicy\(2,"ESCALATE"\),
9"unlock\_discoverable\_agent\_tool":ToolRetryPolicy\(2,"ESCALATE"\),
10"KB\_search":ToolRetryPolicy\(4,"ADVISORY"\),
11\}
The\_determine\_phase\(\)method checks retry counts first, before all other logic, so no phase can override the escalation:
1failure\_phase=self\.\_retry\_limit\_exceeded\(state\)
2iffailure\_phaseandcurrentnotin\("ESCALATE","COMPLETE"\):
3returnfailure\_phase
#### Strategy 6: Response\-type enforcement
1def\_enforce\_expect\(self,phase,tools\_arg,tool\_choice,messages,state\):
2expect=PHASES\[phase\]\["expect"\]
3forattemptinrange\(MAX\_EXPECT\_RETRIES\+1\):
4response=generate\(model=self\.llm,tools=tools\_arg,
5tool\_choice=tool\_choice,messages=messages,\.\.\.\)
6got\_tool=response\.is\_tool\_call\(\)
7got\_text=response\.has\_text\_content\(\)andnotgot\_tool
8ifexpect=="either":returnresponse
9ifexpect=="tool\_call"andgot\_tool:returnresponse
10ifexpect=="text"andgot\_text:returnresponse
11state\.expect\_violation\_count\+=1
12ifexpect=="tool\_call":
13tool\_choice="required"\#forcetooluseonretry
14elifexpect=="text":
15messages=\[correction\_msg\]\+messages\[1:\]\#injectcorrection
16returnresponse\#best\-effortfallbackSimilar Articles
How AI Agents Reshape Knowledge Work: Autonomy, Efficiency, and Scope
This study uses production data from Perplexity to compare AI agents versus conversational assistants, finding that agents reduce completion time by 87% and costs by 94% while expanding the scope and quality of knowledge work.
Where AI agents actually break in real workflows (not demos)
A discussion on where AI agents fail in real workflows, highlighting issues with coordination, reliability under messy inputs, and the challenge of reducing human intervention in production.
Most of our “agent” problems turned out to be workflow/state problems
A developer recounts how many challenges in building AI agents actually stem from workflow and state management issues, not model intelligence, emphasizing the need for robust state handling and observability.
AI agents are starting to expose how broken most workflows already were
The article argues that AI agents are revealing how unstructured and chaotic many corporate workflows actually are, suggesting that successful automation depends more on clean systems and documentation than on advanced models.
Can AI agents realistically automate complex workflows without human intervention?
A discussion about whether AI agents can reliably automate complex, multi-step workflows without constant human supervision, asking about current limitations and experiences.