Learning to Construct Practical Agentic Systems

arXiv cs.LG 06/02/26, 04:00 AM Papers
Summary
This paper proposes principled approaches for designing and optimizing practical agentic LLM systems, introducing a framework with pseudo-tools and fixed workflows to improve modularity, cost-efficiency, and accuracy across diverse tasks.
arXiv:2606.00189v1 Announce Type: new Abstract: Automated design and optimization of agentic LLM-based systems leads to sophisticated systems that substantially improve result quality over off-the-shelf agentic patterns. However, studies of fielded agentic systems show that production systems focus much more on issues such as simplicity, controllability, and predictability of inference costs. In this paper we propose principled approaches to designing and optimizing practical agentic systems. We describe an agent framework that enables designers to enforce modularity in agentic systems, by defining "pseudo-tools" that call LLMs recursively on a restricted context. Using this framework we hand-engineer agents for a diverse set of tasks, and show that relative to dynamically-planned workflows, hand-constructed fixed workflows are generally cheaper and more accurate. We then propose novel learning methods for the agentic components required by this framework, namely pseudo-tools and fixed workflows. These learning methods generally outperform hand-engineered agents. We also exploit the modularity of the framework to apply multi-objective optimization methods to jointly optimize cost and response quality and blend the results of multiple learning systems.
Original Article
View Cached Full Text
Cached at: 06/02/26, 03:40 PM
# Learning to Construct Practical Agentic Systems
Source: [https://arxiv.org/html/2606.00189](https://arxiv.org/html/2606.00189)
Aditya Kumar1 Carnegie Mellon University adityaku@andrew\.cmu\.eduZhihan Lei1 Carnegie Mellon University lexl@andrew\.cmu\.eduJerry Yan1 Carnegie Mellon University jerryy2@cs\.cmu\.eduJoshua W\. Momo Carnegie Mellon University jmomo@andrew\.cmu\.eduLauhitya Reddy Dept\. of Computer Science Emory University lreddy3@emory\.cmuRafael Enrique Cabrera Jimenez Carnegie Mellon University rafaelcabrerajimenez7@gmail\.comCassandra A\. Cohen Carnegie Mellon University ccohen2@andrew\.cmu\.eduArthur Kajiyama Carnegie Mellon University akajiyam@andrew\.cmu\.eduWilliam W\. Cohen Carnegie Mellon University wcohen@cmu\.edu

###### Abstract

Automated design and optimization of agentic LLM\-based systems leads to sophisticated systems that substantially improve result quality over off\-the\-shelf agentic patterns\. However, studies of fielded agentic systems show that production systems focus much more on issues such as simplicity, controllability, and predictability of inference costs\. In this paper we propose principled approaches to designing and optimizing*practical*agentic systems\. We describe an agent framework that enables designers to enforce modularity in agentic systems, by defining “pseudo\-tools” that call LLMs recursively on a restricted context\. Using this framework we hand\-engineer agents for a diverse set of tasks, and show that relative to dynamically\-planned workflows, hand\-constructed fixed workflows are generally cheaper and more accurate\. We then propose novel learning methods for the agentic components required by this framework, namely pseudo\-tools and fixed workflows\. These learning methods generally outperform hand\-engineered agents\. We also exploit the modularity of the framework to apply multi\-objective optimization methods to jointly optimize cost and response quality and blend the results of multiple learning systems\.

## 1Introduction

11footnotetext:Equal contribution\.Automated design and optimization of agentic systems is an important research area, with much work on optimizing agentic systemsKhattabet al\.\([2024](https://arxiv.org/html/2606.00189#bib.bib16)\); Yuksekgonulet al\.\([2024](https://arxiv.org/html/2606.00189#bib.bib17)\); Ye and others \([2024](https://arxiv.org/html/2606.00189#bib.bib18)\); Huet al\.\([2024](https://arxiv.org/html/2606.00189#bib.bib15)\); Zhugeet al\.\([2024a](https://arxiv.org/html/2606.00189#bib.bib13)\); Zhanget al\.\([2024](https://arxiv.org/html/2606.00189#bib.bib14)\), and even developing self\-improving onesZhanget al\.\([2026](https://arxiv.org/html/2606.00189#bib.bib8)\); Fanget al\.\([2025](https://arxiv.org/html/2606.00189#bib.bib5)\)\. Although this has led to improvements in result quality, studies suggest that different concerns dominate in production environments, such as simplicity and predictabilityPanet al\.\([2026](https://arxiv.org/html/2606.00189#bib.bib9)\)\. For example, it was observed that agentic systems in use usually use a hand\-coded static workflow, rather than a ReAct\-styleYaoet al\.\([2023](https://arxiv.org/html/2606.00189#bib.bib10)\)agent loop, and many use multiple LLMs for different tasks in the same workflow\.

In this paper, we developprincipled approaches to designing and optimizing agentic systems, focusing on designingpractical systems that are well\-suited to production settings\. We believe this goal requires aframework that enables designers to enforce modularityin agentic systems\. Our proposal is to do this by constructing “pseudo\-tools”—which look like external LLM tools, but call another LLM with a restricted context when they are called\.

Contributions\. We begin by hand\-engineering agents in the same framework for 19 diverse tasks, in domains including finance, math, planning, and health\. We verify that hand\-constructed fixed workflows are indeed often faster and more accurate than dynamically\-planned workflows\. We then introduce methods for improving and/or developing the components of these hand\-constructed systems, namely, the workflows and the pseudo\-tools invoked by workflows\. Finally, we exploit the modularity of the framework to apply multiobjective optimization methodsDebet al\.\([2002](https://arxiv.org/html/2606.00189#bib.bib39)\)to jointly optimize cost and response quality\.

The proposed framework\.When LLMs generate text, every token depends on all previously\-generated tokens\. This is also true when LLMs are “thinking", or planning tool\-using actions in a ReActYaoet al\.\([2023](https://arxiv.org/html/2606.00189#bib.bib10)\)loop \(reason, act, repeat\)\. The high degree of interdependence in agentic decision\-making and reasoning contrasts with traditional software systems, which are made up of modular pieces of software\.

However, agents do have the ability to restrict context when invoking tools, as tools receive only a part of the context as input\. We propose to exploit this feature to modularize reasoning: specifically, we introduce “pseudo\-tools" \(ptools\) which solve a subtask by recursively calling an LLM with a restricted context\. Technically, the core unit of our framework is the*interface*—a typed function stub with a natural\-language docstring specifying intended behavior\. An interface can be bound to different alternative*implementations*: possible implementations would include an LLM call with a task\-specific prompt; code generated on\-the\-fly given the input; a ReAct sub\-agent; or static Python code\. For convenience in developing agents, there is also a default LLM\-based implementation for every interface, which uses an LLM to predict the function stub’s output given an input\. The binding between an interface and an implementation is easily reconfigured, and the binding for one interface can be selected independently of other binding choices\. Because each interface exposes typed inputs and outputs, the space of valid implementations is constrained enough for tractable search, yet rich enough to span the full range from low\-cost deterministic code to flexible but expensive multi\-turn agent loops\.

## 2Related work

Agent architectures\.Because LLMs require some external harness to call tools, and because LLMs can often be productively used in sequence, there are numerous frameworks for implementing agentic systems: well\-known frameworks include LangChain/LangGraph, DSPyKhattabet al\.\([2024](https://arxiv.org/html/2606.00189#bib.bib16)\), Pydantic AIPydantic Services Inc\. \([2024](https://arxiv.org/html/2606.00189#bib.bib41)\), SmolAgentsRoucheret al\.\([2025](https://arxiv.org/html/2606.00189#bib.bib42)\), and othersChoure and Prajapat \([2025](https://arxiv.org/html/2606.00189#bib.bib3)\)\. Our framework has some novel elements, discussed below, but is very lightweight\.222It has less than 1000 lines of Python in its core, and makes use of several existing packages as components: we use Pydantic AI’s ReACT\-style dynamic planner, SmolAgents’ Python sandbox, and use LiteLLMBerriAI \([2023](https://arxiv.org/html/2606.00189#bib.bib40)\)for LLM calls\.Notably it was no explicit representation of workflows over LLM calls—instead workflows are implemented as Python functions\. \(This is sometimes called a "native agent architecture"\)\.

Agent optimization and workflow generation\.Agentic architectures are often designed for particular types of optimizations—e\.g\., DSPy supports prompt optimization well\. The optimizations we focus on are inspired by work on workflow optimization and generation, such as ADAS and the later AFLOWHuet al\.\([2024](https://arxiv.org/html/2606.00189#bib.bib15)\); Zhanget al\.\([2024](https://arxiv.org/html/2606.00189#bib.bib14)\)\. For instance, ADAS defines workflows \(similarly to us\) as code, and refines workflows by incrementally using a “meta agent” that is guided by a database of historically useful workflows; AFLOW represents agents as graphs, where each node is an LLM action, and edges encode data\-flow dependencies and data transformations\. In AFLOW the agent space is searched with Monte Carlo Tree Search \(MCTS\), rather than conducting a linear search as in ADAS\. GPTSwarmZhugeet al\.\([2024b](https://arxiv.org/html/2606.00189#bib.bib1)\)treats agent optimization as graph optimization, using operators that alter the graph topology \(e\.g\., by reducing the distance information flows\), combined with node\-level prompt optimization\. Most of these systems focus on improving quality along a single dimension, output quality, and often develop workflows that are complex and costly\. We focus here on automatically constructing workflows that are simple and efficient \(as well as effective\)\.

Workflow Memory And Caching\.An alternative to building workflows is to save and reuse workflows\. Agent Workflow MemoryWanget al\.\([2025b](https://arxiv.org/html/2606.00189#bib.bib28)\)stores complete trajectories and retrieves them as in\-context guides on new instances; similar systems include Agent Skill InductionWanget al\.\([2025a](https://arxiv.org/html/2606.00189#bib.bib29)\)and ExpeLZhaoet al\.\([2024](https://arxiv.org/html/2606.00189#bib.bib30)\)\. These methods reuse*whole trajectories or skills*; we instead extract recurring*sub\-steps*and promote them into typed pseudo\-tools that the agent composes through ordinary tool calls\.

## 3Methods

### 3\.1Framework

Interfaces and implementations\.The agentic framework we use333[https://github\.com/wwcohen/secretagent](https://github.com/wwcohen/secretagent)extends Pydantic AI444[https://pydantic\.dev/docs/ai/overview/](https://pydantic.dev/docs/ai/overview/)with two new concepts:*interfaces*and*implementations*\. An*interface*looks like a Python function stub, and consists of a function name; a set of typed named inputs; an output type; and a natural language description of the behavior of the function \(a Python “docstring”\)\. An interface can be*bound*to a type\-compatible*implementation*, and a bound interface can be called just like a Python function\. The framework includes a set of pre\-defined*implementation factories*which can be used to produce implementations of various types: ordinary Python functions; a single call to an LLM, using a*default prompt*automatically derived from the interface description;a single LLM call using a user\-defined prompt template;555The template should have slots corresponding to the interfaces argumentsa ReAct agent plus a list of tools; a single LLM call that generates Python code \(optionally using a given list of tools\) appropriate for a given input, and then executes that code in a sandbox\. There are also a small number of special\-purpose compound implementations for common agentic tasks, e\.g\., combining an implementation with an interface that extracts an answer from an unstructured LLM response, or retrying an LLM call until a validation step succeeds\.

The root interface and configurations\.One designated*root interface*is called for the top\-level input from a user or benchmark case\. A*task configuration*is defined by the root interface, together with the set of bindings of every interface \(including the root\)\. Task configurations are easy to change, serialize, and restore\.

Expressiveness\.This framework flexibly supports a broad set of agent behaviors\. Interfaces can act as tools, when bound to a Python implementation; as recursive LLM calls \(pseudo\-tools\); or as subagents\. Alternative bindings for the root interface can also implement many standard agent behaviors: e\.g\., the root interface can be bound to a Python function that calls a mix of tools and pseudo\-tools, providing a static, engineered workflow; alternatively, a workflow can be dynamically planned by binding the root interface to ReAct with the same set of \(pseudo\-\)tools\. Many variations and interpolations between these behaviors can also be easily implemented, as we will discuss below\.

### 3\.2Learning and optimization methods

The interface/implementation framework allows a designer to explore a broad space of possible agentic designs, which will lead to different trade\-offs with respect to different dimensions of performance \(e\.g\., cost, interpretability, output quality, etc\)\. Our goal is to support manual exploration of this space and, where possible, to automate the exploration of this space\. Among the design decisions are: when should pseudo\-tools be introduced to modularize agent behavior? given a toolkit, what kind of implementation is appropriate for the root interface—e\.g\., should actions be dynamically planned, or can a fixed workflow be used? how can the modularity of the system be exploited to improve efficiency or testability—e\.g\., can some pseudo\-tools be implemented by smaller and cheaper LLMs, or even replaced with zero\-cost Python code? In this section we consider these decisions\.

#### 3\.2\.1Learning a pseudo\-tool toolkit

One important design task is constructing pseudo\-tools to modularize agent behavior\.

The*ptool inducer*learns a typed library of pseudo\-tools from recorded agent traces, and is itself a one\-shot LLM pipeline rather than a gradient\-trained model\. The intuition is that even an unstructured rollout—a ReAct trajectory, or a free\-form chain\-of\-thought—implicitly factors a task into a sequence of reasoning operations, and that recurring operations can be promoted into named, recallable subroutines for a future agent\.

Concretely, the inducer takes as input a set of recorded rollouts from a base agent on training instances\. We support two trace modes:*ReAct*, which extracts the per\-stepthoughtfield of each tool\-using turn, and*CoT*, which chunks the agent’s free\-form chain\-of\-thought into atomic reasoning steps\. An optional filter restricts induction to rollouts that produced a correct final answer\.

A four\-stage LLM pipeline then converts these thoughts into ptool stubs\. \(1\) Each thought is labeled independently with a short, case\-independent action type \(e\.g\., “extract temporal constraints”, “verify suspect alibi”\)\. \(2\) If the resulting label set is large, a merge call collapses synonymous categories into a smaller canonical set\. \(3\) Categories whose frequency exceeds a minimum count are kept as candidates, and the top\-KKare retained\. \(4\) For each retained category, a final LLM call synthesizes an interface stub—a name, a typed signature, and a natural\-language docstring that describes the reasoning pattern as a self\-contained subroutine\.

The output is a Python module of typed function stubs, each bound to a single\-LLM\-call implementation that uses the synthesized docstring as the prompt; the module is then registered as the action space of a downstream ReAct sub\-agent\. At runtime each induced ptool reduces to a single LLM call against its docstring; the downstream agent never invokes the inducer or sees the synthesis prompts, so the runtime cost is the same as an equivalently sized hand\-written toolkit\. Three hyperparameters control the pipeline—the maximum number of ptoolsKK, the minimum category count for retention, and the correct\-only flag, but they are fixed for all our experiments; for details see §[4\.2](https://arxiv.org/html/2606.00189#S4.SS2)\.

### 3\.3Distilling pseudo\-tools to tools

After decomposing a problem, it may be that some interfaces are simple enough to implement with small specialist LMs, or even without calling LLMs at all\. In particular, given collected target input/output examples and the interface “stub”, strong coding models can sometimes generate Python code to implement some interface\.

The*code distillation learner*replaces a single pseudo\-tool’s LLM call with a Python function\. Given an interface and the \(input, output\) pairs harvested from recorded rollouts \(optionally restricted to rollouts that produced a correct final answer\), the learner prompts a strong code\-generating LLM fornncandidate functions, executes each on all the training cases, picks the best in a round by training accuracy, and feeds any failures back as targeted feedback for the next ofRRrounds\. The generated function is allowed to*abstain*by returningNoneor raising an error, in which case the workflow backs off to the original LLM\-based ptool; however, an error when the function does not abstain \(an error of commission\) cannot be recovered from\. The distilled code is therefore retained only when a holdout shows a low rate of errors of comission\. The result of code distillation is*ptool\-distill*: a collection of low cost, code\-based implementations that replace LLM calls for a subset of the steps\.

### 3\.4Learning a workflow

The experiments below will confirm the observations of othersPanet al\.\([2026](https://arxiv.org/html/2606.00189#bib.bib9)\), and show that well\-designed static workflows are often faster and more accurate than ReAct\-style agent loops\. This is no surprise: the flexibility of an agent loop comes at the price of unpredictability in cost and performance\. We thus explored two approaches to learning workflows for a task\.

Learning workflows by code distillation\.This approach applies the code distillation learner to the root interface itself, producing a Python workflow that orchestrates the existing toolkit\. For this case, the code generation prompt is extended with the signatures and docstrings for every tool or ptool, a few sampled tool\-call traces from successful rollouts of the existing workflow, and optional ReAct traces to show "thinking" episodes\. The prompt also includes some hand\-written reference workflows from other benchmarks, so it is akk\-shot rather than0\-shot prompt\. We instruct the learner that returningNoneis strictly preferred to an error of comission, as code distillation for workflows also allows backoff to a simulate call on the same root interface\.

Learning a workflow by hill\-climbing\.The*orchestration learner*treats both the workflow and the ptool toolkit as editable Python source, and improves them jointly across iterations\. The supervisor may rewrite the workflow body, refine ptool docstrings \(which serve as their prompts under the default LLM\-call binding\), or override per\-interface configuration such as backend model assignments\. The learner takes as input the root interface, a training set, an eval set, and a starting workflow that is either \(i\) the benchmark’s hand\-written workflow or \(ii\) a workflow composed at iteration zero from the available ptools by a single LLM call\. We call the former a workflow\-seeded run and the latter a tool\-seeded run\. In a workflow\-seeded run, the supervisor inherits both a workflow and a ptool toolkit, and may edit either; in a tool\-seeded run, only the toolkit is given, and the initialization step described in Appendix[C\.3](https://arxiv.org/html/2606.00189#A3.SS3)synthesizes an iteration\-zero workflow before the iterative loop begins\. Each iteration evaluates the current workflow, samples failed cases, and prompts a strong*supervisor*LLM with the current workflow source, a per\-interface profiling summary, the failure traces, and the iteration history\. The supervisor returns either edited source \(optionally with configuration overrides, such as model assignments\) or a no\-change signal\. A proposed edit is accepted if it improves training accuracy, or if it matches training accuracy and improves evaluation accuracy; otherwise is rejected\. The learner terminates after a fixed iteration budget or after five consecutive non\-improvements\. The two modes let the same machinery serve two purposes:*repairing*an existing workflow and*inducing*a workflow from a given toolkit\. Full algorithmic details and artifacts are in Appendix[C](https://arxiv.org/html/2606.00189#A3)\.

### 3\.5Bi\-objective optimization to fuse alternatives

Choice of learners and implementations as discrete optimization\.When ptools are optimized by code distillation, the result may be a toolkit that is faster on average, but possibly less accurate \(due to errors of comission\), so choosing the best set of ptool implementations can be non\-obvious—especially since some workflows might be more robust to tool errors than others\. Tradeoff arises because LLM\-calling ptools might require strong LLMs, or might be nearly as effective with weaker, cheaper LLMs\. In our architecture, all of these design choices can be expressed configuration choices, specifically choices about how to bind interfaces, and how to parameterize individual ptools \(e\.g\., with ptool\-specific model choices\)\. Alternative procedures for learning an agent, or revising an initial seed workflow and/or set of tools, can also be expressed as configuration choices\. This leads to the question of how to best search a space of alternative configurations—in general, for a desirable tradeoff of output quality with other metrics such as cost, latency, or interpretability\.

We use the DEAP frameworkFortinet al\.\([2012](https://arxiv.org/html/2606.00189#bib.bib38)\)to perform multi\-objective search over the configuration space\. Concretely, each interface in the system contributes one or more*genes*to a chromosome: a*method gene*which selects among a discrete set of available implementations, and a*model gene*which selects from a discrete set of backend LLMs\. We also allow higher\-level genes which*compound*—e\.g\., a single gene value may expand to multiple configuration overrides \(e\.g\., selecting “program\-of\-thought” also sets tool lists and token limits\)\. Interface\-level genes allow the optimizer to assign different models and methods to different stages of a workflow independently\. The search space is defined declaratively in YAML for each benchmark, based on available options \(potentially including learned implementations\)\.The optimizer evaluates configurations by running each on validation data, and recording output correctness and LLM cost per query\.666In the experiments below, we used the NSGA\-II algorithmDebet al\.\([2002](https://arxiv.org/html/2606.00189#bib.bib39)\), a population\-based evolutionary strategy that maintains a Pareto frontier over these two objectives\. NSGA\-II uses uniform crossover and random\-reset mutation over the categorical chromosome encoding, with crowding\-distance\-based tournament selection to preserve diversity along the frontier\.

As well as comparing learned implementations to hand\-engineered ones, this formalization naturally supports heterogeneous LLM assignment: different sub\-interfaces can be bound to different backend models, e\.g\., allowing the optimizer to discover configurations where a cheap model handles extraction while an expensive model handles reasoning\.

Caching\.The modularity of interfaces means that similar chromosomes often invoke the same interfaces with the same inputs; we thus cache LLM calls during optimization, storing both the output and the reported cost so that optimization is cheaper and faster while still reflecting the true cost of each configuration\.777Specifically, we cache the LLM output as well as the cost of the call via LiteLLM’s built\-in cache, so that optimization will be cheaper and faster, but still reflect true costs\. See Appendix[D\.3](https://arxiv.org/html/2606.00189#A4.SS3)for a measurement of cache effectiveness on one of our sweeps\.Caching can reduce the real cost of optimization sweeps by up to a factor of 8\.

## 4Experimental results

### 4\.1Why are static workflows preferred in practice?

To our knowledge, there are no published data on the outcome of systematic engineering of agentic systems, across multiple benchmarks, using a common experimental harness\. To address this shortage, over the course of approximately six weeks, each coauthor implemented and optimized888Optimization was by ”graduate student descent” on validation data, exploring different toolkits/ptoolkits, agent loops, and prompts\. Most coauthors are future or current master’s students in AI\.1\-3 benchmarks, while the group also engaged in literature review and refining the common framework\. Benchmarks selected were cited by related papers, or based on coauthor expertise and interest\.

Table 1:Left, result quality \(generally accuracy\) across multiple benchmarks for different root interface bindings\. Right, cost in US dollars per 100 task instances using Together\.ai’s costs for DeepSeek\-V3\.1\. The static workflows are hand\-engineered on a dev set\. The dynamic workflow \(ReAct\) uses the same pseudo\-tools and tools\. The zero\-shot baseline uses the same prompt as ReAct but no tools, so makes only one LLM call\. DeepSeek\-V3\.1 was used as the LLM model throughout\.CorrectnessCost \(per 100 examples\)TaskStaticWorkflowDynamicWorkflow\(ReAct\)Zero\-shot\(DefaultImp\.\)StaticWorkflowDynamicWorkflow\(ReAct\)Zero\-shot\(DefaultImp\.\)BBH Date Understanding0\.840\.840\.93\\mathbf\{0\.93\}0\.520\.520\.280\.280\.990\.990\.10\\mathbf\{0\.10\}BBH Geometric Shapes0\.300\.300\.350\.350\.53\\mathbf\{0\.53\}0\.630\.632\.402\.400\.07\\mathbf\{0\.07\}BBH Penguins in a Table0\.630\.630\.720\.720\.93\\mathbf\{0\.93\}0\.210\.210\.540\.540\.05\\mathbf\{0\.05\}BBH Sports Understanding0\.870\.870\.88\\mathbf\{0\.88\}0\.650\.650\.130\.130\.290\.290\.02\\mathbf\{0\.02\}FinQA0\.75\\mathbf\{0\.75\}0\.320\.320\.620\.620\.12\\mathbf\{0\.12\}0\.950\.950\.140\.14MedAgentBench0\.87\\mathbf\{0\.87\}0\.690\.690\.000\.000\.640\.640\.650\.650\.22\\mathbf\{0\.22\}MedCalc Formulas0\.810\.810\.82\\mathbf\{0\.82\}0\.600\.600\.15\\mathbf\{0\.15\}1\.201\.200\.190\.19MedCalc Rules0\.50\\mathbf\{0\.50\}0\.470\.470\.450\.450\.270\.271\.291\.290\.21\\mathbf\{0\.21\}MUSR Murder0\.68\\mathbf\{0\.68\}0\.620\.620\.520\.520\.470\.471\.341\.340\.11\\mathbf\{0\.11\}MUSR Objects0\.58\\mathbf\{0\.58\}0\.360\.360\.350\.350\.340\.340\.990\.990\.10\\mathbf\{0\.10\}MUSR Teams0\.61\\mathbf\{0\.61\}0\.490\.490\.590\.590\.280\.281\.381\.380\.14\\mathbf\{0\.14\}NaturalPlan Calendar0\.62\\mathbf\{0\.62\}0\.460\.460\.540\.540\.460\.460\.790\.790\.09\\mathbf\{0\.09\}NaturalPlan Meeting0\.37\\mathbf\{0\.37\}0\.230\.230\.230\.230\.570\.570\.910\.910\.11\\mathbf\{0\.11\}NaturalPlan Trip0\.21\\mathbf\{0\.21\}0\.160\.160\.170\.170\.370\.371\.191\.190\.08\\mathbf\{0\.08\}Rulearena Airlines0\.98\\mathbf\{0\.98\}0\.890\.890\.410\.410\.97\\mathbf\{0\.97\}2\.402\.401\.121\.12Rulearena Tax0\.50\\mathbf\{0\.50\}0\.150\.150\.110\.111\.011\.013\.493\.490\.68\\mathbf\{0\.68\}Rulearena NBA0\.610\.610\.590\.590\.67\\mathbf\{0\.67\}1\.491\.492\.952\.951\.42\\mathbf\{1\.42\}Tabular Math WP0\.900\.900\.97\\mathbf\{0\.97\}0\.870\.870\.070\.070\.690\.690\.04\\mathbf\{0\.04\}τ\\tauBench Retail0\.520\.520\.56\\mathbf\{0\.56\}0\.110\.114\.754\.757\.037\.033\.50\\mathbf\{3\.50\}Average0\.64\\mathbf\{0\.64\}0\.560\.560\.470\.470\.700\.701\.661\.660\.44\\mathbf\{0\.44\}![Refer to caption](https://arxiv.org/html/2606.00189v1/results/correct_react.png)

![Refer to caption](https://arxiv.org/html/2606.00189v1/results/correct_thinking.png)

Figure 1:Left, the accuracy of engineered static workflows compared to ReAct; right, accuracy of workflows compared to a traditional zero\-shot tuned prompt\. Each box is a benchmark, where the y\-axis position is mean workflow accuracy and the x\-axis position is ReAct, so boxes above the liney=xy=xshow that workflow performs best\. Box width is the standard error of the mean\.Static workflows and pseudo\-tools are fast and effective\.In Table[1](https://arxiv.org/html/2606.00189#S4.T1)\) we show the results of static workflows across all benchmarks \(on test data heldout during this optimization phase\)\. We also show two ablations\. One uses a ReAct loop over the same toolkit \(with a default prompt constructed from the interface description\)\. This conditions ablates exactly the information present in the hand\-engineered workflow, but retains the engineered set of tools/ptools\. The other is a zero\-shot baseline, using the same top\-level prompt as the ReAct loop, which ablates the engineered toolkit as well\.

In Table[1](https://arxiv.org/html/2606.00189#S4.T1)the engineered static workflows usually perform best\. The left of Figure[1](https://arxiv.org/html/2606.00189#S4.F1)gives a graphic view of the comparison to ReAct, and also indicates the standard error of these measurements\. In Table[1](https://arxiv.org/html/2606.00189#S4.T1)we also record the cost of each strategy, showing that static workflows are also cheaper than ReACt by a factor of more than 2 on average\.

The "agentic tax" is real\.Dynamic workflows have an obvious computational cost compared to a fixed workflow\. In these experiments they were also often fragile: e\.g\., tool/ptool calls often fail because of syntax or semantic errors\. Hand\-tuning can reduce these errors, but tuning dynamic agents can be expensive and slow\.999Some poorly planned experiments with ReAct on RuleArena tasks with variant ptools cost hundreds of dollars\.

Single\-shot LLMs are strong when Python tools are not needed\.On the right in Figure[1](https://arxiv.org/html/2606.00189#S4.F1)\(and in more detail in Appendix[B](https://arxiv.org/html/2606.00189#A2)\) we also compare to easy\-to\-tune manually constructed traditional zero\-shot prompts\. Briefly, the average correctness of these prompts is 0\.56, the same as ReAct,101010The 0\.56 is noticibly higher than the default zero\-shot prompt used in our ablations, which averaged 0\.47\. Much of the improvement is from instructions which suggest workflow like\-procedures to follow, which led to large gains on some tasks; if these are disallowed then the average score of traditional zero\-shot prompts drops to 0\.50\.but there are several cases where they outperform agentic models\. However, there are benchmarks where "real" tools are absolutely necessary for performance \(e\.g\., MedAgentBench andτ\\tauBench\)\.

Engineered agentic systems follow many design patterns\.The ability to easily design and invoke single\-shot LLM tools is one of the goals and advantages of the ptool architecture, and it was used in many ways\. For example, inτ\\tau\-bench, a LLM\-using ptool was be used to route between variant workflows, but calling it and conditioning on the output\. A frequently\-used pattern to gain efficiency \(not discussed to our knowledge in the literature\) was to replace selected pseudo\-tools for Pythonic "real" tools, which led to very low costs on some of the tasks\.

Table 2:Correctness for selected benchmarks using learned components\. Lines 1–2 are the static workflow and ReAct results from Table[1](https://arxiv.org/html/2606.00189#S4.T1), shown as baselines\. Orch\-WfSeed starts from the hand\-engineered workflow; Orch\-ToolSeed starts from a composer\-synthesized workflow over the toolkit \(engineered or induced\)\. All Orchestrate numbers are eval\-split accuracy at the best\-eval iteration over a 5\-iteration run; see Appendix D\.5 for full sweep\.MuSRNaturalPlanRuleArenaMedCalcWorkflow/PtoolsMurderObjectTeamMeetingTripNBAFormulasRulesAvg\.human/human0\.680\.680\.580\.580\.610\.610\.370\.370\.210\.210\.610\.610\.810\.810\.500\.500\.550\.55ReAct/human0\.480\.480\.340\.340\.470\.470\.290\.290\.160\.160\.500\.500\.820\.820\.470\.470\.440\.44ReAct/learned0\.750\.750\.690\.690\.680\.680\.350\.350\.010\.010\.720\.720\.590\.590\.500\.500\.540\.54CodeDist/human0\.640\.640\.610\.610\.670\.670\.270\.270\.970\.971\.00\\mathbf\{1\.00\}0\.700\.700\.500\.500\.670\.67CodeDist/learned0\.690\.690\.480\.480\.670\.671\.00\\mathbf\{1\.00\}0\.780\.780\.720\.720\.540\.540\.350\.350\.650\.65Orch\-WfSeed/human0\.680\.680\.640\.640\.700\.701\.00\\mathbf\{1\.00\}0\.110\.110\.520\.520\.780\.780\.510\.510\.620\.62Orch\-ToolSeed/human0\.76\\mathbf\{0\.76\}0\.690\.690\.740\.740\.750\.751\.00\\mathbf\{1\.00\}0\.260\.260\.80\\mathbf\{0\.80\}0\.52\\mathbf\{0\.52\}0\.690\.69Orch\-ToolSeed/learned0\.540\.540\.71\\mathbf\{0\.71\}0\.80\\mathbf\{0\.80\}1\.00\\mathbf\{1\.00\}1\.00\\mathbf\{1\.00\}0\.740\.740\.80\\mathbf\{0\.80\}0\.510\.510\.76\\mathbf\{0\.76\}
### 4\.2Can pseudo\-tools be learned?

The engineering done for the results of Table[1](https://arxiv.org/html/2606.00189#S4.T1)requires designing two artifacts: a*toolkit*, i\.e\., a set of tools and pseudo\-tools, and a*workflow*\. We consider here the problem of learning a toolkit\.

Many tasks clearly require external tools \(e\.g\., for database access, calculation, etc\), and use of these tools helps decompose the task into steps\. Here we focus on designing*pseudo\-tools*that decompose a*reasoning process*into reusable, modular pieces\. We conjecture that techniques developed for this difficult induction process will be adaptable to tasks where external tools also exist, and selected several subtasks that seemed appropriate test cases\.111111We omitted subtasks which seemed to be solvable well without decomposition, some costly subtasks, and subtasks requiring external tools\.

For each of these tasks, we used the method of §[3\.2\.1](https://arxiv.org/html/2606.00189#S3.SS2.SSS1)to induce a set ofK=5K=5ptools, and used that toolkit with a ReAct loop\. The results are shown in row 3 of Table[2](https://arxiv.org/html/2606.00189#S4.T2), along with two baselines: row 1, the original human\-engineered static workflow and human\-engineered toolkit; and row 2, the same ReAct loop with the human\-engineered toolkit\. For these experiments, DeepSeek\-V3DeepSeek\-AI \([2024](https://arxiv.org/html/2606.00189#bib.bib33)\)was used both for learning the ptools and as the inference LLM for ReAct\.

Induced Pseudo\-tools Perform Well\.On average, the learned ptools outperform the engineered workflow and toolkit\. The improvement is consistent, with induced tools scoring better on 4/6 tasks\. More importantly, when the same workflow process \(ReAct\) is used, the induced ptools perform better on 5/6 tasks\. We also evaluated induced ptools in conjunction with the workflow learning methods described in §[3\.4](https://arxiv.org/html/2606.00189#S3.SS4)\(see below\)\. Overall, the induced ptools also perform better overall than engineered ones in these conditions\. For more detail see Appendix[B\.2](https://arxiv.org/html/2606.00189#A2.SS2)\.

Table 3:Cost \(USD per 100 examples\) for selected benchmarks using learned components\. The first two lines are duplicated from the static workflow and ReAct results of Table[1](https://arxiv.org/html/2606.00189#S4.T1), and are shown here as baselines\. Zeros indicate workflows that are pure Python with no LLM calls, which are considered zero\-cost in our experiments\.MuSRNaturalPlanRuleArenaMedCalcWorkflow/PtoolsMurderObjectTeamMeetingTripNBAFormulasRulesAvg\.human/human0\.470\.470\.340\.340\.28\\mathbf\{0\.28\}0\.570\.570\.370\.371\.491\.490\.150\.150\.27\\mathbf\{0\.27\}0\.490\.49ReAct/human0\.710\.710\.530\.530\.360\.360\.940\.941\.071\.073\.093\.091\.201\.201\.291\.291\.151\.15ReAct/learned5\.245\.242\.362\.362\.842\.846\.566\.562\.352\.3568\.4068\.400\.730\.730\.870\.8711\.1711\.17CodeDist/human0\.730\.730\.880\.880\.470\.470\.680\.680\.000\.000\.000\.000\.250\.250\.380\.380\.570\.57CodeDist/learned0\.830\.830\.20\\mathbf\{0\.20\}0\.420\.420\.000\.000\.03\\mathbf\{0\.03\}2\.882\.880\.200\.200\.13\\mathbf\{0\.13\}0\.670\.67Orch\-WfSeed/human0\.460\.460\.780\.780\.320\.320\.000\.000\.390\.391\.501\.500\.150\.150\.280\.280\.550\.55Orch\-ToolSeed/human0\.490\.490\.240\.240\.390\.390\.970\.970\.000\.000\.05\\mathbf\{0\.05\}0\.14\\mathbf\{0\.14\}0\.290\.290\.37\\mathbf\{0\.37\}Orch\-ToolSeed/learned0\.21\\mathbf\{0\.21\}0\.700\.701\.031\.030\.000\.000\.000\.001\.441\.440\.550\.550\.470\.470\.730\.73
### 4\.3Can static workflows be learned?

We applied the two workflow\-learning methods described in §[3\.4](https://arxiv.org/html/2606.00189#S3.SS4)\. The code distillation learner was applied to the root interface, producing a Python workflow over the existing toolkit\. The orchestration learner was run in both of its modes: workflow\-seeded, starting from the hand\-engineered workflow over the engineered toolkit; and tool\-seeded, in which only a toolkit is provided\. Workflow learning requires a strong coding model, so all orchestration\-learner runs used Gemini 3\.1 ProGoogle DeepMind \([2026b](https://arxiv.org/html/2606.00189#bib.bib35)\)as supervisor\.

Learned workflows outperform engineered ones\.The results are shown in the final four lines of Table[2](https://arxiv.org/html/2606.00189#S4.T2)\. Lines 4 and 6 are using engineered toolkits with learned static workflows\. Learned workflows fairly consistently outperform the engineered ones \(there are two exceptions for the code distillation learner, and one for the orchestration learner\)\. As noted above, using induced ptools gives another improvement on average\.

The "agentic tax" is real\.Table[3](https://arxiv.org/html/2606.00189#S4.T3)summarizes the cost of the various learned models, and emphasizes one of the practical advantages of using static workflows: their relative efficiency over dynamic, ReAct\-style workflows\. The ptool induction method can produce tools that are effective but expensive to use in inference, but switching to a static workflow can dramatically reduce this overhead\.121212In the RuleArena/NBA task, the reduction in cost is nearly a factor of 50, and in MUSR/Murder it is a factor of 25\.

Learning by coding is powerful but susceptible to "reward hacking"\.Three of the tasks we selected turned out to be possible to implement without LLM calls at all–and both of our workflow learners discovered this, generating Python "workflows" sometimes hundreds of steps long that made no use of the toolkit at all, and instead combined regex\-based extraction from inputs with search methods to find solutions\. Surprisingly, most of these generated solutions generalized well\.131313Many LLM benchmark problems are problems that are hard for LLM reasoners, but not intrinsically hard to solve computationally, and unless the inputs are linguistically complex coded shortcuts are possible\.

We discuss this issue more in §[5](https://arxiv.org/html/2606.00189#S5)but note here that the experimental results for our learning methods are qualitatively the same on the MUSR problems, where no "reward hacking" was observed\.

### 4\.4Can modularity be exploited to improve performance in other ways?

![Refer to caption](https://arxiv.org/html/2606.00189v1/x1.png)

Figure 2:Left, test split cost vs accuracy of*ptool\-distill*under the Claude Opus 4\.6Anthropic \([2026](https://arxiv.org/html/2606.00189#bib.bib36)\)learner; right, under Gemini 3\.1 Pro Preview\. The lighter circle is the original hand\-written workflow, the darker square is the same workflow after all ptools \(excluding the root interfaces\)have been processed by code distillation Arrows pointing up\-and\-left are lower cost and higher accuracy\. Benchmarks means do ptools distillations were accepted\.![Refer to caption](https://arxiv.org/html/2606.00189v1/results/rulearena_nba_optimize.png)

![Refer to caption](https://arxiv.org/html/2606.00189v1/results/musr_murder_optimize.png)

Figure 3:Pareto\-optimal configurations found for two benchmarks by the NSGA\-II optimized\. Left: For NBA, green points are engineered workflows with a single LLM; config 0001 switched between DeepSeek\-V3 and Gemini 2\.5 Flash\-LiteComaniciet al\.\([2025](https://arxiv.org/html/2606.00189#bib.bib34)\), using induced ptools and ReAct; configs 011, 005 are the engineered baseline with ptools routed to a gpt\-oss\-20bOpenAI \([2025](https://arxiv.org/html/2606.00189#bib.bib37)\)model; config 009 is the same but only using gpt\-oss\-20b\. Right: Pareto curve for a second representative benchmark, MUSR Murder\. Two of the Pareto\-optimal configurations are from the orchestration learner of §[3\.4](https://arxiv.org/html/2606.00189#S3.SS4), one bound to Gemini 2\.5 Flash\-Lite and one to gpt\-oss\-120b, with similar output quality for≈\\approx1/7 the cost\.Many pseudo\-tools can be replaced with Python tools\. We applied the code distillation learner on handed\-coded fixed workflows for 13 benchmarks\. Under Opus,1212ptools across77benchmarks \(more than half of those tested\) have ptools that are substantively modified \(after all gating checks\)–i\.e\., 7 ptools are algorithmic enough to express in code\. Figure[3](https://arxiv.org/html/2606.00189#S4.F3)plots results\.

Modularity makes bi\-objective optimization possible\.As described in §[3\.5](https://arxiv.org/html/2606.00189#S3.SS5), we used the NSGA\-II evolutionary search method to search the space of possible configurations produced by the various learners described above\. Two representative results are shown in Figure[3](https://arxiv.org/html/2606.00189#S4.F3), where we show some “frontier configurations” that optimally tradeoff cost and output quality\.

## 5Limitations and broader impacts

The goal of making agentic systems more efficient and interpretable is of primary importance in AI, and we believe that this work makes some steps toward this: because static workflows are much easier to understand, and predict the behavior of, than dynamic ones, extending their capabilities is desirable\. As noted above, however, some of our methods can subvert the goal of interpretability, e\.g\., by generating Python "workflows" that bypass the toolkit that a designer intends to be used\. Auditing learned workflows and learned toolkits for undesirable behavior is left as a topic for further research\.

## 6Conclusion

Principled design of practical agentic systems requires modularity\. We implement a framework based on "pseudo\-tools"—LLM calls with a restricted context—and show that hand\-engineered pseudo\-toolkits and fixed workflows outperform dynamic ReAct\-style systems\. We further show that both toolkits and workflows can be learned, that components can be automatically converted into hard\-coded tools, and that learner and backend LLM choices can be configured via discrete search to select cheap, effective systems\.

### Acknowledgements

The authors thank Google and Theta Labs \([https://www\.thetalabs\.org/](https://www.thetalabs.org/)\) for their generous support with compute resources\.

## References

- Claude Opus 4\.6 system card\.Note:[https://www\.anthropic\.com/system\-cards](https://www.anthropic.com/system-cards)Accessed May 2026Cited by:[Figure 3](https://arxiv.org/html/2606.00189#S4.F3.3.1),[Figure 3](https://arxiv.org/html/2606.00189#S4.F3.4.1)\.
- BerriAI \(2023\)LiteLLM: python SDK and proxy server to call 100\+ LLM APIs in OpenAI format\.Note:[https://github\.com/BerriAI/litellm](https://github.com/BerriAI/litellm)Cited by:[footnote 2](https://arxiv.org/html/2606.00189#footnote2)\.
- W\. Chen, X\. Ma, X\. Wang, and W\. W\. Cohen \(2022\)Program of thoughts prompting: disentangling computation from reasoning for numerical reasoning tasks\.arXiv preprint arXiv:2211\.12588\.Cited by:[§B\.1](https://arxiv.org/html/2606.00189#A2.SS1.p3.1)\.
- Z\. Chen, W\. Chen, C\. Smiley, S\. Shah, I\. Borova, D\. Langdon, R\. Moussa, M\. Beane, T\. Huang, B\. Routledge, and W\. Y\. Wang \(2021\)FinQA: a dataset of numerical reasoning over financial data\.Conference on Empirical Methods in Natural Language Processing \(EMNLP\)\.Cited by:[Table 4](https://arxiv.org/html/2606.00189#A1.T4.3.9.8)\.
- P\. Choure and S\. Prajapat \(2025\)Agentic ai for emergency response and comparative analysis of smolagents, langgraph, autogen, agno agi and crewai for crisis solution\.Authorea Preprints\.Cited by:[§2](https://arxiv.org/html/2606.00189#S2.p1.1)\.
- G\. Comanici, E\. Bieber, M\. Schaekermann, I\. Pasupat,et al\.\(2025\)Gemini 2\.5: pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities\.External Links:2507\.06261,[Link](https://arxiv.org/abs/2507.06261)Cited by:[Figure 3](https://arxiv.org/html/2606.00189#S4.F3.1.1),[Figure 3](https://arxiv.org/html/2606.00189#S4.F3.2.1)\.
- K\. Deb, A\. Pratap, S\. Agarwal, and T\. Meyarivan \(2002\)A fast and elitist multiobjective genetic algorithm: NSGA\-II\.IEEE Transactions on Evolutionary Computation6\(2\),pp\. 182–197\.Cited by:[§1](https://arxiv.org/html/2606.00189#S1.p3.1),[footnote 6](https://arxiv.org/html/2606.00189#footnote6)\.
- DeepSeek\-AI \(2024\)DeepSeek\-V3 technical report\.External Links:2412\.19437,[Link](https://arxiv.org/abs/2412.19437)Cited by:[§4\.2](https://arxiv.org/html/2606.00189#S4.SS2.p3.1)\.
- J\. Fang, Y\. Peng, X\. Zhang, Y\. Wang, X\. Yi, G\. Zhang, Y\. Xu, B\. Wu, S\. Liu, Z\. Li, Z\. Ren, N\. Aletras, X\. Wang, H\. Zhou, and Z\. Meng \(2025\)A comprehensive survey of self\-evolving ai agents: a new paradigm bridging foundation models and lifelong agentic systems\.External Links:2508\.07407,[Link](https://arxiv.org/abs/2508.07407)Cited by:[§1](https://arxiv.org/html/2606.00189#S1.p1.1)\.
- F\. Fortin, F\. D\. Rainville, M\. Gardner, M\. Parizeau, and C\. Gagné \(2012\)DEAP: evolutionary algorithms made easy\.Journal of Machine Learning Research13\(70\),pp\. 2171–2175\.External Links:[Link](http://jmlr.org/papers/v13/fortin12a.html)Cited by:[§3\.5](https://arxiv.org/html/2606.00189#S3.SS5.p2.1)\.
- Google DeepMind \(2026a\)Gemini 3\.1 Flash\-Lite model card\.Note:[https://deepmind\.google/models/model\-cards/gemini\-3\-1\-flash\-lite/](https://deepmind.google/models/model-cards/gemini-3-1-flash-lite/)Accessed May 2026Cited by:[footnote 14](https://arxiv.org/html/2606.00189#footnote14)\.
- Google DeepMind \(2026b\)Gemini 3\.1 Pro model card\.Note:[https://deepmind\.google/models/model\-cards/gemini\-3\-1\-pro/](https://deepmind.google/models/model-cards/gemini-3-1-pro/)Accessed May 2026Cited by:[§4\.3](https://arxiv.org/html/2606.00189#S4.SS3.p1.1)\.
- S\. Hu, C\. Lu, and J\. Clune \(2024\)Automated design of agentic systems\.arXiv preprint arXiv:2408\.08435\.Cited by:[§1](https://arxiv.org/html/2606.00189#S1.p1.1),[§2](https://arxiv.org/html/2606.00189#S2.p2.1)\.
- Y\. Jiang, K\. C\. Black, G\. Geng, D\. Park, J\. Zou, A\. Y\. Ng, and J\. H\. Chen \(2025\)MedAgentBench: a realistic virtual EHR environment to benchmark medical LLM agents\.NEJM AI\.Note:arXiv:2501\.14654Cited by:[Table 4](https://arxiv.org/html/2606.00189#A1.T4.3.10.8)\.
- N\. Khandekar, Q\. Jin, G\. Xiong, S\. Dunn, S\. S\. Applebaum, Z\. Anwar, M\. Sarfo\-Gyamfi, C\. W\. Safranek, A\. A\. Anwar, A\. Zhang, A\. Gilson, M\. B\. Singer, A\. Dave, A\. Taylor, A\. Zhang, Q\. Chen, and Z\. Lu \(2024\)MedCalc\-Bench: evaluating large language models for medical calculations\.Advances in Neural Information Processing Systems \(NeurIPS\) Datasets and Benchmarks Track\.Cited by:[Table 4](https://arxiv.org/html/2606.00189#A1.T4.3.11.8),[Table 4](https://arxiv.org/html/2606.00189#A1.T4.3.12.8)\.
- O\. Khattab, A\. Singhvi, P\. Maheshwari, Z\. Zhang, K\. Santhanam, S\. Vardhamanan, S\. Haq, A\. Sharma, T\. T\. Joshi, H\. Mober, P\. K\. Shah, N\. Edalati, C\. Lee, R\. Shin, C\. Potts, and M\. Zaharia \(2024\)DSPy: compiling declarative language model calls into self\-improving pipelines\.International Conference on Learning Representations \(ICLR\)\.Cited by:[§1](https://arxiv.org/html/2606.00189#S1.p1.1),[§2](https://arxiv.org/html/2606.00189#S2.p1.1)\.
- P\. Lu, L\. Qiu, K\. Chang, Y\. N\. Wu, S\. Zhu, T\. Rajpurohit, P\. Clark, and A\. Kalyan \(2023\)Dynamic prompt learning via policy gradient for semi\-structured mathematical reasoning\.International Conference on Learning Representations \(ICLR\)\.Cited by:[Table 4](https://arxiv.org/html/2606.00189#A1.T4.3.22.8)\.
- OpenAI \(2025\)gpt\-oss\-120b & gpt\-oss\-20b model card\.External Links:2508\.10925,[Link](https://arxiv.org/abs/2508.10925)Cited by:[Figure 3](https://arxiv.org/html/2606.00189#S4.F3.1.1),[Figure 3](https://arxiv.org/html/2606.00189#S4.F3.2.1)\.
- M\. Z\. Pan, N\. Arabzadeh, R\. Cogo, Y\. Zhu, A\. Xiong, L\. A\. Agrawal, H\. Mao, E\. Shen, S\. Pallerla, L\. Patel, S\. Liu, T\. Shi, X\. Liu, J\. Q\. Davis, E\. Lacavalla, A\. Basile, S\. Yang, P\. Castro, D\. Kang, J\. E\. Gonzalez, K\. Sen, D\. Song, I\. Stoica, M\. Zaharia, and M\. Ellis \(2026\)Measuring agents in production\.External Links:2512\.04123,[Link](https://arxiv.org/abs/2512.04123)Cited by:[§1](https://arxiv.org/html/2606.00189#S1.p1.1),[§3\.4](https://arxiv.org/html/2606.00189#S3.SS4.p1.1)\.
- Pydantic Services Inc\. \(2024\)PydanticAI: agent framework, the Pydantic way\.Note:[https://github\.com/pydantic/pydantic\-ai](https://github.com/pydantic/pydantic-ai)Cited by:[§2](https://arxiv.org/html/2606.00189#S2.p1.1)\.
- A\. Roucher, A\. Villanova del Moral, T\. Wolf, L\. von Werra, and E\. Kaunismäki \(2025\)smolagents: a smol library to build great agentic systems\.Note:[https://github\.com/huggingface/smolagents](https://github.com/huggingface/smolagents)Cited by:[§2](https://arxiv.org/html/2606.00189#S2.p1.1)\.
- Z\. Sprague, X\. Ye, K\. Bostrom, S\. Chaudhuri, and G\. Durrett \(2024\)MuSR: testing the limits of chain\-of\-thought with multistep soft reasoning\.International Conference on Learning Representations \(ICLR\)\.Cited by:[Table 4](https://arxiv.org/html/2606.00189#A1.T4.3.13.8),[Table 4](https://arxiv.org/html/2606.00189#A1.T4.3.14.8),[Table 4](https://arxiv.org/html/2606.00189#A1.T4.3.15.8)\.
- M\. Suzgun, N\. Scales, N\. Schärli, S\. Gehrmann, Y\. Tay, H\. W\. Chung, A\. Chowdhery, Q\. Le, E\. Chi, D\. Zhou, and J\. Wei \(2023\)Challenging BIG\-bench tasks and whether chain\-of\-thought can solve them\.Findings of the Association for Computational Linguistics \(ACL Findings\)\.Cited by:[Table 4](https://arxiv.org/html/2606.00189#A1.T4.3.2.8),[Table 4](https://arxiv.org/html/2606.00189#A1.T4.3.3.8),[Table 4](https://arxiv.org/html/2606.00189#A1.T4.3.4.8),[Table 4](https://arxiv.org/html/2606.00189#A1.T4.3.5.8)\.
- Z\. Z\. Wang, A\. Gandhi, G\. Neubig, and D\. Fried \(2025a\)Inducing programmatic skills for agentic tasks\.Conference on Language Modeling \(COLM\)\.Cited by:[§2](https://arxiv.org/html/2606.00189#S2.p3.1)\.
- Z\. Z\. Wang, J\. Mao, D\. Fried, and G\. Neubig \(2025b\)Agent workflow memory\.International Conference on Machine Learning \(ICML\)\.Cited by:[§2](https://arxiv.org/html/2606.00189#S2.p3.1)\.
- J\. Xiao, M\. Wang, M\. H\. Lam, Y\. Wan, J\. Liu, Y\. Huo, and M\. R\. Lyu \(2025\)Designbench: a comprehensive benchmark for mllm\-based front\-end code generation\.arXiv preprint arXiv:2506\.06251\.Cited by:[Table 4](https://arxiv.org/html/2606.00189#A1.T4.3.6.8),[Table 4](https://arxiv.org/html/2606.00189#A1.T4.3.7.8),[Table 4](https://arxiv.org/html/2606.00189#A1.T4.3.8.8)\.
- S\. Yao, N\. Shinn, P\. Razavi, and K\. Narasimhan \(2024\)τ\\tau\-bench: a benchmark for tool\-agent\-user interaction in real\-world domains\.arXiv preprint arXiv:2406\.12045\.Cited by:[Table 4](https://arxiv.org/html/2606.00189#A1.T4.3.23.8)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao \(2023\)ReAct: synergizing reasoning and acting in language models\.International Conference on Learning Representations \(ICLR\)\.Cited by:[§1](https://arxiv.org/html/2606.00189#S1.p1.1),[§1](https://arxiv.org/html/2606.00189#S1.p4.1)\.
- H\. Yeet al\.\(2024\)GEPA: reflective prompt evolution can outperform reinforcement learning\.arXiv preprint\.Cited by:[§1](https://arxiv.org/html/2606.00189#S1.p1.1)\.
- M\. Yuksekgonul, F\. Bianchi, J\. Boen, S\. Liu, Z\. Huang, C\. Guestrin, and J\. Zou \(2024\)TextGrad: automatic “differentiation” via text\.International Conference on Machine Learning \(ICML\)\.Cited by:[§1](https://arxiv.org/html/2606.00189#S1.p1.1)\.
- J\. Zhang, S\. Hu, C\. Lu, R\. Lange, and J\. Clune \(2026\)Darwin godel machine: open\-ended evolution of self\-improving agents\.External Links:2505\.22954,[Link](https://arxiv.org/abs/2505.22954)Cited by:[§1](https://arxiv.org/html/2606.00189#S1.p1.1)\.
- J\. Zhang, D\. Xiang, A\. Yu,et al\.\(2024\)AFlow: automating agentic workflow generation\.arXiv preprint arXiv:2410\.10762\.Cited by:[§1](https://arxiv.org/html/2606.00189#S1.p1.1),[§2](https://arxiv.org/html/2606.00189#S2.p2.1)\.
- A\. Zhao, D\. Huang, Q\. Xu, M\. Lin, Y\. Liu, and G\. Huang \(2024\)ExpeL: LLM agents are experiential learners\.AAAI Conference on Artificial Intelligence \(AAAI\)\.Cited by:[§2](https://arxiv.org/html/2606.00189#S2.p3.1)\.
- H\. S\. Zheng, S\. Mishra, H\. Zhang, X\. Chen, M\. Chen, A\. Nova, L\. Hou, H\. Cheng, Q\. V\. Le, E\. H\. Chi, and D\. Zhou \(2024\)NATURAL PLAN: benchmarking LLMs on natural language planning\.arXiv preprint arXiv:2406\.04520\.Cited by:[Table 4](https://arxiv.org/html/2606.00189#A1.T4.3.16.8),[Table 4](https://arxiv.org/html/2606.00189#A1.T4.3.17.8),[Table 4](https://arxiv.org/html/2606.00189#A1.T4.3.18.8)\.
- R\. Zhou, W\. Hua, L\. Pan, S\. Cheng, X\. Wu, E\. Yu, and W\. Y\. Wang \(2025\)RuleArena: a benchmark for rule\-guided reasoning with LLMs in real\-world scenarios\.Annual Meeting of the Association for Computational Linguistics \(ACL\)\.Cited by:[Table 4](https://arxiv.org/html/2606.00189#A1.T4.3.19.8),[Table 4](https://arxiv.org/html/2606.00189#A1.T4.3.20.8),[Table 4](https://arxiv.org/html/2606.00189#A1.T4.3.21.8)\.
- M\. Zhuge, W\. Wang, L\. Kirsch, F\. Faccio, D\. Khizbullin, and J\. Schmidhuber \(2024a\)GPTSwarm: language agents as optimizable graphs\.International Conference on Machine Learning \(ICML\)\.Cited by:[§1](https://arxiv.org/html/2606.00189#S1.p1.1)\.
- M\. Zhuge, W\. Wang, L\. Kirsch, F\. Faccio, D\. Khizbullin, and J\. Schmidhuber \(2024b\)Gptswarm: language agents as optimizable graphs\.InForty\-first International Conference on Machine Learning,Cited by:[§2](https://arxiv.org/html/2606.00189#S2.p2.1)\.

## Appendix ADescription of benchmarks

### A\.1Benchmarks used

Table 4:Benchmarks used in this work\. “Tools” counts Python or deterministic interfaces in the hand\-coded workflow; “Ptools” counts LLM\-backed pseudo\-tool interfaces\.BenchmarkDescriptionToolsPtoolsTrainValidTestReferenceBBH / date understandingDetermine a calendar date from prose temporal constraints067575100Suzgunet al\.\[[2023](https://arxiv.org/html/2606.00189#bib.bib19)\]BBH / geometric shapesIdentify a geometric figure from an ASCII art description077575100Suzgunet al\.\[[2023](https://arxiv.org/html/2606.00189#bib.bib19)\]BBH / penguins in a tableAnswer attribute lookup questions over a formatted table04434360Suzgunet al\.\[[2023](https://arxiv.org/html/2606.00189#bib.bib19)\]BBH / sports understandingDetermine whether a sports\-action sentence is plausible03757575Suzgunet al\.\[[2023](https://arxiv.org/html/2606.00189#bib.bib19)\]DesignBench / vanillaGenerate HTML/CSS code matching a reference screenshot01––120Xiaoet al\.\[[2025](https://arxiv.org/html/2606.00189#bib.bib7)\]DesignBench / vueGenerate a Vue\.js component matching a reference screenshot01––118Xiaoet al\.\[[2025](https://arxiv.org/html/2606.00189#bib.bib7)\]DesignBench / angularGenerate an Angular component matching a reference screenshot01––83Xiaoet al\.\[[2025](https://arxiv.org/html/2606.00189#bib.bib7)\]FinQANumerical reasoning over financial tables and free text42100100300Chenet al\.\[[2021](https://arxiv.org/html/2606.00189#bib.bib21)\]MedAgentBenchMedical EHR tasks via FHIR API \(lab orders, vitals, prescriptions\)210––300Jianget al\.\[[2025](https://arxiv.org/html/2606.00189#bib.bib26)\]MedCalc / equationCompute a medical score given an explicit formula146363660Khandekaret al\.\[[2024](https://arxiv.org/html/2606.00189#bib.bib25)\]MedCalc / ruleCompute a medical score given a prose eligibility rule153434380Khandekaret al\.\[[2024](https://arxiv.org/html/2606.00189#bib.bib25)\]MuSR / murder mysteryIdentify a murderer from alibi and motive evidence047575100Spragueet al\.\[[2024](https://arxiv.org/html/2606.00189#bib.bib23)\]MuSR / object placementsTrack object locations across a sequence of moves047575106Spragueet al\.\[[2024](https://arxiv.org/html/2606.00189#bib.bib23)\]MuSR / team allocationAssign agents to teams under compatibility constraints037575100Spragueet al\.\[[2024](https://arxiv.org/html/2606.00189#bib.bib23)\]Natural Plan / calendarSchedule appointments without time conflicts03100100100Zhenget al\.\[[2024](https://arxiv.org/html/2606.00189#bib.bib24)\]Natural Plan / meetingFind a meeting slot satisfying all participants’ availability03100100100Zhenget al\.\[[2024](https://arxiv.org/html/2606.00189#bib.bib24)\]Natural Plan / tripPlan a multi\-city trip within time and budget constraints03100100100Zhenget al\.\[[2024](https://arxiv.org/html/2606.00189#bib.bib24)\]RuleArena / airlineCompute baggage fees and ticket total from American Airlines fee rules115050100Zhouet al\.\[[2025](https://arxiv.org/html/2606.00189#bib.bib22)\]RuleArena / taxCompute US federal income tax owed from filled IRS form inputs115050100Zhouet al\.\[[2025](https://arxiv.org/html/2606.00189#bib.bib22)\]RuleArena / nbaDetermine whether proposed team operations violate NBA CBA rules01504246Zhouet al\.\[[2025](https://arxiv.org/html/2606.00189#bib.bib22)\]TabMWPArithmetic word problems requiring lookup in a structured table02100100100Luet al\.\[[2023](https://arxiv.org/html/2606.00189#bib.bib20)\]Tau Bench / retailMulti\-turn retail customer service with order and account operations167–6054Yaoet al\.\[[2024](https://arxiv.org/html/2606.00189#bib.bib27)\]

## Appendix BAdditional experimental results

### B\.1Other evaluated strategies

In addition to the results of Table[1](https://arxiv.org/html/2606.00189#S4.T1), we compared the static workflow engineered with two other agent strategies\.

One was a traditional zero\-shot prompt\. This allows more flexibility in prompt tuning, which was often successful\.

Recall that one supported implementation factory generates a \(possibly unique\) Python function for each call to an interface, and executes that function in a sandbox\. When an empty toolkit is used, this implements program of thoughtsChenet al\.\[[2022](https://arxiv.org/html/2606.00189#bib.bib4)\]\(PoT\)\. Replacing the empty toolkit with the same kit of tools and pseudo\-tools used by the static workflow and ReAct does an intermediate amount of planning: a new workflow is constructed for each problem instance, but the workflow is constructed all at once, rather than incrementally as tool outputs are seen, as in ReAct\. Tables[5](https://arxiv.org/html/2606.00189#A2.T5)and[6](https://arxiv.org/html/2606.00189#A2.T6)show this dynamic workflow variant is intermediate in average performance between ReAct and the zero\-shot default prompt: however, there are cases where it behaves quite differently to ReAct\.

Table 5:Result quality \(generally accuracy\) across multiple benchmarks for different root interface bindings, using DeepSeek V3\-1\. The static workflow is hand\-engineered on a dev set\. The dynamic workflows use the same pseudo\-tools and tools: ReAct replans after each tool or ptool output is observed, and PoT constructs a complete tool\-calling plan for each task instance before running and tools\. The default implementation zero\-shot baseline uses one non\-agentic LLM call, without demonstrations, and the custom prompt sometimes uses a second LLM call to extract a final answer from the first call\.TaskStaticWorkflowDynamicWorkflow\(ReAct\)DynamicWorkflow\(PoT\)Zero\-shot\(DefaultImp\.\)Zero\-shot\(TraditionalPrompt\)Zero\-shot\(TraditionalPromptw/ Thinking\)BBH Date Understanding0\.840\.840\.93\\mathbf\{0\.93\}0\.820\.820\.520\.520\.880\.880\.880\.88BBH Geometric Shapes0\.300\.300\.350\.350\.430\.430\.530\.530\.63\\mathbf\{0\.63\}0\.63\\mathbf\{0\.63\}BBH Penguins in a Table0\.630\.630\.720\.720\.250\.250\.930\.930\.98\\mathbf\{0\.98\}0\.98\\mathbf\{0\.98\}BBH Sports Understanding0\.870\.870\.88\\mathbf\{0\.88\}0\.710\.710\.650\.650\.840\.840\.840\.84FinQA0\.75\\mathbf\{0\.75\}0\.320\.320\.180\.180\.620\.620\.570\.570\.570\.57MedAgentBench0\.87\\mathbf\{0\.87\}0\.690\.690\.490\.490\.000\.000\.000\.000\.000\.00MedCalc Formulas0\.810\.810\.820\.820\.750\.750\.600\.600\.330\.330\.86\\mathbf\{0\.86\}∗MedCalc Rules0\.500\.500\.470\.470\.430\.430\.450\.450\.260\.260\.62\\mathbf\{0\.62\}∗MUSR Murder0\.68\\mathbf\{0\.68\}0\.620\.620\.650\.650\.520\.520\.370\.370\.460\.46∗MUSR Objects0\.580\.580\.360\.360\.67\\mathbf\{0\.67\}0\.350\.350\.310\.310\.370\.37∗MUSR Teams0\.61\\mathbf\{0\.61\}0\.490\.490\.520\.520\.590\.590\.600\.600\.590\.59∗NaturalPlan Calendar0\.620\.620\.460\.460\.560\.560\.540\.540\.70\\mathbf\{0\.70\}0\.70\\mathbf\{0\.70\}NaturalPlan Meeting0\.37\\mathbf\{0\.37\}0\.230\.230\.280\.280\.230\.230\.320\.320\.320\.32NaturalPlan Trip0\.21\\mathbf\{0\.21\}0\.160\.160\.21\\mathbf\{0\.21\}0\.170\.170\.010\.010\.010\.01Rulearena Airlines0\.98\\mathbf\{0\.98\}0\.890\.890\.840\.840\.410\.410\.430\.430\.460\.46∗Rulearena Tax0\.500\.500\.150\.150\.430\.430\.110\.110\.470\.470\.55\\mathbf\{0\.55\}∗Rulearena NBA0\.610\.610\.590\.590\.020\.020\.670\.670\.78\\mathbf\{0\.78\}0\.720\.72∗Tabular Math WP0\.900\.900\.97\\mathbf\{0\.97\}0\.940\.940\.870\.870\.960\.960\.960\.96τ\\tauBench Retail0\.520\.520\.56\\mathbf\{0\.56\}0\.150\.150\.110\.110\.110\.110\.110\.11Average0\.64\\mathbf\{0\.64\}0\.560\.560\.490\.490\.470\.470\.500\.500\.560\.56Table 6:Cost \(per 100 examples\) for different root interface bindings, using DeepSeek V3\-1\.TaskStaticWorkflowDynamicWorkflow\(ReAct\)DynamicWorkflow\(PoT\)Zero\-shot\(DefaultImp\.\)Zero\-shot\(TraditionalPrompt\)Zero\-shot\(TraditionalPromptw/ Thinking\)BBH Date Understanding0\.280\.280\.990\.990\.400\.400\.100\.100\.08\\mathbf\{0\.08\}0\.08\\mathbf\{0\.08\}BBH Geometric Shapes0\.630\.632\.402\.400\.640\.640\.07\\mathbf\{0\.07\}0\.07\\mathbf\{0\.07\}0\.07\\mathbf\{0\.07\}BBH Penguins in a Table0\.210\.210\.540\.540\.310\.310\.05\\mathbf\{0\.05\}0\.05\\mathbf\{0\.05\}0\.05\\mathbf\{0\.05\}BBH Sports Understanding0\.130\.130\.290\.290\.210\.210\.02\\mathbf\{0\.02\}0\.030\.030\.030\.03DesignBench Vanilla–0\.290\.29––0\.09\\mathbf\{0\.09\}0\.09\\mathbf\{0\.09\}FinQA0\.120\.120\.950\.950\.350\.350\.140\.140\.10\\mathbf\{0\.10\}0\.10\\mathbf\{0\.10\}MedAgentBench0\.640\.640\.650\.650\.820\.820\.220\.220\.02\\mathbf\{0\.02\}0\.02\\mathbf\{0\.02\}MedCalc Formulas0\.150\.151\.201\.200\.470\.470\.190\.190\.12\\mathbf\{0\.12\}0\.200\.20∗MedCalc Rules0\.270\.271\.291\.290\.540\.540\.210\.210\.13\\mathbf\{0\.13\}0\.240\.24∗MUSR Murder0\.470\.471\.341\.340\.800\.800\.11\\mathbf\{0\.11\}0\.140\.140\.160\.16∗MUSR Objects0\.340\.340\.990\.990\.390\.390\.10\\mathbf\{0\.10\}0\.150\.150\.130\.13∗MUSR Teams0\.280\.281\.381\.380\.350\.350\.14\\mathbf\{0\.14\}0\.280\.280\.210\.21∗NaturalPlan Calendar0\.460\.460\.790\.790\.630\.630\.09\\mathbf\{0\.09\}0\.260\.260\.260\.26NaturalPlan Meeting0\.570\.570\.910\.910\.790\.790\.11\\mathbf\{0\.11\}0\.270\.270\.270\.27NaturalPlan Trip0\.370\.371\.191\.190\.520\.520\.08\\mathbf\{0\.08\}0\.350\.350\.350\.35Rulearena Airlines0\.97\\mathbf\{0\.97\}2\.402\.401\.141\.141\.121\.121\.301\.301\.131\.13∗Rulearena Tax1\.011\.013\.493\.491\.671\.670\.68\\mathbf\{0\.68\}1\.011\.010\.980\.98∗Rulearena NBA1\.491\.492\.952\.951\.851\.851\.42\\mathbf\{1\.42\}1\.591\.591\.621\.62∗Tabular Math WP0\.070\.070\.690\.690\.100\.100\.040\.040\.03\\mathbf\{0\.03\}0\.03\\mathbf\{0\.03\}τ\\tauBench Retail4\.754\.757\.037\.035\.145\.143\.503\.503\.15\\mathbf\{3\.15\}3\.15\\mathbf\{3\.15\}Average0\.700\.701\.591\.590\.900\.900\.44\\mathbf\{0\.44\}0\.460\.460\.460\.46We also considered a baseline that uses a single user\-constructed and tuned prompt\. This baseline is also surprisingly strong for many tasks; the tables above give the full details of the results from Figure[1](https://arxiv.org/html/2606.00189#S4.F1)\.

### B\.2Pseudo\-tool induction experiments

For each subtask we evaluated two action spaces under the same ReAct framework \(react\_pydantic, a clean ReAct preamble over a pydantic\-ai loop\), the same backbone model \(DeepSeek\-V3\), and the same test split \(seed 42\):*engineered*, the hand\-written multi\-step pipeline, and*induced*, aK=5K\{=\}5ptool library induced from the engineered agent’s correct training rollouts using the pipeline of §[3\.2\.1](https://arxiv.org/html/2606.00189#S3.SS2.SSS1)\.141414NaturalPlan accuracy is reported under a Gemini\-2\.5\-flash LLM judge, since the strict format\-matching evaluator zero\-rates almost all induced plans; engineered runs are scored under the same judge for apples\-to\-apples\. The MedCalc induced module was synthesized by Gemini 3\.1 Flash\-LiteGoogle DeepMind \[[2026a](https://arxiv.org/html/2606.00189#bib.bib43)\]\(state\-aware, correct\-only,K=5K\{=\}5cap\), one LLM\-drift point relative to the other tasks where induction used DeepSeek\-V3\.

Table 7:Cross\-benchmark induced\-ptool evaluation\. Both columns use the same ReAct framework \(react\_pydantic\), backbone model \(DeepSeek\-V3\), and test split \(seed 42\); the only difference is the action space \(hand\-engineered multi\-step pipeline vs\. aK=5K\{=\}5induced ptool library, correct\-only filter\)\. Bold marks the higher of each row\. NaturalPlan rows are scored under a Gemini\-2\.5\-flash LLM judge for apples\-to\-apples; others use strict exact\-match\.SubtaskEngineered \(col 3\)Induced \(col 4\)Δ\\DeltaMuSR murder57\.075\.0\+18\.0MuSR object placements30\.268\.9\+38\.7MuSR team allocation44\.068\.0\+24\.0NaturalPlan meeting10\.035\.0\+25\.0NaturalPlan trip6\.01\.0−5\.0\-5\.0RuleArena NBA65\.271\.7\+6\.5MedCalc44\.054\.0\+10\.0The pattern matches the bad\-tools framing\. On MuSR, where the engineered toolkit consists of reasoning\-only steps with no external grounding, induction wins by 18–39 pp across all three subtasks, peaking at\+38\.7\+38\.7pp on object\-placements where the engineered 3\-step pipeline is the most abstract\. On NaturalPlan meeting, where the hand\-designed parse\-order\-build pipeline collapses under thereact\_pydanticframing \(10%\), induction sustains35%35\\%and wins by\+25\+25pp—the engineered pipeline’s value is recoverable from rollouts but not robust to a clean ReAct prompt\. On NBA, where the engineered toolkit is constrained to a single one\-shot extraction tool because the domain has no calculator, induction still wins but by only\+6\.5\+6\.5pp—a single typed extraction is most of what the task needs, so additional structure pays small dividends\. On MedCalc, induction wins by\+10\+10pp; the gain is moderate because the hand\-written calculator\-identification pipeline is genuinely informative on the formula\-driven subset of cases, even underreact\_pydantic\.

NaturalPlan trip is the lone reversal \(−5\-5pp\), but both methods saturate near zero \(11–6%6\\%\): the eval is dominated by harder multi\-day, multi\-constraint instances on which neither action space helps the LLM\. We treat it as a non\-comparison cell rather than evidence against the framing; per\-method differences below10%10\\%on a noisy LLM\-judge metric are within the regime where neither system reliably solves the task\.

The overall picture is that induction’s leverage tracks the*abstractness*of the engineered toolkit: largest gains where hand\-designed tools are reasoning steps without external grounding \(MuSR, \+24 pp average\), substantial gains where the engineered pipeline collapses under a clean ReAct framing \(NatPlan meeting, \+25 pp\), moderate gains where engineered tools have natural domain structure \(MedCalc, NBA\), and indistinguishable in the saturated\-low regime \(NatPlan trip\)\.

### B\.3Workflow learning experiments

We tested the ptool inducer on MedCalc, where the hand\-written ReAct toolkit decomposes the problem into three interfaces \(identify\_calculator,extract\_clinical\_values,compute\_calculation\)\. Inducing from101101recorded thoughts over110110training problems and a held\-out evaluation of1212induction variants \(axes: state injection, only\-correct filtering, ptool cap\), the best variant collapsed the three\-step pipeline into three self\-contained ptools \(calculate\_clinical\_score,compute\_clinical\_value,apply\_clinical\_score\) and improved within\-tolerance accuracy from58\.2%58\.2\\%to66\.4%66\.4\\%while reducing input tokens by60%60\\%\(2,129K to 848K\)\. The induced ptools recover most of the gain on formula\-driven categories \(*physical*\+15\.6\+15\.6pp,*lab test*\+12\.5\+12\.5pp\) but do not improve criterion\-counting categories\. The induction itself cost∼$0\.30\\sim\\mathdollar 0\.30over∼10\\sim 10minutes of wall\-clock; replacing several hand\-written tools with their induced counterparts is therefore a cheap and reversible operation\. Composing these induced ptools through the orchestration learner of §[3\.4](https://arxiv.org/html/2606.00189#S3.SS4)closes the remaining gap to the hand\-coded workflow on this benchmark \(§[4\.3](https://arxiv.org/html/2606.00189#S4.SS3)\)\.

We applied the orchestration learner of §[3\.4](https://arxiv.org/html/2606.00189#S3.SS4)to1515benchmarks, in two modes per benchmark:*existing*\(start from the hand\-engineered workflow\) and*seeded*\(compose a workflow from the benchmark’s \(pseudo\-\)toolkit at iteration zero, then improve\)\. Each run uses six iterations and Gemini 3\.1 Pro as supervisor\. Across the3030runs, eval accuracy improved over the iteration\-zero baseline on1616runs and was preserved on the rest; on99runs the supervisor consistently produced no applicable edit, mostly on RuleArena where the existing workflow is already specialized to a small number of rule paths\. The tool\-seeded mode is the more interesting comparison: on several compositional benchmarks the workflow induced from the toolkit alone matches or exceeds the hand\-engineered workflow—NaturalPlan trip \(100%100\\%vs\.50%50\\%best eval\), NaturalPlan calendar \(100%100\\%vs\.70%70\\%\), MuSR team allocation \(86\.7%86\.7\\%vs\.63\.3%63\.3\\%\), Geometric Shapes \(100%100\\%vs\.86\.4%86\.4\\%\), and MedCalc \(74\.4%74\.4\\%vs\.72\.0%72\.0\\%\)\. On MedCalc, where we ran a deeper six\-iteration study from induced ptools, the supervisor’s accepted edits mostly took the form of richer docstrings encoding formula references and scoring criteria; in particular, an iteration that added a deterministic Python expression evaluator as a ptool produced the run’s best eval \(62\.7%62\.7\\%\)\. These results indicate that, given a serviceable toolkit, supervised hill\-climbing over workflow source can match hand\-engineering on a non\-trivial fraction of tasks, while making the entire workflow—and the rationale for each edit—inspectable as a code diff\. The full sweep table, per\-iteration accuracy curves, and the MedCalc iteration log appear in Appendix[C\.5](https://arxiv.org/html/2606.00189#A3.SS5)\.

## Appendix CThe orchestration learner

This appendix describes the orchestration learner of §[3\.4](https://arxiv.org/html/2606.00189#S3.SS4)in full: its inputs, its per\-iteration loop, the supervisor prompt, the accept/reject rule, the workflow\-composer used to produce a seed in the*seeded*mode, the artifacts written for each run, and the implementation details that affect reproducibility\.

### C\.1Inputs and starting state

A run is parameterized by:

- •A*ptools module*, i\.e\. a Python file declaring a set of@interfacestubs together with their bindings\. The entry interface is named explicitly; all other interfaces in the module are available to the supervisor as building blocks\.
- •A*starting workflow*, in one of two modes\. In workflow\-seeded mode the starting source is the benchmark’s hand\-coded workflow\. In tool\-seeded mode the starting source is a workflow synthesized at iteration zero by the workflow composer of[C\.3](https://arxiv.org/html/2606.00189#A3.SS3)from the \(pseudo\-\)tools available in the module\.
- •A*train*dataset and an*eval*dataset, both stratified by a benchmark\-specific label\. We usedntrain=neval=110n\_\{\\mathrm\{train\}\}\\\!=\\\!n\_\{\\mathrm\{eval\}\}\\\!=\\\!110for MedCalc and either70/3570/35or110/110110/110for the other benchmarks\.
- •A*supervisor model*\(we usedgemini\-3\.1\-pro\-preview\) and an iteration budget \(six in all reported runs\)\. Optional inputs include*custom instructions*\(free\-text guidance about pipeline strengths/weaknesses, or hard rules such as “do not modify the deterministic calculator”\) and a*model\-choice catalog*\(cheap and expensive alternatives that the supervisor may recommend in configuration overrides\)\.

The starting source is copied to a scratch file \(<name\>\_<timestamp\>\_scratch\.py\) and all subsequent edits target that file; the original ptools module is never modified\. Iteration zero is a baseline evaluation of the starting workflow on both train and eval, recorded as aKEPTrecord\.

### C\.2Per\-iteration loop

At iterationt≥1t\\geq 1:

1. 1\.Profile\.Re\-run the most recent train evaluation through a profiler that aggregates per\-interface call counts, per\-interface average cost, and overall accuracy\.
2. 2\.Sample failures\.From the same train run, sample failed cases \(correct=false\\texttt\{correct\}\\\!=\\\!\\textsc\{false\}\) and format each as⟨\\langleinput, predicted answer, expected answer, ordered ptool calls⟩\\rangle\.
3. 3\.Build the supervisor prompt, which contains: - •the full source of the current scratch ptools file, verbatim; - •the profiling summary; - •the sampled failure traces; - •the iteration history—a concise per\-iteration record ofkept/rolled\-back, the supervisor’s stated reasoning, and the accuracy delta, so that the supervisor does not propose the same change twice; - •the custom instructions and the model\-choice catalog if provided\.
4. 4\.Call the supervisor\.A single LLM call returns one of: \(a\) a full rewritten ptools source plus optional configuration overrides \(e\.g\. a per\-interface model assignment\); \(b\) a no\-change signal; or \(c\) a syntactically invalid response\. \(b\) and \(c\) increment a no\-improvement counter; the run halts when this counter reaches five\.
5. 5\.Apply\.Write the proposed source over the scratch file and reload it viaexec\_ptools\_module; re\-bind every interface declared in the new source throughimplement\_via\_config; if configuration overrides were proposed, snapshot the global configuration and apply them\.
6. 6\.Re\-evaluateon train; then on eval if eval is available\.
7. 7\.Decide\.The proposal is*kept*iff it strictly improves the best train accuracy seen so far, with the eval accuracy used to break ties when train is matched: ``` if new_train > best_train: kept = True elif new_train == best_train and have_eval: kept = (new_eval > best_eval) else: kept = False ``` On a rollback the scratch file is restored from the previous iteration’s source, the configuration snapshot is restored, and the module is reloaded again so subsequent iterations see the previous code\. The supervisor’s reasoning is retained in the iteration history regardless of the decision, so it can inform later proposals\.

##### Stop conditions\.

The loop ends when \(i\) a target accuracy is reached \(rare and usually unset\), \(ii\) the no\-improvement counter reaches five, \(iii\) five consecutive supervisor failures or no\-change responses occur, or \(iv\) the iteration budget is exhausted\.

##### Final eval\.

After the in\-process loop terminates, the best\-train iteration’s source becomesptools\_evolved\.pyand a final evaluation is launched in a*fresh subprocess*on the eval set\. The fresh process is necessary because Python’simportlib\.reloaddoes not reset modules loaded viaspec\_from\_file\_location; in\-process iterations would otherwise produce duplicate\-tool\-name errors when a previously executed source defines tools with the same names\. Within\-iteration evals during the loop are unaffected because the previous source is replaced rather than added; the conflict surfaces only when a downstream consumer attempts to register tools whose names overlap with an earlier iteration still resident in module memory\.

### C\.3Workflow composer \(used in tool\-seeded mode\)

The composer turns a \(pseudo\-\)toolkit into an initial pipeline source\.

1. 1\.Build a*catalog*from the ptools module by iterating its declared interfaces, excluding the entry interface, and rendering each as “signature \+ docstring”\.
2. 2\.Issue a single LLM call \(defaultGemini 3\.1 Pro Preview\) with a prompt containing the task description, the catalog, and the desired entry signature; the call returns Python source for the entry function, expected to invoke each catalog interface as a typed call so that recording, caching, and cost tracking continue to work\.
3. 3\.Runruff \-\-fixon the returned source for deterministic cleanup \(unused imports, formatting\)\.
4. 4\.Optionally smoke\-test the composed source on one example case; on exception, retry up to three times by feeding the error back to the model along with the prior source\.
5. 5\.Bind the composed function to the entry interface; the source becomes the iteration\-zero scratch file\.

The composer thus performs*static*workflow induction in a single call; the orchestration learner performs*iterative*repair on top of it\. Because the orchestration learner uses a stronger supervisor model \(Gemini 3\.1 Pro\) than the composer, the design separates a fast cheap structural pass from a slow expensive refinement pass\.

### C\.4Per\-iteration artifacts and run\-level outputs

Every iteration directoryiterations/iter\_NNN/contains:

- •ptools\_before\.py,ptools\_after\.py, and a unified diff;
- •supervisor\_prompt\.txt,supervisor\_response\.txt;
- •profiling\_summary\.txt,failure\_traces\.txt,iteration\_history\.txt;
- •outcome\.txtrecording the decision \(kept/rolled\-back\), the train and eval deltas, and the supervisor’s reasoning;
- •config\_before\.yamlandconfig\_after\.yamlwhen configuration overrides were proposed;
- •result\_dirs\.jsonpointing at the train and eval result directories produced this iteration\.

The run directory itself contains a machine\-readablereport\.json, a self\-contained HTML dashboard with an accuracy curve and per\-iteration drilldown, the best\-trainptools\_evolved\.py, animplementation\.yamlthat points at the evolved file as a learned implementation, arun\_metadata\.json, and the fresh\-processfinal\_eval/subdirectory\. Because every prompt, response, and intermediate evaluation is preserved, any iteration can be replayed or audited offline; the run is reproducible up to LLM nondeterminism\.

### C\.5Sweep results

We ran the learner on1515\(benchmark, entry\-interface\) pairs in both modes for3030runs total, six iterations each\. Initial T/E is the iteration\-zero train/eval; Best E,T/E is the iteration that maximized eval \(and its train/eval\); Final T/E is the last accepted iteration\. The TabMWP final\-eval entries marked0\.0%†0\.0\\%^\{\\dagger\}are an instrumentation artefact: a one\-time setup hook populates an in\-memory table store, but the fresh\-process final eval reloads the evolved module*after*the hook has run, clearing the store; the per\-iteration evals \(which share the in\-process store\) are correct, and the workflow is not actually regressing\.

Table 8:Orchestration\-learner sweep: 30 runs,≤6\\leq\\\!6iterations each\. T/E denotes train/eval accuracy as a percentage\. “Workflow\-seeded” starts from the hand\-engineered workflow; “Tool\-Seeded” starts from a composer\-generated workflow over the same toolkit\.ModeBenchmarkInitial T/EBest E iter / T / EFinal T/Ewf\-seedfinqa66\.2 / 58\.90 / 66\.2 / 58\.966\.2 / 58\.9wf\-seedmedcalc79\.3 / 65\.94 / 76\.7 /72\.079\.3 / 67\.1wf\-seedmusr / murder72\.9 / 73\.34 / 70\.0 /86\.780\.0 / 80\.0wf\-seedmusr / object55\.4 / 46\.95 / 77\.0 /78\.177\.0 / 78\.1wf\-seedmusr / team72\.9 / 63\.30 / 72\.9 / 63\.372\.9 / 63\.3wf\-seednatural\_plan / calendar55\.7 / 70\.00 / 55\.7 / 70\.055\.7 / 70\.0wf\-seednatural\_plan / meeting31\.4 / 33\.31 /100\.0/100\.0100\.0 / 100\.0wf\-seednatural\_plan / trip47\.1 / 50\.00 / 47\.1 / 50\.047\.1 / 50\.0wf\-seedrulearena / airline28\.6 / 16\.70 / 28\.6 / 16\.728\.6 / 16\.7wf\-seedrulearena / nba89\.7 / 61\.50 / 89\.7 / 61\.589\.7 / 61\.5wf\-seedrulearena / tax0\.0 / 0\.00 / 0\.0 / 0\.00\.0 / 0\.0wf\-seedtabmwp54\.3 / 50\.00 / 54\.3 / 50\.054\.3 /0\.0†0\.0^\{\\dagger\}wf\-seedsports\_understanding90\.6 / 100\.00 / 90\.6 / 100\.0100\.0 / 100\.0wf\-seedgeometric\_shapes54\.7 / 50\.01 / 94\.3 /86\.496\.2 / 86\.4wf\-seedpenguins\_in\_a\_table86\.7 / 53\.85 / 100\.0 /92\.3100\.0 / 92\.3tool\-seedfinqa66\.2 / 63\.34 / 74\.3 /68\.974\.3 / 68\.9tool\-seedmedcalc56\.0 / 48\.84 / 81\.9 /74\.482\.9 / 72\.0tool\-seedmusr / murder72\.9 / 73\.35 / 77\.1 /90\.077\.1 / 90\.0tool\-seedmusr / object55\.4 / 46\.95 / 68\.9 /62\.568\.9 / 62\.5tool\-seedmusr / team72\.9 / 63\.33 / 85\.7 /86\.785\.7 / 86\.7tool\-seednatural\_plan / calendar55\.7 / 66\.74 / 85\.7 /100\.088\.6 / 93\.3tool\-seednatural\_plan / meeting31\.4 / 33\.34 / 82\.9 /83\.382\.9 / 83\.3tool\-seednatural\_plan / trip47\.1 / 50\.04 /100\.0/100\.0100\.0 / 100\.0tool\-seedrulearena / airline0\.0 / 0\.00 / 0\.0 / 0\.00\.0 / 0\.0tool\-seedrulearena / nba89\.7 / 76\.90 / 89\.7 / 76\.989\.7 / 76\.9tool\-seedrulearena / tax0\.0 / 0\.00 / 0\.0 / 0\.00\.0 / 0\.0tool\-seedtabmwp55\.7 / 50\.00 / 55\.7 / 50\.055\.7 /0\.0†0\.0^\{\\dagger\}tool\-seedsports\_understanding90\.6 / 100\.00 / 90\.6 / 100\.0100\.0 / 100\.0tool\-seedgeometric\_shapes52\.8 / 54\.55 / 100\.0 /100\.0100\.0 / 100\.0tool\-seedpenguins\_in\_a\_table86\.7 / 53\.83 / 83\.3 /92\.3100\.0 / 76\.9In aggregate1616of3030runs improved best\-eval over the iteration\-zero baseline, and1616retained that improvement at end of run; on99of3030runs the supervisor produced no applicable code edit across multiple iterations, concentrated on RuleArena where the existing workflow is already specialized to a small set of rule paths\. On six of the fifteen benchmarks the tool\-seeded mode reached or exceeded the workflow\-seeded mode \(FinQA, MedCalc, MuSR team, NaturalPlan\{\\\{calendar, meeting, trip\}\\\}, Geometric Shapes\), supporting the claim that workflow induction from a sound toolkit can match or surpass hand\-engineered workflows on tasks with compositional structure\.

## Appendix DOptimization details

### D\.1Cross\-benchmark frontier summary

In our optimization experiments \(Table[9](https://arxiv.org/html/2606.00189#A4.T9)\) we ran the NSGA\-II optimizer of §[3\.5](https://arxiv.org/html/2606.00189#S3.SS5)over 10 benchmarks spanning 6 domains\. Search\-space sizes range from 18 configurations on each natural\_plan subtask \(3 methods×\\times6 models, enumerated completely below the optimizer’s 20\-configuration exhaustive\-fallback threshold\) to over 10,000 on MedCalc \(8 genes\); the remaining benchmarks were searched by NSGA\-II over five generations of population 12\. Each configuration is evaluated on 50 validation cases; frontier configurations \(those Pareto\-optimal in cost–correctness on the validation set\) are re\-run on a held\-out test split where one is available\.

Table 9:Per\-benchmark NSGA\-II frontier summary\.*Frontier size*is the number of Pareto\-optimal configurations on the validation split;*valid*is the highest validation correctness on the frontier;*test*is the same configuration’s correctness on the held\-out test split\.*Cheapest cost*is the lowest LLM cost per query \(USD\) on the validation frontier\.*Frontier methods*lists the distinct top\-level methods that appear on the validation frontier, in descending order of contribution; method labels follow Table[10](https://arxiv.org/html/2606.00189#A4.T10)\. Test split is unavailable for FinQA \(the public release ships only the dev split, used here as*valid*\)\.frontierbest correct \(%\)ntestn\_\{\\mathrm\{test\}\}cheapestfrontier methodsbenchmarksizevalidtest\(cases\)cost \($/case\)\(deduped\)sports296\.094\.01002\.7×10−52\.7\\\!\\times\\\!10^\{\-5\}structtabmwp592\.094\.25004\.3×10−54\.3\\\!\\times\\\!10^\{\-5\}struct, pot, wfrulearena\_nba490\.569\.6462\.2×10−32\.2\\\!\\times\\\!10^\{\-3\}r\_learn, wf, unstrmusr\_object486\.078\.0503\.3×10−43\.3\\\!\\times\\\!10^\{\-4\}wf\_orch, struct, zs\_cotfinqa682\.0——1\.4×10−41\.4\\\!\\times\\\!10^\{\-4\}struct, wf, wf\_orchmusr\_team480\.062\.0504\.6×10−44\.6\\\!\\times\\\!10^\{\-4\}pot, wf\_orch, zs\_cotmedcalc178\.041\.73007\.6×10−47\.6\\\!\\times\\\!10^\{\-4\}structmusr\_murder276\.080\.0501\.5×10−41\.5\\\!\\times\\\!10^\{\-4\}structnatplan\_meeting344\.054\.0502\.1×10−42\.1\\\!\\times\\\!10^\{\-4\}structnatplan\_trip422\.022\.0501\.5×10−41\.5\\\!\\times\\\!10^\{\-4\}wf, structThe optimizer discovers a non\-trivial Pareto frontier on every benchmark, ranging from a 6\-point frontier on FinQA to a single\-point degenerate one on MedCalc\. Validation rankings mostly survive on test: across the eight benchmarks with public test sets, the best\-validation configuration loses more than 10pp on test on three \(MedCalc, RuleArena/NBA, MuSR/team\), and on five it loses less than 4pp or improves\.

Heterogeneous LLM assignment delivers cost\-quality tiers through a single workflow\.The structural payoff of multi\-objective optimization is visible in MuSR/object\_placements \(Table[9](https://arxiv.org/html/2606.00189#A4.T9), frontier of 4; Figure[4](https://arxiv.org/html/2606.00189#A4.F4)\): two of the four frontier configurations are the seeded orchestration workflow of §[3\.4](https://arxiv.org/html/2606.00189#S3.SS4), one bound to gemini\-2\.5\-flash at 86% valid and the other to gpt\-oss\-120b at 84% valid for roughly 1/7 the cost\. On the held\-out test split both configurations score 78%, so the validation spread vanishes while the cost spread remains; the frontier here is not a list of strictly better choices but a ladder of cost–quality tiers through the same learned workflow\. Per\-interface model assignments on the discovered frontiers, and the corresponding per\-benchmark sample distributions, are tabulated in Appendix[D\.4](https://arxiv.org/html/2606.00189#A4.SS4)\.

![Refer to caption](https://arxiv.org/html/2606.00189v1/figures/nsga2_musr_object.png)Figure 4:NSGA\-II Pareto frontier on MuSR/object\_placements \(cost on the validation split, USD per case, log axis, vs\. correctness;nvalid=50n\_\{\\mathrm\{valid\}\}=50\)\. Filled markers are frontier\-optimal; hollow markers are dominated samples; colour encodes top\-level method, marker shape encodes backend LLM\. The two frontier points labelledwf\_orch/gemini\-2\.5\-flash\(86% valid\) andwf\_orch/gpt\-oss\-120b\(84% valid\) are the same seeded orchestration workflow bound to two different backbones at roughly 1/7 cost ratio; both score 78% on the held\-out test split\.The optimizer’s frontiers track the per\-benchmark verdicts of the learning experiments\.Learned components areretainedwhere the corresponding learner reportedheadroom over hand\-engineering, andprunedwhere it did not\. The seeded orchestration workflow of §[3\.4](https://arxiv.org/html/2606.00189#S3.SS4)lands on the frontier of MuSR/object\_placements \(twice, as above\) and MuSR/team, two of the cases on which §[4\.3](https://arxiv.org/html/2606.00189#S4.SS3)reports the orchestration learner improving over its iteration\-zero baseline; it is sampled but pruned on Sports and TabMWP, where the hand\-engineered workflow already saturates the model\. Induced pseudo\-tool toolkits behave similarly:react\_learnedoccupies two of the four frontier points on RuleArena/NBA, at 90\.5% \(DeepSeek\-V3\) and 74% \(gpt\-oss\-20b\), reflecting the\+6\.5\+6\.5pp induction gain reported in §[4\.2](https://arxiv.org/html/2606.00189#S4.SS2)\. The optimizer istherefore not just a Pareto search but a verifier: it confirms or contradicts each learner’s per\-benchmark claim\.

Caveats\.Three points qualify the cross\-benchmark reading\.

*Sampling concentration\.*NSGA\-II concentrates effort on early\-population winners: across the eight stochastic\-search benchmarks, the most\-sampled method received between 26% and 62% of all configuration evaluations, while methods such as the seeded\-orchestration workflow received between 0 and 10\.

*Provider\-format wart\.*Under our Together AI deployment, DeepSeek V3 and V3\.1 emit native special\-token tool\-call sequences that pydantic\-ai never dispatches, zero\-rating every \(react, V3\) and \(react, V3\.1\) configuration; NSGA\-II Pareto\-prunes these after the first generation\.151515Appendix[D\.4](https://arxiv.org/html/2606.00189#A4.SS4)lists the affected \(method, model\) cells\.

*Test generalisation\.*MedCalc’s best\-validation configuration drops from 78% valid to 41\.7% test \(the largest gap in the table\),161616Both MedCalc/formulas and MedCalc/rules splits were combined and evaluated together in these experiments, as MedCalcand on MuSR/team the valid\-winningpotconfiguration tests at 62% whilewf\_orchon DeepSeek\-V3\.1, also on the frontier, tests at 70%\. So the optimizer’s valid\-best is not always the test\-best frontier point\.

### D\.2NSGA\-II search\-space details

For each benchmark we declare the search space in a YAML file atbenchmarks/<bench\>/nsga2\*\.yaml\. Each declaration consists of \(i\) the entry interface; \(ii\) a shared model pool, fixed across our experiments at six LLMs \(DeepSeek\-V3, DeepSeek\-V3\.1, gpt\-oss\-20b, gpt\-oss\-120b, gemini\-2\.5\-flash, gemini\-2\.5\-flash\-lite\); \(iii\) a top\-level*method gene*listing the alternative bindings for the entry interface, with each value expanding to the full set of dotlist overrides that bind the entry and any helper interfaces it requires; and \(iv\) a per\-sub\-interface method gene where applicable\. The encoder treats top\-level method as the primary categorical gene; the model gene and sub\-interface genes contribute additional positions to the chromosome \(src/secretagent/optimize/encoder\.py\)\. Search spaces range from 18 configurations on each natural\_plan subtask \(which the optimizer enumerated below the 20\-configuration exhaustive\-fallback threshold\) to over 10,000 on MedCalc, with RuleArena/NBA at 360 \(§[3\.5](https://arxiv.org/html/2606.00189#S3.SS5)\)\.

Table 10:Per\-benchmark top\-level method gene options\. Method labels:struct\(structured\_baseline; default LLM call against the entry interface\),unstr\(unstructured\_baseline; zero\-shot prompt\),wf\(workflow; hand\-engineered Python orchestration over the toolkit\),pot\(program\-of\-thought; sandboxed code generation\),react\(simulate\_pydanticReAct loop over the engineered toolkit\),react\_learned\(ReAct loop over the induced ptool toolkit of §[4\.2](https://arxiv.org/html/2606.00189#S4.SS2)\),wf\_orch\(the seeded orchestration workflow of §[3\.4](https://arxiv.org/html/2606.00189#S3.SS4)\), andzs\_cot\(zero\-shot chain\-of\-thought prompt template\)\. Sub\-interface genes per benchmark are documented in the YAML files; the largest is MedCalc \(3 sub\-interfaces:identify\_calculator,extract\_clinical\_values,compute\_calculation\)\.benchmarktop\-level method optionssportsstruct, unstr, wf, react, react\_learned, wf\_orchfinqastruct, unstr, wf, pot, react, wf\_orchtabmwpstruct, wf, pot, react, wf\_orchrulearena\_nbastruct, unstr, wf, pot, react, react\_learnedmusr/\{\\\{murder, object, team\}\\\}struct, zs\_cot, wf, pot, react, react\_learned, wf\_orchmedcalcstruct, wf, pot, react, react\_learned, wf\_orchnatural\_plan/\{\\\{meeting, trip\}\\\}struct, zs\_cot, wf
### D\.3Cache effectiveness in NSGA\-II

The body claim that caching makes optimization affordable is load\-bearing for the practicality of the framework: an uncached NSGA\-II sweep at our search\-space sizes \(Table[10](https://arxiv.org/html/2606.00189#A4.T10)\) and per\-configuration sample size \(nvalid=50n\_\{\\mathrm\{valid\}\}=50\) would cost several times the cached version\. We measured this directly on RuleArena/NBA, the only run for which we recorded both the cache\-hit cost and the equivalent no\-cache cost \(the latter computed by summing the per\-callcostfield that each evaluator records regardless of cache state\)\.

Across the 43\-configuration NBA sweep, the cache absorbed87\.5%of the run’s notional API spend: the run cost $11\.90 in fresh API calls against a $95\.44 no\-cache equivalent, with cache hits contributing $83\.54\. This is no surprise; NSGA\-II revisits good configurations as the population converges, and similar chromosomes call the same sub\-interfaces with the same inputs\. At the configuration level \(Table[11](https://arxiv.org/html/2606.00189#A4.T11)\) the hit rate climbs from 0% in generation 0 to 58% by generation 4 and plateaus; sub\-call hits within configurations contribute the difference between this 0–58% per\-configuration range and the aggregate 87\.5% figure\.171717Cache keys are\(prompt, model\)only\. Hyperparameter sweeps over the same prompt \(e\.g\. varyingmax\_tokensorreasoning\_effort\) require explicit invalidation viacachier\.enable\_caching=false\.

Table 11:Per\-generation NSGA\-II cache effectiveness at the configuration level on RuleArena/NBA\. “New evaluations” is the number of unique chromosomes the optimizer proposed and evaluated in each generation; “cache hits” is the number it proposed and recognised as already\-evaluated; “hit %” is the fraction of considered chromosomes that were already\-seen\. The aggregate sub\-call cache effectiveness over the entire run is 87\.5% of notional API spend, higher than any per\-generation row above because it counts ptool\-level call repetition within configurations as well as configuration\-level repetition across generations\.generationnew evaluationscache hitsconsideredhit %0120120\.01101119\.12731030\.03561154\.54571258\.3545955\.6
### D\.4Per\-benchmark optimization details

This section provides per\-benchmark detail supporting the cross\-benchmark claims of §[4\.4](https://arxiv.org/html/2606.00189#S4.SS4): NSGA\-II method\-gene sample frequencies \(Table[12](https://arxiv.org/html/2606.00189#A4.T12)\), Pareto\-frontier configurations evaluated on the held\-out test split \(Table[13](https://arxiv.org/html/2606.00189#A4.T13)\), the affected \(method, model\) cells of the V3/V3\.1 ReAct provider\-format wart \(Table[14](https://arxiv.org/html/2606.00189#A4.T14)\), and a worked\-example contrast for RuleArena/NBA between the optimizer’s frontier and the six\-model engineered baseline \(Figure[5](https://arxiv.org/html/2606.00189#A4.F5)\)\.

Table 12:Per\-benchmark NSGA\-II method\-gene sample counts\. Cells show the number of evaluated configurations that used each method on the validation split\. “—” marks methods not in the benchmark’s YAML gene set \(Table[10](https://arxiv.org/html/2606.00189#A4.T10)\); “0” marks methods that are in the gene set but received zero NSGA\-II samples within the population\-12, 5\-generation budget\. The most\-sampled method received between 26% \(musr\_teampot\) and 62% \(tabmwpstruct\) of total evaluations on each benchmark;wf\_orchwas sampled between 0 and 10 times on benchmarks where its yaml entry exists\. MedCalc is omitted: sample counts on its single\-frontier sweep are not load\-bearing\.benchmarktotalstructunstrwfpotreactr\_lrnwf\_orchzs\_cotsports5022107—605—tabmwp4226—474—1—finqa431821021—10—musr\_murder4821—650907musr\_object5418—53051013musr\_team503—81319412natplan\_meeting186—6————6natplan\_trip186—6————6rulearena\_nba43314103013——Table 13:Per\-benchmark Pareto\-frontier configurations re\-evaluated on the held\-out test split\. Each row is a validation\-frontier configuration; valid % is the optimizer’s reported correctness on the 50\-case validation split; test % is the same configuration’s correctness on the held\-out test split of sizentestn\_\{\\mathrm\{test\}\}\. Within each benchmark, rows are ordered by descending validation correctness\. FinQA is omitted because its public release ships only the dev split \(used here as*valid*\)\. Method labels follow Table[10](https://arxiv.org/html/2606.00189#A4.T10)\.benchmarkmethod / modelvalid %test %ntestn\_\{\\mathrm\{test\}\}sportsstruct / gemini\-2\.5\-flash96\.094\.0100sportsstruct / gemini\-2\.5\-flash\-lite90\.082\.0100tabmwpstruct / gemini\-2\.5\-flash92\.094\.2500tabmwppot / gpt\-oss\-120b90\.087\.4500tabmwpwf / gpt\-oss\-120b88\.074\.0500tabmwpstruct / gpt\-oss\-120b86\.072\.8500tabmwpstruct / gemini\-2\.5\-flash\-lite70\.064\.8500rulearena\_nbareact\_learned / DeepSeek\-V390\.569\.646rulearena\_nbawf / gemini\-2\.5\-flash\-lite78\.654\.346rulearena\_nbaunstr / gemini\-2\.5\-flash\-lite76\.258\.746rulearena\_nbareact\_learned / gpt\-oss\-20b73\.850\.046musr\_objectwf\_orch / gemini\-2\.5\-flash86\.078\.050musr\_objectwf\_orch / gpt\-oss\-120b84\.078\.050musr\_objectstruct / gpt\-oss\-120b74\.064\.050musr\_objectzs\_cot / gpt\-oss\-120b0\.02\.050musr\_teampot / DeepSeek\-V3\.180\.062\.050musr\_teamwf\_orch / DeepSeek\-V3\.176\.070\.050musr\_teamzs\_cot / DeepSeek\-V3\.152\.034\.050musr\_teamzs\_cot / gpt\-oss\-120b0\.00\.050musr\_murderstruct / gemini\-2\.5\-flash76\.080\.050musr\_murderstruct / gemini\-2\.5\-flash\-lite56\.064\.050medcalcstruct / gpt\-oss\-120b78\.041\.7300natplan\_meetingstruct / gpt\-oss\-20b44\.054\.050natplan\_meetingstruct / DeepSeek\-V3\.130\.036\.050natplan\_meetingstruct / gemini\-2\.5\-flash\-lite26\.024\.050natplan\_tripwf / DeepSeek\-V3\.122\.022\.050natplan\_tripstruct / DeepSeek\-V320\.018\.050natplan\_tripstruct / DeepSeek\-V3\.118\.018\.050natplan\_tripstruct / gemini\-2\.5\-flash\-lite16\.016\.050##### V3/V3\.1 ReAct provider\-format wart\.

Under our Together AI deployment, DeepSeek V3 and V3\.1 emit native special\-token tool\-call sequences \(e\.g\.<\|tool\_calls\_begin\|\>\.\.\.<\|tool\_calls\_end\|\>\) rather than the OpenAI\-style JSONtool\_callsfield\. The pydantic\-ai harness used by ourreactandsimulate\_pydanticfactories parses these as plain text and never dispatches the call, so the agent never reachesfinish\. Every affected configuration returns0%0\\%accuracy at non\-zero cost; NSGA\-II Pareto\-prunes them after one generation\. The same wart removes the \(wf\_orch, V3\) and \(wf\_orch, V3\.1\) cells on MedCalc, where the orchestration learner ran in non\-seed mode and exposed the evolved toolkit throughsimulate\_pydanticrather than as a directly\-callable Python function\.

Table 14:\(method, model\) cells affected by the V3/V3\.1 ReAct provider\-format wart, by benchmark\. Cells in benchmarks not listed are unaffected \(the benchmark’s YAML gene set does not include the method, or the method is not dispatched through pydantic\-ai\)\. natural\_plan/\{\\\{meeting, trip\}\\\}are both unaffected; their gene sets contain noreact\-family methods\.affected \(method, model\) cellaffected benchmarks\(react, V3\), \(react, V3\.1\)sports, tabmwp, finqa, musr/\{\\\{murder, object, team\}\\\}, rulearena\_nba, medcalc\(react\_learned, V3\), \(react\_learned, V3\.1\)sports, musr/\{\\\{murder, object, team\}\\\}, rulearena\_nba, medcalc\(wf\_orch, V3\), \(wf\_orch, V3\.1\)medcalc
##### Worked example: NSGA\-II versus engineered baselines on RuleArena/NBA\.

Figure[5](https://arxiv.org/html/2606.00189#A4.F5)contrasts the optimizer’s Pareto frontier with a six\-model engineered baseline \(thereact\_learnedagent run separately on each model, with no sub\-interface model gene\)\. The engineered frontier \(green squares\) tops out at $0\.30 and 90% accuracy onptool\-inducedwith DeepSeek\-V3\. NSGA\-II \(blue circles\) recovers a comparable 90% point at the same cost*and*populates the cheap end of the frontier with two further configurations the engineered sweep never explored: aneng/GPT\-OSS\-20B configuration at $0\.002 and 76% accuracy, and aneng/DeepSeek\-V3\.1 configuration at $0\.014 and 79% accuracy\. The optimizer’s value, on this benchmark, is at the cheap end of the frontier; the high\-accuracy point was already reachable by tuning the model on the engineered toolkit\.

![Refer to caption](https://arxiv.org/html/2606.00189v1/figures/nba_baseline_vs_nsga.png)Figure 5:RuleArena/NBA: engineered six\-modelreact\_learnedbaseline \(green\) versus the NSGA\-II frontier \(blue\); validation cost \(USD/case, linear axis\) versus correctness\. Squares mark frontier\-optimal, diamonds mark dominated\. Point labels show each configuration’s top\-level method and model assignment\(s\)\. NSGA\-II recovers the high\-accuracy DeepSeek\-V3 baseline \(top\-right\) and finds two cheaper frontier configurations the engineered sweep did not produce\.
Learning to Construct Practical Agentic Systems

Similar Articles

AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration -- Learning from Cheap, Optimizing Expensive

Toward Reliable Design of LLM-Enabled Agentic Workflows: Optimizing Latency-Reliability-Cost Tradeoffs

As we scale toward agentic, multimodal systems combining LLMs, RLHF, tool-use, and retrieval-augmented generation, what practical architecture best balances reliability, alignment, and cost?

AHD Agent: Agentic Reinforcement Learning for Automatic Heuristic Design

@Zephyr_hg: https://x.com/Zephyr_hg/status/2062176187384807488

Submit Feedback

Similar Articles

AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration -- Learning from Cheap, Optimizing Expensive
Toward Reliable Design of LLM-Enabled Agentic Workflows: Optimizing Latency-Reliability-Cost Tradeoffs
As we scale toward agentic, multimodal systems combining LLMs, RLHF, tool-use, and retrieval-augmented generation, what practical architecture best balances reliability, alignment, and cost?
AHD Agent: Agentic Reinforcement Learning for Automatic Heuristic Design
@Zephyr_hg: https://x.com/Zephyr_hg/status/2062176187384807488