CoCoDA: Co-evolving Compositional DAG for Tool-Augmented Agents

arXiv cs.AI 05/12/26, 04:00 AM Papers
Summary
This paper introduces CoCoDA, a framework that uses a co-evolving compositional Directed Acyclic Graph (DAG) to manage tool libraries for augmented agents. It enables small language models to efficiently retrieve and compose tools, allowing an 8B model to match or exceed the performance of a 32B model on reasoning benchmarks.
arXiv:2605.08399v1 Announce Type: new Abstract: Tool-augmented language models can extend small language models with external executable skills, but scaling the tool library creates a coupled challenge: the library must evolve with the planner as new reusable subroutines emerge, while retrieval from the growing library must remain within a fixed context budget. Existing tool-use and skill-library methods typically treat tools as flat or text-indexed memories, causing prompt cost to grow with library size and obscuring the typed, compositional structure of executable code. We propose CoCoDA, a framework that co-evolves the planner and tool library through a single code-native structure: a compositional code DAG. Nodes are primitive or composite tools, edges encode invocation dependencies, and each node stores a typed signature, description, pre/post-condition specification, and worked examples. At inference time, Typed DAG Retrieval prunes candidates by symbolic signature unification, ranks survivors by descriptions, filters them by behavioral specifications, and disambiguates with examples, keeping expensive context materialization on progressively smaller candidate sets. At training time, successful trajectories are folded into validated composite tools, while the planner is updated with a DAG-induced reward that credits composites by their primitive expansion size. We provide theoretical results showing retrieval cost reduction, sublinear retrieval time, compositional advantage under the shaped reward, monotone co-evolution under conservative updates, and DAG well-formedness. Across mathematical reasoning, tabular analysis, and code task benchmarks, CoCoDA enables an 8B student to match or exceed a 32B teacher on GSM8K and MATH and consistently improves over strong tool-use and library-learning baselines.
Original Article
View Cached Full Text
Cached at: 05/12/26, 07:13 AM
# CoCoDA: Co-evolving Compositional DAG for Tool-Augmented Agents
Source: [https://arxiv.org/html/2605.08399](https://arxiv.org/html/2605.08399)
Ziyang Yu1Qiyue Li2Liang Zhao11Emory University2Beijing Normal\-Hong Kong Baptist University

###### Abstract

Tool\-augmented language models can extend small language models with external executable skills, but scaling the tool library creates a coupled challenge: the library must evolve with the planner as new reusable subroutines emerge, while retrieval from the growing library must remain within a fixed context budget\. Existing tool\-use and skill\-library methods typically treat tools as flat or text\-indexed memories, causing prompt cost to grow with library size and obscuring the typed, compositional structure of executable code\. We propose CoCoDA, a framework that co\-evolves the planner and tool library through a single code\-native structure: a compositional code DAG\. Nodes are primitive or composite tools, edges encode invocation dependencies, and each node stores a typed signature, description, pre/post\-condition specification, and worked examples\. At inference time, Typed DAG Retrieval prunes candidates by symbolic signature unification, ranks survivors by descriptions, filters them by behavioral specifications, and disambiguates with examples, keeping expensive context materialization on progressively smaller candidate sets\. At training time, successful trajectories are folded into validated composite tools, while the planner is updated with a DAG\-induced reward that credits composites by their primitive expansion size\. We provide theoretical results showing retrieval cost reduction, sublinear retrieval time, compositional advantage under the shaped reward, monotone co\-evolution under conservative updates, and DAG well\-formedness\. Across mathematical reasoning, tabular analysis, and code task benchmarks, CoCoDA enables an 8B student to match or exceed a 32B teacher on GSM8K and MATH and consistently improves over strong tool\-use and library\-learning baselines\.

## 1Introduction

Tool\-augmented language models extend a planner with external executable tools, allowing even small language models to offload computation, symbolic manipulation, and long\-horizon procedures to a reusable library\. This division of labor is especially attractive for small language models, whose context windows and parametric capacity are limited\. In principle, a sufficiently rich tool library can act as an external procedural memory: the planner retrieves a small set of relevant tools, invokes them through an executor, and composes their outputs into a final answer\.

However, scaling such a library creates a tension\. On one hand, the library must evolve with the planner\. As the planner explores new tasks, it discovers recurring sub\-trajectories that should be abstracted into reusable tools; otherwise, the planner repeatedly reconstructs the same primitive computations\. On the other hand, every newly added tool increases the retrieval burden\. A flat library makes the planner’s prompt cost grow with the number of tools, eventually exhausting the context budget before the library becomes useful\. Thus, tool\-library learning for small language models requires solving two coupled problems: how to grow the library with the policy, and how to retrieve from the growing library under a fixed context window\.

Existing approaches address only parts of this problem\. Static tool\-use systems assume a fixed inventory and train or prompt the model to call tools from that inventory\. Skill\-library and agent\-memory methods append new skills during exploration, but typically store them as flat text records or natural\-language memories, so retrieval cost grows with library size and the internal dependency structure of executable code remains hidden\. Hierarchical memory and RAG methods reduce context cost by clustering summaries, but their hierarchies are induced by topical similarity rather than by executable composition\. Conversely, code\-library learning and program\-synthesis methods exploit compositional structure, but they are not designed for LLM planners operating under token\-budgeted retrieval\. The missing mechanism is a structure that is simultaneously code\-aware, retrieval\-efficient, and learnable online\.

Our key observation is that executable tools contain structure that text memories do not\. A tool has a typed signature, executable dependencies, behavioral specifications, and concrete input\-output examples\. These signals can be used not only to organize the library, but also to decide which records must be exposed to the planner\. We therefore represent the library as a compositional code DAG\. Nodes are primitive or composite tools, edges are invocation dependencies extracted from executable bodies, and each node stores a four\-level record: typed signature, description, pre/post\-condition specification, and worked examples\. This DAG is the single object through which CoCoDA couples library growth and context\-bounded retrieval\.

At inference time, CoCoDA performs Typed DAG Retrieval\. Given a query\-induced subgoal, retrieval first prunes candidates by symbolic signature unification, then ranks the survivors by descriptions, filters them by pre/post\-condition compatibility, and finally uses examples for disambiguation\. Because expensive LLM\-billed records are materialized only for surviving candidates, the retrieval cost contracts across stages rather than scaling with the full library\. Because the hierarchy follows invocation dependencies rather than text similarity, retrieval can surface a composite tool that behaviorally subsumes a primitive sub\-trajectory\.

At training time, CoCoDA co\-evolves the planner and the library\. Successful trajectories are passed to a fixed teacher abstractor, which proposes composite tools\. A proposal is committed only if it preserves acyclicity and satisfies dependency and specification checks\. The planner is updated with a DAG\-induced reward that credits tools by their primitive expansion size: invoking a composite that represents multiple primitive calls receives positive saved\-call credit, while invoking a primitive receives none\. Thus, among behavior\-equivalent trajectories with the same verified outcome, the planner is encouraged to use reusable composites\. The same DAG therefore plays two roles: it keeps retrieval within the context budget and defines the structural signal that trains the planner to exploit newly learned abstractions\.

We make the following contributions\. First, we formulate tool\-library learning for small language models as a joint planner\-library optimization problem with explicit retrieval\-cost and context\-window constraints\. Second, we propose Typed DAG Retrieval, a code\-native retrieval cascade that uses signatures, invocation edges, specifications, and examples to keep prompt cost sublinear in library size\. Third, we introduce an online co\-evolution procedure that inserts validated composite tools and trains the planner with a DAG\-induced saved\-call reward\. Fourth, we provide theoretical results showing retrieval cost reduction, sublinear retrieval time, compositional advantage under the shaped reward, monotone co\-evolution under conservative updates, and DAG well\-formedness\. Finally, across mathematical reasoning, tabular analysis, and code task benchmarks, CoCoDA enables an 8B student to match or exceed a 32B teacher on GSM8K and MATH and consistently improves over strong tool\-use and library\-learning baselines\.

## 2Related Works

### 2\.1Tool\-Augmented Language Models

Toolformer\(Schicket al\.,[2023](https://arxiv.org/html/2605.08399#bib.bib1)\)teaches an LLM to self\-annotate API calls and interleave them with generation, while ReAct\(Yaoet al\.,[2023](https://arxiv.org/html/2605.08399#bib.bib2)\)threads reasoning traces with tool invocations at inference time\. ToolLLM\(Qinet al\.,[2024](https://arxiv.org/html/2605.08399#bib.bib3)\), Gorilla\(Patilet al\.,[2024](https://arxiv.org/html/2605.08399#bib.bib4)\), and AnyTool\(Duet al\.,[2024](https://arxiv.org/html/2605.08399#bib.bib5)\)scale tool use to thousands of real\-world APIs through instruction tuning and hierarchical API retrieval\. ToolAlpaca\(Tanget al\.,[2023](https://arxiv.org/html/2605.08399#bib.bib6)\)and Gorilla\(Patilet al\.,[2024](https://arxiv.org/html/2605.08399#bib.bib4)\)distill tool\-use trajectories from a stronger teacher into smaller planners\. RLTF\(Liuet al\.,[2024](https://arxiv.org/html/2605.08399#bib.bib7)\), ToRL\(Liet al\.,[2025](https://arxiv.org/html/2605.08399#bib.bib8)\), and ToolRL\(Qianet al\.,[2025](https://arxiv.org/html/2605.08399#bib.bib14)\)train LLMs end\-to\-end with outcome rewards from execution feedback, and ReTool\(Fenget al\.,[2025](https://arxiv.org/html/2605.08399#bib.bib15)\)and ARTIST\(Singhet al\.,[2025](https://arxiv.org/html/2605.08399#bib.bib16)\)extend the recipe to multi\-turn agentic settings on top of GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2605.08399#bib.bib17); DeepSeek\-AIet al\.,[2025](https://arxiv.org/html/2605.08399#bib.bib18)\)\. Across these works, the tool inventory is treated as a static, flat*text\-indexed*resource that the policy optimizes against rather than reshapes, leaving both co\-evolution and the per\-prompt scaling of retrieval cost unaddressed\.

### 2\.2Skill Library Learning and Agent Memory

Prior agent\-memory work treats the library as a flat*text*collection: Voyager\(Wanget al\.,[2024a](https://arxiv.org/html/2605.08399#bib.bib9)\), Ghost in the Minecraft\(Zhuet al\.,[2023](https://arxiv.org/html/2605.08399#bib.bib33)\), CRAFT\(Yuanet al\.,[2024](https://arxiv.org/html/2605.08399#bib.bib19)\), LATM\(Caiet al\.,[2024](https://arxiv.org/html/2605.08399#bib.bib10)\), CREATOR\(Qianet al\.,[2023](https://arxiv.org/html/2605.08399#bib.bib11)\), TroVE\(Wanget al\.,[2024b](https://arxiv.org/html/2605.08399#bib.bib12)\), and ReGAL\(Stengel\-Eskinet al\.,[2024](https://arxiv.org/html/2605.08399#bib.bib13)\)grow skill or function memories offline whose dependency structure is invisible to both retrieval and training, while MemGPT\(Packeret al\.,[2024](https://arxiv.org/html/2605.08399#bib.bib30)\), Generative Agents\(Parket al\.,[2023](https://arxiv.org/html/2605.08399#bib.bib29)\), Reflexion\(Shinnet al\.,[2023](https://arxiv.org/html/2605.08399#bib.bib31)\), and Self\-Refine\(Madaanet al\.,[2023](https://arxiv.org/html/2605.08399#bib.bib32)\)hierarchize or self\-refine memory by topical similarity over natural language rather than by code structure\. A parallel line that does organize code into hierarchies splits along an axis nothing has crossed: a hierarchy is either*code\-aware*or*built for an LLM under a fixed context budget*, never both\. DreamCoder\(Elliset al\.,[2021](https://arxiv.org/html/2605.08399#bib.bib35)\), Stitch\(Bowerset al\.,[2023](https://arxiv.org/html/2605.08399#bib.bib36)\), and software\-engineering call\-graph indexing yield genuinely code\-aware hierarchies, typed, compositional, statically extractable, but target symbolic synthesizers or human developers, with no token\-budget tiering, no signature\-level prefilter, and no certified macro\-action equivalence; conversely, CodeRAG\(Wanget al\.,[2024c](https://arxiv.org/html/2605.08399#bib.bib37)\)and CodeT5\+\(Wanget al\.,[2023](https://arxiv.org/html/2605.08399#bib.bib38)\)retrievers target LLMs but use a flat embedding space, and RAPTOR\(Sarthiet al\.,[2024](https://arxiv.org/html/2605.08399#bib.bib39)\)tiers for LLM context budgets yet clusters by text similarity, discarding every code\-specific signal\. Our framework occupies the missing quadrant via three properties that serve both halves at once: \(i\)*typed signatures*are statically extractable \(code\-aware\) and prune candidates symbolically before any LLM call \(LLM\-aware\); \(ii\)*static invocation edges*follow real reuse structure while keeping per\-step cost proportional to local fan\-out rather than library size; and \(iii\)*pre/post\-condition\-validated*composites preserve executable behavior and license*lossless*substitution of one composite token for a subtree of primitives in the prompt, making context cost provably sublinear in library size where prior code hierarchies remain worst\-case linear\.

### 2\.3Policy–Memory Co\-Evolution

Voyager\(Wanget al\.,[2024a](https://arxiv.org/html/2605.08399#bib.bib9)\)is the canonical co\-evolution setup, appending new skills to a shared library as a frozen LLM explores, and Experiential Co\-Learning\(Qianet al\.,[2024](https://arxiv.org/html/2605.08399#bib.bib34)\)extends the idea to multi\-agent software development\. CREATOR\(Qianet al\.,[2023](https://arxiv.org/html/2605.08399#bib.bib11)\)and LATM\(Caiet al\.,[2024](https://arxiv.org/html/2605.08399#bib.bib10)\)interleave tool creation with tool use, while CRAFT\(Yuanet al\.,[2024](https://arxiv.org/html/2605.08399#bib.bib19)\)specializes toolsets to tasks encountered during deployment\. These works establish*prompt\-space, text\-memory*co\-evolution where the library grows by text appends against a frozen or lightly instructed large model and prompt cost grows linearly with library size, rather than gradient\-based co\-evolution against a verifiable reward over a structured*code*library whose hierarchy keeps retrieval cost sublinear\.

## 3Proposed Method

### 3\.1Problem Formulation

Setup\.We study tool\-mediated problem solving for a plannerπθ\\pi\_\{\\theta\}on tasks\(q,y\)∼𝒟\(q,y\)\\sim\\mathcal\{D\}, withqqa problem instance andyya verifiable outcome \(final answer or unit\-test oracle\)\. The planner accesses a tool libraryℒ=\(𝒱,ℰ,ℐ\)\\mathcal\{L\}=\(\\mathcal\{V\},\\mathcal\{E\},\\mathcal\{I\}\)where𝒱\\mathcal\{V\}is the set of tools,ℰ\\mathcal\{E\}records invocation dependencies, andℐ\\mathcal\{I\}stores per\-tool records\. Givenqq, retrieval builds a candidate set𝒞q\(ℒ\)⊆𝒱\\mathcal\{C\}\_\{q\}\(\\mathcal\{L\}\)\\subseteq\\mathcal\{V\}that augments the planner’s prompt; the planner then samplesτ∼πθ\(⋅∣q,𝒞q\(ℒ\)\)\\tau\\sim\\pi\_\{\\theta\}\(\\cdot\\mid q,\\mathcal\{C\}\_\{q\}\(\\mathcal\{L\}\)\)invoking tools\(v1,…,vT\)\(v\_\{1\},\\ldots,v\_\{T\}\)via an executor\. Quality is measured by a deterministic verifierRres\(τ\)∈\[0,1\]R\_\{\\text\{res\}\}\(\\tau\)\\in\[0,1\], and the context\-token cost of𝒞q\\mathcal\{C\}\_\{q\}byC\(q,ℒ\)C\(q,\\mathcal\{L\}\)\.

Compositional Tool DAG\.We organiseℒ\\mathcal\{L\}as a DAG𝒢=\(𝒱,ℰ\)\\mathcal\{G\}=\(\\mathcal\{V\},\\mathcal\{E\}\), where𝒱=𝒱p∪𝒱c\\mathcal\{V\}=\\mathcal\{V\}\_\{p\}\\cup\\mathcal\{V\}\_\{c\}partitions tools into primitive nodes \(atomic executable functions\) and composite nodes \(macro\-actions that invoke one or more existing tools\)\. Eachvvcarries\(d\(v\),Ch\(v\),ℐ\(v\)\)\(d\(v\),\\mathrm\{Ch\}\(v\),\\mathcal\{I\}\(v\)\): depthd\(v\)=0d\(v\)=0for primitives andd\(v\)=1\+maxu∈Ch\(v\)⁡d\(u\)d\(v\)=1\+\\max\_\{u\\in\\mathrm\{Ch\}\(v\)\}d\(u\)for composites; an ordered child setCh\(v\)\\mathrm\{Ch\}\(v\)realisingvvat the next finer level; and a recordℐ\(v\)=\(L1,L2,L3,L4\)\(v\)\\mathcal\{I\}\(v\)=\(L\_\{1\},L\_\{2\},L\_\{3\},L\_\{4\}\)\(v\)giving a typed signature, description, pre/post specification, and worked examples\. Edges encode invocation, so𝒢\\mathcal\{G\}stratifies the library by abstraction rather than surface similarity\. We further define the*flat size*flat\(v\)=1\\mathrm\{flat\}\(v\)=1for primitives andflat\(v\)=∑u∈Ch\(v\)flat\(u\)\\mathrm\{flat\}\(v\)=\\sum\_\{u\\in\\mathrm\{Ch\}\(v\)\}\\mathrm\{flat\}\(u\)for composites \(counted with multiplicity: a child invokedkktimes contributeskflat\(u\)k\\,\\mathrm\{flat\}\(u\)\), i\.e\. the number of primitive leaves in the recursive expansion ofvv, and the*saved\-call count*Φ\(v\):=flat\(v\)−1\\Phi\(v\):=\\mathrm\{flat\}\(v\)\-1\. ThusΦ=0\\Phi=0on primitives andΦ\(v\)≥1\\Phi\(v\)\\geq 1on any composite expanding to≥2\\geq 2primitive invocations\.

![Refer to caption](https://arxiv.org/html/2605.08399v1/pics/framework.png)

Figure 1:Overview of theCoCoDAframework\.Left \(Tool Structure\):each tool stores a44\-layer record \(L1L\_\{1\}signature,L2L\_\{2\}description,L3L\_\{3\}specifications,L4L\_\{4\}examples\), used by Typed DAG Retrieval as a cascade filter from cheap to expensive layers\.Middle \(Compositional Tool Library\):a code DAG of primitive and composite tools \(e\.g\.,compute\_ratio,solve\_equation\) whose edges encode invocation dependencies, letting macro\-actions decompose into reusable subroutines\.Right \(Co\-evolve Stage\):the planner rolls out trajectories on the DAG; successful ones are abstracted into new composite tools viaInsert/Update Tool Node, while the planner is updated with a compositional plus result reward, co\-evolving library and policy\.Tool\-mediated output\.We co\-maintainπθ\\pi\_\{\\theta\}\(consumes𝒞q\\mathcal\{C\}\_\{q\}, emitsτ\\tau\) andℒ\\mathcal\{L\}\(read by retrieval, written byInsertTool, which admits composites distilled from successful rollouts\)\. We write the joint outputOπθ,ℒ\(q\)O\_\{\\pi\_\{\\theta\},\\mathcal\{L\}\}\(q\)to make this explicit\. A fixed teacher abstractorℳT\\mathcal\{M\}\_\{T\}proposes candidates;InsertToolaccepts only those preserving acyclicity and dependency specifications\.

Optimization objective\.We frameCoCoDAas a single optimization over the planner–library pair in terms of quantities intrinsic to the LLM\. Let𝒟ev\\mathcal\{D\}\_\{\\text\{ev\}\}be the training distribution\. Forτ∼πθ\(⋅∣q,𝒞q\(ℒ\)\)\\tau\\sim\\pi\_\{\\theta\}\(\\cdot\\mid q,\\mathcal\{C\}\_\{q\}\(\\mathcal\{L\}\)\)withT\(τ\)T\(\\tau\)tool calls, letC¯retr\(q,ℒ\)\\bar\{C\}\_\{\\text\{retr\}\}\(q,\\mathcal\{L\}\)be the average per\-call retrieval token cost; total inference token cost factorises asT\(τ\)⋅C¯retr\(q,ℒ\)T\(\\tau\)\\cdot\\bar\{C\}\_\{\\text\{retr\}\}\(q,\\mathcal\{L\}\)\. We seek

\(πθ⋆,ℒ⋆\)\\displaystyle\(\\pi\_\{\\theta^\{\\star\}\},\\mathcal\{L\}^\{\\star\}\)=arg⁡maxπθ,ℒ𝔼\(q,y\)∼𝒟ev\[Rres\(Oπθ,ℒ\(q\)\)−𝔼τ\[T\(τ\)⋅C¯retr\(q,ℒ\)\]\]\\displaystyle=\\;\\mathop\{\\arg\\max\}\\nolimits\_\{\\pi\_\{\\theta\},\\,\\mathcal\{L\}\}\\;\\mathop\{\\mathbb\{E\}\}\\nolimits\_\{\(q,y\)\\sim\\mathcal\{D\}\_\{\\text\{ev\}\}\}\\\!\\Bigl\[\\,R\_\{\\text\{res\}\}\\bigl\(O\_\{\\pi\_\{\\theta\},\\mathcal\{L\}\}\(q\)\\bigr\)\\;\-\\;\\mathop\{\\mathbb\{E\}\}\\nolimits\_\{\\tau\}\\\!\\bigl\[\\,T\(\\tau\)\\cdot\\bar\{C\}\_\{\\text\{retr\}\}\(q,\\mathcal\{L\}\)\\,\\bigr\]\\,\\Bigr\]\(1\)s\.t\.∑v∈𝒮ℓ−1cℓ\(v\)≤W∀ℓ∈\{2,3,4\},\\displaystyle\\text\{s\.t\.\}\\;\\;\\sum\\nolimits\_\{v\\in\\mathcal\{S\}\_\{\\ell\-1\}\}\\\!c\_\{\\ell\}\(v\)\\;\\leq\\;W\\qquad\\forall\\,\\ell\\in\\\{2,3,4\\\},where the second term is the expected token cost of solvingqq\. Retrieval issues three LLM calls \(atL2,L3,L4L\_\{2\},L\_\{3\},L\_\{4\}\), each materialisingLℓ\(v\)L\_\{\\ell\}\(v\)for survivingv∈𝒮ℓ−1v\\in\\mathcal\{S\}\_\{\\ell\-1\};L1L\_\{1\}is a symbolic type filter and incurs no LLM call\. AfterL4L\_\{4\}each typed sub\-goal commits to a single winner that the executor invokes directly, so the context\-window limit applies*per retrieval call*, giving the three constraints above\. None of these quantities is method\-specific:RresR\_\{\\text\{res\}\}is the task verifier,T⋅C¯retrT\\cdot\\bar\{C\}\_\{\\text\{retr\}\}is the standard token bill of any tool\-augmented LLM, andWWis a model context constant\.Three structural axes\.Eq\. \([1](https://arxiv.org/html/2605.08399#S3.E1)\) exposes three orthogonal cost axes, each addressed by a different component:\(A\)*Per\-call retrieval costC¯*retr*\\bar\{C\}\_\{\\text\{retr\}\}\.*Flat retrieval givesC¯retr=∑v∑ℓcℓ\(v\)\\bar\{C\}\_\{\\text\{retr\}\}=\\sum\_\{v\}\\sum\_\{\\ell\}c\_\{\\ell\}\(v\), linear in\|𝒱\|\|\\mathcal\{V\}\|\. Typed DAG Retrieval \(Section[3\.2](https://arxiv.org/html/2605.08399#S3.SS2)\) makes this sublinear via a cascade𝒮1⊇⋯⊇𝒮4\\mathcal\{S\}\_\{1\}\\\!\\supseteq\\\!\\cdots\\\!\\supseteq\\\!\\mathcal\{S\}\_\{4\}payingcℓc\_\{\\ell\}only on survivors;\(B\)*Number of tool callsT\(τ\)T\(\\tau\)\.*A depth\-ddcomposite subsuming anmm\-step primitive sub\-trajectory replacesmmplanner calls with one\. Co\-Evolution \(Section[3\.3](https://arxiv.org/html/2605.08399#S3.SS3)\) shrinksT\(τ\)T\(\\tau\)by growingd¯\(𝒱c\)\\bar\{d\}\(\\mathcal\{V\}\_\{c\}\)via teacher\-distilledInsertToolcommits;\(C\)*Per\-call context limitWW\.*Eachℓ∈\{2,3,4\}\\ell\\in\\\{2,3,4\\\}is a separate LLM call that fit inWWon its own; a flat single call scaling with\|𝒱\|\|\\mathcal\{V\}\|breachesWWon any non\-trivial library\. The cascade payscℓc\_\{\\ell\}only on𝒮ℓ−1\\mathcal\{S\}\_\{\\ell\-1\}, so each call fits regardless of\|𝒱\|\|\\mathcal\{V\}\|\.

Roadmap\.*Typed DAG Retrieval*\(Section[3\.2](https://arxiv.org/html/2605.08399#S3.SS2)\) handles axes \(A\) and \(C\) via the cascadeL1→L2→L3→L4L\_\{1\}\\\!\\to\\\!L\_\{2\}\\\!\\to\\\!L\_\{3\}\\\!\\to\\\!L\_\{4\}, paying eachcℓc\_\{\\ell\}only on survivors so per\-stage calls fit underWWand average retrieval cost is sublinear in\|𝒱\|\|\\mathcal\{V\}\|\. At training,*Co\-Evolution under a graph\-aware reward*\(Section[3\.3](https://arxiv.org/html/2605.08399#S3.SS3)\) handles axis \(B\):InsertToolcommits growd¯\(𝒱c\)\\bar\{d\}\(\\mathcal\{V\}\_\{c\}\)while a structural reward credits the planner for routing through deeper composites, shrinkingT\(τ\)T\(\\tau\)\. The two close the loop: a deeper library makes retrieval cheaper per call and shorter, while cheaper retrieval keeps the prompt underWWas the library grows\.

### 3\.2Typed DAG Retrieval: Approximating the Retrieval\-Cost Block

This section addresses axes \(A\) and \(C\) of Eq\. \(1\)\. Typed DAG Retrieval is coarse\-to\-fine along two orthogonal axes\. The first is the*abstraction axis*induced by the compositional DAG: retrieval starts from the highest\-level tool whose signature can cover the current subgoal and descends to children only when that composite is too broad, incompatible, or insufficiently discriminative\. Thus, the retriever stops as soon as a validated composite can solve the subgoal, instead of blindly opening all lower\-level subskills\. The second is the*evidence axis*inside each candidate node: compatibility is judged using increasingly informative and increasingly expensive records, from typed signatures to descriptions, specifications, and examples\. These two axes play different roles\. The abstraction axis decides*how deep*retrieval must go in the executable skill DAG, while the evidence axis decides*which alternative implementation*at the current abstraction level should be surfaced\.

The key asymmetry is that tool compatibility becomes increasingly expensive to judge\. Type compatibility is symbolic; semantic relevance requires description ranking; behavioral compatibility requires contract checking; and fine\-grained disambiguation requires examples\. CoCoDA orders these checks from cheapest to most expensive and pays each expensive cost only on the survivors of earlier stages\.

For a query\-induced subgoal, Typed DAG Retrieval constructs a nested sequenceV=S0⊇S1⊇S2⊇S3⊇S4V=S\_\{0\}\\supseteq S\_\{1\}\\supseteq S\_\{2\}\\supseteq S\_\{3\}\\supseteq S\_\{4\}\. The final setS4S\_\{4\}is surfaced to the planner or executor\. Importantly, this survivor cascade is applied lazily over the DAG: a composite node is expanded into its children only if the cascade cannot certify the composite itself as a sufficient tool for the subgoal\.

#### L1: symbolic signature pruning\.

The first stage drops any tool whose signature does not unify with the sub\-goal:𝒮1=\{v∈𝒱:L1\(v\)unifies with the sub\-goal signature\}\\mathcal\{S\}\_\{1\}=\\\{v\\in\\mathcal\{V\}:L\_\{1\}\(v\)\\text\{ unifies with the sub\-goal signature\}\\\}\. Unification is decided symbolically on a static type lattice via an inverted index, so the stage incurs zero prompt\-token cost; only its symbolic work scales with\|𝒱\|\|\\mathcal\{V\}\|, bounded by Corollary[2](https://arxiv.org/html/2605.08399#Thmtheorem2)\. Running it first lets every later LLM\-billed stage operate on a much smaller survivor set\.

#### L2: edge\-guided semantic expansion\.

The second stage exposes descriptionsL2\(v\)L\_\{2\}\(v\)only forv∈𝒮1v\\in\\mathcal\{S\}\_\{1\}; the planner ranks these by semantic relevance toqqand keeps the top\-k2k\_\{2\}shortlist𝒮2\\mathcal\{S\}\_\{2\}\. Because candidates live in an invocation DAG rather than a flat pool, ranking can climb to a subsuming composite or descend to a primitive needed to complete the sub\-goal—structural moves that similarity\-only hierarchies cannot make\. Cost is paid on𝒮1\\mathcal\{S\}\_\{1\}, not𝒱\\mathcal\{V\}\.

#### L3: specification filtering\.

The third stage exposes pre/post\-conditionsL3\(v\)L\_\{3\}\(v\)only forv∈𝒮2v\\in\\mathcal\{S\}\_\{2\}and treats them as a hard accept/reject judged by the planner under the prompt of Appendix[D\.5](https://arxiv.org/html/2605.08399#A4.SS5): a candidate is dropped if its preconditions are not satisfiable in the current sub\-goal or its postconditions do not imply the desired outcome\. Binary verdicts admit no false positives under faithful specification reading, so a surviving composite is certified as a macro\-action equivalent to its primitive sub\-trajectory \(Theorem[3](https://arxiv.org/html/2605.08399#Thmtheorem3)\)\. Cost is paid only on thek2k\_\{2\}\-shortlist\.

#### L4: example\-based reranking\.

The final stage exposes worked examplesL4\(v\)L\_\{4\}\(v\)only forv∈𝒮3v\\in\\mathcal\{S\}\_\{3\}and reranks the survivors to break near\-ties among tools with similar signatures and contracts\.

SinceL1L\_\{1\}is free, the total LLM\-billed retrieval cost of the cascade is∑ℓ=24∑v∈𝒮ℓ−1cℓ\(v\)\\sum\_\{\\ell=2\}^\{4\}\\\!\\sum\_\{v\\in\\mathcal\{S\}\_\{\\ell\-1\}\}\\\!c\_\{\\ell\}\(v\), which Theorem[1](https://arxiv.org/html/2605.08399#Thmtheorem1)bounds byα1α2Cflat\+o\(Cflat\)\\alpha\_\{1\}\\alpha\_\{2\}\\,C\_\{\\text\{flat\}\}\+o\(C\_\{\\text\{flat\}\}\)and Corollary[2](https://arxiv.org/html/2605.08399#Thmtheorem2)converts to sublinear time\.

#### Co\-evolution tightens the cascade\.

The decomposition above is independent of which nodes occupy𝒱\\mathcal\{V\}\. When co\-evolution \(Sec\.[3\.3](https://arxiv.org/html/2605.08399#S3.SS3)\) folds a length\-mmprimitive sub\-trajectory into one validated composite, that composite occupies a single slot of𝒮2\\mathcal\{S\}\_\{2\}instead ofmm, strictly contracting every later𝒮ℓ\\mathcal\{S\}\_\{\\ell\}on subsequent queries \(Remark[Remark](https://arxiv.org/html/2605.08399#Thmremarkx1)\)\. The same abstraction that shortens planner trajectories therefore also shortens future retrieval prompts\.

### 3\.3Co\-Evolution as a Coupled Update on the Joint Objective

Section[3\.2](https://arxiv.org/html/2605.08399#S3.SS2)treats\(πθ,ℒ\)\(\\pi\_\{\\theta\},\\mathcal\{L\}\)as fixed: any sub\-trajectory the planner stitches from primitives is consumed for the current query and not written back\. Updating onlyπθ\\pi\_\{\\theta\}leavesd¯\(𝒱c\)\\bar\{d\}\(\\mathcal\{V\}\_\{c\}\)unchanged, so every subsequent query keeps paying the same retrieval cost in Eq\. \([1](https://arxiv.org/html/2605.08399#S3.E1)\)\. We therefore approach the joint objective via a*coupled*update that co\-evolvesπθ\\pi\_\{\\theta\}andℒ\\mathcal\{L\}on the same gap signal, withℳT\\mathcal\{M\}\_\{T\}fixed\. The library is edited via atomic single\-tool operators:Addintroduces a primitive or composite with depth assigned by Algorithm[3](https://arxiv.org/html/2605.08399#alg3);Mergeconsolidatest⋆t^\{\\star\}with a semantically overlapping entry;Rejectprunes proposals failing dependency specifications or acyclicity\. They are jointly closed under composition: any acyclicℒ′\\mathcal\{L\}^\{\\prime\}is reachable from anyℒ\\mathcal\{L\}by a finite sequence of these operations\.

#### Surrogate Reward

The cost block is𝔼τ\[T\(τ\)⋅C¯retr\(q,ℒ\)\]\\mathbb\{E\}\_\{\\tau\}\[T\(\\tau\)\\cdot\\bar\{C\}\_\{\\text\{retr\}\}\(q,\\mathcal\{L\}\)\]\. Within a single rolloutC¯retr\\bar\{C\}\_\{\\text\{retr\}\}is fixed byℒ\\mathcal\{L\}and retrieval, notπθ\\pi\_\{\\theta\}, so minimising the cost block over the policy reduces to maximising−T\(τ\)\-T\(\\tau\)\. By the definition ofΦ\\Phi, every invocation satisfiesflat\(ti\)−Φ\(ti\)=1\\mathrm\{flat\}\(t\_\{i\}\)\-\\Phi\(t\_\{i\}\)=1, so summing over the trajectory gives

−T\(τ\)=∑i=1T\(τ\)\(Φ\(ti\)−flat\(ti\)\)=∑i=1T\(τ\)Φ\(ti\)⏟\(i\) saved\-call credit−Tprim\(τ\)⏟\(ii\) primitive workload,\-T\(\\tau\)\\;=\\;\\sum\\nolimits\_\{i=1\}^\{T\(\\tau\)\}\\bigl\(\\Phi\(t\_\{i\}\)\-\\mathrm\{flat\}\(t\_\{i\}\)\\bigr\)\\;=\\;\\underbrace\{\\sum\\nolimits\_\{i=1\}^\{T\(\\tau\)\}\\Phi\(t\_\{i\}\)\}\_\{\\text\{\(i\) saved\-call credit\}\}\\;\-\\;\\underbrace\{T\_\{\\text\{prim\}\}\(\\tau\)\}\_\{\\text\{\(ii\) primitive workload\}\},\(2\)whereTprim\(τ\):=∑iflat\(ti\)T\_\{\\text\{prim\}\}\(\\tau\):=\\sum\_\{i\}\\mathrm\{flat\}\(t\_\{i\}\)counts primitive invocations after expansion\. Term \(ii\) depends only on the sub\-computationτ\\taurealises, not on the composite\-vs\-primitive phrasing: lossless substitution leavesTprimT\_\{\\text\{prim\}\}invariant\. Within a GRPO group on the sameqq,TprimT\_\{\\text\{prim\}\}acts as a query\-conditional baseline absorbed by group\-mean subtraction in the advantage, leaving only term \(i\) as a non\-trivial gradient signal\. We therefore takeRcomp\(τ\)=∑i=1T\(τ\)Φ\(ti\)=∑i=1T\(τ\)\(flat\(ti\)−1\)R\_\{\\text\{comp\}\}\(\\tau\)\\;=\\;\\sum\\nolimits\_\{i=1\}^\{T\(\\tau\)\}\\Phi\(t\_\{i\}\)\\;=\\;\\sum\\nolimits\_\{i=1\}^\{T\(\\tau\)\}\\bigl\(\\mathrm\{flat\}\(t\_\{i\}\)\-1\\bigr\)as the surrogate reward\. The shaped training reward isR\(τ\)=Rres\(τ\)\+λRcomp\(τ\)R\(\\tau\)=R\_\{\\text\{res\}\}\(\\tau\)\+\\lambda\\,R\_\{\\text\{comp\}\}\(\\tau\), withλ\\lambdarescaling saved\-call credit \(units of “primitive calls”\) against the unit\-interval verifier reward\. Length\-cheating is structurally impossible: padding a primitive contributesΦ=0\\Phi=0, soRcompR\_\{\\text\{comp\}\}rewards only actual composite expansion\.

RcompR\_\{\\text\{comp\}\}depends on𝒢\\mathcal\{G\}only through the scalarflat\(ti\)\\mathrm\{flat\}\(t\_\{i\}\), not directly through edges or depth\. The DAG’s structural content is used elsewhere: retrieval traverses edges to bound per\-call cost,InsertTooluses them for acyclicity and child\-specification validation, and Theorem[4](https://arxiv.org/html/2605.08399#Thmtheorem4)relies on DAG inclusion𝒱k⊆𝒱k\+1\\mathcal\{V\}\_\{k\}\\subseteq\\mathcal\{V\}\_\{k\+1\}\. The reward inherits the DAG’s guarantees becauseflat\\mathrm\{flat\}is a structural invariant of recursive expansion, not a teacher annotation; without the DAG,flat\\mathrm\{flat\}would be unverifiable andRcompR\_\{\\text\{comp\}\}trivially exploitable\.

For eachqqwe run typed\-DAG retrieval and rollout to obtain\{τ\(i\)\}i=1G\\\{\\tau^\{\(i\)\}\\\}\_\{i=1\}^\{G\}scored byRR\. Library writes are gated onRresR\_\{\\text\{res\}\}alone, soRcompR\_\{\\text\{comp\}\}cannot inject unsolved trajectories intoℒ\\mathcal\{L\}; it acts only on the planner\.

#### Coupled Update Rule

At iterationkk, given gap\-signal\-positive rollouts𝒯q\+=\{τ\(i\):Rres\(τ\(i\)\)≥ρ\}\\mathcal\{T\}^\{\+\}\_\{q\}=\\\{\\tau^\{\(i\)\}:R\_\{\\text\{res\}\}\(\\tau^\{\(i\)\}\)\\geq\\rho\\\}, we update\(πθ,ℒ\)\(\\pi\_\{\\theta\},\\mathcal\{L\}\)simultaneously:

πθk\+1=GRPO\(πθk;\{\(τ\(i\),R\(τ\(i\)\)\)\}\),ℒk\+1=Fold\(InsertTool,ℳT\(𝒯q\+\),ℒk\),\\pi\_\{\\theta\_\{k\+1\}\}\\\!=\\\!\\textsc\{GRPO\}\\bigl\(\\pi\_\{\\theta\_\{k\}\};\\\{\(\\tau^\{\(i\)\},R\(\\tau^\{\(i\)\}\)\)\\\}\\bigr\),\\qquad\\mathcal\{L\}\_\{k\+1\}\\\!=\\\!\\textsc\{Fold\}\\bigl\(\\textsc\{InsertTool\},\\,\\mathcal\{M\}\_\{T\}\(\\mathcal\{T\}^\{\+\}\_\{q\}\),\\,\\mathcal\{L\}\_\{k\}\\bigr\),\(3\)where the planner update is a clipped GRPO step onR=Rres\+λRcompR=R\_\{\\text\{res\}\}\+\\lambda R\_\{\\text\{comp\}\}, andFoldappliesInsertToolto each candidatet⋆∈ℳT\(𝒯q\+\)t^\{\\star\}\\in\\mathcal\{M\}\_\{T\}\(\\mathcal\{T\}^\{\+\}\_\{q\}\)in sequence, performingAdd/Merge/Rejectafter spec and acyclicity validation\. The two updates jointly attack axis \(B\): each write raisesd¯\(𝒱c\)\\bar\{d\}\(\\mathcal\{V\}\_\{c\}\), letting a fixed planner solve the query with fewer calls; each GRPO step shiftsπθ\\pi\_\{\\theta\}toward deeper composites, biasingℳT\(𝒯q\+\)\\mathcal\{M\}\_\{T\}\(\\mathcal\{T\}^\{\+\}\_\{q\}\)toward reusable abstractions and accelerating the next commit\.

Why both updates are needed\.Updating onlyπθ\\pi\_\{\\theta\}leavesd¯\(𝒱c\)\\bar\{d\}\(\\mathcal\{V\}\_\{c\}\)untouched and the planner overfits to a fixed retrieval surface; updating onlyℒ\\mathcal\{L\}leaves the planner unable to invoke new composites\. Because Eq\. \([3](https://arxiv.org/html/2605.08399#S3.E3)\) co\-evolves both on the same gap signal, and\(πθk,ℒk\)\(\\pi\_\{\\theta\_\{k\}\},\\mathcal\{L\}\_\{k\}\)is always a feasible fallback \(clipped GRPO andRejectdefault to identity on failed proposals\), the sequence\{J\(πθk,ℒk\)\}k\\\{J\(\\pi\_\{\\theta\_\{k\}\},\\mathcal\{L\}\_\{k\}\)\\\}\_\{k\}is monotone non\-decreasing under a small enough step size \(Theorem[4](https://arxiv.org/html/2605.08399#Thmtheorem4)\)\.

### 3\.4Theoretical Analysis

Here we analyze our design as an objective\-consistent approximation to Eq\. \([1](https://arxiv.org/html/2605.08399#S3.E1)\)\. We show that Typed DAG Retrieval makes the cost block sublinear,RcompR\_\{\\text\{comp\}\}favors deeper compositions, and the coupled update monotonically improvesJ\(πθ,ℒ\)J\(\\pi\_\{\\theta\},\\mathcal\{L\}\), with the same DAG depth driving both sides\. Due to limited space, full statements and proofs are deferred to Appendix[L](https://arxiv.org/html/2605.08399#A12)\.

###### Theorem 1\(Retrieval Cost Reduction\)\.

LetCflat=∑v∈𝒱∑ℓ=14cℓ\(v\)C\_\{\\text\{flat\}\}=\\sum\_\{v\\in\\mathcal\{V\}\}\\sum\_\{\\ell=1\}^\{4\}c\_\{\\ell\}\(v\)be the expected context cost of the flat baseline, andChier=𝔼\[∑ℓ=14∑v∈𝒮ℓ−1cℓ\(v\)\]C\_\{\\text\{hier\}\}=\\mathbb\{E\}\[\\sum\_\{\\ell=1\}^\{4\}\\sum\_\{v\\in\\mathcal\{S\}\_\{\\ell\-1\}\}c\_\{\\ell\}\(v\)\]that ofTypedDAGRetrieve\(Algorithm[2](https://arxiv.org/html/2605.08399#alg2)\), where𝒮ℓ\\mathcal\{S\}\_\{\\ell\}is the surviving set after stageLℓL\_\{\\ell\}and𝒮0:=𝒱\\mathcal\{S\}\_\{0\}:=\\mathcal\{V\}\. Under Assumption[1](https://arxiv.org/html/2605.08399#Thmassumption1),Chier≤α1α2Cflat\+o\(Cflat\)=o\(Cflat\)C\_\{\\text\{hier\}\}\\leq\\alpha\_\{1\}\\alpha\_\{2\}\\,C\_\{\\text\{flat\}\}\+o\(C\_\{\\text\{flat\}\}\)=o\(C\_\{\\text\{flat\}\}\)asn→∞n\\\!\\to\\\!\\infty\.

###### Corollary 2\(Sublinear Retrieval Time\)\.

Letτℓ\\tau\_\{\\ell\}be the per\-candidate wall\-clock cost atLℓL\_\{\\ell\}withτ1≪τ2,τ3,τ4\\tau\_\{1\}\\ll\\tau\_\{2\},\\tau\_\{3\},\\tau\_\{4\}\. Under Assumption[1](https://arxiv.org/html/2605.08399#Thmassumption1)and the inverted\-index L1 \(Appendix[L](https://arxiv.org/html/2605.08399#A12)\),Thier=𝒪\(τ1log⁡n\+\(τ2\+τ3\+τ4\)k2\)T\_\{\\text\{hier\}\}=\\mathcal\{O\}\(\\tau\_\{1\}\\log n\+\(\\tau\_\{2\}\+\\tau\_\{3\}\+\\tau\_\{4\}\)k\_\{2\}\), sublinear inn=\|𝒱\|n=\|\\mathcal\{V\}\|, vs\.Tflat=Θ\(\(τ2\+τ3\+τ4\)n\)T\_\{\\text\{flat\}\}=\\Theta\(\(\\tau\_\{2\}\+\\tau\_\{3\}\+\\tau\_\{4\}\)n\)\.

###### Theorem 3\(Compositional Advantage under Graph\-Aware GRPO\)\.

Letτp\\tau\_\{p\}invoke only primitives, and letτc\\tau\_\{c\}replace a contiguous length\-mm\(m≥2m\\geq 2\) sub\-trajectory ofτp\\tau\_\{p\}by a compositet⋆∈𝒱ct^\{\\star\}\\in\\mathcal\{V\}\_\{c\}withflat\(t⋆\)≥m\\mathrm\{flat\}\(t^\{\\star\}\)\\geq m\. IfRres\(τp\)=Rres\(τc\)R\_\{\\text\{res\}\}\(\\tau\_\{p\}\)=R\_\{\\text\{res\}\}\(\\tau\_\{c\}\)andλ\>0\\lambda\>0, thenR\(τc\)−R\(τp\)=λΦ\(t⋆\)≥λ\(m−1\)\>0R\(\\tau\_\{c\}\)\-R\(\\tau\_\{p\}\)=\\lambda\\,\\Phi\(t^\{\\star\}\)\\geq\\lambda\(m\-1\)\>0, soτc\\tau\_\{c\}receives strictly larger group\-relative advantage under GRPO\.

###### Theorem 4\(Monotone Co\-Evolution\)\.

Let\(πθk,ℒk\)\(\\pi\_\{\\theta\_\{k\}\},\\mathcal\{L\}\_\{k\}\)be the planner–library pair after thekk\-th update in Stage 3 of Algorithm[1](https://arxiv.org/html/2605.08399#alg1), andJ\(π,ℒ\)=𝔼q,τ\[Rres\+λRcomp\]J\(\\pi,\\mathcal\{L\}\)=\\mathbb\{E\}\_\{q,\\tau\}\[R\_\{\\text\{res\}\}\+\\lambda R\_\{\\text\{comp\}\}\]\. Under Assumptions[2](https://arxiv.org/html/2605.08399#Thmassumption2)–[3](https://arxiv.org/html/2605.08399#Thmassumption3), there existsη¯\>0\\bar\{\\eta\}\>0such that for every GRPO learning rateη∈\(0,η¯\]\\eta\\in\(0,\\bar\{\\eta\}\],J\(πθk,ℒk\)J\(\\pi\_\{\\theta\_\{k\}\},\\mathcal\{L\}\_\{k\}\)is monotonically non\-decreasing inkk\.

###### Corollary 5\(DAG Well\-Formedness and Complexity\)\.

For allk≥0k\\geq 0,𝒢k\\mathcal\{G\}\_\{k\}from Algorithm[3](https://arxiv.org/html/2605.08399#alg3)is acyclic withd\(v\)=1\+maxu∈Ch\(v\)⁡d\(u\)d\(v\)=1\+\\max\_\{u\\in\\mathrm\{Ch\}\(v\)\}d\(u\)for everyv∈𝒱cv\\in\\mathcal\{V\}\_\{c\}; max depth is𝒪\(log⁡\|𝒱\|\)\\mathcal\{O\}\(\\log\|\\mathcal\{V\}\|\)when fan\-out≥2\\geq 2\.InsertToolcosts𝒪\(\|Ch\(t⋆\)\|\+\|ℰ\|\)\\mathcal\{O\}\(\|\\mathrm\{Ch\}\(t^\{\\star\}\)\|\+\|\\mathcal\{E\}\|\)per insertion plus𝒪\(1\)\\mathcal\{O\}\(1\)amortized indexing\.

## 4Experiments

We evaluateCoCoDAon a cloud VM with44NVIDIA H200 GPUs\. Qwen3\-32B serves as the teacher for tool extraction and solution generation; smaller students are trained via GRPO\. Hyperparameters and reward weights are in the appendix\. Source code:[https://anonymous\.4open\.science/r/CoCoDA\-664C](https://anonymous.4open.science/r/CoCoDA-664C)\.

### 4\.1Experimental Settings

Datasets\.We use six benchmarks across three categories with deterministic code\-based verification: math/logical reasoning \(GSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2605.08399#bib.bib20)\), MATH\(Hendryckset al\.,[2021](https://arxiv.org/html/2605.08399#bib.bib21)\)\), tabular analysis \(WikiTableQuestions\(Pasupat and Liang,[2015](https://arxiv.org/html/2605.08399#bib.bib22)\), FinQA\(Chenet al\.,[2021](https://arxiv.org/html/2605.08399#bib.bib23)\)\), and code task \(EvalPlus\(Liuet al\.,[2023](https://arxiv.org/html/2605.08399#bib.bib26)\), MBPP\(Austinet al\.,[2021](https://arxiv.org/html/2605.08399#bib.bib28)\)\)\. Statistics are in the appendix\.

Comparison Methods\.We compare against two no\-tool CoT baselines \(Qwen3 student and Qwen3\-32B teacher\(Weiet al\.,[2022](https://arxiv.org/html/2605.08399#bib.bib25)\)\), three fixed\-library tool\-augmented methods \(ReAct\(Yaoet al\.,[2023](https://arxiv.org/html/2605.08399#bib.bib2)\), ToRL\(Liet al\.,[2025](https://arxiv.org/html/2605.08399#bib.bib8)\), ReTool\(Fenget al\.,[2025](https://arxiv.org/html/2605.08399#bib.bib15)\)\), and two library\-learning methods \(CREATOR\(Qianet al\.,[2023](https://arxiv.org/html/2605.08399#bib.bib11)\), TroVE\(Wanget al\.,[2024b](https://arxiv.org/html/2605.08399#bib.bib12)\)\)\. All baselines share the student backbone and task prompts\. Component\-level ablations are in Section[4\.4](https://arxiv.org/html/2605.08399#S4.SS4)\.

### 4\.2Main Results\.

Table[1](https://arxiv.org/html/2605.08399#S4.T1)reports accuracy on six benchmarks across four Qwen3 student sizes \(0\.60\.6B–88B\), with the Qwen3\-32B teacher as an upper bound\.CoCoDAwins every \(size, benchmark\) cell and beats the strongest tool\-augmented baseline, ReTool, by up to22points\. Gains over vanilla CoT are largest at the smallest scale \(\+10\.7\+10\.7on GSM8K,\+10\.8\+10\.8on MBPP at 0\.6B\), showing that the co\-evolved library transfers teacher competence most effectively when student capacity is scarce\. The teacher gap closes with scale: at 8B,CoCoDA*matches or exceeds*the 32B teacher on math/logic \(GSM8K93\.6793\.67vs\.93\.4093\.40, MATH63\.1863\.18vs\.61\.6261\.62;⋆\\starin Table[1](https://arxiv.org/html/2605.08399#S4.T1)\), recovering teacher\-level performance at roughly a quarter of the parameters\. Comparable margins on symbolic \(GSM8K, MATH\) and retrieval\-heavy \(WTQ, FinQA\) tasks indicate the library generalizes across reasoning styles rather than overfitting one skill\. In contrast, library\-construction baselines without a tight student–teacher loop \(CREATOR, TroVE\)*underperform*plain CoT on math and code, confirming tool creation helps only when the library co\-evolves with the learner\.

Table 1:Main results across three task categories and six benchmarks, reported for four student sizes of the Qwen3 family\. Within each student\-size block,bolddenotes the best result andunderlinethe second\-best\. CoCoDA entries that match or exceed the teacher are additionally marked with⋆\\star\.
### 4\.3Scalability and Efficiency Analysis

#### Library Growth\.

Figure[2\(a\)](https://arxiv.org/html/2605.08399#S4.F2.sf1)tracks library size and mean compositional depth of invoked tools over GRPO training\. Both rise sharply in the first∼\\sim150 steps and saturate, showing thatCoCoDAcommits to a compact set of reusable abstractions rather than growing unboundedly\. Asymptotic size varies with task complexity \(larger for symbolic math than code\), while depth stabilizes near33–44, consistent with Corollary[5](https://arxiv.org/html/2605.08399#Thmtheorem5)\.

#### Context Cost vs\. Library Size: Code Hierarchy vs\. Text Hierarchy\.

TDR’s novelty lies not in hierarchy itself but in exploiting properties unique to executable code\. Fixing the library at the full\-CoCoDAsnapshot, we swap only the retrieval substrate and measure mean prompt\-token cost for a top\-kkset as library size sweeps from5050to1,6001\{,\}600nodes \(Figure[2\(b\)](https://arxiv.org/html/2605.08399#S4.F2.sf2)\), at matched 4B\-student accuracy\.Flat retrievalembeds and ranks all four annotation levels per node \(no hierarchy; typical prior baseline\)\.Text\-hierarchical RAGbuilds a RAPTOR\-style summary tree over L2 descriptions \(generic hierarchy\)\.Typed DAG Retrieval \(ours\)traverses the call\-induced DAG \(Algorithm[2](https://arxiv.org/html/2605.08399#alg2)\)\. At1,6001\{,\}600nodes, flat retrieval costs≈11\.4×\\approx\\\!11\.4\\timesTDR’s tokens and text\-hierarchical RAG≈4\.7×\\approx\\\!4\.7\\times, since its topic\-clustered tree cannot prune by typed signature or contract on abstracted sub\-trajectories\. This gap is the empirical content of the three code\-specificity claims \(typed prune, edge\-guided expansion, behavior\-preserving leaf substitution\) of Section[3\.2](https://arxiv.org/html/2605.08399#S3.SS2)\.

![Refer to caption](https://arxiv.org/html/2605.08399v1/x1.png)\(a\)Library co\-evolution dynamics
![Refer to caption](https://arxiv.org/html/2605.08399v1/x2.png)\(b\)Retrieval context\-cost scaling

Figure 2:Left\.Co\-evolution dynamics of the tool library under GRPO training\.Right\.Mean prompt\-token cost to surface a top\-kktool set as library size grows\.
#### Scaling Analysis\.

Figure[3](https://arxiv.org/html/2605.08399#S4.F3)sweeps library size with the 4B student fixed \(one panel per benchmark\)\. Accuracy rises steeply over the first∼\\sim200 tools and plateaus by∼\\sim400, while latency grows roughly linearly, so400400tools dominate800800on both accuracy and cost\. Sweeping the student with the full library fixed yields monotone GSM8K gains from0\.60\.6B to88B, but the44B→\\to88B gain is only\+1\.0\+1\.0while latency nearly doubles—44B is the accuracy–cost knee\.

![Refer to caption](https://arxiv.org/html/2605.08399v1/x3.png)Figure 3:Accuracy and per\-query latency as a function of library size at a fixed44B student\.

### 4\.4Ablation Studies

We ablateCoCoDA’s three components: the*Compositional Tool DAG*\(CTD, reusable tools with dependency edges\);*Typed DAG Retrieval*\(TDR, typed pruning with edge\-guided expansion and example disambiguation\); and the*graph\-aware reward*\(GAR, a structural credit term for tools on verified DAG paths\)\. Each variant retrains the 4B student under identical hyperparameters, replacing only the ablated component with its non\-structural counterpart \(flat library, dense retrieval, plain reward\); see Table[2](https://arxiv.org/html/2605.08399#S4.T2)\. Removing CTD causes the largest drop \(22–44points everywhere\): without dependency edges the planner re\-derives intermediates and the library collapses to a CREATOR/TroVE\-style flat collection\. Disabling TDR yields a milder drop concentrated on tabular tasks \(WTQ, FinQA\) where the candidate set is largest; this reduces to flat dense retrieval—the configuration to which text\-style hierarchical RAG and similarity\-clustered skill memories collapse once code\-specific structure is removed, so the−1\.79\-1\.79mean loss isolates what*code\-specific*hierarchy buys beyond any text/skill\-hierarchy substrate\. Removing GAR weakens credit assignment, costing11–22points uniformly, most visibly on MATH and MBPP \(longest tool chains\)\. Disabling all three barely beats plain CoT, confirming the components are complementary\.

Table 2:Ablation ofCoCoDA’s three core components on the 4B Qwen3 student\.Math / LogicalTabular AnalysisCode TaskVariantGSM8KMATHWTQFinQAEPMBPPΔ\\Delta\\rowcolorgray\!20CoCoDA \(full\)92\.6459\.3746\.8329\.0666\.2873\.42—w/o CTD \(flat library\)89\.1255\.7443\.5126\.1862\.8369\.47−3\.49\-3\.49w/o TDR \(flat dense retrieval\)91\.0857\.9243\.8626\.6164\.7371\.58−1\.79\-1\.79w/o GAR \(exec\-only reward\)91\.3257\.4845\.2727\.7465\.1671\.82−1\.40\-1\.40w/o CTD \+ TDR88\.4754\.6341\.9225\.3861\.9468\.52−4\.64\-4\.64w/o CTD \+ TDR \+ GAR87\.9654\.2741\.3824\.8361\.2767\.86−5\.18\-5\.18

## 5Conclusion

This work addresses a key limitation of small language models equipped with external code libraries: the library must co\-evolve with the policy without exhausting the planner’s context budget\. We presentedCoCoDA, which resolves both faces through a single*compositional code DAG*that supports graph\-aware GRPO co\-evolution and Typed DAG Retrieval, exploiting properties only executable code admits\. Experiments on GSM8K and MATH show that an 8B student matches or exceeds its 32B teacher, with the largest gains at the smallest student sizes\. Future work will extend CoCoDA to multimodal tool\-use and multi\-agent settings, while improving composite compression for large evolving libraries\.

## References

- J\. Austin, A\. Odena, M\. Nye, M\. Bosma, H\. Michalewski, D\. Dohan, E\. Jiang, C\. Cai, M\. Terry, Q\. Le, and C\. Sutton \(2021\)Program synthesis with large language models\.arXiv preprint arXiv:2108\.07732\.Cited by:[Appendix B](https://arxiv.org/html/2605.08399#A2.p7.1),[§4\.1](https://arxiv.org/html/2605.08399#S4.SS1.p1.1)\.
- Top\-down synthesis for library learning\.InProceedings of the ACM on Programming Languages \(POPL\),Cited by:[§2\.2](https://arxiv.org/html/2605.08399#S2.SS2.p1.1)\.
- T\. Cai, X\. Wang, T\. Ma, X\. Chen, and D\. Zhou \(2024\)Large language models as tool makers\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2\.2](https://arxiv.org/html/2605.08399#S2.SS2.p1.1),[§2\.3](https://arxiv.org/html/2605.08399#S2.SS3.p1.1)\.
- Z\. Chen, W\. Chen, C\. Smiley, S\. Shah, I\. Borova, D\. Langdon, R\. Moussa, M\. Beane, T\. Huang, B\. Routledge, and W\. Y\. Wang \(2021\)FinQA: a dataset of numerical reasoning over financial data\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,pp\. 3697–3711\.Cited by:[Appendix B](https://arxiv.org/html/2605.08399#A2.p5.1),[§4\.1](https://arxiv.org/html/2605.08399#S4.SS1.p1.1)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano, C\. Hesse, and J\. Schulman \(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[Appendix B](https://arxiv.org/html/2605.08399#A2.p2.1),[§4\.1](https://arxiv.org/html/2605.08399#S4.SS1.p1.1)\.
- DeepSeek\-AI, D\. Guo, D\. Yang, H\. Zhang, J\. Song, R\. Zhang, R\. Xu, Q\. Zhu, S\. Ma, P\. Wang,et al\.\(2025\)DeepSeek\-R1: incentivizing reasoning capability in LLMs via reinforcement learning\.arXiv preprint arXiv:2501\.12948\.Cited by:[§2\.1](https://arxiv.org/html/2605.08399#S2.SS1.p1.1)\.
- Y\. Du, F\. Wei, and H\. Zhang \(2024\)AnyTool: self\-reflective, hierarchical agents for large\-scale API calls\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§2\.1](https://arxiv.org/html/2605.08399#S2.SS1.p1.1)\.
- K\. Ellis, C\. Wong, M\. Nye, M\. Sablé\-Meyer, L\. Morales, L\. Hewitt, L\. Cary, A\. Solar\-Lezama, and J\. B\. Tenenbaum \(2021\)DreamCoder: bootstrapping inductive program synthesis with wake\-sleep library learning\.InProceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation \(PLDI\),Cited by:[§2\.2](https://arxiv.org/html/2605.08399#S2.SS2.p1.1)\.
- J\. Feng, S\. Huang, X\. Qu, G\. Zhang, Y\. Qin, B\. Zhong, C\. Jiang, J\. Chi, and W\. Zhong \(2025\)ReTool: reinforcement learning for strategic tool use in LLMs\.arXiv preprint arXiv:2504\.11536\.Cited by:[Appendix C](https://arxiv.org/html/2605.08399#A3.SS0.SSS0.Px2.p1.1),[§2\.1](https://arxiv.org/html/2605.08399#S2.SS1.p1.1),[§4\.1](https://arxiv.org/html/2605.08399#S4.SS1.p2.1),[Table 1](https://arxiv.org/html/2605.08399#S4.T1.4.16.11.1),[Table 1](https://arxiv.org/html/2605.08399#S4.T1.4.23.18.1),[Table 1](https://arxiv.org/html/2605.08399#S4.T1.4.30.25.1),[Table 1](https://arxiv.org/html/2605.08399#S4.T1.4.9.4.1)\.
- D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt \(2021\)Measuring mathematical problem solving with the MATH dataset\.InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks,Cited by:[Appendix B](https://arxiv.org/html/2605.08399#A2.p3.1),[§4\.1](https://arxiv.org/html/2605.08399#S4.SS1.p1.1)\.
- X\. Li, H\. Zou, and P\. Liu \(2025\)ToRL: scaling tool\-integrated rl\.arXiv preprint arXiv:2503\.23383\.Cited by:[Appendix C](https://arxiv.org/html/2605.08399#A3.SS0.SSS0.Px2.p1.1),[§2\.1](https://arxiv.org/html/2605.08399#S2.SS1.p1.1),[§4\.1](https://arxiv.org/html/2605.08399#S4.SS1.p2.1),[Table 1](https://arxiv.org/html/2605.08399#S4.T1.4.15.10.1),[Table 1](https://arxiv.org/html/2605.08399#S4.T1.4.22.17.1),[Table 1](https://arxiv.org/html/2605.08399#S4.T1.4.29.24.1),[Table 1](https://arxiv.org/html/2605.08399#S4.T1.4.8.3.1)\.
- J\. Liu, Y\. Zhu, K\. Xiao, Q\. Fu, X\. Han, W\. Yang, and D\. Ye \(2024\)RLTF: reinforcement learning from unit test feedback\.Transactions on Machine Learning Research \(TMLR\)\.Cited by:[§2\.1](https://arxiv.org/html/2605.08399#S2.SS1.p1.1)\.
- J\. Liu, C\. S\. Xia, Y\. Wang, and L\. Zhang \(2023\)Is your code generated by ChatGPT really correct? rigorous evaluation of large language models for code generation\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[Appendix B](https://arxiv.org/html/2605.08399#A2.p6.1),[§4\.1](https://arxiv.org/html/2605.08399#S4.SS1.p1.1)\.
- A\. Madaan, N\. Tandon, P\. Gupta, S\. Hallinan, L\. Gao, S\. Wiegreffe, U\. Alon, N\. Dziri, S\. Prabhumoye, Y\. Yang, S\. Gupta, B\. P\. Majumder, K\. Hermann, S\. Welleck, A\. Yazdanbakhsh, and P\. Clark \(2023\)Self\-refine: iterative refinement with self\-feedback\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§2\.2](https://arxiv.org/html/2605.08399#S2.SS2.p1.1)\.
- C\. Packer, S\. Wooders, K\. Lin, V\. Fang, S\. G\. Patil, I\. Stoica, and J\. E\. Gonzalez \(2024\)MemGPT: towards LLMs as operating systems\.InConference on Language Modeling \(COLM\),Cited by:[§2\.2](https://arxiv.org/html/2605.08399#S2.SS2.p1.1)\.
- J\. S\. Park, J\. C\. O’Brien, C\. J\. Cai, M\. R\. Morris, P\. Liang, and M\. S\. Bernstein \(2023\)Generative agents: interactive simulacra of human behavior\.InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology \(UIST\),Cited by:[§2\.2](https://arxiv.org/html/2605.08399#S2.SS2.p1.1)\.
- P\. Pasupat and P\. Liang \(2015\)Compositional semantic parsing on semi\-structured tables\.InProceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\),Beijing, China,pp\. 1470–1480\.Cited by:[Appendix B](https://arxiv.org/html/2605.08399#A2.p4.2),[§4\.1](https://arxiv.org/html/2605.08399#S4.SS1.p1.1)\.
- S\. G\. Patil, T\. Zhang, X\. Wang, and J\. E\. Gonzalez \(2024\)Gorilla: large language model connected with massive APIs\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§2\.1](https://arxiv.org/html/2605.08399#S2.SS1.p1.1)\.
- C\. Qian, Y\. Dang, J\. Li, W\. Liu, Z\. Xie, Y\. Wang, W\. Chen, C\. Yang, X\. Cong, Z\. Liu, and M\. Sun \(2024\)Experiential co\-learning of software\-developing agents\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(ACL\),Cited by:[§2\.3](https://arxiv.org/html/2605.08399#S2.SS3.p1.1)\.
- C\. Qian, E\. C\. Acikgoz, Q\. He, H\. Wang, X\. Chen, D\. Hakkani\-Tür, G\. Tur, and H\. Ji \(2025\)ToolRL: reward is all tool learning needs\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§2\.1](https://arxiv.org/html/2605.08399#S2.SS1.p1.1)\.
- C\. Qian, C\. Han, Y\. R\. Fung, Y\. Qin, Z\. Liu, and H\. Ji \(2023\)CREATOR: tool creation for disentangling abstract and concrete reasoning of large language models\.InFindings of the Association for Computational Linguistics: EMNLP,Cited by:[Appendix C](https://arxiv.org/html/2605.08399#A3.SS0.SSS0.Px3.p1.1),[§2\.2](https://arxiv.org/html/2605.08399#S2.SS2.p1.1),[§2\.3](https://arxiv.org/html/2605.08399#S2.SS3.p1.1),[§4\.1](https://arxiv.org/html/2605.08399#S4.SS1.p2.1),[Table 1](https://arxiv.org/html/2605.08399#S4.T1.4.10.5.1),[Table 1](https://arxiv.org/html/2605.08399#S4.T1.4.17.12.1),[Table 1](https://arxiv.org/html/2605.08399#S4.T1.4.24.19.1),[Table 1](https://arxiv.org/html/2605.08399#S4.T1.4.31.26.1)\.
- Y\. Qin, S\. Liang, Y\. Ye, K\. Zhu, L\. Yan, Y\. Lu, Y\. Lin, X\. Cong, X\. Tang, B\. Qian, S\. Zhao, L\. Hong, R\. Tian, R\. Xie, J\. Zhou, M\. Gerstein, D\. Li, Z\. Liu, and M\. Sun \(2024\)ToolLLM: facilitating large language models to master 16000\+ real\-world APIs\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2\.1](https://arxiv.org/html/2605.08399#S2.SS1.p1.1)\.
- P\. Sarthi, S\. Abdullah, A\. Tuli, S\. Khanna, A\. Goldie, and C\. D\. Manning \(2024\)RAPTOR: recursive abstractive processing for tree\-organized retrieval\.International Conference on Learning Representations \(ICLR\)\.Cited by:[§2\.2](https://arxiv.org/html/2605.08399#S2.SS2.p1.1)\.
- T\. Schick, J\. Dwivedi\-Yu, R\. Dessì, R\. Raileanu, M\. Lomeli, E\. Hambro, L\. Zettlemoyer, N\. Cancedda, and T\. Scialom \(2023\)Toolformer: language models can teach themselves to use tools\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§2\.1](https://arxiv.org/html/2605.08399#S2.SS1.p1.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. K\. Li, Y\. Wu, and D\. Guo \(2024\)DeepSeekMath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[§L\.5](https://arxiv.org/html/2605.08399#A12.SS5.2.p2.5),[§2\.1](https://arxiv.org/html/2605.08399#S2.SS1.p1.1)\.
- N\. Shinn, F\. Cassano, A\. Gopinath, K\. Narasimhan, and S\. Yao \(2023\)Reflexion: language agents with verbal reinforcement learning\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§2\.2](https://arxiv.org/html/2605.08399#S2.SS2.p1.1)\.
- J\. Singh, R\. Magazine, Y\. Pandya, and A\. Nambi \(2025\)Agentic reasoning and tool integration for LLMs via reinforcement learning\.arXiv preprint arXiv:2505\.01441\.Cited by:[§2\.1](https://arxiv.org/html/2605.08399#S2.SS1.p1.1)\.
- E\. Stengel\-Eskin, A\. Prasad, and M\. Bansal \(2024\)ReGAL: refactoring programs to discover generalizable abstractions\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§2\.2](https://arxiv.org/html/2605.08399#S2.SS2.p1.1)\.
- Q\. Tang, Z\. Deng, H\. Lin, X\. Han, Q\. Liang, B\. Cao, and L\. Sun \(2023\)ToolAlpaca: generalized tool learning for language models with 3000 simulated cases\.arXiv preprint arXiv:2306\.05301\.Cited by:[§2\.1](https://arxiv.org/html/2605.08399#S2.SS1.p1.1)\.
- G\. Wang, Y\. Xie, Y\. Jiang, A\. Mandlekar, C\. Xiao, Y\. Zhu, L\. Fan, and A\. Anandkumar \(2024a\)Voyager: an open\-ended embodied agent with large language models\.Transactions on Machine Learning Research \(TMLR\)\.Cited by:[§2\.2](https://arxiv.org/html/2605.08399#S2.SS2.p1.1),[§2\.3](https://arxiv.org/html/2605.08399#S2.SS3.p1.1)\.
- Y\. Wang, H\. Le, A\. D\. Gotmare, N\. D\. Q\. Bui, J\. Li, and S\. C\. H\. Hoi \(2023\)CodeT5\+: open code large language models for code understanding and generation\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),Cited by:[§2\.2](https://arxiv.org/html/2605.08399#S2.SS2.p1.1)\.
- Z\. Wang, D\. Fried, and G\. Neubig \(2024b\)TroVE: inducing verifiable and efficient toolboxes for solving programmatic tasks\.InInternational Conference on Machine Learning \(ICML\),Cited by:[Appendix C](https://arxiv.org/html/2605.08399#A3.SS0.SSS0.Px3.p1.1),[§2\.2](https://arxiv.org/html/2605.08399#S2.SS2.p1.1),[§4\.1](https://arxiv.org/html/2605.08399#S4.SS1.p2.1),[Table 1](https://arxiv.org/html/2605.08399#S4.T1.4.11.6.1),[Table 1](https://arxiv.org/html/2605.08399#S4.T1.4.18.13.1),[Table 1](https://arxiv.org/html/2605.08399#S4.T1.4.25.20.1),[Table 1](https://arxiv.org/html/2605.08399#S4.T1.4.32.27.1)\.
- Z\. Z\. Wang, A\. Asai, X\. V\. Yu, F\. F\. Xu, Y\. Xie, G\. Neubig, and D\. Fried \(2024c\)CodeRAG\-Bench: can retrieval augment code generation?\.InFindings of the Association for Computational Linguistics \(NAACL\),Cited by:[§2\.2](https://arxiv.org/html/2605.08399#S2.SS2.p1.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, B\. Ichter, F\. Xia, E\. Chi, Q\. Le, and D\. Zhou \(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[Appendix C](https://arxiv.org/html/2605.08399#A3.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.08399#S4.SS1.p2.1)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao \(2023\)ReAct: synergizing reasoning and acting in language models\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[Appendix C](https://arxiv.org/html/2605.08399#A3.SS0.SSS0.Px2.p1.1),[§2\.1](https://arxiv.org/html/2605.08399#S2.SS1.p1.1),[§4\.1](https://arxiv.org/html/2605.08399#S4.SS1.p2.1),[Table 1](https://arxiv.org/html/2605.08399#S4.T1.4.14.9.1),[Table 1](https://arxiv.org/html/2605.08399#S4.T1.4.21.16.1),[Table 1](https://arxiv.org/html/2605.08399#S4.T1.4.28.23.1),[Table 1](https://arxiv.org/html/2605.08399#S4.T1.4.7.2.1)\.
- L\. Yuan, Y\. Chen, X\. Wang, Y\. R\. Fung, H\. Peng, and H\. Ji \(2024\)CRAFT: customizing llms by creating and retrieving from specialized toolsets\.External Links:2309\.17428,[Link](https://arxiv.org/abs/2309.17428)Cited by:[§2\.2](https://arxiv.org/html/2605.08399#S2.SS2.p1.1),[§2\.3](https://arxiv.org/html/2605.08399#S2.SS3.p1.1)\.
- X\. Zhu, Y\. Chen, H\. Tian, C\. Tao, W\. Su, C\. Yang, G\. Huang, B\. Li, L\. Lu, X\. Wang, Y\. Qiao, Z\. Zhang, and J\. Dai \(2023\)Ghost in the Minecraft: generally capable agents for open\-world environments via large language models with text\-based knowledge and memory\.arXiv preprint arXiv:2305\.17144\.Cited by:[§2\.2](https://arxiv.org/html/2605.08399#S2.SS2.p1.1)\.

## Appendix AAlgorithms

This appendix collects the pseudocode for the mainCoCoDAtraining loop \(Algorithm[1](https://arxiv.org/html/2605.08399#alg1)\) and its two subroutines: Typed DAG Retrieval \(Algorithm[2](https://arxiv.org/html/2605.08399#alg2)\) and online tool insertion \(Algorithm[3](https://arxiv.org/html/2605.08399#alg3)\)\.

Algorithm 1CoCoDA: Tool\-Augmented Planning with Online Co\-Evolving Compositional DAG1:Base student

πbase\\pi\_\{\\text\{base\}\}, teacher

ℳT\\mathcal\{M\}\_\{T\}, query stream

𝒬\\mathcal\{Q\}, success threshold

ρ\\rho, reward coefficient

λ\\lambda, group size

GG, epochs

NN
2:Co\-evolved planner

πθ⋆\\pi\_\{\\theta^\{\\star\}\}and library

ℒ⋆=\(𝒱⋆,ℰ⋆,ℐ⋆\)\\mathcal\{L\}^\{\\star\}=\(\\mathcal\{V\}^\{\\star\},\\mathcal\{E\}^\{\\star\},\\mathcal\{I\}^\{\\star\}\)with DAG

𝒢⋆\\mathcal\{G\}^\{\\star\}
3:

⊳\\trianglerightStage 1: Experience\-based Tool Distillation \(warm\-start library\)

4:

𝒯\+←Rollout\(πbase,𝒬seed\)\\mathcal\{T\}^\{\+\}\\\!\\leftarrow\\\!\\textsc\{Rollout\}\(\\pi\_\{\\text\{base\}\},\\mathcal\{Q\}\_\{\\text\{seed\}\}\)⊳\\trianglerightkeep only successful trajectories

5:

𝒱,ℰ,ℐ←∅,∅,∅\\mathcal\{V\},\\mathcal\{E\},\\mathcal\{I\}\\leftarrow\\emptyset,\\emptyset,\\emptyset
6:forall

τ\+∈𝒯\+\\tau^\{\+\}\\in\\mathcal\{T\}^\{\+\}do

7:forall

t⋆∈ℳT\(τ\+\)t^\{\\star\}\\in\\mathcal\{M\}\_\{T\}\(\\tau^\{\+\}\)do

8:InsertTool\(

t⋆,𝒱,ℰ,ℐt^\{\\star\},\\mathcal\{V\},\\mathcal\{E\},\\mathcal\{I\}\)⊳\\trianglerightAlgorithm[3](https://arxiv.org/html/2605.08399#alg3)

9:endfor

10:endfor

11:

ℒ←\(𝒱,ℰ,ℐ\)\\mathcal\{L\}\\leftarrow\(\\mathcal\{V\},\\mathcal\{E\},\\mathcal\{I\}\)
12:

⊳\\trianglerightStage 2: Cold\-start SFT on warm library

13:

𝒟SFT←ℳT\(𝒬seed,ℒ\)\\mathcal\{D\}\_\{\\text\{SFT\}\}\\leftarrow\\mathcal\{M\}\_\{T\}\(\\mathcal\{Q\}\_\{\\text\{seed\}\},\\mathcal\{L\}\)
14:

θ←SFT\(πbase,𝒟SFT\)\\theta\\leftarrow\\textsc\{SFT\}\(\\pi\_\{\\text\{base\}\},\\mathcal\{D\}\_\{\\text\{SFT\}\}\);

πref←πθ\\pi\_\{\\text\{ref\}\}\\leftarrow\\pi\_\{\\theta\}
15:

⊳\\trianglerightStage 3: Online Co\-Evolution \(πθ\\pi\_\{\\theta\}andℒ\\mathcal\{L\}update per query\)

16:forepoch

=1=1to

NNdo

17:foreach incoming query

q∈𝒬q\\in\\mathcal\{Q\}do

18:// \(a\) Retrieve context from the*current*library

19:

𝒞q←TypedDAGRetrieve\(q,ℐ,𝒢\)\\mathcal\{C\}\_\{q\}\\leftarrow\\textsc\{TypedDAGRetrieve\}\(q,\\mathcal\{I\},\\mathcal\{G\}\)⊳\\trianglerightAlgorithm[2](https://arxiv.org/html/2605.08399#alg2)

20:// \(b\) Group rollout under current policy

21:Sample

\{τ\(i\)\}i=1G∼πθ\(⋅∣q,𝒞q\)\\\{\\tau^\{\(i\)\}\\\}\_\{i=1\}^\{G\}\\sim\\pi\_\{\\theta\}\(\\cdot\\mid q,\\mathcal\{C\}\_\{q\}\)
22:

Ri←Rres\(τ\(i\)\)\+λ∑j\(flat\(tj\(i\)\)−1\)R\_\{i\}\\leftarrow R\_\{\\text\{res\}\}\(\\tau^\{\(i\)\}\)\+\\lambda\\sum\_\{j\}\\bigl\(\\mathrm\{flat\}\(t\_\{j\}^\{\(i\)\}\)\-1\\bigr\)
23:// \(c\) Policy update

24:Update

θ\\thetavia GRPO with group\-relative advantages over

\{Ri\}\\\{R\_\{i\}\\\}
25:// \(d\) Library update from this query’s successful rollouts

26:

𝒯q\+←\{τ\(i\):Rres\(τ\(i\)\)≥ρ\}\\mathcal\{T\}^\{\+\}\_\{q\}\\leftarrow\\\{\\tau^\{\(i\)\}\\,:\\,R\_\{\\text\{res\}\}\(\\tau^\{\(i\)\}\)\\geq\\rho\\\}
27:if

𝒯q\+≠∅\\mathcal\{T\}^\{\+\}\_\{q\}\\neq\\emptysetthen

28:forall

τ\+∈𝒯q\+\\tau^\{\+\}\\in\\mathcal\{T\}^\{\+\}\_\{q\}do

29:forall

t⋆∈ℳT\(τ\+\)t^\{\\star\}\\in\\mathcal\{M\}\_\{T\}\(\\tau^\{\+\}\)do

30:InsertTool\(

t⋆,𝒱,ℰ,ℐt^\{\\star\},\\mathcal\{V\},\\mathcal\{E\},\\mathcal\{I\}\)⊳\\trianglerightmerge / add edges / update DAG

31:endfor

32:endfor

33:

ℒ←\(𝒱,ℰ,ℐ\)\\mathcal\{L\}\\leftarrow\(\\mathcal\{V\},\\mathcal\{E\},\\mathcal\{I\}\)⊳\\trianglerightnext query sees the updated library

34:endif

35:endfor

36:endfor

37:return

πθ,ℒ\\pi\_\{\\theta\},\\ \\mathcal\{L\}

Algorithm 2TypedDAGRetrieve: Typed DAG Retrieval over the Compositional Code DAG1:Query

qq, DAG

𝒢=\(𝒱,ℰ\)\\mathcal\{G\}=\(\\mathcal\{V\},\\mathcal\{E\}\)with per\-node records

\{L1,L2,L3,L4\}\\\{L\_\{1\},L\_\{2\},L\_\{3\},L\_\{4\}\\\}\(signature, description, specification, examples\), planner

πθ\\pi\_\{\\theta\}, shortlist size

k2k\_\{2\}
2:Candidate set

𝒞q\\mathcal\{C\}\_\{q\}annotated with graph\-structural features

3:Decompose

qqinto typed sub\-goals

\{\(Tin\(j\)→Tout\(j\),intent\(j\)\)\}j=1J\\\{\(T\_\{\\text\{in\}\}^\{\(j\)\}\\\!\\to\\\!T\_\{\\text\{out\}\}^\{\(j\)\},\\ \\text\{intent\}^\{\(j\)\}\)\\\}\_\{j=1\}^\{J\}
4:

𝒞q←∅\\mathcal\{C\}\_\{q\}\\leftarrow\\emptyset
5:for

j=1j=1to

JJdo

6:

𝒮1\(j\)←\{v∈𝒱∣L1\(v\)unifies withTin\(j\)→Tout\(j\)\}\\mathcal\{S\}\_\{1\}^\{\(j\)\}\\\!\\leftarrow\\\!\\\{v\\in\\mathcal\{V\}\\mid L\_\{1\}\(v\)\\ \\text\{unifies with\}\\ T\_\{\\text\{in\}\}^\{\(j\)\}\\\!\\to\\\!T\_\{\\text\{out\}\}^\{\(j\)\}\\\}⊳\\trianglerightL1 type filter

7:

𝒮2\(j\)←Top\-k2\{πθ\(intent\(j\),L2\(v\)\):v∈𝒮1\(j\)\}\\mathcal\{S\}\_\{2\}^\{\(j\)\}\\\!\\leftarrow\\\!\\mathrm\{Top\}\\text\{\-\}k\_\{2\}\\bigl\\\{\\pi\_\{\\theta\}\\bigl\(\\text\{intent\}^\{\(j\)\},L\_\{2\}\(v\)\\bigr\):v\\in\\mathcal\{S\}\_\{1\}^\{\(j\)\}\\bigr\\\}⊳\\trianglerightL2 semantic scan

8:

𝒮3\(j\)←\{v∈𝒮2\(j\)∣L3\(v\)satisfies pre/post\-conditions ofintent\(j\)\}\\mathcal\{S\}\_\{3\}^\{\(j\)\}\\\!\\leftarrow\\\!\\\{v\\in\\mathcal\{S\}\_\{2\}^\{\(j\)\}\\mid L\_\{3\}\(v\)\\ \\text\{satisfies pre/post\-conditions of\}\\ \\text\{intent\}^\{\(j\)\}\\\}⊳\\trianglerightL3 constraint check

9:

vj⋆←arg⁡maxv∈𝒮3\(j\)⁡πθ\(intent\(j\),L4\(v\)\)v^\{\\star\}\_\{j\}\\\!\\leftarrow\\\!\\arg\\max\_\{v\\in\\mathcal\{S\}\_\{3\}^\{\(j\)\}\}\\pi\_\{\\theta\}\\bigl\(\\text\{intent\}^\{\(j\)\},L\_\{4\}\(v\)\\bigr\)⊳\\trianglerightL4 example rerank

10:Attach graph features

\(d\(vj⋆\),Ch\(vj⋆\),\|𝒢vj⋆\|\)\\bigl\(d\(v^\{\\star\}\_\{j\}\),\\mathrm\{Ch\}\(v^\{\\star\}\_\{j\}\),\|\\mathcal\{G\}\_\{v^\{\\star\}\_\{j\}\}\|\\bigr\)
11:

𝒞q←𝒞q∪\{vj⋆\}\\mathcal\{C\}\_\{q\}\\leftarrow\\mathcal\{C\}\_\{q\}\\cup\\\{v^\{\\star\}\_\{j\}\\\}
12:endfor

13:return

𝒞q\\mathcal\{C\}\_\{q\}

Algorithm 3InsertTool: Online Insertion of a New Tool into the Compositional DAG1:Candidate tool

t⋆t^\{\\star\}, vertex set

𝒱=𝒱p∪𝒱c\\mathcal\{V\}=\\mathcal\{V\}\_\{p\}\\cup\\mathcal\{V\}\_\{c\}, edge set

ℰ\\mathcal\{E\}, hierarchical index

ℐ\\mathcal\{I\}
2:Updated

\(𝒱,ℰ,ℐ\)\(\\mathcal\{V\},\\mathcal\{E\},\\mathcal\{I\}\)with

t⋆t^\{\\star\}inserted as primitive or composite, or unchanged on rejection

3:Validate

t⋆t^\{\\star\}against

L3,L4L\_\{3\},L\_\{4\}of each

u∈Ch\(t⋆\)u\\in\\mathrm\{Ch\}\(t^\{\\star\}\)
4:ifvalidation failsor

𝒢∪\{t⋆\}\\mathcal\{G\}\\cup\\\{t^\{\\star\}\\\}not acyclicthen

5:return⊳\\trianglerightrejectt⋆t^\{\\star\}

6:endif

7:if

Ch\(t⋆\)=∅\\mathrm\{Ch\}\(t^\{\\star\}\)=\\emptysetthen

8:

𝒱p←𝒱p∪\{t⋆\}\\mathcal\{V\}\_\{p\}\\leftarrow\\mathcal\{V\}\_\{p\}\\cup\\\{t^\{\\star\}\\\};

d\(t⋆\)←0d\(t^\{\\star\}\)\\leftarrow 0⊳\\trianglerightprimitive

9:else

10:

𝒱c←𝒱c∪\{t⋆\}\\mathcal\{V\}\_\{c\}\\leftarrow\\mathcal\{V\}\_\{c\}\\cup\\\{t^\{\\star\}\\\};

d\(t⋆\)←1\+maxu∈Ch\(t⋆\)⁡d\(u\)d\(t^\{\\star\}\)\\leftarrow 1\+\\max\_\{u\\in\\mathrm\{Ch\}\(t^\{\\star\}\)\}d\(u\)⊳\\trianglerightcomposite

11:

ℰ←ℰ∪\{\(t⋆,u\)∣u∈Ch\(t⋆\)\}\\mathcal\{E\}\\leftarrow\\mathcal\{E\}\\cup\\\{\(t^\{\\star\},u\)\\mid u\\in\\mathrm\{Ch\}\(t^\{\\star\}\)\\\}
12:endif

13:Build record

\{L1,L2,L3,L4\}\(t⋆\)\\\{L\_\{1\},L\_\{2\},L\_\{3\},L\_\{4\}\\\}\(t^\{\\star\}\)and insert into

ℐ\\mathcal\{I\}

## Appendix BDatasets

We evaluateCoCoDAon six public benchmarks that span three categories of execution\-verifiable reasoning\. Table[3](https://arxiv.org/html/2605.08399#A3.T3)summarizes the split sizes used in our experiments\.

GSM8K\[Cobbeet al\.,[2021](https://arxiv.org/html/2605.08399#bib.bib20)\]is a dataset of8\.58\.5K grade\-school math word problems that require multi\-step arithmetic reasoning\. Each problem is paired with a natural\-language solution and a final numerical answer, so correctness can be checked by evaluating the Python expression produced by the planner against the reference answer\.

MATH\[Hendryckset al\.,[2021](https://arxiv.org/html/2605.08399#bib.bib21)\]contains12\.512\.5K competition\-level mathematics problems \(algebra, geometry, number theory, precalculus, etc\.\) with step\-by\-step solutions and boxed final answers\. Compared with GSM8K, it requires longer symbolic derivations, making it a natural stress test for compositional tool use\.

WikiTableQuestions \(WTQ\)\[Pasupat and Liang,[2015](https://arxiv.org/html/2605.08399#bib.bib22)\]is a table question\-answering benchmark over2222K questions about2,1082\{,\}108semi\-structured Wikipedia tables\. Answers are cells, aggregations, or comparisons computable by combining SQL\-style table lookups with numerical operators\.

FinQA\[Chenet al\.,[2021](https://arxiv.org/html/2605.08399#bib.bib23)\]is a financial QA dataset of8,2818\{,\}281questions grounded in real earnings reports, where each answer is produced by a short numerical reasoning program over tables and accompanying text\. It emphasizes multi\-step arithmetic over structured financial data\.

EvalPlus\[Liuet al\.,[2023](https://arxiv.org/html/2605.08399#bib.bib26)\]is an extension of HumanEval and MBPP that augments each problem with a substantially larger and more adversarial test suite, giving a more faithful estimate of functional correctness than the original pass@1\. We use its HumanEval\+ split for code Task\.

MBPP\[Austinet al\.,[2021](https://arxiv.org/html/2605.08399#bib.bib28)\]\(Mostly Basic Python Problems\) contains974974short Python programming tasks with natural\-language descriptions and three unit tests each\. Correctness is defined by passing all provided test cases\.

All six benchmarks share the property that each task can be verified by executing deterministic code \(Python for math and coding, SQL combined with numerical computation for tabular analysis\), which makes them well\-suited to the execution\-grounded reward signal used in our GRPO training\.

## Appendix CBaseline Methods

We group our baselines into three categories that together cover the main design choices our framework departs from\. All baselines are trained or prompted on the same Qwen3 student backbone and receive identical task prompts\.

#### No\-tool chain\-of\-thought baselines\.

CoT \(student\)\[Weiet al\.,[2022](https://arxiv.org/html/2605.08399#bib.bib25)\]prompts the base Qwen3 student with standard chain\-of\-thought reasoning and no external tool access, establishing the performance attainable from pure parametric reasoning at the student scale\.CoT \(teacher\)applies the same protocol to the Qwen3\-32B teacher and is reported as an upper\-bound reference\.

#### Fixed\-library tool\-augmented methods\.

ReAct\[Yaoet al\.,[2023](https://arxiv.org/html/2605.08399#bib.bib2)\]interleaves reasoning traces and tool calls at inference time; the planner chooses from a fixed tool inventory and is not trained via reinforcement learning\.ToRL\[Liet al\.,[2025](https://arxiv.org/html/2605.08399#bib.bib8)\]trains a tool\-integrated reasoning policy with reinforcement learning over a static set of tools, optimizing for end\-to\-end task reward\.ReTool\[Fenget al\.,[2025](https://arxiv.org/html/2605.08399#bib.bib15)\]also trains a tool\-integrated reasoner with GRPO\-style updates and shares our RL backbone, but keeps the tool inventory frozen throughout training, isolating the contribution of library co\-evolution\.

#### Library\-learning methods\.

CREATOR\[Qianet al\.,[2023](https://arxiv.org/html/2605.08399#bib.bib11)\]disentangles tool creation from tool use: a teacher model first synthesizes a flat library of Python utilities offline, and the solver later invokes them, without feedback from solver performance to library construction\.TroVE\[Wanget al\.,[2024b](https://arxiv.org/html/2605.08399#bib.bib12)\]induces a verified function library by abstracting recurring patterns from successful program traces and reuses them as a flat collection, again decoupling library growth from policy optimization\.

Table 3:Statistics of the six evaluation benchmarks\.

## Appendix DPrompts

This appendix lists every prompt used byCoCoDA\. All prompts are issued to either the teacher modelℳT\\mathcal\{M\}\_\{T\}\(Qwen3\-32B, used for tool abstraction, cold\-start SFT generation, and semantic deduplication\) or the student plannerπθ\\pi\_\{\\theta\}\(for task decomposition, retrieval, and tool\-augmented planning\)\. Placeholders in curly braces are substituted at runtime\. We fix temperature0\.20\.2and top\-pp0\.950\.95for all auxiliary \(non\-rollout\) prompts; rollout\-time generation settings are reported in Appendix[E](https://arxiv.org/html/2605.08399#A5)\.

### D\.1Teacher Tool\-Abstraction Prompt

Used in Stage 1 and Stage 3\(d\) of Algorithm[1](https://arxiv.org/html/2605.08399#alg1)to distill a successful trajectoryτ\+\\tau^\{\+\}into a set of candidate compositional tools\{t⋆\}\\\{t^\{\\star\}\\\}whose bodies invoke only tools already in𝒱\\mathcal\{V\}\.

`Teacher Tool\-Abstraction Prompt`

`D\.2 Teacher Cold\-Start SFT Prompt Used once per seed query q∈𝒬seedq\\in\\mathcal\{Q\}\_\{\\text\{seed\}\} to generate demonstrations 𝒟SFT\\mathcal\{D\}\_\{\\text\{SFT\}\} against the warm library ℒ\\mathcal\{L\}\. Teacher Cold\-Start SFT Prompt D\.3 Task\-Decomposition Prompt \(Planner\) Executed by πθ\\pi\_\{\\theta\} at the start of every query to emit a sequence of typed sub\-goals consumed by TypedDAGRetrieve \(Algorithm 2, line 1\)\. Task\-Decomposition Prompt D\.4 L2 Semantic\-Ranker Prompt Single forward pass over all L1\-surviving candidates for a given sub\-goal, returning a top\-k2k\_\{2\} shortlist\. The planner itself plays the role of the ranker, so no separate model is introduced\. L2 Semantic\-Ranker Prompt D\.5 L3 Constraint\-Check Prompt Applied to each candidate that survives the L2 semantic shortlist\. The planner verifies whether the tool’s pre/post\-conditions and complexity specification are compatible with the sub\-goal; only candidates judged compatible proceed to L4 reranking\. L3 Constraint\-Check Prompt D\.6 L4 Example\-Based Reranking Prompt Applied only to candidates that survive the L3 constraint check\. L4 Example\-Based Reranking Prompt D\.7 Planner System Prompt for Tool\-Augmented Rollouts Conditions every GRPO rollout\. The context 𝒞q\\mathcal\{C\}\_\{q\} returned by TypedDAGRetrieve is injected under \{TOOL\_CARDS\}, and each card contains L1, L2, L3, L4, the depth d\(v\)d\(v\), and the direct children Ch\(v\)\\mathrm\{Ch\}\(v\)\. Planner System Prompt for Tool\-Augmented Rollouts D\.8 InsertTool Semantic\-Deduplication Prompt Invoked inside InsertTool \(Algorithm 3\) when a candidate t⋆t^\{\\star\} passes validation\. Returns either a merge decision against an existing tool or a distinctness verdict\. InsertTool Semantic\-Deduplication Prompt Appendix E Hyperparameters Table 4 lists every hyperparameter used in the main experiments\. All student sizes share the same retrieval, reward, and generation configuration; only the optimizer learning rate and the parameter\-efficient fine\-tuning choice vary across student sizes\. The teacher ℳT\\mathcal\{M\}\_\{T\} is frozen throughout\. Table 4: Hyperparameters of CoCoDA\. Values without a size qualifier apply to every student\. “FT” = full fine\-tuning with DeepSpeed ZeRO\-3; “LoRA” = LoRA adapters on every attention and MLP projection\. Group Hyperparameter Value GRPO Group size GG 88 Clip range ϵ\\epsilon 0\.20\.2 KL coefficient 0\.010\.01 \(to πref\\pi\_\{\\text\{ref\}\}\) Rollout batch \(queries / step\) 6464 Online epochs NN 33 Advantage normalization group mean & std Reward Compositional weight λ\\lambda 0\.200\.20 Success threshold ρ\\rho 0\.80\.8 Result reward RresR\_\{\\text\{res\}\} \{0,1\}\\\{0,1\\\} via deterministic verifier Retrieval L2 shortlist size k2k\_\{2\} 3232 L1 type unification structural \(covariant outputs\) L3 specification check LLM\-judged on declared pre/post/complexity \(App\. D\.5\) Cold\-start SFT Seed queries \|𝒬seed\|\|\\mathcal\{Q\}\_\{\\text\{seed\}\}\| 512512 Epochs 33 Batch size 6464 Warmup ratio 0\.030\.03 LR schedule cosine to 0 Optimizer Type AdamW, β1=0\.9\\beta\_\{1\}=0\.9, β2=0\.95\\beta\_\{2\}=0\.95 Weight decay 0\.010\.01 Grad clipping 1\.01\.0 Mixed precision bf16 Per\-size 0\.6B / 1\.7B / 4B FT, lr =2×10−5, 1×10−5, 5×10−6=2\\\!\\times\\\!10^\{\-5\},\\,1\\\!\\times\\\!10^\{\-5\},\\,5\\\!\\times\\\!10^\{\-6\} 8B LoRA \(r=64,α=128r\{=\}64,\\alpha\{=\}128\), lr =1×10−5=1\\\!\\times\\\!10^\{\-5\} Max sequence length 40964096 Rollouts / query G=8G\{=\}8 Generation Student rollout temperature 0\.70\.7 Student top\-pp 0\.950\.95 Student max new tokens 20482048 Teacher temperature / top\-pp 0\.20\.2 / 0\.950\.95 Teacher max new tokens 40964096 Library init Warm library \|𝒱\|\|\\mathcal\{V\}\| after Stage 1 ≈180\\approx 180 Primitive set size 4242 \(shared across domains\) Max DAG depth at init 22 Appendix F Tool Library Data We give concrete examples of the four\-level tool records introduced in Section 3\.2, a snapshot of a small region of the Compositional Tool DAG, and end\-of\-training statistics of the evolved library on each benchmark\. F\.1 Example Primitive Tool Record name : add L1 : add :: \(float, float\) \-\> float ; deps = \[\] L2 : "Return the sum of two real numbers\." ; tags = \[arithmetic, primitive\] L3 : pre : a, b are finite floats post : returns a \+ b complexity : O\(1\) L4 : \[ \(1\.0, 2\.0\) \-\> 3\.0, \(\-1\.5, 2\.5\) \-\> 1\.0, \(0\.0, 0\.0\) \-\> 0\.0 \] body : def add\(a, b\): return a \+ b F\.2 Example Compositional Tool Record The composite quadratic\_expr has depth d=2d=2 and children \{add,mul,pow\_int\}\\\{\\texttt\{add\},\\texttt\{mul\},\\texttt\{pow\\\_int\}\\\}\. name : quadratic\_expr L1 : quadratic\_expr :: \(float, float, float, float\) \-\> float deps = \[add, mul, pow\_int\] L2 : "Evaluate the quadratic a\*xˆ2 \+ b\*x \+ c at x\." tags = \[algebra, polynomial, closed\-form\] L3 : pre : a, b, c, x are finite floats post : returns a\*xˆ2 \+ b\*x \+ c complexity : O\(1\) L4 : \[ \(1, \-3, 2, 2\) \-\> 0\.0, \(2, 0, \-1, 3\) \-\> 17\.0, \(0, 1, 0, 5\) \-\> 5\.0 \] body : def quadratic\_expr\(a, b, c, x\): return add\( add\( mul\(a, pow\_int\(x, 2\)\), mul\(b, x\) \), c \) A second example, extracted on WikiTableQuestions during online co\-evolution, aggregates a numeric column under a row filter: name : sum\_where L1 : sum\_where :: \(DataFrame, str, Predicate\) \-\> float deps = \[filter\_rows, select\_column, sum\_list\] L2 : "Sum a numeric column over rows satisfying a predicate\." tags = \[tabular, aggregation\] L3 : pre : column exists and is numeric; predicate is total on rows post : returns sum of column over filtered rows complexity : O\(n\) L4 : \[ \(sales\_df, "amount", lambda r: r\.region=="EU"\) \-\> 31420\.0, \.\.\. \] body : def sum\_where\(df, col, pred\): return sum\_list\( select\_column\( filter\_rows\(df, pred\), col \) \) F\.3 Compositional Tool DAG Snapshot \(Math Domain\) Figure 4 sketches a region of the DAG after training\. The figure makes visible how higher\-depth composites are layered on top of primitive arithmetic operators, with shared low\-level nodes reused across multiple composites, which is exactly the reuse pattern that drives the compositional advantage established in Theorem 3\. Figure 4: A subgraph of the Compositional Tool DAG on the math domain after training\. Edges point from composites to their direct dependencies\. F\.4 End\-of\-Training Library Statistics Table 5 reports the state of ℒ\\mathcal\{L\} at the end of Stage 3 for each benchmark, broken down by primitives, composites, maximum depth, and mean fan\-out\. Table 5: Evolved library statistics at the end of training \(4B student\)\. \|𝒱p\|\|\\mathcal\{V\}\_\{p\}\| is fixed at initialization; \|𝒱c\|\|\\mathcal\{V\}\_\{c\}\| grows during co\-evolution\. Mean fan\-out is computed over composites only\. Appendix G Dataset Details Table 6 summarises the six benchmarks\. We use the official train/test splits for GSM8K, MATH, WTQ, FinQA, and MBPP, and the full EvalPlus suite \(extended HumanEval with hardened unit tests\) for code generation\. The training split is used both as the seed set 𝒬seed\\mathcal\{Q\}\_\{\\text\{seed\}\} and as the GRPO query stream 𝒬\\mathcal\{Q\}; evaluation is reported on the held\-out test split\. Table 6: Dataset statistics and verifier used for each benchmark\. Appendix H Case Studies H\.1 Cascaded Retrieval on a GSM8K Query Query\. “Lena buys 3 notebooks at $4 each and 2 pens at $1\.50 each\. How much does she spend in total?” Task decomposition\. \[ \{"T\_in":"\(int,float\)", "T\_out":"float", "intent":"cost of notebooks"\}, \{"T\_in":"\(int,float\)", "T\_out":"float", "intent":"cost of pens"\}, \{"T\_in":"\(float,float\)","T\_out":"float","intent":"total spend"\} \] L1 type filter\. 1,8731\{,\}873 library tools →\\to 4141 type\-compatible candidates per sub\-goal\. L2 shortlist \(k2=32k\_\{2\}=32\)\. Top\-ranked candidates include mul, linear\_cost, unit\_price\_times\_qty, add, sum\_list\. L3 constraint check\. Discards sum\_list \(requires a list input\); retains mul, linear\_cost, unit\_price\_times\_qty, add\. L4 rerank\. For the first two sub\-goals, unit\_price\_times\_qty \(depth 11, example \(3,4\.0\)→12\.0\(3,4\.0\)\\to 12\.0\) outranks mul; for the third, add wins\. Planner emits a two\-call trajectory instead of the four\-call primitive\-only trajectory, receiving a higher shaped reward under RcompR\_\{\\text\{comp\}\}\. H\.2 A Tool Created from Success and Later Reused Early in MATH training, the planner solves “find the roots of x2−5x\+6=0x^\{2\}\-5x\+6=0” by invoking sub, mul, pow\_int, sqrt, add, div\. The teacher abstracts the trajectory into a new composite solve\_quadratic \(depth 33\)\. On the next query, “for what kk does x2\+kx\+9=0x^\{2\}\+kx\+9=0 have a double root?”, TypedDAGRetrieve surfaces solve\_quadratic at the L4 stage, and the planner solves the problem in two calls rather than seven\. This illustrates the per\-query co\-evolution claimed in Theorem 4\. H\.3 Rejected Candidates Three representative InsertTool rejections observed during training: Spec failure\. Candidate safe\_div was proposed with postcondition “returns a/ba/b for any bb” but its body dispatches to div, whose precondition requires b≠0b\\neq 0\. Validation against L3\(div\)L\_\{3\}\(\\texttt\{div\}\) fails and the tool is rejected\. Cycle\. On MBPP the teacher proposed a composite memoize\_factorial that, after substitution, invokes fact\_rec which already depends on memoize\_factorial\. The acyclicity check at line 3 of Algorithm 3 rejects the insertion\. Near\-duplicate merge\. A WTQ rollout produced column\_sum\_if, whose L1/L3/L4 agree with the existing sum\_where\. The deduplication prompt \(Appendix D\.8\) returns MERGE; the candidate’s additional L4 examples are appended to sum\_where and no new vertex is created\. Appendix I Compute and Reproducibility All experiments are run on a single node with 4×4\\times NVIDIA H200 80GB GPUs, 2×2\\times AMD EPYC 9654 CPUs, and 1\.5TB RAM\. Rollouts use vLLM \(0\.6\.30\.6\.3\) with tensor\-parallel =4=4; optimization uses PyTorch 2\.52\.5 and DeepSpeed ZeRO\-3 for FT runs and PEFT 0\.130\.13 for the LoRA run\. The RL loop is implemented on top of the verl framework\. Table 7 reports end\-to\-end wall\-clock training time and GPU\-hours per student size\. Latency numbers in Figure 3 are measured with a warm vLLM cache at batch size 11 and averaged over 256256 held\-out queries per benchmark; error bars are one standard deviation over 33 seeds\. Table 7: End\-to\-end training cost of CoCoDA on 4×4\\timesH200\. All main\-table numbers are averaged over 33 seeds \(\{13,42,2026\}\\\{13,42,2026\\\}\) and the standard deviation is at most 0\.40\.4 on GSM8K/MATH/WTQ/FinQA and at most 0\.70\.7 on EvalPlus/MBPP\. Appendix J Ablation Implementation Details Each ablated variant replaces exactly one component with a “natural non\-structural counterpart” and re\-trains the 4B student under identical optimizer, reward, and generation settings \(Appendix E\)\. Concretely: w/o CTD \(flat library\)\. The graph 𝒢\\mathcal\{G\} is discarded\. Every tool, including would\-be composites, is stored as an independent entry; edges and depths are unavailable at retrieval time, and the compositional reward term is set to 0 \(i\.e\., flat\(t\)=1\\mathrm\{flat\}\(t\)=1 and hence Φ\(t\)=0\\Phi\(t\)=0 for all tt\) so that only RresR\_\{\\text\{res\}\} remains as reward\. InsertTool still performs acyclicity\-free validation and deduplication\. w/o TDR \(flat dense retrieval\)\. All four annotation levels are concatenated into a single text blob per node and embedded once with BAAI/bge\-large\-en\-v1\.5\. Retrieval is top\-kk \(k=8k\{=\}8\) cosine similarity to the query embedding, equivalent to what text\-style hierarchical RAG and similarity\-clustered skill memories collapse to when code\-specific structure is removed\. The Compositional Tool DAG and the graph\-aware reward are preserved\. w/o GAR \(exec\-only reward\)\. The reward is reduced to R\(τ\)=Rres\(τ\)R\(\\tau\)=R\_\{\\text\{res\}\}\(\\tau\)\. The DAG and Typed DAG Retrieval are unchanged, but the planner receives no structural pressure\. All other components, including library updates, are unchanged\. w/o CTD \+ TDR\. Flat library and flat dense retrieval; compositional reward set to 0\. w/o CTD \+ TDR \+ GAR\. All three ablations compose; equivalent to a GRPO\-trained ReTool\-style planner over a flat pool of tools collected from the warm\-start phase\. The reward coefficient λ\\lambda is held at its main\-experiment value \(0\.200\.20\) rather than re\-tuned per ablation, so that the ablation isolates the mechanism rather than the coefficient schedule\. A separate re\-tuned version produced accuracies within 0\.30\.3 points of the values reported in Table 2\. Appendix K Sensitivity Sweeps Reward coefficient λ\\lambda\. Table 8 sweeps the compositional weight λ\\lambda on GSM8K with the 4B student\. Accuracy is largely insensitive in the range \[0\.1,0\.3\]\[0\.1,0\.3\] and degrades outside it: too small a λ\\lambda removes structural pressure, while too large a λ\\lambda biases rollouts toward deep composites even when primitives would suffice, raising the rate of Rres=0R\_\{\\text\{res\}\}=0 rollouts\. Table 8: Sensitivity of GSM8K accuracy \(4B student\) to the compositional reward weight λ\\lambda\. Shortlist size k2k\_\{2\}\. Varying k2∈\{8,16,32,64,128\}k\_\{2\}\\in\\\{8,16,32,64,128\\\} at a fixed 4B student changes GSM8K accuracy by at most 0\.60\.6 points; retrieval latency grows nearly linearly in k2k\_\{2\}\. We fix k2=32k\_\{2\}=32 as the knee of the accuracy–cost frontier\. Success threshold ρ\\rho\. Setting ρ\\rho too low \(≤0\.4\\leq 0\.4\) admits noisy compositions and raises the InsertTool rejection rate at later queries; setting ρ=1\.0\\rho=1\.0 starves the library in early training\. The value ρ=0\.8\\rho=0\.8 used in the main experiments balances library growth and purity: on MATH, it yields 364364 retained composites vs\. 419419 at ρ=0\.5\\rho=0\.5 \(of which 112112 are later overwritten by duplicates\) and 207207 at ρ=1\.0\\rho=1\.0\. Appendix L Theoretical Proofs In this section, we state the assumptions underlying our analysis and provide the detailed proofs for the theoretical results stated in Section 3\.4\. L\.1 Assumptions Assumption 1 \(Level Cost Ordering\)\. For every tool v∈𝒱v\\in\\mathcal\{V\}, the per\-tool token costs satisfy c1\(v\)≪c2\(v\)≪c3\(v\)c\_\{1\}\(v\)\\ll c\_\{2\}\(v\)\\ll c\_\{3\}\(v\) and c2\(v\)≪c4\(v\)c\_\{2\}\(v\)\\ll c\_\{4\}\(v\) \(no order is imposed between c3\(v\)c\_\{3\}\(v\) and c4\(v\)c\_\{4\}\(v\)\)\. Moreover, there exist constants 0<α1,α2,α3<10<\\alpha\_\{1\},\\alpha\_\{2\},\\alpha\_\{3\}<1 such that the L1 type filter retains at most an α1\\alpha\_\{1\}\-fraction of 𝒱\\mathcal\{V\}, the L2 semantic scan retains at most an α2\\alpha\_\{2\}\-fraction of its input, and the L3 constraint check retains at most an α3\\alpha\_\{3\}\-fraction of its input\. Assumption 2 \(Bounded Rewards and Verifier\)\. The result reward Rres\(τ\)∈\[0,1\]R\_\{\\text\{res\}\}\(\\tau\)\\in\[0,1\] is determined by a deterministic verifier, and candidate tools abstracted from any successful trajectory τ\+\\tau^\{\+\} that pass InsertTool’s spec and acyclicity checks are preserved in ℒ\\mathcal\{L\} across subsequent updates\. Assumption 3 \(Retrieval Consistency\)\. For every query qq and any two libraries ℒ⊆ℒ′\\mathcal\{L\}\\subseteq\\mathcal\{L\}^\{\\prime\} \(meaning 𝒱⊆𝒱′\\mathcal\{V\}\\subseteq\\mathcal\{V\}^\{\\prime\} and ℰ⊆ℰ′\\mathcal\{E\}\\subseteq\\mathcal\{E\}^\{\\prime\}\), TypedDAGRetrieve satisfies 𝒞q\(ℒ\)⊆𝒞q\(ℒ′\)\\mathcal\{C\}\_\{q\}\(\\mathcal\{L\}\)\\subseteq\\mathcal\{C\}\_\{q\}\(\\mathcal\{L\}^\{\\prime\}\) up to the shortlist cap k2k\_\{2\}: either every tool in 𝒞q\(ℒ\)\\mathcal\{C\}\_\{q\}\(\\mathcal\{L\}\) also appears in 𝒞q\(ℒ′\)\\mathcal\{C\}\_\{q\}\(\\mathcal\{L\}^\{\\prime\}\), or any tool it displaces is itself a composite that dominates it in the L4 reranker score\. Intuitively, new library entries cannot crowd out previously retrieved tools unless they are strictly better substitutes\. L\.2 Proof of Theorem 1 Proof\. Let p1\(v\)=Pr⁡\[v∈𝒮1\]p\_\{1\}\(v\)=\\Pr\[v\\in\\mathcal\{S\}\_\{1\}\], p2\(v\)=Pr⁡\[v∈𝒮2\]p\_\{2\}\(v\)=\\Pr\[v\\in\\mathcal\{S\}\_\{2\}\], and p3\(v\)=Pr⁡\[v∈𝒮3\]p\_\{3\}\(v\)=\\Pr\[v\\in\\mathcal\{S\}\_\{3\}\] be the marginal survival probabilities of tool vv after stages L1, L2, and L3\. Write c¯ℓ=1n∑v∈𝒱cℓ\(v\)\\bar\{c\}\_\{\\ell\}=\\tfrac\{1\}\{n\}\\sum\_\{v\\in\\mathcal\{V\}\}c\_\{\\ell\}\(v\) for the mean level\-ℓ\\ell cost and cℓmax=maxv∈𝒱⁡cℓ\(v\)c\_\{\\ell\}^\{\\max\}=\\max\_\{v\\in\\mathcal\{V\}\}c\_\{\\ell\}\(v\) for the maximum\. By linearity of expectation, Chier=∑v∈𝒱c1\(v\)\+∑v∈𝒱p1\(v\)c2\(v\)\+∑v∈𝒱p2\(v\)c3\(v\)\+∑v∈𝒱p3\(v\)c4\(v\)\.C\_\{\\text\{hier\}\}=\\sum\_\{v\\in\\mathcal\{V\}\}c\_\{1\}\(v\)\+\\sum\_\{v\\in\\mathcal\{V\}\}p\_\{1\}\(v\)\\,c\_\{2\}\(v\)\+\\sum\_\{v\\in\\mathcal\{V\}\}p\_\{2\}\(v\)\\,c\_\{3\}\(v\)\+\\sum\_\{v\\in\\mathcal\{V\}\}p\_\{3\}\(v\)\\,c\_\{4\}\(v\)\. \(4\) Assumption 1 states ∑vp1\(v\)=𝔼\[\|𝒮1\|\]≤α1n\\sum\_\{v\}p\_\{1\}\(v\)=\\mathbb\{E\}\[\|\\mathcal\{S\}\_\{1\}\|\]\\leq\\alpha\_\{1\}n, ∑vp2\(v\)=𝔼\[\|𝒮2\|\]≤α2𝔼\[\|𝒮1\|\]≤α1α2n\\sum\_\{v\}p\_\{2\}\(v\)=\\mathbb\{E\}\[\|\\mathcal\{S\}\_\{2\}\|\]\\leq\\alpha\_\{2\}\\mathbb\{E\}\[\|\\mathcal\{S\}\_\{1\}\|\]\\leq\\alpha\_\{1\}\\alpha\_\{2\}n, and ∑vp3\(v\)=𝔼\[\|𝒮3\|\]≤α3𝔼\[\|𝒮2\|\]≤α1α2α3n\\sum\_\{v\}p\_\{3\}\(v\)=\\mathbb\{E\}\[\|\\mathcal\{S\}\_\{3\}\|\]\\leq\\alpha\_\{3\}\\mathbb\{E\}\[\|\\mathcal\{S\}\_\{2\}\|\]\\leq\\alpha\_\{1\}\\alpha\_\{2\}\\alpha\_\{3\}n\. Applying Hölder’s inequality \(∑vpℓ\(v\)cℓ\(v\)≤cℓmax∑vpℓ\(v\)\\sum\_\{v\}p\_\{\\ell\}\(v\)\\,c\_\{\\ell\}\(v\)\\leq c\_\{\\ell\}^\{\\max\}\\sum\_\{v\}p\_\{\\ell\}\(v\)\) to the last three terms of \(4\) gives Chier≤nc¯1\+α1nc2max\+α1α2nc3max\+α1α2α3nc4max\.C\_\{\\text\{hier\}\}\\leq n\\bar\{c\}\_\{1\}\+\\alpha\_\{1\}n\\,c\_\{2\}^\{\\max\}\+\\alpha\_\{1\}\\alpha\_\{2\}n\\,c\_\{3\}^\{\\max\}\+\\alpha\_\{1\}\\alpha\_\{2\}\\alpha\_\{3\}n\\,c\_\{4\}^\{\\max\}\. \(5\) Since α3≤1\\alpha\_\{3\}\\leq 1 we have α1α2α3nc4max≤α1α2nc4max\\alpha\_\{1\}\\alpha\_\{2\}\\alpha\_\{3\}n\\,c\_\{4\}^\{\\max\}\\leq\\alpha\_\{1\}\\alpha\_\{2\}n\\,c\_\{4\}^\{\\max\}, so Chier≤nc¯1\+α1nc2max\+α1α2n\(c3max\+c4max\)\.C\_\{\\text\{hier\}\}\\leq n\\bar\{c\}\_\{1\}\+\\alpha\_\{1\}n\\,c\_\{2\}^\{\\max\}\+\\alpha\_\{1\}\\alpha\_\{2\}n\\,\(c\_\{3\}^\{\\max\}\+c\_\{4\}^\{\\max\}\)\. \(6\) Rewriting \(6\) by adding and subtracting α1α2n\(c¯1\+c2max\)\\alpha\_\{1\}\\alpha\_\{2\}n\(\\bar\{c\}\_\{1\}\+c\_\{2\}^\{\\max\}\) and using Cflat=n\(c¯1\+c¯2\+c¯3\+c¯4\)C\_\{\\text\{flat\}\}=n\(\\bar\{c\}\_\{1\}\+\\bar\{c\}\_\{2\}\+\\bar\{c\}\_\{3\}\+\\bar\{c\}\_\{4\}\), Chier≤α1α2Cflat\+nc¯1\(1−α1α2\)\+nc2max\(α1−α1α2\)\+α1α2n\[\(c3max\+c4max\)−\(c¯3\+c¯4\)\]\.C\_\{\\text\{hier\}\}\\leq\\alpha\_\{1\}\\alpha\_\{2\}\\,C\_\{\\text\{flat\}\}\+n\\bar\{c\}\_\{1\}\(1\-\\alpha\_\{1\}\\alpha\_\{2\}\)\+n\\,c\_\{2\}^\{\\max\}\(\\alpha\_\{1\}\-\\alpha\_\{1\}\\alpha\_\{2\}\)\+\\alpha\_\{1\}\\alpha\_\{2\}n\\bigl\[\(c\_\{3\}^\{\\max\}\+c\_\{4\}^\{\\max\}\)\-\(\\bar\{c\}\_\{3\}\+\\bar\{c\}\_\{4\}\)\\bigr\]\. \(7\) The bracketed term in \(7\) is non\-negative but, under the mild regularity cℓmax=𝒪\(c¯ℓ\)c\_\{\\ell\}^\{\\max\}=\\mathcal\{O\}\(\\bar\{c\}\_\{\\ell\}\) \(bounded per\-level dispersion\), is of the same order as α1α2Cflat\\alpha\_\{1\}\\alpha\_\{2\}C\_\{\\text\{flat\}\} and thus absorbed into the leading term\. Assumption 1 additionally gives c¯1/\(c¯3\+c¯4\)→0\\bar\{c\}\_\{1\}/\(\\bar\{c\}\_\{3\}\+\\bar\{c\}\_\{4\}\)\\to 0 and c2max/\(c¯3\+c¯4\)→0c\_\{2\}^\{\\max\}/\(\\bar\{c\}\_\{3\}\+\\bar\{c\}\_\{4\}\)\\to 0, so nc¯1,nc2max=o\(n\(c¯3\+c¯4\)\)=o\(Cflat\)n\\bar\{c\}\_\{1\},\\,nc\_\{2\}^\{\\max\}=o\\bigl\(n\(\\bar\{c\}\_\{3\}\+\\bar\{c\}\_\{4\}\)\\bigr\)=o\(C\_\{\\text\{flat\}\}\)\. Substituting into \(7\) yields Chier≤α1α2Cflat\+o\(Cflat\)\.C\_\{\\text\{hier\}\}\\;\\leq\\;\\alpha\_\{1\}\\alpha\_\{2\}\\,C\_\{\\text\{flat\}\}\+o\(C\_\{\\text\{flat\}\}\)\. Since α1α2<1\\alpha\_\{1\}\\alpha\_\{2\}<1 by assumption, we conclude lim supn→∞Chier/Cflat≤α1α2<1\\limsup\_\{n\\to\\infty\}C\_\{\\text\{hier\}\}/C\_\{\\text\{flat\}\}\\leq\\alpha\_\{1\}\\alpha\_\{2\}<1, which is strictly less than unity and in particular Chier=o\(Cflat\)C\_\{\\text\{hier\}\}=o\(C\_\{\\text\{flat\}\}\) when max⁡\(c¯1,c2max\)=o\(c¯3\+c¯4\)\\max\(\\bar\{c\}\_\{1\},c\_\{2\}^\{\\max\}\)=o\(\\bar\{c\}\_\{3\}\+\\bar\{c\}\_\{4\}\)\. ∎ L\.3 Proof of Corollary 2 Proof\. We show that the expected wall\-clock cost of TypedDAGRetrieve grows sublinearly in the library size n=\|𝒱\|n=\|\\mathcal\{V\}\|, in contrast with the flat baseline whose cost is Θ\(n\)\\Theta\(n\)\. Implementation of L1\. The L1 record of each tool vv is a typed signature Tin\(v\)→Tout\(v\)T\_\{\\text\{in\}\}\(v\)\\\!\\to\\\!T\_\{\\text\{out\}\}\(v\) drawn from a finite type alphabet 𝒯\\mathcal\{T\}\. We maintain an inverted index ℋ:𝒯×𝒯→2𝒱\\mathcal\{H\}:\\mathcal\{T\}\\\!\\times\\\!\\mathcal\{T\}\\\!\\to\\\!2^\{\\mathcal\{V\}\} that maps each signature pair \(Tin,Tout\)\(T\_\{\\text\{in\}\},T\_\{\\text\{out\}\}\) to the set of tools with that signature; InsertTool updates ℋ\\mathcal\{H\} in 𝒪\(1\)\\mathcal\{O\}\(1\) amortized time per insertion \(Corollary 5\)\. Given a sub\-goal signature \(Tin\(j\)→Tout\(j\)\)\(T\_\{\\text\{in\}\}^\{\(j\)\}\\\!\\to\\\!T\_\{\\text\{out\}\}^\{\(j\)\}\), the L1 candidate set 𝒮1\(j\)\\mathcal\{S\}\_\{1\}^\{\(j\)\} is recovered by hash lookup followed by an optional unification pass over polymorphic signatures organized in a balanced type trie of height 𝒪\(log⁡n\)\\mathcal\{O\}\(\\log n\)\. Hence the per\-query L1 time is 𝒪\(τ1\(log⁡n\+\|𝒮1\(j\)\|\)\)\\mathcal\{O\}\\bigl\(\\tau\_\{1\}\(\\log n\+\|\\mathcal\{S\}\_\{1\}^\{\(j\)\}\|\)\\bigr\) rather than Θ\(τ1n\)\\Theta\(\\tau\_\{1\}n\) as in an exhaustive scan\. Cost decomposition\. Let \|𝒮ℓ\|\|\\mathcal\{S\}\_\{\\ell\}\| denote the size of the surviving candidate set after stage LℓL\_\{\\ell\}\. By Algorithm 2, L2L\_\{2\} is applied to 𝒮1\\mathcal\{S\}\_\{1\}, L3L\_\{3\} to a top\-k2k\_\{2\} shortlist of size at most k2k\_\{2\}, and L4L\_\{4\} to 𝒮3⊆𝒮2\\mathcal\{S\}\_\{3\}\\subseteq\\mathcal\{S\}\_\{2\} with \|𝒮2\|≤k2\|\\mathcal\{S\}\_\{2\}\|\\leq k\_\{2\}\. Summing per\-level costs and taking expectations, Thier≤τ1\(log⁡n\+𝔼\|𝒮1\|\)\+τ2𝔼\|𝒮1\|\+τ3k2\+τ4𝔼\|𝒮3\|\.T\_\{\\text\{hier\}\}\\;\\leq\\;\\tau\_\{1\}\\bigl\(\\log n\+\\mathbb\{E\}\|\\mathcal\{S\}\_\{1\}\|\\bigr\)\+\\tau\_\{2\}\\,\\mathbb\{E\}\|\\mathcal\{S\}\_\{1\}\|\+\\tau\_\{3\}\\,k\_\{2\}\+\\tau\_\{4\}\\,\\mathbb\{E\}\|\\mathcal\{S\}\_\{3\}\|\. \(8\) Bounding the survivor sets\. By Assumption 1, 𝔼\|𝒮1\|≤α1n\\mathbb\{E\}\|\\mathcal\{S\}\_\{1\}\|\\leq\\alpha\_\{1\}n\. The L2 stage truncates to a shortlist of fixed size k2k\_\{2\}, independent of nn, so 𝔼\|𝒮2\|≤min⁡\(k2,α1α2n\)≤k2\\mathbb\{E\}\|\\mathcal\{S\}\_\{2\}\|\\leq\\min\(k\_\{2\},\\alpha\_\{1\}\\alpha\_\{2\}n\)\\leq k\_\{2\} and 𝔼\|𝒮3\|≤α3k2\\mathbb\{E\}\|\\mathcal\{S\}\_\{3\}\|\\leq\\alpha\_\{3\}k\_\{2\}\. Substituting into \(8\), Thier≤τ1log⁡n\+\(τ1\+τ2\)α1n\+\(τ3\+α3τ4\)k2\.T\_\{\\text\{hier\}\}\\;\\leq\\;\\tau\_\{1\}\\log n\+\(\\tau\_\{1\}\+\\tau\_\{2\}\)\\alpha\_\{1\}n\+\(\\tau\_\{3\}\+\\alpha\_\{3\}\\tau\_\{4\}\)\\,k\_\{2\}\. \(9\) Sublinear regime\. Inequality \(9\) already shows that the LLM\-bound stages L2,L3,L4L\_\{2\},L\_\{3\},L\_\{4\} contribute time independent of nn \(since k2k\_\{2\} is a fixed shortlist budget\)\. The only nn\-dependent term is the L1 scan τ1α1n\\tau\_\{1\}\\alpha\_\{1\}n\. Two cases are of interest: 1\. Indexed L1\. When the inverted index ℋ\\mathcal\{H\} is used and the sub\-goal signature is fully concrete, the unification pass touches only the matching bucket and the trie path, giving 𝔼\|𝒮1\|=𝒪\(k2\)\\mathbb\{E\}\|\\mathcal\{S\}\_\{1\}\|=\\mathcal\{O\}\(k\_\{2\}\) and the L1 cost collapses to 𝒪\(τ1log⁡n\)\\mathcal\{O\}\(\\tau\_\{1\}\\log n\)\. Combined with \(9\), Thier=𝒪\(τ1log⁡n\+\(τ2\+τ3\+τ4\)k2\),T\_\{\\text\{hier\}\}\\;=\\;\\mathcal\{O\}\\bigl\(\\tau\_\{1\}\\log n\+\(\\tau\_\{2\}\+\\tau\_\{3\}\+\\tau\_\{4\}\)\\,k\_\{2\}\\bigr\), which is the bound stated in the corollary and is sublinear \(o\(n\)o\(n\)\) in nn\. 2\. Exhaustive L1\. If ℋ\\mathcal\{H\} is not used and L1 is implemented as a flat scan, the L1 term becomes τ1α1n\\tau\_\{1\}\\alpha\_\{1\}n\. Because τ1\\tau\_\{1\} is the cost of a constant\-size symbolic type check \(no LLM call\) while τ2,τ3,τ4\\tau\_\{2\},\\tau\_\{3\},\\tau\_\{4\} are LLM\-call costs, τ1/τℓ→0\\tau\_\{1\}/\\tau\_\{\\ell\}\\to 0 for ℓ≥2\\ell\\geq 2 in the relevant regime\. The flat baseline incurs the LLM cost on every tool, so Tflat=Θ\(\(τ2\+τ3\+τ4\)n\)T\_\{\\text\{flat\}\}=\\Theta\\bigl\(\(\\tau\_\{2\}\+\\tau\_\{3\}\+\\tau\_\{4\}\)\\,n\\bigr\), and the ratio ThierTflat≤τ1α1τ2\+τ3\+τ4\+\(τ3\+α3τ4\)k2\(τ2\+τ3\+τ4\)n→n→∞τ1α1τ2\+τ3\+τ4≪ 1,\\frac\{T\_\{\\text\{hier\}\}\}\{T\_\{\\text\{flat\}\}\}\\;\\leq\\;\\frac\{\\tau\_\{1\}\\alpha\_\{1\}\}\{\\tau\_\{2\}\+\\tau\_\{3\}\+\\tau\_\{4\}\}\+\\frac\{\(\\tau\_\{3\}\+\\alpha\_\{3\}\\tau\_\{4\}\)k\_\{2\}\}\{\(\\tau\_\{2\}\+\\tau\_\{3\}\+\\tau\_\{4\}\)n\}\\;\\xrightarrow\[n\\to\\infty\]\{\}\\;\\frac\{\\tau\_\{1\}\\alpha\_\{1\}\}\{\\tau\_\{2\}\+\\tau\_\{3\}\+\\tau\_\{4\}\}\\;\\ll\\;1, so even without the inverted index the dominant LLM\-bound work is reduced from Θ\(n\)\\Theta\(n\) to 𝒪\(k2\)\\mathcal\{O\}\(k\_\{2\}\), and the residual linear term carries a constant τ1α1\\tau\_\{1\}\\alpha\_\{1\} that is orders of magnitude smaller than the flat constant\. Lower bound for the flat baseline\. For ℳflat\\mathcal\{M\}\_\{\\text\{flat\}\} \(Theorem 1\), the single ranking pass materializes \{L1,L2,L3,L4\}\(v\)\\\{L\_\{1\},L\_\{2\},L\_\{3\},L\_\{4\}\\\}\(v\) for every v∈𝒱v\\in\\mathcal\{V\} and applies the planner to each, so Tflat≥\(τ2\+τ3\+τ4\)n=Θ\(n\)T\_\{\\text\{flat\}\}\\geq\(\\tau\_\{2\}\+\\tau\_\{3\}\+\\tau\_\{4\}\)\\,n=\\Theta\(n\), matching the upper bound\. Combining with case \(i\) yields Thier/Tflat=𝒪\(\(log⁡n\+k2\)/n\)→0T\_\{\\text\{hier\}\}/T\_\{\\text\{flat\}\}=\\mathcal\{O\}\\bigl\(\(\\log n\+k\_\{2\}\)/n\\bigr\)\\to 0, i\.e\. TypedDAGRetrieve achieves strictly sublinear retrieval time while the flat baseline is necessarily linear\. ∎ L\.4 Proof of Theorem 3 Proof\. We compute both sides of the reward gap in closed form and then propagate the inequality through the GRPO advantage and the clipped policy\-gradient update\. Step 1: reward gap\. By construction τp\\tau\_\{p\} invokes only primitives, so Φ\(ti\)=0\\Phi\(t\_\{i\}\)=0 for every ii and Rcomp\(τp\)=∑i=1TpΦ\(ti\)=0R\_\{\\text\{comp\}\}\(\\tau\_\{p\}\)=\\sum\_\{i=1\}^\{T\_\{p\}\}\\Phi\(t\_\{i\}\)=0\. In τc\\tau\_\{c\}, the contiguous block of mm primitives in τp\\tau\_\{p\} is replaced by a single composite call t⋆t^\{\\star\} with saved\-call count Φ\(t⋆\)=flat\(t⋆\)−1\\Phi\(t^\{\\star\}\)=\\mathrm\{flat\}\(t^\{\\star\}\)\-1, while the remaining Tp−mT\_\{p\}\-m calls are the unchanged primitives with Φ=0\\Phi=0\. Hence Rcomp\(τc\)=Φ\(t⋆\)R\_\{\\text\{comp\}\}\(\\tau\_\{c\}\)=\\Phi\(t^\{\\star\}\)\. Subtracting and using R\(τ\)=Rres\(τ\)\+λRcomp\(τ\)R\(\\tau\)=R\_\{\\text\{res\}\}\(\\tau\)\+\\lambda R\_\{\\text\{comp\}\}\(\\tau\) with Rres\(τp\)=Rres\(τc\)R\_\{\\text\{res\}\}\(\\tau\_\{p\}\)=R\_\{\\text\{res\}\}\(\\tau\_\{c\}\), R\(τc\)−R\(τp\)=λΦ\(t⋆\)=λ\[flat\(t⋆\)−1\]\.R\(\\tau\_\{c\}\)\-R\(\\tau\_\{p\}\)\\;=\\;\\lambda\\,\\Phi\(t^\{\\star\}\)\\;=\\;\\lambda\\bigl\[\\mathrm\{flat\}\(t^\{\\star\}\)\-1\\bigr\]\. \(10\) The faithful\-executor lower bound flat\(t⋆\)≥m≥2\\mathrm\{flat\}\(t^\{\\star\}\)\\geq m\\geq 2 \(replacing a length\-mm primitive subroutine cannot reduce the number of primitive invocations actually executed\) gives the strict bound R\(τc\)−R\(τp\)≥λ\(m−1\)\>0R\(\\tau\_\{c\}\)\-R\(\\tau\_\{p\}\)\\geq\\lambda\(m\-1\)\>0\. Equivalently, writing the gap in terms of compression depth, R\(τc\)−R\(τp\)=λ\[Φ\(t⋆\)−\(m−1\)\]\+λ\(m−1\)=λ\[flat\(t⋆\)−m\]\+λ\(m−1\),R\(\\tau\_\{c\}\)\-R\(\\tau\_\{p\}\)\\;=\\;\\lambda\\bigl\[\\Phi\(t^\{\\star\}\)\-\(m\-1\)\\bigr\]\\;\+\\;\\lambda\(m\-1\)\\;=\\;\\lambda\\bigl\[\\mathrm\{flat\}\(t^\{\\star\}\)\-m\\bigr\]\+\\lambda\(m\-1\), \(11\) where the first summand is the strictly positive “internal\-reuse bonus” \(zero exactly when t⋆t^\{\\star\} is a flat renaming of the mm primitives\) and the second is the unconditional saving from collapsing mm tokens into one\. Step 2: group\-relative advantage\. For a group \{τ\(1\),…,τ\(G\)\}\\\{\\tau^\{\(1\)\},\\ldots,\\tau^\{\(G\)\}\\\} sampled from the current policy, GRPO normalizes rewards by the empirical group mean μ\\mu and standard deviation σ\>0\\sigma\>0: A\(τ\(i\)\)=\(R\(τ\(i\)\)−μ\)/σA\(\\tau^\{\(i\)\}\)=\(R\(\\tau^\{\(i\)\}\)\-\\mu\)/\\sigma\. The map R↦\(R−μ\)/σR\\mapsto\(R\-\\mu\)/\\sigma is strictly increasing for any fixed μ\\mu and σ\>0\\sigma\>0, so R\(τc\)\>R\(τp\)R\(\\tau\_\{c\}\)\>R\(\\tau\_\{p\}\) implies A\(τc\)−A\(τp\)=\(R\(τc\)−R\(τp\)\)/σ\>0A\(\\tau\_\{c\}\)\-A\(\\tau\_\{p\}\)=\(R\(\\tau\_\{c\}\)\-R\(\\tau\_\{p\}\)\)/\\sigma\>0\. Step 3: policy gradient\. Let ρθ\(τ\)=πθ\(τ\)/πθold\(τ\)\\rho\_\{\\theta\}\(\\tau\)=\\pi\_\{\\theta\}\(\\tau\)/\\pi\_\{\\theta\_\{\\text\{old\}\}\}\(\\tau\) be the importance ratio\. The clipped GRPO objective is 𝒥\(θ\)=𝔼\[min⁡\(ρθA,clip\(ρθ,1−ϵ,1\+ϵ\)A\)\]\\mathcal\{J\}\(\\theta\)=\\mathbb\{E\}\\bigl\[\\min\\\!\\bigl\(\\rho\_\{\\theta\}A,\\,\\mathrm\{clip\}\(\\rho\_\{\\theta\},1\-\\epsilon,1\+\\epsilon\)\\,A\\bigr\)\\bigr\], whose gradient at θold\\theta\_\{\\text\{old\}\} \(where ρθold=1\\rho\_\{\\theta\_\{\\text\{old\}\}\}\\\!=\\\!1 lies in the interior of \[1−ϵ,1\+ϵ\]\[1\-\\epsilon,1\+\\epsilon\]\) reduces to ∇θ𝒥\(θ\)\|θold=𝔼\[A\(τ\)∇θlog⁡πθ\(τ\)\]\\nabla\_\{\\theta\}\\mathcal\{J\}\(\\theta\)\\big\|\_\{\\theta\_\{\\text\{old\}\}\}=\\mathbb\{E\}\[A\(\\tau\)\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(\\tau\)\]\. For a single GRPO step with step size η\>0\\eta\>0 along this gradient, a first\-order expansion of log⁡πθ\(τ\)\\log\\pi\_\{\\theta\}\(\\tau\) around θold\\theta\_\{\\text\{old\}\} gives Δθ\[log⁡πθ\(τc\)−log⁡πθ\(τp\)\]=η⟨A\(τc\)g\(τc\)−A\(τp\)g\(τp\),g\(τc\)−g\(τp\)⟩\+O\(η2\),\\Delta\_\{\\theta\}\\bigl\[\\log\\pi\_\{\\theta\}\(\\tau\_\{c\}\)\-\\log\\pi\_\{\\theta\}\(\\tau\_\{p\}\)\\bigr\]=\\eta\\,\\bigl\\langle A\(\\tau\_\{c\}\)g\(\\tau\_\{c\}\)\-A\(\\tau\_\{p\}\)g\(\\tau\_\{p\}\),\\;g\(\\tau\_\{c\}\)\-g\(\\tau\_\{p\}\)\\bigr\\rangle\+O\(\\eta^\{2\}\), where g\(τ\)=∇θlog⁡πθold\(τ\)g\(\\tau\)=\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\_\{\\text\{old\}\}\}\(\\tau\)\. Taking expectation over groups and using A\(τc\)\>A\(τp\)A\(\\tau\_\{c\}\)\>A\(\\tau\_\{p\}\) with the Fisher\-information positivity 𝔼\[⟨g\(τc\)−g\(τp\),g\(τc\)−g\(τp\)⟩\]≥0\\mathbb\{E\}\\bigl\[\\langle g\(\\tau\_\{c\}\)\-g\(\\tau\_\{p\}\),\\,g\(\\tau\_\{c\}\)\-g\(\\tau\_\{p\}\)\\rangle\\bigr\]\\geq 0, the leading term is non\-negative and is strictly positive whenever τc\\tau\_\{c\} and τp\\tau\_\{p\} are not policy\-equivalent \(i\.e\. g\(τc\)≠g\(τp\)g\(\\tau\_\{c\}\)\\neq g\(\\tau\_\{p\}\)\)\. Thus, for small enough η\\eta, a single GRPO update strictly increases the expected log\-odds of τc\\tau\_\{c\} relative to τp\\tau\_\{p\}, as claimed\. ∎ L\.5 Proof of Theorem 4 Proof\. Fix k≥0k\\geq 0 and telescope the one\-step gap as J\(πθk\+1,ℒk\+1\)−J\(πθk,ℒk\)⏟Δk=J\(πθk\+1,ℒk\)−J\(πθk,ℒk\)⏟Δk\(π\)\+J\(πθk\+1,ℒk\+1\)−J\(πθk\+1,ℒk\)⏟Δk\(ℒ\)\.\\underbrace\{J\(\\pi\_\{\\theta\_\{k\+1\}\},\\mathcal\{L\}\_\{k\+1\}\)\-J\(\\pi\_\{\\theta\_\{k\}\},\\mathcal\{L\}\_\{k\}\)\}\_\{\\Delta\_\{k\}\}=\\underbrace\{J\(\\pi\_\{\\theta\_\{k\+1\}\},\\mathcal\{L\}\_\{k\}\)\-J\(\\pi\_\{\\theta\_\{k\}\},\\mathcal\{L\}\_\{k\}\)\}\_\{\\Delta\_\{k\}^\{\(\\pi\)\}\}\+\\underbrace\{J\(\\pi\_\{\\theta\_\{k\+1\}\},\\mathcal\{L\}\_\{k\+1\}\)\-J\(\\pi\_\{\\theta\_\{k\+1\}\},\\mathcal\{L\}\_\{k\}\)\}\_\{\\Delta\_\{k\}^\{\(\\mathcal\{L\}\)\}\}\. \(12\) We establish Δk\(π\)≥0\\Delta\_\{k\}^\{\(\\pi\)\}\\geq 0 and Δk\(ℒ\)≥0\\Delta\_\{k\}^\{\(\\mathcal\{L\}\)\}\\geq 0 separately\. Policy term\. With the library held fixed at ℒk\\mathcal\{L\}\_\{k\}, the GRPO update on πθ\\pi\_\{\\theta\} is a trust\-region\-style update on the shaped\-reward objective J\(⋅,ℒk\)J\(\\cdot,\\mathcal\{L\}\_\{k\}\)\. By the standard Kakade–Langford monotonic\-improvement lemma applied to the clipped GRPO surrogate \(equivalent to the PPO surrogate with a group\-baseline advantage \[Shao et al\., 2024\]\), we have for any θ′\\theta^\{\\prime\} in the trust region of θold\\theta\_\{\\text\{old\}\}, J\(πθ′,ℒk\)−J\(πθold,ℒk\)≥ℒθoldCLIP\(θ′\)−2ϵγ\(1−γ\)2DKLmax\(πθ′,πθold\),J\(\\pi\_\{\\theta^\{\\prime\}\},\\mathcal\{L\}\_\{k\}\)\-J\(\\pi\_\{\\theta\_\{\\text\{old\}\}\},\\mathcal\{L\}\_\{k\}\)\\geq\\mathcal\{L\}^\{\\text\{CLIP\}\}\_\{\\theta\_\{\\text\{old\}\}\}\(\\theta^\{\\prime\}\)\-\\frac\{2\\epsilon\\,\\gamma\}\{\(1\-\\gamma\)^\{2\}\}\\,D\_\{\\mathrm\{KL\}\}^\{\\max\}\(\\pi\_\{\\theta^\{\\prime\}\},\\pi\_\{\\theta\_\{\\text\{old\}\}\}\), where ℒCLIP\\mathcal\{L\}^\{\\text\{CLIP\}\} is the clipped surrogate, γ\\gamma the effective discount of the trajectory reward, and ϵ\\epsilon is the advantage bound\. The GRPO update selects θk\+1\\theta\_\{k\+1\} as a stochastic ascent step on ℒCLIP\\mathcal\{L\}^\{\\text\{CLIP\}\}; for step sizes η≤η¯1\\eta\\leq\\bar\{\\eta\}\_\{1\} with η¯1=\(1−γ\)22ϵγLKL\\bar\{\\eta\}\_\{1\}=\\tfrac\{\(1\-\\gamma\)^\{2\}\}\{2\\epsilon\\gamma L\_\{\\text\{KL\}\}\} \(where LKLL\_\{\\text\{KL\}\} is the local Lipschitz constant of DKLmaxD\_\{\\mathrm\{KL\}\}^\{\\max\}\), the surrogate gain dominates the KL penalty and Δk\(π\)≥0\\Delta\_\{k\}^\{\(\\pi\)\}\\geq 0 holds in expectation\. This is the standard monotonic\-improvement guarantee; we reproduce it here only to make the step\-size requirement on η¯\\bar\{\\eta\} explicit\. Library term\. Let ℒk\+1=\(𝒱k∪Δ𝒱,ℰk∪Δℰ,ℐk\+1\)\\mathcal\{L\}\_\{k\+1\}=\(\\mathcal\{V\}\_\{k\}\\cup\\Delta\\mathcal\{V\},\\mathcal\{E\}\_\{k\}\\cup\\Delta\\mathcal\{E\},\\mathcal\{I\}\_\{k\+1\}\) where Δ𝒱\\Delta\\mathcal\{V\} is the set of new composites inserted from the kk\-th query’s successful rollouts\. By Assumption 2, 𝒱k⊆𝒱k\+1\\mathcal\{V\}\_\{k\}\\subseteq\\mathcal\{V\}\_\{k\+1\} and every previously admitted tool remains available, so every primitive trajectory feasible under ℒk\\mathcal\{L\}\_\{k\} is also feasible under ℒk\+1\\mathcal\{L\}\_\{k\+1\} with identical RresR\_\{\\text\{res\}\} \(the verifier is deterministic and depends only on the program, not the library\)\. Fix a query qq and let 𝒞q\(k\)=𝒞q\(ℒk\)\\mathcal\{C\}\_\{q\}^\{\(k\)\}\\\!=\\\!\\mathcal\{C\}\_\{q\}\(\\mathcal\{L\}\_\{k\}\), 𝒞q\(k\+1\)=𝒞q\(ℒk\+1\)\\mathcal\{C\}\_\{q\}^\{\(k\+1\)\}\\\!=\\\!\\mathcal\{C\}\_\{q\}\(\\mathcal\{L\}\_\{k\+1\}\)\. We partition trajectories under πθk\+1\(⋅∣q,𝒞q\(k\+1\)\)\\pi\_\{\\theta\_\{k\+1\}\}\(\\cdot\\mid q,\\mathcal\{C\}\_\{q\}^\{\(k\+1\)\}\) into two classes\. \(a\) Legacy trajectories that use only tools in 𝒱k∩𝒞q\(k\+1\)\\mathcal\{V\}\_\{k\}\\cap\\mathcal\{C\}\_\{q\}^\{\(k\+1\)\}: by Assumption 3, the legacy subset of the new context is at least as rich as 𝒞q\(k\)\\mathcal\{C\}\_\{q\}^\{\(k\)\} restricted to retained tools, so the conditional expected shaped return over legacy trajectories is bounded below by 𝔼τ∼πθk\+1\(⋅∣q,𝒞q\(k\)\)\[R\(τ\)\]\\mathbb\{E\}\_\{\\tau\\sim\\pi\_\{\\theta\_\{k\+1\}\}\(\\cdot\\mid q,\\mathcal\{C\}\_\{q\}^\{\(k\)\}\)\}\[R\(\\tau\)\]\. \(b\) Composite\-adopting trajectories that invoke at least one tool t⋆∈Δ𝒱t^\{\\star\}\\\!\\in\\\!\\Delta\\mathcal\{V\}: by construction of InsertTool, each t⋆t^\{\\star\} is validated against the L3 pre/post\-conditions of its children, so any trajectory that invokes t⋆t^\{\\star\} has RresR\_\{\\text\{res\}\} at least as high as the corresponding primitive\-expanded trajectory it replaces; moreover, by Theorem 3 applied to each composite substitution of length m≥2m\\geq 2, the shaped reward satisfies R\(τc\)≥R\(τp\)\+λΦ\(t⋆\)≥R\(τp\)\+λ\(m−1\)\>R\(τp\)R\(\\tau\_\{c\}\)\\geq R\(\\tau\_\{p\}\)\+\\lambda\\,\\Phi\(t^\{\\star\}\)\\geq R\(\\tau\_\{p\}\)\+\\lambda\(m\-1\)\>R\(\\tau\_\{p\}\)\. Taking expectation over the mixing distribution induced by πθk\+1\\pi\_\{\\theta\_\{k\+1\}\} and summing over classes \(a\) and \(b\) yields 𝔼τ∼πθk\+1\(⋅∣q,𝒞q\(k\+1\)\)\[R\(τ\)\]≥𝔼τ∼πθk\+1\(⋅∣q,𝒞q\(k\)\)\[R\(τ\)\]\\mathbb\{E\}\_\{\\tau\\sim\\pi\_\{\\theta\_\{k\+1\}\}\(\\cdot\\mid q,\\mathcal\{C\}\_\{q\}^\{\(k\+1\)\}\)\}\[R\(\\tau\)\]\\geq\\mathbb\{E\}\_\{\\tau\\sim\\pi\_\{\\theta\_\{k\+1\}\}\(\\cdot\\mid q,\\mathcal\{C\}\_\{q\}^\{\(k\)\}\)\}\[R\(\\tau\)\] for every qq\. Integrating over q∼𝒬q\\sim\\mathcal\{Q\} gives Δk\(ℒ\)≥0\\Delta\_\{k\}^\{\(\\mathcal\{L\}\)\}\\geq 0\. Combining the two bounds in \(12\) yields Δk≥0\\Delta\_\{k\}\\geq 0 for every kk, and choosing η¯≤η¯1\\bar\{\\eta\}\\leq\\bar\{\\eta\}\_\{1\} makes this hold for all η∈\(0,η¯\]\\eta\\in\(0,\\bar\{\\eta\}\]\. Induction on kk completes the proof\. ∎ L\.6 Proof of Corollary 5 Proof\. Acyclicity follows by induction on kk\. The base case 𝒢0\\mathcal\{G\}\_\{0\} is empty \(or consists only of primitives with no outgoing edges\) and is trivially acyclic\. Suppose 𝒢k\\mathcal\{G\}\_\{k\} is acyclic; InsertTool accepts a candidate t⋆t^\{\\star\} only if 𝒢k∪\{t⋆,edges\(t⋆,Ch\(t⋆\)\)\}\\mathcal\{G\}\_\{k\}\\cup\\\{t^\{\\star\},\\,\\mathrm\{edges\}\(t^\{\\star\},\\mathrm\{Ch\}\(t^\{\\star\}\)\)\\\} is acyclic \(line 3 of Algorithm 3\), so 𝒢k\+1\\mathcal\{G\}\_\{k\+1\} is acyclic\. The depth recurrence is the definition d\(v\)=1\+maxu∈Ch\(v\)⁡d\(u\)d\(v\)=1\+\\max\_\{u\\in\\mathrm\{Ch\}\(v\)\}d\(u\) for composites \(and d\(v\)=0d\(v\)=0 for primitives\), which the algorithm sets at insertion time and never revises\. For the logarithmic growth, every composite of depth dd has at least two children of depth ≤d−1\\leq d\-1, so by induction the subgraph 𝒢v\\mathcal\{G\}\_\{v\} rooted at vv contains at least 2d\(v\)2^\{d\(v\)\} primitive leaves, giving d\(v\)≤log2⁡\|𝒱p\|≤log2⁡\|𝒱\|d\(v\)\\leq\\log\_\{2\}\|\\mathcal\{V\}\_\{p\}\|\\leq\\log\_\{2\}\|\\mathcal\{V\}\|\. The 𝒪\(\|Ch\(t⋆\)\|\+\|ℰ\|\)\\mathcal\{O\}\(\|\\mathrm\{Ch\}\(t^\{\\star\}\)\|\+\|\\mathcal\{E\}\|\) acyclicity check is a standard DFS reachability test from Ch\(t⋆\)\\mathrm\{Ch\}\(t^\{\\star\}\); the 𝒪\(1\)\\mathcal\{O\}\(1\) amortized index insertion follows from maintaining ℐ\\mathcal\{I\} as a hash\-indexed record\. ∎`
CoCoDA: Co-evolving Compositional DAG for Tool-Augmented Agents

Similar Articles

AgentCo-op: Retrieval-Based Synthesis of Interoperable Multi-Agent Workflows

Learning Agent-Compatible Context Management for Long-Horizon Tasks

GraphDC: A Divide-and-Conquer Multi-Agent System for Scalable Graph Algorithm Reasoning

Contract2Tool: Learning Preconditions and Effects for Reliable Tool-Augmented LLM Agents

Tools as Continuous Flow for Evolving Agentic Reasoning

Submit Feedback

Similar Articles

AgentCo-op: Retrieval-Based Synthesis of Interoperable Multi-Agent Workflows
Learning Agent-Compatible Context Management for Long-Horizon Tasks
GraphDC: A Divide-and-Conquer Multi-Agent System for Scalable Graph Algorithm Reasoning
Contract2Tool: Learning Preconditions and Effects for Reliable Tool-Augmented LLM Agents
Tools as Continuous Flow for Evolving Agentic Reasoning