PACE: Two-Timescale Self-Evolution for Small Language Model Agents

arXiv cs.LG 05/25/26, 04:00 AM Papers
Summary
PACE introduces a two-timescale framework for self-evolution of small language model agents, coordinating low-risk prompt refinement with higher-risk control-logic updates, achieving up to +9.2% relative improvement across benchmarks.
arXiv:2605.23019v1 Announce Type: new Abstract: Deploying language-model agents in production often requires substantial compute and human effort to tune prompts, parsers, validators, and other components of the agent pipeline. Self-evolution offers a promising alternative, but most existing frameworks assume access to frontier models that can reliably diagnose failures, propose revisions, and judge their own updates. We study whether frozen small language models (SLMs) can serve as effective self-evolving agents under resource constraints. We propose PACE (Prompt And Control Logic Evolution), a two-timescale framework that coordinates low-risk prompt refinement with higher-risk control-logic updates. PACE evolves prompts under fixed control logic until prompt-level gains saturate, then considers constrained control-logic updates that are accepted through held-out validation. Across three frozen SLM backbones ranging from 4B to 14B parameters and four controlled benchmarks, PACE achieves the best performance on all 12 backbone--benchmark combinations, improving over vanilla SLM agents by up to +9.2% relative improvement and over the stronger single-mode evolution baseline by up to +5.4% relative improvement. A tau-bench case study further shows that PACE improves multi-turn tool-use success over vanilla and prompt-only evolution. These results suggest that reliable SLM agent self-evolution is possible without updating model weights or relying on frontier-model teachers, and that the key benefit is not any single final solver pattern but autonomous, validated discovery of task-appropriate inference strategies.
Original Article
View Cached Full Text
Cached at: 05/25/26, 08:58 AM
# PACE: Two-Timescale Self-Evolution for Small Language Model Agents
Source: [https://arxiv.org/html/2605.23019](https://arxiv.org/html/2605.23019)
Chen Ling, Pei Chen, Albert Guan, Jiaming Qu, Shayan Ali Akbar, Madhu Gopinathan,Erwin Cornejo Amazon emorycl@amazon\.com

###### Abstract

Deploying language\-model agents in production often requires substantial compute and human effort to tune prompts, parsers, validators, and other components of the agent pipeline\. Self\-evolution offers a promising alternative, but most existing frameworks assume access to frontier models that can reliably diagnose failures, propose revisions, and judge their own updates\. We study whether frozen small language models \(SLMs\) can serve as effective self\-evolving agents under resource constraints\. We propose PACE \(Prompt And Control Logic Evolution\), a two\-timescale framework that coordinates low\-risk prompt refinement with higher\-risk control\-logic updates\. PACE evolves prompts under fixed control logic until prompt\-level gains saturate, then considers constrained control\-logic updates that are accepted through held\-out validation\. Across three frozen SLM backbones ranging from 4B to 14B parameters and four controlled benchmarks, PACE achieves the best performance on all 12 backbone–benchmark combinations, improving over vanilla SLM agents by up to\+9\.2%\+9\.2\\%relative improvement and over the stronger single\-mode evolution baseline by up to\+5\.4%\+5\.4\\%relative improvement\. Aτ\\tau\-bench case study further shows that PACE improves multi\-turn tool\-use success over vanilla and prompt\-only evolution\. These results suggest that reliable SLM agent self\-evolution is possible without updating model weights or relying on frontier\-model teachers, and that the key benefit is not any single final solver pattern but autonomous, validated discovery of task\-appropriate inference strategies\.

## 1Introduction

Language\-model\-based agents\(Wanget al\.,[2024a](https://arxiv.org/html/2605.23019#bib.bib3)\)have become a common abstraction for solving complex tasks through reasoning, tool use, verification, and iterative refinement\. However, deploying such agents in production often requires substantial compute and repeated human intervention to tune prompts, parsers, validators, and other components of the agent pipeline\. Recent work on agent self\-evolution\(Agrawalet al\.,[2025](https://arxiv.org/html/2605.23019#bib.bib31); Opsahl\-Onget al\.,[2024](https://arxiv.org/html/2605.23019#bib.bib26); Zhanget al\.,[2025](https://arxiv.org/html/2605.23019#bib.bib4)\)offers a promising alternative: agents can improve their own behavior by using execution feedback to reviseprompts/task contexts, or modifycontrol logicwithout changing the underlying model parameters\. Most existing approaches, however, assume access to strong frontier models that can reliably diagnose failures, propose high\-quality revisions, and judge whether those revisions should be accepted\. These assumptions become fragile when the agent is powered by a small language model \(SLM\)\.

This paper studies self\-evolution for frozen SLM agents\. We focus on models with at most 14B parameters, where the model weights remain fixed throughout the evolution process\. This setting is important for practical deployments in which local serving, latency, privacy, or cost constraints make frontier\-model APIs or large\-scale fine\-tuning undesirable\. It is also technically challenging\. SLMs are more sensitive to prompt complexity and often reach diminishing returns after a small number of prompt revisions\. At the same time, allowing an SLM to freely rewrite its own executable control logic111Non\-parameter code or configuration that governs an agent’s inference procedure, including output parsing, validation, retry/repair policies, routing or fallback rules, tool\-call handling, and decoding settings\. Itexcludesupdates to model weights\.can be unstable: proposed edits may be syntactically valid but semantically incorrect, causing silent regressions in parsing, validation, retry behavior, or inference\-time decision rules\.

![Refer to caption](https://arxiv.org/html/2605.23019v1/x1.png)Figure 1:PACE evolution dynamics on Qwen3\.5\-9B\. Prompt\-only evolution \(\+PE\) improves early but quickly saturates, while control\-logic evolution \(\+CE\) is noisy when structural updates are applied from the start\. PACE first exploits stable prompt refinement, then introduces a validated control\-logic update after prompt gains plateau, producing a performance jump and final better accuracy\.A key observation motivating our work is that prompt updates and control\-logic updates play different roles in agent improvement\. As shown in Figure[1](https://arxiv.org/html/2605.23019#S1.F1), prompt\-only evolution improves performance quickly but often saturates once the remaining failures stem from structural bottlenecks, such as brittle output extraction, missing validation, weak repair logic, or ineffective sampling policies\. In contrast, enabling control\-logic updates from the beginning is substantially less stable, since SLM\-proposed structural edits can introduce regressions before the prompt has been sufficiently optimized\. These results suggest that reliable SLM self\-evolution should not treat prompt refinement and control\-logic modification as interchangeable actions: it should first exploit low\-risk prompt updates, then invoke higher\-risk control\-logic updates only after prompt\-level gains plateau, with an explicit mechanism for validating proposed changes\.

We proposePACE\(PromptAndControl LogicEvolution\), a two\-timescale agentic framework for self\-evolving frozen SLM\. PACE operationalizes self\-evolution through a controller that can invoke multiple adaptation tools, including prompt evolution, failure analysis, control logic proposal, control logic validation, etc\. Prompt evolution is treated as a frequently callable, low\-risk tool under a fixed agent structure, while control\-logic evolution is treated as a higher\-risk adaptation action that is considered if prompt\-level gains saturate\. To make such updates reliable, PACE separatesproposalfromacceptance: the SLM may propose constrained changes to safe\-to\-edit components of the agent pipeline, but a candidate structure is committed only if it improves over the current agent in a held\-out validation while satisfying the resource budget\. PACE is not merely prompt optimization plus code editing; the contribution is instead to show that a frozen SLM can autonomously discover when such strategies are useful, propose task\-appropriate solver modifications, and commit them only through empirical validation, without human specification of the strategy–task mapping\.

- •We introduce PACE, a two\-timescale agentic self\-evolution framework for frozen SLM agents, where prompt evolution is a frequently callable low\-risk adaptation tool and control\-logic evolution is a higher\-risk adaptation action invoked only after prompt\-level gains saturate\.
- •We propose a prompt\-saturation\-based credit\-assignment mechanism that determines when adaptation should leave prompt space and enter structural search, reducing premature code\-level edits while avoiding inefficient prompt refinement after marginal gains vanish\.
- •We introduce validation\-based structural evolution, where SLM\-generated control\-logic edits are treated as proposals and are committed only if they improve over the current agent under the held\-out validation and satisfy the resource budget\.
- •We empirically validate PACE across four benchmarks and three frozen SLM backbones ranging from 4B to 14B parameters\. PACE achieves the best performance on all 12 backbone–benchmark combinations, improving over the vanilla SLM baseline by relatively9\.2%9\.2\\%accuracy and over the stronger single\-mode evolution baseline by relatively5\.4%5\.4\\%accuracy\. Additionally, onτ\\tau\-bench, PACE improves multi\-turn tool\-use success over vanilla and prompt\-only evolution\.

## 2Related Work

Agent self\-evolution\(Taoet al\.,[2024](https://arxiv.org/html/2605.23019#bib.bib33); Fanget al\.,[2025](https://arxiv.org/html/2605.23019#bib.bib32)\)has emerged as a vital direction for improving the capabilities of language\-model agents\. Existing approaches can be broadly categorized according to the portion of the agent definition they are allowed to modify: 1\)prompt\-space evolution methods, which operate purely in the textual domain, and 2\)self\-referential control logic evolution methods, which permit modifications to executable control logic\.

Prompt\-Space Agent Evolution\.Prompt\-space evolution improves agent behavior by refining textual artifacts such as system prompts, task instructions, tool descriptions, and output constraints while keeping control logic fixed\. These methods use execution feedback, including incorrect outputs, reasoning traces, and tool\-use errors, to guide prompt updates\(Shinnet al\.,[2023](https://arxiv.org/html/2605.23019#bib.bib29); Wanget al\.,[2023](https://arxiv.org/html/2605.23019#bib.bib27)\)\. Since prompt updates remain in natural language, they are stable, sample\-efficient, and easy to deploy, especially for frozen or small language models\. Recent work\(Zhanget al\.,[2025](https://arxiv.org/html/2605.23019#bib.bib4)\)further treats prompt evolution as structured search, using reflection, specialization, and Pareto\-aware selection to balance performance and cost\(Agrawalet al\.,[2025](https://arxiv.org/html/2605.23019#bib.bib31)\)\. However, prompt\-space methods cannot directly repair structural bottlenecks such as brittle parsing, missing validation, or weak retry logic, and therefore often saturate once such failures dominate\.

Control Logic Agent Evolution\.In contrast, control logic evolution allows agents to inspect and modify their own executable logic\(Wanget al\.,[2024b](https://arxiv.org/html/2605.23019#bib.bib9)\), including control flow, validation routines, and inference\-time configurations\. These approaches are often motivated by recursive self\-improvement, in which both the agent’s policy and its update mechanism evolve jointly through runtime introspection and code modification\(Schmidhuber,[2003](https://arxiv.org/html/2605.23019#bib.bib10); Yinet al\.,[2025](https://arxiv.org/html/2605.23019#bib.bib1); Zhouet al\.,[2025](https://arxiv.org/html/2605.23019#bib.bib24)\)\.

PACE differs from both lines of work in how it coordinates the two adaptation modes under frozen SLM constraints\. Prompt\-optimization methods generally assume a fixed execution structure, and therefore cannot directly repair structural bottlenecks such as brittle parsing or missing validation\. Self\-referential code\-evolution methods, in contrast, often allow structural changes throughout the search process, which can be unstable when the proposing model is small\(Linet al\.,[2025](https://arxiv.org/html/2605.23019#bib.bib23); Shaoet al\.,[2025](https://arxiv.org/html/2605.23019#bib.bib22)\)\. PACE addresses this gap by making the transition between prompt evolution and structural evolution explicit: structural search is delayed until prompt gains saturate, and accepted only through a validation gate\. More broadly, PACE treats prompt and control\-logic updates not as independent optimization targets, but as adaptation actions with different risks that must be scheduled and validated under the limited proposal quality of frozen SLMs\. Thus, the novelty of PACE lies not in allowing both prompt and control\-logic changes, but in assigning them to different timescales and validating the higher\-risk structural updates before committing them\.

## 3PACE: Two\-Timescale Self\-Evolution for Frozen SLM Agents

We consider self\-evolution for agents powered by frozen, resource\-constrained language models\. We first define the agent optimization objective, then introduce PACE as a two\-timescale framework for coordinating prompt refinement and constrained control\-logic updates\.

### 3\.1Problem Definition and Objective

Let𝒯\\mathcal\{T\}denote a distribution over tasks, and letMθM\_\{\\theta\}be a pretrained language model with fixed parametersθ\\theta\. Throughout evolution,θ\\thetaremains frozen: the model is not fine\-tuned, distilled, or otherwise updated\. Adaptation is restricted to the agent definition around the model\.

We define an agent asA=\(P,C\)A=\(P,C\), wherePPdenotes textual artifacts such as system prompts, task instructions, and formatting constraints, andCCdenotes executable control logic such as parsing routines, validation modules, fallback strategies, and inference\-time configurations\. Given a taskτ∼𝒯\\tau\\sim\\mathcal\{T\}, the agent producesy=A\(τ;Mθ,P,C\)y=A\(\\tau;M\_\{\\theta\},P,C\)and is evaluated by a task\-specific utilityU\(τ,y\)U\(\\tau,y\)\. Executing the agent incurs a costCost\(A\)\\mathrm\{Cost\}\(A\), such as latency, token usage, model calls, or API calls\. The goal of self\-evolution is to iteratively improve\(P,C\)\(P,C\)under a fixed resource budgetBB:

maxP,C⁡𝔼τ∼𝒯\[U\(τ,A\(τ;Mθ,P,C\)\)\]s\.t\.Cost\(A\)≤B\.\\max\_\{P,C\}\\;\\;\\mathbb\{E\}\_\{\\tau\\sim\\mathcal\{T\}\}\\left\[U\(\\tau,A\(\\tau;M\_\{\\theta\},P,C\)\)\\right\]\\quad\\text\{s\.t\. \}\\mathrm\{Cost\}\(A\)\\leq B\.\(1\)BecauseMθM\_\{\\theta\}remains fixed, all performance gains must arise from changes to\(P,C\)\(P,C\)\.

![Refer to caption](https://arxiv.org/html/2605.23019v1/x2.png)Figure 2:Overview of PACE\. An agentic controller invokes prompt evolution until gains saturate, then proposes bounded control\-logic updates and commits them only after held\-out validation\.
### 3\.2PACE: A Two\-Timescale Agentic Adaptation Framework

Directly optimizing\(P,C\)\(P,C\)in Eq\. \([1](https://arxiv.org/html/2605.23019#S3.E1)\) is difficult under frozen SLM constraints because prompt edits and control\-logic edits fail in different ways\. Prompt evolution is inexpensive and stable, but it can saturate once failures stem from structural bottlenecks such as brittle parsing, missing validation, weak retry logic, or ineffective sampling\. Control\-logic evolution can address these failures by changing how model outputs are sampled, parsed, checked, repaired, or re\-executed, but such edits are higher\-risk: SLM\-proposed changes may be non\-executable or semantically incorrect\.

PACE therefore treats self\-evolution as a two\-timescale agentic adaptation process rather than an exact joint optimizer over\(P,C\)\(P,C\)\. As summarized in Figure[2](https://arxiv.org/html/2605.23019#S3.F2), an agentic controller invokes prompt evolution as the default low\-risk tool, uses failure analysis once prompt gains saturate, proposes bounded control\-logic updates, and commits a candidate solver only if it improves held\-out validation performance under the resource budget\. After each accepted update, PACE re\-invokes prompt evolution under the new control logic\. This cycle coordinates low\-risk prompt refinement with higher\-risk structural adaptation while reducing premature or harmful control\-logic changes\.

Prompt Evolution \(PE\)\.For a fixed control logicCC, PACE uses prompt evolution as the primary low\-risk mechanism for improving the current agent\. Conceptually, this step searches for a prompt configuration that improves expected task utility under the resource budget:

P∗\(C\)=arg⁡maxP⁡𝔼τ∼𝒯\[U\(τ,A\(τ;Mθ,P,C\)\)\]s\.t\.Cost\(A\)≤B\.\\displaystyle P^\{\*\}\(C\)=\\arg\\max\_\{P\}\\;\\mathbb\{E\}\_\{\\tau\\sim\\mathcal\{T\}\}\\left\[U\(\\tau,A\(\\tau;M\_\{\\theta\},P,C\)\)\\right\]\\quad\\text\{s\.t\. \}\\mathrm\{Cost\}\(A\)\\leq B\.\(2\)This prompt\-evolution tool operates entirely in textual space\(Pryzantet al\.,[2023](https://arxiv.org/html/2605.23019#bib.bib16); Opsahl\-Onget al\.,[2024](https://arxiv.org/html/2605.23019#bib.bib26)\)\. At each iteration, prompt candidates are produced through three complementary channels:*\(i\)*a library of handcrafted mutations that perturb the role description, reasoning directives, or sampling temperature;*\(ii\)**reflective*candidates, where the SLM first diagnoses each failure with a one\-sentence root\-cause analysis \(e\.g\., factual confusion, logical gap, or format error\) and a separate proposer call uses these diagnoses to generate targeted prompt revisions; and*\(iii\)**crossover*candidates, in which the two most complementary configurations on the current Pareto front are combined by asking the SLM to unify their respective strengths into a single prompt\.

Candidates are evaluated on a small training mini\-batch \(sampled\) and maintained on a Pareto front trading off accuracy against token cost\. Rather than always mutating the single highest\-accuracy parent, the next parent is sampled from the front with probability proportional to its failure coverage, encouraging exploration of diverse error modes\. After all iterations, every front member is re\-evaluated on a held\-out validation split; the configuration with the best validation accuracy is retained\.

Because prompt edits are low\-cost, reversible, and stable under limited model capacity, PACE invokes this tool frequently under fixed control logic\. Prompt evolution therefore serves as the default adaptation mode before the controller considers higher\-risk structural changes\.

Constrained Control Logic Evolution \(CE\)\.When prompt\-level gains saturate, PACE invokes control logic evolution as a higher\-risk adaptation tool for modifying the agent’sCC\. Conceptually, this step searches over a restricted structural space𝒞\\mathcal\{C\}:maxC∈𝒞⁡𝔼τ∼𝒯\[U\(τ,A\(τ;Mθ,P∗\(C\),C\)\)\],\\max\_\{C\\in\\mathcal\{C\}\}\\;\\mathbb\{E\}\_\{\\tau\\sim\\mathcal\{T\}\}\\left\[U\(\\tau,A\(\\tau;M\_\{\\theta\},P^\{\*\}\(C\),C\)\)\\right\],where𝒞\\mathcal\{C\}denotes the set of admissible control\-logic modifications\. In practice, control\-logic updates are implemented as bounded edits to the agent’s solver function, which specifies how the frozen SLM is prompted, sampled, parsed, verified, or re\-executed at inference time\. These edits are constrained by a fixed solver interface and resource budget, and typically affect safe\-to\-edit components such as output parsing and validation, retry mechanisms, inference\-time configuration, self\-consistency or verification passes, and lightweight routing or fallback strategies\.

More specifically, PACE guides solver edits using a compact failure summary over recent errors, including extraction/runtime failures, format or constraint violations, and reasoning/content mistakes\. The summary steers the next proposal toward an appropriate control\-logic change, such as answer\-extraction hardening, validation or retry logic, sampling adjustment, self\-consistency, or verification/repair\. Each candidate solver edit is treated as a proposal and is committed only if it improves held\-out performance under the resource budgetBB\. After an accepted update, PACE re\-invokes prompt evolution to adaptPPto the new solver\.

### 3\.3Empirical Findings for Reliable SLM Agent Evolution

Our empirical analysis suggests that reliable self\-evolution under frozen SLMs depends on two practical principles\. First, the agentic controller must correctly assign improvement opportunities between low\-risk prompt refinement and higher\-risk solver modification\. Second, because SLM\-generated control\-logic edits can introduce subtle but severe regressions, candidate solver updates should be validated empirically before being committed\. We discuss both findings below\.

#### 3\.3\.1Evolution Credit Assignment

Observation\.Under frozen SLMs, prompt evolution often saturates before all error modes are resolved\. Further prompt refinement under a fixed control logicCCmay yield only marginal gains, while admissible solver modifications can still provide non\-trivial improvement\. The challenge is therefore deciding*which adaptation tool*to invoke when performance stalls: premature control\-logic edits are unstable, but excessive prompt updates waste budget once prompt\-level gains have saturated\.

To describe this tradeoff, we define two conceptual utility gains:

ΔUP\(C\)\\displaystyle\\Delta U\_\{P\}\(C\)=maxP⁡𝔼τ∼𝒯\[U\(τ,A\(τ;Mθ,P,C\)\)\]−𝔼τ∼𝒯\[U\(τ,A\(τ;Mθ,P0,C\)\)\],\\displaystyle=\\max\_\{P\}\\;\\mathbb\{E\}\_\{\\tau\\sim\\mathcal\{T\}\}\\left\[U\(\\tau,A\(\\tau;M\_\{\\theta\},P,C\)\)\\right\]\-\\mathbb\{E\}\_\{\\tau\\sim\\mathcal\{T\}\}\\left\[U\(\\tau,A\(\\tau;M\_\{\\theta\},P\_\{0\},C\)\)\\right\],ΔUC\(P\)\\displaystyle\\Delta U\_\{C\}\(P\)=maxC′∈𝒞⁡𝔼τ∼𝒯\[U\(τ,A\(τ;Mθ,P,C′\)\)\]−𝔼τ∼𝒯\[U\(τ,A\(τ;Mθ,P,C\)\)\],\\displaystyle=\\max\_\{C^\{\\prime\}\\in\\mathcal\{C\}\}\\;\\mathbb\{E\}\_\{\\tau\\sim\\mathcal\{T\}\}\\left\[U\(\\tau,A\(\\tau;M\_\{\\theta\},P,C^\{\\prime\}\)\)\\right\]\-\\mathbb\{E\}\_\{\\tau\\sim\\mathcal\{T\}\}\\left\[U\(\\tau,A\(\\tau;M\_\{\\theta\},P,C\)\)\\right\],whereP0P\_\{0\}is the baseline prompt and𝒞\\mathcal\{C\}is the restricted space of admissible solver modifications\. These quantities are not solved exactly; they only distinguish improvements achievable through prompt refinement from those requiring changes to the inference procedure\.

In practice, PACE uses recent validation improvement from the prompt\-evolution tool as a conservative trigger, computed after roundttasΔU^P\(t\)=U^\(A\(Pt,C\);𝒱\)−U^\(A\(Pt−1,C\);𝒱\)\\widehat\{\\Delta U\}\_\{P\}^\{\(t\)\}=\\widehat\{U\}\(A\(P\_\{t\},C\);\\mathcal\{V\}\)\-\\widehat\{U\}\(A\(P\_\{t\-1\},C\);\\mathcal\{V\}\)\. The controller continues prompt evolution whileΔU^P\(t\)\>ϵ\\widehat\{\\Delta U\}\_\{P\}^\{\(t\)\}\>\\epsilon; onceΔU^P\(t\)≤ϵ\\widehat\{\\Delta U\}\_\{P\}^\{\(t\)\}\\leq\\epsilon, prompt refinement is treated as saturated, and admissible control\-logic proposals are considered\.

#### 3\.3\.2Empirical Validation of Solver Updates

Observation\.Control\-logic edits proposed by SLMs can easily lead to failures\. Beyond producing invalid tool calls or non\-executable code, SLMs may introduce semantically incorrect solver modifications that preserve syntactic validity but degrade performance\. For instance, we observed cases where an SLM rewrote a correct solver into a verification\-style procedure that omitted essential task context: the modified solver remained executable, yet caused a clear accuracy regression\. Since such edits can pass runtime checks and may appear locally plausible, solver updates should not be accepted based on model self\-assessment alone\.

PACE addresses this by treating every control\-logic edit as a candidate requiring empirical validation\. We introduceaction\_compare\_variants, which compares two control logic variants on the same validation subset, returning per\-variant performance, sample\-level differences, and an accept/reject recommendation\. Formally, letAoldA\_\{\\mathrm\{old\}\}andAnewA\_\{\\mathrm\{new\}\}denote the current and candidate agents\. Given a validation subset𝒱⊂𝒯\\mathcal\{V\}\\subset\\mathcal\{T\}, the candidate is accepted only if

U^\(Anew;𝒱\)\>U^\(Aold;𝒱\)\+δandCost\(Anew\)≤B,\\widehat\{U\}\(A\_\{\\mathrm\{new\}\};\\mathcal\{V\}\)\>\\widehat\{U\}\(A\_\{\\mathrm\{old\}\};\\mathcal\{V\}\)\+\\delta\\quad\\text\{and\}\\quad\\mathrm\{Cost\}\(A\_\{\\mathrm\{new\}\}\)\\leq B,whereU^\(⋅;𝒱\)\\widehat\{U\}\(\\cdot;\\mathcal\{V\}\)denotes empirical utility on the shared validation subset andδ≥0\\delta\\geq 0is the*validation gate threshold*, a minimum improvement margin that filters out noisy or marginal gains\. Settingδ=0\\delta\{=\}0accepts any non\-regressing candidate; larger values require stronger evidence before committing the update\. In practice, we setδ=0\.02\\delta\{=\}0\.02, and we ensure validation comparison samples a fresh subset from the training pool, varying both size and composition across iterations\. This prevents the evolution trajectory from overfitting to a fixed held\-out slice and ensures that accepted control\-logic updates generalize beyond any single validation draw\.

## 4Experiment

We evaluate PACE on four static benchmarks using three frozen SLM backbones from 4B to 14B parameters, comparing against prompt\-only and control\-logic\-only evolution while analyzing threshold sensitivity and token overhead\. We further evaluate PACE onτ\\tau\-bench as a realistic multi\-turn tool\-use case study\. Additional results and evolved solver listings are provided in the Appendix\.

### 4\.1Setup

In our implementation,BBdenotes the evolution budget\. Each run uses a frozen SLM backbone with at most 14B parameters and is capped at 20 evolution steps\. We do not enforce a fixed per\-query token budget during evolution; instead, we measure the inference\-time cost of the evolved agent, including model calls and generated tokens, and report the resulting accuracy–cost tradeoff \(Appendix[A\.6](https://arxiv.org/html/2605.23019#A1.SS6)\)\.

Benchmarks and Evaluation Metrics\.We conduct controlled comparisons on four static benchmarks: MMLU\(Hendryckset al\.,[2020](https://arxiv.org/html/2605.23019#bib.bib21)\)for knowledge\-intensive multiple\-choice reasoning, MGSM\(Shiet al\.,[2022](https://arxiv.org/html/2605.23019#bib.bib12)\)for multilingual math reasoning, HotpotQA\(Yanget al\.,[2018](https://arxiv.org/html/2605.23019#bib.bib20)\)for multi\-hop question answering, and IFEval\(Zhouet al\.,[2023](https://arxiv.org/html/2605.23019#bib.bib19)\)for verifiable instruction following\. We report task\-standard accuracy metrics, including letter\-match accuracy for MMLU, exact numeric match for MGSM, exact match for HotpotQA, and strict instruction accuracy for IFEval\. We further evaluate PACE onτ\\tau\-bench\(Yaoet al\.,[2024](https://arxiv.org/html/2605.23019#bib.bib18)\)as a realistic multi\-turn tool\-use case study involving simulated users, domain policies, and API tools\. Full dataset splits and evaluation details are in Appendix[A\.1](https://arxiv.org/html/2605.23019#A1.SS1)\. Note that the utilityUUis instantiated by each benchmark’s primary metric: accuracy for MMLU, exact match for MGSM and HotpotQA, strict accuracy for IFEval, and task success forτ\\tau\-bench\.

SLM backbones\.All experiments use frozen SLMs served locally via vLLM\(Kwonet al\.,[2023](https://arxiv.org/html/2605.23019#bib.bib17)\)\. We study three backbone families ranging from 4B to 14B parameters: 1\)Qwen/Qwen3\-4B\-Instruct\-2507\(Yanget al\.,[2025](https://arxiv.org/html/2605.23019#bib.bib2)\), 2\)Qwen/Qwen3\.5\-9B, and 3\)Ministral\-3\-14B\-Instruct\-2512\. Each backbone serves both as the inference model inside the solver and as the controller model that proposes control\-logic edits\. No stronger teacher model is used during evolution\.

Table 1:Performance comparison, where CE denotes control logic evolution only, PE denotes prompt evolution only, and\+%\+\\%denotes relative performance gain\. Best result per backbone isbolded\.Baselines\.We compare PACE against the following: 1\.Vanilla SLM: the unmodified solver with a task\-specific prompt and default temperature \(T=0\.2T\{=\}0\.2\)\. 2\.\+PE\(prompt evolution only\): we run prompt updates to convergence \(starts from a minimal and unified template\); no CE are permitted\. 3\.\+CE\(control logic evolution only\): the agent may modify control logic from the start, without a dedicated prompt optimization phase\. 4\.GEPA\(Agrawalet al\.,[2025](https://arxiv.org/html/2605.23019#bib.bib31)\): declarative prompt compilation pipeline\. 5\.MIPROv2\(Opsahl\-Onget al\.,[2024](https://arxiv.org/html/2605.23019#bib.bib26)\): multi\-stage instruction and demonstration optimizer\. 6\.Gödel Agent\(Yinet al\.,[2025](https://arxiv.org/html/2605.23019#bib.bib1)\): a self\-referential agent framework for recursive control\-logic updates\. 7\.ACE\(Zhanget al\.,[2025](https://arxiv.org/html/2605.23019#bib.bib4)\): an agent self\-evolution by using execution feedback to iteratively reflect on successes and failures, curate reusable lessons into a structured context/playbook, and improve future task performance without updating model weights\. For a frontier reference, we also reportDeepSeek\-V3\.2\-685Bresults on each benchmark\. Note that all experiments are conducted over1010independent runs with different random seeds, and we report the mean accuracy in the main context\. Results with standard deviation are reported in Table[6](https://arxiv.org/html/2605.23019#A1.T6)in Appendix\.

### 4\.2Quantitative Results

Table[1](https://arxiv.org/html/2605.23019#S4.T1)reports performance across three SLM backbones on four controlled benchmarks, with DeepSeek\-V3\.2 included as a frontier reference\. Across all SLM backbones, PACE achieves the best overall performance among the evaluated adaptation methods\. The strongest gains are observed on MMLU \(\+6\.1%6\.1\\%to \+8\.6%8\.6\\%accuracy\) and IFEval \(\+5\.2%5\.2\\%to \+9\.2%9\.2\\%\), where PACE consistently outperforms both prompt\-only \(\+PE\) and control\-logic\-only \(\+CE\) evolution\. Compared with the DeepSeek\-V3\.2 reference, PACE substantially narrows the vanilla\-to\-frontier gap\. For example, on Qwen3\.5\-9B, PACE closes79%79\\%of the MMLU gap and38%38\\%of the IFEval gap, while on Ministral\-3\-14B it closes 50% of the MMLU gap and67%67\\%of the IFEval gap\.

Comparison to single\-axis evolution\.Across backbones, \+PE and \+CE each capture partial gains, while PACE combines and exceeds them\. For example, on Qwen3\.5\-9B MMLU, \+PE reaches 0\.832 and \+CE reaches0\.8350\.835, whereas PACE achieves0\.8890\.889, approaching the DeepSeek\-V3\.2 reference score of 0\.908\. On Qwen3\.5\-9B IFEval, PACE improves by \+9\.2%9\.2\\%over vanilla, compared with \+3\.0%3\.0\\%for \+PE and \+4\.0%4\.0\\%for \+CE\. These results suggest that prompt and control\-logic updates address complementary failure modes\. The evolution trajectories in Figure[1](https://arxiv.org/html/2605.23019#S1.F1)further illustrate why both mechanisms are needed: prompt\-only evolution saturates, control\-logic\-first evolution is noisy, and PACE delays structural updates until prompt gains plateau\.

Comparison to existing optimizers\.Prompt/context optimizers such as MIPROv2, GEPA, and ACE show inconsistent gains and occasionally regress, reflecting their inability to repair structural bottlenecks such as brittle parsing, missing validation, or weak retry logic\. Gödel Agent, which permits structural modification, is more competitive but still trails PACE on most benchmarks\. We attribute this gap to PACE’s saturation\-gated transition and validation\-gated acceptance, which reduce premature or harmful control\-logic edits under SLM constraints\.

Benchmark\-specific trends\.On MGSM and HotpotQA, \+CE already provides large gains for the 9B and 14B backbones, and PACE further improves performance, indicating that multi\-step reasoning and retrieval\-style tasks benefit strongly from CE such as verification, structured reasoning passes, or retry mechanisms\. Notably, PACE with Qwen3\.5\-9B surpasses the DeepSeek\-V3\.2 reference on HotpotQA, reaching0\.8030\.803compared with0\.7790\.779, while Ministral\-3\-14B with PACE nearly matches DeepSeek\-V3\.2 on HotpotQA, reaching0\.7740\.774\. In contrast, Qwen3\-4B shows little improvement on HotpotQA across all methods, suggesting that model capacity can remain the limiting factor even with improved agent design\. On MMLU, prompt evolution contributes a larger share of the gain, consistent with the importance of improved elicitation for knowledge\-intensive tasks\.

Cost analysis\.PACE incurs a one\-time evolution cost of approximately 2–3M generated tokens for Qwen3\.5\-9B, depending on the benchmark\. At inference time, PE adds little overhead, typically 1\.1–1\.2×\\timesthe baseline token cost, while accepted CE increases cost by introducing additional sampling, verification, or repair calls\. For example, the evolved MMLU solver uses about 3\.6×\\timesthe baseline per\-query tokens, while the evolved IFEval solver uses about 5\.1×\\times\. These results show that PACE trades additional inference\-time compute for accuracy, and that the cost is concentrated in structural updates rather than prompt refinement\. Detailed token breakdowns are provided in Appendix[A\.6](https://arxiv.org/html/2605.23019#A1.SS6)\.

Table 2:Performance onτ\\tau\-bench333We omit the CE\-only baseline because multi\-turn dialogue performance is highly sensitive to prompt quality; applying structural evolution from an unoptimized prompt yields unstable trajectories that do not constitute a meaningful ablation\.: we leverage Claude Sonnet 4\.5 as the stochastic user simulator across Retail \(115 tasks\) and Airline \(50 tasks\) domains\.passk\\text\{pass\}^\{k\}denotes the fraction of tasks solved in*all*kkindependent trials\. Best result per backbone isbolded\.τ\\tau\-bench case study\.Table[3](https://arxiv.org/html/2605.23019#footnote3)evaluates PACE on realistic multi\-turn tool\-use tasks, where the agent must maintain dialogue state, follow domain policies, and issue valid API calls while interacting with a stochastic user simulator\. Prompt\-only evolution is unstable in this setting: \+PE improves Qwen3\-4B Retail but degrades Qwen3\.5\-9B Retail \(0\.7850\.785vs\.0\.7910\.791\) and collapses on Qwen3\-4B Airline \(pass4\\text\{pass\}^\{4\}:0\.0400\.040vs\.0\.1200\.120\)\. PACE stabilizes evolution and achieves the best results across all configurations, with gains amplifying at higherpassk\\text\{pass\}^\{k\}\(\+3\.5\+3\.5pp atpass4\\text\{pass\}^\{4\}vs\.\+2\.6\+2\.6pp atpass1\\text\{pass\}^\{1\}on Qwen3\.5\-9B Retail\), indicating improved behavioral consistency across repeated user interactions rather than single\-trial luck\. Notably, Qwen3\.5\-9B \+ PACE surpasses both the Sonnet 3\.5 and GPT\-4o references\(Yaoet al\.,[2024](https://arxiv.org/html/2605.23019#bib.bib18)\)on Retail and Airline across allpassk\\text\{pass\}^\{k\}levels\.

### 4\.3Ablation Study and Parameter Sensitivity Analysis

We ablate two core design choices in PACE using Qwen3\.5\-9B on MMLU and IFEval: \(i\) the credit\-assignment thresholdε\\varepsilonthat gates the transition from prompt optimization to structural exploration, and \(ii\) the validation gate thresholdδ\\deltathat filters proposed solver edits before they are committed\. Table[3](https://arxiv.org/html/2605.23019#S4.T3)reports test accuracy under each setting\. For theε\\varepsilonsweep \(left\),ε=0\\varepsilon\{=\}0disables control logic evolution entirely \(prompt\-only\), whileε=1\.0\\varepsilon\{=\}1\.0bypasses saturation detection and unlocks structure edits after the first prompt round\. For theδ\\deltasweep \(right\),δ=−1\\delta\{=\}\{\-\}1disables the gate \(any candidate is accepted\), whileδ=0\.05\\delta\{=\}0\.05requires the candidate to exceed the current solver by at least 5%\.

Table 3:Ablation on credit\-assignment thresholdε\\varepsilon\(left\) and control logic validation gateδ\\delta\(right\) with Qwen 3\.5\-9B as the base model\. Shaded columns mark the proposed defaults \(ε=0\.01\\varepsilon\{=\}0\.01,δ=0\.02\\delta\{=\}0\.02\)\.Credit\-assignment thresholdε\\varepsilon\.In Table[3](https://arxiv.org/html/2605.23019#S4.T3), both extremes hurt:ε=0\\varepsilon\{=\}0\(prompt\-only\) never enters structural search, whileε=1\.0\\varepsilon\{=\}1\.0unlocks structure edits before the prompt is sufficiently refined\. The defaultϵ=0\.01\\epsilon=0\.01achieves the best accuracy on both benchmarks, improving over the prompt\-only setting by∼5%\\sim 5\\%on both benchmarks\. These gains indicate CE can provide substantial improvements beyond PE alone\. Theε=1\.0\\varepsilon=1\.0setting approximates a naïve combination that allows control\-logic edits before prompt refinement has saturated\. Its lower performance, especially on IFEval, shows combining prompt and control\-logic search is insufficient; the transition policy matters\.

Validation gate thresholdδ\\delta\.Disabling the gate \(δ=−1\\delta\{=\}\{\-\}1\) degrades IFEval by3\.8%3\.8\\%, as harmful edits pass unchecked\. On MMLU the effect is smaller, since MMLU’s multiple\-choice format leaves less room for structurally broken solvers to diverge\. Overly strict gating \(δ=0\.05\\delta\{=\}0\.05\) also underperforms the default, as it blocks beneficial edits that show moderate but genuine improvement on the small validation subset\. The defaultδ=0\.02\\delta\{=\}0\.02strikes the best balance, filtering noisy candidates while admitting meaningful structural gains\.

### 4\.4Failure Mode Shift Across Evolution Phases

The two\-timescale design rests on the premise that prompt and structural interventions resolve*different*failure modes\. To validate this, we classify every failure on a 20\-sample mini\-batch into three categories—*extraction/runtime*\(malformed output, parse errors\),*format/constraint*\(valid output violating task requirements\), and*reasoning/content*\(well\-formed but incorrect\)—and track how the distribution shifts from Vanilla through \+PE to \+PACE\.

![Refer to caption](https://arxiv.org/html/2605.23019v1/x3.png)Figure 3:Failure mode distribution per 20\-sample mini\-batch across evolution phases on all benchmarks with Qwen3\.5\-9B\. PE resolves extraction and format errors; residual failures after \+PE are dominated by reasoning or hard constraint violations, which CE then targets\.Fig\.[3](https://arxiv.org/html/2605.23019#S4.F3)further shows PE reduces extraction/format errors, while the residual failures after \+PE are increasingly dominated by reasoning/content or hard\-constraint errors\. This shift supports the saturation trigger: once prompt\-level gains plateau, the remaining errors are more likely to require changes to the solver’s inference procedure, such as voting, verification, or task\-specific extraction\.

### 4\.5Control\-Logic Proposal Filtering

Table 4:Control\-logic proposal filtering during PACE evolution, aggregated over three runs \(with random seeds\) per task\. Each entry reports Qwen3\-4B / Qwen3\.5\-9B\. “Rej\. regression” denotes rejected executable candidates with non\-positive validation gain\.To quantify the role of the validation gate, we track all control\-logic proposals generated byaction\_adjust\_logicacross four benchmarks and two backbones\. For each proposal, we record whether it is executable, accepted by held\-out validation, or rejected despite being executable\. Table[4](https://arxiv.org/html/2605.23019#S4.T4)shows that frozen SLMs are useful but noisy generators of solver edits\. The 4B model is less stable than the 9B model, producing fewer executable and accepted proposals, but held\-out validation filters non\-improving edits for both backbones\. Nearly all rejected executable edits have non\-positive validation gain, supporting PACE’s proposal–validation separation: the SLM proposes candidate control\-logic changes, while validation determines which updates are committed\.

## 5Conclusion

We introduced PACE, a two\-timescale agentic framework for frozen SLM self\-evolution that coordinates fast prompt refinement with less frequent, constrained control\-logic updates\. By separating proposal from validation, PACE lets the SLM generate candidate edits while committing only those that improve held\-out validation performance under a resource budget\. Across all benchmarks and three SLM backbones \(4B–14B\), PACE outperforms prompt\-only and control\-logic\-only evolution, achieving up to \+9\.2%9\.2\\%accuracy over vanilla agents at a one\-time evolution cost of 2–3M tokens\. More broadly, PACE shows that the novelty is not in any single final solver pattern, but in enabling a frozen SLM to autonomously discover, select, and validate task\-appropriate inference strategies; the resulting accepted and rejected evolution trajectories also provide preference\- or reward\-labeled data for training future evolution controllers\.

## References

- L\. A\. Agrawal, S\. Tan, D\. Soylu, N\. Ziems, R\. Khare, K\. Opsahl\-Ong, A\. Singhvi, H\. Shandilya, M\. J\. Ryan, M\. Jiang,et al\.\(2025\)Gepa: reflective prompt evolution can outperform reinforcement learning\.arXiv preprint arXiv:2507\.19457\.Cited by:[§1](https://arxiv.org/html/2605.23019#S1.p1.1),[§2](https://arxiv.org/html/2605.23019#S2.p2.1),[§4\.1](https://arxiv.org/html/2605.23019#S4.SS1.p4.2)\.
- J\. Fang, Y\. Peng, X\. Zhang, Y\. Wang, X\. Yi, G\. Zhang, Y\. Xu, B\. Wu, S\. Liu, Z\. Li,et al\.\(2025\)A comprehensive survey of self\-evolving ai agents: a new paradigm bridging foundation models and lifelong agentic systems\.arXiv preprint arXiv:2508\.07407\.Cited by:[§2](https://arxiv.org/html/2605.23019#S2.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt \(2020\)Measuring massive multitask language understanding\.arXiv preprint arXiv:2009\.03300\.Cited by:[§4\.1](https://arxiv.org/html/2605.23019#S4.SS1.p2.3)\.
- W\. Kwon, Z\. Li, S\. Zhuang, Y\. Sheng, L\. Zheng, C\. H\. Yu, J\. E\. Gonzalez, H\. Zhang, and I\. Stoica \(2023\)Efficient memory management for large language model serving with pagedattention\.InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles,Cited by:[§4\.1](https://arxiv.org/html/2605.23019#S4.SS1.p3.1)\.
- J\. Lin, Y\. Guo, Y\. Han, S\. Hu, Z\. Ni, L\. Wang, M\. Chen, H\. Liu, R\. Chen, Y\. He,et al\.\(2025\)Se\-agent: self\-evolution trajectory optimization in multi\-step reasoning with llm\-based agents\.arXiv preprint arXiv:2508\.02085\.Cited by:[§2](https://arxiv.org/html/2605.23019#S2.p4.1)\.
- K\. Opsahl\-Ong, M\. J\. Ryan, J\. Purtell, D\. Broman, C\. Potts, M\. Zaharia, and O\. Khattab \(2024\)Optimizing instructions and demonstrations for multi\-stage language model programs\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 9340–9366\.Cited by:[§A\.1](https://arxiv.org/html/2605.23019#A1.SS1.SSS0.Px1.p5.1),[§1](https://arxiv.org/html/2605.23019#S1.p1.1),[§3\.2](https://arxiv.org/html/2605.23019#S3.SS2.p3.2),[§4\.1](https://arxiv.org/html/2605.23019#S4.SS1.p4.2)\.
- R\. Pryzant, D\. Iter, J\. Li, Y\. Lee, C\. Zhu, and M\. Zeng \(2023\)Automatic prompt optimization with “gradient descent” and beam search\.InProceedings of the 2023 conference on empirical methods in natural language processing,pp\. 7957–7968\.Cited by:[§A\.1](https://arxiv.org/html/2605.23019#A1.SS1.SSS0.Px1.p5.1),[§3\.2](https://arxiv.org/html/2605.23019#S3.SS2.p3.2)\.
- J\. Schmidhuber \(2003\)Gödel machines: self\-referential universal problem solvers making provably optimal self\-improvements\.arXiv preprint cs/0309048\.Cited by:[§2](https://arxiv.org/html/2605.23019#S2.p3.1)\.
- S\. Shao, Q\. Ren, C\. Qian, B\. Wei, D\. Guo, J\. Yang, X\. Song, L\. Zhang, W\. Zhang, D\. Liu,et al\.\(2025\)Your agent may misevolve: emergent risks in self\-evolving llm agents\.arXiv preprint arXiv:2509\.26354\.Cited by:[§2](https://arxiv.org/html/2605.23019#S2.p4.1)\.
- F\. Shi, M\. Suzgun, M\. Freitag, X\. Wang, S\. Srivats, S\. Vosoughi, H\. W\. Chung, Y\. Tay, S\. Ruder, D\. Zhou,et al\.\(2022\)Language models are multilingual chain\-of\-thought reasoners\.arXiv preprint arXiv:2210\.03057\.Cited by:[§4\.1](https://arxiv.org/html/2605.23019#S4.SS1.p2.3)\.
- N\. Shinn, F\. Cassano, A\. Gopinath, K\. Narasimhan, and S\. Yao \(2023\)Reflexion: language agents with verbal reinforcement learning\.Advances in neural information processing systems36,pp\. 8634–8652\.Cited by:[§2](https://arxiv.org/html/2605.23019#S2.p2.1)\.
- Z\. Tao, T\. Lin, X\. Chen, H\. Li, Y\. Wu, Y\. Li, Z\. Jin, F\. Huang, D\. Tao, and J\. Zhou \(2024\)A survey on self\-evolution of large language models\.arXiv preprint arXiv:2404\.14387\.Cited by:[§2](https://arxiv.org/html/2605.23019#S2.p1.1)\.
- L\. Wang, C\. Ma, X\. Feng, Z\. Zhang, H\. Yang, J\. Zhang, Z\. Chen, J\. Tang, X\. Chen, Y\. Lin,et al\.\(2024a\)A survey on large language model based autonomous agents\.Frontiers of Computer Science18\(6\),pp\. 186345\.Cited by:[§1](https://arxiv.org/html/2605.23019#S1.p1.1)\.
- X\. Wang, Y\. Chen, L\. Yuan, Y\. Zhang, Y\. Li, H\. Peng, and H\. Ji \(2024b\)Executable code actions elicit better llm agents\.InForty\-first International Conference on Machine Learning,Cited by:[§2](https://arxiv.org/html/2605.23019#S2.p3.1)\.
- X\. Wang, J\. Wei, D\. Schuurmans, Q\. V\. Le, E\. H\. Chi, S\. Narang, A\. Chowdhery, and D\. Zhou \(2023\)Self\-consistency improves chain of thought reasoning in language models\.InThe Eleventh International Conference on Learning Representations,Cited by:[1st item](https://arxiv.org/html/2605.23019#A1.I11.i1.p1.1),[§A\.7](https://arxiv.org/html/2605.23019#A1.SS7.p2.2),[§2](https://arxiv.org/html/2605.23019#S2.p2.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, F\. Xia, E\. Chi, Q\. V\. Le, D\. Zhou,et al\.\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.Advances in neural information processing systems35,pp\. 24824–24837\.Cited by:[§A\.7](https://arxiv.org/html/2605.23019#A1.SS7.p2.2)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§4\.1](https://arxiv.org/html/2605.23019#S4.SS1.p3.1)\.
- Z\. Yang, P\. Qi, S\. Zhang, Y\. Bengio, W\. Cohen, R\. Salakhutdinov, and C\. D\. Manning \(2018\)HotpotQA: a dataset for diverse, explainable multi\-hop question answering\.InProceedings of the 2018 conference on empirical methods in natural language processing,pp\. 2369–2380\.Cited by:[§A\.1](https://arxiv.org/html/2605.23019#A1.SS1.SSS0.Px4.p1.1),[§4\.1](https://arxiv.org/html/2605.23019#S4.SS1.p2.3)\.
- S\. Yao, N\. Shinn, P\. Razavi, and K\. Narasimhan \(2024\)τ\\tau\-Bench: a benchmark for tool\-agent\-user interaction in real\-world domains\.External Links:2406\.12045,[Link](https://arxiv.org/abs/2406.12045)Cited by:[§A\.1](https://arxiv.org/html/2605.23019#A1.SS1.SSS0.Px6.p1.1),[§4\.1](https://arxiv.org/html/2605.23019#S4.SS1.p2.3),[§4\.2](https://arxiv.org/html/2605.23019#S4.SS2.p6.12),[Table 2](https://arxiv.org/html/2605.23019#S4.T2.20.1.3.3.2.1),[Table 2](https://arxiv.org/html/2605.23019#S4.T2.20.1.4.4.2.1)\.
- X\. Yin, X\. Wang, L\. Pan, L\. Lin, X\. Wan, and W\. Y\. Wang \(2025\)Gödel agent: a self\-referential agent framework for recursively self\-improvement\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 27890–27913\.Cited by:[§2](https://arxiv.org/html/2605.23019#S2.p3.1),[§4\.1](https://arxiv.org/html/2605.23019#S4.SS1.p4.2)\.
- Q\. Zhang, C\. Hu, S\. Upasani, B\. Ma, F\. Hong, V\. Kamanuru, J\. Rainton, C\. Wu, M\. Ji, H\. Li,et al\.\(2025\)Agentic context engineering: evolving contexts for self\-improving language models\.arXiv preprint arXiv:2510\.04618\.Cited by:[§1](https://arxiv.org/html/2605.23019#S1.p1.1),[§2](https://arxiv.org/html/2605.23019#S2.p2.1),[§4\.1](https://arxiv.org/html/2605.23019#S4.SS1.p4.2)\.
- H\. Zhou, X\. Wan, R\. Sun, H\. Palangi, S\. Iqbal, I\. Vulić, A\. Korhonen, and S\. Ö\. Arık \(2025\)Multi\-agent design: optimizing agents with better prompts and topologies\.arXiv preprint arXiv:2502\.02533\.Cited by:[§2](https://arxiv.org/html/2605.23019#S2.p3.1)\.
- J\. Zhou, T\. Lu, S\. Mishra, S\. Brahma, S\. Basu, Y\. Luan, D\. Zhou, and L\. Hou \(2023\)Instruction\-following evaluation for large language models\.arXiv preprint arXiv:2311\.07911\.Cited by:[§4\.1](https://arxiv.org/html/2605.23019#S4.SS1.p2.3)\.

## Appendix AAppendix

### A\.1Benchmark Details and Data Splits

This section documents the four benchmarks used throughout the paper \(MMLU, IFEval, HotpotQA, and MGSM\) and specifies the data splits, program structure, and evaluation metrics shared by the Vanilla, \+PE, and \+PACE conditions as well as the DSPy baselines \(GEPA, MIPROv2\) reported in the main results\.

##### Common protocol\.

Every benchmark is partitioned into three disjoint, seed\-shuffled splits\. The*train*split is used by the inner\-loop prompt optimizer to sample minibatches and by DSPy baselines to bootstrap demonstrations\. The*validation*split is used for comparisons of candidate solvers or prompts \(including the PACE credit\-assignment gate\)\. The*test*split is held out and used only for the final reported numbers\. All experiments use seed0, theauto="medium"budget preset for DSPy optimizers, and identical splits across Vanilla, \+PE, \+PACE, GEPA, and MIPROv2 so that differences reflect the method rather than the data\.

Each task is implemented as a single\-module DSPyChainOfThoughtpredictor\. The signature exposes an input schema, areasoningfield, and a task\-specific output field; only the signature instruction is subject to optimization\. At inference time the model is served via an OpenAI\-compatible interface backed by vLLM, with a round\-robin dispatcher sharding requests across replicas\.

Table 5:Controlled data splits used for repeated evolution experiments\. All methods use the same train/test subsets; validation subsets are dynamically sampled by the agent during evolution\. Results should be interpreted as within\-protocol comparisons rather than standard leaderboard scores\.Because evolution requires repeated evaluation, we use fixed controlled subsets for training and testing rather than full benchmark test suites; the same splits are shared across all methods\. For IFEval, whose official release consists of a single set of541541prompts without a designated train/test partition, we randomly split the full prompt set into disjoint train and test subsets using a fixed seed\.

Validation subsets are not fixed: at each evaluation step, the agent dynamically samples a fresh subset from the training pool, varying both size and composition across iterations to reduce overfitting to any single validation slice\. This design reflects a practical deployment constraint where repeated evaluation on an identical held\-out set can lead to implicit selection bias in the evolution trajectory\.

Following standard practice in iterative prompt/agent optimization\[Opsahl\-Onget al\.,[2024](https://arxiv.org/html/2605.23019#bib.bib26), Pryzantet al\.,[2023](https://arxiv.org/html/2605.23019#bib.bib16)\], we evaluate on fixed controlled subsets to enable repeated evaluation during evolution while keeping compute tractable\. All methods share identical splits, ensuring fair within\-protocol comparison\.

##### MMLU \(multiple\-choice knowledge\)\.

MMLU is a closed\-book multiple\-choice benchmark covering 57 subjects with four answer choices per question\.

- •Data\.The full dataset is loaded from Kaggle444https://www\.kaggle\.com/datasets/open\-benchmarks/mmlu\-massive\-multitask\-language\-understanding/data\. We sampled800800cases as held\-out test set\.
- •Program\.ChainOfThoughtover a signature with inputquestionand outputsreasoningandanswer, whereansweris constrained to one letter in\{A,B,C,D\}\\\{A,B,C,D\\\}\.
- •Metric\.We report letter\-match accuracy\. The first A/B/C/D token is extracted from the model output case\-insensitively, and the prediction is counted correct iff it matches the gold letter\.

##### IFEval \(verifiable instruction following\)\.

IFEval measures whether a model satisfies verifiable formatting and content constraints attached to each prompt\.

- •Data\.We shuffle the available IFEval prompts and use 200 examples for training and 100 examples for validation\. A disjoint 141\-example subset from the official test set is reserved for final evaluation\.
- •Program\.ChainOfThoughtover a signature with inputpromptand outputsreasoningandresponse\. Theresponsefield is scored by the instruction checkers\.
- •Metrics\.We use the IFEval instruction checkers for length, case, keyword, format, and punctuation constraints\. Each response is scored by*strict accuracy*, which equals1\.01\.0only if all attached instructions are satisfied, and*loose accuracy*, the fraction of satisfied instructions\. Strict accuracy is used as the optimization objective; failure feedback lists violated instruction identifiers and the loose score\.

##### HotpotQA \(multi\-hop QA, distractor setting\)\.

HotpotQAYanget al\.\[[2018](https://arxiv.org/html/2605.23019#bib.bib20)\]requires combining evidence from multiple Wikipedia paragraphs to answer a question\. In the distractor setting, each example includes ten paragraphs, consisting of two gold paragraphs and eight distractors\.

- •Data\.We use the HuggingFacehotpot\_qadistractor split\. After shuffling the training split, we use 200 examples for training and 100 for validation\. Final evaluation uses the first 500 examples from the official validation split\. Paragraphs are concatenated into a single context string with title markers and sentence text\.
- •Program\.ChainOfThoughtover a signature with inputscontextandquestionand outputsreasoningandanswer, whereansweris a short span\.
- •Metrics\.We apply SQuAD\-style normalization, including lowercasing, article removal, punctuation removal, and whitespace normalization\. Exact Match is used as the optimizer objective, and token\-level F1 is also reported for analysis\.

##### MGSM \(multilingual grade\-school math\)\.

MGSM is a multilingual grade\-school math benchmark with numeric answers, covering 11 languages: Bengali, German, English, Spanish, French, Japanese, Russian, Swahili, Telugu, Thai, and Chinese\.

- •Data\.For each language, we loadmgsm\_\{lang\}\.tsv, prepend a language\-specific instruction prefix, pool examples across languages, shuffle with seed0, and split the pooled data into 200 training, 100 validation, and 800 test examples\.
- •Program\.ChainOfThoughtover a signature with inputquestionand outputsreasoningandanswer, whereansweris numeric\.
- •Metric\.We report exact numeric match\. The model output is normalized by extracting the first numeric token, removing thousands separators, and trimming trailing zeros and decimal points; the prediction is correct iff the normalized value matches the gold answer\.
- •Inference note\.Non\-English problems occasionally produce long repetitive generations, so we setmax\_tokensto80008000and use task\-generation temperature0\.60\.6for MGSM\. Parse failures caused by truncation are counted as incorrect rather than aborting evaluation\.

##### τ\\tau\-bench \(multi\-turn tool\-augmented dialogue\)\.

τ\\tau\-bench\[Yaoet al\.,[2024](https://arxiv.org/html/2605.23019#bib.bib18)\]is a realistic agent benchmark that evaluates LLM agents in dynamic, multi\-turn customer\-service conversations requiring policy\-compliant tool use\. The agent interacts with a stochastic LLM\-simulated user who has a hidden intent \(e\.g\., exchange an item, cancel an order\), and must resolve the request by calling domain\-specific API tools \(e\.g\.,get\_order\_details,exchange\_delivered\_order\_items\) while adhering to a detailed policy document \(the “wiki”\)\.

- •Data\.We evaluate on two domains:*Retail*\(115 tasks involving order management, returns, exchanges, and address modifications\) and*Airline*\(50 tasks involving flight changes, cancellations, and baggage policies\)\. Each task specifies a user persona with a hidden intent and a gold action sequence; the environment is fully stateful with a database backend\.
- •Program\.The agent runs a multi\-turn loop: at each step it receives the user’s message \(or tool result\), generates either a tool call or a natural\-language response, and the environment advances accordingly\. The system prompt consists of the domain policy wiki concatenated with any evolved role description and requirements\. The agent has access to 10–15 domain\-specific tools per environment\. Maximum conversation length is 30 steps \(tool calls \+ responses\)\.
- •Metric\.Each task is evaluated overk=4k\{=\}4independent trials with the stochastic user simulator\.passk\\text\{pass\}^\{k\}denotes the fraction of tasks where the agent succeeds \(reward=1\.0=1\.0\) in*all*kktrials, computed via the combinatorial estimatorpassik=\(cik\)/\(nk\)\\text\{pass\}^\{k\}\_\{i\}=\\binom\{c\_\{i\}\}\{k\}/\\binom\{n\}\{k\}wherecic\_\{i\}is the number of successful trials for taskii\. This metric captures both accuracy and behavioral consistency under user variability\.
- •User simulator\.We use Claude Sonnet 4\.5 \(claude\-sonnet\-4\-5\-20250929\) as the user simulator for all reported results\. The simulator follows the task’s hidden intent, responds naturally to the agent, and terminates the conversation \(\#\#\#STOP\#\#\#\) if the agent becomes unresponsive or completes the task\.
- •Inference note\.The agent model \(Qwen3\.5\-9B\) is served locally via vLLM on 2 A100 GPUs with temperature0\.00\.0\. During evolution, validation uses single\-trial evaluation on all tasks for efficiency; the final reported metrics use 4\-trial evaluation\. Each full 4\-trial evaluation run takes approximately 8 hours per domain\.

##### DSPy baselines\.

The GEPA and MIPROv2 baselines reported in the main tables use the splits above, the sameChainOfThoughtprogram for each task, and identical evaluation metrics\. Both optimizers are run withauto="medium"and seed0\. GEPA additionally uses a reflection language model at temperature1\.01\.0; MIPROv2 uses its default grounded proposer\. To keep comparisons fair with PACE, DSPy’smax\_errorsthreshold is set to the size of the evaluation set so that sporadic parse failures on pathological examples are counted as incorrect rather than aborting the evaluation\.

##### Inference infrastructure\.

All experiments run on a single node with eight vLLM replicas of the target SLM, each bound to one GPU\. When four experiments share the cluster we allocate two replicas per task and use2424evaluation threads; when a single task has exclusive use of the cluster we use all eight replicas with4848threads\. Task\-generation temperature is0\.20\.2for MMLU, IFEval, and HotpotQA and0\.60\.6for MGSM; optimizer\-side reflection temperature is fixed at1\.01\.0\.

### A\.2Experiment Settings

Agent Tools Description\.The following list covers existing tools for agent control logic \(i\.e\., outer loop\) optimization\. Note thataction\_optimize\_prompt\_on\_taskis the inner loop prompt optimization that would run by default in each iteration\.

- •action\_display\_analysis: Summarize the latest evaluation failures into a structured failure taxonomy \(extraction/runtime, format/constraint, reasoning/content\) and produce actionable evolution guidance with ranked failure modes and fix recommendations\.
- •action\_read\_logic: Read the source code of a specified function, method, or class within a given module\.
- •action\_adjust\_logic: Modify, add, or delete the source code of a specified function, method, or class within a given module to improve task\-solving ability\.
- •action\_run\_code: Execute Python or shell code and capture the output, errors, and return value\.
- •action\_call\_json\_format\_llm: Call an external LLM for assistance with gathering insights, refining strategies, correcting errors, and solving complex problems, returning the response in JSON format\.
- •action\_select\_examples: Select a representative subset of task examples \(diverse, random, or head strategy\) for future train, valid, or evaluate calls\.
- •action\_compare\_variants: Run a candidate comparison of two solver or prompt variants on the same sampled validation subset to verify whether a change improves accuracy\.
- •action\_optimize\_prompt\_on\_task: Run the inner\-loop prompt optimizer, evolving role, requirements, and temperature using minibatch feedback\.
- •action\_get\_evolution\_credit: Return the current evolution credit assignment state, includingΔUP\\Delta U\_\{P\},ΔUC\\Delta U\_\{C\}, prompt saturation status, and thresholds\.
- •action\_evaluate\_on\_task: Evaluate the current solver on the goal task samples and return evaluation feedback including accuracy and per\-sample details\.

Table 6:Performance comparison across SLM backbones\. CE denotes control\-logic evolution only, PE denotes prompt evolution only, and PACE denotes the full two\-timescale framework\. Results are reported as mean±\\pmstandard deviation over multiple runs\. Best result per backbone isbolded\.
### A\.3Complete Experiment Results for Table[6](https://arxiv.org/html/2605.23019#A1.T6)

We report the standard deviation of results in Table[1](https://arxiv.org/html/2605.23019#S4.T1)in Table[6](https://arxiv.org/html/2605.23019#A1.T6)\. Three patterns stand out in Table[6](https://arxiv.org/html/2605.23019#A1.T6)\.*First*, PACE does not inflate variance despite its two\-stage search: its standard deviation averages0\.0110\.011across the 12 backbone\-benchmark cells, on par with \+CE \(0\.0120\.012\) and marginally above \+PE \(0\.0090\.009\)\. The validation gate absorbs the instability of structural proposals rather than propagating it\.*Second*, structural methods \(\+CE, Gödel Agent, \+PACE\) are consistently noisier than prompt\-only baselines \(\+PE, GEPA, ACE, MIPROv2\) on benchmarks where control\-logic edits pay off \(e\.g\. MGSM and HotpotQA on Qwen3\.5\-9B\), reflecting run\-to\-run variability in*which*strategy the SLM discovers, even when the expected accuracy is reliably higher\.*Third*, the main\-text improvements survive these standard deviations\. On Qwen3\.5\-9B MMLU, \+PACE beats \+CE by5\.45\.4points \(0\.889±0\.0070\.889\{\\pm\}0\.007vs\.0\.835±0\.0080\.835\{\\pm\}0\.008\), more than3σ3\\sigmaapart\. On Qwen3\.5\-9B IFEval, \+PACE beats \+PE by4\.34\.3points \(0\.761±0\.0120\.761\{\\pm\}0\.012vs\.0\.718±0\.0100\.718\{\\pm\}0\.010\), roughly2\.7σ2\.7\\sigma\. The only narrow cluster is Qwen3\-4B HotpotQA, where all methods sit within oneσ\\sigmaof Vanilla—consistent with our observation in Table[1](https://arxiv.org/html/2605.23019#S4.T1)that the 4B model lacks the capacity for multi\-hop reasoning regardless of agent design\.

### A\.4Structural Edit Space

Prompt evolution \(\+PE\) and the outer loop of PACE share a common interface,action\_adjust\_logic, but the categories of change they are allowed to make on top of the current solver differ qualitatively\. This subsection enumerates the concrete edit categories that the outer\-loop agent can emit, specifies whether each is free\-form Python or constrained to a template, and lists the safety checks that gate every structural update before it is committed\.

Interface\.Every structural edit is a single call of the formaction\_adjust\_logic\(module\_name, target\_name, new\_code, target\_type, operation\)whereoperation∈\\in\{modify,add,delete\} andtarget\_type∈\\in\{function,class\}\. By convention the principal target is the solver function \(agent\_module\.solver\); the same interface can edit or introduce helper functions and classes withinagent\_module\. The prompt mutation interface \(action\_optimize\_prompt\_on\_task\) is template\-based and operates only on the fields of the solver’sprompt\_config—role, requirements, temperature, and response format—without touching control flow\.

Edit categories\.Table[7](https://arxiv.org/html/2605.23019#A1.T7)lists the edit categories the agent actually uses across the four benchmarks, together with the target, whether they are free\-form or template\-constrained, and the safety checks that apply\. Templates reflect recurring patterns in what the outer\-loop agent discovers, but are not hard\-coded: the agent writes arbitrary Python that conforms to the solver contract \(solver\(agent, task\)→\\rightarrowdict\)\.

Table 7:Allowed edit categories in the PACE outer loop\. All edits are expressed as calls toaction\_adjust\_logic\.*Free\-form*means the agent writes arbitrary Python satisfying the solver contract;*template\-constrained*means the edit is restricted to a fixed schema \(prompt fields, temperature, or a hyperparameter dict\)\.Safety constraints\.Every structural update is gated by a sequence of checks before it is applied toagent\_module:

- •Prompt\-saturation gate\.Structural updates are blocked unlessΔUP<ε\\Delta U\_\{P\}<\\varepsilonon the inner\-loop credit state, i\.e\. the prompt optimizer has exhausted its gains\. This enforces the PE→\\toCE ordering described in the main body\.
- •Immutable targets\.The LLM call primitivesAgent\.action\_call\_llmandAgent\.action\_call\_json\_format\_llmcannot be edited or deleted\. Any attempt raisesValueErrorand the edit is discarded\.
- •Blacklisted primitives\.The solver may not containtime\.sleep, and no target mayimport loggingorfrom logging \.\.\.\. These would break the evaluation harness’ latency accounting and log capture\.
- •Syntax compilation\.Everymodifyoraddis parsed withcompile\(\.\.\.\); aSyntaxErrorsurfaces as a structured error message \(line, column, offending text\) and the edit is rejected\.
- •validation gate \(solver only\)\.Edits that modify thesolverfunction are run through a cheap candidate comparison \(\_compare\_variant\_metrics\) against the current solver on1212deterministic validation samples\. IfΔAB≤τAB\\Delta\_\{\\text\{AB\}\}\\leq\\tau\_\{\\text\{AB\}\}\(agent\.ab\_gate\_delta\_threshold\) the edit is rejected and the candidate code is not installed\.
- •Post\-commit structural eval\.When an edit passes the validation gate and is committed, the agent records the current eval accuracy asstructure\_eval\_baselineand marks a pending full evaluation\. The nextaction\_evaluate\_on\_taskcomputesΔUC\\Delta U\_\{C\}, which is fed back into the credit mechanism \(Section[4\.4](https://arxiv.org/html/2605.23019#S4.SS4)\) and can trigger rollback if the full\-evaluation delta is negative\.

Implications\.The edit space is deliberately asymmetric: prompt fields are small, template\-constrained, and cheap to revert, while structural edits are free\-form Python but guarded by a prompt\-saturation gate, a validation gate, and a set of immutable primitives\. This is what lets a frozen 9B model act as its own architect—free\-form enough to discover self\-consistency on MMLU or the generate–score–repair loop on IFEval \(Figures[4](https://arxiv.org/html/2605.23019#A1.F4)–[5](https://arxiv.org/html/2605.23019#A1.F5)\), while the gates prevent the model’s occasional faulty edits from regressing the system\.

### A\.5Algorithm Walkthrough

Algorithm 1PACE: Prompt And Control Logic Evolution1:Frozen SLM

MθM\_\{\\theta\}; initial prompt artifacts

P0P\_\{0\}; initial control logic

C0C\_\{0\}; training set

𝒟train\\mathcal\{D\}\_\{train\}; validation set

𝒟val\\mathcal\{D\}\_\{val\}; resource budget

BB; prompt saturation threshold

ϵ\\epsilon; validation gate acceptance margin

δ\\delta; maximum outer\-loop steps

KK; maximum prompt\-evolution steps

LL
2:Evolved agent

A∗=\(P∗,C∗\)A^\{\*\}=\(P^\{\*\},C^\{\*\}\)
3:Initialize

P←P0P\\leftarrow P\_\{0\},

C←C0C\\leftarrow C\_\{0\}
4:Evaluate initial agent

A=\(P,C\)A=\(P,C\)on

𝒟val\\mathcal\{D\}\_\{val\}to obtain

UbestU\_\{best\}
5:for

k=1k=1to

KKdo

6:⊳\\trianglerightInner loop: prompt evolution under fixed control logic

7:

Pold←PP\_\{\\mathrm\{old\}\}\\leftarrow P
8:for

ℓ=1\\ell=1to

LLdo

9:Generate prompt candidates

𝒫cand\\mathcal\{P\}\_\{cand\}using handcrafted mutations, failure\-guided reflection, and crossover

10:Evaluate each

P′∈𝒫candP^\{\\prime\}\\in\\mathcal\{P\}\_\{cand\}with fixed

CCon minibatches from

𝒟train\\mathcal\{D\}\_\{train\}
11:Update the prompt Pareto front according to validation utility and resource cost

12:Select the best prompt

PPfrom the Pareto front using

𝒟val\\mathcal\{D\}\_\{val\}
13:if

Cost\(P,C\)\>BCost\(P,C\)\>Bthen

14:Reject

PPand restore the best feasible prompt on the Pareto front

15:endif

16:endfor

17:Compute recent prompt improvement

ΔUP←U\(P,C;𝒟val\)−U\(Pold,C;𝒟val\)\\Delta U\_\{P\}\\leftarrow U\(P,C;\\mathcal\{D\}\_\{val\}\)\-U\(P\_\{\\mathrm\{old\}\},C;\\mathcal\{D\}\_\{val\}\)
18:if

ΔUP≥ϵ\\Delta U\_\{P\}\\geq\\epsilonthen

19:continue⊳\\trianglerightPrompt evolution is still useful; do not edit control logic

20:endif

21:⊳\\trianglerightOuter loop: constrained structural evolution

22:Analyze failures of

\(P,C\)\(P,C\)on

𝒟val\\mathcal\{D\}\_\{val\}
23:Classify failures into structural categories, e\.g\., parsing errors, validation failures, retry failures, or inference\-configuration issues

24:Generate a constrained structural candidate

C′C^\{\\prime\}within the safe\-to\-edit search space

25:if

Cost\(P,C′\)\>BCost\(P,C^\{\\prime\}\)\>Bthen

26:Reject

C′C^\{\\prime\}
27:continue

28:endif

29:⊳\\trianglerightEmpirical validation validation

30:Evaluate old agent

Aold=\(P,C\)A\_\{old\}=\(P,C\)and candidate agent

Anew=\(P,C′\)A\_\{new\}=\(P,C^\{\\prime\}\)on the same validation subset

𝒱⊆𝒟val\\mathcal\{V\}\\subseteq\\mathcal\{D\}\_\{val\}
31:Compute

ΔUC←U\(Anew;𝒱\)−U\(Aold;𝒱\)\\Delta U\_\{C\}\\leftarrow U\(A\_\{new\};\\mathcal\{V\}\)\-U\(A\_\{old\};\\mathcal\{V\}\)
32:if

ΔUC\>δ\\Delta U\_\{C\}\>\\deltathen

33:Accept the structural update:

C←C′C\\leftarrow C^\{\\prime\}
34:Re\-enter prompt evolution under the updated control logic

35:else

36:Reject

C′C^\{\\prime\}and keep

CCunchanged

37:endif

38:Update

Ubest←U\(P,C;𝒟val\)U\_\{best\}\\leftarrow U\(P,C;\\mathcal\{D\}\_\{val\}\)if improved

39:endfor

40:return

A∗=\(P,C\)A^\{\*\}=\(P,C\)

Algorithm 1 summarizes the full PACE procedure\. Starting from an initial agent, PACE first performs prompt evolution while holding the control logic fixed\. Prompt candidates are generated through mutation, reflection, and crossover, then selected using validation utility and resource cost\. If the marginal validation improvement from prompt evolution remains above the saturation thresholdϵ\\epsilon, PACE continues refining prompts\. Once prompt gains fall belowϵ\\epsilon, the framework activates constrained structural evolution\. Structural candidates are proposed only within a predefined safe\-to\-edit search space and are accepted only if they improve over the current agent by at least the validation marginδ\\deltaon the same held\-out validation subset without violating the resource budget\. After each accepted structural update, PACE returns to prompt evolution, allowing the prompt to adapt to the new control logic\.

### A\.6Token Usage Analysis

A defining constraint of the SLM setting is that every token is generated locally on limited hardware\. Unlike API\-based systems where cost is monetary, here cost is*latency and throughput*: each additional LLM call occupies a GPU that could serve another request\. We therefore analyse the token budget of both the evolution process \(outer loop\) and the evolved solvers \(inference time\) to characterise the practical overhead of self\-evolution under resource constraints\.555Token counts in this section are estimates derived from code\-level analysis of prompt templates, solver call patterns, and observed output lengths\. The current implementation tracks cost vialen\(response\_text\)\(character count\) rather than API\-levelusage\.prompt\_tokens/completion\_tokens\. Accuracy figures are from real evaluation runs on the respective test splits \(800 samples for MMLU, 141 for IFEval\)\.

Table 8:Estimated per\-query inference token breakdown for baseline and evolved solvers\.*Input*counts prompt tokens \(system \+ user \+ context\);*Output*counts generated tokens\. Multipliers are relative to the single\-call baseline\. Token counts are derived from code\-level analysis of prompt templates and observed output lengths \(marked with∼\{\\sim\}\); accuracy figures are from real evaluation runs on the respective test splits\.Inference\-time token cost\.Table[8](https://arxiv.org/html/2605.23019#A1.T8)decomposes the per\-query token budget across solver variants, and two patterns stand out:

- •Prompt evolution is nearly free\.The optimized prompt adds only 50–70 tokens of system\-prompt overhead \(longer role descriptions and requirement suffixes\), yielding a 1\.1–1\.2×\\timesmultiplier with no additional LLM calls\. This makes PE the highest\-ROI intervention: on MMLU it accounts for the majority of the accuracy gain at negligible cost\.
- •Structural evolution trades tokens for accuracy\.The evolved MMLU solver issues up to3\+13\{\+\}1calls \(3\-way sample\+\+conditional verification\), while the IFEval solver issues up to3\+3\+13\{\+\}3\{\+\}1calls \(3\-way sample \+ 3 self\-checks \+ conditional repair\)\. The MMLU multiplier is kept moderate \(∼\{\\sim\}3\.6×\\times\) because the early\-exit on consensus fires∼\{\\sim\}80% of the time, avoiding the verification call\. IFEval is more expensive \(∼\{\\sim\}5\.1×\\times\) because the self\-check scoring requires a separate LLM call per candidate, and the repair pass generates a full\-length response\. However, the accuracy gain on IFEval is substantially larger \(\+33\.1 pp from \+PE to \+PACE\), reflecting the high value of structural intervention for constrained\-generation tasks\. MGSM exhibits the highest per\-query multiplier \(∼\{\\sim\}5\.2×\\times\) due to its translate\-then\-solve pipeline plusN=5N\{=\}5self\-consistency voting, but this cost is justified: majority voting over diverse reasoning paths catches stochastic arithmetic errors that no single prompt can eliminate, yielding \+14\.5 pp on Ministral\-14B\. In contrast, HotpotQA’s evolved solver adds only a lightweight second extraction round with question\-type\-aware routing \(∼\{\\sim\}1\.6×\\times\), because the dominant failure mode—answer over\-specification—is largely addressable through prompt refinement alone; the structural contribution is a yes/no detector that forces exact “yes”/“no” output for comparison questions, eliminating a failure mode worth∼\{\\sim\}3 F1 points that prompt wording alone could not resolve\.

Crucially, these multipliers are*discovered*by the agent, not prescribed\. The Pareto\-front optimization in the inner loop and the validation gate in the outer loop jointly penalize candidates that increase cost without proportional accuracy gain, steering the agent toward architectures with built\-in cost control \(early exits, conditional refinement\)\.

Table 9:Estimated total token budget for the evolution process \(outer loop\), broken down by phase\. All figures are for Qwen3\.5\-9B withKmax=20K\_\{\\max\}\{=\}20outer steps, extrapolated from code\-level analysis of a single evolution run\.Evolution\-time token budget\.Table[9](https://arxiv.org/html/2605.23019#A1.T9)reports the estimated total tokens consumed during a single evolution run\. The dominant cost is prompt optimization \(∼\{\\sim\}60% of the budget\), which evaluates multiple candidate configurations on train and validation splits across several iterations\. The outer\-loop reasoning \(agent conversation history, tool\-call arguments, failure analysis summaries\) accounts for only∼\{\\sim\}15%, demonstrating that the framework’s token\-trimming mechanism \(\_trim\_messages\_to\_budget\) effectively controls context growth over long evolution trajectories\.

The total evolution budget of 2–3M tokens is a*one\-time*cost that produces a permanently improved solver\. For context, this is equivalent to∼\{\\sim\}3 000–5 000 inference queries under the evolved solver—a break\-even point reached quickly in any deployment scenario\.

Table 10:Accuracy–cost Pareto trajectories for MMLU and IFEval using Qwen3\.5\-9B\. Each row represents a solver variant: baseline, prompt\-evolved \(\+PE\), and fully evolved \(\+PACE\)\.Δ\\DeltaAcc reports the incremental gain over the previous row within each benchmark\. Efficiency is the incremental accuracy gain per 1,000 additional tokens\. Accuracy values are taken from Table[1](https://arxiv.org/html/2605.23019#S4.T1); token counts are estimates\.Why token efficiency matters for SLMs\.The token analysis reveals a cost structure qualitatively different from the frontier\-model setting\. First,output tokens dominate: SLMs produce longer reasoning chains than frontier models to reach comparable accuracy, making output\-token efficiency a first\-order concern that the Pareto\-front optimization directly addresses\. Second,batched sampling amortizes latency: both evolved solvers usen\>1n\{\>\}1in a single API call; on vLLM, batched sampling withn=3n\{=\}3is only∼1\.3\{\\sim\}1\.3–1\.5×1\.5\{\\times\}slower thann=1n\{=\}1thanks to KV\-cache sharing, so the cost sensitivity of the validation gate implicitly favors batched over sequential designs\. Third, theevolution cost is a fixed upfront investment\(2–3M tokens\); once the solver is evolved it runs at the per\-query cost in Table[8](https://arxiv.org/html/2605.23019#A1.T8)indefinitely, an amortization property especially attractive for SLM deployments serving many queries on dedicated hardware\. Finally,credit assignment controls cost growth: theε\\varepsilon\-gated PE→\\toCE transition \(Section[4\.3](https://arxiv.org/html/2605.23019#S4.SS3)\) exhausts cheap prompt\-level gains before incurring multi\-call structural changes, avoiding the premature cost inflation visible in theε=1\.0\\varepsilon\{=\}1\.0row of Table[3](https://arxiv.org/html/2605.23019#S4.T3)\.

Comparison with frontier\-model baselines\.The evolved Qwen3\.5\-9B solver uses∼\{\\sim\}1 710 tokens but runs on a single A100 GPU at zero marginal API cost\. At a serving throughput of∼\{\\sim\}2 000 tokens/s, the evolved solver adds∼\{\\sim\}0\.6 s of latency per query, which is acceptable for batch evaluation and many interactive use cases\. This positions SLM self\-evolution as a practical alternative to frontier\-model API access: a modest one\-time compute investment yields a permanently improved local model that approaches frontier accuracy on structured benchmarks without ongoing API costs\.

### A\.7Cross\-Benchmark Solver Analysis: MMLU vs\. IFEval

The structural evolution trajectories on MMLU and IFEval reveal that the outer\-loop agent independently discovers qualitatively different inference strategies tailored to each task’s evaluation semantics, without any human\-specified architectural prior\. We analyse the two evolved solvers side by side to highlight how a frozen SLM, acting as its own architect, adapts control logic to the structure of the problem\.

Table 11:Structural comparison of evolved solvers on MMLU and IFEval\. Both are discovered autonomously by the same outer\-loop agent \(Qwen3\.5\-9B\) viaaction\_adjust\_logic\.Shared meta\-strategy\.Despite targeting fundamentally different evaluation protocols, both evolved solvers converge on the same three\-phase skeleton:*diverse generation*→\\rightarrow*selection*→\\rightarrow*conditional refinement*\. This is not prescribed by the framework; the agent arrives at it independently on each benchmark\. The convergence suggests that*generate–evaluate–refine*is a natural attractor in the space of inference\-time strategies discoverable by SLMs, analogous to how self\-consistency and chain\-of\-thought emerge as effective patterns in the prompting literature\[Wanget al\.,[2023](https://arxiv.org/html/2605.23019#bib.bib27), Weiet al\.,[2022](https://arxiv.org/html/2605.23019#bib.bib15)\]\.

Task\-adaptive specialization\.Within the shared skeleton, the agent makes task\-specific design choices that a human engineer would recognize as sensible:

- •Selection mechanism\.MMLU admits a trivial aggregation \(majority vote over four labels\), whereas IFEval requires evaluating each candidate against a heterogeneous constraint set\. The agent replaces voting with an LLM\-as\-judge self\-check that enumerates constraints and scores compliance— effectively inventing a lightweight verifier without access to the ground\-truth checker\.
- •Refinement strategy\.For MMLU, a wrong answer is best addressed by re\-examining the reasoning chain \(verification\)\. For IFEval, a constraint violation is best addressed by rewriting the offending passage \(repair\)\. The agent discovers this distinction: it uses a*confirm\-or\-correct*prompt for MMLU and a*rewrite\-to\-satisfy*prompt for IFEval\.
- •Output format\.The MMLU solver requests structured JSON to facilitate answer extraction, while the IFEval solver explicitly usesresponse\_format="text"to avoid introducing formatting artefacts that would violate content constraints\. This choice reflects awareness of the interaction between output format and evaluation criteria\.
- •Temperature regime\.The MMLU solver usesT≥0\.5T\{\\geq\}0\.5for diversity \(reasoning paths are more varied at higher temperature\), while the IFEval solver usesT≥0\.3T\{\\geq\}0\.3\(lower diversity suffices because constraint violations are structural, not stochastic\)\.

Evolved MMLU solver \(self\-consistency \+ verification\)```
def solver(agent, task):
    cfg = normalize_prompt_config(agent.prompt_config)
    msg = [{"role": "user", "content": task}]

    # Phase 1: 3-way self-consistency (temp >= 0.5)
    resps = call_llm(msg, temp=max(cfg["T"], 0.5), n=3)
    votes, best = {}, {}
    for r in resps:
        a = extract_answer(r)          # coerce to A/B/C/D
        if a in {"A","B","C","D"}:
            votes[a] = votes.get(a, 0) + 1
            best.setdefault(a, r)

    top = max(votes, key=votes.get)
    if votes[top] >= 2:              # consensus
        return best[top]

    # Phase 2: verification tiebreak (temp = 0)
    v = call_llm(
        f"Previous answer: {top}
nReasoning: "
        f"{best[top][’reasoning’][:600]}
n"
        "Confirm or correct.", temp=0.0, n=1)
    return extract_answer(v[0]) or best[top]
```

Figure 4:Simplified evolved solver for MMLU\.call\_llmwrapsaction\_call\_json\_format\_llm;extract\_answerwraps the coercion and regex extraction pipeline\.Evolved IFEval solver \(constraint scoring \+ targeted repair\)```
def solver(agent, task):
    cfg = normalize_prompt_config(agent.prompt_config)
    sys = build_system_prompt(cfg)
    msg = [{"role":"system","content":sys},
           {"role":"user","content":task}]

    # Phase 1: generate 3 candidates (temp >= 0.3)
    resps = call_llm(msg, temp=max(cfg["T"], 0.3),
                      n=3, fmt="text")
    candidates = [extract_text(r) for r in resps]

    # Phase 2: pick best via constraint self-check
    best, best_score = None, -1.0
    for c in candidates:
        s = self_check_constraints(agent, task, c)
        if s > best_score:
            best, best_score = c, s
    if best_score >= 1.0:          # all constraints pass
        return {"response": best}

    # Phase 3: targeted repair (temp = 0)
    repair_msg = msg + [
        {"role":"assistant","content":best},
        {"role":"user","content":
         "The response above violates some constraints."
         " Rewrite so EVERY constraint is satisfied."
         " Output ONLY the corrected response."}]
    fixed = call_llm(repair_msg, temp=0.0, n=1, fmt="text")

    # Accept repair only if it improves
    fix_s = self_check_constraints(
        agent, task, extract_text(fixed[0]))
    if fix_s >= best_score:
        return {"response": extract_text(fixed[0])}
    return {"response": best}
```

Figure 5:Simplified evolved solver for IFEval\.self\_check\_constraintsprompts the model to enumerate constraints and score compliance;call\_llmwrapsaction\_call\_llmwithresponse\_format="text"\. Compare with the MMLU solver in Figure[4](https://arxiv.org/html/2605.23019#A1.F4)\.##### Implications for SLM\-driven self\-evolution\.

The cross\-benchmark comparison surfaces several properties that distinguish SLM self\-evolution from conventional prompt engineering or human\-designed inference pipelines:

1. 1\.Emergent architectural transfer\.The generate–select–refine skeleton emerges independently on both tasks, suggesting that SLMs possess implicit knowledge of effective inference\-time compute patterns\. This is notable because the 9B model was never explicitly trained on meta\-reasoning about solver design; the pattern is*discovered*through the interplay of failure analysis, credit assignment, and code generation\.
2. 2\.Constraint\-aware adaptation without oracle access\.On IFEval, the agent invents a self\-check scoring mechanism that approximates the ground\-truth verifier \(verify\_all\_instructions\) without ever observing it\. The self\-check is imperfect—it relies on the same frozen model that generated the response—but it provides a sufficient signal to rank candidates and gate repairs\. This demonstrates that SLMs can construct*task\-specific evaluation proxies*as part of structural evolution, partially compensating for the absence of external reward models\.
3. 3\.Cost\-aware design under implicit budgets\.Both solvers include early\-exit conditions \(consensus for MMLU, full\-pass for IFEval\) that avoid unnecessary refinement calls\. The agent is not given an explicit cost objective, yet it discovers short\-circuit logic that reduces average inference cost\. This suggests that the validation gate mechanism, which rejects candidates that increase cost without proportional accuracy gain, implicitly teaches the agent to prefer efficient architectures\.
4. 4\.Robustness through conservative acceptance\.The IFEval solver accepts a repair only if it scores at least as well as the best candidate, preventing the repair pass from introducing new violations\. Similarly, the MMLU solver falls back to the plurality answer when verification produces an invalid label\. Both patterns reflect a*do\-no\-harm*principle that the agent learns from the empirical validation loop: edits that cannot be shown to help are rejected\.

##### Unique strengths of SLM\-powered evolution\.

Taken together, these observations highlight capabilities that are distinctive to the self\-evolving SLM paradigm:

- •Zero\-shot solver design\.The agent synthesises multi\-phase inference strategies from scratch, without few\-shot examples of solver code or access to a library of known techniques\. A 9B model autonomously rediscovers self\-consistencyWanget al\.\[[2023](https://arxiv.org/html/2605.23019#bib.bib27)\]for MMLU and invents a generate–score–repair loop for IFEval—strategies that required separate research efforts when designed by humans\.
- •Benchmark\-specific structural creativity\.Rather than applying a single universal strategy, the agent tailors control flow, output format, temperature, and refinement logic to the evaluation protocol of each benchmark\. This level of task\-specific adaptation is difficult to achieve with prompt\-only methods or fixed inference pipelines\.
- •Self\-improving without stronger teachers\.The entire evolution loop—prompt mutation, failure analysis, code generation, and empirical validation—is executed by the*same*frozen SLM that serves as the inference engine\. No larger teacher model, human feedback, or gradient updates are required\. The improvement comes purely from better*use*of the model’s existing capabilities through evolved control logic\.
- •Graceful degradation via empirical gating\.The credit\-assignment and validation gate mechanisms ensure that structural exploration is both*triggered*and*validated*empirically\. When the SLM proposes a flawed edit—which happens frequently at 9B scale—the gate rejects it without degrading the current solver\. This makes the evolution process monotonically non\-regressing in expectation, a property that is critical for deploying autonomous agents in practice\.

Table[11](https://arxiv.org/html/2605.23019#A1.T11)and Figures[4](https://arxiv.org/html/2605.23019#A1.F4)–[5](https://arxiv.org/html/2605.23019#A1.F5)together illustrate that the PACE framework enables a single frozen SLM to act as both the*subject*and the*architect*of its own inference pipeline, discovering task\-appropriate strategies that would traditionally require separate human engineering effort for each benchmark\.

### A\.8τ\\tau\-bench Evolution Analysis

We provide a detailed analysis of the PACE evolution trajectory onτ\\tau\-bench Retail using Qwen3\.5\-9B, tracing both the failure taxonomy shifts and the accuracy progression across evolution phases\.

#### A\.8\.1Failure Taxonomy

Table[12](https://arxiv.org/html/2605.23019#A1.T12)reports the failure category breakdown at each evolution checkpoint\. Theτ\\tau\-bench evaluation framework classifies failures into four categories:*missing\_actions*\(the agent omitted a required tool call\),*extra\_actions*\(the agent invoked an unnecessary or incorrect tool\),*tool\_errors*\(the agent called a tool with invalid arguments, e\.g\., querying a non\-existent user\), and*early\_stop*\(the user simulator terminated the conversation prematurely due to agent confusion or unresponsiveness\)\.

Table 12:Failure taxonomy across evolution phases for Qwen3\.5\-9B onτ\\tau\-bench Retail \(115 tasks, single\-trial mid\-loop evaluation\)\. Categories are non\-exclusive: a single failed task may exhibit multiple failure types \(e\.g\., an extra tool call that also triggers a tool error and leads to early user termination\)\.Three patterns emerge from the taxonomy progression:

1. 1\.*Tool errors are halved by structural evolution\.*Tool errors increase slightly under PE \(9→\\to11\) but drop sharply to 4 after the structural edit, representing the largest relative reduction among all categories\. This confirms that structural evolution addresses failure modes that prompt refinement cannot: the evolved control logic adds pre\-call validation \(e\.g\., verifying order status before executing irreversible actions\) and enforces the single\-call\-per\-order constraint that the SLM frequently violates under prompt\-only guidance\.
2. 2\.*Early stop remains the dominant failure mode throughout\.*Early termination accounts for 70–80% of failures at every checkpoint\. These failures arise when the stochastic user simulator loses patience with the agent’s responses — a signal that is difficult to address through either prompt or structural changes alone, as it depends on the model’s conversational fluency and the simulator’s tolerance threshold\. ![Refer to caption](https://arxiv.org/html/2605.23019v1/x4.png)Figure 6:Evolution trajectory for Qwen3\.5\-9B onτ\\tau\-bench Retail \(115 tasks, single\-trial validation during evolution\)\. The \+PE baseline \(blue, dashed\) saturates at0\.8090\.809after two rounds\. PACE \(red, solid\) matches PE during the prompt phase, then achieves a discrete jump to0\.8350\.835when the structural edit is accepted at step 3\. Subsequent PE rounds \(gray crosses\) regress and are rejected by the validation gate, preserving the post\-structural peak\. The green dashed line marks the final 4\-trialpass1=0\.817\\text\{pass\}^\{1\}\{=\}0\.817\.
3. 3\.*Missing actions decrease only after structural intervention\.*Prompt evolution leaves missing\_actions unchanged \(9→\\to9\), while the structural edit reduces it to 7\. The evolved solver’s requirement to “collect ALL items into a single list before making the call” prevents the common failure pattern where the agent processes items sequentially and exhausts the one\-call\-per\-order limit before completing all user requests\.

#### A\.8\.2Evolution Trajectory

Figure[6](https://arxiv.org/html/2605.23019#A1.F6)visualizes the full evolution trajectory\.

The trajectory reveals several characteristics specific to multi\-turn agent evolution:

1. 1\.*A single structural edit provides the largest discrete jump\.*The accepted edit \(step 3\) improves validation accuracy from0\.8090\.809to0\.8350\.835\(\+2\.6 pp\), the largest single\-step gain in the trajectory\. The edit adds tool\-call pre\-validation and enforces the single\-call\-per\-order constraint, directly addressing thetool\_errorsandmissing\_actionscategories identified in the failure taxonomy\.
2. 2\.*The validation gate prevents post\-peak regression\.*Steps 4–6 show that subsequent PE rounds after the structural edit consistently produce worse results \(0\.7740\.774,0\.7830\.783,0\.7910\.791\), all rejected by the rollback mechanism\. This demonstrates the monotonic non\-regression property of PACE: once a good structural configuration is found, the gate prevents destabilizing changes\.
3. 3\.*Multi\-trial evaluation is lower than single\-trial mid\-loop\.*The final 4\-trial evaluation \(pass1=0\.817\\text\{pass\}^\{1\}\{=\}0\.817\) is lower than the best validation score \(0\.8350\.835\)\. This gap reflects the inherent variance of multi\-turn dialogue with a stochastic user simulator: a solver that passes 96/115 tasks in one trial may fail different subsets in subsequent trials\. Thepass4\\text\{pass\}^\{4\}improvement over vanilla \(0\.6610\.661vs\.0\.6260\.626\) confirms that PACE’s structural edits improve robustness across trials, not just peak single\-run performance\.
PACE: Two-Timescale Self-Evolution for Small Language Model Agents

Similar Articles

@dair_ai: // MetaSkill-Evolve // Great paper on self-improving agents. Most self-improving agents rewrite what the agent does and…

Breaking the Impasse: Dual-Scale Evolutionary Policy Training for Social Language Agents

PACE: Anytime-Valid Acceptance Tests for Self-Evolving Agents

PACEvolve++: Improving Test-time Learning for Evolutionary Search Agents

@tli104: New paper: "Self-Compacting Language Model Agents" LM agents build up long traces of reasoning and tool calls. As the t…

Submit Feedback

Similar Articles

@dair_ai: // MetaSkill-Evolve // Great paper on self-improving agents. Most self-improving agents rewrite what the agent does and…
Breaking the Impasse: Dual-Scale Evolutionary Policy Training for Social Language Agents
PACE: Anytime-Valid Acceptance Tests for Self-Evolving Agents
PACEvolve++: Improving Test-time Learning for Evolutionary Search Agents
@tli104: New paper: "Self-Compacting Language Model Agents" LM agents build up long traces of reasoning and tool calls. As the t…