PACE: Anytime-Valid Acceptance Tests for Self-Evolving Agents

arXiv cs.AI 06/09/26, 04:00 AM Papers
Summary
PACE introduces an anytime-valid commit gate for self-evolving agents that replaces greedy acceptance with a sequential hypothesis test, controlling false-commit probability and reducing churn while matching performance with lower variance.
arXiv:2606.08106v1 Announce Type: new Abstract: Self-evolving agents improve by repeatedly proposing changes to their own prompts, skills, or workflows and keeping those that score higher on a small held-out set. Almost all effort has gone into the proposer that generates candidates; we argue the weak point is the acceptor, the rule that decides whether to commit a change. Applied hundreds of times against the same noisy dev estimate, the ubiquitous "keep it if the score went up" rule is uncontrolled adaptive multiple testing: the agent effectively p-hacks itself, accumulating false commits that make it churn and drift rather than improve. We recast committing as a sequential hypothesis test and propose PACE (Paired Anytime-valid Commit Evaluation), a training-free, anytime-valid commit gate. Each candidate is compared to the incumbent on identical instances and committed only when a testing-by-betting e-process accumulates decisive evidence, stopping early to save evaluations and controlling each candidate's false-commit probability at a user-set level even under optional stopping (a per-decision guarantee). On Qwen2.5 agents (0.5B-3B) self-evolving at the prompt level on GSM8K, SVAMP, and ARC-Challenge, greedy acceptance commits 30-42% false and 10-33% harmful edits when a genuine improvement is hidden among noisy proposals, while PACE commits the real one and essentially nothing else, matching greedy's held-out accuracy at sharply lower variance and about 18% lower evaluation cost. With no real gain available, greedy commits 13-21 spurious self-modifications per run (72-100% false) and degrades the most fragile agent by 4.9 points, while PACE holds at baseline. Reliability of self-evolution depends on the acceptor, not only on the proposer.
Original Article
View Cached Full Text
Cached at: 06/09/26, 08:54 AM
# PACE: Anytime-Valid Acceptance Tests for Self-Evolving Agents
Source: [https://arxiv.org/html/2606.08106](https://arxiv.org/html/2606.08106)
###### Abstract

Self\-evolving agents improve by repeatedly proposing modifications to their own prompts, skills, or workflows and keeping those that score higher on a small held\-out set\. Nearly all effort has gone into the*proposer*; we identify the*acceptor*\(the rule that decides whether to commit a change\) as the loop’s silent weak point\. Applied hundreds of times against the same noisy dev estimate, the ubiquitous “keep it if the score went up” rule is uncontrolled adaptive multiple testing\. The agent effectively*p\-hacks itself*, accumulating*false commits*that make it churn and drift rather than improve\. We recast committing as a sequential hypothesis test and proposePACE\(*Paired Anytime\-valid Commit Evaluation*\), a training\-free*anytime\-valid commit gate*: each candidate is compared to the incumbent on identical instances and committed only when a testing\-by\-betting e\-process accumulates decisive evidence, stopping early to save evaluations and controlling*each candidate’s*false\-commit probability at a user\-set level even under optional stopping \(a per\-decision guarantee, not a run\-level one\)\. We evaluate this on Qwen2\.5 agents \(0\.5B–3B\) self\-evolving at the prompt level on three tasks \(GSM8K, SVAMP, ARC\-Challenge\): a deliberately minimal testbed that isolates the accept decision, and on which PACE cleanly separates real gains from noise\. With a known beneficial edit hidden among noisy proposals, greedy acceptance commits3030–42%42\\%false and1010–33%33\\%harmful modifications, while PACE commits the real improvement and essentially*nothing else*\(0/5 audit\-labelled false commits\), matching greedy’s held\-out accuracy at sharply lower variance \(\+0\.74±0\.04\+0\.74\{\\pm\}0\.04vs\.\+0\.54±0\.30\+0\.54\{\\pm\}0\.30at 3B\) and∼18%\{\\sim\}18\\%lower evaluation cost\. With a stochastic agent and*no*real gain available, greedy commits1313–2121spurious self\-modifications per run \(7272–100%100\\%false\), churning and degrading the most fragile agent by4\.94\.9points, while PACE commits almost nothing and holds at baseline\. Reliability of self\-evolution depends on the acceptor, not only on the proposer\.

## 1Introduction

Self\-evolving agents now rewrite their own prompts, induce reusable skills, edit their controlling code, or restructure multi\-agent workflows, and they have made rapid progress\(Huet al\.,[2024](https://arxiv.org/html/2606.08106#bib.bib1); Yinet al\.,[2024](https://arxiv.org/html/2606.08106#bib.bib2); Zhanget al\.,[2025a](https://arxiv.org/html/2606.08106#bib.bib3); Wanget al\.,[2023](https://arxiv.org/html/2606.08106#bib.bib5); Khattabet al\.,[2024](https://arxiv.org/html/2606.08106#bib.bib7)\)\. Yet nearly all of it comes from building better*proposers*: mechanisms that generate candidate modifications\. The decision that closes the loop, whether to*commit*a proposed change, has been left to a single unexamined heuristic\. This omission is where self\-evolution most often goes wrong: a good acceptor keeps the genuine gains a proposer finds and rejects the rest, cheaply\.

![Refer to caption](https://arxiv.org/html/2606.08106v1/figures/pace_overview_cartoon.png)Figure 1:Overview of PACE\.A self\-evolving agent proposes modifications and must decide whether to commit each\.Greedyacceptance keeps any candidate whose reused dev score ticks up—*p\-hacking*a noisy, recycled signal\.PACEinstead runs an anytime\-valid paired test \(McNemar discordant pairs\+\+a betting e\-process\), committing only when evidence crossesE≥1/αE\\geq 1/\\alpha, which bounds*each candidate’s*false\-commit probability atα\\alphaunder optional stopping\. A fresh held\-out pool audits decisions for measurement only\.In practice the answer is overwhelmingly the same heuristic: measure the candidate on a small held\-out set and keep it if the score went up\. Some systems add lightweight safeguards \(a second validation split, occasional human inspection, task\-specific filters\), but the operative rule that runs unattended, every round, remains “commit iff the dev score improved\.” This paper makes a simple observation with sharp consequences\. A self\-evolution run applies this accept rule hundreds of times against the*same*,*noisy*estimate of quality\. Statistically, this is adaptive multiple testing on a reused validation signal\. Just as a researcher who tests many hypotheses on one dataset and keeps whatever reaches significance produces a flood of false discoveries, a self\-evolving agent that keeps every change that bumped a small dev score accumulates*false commits*: modifications that helped the estimate by chance but do not help \(or actively hurt\) true performance\. The agent then churns\. It constantly modifies itself, drifts, and at best wastes compute; at worst it degrades\. A self\-evolving agent with a greedy acceptor thus runs an unregistered, never\-corrected sequence of trials against one tiny validation set and believes the winners\.

We make this failure mode precise and give a simple statistical control for it\. Our contributions:

- •A diagnosis\.We frame the commit step of self\-evolution as a sequence of hypothesis tests and show that greedy acceptance is uncontrolled adaptive testing—predicting, and empirically exhibiting, a high rate of false and harmful commits \(§[3](https://arxiv.org/html/2606.08106#S3)\)\.
- •A method \(PACE\)\.We propose*PACE*, an anytime\-valid commit gate: a training\-free wrapper that compares each candidate to the incumbent on identical instances and commits only when a testing\-by\-betting e\-process clears a calibrated threshold, stopping early to minimize evaluations \(§[4](https://arxiv.org/html/2606.08106#S4)\)\. The method is deliberately simple—a single paired sequential test,∼10\{\\sim\}10lines—and applies to any self\-modification loop whose incumbent and candidate can be scored on shared instances \(binary correctness, or pairwise preferences\), consuming only those paired outcomes rather than the proposer’s internals\. To our knowledge this is the first*explicit*treatment of the accept step in agent self\-evolution as an anytime\-valid hypothesis test, building on a long line of sequential and safe\-anytime\-valid inference\(Wald,[1947](https://arxiv.org/html/2606.08106#bib.bib24); Shafer,[2021](https://arxiv.org/html/2606.08106#bib.bib22); Ramdaset al\.,[2023](https://arxiv.org/html/2606.08106#bib.bib23)\); the contribution is this*abstraction*, locating an anytime\-valid test at the commit step, not the test itself \(which is standard\)\.
- •Evidence\.Across Qwen2\.5 agents \(0\.5B–3B\) self\-evolving on three tasks spanning arithmetic \(GSM8K, SVAMP\) and multiple\-choice science \(ARC\-Challenge\), PACE drives greedy’s3030–100%100\\%audit\-labelled false\-commit rate to≈0%\{\\approx\}0\\%and its1313–2121noise commits per run to≈0\{\\approx\}0, while preserving genuine gains, in both a controlled regime with a known beneficial edit and a realistic stochastic regime—at lower evaluation cost than greedy \(§[5](https://arxiv.org/html/2606.08106#S5)\)\.

We study this acceptor in prompt\-level self\-evolution: deliberately the*minimal*self\-modification loop, where any churn or degradation traces to the accept rule rather than to a complex proposer or a brittle execution substrate\. By interface the gate is proposer\- and loop\-agnostic, consuming only paired correctness outcomes, so richer loops \(skills, code, multi\-agent topologies\) inherit the very same decision\. We demonstrate it only on prompt evolution, however, and so treat system\-level generality as argued, not shown\.

## 2Related Work

#### Self\-evolving agents\.

Recent systems search over agent designs, code, prompts, and skills, retaining candidates by benchmark performance: ADAS programs agents in code\(Huet al\.,[2024](https://arxiv.org/html/2606.08106#bib.bib1)\), the Gödel Agent rewrites its own logic\(Yinet al\.,[2024](https://arxiv.org/html/2606.08106#bib.bib2)\), the Darwin Gödel Machine evolves self\-modifying coding agents\(Zhanget al\.,[2025a](https://arxiv.org/html/2606.08106#bib.bib3)\), Voyager grows a skill library\(Wanget al\.,[2023](https://arxiv.org/html/2606.08106#bib.bib5)\), and Agent Workflow Memory induces reusable routines\(Wanget al\.,[2024](https://arxiv.org/html/2606.08106#bib.bib6)\); seeGao and others \([2025](https://arxiv.org/html/2606.08106#bib.bib4)\)for a survey\. All select modifications by an empirical score and explicitly flag objective hacking and stability of self\-modification as open problems; none treat acceptance as a statistical decision—the gap we fill\.

#### Prompt and workflow optimization\.

A parallel line optimizes the scaffold around a frozen model—few\-shot demonstrations and instructions \(DSPy\(Khattabet al\.,[2024](https://arxiv.org/html/2606.08106#bib.bib7)\)\), text “gradients” \(TextGrad\(Yuksekgonulet al\.,[2024](https://arxiv.org/html/2606.08106#bib.bib8)\)\), instruction search \(OPRO\(Yanget al\.,[2024](https://arxiv.org/html/2606.08106#bib.bib9)\)\), reflective or evolutionary prompt search \(GEPA\(Agrawal and others,[2025](https://arxiv.org/html/2606.08106#bib.bib10)\), Promptbreeder\(Fernandoet al\.,[2023](https://arxiv.org/html/2606.08106#bib.bib12)\), EvoPrompt\(Guoet al\.,[2024](https://arxiv.org/html/2606.08106#bib.bib27)\)\), and workflow search \(AFlow\(Zhanget al\.,[2025b](https://arxiv.org/html/2606.08106#bib.bib11)\)\)\. These optimize a scalar fitness on a fixed development set and are known to overfit it; we address the orthogonal question of whether a candidate should be committed at all\.

#### Self\-improvement and its pathologies\.

Loop\-based self\-improvement \(STaR\(Zelikmanet al\.,[2022](https://arxiv.org/html/2606.08106#bib.bib13)\), Reflexion\(Shinnet al\.,[2023](https://arxiv.org/html/2606.08106#bib.bib14)\), Self\-Refine\(Madaanet al\.,[2023](https://arxiv.org/html/2606.08106#bib.bib15)\), Self\-Rewarding LMs\(Yuanet al\.,[2024](https://arxiv.org/html/2606.08106#bib.bib16)\)\) is bounded by the reliability of self\-evaluation: intrinsic self\-correction can degrade reasoning\(Huanget al\.,[2024](https://arxiv.org/html/2606.08106#bib.bib17)\), the generation–verification gap governs and eventually closes self\-improvement\(Songet al\.,[2025](https://arxiv.org/html/2606.08106#bib.bib18)\), unanchored loops are reward\-hacked and collapse\(Shafayatet al\.,[2025](https://arxiv.org/html/2606.08106#bib.bib19)\), and recursive self\-training erodes diversity\(Shumailovet al\.,[2024](https://arxiv.org/html/2606.08106#bib.bib20)\)\. These analyze the*signal*that self\-improvement produces; we address the*decision rule*that consumes it\.

#### Sequential and anytime\-valid testing\.

PACE draws on safe, anytime\-valid inference: e\-processes and testing\-by\-betting yield nonnegative martingales whose maxima are controlled by Ville’s inequality, enabling tests that remain valid under optional stopping\(Wald,[1947](https://arxiv.org/html/2606.08106#bib.bib24); Shafer,[2021](https://arxiv.org/html/2606.08106#bib.bib22); Ramdaset al\.,[2023](https://arxiv.org/html/2606.08106#bib.bib23)\); online error\-control procedures bound error over streams of tests\(Ramdaset al\.,[2017](https://arxiv.org/html/2606.08106#bib.bib21)\)\. We import this machinery into the agent self\-evolution loop, where, to our knowledge, commit decisions are currently made by unguarded point estimates\.

## 3The Commit Decision in Self\-Evolution

#### Setup\.

A self\-evolving agent maintains a configurationctc\_\{t\}\(here, a system prompt\) and runs forTTrounds\. At roundtta proposer suggests a modificationct′c\_\{t\}^\{\\prime\}\(add/rewrite/delete/merge an instruction\)\. An*accept rule*AAdecides whether to commit \(ct\+1=ct′c\_\{t\+1\}=c\_\{t\}^\{\\prime\}\) or reject \(ct\+1=ctc\_\{t\+1\}=c\_\{t\}\)\. Quality is read through a small held\-out setDDof sizennvia accuracys^D\(c\)\\hat\{s\}\_\{D\}\(c\), an estimate of the agent’s true accuracys\(c\)s\(c\)\.

#### Greedy acceptance is adaptive multiple testing\.

The near\-universal rule is greedy:

Agreedy:commit⇔s^D\(ct′\)\>s^D\(ct\)\.A\_\{\\text\{greedy\}\}:\\;\\text\{commit\}\\iff\\hat\{s\}\_\{D\}\(c\_\{t\}^\{\\prime\}\)\>\\hat\{s\}\_\{D\}\(c\_\{t\}\)\.\(1\)BecauseDDis small and reused every round,s^D\\hat\{s\}\_\{D\}is a noisy estimate ofss, and overTTrounds the agent performsTTcomparisons against the same noise\. Consider what happens when no real improvement is available \(s\(ct′\)≤s\(ct\)s\(c\_\{t\}^\{\\prime\}\)\\leq s\(c\_\{t\}\)for every round\): the candidate still wins the comparison with non\-trivial probability simply becauses^D\\hat\{s\}\_\{D\}fluctuates\. Each such accepted change is a*false commit*, and a*harmful commit*if true accuracy strictly decreases\. Greedy therefore accumulates false commits at a rate set by the dev\-set noise, independent of whether real improvements exist—the agent keeps modifying itself on noise\. This is precisely the structure of adaptive multiple testing on a reused dataset: the agent is runningTTunregistered, uncorrected significance tests and keeping whatever happens to look good\.

#### Why the obvious fixes fall short\.

This is a textbook case of*adaptive data analysis*: reusing one validation set to steer a long, data\-dependent sequence of choices invalidates its naive estimates and inflates false discoveries\(Dworket al\.,[2015](https://arxiv.org/html/2606.08106#bib.bib28); Blum and Hardt,[2015](https://arxiv.org/html/2606.08106#bib.bib29)\)\. The standard remedies sit poorly in an autonomous loop\. A larger dev set only postpones the problem—its noise shrinks as1/n1/\\sqrt\{n\}while the number of adaptive comparisons grows with the run\. A fixed multiplicity correction \(Bonferroni orα\\alpha\-spending\) must know the number of tests in advance, which an open\-ended run does not, and spends its budget so fast that genuine gains are missed \(§[5](https://arxiv.org/html/2606.08106#S5)\)\. Refreshing the holdout every round restores validity but assumes a stream of fresh labeled data the agent rarely has\. What the loop actually needs is a test that stays valid under an*unbounded, adaptively chosen*number of looks at the same data—an anytime\-valid sequential test, which is what the next section builds\.

#### What we want\.

A good accept rule should commit a modification only when there is reliable evidence that it improves true accuracy, while \(i\) keeping genuine improvements, \(ii\) spending as little evaluation as possible, and \(iii\) requiring no training\. The next section gives such a rule\.

## 4PACE: An Anytime\-Valid Commit Gate

#### Paired evaluation\.

A naive comparison ofs^D\(ct\)\\hat\{s\}\_\{D\}\(c\_\{t\}\)ands^D\(ct′\)\\hat\{s\}\_\{D\}\(c\_\{t\}^\{\\prime\}\)conflates two sources of variance: which instances are easy, and whether the candidate is genuinely better\. Pairing removes the first\. We evaluate the incumbentctc\_\{t\}and candidatect′c\_\{t\}^\{\\prime\}on the*same*instances\. For instanceiiletwi=1w\_\{i\}=1if the candidate is correct and the incumbent is wrong,wi=0w\_\{i\}=0if the reverse, and discard ties \(both right or both wrong\)—a McNemar\-style paired comparison\(McNemar,[1947](https://arxiv.org/html/2606.08106#bib.bib30)\)\. Under the null “the candidate is not better,” discordant pairs are equally likely either way:Pr⁡\[wi=1\]=12\\Pr\[w\_\{i\}=1\]=\\tfrac\{1\}\{2\}\.

#### Testing by betting\.

The idea behind an e\-process is simple: treat testing as a betting game\. We start with wealthE0=1E\_\{0\}=1and bet a fractionλ\\lambdaof our wealth on each discordant pair being a “win” \(wi=1w\_\{i\}=1\)\. If the null holds, the bet is fair and wealth stays near11on average; if the candidate is truly better, wins outnumber losses and wealth grows\. Concretely, after each discordant pair we update

E←E⋅\(1\+λ\(2wi−1\)\),λ∈\[0,1\)\.E\\leftarrow E\\cdot\\bigl\(1\+\\lambda\\,\(2w\_\{i\}\-1\)\\bigr\),\\qquad\\lambda\\in\[0,1\)\.\(2\)Under the null𝔼\[2wi−1\]=0\\mathbb\{E\}\[2w\_\{i\}\-1\]=0, soEEis a nonnegative martingale with𝔼\[E\]=1\\mathbb\{E\}\[E\]=1; by Ville’s inequalityPr⁡\[suptEt≥1/α\]≤α\\Pr\[\\sup\_\{t\}E\_\{t\}\\geq 1/\\alpha\]\\leq\\alpha\(Shafer,[2021](https://arxiv.org/html/2606.08106#bib.bib22); Ramdaset al\.,[2023](https://arxiv.org/html/2606.08106#bib.bib23)\)\. We thereforecommitas soon asE≥1/αE\\geq 1/\\alpha, which controls the false\-commit probability at levelα\\alpha*at any stopping time*\. Because the test is anytime\-valid, we may evaluate instances incrementally and stop the moment the evidence is conclusive; if the evaluation budget is exhausted without crossing the threshold, the candidate isrejected\. We useα=0\.05\\alpha=0\.05,λ=0\.5\\lambda=0\.5, and a fixed batch size unless noted\.

#### Guarantee\.

The control is anytime\-valid\. Letw1,w2,…w\_\{1\},w\_\{2\},\\dotsbe the discordant\-pair outcomes in evaluation order andℱi=σ\(w1,…,wi\)\\mathcal\{F\}\_\{i\}=\\sigma\(w\_\{1\},\\dots,w\_\{i\}\)\. Under the nullH0H\_\{0\}that the candidate is not better—soPr⁡\[wi=1∣ℱi−1\]≤12\\Pr\[w\_\{i\}\{=\}1\\mid\\mathcal\{F\}\_\{i\-1\}\]\\leq\\tfrac\{1\}\{2\}—the wealthEi=∏j≤i\(1\+λ\(2wj−1\)\)E\_\{i\}=\\prod\_\{j\\leq i\}\\bigl\(1\+\\lambda\(2w\_\{j\}\{\-\}1\)\\bigr\)is a nonnegative supermartingale withE0=1E\_\{0\}\{=\}1and𝔼H0\[Ei\]≤1\\mathbb\{E\}\_\{H\_\{0\}\}\[E\_\{i\}\]\\leq 1\. For the commit timeτ=inf\{i:Ei≥1/α\}\\tau=\\inf\\\{i:E\_\{i\}\\geq 1/\\alpha\\\}, Ville’s inequality givesPrH0⁡\[τ<∞\]≤α\\Pr\_\{H\_\{0\}\}\[\\tau<\\infty\]\\leq\\alpha: the probability of*ever*committing a non\-improving candidate is at mostα\\alpha, for anyα∈\(0,1\)\\alpha\\in\(0,1\), anyλ∈\[0,1\)\\lambda\\in\[0,1\), and any data\-dependent stopping\. Two scope points keep this honest\. First, validity is*per candidate*: for a fixedct′c\_\{t\}^\{\\prime\}, the discordant outcomes form the evaluation stream and, underH0H\_\{0\}, each is conditionally fair \(Pr⁡\[wi=1∣ℱi−1\]≤12\\Pr\[w\_\{i\}\{=\}1\\mid\\mathcal\{F\}\_\{i\-1\}\]\\leq\\tfrac\{1\}\{2\}\), soEiE\_\{i\}is a supermartingale and*that candidate’s*false\-commit probability is≤α\\leq\\alpha\. This is not a run\-level familywise bound: across many tested non\-improving candidates the*expected*number of false commits grows at most asα\\alphaper candidate—but this already defuses greedy’s pathology, whose per\-candidate false rate is set by the dev\-set noise \(typically far aboveα\\alpha\) and compounds every round\. Second, the null concerns*this*candidate against the incumbent on paired instances; the adaptive*reuse*ofDDto generate the*next*candidate is greedy’s problem, and because PACE gates each decision independently atα\\alphathis per\-decision control curbs it in practice\. The theorem covers each candidate conditional on its own evaluation stream; we treat the loop\-level adaptivity as an empirical matter, not a theoretical claim\. What the guarantee does*not*control is power: a genuine improvement is committed only if its evidence crosses1/α1/\\alphawithin the evaluation budget \(§[5](https://arxiv.org/html/2606.08106#S5)and Table[4](https://arxiv.org/html/2606.08106#A1.T4)probe this empirically\)\.

Algorithm 1Self\-evolution with the PACE commit gate1:input:agent config

cc, dev set

DD, level

α\\alpha, bet

λ\\lambda
2:for

t=1t=1to

TTdo

3:

c′←Propose\(c\)c^\{\\prime\}\\leftarrow\\textsc\{Propose\}\(c\)
4:

E←1E\\leftarrow 1
5:forinstance

iidrawn from

DDdo

6:evaluate

c,c′c,c^\{\\prime\}on

ii;iftiethen continue

7:

w←𝟏\[c′right,cwrong\]w\\leftarrow\\mathbf\{1\}\[c^\{\\prime\}\\text\{ right\},\\,c\\text\{ wrong\}\]
8:

E←E\(1\+λ\(2w−1\)\)E\\leftarrow E\\,\(1\+\\lambda\(2w\-1\)\)
9:if

E≥1/αE\\geq 1/\\alphathen

c←c′c\\leftarrow c^\{\\prime\};break\{commit\}

10:endfor

11:endfor

12:return

cc

#### Why so simple, and why anytime\-valid\.

A long stream of tests, one might expect, should require an online false\-discovery\-rate correction across rounds\. We found one unnecessary and, worse, over\-conservative: because PACE gates each decision independently atα\\alpha\(§[4](https://arxiv.org/html/2606.08106#S4)\), reuse ofDDdoes not compound the per\-decision error the way it does for greedy, whereas a decaying\-memory FDR variant was so cautious that it sometimes*missed the genuine improvement entirely*\. \(As above, this is per\-decision control, not a run\-level familywise bound\.\) We therefore keep PACE as the single paired anytime\-valid test of Algorithm[1](https://arxiv.org/html/2606.08106#alg1)\. Anytime\-validity, not just pairing, is what earns its keep\. A*fixed*\-nnpaired test ties with PACE when the gain is obvious \(Table[1](https://arxiv.org/html/2606.08106#S5.T1)\), but it must fixnnin advance: too small misses marginal gains, too large wastes evaluation\. The e\-process instead adapts the number of pairs to the evidence, which is why PACE alone holds0%0\\%false at a near\-constantΔ\\Deltaacross the fullα\\alpha/nnsweep \(Table[4](https://arxiv.org/html/2606.08106#A1.T4)\) where the fixed\-nnand FDR variants degrade\. It adds no model calls beyond the evaluations greedy already performs, and typically fewer, because it stops early\.

## 5Experiments

Our experiments answer three questions\.\(Q1\)Does greedy acceptance actually accumulate false and harmful commits, as the diagnosis predicts?\(Q2\)Does PACE prevent them while keeping genuine improvements, and at what evaluation cost?\(Q3\)Is the effect robust—across model scales, across tasks and domains, and across PACE’s own knobs? A controlled regime and a sensitivity sweep answer Q1–Q2 and the knob part of Q3; a stochastic regime and two additional tasks answer the rest\.

#### Agents and task\.

The agent is a frozen Qwen2\.5\-Instruct model\(Qwen Team,[2024](https://arxiv.org/html/2606.08106#bib.bib26)\)\(0\.5B, 1\.5B, 3B\) steered by an editable system prompt; it solves GSM8K grade\-school math\(Cobbeet al\.,[2021](https://arxiv.org/html/2606.08106#bib.bib25)\), scored by exact match against the dataset’s ground\-truth answer\. A proposer \(a separate LLM\) suggests one prompt edit per round\. We use disjoint splits: a small reuseddev\(n=40n\{=\}40\) and a freshauditpool \(n=120n\{=\}120\) used*only*for measurement, which the accept rule never sees\. All numbers are averages over 5 seeds\.

#### Metrics\.

For every decision we audit the change on a large, fresh audit pool \(n=120n\{=\}120, disjoint from dev, which the accept rule never sees\)\. A commit is a*false commit*if its audited accuracy change is≤0\\leq 0, and*harmful*if<0<0\. We report the false\- and harmful\-commit rates among commits, the end\-to\-end accuracy changeΔ\\Deltameasured on the audit pool, and the evaluation cost \(number of dev problems scored\)\. Two caveats keep this honest\. The audit is a finite estimate, so these are*observed*\(audit\-labelled\) false commits, not ground truth—but auditing at temperature0on a pool far larger than dev makes the reported gaps \(greedy3030–100%100\\%vs\. gates0%0\\%\) dwarf its sampling error\. And the audit measures the*realized*outcome, whereasα\\alphabounds the*prospective*false\-commit probability \(§[4](https://arxiv.org/html/2606.08106#S4)\); the two coincide here\.

#### Two regimes\.

The two play complementary roles—a*diagnostic*with a known ground\-truth answer \(a unit test for acceptors\), and a*realistic*stochastic deployment with no planted gain; neither is a full agent\-*system*deployment, which we leave to future work\. \(i\)Controlled: we seed the agent with one deliberately harmful instruction that the proposer can remove for a large,*known*true gain; the run thus contains exactly one big real improvement amid the proposer’s genuinely noisy edits\. This isolates whether an accept rule*keeps the real gain while rejecting noise*\. \(ii\)Stochastic: the agent is sampled at temperature0\.70\.7\(as deployed agents are\), so the dev signal is genuinely noisy even though no large improvement is available; this is the realism evidence—it tests whether an accept rule*avoids chasing sampling noise*\. In the stochastic regime the audit is measured at temperature0\(true quality\) while the agent’s own dev signal is sampled\.

### 5\.1Controlled: keep the real gain, reject noise

Table[1](https://arxiv.org/html/2606.08106#S5.T1)compares greedy, a fixed\-nnpaired test, and our anytime\-valid gate \(5 seeds;±\{\\pm\}is SD across seeds\)\. The contrast is direct: greedy commits33–3\.43\.4changes per run,3030–42%42\\%false and1010–33%33\\%harmful; both statistical rules commit exactly the one real improvement and*nothing else*\(0%0\\%false,0%0\\%harmful\)\. At 1\.5B all three rules reach the same held\-out gain \(Δ≈\+0\.57\\Delta\{\\approx\}\{\+\}0\.57\), so here the gate’s advantage is hygiene rather than accuracy—it keeps the real improvement while committing none of greedy’s noise, at lower cost\. At 3B the difference is one of*variance*: the gate reaches a tightΔ=\+0\.74±0\.04\\Delta\{=\}\{\+\}0\.74\{\\pm\}0\.04while greedy is a high\-variance\+0\.54±0\.30\{\+\}0\.54\{\\pm\}0\.30, because greedy’s occasional harmful commits cause large drops on some seeds\. We therefore read the gate’s value as*reliability*—it removes the downside—rather than a guaranteed mean margin \(the\+0\.20\+0\.20gap is within seed noise atn=5n\{=\}5\)\. The gate also costs the least: by stopping each comparison once the evidence is conclusive it uses∼18%\{\\sim\}18\\%fewer dev evaluations than greedy \(e\.g\.17121712vs\.20802080paired problems at 1\.5B\), though greedy runs no test at all\. The weakest model \(0\.5B\) cannot do GSM8K even with the handicap removed, so no real gain exists for it in this regime; we use 0\.5B only in the stochastic regime below\.

To check that the gate is not merely exploiting a large, obvious gain, we repeated the 1\.5B run with a*milder*handicap \(a soft discouragement of reasoning rather than a hard override\), leaving a smaller true gain of≈\+0\.18\{\\approx\}\{\+\}0\.18\. The gate still captures most of it \(Δ=\+0\.14\\Delta\{=\}\{\+\}0\.14at0%0\\%false\), while greedy takes the slightly larger raw gain \(\+0\.18\+0\.18\) at the cost of17%17\\%false and17%17\\%harmful commits\. This is the expected precision/recall trade\-off: with a weaker signal the anytime\-valid test gives up a little recall to preserve its zero\-false\-commit guarantee\.

Table 1:Controlled regime\(5 seeds;±\{\\pm\}is SD\)\. Greedy commits many false/harmful changes; the gate keeps only the genuine improvement \(0/5 audit\-labelled false commits\)\. At 1\.5B all rules tie on held\-outΔ\\Delta\(the gate wins on hygiene \+ cost\); at 3B the gate matches greedy’s mean with far lower variance \(±\.04\\pm\.04vs±\.30\\pm\.30\)\. Cmt: commits per run\. False/Harm are*audit\-estimated*\(retrospective, temp\-0\); the gate’sα\\alphabounds the*prospective*false\-commit probability\.
### 5\.2Stochastic: don’t chase sampling noise

Table[2](https://arxiv.org/html/2606.08106#S5.T2)reports the realistic regime and Figure[2](https://arxiv.org/html/2606.08106#S5.F2)shows the dynamics\. A sampled agent makes the dev estimate fluctuate, and greedy reads up\-fluctuations as improvements: it commits a stream of13\.313\.3–20\.720\.7self\-modifications per run \(SD≤3\.9\\leq 3\.9\),7272–100%100\\%of which a temperature\-0audit deems false\. By contrast*every*statistical gate—paired fixed\-nn, online\-FDR, and ours—commits<1\{<\}1per run: the family uniformly refuses to chase the noise, with greedy the lone offender\. This commit\-rate gap is the robust headline\. The churn is not always harmless: on the most fragile agent \(0\.5B\) the accumulated noise\-chasing edits drive true held\-out accuracy*down*by4\.9±3\.04\.9\{\\pm\}3\.0points; at 1\.5B the mean drop \(3\.13\.1points\) is real but seed\-noisy \(±7\.0\{\\pm\}7\.0\), and the robust 3B agent escapes net damage despite a100%100\\%false\-commit rate\. The*guaranteed*cost of greedy, then, is the churn itself—continual, unjustified self\-modification, a stability and compute liability in deployment—while held\-out degradation is an additional risk that materializes on weaker agents\. The gate holds held\-out accuracy at baseline throughout \(Figure[2](https://arxiv.org/html/2606.08106#S5.F2)\)\.

![Refer to caption](https://arxiv.org/html/2606.08106v1/x1.png)Figure 2:Stochastic\-regime dynamics\(Qwen2\.5\-1\.5B; mean±\\pmSE over 3 seeds\)\.\(a\)Greedy reads sampling up\-fluctuations as wins and commits continually \(≈\\approx15 edits over the run\), while the gate commits essentially nothing\.\(b\)Those commits are not free: greedy’s true held\-out accuracy drifts below where it started, whereas the gate holds at the baseline line\.Table 2:Stochastic regime, agent sampled atT=0\.7T\{=\}0\.7\(3 seeds;±\{\\pm\}is SD\)\. Greedy commits1313–2121self\-modifications/run \(7272–100%100\\%false by a temp\-0audit\) and degrades the fragile 0\.5B agent; every statistical gate commits<1\{<\}1and holdsΔ\\Deltaat baseline\. False% is*audit\-estimated*\(retrospective\), not the prospectiveα\\alphabound, and for the gates is over their<1<\\\!1commit/run \(a near\-empty denominator\)\.
### 5\.3Sensitivity

The gate exposes two knobs—the levelα\\alphaand the dev sizenn—and a natural worry is that its zero false\-commit rate is merely bought by setting them conservatively\. Table[4](https://arxiv.org/html/2606.08106#A1.T4)\(Appendix[A](https://arxiv.org/html/2606.08106#A1)\) shows otherwise\. Sweepingα\\alphaover a10×10\\timesrange \(0\.010\.01to0\.100\.10\) andnnover a4×4\\timesrange \(2020to8080\), the gate commits the single real improvement at*every*setting, with0%0\\%false commits and a near\-constant held\-out gainΔ≈\+0\.57\\Delta\\approx\{\+\}0\.57; greedy stays3333–53%53\\%false throughout\. The anytime\-valid e\-process is what buys this stability: because it accumulates evidence across instances rather than thresholding one fixed\-nnestimate, it neither misses the gain whenα\\alphais tightened nor fires on noise whennnis small\. A fixed\-schedule online\-FDR alternative behaves far worse—it lost the genuine improvement entirely atα=0\.01\\alpha\{=\}0\.01\(Δ=\+0\.000\\Delta\{=\}\{\+\}0\.000\) and recovered only part of it atn=20n\{=\}20\(Δ=\+0\.41\\Delta\{=\}\{\+\}0\.41\)—which is precisely why our gate is a single betting test rather than a multiplicity correction\.

### 5\.4Generalization across tasks and domains

To test that the diagnosis and the fix are not artifacts of GSM8K, we repeat both regimes on SVAMP\(Patelet al\.,[2021](https://arxiv.org/html/2606.08106#bib.bib32)\)\(a second arithmetic benchmark\) and ARC\-Challenge\(Clarket al\.,[2018](https://arxiv.org/html/2606.08106#bib.bib33)\)\(multiple\-choice science—a*non\-math*domain\)\. Table[5](https://arxiv.org/html/2606.08106#A1.T5)\(Appendix[A](https://arxiv.org/html/2606.08106#A1)\) shows the same pattern throughout\. In the stochastic regime greedy commits1212–1616self\-modifications per run \(6767–93%93\\%false\) and ends at or below baseline, while the gate commits essentially none and holds\. In the controlled regime the gate again captures the planted improvement at0%0\\%false and matches greedy’s held\-out gain \(e\.g\.\+0\.45\+0\.45vs\.\+0\.45\+0\.45on SVAMP\-3B, where greedy pays with11%11\\%false commits\)\. The effect is thus a property of the*accept rule under noisy evaluation*, not of any single task or domain\.

## 6Discussion and Limitations

The gate helps precisely where a self\-evolution loop faces exploitable evaluation noise \(a small or reused dev set, or a stochastic agent\), which is the common case\. When the base configuration is already near\-optimal and the task is not prompt\-sensitive, no rule \(including greedy\) commits much: on un\-handicapped GSM8K, where instruct\-tuned Qwen2\.5 already sits near its prompt\-induced ceiling, greedy and the gate alike commit essentially nothing \(Δ≈0\\Delta\{\\approx\}0\), so the gate’s only effect is to cost nothing when there is nothing to decide\. The danger the gate removes is the asymmetric one—churning on noise—and it removes it without sacrificing genuine gains\.

#### When \(not\) to use it\.

The price of the gate’s safety is power, and that power scales with effect size: a large true gain crosses the e\-process threshold in a handful of discordant pairs, a marginal gain needs many, and a gain too small for the evaluation budget to resolve is rejected—exactly the milder\-handicap case, where the gate kept78%78\\%of a\+0\.18\{\+\}0\.18improvement rather than all of it\. The gate is therefore the wrong tool when genuine gains are expected to be tiny and a*missed*improvement is costlier than a wrongly committed one; there, a higherα\\alphaor a larger dev budget trades precision back for recall\. In the far more common regime—noisy, reused evaluation where spurious commits are the dominant failure—it is a sensible default, at no greater cost than greedy\. Finally, the gate controls false commits but inherits the dev set’s coverage: it cannot certify improvements the dev set cannot see\.

Table 3:What PACE claims, what supports each claim, and its scope\. The guarantee is per\-decision; breadth across systems and model families is argued from the interface, not demonstrated\.
#### What PACE does and does not claim\.

Table[3](https://arxiv.org/html/2606.08106#S6.T3)summarizes the contract: PACE provides a per\-decision false\-commit bound under the paired null, empirically eliminates greedy’s3030–100%100\\%false commits and the resulting degradation, and claims neither higher power, a new statistic, nor system\-level generality beyond its interface\. Three caveats remain\. We embed PACE in our own prompt\-evolution loop rather than a third\-party system \(DSPy, ADAS\); the interface is identical, but cross\-system validation remains future work\. Evidence uses one model family and one proposer, so proposer\-agnosticism is argued at the interface level, not demonstrated\. And greedy’s held\-out*degradation*is strongest on small, fragile agents; the universally robust effect is false\-commit suppression, not always a large accuracy rescue\.

## 7Conclusion

Self\-evolving agents spend almost all of their effort proposing changes and almost none deciding whether to keep them; the unguarded commit step then accumulates false acceptances on a reused dev set\. We made this failure mode precise \(the commit step is uncontrolled adaptive testing\) and showed that*PACE*, a training\-free paired anytime\-valid test, bounds*each candidate’s*false\-commit probability at a user\-set level \(per decision, not run\-level\) while preserving genuine gains, at lower cost than greedy\. The acceptor is an overlooked piece of the self\-evolution loop\.

## References

- L\. A\. Agrawalet al\.\(2025\)GEPA: reflective prompt evolution can outperform reinforcement learning\.arXiv preprint arXiv:2507\.19457\.Cited by:[§2](https://arxiv.org/html/2606.08106#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Blum and M\. Hardt \(2015\)The ladder: a reliable leaderboard for machine learning competitions\.InInternational Conference on Machine Learning \(ICML\),pp\. 1006–1014\.Cited by:[§3](https://arxiv.org/html/2606.08106#S3.SS0.SSS0.Px3.p1.2)\.
- P\. Clark, I\. Cowhey, O\. Etzioni, T\. Khot, A\. Sabharwal, C\. Schoenick, and O\. Tafjord \(2018\)Think you have solved question answering? try ARC, the AI2 reasoning challenge\.arXiv preprint arXiv:1803\.05457\.Cited by:[§5\.4](https://arxiv.org/html/2606.08106#S5.SS4.p1.8)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian,et al\.\(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[§5](https://arxiv.org/html/2606.08106#S5.SS0.SSS0.Px1.p1.2)\.
- C\. Dwork, V\. Feldman, M\. Hardt, T\. Pitassi, O\. Reingold, and A\. Roth \(2015\)The reusable holdout: preserving validity in adaptive data analysis\.Science349\(6248\),pp\. 636–638\.Cited by:[§3](https://arxiv.org/html/2606.08106#S3.SS0.SSS0.Px3.p1.2)\.
- C\. Fernando, D\. Banarse, H\. Michalewski, S\. Osindero, and T\. Rocktäschel \(2023\)Promptbreeder: self\-referential self\-improvement via prompt evolution\.arXiv preprint arXiv:2309\.16797\.Cited by:[§2](https://arxiv.org/html/2606.08106#S2.SS0.SSS0.Px2.p1.1)\.
- H\. Gaoet al\.\(2025\)A survey of self\-evolving agents: on path to artificial super intelligence\.arXiv preprint arXiv:2507\.21046\.Cited by:[§2](https://arxiv.org/html/2606.08106#S2.SS0.SSS0.Px1.p1.1)\.
- Q\. Guo, R\. Wang, J\. Guo, B\. Li, K\. Song,et al\.\(2024\)Connecting large language models with evolutionary algorithms yields powerful prompt optimizers\.International Conference on Learning Representations \(ICLR\)\.Cited by:[§2](https://arxiv.org/html/2606.08106#S2.SS0.SSS0.Px2.p1.1)\.
- S\. Hu, C\. Lu, and J\. Clune \(2024\)Automated design of agentic systems\.arXiv preprint arXiv:2408\.08435\.Cited by:[§1](https://arxiv.org/html/2606.08106#S1.p1.1),[§2](https://arxiv.org/html/2606.08106#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Huang, X\. Chen, S\. Mishra, H\. S\. Zheng, A\. W\. Yu, X\. Song, and D\. Zhou \(2024\)Large language models cannot self\-correct reasoning yet\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2606.08106#S2.SS0.SSS0.Px3.p1.1)\.
- O\. Khattab, A\. Singhvi, P\. Maheshwari, Z\. Zhang, K\. Santhanam,et al\.\(2024\)DSPy: compiling declarative language model calls into self\-improving pipelines\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§1](https://arxiv.org/html/2606.08106#S1.p1.1),[§2](https://arxiv.org/html/2606.08106#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Madaan, N\. Tandon, P\. Gupta,et al\.\(2023\)Self\-refine: iterative refinement with self\-feedback\.Advances in Neural Information Processing Systems \(NeurIPS\)\.Cited by:[§2](https://arxiv.org/html/2606.08106#S2.SS0.SSS0.Px3.p1.1)\.
- Q\. McNemar \(1947\)Note on the sampling error of the difference between correlated proportions or percentages\.Psychometrika12\(2\),pp\. 153–157\.Cited by:[§4](https://arxiv.org/html/2606.08106#S4.SS0.SSS0.Px1.p1.8)\.
- A\. Patel, S\. Bhattamishra, and N\. Goyal \(2021\)Are NLP models really able to solve simple math word problems?\.InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics \(NAACL\),pp\. 2080–2094\.Cited by:[§5\.4](https://arxiv.org/html/2606.08106#S5.SS4.p1.8)\.
- Qwen Team \(2024\)Qwen2\.5 technical report\.arXiv preprint arXiv:2412\.15115\.Cited by:[§5](https://arxiv.org/html/2606.08106#S5.SS0.SSS0.Px1.p1.2)\.
- A\. Ramdas, P\. Grünwald, V\. Vovk, and G\. Shafer \(2023\)Game\-theoretic statistics and safe anytime\-valid inference\.Statistical Science38\(4\),pp\. 576–601\.Cited by:[2nd item](https://arxiv.org/html/2606.08106#S1.I1.i2.p1.1),[§2](https://arxiv.org/html/2606.08106#S2.SS0.SSS0.Px4.p1.1),[§4](https://arxiv.org/html/2606.08106#S4.SS0.SSS0.Px2.p1.12)\.
- A\. Ramdas, F\. Yang, M\. J\. Wainwright, and M\. I\. Jordan \(2017\)Online control of the false discovery rate with decaying memory\.Advances in Neural Information Processing Systems \(NeurIPS\)\.Cited by:[§2](https://arxiv.org/html/2606.08106#S2.SS0.SSS0.Px4.p1.1)\.
- S\. Shafayat, F\. Tajwar, R\. Salakhutdinov, J\. Schneider, and A\. Zanette \(2025\)Can large reasoning models self\-train?\.arXiv preprint arXiv:2505\.21444\.Cited by:[§2](https://arxiv.org/html/2606.08106#S2.SS0.SSS0.Px3.p1.1)\.
- G\. Shafer \(2021\)Testing by betting: a strategy for statistical and scientific communication\.Journal of the Royal Statistical Society: Series A184\(2\),pp\. 407–431\.Cited by:[2nd item](https://arxiv.org/html/2606.08106#S1.I1.i2.p1.1),[§2](https://arxiv.org/html/2606.08106#S2.SS0.SSS0.Px4.p1.1),[§4](https://arxiv.org/html/2606.08106#S4.SS0.SSS0.Px2.p1.12)\.
- N\. Shinn, F\. Cassano, E\. Berman, A\. Gopinath, K\. Narasimhan, and S\. Yao \(2023\)Reflexion: language agents with verbal reinforcement learning\.Advances in Neural Information Processing Systems \(NeurIPS\)\.Cited by:[§2](https://arxiv.org/html/2606.08106#S2.SS0.SSS0.Px3.p1.1)\.
- I\. Shumailov, Z\. Shumaylov, Y\. Zhao, N\. Papernot, R\. Anderson, and Y\. Gal \(2024\)AI models collapse when trained on recursively generated data\.Nature631,pp\. 755–759\.Cited by:[§2](https://arxiv.org/html/2606.08106#S2.SS0.SSS0.Px3.p1.1)\.
- Y\. Song, H\. Zhang, C\. Eisenach, S\. Kakade, D\. Foster, and U\. Ghai \(2025\)Mind the gap: examining the self\-improvement capabilities of large language models\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2606.08106#S2.SS0.SSS0.Px3.p1.1)\.
- A\. Wald \(1947\)Sequential analysis\.John Wiley & Sons\.Cited by:[2nd item](https://arxiv.org/html/2606.08106#S1.I1.i2.p1.1),[§2](https://arxiv.org/html/2606.08106#S2.SS0.SSS0.Px4.p1.1)\.
- G\. Wang, Y\. Xie, Y\. Jiang, A\. Mandlekar, C\. Xiao, Y\. Zhu, L\. Fan, and A\. Anandkumar \(2023\)Voyager: an open\-ended embodied agent with large language models\.arXiv preprint arXiv:2305\.16291\.Cited by:[§1](https://arxiv.org/html/2606.08106#S1.p1.1),[§2](https://arxiv.org/html/2606.08106#S2.SS0.SSS0.Px1.p1.1)\.
- Z\. Z\. Wang, J\. Mao, D\. Fried, and G\. Neubig \(2024\)Agent workflow memory\.arXiv preprint arXiv:2409\.07429\.Cited by:[§2](https://arxiv.org/html/2606.08106#S2.SS0.SSS0.Px1.p1.1)\.
- C\. Yang, X\. Wang, Y\. Lu, H\. Liu, Q\. V\. Le, D\. Zhou, and X\. Chen \(2024\)Large language models as optimizers\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2606.08106#S2.SS0.SSS0.Px2.p1.1)\.
- X\. Yin, X\. Wang, L\. Pan, L\. Lin, X\. Wan, and W\. Y\. Wang \(2024\)Gödel agent: a self\-referential agent framework for recursive self\-improvement\.arXiv preprint arXiv:2410\.04444\.Cited by:[§1](https://arxiv.org/html/2606.08106#S1.p1.1),[§2](https://arxiv.org/html/2606.08106#S2.SS0.SSS0.Px1.p1.1)\.
- W\. Yuan, R\. Y\. Pang, K\. Cho, X\. Li, S\. Sukhbaatar, J\. Xu, and J\. Weston \(2024\)Self\-rewarding language models\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§2](https://arxiv.org/html/2606.08106#S2.SS0.SSS0.Px3.p1.1)\.
- M\. Yuksekgonul, F\. Bianchi, J\. Boen, S\. Liu, Z\. Huang, C\. Guestrin, and J\. Zou \(2024\)TextGrad: automatic “differentiation” via text\.arXiv preprint arXiv:2406\.07496\.Cited by:[§2](https://arxiv.org/html/2606.08106#S2.SS0.SSS0.Px2.p1.1)\.
- E\. Zelikman, Y\. Wu, J\. Mu, and N\. D\. Goodman \(2022\)STaR: bootstrapping reasoning with reasoning\.Advances in Neural Information Processing Systems \(NeurIPS\)\.Cited by:[§2](https://arxiv.org/html/2606.08106#S2.SS0.SSS0.Px3.p1.1)\.
- J\. Zhang, S\. Hu, C\. Lu, R\. Lange, and J\. Clune \(2025a\)Darwin gödel machine: open\-ended evolution of self\-improving agents\.arXiv preprint arXiv:2505\.22954\.Cited by:[§1](https://arxiv.org/html/2606.08106#S1.p1.1),[§2](https://arxiv.org/html/2606.08106#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Zhang, J\. Xiang, Z\. Yu,et al\.\(2025b\)AFlow: automating agentic workflow generation\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2606.08106#S2.SS0.SSS0.Px2.p1.1)\.

## Appendix AAdditional Results

Full sensitivity \(Table[4](https://arxiv.org/html/2606.08106#A1.T4)\) and cross\-task generalization \(Table[5](https://arxiv.org/html/2606.08106#A1.T5)\) results, summarized in §[5](https://arxiv.org/html/2606.08106#S5), are collected here\.

Table 4:Sensitivity\(1\.5B, controlled regime, 3 seeds; defaultα=0\.05\\alpha\{=\}0\.05,n=40n\{=\}40\)\. Over a10×10\\timesrange ofα\\alphaand a4×4\\timesrange of dev sizenn, the gate commits the one real improvement with0%0\\%false commits and an essentially constantΔ\\Delta, while greedy remains3333–53%53\\%false\. The gate is insensitive to both knobs\. \(False% is audit\-estimated;α\\alphabounds the prospective rate\.\)Table 5:Generalizationto a second arithmetic task \(SVAMP\) and a non\-math task \(ARC\-Challenge\), 3 seeds\. Controlled \(ctrl\): the gate keeps the planted gain at0%0\\%false\. Stochastic \(stoch\): greedy churns1212–1616edits/run \(6767–93%93\\%false\) and drifts down, the gate commits≈0\{\\approx\}0and holds\. Cmt: commits/run; False% is audit\-estimated \(α\\alphabounds the prospective rate\)\.
PACE: Anytime-Valid Acceptance Tests for Self-Evolving Agents

Similar Articles

PACE: Two-Timescale Self-Evolution for Small Language Model Agents

PACEvolve++: Improving Test-time Learning for Evolutionary Search Agents

EVE-Agent: Evidence-Verifiable Self-Evolving Agents

On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment

Anticipate and Learn: Unleashing Idle-Time Compute in Proactive Agents

Submit Feedback

Similar Articles

PACE: Two-Timescale Self-Evolution for Small Language Model Agents
PACEvolve++: Improving Test-time Learning for Evolutionary Search Agents
EVE-Agent: Evidence-Verifiable Self-Evolving Agents
On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment
Anticipate and Learn: Unleashing Idle-Time Compute in Proactive Agents