Don't Gamble, GAMBLe: An Analytical Framework for AI-Driven Research Systems

arXiv cs.AI 06/03/26, 04:00 AM Papers
ai-driven-research llm analysis-framework optimization automated-discovery np-hard-problems
Summary
The paper introduces GAMBLe, a framework that decomposes AI-Driven Research Systems into generator, assessor, discovery mechanism, and budget, revealing how component interactions shape optimization landscapes. Experiments on NP-hard problems show no universally best configuration, emphasizing the need for careful component selection.
arXiv:2606.02863v1 Announce Type: new Abstract: AI-Driven Research Systems (ADRS) -- systems coupling LLMs with automated evaluation to discover algorithms, proofs, and designs -- are being optimized and adopted across domains, but the tools to analyze them have not kept pace. ADRS performance depends on component interactions that are poorly understood, expensive to explore, and (as we show) not well captured by standard convergence guarantees. These guarantees rely on structural assumptions that do not hold under the ADRS process we formalize. We introduce GAMBLe, a framework that decomposes ADRS behavior into four parameters (generator $G$, assessor $\mathcal{A}$, discovery mechanism $\mathcal{M}$, budget $B$) and one compositional object, the effective landscape $L_{\text{eff}} = \mathcal{A} \circ G$, which reveals that distinct generator-assessor pairs induce structurally different per-problem optimization landscapes. We exercise the framework on 760+ replicated runs (>46,000 iterations) spanning generators from single LLMs to dynamically-adaptive ensembles, mechanisms from greedy selection to co-evolutionary meta-search, and three NP-hard problems whose assessors range from continuous scoring to cliff functions. The experiments reveal no total ordering of generators or mechanisms: frontier models can underperform open-source alternatives and the simplest mechanism sometimes outperforms state-of-the-art meta-search. Results show that even under limited budgets (60 iterations per run), the right component choices can improve performance by 13-67% and search efficiency by 6-39x.
Original Article
View Cached Full Text
Cached at: 06/03/26, 09:41 AM
# Don’t gamble, GAMBLe: An Analytical Framework for AI-Driven Research Systems
Source: [https://arxiv.org/html/2606.02863](https://arxiv.org/html/2606.02863)
###### Abstract

AI\-Driven Research Systems \(ADRS\)—systems coupling LLMs with automated evaluation to discover algorithms, proofs, and designs—are being optimized and adopted across domains, but the tools to analyze them have not kept pace\. ADRS performance depends on component interactions that are poorly understood, expensive to explore, and \(as we show\) not well captured by standard convergence guarantees\. These guarantees rely on structural assumptions that do not hold under the ADRS process we formalize\. We introduce GAMBLe, a framework that decomposes ADRS behavior into four parameters \(generatorGG, assessor𝒜\\mathcal\{A\}, discovery mechanismℳ\\mathcal\{M\}, budgetBB\) and one compositional object, the*effective landscape*Leff=𝒜∘GL\_\{\\text\{eff\}\}=\\mathcal\{A\}\\circ G, which reveals that distinct generator–assessor pairs induce structurally different per\-problem optimization landscapes\. We exercise the framework on760\+760\{\+\}replicated runs \(\>\>46,000 iterations\) spanning generators from single LLMs to dynamically\-adaptive ensembles, mechanisms from greedy selection to co\-evolutionary meta\-search, and three NP\-hard problems whose assessors range from continuous scoring to cliff functions\. The experiments reveal no total ordering of generators or mechanisms: frontier models can underperform open\-source alternatives and the simplest mechanism sometimes outperforms state\-of\-the\-art meta\-search\. Results show that even under limited budgets \(6060iterations per run\), the right component choices can improve performance by1313–67%67\\%and search efficiency by66–39×39\\times\.

## 1Introduction

Recent systems for automated scientific and algorithmic discovery \(FunSearch\(Romera\-Paredeset al\.,[2024](https://arxiv.org/html/2606.02863#bib.bib11)\), AlphaEvolve\(Novikovet al\.,[2025](https://arxiv.org/html/2606.02863#bib.bib12)\), LEVI\(Tanveer,[2026](https://arxiv.org/html/2606.02863#bib.bib9)\), and others\) share a common architecture: a language model generates candidate solutions, a scoring function evaluates them, and a search algorithm directs the process by selecting parents, constructing prompts, and adapting strategy over time\. We refer to these asAI\-Driven Research Systems\(ADRS\), followingChenget al\.\([2025b](https://arxiv.org/html/2606.02863#bib.bib3)\)\. These systems have produced new mathematical lower bounds on the cap set problem, improved matrix multiplication algorithms, and competitive solutions to open optimization challenges\. The architecture can work; the question is*when and why*it works well across problems\.

We identify the minimal decomposition needed to analyze ADRS behavior: thegeneratorGG\(the entire candidate\-producing system\), theassessor𝒜\\mathcal\{A\}\(the evaluation system\), and thediscovery mechanismℳ\\mathcal\{M\}\(the search algorithm and its configuration\), operating under a computational budgetBBto explore a given problem landscapeLL\. Recent engineering advances have expanded each component’s design space:GGfrom single models to agentic multi\-model systems\(Hamadanianet al\.,[2026](https://arxiv.org/html/2606.02863#bib.bib31); Qu and Lu,[2026](https://arxiv.org/html/2606.02863#bib.bib23)\),𝒜\\mathcal\{A\}from scalar scoring to rich feedback\(Agrawalet al\.,[2026](https://arxiv.org/html/2606.02863#bib.bib5); Chenget al\.,[2026](https://arxiv.org/html/2606.02863#bib.bib2)\), andℳ\\mathcal\{M\}from greedy to co\-evolutionary meta\-search\(Liuet al\.,[2026a](https://arxiv.org/html/2606.02863#bib.bib8); Cemriet al\.,[2026](https://arxiv.org/html/2606.02863#bib.bib7)\)\. Exhaustive search over the resulting configuration space is quickly becoming intractable: each iteration consumes LLM inference \(or, for compound generators, multiple coordinated calls\), runs span tens to hundreds of iterations, and building confidence in near\-optimality requires replication across runs \(as we show\), multiplying costs further\. Without principled characterization, the compute, energy, token, and expert time costs of navigating this space grow with the complexity of the system\.

ADRS lack theoretical foundations for efficient application and optimization\. At least four sources of variability make it difficult to identify which component is limiting a given system: generator sensitivity,G×ℳG\\times\\mathcal\{M\}interaction, configuration sensitivity, and run\-to\-run variance\. We formalize their origins \(Section[2](https://arxiv.org/html/2606.02863#S2)\) and expose each empirically \(Section[3](https://arxiv.org/html/2606.02863#S3)\)\.

#### Contributions\.

We show that standard analytical tools’ assumptions are often violated in ADRS, and introduce the minimal machinery necessary to begin bridging the gap:

1. 1\.We formalize the ADRS process and prove that the best\-score process\{st∗\}\\\{s^\{\*\}\_\{t\}\\\}isnot Markov\(Theorem[2](https://arxiv.org/html/2606.02863#Thmtheorem2)\)\. The full state\(Dt,ℳt\)\(D\_\{t\},\\mathcal\{M\}\_\{t\}\)*is*Markov but evolves in agrowing\-dimensionalspace \(the history gains an entry every step\)\. While this process can be embedded in a fixed infinite\-dimensional state space, many standard quantitative convergence analyses rely on structural assumptions—fixed representations, stationary objectives, stable transition operators—that are often violated in ADRS due to history\-dependent context construction and adaptive mechanisms\. Moreover, natural scalar summaries such asst∗s^\{\*\}\_\{t\}are not, in general, sufficient statistics for the process, so convergence behavior is not fully determined by low\-dimensional progress metrics alone\. Under history\-preserving context construction, initial conditions propagate: runs starting from different seeds may follow persistently different trajectories even at the samest∗s^\{\*\}\_\{t\}\(Appendix[J](https://arxiv.org/html/2606.02863#A10)\)\.
2. 2\.We define theeffective landscapeLeff=𝒜∘GL\_\{\\text\{eff\}\}=\\mathcal\{A\}\\circ Gand show that different generators can induce structurally different landscapes on the same problem \(Theorem[4](https://arxiv.org/html/2606.02863#Thmtheorem4)\); generator sensitivity in our data reflects structural differences rather than purely sampling noise\. Ensemble generators canescape barriersthat trap any single generator\.
3. 3\.We define thegenerator ceilings∞∗\(G,𝒜\)s^\{\*\}\_\{\\infty\}\(G,\\mathcal\{A\}\)andsystem ceilings∞∗\(G,𝒜,ℳ\)s^\{\*\}\_\{\\infty\}\(G,\\mathcal\{A\},\\mathcal\{M\}\)and derive aregime classification\(GG\-limited,𝒜\\mathcal\{A\}\-limited,ℳ\\mathcal\{M\}\-limited, saturated\) that identifies the binding constraint with targeted evaluations rather than exhaustive ablation\.
4. 4\.We validate the framework empirically across760\+760\{\+\}replicated runs totaling\>46,000\{\>\}46\{,\}000iterations, spanning 12 generators from 6 model families \(static through dynamically adaptive\), 3 search mechanisms, and 3 NP\-hard problems whose assessors range from rich continuous scoring to cliff functions\. The data reveals basin structure,G×ℳG\\times\\mathcal\{M\}interaction, and regime diversity including𝒜\\mathcal\{A\}\-limited configurations where no generator or mechanism can make progress \(Figure[1](https://arxiv.org/html/2606.02863#S3.F1); Section[3](https://arxiv.org/html/2606.02863#S3)\)\.

#### Related work\.

LeffL\_\{\\text\{eff\}\}extends fitness landscape theory\(Kauffman,[1993](https://arxiv.org/html/2606.02863#bib.bib21)\)to ADRS, where the generating operator is context\-dependent, stochastic, and potentially adaptive \(Appendix[G](https://arxiv.org/html/2606.02863#A7)\)\. Unlike AutoML\(Feureret al\.,[2015](https://arxiv.org/html/2606.02863#bib.bib22)\), our framework addresses non\-additive component interactions\. Our\(G,𝒜,ℳ\)\(G,\\mathcal\{A\},\\mathcal\{M\}\)decomposition is analytical rather than architectural; we map it to[Chenget al\.](https://arxiv.org/html/2606.02863#bib.bib3)’s five\-component description in Appendix[E](https://arxiv.org/html/2606.02863#A5)\. Several concurrent groups have observed generator sensitivity,G×ℳG\\times\\mathcal\{M\}interaction, and run\-to\-run variance empirically\(Lehmanet al\.,[2022](https://arxiv.org/html/2606.02863#bib.bib24); Chenget al\.,[2025a](https://arxiv.org/html/2606.02863#bib.bib4); Liuet al\.,[2026b](https://arxiv.org/html/2606.02863#bib.bib6),[a](https://arxiv.org/html/2606.02863#bib.bib8); Cemriet al\.,[2026](https://arxiv.org/html/2606.02863#bib.bib7); Tanveer,[2026](https://arxiv.org/html/2606.02863#bib.bib9); Agrawalet al\.,[2026](https://arxiv.org/html/2606.02863#bib.bib5); Chenget al\.,[2026](https://arxiv.org/html/2606.02863#bib.bib2); Hamadanianet al\.,[2026](https://arxiv.org/html/2606.02863#bib.bib31); Karimiet al\.,[2026](https://arxiv.org/html/2606.02863#bib.bib32); Qu and Lu,[2026](https://arxiv.org/html/2606.02863#bib.bib23)\); none provides a theoretical account of why these phenomena arise or how to diagnose the limiting factor with targeted evaluations rather than exhaustive ablation \(Appendix[L](https://arxiv.org/html/2606.02863#A12)\)\.

## 2The GAMBLe framework

We define the GAMBLe framework, beginning with a process model of howGG,𝒜\\mathcal\{A\}, andℳ\\mathcal\{M\}interact within a run, then prove two structural results: non\-Markov best\-score dynamics, and generator\-dependent effective landscapes\. ThegeneratorGGproduces candidates given context; theassessormaps candidates to scores,𝒜:𝒳→ℝ\\mathcal\{A\}:\\mathcal\{X\}\\to\\mathbb\{R\}, where𝒳\\mathcal\{X\}is the candidate space defined by the problemlandscapeLL; and thediscovery mechanismℳ\\mathcal\{M\}directs exploration, encompassing parent selection, prompt construction, parameter adaptation, and potentially meta\-level strategy evolution\. Crucially,𝒜\\mathcal\{A\}’s signal reachesGGonly throughℳ\\mathcal\{M\}’s context construction:ℳ\\mathcal\{M\}selects which scores, candidates, and history to surface, so whatGG“sees” of the landscape depends on both𝒜\\mathcal\{A\}’s fidelity andℳ\\mathcal\{M\}’s filtering\. When𝒜\\mathcal\{A\}cannot distinguish candidates, noℳ\\mathcal\{M\}can provide useful signal—𝒜\\mathcal\{A\}can be a binding constraint\. We formalize𝒜\\mathcal\{A\}as scalar\-valued; structured feedback\(Agrawalet al\.,[2026](https://arxiv.org/html/2606.02863#bib.bib5)\)enters through context constructionCCrather than extending𝒜\\mathcal\{A\}’s codomain, so the theorems apply to\{st∗\}\\\{s^\{\*\}\_\{t\}\\\}regardless of feedback richness \(feedback affectsLeffL\_\{\\text\{eff\}\}geometry viaCC’s effectiveness\)\. Notation is summarized in Appendix[A](https://arxiv.org/html/2606.02863#A1); all assumptions are stated formally in Appendix[B](https://arxiv.org/html/2606.02863#A2); alternative formulations considered are in Appendix[D](https://arxiv.org/html/2606.02863#A4)\.

Every ADRS produces arun historyDtD\_\{t\}: the append\-only record of all data produced through steptt\.ℳ\\mathcal\{M\}accesses the history through context constructionCC, which determines what subset ofDtD\_\{t\}the generator sees\. Without history\-dependent context construction, the process degenerates to i\.i\.d\. sampling and gains come only from drawing more samples, not from learning to generate better candidates\.

###### Definition 1\(ADRS trajectory\)\.

An ADRS run produces states\(D0,ℳ0\)→⋯→\(DT,ℳT\)\(D\_\{0\},\\mathcal\{M\}\_\{0\}\)\\to\\cdots\\to\(D\_\{T\},\\mathcal\{M\}\_\{T\}\), whereDt=\{\(xi,si\)\}i=1tD\_\{t\}=\\\{\(x\_\{i\},s\_\{i\}\)\\\}\_\{i=1\}^\{t\}is the run history \(projected to candidate\-score pairs\) andℳt\\mathcal\{M\}\_\{t\}is the mechanism state at steptt\.At each step, the system\(1\)constructs contextct=C\(Dt,ℳt\)c\_\{t\}=C\(D\_\{t\},\\mathcal\{M\}\_\{t\}\),\(2\)generates a candidatext\+1∼G\(⋅∣ct\)x\_\{t\+1\}\\sim G\(\\cdot\\mid c\_\{t\}\),\(3\)evaluatesst\+1=𝒜\(xt\+1\)s\_\{t\+1\}=\\mathcal\{A\}\(x\_\{t\+1\}\), and\(4\)records the result,Dt\+1=Dt∪\{\(xt\+1,st\+1\)\}D\_\{t\+1\}=D\_\{t\}\\cup\\\{\(x\_\{t\+1\},s\_\{t\+1\}\)\\\}, and updates mechanism stateℳt\+1=U\(Dt,ℳt,xt\+1,st\+1\)\\mathcal\{M\}\_\{t\+1\}=U\(D\_\{t\},\\mathcal\{M\}\_\{t\},x\_\{t\+1\},s\_\{t\+1\}\)\. The best\-score process isst∗=maxi≤t⁡sis^\{\*\}\_\{t\}=\\max\_\{i\\leq t\}s\_\{i\}\.

### 2\.1The best\-score process is not Markov

###### Theorem 2\(Non\-reducibility\)\.

The best\-score process\{st∗\}\\\{s^\{\*\}\_\{t\}\\\}isnot Markovfor an ADRS satisfying:

1. \(A1\)Faithful context construction:∃Dt≠Dt′\\exists\\,D\_\{t\}\\neq D\_\{t\}^\{\\prime\}with the same best score \(st∗=maxi≤t⁡sis^\{\*\}\_\{t\}=\\max\_\{i\\leq t\}s\_\{i\}\) such thatC\(Dt,ℳt\)≠C\(Dt′,ℳt\)C\(D\_\{t\},\\mathcal\{M\}\_\{t\}\)\\neq C\(D\_\{t\}^\{\\prime\},\\mathcal\{M\}\_\{t\}\),
2. \(A2\)Context\-dependent generation:c≠c′⇒G\(⋅∣c\)≠G\(⋅∣c′\)c\\neq c^\{\\prime\}\\Rightarrow G\(\\cdot\\mid c\)\\neq G\(\\cdot\\mid c^\{\\prime\}\)forc,c′c,c^\{\\prime\}in the range ofCC,
3. \(A3\)Upper\-tail\-separating assessor: for any thresholdww\(in particularw=st∗w=s^\{\*\}\_\{t\}\) and distinctG\(⋅∣c\)≠G\(⋅∣c′\)G\(\\cdot\\mid c\)\\neq G\(\\cdot\\mid c^\{\\prime\}\)that both admit improvement aboveww, the distributions ofmax⁡\(w,𝒜\(x\)\)\\max\(w,\\,\\mathcal\{A\}\(x\)\)are not identical,

###### Proof\.

*State dependence\.*Take\(Dt,ℳt\)\(D\_\{t\},\\mathcal\{M\}\_\{t\}\),\(Dt′,ℳt\)\(D\_\{t\}^\{\\prime\},\\mathcal\{M\}\_\{t\}\)withmax⁡Dt=max⁡Dt′=w<sup𝒜\(𝒳\)\\max D\_\{t\}=\\max D\_\{t\}^\{\\prime\}=w<\\sup\\mathcal\{A\}\(\\mathcal\{X\}\)butDt≠Dt′D\_\{t\}\\neq D\_\{t\}^\{\\prime\}\. Faithfulness \(A1\) givesC\(Dt,ℳt\)≠C\(Dt′,ℳt\)C\(D\_\{t\},\\mathcal\{M\}\_\{t\}\)\\neq C\(D\_\{t\}^\{\\prime\},\\mathcal\{M\}\_\{t\}\); context\-dependence \(A2\) gives distinct generation distributionsG\(⋅∣ct\)≠G\(⋅∣ct′\)G\(\\cdot\\mid c\_\{t\}\)\\neq G\(\\cdot\\mid c\_\{t\}^\{\\prime\}\); upper\-tail separation \(A3\) then gives distinct distributions ofst\+1∗=max⁡\(w,st\+1\)s^\{\*\}\_\{t\+1\}=\\max\(w,s\_\{t\+1\}\), sost∗s^\{\*\}\_\{t\}alone does not determine the distribution ofst\+1∗s^\{\*\}\_\{t\+1\}\.

*Non\-Markov\.*Two realizations can reachs2∗=w\>s0∗s^\{\*\}\_\{2\}=w\>s^\{\*\}\_\{0\}via different paths: improving at step 1 \(history\(s0∗,w,w\)\(s^\{\*\}\_\{0\},w,w\)\) or at step 2 \(history\(s0∗,s0∗,w\)\(s^\{\*\}\_\{0\},s^\{\*\}\_\{0\},w\)\)\. Their historiesD2D\_\{2\}differ\. By state dependence, these produce distinct distributions ofs3∗s^\{\*\}\_\{3\}\. Both conditionals shares2∗=ws^\{\*\}\_\{2\}=wbut prior historys1∗s^\{\*\}\_\{1\}is informative, so\{st∗\}\\\{s^\{\*\}\_\{t\}\\\}is not Markov\. ∎

A1–A3 hold by construction for any ADRS that shows prior candidates to the generator \(A1\), uses a context\-sensitive generator \(A2\), and has a non\-degenerate assessor \(A3\); see Appendix[B](https://arxiv.org/html/2606.02863#A2)\. In practice, the run history may include structured assessor feedback\(Agrawalet al\.,[2026](https://arxiv.org/html/2606.02863#bib.bib5)\), generation reasoning traces, and experimental logs\(Karimiet al\.,[2026](https://arxiv.org/html/2606.02863#bib.bib32); Hamadanianet al\.,[2026](https://arxiv.org/html/2606.02863#bib.bib31)\)\. Richer histories make A1 easier to satisfy—more ways forDtD\_\{t\}to differ—so systems that persist and leverage such data strengthen the non\-Markov property\. Moreover, under history\-preserving context construction, initial conditions can persist: whenC\(Dt,ℳt\)C\(D\_\{t\},\\mathcal\{M\}\_\{t\}\)carries information fromD0D\_\{0\}, the generation distribution remainsD0D\_\{0\}\-dependent for alltt\(Appendix[J](https://arxiv.org/html/2606.02863#A10)\)\.

Consequence\.The full state\(Dt,ℳt\)\(D\_\{t\},\\mathcal\{M\}\_\{t\}\)*is*Markov—but it evolves in a*growing\-dimensional*space: the history gains an entry every step\. Although the process can be embedded in a fixed infinite\-dimensional state space, many existing rate guarantees in optimization, bandit theory, and evolutionary algorithms typically assume fixed representations, stationary objectives, or stable transition operators—assumptions often violated in ADRS through history\-dependent context construction and adaptive mechanisms \(Appendix[C](https://arxiv.org/html/2606.02863#A3)\)\. Scalar progress metrics such asst∗s^\{\*\}\_\{t\}do not determine future behavior, so analysis requires tracking the full history\. When the effective landscape contains regions with different improvement probabilities, both reachable from the initial history, independent replications can produce multimodal final\-score distributions\. Because\{st∗\}\\\{s^\{\*\}\_\{t\}\\\}is not Markov, the score alone does not determine which region a run is exploring—two runs at the samest∗s^\{\*\}\_\{t\}but with different histories can have different improvement probabilities \(Theorem[2](https://arxiv.org/html/2606.02863#Thmtheorem2)\)\. Neither proximity to the ceiling, nor the binding constraint, nor the structure of final\-score distributions can be determined from a single trajectory—all require independent replications\.

### 2\.2The effective landscape

The system does not exploreLLdirectly—it exploresLLthrough the lens of whatGGcan generate\. This gives the non\-Markov result geometric content: generator sensitivity is not noise but reflects structurally different optimization landscapes, and which region ofLeffL\_\{\\text\{eff\}\}a run explores is unobservable fromst∗s^\{\*\}\_\{t\}alone\. To formalize this:

###### Definition 3\(Effective landscape\)\.

Theeffective landscapeisLeff\(G,𝒜\)=𝒜∗∘GL\_\{\\text\{eff\}\}\(G,\\mathcal\{A\}\)=\\mathcal\{A\}\_\{\*\}\\circ G, whereGGis the full family of conditional distributionsG\(⋅∣c\)G\(\\cdot\\mid c\)over candidates, indexed by contextcc\.LeffL\_\{\\text\{eff\}\}maps each contextccto the pushforward distribution𝒜∗G\(⋅∣c\)\\mathcal\{A\}\_\{\*\}G\(\\cdot\\mid c\)over scores\.

WhenGGis static during a run—the common case, covering single models, static ensembles, andℳ\\mathcal\{M\}\-controlled routing—LeffL\_\{\\text\{eff\}\}is time\-invariant: a fixed surface the system navigates asct=C\(Dt,ℳt\)c\_\{t\}=C\(D\_\{t\},\\mathcal\{M\}\_\{t\}\)evolves\. WhenGGitself adapts \(Section[2\.4](https://arxiv.org/html/2606.02863#S2.SS4)\), the landscape becomes time\-varying:Leff,t=𝒜∗∘GtL\_\{\\text\{eff\},t\}=\\mathcal\{A\}\_\{\*\}\\circ G\_\{t\}\. All structural results below hold in both cases; for adaptiveGG, they apply at each instanttt\.

###### Theorem 4\(Generator\-dependent effective landscape\)\.

Let𝒜\\mathcal\{A\}be a non\-constant assessor \(A4\)\. Two generatorsG1≠G2G\_\{1\}\\neq G\_\{2\}can induce different effective landscapes:Leff\(G1,𝒜\)≠Leff\(G2,𝒜\)L\_\{\\text\{eff\}\}\(G\_\{1\},\\mathcal\{A\}\)\\neq L\_\{\\text\{eff\}\}\(G\_\{2\},\\mathcal\{A\}\)\.

###### Proof\.

Since𝒜\\mathcal\{A\}is not constant, there exist candidatesx,yx,ywith𝒜\(x\)≠𝒜\(y\)\\mathcal\{A\}\(x\)\\neq\\mathcal\{A\}\(y\)\. Generators that concentrate onxxvs\.yyat some contextccthen produce distinct pushforward score distributions\. ∎

Note, while Theorem[2](https://arxiv.org/html/2606.02863#Thmtheorem2)requires A3 \(upper\-tail separation\), Theorem[4](https://arxiv.org/html/2606.02863#Thmtheorem4)does not\. Theorem[2](https://arxiv.org/html/2606.02863#Thmtheorem2)shows that landscape differences propagate to observable score dynamics, so it must exclude generators that differ only below the currentst∗s^\{\*\}\_\{t\}\(which would produce identical next\-best\-score distributions\)\. Theorem[4](https://arxiv.org/html/2606.02863#Thmtheorem4)is only a structural claim about the landscape as an object, not about dynamics at a particular state, and requires only that𝒜\\mathcal\{A\}distinguishes some candidates \(A4\)\. Conversely, when𝒜\\mathcal\{A\}is coarse \(many candidates share scores\),LeffL\_\{\\text\{eff\}\}differences between generators collapse—the𝒜\\mathcal\{A\}\-limited regime \(Table[1](https://arxiv.org/html/2606.02863#S2.T1)\), where𝒜\\mathcal\{A\}’s lack of discrimination renders generator choice irrelevant\.

Ensemble effective landscape\.WhenG=∑i=1kwiGiG=\\sum\_\{i=1\}^\{k\}w\_\{i\}G\_\{i\}is a flat mixture \(fixed routing weights\), the score distribution decomposes by linearity of pushforward:𝒜∗G\(⋅∣c\)=∑i=1kwi⋅𝒜∗Gi\(⋅∣c\)\\mathcal\{A\}\_\{\*\}G\(\\cdot\\mid c\)=\\sum\_\{i=1\}^\{k\}w\_\{i\}\\cdot\\mathcal\{A\}\_\{\*\}G\_\{i\}\(\\cdot\\mid c\)\. A generator faces abarrierat contextccwhen its improvement probability vanishes:𝒜∗G\(⋅∣c\)\(\(s∗,∞\)\)=0\\mathcal\{A\}\_\{\*\}G\(\\cdot\\mid c\)\(\(s^\{\*\},\\infty\)\)=0\. The landscape may contain higher scores, butGGcannot reach them fromcc\. An ensemble’s reachable set is the union of each component’s, so ensembles can escape barriers that trap single generators\. Ensemble benefit depends onLeffL\_\{\\text\{eff\}\}diversity, not ensemble size alone—addingGk\+1G\_\{k\+1\}dilutes existing weights \(∑wi=1\\sum w\_\{i\}=1\), and helps only if it reaches score regions not already covered; consistent withChenget al\.\([2025a](https://arxiv.org/html/2606.02863#bib.bib4)\)’s finding that ensembles beyond two models provided no additional benefit\. For verifying or routed generators \(where component selection depends on candidate quality\), the linearity decomposition does not necessarily hold\. Our experiments suggest barrier escape extends to these systems, though the internal mechanism differs from flat\-mixture union \(Section[3\.1](https://arxiv.org/html/2606.02863#S3.SS1)\)\.

### 2\.3Ceilings and regime classification

###### Definition 5\(Ceilings\)\.

Thegenerator ceilingiss∞∗\(G,𝒜\)=sup\{s:∃cs\.t\.𝒜∗G\(⋅∣c\)\(\[s,∞\)\)\>0\}s^\{\*\}\_\{\\infty\}\(G,\\mathcal\{A\}\)=\\sup\\\{s:\\exists\\,c\\text\{ s\.t\.\\ \}\\mathcal\{A\}\_\{\*\}G\(\\cdot\\mid c\)\(\[s,\\infty\)\)\>0\\\}—the supremum over all contextscc, including those that only the best mechanism with unlimited budget could construct\. It is the intrinsic capability ofGGunder𝒜\\mathcal\{A\}: an upper bound that no choice ofℳ\\mathcal\{M\}orBBcan exceed\. Thesystem ceilings∞∗\(G,𝒜,ℳ\)≤s∞∗\(G,𝒜\)s^\{\*\}\_\{\\infty\}\(G,\\mathcal\{A\},\\mathcal\{M\}\)\\leq s^\{\*\}\_\{\\infty\}\(G,\\mathcal\{A\}\)restricts to contexts constructable by a specificℳ\\mathcal\{M\}from the evolving history\.

Neither ceiling is directly observable; in practice, proximity is inferred from cross\-configuration comparisons \(Section[3](https://arxiv.org/html/2606.02863#S3)\): if varyingℳ\\mathcal\{M\}does not change scores, the system is likely near the generator ceiling\. These ceilings, together with the assessor’s signal structure, induce a regime classification \(Table[1](https://arxiv.org/html/2606.02863#S2.T1)\)\. In practice, a system may exhibit features of more than one regime simultaneously, but identifying even a single binding constraint narrows the space of useful interventions\.

Table 1:Regime classification\. Each regime identifies a binding constraint and intervention\. Empirical illustrations appear in Section[3](https://arxiv.org/html/2606.02863#S3)\.The𝒜\\mathcal\{A\}\-limited regime is qualitatively different: the limitation is prior to theGG/ℳ\\mathcal\{M\}/BBhierarchy—no mechanism can navigate a landscape where𝒜\\mathcal\{A\}provides no optimization signal\. This can arise even when𝒜\\mathcal\{A\}is*correct*\(it scores valid solutions accurately\) but its feedback structure is too coarse—e\.g\., cliff scoring that collapses all invalid candidates to zero regardless of proximity to validity\. Every call in this regime produces zero usable signal; since generation is expensive \(each candidate requires at least one LLM call\), the waste compounds\. Distinguishing “𝒜\\mathcal\{A\}provides insufficient signal” from “GGis producing uniformly poor candidates” requires inspecting the candidates themselves \(Section[3\.3](https://arxiv.org/html/2606.02863#S3.SS3)\)\.

### 2\.4Static vs\. adaptive generators

An adaptive generator \(GG\) learns or changes its approach during the course of a run\. A static generator navigatesLeffL\_\{\\text\{eff\}\}; an adaptiveGtG\_\{t\}simultaneously navigates and reshapes it\. The structural results \(Theorem[4](https://arxiv.org/html/2606.02863#Thmtheorem4), ceilings, regimes\) apply to adaptiveGGat each instantttwithout modification\. The non\-Markov property \(Theorem[2](https://arxiv.org/html/2606.02863#Thmtheorem2)\) is strengthened: the generator’s internal stateθt\\theta\_\{t\}is an additional source of path\-dependence beyond the history\. Ceilings become time\-varying—s∞∗\(Gt,𝒜\)s^\{\*\}\_\{\\infty\}\(G\_\{t\},\\mathcal\{A\}\)is itself a process whose trajectory characterizes how fast the generator is learning\. Adaptation is not automatically beneficial: a generator that overfits to early candidates may narrow its reachable set, decreasings∞∗\(Gt,𝒜\)s^\{\*\}\_\{\\infty\}\(G\_\{t\},\\mathcal\{A\}\)\. The framework provides the quantities \(Leff,tL\_\{\\text\{eff\},t\}, ceiling trajectory\) needed to measure whether adaptation helps, without assuming that it does\.

## 3Empirical validation

Benchmark and problems\.We select three NP\-hard problems for diversity in landscape structure and assessor signal \(Table[2](https://arxiv.org/html/2606.02863#S3.T2)\) from Frontier\-CS, an open\-source competitive programming benchmark\(Manget al\.,[2025](https://arxiv.org/html/2606.02863#bib.bib30)\)\.

Table 2:Problems used in experiments \(all NP\-hard\)\.Generators\.We use 12 generators spanning 3 architectural categories \(Table[3](https://arxiv.org/html/2606.02863#S3.T3)\)\. Static generators have a fixedGGthroughout the run, each inducing a time\-invariantLeffL\_\{\\text\{eff\}\}\(Theorem[4](https://arxiv.org/html/2606.02863#Thmtheorem4)\)\. Network\-of\-networks \(NoN\)\(Daviset al\.,[2024](https://arxiv.org/html/2606.02863#bib.bib10)\)combine multiple models with internal verifiers, composing reachable sets acrossLeffL\_\{\\text\{eff\}\}\(Section[2\.2](https://arxiv.org/html/2606.02863#S2.SS2)\)\. Static NoN have fixed components and routing; adaptive NoN reshape their generation distributions across the run, makingLeff,tL\_\{\\text\{eff\},t\}non\-stationary \(Section[2\.4](https://arxiv.org/html/2606.02863#S2.SS4)\)\. The eb1 variants are closed\-source NoN systems\(Daviset al\.,[2024](https://arxiv.org/html/2606.02863#bib.bib10)\)not yet publicly released\.

CategoryGeneratorTypeLeffL\_\{\\text\{eff\}\}Static \(single model\) \(7\)Claude Opus 4\.6\(Anthropic,[2023](https://arxiv.org/html/2606.02863#bib.bib19)\)CommercialStationaryGemini\-2\.5\-Flash\(Google DeepMind,[2023](https://arxiv.org/html/2606.02863#bib.bib18)\)CommercialStationaryGPT\-5\.4\(OpenAI,[2026](https://arxiv.org/html/2606.02863#bib.bib20)\)CommercialStationaryGPT\-5\-mini\(OpenAI,[2026](https://arxiv.org/html/2606.02863#bib.bib20)\)CommercialStationaryGPT\-OSS\-20B\(OpenAI,[2026](https://arxiv.org/html/2606.02863#bib.bib20)\)Open\-weight, MoEStationaryKimi\-K2\.5\(Moonshot AI,[2026](https://arxiv.org/html/2606.02863#bib.bib16)\)Open\-weight, MoEStationaryQwen\-3\.5\-9B†\(Alibaba Cloud,[2024](https://arxiv.org/html/2606.02863#bib.bib17)\)Open\-weightStationaryStatic NoN \(2\)eb1, eb1\-pro†NoN, fixedStationaryAdaptive NoN \(3\)eb1\-previewNoN, adaptiveNon\-stationaryeb1\-delta\-previewNoN, adaptiveNon\-stationaryeb1\-frontier\-previewNoN, adaptiveNon\-stationaryTable 3:Generators used in experiments\.†Partial problem coverage due to availability/latency\.Discovery mechanisms\.In order to varyℳ\\mathcal\{M\}andGG, we use SkyDiscover\(Liuet al\.,[2026b](https://arxiv.org/html/2606.02863#bib.bib6)\)with 3 mechanisms, from a greedy baseline to state\-of\-the\-art adaptive search\. Best\-of\-N \(BoN\), our greedy baseline \(Appendix[H](https://arxiv.org/html/2606.02863#A8)\) carries no adaptive state, but is not stateless: its context depends on the growing history \(Theorem[2](https://arxiv.org/html/2606.02863#Thmtheorem2)\), so generator differences under BoN reflectLeffL\_\{\\text\{eff\}\}directly\. AdaEvolve\(Cemriet al\.,[2026](https://arxiv.org/html/2606.02863#bib.bib7)\)is an adaptive multi\-island evolutionary search with33\+33\{\+\}tunable parameters in SkyDiscover, including stagnation detection and island migration\. EvoX\(Liuet al\.,[2026a](https://arxiv.org/html/2606.02863#bib.bib8)\)additionally adapts the search strategy itself across iterations via co\-evolutionary meta\-search\.

Budget and replication\.Each run usesB=60B=60iterations\. Convergence guarantees have not yet been established for ADRS \(Section[2](https://arxiv.org/html/2606.02863#S2)\), so we use a fixed budget to normalize comparisons across configurations while keeping total cost tractable; runs that repeatedly reach saturation \(e\.g\., P1 score100100\) are terminated early\. We target≥5\\geq 5independent replications per configuration, with additional runs for multimodal distributions until each detected basin contains≥3\\geq 3observations \(Appendix[I](https://arxiv.org/html/2606.02863#A9)\)\.

### 3\.1Results on polyomino packing \(Problem 0\)

![Refer to caption](https://arxiv.org/html/2606.02863v1/x1.png)Figure 1:Problem 0 results\. \(a\) Score distributions across generators and mechanisms\. Each point is one run’s best score; the black line marks the BoN median \(greedy baseline\)\. eb1\-frontier\-preview shown as representative of the dynamic eb1 variants \(all three variants in Appendix Figure[3](https://arxiv.org/html/2606.02863#A11.F3)\)\. Gemini\-2\.5\-Flash and Qwen\-3\.5\-9B omitted \(rare/no breakthroughs\)\. \(b\) Mechanism contribution relative to BoN median\. Annotations show the score difference each adaptive mechanism achieves\. The effect is non\-monotonic: adaptive search can hurt \(Claude Opus 4\.6 AdaEvolve−5\-5, GPT\-5\.4 AdaEvolve−29\-29\)\. GPT\-5\-mini and Kimi\-K2\.5 omitted \(full version in Appendix Figure[4](https://arxiv.org/html/2606.02863#A11.F4)\)\.Problem 0 is a polyomino packing instance: pack up to10410^\{4\}polyominoes of size 1–10 into a minimum\-area axis\-aligned rectangle \(70 test cases, NP\-hard\)\. The assessor\(Manget al\.,[2025](https://arxiv.org/html/2606.02863#bib.bib30)\)scores continuously, proportional to packing density\. Results across 12 generators from 6 model families and 3 mechanisms are summarized in Figure[1](https://arxiv.org/html/2606.02863#S3.F1)\(representative subset; full set in Appendix Figure[3](https://arxiv.org/html/2606.02863#A11.F3)\)\. We use BoN as a baseline because it isolates generator effects: with no adaptive state, score differences under BoN reflectLeffL\_\{\\text\{eff\}\}directly\.

Generator scores span a wide range: the best \(eb1\) reaches 82\.3, while the weakest \(Qwen\-3\.5\-9B\) never exceeds 3\.07, and Gemini\-2\.5\-Flash rarely breaks through \(4/19 runs\), though its rare successes reach scores above 66 \(Appendix Figure[3](https://arxiv.org/html/2606.02863#A11.F3)\)\. The ordering does not follow general capability rankings: Claude Opus 4\.6 \(BoN median 21\.5\) ranks below GPT\-5\-mini \(45\.8\) and GPT\-OSS\-20B \(45\.0\)\. No generator exceeds a score of 82 regardless of mechanism\. The eb1 family \(Networks\-of\-Networks with internal verification, Table[3](https://arxiv.org/html/2606.02863#S3.T3)\) illustrates the complexity: eb1 base reaches 82 under BoN alone, yet variants adapting dynamically cluster near 44\. Whether a configuration ever exceeds score 0—an event we call*breakthrough*—varies sharply by generator: GPT\-OSS\-20B and GPT\-5\-mini break through on every run, while Qwen\-3\.5\-9B, Gemini\-2\.5\-Flash, and GPT\-5\.4 frequently score 0\.

For several generators, final scores cluster at discrete levels: most clearly GPT\-OSS\-20B at 44 and the eb1 variants’ shared 44 attractor \(Figure[1](https://arxiv.org/html/2606.02863#S3.F1)\); others show wider distributions consistent with multiple partially\-resolved basins\. The eb1 variants \(preview, frontier, delta\) all cluster near 44 under EvoX, and frontier/delta under BoN as well, a shared attractor across related but distinct generators and mechanisms suggesting this basin is a feature of the problem landscape\. Under AdaEvolve, eb1\-preview and eb1\-frontier\-preview shift to medians of 68–71, indicating that AdaEvolve can escape the 44 basin\. The mechanism ranking reverses within the family: for eb1\-preview, EvoX*loses*1212points relative to BoN \(Appendix Figure[4](https://arxiv.org/html/2606.02863#A11.F4)\), the opposite of the AdaEvolve benefit seen in eb1\-frontier\-preview \(\+24\+24\)\. In contrast, GPT\-OSS\-20B converges to 44 across all three mechanisms \(CV≈2%\\approx 2\\%\), exhibiting a single accessible basin regardless ofℳ\\mathcal\{M\}\. Kimi\-K2\.5 shows the widest spread \(0–69 under EvoX\), consistent with multiple accessible basins\.

Figure[1](https://arxiv.org/html/2606.02863#S3.F1)b quantifies each mechanism’s contribution relative to BoN\. If mechanism adaptivity predicted performance, we would expect a consistent ordering BoN<<AdaEvolve<<EvoX; instead, no mechanism dominates across all generators\. For eb1, both AdaEvolve and EvoX improve on BoN by only\+4\+4; mechanism choice is almost irrelevant\. For Claude Opus 4\.6, EvoX adds\+46\+46but AdaEvolve*loses*55points relative to BoN: same generator, opposite mechanism effects\. GPT\-5\.4 under AdaEvolve scores 0 on 8/11 runs \(median 0 vs BoN median 28\.7\): guided search actively lowers performance for this pairing\. The breakthrough asymmetry reinforces this: GPT\-5\.4 breaks through on 27% of AdaEvolve runs but 100% of EvoX runs—sameGG,𝒜\\mathcal\{A\},LL, differentℳ\\mathcal\{M\}, completely different breakthrough behavior\. In general, generators with lower BoN medians benefit more from guided search, but the relationship is non\-monotonic and mechanism\-specific\. No configuration reaches score100100within6060iterations\.

### 3\.2Results on bounded knapsack \(Problem 1\)

Problem 1 is a bounded 2D knapsack instance: select quantities of 12 item types to maximize total value subject to joint mass and volume constraints\. Candidate solutions are evaluated on 3 test cases with continuous scoring relative to a known\-optimal reference solution\(Manget al\.,[2025](https://arxiv.org/html/2606.02863#bib.bib30)\)\. Most generator–mechanism combinations reach the optimum \(score100100\), but iterations\-to\-saturation and reliability vary significantly\.

![Refer to caption](https://arxiv.org/html/2606.02863v1/x2.png)Figure 2:Problem 1 search efficiency and reliability\. \(a\) Median iterations to saturation \(score≥99\\text\{score\}\\geq 99\) by generator and mechanism; brackets:\[min,max\]\[\\min,\\max\]over saturating runs\. \(b\) Best scores for non\-saturating runs, colored by mechanism\. The eb1 family groups eb1, eb1\-pro, eb1\-preview, and eb1\-frontier\-preview\. Full set including GPT\-5\-mini and Gemini\-2\.5\-Flash in Appendix Figure[5](https://arxiv.org/html/2606.02863#A11.F5)\.Figure[2](https://arxiv.org/html/2606.02863#S3.F2)a shows median iterations to saturation \(score≥99\\geq 99\) across 6 representative generator classes and 3 mechanisms \(full set including GPT\-5\-mini and Gemini\-2\.5\-Flash in Appendix Figure[5](https://arxiv.org/html/2606.02863#A11.F5)\)\. Theeb1 family\(eb1, eb1\-pro, eb1\-preview, and eb1\-frontier\-preview\) are most efficient, all saturating at a median of11iteration across all mechanisms\. Among static generators, the convergence rate spans nearly an order of magnitude: Kimi\-K2\.5 reaches saturation in77–1616iterations \(median, depending on mechanism\), while GPT\-OSS\-20B \(1717–2222\) and Claude Opus 4\.6 \(1212–3939\) form a slower tier\. GPT\-5\.4 spans the full range \(22–2424\) depending on mechanism, the widest spread among static generators\. This hierarchy is not predicted by general model capability rankings: Claude Opus 4\.6 is the slowest static generator to saturate despite being one of the most capable on standard coding benchmarks at the time of result collection\.

The mechanism ranking is not consistent across generators\. Claude Opus 4\.6 shows strongℳ\\mathcal\{M\}\-sensitivity: AdaEvolve reaches saturation in a median of1212iterations versus3939for Best\-of\-N and3131for EvoX—a3×3\\timesreduction in iterations from mechanism choice\. In contrast, GPT\-OSS\-20B isℳ\\mathcal\{M\}\-insensitive, saturating in1717–2222iterations regardless of mechanism\. The mechanism ranking*reverses*between generators: eb1\-delta\-preview saturates in11iteration with EvoX but1111with AdaEvolve; GPT\-5\.4 saturates in22with Best\-of\-N but2424with AdaEvolve, the opposite of Claude Opus 4\.6’s ordering in both cases\. There is no universally bestℳ\\mathcal\{M\}; the optimal pairing depends on the specificG×ℳG\\times\\mathcal\{M\}combination\.

Among runs that do not reach saturation \(Figure[2](https://arxiv.org/html/2606.02863#S3.F2)b\), scores cluster at≈66\.7\\approx 66\.7\. Within the eb1 family, all sub\-saturation runs come from the adaptive variants \(preview, frontier, delta\); static eb1 and eb1\-pro always saturate, suggesting that adaptiveLeff,tL\_\{\\text\{eff\},t\}reshaping occasionally steers into worse regions on this problem\. Breakthrough rates vary byG×ℳG\\times\\mathcal\{M\}: GPT\-OSS\-20B achieves6969–92%92\\%\(depending onℳ\\mathcal\{M\}\), Gemini\-2\.5\-Flash4040–80%80\\%, Kimi\-K2\.5 on Best\-of\-N71%71\\%, and Qwen\-3\.5\-9B0/160/16runs even with the most adaptive mechanism\.

### 3\.3Results on palindrome path \(Problem 11\)

Problem 11 requires finding a minimum\-length palindromic move sequence that visits every blank cell of ann×mn\\times mgrid \(up to30×3030\\times 30, 3 test cases\)\. Across\>13K\{\>\}13\\text\{K\}iterations \(233233runs,2222configurations spanning 10 generators from Table[3](https://arxiv.org/html/2606.02863#S3.T3)and all 3 mechanisms\), every run scores 0 \(Figure[6](https://arxiv.org/html/2606.02863#A11.F6)\)\. This is the cleanest regime example in our study: unlike P0 \(where breakthrough depends onG×ℳG\\times\\mathcal\{M\}\) or P1 \(where breakthrough is stochastic but common\), P11 exhibits universal failure regardless of configuration\.

The universal failures do not reflect generators ignoring the problem\. Across all runs,\>120K\{\>\}120\\text\{K\}candidate programs were generated and evaluated\. These candidates are substantive \(median 220 lines, 93% implementing BFS/DFS\-based graph traversal\) and often demonstrate correct mathematical reasoning about the palindrome constraint, but no candidate satisfies all constraints simultaneously\. The assessor\(Manget al\.,[2025](https://arxiv.org/html/2606.02863#bib.bib30)\)scores viaClamp\(1−\(ℓ−ℓ∗\)/ℓ∗,0,1\)\\text\{Clamp\}\(1\-\(\\ell\-\\ell^\{\*\}\)/\\ell^\{\*\},\\;0,\\;1\)whereℓ\\ellis the path length andℓ∗\\ell^\{\*\}a reference bound\. However, this formula is only reached for*valid*palindrome paths visiting all cells\. Any candidate that fails validity scores 0 outright, with no partial credit for near\-valid solutions\. This cliff structure meansℳ\\mathcal\{M\}receives no gradient signal: a candidate covering 90% of cells scores identically to one covering 0%\. All\>120K\{\>\}120\\text\{K\}candidates map to the same score, so no mechanism—regardless of sophistication—can distinguish improving from worsening candidates\. Formally, P11 satisfies Theorems[2](https://arxiv.org/html/2606.02863#Thmtheorem2)and[4](https://arxiv.org/html/2606.02863#Thmtheorem4)vacuously: no candidate scores above 0, so A3’s antecedent is never met and the theorems make no predictions\. However, the framework’s regime classification still provides actionable diagnosis independent of these guarantees\. P11 is optimization\-signal\-limited under the current assessor: the assessor is*correct*\(it would score a valid solution accurately\) but its cliff structure collapses all invalid candidates to a single score, flatteningLeffL\_\{\\text\{eff\}\}so that noℳ\\mathcal\{M\}can extract a gradient\. This is diagnosed from the decompositionLeff=𝒜∘GL\_\{\\text\{eff\}\}=\\mathcal\{A\}\\circ G: generators produce structurally relevant candidates \(above\), so the binding constraint is notGG’s capability but𝒜\\mathcal\{A\}’s feedback granularity\. Enriching𝒜\\mathcal\{A\}to provide partial credit \(e\.g\., for cell coverage or palindrome prefix length\) is the necessary first step\.

## 4Discussion

Table 4:Theory–empirics mapping with evidence\.The effective landscape unifies several empirical phenomena observed across our results and concurrent work\.Generator sensitivity\(Sections[3\.1](https://arxiv.org/html/2606.02863#S3.SS1)–[3\.2](https://arxiv.org/html/2606.02863#S3.SS2)\): different generators induce structurally differentLeffL\_\{\\text\{eff\}\}\(Theorem[4](https://arxiv.org/html/2606.02863#Thmtheorem4)\)\. Capability leaderboards do not predict ADRS performance; on P0, GPT\-OSS\-20B \(20B open\-weight MoE\) reaches median 45\.0 versus 21\.5 for Claude Opus 4\.6, and the ranking reverses across mechanisms\. Model selection is a topology question, not a ranking question\.G×ℳG\\times\\mathcal\{M\}interaction:ℳ\\mathcal\{M\}navigatesLeff\(G,𝒜\)L\_\{\\text\{eff\}\}\(G,\\mathcal\{A\}\), so a mechanism suited to one landscape topology may be unsuited to another\. This interaction is itself problem\-dependent: GPT\-OSS\-20B isℳ\\mathcal\{M\}\-insensitive on P0 butℳ\\mathcal\{M\}\-responsive on P1\.Run\-to\-run variance: reflects basin structure whenLeffL\_\{\\text\{eff\}\}has multiple accessible regions\. BecauseCCsamples from the history, same\-region entries accumulate, producing self\-reinforcing contexts—unlessℳ\\mathcal\{M\}actively injects diversity\. Low variance can also be an artifact of a coarse𝒜\\mathcal\{A\}that maps diverse candidates to the same score\.Regime diagnosis: ceilings and assessor signal structure identify the binding constraint \(GG,𝒜\\mathcal\{A\},ℳ\\mathcal\{M\}, orBB\) with targeted evaluations\. On P1 alone, Qwen\-3\.5\-9B is hard G\-limited \(0/160/16breakthrough\), GPT\-OSS\-20B stochastically G\-limited, and eb1 saturated—all diagnosed per\-configuration \(Section[3\.2](https://arxiv.org/html/2606.02863#S3.SS2)\)\.

Component effectiveness is mutually conditioned—each component’s contribution depends on what the others provide\. On P0, rich assessor signal enablesℳ\\mathcal\{M\}to exploit landscape structure; on P1,ℳ\\mathcal\{M\}affects efficiency butGGdetermines breakthrough; on P11,ℳ\\mathcal\{M\}is powerless because the current assessor’s cliff scoring flattensLeffL\_\{\\text\{eff\}\}\(𝒜\\mathcal\{A\}collapses all invalid candidates to zero, so no search strategy can extract a gradient\)\. Enriching𝒜\\mathcal\{A\}’s signal structure \(e\.g\., partial credit for cell coverage or palindrome prefix length\) could restore optimization signal and unlockℳ\\mathcal\{M\}’s ability to navigate, which changes whatGGneeds to achieve\. The framework identifies which component to invest in next\. This co\-design perspective builds on independent component advances: mechanism diversity \(co\-evolutionary meta\-search, island models\) revealsG×ℳG\\times\\mathcal\{M\}interaction; trajectory\-aware mechanisms matter because the process is non\-Markov: mechanisms that exploit history structure can escape basins that memoryless mechanisms cannot\(Karimiet al\.,[2026](https://arxiv.org/html/2606.02863#bib.bib32)\)\. Our evidence suggests thatℳ\\mathcal\{M\}innovation \(which dominates recent work\) faces diminishing returns whenGGor𝒜\\mathcal\{A\}is binding: GPT\-OSS\-20B isℳ\\mathcal\{M\}\-insensitive on P0, and on P11 no mechanism helps because the assessor’s cliff scoring provides no gradient\. Adaptive generation \(internal verification, dynamic routing\) and assessor enrichment may offer higher returns on many problems than further mechanism sophistication alone\.

Though regime taxonomy,G×ℳG\\times\\mathcal\{M\}interaction, basin structure, and mechanism reversal are all observable from the 7 publicly available static generators alone, the eb1 family further illustrates adaptive generation can subsume mechanism choice\. Its internal verification \(Section[2\.4](https://arxiv.org/html/2606.02863#S2.SS4)\) might reshapeLeffL\_\{\\text\{eff\}\}independently ofℳ\\mathcal\{M\}, which would explain its high ceilings: the verifier preferentially passes high\-quality candidates, raisings∞∗s^\{\*\}\_\{\\infty\}\. eb1 base isℳ\\mathcal\{M\}\-insensitive across problems\. The dynamic variants share this verification but additionally reshape their generation distributions over a run, introducing path\-dependence beyond the history \(Theorem[2](https://arxiv.org/html/2606.02863#Thmtheorem2)\)\. That eb1 base reaches higher basins \(∼82\{\\sim\}82\) than these variants suggests that the form of internal adaptation, not external mechanism choice, determines which basins are reached\. These eb1 results are a strong example of the upside from NoN\-style compound generation, though no framework conclusion depends on them\.

The framework connects each design decision to a specific theoretical quantity, identifying where to invest with targeted evaluations rather than exhaustive ablation\. Even on P1 \(a “solved” problem\),GG\-selection reduces iterations by∼39×\{\\sim\}39\\times\(1 for eb1 vs\. 39 for Claude Opus 4\.6, both under BoN\) andℳ\\mathcal\{M\}\-selection by3×3\\times\(12 for AdaEvolve vs\. 39 for BoN, both with Claude Opus 4\.6\)—the regime classification tells practitioners which system component matters for their task\. Under basin structure, summary statistics conflate basin membership with performance: when P0 scores cluster at4444and7272, the mean \(5858\) describes no actual run\. The framework shifts the question from “how many runs to estimate a mean?” to “how many runs to resolve the mixture?,” telling practitioners*what to look for*\(basins\) and thereby*when they have enough data*\. The framework is problem\-general: theory establishes the structures to diagnose \(regimes, ceilings\); empirics reveal basin structure and which regime a specific\(G,𝒜,ℳ,B,L\)\(G,\\mathcal\{A\},\\mathcal\{M\},B,L\)falls into \(Table[4](https://arxiv.org/html/2606.02863#S4.T4)\)\. A related open direction is formalizing how seed choice \(D0D\_\{0\}\) affects which outcome a run reaches\. Theorem[2](https://arxiv.org/html/2606.02863#Thmtheorem2)establishes thatD0D\_\{0\}content \(not justs0∗s^\{\*\}\_\{0\}\) can shape the trajectory; multimodal final\-score distributions from identical configurations \(Section[3\.1](https://arxiv.org/html/2606.02863#S3.SS1)\) suggest that stochastic choices during a run determine basin commitment\.

#### Limitations\.

The empirical validation uses a competitive programming benchmark; the theoretical results are architecture\-general, but the specific phenomena could in principle be domain\-specific\. Independent results across math, systems, and science domains suggest they are not, though direct validation on additional domains would strengthen the case\. Search efficiency \(iterations\-to\-saturation\) is not computational efficiency: NoN generators perform opaque internal computation per API call, so one iteration for a verified generator may involve much more compute than for a single model\. Our comparisons measure mechanism\-level attempts needed, not cost per attempt\. Regime classifications are inferred from observed score distributions under finite replication; low\-probability events \(e\.g\., rare basin access or infrequent breakthrough\) may remain undetected at the sample sizes used\. The empirical results establish existence of the reported phenomena, not exhaustive characterization of the underlying landscapes\. Characterizing whenLeffL\_\{\\text\{eff\}\}admits multiple separated score regions—contextsc1,c2c\_\{1\},c\_\{2\}where𝒜∗G\(⋅∣ci\)\\mathcal\{A\}\_\{\*\}G\(\\cdot\\mid c\_\{i\}\)concentrate on different score ranges with no improvement\-probability path between them—is a topological question about the effective landscape that remains open\.

## 5Conclusion

We prove that the ADRS best\-score process is non\-Markov and that its full state is growing\-dimensional \(Theorem[2](https://arxiv.org/html/2606.02863#Thmtheorem2)\); convergence guarantees based on fixed\-dimensional Markov state do not directly apply\. GAMBLe takes the first steps in bridging this gap: its effective landscapeLeff=𝒜∘GL\_\{\\text\{eff\}\}=\\mathcal\{A\}\\circ Gexplains why generator sensitivity andG×ℳG\\times\\mathcal\{M\}interaction arise structurally, and its regime classification identifies the binding constraint \(GG,𝒜\\mathcal\{A\},ℳ\\mathcal\{M\}, orBB\) with targeted evaluations rather than exhaustive ablation\. Validation across760\+760\{\+\}runs \(\>46,000\{\>\}46\{,\}000iterations\), 12 generators from 6 model families, 3 mechanisms, and 3 NP\-hard problems exposes basin structure, interaction effects, and regime diversity\.

The framework makes ADRS optimization systematic: each design decision maps to a specific theoretical quantity \(LeffL\_\{\\text\{eff\}\}geometry, ceilings, regime boundaries\) that researchers can estimate empirically\. Because components are mutually conditioned—ℳ\\mathcal\{M\}’s value depends on theLeffL\_\{\\text\{eff\}\}thatGGand𝒜\\mathcal\{A\}jointly determine—effective improvement requires diagnosing which component is binding before investing in any one\. When𝒜\\mathcal\{A\}is binding, no amount ofGGorℳ\\mathcal\{M\}spend helps; when a configuration is saturated, further iterations are wasted\. As ADRS scale to longer runs and broader deployment, the non\-Markov characterization and effective landscape geometry established here lay the groundwork for making ADRS efficient, supporting future work on autotuning, formal diagnostics, and scaling under resource constraints\.

## Acknowledgments

We thank Jared Quincy Davis for early access to eb1, insightful feedback, and the whole eb1 model development team for their ongoing support\.

## References

- L\. A\. Agrawal, S\. Tan, D\. Soylu, N\. Ziems, R\. Khare, K\. Opsahl\-Ong, A\. Singhvi, H\. Shandilya, M\. J\. Ryan, M\. Jiang, C\. Potts, K\. Sen, A\. G\. Dimakis, I\. Stoica, D\. Klein, M\. Zaharia, and O\. Khattab \(2026\)GEPA: reflective prompt evolution can outperform reinforcement learning\.External Links:2507\.19457,[Link](https://arxiv.org/abs/2507.19457)Cited by:[Appendix L](https://arxiv.org/html/2606.02863#A12.SS0.SSS0.Px6.p1.5),[§1](https://arxiv.org/html/2606.02863#S1.SS0.SSS0.Px2.p1.3),[§1](https://arxiv.org/html/2606.02863#S1.p2.8),[§2\.1](https://arxiv.org/html/2606.02863#S2.SS1.p1.5),[§2](https://arxiv.org/html/2606.02863#S2.p1.24)\.
- Alibaba Cloud \(2024\)Qwen model series\.External Links:[Link](https://huggingface.co/Qwen)Cited by:[Table 3](https://arxiv.org/html/2606.02863#S3.T3.1.8.7.2)\.
- Anthropic \(2023\)Claude model family\.External Links:[Link](https://www.anthropic.com/system-cards)Cited by:[Table 3](https://arxiv.org/html/2606.02863#S3.T3.1.2.1.2)\.
- M\. Cemri, S\. Agrawal, A\. Gupta, S\. Liu, A\. Cheng, Q\. Mang, A\. Naren, L\. E\. Erdogan, K\. Sen, M\. Zaharia, A\. Dimakis, and I\. Stoica \(2026\)AdaEvolve: adaptive llm driven zeroth\-order optimization\.External Links:2602\.20133,[Link](https://arxiv.org/abs/2602.20133)Cited by:[Appendix L](https://arxiv.org/html/2606.02863#A12.SS0.SSS0.Px4.p1.12),[Appendix H](https://arxiv.org/html/2606.02863#A8.SS0.SSS0.Px2.p1.2),[§1](https://arxiv.org/html/2606.02863#S1.SS0.SSS0.Px2.p1.3),[§1](https://arxiv.org/html/2606.02863#S1.p2.8),[§3](https://arxiv.org/html/2606.02863#S3.p3.4)\.
- A\. Cheng, S\. Liu, M\. Pan, Z\. Li, S\. Agarwal, M\. Cemri, B\. Wang, A\. Krentsel, T\. Xia, J\. Park, S\. Yang, J\. Chen, L\. Agrawal, A\. Naren, S\. Li, R\. Ma, A\. Desai, J\. Xing, K\. Sen, M\. Zaharia, and I\. Stoica \(2025a\)Let the barbarians in: how ai can accelerate systems performance research\.External Links:2512\.14806,[Link](https://arxiv.org/abs/2512.14806)Cited by:[Appendix L](https://arxiv.org/html/2606.02863#A12.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.02863#S1.SS0.SSS0.Px2.p1.3),[§2\.2](https://arxiv.org/html/2606.02863#S2.SS2.p4.9)\.
- A\. Cheng, S\. Liu, M\. Pan, Z\. Li, B\. Wang, A\. Krentsel, T\. Xia, M\. Cemri, J\. Park, S\. Yang, J\. Chen, L\. Agrawal, A\. Desai, J\. Xing, K\. Sen, M\. Zaharia, and I\. Stoica \(2025b\)Barbarians at the gate: how ai is upending systems research\.External Links:2510\.06189,[Link](https://arxiv.org/abs/2510.06189)Cited by:[Table 6](https://arxiv.org/html/2606.02863#A5.T6),[Appendix E](https://arxiv.org/html/2606.02863#A5.p1.1),[§1](https://arxiv.org/html/2606.02863#S1.SS0.SSS0.Px2.p1.3),[§1](https://arxiv.org/html/2606.02863#S1.p1.1)\.
- A\. Cheng, H\. Ng, A\. Kabcenell, P\. Bailis, M\. Zaharia, L\. Ma, X\. Shi, and I\. Stoica \(2026\)AI\-driven research for databases\.External Links:2604\.06566,[Link](https://arxiv.org/abs/2604.06566)Cited by:[Appendix L](https://arxiv.org/html/2606.02863#A12.SS0.SSS0.Px7.p1.9),[§1](https://arxiv.org/html/2606.02863#S1.SS0.SSS0.Px2.p1.3),[§1](https://arxiv.org/html/2606.02863#S1.p2.8)\.
- J\. Q\. Davis, B\. Hanin, L\. Chen, P\. Bailis, I\. Stoica, and M\. Zaharia \(2024\)Networks of networks: complexity class principles applied to compound ai systems design\.External Links:2407\.16831,[Link](https://arxiv.org/abs/2407.16831)Cited by:[Appendix F](https://arxiv.org/html/2606.02863#A6.p2.2),[§3](https://arxiv.org/html/2606.02863#S3.p2.4)\.
- M\. Feurer, A\. Klein, K\. Eggensperger, J\. Springenberg, M\. Blum, and F\. Hutter \(2015\)Efficient and robust automated machine learning\.InProceedings of the 28th International Conference on Advances in Neural Information Processing Systems \(NIPS’15\),Cited by:[Appendix G](https://arxiv.org/html/2606.02863#A7.p1.2),[§1](https://arxiv.org/html/2606.02863#S1.SS0.SSS0.Px2.p1.3)\.
- Google DeepMind \(2023\)Gemini model family\.External Links:[Link](https://ai.google.dev/gemini-api/docs/models)Cited by:[Table 3](https://arxiv.org/html/2606.02863#S3.T3.1.3.2.2)\.
- P\. Hamadanian, P\. Karimi, A\. Nasr\-Esfahany, K\. Noorbakhsh, J\. Chandler, A\. ParandehGheibi, M\. Alizadeh, and H\. Balakrishnan \(2026\)Glia: a human\-inspired ai for automated systems design and optimization\.InProceedings of the ACM Conference on AI and Agentic Systems,CAIS ’26,New York, NY, USA,pp\. 61–84\.External Links:ISBN 9798400724152,[Link](https://doi.org/10.1145/3786335.3813125),[Document](https://dx.doi.org/10.1145/3786335.3813125)Cited by:[Appendix L](https://arxiv.org/html/2606.02863#A12.SS0.SSS0.Px8.p1.8),[§1](https://arxiv.org/html/2606.02863#S1.SS0.SSS0.Px2.p1.3),[§1](https://arxiv.org/html/2606.02863#S1.p2.8),[§2\.1](https://arxiv.org/html/2606.02863#S2.SS1.p1.5)\.
- P\. Karimi, K\. Noorbakhsh, M\. Alizadeh, and H\. Balakrishnan \(2026\)Improving coherence and persistence in agentic ai for system optimization\.InProceedings of the ACM Conference on AI and Agentic Systems,CAIS ’26,New York, NY, USA,pp\. 124–160\.External Links:ISBN 9798400724152,[Link](https://doi.org/10.1145/3786335.3813138),[Document](https://dx.doi.org/10.1145/3786335.3813138)Cited by:[Appendix L](https://arxiv.org/html/2606.02863#A12.SS0.SSS0.Px9.p1.2),[§1](https://arxiv.org/html/2606.02863#S1.SS0.SSS0.Px2.p1.3),[§2\.1](https://arxiv.org/html/2606.02863#S2.SS1.p1.5),[§4](https://arxiv.org/html/2606.02863#S4.p2.14)\.
- S\. A\. Kauffman \(1993\)The origins of order: self\-organization and selection in evolution\.Oxford University Press,New York, NY\.External Links:ISBN 978\-0195079517Cited by:[Appendix G](https://arxiv.org/html/2606.02863#A7.p2.3),[§1](https://arxiv.org/html/2606.02863#S1.SS0.SSS0.Px2.p1.3)\.
- T\. Lai and H\. Robbins \(1985\)Asymptotically efficient adaptive allocation rules\.Advances in Applied Mathematics6\(1\),pp\. 4–22\.External Links:ISSN 0196\-8858,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/0196-8858%2885%2990002-8),[Link](https://www.sciencedirect.com/science/article/pii/0196885885900028)Cited by:[Appendix C](https://arxiv.org/html/2606.02863#A3.p1.1)\.
- J\. Lehman, J\. Gordon, S\. Jain, K\. Ndousse, C\. Yeh, and K\. O\. Stanley \(2022\)Evolution through large models\.External Links:2206\.08896,[Link](https://arxiv.org/abs/2206.08896)Cited by:[§1](https://arxiv.org/html/2606.02863#S1.SS0.SSS0.Px2.p1.3)\.
- D\. A\. Levin and Y\. Peres \(2017\)Markov chains and mixing times\.Vol\.107,American Mathematical Soc\.\.Cited by:[Appendix C](https://arxiv.org/html/2606.02863#A3.p2.2)\.
- S\. Liu, S\. Agarwal, M\. Maheswaran, M\. Cemri, Z\. Li, Q\. Mang, A\. Naren, E\. Boneh, A\. Cheng, M\. Z\. Pan, A\. Du, K\. Keutzer, A\. Cheung, A\. G\. Dimakis, K\. Sen, M\. Zaharia, and I\. Stoica \(2026a\)EvoX: meta\-evolution for automated discovery\.External Links:2602\.23413,[Link](https://arxiv.org/abs/2602.23413)Cited by:[Appendix L](https://arxiv.org/html/2606.02863#A12.SS0.SSS0.Px3.p1.2),[Appendix H](https://arxiv.org/html/2606.02863#A8.SS0.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2606.02863#S1.SS0.SSS0.Px2.p1.3),[§1](https://arxiv.org/html/2606.02863#S1.p2.8),[§3](https://arxiv.org/html/2606.02863#S3.p3.4)\.
- S\. Liu, M\. Cemri, S\. Agarwal, A\. Krentsel, A\. Naren, Q\. Mang, Z\. Li, A\. Gupta, M\. Maheswaran, A\. Cheng, M\. Pan, E\. Boneh, K\. Ramchandran, K\. Sen, A\. G\. Dimakis, M\. Zaharia, and I\. Stoica \(2026b\)SkyDiscover: a flexible framework for ai\-driven scientific and algorithmic discovery\.External Links:[Link](https://skydiscover-ai.github.io/blog.html)Cited by:[Appendix L](https://arxiv.org/html/2606.02863#A12.SS0.SSS0.Px2.p1.1),[Appendix H](https://arxiv.org/html/2606.02863#A8.p1.1),[§1](https://arxiv.org/html/2606.02863#S1.SS0.SSS0.Px2.p1.3),[§3](https://arxiv.org/html/2606.02863#S3.p3.4)\.
- Q\. Mang, W\. Chai, Z\. Li, H\. Mao, S\. Zhou, A\. Du, H\. Li, S\. Liu, E\. Chen, Y\. Wang, X\. Chu, Z\. Cheng, Y\. Xu, T\. Xia, Z\. Wang, T\. Shi, J\. Yao, Y\. Zhao, Q\. Zhang, C\. Ruan, Z\. Shen, K\. Liu, R\. He, D\. Xing, Z\. Li, Z\. Zeng, Y\. Jiang, L\. Cheng, Z\. Zhao, Y\. Sun, W\. Zheng, M\. Zhang, R\. Ji, X\. Tu, Z\. Zheng, Z\. Chen, K\. Zhou, Z\. Wang, J\. Chen, A\. Korolova, P\. Henderson, P\. Viswanath, V\. Ganesh, S\. Xie, Z\. Liu, D\. Song, S\. Min, I\. Stoica, J\. E\. Gonzalez, J\. Shang, and A\. Cheung \(2025\)FrontierCS: evolving challenges for evolving intelligence\.External Links:2512\.15699,[Link](https://arxiv.org/abs/2512.15699)Cited by:[§3\.1](https://arxiv.org/html/2606.02863#S3.SS1.p1.2),[§3\.2](https://arxiv.org/html/2606.02863#S3.SS2.p1.1),[§3\.3](https://arxiv.org/html/2606.02863#S3.SS3.p2.12),[§3](https://arxiv.org/html/2606.02863#S3.p1.1)\.
- Moonshot AI \(2026\)Kimi\-k2\.5\.External Links:[Link](https://github.com/MoonshotAI/Kimi-K2.5)Cited by:[Table 3](https://arxiv.org/html/2606.02863#S3.T3.1.7.6.2)\.
- Y\. Nesterov \(2018\)Lectures on convex optimization\.Springer Optimization and Its Applications, Vol\.137,Springer\.Cited by:[Appendix C](https://arxiv.org/html/2606.02863#A3.p1.1)\.
- F\. Neumann and C\. Witt \(2010\)Bioinspired computation in combinatorial optimization: algorithms and their computational complexity\.Springer\.Cited by:[Appendix C](https://arxiv.org/html/2606.02863#A3.p1.1)\.
- A\. Novikov, N\. Vũ, M\. Eisenberger, E\. Dupont, P\. Huang, A\. Z\. Wagner, S\. Shirobokov, B\. Kozlovskii, F\. J\. R\. Ruiz, A\. Mehrabian, M\. P\. Kumar, A\. See, S\. Chaudhuri, G\. Holland, A\. Davies, S\. Nowozin, P\. Kohli, and M\. Balog \(2025\)AlphaEvolve: a coding agent for scientific and algorithmic discovery\.External Links:2506\.13131,[Link](https://arxiv.org/abs/2506.13131)Cited by:[§1](https://arxiv.org/html/2606.02863#S1.p1.1)\.
- OpenAI \(2026\)The generative pre\-trained transformer \(gpt\) model family\.Note:Accessed: May 2026External Links:[Link](https://openai.com/)Cited by:[Table 3](https://arxiv.org/html/2606.02863#S3.T3.1.4.3.2),[Table 3](https://arxiv.org/html/2606.02863#S3.T3.1.5.4.2),[Table 3](https://arxiv.org/html/2606.02863#S3.T3.1.6.5.2)\.
- Y\. Qu and M\. Lu \(2026\)Bilevel autoresearch: meta\-autoresearching itself\.External Links:2603\.23420,[Link](https://arxiv.org/abs/2603.23420)Cited by:[Appendix L](https://arxiv.org/html/2606.02863#A12.SS0.SSS0.Px10.p1.6),[§1](https://arxiv.org/html/2606.02863#S1.SS0.SSS0.Px2.p1.3),[§1](https://arxiv.org/html/2606.02863#S1.p2.8)\.
- B\. Romera\-Paredes, M\. Barekatain, A\. Novikov, M\. Balog, M\. P\. Kumar, E\. Dupont, F\. J\. R\. Ruiz, J\. S\. Ellenberg, P\. Wang, O\. Fawzi, P\. Kohli, and A\. Fawzi \(2024\)Mathematical discoveries from program search with large language models\.Nature625\(7995\),pp\. 468–475\.External Links:[Document](https://dx.doi.org/10.1038/s41586-023-06924-6),ISBN 1476\-4687,[Link](https://doi.org/10.1038/s41586-023-06924-6)Cited by:[§1](https://arxiv.org/html/2606.02863#S1.p1.1)\.
- A\. Slivkins \(2019\)Introduction to multi\-armed bandits\.Found\. Trends Mach\. Learn\.12\(1–2\),pp\. 1–286\.External Links:ISSN 1935\-8237,[Link](https://doi.org/10.1561/2200000068),[Document](https://dx.doi.org/10.1561/2200000068)Cited by:[Appendix C](https://arxiv.org/html/2606.02863#A3.p1.1)\.
- T\. Tanveer \(2026\)LEVI: llm\-guided evolutionary search needs better harnesses, not bigger modelsExternal Links:[Link](https://github.com/ttanv/levi)Cited by:[Appendix L](https://arxiv.org/html/2606.02863#A12.SS0.SSS0.Px5.p1.1),[§1](https://arxiv.org/html/2606.02863#S1.SS0.SSS0.Px2.p1.3),[§1](https://arxiv.org/html/2606.02863#S1.p1.1)\.

## Appendix

Outline\.

- §A\.Notation reference
- §B\.Assumptions
- §C\.Why standard analytical tools do not apply
- §D\.Alternative formulations considered
- §E\.Relationship to Cheng et al\.’s ADRS decomposition
- §F\.Generation\-verification structure
- §G\.Methodological ancestors
- §H\.Mechanism implementation details
- §I\.Replication design
- §J\.Persistence of initial conditions
- §K\.Supplementary figures
- §L\.Extended comparison with concurrent work

## Appendix ANotation reference

Table 5:Notation reference\.
## Appendix BAssumptions

The theoretical results in this paper depend on the following assumptions\. We state them explicitly for auditability; each theorem references which assumptions it requires\.

A1\. Faithful context construction\.Context construction uses history content beyond the current best score: there existDt≠Dt′D\_\{t\}\\neq D\_\{t\}^\{\\prime\}with the same best score \(maxi≤t⁡si=maxi≤t⁡si′\\max\_\{i\\leq t\}s\_\{i\}=\\max\_\{i\\leq t\}s\_\{i\}^\{\\prime\}\) such thatC\(Dt,ℳt\)≠C\(Dt′,ℳt\)C\(D\_\{t\},\\mathcal\{M\}\_\{t\}\)\\neq C\(D\_\{t\}^\{\\prime\},\\mathcal\{M\}\_\{t\}\)with positive probability\. This excludes mechanisms that condition only onst∗s^\{\*\}\_\{t\}or ignore the history entirely\.*Used in:*Theorem[2](https://arxiv.org/html/2606.02863#Thmtheorem2), Corollary[6](https://arxiv.org/html/2606.02863#Thmtheorem6)\.

A2\. Context\-dependent generator\.The generator’s output distribution depends on the context:c≠c′⇒G\(⋅∣c\)≠G\(⋅∣c′\)c\\neq c^\{\\prime\}\\Rightarrow G\(\\cdot\\mid c\)\\neq G\(\\cdot\\mid c^\{\\prime\}\)for allc,c′c,c^\{\\prime\}in the range ofCC\. For LLM\-based generators receiving contexts fromCC, this is empirically natural: contexts differ in substantive content \(code, scores, run history\), not merely in paraphrasing or irrelevant tokens, so distinct contexts produce distinct output distributions\.*Used in:*Theorem[2](https://arxiv.org/html/2606.02863#Thmtheorem2), Corollary[6](https://arxiv.org/html/2606.02863#Thmtheorem6)\.

A3\. Upper\-tail\-separating assessor\.For any thresholdwwand any distinct generation distributionsG\(⋅∣c\)≠G\(⋅∣c′\)G\(\\cdot\\mid c\)\\neq G\(\\cdot\\mid c^\{\\prime\}\)that both admit improvement aboveww\(i\.e\.,P\(𝒜\(x\)\>w∣c\)\>0P\(\\mathcal\{A\}\(x\)\>w\\mid c\)\>0andP\(𝒜\(x\)\>w∣c′\)\>0P\(\\mathcal\{A\}\(x\)\>w\\mid c^\{\\prime\}\)\>0\), the distributions ofmax⁡\(w,𝒜\(x\)\)\\max\(w,\\,\\mathcal\{A\}\(x\)\)underG\(⋅∣c\)G\(\\cdot\\mid c\)andG\(⋅∣c′\)G\(\\cdot\\mid c^\{\\prime\}\)are not identical\.

This is the load\-bearing assumption for Theorem[2](https://arxiv.org/html/2606.02863#Thmtheorem2)\. It is strictly about the*upper tail*: two generation distributions that differ only belowwwproduce identical distributions ofst\+1∗=max⁡\(w,st\+1\)s^\{\*\}\_\{t\+1\}=\\max\(w,s\_\{t\+1\}\), so below\-threshold differences cannot violate the Markov property at best scoreww\. A3 requires that when contexts differ and improvement is possible from both, the distributions of the next best score differ—either through different improvement probabilities, or through different conditional distributions aboveww\. The quantifier is restricted to the generation family\{G\(⋅∣c\)\}c\\\{G\(\\cdot\\mid c\)\\\}\_\{c\}; we do not require this for arbitrary distributions over candidates\.

For LLM\-based generators, A3 is empirically natural: a prompt containing high\-scoring code produces a different probability of*exceeding*the current best than a prompt containing low\-scoring code—the generation distribution doesn’t partition into “above current best” and “below” independently of context\.

*Used in:*Theorem[2](https://arxiv.org/html/2606.02863#Thmtheorem2)\(for the non\-Markov step\)\.*Vacuous on:*P11’s cliff assessor \(Section[3\.3](https://arxiv.org/html/2606.02863#S3.SS3)\), where improvement probability is zero for all contexts \(𝒜∗\\mathcal\{A\}\_\{\*\}maps every generation distribution toδ0\\delta\_\{0\}\)\. A3’s antecedent is never satisfied, so the theorem is silent—not because the assumption is violated, but because no context admits improvement for it to distinguish\.

A4\. Non\-constant assessor\.There exist candidatesx,y∈𝒳x,y\\in\\mathcal\{X\}with𝒜\(x\)≠𝒜\(y\)\\mathcal\{A\}\(x\)\\neq\\mathcal\{A\}\(y\)\.*Used in:*Theorem[4](https://arxiv.org/html/2606.02863#Thmtheorem4)\. Strictly weaker than A3\.

A5\. Budget sufficiency\.The computational budgetBBis large enough for the mechanism to explore multiple history states\. Required for the budget\-limited vs\. saturated regime distinction to be meaningful\.*Used in:*regime classification \(Table[1](https://arxiv.org/html/2606.02863#S2.T1)\)\.

A6\. Assessor faithfulness\.The assessor𝒜\\mathcal\{A\}faithfully represents the optimization objective—higher𝒜\\mathcal\{A\}\-scores correspond to genuinely better candidates\. The framework optimizes𝒜\\mathcal\{A\}; if𝒜\\mathcal\{A\}is a misleading proxy, the system converges to optima of the proxy\.*Used in:*all interpretive claims that connect scores to candidate quality\.

Assumptions A1–A4 are structural and verifiable from the system’s architecture\. A3 targets the upper\-tail distribution within the generation family and directly delivers what the non\-Markov proof requires\. A5 is a design assumption about experimental adequacy\. A6 is an implicit modeling assumption common to all score\-based optimization\.

## Appendix CWhy standard analytical tools do not apply

Many standard quantitative convergence tools rely on structural assumptions that are often violated in ADRS\. Classical rate guarantees for gradient methods assume a fixed objective and stable update operator\[Nesterov,[2018](https://arxiv.org/html/2606.02863#bib.bib25)\]; classical regret bounds for bandits assume a fixed or well\-structured action set\[Lai and Robbins,[1985](https://arxiv.org/html/2606.02863#bib.bib26), Slivkins,[2019](https://arxiv.org/html/2606.02863#bib.bib27)\]; EA runtime analyses typically assume fixed representations with stationary mutation kernels\[Neumann and Witt,[2010](https://arxiv.org/html/2606.02863#bib.bib29)\]\.

The growing history can be embedded in an infinite\-dimensional product space where the full\-state process is Markov—and indeed time\-homogeneous if the mechanism state is included—but the resulting representation grows withttand does not, in general, admit tractable forms of the structural properties \(e\.g\., contraction, ergodicity, compactness\) on which standard analyses rely\[Levin and Peres,[2017](https://arxiv.org/html/2606.02863#bib.bib28)\]\. History\-dependent context construction and adaptive mechanisms mean that existing analytical tools do not directly yield nontrivial quantitative bounds on convergence rates in this setting\. The fundamental difficulty is that natural scalar summaries such asst∗s^\{\*\}\_\{t\}are not, in general, sufficient statistics for the process \(Theorem[2](https://arxiv.org/html/2606.02863#Thmtheorem2)\): convergence behavior depends on the full history, not the current best score alone\.

## Appendix DAlternative formulations considered

#### True landscapeL∗L^\{\*\}\.

We considered introducingL∗:𝒳→ℝL^\{\*\}:\\mathcal\{X\}\\to\\mathbb\{R\}representing actual candidate quality, distinct from𝒜\\mathcal\{A\}\. Rejected: for computationally hard problems𝒜=L∗\\mathcal\{A\}=L^\{\*\}; for engineering problemsL∗L^\{\*\}is ill\-defined\.L∗L^\{\*\}does no analytical work—all results operate on𝒜\\mathcal\{A\}\. We note𝒜\\mathcal\{A\}\-faithfulness as an implicit assumption: the assessor𝒜\\mathcal\{A\}is taken as ground truth for what constitutes improvement\.

#### Prompt construction as separate parameter\.

Rejected: examination of four discovery mechanisms shows prompt construction is tightly coupled to mechanism architecture\. Separating them creates a parameter that cannot vary independently in practice\.

#### Assessor embedded in landscape\.

Initial formulation used\(G,ℳ,L\)\(G,\\mathcal\{M\},L\)with𝒜⊂L\\mathcal\{A\}\\subset L\. Revised to\(G,𝒜,ℳ,L\)\(G,\\mathcal\{A\},\\mathcal\{M\},L\):𝒜\\mathcal\{A\}is a design choice, and separating it enables assessor\-limited diagnosis\.

## Appendix ERelationship to Cheng et al\.’s ADRS decomposition

Chenget al\.\[[2025b](https://arxiv.org/html/2606.02863#bib.bib3)\]define five architectural components of ADRS: Prompt Generator, Solution Generator, Evaluator, Storage, and Solution Selector\. Our parameterization maps as follows:

Table 6:Mapping betweenChenget al\.\[[2025b](https://arxiv.org/html/2606.02863#bib.bib3)\]’s five\-component decomposition and our parameterization\.The Prompt Generator is the component least cleanly captured by\(G,𝒜,ℳ,L\)\(G,\\mathcal\{A\},\\mathcal\{M\},L\)\. It controls: \(a\) problem statement formatting \(fixed per problem—part ofLL\), \(b\) which previous solutions to show \(determined by Solution Selector—part ofℳ\\mathcal\{M\}\), and \(c\) the prompt template and content: how selected information is assembled into text\. We absorb \(c\) intoC\(Dt,ℳt\)C\(D\_\{t\},\\mathcal\{M\}\_\{t\}\), treating it as part ofℳ\\mathcal\{M\}, because in current ADRS implementations prompt construction is tightly coupled to the discovery mechanism: AdaEvolve’s prompts include island\-specific context, EvoX’s prompts include strategy performance history, and GEPA’s prompts include rejection feedback\. Changingℳ\\mathcal\{M\}already changes prompt generation\.

## Appendix FGeneration\-verification structure

Generators may be compound AI systems whose internal composition determines the structure ofLeffL\_\{\\text\{eff\}\}\. Table[7](https://arxiv.org/html/2606.02863#A6.T7)shows the composition hierarchy; each level builds on the previous\.

Table 7:Composition hierarchy for generators\.Ensemble composition smoothsLeffL\_\{\\text\{eff\}\}by mixing reachable sets \(Section[2\.2](https://arxiv.org/html/2606.02863#S2.SS2)\)\. Verified composition additionally filters: internal verifiers preferentially pass high\-quality candidates before they reach the outer assessor𝒜\\mathcal\{A\}\. In general, composition nests recursively—generators contain verifiers that contain generators—producing arbitrarily deep filtering hierarchies\[Daviset al\.,[2024](https://arxiv.org/html/2606.02863#bib.bib10)\]\. A verified generator may be*static*\(non\-adaptive components and routing\) or*adaptive*\(components or routing evolve across iterations, with or without a learning conductor\)\. Our experiments span both: eb1 base is a static NoN, while eb1\-preview, eb1\-frontier\-preview, and eb1\-delta\-preview are adaptive NoN whose generation distributions shift over a run \(Table[3](https://arxiv.org/html/2606.02863#S3.T3)\)\.

## Appendix GMethodological ancestors

AutoML systems such as auto\-sklearn\[Feureret al\.,[2015](https://arxiv.org/html/2606.02863#bib.bib22)\]operate over large, structured configuration spaces whose practical tractability depends on limiting effective parameter interactions—surrogate models work best when effective dimensionality is low and high\-order interactions are limited\. ADRS components interact strongly:G×ℳG\\times\\mathcal\{M\}interactions are pervasive in our data, and the binding constraint shifts across problems, so which component to optimize depends on the full\(G,𝒜,ℳ\)\(G,\\mathcal\{A\},\\mathcal\{M\}\)configuration\. The configuration space is not a numeric hypercube but a choice among qualitatively different systems\.

Fitness landscape theory in evolutionary computation\[Kauffman,[1993](https://arxiv.org/html/2606.02863#bib.bib21)\]is the closest intellectual ancestor\. Kauffman’s NK model takes a fixed local mutation neighborhood as given; later formalizations \(e\.g\., Stadler’s⟨\\langlespace, neighborhood, fitness⟩\\rangledefinition\) make landscape topology explicitly dependent on the mutation operator, rendering it solver\-configuration\-dependent\.LeffL\_\{\\text\{eff\}\}extends this lineage: the “operator” is now a generative system comprising at least one LLM, whose output distribution is context\-dependent, stochastic, and potentially adaptive, making the landscape both generator\-dependent and, for adaptive generators, time\-varying \(Section[2\.4](https://arxiv.org/html/2606.02863#S2.SS4)\)\.

## Appendix HMechanism implementation details

All mechanisms are implemented in SkyDiscover\[Liuet al\.,[2026b](https://arxiv.org/html/2606.02863#bib.bib6)\]\(commit48bf7ae\)\. All generation hyperparameters \(temperature, max\_tokens, top\-p\) use SkyDiscover’s defaults at this commit unless otherwise stated\. We document the precise context construction for each mechanism, since the information shown to the generator determines whether the process satisfies the faithfulness condition of Theorem[2](https://arxiv.org/html/2606.02863#Thmtheorem2)\.

#### Best\-of\-N \(BoN\)\.

BoN uses greedy parent selection with history\-dependent context:

1. 1\.Parent selection: the highest\-scoring program in the history \(ties broken by insertion order\)\. The same parent is reused forN=5N\{=\}5consecutive iterations before re\-selection \(Nreuse=5N\_\{\\text\{reuse\}\}\{=\}5\)\.
2. 2\.Context programs:kctx=4k\_\{\\text\{ctx\}\}\{=\}4programs sampled uniformly from the history’s top\-kpool=10k\_\{\\text\{pool\}\}\{=\}10by score \(excluding the parent\)\. These are non\-best programs—the generator sees alternatives, not just the current optimum\.
3. 3\.Previous attempts: up tokprev=3k\_\{\\text\{prev\}\}\{=\}3recent programs \(from the lastw=100w\{=\}100iterations\), sorted by score, shown with their metrics and whether they improved or regressed relative to their parent\.
4. 4\.Prompt: the parent’s code, score breakdown, and evaluator feedback; context programs’ code and scores; previous attempts with outcomes; improvement direction hints \(score trend, solution length\)\.

BoN’s context constructionCCis faithful: two histories sharing the same best score but differing in non\-maximal entries will produce different context samples and different previous\-attempt histories, satisfying the conditions of Theorem[2](https://arxiv.org/html/2606.02863#Thmtheorem2)\. Score ties are common in practice \(e\.g\.,5353programs sharing score63\.7863\.78within a single P0 run\), so the “strictly injective assessor” assumption that would make BoN Markov\-reducible does not hold\.

BoN carries no adaptive state beyond the history: no population partitions, no stagnation counters, no strategy parameters\. This makes it the minimal faithful mechanism—any simplerCC\(e\.g\., showing only the best program with no context\) would be unfaithful, reducing the process to i\.i\.d\. sampling\.

#### AdaEvolve\.

AdaEvolve\[Cemriet al\.,[2026](https://arxiv.org/html/2606.02863#bib.bib7)\]adds multi\-island population structure with33\+33\{\+\}tunable parameters\. Its context construction includes island\-specific parent selection, stagnation detection that triggers “meta\-guidance” strategy shifts, and island migration\. The mechanism stateℳt\\mathcal\{M\}\_\{t\}includes per\-island stagnation counters, migration history, and adapted strategy parameters\.

#### EvoX\.

EvoX\[Liuet al\.,[2026a](https://arxiv.org/html/2606.02863#bib.bib8)\]adapts the search strategy itself via co\-evolutionary meta\-search\. Its context construction includes strategy performance history across iterations and a switch interval \(2020iterations in our experiments\) that triggers strategy re\-evaluation\. The mechanism state includes the strategy archive and performance statistics\.

## Appendix IReplication design

Every configuration receives≥5\\geq 5independent runs\. For configurations whose sorted scores show gaps\>15\>15points \(indicating distinct performance basins\), we add runs until each detected cluster contains≥3\\geq 3independent observations\.

## Appendix JPersistence of initial conditions

###### Corollary 6\(Persistence of initial conditions\)\.

For an ADRS with faithfulCC, context\-dependentGG, andD0D\_\{0\}\-preserving context construction \(i\.e\.,C\(Dt,ℳt\)C\(D\_\{t\},\\mathcal\{M\}\_\{t\}\)carries information fromD0D\_\{0\}for alltt\), the generation distributionG\(⋅∣ct\)G\(\\cdot\\mid c\_\{t\}\)depends onD0D\_\{0\}for allt≤Tt\\leq T\.

###### Proof\.

By induction\.D0D\_\{0\}determinesc0=C\(D0,ℳ0\)c\_\{0\}=C\(D\_\{0\},\\mathcal\{M\}\_\{0\}\), henceG\(⋅∣c0\)G\(\\cdot\\mid c\_\{0\}\)depends onD0D\_\{0\}\(by A2\)\. At steptt,D0D\_\{0\}\-preservation ensuresct=C\(Dt,ℳt\)c\_\{t\}=C\(D\_\{t\},\\mathcal\{M\}\_\{t\}\)carriesD0D\_\{0\}information; context\-dependence \(A2\) then gives aD0D\_\{0\}\-dependent generation distributionG\(⋅∣ct\)G\(\\cdot\\mid c\_\{t\}\)\. ∎

In practice,D0D\_\{0\}\-preservation holds so long asD0D\_\{0\}remains competitive, orℳ\\mathcal\{M\}orGGcontinue surfacing it in context\. When the run improves pastD0D\_\{0\}, mechanisms typically replace early entries with higher\-scoring candidates, potentially weakeningD0D\_\{0\}’s influence on the context\.

## Appendix KSupplementary figures

![Refer to caption](https://arxiv.org/html/2606.02863v1/x3.png)Figure 3:P0 score distributions for all 12 generators individually \(Figure[1](https://arxiv.org/html/2606.02863#S3.F1)a shows eb1\-frontier\-preview as representative\)\. All configurations haven≥5n\\geq 5runs per mechanism; see Appendix[I](https://arxiv.org/html/2606.02863#A9)for replication criteria\.![Refer to caption](https://arxiv.org/html/2606.02863v1/x4.png)Figure 4:Mechanism contribution relative to BoN median on P0 \(full version of Figure[1](https://arxiv.org/html/2606.02863#S3.F1)b, with all eb1 variants shown individually\)\. Gemini\-2\.5\-Flash and Qwen\-3\.5\-9B omitted \(median score 0 across all mechanisms\)\.![Refer to caption](https://arxiv.org/html/2606.02863v1/x5.png)Figure 5:P1 iterations to saturation and sub\-saturation scores for all generators \(full version of Figure[2](https://arxiv.org/html/2606.02863#S3.F2)\)\. Includes GPT\-5\-mini and Gemini\-2\.5\-Flash, omitted from the main figure for space\. GPT\-5\-mini behaves similarly to Kimi\-K2\.5 \(fast saturation\); Gemini\-2\.5\-Flash is in the slower tier with GPT\-OSS\-20B and Claude Opus 4\.6\.![Refer to caption](https://arxiv.org/html/2606.02863v1/x6.png)Figure 6:G×ℳG\\times\\mathcal\{M\}coverage matrix showing the fraction of runs reaching score≥99\\geq 99on P1 \(left\) vs\. P11 \(right\)\. Each cell shows runs reaching near\-optimal out of total runs\. P1 exhibits rich differentiation across generators and mechanisms \(Section[3\.2](https://arxiv.org/html/2606.02863#S3.SS2)\); P11 shows universal zero across all 22 tested configurations\. Grey cells indicate untested combinations\. Generators grouped by family; eb1 variants shown individually \(eb1\-pro excluded due to availability\-related coverage sparsity\)\.
## Appendix LExtended comparison with concurrent work

The phenomena we formalize have been observed empirically by several groups working on ADRS during the same period\. We provide a detailed comparison here\.

#### Cheng et al\. \(2025\)\.

Chenget al\.\[[2025a](https://arxiv.org/html/2606.02863#bib.bib4)\]compare three frameworks \(OpenEvolve, GEPA, ShinkaEvolve\) across two generators \(GPT\-5, Gemini\-3\.0\) on ten systems problems and report that “GEPA performed significantly better with GPT\-5, whereas ShinkaEvolve favored Gemini\-3\.0”—a clearG×ℳG\\times\\mathcal\{M\}interaction across diverse problem domains\. Their best practices distill these observations into practitioner heuristics \(“select LLM based on desired solution structure,” “the solution is only as good as the evaluator”\) that correspond to our generator sensitivity and assessor\-limited concepts, respectively\.

#### SkyDiscover\.

SkyDiscover\[Liuet al\.,[2026b](https://arxiv.org/html/2606.02863#bib.bib6)\]extends this to 172 Frontier\-CS problems and 14 math/systems tasks, demonstrating both generator sensitivity and mechanism sensitivity at scale, though without run\-to\-run variance data to assess the significance of individual differences\.

#### EvoX\.

EvoX\[Liuet al\.,[2026a](https://arxiv.org/html/2606.02863#bib.bib8)\]reports mean and best scores over three runs per configuration across∼200\\sim\\\!200tasks, revealing both generator sensitivity \(e\.g\., on Cloudcast, Gemini\-3\.0\-Pro succeeds with random sampling alone while GPT\-5 requires an adaptive mechanism—aG×ℳG\\times\\mathcal\{M\}interaction\) and mean–best gaps consistent with basin structure\. However, EvoX focuses on*within\-run*variation: adapting the search strategy across iterations as the landscape changes during a single trajectory\. The*across\-run*question—why do independent runs of the same configuration produce different outcomes?—is not studied, and the replicated data that could answer it is reported only as a robustness metric\.

#### AdaEvolve\.

AdaEvolve\[Cemriet al\.,[2026](https://arxiv.org/html/2606.02863#bib.bib7)\]reports mean±\\pmstd over three runs per configuration on six math and seven systems tasks, using both GPT\-5 and Gemini\-3\-Pro\. Their cross\-backbone data revealsG×ℳG\\times\\mathcal\{M\}interactions that the paper itself does not highlight: ShinkaEvolve on Circle Packing scores2\.464±\.0832\.464\\pm\.083with GPT\-5 but2\.622±\.0122\.622\\pm\.012with Gemini, while GEPA on Circle Packing \(Rect\)*worsens*from2\.3262\.326to2\.2162\.216when switching to Gemini\. The variance data is also informative: ShinkaEvolve’s std of\.083\.083on Circle Packing with GPT\-5 suggests bimodal outcomes \(some runs reaching2\.5412\.541, others not\), while AdaEvolve achieves std=\.001=\.001on the same task\. This variance reduction is not accidental—AdaEvolve’s Level 3 “Meta\-Guidance” is explicitly designed to escape stagnation plateaus, which in our terminology are basin commitment events\. Their case study confirms this: runs without meta\-guidance “remain stuck near2\.5142\.514,” while meta\-guidance triggers a qualitative strategy shift that escapes to2\.6362\.636\. AdaEvolve thus provides engineering validation that basin structure is real and that explicit escape mechanisms can address it, consistent with our theoretical prediction that run\-to\-run variance reflects basin structure inLeffL\_\{\\text\{eff\}\}\.

#### LEVI\.

LEVI\[Tanveer,[2026](https://arxiv.org/html/2606.02863#bib.bib9)\]builds a framework around generator–mechanism interaction: cheap models handle most mutations \(within\-basin refinement in our terminology\), while a frontier model is reserved for infrequent “paradigm shifts” \(cross\-basin jumps\)\. LEVI’s CVT\-MAP\-Elites archive, which maintains diversity across both structural and behavioral dimensions, can be understood as a mechanism designed to prevent premature basin commitment\. Their controlled comparison—same model, same budget, three seeds—shows that the search architecture matters as much as the generator, consistent with ourG×ℳG\\times\\mathcal\{M\}interaction results\. LEVI observes run\-to\-run variance \(their shaded confidence bands\) but treats it as noise; our framework characterizes its source\.

#### GEPA\.

GEPA\[Agrawalet al\.,[2026](https://arxiv.org/html/2606.02863#bib.bib5)\], a reflective evolutionary optimizer applicable to any text\-based optimization problem, provides evidence that assessor richness is a first\-class component\. By extending the scalar evaluator to return natural language feedback \(execution traces, error diagnostics\), GEPA outperforms reinforcement learning baselines by up to 20% with44–35×35\\timesfewer rollouts, using the same underlying search mechanism\. In our terminology, enriching𝒜\\mathcal\{A\}reshapesLeffL\_\{\\text\{eff\}\}to be more navigable\. GEPA also exhibitsG×ℳG\\times\\mathcal\{M\}interaction: their merge mechanism improves performance with GPT\-4\.1 Mini but*degrades*it with Qwen3 8B\.

#### Cheng et al\. \(2026\)\.

Chenget al\.\[[2026](https://arxiv.org/html/2606.02863#bib.bib2)\]extend ADRS to database optimization and address theAA\-limited regime directly, co\-evolving the evaluator alongside solutions\. Across three case studies they demonstrate that a misleadingAAtraps evolution in false optima and that improvingAAunlocks gains noGGorℳ\\mathcal\{M\}change could achieve \(up to6\.8×6\.8\\timeslatency reduction\)—makingAAadaptive \(Leff,t=At∘GL\_\{\\text\{eff\},t\}=A\_\{t\}\\circ G\) even with staticGG\.

#### Glia\.

Hamadanianet al\.\[[2026](https://arxiv.org/html/2606.02863#bib.bib31)\]introduce a multi\-agent workflow \(Researcher \+ Supervisor\) for systems design that reasons at the hypothesis level rather than performing code\-level mutation\. Compared against EoH, FunSearch, and OpenEvolve on LLM\-serving request routing, their Multi\-Context variant \(MCG, best\-of\-NNparallel reasoning trajectories\) outperforms evolutionary methods by1\.31\.3–1\.7×1\.7\\times, with diminishing returns beyondN=4N=4\. Glia simultaneously changes both generator \(compound reasoning agent vs\. single\-model code completion\) and mechanism \(best\-of\-NNvs\. island evolution\), so the improvement cannot be attributed to either component alone—consistent with performance being a property of the\(G,𝒜,ℳ\)\(G,\\mathcal\{A\},\\mathcal\{M\}\)triple\. Their finding that MCG benefit saturates at smallNNis consistent with ensemble diminishing returns whenLeffL\_\{\\text\{eff\}\}diversity is exhausted\. No variance analysis or formal decomposition is provided\.

#### Engram\.

Karimiet al\.\[[2026](https://arxiv.org/html/2606.02863#bib.bib32)\]address what they call the*coherence ceiling*: a single long\-running agent’s performance degrades as context grows, yet independent runs discard prior insights\. Their solution decouples exploration from persistence: a sequence of agents each operates in a fresh context window, reading a structured*Research Digest*summarizing prior agents’ findings, and writing results into a persistent*Archive*\. On LLM request routing \(same benchmark as Glia\), multi\-cloud multicast, and KV\-cache reuse, Engram outperforms Glia, EoH, FunSearch, and OpenEvolve\. In our terminology, the coherence ceiling is a practical manifestation of the growing\-dimensional state\(Dt,ℳt\)\(D\_\{t\},\\mathcal\{M\}\_\{t\}\)exceeding what fits in a finite context window; the Digest is a lossy compression ofDtD\_\{t\}designed to preserve the reasoning behind prior attempts, not just their scores\. Their ablation confirms that context construction is a first\-class design choice: removing the Digest degrades performance more than removing the raw Archive, suggesting that structured interpretation of history matters more than raw access to it\. Engram also tolerates temporary score regression \(costs rising from $772 to $1104 before recovering to $644\): the Research Digest preserves the reasoning behind the failed attempt, enabling the next agent to persist within a promising algorithmic family rather than retreating to the previous best\. Score\-only selection would prune the intermediate, making sustained exploration through a temporary regression unlikely without the structured context that history\-dependent reasoning provides\.

#### Bilevel Autoresearch\.

Qu and Lu \[[2026](https://arxiv.org/html/2606.02863#bib.bib23)\]meta\-optimize the search mechanism itself: an outer loop analyzes the inner autoresearch loop’s trace, generates new Python mechanism code \(Tabu Search, Multi\-Armed Bandit, Orthogonal Exploration\), and injects it at runtime\. On a GPT pretraining task, mechanism replacement \(their Level 2\) produces5×5\\timesimprovement over the inner loop alone, while parameter\-level adjustment yields essentially zero gain—an extreme instance ofℳ\\mathcal\{M\}\-limitation where the generator can produce better solutions but the fixed mechanism cannot navigate to them\. They observe that the inner loop without intervention exhibits “near\-deterministic repetition” \(the LLM proposes the same changes every iteration\), which in our terminology is a barrier inLeffL\_\{\\text\{eff\}\}: the generator’s prior biases create deterministic trajectories that the mechanism must break\. Their high run\-to\-run variance \(±0\.030\\pm 0\.030,67%67\\%of absolute mean atn=3n=3\) further confirms that single runs are unreliable estimators\.

#### Summary\.

These works collectively demonstrate that generator sensitivity,G×ℳG\\times\\mathcal\{M\}interaction, and assessor\-dependent performance are robust empirical phenomena spanning competitive programming, math, systems optimization, and database domains\. All report variance data but treat it as noise or a robustness metric; none provides a theoretical account of why these phenomena arise or how to diagnose the limiting factor in a given configuration with targeted evaluations rather than exhaustive ablation\. The GAMBLe framework takes the first steps to fill this gap\.
Don't Gamble, GAMBLe: An Analytical Framework for AI-Driven Research Systems

Similar Articles

GAMBIT: A Three-Mode Benchmark for Adversarial Robustness in Multi-Agent LLM Collectives

Games people — and machines — play: Untangling strategic reasoning to advance AI

Most “agentic AI” conversations feel too abstract. Here is how my agentic research system looks like

Reliability and Effectiveness of Autonomous AI Agents in Supply Chain Management

Playing games with knowledge: AI-Induced delusions need game theoretic interventions

Submit Feedback

Similar Articles

GAMBIT: A Three-Mode Benchmark for Adversarial Robustness in Multi-Agent LLM Collectives
Games people — and machines — play: Untangling strategic reasoning to advance AI
Most “agentic AI” conversations feel too abstract. Here is how my agentic research system looks like
Reliability and Effectiveness of Autonomous AI Agents in Supply Chain Management
Playing games with knowledge: AI-Induced delusions need game theoretic interventions