Scientific discovery as meta-optimization: a combinatorial optimization case study

arXiv cs.AI Papers

Summary

This paper proposes formalizing scientific discovery as a meta-optimization problem where LLMs generate and aggregate objective functions via correlation-weighted voting, applied to 3-SAT algorithm discovery using digital MemComputing, achieving a 67x speedup on large instances.

arXiv:2606.26728v1 Announce Type: new Abstract: Scientific discovery is fundamentally an optimization problem, defined by a vast "state space" of theories and experiments, and an evaluation criterion based on quality, novelty, and validity. Large language models (LLMs) have enabled automated exploration of this space, but we argue that simultaneous modification of the evaluation criteria is equally important. Here, we propose formalizing research as meta-optimization, where the optimization objective itself is also being optimized. Our key contribution is "consensus objective aggregation," where LLM-generated objective functions are combined via correlation-weighted voting, yielding a stable, self-correcting evaluation criterion that evolves as understanding deepens. We apply this framework to algorithm discovery for 3-SAT problems based on digital MemComputing machines, reducing the baseline scaling with problem size $N$ from $\sim N^{2.51}$ to $\sim N^{1.33}$ and delivering a $\sim 67\times$ speedup on the largest instances tested. As a problem-agnostic framework, we hope this approach will considerably aid scientific discovery.
Original Article
View Cached Full Text

Cached at: 06/26/26, 05:16 AM

# Scientific discovery as meta-optimization: a combinatorial optimization case study
Source: [https://arxiv.org/html/2606.26728](https://arxiv.org/html/2606.26728)
Yuan\-Hang Zhang Department of Physics University of California San Diego La Jolla, CA 92093 yuz092@ucsd\.edu &Chesson Sipling Department of Physics University of California San Diego La Jolla, CA 92093 csipling@ucsd\.edu &Massimiliano Di Ventra Department of Physics University of California San Diego La Jolla, CA 92093 diventra@physics\.ucsd\.edu

###### Abstract

Scientific discovery is fundamentally an optimization problem, defined by a vast “state space” of theories and experiments, and an evaluation criterion based on quality, novelty, and validity\. Large language models \(LLMs\) have enabled automated exploration of this space, but we argue that simultaneous modification of the evaluation criteria is equally important\. Here, we propose formalizing research asmeta\-optimization, where the optimization objective itself is also being optimized\. Our key contribution is “consensus objective aggregation,” where LLM\-generated objective functions are combined via correlation\-weighted voting, yielding a stable, self\-correcting evaluation criterion that evolves as understanding deepens\. We apply this framework to algorithm discovery for 3\-SAT problems based on digital MemComputing machines, reducing the baseline scaling with problem sizeNNfrom∼N2\.51\\sim N^\{2\.51\}to∼N1\.33\\sim N^\{1\.33\}and delivering a∼67×\\sim 67\\timesspeedup on the largest instances tested\. As a problem\-agnostic framework, we hope this approach will considerably aid scientific discovery\.

## 1Introduction

At its core, scientific research is an optimization process over a vast space of theories, experiments, and implementations\[[8](https://arxiv.org/html/2606.26728#bib.bib2)\]\. Every published result corresponds to a local optimum in this “research space,” found through a slow, noisy sampling process carried out by humans\. Evaluating a single idea may take months, and mistakes are common\. Recently, large language models \(LLMs\) have started to automate this research process at scales far exceeding individual researchers, covering the entire loop from generating hypotheses, writing code, designing experiments, to analyzing results and writing academic papers\[[18](https://arxiv.org/html/2606.26728#bib.bib21),[37](https://arxiv.org/html/2606.26728#bib.bib50),[14](https://arxiv.org/html/2606.26728#bib.bib16),[5](https://arxiv.org/html/2606.26728#bib.bib5),[22](https://arxiv.org/html/2606.26728#bib.bib27)\]\.

The early results are striking\. The AI Scientist\[[18](https://arxiv.org/html/2606.26728#bib.bib21)\]and its successor\[[37](https://arxiv.org/html/2606.26728#bib.bib50)\]produce complete research papers across machine learning subfields\. Google’s AI co\-scientist\[[14](https://arxiv.org/html/2606.26728#bib.bib16)\]generates and debates biomedical hypotheses through a tournament\-style loop\. FunSearch\[[26](https://arxiv.org/html/2606.26728#bib.bib34)\], AlphaEvolve\[[23](https://arxiv.org/html/2606.26728#bib.bib30)\], and SATLUTION\[[39](https://arxiv.org/html/2606.26728#bib.bib52)\]pair LLMs with evolutionary search to push the state of the art on algorithm discovery\. Multi\-agent systems now tackle materials discovery\[[21](https://arxiv.org/html/2606.26728#bib.bib26),[33](https://arxiv.org/html/2606.26728#bib.bib46)\], chemical synthesis\[[5](https://arxiv.org/html/2606.26728#bib.bib5)\], and the recovery of physical laws\[[15](https://arxiv.org/html/2606.26728#bib.bib18)\]\. Tree\-search methods, including Monte Carlo Tree Search \(MCTS\), have been coupled with LLMs for heuristic design\[[42](https://arxiv.org/html/2606.26728#bib.bib60),[35](https://arxiv.org/html/2606.26728#bib.bib45),[34](https://arxiv.org/html/2606.26728#bib.bib43)\]and mathematical reasoning\[[40](https://arxiv.org/html/2606.26728#bib.bib54),[38](https://arxiv.org/html/2606.26728#bib.bib51)\]\. For surveys of this rapidly expanding area, see, e\.g\.,\[[36](https://arxiv.org/html/2606.26728#bib.bib47),[41](https://arxiv.org/html/2606.26728#bib.bib58),[20](https://arxiv.org/html/2606.26728#bib.bib24),[11](https://arxiv.org/html/2606.26728#bib.bib11)\]\.

A more fundamental question, though, runs beneath all of these efforts: What should these systems actually optimize? The objectives guiding automated discovery, ranging from hand\-crafted benchmarks, proxy scores, to LLM\-generated evaluation functions, are imperfect measures for genuine research progress\. This is Goodhart’s law in action, well known from reinforcement learning: “when a measure becomes a target, it ceases to be a good measure”\[[13](https://arxiv.org/html/2606.26728#bib.bib15)\]\. Optimizing against proxy objectives invites reward hacking, in which solutions score well on the metric while failing at the underlying goal\[[30](https://arxiv.org/html/2606.26728#bib.bib39),[31](https://arxiv.org/html/2606.26728#bib.bib41)\]\. Objective specification has been flagged as a central difficulty for AI\-driven discovery\[[10](https://arxiv.org/html/2606.26728#bib.bib10)\], and recent work on evolving objectives\[[10](https://arxiv.org/html/2606.26728#bib.bib10),[19](https://arxiv.org/html/2606.26728#bib.bib23)\]has begun to address it—though usually by swapping one objective for the next in sequence\.

In real scientific research, the goals are often unclear or evolving\. A project may start with something like “develop an efficient algorithm” or “discover a potent drug,” but as understanding builds, the criteria for success shift\. Goals are typically plural—accuracy, explanatory power, robustness, cost, novelty, societal impact—and may resist any single stable ordering\. It is possible to define a meta\-objective and optimize the objective function itself, but this simply pushes the same specification and gaming problem up one level\. Therefore, we treat objectives as evolving, uncertain, and multi\-dimensional, working with portfolios rather than individual metrics\.

In this work, we then propose*consensus objective aggregation*as a robust mechanism formeta\-optimizationin automated scientific discovery\. Instead of maintaining a single evolving objective, the framework generates many objective functions, each encoding a different aspect of solution quality, and aggregates them through correlation\-weighted voting\. We compute pairwise Kendall’sτ\\taurank correlations\[[17](https://arxiv.org/html/2606.26728#bib.bib61)\]across all objectives, weight each one by its median agreement with the rest \(clamped to be non\-negative\) times an exponential age decay factor, and produce a consensus ranking via a weighted Borda count\[[12](https://arxiv.org/html/2606.26728#bib.bib12)\]\.

Since each proxy objective is an imperfect attempt to capture the true, unknown research goal, objectives that agree with each other are more likely to capture genuine research progress, while disagreeing outliers are likely noisy or misleading\. Therefore, we assign higher weight to objectives that agree with many others\. The age decay mechanism allows newer objectives to gradually replace older ones informed by updated understanding\. Noisy or adversarial objectives are automatically suppressed, making the consensus ranking self\-correcting\.

This consensus mechanism sits inside a multi\-agent system that iterates in a closed loop, where objective evolution is steered by a meta\-agent and carried out by an objective agent, while actual research is outlined by the planner agent and implemented by the designer agent\. We apply the framework to algorithm discovery for random planted 3\-SAT problems\[[2](https://arxiv.org/html/2606.26728#bib.bib62)\]using digital memcomputing machines \(DMMs\)\[[9](https://arxiv.org/html/2606.26728#bib.bib64),[32](https://arxiv.org/html/2606.26728#bib.bib65),[3](https://arxiv.org/html/2606.26728#bib.bib4),[29](https://arxiv.org/html/2606.26728#bib.bib38)\]\. Across the search, the system explored 414 solver designs under the guidance of 42 co\-evolving objectives\. The best solver found cuts the baseline scaling from∼N2\.51\\sim N^\{2\.51\}to∼N1\.33\\sim N^\{1\.33\}and achieves a∼67×\\sim 67\\timesspeedup on the largest test instance\. Of course, this LLM\-based meta\-optimization can be applied to a wide variety of problems\. It is then our hope that it will considerably aid future scientific discovery\.

## 2Results

### 2\.1Framework Overview

![Refer to caption](https://arxiv.org/html/2606.26728v1/x1.png)Figure 1:Framework overview\.The system consists of four LLM agents in an iterative cycle\. Starting from a high\-level human\-designed research goal, themeta\-agentsets the research strategy, guiding objective generation and analyzing objective quality\. Theobjective agentproposes proxy objective functions reflecting different aspects of solution quality; these feed into aconsensus objectivethat aggregates rankings through correlation\-weighted voting\. Theplanner agentuses Monte Carlo Graph Search \(MCGS\) under the consensus objective to identify strategic research directions\. Thedesigner agentturns those directions into concrete implementations, which are tested by amulti\-fidelity execution enginethat allocates computational budget to the most promising designs\. Ahyperparameter optimizationmodule periodically tunes the leading design’s parameters\.Our framework has four LLM agents in an iterative cycle \(Fig\.[1](https://arxiv.org/html/2606.26728#S2.F1)\)\. Starting from a high\-level goal defined by the human researcher, each iteration proceeds through the following stages:

Meta\-agent\.The meta\-agent receives the human\-defined research goal and steers the overall research strategy\. Periodically, it analyzes existing objectives via Kendall’sτ\\taucorrelations, providing strategic guidance to the objective agent and assigning weight multipliers that can amplify useful objectives or suppress harmful ones\.

Objective agent\.Acting on the meta\-agent’s assessment, the objective agent generates new proxy objective functions mapping experiment results to scalar quality scores\. Each objective is meant to capture a different aspect of solution quality\. All objectives are scored on all designs and aggregated into a single consensus ranking through Kendall’sτ\\taucorrelation\-weighted voting with age decay \(Sec\.[2\.2](https://arxiv.org/html/2606.26728#S2.SS2)and Sec\.[4\.2](https://arxiv.org/html/2606.26728#S4.SS2)\)\.

Planner agent\.The planner receives the MCGS\-ranked design list \(scored by the consensus objective\) along with the full experiment history\. It looks for patterns among successes and failures and outputs several strategic research directions, with references for each direction to build upon\.

Designer agent\.For each direction the planner proposes, the designer writes new solver code and runs it through a multi\-fidelity experiment schedule\[[25](https://arxiv.org/html/2606.26728#bib.bib33)\]\. Designs advance to higher fidelity levels by rule\-based criteria relative to the current population, so that computational budget concentrates on the strongest candidates\. The heteroscedastic evolutionary bayesian optimization \(HEBO\) algorithm\[[6](https://arxiv.org/html/2606.26728#bib.bib63)\]periodically tunes hyperparameters on the best untuned design\.

This iterative cycle allows solutions and evaluation criteria to evolve simultaneously\. The consensus mechanism ensures a stable research objective: even if certain LLM\-generated objectives could be misleading, the consensus ranking stays robust and informative\.

### 2\.2Consensus Objective Aggregation

![Refer to caption](https://arxiv.org/html/2606.26728v1/x2.png)Figure 2:Consensus objective aggregation\.\(a\)Kendall’sτ\\taucorrelation matrix across 42 LLM\-generated objectives\. Most objectives are positively correlated \(red\), but a few outliers have negative correlations with the majority \(blue\), showing that LLM\-generated objectives can indeed be misleading sometimes\.\(b\)Consensus weights after correlation\-weighted voting with age decay \(λ=0\.9\\lambda=0\.9\)\. Newer objectives are weighted higher due to age decay, while some older objectives still carry a large weight from the correlated majority\. Outlier objectives with low agreement are suppressed\.\(c\)PCA of objectives based on pairwiseτ\\taucorrelations, colored by objective ID \(creation order\); larger dots denote higher weight\. Earlier objectives form a small, isolated cluster, while later objectives converge towards another larger cluster, reflecting the shift in the research goal as research progresses\. Outlier objectives are visually separated and get suppressed in the consensus ranking\.No single proxy objective reliably captures solution quality\. An objective rewarding low computational cost at small problem sizes, for example, may unintentionally favor solutions that scale poorly\. Each time an LLM generates a new objective function, it can introduce subtle biases or overlook important edge cases\. Depending on any one such objective, even one that evolves over time, can lead to reward hacking\.

Instead, we maintain a portfolio of objective functions and aggregate them into a consensus ranking\. The intuition is straightforward: every LLM\-generated objective is an imperfect proxy for the same underlying latent research goal\. Good proxies should more or less agree, because they each approximate the same target from different aspects; an objective that disagrees with the majority has likely drifted too far from that goal through subtle biases or blind spots\. The aggregation procedure \(Fig\.[2](https://arxiv.org/html/2606.26728#S2.F2)\) works as follows:

1. 1\.Score matrix\.Each objective functionfif\_\{i\}scores every designdjd\_\{j\}, producing a matrixSi​j=fi​\(dj\)S\_\{ij\}=f\_\{i\}\(d\_\{j\}\)\.
2. 2\.Rank conversion\.Scores are converted to ranks within each objective \(lower score = better rank\), keeping the score range consistent across different objectives\.
3. 3\.Pairwise correlations\.We compute the Kendall’sτ\\taurank correlation\[[17](https://arxiv.org/html/2606.26728#bib.bib61)\]τi​k\\tau\_\{ik\}between every pair of objectivesfif\_\{i\}andfkf\_\{k\}over all designs\. The complete Kendall’sτ\\taumatrix is plotted in Fig\.[2](https://arxiv.org/html/2606.26728#S2.F2)\(a\), revealing clusters of positively correlated objectives and a few outliers\.
4. 4\.Agreement weighting\.Each objective’s weight is proportional to its median pairwiseτ\\tauwith all other objectives \(clamped at zero\) multiplied by an exponential age decay:wi=max⁡\(τ~i,0\)⋅λt−tiw\_\{i\}=\\max\(\\widetilde\{\\tau\}\_\{i\},0\)\\cdot\\lambda^\{t\-t\_\{i\}\}, whereτ~i=mediank≠i​\(τi​k\)\\widetilde\{\\tau\}\_\{i\}=\\text\{median\}\_\{k\\neq i\}\(\\tau\_\{ik\}\)is the median correlation,λ=0\.9\\lambda=0\.9is the decay base,ttis the current round, andtit\_\{i\}is the round in which objectiveiiwas created\. Weights are then normalized to sum to one\. The weights of all LLM\-generated objectives are plotted in Fig\.[2](https://arxiv.org/html/2606.26728#S2.F2)\(b\)\.
5. 5\.Consensus ranking\.The final ranking is a weighted Borda count\[[12](https://arxiv.org/html/2606.26728#bib.bib12)\]:Cj=∑iwi⋅Ri​j/\(n−1\)C\_\{j\}=\\sum\_\{i\}w\_\{i\}\\cdot R\_\{ij\}/\(n\-1\), whereRi​jR\_\{ij\}is the rank of designdjd\_\{j\}under objectivefif\_\{i\}andnnis the number of designs\.

This aggregation procedure is self\-correcting\. Objectives that disagree with the majority get near\-zero weight through the medianτ\\tauclamping, and age decay gradually shifts influence from older objectives to newer, better\-informed ones\.

Principal component analysis \(PCA\) on the standardized rank vectors of all objectives \(Fig\.[2](https://arxiv.org/html/2606.26728#S2.F2)\(c\)\) confirms this picture: objectives that rank designs similarly cluster together, while outlier objectives sit apart and receive low consensus weight\. The earlier objectives form an initial, smaller cluster, which gradually shifts towards another larger cluster as understanding deepens and the system’s notion of quality converges\. The Supplementary Information \(SI\), Sec\.[S2](https://arxiv.org/html/2606.26728#A2)traces the full evolution of the objective portfolio, including reward hacking episodes and their mitigation\.

Still, agreement\-based weighting does have a blind spot\. When a block of correlated objectives constitutes the majority, mutual agreement inflates every member’s weight, even if the criterion they collectively encode is misaligned with the true research goal\. The consensus mechanism cannot tell a genuinely informative majority from an echo chamber of redundant proxies\.

To address this, the meta\-agent periodically reviews all objectives and assigns weight multipliersmim\_\{i\}that modulate the consensus weights \(wi′∝wi⋅miw\_\{i\}^\{\\prime\}\\propto w\_\{i\}\\cdot m\_\{i\}\)\. This adjustment can break cluster dominance where statistical aggregation alone cannot\. In the 3\-SAT case study we discuss below, the meta\-agent used this to progressively zero out 20 early objectives whose tight mutual correlation rewarded small\-NNperformance, redirecting weight to newer metrics better matched to the large\-NNregime \(see SI Sec\.[S2\.2](https://arxiv.org/html/2606.26728#A2.SS2)for details\)\.

### 2\.3Exploration via Monte Carlo Graph Search

The exploration\-exploitation dilemma\[[4](https://arxiv.org/html/2606.26728#bib.bib68)\]is a well\-studied problem and has standard solutions in bandit\-style problems\[[1](https://arxiv.org/html/2606.26728#bib.bib67)\]\. In our framework, we balance the exploration of novel designs and exploitation of promising designs via Monte Carlo Graph Search \(MCGS\)\[[7](https://arxiv.org/html/2606.26728#bib.bib8)\]\. Each new design is built from one or more reference designs, and this parent\-child structure forms a directed graph, which is the basis for MCGS\.

We adapt the Upper Confidence Bound \(UCB\) algorithm\[[1](https://arxiv.org/html/2606.26728#bib.bib67),[27](https://arxiv.org/html/2606.26728#bib.bib36),[28](https://arxiv.org/html/2606.26728#bib.bib37)\]to our design graph\. Each design nodejjis scored by

UCBj=rj⏟exploitation\+c⋅Ntotal1\+nj⏟exploration,\\text\{UCB\}\_\{j\}=\\underbrace\{r\_\{j\}\}\_\{\\text\{exploitation\}\}\+\\underbrace\{c\\cdot\\frac\{\\sqrt\{N\_\{\\text\{total\}\}\}\}\{1\+n\_\{j\}\}\}\_\{\\text\{exploration\}\},\(1\)whererj∈\[0,1\]r\_\{j\}\\in\[0,1\]is the normalized rank score from the consensus objective \(higher is better\),njn\_\{j\}is the visit count for designjj,NtotalN\_\{\\text\{total\}\}is the sum of all visit counts, andc=0\.1c=0\.1is the exploration constant\. The exploitation termrjr\_\{j\}is obtained from the consensus scoreCjC\_\{j\}by rank\-based normalization,rj=1−\(rank​\(Cj\)−1\)/\(n−1\)r\_\{j\}=1\-\(\\mathrm\{rank\}\(C\_\{j\}\)\-1\)/\(n\-1\), so that the best design hasrj=1r\_\{j\}=1\.

Visit counts measure how thoroughly a lineage has been explored\. Our definition is different from standard MCTS\[[27](https://arxiv.org/html/2606.26728#bib.bib36),[28](https://arxiv.org/html/2606.26728#bib.bib37)\]to accommodate the graph structure, variable parent influences, and the fact that any node can be expanded\. When a design is created, the designer agent self\-reports how much each parent contributed \(weights summing to 1\)\. These weighted counts propagate upward through the genealogy by breadth\-first search with a depth\-increasing decay \(κ=0\.9\\kappa=0\.9\), concentrating credit near the immediate parents\. Designs referenced often as parents accumulate high visit counts, shrinking their exploration bonus and moving the planner toward less\-explored but potentially rewarding parts of the design space\.

As the graph is rebuilt every iteration from the current consensus objective, rankings always reflect the latest evaluation criteria\. The planner agent picks high\-UCB designs as starting points for new research directions\.

### 2\.4Application: Algorithm Discovery for Combinatorial Optimization

![Refer to caption](https://arxiv.org/html/2606.26728v1/x3.png)Figure 3:Algorithm discovery for 3\-SAT DMM solvers\.\(a\)Design genealogy graph\. Nodes are solver designs colored by ID \(chronological\); larger nodes rank higher under the consensus\. Directed edges represent the reference weights\. Sub\-tree structures arise from merging several exploratory workspaces, with diverse lineages converging toward the best design 340\.\(b\)Scaling of median solution steps with problem sizeNN\. The baseline DMM solver scales asN2\.51±0\.06N^\{2\.51\\pm 0\.06\}; the best solver \(design 340\) scales asN1\.33±0\.07N^\{1\.33\\pm 0\.07\}—a∼67×\\sim 67\\timesspeedup atN=1810N=1810\(95,503 vs\. 6,369,516 median steps\)\. Each median is computed over 100 random 3\-SAT instances at fixedNN; reported uncertainties on the exponents and the shaded 1σ\\sigmaenvelopes around the fit lines come from the log\-log linear\-regression covariance\. Background points cover all 414 designs, colored by ID to show progressive improvement across the search\. Note that earlier designs frequently perform better on smallerNN, but later designs achieve much stronger performance on largerNN\.We test our framework on a challenging problem: designing efficient solvers for planted random 3\-SAT instances \(instances from\[[2](https://arxiv.org/html/2606.26728#bib.bib62)\]\) near the complexity peak\. The baseline digital MemComputing machine \(DMM\) solver\[[3](https://arxiv.org/html/2606.26728#bib.bib4),[9](https://arxiv.org/html/2606.26728#bib.bib64)\]is a physics\-inspired dynamical system, whose fixed points encode satisfying assignments of the Boolean formula \(see Sec\.[4\.5](https://arxiv.org/html/2606.26728#S4.SS5)\)\. A key challenge is that solver performance at certain sizes does not necessarily predict scaling: a solver that looks fast atN≲100N\\lesssim 100variables may scale poorly toN\>1000N\>1000, while benchmarking at largeNNcan be prohibitively expensive\. This makes the choice of the evaluation criteria especially important\. The seven baseline parameters are the hand\-tuned defaults from prior empirical studies of the DMM dynamics\.

Our system operated across several individual workspaces, producing 414 distinct solver designs evaluated by 42 co\-evolving objectives\. Roughly the first 200 designs came from the top performers and their full genealogies in separate exploratory workspaces; these were then merged into a single workspace where roughly 200 more designs were generated \(see SI Sec\.[S3\.3](https://arxiv.org/html/2606.26728#A3.SS3)for details\)\. The merging produced the sub\-tree structures in Fig\.[3](https://arxiv.org/html/2606.26728#S2.F3)\(a\), with independent lineages eventually converging toward the best design\. Over the course of the search, objectives grew steadily more sophisticated, from simple power\-law fits early on to schedule\-faithful reach metrics that emphasize the largest solvable problem size, penalize budget cliffs, and analyze tail scaling\.

Fig\.[3](https://arxiv.org/html/2606.26728#S2.F3)\(b\) plots the scaling of median steps\-to\-solution vs\. the problem sizeNNfor all designs\. The baseline DMM solver\[[3](https://arxiv.org/html/2606.26728#bib.bib4)\]scales asN2\.51±0\.06N^\{2\.51\\pm 0\.06\}\. Under the regular schedule the baseline cannot clearN=1810N=1810; we re\-evaluated it in a separate run with an enlarged step budget, obtaining6,369,5166\{,\}369\{,\}516median steps at clause\-to\-variable ratio 4\.3\. Design 340, the best solver found, scales asN1\.33±0\.07N^\{1\.33\\pm 0\.07\}and reachesN=1810N=1810at95,50395\{,\}503median steps under the regular schedule—a∼67×\\sim 67\\timesspeedup\. All 414 designs are plotted in the background, colored by design ID\. Earlier solvers often perform better at smallerNN, but their performance degrades more steeply with increasing problem size\. In SI Sec\.[S1](https://arxiv.org/html/2606.26728#A1)we provide a detailed analysis of the best solver’s architecture, genealogy, and scaling behavior\.

The multi\-fidelity schedule was essential for efficiency\. Every design was first evaluated at low fidelity to filter out poor candidates quickly\. Fidelity promotion was rule\-based: the top 50% of designs \(by consensus ranking\) advance to medium fidelity, then the top 10% to high fidelity\. We also tried an LLM\-based judge for promotion decisions but found the simpler rules more reliable \(SI Sec\.[S3\.2](https://arxiv.org/html/2606.26728#A3.SS2)\)\.

## 3Discussion

We have demonstrated that treating scientific discovery as meta\-optimization by co\-evolving solutions and evaluation criteria can produce significant improvements over fixed\-objective optimization\. Consensus objective aggregation offers a rigorous solution to the objective specification and the reward hacking problem faced by automated research systems, and the objective portfolio adapts as the research matures\.

This consensus mechanism plays a key role in our 3\-SAT DMM case study\. Early objectives, built around power\-law extrapolation, provided useful initial direction but were gradually replaced by metrics that assessed scaling at larger problem sizes \(see SI Sec\.[S2](https://arxiv.org/html/2606.26728#A2)\)\. As new objectives entered and old ones lost weight through age decay, the consensus ranking evolved gradually, making the research process stable and consistent\.

Still, several limitations are worth addressing\. Because all objectives come from the same LLM backbone, they tend to share assumptions, and agreement\-based weighting can reinforce those shared biases instead of filtering noise, causing an “echo chamber” effect\. During research, we observed early objectives converged on similar power\-law extrapolation metrics until the meta\-agent stepped in to suppress redundant ones and steer later generations toward greater diversity\. The meta\-agent’s weight multipliers offer a top\-down correction to the agreement signal \(Sec\.[2\.2](https://arxiv.org/html/2606.26728#S2.SS2)\), but more principled diversity\-promoting mechanisms would further strengthen the guard against collective blind spots\.

There is also an inherent tension in how much autonomy the system should have\. We opted for a structured design: fixed agent roles, predefined workflow stages, and guardrails constraining what each agent can touch\. The result is predictable and safe, though less flexible than agentic systems with full authorization in a sandbox environment\. Our preliminary experiments with fewer restrictions revealed a few failure modes, including emergent goal drift, “cheating” by modifying evaluation scripts, and cherry\-picking favorable results\. How to balance the safety of structured workflows against the creative reach of unconstrained agents remains an important challenge\.

Two further limitations are scope\-related\. The reported scaling exponent comes from a single end\-to\-end search; quantifying run\-to\-run variance would require repeating the full pipeline, which is prohibitive at the present compute budget\. The case study also targets a narrow problem \(planted 3\-SAT near the complexity peak\); we are extending the framework into a general automated\-research pipeline applicable across domains\.

##### Broader impacts\.

As a problem\-agnostic framework, the consensus\-driven meta\-optimization approach can accelerate scientific discovery beyond combinatorial optimization\. Risks include amplifying biases inherited from foundation models or embedding reward\-hacking patterns into deployed algorithms; consensus aggregation mitigates single\-objective gaming but does not eliminate misuse risk\.

## 4Methods

### 4\.1Agent Architecture

All four agents run on GPT 5\.2, with structured outputs enforced through Pydantic schemas for reliable parsing\. The planner takes in the MCGS\-ranked design list together with a summary of the full experiment history \(genealogies, performance metrics, failure patterns\) and returnsNNstrategic research directions, each specifying reference design IDs and a natural\-language description of the intended modification\. The designer receives one direction, writes Python solver code, and launches multi\-fidelity experiments\. Every 20 designs, the meta\-agent computes Kendall’sτ\\taucorrelations between all objectives, assigns weight multipliers, identifies the current research phase, and issues strategic guidance to the objective agent for subsequent rounds\. The objective agent then produces a Python function, mapping experiment results to a scalar score\. A HEBO\[[6](https://arxiv.org/html/2606.26728#bib.bib63)\]hyperparameter optimization module runs 50 iterations on the best untuned design whenever the tuned fraction drops below 5%, refining continuous hyperparameters without changing the solver architecture\. The methodology evolution of the framework is documented in the SI Sec\.[S3](https://arxiv.org/html/2606.26728#A3), and the modularized solver framework is documented in the SI Sec\.[S4](https://arxiv.org/html/2606.26728#A4)\. All agent conversations are logged for reproducibility, and complete prompt templates can be found in the SI Sec\.[S5](https://arxiv.org/html/2606.26728#A5)\.

### 4\.2Consensus Objective Algorithm

GivenKKobjective functions\{f1,…,fK\}\\\{f\_\{1\},\\ldots,f\_\{K\}\\\}andnndesigns\{d1,⋯,dn\}\\\{d\_\{1\},\\cdots,d\_\{n\}\\\}with experiment results, the consensus objective is computed as follows:

1. 1\.Score matrix:Si​j=fi​\(dj\)S\_\{ij\}=f\_\{i\}\(d\_\{j\}\)for each objectivefif\_\{i\}and designdjd\_\{j\}\. Invalid or missing scores are set to\+∞\+\\infty\.
2. 2\.Rank matrix:For each objectiveii, designs are sorted bySi​jS\_\{ij\}and assigned ranksRi​jR\_\{ij\}, with tied designs receiving the average of the ranks they span\.
3. 3\.Kendall’sτ\\taumatrix:For each pair\(i,k\)\(i,k\), computeτi​k=KendallTau​\(\{Ri​j\}j,\{Rk​j\}j\)\\tau\_\{ik\}=\\text\{KendallTau\}\(\\\{R\_\{ij\}\\\}\_\{j\},\\\{R\_\{kj\}\\\}\_\{j\}\)\.
4. 4\.Objective weights: wi=max⁡\(τ~i,0\)⋅λt−ti∑k=1Kmax⁡\(τ~k,0\)⋅λt−tk,w\_\{i\}=\\frac\{\\max\\\!\\big\(\\widetilde\{\\tau\}\_\{i\},0\\big\)\\cdot\\lambda^\{t\-t\_\{i\}\}\}\{\\sum\_\{k=1\}^\{K\}\\max\\\!\\big\(\\widetilde\{\\tau\}\_\{k\},0\\big\)\\cdot\\lambda^\{t\-t\_\{k\}\}\},\(2\)whereτ~i=mediank≠i​τi​k\\widetilde\{\\tau\}\_\{i\}=\\text\{median\}\_\{k\\neq i\}\\,\\tau\_\{ik\}is the median pairwise correlation,λ=0\.9\\lambda=0\.9is the age decay base,ttis the current round, andtit\_\{i\}is the creation round of objectiveii\. If the denominator vanishes, weights default to uniform\.
5. 5\.Meta\-agent adjustment \(optional\):If meta\-agent multipliers\{mi\}\\\{m\_\{i\}\\\}are provided, weights are updated aswi′=wi⋅mi/∑kwk⋅mkw\_\{i\}^\{\\prime\}=w\_\{i\}\\cdot m\_\{i\}/\\sum\_\{k\}w\_\{k\}\\cdot m\_\{k\}\.
6. 6\.Consensus score: Cj=∑i=1Kwi⋅Ri​jn−1,C\_\{j\}=\\sum\_\{i=1\}^\{K\}w\_\{i\}\\cdot\\frac\{R\_\{ij\}\}\{n\-1\},\(3\)whereCj∈\[0,1\]C\_\{j\}\\in\[0,1\]with lower values indicating better designs\. This is a weighted Borda count\[[12](https://arxiv.org/html/2606.26728#bib.bib12)\]that aggregates normalized ranks across all weighted objectives\.

For designs evaluated on\-the\-fly during the search and not yet stored in the database, normalized ranks are estimated by binary search into each objective’s sorted score distribution\.

### 4\.3Monte Carlo Graph Search

We adapt Monte Carlo Graph Search \(MCGS\)\[[7](https://arxiv.org/html/2606.26728#bib.bib8)\]from board games to the design space\. Each design is a node in a directed acyclic graph whose edges are defined by the reference weights in the design’s genealogy: when the designer creates designdjd\_\{j\}from parentsp1,…,pmp\_\{1\},\\ldots,p\_\{m\}with weightsω1,…,ωm\\omega\_\{1\},\\ldots,\\omega\_\{m\}\(self\-reported, normalized to sum to 1\), edges\(pk→dj\)\(p\_\{k\}\\to d\_\{j\}\)are added for everyk∈\[1,…,m\]k\\in\[1,\.\.\.,m\]\.

Visit countsnjn\_\{j\}for each designdjd\_\{j\}propagate upward by breadth\-first search\. Creating designdjd\_\{j\}with reference weightωk\\omega\_\{k\}to parentpkp\_\{k\}addsωk\\omega\_\{k\}topkp\_\{k\}’s visit countnkn\_\{k\}\. Propagation continues recursively, each hop from depthddtod\+1d\{\+\}1attenuated by the ancestor’s edge weight timesκd\+1\\kappa^\{d\+1\}\(κ=0\.9\\kappa=0\.9\); deeper ancestors therefore receive rapidly diminishing credit\. Propagation halts once contributions drop below10−410^\{\-4\}\.

The UCB score \(Eq\. \([1](https://arxiv.org/html/2606.26728#S2.E1)\)\) combines the exploitation termrjr\_\{j\}\(normalized consensus rank; the best design scores 1\.0\) and the exploration term \(c⋅Ntotal1\+njc\\cdot\\frac\{\\sqrt\{N\_\{\\text\{total\}\}\}\}\{1\+n\_\{j\}\}\), withNtotal=∑jnjN\_\{\\text\{total\}\}=\\sum\_\{j\}n\_\{j\}andc=0\.1c=0\.1setting the balance\. Rebuilding the graph each iteration from the current consensus objective keeps rankings updated with the latest evaluation criteria\.

### 4\.4Multi\-Fidelity Schedule

Computational resources are allocated progressively across three fidelity tiers\. Problem sizes follow a geometric progressionNk=round​\(N0⋅2k/2\)N\_\{k\}=\\mathrm\{round\}\(N\_\{0\}\\cdot 2^\{k/2\}\)withN0=10N\_\{0\}=10andk=0,1,2,…k=0,1,2,\\ldots, yielding the sequenceN∈\{10,14,20,28,40,…\}N\\in\\\{10,14,20,28,40,\\ldots\\\}\. The solver step budget \(max\_steps\) starts at 50 and doubles at each level until it reaches the per\-fidelity cap \(10410^\{4\}low,10510^\{5\}medium,10610^\{6\}high\)\.

- •Low fidelity:N≤640N\\leq 640,max\_stepsup to10410^\{4\}, timeout after 1 minute\. Fast initial screening\.
- •Medium fidelity:N≤1280N\\leq 1280,max\_stepsup to10510^\{5\}, timeout after 5 minutes\. Moderate evaluation for promising candidates\.
- •High fidelity:N≤2560N\\leq 2560,max\_stepsup to10610^\{6\}, timeout after 30 minutes\. Thorough benchmark of the best designs\.

At every level, the solver is run on 100 random planted 3\-SAT instances\[[2](https://arxiv.org/html/2606.26728#bib.bib62)\]per problem size at clause\-to\-variable ratioαr=4\.3\\alpha\_\{r\}=4\.3\(near the complexity peak\[[16](https://arxiv.org/html/2606.26728#bib.bib66)\]\)\. A problem size is considered solved if at least half of the instances finish within the step budget\. Fidelity promotion is rule\-based: designs in the top 50% by consensus advance to medium fidelity; those in the top 10% further advance to high fidelity\. The scheme allocates computational effort toward the strongest candidates and filters out weak designs early\.

### 4\.5Baseline 3\-SAT DMM Algorithm

A random 3\-SAT instance hasNNBoolean variablesVi∈\{0,1\}V\_\{i\}\\in\\\{0,1\\\}andM=⌊αr​N⌋M=\\lfloor\\alpha\_\{r\}N\\rfloorclauses, each being the disjunction of three literals\. The baseline DMM algorithm relaxes every variable to a continuous valuevi∈\[−1,1\]v\_\{i\}\\in\[\-1,1\]and introduces auxiliary memory variables: a long\-term clause penalty weightxl,m∈\[1,106\]x\_\{l,m\}\\in\[1,10^\{6\}\]that grows for persistently violated clauses, and a short\-term satisfaction switchxs,m∈\[0,1\]x\_\{s,m\}\\in\[0,1\]toggling between push and hold modes\[[3](https://arxiv.org/html/2606.26728#bib.bib4)\]\. The coupled ordinary differential equations governing the dynamics are:

v˙n\\displaystyle\\dot\{v\}\_\{n\}=∑m=1M\[xl,m​xs,m​Gn,m\+\(1\+ζ​xl,m\)​\(1−xs,m\)​Rn,m\],\\displaystyle=\\sum\_\{m=1\}^\{M\}\\Bigl\[\\,x\_\{l,m\}\\,x\_\{s,m\}\\,G\_\{n,m\}\+\(1\+\\zeta\\,x\_\{l,m\}\)\(1\-x\_\{s,m\}\)\\,R\_\{n,m\}\\,\\Bigr\],\(4\)x˙s,m\\displaystyle\\dot\{x\}\_\{s,m\}=β​\(xs,m\+ϵ\)​\(cm−γ\),\\displaystyle=\\beta\\,\(x\_\{s,m\}\+\\epsilon\)\\,\(c\_\{m\}\-\\gamma\),\(5\)x˙l,m\\displaystyle\\dot\{x\}\_\{l,m\}=α​\(cm−δ\),\\displaystyle=\\alpha\\,\(c\_\{m\}\-\\delta\),\(6\)
The literal polarityqn,m=±1q\_\{n,m\}=\\pm 1indicates whether variablennappears positive or negated in clausemm\. The clause satisfaction monitor,

cm​\(vi,vj,vk\)=\\displaystyle c\_\{m\}\(v\_\{i\},v\_\{j\},v\_\{k\}\)=\(7\)12​min⁡\[\(1−qi,m​vi\),\(1−qj,m​vj\),\(1−qk,m​vk\)\],\\displaystyle\\tfrac\{1\}\{2\}\\min\\bigl\[\(1\-q\_\{i,m\}\\,v\_\{i\}\),\\;\(1\-q\_\{j,m\}\\,v\_\{j\}\),\\;\(1\-q\_\{k,m\}\\,v\_\{k\}\)\\bigr\],ranges between 0 \(satisfied\) and 1 \(violated\)\.α,β,γ,δ,ϵ,ζ\\alpha,\\beta,\\gamma,\\delta,\\epsilon,\\zeta, and the integration step sizeΔ​t0\\Delta t\_\{0\}are hyperparameters\[[3](https://arxiv.org/html/2606.26728#bib.bib4)\]\. The gradient term

Gn,m=12​qn,m​min⁡\[\(1−qj,m​vj\),\(1−qk,m​vk\)\],G\_\{n,m\}=\\tfrac\{1\}\{2\}\\,q\_\{n,m\}\\min\\\!\\bigl\[\(1\-q\_\{j,m\}\\,v\_\{j\}\),\\;\(1\-q\_\{k,m\}\\,v\_\{k\}\)\\bigr\],\(8\)pushes variablevnv\_\{n\}toward its satisfying polarity, scaled by how close the other two literalsj,kj,kin the clause are to being satisfied\. The rigidity term

Rn,m=\{12​\(qn,m−vn\),if​cm=12​\(1−qn,m​vn\),0,otherwise,R\_\{n,m\}=\\begin\{cases\}\\tfrac\{1\}\{2\}\(q\_\{n,m\}\-v\_\{n\}\),&\\text\{if \}c\_\{m\}=\\tfrac\{1\}\{2\}\(1\-q\_\{n,m\}\\,v\_\{n\}\),\\\\ 0,&\\text\{otherwise\},\\end\{cases\}\(9\)fires only when variablennis the most satisfied literal in clausemm, pulling it back toward its satisfying value to keep the clause from flipping\.

Finally, the numerical integration time stepΔ​t\\Delta tis rescaled by the inverse of the maximum absolutevv\-derivative:

Δ​t=1maxn⁡\|v˙n\|​Δ​t0,\\Delta t\\;=\\;\\frac\{1\}\{\\max\_\{n\}\|\\dot\{v\}\_\{n\}\|\}\\;\\Delta t\_\{0\},\(10\)with the denominator lower\-bounded at10−610^\{\-6\}for numerical stability\. In practice,maxn⁡\|v˙n\|\\max\_\{n\}\|\\dot\{v\}\_\{n\}\|will not reach 0 unless the system converges exactly at the solution point \(which is a property of DMMs\[[9](https://arxiv.org/html/2606.26728#bib.bib64),[3](https://arxiv.org/html/2606.26728#bib.bib4)\]\), and this rescaling constitutes an adaptive time step, stabilizing the numerical integration\.

##### Data and code availability\.

The complete framework code \(planner, designer, objective, and meta agents; consensus aggregator; Monte Carlo Graph Search; multi\-fidelity schedule; baseline and best 3\-SAT DMM solver\), the database of all 414 explored designs with experiment results, and the 42 evolving objective functions are available at[https://github\.com/yuanhangzhang98/LLM\_meta\_optimization](https://github.com/yuanhangzhang98/LLM_meta_optimization)under the MIT license\. A companion Claude Code skill packaging the same algorithm into a one\-command installable form is available at[https://github\.com/yuanhangzhang98/meta\-discovery](https://github.com/yuanhangzhang98/meta-discovery)\.

## Acknowledgments and Disclosure of Funding

Author Contributions\.Y\.\-H\.Z\. suggested and M\.D\. supervised the work\. Y\.\-H\.Z\. designed and performed the numerical simulations\. All authors contributed to the scientific discussion, and have read and approved the final manuscript\.

Funding\.This work was funded by the National Science Foundation via grant No\. ECCS\-2229880\. M\.D\. also acknowledges funding by the Alexander von Humboldt Stiftung through the Humboldt Research Award\.

## References

- \[1\]\(2002\-05\)Finite\-time Analysis of the Multiarmed Bandit Problem\.Machine Learning47\(2\),pp\. 235–256\.External Links:ISSN 1573\-0565,[Document](https://dx.doi.org/10.1023/A%3A1013689704352)Cited by:[§2\.3](https://arxiv.org/html/2606.26728#S2.SS3.p1.1),[§2\.3](https://arxiv.org/html/2606.26728#S2.SS3.p2.1)\.
- \[2\]W\. Barthel, A\. K\. Hartmann, M\. Leone, F\. Ricci\-Tersenghi, M\. Weigt, and R\. Zecchina\(2002\-04\)Hiding Solutions in Random Satisfiability Problems: A Statistical Mechanics Approach\.Physical Review Letters88\(18\),pp\. 188701\.External Links:[Document](https://dx.doi.org/10.1103/PhysRevLett.88.188701)Cited by:[§1](https://arxiv.org/html/2606.26728#S1.p7.3),[§2\.4](https://arxiv.org/html/2606.26728#S2.SS4.p1.3),[§4\.4](https://arxiv.org/html/2606.26728#S4.SS4.p3.1)\.
- \[3\]S\. R\. B\. Bearden, Y\. R\. Pei, and M\. Di Ventra\(2020\-11\)Efficient solution of Boolean satisfiability problems with digital memcomputing\.Scientific Reports10\(1\),pp\. 19741\.External Links:ISSN 2045\-2322,[Document](https://dx.doi.org/10.1038/s41598-020-76666-2)Cited by:[§S4\.1](https://arxiv.org/html/2606.26728#A4.SS1.SSS0.Px1.p1.9),[§1](https://arxiv.org/html/2606.26728#S1.p7.3),[§2\.4](https://arxiv.org/html/2606.26728#S2.SS4.p1.3),[§2\.4](https://arxiv.org/html/2606.26728#S2.SS4.p3.9),[§4\.5](https://arxiv.org/html/2606.26728#S4.SS5.p1.6),[§4\.5](https://arxiv.org/html/2606.26728#S4.SS5.p2.5),[§4\.5](https://arxiv.org/html/2606.26728#S4.SS5.p3.4)\.
- \[4\]O\. Berger\-Tal, J\. Nathan, E\. Meron, and D\. Saltz\(2014\-04\)The Exploration\-Exploitation Dilemma: A Multidisciplinary Framework\.PLOS ONE9\(4\),pp\. e95693\.External Links:ISSN 1932\-6203,[Document](https://dx.doi.org/10.1371/journal.pone.0095693)Cited by:[§2\.3](https://arxiv.org/html/2606.26728#S2.SS3.p1.1)\.
- \[5\]D\. A\. Boiko, R\. MacKnight, B\. Kline, and G\. Gomes\(2023\-12\)Autonomous chemical research with large language models\.Nature624\(7992\),pp\. 570–578\.External Links:ISSN 0028\-0836, 1476\-4687,[Document](https://dx.doi.org/10.1038/s41586-023-06792-0)Cited by:[§1](https://arxiv.org/html/2606.26728#S1.p1.1),[§1](https://arxiv.org/html/2606.26728#S1.p2.1)\.
- \[6\]A\. Cowen\-Rivers, W\. Lyu, R\. Tutunov, Z\. Wang, A\. Grosnit, R\. Griffiths, A\. Maravel, J\. Hao, J\. Wang, J\. Peters, and H\. Bou Ammar\(2022\-07\)HEBO: pushing the limits of sample\-efficient hyperparameter optimisation\.Journal of Artificial Intelligence Research74,pp\.\.Cited by:[§S1\.3](https://arxiv.org/html/2606.26728#A1.SS3.p3.1),[item 2](https://arxiv.org/html/2606.26728#A4.I1.i2.p1.1),[§2\.1](https://arxiv.org/html/2606.26728#S2.SS1.p5.1),[§4\.1](https://arxiv.org/html/2606.26728#S4.SS1.p1.2)\.
- \[7\]J\. Czech, P\. Korus, and K\. Kersting\(2021\-05\)Improving AlphaZero Using Monte\-Carlo Graph Search\.Proceedings of the International Conference on Automated Planning and Scheduling31,pp\. 103–111\.External Links:ISSN 2334\-0843, 2334\-0835,[Document](https://dx.doi.org/10.1609/icaps.v31i1.15952)Cited by:[§2\.3](https://arxiv.org/html/2606.26728#S2.SS3.p1.1),[§4\.3](https://arxiv.org/html/2606.26728#S4.SS3.p1.5)\.
- \[8\]M\. Di Ventra\(2018\)The scientific method: reflections from a practitioner\.Oxford University Press, Oxford\.Cited by:[§1](https://arxiv.org/html/2606.26728#S1.p1.1)\.
- \[9\]M\. Di Ventra\(2022\-02\)MemComputing: Fundamentals and Applications\.Oxford University Press\.External Links:[Document](https://dx.doi.org/10.1093/oso/9780192845320.001.0001),ISBN 978\-0\-19\-284532\-0Cited by:[§S4\.1](https://arxiv.org/html/2606.26728#A4.SS1.SSS0.Px1.p1.9),[§1](https://arxiv.org/html/2606.26728#S1.p7.3),[§2\.4](https://arxiv.org/html/2606.26728#S2.SS4.p1.3),[§4\.5](https://arxiv.org/html/2606.26728#S4.SS5.p3.4)\.
- \[10\]Y\. Du, B\. Yu, T\. Liu, T\. Shen, J\. Chen, J\. G\. Rittig, K\. Sun, Y\. Zhang, Z\. Song, B\. Zhou, C\. Masschelein, Y\. Wang, H\. Wang, H\. Jia, C\. Zhang, H\. Zhao, M\. Ester, T\. Head\-Gordon, C\. P\. Gomes, H\. Sun, C\. Duan, P\. Schwaller, and W\. Jin\(2025\-12\)Accelerating Scientific Discovery with Autonomous Goal\-evolving Agents\.arXiv\.Note:arXiv preprint arXiv:2512\.21782External Links:2512\.21782,[Document](https://dx.doi.org/10.48550/arXiv.2512.21782)Cited by:[§1](https://arxiv.org/html/2606.26728#S1.p3.1)\.
- \[11\]S\. Eger, Y\. Cao, J\. D’Souza, A\. Geiger, C\. Greisinger, S\. Gross, Y\. Hou, B\. Krenn, A\. Lauscher, Y\. Li, C\. Lin, N\. S\. Moosavi, W\. Zhao, and T\. Miller\(2025\-04\)Transforming Science with Large Language Models: A Survey on AI\-assisted Scientific Discovery, Experimentation, Content Generation, and Evaluation\.arXiv\.Note:arXiv preprint arXiv:2502\.05151External Links:2502\.05151,[Document](https://dx.doi.org/10.48550/arXiv.2502.05151)Cited by:[§1](https://arxiv.org/html/2606.26728#S1.p2.1)\.
- \[12\]P\. Emerson\(2013\-02\)The original Borda count and partial voting\.Social Choice and Welfare40\(2\),pp\. 353–358\.External Links:ISSN 1432\-217X,[Document](https://dx.doi.org/10.1007/s00355-011-0603-9)Cited by:[§1](https://arxiv.org/html/2606.26728#S1.p5.1),[item 5](https://arxiv.org/html/2606.26728#S2.I1.i5.p1.5),[item 6](https://arxiv.org/html/2606.26728#S4.I1.i6.p1.1)\.
- \[13\]C\. A\. E\. Goodhart\(1984\)Problems of Monetary Management: The UK Experience\.InMonetary Theory and Practice: The UK Experience,C\. A\. E\. Goodhart \(Ed\.\),pp\. 91–121\.External Links:[Document](https://dx.doi.org/10.1007/978-1-349-17295-5%5F4),ISBN 978\-1\-349\-17295\-5Cited by:[§1](https://arxiv.org/html/2606.26728#S1.p3.1)\.
- \[14\]J\. Gottweis, W\. Weng, A\. Daryin, T\. Tu, A\. Palepu, P\. Sirkovic, A\. Myaskovsky, F\. Weissenberger, K\. Rong, R\. Tanno, K\. Saab, D\. Popovici, J\. Blum, F\. Zhang, K\. Chou, A\. Hassidim, B\. Gokturk, A\. Vahdat, P\. Kohli, Y\. Matias, A\. Carroll, K\. Kulkarni, N\. Tomasev, Y\. Guan, V\. Dhillon, E\. D\. Vaishnav, B\. Lee, T\. R\. D\. Costa, J\. R\. Penadés, G\. Peltz, Y\. Xu, A\. Pawlosky, A\. Karthikesalingam, and V\. Natarajan\(2025\-02\)Towards an AI co\-scientist\.arXiv\.Note:arXiv preprint arXiv:2502\.18864External Links:2502\.18864,[Document](https://dx.doi.org/10.48550/arXiv.2502.18864)Cited by:[§1](https://arxiv.org/html/2606.26728#S1.p1.1),[§1](https://arxiv.org/html/2606.26728#S1.p2.1)\.
- \[15\]X\. Han, Z\. Gao, P\. Guo, and Z\. Lu\(2025\-08\)PhysAgent: A Multi\-Agent Approach to the Automated Discovery of Physical Laws\.Note:QeiosExternal Links:[Document](https://dx.doi.org/10.32388/J2MXUW)Cited by:[§1](https://arxiv.org/html/2606.26728#S1.p2.1)\.
- \[16\]A\. K\. Hartmann and H\. Rieger \(Eds\.\)\(2004\)New optimization algorithms in physics\.Wiley\-VCH ; John Wiley,Weinheim : Chichester\.External Links:ISBN 978\-3\-527\-40406\-3,LCCN QC20\.7\.C58 N49 2004Cited by:[§4\.4](https://arxiv.org/html/2606.26728#S4.SS4.p3.1)\.
- \[17\]M\. G\. Kendall\(1938\-06\)A New Measure of Rank Correlation\.Biometrika30\(1\-2\),pp\. 81–93\.External Links:ISSN 0006\-3444,[Document](https://dx.doi.org/10.1093/biomet/30.1-2.81)Cited by:[§1](https://arxiv.org/html/2606.26728#S1.p5.1),[item 3](https://arxiv.org/html/2606.26728#S2.I1.i3.p1.5)\.
- \[18\]C\. Lu, C\. Lu, R\. T\. Lange, J\. Foerster, J\. Clune, and D\. Ha\(2024\-09\)The AI Scientist: Towards Fully Automated Open\-Ended Scientific Discovery\.arXiv\.Note:arXiv preprint arXiv:2408\.06292External Links:2408\.06292,[Document](https://dx.doi.org/10.48550/arXiv.2408.06292)Cited by:[§1](https://arxiv.org/html/2606.26728#S1.p1.1),[§1](https://arxiv.org/html/2606.26728#S1.p2.1)\.
- \[19\]R\. Lu, Z\. Shao, Y\. Ding, R\. Chen, D\. Wu, H\. Su, T\. Yang, F\. Zhang, J\. Wang, Y\. Shi, Z\. Jiang, H\. Ding, and H\. Zhang\(2025\-12\)Discovery of the reward function for embodied reinforcement learning agents\.Nature Communications16\(1\),pp\. 11064\.External Links:ISSN 2041\-1723,[Document](https://dx.doi.org/10.1038/s41467-025-66009-y)Cited by:[§1](https://arxiv.org/html/2606.26728#S1.p3.1)\.
- \[20\]Z\. Luo, Z\. Yang, Z\. Xu, W\. Yang, and X\. Du\(2025\-01\)LLM4SR: A Survey on Large Language Models for Scientific Research\.arXiv\.Note:arXiv preprint arXiv:2501\.04306External Links:2501\.04306,[Document](https://dx.doi.org/10.48550/arXiv.2501.04306)Cited by:[§1](https://arxiv.org/html/2606.26728#S1.p2.1)\.
- \[21\]A\. Merchant, S\. Batzner, S\. S\. Schoenholz, M\. Aykol, G\. Cheon, and E\. D\. Cubuk\(2023\-12\)Scaling deep learning for materials discovery\.Nature624\(7990\),pp\. 80–85\.External Links:ISSN 0028\-0836, 1476\-4687,[Document](https://dx.doi.org/10.1038/s41586-023-06735-9)Cited by:[§1](https://arxiv.org/html/2606.26728#S1.p2.1)\.
- \[22\]L\. Mitchener, A\. Yiu, B\. Chang, M\. Bourdenx, T\. Nadolski, A\. Sulovari, E\. C\. Landsness, D\. L\. Barabasi, S\. Narayanan, N\. Evans, S\. Reddy, M\. Foiani, A\. Kamal, L\. P\. Shriver, F\. Cao, A\. T\. Wassie, J\. M\. Laurent, E\. Melville\-Green, M\. Caldas, A\. Bou, K\. F\. Roberts, S\. Zagorac, T\. C\. Orr, M\. E\. Orr, K\. J\. Zwezdaryk, A\. E\. Ghareeb, L\. McCoy, B\. Gomes, E\. A\. Ashley, K\. E\. Duff, T\. Buonassisi, T\. Rainforth, R\. J\. Bateman, M\. Skarlinski, S\. G\. Rodriques, M\. M\. Hinks, and A\. D\. White\(2025\-11\)Kosmos: An AI Scientist for Autonomous Discovery\.arXiv\.Note:arXiv preprint arXiv:2511\.02824External Links:2511\.02824,[Document](https://dx.doi.org/10.48550/arXiv.2511.02824)Cited by:[§1](https://arxiv.org/html/2606.26728#S1.p1.1)\.
- \[23\]A\. Novikov, N\. Vũ, M\. Eisenberger, E\. Dupont, P\. Huang, A\. Z\. Wagner, S\. Shirobokov, B\. Kozlovskii, F\. J\. R\. Ruiz, A\. Mehrabian, M\. P\. Kumar, A\. See, S\. Chaudhuri, G\. Holland, A\. Davies, S\. Nowozin, P\. Kohli, and M\. Balog\(2025\-06\)AlphaEvolve: A coding agent for scientific and algorithmic discovery\.arXiv\.Note:arXiv preprint arXiv:2506\.13131External Links:2506\.13131,[Document](https://dx.doi.org/10.48550/arXiv.2506.13131)Cited by:[§1](https://arxiv.org/html/2606.26728#S1.p2.1)\.
- \[24\]A\. Paszke, S\. Gross, F\. Massa, A\. Lerer, J\. Bradbury, G\. Chanan, T\. Killeen, Z\. Lin, N\. Gimelshein, L\. Antiga, A\. Desmaison, A\. Kopf, E\. Yang, Z\. DeVito, M\. Raison, A\. Tejani, S\. Chilamkurthy, B\. Steiner, L\. Fang, J\. Bai, and S\. Chintala\(2019\)PyTorch: An Imperative Style, High\-Performance Deep Learning Library\.InAdvances in Neural Information Processing Systems,Vol\.32\.Cited by:[§S4\.2](https://arxiv.org/html/2606.26728#A4.SS2.SSS0.Px1.p1.2)\.
- \[25\]B\. Peherstorfer, K\. Willcox, and M\. Gunzburger\(2018\-01\)Survey of Multifidelity Methods in Uncertainty Propagation, Inference, and Optimization\.SIAM Review60\(3\),pp\. 550–591\.External Links:ISSN 0036\-1445, 1095\-7200,[Document](https://dx.doi.org/10.1137/16M1082469)Cited by:[§2\.1](https://arxiv.org/html/2606.26728#S2.SS1.p5.1)\.
- \[26\]B\. Romera\-Paredes, M\. Barekatain, A\. Novikov, M\. Balog, M\. P\. Kumar, E\. Dupont, F\. J\. R\. Ruiz, J\. S\. Ellenberg, P\. Wang, O\. Fawzi, P\. Kohli, and A\. Fawzi\(2024\-01\)Mathematical discoveries from program search with large language models\.Nature625\(7995\),pp\. 468–475\.External Links:ISSN 1476\-4687,[Document](https://dx.doi.org/10.1038/s41586-023-06924-6)Cited by:[§1](https://arxiv.org/html/2606.26728#S1.p2.1)\.
- \[27\]D\. Silver, A\. Huang, C\. J\. Maddison, A\. Guez, L\. Sifre, G\. Van Den Driessche, J\. Schrittwieser, I\. Antonoglou, V\. Panneershelvam, M\. Lanctot, S\. Dieleman, D\. Grewe, J\. Nham, N\. Kalchbrenner, I\. Sutskever, T\. Lillicrap, M\. Leach, K\. Kavukcuoglu, T\. Graepel, and D\. Hassabis\(2016\-01\)Mastering the game of Go with deep neural networks and tree search\.Nature529\(7587\),pp\. 484–489\.External Links:ISSN 0028\-0836, 1476\-4687,[Document](https://dx.doi.org/10.1038/nature16961)Cited by:[§2\.3](https://arxiv.org/html/2606.26728#S2.SS3.p2.1),[§2\.3](https://arxiv.org/html/2606.26728#S2.SS3.p3.1)\.
- \[28\]D\. Silver, J\. Schrittwieser, K\. Simonyan, I\. Antonoglou, A\. Huang, A\. Guez, T\. Hubert, L\. Baker, M\. Lai, A\. Bolton, Y\. Chen, T\. Lillicrap, F\. Hui, L\. Sifre, G\. Van Den Driessche, T\. Graepel, and D\. Hassabis\(2017\-10\)Mastering the game of Go without human knowledge\.Nature550\(7676\),pp\. 354–359\.External Links:ISSN 0028\-0836, 1476\-4687,[Document](https://dx.doi.org/10.1038/nature24270)Cited by:[§2\.3](https://arxiv.org/html/2606.26728#S2.SS3.p2.1),[§2\.3](https://arxiv.org/html/2606.26728#S2.SS3.p3.1)\.
- \[29\]C\. Sipling, Y\. Zhang, and M\. Di Ventra\(2026\-01\)Phase\-space engineering and collective dynamics in memcomputing\.Physical Review Applied25\(1\),pp\. 014048\.External Links:[Document](https://dx.doi.org/10.1103/f8tv-jv1b)Cited by:[§1](https://arxiv.org/html/2606.26728#S1.p7.3)\.
- \[30\]J\. Skalse, N\. Howe, D\. Krasheninnikov, and D\. Krueger\(2022\-12\)Defining and Characterizing Reward Gaming\.Advances in Neural Information Processing Systems35,pp\. 9460–9471\.Cited by:[§1](https://arxiv.org/html/2606.26728#S1.p3.1)\.
- \[31\]H\. Su, Y\. Sun, and C\. Yu\(2026\-01\)The End of Reward Engineering: How LLMs Are Redefining Multi\-Agent Coordination\.arXiv\.Note:arXiv preprint arXiv:2601\.08237External Links:2601\.08237,[Document](https://dx.doi.org/10.48550/arXiv.2601.08237)Cited by:[§1](https://arxiv.org/html/2606.26728#S1.p3.1)\.
- \[32\]F\. L\. Traversa and M\. Di Ventra\(2017\-02\)Polynomial\-time solution of prime factorization and NP\-complete problems with digital memcomputing machines\.Chaos: An Interdisciplinary Journal of Nonlinear Science27\(2\),pp\. 023107\.External Links:ISSN 1054\-1500, 1089\-7682,[Document](https://dx.doi.org/10.1063/1.4975761)Cited by:[§1](https://arxiv.org/html/2606.26728#S1.p7.3)\.
- \[33\]F\. Y\. Wang, D\. S\. Lee, D\. L\. Kaplan, and M\. J\. Buehler\(2025\-11\)Swarms of Large Language Model Agents for Protein Sequence Design with Experimental Validation\.arXiv\.Note:arXiv preprint arXiv:2511\.22311External Links:2511\.22311,[Document](https://dx.doi.org/10.48550/arXiv.2511.22311)Cited by:[§1](https://arxiv.org/html/2606.26728#S1.p2.1)\.
- \[34\]H\. Wang and L\. Zeng\(2025\-11\)Automated Algorithmic Discovery for Scientific Computing through LLM\-Guided Evolutionary Search: A Case Study in Gravitational\-Wave Detection\.arXiv\.Note:arXiv preprint arXiv:2508\.03661External Links:2508\.03661,[Document](https://dx.doi.org/10.48550/arXiv.2508.03661)Cited by:[§1](https://arxiv.org/html/2606.26728#S1.p2.1)\.
- \[35\]H\. Wang, X\. Zhang, and C\. Mu\(2025\-06\)Planning of Heuristics: Strategic Planning on Large Language Models with Monte Carlo Tree Search for Automating Heuristic Optimization\.arXiv\.Note:arXiv preprint arXiv:2502\.11422External Links:2502\.11422,[Document](https://dx.doi.org/10.48550/arXiv.2502.11422)Cited by:[§1](https://arxiv.org/html/2606.26728#S1.p2.1)\.
- \[36\]J\. Wei, Y\. Yang, X\. Zhang, Y\. Chen, X\. Zhuang, Z\. Gao, D\. Zhou, G\. Wang, Z\. Gao, J\. Cao, Z\. Qiu, M\. Hu, C\. Ma, S\. Tang, J\. He, C\. Song, X\. He, Q\. Zhang, C\. You, S\. Zheng, N\. Ding, W\. Ouyang, N\. Dong, Y\. Cheng, S\. Sun, L\. Bai, and B\. Zhou\(2025\-10\)From AI for Science to Agentic Science: A Survey on Autonomous Scientific Discovery\.arXiv\.Note:arXiv preprint arXiv:2508\.14111External Links:2508\.14111,[Document](https://dx.doi.org/10.48550/arXiv.2508.14111)Cited by:[§1](https://arxiv.org/html/2606.26728#S1.p2.1)\.
- \[37\]Y\. Yamada, R\. T\. Lange, C\. Lu, S\. Hu, C\. Lu, J\. Foerster, J\. Clune, and D\. Ha\(2025\-04\)The AI Scientist\-v2: Workshop\-Level Automated Scientific Discovery via Agentic Tree Search\.arXiv\.Note:arXiv preprint arXiv:2504\.08066External Links:2504\.08066,[Document](https://dx.doi.org/10.48550/arXiv.2504.08066)Cited by:[§1](https://arxiv.org/html/2606.26728#S1.p1.1),[§1](https://arxiv.org/html/2606.26728#S1.p2.1)\.
- \[38\]S\. Yao, D\. Yu, J\. Zhao, I\. Shafran, T\. L\. Griffiths, Y\. Cao, and K\. Narasimhan\(2023\-12\)Tree of Thoughts: Deliberate Problem Solving with Large Language Models\.arXiv\.Note:arXiv preprint arXiv:2305\.10601External Links:2305\.10601,[Document](https://dx.doi.org/10.48550/arXiv.2305.10601)Cited by:[§1](https://arxiv.org/html/2606.26728#S1.p2.1)\.
- \[39\]C\. Yu, R\. Liang, C\. Ho, and H\. Ren\(2025\-09\)Autonomous Code Evolution Meets NP\-Completeness\.arXiv\.Note:arXiv preprint arXiv:2509\.07367External Links:2509\.07367,[Document](https://dx.doi.org/10.48550/arXiv.2509.07367)Cited by:[§1](https://arxiv.org/html/2606.26728#S1.p2.1)\.
- \[40\]D\. Zhang, X\. Huang, D\. Zhou, Y\. Li, and W\. Ouyang\(2024\-06\)Accessing GPT\-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self\-refine with LLaMa\-3 8B\.arXiv\.Note:arXiv preprint arXiv:2406\.07394External Links:2406\.07394,[Document](https://dx.doi.org/10.48550/arXiv.2406.07394)Cited by:[§1](https://arxiv.org/html/2606.26728#S1.p2.1)\.
- \[41\]T\. Zheng, Z\. Deng, H\. T\. Tsang, W\. Wang, J\. Bai, Z\. Wang, and Y\. Song\(2025\-09\)From Automation to Autonomy: A Survey on Large Language Models in Scientific Discovery\.arXiv\.Note:arXiv preprint arXiv:2505\.13259External Links:2505\.13259,[Document](https://dx.doi.org/10.48550/arXiv.2505.13259)Cited by:[§1](https://arxiv.org/html/2606.26728#S1.p2.1)\.
- \[42\]Z\. Zheng, Z\. Xie, Z\. Wang, and B\. Hooi\(2025\-01\)Monte Carlo Tree Search for Comprehensive Exploration in LLM\-Based Automatic Heuristic Design\.arXiv\.Note:arXiv preprint arXiv:2501\.08603External Links:2501\.08603,[Document](https://dx.doi.org/10.48550/arXiv.2501.08603)Cited by:[§1](https://arxiv.org/html/2606.26728#S1.p2.1)\.

Supplementary Information for “Scientific discovery as meta\-optimization: a combinatorial optimization case study”

## Appendix S1Best Solver Analysis

Design 340, the best solver found, reduces the scaling from∼N2\.51\\sim N^\{2\.51\}\(baseline\) to∼N1\.33\\sim N^\{1\.33\}and delivers a∼67×\\sim 67\\timesspeedup at number of variablesN=1810N=1810for a clause\-to\-variable ratioαr=4\.3\\alpha\_\{r\}=4\.3\. Below we trace the solver’s genealogy, compare its modifications to the baseline DMM equations \(Eqs\. \([4](https://arxiv.org/html/2606.26728#S4.E4)\)\-\([6](https://arxiv.org/html/2606.26728#S4.E6)\) of the main text\), and examine why those modifications improve scaling\.

### S1\.1Architectural modifications

Design 340 leaves the baseline variable dynamics \(Eq\. \([4](https://arxiv.org/html/2606.26728#S4.E4)\)\) and long\-term memory dynamics \(Eq\. \([6](https://arxiv.org/html/2606.26728#S4.E6)\)\) intact\. All innovation is concentrated in the short\-term memory equation \(Eq\. \([5](https://arxiv.org/html/2606.26728#S4.E5)\)\), which acquires an additive release term:

x˙s,m=β​\[\(xs,m\+ϵ\)​\(cm−γ\)⏟baseline\+ℛm⏟release\],\\dot\{x\}\_\{s,m\}=\\beta\\bigl\[\\underbrace\{\(x\_\{s,m\}\+\\epsilon\)\(c\_\{m\}\-\\gamma\)\}\_\{\\text\{baseline\}\}\+\\underbrace\{\\mathcal\{R\}\_\{m\}\}\_\{\\text\{release\}\}\\bigr\],\(S1\)where the release termℛm\\mathcal\{R\}\_\{m\}is a product of five gating functions, each targeting a specific condition that must hold before the system intervenes:

ℛm=κ⋅gatexl⋅weak​\_​band⋅push​\_​up⋅gatetail⋅amp​\_​norm\.\\mathcal\{R\}\_\{m\}=\\kappa\\cdot\\mathrm\{gate\}\_\{x\_\{l\}\}\\cdot\\mathrm\{weak\\\_band\}\\cdot\\mathrm\{push\\\_up\}\\cdot\\mathrm\{gate\}\_\{\\mathrm\{tail\}\}\\cdot\\mathrm\{amp\\\_norm\}\.\(S2\)The multiplicative structure meansℛm\\mathcal\{R\}\_\{m\}is nonzero only when all five gates open at once\. Each component is described below and visualized in Fig\.[S1](https://arxiv.org/html/2606.26728#A1.F1)\.

![Refer to caption](https://arxiv.org/html/2606.26728v1/x4.png)Figure S1:Gating components of the release term in design 340\.\(a\)Long\-term memory gategatexl\\mathrm\{gate\}\_\{x\_\{l\}\}: activates only whenxl,mx\_\{l,m\}exceeds the thresholdxl,thr=1000x\_\{l,\\mathrm\{thr\}\}=1000\(dashed red line\)\.\(b\)Weak\-satisfaction band: peaks whencmc\_\{m\}falls betweenη\\etaandγ\\gamma; note that the optimized valuesη=0\.336\>γ=0\.282\\eta=0\.336\>\\gamma=0\.282produce a narrow, low\-amplitude response \(peak≈0\.16\\approx 0\.16\)\.\(c\)Bounded upward push: linearly decreasing inxs,mx\_\{s,m\}, reaching zero atxs∗=0\.091x\_\{s\}^\{\*\}=0\.091\.\(d\)Tail\-safety gate at three levels ofgatexl\\mathrm\{gate\}\_\{x\_\{l\}\}, showing how thexlx\_\{l\}\-dependent shift extends the safe operating range\.\(e\)Amplitude normalization with power\-law decay and clause\-state\-driven floor at three levels offloor​\_​gate\\mathrm\{floor\\\_gate\}\.\(f\)Combined release magnitude in the\(cm,xl,m\)\(c\_\{m\},x\_\{l,m\}\)plane atxs,m=0\.05x\_\{s,m\}=0\.05, showing the narrow region where all gates are simultaneously open\.1\. Long\-term memory gate\(Fig\.[S1](https://arxiv.org/html/2606.26728#A1.F1)\(a\)\)\. A sigmoid in log\-space that switches on only for clauses carrying a large long\-term memory:

gatexl=σ​\(ln⁡xl,m−ln⁡xl,thrωl\),\\mathrm\{gate\}\_\{x\_\{l\}\}=\\sigma\\\!\\left\(\\frac\{\\ln x\_\{l,m\}\-\\ln x\_\{l,\\mathrm\{thr\}\}\}\{\\omega\_\{l\}\}\\right\),\(S3\)withxl,thr=1000\.4x\_\{l,\\mathrm\{thr\}\}=1000\.4andωl=2\.03\\omega\_\{l\}=2\.03\. The release mechanism therefore acts only on persistently violated clauses\.

2\. Weak\-satisfaction band\(Fig\.[S1](https://arxiv.org/html/2606.26728#A1.F1)\(b\)\)\. Two opposing sigmoids select clauses in a narrow satisfaction window:

weak​\_​band=σ​\(cm−ηωb\)⋅σ​\(γ−cmωb\),\\mathrm\{weak\\\_band\}=\\sigma\\\!\\left\(\\frac\{c\_\{m\}\-\\eta\}\{\\omega\_\{b\}\}\\right\)\\cdot\\sigma\\\!\\left\(\\frac\{\\gamma\-c\_\{m\}\}\{\\omega\_\{b\}\}\\right\),\(S4\)whereη=0\.336\\eta=0\.336,γ=0\.282\\gamma=0\.282, andωb=0\.063\\omega\_\{b\}=0\.063\. Notably, the optimized values satisfyη\>γ\\eta\>\\gamma, yielding a very narrow, low\-amplitude response \(peak≈0\.16\\approx 0\.16\)\. The band targets clauses hovering near the satisfaction boundary, which are partially satisfied but at risk of flipping\. The inverted orderingη\>γ\\eta\>\\gammaemerged from hyperparameter optimization and makes the release mechanism highly selective, delivering only a small nudge to clauses in this narrow region\.

3\. Bounded upward push\(Fig\.[S1](https://arxiv.org/html/2606.26728#A1.F1)\(c\)\)\. A ReLU gate that drivesxs,mx\_\{s,m\}toward a target value:

push​\_​up=max⁡\(xs∗−xs,m,0\)xs∗,\\mathrm\{push\\\_up\}=\\frac\{\\max\(x\_\{s\}^\{\*\}\-x\_\{s,m\},\\;0\)\}\{x\_\{s\}^\{\*\}\},\(S5\)withxs∗=0\.091x\_\{s\}^\{\*\}=0\.091\. This is perhaps the most surprising parameter choice: the target sits near the lower bound ofxsx\_\{s\}, so the push shuts off almost as soon asxsx\_\{s\}rises above∼0\.09\\sim 0\.09\. Together with the narrow weak\-band, it makes the release mechanism extremely conservative\.

4\. Tail\-safety gate\(Fig\.[S1](https://arxiv.org/html/2606.26728#A1.F1)\(d\)\)\. A sigmoid that damps release whenxs,mx\_\{s,m\}approaches its upper bound:

gatetail=σ​\(xs,tail−xs,m\+μtail⋅gatexlωtail\),\\mathrm\{gate\}\_\{\\mathrm\{tail\}\}=\\sigma\\\!\\left\(\\frac\{x\_\{s,\\mathrm\{tail\}\}\-x\_\{s,m\}\+\\mu\_\{\\mathrm\{tail\}\}\\cdot\\mathrm\{gate\}\_\{x\_\{l\}\}\}\{\\omega\_\{\\mathrm\{tail\}\}\}\\right\),\(S6\)withxs,tail=0\.531x\_\{s,\\mathrm\{tail\}\}=0\.531,μtail=0\.424\\mu\_\{\\mathrm\{tail\}\}=0\.424, andωtail=0\.107\\omega\_\{\\mathrm\{tail\}\}=0\.107\. Thexlx\_\{l\}\-dependent shift \(μtail⋅gatexl\\mu\_\{\\mathrm\{tail\}\}\\cdot\\mathrm\{gate\}\_\{x\_\{l\}\}\) widens the safe operating range for high\-penalty clauses, giving them more headroom before the safety cutoff kicks in\. With the previous push\_up term concentrating on smallxsx\_\{s\}, this gate becomes almost redundant\. This is an artifact arising from hyperparameter optimization and might warrant simplification\.

5\. Amplitude normalization\(Fig\.[S1](https://arxiv.org/html/2606.26728#A1.F1)\(e\)\)\. A power\-law decay with a clause\-state\-driven floor:

amppow\\displaystyle\\mathrm\{amp\}\_\{\\mathrm\{pow\}\}=\(xl,normxl,norm\+xl,m\)p,\\displaystyle=\\left\(\\frac\{x\_\{l,\\mathrm\{norm\}\}\}\{x\_\{l,\\mathrm\{norm\}\}\+x\_\{l,m\}\}\\right\)^\{\\\!p\},\(S7\)floor​\_​gate\\displaystyle\\mathrm\{floor\\\_gate\}=1−\(1−weak​\_​band\)​\(1−push​\_​up\),\\displaystyle=1\-\(1\-\\mathrm\{weak\\\_band\}\)\(1\-\\mathrm\{push\\\_up\}\),\(S8\)amp​\_​norm\\displaystyle\\mathrm\{amp\\\_norm\}=\(1−f\)​amppow\+f,f=afloor⋅floor​\_​gate,\\displaystyle=\(1\-f\)\\,\\mathrm\{amp\}\_\{\\mathrm\{pow\}\}\+f,\\quad f=a\_\{\\mathrm\{floor\}\}\\cdot\\mathrm\{floor\\\_gate\},\(S9\)wherexl,norm=10,087x\_\{l,\\mathrm\{norm\}\}=10\{,\}087,p=1\.75p=1\.75, andafloor=0\.0093a\_\{\\mathrm\{floor\}\}=0\.0093\. The power\-law decay \(Eq\. \([S7](https://arxiv.org/html/2606.26728#A1.E7)\)\) regulates the release term asxl,mx\_\{l,m\}increases\. The floor gate \(Eq\. \([S8](https://arxiv.org/html/2606.26728#A1.E8)\)\) is a soft logical OR of the weak\-band and push\-up signals: it guarantees a minimum amplitude precisely when a clause needs intervention\. Still, hyperparameter optimization yields a near\-zeroafloora\_\{\\mathrm\{floor\}\}, suggesting that the floor gate mechanism could be simplified\.

Table[S1](https://arxiv.org/html/2606.26728#A1.T1)gives the full hyperparameter comparison between the baseline and design 340\.

Table S1:Hyperparameter comparison\.The baseline uses 7 parameters; design 340 uses 20, with the 13 additional parameters controlling the release mechanism\. Values shown are the HEBO\-optimized defaults for design 340\.Design choices and potential simplifications\.The five\-gate structure was found without any explicit simplicity constraint: agents were judged purely on solver performance, with no penalty for parameter count or code length\. The evaluation schedule’s wall\-time timeout acts as an implicit regularizer: complex designs that run slower are less likely to pass higher fidelity levels\. But this pressure is indirect and does not penalize architectural complexity sufficiently\. A systematic ablation study would sort out which components are essential and which are evolutionary artifacts\.

An explicit simplicity term, like a parameter\-count or description\-length penalty, could bias the search toward more parsimonious solutions and reduce the hyperparameter optimization burden\. Calibrating such a penalty without unintentionally suppressing useful mechanisms would require further fine\-tuning\.

### S1\.2Genealogy

Design 340 is the product of 32 generations of LLM\-guided search spanning 154 planner rounds\. Fig\.[S2](https://arxiv.org/html/2606.26728#A1.F2)shows the primary lineage, traced by following the highest\-weight parent reference at each step\. Table[S2](https://arxiv.org/html/2606.26728#A1.T2)lists every design in this lineage with its key innovation and maximum solved problem size\. A consistent pattern runs through the entire ancestry: every architectural change modifies only the short\-term memory dynamicsx˙s,m\\dot\{x\}\_\{s,m\}; the variable update \(Eq\. \([4](https://arxiv.org/html/2606.26728#S4.E4)\)\) and long\-term memory update \(Eq\. \([6](https://arxiv.org/html/2606.26728#S4.E6)\)\) are never touched\.

![Refer to caption](https://arxiv.org/html/2606.26728v1/x5.png)Figure S2:Primary lineage of design 340\.Each node is a milestone design, colored by innovation phase\. Design IDs appear below each node; several intermediate designs are omitted for clarity\. The full lineage spans 32 designs across 154 planner rounds\.Table S2:Complete primary lineage of design 340\.Each row shows a design along the highest\-weight ancestry path, its key innovation, the release term structure at that point, maximum solved problem sizeNmaxN\_\{\\max\}\(with median steps atNmaxN\_\{\\max\}\), and consensus score across all objectives\. HEBO\-tuned designs share identical code with their parent; only hyperparameters change\.NmaxN\_\{\\max\}is determined by the largestNNat which unsolved fraction<0\.5<0\.5\.IDShort nameInnovationNmaxN\_\{\\max\}Steps@NmaxN\_\{\\max\}ℛm\\mathcal\{R\}\_\{m\}structurePhase 1: Core release structure \(designs 1–12\)1Baseline—32037,729—5xlx\_\{l\}\-gatedxsx\_\{s\}bumpgatexl\\mathrm\{gate\}\_\{x\_\{l\}\},κ\\kappa3209,241κ⋅gatexl⋅c⋅max⁡\(γ−c,0\)\\kappa\\cdot\\mathrm\{gate\}\_\{x\_\{l\}\}\\cdot c\\cdot\\max\(\\gamma\-c,0\)6xsx\_\{s\}hysteresis bump\(1−xs\)\(1\-x\_\{s\}\)damping3209,452κ⋅gatexl⋅c⋅max⁡\(γ−c,0\)⋅\(1−xs\)\\kappa\\cdot\\mathrm\{gate\}\_\{x\_\{l\}\}\\cdot c\\cdot\\max\(\\gamma\-c,0\)\\cdot\(1\-x\_\{s\}\)7Boundedxsx\_\{s\}releaseweak\_band, push\_up,xs∗x\_\{s\}^\{\*\}45366,681κ⋅gatexl⋅wb⋅pu\\kappa\\cdot\\mathrm\{gate\}\_\{x\_\{l\}\}\\cdot\\mathrm\{wb\}\\cdot\\mathrm\{pu\}10Near\-γ\\gammarelease gatenear\_gamma gate45345,234\+nearγ\+\\;\\mathrm\{near\}\_\{\\gamma\}factor11HEBO\-tuned\(hyperparameters\)45312,409\(same\)12Below\-γ\\gammawindowedbelow\-γ\\gammafilter905229,470\+belowγ\+\\;\\mathrm\{below\}\_\{\\gamma\}windowPhase 2: Window shaping & amplitude refinement \(designs 16–285\)16Floored windowwin\_floor905203,389\+amp\+\\;\\mathrm\{amp\}with floor20Power\-shaped windowwin\_pow905191,288\+tp\+\\;t^\{p\}shaping21HEBO\-tuned\(hyperparameters\)64032,071\(same\)30Dead\-zone windowwin\_tau dead zone1,280220,176\+τ\+\\;\\taudead zone34Smootherstep ampsmootherstep function1,280230,042\+s​\(t\)\+\\;s\(t\)smoothing36Warp amprational warp1,280225,071\+s/\(s\+\(1−s\)2\)\+\\;s/\(s\+\(1\-s\)^\{2\}\)warp39One\-sided windowsimplified to one\-sided1,280130,867one\-sided sigmoid window43xlx\_\{l\}\-gain ampxlx\_\{l\}\-dependent boost1,280191,946\+\(1\+g⋅gatexl\)\+\\;\(1\+g\\cdot\\mathrm\{gate\}\_\{x\_\{l\}\}\)222Safexlx\_\{l\}\-sharp capsharp\_min\_ratio, sharp\_gain90535,104\+xl\+\\;x\_\{l\}\-adaptive sharpness228γ\\gamma\-threshold shiftthr\_shift90562,151\+xl\+\\;x\_\{l\}\-dependentγ\\gammashift280Shift sharpeningshift\_pow1,280143,552\+\+\\;power\-law shift285Dead\-zone shiftshift\_tau3203,476\+τ\+\\;\\taudead zone on shiftPhase 3: Simplification & component assembly \(designs 298–322\)298Weak\-band releaseradical simplification64064,272κ⋅gatexl⋅wb⋅pu\\kappa\\cdot\\mathrm\{gate\}\_\{x\_\{l\}\}\\cdot\\mathrm\{wb\}\\cdot\\mathrm\{pu\}302Tail gategate\_tail,xs,tailx\_\{s,\\mathrm\{tail\}\}64051,263\+gatetail\+\\;\\mathrm\{gate\}\_\{\\mathrm\{tail\}\}305xlx\_\{l\}\-shifted tailtail\_mu3206,001\+μ⋅gatexl\+\\;\\mu\\cdot\\mathrm\{gate\}\_\{x\_\{l\}\}in tail310Amplitude normamp\_norm64045,978\+amp​\_​norm\+\\;\\mathrm\{amp\\\_norm\}316Sign flip in tail−μ→\+μ\-\\mu\\to\+\\mu64038,994sign correction320Over\-dampedamp\_norm264058,343squared amp\_norm321HEBO\-tuned\(hyperparameters\)1,280118,834\(same\)322Tunable amp poweramp\_p90528,192amp​\_​norm=\(⋅\)p\\mathrm\{amp\\\_norm\}=\(\\cdot\)^\{p\}Phase 4: Floor refinement & final tuning \(designs 328–340\)328xlx\_\{l\}\-gated flooramp\_floor⋅\\cdotgate2xl\{\}\_\{x\_\{l\}\}^\{2\}90541,271\+f⋅gatexl2\+\\;f\\cdot\\mathrm\{gate\}\_\{x\_\{l\}\}^\{2\}floor332Linear flooramp\_floor⋅\\cdotgatexl\{\}\_\{x\_\{l\}\}90531,154\+f⋅gatexl\+\\;f\\cdot\\mathrm\{gate\}\_\{x\_\{l\}\}floor336Ungated flooramp\_floor \(constant\)3202,916\+f\+\\;fconstant floor339State\-gated floorfloor\_gate\(soft\-OR\)1,810691,058\+f⋅OR​\(wb,pu\)\+\\;f\\cdot\\mathrm\{OR\}\(\\mathrm\{wb\},\\mathrm\{pu\}\)340HEBO\-tuned\(hyperparameters\)1,81095,503\(same\)

The evolution falls into four phases, each contributing distinct pieces of the final solver\.

#### Phase 1: Core release structure \(designs 1\-12\)

The first phase laid down the release term’s basic architecture\. Design 5 introduced the central idea: an additive “bump” in thexsx\_\{s\}dynamics, gated by long\-term memory:ℛm=κ⋅gatexl​\(xl,m\)⋅cm⋅max⁡\(γ−cm,0\)\\mathcal\{R\}\_\{m\}=\\kappa\\cdot\\mathrm\{gate\}\_\{x\_\{l\}\}\(x\_\{l,m\}\)\\cdot c\_\{m\}\\cdot\\max\(\\gamma\-c\_\{m\},0\)\. It fires for persistently violated clauses \(xl,m≫xl,thrx\_\{l,m\}\\gg x\_\{l,\\mathrm\{thr\}\}\) whose satisfaction monitor falls belowγ\\gamma, cutting the baseline’s median step count atN=320N=320from 37,729 to 9,241\. Design 6 then added a\(1−xs\)\(1\-x\_\{s\}\)damping factor so the bump would not saturatexsx\_\{s\}\.

Design 7 was the first real breakthrough\. It replaced the ad\-hoccm⋅max⁡\(γ−cm,0\)⋅\(1−xs\)c\_\{m\}\\cdot\\max\(\\gamma\-c\_\{m\},0\)\\cdot\(1\-x\_\{s\}\)product with three cleanly separated components—weak​\_​band\\mathrm\{weak\\\_band\},push​\_​up\\mathrm\{push\\\_up\}, andxs∗x\_\{s\}^\{\*\}—establishing the multiplicative gating structure that survives into the final solver:

ℛm=κ⋅gatexl⋅weak​\_​band​\(cm\)⋅push​\_​up​\(xs,m\)\.\\mathcal\{R\}\_\{m\}=\\kappa\\cdot\\mathrm\{gate\}\_\{x\_\{l\}\}\\cdot\\mathrm\{weak\\\_band\}\(c\_\{m\}\)\\cdot\\mathrm\{push\\\_up\}\(x\_\{s,m\}\)\.\(S10\)NmaxN\_\{\\max\}jumped from 320 to 453 after this change\. Designs 10\-12 refined the release window: design 10 added a “near\-γ\\gamma” gate to concentrate release on clauses closest to the satisfaction threshold, and design 12 restricted action tocm<γc\_\{m\}<\\gamma\(below\-γ\\gammawindowed release\), pushingNmaxN\_\{\\max\}to 905\. HEBO tuning of design 10 alone cut median steps atN=453N=453from 45,234 to 12,409—a3\.6×3\.6\\timesspeedup from hyperparameters\.

#### Phase 2: Window shaping and amplitude refinement \(designs 16\-285\)

With the basic release structure working up toN≤905N\\leq 905, the search spent designs 16\-285 \(12 generations, planner rounds 8\-133\) exploring progressively more elaborate amplitude shaping functions\. Design 16 added a floor to the amplitude window\. Design 20 brought in power\-law shaping \(tpt^\{p\}\)\. Design 30 introduced a dead\-zone threshold\. Design 34 swapped the amplitude for a smootherstep function \(6​t5−15​t4\+10​t36t^\{5\}\-15t^\{4\}\+10t^\{3\}\)\. Design 36 applied a rational warp \(s/\(s\+\(1−s\)2\)s/\(s\+\(1\-s\)^\{2\}\)\)\. These refinements pushedNmaxN\_\{\\max\}to 1,280 \(design 30\), though step counts at that size remained high \(130,867130\{,\}867–230,042230\{,\}042\)\.

Two other directions were explored in parallel:xlx\_\{l\}\-dependent amplitude boosting \(design 43, replacinggatexl\\mathrm\{gate\}\_\{x\_\{l\}\}withgatexl​\(1\+g⋅gatexl\)\\mathrm\{gate\}\_\{x\_\{l\}\}\(1\+g\\cdot\\mathrm\{gate\}\_\{x\_\{l\}\}\)\) andxlx\_\{l\}\-adaptive sharpness control \(designs 222–285, making the window sharpness depend on clause memory state\)\. These additions pushed the parameter count upward \(from 13 to 24\) without matching gains in scaling\. Design 285, the most complex solver in this phase, ran 183 lines of code with 24 hyperparameters yet could only solve up toN=320N=320\.

#### Phase 3: Simplification and component assembly \(designs 298\-322\)

Design 298 marks a turning point\. The LLM discarded the entire accumulated complexity of Phase 2 \(smoothstep, rational warp, dead zones,xlx\_\{l\}\-adaptive sharpness, threshold shifts\) and reverted to the clean three\-gate structure of design 7:ℛm=κ⋅gatexl⋅weak​\_​band⋅push​\_​up\\mathcal\{R\}\_\{m\}=\\kappa\\cdot\\mathrm\{gate\}\_\{x\_\{l\}\}\\cdot\\mathrm\{weak\\\_band\}\\cdot\\mathrm\{push\\\_up\}\. Code shrank from 183 lines and 24 hyperparameters to 118 lines and 13 hyperparameters\. Reach dropped \(NmaxN\_\{\\max\}fell from 1,280 to 640\), but the simplified base gave the remaining innovations a clean foundation\.

From there, the system assembled the final release term’s remaining components in quick succession:

- •Design 302: Addedgatetail\\mathrm\{gate\}\_\{\\mathrm\{tail\}\}, a sigmoid suppressing release at highxs,mx\_\{s,m\}to prevent saturation\.
- •Design 305: Addedμtail⋅gatexl\\mu\_\{\\mathrm\{tail\}\}\\cdot\\mathrm\{gate\}\_\{x\_\{l\}\}to the tail gate, letting the safety threshold shift withxlx\_\{l\}\.
- •Design 310: Addedamp​\_​norm=xl,norm/\(xl,norm\+xl,m\)\\mathrm\{amp\\\_norm\}=x\_\{l,\\mathrm\{norm\}\}/\(x\_\{l,\\mathrm\{norm\}\}\+x\_\{l,m\}\), a monotone decay preventing release amplitude from growing withxlx\_\{l\}\.
- •Design 316: Flipped the tail\-gate shift sign from−μ\-\\muto\+μ\+\\mu, extending \(not reducing\) the safe range for high\-xlx\_\{l\}clauses\.
- •Design 320: Squaredamp​\_​norm\\mathrm\{amp\\\_norm\}for stronger damping at largexlx\_\{l\}\.
- •Design 322: Generalized the square to a tunable powerpp, replacingamp​\_​norm2\\mathrm\{amp\\\_norm\}^\{2\}withamp​\_​basep\\mathrm\{amp\\\_base\}^\{p\}and makingppa hyperparameter\.

By design 321 \(HEBO tuning of design 320\),NmaxN\_\{\\max\}had recovered to 1,280 at 118,834 median steps\. The release term now contained all five gating components of the final solver\.

#### Phase 4: Floor refinement and final tuning \(designs 328\-340\)

One problem remained:amp​\_​norm\\mathrm\{amp\\\_norm\}decays to zero asxl,m→∞x\_\{l,m\}\\to\\infty, which can starve clauses that genuinely need sustained help\. Design 328 introduced an amplitude floor,amp​\_​norm=\(1−f\)⋅amp​\_​pow\+f\\mathrm\{amp\\\_norm\}=\(1\-f\)\\cdot\\mathrm\{amp\\\_pow\}\+f, withf=afloor⋅gatexl2f=a\_\{\\mathrm\{floor\}\}\\cdot\\mathrm\{gate\}\_\{x\_\{l\}\}^\{2\}\. But because the release term already multiplies bygatexl\\mathrm\{gate\}\_\{x\_\{l\}\}, this created a double\-gating pathology: the effective floor scaled asgatexl3\\mathrm\{gate\}\_\{x\_\{l\}\}^\{3\}\.

Three floor variants followed in rapid succession:

- •Design 332: Reduced tof=afloor⋅gatexlf=a\_\{\\mathrm\{floor\}\}\\cdot\\mathrm\{gate\}\_\{x\_\{l\}\}\(linear; effective∝gatexl2\\propto\\mathrm\{gate\}\_\{x\_\{l\}\}^\{2\}\)\.
- •Design 336: Droppedxlx\_\{l\}dependence entirely:f=afloorf=a\_\{\\mathrm\{floor\}\}\(constant floor; effective∝gatexl\\propto\\mathrm\{gate\}\_\{x\_\{l\}\}\)\.
- •Design 339: Conditioned the floor on clause state instead:f=afloor⋅\[1−\(1−weak​\_​band\)​\(1−push​\_​up\)\]f=a\_\{\\mathrm\{floor\}\}\\cdot\[1\-\(1\-\\mathrm\{weak\\\_band\}\)\(1\-\\mathrm\{push\\\_up\}\)\]\. This soft\-OR activates the floor when the clause sits in the weak\-satisfaction band*or*whenxs,mx\_\{s,m\}is below target, and shuts off otherwise\.

Design 339 reachedNmax=1,810N\_\{\\max\}=1\{,\}810\(691,058 steps\), the first solver in the lineage to clear the largest benchmark instances\. Design 340 \(HEBO tuning of 339, no code changes\) then brought the median steps atN=1,810N=1\{,\}810down from 691,058 to 95,503, a7\.2×7\.2\\timesspeedup from hyperparameters alone, confirming the importance of both architecture and tuning\.

#### Summary: Building up the release term

Table[S3](https://arxiv.org/html/2606.26728#A1.T3)records which design first introduced each component of the final release term\. The core three\-gate structure \(gatexl⋅weak​\_​band⋅push​\_​up\\mathrm\{gate\}\_\{x\_\{l\}\}\\cdot\\mathrm\{weak\\\_band\}\\cdot\\mathrm\{push\\\_up\}\) was in place by design 7; assembling the full five\-gate structure with a clause\-state\-driven floor took 32 generations and a critical simplification event at design 298\.

Table S3:Release term component origins\.Each row shows the first design in the lineage that introduced a component of the final release term \(Eq\. \([S2](https://arxiv.org/html/2606.26728#A1.E2)\)\)\.

### S1\.3Analysis

Why the release mechanism improves scaling\.The baselinexsx\_\{s\}dynamics \(Eq\. \([5](https://arxiv.org/html/2606.26728#S4.E5)\)\) amounts to a simple proportional controller:xs,mx\_\{s,m\}grows whencm\>γc\_\{m\}\>\\gamma\(clause violated\) and shrinks whencm<γc\_\{m\}<\\gamma\(clause satisfied\)\. Every clause gets the same treatment: one that has been stuck for thousands of steps receives the same dynamical response as one that was violated only briefly\. At large problem sizes, where clauses compete for variable updates through the shared variable dynamics \(Eq\. \([4](https://arxiv.org/html/2606.26728#S4.E4)\)\), this indiscriminate response leads to wasted computational effort\.

The release term \(Eq\. \([S2](https://arxiv.org/html/2606.26728#A1.E2)\)\) provides a targeted intervention that fires only when five conditions hold simultaneously:

1. 1\.The clause has been persistently violated \(xl,m≫xl,thrx\_\{l,m\}\\gg x\_\{l,\\mathrm\{thr\}\}, viagatexl\\mathrm\{gate\}\_\{x\_\{l\}\}\)\.
2. 2\.The clause sits in a critical satisfaction state \(cm≈0\.3c\_\{m\}\\approx 0\.3, viaweak​\_​band\\mathrm\{weak\\\_band\}\)\.
3. 3\.The short\-term memory is below target \(xs,m<xs∗x\_\{s,m\}<x\_\{s\}^\{\*\}, viapush​\_​up\\mathrm\{push\\\_up\}\)\.
4. 4\.The short\-term memory is not saturated \(xs,mx\_\{s,m\}below safety threshold, viagatetail\\mathrm\{gate\}\_\{\\mathrm\{tail\}\}\)\.
5. 5\.The release amplitude is properly regulated \(viaamp​\_​norm\\mathrm\{amp\\\_norm\}\)\.

The conjunction channels effort toward persistently stuck clauses that are close to flipping and need a gentle push, rather than clauses that are far from satisfaction or would resolve on their own through the baseline dynamics\.

Conservative parameter regime\.The HEBO\-optimized\[[6](https://arxiv.org/html/2606.26728#bib.bib63)\]hyperparameters significantly restrain the release mechanism\. Three noteworthy choices:

- •High memory threshold\(xl,thr=1000x\_\{l,\\mathrm\{thr\}\}=1000\)\. The long\-term memory gate demandsxl,m\>1000x\_\{l,m\}\>1000before it opens\. Sincexl,mx\_\{l,m\}starts at 1 and grows at rateα​\(cm−δ\)\\alpha\(c\_\{m\}\-\\delta\), a clause must be persistently violated before the release mechanism engages at all\.
- •Inverted weak\-band\(η=0\.336\>γ=0\.282\\eta=0\.336\>\\gamma=0\.282\)\. The band edges are flipped relative to their original design intent \(η\\etawas meant to be less thanγ\\gamma\), producing a narrow, low\-amplitude response peak of≈0\.16\\approx 0\.16\. The weak\-band filter ends up highly selective\.
- •Very low push\-up target\(xs∗=0\.091x\_\{s\}^\{\*\}=0\.091\)\. The push deactivates oncexs,m\>0\.091x\_\{s,m\}\>0\.091: a small initial nudge, not sustained forcing\.

Together these produce a release mechanism that fires rarely \(highxl,thrx\_\{l,\\mathrm\{thr\}\}\), responds weakly \(inverted band, lowxs∗x\_\{s\}^\{\*\}\), and acts briefly \(push\-up saturates quickly\)\. The combined amplitude is typically 1–3% of the baselinexsx\_\{s\}dynamics\. This small intervention suffices to break deadlocks at largeNNwithout disrupting the baseline dynamics that handle the majority of clauses well\.

Backbone parameter shifts\.The optimized backbone parameters also change a lot from the baseline:α\\alphamore than doubles \(5→11\.25\\to 11\.2\),β\\betadoubles \(20→4220\\to 42\), and the time stepΔ​t0\\Delta t\_\{0\}triples \(1→3\.01\\to 3\.0\)\. These shifts make the baseline dynamics faster and more responsive overall; the release mechanism then provides a focused correction for the few clauses that get stuck\.

Scaling comparison\.Fig\.[S3](https://arxiv.org/html/2606.26728#A1.F3)compares scaling in detail\. The speedup grows with problem size: no speedup at smallNN\(∼1×\{\\sim\}1\\timesatN=10N=10\), modest at intermediateNN\(∼4×\{\\sim\}4\\timesatN=320N=320\), and accelerating at largeNN\(67×67\\timesatN=1810N=1810\)\. The pattern matches the picture above: at smallNNmost clauses resolve through the baseline dynamics alone, so the release mechanism adds little\. At largeNNthe growing web of clause interactions creates more persistent deadlocks, and the targeted release becomes increasingly valuable\.

![Refer to caption](https://arxiv.org/html/2606.26728v1/x6.png)Figure S3:Scaling comparison\.\(a\)Median solution steps vs\. problem sizeNNfor the baseline \(black\) and design 340 \(red\); power\-law fits are dashed\.\(b\)Speedup ratio \(baseline steps / design 340 steps\) vs\.NN, growing monotonically to67×67\\timesatN=1810N=1810\. The horizontal dashed line corresponds to a speedup of1×1\\times\(no change\)\.

## Appendix S2Objective Evolution Analysis

Over the course of the search, the objective agent produced 42 objective functions across eight independent workspaces that were later merged\. These objectives pass through four phases, tracking how the system’s notion of a good solver changes as understanding deepens\. Table[S7](https://arxiv.org/html/2606.26728#A2.T7)lists all 42 objectives with their origin and final consensus weights\. We trace this evolution below, document three reward hacking episodes that were detected and mitigated, and analyze the shifting correlation structure of the objective portfolio\.

### S2\.1Objective phases

Phase 1a: Power\-law extrapolation \(objectives 0–5\)\.The first six objectives came from workspace 22\. They shared a common strategy: fit power\-law \(log⁡s∼a\+b​log⁡N\\log s\\sim a\+b\\log N\) and exponential \(log⁡s∼a\+b​N\\log s\\sim a\+bN\) models to the observed median stepsssvs\. problem sizeNN, then extrapolate to a fixed targetN=2000N=2000\. Objective 0, the initial human\-designed baseline, used anR2R^\{2\}\-weighted softmax blend of both models\. Successive objectives show incremental refinements: reliability penalties based onunsolved\_fraction\(objective 1\), coverage penalties when the largest testedNNfell far short of the target \(objective 2\), AIC\-weighted model averaging \(objective 3\), and smooth failure\-margin penalties \(objectives 4\-5\)\.

All six shared a basic limitation: they worked entirely from small\-NNdata \(N≤640N\\leq 640under low\-fidelity evaluation\) and tried to extrapolate toN=2000N=2000\. There is also a misunderstanding on theunsolved\_fractionvariable: a problem is considered solved and terminates whenunsolved\_fractionfalls below 0\.5 \(effectively a binary success/fail signal\), but the objectives treat it as a continuous variable and reward designs with smallerunsolved\_fraction\. The result was a tight cluster of correlated objectives: the mean within\-phase pairwise Kendall’sτ\\tauwas 0\.588, with no negative pairs\.

Phase 1b: Censored\-aware scaling \(objectives 6\-15\)\.In parallel, workspace 24 produced ten objectives through a similar iterative refinement loop\. The key advance was a restart\-steps model: instead of raw median stepss, these objectives computed an effective costseff=s\+\(unsolved\_fraction/\(1−unsolved\_fraction\)\)⋅max\_stepss\_\{\\mathrm\{eff\}\}=s\+\(\\texttt\{unsolved\\\_fraction\}/\(1\-\\texttt\{unsolved\\\_fraction\}\)\)\\cdot\\texttt\{max\\\_steps\}, accounting for the restarts needed when some runs fail\. They also adopted AIC\-weighted model averaging between polynomial and exponential scaling models and used Theil\-Sen \(median\) regression for outlier robustness\.

Nevertheless, Phase 1b objectives still use the extrapolation\-to\-N=2000N\{=\}2000paradigm and still treatedunsolved\_fractionas a continuous variable below the success threshold 0\.5\. Without the meta\-agent \(not yet introduced\), objective evolution in this workspace amounted to incremental polishing of one approach, producing a tight cluster with mean within\-workspaceτ=0\.536\\tau=0\.536\.

Phase 2: Reach\-dominant metrics \(objectives 16\-31\)\.The later exploratory workspaces exhibit a paradigm shift\. Instead of extrapolating to a fixed targetNN, these objectives maximized the largest problem size the solver could actually clear—a metric we call “reach\.” Objective 20 introduced the structure that every subsequent objective adopted: a reach term \(e\.g\.,−log2⁡Nmax\-\\log\_\{2\}N\_\{\\mathrm\{max\}\}\) dominates by orders of magnitude, with tail\-scaling and budget\-headroom terms serving only as tie\-breakers among designs that reach the sameNN\.

Hard gates at schedule milestones also appeared in this phase: objective 20 penalized solvers failing beforeN=640N=640and, when medium/high\-fidelity data were available, beforeN=1280N=1280\. Later objectives \(25\-28\) added conservative worst\-window tail fits and budget\-cliff hazard terms for sharp jumps inmedian\_step/max\_steps\\texttt\{median\\\_step\}/\\texttt\{max\\\_steps\}at the frontier\.

The meta\-agent entered in workspace 36, and its effect on diversity is visible in the numbers: workspace 36 yielded the most diverse objective set \(within\-workspace meanτ=0\.359\\tau=0\.359, versus 0\.587 and 0\.536 in the pre\-meta\-agent workspaces\)\. Phase 2 also produced the most problematic objective in the entire search \(objective 17; see Sec\.[S2\.2](https://arxiv.org/html/2606.26728#A2.SS2)\)\.

Phase 3: Schedule\-faithful objectives \(objectives 32–41\)\.The last ten objectives were generated in the merged workspace, with all earlier objectives and designs available as context\. They learned from the lessons in Phases 1 and 2, and produced a mature design pattern resting on three principles:

1. 1\.Strict binary pass/fail\.Success meansunsolved\_fraction<0\.5<0\.5; a value of 0\.49 is never penalized\.
2. 2\.Schedule\-faithful headroom\.Budget headroom is computed from the*schedule budget*B​\(N\)B\(N\)\(a deterministic function ofNNand the fidelity cap\), not from the run’smax\_steps\. This blocks designs from inflating headroom by running with inflated budgets\.
3. 3\.Smooth\-max bottleneck detection\.Rather than sampling headroom at a single point, the objective takes a smooth\-max \(log\-sum\-exp\) oflog⁡\(median\_step/B​\(N\)\)\\log\(\\texttt\{median\\\_step\}/B\(N\)\)over the last 3 cleared levels plus conservative worst\-window predictions atN=1810N=1810andN=2560N=2560\. The worst bottleneck is identified without being dominated by a single noisy data point\.

Further components included multi\-step budget\-cliff hazard penalties for acceleration inlog⁡\(median\_step/B\)\\log\(\\texttt\{median\\\_step\}/B\)across the905→1280→1810→2560905\\to 1280\\to 1810\\to 2560transitions \(gated so they fire only once the solver reaches those regimes\), and optional repeat\-robustness terms penalizing mixed pass/fail outcomes across repeated runs at the sameNN\.

Phase 3 objectives produced the strongest selection pressure: their per\-objective winners had a median consensus rank of 2 \(vs\. 95 for Phase 1b\), and their mean Kendall’sτ\\tauwith the consensus was 0\.624 \(Phase 1a’s higher 0\.665 reflects the large early cluster, not better selection\)\.

### S2\.2Reward hacking episodes

Three reward hacking patterns surfaced during the search\. Each was caught through correlation analysis and corrected by the meta\-agent’s weight adjustments\.

Episode 1: Small\-NNover\-optimization \(objectives 0–15\)\.Phase 1 objectives rewarded designs whose scaling curves looked favorable on small\-NNdata, regardless of whether the extrapolation held up\. The meta\-agent’s first assessment \(round 1\) diagnosed the issue: “designs reachN=640N=640under the adaptive schedule, but this is largely driven by hovering just below theunsolved\_fraction<<0\.5 gate whilemedian\_stepexplodes at higherNN\.” By round 6 of the merged workspace, the meta\-agent had zeroed out all 16 Phase 1 objectives \(weights set to 0\.0\), concentrating the consensus on Phase 2 and 3 metrics\. Even after suppression, these objectives still contributed passively to the consensus through agreement\-based weighting: objectives 0, 2, 4, 7, 12, and 15 retained Kendall’sτ\>0\.78\\tau\>0\.78with the final consensus ranking\. Their winners were often poor, but their overall ranking structure was partially aligned\.

Episode 2: Gate\-passing artifacts \(objective 17\)\.Objective 17 was the worst reward hacking case in the search\. It used a soft threshold atueff≤0\.40u\_\{\\mathrm\{eff\}\}\\leq 0\.40\(with uncertainty\-aware unsolved fractionueff=u\+SEbinomu\_\{\\mathrm\{eff\}\}=u\+\\mathrm\{SE\}\_\{\\mathrm\{binom\}\}\) and computed a composite score from effective solved sizeNeffN\_\{\\mathrm\{eff\}\}, a time\-to\-solution \(TTS\) surrogate, and a scaling exponent penalty\. In practice, the formulation rewarded designs achieving moderate solve rates \(u≈0\.35u\\approx 0\.35–0\.400\.40\) across many smallNNvalues, even if those designs failed catastrophically at largerNN\.

The consequences were dramatic: objective 17 had a Kendall’sτ\\tauof−0\.472\-0\.472with the consensus ranking, anti\-correlated with the majority view of quality\. Its top 10 designs all sat between consensus ranks 345 and 414 \(bottom 5%\)\. Conversely, it placed the consensus winner \(design 340\) at rank 390 out of 414\. The meta\-agent detected the anti\-correlation and set objective 17 to weight 0\.0 in every round after its creation\.

Episode 3: Echo chamber in pre\-meta\-agent workspaces\.Without the meta\-agent’s diversity\-promoting guidance, objective evolution in workspaces 22 and 24 amounted to incremental refinement of the same approach\. Workspace 22’s objectives shared a mean pairwiseτ\\tauof 0\.587; workspace 24’s shared 0\.536\.

Introducing the meta\-agent in workspace 36 visibly increased diversity\. Its strategic guidance explicitly steered later objectives toward different facets of solver quality \(headroom vs\. hazard vs\. robustness\), yielding within\-workspaceτ=0\.359\\tau=0\.359—the lowest among all multi\-objective workspaces\. In the merged workspace, the meta\-agent diversified Phase 3 objectives further by specifying distinct tie\-breaking mechanisms and introducing optional components\.

Detection and mitigation\.The meta\-agent’s weight trajectory \(Table[S4](https://arxiv.org/html/2606.26728#A2.T4)\) shows weight steadily migrating from early to late objectives\. By round 6, all Phase 1 objectives \(0\-19\) were zeroed out\. In the final round only 13 of 42 objectives retained nonzero weights, with 8 receiving weights≥1\.0\\geq 1\.0\. The heaviest\-weighted objectives \(39, 40, 37, 35, 34\) were all Phase 3 schedule\-faithful metrics\. Importantly, suppression was not merely age\-based: objectives 20\-22 and 24 \(Phase 2\) kept nonzero weights in the final round because of their high correlation with the consensus, while the more recent objective 32 \(Phase 3,τ=−0\.162\\tau=\-0\.162\) was suppressed\.

Table S4:Meta\-agent weight trajectory\.Aggregate weight per objective phase across meta\-agent rounds, showing the progressive shift from Phase 1 extrapolation objectives to Phase 3 schedule\-faithful metrics\.#### Limitation of agreement\-based weighting

The meta\-agent’s progressive suppression of Phase 1 objectives \(Table[S4](https://arxiv.org/html/2606.26728#A2.T4)\) raises a natural question: was the intervention necessary, or would the built\-in age decay have done the same job without external override?

Agreement\-based weighting gives each objective a weight proportional to its median pairwise Kendall’sτ\\tauwith all others\. This works well when misaligned objectives are isolated outliers—their low correlation with the majority suppresses their influence\. But the mechanism has a structural weakness: when a block of correlated objectives forms the majority, each member’s medianτ\\tauis inflated by agreement with other block members, regardless of whether what they collectively measure aligns with the true research goal\. Agreement is computed within the current portfolio, so the mechanism cannot distinguish a genuinely informative majority from an echo chamber of redundant proxies\.

That is precisely what happened in the merged workspace\. At the time of merging, Phase 1a and 1b objectives \(IDs 0\-15\) made up 16 of 32 total objectives\. They formed a tight cluster \(mean pairwiseτ=0\.561\\tau=0\.561; Table[S5](https://arxiv.org/html/2606.26728#A2.T5); zero negatively correlated pairs\) that collectively rewarded small\-NNperformance, even when those designs scaled poorly pastN=1000N=1000\. Because block members agreed with each other, their per\-objective medianτ\\tauvalues stayed high \(mean 0\.49, vs\. 0\.40 for Phase 2\), and the consensus mechanism gave them a combined weight share of 55% despite age decay\.

Age decay alone could not fix this\. Atλ=0\.9\\lambda=0\.9, an objective still retains0\.910≈0\.350\.9^\{10\}\\approx 0\.35of its original contribution after 10 rounds\. With 16 objectives in the cluster, collective influence decays slowly\. Furthermore, a newly generated objective that happens to correlate with the Phase 1 cluster would receive high agreement\-based weight regardless of recency, perpetuating the misalignment\.

The meta\-agent supplies an orthogonal correction\. Instead of measuring statistical agreement among objectives, it assesses whether each objective’s evaluation criterion still fits the current research strategy\. Here, the meta\-agent recognized that Phase 1 objectives were optimizing for small\-NNscaling when the search needed large\-NNreach and tail efficiency\. Its weight multipliers broke the cluster’s grip: by round 6 all 16 Phase 1 objectives were set tomi=0\.0m\_\{i\}=0\.0, shifting the consensus entirely to Phase 2 and 3 metrics\. The intervention was gradual, so the consensus ranking shifted smoothly, avoiding destabilization of the planner and designer agents that depend on consistent rankings\.

The multipliers compose with, rather than replace, the agreement\-based weights: the final weight iswi′=wi⋅mi/∑kwk⋅mkw\_\{i\}^\{\\prime\}=w\_\{i\}\\cdot m\_\{i\}/\\sum\_\{k\}w\_\{k\}\\cdot m\_\{k\}, wherewiw\_\{i\}is the consensus weight andmim\_\{i\}is the meta\-agent multiplier\. The meta\-agent therefore cannot amplify an objective that the consensus mechanism has already flagged as an outlier\. Multipliers only modulate the relative influence among objectives that already carry nonzero consensus weight, providing a targeted correction for cluster dominance without undermining the consensus mechanism’s built\-in outlier suppression\.

### S2\.3Correlation structure

The pairwise Kendall’sτ\\taumatrix \(Fig\.[2](https://arxiv.org/html/2606.26728#S2.F2)\(a\)\) captures the full correlation structure of the 42\-objective portfolio\. Here we quantify how that structure evolved as objectives were added\.

Table[S5](https://arxiv.org/html/2606.26728#A2.T5)tells a clear story: the mean pairwiseτ\\taufell from 0\.588 \(Phase 1a alone, 6 objectives\) to 0\.399 \(all 42\), while the fraction of negatively correlated pairs rose from 0% to 11\.3%\. This growing disagreement is not pathological\. Instead, it reflects increasing diversity in how solvers are evaluated\. Early objectives agreed because they were variations on a single extrapolation theme; later objectives introduced fundamentally different evaluation criteria \(reach vs\. headroom vs\. hazard\), which naturally reduced pairwise agreement while broadening the coverage of the portfolio\.

Table S5:Pairwise Kendall’sτ\\tausnapshots\.As objectives accumulate across phases, mean pairwise correlation drops and the fraction of negative pairs grows, reflecting greater diversity\.The within\-phase vs\. cross\-phase structure reveals an informative asymmetry\. Phase 1a objectives have higher within\-phaseτ\\tau\(0\.588\) than cross\-phaseτ\\tau\(0\.452\), which is a classic echo\-chamber signature\. Phase 2 shows the reverse: within\-phaseτ\\tau\(0\.287\) is*lower*than cross\-phaseτ\\tau\(0\.369\), meaning these objectives were more diverse among themselves than in their relationship to other phases\. That inversion traces directly to the meta\-agent’s diversity\-promoting guidance in workspace 36\.

Per\-objective Kendall’sτ\\tauwith the consensus ranking \(Table[S6](https://arxiv.org/html/2606.26728#A2.T6)\) confirms that no single objective captures the consensus perfectly\. Even the best\-aligned \(objectives 0, 2, 21, 28, 39\) reach onlyτ≈0\.84\\tau\\approx 0\.84–0\.850\.85, so roughly 8% of design pairs are ranked differently\. At the other end, objectives 17, 32, 23, 25, and 27 haveτ<0\.16\\tau<0\.16—near\-random or anti\-correlated\.

Table S6:Per\-phase agreement with the consensus ranking\.Mean and range of Kendall’sτ\\taubetween individual objectives and the consensus, by phase\.
### S2\.4Why the consensus outperforms individual objectives

Three observations from the data above illustrate why the consensus mechanism matters\.

Winner divergence\.Only 7 of 42 objectives \(17%17\\%\) rank the same design as the consensus winner \(design 340\) at position \#1\. Just 24 of 42 \(57%57\\%\) place their \#1 design in the consensus top\-10\. The mean consensus rank of per\-objective winners is 84\.3 \(out of 414 designs\), and for Phase 1b objectives the median consensus rank of their winners is 95\.

Visibility of the best design\.Design 340 falls outside the top\-10 under 14 of 42 objectives \(33%\), outside the top\-20 under 10 \(24%\), and outside the top\-50 under 8 \(19%\)\. Objective 17 ranks it 390th out of 414—worse than 94% of all designs\. Objective 3 ranks it 239th\. No single objective consistently identifies design 340 as a top candidate\. By weighting objectives on agreement and recency, the consensus aggregation ensures that design 340’s strong performance across the*majority*of objectives lifts it to rank \#1 despite its poor standing under a minority\.

Phase\-dependent selection quality\.Phase 3 objectives’ per\-objective winners have a mean consensus score of 0\.064 \(near the best achievable score of 0\.028\), while Phase 1b winners have a mean consensus score of 0\.344—over5×5\\timesworse\. Later objectives, drawing on more experimental data and meta\-agent guidance, exert substantially better selection pressure\. The consensus mechanism picks this up automatically through the age decay factor \(λ=0\.9\\lambda=0\.9\), which down\-weights older objectives even before the meta\-agent explicitly suppresses them\.

### S2\.5Complete objective list

Table[S7](https://arxiv.org/html/2606.26728#A2.T7)lists all 42 objectives with their origin and final meta\-agent weight multipliers\.

Table S7:Complete list of all 42 objectives\.Each objective is listed with its ID, source workspace, phase, description \(abbreviated\), and final meta\-agent weight multiplier\. Objectives with weight 0\.0 were suppressed by the meta\-agent; a dash \(—\) indicates the objective was created after the final meta\-agent round\.IDSourcePhaseDescriptionWt\.Phase 1a: Power\-law extrapolation0merged1aR2R^\{2\}\-weighted power\-law / exponential blend, extrapolate toN=2000N\{=\}20000\.01ws\-221aPredictedlog10\\log\_\{10\}steps atN=2000N\{=\}2000incl\. scaling \+ reliability0\.02ws\-221alog10\\log\_\{10\}expected steps with reliability\-adjusted cost \+ coverage penalty0\.03ws\-221aAIC\-mixed scaling toN=2000N\{=\}2000with smooth fail/coverage0\.04ws\-221aSmooth reliability and budget\-cap penalties0\.05ws\-221aScaling \+ reliability\-margin penalties0\.0Phase 1b: Censored\-aware scaling6ws\-241bSuccess\-rate \+ scaling extrapolation toN=2000N\{=\}20000\.07ws\-241bCensor\-aware AIC model avg\. with failure lower\-bounds0\.08ws\-241bSuccess \+ scaling \+ coverage extrapolation0\.09ws\-241bAIC\-avglog10\\log\_\{10\}expected solve\-steps incl\. success \+ coverage0\.010ws\-241bAIC\-avg robust scaling \+ reliability0\.011ws\-241bAIC\-avglog10\\log\_\{10\}restart\-steps \+ reliability margin0\.012ws\-241bAIC\-avg restart\-steps \+ slope and reliability0\.013ws\-241bpp\-adjusted steps with smooth coverage and slope penalties0\.014ws\-241bRestart\-steps with reliability margin and censoring0\.015ws\-241bAIC\-weighted censored scaling fit of restart\-steps0\.0Phase 2: Reach\-dominant16ws\-322Extrapolated log\-steps \+ near\-threshold failure penalty \+ reach reward0\.017ws\-332CompositeNeffN\_\{\\mathrm\{eff\}\}atu≤0\.40u\\leq 0\.40\+ TTS surrogate \+ scaling exponent0\.018ws\-342Robust high\-NNsolves \(u≤0\.3/0\.2u\\leq 0\.3/0\.2\) \+ brittleness penalty0\.019ws\-342Soft\-pass area \+ top\-level reward \+ anti\-truncation0\.020ws\-362Reach\-dominant: hard fail before 640/1280 \+ tail slope \+ slack0\.221ws\-362Reach \+ budget\-margin/slack \+ conservative next\-NNclearance0\.322ws\-362Successful\-prefix clearanceN∗N^\{\*\}\+ near\-budget tail penalty0\.423ws\-362Reach \+ conservative worst\-of\-multi\-fit clearance at 1280/25600\.024ws\-362Reach \+ worst\-window predicted earliest schedule failure0\.325ws\-362Robust scheduledN∗N^\{\*\}with ratio\-space clearance≤0\.60\\leq 0\.600\.026ws\-362Gated tail\-only headroom\-to\-1280 \+ slope/curvature penalties0\.027ws\-362Tail\-local polynomial scaling \+ budget\-clearance robustness0\.028ws\-362Two\-jump headroom at 640→\\to905→\\to1280 \+ exponent penalty0\.029ws\-372Reach \+ low scaling exponent \+ censored extrapolation toN=2000N\{=\}20000\.030ws\-372Reach \+ scaling \+ headroom \+ frontier robustness0\.031ws\-372Schedule\-faithful lexicographic: reach, then steps, then slope0\.1Phase 3: Schedule\-faithful32merged3Tail headroom: conservative worst\-window next\-NNovershoot0\.033merged3Lexicographic reach \+ strong tail headroom tie\-break1\.234merged3Reach \+ tail headroom \+ worst\-window 1810/2560 overshoot2\.035merged3Reach \+ schedule\-budget headroom \+ budget\-cliff hazard2\.236merged3Reach \+ smooth\-max headroom \+ hazard \+ repeat\-stability1\.237merged3Reach \+ smooth\-max headroom \+ focus 1280→\\to1810→\\to2560 \+ hazard2\.438merged3Reach \+ smooth\-max headroom \+ tail exponent \+ post\-1280 cliff1\.639merged3Reach \+ smooth\-maxlog\\log\(headroom\) \+ multi\-jump hazard \+ censor3\.040merged3Reach \+ smooth\-maxlog\\log\(headroom\) \+ cliff hazard \+ robustness3\.241merged3Three\-variant objective \(balanced / hazard / robustness\)—

## Appendix S3Methodology Evolution

The framework described in the main text went through several rounds of redesign across independent workspaces\. This section records the key methodological changes and the empirical reasoning behind each\.

### S3\.1Meta\-agent introduction

Before workspace 36, the objective agent ran on its own: it proposed new objective functions and set their weights with no external oversight\. The result was the*echo\-chamber effect*documented in Sec\.[S2\.2](https://arxiv.org/html/2606.26728#A2.SS2)\(Episode 3\): the agent repeatedly reinforced a narrow set of evaluation criteria\. Objectives produced this way exhibited high mutual correlation \(mean pairwise Kendallτ=0\.587\\tau=0\.587–0\.9490\.949within echo\-chamber clusters; see Table[S5](https://arxiv.org/html/2606.26728#A2.T5)\) and consistently ranked the same small subset of designs at the top, narrowing the search\.

Workspace 36 introduced a*meta\-agent*as an oversight layer between the objective agent and the consensus mechanism\. The meta\-agent reviews newly proposed objectives, adjusts their consensus weights, and can suppress objectives that are redundant or misaligned with the broader research goals\. In practice, it progressively reallocated weight from Phase 1 and Phase 2 objectives to the more informative Phase 3 \(schedule\-faithful\) objectives as those were created \(Table[S4](https://arxiv.org/html/2606.26728#A2.T4)\)\. This broke the echo\-chamber pattern and restored the objective diversity that the consensus mechanism needs to work well \(Sec\.[S2\.4](https://arxiv.org/html/2606.26728#A2.SS4)\)\.

### S3\.2Evaluation schedule

An early design question was whether the evaluation schedule \(problem sizesNN, clause\-to\-variable ratios, and computational budgets used to benchmark each solver\) should co\-evolve with the solver designs\. Initial experiments let the objective agent propose new schedules alongside new evaluation criteria\.

Three problems emerged:

1. 1\.Inconsistent evaluation\.When the schedule shifts between iterations, scores from different iterations are not directly comparable\. A design that appears to improve may simply have been tested on an easier schedule, making consensus rankings unreliable\.
2. 2\.Schedule echo chamber\.The LLM agent, aware of the current top designs and existing schedules, tended to propose schedules favoring those same designs, which is self\-reinforcing bias analogous to the objective echo chamber \(Sec\.[S2\.2](https://arxiv.org/html/2606.26728#A2.SS2)\)\.
3. 3\.No quality criterion\.Unlike solver designs, which can be ranked by objective performance, there is no obvious ground truth for schedule quality\. The system had no reliable signal for judging whether a new schedule was more informative than its predecessor\.

We resolved these issues by fixing the evaluation schedule and pairing it with rule\-based multi\-fidelity promotion\. All designs face the same base schedule, so rankings are consistent and comparable\. Computational resources are allocated by deterministic rules: the top 50% of designs by consensus advance to medium fidelity \(largerNN, more instances\), the top 10% to high fidelity\. This preserves the efficiency benefits of adaptive evaluation without the instabilities of co\-evolving schedules\.

Whether flexible schedule generation might help in domains where the evaluation landscape is less well\-characterized remains an open question\. Possible criteria for schedule quality include: \(i\) discriminative power \(does the schedule separate good designs from bad?\), \(ii\) predictive validity \(do rankings on the schedule predict rankings under held\-out test conditions?\), and \(iii\) cost\-efficiency \(does the schedule achieve comparable discrimination at lower cost?\)\. We leave these for future work\.

### S3\.3Workspace merging

The search initially ran across multiple independent workspaces, each operating the full framework autonomously with its own design pool, objective history, and MCGS graph\. Running in parallel increases diversity: independent workspaces explore different regions of the design space without interfering, reducing the risk that an early success monopolizes the search\.

##### Selection and import\.

From each source workspace, the top 5 designs by consensus ranking were selected together with their*complete*parent\-child genealogies\. Preserving the genealogy matters: it gives the planner in the merged workspace the full evolutionary context behind each imported design, not just the final product\. Roughly 200 designs were imported in total\.

##### Merged workspace dynamics\.

Once merged, the planner can see designs from all source workspaces at once\. The MCGS graph is rebuilt from the combined pool, and UCB scoring treats every design uniformly regardless of origin\. This opens the door to*cross\-workspace recombination*: the planner can pick parents from different source workspaces, combining innovations that evolved independently\. The sub\-tree structures in Fig\.[3](https://arxiv.org/html/2606.26728#S2.F3)\(a\) of the main text are a direct imprint of this merging: each sub\-tree traces back to a particular source workspace, with inter\-tree edges marking cross\-workspace references\.

About 200 additional designs were generated in the merged workspace\. Design 340, the best solver found, emerged from this phase, incorporating architectural elements that trace back to multiple source workspaces \(see Sec\.[S1\.2](https://arxiv.org/html/2606.26728#A1.SS2)for the full lineage\)\. Cross\-workspace recombination appears to have been productive: innovations that evolved in isolation were combined in ways unlikely to have been found within any single workspace\.

##### Practical considerations\.

Merging introduces a cold\-start problem: imported designs have not been evaluated under the merged workspace’s objective history, and MCGS visit counts reset to zero\. The UCB parameterization partially addresses this by maintaining exploration pressure even with a large initial pool, and the consensus mechanism helps by aggregating objectives from all phases, including those created in the source workspaces\.

## Appendix S4Base Solver Framework

This section describes the baseline DMM solver provided to the LLM agents and the modular code framework that supports automated solver generation\.

### S4\.1Research goal and problem context

The high\-level research goal given to the meta\-agent is:

> *Develop an algorithm that efficiently solves hard, large\-scale 3\-SAT problems\. Prioritize robust scaling, with focus on large\-NNbehaviors\. Aim for polynomial step vs\.NNscaling, ideally sub\-quadratic\. Common pitfall: polynomial at smallNN, exponentially diverging at largeNN\.*

##### 3\-SAT and the DMM relaxation\.

A 3\-SAT instance hasNNBoolean variablesVi∈\{0,1\}V\_\{i\}\\in\\\{0,1\\\}andMMclausesCm=Li∨Lj∨LkC\_\{m\}=L\_\{i\}\\vee L\_\{j\}\\vee L\_\{k\}, where each literalLiL\_\{i\}is eitherViV\_\{i\}or its negation\. A digital MemComputing machine \(DMM\)\[[9](https://arxiv.org/html/2606.26728#bib.bib64),[3](https://arxiv.org/html/2606.26728#bib.bib4)\]relaxes eachViV\_\{i\}to a continuous variablevi∈\[−1,1\]v\_\{i\}\\in\[\-1,1\]and introduces auxiliary*memory*variables that encode information about the dynamics’ history\. The sign ofviv\_\{i\}at the fixed point gives the Boolean assignment\.

##### Baseline DMM equations\.

The baseline dynamics \(Eqs\. \([4](https://arxiv.org/html/2606.26728#S4.E4)\)\-\([6](https://arxiv.org/html/2606.26728#S4.E6)\) of the main text\) evolve three sets of variables\. The state variablesvnv\_\{n\}follow

v˙n=∑m=1Mxl,m​xs,m​Gn,m\+\(1\+ζ​xl,m\)​\(1−xs,m\)​Rn,m,\\dot\{v\}\_\{n\}=\\sum\_\{m=1\}^\{M\}x\_\{l,m\}\\,x\_\{s,m\}\\,G\_\{n,m\}\+\(1\+\\zeta\\,x\_\{l,m\}\)\(1\-x\_\{s,m\}\)\\,R\_\{n,m\},\(S11\)whereGn,mG\_\{n,m\}is a gradient term nudging variables toward clause satisfaction andRn,mR\_\{n,m\}is a rigidity term preventing satisfied literals from flipping:

Gn,m\\displaystyle G\_\{n,m\}=12​qn,m​min⁡\[\(1−qj,m​vj\),\(1−qk,m​vk\)\],\\displaystyle=\\tfrac\{1\}\{2\}\\,q\_\{n,m\}\\min\\bigl\[\(1\-q\_\{j,m\}v\_\{j\}\),\\,\(1\-q\_\{k,m\}v\_\{k\}\)\\bigr\],\(S12\)Rn,m\\displaystyle R\_\{n,m\}=\{12​qn,m​\(1−qn,m​vn\),if​cm=12​\(1−qn,m​vn\),0,otherwise,\\displaystyle=\\begin\{cases\}\\tfrac\{1\}\{2\}\\,q\_\{n,m\}\(1\-q\_\{n,m\}v\_\{n\}\),&\\text\{if \}c\_\{m\}=\\tfrac\{1\}\{2\}\(1\-q\_\{n,m\}v\_\{n\}\),\\\\ 0,&\\text\{otherwise\},\\end\{cases\}\(S13\)withqn,m∈\{\+1,−1\}q\_\{n,m\}\\in\\\{\+1,\-1\\\}the literal polarity and the clause cost

cm=12​min⁡\[\(1−qi,m​vi\),\(1−qj,m​vj\),\(1−qk,m​vk\)\]\.c\_\{m\}=\\tfrac\{1\}\{2\}\\min\\bigl\[\(1\-q\_\{i,m\}v\_\{i\}\),\\,\(1\-q\_\{j,m\}v\_\{j\}\),\\,\(1\-q\_\{k,m\}v\_\{k\}\)\\bigr\]\.\(S14\)Two auxiliary memory variables per clause track satisfaction history:

x˙s,m\\displaystyle\\dot\{x\}\_\{s,m\}=β​\(xs,m\+ϵ\)​\(cm−γ\),\\displaystyle=\\beta\\,\(x\_\{s,m\}\+\\epsilon\)\\,\(c\_\{m\}\-\\gamma\),\(S15\)x˙l,m\\displaystyle\\dot\{x\}\_\{l,m\}=α​\(cm−δ\)\.\\displaystyle=\\alpha\\,\(c\_\{m\}\-\\delta\)\.\(S16\)The short\-term switchxs,m∈\[0,1\]x\_\{s,m\}\\in\[0,1\]toggles between “push” \(xs≈1x\_\{s\}\\approx 1\) and “hold” \(xs≈0x\_\{s\}\\approx 0\) modes; the long\-term weightxl,m∈\[1,106\]x\_\{l,m\}\\in\[1,10^\{6\}\]grows monotonically for persistently violated clauses\. Default hyperparameters areα=5\\alpha=5,β=20\\beta=20,γ=0\.25\\gamma=0\.25,δ=0\.05\\delta=0\.05,ϵ=10−3\\epsilon=10^\{\-3\}, andζ=10−3\\zeta=10^\{\-3\}\.

### S4\.2Modularized solver framework

The solver codebase \(released as supplementary code\) separates problem\-specific dynamics from generic solving infrastructure\. A single Python file fully defines each solver experiment through three components:

1. 1\.VARIABLES\_SPEC: a dictionary mapping variable names to their initialization, shape, and bounds\. For the baseline: - •v: shape\(B,N\)\(B,N\), bounds\[−1,1\]\[\-1,1\], randomly initialized; - •xl: shape\(B,M\)\(B,M\), bounds\[1,106\]\[1,10^\{6\}\], initialized to 1; - •xs: shape\(B,M\)\(B,M\), bounds\[0,1\]\[0,1\], initialized to 0; whereBBis the batch size\. The LLM can add, remove, or modify variables to introduce new memory mechanisms\.
2. 2\.HYPER\_SPACE: a dictionary defining the hyperparameter search space\. Each entry specifies a type \(uniform, log\-uniform, integer, or categorical\), default value, and bounds\. The baseline defines seven parameters \(α,β,γ,δ,ϵ,ζ\\alpha,\\beta,\\gamma,\\delta,\\epsilon,\\zeta, and the integration step size\)\. A Bayesian optimizer \(HEBO\[[6](https://arxiv.org/html/2606.26728#bib.bib63)\]\) tunes these parameters for each solver design\.
3. 3\.\_grad\_single\(vars, idx, sgn, hp\): the core dynamics function computing per\-instance gradients for all state variables\. Inputs are the current variable values \(vars\), clause structure \(idx,sgn\), and hyperparameters \(hp\)\. It returns a gradient dictionary with the same keys asvars\. This function encodes the dynamical equations \(Eqs\. \([S11](https://arxiv.org/html/2606.26728#A4.E11)\)\-\([S16](https://arxiv.org/html/2606.26728#A4.E16)\)\) and is the primary target of LLM\-driven design\.

##### Framework integration\.

The solver framework \(sat\_solver\.py\) dynamically imports these three components at runtime\. Launching an experiment with solver IDkkloadssolvers/solver\_kk\.py\. TheSATSolverclass readsVARIABLES\_SPECto allocate and initialize state tensors, then vectorizes\_grad\_singleover the batch dimension viatorch\.vmap\[[24](https://arxiv.org/html/2606.26728#bib.bib69)\]\. The solving loop is standard: zero gradients, compute dynamics through the vmapped function, take an optimizer step, clamp variables to their specified bounds, check for satisfying assignments\.

##### LLM\-generated solvers\.

The designer agent produces new solver files containing modified versions of these three components\. Because the interface is fixed—any file exportingVARIABLES\_SPEC,HYPER\_SPACE, and\_grad\_singlewith the correct signatures is automatically integrated—the LLM can freely redesign the dynamics, introduce new auxiliary variables, or restructure the hyperparameter space without touching framework code\. This modularity is what makes the automated search described in the main text possible: each MCGS iteration generates a new solver file, evaluates it through the framework, and records the results\.

## Appendix S5LLM Agent Prompts

This section provides the complete prompt templates used by each LLM agent in the system, organized by agent role\.

### S5\.1Planner Agent

The Planner Agent analyzes research progress and assigns strategic directions to Designer Agents\. It operates in two stages: first selecting promising experiments to review from MCGS rankings, then synthesizing insights into distinct, non\-overlapping research directions\.

##### System prompt\.

Sets the agent’s role and scope\.

YouarethePlannerAgentinanLLM\-drivenautonomous

researchsystem\.Guidetheresearchdirectionandprovide

strategicrecommendationsforDesignerAgents\.

##### Stage 1: Direction selection\.

The planner receives an overview of all experiments ranked by UCB score from MCGS and selects which experiments to examine in detail\.

\#\#ResearchContext

\{main\_research\_context\}

\#\#Task

Select\{DESIGNER\_AGENT\_COUNT\}promisingresearch

directionsandidentifyexperimentstoreviewindetail\.

\#\#DatabaseOverview

\-Totalexperiments:\{total\_experiments\}

\-Bestperformance:\{best\_performance\}

\-Recentexperiments:\{recent\_summary\}

\#\#ExperimentSummaries

Upto\{MAX\_EXPERIMENT\_SUMMARY\}experimentsrankedby

upperconfidencebound\(UCB\)fromMonteCarlograph

search\(MCGS\):

\{experiment\_summaries\}

\#\#Guidelines

\-Requestdetailsforupto\{MAX\_EXPERIMENT\_DETAIL\}

experiments

\-Prioritizehigh\-UCBexperimentsandcomplementaryideas

\-Avoidredundantornear\-duplicatedirections

\-HyperparametertuningisautomaticviaHEBOoptimizer\-

don’tassignthistoDesignerAgents

\-Yourgoal:MINIMIZEtheobjectivefunction

\#\#OutputFormat

‘‘‘json

\{

"lookup\_experiment\_ids":\[/\*intIDstoreview\*/\],

"context\_rationale":"Concisereasoningbehindchosen

directionsandtheirrelevance"

\}

‘‘‘

##### Stage 2: Designer assignment\.

After reviewing the requested experiment details, the planner synthesizes findings into non\-overlapping research directions with reference experiments for each Designer Agent\.

\#\#ExperimentDetails

\{design\_details\}

\#\#Task

Summarizeprogressandassign\{DESIGNER\_AGENT\_COUNT\}

designeragentsdistinctresearchdirectionswithupto

\{MAX\_DESIGNER\_REFERENCE\}referenceexperimentseach\.

\#\#OutputFormat

‘‘‘json

\{

"current\_phase":"early\_exploration\|systematic\_search

\|exploitation\|stagnation\|breakthrough\_needed",

"key\_insights":\["Toptakeawaysexplainingwhatworks"\],

"success\_patterns":\[

"Sharedtraitsofstrongexperiments"\],

"failure\_patterns":\[

"Sharedtraitsofweakexperiments"\],

"research\_directions":\[

"Directionford1","Directionford2"\],

"strategy\_rationales":\[

"Rationaleford1","Rationaleford2"\],

"focus\_areas":\["themesford1","themesford2"\],

"avoid\_areas":\["pitfallsford1","pitfallsford2"\],

"reference\_design\_ids":\[

\[intIDsford1\],\[intIDsford2\]\]

\}

‘‘‘

\#\#Guidelines

\-Allarraysmusthavelength\{DESIGNER\_AGENT\_COUNT\}

andalignbyindex

\-Directionsmustbenon\-overlapping,complementary,

anddistinct

### S5\.2Designer Agent

The Designer Agent creates new experiment designs based on the Planner’s strategic directions and reference experiments\. It operates through a multi\-turn conversation: first proposing a design, then receiving experiment results, and finally analyzing outcomes\. If execution errors occur, an error\-recovery prompt is used\.

##### System prompt\.

Sets the agent’s role and scope\.

YouaretheDesignerAgentinanLLM\-drivenautonomous

researchsystem\.Youreceivestrategicrecommendations

fromthePlannerAgentandcreatetargetedexperiments\.

##### Design creation\.

The designer receives the research context, planner guidance, baseline components, and reference experiments, then proposes a new design with one principled modification\.

\#\#ResearchContext

\{main\_research\_context\}

\#\#PlannerContext

\{planner\_context\}

\#\#Architecture

\-‘domain\_knowledge/\{framework\_module\}\.py‘\-Main

framework

\-‘domain\_knowledge/\{baseline\_filename\}‘\-Baseline

components

\-‘solvers/solver\_N\.py‘\-Yourexperimentcomponents

\-Objective:Consensusofallgeneratedobjectives

\(toMINIMIZE\)

\-‘schedules/schedule\_\{current\_schedule\_id\}\.py‘\-Current

experimentschedule

\{framework\_module\}\.pydynamicallyimportsyour

\{num\_components\}componentsfromsolver\_N\.py\.

\#\#BaselineComponents

‘‘‘python

\{base\_solver\_code\}

‘‘‘

\#\#ReferenceExperiments

\{reference\_experiments\}

\#\#CurrentObjective

\{objective\_description\}

\#\#CurrentSchedule

\{schedule\_description\}

\#\#Task

Designanewexperimentbymodifyingthe\{num\_components\}

corecomponents:

1\.MakeONEsmall,principledmodificationtobaseline

2\.Buildonprovenideasfromreferenceexperiments

3\.FollowthePlanner’sdirectionandrationale

4\.Explainhowchangesshouldimprovetheobjective

\#\#RequiredComponents

\{component\_descriptions\}

Availableimports:‘math‘,‘numpy‘,‘scipy‘,‘torch‘,

andstandardlibraries

\#\#OutputFormat

ReturnaJSONobjectwith:

‘‘‘json

\{

"explanation":"Rationaleformodification,referencing

evidenceandstrategy",

"solver\_code":"CompletePythoncodewithimportsand

components:\{component\_names\}"

\}

‘‘‘

##### Experiment results\.

After the design is executed, the experiment results are appended to the conversation as context\. No LLM response is requested at this stage\.

\#\#ExperimentResults

Youhavecompletedexperimentsat\*\*\{current\_fidelity\}\*\*

fidelity\.

\#\#\#Results

‘‘‘json

\{experiment\_results\}

‘‘‘

\*\*Objectivevalue\*\*\(lowerisbetter\):\{objective\_value\}

##### Result analysis\.

The designer analyzes outcomes and extracts actionable insights\. Reference weights are computed for MCGS graph updates, reflecting how much each parent design influenced the current result\.

\#\#Task

Analyzeresultsandextractactionableinsights\.Focuson

whytheoutcomeoccurredandwhattodonext\.

Evaluatehowmucheachreferencedesigninfluencedthis

resultforMonteCarloGraphSearch\(MCGS\)updates\.

\#\#SuccessLevels

\-\*\*excellent\*\*:Majorbreakthroughorvalidated

improvement

\-\*\*good\*\*:Noticeableimprovementwithwell\-understood

cause

\-\*\*moderate\*\*:Partialprogressorusefulinsight

despitelimitedgains

\-\*\*poor\*\*:Noimprovementorregression,butstill

informative

\#\#ReferenceWeights

\-IncludeonlyreferenceddesignIDs

\-Eachweight\(0\-1\)representsinfluenceoncurrent

design

\-Weightsmustsumto1\.0

\-Higherweightsforideas/parametersthatmoststrongly

shapedresults

\#\#OutputFormat

ReturnaJSONobjectwith:

‘‘‘json

\{

"short\_name":"Concisedescriptivetitle\(<=40chars\)",

"key\_insight":"Mostimportanttakeaway\(1line\)",

"success\_level":"poor\|moderate\|good\|excellent",

"detailed\_analysis":"Comprehensiveexplanationof

mechanismsandoutcomes",

"comparison\_to\_references":"Howresultscomparewith

referenceddesigns",

"recommended\_next\_steps":"Concretesuggestionsfor

futuredesigns",

"reference\_weights":\[

\{"design\_id":int,"weight":float\}

\]

\}

‘‘‘

##### Error recovery\.

When a design produces runtime errors, this prompt is appended to the existing conversation so the designer can see the full context of the failed attempt\.

\#\#Error

‘‘‘

\{error\_message\}

\{full\_traceback\}

‘‘‘

\#\#Task

Fixtheerrorinyourimplementationcomponentsand

returnall\{num\_components\}correctedcomponents:

\{component\_names\}\.

\#\#OutputFormat

ReturnaJSONobjectwith:

‘‘‘json

\{

"error\_summary":"Whatwentwrongandhowyoufixedit",

"solver\_code":"CompletePythoncodewithimportsand

correctedcomponents"

\}

‘‘‘

### S5\.3Objective Agent

The Objective Agent generates proxy objective functions that guide the search\. The evaluation schedule remains fixed at the baseline; only the objective function is generated\. The Meta\-Agent provides strategic guidance on what properties the next objective should emphasize\.

##### System prompt\.

Sets the agent’s role and scope\.

\#\#Role

Youarethe\*\*ObjectiveAgent\*\*\.Youdesignproxy

objectivefunctionsthatguidethediscoveryofbetter

algorithms\.

##### Objective generation\.

The Objective Agent receives the current research state, Meta\-Agent guidance, baseline code, existing objectives, and recent results, then proposes a new proxy objective function\.

\#\#ResearchContext

\{main\_research\_context\}

\#\#Meta\-AgentGuidance

\{meta\_agent\_directions\}

Considerthisguidancewhendesigningyourobjective

function\.Themeta\-agenthasanalyzedresearchprogress

andidentifiedareasthatneedattention\.

\#\#Task

Generateaproxyobjectivefunctionthatbetterestimates

theresearchgoal\.Theexperimentschedulewillusethe

baselineschedule\(shownbelowforreference\)\.

\#\#DiscoveryPhilosophy

Thebaselineobjectiveisdeliberatelysimple\.Youhave

freedomtodesignobjectivesthat:

\-Targetanyexperimentorcombinationofallexperiments

\-Useanycombinationofavailablemetrics

\-Applyanyscalingmodelornoneatall

\-Incorporateuncertainty,robustness,orotheradvanced

concepts

Learnfromexistingresultsandthinkaboutwhattruly

mattersforidentifyingthebestalgorithms\.

\#\#BaselineCode

\*\*Objective:\*\*

‘‘‘python

\{baseline\_objective\_code\}

‘‘‘

\*\*Schedules\(forreference\-willbeusedunchanged\):\*\*

‘‘‘python

\{baseline\_schedule\_code\}

‘‘‘

\#\#ExistingObjectives

\{existing\_objective\_summary\}

\#\#RecentExperimentResults

\{recent\_experiments\_objectives\_summary\}

\#\#RequiredFunction

\*\*Objectivefunction:\*\*

‘‘‘python

defobjective\(experiment\_results\):

"""Estimatetheresearchgoalfromexperimentresults\.

Args:

experiment\_results:Listofdictswithexperiment

details

Returns:

FloattobeMINIMIZED

"""

returnobjective\_value

‘‘‘

\#\#ExperimentInterface

‘‘‘python

\{experiment\_code\}

‘‘‘

\#\#Guidelines

\-‘experiment\_results‘isalistofdictscontainingall

experimentkwargsandoutputs

\-ObjectiveshouldbesmoothandfriendlytoBayesian

optimization\(avoidlargepenaltiesforfailures\)

\-Objectiveshouldadapttodifferentschedulesandremain

backwardcompatiblewhenpossible

\-Availableimports:‘math‘,‘numpy‘,‘scipy‘,‘torch‘,

andstandardlibraries

\#\#OutputFormat

ReturnaJSONobjectwith:

‘‘‘json

\{

"objective\_description":"Whatthisobjectivemeasures

\(oneline,comprehensivebutextremelyconcise\)",

"objective\_code":"CompletePythoncodewithimportsand

objective\(\)function"

\}

‘‘‘

##### Error recovery\.

Appended to the conversation when the generated objective function produces a runtime error\.

\#\#Error

‘‘‘

\{error\_message\}

\{full\_traceback\}

‘‘‘

\#\#Task

Fixtheerrorinyourobjectivecode\.

\#\#OutputFormat

ReturnaJSONobjectwith:

‘‘‘json

\{

"error\_summary":"Whatwentwrongandhowyoufixedit",

"objective\_code":"CorrectedPythoncodewithimports

andobjective\(\)function"

\}

‘‘‘

### S5\.4Meta\-Agent

The Meta\-Agent oversees the entire research process\. It analyzes objective function performance using Kendall tau correlations, adjusts objective weights in the consensus mechanism, and provides strategic guidance for the Objective Agent’s next generation\.

##### System prompt\.

Sets the agent’s role and scope\.

Youarethe\*\*Meta\-Agent\*\*\.Youoverseetheentire

researchprocess,analyzewhat’sworkingandwhat’snot,

guidetheobjectiveagent,andadjustobjectiveweights

toimproveresearchprogress\.

##### Research analysis\.

The Meta\-Agent receives a comprehensive view of all objective functions, their correlation structure, weighting state, and recent progress, then produces an assessment with updated weights and directions for the Objective Agent\.

\#\#ResearchContext

\{main\_research\_context\}

\#\#High\-LevelResearchGoal

\{high\_level\_research\_goal\}

\#\#Task

Analyzeresearchprogress,evaluateobjectivefunctions,

andprovidestrategicguidance\.

Yourresponsibilities:

1\.Assessresearchprogress

2\.\*\*Maintainandupdateanevolvingconsensusobjective

function\*\*

\-Objectivefunctionsareperiodicallygeneratedby

ObjectiveAgent

\-Planner/DesignerAgents/hyperparameteroptimizer

minimizetheconsensusobjective\.

\-Youcanadjustobjectiveweights:amplifyuseful

ones,suppressharmfulones\.

3\.GuidetheObjectiveAgentingeneratingnewobjectives\.

\#\#ExperimentSchedule

\(forreference\-willbeusedunchanged\):

‘‘‘python

\{baseline\_schedule\_code\}

‘‘‘

\#\#CurrentObjectiveFunctions

\{objective\_summary\_with\_code\}

\#\#ObjectivePerformanceAnalysis

\*\*KendallTauCorrelationMatrix\*\*\(measuresagreement

betweenobjectives\):

\{objective\_correlation\_matrix\}

\#\#ObjectiveWeightingMechanism

‘‘‘python

weight=default\_weight\*weight\_multiplier

\#Youassignweight\_multiplier,default1\.0

default\_weight=agreement\*age\_decay

agreement=max\(median\_tau,0\.0\)

age\_decay=0\.9\*\*rounds\_since\_creation

‘‘‘

\*\*ObjectiveAgreementSummary\*\*:

\{objective\_agreement\_summary\}

\*\*RecentProgress\*\*:

\{recent\_progress\_summary\}

\*\*TopDesigns\*\*:

\{top\_designs\_summary\}

\#\#PreviousMeta\-AgentGuidance

\{previous\_meta\_guidance\}

\#\#Guidelines

\*\*ForObjectiveEvaluation\*\*:

\-ObjectiveswithlowKendalltau\(disagreeingwithothers\)

maybemisleading

\-Objectivesthatrankfaileddesignshighlyare

problematic

\-Objectivesthatdon’tdifferentiatebetweendesignsare

notuseful

\-Considerwhethertheobjectivealignswiththe

high\-levelresearchgoal

\#\#OutputFormat

ReturnaJSONobject:

‘‘‘json

\{

"research\_assessment":"Overallassessmentofresearch

progress\(2\-4sentences\)",

"research\_phase":"exploring\|converging\|stuck

\|breakthrough\_needed\|refining",

"objective\_analysis":\[

\{

"objective\_id":0,

"assessment":"Howthisobjectiveisperforming",

"weight\_multiplier":1\.0

\}

\],

"objective\_directions":"Strategicguidancefornext

objectivegeneration\(bespecific\)"

\}

‘‘‘

Similar Articles

Evaluation-driven Scaling for Scientific Discovery

Hugging Face Daily Papers

SimpleTES framework scales evaluation-driven discovery loops across 21 scientific problems, yielding 2× speedups on LASSO, 24.5% quantum gate reductions, and new Erdos constructions while enabling trajectory-level model post-training.

LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

Hugging Face Daily Papers

This paper introduces AutoTTS, an environment-driven framework that automates the discovery of test-time scaling strategies for LLMs by formulating it as controller synthesis. It demonstrates improved accuracy-cost tradeoffs on mathematical reasoning benchmarks with minimal computational overhead.