Distribution-Aware Algorithm Design with LLM Agents

arXiv cs.AI 05/15/26, 04:00 AM Papers
Summary
This paper introduces a framework for distribution-aware algorithm design where LLM agents learn to generate solver code specialized to target distributions, achieving high solution quality and significant speedups over standard solvers.
arXiv:2605.14141v1 Announce Type: new Abstract: We study learning when the learned object is executable solver code rather than a predictor. In this setting, correctness is not enough: two solvers may both return valid solutions on the deployment distribution while differing substantially in runtime. Given samples from an unknown task distribution, the learner returns code evaluated on fresh instances by both solution quality and execution time. Our central abstraction is a \emph{solver hint}: reusable structure inferred from samples and compiled into specialized solver code. We prove that the empirically fastest sample-consistent solver from a fixed library generalizes in both correctness and runtime, and that statistically identifiable hints can be recovered and compiled from polynomially many samples. Empirically, we instantiate the framework with LLM code agents on \(21\) structured combinatorial-optimization target distributions across seven problem classes. The synthesized solvers reach mean normalized quality \(0.971\), improve by \(+0.224\) over the average heuristic pool and by \(+0.098\) over the highest-quality heuristic, and are \(336.9\times\), \(342.8\times\), and \(16.1\times\) faster than the quality-best heuristic, Gurobi, and the selected time-limited exact backend, respectively. On released PACE 2025 Dominating Set private instances, the synthesized solver is valid on all \(100\) graphs and runs about two orders of magnitude faster than top competition solvers, with a moderate quality gap. Inspection shows that many gains come from changing the computational scale: replacing ambient exponential search or general-purpose optimization with compiled distribution-specific computation.
Original Article
View Cached Full Text
Cached at: 05/15/26, 06:21 AM
# Distribution-Aware Algorithm Design with LLM Agents
Source: [https://arxiv.org/html/2605.14141](https://arxiv.org/html/2605.14141)
Saharsh Koganti Texas A&M University saharshk11@tamu\.edu &Priyadarsi Mishra Texas A&M University priyadarsimishra@tamu\.edu Pierfrancesco Beneventano Massachusetts Institute of Technology pierb@mit\.edu &Tomer Galanti Texas A&M University galanti@tamu\.edu

###### Abstract

We study learning when the learned object is executable solver code rather than a predictor\. In this setting, correctness is not enough: two solvers may both return valid solutions on the deployment distribution while differing substantially in runtime\. Given samples from an unknown task distribution, the learner returns code evaluated on fresh instances by both solution quality and execution time\. Our central abstraction is a*solver hint*: reusable structure inferred from samples and compiled into specialized solver code\. We prove that the empirically fastest sample\-consistent solver from a fixed library generalizes in both correctness and runtime, and that statistically identifiable hints can be recovered and compiled from polynomially many samples\.

Empirically, we instantiate the framework with LLM code agents on2121structured combinatorial\-optimization target distributions across seven problem classes\. The synthesized solvers reach mean normalized quality0\.9710\.971, improve by\+0\.224\+0\.224over the average heuristic pool and by\+0\.098\+0\.098over the highest\-quality heuristic, and are336\.9×336\.9\\times,342\.8×342\.8\\times, and16\.1×16\.1\\timesfaster than the quality\-best heuristic, Gurobi, and the selected time\-limited exact backend, respectively\. On released PACE 2025 Dominating Set private instances, the synthesized solver is valid on all100100graphs and runs about two orders of magnitude faster than top competition solvers, with a moderate quality gap\. Inspection shows that many gains come from changing the computational scale: replacing ambient exponential search or general\-purpose optimization with compiled distribution\-specific computation\.

## 1Introduction

Classical learning theory\(Valiant,[1984](https://arxiv.org/html/2605.14141#bib.bib33); Vapnik and Chervonenkis,[1971](https://arxiv.org/html/2605.14141#bib.bib38); Vapnik,[1998](https://arxiv.org/html/2605.14141#bib.bib39)\)is largely organized around correctness and generalization\. But when the learned object is an*executable procedure*\(Singhalet al\.,[2025](https://arxiv.org/html/2605.14141#bib.bib37)\), such as an algorithm, solver, or piece of code, correctness is only part of the objective\. Two procedures may both return valid solutions on the deployment distribution while differing substantially in runtime\. In that setting, they have not learned equally well\. The learned object must generalize not only in quality, but also in computation\.

We study this through*distribution\-aware program learning*: from samples of an unknown deployment distribution, a learner returns solver code evaluated on fresh instances by both solution quality and runtime\. For decision problems, quality is correctness; for optimization problems, it measures proximity to optimality\. The goal is not merely to solve the ambient problem class, but to learn an algorithm whose execution is specialized to the target regime\.

Our central abstraction is a*solver hint*: reusable structure inferred from samples and compiled into solver code\. Hints may be backdoors in SAT, latent decompositions in graph problems, active\-resource patterns in packing problems, or geometric structure in routing problems\. The sample\-to\-solver map therefore factors asS⟼h^S⟼c^S=Comp\(h^S\)S\\;\\longmapsto\\;\\widehat\{h\}\_\{S\}\\;\\longmapsto\\;\\widehat\{c\}\_\{S\}=\\mathrm\{Comp\}\(\\widehat\{h\}\_\{S\}\)\. This factorization is also how we interpret LLM synthesis\. The agent is not best viewed as writing a solver in one shot; it proposes structural hypotheses, estimates reusable summaries from the sample, and writes a deployment solver conditioned on those summaries\.

Solver hints are computational shortcuts\. In SAT, for example, the hint might be a reusable backdoor set\. Correctness need not be learned: a complete solver can always be used as fallback\. What the sample learns is which shortcut to compile so that future instances are solved faster\. This separation between correctness and learned computation is central to both our theory and our experiments\.

The framing connects to a long\-standing question in complexity theory: what does it mean for a problem to be hard*on average*? Worst\-case complexity has a clean primary object, the language, but average\-case complexity\(Levin,[1986](https://arxiv.org/html/2605.14141#bib.bib40); Impagliazzo,[1995](https://arxiv.org/html/2605.14141#bib.bib41); Bogdanov and Trevisan,[2006](https://arxiv.org/html/2605.14141#bib.bib42)\)also requires a distribution\. In practice, deployment distributions often lack closed\-form descriptions amenable to classical analysis\. Smoothed analysis\(Spielman and Teng,[2004](https://arxiv.org/html/2605.14141#bib.bib43)\), parameterized complexity\(Downey and Fellows,[2013](https://arxiv.org/html/2605.14141#bib.bib44)\), and structural backdoors\(Williamset al\.,[2003](https://arxiv.org/html/2605.14141#bib.bib45)\)each give hand\-designed routes to distribution\-specific tractability\. We propose a complementary constructive route: treat the distribution as accessible only through samples, learn a solver against it directly, and use the learned solver as a witness of distribution\-specific tractability\.

Contributions\.In this paper we make the following contributions:

1. 1\.We introduce*distribution\-aware program learning*: a setting in which the learned object is executable solver code, and success is measured by both solution quality and deployment runtime\. The key abstraction is a*solver hint*, a reusable description of distribution\-specific structure inferred from samples and compiled into specialized solver code\.
2. 2\.We formalize when samples can improve computation\. For fixed solver libraries, the empirically fastest sample\-consistent solver generalizes in both correctness and runtime\. For structured hint classes, statistically identifiable hints can be recovered from polynomially many samples and compiled into specialized solvers\. A hidden SAT\-backdoor model makes the mechanism concrete: fallback preserves correctness, while the recovered hint yields exponential per\-instance speedup on the deployment distribution\.
3. 3\.We instantiate the sample\-to\-hint\-to\-solver view with an LLM synthesis procedure in which each candidate consists of a distributional hypothesis, a train\-time analysis program, and a deployment solver\. Across2121structured combinatorial\-optimization target distributions, the synthesized solvers reach mean normalized quality0\.9710\.971, improve by\+0\.224\+0\.224over the average heuristic pool and by\+0\.098\+0\.098over the highest\-quality heuristic, and are336\.9×336\.9\\times,342\.8×342\.8\\times, and16\.1×16\.1\\timesfaster than the quality\-best heuristic, Gurobi, and the selected time\-limited exact backend, respectively\.
4. 4\.We inspect the synthesized solvers and find that many gains come from changing the computational scale: generated code often replaces ambient exponential search or full general\-purpose optimization with lower\-order, distribution\-specific computations\. As an external diagnostic, on released PACE 2025PACE Challenge \([2025](https://arxiv.org/html/2605.14141#bib.bib72)\)Dominating Set private instances, the synthesized solver returns valid solutions on all100100graphs and runs about two orders of magnitude faster than top competition solvers, with a moderate quality gap\.

access toDDlearned representationdeployed solverWorst\-casealgorithm designAverage\-casecomplexityThis papernone\(worst case overDD\)algorithm withworst\-case guaranteeanalyzeDDspecifiedanalyticallyalgorithm tunedtoDDanalyzesampleS∼DnS\\sim D^\{n\}solver hinth^S∈ℋ\\hat\{h\}\_\{S\}\\in\\mathcal\{H\}solverc^S∈𝒞\\widehat\{c\}\_\{S\}\\in\\mathcal\{C\}learncompileComp\\mathrm\{Comp\}effectiveclassComp\(ℋ\)⊆𝒞\\mathrm\{Comp\}\(\\mathcal\{H\}\)\\subseteq\\mathcal\{C\}

Figure 1:Three access models for designing solvers against a distributionDD\.Worst\-case design assumes no distributional information, while average\-case complexity assumes an analytic specification ofDD\. We study the intermediate sample\-access regime: fromS∼DnS\\sim D^\{n\}, the learner infers a solver hinth^S∈ℋ\\hat\{h\}\_\{S\}\\in\\mathcal\{H\}and compiles it into a deployed solverc^S=Comp\(h^S\)\\widehat\{c\}\_\{S\}=\\mathrm\{Comp\}\(\\hat\{h\}\_\{S\}\)\. Thus the effective search space is the structured subfamilyComp\(ℋ\)⊆𝒞\\mathrm\{Comp\}\(\\mathcal\{H\}\)\\subseteq\\mathcal\{C\}\. Our agents approximate this sample\-to\-hint\-to\-solver map\.
## 2Related Work

We treat*execution runtime*as part of generalization when the learned object is executable solver code\. Several established literatures touch parts of this picture\. We therefore situate the contribution narrowly: the main novelty is not the broad idea that data can improve computation, but the explicit*sample→\\tohint→\\tosolver*formalization and the theory\-plus\-benchmark package built around it\.

Average\-case and beyond\-worst\-case analysis\.Average\-case complexity\(Levin,[1986](https://arxiv.org/html/2605.14141#bib.bib40); Impagliazzo,[1995](https://arxiv.org/html/2605.14141#bib.bib41); Bogdanov and Trevisan,[2006](https://arxiv.org/html/2605.14141#bib.bib42)\)asks when problems become tractable under input distributions, but often requires an analytic distribution before the analysis can begin\. Smoothed analysis\(Spielman and Teng,[2004](https://arxiv.org/html/2605.14141#bib.bib43)\), parameterized complexity\(Downey and Fellows,[2013](https://arxiv.org/html/2605.14141#bib.bib44)\), and SAT backdoor analyses\(Williamset al\.,[2003](https://arxiv.org/html/2605.14141#bib.bib45)\)give hand\-designed routes to distribution\-specific tractability\. Our framework is complementary and sample\-driven: the deployment distribution is accessed through examples, and the learned solver serves as a constructive witness of tractability for that regime\.

Algorithm selection, portfolios, and hyper\-heuristics\.The closest classical line is algorithm selection\(Rice,[1976](https://arxiv.org/html/2605.14141#bib.bib1)\)and feature\-based portfolios such as SATzilla, Hydra, and AutoFolio\(Xuet al\.,[2008](https://arxiv.org/html/2605.14141#bib.bib2),[2010](https://arxiv.org/html/2605.14141#bib.bib3); Lindaueret al\.,[2015](https://arxiv.org/html/2605.14141#bib.bib4); Kotthoff,[2014](https://arxiv.org/html/2605.14141#bib.bib6); Kerschkeet al\.,[2019](https://arxiv.org/html/2605.14141#bib.bib5)\)\. Our fixed\-library regime is close in spirit: once a solver library is given, choose the best member for the deployment instances\. A broader neighboring literature studies hyper\-heuristics and automated heuristic design, where the goal is to select, compose, or generate heuristics for families of optimization problems\. These lines already show that data can be used to reduce runtime\. Our main object is narrower and more structural: a reusable solver hint inferred from samples and compiled into solver code, potentially yielding a solver not present in any predefined library\.

Learning\-augmented algorithms, adaptive computation, and amortization\.Our deployment criterion aligns with work arguing that worst\-case complexity is too coarse for practical performance\(Roughgarden,[2019](https://arxiv.org/html/2605.14141#bib.bib7)\), and with learning\-augmented algorithms, where predictions or advice modify the behavior of a fixed algorithm while preserving robustness when advice is inaccurate\(Lykouris and Vassilvitskii,[2018](https://arxiv.org/html/2605.14141#bib.bib9); Mitzenmacher and Vassilvitskii,[2022](https://arxiv.org/html/2605.14141#bib.bib10)\)\. The shared motivation is that data can help computation\. The difference is in the learned object: these works begin with a fixed algorithmic template and study how advice changes its behavior, while we ask whether samples can reveal a reusable hint that compiles into a*different executable solver*\. The framework can also be read as amortization: instead of paying inference\-time compute on every instance, we pay a one\-time synthesis cost against the sample and then deploy a solver whose per\-instance cost is lower\.

Program synthesis, sketches, and code agents\.Our synthesis regime connects to program synthesis from examples or natural language\(Gulwani,[2011](https://arxiv.org/html/2605.14141#bib.bib78); Baloget al\.,[2017](https://arxiv.org/html/2605.14141#bib.bib24); Devlinet al\.,[2017](https://arxiv.org/html/2605.14141#bib.bib25)\), and is closest in spirit to work that learns intermediate guides for search, such as sketches, rather than predicting complete programs\(Nyeet al\.,[2019](https://arxiv.org/html/2605.14141#bib.bib26)\)\. It is also related to code\-generation and agentic coding benchmarks\(Hendryckset al\.,[2021](https://arxiv.org/html/2605.14141#bib.bib27); Liet al\.,[2022](https://arxiv.org/html/2605.14141#bib.bib28); Jimenezet al\.,[2024](https://arxiv.org/html/2605.14141#bib.bib30); Huanget al\.,[2024](https://arxiv.org/html/2605.14141#bib.bib31)\)\. Our contribution is not to initiate automated heuristic design or agentic code generation\. Rather, we recast one important regime of such systems through the lens of distribution\-aware program learning: runtime is part of the objective, the intermediate object is a solver hint, and the empirical study is designed so that the recovered structural hypothesis can be inspected after synthesis\.

## 3Problem Setup

We study learning when the learned object is an executable solver and the deployment criterion includes runtime\. Classical statistical learning measures generalization by accuracy: a hypothesis is good if it performs well on fresh draws fromDD\(Valiant,[1984](https://arxiv.org/html/2605.14141#bib.bib33); Vapnik and Chervonenkis,[1971](https://arxiv.org/html/2605.14141#bib.bib38); Vapnik,[1998](https://arxiv.org/html/2605.14141#bib.bib39)\)\. For solvers, accuracy is incomplete\. Two solvers may both return valid solutions on the same deployment distribution, yet differ substantially in runtime\. We therefore ask how well a learner can identify, from samples, a solver whose computation is specialized toDD\.

This places the problem between worst\-case and average\-case analysis\. Worst\-case design assumes no distributional information, while average\-case complexity assumes an analytic distribution\. We study the sample\-access regime: givenS=\(x1,…,xn\)∼DnS=\(x\_\{1\},\\ldots,x\_\{n\}\)\\sim D^\{n\}from an unknown deployment distribution, the learner returns solver code for future instances from the sameDD\.

Formally, let𝒳\\mathcal\{X\}be the instance space, and letV\(x,z\)∈\{0,1\}\\mathrm\{V\}\(x,z\)\\in\\\{0,1\\\}indicate whetherzzis a valid solution forxx\. A solverc∈𝒞c\\in\\mathcal\{C\}is a program mappingx↦c\(x\)x\\mapsto c\(x\), with execution timeT\(c,x\)∈ℝ≥0\\mathrm\{T\}\(c,x\)\\in\\mathbb\{R\}\_\{\\geq 0\}\. We evaluateccby its deployment errorErrD\(c\):=Prx∼D⁡\[V\(x,c\(x\)\)=0\]\\mathrm\{Err\}\_\{D\}\(c\):=\\Pr\_\{x\\sim D\}\[\\mathrm\{V\}\(x,c\(x\)\)=0\]and expected deployment runtimeRunD\(c\):=𝔼x∼D\[T\(c,x\)\]\\mathrm\{Run\}\_\{D\}\(c\):=\\mathbb\{E\}\_\{x\\sim D\}\[\\mathrm\{T\}\(c,x\)\]\.

Correctness is a feasibility constraint: among solvers withErrD\(c\)=0\\mathrm\{Err\}\_\{D\}\(c\)=0, two solvers may differ arbitrarily inRunD\(c\)\\mathrm\{Run\}\_\{D\}\(c\)\. For a solver class𝒞\\mathcal\{C\}, define the class\-relative distribution\-aware runtime optimum asRunD⋆\(𝒞\):=infc∈𝒞:ErrD\(c\)=0RunD\(c\)\\mathrm\{Run\}\_\{D\}^\{\\star\}\(\\mathcal\{C\}\):=\\inf\_\{c\\in\\mathcal\{C\}:\\,\\mathrm\{Err\}\_\{D\}\(c\)=0\}\\mathrm\{Run\}\_\{D\}\(c\)\. This is the best expected deployment runtime achievable by a correct solver in𝒞\\mathcal\{C\}\. Unlike worst\-case runtime, it depends onDD, and the role of the sample is to discover which fast correct solver the presentDDadmits\.

Selection from a fixed library\.When𝒞\\mathcal\{C\}is given in advance, a natural rule is the runtime\-aware analogue of empirical risk minimization\. LetErr^S\(c\):=1n∑i=1n𝟏\{V\(xi,c\(xi\)\)=0\}\\widehat\{\\mathrm\{Err\}\}\_\{S\}\(c\):=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\mathbf\{1\}\\\{\\mathrm\{V\}\(x\_\{i\},c\(x\_\{i\}\)\)=0\\\}andRun^S\(c\):=1n∑i=1nT\(c,xi\)\\widehat\{\\mathrm\{Run\}\}\_\{S\}\(c\):=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\mathrm\{T\}\(c,x\_\{i\}\)\. We selectc^S∈arg⁡minc∈𝒞:Err^S\(c\)=0⁡Run^S\(c\)\\widehat\{c\}\_\{S\}\\in\\arg\\min\_\{c\\in\\mathcal\{C\}:\\,\\widehat\{\\mathrm\{Err\}\}\_\{S\}\(c\)=0\}\\widehat\{\\mathrm\{Run\}\}\_\{S\}\(c\), the empirically fastest sample\-consistent solver\. Thus fixed\-library selection empirically minimizes the sample analogue ofRunD⋆\(𝒞\)\\mathrm\{Run\}\_\{D\}^\{\\star\}\(\\mathcal\{C\}\)\. This rule is appropriate when the solver library is enumerated, but it does not address richer settings where the useful specialization is not already present in𝒞\\mathcal\{C\}\.

Synthesis via solver hints\.To move beyond enumerated libraries, we factor the learner through a*solver hint*: reusable structure inferred from samples and compiled into solver code\. A hint spaceℋ\\mathcal\{H\}and compilation mapComp:ℋ→𝒞\\mathrm\{Comp\}:\\mathcal\{H\}\\to\\mathcal\{C\}split learning intoS↦h^S↦c^S=Comp\(h^S\)S\\mapsto\\widehat\{h\}\_\{S\}\\mapsto\\widehat\{c\}\_\{S\}=\\mathrm\{Comp\}\(\\widehat\{h\}\_\{S\}\)\. The hint may encode a backdoor, decomposition, active\-resource pattern, local repair rule, or other distribution\-specific shortcut\. We focus on the regime in whichComp\(h\)\\mathrm\{Comp\}\(h\)is correct for everyh∈ℋh\\in\\mathcal\{H\}, typically because the compiled solver falls back to a generic complete solver\. The sample is then not used to learn correctness, but to identify which shortcut to compile\.

## 4Method

The setup above defines the object we want to learn: a reusable solver hint compiled into executable solver code\. In realistic synthesis, however, the hint space, evidence statistics, and compilation map are unknown\. The learner must propose what structure to look for, measure it from samples, and write code that exploits the resulting summary on future instances\.

Our method implements this sample\-to\-hint\-to\-solver factorization with an LLM code agent\. Each candidate contains a hypothesis about the hidden structure, an analysis program that estimates it from public training instances, and a deployment solver conditioned on the resulting summary\. Validation selects among candidate factorizations by quality, optimality, and runtime\. Additional implementation details, prompt schemas, validation checks, and failure handling appear in Appendix[C](https://arxiv.org/html/2605.14141#A3)\.

1\. hypothesis2\. analysis3\. solverpublicsamplesSpubS^\{\\mathrm\{pub\}\}HcH\_\{c\}AcA\_\{c\}scs\_\{c\}hintaca\_\{c\}evaluatesc\(⋅,ac\)s\_\{c\}\(\\cdot,a\_\{c\}\)rank by\(Q,O,−T\)val\(Q,O,\-T\)\_\{\\mathrm\{val\}\}deployc^S\\widehat\{c\}\_\{S\}executecondition onaca\_\{c\}refine*\(refine / fork / replace / push runtime / push quality\)*

Figure 2:Method pipeline\.Each candidate contains a hypothesisHcH\_\{c\}, an analysis programAcA\_\{c\}, and a deployment solverscs\_\{c\}\. ExecutingAcA\_\{c\}on the public training sample produces a summaryac=Ac\(Strpub\)a\_\{c\}=A\_\{c\}\(S^\{\\mathrm\{pub\}\}\_\{\\mathrm\{tr\}\}\), which serves as the recovered hint\. The solversc\(⋅,ac\)s\_\{c\}\(\\cdot,a\_\{c\}\)is evaluated on public splits, ranked by validation quality, optimality, and runtime, and refined in a diversity\-preserving beam\.Candidate representation\.Each candidate has the formc=\(Hc,Ac,sc\)c=\(H\_\{c\},A\_\{c\},s\_\{c\}\)\. The hypothesisHcH\_\{c\}is a structured natural\-language description of a suspected distributional rule\. The analysis programAcA\_\{c\}maps the public training sample to a compact JSON\-serializable summary,ac=Ac\(Strpub\)a\_\{c\}=A\_\{c\}\(S\_\{\\mathrm\{tr\}\}^\{\\mathrm\{pub\}\}\), and the deployment solver maps a new public instance and this summary to a solution,z=sc\(xpub,ac\)z=s\_\{c\}\(x^\{\\mathrm\{pub\}\},a\_\{c\}\)\. Thusaca\_\{c\}is the empirical solver hint, whilesc\(⋅,ac\)s\_\{c\}\(\\cdot,a\_\{c\}\)is the compiled solver\. Neither the hint space nor the compilation map is fixed in advance; both are generated separately for each candidate\.

Public instances are stripped of evaluator\-only fields, including family identity, hidden\-rule metadata, optimum solutions, and optimum objective values\. The agent sees only the public instance format, the problem specification, the scoring rule, and samples from the unknown structured distribution\.

Three\-stage construction\.Candidates are generated by three sequential LLM calls\. The first producesHcH\_\{c\}: the suspected rule, evidence to measure, solver strategy, expected failure modes, and diversity key\. The second writesAcA\_\{c\}, which measures this evidence and compresses it into a reusable summary\. The third is conditioned onHcH\_\{c\},AcA\_\{c\}, and the executed summaryaca\_\{c\}, and writes the deployment solverscs\_\{c\}\. Thus synthesis is sample\-conditioned: the solver is written against an actual estimate of distributional structure, not merely a plan to compute one\.

Search and selection\.Because the relevant hypothesis class is unknown, we search over a beam of candidates\. The initial beam is seeded with broad structural directives, such as latent subtypes, separators, bottlenecks, decompositions, higher\-order interactions, objective\-aware marginal rules, and shortcut\-plus\-repair strategies\. Later rounds refine surviving candidates, fork them into different diversity classes, replace brittle hypotheses, or push them toward higher quality or lower runtime\. Each child is conditioned on the parent hypothesis, analysis output, code, validation metrics, and representative public failure cases\.

Candidates are ranked lexicographically by\(Qval,Oval,−Tval\)\(Q\_\{\\mathrm\{val\}\},\\,O\_\{\\mathrm\{val\}\},\\,\-T\_\{\\mathrm\{val\}\}\), whereQvalQ\_\{\\mathrm\{val\}\}is average normalized quality,OvalO\_\{\\mathrm\{val\}\}is optimality rate, andTvalT\_\{\\mathrm\{val\}\}is average runtime\. Beam survivors are chosen in two passes: first, the best candidate for each diversity key is retained; then remaining slots are filled by the top\-ranked candidates overall\. Failed candidates receive zero quality and a large failure runtime\. After the final iteration, we select the best candidate among all evaluated candidates, rerun its analysis onStrpubS\_\{\\mathrm\{tr\}\}^\{\\mathrm\{pub\}\}, and deploy the resulting solver\.

## 5Theory

The method above treats algorithm design as learning: from samples of an unknown distribution, the agent identifies reusable structure and compiles it into solver code\. We now abstract this mechanism and ask when such sample\-conditioned design generalizes beyond the observed instances\.

We study two regimes\. First, the learner selects from a fixed solver library, using empirical runtime to identify a fast correct solver for the deployment distribution\. Second, the useful specialization is not enumerated in advance; the learner must recover a reusable structural*hint*and compile it into a solver\. The results use standard concentration and union\-bound arguments\(Shalev\-Shwartz and Ben\-David,[2014](https://arxiv.org/html/2605.14141#bib.bib18)\)\. Their purpose is to formalize the core principle: samples can improve computation when they identify a shortcut shared by future instances\. Proofs appear in Appendix[F](https://arxiv.org/html/2605.14141#A6)\.

### 5\.1Runtime\-Aware ERM Over a Fixed Solver Class

When the solver library is fixed, the empirically fastest sample\-consistent solver approaches the best correct distribution\-specialized solver in the class\.

###### Theorem 5\.1\(Runtime\-aware generalization for library selection\)\.

Assume0≤T\(c,x\)≤Tmax0\\leq\\mathrm\{T\}\(c,x\)\\leq\\mathrm\{T\}\_\{\\max\}for everyc∈𝒞c\\in\\mathcal\{C\}andx∈𝒳x\\in\\mathcal\{X\}, and assume𝒞\\mathcal\{C\}contains at least one solver that is correct almost surely underDD\. Letπ\\pibe a prior on𝒞\\mathcal\{C\}with∑c∈𝒞π\(c\)≤1\\sum\_\{c\\in\\mathcal\{C\}\}\\pi\(c\)\\leq 1, and writeΓ\(c\):=log⁡\(1/π\(c\)\)\\Gamma\(c\):=\\log\(1/\\pi\(c\)\)\. Let𝒞feas=\{c∈𝒞:ErrD\(c\)=0\}\\mathcal\{C\}^\{\\rm feas\}=\\\{c\\in\\mathcal\{C\}:\\mathrm\{Err\}\_\{D\}\(c\)=0\\\}\. For everyδ∈\(0,1\)\\delta\\in\(0,1\), with probability at least1−δ1\-\\deltaoverS∼DnS\\sim D^\{n\},

ErrD\(c^S\)≤Γ\(c^S\)\+log⁡\(2δ\)n,RunD\(c^S\)≤infc∈𝒞feas\{RunD\(c\)\+2Tmaxmax⁡\{Γ\(c^S\),Γ\(c\)\}\+log⁡\(4δ\)2n\}\\mathrm\{Err\}\_\{D\}\(\\widehat\{c\}\_\{S\}\)\\\!\\leq\\\!\\frac\{\\Gamma\(\\widehat\{c\}\_\{S\}\)\\\!\+\\\!\\log\(\\tfrac\{2\}\{\\delta\}\)\}\{n\},~~\\mathrm\{Run\}\_\{D\}\(\\widehat\{c\}\_\{S\}\)\\\!\\leq\\\!\\inf\_\{c\\in\\mathcal\{C\}^\{\\rm feas\}\}\\\!\\left\\\{\\mathrm\{Run\}\_\{D\}\(c\)\\\!\+\\\!2\\mathrm\{T\}\_\{\\max\}\\sqrt\{\\frac\{\\max\\\{\\Gamma\(\\widehat\{c\}\_\{S\}\),\\Gamma\(c\)\\\}\\\!\+\\\!\\log\(\\tfrac\{4\}\{\\delta\}\)\}\{2n\}\}\\right\\\}

For a finite uniform library withπ\(c\)=1/\|𝒞\|\\pi\(c\)=1/\|\\mathcal\{C\}\|, Theorem[5\.1](https://arxiv.org/html/2605.14141#S5.Thmtheorem1)says thatc^S\\widehat\{c\}\_\{S\}approachesRunD⋆\(𝒞\)\\mathrm\{Run\}\_\{D\}^\{\\star\}\(\\mathcal\{C\}\)up to an additiveO\(Tmaxlog⁡\|𝒞\|/n\)O\(\\mathrm\{T\}\_\{\\max\}\\sqrt\{\\log\|\\mathcal\{C\}\|/n\}\)term, while sample consistency controls deployment error\. Thus runtime\-aware ERM is not merely choosing a fast solver on the sample; it estimates the best correct distribution\-specialized solver available in the class\. The guarantee is class\-relative: it does not say that the library contains the right specialization, only that if such a solver is present, empirical runtime minimization can identify it from samples\. This motivates the hint\-based synthesis regime below, where the specialization itself must be constructed\.

### 5\.2Synthesis via Learnable Hints

In the rich domains we ultimately care about, no fixed library contains the right specialized solver\. To make the search tractable, we shift it from full solvers to a smaller space of reusable structure\. Letℋ\\mathcal\{H\}be a finite hint space, where eachh∈ℋh\\in\\mathcal\{H\}induces a distributionDhD\_\{h\}over instances sharing that structure\. We assume a realizable setting:D=Dh⋆D=D\_\{h^\{\\star\}\}for some unknownh⋆∈ℋh^\{\\star\}\\in\\mathcal\{H\}, and both training and deployment instances are drawn fromDh⋆D\_\{h^\{\\star\}\}\.

Given a score family with a margin separation, hint recovery reduces to a finite\-class estimation problem\. Suppose score functions\{ψh:𝒳→\[0,1\]\}h∈ℋ\\\{\\psi\_\{h\}:\\mathcal\{X\}\\to\[0,1\]\\\}\_\{h\\in\\mathcal\{H\}\}satisfy

𝔼x∼Dh\[ψh\(x\)\]≥𝔼x∼Dh\[ψg\(x\)\]\+γfor everyh∈ℋand everyg∈ℋ∖\{h\},\\mathbb\{E\}\_\{x\\sim D\_\{h\}\}\[\\psi\_\{h\}\(x\)\]\\geq\\mathbb\{E\}\_\{x\\sim D\_\{h\}\}\[\\psi\_\{g\}\(x\)\]\+\\gamma\\qquad\\text\{for every \}h\\in\\mathcal\{H\}\\text\{ and every \}g\\in\\mathcal\{H\}\\setminus\\\{h\\\},for someγ\>0\\gamma\>0\. GivenS=\(x1,…,xn\)S=\(x\_\{1\},\\ldots,x\_\{n\}\), the learner returnsh^∈arg⁡maxh∈ℋ⁡μ^S\(h\)\\widehat\{h\}\\in\\arg\\max\_\{h\\in\\mathcal\{H\}\}\\widehat\{\\mu\}\_\{S\}\(h\)withμ^S\(h\):=1n∑t=1nψh\(xt\)\\widehat\{\\mu\}\_\{S\}\(h\):=\\tfrac\{1\}\{n\}\\sum\_\{t=1\}^\{n\}\\psi\_\{h\}\(x\_\{t\}\)\.

###### Theorem 5\.2\(Exact recovery under identifiable structure\)\.

If\|ℋ\|=N\|\\mathcal\{H\}\|=Nand the margin isγ\>0\\gamma\>0, thenn≥2γ2log⁡2Nδn\\geq\\frac\{2\}\{\\gamma^\{2\}\}\\log\\frac\{2N\}\{\\delta\}samples suffice forh^=h⋆\\widehat\{h\}=h^\{\\star\}with probability at least1−δ1\-\\delta\.

The sample complexity is logarithmic in\|ℋ\|\|\\mathcal\{H\}\|and inverse\-quadratic in the margin, after which the learner compilesh^\\widehat\{h\}into a specialized solver\. The theorem is intentionally idealized: in realistic domains the hard part is not estimatingh^\\widehat\{h\}for a known score family but*discovering*what the hint should be, what statistic reveals it, and how to compile it into code\. This is precisely the regime our experiments target\. We use an LLM agent as an approximate procedure for these three sub\-tasks and evaluate whether the resulting solvers improve in deployment runtime and quality on structured task distributions\. Before turning to that empirical study, we give a formal example where the score family is known and Theorem[5\.2](https://arxiv.org/html/2605.14141#S5.Thmtheorem2)applies directly\.

### 5\.3A Formal Example: Hidden SAT Backdoors

SAT illustrates the point cleanly because correctness never requires learning: a complete solver is always available\. The value of learning is computational, recovering a reusable backdoor that makes future solving faster\. Fix variables\[d\]\[d\]and backdoor sizekk\. The hint space isℋ=\(\[d\]k\)\\mathcal\{H\}=\\binom\{\[d\]\}\{k\}, and an unknownB∈ℋB\\in\\mathcal\{H\}indexes a distributionDBD\_\{B\}over CNF formulas onddvariables\. We assumeBBis a strong backdoor into a tractable class𝒯\\mathcal\{T\}: for everyFFin the support ofDBD\_\{B\}and everyα∈\{0,1\}B\\alpha\\in\\\{0,1\\\}^\{B\}, the restricted formulaF\|B=αF\|\_\{B=\\alpha\}lies in𝒯\\mathcal\{T\}\. The learner observesF\(1\),…,F\(m\)∼DBF^\{\(1\)\},\\ldots,F^\{\(m\)\}\\sim D\_\{B\}but notBB\. Assume membership inBBis identifiable from a bounded variable\-level salience statisticσi\(F\)∈\[0,1\]\\sigma\_\{i\}\(F\)\\in\[0,1\]with𝔼F∼DB\[σi\(F\)\]≥q1\\mathbb\{E\}\_\{F\\sim D\_\{B\}\}\[\\sigma\_\{i\}\(F\)\]\\geq q\_\{1\}fori∈Bi\\in Band≤q0\\leq q\_\{0\}fori∉Bi\\notin B, with marginγ=q1−q0\\gamma=q\_\{1\}\-q\_\{0\}\. The learner estimatesσ^i=1m∑tσi\(F\(t\)\)\\widehat\{\\sigma\}\_\{i\}=\\frac\{1\}\{m\}\\sum\_\{t\}\\sigma\_\{i\}\(F^\{\(t\)\}\)and setsB^\\widehat\{B\}to the top\-kkvariables\. The compiled solver enumerates assignments toB^\\widehat\{B\}, solves the residual if it lies in𝒯\\mathcal\{T\}, and otherwise falls back to a complete base solver—preserving correctness for everyB^\\widehat\{B\}while gaining speed whenB^=B\\widehat\{B\}=B\.

###### Theorem 5\.3\(Learning a hidden SAT backdoor from samples\)\.

Ifm≥8γ−2log⁡2dδm\\geq 8\\,\\gamma^\{\-2\}\\log\\frac\{2d\}\{\\delta\}, thenB^=B\\widehat\{B\}=Bwith probability at least1−δ1\-\\delta\.

On the eventB^=B\\widehat\{B\}=B, the learned solver runs inO\(2kpoly\(\|F\|\)\)O\(2^\{k\}\\textnormal\{poly\}\(\|F\|\)\)on formulas fromDBD\_\{B\}, with learner\-side costO\(md\)O\(md\)for the salience scores\. The sample is not used to learn SAT correctness; it identifies which structural shortcut to compile\. A concrete planted Horn\-backdoor family satisfying the separation assumption is given in Appendix[B](https://arxiv.org/html/2605.14141#A2)\.

## 6Experiments

We evaluate whether the synthesis procedure of Section[4](https://arxiv.org/html/2605.14141#S4)extracts reusable structure from samples and improves the quality–runtime tradeoff over heuristic and solver\-backed baselines\. The appendices give the full implementation details \(Appendix[C](https://arxiv.org/html/2605.14141#A3)\), benchmark distributions \(Appendix[D\.1](https://arxiv.org/html/2605.14141#A4.SS1)\), baseline protocols \(Appendix[D\.3](https://arxiv.org/html/2605.14141#A4.SS3)\), additional setup details \(Appendix[D](https://arxiv.org/html/2605.14141#A4)\), full per\-target results \(Appendix[E\.1](https://arxiv.org/html/2605.14141#A5.SS1)\), distributional\-complexity diagnostics \(Appendix[E\.3](https://arxiv.org/html/2605.14141#A5.SS3)\), perturbation ablations \(Appendix[E\.6](https://arxiv.org/html/2605.14141#A5.SS6)\), and our results on the PACE Dominating Set competition \(Appendix[E\.2](https://arxiv.org/html/2605.14141#A5.SS2)\)\.

### 6\.1Experimental setup

Benchmarks\.The benchmark contains structured combinatorial optimization tasks\. Each*target*pairs a problem class with a hidden distribution family\. The learner observes sampled public instances, but not the family identity, hidden rule, optimum solutions, optimum objective values, or hidden\-rule metadata, which are retained only by the evaluator\. The benchmark spans seven problem classes and2121hidden families, three per class; the full target list is in Table[5](https://arxiv.org/html/2605.14141#A4.T5)of Appendix[D\.1](https://arxiv.org/html/2605.14141#A4.SS1)\. We report aggregate results over all2121targets\. Each family is designed so that recurring cross\-instance structure, rather than per\-instance heuristic search alone, is the relevant signal\.

Baselines\.We compare the LLM synthesis agent of Section[4](https://arxiv.org/html/2605.14141#S4)with the full benchmark baseline catalog, evaluating all methods under the same public\-instance protocol\. The catalog includes problem\-specific heuristics, time\-limited optimization backends, and exact or certifying methods when available\. The heuristic pool includes DSATUR\-style coloring\(Brélaz,[1979](https://arxiv.org/html/2605.14141#bib.bib46)\), greedy set\-cover and graph rules for dominating set and independent set variants\(Chvátal,[1979](https://arxiv.org/html/2605.14141#bib.bib47)\), density and relaxation\-based rules for packing and knapsack\(Dantzig,[1957](https://arxiv.org/html/2605.14141#bib.bib48)\), and insertion or local\-search heuristics for TSP, including two\-opt and LKH\(Lin,[1965](https://arxiv.org/html/2605.14141#bib.bib49); Rosenkrantzet al\.,[1977](https://arxiv.org/html/2605.14141#bib.bib50); Helsgaun,[2000](https://arxiv.org/html/2605.14141#bib.bib76)\)\. The optimization\-backed pool includes time\-limited Gurobi formulations\(Gurobi Optimization, LLC,[2026](https://arxiv.org/html/2605.14141#bib.bib51)\), OR\-Tools CP\-SAT/GLOP formulations\(Perron and Furnon,[2025](https://arxiv.org/html/2605.14141#bib.bib52); Perronet al\.,[2023](https://arxiv.org/html/2605.14141#bib.bib53)\), PySAT/RC2 MaxSAT solvers\(Ignatievet al\.,[2019](https://arxiv.org/html/2605.14141#bib.bib54)\), branch\-and\-bound routines\(Land and Doig,[1960](https://arxiv.org/html/2605.14141#bib.bib55)\), Held–Karp dynamic programming for TSP\(Held and Karp,[1962](https://arxiv.org/html/2605.14141#bib.bib56)\), and external solvers including SCIP, HiGHS, CBC, Open\-WBO, and UWrMaxSAT\(Bestuzhevaet al\.,[2023](https://arxiv.org/html/2605.14141#bib.bib57); HiGHS Development Team,[2024](https://arxiv.org/html/2605.14141#bib.bib58); Forrest and Lougee\-Heimer,[2005](https://arxiv.org/html/2605.14141#bib.bib60); Martinset al\.,[2014](https://arxiv.org/html/2605.14141#bib.bib61); Piotrów,[2020](https://arxiv.org/html/2605.14141#bib.bib62)\)\. The exhaustive per\-problem catalog appears in Table[7](https://arxiv.org/html/2605.14141#A4.T7)of Appendix[D\.3](https://arxiv.org/html/2605.14141#A4.SS3)\.

Heuristic baselines have no access to the hidden family rule\. We report Gurobi separately because it uses a1010\-second, single\-thread budget and may return incumbents rather than certified optima\. When several time\-limited exact or certifying backends are available, we report the fastest method among those that attain the best certified or validated quality, denoted the*time\-limited exact*baseline\.

Evaluation\.The primary metric is average normalized quality, scaled so that1\.01\.0is optimal and larger is better; invalid or infeasible outputs receive quality zero\. We also report optimality rate, feasibility rate, and average per\-instance wall\-clock runtime\. Per\-problem quality definitions appear in Table[6](https://arxiv.org/html/2605.14141#A4.T6)of Appendix[D](https://arxiv.org/html/2605.14141#A4)\. Each per\-target value averages1010repeated evaluations per solver and split\.

For aggregate quality, optimality, feasibility, and quality\-lift values, we compute one value per target distribution by averaging over held\-out test instances and repeated runs, and then take the arithmetic mean over the2121target distributions\. For aggregate runtime comparisons, we compute per\-target speedup ratios and report their geometric mean\. This treats speedup as a multiplicative quantity and avoids having a few high\-runtime targets dominate the aggregate\. Synthesis methods may use training and validation instances, including representative failure cases from those splits, for refinement\. Test instances are never used during synthesis, selection, or refinement\.

### 6\.2Results

Table 1:Headline quality–runtime summary over2121target distributions\.Quality is normalized so higher is better\. Quality lifts compare the synthesized solver to the average and best heuristic\.TLLMT\_\{\\rm LLM\}and speedups are geometric means over target distributions; speedups above1×1\\timesmean the synthesized solver is faster\. Heuristic runtimes are clipped at1010seconds before ratio computation\. Full per\-target results, including optimality rates, appear in Table[9](https://arxiv.org/html/2605.14141#A5.T9)\.Main quality–runtime results\.The relevant comparison is the quality–runtime tradeoff: a useful distribution\-aware solver should recover high\-quality solutions without treating each instance as worst case\. Table[1](https://arxiv.org/html/2605.14141#S6.T1)summarizes this tradeoff over all2121benchmark target distributions; full per\-target results appear in Appendix[E\.1](https://arxiv.org/html/2605.14141#A5.SS1)\.

LLM synthesis reaches mean normalized quality0\.9710\.971, improving by\+0\.224\+0\.224over the average heuristic pool and by\+0\.098\+0\.098over the quality\-best heuristic\. Using geometric means of per\-target runtime ratios, it is336\.9×336\.9\\timesfaster than the quality\-best heuristic,342\.8×342\.8\\timesfaster than Gurobi, and16\.1×16\.1\\timesfaster than the selected time\-limited exact backend\. The gains are strongest on families where the synthesized solver finds a distribution\-specific shortcut, such as Packing LP, MDS, and TSP\.

The exceptions are also informative\. On MDKP, LLM synthesis is faster but trails the quality\-best heuristic by0\.0260\.026, while on Coloring it improves quality but is not uniformly faster than the best heuristic\. Thus, the method improves the average quality–runtime frontier, but does not always recover the cheapest or most accurate specialized procedure\.

Iteration ablation\.We next examine how much of the runtime gain comes from iterative synthesis, rather than from the first generated proposal alone\. After each iteration, we measure the test\-time speedup of the best generated candidate found so far, using the zero\-shot generated solver as the reference point\. This isolates the effect of search depth on the efficiency of the synthesized solver\.

![Refer to caption](https://arxiv.org/html/2605.14141v1/x1.png)

Figure 3:Runtime speedup relative to the zero\-shot generated solver across synthesis iterations\.The bar att=0t=0is the zero\-shot reference\. Fort\>0t\>0, each bar reports the best generated candidate found up to iterationtt, with speedup computed as zero\-shot runtime divided by candidate runtime\. Values above1×1\\timesindicate faster execution than zero\-shot, while values below1×1\\timesindicate slower execution\. Tasks are ordered by final speedup\.Figure[3](https://arxiv.org/html/2605.14141#S6.F3.11)shows that later synthesis iterations often improve runtime beyond the first proposal\. Some gains appear immediately, as in MIS, while others require refinement: TSP improves from about8\.7×8\.7\\timesafter the first iteration to40\.6×40\.6\\timesby iteration44, and Packing LP recovers from initially slower candidates to about2\.1×2\.1\\timesspeedup\. MDKP and MDS also improve after the initial proposal\.

Thus, the loop is not merely selecting among equivalent first\-pass proposals\. Later iterations can recover from poor specializations or refine useful ones into faster implementations\. The MaxSAT case is a counterexample: under this search budget, generated candidates remain slower than zero\-shot\. Iterative synthesis is therefore not uniformly beneficial, but can produce substantial runtime gains when the family contains exploitable structure\.

What did the LLM compile?The speedups in Table[1](https://arxiv.org/html/2605.14141#S6.T1)are not just implementation effects\. In many cases, the selected solver changes the effective computation: it replaces an ambient worst\-case search or generic optimization call with a distribution\-specific procedure inferred from samples\. Table[2](https://arxiv.org/html/2605.14141#S6.T2)gives a compact summary of these computation patterns, while Appendix[E\.3](https://arxiv.org/html/2605.14141#A5.SS3)\(specifically Table[12](https://arxiv.org/html/2605.14141#A5.T12)\) gives the full per\-distribution expressions and notation\.

Table 2:Representative computation patterns discovered by LLM synthesis\.Bounds show dominant runtime terms; full expressions and notation appear in Appendix[E\.3](https://arxiv.org/html/2605.14141#A5.SS3)\.The main pattern is that synthesis often compiles a structural shortcut\. For MAXSAT, the generated solvers exploit latent Boolean rules and bounded local repair instead of searching over all assignments\. For graph problems, they use planted palettes, motif decompositions, hubs, gateways, or small residual kernels instead of treating each graph as an arbitrary coloring, MIS, or MDS instance\. For Packing LP and MDKP, they exploit active resources or bottleneck structure, replacing full LP or exponential knapsack search with sorting, scoring, fractional filling, and bounded repair\. For TSP, they exploit clustered or latent geometric structure to construct and improve tours without running full Held–Karp\-style dynamic programming\.

Thus, the empirical speedups reflect the intended mechanism of distribution\-aware program learning: the LLM is not merely producing faster code for the same computation, but often compiling sampled regularities into a lower\-dimensional or smaller\-residual computation\.

External PACE Dominating Set comparison\.To test whether the synthesis procedure produces useful solvers beyond our controlled benchmark families, we also evaluate on the PACE 2025 Dominating Set heuristic track\. PACE released100100public instances and, after the competition,100100private instances generated in a similar way\(Grobler and Siebertz,[2025](https://arxiv.org/html/2605.14141#bib.bib69); PACE Challenge,[2025](https://arxiv.org/html/2605.14141#bib.bib72)\)\. We split the public set into5050training and5050validation instances, and report on the released private set\. This comparison is not an official PACE score, but it provides an external test against highly engineered competition solvers\. We compare against the top released Dominating Set heuristic submissions: Fontan–Verger, Root, Swats, and Shadoks\(Fontan and Verger,[2025](https://arxiv.org/html/2605.14141#bib.bib70); Luoet al\.,[2025](https://arxiv.org/html/2605.14141#bib.bib74); Swat,[2025](https://arxiv.org/html/2605.14141#bib.bib73); da Fonsecaet al\.,[2025](https://arxiv.org/html/2605.14141#bib.bib71)\)\.

On the private instances, the synthesized solver returns verified\-valid solutions on all100100graphs and runs about two orders of magnitude faster than the released PACE solvers\. The tradeoff is quality: the PACE solvers return smaller dominating sets on every matched instance\. The aggregate gap, however, is moderate: the synthesized solver’s total solution size is about3\.3%3\.3\\%larger than Fontan–Verger, Root, and Shadoks, while its average runtime is roughly99×99\\times–109×109\\timeslower\. Thus, the PACE experiment shows a fast\-but\-lower\-quality operating point on a large external benchmark; full details appear in Appendix[E\.2](https://arxiv.org/html/2605.14141#A5.SS2)\.

Table 3:PACE 2025 Dominating Set comparison\.Lower size and runtime are better\. The synthesized solver is valid on all private instances and much faster, while PACE solvers find smaller dominating sets\. This is a local comparison using released instances and solvers, not an official PACE score; setup and official results are described in\(Grobler and Siebertz,[2025](https://arxiv.org/html/2605.14141#bib.bib69); PACE Challenge,[2025](https://arxiv.org/html/2605.14141#bib.bib72)\)\. Full results appear in Appendix[E\.2](https://arxiv.org/html/2605.14141#A5.SS2)\.†Computed only on the7575instances for which Swats returned verified\-valid solutions\. “LLM/solver size” is the matched total\-size ratio; values above11mean the agent returns a larger dominating set\.

## 7Discussion and Limitations

We study learning when the output is executable solver code rather than a prediction rule\. The main lesson is that samples can improve computation when they reveal structure reused across future instances\. A solver hint captures this structure: it is not a solution to one instance, but information that changes the algorithm used for many later instances\. Empirically, LLM synthesis sometimes performs this amortized algorithm design\. The generated solvers often change the computational scale, replacing broad search or full optimization with distribution\-specific sorting, scoring, bounded repair, kernelization, or structured construction\. Thus, the learned object is closer to a specialized algorithm for the deployment regime than to a tuned heuristic for isolated instances\.

The same view explains the limitations\. The one\-time synthesis cost is useful only when amortized over enough future instances\. The resulting solver is specialized to the sampled regime, so its advantage may degrade under certain distribution shifts\. Finally, because the multi\-agent system searches over a rich program space, different runs may recover different hints, useful proxies, or brittle shortcuts\. As future work, it would be interesting to understand how to improve stability while preserving the discovery of distribution\-specific computation\.

## References

- \[1\]\(2006\)The traveling salesman problem: a computational study\.Princeton Series in Applied Mathematics,Princeton University Press\.External Links:ISBN 9780691129938Cited by:[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.8.7.3.1.1)\.
- \[2\]F\. Avellaneda\(2020\)A short description of the solver EvalMaxSAT\.InMaxSAT Evaluation 2020: Solver and Benchmark Descriptions,F\. Bacchus, J\. Berg, M\. Järvisalo, and R\. Martins \(Eds\.\),pp\. 8–9\.Cited by:[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.3.2.3.1.1)\.
- \[3\]M\. Balog, A\. L\. Gaunt, M\. Brockschmidt, S\. Nowozin, and D\. Tarlow\(2017\)DeepCoder: learning to write programs\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=ByldLrqlx)Cited by:[§2](https://arxiv.org/html/2605.14141#S2.p5.1)\.
- \[4\]K\. Bestuzheva, M\. Besançon, W\. Chen, A\. Chmiela, T\. Donkiewicz, J\. van Doornmalen, L\. Eifler, O\. Gaul, G\. Gamrath, A\. Gleixner,et al\.\(2023\)Enabling research through the SCIP optimization suite 8\.0\.ACM Transactions on Mathematical Software49\(2\),pp\. 1–21\.External Links:[Document](https://dx.doi.org/10.1145/3585516)Cited by:[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.2.1.3.1.1),[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.4.3.3.1.1),[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.5.4.3.1.1),[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.6.5.3.1.1),[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.7.6.3.1.1),[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.8.7.3.1.1),[§6\.1](https://arxiv.org/html/2605.14141#S6.SS1.p2.1)\.
- \[5\]A\. Bogdanov and L\. Trevisan\(2006\)Average\-case complexity\.Foundations and Trends in Theoretical Computer Science2\(1\),pp\. 1–106\.External Links:[Document](https://dx.doi.org/10.1561/0400000004)Cited by:[§1](https://arxiv.org/html/2605.14141#S1.p5.1),[§2](https://arxiv.org/html/2605.14141#S2.p2.1)\.
- \[6\]D\. Brélaz\(1979\)New methods to color the vertices of a graph\.Communications of the ACM22\(4\),pp\. 251–256\.External Links:[Document](https://dx.doi.org/10.1145/359094.359101)Cited by:[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.2.1.2.1.1),[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.2.1.3.1.1),[§6\.1](https://arxiv.org/html/2605.14141#S6.SS1.p2.1)\.
- \[7\]V\. Chvátal\(1979\)A greedy heuristic for the set\-covering problem\.Mathematics of Operations Research4\(3\),pp\. 233–235\.External Links:[Document](https://dx.doi.org/10.1287/moor.4.3.233)Cited by:[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.5.4.2.1.1),[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.5.4.3.1.1),[§6\.1](https://arxiv.org/html/2605.14141#S6.SS1.p2.1)\.
- \[8\]J\. Coll, C\. M\. Li, S\. Li, D\. Habet, and F\. Manyà\(2025\)Solving weighted maximum satisfiability with branch and bound and clause learning\.Computers & Operations Research183,pp\. 107195\.External Links:[Document](https://dx.doi.org/10.1016/j.cor.2025.107195)Cited by:[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.3.2.3.1.1)\.
- \[9\]G\. D\. da Fonseca, F\. Feschet, and Y\. Gerard\(2025\)PACE Solver Description: Shadoks Approach to Minimum Hitting Set and Dominating Set\.In20th International Symposium on Parameterized and Exact Computation \(IPEC 2025\),A\. Agrawal and E\. J\. van Leeuwen \(Eds\.\),Leibniz International Proceedings in Informatics \(LIPIcs\), Vol\.358,Dagstuhl, Germany,pp\. 34:1–34:5\.Note:Keywords: Optimization, heuristic, hitting set, dominating setExternal Links:ISBN 978\-3\-95977\-407\-9,ISSN 1868\-8969,[Link](https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.IPEC.2025.34),[Document](https://dx.doi.org/10.4230/LIPIcs.IPEC.2025.34)Cited by:[§6\.2](https://arxiv.org/html/2605.14141#S6.SS2.p10.4),[Table 3](https://arxiv.org/html/2605.14141#S6.T3.21.21.21.21.21.21.21.21.21.21.21.5)\.
- \[10\]G\. B\. Dantzig\(1957\)Discrete\-variable extremum problems\.Operations Research5\(2\),pp\. 266–288\.External Links:[Document](https://dx.doi.org/10.1287/opre.5.2.266)Cited by:[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.4.3.2.1.1),[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.7.6.2.1.1),[§6\.1](https://arxiv.org/html/2605.14141#S6.SS1.p2.1)\.
- \[11\]J\. Davies and F\. Bacchus\(2013\)Exploiting the power of MIP solvers in MAXSAT\.InTheory and Applications of Satisfiability Testing – SAT 2013,Lecture Notes in Computer Science, Vol\.7962,pp\. 166–181\.External Links:[Document](https://dx.doi.org/10.1007/978-3-642-39071-5%5F13)Cited by:[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.3.2.3.1.1)\.
- \[12\]J\. Davies and F\. Bacchus\(2013\)Postponing optimization to speed up MAXSAT solving\.InPrinciples and Practice of Constraint Programming – CP 2013,Lecture Notes in Computer Science, Vol\.8124,pp\. 247–262\.External Links:[Document](https://dx.doi.org/10.1007/978-3-642-40627-0%5F21)Cited by:[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.3.2.3.1.1)\.
- \[13\]J\. Devlin, J\. Uesato, S\. Bhupatiraju, R\. Singh, A\. Mohamed, and P\. Kohli\(2017\)RobustFill: neural program learning under noisy i/o\.InProceedings of the 34th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.70,pp\. 990–998\.External Links:[Link](https://proceedings.mlr.press/v70/devlin17a.html)Cited by:[§2](https://arxiv.org/html/2605.14141#S2.p5.1)\.
- \[14\]R\. G\. Downey and M\. R\. Fellows\(2013\)Fundamentals of parameterized complexity\.Texts in Computer Science,Springer\.External Links:[Document](https://dx.doi.org/10.1007/978-1-4471-5559-1)Cited by:[§1](https://arxiv.org/html/2605.14141#S1.p5.1),[§2](https://arxiv.org/html/2605.14141#S2.p2.1)\.
- \[15\]F\. Fontan and G\. Verger\(2025\)PACE Solver Description: Reductions and Heuristic Search for the Dominating Set Problem and the Hitting Set Problem\.In20th International Symposium on Parameterized and Exact Computation \(IPEC 2025\),A\. Agrawal and E\. J\. van Leeuwen \(Eds\.\),Leibniz International Proceedings in Informatics \(LIPIcs\), Vol\.358,Dagstuhl, Germany,pp\. 36:1–36:3\.Note:Keywords: dominating set, hitting set, unicost set covering, reductions, large neighborhood search, local searchExternal Links:ISBN 978\-3\-95977\-407\-9,ISSN 1868\-8969,[Link](https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.IPEC.2025.36),[Document](https://dx.doi.org/10.4230/LIPIcs.IPEC.2025.36)Cited by:[§6\.2](https://arxiv.org/html/2605.14141#S6.SS2.p10.4),[Table 3](https://arxiv.org/html/2605.14141#S6.T3.8.8.8.8.8.8.8.8.8.8.8.5)\.
- \[16\]J\. J\. Forrest and R\. Lougee\-Heimer\(2005\)CBC user guide\.InEmerging Theory, Methods, and Applications,pp\. 257–277\.External Links:[Document](https://dx.doi.org/10.1287/educ.1053.0020)Cited by:[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.4.3.3.1.1),[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.5.4.3.1.1),[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.8.7.3.1.1),[§6\.1](https://arxiv.org/html/2605.14141#S6.SS1.p2.1)\.
- \[17\]coin\-or/Clp: release releases/1\.17\.7External Links:[Document](https://dx.doi.org/10.5281/zenodo.5839302),[Link](https://doi.org/10.5281/zenodo.5839302)Cited by:[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.7.6.3.1.1)\.
- \[18\]M\. Grobler and S\. Siebertz\(2025\)The PACE 2025 Parameterized Algorithms and Computational Experiments Challenge: Dominating Set and Hitting Set\.In20th International Symposium on Parameterized and Exact Computation \(IPEC 2025\),A\. Agrawal and E\. J\. van Leeuwen \(Eds\.\),Leibniz International Proceedings in Informatics \(LIPIcs\), Vol\.358,Dagstuhl, Germany,pp\. 32:1–32:17\.Note:Keywords: PACE 2025 Report, Dominating Set, Hitting Set, Algorithm Engineering, FPT, HeuristicsExternal Links:ISBN 978\-3\-95977\-407\-9,ISSN 1868\-8969,[Link](https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.IPEC.2025.32),[Document](https://dx.doi.org/10.4230/LIPIcs.IPEC.2025.32)Cited by:[§6\.2](https://arxiv.org/html/2605.14141#S6.SS2.p10.4),[Table 3](https://arxiv.org/html/2605.14141#S6.T3)\.
- \[19\]S\. Gulwani\(2011\)Automating string processing in spreadsheets using input\-output examples\.InProceedings of the 38th Annual ACM SIGPLAN\-SIGACT Symposium on Principles of Programming Languages,pp\. 317–330\.External Links:[Document](https://dx.doi.org/10.1145/1926385.1926423)Cited by:[§2](https://arxiv.org/html/2605.14141#S2.p5.1)\.
- \[20\]Gurobi Optimization, LLC\(2026\)Gurobi optimizer reference manual\.External Links:[Link](https://www.gurobi.com/)Cited by:[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.2.1.3.1.1),[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.3.2.3.1.1),[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.4.3.3.1.1),[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.5.4.3.1.1),[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.6.5.3.1.1),[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.7.6.3.1.1),[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.8.7.3.1.1),[§6\.1](https://arxiv.org/html/2605.14141#S6.SS1.p2.1)\.
- \[21\]M\. Held and R\. M\. Karp\(1962\)A dynamic programming approach to sequencing problems\.Journal of the Society for Industrial and Applied Mathematics10\(1\),pp\. 196–210\.External Links:[Document](https://dx.doi.org/10.1137/0110015)Cited by:[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.8.7.3.1.1),[§6\.1](https://arxiv.org/html/2605.14141#S6.SS1.p2.1)\.
- \[22\]K\. Helsgaun\(2000\)An effective implementation of the lin–kernighan traveling salesman heuristic\.European Journal of Operational Research126\(1\),pp\. 106–130\.External Links:[Document](https://dx.doi.org/10.1016/S0377-2217%2899%2900284-2)Cited by:[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.8.7.2.1.1),[§6\.1](https://arxiv.org/html/2605.14141#S6.SS1.p2.1)\.
- \[23\]D\. Hendrycks, S\. Basart, S\. Kadavath, M\. Mazeika, A\. Arora, E\. Guo, C\. Burns, S\. Y\. Puranik, H\. He, D\. Song, and J\. Steinhardt\(2021\)Measuring coding challenge competence with APPS\.InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks,External Links:[Link](https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/c24cd76e1ce41366a4bbe8a49b02a028-Abstract-round2.html)Cited by:[§2](https://arxiv.org/html/2605.14141#S2.p5.1)\.
- \[24\]HiGHS Development Team\(2024\)HiGHS: high\-performance parallel linear optimization software\.External Links:[Link](https://highs.dev/)Cited by:[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.2.1.3.1.1),[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.4.3.3.1.1),[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.5.4.3.1.1),[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.6.5.3.1.1),[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.7.6.3.1.1),[§6\.1](https://arxiv.org/html/2605.14141#S6.SS1.p2.1)\.
- \[25\]Q\. Huang, J\. Vora, P\. Liang, and J\. Leskovec\(2024\)MLAgentBench: evaluating language agents on machine learning experimentation\.InProceedings of the 41st International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.235,pp\. 20271–20309\.External Links:[Link](https://proceedings.mlr.press/v235/huang24b.html)Cited by:[§2](https://arxiv.org/html/2605.14141#S2.p5.1)\.
- \[26\]Q\. Huangfu and J\. A\. J\. Hall\(2018\)Parallelizing the dual revised simplex method\.Mathematical Programming Computation10\(1\),pp\. 119–142\.External Links:[Document](https://dx.doi.org/10.1007/s12532-017-0130-5)Cited by:[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.2.1.3.1.1),[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.4.3.3.1.1),[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.5.4.3.1.1),[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.6.5.3.1.1),[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.7.6.3.1.1)\.
- \[27\]A\. Ignatiev, A\. Morgado, and J\. Marques\-Silva\(2019\)RC2: an efficient MaxSAT solver\.Journal on Satisfiability, Boolean Modeling and Computation11\(1\),pp\. 53–64\.External Links:[Document](https://dx.doi.org/10.3233/SAT190116)Cited by:[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.3.2.3.1.1),[§6\.1](https://arxiv.org/html/2605.14141#S6.SS1.p2.1)\.
- \[28\]R\. Impagliazzo\(1995\)A personal view of average\-case complexity\.InProceedings of the Tenth Annual Structure in Complexity Theory Conference,pp\. 134–147\.External Links:[Document](https://dx.doi.org/10.1109/SCT.1995.514853)Cited by:[§1](https://arxiv.org/html/2605.14141#S1.p5.1),[§2](https://arxiv.org/html/2605.14141#S2.p2.1)\.
- \[29\]C\. E\. Jimenez, J\. Yang, A\. Wettig, S\. Yao, K\. Pei, O\. Press, and K\. R\. Narasimhan\(2024\)SWE\-bench: can language models resolve real\-world GitHub issues?\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=VTF8yNQM66)Cited by:[§2](https://arxiv.org/html/2605.14141#S2.p5.1)\.
- \[30\]P\. Kerschke, H\. H\. Hoos, F\. Neumann, and H\. Trautmann\(2019\)Automated algorithm selection: survey and perspectives\.Evolutionary Computation27\(1\),pp\. 3–45\.External Links:[Document](https://dx.doi.org/10.1162/evco%5Fa%5F00242)Cited by:[§2](https://arxiv.org/html/2605.14141#S2.p3.1)\.
- \[31\]L\. Kotthoff\(2014\)Algorithm selection for combinatorial search problems: a survey\.AI Magazine35\(3\),pp\. 48–60\.External Links:[Document](https://dx.doi.org/10.1609/aimag.v35i3.2460)Cited by:[§2](https://arxiv.org/html/2605.14141#S2.p3.1)\.
- \[32\]S\. Lamm, C\. Schulz, D\. Strash, R\. Williger, and H\. Zhang\(2019\)Exactly solving the maximum weight independent set problem on large real\-world graphs\.InProceedings of the Twenty\-First Workshop on Algorithm Engineering and Experiments,pp\. 144–158\.External Links:[Document](https://dx.doi.org/10.1137/1.9781611975499.12)Cited by:[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.6.5.3.1.1)\.
- \[33\]A\. H\. Land and A\. G\. Doig\(1960\)An automatic method of solving discrete programming problems\.Econometrica28\(3\),pp\. 497–520\.External Links:[Document](https://dx.doi.org/10.2307/1910129)Cited by:[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.2.1.3.1.1),[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.4.3.3.1.1),[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.5.4.3.1.1),[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.6.5.3.1.1),[§6\.1](https://arxiv.org/html/2605.14141#S6.SS1.p2.1)\.
- \[34\]L\. A\. Levin\(1986\)Average case complete problems\.SIAM Journal on Computing15\(1\),pp\. 285–286\.External Links:[Document](https://dx.doi.org/10.1137/0215020)Cited by:[§1](https://arxiv.org/html/2605.14141#S1.p5.1),[§2](https://arxiv.org/html/2605.14141#S2.p2.1)\.
- \[35\]Y\. Li, D\. Choi, J\. Chung, N\. Kushman, J\. Schrittwieser, R\. Leblond, T\. Eccles, J\. Keeling, F\. Gimeno, A\. Dal Lago, T\. Hubert, P\. Choy, C\. de Masson d’Autume, I\. Babuschkin, X\. Chen, P\. Huang, J\. Welbl, S\. Gowal, A\. Cherepanov, J\. Molloy, D\. J\. Mankowitz, E\. S\. Robson, P\. Kohli, N\. de Freitas, K\. Kavukcuoglu, and O\. Vinyals\(2022\)Competition\-level code generation with AlphaCode\.Science378\(6624\),pp\. 1092–1097\.External Links:[Document](https://dx.doi.org/10.1126/science.abq1158)Cited by:[§2](https://arxiv.org/html/2605.14141#S2.p5.1)\.
- \[36\]S\. Lin\(1965\)Computer solutions of the traveling salesman problem\.Bell System Technical Journal44\(10\),pp\. 2245–2269\.External Links:[Document](https://dx.doi.org/10.1002/j.1538-7305.1965.tb04146.x)Cited by:[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.8.7.2.1.1),[§6\.1](https://arxiv.org/html/2605.14141#S6.SS1.p2.1)\.
- \[37\]M\. Lindauer, H\. H\. Hoos, F\. Hutter, and T\. Schaub\(2015\)AutoFolio: an automatically configured algorithm selector\.Journal of Artificial Intelligence Research53,pp\. 745–778\.External Links:[Document](https://dx.doi.org/10.1613/jair.4726)Cited by:[§2](https://arxiv.org/html/2605.14141#S2.p3.1)\.
- \[38\]C\. Luo, Q\. Zhang, Z\. Su, and Z\. Lü\(2025\)PACE Solver Description: Weighting\-Based Local Search Heuristic for the Hitting Set Problem\.In20th International Symposium on Parameterized and Exact Computation \(IPEC 2025\),A\. Agrawal and E\. J\. van Leeuwen \(Eds\.\),Leibniz International Proceedings in Informatics \(LIPIcs\), Vol\.358,Dagstuhl, Germany,pp\. 40:1–40:4\.Note:Keywords: PACE 2025, Dominating Set, Hitting Set, Heuristic Optimization, Weighted Local SearchExternal Links:ISBN 978\-3\-95977\-407\-9,ISSN 1868\-8969,[Link](https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.IPEC.2025.40),[Document](https://dx.doi.org/10.4230/LIPIcs.IPEC.2025.40)Cited by:[§6\.2](https://arxiv.org/html/2605.14141#S6.SS2.p10.4),[Table 3](https://arxiv.org/html/2605.14141#S6.T3.12.12.12.12.12.12.12.12.12.12.12.5)\.
- \[39\]T\. Lykouris and S\. Vassilvitskii\(2018\)Competitive caching with machine learned advice\.InProceedings of the 35th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.80,pp\. 3296–3305\.External Links:[Link](https://proceedings.mlr.press/v80/lykouris18a.html)Cited by:[§2](https://arxiv.org/html/2605.14141#S2.p4.1)\.
- \[40\]R\. Martins, V\. Manquinho, and I\. Lynce\(2014\)Open\-WBO: a modular MaxSAT solver\.InTheory and Applications of Satisfiability Testing – SAT 2014,Lecture Notes in Computer Science, Vol\.8561,pp\. 438–445\.External Links:[Document](https://dx.doi.org/10.1007/978-3-319-09284-3%5F33)Cited by:[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.3.2.3.1.1),[§6\.1](https://arxiv.org/html/2605.14141#S6.SS1.p2.1)\.
- \[41\]M\. Mitzenmacher and S\. Vassilvitskii\(2022\)Algorithms with predictions\.Communications of the ACM65\(7\),pp\. 33–35\.External Links:[Document](https://dx.doi.org/10.1145/3528087)Cited by:[§2](https://arxiv.org/html/2605.14141#S2.p4.1)\.
- \[42\]M\. Nye, L\. Hewitt, J\. B\. Tenenbaum, and A\. Solar\-Lezama\(2019\)Learning to infer program sketches\.InProceedings of the 36th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.97,pp\. 4861–4870\.External Links:[Link](https://proceedings.mlr.press/v97/nye19a.html)Cited by:[§2](https://arxiv.org/html/2605.14141#S2.p5.1)\.
- \[43\]PACE Challenge\(2025\)PACE 2025 Results\.Note:[https://pacechallenge\.org/2025/results/](https://pacechallenge.org/2025/results/)Accessed 2026\-05\-07Cited by:[item 4](https://arxiv.org/html/2605.14141#S1.I1.i4.p1.1),[§6\.2](https://arxiv.org/html/2605.14141#S6.SS2.p10.4),[Table 3](https://arxiv.org/html/2605.14141#S6.T3)\.
- \[44\]L\. Perron, F\. Didier, and S\. Gay\(2023\)The CP\-SAT\-LP solver\.In29th International Conference on Principles and Practice of Constraint Programming,Leibniz International Proceedings in Informatics, Vol\.280,pp\. 3:1–3:2\.External Links:[Document](https://dx.doi.org/10.4230/LIPIcs.CP.2023.3),[Link](https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CP.2023.3)Cited by:[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.2.1.3.1.1),[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.4.3.3.1.1),[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.5.4.3.1.1),[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.6.5.3.1.1),[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.8.7.3.1.1),[§6\.1](https://arxiv.org/html/2605.14141#S6.SS1.p2.1)\.
- \[45\]L\. Perron and V\. Furnon\(2025\-02\-17\)OR\-Tools\.Google\.External Links:[Link](https://developers.google.com/optimization/)Cited by:[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.2.1.3.1.1),[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.4.3.3.1.1),[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.5.4.3.1.1),[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.6.5.3.1.1),[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.7.6.3.1.1),[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.8.7.3.1.1),[§6\.1](https://arxiv.org/html/2605.14141#S6.SS1.p2.1)\.
- \[46\]M\. Piotrów\(2020\)UWrMaxSat: efficient solver for MaxSAT and pseudo\-boolean problems\.In2020 IEEE 32nd International Conference on Tools with Artificial Intelligence,pp\. 132–136\.External Links:[Document](https://dx.doi.org/10.1109/ICTAI50040.2020.00031)Cited by:[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.3.2.3.1.1),[§6\.1](https://arxiv.org/html/2605.14141#S6.SS1.p2.1)\.
- \[47\]J\. R\. Rice\(1976\)The algorithm selection problem\.Advances in Computers15,pp\. 65–118\.External Links:[Document](https://dx.doi.org/10.1016/S0065-2458%2808%2960520-3)Cited by:[§2](https://arxiv.org/html/2605.14141#S2.p3.1)\.
- \[48\]D\. J\. Rosenkrantz, R\. E\. Stearns, and P\. M\. Lewis\(1977\)An analysis of several heuristics for the traveling salesman problem\.SIAM Journal on Computing6\(3\),pp\. 563–581\.External Links:[Document](https://dx.doi.org/10.1137/0206041)Cited by:[Table 7](https://arxiv.org/html/2605.14141#A4.T7.1.1.1.1.1.1.1.1.1.8.7.2.1.1),[§6\.1](https://arxiv.org/html/2605.14141#S6.SS1.p2.1)\.
- \[49\]T\. Roughgarden\(2019\)Beyond worst\-case analysis\.Communications of the ACM62\(3\),pp\. 88–96\.External Links:[Document](https://dx.doi.org/10.1145/3232535)Cited by:[§2](https://arxiv.org/html/2605.14141#S2.p4.1)\.
- \[50\]S\. Shalev\-Shwartz and S\. Ben\-David\(2014\)Understanding machine learning: from theory to algorithms\.Cambridge University Press\.External Links:ISBN 9781107057135Cited by:[§5](https://arxiv.org/html/2605.14141#S5.p2.1)\.
- \[51\]S\. Singhal, P\. Mishra, E\. Malach, and T\. Galanti\(2025\)LLM priors for ERM over programs\.External Links:2510\.14331,[Document](https://dx.doi.org/10.48550/arXiv.2510.14331),[Link](https://arxiv.org/abs/2510.14331)Cited by:[§1](https://arxiv.org/html/2605.14141#S1.p1.1)\.
- \[52\]D\. A\. Spielman and S\. Teng\(2004\)Smoothed analysis of algorithms: why the simplex algorithm usually takes polynomial time\.Journal of the ACM51\(3\),pp\. 385–463\.External Links:[Document](https://dx.doi.org/10.1145/990308.990310)Cited by:[§1](https://arxiv.org/html/2605.14141#S1.p5.1),[§2](https://arxiv.org/html/2605.14141#S2.p2.1)\.
- \[53\]S\. Swat\(2025\)PACE Solver Description: HitS&DoSeS \- Exact and Heuristic Solvers for the Dominating Set and Hitting Set Problems\.In20th International Symposium on Parameterized and Exact Computation \(IPEC 2025\),A\. Agrawal and E\. J\. van Leeuwen \(Eds\.\),Leibniz International Proceedings in Informatics \(LIPIcs\), Vol\.358,Dagstuhl, Germany,pp\. 38:1–38:4\.Note:Keywords: dominating set, hitting set, exact algorithms, heuristic algorithms, large graphs, combinatorial optimizationExternal Links:ISBN 978\-3\-95977\-407\-9,ISSN 1868\-8969,[Link](https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.IPEC.2025.38),[Document](https://dx.doi.org/10.4230/LIPIcs.IPEC.2025.38)Cited by:[§6\.2](https://arxiv.org/html/2605.14141#S6.SS2.p10.4),[Table 3](https://arxiv.org/html/2605.14141#S6.T3.17.17.17.17.17.17.17.17.17.17.17.6)\.
- \[54\]L\. G\. Valiant\(1984\)A theory of the learnable\.InProceedings of the Sixteenth Annual ACM Symposium on Theory of Computing,pp\. 436–445\.External Links:[Document](https://dx.doi.org/10.1145/800057.808710)Cited by:[§1](https://arxiv.org/html/2605.14141#S1.p1.1),[§3](https://arxiv.org/html/2605.14141#S3.p1.2)\.
- \[55\]V\. N\. Vapnik and A\. Y\. Chervonenkis\(1971\)On the uniform convergence of relative frequencies of events to their probabilities\.Theory of Probability and Its Applications16\(2\),pp\. 264–280\.Cited by:[§1](https://arxiv.org/html/2605.14141#S1.p1.1),[§3](https://arxiv.org/html/2605.14141#S3.p1.2)\.
- \[56\]V\. N\. Vapnik\(1998\)Statistical learning theory\.Wiley\.External Links:ISBN 9780471030034Cited by:[§1](https://arxiv.org/html/2605.14141#S1.p1.1),[§3](https://arxiv.org/html/2605.14141#S3.p1.2)\.
- \[57\]R\. Williams, C\. P\. Gomes, and B\. Selman\(2003\)Backdoors to typical case complexity\.InProceedings of the Eighteenth International Joint Conference on Artificial Intelligence,pp\. 1173–1178\.Cited by:[§1](https://arxiv.org/html/2605.14141#S1.p5.1),[§2](https://arxiv.org/html/2605.14141#S2.p2.1)\.
- \[58\]L\. Xu, H\. H\. Hoos, and K\. Leyton\-Brown\(2010\)Hydra: automatically configuring algorithms for portfolio\-based selection\.InProceedings of the Twenty\-Fourth AAAI Conference on Artificial Intelligence,pp\. 210–216\.Cited by:[§2](https://arxiv.org/html/2605.14141#S2.p3.1)\.
- \[59\]L\. Xu, F\. Hutter, H\. H\. Hoos, and K\. Leyton\-Brown\(2008\)SATzilla: portfolio\-based algorithm selection for SAT\.Journal of Artificial Intelligence Research32,pp\. 565–606\.External Links:[Document](https://dx.doi.org/10.1613/jair.2490)Cited by:[§2](https://arxiv.org/html/2605.14141#S2.p3.1)\.

## Appendix ABroader Impacts

This work studies methods for learning distribution\-specific solvers from examples\. Potential positive impacts include reducing the computational cost of repeated optimization workloads, making specialized solvers easier to construct, and improving access to efficient algorithmic tools in scientific, engineering, and logistics settings\. Potential negative impacts arise if synthesized solvers are deployed outside the distributions for which they were inferred, where brittle specialization may lead to incorrect or inefficient decisions\. More broadly, automated solver synthesis could also reduce the barrier to optimizing objectives in harmful or poorly governed applications\. We therefore view validation on held\-out instances, explicit distributional assumptions, fallback mechanisms, and human review of generated code as important safeguards\.

## Appendix BA Planted Horn\-Backdoor Family

We give a concrete distribution satisfying the separation assumption in Section[5\.3](https://arxiv.org/html/2605.14141#S5.SS3)\. Fix an unknown setB⊆\[d\]B\\subseteq\[d\]of sizekk, and generate a CNF formulaF∼DBF\\sim D\_\{B\}as a conjunction ofMMindependently sampled clauses\. Each clause is Horn with probability1−ρ1\-\\rho\. With probabilityρ\\rho, it is a non\-Horn clause of the form

xi∨xj∨⋁ℓ∈T¬xℓ,x\_\{i\}\\vee x\_\{j\}\\vee\\bigvee\_\{\\ell\\in T\}\\neg x\_\{\\ell\},wherei∼Unif\(B\)i\\sim\\mathrm\{Unif\}\(B\),j∼Unif\(\[d\]∖B\)j\\sim\\mathrm\{Unif\}\(\[d\]\\setminus B\), andT⊆\[d\]∖\(B∪\{j\}\)T\\subseteq\[d\]\\setminus\(B\\cup\\\{j\\\}\)is sampled by any fixed rule\. For every assignment to the variables inBB, each non\-Horn clause is either satisfied or reduced to a Horn clause over the remaining variables\. ThusBBis a strong backdoor to Horn\-SAT\.

Define the salience statistic

σi\(F\)=1M∑C∈F𝟏\{Cis non\-Horn andxiappears positively inC\}\.\\sigma\_\{i\}\(F\)=\\frac\{1\}\{M\}\\sum\_\{C\\in F\}\\mathbf\{1\}\\\{C\\text\{ is non\-Horn and \}x\_\{i\}\\text\{ appears positively in \}C\\\}\.Then

𝔼\[σi\(F\)\]=\{ρ/k,i∈B,ρ/\(d−k\),i∉B\.\\mathbb\{E\}\[\\sigma\_\{i\}\(F\)\]=\\begin\{cases\}\\rho/k,&i\\in B,\\\\ \\rho/\(d\-k\),&i\\notin B\.\\end\{cases\}Hence the variable\-level separation assumption holds withq1=ρ/kq\_\{1\}=\\rho/k,q0=ρ/\(d−k\)q\_\{0\}=\\rho/\(d\-k\), andγ=ρ\(1/k−1/\(d−k\)\)\\gamma=\\rho\(1/k\-1/\(d\-k\)\), which is positive wheneverk<d/2k<d/2\.

## Appendix CAdditional Method Details

This appendix records the experimental protocol behind Section[4](https://arxiv.org/html/2605.14141#S4)\. The goal is to make clear what information was available to the synthesis agent, how candidates were represented and selected, and how invalid candidates were handled\. The protocol enforces a separation between distribution\-level inference and deployment\-time solving: the analysis program may use the public training sample to estimate a reusable summary, while the deployed solver must solve new public instances using only that summary and the instance itself\.

A single LLM\-generated candidate is not treated as the learned algorithm\. Instead, the synthesis loop is a proposal\-and\-selection procedure over solver hints and implementations\. Different proposals may encode different structural explanations of the same sample, and some may be brittle, slow, or incorrect\. The deterministic evaluator converts this high\-variance proposal process into a selected deployed solver by filtering candidates on public validation quality, optimality, and runtime\.

Public information and leakage boundary\.For each target, the synthesis agent receives public training and validation instances, together with a small task manifest\. The manifest specifies the problem type, the required solution format, the normalized\-quality metric, and basic instance\-size information\. It also states that the instances are drawn from a shared but unknown structured distribution\.

The public view excludes all fields used to define or score the hidden family\. In particular, it does not include the distribution\-family identity, the planted rule, optimum solutions, optimum objective values, or evaluator\-only metadata\. This boundary is intended to model deployment: the learner knows the ambient optimization problem and the scoring rule, but not the generative mechanism that produces the deployment instances\.

All synthesis stages use a shared instruction emphasizing that the goal is to infer reusable structure from public samples and compile it into a fast solver\. The instruction states that validation quality is the primary selection signal, and that runtime matters once quality is high\. It does not reveal hidden\-rule metadata or test performance\.

Candidate representation and stage contracts\.Each candidate has the formc=\(Hc,Ac,ac,sc\)c=\(H\_\{c\},A\_\{c\},a\_\{c\},s\_\{c\}\), whereHcH\_\{c\}is a structured hypothesis,AcA\_\{c\}is a train\-time analysis program,ac=Ac\(Strpub\)a\_\{c\}=A\_\{c\}\(S\_\{\\mathrm\{tr\}\}^\{\\mathrm\{pub\}\}\)is the recovered summary, andscs\_\{c\}is the deployment\-time solver\. The analysis program performs distribution\-level inference from the public training sample\. The solver then maps a new public instance and the recovered summary to a candidate solution,z=sc\(xpub,ac\)z=s\_\{c\}\(x^\{\\mathrm\{pub\}\},a\_\{c\}\)\. Thusaca\_\{c\}is the empirical analogue of the solver hint in Section[3](https://arxiv.org/html/2605.14141#S3)\.

The fields elicited at each stage are summarized in Table[4](https://arxiv.org/html/2605.14141#A3.T4)\. These stage contracts prevent hypothesis formation, evidence extraction, and solver construction from collapsing into a single opaque generation step\. They also make the recovered hint explicit and inspectable after synthesis\.

Table 4:Stage contracts used by the synthesis procedure\.The diversity key is used only for beam selection and is not part of the validation score\.Synthesis loop\.Algorithm[1](https://arxiv.org/html/2605.14141#alg1)gives the full synthesis loop\. Each round constructs a batch of candidate hint–solver pairs, executes their analysis programs on the public training sample, evaluates their solvers on the public train and validation splits, and updates a diversity\-preserving beam\. After the final round, the selected candidate is the best evaluated candidate across all rounds, not necessarily the best member of the final beam\. Thus the deployed solver is validation\-selected from a portfolio of generated candidates rather than taken from a single LLM proposal\.

Algorithm 1LLM\-based distribution\-to\-algorithm synthesis1:Public train/validation sets

Strpub,SvalpubS\_\{\\mathrm\{tr\}\}^\{\\mathrm\{pub\}\},S\_\{\\mathrm\{val\}\}^\{\\mathrm\{pub\}\}; iterations

RR, beam width

BB, candidate budget

KK
2:Learned solvers^\\hat\{s\}and recovered summarya^\\hat\{a\}

3:Initialize beam

ℬ0\\mathcal\{B\}\_\{0\}with diverse hypothesis directives

4:Initialize evaluated set

ℰ←∅\\mathcal\{E\}\\leftarrow\\emptyset
5:for

r=0,…,R−1r=0,\\ldots,R\-1do

6:Build

KKcandidate prompts from

ℬr\\mathcal\{B\}\_\{r\}⊳\\trianglerightseed / refine / fork / replace / push runtime / push quality

7:foreach prompt

ppdo

8:Generate hypothesis

Hc\{\\color\[rgb\]\{0\.4296875,0\.28125,0\.66796875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.4296875,0\.28125,0\.66796875\}H\_\{c\}\}
9:Generate analysis program

Ac\{\\color\[rgb\]\{0\.16796875,0\.40234375,0\.69140625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.16796875,0\.40234375,0\.69140625\}A\_\{c\}\}
10:Execute analysis

ac←Ac\(Strpub\)\{\\color\[rgb\]\{0\.16796875,0\.40234375,0\.69140625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.16796875,0\.40234375,0\.69140625\}a\_\{c\}\\leftarrow A\_\{c\}\(S\_\{\\mathrm\{tr\}\}^\{\\mathrm\{pub\}\}\)\}
11:Generate solver

sc\{\\color\[rgb\]\{0\.171875,0\.5,0\.375\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.171875,0\.5,0\.375\}s\_\{c\}\}conditioned on

\(Hc,Ac,ac\)\(\{\\color\[rgb\]\{0\.4296875,0\.28125,0\.66796875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.4296875,0\.28125,0\.66796875\}H\_\{c\}\},\\,\{\\color\[rgb\]\{0\.16796875,0\.40234375,0\.69140625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.16796875,0\.40234375,0\.69140625\}A\_\{c\}\},\\,\{\\color\[rgb\]\{0\.16796875,0\.40234375,0\.69140625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.16796875,0\.40234375,0\.69140625\}a\_\{c\}\}\)
12:Evaluate

sc\(⋅,ac\)s\_\{c\}\(\\cdot,a\_\{c\}\)on

StrpubS\_\{\\mathrm\{tr\}\}^\{\\mathrm\{pub\}\}and

SvalpubS\_\{\\mathrm\{val\}\}^\{\\mathrm\{pub\}\}
13:Add

c=\(Hc,Ac,ac,sc\)c=\(\{\\color\[rgb\]\{0\.4296875,0\.28125,0\.66796875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.4296875,0\.28125,0\.66796875\}H\_\{c\}\},\{\\color\[rgb\]\{0\.16796875,0\.40234375,0\.69140625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.16796875,0\.40234375,0\.69140625\}A\_\{c\}\},\{\\color\[rgb\]\{0\.16796875,0\.40234375,0\.69140625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.16796875,0\.40234375,0\.69140625\}a\_\{c\}\},\{\\color\[rgb\]\{0\.171875,0\.5,0\.375\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.171875,0\.5,0\.375\}s\_\{c\}\}\)to

ℰ\\mathcal\{E\}
14:endfor

15:Scoreeach

c∈ℰc\\in\\mathcal\{E\}by

Score\(c\)=\(Qval\(c\),Oval\(c\),−Tval\(c\)\)\.\\mathrm\{Score\}\(c\)=\\bigl\(Q\_\{\\mathrm\{val\}\}\(c\),\\,O\_\{\\mathrm\{val\}\}\(c\),\\,\-T\_\{\\mathrm\{val\}\}\(c\)\\bigr\)\.
16:Beam update: set

ℬr\+1\\mathcal\{B\}\_\{r\+1\}using a diversity\-preserving rule: keep the best candidate per diversity key, then fill remaining slots by score

17:Extractsummaries, scores, and failure cases from

ℬr\+1\\mathcal\{B\}\_\{r\+1\}for the next refinement round

18:endfor

19:Select

c⋆∈ℰc^\{\\star\}\\in\\mathcal\{E\}with highest

Score\(c⋆\)\\mathrm\{Score\}\(c^\{\\star\}\)
20:Re\-run analysis

a^←Ac⋆\(Strpub\)\{\\color\[rgb\]\{0\.16796875,0\.40234375,0\.69140625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.16796875,0\.40234375,0\.69140625\}\\hat\{a\}\\leftarrow A\_\{c^\{\\star\}\}\(S\_\{\\mathrm\{tr\}\}^\{\\mathrm\{pub\}\}\)\}
21:returns^:=sc⋆\\hat\{s\}:=s\_\{c^\{\\star\}\}anda^\\hat\{a\}

Beam selection and refinement\.Each hypothesis includes a diversity key identifying its broad explanation class, such as a separator\-based explanation, a bottleneck\-resource explanation, a latent\-cluster explanation, or a geometric\-template explanation\. After each round, candidates are ranked by the validation\-aware score in Algorithm[1](https://arxiv.org/html/2605.14141#alg1)\. The next beam is then formed in two passes\. First, the best candidate for each diversity key is retained\. Second, any remaining slots are filled by the highest\-scoring candidates overall\.

This rule encourages exploration early in synthesis without preventing a strong candidate from surviving once diversity has been exhausted\. Refinement prompts include the parent hypothesis, the recovered summary, aggregate train and validation metrics, and representative failure cases\. They do not include hidden\-rule metadata or test performance\. The refinement action asks the agent to refine the current explanation, fork to a different explanation, improve quality, reduce runtime, or replace a hypothesis that appears to rely on an accidental shortcut\.

This design is important because LLM synthesis is stochastic\. The search loop is not only an optimization over code implementations; it is also an exploration over competing explanations of the distribution\. Diversity keys keep multiple structural hypotheses alive, while validation selection determines which hypothesis actually compiles into a reliable and fast solver\.

Recovered summaries\.The recovered summaryaca\_\{c\}is not restricted to a fixed representation\. Its form depends on the structure inferred by the candidate\. In graph problems, summaries may contain vertex\-role scores, cluster assignments, separator candidates, or recurring motif statistics\. In MaxSAT, they may contain variable\-salience scores, polarity statistics, backbone estimates, or clause\-position signals\. In packing and knapsack problems, they may contain estimated resource bottlenecks, item\-class prototypes, or active\-constraint patterns\. In TSP\-like tasks, they may contain geometric prototypes, coarsened tour skeletons, or detected coordinate transformations\.

These summaries are not directly supervised or scored\. They matter only through the quality and runtime of the solver that uses them\. Their role is to amortize distribution\-level inference: the analysis program may spend computation once on the public training sample, while the deployed solver should use the resulting summary cheaply on each new instance\. This separation is what distinguishes the method from per\-instance prompting or repeated test\-time reasoning: the expensive distribution\-level inference happens before deployment, and the deployed solver is ordinary executable code conditioned on the recovered summary\.

Failure handling\.A candidate can fail during analysis, solver construction, or instance\-level execution\. If the analysis programAcA\_\{c\}raises an error or exceeds its time budget, the candidate receives zero quality on every instance and is assigned a large failure runtime\. The same convention applies if the solverscs\_\{c\}fails to load, for example because of a syntax error or unavailable dependency\.

If bothAcA\_\{c\}andscs\_\{c\}load successfully but the solver returns an invalid output on some instances, only those instances receive normalized quality zero\. The remaining instances are scored normally\. This convention keeps the synthesis loop fully automatic: malformed candidates are not manually repaired, and partial failures degrade the validation score rather than requiring special\-case intervention\.

Connection to the theoretical framework\.The protocol implements the hint\-learning decomposition from Section[3](https://arxiv.org/html/2605.14141#S3)\. The hypothesisHcH\_\{c\}proposes a possible family of reusable structure, the analysis programAcA\_\{c\}estimates a hint from samples, and the solverscs\_\{c\}compiles the estimated hint into an executable algorithm\. Unlike the idealized theory, the hint space and evidence statistics are not fixed in advance\. They are proposed by the LLM, and the resulting candidates are selected by deterministic train\-validation evaluation\. Thus the method is better viewed as search over a proposal distribution of candidate hints and solvers, rather than as a deterministic sample\-to\-solver map\. Candidates survive only when their inferred summaries yield high\-quality solutions with low deployment runtime on public validation instances\.

## Appendix DAdditional Experimental Details

### D\.1Benchmark Distributions

Table 5:Benchmark target distributions\.The benchmark contains2121targets: seven problem classes with three hidden distribution families per class\. Each target contains6464training instances,3232validation instances, and500500held\-out test instances\. Sizes are listed in the same order as the distribution families\.Table[5](https://arxiv.org/html/2605.14141#A4.T5)summarizes the benchmark targets\. The benchmark contains2121designed targets, spanning seven problem classes with three hidden distribution families per class\. The benchmark contains2121target distributions, each with6464training instances,3232validation instances, and500500held\-out test instances\. Each target defines a distribution over structured optimization instances, rather than a collection of arbitrary worst\-case instances\. The goal is to test whether a synthesis procedure can infer reusable regularities from a small sample and compile them into a specialized solver\. Each completed target contains6464training instances,3232validation instances, and500500held\-out test instances\.

The distributions are designed to cover several kinds of exploitable structure: planted assignments, latent classes, recurring bottlenecks, geometric layouts, motif decompositions, and hidden resource constraints\. We briefly describe each family below\.

Graph coloring\.The coloring targets contain graphs with planted low\-color structure\. In*Ring\-template*, vertices are partitioned into local blocks with compatible palette assignments and ring\-style bridges between neighboring blocks\. In*Overlapping\-palette*, blocks share shifted and overlapping palettes, so a solver must recover the global palette rather than color each block independently\. In*Separator\-trap*, separator vertices create long\-range color\-reuse constraints, making local coloring rules unreliable unless the global palette structure is inferred\.

MAXSAT\.The MAXSAT targets contain formulas with hidden assignments induced by small latent rules\. In*Community\-parity*, variables are partitioned into communities whose values are controlled by parity\-style functions of a few anchor bits\. In*Last\-clause signal*, the final clause reveals an anchor pattern, while the remaining variables copy or negate anchor bits according to a hidden rule table\. In*Latent\-backdoor*, each instance comes from one of several regimes in which variable blocks are governed by anchor\-dependent Boolean rules, with additional noise and bridge clauses\.

Maximum independent set\.The MIS targets are block\-structured graph distributions\. In*Clique\-path*, blocks alternate between clique\-like and path\-like motifs, with sparse bridge edges between adjacent blocks\. In*Motif\-bridge*, each instance samples a latent sequence of motifs, including cliques, cycles, bicliques, and crown\-like gadgets\. Good solvers should decompose the graph into motifs and account for bridge conflicts\.

Minimum dominating set\.The MDS targets emphasize different coverage structures\. In*Gateway\-hub*, clusters contain hubs and gateways, where gateways provide overlapping coverage across neighboring clusters\. In*Geometric\-anchor*, vertices come from geometric clusters with heterogeneous density and connector edges\. In*Star\-kernel*, each cluster has a stable hub that dominates most local vertices, with sparse connectors between hubs\. These targets test whether the solver can identify high\-coverage vertices rather than relying only on generic degree heuristics\.

Packing LP\.The packing LP targets contain continuous packing problems with recurring resource structure\. In*Block\-coupled*, items belong to latent resource blocks, but a shared coupling resource limits the combination of otherwise attractive block\-local items\. In*Active\-resource*, each instance has a hidden active\-resource regime with recurring dual\-price patterns\. In*Single\-bottleneck*, one resource is consistently much tighter than the others, so the LP optimum is largely governed by value per unit of the hidden bottleneck resource\.

Multidimensional knapsack\.The MDKP targets are integer packing problems with latent item and resource structure\. In*Decoy\-complement*, high\-value decoy items look attractive locally but consume a scarce hidden resource, while complementary item classes combine better across resources\. In*Latent\-class*, items come from hidden resource\-consumption classes, and the active bottleneck changes across instances\. In*Single\-resource*, one hidden resource is typically tight, making density with respect to that resource a strong but not always exact rule\.

Traveling salesperson problem\.The TSP targets contain Euclidean or Euclidean\-like geometric structure\. In*Clustered Euclidean*, cities are sampled from balanced clusters arranged around a ring, so good tours combine short intra\-cluster paths with the correct inter\-cluster order\. In*Latent\-metric*, each instance comes from one of several geometric regimes, including ring clusters, paired ribbons, and barrier\-bridge layouts\. This target tests whether the solver can classify the latent geometry and choose an appropriate tour construction strategy\.

Together, these targets test the central hypothesis of the benchmark: sampled instances can reveal distribution\-specific algorithmic structure, and a solver that identifies this structure can improve the quality–runtime tradeoff relative to generic heuristics and off\-the\-shelf optimization backends\.

### D\.2Evaluation Metrics

We evaluate each solver on held\-out instances using three quantities: normalized quality, optimality rate, and runtime\. The goal is to compare solvers across problem classes with different objective scales and directions\. Some tasks are maximization problems, such as MAXSAT, MIS, Packing LP, and MDKP\. Other tasks are minimization problems, such as Coloring, MDS, and TSP\. We therefore convert all objective values to a common normalized quality score in\[0,1\]\[0,1\], where larger is better\. Table[6](https://arxiv.org/html/2605.14141#A4.T6)summarizes the normalization used for each problem class\.

Table 6:Normalized quality metrics used for evaluation\.Higher is better, and optimal solutions have quality11\. For maximization problems, quality is the algorithm value divided by the optimum value; for minimization problems, it is the optimum value divided by the algorithm value\. Invalid or infeasible outputs receive quality0\.Normalized quality\.For each instancexx, letqA\(x\)q\_\{A\}\(x\)denote the normalized quality of solverAA\. For maximization problems, ifA\(x\)A\(x\)is feasible and the optimum value is positive, we set

qA\(x\)=valA\(x\)valopt\(x\)\.q\_\{A\}\(x\)=\\frac\{\\mathrm\{val\}\_\{A\}\(x\)\}\{\\mathrm\{val\}\_\{\\rm opt\}\(x\)\}\.For minimization problems, ifA\(x\)A\(x\)is feasible and the returned cost is positive, we set

qA\(x\)=costopt\(x\)costA\(x\)\.q\_\{A\}\(x\)=\\frac\{\\mathrm\{cost\}\_\{\\rm opt\}\(x\)\}\{\\mathrm\{cost\}\_\{A\}\(x\)\}\.ThusqA\(x\)=1q\_\{A\}\(x\)=1means that the solver attains the optimum objective value, whileqA\(x\)<1q\_\{A\}\(x\)<1measures the multiplicative gap from optimum\. Invalid, malformed, or infeasible outputs are assignedqA\(x\)=0q\_\{A\}\(x\)=0\. In the benchmark distributions used here, optimum objective values are positive, so the normalizations in Table[6](https://arxiv.org/html/2605.14141#A4.T6)are well\-defined\.

For Coloring,χ\(G\)\\chi\(G\)is the chromatic number of the instance graphGG, andkalgk\_\{\\rm alg\}is the number of colors used by the returned coloring\. For MAXSAT,satalg\\mathrm\{sat\}\_\{\\rm alg\}is the number of clauses satisfied by the returned assignment, andsatopt\\mathrm\{sat\}\_\{\\rm opt\}is the maximum satisfiable number of clauses\. For MIS,ISalg\\mathrm\{IS\}\_\{\\rm alg\}andISopt\\mathrm\{IS\}\_\{\\rm opt\}are the returned and maximum independent sets\. For MDS,DSalg\\mathrm\{DS\}\_\{\\rm alg\}andDSopt\\mathrm\{DS\}\_\{\\rm opt\}are the returned and minimum dominating sets\. For Packing LP and MDKP,objalg\\mathrm\{obj\}\_\{\\rm alg\}andobjopt\\mathrm\{obj\}\_\{\\rm opt\}are the returned and optimal objective values\. For TSP,lenalg\\mathrm\{len\}\_\{\\rm alg\}is the length of the returned tour andlenopt\\mathrm\{len\}\_\{\\rm opt\}is the optimal tour length\.

Optimality rate\.The optimality rate measures the fraction of held\-out instances on which a solver attains the optimum value\. Equivalently, for a test set𝒟test\\mathcal\{D\}\_\{\\rm test\}, we report

1\|𝒟test\|∑x∈𝒟test𝟏\{qA\(x\)=1\}\.\\frac\{1\}\{\|\\mathcal\{D\}\_\{\\rm test\}\|\}\\sum\_\{x\\in\\mathcal\{D\}\_\{\\rm test\}\}\\mathbf\{1\}\\\{q\_\{A\}\(x\)=1\\\}\.In practice, the equality check is implemented using the benchmark verifier and the stored optimal objective value, with the appropriate numerical tolerance for continuous Packing LP objectives\. Invalid or infeasible outputs are not counted as optimal\.

Runtime\.Runtime is measured per instance in milliseconds and averaged over the held\-out test set\. The runtime includes the deployed solver computation on the instance, including any verification, repair, fallback, or local\-search steps used by the selected generated program\. It does not include the earlier synthesis process that produced the solver, since synthesis is an offline training\-time cost\. When we report aggregate rows over several benchmark distributions, we first compute the mean metric for each distribution and then average across distributions, so each distribution contributes equally to the reported problem\-level or overall aggregate\.

Aggregation protocol\.All reported aggregate quantities are computed over the2121benchmark target distributions\. For each target distribution, we first average over held\-out test instances and repeated runs to obtain a single quality, optimality, feasibility, and runtime value for each method\. Quality lifts are then computed at the target level, for exampleQLLM−QavgheurQ\_\{\\rm LLM\}\-Q\_\{\\rm avg\\ heur\}andQLLM−QbestheurQ\_\{\\rm LLM\}\-Q\_\{\\rm best\\ heur\}\.

For additive quantities such as quality, optimality, feasibility, and quality lift, aggregate values are arithmetic means over the completed target distributions\. Runtime comparisons are aggregated as speedup ratios\. For each completed targetτ\\tau, we compute

Tbaseline\(τ\)TLLM\(τ\),\\frac\{T\_\{\\rm baseline\}\(\\tau\)\}\{T\_\{\\rm LLM\}\(\\tau\)\},so that values larger than11indicate that the LLM\-synthesized solver is faster\. Because runtime ratios are multiplicative, we aggregate them using geometric means, equivalently by averaging log\-speedups and exponentiating\. All aggregate speedups are computed over the same2121target distributions\.

### D\.3Baseline Catalog and Reporting Protocol

The main text summarizes the baseline families used in our experiments\. Here we give the complete per\-problem baseline catalog used in the benchmark\. Each entry specifies the problem class, solver family, backend, category, aliases when applicable, and implementation notes\. The benchmark protocol also specifies the execution details, including formulations, solver wrappers, time limits, thread settings, external\-solver requirements, and alias mappings\.

Table[7](https://arxiv.org/html/2605.14141#A4.T7)lists all baseline entries used by the benchmark\. The catalog contains7878registered baseline entries across the seven problem classes, corresponding to7272unique implementations after removing registry aliases\. We nevertheless list aliases explicitly because they are executable entries in the registry\. In particular, several problems include anexactentry that aliases a problem\-local exact implementation, such as an OR\-Tools formulation or the Held–Karp dynamic program\.

We distinguish three categories of baselines\.*Heuristic*baselines are problem\-specific procedures implemented locally in the benchmark, such as greedy, local\-search, insertion, rounding, and randomized rules\.*Solver\-backed*baselines are calls to general\-purpose optimization software under the benchmark protocol\. In particular, Gurobi is reported separately because it is run with a fixed1010\-second, single\-thread budget and may return an incumbent solution rather than a certified optimum\.*Exact*and*external exact*baselines are used as certified exact comparisons when they return certified optima under the evaluation protocol\.

In the aggregate result tables, “best heuristic” denotes the strongest heuristic baseline for that problem, “Gurobi \(10s\)” denotes the time\-limited Gurobi baseline, and “best exact” denotes the fastest certified exact solver available for that problem\. All baselines are evaluated under the same public\-instance protocol as the LLM synthesis agent\. The baselines are given the benchmark instances but not direct access to the hidden family\-generation rule\.

Table 7:Baseline catalog used in the benchmark registry\.The catalog includes lightweight problem\-specific heuristics, time\-limited solver\-backed baselines, and exact or certifying backends when available\. Theexactentries are registry aliases for problem\-local exact implementations\. Citations for heuristic variants refer to the closest classical heuristic family or solver framework; exact implementation details are defined in the registry\.
### D\.4Compute and API Usage

Solver synthesis was performed with GPT\-5\.2 Thinking through remote API calls\. We used fixed high reasoning effort and structured outputs\. Temperature, top\-pp, and maximum output length were left at service defaults\. Across the2121target distributions, synthesis used463463successful API calls and16\.116\.1M total tokens, where total tokens are prompt plus completion tokens\. This amounts to22\.022\.0calls and767767K tokens per distribution on average\. Of the16\.116\.1M total tokens,5\.955\.95M were prompt tokens and10\.1610\.16M were completion tokens; the recorded8\.058\.05M reasoning tokens are a subset of the completion tokens\.

All generated solvers and benchmark baselines were executed locally on CPUs\. No local GPU inference, GPU\-backed solver, model training, or fine\-tuning was used\. Experiments were run on a dual\-socket AMD EPYC 9554 machine with128128physical cores,256256logical CPUs, and approximately1\.11\.1TiB of RAM\. The main benchmark was parallelized across target distributions\. Summed over target distributions, the recorded synthesis\-stage time was73\.473\.4hours, while baseline pre\-synthesis evaluation took417\.3417\.3hours\. Additional local work inside synthesis, including analysis execution, solver construction, training evaluation, and validation evaluation, took approximately0\.60\.6hours\. These times are aggregate recorded per\-target stage times rather than raw calendar elapsed time\.

Table 8:Compute summary\.API usage is for solver synthesis\. Timing values are aggregate recorded per\-target stage times\. Generated solvers and baselines were evaluated locally on CPUs\.

## Appendix EAdditional Empirical Results

### E\.1Full Per\-Target Results

Table 9:Per\-target held\-out test results for all2121benchmark target distributions\.HereQQdenotes normalized quality, scaled so that higher is better; opt\. is the fraction of instances solved to optimality; andTTis average wall\-clock runtime per instance in milliseconds\. “Quality\-best heuristic” selects, within each target, the heuristic with highest normalized quality, breaking ties by optimality and then runtime\. “Avg\. heuristic” reports the arithmetic average over the heuristic baselines for that target\. For the two heuristic columns, reported runtimes are clipped at10,00010\{,\}000ms before averaging or reporting\. “Time\-limited exact” selects the strongest exact backend under the benchmark time limit using the same quality\-first rule\. The Gurobi and exact backends are run with a1010second solver limit, but the reportedTTvalues are measured wall\-clock runtimes and may include wrapper overhead\. The headline speedups in Table[1](https://arxiv.org/html/2605.14141#S6.T1)are computed from these per\-target runtime ratios using geometric means\.Table[9](https://arxiv.org/html/2605.14141#A5.T9)reports the full held\-out test results for all2121benchmark target distributions\. Unlike the headline summary in Table[1](https://arxiv.org/html/2605.14141#S6.T1), this table gives the per\-target normalized quality, optimality rate, and mean wall\-clock runtime for each baseline group\. Runtime is reported in milliseconds\.

For each target distribution, “Quality\-best heuristic” denotes the heuristic baseline with the highest normalized quality, breaking ties by optimality rate and then runtime\. “Avg\. heuristic” averages over all heuristic baselines for that target\. “Time\-limited exact” denotes the strongest exact\-formulation or certifying backend under the benchmark time limit, selected by the same quality\-first rule\. The aggregate statistics in Table[1](https://arxiv.org/html/2605.14141#S6.T1)are computed from these2121target\-level results\.

### E\.2PACE 2025 Dominating Set Comparison

We also evaluate on instances from the PACE 2025 Dominating Set heuristic track\. The task is to output a dominating set in an undirected graph, with smaller solution size corresponding to higher quality\. PACE released a public set of100100heuristic instances and, after the competition, the private evaluation set of100100instances\. The organizers state that the private instances are similar to the public instances\. We therefore use the public set as our development distribution, split into5050training instances and5050validation instances, and report final results on the released private set of100100instances\. We emphasize that this is a local comparison against released PACE solvers and instances, not an official PACE score\.

The private test graphs are large and sparse\. Table[10](https://arxiv.org/html/2605.14141#A5.T10)summarizes their sizes\. The graphs range from tens of thousands to several million vertices, with median size about6\.5×1056\.5\\times 10^\{5\}vertices and9\.4×1059\.4\\times 10^\{5\}edges\.

Table 10:PACE private\-test graph statistics\.Statistics are computed over the100100released private Dominating Set heuristic instances used for evaluation\.For each test instance, we run our synthesized solver and compare it with the released PACE heuristic solvers Fontanf, Root, Shadoks, and Swats\. We verify validity of every returned solution\. Since this is a minimization problem, a solver has better quality when it returns a smaller dominating set\. Runtime is measured locally\. Root uses the parallel rerun for instancesprivate\_heuristic\_001–046and the original completed run forprivate\_heuristic\_047–100\.

Table 11:PACE 2025 Dominating Set comparison\.Lower solution size and lower runtime are better\. The synthesized solver is valid on all instances and is roughly two orders of magnitude faster than the PACE baselines, but the PACE solvers return consistently smaller dominating sets\. Ratios are computed on matched verified\-valid instances\.The comparison shows a clear speed–quality tradeoff\. The synthesized solver is not competitive with the strongest PACE submissions in solution quality: every valid PACE baseline returns a smaller dominating set on every matched instance\. However, the gap is moderate in aggregate\. Against Fontanf, Root, and Shadoks, the agent’s total solution size is about3\.3%3\.3\\%larger, while its average runtime is roughly99×99\\times–109×109\\timessmaller\. Thus, on this benchmark the synthesized solver discovers a very fast distribution\-specific heuristic, but not one that matches the quality of highly engineered competition solvers\.

### E\.3Distributional Algorithmic Complexity

Table 12:Distribution\-aware computation for all benchmark distributions\.This appendix explains the computational mechanisms behind the generated solvers in Table[12](https://arxiv.org/html/2605.14141#A5.T12)\. The comparison is distributional\. We do not claim that the generated programs improve the worst\-case complexity of MaxSAT, graph coloring, minimum dominating set, maximum independent set, Packing LP, multidimensional knapsack, or TSP as abstract problem classes\. Instead, the generated programs exploit regularities of the benchmark distributions and replace an ambient generic computation by a smaller distribution\-specific computation on instances drawn from the same distribution\.

The central contrast is between an ambient solver and a distribution\-aware solver\. A generic exact baseline must remain valid for arbitrary instances of the problem class\. It therefore searches over color assignments for graph coloring, Boolean assignments for MaxSAT, vertex subsets for MIS and MDS, multidimensional item subsets for MDKP, or tours for TSP\. For Packing LP, the ambient problem is polynomial\-time solvable, but a generic LP backend still solves the fullNN\-variable,rr\-constraint linear program\. By contrast, a generated solver is selected after seeing samples from a fixed distribution\. When it identifies reusable structure, it can verify a template, seed a local search, restrict a candidate set, use a surrogate score, or construct a small set of structured candidates\. This changes the effective deployed computation, even though it does not change the worst\-case complexity of the ambient problem\.

Notation\.For graph problems,nnis the number of vertices,mmis the number of edges, andΔmax\\Delta\_\{\\max\}is the maximum degree\. For Coloring,κ\\kappadenotes the number of colors considered by the ambient exact search\. For MaxSAT,vvis the number of variables,\|F\|\|F\|is the number of clauses,ℓ\\ellis the maximum clause width, andΔocc\\Delta\_\{\\rm occ\}is the maximum variable occurrence count\. For Packing LP and MDKP,NNis the number of items,rris the number of resource constraints, andLLis the input bit length\. For TSP,nnis the number of cities\.

The symbolsBB,RR,PP,FF,SS, andKKin Table[12](https://arxiv.org/html/2605.14141#A5.T12)denote fixed implementation budgets: repair steps, restarts, local\-search rounds, coordinate frames, candidate families, pricing variants, improvement passes, or bounded candidate pools\. These quantities are fixed by the generated program or by the benchmark analysis file, not by the asymptotic input size\. We keep them visible because they identify the bounded computation that replaces the ambient search\.

More specifically,BrepB\_\{\\rm rep\}is the bounded coloring\-repair budget;RdsR\_\{\\rm ds\}is the number of DSATUR\-style restarts;PelimP\_\{\\rm elim\}is the number of color\-elimination passes; andBrecolorB\_\{\\rm recolor\}is the capped local recoloring budget\. For MaxSAT,RwsR\_\{\\rm ws\}is the number of WalkSAT\-style restarts,BflipB\_\{\\rm flip\}is the flip budget per restart, andBpatchB\_\{\\rm patch\}is the bounded patch\-search budget\. For MIS,PordP\_\{\\rm ord\}is the number of vertex\-ordering rules,PgrP\_\{\\rm gr\}is the number of greedy candidate constructions,BlocB\_\{\\rm loc\}is the local\-search budget,BkickB\_\{\\rm kick\}is the number of kick\-repair attempts, andTARWT\_\{\\rm ARW\}andTtinyT\_\{\\rm tiny\}denote bounded ARW\-style and tiny time\-limited improvement subroutines\. In the Core\-fringe trap row,tjt\_\{j\}is the size of residual fringe componentjj,kkis the number of core choices considered, andqqis the number of candidate configurations scored\.

For MDS,Theap\-greedyT\_\{\\rm heap\\text\{\-\}greedy\},TgreedyT\_\{\\rm greedy\}, andTpruneT\_\{\\rm prune\}denote the bounded heap\-greedy, greedy\-completion, and pruning subroutines;BgeoB\_\{\\rm geo\}is the bounded geometric\-anchor completion budget\. In the Star\-kernel row,tjt\_\{j\}is the number of residual target vertices in componentjj, andcjc\_\{j\}is the number of candidate cover vertices for that component\. For Packing LP,PLPP\_\{\\rm LP\}is the number of density or active\-resource rules evaluated\. For MDKP,PscoreP\_\{\\rm score\}is the number of surrogate scoring rules,PpriceP\_\{\\rm price\}is the number of resource\-price vectors,TiterT\_\{\\rm iter\}is one bounded price\-update and repair iteration,KcandK\_\{\\rm cand\}is the restricted candidate\-set size, andCbC\_\{b\}is the effective one\-dimensional bottleneck capacity used by the guarded DP\. For TSP,BcandB\_\{\\rm cand\}andBfullB\_\{\\rm full\}are bounded candidate\-2\-opt and full\-2\-opt budgets,FFis the number of coordinate frames,kkis the number of stripes or groups in the latent\-metric construction,SSis the number of selected stripe frames,PimpP\_\{\\rm imp\}is the number of candidate tours sent to improvement,BimpB\_\{\\rm imp\}is the improvement budget per candidate, andB2optB\_\{\\rm 2opt\}is the bounded 2\-opt budget in the paired\-ribbon fallback\.

For Coloring, the DSATUR implementation used by the generated solvers scans all uncolored vertices at each step\. Thus

TDSATUR\(n,m,κ\)=O\(n2\+m\+nκ\),T\_\{\\rm DSATUR\}\(n,m,\\kappa\)=O\(n^\{2\}\+m\+n\\kappa\),andTDSATUR\(n,m,κ\)=O\(n2\+m\)T\_\{\\rm DSATUR\}\(n,m,\\kappa\)=O\(n^\{2\}\+m\)whenκ≤n\\kappa\\leq n\. This is the main polynomial term replacing the ambientO∗\(κn\)O^\{\*\}\(\\kappa^\{n\}\)search in the DSATUR\-based coloring rows\.

Graph coloring\.A generic exact coloring solver must handle arbitrary graphs and therefore faces an ambientO∗\(κn\)O^\{\*\}\(\\kappa^\{n\}\)assignment space, even if practical solvers use pruning, SAT encodings, or branch\-and\-bound\. The generated solvers exploit the fact that the benchmark graphs are not arbitrary\.

In the Ring\-template distribution, the dominant fast path is template checking: the solver verifies a learned coloring pattern against the edge set inO\(m\)O\(m\)\. If this path fails or leaves conflicts, the solver uses bounded repair or falls back to a polynomial\-time greedy coloring routine, summarized in Table[12](https://arxiv.org/html/2605.14141#A5.T12)byO\(n2\+m\+BrepΔmax\)O\(n^\{2\}\+m\+B\_\{\\rm rep\}\\Delta\_\{\\max\}\)\. The gain is that most instances can be handled by verification and bounded repair rather than by color\-assignment search\.

In the Overlapping\-palette and Separator\-trap distributions, the selected solvers do not reduce to pure template checking\. They run a bounded number of DSATUR\-style attempts and cleanup passes\. For Overlapping\-palette, the cleanup includes bounded color\-elimination passes, yielding

O\(n\+m\+Rds\(n2\+m\)\+Pelimκ\(n\+m\)\)\.O\(n\+m\+R\_\{\\rm ds\}\(n^\{2\}\+m\)\+P\_\{\\rm elim\}\\kappa\(n\+m\)\)\.For Separator\-trap, the cleanup is capped recoloring repair, yielding

O\(n\+m\+Rds\(n2\+m\)\+Brecolor\)\.O\(n\+m\+R\_\{\\rm ds\}\(n^\{2\}\+m\)\+B\_\{\\rm recolor\}\)\.These rows are bounded polynomial\-time heuristics rather than exact reductions, but they still replace ambient exponential search by a small number of structured attempts\.

MaxSAT\.For MaxSAT, a generic exact backend must reason over the full2v2^\{v\}Boolean assignment space and certify optimality without assuming planted structure\. The generated solvers instead use distributional statistics to construct strong initial assignments, then run bounded local repair\. Building occurrence lists, initializing scores, and evaluating the seed costsO\(\|F\|ℓ\)O\(\|F\|\\ell\)\. A flip updates clauses incident to the flipped variable, which costsO\(ℓΔocc\)O\(\\ell\\Delta\_\{\\rm occ\}\)with maintained clause counts\.

The Community\-parity and Latent\-backdoor rows therefore have dominant cost

O\(\|F\|ℓ\+RwsBflipℓΔocc\),O\(\|F\|\\ell\+R\_\{\\rm ws\}B\_\{\\rm flip\}\\ell\\Delta\_\{\\rm occ\}\),whereRwsR\_\{\\rm ws\}is the number of restarts andBflipB\_\{\\rm flip\}is the flip budget per restart\. The Last\-clause signal row adds a bounded patch term,

O\(BpatchℓΔocc\),O\(B\_\{\\rm patch\}\\ell\\Delta\_\{\\rm occ\}\),because it explicitly repairs variables around remaining unsatisfied clauses\. In all three cases, the computational advantage is not a new MaxSAT algorithm for arbitrary formulas\. The advantage is that the distributional seed moves the solver close enough to a high\-quality assignment that bounded local search is often sufficient\.

Maximum independent set\.For MIS, an exact generic solver must handle arbitrary vertex subsets, giving an ambientO∗\(2n\)O^\{\*\}\(2^\{n\}\)search space\. The selected generated solvers are not certified exact MIS algorithms\. They build small pools of candidate independent sets using orderings or decompositions suggested by the distribution, then apply bounded local improvement or small residual enumeration\.

In the Clique\-path distribution, the solver constructs candidates from analysis\-guided vertex orders, greedy maximal independent sets, randomized restarts, and bounded ARW\-style local improvement\. This gives

O\(n\+m\+Pordnlog⁡n\+Pgr\(n\+m\)\+TARW\)\.O\(n\+m\+P\_\{\\rm ord\}n\\log n\+P\_\{\\rm gr\}\(n\+m\)\+T\_\{\\rm ARW\}\)\.The benefit is that the solver spends its budget on a small set of distributionally promising independent sets rather than on the full subset space\.

In the Core\-fringe trap distribution, the solver uses the high\-degree core and low\-degree fringe structure\. On the structured path, it decomposes the fringe into tiny components, enumerates exact choices inside each component, and scores a bounded set of candidate core decisions\. This gives

O\(n\+m\+∑j4tj\+kq\),O\(n\+m\+\\sum\_\{j\}4^\{t\_\{j\}\}\+kq\),wheretjt\_\{j\}is the size of residual fringe componentjj,kkis the number of core choices considered, andqqis the number of candidate configurations scored\. The exponential term is over tiny residual pieces, not over the full graph\.

In the Motif\-bridge distribution, the solver uses multiple greedy constructions, bounded local improvement, bounded kick moves, and an optional tiny time\-limited exact improvement step:

O\(n\+m\+Pgr\(n\+m\)\+Bloc\(n\+m\)\+Bkick\(n\+m\)\+Ttiny\)\.O\(n\+m\+P\_\{\\rm gr\}\(n\+m\)\+B\_\{\\rm loc\}\(n\+m\)\+B\_\{\\rm kick\}\(n\+m\)\+T\_\{\\rm tiny\}\)\.This is again a bounded heuristic, but its search is restricted to a small set of distributionally meaningful candidates\.

Minimum dominating set\.For MDS, a generic exact solver faces an ambientO∗\(2n\)O^\{\*\}\(2^\{n\}\)subset\-selection problem\. The generated solvers exploit distribution\-specific structure in how domination is achieved\.

In Gateway\-hub, the main computation is graph processing, greedy completion, and redundant\-vertex pruning:

O\(n\+m\+Theap\-greedy\+Tprune\)\.O\(n\+m\+T\_\{\\rm heap\\text\{\-\}greedy\}\+T\_\{\\rm prune\}\)\.The advantage is that high\-degree hubs and leaf or isolate structure make much of the dominating set apparent without enumerating subsets\.

In Geometric\-anchor, the fast path isO\(n\+m\)O\(n\+m\), because the solver checks a small anchor pattern induced by a residue\-class structure\. If the pattern fails, it uses bounded greedy completion, giving

O\(\(1\+Bgeo\)\(n\+m\)\)\.O\(\(1\+B\_\{\\rm geo\}\)\(n\+m\)\)\.In Star\-kernel, the solver uses forced leaf\-neighbor structure, then solves only tiny residual components\. The residual exact work is

∑jcj2tj,\\sum\_\{j\}c\_\{j\}2^\{t\_\{j\}\},wheretjt\_\{j\}is the number of residual target vertices in componentjj, andcjc\_\{j\}is the number of candidate cover vertices relevant to that component\. This is exponential only in the tiny residual size, not in the original graph sizenn\.

Packing LP\.Packing LP is different from the discrete NP\-hard families\. The ambient problem is already polynomial\-time solvable, so the comparison is not exponential versus polynomial\. The comparison is between a generic LP solve,TLP\(N,r,L\)T\_\{\\rm LP\}\(N,r,L\), and a specialized sorting\-based computation\. The generated solvers exploit the fact that the benchmark LP distributions often have a small number of dominant or active resources\.

In the Block\-coupled and Single\-bottleneck distributions, the solver identifies a bottleneck or effective resource score and sorts items by density, giving

O\(Nr\+Nlog⁡N\)\.O\(Nr\+N\\log N\)\.TheNrNrterm comes from evaluating feasibility and resource usage across constraints, whileNlog⁡NN\\log Ncomes from sorting\. In Active\-resource, the solver evaluates a small portfolio ofPLPP\_\{\\rm LP\}density rules, giving

O\(Nr\+PLP\(Nr\+Nlog⁡N\)\)\.O\(Nr\+P\_\{\\rm LP\}\(Nr\+N\\log N\)\)\.The advantage overTLP\(N,r,L\)T\_\{\\rm LP\}\(N,r,L\)is not a worst\-case complexity improvement in the abstract LP model, but a deployment advantage on these distributions: a small portfolio of density rules captures the active\-resource structure without invoking a full general\-purpose optimizer\.

Multidimensional knapsack\.For MDKP, a generic exact baseline is a MIP or branch\-and\-bound solver over2N2^\{N\}possible item subsets\. The generated solvers replace this ambient search by surrogate scoring, candidate restriction, bounded dynamic programming, add/drop improvement, and repair\.

In Decoy\-complement MDKP, the solver evaluates a small portfolio of surrogate scores, greedily packs feasible candidates, applies bounded add/drop improvement, and repairs violations:

O\(Nr\+Pscore\(Nr\+Nlog⁡N\)\+Tadd/drop\+Trepair\)\.O\(Nr\+P\_\{\\rm score\}\(Nr\+N\\log N\)\+T\_\{\\rm add/drop\}\+T\_\{\\rm repair\}\)\.In Latent\-class MDKP, the solver uses adaptive price vectors and iterated surrogate packing:

O\(Nr\+Nlog⁡N\+PpriceTiter\(Nr\+Nlog⁡N\+Trepair\)\)\.O\(Nr\+N\\log N\+P\_\{\\rm price\}T\_\{\\rm iter\}\(Nr\+N\\log N\+T\_\{\\rm repair\}\)\)\.In Single\-resource MDKP, the solver identifies a dominant resource, filters or sorts items, and runs one\-dimensional DP on a bounded candidate set:

O\(Nr\+Nlog⁡N\+KcandCb\+Trepair\),O\(Nr\+N\\log N\+K\_\{\\rm cand\}C\_\{b\}\+T\_\{\\rm repair\}\),whereKcandK\_\{\\rm cand\}is the candidate\-set size andCbC\_\{b\}is the effective one\-dimensional capacity used by the guarded DP\. These rows are bounded heuristics or restricted dynamic programs rather than exact reductions for general MDKP\. Their advantage is that they search over a small number of distributionally meaningful item orders or candidates instead of exploring the full subset space\.

Traveling salesman problem\.For TSP, a generic exact dynamic program such as Held\-Karp has costO\(n22n\)O\(n^\{2\}2^\{n\}\), and exact branch\-and\-bound solvers must still be prepared for arbitrary tour structure\. The generated solvers do not implement exact TSP dynamic programming\. They exploit geometric structure to generate a small set of candidate tours and then improve them by bounded local search\.

In Clustered Euclidean TSP, the solver builds geometric or nearest\-neighbor candidate tours and candidate neighborhoods\. Sorting neighborhoods contributes theO\(n2log⁡n\)O\(n^\{2\}\\log n\)term, while bounded candidate 2\-opt and bounded full 2\-opt cleanup contribute quadratic improvement terms:

O\(n2log⁡n\+Bcandn2\+Bfulln2\)\.O\(n^\{2\}\\log n\+B\_\{\\rm cand\}n^\{2\}\+B\_\{\\rm full\}n^\{2\}\)\.Thus the effective computation is structured candidate generation plus bounded quadratic local improvement, rather than exponential tour search\.

In Latent\-metric TSP, the solver builds several structured candidate families\. The termF\(kn2\+nlog⁡n\)F\(kn^\{2\}\+n\\log n\)comes from evaluatingFFcoordinate frames and computing stripe or projection structure withkkgroups\. The termS\(2kk\+kn\)S\(2^\{k\}k\+kn\)comes from using the topSSstripe frames and solving the small stripe\-order subproblem overkkstripes\. Sincekkis fixed and small in the selected program, this is not an exponential dependence on the number of cities\. Finally,PimpBimpn2P\_\{\\rm imp\}B\_\{\\rm imp\}n^\{2\}summarizes improving at mostPimpP\_\{\\rm imp\}candidate tours for a bounded numberBimpB\_\{\\rm imp\}of local improvement passes:

O\(n2\+F\(kn2\+nlog⁡n\)\+S\(2kk\+kn\)\+PimpBimpn2\)\.O\(n^\{2\}\+F\(kn^\{2\}\+n\\log n\)\+S\(2^\{k\}k\+kn\)\+P\_\{\\rm imp\}B\_\{\\rm imp\}n^\{2\}\)\.
In Paired\-ribbon zigzag TSP, the solver has a fast structured path based on the two\-rail geometry of the distribution\. It detects the rail structure, sorts points along the rails, and pairs endpoints, giving

If the structured path is not used, it falls back to nearest\-neighbor construction and bounded 2\-opt improvement:

O\(n2\+B2optn2\)\.O\(n^\{2\}\+B\_\{\\rm 2opt\}n^\{2\}\)\.The ambient comparison remainsO\(n22n\)O\(n^\{2\}2^\{n\}\), since a generic exact TSP solver must remain valid for arbitrary instances\. The gain is therefore distributional: for instances matching the ribbon structure, the generated solver avoids general tour search and most quadratic local\-search work\.

Summary\.Across the benchmark families, the generated solvers win when the training samples reveal reusable low\-dimensional, local, or template structure\. Coloring can become template verification plus bounded repair\. MaxSAT can become seeded local search\. MIS and MDS can become greedy construction plus tiny residual enumeration or bounded local improvement\. Packing LP can become density sorting over active resources\. MDKP can become surrogate pricing, restricted one\-dimensional DP, or bounded repair\. TSP can become structured candidate generation plus bounded quadratic local search, and in the paired\-ribbon case even a near\-sorting computation on the fast path\.

The limitation is equally important\. These programs do not establish new worst\-case algorithms for the ambient problem classes\. Several rows are bounded heuristics, and some include fallback or repair\. The faithful claim is distributional and mechanistic: on these benchmark distributions, the generated programs often replace generic exact optimization over a large ambient space by the smaller computations summarized in Table[12](https://arxiv.org/html/2605.14141#A5.T12)\.

### E\.4Zero\-Sample versus Sample\-Conditioned Synthesis

We use a focused ablation to test whether public samples help the synthesis procedure compile better deployment solvers\. The zero\-sample variant uses the same synthesis pipeline but provides no public training instances\. The sample\-conditioned variant usesn=64n=64public training instances, matching the default setting in our main experiments rather than selecting the best sample size\. Results are averaged over1010paired synthesis seeds, so each row compares the two variants under matched randomness\.

Table[13](https://arxiv.org/html/2605.14141#A5.T13)shows that normalized quality is near\-saturated in both variants\. Thus, in these target distributions, the main observable effect of sample access is not feasibility or final quality, but deployment runtime\. Sample\-conditioned synthesis is at least as fast on10/1210/12targets and strictly faster on9/129/12, with large gains on Single\-resource MDKP, Separator\-palette Coloring, and Paired\-ribbon zigzag TSP\. The only substantial regression occurs on Decoy\-complement MDKP, where the sample\-conditioned solver is slower despite a small quality gain\.

We interpret this experiment as evidence for sample\-conditioned compilation: public samples often help the synthesis process discover faster deployment code, even when zero\-sample synthesis already finds high\-quality solutions\. The effect is distribution\-dependent, however, rather than a monotone consequence of providing more samples\.

Table 13:Zero\-sample versus sample\-conditioned synthesis\.The zero\-sample variant uses no public training instances\. The sample\-conditioned variant uses the default training size,n=64n=64\. We report mean normalized qualityQQand mean runtimeTTin milliseconds over1010paired synthesis seeds\. The speedup column is the ratio of mean runtimes,T0/T64T\_\{0\}/T\_\{64\}; values above1×1\\timesindicate that using samples produced a faster deployed solver\.
### E\.5Diagnostic Traces for Synthesized Solvers

The preceding section summarized the distribution\-specific computations used by the generated solvers\. These computations are heterogeneous: depending on the problem family and selected program, a synthesized solver may verify a template, seed a local search, restrict a candidate set, solve a small residual subproblem, sort by an active\-resource score, construct a structured tour, or invoke bounded repair or fallback\. We therefore record diagnostic traces during held\-out evaluation to understand which parts of the synthesized computation are actually used at test time\.

The diagnostics use a common logging interface, but they should not be read as saying that every solver implements the same mechanisms\. A*shortcut*denotes the solver’s learned fast path, whose concrete meaning is problem\-dependent\. For example, it may correspond to template checking in Coloring, seeded assignment construction in MAXSAT, density sorting in Packing LP, surrogate scoring or restricted dynamic programming in MDKP, or structured candidate generation in TSP\. A*fallback*records whether the solver routed the instance to a generic safety routine, when such a routine is present\.*Residual size*records the solver\-reported size of the remaining subproblem after applying the learned specialization, when the generated solver creates such a residual\. Its scale depends on the native problem representation and should be interpreted within, not across, problem families\. Finally,*repair iterations*records the amount of bounded local repair used after the initial construction or reduction\.

These traces are not intervention ablations: we do not disable shortcutting, fallback, residual solving, or repair and rerun the benchmark\. Instead, they provide a behavioral diagnosis of the deployed synthesized solver under the same held\-out evaluation protocol used for the headline quality and runtime results\. For each target distribution, we average the diagnostic traces over held\-out test instances\. Family rows average over the three target distributions in that problem family, and the final row averages over all2121target distributions\.

Table 14:Diagnostic traces for the synthesized solvers over held\-out test instances\.QLLMQ\_\{\\rm LLM\}is normalized quality andTLLMT\_\{\\rm LLM\}is average wall\-clock runtime per instance in milliseconds\. The columns use a common logging interface, but their concrete interpretation is solver\-dependent\. Shortcut rate is the fraction of instances on which the solver used its learned fast path, which may correspond to different computations in different problem families\. Fallback rate is the fraction routed to a generic safety routine when such a routine is present\. Residual size is the solver\-reported size of the remaining subproblem after specialization and is meaningful only within a problem family\. Repair iters\. is the average number of bounded local repair iterations\. Values are first averaged within each target distribution and then averaged across the three targets in each problem family\. The final row averages over all2121target distributions\.Table[14](https://arxiv.org/html/2605.14141#A5.T14)should therefore be read as a mechanism\-use diagnostic, not as evidence that every synthesized solver uses the same algorithmic components\. The dominant aggregate pattern is the use of a learned fast path: the average shortcut rate is86\.9%86\.9\\%, and it reaches100%100\\%for Packing LP, MDKP, and TSP\. The concrete fast path differs across families\. In Packing LP it corresponds to specialized density or active\-resource rules; in MDKP, to surrogate scoring, candidate restriction, or restricted dynamic programming; and in TSP, to structured candidate generation or geometric tour construction\. Thus, the high shortcut rate supports the interpretation from Table[12](https://arxiv.org/html/2605.14141#A5.T12): the synthesized solvers often replace generic search by a distribution\-specific computation\.

Fallback is rare overall\. The average fallback rate is only3\.0%3\.0\\%, which indicates that the reported speedups are not primarily obtained by calling a generic backend on most instances\. Instead, fallback acts as a safety mechanism for slices of the distribution not covered by the learned hint\. The largest fallback rates occur in MDS and MAXSAT, matching the qualitative picture that these solvers often use useful but incomplete structure, such as seeded local search, hub or anchor rules, or bounded cleanup\.

Repair is also usually limited\. The average number of repair iterations is0\.430\.43, and most families have essentially zero repair\. The main exception is MDKP, where repair is part of the synthesized packing strategy: the solver first constructs a candidate solution using surrogate scores or restricted dynamic programming, then performs bounded add/drop or feasibility repair\. Thus, MDKP is more repair\-mediated, whereas Packing LP and TSP are more directly fast\-path\-mediated\.

The residual\-size column gives a complementary but problem\-dependent view\. In some families, the solver does not solve the ambient instance directly; it first reduces the instance to a structured residual and then applies enumeration, repair, or a backend routine to that residual\. This is the distributional computation pattern described in the previous section\. However, because “residual size” has different native meanings across problem families, it should not be used for cross\-family comparisons\.

Overall, the diagnostic traces strengthen the mechanistic interpretation of the benchmark while clarifying its limits\. The synthesized solvers are not merely faster implementations of the same generic algorithms, nor are they mostly wrappers around exact solvers\. They usually execute a learned distribution\-specific fast path, occasionally use fallback as a safety mechanism, and perform bounded repair only when the generated strategy leaves small violations or residual choices\.

### E\.6Perturbation Robustness Ablation

We include a graph\-relabeling perturbation ablation as a diagnostic for presentation dependence\. The perturbation randomly relabels the vertices of each graph instance\. This preserves the underlying graph up to isomorphism, but changes the vertex identifiers and the input order seen by the generated solver\. Thus, if a generated solver relies only on isomorphism\-invariant structure, its quality and runtime should remain similar after relabeling\. If it relies on incidental identifiers or ordering artifacts, its behavior may change\.

For each selected generated solver, we evaluate the solver on the original held\-out test instances and on relabeled copies of the same instances\. This perturbation is applied only as a post\-hoc diagnostic and is not used during solver selection\. We report the original normalized qualityQorigQ\_\{\\rm orig\}, the perturbed normalized qualityQpertQ\_\{\\rm pert\}, and

ΔQ=Qpert−Qorig\.\\Delta Q=Q\_\{\\rm pert\}\-Q\_\{\\rm orig\}\.We also report the fraction of test instances for which the quality value, optimality indicator, or feasibility indicator changes after relabeling\. Finally, we report the runtime ratiotpert/torigt\_\{\\rm pert\}/t\_\{\\rm orig\}\. Runtime ratios are aggregated by geometric mean, since runtime effects are multiplicative\.

Table 15:Aggregate graph\-relabeling perturbation ablation\.The perturbation preserves each graph instance up to isomorphism, but changes the vertex identifiers and input order seen by the generated solver\. The changed columns report the fraction of test instances whose quality, optimality, or feasibility value changes under relabeling\. Runtime ratios are geometric means oftpert/torigt\_\{\\rm pert\}/t\_\{\\rm orig\}across target distributions\.Table[15](https://arxiv.org/html/2605.14141#A5.T15)shows that feasibility is fully stable under relabeling: the feasibility\-changed fraction is zero for every problem class and every tested target distribution\. Thus, the generated solvers continue to return feasible solutions under an isomorphic presentation of the same graph instances\.

Quality is more mixed\. Across the nine graph target distributions, mean normalized quality decreases from0\.9450\.945to0\.8670\.867\. This drop is concentrated in two target distributions: Ring\-template in Coloring and Geometric\-anchor in MDS\. The remaining seven graph targets have nearly unchanged mean quality under relabeling\. This suggests that many generated solvers capture reusable distribution\-specific structure, but that some selected solvers also depend on presentation\-specific regularities\.

Runtime is also presentation\-sensitive\. Across the nine graph targets, relabeling increases runtime by a geometric mean factor of1\.36×1\.36\\times\. Most targets remain within a small constant factor, but there are notable outliers: Ring\-template slows down by9\.65×9\.65\\times, while Core\-fringe trap slows down by2\.19×2\.19\\timesdespite unchanged quality\. Thus, relabeling probes not only solution robustness but also computational robustness\. A generated solver may preserve feasibility and quality while still taking a different, slower execution path under an isomorphic presentation\.

Table 16:Target\-level graph\-relabeling perturbation ablation\.The aggregate quality drop is concentrated in Ring\-template and Geometric\-anchor, while feasibility remains unchanged on all tested targets\.Overall, the perturbation ablation is best viewed as a diagnostic test of presentation dependence\. The results are consistent with the synthesis agent often extracting reusable distribution\-specific structure: feasibility is invariant, and quality is nearly unchanged on most graph targets\. At the same time, the brittle cases show that some generated solvers depend on incidental presentation details such as vertex identifiers or ordering\. Graph relabeling is therefore a useful stress test for separating invariant distributional structure from presentation\-specific shortcuts\.

## Appendix FProofs for Section[5](https://arxiv.org/html/2605.14141#S5)

See[5\.1](https://arxiv.org/html/2605.14141#S5.Thmtheorem1)

###### Proof\.

Forc∈𝒞c\\in\\mathcal\{C\}, writee\(c\):=ErrD\(c\)e\(c\):=\\mathrm\{Err\}\_\{D\}\(c\)\. IfErr^S\(c\)=0\\widehat\{\\mathrm\{Err\}\}\_\{S\}\(c\)=0, thenPr⁡\[Err^S\(c\)=0\]=\(1−e\(c\)\)n≤e−ne\(c\)\\Pr\[\\widehat\{\\mathrm\{Err\}\}\_\{S\}\(c\)=0\]=\(1\-e\(c\)\)^\{n\}\\leq e^\{\-ne\(c\)\}\. Hence

Pr⁡\(Err^S\(c\)=0ande\(c\)\>Γ\(c\)\+log⁡\(2/δ\)n\)≤π\(c\)δ2\.\\Pr\\\!\\left\(\\widehat\{\\mathrm\{Err\}\}\_\{S\}\(c\)=0\\;\\text\{and\}\\;e\(c\)\>\\frac\{\\Gamma\(c\)\+\\log\(2/\\delta\)\}\{n\}\\right\)\\leq\\pi\(c\)\\frac\{\\delta\}\{2\}\.A union bound gives, with probability at least1−δ/21\-\\delta/2,

ErrD\(c\)≤Γ\(c\)\+log⁡\(2/δ\)nfor allc∈𝒞withErr^S\(c\)=0\.\\mathrm\{Err\}\_\{D\}\(c\)\\leq\\frac\{\\Gamma\(c\)\+\\log\(2/\\delta\)\}\{n\}\\quad\\text\{for all \}c\\in\\mathcal\{C\}\\text\{ with \}\\widehat\{\\mathrm\{Err\}\}\_\{S\}\(c\)=0\.Sincec^S\\widehat\{c\}\_\{S\}is sample\-consistent, this proves the error bound\.

For runtime, set

r\(c\):=TmaxΓ\(c\)\+log⁡\(4/δ\)2n\.r\(c\):=\\mathrm\{T\}\_\{\\max\}\\sqrt\{\\frac\{\\Gamma\(c\)\+\\log\(4/\\delta\)\}\{2n\}\}\.By Hoeffding’s inequality and another union bound, with probability at least1−δ/21\-\\delta/2,

\|Run^S\(c\)−RunD\(c\)\|≤r\(c\)for allc∈𝒞\.\|\\widehat\{\\mathrm\{Run\}\}\_\{S\}\(c\)\-\\mathrm\{Run\}\_\{D\}\(c\)\|\\leq r\(c\)\\quad\\text\{for all \}c\\in\\mathcal\{C\}\.On the intersection of the two events, fix anyc∈𝒞feasc\\in\\mathcal\{C\}^\{\\rm feas\}\. SinceErrD\(c\)=0\\mathrm\{Err\}\_\{D\}\(c\)=0,ccis sample\-consistent almost surely, so the selection rule givesRun^S\(c^S\)≤Run^S\(c\)\\widehat\{\\mathrm\{Run\}\}\_\{S\}\(\\widehat\{c\}\_\{S\}\)\\leq\\widehat\{\\mathrm\{Run\}\}\_\{S\}\(c\)\. Therefore

RunD\(c^S\)≤Run^S\(c^S\)\+r\(c^S\)≤Run^S\(c\)\+r\(c^S\)≤RunD\(c\)\+r\(c\)\+r\(c^S\)\.\\mathrm\{Run\}\_\{D\}\(\\widehat\{c\}\_\{S\}\)\\leq\\widehat\{\\mathrm\{Run\}\}\_\{S\}\(\\widehat\{c\}\_\{S\}\)\+r\(\\widehat\{c\}\_\{S\}\)\\leq\\widehat\{\\mathrm\{Run\}\}\_\{S\}\(c\)\+r\(\\widehat\{c\}\_\{S\}\)\\leq\\mathrm\{Run\}\_\{D\}\(c\)\+r\(c\)\+r\(\\widehat\{c\}\_\{S\}\)\.Taking the infimum overc∈𝒞feasc\\in\\mathcal\{C\}^\{\\rm feas\}and using

r\(c\)\+r\(c^S\)≤2Tmaxmax⁡\{Γ\(c^S\),Γ\(c\)\}\+log⁡\(4/δ\)2nr\(c\)\+r\(\\widehat\{c\}\_\{S\}\)\\leq 2\\mathrm\{T\}\_\{\\max\}\\sqrt\{\\frac\{\\max\\\{\\Gamma\(\\widehat\{c\}\_\{S\}\),\\Gamma\(c\)\\\}\+\\log\(4/\\delta\)\}\{2n\}\}gives the stated runtime bound\. The two events together hold with probability at least1−δ1\-\\delta\. ∎

See[5\.2](https://arxiv.org/html/2605.14141#S5.Thmtheorem2)

###### Proof\.

Fix the true hinth⋆∈ℋh^\{\\star\}\\in\\mathcal\{H\}, and for eachh∈ℋh\\in\\mathcal\{H\}letμ\(h\):=𝔼X∼Dh⋆\[ψh\(X\)\]\\mu\(h\):=\\mathbb\{E\}\_\{X\\sim D\_\{h^\{\\star\}\}\}\[\\psi\_\{h\}\(X\)\]\. By assumption,μ\(h⋆\)≥μ\(g\)\+γ\\mu\(h^\{\\star\}\)\\geq\\mu\(g\)\+\\gammafor everyg≠h⋆g\\neq h^\{\\star\}\. Sinceψh\(X\)∈\[0,1\]\\psi\_\{h\}\(X\)\\in\[0,1\], Hoeffding’s inequality givesPr⁡\(\|μ^S\(h\)−μ\(h\)\|\>γ/2\)≤2e−nγ2/2\\Pr\(\|\\widehat\{\\mu\}\_\{S\}\(h\)\-\\mu\(h\)\|\>\\gamma/2\)\\leq 2e^\{\-n\\gamma^\{2\}/2\}for eachhh, and a union bound overℋ\\mathcal\{H\}yields

Pr⁡\(∃h∈ℋ:\|μ^S\(h\)−μ\(h\)\|\>γ2\)≤2\|ℋ\|e−nγ2/2\.\\Pr\\\!\\left\(\\exists h\\in\\mathcal\{H\}:\\ \\bigl\|\\widehat\{\\mu\}\_\{S\}\(h\)\-\\mu\(h\)\\bigr\|\>\\tfrac\{\\gamma\}\{2\}\\right\)\\leq 2\|\\mathcal\{H\}\|e^\{\-n\\gamma^\{2\}/2\}\.Forn≥2γ2log⁡2\|ℋ\|δn\\geq\\tfrac\{2\}\{\\gamma^\{2\}\}\\log\\tfrac\{2\|\\mathcal\{H\}\|\}\{\\delta\}, the right\-hand side is at mostδ\\delta, so with probability1−δ1\-\\deltawe have\|μ^S\(h\)−μ\(h\)\|≤γ/2\|\\widehat\{\\mu\}\_\{S\}\(h\)\-\\mu\(h\)\|\\leq\\gamma/2for allh∈ℋh\\in\\mathcal\{H\}\. On this event, for everyg≠h⋆g\\neq h^\{\\star\},

μ^S\(h⋆\)≥μ\(h⋆\)−γ2≥μ\(g\)\+γ2≥μ^S\(g\),\\widehat\{\\mu\}\_\{S\}\(h^\{\\star\}\)\\geq\\mu\(h^\{\\star\}\)\-\\tfrac\{\\gamma\}\{2\}\\geq\\mu\(g\)\+\\tfrac\{\\gamma\}\{2\}\\geq\\widehat\{\\mu\}\_\{S\}\(g\),soh^=h⋆\\hat\{h\}=h^\{\\star\}\. ∎

See[5\.3](https://arxiv.org/html/2605.14141#S5.Thmtheorem3)

###### Proof\.

Setε:=γ/4\\varepsilon:=\\gamma/4\. For each variablei∈\[d\]i\\in\[d\], the random variablesσi\(F\(1\)\),…,σi\(F\(m\)\)\\sigma\_\{i\}\(F^\{\(1\)\}\),\\dots,\\sigma\_\{i\}\(F^\{\(m\)\}\)are IID and lie in\[0,1\]\[0,1\], with meanμi:=𝔼F∼DB\[σi\(F\)\]\\mu\_\{i\}:=\\mathbb\{E\}\_\{F\\sim D\_\{B\}\}\[\\sigma\_\{i\}\(F\)\]\. By Hoeffding’s inequality,

Pr⁡\(\|σ^i−μi\|\>ε\)≤2e−2mε2\.\\Pr\\\!\\left\(\|\\widehat\{\\sigma\}\_\{i\}\-\\mu\_\{i\}\|\>\\varepsilon\\right\)\\leq 2e^\{\-2m\\varepsilon^\{2\}\}\.A union bound over alli∈\[d\]i\\in\[d\]yields

Pr⁡\(max1≤i≤d⁡\|σ^i−μi\|\>ε\)≤2de−2mε2\.\\Pr\\\!\\left\(\\max\_\{1\\leq i\\leq d\}\|\\widehat\{\\sigma\}\_\{i\}\-\\mu\_\{i\}\|\>\\varepsilon\\right\)\\leq 2d\\,e^\{\-2m\\varepsilon^\{2\}\}\.Therefore, ifm≥8γ2log⁡2dδm\\geq\\frac\{8\}\{\\gamma^\{2\}\}\\log\\frac\{2d\}\{\\delta\}, then with probability at least1−δ1\-\\deltawe have\|σ^i−μi\|≤ε\|\\widehat\{\\sigma\}\_\{i\}\-\\mu\_\{i\}\|\\leq\\varepsilonfor alli∈\[d\]i\\in\[d\]\.

On this event, ifi∈Bi\\in Bthenσ^i≥q1−ε\\widehat\{\\sigma\}\_\{i\}\\geq q\_\{1\}\-\\varepsilon, while ifj∉Bj\\notin Bthenσ^j≤q0\+ε\\widehat\{\\sigma\}\_\{j\}\\leq q\_\{0\}\+\\varepsilon\. Sinceγ=q1−q0\\gamma=q\_\{1\}\-q\_\{0\}andε=γ/4\\varepsilon=\\gamma/4, we getq1−ε\>q0\+εq\_\{1\}\-\\varepsilon\>q\_\{0\}\+\\varepsilon\. Hence every variable inBBhas strictly larger empirical score than every variable outsideBB, so the topkkempirical scores are attained exactly on the variables inBB\. ThereforeB^=B\\widehat\{B\}=B\. ∎
Distribution-Aware Algorithm Design with LLM Agents

Similar Articles

AHD Agent: Agentic Reinforcement Learning for Automatic Heuristic Design

Mixture of Complementary Agents for Robust LLM Ensemble

Revisiting DAgger in the Era of LLM-Agents

TradingAgents: Multi-Agents LLM Financial Trading Framework

AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration -- Learning from Cheap, Optimizing Expensive

Submit Feedback

Similar Articles

AHD Agent: Agentic Reinforcement Learning for Automatic Heuristic Design
Mixture of Complementary Agents for Robust LLM Ensemble
Revisiting DAgger in the Era of LLM-Agents
TradingAgents: Multi-Agents LLM Financial Trading Framework
AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration -- Learning from Cheap, Optimizing Expensive