AHD Agent: Agentic Reinforcement Learning for Automatic Heuristic Design

arXiv cs.AI Papers

Summary

This paper introduces AHD Agent, a framework using agentic reinforcement learning to enable LLMs to autonomously design heuristics for combinatorial optimization problems by dynamically interacting with the solving environment.

arXiv:2605.08756v1 Announce Type: new Abstract: Automatic heuristic design (AHD) has emerged as a promising paradigm for solving NP-hard combinatorial optimization problems (COPs). Recent works show that large language models (LLMs), when integrated into well-designed frameworks (i.e., LLM-AHD), can autonomously discover high-performing heuristics. However, existing LLM-AHD frameworks typically treat LLMs as passive generators within fixed workflows, where the model generates heuristics from manually designed, limited context. Such context may fail to capture state-dependent information (e.g., specific failure modes), leading to inefficient trial-and-error exploration. To overcome these limitations, we propose AHD Agent, a novel tool-integrated, multi-turn framework that empowers LLMs to proactively decide whether to generate heuristics or invoke tools to retrieve targeted evidence from the solving environment. To effectively train such a dynamic decision-making agent, we introduce an agentic reinforcement learning (RL) system, which leverages a novel environment synthesis pipeline to optimize a compact model's generalizable AHD capabilities. Experiments across eight diverse domains, including four held-out tasks, demonstrate that our 4B-parameter agent matches or surpasses state-of-the-art baselines using much larger models, while requiring significantly fewer evaluations. Model and inference scaling analysis further reveals that AHD Agent offers an effective trajectory toward truly autonomous heuristic design.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/12/26, 07:23 AM

# AHD Agent: Agentic Reinforcement Learning for Automatic Heuristic Design
Source: [https://arxiv.org/html/2605.08756](https://arxiv.org/html/2605.08756)
Haoze Lv1,2,Ning Lu1,3,11footnotemark:1Ziang Zhou1Shengcai Liu1,2, 1Guangdong Provincial Key Laboratory of Brain\-Inspired Intelligent Computation, Department of Computer Science and Engineering, Southern University of Science and Technology 2Zhongguancun Academy3The Hong Kong University of Science and Technology

Project Page:[https://github\.com/Antoniano1963/AHD\-Agent](https://github.com/Antoniano1963/AHD-Agent)

###### Abstract

Automatic heuristic design \(AHD\) has emerged as a promising paradigm for solving NP\-hard combinatorial optimization problems \(COPs\)\. Recent works show that large language models \(LLMs\), when integrated into well\-designed frameworks \(i\.e\., LLM\-AHD\), can autonomously discover high\-performing heuristics\. However, existing LLM\-AHD frameworks typically treat LLMs as passive generators within fixed workflows, where the model generates heuristics from manually designed, limited context\. Such context may fail to capture state\-dependent information \(e\.g\., specific failure modes\), leading to inefficient trial\-and\-error exploration\. To overcome these limitations, we proposeAHD Agent, a novel tool\-integrated, multi\-turn framework that empowers LLMs to proactively decide whether to generate heuristics or invoke tools to retrieve targeted evidence from the solving environment\. To effectively train such a dynamic decision\-making agent, we introduce an agentic reinforcement learning \(RL\) system, which leverages a novel environment synthesis pipeline to optimize a compact model’s generalizable AHD capabilities\. Experiments across eight diverse domains, including four held\-out tasks, demonstrate that our 4B\-parameter agent matches or surpasses state\-of\-the\-art baselines using much larger models, while requiring significantly fewer evaluations\. Model and inference scaling analysis further reveals that AHD Agent offers an effective trajectory toward truly autonomous heuristic design\.

## 1Introduction

NP\-hard combinatorial optimization problems \(COPs\) are fundamental to many real\-world systems, including transportation, planning, and decision\-making\[[30](https://arxiv.org/html/2605.08756#bib.bib13),[35](https://arxiv.org/html/2605.08756#bib.bib11)\]\. Efficiently solving these problems relies heavily on well\-designed heuristics\[[7](https://arxiv.org/html/2605.08756#bib.bib6)\]\. Traditionally, heuristic design is a highly manual and time\-intensive process that requires experts to analyze the solving process and rely on extensive trial and error\. To mitigate these limitations, automatic heuristic design \(AHD\) has emerged as a promising paradigm for heuristic generation\[[2](https://arxiv.org/html/2605.08756#bib.bib15)\]\. However, traditional AHD approaches, such as genetic programming \(GP\), still rely heavily on expert\-designed components\[[22](https://arxiv.org/html/2605.08756#bib.bib36),[31](https://arxiv.org/html/2605.08756#bib.bib12)\]\.

Recently, large language models \(LLMs\) have been introduced into AHD as heuristic generators within evolutionary computing \(EC\) frameworks\[[36](https://arxiv.org/html/2605.08756#bib.bib19),[33](https://arxiv.org/html/2605.08756#bib.bib17)\]\. In these frameworks, LLMs generate new heuristics based on candidates selected according to predefined rules\. The generated heuristics are then evaluated, forming a feedback\-generation loop\.

![Refer to caption](https://arxiv.org/html/2605.08756v1/x1.png)Figure 1:Traditional LLM\-based AHD vs\. our AHD Agent\.Traditional AHD places the LLM inside a fixed loop\. AHD Agent enables the LLM to design heuristics by actively calling tools, generating candidates, and performing evaluations\.However, existing LLM\-based AHD frameworks \(e\.g\., EoH\[[26](https://arxiv.org/html/2605.08756#bib.bib22)\], ReEvo\[[59](https://arxiv.org/html/2605.08756#bib.bib23)\]\) still face key limitations\. As shown in[Figure˜1](https://arxiv.org/html/2605.08756#S1.F1), they treat LLMs as passive heuristic generators within fixed workflows\. These workflows rely on manually designed and limited context \(e\.g\., crossover based on top heuristic\[[26](https://arxiv.org/html/2605.08756#bib.bib22)\]\), which may fail to capture information needed at a specific design step, such as the failure modes of prior heuristics\. As a result, the model cannot identify information gaps or retrieve targeted evidence, and instead relies on inefficient trial\-and\-error generation\. Our preliminary study in[Figure˜2](https://arxiv.org/html/2605.08756#S1.F2)further shows that simply providing all available information \(tools\) to LLMs within these fixed workflows brings limited gains and may even hurt performance, suggesting that the key challenge is not information availability alone, but the lack of state\-dependent mechanisms for acquiring and using relevant information\. Additionally, existing frameworks typically use general\-purpose LLMs that are not specifically aligned for AHD, leading to costly trial\-and\-error search\.

![Refer to caption](https://arxiv.org/html/2605.08756v1/x2.png)Figure 2:Tools help AHD Agent more than fixed\-workflow LLM\-based AHD\.Mean validation gaps are reported under DeepSeek\-V4\-Flash\[[6](https://arxiv.org/html/2605.08756#bib.bib55)\]\. For EoH and ReEvo, all tools are called at each LLM\-generation step\. Details are shown in[Section˜D\.6](https://arxiv.org/html/2605.08756#A4.SS6)\.To overcome these limitations, we proposeAHD Agent, the first tool\-integrated multi\-turn framework for LLM\-based AHD\. Instead of following a fixed pipeline, AHD Agent enables LLMs to proactively decide whether to generate heuristics or use tools to retrieve relevant information\. This enables the model to adapt its design strategy based on intermediate feedback, like evaluation results and tool outputs\. Building on AHD Agent, we further develop an agentic reinforcement learning \(RL\) system that optimizes the base model with GRPO\[[53](https://arxiv.org/html/2605.08756#bib.bib95)\]to improve its generalizable AHD capability\. We introduce an AHD RL environment synthesis pipeline that constructs diverse training environments by varying evaluation instances, solver backbones, and initial heuristics\. The RL\-trained agent matches or surpasses traditional LLM\-based AHD frameworks with substantially fewer heuristic evaluations\. Model and inference scaling further reveal the potential of our framework, suggesting an effective and efficient path toward LLM\-based AHD\.

Our contributions can be summarized as follows:

- •We introduceAHD Agent, the first tool\-integrated multi\-turn framework for LLM\-based AHD that enables proactive, state\-dependent tool use for heuristic generation, rather than following fixed workflows with static context\.
- •We develop an agentic RL system with AHD environment synthesis pipeline and multi\-domain joint training, substantially improving the model’s generalizable AHD capability across diverse settings\.
- •We conduct extensive experiments across eight evaluation settings spanning diverse problem domains, instance scales, and solver backbones\. Our 4B\-parameter agent outperforms baselines using larger models and exhibits strong generalization, establishing AHD Agent as a competitive and efficient alternative to conventional methods\.

## 2Related Work

LLM\-based AHD\.Recent LLM\-based AHD methods use LLMs as code generators in a feedback\-driven search loop, where candidate heuristics are generated, evaluated, and refined\. FunSearch\[[36](https://arxiv.org/html/2605.08756#bib.bib19)\]and EoH\[[26](https://arxiv.org/html/2605.08756#bib.bib22)\]established this paradigm, and later works extend it with fixed workflows such as reflection and tree search\[[59](https://arxiv.org/html/2605.08756#bib.bib23),[64](https://arxiv.org/html/2605.08756#bib.bib24),[25](https://arxiv.org/html/2605.08756#bib.bib18),[5](https://arxiv.org/html/2605.08756#bib.bib56),[42](https://arxiv.org/html/2605.08756#bib.bib57)\]\. Other extensions apply LLM\-AHD to routing, scheduling, MILP, SAT, and related optimization problems\[[13](https://arxiv.org/html/2605.08756#bib.bib25),[24](https://arxiv.org/html/2605.08756#bib.bib26),[62](https://arxiv.org/html/2605.08756#bib.bib28),[4](https://arxiv.org/html/2605.08756#bib.bib27),[61](https://arxiv.org/html/2605.08756#bib.bib102)\]\. Despite promising results, most methods still prescribe the search procedure externally, leaving the LLM mainly as a candidate generator\.

RL for AHD\.Recent work has begun to use RL to enhance LLM\-based heuristic generation\[[48](https://arxiv.org/html/2605.08756#bib.bib54),[65](https://arxiv.org/html/2605.08756#bib.bib59),[15](https://arxiv.org/html/2605.08756#bib.bib53)\]\. For example, CALM co\-evolves the LLM and heuristic population within a fixed evolutionary search workflow\. These methods show that RL feedback can improve heuristic search, but they either remain tied to fixed workflows or specialize to a particular solver and problem family\. In contrast, AHD Agent learns a transferable heuristic\-design policy from multi\-domain RL training\. The learned policy controls the multi\-turn design process itself: it decides when to evaluate, which tools to invoke, and how to revise candidate heuristics based on feedback\. Our experiments show that this policy generalizes to unseen problem families and transfers across evaluation protocols\.

LLM agents and reinforcement learning\.LLMs have increasingly been used as autonomous decision\-making agents across optimization\[[17](https://arxiv.org/html/2605.08756#bib.bib99),[28](https://arxiv.org/html/2605.08756#bib.bib100)\], program generation\[[60](https://arxiv.org/html/2605.08756#bib.bib88)\], smart device operation\[[63](https://arxiv.org/html/2605.08756#bib.bib84),[14](https://arxiv.org/html/2605.08756#bib.bib77)\], interactive gameplay\[[51](https://arxiv.org/html/2605.08756#bib.bib79)\], and robotic control\[[66](https://arxiv.org/html/2605.08756#bib.bib80)\]\. Early studies mainly leveraged frozen pre\-trained models through prompting strategies such as ReAct\[[57](https://arxiv.org/html/2605.08756#bib.bib85)\]and Reflexion\[[43](https://arxiv.org/html/2605.08756#bib.bib87)\], often enhanced with memory, retrieval, and external tools\[[52](https://arxiv.org/html/2605.08756#bib.bib76),[49](https://arxiv.org/html/2605.08756#bib.bib75),[29](https://arxiv.org/html/2605.08756#bib.bib104),[38](https://arxiv.org/html/2605.08756#bib.bib86),[56](https://arxiv.org/html/2605.08756#bib.bib81),[18](https://arxiv.org/html/2605.08756#bib.bib103)\]\. More recent work has shifted toward adapting model parameters via supervised fine\-tuning or reinforcement learning \(RL\)\[[19](https://arxiv.org/html/2605.08756#bib.bib3),[1](https://arxiv.org/html/2605.08756#bib.bib106),[9](https://arxiv.org/html/2605.08756#bib.bib105),[55](https://arxiv.org/html/2605.08756#bib.bib101),[11](https://arxiv.org/html/2605.08756#bib.bib89)\], enabling agents to improve through direct interaction with environments rather than relying solely on handcrafted workflows\. Representative RL methods include PPO\[[39](https://arxiv.org/html/2605.08756#bib.bib74)\], MCTS\[[44](https://arxiv.org/html/2605.08756#bib.bib93)\], RLOO\[[21](https://arxiv.org/html/2605.08756#bib.bib92)\], and GRPO\[[53](https://arxiv.org/html/2605.08756#bib.bib95)\]\.

## 3Methodology

We proposeAHD Agent, a tool\-integrated, multi\-turn framework for LLM\-based AHD\. Unlike fixed\-workflow LLM\-AHD methods, AHD Agent treats the LLM as the decision\-making agent in a multi\-turn design process\. We first formulate the problem and describe the AHD Agent interaction protocol and tool set\. We then present the agentic RL training process, including the AHD environment synthesis pipeline and cross\-domain training\.

### 3\.1TheAHD AgentFramework

#### 3\.1\.1Formulation

Letℐ\\mathcal\{I\}denote the instance space of a target optimization problem, and let𝒟design⊂ℐ\\mathcal\{D\}\_\{\\mathrm\{design\}\}\\subset\\mathcal\{I\}be the design set visible during one heuristic\-design episode\. The goal of AHD is to construct a heuristichh, represented in executable code form\. The evaluator runshhunder the specified solver setting on𝒟design\\mathcal\{D\}\_\{\\mathrm\{design\}\}and returns a scalar scoreScore⁡\(h;𝒟design\)\\operatorname\{Score\}\(h;\\mathcal\{D\}\_\{\\mathrm\{design\}\}\), together with execution feedback\. We normalizeScore\\operatorname\{Score\}so that larger values are always better\.

In the AHD Agent framework, the AHD process is formulated as a finite\-horizon Markov Decision Process \(MDP\)ℳ=\(𝒮,𝒜,P,R\)\\mathcal\{M\}=\(\\mathcal\{S\},\\mathcal\{A\},P,R\)\. Here𝒮\\mathcal\{S\}represents the state space, where each statests\_\{t\}can be an observation sequence or interaction history;𝒜\\mathcal\{A\}is the token\-level action space, covering heuristic generation/evaluation, tool calls, and final responses;PPandRRdenote the transition dynamics and reward generation process, respectively\. The initial states0s\_\{0\}contains the problem description and a seed heuristich0h\_\{0\}\. At each time steptt, the agent policyπθ\\pi\_\{\\theta\}generates an action conditioned on the current statests\_\{t\}and the interaction historyτ<t\\tau\_\{<t\}, after which the environment returns a reward and a new state:

at∼πθ\(⋅∣st,τ<t\),rt=R\(st,at\),st\+1∼P\(⋅∣st,at\)\.a\_\{t\}\\sim\\pi\_\{\\theta\}\(\\cdot\\mid s\_\{t\},\\tau\_\{<t\}\),\\qquad r\_\{t\}=R\(s\_\{t\},a\_\{t\}\),\\qquad s\_\{t\+1\}\\sim P\(\\cdot\\mid s\_\{t\},a\_\{t\}\)\.Hereτ<t=\{s0,a0,r0,…,st−1,at−1,rt−1\}\\tau\_\{<t\}=\\\{s\_\{0\},a\_\{0\},r\_\{0\},\\ldots,s\_\{t\-1\},a\_\{t\-1\},r\_\{t\-1\}\\\}denotes the interaction history\. This interactive process continues for a maximum horizonKKor until a final response is produced\. We denote the final response ashfinalh\_\{\\mathrm\{final\}\}, where the model returns a heuristic in a predefined format as the final answer\. We set intermediate rewards to zero, and the episode return is determined only by the final heuristic score, i\.e\.,R​\(τ\)=Score⁡\(hfinal;𝒟design\)R\(\\tau\)=\\operatorname\{Score\}\(h\_\{\\mathrm\{final\}\};\\mathcal\{D\}\_\{\\mathrm\{design\}\}\)\.

![Refer to caption](https://arxiv.org/html/2605.08756v1/x3.png)Figure 3:Demonstration of AHD Agent workflow\.Given a problem description, a seed heuristic, and a set of tools, the model iteratively decides its next action based on the session history, which records all previous interactions\. At each turn, it can call tools, generate and evaluate heuristics, and finally return the best heuristic\.
#### 3\.1\.2Agent Loop

Figure[3](https://arxiv.org/html/2605.08756#S3.F3)illustrates how AHD Agent conducts a heuristic\-design episode through multi\-turn decision making\. The episode starts from the initial states0s\_\{0\}, which contains the problem description and a seed heuristich0h\_\{0\}\. At the first turn, the agent first generates a candidate heuristic and submit it to the evaluator\. The resulting execution feedback and objective score are then recorded, updating both the statessand the interaction historyτ\\tau\. In a later turn, the agent call a tool from to inspect specific properties of𝒟design\\mathcal\{D\}\_\{\\mathrm\{design\}\}\. After multiple turns, the agent produces a final answer, and the framework returns the final heuristichfinalh\_\{\\mathrm\{final\}\}\.

In practical evaluation, the design loop is constrained by a maximum evaluator\-call budgetBB\. Evaluator submissions consume this budget, while diagnostic tool calls do not\. When the remaining evaluator budget approaches the limit, the framework reminds the agent to stop further exploration and output a final heuristic\.

#### 3\.1\.3Tool Library

The tool library provides a set of callable tools through which the LLM Agent can acquire information during heuristic design\. At each turn of a design run, AHD Agent decides whether to invoke a tool according to the current design state\. If a tool call is needed, the agent further selects which tool to use, enabling it to acquire information adaptively during the search process\. Moreover, this modular design makes the framework flexible: adding or removing a tool only changes the information that the agent can actively request, without requiring a manually redesigned search workflow\. All tools in the tool library are design\-set\-only: they can inspect the design dataset𝒟design\\mathcal\{D\}\_\{\\mathrm\{design\}\}, the current interaction history, and the evaluator feedback already produced during the design run, but they never reveal the validation dataset\. The current tool library contains two read\-only diagnostic tool groups\.

Instance\-analysis tools\.Instance\-analysis tools provide three query modes to expose structural information about𝒟design\\mathcal\{D\}\_\{\\mathrm\{design\}\}:summary,instance, andcompare\. Thesummarymode returns dataset\-level statistics, including scale, spatial, nearest\-neighbor, density, and domain\-specific features when available\. Theinstancemode inspects selected instances, while thecomparemode contrasts selected instances or groups to reveal structural heterogeneity\. By selecting the appropriate mode, the agent can acquire instance\-level information when needed, e\.g\., to decide whether a heuristic should account for clustered nodes, capacity\-demand patterns, or heterogeneous instance scales\.

AST tools\.AST tools compare a candidate program against previous attempts in the same session at the abstract\-syntax\-tree level and quantify its structural novelty\. If a candidate is structurally close to previously evaluated code, the agent can revise it before committing an evaluator call\. Moreover, when recent attempts exhibit low structural novelty, the agent can use AST\-derived information to choose larger structural revisions, reducing the risk of being trapped in local refinement loops during heuristic search\.

Overall, the tool library is designed to support active information acquisition rather than to inject hand\-designed heuristic rules\. The tools expose information about the design dataset and code novelty, while the LLM Agent remains responsible for deciding when to call them and how to translate their observations into heuristic revisions\. Detailed tool interfaces and outputs are provided in Appendix[B](https://arxiv.org/html/2605.08756#A2)\.

### 3\.2Agentic RL Training

An AHD environment consists of an exploration setup and an evaluation protocol\. The exploration setup is specified by the seed heuristic, when provided, which defines the initial state of the agent’s search\. The evaluation protocol is specified by the target problem, design set, and solver backbone, which determine the feedback for each generated heuristic\. This definition makes environment synthesis automatic: instances are generated, and candidate heuristics are evaluated by execution\. We synthesize environments by varying three elements that shape the feedback: seed heuristics, design\-set instances, and evaluation solver backbones\.

#### 3\.2\.1AHD Environment Synthesis Pipeline

RL training requires diverse AHD environments with different optimization dynamics and feedback patterns\. An AHD environment is naturally specified by three components: the evaluation instances, the solver backbone where the heuristic is used, and the starting point of the design process\. Since evaluation instances can be automatically generated and candidate heuristics can be programmatically evaluated, we can synthesize large\-scale AHD RL environments without human annotations\. We therefore construct training data by diversifying these three components\.

Seed heuristic diversity\.We vary the starting point of the AHD process across two settings: seed\-guided design, where the model improves a provided seed heuristic, and seed\-free design, where the model starts only from the task description and function interface\. For seed\-guided tasks, we use ReEvo\[[59](https://arxiv.org/html/2605.08756#bib.bib23)\]as an automatic seed heuristic collector and retain executable heuristics with different structures and quality levels, rather than only the best candidates\. These seed heuristics expose the model to varied failure modes and improvement opportunities\. For seed\-free tasks, a reference heuristic is used internally for reward normalization\. This reduces over\-reliance on existing seed heuristics and improves the model’s ability to initialize heuristic logic\.

Instance diversity\.We construct design sets from instances with different characteristics, such as instance sizes and node layouts in TSP, or capacity constraints and demand patterns in CVRP\. The same heuristic can behave differently across these design\-set instances, producing diverse feedback signals\. This encourages the model to learn refinement strategies that are robust to changes in instance characteristics\.

Solver\-backbone diversity\.We vary the solver backbone used for heuristic evaluation, including constructive and ACO\-based solvers\. These backbones differ in heuristic interfaces, code structures, and search behaviors\. Training across them encourages transferable design decisions beyond a single evaluation solver\.

#### 3\.2\.2Cross\-domain RL Training

Existing LLM\-based AHD RL methods train the model on a single problem domain\[[15](https://arxiv.org/html/2605.08756#bib.bib53),[65](https://arxiv.org/html/2605.08756#bib.bib59)\], limiting the model’s general AHD ability\. To improve general AHD capability, we jointly train one LLM across multiple problem domains and solver backbones\. Specifically, we use two problem domains and two solver backbones, and construct the corresponding training environments with the synthesis pipeline above\. We train the model with GRPO, with the algorithm and implementation details provided in[Appendix˜C](https://arxiv.org/html/2605.08756#A3)\. At test time, the trained model is deployed in the same multi\-turn environment, but evaluated on datasets disjoint from those used for training\.

Reward design\.We use the score of the final heuristic as the reward signal for the whole trajectory\. If no valid code can be extracted, the trajectory receives a fixed penalty\. If the code fails during execution or produces an infeasible solution, it receives another fixed penalty\. For a feasible final heuristichfinalh\_\{\\mathrm\{final\}\}, we measure its improvement over a baseline heuristichbh\_\{b\}on the design set, defined asΔ​\(hfinal\)=Score​\(hfinal;𝒟design\)−Score​\(hb;𝒟design\)\\Delta\(h\_\{\\mathrm\{final\}\}\)=\\text\{Score\}\(h\_\{\\mathrm\{final\}\};\\mathcal\{D\}\_\{\\mathrm\{design\}\}\)\-\\text\{Score\}\(h\_\{b\};\\mathcal\{D\}\_\{\\mathrm\{design\}\}\)\. Herehbh\_\{b\}is the provided seed heuristic for seed\-guided tasks, and a default human\-designed heuristic for seed\-free tasks\. The reward is

ℛ=\{−2\.0,no code is extracted,−1\.5,execution fails or the solution is infeasible,Δ​\(hfinal\),feasible heuristic\.\\mathcal\{R\}=\\begin\{cases\}\-2\.0,&\\text\{no code is extracted\},\\\\ \-1\.5,&\\text\{execution fails or the solution is infeasible\},\\\\ \\Delta\(h\_\{\\mathrm\{final\}\}\),&\\text\{feasible heuristic\}\.\\end\{cases\}This design discourages invalid code while directly rewarding improvements over the baseline associated with each synthesized environment\.

Table 1:In\-domain performance and search efficiency\.Validation results on the four RL training domains: TSP\-Constructive, CVRP\-Constructive, TSP\-ACO, and CVRP\-ACO\. Each problem size reports the mean objective and mean Gap \(%\)\. Eval denotes design\-time evaluator calls, and Cost denotes the average USD cost per run\. Method results, Eval, and Cost are averaged over five independent runs\. Reference objectives and seed heuristics are detailed in Appendix[D\.2](https://arxiv.org/html/2605.08756#A4.SS2)\.TSP\-Constructive \(↓\\downarrow\)CVRP\-Constructive \(↓\\downarrow\)TSP\-ACO \(↓\\downarrow\)CVRP\-ACO \(↓\\downarrow\)MethodN=100N=200N=100N=200N=100N=200N=100N=200MeanMean Gap \(%\)MeanMean Gap \(%\)EvalCost \($\)MeanMean Gap \(%\)MeanMean Gap \(%\)EvalCost \($\)MeanMean Gap \(%\)MeanMean Gap \(%\)EvalCost \($\)MeanMean Gap \(%\)MeanMean Gap \(%\)EvalCost \($\)Optimal7\.7260\.000%10\.6960\.000%17\.7530\.000%32\.3010\.000%7\.7840\.000%10\.6820\.000%12\.7080\.000%22\.3530\.000%Baseline heuristic9\.58624\.094%13\.39925\.243%24\.27737\.275%42\.46632\.037%10\.08129\.503%15\.30443\.276%29\.962135\.900%48\.389116\.568%LLM\-based AHD: GPT\-4oReEvo9\.58324\.054%13\.37925\.064%1000\.39222\.91529\.507%40\.52125\.878%1000\.6428\.66911\.358%12\.81119\.943%1000\.47332\.258147\.916%52\.867209\.525%1001\.101EOH9\.12718\.136%12\.75419\.240%1000\.21423\.34831\.940%40\.98827\.340%1000\.3328\.4128\.057%12\.21814\.385%1000\.23615\.87125\.002%28\.57827\.895%1000\.401MCTS\-AHD9\.27119\.998%13\.01021\.632%1000\.07322\.58727\.583%39\.80923\.577%1000\.7028\.3658\.536%12\.30215\.274%1001\.02617\.79739\.883%32\.29244\.518%1000\.986AHD Agent9\.55823\.727%13\.40225\.278%4\.60\.12323\.61933\.471%41\.25528\.094%120\.1148\.89314\.847%13\.29924\.307%70\.05827\.236114\.492%44\.779100\.385%7\.80\.086LLM\-based AHD: DeepSeek\-V4\-FlashReEvo9\.09317\.702%12\.84320\.069%1000\.05922\.42626\.605%39\.65823\.054%1000\.0378\.3507\.255%12\.23214\.514%1000\.05117\.31036\.233%31\.73541\.997%1000\.065EOH9\.48422\.759%13\.28924\.228%1000\.02722\.67128\.005%40\.21624\.747%1000\.0168\.4738\.839%12\.37015\.798%1000\.01316\.27928\.105%29\.20330\.659%1000\.026MCTS\-AHD9\.34520\.944%13\.10622\.523%1000\.04222\.68028\.070%40\.13824\.609%1000\.0748\.4758\.863%12\.39516\.039%1000\.10416\.45729\.514%29\.20030\.656%1000\.464AHD Agent9\.44822\.285%13\.11422\.598%18\.40\.05422\.08124\.600%39\.12021\.317%20\.40\.0678\.3897\.766%12\.11513\.415%14\.20\.06118\.32744\.249%32\.84146\.953%17\.60\.066RL Model: Qwen3\-4B\-Instruct\-2507CALM8\.85214\.597%13\.15923\.049%1500\.03324\.13436\.300%43\.23634\.588%1500\.0948\.60710\.555%12\.67218\.630%1500\.06916\.61230\.721%30\.24335\.298%1500\.075\\rowcolorgray\!15 AHD Agent9\.04417\.067%12\.72318\.961%12\.80\.01021\.46621\.167%38\.16618\.369%14\.60\.0048\.3457\.195%12\.03912\.705%11\.80\.00316\.60330\.667%30\.51636\.538%12\.90\.004\\rowcolorgray\!15 AHD Agent w/SR8\.79513\.855%12\.31415\.137%1000\.04321\.44221\.030%38\.15518\.325%1000\.0338\.3427\.118%12\.01512\.483%1000\.03115\.63423\.145%28\.10025\.637%1000\.039

## 4Experiments

### 4\.1Experimental Setup

Problems\.We evaluate AHD Agent on heuristic\-design benchmarks spanning combinatorial and continuous optimization\. The combinatorial suite includes constructive routing tasks \(TSP, CVRP, and OVRP\)\[[23](https://arxiv.org/html/2605.08756#bib.bib62),[27](https://arxiv.org/html/2605.08756#bib.bib63)\]and ACO\-based tasks \(TSP, CVRP, OP, and MKP\)\[[8](https://arxiv.org/html/2605.08756#bib.bib61)\], covering different problem structures and solver backbones\. We further evaluate cross\-protocol transfer on cost\-aware Bayesian optimization \(CAF\)\[[58](https://arxiv.org/html/2605.08756#bib.bib52)\], which lies outside the combinatorial suite\. Detailed formulations, function interfaces, instance\-generation protocols, and validation configurations are provided in Appendix[A](https://arxiv.org/html/2605.08756#A1)\.

Baselines\.We compare AHD Agent with three groups of baselines: problem\-specific hand\-crafted heuristics, fixed\-workflow LLM\-AHD methods, and an RL\-enhanced AHD baseline\. Problem\-specific hand\-crafted heuristics include Greedy\-Construct \(GC\)\[[37](https://arxiv.org/html/2605.08756#bib.bib68)\], standard hand\-crafted ACO heuristics for ACO tasks\[[45](https://arxiv.org/html/2605.08756#bib.bib64),[3](https://arxiv.org/html/2605.08756#bib.bib65),[47](https://arxiv.org/html/2605.08756#bib.bib66),[10](https://arxiv.org/html/2605.08756#bib.bib67)\], and classical acquisition functions for CAF\[[58](https://arxiv.org/html/2605.08756#bib.bib52)\]\. The fixed\-workflow methods include ReEvo\[[59](https://arxiv.org/html/2605.08756#bib.bib23)\], EOH\[[26](https://arxiv.org/html/2605.08756#bib.bib22)\], and MCTS\-AHD\[[64](https://arxiv.org/html/2605.08756#bib.bib24)\], evaluated with GPT\-4o\[[16](https://arxiv.org/html/2605.08756#bib.bib73)\]and DeepSeek\-V4\-Flash\[[6](https://arxiv.org/html/2605.08756#bib.bib55)\]\. We also include CALM\[[15](https://arxiv.org/html/2605.08756#bib.bib53)\], which fine\-tunes the LLM within a fixed evolutionary search workflow\. For fair comparison, all LLM\-based AHD methods share the same task interface, data splits, objective computation, and seed heuristic when applicable\.

RL training\.We fine\-tune Qwen3\-4B\-Instruct\-2507\[[50](https://arxiv.org/html/2605.08756#bib.bib60)\]with GRPO on 4,000 prompts sampled equally from four training domains: TSP\-Constructive, CVRP\-Constructive, TSP\-ACO, and CVRP\-ACO\. Generalization is evaluated on three unseen combinatorial domains, OP\-ACO, OVRP\-Constructive, and MKP\-ACO, and on the cross\-protocol CAF benchmark\. All experiments are conducted on 4×\\timesA100\-80G GPUs with Intel Xeon 8358P CPUs \(72 cores, 2\.6 GHz\)\. Detailed information is provided in the Appendix[C](https://arxiv.org/html/2605.08756#A3)\.

Hyperparameter settings\.We use validation Gap \(%\) as the primary metric, where lower is better\. To compare search efficiency, we cap design\-time evaluator calls: fixed\-workflow AHD baselines use 100 calls, CALM uses 150 calls, AHD Agent uses 30 calls, and Sequential Refinement AHD Agent \(AHD Agent w/SR\) uses 100 calls for budget\-matched comparison\. We also report the average USD cost per run; pricing details are provided in Appendix[D\.1](https://arxiv.org/html/2605.08756#A4.SS1)\.

Table 2:Generalization to unseen combinatorial domains\.Validation results on OP\-ACO, OVRP\-Constructive, and MKP\-ACO, which are excluded from RL training\. Method results, Eval, and Cost are averaged over five independent runs\. Reference objectives and seed heuristics are detailed in Appendix[D\.2](https://arxiv.org/html/2605.08756#A4.SS2)\.OP\-ACO \(↑\\uparrow\)OVRP\-Constructive \(↓\\downarrow\)MKP\-ACO \(↑\\uparrow\)MethodN=50N=100N=200N=50N=100N=200N=100N=200N=300MeanMean Gap \(%\)MeanMean Gap \(%\)MeanMean Gap \(%\)EvalCost \($\)MeanMean Gap \(%\)MeanMean Gap \(%\)MeanMean Gap \(%\)EvalCost \($\)MeanMean Gap \(%\)MeanMean Gap \(%\)MeanMean Gap \(%\)EvalCost \($\)Optimal16\.0190\.000%33\.2340\.000%62\.5950\.000%6\.5740\.000%10\.6360\.000%18\.4910\.000%17\.1620\.000%44\.5070\.000%57\.0070\.000%Baseline heuristic13\.46316\.019%24\.38026\.635%36\.67541\.385%14\.413119\.248%23\.711123\.114%41\.900126\.979%16\.4973\.686%41\.0697\.689%52\.7838\.048%LLM\-based AHD: GPT\-4oReEvo15\.0436\.136%29\.30711\.822%51\.81717\.183%1000\.90313\.341102\.901%22\.645113\.117%40\.221117\.773%1000\.67816\.9521\.225%42\.5914\.303%52\.4767\.074%1000\.637EOH15\.0845\.890%29\.65110\.786%52\.29916\.442%1000\.34313\.08998\.952%22\.432111\.016%39\.960116\.182%1000\.34616\.8741\.464%42\.6673\.922%55\.1302\.882%1000\.227MCTS\-AHD14\.9776\.358%29\.8639\.900%52\.89215\.230%1000\.93013\.08698\.942%22\.193108\.690%39\.420113\.459%1000\.72816\.9231\.439%42\.8103\.375%52\.6622\.861%1000\.844AHD Agent13\.73214\.344%24\.99524\.787%38\.59938\.297%70\.08914\.145115\.091%23\.503121\.105%41\.450124\.441%7\.20\.08716\.9431\.168%43\.2012\.910%55\.6742\.178%5\.80\.058LLM\-based AHD: DeepSeek\-V4\-FlashReEvo14\.9526\.687%29\.11812\.368%51\.16118\.257%1000\.04712\.88095\.659%22\.115107\.953%39\.690114\.745%1000\.04016\.9571\.249%43\.1413\.011%55\.7831\.850%1000\.033EOH15\.0606\.071%30\.0969\.454%53\.81514\.022%1000\.01813\.05098\.372%22\.350110\.188%39\.724114\.980%1000\.01617\.0100\.921%43\.3242\.607%55\.8131\.818%1000\.011MCTS\-AHD15\.1415\.530%30\.2708\.933%54\.29813\.252%1000\.07413\.03998\.139%22\.110107\.846%39\.682114\.685%1000\.05416\.9411\.311%42\.6514\.130%55\.5692\.451%1000\.050AHD Agent14\.9486\.736%29\.20812\.102%50\.78318\.841%19\.00\.06512\.88595\.744%21\.977106\.580%39\.232112\.351%16\.60\.07717\.0210\.860%43\.3012\.659%55\.7891\.906%15\.20\.040RL Model: Qwen3\-4B\-Instruct\-2507CALM14\.9896\.471%29\.80910\.289%53\.17615\.036%1500\.07412\.83094\.908%21\.966106\.506%39\.403113\.158%1500\.08016\.9061\.517%42\.9133\.521%55\.4082\.427%1500\.053\\rowcolorgray\!15AHD Agent15\.1735\.340%30\.3648\.624%54\.55512\.837%13\.60\.00512\.10383\.756%20\.55193\.153%37\.247101\.337%10\.70\.00417\.0290\.843%43\.5222\.166%56\.1861\.188%12\.40\.004\\rowcolorgray\!15AHD Agent w/SR15\.3614\.184%30\.9706\.601%55\.83610\.467%1000\.03412\.08383\.485%20\.52392\.892%37\.236101\.272%1000\.03517\.0580\.669%43\.8401\.442%56\.5650\.592%1000\.033

Table 3:Generalization to cost\-aware Bayesian optimization\.CAF comparison on 12 benchmark functions\. Ackley and Rastrigin are used as design functions during AHD search, while the remaining ten functions are held\-out validation functions\. Entries report gaps to the known optimum with BO budget 30 and 10 trials per test function; lower is better\. The results of EI, EIpu, and EI\-cool are from\[[64](https://arxiv.org/html/2605.08756#bib.bib24)\]\. Method results, Eval, and Cost are averaged over three independent runs\.MethodAckleyRastriginGriewankRosenbrockLevyThreeHumpCamelStyblinskiTangHartmann\-3DPowellShekelHartmann\-6DCosine8Avg\.EvalCost \($\)EI2\.66%4\.74%0\.49%1\.26%0\.01%0\.05%0\.03%0\.00%18\.89%7\.91%0\.03%0\.47%3\.05%EIpu2\.33%5\.62%0\.34%2\.36%0\.01%0\.12%0\.02%0\.00%19\.83%7\.92%0\.03%0\.47%3\.25%EI\-cool2\.74%5\.78%0\.34%2\.29%0\.01%0\.07%0\.03%0\.00%14\.95%8\.21%0\.03%0\.54%2\.92%LLM\-based AHD: DeepSeek\-V4\-FlashEoH3\.14%4\.02%0\.23%1\.63%0\.02%0\.16%0\.18%0\.00%32\.02%7\.38%0\.18%0\.50%4\.12%1000\.025MCTS\-AHD3\.19%5\.29%0\.34%0\.83%0\.04%0\.07%0\.12%0\.01%26\.36%7\.59%0\.14%0\.61%3\.72%1000\.071ReEvo4\.82%5\.25%1\.09%11\.02%0\.23%0\.29%2\.13%0\.05%73\.56%8\.70%0\.61%0\.75%9\.04%1000\.153RL Model: Qwen3\-4B\-Instruct\-2507CALM3\.28%4\.22%1\.01%23\.05%0\.31%0\.57%1\.61%0\.04%50\.38%8\.44%0\.60%0\.86%7\.86%1500\.057\\rowcolorgray\!15AHD Agent2\.88%1\.73%0\.35%2\.99%0\.01%0\.05%0\.49%0\.11%7\.00%5\.02%0\.47%0\.40%1\.79%14\.20\.007

### 4\.2Overall Results

Training\-domain efficiency\.As shown in Table[1](https://arxiv.org/html/2605.08756#S3.T1), the RL\-trained AHD Agent achieves strong search efficiency on all four training domains\. In the short AHD Agent setting, AHD Agent uses only about 11–15 evaluator calls, yet reaches competitive performance and often matches or surpasses AHD baselines that use 100 evaluations\. When given a larger budget, AHD Agent w/SR achieves the best gap on all four training domains, showing that RL substantially improves the search efficiency of the LLM Agent paradigm over fixed\-workflow AHD baselines\. In addition, AHD Agent with DeepSeek\-V4\-Flash also achieves competitive or superior results on TSP\-ACO and CVRP\-Constructive with a small evaluator budget, suggesting that AHD Agent remains highly efficient when paired with stronger general\-purpose LLMs\. Statistical significance results are reported in Appendix[D\.8](https://arxiv.org/html/2605.08756#A4.SS8)\.

Generalization across combinatorial domains\.As shown in Table[2](https://arxiv.org/html/2605.08756#S4.T2), the RL\-trained AHD Agent generalizes well to unseen combinatorial domains that are excluded from RL training\. With only a small evaluator budget, AHD Agent achieves strong validation gaps across three domains; for example, on OP\-ACO withN=200N=200, it obtains a gap of12\.837%12\.837\\%, ranking first among all non\-SR methods while using only13\.613\.6evaluator calls on average\. OP\-ACO and OVRP\-Constructive are routing domains related to the training tasks but differ in objective or route structure, and AHD Agent maintains strong performance on both\. More importantly, MKP\-ACO is a knapsack\-style domain absent from the training set, yet the same checkpoint remains competitive\. These results suggest that the RL\-trained agent, together with the AHD Agent interaction protocol, learns how to design and revise heuristics from feedback rather than simply memorizing domain\-specific heuristic patterns\.

Transfer to continuous optimization\.As shown in Table[3](https://arxiv.org/html/2605.08756#S4.T3), the RL\-trained AHD Agent also transfers effectively to CAF, a continuous optimization task outside the combinatorial training domains\. Although trained only on combinatorial heuristic\-design tasks, AHD Agent achieves an average gap of 1\.79% over 12 Bayesian\-optimization test functions, using only 14\.2 evaluator calls on average\. This outperforms both hand\-crafted acquisition baselines, such as EI\-cool \(2\.92%\), and fixed\-workflow AHD baselines, including EoH \(4\.12%\), MCTS\-AHD \(3\.72%\), and ReEvo \(9\.04%\) with DeepSeek\-V4\-Flash\. These results show that the learned agentic design policy can transfer across evaluation protocols, not only across combinatorial problem families\.

![Refer to caption](https://arxiv.org/html/2605.08756v1/x4.png)\(a\)
![Refer to caption](https://arxiv.org/html/2605.08756v1/x5.png)\(b\)

Figure 4:Scaling Effect of AHD Agent\.\(a\)Inference\-time scaling comparison\.SR strategy outperforms PS strategy on two tasks\. \(b\)Model scaling favors AHD Agent\.Performance of AHD Agent increases as the model size increases from 30B to 397B parameters\.
### 4\.3Further Experiments

#### 4\.3\.1Scaling Effect of AHD Agent Framework

Inference\-time scaling\.Figure[4\(a\)](https://arxiv.org/html/2605.08756#S4.F4.sf1)compares two ways of scaling the evaluator budget for AHD Agent\. AHD Agent with Sequential Requirement \(w/ SR\) extends a single agent session, allowing later turns to build on accumulated feedback, while Parallel Sampling AHD Agent \(w/ PS\) runs multiple short trajectories independently and selects the best candidate\. The convergence curves show that continuing one coherent trajectory is generally more effective than aggregating independent short runs\. Meanwhile, both AHD Agent w/ SR and AHD Agent w/ PS improve as the evaluator budget increases, showing that AHD Agent can further benefit from additional evaluations and has the potential to scale with larger search budgets\. AHD Agent w/ PS remains useful when wall\-clock time is constrained, since independent trajectories can be executed in parallel\.

Model scaling\.Figure[4\(b\)](https://arxiv.org/html/2605.08756#S4.F4.sf2)shows that the agentic multi\-turn paradigm benefits more consistently from stronger LLM backbones than fixed\-workflow AHD methods\. On OP\-ACO, the AHD Agent gap decreases from 29\.14% with the 30B Qwen model to 14\.35% with the 397B model\. In contrast, EOH and ReEvo show non\-monotonic or domain\-dependent trends under the same backbone sweep\. This suggests that stronger reasoning capability is more effectively converted into performance when the LLM controls the design process, including feedback interpretation, tool use, and iterative revision, rather than only generating candidates inside a fixed workflow\.

### 4\.4Training Curves

![Refer to caption](https://arxiv.org/html/2605.08756v1/x6.png)Figure 5:Training curves during design\.AHD Agent converges faster and achieves better performance under larger evaluation budgets\.
![Refer to caption](https://arxiv.org/html/2605.08756v1/x7.png)Figure 6:RL training dynamics\.Reward and the number of turns increase after a 500\-step RL training\.

Training curves during design\.We further visualize the training curves during the design process\. After RL training, we use a separate design set, distinct from the training set used in RL training, to obtain the heuristics for testing\. For plotting, we align the five independent design runs by evaluator\-call index and use the longest run as the curve horizon; runs that stop earlier are padded by carrying forward their final best\-so\-far gap\.[Figure˜6](https://arxiv.org/html/2605.08756#S4.F6)shows that our method converges faster than the baselines while requiring significantly fewer evaluations\. Moreover, when scaling the evaluation budget with SR, our method converges to substantially better performance\.

RL Training dynamics\.Figure[6](https://arxiv.org/html/2605.08756#S4.F6)demonstrates the dynamics of reward and number of turns during AHD Agent training\. The reward increases steadily over the 500\-step training process, indicating that the policy progressively learns to generate higher\-quality heuristic revisions\. Meanwhile, the number of turns first increases, suggesting that the agent learns to move beyond one\-shot heuristic generation and instead makes more state\-dependent decisions about when to generate, evaluate, or call tools based on intermediate feedback\. After this exploration phase, the turn count fluctuates and eventually converges to around 30 steps, showing that the learned policy does not simply extend the dialogue indefinitely\. More details are provided in Appendix[D\.1](https://arxiv.org/html/2605.08756#A4.SS1)\.

#### 4\.4\.1Ablation Study

![Refer to caption](https://arxiv.org/html/2605.08756v1/x8.png)Figure 7:Performance changes as the number of training domains increases\.Cross\-domain RL training increases the general AHD capability\.Figure[7](https://arxiv.org/html/2605.08756#S4.F7)shows how the performance on an in\-domain task and a held\-out task changes as the number of RL training domains increases\. The training mixture is gradually expanded by adding TSP\-ACO, CVRP\-ACO, and TSP/CVRP\-Constructive domains\. Note that all runs are trained for the same 500 steps\. On OP\-ACO, which is never included in the RL training mixture, the gap decreases steadily from24\.5%24\.5\\%to8\.9%8\.9\\%as more training domains are added\. This consistently held\-out improvement indicates that cross\-domain RL training learns transferable heuristic\-design behavior rather than fitting domain\-specific heuristic templates\. Meanwhile, on in\-domain TSP\-ACO, RL reduces the gap from27\.3%27\.3\\%to9\.0%9\.0\\%, and adding more domains further improves it to7\.7%7\.7\\%\. This suggests that broader training mixtures improve both held\-out generalization and in\-domain AHD performance\.

Table 4:Ablation study of tool\-access and RL\.Results evaluated on CVRP\-C and CVRP\-ACO\. Entries are average Mean/Best Gap \(%\) overN=50,100,200N=50,100,200\.Effectiveness of RL training and tool access\.Table[4](https://arxiv.org/html/2605.08756#S4.T4)shows that RL training is the main source of improvement, while diagnostic tools provide additional gains\. Removing RL while keeping the same multi\-turn framework increases the mean gap from 21\.20% to 37\.59% on CVRP\-Constructive, and from 30\.00% to 125\.81% on CVRP\-ACO\. Replacing the full diagnostic tool library with the evaluator\-only setting also degrades performance, increasing the mean gap from 21\.20% to 22\.90% on CVRP\-Constructive and from 30\.00% to 33\.21% on CVRP\-ACO\. These experimental results validate the effectiveness of our design\.

## 5Conclusion

We introduced AHD Agent, an agentic reinforcement\-learning framework for automatic heuristic design\. By letting the LLM Agent control a multi\-turn design process and by specializing a compact base model with GRPO on cross\-domain heuristic\-design tasks, AHD Agent improves search efficiency while reducing reliance on large general\-purpose LLMs\. Experiments across combinatorial and continuous optimization benchmarks show that the learned policy can effectively revise heuristics with limited evaluator feedback, generalize to unseen domains, and transfer across evaluation protocols\. These results suggest that agentic, RL\-trained LLMs are a promising alternative to fixed\-workflow LLM\-AHD methods\.

## 6Limitations and Broader Impacts

Limitations\.In this work, we train a relatively small 4B model using data from four domains\. Scaling to larger models and broader training data may further improve the capability of the model and help characterize the limits of our framework\.

Broader Impacts\.This work advances LLM\-based automatic heuristic design through an agentic framework that improves design efficiency and generalization\. We do not anticipate societal impacts beyond those generally associated with automated algorithm design and LLM\-based systems\.

## References

- \[1\]C\. Baronio, P\. Marsella, B\. Pan, S\. Guo, and S\. Alberti\(2026\)Kevin: multi\-turn RL for generating CUDA kernels\.InThe Fourteenth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.08756#S2.p3.1)\.
- \[2\]E\. K\. Burke, M\. Hyde, G\. Kendall, G\. Ochoa, E\. Özcan, and J\. R\. Woodward\(2010\)A classification of hyper\-heuristic approaches\.InHandbook of metaheuristics,pp\. 449–468\.Cited by:[§1](https://arxiv.org/html/2605.08756#S1.p1.1)\.
- \[3\]J\. Cai, P\. Wang, S\. Sun, and H\. Dong\(2022\)A dynamic space reduction ant colony optimization for capacitated vehicle routing problem\.Soft Computing26\(17\),pp\. 8745–8756\.Cited by:[§4\.1](https://arxiv.org/html/2605.08756#S4.SS1.p2.1)\.
- \[4\]M\. Chen and G\. Li\(2025\)DaSAThco: data\-aware sat heuristics combinations optimization via large language models\.arXiv preprint arXiv:2509\.12602\.Cited by:[§2](https://arxiv.org/html/2605.08756#S2.p1.1)\.
- \[5\]P\. V\. T\. Dat, L\. Doan, and H\. T\. T\. Binh\(2025\)Hsevo: elevating automatic heuristic design with diversity\-driven harmony search and genetic algorithm using llms\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.39,pp\. 26931–26938\.Cited by:[§2](https://arxiv.org/html/2605.08756#S2.p1.1)\.
- \[6\]DeepSeek\-AI\(2026\)DeepSeek\-v4: towards highly efficient million\-token context intelligence\.Cited by:[Figure 2](https://arxiv.org/html/2605.08756#S1.F2),[Figure 2](https://arxiv.org/html/2605.08756#S1.F2.4.2.1),[§4\.1](https://arxiv.org/html/2605.08756#S4.SS1.p2.1)\.
- \[7\]S\. Desale, A\. Rasool, S\. Andhale, and P\. Rane\(2015\)Heuristic and meta\-heuristic algorithms and their relevance to the real world: a survey\.Int\. J\. Comput\. Eng\. Res\. Trends351\(5\),pp\. 2349–7084\.Cited by:[§1](https://arxiv.org/html/2605.08756#S1.p1.1)\.
- \[8\]M\. Dorigo, M\. Birattari, and T\. Stutzle\(2006\)Ant colony optimization\.IEEE computational intelligence magazine1\(4\),pp\. 28–39\.Cited by:[§A\.2\.2](https://arxiv.org/html/2605.08756#A1.SS2.SSS2.p1.3),[§4\.1](https://arxiv.org/html/2605.08756#S4.SS1.p1.1)\.
- \[9\]Z\. Feng, Q\. Chen, N\. Lu, Y\. Li, S\. Cheng, S\. Peng, D\. Tang, S\. Liu, and Z\. Zhang\(2025\)Is PRM necessary? problem\-solving RL implicitly induces PRM capability in LLMs\.InNeurIPS,Cited by:[§2](https://arxiv.org/html/2605.08756#S2.p3.1)\.
- \[10\]S\. Fidanova\(2020\)Hybrid ant colony optimization algorithm for multiple knapsack problem\.In2020 5th IEEE International Conference on Recent Advances and Innovations in Engineering \(ICRAIE\),pp\. 1–5\.Cited by:[§4\.1](https://arxiv.org/html/2605.08756#S4.SS1.p2.1)\.
- \[11\]Y\. Gou, K\. Chen, Z\. Liu, L\. HONG, X\. Jin, Z\. Li, J\. Kwok, and Y\. Zhang\(2026\)Reasoning\-aligned perception decoupling for scalable multi\-modal reasoning\.InThe Fourteenth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.08756#S2.p3.1)\.
- \[12\]K\. Helsgaun\(2017\)An extension of the lin\-kernighan\-helsgaun tsp solver for constrained traveling salesman and vehicle routing problems\.Roskilde: Roskilde University12,pp\. 966–980\.Cited by:[§D\.2](https://arxiv.org/html/2605.08756#A4.SS2.SSS0.Px1.p2.1)\.
- \[13\]A\. Hottung, F\. Berto, C\. Hua, N\. G\. Zepeda, D\. Wetzel, M\. Römer, H\. Ye, D\. Zago, M\. Poli, S\. Massaroli,et al\.\(2025\)VRPAgent: llm\-driven discovery of heuristic operators for vehicle routing problems\.arXiv preprint arXiv:2510\.07073\.Cited by:[§2](https://arxiv.org/html/2605.08756#S2.p1.1)\.
- \[14\]S\. Hu, M\. Ouyang, D\. Gao, and M\. Z\. Shou\(2024\)The dawn of GUI agent: a preliminary case study with claude 3\.5 computer use\.arXiv preprint arXiv:2411\.10323\.Cited by:[§2](https://arxiv.org/html/2605.08756#S2.p3.1)\.
- \[15\]Z\. Huang, W\. Wu, K\. Wu, W\. Lee, and J\. Wang\(2026\)CALM: co\-evolution of algorithms and language model for automatic heuristic design\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=x6bG2Hoqdf)Cited by:[§2](https://arxiv.org/html/2605.08756#S2.p2.1),[§3\.2\.2](https://arxiv.org/html/2605.08756#S3.SS2.SSS2.p1.1),[§4\.1](https://arxiv.org/html/2605.08756#S4.SS1.p2.1)\.
- \[16\]A\. Hurst, A\. Lerer, A\. P\. Goucher, A\. Perelman, A\. Ramesh, A\. Clark, A\. Ostrow, A\. Welihinda, A\. Hayes, A\. Radford,et al\.\(2024\)Gpt\-4o system card\.arXiv preprint arXiv:2410\.21276\.Cited by:[§4\.1](https://arxiv.org/html/2605.08756#S4.SS1.p2.1)\.
- \[17\]C\. Jiang, X\. Shu, H\. Qian, X\. Lu, J\. Zhou, A\. Zhou, and Y\. Yu\(2025\)LLMOPT: learning to define and solve general optimization problems from scratch\.InProceedings of the Thirteenth International Conference on Learning Representations \(ICLR\),Singapore, Singapore\.Cited by:[§2](https://arxiv.org/html/2605.08756#S2.p3.1)\.
- \[18\]H\. Jiang and K\. Tang\(2026\)Why agents compromise safety under pressure\.arXiv preprint arXiv:2603\.14975\.Cited by:[§2](https://arxiv.org/html/2605.08756#S2.p3.1)\.
- \[19\]B\. Jin, H\. Zeng, Z\. Yue, J\. Yoon, S\. Arik, D\. Wang, H\. Zamani, and J\. Han\(2025\)Search\-r1: training llms to reason and leverage search engines with reinforcement learning\.arXiv preprint arXiv:2503\.09516\.Cited by:[§2](https://arxiv.org/html/2605.08756#S2.p3.1)\.
- \[20\]G\. Kobeaga, J\. Rojas\-Delgado, M\. Merino, and J\. A\. Lozano\(2024\)A revisited branch\-and\-cut algorithm for large\-scale orienteering problems\.European Journal of Operational Research313\(1\),pp\. 44–68\.Cited by:[§D\.2](https://arxiv.org/html/2605.08756#A4.SS2.SSS0.Px1.p2.1)\.
- \[21\]W\. Kool, H\. van Hoof, and M\. Welling\(2019\)Buy 4 reinforce samples, get a baseline for free\!\.InICLR 2019 Workshop,Cited by:[§2](https://arxiv.org/html/2605.08756#S2.p3.1)\.
- \[22\]W\. B\. Langdon and R\. Poli\(2002\)Foundations of genetic programming\.Vol\.90,Springer\.Cited by:[§1](https://arxiv.org/html/2605.08756#S1.p1.1)\.
- \[23\]E\. L\. Lawler\(1985\)The traveling salesman problem: a guided tour of combinatorial optimization\.Wiley\-Interscience Series in Discrete Mathematics\.Cited by:[§4\.1](https://arxiv.org/html/2605.08756#S4.SS1.p1.1)\.
- \[24\]R\. Li, L\. Wang, H\. Sang, L\. Yao, and L\. Pan\(2025\)LLM\-assisted automatic memetic algorithm for lot\-streaming hybrid job shop scheduling with variable sublots\.IEEE Transactions on Evolutionary Computation\.Cited by:[§2](https://arxiv.org/html/2605.08756#S2.p1.1)\.
- \[25\]F\. Liu, Y\. Liu, Q\. Zhang, T\. Xialiang, and M\. Yuan\(2026\)Eoh\-s: evolution of heuristic set using llms for automated heuristic design\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.40,pp\. 37090–37098\.Cited by:[§2](https://arxiv.org/html/2605.08756#S2.p1.1)\.
- \[26\]F\. Liu, X\. Tong, M\. Yuan, X\. Lin, F\. Luo, Z\. Wang, Z\. Lu, and Q\. Zhang\(2024\)Evolution of heuristics: towards efficient automatic algorithm design using large language model\.arXiv preprint arXiv:2401\.02051\.Cited by:[§1](https://arxiv.org/html/2605.08756#S1.p3.1),[§2](https://arxiv.org/html/2605.08756#S2.p1.1),[§4\.1](https://arxiv.org/html/2605.08756#S4.SS1.p2.1)\.
- \[27\]F\. Liu, R\. Zhang, Z\. Xie, R\. Sun, K\. Li, Q\. Hu, P\. Guo, X\. Lin, X\. Tong, M\. Yuan,et al\.\(2024\)Llm4ad: a platform for algorithm design with large language model\.arXiv preprint arXiv:2412\.17287\.Cited by:[§D\.2](https://arxiv.org/html/2605.08756#A4.SS2.SSS0.Px1.p3.1),[§4\.1](https://arxiv.org/html/2605.08756#S4.SS1.p1.1)\.
- \[28\]S\. Liu, C\. Chen, X\. Qu, K\. Tang, and Y\. Ong\(2024\)Large language models as evolutionary optimizers\.In2024 IEEE Congress on Evolutionary Computation \(CEC\),pp\. 1–8\.Cited by:[§2](https://arxiv.org/html/2605.08756#S2.p3.1)\.
- \[29\]N\. Lu, S\. Liu, R\. He, Y\. Ong, Q\. Wang, and K\. Tang\(2024\)Large language models can be guided to evade ai\-generated text detection\.TMLR\.Cited by:[§2](https://arxiv.org/html/2605.08756#S2.p3.1)\.
- \[30\]R\. Matai, S\. P\. Singh, and M\. L\. Mittal\(2010\)Traveling salesman problem: an overview of applications, formulations, and solution approaches\.Traveling salesman problem, theory and applications1\(1\),pp\. 1–25\.Cited by:[§1](https://arxiv.org/html/2605.08756#S1.p1.1)\.
- \[31\]Y\. Mei, Q\. Chen, A\. Lensen, B\. Xue, and M\. Zhang\(2022\)Explainable artificial intelligence by genetic programming: a survey\.IEEE Transactions on Evolutionary Computation27\(3\),pp\. 621–641\.Cited by:[§1](https://arxiv.org/html/2605.08756#S1.p1.1)\.
- \[32\]J\. Močkus\(1974\)On bayesian methods for seeking the extremum\.InIFIP Technical Conference on Optimization Techniques,pp\. 400–404\.Cited by:[§A\.2\.3](https://arxiv.org/html/2605.08756#A1.SS2.SSS3.p1.5)\.
- \[33\]A\. Novikov, N\. Vũ, M\. Eisenberger, E\. Dupont, P\. Huang, A\. Z\. Wagner, S\. Shirobokov, B\. Kozlovskii, F\. J\. Ruiz, A\. Mehrabian,et al\.\(2025\)Alphaevolve: a coding agent for scientific and algorithmic discovery\.arXiv preprint arXiv:2506\.13131\.Cited by:[§1](https://arxiv.org/html/2605.08756#S1.p2.1)\.
- \[34\]OR\-toolsGoogle\.External Links:[Link](https://developers.google.com/optimization/)Cited by:[§D\.2](https://arxiv.org/html/2605.08756#A4.SS2.SSS0.Px1.p2.1)\.
- \[35\]C\. Rajendran\(1993\)Heuristic algorithm for scheduling in a flowshop to minimize total flowtime\.International Journal of Production Economics29\(1\),pp\. 65–73\.Cited by:[§1](https://arxiv.org/html/2605.08756#S1.p1.1)\.
- \[36\]B\. Romera\-Paredes, M\. Barekatain, A\. Novikov, M\. Balog, M\. P\. Kumar, E\. Dupont, F\. J\. Ruiz, J\. S\. Ellenberg, P\. Wang, O\. Fawzi,et al\.\(2024\)Mathematical discoveries from program search with large language models\.Nature625\(7995\),pp\. 468–475\.Cited by:[§1](https://arxiv.org/html/2605.08756#S1.p2.1),[§2](https://arxiv.org/html/2605.08756#S2.p1.1)\.
- \[37\]D\. J\. Rosenkrantz, R\. E\. Stearns, and P\. M\. Lewis\(1977\)An analysis of several heuristics for the traveling salesman problem\.SIAM journal on computing6\(3\),pp\. 563–581\.Cited by:[§A\.2\.1](https://arxiv.org/html/2605.08756#A1.SS2.SSS1.p3.1),[§D\.2](https://arxiv.org/html/2605.08756#A4.SS2.SSS0.Px1.p3.1),[§4\.1](https://arxiv.org/html/2605.08756#S4.SS1.p2.1)\.
- \[38\]T\. Schick, J\. Dwivedi\-Yu, R\. Dessì, R\. Raileanu, M\. Lomeli, E\. Hambro, L\. Zettlemoyer, N\. Cancedda, and T\. Scialom\(2023\)Toolformer: language models can teach themselves to use tools\.Advances in Neural Information Processing Systems36,pp\. 68539–68551\.Cited by:[§2](https://arxiv.org/html/2605.08756#S2.p3.1)\.
- \[39\]J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov\(2017\)Proximal policy optimization algorithms\.arXiv preprint arXiv:1707\.06347\.Cited by:[§2](https://arxiv.org/html/2605.08756#S2.p3.1)\.
- \[40\]B\. Shahriari, K\. Swersky, Z\. Wang, R\. P\. Adams, and N\. De Freitas\(2015\)Taking the human out of the loop: a review of bayesian optimization\.Proceedings of the IEEE104\(1\),pp\. 148–175\.Cited by:[§A\.2\.3](https://arxiv.org/html/2605.08756#A1.SS2.SSS3.p1.2)\.
- \[41\]Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. Li, Y\. Wu,et al\.\(2024\)Deepseekmath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[§C\.1](https://arxiv.org/html/2605.08756#A3.SS1.p1.8),[Table 8](https://arxiv.org/html/2605.08756#A3.T8.4.4.7.3.2)\.
- \[42\]Y\. Shi, J\. Zhou, W\. Song, J\. Bi, Y\. Wu, Z\. Cao, and J\. Zhang\(2026\)Generalizable heuristic generation through LLMs with meta\-optimization\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=tIQZ7pVN6S)Cited by:[§2](https://arxiv.org/html/2605.08756#S2.p1.1)\.
- \[43\]N\. Shinn, F\. Cassano, A\. Gopinath, K\. Narasimhan, and S\. Yao\(2024\)Reflexion: language agents with verbal reinforcement learning\.Advances in Neural Information Processing Systems36\.Cited by:[§2](https://arxiv.org/html/2605.08756#S2.p3.1)\.
- \[44\]D\. Silver, J\. Schrittwieser, K\. Simonyan, I\. Antonoglou, A\. Huang, A\. Guez, T\. Hubert, L\. Baker, M\. Lai, A\. Bolton,et al\.\(2017\)Mastering the game of go without human knowledge\.Nature550\(7676\),pp\. 354–359\.Cited by:[§2](https://arxiv.org/html/2605.08756#S2.p3.1)\.
- \[45\]R\. Skinderowicz\(2022\)Improving ant colony optimization efficiency for solving large tsp instances\.Applied Soft Computing120,pp\. 108653\.Cited by:[§A\.2\.2](https://arxiv.org/html/2605.08756#A1.SS2.SSS2.p3.3),[§4\.1](https://arxiv.org/html/2605.08756#S4.SS1.p2.1)\.
- \[46\]J\. Snoek, H\. Larochelle, and R\. P\. Adams\(2012\)Practical bayesian optimization of machine learning algorithms\.Advances in neural information processing systems25\.Cited by:[§A\.2\.3](https://arxiv.org/html/2605.08756#A1.SS2.SSS3.p3.4)\.
- \[47\]S\. Sohrabi, K\. Ziarati, and M\. Keshtkaran\(2021\)ACS\-ophs: ant colony system for the orienteering problem with hotel selection\.EURO Journal on Transportation and Logistics10,pp\. 100036\.Cited by:[§4\.1](https://arxiv.org/html/2605.08756#S4.SS1.p2.1)\.
- \[48\]A\. Surina, A\. Mansouri, L\. Quaedvlieg, A\. Seddas, M\. Viazovska, E\. Abbe, and C\. Gulcehre\(2025\)Algorithm discovery with llms: evolutionary search meets reinforcement learning\.arXiv preprint arXiv:2504\.05108\.Cited by:[§2](https://arxiv.org/html/2605.08756#S2.p2.1)\.
- \[49\]W\. Tan, W\. Zhang, X\. Xu, H\. Xia, G\. Ding, B\. Li, B\. Zhou, J\. Yue, J\. Jiang, Y\. Li,et al\.\(2024\)Cradle: empowering foundation agents towards general computer control\.InNeurIPS 2024 Workshop on Open\-World Agents,Cited by:[§2](https://arxiv.org/html/2605.08756#S2.p3.1)\.
- \[50\]Q\. Team\(2025\)Qwen3 technical report\.External Links:2505\.09388,[Link](https://arxiv.org/abs/2505.09388)Cited by:[§C\.3](https://arxiv.org/html/2605.08756#A3.SS3.p1.1),[Table 8](https://arxiv.org/html/2605.08756#A3.T8.4.4.6.2.2),[§4\.1](https://arxiv.org/html/2605.08756#S4.SS1.p3.1)\.
- \[51\]G\. Wang, Y\. Xie, Y\. Jiang, A\. Mandlekar, C\. Xiao, Y\. Zhu, L\. Fan, and A\. Anandkumar\(2024\)Voyager: an open\-ended embodied agent with large language models\.Transactions on Machine Learning Research\.External Links:ISSN 2835\-8856,[Link](https://openreview.net/forum?id=ehfRiF0R3a)Cited by:[§2](https://arxiv.org/html/2605.08756#S2.p3.1)\.
- \[52\]J\. Wang, H\. Xu, H\. Jia, X\. Zhang, M\. Yan, W\. Shen, J\. Zhang, F\. Huang, and J\. Sang\(2024\)Mobile\-Agent\-v2: mobile device operation assistant with effective navigation via multi\-agent collaboration\.Advances in Neural Information Processing Systems37,pp\. 2686–2710\.Cited by:[§2](https://arxiv.org/html/2605.08756#S2.p3.1)\.
- \[53\]Z\. Wang, K\. Wang, Q\. Wang, P\. Zhang, L\. Li, Z\. Yang, K\. Yu, M\. N\. Nguyen, L\. Liu, E\. Gottlieb,et al\.\(2025\)RAGEN: understanding self\-evolution in LLM agents via multi\-turn reinforcement learning\.arXiv preprint arXiv:2504\.20073\.Cited by:[§1](https://arxiv.org/html/2605.08756#S1.p4.1),[§2](https://arxiv.org/html/2605.08756#S2.p3.1)\.
- \[54\]N\. A\. Wouda, L\. Lan, and W\. Kool\(2024\)PyVRP: a high\-performance vrp solver package\.INFORMS Journal on Computing36\(4\),pp\. 943–955\.Cited by:[§D\.2](https://arxiv.org/html/2605.08756#A4.SS2.SSS0.Px1.p2.1)\.
- \[55\]J\. Wu, N\. Lu, S\. Liu, K\. Wang, Y\. Yang, L\. Qing, and K\. Tang\(2026\)Train at moving edge: online\-verified prompt selection for efficient rl training of large reasoning model\.arXiv preprint arXiv:2603\.25184\.Cited by:[§2](https://arxiv.org/html/2605.08756#S2.p3.1)\.
- \[56\]T\. Xie, D\. Zhang, J\. Chen, X\. Li, S\. Zhao, R\. Cao, T\. J\. Hua, Z\. Cheng, D\. Shin, F\. Lei, Y\. Liu, Y\. Xu, S\. Zhou, S\. Savarese, C\. Xiong, V\. Zhong, and T\. Yu\(2024\)OSWorld: benchmarking multimodal agents for open\-ended tasks in real computer environments\.InThe Thirty\-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track,Cited by:[§2](https://arxiv.org/html/2605.08756#S2.p3.1)\.
- \[57\]S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. R\. Narasimhan, and Y\. Cao\(2023\)ReAct: synergizing reasoning and acting in language models\.InThe Eleventh International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=WE_vluYUL-X)Cited by:[§2](https://arxiv.org/html/2605.08756#S2.p3.1)\.
- \[58\]Y\. Yao, F\. Liu, J\. Cheng, and Q\. Zhang\(2024\)Evolve cost\-aware acquisition functions using large language models\.InInternational Conference on Parallel Problem Solving from Nature,pp\. 374–390\.Cited by:[§A\.1\.8](https://arxiv.org/html/2605.08756#A1.SS1.SSS8.p1.1),[§A\.2\.3](https://arxiv.org/html/2605.08756#A1.SS2.SSS3.p3.4),[§4\.1](https://arxiv.org/html/2605.08756#S4.SS1.p1.1),[§4\.1](https://arxiv.org/html/2605.08756#S4.SS1.p2.1)\.
- \[59\]H\. Ye, J\. Wang, Z\. Cao, F\. Berto, C\. Hua, H\. Kim, J\. Park, and G\. Song\(2024\)Reevo: large language models as hyper\-heuristics with reflective evolution\.Advances in neural information processing systems37,pp\. 43571–43608\.Cited by:[§A\.2\.2](https://arxiv.org/html/2605.08756#A1.SS2.SSS2.p4.1),[§D\.2](https://arxiv.org/html/2605.08756#A4.SS2.SSS0.Px1.p3.1),[§1](https://arxiv.org/html/2605.08756#S1.p3.1),[§2](https://arxiv.org/html/2605.08756#S2.p1.1),[§3\.2\.1](https://arxiv.org/html/2605.08756#S3.SS2.SSS1.p2.1),[§4\.1](https://arxiv.org/html/2605.08756#S4.SS1.p2.1)\.
- \[60\]K\. Zhang, J\. Li, G\. Li, X\. Shi, and Z\. Jin\(2024\)CodeAgent: enhancing code generation with tool\-integrated agent systems for real\-world repo\-level coding challenges\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 13643–13658\.Cited by:[§2](https://arxiv.org/html/2605.08756#S2.p3.1)\.
- \[61\]S\. Zhang, S\. Liu, N\. Lu, J\. Wu, J\. Liu, Y\. Ong, and K\. Tang\(2026\)LLM\-driven instance\-specific heuristic generation and selection\.arXiv preprint arXiv:2506\.00490\.Cited by:[§2](https://arxiv.org/html/2605.08756#S2.p1.1)\.
- \[62\]Z\. Zhang, S\. Li, C\. Li, F\. Liu, M\. Chen, K\. Li, T\. Zhong, B\. An, and P\. Liu\(2025\)DHEvo: data\-algorithm based heuristic evolution for generalizable milp solving\.arXiv preprint arXiv:2507\.15615\.Cited by:[§2](https://arxiv.org/html/2605.08756#S2.p1.1)\.
- \[63\]Z\. Zhang and A\. Zhang\(2024\)You only look at screens: multimodal chain\-of\-action agents\.InFindings of the Association for Computational Linguistics ACL 2024,pp\. 3132–3149\.Cited by:[§2](https://arxiv.org/html/2605.08756#S2.p3.1)\.
- \[64\]Z\. Zheng, Z\. Xie, Z\. Wang, and B\. Hooi\(2025\)Monte carlo tree search for comprehensive exploration in llm\-based automatic heuristic design\.arXiv preprint arXiv:2501\.08603\.Cited by:[§D\.8](https://arxiv.org/html/2605.08756#A4.SS8.p1.2),[§2](https://arxiv.org/html/2605.08756#S2.p1.1),[§4\.1](https://arxiv.org/html/2605.08756#S4.SS1.p2.1),[Table 3](https://arxiv.org/html/2605.08756#S4.T3),[Table 3](https://arxiv.org/html/2605.08756#S4.T3.9.2.1)\.
- \[65\]R\. Zhu, C\. Zhang, and Z\. Cao\(2026\)Refining hybrid genetic search for CVRP via reinforcement learning\-finetuned LLM\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=aITKXFeivk)Cited by:[§2](https://arxiv.org/html/2605.08756#S2.p2.1),[§3\.2\.2](https://arxiv.org/html/2605.08756#S3.SS2.SSS2.p1.1)\.
- \[66\]B\. Zitkovich, T\. Yu, S\. Xu, P\. Xu, T\. Xiao, F\. Xia, J\. Wu, P\. Wohlhart, S\. Welker, A\. Wahid,et al\.\(2023\)RT\-2: vision\-language\-action models transfer web knowledge to robotic control\.InConference on Robot Learning,pp\. 2165–2183\.Cited by:[§2](https://arxiv.org/html/2605.08756#S2.p3.1)\.

## Appendix ADetails of Problem Domains

### A\.1Problem Domain Definitions

We evaluate on eight problem domains spanning combinatorial and continuous optimization\. Each subsection below states the mathematical formulation, the training/validation instance sizes, and the function interface that the LLM is asked to design\.

#### A\.1\.1TSP\-Constructive

The Travelling Salesman Problem \(TSP\) asks for a minimum\-cost Hamiltonian cycle overnncities\. Given a distance matrixD∈ℝn×nD\\in\\mathbb\{R\}^\{n\\times n\}, the objective is

minσ∈Sn​∑i=1nDσ​\(i\),σ​\(i\+1\),\\min\_\{\\sigma\\in S\_\{n\}\}\\sum\_\{i=1\}^\{n\}D\_\{\\sigma\(i\),\\sigma\(i\+1\)\},whereσ​\(n\+1\)≡σ​\(1\)\\sigma\(n\+1\)\\equiv\\sigma\(1\)\. In the constructive setting, the LLM designs a greedy selection rule that builds a tour one city at a time\.

Function interface\.

def select\_next\_node\(current\_node, destination\_node, unvisited\_nodes, distance\_matrix\): """Return the index of the next city to visit\.""" return next\_node⊳\\trianglerightint

Instance sizes\.Design set:N=50N=50; validation:N=50,100,200N=50,100,200\.

#### A\.1\.2CVRP\-Constructive

The Capacitated Vehicle Routing Problem \(CVRP\) extends TSP with a depot node and per\-customer demandsdid\_\{i\}\. A fleet of homogeneous vehicles, each with capacityQQ, must serve all customers while minimizing total travel distance:

min​∑k∑\(i,j\)∈RkDi,j,s\.t\.∑i∈Rkdi≤Q​∀k,\\min\\sum\_\{k\}\\sum\_\{\(i,j\)\\in R\_\{k\}\}D\_\{i,j\},\\quad\\text\{s\.t\.\}\\quad\\sum\_\{i\\in R\_\{k\}\}d\_\{i\}\\leq Q\\;\\;\\forall k,whereRkR\_\{k\}denotes the route of vehiclekk\. In the constructive setting, the LLM designs the next\-customer selection rule, which must also decide when to return to the depot and start a new route\.

Function interface\.

def select\_next\_node\(current\_node, depot, feasible\_unvisited, capacity\_remaining, demands, distance\_matrix\): """Return the next customer index, or 0 \(depot\) to start a new route\.""" return next\_node⊳\\trianglerightint

Instance sizes\.Design set:N=50N=50; validation:N=50,100,200N=50,100,200\.

#### A\.1\.3TSP\-ACO

In the ACO setting for TSP, a colony of ants constructs tours in parallel by sampling transitions with probabilities proportional toτi​jα⋅ηi​jβ\\tau\_\{ij\}^\{\\alpha\}\\cdot\\eta\_\{ij\}^\{\\beta\}, whereτ\\tauis the pheromone matrix andη\\etais the heuristic desirability matrix\. The LLM designs the function that computesη\\etafrom the distance matrix\.

Function interface\.

def heuristic\(distance\_matrix: np\.ndarray\): """Return an \(n, n\) heuristic desirability matrix\.""" return heuristic\_matrix⊳\\trianglerightnp\.ndarray

Instance sizes\.Design set:N=50N=50; validation:N=50,100,200N=50,100,200\.

#### A\.1\.4CVRP\-ACO

The ACO formulation of CVRP follows the same pheromone\-driven construction as TSP\-ACO but incorporates capacity constraints\. The LLM designs the heuristic matrix that guides ant transitions, taking into account both distances and remaining vehicle capacity\.

Function interface\.

def heuristic\(distance\_matrix: np\.ndarray, coordinates: np\.ndarray, demands: np\.ndarray, capacity: float\): """Return an \(n, n\) heuristic desirability matrix for capacitated routing\.""" return heuristic\_matrix⊳\\trianglerightnp\.ndarray

Instance sizes\.Design set:N=50N=50; validation:N=50,100,200N=50,100,200\.

#### A\.1\.5OP\-ACO

The Orienteering Problem \(OP\) maximizes the total prize collected by visiting a subset of nodes, subject to a maximum route lengthLL:

maxσ​∑i∈σpi,s\.t\.∑iDσ​\(i\),σ​\(i\+1\)≤L\.\\max\_\{\\sigma\}\\sum\_\{i\\in\\sigma\}p\_\{i\},\\quad\\text\{s\.t\.\}\\quad\\sum\_\{i\}D\_\{\\sigma\(i\),\\sigma\(i\+1\)\}\\leq L\.The ACO framework is used with the LLM designing the heuristic matrix that balances prize attractiveness against distance cost\.

Function interface\.

def heuristic\(prize: np\.ndarray, distance: np\.ndarray, maxlen: float\): """Return an \(n, n\) heuristic desirability matrix for prize collection\.""" return heuristic\_matrix⊳\\trianglerightnp\.ndarray

Instance sizes\.RL training: not used \(held\-out domain\)\. Design set:N=50N=50; validation:N=50,100,200N=50,100,200\.

#### A\.1\.6OVRP\-Constructive

The Open Vehicle Routing Problem \(OVRP\) is a variant of CVRP where vehicles are not required to return to the depot after serving their last customer\. The LLM designs the constructive selection rule under this modified termination condition\.

Function interface\.Same as CVRP\-Constructive\.

Instance sizes\.RL training: not used \(held\-out domain\)\. Design set:N=50N=50; validation:N=50,100,200N=50,100,200\.

#### A\.1\.7MKP\-ACO

The Multidimensional Knapsack Problem \(MKP\) maximizes the total value of selected items subject to multiple resource constraints:

maxx∈\{0,1\}n​∑j=1nvj​xj,s\.t\.∑j=1nwi​j​xj≤Ci,i=1,…,m\.\\max\_\{x\\in\\\{0,1\\\}^\{n\}\}\\sum\_\{j=1\}^\{n\}v\_\{j\}x\_\{j\},\\quad\\text\{s\.t\.\}\\quad\\sum\_\{j=1\}^\{n\}w\_\{ij\}x\_\{j\}\\leq C\_\{i\},\\;\\;i=1,\\ldots,m\.The ACO framework constructs solutions by sequentially adding items\. The LLM designs the heuristic desirability function that guides item selection\.

Function interface\.

def heuristic\(prize: np\.ndarray, weight: np\.ndarray\): """Return an \(n,\) heuristic desirability vector for item selection\.""" return heuristic\_vector⊳\\trianglerightnp\.ndarray

Instance sizes\.RL training: not used \(held\-out domain\)\. Design set:N=100N=100; validation:N=100,200,300N=100,200,300\.

#### A\.1\.8CAF \(Cost\-Aware Bayesian Optimization\)

In cost\-aware Bayesian optimization\[[58](https://arxiv.org/html/2605.08756#bib.bib52)\], the goal is to design a utility \(acquisition\) function that guides the selection of the next evaluation point\. Given a surrogate model’s predictions, the LLM designs a function that maps posterior statistics and cost information to a scalar utility value\.

Function interface\.

def utility\(train\_x, train\_Y, best\_x, best\_Y, test\_x, mean\_test\_y, std\_test\_y, cost\_test\_y, budget\_used, budget\_total\): """Return a scalar acquisition value for each candidate point\.""" return utility\_value⊳\\trianglerighttorch\.Tensor

Test protocol\.12 benchmark functions \(Ackley, Rastrigin, Griewank, Rosenbrock, Levy, ThreeHumpCamel, StyblinskiTang, Hartmann\-3D, Powell, Shekel, Hartmann\-6D, Cosine8\), BO budget 30, 10 trials per instance\. Ackley and Rastrigin are training instances, others are validation instances\.

### A\.2Algorithmic Framework Definitions

The LLM\-designed heuristic is embedded into one of three algorithmic frameworks\. Each framework defines the overall solving procedure; the LLM is responsible for designing a specific component within that procedure\. Below we describe each framework in detail, including the pseudocode and the exact role of the LLM\-designed component\.

#### A\.2\.1Constructive Heuristic Framework

Step\-by\-step construction is an intuitive and general\-purpose framework for combinatorial optimization\. It builds a feasible solution incrementally from scratch: starting from an empty partial solution, the framework repeatedly selects the most promising candidate element and appends it, until the solution is complete\. This greedy construction paradigm is applicable to a wide range of CO problems, including routing, packing, scheduling, and assignment problems\.

The core of the framework is a*selection function*ffthat assigns a priority score to each candidate element given the current partial solution state\. At every construction step, the element with the highest priority is selected and added to the solution\. The framework itself handles feasibility enforcement \(e\.g\., capacity constraints in vehicle routing\), termination conditions, and final objective evaluation\. This separation of concerns allows the LLM to focus entirely on designing the scoring logic, while the framework guarantees that every generated solution is complete and feasible\.

For TSP, the classical manually designed baseline is the nearest\-neighbor heuristic\[[37](https://arxiv.org/html/2605.08756#bib.bib68)\], which scores unvisited cities solely by their Euclidean distance from the current city\. This simple rule produces tours that are typically 20–25% longer than optimal\. For CVRP, a natural baseline additionally considers remaining vehicle capacity when scoring customers, returning to the depot when no feasible customer can be served\. The LLM\-designed function is expected to discover more sophisticated scoring strategies that go beyond local distance, potentially incorporating features such as angular sweep ordering, look\-ahead insertion cost, demand\-to\-distance ratios, distance to the depot, or density\-aware clustering of unvisited customers\.

We apply the constructive framework to three routing domains\. The detailed settings of the LLM\-designed selection function for each domain are as follows:

- •ForTSP\-Constructive, the LLM designs a function to select the next city to visit based on the current node, a destination node \(node 0\), the set of unvisited nodes, and the distance matrix\. There is no capacity constraint and no depot return during construction\. The function signature isf​\(v,vdst,𝒰,D\)→v⋆f\(v,\\;v\_\{\\mathrm\{dst\}\},\\;\\mathcal\{U\},\\;D\)\\to v^\{\\star\}\. All instances use 2D Euclidean coordinates sampled uniformly from\[0,1\]2\[0,1\]^\{2\}\. The design set usesN=50N=50with 64 instances; validation usesN∈\{50,100,200\}N\\in\\\{50,100,200\\\}with 64 instances per size\.
- •ForCVRP\-Constructive, the LLM designs a function to select the next customer or decide to return to the depot, taking into account the current node, the depot location, feasible unvisited customers, remaining vehicle capacity, customer demands, and the distance matrix\. The vehicle capacity isQ=40Q=40, customer demands are integer\-valued and sampled uniformly from\{1,…,9\}\\\{1,\\ldots,9\\\}, and all coordinates \(including the depot\) are sampled uniformly from\[0,1\]2\[0,1\]^\{2\}\. Instance sizes follow TSP\-Constructive\.
- •ForOVRP\-Constructive, the function interface and instance generation are the same as CVRP\-Constructive \(Q=40Q=40, demands in\{1,…,9\}\\\{1,\\ldots,9\\\}, random depot\)\. The only structural difference is that the final vehicle does not return to the depot, so the last return\-edge cost is omitted from the objective\. This domain is used only for held\-out evaluation\.

#### A\.2\.2Ant Colony Optimization \(ACO\) Framework

Ant Colony Optimization \(ACO\)\[[8](https://arxiv.org/html/2605.08756#bib.bib61)\]is a meta\-heuristic and population\-based algorithm inspired by the foraging behavior of natural ant colonies\. In ACO, a colony ofmmartificial ants independently constructs candidate solutions in parallel\. Each ant builds a solution step by step, selecting the next element with a probability that depends on two factors: \(1\) a*pheromone matrix*τ\\tau, which encodes the collective search experience accumulated from previous iterations, and \(2\) a*heuristic desirability matrix*η\\eta, which provides problem\-specific prior knowledge about the attractiveness of each transition\.

The transition probability from nodeiito nodejjfor antkkis given by:

pi​j\(k\)=τi​jα⋅ηi​jβ⋅𝟙​\[j∈𝒩k\]∑l∈𝒩kτi​lα⋅ηi​lβ,p\_\{ij\}^\{\(k\)\}=\\frac\{\\tau\_\{ij\}^\{\\alpha\}\\cdot\\eta\_\{ij\}^\{\\beta\}\\cdot\\mathbb\{1\}\[j\\in\\mathcal\{N\}\_\{k\}\]\}\{\\sum\_\{l\\in\\mathcal\{N\}\_\{k\}\}\\tau\_\{il\}^\{\\alpha\}\\cdot\\eta\_\{il\}^\{\\beta\}\},where𝒩k\\mathcal\{N\}\_\{k\}denotes the set of feasible candidates for antkk\(e\.g\., unvisited cities in TSP, or capacity\-feasible customers in CVRP\), andα,β\\alpha,\\betacontrol the relative influence of pheromone and heuristic information\. The pheromone matrix is initialized uniformly \(τi​j=1\\tau\_\{ij\}=1for alli,ji,j\) and updated after each iteration: all values decay by a factorρ\\rho\(evaporation\), and edges appearing in high\-quality solutions receive additional pheromone deposits proportional to the inverse solution cost\. This feedback mechanism enables the colony to progressively concentrate its search around promising solution structures while maintaining exploratory diversity\.

The key component designed by the LLM is the heuristic desirability matrixη∈ℝn×n\\eta\\in\\mathbb\{R\}^\{n\\times n\}\. Unlike the pheromone matrix, which evolves during optimization, the heuristic matrix is computed once from the instance data before the first iteration and remains fixed throughout\. For example, the classical choice for TSP isηi​j=1/Di​j\\eta\_\{ij\}=1/D\_\{ij\}\(inverse distance\)\[[45](https://arxiv.org/html/2605.08756#bib.bib64)\], which biases ants toward nearby cities but ignores global structure\. The LLM can design more effective heuristic functions that incorporate features such askk\-nearest\-neighbor density, distance rank normalization, spatial clustering, or non\-linear transformations that reshape the exploration\-exploitation balance of the colony\.

We apply the ACO framework to four domains, following the setting in ReEvo\[[59](https://arxiv.org/html/2605.08756#bib.bib23)\]\. The LLM\-designed heuristic functionhhis the sole component the model controls; its input signature varies by domain, but the output is always a matrix \(or vector for MKP\) of non\-negative desirability values\. The detailed settings for each domain are as follows:

ForTSP\-ACO, the heuristic function signature ish​\(D\)→ηh\(D\)\\to\\eta\. We usem=30m=30ants,T=100T=100iterations, decayρ=0\.9\\rho=0\.9, and exponentsα=β=1\\alpha=\\beta=1\. Instances are 2D Euclidean with coordinates in\[0,1\]2\[0,1\]^\{2\}\. The design set usesN=50N=50with 16 instances; validation usesN∈\{50,100,200\}N\\in\\\{50,100,200\\\}with 64 instances per size\.

ForCVRP\-ACO, the heuristic function receives additional inputs:h​\(D,X,d,Q\)→ηh\(D,X,d,Q\)\\to\\eta, whereXXis the node coordinate matrix,ddis the demand vector, andQ=50Q=50is the vehicle capacity\. Customer demands are integer\-valued and sampled uniformly from\{1,…,9\}\\\{1,\\ldots,9\\\}, and the depot is fixed at\(0\.5,0\.5\)\(0\.5,0\.5\)\. The ACO construction process incorporates capacity constraints: at each step, an ant can only transition to a customer whose demand does not exceed the remaining vehicle capacity, or return to the depot to start a new route\. The remaining ACO parameters \(m=30m=30,T=100T=100,ρ=0\.9\\rho=0\.9,α=β=1\\alpha=\\beta=1\) are the same as TSP\-ACO\. The design set usesN=50N=50with 10 instances; validation usesN∈\{50,100,200\}N\\in\\\{50,100,200\\\}with 64 instances per size\.

ForOP\-ACO, the objective is to maximize the total prize collected within a maximum route lengthLL\. The heuristic function signature ish​\(p,D,L\)→ηh\(p,D,L\)\\to\\eta, whereppis the prize vector\. Prizes are computed asp~i=1\+⌊99​d0​i/maxj⁡d0​j⌋,pi=p~i/100\.\\tilde\{p\}\_\{i\}=1\+\\lfloor 99d\_\{0i\}/\\max\_\{j\}d\_\{0j\}\\rfloor,\\qquad p\_\{i\}=\\tilde\{p\}\_\{i\}/100\.and the prizes are normalized to\[0,1\]\[0,1\], whered0​id\_\{0i\}is the distance from the depot\. The maximum route length depends on problem size:L=3\.0L=3\.0forN≤50N\\leq 50,L=4\.0L=4\.0forN≤100N\\leq 100,L=5\.0L=5\.0forN≤200N\\leq 200, andL=6\.0L=6\.0forN≤300N\\leq 300\. We usem=20m=20ants andT=50T=50iterations\. This domain is used only for held\-out evaluation; validation usesN∈\{50,100,200\}N\\in\\\{50,100,200\\\}with 64 instances per size\.

ForMKP\-ACO, the Multidimensional Knapsack Problem hasmdim=5m\_\{\\mathrm\{dim\}\}=5resource constraints\. The heuristic function signature ish​\(v,W\)→ηh\(v,W\)\\to\\eta, wherev∈ℝnv\\in\\mathbb\{R\}^\{n\}is the value vector andW∈ℝn×5W\\in\\mathbb\{R\}^\{n\\times 5\}is the weight matrix\. The ACO construction sequentially selects items, masking those that would violate any capacity constraint\. Capacities are normalized to 1 during instance generation \(weights are rescaled accordingly\), so the evaluator enforces feasibility internally\. The LLM\-designed heuristic receives only values and weights, not capacities\. Pheromone deposits are scaled by the total prize sum:Δ​τ=\(1/∑ivi\)⋅objective\\Delta\\tau=\(1/\\sum\_\{i\}v\_\{i\}\)\\cdot\\text\{objective\}\. We usem=10m=10ants andT=50T=50iterations\. This domain is used only for held\-out evaluation; validation usesN∈\{100,200,300\}N\\in\\\{100,200,300\\\}with 5 instances per size\.

#### A\.2\.3Bayesian Optimization \(BO\) Framework

Bayesian optimization \(BO\)\[[40](https://arxiv.org/html/2605.08756#bib.bib96)\]is a method for optimizing expensive black\-box functions where the objective is to find the global minimum of an unknown functionf​\(x\)f\(x\)over a bounded search space𝒳\\mathcal\{X\}:

x⋆=arg⁡minx∈𝒳⁡f​\(x\)\.x^\{\\star\}=\\arg\\min\_\{x\\in\\mathcal\{X\}\}f\(x\)\.Two core components of BO are the*probabilistic surrogate model*and the*acquisition function*\[[32](https://arxiv.org/html/2605.08756#bib.bib97)\]\. In each iteration, the surrogate model \(typically a Gaussian process\) is fitted to all observed evaluations, providing a posterior distribution with meanμ^​\(x\)\\hat\{\\mu\}\(x\)and standard deviationσ^​\(x\)\\hat\{\\sigma\}\(x\)at any candidate pointxx\. The acquisition function then uses these posterior statistics to determine the next evaluation point, balancing exploitation \(evaluating where the predicted objective is good\) and exploration \(evaluating where uncertainty is high\)\.

In cost\-aware BO settings, different evaluation points incur different costs, and the total evaluation budget is measured in cumulative cost rather than number of evaluations\. This requires the acquisition function to additionally consider the predicted costc^​\(x\)\\hat\{c\}\(x\)of each candidate, managing the trade\-off between expected improvement, uncertainty, and evaluation cost under a fixed budget\. The goal of LLM\-based AHD in this setting is to design a*cost\-aware acquisition function*\(CAF\) that can adaptively balance these three objectives\. The LLM designs a utility functionuuthat replaces the standard acquisition function and receives the full surrogate state—posterior mean, standard deviation, cost prediction, training history, and remaining budget—and returns a scalar value that, when maximized, guides the search toward promising and cost\-efficient regions\.

Standard acquisition functions such as Expected Improvement \(EI\)\[[58](https://arxiv.org/html/2605.08756#bib.bib52)\]use a closed\-form formula that depends only onμ^​\(x\)\\hat\{\\mu\}\(x\),σ^​\(x\)\\hat\{\\sigma\}\(x\), and the current best value\. Cost\-aware variants like EI\-per\-unit\-cost\[[46](https://arxiv.org/html/2605.08756#bib.bib98)\]divide EI by the predicted cost raised to a decaying power:a​\(x\)=EI​\(x\)/c^​\(x\)αa\(x\)=\\mathrm\{EI\}\(x\)/\\hat\{c\}\(x\)^\{\\alpha\}, whereα\\alphadecreases as the remaining budget shrinks\. These formulas encode fixed trade\-off strategies that cannot adapt to the specific geometry of the objective landscape\. The LLM\-designed utility function is expected to discover acquisition strategies that go beyond these templates, potentially conditioning on the full training history, adapting the exploration\-exploitation ratio based on observed progress, or incorporating cost\-awareness in non\-linear ways\.

ForCAF, the cost\-aware BO benchmark uses 12 standard test functions with varying input dimensions: Ackley \(d=2d=2\), Rastrigin \(d=2d=2\), Griewank \(d=2d=2\), Rosenbrock \(d=2d=2\), Levy \(d=2d=2\), ThreeHumpCamel \(d=2d=2\), StyblinskiTang \(d=2d=2\), Hartmann\-3D \(d=3d=3\), Powell \(d=2d=2\), Shekel \(d=4d=4\), Hartmann\-6D \(d=6d=6\), and Cosine8 \(d=8d=8\)\. The evaluation cost function isc​\(x\)=exp⁡\(−‖x−xopt‖2\)c\(x\)=\\exp\(\-\\\|x\-x\_\{\\mathrm\{opt\}\}\\\|\_\{2\}\)\. This cost is highest near the optimum and decreases with distance from it\. The total cost budget isCmax=30C\_\{\\max\}=30\.

## Appendix BDefinition of AHD Agent

This section gives the implementation\-level definition of AHD Agent used in our experiments\. We separate the agent\-facing diagnostic tools from the evaluation components required to execute candidate heuristics\. Diagnostic tools provide train\-only information to the model during the design loop, whereas the evaluation components manage sessions, execute submitted code on train instances, and record the resulting feedback\.

### B\.1Diagnostic Tool Interfaces

All diagnostic tools are train\-only: they may inspect the current training instances, the session\-local attempt history, and evaluator feedback already produced during the design loop, but they never expose validation or test objectives\. We focus on two diagnostic interfaces used by the agent:InstanceAnalysisandASTNoveltyAnalyzer\.

Table 5:Diagnostic tool interface summary\.The diagnostic tools are read\-only, operate only on the design set and session history, and do not consume evaluator calls\.#### B\.1\.1Train\-Instance Analysis

InstanceAnalysis\.This tool summarizes the train\-only dataset bound to the current task\. It requires asession\_idand supports three scopes\. Withscope=summary, it analyzes the full train set and returns aggregated feature statistics\. Withscope=single\_instance, it additionally requires aninstance\_idand returns a detailed feature summary for that one train instance\. Withscope=contrastive\_pair, it selects the most dissimilar pair of train instances by standardized feature distance and reports the largest feature gaps between them\. The output is a text summary for the model and a structured metrics dictionary for logging\.

Feature computation\.For each instance in the train set, the tool computes five categories of structural features from the raw coordinates and domain\-specific attributes:

- •Spacing uniformity: using a KD\-tree, the tool computes each node’s nearest\-neighbor distance\. It reports the coefficient of variation of these distances \(nn\_cv\) and the ratio of the observed mean nearest\-neighbor distance to the expected value under a uniform random distribution \(nn\_mean\_normalized\)\. These metrics indicate whether nodes are regularly spaced, clustered, or randomly scattered\.
- •Cluster structure: DBSCAN is applied withε\\varepsilonset to the 10th percentile of nearest\-neighbor distances\. The tool reports the number of detected clusters and the silhouette score when at least two clusters are found\. This helps the agent identify whether the instance has distinct spatial groups that may benefit from cluster\-aware heuristic logic\.
- •Density variation: a 2D histogram grid partitions the coordinate space, and the coefficient of variation of bin counts \(density\_cv\) quantifies how unevenly nodes are distributed across the spatial domain\.
- •Boundary shape: the convex hull of the node coordinates is computed\. The tool reports the fraction of nodes on the hull \(hull\_fraction\) and the ratio of hull area to bounding\-box area \(hull\_area\_ratio\), characterizing whether nodes fill the interior or concentrate near the boundary\.
- •Demand pattern\(CVRP/OVRP only\): for domains with per\-customer demands, the tool computes the demand coefficient of variation \(demand\_cv\) and Moran’s I spatial autocorrelation index \(demand\_morans\_i\) using 5\-nearest\-neighbor spatial weights\. Moran’s I reveals whether high\-demand customers are spatially clustered or randomly distributed, which can inform capacity\-aware routing strategies\.

Aggregation\.Whenscope=summary, the tool computes the above features for every instance in the train set and aggregates them by reporting the mean, minimum, and maximum of each metric\. The aggregated statistics are then translated into a natural\-language summary returned to the agent\. This summary allows the agent to form hypotheses about the dataset structure—such as whether instances are clustered, whether demands are spatially correlated, or whether boundary nodes are disproportionately represented—before spending evaluator calls\.

#### B\.1\.2Code\-Structure Analysis

ASTNoveltyAnalyzer\.The AST novelty tool compares a candidate heuristic program against all previously evaluated attempts in the same session\. Its inputs aresession\_id, the candidatecode, and an optionaltop\_k\(default 3\)\. The tool does not decide which heuristic is best; it only exposes structural similarity signals so that the agent can judge whether another evaluator call is worthwhile or whether further revision is needed first\.

AST normalization\.Comparing raw source code is unreliable because superficial differences—such as variable renaming, constant tuning, or comment changes—inflate dissimilarity without reflecting genuine algorithmic changes\. To address this, the tool uses Python’sastmodule to parse both the candidate and each historical attempt into abstract syntax trees\. It then applies a structure\-preserving normalization pass that replaces all variable names with a placeholderVAR, all function arguments withARG, and all literal constants with type\-level placeholders \(NUMfor numbers,STRfor strings,BOOLfor booleans\)\. The resulting normalized tree—referred to as the*shape tree*—captures only the control\-flow and operator structure of the program, discarding surface\-level variation\.

Similarity computation\.The tool computes three complementary similarity scores between the candidate and each historical attempt:

- •Raw similarity: theSequenceMatcherratio between the full \(unnormalized\) AST dumps of the two programs\. This component is sensitive to variable names and constant values, capturing cases where the candidate is a near\-verbatim copy\.
- •Shape similarity: theSequenceMatcherratio between the normalized shape\-tree dumps\. This component ignores naming and constant differences and focuses on whether the control\-flow structure \(branches, loops, function calls\) has changed\.
- •Node\-type similarity: the cosine similarity between the node\-type frequency vectors of the two ASTs \(e\.g\., counts ofIf,For,Call,BinOp, etc\.\)\. This component captures coarse structural composition even when the tree layout differs\.

The final AST similarity is a weighted combination:

s=0\.25×sraw\+0\.50×sshape\+0\.25×snode\.s=0\.25\\times s\_\{\\mathrm\{raw\}\}\+0\.50\\times s\_\{\\mathrm\{shape\}\}\+0\.25\\times s\_\{\\mathrm\{node\}\}\.The shape component receives the highest weight because structural changes to control flow and operator composition are the strongest indicators of genuinely different algorithmic logic\.

The tool returns the AST similarity to each of the top\-kkmost similar historical attempts, along with a candidate summary \(node count, branches, loops, constants, function calls\), a novelty score \(1−smax1\-s\_\{\\max\}\), and an evaluation hint that helps the agent decide whether to evaluate or revise further\.

#### B\.1\.3Diagnostic Tool\-Use Constraints

The diagnostic tools are intentionally indirect\. They expose instance structure and code\-structure novelty, but they do not rank candidates by validation performance\. The agent must decide whether these signals justify a new train evaluation or a code revision\. This design prevents validation leakage while still allowing the policy to perform model\-controlled experimentation, diagnose failure modes, and revise heuristics over multiple turns\.

### B\.2Evaluation Implementation

Candidate execution is implemented by the evaluation harness rather than by the diagnostic tool set\. Each design run starts by creating a persistent session workspace identified by asession\_id\. The workspace stores the problem configuration, the train dataset binding, all candidate programs submitted during the run, execution logs, aggregate objective values, and instance\-level cost histories\. This persistent state makes the interaction history available to later diagnostic calls without exposing validation data\.

When the agent submits a candidate heuristic, the harness writes the complete Python implementation into an isolated session\-local evaluation directory\. The same evaluator used by the corresponding baseline solver then executes the candidate on the train instances\. The harness records whether the program is executable and feasible, its average train objective, repeat\-level statistics when repeated evaluation is enabled, a monotonically increasing attempt id, and the best\-so\-far train objective in the current session\. For successful evaluations, the harness also persists per\-instance costs by problem size\. These records support session auditing and debugging without exposing validation data\.

The evaluator budget is counted only when a candidate is actually executed on the train set\. Diagnostic calls are read\-only and do not consume evaluator calls\. During search, the agent observes only train\-time feedback: execution errors, feasibility status, train objective values, best\-so\-far status, diagnostic summaries, and the remaining evaluator budget\. Validation and test sets are used only after the final heuristic is selected, so the design loop does not receive validation or test feedback\.

### B\.3Prompt Templates

This section summarizes the prompt templates used by the multi\-turn heuristic designer\. Unlike fixed\-search frameworks that maintain separate prompts for initialization, mutation, crossover, or tree\-path reasoning actions,AHD Agentuses one system prompt and one user prompt\. The task identity, function interface, seed code, and observations are injected through placeholders such as\{problem\.description\},\{function\_signature\},\{initial\_code\}, and\{algorithm\_details\(given\_heuristics\)\}\.

#### B\.3\.1System Prompt

The system prompt defines the role of the model, the diagnostic tools, and the global interaction protocol\. The same template is shared by all problem domains; only\{task\_brief\},\{objective\_text\}, and the available diagnostic interfaces are instantiated at runtime\.

System prompt templateRole\.You are an expert in designing heuristic algorithms for\{task\_brief\}\. Your goal is to iteratively improve the current heuristic and optimize\{objective\_text\}\.Available diagnostic interfaces\.1\.InstanceAnalysis: summarize structural properties of the training instances, such as spacing, clustering, density, boundary statistics, and task\-specific attributes when available\.2\.ASTNoveltyAnalyzer: compare the AST structure of a candidate against previously evaluated candidates\. This interface is used only as a novelty checkpoint; final ranking is always determined by train evaluation\.Interaction rules\.Use diagnostic feedback and train evaluation results to revise the code over multiple turns\. Do not submit the initial code unchanged\. The final answer must contain only the complete Python code after the marker\#\#\#\# FINAL SOLUTION \#\#\#\#\.

#### B\.3\.2User Prompt Template

The user prompt contains the task\-specific design request, the current baseline, the implementation constraints, appended observations, and the final\-answer format\. For constructive routing tasks, the target is a next\-node decision function; for ACO tasks, the target is a heuristic\-information function; for CAF, the target is an acquisition utility function\. These differences are represented by placeholders rather than by separate prompt families\.

User prompt templateTask request\.Please design the required heuristic function using the following background\.Problem description\.\{problem\.description\}Algorithmic context\.\{algorithm\_details\(given\_heuristics\)\}Function interface\.\{function\_signature\}Current baseline code\.\{initial\_code\}Current baseline objective\.\{baseline\_objective\}Evaluation objective\.Direction:\{objective\_direction\}\. Improve over the current baseline using only training\-set feedback\.Implementation constraints\.The function name and signature must match\{function\_name\}exactly\. The returned value must satisfy all validity constraints in\{problem\.description\}\. The code must be deterministic and executable in the provided evaluator\. NumPy is available asnp\.Observations appended during interaction\.After each diagnostic call or train evaluation, the returned observation is appended to the conversation memory\. The agent may then revise a candidate, call a diagnostic interface, submit a new implementation for train evaluation, or finalize the best successfully evaluated code\.Final answer format\.All domains use the same final\-answer constraint:\#\#\#\# FINAL SOLUTION \#\#\#\# <complete Python code only\>\.

#### B\.3\.3Problem Information Used in Prompts

Table[6](https://arxiv.org/html/2605.08756#A2.T6)summarizes the domain\-specific information inserted into\{problem\.description\}and related placeholders\. The entries are intentionally concise because the contribution of this work is the agentic interaction policy rather than domain\-specific prompt engineering\.

Table 6:Information of each problem used in prompts\. The same system and user prompt templates are reused across domains; only these problem\-specific fields are instantiated\.

## Appendix CRL Training Details

This section provides additional details on the RL training process, including training data statistics, training configuration, and training curves\.

### C\.1GRPO

We optimize the AHD Agent with the GRPO algorithm\[[41](https://arxiv.org/html/2605.08756#bib.bib5)\]\. Letπθ\\pi\_\{\\theta\}be the trainable LLM,πref\\pi\_\{\\mathrm\{ref\}\}a frozen reference LLM, and an input promptqq\. For eachqq, GRPO samples a group ofGGrollouts\{oi\}i=1G∼πθold\(⋅∣q\)\\\{o\_\{i\}\\\}\_\{i=1\}^\{G\}\\sim\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\(\\cdot\\mid q\), obtains rewardsrir\_\{i\}, and computes the normalized advantageA^i=\(ri−mean​\(\{rj\}\)\)/\(std​\(\{rj\}\)\+δ\)\\hat\{A\}\_\{i\}=\(r\_\{i\}\-\\mathrm\{mean\}\(\\\{r\_\{j\}\\\}\)\)/\(\\mathrm\{std\}\(\\\{r\_\{j\}\\\}\)\+\\delta\)\. The LLM is updated by the clipped objective

𝒥GRPO​\(θ\)\\displaystyle\\mathcal\{J\}\_\{\\mathrm\{GRPO\}\}\(\\theta\)=𝔼q,\{oi\}\[1G∑i=1G1\|oi\|∑t=1\|oi\|\(min\(ρi,tA^i,clip\(ρi,t,1−ϵclip,1\+ϵclip\)A^i\)\\displaystyle=\\mathbb\{E\}\_\{q,\\\{o\_\{i\}\\\}\}\\Bigg\[\\frac\{1\}\{G\}\\sum\_\{i=1\}^\{G\}\\frac\{1\}\{\|o\_\{i\}\|\}\\sum\_\{t=1\}^\{\|o\_\{i\}\|\}\\bigg\(\\min\\Big\(\\rho\_\{i,t\}\\hat\{A\}\_\{i\},\\mathrm\{clip\}\\big\(\\rho\_\{i,t\},1\\\!\-\\\!\\epsilon\_\{\\mathrm\{clip\}\},1\\\!\+\\\!\\epsilon\_\{\\mathrm\{clip\}\}\\big\)\\hat\{A\}\_\{i\}\\Big\)−βDKL\(πθ∥πref\)\)\],\\displaystyle\\quad\-\\beta\\,D\_\{\\mathrm\{KL\}\}\\\!\\left\(\\pi\_\{\\theta\}\\,\\\|\\,\\pi\_\{\\mathrm\{ref\}\}\\right\)\\bigg\)\\Bigg\],\(1\)whereρi,t​\(θ\)=πθ​\(oi,t∣q,oi,<t\)/πθold​\(oi,t∣q,oi,<t\)\\rho\_\{i,t\}\(\\theta\)=\\pi\_\{\\theta\}\(o\_\{i,t\}\\mid q,o\_\{i,<t\}\)/\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\(o\_\{i,t\}\\mid q,o\_\{i,<t\}\)is the per\-token importance ratio,ϵclip\\epsilon\_\{\\mathrm\{clip\}\}controls the clipping range, andβ\\betacontrols the KL penalty strengthDKLD\_\{\\mathrm\{KL\}\}\. We setϵclip=0\.2\\epsilon\_\{\\mathrm\{clip\}\}=0\.2andβ=0\.001\\beta=0\.001for training\.

### C\.2Training Data

The RL training data is constructed from seed heuristics collected via ReEvo searches and prior agentic design sessions across four domains\. For each domain, we retain all executable heuristics from the collection process—not only the best\-performing ones—to ensure diversity in starting quality\. Each seed heuristic is paired with multiple training\-instance variants at problem sizesN∈\{40,50,60,70,80\}N\\in\\\{40,50,60,70,80\\\}, and the baseline objective is re\-evaluated on each variant\. Failed or timed\-out evaluations are discarded\. The final dataset contains 1,000 candidate prompts per domain, drawn from the source pools summarized in Table[7](https://arxiv.org/html/2605.08756#A3.T7)\. Approximately 50% of the prompts use the*improve\-from\-code*mode, and the remaining 50% use the*design\-from\-scratch*mode\. Figure[8](https://arxiv.org/html/2605.08756#A3.F8)visualizes the objective\-value distribution of the source heuristic pools, showing that all four domains cover a broad range of starting quality levels\.

Table 7:Training data statistics for the four\-domain RL dataset\. Source heuristics are the number of distinct seed programs collected per domain\. Generated prompts are candidate RL records after pairing with instance variants\. Training sizes report the instance scales used for variant augmentation, and the format ratio reports improve\-from\-code versus design\-from\-scratch prompts\. Obj\. value statistics are computed over the source pool\.![Refer to caption](https://arxiv.org/html/2605.08756v1/x9.png)Figure 8:Objective\-value distribution of the source heuristic pools used to generate the RL training data\. Solid and dashed vertical lines mark the pool mean and median, respectively\. All four domains exhibit broad coverage from near\-baseline to competitive quality levels\.
### C\.3Training Configuration

The RL policy is initialized from Qwen3\-4B\-Instruct\-2507\[[50](https://arxiv.org/html/2605.08756#bib.bib60)\]and trained with GRPO in the multi\-turn tool\-augmented environment described\. Each RL step samples a mini\-batch of 4 prompts, generates 8 rollouts per prompt, executes tool calls and evaluations within the environment, and updates the policy using the staged reward described in[Section˜3\.2\.2](https://arxiv.org/html/2605.08756#S3.SS2.SSS2)\. Training runs for 500 steps on 4×\\timesA100\-80G GPUs\. Table[8](https://arxiv.org/html/2605.08756#A3.T8)summarizes the full hyperparameter configuration\.

Table 8:RL training hyperparameters\.
### C\.4Training Curves

Figures[9](https://arxiv.org/html/2605.08756#A3.F9)and[10](https://arxiv.org/html/2605.08756#A3.F10)report key training diagnostics and per\-domain validation curves over 500 RL steps\.

The reward \(Figure[9](https://arxiv.org/html/2605.08756#A3.F9), top\-left\) increases steadily throughout training, indicating that the policy progressively learns to produce higher\-quality heuristics\. The number of turns \(top\-right\) stabilizes after an initial adjustment period, suggesting that the policy converges to a consistent multi\-turn interaction strategy rather than expanding trajectory length indefinitely\. Response length \(bottom\-left\) shows a moderate increase as the model learns to generate more detailed reasoning and code revisions\. Step time \(bottom\-right\) remains stable, confirming that the training infrastructure scales consistently across steps\.

The per\-domain validation curves \(Figure[10](https://arxiv.org/html/2605.08756#A3.F10)\) show that the train\-side objective improves on all four domains over the course of training\. TSP\-ACO and CVRP\-ACO converge relatively quickly, while CVRP\-Constructive shows more gradual improvement, consistent with the higher variance of constructive heuristic design\.

![Refer to caption](https://arxiv.org/html/2605.08756v1/x10.png)

![Refer to caption](https://arxiv.org/html/2605.08756v1/x11.png)

![Refer to caption](https://arxiv.org/html/2605.08756v1/x12.png)

![Refer to caption](https://arxiv.org/html/2605.08756v1/x13.png)

Figure 9:RL training diagnostics over 500 steps\. Top\-left: quality reward \(higher is better\)\. Top\-right: average interaction turns per rollout\. Bottom\-left: average response length in tokens\. Bottom\-right: wall\-clock time per RL step\.![Refer to caption](https://arxiv.org/html/2605.08756v1/x14.png)

![Refer to caption](https://arxiv.org/html/2605.08756v1/x15.png)

![Refer to caption](https://arxiv.org/html/2605.08756v1/x16.png)

![Refer to caption](https://arxiv.org/html/2605.08756v1/x17.png)

Figure 10:Per\-domain train\-side validation curves over RL training steps\. Each panel reports theN=50N=50validation objective; lower is better for all four domains\.

## Appendix DDetailed Experimental Results

### D\.1Pricing Details

The reported dollar costs correspond to theCostcolumns in Tables[1](https://arxiv.org/html/2605.08756#S3.T1),[2](https://arxiv.org/html/2605.08756#S4.T2), and[3](https://arxiv.org/html/2605.08756#S4.T3)\. We compute these costs from token usage using the public OpenRouter pricing of the corresponding API\-based backbone models\. Since OpenRouter does not provide a separate price for Qwen3\-4B, we use the Qwen3\-8B price as a proxy for locally served Qwen3\-4B and the RL\-trained checkpoints based on it\. The detailed token prices used for cost estimation are reported in Table[9](https://arxiv.org/html/2605.08756#A4.T9)\.

Table 9:Token pricing used for the reported cost estimates\. Prices are in USD per one million tokens\.
### D\.2Metrics Details

This section provides the details behind the rows labeled “Optimal” and “Baseline heuristic” in Tables[1](https://arxiv.org/html/2605.08756#S3.T1)and[2](https://arxiv.org/html/2605.08756#S4.T2)\. The “Optimal” row reports the reference objectivef⋆f^\{\\star\}used to compute validation Gap, while the “Baseline heuristic” row reports the performance of the initial heuristic provided to each heuristic\-design method\. For all LLM\-based AHD methods compared within the same domain, we use the same target function interface, design instances, validation instances, objective computation, and initial baseline heuristic when applicable; only the search or interaction workflow differs\.

##### Gap computation\.

We use validation Gap \(%\) as the primary performance metric\. For minimization domains, including TSP\-Constructive, CVRP\-Constructive, TSP\-ACO, CVRP\-ACO, OVRP\-Constructive, and CAF, the Gap is computed as

Gap=f−f⋆\|f⋆\|×100%,\\mathrm\{Gap\}=\\frac\{f\-f^\{\\star\}\}\{\|f^\{\\star\}\|\}\\times 100\\%,whereffis the objective value achieved by the evaluated heuristic andf⋆f^\{\\star\}is the corresponding optimal or best\-known reference objective\. For maximization domains, including OP\-ACO and MKP\-ACO, the Gap is computed as

Gap=f⋆−f\|f⋆\|×100%\.\\mathrm\{Gap\}=\\frac\{f^\{\\star\}\-f\}\{\|f^\{\\star\}\|\}\\times 100\\%\.Under this convention, lower Gap is always better, and a Gap of0indicates that the evaluated heuristic matches the reference objective on the validation set\.

Reference objectives\.For each validation domain and problem size, the reference objectivef⋆f^\{\\star\}is computed independently of the AHD search process and is never exposed to the agent during design\. For TSP\-Constructive and TSP\-ACO, the reference tour lengths are computed by LKH\[[12](https://arxiv.org/html/2605.08756#bib.bib70)\]\. For CVRP\-Constructive, CVRP\-ACO, and OVRP\-Constructive, the reference routing objectives are computed by PyVRP\[[54](https://arxiv.org/html/2605.08756#bib.bib71)\]\. For OP\-ACO, the reference collected rewards are computed by OP\-Solver\[[20](https://arxiv.org/html/2605.08756#bib.bib72)\]\. For MKP\-ACO, the reference packed profits are computed by OR\-Tools\[[34](https://arxiv.org/html/2605.08756#bib.bib69)\]\. These reference objectives are used only for reporting validation Gap after a final heuristic has been selected\.

Baseline heuristics\.The initial baseline heuristic defines the starting point of the improve\-from\-code setting and also provides the “Baseline heuristic” row in the main tables\. For TSP\-Constructive, the baseline follows the classical greedy constructive rule\[[37](https://arxiv.org/html/2605.08756#bib.bib68)\]\. For CVRP\-Constructive, we use the same constructive starting point as LLM4AD\[[27](https://arxiv.org/html/2605.08756#bib.bib63)\]\. For OVRP\-Constructive, we use the same constructive rule as CVRP\-Constructive, adapted to the open\-route objective by omitting the final return\-to\-depot cost\. For TSP\-ACO, CVRP\-ACO, OP\-ACO, and MKP\-ACO, the baseline heuristic follows the standard ReEvo ACO settings\[[59](https://arxiv.org/html/2605.08756#bib.bib23)\]\. In ACO domains, the LLM\-designed component is the heuristic desirability function, while the remaining ACO solver configuration is kept fixed across methods\.

Evaluation protocol\.All reported method results are computed after the design process terminates and the final heuristic is selected\. During design, the agent and all baselines receive feedback only from the design instances; validation and test objectives are not available to the search process\. The evaluator\-call count reports the number of times candidate heuristic code is executed on the design set\. Diagnostic tool calls do not consume evaluator budget, and Cost reports the average USD cost per run under the pricing protocol described in Section[D\.1](https://arxiv.org/html/2605.08756#A4.SS1)\. Unless otherwise specified, method results, evaluator calls, and costs are averaged over five independent runs\.

### D\.3Data Sources and Split Protocol

Our experiments involve three different data sources, which serve different purposes and are never used interchangeably: the RL prompt corpus, the evaluation\-time design set, and the held\-out validation set\.

##### RL prompt corpus\.

The RL prompt corpus𝒬RL\\mathcal\{Q\}\_\{\\mathrm\{RL\}\}is used only to optimize the parameters of the AHD Agent policy with GRPO\. Each RL prompt instantiates a train\-time design environment, including a problem domain, a seed heuristic or a design\-from\-scratch mode, a design\-dataset variant, the tool library, and the reward/evaluation interface\. These prompts are used to train the agentic interaction policy, i\.e\., how the model uses feedback, calls tools, revises code, repairs invalid candidates, and decides when to finalize\. The RL prompt corpus is not used for reporting the validation results in the main tables\.

##### Evaluation\-time design set\.

During evaluation, every AHD run is associated with a design setDdesignD\_\{\\mathrm\{design\}\}\. This set is the only set visible to the heuristic\-design loop\. Candidate heuristics generated by AHD Agent or by fixed\-workflow AHD baselines are evaluated onDdesignD\_\{\\mathrm\{design\}\}during search, and all design\-time evaluator scores, execution feedback, and per\-instance diagnostic records are computed only on this set\. Diagnostic tools are also restricted toDdesignD\_\{\\mathrm\{design\}\}, the current session history, and evaluator feedback already produced within the same design run\. They do not inspect held\-out validation instances, validation objectives, or validation per\-instance performance\. The design set is used to select the final heuristic within the allowed evaluator\-call budget, but design\-set performance is not the metric reported in the main result tables\.

##### Held\-out validation set\.

After the design budget is exhausted, the selected heuristic is fixed and evaluated on the held\-out validation setDvalD\_\{\\mathrm\{val\}\}\. No validation feedback is returned to the agent, tools, evaluator loop, or fixed\-workflow search process during heuristic design\. Therefore, the reported COP results measure held\-out validation performance of a heuristic selected using only design\-set feedback\. For the combinatorial optimization domains in this paper,DvalD\_\{\\mathrm\{val\}\}is the final reporting split; we do not use an additional COP test split\. The cost\-aware Bayesian optimization benchmark follows its own test\-function protocol described in Appendix[A\.1](https://arxiv.org/html/2605.08756#A1.SS1)\.

##### In\-domain and held\-out domains\.

The distinction between in\-domain and held\-out domains refers only to whether a domain appears in the RL prompt corpus used to train the AHD Agent policy\. In\-domain domains are included in𝒬RL\\mathcal\{Q\}\_\{\\mathrm\{RL\}\}, whereas held\-out domains are absent from RL training\. However, both in\-domain and held\-out evaluation domains still use an evaluation\-time design setDdesignD\_\{\\mathrm\{design\}\}during AHD search, because all AHD methods require design\-time feedback to discover a heuristic\. Thus, held\-out\-domain experiments test whether the trained agentic design policy transfers to new problem families, not whether the final heuristic is produced without any design\-time feedback\.

Table 10:Instance split statistics for COP domains\. “Used in RL training” indicates whether the domain appears in the GRPO prompt corpus\. Held\-out domains are not used to train the AHD Agent policy, but they still have an evaluation\-time design set for heuristic search\.

### D\.4Motivation for RL Training

General\-purpose LLMs are backbone\-sensitive\.Table[11](https://arxiv.org/html/2605.08756#A4.T11)compares GPT\-4o and DeepSeek\-V4\-Flash under the same fixed\-search AHD framework across seven domains\. The best backbone varies by domain and framework: for example, on TSP\-ACO EOH performs better with GPT\-4o, while on CVRP\-ACO it performs better with DeepSeek\-V4\-Flash\. ReEvo shows a similar inconsistency, with the dominant backbone flipping between domains\. This instability means that practitioners must re\-evaluate backbone choices for each new domain, adding cost and uncertainty to the design process\.

Table 11:Model\-sensitivity view for motivating RL\-trained multi\-turn design\. Entries are average validation mean Gap \(%\) over the reported sizes, comparing GPT\-4o \(G\) with DeepSeek\-V4\-Flash \(D\) under the same AHD method\. Each cell reports the lower\-gap model first, with the other model in parentheses; lower is better\.RL\-trained AHD Agent is efficient and backbone\-independent\.Table[12](https://arxiv.org/html/2605.08756#A4.T12)compares AHD Agent \(RL\) against fixed\-search AHD baselines with GPT\-4o on four representative domains\. AHD Agent achieves the lowest average gap \(32\.85%\) across all four domains while using only 12\.7 evaluator calls on average, compared to 100 calls for the fixed\-search baselines\. The average cost per run is $0\.004, roughly two orders of magnitude lower than the GPT\-4o baselines \($0\.31–$0\.85\)\. Because the RL\-trained policy has internalized effective design strategies during training, it does not depend on the reasoning capabilities of a specific frontier backbone at inference time\.

Table 12:Efficiency summary on four representative domains \(two in\-domain, two held\-out\)\. For each domain, Gap \(%\) is the average Mean Gap over the reported validation sizes \(N=100,200N=100,200for training domains;N=50,100,200N=50,100,200for held\-out domains\), matching the main\-text tables\. Eval is the average evaluator\-call count per run\. Cost is the average USD per run\. Lower is better for all metrics\.
### D\.5Model Scaling with GPT\-Series Models

In addition to the Qwen\-family scaling, we conduct the same comparison using three GPT\-series models \(GPT\-4o\-Mini, GPT\-5\.4\-Mini, GPT\-5\.4\) on three validation domains: CVRP\-Constructive, OP\-ACO, and OVRP\-Constructive\. Figure[11](https://arxiv.org/html/2605.08756#A4.F11)report the results\.

![Refer to caption](https://arxiv.org/html/2605.08756v1/x18.png)Figure 11:Mean validation Gap \(%\) across the three reported problem sizes as the GPT\-series backbone changes from GPT\-4o\-Mini to GPT\-5\.4\-Mini and GPT\-5\.4\. AHD Agent uses the LLM as an agent that iteratively calls tools and reacts to feedback, whereas EOH and ReEvo use the LLM inside a fixed search procedure\.AHD Agent shows a clear scaling trend\.The agentic multi\-turn framework consistently improves with stronger backbones: on all three domains, the best gap is achieved with the strongest model GPT\-5\.4, with endpoint improvements of−9\.00\-9\.00,−8\.14\-8\.14, and−5\.65\-5\.65percentage points from GPT\-4o\-Mini to GPT\-5\.4\. This monotonic trend confirms that the multi\-turn paradigm can effectively convert stronger model capabilities into better heuristic design decisions\.

Fixed\-search frameworks do not exhibit a consistent scaling trend\.EOH and ReEvo show non\-monotonic or domain\-dependent behavior under the same backbone sweep\. On OP\-ACO, EOH degrades from 12\.06% at GPT\-4o\-Mini to 17\.18% at GPT\-5\.4\. ReEvo is even more erratic: its OP\-ACO gap spikes to 46\.16% at GPT\-5\.4\-Mini before recovering, and its OVRP\-Constructive gap increases by\+13\.26\+13\.26points when moving to the strongest backbone\. These results suggest that simply using a more powerful LLM as a code generator inside a fixed search procedure does not reliably improve performance\.

### D\.6Tool Ablation with DeepSeek\-V4

Tools consistently benefit the agentic multi\-turn framework but produce mixed or negative effects in fixed\-search workflows \(Table[13](https://arxiv.org/html/2605.08756#A4.T13)\)\. This experiment uses DeepSeek\-V4\-Flash as the LLM backbone for all three frameworks, comparing two tool configurations: \(i\) evaluator only—the agent can evaluate candidates but has no diagnostic tools and \(ii\) full tools—instance analysis and AST novelty are additionally available\. We report mean and best validation gaps averaged over three problem sizes \(N=50,100,200N=50,100,200\) on CVRP\-Constructive and TSP\-ACO\.

AHD Agent results\.Adding diagnostic tools consistently improves the multi\-turn agent on both domains\. The mean gap decreases by 1\.87 points on CVRP\-Constructive \(25\.60→\\to23\.73\) and by 1\.15 points on TSP\-ACO \(9\.30→\\to8\.15\)\. The best gap also improves:−1\.73\-1\.73points on CVRP\-Constructive and−0\.10\-0\.10points on TSP\-ACO\. This confirms that the agentic multi\-turn policy can effectively leverage additional diagnostic signals to guide its revisions\.

Fixed\-workflow tool injection\.To provide the same diagnostic information to EOH and ReEvo, we inject the tools through deterministic adapters that preserve the original evolutionary workflow:

- •EOH: A drop\-in*ToolAugmentedSampler*wraps the original EoH sampler with a four\-phase pipeline\. \(1\) Instance analysis computes a structural summary of the training instances and appends it to the code\-generation prompt\. \(2\) The base sampler generates a candidate as usual\. \(3\) AST novelty analysis compares the candidate against the current population\. \(4\) If the candidate is structurally too similar to existing individuals, the LLM is asked to revise it once before evaluation\.
- •ReEvo: A*ReEvoToolController*injects the instance summary \(computed once and cached\) into the LLM prompts before initialization, crossover, and mutation operations\. After each candidate is generated, the controller performs AST novelty checking against the evolutionary history and triggers a single revision if similarity exceeds a threshold\.

In both cases, tool injection is*fixed and deterministic*: the LLM cannot choose whether or when to query the tools, and the feedback is always appended in the same format at the same pipeline stage\.

Fixed\-workflow results\.Unlike the multi\-turn agent, EOH and ReEvo show inconsistent responses to tool injection\. On CVRP\-Constructive, EOH improves only marginally \(mean gap−0\.27\-0\.27\) and ReEvo improves moderately \(−1\.19\-1\.19\)\. On TSP\-ACO, both frameworks*degrade*with tools: EOH worsens by\+0\.18\+0\.18and ReEvo by\+0\.49\+0\.49on mean gap\. This asymmetry demonstrates that the same diagnostic information can be helpful or harmful depending on how it is integrated\. When tools are injected at fixed points without the model’s agency over when and how to use the feedback, the additional information may disrupt the existing search dynamics rather than improve them\. The multi\-turn agent, by contrast, can choose to query tools selectively, ignore uninformative feedback, and adapt its revision strategy based on the diagnostic results\.

Table 13:DeepSeek\-v4 tool ablation on two domains\. Entries are mean/best validation Gap \(%\) averaged over three sizes; lower is better\. AHD Agent compares evaluator\-only tools with AST/analysis tools, while EOH and ReEvo compare the original design with AST/analysis tools\. Shading marks improved full\-tool cells; bold and underline mark the best and second\-best values\.
### D\.7AHD Agent w/PS vs\. AHD Agent w/SR: Budget\-Expanded Inference

The standard AHD Agent setting uses a short evaluator budget \(up to 30 calls\) to maximize efficiency\. When a larger evaluator budget is available, we study two complementary strategies for scaling multi\-turn inference:Sequential Refinement\(AHD Agent w/SR\) andParallel Sampling\(AHD Agent w/PS\)\. Both strategies reuse the same RL\-trained policy without any additional fine\-tuning; they differ only in how the evaluator budget is allocated across interaction trajectories\.

#### D\.7\.1Sequential Refinement \(AHD Agent w/SR\)

AHD Agent w/SR increases inference depth by extending a single agent session across multiple continuation rounds until a global evaluator budget \(default 100 calls\) is exhausted\. Concretely, the agent begins with its standard multi\-turn interaction\. When it submits aFINAL SOLUTIONmarker but the global budget has not been reached, the system automatically starts a new continuation round: the best\-evaluated heuristic from previous rounds is injected as the new seed code, a continuation note informs the agent of the remaining budget, and the conversation history is reset to avoid context\-length overflow\. This process repeats for up toKKdialog rounds \(defaultK=10K=10\) or until the global budget is fully consumed\.

The key advantage of this strategy is that the agent accumulates design experience across rounds\. Each continuation round starts from the best result of the previous round, enabling progressive refinement of a single heuristic trajectory\. The agent can exploit insights from earlier evaluations—such as which code patterns improved performance or which instance groups remain weak—to guide subsequent revisions\.

#### D\.7\.2Parallel Sampling \(AHD Agent w/PS\)

AHD Agent w/PS increases inference breadth by launchingLLindependent short multi\-turn lanes in parallel \(defaultL=5L=5\)\. Each lane runs a complete standard AHD Agent session with its own conversation context and evaluator state\. The lanes execute concurrently using a thread pool and do not share information during execution\. After all lanes complete, the system selects the best candidate across all lanes based on the design\-set \(training\) objective\.

This strategy trades depth for diversity\. While each individual lane uses only a short budget, the parallel ensemble samplesLLindependent trajectories from the policy, increasing the probability that at least one trajectory discovers a strong heuristic\. The total evaluator budget is approximatelyLLtimes the per\-lane budget\. AHD Agent w/PS is naturally parallelizable and more robust to individual trajectory failures: if one lane produces a weak or failed candidate, the remaining lanes can still succeed\.

#### D\.7\.3Comparison

Table[14](https://arxiv.org/html/2605.08756#A4.T14)compares both strategies on seven domains\. AHD Agent w/SR obtains the lower average mean gap on five of seven domains \(TSP\-C, TSP\-ACO, CVRP\-ACO, OP\-ACO, MKP\-ACO\), while AHD Agent w/PS is better on two \(CVRP\-C, OVRP\-C\)\. Averaged across the seven domains, AHD Agent w/SR achieves a mean gap of 23\.57%, compared with 24\.47% for AHD Agent w/PS\. AHD Agent w/PS uses 88\.8 evaluator calls on average across the seven domains, compared with the fixed 100\-call budget for AHD Agent w/SR\. In practice, the choice between the two strategies depends on the deployment scenario: AHD Agent w/SR is preferred when serial depth and progressive refinement are important, while AHD Agent w/PS is preferred when wall\-clock time is constrained and parallelism is available\.

Table 14:AHD Agent w/PS versus AHD Agent w/SR on the seven domains where both budget\-expanded variants are reported in the appendix mean\-gap tables\. AHD Agent w/PS corresponds to the AHD Agent w/PS rows in those tables\. For each domain, Avg\. Mean Gap averages the three validation sizes\. Avg\. Eval reports the design\-time evaluator budget: AHD Agent w/PS uses the reported evaluator\-call count averaged over the same sizes, while AHD Agent w/SR uses a fixed 100\-call sequential\-refinement budget\. Lower is better\.

### D\.8P\-values for Significance

Following the significance\-test protocol in prior AHD work\[[64](https://arxiv.org/html/2605.08756#bib.bib24)\], we conduct ten independent runs for each method and test whether the RL\-trained multi\-turn agent significantly outperforms the strongest LLM\-based AHD baseline on each domain\. Each run is summarized by the average validation mean gap over three problem sizes \(N∈\{50,100,200\}N\\in\\\{50,100,200\\\}\)\. We compare the run\-level gap distributions using single\-tailed Welchtt\-tests, where the alternative hypothesis is that the multi\-turn agent has a lower mean gap than the baseline\. For each domain, we select the best\-performing AHD baseline with DeepSeek\-V4\-Flash as the comparison target\.

Table[15](https://arxiv.org/html/2605.08756#A4.T15)reports the per\-run gaps, averages, standard deviations, andpp\-values\. OnCVRP\-Constructive, the standard AHD Agent RL agent \(avg\. 22\.13%\) significantly outperforms the best baseline ReEvo \(avg\. 25\.44%\) withp=4\.65×10−5p=4\.65\\times 10^\{\-5\}\. OnOVRP\-Constructive, a held\-out domain unseen during RL training, the AHD Agent RL agent \(avg\. 92\.75%\) likewise significantly outperforms ReEvo \(avg\. 106\.30%\) withp=1\.44×10−9p=1\.44\\times 10^\{\-9\}, demonstrating strong generalization\.

OnOP\-ACO, the standard AHD Agent RL agent \(avg\. 8\.93%\) is slightly better than the best baseline MCTS\-AHD \(avg\. 9\.22%\) on average, but the difference is not statistically significant \(p=0\.355p=0\.355\) due to higher run\-to\-run variance under the limited short multi\-turn evaluator budget\. When we increase the evaluator budget with Sequential Refinement AHD Agent \(AHD Agent w/SR\), the variance shrinks substantially \(std from 1\.73 to 0\.52\) and the gap drops to 7\.08%, achieving significance withp=7\.80×10−4p=7\.80\\times 10^\{\-4\}\. This confirms that the RL\-trained policy has learned an effective design strategy; the additional evaluator budget allows it to realize this advantage more consistently\.

Table 15:Ten\-run significance tests against strong AHD baselines\. Each run reports the mean gap \(%\) averaged overN=50,100,200N=50,100,200\. “avg” and “std” are computed over the ten run\-level gaps, andpp\-values are from single\-tailed Welchtt\-tests\. Lower is better\.DomainMethodrun1run2run3run4run5run6run7run8run9run10avgstdpp\-valueCVRP\-CReEvo23\.37724\.81227\.33026\.99325\.50425\.55927\.15125\.71622\.12525\.87525\.4441\.576–AHD Agent RL20\.92622\.22923\.63722\.09822\.57620\.92623\.45220\.92623\.63720\.92622\.1331\.1094\.65×𝟏𝟎−𝟓\\mathbf\{4\.65\{\\times\}10^\{\-5\}\}OVRP\-CReEvo102\.632105\.358110\.180104\.906107\.519107\.022105\.167106\.235108\.805105\.161106\.2992\.054–AHD Agent RL92\.54992\.54994\.22792\.54992\.54992\.54992\.86592\.54992\.54992\.54992\.7490\.5021\.44×𝟏𝟎−𝟗\\mathbf\{1\.44\{\\times\}10^\{\-9\}\}OP\-ACOMCTS\-AHD11\.3428\.5247\.51810\.3398\.25911\.6788\.6527\.46510\.1128\.2849\.2171\.455–AHD Agent RL6\.89610\.5289\.2147\.8667\.55410\.13912\.6627\.5839\.5347\.3608\.9341\.7273\.55×10−13\.55\{\\times\}10^\{\-1\}AHD Agent w/SR RL7\.2027\.8677\.6266\.7967\.1527\.8296\.3846\.3986\.7296\.8537\.0840\.5207\.80×𝟏𝟎−𝟒\\mathbf\{7\.80\{\\times\}10^\{\-4\}\}

## Appendix EExternal Resources and Licenses

Table[16](https://arxiv.org/html/2605.08756#A5.T16)lists the external codebases, models, and solver resources referenced or used in our implementation and experiments\. The table is intended as a practical resource summary rather than a complete legal attribution list\.

Table 16:External resources and license information\.

Similar Articles

Learning Agentic Policy from Action Guidance

arXiv cs.CL

The paper proposes ActGuide-RL, a method for training agentic policies in LLMs by using human action data as guidance to overcome exploration barriers in reinforcement learning without extensive supervised fine-tuning.

AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning

Hugging Face Daily Papers

This paper introduces AEM, a supervision-free method for agentic reinforcement learning that adapts entropy dynamics at the response level to improve exploration-exploitation trade-offs. It demonstrates performance gains on benchmarks like ALFWorld and SWE-bench by aligning uncertainty estimation with action granularity.