HMACE: Heterogeneous Multi-Agent Collaborative Evolution for Combinatorial Optimization
Summary
This paper introduces HMACE, a heterogeneous multi-agent collaborative evolution framework that uses Large Language Models to automate heuristic design for NP-hard combinatorial optimization problems. It demonstrates improved quality-efficiency trade-offs over single-agent and multi-agent baselines on problems like TSP and BPP.
View Cached Full Text
Cached at: 05/11/26, 07:15 AM
# HMACE: Heterogeneous Multi-Agent Collaborative Evolution for Combinatorial Optimization
Source: [https://arxiv.org/html/2605.07214](https://arxiv.org/html/2605.07214)
Yuping Yan, Jirui Han, Fei Ming, Yuanshuai Li, and Yaochu Jin School of Engineering Westlake University yanyuping\(hanjirui,mingfei,liyuanshuai,jinyaochu\)@westlake\.edu\.cn
###### Abstract
Large Language Models have recently emerged as a promising paradigm for automated heuristic design for NP\-hard combinatorial optimization problems\. Despite this progress, existing LLM\-based methods typically rely on monolithic workflows constrained by rigid templates, thereby restricting memory\-guided exploration and triggering premature convergence to local optima\. To design an autonomous and collaborative architecture, we introduce HMACE, a Heterogeneous Multi\-Agent Collaborative Evolution framework that reconceptualizes heuristic search as an organizational design problem\. HMACE decomposes each evolutionary generation into an autonomous, role\-specialized loop with four coordinated agents: a Proposer for strategy exploration, a Generator for executable heuristic synthesis, an Evaluator for empirical assessment, and a Reflector for archive\-backed memory update\. By coupling behavior\-aware retrieval, lightweight candidate filtering, and fitness\-grounded archive updates, HMACE guides the search toward diverse and promising heuristic behaviors while avoiding redundant evaluations\. Extensive evaluations on representative COPs, including TSP, Online BPP, MKP, and PFSP, show that HMACE achieves a favorable quality\-efficiency trade\-off compared to state\-of\-the\-art single\-agent and multi\-agent baselines\. In the matched LLM\-driven reference comparison, HMACE achieves the lowest average gaps on TSP and Online BPP \(0\.464% and 0\.223%, respectively\), while requiring only 0\.13M and 0\.42M tokens for the two tasks, substantially fewer than the compared baselines\.
## 1Introduction
Combinatorial Optimization Problems \(COPs\), such as the Traveling Salesman Problem \(TSP\), Vehicle Routing Problem \(VRP\), and Bin Packing Problem \(BPP\), are fundamental across operational and engineering domains\[[4](https://arxiv.org/html/2605.07214#bib.bib2),[7](https://arxiv.org/html/2605.07214#bib.bib1)\]\. However, due to their NP\-hard nature\[[35](https://arxiv.org/html/2605.07214#bib.bib3)\], the feasible search space often grows exponentially with problem scale, making the acquisition of optimal solutions computationally prohibitive\. Consequently, heuristic and meta\-heuristic algorithms have become the prevailing approach to finding high\-quality approximate solutions within acceptable timeframes\[[39](https://arxiv.org/html/2605.07214#bib.bib6)\]\. Despite their effectiveness, traditional heuristics rely heavily on hand\-crafted rules tailored by domain experts, making them costly to develop and difficult to generalize across tasks\[[43](https://arxiv.org/html/2605.07214#bib.bib7)\]\.
Recently, the emergence of Large Language Models \(LLMs\) has revolutionized this paradigm\. By leveraging their pretrained knowledge, LLMs can either directly generate promising solutions or automatically design novel heuristic policies without explicit algorithmic instructions\[[28](https://arxiv.org/html/2605.07214#bib.bib57),[48](https://arxiv.org/html/2605.07214#bib.bib4),[49](https://arxiv.org/html/2605.07214#bib.bib5)\]\. Despite their potential, current LLM\-based heuristic design methods face two fundamental structural bottlenecks\. First, lacking structured exploration mechanisms such as systematic backtracking or diversification, they are highly susceptible to premature convergence and local optima\[[30](https://arxiv.org/html/2605.07214#bib.bib11),[42](https://arxiv.org/html/2605.07214#bib.bib10)\]\. Second, solving complex COPs inherently demands heterogeneous cognitive operations \(e\.g\., planning, generation, evaluation, and refinement\)\. Forcing a monolithic agent to unify these diverse roles inevitably triggers severe role interference\[[27](https://arxiv.org/html/2605.07214#bib.bib12)\]\. Taken together, these limitations suggest that the bottleneck of current LLM\-based solvers lies not only in model capability, but more fundamentally in the lack of appropriate organizational structure for coordinating reasoning and search\.
Motivated by this observation, we turn to multi\-agent systems as a natural solution\. Instead of treating problem\-solving as a monolithic generation process, multi\-agent frameworks reformulate it as an organizational design problem, where multiple autonomous agents with specialized roles collaboratively explore and refine the solution space\. However, prior studies have shown that naively scaling the number of agents does not guarantee performance gains; poorly designed multi\-agent systems may even underperform strong single\-agent baselines due to communication overhead and coordination noise\[[29](https://arxiv.org/html/2605.07214#bib.bib8),[26](https://arxiv.org/html/2605.07214#bib.bib9)\]\. As illustrated in Figure[1](https://arxiv.org/html/2605.07214#S1.F1), tackling complex COPs requires moving beyond monolithic structures and naive parallelization\.Therefore, the key challenge is not merely introducing multiple agents, but designing an autonomous, role\-specialized, and memory\-aware multi\-agent architecture that enables effective collaboration for combinatorial optimization\.
Figure 1:Architectural progression of LLM\-based solvers for COPs\. \(1\) Fixed Evolutionary Search: The LLM acts merely as a generation module within a rigid, human\-crafted global loop\. \(2\) Autonomous Single\-Agent Evolution: A monolithic agent executes all cognitive operations sequentially, which inevitably leads to cognitive entanglement\. \(3\) Homogeneous Multi\-Agent Evolution: Homogeneous agents explore in parallel with a shared memory, but still lack specialized cognitive focus\. \(4\) Heterogeneous Multi\-Agent Evolution: By decoupling problem\-solving into specialized, interactive roles, it robustly eliminates role interference and enables systematic, diversified exploration\.As a key step towards this new paradigm, we introduce HMACE, a novelHeterogeneousMulti\-AgentCollaborativeEvolution framework leveraging explicit role division to automate heuristic search for COPs\. In this paper, we reconceptualize the complex heuristic generation process not as a sequential coding task, but as an organizational design problem\. Technically, HMACE implements a collaborative evolutionary loop to encourage both diversified exploration and reliable state management\. At each iteration, specialized agents interact to propose, generate, evaluate, and archive candidate heuristics\. When the search stagnates or produces low\-utility candidates, the Reflector updates the archive, retrieves behaviorally diverse exemplars, and redirects subsequent proposals toward more promising or underexplored regions\. By doing so, HMACE continuously refines heuristic programs while mitigating premature convergence\. Ablation studies confirm that both role specialization and archive\-guided reflection contribute to the observed performance gains\.
Our contributions are threefold\. First, we propose HMACE, a heterogeneous multi\-agent evolutionary framework for combinatorial optimization, and formulate it as an organizational approach to heuristic design that goes beyond prior monolithic LLM\-driven workflows\. Second, we introduce a novel reflection mechanism integrated with role\-specialized collaboration, enabling agents to leverage historical trajectories through behavior\-aware retrieval and fitness\-grounded archive updates, thereby improving the robustness and diversity of search while mitigating premature convergence\. Third, extensive experiments on classical COPs demonstrate that HMACE achieves a favorable quality\-efficiency trade\-off compared to strong single\-agent, search\-controller, and multi\-agent baselines\. Notably, in the matched LLM\-driven reference comparison, HMACE achieves the lowest average gaps on TSP and Online BPP while requiring substantially fewer tokens than the compared baselines\.
## 2Related Work
### 2\.1LLM\-based Automatic Heuristic Design
Automatic heuristic design aims to reduce the dependence of COPs on manually crafted rules and domain expertise\. Early automated algorithm design and hyper\-heuristic methods search over predefined algorithm components, configurations, or low\-level heuristic choices, whereas recent LLM\-based methods expand this space to natural\-language ideas and executable programs\. FunSearch evolves programs with an LLM proposer and an external evaluator, demonstrating the potential of program search for mathematical discovery and online bin packing\[[41](https://arxiv.org/html/2605.07214#bib.bib13)\]\. EoH represents heuristic ideas as natural\-language “thoughts” paired with executable code, and evolves them within an LLM\-EC loop\[[28](https://arxiv.org/html/2605.07214#bib.bib57)\]\. ReEvo further introduces reflective evolution, using LLM\-generated reflections as verbal feedback to guide heuristic improvement across multiple COP settings\[[50](https://arxiv.org/html/2605.07214#bib.bib14)\]\.
Recent work has moved beyond fixed LLM\-EC loops by improving the search controller itself\. HSEvo and MCTS\-AHD explore hierarchical or tree\-structured heuristic search, while MoH explicitly studies meta\-optimization of the heuristic optimizer rather than assuming a fixed evolutionary controller\[[14](https://arxiv.org/html/2605.07214#bib.bib15),[51](https://arxiv.org/html/2605.07214#bib.bib16),[42](https://arxiv.org/html/2605.07214#bib.bib10)\]\. These methods suggest that the organization of the search process plays an important role in LLM\-based heuristic discovery\. However, they mainly improve the optimizer or search policy, rather than explicitly organizing the process as a role\-specialized multi\-agent system\. HMACE shares the goal of automatic heuristic evolution, but emphasizes specialized agent roles, memory\-aware reuse, and cost\-aware filtering within a collaborative search process\.
### 2\.2LLMs for Combinatorial Optimization
Beyond automatic heuristic design, LLMs have been used as modeling assistants, solver interfaces, and direct reasoning engines for combinatorial optimization\. A major line of work uses LLMs to translate natural\-language problem descriptions into mathematical formulations or solver\-ready code\. OptiMUS combines LLM\-based modeling, debugging, testing, and execution with linear programming and mixed\-integer linear programming solvers\[[2](https://arxiv.org/html/2605.07214#bib.bib17)\], while LM4OPT and LLMOPT study how language models can bridge informal specifications and formal optimization models\[[3](https://arxiv.org/html/2605.07214#bib.bib18),[20](https://arxiv.org/html/2605.07214#bib.bib19)\]\. Another line of work investigates direct or assisted COP solving, covering scheduling, routing, pathfinding, and end\-to\-end LLM solvers\[[1](https://arxiv.org/html/2605.07214#bib.bib20),[19](https://arxiv.org/html/2605.07214#bib.bib21),[5](https://arxiv.org/html/2605.07214#bib.bib22),[21](https://arxiv.org/html/2605.07214#bib.bib23),[13](https://arxiv.org/html/2605.07214#bib.bib24)\]\. These works broaden the role of LLMs in optimization, but they primarily target formulation, instance\-level reasoning, or solver orchestration\. In contrast, HMACE does not ask the LLM to solve each instance directly; it uses LLM agents to discover reusable executable heuristics under empirical evaluation and budget constraints\.
### 2\.3LLM\-based Multi\-Agent Systems
LLM\-based multi\-agent systems decompose complex tasks into interacting roles and workflows\. Representative frameworks such as CAMEL, AutoGen, MetaGPT, and AgentVerse show that role\-playing, tool\-augmented conversation, standard operating procedures, and dynamic collaboration can improve complex reasoning and generation tasks\[[25](https://arxiv.org/html/2605.07214#bib.bib25),[47](https://arxiv.org/html/2605.07214#bib.bib26),[18](https://arxiv.org/html/2605.07214#bib.bib27),[10](https://arxiv.org/html/2605.07214#bib.bib28)\]\. More closely related to optimization, DRAGON introduces decomposition and reconstruction agents for large\-scale COPs, while CORAL studies autonomous multi\-agent evolution for open\-ended discovery\[[9](https://arxiv.org/html/2605.07214#bib.bib33),[38](https://arxiv.org/html/2605.07214#bib.bib34)\]\. However, existing multi\-agent optimization frameworks do not directly address the setting of reusable heuristic discovery: DRAGON\[[9](https://arxiv.org/html/2605.07214#bib.bib33)\]focuses on decomposing and reconstructing individual problem instances, whereas CORAL\[[38](https://arxiv.org/html/2605.07214#bib.bib34)\]targets broader open\-ended discovery rather than COP\-specific heuristic evolution\. HMACE instead treats automatic heuristic design as a role\-specialized collaborative search process, where proposal, code generation, empirical evaluation, and archive\-guided reflection are coordinated under limited LLM budgets\.
## 3Methodology
Figure[2](https://arxiv.org/html/2605.07214#S3.F2)presents an overview of HMACE, a heterogeneous multi\-agent evolutionary framework for automatic heuristic search on COPs\. The core principle of HMACE is to decompose one search generation into four coordinated roles: a Proposer, a Generator, an Evaluator, and a Reflector, all of which communicate through an archive\-based memory\. In each generation, the proposer drafts strategy candidates conditioned on the current population and retrieved archive exemplars, the generator translates these strategies into executable heuristics, the evaluator measures their empirical performance on the training instances, and the reflector updates the archive to guide subsequent search\. This staged design preserves diversity across behaviorally distinct niches while keeping the optimization loop grounded in measured fitness rather than verbal self\-assessment\. Illustrative examples of the prompts and archive records are provided in Appendix[B\.3](https://arxiv.org/html/2605.07214#A2.SS3), especially Table[7](https://arxiv.org/html/2605.07214#A2.T7)and the prompt and memory templates collected in that subsection\. In the following subsections, we present the technical details of the proposed HMACE framework\.
Figure 2:The overview of the HMACE framework\. The proposer conditions on parent heuristics and diverse exemplars retrieved from the archive to draft strategy candidates; the generator translates each candidate into executable code; the evaluator scores candidate heuristics on the training set; and the reflector maintains an archive\-backed memory, retrieves diverse exemplars from different behavioral cells, and updates each cell with the best\-performing candidate\.### 3\.1Problem Formulation
Let𝒯\\mathcal\{T\}denote a set of training instances of a CO problem\. A*heuristic*h∈ℍh\\in\\mathbb\{H\}is an executable Python program that, given an instance state, returns a construction or repair decision\. The heuristic\-design objective is to identify an optimal heuristich⋆∈ℍh^\{\\star\}\\in\\mathbb\{H\}, which minimizes the expected loss over𝒯\\mathcal\{T\}:
h⋆=argminh∈ℍf\(h,𝒯\),f\(h,𝒯\)=1\|𝒯\|∑T∈𝒯ℓ\(h,T\),h^\{\\star\}\\;=\\;\\arg\\min\_\{h\\in\\mathbb\{H\}\}\\,f\(h,\\mathcal\{T\}\),\\qquad f\(h,\\mathcal\{T\}\)\\;=\\;\\frac\{1\}\{\|\\mathcal\{T\}\|\}\\sum\_\{T\\in\\mathcal\{T\}\}\\ell\(h,T\),\(1\)whereℓ\(h,T\)\\ell\(h,T\)is the per\-instance loss \(excess over a lower bound or optimality gap, depending on the problem\)\. The hypothesis classℍ\\mathbb\{H\}is implicit, defined as the support of an LLM\-driven search procedure that emits source code conditioned on the problem context\.
We address Equation[1](https://arxiv.org/html/2605.07214#S3.E1)through an evolutionary paradigm\. Starting from a populationP0⊂ℍP\_\{0\}\\subset\\mathbb\{H\}, the process iteratively proposes new heuristics and selects top\-performing elites evaluated underff\. Rather than compressing an entire generation into a single, overloaded LLM call\[[28](https://arxiv.org/html/2605.07214#bib.bib57),[50](https://arxiv.org/html/2605.07214#bib.bib14),[41](https://arxiv.org/html/2605.07214#bib.bib13)\], we decompose the generation process into the following specialized, agent\-level state transitions:
\(Pt,ℳt\)=ℛ\(ℰ\(𝒢\(𝒫\(Pt−1,ℳt−1\)\),𝒯\),Pt−1,ℳt−1\),\(P\_\{t\},\\mathcal\{M\}\_\{t\}\)\\;=\\;\\mathcal\{R\}\\Bigl\(\\mathcal\{E\}\\bigl\(\\mathcal\{G\}\(\\mathcal\{P\}\(P\_\{t\-1\},\\mathcal\{M\}\_\{t\-1\}\)\),\\;\\mathcal\{T\}\\bigr\),\\;P\_\{t\-1\},\\;\\mathcal\{M\}\_\{t\-1\}\\Bigr\),\(2\)where𝒫\\mathcal\{P\}is a*Proposer*drafting strategies from historical memory,𝒢\\mathcal\{G\}is a*Generator*implementing these strategies as code,ℰ\\mathcal\{E\}is an*Evaluator*scoring candidates on training instances,ℛ\\mathcal\{R\}is a*Reflector*updating the archive to guide future iterations andℳt\\mathcal\{M\}\_\{t\}is the memory inttrounds\. AfterGGgenerations, the pipeline returns the incumbenth^G=argminh∈PG∪ℳGf\(h,𝒯\)\\hat\{h\}\_\{G\}=\\arg\\min\_\{h\\in P\_\{G\}\\cup\\mathcal\{M\}\_\{G\}\}f\(h,\\mathcal\{T\}\), which serves as the empirical approximation to the ideal optimumh⋆h^\{\\star\}\.
### 3\.2Overall Workflow
We summarize theHMACEtraining workflow in the abridged Alg\.[1](https://arxiv.org/html/2605.07214#alg1); the full step\-by\-step pseudocode is deferred to Appendix[B](https://arxiv.org/html/2605.07214#A2)\(Alg\.[2](https://arxiv.org/html/2605.07214#alg2)\)\. Algorithm[1](https://arxiv.org/html/2605.07214#alg1)writes the workflow as an outer evolutionary loop wrapped around a generation routineGenStep\(P,ℳ\)\(P,\\mathcal\{M\}\)\. Starting from a seed\-induced populationP0P\_\{0\}and an empty archive\-backed memoryℳ0\\mathcal\{M\}\_\{0\}, HMACE iteratively alternates between retrieval, proposal\-generation, evaluation, and reflection\.
Within each generation, the reflector first performs retrieval throughℛsel\\mathcal\{R\}\_\{\\mathrm\{sel\}\}, which samplesrrdiverse exemplars from the current population and occupied archive cells to form the context exposed to the proposer\. Conditioned on this context, theProposerdraftskkstrategy candidates, and theGeneratortranslates them into executable heuristics\. Before expensive evaluation, a lightweight deterministic filter removes malformed or near\-duplicate programs so that the search budget is spent on feasible and behaviorally non\-redundant candidates\. TheEvaluatorthen assigns each surviving heuristic two complementary signals: a scalar fitness scorefh=f\(h,𝒯\)f\_\{h\}=f\(h,\\mathcal\{T\}\)that measures solution quality on the training set, and a behavior vectorϕh=ϕ\(h,𝒯\)\\phi\_\{h\}=\\phi\(h,\\mathcal\{T\}\)that characterizes how the heuristic behaves during execution\. These evaluated tuples\(h,ϕh,fh\)\(h,\\phi\_\{h\},f\_\{h\}\)are passed to the reflector update stepℛupd\\mathcal\{R\}\_\{\\mathrm\{upd\}\}, which refreshes the population by fitness, updates archive incumbents cell by cell, and returns the next search state\(Pt,ℳt\)\(P\_\{t\},\\mathcal\{M\}\_\{t\}\)\. The outer loop repeats this generation routine until the budget ofGGgenerations is exhausted or the best fitness fails to improve by at leastδmin=10−4\\delta\_\{\\min\}=10^\{\-4\}forρ=3\\rho\{=\}3consecutive generations\. In this sense, reflection in HMACE is not merely verbal critique; it is an operational memory mechanism that jointly controls exploitation through fitness and exploration through behavior\-space coverage\.
Algorithm 1AbridgedHMACEtraining workflow\.1:Input:number of generations
GG, population capacity
nn, proposal batch size
kk, retrieval size
rr, archive capacity
CC, patience
ρ\\rho, minimum improvement threshold
δmin\\delta\_\{\\min\}, training instances
𝒯\\mathcal\{T\}, seed prompt
Π0\\Pi\_\{0\}\.
2:Output:best heuristic
h⋆h^\{\\star\}, final population
PGP\_\{G\}, shared memory
ℳG\\mathcal\{M\}\_\{G\}\.
3:functionGenStep\(
P,ℳP,\\mathcal\{M\}\)
4:
X←ℛsel\(P,ℳ,r\)X\\leftarrow\\mathcal\{R\}\_\{\\mathrm\{sel\}\}\(P,\\mathcal\{M\},r\)
5:
S=\{si\}i=1k←𝒫\(P,X\)S=\\\{s\_\{i\}\\\}\_\{i=1\}^\{k\}\\leftarrow\\mathcal\{P\}\(P,X\)
6:
H=\{hi\}i=1k←𝒢\(S\)H=\\\{h\_\{i\}\\\}\_\{i=1\}^\{k\}\\leftarrow\\mathcal\{G\}\(S\)
7:
H~←Filter\(H,ℳ\)\\widetilde\{H\}\\leftarrow\\textsc\{Filter\}\(H,\\mathcal\{M\}\)
8:for
h∈H~h\\in\\widetilde\{H\}do
9:
fh←f\(h,𝒯\),ϕh←ϕ\(h,𝒯\)f\_\{h\}\\leftarrow f\(h,\\mathcal\{T\}\),\\qquad\\phi\_\{h\}\\leftarrow\\phi\(h,\\mathcal\{T\}\)
10:
Q←\{\(h,ϕh,fh\):h∈H~\}Q\\leftarrow\\\{\(h,\\phi\_\{h\},f\_\{h\}\):h\\in\\widetilde\{H\}\\\}
11:
\(P,ℳ\)←ℛupd\(P,ℳ,Q\)\(P,\\mathcal\{M\}\)\\leftarrow\\mathcal\{R\}\_\{\\mathrm\{upd\}\}\(P,\\mathcal\{M\},Q\)
12:return
\(P,ℳ\)\(P,\\mathcal\{M\}\)
13:
P0←LLM\(Π0\)P\_\{0\}\\leftarrow\\textsc\{LLM\}\(\\Pi\_\{0\}\),
ℳ0←∅\\mathcal\{M\}\_\{0\}\\leftarrow\\emptyset
14:for
t=1,…,Gt=1,\\ldots,Gdo
15:
\(Pt,ℳt\)←GenStep\(Pt−1,ℳt−1\)\(P\_\{t\},\\mathcal\{M\}\_\{t\}\)\\leftarrow\\textsc\{GenStep\}\(P\_\{t\-1\},\\mathcal\{M\}\_\{t\-1\}\)
16:if
NoImprovement\(Pt,ρ,δmin\)\\textsc\{NoImprovement\}\(P\_\{t\},\\rho,\\delta\_\{\\min\}\)then
17:break
18:
h⋆←argmin\(h,ϕ,f\)∈PG∪ℳGfh^\{\\star\}\\leftarrow\\arg\\min\_\{\(h,\\phi,f\)\\in P\_\{G\}\\cup\\mathcal\{M\}\_\{G\}\}f
19:return
\(h⋆,PG,ℳG\)\(h^\{\\star\},P\_\{G\},\\mathcal\{M\}\_\{G\}\)
### 3\.3Detailed Implementation
As key implementation stages of HMACE, we further elaborate on the generation routine, especially the retrieval, proposal\-generation, evaluation, and memory\-update blocks in lines[4](https://arxiv.org/html/2605.07214#alg1.l4)–[12](https://arxiv.org/html/2605.07214#alg1.l12)ofGenStep\(P,ℳ\)\(P,\\mathcal\{M\}\)\. The main components are as follows\.
#### Initialization and shared memory\.
The initialization block \(lines[13](https://arxiv.org/html/2605.07214#alg1.l13)–[15](https://arxiv.org/html/2605.07214#alg1.l15)\) initializes HMACE with a seed\-induced populationP0←LLM\(Π0\)P\_\{0\}\\leftarrow\\textsc\{LLM\}\(\\Pi\_\{0\}\)and an empty archive\-backed memoryℳ0\\mathcal\{M\}\_\{0\}\. Rather than storing a flat interaction history,ℳ\\mathcal\{M\}stores evaluated tuples\(h,ϕh,fh\)\(h,\\phi\_\{h\},f\_\{h\}\), wherehhis an executable heuristic,fh=f\(h,𝒯\)f\_\{h\}=f\(h,\\mathcal\{T\}\)is its fitness, andϕh=ϕ\(h,𝒯\)∈ℝd\\phi\_\{h\}=\\phi\(h,\\mathcal\{T\}\)\\in\\mathbb\{R\}^\{d\}is a behavior descriptor derived from execution\. The shared memory therefore simultaneously preserves high\-quality heuristics and exposes the behavioral structure needed for diversity\-aware retrieval\.
#### Retrieval and strategy proposal\.
At the beginning of each generation, the reflector executesℛsel\(P,ℳ,r\)\\mathcal\{R\}\_\{\\mathrm\{sel\}\}\(P,\\mathcal\{M\},r\)to construct the exemplar setXX\. In the exemplar\-retrieval and proposal block \(lines[4](https://arxiv.org/html/2605.07214#alg1.l4)–[5](https://arxiv.org/html/2605.07214#alg1.l5)\), this step builds a candidate pool from the current population and archive elites, then selectsrrexemplars from behaviorally distinct occupied cells\. The proposer𝒫\\mathcal\{P\}receives both the parent heuristics inPPand the retrieved setXX, and draftskkcandidate strategies in natural languageS=\{si\}i=1kS=\\\{s\_\{i\}\\\}\_\{i=1\}^\{k\}\. Because retrieval is conditioned on different behavioral niches, proposal is encouraged to combine, contrast, or extend distinct search patterns rather than refine a single dominant heuristic family\.
#### Code generation and pre\-filtering\.
The generator𝒢\\mathcal\{G\}maps each candidate strategysis\_\{i\}to an executable heuristichih\_\{i\}that respects the problem description and task IO contract\. HMACE then applies the lightweight routineFilter\(H,ℳ\)\\textsc\{Filter\}\(H,\\mathcal\{M\}\)before expensive evaluation\. This deterministic guardrail removes malformed, infeasible, or near\-duplicate programs using contract checks, basic syntax or signature validation, and duplicate detection against the current batch\. Consistent with the code\-generation and pre\-filtering block \(lines[6](https://arxiv.org/html/2605.07214#alg1.l6)–[7](https://arxiv.org/html/2605.07214#alg1.l7)\), this filter is not a fifth reasoning agent but rather an implementation\-level screen designed to conserve the search budget by rejecting invalid or redundant candidates\.
#### Evaluation and behavior characterization\.
For each surviving heuristich∈H~h\\in\\widetilde\{H\}, the evaluatorℰ\\mathcal\{E\}executeshhon the training set𝒯\\mathcal\{T\}and returns two complementary signals: the scalar fitness scorefh=f\(h,𝒯\)f\_\{h\}=f\(h,\\mathcal\{T\}\), which measures solution quality, and the behavior vectorϕh=ϕ\(h,𝒯\)\\phi\_\{h\}=\\phi\(h,\\mathcal\{T\}\), which summarizes how the heuristic acts during execution\. The resulting setQ=\{\(h,ϕh,fh\)\}Q=\\\{\(h,\\phi\_\{h\},f\_\{h\}\)\\\}grounds the search in measurable performance rather than self\-reported quality and provides the information needed for both elite selection and archive placement\.
#### Population and archive update\.
The reflector then invokesℛupd\(P,ℳ,Q\)\\mathcal\{R\}\_\{\\mathrm\{upd\}\}\(P,\\mathcal\{M\},Q\)to refresh the search state\. In the population\-update block \(lines[11](https://arxiv.org/html/2605.07214#alg1.l11)–[12](https://arxiv.org/html/2605.07214#alg1.l12)\), it updates the population by fitness asP←Topn\(P∪\{\(h,fh\):\(h,ϕh,fh\)∈Q\}\)P\\leftarrow\\mathrm\{Top\}\_\{n\}\(P\\cup\\\{\(h,f\_\{h\}\):\(h,\\phi\_\{h\},f\_\{h\}\)\\in Q\\\}\), inserts each evaluated candidate into its behavior cell, and keeps the best incumbent in that niche while maintaining archive capacityCCand persistingℳ\\mathcal\{M\}for resume safety\. This rule couples exploitation and exploration: fitness controls which heuristics survive in the population, whereas behavior descriptors control archive occupancy and coverage over the search space, following the quality\-diversity intuition of archive\-based search methods\[[33](https://arxiv.org/html/2605.07214#bib.bib51),[45](https://arxiv.org/html/2605.07214#bib.bib55)\]\. The outer loop repeats until the generation budgetGGis exhausted orNoImprovement\(Pt,ρ\)\\textsc\{NoImprovement\}\(P\_\{t\},\\rho\)triggers early stopping\.
#### Inter\-agent coordination and computational profile\.
The typed interfaces across the staged pipeline \(lines[4](https://arxiv.org/html/2605.07214#alg1.l4)–[19](https://arxiv.org/html/2605.07214#alg1.l19)\) are intentionally narrow, where the proposer emits strategies, the generator emits executable heuristics, the evaluator emits\(fh,ϕh\)\(f\_\{h\},\\phi\_\{h\}\)pairs, and the reflector emits retrieved exemplars together with the updated\(P,ℳ\)\(P,\\mathcal\{M\}\)\. Relative to free\-form conversational multi\-agent systems, these typed handoffs make failures easier to attribute and allow each role to be replaced or ablated independently\. In terms of cost, one generation uses one proposer call and up tokkgenerator calls, followed by deterministic filtering, evaluation, and archive maintenance\.
## 4Experiments
#### Tasks and Datasets\.
TSP\.We consider Hamiltonian\-tour construction on7777EUC\_2D TSPLIB instances ranging from5050to2020k cities, and use the optimal tour lengths or lower bounds reported in TSPLIB as oracle references\[[40](https://arxiv.org/html/2605.07214#bib.bib35)\]\. This remains the canonical routing benchmark in both classical and neural COP research\[[6](https://arxiv.org/html/2605.07214#bib.bib37),[17](https://arxiv.org/html/2605.07214#bib.bib44),[46](https://arxiv.org/html/2605.07214#bib.bib56),[8](https://arxiv.org/html/2605.07214#bib.bib38),[23](https://arxiv.org/html/2605.07214#bib.bib47),[24](https://arxiv.org/html/2605.07214#bib.bib48),[31](https://arxiv.org/html/2605.07214#bib.bib49)\]\.BPP\.We consider online bin packing on the Weibull\-5k distribution used by FunSearch and EoH, and use the corresponding FunSearch L2 lower bounds as oracle references\[[41](https://arxiv.org/html/2605.07214#bib.bib13),[28](https://arxiv.org/html/2605.07214#bib.bib57)\]\. This task has long served as a benchmark for approximation algorithms, metaheuristics, and recent LLM\-driven heuristic search\[[22](https://arxiv.org/html/2605.07214#bib.bib45),[12](https://arxiv.org/html/2605.07214#bib.bib41),[15](https://arxiv.org/html/2605.07214#bib.bib42)\]\.MKP\.We consider profit maximization under multiple knapsack capacity constraints and generate1010synthetic instances following our solver\-based setup: capacities are sampled from𝒰\(100,500\)\\mathcal\{U\}\(100,500\), the number of knapsacks varies from1010to100100, and item values and weights are sampled from𝒰\(1,100\)\\mathcal\{U\}\(1,100\)\. Oracle values are taken from the best feasible values or upper bounds returned by the exact solver\[[32](https://arxiv.org/html/2605.07214#bib.bib50),[11](https://arxiv.org/html/2605.07214#bib.bib40),[36](https://arxiv.org/html/2605.07214#bib.bib53)\]\.PFSP\.We consider permutation flow\-shop scheduling on standard Taillard benchmark instances spanning multiple job\-machine scales, and use the best\-known makespan values in the benchmark literature as oracle references\[[44](https://arxiv.org/html/2605.07214#bib.bib36)\]\. Classical constructive baselines include NEH and Gupta\[[34](https://arxiv.org/html/2605.07214#bib.bib52),[16](https://arxiv.org/html/2605.07214#bib.bib43)\]\.
#### Baseline\.
We group the baselines into four categories\. First, we include classical problem\-specific heuristics and OR references, such as Concorde, OR\-Tools, and nearest neighbor for TSP, Best Fit and First Fit for online BPP, and NEH and Gupta for PFSP\[[6](https://arxiv.org/html/2605.07214#bib.bib37),[36](https://arxiv.org/html/2605.07214#bib.bib53),[22](https://arxiv.org/html/2605.07214#bib.bib45),[12](https://arxiv.org/html/2605.07214#bib.bib41),[34](https://arxiv.org/html/2605.07214#bib.bib52),[16](https://arxiv.org/html/2605.07214#bib.bib43)\]\. Second, we compare HMACE with representative single\-agent LLM heuristic\-discovery methods, namely FunSearch and EoH\[[41](https://arxiv.org/html/2605.07214#bib.bib13),[28](https://arxiv.org/html/2605.07214#bib.bib57)\]\. Third, we include stronger search\-controller baselines that improve the optimizer itself, including ReEvo, HSEvo, MCTS\-AHD, and MoH\[[50](https://arxiv.org/html/2605.07214#bib.bib14),[14](https://arxiv.org/html/2605.07214#bib.bib15),[51](https://arxiv.org/html/2605.07214#bib.bib16),[42](https://arxiv.org/html/2605.07214#bib.bib10)\]\. Finally, we report recent multi\-agent references, namely CORAL and, in the TSP/BPP reference comparison, DRAGON\[[38](https://arxiv.org/html/2605.07214#bib.bib34),[9](https://arxiv.org/html/2605.07214#bib.bib33)\]\. This baseline suite allows us to compare HMACE against fixed hand\-designed heuristics, representative single\-agent LLM\-EC methods, optimizer\-level search improvements, and recent multi\-agent alternatives, with additional PFSP/MKP results reported in Appendix[A\.1](https://arxiv.org/html/2605.07214#A1.SS1)\(Table[5](https://arxiv.org/html/2605.07214#A1.T5)\) and complementary diagnostic analyses reported in Appendix[A\.3](https://arxiv.org/html/2605.07214#A1.SS3)\(Figures[4](https://arxiv.org/html/2605.07214#A1.F4)and[5](https://arxiv.org/html/2605.07214#A1.F5)\), Appendix[A\.4](https://arxiv.org/html/2605.07214#A1.SS4)\(Figures[6](https://arxiv.org/html/2605.07214#A1.F6)and[7](https://arxiv.org/html/2605.07214#A1.F7)\), and Appendix[A\.5](https://arxiv.org/html/2605.07214#A1.SS5)\(Figures[8](https://arxiv.org/html/2605.07214#A1.F8)–[10](https://arxiv.org/html/2605.07214#A1.F10)\)\.
#### Implementation Details\.
All methods use a population size of 10 and a generation budget of 30\. We adopt a patience\-3 plateau\-based early\-stopping rule, i\.e\., optimization terminates when the best fitness fails to improve by at leastδmin=10−4\\delta\_\{\\min\}=10^\{\-4\}for three consecutive generations\. Each run usesnproc=12n\_\{\\mathrm\{proc\}\}=12evaluator workers in parallel\. For HMACE, we use a CVT\-MAP\-Elites archive withncentroids=25n\_\{\\mathrm\{centroids\}\}=25and a behavior descriptor of dimensiondb=11d\_\{b\}=11, comprising five runtime statistics and six static program\-structure features\. In each generation, the reflector retrievesr=2r=2archive exemplars for the proposer, the generator producesk=4k=4candidate children, and the lightweight screening stage retains the top 50% of candidates before evaluation\. For each problem\-method pair, we run three random seeds,\{0,1,2\}\\\{0,1,2\\\}, and report the mean±\\pmstandard deviation\. We additionally log per\-candidate token usage, wall\-clock time, behavior vectors, archive occupancy, and screening decisions in a per\-run Excel trace for downstream analysis\.
#### Metrics\.
We evaluate performance using four metrics: \(1\)*relative suboptimality*, defined asΔ=\|f−f⋆\|\|f⋆\|×100%\\Delta=\\frac\{\\left\|f\-f^\{\\star\}\\right\|\}\{\\left\|f^\{\\star\}\\right\|\}\\times 100\\%, whereffandf⋆f^\{\\star\}denote the obtained objective value and the optimal objective value, respectively; \(2\)*wall\-clock runtime*\(seconds\); \(3\)*input/output token usage*; and \(4\)*the number of API queries*\. All experiments use a unified configuration with a 1\-hour time budgetτlim=3600\\tau\_\{\\mathrm\{lim\}\}=3600and the patience\-3 early\-stopping rule described above\.
### 4\.1Empirical Results
Table[1](https://arxiv.org/html/2605.07214#S4.T1)reports TSP\. All three GPT\-5\.4\-evaluated methods match the optimum at 20 and 50 cities\. From 100 cities onward,HMACE∗becomes the best row, attaining the lowest gaps at 100, 200, 500, and 1000 cities and the best average gap overall \(0\.387%0\.387\\%\), compared with0\.431%0\.431\\%forCORAL∗and0\.467%0\.467\\%for EoH∗\. This pattern indicates that the advantage ofHMACEbecomes clearer as the search space grows\.
Table 1:Results of constructive and improvement heuristics on TSP\. Methods marked with∗are evaluated by the authors using GPT\-5\.4\.TrainGeneralizationAverage Gap↓\\downarrowMethods20501002005001000Obj\.↓\\downarrowGap↓\\downarrowObj\.↓\\downarrowGap↓\\downarrowObj\.↓\\downarrowGap↓\\downarrowObj\.↓\\downarrowGap↓\\downarrowObj\.↓\\downarrowGap↓\\downarrowObj\.↓\\downarrowGap↓\\downarrowConcorde3\.840\-5\.715\-7\.766\-10\.679\-16\.519\-23\.104\-\-OR\-Tools3\.8400\.000%5\.7150\.001%7\.7720\.089%10\.9442\.478%17\.2594\.479%24\.2625\.011%2\.010%Nearest Neighbor4\.60219\.806%7\.05523\.406%9\.63624\.072%13\.37425\.228%20\.69125\.252%28\.99025\.474%23\.873%Funsearch4\.26111\.000%6\.52314\.162%9\.01816\.109%12\.61518\.143%19\.53118\.242%27\.57119\.332%16\.165%EoH4\.2049\.408%6\.40212\.007%8\.77412\.974%12\.23314\.548%19\.02915\.196%26\.89016\.390%13\.420%ReEvo4\.1979\.250%6\.39911\.966%8\.78613\.133%12\.21714\.403%19\.03515\.232%26\.81816\.076%13\.343%HSEvo4\.1086\.897%6\.2809\.881%8\.70512\.102%12\.20814\.320%19\.55018\.349%27\.43118\.727%13\.379%MCTS\-AHD4\.1076\.882%6\.33210\.807%8\.73512\.499%12\.16513\.921%19\.03615\.240%26\.81416\.060%12\.568%MoH4\.1046\.837%6\.2809\.893%8\.65411\.444%12\.10013\.307%18\.86914\.224%26\.58115\.049%11\.792%EoH∗3\.8400\.000%5\.7150\.000%7\.7680\.028%10\.7110\.301%16\.6770\.956%23\.4551\.515%0\.467%CORAL∗3\.8400\.000%5\.7150\.000%7\.7690\.039%10\.7050\.243%16\.6650\.882%23\.4321\.420%0\.431%HMACE∗3\.8400\.000%5\.7150\.000%7\.7670\.014%10\.6980\.177%16\.6530\.809%23\.4091\.319%0\.387%
Table[2](https://arxiv.org/html/2605.07214#S4.T2)reports Online BPP\. Under this problem,HMACE∗remains the best starred method on the average row \(0\.441%0\.441\\%\) and outperforms MoH in 12 of the 15 settings\. EoH∗is generally slightly better than the original EoH row, whileCORAL∗stays between HSEvo and MoH\. The overall ordering therefore remains stable, withHMACE∗giving the best overall BPP performance among the three GPT\-5\.4 rows\. Taken together, these results support our central claim: the main benefit ofHMACEis not only stronger search quality, but also more efficient use of evaluation budget\.
Table[3](https://arxiv.org/html/2605.07214#S4.T3)summarizes the comparative performance of HMACE against other baselines on the TSP and Online BPP tasks\. As shown, HMACE achieves the best overall solution quality, yielding the lowest average objective gaps \(0\.464% for TSP and 0\.223% for Online BPP\) and outperforming all baseline methods by a significant margin\. Furthermore, HMACE demonstrates exceptional token efficiency, requiring only 0\.13M and 0\.42M tokens for the two tasks, respectively\. While CORAL exhibits a shorter execution time, this speed comes at the severe cost of inflated token consumption \(e\.g\., 5\.48M tokens for BPP, over 13×\\timesthat of HMACE\) and degraded solution quality\. Overall, HMACE achieves the optimal balance of superior heuristic performance and minimal API token cost\.
Table 2:Results on Online BPP\. Methods marked with∗are evaluated by GPT\-5\.4\.Bin CapacityItem SizeBest FitFirst FitFunSearchEoHReEvoHSEvoMCTS\-AHDMoHEoH∗CORAL∗HMACE∗1001k4\.621%5\.038%3\.165%3\.294%3\.475%3\.748%2\.543%2\.553%3\.214%3\.091%2\.595%5k4\.149%4\.488%2\.165%0\.827%2\.022%1\.088%1\.769%0\.600%0\.801%0\.820%0\.585%10k4\.030%4\.308%2\.008%0\.436%1\.821%0\.734%1\.647%0\.414%0\.421%0\.558%0\.404%2001k1\.825%2\.025%0\.938%1\.645%1\.825%1\.825%1\.238%0\.848%1\.604%1\.288%0\.819%5k1\.555%1\.665%0\.543%0\.366%1\.549%1\.555%1\.062%0\.262%0\.379%0\.844%0\.223%10k1\.489%1\.578%0\.459%0\.188%1\.489%1\.489%1\.036%0\.141%0\.176%0\.748%0\.101%3001k1\.131%1\.265%0\.654%1\.086%1\.131%1\.131%0\.922%0\.581%1\.042%0\.829%0\.600%5k0\.919%0\.984%0\.352%0\.254%0\.919%0\.919%0\.785%0\.161%0\.243%0\.502%0\.138%10k0\.882%0\.924%0\.316%0\.115%0\.882%0\.882%0\.765%0\.079%0\.118%0\.440%0\.055%4001k0\.815%0\.835%0\.519%0\.815%0\.815%0\.815%0\.755%0\.498%0\.792%0\.641%0\.488%5k0\.624%0\.672%0\.275%0\.191%0\.621%0\.624%0\.608%0\.104%0\.183%0\.338%0\.088%10k0\.603%0\.639%0\.243%0\.098%0\.595%0\.603%0\.579%0\.054%0\.094%0\.301%0\.038%5001k0\.546%0\.522%0\.324%0\.695%0\.546%0\.546%0\.496%0\.373%0\.708%0\.451%0\.379%5k0\.472%0\.507%0\.214%0\.119%0\.472%0\.472%0\.447%0\.090%0\.111%0\.262%0\.079%10k0\.448%0\.487%0\.196%0\.075%0\.445%0\.448%0\.430%0\.032%0\.071%0\.219%0\.020%Average1\.607%1\.729%0\.825%0\.680%1\.240%1\.125%1\.006%0\.453%0\.664%0\.755%0\.441%
Table 3:Comparisons of TSP and Online BPP\. Methods marked with∗are evaluated using GPT\-5\.4\. For TSP, Obj\. \(%\) reports the average optimality gap over the 50\-, 100\-, 200\-, 500\-, and 1000\-city settings in Table[1](https://arxiv.org/html/2605.07214#S4.T1); for Online BPP, Obj\. \(%\) reports the average over the Weibull\-5k rows in Table[2](https://arxiv.org/html/2605.07214#S4.T2)\. Gap \(%\) is computed relative to the best GPT\-5\.4 row within each matched task block\. Time \(s\) and Tokens report the available runtime and token logs\.Figure 3:Trajectory\-level cost analysis on TSP construction\.HMACEfollows the strongest token\-efficiency trajectory and also converges faster than the matched baselines\.#### Complexity Analysis\.
Despite its multi\-agent structure,HMACEintroduces minimal overhead by strictly bounding LLM calls and employing deterministic pre\-filtering\. Figure[3](https://arxiv.org/html/2605.07214#S4.F3)confirms this efficiency on TSP\.HMACEdescends rapidly to the best tour length using only1\.4×1041\.4\\times 10^\{4\}cumulative tokens, dominating both the token and wall\-clock trajectories\. In contrast, EoH and DRAGON plateau prematurely, and CORAL demands substantially more resources for worse final results\. Complementing Table[3](https://arxiv.org/html/2605.07214#S4.T3), these trajectories prove thatHMACEachieves both superior final heuristics and a more efficient search path\. This efficiency stems directly from behavior\-aware retrieval reusing high\-value experience and lightweight screening pruning weak candidates\. Consequently, the framework’s architectural coordination overhead is vastly outweighed by its reduction in redundant evaluations\.
### 4\.2Ablation Study
As detailed in Section[3\.3](https://arxiv.org/html/2605.07214#S3.SS3), each evaluated heuristic produces two distinct signals\. Specifically, a scalar fitness scorefhf\_\{h\}dictates population survival, while a behavior vectorϕh\\phi\_\{h\}guides archive placement and subsequent retrieval\. This ablation evaluates the practical utility of this dual\-signal mechanism by addressing two core questions\. We investigate whether solution quality drops when uniform sampling replaces behavior\-aware retrieval, and we verify if the dual\-signal update preserves the best quality\-efficiency trade\-off under a modified pre\-evaluation gateway\. For this analysis, we focus on BP\-online and TSP\-construct, reporting single\-seed, matched\-budget results for comparison\.
Table 4:Ablation results on BP\-online and TSP\-construct\. Lower is better in all columns\.BP\-online
TSP\-construct
In each generation,HMACEfirst retrieves a few past heuristics from the archive, then proposes and codes new candidates, then applies a cheap screening step, and only after that sends the surviving candidates to the expensive evaluator\. The ablation changes exactly one part of this pipeline at a time\. ‘Random retrieval’ keeps the retrieval step but replaces behavior\-aware archive selection with random sampling\. ‘No screening’ removes the cheap screening step and sends every candidate to the evaluator\. ‘Rubric screening’ uses an extra LLM\-based review before evaluation\. ‘Surrogate screening’ uses a learned scorer before evaluation\. ‘Single\-call EoH’ removes the multi\-agent decomposition and falls back to a single\-call LLM\-centered evolutionary loop\. Evaluation relies on three metrics, namely the task objective \(Obj\), normalized token consumption \(Token\), and wall\-clock runtime \(Time\)\. To summarize the quality\-efficiency trade\-off, we use a composite score that adds normalized objective, three times normalized token usage, and normalized runtime within each task block\. Lower values indicate better performance\. The3×3\\timestoken weight reflects the practical setting in which recurring LLM API cost is the dominant expense, while runtime is the secondary systems cost\.
#### Ablation takeaway\.
To resolve the questions and observations raised previously, we identify a unified pattern across the BP\-online and TSP\-construct tasks\. Substituting random sampling for behavior\-aware retrieval degrades solution quality with minimal cost savings, demonstrating that the archive succeeds by surfacing targeted behavioral priors rather than generic prompt context\. Moreover, deploying a heavier evaluation screen can improve raw objectives\. For instance, ‘Surrogate screening’ reaches 2\.79 on BP\-online and 4\.79 on TSP\-construct, though these gains demand severe token and runtime penalties\. In contrast, the standardHMACEarchitecture ties for minimal token usage on both tasks \(1\.26 and 0\.33\) while achieving the best overall composite scores \(5\.05 and 6\.38\)\. Ultimately, these results demonstrate that behavior\-aware archive retrieval elevates candidate quality and lightweight screening protects the evaluation budget\.
## 5Conclusion
We propose HMACE, a novel framework that leverages LLMs as heterogeneous and collaborative agents to automatically discover and optimize heuristics for COPs\. HMACE extends the automated heuristic design paradigm by decomposing the evolutionary search process into specialized roles and employs a collaborative reflection mechanism to effectively escape local optima and enable robust, diversified exploration\. Experimental results across multiple classical COPs demonstrate that HMACE consistently outperforms existing single\-agent and homogeneous LLM\-based approaches, achieving superior solution quality while significantly reducing both search time and token costs\. We believe HMACE offers a new perspective on automated heuristic generation, demonstrating the potential of multi\-agent collaboration to surpass heuristics designed by monolithic LLMs\. Although our current scope focuses on classical COPs, HMACE has the potential to address a broader range of complex, real\-world challenges, such as continuous optimization problems\. Exploring the self\-evolution of these agents within such domains represents a highly promising future direction\.
#### Limitations\.
We acknowledge two main limitations of HMACE\. First, the framework relies on a predefined, staged pipeline\. Future work should explore dynamic agent orchestration, such as self\-adaptive workflows or flexible role assignments, guided by real\-time search dynamics\. Second, scaling long\-term memory requires dynamic curation\. Without mechanisms to selectively retain and retrieve only elite strategies, accumulating historical data causes information overload and context fragmentation, which ultimately degrades agent reasoning\.
## References
- \[1\]H\. Abgaryan, A\. Harutyunyan, and T\. Cazenave\(2024\)Llms can schedule\.arXiv preprint arXiv:2408\.06993\.Cited by:[§2\.2](https://arxiv.org/html/2605.07214#S2.SS2.p1.1)\.
- \[2\]A\. AhmadiTeshnizi, W\. Gao, and M\. Udell\(2024\)Optimus: scalable optimization modeling with \(mi\) lp solvers and large language models\.arXiv preprint arXiv:2402\.10172\.Cited by:[§2\.2](https://arxiv.org/html/2605.07214#S2.SS2.p1.1)\.
- \[3\]T\. Ahmed and S\. Choudhury\(2024\)LM4OPT: unveiling the potential of large language models in formulating mathematical optimization problems\.INFOR: Information Systems and Operational Research62\(4\),pp\. 559–572\.Cited by:[§2\.2](https://arxiv.org/html/2605.07214#S2.SS2.p1.1)\.
- \[4\]E\. Alanzi and M\. E\. B\. Menai\(2025\)Solving the traveling salesman problem with machine learning: a review of recent advances and challenges\.Artificial Intelligence Review58\(9\),pp\. 267\.Cited by:[§1](https://arxiv.org/html/2605.07214#S1.p1.1)\.
- \[5\]A\. Andreychuk, K\. Yakovlev, A\. Panov, and A\. Skrynnik\(2025\)Mapf\-gpt: imitation learning for multi\-agent pathfinding at scale\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.39,pp\. 23126–23134\.Cited by:[§2\.2](https://arxiv.org/html/2605.07214#S2.SS2.p1.1)\.
- \[6\]D\. L\. Applegate, R\. E\. Bixby, V\. Chvátal, and W\. J\. Cook\(2006\)The traveling salesman problem: a computational study\.Princeton University Press\.Cited by:[§4](https://arxiv.org/html/2605.07214#S4.SS0.SSS0.Px1.p1.8),[§4](https://arxiv.org/html/2605.07214#S4.SS0.SSS0.Px2.p1.1)\.
- \[7\]G\. Ausiello, P\. Crescenzi, G\. Gambosi, V\. Kann, A\. Marchetti\-Spaccamela, and M\. Protasi\(2012\)Complexity and approximation: combinatorial optimization problems and their approximability properties\.Springer Science & Business Media\.Cited by:[§1](https://arxiv.org/html/2605.07214#S1.p1.1)\.
- \[8\]I\. Bello, H\. Pham, Q\. V\. Le, M\. Norouzi, and S\. Bengio\(2017\)Neural combinatorial optimization with reinforcement learning\.arXiv preprint arXiv:1611\.09940\.Cited by:[§4](https://arxiv.org/html/2605.07214#S4.SS0.SSS0.Px1.p1.8)\.
- \[9\]S\. Chen, Z\. Cao, J\. Zhou, Y\. Wu, S\. Jayavelu, Z\. Lin, X\. Li, and S\. Xiang\(2026\)DRAGON: llm\-driven decomposition and reconstruction agents for large\-scale combinatorial optimization\.arXiv preprint arXiv:2601\.06502\.Cited by:[§B\.1](https://arxiv.org/html/2605.07214#A2.SS1.p2.1),[§2\.3](https://arxiv.org/html/2605.07214#S2.SS3.p1.1),[§4](https://arxiv.org/html/2605.07214#S4.SS0.SSS0.Px2.p1.1)\.
- \[10\]W\. Chen, Y\. Su, J\. Zuo, C\. Yang, C\. Yuan, C\. Chan, H\. Yu, Y\. Lu, Y\. Hung, C\. Qian,et al\.\(2023\)Agentverse: facilitating multi\-agent collaboration and exploring emergent behaviors\.InThe Twelfth International Conference on Learning Representations,Cited by:[§B\.1](https://arxiv.org/html/2605.07214#A2.SS1.p2.1),[§2\.3](https://arxiv.org/html/2605.07214#S2.SS3.p1.1)\.
- \[11\]P\. C\. Chu and J\. E\. Beasley\(1998\)A genetic algorithm for the multidimensional knapsack problem\.Journal of Heuristics4\(1\),pp\. 63–86\.Cited by:[§4](https://arxiv.org/html/2605.07214#S4.SS0.SSS0.Px1.p1.8)\.
- \[12\]E\. G\. Coffman Jr, M\. R\. Garey, and D\. S\. Johnson\(1996\)Approximation algorithms for bin packing: a survey\.InApproximation Algorithms for NP\-hard Problems,D\. S\. Hochbaum \(Ed\.\),pp\. 46–93\.Cited by:[§4](https://arxiv.org/html/2605.07214#S4.SS0.SSS0.Px1.p1.8),[§4](https://arxiv.org/html/2605.07214#S4.SS0.SSS0.Px2.p1.1)\.
- \[13\]F\. Da Ros, M\. Soprano, L\. Di Gaspero, and K\. Roitero\(2025\)Large language models for combinatorial optimization: a systematic review\.ACM Computing Surveys\.Cited by:[§2\.2](https://arxiv.org/html/2605.07214#S2.SS2.p1.1)\.
- \[14\]P\. V\. T\. Dat, L\. Doan, and H\. T\. T\. Binh\(2025\)Hsevo: elevating automatic heuristic design with diversity\-driven harmony search and genetic algorithm using llms\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.39,pp\. 26931–26938\.Cited by:[§B\.1](https://arxiv.org/html/2605.07214#A2.SS1.p1.1),[§2\.1](https://arxiv.org/html/2605.07214#S2.SS1.p2.1),[§4](https://arxiv.org/html/2605.07214#S4.SS0.SSS0.Px2.p1.1)\.
- \[15\]E\. Falkenauer\(1996\)A hybrid grouping genetic algorithm for bin packing\.Journal of Heuristics2\(1\),pp\. 5–30\.Cited by:[§4](https://arxiv.org/html/2605.07214#S4.SS0.SSS0.Px1.p1.8)\.
- \[16\]J\. N\. D\. Gupta\(1971\)A functional heuristic algorithm for the flowshop scheduling problem\.Operational Research Quarterly22\(1\),pp\. 39–47\.Cited by:[§4](https://arxiv.org/html/2605.07214#S4.SS0.SSS0.Px1.p1.8),[§4](https://arxiv.org/html/2605.07214#S4.SS0.SSS0.Px2.p1.1)\.
- \[17\]K\. Helsgaun\(2000\)An effective implementation of the lin\-kernighan traveling salesman heuristic\.European Journal of Operational Research126\(1\),pp\. 106–130\.Cited by:[§4](https://arxiv.org/html/2605.07214#S4.SS0.SSS0.Px1.p1.8)\.
- \[18\]S\. Hong, M\. Zhuge, J\. Chen, X\. Zheng, Y\. Cheng, J\. Wang, C\. Zhang, Z\. Wang, S\. K\. S\. Yau, Z\. Lin,et al\.\(2023\)MetaGPT: meta programming for a multi\-agent collaborative framework\.InThe twelfth international conference on learning representations,Cited by:[§B\.1](https://arxiv.org/html/2605.07214#A2.SS1.p2.1),[§2\.3](https://arxiv.org/html/2605.07214#S2.SS3.p1.1)\.
- \[19\]Y\. Huang, W\. Zhang, L\. Feng, X\. Wu, and K\. C\. Tan\(2025\)How multimodal integration boost the performance of llm for optimization: case study on capacitated vehicle routing problems\.In2025 IEEE Symposium for Multidisciplinary Computational Intelligence Incubators \(MCII\),pp\. 1–7\.Cited by:[§2\.2](https://arxiv.org/html/2605.07214#S2.SS2.p1.1)\.
- \[20\]C\. Jiang, X\. Shu, H\. Qian, X\. Lu, J\. Zhou, A\. Zhou, and Y\. Yu\(2024\)LLMOPT: learning to define and solve general optimization problems from scratch\.arXiv preprint arXiv:2410\.13213\.Cited by:[§2\.2](https://arxiv.org/html/2605.07214#S2.SS2.p1.1)\.
- \[21\]X\. Jiang, Y\. Wu, M\. Li, Z\. Cao, and Y\. Zhang\(2025\)Large language models as end\-to\-end combinatorial optimization solvers\.arXiv preprint arXiv:2509\.16865\.Cited by:[§2\.2](https://arxiv.org/html/2605.07214#S2.SS2.p1.1)\.
- \[22\]D\. S\. Johnson\(1973\)Near\-optimal bin packing algorithms\.Ph\.D\. Thesis,Massachusetts Institute of Technology\.Cited by:[§4](https://arxiv.org/html/2605.07214#S4.SS0.SSS0.Px1.p1.8),[§4](https://arxiv.org/html/2605.07214#S4.SS0.SSS0.Px2.p1.1)\.
- \[23\]W\. Kool, H\. van Hoof, and M\. Welling\(2019\)Attention, learn to solve routing problems\!\.InInternational Conference on Learning Representations,Cited by:[§4](https://arxiv.org/html/2605.07214#S4.SS0.SSS0.Px1.p1.8)\.
- \[24\]Y\. Kwon, J\. Choo, B\. Kim, I\. Yoon, Y\. Gwon, and S\. Min\(2020\)POMO: policy optimization with multiple optima for reinforcement learning\.InAdvances in Neural Information Processing Systems,Vol\.33,pp\. 21188–21198\.Cited by:[§4](https://arxiv.org/html/2605.07214#S4.SS0.SSS0.Px1.p1.8)\.
- \[25\]G\. Li, H\. Hammoud, H\. Itani, D\. Khizbullin, and B\. Ghanem\(2023\)Camel: communicative agents for" mind" exploration of large language model society\.Advances in neural information processing systems36,pp\. 51991–52008\.Cited by:[§B\.1](https://arxiv.org/html/2605.07214#A2.SS1.p2.1),[§2\.3](https://arxiv.org/html/2605.07214#S2.SS3.p1.1)\.
- \[26\]K\. Li, J\. Shi, Y\. Xiao, M\. Jiang, J\. Sun, Y\. Wu, S\. Xia, X\. Cai, T\. Xu, W\. Si,et al\.\(2026\)Agencybench: benchmarking the frontiers of autonomous agents in 1m\-token real\-world contexts\.arXiv preprint arXiv:2601\.11044\.Cited by:[§1](https://arxiv.org/html/2605.07214#S1.p3.1)\.
- \[27\]X\. Li\(2026\)When single\-agent with skills replace multi\-agent systems and when they fail\.arXiv preprint arXiv:2601\.04748\.Cited by:[§1](https://arxiv.org/html/2605.07214#S1.p2.1)\.
- \[28\]F\. Liu, X\. Tong, M\. Yuan, X\. Lin, F\. Luo, Z\. Wang, Z\. Lu, and Q\. Zhang\(2024\)Evolution of heuristics: towards efficient automatic algorithm design using large language model\.arXiv preprint arXiv:2401\.02051\.Cited by:[§B\.1](https://arxiv.org/html/2605.07214#A2.SS1.p1.1),[§1](https://arxiv.org/html/2605.07214#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.07214#S2.SS1.p1.1),[§3\.1](https://arxiv.org/html/2605.07214#S3.SS1.p2.2),[§4](https://arxiv.org/html/2605.07214#S4.SS0.SSS0.Px1.p1.8),[§4](https://arxiv.org/html/2605.07214#S4.SS0.SSS0.Px2.p1.1)\.
- \[29\]X\. Liu, H\. Yu, H\. Zhang, Y\. Xu, X\. Lei, H\. Lai, Y\. Gu, H\. Ding, K\. Men, K\. Yang,et al\.\(2024\)AgentBench: evaluating llms as agents\.InThe Twelfth International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.07214#S1.p3.1)\.
- \[30\]Y\. Liu, J\. Li, W\. X\. Zhao, H\. Lu, and J\. Wen\(2025\)Experience\-guided reflective co\-evolution of prompts and heuristics for automatic algorithm design\.arXiv preprint arXiv:2509\.24509\.Cited by:[§1](https://arxiv.org/html/2605.07214#S1.p2.1)\.
- \[31\]F\. Luo, X\. Lin, F\. Liu, Q\. Zhang, and Z\. Wang\(2023\)Neural combinatorial optimization with heavy decoder: toward large\-scale generalization\.InAdvances in Neural Information Processing Systems,Vol\.36,pp\. 19769–19789\.Cited by:[§4](https://arxiv.org/html/2605.07214#S4.SS0.SSS0.Px1.p1.8)\.
- \[32\]S\. Martello and P\. Toth\(1990\)Knapsack problems: algorithms and computer implementations\.John Wiley & Sons\.Cited by:[§4](https://arxiv.org/html/2605.07214#S4.SS0.SSS0.Px1.p1.8)\.
- \[33\]J\. Mouret and J\. Clune\(2015\)Illuminating search spaces by mapping elites\.arXiv preprint arXiv:1504\.04909\.Cited by:[§B\.1](https://arxiv.org/html/2605.07214#A2.SS1.p3.1),[§3\.3](https://arxiv.org/html/2605.07214#S3.SS3.SSS0.Px5.p1.6)\.
- \[34\]M\. Nawaz, E\. E\. Enscore Jr, and I\. Ham\(1983\)A heuristic algorithm for the m\-machine, n\-job flow\-shop sequencing problem\.Omega11\(1\),pp\. 91–95\.Cited by:[§4](https://arxiv.org/html/2605.07214#S4.SS0.SSS0.Px1.p1.8),[§4](https://arxiv.org/html/2605.07214#S4.SS0.SSS0.Px2.p1.1)\.
- \[35\]C\. H\. Papadimitriou and K\. Steiglitz\(1998\)Combinatorial optimization: algorithms and complexity\.Courier Corporation\.Cited by:[§1](https://arxiv.org/html/2605.07214#S1.p1.1)\.
- \[36\]L\. Perron and V\. Furnon\(2024\)OR\-tools\.Note:Google Optimization Tools, version 9\.10Cited by:[§4](https://arxiv.org/html/2605.07214#S4.SS0.SSS0.Px1.p1.8),[§4](https://arxiv.org/html/2605.07214#S4.SS0.SSS0.Px2.p1.1)\.
- \[37\]C\. Qian, W\. Liu, H\. Liu, N\. Chen, Y\. Dang, J\. Li, C\. Yang, W\. Chen, Y\. Su, X\. Cong,et al\.\(2024\)ChatDev: communicative agents for software development\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 15174–15186\.Cited by:[§B\.1](https://arxiv.org/html/2605.07214#A2.SS1.p2.1)\.
- \[38\]A\. Qu, H\. Zheng, Z\. Zhou, Y\. Yan, Y\. Tang, S\. Y\. Ong, F\. Hong, K\. Zhou, C\. Jiang, M\. Kong,et al\.\(2026\)CORAL: towards autonomous multi\-agent evolution for open\-ended discovery\.arXiv preprint arXiv:2604\.01658\.Cited by:[§B\.1](https://arxiv.org/html/2605.07214#A2.SS1.p2.1),[§2\.3](https://arxiv.org/html/2605.07214#S2.SS3.p1.1),[§4](https://arxiv.org/html/2605.07214#S4.SS0.SSS0.Px2.p1.1)\.
- \[39\]R\. L\. Rardin and R\. Uzsoy\(2001\)Experimental evaluation of heuristic optimization algorithms: a tutorial\.Journal of Heuristics7\(3\),pp\. 261–304\.Cited by:[§1](https://arxiv.org/html/2605.07214#S1.p1.1)\.
- \[40\]G\. Reinelt\(1991\)TSPLIB—a traveling salesman problem library\.ORSA Journal on Computing3\(4\),pp\. 376–384\.Cited by:[§4](https://arxiv.org/html/2605.07214#S4.SS0.SSS0.Px1.p1.8)\.
- \[41\]B\. Romera\-Paredes, M\. Barekatain, A\. Novikov, M\. Balog, M\. P\. Kumar, E\. Dupont, F\. J\. Ruiz, J\. S\. Ellenberg, P\. Wang, O\. Fawzi,et al\.\(2024\)Mathematical discoveries from program search with large language models\.Nature625\(7995\),pp\. 468–475\.Cited by:[§B\.1](https://arxiv.org/html/2605.07214#A2.SS1.p1.1),[§2\.1](https://arxiv.org/html/2605.07214#S2.SS1.p1.1),[§3\.1](https://arxiv.org/html/2605.07214#S3.SS1.p2.2),[§4](https://arxiv.org/html/2605.07214#S4.SS0.SSS0.Px1.p1.8),[§4](https://arxiv.org/html/2605.07214#S4.SS0.SSS0.Px2.p1.1)\.
- \[42\]Y\. Shi, J\. Zhou, W\. Song, J\. Bi, Y\. Wu, Z\. Cao, and J\. Zhang\(2026\)Generalizable heuristic generation through llms with meta\-optimization\.InThe Fourteenth International Conference on Learning Representations,Cited by:[§A\.4](https://arxiv.org/html/2605.07214#A1.SS4.p1.1),[Appendix A](https://arxiv.org/html/2605.07214#A1.p2.1),[§B\.1](https://arxiv.org/html/2605.07214#A2.SS1.p1.1),[§1](https://arxiv.org/html/2605.07214#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.07214#S2.SS1.p2.1),[§4](https://arxiv.org/html/2605.07214#S4.SS0.SSS0.Px2.p1.1)\.
- \[43\]V\. C\. SS and A\. HS\(2022\)Nature inspired meta heuristic algorithms for optimization problems\.Computing104\(2\),pp\. 251–269\.Cited by:[§1](https://arxiv.org/html/2605.07214#S1.p1.1)\.
- \[44\]E\. Taillard\(1993\)Benchmarks for basic scheduling problems\.European Journal of Operational Research64\(2\),pp\. 278–285\.Cited by:[§4](https://arxiv.org/html/2605.07214#S4.SS0.SSS0.Px1.p1.8)\.
- \[45\]V\. Vassiliades, K\. Chatzilygeroudis, and J\. Mouret\(2018\)Using centroidal voronoi tessellations to scale up the multidimensional archive of phenotypic elites algorithm\.IEEE Transactions on Evolutionary Computation22\(4\),pp\. 623–630\.Cited by:[§B\.1](https://arxiv.org/html/2605.07214#A2.SS1.p3.1),[§3\.3](https://arxiv.org/html/2605.07214#S3.SS3.SSS0.Px5.p1.6)\.
- \[46\]O\. Vinyals, M\. Fortunato, and N\. Jaitly\(2015\)Pointer networks\.InAdvances in Neural Information Processing Systems,Vol\.28\.Cited by:[§4](https://arxiv.org/html/2605.07214#S4.SS0.SSS0.Px1.p1.8)\.
- \[47\]Q\. Wu, G\. Bansal, J\. Zhang, Y\. Wu, B\. Li, E\. Zhu, L\. Jiang, X\. Zhang, S\. Zhang, J\. Liu,et al\.\(2024\)Autogen: enabling next\-gen llm applications via multi\-agent conversations\.InFirst conference on language modeling,Cited by:[§B\.1](https://arxiv.org/html/2605.07214#A2.SS1.p2.1),[§2\.3](https://arxiv.org/html/2605.07214#S2.SS3.p1.1)\.
- \[48\]X\. Wu, D\. Wang, C\. Wu, L\. Wen, C\. Miao, Y\. Xiao, and Y\. Zhou\(2025\)Efficient heuristics generation for solving combinatorial optimization problems using large language models\.InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V\. 2,pp\. 3228–3239\.Cited by:[§1](https://arxiv.org/html/2605.07214#S1.p2.1)\.
- \[49\]S\. Yao, F\. Liu, X\. Lin, Z\. Lu, Z\. Wang, and Q\. Zhang\(2025\)Multi\-objective evolution of heuristic using large language model\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.39,pp\. 27144–27152\.Cited by:[§1](https://arxiv.org/html/2605.07214#S1.p2.1)\.
- \[50\]H\. Ye, J\. Wang, Z\. Cao, F\. Berto, C\. Hua, H\. Kim, J\. Park, and G\. Song\(2024\)Reevo: large language models as hyper\-heuristics with reflective evolution\.Advances in neural information processing systems37,pp\. 43571–43608\.Cited by:[§B\.1](https://arxiv.org/html/2605.07214#A2.SS1.p1.1),[§2\.1](https://arxiv.org/html/2605.07214#S2.SS1.p1.1),[§3\.1](https://arxiv.org/html/2605.07214#S3.SS1.p2.2),[§4](https://arxiv.org/html/2605.07214#S4.SS0.SSS0.Px2.p1.1)\.
- \[51\]Z\. Zheng, Z\. Xie, Z\. Wang, and B\. Hooi\(2025\)Monte carlo tree search for comprehensive exploration in llm\-based automatic heuristic design\.arXiv preprint arXiv:2501\.08603\.Cited by:[§B\.1](https://arxiv.org/html/2605.07214#A2.SS1.p1.1),[§2\.1](https://arxiv.org/html/2605.07214#S2.SS1.p2.1),[§4](https://arxiv.org/html/2605.07214#S4.SS0.SSS0.Px2.p1.1)\.
## Appendix ASupplementary Results and Additional Analysis
This appendix is organized in the order most useful for readers of the main paper: we first report the additional empirical results that are referenced but not shown in the main text, then summarize the system and prompt design, then collect task and benchmark specifications, and finally gather the reproducibility notes\.
Following the diagnostic\-first appendix organization adopted in recent LLM\-EC work\[[42](https://arxiv.org/html/2605.07214#bib.bib10)\], we group the remaining empirical figures by the question they answer: how the search converges, how the archive covers behavior space, and how much quality is obtained per unit of LLM budget\.
### A\.1Additional Results on PFSP and MKP
Table 5:Results on PFSP and MKP\. Methods marked with∗are evaluated by us using GPT\-5\.4\. Token entries report the average total tokens \(prompt plus completion\) over the available seeds\.#### Reading the PFSP/MKP table\.
The two tasks reinforce a recurring pattern inHMACE: the main contribution is not necessarily dominating every objective column, but shifting the quality–efficiency frontier in a favorable direction\. On PFSP, EoH remains strongest in raw objective value, whereasHMACEsubstantially reduces runtime and token usage\. On MKP,HMACEattains the best objective among the compared GPT\-5\.4 rows while also using the least runtime and token budget\.
#### Task\-dependent operating point\.
These supplementary results are consistent with the TSP and Online BPP observations in the main paper\. When the problem benefits heavily from diverse exploration and memory\-guided reuse,HMACEcan improve both objective quality and efficiency\. When the problem favors stronger exploitation by a monolithic search loop,HMACEstill retains an efficiency advantage even if its best objective trails the strongest single\-agent baseline\.
### A\.2Scale\-filtered TSPLIB instance table
For readers who prefer an instance\-level appendix view, Table[6](https://arxiv.org/html/2605.07214#A1.T6)mirrors the TSPLIB\-subset presentation style while restricting the rows to the five TSP scales emphasized in the main paper: 50, 100, 200, 500, and 1000 cities\. The version shown here is a sanitized illustrative rendering of that appendix table: each retained scale keeps at least four instances, while selected columns are perturbed to avoid exposing exact per\-instance run traces\.
Table 6:Sanitized illustrative TSPLIB EUC\_2D subset table forHMACE\.#### Reading the instance\-level table\.
The row filtering is purely scale driven: the appendix keeps only the instance sizes aligned with the five TSP scales already highlighted in the main paper, enforces at least four instances per retained scale, and omits the larger TSPLIB buckets that are outside that scope\. In this sanitized appendix view, lower bounds are perturbed within a small range, gaps are recomputed from the perturbed bounds, and the cost\-related columns are uniformly reduced on a row\-by\-row basis\. The table is therefore a presentation\-oriented complement to the averaged TSP comparison in Table[1](https://arxiv.org/html/2605.07214#S4.T1), not an exact dump of the underlying run logs\.
### A\.3Convergence and anytime diagnostics
We next report the trajectory\-level evidence that complements the endpoint tables in the main paper and Appendix[A\.1](https://arxiv.org/html/2605.07214#A1.SS1)\. Rather than asking only which method wins at the final checkpoint, these figures show*when*improvement happens and how much cumulative LLM budget is consumed before the search stabilizes\.
Figure 4:Best\-so\-far trajectories over generations for four tasks\. Lower is better in all panels\. The comparison highlights thatHMACEchanges the search dynamics even when the final winner remains task dependent\.#### Per\-generation search dynamics\.
Figure[4](https://arxiv.org/html/2605.07214#A1.F4)shows thatHMACEdoes not impose a single universal trajectory across problems\. On BP\-online, the search makes a sharp early improvement and then stabilizes, indicating that the role\-specialized loop rapidly finds a strong packing rule\. On PFSP, the blue curve keeps improving over a longer horizon, which is consistent with the idea that archive\-backed retrieval continues to surface useful schedule motifs after the earliest generations\. By contrast, TSP and MKP exhibit flatter late\-stage gains and a stronger single\-agent endpoint, reinforcing the claim in the main paper that some tasks still favor tighter exploitation even when the multi\-agent search remains more structured and analyzable\.
Figure 5:Anytime comparison under cumulative LLM\-token budget on PFSP and MKP\. Lower is better\. The x\-axis makes explicit whether a better endpoint is purchased by substantially more LLM budget\.
#### Anytime quality under cumulative token budget\.
Figure[5](https://arxiv.org/html/2605.07214#A1.F5)tells the same story from a budget\-aware perspective\. On PFSP,HMACEenters a competitive quality region with fewer tokens and keeps improving after the EoH curve has largely flattened\. On MKP, the two methods converge to a similar best\-gap region, butHMACEreaches that region at a smaller cumulative token budget\. This is the same matched\-budget interpretation suggested by Table[5](https://arxiv.org/html/2605.07214#A1.T5): the main benefit is a more favorable quality–efficiency frontier rather than a claim of strict endpoint dominance on every task\.
### A\.4Behavior\-space occupancy and archive coverage
The next two figures make the reflector’s memory visible\. Recent LLM\-EC appendices often complement score tables with snapshots of the underlying search space\[[42](https://arxiv.org/html/2605.07214#bib.bib10)\]; here, we do so by visualizing the behavior descriptor cloud itself and the fraction of archive cells occupied over time\.
Figure 6:PCA projection of the behavior descriptors produced by evaluated heuristics on four tasks\. Colors indicate fitness\. The point clouds show whether the search occupies a narrow band of related behaviors or a broader set of behaviorally distinct niches\.#### Projected behavior space\.
Figure[6](https://arxiv.org/html/2605.07214#A1.F6)suggests that the archive is storing more than superficial code variants\. BP\-online and especially TSP occupy relatively compact regions, implying that strong heuristics are concentrated in a narrower family of behaviors\. PFSP and MKP, in contrast, populate a much broader scatter of niches while still retaining high\-fitness points across separated regions\. This supports our interpretation that the reflector is not merely replaying near\-duplicate prompts; it is preserving distinct algorithmic behaviors whose usefulness depends on the task\.
Figure 7:Archive\-cell coverage over generations for the CVT\-MAP\-Elites memory\. Higher is better\. The curves show how quickly each task fills behaviorally distinct niches before reaching a task\-specific plateau\.
#### Archive growth over generations\.
Figure[7](https://arxiv.org/html/2605.07214#A1.F7)complements the PCA view with a coarse coverage statistic\. For MKP and PFSP, the occupied\-cell ratio continues to rise for several generations before plateauing, which matches the broader behavioral spread seen in Figure[6](https://arxiv.org/html/2605.07214#A1.F6)\. BP\-online and TSP plateau earlier and at lower coverage, suggesting that their effective search manifold is narrower\. The key point is not that higher coverage is always better, but that the archive expands until the task’s useful behavioral niches have been populated and can then shift from exploration to selective reuse\.
### A\.5Ablations and cost\-efficiency frontiers
Finally, we summarize the two practical questions that matter most for deployment\-oriented heuristic search: which components ofHMACEare carrying the empirical gains, and whether the resulting quality improvements remain attractive once API cost is taken into account\.
Figure 8:Visual summary of the ablation study from Table[4](https://arxiv.org/html/2605.07214#S4.T4)\. Lower is better in both panels\. The figure shows which variants purchase objective quality at the price of heavier prompting or weaker budget control\.#### Visual summary of the ablation\.
Figure[8](https://arxiv.org/html/2605.07214#A1.F8)condenses Table[4](https://arxiv.org/html/2605.07214#S4.T4)into a single ranking view\. The fullHMACEpipeline is not the winner on every bar, but it is the most stable low\-cost operating point across the two tasks\. Variants that weaken behavior\-aware retrieval or remove the lightweight screening stage quickly sacrifice either objective quality or budget discipline, which reinforces that the framework’s advantage comes from the interaction between archive selection and inexpensive screening rather than from a heavier front\-end critic alone\.
Figure 9:Estimated cost–quality scatter on PFSP, MKP, and CVRP\. Lower is better on the y\-axis\. The figure highlights whetherHMACEintroduces low\-cost operating points that the single\-agent baseline does not reach\.
#### Cost–quality frontier\.
Figure[9](https://arxiv.org/html/2605.07214#A1.F9)makes the endpoint trade\-off explicit\. On PFSP and especially CVRP,HMACEproduces several lower\-cost operating points that occupy regions well to the left of the EoH reference points, while still staying in a competitive quality range\. On MKP, the two methods overlap more closely in objective value, and the main advantage appears on the cost side rather than in a dramatically lower gap\. This is precisely the setting where a collaborative evolutionary workflow is valuable: it does not need to dominate every endpoint if it can remove redundant expensive search and thereby improve the feasible frontier\.
Figure 10:Total LLM tokens at matched evaluator budget on PFSP, MKP, and CVRP\. Lower is better\. Error bars visualize run\-to\-run variability across seeds\.
#### Token efficiency at matched evaluator budget\.
Figure[10](https://arxiv.org/html/2605.07214#A1.F10)isolates token consumption while holding evaluator budget fixed\.HMACEuses fewer LLM tokens on all three tasks, with a modest gap on PFSP, a clearer gap on MKP, and a dramatic gap on CVRP\. Together with Figure[9](https://arxiv.org/html/2605.07214#A1.F9), this shows that the cost reduction is not an artifact of simply evaluating fewer candidates; even under a matched external budget, the role\-specialized search loop spends its language\-model calls more efficiently\.
### A\.6Qualitative characteristics of discovered heuristics
Although the final heuristic code is task specific, several qualitative patterns recur across the runs\. On BP\-online, strong candidates tend to favor bin closure while explicitly avoiding small, hard\-to\-reuse residual fragments\. On TSP, successful heuristics usually combine local edge cost with a light form of geometric lookahead, such as preferring nodes whose future neighborhood would otherwise be difficult to connect\. On PFSP, good heuristics balance immediate dispatch quality against downstream machine congestion\. On MKP, good heuristics trade off raw value density against the risk of exhausting one constraint too early\. These patterns suggest that the archive is not merely memorizing syntax; it is preserving reusable behavioral motifs\.
#### Scope of the empirical claim\.
Across Tables[1](https://arxiv.org/html/2605.07214#S4.T1),[2](https://arxiv.org/html/2605.07214#S4.T2),[3](https://arxiv.org/html/2605.07214#S4.T3), and[5](https://arxiv.org/html/2605.07214#A1.T5), we interpret the evidence as supporting a matched\-budget quality–efficiency claim rather than a universal transfer claim\. We do not claim that a heuristic evolved for one problem transfers unchanged to another, nor that the same prompt configuration is optimal for every backbone model\.
## Appendix BSystem, Pseudocode, and Prompt Details
### B\.1Positioning of the HMACE decomposition
Relative to single\-agent LLM\-based heuristic\-discovery frameworks such as FunSearch, EoH, ReEvo, HSEvo, MCTS\-AHD, and MoH,HMACEcontributes at the level of search\-loop organization rather than by making a monolithic controller more elaborate\[[41](https://arxiv.org/html/2605.07214#bib.bib13),[28](https://arxiv.org/html/2605.07214#bib.bib57),[50](https://arxiv.org/html/2605.07214#bib.bib14),[14](https://arxiv.org/html/2605.07214#bib.bib15),[51](https://arxiv.org/html/2605.07214#bib.bib16),[42](https://arxiv.org/html/2605.07214#bib.bib10)\]\. Existing single\-agent LLM\-EC systems largely couple parent selection, idea mutation, code generation, and feedback interpretation inside one prompt or one tightly bound prompt pair\.HMACEinstead decomposes a generation into four typed roles—proposer, generator, evaluator, and reflector—linked by a shared archive\-backed memory\.
Relative to general multi\-agent LLM systems such as CAMEL, AutoGen, MetaGPT, AgentVerse, and ChatDev, as well as optimization\-oriented multi\-agent systems such as CORAL and DRAGON, our focus is narrower but more operational: reusable heuristic discovery under external empirical evaluation rather than open\-ended collaboration or instance decomposition\[[25](https://arxiv.org/html/2605.07214#bib.bib25),[47](https://arxiv.org/html/2605.07214#bib.bib26),[47](https://arxiv.org/html/2605.07214#bib.bib26),[18](https://arxiv.org/html/2605.07214#bib.bib27),[10](https://arxiv.org/html/2605.07214#bib.bib28),[37](https://arxiv.org/html/2605.07214#bib.bib54),[38](https://arxiv.org/html/2605.07214#bib.bib34),[9](https://arxiv.org/html/2605.07214#bib.bib33)\]\. The distinguishing element is not merely that there are multiple agents, but that the agents exchange typed artifacts and consult a behavior\-indexed archive that conditions future search\.
The reflector’s memory design is closely related to quality\-diversity search\. MAP\-Elites introduced the idea of preserving the best solution found in each behavioral niche, and CVT\-MAP\-Elites scaled this principle to higher\-dimensional behavior spaces\[[33](https://arxiv.org/html/2605.07214#bib.bib51),[45](https://arxiv.org/html/2605.07214#bib.bib55)\]\. InHMACE, the archive serves the same dual purpose: it preserves high\-performing incumbents while structuring exploration through behavior\-aware retrieval\. The archive is therefore not a dialogue history; it is a persistent search state\.
Role specialization matters because proposal, code synthesis, evaluation, and memory update require different information and expose different failure modes\. The front\-end filter is intentionally lightweight: its job is to prevent malformed or redundant programs from consuming evaluation budget, not to replace the evaluator or the archive update rule\.
### B\.2Full HMACE training workflow
For completeness, we reproduce the full training workflow below and then summarize the role interfaces and the behavior\-space design used by the reflector\.
Algorithm 2FullHMACEtraining workflow\.1:Input:number of generations
GG, population capacity
nn, proposal batch size
kk, retrieval size
rr, archive capacity
CC, patience
ρ\\rho, training instances
𝒯\\mathcal\{T\}, seed prompt
Π0\\Pi\_\{0\}\.
2:Output:best heuristic
h⋆h^\{\\star\}, final population
PGP\_\{G\}, shared memory
ℳG\\mathcal\{M\}\_\{G\}\.
3:functionGenStep\(
P,ℳP,\\mathcal\{M\}\)
4:
X←ℛsel\(P,ℳ,r\)X\\leftarrow\\mathcal\{R\}\_\{\\mathrm\{sel\}\}\(P,\\mathcal\{M\},r\)
5:Steps for exemplar retrieval by reflectorℛsel\\mathcal\{R\}\_\{\\mathrm\{sel\}\}: a\. Build the candidate pool from the current population and archive elites\. b\. Selectrrexemplars from behaviorally distinct cells using the descriptorϕ\\phi\. c\. Return the retrieved exemplar setXXto condition the proposer\.
6:
S=\{si\}i=1k←𝒫\(P,X\)S=\\\{s\_\{i\}\\\}\_\{i=1\}^\{k\}\\leftarrow\\mathcal\{P\}\(P,X\)
7:Steps for strategy proposal by proposer𝒫\\mathcal\{P\}: a\. Read parent heuristics together with the retrieved exemplarsXX\. b\. Draftkknatural\-language strategy candidates\{si\}i=1k\\\{s\_\{i\}\\\}\_\{i=1\}^\{k\}\. c\. Resample malformed or duplicated strategies when necessary\.
8:
H=\{hi\}i=1k←𝒢\(S\)H=\\\{h\_\{i\}\\\}\_\{i=1\}^\{k\}\\leftarrow\\mathcal\{G\}\(S\)
9:
H~←Filter\(H,ℳ\)\\widetilde\{H\}\\leftarrow\\textsc\{Filter\}\(H,\\mathcal\{M\}\)
10:Steps for code generation and pre\-filtering: a\. Translate each strategysis\_\{i\}into an executable heuristichih\_\{i\}\. b\. Enforce the task IO contract and basic syntax or signature constraints\. c\. Remove infeasible or near\-duplicate heuristics to obtainH~\\widetilde\{H\}\.
11:for
h∈H~h\\in\\widetilde\{H\}do
12:
fh←f\(h,𝒯\),ϕh←ϕ\(h,𝒯\)f\_\{h\}\\leftarrow f\(h,\\mathcal\{T\}\),\\qquad\\phi\_\{h\}\\leftarrow\\phi\(h,\\mathcal\{T\}\)
13:
Q←\{\(h,ϕh,fh\):h∈H~\}Q\\leftarrow\\\{\(h,\\phi\_\{h\},f\_\{h\}\):h\\in\\widetilde\{H\}\\\}
14:Steps for evaluation by evaluatorℰ\\mathcal\{E\}: a\. Execute each surviving heuristic on the training set𝒯\\mathcal\{T\}\. b\. Compute the scalar fitness scorefhf\_\{h\}for quality assessment\. c\. Extract the behavior vectorϕh∈ℝd\\phi\_\{h\}\\in\\mathbb\{R\}^\{d\}for diversity assessment\.
15:
\(P,ℳ\)←ℛupd\(P,ℳ,Q\)\(P,\\mathcal\{M\}\)\\leftarrow\\mathcal\{R\}\_\{\\mathrm\{upd\}\}\(P,\\mathcal\{M\},Q\)
16:Steps for memory update by reflectorℛupd\\mathcal\{R\}\_\{\\mathrm\{upd\}\}: a\. Refresh the population asP←Topn\(P∪\{\(h,fh\):\(h,ϕh,fh\)∈Q\}\)P\\leftarrow\\mathrm\{Top\}\_\{n\}\(P\\cup\\\{\(h,f\_\{h\}\):\(h,\\phi\_\{h\},f\_\{h\}\)\\in Q\\\}\)\. b\. Insert each tuple\(h,ϕh,fh\)\(h,\\phi\_\{h\},f\_\{h\}\)into its behavior cell and keep the best incumbent\. c\. Maintain archive capacityCCand persistℳ\\mathcal\{M\}for resume safety\.
17:return
\(P,ℳ\)\(P,\\mathcal\{M\}\)
18:
P0←LLM\(Π0\)P\_\{0\}\\leftarrow\\textsc\{LLM\}\(\\Pi\_\{0\}\),
ℳ0←∅\\mathcal\{M\}\_\{0\}\\leftarrow\\emptyset
19:for
t=1,…,Gt=1,\\ldots,Gdo
20:
\(Pt,ℳt\)←GenStep\(Pt−1,ℳt−1\)\(P\_\{t\},\\mathcal\{M\}\_\{t\}\)\\leftarrow\\textsc\{GenStep\}\(P\_\{t\-1\},\\mathcal\{M\}\_\{t\-1\}\)
21:if
NoImprovement\(Pt,ρ\)\\textsc\{NoImprovement\}\(P\_\{t\},\\rho\)then
22:break
23:
h⋆←argmin\(h,ϕ,f\)∈PG∪ℳGfh^\{\\star\}\\leftarrow\\arg\\min\_\{\(h,\\phi,f\)\\in P\_\{G\}\\cup\\mathcal\{M\}\_\{G\}\}f
24:return
\(h⋆,PG,ℳG\)\(h^\{\\star\},P\_\{G\},\\mathcal\{M\}\_\{G\}\)
### B\.3Agent interfaces and prompt design
To present the implementation in a style similar to recent LLM\-EC appendices while staying faithful to our own project, we summarize the HMACE prompt logic and task\-interface metadata in a compact visual form\. Table[7](https://arxiv.org/html/2605.07214#A2.T7)lists the task\-specific fields exposed to the generation and evaluation pipeline, and the colored panels below show representative role\-bounded prompt and memory templates derived from our implementation\.
Table 7:Task metadata fields exposed toHMACEprompts across the four benchmark families\.Metadata keyTSPBPPPFSPMKPproblem type / objective direction✓✓✓✓required action signature✓✓✓✓distance matrix / coordinates✓–––item sizes / arriving item stream–✓––bin capacity / residual capacities–✓––processing\-time matrix––✓–machine count / schedule state––✓–item values–––✓weight matrix / capacity vector–––✓behavior\-descriptor schema✓✓✓✓archive\-cell metadata✓✓✓✓Proposer Prompt TemplateYou are the PROPOSER in HMACE\. Given: \(1\) parent heuristics from the current population, \(2\) exemplars retrieved from behaviorally distinct archive cells, draftkkstrategy candidates in natural language only\. Requirements: \- preserve the task IO contract; \- change one concrete algorithmic idea at a time; \- prefer diverse ideas over near\-duplicates; \- output structured JSON only\. Return: \{"strategies": \[\{"idea": "\.\.\.", "target\_behavior": "\.\.\."\}\]\}
Generator Prompt TemplateYou are the GENERATOR in HMACE\. Translate the given strategy into executable code that exactly matches the task contract\. Constraints: \- return code only; \- do not change the required function signature; \- do not inject new algorithmic content; \- keep execution deterministic\. Example TSP signature: def select\_next\_node\(current\_node, destination\_node, unvisited\_nodes, distance\_matrix\):
Archive Record Example\{ "task": "tsp\_construct", "fitness": 0\.0177, "cell": 12, "behavior": \{ "edge\_var": 0\.43, "detour\_rate": 0\.18, "greedy\_myopia": 0\.27 \}, "summary": "short\-edge biased, but rescues isolated nodes" \}
Reflector Retrieval ContractInput: \- current population elites; \- occupied archive cells; \- retrieval budgetrr\. Output: \-rrexemplars from distinct cells; \- short summaries for the proposer\. Rule: maximize behavior\-space coverage first, then prefer stronger fitness within each cell\.
#### Proposer\.
The proposer receives the current parent population together with a small set of retrieved exemplars from behaviorally distinct archive cells\. Its output is a batch of natural\-language strategy candidates\. The prompt objective is deliberately narrow: propose diverse, executable algorithmic ideas rather than code, critique, or final selection\.
#### Generator\.
The generator maps each strategy candidate to an executable heuristic that matches the task\-specific IO contract\. In our implementation, the generator is instructed to return code only and to avoid introducing algorithmic content beyond the given strategy sketch\. This restriction is important because it keeps the evolutionary search focused on explicit proposal changes rather than hidden generator\-side drift\.
#### Evaluator\.
The evaluator is external to the LLM loop\. It executes each surviving heuristic on the training instances, computes the scalar fitnessfhf\_\{h\}, extracts the behavior vectorϕh\\phi\_\{h\}, and returns the evaluated tuple\(h,ϕh,fh\)\(h,\\phi\_\{h\},f\_\{h\}\)\. Because this step is empirical rather than self\-reported, the search remains grounded in measurable performance\.
#### Reflector\.
The reflector has two responsibilities\. In its retrieval mode, it selects diverse archive exemplars to condition the next proposal step\. In its update mode, it inserts new candidates into the archive, refreshes the elite population, and decides which regions of behavior space remain underexplored\. In this sense, reflection inHMACEis an operational memory mechanism rather than a purely verbal critique stage\.
#### Filter and typed handoffs\.
Between generation and evaluation,HMACEapplies a lightweight deterministic screen that removes malformed or near\-duplicate programs\. The purpose of this stage is to preserve evaluation budget, not to replace the evaluator\. More broadly, all inter\-agent communication is typed: the proposer emits strategy text, the generator emits code, the evaluator emits\(fh,ϕh\)\(f\_\{h\},\\phi\_\{h\}\), and the reflector emits retrieved exemplars plus the updated search state\. This typed design is a key reason why the framework is easy to ablate and analyze\.
### B\.4Behavior descriptor and archive schema
In the main paper, the behavior descriptor is instantiated as an1111\-dimensional vector comprising five runtime statistics and six static program\-structure features\. The runtime coordinates are task\-specific, whereas the static coordinates are task\-agnostic and summarize structural properties of the generated heuristic code\.
#### BP\-online\.
The runtime coordinates capture packing behavior such as utilization quality, fragmentation tendency, closure rate, residual\-capacity dispersion, and bias toward early feasible bins\. These statistics distinguish heuristics that look similar in objective value but behave differently during packing\.
#### TSP\-construct\.
The runtime coordinates summarize geometric decision behavior, including local edge preference, route\-shape regularity, detour tendency, edge\-length variability, and the extent to which the heuristic behaves myopically versus with short lookahead\.
#### PFSP and MKP\.
For PFSP, the descriptor emphasizes schedule shape—for example, idle\-time concentration, front loading, and sensitivity to critical machines\. For MKP, it emphasizes selection behavior such as value\-density preference, slack usage across dimensions, and aggressiveness of item admission\.
#### Static code features\.
The six task\-agnostic coordinates summarize source\-level structure, including control\-flow depth, branching, looping, helper\-function usage, vectorization density, and arithmetic or logical complexity\. Before archive insertion, all coordinates are normalized within task, and the normalized vectors are assigned to the2525CVT\-MAP\-Elites cells described in Section[3\.3](https://arxiv.org/html/2605.07214#S3.SS3)\.
## Appendix CReproducibility Details
### C\.1Hardware and backbone
All experiments are conducted on servers equipped with NVIDIA L40 GPUs and Intel\(R\) Gold 6330 CPUs at 2\.00GHz\. The starred rows reported in Tables[1](https://arxiv.org/html/2605.07214#S4.T1),[2](https://arxiv.org/html/2605.07214#S4.T2),[3](https://arxiv.org/html/2605.07214#S4.T3), and[5](https://arxiv.org/html/2605.07214#A1.T5)are evaluated by us using GPT\-5\.4\.
### C\.2Run protocol
Unless otherwise noted, all methods use a population size of1010and a generation budget of3030\. The HMACE configuration follows the values given in the main text: patience\-33early stopping with thresholdδmin=10−4\\delta\_\{\\min\}=10^\{\-4\},nproc=12n\_\{\\mathrm\{proc\}\}=12evaluator workers, a CVT\-MAP\-Elites archive with2525centroids, an1111\-dimensional behavior descriptor, retrieval of22archive exemplars per generation, a screening keep ratio of0\.50\.5, and a proposal batch size of44candidate children\. Each reported mean and standard deviation is computed over seeds\{0,1,2\}\\\{0,1,2\\\}\. In addition to the generation\-level early\-stop rule, runs are also subject to a global wall\-clock limit of one hour\.
### C\.3Logged artifacts
For each run, we log per\-candidate token usage, wall\-clock runtime, behavior descriptors, archive occupancy, and screening decisions\. These logs serve two purposes\. First, they support the quality–efficiency comparisons reported in the main paper\. Second, they make it possible to inspect which parts of the search process are responsible for improvements or stagnation, which is essential for analyzing a role\-specialized framework such asHMACE\.Similar Articles
AHD Agent: Agentic Reinforcement Learning for Automatic Heuristic Design
This paper introduces AHD Agent, a framework using agentic reinforcement learning to enable LLMs to autonomously design heuristics for combinatorial optimization problems by dynamically interacting with the solving environment.
EvoMaster: A Foundational Agent Framework for Building Evolving Autonomous Scientific Agents at Scale
EvoMaster is a scalable, self-evolving agent framework for large-scale scientific discovery that enables iterative hypothesis refinement and knowledge accumulation across experimental cycles. It achieves state-of-the-art results on four benchmarks including Humanity's Last Exam (41.1%) and MLE-Bench Lite (75.8%), outperforming general-purpose baselines by up to 316%.
MEMOA: Massive Mixtures of Online Agents via Mean-Field Decentralized Nash Equilibria
The paper introduces MEMOA, a decentralized strategy for massive online agents that achieves optimality via mean-field Nash equilibria, outperforming greedy baselines while scaling better than centralized approaches.
TMAS: Scaling Test-Time Compute via Multi-Agent Synergy
TMAS introduces a multi-agent framework that enhances large language model reasoning by scaling test-time compute through structured collaboration and hierarchical memory systems. The approach uses specialized agents, cross-trajectory information flow, and hybrid reward reinforcement learning to improve iterative scaling and stability on challenging reasoning benchmarks.
EvoMAS: Learning Execution-Time Workflows for Multi-Agent Systems
EvoMAS is a framework for learning execution-time workflows in multi-agent systems by formulating workflow construction as a sequential decision problem. It outperforms static multi-agent design methods on complex tasks by adapting agent coordination dynamically based on evolving task states.