Generating Robust Portfolios of Optimization Models using Large Language Models

arXiv cs.AI Papers

Summary

Proposes a method to generate portfolios of optimization models using LLMs, with theoretical guarantees and empirical validation.

arXiv:2605.27013v1 Announce Type: new Abstract: Mathematical optimization is a powerful tool for structured decision-making across domains such as resource allocation and planning. Formulating optimization models faithful to reality, though, remains a significant bottleneck as it typically demands both domain expertise and optimization knowledge that are often scarce. Recent advances in large language models (LLMs) promise to bridge this gap, enabling the generation of candidate optimization models from natural language descriptions. However, there is no guarantee that any single LLM-generated model is reliable, and existing approaches that output only one model are therefore risky. In this work, we propose a novel algorithm that generates a portfolio of optimization models, designed to be robust to the limitations of LLMs. Our method exploits the observation that a single LLM can play two distinct roles $\unicode{x2014}$ as a stochastic generator and as a reasoning evaluator $\unicode{x2014}$ and proposes a unified framework that leverages both capabilities in a complementary manner. We provide theoretical guarantees showing that, as long as either the generator or the evaluator is well-aligned with human preferences, the portfolio is guaranteed to contain high-quality candidates, enabling a principled human-in-the-loop process in which a decision-maker can review multiple candidates before committing to one. We further validate our approach empirically, demonstrating strong performance across a range of optimization modeling tasks.
Original Article
View Cached Full Text

Cached at: 05/27/26, 09:10 AM

# Generating Robust Portfolios of Optimization Models using Large Language Models
Source: [https://arxiv.org/html/2605.27013](https://arxiv.org/html/2605.27013)
###### Abstract

Mathematical optimization is a powerful tool for structured decision\-making across domains such as resource allocation and planning\. Formulating optimization models faithful to reality, though, remains a significant bottleneck as it typically demands both domain expertise and optimization knowledge that are often scarce\. Recent advances in large language models \(LLMs\) promise to bridge this gap, enabling the generation of candidate optimization models from natural language descriptions\. However, there is no guarantee that any single LLM\-generated model is reliable, and existing approaches that output only one model are therefore risky\. In this work, we propose a novel algorithm that generates a portfolio of optimization models, designed to be robust to the limitations of LLMs\. Our method exploits the observation that a single LLM can play two distinct roles—as a stochastic generator and as a reasoning evaluator—and proposes a unified framework that leverages both capabilities in a complementary manner\. We provide theoretical guarantees showing that, as long as either the generator or the evaluator is well\-aligned with human preferences, the portfolio is guaranteed to contain high\-quality candidates, enabling a principled human\-in\-the\-loop process in which a decision\-maker can review multiple candidates before committing to one\. We further validate our approach empirically, demonstrating strong performance across a range of optimization modeling tasks\.

Optimization Modeling, Portfolios, LLMs

## 1Introduction

Formalizing a structured decision\-making task as a mathematical optimization problem plays a pivotal role in designing optimal decision policies in domains such as resource allocation and planning\(Brill Jr,[1979](https://arxiv.org/html/2605.27013#bib.bib27); Katoh and Ibaraki,[1998](https://arxiv.org/html/2605.27013#bib.bib28); Vercellis,[2011](https://arxiv.org/html/2605.27013#bib.bib26)\)\. However, defining an optimization model that accurately reflects all the real\-world requirements and constraints of the decision\-making task can be rather challenging; typically, it requires not only exhaustive manual tuning, but also a combination of domain expertise and deep optimization knowledge\.

To address this challenge, there has been a growing interest in leveraging Large Language Models \(LLMs\) to automate the definition of optimization models given a task description in natural language\. Recent lines of work typically focus on either automating the entire model definition\(Yanget al\.,[2023](https://arxiv.org/html/2605.27013#bib.bib20),[2024](https://arxiv.org/html/2605.27013#bib.bib23); Ahmaditeshniziet al\.,[2024](https://arxiv.org/html/2605.27013#bib.bib21); Astorgaet al\.,[2024](https://arxiv.org/html/2605.27013#bib.bib22); Zhanget al\.,[2024](https://arxiv.org/html/2605.27013#bib.bib29); Jianget al\.,[2024](https://arxiv.org/html/2605.27013#bib.bib31); Ahmed and Choudhury,[2024](https://arxiv.org/html/2605.27013#bib.bib32); Huanget al\.,[2025](https://arxiv.org/html/2605.27013#bib.bib19); Xiaoet al\.,[2025](https://arxiv.org/html/2605.27013#bib.bib30)\)or a partial model definition limited to the design of the objective \(reward\) function in\(Icarteet al\.,[2022](https://arxiv.org/html/2605.27013#bib.bib3); Yuet al\.,[2023](https://arxiv.org/html/2605.27013#bib.bib1); Shinnet al\.,[2023](https://arxiv.org/html/2605.27013#bib.bib6); Maet al\.,[2024](https://arxiv.org/html/2605.27013#bib.bib7); Hwanget al\.,[2024](https://arxiv.org/html/2605.27013#bib.bib9); Xieet al\.,[2024](https://arxiv.org/html/2605.27013#bib.bib4); Behariet al\.,[2024](https://arxiv.org/html/2605.27013#bib.bib5); Vermaet al\.,[2025](https://arxiv.org/html/2605.27013#bib.bib11); Sunet al\.,[2026](https://arxiv.org/html/2605.27013#bib.bib2)\)\.

Prior work, though, typically proposes computationally\-heavy approaches, requiring additional training or fine\-tuning of language models, with the objective to generate a single optimization model, while lacking guarantees on its quality\. In this work, we introduce a lightweight, training\-free algorithm to generate a portfolio of optimization models with guarantees over the quality of generated models while being robust to limitations of language models\. To achieve this, our algorithm leverages a dual perspective on the capabilities of language models that we describe next\.

Language models can operate as stochasticgenerators\(Vermaet al\.,[2025](https://arxiv.org/html/2605.27013#bib.bib11); Cardenoso and Caarls,[2025](https://arxiv.org/html/2605.27013#bib.bib15)\)that can provide diverse models through repeated stochastic sampling, accounting for different trade\-offs present in the optimization task at hand\. In addition, language models, can also operate as judges or reasoning evaluators\(Vermaet al\.,[2025](https://arxiv.org/html/2605.27013#bib.bib11); Sacconet al\.,[2025](https://arxiv.org/html/2605.27013#bib.bib14)\)arguing about the quality of given inputs, based on their world knowledge and reasoning capabilities\. By unifying these modes of operation, our approach leverages their complementary strengths to provide robustness guarantees over the quality of models comprising our portfolios, allowing a decision\-maker to review multiple high quality candidates before committing to one\.

Contributions\.We propose a method that first uses a language model, namely agenerator, to generate a distribution of candidate optimization models through repeated stochastic sampling\. Next, our method uses a judge agent, namely anevaluatorto rank these optimization models based on how well they align with the optimization task description in natural language\. We construct a portfolio comprising the candidate optimization models with the highest evaluator ranks that have a total generation probability above a user\-specified threshold\. We show that in this way our portfolio guarantees to include high quality candidates if either of the generator or the evaluator satisfy a certain level of alignment with respect to human preferences\. We empirically verify the strong performance of our portfolio in experiments with both synthetic and real data\.

## 2Building a Portfolio through a Generator and Evaluator

Letddbe the natural language description of an optimization problem,ggbe a stochastic generator model andeean evaluator model\.

Generator\.Given a natural language descriptiond∈𝒟d\\in\\mathcal\{D\}, where𝒟\\mathcal\{D\}is the space of any natural language prompt, we consider the generatorggas a probability distributionppover the space of candidate optimization models𝒪\\mathcal\{O\}\.111We assume that𝒪=𝒪v∪⊥\\mathcal\{O\}=\\mathcal\{O\}\_\{v\}\\cup\\bot, where𝒪v\\mathcal\{O\}\_\{v\}is the space of text representations of any valid optimization model, and⊥=𝒟∖𝒪v\\bot=\\mathcal\{D\}\\setminus\\mathcal\{O\}\_\{v\}\. Note that𝒪\\mathcal\{O\}can be finite or infinite\.

Evaluator\.Given the optimization problem descriptiondd, we consider the evaluator as a ranking policyπe\\pi\_\{e\}over the space of candidate optimization modelso∈𝒪o\\in\\mathcal\{O\}, inducing a ranking

πe​\(d\)=\(o\(1\)e,o\(2\)e,…\),\\pi\_\{e\}\(d\)=\(o\_\{\(1\)^\{e\}\},o\_\{\(2\)^\{e\}\},\\dots\),\(1\)where the subscript\(⋅\)\{\(\\cdot\)\}denotes that rank of candidateoo\. We assume that the lower the rank, the better the candidate given the descriptiondd, according to world knowledge of the evaluator, and ties are broken randomly\.

We use the probability distributionppinduced by the generator valuesg​\(d\)g\(d\)and the rankingπe​\(d\)\\pi\_\{e\}\(d\)induced by the evaluatoreeto construct our portfolio𝒫\\mathcal\{P\}of candidate optimization models as follows

𝒫​\(d;α\)=\{o\(i\)e\}i=1k∗​\(α\),\\displaystyle~\\mathcal\{P\}\(d;\\alpha\)=\\left\\\{o\_\{\(i\)^\{e\}\}\\right\\\}\_\{i=1\}^\{k^\{\*\}\(\\alpha\)\},\(2\)whereα∈\(0,1\)\\alpha\\in\(0,1\)is a user\-defined parameter and

k∗​\(α\)=inf\{k∈ℕ:∑i=1kp​\(o\(i\)e\)≥1−α\}\.\\displaystyle~k^\{\*\}\(\\alpha\)=\\inf\\left\\\{k\\in\\mathbb\{N\}:\\sum\_\{i=1\}^\{k\}p\(o\_\{\(i\)^\{e\}\}\)\\geq 1\-\\alpha\\right\\\}\.
By constructing our portfolio in this way, we make sure that, for a small enoughα\\alphaour portfolio will include candidate models either because they have a low \(good\) evaluator rank or because they have a high probability of being generated This means that our portfolio will include high\-quality models as long as either the evaluator ranks them low, or the generator produces them with high probability,*i\.e\.*, whenever, either of them is well\-aligned with human preferences\. Motivated by this observation, in what follows, we prove that our portfolio guarantees to include high\-quality candidates if either of the generator or evaluator are human\-aligned\.

## 3Robust Portfolio Generation

We begin by defining human\-alignment of the generator and evaluator under the following assumption

###### Assumption 3\.1\.

Given the natural descriptionddthere is a human ranking policyπ∗\\pi^\{\*\}, inducing a rankingπ∗​\(d\)\\pi^\{\*\}\(d\)overo∈𝒪o\\in\\mathcal\{O\}, with

π∗​\(d\)=\(o\(1\)∗,o\(2\)∗,…\),\\displaystyle\\pi^\{\*\}\(d\)=\(o\_\{\(1\)^\{\*\}\},o\_\{\(2\)^\{\*\}\},\\dots\),\(3\)where the lower the rank, the higher the quality of the candidate optimization modelooaccording to human preferences\.

Based on the above, we provide the following definitions\.

###### Definition 3\.2\(Evaluator Alignment\)\.

An evaluator is human\-aligned givendd, if the induced ranking

πe​\(d\)=π∗​\(d\)\.\\pi\_\{e\}\(d\)=\\pi^\{\*\}\(d\)\.

###### Definition 3\.3\(Generator Alignment\)\.

A generator model inducing a probabilityppover candidate optimization modelso∈𝒪o\\in\\mathcal\{O\}is human\-aligned givendd, if for anyo\(i\)∗,o\(j\)∗∈π∗​\(d\)o\_\{\(i\)^\{\*\}\},o\_\{\(j\)^\{\*\}\}\\in\\pi^\{\*\}\(d\)withi≤ji\\leq j, it holds

p​\(o\(i\)∗\)≥p​\(o\(j\)∗\)\.p\(o\_\{\(i\)^\{\*\}\}\)\\geq p\(o\_\{\(j\)^\{\*\}\}\)\.\(4\)

Intuitively, a human\-aligned generator generates candidates judged as high quality according to human preferences with high probability\.

We show that under either evaluator or generator alignment our portfolio is guaranteed to include high quality candidates or more formally achieve positivecoverage, where we define coverage as follows\.

###### Definition 3\.4\(Portfolio coverage\)\.

A portfolio𝒫\\mathcal\{P\}ofkkcandidates has coverage

c​\(𝒫\)=∑i=1k𝕀​\{o\(i\)∗∈𝒫\}k\\displaystyle c\(\\mathcal\{P\}\)=\\frac\{\\sum\_\{i=1\}^\{k\}\\mathbb\{I\}\\\{o\_\{\(i\)^\{\*\}\}\\in\\mathcal\{P\}\\\}\}\{k\}\(5\)

Under evaluator alignment and any generator, we prove:222Proofs are deferred to Appendix[A](https://arxiv.org/html/2605.27013#A1)\.

###### Corollary 3\.5\(Evaluator robustness\)\.

Assume an aligned evaluator under Definition[3\.2](https://arxiv.org/html/2605.27013#S3.Thmtheorem2)for a descriptiondd\. For any value ofα∈\(0,1\)\\alpha\\in\(0,1\)and any generator, the portfolio𝒫​\(d;α\)\\mathcal\{P\}\(d;\\alpha\)constructed using Eq\.[2](https://arxiv.org/html/2605.27013#S2.E2)has coveragec​\(𝒫​\(d;α\)\)=1c\(\\mathcal\{P\}\(d;\\alpha\)\)=1\.

Under generator alignment and any evaluator, we prove

###### Proposition 3\.6\.

Assume an aligned generator under Definition[3\.3](https://arxiv.org/html/2605.27013#S3.Thmtheorem3)for a descriptiondd\. For any value ofα∈\(0,1/2\)\\alpha\\in\(0,1/2\)and any evaluator, any non\-empty portfolio𝒫​\(d;α\)\\mathcal\{P\}\(d;\\alpha\)constructed using Eq\.[2](https://arxiv.org/html/2605.27013#S2.E2)has coverage

c​\(𝒫​\(d;α\)\)\>1−2​αk∗​\(α\)\>0\.\\displaystyle c\(\\mathcal\{P\}\(d;\\alpha\)\)\>\\frac\{1\-2\\alpha\}\{k^\{\*\}\(\\alpha\)\}\>0\.

## 4Experiments

We begin by simulating the generator and evaluator under different levels of human\-alignment in a synthetic data setup, where we i\) empirically verify the robustness of our portfolio, and ii\) investigate the effect of human\-alignment on the portfolio coverage and size\. We proceed with a real data implementation of our portfolios on optimization modeling, where we show that the optimization models comprising our portfolios are qualitatively superior compared to optimization models provided by randomly sampled portfolios\.

### 4\.1Simulated Portfolios

In our synthetic data setup, we consider a finite space of candidate optimization models𝒪\\mathcal\{O\}, such that\|𝒪\|=K\|\\mathcal\{O\}\|=K, forK∈\{10,20,50,100\}K\\in\\\{10,20,50,100\\\}, with human ranking\(1,2,…,K\)\(1,2,\.\.\.,K\)\. We simulate several generator and evaluator types, each with a different level of human alignment as described below\. For each generator\-evaluator pair, we use Eq\.[2](https://arxiv.org/html/2605.27013#S2.E2)to construct a portfolio for eachα\\alphavalue from0to11with step0\.020\.02and repeat the experiment with4040different seeds\.

Generators\.We implement the following generator types \(refer to Appendix[B\.1](https://arxiv.org/html/2605.27013#A2.SS1)for more details\)\.

- •Aligned\. A generator satisfying Definition[3\.3](https://arxiv.org/html/2605.27013#S3.Thmtheorem3)\.
- •Weakly Aligned\. A generator violating Definition[3\.3](https://arxiv.org/html/2605.27013#S3.Thmtheorem3), for which though Proposition[3\.6](https://arxiv.org/html/2605.27013#S3.Thmtheorem6)holds for any portfolio ofK/2K/2generated candidates\.
- •Uniform\. A generator where each generated candidatei∈\[K\]i\\in\[K\]has probabilityp​\(i\)=1/Kp\(i\)=1/K\.
- •Misaligned\. A reversely aligned generator\.

Evaluators\.We characterize each evaluator by the fraction of incorrectly ranked candidates, which we denote as the evaluator errorϵ=∑i=1K𝕀​\{o\(i\)∗≠o\(i\)e\}K\\epsilon=\\frac\{\\sum\_\{i=1\}^\{K\}\\mathbb\{I\}\\\{o\_\{\(i\)^\{\*\}\}\\neq o\_\{\(i\)^\{e\}\}\\\}\}\{K\}\. We implement the evaluator forϵ∈\{0,0\.3,0\.5,0\.7,1\}\\epsilon\\in\\\{0,0\.3,0\.5,0\.7,1\\\}, whereϵ=0\\epsilon=0characterizes a human\-aligned evaluator withπ∗=πe\\pi^\{\*\}=\\pi\_\{e\}, andϵ=1\\epsilon=1, an evaluatoreesuch that∀i∈\[K\],o\(i\)e=o\(K\+1−i\)∗\\forall i\\in\[K\],o\_\{\(i\)^\{e\}\}=o\_\{\(K\+1\-i\)^\{\*\}\}\.

![Refer to caption](https://arxiv.org/html/2605.27013v1/x1.png)Figure 1:Portfolio mean coverage against the value of1−α1\-\\alphafor theWeakly Alignedgenerator paired with each evaluator forK=100K=100\. The mean is over4040iterations and shaded areas represent95%95\\%confidence intervals\.![Refer to caption](https://arxiv.org/html/2605.27013v1/x2.png)Figure 2:Portfolio mean size against the value of1−α1\-\\alphafor theWeakly Alignedgenerator paired with each evaluator forK=100K=100\. The mean is over4040iterations and shaded areas represent95%95\\%confidence intervals\.![Refer to caption](https://arxiv.org/html/2605.27013v1/x3.png)Figure 3:Portfolio mean coverage against the value of1−α1\-\\alphafor the evaluator withϵ=1\.0\\epsilon=1\.0paired with each generator forK=100K=100\. The mean is over4040iterations and shaded areas represent95%95\\%confidence intervals\.![Refer to caption](https://arxiv.org/html/2605.27013v1/x4.png)Figure 4:Portfolio mean size against the value of1−α1\-\\alphafor the evaluator withϵ=1\.0\\epsilon=1\.0paired with each generator forK=100K=100\. The mean is over4040iterations and shaded areas represent95%95\\%confidence intervals\.Results\.Figure[1](https://arxiv.org/html/2605.27013#S4.F1)and[3](https://arxiv.org/html/2605.27013#S4.F3)show—in consistence with Proposition[3\.6](https://arxiv.org/html/2605.27013#S3.Thmtheorem6)— that forα<0\.5\\alpha<0\.5our portfolio achieves positive coverage even under theWeakly Alignedgenerator independently of the evaluator errorϵ\\epsilon\. In fact, Figure[1](https://arxiv.org/html/2605.27013#S4.F1)shows, in practice, that coverage is lower bounded by1−α1\-\\alpha—a tighter lower bound compared to Proposition[3\.6](https://arxiv.org/html/2605.27013#S3.Thmtheorem6)—as coverage is always above the diagonal forα<0\.5\\alpha<0\.5\. Further, Figure[1](https://arxiv.org/html/2605.27013#S4.F1)–[4](https://arxiv.org/html/2605.27013#S4.F4)reveal notable insights on how the level of human alignment controls the trade\-off between portfolio coverage and its size; Figure[1](https://arxiv.org/html/2605.27013#S4.F1)and[2](https://arxiv.org/html/2605.27013#S4.F2)show that under aWeakly Alignedgenerator, the lower the evaluator error, the higher the coverage and the lower the portfolio size for the same value ofα\\alpha; Figure[3](https://arxiv.org/html/2605.27013#S4.F3)and[4](https://arxiv.org/html/2605.27013#S4.F4)show that under the worst evaluator, the more aligned the generator, the higher the coverage, at the price though of larger portfolios\.

### 4\.2Portfolios of Optimization Models

![Refer to caption](https://arxiv.org/html/2605.27013v1/x5.png)\(a\)
![Refer to caption](https://arxiv.org/html/2605.27013v1/x6.png)\(b\)

Figure 5:Kde plot of scores assigned bygpt\-5\.4\-as\-a\-judge to portfolios of sizes∈\{2,4,6,8\}s\\in\\\{2,4,6,8\\\}againstssrandomly selected candidates for two evaluator types\. The score distributions are over2525problems and3030samplings of the random candidates\. Dashed and dotted lines represent the mean score value over the randomly chosen and the portfolio candidates respectively\.We construct portfolios of optimization models for2525optimization problems given in natural language from the dataset NL4LP\(Ahmaditeshniziet al\.,[2024](https://arxiv.org/html/2605.27013#bib.bib21)\)\. We use a strong LLM\-as\-a\-judge to evaluate the quality of the candidate optimization models in our portfolios and compare it against portfolios constructed by random sampling \(i\.e\., prompting multiple times\)\. Refer to Appendix[B\.2](https://arxiv.org/html/2605.27013#A2.SS2)for all LLM prompts and further details\.

Generator\.For each optimization problem, we generate5050candidate optimization models withgpt\-5\.4\-nano, where each model comprises a natural language description followed by the corresponding python code\. For each candidate modeloowe compute the generator probabilityp​\(o\)p\(o\)using the normalized log probabilities of the tokens of the text representation of the candidate model\.

Evaluator\.For each generated model, we execute the code and use the output—together with the model and problem—to assign a scalar score from 1 to 100 to each generated model usinggpt\-5\.4\-nanowith repeated sampling\. For each problem, we rank the generated models with the average score over44samples and use this as the evaluator ranking\. Further, we implement an additional evaluator type, where we rank the generated models using the probability distribution induced by the generator—essentially using the generator as evaluator too\.

Results using LLM\-as\-a\-judge\.We usegpt\-5\.4as a judge under the same scoring regime as for the evaluator for each generated model, except that in the scoring input prompt, we also add the ground truth solution of the optimization model as provided in the NL4LP dataset\. As the final score of each generated model we consider the average normalized LLM\-as\-a\-judge score over the repeated samplings\. We characterize the quality of each portfolio with the minimum among the scores of the models it contains, assuming a worst\-case scenario, where a human would select the worst model in the portfolio\. Figure[5](https://arxiv.org/html/2605.27013#S4.F5)shows that our portfolio for both the evaluator types consistently outperforms random portfolios under this worst\-case scenario\. Further, it also shows that our portfolios using the reasoning capabilities of the LLM evaluator achieve higher scores compared to portfolios using the generator probabilities\. These results demonstrate in practice the competitive advantage of our approach on unifying the generative and reasoning capabilities of LLMs\.

## Acknowledgments

Straitouri acknowledges support from a Google PhD Fellowship\. Kim and Tambe acknowledge support by ONR MURI N00014\-24\-1\-2742\.

## Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning\. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here\.

## References

- A\. Ahmaditeshnizi, W\. Gao, and M\. Udell \(2024\)OptiMUS: scalable optimization modeling with \(MI\)LP solvers and large language models\.InProceedings of the 41st International Conference on Machine Learning,R\. Salakhutdinov, Z\. Kolter, K\. Heller, A\. Weller, N\. Oliver, J\. Scarlett, and F\. Berkenkamp \(Eds\.\),Proceedings of Machine Learning Research, Vol\.235,pp\. 577–596\.Cited by:[§1](https://arxiv.org/html/2605.27013#S1.p2.1),[§4\.2](https://arxiv.org/html/2605.27013#S4.SS2.p1.1)\.
- T\. Ahmed and S\. Choudhury \(2024\)LM4OPT: unveiling the potential of large language models in formulating mathematical optimization problems\.INFOR: Information Systems and Operational Research62\(4\),pp\. 559–572\.Cited by:[§1](https://arxiv.org/html/2605.27013#S1.p2.1)\.
- N\. Astorga, T\. Liu, Y\. Xiao, and M\. Van Der Schaar \(2024\)Autoformulation of mathematical optimization models using llms\.arXiv preprint arXiv:2411\.01679\.Cited by:[§1](https://arxiv.org/html/2605.27013#S1.p2.1)\.
- N\. Behari, E\. Zhang, Y\. Zhao, A\. Taneja, D\. Nagaraj, and M\. Tambe \(2024\)A decision\-language model \(dlm\) for dynamic restless multi\-armed bandit tasks in public health\.Advances in Neural Information Processing Systems37,pp\. 3964–4002\.Cited by:[§1](https://arxiv.org/html/2605.27013#S1.p2.1)\.
- E\. D\. Brill Jr \(1979\)The use of optimization models in public\-sector planning\.Management Science25\(5\),pp\. 413–422\.Cited by:[§1](https://arxiv.org/html/2605.27013#S1.p1.1)\.
- F\. Cardenoso and W\. Caarls \(2025\)Leveraging llms for reward function design in reinforcement learning control tasks\.arXiv preprint arXiv:2511\.19355\.Cited by:[§1](https://arxiv.org/html/2605.27013#S1.p4.1)\.
- C\. Huang, Z\. Tang, S\. Hu, R\. Jiang, X\. Zheng, D\. Ge, B\. Wang, and Z\. Wang \(2025\)Orlm: a customizable framework in training large models for automated optimization modeling\.Operations Research73\(6\),pp\. 2986–3009\.Cited by:[§B\.2](https://arxiv.org/html/2605.27013#A2.SS2.p1.1),[§1](https://arxiv.org/html/2605.27013#S1.p2.1)\.
- M\. Hwang, L\. Weihs, C\. Park, K\. Lee, A\. Kembhavi, and K\. Ehsani \(2024\)Promptable behaviors: personalizing multi\-objective rewards from human preferences\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),pp\. 16216–16226\.Cited by:[§1](https://arxiv.org/html/2605.27013#S1.p2.1)\.
- R\. T\. Icarte, T\. Q\. Klassen, R\. Valenzano, and S\. A\. McIlraith \(2022\)Reward machines: exploiting reward function structure in reinforcement learning\.Journal of Artificial Intelligence Research73,pp\. 173–208\.Cited by:[§1](https://arxiv.org/html/2605.27013#S1.p2.1)\.
- C\. Jiang, X\. Shu, H\. Qian, X\. Lu, J\. Zhou, A\. Zhou, and Y\. Yu \(2024\)LLMOPT: learning to define and solve general optimization problems from scratch\.arXiv preprint arXiv:2410\.13213\.Cited by:[§1](https://arxiv.org/html/2605.27013#S1.p2.1)\.
- N\. Katoh and T\. Ibaraki \(1998\)Resource allocation problems\.Handbook of Combinatorial Optimization: Volume1–3,pp\. 905–1006\.Cited by:[§1](https://arxiv.org/html/2605.27013#S1.p1.1)\.
- Y\. J\. Ma, W\. Liang, G\. Wang, D\. Huang, O\. Bastani, D\. Jayaraman, Y\. Zhu, L\. Fan, and A\. Anandkumar \(2024\)Eureka: human\-level reward design via coding large language models\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=IEduRUO55F)Cited by:[§1](https://arxiv.org/html/2605.27013#S1.p2.1)\.
- E\. Saccon, D\. De Martini, M\. Saveriano, E\. Lamon, L\. Palopoli, and M\. Roveri \(2025\)Automated generation of mdps using logic programming and llms for robotic applications\.IEEE Robotics and Automation Letters11\(2\),pp\. 1770–1777\.Cited by:[§1](https://arxiv.org/html/2605.27013#S1.p4.1)\.
- N\. Shinn, F\. Cassano, B\. Labash, A\. Gopinath, K\. Narasimhan, and S\. Yao \(2023\)Reflexion: language agents with verbal reinforcement learning, 2023\.URL https://arxiv\. org/abs/2303\.113661\.Cited by:[§1](https://arxiv.org/html/2605.27013#S1.p2.1)\.
- S\. Sun, R\. Liu, J\. Lyu, J\. Yang, L\. Zhang, and X\. Li \(2026\)A large language model\-driven reward design framework via dynamic feedback for reinforcement learning\.Know\.\-Based Syst\.326\(C\)\.External Links:ISSN 0950\-7051,[Link](https://doi.org/10.1016/j.knosys.2025.114065),[Document](https://dx.doi.org/10.1016/j.knosys.2025.114065)Cited by:[§1](https://arxiv.org/html/2605.27013#S1.p2.1)\.
- C\. Vercellis \(2011\)Business intelligence: data mining and optimization for decision making\.John Wiley & Sons\.Cited by:[§1](https://arxiv.org/html/2605.27013#S1.p1.1)\.
- S\. Verma, N\. Boehmer, L\. Kong, and M\. Tambe \(2025\)Balancing act: prioritization strategies for llm\-designed restless bandit rewards\.InInternational Conference on Game Theory and AI for Security,pp\. 376–394\.Cited by:[§1](https://arxiv.org/html/2605.27013#S1.p2.1),[§1](https://arxiv.org/html/2605.27013#S1.p4.1)\.
- Z\. Xiao, J\. Xie, L\. Xu, S\. Guan, J\. Zhu, X\. Han, X\. Fu, W\. Yu, H\. Wu, W\. Shi,et al\.\(2025\)A survey of optimization modeling meets llms: progress and future directions\.arXiv preprint arXiv:2508\.10047\.Cited by:[§1](https://arxiv.org/html/2605.27013#S1.p2.1)\.
- T\. Xie, S\. Zhao, C\. H\. Wu, Y\. Liu, Q\. Luo, V\. Zhong, Y\. Yang, and T\. Yu \(2024\)Text2Reward: reward shaping with language models for reinforcement learning\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=tUM39YTRxH)Cited by:[§1](https://arxiv.org/html/2605.27013#S1.p2.1)\.
- C\. Yang, X\. Wang, Y\. Lu, H\. Liu, Q\. V\. Le, D\. Zhou, and X\. Chen \(2023\)Large language models as optimizers\.InThe Twelfth International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.27013#S1.p2.1)\.
- Z\. Yang, Y\. Wang, Y\. Huang, Z\. Guo, W\. Shi, X\. Han, L\. Feng, L\. Song, X\. Liang, and J\. Tang \(2024\)Optibench meets resocratic: measure and improve llms for optimization modeling\.arXiv preprint arXiv:2407\.09887\.Cited by:[§1](https://arxiv.org/html/2605.27013#S1.p2.1)\.
- W\. Yu, N\. Gileadi, C\. Fu, S\. Kirmani, K\. Lee, M\. G\. Arenas, H\. L\. Chiang, T\. Erez, L\. Hasenclever, J\. Humplik, B\. Ichter, T\. Xiao, P\. Xu, A\. Zeng, T\. Zhang, N\. Heess, D\. Sadigh, J\. Tan, Y\. Tassa, and F\. Xia \(2023\)Language to rewards for robotic skill synthesis\.InProceedings of The 7th Conference on Robot Learning,J\. Tan, M\. Toussaint, and K\. Darvish \(Eds\.\),Proceedings of Machine Learning Research, Vol\.229,pp\. 374–404\.Cited by:[§1](https://arxiv.org/html/2605.27013#S1.p2.1)\.
- J\. Zhang, W\. Wang, S\. Guo, L\. Wang, F\. Lin, C\. Yang, and W\. Yin \(2024\)Solving general natural\-language\-description optimization problems with large language models\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 6: Industry Track\),pp\. 483–490\.Cited by:[§1](https://arxiv.org/html/2605.27013#S1.p2.1)\.

## Appendix AProofs

### A\.1Proof of Corollary[3\.5](https://arxiv.org/html/2605.27013#S3.Thmtheorem5)

###### Proof\.

Letπe\\pi\_\{e\}be the ranking policy of the aligned evaluator\. By Definition[3\.2](https://arxiv.org/html/2605.27013#S3.Thmtheorem2), we have thatπe​\(d\)=π∗​\(d\)\\pi\_\{e\}\(d\)=\\pi^\{\*\}\(d\), thuso\(i\)e=o\(i\)⁣∗o\_\{\(i\)^\{e\}\}=o\_\{\(i\)\*\}for anyi∈ℕi\\in\\mathbb\{N\}\. For any generator model andα∈\(0,1\)\\alpha\\in\(0,1\)we use Eq\.[2](https://arxiv.org/html/2605.27013#S2.E2), to construct a portfolio𝒫​\(d;α\)=\{o\(j\)e\}i=1k∗\(α\)\\mathcal\{P\}\(d;\\alpha\)=\\\{o\_\{\(j\)^\{e\}\}\\\}\_\{i=1\}^\{k\*\(\\alpha\)\}, wherek∗​\(α\)k^\{\*\}\(\\alpha\)is given by Eq\.[2](https://arxiv.org/html/2605.27013#S2.Ex1)\. The portfolio𝒫​\(d;α\)\\mathcal\{P\}\(d;\\alpha\)has coverage

c​\(𝒫​\(d;α\)\)=c​\(\{o\(i\)e\}i=1k∗​\(α\)\)=c​\(\{o\(i\)∗\}i=1k∗​\(α\)\)=∑i=1k∗​\(α\)𝕀​\{o\(i\)∗∈\{o\(i\)∗\}i=1k∗​\(α\)\}k∗​\(α\)=1\.\\displaystyle c\(\\mathcal\{P\}\(d;\\alpha\)\)=c\\left\(\\\{o\_\{\(i\)^\{e\}\}\\\}\_\{i=1\}^\{k^\{\*\}\(\\alpha\)\}\\right\)=c\\left\(\\\{o\_\{\(i\)^\{\*\}\}\\\}\_\{i=1\}^\{k^\{\*\}\(\\alpha\)\}\\right\)=\\frac\{\\sum\_\{i=1\}^\{k^\{\*\}\(\\alpha\)\}\\mathbb\{I\}\\left\\\{o\_\{\(i\)^\{\*\}\}\\in\\\{o\_\{\(i\)^\{\*\}\}\\\}\_\{i=1\}^\{k^\{\*\}\(\\alpha\)\}\\right\\\}\}\{k^\{\*\}\(\\alpha\)\}=1\.∎

### A\.2Proof of Proposition[3\.6](https://arxiv.org/html/2605.27013#S3.Thmtheorem6)

###### Proof\.

With a slight abuse of notation, we will write𝒫\\mathcal\{P\}to denote𝒫​\(d;α\)\\mathcal\{P\}\(d;\\alpha\)for ease of exposition\. Let𝒳⊆𝒫\\mathcal\{X\}\\subseteq\\mathcal\{P\}such that𝒳=𝒫∩𝒫∗\\mathcal\{X\}=\\mathcal\{P\}\\cap\\mathcal\{P\}^\{\*\}, where𝒫∗\\mathcal\{P\}^\{\*\}is a portfolio of size\|𝒫∗\|=\|𝒫\|\|\\mathcal\{P\}^\{\*\}\|=\|\\mathcal\{P\}\|with coveragec​\(𝒫∗\)=1c\(\\mathcal\{P\}^\{\*\}\)=1\. For𝒳⊆𝒫\\mathcal\{X\}\\subseteq\\mathcal\{P\}we have

∑o∈𝒳p​\(o\)\+∑o′∈𝒫∖𝒳p​\(o′\)≥1−α\.\\displaystyle~\\sum\_\{o\\in\\mathcal\{X\}\}p\(o\)\+\\sum\_\{o^\{\\prime\}\\in\\mathcal\{P\}\\setminus\\mathcal\{X\}\}p\(o^\{\\prime\}\)\\geq 1\-\\alpha\.\(6\)Since the generator model is human aligned, it must hold that∀o∈𝒫∖𝒳\\forall o\\in\\mathcal\{P\}\\setminus\\mathcal\{X\}and∀o′∈𝒫∗∖𝒳\\forall o^\{\\prime\}\\in\\mathcal\{P\}^\{\*\}\\setminus\\mathcal\{X\},r​a​n​k​\(o\)≥r​a​n​k​\(o′\)rank\(o\)\\geq rank\(o^\{\\prime\}\)given the human rankingπ∗​\(d\)\\pi^\{\*\}\(d\)and as a result,p​\(o\)≤p​\(o′\)p\(o\)\\leq p\(o^\{\\prime\}\)\. Using this and the fact that\|𝒫∖𝒳\|=\|𝒫∗∖𝒳\|=k∗​\(α\)−\|𝒳\|\|\\mathcal\{P\}\\setminus\\mathcal\{X\}\|=\|\\mathcal\{P\}^\{\*\}\\setminus\\mathcal\{X\}\|=k^\{\*\}\(\\alpha\)\-\|\\mathcal\{X\}\|, Eq\.[6](https://arxiv.org/html/2605.27013#A1.E6)becomes

∑o∈𝒳p​\(o\)\+∑o′′∈𝒫∗∖𝒳p​\(o′′\)≥∑o∈𝒳p​\(o\)\+∑o′∈𝒫∖𝒳p​\(o′\)≥1−α\.\\displaystyle~\\sum\_\{o\\in\\mathcal\{X\}\}p\(o\)\+\\sum\_\{o^\{\\prime\\prime\}\\in\\mathcal\{P\}^\{\*\}\\setminus\\mathcal\{X\}\}p\(o^\{\\prime\\prime\}\)\\geq\\sum\_\{o\\in\\mathcal\{X\}\}p\(o\)\+\\sum\_\{o^\{\\prime\}\\in\\mathcal\{P\}\\setminus\\mathcal\{X\}\}p\(o^\{\\prime\}\)\\geq 1\-\\alpha\.\(7\)
We will now use that∑o′′∈𝒫∗∖𝒳p​\(o′′\)≤α\\sum\_\{o^\{\\prime\\prime\}\\in\\mathcal\{P\}^\{\*\}\\setminus\\mathcal\{X\}\}p\(o^\{\\prime\\prime\}\)\\leq\\alpha\. This is because𝒫∗∖𝒳⊆𝒪∖𝒫\\mathcal\{P\}^\{\*\}\\setminus\\mathcal\{X\}\\subseteq\\mathcal\{O\}\\setminus\\mathcal\{P\}and by Eq\.[6](https://arxiv.org/html/2605.27013#A1.E6)

1−∑o∈𝒪∖𝒫p​\(o\)=∑o∈𝒳p​\(o\)\+∑o′∈𝒫∖𝒳p​\(o′\)≥1−α\.\\displaystyle 1\-\\sum\_\{o\\in\\mathcal\{O\}\\setminus\\mathcal\{P\}\}p\(o\)=\\sum\_\{o\\in\\mathcal\{X\}\}p\(o\)\+\\sum\_\{o^\{\\prime\}\\in\\mathcal\{P\}\\setminus\\mathcal\{X\}\}p\(o^\{\\prime\}\)\\geq 1\-\\alpha\.As a result,

∑o∈𝒪∖𝒫p​\(o\)≤α\\displaystyle\\sum\_\{o\\in\\mathcal\{O\}\\setminus\\mathcal\{P\}\}p\(o\)\\leq\\alphaand since𝒫∗∖𝒳⊆𝒪∖𝒫\\mathcal\{P\}^\{\*\}\\setminus\\mathcal\{X\}\\subseteq\\mathcal\{O\}\\setminus\\mathcal\{P\}it also holds

∑o′′∈𝒫∗∖𝒳p​\(o′′\)≤∑o∈𝒪∖𝒫p​\(o\)≤α\.\\displaystyle\\sum\_\{o^\{\\prime\\prime\}\\in\\mathcal\{P\}^\{\*\}\\setminus\\mathcal\{X\}\}p\(o^\{\\prime\\prime\}\)\\leq\\sum\_\{o\\in\\mathcal\{O\}\\setminus\\mathcal\{P\}\}p\(o\)\\leq\\alpha\.Using the above and Eq\.[7](https://arxiv.org/html/2605.27013#A1.E7), we have

∑o∈𝒳p​\(o\)\+α≥∑o∈𝒳p​\(o\)\+∑o′′∈𝒫∗∖𝒳p​\(o′′\)≥1−α\\displaystyle\\sum\_\{o\\in\\mathcal\{X\}\}p\(o\)\+\\alpha\\geq\\sum\_\{o\\in\\mathcal\{X\}\}p\(o\)\+\\sum\_\{o^\{\\prime\\prime\}\\in\\mathcal\{P\}^\{\*\}\\setminus\\mathcal\{X\}\}p\(o^\{\\prime\\prime\}\)\\geq 1\-\\alphaand therefore

∑o∈𝒳p​\(o\)≥1−2​α\.\\displaystyle\\sum\_\{o\\in\\mathcal\{X\}\}p\(o\)\\geq 1\-2\\alpha\.We will now use that∑o∈𝒳p​\(o\)<\|𝒳\|=k∗​\(α\)⋅\|𝒳\|k∗​\(α\)\\sum\_\{o\\in\\mathcal\{X\}\}p\(o\)<\|\\mathcal\{X\}\|=k^\{\*\}\(\\alpha\)\\cdot\\frac\{\|\\mathcal\{X\}\|\}\{k^\{\*\}\(\\alpha\)\}to rewrite the above as

k∗​\(α\)⋅\|𝒳\|k∗​\(α\)\>∑o∈𝒳p​\(o\)≥1−2​α\.\\displaystyle k^\{\*\}\(\\alpha\)\\cdot\\frac\{\|\\mathcal\{X\}\|\}\{k^\{\*\}\(\\alpha\)\}\>\\sum\_\{o\\in\\mathcal\{X\}\}p\(o\)\\geq 1\-2\\alpha\.Remember that𝒳=𝒫∩𝒫∗\\mathcal\{X\}=\\mathcal\{P\}\\cap\\mathcal\{P\}^\{\*\}, soc​\(𝒫\)=\|𝒳\|k∗​\(α\)c\(\\mathcal\{P\}\)=\\frac\{\|\\mathcal\{X\}\|\}\{k^\{\*\}\(\\alpha\)\}\. Therefore the above becomes

k∗​\(α\)⋅c​\(𝒫\)\>1−2​α\\displaystyle k^\{\*\}\(\\alpha\)\\cdot c\(\\mathcal\{P\}\)\>1\-2\\alphaand, thus

c​\(𝒫\)\>1−2​αk∗​\(α\)\.\\displaystyle c\(\\mathcal\{P\}\)\>\\frac\{1\-2\\alpha\}\{k^\{\*\}\(\\alpha\)\}\.∎

## Appendix BAdditional Experimental Details

### B\.1Simulated Portfolios

Generator Types\.

- •Aligned\. Each generated candidatei∈\[K\]i\\in\[K\]has probabilityp​\(i\)=K\+1−i∑i′=1KK\+1−i′p\(i\)=\\frac\{K\+1\-i\}\{\\sum\_\{i^\{\\prime\}=1\}^\{K\}K\+1\-i^\{\\prime\}\}; therefore∀i<j∈\[K\],p​\(i\)\>p​\(j\)\\forall i<j\\in\[K\],p\(i\)\>p\(j\)satisfying Definition[3\.3](https://arxiv.org/html/2605.27013#S3.Thmtheorem3)\.
- •Weakly Aligned\. Each generated candidatei∈\[K\]i\\in\[K\]has a randomly selected probability value such that∑i∈\[K/2\]p​\(i\)\>∑j∈\[K/2\]p​\(K/2\+j\)\\sum\_\{i\\in\[K/2\]\}p\(i\)\>\\sum\_\{j\\in\[K/2\]\}p\(K/2\+j\), where we select∑i∈\[K/2\]p​\(i\)∈\[0\.5,0\.99\)\\sum\_\{i\\in\[K/2\]\}p\(i\)\\in\[0\.5,0\.99\)uniformly at random\. Note that Proposition[3\.6](https://arxiv.org/html/2605.27013#S3.Thmtheorem6)holds for any portfolio ofK/2K/2generation candidates constructed using this generator type\.
- •Uniform\. Each generated candidatei∈\[K\]i\\in\[K\]has probabilityp​\(i\)=1/Kp\(i\)=1/K\.
- •Misaligned\. Each generated candidatei∈\[K\]i\\in\[K\]has probabilityp​\(i\)=i∑i′=1Ki′p\(i\)=\\frac\{i\}\{\\sum\_\{i^\{\\prime\}=1\}^\{K\}i^\{\\prime\}\}\.

### B\.2Portfolios of Optimization Models

We prompt all LLMs with temperature1\.01\.0, while setting verbosity to ‘low’ and the reasoning effort to ‘none’\. Below we provide the prompt templates and system prompts we used for the generator, evaluator and LLM\-as\-a\-judge models\. For the templates we follow the templates used byHuanget al\.\([2025](https://arxiv.org/html/2605.27013#bib.bib19)\)\.

![Refer to caption](https://arxiv.org/html/2605.27013v1/x7.png)Figure 6:Generator prompt and system prompt\.![Refer to caption](https://arxiv.org/html/2605.27013v1/x8.png)Figure 7:Evaluator prompt and system prompt\.![Refer to caption](https://arxiv.org/html/2605.27013v1/x9.png)Figure 8:LLM\-as\-a\-judge prompt and system prompt\.

## Appendix CAdditional Experimental Results on Simulated Portfolios

Below we present results over several generator\-evaluator pairs and values ofKK\.

![Refer to caption](https://arxiv.org/html/2605.27013v1/x10.png)Figure 9:Portfolio coverage against1−α1\-\\alphafor several generator evaluator pairs and value ofKK\. The mean is over4040iterations and shaded areas represent95%95\\%confidence intervals\.![Refer to caption](https://arxiv.org/html/2605.27013v1/x11.png)Figure 10:Portfolio size against1−α1\-\\alphafor several generator evaluator pairs and value ofKK\. The mean is over4040iterations and shaded areas represent95%95\\%confidence intervals\.

Similar Articles

A Theory of Training Profit-Optimal LLMs

arXiv cs.LG

This paper develops an economic model combining scaling laws with microeconomic theory to analyze profit-optimal training of large language models, considering trade-offs between model quality, training costs, and hardware efficiency.

Discovering Reinforcement Learning Interfaces with Large Language Models

Hugging Face Daily Papers

This paper introduces LIMEN, an LLM-guided evolutionary framework that automatically discovers reinforcement learning interfaces by jointly optimizing observation mappings and reward functions from raw simulator states. The approach reduces manual engineering effort and demonstrates that co-designing observations and rewards outperforms optimizing either component alone.

Evolution through large models

OpenAI Blog

This paper demonstrates that large language models trained on code can significantly enhance genetic programming mutation operators, enabling the generation of hundreds of thousands of functional Python programs for robot design in the Sodarace domain without prior training data. The approach, called Evolution through Large Models (ELM), combines LLMs with MAP-Elites to bootstrap new conditional models for context-specific artifact generation.

Mixture of Complementary Agents for Robust LLM Ensemble

arXiv cs.LG

Proposes a framework for selecting complementary LLMs as proposers in ensemble systems, reformulating proposer selection as a combinatorial problem and exploring greedy algorithms for efficient performance-cost trade-offs.