StarOR: Synergizing Tree Search and Test-Time Reinforcement Learning for Optimization Modeling

arXiv cs.LG Papers

Summary

StarOR proposes a framework that synergizes Monte Carlo Tree Search with test-time reinforcement learning for automated optimization modeling, achieving state-of-the-art performance across multiple benchmarks.

arXiv:2606.15197v1 Announce Type: new Abstract: Optimization modeling is inherently hierarchical, requiring a precise sequence of symbolic commitments. Traditional learning-based automated optimization modeling methods improve modeling policies through large-scale annotated or curated training data, but are costly to adapt to new problem distributions. Meanwhile, one-shot generation remains brittle in hierarchical modeling, where early symbolic errors can propagate into invalid formulations. Test-time scaling offers a promising alternative by enabling structural exploration with additional instance-level computation; however, existing search-based methods typically rely on a fixed policy, causing repeated rollouts to inherit similar modeling biases and providing limited credit assignment for intermediate decisions. To address these limitations, we propose StarOR, a synergistic search-and-adaptation framework that couples MCTS with Test-Time Reinforcement Learning for optimization modeling. StarOR decomposes the modeling process into four stages and updates a transient LoRA adapter via GRPO at each non-terminal node. By using MCTS-generated siblings as local comparison sets, StarOR transforms search-time exploration into instance-specific policy refinement. Moreover, an unsupervised multi-faceted reward system provides fine-grained feedback for intermediate formulation decisions without ground-truth labels. Experiments across five optimization benchmarks show that StarOR achieves state-of-the-art performance even with a 4B backbone, outperforming existing methods and the frontier LLMs.
Original Article
View Cached Full Text

Cached at: 06/16/26, 11:38 AM

# StarOR: Synergizing Tree Search and Test-Time Reinforcement Learning for Optimization Modeling
Source: [https://arxiv.org/html/2606.15197](https://arxiv.org/html/2606.15197)
Yu Ding1Shisi Guan2Ran Hou1Wanyuan Wang1,\* 1School of Computer Science and EngineeringSoutheast University2Northwest A&F University\*Corresponding author\. E\-mail: wywang@seu\.edu\.cn

###### Abstract

Optimization modeling is inherently hierarchical, requiring a precise sequence of symbolic commitments\. Traditional learning\-based automated optimization modeling methods improve modeling policies through large\-scale annotated or curated training data, but are costly to adapt to new problem distributions\. Meanwhile, one\-shot generation remains brittle in hierarchical modeling, where early symbolic errors can propagate into invalid formulations\. Test\-time scaling offers a promising alternative by enabling structural exploration with additional instance\-level computation; however, existing search\-based methods typically rely on a fixed policy, causing repeated rollouts to inherit similar modeling biases and providing limited credit assignment for intermediate decisions\. To address these limitations, we proposeStarOR, a synergistic search\-and\-adaptation framework that couples MCTS with Test\-Time Reinforcement Learning for optimization modeling\. StarOR decomposes the modeling process into four stages and updates a transient LoRA adapter via GRPO at each non\-terminal node\. By using MCTS\-generated siblings as local comparison sets, StarOR transforms search\-time exploration into instance\-specific policy refinement\. Moreover, an unsupervised multi\-faceted reward system provides fine\-grained feedback for intermediate formulation decisions without ground\-truth labels\. Experiments across five optimization benchmarks show that StarOR achieves state\-of\-the\-art performance even with a 4B backbone, outperforming existing methods and the frontier LLMs\. Code is available at[StarOR](https://github.com/Liwow/StarOR)\.

## 1Introduction

Across various industries, decision\-making challenges are typically articulated in natural language—ranging from complex engineering designs\(Belegundu and Chandrupatla,[2019](https://arxiv.org/html/2606.15197#bib.bib33)\)to large\-scale energy management\(Krishnamurthyet al\.,[2018](https://arxiv.org/html/2606.15197#bib.bib35); Singh,[2012](https://arxiv.org/html/2606.15197#bib.bib34)\)\. While modern solvers are remarkably robust, they necessitate precise mathematical formulations to function, creating a critical translation bottleneck\. This process requires an exacting sequence of logical commitments: classifying the optimization type, defining sets and parameters, and mapping constraints to executable code\. Consequently, the semantic gap between a problem*description*and its formal*formulation*remains a primary barrier, preventing non\-experts from leveraging advanced operations research tools\(Ramamonjisonet al\.,[2022a](https://arxiv.org/html/2606.15197#bib.bib13),[b](https://arxiv.org/html/2606.15197#bib.bib14); Ahmed and Choudhury,[2024](https://arxiv.org/html/2606.15197#bib.bib15)\)\.

Large Language Models \(LLMs\) have made automated optimization modeling increasingly viable\. Recent systems can already translate natural language into structured formulations or executable programs\(Xiaoet al\.,[2024](https://arxiv.org/html/2606.15197#bib.bib25); Huanget al\.,[2024a](https://arxiv.org/html/2606.15197#bib.bib1); Jianget al\.,[2024](https://arxiv.org/html/2606.15197#bib.bib2); Chenet al\.,[2025](https://arxiv.org/html/2606.15197#bib.bib4); Wuet al\.,[2025](https://arxiv.org/html/2606.15197#bib.bib16); Wanget al\.,[2025c](https://arxiv.org/html/2606.15197#bib.bib17)\)\. However, optimization modeling is highly sensitive to minor errors; a single silent mistake in an index or constraint direction can invalidate the entire model\(Zhanget al\.,[2025](https://arxiv.org/html/2606.15197#bib.bib38); Lianet al\.,[2026](https://arxiv.org/html/2606.15197#bib.bib37)\)\. This sensitivity stems from the task’s hierarchical structure: early definitions of sets and types govern all subsequent symbols, meaning errors in variables inevitably propagate to the final objective\. Treating modeling as a flat, end\-to\-end task ignores these dependencies, making the results brittle and difficult to repair\.

Current research has attempted to mitigate these issues through various methodological paradigms, yet each faces a fundamental trade\-off between structural rigor and adaptive reasoning\. Specifically, model\-based methods using strong frontier models rely on scaling to massive parameters, which often entails prohibitive costs and privacy risks in industrial settings\. In contrast, learning\-based methods\(Huanget al\.,[2024a](https://arxiv.org/html/2606.15197#bib.bib1); Jianget al\.,[2024](https://arxiv.org/html/2606.15197#bib.bib2); Luet al\.,[2025](https://arxiv.org/html/2606.15197#bib.bib3); Chenet al\.,[2025](https://arxiv.org/html/2606.15197#bib.bib4)\)focus on refining policies offline; however, they frequently struggle with the combinatorial complexity of industrial scenarios and lack the mechanism for rapid, instance\-specific iteration\. Most recently, search\-based methods\(Astorgaet al\.,[2025](https://arxiv.org/html/2606.15197#bib.bib6); Liet al\.,[2025](https://arxiv.org/html/2606.15197#bib.bib5); Liuet al\.,[2025](https://arxiv.org/html/2606.15197#bib.bib7); Wanget al\.,[2025a](https://arxiv.org/html/2606.15197#bib.bib18)\)enhance reliability by exploring multiple candidates at inference time\. However, while search can guide traversal or candidate selection, the underlying generator remains fixed\. Consequently, the process is bounded by the original proposal distribution: systematic modeling errors often recur across branches, and additional rollouts yield diminishing marginal returns—a bottleneck inherent to pure test\-time scaling\.

![Refer to caption](https://arxiv.org/html/2606.15197v1/x1.png)Figure 1:\(a\) One\-shot optimization modeling is brittle because an early silent mistake can invalidate the full program\. \(b\) Search\-based methods with a fixed policy keep repeating the same failure\. \(c\)StarORsearches and adapts the policy within the current instance\. \(d\) The result empirically demonstrates the effectiveness of the search\-and\-adaptation paradigm\.As illustrated in Figure[1](https://arxiv.org/html/2606.15197#S1.F1), the limitations of existing paradigms suggest a promising direction: coupling structured exploration with instance\-specific online adaptation\. This trade\-off is especially appropriate for OR modeling, where high\-value industrial applications often prioritize formulation correctness over real\-time generation, and additional test\-time computation is acceptable when it improves reliability\. However, realizing this coupling presents two key challenges\.First, it requires a unified framework where discrete search steps serve as high\-quality signals for continuous policy refinement\.Second, it demands granular evaluation to address the credit assignment problem: since execution feedback is obtained only after completing partial formulations into code and ground\-truth labels are unavailable at test time, the framework must infer which intermediate commitments are responsible for success or failure\. To address these challenges, we proposeStarOR, a framework that treats optimization modeling as a trajectory of structured commitments across four hierarchical stages\. By anchoring a multi\-faceted reward system to intermediate formulation nodes,StarORenables each search step to both explore the formulation space and provide fine\-grained feedback for adapting the policy to the current instance\. This transforms optimization modeling from a static generation into a dynamic evolution\. The main contributions of this work are as follows:

- •Synergistic Search\-and\-Adapt Framework\.We introduceStarOR, the first architecture to integrate hierarchical MCTS with test\-time training, enabling online policy evolution during the optimization modeling process\.
- •OR\-Specific Reward Design\.We propose an unsupervised multi\-reward system to evaluate formulation quality without ground\-truth labels\.
- •State\-of\-the\-art Performance:StarORconsistently exceeds the performance of prior methodologies, yielding a65\.0%65\.0\\%average accuracy on five optimization modeling benchmarks and achieving state\-of\-the\-art results\.

## 2Related Work

##### Learning\-based Methods for Optimization Modeling\.

LLM\-based optimization modeling has progressed from benchmark construction\(Ramamonjisonet al\.,[2022a](https://arxiv.org/html/2606.15197#bib.bib13),[b](https://arxiv.org/html/2606.15197#bib.bib14); Ahmed and Choudhury,[2024](https://arxiv.org/html/2606.15197#bib.bib15)\)to systems trained with synthetic formulations and solver feedback\. ORLM and OptMATH build large\-scale synthesis pipelines\(Huanget al\.,[2024a](https://arxiv.org/html/2606.15197#bib.bib1); Luet al\.,[2025](https://arxiv.org/html/2606.15197#bib.bib3)\), LLMOPT studies standardized formulation representations\(Jianget al\.,[2024](https://arxiv.org/html/2606.15197#bib.bib2)\), and SIRL uses solver\-informed rewards for executable modeling\(Chenet al\.,[2025](https://arxiv.org/html/2606.15197#bib.bib4)\)\. Recent systems such as Step\-Opt\-Instruct and ORMind further emphasize structured validation and iterative reasoning\(Wuet al\.,[2025](https://arxiv.org/html/2606.15197#bib.bib16); Wanget al\.,[2025c](https://arxiv.org/html/2606.15197#bib.bib17)\)\. These methods improve the base modeling prior, but they remain largely offline: adapting to new industrial constraints typically requires additional data, retraining, or stronger proprietary models\.

##### Search\-based Methods for Optimization Modeling\.

Search\-based inference addresses the brittleness of one\-shot generation by exploring multiple formulation trajectories\. Inspired by Tree of Thoughts\(Yaoet al\.,[2023](https://arxiv.org/html/2606.15197#bib.bib21)\), OR\-specific methods use structured search to improve candidate selection: AutoFormulation combines MCTS with symbolic pruning\(Astorgaet al\.,[2025](https://arxiv.org/html/2606.15197#bib.bib6)\), SolverLLM uses feedback\-guided dynamic expansion\(Liet al\.,[2025](https://arxiv.org/html/2606.15197#bib.bib5)\), and OptiTree and related methods decompose modeling through hierarchical search\(Liuet al\.,[2025](https://arxiv.org/html/2606.15197#bib.bib7); Wanget al\.,[2025a](https://arxiv.org/html/2606.15197#bib.bib18)\)\. However, these approaches keep the policy fixed\. Search can filter or rerank candidates, but it does not internalize the feedback encountered during exploration, so modeling biases may recur across branches\.

##### Test\-time Adaptation and Reinforcement Learning\.

Test\-time training and reinforcement learning update a model or lightweight adapter during inference, allowing feedback from the current instance to shape later predictions\(Akyüreket al\.,[2025](https://arxiv.org/html/2606.15197#bib.bib39); Zuoet al\.,[2025](https://arxiv.org/html/2606.15197#bib.bib8); Huet al\.,[2025](https://arxiv.org/html/2606.15197#bib.bib45)\)\. Related work strengthens this idea with tool verification, transient policy adaptation, and online discovery\(Liaoet al\.,[2026](https://arxiv.org/html/2606.15197#bib.bib40); Jiaoet al\.,[2026](https://arxiv.org/html/2606.15197#bib.bib9); Wanget al\.,[2025b](https://arxiv.org/html/2606.15197#bib.bib42); Yuksekgonulet al\.,[2026](https://arxiv.org/html/2606.15197#bib.bib43)\)\. In operations research, OR\-R1 applies group relative policy optimization to improve modeling outputs at test time\(Dinget al\.,[2025](https://arxiv.org/html/2606.15197#bib.bib11)\)\.StarORfollows this direction but changes the granularity of adaptation: instead of learning only from complete outputs, it attaches executable, semantic, and structural rewards to intermediate MCTS nodes, enabling instance\-specific policy refinement during formulation search\.

## 3Method

### 3\.1Problem Formulation

Given a natural language optimization problemxx, our objective is to generate executable, faithful, and mathematically sound Python codecc\. Following\(Jianget al\.,[2024](https://arxiv.org/html/2606.15197#bib.bib2); Liet al\.,[2025](https://arxiv.org/html/2606.15197#bib.bib5)\), we decompose the modeling process into six elements—*type, sets, parameters, variables, objective,*and*constraints*—structured as a four\-stage trajectoryτ=\(z\(1\),z\(2\),z\(3\),z\(4\)\)\\tau=\(z^\{\(1\)\},z^\{\(2\)\},z^\{\(3\)\},z^\{\(4\)\}\)\. These stages represent: \(1\) sets and type, \(2\) parameters and variables, \(3\) objective and constraints, and \(4\) the final executable code\. Any prefixτ≤s\\tau\_\{\\leq s\}\(s<4s<4\) denotes a*partial formulation*\.

We formalize this sequence as a Markov Decision Process \(MDP\) defined by\(𝒮,𝒜,𝒫,ℛ\)\(\\mathcal\{S\},\\mathcal\{A\},\\mathcal\{P\},\\mathcal\{R\}\):

- •States𝒮\\mathcal\{S\}: A stateσt=\(x,τ≤t\)\\sigma\_\{t\}=\(x,\\tau\_\{\\leq t\}\)pairs the problem descriptionxxwith the partial formulationτ≤t\\tau\_\{\\leq t\}generated so far\.
- •Actions𝒜\\mathcal\{A\}: An actionata\_\{t\}corresponds to generating the next\-stage commitmentz\(t\+1\)z^\{\(t\+1\)\}\.
- •Transitions𝒫\\mathcal\{P\}: The transition is deterministic, defined byσt\+1=\(x,τ≤t\+1\)\\sigma\_\{t\+1\}=\(x,\\tau\_\{\\leq t\+1\}\), whereτ≤t\+1\\tau\_\{\\leq t\+1\}is the result of appendingata\_\{t\}toτ≤t\\tau\_\{\\leq t\}\.
- •Rewardsℛ\\mathcal\{R\}: The task\-level reward is defined on completed code atσ4\\sigma\_\{4\}\. StarOR estimates a node\-level reward by completing each partial trajectory and evaluating it \(see Section[3\.2\.1](https://arxiv.org/html/2606.15197#S3.SS2.SSS1.Px2)\)\.

![Refer to caption](https://arxiv.org/html/2606.15197v1/x2.png)Figure 2:Overview of the StarOR framework\.\(a\) Overall framework: StarOR forms a closed loop between tree search, code execution, multi\-faceted reward evaluation, and GRPO\-based transient LoRA adaptation\. \(b\) search\-and\-adaptation: StarOR searches over four formulation stages and uses execution\-grounded feedback to adapt the local policy during exploration\. \(c\) Multi\-faceted reward: An unsupervised reward provides node\-level feedback for both tree search and policy evolution\.
### 3\.2StarOR: Online Optimization and Tree Search for Modeling

We proposeStarOR, a framework that transforms optimization modeling into a synergized process of Monte Carlo Tree Search \(MCTS\) and online policy optimization\. Unlike conventional static generation, StarOR treats each modeling stage as a localized task of search\-and\-adaptation\. Figure[2](https://arxiv.org/html/2606.15197#S3.F2)shows the overview of the StarOR’s pipeline for Optimization Modeling\.

At each iterationtt, MCTS identifies a partial formulationτ≤s−1\\tau\_\{\\leq s\-1\}and addresses a localized sub\-problem: determining the optimal next\-stage componentz\(s\)z^\{\(s\)\}conditioned on the taskxxand preceding modeling decisions\. Formally, for stagess, we sample a candidate set:

zi\(s\)∼πϕ\+Δ​ϕ\(⋅∣x,τ≤s−1,s\),i=1,…,K,z\_\{i\}^\{\(s\)\}\\sim\\pi\_\{\\phi\+\\Delta\\phi\}\\\!\\left\(\\cdot\\mid x,\\tau\_\{\\leq s\-1\},s\\right\),\\qquad i=1,\\dots,K,\(1\)whereϕ\\phidenotes the base policy andΔ​ϕ\\Delta\\phirepresents a transient LoRA adapter maintained exclusively for the current test instance\.

This integration fosters a self\-evolving search process: MCTS facilitates structured exploration of modeling hypotheses, while online optimization distills execution feedback to refine the transient adapterΔ​ϕ\\Delta\\phi\. In this dual\-purpose cycle, the sampled candidates simultaneously expand the search tree and serve as the training batch for test\-time adaptation\. Consequently, the framework identifies critical decision points and immediately sharpens the local policy for the specific modeling instance\. The remainder of this section is organized as follows: Section[3\.2\.1](https://arxiv.org/html/2606.15197#S3.SS2.SSS1)describes our node\-level online policy evolution, while Section[3\.2\.2](https://arxiv.org/html/2606.15197#S3.SS2.SSS2)provides a detailed exposition of the MCTS framework\.

#### 3\.2\.1Policy Evolution via Group Relative Policy Optimization \(GRPO\)

##### Group Advantage Estimation via GRPO\.

At each MCTS iterationttcorresponding to stagess, we sample a sibling group ofKKcandidate continuations\{zi\(s\)\}i=1K\\\{z\_\{i\}^\{\(s\)\}\\\}\_\{i=1\}^\{K\}conditioned on the same prefixτ≤s−1\\tau\_\{\\leq s\-1\}\. Following Group Relative Policy Optimization \(GRPO\), these siblings serve as a local comparison group, allowing StarOR to estimate candidate quality by relative reward rather than an additional value function\(Shaoet al\.,[2024](https://arxiv.org/html/2606.15197#bib.bib12)\):

Ai=Ri−meanj=1K⁡Rjstdj=1K⁡Rj\+ϵ,A\_\{i\}=\\frac\{R\_\{i\}\-\\operatorname\{mean\}\_\{j=1\}^\{K\}R\_\{j\}\}\{\\operatorname\{std\}\_\{j=1\}^\{K\}R\_\{j\}\+\\epsilon\},\(2\)whereϵ\\epsilonis a small numerical constant\. A detailed description for the GRPO update is in Appendix[D](https://arxiv.org/html/2606.15197#A4)\.

##### Multi\-faceted Reward\.

To stabilize the search process and prevent extreme outliers from distorting the learning signal, we first use the base policy to establish a pre\-generation prior before search\-and\-adaptation\. Conditioned on the problem context, the base policy performs problem\-level reasoning to construct synthetic perturbation cases and estimate a plausible objective range\[Lx,Ux\]\[L\_\{x\},U\_\{x\}\]for instancexx\. This prior is used only as a soft grounding signal: it calibrates the objective\-range penalty inrsemr\_\{\\mathrm\{sem\}\}and provides expected objective ranges forrtestr\_\{\\mathrm\{test\}\}, rather than serving as a hard correctness oracle\. For candidateii, the total rewardRi=∑jwj​rj,iR\_\{i\}=\\sum\_\{j\}w\_\{j\}r\_\{j,i\}is computed from the following components:

Semantic Reward \(rs​e​m,ir\_\{sem,i\}\): This measures semantic consensus via the objective\-value cluster size\|Cs​e​m,i\|\|C\_\{sem,i\}\|\. LetKvalidK\_\{\\mathrm\{valid\}\}denote the number of candidates in the sibling rollout group that are executable and return a finite objective value\. We cluster these valid objective values under a numerical tolerance, andCs​e​m,iC\_\{sem,i\}denotes the cluster containing candidateii\. To penalize numerically implausible results while maintaining robustness against potential inaccuracies in the initial estimation, we apply a soft\-penalty factorλ∈\(0,1\)\\lambda\\in\(0,1\)for objectives falling outside the predicted range:

rs​e​m,i=\|Cs​e​m,i\|Kvalid⋅\{1if​Oi∈\[Lx,Ux\],λotherwise\.r\_\{sem,i\}=\\frac\{\|C\_\{sem,i\}\|\}\{K\_\{\\mathrm\{valid\}\}\}\\cdot\\begin\{cases\}1&\\text\{if \}O\_\{i\}\\in\[L\_\{x\},U\_\{x\}\],\\\\ \\lambda&\\text\{otherwise\.\}\\end\{cases\}\(3\)For candidates that fail to execute or do not produce a finite objective value, we setrs​e​m,i=0r\_\{sem,i\}=0\. Thisλ\\lambda\-scaling ensures that otherwise valid formulations are not entirely dismissed due to conservative estimates from the pre\-generation phase\.

Structural Reward \(rs​t​r​u​c​t,ir\_\{struct,i\}\): This evaluates structural consistency via a signature vector𝐯i=\[nbin,nint,ncont,ncon,sense\]\\mathbf\{v\}\_\{i\}=\[n\_\{\\mathrm\{bin\}\},n\_\{\\mathrm\{int\}\},n\_\{\\mathrm\{cont\}\},n\_\{\\mathrm\{con\}\},\\mathrm\{sense\}\], representing the counts of binary, integer, and continuous variables, the number of constraints, and the optimization sense\. Let\|C​\(i,d\)\|\|C\(i,d\)\|denote the number of executable candidates sharing candidateii’s value on dimensiondd\. The reward is:

rs​t​r​u​c​t,i=15​∑d∈𝐯i\|C​\(i,d\)\|Kvalid∗\.r\_\{struct,i\}=\\frac\{1\}\{5\}\\sum\_\{d\\in\\mathbf\{v\}\_\{i\}\}\\sqrt\{\\frac\{\|C\(i,d\)\|\}\{K^\{\*\}\_\{\\mathrm\{valid\}\}\}\}\.\(4\)HereKvalid∗K^\{\*\}\_\{\\mathrm\{valid\}\}is the number of executable candidates\.

Execution Reward \(re​x​e​c,ir\_\{exec,i\}\):∈\{0,1\}\\in\\\{0,1\\\}, indicating code executability\.

Test\-case Reward \(rt​e​s​t,ir\_\{test,i\}\): This provides grounding throughNtN\_\{t\}test cases\. Each rollout is assigned a scoreSi,j∈\{0,λ,1\}S\_\{i,j\}\\in\\\{0,\\lambda,1\\\}based on whether its output aligns with the expected range for test casejj:

rt​e​s​t,i=1Nt​∑j=1NtSi,j\.r\_\{test,i\}=\\frac\{1\}\{N\_\{t\}\}\\sum\_\{j=1\}^\{N\_\{t\}\}S\_\{i,j\}\.\(5\)
To optimize the search\-and\-adapt process,Dynamic Reward Shapingadaptively tunes the weightswjw\_\{j\}as MCTS iterations progress\. It transitions the learning signal from*feasibility\-oriented*metrics \(re​x​e​c,rt​e​s​tr\_\{exec\},r\_\{test\}\) to*consensus\-oriented*metrics \(rs​e​m,rs​t​r​u​c​tr\_\{sem\},r\_\{struct\}\), thereby balancing broad initial exploration with eventual convergence toward a precise and consistent mathematical architecture\. A detailed description for reward is provided in the Appendix[D\.2](https://arxiv.org/html/2606.15197#A4.SS2)\.

##### Sample\-Specific Policy Evolution\.

This reward\-driven feedback is internalized through a transient LoRA adapterΔ​ϕ\\Delta\\phi, which is initialized to zero and reset for every new problem instance to prevent cross\-task interference\. This mechanism facilitates a policy evolution where search and learning mutually reinforce one another: MCTS generates the local contrastive hypotheses necessary for meaningful advantage estimation, while GRPO distills the execution feedback to sharpen the policy’s reasoning for the specific instance\. This coupling overcomes the limitations of static exploration; while MCTS identifies alternative modeling paths, the online optimization ensures that the policy progressively specializes in the unique constraints and nuances of the current OR problem, refining the model from high\-level set definitions down to terminal code implementation\.

#### 3\.2\.2Monte Carlo Tree Search for Optimization Modeling

We formulate the modeling process as a structured decision\-making task, navigating a four\-stage search tree via Monte Carlo Tree Search \(MCTS\)\. This framework systematically explores the space of LLM\-proposed formulations through four refined phases:Problem Type and Sets,Variables and Parameters,Constraints and Objectives, andCode\.

##### Selection\.

The search traverses the tree from the root by selecting child nodes that maximize a PUCT\-style objective, balancing exploitation and exploration:

Score​\(n,a\)=Q​\(n,a\)\+cpuct⋅P​\(n,a\)​∑a′N​\(n,a′\)1\+N​\(n,a\),\\mathrm\{Score\}\(n,a\)=Q\(n,a\)\+c\_\{\\mathrm\{puct\}\}\\cdot P\(n,a\)\\frac\{\\sqrt\{\\sum\_\{a^\{\\prime\}\}N\(n,a^\{\\prime\}\)\}\}\{1\+N\(n,a\)\},\(6\)wherenndenotes the current node and the prior probabilityP​\(n,a\)P\(n,a\)is derived from the average log\-likelihood of the sequenceYaY\_\{a\}corresponding to actionaa:

P​\(n,a\)=exp⁡\(sa/η\)∑a′∈𝒜​\(n\)exp⁡\(sa′/η\),sa=1\|Ya\|​∑u=1\|Ya\|log⁡πθ​\(yu∣x,τ≤s−1,y<u\)\.P\(n,a\)=\\frac\{\\exp\(s\_\{a\}/\\eta\)\}\{\\sum\_\{a^\{\\prime\}\\in\\mathcal\{A\}\(n\)\}\\exp\(s\_\{a^\{\\prime\}\}/\\eta\)\},\\quad s\_\{a\}=\\frac\{1\}\{\|Y\_\{a\}\|\}\\sum\_\{u=1\}^\{\|Y\_\{a\}\|\}\\log\\pi\_\{\\theta\}\(y\_\{u\}\\mid x,\\tau\_\{\\leq s\-1\},y\_\{<u\}\)\.\(7\)Here,η\\etais the prior softmax temperature, andπθ\\pi\_\{\\theta\}\(whereθ=ϕ\+Δ​ϕ\\theta=\\phi\+\\Delta\\phi\) represents the transient policy\. This mechanism ensures that the selection is guided by both historical rewardsQQand the model’s evolving, instance\-specific beliefs\.

##### Expansion and Simulation\.

Upon reaching an expandable leaf node at stagess, StarOR expands the tree by samplingKKcandidate commitments\{zi\(s\)\}i=1K\\\{z\_\{i\}^\{\(s\)\}\\\}\_\{i=1\}^\{K\}from the current transient policyπϕ\+Δ​ϕt\\pi\_\{\\phi\+\\Delta\\phi\_\{t\}\}\. Each candidate is appended to the existing partial formulation and temporarily completed into executable code for evaluation \(In a single generation\)\. After execution, sibling candidates that yield near\-identical objective values \(toleranceϵc=0\.01%\\epsilon\_\{c\}=0\.01\\%\) are aggregated into aGroup Cluster, representing behaviorally similar modeling paths under the current instance\. This clustering enables the search to share evidence across semantically equivalent branches and reduce redundant exploration\.

##### Group Backpropagation\.

To leverage collective evidence during tree search, we propagate rewards at both individual and group granularities\. Each child nodeiimaintains its own rewardRiR\_\{i\}, while the selected ancestral path is updated using the group meanR¯=1K​∑i=1KRi\\bar\{R\}=\\frac\{1\}\{K\}\\sum\_\{i=1\}^\{K\}R\_\{i\}\. Beyond standard backpropagation, we introduceGroup Backpropagationto share evidence across behaviorally similar nodes\. The intuition is that nodes within the same Group Cluster often correspond to formulations with similar objective behavior and hence similar values; repeatedly revisiting them therefore leads to redundant exploration\. For any sibling nodemmresiding in the same Group Cluster as a trajectory node, its value is softly updated as:

Q​\(m\)←N​\(m\)​Q​\(m\)\+ρℓ​R¯N​\(m\)\+ρℓ,Q\(m\)\\leftarrow\\frac\{N\(m\)Q\(m\)\+\\rho^\{\\ell\}\\bar\{R\}\}\{N\(m\)\+\\rho^\{\\ell\}\},\(8\)whereN​\(m\)N\(m\)is incremented byρℓ\\rho^\{\\ell\},ℓ\\elldenotes the structural tree distance between nodemmand the trajectory node, andρ∈\[0,1\]\\rho\\in\[0,1\]is a decay factor\. This shared update allows PUCT to treat same\-cluster nodes as partially explored, reducing repeated visits to semantically equivalent branches\.

##### Early Stopping and Termination\.

The search terminates upon reaching thecodestage\. To enhance symbolic coverage, StarOR applies a one\-shot suppression factorγsup=0\.5\\gamma\_\{\\mathrm\{sup\}\}=0\.5to the PUCT scores of the current trajectory and its associated objective\-consensus cluster\. This temporarily diverts the search from the current basin, forcing the exploration of alternative branches before final acceptance\. Early stopping is triggered if three consecutive steps across distinct stages converge to the same objective consensus\. Upon termination, the optimal candidate is selected from thecodenodes; if the budget is exhausted, a global consensus voting mechanism identifies the most robust formulation\. Any residual errors trigger a repair loop guided by execution traces\. This synergy allows MCTS to navigate the formulation space while the repair phase finalizes executable reliability\.

## 4Experiments

To evaluate the effectiveness of StarOR, we conduct comprehensive experiments across five benchmark datasets, comparing our approach with both prompt\-based and learning\-based baselines\. Our study is designed to answer the following key research questions:

1. \(RQ1\)Effectiveness of StarOR:How does our proposed StarOR, as a test\-time scalable framework, compare with Strong large models \(e\.g\., GPT\-4, DeepSeek\-R1\), specialized learning\-based model and some test\-time reasoning method across diverse optimization benchmarks?
2. \(RQ2\)Synergy of Search and Test\-Time RL:What is the specific contribution of Test\-Time Adaptation to the overall performance? How does the framework perform when utilizing only MCTS without the iterative GRPO\-based policy update?
3. \(RQ3\)Impact of Reward Mechanism:How does our specialized reward design influence the search quality and decision\-making during inference? Can it significantly outperform simpler baselines such as majority voting?
4. \(RQ4\)Test\-Time Scalability:How does the performance of StarOR scale with respect to the expansion group sizeKKand the overall computational budget?

### 4\.1Experiment Setup

##### Benchmarks\.

We evaluate on five cleaned benchmarks totaling 1356 instances followingChenet al\.\([2025](https://arxiv.org/html/2606.15197#bib.bib4)\)\. Table[1](https://arxiv.org/html/2606.15197#S4.T1)summarizes the main benchmarks\.

Table 1:Main benchmark suite\. \#Inst means the number of dataset\. The benchmarks are roughly ordered by increasing modeling difficulty from top to bottom\.Dataset\#Inst\.FocusNL4OPT \(Ramamonjisonet al\.\([2022b](https://arxiv.org/html/2606.15197#bib.bib14)\)\)245Natural\-language OR modelingMAMO\-Easy \(Huanget al\.\([2024b](https://arxiv.org/html/2606.15197#bib.bib23)\)\)642Easy LP/MILP modelingMAMO\-Complex \(Huanget al\.\([2024b](https://arxiv.org/html/2606.15197#bib.bib23)\)\)203Complex LP/MILP modelingIndustryOR \(Huanget al\.\([2024a](https://arxiv.org/html/2606.15197#bib.bib1)\)\)100Industrial OR problemsOptMATH \(Luet al\.\([2025](https://arxiv.org/html/2606.15197#bib.bib3)\)\)166Optimization modeling benchmarkTotal1356Varying\-difficulty OR modeling tasks
##### Baselines\.

To benchmark our method, we compare against three categories of baseline approaches \(see Appendix[A](https://arxiv.org/html/2606.15197#A1)for detailed descriptions\):

1. 1\.Model\-based Methods\.Approaches that utilize state\-of\-the\-art frontier models, such as GPT\-4\(OpenAIet al\.,[2023](https://arxiv.org/html/2606.15197#bib.bib27)\), DeepSeek\-V3\.1\(DeepSeek\-AIet al\.,[2024](https://arxiv.org/html/2606.15197#bib.bib29)\), and Large Reasoning Model DeepSeek\-R1\(DeepSeek\-AIet al\.,[2025](https://arxiv.org/html/2606.15197#bib.bib28)\), OpenAI\-o3\(OpenAI,[2025](https://arxiv.org/html/2606.15197#bib.bib46)\), integrated with the Gurobi solver\(Gurobi Optimization, LLC,[2026](https://arxiv.org/html/2606.15197#bib.bib26)\)for optimization tasks\.
2. 2\.Learning\-based Methods\.Methods that enhance the backbone model through offline training or synthetic supervision prior to inference\. This category includes ORLM\(Huanget al\.,[2024a](https://arxiv.org/html/2606.15197#bib.bib1)\), LLMOPT\(Jianget al\.,[2024](https://arxiv.org/html/2606.15197#bib.bib2)\), OptMATH\(Luet al\.,[2025](https://arxiv.org/html/2606.15197#bib.bib3)\), and SIRL\(Chenet al\.,[2025](https://arxiv.org/html/2606.15197#bib.bib4)\)\.
3. 3\.Test\-time Methods\.Strategies that operate during inference to refine solution quality using the same backbone when applicable\. We compare against zero\-shot decoding, best\-of\-NN\(Kanget al\.,[2025](https://arxiv.org/html/2606.15197#bib.bib31)\), Reflexion\(Shinnet al\.,[2023](https://arxiv.org/html/2606.15197#bib.bib36)\), AutoFormulator \(MCTS\-Style method\)\(Astorgaet al\.,[2025](https://arxiv.org/html/2606.15197#bib.bib6)\), OptiTree\(Liuet al\.,[2025](https://arxiv.org/html/2606.15197#bib.bib7)\), and OR\-R1\(Dinget al\.,[2025](https://arxiv.org/html/2606.15197#bib.bib11)\)\.

##### Implementation Details and Metrics\.

To ensure a fair comparison, we standardize on Qwen3\-4B\-Instruct\-2507 as the backbone for test\-time methods, except OR\-R1 \(Qwen3\-8B\)\. Detailed hyperparameters are provided in Appendix[D\.1](https://arxiv.org/html/2606.15197#A4.SS1)\. We adoptSolution Accuracy\(Acc\.\) as our primary metric, using a numerical tolerance of10−610^\{\-6\}for verification\. Moreover, we use expansion group sizeK=8K=8and max iterationsT=10T=10to conduct experiment\.

### 4\.2Result Analysis

##### Main results \(RQ1\)\.

As shown in Table[2](https://arxiv.org/html/2606.15197#S4.T2),StarOR\-4B achieves a state\-of\-the\-art \(SOTA\) average accuracy of 65\.0%, rivaling the much larger DeepSeek\-R1 and outperforming GPT\-4\. Remarkably, despite its compact 4B parameter scale,StarORconsistently surpasses specialized learning\-based models like SIRL\-7B and all test\-time baselines\. Our method exhibits superior robustness in complex domains, reaching a SOTA 51\.0% onIndustryORand achieving significant gains on other complex benchmarks\. These results underscore that test\-time training is uniquely effective for OR modeling, where precision in high\-complexity scenarios outweighs real\-time constraints\. By effectively converting test\-time compute into modeling intelligence,StarORprovides a high\-efficiency paradigm that delivers expert\-level generalization without the need for massive model scaling\.

Table 2:Main performance comparison \(Acc\. %\) on five benchmarks\. Results marked with∗\*are taken fromChenet al\.\([2025](https://arxiv.org/html/2606.15197#bib.bib4)\)and evaluated on the same datasets, except for OR\-R1, whose results are taken from its original paper;−\-denotes missing results\.MethodNL4OPTMAMOEasyMAMOComplexIndustryOROptMATHAvg\.Model\-based MethodsGPT\-489\.0∗87\.3∗49\.3∗33\.0∗16\.6∗55\.0∗DeepSeek\-V3\.184\.8∗88\.9∗63\.5∗44\.0∗43\.9∗65\.0∗DeepSeek\-R182\.4∗87\.2∗67\.9∗45\.0∗40\.4∗64\.6∗OpenAI\-o369\.4∗77\.1∗51\.2∗44\.0∗44\.0∗57\.1∗Learning\-based MethodsORLM\-LLaMA3\-8B85\.7∗82\.3∗37\.4∗38\.0∗2\.6∗49\.2∗LLMOPT\-Qwen2\.5\-14B80\.3∗89\.5∗44\.1∗29\.0∗12\.5∗51\.1∗OptMATH\-Qwen2\.5\-7B94\.7∗86\.5∗51\.2∗20\.0∗24\.4∗55\.4∗SIRL\-Qwen2\.5\-7B96\.3∗91\.7∗51\.7∗33\.0∗30\.5∗60\.6∗Test\-Time MethodsBase model: Qwen3\-4B\-Instruct\-2507Zero\-shot55\.981\.911\.827\.010\.837\.5Best\-of\-NN\(N=16N=16\)70\.688\.922\.239\.021\.648\.4Reflexion58\.082\.716\.328\.011\.439\.3AutoFormulator75\.585\.723\.130\.011\.445\.1OptiTree86\.490\.032\.233\.012\.750\.9OR\-R1\-Qwen3\-8B88\.3∗86\.1∗49\.9∗35\.3∗––StarOR \(Ours\)92\.794\.454\.251\.032\.565\.0
##### Effect of LoRA Adaptation on Search \(RQ2\)\.

Table[3](https://arxiv.org/html/2606.15197#S4.T3)compares search\-only exploration with the full StarOR framework to assess the effect of LoRA adaptation\. While search alone already improves over the zero\-shot baseline, online LoRA adaptation brings consistent further gains across benchmarks\. This shows that StarOR can internalize search\-time feedback through policy updates, moving beyond pure exploration toward more robust reasoning under complex constraints\.

Table 3:\(RQ2\) Contribution of online policy optimization beyond pure search\.ConfigurationNL4OPTMAMOEasyMAMOComplexIndustryOROptMATHAvg\.Zero\-shotQwen3\-4B\-Instruct\-250755\.981\.911\.827\.010\.837\.5Search Only \(No Adaptation\)AutoFormulator \(MCTS\)75\.585\.723\.130\.011\.445\.1StarOR w/o LoRA89\.590\.149\.347\.028\.560\.9Search \+ Adaptation \(Ours\)StarOR w/ LoRA92\.794\.454\.251\.032\.565\.0
##### Impact of Reward Design \(RQ3\)\.

Table[4](https://arxiv.org/html/2606.15197#S4.T4)evaluates the effect of reward design in test\-time adaptation\. The multi\-faceted reward achieves the best average accuracy, which underscores that simple consensus is insufficient for complex OR modeling; instead, rich and multi\-dimensional feedback is essential to guide policy evolution\. Table[5](https://arxiv.org/html/2606.15197#S4.T5)further shows that reward components are especially useful on complex benchmarks: removing them consistently degrades OptMATH performance\. On the easier NL4OPT benchmark, however, some components yield neutral or negative effects, mainly because valid consensus is easier to obtain and imperfect pre\-generation priors may introduce additional noise\. We provide a detailed analysis of the pre\-generation in the appendix[B](https://arxiv.org/html/2606.15197#A2)\.

Table 4:\(RQ3\-1\) Impact of reward design\. We compare a simple majority\-voting reward against the full multi\-faceted reward under StarOR with Acc \(%\)\.Method / VariantNL4OPTMAMOEasyMAMOComplexIndustryOROptMATHAvg\.OR\-R1\-Qwen3\-8B88\.386\.149\.935\.3––StarOR\+ No RL \(Search Only\)89\.590\.149\.347\.028\.560\.9\+ RL \(Major\-voting Reward\)90\.692\.448\.346\.026\.860\.8\+ RL \(Multi\-faceted Reward\)92\.794\.454\.251\.032\.565\.0Table 5:\(RQ3\-2\) Detailed analysis for reward ablation on OptMATH \(Complex\) and NL4OPT \(Easy\)\.Δ\\Deltareports the change relative to the full StarOR reward on each dataset\.VariantOptMATHNL4OPTAcc\. \(%\)Δ\\DeltaAcc\. \(%\)Δ\\DeltaStarOR \(Ours\)32\.5–92\.7–w/o objective scale29\.5\-3\.093\.1\+0\.4w/o test\-case reward \(rtestr\_\{\\mathrm\{test\}\}\)30\.7\-1\.892\.2\-0\.5w/o dynamic shaping31\.3\-1\.292\.7\+0\.0
##### Test\-Time Scaling Analysis \(RQ4\)\.

Table[6](https://arxiv.org/html/2606.15197#S4.T6)illustrates how OptMATH performance scales with the expansion group sizeKK\. We observe a consistent upward trend asKKincreases from 2 to 32, demonstrating StarOR’s ability to leverage additional test\-time compute\. StarOR with onlyK=4K=4already reaches 26\.0% accuracy in 93\.9s, outperforming Best\-of\-3232\(24\.1%\) while using less than half of its time\. This scaling behavior validates its effectiveness for complex scenarios where precision benefits from extended search\. A detailed analysis for cost and efficiency is in Appendix[C](https://arxiv.org/html/2606.15197#A3)\.

Table 6:\(RQ4\) We compare StarOR’s expansion\-based scaling against the Best\-of\-NNon OptMATH\.Acc,Δ\\Delta, andTime Costdenote accuracy, absolute gain, and time cost per sample respectively\.Exp\. NumKKAcc \(%\)Δ\\Delta\(%\)Time Cost \(s\)Best\-of\-NAcc \(%\)Δ\\Delta\(%\)Time Cost \(s\)K=2K=223\.5–56\.1N=2N=212\.0–10\.9K=4K=426\.0\+2\.593\.9N=4N=415\.1\+3\.118\.7K=8K=832\.5\+9\.0205\.2N=8N=818\.1\+6\.139\.3K=16K=1635\.0\+11\.5575\.7N=16N=1621\.6\+9\.689\.9K=32K=3237\.3\+13\.81396\.1N=32N=3224\.1\+12\.1209\.2

## 5Conclusion

In this paper, we revisit optimization modeling as a hierarchical decision\-making process rather than a flat text\-to\-code generation task\. Due to the brittleness of hierarchical formulation, one\-shot generation is prone to error propagation, while fixed\-policy search may repeatedly explore flawed modeling paths\. To address this challenge, we proposedStarOR, a framework integrating stage\-wise MCTS with node\-level test\-time reinforcement learning\. By decomposing formulation into discrete stages and updating a transient LoRA adapter via GRPO,StarORtransforms execution\-driven feedback into instance\-specific policy refinement, with an unsupervised multi\-faceted reward providing fine\-grained node\-level supervision\. Experiments across five benchmarks show thatStarORachieves state\-of\-the\-art performance with a 4B backbone, demonstrating the effectiveness of search\-driven test\-time evolution for reliable LLM\-based optimization modeling, suggesting search\-driven test\-time evolution as an effective paradigm in OR\.

## References

- T\. Ahmed and S\. Choudhury \(2024\)LM4OPT: unveiling the potential of large language models in formulating mathematical optimization problems\.INFOR: Information Systems and Operational Research62\(4\),pp\. 559–572\.External Links:[Document](https://dx.doi.org/10.1080/03155986.2024.2388452),[Link](https://www.tandfonline.com/doi/abs/10.1080/03155986.2024.2388452)Cited by:[§1](https://arxiv.org/html/2606.15197#S1.p1.1),[§2](https://arxiv.org/html/2606.15197#S2.SS0.SSS0.Px1.p1.1)\.
- E\. Akyürek, M\. Damani, A\. Zweiger, L\. Qiu, H\. Guo, J\. Pari, Y\. Kim, and J\. Andreas \(2025\)The surprising effectiveness of test\-time training for few\-shot learning\.InProceedings of the 42nd International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=asgBo3FNdg)Cited by:[§2](https://arxiv.org/html/2606.15197#S2.SS0.SSS0.Px3.p1.1)\.
- N\. Astorga, T\. Liu, Y\. Xiao, and M\. Van Der Schaar \(2025\)Autoformulation of mathematical optimization models using llms\.InProceedings of the 42nd International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.267,pp\. 1864–1886\.External Links:[Link](https://proceedings.mlr.press/v267/astorga25a.html)Cited by:[Appendix A](https://arxiv.org/html/2606.15197#A1.SS0.SSS0.Px4.p1.1),[§1](https://arxiv.org/html/2606.15197#S1.p3.1),[§2](https://arxiv.org/html/2606.15197#S2.SS0.SSS0.Px2.p1.1),[item 3\.](https://arxiv.org/html/2606.15197#S4.I2.i3.p1.1)\.
- A\. D\. Belegundu and T\. R\. Chandrupatla \(2019\)Optimization concepts and applications in engineering\.3 edition,Cambridge University Press\.Cited by:[§1](https://arxiv.org/html/2606.15197#S1.p1.1)\.
- Y\. Chen, J\. Xia, S\. Shao, D\. Ge, and Y\. Ye \(2025\)Solver\-informed rl: grounding large language models for authentic optimization modeling\.arXiv preprint arXiv:2505\.11792\.External Links:[Link](https://arxiv.org/abs/2505.11792)Cited by:[Appendix A](https://arxiv.org/html/2606.15197#A1.SS0.SSS0.Px7.p1.1),[Appendix F](https://arxiv.org/html/2606.15197#A6.p1.1),[§1](https://arxiv.org/html/2606.15197#S1.p2.1),[§1](https://arxiv.org/html/2606.15197#S1.p3.1),[§2](https://arxiv.org/html/2606.15197#S2.SS0.SSS0.Px1.p1.1),[item 2\.](https://arxiv.org/html/2606.15197#S4.I2.i2.p1.1),[§4\.1](https://arxiv.org/html/2606.15197#S4.SS1.SSS0.Px1.p1.1),[Table 2](https://arxiv.org/html/2606.15197#S4.T2)\.
- DeepSeek\-AI, D\. Guo, D\. Yang, H\. Zhang, J\. Song, P\. Wang, Q\. Zhu, R\. Xu, R\. Zhang, S\. Ma,et al\.\(2025\)DeepSeek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.Nature645,pp\. 633–638\.Note:Also available as arXiv preprint arXiv:2501\.12948External Links:[Document](https://dx.doi.org/10.1038/s41586-025-09422-z),[Link](https://arxiv.org/abs/2501.12948)Cited by:[item 1\.](https://arxiv.org/html/2606.15197#S4.I2.i1.p1.1)\.
- DeepSeek\-AI, A\. Liu, B\. Feng, B\. Xue, B\. Wang, B\. Wu, C\. Lu, C\. Zhao, C\. Deng, C\. Zhang,et al\.\(2024\)DeepSeek\-v3 technical report\.arXiv preprint arXiv:2412\.19437\.Note:DeepSeek\-V3\.1 is an incremental update based on this architectureExternal Links:2412\.19437,[Link](https://arxiv.org/abs/2412.19437),[Document](https://dx.doi.org/10.48550/arXiv.2412.19437)Cited by:[item 1\.](https://arxiv.org/html/2606.15197#S4.I2.i1.p1.1)\.
- Z\. Ding, Z\. Tan, J\. Zhang, and T\. Chen \(2025\)OR\-r1: automating modeling and solving of operations research optimization problem via test\-time reinforcement learning\.arXiv preprint arXiv:2511\.09092\.External Links:[Link](https://arxiv.org/abs/2511.09092)Cited by:[Appendix A](https://arxiv.org/html/2606.15197#A1.SS0.SSS0.Px6.p1.1),[§2](https://arxiv.org/html/2606.15197#S2.SS0.SSS0.Px3.p1.1),[item 3\.](https://arxiv.org/html/2606.15197#S4.I2.i3.p1.1)\.
- Gurobi Optimization, LLC \(2026\)Gurobi Optimizer Reference Manual\.External Links:[Link](https://www.gurobi.com/)Cited by:[item 1\.](https://arxiv.org/html/2606.15197#S4.I2.i1.p1.1)\.
- J\. Hu, Z\. Zhang, G\. Chen, X\. Wen, C\. Shuai, W\. Luo, B\. Xiao, Y\. Li, and M\. Tan \(2025\)Test\-time learning for large language models\.External Links:2505\.20633,[Link](https://arxiv.org/abs/2505.20633)Cited by:[§2](https://arxiv.org/html/2606.15197#S2.SS0.SSS0.Px3.p1.1)\.
- C\. Huang, Z\. Tang, S\. Hu, R\. Jiang, X\. Zheng, D\. Ge, B\. Wang, and Z\. Wang \(2024a\)ORLM: a customizable framework in training large models for automated optimization modeling\.arXiv preprint arXiv:2405\.17743\.External Links:[Link](https://arxiv.org/abs/2405.17743)Cited by:[Appendix A](https://arxiv.org/html/2606.15197#A1.SS0.SSS0.Px7.p1.1),[§1](https://arxiv.org/html/2606.15197#S1.p2.1),[§1](https://arxiv.org/html/2606.15197#S1.p3.1),[§2](https://arxiv.org/html/2606.15197#S2.SS0.SSS0.Px1.p1.1),[item 2\.](https://arxiv.org/html/2606.15197#S4.I2.i2.p1.1),[Table 1](https://arxiv.org/html/2606.15197#S4.T1.3.5.1.1)\.
- X\. Huang, Q\. Shen, Y\. Hu, A\. Gao, and B\. Wang \(2024b\)Mamo: a mathematical modeling benchmark with solvers\.External Links:2405\.13144Cited by:[Table 1](https://arxiv.org/html/2606.15197#S4.T1.3.3.1.1),[Table 1](https://arxiv.org/html/2606.15197#S4.T1.3.4.1.1)\.
- C\. Jiang, X\. Shu, H\. Qian, X\. Lu, J\. Zhou, A\. Zhou, and Y\. Yu \(2024\)LLMOPT: learning to define and solve general optimization problems from scratch\.arXiv preprint arXiv:2410\.13213\.External Links:[Link](https://arxiv.org/abs/2410.13213)Cited by:[Appendix A](https://arxiv.org/html/2606.15197#A1.SS0.SSS0.Px7.p1.1),[§1](https://arxiv.org/html/2606.15197#S1.p2.1),[§1](https://arxiv.org/html/2606.15197#S1.p3.1),[§2](https://arxiv.org/html/2606.15197#S2.SS0.SSS0.Px1.p1.1),[§3\.1](https://arxiv.org/html/2606.15197#S3.SS1.p1.5),[item 2\.](https://arxiv.org/html/2606.15197#S4.I2.i2.p1.1)\.
- Z\. Jiao, H\. Xian, Q\. Wang, Y\. Ma, Z\. Wang, Z\. Zhang, D\. Kong, and M\. Han \(2026\)Policy of thoughts: scaling llm reasoning via test\-time policy evolution\.arXiv preprint arXiv:2601\.20379\.External Links:[Link](https://arxiv.org/abs/2601.20379)Cited by:[§2](https://arxiv.org/html/2606.15197#S2.SS0.SSS0.Px3.p1.1)\.
- Z\. Kang, X\. Zhao, and D\. Song \(2025\)Scalable best\-of\-n selection for large language models via self\-certainty\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.38,pp\.\.External Links:[Link](https://arxiv.org/abs/2502.18581),2502\.18581Cited by:[Appendix A](https://arxiv.org/html/2606.15197#A1.SS0.SSS0.Px2.p1.7),[item 3\.](https://arxiv.org/html/2606.15197#S4.I2.i3.p1.1)\.
- D\. Krishnamurthy, C\. Uckun, Z\. Zhou, P\. R\. Thimmapuram, and A\. Botterud \(2018\)Energy storage arbitrage under day\-ahead and real\-time price uncertainty\.IEEE Transactions on Power Systems33\(1\),pp\. 84–93\.External Links:[Document](https://dx.doi.org/10.1109/TPWRS.2017.2685347),ISSN 0885\-8950Cited by:[§1](https://arxiv.org/html/2606.15197#S1.p1.1)\.
- D\. Li, X\. Zhao, L\. Yu, Y\. Liu, W\. Cheng, Z\. Chen, Z\. Chen, F\. Chen, C\. Zhao, and H\. Chen \(2025\)SolverLLM: leveraging test\-time scaling for optimization problem via llm\-guided search\.arXiv preprint arXiv:2510\.16916\.External Links:[Link](https://arxiv.org/abs/2510.16916)Cited by:[§1](https://arxiv.org/html/2606.15197#S1.p3.1),[§2](https://arxiv.org/html/2606.15197#S2.SS0.SSS0.Px2.p1.1),[§3\.1](https://arxiv.org/html/2606.15197#S3.SS1.p1.5)\.
- J\. J\. Lian, Y\. Sun, H\. Chen, C\. Zhang, H\. Qin, and C\. Teo \(2026\)ReLoop: structured modeling and behavioral verification for reliable llm\-based optimization\.External Links:2602\.15983,[Link](https://arxiv.org/abs/2602.15983)Cited by:[§1](https://arxiv.org/html/2606.15197#S1.p2.1)\.
- R\. Liao, N\. Röhrich, X\. Wang, Y\. Zhang, Y\. Samadzadeh, V\. Tresp, and S\. Yeung\-Levy \(2026\)Tool verification for test\-time reinforcement learning\.arXiv preprint arXiv:2603\.02203\.External Links:[Link](https://arxiv.org/abs/2603.02203)Cited by:[§2](https://arxiv.org/html/2606.15197#S2.SS0.SSS0.Px3.p1.1)\.
- H\. Liu, J\. Wang, Y\. Cai, X\. Han, Y\. Kuang, and J\. Hao \(2025\)OptiTree: hierarchical thoughts generation with tree search for llm optimization modeling\.InAdvances in Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=Ej20yjWMCj)Cited by:[Appendix A](https://arxiv.org/html/2606.15197#A1.SS0.SSS0.Px5.p1.1),[§1](https://arxiv.org/html/2606.15197#S1.p3.1),[§2](https://arxiv.org/html/2606.15197#S2.SS0.SSS0.Px2.p1.1),[item 3\.](https://arxiv.org/html/2606.15197#S4.I2.i3.p1.1)\.
- H\. Lu, Z\. Xie, Y\. Wu, C\. Ren, Y\. Chen, and Z\. Wen \(2025\)OptMATH: a scalable bidirectional data synthesis framework for optimization modeling\.arXiv preprint arXiv:2502\.11102\.External Links:[Link](https://arxiv.org/abs/2502.11102)Cited by:[Appendix A](https://arxiv.org/html/2606.15197#A1.SS0.SSS0.Px7.p1.1),[§1](https://arxiv.org/html/2606.15197#S1.p3.1),[§2](https://arxiv.org/html/2606.15197#S2.SS0.SSS0.Px1.p1.1),[item 2\.](https://arxiv.org/html/2606.15197#S4.I2.i2.p1.1),[Table 1](https://arxiv.org/html/2606.15197#S4.T1.3.6.1.1)\.
- OpenAI, J\. Achiam, S\. Adler, S\. Agarwal, L\. Ahmad, I\. Akkaya, F\. L\. Aleman, D\. Almeida,et al\.\(2023\)GPT\-4 technical report\.arXiv preprint arXiv:2303\.08774\.External Links:2303\.08774,[Link](https://arxiv.org/abs/2303.08774),[Document](https://dx.doi.org/10.48550/arXiv.2303.08774)Cited by:[item 1\.](https://arxiv.org/html/2606.15197#S4.I2.i1.p1.1)\.
- OpenAI \(2025\)Introducing openai o3 and o4\-mini\.Note:[https://openai\.com/index/introducing\-o3\-and\-o4\-mini/](https://openai.com/index/introducing-o3-and-o4-mini/)Accessed: 2026\-05\-05Cited by:[item 1\.](https://arxiv.org/html/2606.15197#S4.I2.i1.p1.1)\.
- R\. Ramamonjison, H\. Li, T\. Yu, S\. He, V\. Rengan, A\. Banitalebi\-dehkordi, Z\. Zhou, and Y\. Zhang \(2022a\)Augmenting operations research with auto\-formulation of optimization models from problem descriptions\.InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track,Abu Dhabi, UAE,pp\. 29–62\.External Links:[Document](https://dx.doi.org/10.18653/v1/2022.emnlp-industry.4),[Link](https://aclanthology.org/2022.emnlp-industry.4/)Cited by:[§1](https://arxiv.org/html/2606.15197#S1.p1.1),[§2](https://arxiv.org/html/2606.15197#S2.SS0.SSS0.Px1.p1.1)\.
- R\. Ramamonjison, T\. Yu, R\. Li, H\. Li, G\. Carenini, B\. Ghaddar, S\. He, M\. Mostajabdaveh, A\. Banitalebi\-Dehkordi, Z\. Zhou, and Y\. Zhang \(2022b\)NL4Opt competition: formulating optimization problems based on their natural language descriptions\.InProceedings of the NeurIPS 2022 Competitions Track,Proceedings of Machine Learning Research, Vol\.220,pp\. 189–203\.External Links:[Link](https://proceedings.mlr.press/v220/ramamonjison23a.html)Cited by:[§1](https://arxiv.org/html/2606.15197#S1.p1.1),[§2](https://arxiv.org/html/2606.15197#S2.SS0.SSS0.Px1.p1.1),[Table 1](https://arxiv.org/html/2606.15197#S4.T1.3.2.1.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. K\. Li, Y\. Wu, and D\. Guo \(2024\)DeepSeekMath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.External Links:[Link](https://arxiv.org/abs/2402.03300)Cited by:[§3\.2\.1](https://arxiv.org/html/2606.15197#S3.SS2.SSS1.Px1.p1.5)\.
- G\. Sheng, C\. Zhang, Z\. Ye, X\. Wu, W\. Zhang, R\. Zhang, Y\. Peng, H\. Lin, and C\. Wu \(2024\)HybridFlow: a flexible and efficient rlhf framework\.arXiv preprint arXiv: 2409\.19256\.Cited by:[Appendix D](https://arxiv.org/html/2606.15197#A4.p1.1)\.
- N\. Shinn, F\. Cassano, E\. Berman, A\. Gopinath, K\. Narasimhan, and S\. Yao \(2023\)Reflexion: language agents with verbal reinforcement learning\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.36,pp\. 39870–39890\.External Links:[Link](https://arxiv.org/abs/2303.11366),2303\.11366Cited by:[Appendix A](https://arxiv.org/html/2606.15197#A1.SS0.SSS0.Px3.p1.4),[item 3\.](https://arxiv.org/html/2606.15197#S4.I2.i3.p1.1)\.
- A\. Singh \(2012\)An overview of the optimization modelling applications\.Journal of Hydrology466–467,pp\. 167–182\.External Links:[Document](https://dx.doi.org/10.1016/j.jhydrol.2012.08.004),ISSN 0022\-1694Cited by:[§1](https://arxiv.org/html/2606.15197#S1.p1.1)\.
- T\. Wang, W\. Yu, Z\. He, Z\. Liu, H\. Gong, H\. Wu, X\. Han, W\. Shi, R\. She, F\. Zhu, and T\. Zhong \(2025a\)BPP\-search: enhancing tree of thought reasoning for mathematical modeling problem solving\.External Links:2411\.17404,[Link](https://arxiv.org/abs/2411.17404)Cited by:[§1](https://arxiv.org/html/2606.15197#S1.p3.1),[§2](https://arxiv.org/html/2606.15197#S2.SS0.SSS0.Px2.p1.1)\.
- X\. Wang, J\. Wei, D\. Schuurmans, Q\. V\. Le, E\. H\. Chi, S\. Narang, A\. Chowdhery, and D\. Zhou \(2023\)Self\-consistency improves chain of thought reasoning in language models\.InThe Eleventh International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=1PL1NIMMrw)Cited by:[Appendix A](https://arxiv.org/html/2606.15197#A1.SS0.SSS0.Px2.p1.7)\.
- Y\. Wang, S\. Su, Z\. Zeng, E\. Xu, L\. Ren, X\. Yang, Z\. Huang, X\. He, L\. Ma, B\. Peng, H\. Cheng, P\. He, W\. Chen, S\. Wang, S\. S\. Du, and Y\. Shen \(2025b\)ThetaEvolve: test\-time learning on open problems\.arXiv preprint arXiv:2511\.23473\.External Links:[Link](https://arxiv.org/abs/2511.23473)Cited by:[§2](https://arxiv.org/html/2606.15197#S2.SS0.SSS0.Px3.p1.1)\.
- Z\. Wang, B\. Chen, Y\. Huang, Q\. Cao, M\. He, J\. Fan, and X\. Liang \(2025c\)ORMind: a cognitive\-inspired end\-to\-end reasoning framework for operations research\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 6: Industry Track\),Vienna, Austria,pp\. 104–131\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.acl-industry.10),[Link](https://aclanthology.org/2025.acl-industry.10/)Cited by:[§1](https://arxiv.org/html/2606.15197#S1.p2.1),[§2](https://arxiv.org/html/2606.15197#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Wu, Y\. Zhang, Y\. Wu, Y\. Wang, J\. Zhang, and J\. Cheng \(2025\)Training llms for optimization modeling via iterative data synthesis and structured validation\.InFindings of the Association for Computational Linguistics: EMNLP 2025,Suzhou, China,pp\. 12880–12896\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.findings-emnlp.691),[Link](https://aclanthology.org/2025.findings-emnlp.691/)Cited by:[§1](https://arxiv.org/html/2606.15197#S1.p2.1),[§2](https://arxiv.org/html/2606.15197#S2.SS0.SSS0.Px1.p1.1)\.
- Z\. Xiao, D\. Zhang, Y\. Wu, L\. Xu, Y\. J\. Wang, X\. Han, X\. Fu, T\. Zhong, J\. Zeng, M\. Song, and G\. Chen \(2024\)Chain\-of\-experts: when llms meet complex operations research problems\.InThe Twelfth International Conference on Learning Representations \(ICLR\),External Links:[Link](https://openreview.net/forum?id=HobyL1B9CZ)Cited by:[§1](https://arxiv.org/html/2606.15197#S1.p2.1)\.
- S\. Yao, D\. Yu, J\. Zhao, I\. Shafran, T\. L\. Griffiths, Y\. Cao, and K\. R\. Narasimhan \(2023\)Tree of thoughts: deliberate problem solving with large language models\.InAdvances in Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=5Xc1ecxO1h)Cited by:[§2](https://arxiv.org/html/2606.15197#S2.SS0.SSS0.Px2.p1.1)\.
- M\. Yuksekgonul, D\. Koceja, X\. Li, F\. Bianchi, J\. McCaleb, X\. Wang, J\. Kautz, Y\. Choi, J\. Zou, C\. Guestrin, and Y\. Sun \(2026\)Learning to discover at test time\.arXiv preprint arXiv:2601\.16175\.External Links:[Link](https://arxiv.org/abs/2601.16175)Cited by:[§2](https://arxiv.org/html/2606.15197#S2.SS0.SSS0.Px3.p1.1)\.
- Y\. Zhang, Q\. Kang, Y\. Chen, Y\. Wang, X\. Han, T\. Zhong, M\. Yuan, and C\. Ma \(2025\)SAC\-opt: semantic anchors for iterative correction in optimization modeling\.External Links:2510\.05115,[Link](https://arxiv.org/abs/2510.05115)Cited by:[§1](https://arxiv.org/html/2606.15197#S1.p2.1)\.
- Y\. Zuo, K\. Zhang, L\. Sheng, S\. Qu, G\. Cui, X\. Zhu, H\. Li, Y\. Zhang, X\. Long, E\. Hua, B\. Qi, Y\. Sun, Z\. Ma, L\. Yuan, N\. Ding, and B\. Zhou \(2025\)TTRL: test\-time reinforcement learning\.InAdvances in Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=VuVhgEiu20)Cited by:[§2](https://arxiv.org/html/2606.15197#S2.SS0.SSS0.Px3.p1.1)\.

\\newtcblisting

starorprompt\[2\]\[\]enhanced, breakable, listing only, colback=PromptBodyBlue, colframe=PromptBorderBlue, colbacktitle=PromptTitleBlue, coltitle=white, title=\#2, fonttitle=, lefttitle=4mm, righttitle=4mm, toptitle=1\.2mm, bottomtitle=1\.2mm, arc=4pt, outer arc=4pt, boxrule=0\.9pt, left=4mm, right=4mm, top=3\.8mm, bottom=3mm, before skip=8pt, after skip=8pt, listing options= basicstyle=, breaklines=true, columns=fullflexible, keepspaces=true, showstringspaces=false, tabsize=2, \#1

## Appendix ABaselines

We compare StarOR with three families of baselines: strong general\-purpose LLMs, offline learning\-based OR modeling systems, and inference\-time scaling methods\. For all test\-time baselines implemented in our codebase, we use the same backbone, Qwen3\-4B\-Instruct\-2507, and the same Gurobi execution environment as StarOR unless otherwise stated\. This controls for differences in base model capability and isolates the effect of the test\-time algorithm\.

##### Zero\-shot\.

The zero\-shot baseline directly prompts the backbone model to generate the complete optimization formulation and solver\-ready Gurobi Python program in a single pass\. The output is executed once and the resulting objective value is compared against the reference answer\. This baseline measures the raw modeling ability of the backbone without sampling, search, repair, or adaptation\.

##### Best\-of\-N\.

Best\-of\-N is a standard inference\-time scaling baseline inspired by self\-consistency and repeated sampling\[Wanget al\.,[2023](https://arxiv.org/html/2606.15197#bib.bib20), Kanget al\.,[2025](https://arxiv.org/html/2606.15197#bib.bib31)\]\. We sampleNNindependent solver\-code candidates from the same backbone and execute each candidate with a3030second solver/execution budget\. The final answer is selected from executable candidates by objective\-value consensus: candidates are clustered by numerical objective value under the same tolerance used in evaluation, and the largest executable cluster is selected; ties are resolved by execution validity and generation order\. In the main comparison, we useN=16N=16with temperature1\.01\.0and maximum generation length81968196tokens\. In the scaling study,NNis swept over\{2,4,8,16,32\}\\\{2,4,8,16,32\\\}to match the compute axis used by StarOR\.

##### Reflexion\.

Reflexion\[Shinnet al\.,[2023](https://arxiv.org/html/2606.15197#bib.bib36)\]is implemented as an iterative generate\-execute\-refine loop\. A candidate Gurobi program is first generated and executed\. The execution trace, including Python exceptions, solver status, infeasibility/unboundedness messages, or suspicious missing objective output, is then fed back to the same model as verbal feedback for code refinement\. We run up to1010refinement attempts with maximum generation length81968196tokens and a3030second execution budget per attempt\. The final candidate is the first executable candidate that passes the solver checks; if no such candidate exists, we select the highest\-quality candidate according to the same execution and objective\-consensus rule used for Best\-of\-NN\.

##### AutoFormulator\.

AutoFormulator refers to the MCTS\-style autoformulation framework ofAstorgaet al\.\[[2025](https://arxiv.org/html/2606.15197#bib.bib6)\]\. The method treats optimization modeling as a hierarchical search problem, where an LLM proposes formulation components and MCTS explores alternative modeling hypotheses\. It improves search efficiency with symbolic pruning and uses LLM\-based partial\-formulation evaluation to guide the tree\. In our comparison, AutoFormulator represents a strong search\-only baseline: it explores multiple partial formulations at test time but does not update the policy parameters during inference\. This makes it a useful contrast to StarOR, which couples tree search with node\-level GRPO updates\.

##### OptiTree\.

OptiTree\[Liuet al\.,[2025](https://arxiv.org/html/2606.15197#bib.bib7)\]is a hierarchical thought\-generation and tree\-search method for optimization modeling\. Instead of relying on a fixed decomposition, OptiTree searches over a taxonomy\-like modeling tree, retrieves high\-level modeling thoughts from relevant subproblem categories, and synthesizes these thoughts into a final formulation\. We include OptiTree as a strong OR\-specific tree\-search baseline\. Compared with StarOR, OptiTree emphasizes adaptive decomposition and retrieval of modeling thoughts, while StarOR emphasizes execution\-grounded reward feedback and test\-time policy evolution within each sample\.

##### OR\-R1\.

OR\-R1\[Dinget al\.,[2025](https://arxiv.org/html/2606.15197#bib.bib11)\]is a learning\-based OR modeling system that combines supervised fine\-tuning with test\-time group relative policy optimization\. Its reward contains OR\-specific components such as format validity, code validity, code executability, solution correctness, and consistency\. We include the reported OR\-R1\-Qwen3\-8B results as a strong TTRL\-oriented baseline\. The comparison is intentionally conservative: OR\-R1 uses a larger 8B backbone and task\-specific training \(SFT\), whereas StarOR uses a 4B backbone and performs instance\-level adaptation with a transient LoRA adapter during search\.

##### Learning\-based baselines\.

We also report specialized offline\-training methods, including ORLM\[Huanget al\.,[2024a](https://arxiv.org/html/2606.15197#bib.bib1)\], LLMOPT\[Jianget al\.,[2024](https://arxiv.org/html/2606.15197#bib.bib2)\], OptMATH\[Luet al\.,[2025](https://arxiv.org/html/2606.15197#bib.bib3)\], and SIRL\[Chenet al\.,[2025](https://arxiv.org/html/2606.15197#bib.bib4)\]\. These methods improve LLMs through synthetic data, solver\-informed supervision, or reinforcement learning before evaluation\. They provide a reference for how far static training can push OR modeling accuracy, while StarOR studies the complementary direction of spending compute at test time on each instance\.

## Appendix BMore Experiment Results and Analysis

### B\.1Experiment Settings

Unless otherwise stated, all reward weights are ordered as\(wsem,wexec,wtest,wstruct\)\(w\_\{\\mathrm\{sem\}\},w\_\{\\mathrm\{exec\}\},w\_\{\\mathrm\{test\}\},w\_\{\\mathrm\{struct\}\}\), corresponding to semantic consensus, executability, test\-case robustness, and structural consistency\. In the full StarOR configuration, we use dynamic reward shaping: the weights are\(0\.2,0\.5,0\.2,0\.1\)\(0\.2,0\.5,0\.2,0\.1\)for iterations 1–3,\(0\.4,0\.4,0\.1,0\.1\)\(0\.4,0\.4,0\.1,0\.1\)for iterations 4–5, and\(0\.6,0\.2,0\.1,0\.1\)\(0\.6,0\.2,0\.1,0\.1\)afterwards\. For the*w/o dynamic shaping*ablation in Table[5](https://arxiv.org/html/2606.15197#S4.T5), we disable this schedule and use the fixed late\-stage weight vector\(0\.6,0\.2,0\.1,0\.1\)\(0\.6,0\.2,0\.1,0\.1\)throughout all MCTS iterations\. For the*w/o test\-case reward*ablation, we removertestr\_\{\\mathrm\{test\}\}from the reward computation and transfer its weight torexecr\_\{\\mathrm\{exec\}\}so that the total reward scale remains unchanged\.

Table 7:Reward\-weight settings for the reward ablations\. Tuples are ordered as\(wsem,wexec,wtest,wstruct\)\(w\_\{\\mathrm\{sem\}\},w\_\{\\mathrm\{exec\}\},w\_\{\\mathrm\{test\}\},w\_\{\\mathrm\{struct\}\}\)\.SettingIter\.1–3Iter\.4–5Iter\.\>5\>5Full StarOR\(0\.2,0\.5,0\.2,0\.1\)\(0\.2,0\.5,0\.2,0\.1\)\(0\.4,0\.4,0\.1,0\.1\)\(0\.4,0\.4,0\.1,0\.1\)\(0\.6,0\.2,0\.1,0\.1\)\(0\.6,0\.2,0\.1,0\.1\)w/o dynamic shaping\(0\.6,0\.2,0\.1,0\.1\)\(0\.6,0\.2,0\.1,0\.1\)\(0\.6,0\.2,0\.1,0\.1\)\(0\.6,0\.2,0\.1,0\.1\)\(0\.6,0\.2,0\.1,0\.1\)\(0\.6,0\.2,0\.1,0\.1\)w/o test\-case reward\(0\.2,0\.7,0\.0,0\.1\)\(0\.2,0\.7,0\.0,0\.1\)\(0\.4,0\.5,0\.0,0\.1\)\(0\.4,0\.5,0\.0,0\.1\)\(0\.6,0\.3,0\.0,0\.1\)\(0\.6,0\.3,0\.0,0\.1\)
### B\.2Pre\-generation analysis

Before running the stage\-wise search, StarOR constructs a reward prior by estimating a conservative objective\-scale envelope\[Lx,Ux\]\[L\_\{x\},U\_\{x\}\]for each instance\. This pre\-generation signal is not used as a hard correctness oracle; rather, it provides a soft preference that down\-weights candidates whose objective values fall outside the estimated range\. Table[8](https://arxiv.org/html/2606.15197#A2.T8)analyzes the reliability and effect of this objective\-scale prior\.

We report four diagnostic statistics\. First,*GT covered by base scale*measures the percentage of instances whose ground\-truth objective lies within the estimated range\[Lx,Ux\]\[L\_\{x\},U\_\{x\}\], reflecting the coverage quality of the pre\-generated scale\. Second, we report the conditional accuracy when the ground\-truth objective is inside the estimated range, which measures how well StarOR performs when the prior is consistent with the true objective scale\. Third, we report the conditional accuracy when the ground\-truth objective falls outside the estimated range, which evaluates whether StarOR can still recover the correct answer when the prior is inaccurate\. Finally,Δ\\DeltaAcc\. measures the accuracy gap between these two subsets, quantifying the performance drop caused by an incorrect objective\-scale estimate\.

Overall, the results show that the estimated base scale covers the ground\-truth objective for a large fraction of instances across datasets, while StarOR still retains non\-trivial recovery ability when the ground truth falls outside the predicted range\. The trends also reveal clear differences in dataset difficulty\. On relatively easier datasets such as NL4OPT and MAMO\-Easy, StarOR achieves high conditional accuracy when the ground truth is covered by the base scale, and MAMO\-Easy in particular remains robust even when the scale estimate is incorrect\. In contrast, harder datasets such as MAMO\-Complex, IndustryOR, and especially OptMATH exhibit substantially lower conditional accuracies in both subsets, indicating that their difficulty is not solely caused by objective\-scale misestimation but also by more challenging modeling, reasoning, or search requirements\. Moreover, the positiveΔ\\DeltaAcc\. values across all datasets suggest that a correct objective\-scale estimate consistently benefits performance, while the magnitude of this gap reflects how strongly each dataset depends on the pre\-generation prior\. These observations support that the objective\-scale prior serves as a useful guidance signal rather than an overly restrictive filter\.

Table 8:Pre\-generation objective\-scale diagnostics\. “GT covered by base scale” reports the percentage of instances whose ground\-truth objective lies within the pre\-generated objective range\[Lx,Ux\]\[L\_\{x\},U\_\{x\}\]\. The two conditional accuracy columns report model accuracy separately for instances where the ground\-truth objective is inside or outside this range\.Δ\\DeltaAcc\. denotes the accuracy gap between these two subsets\.Dataset\#Inst\.GT covered bybase scale \(%\)Acc\. when GTin scale \(%\)Acc\. when GTout of scale \(%\)Δ\\DeltaAcc\.\(in–out, pp\)NL4OPT24595\.594\.063\.630\.4MAMO\-Easy64286\.495\.388\.56\.8MAMO\-Complex20382\.354\.552\.81\.7IndustryOR10079\.053\.242\.910\.3OptMATH16670\.535\.026\.58\.5
### B\.3More Ablation Studies

##### Code Repair\.

We also evaluate the effect of the final code repair budget\. Repair is applied after terminal candidate selection and is intended to fix implementation\-level issues such as syntax errors, missing imports, undefined variables, and solver\-status printing without changing the searched mathematical formulation\. Table[9](https://arxiv.org/html/2606.15197#A2.T9)reports the planned OptMATH comparison\. The default setting uses two repair rounds, which balances correction capacity and runtime overhead\. Larger repair budgets may improve executability but can also introduce extra latency or over\-edit a formulation that was already semantically correct\.

Table 9:Code repair ablation on OptMATH\. “Repair rounds” denotes the maximum number of final repair attempts after terminal candidate selection\.VariantRepairroundsAcc\. \(%\)TimeCost \(s\)StarOR \(Ours\)232\.5205\.2repair = 0030\.7194\.6srepair = 4433\.1219\.8srepair = 8833\.1252\.1s

## Appendix CComputational Cost and Efficiency Analysis

### C\.1Execution Budget and Computational Overhead

StarOR is designed for optimization modeling settings where a small formulation error can invalidate a downstream decision\. Its runtime is therefore best understood as a structured test\-time optimization cost rather than as a raw count of generated programs\. In this section, we decompose the wall\-clock time into the components that are actually sequential in our implementation: reward\-prior construction, batched rollout generation, node\-level GRPO update, and parallel execution\-based reward evaluation\.

##### Wall\-clock decomposition\.

For a problem instance that runs forIIMCTS iterations with expansion sizeKK, the per\-instance wall\-clock time can be approximated as

TStarOR≈Treward​\-​prior\+∑t=1I\[Trollout​\(K\)\+TGRPO​\(K\)\+Texec∥​\(K,Nt\)\]\+Trepair,T\_\{\\textsc\{StarOR\}\}\\approx T\_\{\\mathrm\{reward\\text\{\-\}prior\}\}\+\\sum\_\{t=1\}^\{I\}\\left\[T\_\{\\mathrm\{rollout\}\}\(K\)\+T\_\{\\mathrm\{GRPO\}\}\(K\)\+T\_\{\\mathrm\{exec\}\}^\{\\parallel\}\(K,N\_\{t\}\)\\right\]\+T\_\{\\mathrm\{repair\}\},\(9\)whereTrollout​\(K\)T\_\{\\mathrm\{rollout\}\}\(K\)denotes the batched generation and rollout\-to\-code completion time for theKKsibling candidates,TGRPO​\(K\)T\_\{\\mathrm\{GRPO\}\}\(K\)denotes the transient LoRA update time, andTexec∥​\(K,Nt\)T\_\{\\mathrm\{exec\}\}^\{\\parallel\}\(K,N\_\{t\}\)denotes the wall\-clock execution time for the original instance and theNtN\_\{t\}test cases\. The execution term is parallelized across candidates and cases: although each candidate is protected by a3030s timeout, a normal execution takes about11s on average, so execution contributes a small batch\-level overhead instead of a serialK​\(1\+Nt\)K\(1\+N\_\{t\}\)multiplier\. Consequently, the main sequential cost is the number of search/adaptation rounds\.

Table 10:Component\-level wall\-clock estimate for StarOR withK=8K=8on OptMATH\. The estimate matches the measured time scale in Table[6](https://arxiv.org/html/2606.15197#S4.T6): the observed average time is205\.2205\.2s per sample\.ComponentApprox\. timeExplanationReward\-prior construction≈4\\approx 4sRoughly two single\-rollout equivalents for objective\-scale estimation and test\-case preparationRollout generation per iteration≈22\\approx 22sBatched generation and completion forK=8K=8sibling candidates\.GRPO update per iteration≈7\\approx 7sNode\-level LoRA backpropagation and policy update\.Parallel execution per iteration≈1\\approx 1sOriginal and test case executions are run in parallel; the3030s solver limit is a timeout cap\.Average number of iterations≈7\.1\\approx 7\.1Empirical mean number of MCTS/adaptation rounds on OptMATH\.
##### Consistency with empirical runtime\.

Plugging these values into the decomposition gives

TStarOR​\(K=8\)≈4\+7\.1×\(22\+7\+1\)≈217​s\.T\_\{\\textsc\{StarOR\}\}\(K=8\)\\approx 4\+7\.1\\times\(22\+7\+1\)\\approx 217\\text\{s\}\.\(10\)This simple estimate is intentionally conservative because it treats all iterations as full iterations with both rollout and GRPO update\. In practice, some terminal or low\-utility iterations are shorter, several callback operations overlap with batched rollout execution, and optional repair is not triggered for every sample\. These effects reduce the observed average to205\.2205\.2s in Table[6](https://arxiv.org/html/2606.15197#S4.T6), which is close to the theoretical estimate\. The decomposition therefore supports the runtime interpretation of StarOR: the dominant cost is not serial solver execution, but repeated rollout\-and\-adaptation rounds that refine the instance\-specific policy during search\.

##### Compute efficiency\.

The key question is not which method uses the smallest absolute test\-time budget, but which method converts test\-time compute into correct formulations more effectively\. The scaling results in Table[6](https://arxiv.org/html/2606.15197#S4.T6)support this view: StarOR with onlyK=4K=4already reaches26\.0%26\.0\\%accuracy on OptMATH in93\.993\.9s, slightly outperforming Best\-of\-NNwithN=32N=32\(24\.1%24\.1\\%\) while using less than one\-half of its wall\-clock time \(209\.2209\.2s\)\. This indicates that the additional computation in StarOR is not merely spent on independent resampling; it is reused to update the instance\-specific policy and steer later search toward more reliable formulation decisions\.

### C\.2Difficulty\-Aware Resource Allocation

A core strength of StarOR is its ability to adaptively allocate computational resources based on problem complexity\. This difficulty\-awareness is reflected in the search depth, convergence speed of the reward signals, and the final wall\-clock time:

##### Case 1: Low\-Complexity Instances \(e\.g\., NL4OPT\)\.

Problems in the NL4OPT benchmark typically feature direct linear constraints and clear objective mappings\. For these instances, StarOR often achieves high semantic consensus \(rsemr\_\{\\text\{sem\}\}\) and structural agreement \(rstructr\_\{\\text\{struct\}\}\) within the first22–33iterations\. Consequently, the MCTS logic triggers early termination, keeping the average time\-per\-sample minimal\.

##### Case 2: High\-Complexity Instances \(e\.g\., OptMATH\)\.

In contrast, OptMATH contains dense numerical descriptions and intricate logical dependencies\. Early search nodes often yield diverging objective values \(lowrsemr\_\{\\text\{sem\}\}\) or inconsistent variable types \(lowrstructr\_\{\\text\{struct\}\}\)\. This lack of consensus prevents premature termination and drives the model to utilize the full MCTS budget and more GRPO update steps to resolve modeling ambiguities\.

Table[11](https://arxiv.org/html/2606.15197#A3.T11)illustrates this adaptive behavior by comparing representative statistics across easy and complex benchmarks\. The table should be read together with the cost decomposition above: the average time is driven primarily by the number of sequential search/adaptation rounds, while parallel code execution contributes only a small batch\-level overhead\.

Table 11:Difficulty\-aware runtime statistics\. StarOR \(K=8K=8\) spends little time on benchmarks where semantic and structural consensus emerges early, and allocates more iterations to datasets with ambiguous variables, dense constraints, or unstable objective consensus\.DatasetDifficultyAvg\. Iter\.Avg\. TimeDominant runtime behaviorNL4OPTEasy∼3\.1\\sim 3\.165\.1sEarly semantic consensus; most instances terminate after shallow stage\-wise exploration\.MAMO\-EasyEasy∼3\.2\\sim 3\.270\.4sSimple LP/MILP structure; extra time mainly comes from code completion and verification\.MAMO\-ComplexComplex∼6\.5\\sim 6\.5188\.5sAmbiguous variable/constraint mapping; more rounds are needed before structural agreement\.OptMATHComplex∼7\.1\\sim 7\.1205\.2sDense numeric descriptions and unstable objective consensus; search often uses a deeper budget\.These statistics support the intended deployment pattern of StarOR\. On easier instances, the reward signals converge quickly and the method behaves like a lightweight verifier around a small number of search rounds\. On harder instances, the additional cost is spent precisely where one\-shot and Best\-of\-NNare weakest: resolving variable definitions, constraint directions, and objective\-scale disagreements through repeated search\-and\-adaptation\. Thus, the runtime increase is difficulty\-aware rather than uniformly applied to every problem\.

## Appendix DImplementation Details

Experiments are conducted on a single NVIDIA H20 140GB GPU using the veRL\[Shenget al\.,[2024](https://arxiv.org/html/2606.15197#bib.bib47)\]framework for RL training\.

##### GRPO\.

StarOR performs a lightweight online GRPO update after each non\-terminal MCTS expansion based verl Implementation\. For a selected node at stagess, the current policy consists of the frozen backbone parametersϕ\\phiand a transient LoRA adapterΔ​ϕ\\Delta\\phi, yielding the stage\-conditioned policyπϕ\+Δ​ϕ\(⋅∣x,τ≤s−1\)\\pi\_\{\\phi\+\\Delta\\phi\}\(\\cdot\\mid x,\\tau\_\{\\leq s\-1\}\)\. The adapter is initialized at the beginning of each problem instance and is reset after the instance finishes, so no test\-time update is carried across benchmark examples\.

At an expansion step, we sample a sibling group ofKKcontinuations from the same prefix,

zi\(s\)∼πϕ\+Δ​ϕold\(⋅∣x,τ≤s−1\),i=1,…,K,z\_\{i\}^\{\(s\)\}\\sim\\pi\_\{\\phi\+\\Delta\\phi\_\{\\mathrm\{old\}\}\}\\bigl\(\\cdot\\mid x,\\tau\_\{\\leq s\-1\}\\bigr\),\\qquad i=1,\\ldots,K,\(11\)and roll each continuation to executable code for reward evaluation\. LetRiR\_\{i\}denote the scalar multi\-faceted reward assigned to theii\-th sibling after combining semantic consensus, execution, test\-case robustness, and structural consistency\. We use the siblings as a local comparison group and compute the GRPO advantage without a learned value function:

μR=1K​∑j=1KRj,σR=stdj=1K⁡\(Rj\),Ai=Ri−μRσR\+ϵ,\\mu\_\{R\}=\\frac\{1\}\{K\}\\sum\_\{j=1\}^\{K\}R\_\{j\},\\qquad\\sigma\_\{R\}=\\operatorname\{std\}\_\{j=1\}^\{K\}\(R\_\{j\}\),\\qquad A\_\{i\}=\\frac\{R\_\{i\}\-\\mu\_\{R\}\}\{\\sigma\_\{R\}\+\\epsilon\},\(12\)whereϵ\\epsilonis a small numerical constant\. In implementation, the scalar reward is placed on the final valid response token and then summed over the response, so Eq\. \([12](https://arxiv.org/html/2606.15197#A4.E12)\) is exactly the group\-normalized outcome advantage\. The scalarAiA\_\{i\}is broadcast to the valid response tokens selected by the response mask:

Ai,t=Ai​mi,t,A\_\{i,t\}=A\_\{i\}\\,m\_\{i,t\},\(13\)wheremi,t∈\{0,1\}m\_\{i,t\}\\in\\\{0,1\\\}indicates whether tokenttof candidateiiparticipates in the update\. When stage\-level updating is enabled, this mask is further intersected with the current\-stage span so that only tokens corresponding to the selected formulation stage receive gradient signal\.

The LoRA adapter is then updated with a PPO\-style clipped policy loss\. Letyi,ty\_\{i,t\}be thett\-th token of the generated response and define the token\-level importance ratio

ρi,t​\(Δ​ϕ\)=exp⁡\[log⁡πϕ\+Δ​ϕ​\(yi,t∣x,τ≤s−1,yi,<t\)−log⁡πϕ\+Δ​ϕold​\(yi,t∣x,τ≤s−1,yi,<t\)\]\.\\rho\_\{i,t\}\(\\Delta\\phi\)=\\exp\\left\[\\log\\pi\_\{\\phi\+\\Delta\\phi\}\(y\_\{i,t\}\\mid x,\\tau\_\{\\leq s\-1\},y\_\{i,<t\}\)\-\\log\\pi\_\{\\phi\+\\Delta\\phi\_\{\\mathrm\{old\}\}\}\(y\_\{i,t\}\\mid x,\\tau\_\{\\leq s\-1\},y\_\{i,<t\}\)\\right\]\.\(14\)With clipping rangeεclip\\varepsilon\_\{\\mathrm\{clip\}\}, the clipped surrogate for each valid token is

ℓi,tclip​\(Δ​ϕ\)=−min⁡\(ρi,t​\(Δ​ϕ\)​Ai,t,clip⁡\(ρi,t​\(Δ​ϕ\),1−εclip,1\+εclip\)​Ai,t\)\.\\ell^\{\\mathrm\{clip\}\}\_\{i,t\}\(\\Delta\\phi\)=\-\\min\\left\(\\rho\_\{i,t\}\(\\Delta\\phi\)A\_\{i,t\},\\operatorname\{clip\}\\bigl\(\\rho\_\{i,t\}\(\\Delta\\phi\),1\-\\varepsilon\_\{\\mathrm\{clip\}\},1\+\\varepsilon\_\{\\mathrm\{clip\}\}\\bigr\)A\_\{i,t\}\\right\)\.\(15\)Our implementation follows the veRL dual\-clip variant for negative advantages, using an additional constantc\>1c\>1to cap overly large negative\-advantage losses:

ℓi,tpg​\(Δ​ϕ\)=\{ℓi,tclip​\(Δ​ϕ\),Ai,t≥0,min⁡\(ℓi,tclip​\(Δ​ϕ\),−c​Ai,t\),Ai,t<0\.\\ell^\{\\mathrm\{pg\}\}\_\{i,t\}\(\\Delta\\phi\)=\\begin\{cases\}\\ell^\{\\mathrm\{clip\}\}\_\{i,t\}\(\\Delta\\phi\),&A\_\{i,t\}\\geq 0,\\\\\[5\.69054pt\] \\min\\\!\\left\(\\ell^\{\\mathrm\{clip\}\}\_\{i,t\}\(\\Delta\\phi\),\-cA\_\{i,t\}\\right\),&A\_\{i,t\}<0\.\\end\{cases\}\(16\)
In addition to PPO clipping, we regularize the update toward the reference policy\. We do not subtract the KL term from the reward in our default configuration; instead, we add an actor\-level KL loss with coefficientβ\\beta\. Using the low\-variance KL estimator, the per\-token regularizer is

di,tKL​\(Δ​ϕ\)\\displaystyle d^\{\\mathrm\{KL\}\}\_\{i,t\}\(\\Delta\\phi\)=exp⁡\(δi,t\)−δi,t−1,\\displaystyle=\\exp\\\!\\left\(\\delta\_\{i,t\}\\right\)\-\\delta\_\{i,t\}\-1,\(17\)δi,t\\displaystyle\\delta\_\{i,t\}=log⁡πref​\(yi,t∣⋅\)−log⁡πϕ\+Δ​ϕ​\(yi,t∣⋅\)\.\\displaystyle=\\log\\pi\_\{\\mathrm\{ref\}\}\(y\_\{i,t\}\\mid\\cdot\)\-\\log\\pi\_\{\\phi\+\\Delta\\phi\}\(y\_\{i,t\}\\mid\\cdot\)\.The final optimization objective for one sibling group is therefore

ℒGRPO​\(Δ​ϕ\)=1∑i,tmi,t​∑i=1K∑tmi,t​\[ℓi,tpg​\(Δ​ϕ\)\+β​di,tKL​\(Δ​ϕ\)\],\\mathcal\{L\}\_\{\\mathrm\{GRPO\}\}\(\\Delta\\phi\)=\\frac\{1\}\{\\sum\_\{i,t\}m\_\{i,t\}\}\\sum\_\{i=1\}^\{K\}\\sum\_\{t\}m\_\{i,t\}\\left\[\\ell^\{\\mathrm\{pg\}\}\_\{i,t\}\(\\Delta\\phi\)\+\\beta\\,d^\{\\mathrm\{KL\}\}\_\{i,t\}\(\\Delta\\phi\)\\right\],\(18\)and only the LoRA parametersΔ​ϕ\\Delta\\phiare optimized\. After the update, the refreshed adapter is synchronized to the rollout engine, so subsequent MCTS expansions sample from the instance\-adapted policyπϕ\+Δ​ϕ\\pi\_\{\\phi\+\\Delta\\phi\}\.

##### Test\-case\.

The robustness test cases are prepared once before the MCTS search for each problem instance\. StarOR first converts the raw problem into a numeric instance representation and a feature catalog\. This catalog is constructed by deterministic parsing rather than by free\-form generation: inline numerical mentions in the natural\-language description are stored as keys such asnum\_0, while numerical entries extracted from markdown\-style tables are stored as table keys\. Each entry records its numeric value, source type, local text snippet, and a heuristic importance score\. Scores are higher when the surrounding text or table header contains OR\-relevant terms such as*cost*,*profit*,*capacity*,*demand*,*budget*,*limit*, or relational language\. The ranked entries are exposed to the planner as a compact feature catalog

ℱx=\{\(Fm,km,vm,qm,hm\)\}m=1M,\\mathcal\{F\}\_\{x\}=\\\{\(F\_\{m\},k\_\{m\},v\_\{m\},q\_\{m\},h\_\{m\}\)\\\}\_\{m=1\}^\{M\},\(19\)whereFmF\_\{m\}is a stable feature id \(e\.g\.,F01\),kmk\_\{m\}is the internal instance key,vmv\_\{m\}is the original numeric value,qmq\_\{m\}is the source, andhmh\_\{m\}is the supporting snippet\.

Givenℱx\\mathcal\{F\}\_\{x\}, StarOR uses an auxiliary pre\-search generation step to produce two structured artifacts: \(i\) a conservative objective\-scale envelope for the original instance, and \(ii\)NtN\_\{t\}robustness tests\. The LLM is asked to return tagged JSON blocks rather than executable code\. A test case contains a case id, a list of coordinated patches, an objective\-scale envelope, and a short rationale:

cj=\(idj,\{\(Fm,op,a\)\}m∈𝒫j,scalej,rationalej\)\.c\_\{j\}=\\bigl\(\\mathrm\{id\}\_\{j\},\\\{\(F\_\{m\},\\mathrm\{op\},a\)\\\}\_\{m\\in\\mathcal\{P\}\_\{j\}\},\\mathrm\{scale\}\_\{j\},\\mathrm\{rationale\}\_\{j\}\\bigr\)\.\(20\)The patch list is interpreted programmatically\. Each patch must reference an existing feature id inℱx\\mathcal\{F\}\_\{x\}; the implementation maps the id back to its internal numeric keykmk\_\{m\}and applies the edit to a deep copy of the original instance\. Three edit forms are supported:

vm′=\{a,op=replace,a​vm,op=scale,vm\+a,op=shift\.v^\{\\prime\}\_\{m\}=\\begin\{cases\}a,&\\mathrm\{op\}=\\texttt\{replace\},\\\\ a\\,v\_\{m\},&\\mathrm\{op\}=\\texttt\{scale\},\\\\ v\_\{m\}\+a,&\\mathrm\{op\}=\\texttt\{shift\}\.\\end\{cases\}\(21\)Integer\-valued features are rounded back to integers after editing\. Invalid patches, such as unknown feature ids or edits to non\-numeric fields, are discarded\. The resulting perturbed instance stores the normalized patch metadata and is cached in the problem instance under the precomputed test\-case field\.

Importantly, the LLM is not called again when computingrtestr\_\{\\mathrm\{test\}\}during MCTS\. During search, every candidate program is executed directly on the precomputed perturbed instances, in parallel when multiple tests are available\. The observed objective is then checked against the corresponding case\-specific objective\-scale envelope\. Thus, the test\-case reward is an execution\-time verification signal: LLM reasoning is used only once to design the stress tests before search, while the per\-candidate reward computation is deterministic given the generated code, the cached perturbation patches, and the objective\-scale filters\. If structured test\-case precomputation fails, the current veRL pipeline marks the test\-case reward as disabled for that instance rather than repeatedly querying the LLM inside the search loop\.

### D\.1Hyperparameters

This section provides a comprehensive summary of the hyperparameters and implementation configurations for StarOR\. Table[12](https://arxiv.org/html/2606.15197#A4.T12)details the settings for the backbone model, Monte Carlo Tree Search \(MCTS\), and the Group Relative Policy Optimization \(GRPO\) components\.

Table 12:Main hyperparameters for StarOR\.CategoryParameterValue / SettingBackboneModelQwen3\-4B\-Instruct\-2507Max Response Length6,144 tokensSamplingTemperature 1\.0, Top\-p=0\.95p=0\.95Search \(MCTS\)BudgetMax iterationsT=10T=10Expansion Group SizeK=8K=8siblings per nodePUCT Constantcpuct=1\.414c\_\{\\mathrm\{puct\}\}=1\.414Prior SoftmaxTemperatureη=0\.7\\eta=0\.7BackpropagationDecay factorρ=0\.95\\rho=0\.95for clustersStopone\-shot suppression factorγsup=0\.5\\gamma\_\{\\mathrm\{sup\}\}=0\.5Clusteringtoleranceϵc=0\.01%\\epsilon\_\{c\}=0\.01\\%Adaptation \(RL\)OptimizerGRPO \(Online\)Learning Rate1×10−41\\times 10^\{\-4\}LoRA ConfigRank 8, Alpha 16, all linear layersKL PenaltyCoefficient0\.0010\.001, low\-variance estimatorRewardsReward Componentsrs​e​mr\_\{sem\}: Semantic,re​x​e​cr\_\{exec\}: Exec,rt​e​s​tr\_\{test\}: Test\-case,rs​t​r​u​c​tr\_\{struct\}: StructuralPhase 1 \(Iter 1–3\)\(0\.2,0\.5,0\.2,0\.1\)\(0\.2,0\.5,0\.2,0\.1\)forrs​e​mr\_\{sem\},re​x​e​cr\_\{exec\},rt​e​s​tr\_\{test\},rs​t​r​u​c​tr\_\{struct\}Phase 2 \(Iter 4–5\)\(0\.4,0\.4,0\.1,0\.1\)\(0\.4,0\.4,0\.1,0\.1\)Phase 3 \(Iter \> 5\)\(0\.6,0\.2,0\.1,0\.1\)\(0\.6,0\.2,0\.1,0\.1\)Out\-of\-scale PenaltyMultiplierλ=0\.5\\lambda=0\.5forrs​e​mr\_\{sem\}andrt​e​s​tr\_\{test\}Test\-Case NumberNt=3N\_\{t\}=3OthersConsensus ToleranceRelative objective tolerance0\.01%0\.01\\%Code RefinementMax 2 repair rounds at terminationSolver BackendGurobi, 30s limit per execution
### D\.2Details of Reward System

The reward system is designed for the unlabeled test\-time setting, where ground\-truth answers are unavailable during search\. For candidateii, it combines four weak but complementary signals:

Ri=max⁡\(0,wsem​rsem,i\+wexec​rexec,i\+wtest​rtest,i\+wstruct​rstruct,i\),R\_\{i\}=\\max\\left\(0,w\_\{\\mathrm\{sem\}\}r\_\{\\mathrm\{sem\},i\}\+w\_\{\\mathrm\{exec\}\}r\_\{\\mathrm\{exec\},i\}\+w\_\{\\mathrm\{test\}\}r\_\{\\mathrm\{test\},i\}\+w\_\{\\mathrm\{struct\}\}r\_\{\\mathrm\{struct\},i\}\\right\),\(22\)
##### Execution rewardre​x​e​cr\_\{exec\}\.

The execution reward is a binary indicator of whether the generated Python/Gurobi program executes without fatal errors\. Specifically, the implementation treats any program free of syntax and runtime exceptions as an effective success, prioritizing the model’s executability\. This broad definition includes not only cases that yield recognizable Gurobi optimality messages or parseable objective values but also any run that completes its process normally\. Conversely, only timeouts, unhandled code exceptions, and completely missing solver outputs are treated as failures\.

##### Semantic rewardrsemr\_\{\\mathrm\{sem\}\}\.

Following the definition in Eq\.[3](https://arxiv.org/html/2606.15197#S3.E3), StarOR calculates the semantic consistency reward for each candidateiibased on the size of its objective\-value cluster\|Csem,i\|\|C\_\{\\mathrm\{sem\},i\}\|\. Valid objective values are clustered using a relative tolerance of0\.01%0\.01\\%\. To ensure numerical stability during test\-time optimization, the practical implementation employs a smoothed version of the consensus ratio:

rsem,i=\|Csem,i\|\+α1Kvalid\+α1⋅max⁡\(Nclusters,Kmin\)⋅λ,r\_\{\\mathrm\{sem\},i\}=\\frac\{\|C\_\{\\mathrm\{sem\},i\}\|\+\\alpha\_\{1\}\}\{K\_\{\\mathrm\{valid\}\}\+\\alpha\_\{1\}\\cdot\\max\(N\_\{\\mathrm\{clusters\}\},K\_\{\\min\}\)\}\\cdot\\lambda,\(23\)whereα1=0\.6\\alpha\_\{1\}=0\.6is the smoothing coefficient,Kmin=3K\_\{\\min\}=3is the baseline cluster scale,KvalidK\_\{\\mathrm\{valid\}\}is the number of executable candidates, andNclustersN\_\{\\mathrm\{clusters\}\}denotes the total number of identified clusters in the rollout group\. The multiplierλ\\lambdacorresponds to theobjective\-scale penaltydescribed in Section[3\.2\.1](https://arxiv.org/html/2606.15197#S3.SS2.SSS1.Px2):λ=1\.0\\lambda=1\.0if the objective valueOiO\_\{i\}falls within the predicted scale\[Lx,Ux\]\[L\_\{x\},U\_\{x\}\], andλ=0\.5\\lambda=0\.5otherwise\. This reward design ensures that independently generated formulations reaching the same plausible objective value receive higher credit\.

##### Structural consistency rewardrs​t​r​u​c​tr\_\{struct\}\.

Following the definition in Eq\.[4](https://arxiv.org/html/2606.15197#S3.E4), the structural rewardrstruct,ir\_\{\\text\{struct\},i\}evaluates candidateiiby its architectural consistency within the group\. To ensure the reward is bounded in\[0,1\]\[0,1\], theNorm​\(⋅\)\\text\{Norm\}\(\\cdot\)operator is implemented as a dimension\-wise average\. Specifically, let𝐯i\\mathbf\{v\}\_\{i\}be the signature vector containing the counts of 5 core components:

𝐯i=\[nbin,nint,ncont,ncon,sense\]\.\\mathbf\{v\}\_\{i\}=\[n\_\{\\mathrm\{bin\}\},n\_\{\\mathrm\{int\}\},n\_\{\\mathrm\{cont\}\},n\_\{\\mathrm\{con\}\},\\mathrm\{sense\}\]\.\(24\)The implementation incorporates a smoothing term\(α4,K4\)\(\\alpha\_\{4\},K\_\{4\}\)into the consensus ratio for each dimensiondd, and the final reward is calculated as:

rstruct,i=15​∑d∈𝐯i\|C​\(i,d\)\|\+α4Kvalid\+α4⋅K4,r\_\{\\text\{struct\},i\}=\\frac\{1\}\{5\}\\sum\_\{d\\in\\mathbf\{v\}\_\{i\}\}\\sqrt\{\\frac\{\|C\(i,d\)\|\+\\alpha\_\{4\}\}\{K\_\{\\mathrm\{valid\}\}\+\\alpha\_\{4\}\\cdot K\_\{4\}\}\},\(25\)where\|C​\(i,d\)\|\|C\(i,d\)\|is the number of candidates sharing the same value on dimensiondd,KvalidK\_\{\\mathrm\{valid\}\}is the count of executable candidates,α4=0\.4\\alpha\_\{4\}=0\.4is the smoothing coefficient, andK4=3K\_\{4\}=3is the normalization constant\. This formulation ensures that when perfect consensus is reached \(\|C\|=Kvalid\|C\|=K\_\{\\mathrm\{valid\}\}\), the reward approaches 1\.0 \(subject to smoothing\), effectively rewarding structurally stable modeling trajectories\.

##### Test\-case rewardrt​e​s​tr\_\{test\}\.

Following the definition in Eq\.[5](https://arxiv.org/html/2606.15197#S3.E5), the test\-case rewardrtest,ir\_\{\\text\{test\},i\}evaluates the robustness of candidateiiagainst synthetic test cases\. For each problem instance, StarOR precomputesNt=3N\_\{t\}=3perturbed versions \(e\.g\., modified numeric capacities or coefficients\) with associated objective\-scale envelopes\. Candidateiiis executed on each perturbed instance, and its performance on casejjis assigned a scoreSi,jS\_\{i,j\}based on the following criteria:

- •Si,j=1\.0S\_\{i,j\}=1\.0: Execution succeeds, and the objective value falls within the case\-specific scale\.
- •Si,j=λS\_\{i,j\}=\\lambda\(λ=0\.5\\lambda=0\.5\): Execution succeeds, but the objective value is outside the predicted scale\.
- •Si,j=0\.0S\_\{i,j\}=0\.0: Execution fails \(e\.g\., syntax error, timeout, or infeasibility\)\.

The final test\-case reward is the average across all test cases:

rtest,i=1Nt​∑j=1NtSi,j,Si,j∈\{0,λ,1\}\.r\_\{\\text\{test\},i\}=\\frac\{1\}\{N\_\{t\}\}\\sum\_\{j=1\}^\{N\_\{t\}\}S\_\{i,j\},\\quad S\_\{i,j\}\\in\\\{0,\\lambda,1\\\}\.\(26\)By incorporating the soft\-penalty factorλ\\lambda, this design ensures that out\-of\-scale but executable programs are penalized rather than entirely discarded, acknowledging the potential conservative nature of the initial objective\-scale estimation\.

##### Objective\-scale Prior Estimation\.

Prior to the hierarchical search, StarOR leverages the base policy to estimate a conservative objective\-scale envelope for the problem instance\. This prior acts as a grounded reference for the reward system, particularly for theλ\\lambda\-penalty inrsemr\_\{\\text\{sem\}\}andrtestr\_\{\\text\{test\}\}\. The estimated scale is structured as a JSON object, enabling programmatic verification across multiple dimensions:

Listing 1:Example for objective\-scale priors\.1\{

2"kind":"interval",

3"lower":500,

4"upper":2800,

5"sign\_relation":"positive",

6"magnitude":\{

7"min\_order":2,

8"max\_order":4,

9"use\_abs":true

10\},

11"reject\_exact":\[0\]

12\}

The runtime verification engine applies a hierarchical filtering logic:

1. 1\)Validity Check:Rejects non\-finite objectives \(e\.g\., NaN,±∞\\pm\\infty\) and values \(e\.g\.0\) specified in thereject\_exactlist\.
2. 2\)Structural Constraints:Enforces sign consistency \(e\.g\., non\-negativity\) and magnitude\-order constraints \(e\.g\., ensuring the value is within10210^\{2\}to10410^\{4\}\)\.
3. 3\)Interval Grounding:Validates the objective against explicit numeric bounds\.

To accommodate the potential conservatism of the zero\-shot backbone, we apply a10%10\\%margin relaxation to the interval bounds during final selection and early stopping\. This buffer reduces false negatives, ensuring that valid modeling refinements that slightly exceed the initial estimate are not prematurely discarded\.

##### Test\-case Generation\.

To synthesize theNtN\_\{t\}test cases for robustness evaluation, thetest\-case generatorfirst constructs a compact numeric feature catalog by parsing the natural\-language problem description\. Each numerical value or table entry is assigned a unique identifier \(e\.g\.,F01,F02\), with priority given to features proximal to critical OR semantics such as*cost*,*profit*,*capacity*,*demand*,*budget*, and*limit*\. Leveraging this catalog and original problem description, the generator produces coordinated patches targeting high\-ranking features simultaneously, accompanied by a re\-estimated objective\-scale envelope for each generated test case\. We show the example in the List\.[D\.2](https://arxiv.org/html/2606.15197#A4.SS2.SSS0.Px6)

Listing 2:Example for test\-case generation\.1\{

2\{

3"case\_id":"easier\_route\_key\_edges\_down",

4"patches":\[

5\{"fid":"F04","new\_value":38\},

6\{"fid":"F08","new\_value":24\},

7\{"fid":"F05","new\_value":42\}

8\],

9"obj\_scale":\{

10"kind":"interval",

11"lower":180,

12"upper":245,

13"sign\_relation":"positive",

14"magnitude":\{

15"min\_order":2,

16"max\_order":3,

17"use\_abs":true

18\},

19"reject\_exact":\[0\]

20\},

21\},

22\}

##### Prompt templates\.

The two templates below are used before search to construct the objective\-scale prior and the robustness tests\. They are displayed in a verbatim\-style prompt box, so the raw prompt can be pasted directly into the paper source without manually inserting LaTeX line breaks or escaping prompt placeholders\.

\{starorprompt\}

Base Objective\-Scale Prompt Template You are an OR objective\-scale analyst\. Return exactly two tagged blocks and nothing else: <analysis\>…</analysis\> <base\_scale\>…JSON…</base\_scale\>

Goal: Produce a conservative runtime filter\(obj, scale\) for the ORIGINAL optimization instance\.

Priority: Prioritize correctness over precision\. The interval should be wide enough to include all mathematically plausible optimal objective values, but narrow enough to reject absurd, dimensionally impossible, or semantically invalid values\.

Analysis logic in <analysis\>: 1\. Identify the optimization sense \(minimize or maximize\) and the physical meaning of the objective\. 2\. Estimate a theoretical floor and ceiling from the numeric data, such as total demand, maximum capacity, fixed costs, largest unit costs, or profit bounds\. 3\. Explain why the selected scale is safe and should not reject the true optimum\. 4\. Explicitly exclude impossible values, such as negative costs when all costs are nonnegative, zero objective values when fixed costs are mandatory, or magnitudes that violate the instance scale\.

Output schema for <base\_scale\>: "kind": "interval", "lower": number, "upper": number, "sign\_relation": "positive \| nonnegative \| negative \| mixed \| unknown", "magnitude": "min\_order": integer, "max\_order": integer, "use\_abs": true , "reject\_exact": \[number, …\]

Task description: task\_description

Numeric snapshot: compact\_instance\_json

Feature catalog: feature\_catalog\_json

Think carefully in <analysis\>, then provide the final JSON object in <base\_scale\>\.

\{starorprompt\}

Perturbation Test\-Case Generation Prompt Template You are an expert Operations Research \(OR\) stress\-test engineer\. Return exactly two tagged blocks and nothing else: <analysis\>…</analysis\> <tests\>\[…JSON list…\]</tests\>

Goal: Design robustness tests that challenge the formulation’s sensitivity, feasibility boundaries, and objective\-scale consistency\.

Each test should contain: \- case\_id: a short descriptive identifier\. \- patches: 1\-3 coordinated feature edits using feature ids from the catalog\. \- obj\_scale: a conservative objective\-scale envelope for the perturbed instance\. \- rationale: a short explanation of why this perturbation is meaningful\.

Analysis logic in <analysis\>: 1\. Identify sensitive parameters whose changes strongly affect feasibility or objective value, such as bottleneck capacities, high\-cost coefficients, demands, budgets, resource limits, or service requirements\. 2\. Plan 3\-5 realistic scenarios, such as resource scarcity, relaxed capacity, extreme cost variation, demand surge, or boundary feasibility\. 3\. For each scenario, estimate a safe objective range that is broad enough to include plausible optimal values but still excludes impossible values\.

Requirements for <tests\>: \- Use only feature ids that appear in the feature catalog\. \- Prefer small, meaningful perturbations over random large changes\. \- Preserve unit consistency and problem semantics\. \- Avoid patches that make the natural\-language problem nonsensical unless the scenario explicitly tests infeasibility handling\. \- The obj\_scale must use the schema: "kind": "interval", "lower": number, "upper": number, "sign\_relation": "positive \| nonnegative \| negative \| mixed \| unknown", "magnitude": "min\_order": integer, "max\_order": integer, "use\_abs": true, "reject\_exact": \[number, …\]

Task description: task\_description

Numeric snapshot: compact\_instance\_json

Feature catalog: feature\_catalog\_json

Think carefully in <analysis\>, then provide the final JSON list in <tests\>\.

### D\.3Algorithm

Algorithm[1](https://arxiv.org/html/2606.15197#alg1)summarizes the complete StarOR inference procedure\.

Algorithm 1StarOR: Synergistic Tree Search and Test\-Time Policy Adaptation0:Problem instance

xx, base policy

πϕ\\pi\_\{\\phi\}, search budget

TT, expansion size

KK
0:Final executable solver program

c⋆c^\{\\star\}
1:Construct the pre\-generation prior: objective\-scale envelope

\[Lx,Ux\]\[L\_\{x\},U\_\{x\}\]and synthetic test cases

𝒟test\\mathcal\{D\}\_\{\\mathrm\{test\}\}
2:Initialize root node

n0n\_\{0\}, terminal candidate set

𝒞←∅\\mathcal\{C\}\\leftarrow\\emptyset, stage archives

\{ℬs\}s=13\\\{\\mathcal\{B\}\_\{s\}\\\}\_\{s=1\}^\{3\}, and transient adapter

Δ​ϕ←0\\Delta\\phi\\leftarrow 0
3:Initialize the one\-step suppression mask

𝒮←∅\\mathcal\{S\}\\leftarrow\\emptyset
4:for

t=1,…,Tt=1,\\dots,Tdo

5:Select an expandable leaf

nnusing the PUCT score under the current policy

πϕ\+Δ​ϕ\\pi\_\{\\phi\+\\Delta\\phi\}and suppression mask

𝒮\\mathcal\{S\}
6:Let

τ≤s−1\\tau\_\{\\leq s\-1\}be the partial formulation stored at

nn, where

ssis the next stage to be generated

7:if

s=codes=\\texttt\{code\}and the one\-time code deferral has not been usedthen

8:Add the current trajectory and its objective\-consensus cluster to

𝒮\\mathcal\{S\}for one iteration

9:Mark the code deferral as used andcontinue

10:endif

11:Sample a sibling group

\{zi\(s\)\}i=1K∼πϕ\+Δ​ϕ\(⋅∣x,τ≤s−1,s\)\\\{z\_\{i\}^\{\(s\)\}\\\}\_\{i=1\}^\{K\}\\sim\\pi\_\{\\phi\+\\Delta\\phi\}\(\\cdot\\mid x,\\tau\_\{\\leq s\-1\},s\)and complete it into an executable rollout

cic\_\{i\}
12:Execute each

cic\_\{i\}on

xxand

𝒟test\\mathcal\{D\}\_\{\\mathrm\{test\}\}; compute

rsemr\_\{\\mathrm\{sem\}\},

rexecr\_\{\\mathrm\{exec\}\},

rtestr\_\{\\mathrm\{test\}\},

rstructr\_\{\\mathrm\{struct\}\}, and total reward

RiR\_\{i\}
13:Cluster executable siblings by objective consensus and structural signatures; let

ℐ\\mathcal\{I\}be the evaluated rollout index set

14:Add children

\{ni\}i∈ℐ\\\{n\_\{i\}\\\}\_\{i\\in\\mathcal\{I\}\}to the tree and assign priors from normalized model log\-likelihoods

15:foreach child

nin\_\{i\}do

16:Store its partial trajectory, executable rollout, cluster ID, and individual reward

RiR\_\{i\}
17:endfor

18:Compute the rollout\-group value

R¯=1\|ℐ\|​∑i∈ℐRi\\bar\{R\}=\\frac\{1\}\{\|\\mathcal\{I\}\|\}\\sum\_\{i\\in\\mathcal\{I\}\}R\_\{i\}
19:Backpropagate

R¯\\bar\{R\}along the selected path from

nnto

n0n\_\{0\}
20:Apply discounted group backpropagation to same\-cluster sibling nodes using decay

ρℓ\\rho^\{\\ell\}
21:if

s≠codes\\neq\\texttt\{code\}then

22:Estimate GRPO advantages from

\{Ri\}i∈ℐ\\\{R\_\{i\}\\\}\_\{i\\in\\mathcal\{I\}\}and update the transient adapter

Δ​ϕ\\Delta\\phi
23:Add the scored partial trajectories

\{τ≤s,i\}i∈ℐ\\\{\\tau\_\{\\leq s,i\}\\\}\_\{i\\in\\mathcal\{I\}\}to archive

ℬs\\mathcal\{B\}\_\{s\}
24:else

25:Add terminal rollouts

\{ci\}i∈ℐ\\\{c\_\{i\}\\\}\_\{i\\in\\mathcal\{I\}\}to

𝒞\\mathcal\{C\}
26:Select

c⋆c^\{\\star\}by objective consensus, objective\-scale filtering, and optional repair

27:break

28:endif

29:ifobjective consensus is stable across recent distinct stagesthen

30:Complete the consensus\-supported trajectory into code, add it to

𝒞\\mathcal\{C\}, and select

c⋆c^\{\\star\}with the terminal\-selection rule

31:break

32:endif

33:Clear expired entries in the one\-step suppression mask

𝒮\\mathcal\{S\}
34:endfor

35:if

c⋆c^\{\\star\}has not been selectedthen

36:Select

c⋆c^\{\\star\}from

𝒞\\mathcal\{C\}or archived executable rollouts by global objective\-consensus voting and objective\-scale\-aware tie breaking

37:endif

38:return

c⋆c^\{\\star\}

## Appendix EPrompt Templates and Structured Instructions

This section reports the structured prompting interface used by StarOR\.

### E\.1Global System Instruction

Every model query is prefixed with the following system instruction\. The same instruction is used for all four formulation stages, auto\-completion rollouts, and local policy adaptation samples\.

\{starorprompt\}

System Instruction for StarOR You are a helpful assistant with expertise in operations research modeling and the Gurobi Python solver\.

Think step by step before producing the requested tagged output\.

Output only clean tag\-specific content inside the required tags\. Do not include extra explanations outside the tags\.

At the end of each rollout, you must provide complete executable Gurobi Python code inside <python\>…</python\>\.

### E\.2Stage Rollout Templates

\{starorprompt\}

Stage 1: Type and Sets You are a professional optimization problem analyst\. Your task is to extract the problem type and the minimum necessary indexing sets from a natural\-language optimization problem\.

Optimization problem: task\_description

Instructions: 1\. Think step by step inside <thought\>\. Identify the decision context, objective direction, resources, agents, items, time periods, locations, or other entities that need indexing\. 2\. Output the problem type inside <Type\>\. Include the optimization family when identifiable, such as LP, MILP, assignment, transportation, facility location, routing, scheduling, blending, or production planning\. 3\. Output the indexing sets inside <Sets\>\. Define only the sets needed for a clean mathematical formulation\. 4\. After completing this stage, continue the remaining formulation and provide complete executable Gurobi Python code inside <python\>\.

Required output order: <thought\>…</thought\> <Type\>…</Type\> <Sets\>…</Sets\> <python\>…</python\>

MANDATORY FORMAT RULES: 1\. <Type\> should summarize: \- optimization type: LP / MILP / NLP / MINLP and so on\. \- classical OR family when identifiable: TSP / Facility Location Problem / VRP \(Vehicle Routing Problem\) and so on\. \- Explanation: Provide a brief sentence outlining the rationale and key points\.

2\. <Sets\> should define the minimum necessary indexing sets\. \- set\_name: description: elements if explicitly enumerable Example: \- s: Employee types: f,p where f=full\-time workers, p=part\-time workers

Here is some code example: <python\> import gurobipy as gp from gurobipy import GRB

\# Create model ……\(here is core modeling code\)

model\.optimize\(\)

status = model\.status if status == GRB\.OPTIMAL: optimal = model\.objVal print\(f"Optimal value: optimal"\) else: print\(f"Model status: status"\) </python\>

\{starorprompt\}

Stage 2: Parameters and Variables You are a professional optimization problem analyst\. Your task is to define the numerical parameters and decision variables given the previously committed problem type and sets\.

Optimization problem: task\_description

Committed Type and Sets: type\_sets\_content

Instructions: 1\. Think step by step inside <thought\>\. Identify all numeric constants, tables, capacities, demands, costs, profits, budgets, bounds, and logical coefficients required by the model\. 2\. Output the parameters inside <Parameters\>\. Preserve units and map each parameter to the relevant set indices\. 3\. Output the decision variables inside <Variables\>\. Specify variable domain, index set, and semantic meaning\. 4\. After completing this stage, continue the remaining formulation and provide complete executable Gurobi Python code inside <python\>\.

Required output order: <thought\>…</thought\> <Parameters\>…</Parameters\> <Variables\>…</Variables\> <python\>…</python\>

MANDATORY FORMAT RULES: 1\. Parameters format: \- Indexed parameter: \- param\_index: description \[unit\]\[indexed by set\_name\] \(data type\): value\_or\_semantic\_value \- Global parameter: \- param: description \[unit\] \(data type\): value\_or\_semantic\_value

2\. Variables format: \- Indexed variable: \- x\_index: description \(domain\) \- Global variable: \- x: description \(domain\)

3\. Naming rules: \- parameter names must be concise and consistent with sets/entities\. \- variable names must be concise and consistent with later symbolic modeling\. \- use the same terminology as previous stages\. \- do not rename entities casually\.

Here is some code example: <python\> import gurobipy as gp from gurobipy import GRB

\# Create model ……\(here is core modeling code\)

model\.optimize\(\)

status = model\.status if status == GRB\.OPTIMAL: optimal = model\.objVal print\(f"Optimal value: optimal"\) else: print\(f"Model status: status"\) </python\>

\{starorprompt\}

Stage 3: Objective and Constraints You are a professional optimization problem analyst\. Your task is to write the mathematical objective and constraints using the previously committed type, sets, parameters, and variables\.

Optimization problem: task\_description

Committed Type and Sets: type\_sets\_content

Committed Parameters and Variables: para\_var\_content

Instructions: 1\. Think step by step inside <thought\>\. Determine the objective sense, the exact objective expression, and every feasibility condition stated or implied by the problem\. 2\. Output the objective inside <Objective\>\. Include whether the problem is a minimization or maximization problem\. 3\. Output the constraints inside <Constraints\>\. Group constraints by semantic role, such as demand satisfaction, capacity, assignment, budget, balance, precedence, compatibility, or domain restrictions\. 4\. After completing this stage, provide complete executable Gurobi Python code inside <python\>\.

Required output order: <thought\>…</thought\> <Objective\>…</Objective\> <Constraints\>…</Constraints\> <python\>…</python\>

MANDATORY FORMAT RULES: 1\. Objective format: \- objective\_name: description:L​a​T​e​X​e​x​p​r​e​s​s​i​o​nLaTeXexpression

2\. Constraints format: \- constraint\_name: description:L​a​T​e​X​e​x​p​r​e​s​s​i​o​nLaTeXexpression\(type: Equality/Inequality\)

3\. All symbols in objective/constraints must come from previous stages\. 4\. Use symbolic parameters rather than hard\-coded numeric coefficients whenever possible\. 5\. Output results directly\. Do NOT output chain\-of\-thought\.

CONSISTENCY RULES: \- Every variable appearing in the objective must be defined earlier\. \- Every parameter appearing in the objective must be defined earlier\. \- Every symbol in every constraint must be defined earlier\. \- Objective and constraints must together reflect the original task faithfully\. \- Do not omit key structural constraints\. \- Do not add assumptions that materially change the problem\.

Here is some code example: <python\> import gurobipy as gp from gurobipy import GRB

\# Create model ……\(here is core modeling code\)

model\.optimize\(\)

status = model\.status if status == GRB\.OPTIMAL: optimal = model\.objVal print\(f"Optimal value: optimal"\) else: print\(f"Model status: status"\) </python\>

\{starorprompt\}

Stage 4: Code You are a professional optimization model implementer\. Your task is to faithfully translate the committed formulation into executable Gurobi Python code\.

Optimization problem: task\_description

Committed Type and Sets: type\_sets\_content

Committed Parameters and Variables: para\_var\_content

Committed Objective and Constraints: obj\_con\_content

Instructions: 1\. Think briefly inside <thought\>\. Check that the committed formulation is internally consistent\. 2\. Output only the executable Gurobi Python implementation inside <python\>\. 3\. Translate the committed formulation faithfully\. Do not redesign the model unless the committed formulation contains an obvious contradiction that would prevent execution\.

Required output order: <thought\>…</thought\> <python\>…</python\>

Here is some code example: <python\> import gurobipy as gp from gurobipy import GRB

\# Create model ……\(here is core modeling code\)

model\.optimize\(\)

status = model\.status if status == GRB\.OPTIMAL: optimal = model\.objVal print\(f"Optimal value: optimal"\) else: print\(f"Model status: status"\) </python\>

### E\.3Repair and Completion Templates

When final candidate is syntactically no\-objective, StarOR invokes a lightweight repair prompt\. If code is error, we use the error repair prompt; if model is infeasible, we use the infeasible repair prompt\.

\{starorprompt\}

Code Error Repair Prompt You are an experienced operations research algorithm engineer\. You are presented with an operations research problem and a previous attempt to model and code a solution\. That attempt resulted in an error\. Problem Description: task\_description

Previous Code Solution Attempt: <python\> code\_text </python\>

After running the provided code from the previous attempt, the following error occurred: error\_info

Your task: Based on the information above, please perform the following: 1\. Analyze Root Cause Identify Pitfalls \- Thoroughly analyze the root cause of the error\. \- Summarize potential pitfalls or common mistakes related to this type of code error\. 2\. Provide Corrected Gurobi Code: \- Write the complete and corrected Python code using the ’gurobipy’ library to accurately solve the problem\.

Please structure your response strictly as follows: \#\# Cause of the Error and Potential Pitfalls: <thought\> \(Your detailed analysis of the error’s cause and a summary of potential pitfalls\.\) </thought\> \#\# Corrected Gurobi Code: <python\> import gurobipy as gp from gurobipy import GRB

\# Create model ……\(here is core modeling code\)

model\.optimize\(\)

status = model\.status if status == GRB\.OPTIMAL: optimal = model\.objVal print\(f"Optimal value: optimal"\) else: print\(f"Model status: status"\) </python\> Please think step by step\.

\{starorprompt\}

Code Infeasible Repair Prompt You are an experienced operations research algorithm engineer\. You are presented with an operations research problem and a previous attempt to model and code a solution\. That attempt resulted in an error\.

Task description: task\_description

Fixed model blocks \(unchanged\): model\_text

Current code: <python\> code\_text </python\>

Execution error: execution\_text

You are an experienced operations research algorithm engineer\. You are presented with an operations research problem and a previous attempt to model and code a solution\. That attempt resulted in an infeasible solution\. Problem Description: task\_description

Previous Model: model\_text

Code Solution Attempt: <python\> code\_text </python\>

After running the provided code from the previous attempt, the answer could not provide a feasible solution\.

Your task: Based on the information above, please perform the following:

1\. Analyze Root Cause Identify Pitfalls \- Thoroughly analyze the root cause of the infeasibility\. \- Summarize potential pitfalls or common mistakes related to this type of infeasibility\.

2\. Provide an Improved Mathematical Model: \- Develop a mathematical model for correctly \- modeling this OR problem\. This should address the flaws in the previous attempt\.

3\. Provide Corrected Gurobi Code: Write the complete and corrected Python code associated with the mathematical model using the ’gurobipy’ library to accurately solve the problem\.

Please structure your response strictly as follows: \#\# Cause of the Error and Potential Pitfalls: <thought\> \(Your detailed analysis of the error’s cause and a summary of potential pitfalls\.\) </thought\>

\#\# Corrected Mathematical Model: <Type\> \[Identify the problem class: LP, MILP, NLP, etc\.\] </Type\>

<Sets\> \[Define all indices and sets with clear descriptions\] </Sets\>

<Parameters\> \[Define all constants and data structures, including units\] </Parameters\>

<Variables\> \[Define decision variables, their domains \(Binary, Non\-negative, etc\.\), and physical meanings\] </Variables\>

<Objective\> \[Mathematical expression of the objective function with Max/Min direction\] </Objective\>

<Constraints\> \[List all mathematical constraints\. Ensure they are indexed correctly \(e\.g\.,∀i∈I\\forall i\\in I\) and clearly explained\] </Constraints\>

\#\# Corrected Gurobi Code: <python\> import gurobipy as gp from gurobipy import GRB

\# Create model ……\(here is core modeling code\)

model\.optimize\(\)

status = model\.status if status == GRB\.OPTIMAL: optimal = model\.objVal print\(f"Optimal value: optimal"\) else: print\(f"Model status: status"\) </python\> Note: Do not rewrite the model from scratch; instead, surgically patch the existing model and code by addressing valid feedback while critically filtering out any incorrect or redundant signals\. Please think step by step\.

## Appendix FLicenses for Existing Assets

We use existing datasets, models, software frameworks, and solver assets only for academic research and benchmarking\. Table[13](https://arxiv.org/html/2606.15197#A6.T13)summarizes the main assets used in our experiments, their original creators or maintainers, and the license or terms that we identified from the public release pages\. Regarding benchmark preparation, we follow the data cleaning and standardization protocols established byChenet al\.\[[2025](https://arxiv.org/html/2606.15197#bib.bib4)\]\.

Table 13:Existing assets used in StarOR\. “Terms” denotes a non\-open\-source usage agreement rather than an OSI\-style license\.AssetCreator / MaintainerLicense / TermsUse in this paperQwen3\-4B\-Instruct\-2507Qwen team / Alibaba CloudApache\-2\.0Backbone model for all StarOR and same\-backbone test\-time baselines\. Model card:[https://huggingface\.co/Qwen/Qwen3\-4B\-Instruct\-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507)\.verlVolcano Engine / ByteDance Seed and communityApache\-2\.0RL training and rollout infrastructure adapted for our TTRL\-OR runtime\. Repository:[https://github\.com/volcengine/verl](https://github.com/volcengine/verl)\.Gurobi OptimizerGurobi Optimization, LLCProprietary Gurobi license; academic licenses are available for non\-commercial academic researchSolver backend for executing generated Python optimization models\. We use Gurobi under academic/research terms and cite the Gurobi reference manual\.NL4OPTNL4OPT organizers; converted optimal\-answer version by Cardinal Operations / ORLMOfficial competition repository: MIT; converted Hugging Face dataset card: CC BY\-NC 4\.0Evaluation benchmark for natural\-language LP modeling\. We cite the NL4OPT competition and use the converted/checked benchmark only for non\-commercial academic evaluation\.MAMO\-Easy and MAMO\-ComplexMAMO authors; public mirror by Cardinal OperationsPublic Hugging Face dataset card: CC BY\-NC 4\.0Evaluation benchmarks for easy and complex LP/MILP modeling\. We use cleaned versions that preserve source attribution and benchmark identity\.IndustryORORLM / Cardinal OperationsPublic Hugging Face dataset card: CC BY\-NC 4\.0; ORLM code repository: Apache\-2\.0Industrial OR evaluation benchmark\. We use the 100\-instance benchmark for non\-commercial research evaluation\.OptMATH\-BenchOptMATH authorsPublicly released for research; no separate dataset license was identified on the project page at the time of writingEvaluation benchmark for difficult optimization modeling\. We cite the OptMATH paper/project page and use only the released benchmark instances for academic evaluation\.##### External services and closed models\.

Some baselines report results from proprietary or externally hosted models such as GPT\-4, DeepSeek\-V3\.1, and DeepSeek\-R1\. For these systems, we cite the corresponding technical reports or model papers and report results either from prior work or from API\-based evaluation under the providers’ terms of service\. These models are not redistributed with our artifacts\.

## Appendix GBroader Impacts

StarOR aims to make optimization modeling more accessible by reducing the effort required to translate natural\-language decision problems into executable solver programs\. This can benefit domains such as logistics, manufacturing, scheduling, energy planning, healthcare operations, and public\-sector resource allocation, where better optimization models can improve utilization, reduce waste, and support more transparent decision\-making\. The framework is especially useful when problem descriptions vary across instances and collecting large supervised datasets for every domain is impractical\.

At the same time, automated optimization modeling carries operational and societal risks\. A generated formulation may be executable but semantically misaligned with the intended decision problem, leading to recommendations that violate constraints, omit stakeholder preferences, or optimize an incomplete objective\. These risks are more serious in high\-stakes settings such as healthcare scheduling, disaster response, infrastructure planning, and workforce allocation\. StarOR reduces this risk through staged formulation, solver execution, structural checks, perturbation tests, and objective\-consensus rewards, but these mechanisms are safeguards rather than correctness guarantees\.

The method also increases test\-time computation\. Additional compute can improve reliability, but it has economic and environmental costs\. Our results suggest that this compute is most valuable for difficult instances where one\-shot decoding is unreliable; practical deployments should therefore use difficulty\-aware budgets, early stopping, and smaller expansion groups for easy instances\. We recommend using StarOR as a decision\-support assistant: generated formulations, solver traces, reward diagnostics, and final\-selection evidence should be logged and reviewed by domain experts before decisions are enacted\.

## Appendix HLimitations

StarOR incurs higher test\-time cost than one\-shot decoding and Best\-of\-NNwith the same number of raw samples, since it performs staged search, repeated code execution, reward computation, and transient LoRA updates\. However, in real industrial optimization scenarios, especially high\-complexity and high\-value applications, reliability is often more important than real\-time response\. For many offline optimization\-modeling settings, this additional cost is therefore acceptable when it leads to more faithful and robust formulations\.

Another limitation is that StarOR relies on unsupervised reward signals, particularly objective\-scale priors and synthetic robustness tests\. These signals are useful for guiding search without ground\-truth labels, but their quality depends on the reasoning ability of the backbone model\. For weaker small models, inaccurate objective\-scale estimation or flawed perturbation analysis may introduce noisy feedback and mislead policy adaptation\. This suggests that StarOR still requires a reasonably capable base model\. Future work should design stronger reward and feedback mechanisms that provide more fine\-grained and reliable supervision, thereby better guiding both policy evolution and tree search during test\-time optimization\.

## NeurIPS Paper Checklist

1. 1\.Claims
2. Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?
3. Answer:\[Yes\]
4. Justification: The main claims in the abstract and introduction—namely the integration of stage\-wise MCTS with node\-level test\-time RL, the proposed multi\-faceted reward design, and the strong empirical performance across five OR modeling benchmarks—are supported by the method and experimental results in Sections 3 and 4, including the main comparisons and ablations in Tables 2–5\.
5. Guidelines: - •The answer\[N/A\]means that the abstract and introduction do not include the claims made in the paper\. - •The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations\. A\[No\]or\[N/A\]answer to this question will not be perceived well by the reviewers\. - •The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings\. - •It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper\.
6. 2\.Limitations
7. Question: Does the paper discuss the limitations of the work performed by the authors?
8. Answer:\[Yes\]
9. Justification: Appendix[H](https://arxiv.org/html/2606.15197#A8)discusses the main limitations, including StarOR’s higher test\-time cost from staged search, repeated code execution, reward computation, and transient LoRA updates, as well as its dependence on unsupervised reward signals such as objective\-scale priors and synthetic robustness tests, whose quality depends on the reasoning ability of the backbone model\.
10. Guidelines: - •The answer\[N/A\]means that the paper has no limitation while the answer\[No\]means that the paper has limitations, but those are not discussed in the paper\. - •The authors are encouraged to create a separate “Limitations” section in their paper\. - •The paper should point out any strong assumptions and how robust the results are to violations of these assumptions \(e\.g\., independence assumptions, noiseless settings, model well\-specification, asymptotic approximations only holding locally\)\. The authors should reflect on how these assumptions might be violated in practice and what the implications would be\. - •The authors should reflect on the scope of the claims made, e\.g\., if the approach was only tested on a few datasets or with a few runs\. In general, empirical results often depend on implicit assumptions, which should be articulated\. - •The authors should reflect on the factors that influence the performance of the approach\. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting\. Or a speech\-to\-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon\. - •The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size\. - •If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness\. - •While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren’t acknowledged in the paper\. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community\. Reviewers will be specifically instructed to not penalize honesty concerning limitations\.
11. 3\.Theory assumptions and proofs
12. Question: For each theoretical result, does the paper provide the full set of assumptions and a complete \(and correct\) proof?
13. Answer:\[N/A\]
14. Justification: The paper does not present formal theoretical results such as theorems, propositions, or proofs; it is primarily a method\-and\-experiments paper\.
15. Guidelines: - •The answer\[N/A\]means that the paper does not include theoretical results\. - •All the theorems, formulas, and proofs in the paper should be numbered and cross\-referenced\. - •All assumptions should be clearly stated or referenced in the statement of any theorems\. - •The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition\. - •Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material\. - •Theorems and Lemmas that the proof relies upon should be properly referenced\.
16. 4\.Experimental result reproducibility
17. Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper \(regardless of whether the code and data are provided or not\)?
18. Answer:\[Yes\]
19. Justification: The paper specifies the evaluated benchmarks, baselines, backbone model, evaluation metric, and the full StarOR pipeline in the main text, and the supplemental material provides implementation details including hyperparameters, reward design, computation\-cost analysis, and prompt templates needed to reproduce the main results\.
20. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •If the paper includes experiments, a\[No\]answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not\. - •If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable\. - •Depending on the contribution, reproducibility can be accomplished in various ways\. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model\. In general\. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model \(e\.g\., in the case of a large language model\), releasing of a model checkpoint, or other means that are appropriate to the research performed\. - •While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution\. For example 1. \(a\)If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm\. 2. \(b\)If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully\. 3. \(c\)If the contribution is a new model \(e\.g\., a large language model\), then there should either be a way to access this model for reproducing the results or a way to reproduce the model \(e\.g\., with an open\-source dataset or instructions for how to construct the dataset\)\. 4. \(d\)We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility\. In the case of closed\-source models, it may be that access to the model is limited in some way \(e\.g\., to registered users\), but it should be possible for other researchers to have some path to reproducing or verifying the results\.
21. 5\.Open access to data and code
22. Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?
23. Answer:\[No\]
24. Justification: At submission time, we provide anonymized key implementation code in the supplemental material, including the core StarOR search\-and\-adaptation components, reward computation, GRPO update logic, and representative run scripts needed to inspect the main algorithmic implementation\. The full repository, with complete instructions for reproducing the experiments, will be publicly released after the review period and de\-anonymization\.
25. Guidelines: - •The answer\[N/A\]means that paper does not include experiments requiring code\. - • - •While we encourage the release of code and data, we understand that this might not be possible, so\[No\]is an acceptable answer\. Papers cannot be rejected simply for not including code, unless this is central to the contribution \(e\.g\., for a new open\-source benchmark\)\. - •The instructions should contain the exact command and environment needed to run to reproduce the results\. See the NeurIPS code and data submission guidelines \([https://neurips\.cc/public/guides/CodeSubmissionPolicy](https://neurips.cc/public/guides/CodeSubmissionPolicy)\) for more details\. - •The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc\. - •The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines\. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why\. - •At submission time, to preserve anonymity, the authors should release anonymized versions \(if applicable\)\. - •Providing as much information as possible in supplemental material \(appended to the paper\) is recommended, but including URLs to data and code is permitted\.
26. 6\.Experimental setting/details
27. Question: Does the paper specify all the training and test details \(e\.g\., data splits, hyperparameters, how they were chosen, type of optimizer\) necessary to understand the results?
28. Answer:\[Yes\]
29. Justification: The paper specifies the benchmark suite, evaluation metric, backbone model, and test\-time comparison setup in the main text, while detailed hyperparameters, reward\-system details, and prompt templates are provided in the supplemental material\.
30. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them\. - •The full details can be provided either with the code, in appendix, or as supplemental material\.
31. 7\.Experiment statistical significance
32. Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?
33. Answer:\[No\]
34. Justification: The current submission reports single\-run accuracy numbers without error bars or formal significance tests\. We acknowledge that the test\-time search and adaptation procedure may introduce run\-to\-run variance, and reporting repeated\-run statistics would strengthen the empirical evaluation\.
35. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •The authors should answer\[Yes\]if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper\. - •The factors of variability that the error bars are capturing should be clearly stated \(for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions\)\. - •The method for calculating the error bars should be explained \(closed form formula, call to a library function, bootstrap, etc\.\) - •The assumptions made should be given \(e\.g\., Normally distributed errors\)\. - •It should be clear whether the error bar is the standard deviation or the standard error of the mean\. - •It is OK to report 1\-sigma error bars, but one should state it\. The authors should preferably report a 2\-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified\. - •For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range \(e\.g\., negative error rates\)\. - •If error bars are reported in tables or plots, the authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text\.
36. 8\.Experiments compute resources
37. Question: For each experiment, does the paper provide sufficient information on the computer resources \(type of compute workers, memory, time of execution\) needed to reproduce the experiments?
38. Answer:\[Yes\]
39. Justification: Appendix[D](https://arxiv.org/html/2606.15197#A4)and Appendix[C](https://arxiv.org/html/2606.15197#A3)report the hardware setup, GPU memory, solver timeout, per\-sample runtime decomposition, dataset\-level runtime statistics, and estimated compute required for the main experiments\.
40. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage\. - •The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute\. - •The paper should disclose whether the full research project required more compute than the experiments reported in the paper \(e\.g\., preliminary or failed experiments that didn’t make it into the paper\)\.
41. 9\.Code of ethics
43. Answer:\[Yes\]
44. Justification: To the best of our knowledge, the research conforms to the NeurIPS Code of Ethics\. The work focuses on benchmark\-based evaluation of optimization\-modeling methods and does not involve human\-subject experimentation or other procedures that would raise special ethical concerns\.
45. Guidelines: - •The answer\[N/A\]means that the authors have not reviewed the NeurIPS Code of Ethics\. - •If the authors answer\[No\], they should explain the special circumstances that require a deviation from the Code of Ethics\. - •The authors should make sure to preserve anonymity \(e\.g\., if there is a special consideration due to laws or regulations in their jurisdiction\)\.
46. 10\.Broader impacts
47. Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?
48. Answer:\[Yes\]
49. Justification: Appendix[G](https://arxiv.org/html/2606.15197#A7)discusses positive impacts on accessible optimization modeling as well as risks from semantically incorrect formulations, high\-stakes deployment, increased compute, and the need for expert review\.
50. Guidelines: - •The answer\[N/A\]means that there is no societal impact of the work performed\. - •If the authors answer\[N/A\]or\[No\], they should explain why their work has no societal impact or why the paper does not address societal impact\. - •Examples of negative societal impacts include potential malicious or unintended uses \(e\.g\., disinformation, generating fake profiles, surveillance\), fairness considerations \(e\.g\., deployment of technologies that could make decisions that unfairly impact specific groups\), privacy considerations, and security considerations\. - •The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments\. However, if there is a direct path to any negative applications, the authors should point it out\. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate Deepfakes for disinformation\. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster\. - •The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from \(intentional or unintentional\) misuse of the technology\. - •If there are negative societal impacts, the authors could also discuss possible mitigation strategies \(e\.g\., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML\)\.
51. 11\.Safeguards
52. Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse \(e\.g\., pre\-trained language models, image generators, or scraped datasets\)?
53. Answer:\[N/A\]
54. Justification: The paper does not release high\-risk generative models or scraped datasets that would require special safeguards beyond standard research disclosure\.
55. Guidelines: - •The answer\[N/A\]means that the paper poses no such risks\. - •Released models that have a high risk for misuse or dual\-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters\. - •Datasets that have been scraped from the Internet could pose safety risks\. The authors should describe how they avoided releasing unsafe images\. - •We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort\.
56. 12\.Licenses for existing assets
57. Question: Are the creators or original owners of assets \(e\.g\., code, data, models\), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?
58. Answer:\[Yes\]
59. Justification: We credit and cite the original datasets, models, solver, and software frameworks used in the paper\. Appendix[F](https://arxiv.org/html/2606.15197#A6)summarizes the creators, public URLs, and licenses or usage terms for Qwen3\-4B, verl, Gurobi, NL4OPT, MAMO, IndustryOR, and OptMATH\-Bench, and states how corrected benchmark files are treated as derived assets\.
60. Guidelines: - •The answer\[N/A\]means that the paper does not use existing assets\. - •The authors should cite the original paper that produced the code package or dataset\. - •The authors should state which version of the asset is used and, if possible, include a URL\. - •The name of the license \(e\.g\., CC\-BY 4\.0\) should be included for each asset\. - •For scraped data from a particular source \(e\.g\., website\), the copyright and terms of service of that source should be provided\. - •If assets are released, the license, copyright information, and terms of use in the package should be provided\. For popular datasets,[paperswithcode\.com/datasets](https://arxiv.org/html/2606.15197v1/paperswithcode.com/datasets)has curated licenses for some datasets\. Their licensing guide can help determine the license of a dataset\. - •For existing datasets that are re\-packaged, both the original license and the license of the derived asset \(if it has changed\) should be provided\. - •If this information is not available online, the authors are encouraged to reach out to the asset’s creators\.
61. 13\.New assets
62. Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?
63. Answer:\[Yes\]
64. Justification: The supplemental material includes anonymized key implementation code for the proposed method, together with representative run scripts and implementation descriptions\. The paper does not introduce new datasets or model checkpoints; existing benchmarks, models, and software assets are credited separately in Appendix[F](https://arxiv.org/html/2606.15197#A6)\.
65. Guidelines: - •The answer\[N/A\]means that the paper does not release new assets\. - •Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates\. This includes details about training, license, limitations, etc\. - •The paper should discuss whether and how consent was obtained from people whose asset is used\. - •At submission time, remember to anonymize your assets \(if applicable\)\. You can either create an anonymized URL or include an anonymized zip file\.
66. 14\.Crowdsourcing and research with human subjects
67. Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation \(if any\)?
68. Answer:\[N/A\]
69. Justification: The paper does not involve crowdsourcing or research with human subjects\.
70. Guidelines: - •The answer\[N/A\]means that the paper does not involve crowdsourcing nor research with human subjects\. - •Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper\. - •According to the NeurIPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector\.
71. 15\.Institutional review board \(IRB\) approvals or equivalent for research with human subjects
72. Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board \(IRB\) approvals \(or an equivalent approval/review based on the requirements of your country or institution\) were obtained?
73. Answer:\[N/A\]
74. Justification: The paper does not involve human\-subject research, so IRB approval or an equivalent review was not required\.
75. Guidelines: - •The answer\[N/A\]means that the paper does not involve crowdsourcing nor research with human subjects\. - •Depending on the country in which research is conducted, IRB approval \(or equivalent\) may be required for any human subjects research\. If you obtained IRB approval, you should clearly state this in the paper\. - •We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution\. - •For initial submissions, do not include any information that would break anonymity \(if applicable\), such as the institution conducting the review\.
76. 16\.Declaration of LLM usage
77. Question: Does the paper describe the usage of LLMs if it is an important, original, or non\-standard component of the core methods in this research? Note that if the LLM is used only for writing, editing, or formatting purposes and does*not*impact the core methodology, scientific rigor, or originality of the research, declaration is not required\.
78. Answer:\[Yes\]
79. Justification: LLM usage is a core methodological component of this work\. The paper explicitly describes the backbone model, the stage\-wise generation process, the transient LoRA\-based test\-time adaptation, and the role of the LLM within search and policy optimization\.
80. Guidelines: - •The answer\[N/A\]means that the core method development in this research does not involve LLMs as any important, original, or non\-standard components\. - •Please refer to our LLM policy in the NeurIPS handbook for what should or should not be described\.

Similar Articles

FBOS-RL: Feedback-Driven Bi-Objective Synergistic Reinforcement Learning

arXiv cs.LG

The paper proposes FBOS-RL, a feedback-driven bi-objective synergistic reinforcement learning framework that improves training efficiency and performance ceiling over GRPO in LLM alignment and reasoning by using feedback-guided exploration and two mutually reinforcing training objectives: Exploitation-oriented Policy Alignment and Exploration-oriented Capability Cultivation.

Causal Object-Centric Models for Planning with Monte Carlo Tree Search

arXiv cs.AI

COMET is a model-based reinforcement learning algorithm that combines a frozen object-centric encoder with a transformer-based world model and Monte Carlo Tree Search, using causal attention to focus on task-relevant objects, achieving higher scores on visual RL benchmarks.

ALSO: Adversarial Online Strategy Optimization for Social Agents

arXiv cs.AI

ALSO introduces a framework for online strategy optimization in multi-agent social simulation, formulating multi-turn interaction as an adversarial bandit problem and using a neural surrogate for reward prediction. Experiments on the Sotopia benchmark show it outperforms static baselines and existing optimization methods.