Models Can Model, But Can't Bind: Structured Grounding in Text-to-Optimization
Summary
This paper introduces Text2Opt-Bench, a scalable benchmark for text-to-optimization, and identifies that LLMs struggle with 'binding' (grounding problem data) rather than 'modeling' (choosing optimization structure). The authors propose BIND, a simple inference-time method that externalizes numeric data, significantly improving accuracy across models.
View Cached Full Text
Cached at: 05/22/26, 08:51 AM
# Models Can Model, But Can’t Bind: Structured Grounding in Text-to-Optimization
Source: [https://arxiv.org/html/2605.21751](https://arxiv.org/html/2605.21751)
Albert GeUniversity of Wisconsin\-MadisonAlexander BerenbeimUnited States Military AcademyNathaniel D\. BastianUnited States Military AcademyFrederic SalaUniversity of Wisconsin\-Madison
###### Abstract
## Abstract
*Text\-to\-optimization*requires two separable capabilities:*modeling*—choosing the right optimization structure—and*binding*—grounding every coefficient, index, and parameter in the concrete problem data\. We study this via Text2Opt\-Bench, a scalable benchmark of solver\-verified optimization problems spanning 12 categories, from textbook linear programs to stochastic and multi\-objective formulations with up to thousands of variables\. Across 10\+ models, we find that accuracy collapses as instance data grows, even when the formulation itself is simple\. We call this the*effective binding limit*\. We address this via a simple inference\-time approach,BIND, which externalizes numeric data to structured files so the model binds data programmatically rather than transcribing from the prompt\. BIND improves GPT\-5\-Nano from 59\.1% to 82\.4% accuracy, matching pass@5 \(82\.0%\) at lower token cost than pass@1, and GPT\-5 from 86\.2% to 95\.8%\. Furthermore, we validate our hypothesis by finetuning a model exclusively on binding and show that it outperforms end\-to\-end SFT and RL across three structurally distinct optimization categories, with a 1\.5B binding specialist alone matching a 7B end\-to\-end baseline\.
\\correspondingauthor
Zhiqi Gao: zhiqi@cs\.wisc\.edu Equal contribution\.
## 1Introduction
Operations research \(OR\) is central to industrial decision\-making in logistics, energy, and supply chains\. Solving OR tasks from natural language with LLMs \(performing text\-to\-optimization\) requires two distinct abilities:\(1\) modeling, i\.e\., selecting the correct optimization model and structure, and\(2\) binding, i\.e\., grounding variables, constraints, coefficients, and other problem parameters to the given data\. The first capability requires*reasoning*skills, an area where models have recently made significant progress\. The second, however, remains challenging to achieve\. We argue that current text\-to\-optimization systems are primarily bottlenecked by binding rather than modeling\.
To test this hypothesis, we turn to benchmarks that measure text\-to\-optimization capabilities\. Existing benchmarks\(Ramamonjison et al\.,[2022](https://arxiv.org/html/2605.21751#bib.bib18), Mostajabdaveh et al\.,[2025](https://arxiv.org/html/2605.21751#bib.bib17), Wang et al\.,[2024](https://arxiv.org/html/2605.21751#bib.bib26), Huang et al\.,[2025a](https://arxiv.org/html/2605.21751#bib.bib10)\)address textbook problem scale: small, deterministic, single\-objective programs in which every constraint is explicitly stated\. Real\-world OR involves uncertainty, competing objectives, and domain knowledge that is used to induce constraints\. These features are absent from existing benchmarks\.
We address these challenges viaText2Opt\-Bench, a scalable benchmark of verified optimization problems spanning12 problem categoriescovering linear programs \(LP\), mixed\-integer linear programs \(MILP\), mixed\-integer quadratic programs \(MIQP\), and nonlinear formulations—including stochastic programs with chance constraints, multi\-objective formulations with competing cost and emissions targets, and problems requiring domain\-specific constraint derivation \(Ohm’s law, Erlang\-C queuing\)\. Our benchmark is built via a*forward\-engineering*pipeline: we first construct a solver\-verified optimization problem, then generate a natural language description grounded in the problem’s underlying scenario parameters\. This decouples linguistic generation from mathematical structure, ensuring that each problem instance is feasible by construction and that evaluation failures can be unambiguously attributed to the model rather than to benchmark artifacts\.
Figure 1:Solution accuracy vs\. combined token cost across three model families \(550 template problems\)\. BIND significantly improves pass@1 accuracy, and remains competitive with other test\-time\-compute strategies while using significantly fewer tokens\. We compare against oracle feedback, representing an upper bound on iterative refinement, and pass@5 as an upper bound on parallel sampling\.Using this benchmark, we evaluate 10\+ models from OpenAI, Claude, Deepseek, Llama, and Qwen families and report three main findings:
\(1\) For frontier models, binding is the primary bottleneck\.GPT\-5\-Nano’s accuracy drops from 72% to 11% as instance data grows, even when the formulation is unchanged\. Closed\-source frontier models are closely matched at 86–88% overall, while reasoning models \(o4\-mini, DeepSeek\-R1\) fail to surpass standard models, suggesting these do not address binding failures\. The same accuracy cliff appears on non\-OR RULER retrieval tasks \(§[4\.2](https://arxiv.org/html/2605.21751#S4.SS2)\)\.
\(2\) Binding\-aware inference substantially improves performance\.We introduceBIND, which externalizes numeric data to structured files so the model binds programmatically\. BIND improves the performance of GPT\-5\-Nano from 59\.1% to 82\.4%—matching pass@5 \(82\.0%\) at the lowest token cost—and GPT\-5 from 86\.2% to 95\.8%, with the largest gains on data\-heavy categories \(\+56pp for GPT\-5\-Nano on stochastic transportation problems\)\.
\(3\) Training binding\-specific models is most effective\.We turn from inference\-only approaches to training\. Surprisingly, we find that supervised finetuning \(SFT\) outperforms reinforcement learning \(RL\) at 7B scale\. This is consistent with binding as the bottleneck: SFT provides dense supervision of coefficient transcription while RL’s sparse reward struggles to distinguish between a wrong formulation and a wrong parameter\. Motivated by this observation, we show that training a 7B binding specialist outperforms end\-to\-end SFT across three structurally distinct categories: 58\.1% vs\. 51\.2% \(resource allocation\), 100% vs\. 96% \(job\-shop scheduling\), and 96% vs\. 88% \(transportation\)\.
In summary, our primary contributions are \(1\) Text2Opt\-Bench, a scalable, solver\-verified benchmark of 12 problem categories \(LP/MILP/MIQP/nonlinear, up to 1,000\+ variables\); \(2\) a binding bottleneck analysis showing that instance binding is the primary failure mode, confirmed via RULER retrieval experiments; \(3\) BIND, a binding\-aware inference method that outperforms both iterative repair and parallel sampling at lower cost; and \(4\) a demonstration that decomposing training by binding yields stronger and more parameter\-efficient models than end\-to\-end SFT or RL\.
## 2Related Work
We briefly detail relevant related work\.
Text\-to\-Optimization\.There is ongoing work to develop benchmarks and methods for solving optimization problems from natural language\. On the benchmark side, NL4Opt\(Ramamonjison et al\.,[2022](https://arxiv.org/html/2605.21751#bib.bib18)\)treats optimization as entity extraction on small LPs\. OptiBench\(Wang et al\.,[2024](https://arxiv.org/html/2605.21751#bib.bib26)\), ORLM\(Huang et al\.,[2025a](https://arxiv.org/html/2605.21751#bib.bib10)\), MAMO\(Huang et al\.,[2025b](https://arxiv.org/html/2605.21751#bib.bib11)\), and OptMATH\(Lu et al\.,[2025](https://arxiv.org/html/2605.21751#bib.bib16)\)offer solver\-verified instances but at textbook problem scale\. More recent efforts \(OPT\-Engine\(Chen et al\.,[2026](https://arxiv.org/html/2605.21751#bib.bib6)\), ProOPF\(Shen et al\.,[2026](https://arxiv.org/html/2605.21751#bib.bib21)\), ConstraintBench\(Tso et al\.,[2026](https://arxiv.org/html/2605.21751#bib.bib25)\), ORQA\(Mostajabdaveh et al\.,[2025](https://arxiv.org/html/2605.21751#bib.bib17)\), NLMOptimizer\(Berenbeim et al\.,[2025](https://arxiv.org/html/2605.21751#bib.bib2)\)\) expand problem types and scale\. Table[1](https://arxiv.org/html/2605.21751#S2.T1)compares these benchmarks\. Our benchmark, Text2Opt\-Bench, offers controllable difficulty, scalability up to 1,000\+ variables, and industrially\-motivated formulations\.
On the methods side, OptiMUS\(AhmadiTeshnizi et al\.,[2024](https://arxiv.org/html/2605.21751#bib.bib1)\)and Chain\-of\-Experts\(Xiao et al\.,[2024](https://arxiv.org/html/2605.21751#bib.bib27)\)use modular decomposition; LLMOPT\(Jiang et al\.,[2025](https://arxiv.org/html/2605.21751#bib.bib12)\)learns to define problems end\-to\-end\. OR\-LLM\-Agent\(Zhang et al\.,[2025](https://arxiv.org/html/2605.21751#bib.bib30)\)decomposes tasks into modeling, coding, and debugging\. For a survey, seeXiao et al\. \([2025](https://arxiv.org/html/2605.21751#bib.bib28)\)\.
Table 1:Comparison with existing OR benchmarks\.BenchmarkProblemsVerifiedMax VarsTypesAdv\. Form\.NL4Opt1,101×5LP×OptiBench60550Mixed×ORLM10010LP/MILP/NLP×MAMO1,20950LP/MILP/ODE×OPT\-Engine1,81040LP/MIP×Oursscalable1,000\+LP/MILP/MIQP/NLPSynthetic Data Generation\.Verifiable synthetic data has proven valuable for reasoning\(Liu et al\.,[2025](https://arxiv.org/html/2605.21751#bib.bib14), Goldie et al\.,[2025](https://arxiv.org/html/2605.21751#bib.bib8), Seegmiller et al\.,[2025](https://arxiv.org/html/2605.21751#bib.bib19)\); our forward\-engineering pipeline differs from back\-translation approaches \(e\.g\., OptMATH\) by jointly generating descriptions and OR structures from simulated world states\.
Data Externalization and Programmatic Access\. A growing body of work offloads context from the prompt to external environments that the model accesses programmatically\. PAL\(Gao et al\.,[2023](https://arxiv.org/html/2605.21751#bib.bib7)\)and Program of Thoughts\(Chen et al\.,[2023a](https://arxiv.org/html/2605.21751#bib.bib4)\)generate code rather than performing computation in\-context; Recursive Language Models\(Zhang et al\.,[2026](https://arxiv.org/html/2605.21751#bib.bib29)\)generalize this by treating the entire prompt as an external environment the model can recursively query\. These approaches address computational or context\-length limitations\. BIND targets a different bottleneck — faithful transcription of numerical data — by externalizing instance data to structured files before loading into the context\.
Long\-Context Retrieval\.Liu et al\. \([2023](https://arxiv.org/html/2605.21751#bib.bib15)\)show that LLMs struggle to retrieve from mid\-context; RULER\(Hsieh et al\.,[2024](https://arxiv.org/html/2605.21751#bib.bib9)\)measures retrieval degradation using controlled tasks\. Our experiments \(§[4\.2](https://arxiv.org/html/2605.21751#S4.SS2)\) show that this retrieval degradation also explains binding failures in text\-to\-optimization, with multi\-parameter retrieval exhibiting sharp accuracy cliffs as extraction failures compound\.
## 3Text2Opt\-Bench: Design and Evaluation
Problem Excerpt \(from LLM input\)“Catalyst Grade A \(whole\-batch only\): Each full batch contributes7\.12in margin\. Every batch requires1\.85hours of reactor time and5\.99hours on the packaging line\. We may run0 to 6full batches;partial batches are not possible\. Solvent Blend B \(flexible run\-size\): Each unit contributes 5\.69 in margin… Bulk Intermediate C \(whole\-load only\): Each full load contributes 4\.84 in margin… Reactor availability: up to69\.83hours\. Packaging line: up to20\.61hours\.Maximizetotal contribution\.”
Modeling \(structural understanding\)•“maximize total contribution”max\. objective•“whole\-batch only”integer variable•“fractional quantities”continuous variable•Two shared resources2constraints•“does not require packaging”A1,2=0A\_\{1,2\}\{=\}0*Requires reasoning; no specific numbers\.*
Binding \(numeric extraction\)ProseParam\.Val\.“contributes 7\.12”c0c\_\{0\}\(obj\.\)7\.12“1\.85 h reactor”A0,0A\_\{0,0\}1\.85“5\.99 h packaging”A1,0A\_\{1,0\}5\.99“0 to 6 batches”boundsx0x\_\{0\}\[0,6\]\[0,6\]“up to 69\.83 h”b0b\_\{0\}\(RHS\)69\.83…and 9 more values\.*Requires faithful transcription; errors compound at scale\.*
Figure 2:Modeling vs\. binding on a resource allocation instance\.Modelingselects the optimization structure \(objective type, variable domains, constraints\);bindingextracts every numerical coefficient from prose\. As instances scale, binding becomes the dominant failure mode\.Solving an optimization problem from natural language requires choosing the right mathematical structure and grounding that structure in the problem’s numerical data\. We formalize this decomposition first as it directly informs our benchmark design\. Each problem category and evaluation mode is constructed to isolate one capability or the other\.
### 3\.1Problem Definition
We definetext\-to\-optimizationas the task of producing executable solver code from a natural language descriptionDD\. The description specifies both the problem’sstructure\(what to optimize, under what constraints\) and itsinstance data\(the numerical coefficients, bounds, demands, and parameters\)\. A correct solution requires two separable capabilities, as illustrated in Figure[2](https://arxiv.org/html/2605.21751#S3.F2)\.
- •Modelingℳ:\(D∗,θ\)S\\mathcal\{M\}:\(D^\{\*\},\\theta\)\\to S— given a problem descriptionD∗D^\{\*\}and parametersθ\\theta, select the objective, constraints, and variable domains to produce executable solver codeSS\.
- •Bindingℬ:Dθ\\mathcal\{B\}:D\\to\\theta— given a natural language descriptionDD, extract concrete parametersθ\\theta\(cost coefficients, capacity limits, demand values, etc\.\)
An end\-to\-end approach performs both steps simultaneously: a single model mapsDDdirectly toSS, implicitly binding parameters while constructing the formulation \(hereD∗=DD^\{\*\}=D\)\. A decomposed approach separates them: first extractθ=ℬ\(D\)\\theta=\\mathcal\{B\}\(D\), then produceS=ℳ\(D∗,θ\)S=\\mathcal\{M\}\(D^\{\*\},\\theta\), whereD∗D^\{\*\}may beDDitself or a structured representation\.
Regardless of approach, these capabilities scale differently\. Modeling difficulty depends on thestructural complexityof the problem and is independent of instance scale\. The same structure must be selected regardless of the cardinality ofθ\\theta\(e\.g\., a transportation LP requires the same formulation whether it has 5 or 500 supply nodes\)\. Binding difficulty grows withinstance scale, as each additional coefficient is an opportunity for transcription error\. They are also empirically separable: varying instance scale at fixed structure isolates binding \(§[4](https://arxiv.org/html/2605.21751#S4)\); externalizing data isolates modeling \(§[3\.3](https://arxiv.org/html/2605.21751#S3.SS3)\)\.
### 3\.2Dataset Creation
Figure 3:Text2Opt\-Bench generation pipeline\. Problems are constructed via forward engineering with solver verification, then described in natural language\. Template\-based insertion decouples linguistic complexity from data scale\.Equipped with these definitions, we seek to build a dataset able to test models’ abilities to handle modeling and binding\. Rather than constructing constraints around a known solution \(*backward*engineering, as in OptMATH\(Lu et al\.,[2025](https://arxiv.org/html/2605.21751#bib.bib16)\)\), we use a*forward\-engineering*framework \(Figure[3](https://arxiv.org/html/2605.21751#S3.F3)\): \(1\) simulate aworld state—business parameters, resource limits, and logical rules; \(2\) derive the optimization structure and solve with an optimization solver111In this paper, we use Gurobi, a standard solver package; this choice is consistent with prior workLu et al\. \([2025](https://arxiv.org/html/2605.21751#bib.bib16)\), Berenbeim et al\. \([2025](https://arxiv.org/html/2605.21751#bib.bib2)\)\.; \(3\) generate a natural language description grounded in the world state\. This guaranteesfeasibility by constructionand producessemantically realisticnarratives\. We adopt two complementary generation strategies: direct translation and template\-based insertion\.
Direct Translation\.An LLM weaves all numerical coefficients directly into natural language prose\. We use this for developingresource allocationproblems \(LP/MILP, 2–20 variables\), where the formulation requires minimal OR expertise but high faithfulness to the constraint values\. Because the model must extract every coefficient from unstructured text, this category isolates binding difficulty from modeling difficulty \(see Appendix[A\.3](https://arxiv.org/html/2605.21751#A1.SS3)for an example\)\. To confirm the binding difficulty of the constructed dataset, we analyze the failure modes of 9 models\. Across all capable models, 60\.4–92\.3% of resource allocation failures produce correct variable and constraint counts but wrong objective values; structural errors are near\-zero \(full results in Appendix[B](https://arxiv.org/html/2605.21751#A2)\)\.
Template\-Based Insertion\.For structured problems requiring domain\-specific modeling \(for example, in scheduling, routing, and facility design\), embedding all data in prose would exceed the model’s effective binding capacity\. Instead, we decouple language from data:
1. 1\.Generate & verify: Create domain\-specific parameters and solve with Gurobi\.
2. 2\.Template: LLMs generate natural language descriptions from data*schema*only \(dimensions, field names, no numeric values\), with placeholders for data tables\.
3. 3\.Insert: Placeholders are filled deterministically with pipe\-separated numerical data, enabling natural\-language problem descriptions that scale to 1000\+ variables\.
Problem Categories\.Text2Opt\-Bench spans 12 categories organized into four tiers of increasing*modeling*difficulty\. Each template category includes 50small\-tierinstances \(10K data tokens\); three categories also include 50large\-tierinstances \(30K tokens\) with identical structure, isolating the effect of*binding*scale\. The pipeline is fully automated, so additional instances can be generated on demand\.
The four tiers span increasing modeling difficulty:Direct Translation\(Resource Allocation\),Template\-Based\(Transportation, Disaster Response, JSSP, VRPTW, RCPSP\),Induced Constraint\(Facility Location, Power Transmission, Queuing/Staffing — parameters derived from domain knowledge\), andIndustrially\-Motivated\(Stochastic Transportation, Multi\-Objective Transportation, Modified Facility Location\)\. Details are shown in Table[3](https://arxiv.org/html/2605.21751#S4.T3)\.
### 3\.3Evaluation Protocol
LLMs generate executable Gurobi Python code, which is run in a sandboxed subprocess\. A response is correct iff the code \(1\) executes without error, \(2\) achieves optimal solver status, and \(3\) produces an objective value matching the ground truth within a relative tolerance of10−410^\{\-4\}\. All instances are feasible by construction — otherwise, a wrong formulation could also return "infeasible" and be falsely marked correct\. Because coefficients are randomly generated continuous values, objective\-value matching serves as an effective fingerprint: a wrong formulation is extremely unlikely to coincidentally produce the same optimum\. We use objective matching rather than structural matching \(e\.g\., variable or constraint counts\) because many OR problems admit multiple valid formulations \(details in Appendix[A\.5](https://arxiv.org/html/2605.21751#A1.SS5)\)\.
The benchmark naturally tests binding at three difficulty levels, from easiest to hardest\.\(1\) Data\-externalized\(BIND\): numeric data lives in an external JSON file that the model can access\. The model only needs to match constraint attributes to the corresponding keys and values in the file, making this the easiest binding setting\.\(2\) Table\-embedded\(template default\): data appears as structured tables within the prompt\. The model must locate and transcribe the correct entries from potentially large tables into code\.\(3\) Prose\-embedded\(direct translation\): all coefficients are stated in natural\-language sentences, requiring the model to parse numeric values from unstructured text\. This is the hardest binding setting\. This design enables separate assessment of modeling vs\. binding failures\.
BIND: Binding\-Aware Data Offloading\.For template problems, BIND externalizes all numeric data \(e\.g\. cost matrices\) to a JSON file\. The model receives: \(1\) the structural problem description \(objectives and constraints in natural language\), \(2\) the data schema with dimensions and types, and \(3\) a file path\. This forces the model to bind programmatically viajson\.load\(\)rather than transcribing coefficients from the prompt\. BIND assumes pre\-extracted structured data; it therefore serves as a*diagnostic tool*for binding\-aware methods, isolating how much accuracy is recoverable when the transcription burden is removed\.
## 4The Binding Bottleneck
We argue that binding, not modeling, is the primary bottleneck\. We first show that accuracy collapses as the data scale increases, even when the formulation structure is fixed \(Table[2](https://arxiv.org/html/2605.21751#S4.T2)\), then benchmark 9 models across the dataset \(§[4\.1](https://arxiv.org/html/2605.21751#S4.SS1)\), and finally confirm via RULER retrieval tasks that this reflects a general limitation in context\-processing \(§[4\.2](https://arxiv.org/html/2605.21751#S4.SS2)\)\.
Figure 4:\(a\)Failure composition by model scale on resource allocation \(1,012 problems\)\. As model size grows, binding errors increasingly make up a significant proportion of failures\.\(b\)Each model exhibits an*effective binding limit*beyond which accuracy sharply declines\. Curves are smoothed with a Gaussian\-weighted rolling average\.Direct translation\.Figure[4](https://arxiv.org/html/2605.21751#S4.F4)presents the relationship between model scale, data scale, and binding failures on all 1,012 resource allocation problems, which isolates binding as discussed in §[3\.2](https://arxiv.org/html/2605.21751#S3.SS2), across the Qwen2\.5 family \(0\.5B–72B\) to control for architectural differences\. Panel \(b\) shows accuracy as a function of prompt token length\.
We observe three primary trends:\(1\) Binding failures dominate at scale:Panel \(a\) shows a phase transition in failure composition: at 0\.5B, nearly all failures are modeling errors \(the model cannot formulate LPs\); by 32B, 86% of failures are binding errors—correct formulation structure but wrong coefficients\.\(2\) Accuracy declines sharply with instance scale:Panel \(b\) shows that accuracy drops as the size of the optimization problem grows\. This confirms that the advertised context window \(128k for Qwen\-2\.5 family\) is far larger than theeffectivewindow for dense numerical tasks, aligning with recent findings on context scaling limits\(Shi et al\.,[2026](https://arxiv.org/html/2605.21751#bib.bib23), Zhou et al\.,[2025](https://arxiv.org/html/2605.21751#bib.bib32), Liu et al\.,[2023](https://arxiv.org/html/2605.21751#bib.bib15)\)\.\(3\) Model\-specific thresholds:Larger models maintain accuracy on longer prompts\. This shows a clear correlation between parameter count and effective context length\.
Table 2:Binding degradation: small \(n=50, 10K tokens\) vs\. large \(n=50, 23–35K tokens\) on three binding\-limited categories\. Same formulation structure, only data scale changes\.Avg TokensGPT\-5GPT\-5\-NanoCategorySmallLargeSLSLTransportation1\.4K23K10090−\-1010032−\-68Multi\-Obj T\.3\.6K35K7048−\-22600−\-60Queue/Staff\.5\.4K34K8066−\-14560−\-56Average8368−\-157211−\-61Template binding degradation\.Table[2](https://arxiv.org/html/2605.21751#S4.T2)confirms this on a set of structured problems\. Accuracy degrades from 83% to 68% \(GPT\-5\) and from 72% to 11% \(GPT\-5\-Nano\) when scaling from 10K to 30K data tokens at identical structure\. Transportation is the clearest case: both models achieve 100% on small\-tier instances, ruling out any formulation difficulty; the drop in GPT\-5\-Nano’s accuracy to 32% on large instances is therefore attributable purely to binding scale, consistent with the multi\-key retrieval cliff observed in RULER \(§[4\.2](https://arxiv.org/html/2605.21751#S4.SS2)\)\.
### 4\.1Model and Scale Comparison
We evaluate Text2Opt\-Bench across 9 models on the main benchmark \(Table[3](https://arxiv.org/html/2605.21751#S4.T3)\), with additional scale analysis across the Qwen\-2\.5 family \(0\.5B–72B\)\. All problems’ descriptions are generated using GPT\-5, at a cost of$0\.03\{\\sim\}\\text\{\\textdollar\}0\.03\(template\) to$0\.10\{\\sim\}\\text\{\\textdollar\}0\.10\(direct translation\) per instance\. We also evaluated GPT\-5; data leakage concerns are discussed in Appendix[A\.4](https://arxiv.org/html/2605.21751#A1.SS4)\.
Table[3](https://arxiv.org/html/2605.21751#S4.T3)presents pass@1 accuracy on the 550 small\-tier template problems \(50 per category\)\. We measure correctness as described in §[3\.3](https://arxiv.org/html/2605.21751#S3.SS3)\. This table reveals several patterns\.\(1\) Frontier models are closely matched: Claude Sonnet 4\.6, Opus 4\.6, and GPT\-5 achieve 84–90% on both resource allocation and template problems\.\(2\) Reasoning models do not outperform standard models: DeepSeek\-R1 performs similarly to DeepSeek\-V3\.2, suggesting that chain\-of\-thought reasoning does not address the binding bottleneck \(Appendix[C](https://arxiv.org/html/2605.21751#A3)confirms that prompting strategies also fail to help\)\.\(3\) Small models lack modeling skills: Qwen2\.5\-7B achieves 0% across many categories\.
Table 3:Text2Opt\-Bench pass@1 accuracy \(%\)\. Template categories: 50 small\-tier instances each\. Resource allocation: 248 eval\-subset instances\. Best per row inbold\.†: 50 additional large\-tier instances \(30K data tokens\) for binding stress tests \(§[4](https://arxiv.org/html/2605.21751#S4)\)\.FrontierReasoningOpen\-LargeClose\-MidOpen\-OtherCategoryProblem TypeForm\.VariableCountSonnet 4\.6
Opus 4\.6
GPT\-5
o4\-mini
DS\-R1
DS\-V3\.2
GPT\-5\-Nano
Llama3\.3\-70B
Qwen2\.5\-7B
Direct Tran\.Resource Alloc\.LP/MILP2–2084\.789\.987\.980\.280\.679\.049\.249\.613\.3TemplateBasedTransportation†\{\}^\{\\text\{\\textdagger\}\}LP9–625100981001001001001008838Disaster Resp\.MILP30–79296968694789062300JSSPMILP19–36598100909696968200VRPTWMILP41–419503870343422200RCPSPMILP26–18110096888234622600InducedConstraintFacility Loc\.MILP18–9809810010098949890986Power Trans\.MIQP18–36064889870645450160Queuing/Staff\.†\{\}^\{\\text\{\\textdagger\}\}NLP36–2\.6K98928076706656100IndustriallyMotivatedStoch\. Transp\.MILP172–1\.4K6262706660183260Multi\-Obj T\.†\{\}^\{\\text\{\\textdagger\}\}MILP40–89698887068768660424Mod\. Fac\. Loc\.MILP28–39010096961009610090962Template Avg\.87\.686\.786\.280\.472\.972\.059\.135\.14\.5
### 4\.2Retrieval Failures Beyond Optimization
To isolate the retrieval component of binding from OR\-specific knowledge, we evaluate the Qwen2\.5 family \(0\.5B–32B\) on four tasks adapted from the RULER long\-context benchmark\(Hsieh et al\.,[2024](https://arxiv.org/html/2605.21751#bib.bib9)\): single\-key retrieval \(analogous to readingdemand\[j\]\), multi\-key retrieval \(binding all coefficients in a constraint\), multi\-value retrieval \(reading a data column\), and aggregation \(assembling an objective from scattered data\)\. We harden RULER with distractor keys and scale difficulty by context length \(1K–32K tokens\)\. All tasks use strict exact\-match: every requested value must be correct\. Full details are in Appendix[D](https://arxiv.org/html/2605.21751#A4)\.
Figure 5:Accuracy on four RULER binding tasks across Qwen\-2\.5 sizes \(0\.5B–32B\)\. Strict exact\-match scoring; 200 samples per task per context length\. Multi\-binding tasks exhibit sharp cliffs as individual retrieval failures compound multiplicatively\.Figure[5](https://arxiv.org/html/2605.21751#S4.F5)reveals two findings\. First,*every*model degrades as context grows—even Qwen\-32B’s average score drops from 90% \(1K\) to 16% \(32K\)\. Second, degradation depends on the number of simultaneous bindings: single\-key retrieval degrades gradually \(32B:96%63%96\\%\\to 63\\%\), whereas multi\-key and multi\-value retrieval collapse from\>\>90% to 0% between 8K and 16K tokens\. This cliff is consistent with per\-binding failure rates that compound multiplicatively \(pkp^\{k\}\), confirming that the binding bottleneck reflects a general retrieval limitation, instead of an OR\-specific deficit\.
Summary\.The evidence above cleanly separates two failure regimes\. Forbinding\-limitedcategories \(transportation, facility location, JSSP, queuing/staffing\), BIND recovers most failures: GPT\-5 reaches 98–100%, confirming that residual errors were transcription failures\. Formodeling\-limitedcategories \(VRPTW, stochastic transportation, power transmission\), BIND provides smaller or no gains—these failures reflect structural errors such as incorrect subtour elimination or mis\-formulated chance constraints \(see Appendix[F\.1](https://arxiv.org/html/2605.21751#A6.SS1)for details\)\.
## 5Mitigating Binding Failures
Having established binding as the primary bottleneck, we now want to investigate how this can be addressed\. We consider two complementary approaches: inference\-time strategies \(§[5\.1](https://arxiv.org/html/2605.21751#S5.SS1)\), and training\-time strategies that specialize a model for binding \(§[5\.2](https://arxiv.org/html/2605.21751#S5.SS2)\)\.
### 5\.1Inference: BIND and Test\-Time Compute
We evaluate test\-time compute \(TTC\) strategies that trade additional inference cost for higher accuracy, including repeated sampling\(Brown et al\.,[2024](https://arxiv.org/html/2605.21751#bib.bib3), Snell et al\.,[2024](https://arxiv.org/html/2605.21751#bib.bib24)\), iterative repair\(Chen et al\.,[2023b](https://arxiv.org/html/2605.21751#bib.bib5)\), and our binding\-aware data offloading \(BIND\)\.
Figure[1](https://arxiv.org/html/2605.21751#S1.F1)compares these strategies across seven models from three families on 550 template problems \(Llama3\.3\-70B and Qwen2\.5\-7B are excluded due to insufficient modeling ability; their results are in Appendix[E](https://arxiv.org/html/2605.21751#A5)\)\. We establish two upper bounds:pass@5anditerative repair with oracle feedback\(a verifier with ground\-truth objective and model structure provides diagnostic feedback each round\)\. We compare BIND against the strongest possible repair baseline, since Gurobi solver cannot provide a valid signal without additional information\.
We find thatBIND consistently matches or exceeds both upper bounds, achieving near\-ceiling accuracy at lower token cost than pass@1\. For example, GPT\-5 reaches 95\.8% with BIND at 3\.1K tokens vs\. 4\.2K for pass@1; Claude Opus achieves 98\.7% at 3\.3K tokens\. This confirms that the binding bottleneck is the primary failure mode, and addressing it architecturally is more efficient than brute\-force resampling\.
Even with oracle feedback, repair at 5 rounds matches BIND only at 2–4 the token cost \(e\.g\., Claude Opus: 98\.7% at 6\.3K tokens vs\. 3\.3K for BIND; GPT\-5: 95\.5% at 6\.9K vs\. 3\.1K\)\. Weaker models benefit more from repair, but still fall short of BIND’s efficiency\. Pass@5 requires 5 token cost and has similar performance to repair\. The full per\-model breakdown including token costs is in Table[7](https://arxiv.org/html/2605.21751#A5.T7)\(Appendix[E](https://arxiv.org/html/2605.21751#A5)\)\.
BIND per\-category analysis\.BIND consistently improves all capable models: GPT\-5 gains \+9\.6pp, Sonnet 4\.6 \+8\.6pp, DeepSeek\-R1 \+11\.6pp, GPT\-5\-Nano \+23\.3pp\. Even Qwen\-7B doubles from 4\.5% to 8\.9%—primarily through transportation \(\+44pp\), where binding is the bottleneck\. However, BIND cannot compensate for missing*modeling*ability: Qwen2\.5\-7B’s accuracy remains 0% on all structurally complex categories \(VRPTW, RCPSP, stochastic\) with BIND\. The exception is power transmission, where GPT\-5’s accuracy drops by 18pp because this induced\-constraint problem requires deriving physics formulas from concrete values that are no longer inline when BIND externalizes them\. Per\-category results are in Table[8](https://arxiv.org/html/2605.21751#A5.T8)\(Appendix[E](https://arxiv.org/html/2605.21751#A5)\); Appendix[F](https://arxiv.org/html/2605.21751#A6)provides a detailed case study of binding failures\.
### 5\.2Training binding\-specific models is most effective
If binding is the bottleneck, then a model trained*only*to bind should outperform one trained end\-to\-end\. We test this with a two\-phase pipeline: a fine\-tuned binding model produces structured JSON, and a separate solver stage—an untrained LLM or deterministic template code—constructs the Gurobi program\. We compare against standard supervised finetuning and also reinforcement learning via GRPO\(Shao et al\.,[2024](https://arxiv.org/html/2605.21751#bib.bib20)\)\.
Table[4](https://arxiv.org/html/2605.21751#S5.T4)reports results\. Across all three categories, the 7B binding specialist outperforms end\-to\-end SFT: 58\.1% vs\. 51\.2% \(resource allocation\), 100% vs\. 96% \(JSSP\), and 96\.0% vs\. 88\.0% \(transportation\)\. GRPO underperforms SFT, and adding denser reward signals \(hierarchical partial credit\) further degrades accuracy as the model exploits intermediate reward gates \(see Appendix[G\.5](https://arxiv.org/html/2605.21751#A7.SS5.SSS0.Px2)for a full study\)\. In\-distribution accuracy is near\-perfect \(96–100%\) across categories, and a 1\.5B binding specialist already matches 7B end\-to\-end SFT on resource allocation\. Fixed\-schema categories \(transportation, JSSP\) generalize well OOD \(91\.7–100%\), while free\-form categories like resource allocation require training coverage closer to the target distribution \(Appendix[G\.9](https://arxiv.org/html/2605.21751#A7.SS9)\)\. These results reinforce the modeling–binding decomposition: isolating the bottleneck task yields both stronger performance and better parameter efficiency than joint training, with SFT’s dense token\-level supervision proving more suited to faithful transcription than RL’s sparse outcome\-based reward\.
Table 4:Accuracy of two\-phase binding pipeline vs\. end\-to\-end training across three categories\. Phase 1 binding specialists are Qwen2\.5\-7B\-Instruct \(full SFT\)\. Phase 2 uses an untrained Qwen2\.5\-7B for resource allocation and deterministic template code for transportation and JSSP\.CategorySystemAllIn\-distributionOODResource Alloc\.\[\-1pt\]Phase 2: Qwen\-7BGround\-truthQwen\-7B1001001007B binding spec\.Qwen\-7B58\.199\.211\.21\.5B binding spec\.Qwen\-7B51\.294\.71\.77B SFT51\.288\.68\.67B GRPO222RL requires a reward signal from executing generated code; at 7B scale, the base model has 0% base accuracy for JSSP\. Thus, we opted not to include RL baselines for the categories in the table\.44\.076\.56\.9Transportation\[\-1pt\]Phase 2: TemplateGround\-truthTemplate1001001007B binding spec\.Template96\.010091\.77B SFT88\.010075\.0JSSP\[\-1pt\]Phase 2: TemplateGround\-truthTemplate1001001007B binding spec\.Template1001001007B SFT96\.010092\.0
## 6Conclusion
We presented Text2Opt\-Bench, a benchmark of 12 solver\-verified optimization categories, and showed that instance binding is the primary bottleneck for frontier LLMs\. BIND, training, and controlled retrieval experiments on RULER tasks all converge on this conclusion, while modeling limitations still exist for structurally complex problems \(VRPTW, power transmission\)\. SFT outperforms RL at 7B scale; and binding specialists outperform end\-to\-end SFT across three categories—all consistent with binding as the bottleneck\.
Limitations:Our benchmark covers mathematical programming formulations solvable by Gurobi but does not cover combinatorial optimization requiring heuristic or metaheuristic approaches\. The fine\-tuning study covers three categories \(resource allocation, transportation, JSSP\) at 7B scale; extension to structurally complex categories \(VRPTW, stochastic transportation\) and larger models remains future work\. BIND assumes cleanly separated structured data; real\-world settings where parameters are embedded in unstructured documents would require an additional data\-extraction step that BIND does not address\.
## Acknowledgments
We are grateful for the support of the National Science Foundation \(NSF\) \(CCF2106707\), the Defense Advanced Research Projects Agency \(DARPA Young Faculty Award\), the Wisconsin Alumni Research Foundation \(WARF\)\.
## Ethics Statement
This work uses GPT\-5 to generate natural language problem descriptions for benchmark instances \(Section[3\.2](https://arxiv.org/html/2605.21751#S3.SS2)\)\. The generated solution is solver\-verified via Gurobi; no LLM is used for evaluation or scoring\. No human subjects, personally identifiable information, or sensitive data are involved in this work\.
## References
- AhmadiTeshnizi et al\. \(2024\)Ali AhmadiTeshnizi, Wenzhi Gao, and Madeleine Udell\.Optimus: Scalable optimization modeling with \(mi\)lp solvers and large language models, 2024\.URL[https://arxiv\.org/abs/2402\.10172](https://arxiv.org/abs/2402.10172)\.
- Berenbeim et al\. \(2025\)Alexander Michael Berenbeim, Ryan McNeil, Timeo Williams, and Nathaniel D\. Bastian\.NLMOptimizer: A neurosymbolic framework and benchmark for operations research optimization problems from natural language, 2025\.URL[https://openreview\.net/forum?id=skctEx59f2](https://openreview.net/forum?id=skctEx59f2)\.
- Brown et al\. \(2024\)Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V\. Le, Christopher Ré, and Azalia Mirhoseini\.Large language monkeys: Scaling inference compute with repeated sampling, 2024\.URL[https://arxiv\.org/abs/2407\.21787](https://arxiv.org/abs/2407.21787)\.
- Chen et al\. \(2023a\)Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W\. Cohen\.Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks, 2023a\.URL[https://arxiv\.org/abs/2211\.12588](https://arxiv.org/abs/2211.12588)\.
- Chen et al\. \(2023b\)Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou\.Teaching large language models to self\-debug, 2023b\.URL[https://arxiv\.org/abs/2304\.05128](https://arxiv.org/abs/2304.05128)\.
- Chen et al\. \(2026\)Yitian Chen, Cheng Cheng, Yinan Sun, Zi Ling, and Dongdong Ge\.Opt\-engine: Benchmarking the limits of llms in optimization modeling via complexity scaling, 2026\.URL[https://arxiv\.org/abs/2601\.19924](https://arxiv.org/abs/2601.19924)\.
- Gao et al\. \(2023\)Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig\.Pal: Program\-aided language models, 2023\.URL[https://arxiv\.org/abs/2211\.10435](https://arxiv.org/abs/2211.10435)\.
- Goldie et al\. \(2025\)Anna Goldie, Azalia Mirhoseini, Hao Zhou, Irene Cai, and Christopher D\. Manning\.Synthetic data generation and multi\-step rl for reasoning and tool use, 2025\.URL[https://arxiv\.org/abs/2504\.04736](https://arxiv.org/abs/2504.04736)\.
- Hsieh et al\. \(2024\)Cheng\-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg\.Ruler: What’s the real context size of your long\-context language models?, 2024\.URL[https://arxiv\.org/abs/2404\.06654](https://arxiv.org/abs/2404.06654)\.
- Huang et al\. \(2025a\)Chenyu Huang, Zhengyang Tang, Shixi Hu, Ruoqing Jiang, Xin Zheng, Dongdong Ge, Benyou Wang, and Zizhuo Wang\.Orlm: A customizable framework in training large models for automated optimization modeling\.*Operations Research*, 73\(6\):2986–3009, November 2025a\.ISSN 1526\-5463\.[10\.1287/opre\.2024\.1233](https://arxiv.org/doi.org/10.1287/opre.2024.1233)\.URL[http://dx\.doi\.org/10\.1287/opre\.2024\.1233](http://dx.doi.org/10.1287/opre.2024.1233)\.
- Huang et al\. \(2025b\)Xuhan Huang, Qingning Shen, Yan Hu, Anningzhe Gao, and Benyou Wang\.Llms for mathematical modeling: Towards bridging the gap between natural and mathematical languages, 2025b\.URL[https://arxiv\.org/abs/2405\.13144](https://arxiv.org/abs/2405.13144)\.
- Jiang et al\. \(2025\)Caigao Jiang, Xiang Shu, Hong Qian, Xingyu Lu, Jun Zhou, Aimin Zhou, and Yang Yu\.Llmopt: Learning to define and solve general optimization problems from scratch, 2025\.URL[https://arxiv\.org/abs/2410\.13213](https://arxiv.org/abs/2410.13213)\.
- Kwon et al\. \(2023\)Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E\. Gonzalez, Hao Zhang, and Ion Stoica\.Efficient memory management for large language model serving with pagedattention, 2023\.URL[https://arxiv\.org/abs/2309\.06180](https://arxiv.org/abs/2309.06180)\.
- Liu et al\. \(2025\)Junteng Liu, Yuanxiang Fan, Zhuo Jiang, Han Ding, Yongyi Hu, Chi Zhang, Yiqi Shi, Shitong Weng, Aili Chen, Shiqi Chen, Yunan Huang, Mozhi Zhang, Pengyu Zhao, Junjie Yan, and Junxian He\.Synlogic: Synthesizing verifiable reasoning data at scale for learning logical reasoning and beyond, 2025\.URL[https://arxiv\.org/abs/2505\.19641](https://arxiv.org/abs/2505.19641)\.
- Liu et al\. \(2023\)Nelson F\. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang\.Lost in the middle: How language models use long contexts, 2023\.URL[https://arxiv\.org/abs/2307\.03172](https://arxiv.org/abs/2307.03172)\.
- Lu et al\. \(2025\)Hongliang Lu, Zhonglin Xie, Yaoyu Wu, Can Ren, Yuxuan Chen, and Zaiwen Wen\.Optmath: A scalable bidirectional data synthesis framework for optimization modeling, 2025\.URL[https://arxiv\.org/abs/2502\.11102](https://arxiv.org/abs/2502.11102)\.
- Mostajabdaveh et al\. \(2025\)Mahdi Mostajabdaveh, Timothy T\. Yu, Samarendra Chandan Bindu Dash, Rindranirina Ramamonjison, Jabo Serge Byusa, Giuseppe Carenini, Zirui Zhou, and Yong Zhang\.Evaluating llm reasoning in the operations research domain with orqa, 2025\.URL[https://arxiv\.org/abs/2412\.17874](https://arxiv.org/abs/2412.17874)\.
- Ramamonjison et al\. \(2022\)Rindranirina Ramamonjison, Timothy Yu, Raymond Li, Haley Li, Giuseppe Carenini, Bissan Ghaddar, Shiqi He, Mahdi Mostajabdaveh, Amin Banitalebi\-Dehkordi, Zirui Zhou, and Yong Zhang\.Nl4opt competition: Formulating optimization problems based on their natural language descriptions\.In Marco Ciccone, Gustavo Stolovitzky, and Jacob Albrecht, editors,*Proceedings of the NeurIPS 2022 Competitions Track*, volume 220 of*Proceedings of Machine Learning Research*, pages 189–203\. PMLR, 28 Nov–09 Dec 2022\.URL[https://proceedings\.mlr\.press/v220/ramamonjison23a\.html](https://proceedings.mlr.press/v220/ramamonjison23a.html)\.
- Seegmiller et al\. \(2025\)Parker Seegmiller, Kartik Mehta, Soumya Saha, Chenyang Tao, Shereen Oraby, Arpit Gupta, Tagyoung Chung, Mohit Bansal, and Nanyun Peng\.Flames: Improving llm math reasoning via a fine\-grained analysis of the data synthesis pipeline, 2025\.URL[https://arxiv\.org/abs/2508\.16514](https://arxiv.org/abs/2508.16514)\.
- Shao et al\. \(2024\)Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y\. K\. Li, Y\. Wu, and Daya Guo\.Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024\.URL[https://arxiv\.org/abs/2402\.03300](https://arxiv.org/abs/2402.03300)\.
- Shen et al\. \(2026\)Chao Shen, Zihan Guo, Xu Wan, Zhenghao Yang, Yifan Zhang, Wengi Huang, Jie Song, Zongyan Zhang, and Mingyang Sun\.Proopf: Benchmarking and improving llms for professional\-grade power systems optimization modeling, 2026\.URL[https://arxiv\.org/abs/2602\.03070](https://arxiv.org/abs/2602.03070)\.
- Sheng et al\. \(2025\)Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu\.Hybridflow: A flexible and efficient rlhf framework\.In*Proceedings of the Twentieth European Conference on Computer Systems*, EuroSys ’25, page 1279–1297\. ACM, March 2025\.[10\.1145/3689031\.3696075](https://arxiv.org/doi.org/10.1145/3689031.3696075)\.URL[http://dx\.doi\.org/10\.1145/3689031\.3696075](http://dx.doi.org/10.1145/3689031.3696075)\.
- Shi et al\. \(2026\)Jingzhe Shi, Qinwei Ma, Hongyi Liu, Hang Zhao, Jeng\-Neng Hwang, and Lei Li\.Intrinsic entropy of context length scaling in llms, 2026\.URL[https://arxiv\.org/abs/2502\.01481](https://arxiv.org/abs/2502.01481)\.
- Snell et al\. \(2024\)Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar\.Scaling llm test\-time compute optimally can be more effective than scaling model parameters, 2024\.URL[https://arxiv\.org/abs/2408\.03314](https://arxiv.org/abs/2408.03314)\.
- Tso et al\. \(2026\)Joseph Tso, Preston Schmittou, Quan Huynh, and Jibran Hutchins\.Constraintbench: Benchmarking llm constraint reasoning on direct optimization, 2026\.URL[https://arxiv\.org/abs/2602\.22465](https://arxiv.org/abs/2602.22465)\.
- Wang et al\. \(2024\)Zhuohan Wang, Ziwei Zhu, Yizhou Han, Yufeng Lin, Zhihang Lin, Ruoyu Sun, and Tian Ding\.Optibench: Benchmarking large language models in optimization modeling with equivalence\-detection evaluation, 2024\.URL[https://openreview\.net/forum?id=KD9F5Ap878](https://openreview.net/forum?id=KD9F5Ap878)\.
- Xiao et al\. \(2024\)Ziyang Xiao, Dongxiang Zhang, Yangjun Wu, Lilin Xu, Yuan Jessica Wang, Xiongwei Han, Xiaojin Fu, Tao Zhong, Jia Zeng, Mingli Song, and Gang Chen\.Chain\-of\-experts: When LLMs meet complex operations research problems\.In*The Twelfth International Conference on Learning Representations*, 2024\.URL[https://openreview\.net/forum?id=HobyL1B9CZ](https://openreview.net/forum?id=HobyL1B9CZ)\.
- Xiao et al\. \(2025\)Ziyang Xiao, Jingrong Xie, Lilin Xu, Shisi Guan, Jingyan Zhu, Xiongwei Han, Xiaojin Fu, WingYin Yu, Han Wu, Wei Shi, Qingcan Kang, Jiahui Duan, Tao Zhong, Mingxuan Yuan, Jia Zeng, Yuan Wang, Gang Chen, and Dongxiang Zhang\.A survey of optimization modeling meets llms: Progress and future directions, 2025\.URL[https://arxiv\.org/abs/2508\.10047](https://arxiv.org/abs/2508.10047)\.
- Zhang et al\. \(2026\)Alex L\. Zhang, Tim Kraska, and Omar Khattab\.Recursive language models, 2026\.URL[https://arxiv\.org/abs/2512\.24601](https://arxiv.org/abs/2512.24601)\.
- Zhang et al\. \(2025\)Bowen Zhang, Pengcheng Luo, Genke Yang, Boon\-Hee Soong, and Chau Yuen\.Or\-llm\-agent: Automating modeling and solving of operations research optimization problems with reasoning llm, 2025\.URL[https://arxiv\.org/abs/2503\.10009](https://arxiv.org/abs/2503.10009)\.
- Zheng et al\. \(2024\)Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma\.Llamafactory: Unified efficient fine\-tuning of 100\+ language models\.In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 3: System Demonstrations\)*, Bangkok, Thailand, 2024\. Association for Computational Linguistics\.URL[http://arxiv\.org/abs/2403\.13372](http://arxiv.org/abs/2403.13372)\.
- Zhou et al\. \(2025\)Yang Zhou, Hongyi Liu, Zhuoming Chen, Yuandong Tian, and Beidi Chen\.Gsm\-infinite: How do your llms behave over infinitely increasing context length and reasoning complexity?, 2025\.URL[https://arxiv\.org/abs/2502\.05252](https://arxiv.org/abs/2502.05252)\.
## Appendix ADataset Curation Details
This appendix provides additional detail on the generation pipeline described in Section[3\.2](https://arxiv.org/html/2605.21751#S3.SS2)\. Algorithm[1](https://arxiv.org/html/2605.21751#alg1)gives the pseudocode\.
#### Problem category details\.
Each template\-based category tests a distinct combination of modeling and binding challenges:
- •Transportation: Bipartite supply\-demand LP\.
- •Disaster Response: Multi\-period MILP with vehicle routing, supply shortages, and route security\.
- •JSSP: Job\-shop scheduling with machine assignments and precedence constraints\.
- •VRPTW: Vehicle routing with time windows, capacity, and subtour elimination\.
- •RCPSP: Multi\-mode project scheduling with time lags, budget, and deadlines\.
- •Facility Location: Requires deriving cost matrices from Euclidean distances \(MILP\)\.
- •Power Transmission: Requires deriving quadratic power loss from Ohm’s law \(MIQP\)\.
- •Queuing/Staffing: Requires Erlang\-C formulas for service levels \(nonlinear\)\.
- •Stochastic Transportation: Two\-stage MILP with SAA and chance constraints\.
- •Multi\-Objective Transportation: Bi\-objective \(cost \+ emissions\) with fixed charges, MOQ, and supplier cardinality\.
- •Modified Facility Location: Extended facility location with additional operational constraints\.
Algorithm 1Text2Opt\-Bench Generation Pipeline1:Input:Problem Type
TT, Dimensions
n,mn,m
2:
𝒟structGenerateWorldState\(T,n,m\)\\mathcal\{D\}\_\{\\text\{struct\}\}\\gets\\text\{GenerateWorldState\}\(T,n,m\)Domain\-specific parameters
3:
x∗,z∗SolverVerify\(𝒟struct\)x^\{\*\},z^\{\*\}\\gets\\text\{SolverVerify\}\(\\mathcal\{D\}\_\{\\text\{struct\}\}\)Gurobi ground truth
4:ifDirect Translation \(small scale\)then
5:
𝒯LLM\(PromptDirect,𝒟struct\)\\mathcal\{T\}\\gets\\text\{LLM\}\(\\text\{Prompt\}\_\{\\text\{Direct\}\},\\mathcal\{D\}\_\{\\text\{struct\}\}\)Full data in narrative
6:elseTemplate\-Based \(large scale\)
7:
𝒯tmplLLM\(PromptTemplate,Schema\(𝒟struct\)\)\\mathcal\{T\}\_\{\\text\{tmpl\}\}\\gets\\text\{LLM\}\(\\text\{Prompt\}\_\{\\text\{Template\}\},\\text\{Schema\}\(\\mathcal\{D\}\_\{\\text\{struct\}\}\)\)Structure only
8:
𝒯InsertData\(𝒯tmpl,𝒟struct\)\\mathcal\{T\}\\gets\\text\{InsertData\}\(\\mathcal\{T\}\_\{\\text\{tmpl\}\},\\mathcal\{D\}\_\{\\text\{struct\}\}\)Fill placeholders
9:endif
10:Output:
\(𝒯,𝒟struct,x∗,z∗\)\(\\mathcal\{T\},\\mathcal\{D\}\_\{\\text\{struct\}\},x^\{\*\},z^\{\*\}\)
### A\.1Direct Translation: Mathematical Construction
For resource allocation problems, we generate a linear programming problem in standard form:
minimizecTx\\displaystyle c^\{T\}x\(1\)subject toAxb\\displaystyle Ax\\gtreqless bx0\\displaystyle x\\geq 0To ensure control over the problem’s characteristics, we use ananchor solution:
1. 1\.Matrix Construction \(AA\):We initializeAℝmnA\\in\\mathbb\{R\}^\{m\\times n\}with random values and apply a sparsity mask to simulate real\-world interactions\.
2. 2\.Anchor Solution \(xanchorx\_\{\\text\{anchor\}\}\):We sample a feasible solutionxanchor0x\_\{\\text\{anchor\}\}\\geq 0\.
3. 3\.RHS Derivation \(bb\):The vectorbbis derived viabi=\(Axanchor\)i\+sib\_\{i\}=\(Ax\_\{\\text\{anchor\}\}\)\_\{i\}\+s\_\{i\}, ensuring feasibility by construction\.
The structured representation is then passed to an LLM \(GPT\-5\) with a prompt, and all numerical coefficients fromAA,bb, andccare put into a text description\. Example of this process is in §[A\.3](https://arxiv.org/html/2605.21751#A1.SS3)\.
Algorithm 2Direct Translation Dataset Generation1:Input:Dimensions
n,mn,m, Sparsity
SS
2:Phase 1: Construction \(Guaranteed Feasibility\)
3:
ARandomMatrix\(m,n,sparsity=S\)A\\gets\\text\{RandomMatrix\}\(m,n,\\text\{sparsity\}=S\)
4:
xanchorRandomVector\(n,min=0\)x\_\{\\text\{anchor\}\}\\gets\\text\{RandomVector\}\(n,\\min=0\)
5:
sRandomVector\(m,min=0\.5\)s\\gets\\text\{RandomVector\}\(m,\\min=0\.5\)
6:
bAxanchorsb\\gets Ax\_\{\\text\{anchor\}\}\\pm sConstructsbbs\.t\.xanchorx\_\{\\text\{anchor\}\}is feasible
7:
cRandomVector\(n\)c\\gets\\text\{RandomVector\}\(n\)
8:Phase 2: Verification \(Ensured Optimality\)
9:
x∗,z∗,statusGurobiSolve\(A,b,c\)x^\{\*\},z^\{\*\},\\text\{status\}\\gets\\text\{GurobiSolve\}\(A,b,c\)
10:ifstatusOPTIMALthen
11:returnRetryReject unbounded/infeasible
12:endif
13:
𝒟struct\{A,b,c,senses,types\}\\mathcal\{D\}\_\{\\text\{struct\}\}\\gets\\\{A,b,c,\\text\{senses\},\\text\{types\}\\\}
14:
𝒯textLLM\(SystemPrompt,𝒟struct\)\\mathcal\{T\}\_\{\\text\{text\}\}\\gets\\text\{LLM\}\(\\text\{SystemPrompt\},\\mathcal\{D\}\_\{\\text\{struct\}\}\)
15:Output:Pair
\(𝒯text,𝒟struct\)\(\\mathcal\{T\}\_\{\\text\{text\}\},\\mathcal\{D\}\_\{\\text\{struct\}\}\)
### A\.2Template\-Based Generation: Full Pipeline
For structured problems \(100\+ variables\), direct translation becomes impractical\. We initially exploredhierarchical decompositionvia a block\-diagonal structure \(A=diag\(A1,…,Ak\)A=\\text\{diag\}\(A\_\{1\},\\dots,A\_\{k\}\)\), which would allow decoupling intokkindependent sub\-problems withZ∗=\\slimits@i=1kZi∗Z^\{\*\}=\\tsum\\slimits@\_\{i=1\}^\{k\}Z\_\{i\}^\{\*\}\. However, we discarded this approach due to three critical bottlenecks: \(1\)context explosion, where merged narratives exceeded 100K tokens; \(2\)semantic fragmentation, resulting in disjointed narratives lacking global coherence; and \(3\)topological inflexibility, as the method could not accommodate complex linking constraints\.
To resolve this, we developed thetemplate\-based pipeline:
#### 1\. Structured Parameter Generation\.
Instead of a generic matrixAA, we generate domain\-specific parameters\. For example, when generating facility location problems:
- •Coordinates forNNfacilities andMMcustomers\.
- •Fixed costsfif\_\{i\}, capacitiessis\_\{i\}, demandsdjd\_\{j\}, and transport ratesrr\.
The transport cost matrix is not provided directly; the model must compute it from coordinates via Euclidean distance\.
#### 2\. Template Generation via LLM\.
The LLM generates a template “ business memo” describing the logic of the problem butexcludingnumerical data\. Placeholders such as\{CUSTOMER\_DEMANDS\}are forced to be included\.
#### 3\. Deterministic Data Insertion\.
The pipeline programmatically replaces placeholders with formatted generated data, decoupling linguistic complexity from numerical complexity\.
### A\.3Data Embedding Example
Example: Data Embedding TransformationWe sample a specific variable and constraint to demonstrate the mapping from structured parameters to natural language narrative\.1\. Variable Embedding Input \(Structured\):•Var\_1: TypeInteger, Obj Coeffc1=6\.86c\_\{1\}=6\.86•Interaction: Consumes0\.80\.8of ResourceC0C\_\{0\}Output \(Narrative\): “On\-Site Retrofit Packages: Each completed package adds6\.86in contribution\. Each package uses0\.8 unitsfrom our environmental emissions allowance…”2\. Constraint Embedding Input \(Structured\):•Constraint C0: Sense, RHSb0=8\.25b\_\{0\}=8\.25Output \(Narrative\): “Environmental Emissions Allowance: Total available is8\.25allowance units and cannot be exceeded\.”
### A\.4Note on GPT\-5 Contamination
GPT\-5 is used both to generate benchmark instances and as an evaluated model, raising a potential self\-contamination concern\. For template\-based categories, this concern is structurally precluded: GPT\-5 generates only the prose template \(natural language structure\), while all numerical data is inserted deterministically by scripts\.
For resource allocation \(direct translation\), GPT\-5 generates the full problem description including numerical coefficients\. However, two observations argue against contamination: \(1\) GPT\-5 \(87\.9%\) is*outperformed*by Claude Opus 4\.6 \(89\.9%\), which is not included in generation; \(2\) Table[5](https://arxiv.org/html/2605.21751#A2.T5)shows that GPT\-5’s failures are predominantly binding errors, similar to other models — memorization would primarily aid coefficient recall, yet GPT\-5 shows no such advantage\.
### A\.5Evaluation Validity: False\-Positive Prevention
Our evaluation pipeline is designed to minimize false positives at two levels\.
#### Feasibility by construction\.
If infeasible problems were included, a model producing a wrong formulation would frequently also yield an infeasible result, creating a false positive under code\-result evaluation\. Restricting to feasible instances ensures that any infeasible output is unambiguously incorrect\.
#### Objective fingerprinting\.
The remaining false\-positive risk is a structurally different formulation that coincidentally matches the gold objective\. In our pipeline, all instances use randomly generated continuous coefficients with wide ranges \(e\.g\., costs from\[5,30\]\[5,30\], demands from\[10,100\]\[10,100\]\), making the optimal objective an effective fingerprint: coincidental agreement to10−410^\{\-4\}relative tolerance is negligible\. We avoid supplementing objective matching with variable/constraint count checks, as correct formulations can legitimately differ in these counts due to auxiliary variables \(e\.g\.,t=max\(x,y\)t=\\max\(x,y\)\), constraint decomposition, or alternative modeling strategies \(e\.g\., Miller–Tucker–Zemlin vs\. lazy subtour elimination in VRPTW\)\.
## Appendix BFailure Mode Analysis
To validate that resource allocation is a binding\-dominated task, we classify every failure across 9 models by checking the Gurobi model structure of the generated code\.
#### Classification\.
For each failed solution, we extract the number of decision variables \(NumVars\) and constraints \(NumConstrs\) from the Gurobi model object and compare against the gold solution:
- •Binding error: correctNumVarsandNumConstrsbut wrong objective value—the model understood the formulation but mis\-transcribed coefficients\.
- •Modeling error: incorrectNumVarsorNumConstrs—the model produced a structurally different formulation\.
- •Execution error: the generated code fails to execute \(syntax errors, runtime exceptions\)\.
#### Results\.
Table[5](https://arxiv.org/html/2605.21751#A2.T5)reports the breakdown of 248 instances of resource allocation eval\-subset\. For all capable models \(pass rate\>\>13%\), binding errors account for 60–92% of failures, with modeling errors at 0–3%\. This confirms that resource allocation failures are overwhelmingly due to incorrect coefficient transcription, which is binding error\.
Table 5:Failure mode breakdown on resource allocation \(248 eval\-subset instances\)\. Percentages are of total failures per model\. Binding errors dominate for all models\.ModelPass%FailExec%Model%Bind%Claude Opus 4\.689\.92536\.00\.064\.0GPT\-587\.93016\.70\.083\.3Claude Sonnet 4\.684\.73815\.80\.084\.2DeepSeek\-R180\.64837\.52\.160\.4o4\-mini80\.24914\.30\.075\.5DeepSeek\-V3\.279\.0525\.81\.992\.3Llama3\.3\-70B49\.612525\.63\.271\.2GPT\-5\-Nano49\.212621\.42\.476\.2Qwen2\.5\-7B13\.321522\.314\.962\.8
#### Case study\.
A representative binding error from GPT\-5 on a problem with 12 variables and 13 constraints: the generated code reproduces all variable bounds, the objective function, and 12 of 13 constraints exactly\. However, one constraint \(“On\-Time Delivery Deviation”\) contains an extra coefficient3\.16 \* x1that leaked from a different constraint \(“Safety Risk Index”\), shifting the optimal objective from 581\.41 to 580\.06\. The model is correct on the modeling side but misplaced a single coefficient, which is a failure in binding side\.
#### Sensitivity to evaluation tolerance\.
Many binding errors produce near\-optimal solutions: for Claude Opus 4\.6, 100% of binding failures have relative objective error below 5%; for GPT\-5, 87% are within 5% \(median 1\.5%\)\. Under a relaxed 5% tolerance, these would all pass—but this inflates scores without changing the relative model ranking or the binding\-vs\-modeling conclusion\.
#### Note on Qwen2\.5\-7B\.
Qwen2\.5\-7B has the highest modeling error rate \(14\.9%\) and execution error rate \(22\.3%\) of any model, reflecting insufficient code generation and formulation capability at this scale\. Its remaining failures are still predominantly binding errors \(62\.8%\), consistent with the overall pattern\.
#### Isomorphism validation of passing solutions\.
A potential concern is that passing solutions achieve the correct objective with a structurally different formulation \(“correct for the wrong reasons”\)\. We validate this by extracting the full Gurobi model \(objective, bounds, constraint matrix, RHS, senses\) from both gold and generated code, then comparing under a canonical ordering: columns sorted by \(objective coefficient, lower bound, upper bound, variable type\), rows sorted by \(sense, RHS, coefficient vector\), with integer bounds normalized to effective integer range and constraints converted to by negation\. Across 1,512 passing solutions from 9 models above, 90\.5% are provably isomorphic under this canonical form\. Of the remaining 9\.5%, a mutual feasibility check \(verifying that each model’s optimal solution satisfies the other’s constraints\) confirms 2\.6% are algebraically equivalent reformulations\. The final 6\.8% have different feasible regions but identical optima; manual inspection of a sample reveals these are variable permutations unresolved by our canonical sort and algebraic rewrites \(e\.g\., a constraint−5\.4x−9\.34\-5\.4x\\geq\-9\.34rewritten asx1\.73x\\leq 1\.73\)\. No cases of genuinely different formulations coincidentally matching the objective were found, consistent with the near\-zero probability of such coincidence under random continuous coefficients with10−410^\{\-4\}tolerance\.
#### Why this analysis does not extend to template problems\.
As described in §[A\.5](https://arxiv.org/html/2605.21751#A1.SS5), the analysis is less informative for template problems\. An alternative approach—automated error classification via LLM—is possible in principle but unreliable at scale due to binding limitation\. For template problems, BIND provides a cleaner diagnostic: categories where BIND recovers most failures are binding\-limited, while categories where BIND provides no gain are modeling\-limited \(§[5\.1](https://arxiv.org/html/2605.21751#S5.SS1)\)\.
## Appendix CPrompting Ablation
To investigate whether advanced prompting improves performance, we conducted an ablation study with GPT\-5\-Nano on resource allocation problem subset \(nn=248\)\. Table[6](https://arxiv.org/html/2605.21751#A3.T6)shows that no prompting strategy improves over the base prompt noticeably\. Every variant—extra reasoning, explicit warnings, additional focus requirements, second\-pass refinement, and one\-shot examples—either performs similarly or degrades accuracy, with one\-shot examples dropping accuracy to 44\.4%\. We attribute this to the effective context limit: the bottleneck is not instruction quality but the model’s capacity to faithfully process dense numerical specifications, which prompting alone cannot address\.
Table 6:Prompting strategy ablation on GPT\-5\-Nano \(resource allocation,nn=248\)\.Prompting StrategyAccuracy \(%\)Base Prompt \(with Template\)49\.2Base \+ Extra Reasoning50\.0Base \+ Explicit Warnings45\.7Base \+ Additional Focus Requirements47\.6Base \+ Second Pass Refinement50\.0One\-Shot Example44\.4
## Appendix DRULER Binding Task Implementation Details
We adapt the RULER long\-context benchmark\(Hsieh et al\.,[2024](https://arxiv.org/html/2605.21751#bib.bib9)\)with modifications designed to prevent ceiling effects at larger model sizes\. The specifications are listed below\.
#### Task descriptions\.
- •Single\-key retrieval\(niah\_single\): retrieve one UUID value associated with a specific key embedded in the haystack, analogous to reading a single parameter\.
- •Multi\-key retrieval\(niah\_multikey\): retrieve UUID values forNNdistinct keys simultaneously, analogous to binding all coefficients in a constraint\.
- •Multi\-value retrieval\(niah\_multivalue\): recall allNNUUID values associated with a single key that appears multiple times, analogous to reading an entire data column\.
- •Aggregation: locate records scattered across multiple categories and compute a count for a target category, the operation closest to assembling an objective function from distributed data\.
#### Task generation\.
Each task generates synthetic prompts at six target context lengths: 1K, 2K, 4K, 8K, 16K, and 32K tokens \(measured via the Qwen\-2\.5 tokenizer\)\. Prompts consist of a*haystack*of expository prose paragraphs from Paul Graham essays, following the original RULER implementationHsieh et al\. \([2024](https://arxiv.org/html/2605.21751#bib.bib9)\), with task\-specific*needles*\(key\-value pairs or records\) inserted at uniformly random positions\. A question requiring extraction of the embedded information is appended at the end\. We generate 200 samples per task per context length \(1,200 per task, 4,800 total across four tasks\)\.
#### Hardening against easy retrieval\.
The original RULER tasks use unique, easily distinguishable keys\. We introduce three forms of increased difficulty that scale with context length:
- •Distractor needles with confusable names\.For single\-key and multi\-value tasks, distractor keys are generated by substituting one character in the target key name \(e\.g\.,special\_item\_abcdevs\.special\_item\_abcdf\)\. The number of distractors scales asmax\(3,L/1024\)\\max\(3,L/1024\)whereLLis the target token length\.
- •Scaled binding complexity\.For multi\-key and multi\-value tasks, the number of target values to retrieve scales asmax\(2,L/2048\)\\max\(2,L/2048\), with33\\timesdistractors per real key\. At 32K tokens, the model must retrieve 15 values amidst 45 distractors\.
- •Category\-based aggregation\.The aggregation task scatters records acrossmax\(3,L/4096\+2\)\\max\(3,L/4096\+2\)categories withmax\(3,L/2048\)\\max\(3,L/2048\)items each\. The model must count or sum values for a single target category while ignoring all others\.
#### Evaluation\.
We evaluate six models from Qwen2\.5\-Instruct family \(0\.5B–32B\) using vLLM\(Kwon et al\.,[2023](https://arxiv.org/html/2605.21751#bib.bib13)\)with greedy decoding \(temperature=0temperature=0\) and a maximum generation length of 128 tokens\. We use*strict exact\-match*scoring for all tasks with no partial credit\. This all\-or\-nothing criterion is motivated by optimization evaluation, where a single incorrect parameter yields an incorrect result\. We did not include closed\-source frontier models because these tasks are falsely flagged as jailbreaking attempts by the content filter\.
## Appendix EFull TTC and BIND Results
Table 7:Test\-Time\-Compute: Accuracy \(%\) and Total Tokens \(input \+ output, in K\) per Problem across Models and Methods \(550 Template Problems\)\. Pass@5 represents the parallel upper bound\. Repair uses feedback from oracle verifier \(objective value and model structure comparison\), representing the sequential upper bound on iterative refinement\.Pass@1BINDMaj\. VotePass@5Repair@5†ModelAccTokAccTokAccTokAccTokAccTokClaude Sonnet 4\.687\.64\.4K96\.23\.2K89\.622\.0K97\.822\.0K98\.46\.5KClaude Opus 4\.686\.74\.4K98\.73\.3K88\.421\.8K96\.021\.8K98\.76\.3KGPT\-586\.24\.2K95\.83\.1K91\.521\.0K95\.821\.0K95\.56\.9Ko4\-mini80\.44\.1K94\.73\.2K83\.620\.4K94\.920\.4K95\.112\.0KDeepSeek\-R172\.93\.8K84\.52\.9K78\.418\.8K93\.618\.8K92\.47\.1KDeepSeek\-V3\.272\.04\.6K87\.13\.3K69\.622\.8K89\.122\.8K88\.710\.7KGPT\-5\-Nano59\.13\.8K82\.43\.2K61\.819\.2K82\.019\.2K78\.212\.2KLlama3\.3\-70B35\.14\.0K46\.03\.1K33\.320\.1K51\.620\.1K50\.719\.5KQwen2\.5\-7B4\.53\.6K8\.93\.2K2\.718\.0K10\.518\.0K8\.928\.0K†Repair uses oracle feedback: ground\-truth objective value and model structure comparison after each round\.
Table 8:BIND per\-category accuracy \(%, n=50 per category\)\.Δ\\Delta= improvement over default \(data in prompt\)\. BIND helps most on data\-heavy categories for capable models, but cannot fix modeling gaps in weaker models\.GPT\-5OpusSonneto4\-miniDS\-R1DS\-V3\.2NanoLlamaQwenBΔ\\DeltaBΔ\\DeltaBΔ\\DeltaBΔ\\DeltaBΔ\\DeltaBΔ\\DeltaBΔ\\DeltaBΔ\\DeltaBΔ\\DeltaTransp\.1000100\+21000100094−\-610001000100\+1282\+44Disaster100\+14100\+4100\+4100\+666−\-1288−\-284\+2260\+3000JSSP100\+101000100\+2100\+4100\+4960100\+180000VRPTW96\+26100\+6292\+4282\+4866\+3248\+2636\+342\+200RCPSP100\+12100\+4100098\+1694\+6094\+3268\+422\+200Fac\. Loc\.10001000100\+2100\+298\+4100\+2100\+1098012\+6Power T\.80−\-1886−\-266\+282\+1266\+272\+1868\+1812−\-400Queue/St\.98\+18100\+8100\+2100\+24100\+3094\+2898\+4230\+2000Stoch\. T\.96\+26100\+38100\+38100\+3466\+676\+5888\+5622\+1600M\-Obj T\.84\+14100\+12100\+280\+1286\+1090\+472\+1286\+442−\-2Mod\. FL100\+4100\+41000100094−\-2100092\+294−\-220Avg95\.8\+9\.698\.7\+12\.096\.2\+8\.694\.7\+14\.484\.5\+11\.687\.1\+15\.182\.4\+23\.346\.0\+10\.98\.9\+4\.4
## Appendix FCase Study: Binding Failures in Transportation Problems
We illustrate the binding bottleneck with Qwen2\.5\-7B on a simple transportation LP \(trans\_001\)\. The problem has 7 sources and 6 destinations with supply, demand, and cost data specified in the prompt\.
Default \(data in prompt\) — FAIL\.The model correctly identifies the LP structure \(continuous variables, supply constraints, demand==constraints, minimize cost\) and accurately copies the767\\times 6cost matrix\. However, it replaces all supply capacities with a uniform value of 100:
```
# Actual supply: [94, 47, 50, 55, 67, 37, 69]
# Qwen generates:
m.addConstr(quicksum(x[i,j] for j in range(6)) <= 100,
f"source_capacity_{i}")
# Actual demand: [14, 47, 21, 70, 72, 58]
# Qwen generates:
m.addConstr(quicksum(x[i,j] for i in range(7)) == 100,
f"destination_requirement_{j}")
```
This error pattern is common across all 31 transportation failures from Qwen\-7B, while sometimes the model get partially correct numbers\.
BIND \(data offloaded to file\) — PASS\.With BIND, numerical data is externalized to a JSON file\. The model generates code that*reads*rather than*transcribes*the values:
```
with open(INSTANCE_DATA_PATH, "r") as f:
d = json.load(f)
# Supply constraint: reads d[’supplies’][i] from file
m.addConstr(quicksum(x[i,j] for j in range(d[’num_destinations’]))
<= d[’supplies’][i], name=f"supply_{i}")
# Demand constraint: reads d[’demands’][j] from file
m.addConstr(quicksum(x[i,j] for i in range(d[’num_sources’]))
== d[’demands’][j], name=f"demand_{j}")
```
As we see above, BIND raises Qwen2\.5\-7B from 38% to 82% on transportation \(\+\+44pp\) by eliminating the need to transcribe numerical values, confirming that binding is the bottleneck for this category\.
### F\.1BIND Regression on Power Transmission
Power transmission is the only category where BIND causes a notable regression for GPT\-5 \(−\-18pp, from 98% to 80%\)\. We analyze all BIND\-induced failures\. The dominant error is a spurious unit\-conversion factor in the loss coefficient:
```
# WRONG: spurious 1e6 inflates loss cost
loss_coef = loss_cost_rate * R * (1e6 / (V_kV ** 2))
# CORRECT:
loss_coef = loss_cost_rate * R / (V_kV ** 2)
```
The model reasons “power is in MW, so multiply by10610^\{6\}to convert to W,” but this double\-counts the conversion since the kV denominator already absorbs the scaling\.
We think this is mainly caused by hint loss in the original data context\. Without providing data details, the model must reconstruct the unit\-conversion chain from the schema alone, which leads to the problem\.
Most models improve or stay flat under BIND on this task \(e\.g\., GPT\-5\-Nano:\+\+18pp, DeepSeek\-V3\.2:\+\+18pp, o4\-mini:\+\+12pp\), suggesting the regression is model\-specific rather than a systematic limitation\. GPT\-5’s high baseline \(98%\) appears to rely on in\-context physics reasoning that BIND disrupts, making it uniquely sensitive to hint loss when data must be interpreted to derive coefficients rather than passed through directly\.
## Appendix GTraining Details
### G\.1Two\-Phase Pipeline
We decompose the optimization solving task into two phases:
1. 1\.Phase 1 \(Binding\):A fine\-tuned model extracts all decision variables, constraints, and objective function parameters from the natural language problem description into structured JSON\.
2. 2\.Phase 2 \(Solve\):A deterministic template loads the extracted JSON and constructs a Gurobi optimization model programmatically—no LLM is needed\.
This decomposition isolates*binding*—the mapping from unstructured text to structured mathematical parameters—as the sole task requiring learned reasoning\.
### G\.2Training Data
We construct binding supervision from theresource\_allocationtraining split \(train\_2\_11\), which contains problems with 2–11 decision variables\. Each training example pairs a natural language problem description \(input\) with the corresponding structured JSON extraction \(output\)\. The JSON schema includes:
- •goal: optimization direction \(MINIMIZEorMAXIMIZE\)
- •variables: list of decision variables with name, type, bounds, and objective coefficient
- •constraints: list of constraints with coefficients, sense \(, ,==\), and right\-hand side
The dataset contains 429 examples \(387 train / 42 validation after a 90/10 split\), with a roughly uniform distribution across variable counts \(30–51 examples per variable count from 2 to 11\)\. Average input length is approximately 4,200 characters; average output length is approximately 2,400 characters\.
### G\.3Model Configurations
We train two binding specialists via full\-parameter supervised fine\-tuning \(SFT\) using LLaMA\-FactoryZheng et al\. \([2024](https://arxiv.org/html/2605.21751#bib.bib31)\):
Table 9:Binding model training hyperparameters\.Hyperparameter1\.5B Binder7B BinderBase modelQwen2\.5\-1\.5B\-InstructQwen2\.5\-7B\-InstructFine\-tuning typeFullFullEpochs66Learning rate110−51\\times 10^\{\-5\}110−51\\times 10^\{\-5\}LR schedulerCosineCosineWarmup ratio0\.10\.1Per\-device batch size21Gradient accumulation48Effective batch size816Max sequence length8,1928,192Precisionbf16bf16DeepSpeed—ZeRO Stage 3Hardware1 A100 80GB2 A100 80GB
### G\.4End\-to\-End SFT Baseline
For comparison, the end\-to\-end SFT baseline fine\-tunes Qwen2\.5\-7B\-Instruct to directly generate Gurobi solver code from problem descriptions \(no intermediate binding step\)\. It is trained on 366 examples from the same variable range \(2–11 vars\) for 3 epochs with identical learning rate \(110−51\\times 10^\{\-5\}\), cosine schedule, and DeepSpeed ZeRO\-3 configuration\.
### G\.5GRPO Training
We additionally train via Group Relative Policy Optimization \(GRPO\) using verl bySheng et al\. \([2025](https://arxiv.org/html/2605.21751#bib.bib22)\)\. GRPO uses outcome\-based rewards from executing generated code against the Gurobi solver, avoiding the need for a learned reward model\. Table[10](https://arxiv.org/html/2605.21751#A7.T10)summarizes the configuration\.
Table 10:GRPO training hyperparameters\.HyperparameterValueBase modelQwen2\.5\-7B\-InstructAlgorithmGRPOTraining datatrain\_2\_11\(vars 2–11\)Train batch size8Max prompt length4,096Max response length4,096Group size \(nn\)5Learning rate110−61\\times 10^\{\-6\}KL lossLow\-variance KL \(β=0\.001\\beta=0\.001\)KL in rewardNoEntropy coefficient0Advantage normalizationBy std \(GRPO default\)Rollout enginevLLM \(TP=2\)Parallelism strategyFSDP2 \(param \+ optimizer offload\)Total epochs15Save frequencyEvery 20 stepsHardware4 A100 80GBPrecisionbf16We experiment with three GRPO variants: \(1\) a standard binary reward \(1 if the generated code produces the correct optimal objective, 0 otherwise\), \(2\) an adaptive curriculum sampler that adjusts the sampling distribution across difficulty levels based on an exponential moving average of per\-level solve rates \(α=0\.3\\alpha=0\.3, floor weight=0\.05=0\.05\), and \(3\) a*partial\-reward*variant that replaces the sparse binary signal with a hierarchical continuous\-credit reward function\.
#### Partial\-reward function\.
The partial\-reward variant addresses the sparse\-reward problem inherent in binary outcome\-based RL: most rollouts for hard problems receive zero reward, providing no gradient signal\. We design a hierarchical rewardr\[0,1\]r\\in\[0,1\]that awards incremental credit at successive gates:
1. 1\.Code extraction\(\+0\.05\): valid Python/Gurobi code is parsed from the model output\.
2. 2\.Execution\(\+0\.10\): the extracted code executes without runtime error\.
3. 3\.Solver status\(\+0\.10 ifoptimal; \+0\.05 if feasible but not optimal\): the Gurobi solver reaches a meaningful termination status\.
4. 4\.Variable\-count match\(\+0\.05\): the number of decision variables in the generated model equals the reference\.
5. 5\.Constraint satisfaction\(\+0\.20, continuous\): the fraction of reference constraints satisfied by the generated solution, evaluated by substituting generated variable values into the ground\-truth constraint matrix\.
6. 6\.Objective closeness\(\+0\.20, continuous\):exp\(−αrel\_gap\)\\exp\(\-\\alpha\\cdot\\text\{rel\\\_gap\}\)whererel\_gap=\|zgen−z∗\|/\(\|z∗\|\+10−6\)\\text\{rel\\\_gap\}=\|z\_\{\\text\{gen\}\}\-z^\{\*\}\|/\(\|z^\{\*\}\|\+10^\{\-6\}\)andα=10\\alpha=10, awarding near\-full credit for small deviations and decaying smoothly for larger gaps\.
An exact solution \(objective and all variable values within10−410^\{\-4\}of the reference\) overrides the partial score and receivesr=1\.0r=1\.0\. All other hyperparameters \(Table[10](https://arxiv.org/html/2605.21751#A7.T10)\) remain identical across the three GRPO variants\.
#### GRPO results\.
Table[11](https://arxiv.org/html/2605.21751#A7.T11)compares all GRPO variants against the Qwen2\.5\-7B\-Instruct baseline \(zero\-shot\) on the 248\-problem resource allocation eval set\.
Table 11:GRPO variant results on resource allocation \(248 eval problems, vars 2–20\)\. In\-distribution: vars11\\leq 11; OOD: vars 12–20\.ModelOverallIn\-dist \(11\)OOD \(12–20\)Qwen2\.5\-7B\-Instruct \(zero\-shot\)14\.5% \(36/248\)27\.3% \(36/132\)0\.0% \(0/116\)GRPO \(binary reward\)44\.0% \(109/248\)76\.5% \(101/132\)6\.9% \(8/116\)GRPO \+ adaptive curriculum44\.8%\(111/248\)80\.3% \(106/132\)4\.3% \(5/116\)GRPO \+ partial reward30\.2% \(75/248\)56\.8% \(75/132\)0\.0% \(0/116\)All three GRPO variants substantially improve over the zero\-shot baseline\. The binary\-reward and adaptive\-curriculum variants perform comparably \(44\.0% vs\. 44\.8%\), with the curriculum sampler providing a marginal gain by focusing training on difficulty levels where the model can still learn\. The partial\-reward variant underperforms at 30\.2%, suggesting that the dense but noisy intermediate credit signal may encourage the model to satisfy partial gates \(code extraction, execution, feasibility\) without converging to fully correct solutions—a form of reward hacking where the model exploits the hierarchical structure to collect partial credit rather than optimizing for exact correctness\. No GRPO variant achieves meaningful OOD generalization beyond vars 11\.
### G\.6Evaluation
All models are evaluated on the fulleval/split containing 248 problems with 2–20 decision variables\. Problems with 12–20 variables are out\-of\-distribution \(OOD\), testing generalization beyond the training range\. Inference uses vLLM with greedy decoding \(temperature 0, top\-pp= 1\) and a maximum generation length of 4,096 tokens\.
### G\.7Per\-Complexity Breakdown
Figure[6](https://arxiv.org/html/2605.21751#A7.F6)shows accuracy as a function of problem size \(number of variables and constraints\) for each training approach\. The red horizontal line marks the maximum number of variables seen during training \(11\\leq 11\); problems above this line are out\-of\-distribution \(OOD\)\.
Figure 6:Accuracy heatmaps by problem size \(number of variables vs\. constraints\) for each training approach on resource allocation \(248 eval problems\)\. The red line marks the maximum number of variables seen during training; problems above it are out\-of\-distribution\. Yellow = 100% accuracy, purple = 0%\.The heatmaps reveal several distinct generalization patterns\. All approaches learn a sharp in\-distribution boundary: accuracy is near\-perfect \(yellow\) below the red line and collapses almost entirely \(purple\) above it, indicating that none of the training regimes generalize binding to larger problem sizes\. The 7B binding specialist shows the cleanest in\-distribution coverage, while SFT 7B end\-to\-end exhibits scattered failures even on seen problem sizes, consistent with imperfect binding under joint training\. GRPO shows the most irregular in\-distribution pattern, with high variance across cells of similar complexity, reflecting the difficulty of learning precise coefficient transcription from a sparse, binary reward\.
### G\.8Multi\-Category Binding Specialists
We extend the binding hypothesis to two additional OR problem categories: transportation \(LP\) and JSSP \(MILP\)\. Both use Qwen2\.5\-7B\-Instruct with the same training configuration as the resource allocation binder \(Table[9](https://arxiv.org/html/2605.21751#A7.T9)\): full\-parameter SFT, 6 epochs, lr=110−51\\times 10^\{\-5\}, cosine schedule, ZeRO\-3 on 2A100 80GB\. Phase 2 uses deterministic template code that loads the extracted JSON and constructs a Gurobi model programmatically\.
#### Training data\.
Transportation: 244 instances with sources destinations52\\leq 52\. JSSP: 224 instances withnjobs5n\_\{\\text\{jobs\}\}\\leq 5\. Both drawn from the respectiveTemplate\_train/splits\.
#### Evaluation\.
50 problems per category\. Transportation: 26 in\-distribution \(sources destinations52\\leq 52\), 24 OOD \(up to2525=62525\\times 25=625variables\)\. JSSP: 25 in\-distribution \(jobs5\\leq 5\), 25 OOD \(jobs 6–13, up to 52 operations\)\. The GTtemplate upper bound achieves 100% on both categories, confirming that binding is the sole bottleneck\.
#### JSSP results\.
The binding specialist achieves 100% overall \(50/50\), including 100% OOD \(25/25\)\. End\-to\-end SFT achieves 96\.0% \(48/50\), with 2 OOD failures from code generation errors\.
#### Transportation results\.
The binding specialist achieves 96\.0% overall \(48/50\) vs\. 88\.0% for end\-to\-end SFT \(44/50\), with OOD accuracy of 91\.7% vs\. 75\.0%\. A key implementation detail: the binding target uses*compact*JSON \(no indentation,separators=\(’,’,’:’\)\), which reduces output token length by3\{\\sim\}3\\timescompared to indented JSON for large cost matrices\. In an initial experiment using indented JSON, the binding specialist achieved only 80\.0% vs\. 90\.0% for end\-to\-end SFT—the token\-length disadvantage caused attention copy errors on OOD instances\.
#### Description format robustness\.
We additionally evaluate both approaches with prose descriptions \(per\-source sentences with randomized destination ordering, no tables\)\. Both models degrade proportionally: JSSP drops20\{\\sim\}20pp OOD for both \(binding: 60%, end\-to\-end: 64%\); transportation drops similarly\. The parallel degradation confirms that the binding bottleneck is output sequence length, not input description complexity\.
### G\.9OOD Cliff\-Shift Experiment
To test whether the OOD gap reflects limited training coverage or a fundamental extraction limit, we train a second 7B binding specialist on vars 2–15 \(565 examples, same configuration as Table[9](https://arxiv.org/html/2605.21751#A7.T9)\) and evaluate on the same 248\-problem eval set\.
Table 12:Effect of training coverage on binding specialist accuracy \(resource allocation, 248 eval problems\)\. The OOD cliff shifts from var=11 to var=15, with no regression on the original in\-distribution range\.Training rangeModelOverallIn\-distOODvars 2–117B binding specialist58\.1% \(144/248\)99\.2% \(131/132,11\)11\.2% \(13/116, 12–20\)vars 2–117B end\-to\-end SFT51\.2% \(127/248\)88\.6% \(117/132,11\)8\.6% \(10/116, 12–20\)vars 2–157B binding specialist75\.4%\(187/248\)91\.8% \(169/184,15\)28\.1% \(18/64, 16–20\)vars 2–157B end\-to\-end SFT57\.3% \(142/248\)75\.5% \(139/184,15\)4\.7% \(3/64, 16–20\)AnyGround truthtemplate100%100%100%Three findings emerge\. First, the cliff shifts: overall accuracy improves from 58\.1% to 75\.4%, and the binding specialist’s advantage over end\-to\-end SFT widens from \+6\.9pp to \+18\.1pp\. Second, accuracy on vars 2–11 is preserved, ruling out catastrophic forgetting\. Third, vars 12–15 reach 80\.8%—below the 99\.2% achieved by the original model on vars 2–11—reflecting the genuine difficulty of extracting 50\+ coefficients from longer prose, not a training artifact\. These results confirm that the OOD gap reflects training coverage, not a fundamental limit of binding SFT\.Similar Articles
How can embedding models bind concepts?
This paper investigates why CLIP struggles with concept binding, showing that while CLIP's binding function is high-complexity, controlled transformer models can learn low-complexity binding functions through multiplicative interactions that generalize better.
OmniToM: Benchmarking Theory of Mind in LLMs via Explicit Belief Modeling
OmniToM introduces a benchmark that evaluates large language models' theory of mind by requiring explicit belief structure extraction and labeling, revealing a bottleneck in tracking actor-specific beliefs despite strong performance on endpoint QA tasks.
The Grounding Gap: How LLMs Anchor the Meaning of Abstract Concepts Differently from Humans
This study investigates how LLMs ground abstract concepts compared to humans, finding a significant 'grounding gap' where models rely heavily on word associations rather than emotional or internal states. Using sparse autoencoders, the authors identify internal features related to grounding dimensions, suggesting LLMs possess this information but do not recruit it naturally during generation.
TabularMath: Understanding Math Reasoning over Tables with Large Language Models
TabularMath introduces a benchmark and AutoT2T framework for evaluating LLMs' mathematical reasoning over tabular data, revealing that table complexity, data quality, and modality significantly impact model performance. The study addresses a gap in LLM evaluation by systematically assessing robustness to incomplete or inconsistent table information in real-world scenarios.
Beyond Objective Equivalence: Constraint Injection for LLM-Based Optimization Modeling on Vehicle Routing Problems
Researchers from Beihang University and Baidu propose 'constraint injection,' a dual verification method for LLM-based optimization modeling that detects spurious or omitted constraints beyond objective equivalence. They develop VRPCoder, an 8B model for translating natural-language vehicle routing problems into Gurobi scripts, achieving 93% average Pass@1 and outperforming Claude Sonnet and prior OR-LLMs by large margins.