TriVAL: A Tri-Validation Framework for Faithful Automatic Optimization Modeling
Summary
TriVAL introduces a tri-validation framework that performs explicit validation at three stages of automatic optimization modeling (semantic specification, mathematical formulation, code generation) to improve faithfulness, and also presents NL4COP, a new benchmark for combinatorial optimization problems.
View Cached Full Text
Cached at: 05/26/26, 08:59 AM
# TriVAL: A Tri-Validation Framework for Faithful Automatic Optimization Modeling
Source: [https://arxiv.org/html/2605.23966](https://arxiv.org/html/2605.23966)
Ziyang Fang, JinXi Wang, Jinghui Zhong, , and Yew\-Soon Ong\(Corresponding author: Jinghui Zhong\.\)Ziyang Fang, JinXi Wang, and Jinghui Zhong are with the School of Computer Science and Engineering, South China University of Technology, Guangzhou 510000, China \(e\-mail: 202520143514@mail\.scut\.edu\.cn, 202511095364@mail\.scut\.edu\.cn, jinghuizhong@scut\.edu\.cn\)\.Yew\-Soon Ong is with the Centre for Frontier AI Research, Agency for Science, Technology and Research, Singapore 138632, and also with the College of Computing and Data Science, Nanyang Technological University, Singapore 639798 \(e\-mail: asysong@ntu\.edu\.sg\)\.
###### Abstract
Optimization modeling serves as the pivotal bridge between natural\-language problem descriptions and optimization solvers, and remains a cornerstone for bringing operations research \(OR\) into real\-world decision making\. Recent advances in large language models \(LLMs\) have driven significant progress in automatic optimization modeling\. However, existing methods still lack explicit validation during the modeling process, allowing errors introduced in earlier stages to carry through the pipeline and ultimately reduce final modeling accuracy\. To address this challenge, we introduce TriVAL, a tri\-validation framework that performs explicit validation at three stages of automatic optimization modeling: semantic specification, mathematical formulation, and code generation\. At each stage, TriVAL follows a construct\-validate\-revise loop that assesses the current result against stage\-specific criteria and revises it when needed\. This design helps identify and correct errors before they accumulate across stages, helping preserve faithfulness throughout the modeling process\. To evaluate automatic optimization modeling on more challenging combinatorial problems, we further introduce NL4COP, a benchmark of 150 instances across 50 diverse problem types with more complex decision logic, more tightly coupled constraints, and more demanding modeling requirements than existing benchmarks\. Experiments on NL4COP and established benchmarks show that TriVAL consistently outperforms state\-of\-the\-art methods, with the largest gains on the most challenging problems\.
## IIntroduction
Optimization modeling constitutes the essential conduit for translating real\-world decision problems into formal optimization models; nevertheless, it has long been the primary bottleneck preventing the large\-scale deployment of operations research \(OR\) by non\-specialist users\. This bottleneck is particularly consequential in domains such as supply chain planning\[[21](https://arxiv.org/html/2605.23966#bib.bib2)\], production scheduling\[[34](https://arxiv.org/html/2605.23966#bib.bib3)\], and transportation\[[25](https://arxiv.org/html/2605.23966#bib.bib4)\], where the need for optimization modeling far exceeds the availability of expert modelers\[[33](https://arxiv.org/html/2605.23966#bib.bib8)\]\. In practice, optimization problems are rarely presented as fully specified mathematical programs\. They are typically described in natural language, often with intricate decision logic, tightly coupled requirements, and assumptions that are left implicit\. Constructing a correct mathematical model from such descriptions requires far more than direct formalization—it demands abstraction, semantic understanding, and careful modeling judgment\. Advancing this capability matters both for research on automatic optimization modeling and for extending the practical reach of OR beyond expert modelers\[[35](https://arxiv.org/html/2605.23966#bib.bib32),[29](https://arxiv.org/html/2605.23966#bib.bib33)\]\.
Automatic optimization modeling is difficult because correctness must be maintained across a sequence of dependent modeling stages: understanding the problem, formulating it, and generating code\[[11](https://arxiv.org/html/2605.23966#bib.bib19),[2](https://arxiv.org/html/2605.23966#bib.bib9),[20](https://arxiv.org/html/2605.23966#bib.bib13)\]\. Each stage produces a result that constrains the next; a flaw at any earlier stage can propagate forward; later results may still be internally coherent while disconnected from the original problem\. Thus, the true bottleneck in automatic optimization modeling is no longer just the generation of code, but the robust containment of errors across dependent modeling stages to preserve faithfulness to the original problem\.
Recent large language models \(LLMs\) have significantly advanced automatic optimization modeling\[[35](https://arxiv.org/html/2605.23966#bib.bib32),[29](https://arxiv.org/html/2605.23966#bib.bib33),[32](https://arxiv.org/html/2605.23966#bib.bib30)\]\. As optimization modeling tasks become more complex, existing work has increasingly adopted multi\-step strategies to improve modeling capability\. Current approaches can be broadly categorized into two main lines: learning\-based methods and prompt\-based methods\. Learning\-based methods enhance optimization modeling capability through large\-scale data synthesis and model training\[[9](https://arxiv.org/html/2605.23966#bib.bib23),[11](https://arxiv.org/html/2605.23966#bib.bib19),[16](https://arxiv.org/html/2605.23966#bib.bib22),[4](https://arxiv.org/html/2605.23966#bib.bib24)\], whereas prompt\-based methods improve inference\-time modeling through multi\-step decomposition and the incorporation of modeling knowledge\[[30](https://arxiv.org/html/2605.23966#bib.bib18),[36](https://arxiv.org/html/2605.23966#bib.bib10),[2](https://arxiv.org/html/2605.23966#bib.bib9),[27](https://arxiv.org/html/2605.23966#bib.bib12),[40](https://arxiv.org/html/2605.23966#bib.bib16)\]\. Both directions have improved the quality of generated models\. However, while these approaches enhance multi\-step generation, they typically treat intermediate results from earlier stages as inherently reliable, giving much less attention to explicitly validating results before they are used in later stages\. Without explicit validation, early hallucinations, misinterpretations, or faulty reasoning can easily embed themselves into stage results\. Because later stages directly build on these inputs, they merely translate already\-flawed modeling decisions into code\. The final program may thus execute perfectly yet solve the wrong problem\. To mitigate semantic drift, automatic optimization modeling must explicitly validate key stage results, catching defects before they become difficult to correct later\.
Recognizing that maintaining faithfulness requires catching defects at each stage before they become difficult to correct later, we propose TriVAL, a dedicated tri\-validation framework for automatic optimization modeling that organizes the modeling process around explicit validation of intermediate modeling results\. It introduces three validations targeting distinct stages of the modeling process: semantic, formulation, and code\. For each target, TriVAL follows a construct\-validate\-revise loop: it first constructs the current modeling result, then validates it against criteria tailored to the specific risk at that stage, and revises it when necessary before proceeding\. Through this design, TriVAL contains early errors before they propagate, safeguarding the correct mapping from problem to solution throughout the modeling process\.
The detrimental impact of error propagation is largely masked in existing LP and MILP benchmarks, where problem logic and formulation requirements are relatively straightforward\. However, in complex combinatorial settings with greater modeling difficulty, errors in problem understanding and formulation become more pervasive, making explicit validation critically important\. To evaluate optimization modeling where faithfulness is difficult to preserve, we introduce NL4COP, a new benchmark comprising 150 instances across 50 diverse combinatorial problem types with more complex decision logic and greater modeling difficulty than existing benchmarks\. Experiments on NL4COP and established benchmarks show that TriVAL consistently outperforms state\-of\-the\-art methods, with the largest gains on the most challenging tasks\.
The main contributions of this work are threefold:
1. 1\.We propose TriVAL, a dedicated tri\-validation framework for automatic optimization modeling that organizes the modeling process around explicit validation of three key modeling results: semantic specification, mathematical formulation, and generated code\. Each result is evaluated and, when necessary, revised before the process proceeds, helping contain error propagation and preserve faithfulness throughout the modeling process\.
2. 2\.We introduce NL4COP, a new benchmark for automatic optimization modeling on challenging combinatorial problems\. It comprises 50 problem types and 150 instances with complex decision logic and tightly coupled constraints, providing a more challenging and discriminative evaluation than existing benchmarks\.
3. 3\.Through extensive experiments on NL4COP and established benchmarks, we show that TriVAL consistently outperforms state\-of\-the\-art optimization\-modeling methods, with the largest gains on the most challenging problems\. The results further highlight NL4COP as a challenging and discriminative benchmark for demonstrating the benefits of explicit validation\.
The remainder of this paper is organized as follows\. Section[II](https://arxiv.org/html/2605.23966#S2)reviews related work and clarifies our positioning\. Section[III](https://arxiv.org/html/2605.23966#S3)presents the proposed TriVAL framework\. Section[IV](https://arxiv.org/html/2605.23966#S4)introduces the NL4COP benchmark and reports the experimental evaluation\. Finally, Section[V](https://arxiv.org/html/2605.23966#S5)concludes the paper\.
## IIRelated Work
Automatic optimization modeling translates natural\-language problem descriptions into formal formulations and executable code for mathematical modeling frameworks and solvers such as CVXPY\[[6](https://arxiv.org/html/2605.23966#bib.bib7)\], Gurobi\[[8](https://arxiv.org/html/2605.23966#bib.bib5)\], and OR\-Tools\[[23](https://arxiv.org/html/2605.23966#bib.bib6)\]\. Recent literature applying LLMs to this task generally pursues two main directions\. The first direction trains specialized models via data synthesis, supervised fine\-tuning, or alignment training\. The second direction guides general\-purpose models using prompt engineering, reasoning frameworks, or external knowledge injection\. After reviewing these foundational approaches, we discuss the literature most closely aligned with our contribution: the evaluation of generated outputs and explicit optimization validation\.
### II\-ALearning\-Based Methods for Optimization Modeling
Learning\-based methods primarily advance automatic optimization modeling via supervised fine\-tuning or alignment training, leveraging optimization\-oriented data and feedback\. One prominent line of work focuses on constructing structured datasets to teach models to produce accurate formulations\. For instance, ORLM\[[9](https://arxiv.org/html/2605.23966#bib.bib23)\]and LLMOPT\[[11](https://arxiv.org/html/2605.23966#bib.bib19)\]design multi\-element schemas that link natural\-language descriptions, mathematical formulations, and code, enabling instruction tuning for optimization modeling\. Alternatively, ReSocratic\[[37](https://arxiv.org/html/2605.23966#bib.bib20)\]and OptMATH\[[16](https://arxiv.org/html/2605.23966#bib.bib22)\]employ formulation\-centered synthesis, generating data bidirectionally between mathematical expressions and problem descriptions to expand training diversity with controllable complexity\.
A complementary direction incorporates verifiable solver signals directly into model alignment and inference\. SIRL\[[4](https://arxiv.org/html/2605.23966#bib.bib24)\]treats optimization solvers as reward sources, using executable code and formal models to guide reinforcement\-learning\-based alignment\. Building upon this verifiable feedback loop, OptiMind\[[41](https://arxiv.org/html/2605.23966#bib.bib21)\]applies solver signals at test time, combining class\-based error summaries with self\-consistency to reduce common modeling mistakes\.
While these learning\-driven approaches successfully produce higher\-quality formulations and executable code, they remain fundamentally generation\-focused\. They tend to treat results from earlier stages as inherently reliable, lacking explicit mechanisms to examine key modeling results before they are used in later stages\.
### II\-BPrompt\-Based Methods for Optimization Modeling
A second major direction guides inference\-time optimization modeling via task decomposition, multi\-agent coordination, and structured search\. Given that single\-step generation often fails on complex problem structures\[[28](https://arxiv.org/html/2605.23966#bib.bib34)\], one prominent line of work decomposes the modeling workflow into specialized roles\. For instance, frameworks like Chain\-of\-Experts\[[36](https://arxiv.org/html/2605.23966#bib.bib10)\], OptiMUS\[[1](https://arxiv.org/html/2605.23966#bib.bib14)\], OptimAI\[[27](https://arxiv.org/html/2605.23966#bib.bib12)\], and OR\-LLM\-Agent\[[40](https://arxiv.org/html/2605.23966#bib.bib16)\]coordinate distinct agents or personas to separately handle mathematical formulation, code generation, and execution repair\. By modularizing the workflow, these methods allow models to focus on one specific part of the modeling task at a time\.
A complementary line of work enforces structural consistency through search trajectories and semantic grounding\. OptiTree\[[13](https://arxiv.org/html/2605.23966#bib.bib15)\]organizes the generation process via hierarchical tree search, decomposing complex problems into simpler subproblems to synthesize a global modeling strategy\. Alongside tree search, retrieving relevant modeling knowledge via in\-context learning has shown promise for Constraint Programming\[[18](https://arxiv.org/html/2605.23966#bib.bib27)\]\. To further ground the generation process, SAC\-Opt\[[42](https://arxiv.org/html/2605.23966#bib.bib17)\]focuses on structural alignment, reconstructing semantic anchors from the generated code and correcting mismatched components against the original problem description\.
While these structural and multi\-agent approaches make reasoning more reliable, their feedback loops are still concentrated at the code stage, typically through execution and debugging after code generation\. Because they rarely impose explicit validation gates on early modeling results \(such as the initial semantic extraction or the mathematical formulation\), early misinterpretations can still freely propagate into the final code structure\.
### II\-CEvaluation of Generated Outputs and Explicit Optimization Validation
Recognizing the limitations of pure generation, recent machine learning literature increasingly leverages LLMs as evaluators to critique and refine outputs\[[17](https://arxiv.org/html/2605.23966#bib.bib39),[12](https://arxiv.org/html/2605.23966#bib.bib40),[3](https://arxiv.org/html/2605.23966#bib.bib38),[7](https://arxiv.org/html/2605.23966#bib.bib41),[31](https://arxiv.org/html/2605.23966#bib.bib31)\]\. Within optimization modeling, recent work has explored validation\-guided search and testing\. For instance, AutoFormulation\[[2](https://arxiv.org/html/2605.23966#bib.bib9)\]incorporates correctness scoring to explore and screen candidate formulations\. Other works focus on post\-generation verification: OptiVer\[[14](https://arxiv.org/html/2605.23966#bib.bib43)\]checks final models via dual\-side verification of structure and solutions, while an agent\-based approach formalizes model checking using generated tests and mutations\[[39](https://arxiv.org/html/2605.23966#bib.bib42)\]\.
Despite these advances, existing optimization verifiers are typically applied only at specific points of the modeling process or after a complete model has already been produced\. Because they do not systematically gate the intermediate transitions connecting natural language, mathematical formulations, and code, they struggle to prevent early semantic misunderstandings from cascading into complex formulation errors\.
TABLE I:Comparison of representative optimization modeling methods\. TriVAL proposes a dedicated tri\-validation framework covering all core modeling stages\.MethodSUFCGValidation ScopeORLM\[[9](https://arxiv.org/html/2605.23966#bib.bib23)\]✓✓–LLMOPT\[[11](https://arxiv.org/html/2605.23966#bib.bib19)\]✓✓–SIRL\[[4](https://arxiv.org/html/2605.23966#bib.bib24)\]✓✓–AutoFormulation\[[2](https://arxiv.org/html/2605.23966#bib.bib9)\]✓✓✓Formulation stage onlyOptiMUS\[[1](https://arxiv.org/html/2605.23966#bib.bib14)\]✓✓✓–OR\-LLM\-Agent\[[40](https://arxiv.org/html/2605.23966#bib.bib16)\]✓✓–OptiTree\[[13](https://arxiv.org/html/2605.23966#bib.bib15)\]✓✓✓–OptiVer\[[14](https://arxiv.org/html/2605.23966#bib.bib43)\]✓✓✓Final\-model onlyTriVAL \(ours\)✓✓✓Tri\-Validation
SU: Semantic Understanding; F: Formulation; CG: Code Generation\.
### II\-DOur Positioning
As shown in Table[I](https://arxiv.org/html/2605.23966#S2.T1), prior work covers the three broad stages of optimization modeling to different extents, while explicit validation is absent in most methods or introduced only partially\. TriVAL differs fundamentally: it elevates validation from a supporting component to the central mechanism of the framework\. Rather than introducing checks only at selected stages of the modeling process, TriVAL introduces explicit validation at each of the three stages: semantic understanding, formulation, and code generation\. By governing whether intermediate modeling results should be accepted, revised, or prevented from affecting later stages, this design contains error propagation at each stage, keeping the final generated code faithful to the original problem\.
## IIIMethodology: TriVAL
### III\-AOverview: A Tri\-Validation Framework for Automatic Optimization Modeling
TriVAL organizes automatic optimization modeling around explicit validation at three stages: semantic specification, mathematical formulation, and code generation\. These stages produce the semantic specificationSS, the mathematical formulationMM, and the generated codeCC, respectively\. An error at any stage can carry forward and cause the final result to deviate from the original problem\.
To address this risk, TriVAL introduces three validation gates, each targeting one stage of the modeling process\. Semantic validation evaluates whetherSSfaithfully captures the original problem\. Formulation validation examines whetherMMcorrectly formalizes the specification\. Code validation assesses whetherCCfaithfully translatesMMand produces correct solutions\.
Across all three stages, TriVAL follows a unified construct\-validate\-revise loop\. The framework first constructs the current stage result, then evaluates it against stage\-specific criteria, and revises it when needed before the modeling process moves forward\. In this way, validation is embedded throughout the modeling process and acts as the central mechanism for containing error propagation\.
To improve candidate quality before validation, TriVAL adopts stage\-specific construction mechanisms: semantic extraction and ambiguity resolution forSS, multi\-expert formulation exploration forMM, and a ReAct\-based code agent with code self\-correction forCC\. These mechanisms improve the quality of the results entering validation, making the overall process more reliable\.
Fig\.[1](https://arxiv.org/html/2605.23966#S3.F1)illustrates the overall architecture of TriVAL, and Algorithm[1](https://arxiv.org/html/2605.23966#alg1)summarizes the complete construct\-validate\-revise procedure\. The following subsections describe the three validation stages in detail\. Section[IV](https://arxiv.org/html/2605.23966#S4)further presents a case study demonstrating how TriVAL identifies and corrects modeling errors in practice\.
Figure 1:Overview of TriVAL\. The framework organizes automatic optimization modeling around three validation gates for the semantic specificationSS, mathematical formulationMM, and generated codeCC\. At each stage, construction, validation, and revision are performed before the process moves to the next stage\.Algorithm 1TriVAL: The construct\-validate\-revise procedure1:Input:problem description
dd
2:Budgets:semantic validation rounds
TST\_\{S\}, formulation validation rounds
TMT\_\{M\}, code validation rounds
TCT\_\{C\}, code self\-correction rounds
ECE\_\{C\}
3:Output:semantic specification
SS, mathematical formulation
MM, generated code
CC
4:Semantic Validation
5:
S←ConstructSemanticSpecification\(d\)S\\leftarrow\\textsc\{ConstructSemanticSpecification\}\(d\)
6:for
tS=1t\_\{S\}=1to
TST\_\{S\}do
7:
\(rS,δS\)←ValidateSemanticSpecification\(d,S\)\(r\_\{S\},\\delta\_\{S\}\)\\leftarrow\\textsc\{ValidateSemanticSpecification\}\(d,S\)
8:if
rS=acceptr\_\{S\}=\\texttt\{accept\}then
9:break
10:elseif
rS=reviser\_\{S\}=\\texttt\{revise\}then
11:
S←ReviseSemanticSpecification\(S,δS\)S\\leftarrow\\textsc\{ReviseSemanticSpecification\}\(S,\\delta\_\{S\}\)
12:endif
13:endfor
14:Formulation Validation
15:
ℬ←ConstructFormulationCandidates\(S\)\\mathcal\{B\}\\leftarrow\\textsc\{ConstructFormulationCandidates\}\(S\)
16:
M←SelectFormulation\(ℬ,S\)M\\leftarrow\\textsc\{SelectFormulation\}\(\\mathcal\{B\},S\)
17:for
tM=1t\_\{M\}=1to
TMT\_\{M\}do
18:
\(rM,δM\)←ValidateFormulation\(d,S,M\)\(r\_\{M\},\\delta\_\{M\}\)\\leftarrow\\textsc\{ValidateFormulation\}\(d,S,M\)
19:if
rM=acceptr\_\{M\}=\\texttt\{accept\}then
20:break
21:elseif
rM=partial\_reviser\_\{M\}=\\texttt\{partial\\\_revise\}then
22:
M←ReviseFormulation\(S,M,δM\)M\\leftarrow\\textsc\{ReviseFormulation\}\(S,M,\\delta\_\{M\}\)
23:elseif
rM=reformulater\_\{M\}=\\texttt\{reformulate\}then
24:
ℬ←ConstructFormulationCandidates\(S,δM\)\\mathcal\{B\}\\leftarrow\\textsc\{ConstructFormulationCandidates\}\(S,\\delta\_\{M\}\)
25:
M←SelectFormulation\(ℬ,S\)M\\leftarrow\\textsc\{SelectFormulation\}\(\\mathcal\{B\},S\)
26:endif
27:endfor
28:Code Validation
29:
C←GenerateSolverCode\(d,M\)C\\leftarrow\\textsc\{GenerateSolverCode\}\(d,M\)
30:for
tC=1t\_\{C\}=1to
TCT\_\{C\}do
31:for
e=1e=1to
ECE\_\{C\}do
32:
\(ηC,ϕC\)←ExecuteCode\(C\)\(\\eta\_\{C\},\\phi\_\{C\}\)\\leftarrow\\textsc\{ExecuteCode\}\(C\)
33:if
ηC=executable\\eta\_\{C\}=\\texttt\{executable\}then
34:break
35:endif
36:
C←SelfCorrectCodeFromExecution\(C,ϕC\)C\\leftarrow\\textsc\{SelfCorrectCodeFromExecution\}\(C,\\phi\_\{C\}\)
37:endfor
38:if
ηC≠executable\\eta\_\{C\}\\neq\\texttt\{executable\}then
39:break
40:endif
41:
\(rC,δC\)←ValidateSolverCode\(d,M,C,ϕC\)\(r\_\{C\},\\delta\_\{C\}\)\\leftarrow\\textsc\{ValidateSolverCode\}\(d,M,C,\\phi\_\{C\}\)
42:if
rC=acceptr\_\{C\}=\\texttt\{accept\}then
43:return
\(S,M,C\)\(S,M,C\)
44:elseif
rC=code\_reviser\_\{C\}=\\texttt\{code\\\_revise\}then
45:
C←ReviseSolverCode\(C,δC\)C\\leftarrow\\textsc\{ReviseSolverCode\}\(C,\\delta\_\{C\}\)
46:elseif
rC=formulation\_reviser\_\{C\}=\\texttt\{formulation\\\_revise\}then
47:
M←PartialReviseFormulation\(S,M,δC\)M\\leftarrow\\textsc\{PartialReviseFormulation\}\(S,M,\\delta\_\{C\}\)
48:
C←GenerateSolverCode\(d,M\)C\\leftarrow\\textsc\{GenerateSolverCode\}\(d,M\)
49:endif
50:endfor
51:return
\(S,M,C\)\(S,M,C\)
### III\-BSemantic Validation for Problem Understanding
Figure 2:Semantic validation targets the semantic specificationS=\(ℱ,𝒜,ℛ\)S=\(\\mathcal\{F\},\\mathcal\{A\},\\mathcal\{R\}\)\.SSis constructed through semantic fact extraction and ambiguity resolution, then validated for factual faithfulness, ambiguity relevance, and resolution consistency\.Semantic validation is the first validation gate in TriVAL, evaluating whether the problem has been understood faithfully at the semantic level before mathematical formulation begins\. This stage is critical because the semantic specificationSSserves as the foundation on which all subsequent modeling depends: formulation inherits its starting point fromSS, and code generation in turn builds upon the formulation\. A specification that omits essential problem facts, introduces unsupported interpretations, or leaves formulation\-relevant ambiguities unresolved can cause subsequent modeling to proceed from a mistaken understanding of the original problem, even if later stages remain internally consistent\.
The specificationSSunder validation is a triple
S=\(ℱ,𝒜,ℛ\),S=\(\\mathcal\{F\},\\mathcal\{A\},\\mathcal\{R\}\),\(1\)whereℱ\\mathcal\{F\}captures semantic facts extracted from the problem descriptiondd, including given conditions, constraint requirements, and the optimization objective;𝒜\\mathcal\{A\}identifies formulation\-relevant ambiguities, i\.e\., parts of the description that admit multiple plausible interpretations and may affect modeling decisions \(such as variable domains, parameter definitions, or the scope of constraints\); andℛ\\mathcal\{R\}records the interpretations and conventions adopted for subsequent modeling\. Together, they capture the problem’s given information, ambiguities, and adopted interpretations\. These ambiguities often affect the mathematical formulation directly, such as variable domains or the scope of constraints\. Without resolving them, the formulation stage may adopt incorrect variable definitions or constraint scopes, introducing errors that are difficult to detect later\.
TriVAL constructsSSthrough a two\-step extraction\-resolution process \(Fig\.[2](https://arxiv.org/html/2605.23966#S3.F2)\)\. A semantic fact extractor first extractsℱ\\mathcal\{F\}from the problem description and identifies candidate ambiguities𝒜\\mathcal\{A\}\. An ambiguity resolver then determines the adopted interpretation for each ambiguity in𝒜\\mathcal\{A\}based on the problem context, the extracted facts, and standard optimization modeling conventions, producing the resolution setℛ\\mathcal\{R\}\. When no formulation\-relevant ambiguity is identified, the extracted facts proceed directly to validation\. This specification provides the semantic basis for all subsequent modeling\.
The semantic validator examinesSSalong three dimensions: factual faithfulness, ambiguity relevance, and resolution consistency\. Forℱ\\mathcal\{F\}, it evaluates whether the extracted facts are complete, correct, and free of unsupported additions\. For𝒜\\mathcal\{A\}, it gauges whether the listed ambiguities are genuinely relevant to formulation and sufficiently consequential to warrant resolution\. Forℛ\\mathcal\{R\}, it determines whether the adopted resolutions are consistent with both the original problem description and the extracted facts\. These three dimensions jointly assess whether the current specification captures the semantic content of the original problem with sufficient faithfulness for formulation to proceed\.
If the validator identifies defects, it returnsrevise, and TriVAL revisesSSaccording to the validation feedback and re\-evaluates it\. Missing or incorrect facts lead to revision ofℱ\\mathcal\{F\}; missing, irrelevant, or misidentified ambiguities lead to revision of𝒜\\mathcal\{A\}; and when the adopted interpretation is weakly supported or inconsistent with the problem description, TriVAL revisesℛ\\mathcal\{R\}accordingly\. This construct\-validate\-revise loop continues until the specification is accepted or the semantic validation round limit is reached, after which the current specification proceeds to the formulation stage\.
### III\-CFormulation Validation for Mathematical Modeling
Figure 3:Formulation validation targets the mathematical formulationM=\(P,V,𝒞,O\)M=\(P,V,\\mathcal\{C\},O\)\. Candidate formulations are constructed through multi\-expert exploration, and the selected candidate is validated for variable\-design quality, constraint soundness, and objective alignment\.With the semantic specificationSSin place, the modeling process proceeds to mathematical formulation\. Formulation validation is the second validation gate, examining whether the problem has been correctly expressed as a mathematical formulation\. BecauseMMwill be directly translated into code, any deviation from the specification propagates to subsequent stages\. Expression\-level defects may prevent the model from being solved correctly or produce abnormal solver behavior\. Modeling\-level deviations are more consequential: they can cause the variables to misrepresent the intended decisions, the constraints to misstate the problem requirements, or the objective to no longer match the original goal, regardless of whether code generation itself is error\-free\.
The mathematical formulation is defined as
M=\(P,V,𝒞,O\),M=\(P,V,\\mathcal\{C\},O\),\(2\)wherePPspecifies the known parameters and constants in the model;VVdefines the decision variables that abstract the core choices in the optimization problem;𝒞\\mathcal\{C\}encodes the constraints that feasible solutions must satisfy; andOOexpresses the optimization objective to be minimized or maximized\. Together, these four components constitute the complete mathematical representation of the problem and define the optimization model to be implemented\.
To constructMM, TriVAL employs multi\-expert formulation exploration\. This design reflects that the same optimization problem can be approached from different modeling perspectives\. Different perspectives lead to different choices of decision variables, which in turn produce different expressions of constraints and objectives, yielding formulations that are all mathematically valid yet vary substantially in compactness, solving difficulty, and ease of code generation\. For example, a routing problem may be formulated through an ordering\-based representation or through an edge\-based graph representation\. Both may be correct, yet they can differ substantially in variable design, constraint organization, and ease of code generation\. TriVAL uses multiple experts to explore these alternatives explicitly, so that validation can operate on a more diverse and higher\-quality candidate set\.
As illustrated in Fig\.[3](https://arxiv.org/html/2605.23966#S3.F3), TriVAL explores candidate formulations through four experts\. Each expert takes the same semantic specificationSSas input and independently produces a complete formulation candidate\. The four experts approach the problem from different modeling perspectives:
- •Parameter\-and\-index expert: focuses on the organization of known quantities, emphasizing parameters, constants, and their index structure to provide a clear basis for later variable definition and constraint expression\.
- •Decision\-variable expert: focuses on decision representation, favoring variable designs that capture the core decisions compactly and sufficiently\.
- •Constraint expert: focuses on requirement expression, emphasizing how constraints should be organized and scoped to express the problem requirements completely and accurately\.
- •Objective expert: focuses on optimization\-goal expression, favoring formulations whose objective is direct and well aligned with the variable design and constraint system\.
This design allows TriVAL to explore different formulations of the same problem while keeping every candidate complete\. The exploration process yields a candidate set
ℬ=M1,M2,…,Mk\.\\mathcal\{B\}=\{M\_\{1\},M\_\{2\},\\dots,M\_\{k\}\}\.\(3\)From this candidate set, an LLM\-based selector chooses the formulation that offers the best combination of sound variable design, concise representation, standard formulation conventions, and correctness with respect toSS, and uses it as the formulation to be validated\.
The formulation validator examinesMMalong three component\-level dimensions:
ValM\(d,S,M\)=\(Qual\(V\),Sound\(𝒞\),Align\(O\)\),\\mathrm\{Val\}\_\{M\}\(d,S,M\)=\\big\(\\mathrm\{Qual\}\(V\),\\ \\mathrm\{Sound\}\(\\mathcal\{C\}\),\\ \\mathrm\{Align\}\(O\)\\big\),\(4\)which correspond to the quality of variable design, the soundness of constraint formulation, and the alignment of the optimization objective\. The validator first checks variable design \(VV\): decision variables should capture the core choices compactly and sufficiently, with domains, bounds, and types consistent with problem semantics\. It then examines constraint soundness \(𝒞\\mathcal\{C\}\), focusing on completeness, correct scope, and accurate expression of problem requirements under the current variable design\. The final component\-level check concerns objective alignment \(OO\): the objective function should match the intended optimization goal in both direction and expression and remain coordinated with the variables and constraints\. Beyond these component\-level dimensions, the validator also examines whether the formulation as a whole is coherent\. Variables, constraints, and objectives are tightly coupled: variable design shapes how constraints and objectives are expressed, while constraint organization affects whether the overall formulation remains clear and well coordinated\.
If the validator identifies defects, TriVAL determines the revision mode according to the nature of the defect\. Expression\-level defects triggerpartial\_revise\. These cases indicate that the overall formulation choice remains appropriate and that the main problem lies in local expressions, such as a missing constraint term, an improper variable bound, an incorrect scope condition, or a local deviation in the objective expression\. Such defects can usually be corrected directly without changing the overall modeling design\. Modeling\-level defects triggerreformulate\. These cases indicate that the defect lies in the overall formulation choice, such as an unsuitable decision abstraction, a constraint system that depends on an unsuitable variable design, or an objective that is misaligned with problem intent\. In such cases, local patching is usually insufficient because variables, constraints, and objectives are tightly coupled, so changing one part can affect the rest of the formulation\. TriVAL therefore returns to multi\-expert formulation exploration and reconstructs candidate formulations under the same semantic specification, guided by the validation feedback\. This step also allows the framework to revisit alternative modeling perspectives and potentially find a stronger formulation\. This construct\-validate\-revise loop continues until the formulation is accepted or the formulation validation round limit is reached, after which the formulation proceeds to code generation\.
### III\-DCode Validation for Executable Translation
Figure 4:Code validation targets the generated codeCC\. The code agent generates code, and runtime errors are fixed through code self\-correction\. The resulting code is then validated for consistency with the formulation using the observed execution result\. Detected defects are attributed to either the code or the formulation\.The formulationMMnow enters the final stage: code generation\. Code validation is the third validation gate in TriVAL, assessing whether the formulationMMhas been correctly translated into codeCC\. Two distinct error sources arise at this stage: solver\-interface and code\-expression errors, such as invalid variable types, incorrect constraint\-construction calls, or syntax errors; and deviations from the formulation, where the code no longer accurately reflects the variables, constraints, and objective defined inMM\. To address both sources, the code stage proceeds in two phases\. Execution\-feedback code self\-correction first fixes runtime errors through execution feedback\. Code validation then assesses whether the resulting code remains faithful toMM\.
To constructCC, TriVAL introduces a ReAct\-based code agent\[[38](https://arxiv.org/html/2605.23966#bib.bib35)\]that constructs code iteratively by alternating reasoning and tool use\. The agent is equipped with three code\-interaction tools for reading the current code state, writing initial code, and making partial edits to existing code\. These tools let the agent refine code statefully rather than regenerate the full program each time\. To improve adaptation to solver interfaces, TriVAL injects basic solver\-interface syntax rules, common error patterns, and task guidance into the agent prompt as prior knowledge for code generation\.
TriVAL pairs the code agent with a code self\-correction loop: each time the code agent completes generation, an independent execution environment runs the code\. If execution fails, the environment returns execution feedback, including runtime errors, interface\-call errors, and solver\-status information\. This feedback is sent back to the code agent and used to guide the next round of partial revision\. This self\-correction loop targets runtime and interface errors, iteratively fixing them through the returned feedback\. The loop continues until the code runs without errors or the repair budget is exhausted\[[26](https://arxiv.org/html/2605.23966#bib.bib11)\], returning an execution statusηC\\eta\_\{C\}together with execution feedbackϕC\\phi\_\{C\}\. Code self\-correction resolves runtime errors but does not assess whether the resulting code remains faithful toMM\. Code validation addresses this gap: even code that runs without errors may still return infeasibility or an abnormal solver status, and even code that produces a solution may still deviate from the formulation, yielding a solved problem that no longer matchesMM\.
The code\-stage validation is defined as
ValC\(d,M,C,ϕC\)→\(rC,δC\),\\mathrm\{Val\}\_\{C\}\(d,M,C,\\phi\_\{C\}\)\\to\(r\_\{C\},\\delta\_\{C\}\),\(5\)whereϕC\\phi\_\{C\}denotes the observed execution result and solver feedback,rCr\_\{C\}is the validation result, andδC\\delta\_\{C\}carries the validation feedback, including error attribution and a description of the defect\.
The validator inspects the executable code together with the observed execution resultϕC\\phi\_\{C\}, assessing faithfulness toMM\. It checks that variable types, domains, and indices remain consistent withMM, that constraints are added completely and correctly under the intended scope, and that the objective function remains aligned in direction and expression\. This faithfulness check also supports error attribution\. If the code deviates from the formulation, the defect is attributed to the code, since code generation is responsible for translating mathematical expressions into executable solver code\. If the code is consistent withMM, the defect is attributed to the formulation\. Certain formulation\-level errors, such as incorrect constraint scope or missing coupling conditions, become visible only through actual execution when the solver result remains abnormal\. If the defect is attributed to the code \(e\.g\., an interface\-call error, a missing constraint term, an incorrect variable definition, or a deviation in objective expression\), the validation result iscode\_revise, and the code agent revises the current generated code accordingly\. If the defect is attributed to the formulation \(e\.g\., a formulation defect exposed through infeasibility or an abnormal solver status\), the validation result isformulation\_revise, the formulation is partially revised, and the code is regenerated from the revised formulation\. This construct\-validate\-revise loop continues until the code is accepted or the code validation round limit is reached, completing the TriVAL modeling process and producing the final results\(S,M,C\)\(S,M,C\)\.
## IVExperiments
This section evaluates TriVAL and NL4COP from four perspectives\. We first introduce NL4COP and the experimental setup\. We then assess the overall effectiveness of TriVAL and examine how its advantage changes with problem complexity across benchmarks and within NL4COP\. Next, we study the value of the complete validation framework through validator ablations, error analysis, and transfer experiments on an existing optimization\-modeling framework\. Finally, we analyze how the three\-stage modeling design and stage\-specific mechanisms contribute to TriVAL’s effectiveness, and evaluate the cost of validation\.
### IV\-AExperimental Setup
Benchmarks\.Existing benchmarks for automatic optimization modeling focus predominantly on LP and MILP problems with relatively simple structures\[[19](https://arxiv.org/html/2605.23966#bib.bib26)\], while combinatorial optimization remains systematically underrepresented\. These problems involve complex constraint interactions and discrete decisions, making them particularly demanding to model\. To fill this gap, we introduce NL4COP, a benchmark that provides broad coverage of combinatorial problem types, graduated difficulty within each problem type, verified reference solutions, and strong discriminative power for distinguishing modeling methods\.
NL4COP comprises 150 instances spanning 50 combinatorial problem types across seven major families: routing, packing and cutting, scheduling, location and allocation, graph and network optimization, knapsack and selection, and hybrid problems\. These 50 types are selected to systematically cover the principal modeling structures in combinatorial optimization, from path and flow decisions to resource coupling, assignment, sequencing, and set selection\.
Each problem type contains three instances at distinct difficulty levels \(simple, medium, and hard\), enabling controlled comparison of modeling performance as difficulty increases within the same problem type\. The three levels differ in description length, constraint complexity, and data scale, placing progressively higher demands on long\-context semantic understanding and mathematical formalization quality\.
All instances are designed and constructed by PhD\-level operations research experts, each grounded in a realistic OR scenario\. The problem data is verified for feasibility via solver execution, and every instance is fully specified in natural language with reference code and a reference optimal solution\. Two experts independently cross\-check all cases for consistency among the problem description, reference code, and reference answer\.
Compared with existing benchmarks, NL4COP features more detailed problem descriptions and more intricate rule interactions, substantially increasing both description length and modeling complexity\. We compare NL4COP with existing benchmarks along these two dimensions, following the definition and computation of modeling complexity in prior work\[[35](https://arxiv.org/html/2605.23966#bib.bib32)\]\. As shown in Table[II](https://arxiv.org/html/2605.23966#S4.T2)and Fig\.[5](https://arxiv.org/html/2605.23966#S4.F5), NL4COP substantially exceeds existing benchmarks in both dimensions\. Accordingly, constructing correct models for NL4COP requires more careful discrete abstraction, constraint scoping, and coordination across interacting constraints\[[15](https://arxiv.org/html/2605.23966#bib.bib25)\]\.
TABLE II:Optimization modeling benchmarks compared by instance count, average description length, and modeling complexity\. NL4COP ranks first in both description length and modeling complexity\.BenchmarkInstancesAvg\. Length \(chars\)ComplexityComplexOR\[[36](https://arxiv.org/html/2605.23966#bib.bib10)\]1812734\.0OptiBench\[[37](https://arxiv.org/html/2605.23966#bib.bib20)\]4036215\.1NL4Opt\[[24](https://arxiv.org/html/2605.23966#bib.bib28)\]2055305\.1NL4LP\[[1](https://arxiv.org/html/2605.23966#bib.bib14)\]1785335\.2Mamo\[[10](https://arxiv.org/html/2605.23966#bib.bib29)\]85212256\.4OptMath\[[16](https://arxiv.org/html/2605.23966#bib.bib22)\]12930837\.1IndustryOR\[[9](https://arxiv.org/html/2605.23966#bib.bib23)\]9910467\.9NL4COP \(ours\)15032499\.4Figure 5:Distribution of modeling complexity across benchmarks\. NL4COP instances cluster at the high\-complexity end, with substantially greater modeling difficulty than existing benchmarks\.In addition to NL4COP, the main experiments include Mamocomplex\{\}\_\{\\text\{complex\}\}\[[10](https://arxiv.org/html/2605.23966#bib.bib29)\], OptMath\[[16](https://arxiv.org/html/2605.23966#bib.bib22)\], and IndustryOR\[[9](https://arxiv.org/html/2605.23966#bib.bib23)\]\. Mamocomplex\{\}\_\{\\text\{complex\}\}is the complex subset of the Mamo benchmark, consisting of challenging LP and MILP instances\. OptMath spans a broader range of optimization problem families and includes longer descriptions with greater modeling difficulty\. IndustryOR focuses on optimization\-modeling tasks drawn from real industrial settings\. These benchmarks, together with NL4COP, span established LP/MILP settings to more challenging combinatorial problems, enabling evaluation across varying modeling difficulty\.
Baselines\.We compare TriVAL against two classes of prior methods\. One class consists of learning\-based methods, including ORLM \(ORLM\-LLaMA\-3\-8B\)\[[9](https://arxiv.org/html/2605.23966#bib.bib23)\], LLMOPT \(LLMOPT\-Qwen2\.5\-14B\)\[[11](https://arxiv.org/html/2605.23966#bib.bib19)\], and SIRL \(SIRL\-Gurobi32B\)\[[4](https://arxiv.org/html/2605.23966#bib.bib24)\], which improve optimization modeling through training data, structured representations, and solver\-verifiable learning signals\. The other consists of prompt\- and agent\-based methods, including AutoFormulation\[[2](https://arxiv.org/html/2605.23966#bib.bib9)\], OR\-LLM\-Agent\[[40](https://arxiv.org/html/2605.23966#bib.bib16)\], and OptiTree\[[13](https://arxiv.org/html/2605.23966#bib.bib15)\], which improve inference\-time modeling through task decomposition, search, planning, external feedback, and multi\-agent coordination\.
Evaluation Metric\.Following prior work\[[9](https://arxiv.org/html/2605.23966#bib.bib23),[11](https://arxiv.org/html/2605.23966#bib.bib19),[13](https://arxiv.org/html/2605.23966#bib.bib15),[4](https://arxiv.org/html/2605.23966#bib.bib24),[2](https://arxiv.org/html/2605.23966#bib.bib9)\], we adopt solving accuracy as the primary metric\. No general\-purpose automatic test of structural equivalence currently exists for optimization modeling, so solving accuracy remains the standard evaluation criterion\. For each instance, we execute the generated code, obtain the predicted objective valueypredy\_\{\\text\{pred\}\}, and compare it with the reference objective valueylabely\_\{\\text\{label\}\}\. Following the SIRL evaluation protocol\[[4](https://arxiv.org/html/2605.23966#bib.bib24)\], an instance is counted as correct when
\|ypred−ylabel\|\|ylabel\|\+1<10−6\.\\frac\{\|y\_\{\\text\{pred\}\}\-y\_\{\\text\{label\}\}\|\}\{\|y\_\{\\text\{label\}\}\|\+1\}<10^\{\-6\}\.\(6\)We also report process\-level indicators and error\-type analyses as supplementary evidence\. For the small number of instances where the decision\-variable type is not explicitly specified, we evaluate both the integral and the continuous settings\. If either setting matches the reference objective value, we count the instance as correct\. This rule is applied uniformly across all methods\.
Protocol\.The main experiments use DeepSeek\-V3\.2\[[5](https://arxiv.org/html/2605.23966#bib.bib36)\]and GPT\-5\.1\[[22](https://arxiv.org/html/2605.23966#bib.bib37)\]as base models\. For prompt\- and agent\-based methods, TriVAL and all baselines are evaluated on the same base model to ensure a fair comparison\. For learning\-based methods, we evaluate the authors’ released models under their best reported settings\. For all prompt\- and agent\-based methods evaluated by us, we repeat each experiment five times\. Unless otherwise specified, we report the best solving accuracy over the five runs\. We use OR\-Tools\[[23](https://arxiv.org/html/2605.23966#bib.bib6)\]and CVXPY\[[6](https://arxiv.org/html/2605.23966#bib.bib7)\]as solvers to cover the optimization problem types in all benchmarks\. In TriVAL, each stage allows up to five validation iterations, after which the current result proceeds to the next stage\. The code self\-correction stage allows up to 20 rounds of execution feedback, with a timeout of 100 s per run\. Unless a module is explicitly removed in an ablation, all other settings are kept fixed across variants\.
TABLE III:Accuracy \(%\) across benchmarks\. TriVAL achieves the strongest overall performance, with the largest gains on the most challenging benchmarks\.ModelMethodBenchmark Accuracy \(%\)Overall \(%\)Mamocomplex\{\}\_\{\\text\{complex\}\}OptMathIndustryORNL4COPLearning\-based MethodsORLM48\.614\.134\.311\.329\.1LLMOPT50\.021\.938\.410\.731\.9SIRL65\.271\.152\.528\.054\.9Prompt\-based MethodsGPT\-5\.1AutoFormulation74\.346\.952\.525\.352\.1OR\-LLM\-Agent91\.085\.278\.868\.081\.8OptiTree80\.583\.675\.856\.074\.1TriVAL90\.587\.586\.981\.386\.9DeepSeek\-V3\.2AutoFormulation74\.840\.663\.614\.750\.1OR\-LLM\-Agent88\.685\.977\.852\.076\.8OptiTree82\.974\.267\.754\.771\.2TriVAL90\.087\.589\.987\.388\.8
Best results are inboldand second\-best results areunderlined\.
### IV\-BMain Results
Table[III](https://arxiv.org/html/2605.23966#S4.T3)reports the solving accuracy of TriVAL and competing methods on NL4COP and three established optimization\-modeling benchmarks\. TriVAL achieves the strongest overall performance across all benchmarks and both base models, with particularly large gains on IndustryOR and NL4COP\. The advantage is largest on IndustryOR and NL4COP because longer descriptions and greater modeling difficulty make modeling more error\-prone, and methods without explicit validation allow these errors to propagate unchecked, whereas TriVAL’s staged validation catches them early\.
Fig\.[6](https://arxiv.org/html/2605.23966#S4.F6)traces this trend across benchmarks\. As complexity increases, all methods decline, but TriVAL degrades much more slowly: under DeepSeek\-V3\.2, TriVAL drops only 2\.7 points from Mamocomplex\{\}\_\{\\text\{complex\}\}to NL4COP, whereas OR\-LLM\-Agent drops 36\.6 and OptiTree drops 28\.2 \(Table[III](https://arxiv.org/html/2605.23966#S4.T3)\)\. NL4COP thus provides strong discriminative power among methods: on simpler benchmarks, most approaches perform adequately and differences remain small, whereas NL4COP’s greater complexity amplifies these differences, clearly separating methods with and without explicit validation\.
Table[IV](https://arxiv.org/html/2605.23966#S4.T4)breaks down the NL4COP results by difficulty level\. Under both base models, the gap between TriVAL and the baselines widens from simple to hard splits, consistent with the cross\-benchmark trend observed above\. This within\-benchmark result further shows that TriVAL is more robust on harder instances, and that NL4COP’s graduated difficulty reveals this difference\.
Figure 6:Solving accuracy versus benchmark complexity under DeepSeek\-V3\.2\. As benchmark difficulty increases, all methods decline but TriVAL degrades the slowest, widening the gap to competing methods\.TABLE IV:Solving accuracy across the simple, medium, and hard splits of NL4COP \(%\)\. The gap between TriVAL and the baselines widens as difficulty increases\.ModelMethodDifficulty SplitsOverallSimpleMediumHardGPT\-5\.1OR\-LLM\-Agent72\.072\.060\.068\.0OptiTree52\.056\.060\.056\.0TriVAL88\.090\.066\.081\.3DeepSeek\-V3\.2OR\-LLM\-Agent56\.056\.044\.052\.0OptiTree60\.058\.046\.054\.7TriVAL94\.092\.076\.087\.3
### IV\-CEffectiveness of the Validation Framework
#### IV\-C1Validator Ablation
To examine the role of each validation stage, we keep the construction process unchanged and remove the semantic, formulation, and code validators individually or jointly\.
Table[V](https://arxiv.org/html/2605.23966#S4.T5)reports the ablation results on IndustryOR\. Explicit validation improves accuracy by at least 11\.1 percentage points \(pp\) and up to 13\.1 pp, showing that it is essential to TriVAL’s performance\. Among the three, the formulation validator contributes the most, improving accuracy by 4\.0–8\.1 pp, as it guards the critical interface where semantic understanding is translated into concrete mathematical expressions; errors at this stage directly corrupt code generation and are difficult to recover from in later stages\. The code validator improves accuracy by 2\.0–6\.1 pp by addressing more local implementation defects such as solver\-interface mismatches, while the semantic validator contributes 1\.0–4\.0 pp by catching early misinterpretations that would otherwise cascade through both formulation and code\. Together, these results support the core design of TriVAL: the three validators target distinct failure modes across the modeling pipeline, and each addresses errors that the others do not catch\.
We further examine the same ablation on NL4COP, where the greater modeling difficulty makes errors more likely to arise\. The effect of validation is substantially larger: the complete framework improves accuracy by at least 18\.0 pp and up to 21\.3 pp, with the formulation validator alone contributing 11\.3–14\.0 pp \(Table[VI](https://arxiv.org/html/2605.23966#S4.T6)\)\. Every individual ablation produces a larger degradation than on IndustryOR, showing that validation becomes more valuable as modeling difficulty increases\. This growing gap also highlights NL4COP’s discriminative power: the benchmark’s complex decision logic and greater modeling difficulty amplify the difference between methods with and without validation, making the contribution of each validator clearly observable\.
TABLE V:Validation ablation on IndustryOR\. All three validators contribute to the overall accuracy\.VariantAccuracy \(%\)Δ\\Delta\(pp\)Full Method89\.9—w/o Semantic Validation85\.9–88\.9\-1\.0 – \-4\.0w/o Formulation Validation81\.8–85\.9\-4\.0 – \-8\.1w/o Code Validation83\.8–87\.9\-2\.0 – \-6\.1w/o All Validation76\.8–78\.8\-11\.1 – \-13\.1
Accuracy \(%\): min–max over five runs\.Δ\\Delta: drop from Full Method in percentage points \(pp\)\.
TABLE VI:Validation ablation on NL4COP\. The effect of each validator is larger than on IndustryOR, reflecting the greater modeling difficulty of this benchmark\.VariantAccuracy \(%\)Δ\\Delta\(pp\)Full Method87\.3—w/o Semantic Validation82\.0–86\.0\-1\.3 – \-5\.3w/o Formulation Validation73\.3–76\.0\-11\.3 – \-14\.0w/o Code Validation78\.7–84\.0\-3\.3 – \-8\.7w/o All Validation66\.0–69\.3\-18\.0 – \-21\.3
Accuracy \(%\): min–max over five runs\.Δ\\Delta: drop from Full Method in percentage points \(pp\)\.
#### IV\-C2Reduction of Variable and Constraint Errors
Having established that validation improves accuracy, we now ask what types of errors it actually fixes\. Table[VII](https://arxiv.org/html/2605.23966#S4.T7)classifies failed cases into three categories\. Variable\-design errors include wrong variable domains, missing decision variables, and incorrect index scopes\. Constraint\-expression errors include missing constraints, wrong scopes, and incorrect coupling relations\. Code\-generation errors involve solver\-interface mismatches and other translation\-level mistakes\. The key observation is that removing validation disproportionately increases errors in the first two categories\. On NL4COP, constraint errors more than double from 12 to 30 and variable errors double from 6 to 12, showing that TriVAL’s validators mainly reduce errors in variable design and constraint formulation\. These are precisely the errors that undermine faithfulness to the original problem: wrong variable abstractions alter the decision space, and flawed constraint formulations distort the feasible region, regardless of whether the subsequent code executes correctly\. By catching these defects early, the three validators directly protect the key modeling stages, helping the solver optimize the intended problem\.
TABLE VII:Error distribution on failed instances with and without validation\. Removing validation mainly increases variable\-design and constraint\-expression errors rather than code\-generation errors\.BenchmarkVariantTotalfailedError TypesCodeVariableConstraintIndustryORFull Method10118w/o All Validation211317NL4COPFull Method202612w/o All Validation4641230
#### IV\-C3Applying Validation to an Existing Framework
The preceding experiments show that explicit validation improves modeling accuracy within TriVAL\. We now ask whether the same mechanism can benefit another optimization\-modeling framework\. To test this, we add explicit validation into OR\-LLM\-Agent\[[40](https://arxiv.org/html/2605.23966#bib.bib16)\], a representative multi\-stage framework that performs optimization modeling through mathematical modeling, code generation, and debugging\. We keep its original modeling pipeline unchanged and insert formulation\-side validation after the mathematical modeling stage and code\-side validation after the debugging loop produces executable code\. Detected formulation\-side defects trigger natural\-language feedback to the mathematical\-modeling stage; code\-side defects trigger feedback to the code\-generation stage, while runtime errors are still handled by the original debugging loop\. The added validators reuse the same prompts as in TriVAL, with each validator allowed up to five iterations\.
After adding these validation steps, accuracy improves from 77\.8% to 83\.2% on IndustryOR and from 52\.0% to 70\.2% on NL4COP\. Formulation\-side validation consistently detects more errors than code\-side validation, flagging 37\.4% vs\. 19\.2% of instances on IndustryOR and 42\.7% vs\. 22\.7% on NL4COP, indicating that many modeling defects arise during the mathematical formulation stage\. On NL4COP, both validators trigger revision more frequently than on IndustryOR, and the accuracy gain is correspondingly larger, further reinforcing that validation grows more valuable as problem difficulty increases\. These results demonstrate that explicit validation can also improve a representative existing multi\-stage framework\.
TABLE VIII:Effect of adding validation to OR\-LLM\-Agent\. Validation improves accuracy on both benchmarks, with a substantially larger gain on NL4COP\.BenchmarkOR\-LLM\-Agent\+ ValidationGainIndustryOR77\.883\.2 \(82\.8–83\.8\)\+5\.4 \(5\.0–6\.0\)NL4COP52\.070\.2 \(68\.7–72\.0\)\+18\.2 \(16\.7–20\.0\)
OR\-LLM\-Agent: original mean accuracy\. \+ Validation: mean \(min–max\) over five runs after inserting formulation\-side and code\-side validation\.
Across these experiments, explicit validation consistently improves modeling accuracy, becomes more effective as problem difficulty increases, and also improves a representative existing framework\. We next analyze the internal design choices that contribute to TriVAL’s effectiveness\.
### IV\-DAnalysis of TriVAL’s Design
#### IV\-D1Contribution of Modeling Design and Mechanisms
The preceding section establishes that explicit validation is a key driver of TriVAL’s performance\. We now turn to the broader framework: how the three\-stage modeling structure and the stage\-specific mechanisms contribute to the overall effectiveness\. To isolate each component’s role, we keep the validator configuration fixed and evaluate the following variants: without semantic understanding, without formulation, without both stages \(code\-only\), without multi\-expert formulation, and without code self\-correction\. All variants are evaluated on IndustryOR and NL4COP\.
TABLE IX:Ablation of modeling design choices on IndustryOR and NL4COP\. Both the three\-stage structure and the stage\-specific mechanisms contribute to final accuracy, with the largest drops from removing semantic understanding or code self\-correction\.VariantSemanticUnderstandingFormulationCodeGenerationMulti\-ExpertFormulationCode Self\-CorrectionIndustryORΔ\\DeltaNL4COPΔ\\DeltaThree\-Stage StructureFull Method✓\\checkmark✓\\checkmark✓\\checkmark✓\\checkmark✓\\checkmark89\.9—87\.3—w/o Semantic Understanding×\\times✓\\checkmark✓\\checkmark✓\\checkmark✓\\checkmark80\.8\-9\.178\.7\-8\.6w/o Formulation Stage✓\\checkmark×\\times✓\\checkmark—✓\\checkmark83\.8\-6\.178\.0\-9\.3Code\-Only×\\times×\\times✓\\checkmark—✓\\checkmark82\.8\-7\.178\.0\-9\.3Stage\-Specific Mechanismsw/o Multi\-Expert✓\\checkmark✓\\checkmark✓\\checkmark×\\times✓\\checkmark83\.8\-6\.178\.7\-8\.6w/o Code Self\-Correction✓\\checkmark✓\\checkmark✓\\checkmark✓\\checkmark×\\times79\.8\-10\.180\.0\-7\.3
Accuracy \(%\);Δ\\Delta: drop from Full Method; —: not applicable\.
Both the three\-stage structure and the stage\-specific mechanisms contribute to overall effectiveness \(Table[IX](https://arxiv.org/html/2605.23966#S4.T9)\)\. Removing semantic understanding or formulation causes accuracy drops of 6\.1–9\.1 pp on IndustryOR and 8\.6–9\.3 pp on NL4COP\. The three\-stage structure decomposes the modeling process into explicit stages for semantic understanding and mathematical construction, allowing defects in variable definitions, constraint scope, and objective structure to be identified and revised before code generation\.
The stage\-specific mechanisms also contribute substantially\. Removing the multi\-expert mechanism costs 6\.1–8\.6 pp, while removing code self\-correction costs 7\.3–10\.1 pp\. The multi\-expert mechanism generates candidate formulations from diverse expert perspectives, increasing the diversity and quality of formulations that enter validation\. Code self\-correction iteratively refines generated code by using execution feedback to diagnose and fix errors such as syntax violations, incorrect solver API calls, and infeasible model constructions\.
#### IV\-D2Synergy Between Modeling Quality and Validation
The previous analysis shows that both the modeling design and the validation framework contribute to accuracy\. We now examine how they reinforce each other\. Better modeling construction reduces errors at the source, gives validation formulations that are already closer to acceptance, and makes subsequent repair more focused\. We use the multi\-expert formulation mechanism as an example \(Table[X](https://arxiv.org/html/2605.23966#S4.T10)\)\.
TABLE X:Formulation quality and validation efficiency\. The multi\-expert mechanism raises Pass@1 and reduces both validation iterations and execution rounds\.Bench\.VariantPass@1\(%\)↑\\uparrowValid\.Iters↓\\downarrowExec\.Rounds↓\\downarrowIndustryORFull Method51\.02\.91\.9w/o Multi\-Expert27\.34\.51\.9NL4COPFull Method46\.03\.12\.9w/o Multi\-Expert30\.74\.84\.1This interaction is reflected in how formulation quality reshapes the role of validation\. Better formulation construction raises formulation Pass@1 from 27\.3% to 51\.0% on IndustryOR and from 30\.7% to 46\.0% on NL4COP\. Validation then more often operates on formulations that are already close to acceptance and can focus on screening and targeted revision\. It also reduces average validation iterations from 4\.5 to 2\.9 and from 4\.8 to 3\.1\. On NL4COP, it further reduces code execution rounds from 4\.1 to 2\.9, indicating that better formulation construction reduces formulation errors that carry into code translation and execution\.
Fig\.[7](https://arxiv.org/html/2605.23966#S4.F7)illustrates the same mechanism at the instance level\. In the initial formulationM1M\_\{1\}, the capacity constraint is applied to end\-of\-month inventory rather than to the inventory state after purchasing\. The resulting codeC1C\_\{1\}faithfully translates this incorrect formulation and therefore produces an infeasible model\. In this setting, execution feedback exposes the symptom but provides limited guidance about the source of the error: infeasibility alone does not determine whether the defect comes from variable timing, constraint scope, or another formulation decision\. At the formulation stage, however, the defect is still visible and amenable to targeted revision before it is translated into code and reflected in execution results\.
Figure 7:Case study: formulation\-side validation identifies an incorrect capacity constraint inM1M\_\{1\}at a stage where the defect remains interpretable, enabling a targeted revision that restores feasibility\.
#### IV\-D3Cost of Validation
Validation incurs additional LLM calls\. Table[XI](https://arxiv.org/html/2605.23966#S4.T11)shows that removing all validation saves 46\.8% of tokens on IndustryOR and 33\.8% on NL4COP, but more than doubles the error rate on both benchmarks\. On NL4COP, the modeling process itself is already more token\-intensive because longer descriptions and greater modeling complexity demand more reasoning and more careful construction, even without validation\. These additional tokens are used to evaluate intermediate modeling results and repair detected errors across the semantic, formulation, and code stages\. For challenging optimization modeling tasks, reducing modeling errors and maintaining reliable, accurate modeling carry greater practical value than minimizing token usage alone\.
TABLE XI:Token cost and error rate with and without validation\. The additional tokens spent on validation reduce the error rate by more than half on both benchmarks\.BenchmarkTokens\(Full / w/o\)TokenSaving\(%\)Error \(%\)\(Full / w/o\)ErrorIncrease\(×\\times\)IndustryOR54\.9k / 29\.2k46\.810\.1 / 21\.22\.1NL4COP62\.8k / 41\.6k33\.812\.7 / 30\.72\.4
Tokens and errors are reported as Full / w/o Validation values\.
## VConclusion and Future Work
This work presents TriVAL, a tri\-validation framework for automatic optimization modeling, and NL4COP, a new benchmark with 50 problem types and 150 instances that provides a more challenging and discriminative evaluation than existing benchmarks\. TriVAL introduces explicit validation at three stages of the modeling pipeline \(semantic specification, mathematical formulation, and generated code\), evaluating each result before errors propagate across stages\. Experiments on both established benchmarks and NL4COP show that staged validation consistently improves modeling accuracy, with the largest gains on the most challenging problems\. Validation of intermediate results is the key driver of these improvements, and the same validation mechanism also improves a representative existing optimization\-modeling framework\. These results position the tri\-validation approach as an effective design principle for automatic optimization modeling and NL4COP as a challenging benchmark that clearly distinguishes the modeling capabilities of different optimization\-modeling methods\.
Future work will focus first on strengthening validation under more challenging problem settings\. As problem complexity grows, semantic ambiguity, complex constraint interactions, and long\-range dependencies make modeling errors harder to identify reliably\. Since TriVAL relies on LLM\-based validators, subtle constraint violations and ambiguous formulations may still challenge their assessments\. A natural next step is to incorporate richer problem\-aware signals, finer\-grained cross\-representation consistency checks, and more precise evaluation criteria to improve validation quality\.
Beyond the combinatorial problems studied here, extending TriVAL to multi\-objective, stochastic, robust, and dynamic optimization settings opens a broader research direction for both modeling and validation\. Improving validation efficiency while maintaining strong performance is also an important practical goal\. On the benchmark side, expanding NL4COP to cover these broader problem classes would provide a more comprehensive testbed for future research\.
## References
- \[1\]\(2024\)OptiMUS: scalable optimization modeling with \(mi\) lp solvers and large language models\.InInternational Conference on Machine Learning,pp\. 577–596\.Cited by:[§II\-B](https://arxiv.org/html/2605.23966#S2.SS2.p1.1),[TABLE I](https://arxiv.org/html/2605.23966#S2.T1.1.6.1),[TABLE II](https://arxiv.org/html/2605.23966#S4.T2.1.5.1.1.1)\.
- \[2\]N\. Astorga, T\. Liu, Y\. Xiao, and M\. Van Der Schaar\(2025\)Autoformulation of mathematical optimization models using llms\.InInternational Conference on Machine Learning,pp\. 1864–1886\.Cited by:[§I](https://arxiv.org/html/2605.23966#S1.p2.1),[§I](https://arxiv.org/html/2605.23966#S1.p3.1),[§II\-C](https://arxiv.org/html/2605.23966#S2.SS3.p1.1),[TABLE I](https://arxiv.org/html/2605.23966#S2.T1.1.5.1),[§IV\-A](https://arxiv.org/html/2605.23966#S4.SS1.p7.1),[§IV\-A](https://arxiv.org/html/2605.23966#S4.SS1.p8.2)\.
- \[3\]X\. Chen, M\. Lin, N\. Schärli, and D\. Zhou\(2024\)Teaching large language models to self\-debug\.InThe Twelfth International Conference on Learning Representations,Cited by:[§II\-C](https://arxiv.org/html/2605.23966#S2.SS3.p1.1)\.
- \[4\]Y\. Chen, J\. Xia, S\. Shao, D\. Ge, and Y\. Ye\(2025\)Solver\-informed RL: grounding large language models for authentic optimization modeling\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,Cited by:[§I](https://arxiv.org/html/2605.23966#S1.p3.1),[§II\-A](https://arxiv.org/html/2605.23966#S2.SS1.p2.1),[TABLE I](https://arxiv.org/html/2605.23966#S2.T1.1.4.1),[§IV\-A](https://arxiv.org/html/2605.23966#S4.SS1.p7.1),[§IV\-A](https://arxiv.org/html/2605.23966#S4.SS1.p8.2)\.
- \[5\]DeepSeek\-AI, A\. Liu, A\. Mei, B\. Lin, B\. Xue, B\. Wang, B\. Xu, B\. Wu, B\. Zhang, C\. Lin, C\. Dong, C\. Lu, C\. Zhao, C\. Deng, C\. Xu, C\. Ruan, D\. Dai, D\. Guo, D\. Yang, D\. Chen, E\. Li, F\. Zhou, F\. Lin, F\. Dai, G\. Hao, G\. Chen, G\. Li, H\. Zhang, H\. Xu, H\. Li, H\. Liang, H\. Wei, H\. Zhang, H\. Luo, H\. Ji, H\. Ding, H\. Tang, H\. Cao, H\. Gao, H\. Qu, H\. Zeng, J\. Huang, J\. Li, J\. Xu, J\. Hu, J\. Chen, J\. Xiang, J\. Yuan, J\. Cheng, J\. Zhu, J\. Ran, J\. Jiang, J\. Qiu, J\. Li, J\. Song, K\. Dong, K\. Gao, K\. Guan, K\. Huang, K\. Zhou, K\. Huang, K\. Yu, L\. Wang, L\. Zhang, L\. Wang, L\. Zhao, L\. Yin, L\. Guo, L\. Luo, L\. Ma, L\. Wang, L\. Zhang, M\. S\. Di, M\. Y\. Xu, M\. Zhang, M\. Zhang, M\. Tang, M\. Zhou, P\. Huang, P\. Cong, P\. Wang, Q\. Wang, Q\. Zhu, Q\. Li, Q\. Chen, Q\. Du, R\. Xu, R\. Ge, R\. Zhang, R\. Pan, R\. Wang, R\. Yin, R\. Xu, R\. Shen, R\. Zhang, S\. H\. Liu, S\. Lu, S\. Zhou, S\. Chen, S\. Cai, S\. Chen, S\. Hu, S\. Liu, S\. Hu, S\. Ma, S\. Wang, S\. Yu, S\. Zhou, S\. Pan, S\. Zhou, T\. Ni, T\. Yun, T\. Pei, T\. Ye, T\. Yue, W\. Zeng, W\. Liu, W\. Liang, W\. Pang, W\. Luo, W\. Gao, W\. Zhang, X\. Gao, X\. Wang, X\. Bi, X\. Liu, X\. Wang, X\. Chen, X\. Zhang, X\. Nie, X\. Cheng, X\. Liu, X\. Xie, X\. Liu, X\. Yu, X\. Li, X\. Yang, X\. Li, X\. Chen, X\. Su, X\. Pan, X\. Lin, X\. Fu, Y\. Q\. Wang, Y\. Zhang, Y\. Xu, Y\. Ma, Y\. Li, Y\. Li, Y\. Zhao, Y\. Sun, Y\. Wang, Y\. Qian, Y\. Yu, Y\. Zhang, Y\. Ding, Y\. Shi, Y\. Xiong, Y\. He, Y\. Zhou, Y\. Zhong, Y\. Piao, Y\. Wang, Y\. Chen, Y\. Tan, Y\. Wei, Y\. Ma, Y\. Liu, Y\. Yang, Y\. Guo, Y\. Wu, Y\. Wu, Y\. Cheng, Y\. Ou, Y\. Xu, Y\. Wang, Y\. Gong, Y\. Wu, Y\. Zou, Y\. Li, Y\. Xiong, Y\. Luo, Y\. You, Y\. Liu, Y\. Zhou, Z\. F\. Wu, Z\. Z\. Ren, Z\. Zhao, Z\. Ren, Z\. Sha, Z\. Fu, Z\. Xu, Z\. Xie, Z\. Zhang, Z\. Hao, Z\. Gou, Z\. Ma, Z\. Yan, Z\. Shao, Z\. Huang, Z\. Wu, Z\. Li, Z\. Zhang, Z\. Xu, Z\. Wang, Z\. Gu, Z\. Zhu, Z\. Li, Z\. Zhang, Z\. Xie, Z\. Gao, Z\. Pan, Z\. Yao, B\. Feng, H\. Li, J\. L\. Cai, J\. Ni, L\. Xu, M\. Li, N\. Tian, R\. J\. Chen, R\. L\. Jin, S\. S\. Li, S\. Zhou, T\. Sun, X\. Q\. Li, X\. Jin, X\. Shen, X\. Chen, X\. Song, X\. Zhou, Y\. X\. Zhu, Y\. Huang, Y\. Li, Y\. Zheng, Y\. Zhu, Y\. Ma, Z\. Huang, Z\. Xu, Z\. Zhang, D\. Ji, J\. Liang, J\. Guo, J\. Chen, L\. Xia, M\. Wang, M\. Li, P\. Zhang, R\. Chen, S\. Sun, S\. Wu, S\. Ye, T\. Wang, W\. L\. Xiao, W\. An, X\. Wang, X\. Sun, X\. Wang, Y\. Tang, Y\. Zha, Z\. Zhang, Z\. Ju, Z\. Zhang, and Z\. Qu\(2025\)DeepSeek\-v3\.2: pushing the frontier of open large language models\.External Links:2512\.02556Cited by:[§IV\-A](https://arxiv.org/html/2605.23966#S4.SS1.p9.1)\.
- \[6\]S\. Diamond and S\. Boyd\(2016\-01\)CVXPY: a Python\-embedded modeling language for convex optimization\.J\. Mach\. Learn\. Res\.17\(83\),pp\. 1–5\.Cited by:[§II](https://arxiv.org/html/2605.23966#S2.p1.1),[§IV\-A](https://arxiv.org/html/2605.23966#S4.SS1.p9.1)\.
- \[7\]Z\. Gou, Z\. Shao, Y\. Gong, yelong shen, Y\. Yang, N\. Duan, and W\. Chen\(2024\)CRITIC: large language models can self\-correct with tool\-interactive critiquing\.InThe Twelfth International Conference on Learning Representations,Cited by:[§II\-C](https://arxiv.org/html/2605.23966#S2.SS3.p1.1)\.
- \[8\]Gurobi Optimization, LLC\(2025\)Gurobi optimizer reference manual\.Note:Version 13\.0\. Accessed: Mar\. 2, 2026External Links:[Link](https://docs.gurobi.com/projects/optimizer/en/current/)Cited by:[§II](https://arxiv.org/html/2605.23966#S2.p1.1)\.
- \[9\]C\. Huang, Z\. Tang, S\. Hu, R\. Jiang, X\. Zheng, D\. Ge, B\. Wang, and Z\. Wang\(2025\)Orlm: a customizable framework in training large models for automated optimization modeling\.Operations Research73\(6\),pp\. 2986–3009\.Cited by:[§I](https://arxiv.org/html/2605.23966#S1.p3.1),[§II\-A](https://arxiv.org/html/2605.23966#S2.SS1.p1.1),[TABLE I](https://arxiv.org/html/2605.23966#S2.T1.1.2.1),[§IV\-A](https://arxiv.org/html/2605.23966#S4.SS1.p6.2),[§IV\-A](https://arxiv.org/html/2605.23966#S4.SS1.p7.1),[§IV\-A](https://arxiv.org/html/2605.23966#S4.SS1.p8.2),[TABLE II](https://arxiv.org/html/2605.23966#S4.T2.1.8.1.1.1)\.
- \[10\]X\. Huang, Q\. Shen, Y\. Hu, A\. Gao, and B\. Wang\(2025\)LLMs for mathematical modeling: towards bridging the gap between natural and mathematical languages\.InFindings of the Association for Computational Linguistics: NAACL 2025,pp\. 2678–2710\.Cited by:[§IV\-A](https://arxiv.org/html/2605.23966#S4.SS1.p6.2),[TABLE II](https://arxiv.org/html/2605.23966#S4.T2.1.6.1.1.1)\.
- \[11\]C\. JIANG, X\. Shu, H\. Qian, X\. Lu, J\. ZHOU, A\. Zhou, and Y\. Yu\(2025\)LLMOPT: learning to define and solve general optimization problems from scratch\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§I](https://arxiv.org/html/2605.23966#S1.p2.1),[§I](https://arxiv.org/html/2605.23966#S1.p3.1),[§II\-A](https://arxiv.org/html/2605.23966#S2.SS1.p1.1),[TABLE I](https://arxiv.org/html/2605.23966#S2.T1.1.3.1),[§IV\-A](https://arxiv.org/html/2605.23966#S4.SS1.p7.1),[§IV\-A](https://arxiv.org/html/2605.23966#S4.SS1.p8.2)\.
- \[12\]H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe\(2024\)Let’s verify step by step\.InThe Twelfth International Conference on Learning Representations,Cited by:[§II\-C](https://arxiv.org/html/2605.23966#S2.SS3.p1.1)\.
- \[13\]H\. Liu, J\. Wang, Y\. Cai, X\. Han, Y\. Kuang, and J\. HAO\(2025\)OptiTree: hierarchical thoughts generation with tree search for LLM optimization modeling\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,Cited by:[§II\-B](https://arxiv.org/html/2605.23966#S2.SS2.p2.1),[TABLE I](https://arxiv.org/html/2605.23966#S2.T1.1.8.1),[§IV\-A](https://arxiv.org/html/2605.23966#S4.SS1.p7.1),[§IV\-A](https://arxiv.org/html/2605.23966#S4.SS1.p8.2)\.
- \[14\]H\. Liu, J\. Wang, B\. Niu, X\. Han, M\. Ye, Z\. Geng, F\. Zhu, and J\. HAO\(2026\)OptiVer: unleashing the power of LLMs for optimization modeling via dual\-side verification\.Cited by:[§II\-C](https://arxiv.org/html/2605.23966#S2.SS3.p1.1),[TABLE I](https://arxiv.org/html/2605.23966#S2.T1.1.9.1)\.
- \[15\]Y\. Liu, C\. Zhou, Y\. Chen, S\. Zhang, X\. Lin, and X\. Wang\(2026\)Hard constraints meet soft generation: guaranteed feasibility for llm\-based combinatorial optimization\.External Links:2602\.01090Cited by:[§IV\-A](https://arxiv.org/html/2605.23966#S4.SS1.p5.1)\.
- \[16\]H\. Lu, Z\. Xie, Y\. Wu, C\. Ren, Y\. Chen, and Z\. Wen\(2025\)OptMATH: a scalable bidirectional data synthesis framework for optimization modeling\.InInternational Conference on Machine Learning,pp\. 40769–40802\.Cited by:[§I](https://arxiv.org/html/2605.23966#S1.p3.1),[§II\-A](https://arxiv.org/html/2605.23966#S2.SS1.p1.1),[§IV\-A](https://arxiv.org/html/2605.23966#S4.SS1.p6.2),[TABLE II](https://arxiv.org/html/2605.23966#S4.T2.1.7.1.1.1)\.
- \[17\]A\. Madaan, N\. Tandon, P\. Gupta, S\. Hallinan, L\. Gao, S\. Wiegreffe, U\. Alon, N\. Dziri, S\. Prabhumoye, Y\. Yang, S\. Gupta, B\. P\. Majumder, K\. Hermann, S\. Welleck, A\. Yazdanbakhsh, and P\. Clark\(2023\)Self\-refine: iterative refinement with self\-feedback\.InAdvances in Neural Information Processing Systems,A\. Oh, T\. Naumann, A\. Globerson, K\. Saenko, M\. Hardt, and S\. Levine \(Eds\.\),Vol\.36,pp\. 46534–46594\.Cited by:[§II\-C](https://arxiv.org/html/2605.23966#S2.SS3.p1.1)\.
- \[18\]K\. Michailidis, D\. Tsouros, and T\. Guns\(2024\)Constraint Modelling with LLMs Using In\-Context Learning\.In30th International Conference on Principles and Practice of Constraint Programming \(CP 2024\),P\. Shaw \(Ed\.\),Leibniz International Proceedings in Informatics \(LIPIcs\), Vol\.307,Dagstuhl, Germany,pp\. 20:1–20:27\.External Links:ISBN 978\-3\-95977\-336\-2,ISSN 1868\-8969,[Document](https://dx.doi.org/10.4230/LIPIcs.CP.2024.20)Cited by:[§II\-B](https://arxiv.org/html/2605.23966#S2.SS2.p2.1)\.
- \[19\]K\. Michailidis, D\. Tsouros, and T\. Guns\(2025\)CP\-bench: evaluating large language models for constraint modelling\.External Links:2506\.06052Cited by:[§IV\-A](https://arxiv.org/html/2605.23966#S4.SS1.p1.1)\.
- \[20\]M\. Mostajabdaveh, T\. T\. Yu, R\. Ramamonjison, G\. Carenini, Z\. Zhou, and Y\. Zhang\(2024\)Optimization modeling and verification from problem specifications using a multi\-agent multi\-stage llm framework\.INFOR: Information Systems and Operational Research62\(4\),pp\. 599–617\.Cited by:[§I](https://arxiv.org/html/2605.23966#S1.p2.1)\.
- \[21\]A\. Nagurney\(2021\)Optimization of supply chain networks with inclusion of labor: applications to covid\-19 pandemic disruptions\.International Journal of Production Economics235,pp\. 108080\.Cited by:[§I](https://arxiv.org/html/2605.23966#S1.p1.1)\.
- \[22\]OpenAI\(2025\-11\)GPT\-5\.1: a smarter, more conversational ChatGPT\.Note:OpenAI Official WebsiteAccessed: Mar\. 2, 2026External Links:[Link](https://openai.com/index/gpt-5-1/)Cited by:[§IV\-A](https://arxiv.org/html/2605.23966#S4.SS1.p9.1)\.
- \[23\]L\. Perron and V\. Furnon\(2024\)OR\-Tools\.Google\.Note:Version 9\.10, released May 7, 2024\. Accessed: Mar\. 2, 2026External Links:[Link](https://developers.google.com/optimization/)Cited by:[§II](https://arxiv.org/html/2605.23966#S2.p1.1),[§IV\-A](https://arxiv.org/html/2605.23966#S4.SS1.p9.1)\.
- \[24\]R\. Ramamonjison, T\. Yu, R\. Li, H\. Li, G\. Carenini, B\. Ghaddar, S\. He, M\. Mostajabdaveh, A\. Banitalebi\-Dehkordi, Z\. Zhou,et al\.\(2023\)Nl4opt competition: formulating optimization problems based on their natural language descriptions\.InNeurIPS 2022 competition track,pp\. 189–203\.Cited by:[TABLE II](https://arxiv.org/html/2605.23966#S4.T2.1.4.1.1.1)\.
- \[25\]C\. M\. Schenekemberg, C\. T\. Scarpin, J\. E\. Pecora Jr, T\. A\. Guimarães, and L\. C\. Coelho\(2021\)The two\-echelon production\-routing problem\.European journal of operational research288\(2\),pp\. 436–449\.Cited by:[§I](https://arxiv.org/html/2605.23966#S1.p1.1)\.
- \[26\]H\. Tang, K\. Hu, J\. P\. Zhou, S\. Zhong, W\. Zheng, X\. Si, and K\. Ellis\(2024\)Code repair with llms gives an exploration\-exploitation tradeoff\.Advances in Neural Information Processing Systems37,pp\. 117954–117996\.Cited by:[§III\-D](https://arxiv.org/html/2605.23966#S3.SS4.p3.4)\.
- \[27\]R\. Thind, Y\. Sun, L\. Liang, and H\. Yang\(2026\)OptimAI: optimization from natural language using llm\-powered ai agents\.External Links:2504\.16918Cited by:[§I](https://arxiv.org/html/2605.23966#S1.p3.1),[§II\-B](https://arxiv.org/html/2605.23966#S2.SS2.p1.1)\.
- \[28\]D\. Tsouros, H\. Verhaeghe, S\. Kadıoğlu, and T\. Guns\(2023\)Holy grail 2\.0: from natural language to constraint models\.External Links:2308\.01589Cited by:[§II\-B](https://arxiv.org/html/2605.23966#S2.SS2.p1.1)\.
- \[29\]Y\. Wang and K\. Li\(2025\)Large language models in operations research: methods, applications, and challenges\.External Links:2509\.18180Cited by:[§I](https://arxiv.org/html/2605.23966#S1.p1.1),[§I](https://arxiv.org/html/2605.23966#S1.p3.1)\.
- \[30\]Y\. Wang, Z\. Wu, J\. Yao, and J\. Su\(2025\)Tdag: a multi\-agent framework based on dynamic task decomposition and agent generation\.Neural Networks185,pp\. 107200\.Cited by:[§I](https://arxiv.org/html/2605.23966#S1.p3.1)\.
- \[31\]Y\. Wang, W\. Wu, J\. Wang, and Q\. Wang\(2026\)From flat logs to causal graphs: hierarchical failure attribution for llm\-based multi\-agent systems\.External Links:2602\.23701Cited by:[§II\-C](https://arxiv.org/html/2605.23966#S2.SS3.p1.1)\.
- \[32\]S\. Wasserkrug, L\. Boussioux, D\. Den Hertog, F\. Mirzazadeh, Ş\. I\. Birbil, J\. Kurtz, and D\. Maragno\(2025\)Enhancing decision making through the integration of large language models and operations research optimization\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.39,pp\. 28643–28650\.Cited by:[§I](https://arxiv.org/html/2605.23966#S1.p3.1)\.
- \[33\]H\. P\. Williams\(2013\)Model building in mathematical programming\.5 edition,John Wiley & Sons\.Cited by:[§I](https://arxiv.org/html/2605.23966#S1.p1.1)\.
- \[34\]Á\. S\. Xavier, F\. Qiu, and S\. Ahmed\(2021\)Learning to solve large\-scale security\-constrained unit commitment problems\.INFORMS Journal on Computing33\(2\),pp\. 739–756\.Cited by:[§I](https://arxiv.org/html/2605.23966#S1.p1.1)\.
- \[35\]Z\. Xiao, J\. Xie, L\. Xu, S\. Guan, J\. Zhu, X\. Han, X\. Fu, W\. Yu, H\. Wu, W\. Shi, Q\. Kang, J\. Duan, T\. Zhong, M\. Yuan, J\. Zeng, Y\. Wang, G\. Chen, and D\. Zhang\(2025\)A survey of optimization modeling meets llms: progress and future directions\.InProceedings of the Thirty\-Fourth International Joint Conference on Artificial Intelligence,IJCAI ’25\.External Links:ISBN 978\-1\-956792\-06\-5,[Document](https://dx.doi.org/10.24963/ijcai.2025/1192)Cited by:[§I](https://arxiv.org/html/2605.23966#S1.p1.1),[§I](https://arxiv.org/html/2605.23966#S1.p3.1),[§IV\-A](https://arxiv.org/html/2605.23966#S4.SS1.p5.1)\.
- \[36\]Z\. Xiao, D\. Zhang, Y\. Wu, L\. Xu, Y\. J\. Wang, X\. Han, X\. Fu, T\. Zhong, J\. Zeng, M\. Song, and G\. Chen\(2024\)Chain\-of\-experts: when LLMs meet complex operations research problems\.InThe Twelfth International Conference on Learning Representations,Cited by:[§I](https://arxiv.org/html/2605.23966#S1.p3.1),[§II\-B](https://arxiv.org/html/2605.23966#S2.SS2.p1.1),[TABLE II](https://arxiv.org/html/2605.23966#S4.T2.1.2.1.1.1)\.
- \[37\]Z\. Yang, Y\. Wang, Y\. Huang, Z\. Guo, W\. Shi, X\. Han, L\. Feng, L\. Song, X\. Liang, and J\. Tang\(2025\)OptiBench meets resocratic: measure and improve LLMs for optimization modeling\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§II\-A](https://arxiv.org/html/2605.23966#S2.SS1.p1.1),[TABLE II](https://arxiv.org/html/2605.23966#S4.T2.1.3.1.1.1)\.
- \[38\]S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. R\. Narasimhan, and Y\. Cao\(2023\)ReAct: synergizing reasoning and acting in language models\.InThe Eleventh International Conference on Learning Representations,Cited by:[§III\-D](https://arxiv.org/html/2605.23966#S3.SS4.p2.1)\.
- \[39\]A\. Zadorojniy, S\. Wasserkrug, and E\. Farchi\(2026\)An agent\-based framework for the automatic validation of mathematical optimization models\.External Links:2511\.16383Cited by:[§II\-C](https://arxiv.org/html/2605.23966#S2.SS3.p1.1)\.
- \[40\]B\. Zhang, P\. Luo, G\. Yang, B\. Soong, and C\. Yuen\(2025\)OR\-llm\-agent: automating modeling and solving of operations research optimization problems with reasoning llm\.External Links:2503\.10009Cited by:[§I](https://arxiv.org/html/2605.23966#S1.p3.1),[§II\-B](https://arxiv.org/html/2605.23966#S2.SS2.p1.1),[TABLE I](https://arxiv.org/html/2605.23966#S2.T1.1.7.1),[§IV\-A](https://arxiv.org/html/2605.23966#S4.SS1.p7.1),[§IV\-C3](https://arxiv.org/html/2605.23966#S4.SS3.SSS3.p1.1)\.
- \[41\]X\. Zhang, Z\. Chen, H\. Zope, H\. Barbalho, K\. Mellou, M\. Molinaro, J\. Kulkarni, I\. Menache, and S\. Li\(2026\)OptiMind: teaching llms to think like optimization experts\.External Links:2509\.22979Cited by:[§II\-A](https://arxiv.org/html/2605.23966#S2.SS1.p2.1)\.
- \[42\]Y\. Zhang, Q\. Kang, Y\. Chen, Y\. Wang, X\. Han, T\. Zhong, M\. Yuan, and C\. Ma\(2026\)SAC\-opt: semantic anchors for iterative correction in optimization modeling\.External Links:2510\.05115Cited by:[§II\-B](https://arxiv.org/html/2605.23966#S2.SS2.p2.1)\.Similar Articles
TriEval: A Resource-Efficient Pipeline for LLM Bias, Toxicity, and Truthfulness Assessment
TriEval is a new pipeline for evaluating LLMs across bias, toxicity, and truthfulness simultaneously, designed to be resource-efficient and run on standard laptops. It has been tested on Llama 3 8B, Mistral 7B, Gemma 2 9B, and Claude Haiku, and is released as open source.
TriAdReview: Triangular Adversarial Review Architecture for Multi-Model Technical Document Generation
This paper proposes TriAdReview, a triangular adversarial review architecture that uses two independent reviewer models (engineering and boundary perspectives) and a judging mechanism to iteratively improve a generator model's output for technical document generation. Experiments show a 10.1% overall improvement over single-model baselines, with strong gains in security audit, code generation, and architecture design, but a degradation on requirements analysis indicating task-dependent effectiveness.
Beyond Compilation: Evaluating Faithful Natural-Language-to-Lean Statement Formalization
This paper presents a benchmark and evaluation protocol for faithful natural-language-to-Lean statement formalization, revealing a 29-point gap between compile-pass and consensus-faithfulness, and decomposing the effects of expert drafting, context search, and elaboration feedback.
REVES: REvision and VErification--Augmented Training for Test-Time Scaling
Proposes REVES, a two-stage iterative framework that alternates between data augmentation and policy optimization to improve LLM reasoning by leveraging intermediate correction steps, achieving superior performance on coding benchmarks and constraint satisfaction problems.
VeryTrace: Verifying Reasoning Traces through Compilable Formalism and Structured Verification
VeryTrace is a zero-shot verification-and-repair framework that formalizes LLM reasoning traces into a compilable representation using a DSL, enabling step-level error localization through a hybrid of deterministic checks and LLM audits. It improves accuracy across math, robotics, and relational reasoning without domain-specific training.