COOPA: A Modular LLM Agent Architecture for Operations Research Problems

arXiv cs.LG 06/29/26, 04:00 AM Papers
llm-agent operations-research modular-architecture decision-support optimization ai-research
Summary
This paper introduces COOPA, a modular LLM agent architecture for operations research problems that combines iterative confidence-based modeling, element-level provenance, and multi-solver routing. Evaluated across eight LLM backbones and four baselines, COOPA achieves the best macro-average accuracy on six backbones and improves over the strongest baseline by up to 6.7 percentage points.
arXiv:2606.27611v1 Announce Type: new Abstract: Operations Research (OR) provides a rigorous framework for high-stakes decision-making, but effective OR modeling requires substantial domain knowledge, mathematical abstraction, and solver expertise. Recent LLM-based systems automate parts of this pipeline, yet remain limited by low accuracy on complex problems, opaque outputs, and narrow solver support. We propose COOPA (COoperative OPerations Agent), a modular LLM-agent architecture for interpretable and scalable OR decision support. It combines three components: iterative confidence-based modeling, which generates multiple candidate formulations, self-evaluates them across modeling dimensions, and selects one using a max-min confidence criterion; element-level provenance and confidence explanations, which link variables, parameters, constraints, and objectives to quoted source text and provide an audit trail for human verification; and multi-solver routing to specialized optimizer agents for different OR problem classes. Across three OR benchmarks, eight LLM backbones, and four baselines under identical conditions, COOPA achieves the best macro-average accuracy on six of eight backbones and improves over the strongest baseline by up to 6.7 percentage points. A within-system ablation isolates the contribution of iterative confidence-based modeling, while additional analyses and case studies illustrate the value of source traceability and multi-solver dispatch.
Original Article
View Cached Full Text
Cached at: 06/29/26, 05:24 AM
# COOPA: A Modular LLM Agent Architecture for Operations Research Problems
Source: [https://arxiv.org/html/2606.27611](https://arxiv.org/html/2606.27611)
Chuanhao Li1111These authors contributed equally to this work\., Xiaoan Xu2111These authors contributed equally to this work\., Dirk Bergemann3 Ethan X\. Fang2, Yehua Wei2, Zhuoran Yang3 1Tsinghua University2Duke University3Yale University

###### Abstract

Operations Research \(OR\) provides a rigorous framework for high\-stakes decision\-making, but effective OR modeling requires substantial domain knowledge, mathematical abstraction, and solver expertise\. Recent LLM\-based systems automate parts of this pipeline, yet remain limited by low accuracy on complex problems, opaque outputs, and narrow solver support\. We propose COOPA \(COoperativeOPerationsAgent\), a modular LLM\-agent architecture for interpretable and scalable OR decision support\. It combines three components: iterative confidence\-based modeling, which generates multiple candidate formulations, self\-evaluates them across modeling dimensions, and selects one using a max\-min confidence criterion; element\-level provenance and confidence explanations, which link variables, parameters, constraints, and objectives to quoted source text and provide an audit trail for human verification; and multi\-solver routing to specialized optimizer agents for different OR problem classes\. Across three OR benchmarks, eight LLM backbones, and four baselines under identical conditions, COOPA achieves the best macro\-average accuracy on six of eight backbones and improves over the strongest baseline by up to 6\.7 percentage points\. A within\-system ablation isolates the contribution of iterative confidence\-based modeling, while additional analyses and case studies illustrate the value of source traceability and multi\-solver dispatch\.111All resources are available at[https://github\.com/xxxxxa\-hub/COOPA](https://github.com/xxxxxa-hub/COOPA)\.

## 1Introduction

Operations Research \(OR\) provides a rigorous foundation for decision\-making in complex systems\. Across domains such as supply chains, transportation, energy, healthcare, finance, engineering, and public policy, OR uses mathematical and computational models to capture resource constraints, system dynamics, uncertainty, and performance objectives\. Its toolkit spans deterministic, stochastic, and robust optimization; dynamic programming; queueing theory; and simulation\-based optimization, supported by exact algorithms, decomposition methods, approximation algorithms, heuristics, and metaheuristics\. This breadth has made OR a foundational discipline for high\-stakes decision support\.

However, applying OR effectively remains expertise\-intensive\. Building a useful model requires specifying variables, objectives, and constraints; deciding what to model explicitly or abstract away; representing uncertainty, temporal structure, and system interactions; balancing decision relevance with computational tractability; and implementing the model in specialized optimization environments\. This process is typically iterative and requires close interaction with domain experts\. Even experienced practitioners may spend weeks or months refining formulations, diagnosing computational behavior, and validating results against operational reality\. Consequently, OR remains underutilized in many organizations, especially where specialized modeling expertise is scarce\.

Recent advances in large language models \(LLMs\) offer a plausible way to lower this barrier\. LLMs can process natural\-language problem descriptions, generate structured outputs and code, and reason over symbolic representations, making them natural candidates for tasks such as formulation, solver\-oriented code generation, and result interpretation\. A growing literature explores this direction through prompting, code synthesis, and fine\-tuning strategies tailored to optimization tasksXiaoet al\.\([2024](https://arxiv.org/html/2606.27611#bib.bib3)\); AhmadiTeshniziet al\.\([2024](https://arxiv.org/html/2606.27611#bib.bib7)\); Huanget al\.\([2025](https://arxiv.org/html/2606.27611#bib.bib8)\); Shuet al\.\([2025](https://arxiv.org/html/2606.27611#bib.bib6)\)\. Yet current systems remain far from reliable OR decision\-support tools\. We highlight three limitations\.

First,accuracy on harder OR problems remains limited, and formulation errors are the main bottleneck\. Across several studies, incomplete formulations, incorrect variables or constraints, and misspecified objectives account for 50–70% of failures, substantially more than coding errorsHuanget al\.\([2025](https://arxiv.org/html/2606.27611#bib.bib8)\); Liuet al\.\([2026](https://arxiv.org/html/2606.27611#bib.bib25)\); AhmadiTeshniziet al\.\([2024](https://arxiv.org/html/2606.27611#bib.bib7)\)\. Most methods still generate a single formulation in one pass, with little opportunity to improve it before code generationXiaoet al\.\([2024](https://arxiv.org/html/2606.27611#bib.bib3)\); Liuet al\.\([2026](https://arxiv.org/html/2606.27611#bib.bib25)\); Zhang and Luo \([2025](https://arxiv.org/html/2606.27611#bib.bib13)\); Huanget al\.\([2025](https://arxiv.org/html/2606.27611#bib.bib8)\)\. When revision occurs, it is usually triggered by downstream execution failures rather than explicit iteration on the formulation itselfAhmadiTeshniziet al\.\([2024](https://arxiv.org/html/2606.27611#bib.bib7)\); Shuet al\.\([2025](https://arxiv.org/html/2606.27611#bib.bib6)\)\. Thus, mathematically incorrect but executable formulations can survive unchecked\.

Second,existing systems produce opaque outputs\. Most LLM\-based OR systems present users with a formulation, solver code, and an optimal solution valueAhmadiTeshniziet al\.\([2024](https://arxiv.org/html/2606.27611#bib.bib7)\); Xiaoet al\.\([2024](https://arxiv.org/html/2606.27611#bib.bib3)\); Zhang and Luo \([2025](https://arxiv.org/html/2606.27611#bib.bib13)\); Huanget al\.\([2025](https://arxiv.org/html/2606.27611#bib.bib8)\); some additionally report quality indicators such as per\-clause confidence scoresAhmadiTeshniziet al\.\([2024](https://arxiv.org/html/2606.27611#bib.bib7)\)\. However, none of these outputs includes the rationale behind quality assessments, such as why a particular element received a given score or what aspect of the problem description it was derived from, and no system provides element\-level provenance linking each parameter, variable, constraint, or objective term back to the specific text that motivated it\. In practical OR problems involving hundreds of constraints, the absence of such traceability forces practitioners to re\-read the entire problem description and mentally reconstruct the LLM’s reasoning to verify correctness or locate errors, which is labor\-intensive and does not scale\.

Third,existing systems are difficult to adapt to new solver backends\. For methods that rely on model fine\-tuning, solver dependence is direct: ORLMHuanget al\.\([2025](https://arxiv.org/html/2606.27611#bib.bib8)\), OR\-R1Dinget al\.\([2026](https://arxiv.org/html/2606.27611#bib.bib28)\), and SIRLChenet al\.\([2026](https://arxiv.org/html/2606.27611#bib.bib31)\)target COPT, while LLMOPTShuet al\.\([2025](https://arxiv.org/html/2606.27611#bib.bib6)\)uses Pyomo\. Changing the backend in such methods requires new solver\-specific data, revised output formats, or additional fine\-tuning\. Systems without fine\-tuning avoid solver\-specific retraining, but solver\-specific assumptions can still be embedded throughout the workflow, including formulation templates, code\-generation prompts, solver API calls, execution and debugging logic, and output parsing\. OptiMUSAhmadiTeshniziet al\.\([2024](https://arxiv.org/html/2606.27611#bib.bib7)\), CoEXiaoet al\.\([2024](https://arxiv.org/html/2606.27611#bib.bib3)\), OptiTreeLiuet al\.\([2026](https://arxiv.org/html/2606.27611#bib.bib25)\), OR\-LLM\-AgentZhang and Luo \([2025](https://arxiv.org/html/2606.27611#bib.bib13)\), and LEAN\-LLM\-OPTLianget al\.\([2026](https://arxiv.org/html/2606.27611#bib.bib27)\), for example, generate Gurobi\-oriented code\. OptimAIThindet al\.\([2025](https://arxiv.org/html/2606.27611#bib.bib30)\)broadens coverage by supporting multiple mathematical\-programming backends, including PuLP, Pyomo, and OR\-Tools\. Nevertheless, existing systems generally treat backend support as a fixed implementation choice rather than an extensible architectural interface\.

Motivated by these limitations, we propose COOPA \(COoperativeOPerationsAgent\), a modular LLM\-agent architecture for OR modeling and solver execution built around three design requirements: improving formulation quality before code generation, representing modeling artifacts in a structured and traceable form, and supporting extensible solver dispatch\. COOPA implements these requirements through three components\. First,*iterative confidence\-based modeling*: it generates multiple candidate formulations, scores each across four modeling dimensions with confidence explanations, and selects the candidate with the highest minimum confidence via a max\-min criterion\. Second,*interpretable output for human verification*: it extracts modeling elements through a validated schema and links them to quoted source text, creating an audit trail from problem statement to formulation\. Third,*scalable multi\-solver dispatch*: it classifies the problem type and routes it to one of four optimizer agents covering Pyomo, OR\-Tools, pymoo, and standard Python; new agents can be added without modifying the rest of the workflow\.

We evaluate COOPA on three benchmarks and eight LLM backbones against four agentic baselines\. COOPA achieves the highest macro\-average accuracy on 6 of 8 backbones\. The strongest results are COOPA with GPT\-5\.2 \(70\.6%\), GPT\-5 \(69\.4%\), and Gemini\-3\-Flash \(68\.4%\), suggesting that the architecture amplifies strong backbones\. The gains also extend beyond the very top models: COOPA improves over the best baseline by 6\.7 percentage points on GPT\-5 and by 6\.3 on GPT\-4\.1\. A within\-system ablation then isolates the contribution of iterative modeling\.

## 2Failure Modes and Design Requirements for LLM\-Based OR Systems

We provide a detailed discussion of related work in Section[5](https://arxiv.org/html/2606.27611#S5)and focus here on the failure modes that motivate COOPA\. The key bottleneck of LLM\-based OR systems is often the*problem\-to\-model*stage: translating natural language into variables, objectives, constraints, and parameters\. This motivates two requirements for COOPA: improving formulation quality before code generation and representing modeling artifacts in a traceable form\. A third requirement, extensible solver dispatch, follows from the broader diversity of OR problem classes and solver backends\.

### 2\.1Formulation Errors as the Central Bottleneck

Existing analyses consistently identify formulation\-stage errors as a major source of failure\. ORLM’s error taxonomyHuanget al\.\([2025](https://arxiv.org/html/2606.27611#bib.bib8)\)concludes that “the primary challenges lie in the optimization modeling phase,” while code generation achieves a pass rate of approximately 95%\. Among sampled modeling errors, 56\.3% come from low model completeness, 30\.3% from objective or constraint translation errors, and 13\.4% from semantic misunderstanding of the problem text\. OptiMUSAhmadiTeshniziet al\.\([2024](https://arxiv.org/html/2606.27611#bib.bib7)\)similarly reports that, on hard mixed\-integer problems with multi\-dimensional variables, parameter extraction and modeling are harder than coding the resulting model\. OptiTreeLiuet al\.\([2026](https://arxiv.org/html/2606.27611#bib.bib25)\)further observes that many medium\- and hard\-instance failures stem from incorrect variable definitions\. These findings imply that code\-generation improvements alone are insufficient\. Execution feedback can catch syntax errors, runtime exceptions, or malformed constraints, but not formulations that are executable yet semantically inconsistent with the problem description\. COOPA therefore targets formulation quality, using iterative candidate generation, confidence\-based assessment, and structured extraction\.

### 2\.2From Failure Modes to Design Requirements

LLM\-based OR systems differ both in how they construct formulations and in which solver backends they support\. These workflow choices point to three design requirements for COOPA\.

Pre\-code formulation improvement\.Methods differ both in how they obtain the formulation passed to code generation and in when they verify it\. Some generate an initial formulation and rely on downstream feedback to repair errors, using limited retriesXiaoet al\.\([2024](https://arxiv.org/html/2606.27611#bib.bib3)\); Zhang and Luo \([2025](https://arxiv.org/html/2606.27611#bib.bib13)\)or correction loops driven by execution or solver feedbackAhmadiTeshniziet al\.\([2024](https://arxiv.org/html/2606.27611#bib.bib7)\); Shuet al\.\([2025](https://arxiv.org/html/2606.27611#bib.bib6)\)\. Others improve formulation quality through taxonomy\-guided decompositionLiuet al\.\([2026](https://arxiv.org/html/2606.27611#bib.bib25)\)or fine\-tuning on OR tasksHuanget al\.\([2025](https://arxiv.org/html/2606.27611#bib.bib8)\); Chenet al\.\([2026](https://arxiv.org/html/2606.27611#bib.bib31)\); Dinget al\.\([2026](https://arxiv.org/html/2606.27611#bib.bib28)\)\. These approaches can help, but they usually revise a single formulation trajectory rather than explicitly comparing alternatives before code generation\. Moreover, verification is often downstream\- or outcome\-driven: some systems use evaluate\-improve cyclesAhmadiTeshniziet al\.\([2024](https://arxiv.org/html/2606.27611#bib.bib7)\), execution\-based self\-correctionShuet al\.\([2025](https://arxiv.org/html/2606.27611#bib.bib6)\), or backward reflection after solution failureXiaoet al\.\([2024](https://arxiv.org/html/2606.27611#bib.bib3)\), while others accept the first result once the generated code runs and returns an answerZhang and Luo \([2025](https://arxiv.org/html/2606.27611#bib.bib13)\); Liuet al\.\([2026](https://arxiv.org/html/2606.27611#bib.bib25)\); Huanget al\.\([2025](https://arxiv.org/html/2606.27611#bib.bib8)\)\. This leaves a key gap: a formulation can be executable and still misrepresent the original problem semantics\. COOPA therefore evaluates formulation quality before solver code is generated, generating multiple candidates and selecting among them with confidence\-based assessment\.

Structured and traceable modeling artifacts\.Systems also differ in how they represent intermediate artifacts\. Prior work uses JSONAhmadiTeshniziet al\.\([2024](https://arxiv.org/html/2606.27611#bib.bib7)\), regex extraction from solver logsLiuet al\.\([2026](https://arxiv.org/html/2606.27611#bib.bib25)\), free\-form commentsXiaoet al\.\([2024](https://arxiv.org/html/2606.27611#bib.bib3)\), and Markdown outputsZhang and Luo \([2025](https://arxiv.org/html/2606.27611#bib.bib13)\)\. This choice matters because downstream stages must reliably recover variables, parameters, constraints, and objectives from LLM outputs\. Brittle formatting assumptions or ad hoc parsing can therefore affect both execution and measured accuracy; for example, we find that the regex\-based extraction used by OptiTreeLiuet al\.\([2026](https://arxiv.org/html/2606.27611#bib.bib25)\)substantially reduces accuracy on some backbones relative to LLM\-based extraction\. At the same time, existing systems provide limited support for human verification: they return a formulation, solver code, and final answer, but not element\-level provenance linking variables, parameters, constraints, or objective terms to the problem text, nor confidence information for individual modeling choices\. Users must therefore reconstruct why each modeling element was introduced, a process that becomes costly as formulations grow larger\. COOPA addresses both issues with schema\-validated structured outputs and source traceability\.

Extensible solver dispatch\.Solver support matters not only in the range of backends currently available, but also in how easily the architecture can be extended to new ones\. OR problems may require mathematical programming, constraint programming, multi\-objective optimization, simulation\-based modeling, or black\-box search\. Systems with broader backend coverage, such as OptimAIThindet al\.\([2025](https://arxiv.org/html/2606.27611#bib.bib30)\), still focus mainly on mathematical\-programming interfaces\. COOPA instead separates problem classification from solver\-specific optimization: it routes each instance to a specialized optimizer agent, and new optimizer agents can be added without changing the rest of the workflow\.

The three requirements above motivate the three components in Section[3](https://arxiv.org/html/2606.27611#S3): structured modeling with source traceability, iterative confidence\-based model selection, and multi\-solver dispatch\.

![Refer to caption](https://arxiv.org/html/2606.27611v1/x1.png)Figure 1:Overview of COOPA\. Problem description is parsed into structured candidate models with source traceability, filtered by confidence\-based selection, and dispatched to a specialized solver\.

## 3The Architecture of COOPA

COOPA addresses the three design requirements through structured modeling with source traceability \(Section[3\.2](https://arxiv.org/html/2606.27611#S3.SS2)\), iterative confidence\-based selection \(Section[3\.3](https://arxiv.org/html/2606.27611#S3.SS3)\), and solver dispatch \(Section[3\.4](https://arxiv.org/html/2606.27611#S3.SS4)\)\.

### 3\.1System Overview

COOPA is a hierarchical multi\-agent system implemented withsmolagentsRoucheret al\.\([2025](https://arxiv.org/html/2606.27611#bib.bib36)\)\. It is organized around three components, shown in Figure[1](https://arxiv.org/html/2606.27611#S2.F1)and listed below\. Prompts are given in Appendix[A](https://arxiv.org/html/2606.27611#A1)\.

1. 1\.Structured modeling with source traceability\.The problem is converted into a structured candidate model with parameters, decision variables, an objective, and constraints\. Each element is produced under a Pydantic schema and includes a*source field*quoting the original problem text\.
2. 2\.Iterative confidence\-based model selection\.The structured\-modeling step is repeatedkktimes, withk=3k=3by default, to obtain multiple candidate models\. For each candidate, the LLM assigns confidence scores from 0 to 100 across four modeling dimensions and provides short explanations\. The candidate that maximizes the minimum confidence score across dimensions is selected\.
3. 3\.Multi\-solver dispatch\.The selected model is then routed to one of the optimizer agents\. Each optimizer agent is an LLM controller specialized to a solver package: it translates the model into solver\-specific code, invokes the corresponding backend, and returns the executed result\.

We use the following BWOR Question 76Zhang and Luo \([2025](https://arxiv.org/html/2606.27611#bib.bib13)\)as a running example to illustrate COOPA\. The reported candidate models, confidence scores, and solver outputs come from an actual run \(o4\-mini backbone\)\. The key challenge here is that assembly\-line throughput is determined by the*slowest*station, so the formulation must optimize a bottleneck objective rather than total station output\.

Running Example: Problem Description \(BWOR Question 76\)An assembly line has 5 stations, each responsible for one of the five steps in the assembly of a certain product\. Five workers \(A, B, C, D, E\) are to be assigned to operate these stations\. Due to differences in individual skills, the production efficiency of each worker varies by station \(unit: pieces/min\)\.Question:How should each worker be assigned to a station to maximize the overall production capacity of the assembly line?Gold answer:T∗=5\.0T^\{\*\}=5\.0pieces/min \(bottleneck throughput\)\.WorkerIIIIIIIVVA23417B34256C25341D52325E37624

### 3\.2Structured Modeling with Source Traceability

The first component, shown on the left of Figure[1](https://arxiv.org/html/2606.27611#S2.F1), defines the structured representation used throughout the pipeline and applies it once to the problem description to produce a candidate model whose elements can be checked individually\. A predefined Pydantic schema ensures that downstream components receive structured artifacts in a consistent form\. Each model contains four element types:

- •ParameterDefinition: name, symbol, value or domain, and source reference\.
- •VariableDefinition: name, symbol, type, bounds, and source reference\.
- •ObjectiveDefinition: direction, math expression, variables involved, and source reference\.
- •ConstraintDefinition: name, math expression, type, and source reference\.

Each element includes a required source reference quoting the original problem text that supports it\. This makes the formulation easier to inspect and helps localize errors when the final answer is wrong\. Schema validation reduces reliance on post\-hoc parsing and ensures that downstream stages receive well\-formed objects with the required fields\. Validation guarantees structure, not mathematical correctness; correctness is assessed separately in Section[3\.3](https://arxiv.org/html/2606.27611#S3.SS3)\.

The following box shows selected elements from the first candidate produced with this schema in the running example\. Each element includes a source reference linking it to the original problem text\.

Running Example: Candidate ModelM1M\_\{1\}TypeElementSource referenceParametersnum\_stations=5“An assembly line has 5 stations…”num\_workers=5“Five workers — A, B, C, D, and E — are to be assigned to operate these stations\.”efficiency\_A\_I=2“The specific efficiencies are shown in Table 8–10 …”; worker A, station Iefficiency\_B\_IV=5“The specific efficiencies are shown in Table 8–10 …”; worker B, station IV⋯\\cdots⋯\\cdots\(23 more efficiency parameters\)Variablesxw,s∈\{0,1\}∀w,sx\_\{w,s\}\\in\\\{0,1\\\}\\qquad\\forall w,s“How should each worker be assigned to a station to maximize the overall production capacity of the assembly line?”Objectivemax∑w,sew,sxw,s\\max\\sum\_\{w,s\}e\_\{w,s\}x\_\{w,s\}WRONG“How should each worker be assigned to a station to maximize the overall production capacity of the assembly line?”Constraints∑sxw,s=1∀w\\sum\_\{s\}x\_\{w,s\}=1\\qquad\\forall w“Five workers — A, B, C, D, and E — are to be assigned to operate these stations\.”∑wxw,s=1∀s\\sum\_\{w\}x\_\{w,s\}=1\\qquad\\forall s“An assembly line has 5 stations…”

This highlights both the value and the limit of source traceability\. The binary variables that represent assigning workers to stations, as well as the constraints that assign each worker to one station and each station to one worker, are well grounded in the quoted text\. The objective, however, cites the phrase “maximize the overall production capacity” but incorrectly encodes it as maximizing the sum of station efficiencies rather than the bottleneck throughput\. Source references therefore make modeling choices easier to audit, but do not by themselves ensure correctness\. The next component builds directly on this one: it generates several candidate models in the same schema and then evaluates them against the original problem statement\.

### 3\.3Iterative Confidence\-Based Model Selection

Because LLMs sometimes misinterpret parts of the problem description, committing to a single candidate leaves no opportunity to catch errors before code generation\. As shown in the center of Figure[1](https://arxiv.org/html/2606.27611#S2.F1), COOPA therefore invokes the structured\-modeling component multiple times and uses self\-evaluated confidence to select the most reliable candidate before dispatch\.

Algorithm 1Iterative Confidence\-Based Model Selection1:Input:Problem description

PP, number of candidates

kk
2:

ℋ←∅\\mathcal\{H\}\\leftarrow\\emptyset⊳\\trianglerightHistory of candidates and evaluations

3:for

i=1,…,ki=1,\\ldots,kdo

4:

Mi←GenerateModel\(P,ℋ\)M\_\{i\}\\leftarrow\\textsc\{GenerateModel\}\(P,\\mathcal\{H\}\)⊳\\trianglerightStructured model with source references

5:

\(\{ci,j\}j,\{ei,j\}j\)←EvaluateConfidence\(P,Mi\)\(\\\{c\_\{i,j\}\\\}\_\{j\},\\\{e\_\{i,j\}\\\}\_\{j\}\)\\leftarrow\\textsc\{EvaluateConfidence\}\(P,M\_\{i\}\)⊳\\trianglerightOne call returns all scores

6:

ℋ←ℋ∪\{\(Mi,\{ci,j,ei,j\}j\)\}\\mathcal\{H\}\\leftarrow\\mathcal\{H\}\\cup\\\{\(M\_\{i\},\\\{c\_\{i,j\},e\_\{i,j\}\\\}\_\{j\}\)\\\}
7:

M∗←arg⁡maxi∈\{1,…,k\}⁡minj⁡ci,jM^\{\*\}\\leftarrow\\arg\\max\_\{i\\in\\\{1,\\ldots,k\\\}\}\\min\_\{j\}c\_\{i,j\}⊳\\trianglerightMax\-min selection

8:return

M∗M^\{\*\}

The procedure operates as follows \(Algorithm[1](https://arxiv.org/html/2606.27611#alg1)\)\. In this component, the LLM first invokes the structured\-modeling component to produce a candidate modelM1M\_\{1\}under the same schema, with source references attached to each element, then assigns confidence scoresc1,j∈\[0,100\]c\_\{1,j\}\\in\[0,100\]across four dimensions: parameters, variables, objective, and constraints, together with short explanations\. The evaluation uses one structured output call over the candidate and the original problem text; no external rubric or few\-shot examples are provided\. The LLM then generates revised candidatesM2,…,MkM\_\{2\},\\ldots,M\_\{k\}using the original problem together with earlier candidates and their evaluations, allowing low\-confidence dimensions to be addressed explicitly\. Afterkkcandidates have been generated and evaluated \(defaultk=3k=3\), the system selects the final model using a*max\-min*strategy:M∗=arg⁡maxi∈\{1,…,k\}⁡minj∈\{1,2,3,4\}⁡ci,jM^\{\*\}=\\arg\\max\_\{i\\in\\\{1,\\ldots,k\\\}\}\\min\_\{j\\in\\\{1,2,3,4\\\}\}c\_\{i,j\}\. That is, the selected model is the one with the highest worst\-case confidence across the four modeling dimensions\. This targets the common case where one flawed element invalidates an otherwise strong formulation, consistent with the failure mode discussed in Section[2\.1](https://arxiv.org/html/2606.27611#S2.SS1)\. These scores also support human verification by indicating uncertain dimensions; we discuss their calibration in Appendix[D\.4](https://arxiv.org/html/2606.27611#A4.SS4)\.

The next two boxes illustrate how the confidence scores guide model selection\. The first shows the evaluation of the initial candidateM1M\_\{1\}\. The second shows the formulation selected asM⋆M^\{\\star\}\.

Running Example: Confidence Evaluation of Candidate ModelM1M\_\{1\}TypeScoreCommentsParameters95/100All station and worker counts and the 25 efficiency parameters are identified with correct values and units\.Variables100/100Binary variable x\[w,s\] is defined for each worker–station pair with correct domain\.Objective30/100MINThe formulation maximizes the sum of efficiencies, but assembly\-line capacity is driven by the bottleneck \(minimum station rate\), so the objective is mis\-specified\.Constraints100/100Exactly\-one assignment constraints for each worker and each station are correctly specified\.

The scores isolate the objective as the only weak dimension: it receives 30/100, while parameters, variables, and constraints score 95–100\. This indicates that the main error lies in how throughput is formulated, not in the extracted entities or assignment structure\. Guided by the low objective score,M2M\_\{2\}replaces the incorrect sum\-of\-efficiencies objective with a bottleneck formulation, andM3M\_\{3\}refines it further\. The selected modelM⋆=M3M^\{\\star\}=M\_\{3\}makes three key changes relative toM1M\_\{1\}as shown below\.

Running Example: Changes of Selected Candidate ModelM⋆M^\{\\star\}Relative toM1M\_\{1\}TypeUpdated elementSource referenceVariableT≥0T\\geq 0: continuous bottleneck\-throughput variable“How should each worker be assigned to a station to maximize the overall production capacity of the assembly line?”ObjectiveMaximizeTT“How should each worker be assigned to a station to maximize the overall production capacity of the assembly line?”ConstraintT≤∑wew,s⋅xw,sT\\leq\\sum\_\{w\}e\_\{w,s\}\\cdot x\_\{w,s\}∀s\\forall\\,s“An assembly line has 5 stations, each responsible for one of the five steps in the assembly of a certain product\.”Confidence:Parameters 95∣\\midVariables 100∣\\midObjective 100∣\\midConstraints 100∣\\midMin score: 95/100

The max\-min criterion therefore selectsM3M\_\{3\}\(min = 95\) overM2M\_\{2\}\(min = 90\) andM1M\_\{1\}\(min = 30\)\. It yields the correct bottleneck throughput \(T∗=5\.0T^\{\*\}=5\.0\)\. By contrast, executingM1M\_\{1\}would optimize the mis\-specified sum\-of\-efficiencies objective and return 28\.0, which is not meaningful for the problem\. Appendix[B](https://arxiv.org/html/2606.27611#A2)provides more details for this example and an additional solver\-dispatch case study\.

### 3\.4Multi\-Solver Dispatch

Because different OR problem classes are best served by different solvers and modeling paradigms, COOPA does not commit to a single solver backend\. As shown on the right of Figure[1](https://arxiv.org/html/2606.27611#S2.F1), the solver\-dispatch component classifies the selected modelM∗M^\{\*\}and routes it to one of four specialized optimizer agents \(Table[1](https://arxiv.org/html/2606.27611#S3.T1)\)\. The optimizer agent is not the solver itself; it is the agent layer that decides how to encode the model, calls the appropriate backend library or external solver, inspects the output, and retries when execution fails\. Each agent therefore packages solver\-specific prompts \(Appendix[A](https://arxiv.org/html/2606.27611#A1)\), whitelisted tool access, and domain\-appropriate libraries around a particular solver family\.

Each optimizer agent operates within a*Thought–Code–Observation*loop: it analyzes the model, generates backend\-specific Python code, executes that code to call the target solver or library, inspects the output, and retries if errors occur\. The general optimizer handles problems that do not require a dedicated solver package, such as simulation\-based or custom numerical tasks\.

This design reflects OR practice: different problem classes benefit from different solvers\. Vehicle routing and scheduling fit OR\-Tools better than Pyomo, while non\-convex or multi\-objective problems may require evolutionary algorithms rather than branch\-and\-bound\. For example, a multi\-objective portfolio problem requires a Pareto front, so COOPA routes it to the metaheuristic optimizer rather than forcing repeated single\-objective scalarizations\. This design is also*scalable*: adding support for a new solver requires only defining a new optimizer agent, with no changes to the modeling stage\. Fine\-tuned models such as ORLMHuanget al\.\([2025](https://arxiv.org/html/2606.27611#bib.bib8)\)could likewise be integrated as specialized agents within the same architecture\.

Table 1:Specialized Optimizer Agents in COOPA\.Optimizer AgentProblem TypeBackend InvokedMathematical OptimizerLP, MIP, nonlinearPyomoCombinatorial OptimizerRouting, assignment, schedulingOR\-ToolsMetaheuristic OptimizerNon\-convex, black\-boxpymooGeneral OptimizerNumerical, simulation, uncategorizedStandard Python

## 4Experiments

We evaluate COOPA across three benchmarks, eight LLM backbones, and four baselines\.

### 4\.1Experimental Setup

Datasets\.We use three deterministic optimization benchmarks\.ComplexLPcontains 211 LP and MILP problems from MAMOHuanget al\.\([2024](https://arxiv.org/html/2606.27611#bib.bib38)\); although all are linear, they require correct variable typing and constraint modeling\.IndustryORcontains 100 real\-world industrial OR problems from ORLMHuanget al\.\([2025](https://arxiv.org/html/2606.27611#bib.bib8)\), spanning eight industries, three difficulty levels, and five problem types \(LP, IP, MIP, NLP, and others\)\.BWORcontains 82 textbook problems from OR\-LLM\-AgentZhang and Luo \([2025](https://arxiv.org/html/2606.27611#bib.bib13)\), covering LP, IP, and MIP with no nonlinear instances; two of these are infeasible and have no ground\-truth optimum, so we evaluate on the remaining 80\. We exclude NLP4LP, NL4OPT, and EasyLP because frontier models already reach saturation\-level accuracy \(\>80%\>80\\%\), limiting their value for method comparison\.

LLM backbones\.We evaluate eight backbones spanning proprietary and open\-source, reasoning and non\-reasoning models: GPT\-5\.2, GPT\-5, GPT\-4\.1, o3, o4\-mini, Gemini\-3\-Flash, Gemini\-2\.5\-Flash, and Qwen3\-30B\-A3B \(abbreviated as Qwen3\-30B\), letting us test performance variation over LLMs\.

Baselines\.We compare against four prior LLM\-based OR systems\. Chain\-of\-Experts \(CoE\)Xiaoet al\.\([2024](https://arxiv.org/html/2606.27611#bib.bib3)\)uses a conductor model to orchestrate specialized expert agents and applies backward reflection after failure\. OptiMUSAhmadiTeshniziet al\.\([2024](https://arxiv.org/html/2606.27611#bib.bib7)\)decomposes problems into clauses and uses an evaluate\-improve/debug loop with a Gurobi backend\. OptiTreeLiuet al\.\([2026](https://arxiv.org/html/2606.27611#bib.bib25)\)retrieves modeling thoughts from a hierarchical modeling tree\. OR\-LLM\-AgentZhang and Luo \([2025](https://arxiv.org/html/2606.27611#bib.bib13)\)uses a simple three\-agent pipeline \(Math→\\toCode→\\toDebugging\), with correction driven mainly by downstream execution failures\.

Metric\.We report accuracy, defined as the fraction of problems whose reported objective value is within0\.10\.1absolute error of the ground\-truth optimum\. Two BWOR problems are infeasible and excluded\. We report macro\-averages across the three benchmarks\.

### 4\.2Main Results and Analysis

The results are reported in Table[2](https://arxiv.org/html/2606.27611#S4.T2)\. COOPA achieves the highest cross\-model macro\-average accuracy at 64\.8%, compared with 61\.6% for OR\-LLM\-Agent, 61\.3% for OptiTree, and 60\.1% for CoE\. It achieves the best macro\-average on 6 of 8 backbones\. The 3\.2\-point gain over the next\-best method is modest in absolute terms but consistent across multiple backbones\.

Table 2:Accuracy \(%\) across three benchmarks and eight LLM backbones\.Boldindicates best performance for each model\. COOPA achieves the highest macro\-average on 6 of 8 backbones\.ModelMethodComplexLPIndustryORBWORMacro\-AvgGPT\-5\.2Chain\-of\-Experts55\.570\.075\.066\.8OptiMUS14\.214\.025\.017\.7OptiTree53\.674\.077\.568\.4OR\-LLM\-Agent49\.870\.078\.866\.2COOPA \(Ours\)55\.976\.080\.070\.6GPT\-5Chain\-of\-Experts48\.863\.076\.362\.7OptiMUS38\.443\.047\.543\.0OptiTree43\.154\.052\.549\.9OR\-LLM\-Agent40\.363\.067\.556\.9COOPA \(Ours\)53\.175\.080\.069\.4GPT\-4\.1Chain\-of\-Experts43\.660\.070\.057\.9OptiMUS39\.348\.058\.848\.7OptiTree49\.362\.068\.860\.0OR\-LLM\-Agent45\.561\.071\.359\.3COOPA \(Ours\)53\.669\.076\.366\.3o3Chain\-of\-Experts55\.572\.075\.067\.5OptiMUS36\.043\.042\.540\.5OptiTree43\.175\.078\.865\.6OR\-LLM\-Agent47\.964\.080\.064\.0COOPA \(Ours\)53\.673\.073\.866\.8o4\-miniChain\-of\-Experts48\.369\.068\.862\.0OptiMUS34\.643\.043\.840\.5OptiTree52\.665\.071\.363\.0OR\-LLM\-Agent47\.468\.078\.864\.7COOPA \(Ours\)47\.972\.077\.565\.8Gemini\-3\-FlashChain\-of\-Experts47\.475\.075\.065\.8OptiMUS38\.426\.028\.831\.1OptiTree60\.267\.075\.067\.4OR\-LLM\-Agent52\.669\.081\.367\.6COOPA \(Ours\)52\.675\.077\.568\.4Gemini\-2\.5\-FlashChain\-of\-Experts49\.332\.047\.542\.9OptiMUS27\.035\.036\.332\.8OptiTree53\.662\.070\.061\.9OR\-LLM\-Agent46\.467\.070\.061\.1COOPA \(Ours\)47\.471\.077\.565\.3Qwen3\-30B†Chain\-of\-Experts42\.257\.067\.555\.6OptiMUS23\.728\.035\.028\.9OptiTree46\.455\.061\.354\.2OR\-LLM\-Agent38\.958\.062\.553\.1COOPA \(Ours\)32\.248\.056\.345\.5††footnotetext:Qwen3\-30B is shorthand for Qwen3\-30B\-A3B\-Thinking\-2507\.Gains vary with backbone capability\.COOPA’s largest gains over the best baseline appear on GPT\-5 \(\+6\.7 points: 69\.4 vs\. 62\.7 for CoE\) and GPT\-4\.1 \(\+6\.3: 66\.3 vs\. 60\.0 for OptiTree\)\. Gains are smaller on GPT\-5\.2 \(\+2\.2\) and on the reasoning models o3 \(−\-0\.7, where CoE leads\) and o4\-mini \(\+1\.1\)\. This pattern suggests that iterative refinement helps most when the backbone is strong enough to benefit from self\-correction but not already near saturation\.

IndustryOR advantage\.COOPA is especially strong on IndustryOR, the most ambiguous benchmark\. On GPT\-5 and Gemini\-3\-Flash, it reaches 75\.0%; on Gemini\-3\-Flash this ties the best baseline, while on GPT\-5 it is the top result\. This is consistent with our motivating claim that COOPA is most useful when the main bottleneck is problem\-to\-model abstraction rather than code execution\.

No single baseline dominates\.The strongest baseline changes across backbones: CoE leads on o3 and Qwen3\-30B, OptiTree is best among baselines on GPT\-5\.2, and OR\-LLM\-Agent is strongest on Gemini\-3\-Flash\. This reinforces the need for cross\-model evaluation rather than single\-model benchmarking\.

### 4\.3Additional Analysis Summary

Appendix[C](https://arxiv.org/html/2606.27611#A3)reports the supplementary experiment results\. Here we retain only the two findings most central to the main claim\.

Ablation\.Iterative confidence\-based modeling is the main driver of COOPA’s gains\. Relative to solving only the first candidate from the same run, the fullk=3k=3pipeline improves macro\-average accuracy on 7 of 8 backbones and raises the cross\-model mean from 61\.8% to 64\.8% \(\+3\.0 points\)\. Without iteration, the base pipeline is roughly tied with the strongest baselines, indicating that iterative refinement explains most of COOPA’s advantage\.

Confidence analysis\.The max\-min confidence criterion adds signal beyond candidate diversity\. Solving the selected candidate instead of the first candidate improves accuracy on 7 of 8 backbones, confidence gain correlates positively with accuracy gain across model–benchmark pairs \(r=0\.58r=0\.58,p=0\.003p=0\.003\), and the criterion makes 181 beneficial overrides versus 95 harmful ones\. The appendix also provides full results on cross\-model robustness, the Qwen3\-30B underperformance case, solver\-dispatch statistics, cost, and case studies\.

## 5Related Work

The literature on LLM\-based OR has grown rapidly\. We organize prior work into three categories: workflow and agent approaches that use general\-purpose LLMs, training\-based approaches that adapt model weights, and methodological foundations in self\-improvement and multi\-agent systems\. Table[3](https://arxiv.org/html/2606.27611#S5.T3)summarizes how existing agent\-based methods compare along key design dimensions\.

### 5\.1Workflow and Agent Approaches

Early work on LLM\-based OR workflows established two complementary strategies for managing problem complexity\.OptiMUSAhmadiTeshniziet al\.\([2024](https://arxiv.org/html/2606.27611#bib.bib7)\)introduces a modular workflow that decomposes problems into individual clauses, formulates each into mathematical expressions independently, and uses a connection graph to manage context across clauses\. Importantly, OptiMUS is the first system to incorporate formulation quality assessment: a confidence\-based feedback mechanism scores each clause on a 1–5 scale and can escalate uncertain formulations to a stronger model\.Chain\-of\-Experts \(CoE\)Xiaoet al\.\([2024](https://arxiv.org/html/2606.27611#bib.bib3)\)takes a different approach by orchestrating 11 specialized agents \(e\.g\., Terminology Interpreter, Variable Extraction, Code Reviewer\) through a Conductor LLM that dynamically selects which expert to invoke\. When the initial solution fails, a backward reflection mechanism re\-consults experts in reverse order to identify and correct errors\. While OptiMUS focuses on decomposing the*problem*into manageable pieces, CoE focuses on decomposing the*workflow*into specialized roles\.

To address the difficulty of formulating problems from scratch, subsequent methods incorporate external knowledge into the modeling process\.OptiTreeLiuet al\.\([2026](https://arxiv.org/html/2606.27611#bib.bib25)\)constructs a hierarchical modeling tree offline, organized by problem taxonomy and complexity, so that at inference time the system can retrieve modeling thoughts from the closest known subproblem as context for formulation\. This knowledge\-augmented approach is particularly effective for problems that fall within the tree’s coverage, though it generates formulations in a single pass at inference time\.LEAN\-LLM\-OPTLianget al\.\([2026](https://arxiv.org/html/2606.27611#bib.bib27)\)addresses a complementary challenge: large\-scale problems where input data resides in external files and exceeds prompt length limits\. It uses RAG\-based retrieval from 96 reference examples to classify problems and dynamically construct step\-by\-step workflows, offloading mechanical data\-handling operations to auxiliary tools\. Both methods demonstrate the value of grounding LLM generations in external knowledge, but neither assesses formulation quality before code generation\.

Other methods explore different design priorities\.OR\-LLM\-AgentZhang and Luo \([2025](https://arxiv.org/html/2606.27611#bib.bib13)\)argues that modern reasoning LLMs have sufficiently strong mathematical capabilities to benefit from simple task decomposition rather than elaborate prompting or retrieval\. It employs three sequential agents, a Math Agent, a Code Agent, and a Debugging Agent, achieving competitive results with minimal workflow complexity\.OptimAIThindet al\.\([2025](https://arxiv.org/html/2606.27611#bib.bib30)\)stands out for supporting multiple solver backends \(including PuLP, Pyomo, and OR\-Tools\), broadening the range of problems that can be addressed\. It generates multiple solution plans and falls back to alternatives when the initial plan fails\.

Table 3:Design comparison of agent\-based LLM\-OR methods\.*External knowledge*: whether the system retrieves or references pre\-built examples or taxonomies\.*Formulation quality assessment*: whether the system evaluates formulation correctness before code generation\.*Source traceability*: whether users can trace modeling elements back to the problem description\.*Multi\-solver*: whether the system supports more than one solver backend\.MethodExternalKnowledgeFormulation QualityAssessmentSourceTraceabilityMulti\-SolverOptiMUS✗✓✗✗Chain\-of\-Experts✗✗✗✗OptiTree✓✗✗✗OR\-LLM\-Agent✗✗✗✗LEAN\-LLM\-OPT✓✗✗✗OptimAI✗✗✗✓COOPA \(ours\)✗✓✓✓Across these methods, two gaps remain\. First, although OptiMUS introduces confidence scoring, it neither explains in natural language why a score is low nor uses evaluation to iteratively improve formulations\. Other workflow methods either omit formulation\-quality assessment or perform it reactively, after downstream execution failures rather than through direct analysis of the formulation\. As a result, these systems fail to exploit LLMs’ demonstrated capacity for self\-evaluationKadavathet al\.\([2022](https://arxiv.org/html/2606.27611#bib.bib33)\); Madaanet al\.\([2023](https://arxiv.org/html/2606.27611#bib.bib32)\)to proactively refine mathematical formulations before code generation, leaving the dominant error source unaddressed\. Second, no existing method provides source traceability from individual modeling elements—parameters, variables, constraints, and objectives—to the specific problem text that motivates them\. In practical OR problems with dozens or hundreds of constraints, this forces practitioners to manually reread the full problem description and reconstruct the LLM’s reasoning to verify correctness or locate errors, which does not scale\. COOPA addresses both gaps through iterative confidence\-based modeling with four\-dimensional scoring and natural\-language explanations, plus source references that create a transparent audit trail from problem text to mathematical model\.

### 5\.2Training\-Based Approaches

A parallel line of work improves OR modeling accuracy by adapting model weights rather than designing workflows\.ORLMHuanget al\.\([2025](https://arxiv.org/html/2606.27611#bib.bib8)\)introduces OR\-Instruct, a semi\-automated workflow for synthesizing optimization training data from 686 real\-world seed cases, and demonstrates that fine\-tuned 7B\-parameter models can outperform standard GPT\-4 on several benchmarks\. Building on the idea of a universal intermediate representation,LLMOPTShuet al\.\([2025](https://arxiv.org/html/2606.27611#bib.bib6)\)proposes a five\-element formulation \(Sets, Parameters, Variables, Objective, Constraints\) combined with multi\-instruction supervised fine\-tuning and KTO alignment\. LLMOPT also introduces an auto\-testing self\-correction loop of up to 12 iterations, though the correction is driven by execution errors and solver logs rather than formulation quality assessment, so correctly\-executing but mathematically\-incorrect models can persist\. Finally, DPLMZhouet al\.\([2025](https://arxiv.org/html/2606.27611#bib.bib26)\)addresses the unique formulation challenges of dynamic programming via a synthetic data\-generation pipeline \(DualReflect\) designed to capture complex state transitions and recursive relationships\.

More recent work applies reinforcement learning to further improve modeling accuracy\.OR\-R1Dinget al\.\([2026](https://arxiv.org/html/2606.27611#bib.bib28)\)uses test\-time reinforcement learning \(TGRPO\) with majority voting across generated candidates as a pseudo\-label, achieving strong results with as few as 100 SFT training samples\.SIRLChenet al\.\([2026](https://arxiv.org/html/2606.27611#bib.bib31)\)takes a different approach by using the solver itself as a verifier to provide reward signals during training, incorporating domain\-specific feedback directly into the learning process\. These training\-based methods are complementary to workflow approaches: COOPA’s architecture could use fine\-tuned models as backbone LLMs, potentially combining the benefits of both paradigms\.

### 5\.3Self\-Improvement and Multi\-Agent Systems

LLMs can iteratively improve their outputs through self\-generated feedback\. Self\-RefineMadaanet al\.\([2023](https://arxiv.org/html/2606.27611#bib.bib32)\)demonstrates this across tasks such as code optimization and mathematical reasoning, where a model critiques and revises a single output over multiple rounds without external supervision\. Kadavath et al\.Kadavathet al\.\([2022](https://arxiv.org/html/2606.27611#bib.bib33)\)show that LLMs exhibit meaningful, though imperfectly calibrated, self\-knowledge about their capabilities and can often distinguish correct from incorrect answers\. In code generation, Self\-DebugChenet al\.\([2024](https://arxiv.org/html/2606.27611#bib.bib34)\)and ReflexionShinnet al\.\([2023](https://arxiv.org/html/2606.27611#bib.bib35)\)extend self\-improvement with execution feedback and episodic memory, respectively\. Across these methods, the model iterates on a single output by repeatedly revising its latest attempt\.

COOPA departs from this single\-output paradigm\. Instead of revising one formulation in place, it generateskkindependent candidates, each informed by confidence evaluations of all prior candidates, then selects the candidate with the highest minimum score across the four modeling dimensions\. This separates*generation*from*selection*: later candidates can address earlier weaknesses, while final selection can fall back to an earlier candidate if a later revision introduces new errors\.

*Multi\-agent architectures*decompose complex problems into specialized subtasks handled by dedicated agents\. In OR, CoEXiaoet al\.\([2024](https://arxiv.org/html/2606.27611#bib.bib3)\)shows the value of specialization by assigning workflow steps—terminology interpretation, variable extraction, and code review—to different agents\. COOPA specializes agents along a different axis: each optimizer agent targets a problem class and solver, routing problems to the appropriate solving paradigm rather than forcing them into a single formulation\.

## 6Conclusion

We proposed COOPA, a modular LLM agent architecture for OR modeling that combines three components: iterative confidence\-based modeling with max\-min selection, source traceability with confidence explanations, and solver dispatch to specialized optimizer agents\. Together, these components target three practical weaknesses of prior systems: weak formulation refinement, opaque outputs, and reliance on a single solver\.

Experiments across three benchmarks and eight LLM backbones show that COOPA achieves the highest cross\-model average accuracy \(64\.8%\) and the best macro\-average on 6 of 8 backbones\. The main empirical driver is iterative modeling: compared with solving only the first candidate, the full pipeline improves accuracy on 7 of 8 backbones and raises the cross\-model mean by 3\.0 points\. The interpretability features provide an audit trail from problem text to formulation, while the appendix presents additional evidence and case studies for confidence analysis and solver dispatch\.

COOPA’s gains are largest on GPT\-5 \(\+6\.7 points\) and GPT\-4\.1 \(\+6\.3\), and remain consistent enough to produce a 3\.2\-point average improvement over the next\-best method\. This pattern suggests that structured refinement is most useful when the backbone is strong enough to benefit from self\-correction but not already saturated\.

Future work should test COOPA on broader OR benchmarks with greater problem\-type diversity, evaluate whether traceability and confidence explanations improve human verification in user studies, and combine the architecture with training\-based OR models such as ORLM or SIRL\.

## References

- OptiMUS\-0\.3: using large language models to model and solve optimization problems at scale\.arXiv preprint arXiv:2407\.19633\.Cited by:[§1](https://arxiv.org/html/2606.27611#S1.p3.1),[§1](https://arxiv.org/html/2606.27611#S1.p4.1),[§1](https://arxiv.org/html/2606.27611#S1.p5.1),[§1](https://arxiv.org/html/2606.27611#S1.p6.1),[§2\.1](https://arxiv.org/html/2606.27611#S2.SS1.p1.1),[§2\.2](https://arxiv.org/html/2606.27611#S2.SS2.p2.1),[§2\.2](https://arxiv.org/html/2606.27611#S2.SS2.p3.1),[§4\.1](https://arxiv.org/html/2606.27611#S4.SS1.p3.2),[§5\.1](https://arxiv.org/html/2606.27611#S5.SS1.p1.1)\.
- X\. Chen, M\. Lin, N\. Schärli, and D\. Zhou \(2024\)Teaching large language models to self\-debug\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 8746–8825\.Cited by:[§5\.3](https://arxiv.org/html/2606.27611#S5.SS3.p1.1)\.
- Y\. Chen, J\. Xia, S\. Shao, D\. Ge, and Y\. Ye \(2026\)Solver\-informed rl: grounding large language models for authentic optimization modeling\.Advances in Neural Information Processing Systems38,pp\. 106027–106069\.Cited by:[§D\.1](https://arxiv.org/html/2606.27611#A4.SS1.p1.1),[§1](https://arxiv.org/html/2606.27611#S1.p6.1),[§2\.2](https://arxiv.org/html/2606.27611#S2.SS2.p2.1),[§5\.2](https://arxiv.org/html/2606.27611#S5.SS2.p2.1)\.
- Z\. Ding, Z\. Tan, J\. Zhang, and T\. Chen \(2026\)OR\-r1: automating modeling and solving of operations research optimization problem via test\-time reinforcement learning\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.40,pp\. 228–236\.Cited by:[§1](https://arxiv.org/html/2606.27611#S1.p6.1),[§2\.2](https://arxiv.org/html/2606.27611#S2.SS2.p2.1),[§5\.2](https://arxiv.org/html/2606.27611#S5.SS2.p2.1)\.
- C\. Huang, Z\. Tang, S\. Hu, R\. Jiang, X\. Zheng, D\. Ge, B\. Wang, and Z\. Wang \(2025\)Orlm: a customizable framework in training large models for automated optimization modeling\.Operations Research73\(6\),pp\. 2986–3009\.Cited by:[§D\.1](https://arxiv.org/html/2606.27611#A4.SS1.p1.1),[§1](https://arxiv.org/html/2606.27611#S1.p3.1),[§1](https://arxiv.org/html/2606.27611#S1.p4.1),[§1](https://arxiv.org/html/2606.27611#S1.p5.1),[§1](https://arxiv.org/html/2606.27611#S1.p6.1),[§2\.1](https://arxiv.org/html/2606.27611#S2.SS1.p1.1),[§2\.2](https://arxiv.org/html/2606.27611#S2.SS2.p2.1),[§3\.4](https://arxiv.org/html/2606.27611#S3.SS4.p3.1),[§4\.1](https://arxiv.org/html/2606.27611#S4.SS1.p1.1),[§5\.2](https://arxiv.org/html/2606.27611#S5.SS2.p1.1)\.
- X\. Huang, Q\. Shen, Y\. Hu, A\. Gao, and B\. Wang \(2024\)Mamo: a mathematical modeling benchmark with solvers\.arXiv preprint arXiv:2405\.13144\.Cited by:[§4\.1](https://arxiv.org/html/2606.27611#S4.SS1.p1.1)\.
- H\. Jain and K\. Deb \(2013\)An evolutionary many\-objective optimization algorithm using reference\-point based nondominated sorting approach, part ii: handling constraints and extending to an adaptive approach\.IEEE Transactions on evolutionary computation18\(4\),pp\. 602–622\.Cited by:[§B\.2](https://arxiv.org/html/2606.27611#A2.SS2.p1.1)\.
- S\. Kadavath, T\. Conerly, A\. Askell, T\. Henighan, D\. Drain, E\. Perez, N\. Schiefer, Z\. Hatfield\-Dodds, N\. DasSarma, E\. Tran\-Johnson,et al\.\(2022\)Language models \(mostly\) know what they know\.arXiv preprint arXiv:2207\.05221\.Cited by:[§5\.1](https://arxiv.org/html/2606.27611#S5.SS1.p4.1),[§5\.3](https://arxiv.org/html/2606.27611#S5.SS3.p1.1)\.
- K\. Liang, Y\. Lu, J\. Mao, S\. Sun, C\. Yang, C\. Zeng, X\. Jin, H\. Qin, R\. Zhu, and C\. Teo \(2026\)LLM for large\-scale optimization model auto\-formulation: a lightweight few\-shot learning approach\.arXiv preprint arXiv:2601\.09635\.Cited by:[§1](https://arxiv.org/html/2606.27611#S1.p6.1),[§5\.1](https://arxiv.org/html/2606.27611#S5.SS1.p2.1)\.
- H\. Liu, J\. Wang, Y\. Cai, X\. Han, Y\. Kuang, and J\. Hao \(2026\)Optitree: hierarchical thoughts generation with tree search for llm optimization modeling\.Advances in Neural Information Processing Systems38,pp\. 120713–120781\.Cited by:[§1](https://arxiv.org/html/2606.27611#S1.p4.1),[§1](https://arxiv.org/html/2606.27611#S1.p6.1),[§2\.1](https://arxiv.org/html/2606.27611#S2.SS1.p1.1),[§2\.2](https://arxiv.org/html/2606.27611#S2.SS2.p2.1),[§2\.2](https://arxiv.org/html/2606.27611#S2.SS2.p3.1),[§4\.1](https://arxiv.org/html/2606.27611#S4.SS1.p3.2),[§5\.1](https://arxiv.org/html/2606.27611#S5.SS1.p2.1)\.
- A\. Madaan, N\. Tandon, P\. Gupta, S\. Hallinan, L\. Gao, S\. Wiegreffe, U\. Alon, N\. Dziri, S\. Prabhumoye, Y\. Yang,et al\.\(2023\)Self\-refine: iterative refinement with self\-feedback\.Advances in neural information processing systems36,pp\. 46534–46594\.Cited by:[§5\.1](https://arxiv.org/html/2606.27611#S5.SS1.p4.1),[§5\.3](https://arxiv.org/html/2606.27611#S5.SS3.p1.1)\.
- A\. Roucher, A\. V\. del Moral, T\. Wolf, L\. von Werra, and E\. Kaunismäki \(2025\)‘Smolagents‘: a smol library to build great agentic systems\.\.Note:[https://github\.com/huggingface/smolagents](https://github.com/huggingface/smolagents)Cited by:[§3\.1](https://arxiv.org/html/2606.27611#S3.SS1.p1.1)\.
- N\. Shinn, F\. Cassano, A\. Gopinath, K\. Narasimhan, and S\. Yao \(2023\)Reflexion: language agents with verbal reinforcement learning\.Advances in neural information processing systems36,pp\. 8634–8652\.Cited by:[§5\.3](https://arxiv.org/html/2606.27611#S5.SS3.p1.1)\.
- X\. Shu, H\. Qian, X\. Lu, J\. ZHOU, A\. Zhou, Y\. Yu,et al\.\(2025\)LLMOPT: learning to define and solve general optimization problems from scratch\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 101580–101606\.Cited by:[§1](https://arxiv.org/html/2606.27611#S1.p3.1),[§1](https://arxiv.org/html/2606.27611#S1.p4.1),[§1](https://arxiv.org/html/2606.27611#S1.p6.1),[§2\.2](https://arxiv.org/html/2606.27611#S2.SS2.p2.1),[§5\.2](https://arxiv.org/html/2606.27611#S5.SS2.p1.1)\.
- R\. Thind, Y\. Sun, L\. Liang, and H\. Yang \(2025\)Optimai: optimization from natural language using llm\-powered ai agents\.arXiv preprint arXiv:2504\.16918\.Cited by:[§1](https://arxiv.org/html/2606.27611#S1.p6.1),[§2\.2](https://arxiv.org/html/2606.27611#S2.SS2.p4.1),[§5\.1](https://arxiv.org/html/2606.27611#S5.SS1.p3.1)\.
- Z\. Xiao, D\. Zhang, Y\. Wu, L\. Xu, Y\. Wang, X\. Han, X\. Fu, T\. Zhong, J\. Zeng, M\. Song,et al\.\(2024\)Chain\-of\-experts: when llms meet complex operations research problems\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 48519–48537\.Cited by:[§1](https://arxiv.org/html/2606.27611#S1.p3.1),[§1](https://arxiv.org/html/2606.27611#S1.p4.1),[§1](https://arxiv.org/html/2606.27611#S1.p5.1),[§1](https://arxiv.org/html/2606.27611#S1.p6.1),[§2\.2](https://arxiv.org/html/2606.27611#S2.SS2.p2.1),[§2\.2](https://arxiv.org/html/2606.27611#S2.SS2.p3.1),[§4\.1](https://arxiv.org/html/2606.27611#S4.SS1.p3.2),[§5\.1](https://arxiv.org/html/2606.27611#S5.SS1.p1.1),[§5\.3](https://arxiv.org/html/2606.27611#S5.SS3.p3.1)\.
- B\. Zhang and P\. Luo \(2025\)Or\-llm\-agent: automating modeling and solving of operations research optimization problem with reasoning large language model\.arXiv e\-prints,pp\. arXiv–2503\.Cited by:[§1](https://arxiv.org/html/2606.27611#S1.p4.1),[§1](https://arxiv.org/html/2606.27611#S1.p5.1),[§1](https://arxiv.org/html/2606.27611#S1.p6.1),[§2\.2](https://arxiv.org/html/2606.27611#S2.SS2.p2.1),[§2\.2](https://arxiv.org/html/2606.27611#S2.SS2.p3.1),[§3\.1](https://arxiv.org/html/2606.27611#S3.SS1.p2.1),[§4\.1](https://arxiv.org/html/2606.27611#S4.SS1.p1.1),[§4\.1](https://arxiv.org/html/2606.27611#S4.SS1.p3.2),[§5\.1](https://arxiv.org/html/2606.27611#S5.SS1.p3.1)\.
- C\. Zhou, J\. Yang, L\. Xin, Y\. Chen, Z\. He, and D\. Ge \(2025\)Auto\-formulating dynamic programming problems with large language models\.arXiv preprint arXiv:2507\.11737\.Cited by:[§5\.2](https://arxiv.org/html/2606.27611#S5.SS2.p1.1)\.

## Appendix ASystem Prompts

This section presents the core system prompts used by each agent in COOPA\. We show the key instruction sections; boilerplate template variables \(tool lists, managed agent lists, planning prompts\) are omitted for brevity\.

### A\.1Formulation Extraction Agent

The formulation extraction agent maps a natural\-language description to a structuredOptimizationFormulationschema\. The system prompt is:

Formulation Extraction System Prompt``` You are an operations-research formulation assistant. Given a natural-language optimization problem you will populate the OptimizationFormulation schema exactly. Guidelines: - Copy the entire user prompt verbatim into the ‘question‘ field. - Enumerate every numeric fact or named constant inside ‘parameters‘. Use meaningful names (e.g., cost_harry) and include a SourceReference with the exact quote and contextual note. - Decision variables must capture domains precisely (binary/integer/continuous, bounds, logical implications). Use SourceReference entries quoting the sentence that motivated the variable or domain. - The ‘objective.expression‘ should be an algebraic description that references variable names, and ‘variables_involved‘ must list those variable identifiers. - Each constraint gets its own entry. Use algebraic expressions when possible; fall back to ‘logical‘ for implications. Every constraint needs a SourceReference quoting the relevant requirement. Return valid JSON only. Do not add fields. ```

### A\.2Confidence Evaluator

After each formulation extraction, a confidence evaluator scores the formulation on four dimensions \(parameters, decision variables, objective, constraints\) on a 0–100 scale\. The evaluation prompt is:

Confidence Evaluation Prompt \(Template\)``` You are an expert in optimization and mathematical modeling. Your task is to evaluate the quality and correctness of an optimization problem formulation. Given: 1. **Raw Question**: {raw_question} 2. **Proposed Formulation**: {formulation_str} Please evaluate the confidence (0-100) for each of the following components: 1. **PARAMETERS**: Are all necessary parameters identified with correct values and units? 2. **DECISION VARIABLES**: Are all decision variables properly defined with correct domains? 3. **OBJECTIVE**: Is the objective function correct and does it properly represent what should be optimized? 4. **CONSTRAINTS**: Are all necessary constraints included and correctly formulated? For each component, provide a confidence score from 0-100 and a brief explanation (1-3 sentences). ```

### A\.3Refinement Prompt

After each iteration, the next candidate is refined using all previous iterations as context\. In the default setting, the system completes allkkiterations rather than stopping at a confidence threshold:

Refinement Prompt \(Template\)``` You are refining an optimization formulation. Review all previous attempts and the feedback to create a better formulation. **Original Problem:** {raw_question} **HISTORY OF PREVIOUS FORMULATIONS:** --- Iteration 1 --- Formulation: {past_formulation_str} Confidence Scores: - Parameters: {score}/100 - {explanation} - Decision Variables: {score}/100 - {explanation} - Objective: {score}/100 - {explanation} - Constraints: {score}/100 - {explanation} [... additional iterations ...] Please create a REFINED formulation that addresses all the identified issues from past iterations. Learn from what worked well and avoid repeating mistakes. Pay special attention to the components with lower confidence scores. ```

### A\.4User Prompt for Solver Dispatch

After iterative refinement selects the best formulation, it is formatted into a structured prompt and sent to the solver\-dispatch component as the user message:

Formatted User Prompt for Solver Dispatch \(Template\)``` Delegate the following operations research problem to the correct optimizer agent: ## PARAMETERS: - {name} ({type}): {description} = {value} [{units}] ... ## DECISION VARIABLES: - {name} ({type}): {description} | Domain: {domain} ... ## OBJECTIVE: - Sense: MAXIMIZE/MINIMIZE - Description: {description} - Expression: {expression} - Variables involved: {variables} ## CONSTRAINTS: 1. {name} ({sense}): Expression: {expression} Variables: {variables} ... ## CRITICAL INSTRUCTIONS: - Your role is solver dispatch. You MUST NOT solve this problem yourself. - Your ONLY job is to delegate the COMPLETE problem above to the appropriate optimizer agent in your FIRST Code block. - The optimizer agent will handle everything: saving parameters to JSON, building the solver, executing it, and returning the result. - Do NOT call final_answer() in the same response where you call an optimizer agent. You MUST wait for the system to return the optimizer’s REAL result first. ```

### A\.5Solver\-Dispatch Prompt

This component delegates the problem to the appropriate optimizer agent\. Its system prompt defines the routing logic:

Solver\-Dispatch System Prompt \(core instructions\)``` You are responsible only for solver dispatch in an advanced multi-agent operations research system. Your role is to orchestrate specialized agents and tools to deliver correct, clear, and actionable solutions. === CORE CONSTRAINTS === 1. Immediately delegate the problem to an optimizer agent in your FIRST Code block. 2. Do not try to solve the problem directly. No internal reasoning, solver code, or calculations. 3. Do not iterate - Let optimizer agents handle all solution refinement. 4. Identify agent type and pass complete problem statement to chosen agent. === PROCEDURE === 1. Clarify the Problem. 2. Select and Delegate to the Appropriate Optimizer: - mathematical_optimizer_agent for algebraic LP, MILP, and continuous NLP models. - combinatorial_optimizer_agent for routing, scheduling, CP-SAT, and other discrete problems best expressed in OR-Tools. - metaheuristic_optimizer_agent for metaheuristic or black-box search, especially multi-objective or non-convex cases. - general_optimizer_agent for simulation-based, custom algorithmic, or general scripting. 3. Review and Present Results. 4. Call final_answer with the final result. ```

### A\.6Mathematical Optimizer Agent

Solves algebraic LP, MILP, and continuous NLP problems using Pyomo with GLPK and IPOPT\.

Mathematical Optimizer System Prompt \(core instructions\)``` You are an expert operations research assistant who models and solves mathematical optimization problems using Pyomo in Python, supported by solvers such as GLPK and IPOPT. === SCOPE AND CAPABILITIES === Supported problem types: - Linear Programming (LP) - Mixed-Integer Linear Programming (MIP) - Nonlinear Programming (NLP) Default solver support: - GLPK for LP/MIP - IPOPT for continuous NLP If a problem requires unsupported MINLP capabilities, do not claim IPOPT can solve it directly. Either reformulate to a supported class or report the limitation. === PROCEDURE === 1. Understand the Problem. 2. Create Parameters JSON File. 3. Build the Solver with Pyomo and Save to Python File. 4. Load and Execute the Solver via load_object_from_python_file(). 5. If execution FAILED: fix and re-execute. 6. Final Answer (only after successful execution). ```

### A\.7Combinatorial Optimizer Agent

Solves VRP, scheduling, assignment, constraint satisfaction, and bin packing problems using Google OR\-Tools\.

Combinatorial Optimizer System Prompt \(core instructions\)``` You are an expert combinatorial optimization assistant specializing in solving combinatorial, routing, and constraint satisfaction problems using Google OR-Tools. === SCOPE AND CAPABILITIES === - Vehicle Routing Problems (VRP, CVRP, VRPTW) - Job Shop / Flow Shop Scheduling - Assignment Problems - Constraint Programming (CP-SAT) - Bin Packing & Knapsack - Graph traversal and network design === PROCEDURE === 1. Understand the Problem. 2. Create Parameters JSON File. 3. Build the Solver with OR-Tools, Save to Python File. 4. Load and Execute via load_object_from_python_file(). 5. If execution FAILED: fix and re-execute. 6. Final Answer (only after successful execution). ```

### A\.8Metaheuristic Optimizer Agent

Solves multi\-objective, non\-convex, and black\-box optimization problems using pymoo\.

Metaheuristic Optimizer System Prompt \(core instructions\)``` You are an expert meta-heuristic optimization assistant specializing in solving complex, non-convex, and multi-objective optimization problems using evolutionary and meta-heuristic algorithms implemented in pymoo. === SCOPE AND CAPABILITIES === - Black-box or simulation-based objective functions - Multi-objective scenarios (Pareto-optimal solutions) - Non-convex, discontinuous, nonlinear landscapes - Mixed-variable types (continuous, discrete, binary, permutations) Algorithms available through pymoo: - Single-objective: GA, DE, PSO, CMA-ES - Multi-objective: NSGA-II, NSGA-III, MOEA/D === PROCEDURE === (Same as mathematical optimizer, using pymoo instead) ```

### A\.9General Optimizer Agent

Handles simulation\-based, custom algorithmic, or scripting tasks that do not fit the other categories\.

General Optimizer System Prompt \(core instructions\)``` You are the general purpose optimizer agent for operations research and optimization. Your job is to solve problems that do not fit into mathematical, combinatorial, or metaheuristic categories. You are especially suited for simulation-based, custom algorithmic, or scripting tasks that require flexible Python code. === SCOPE AND CAPABILITIES === You can write, modify, and execute Python code to solve a wide range of operations research and simulation problems. You are not limited to any specific optimization paradigm. ```

### A\.10Shared Execution Constraints

All four optimizer agents share the following critical execution rules, designed to prevent hallucinated solver outputs:

Shared Execution Rules \(all optimizer agents\)``` === CRITICAL: STOPPING CRITERIA === YOU MUST STOP GENERATING TEXT IMMEDIATELY AFTER WRITING A CODE BLOCK. - Do NOT simulate or fabricate the execution output. - Do NOT hallucinate solver results. - Wait for the actual system to execute your code. === CRITICAL: SOLVER EXECUTION IS MANDATORY === YOU MUST NEVER: - Output just numbers from reasoning/analysis. - Skip the solver step or assume results without running code. - Provide final answers without actual solver execution results. YOU MUST ALWAYS: - Write complete, executable Python code blocks. - Actually execute the solver in your code. - Extract results ONLY from actual solver output. ```

## Appendix BSupplementary Case Studies

### B\.1Iterative Modeling Corrects a Conceptual Error

We return to the assembly line example from Section[3](https://arxiv.org/html/2606.27611#S3)\(BWOR Question 76, o4\-mini backbone\) and show the corresponding COOPA execution logs\. The example illustrates how iterative confidence\-based modeling catches and corrects a fundamental objective error\.

Problem\.Five workers \(A–E\) must be assigned to five stations \(I–V\), one per station\. Because the product passes through the line sequentially, throughput is determined by the*slowest*station\. The correct model is therefore max\-min: maximize the minimum station throughput\. The gold answer isT∗=5\.0T^\{\*\}=5\.0pieces/min\.

Without iterative modeling\.The first candidate \(Iteration 1\) instead maximizes the*sum*of worker efficiencies across all stations:

max∑w∈W∑s∈Sew,s⋅xw,s\\max\\sum\_\{w\\in W\}\\sum\_\{s\\in S\}e\_\{w,s\}\\cdot x\_\{w,s\}subject to one\-worker\-per\-station and one\-station\-per\-worker assignment constraints\. This is a valid assignment model but the wrong objective for an assembly line: maximizing the sum \(= 28\) ignores the bottleneck, so the reported value 28\.0 does not match the target throughput of 5\.0\.

With iterative modeling, Iteration 1\.The first candidate has the*same*incorrect sum\-of\-efficiencies objective\. The confidence evaluator assigns only 30/100 to the objective dimension, with the explanation:

> *“The formulation maximizes the sum of efficiencies, but assembly\-line capacity is driven by the bottleneck \(minimum station rate\), so the objective is mis\-specified\.”*

The other three dimensions score highly \(parameters: 95, variables: 100, constraints: 100\), so the evaluator isolates the objective as the only weak dimension\. The candidate’s minimum confidence is therefore 30/100\.

Iteration 2\.Guided by that low objective score, the LLM restructures the formulation around a bottleneck\-throughput variableT≥0T\\geq 0, replaces the objective withmax⁡T\\max\\;T, and adds bottleneck constraintsT≤∑wew,s⋅xw,sT\\leq\\sum\_\{w\}e\_\{w,s\}\\cdot x\_\{w,s\}for each stationss\. This is the correct max\-min formulation, and its minimum confidence rises to 90/100\.

Iteration 3 and selection\.The third candidate keeps the same bottleneck structure with minor refinements and reaches the highest minimum confidence, 95/100\. Max\-min selection therefore chooses Iteration 3 over Iteration 2 \(90\) and Iteration 1 \(30\)\. The mathematical optimizer then returns the correct bottleneck throughput,T∗=5\.0T^\{\*\}=5\.0\.

Analysis\.This case highlights three points\. First, the confidence evaluator catches a*conceptual*error, not just a syntactic one\. Second, the fix is structural: Iteration 2 introduces a new decision variableTTand bottleneck constraints rather than patching a local mistake\. Third, without iterative modeling, COOPA returns the same wrong answer \(28\.0\), showing that the gain comes from the confidence\-and\-refinement loop itself\. Table[4](https://arxiv.org/html/2606.27611#A2.T4)summarizes the progression\.

Table 4:Case study summary: confidence scores and outcomes across three iterations for BWOR Question 76 \(o4\-mini\)\.ParamsVarsObjConstrsMinObjective typeAnswerIteration 1951003010030Sum \(wrong\)28\.0Iteration 290901009590Max\-min \(correct\)5\.0Iteration 39510010010095Max\-min \(correct\)5\.0w/o iterative modelingSum \(wrong\)28\.0Solver Dispatch and ResultRouting decision:Mixed\-integer program \(binary assignment \+ continuousTT\)\.Routed to:mathematical optimizer \(Pyomo\)\.Generated code:``` model = ConcreteModel() model.WORKERS = Set(initialize=params["workers"]) model.STATIONS = Set(initialize=params["stations"]) model.x = Var(model.WORKERS, model.STATIONS, domain=Binary) model.T = Var(domain=NonNegativeReals) def one_station_per_worker_rule(m, w): return sum(m.x[w, s] for s in m.STATIONS) == 1 model.one_station_per_worker = Constraint( model.WORKERS, rule=one_station_per_worker_rule) def one_worker_per_station_rule(m, s): return sum(m.x[w, s] for w in m.WORKERS) == 1 model.one_worker_per_station = Constraint( model.STATIONS, rule=one_worker_per_station_rule) def bottleneck_rule(m, s): return m.T <= sum( params["efficiency"][w][s] * m.x[w, s] for w in m.WORKERS) model.bottleneck = Constraint( model.STATIONS, rule=bottleneck_rule) model.obj = Objective(expr=model.T, sense=maximize) solver = SolverFactory(’glpk’) result = solver.solve(model) ``` Execution result:``` status: optimal objective: 5.0 assignment: A->V, B->IV, C->II, D->I, E->III T: 5.0 ``` *Without iterative modeling, the incorrect formulation optimizes∑ew,s\\sum e\_\{w,s\}and returns 28\.0, which is not the target throughput\.*

### B\.2Multi\-Objective Optimization via Solver Dispatch

To demonstrate the value of COOPA’s multi\-solver dispatch, we present a case study on the car side impact design problemJain and Deb \([2013](https://arxiv.org/html/2606.27611#bib.bib39)\), a well\-known multi\-objective benchmark from the engineering optimization literature\. This problem has 7 continuous decision variables \(structural thicknesses\), 3 conflicting objectives \(minimize weight, minimize pubic symphysis force, minimize average V\-pillar velocity\), and 10 nonlinear safety constraints including bilinear and quadratic terms\. The goal is to find the*Pareto front*: the set of non\-dominated designs representing the best tradeoffs among the three objectives\.

This problem poses two challenges for systems locked to a single solver\. First, it is*multi\-objective*: the designer needs the full Pareto front, not a single optimal point, because the tradeoff between weight and safety depends on domain priorities\. Gurobi and PuLP can optimize scalarized versions of the problem, but they do not directly return a Pareto front; obtaining one requires manual scalarization or repeatedϵ\\epsilon\-constraint solves\. Second, the constraints contain*bilinear terms*\(x2x4x\_\{2\}x\_\{4\},x1x2x\_\{1\}x\_\{2\}, etc\.\) and*quadratic terms*\(x22x\_\{2\}^\{2\}\), which PuLP cannot handle and which require Gurobi’s non\-convex QCQP mode\. We fed the same natural\-language problem description to both COOPA and OR\-LLM\-Agent \(the best baseline by cross\-model mean accuracy\) and compared their generated solutions\.

#### B\.2\.1OR\-LLM\-Agent Solution \(Gurobi\)

OR\-LLM\-Agent recognized that the problem is multi\-objective and nonlinear\. Its Math Agent correctly identified all three objectives, all constraints, and noted that the epsilon\-constraint method is needed for multi\-objective optimization with Gurobi\. However, the system encountered two failures in sequence\.

Attempt 1: Epsilon\-constraint \(infeasible\)\.The LLM choseϵF=3\.5\\epsilon\_\{F\}=3\.5andϵV=12\.0\\epsilon\_\{V\}=12\.0as bounds for the force and velocity objectives, converting the multi\-objective problem into a single\-objective one \(minimize weight\)\. Gurobi reported the model as*infeasible*: no design exists that simultaneously satisfies all 10 safety constraints and the epsilon bounds\. The LLM selected these epsilon values without prior knowledge of the feasible objective space, and the chosen combination was too restrictive\.

Attempt 2: Self\-repair \(single objective only\)\.The Debugging Agent detected the infeasibility and self\-repaired by removing the epsilon constraints entirely, reducing the problem to minimizing weight subject only to the safety constraints\. Gurobi solved this simplified problem and returned a single feasible design:

OR\-LLM\-Agent Result \(Gurobi, after self\-repair\)Approach:Minimize weight only \(abandoned multi\-objective\)Result:One feasible point \(not a Pareto front\) Weight \(f1f\_\{1\}\):23\.5857Pubic force \(f2f\_\{2\}\):4\.0000 \(at safety limit\)Avg V\-velocity \(f3f\_\{3\}\):12\.5211Design: x1x\_\{1\}\(B\-pillar inner\):0\.5000x5x\_\{5\}\(door beam\): 0\.8750x2x\_\{2\}\(B\-pillar reinforce\):1\.2257x6x\_\{6\}\(door belt\): 0\.8843x3x\_\{3\}\(floor side inner\):0\.5000x7x\_\{7\}\(roof rail\): 0\.4000x4x\_\{4\}\(cross member\):1\.2071Execution result:``` Attempt 1 (epsilon-constraint): Barrier solved model in 0 iterations and 0.00s Model is infeasible. IIS computed: 1 constraint, 4 bounds Attempt 2 (self-repair, minimize weight only): Optimal solution found (tolerance 1.00e-04) Best objective 2.358565798676e+01, gap 0.0000% f1 (weight) = 23.5857 f2 (pubic force) = 4.0000 f3 (avg velocity) = 12.5211 x = [0.5000, 1.2257, 0.5000, 1.2071, 0.8750, 0.8843, 0.4000] ``` Problem:This is the minimum\-weight extreme of the Pareto front\. The force \(f2=4\.00f\_\{2\}=4\.00\) is at its maximum safety limit, meaning passenger safety is sacrificed entirely for weight reduction\. A decision\-maker would need to see the full tradeoff surface to make an informed choice\.

The generated code is shown below\. Note the 145\-line implementation with 9 auxiliary variables for bilinear terms, explicit quadratic constraint definitions, and the epsilon\-constraint workaround that ultimately had to be abandoned\.

OR\-LLM\-Agent Generated Code \(excerpts\)``` # 9 auxiliary variables for bilinear terms q_12 = m.addVar(name="q_12") # x1 * x2 q_23 = m.addVar(name="q_23") # x2 * x3 q_24 = m.addVar(name="q_24") # x2 * x4 q_26 = m.addVar(name="q_26") # x2 * x6 q_27 = m.addVar(name="q_27") # x2 * x7 q_35 = m.addVar(name="q_35") # x3 * x5 q_37 = m.addVar(name="q_37") # x3 * x7 q_56 = m.addVar(name="q_56") # x5 * x6 q_2sq = m.addVar(name="q_2sq") # x2ˆ2 # Quadratic equality constraints to define aux vars m.addQConstr(q_12 == x[0] * x[1], "def_q_12") m.addQConstr(q_23 == x[1] * x[2], "def_q_23") ... # (9 total quadratic equalities) # Epsilon-constraint (caused infeasibility) m.addQConstr(F_PS <= 3.5, "c_epsilon_F_PS") m.addQConstr(V_avg <= 12.0, "c_epsilon_V_avg") # After self-repair: removed epsilon constraints, # minimize weight only m.setObjective(W, GRB.MINIMIZE) m.params.NonConvex = 2 m.optimize() ```

#### B\.2\.2COOPA Solution \(pymoo\)

COOPA’s solver\-dispatch component classified this as a multi\-objective nonlinear problem and routed it to the metaheuristic optimizer, which used pymoo’s NSGA\-II algorithm\. The generated code directly defines the problem class with all three objectives and 10 constraints, then runs the evolutionary algorithm to compute the full Pareto front\.

COOPA Result \(pymoo NSGA\-II\)Approach:NSGA\-II, all 3 objectives optimized simultaneouslyResult:Full Pareto front \(set of non\-dominated designs\) Weight \(f1f\_\{1\}\):\[23\.61,42\.70\]\[23\.61,42\.70\]Pubic force \(f2f\_\{2\}\):\[3\.59,4\.00\]\[3\.59,4\.00\]Avg V\-velocity \(f3f\_\{3\}\):\[10\.62,12\.44\]\[10\.62,12\.44\]Execution result:``` Optimization completed successfully. Pareto front: 100 non-dominated solutions found f1 (weight) in [23.61, 42.70] f2 (pubic force) in [3.59, 4.00] f3 (avg velocity) in [10.62, 12.44] Sample Pareto-optimal designs: Design A: f1=23.61, f2=4.00, f3=12.43 (lightest) Design B: f1=31.52, f2=3.73, f3=11.68 (balanced) Design C: f1=42.70, f2=3.59, f3=10.62 (safest) ``` Output:A population of Pareto\-optimal designs, each representing a different tradeoff\. A decision\-maker can select the design that best balances weight reduction against passenger safety\.

COOPA Generated Code \(excerpts\)``` class CarStructureOptimization(Problem): def __init__(self, params): super().__init__( n_var=7, n_obj=3, n_constr=10, xl=np.array([...]), # lower bounds xu=np.array([...]) # upper bounds ) def _evaluate(self, X, out, *args, **kwargs): x1, x2, ..., x7 = X[:,0], X[:,1], ..., X[:,6] # Objectives (directly expressed) f1 = 1.98 + 4.9*x1 + 6.67*x2 + ... f2 = 4.72 - 0.5*x4 - 0.19*x2*x3 f3 = 0.5 * (V_MBP + V_FD) out["F"] = np.column_stack([f1, f2, f3]) # Constraints (directly expressed) g = np.zeros((X.shape[0], 10)) g[:,0] = (1.16 - 0.3717*x2*x4 ...) - 1.0 ... out["G"] = g # Solve with NSGA-II algorithm = NSGA2(pop_size=100) res = minimize(problem, algorithm, termination=(’n_gen’, 200)) ```

#### B\.2\.3Comparison

OR\-LLM\-Agent \(Gurobi\)Solver:Gurobi \(non\-convex QCQP\) Multi\-objective method:Epsilon\-constraint Code complexity:145 lines Aux variables needed:9 \(for bilinear terms\) Attempt 1:Infeasible \(badϵ\\epsilonvalues\) Attempt 2:Minimized weight only Output:1 point\(min\-weight extreme\) Force \(f2f\_\{2\}\):4\.00 \(at safety limit\) Pareto front:Not computed

COOPA \(pymoo\)Solver:pymoo NSGA\-II Multi\-objective method:Native \(evolutionary\) Code complexity:50 lines Aux variables needed:0 Attempt 1:Succeeded directly Attempt 2: Output:Full Pareto front Force \(f2f\_\{2\}\):\[3\.59,4\.00\]\[3\.59,4\.00\]\(full range\) Pareto front:Complete tradeoff surface

Analysis\.The comparison reveals three advantages of COOPA’s multi\-solver dispatch for this problem class\.

First,*paradigm matching*: NSGA\-II is designed for multi\-objective optimization and computes the full Pareto front in a single run\. By contrast, a Gurobi\-based approach must rely on scalarization or repeatedϵ\\epsilon\-constraint solves, which requires choosing objective tradeoffs without knowing the feasible objective space\. When the LLM’s guesses are infeasible, the solver fails entirely\.

Second,*code simplicity*: pymoo’s API lets the LLM express objectives and constraints as direct function evaluations \(50 lines\)\. Gurobi requires auxiliary variables for every bilinear term, quadratic constraint definitions, and manual epsilon\-constraint setup \(145 lines\)\. Simpler code means fewer opportunities for the LLM to introduce errors\.

Third,*solution completeness*: even after self\-repair, OR\-LLM\-Agent returns only one extreme point \(minimum weight with force at the safety limit\)\. A decision\-maker cannot see the tradeoff between weight and safety\. COOPA returns the full Pareto front, enabling informed design decisions\.

We also tested CoE on this problem; its first attempt withϵ\\epsilon\-constraint was similarly infeasible, and its second attempt returned the same single\-point minimum\-weight solution \(weight = 23\.5857\), confirming that the limitation is not specific to one baseline, but arises more broadly when these pipelines attack the task through repeated scalarized solves\.

## Appendix CSupplementary Experiments

### C\.1Cross\-Model Consistency Analysis

To measure robustness across backbones, we summarize each method’s macro\-average accuracy over the eight models in Table[5](https://arxiv.org/html/2606.27611#A3.T5)\. COOPA has the highest mean accuracy \(64\.8%\) and highest maximum \(70\.6%\)\. OR\-LLM\-Agent has the lowest variance \(5\.0\) but a lower mean, indicating steadier but weaker performance\. COOPA’s variance is driven mainly by Qwen3\-30B\. Overall, single\-backbone evaluation can be misleading: the leader changes by model, whereas COOPA has the best mean with moderate variance\.

Table 5:Cross\-model consistency: summary statistics of macro\-average accuracy \(%\) across 8 LLMs\.MethodMeanStdMinMaxChain\-of\-Experts60\.18\.142\.967\.5OptiMUS35\.49\.817\.748\.7OptiTree61\.36\.549\.968\.4OR\-LLM\-Agent61\.65\.053\.167\.6COOPA \(Ours\)64\.87\.545\.570\.6
### C\.2Ablation Study: Effect of Iterative Modeling

To isolate iterative confidence\-based modeling, we compare two configurations from the same run: \(1\) solve only the first candidate \(Iteration 1\), and \(2\) run the fullk=3k=3pipeline with confidence evaluation and max\-min selection\. We extract the first candidate from the iterative logs and solve it independently, so both conditions share the same generation context and differ only in candidate selection\. All other components remain fixed\. We focus on this ablation because iterative modeling is the main algorithmic contribution; structured output mainly supports human verification \(Section[3\.2](https://arxiv.org/html/2606.27611#S3.SS2)\), and solver dispatch has limited signal on the current benchmarks because 91\.1% of problems are routed to the mathematical optimizer \(Section[C\.4](https://arxiv.org/html/2606.27611#A3.SS4)\)\.

Table 6:Ablation of iterative modeling\. Both rows are drawn from the same iterative run: “w/o iteration” solves only the first candidate \(Iteration 1\); “w/ iteration” solves the candidate selected by the max\-min confidence criterion amongk=3k=3candidates\.Δ\\Deltais w/ iteration minus w/o iteration\.ModelConfigurationComplexLPIndustryORBWORMacro\-Avg𝚫\\bm\{\\Delta\}GPT\-5\.2w/o iteration52\.677\.082\.570\.7w/ iteration55\.976\.080\.070\.6−\-0\.1GPT\-5w/o iteration52\.174\.077\.567\.9w/ iteration53\.175\.080\.069\.4\+\+1\.5GPT\-4\.1w/o iteration51\.767\.071\.363\.3w/ iteration53\.669\.076\.366\.3\+\+3\.0o3w/o iteration49\.873\.071\.364\.7w/ iteration53\.673\.073\.866\.8\+\+2\.1o4\-miniw/o iteration45\.573\.072\.563\.7w/ iteration47\.972\.077\.565\.8\+\+2\.1Gemini\-3\-Flashw/o iteration50\.775\.076\.367\.3w/ iteration52\.675\.077\.568\.4\+\+1\.1Gemini\-2\.5\-Flashw/o iteration45\.566\.070\.060\.5w/ iteration47\.471\.077\.565\.3\+\+4\.8Qwen3\-30Bw/o iteration28\.942\.038\.836\.6w/ iteration32\.248\.056\.345\.5\+\+8\.9![Refer to caption](https://arxiv.org/html/2606.27611v1/x2.png)Figure 2:Per\-benchmark ablation: iterative modeling \(k=3k=3, max\-min selection\) vs\. first candidate only \(k=1k=1\)\. Iterative modeling improves accuracy on the majority of backbone–benchmark pairs, with the largest gains on BWOR and ComplexLP\.Overall effect\.Table[6](https://arxiv.org/html/2606.27611#A3.T6)and Figure[2](https://arxiv.org/html/2606.27611#A3.F2)show gains on 7 of 8 backbones\. The cross\-model mean rises from 61\.8% to 64\.8% \(\+3\.0 points\)\. The largest improvements appear on Qwen3\-30B \(\+8\.9\), Gemini\-2\.5\-Flash \(\+4\.8\), and GPT\-4\.1 \(\+3\.0\)\. GPT\-5\.2 is essentially unchanged \(−\-0\.1\)\.

Gains are largest on weaker backbones\.Qwen3\-30B gains the most \(\+8\.9\), while GPT\-5\.2 gains the least \(−\-0\.1\)\. This suggests weaker models produce more correctable first\-pass errors, leaving more room for confidence\-based selection\. The effect also appears on reasoning models such as o3 and o4\-mini \(\+2\.1 each\), so it is not limited to non\-reasoning architectures\.

Takeaway\.Iterative confidence\-based modeling is the main driver of COOPA’s advantage\. Without it, the base pipeline reaches 61\.8%, roughly matching OR\-LLM\-Agent \(61\.6%\) and OptiTree \(61\.3%\)\. The extra 3\.0 points from iteration explain most of COOPA’s 3\.2\-point lead over the next\-best baseline, consistent with the error analysis in Section[2\.1](https://arxiv.org/html/2606.27611#S2.SS1)\.

### C\.3Confidence Calibration Analysis

The ablation shows that iterative modeling helps, but it leaves one question: does max\-min confidence actually pick better candidates, or does it mainly benefit from generating more of them? To separate selection quality from candidate diversity, we analyze confidence scores and candidate\-level correctness from COOPA’s execution logs\.

![Refer to caption](https://arxiv.org/html/2606.27611v1/x3.png)Figure 3:Confidence analysis of iterative refinement \(aggregated across 3 datasets\)\. \(a\) Min\-confidence increases across iterations \(82\.0→\\to87\.8\)\. \(b\) Per\-dimension improvement from Iteration 1 to the selected candidate: constraints improve the most \(\+8\.7\), followed by objective \(\+4\.2\), variables \(\+2\.8\), and parameters \(\+2\.4\)\. \(c\) Selection frequency shifts toward later iterations on weaker backbones\. \(d\) Selected candidates consistently have higher min\-confidence than the first candidate, with the gap largest on weaker backbones\.Figure[3](https://arxiv.org/html/2606.27611#A3.F3)summarizes how confidence changes across iterations\. Min\-confidence rises from 82\.0 to 87\.8 \(panel a\)\. From the first candidate to the selected one, constraints improve the most \(\+8\.7\), followed by objective \(\+4\.2\), variables \(\+2\.8\), and parameters \(\+2\.4\) \(panel b\), matching the error profile in Section[2\.1](https://arxiv.org/html/2606.27611#S2.SS1)\. Later iterations are selected more often on weaker models \(panel c\), and the selected candidate’s min\-confidence is consistently higher than the first candidate’s \(panel d\), with the largest gaps on Qwen3\-30B \(\+27\.7\) and Gemini\-2\.5\-Flash \(\+17\.5\)\.

Experiment 1: Max\-min vs\. first\-candidate selection\.Table[6](https://arxiv.org/html/2606.27611#A3.T6)already gives this comparison: solving the max\-min selected candidate instead of the first candidate improves accuracy on 7 of 8 backbones, for a \+3\.0\-point cross\-model mean gain\. This indicates that the confidence criterion adds value beyond candidate diversity\.

Experiment 2: Confidence gain predicts accuracy gain\.We plot the mean min\-confidence gain \(selected minus Iteration 1\) against the accuracy gain \(w/ iteration minus w/o iteration\) for each model–benchmark pair \(Figure[4](https://arxiv.org/html/2606.27611#A3.F4)\)\. The Pearson correlation isr=0\.58r=0\.58\(p=0\.003p=0\.003\), indicating that larger confidence gains usually translate into larger accuracy gains\.

![Refer to caption](https://arxiv.org/html/2606.27611v1/x4.png)Figure 4:Confidence gain vs\. accuracy gain per model–benchmark pair\. Each point represents one \(model, dataset\) combination\. The positive correlation \(r=0\.58r=0\.58,p=0\.003p=0\.003\) confirms that confidence improvements from iterative modeling translate into accuracy improvements\.BWOR shows the strongest payoff from confidence gains, while ComplexLP is noisier and includes several near\-zero or negative gains despite higher confidence, consistent with its greater difficulty\. Overall, the max\-min criterion yields 181 beneficial overrides and 95 harmful ones across all problem–backbone pairs \(1\.9:1\), with net\-positive selection on all 8 backbones and all 3 benchmarks\.

### C\.4Solver Dispatch Statistics

Table 7:Distribution of optimizer agent calls per model and dataset\. Each cell shows Math / Comb / Meta / General counts\.complexlp\(211\)industryor\(100\)BWOR\(80\)ModelMathCombMeta/GenMathCombMeta/GenMathCombMeta/GenGPT\-5\.2199120/09540/17820/0GPT\-520740/09541/07640/0GPT\-4\.1178330/09091/067130/0o3188230/09350/27441/1o4\-mini174312/49171/165140/1Gemini\-3\-Flash21100/09631/07640/0Gemini\-2\.5\-Flash197140/09721/07640/0Qwen3\-30B175330/387100/364131/2

Table[7](https://arxiv.org/html/2606.27611#A3.T7)shows that the mathematical optimizer dominates the workload: across 3,128 model–problem invocations, it handles 91\.1% of calls, versus 8\.0% for the combinatorial optimizer and less than 1% combined for the metaheuristic and general optimizers\. This largely reflects the benchmarks, which are overwhelmingly LP/IP/MIP\. Routing still varies by backbone: GPT\-4\.1, o4\-mini, and Qwen3\-30B send 13–15% of problems to the combinatorial optimizer, while Gemini\-3\-Flash sends fewer than 2%\.

The low use of non\-mathematical optimizers limits empirical validation of solver dispatch\. With only 253 combinatorial, 10 metaheuristic, and 18 general\-optimizer calls, we cannot reliably test whether specialized routing improves accuracy over always using a mathematical programming solver\. The contribution here is therefore architectural rather than strongly empirical, and broader validation will require more diverse benchmarks\.

### C\.5Cost and Efficiency

COOPA’s multi\-step pipeline is inherently more expensive than single\-pass methods such as OR\-LLM\-Agent\. To measure that overhead, we sample 10 problems from each benchmark \(30 total\), run all five methods with Gemini\-2\.5\-Flash, and record wall\-clock time, total tokens, API calls, and estimated API cost\. Table[8](https://arxiv.org/html/2606.27611#A3.T8)reports the averages\.

Table 8:Cost and efficiency metrics averaged over 30 randomly sampled problems \(10 per benchmark\) using Gemini\-2\.5\-Flash\. Wall\-clock time includes all LLM calls and code execution\.MethodWall Time \(s\)Total Tokens \(K\)API CallsCost \($\)Chain\-of\-Experts101\.726\.07\.00\.044OptiMUS198\.888\.736\.40\.103OptiTree74\.015\.63\.50\.028OR\-LLM\-Agent61\.39\.82\.10\.020COOPA \(Ours\)201\.4146\.014\.50\.138COOPA is the most expensive method on all four metrics: $0\.138 per problem on average, about7×7\\timesOR\-LLM\-Agent \($0\.020\) and3×3\\timesCoE \($0\.044\)\. The main driver is token usage: 146K tokens per problem, reflectingk=3k=3candidate generations and confidence evaluations\. COOPA uses 14\.5 API calls per problem, fewer than OptiMUS \(36\.4\) but far more than OR\-LLM\-Agent \(2\.1\) and OptiTree \(3\.5\)\. Its wall\-clock time \(201\.4s\) is similar to OptiMUS \(198\.8s\) and about3×3\\timesOR\-LLM\-Agent \(61\.3s\)\.

In absolute terms, the cost remains modest: a 100\-problem benchmark costs under $14 with COOPA\. That premium may be acceptable when a wrong formulation is expensive, but it is less attractive in high\-throughput settings\. In those settings, the ablation in Section[C\.2](https://arxiv.org/html/2606.27611#A3.SS2)suggests that disabling iterative modeling can lower cost while retaining structured output and solver dispatch\.

Wall\-clock time is sensitive to API latency, so token counts and API calls are the more reliable cross\-method comparisons\.

## Appendix DDiscussion and Limitations

### D\.1Modularity

A central design principle of COOPA is that each workflow component is independent and replaceable\. This modularity has direct practical value: adding support for a new solver or problem class requires only defining a new optimizer agent \(a system prompt and tool access list\), with no changes to the other architectural components\. For example, integrating a constraint programming solver \(e\.g\., MiniZinc\) or a stochastic programming framework \(e\.g\., PySP\) would require a single new agent definition\. Fine\-tuned models from the training\-based literature \(e\.g\., ORLMHuanget al\.\([2025](https://arxiv.org/html/2606.27611#bib.bib8)\), SIRLChenet al\.\([2026](https://arxiv.org/html/2606.27611#bib.bib31)\)\) could similarly be integrated as specialized backbone LLMs within individual agents, combining the benefits of training\-based and pipeline approaches\.

The scalable multi\-solver architecture addresses a real gap in existing systems\. As discussed in Section[3\.4](https://arxiv.org/html/2606.27611#S3.SS4), problems such as multi\-objective portfolio optimization, where the goal is to compute a Pareto front of non\-dominated solutions balancing return and risk, are often more naturally handled by multi\-objective evolutionary algorithms \(e\.g\., NSGA\-II via pymoo\) than by repeated scalarization with a single mathematical\-programming solver\. COOPA’s architecture supports such problems through its metaheuristic optimizer agent without forcing them into a single LP/MILP\-style formulation\. Appendix[B\.2](https://arxiv.org/html/2606.27611#A2.SS2)demonstrates this on a real engineering design problem\.

We acknowledge that the multi\-solver dispatch is the least empirically validated design choice in this paper\. The solver dispatch statistics \(Table[7](https://arxiv.org/html/2606.27611#A3.T7)\) show that 91\.1% of problems are routed to the Mathematical Optimizer, reflecting the LP/MILP focus of existing benchmarks rather than the architectural capability\. The qualitative case study in Appendix[B\.2](https://arxiv.org/html/2606.27611#A2.SS2)provides one demonstration, but a comprehensive evaluation would require benchmarks with greater problem\-type diversity, including vehicle routing, job\-shop scheduling, and multi\-objective optimization at meaningful scale\.

### D\.2Interpretability

COOPA provides two layers of interpretability that existing methods lack: \(1\) source traceability linking each modeling element to quoted problem text, and \(2\) confidence scores with natural\-language explanations articulating the LLM’s uncertainty\. These features are practically important because OR practitioners need to trust and verify automated models before deploying solutions in high\-stakes settings; a supply chain planner, for instance, must confirm that constraints correctly capture capacity limits and demand forecasts\.

We note that the interpretability contribution in this paper is primarily architectural: we demonstrate that the system produces source references and confidence explanations, and we illustrate their use in the case studies \(Section[B](https://arxiv.org/html/2606.27611#A2)\)\. We do not empirically measure whether these features improve human verification efficiency or error detection rates, nor do we separately score the factual accuracy and completeness of the extracted source references\. Controlled user studies and source\-level audits with OR practitioners would substantially strengthen this contribution and are an important direction for future work\.

### D\.3Ablation Takeaways

The ablation results \(Table[6](https://arxiv.org/html/2606.27611#A3.T6)\) reveal that iterative confidence\-based modeling is the primary driver of COOPA’s advantage over baselines\. Without iterative modeling, COOPA’s cross\-model mean is 61\.8%, comparable to OR\-LLM\-Agent \(61\.6%\) and OptiTree \(61\.3%\)\. Iterative modeling adds 3\.0 percentage points on average \(61\.8%→\\to64\.8%\), accounting for nearly all of COOPA’s lead over the next\-best baseline\. This finding confirms the hypothesis from Section[2\.1](https://arxiv.org/html/2606.27611#S2.SS1): formulation quality is the central bottleneck in LLM\-based OR, and a mechanism that enables the LLM to evaluate and refine its own formulations addresses this bottleneck directly\.

The ablation deltas are inversely correlated with backbone capability\. Qwen3\-30B, the weakest model, gains the most \(\+8\.9pp\), while GPT\-5\.2, the strongest, is essentially unchanged \(−\-0\.1pp\)\. This pattern suggests that iterative modeling functions as an equalizer: it compensates for the lower single\-attempt quality of weaker backbones by providing multiple chances to identify and correct formulation errors\. For stronger models that already produce high\-quality first candidates, the mechanism still provides modest gains \(\+1\.1 to \+2\.1pp on GPT\-5, o3, Gemini\-3\-Flash\) rather than degradation, indicating that the confidence evaluator generally avoids overriding correct formulations even when the margin for improvement is small\.

For practitioners building LLM\-based OR tools, the ablation supports deploying iterative modeling broadly across backbone models\. The consistent gains on 7 of 8 backbones indicate that the mechanism is robust rather than narrowly tuned to specific model families\. The primary consideration is cost rather than accuracy risk: iterative modeling increases per\-problem cost by approximately7×7\\times\(Section[C\.5](https://arxiv.org/html/2606.27611#A3.SS5)\), so practitioners with tight cost budgets may prefer the single\-candidate variant, which already matches the best existing baselines\. When accuracy is the priority, iterative modeling is justified across all tested backbones\.

We emphasize that this is a*within\-system*ablation: it measures the contribution of iterative modeling within COOPA’s architecture, not across different systems\. The gains from iterative modeling depend on the structured output format that enables meaningful confidence evaluation; whether similar gains would arise in architectures with different output representations remains an open question\.

### D\.4Confidence Reliability

The max\-min selection criterion is only as effective as the LLM’s ability to accurately self\-evaluate its formulations\. Two failure modes are possible: systematic overconfidence \(high scores for incorrect formulations\) and poor discrimination \(similar scores for correct and incorrect formulations\)\.

The ablation results provide indirect evidence that the confidence\-based selection criterion is functional: iterative modeling improves accuracy on 7 of 8 backbones, with gains of up to \+8\.9 percentage points\. If confidence scores were uninformative, generating additional candidates would provide only diversity\-based improvement \(roughly proportional to the chance that any one ofkkcandidates is correct\), and the max\-min criterion would perform no better than random selection\. The consistently positive deltas suggest that the evaluator provides genuine signal for distinguishing better formulations from worse ones\. However, the magnitude of the gains varies substantially across backbones, from \+8\.9pp on Qwen3\-30B to−\-0\.1pp on GPT\-5\.2\. This variation raises the question of whether the evaluator is well\-calibrated in absolute terms, or whether it merely provides sufficient relative ranking to support selection\. The targeted analyses in Appendix[C\.3](https://arxiv.org/html/2606.27611#A3.SS3)address this question directly\.

To move beyond indirect evidence, we conduct three targeted analyses in Appendix[C\.3](https://arxiv.org/html/2606.27611#A3.SS3)\. First, we compare the accuracy of max\-min selection against always using the first candidate, testing whether the criterion adds value beyond candidate diversity\. Second, we measure whether confidence gains predict accuracy gains across model–benchmark pairs\. Third, we examine cases where the criterion overrides the first candidate, measuring how often such overrides are beneficial versus harmful\. Together, these analyses determine whether the confidence\-based selection mechanism provides genuine signal or operates as a near\-random selector among candidates\.

### D\.5Potential Impacts

COOPA could lower the expertise barrier for optimization modeling in settings such as logistics, energy, manufacturing, and public\-service planning, which may improve access to OR tools beyond specialist teams\. At the same time, incorrect automatically generated formulations could misallocate resources, hide important tradeoffs, or create unjustified confidence in flawed decisions if used without human review\.

For this reason, we view COOPA as a decision\-support system rather than an autonomous decision\-maker\. Its source traceability, confidence explanations, and explicit case studies are intended to support human checking before deployment, especially in high\-stakes settings where modeling errors can have operational or social consequences\.
COOPA: A Modular LLM Agent Architecture for Operations Research Problems

Similar Articles

Context, Reasoning, and Hierarchy: A Cost-Performance Study of Compound LLM Agent Design in an Adversarial POMDP

APPO: Agentic Procedural Policy Optimization

AgentCo-op: Retrieval-Based Synthesis of Interoperable Multi-Agent Workflows

PrologMCP: A Standardized Prolog Tool Interface for LLM Agents

ORAgentBench: Can LLM Agents Solve Challenging Operations Research Tasks End to End?

Submit Feedback

Similar Articles

Context, Reasoning, and Hierarchy: A Cost-Performance Study of Compound LLM Agent Design in an Adversarial POMDP
APPO: Agentic Procedural Policy Optimization
AgentCo-op: Retrieval-Based Synthesis of Interoperable Multi-Agent Workflows
PrologMCP: A Standardized Prolog Tool Interface for LLM Agents
ORAgentBench: Can LLM Agents Solve Challenging Operations Research Tasks End to End?
本文介绍ORAgentBench，一个用于评估LLM代理在端到端运筹学任务中表现的执行基准，包含107个经过人工审查的任务。实验表明，当前最佳代理仅通过35.51%的任务，揭示了在可靠决策制定方面的重大不足。