OR-Space: A Full-Lifecycle Workspace Benchmark for Industrial Optimization Agents

Hugging Face Daily Papers 05/27/26, 12:00 AM Papers

operations-research benchmark llm-agents industrial-optimization workspace lifecycle

Summary

OR-Space is a benchmark for evaluating large language model agents in industrial operations research workflows, focusing on multi-stage task lifecycles and persistent workspaces beyond simple text generation.

Large language model (LLM) agents are increasingly used to assist with operations research (OR) modeling, yet existing OR-oriented benchmarks often reduce evaluation to one-shot translation from a self-contained problem statement into a mathematical formulation or solver program. Such settings abstract away two characteristics of real industrial OR workflows: persistent multi-artifact workspaces and multi-stage task lifecycles. We introduce OR-Space, a full-lifecycle workspace benchmark for evaluating industrial optimization agents across model construction, model revision, and grounded explanation. Each instance is an executable workspace containing business documents, structured data, optional code artifacts, solver outputs, and task-specific evaluators distributed across interdependent files. OR-Space defines three task modes: Build, where agents construct solver-ready optimization models from heterogeneous artifacts; Revise, where agents modify existing models under changing requirements or solver feedback while preserving valid prior logic; and Explain, where agents answer grounded questions about solutions, constraints, and business implications using evidence spread across workspace artifacts. By combining persistent workspaces with lifecycle-oriented tasks, OR-Space evaluates whether agents can perform reliable optimization work beyond end-to-end text generation. We describe the benchmark design, evaluation protocol, and quality-control pipeline, and position OR-Space as a benchmark for studying the reliability, failure modes, and practical readiness of LLM agents in industrial OR workflows.

Original Article

View Cached Full Text

Cached at: 05/29/26, 03:00 AM

Paper page - OR-Space: A Full-Lifecycle Workspace Benchmark for Industrial Optimization Agents

Source: https://huggingface.co/papers/2605.28158

Abstract

OR-Space is a comprehensive benchmark for evaluating large language model agents in industrial operations research workflows, assessing their ability to handle persistent workspaces and multi-stage task lifecycles beyond simple text generation.

Large language model (LLM) agents are increasingly used to assist withoperations research(OR) modeling, yet existing OR-oriented benchmarks often reduce evaluation to one-shot translation from a self-contained problem statement into a mathematical formulation or solver program. Such settings abstract away two characteristics of real industrial OR workflows: persistent multi-artifact workspaces andmulti-stage task lifecycles. We introduce OR-Space, a full-lifecycle workspace benchmark for evaluating industrial optimization agents acrossmodel construction,model revision, andgrounded explanation. Each instance is an executable workspace containing business documents, structured data, optional code artifacts, solver outputs, and task-specific evaluators distributed across interdependent files. OR-Space defines threetask modes:Build, where agents construct solver-readyoptimization modelsfrom heterogeneous artifacts;Revise, where agents modify existing models under changing requirements or solver feedback while preserving valid prior logic; andExplain, where agents answer grounded questions about solutions, constraints, and business implications using evidence spread across workspace artifacts. By combiningpersistent workspaceswith lifecycle-oriented tasks, OR-Space evaluates whether agents can perform reliable optimization work beyond end-to-end text generation. We describe the benchmark design, evaluation protocol, and quality-control pipeline, and position OR-Space as a benchmark for studying the reliability, failure modes, and practical readiness of LLM agents in industrial OR workflows.

View arXiv page View PDF GitHub2 Add to collection

Get this paper in your agent:

hf papers read 2605\.28158

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.28158 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.28158 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.28158 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

OR-Space: A Full-Lifecycle Workspace Benchmark for Industrial Optimization Agents

Paper page - OR-Space: A Full-Lifecycle Workspace Benchmark for Industrial Optimization Agents

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Orc (working name) - auditable and declarative AI workflow

Tool-Augmented Agent for Closed-loop Optimization,Simulation,and Modeling Orchestration

SABER: Benchmarking Operational Safety of LLM Coding Agents in Stateful Project Workspaces

@xdotli: 5 spaces you should be evaluating your agents using robust environments: 1) output space: the input and results of agen…

Orchard: An Open-Source Agentic Modeling Framework

Submit Feedback

Similar Articles

Orc (working name) - auditable and declarative AI workflow

Tool-Augmented Agent for Closed-loop Optimization,Simulation,and Modeling Orchestration

SABER: Benchmarking Operational Safety of LLM Coding Agents in Stateful Project Workspaces

@xdotli: 5 spaces you should be evaluating your agents using robust environments: 1) output space: the input and results of agen…

Orchard: An Open-Source Agentic Modeling Framework