OR-Space: A Full-Lifecycle Workspace Benchmark for Industrial Optimization Agents
Summary
OR-Space is a benchmark for evaluating large language model agents in industrial operations research workflows, focusing on multi-stage task lifecycles and persistent workspaces beyond simple text generation.
View Cached Full Text
Cached at: 05/29/26, 03:00 AM
Paper page - OR-Space: A Full-Lifecycle Workspace Benchmark for Industrial Optimization Agents
Source: https://huggingface.co/papers/2605.28158
Abstract
OR-Space is a comprehensive benchmark for evaluating large language model agents in industrial operations research workflows, assessing their ability to handle persistent workspaces and multi-stage task lifecycles beyond simple text generation.
Large language model (LLM) agents are increasingly used to assist withoperations research(OR) modeling, yet existing OR-oriented benchmarks often reduce evaluation to one-shot translation from a self-contained problem statement into a mathematical formulation or solver program. Such settings abstract away two characteristics of real industrial OR workflows: persistent multi-artifact workspaces andmulti-stage task lifecycles. We introduce OR-Space, a full-lifecycle workspace benchmark for evaluating industrial optimization agents acrossmodel construction,model revision, andgrounded explanation. Each instance is an executable workspace containing business documents, structured data, optional code artifacts, solver outputs, and task-specific evaluators distributed across interdependent files. OR-Space defines threetask modes:Build, where agents construct solver-readyoptimization modelsfrom heterogeneous artifacts;Revise, where agents modify existing models under changing requirements or solver feedback while preserving valid prior logic; andExplain, where agents answer grounded questions about solutions, constraints, and business implications using evidence spread across workspace artifacts. By combiningpersistent workspaceswith lifecycle-oriented tasks, OR-Space evaluates whether agents can perform reliable optimization work beyond end-to-end text generation. We describe the benchmark design, evaluation protocol, and quality-control pipeline, and position OR-Space as a benchmark for studying the reliability, failure modes, and practical readiness of LLM agents in industrial OR workflows.
View arXiv pageView PDFGitHub2Add to collection
Get this paper in your agent:
hf papers read 2605\.28158
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.28158 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.28158 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.28158 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Orc (working name) - auditable and declarative AI workflow
The developer is seeking feedback on "ORC," an early-stage orchestration-as-code tool that uses a declarative DSL to define, validate, and version control LLM workflows. Aimed at users combining local and cloud models, it replaces complex Python scripts with auditable, Terraform-like definitions for agents and tool execution.
Tool-Augmented Agent for Closed-loop Optimization,Simulation,and Modeling Orchestration
The paper introduces COSMO-Agent, a tool-augmented reinforcement learning framework that trains LLMs to perform closed-loop CAD-CAE optimization, iteratively generating parametric geometries and running simulations until constraints are satisfied, with a multi-constraint reward and a new industry-aligned dataset.
SABER: Benchmarking Operational Safety of LLM Coding Agents in Stateful Project Workspaces
SABER introduces a benchmark for evaluating the operational safety of LLM coding agents in realistic stateful project workspaces, showing that even the best model has over a 54% harmful safety-violation rate, indicating insufficient alignment for real-world environments.
@xdotli: 5 spaces you should be evaluating your agents using robust environments: 1) output space: the input and results of agen…
Highlights five key spaces for evaluating AI agents using robust environments (output, action, reasoning, latent, memory) and recommends using @benchflow_ai for implementation.
Orchard: An Open-Source Agentic Modeling Framework
Orchard is an open-source framework for scalable agentic modeling that enables training diverse autonomous agents, achieving state-of-the-art results on coding, GUI navigation, and personal assistance tasks.