OR-Space: A Full-Lifecycle Workspace Benchmark for Industrial Optimization Agents

Hugging Face Daily Papers Papers

Summary

OR-Space is a benchmark for evaluating large language model agents in industrial operations research workflows, focusing on multi-stage task lifecycles and persistent workspaces beyond simple text generation.

Large language model (LLM) agents are increasingly used to assist with operations research (OR) modeling, yet existing OR-oriented benchmarks often reduce evaluation to one-shot translation from a self-contained problem statement into a mathematical formulation or solver program. Such settings abstract away two characteristics of real industrial OR workflows: persistent multi-artifact workspaces and multi-stage task lifecycles. We introduce OR-Space, a full-lifecycle workspace benchmark for evaluating industrial optimization agents across model construction, model revision, and grounded explanation. Each instance is an executable workspace containing business documents, structured data, optional code artifacts, solver outputs, and task-specific evaluators distributed across interdependent files. OR-Space defines three task modes: Build, where agents construct solver-ready optimization models from heterogeneous artifacts; Revise, where agents modify existing models under changing requirements or solver feedback while preserving valid prior logic; and Explain, where agents answer grounded questions about solutions, constraints, and business implications using evidence spread across workspace artifacts. By combining persistent workspaces with lifecycle-oriented tasks, OR-Space evaluates whether agents can perform reliable optimization work beyond end-to-end text generation. We describe the benchmark design, evaluation protocol, and quality-control pipeline, and position OR-Space as a benchmark for studying the reliability, failure modes, and practical readiness of LLM agents in industrial OR workflows.
Original Article
View Cached Full Text

Cached at: 05/29/26, 03:00 AM

Paper page - OR-Space: A Full-Lifecycle Workspace Benchmark for Industrial Optimization Agents

Source: https://huggingface.co/papers/2605.28158

Abstract

OR-Space is a comprehensive benchmark for evaluating large language model agents in industrial operations research workflows, assessing their ability to handle persistent workspaces and multi-stage task lifecycles beyond simple text generation.

Large language model (LLM) agents are increasingly used to assist withoperations research(OR) modeling, yet existing OR-oriented benchmarks often reduce evaluation to one-shot translation from a self-contained problem statement into a mathematical formulation or solver program. Such settings abstract away two characteristics of real industrial OR workflows: persistent multi-artifact workspaces andmulti-stage task lifecycles. We introduce OR-Space, a full-lifecycle workspace benchmark for evaluating industrial optimization agents acrossmodel construction,model revision, andgrounded explanation. Each instance is an executable workspace containing business documents, structured data, optional code artifacts, solver outputs, and task-specific evaluators distributed across interdependent files. OR-Space defines threetask modes:Build, where agents construct solver-readyoptimization modelsfrom heterogeneous artifacts;Revise, where agents modify existing models under changing requirements or solver feedback while preserving valid prior logic; andExplain, where agents answer grounded questions about solutions, constraints, and business implications using evidence spread across workspace artifacts. By combiningpersistent workspaceswith lifecycle-oriented tasks, OR-Space evaluates whether agents can perform reliable optimization work beyond end-to-end text generation. We describe the benchmark design, evaluation protocol, and quality-control pipeline, and position OR-Space as a benchmark for studying the reliability, failure modes, and practical readiness of LLM agents in industrial OR workflows.

View arXiv pageView PDFGitHub2Add to collection

Get this paper in your agent:

hf papers read 2605\.28158

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.28158 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.28158 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.28158 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Orc (working name) - auditable and declarative AI workflow

Reddit r/LocalLLaMA

The developer is seeking feedback on "ORC," an early-stage orchestration-as-code tool that uses a declarative DSL to define, validate, and version control LLM workflows. Aimed at users combining local and cloud models, it replaces complex Python scripts with auditable, Terraform-like definitions for agents and tool execution.

Orchard: An Open-Source Agentic Modeling Framework

Hugging Face Daily Papers

Orchard is an open-source framework for scalable agentic modeling that enables training diverse autonomous agents, achieving state-of-the-art results on coding, GUI navigation, and personal assistance tasks.