Bellman-Taylor Score Decoding for Markov Decision Processes with State-Dependent Feasible Action Sets

arXiv cs.AI Papers

Summary

This paper introduces Bellman-Taylor Score Decoding, a method to handle state-dependent feasible action sets in Markov decision processes, addressing a key challenge in applying deep reinforcement learning to operations research problems.

arXiv:2606.10979v1 Announce Type: new Abstract: Many Markov decision processes (MDPs) in operations research have feasible actions that are state dependent and defined implicitly by various operational constraints. These features make it difficult to use standard deep reinforcement learning (DRL) algorithms, whose action interfaces typically assume either a fixed finite action catalog or a simple Euclidean space. Motivated by a Taylor expansion of the optimal action-value function, we propose Bellman--Taylor score decoding, a framework that moves policy learning to a Euclidean score space while enforcing feasibility through an action decoder. The induced latent-score MDP then can be optimized by standard DRL algorithms without differentiating through the decoder. We provide a performance guarantee showing that the optimality gap of this approach decomposes into a structural approximation error and an algorithmic learning error. Lastly, we apply this framework to a queueing network control problem, where the policy essentially learns a state-dependent index-based dispatching rule. Numerical experiments show near-optimal performance in small instances and considerable improvements over benchmarks in larger systems.
Original Article
View Cached Full Text

Cached at: 06/10/26, 06:18 AM

# Bellman–Taylor Score Decoding for Markov Decision Processes with State-Dependent Feasible Action Sets
Source: [https://arxiv.org/html/2606.10979](https://arxiv.org/html/2606.10979)
Yi Chen, Rushuai Yang, Qiang Chen, Dongyan \(Lucy\) Huo Department of Industrial Engineering and Decision AnalyticsHong Kong University of Science and Technology

## 1Introduction

In recent years, deep reinforcement learning \(DRL\) has emerged as one of the most successful algorithmic paradigms for sequential decision\-making\. By combining reinforcement learning algorithms with expressive neural\-network function approximation, DRL has achieved remarkable empirical success in many AI domains such as game playing, robotic control, and autonomous racing\(Mnihet al\.,[2015](https://arxiv.org/html/2606.10979#bib.bib15); Rajeswaranet al\.,[2018](https://arxiv.org/html/2606.10979#bib.bib1); Wurmanet al\.,[2022](https://arxiv.org/html/2606.10979#bib.bib4); Tanget al\.,[2025](https://arxiv.org/html/2606.10979#bib.bib9)\)\. These successes suggest that DRL can be a powerful tool for learning complex control policies from simulation or interaction data, especially when explicit dynamic\-programming solutions are computationally infeasible\.

Many sequential decision\-making problems in operations research \(OR\) can naturally be formulated as Markov decision processes \(MDPs\)\. Examples include inventory control, queueing control, and resource allocation\. Classical dynamic programming provides a principled framework for such problems, but exact solution methods quickly become intractable as the state and action spaces grow\. Approximate dynamic programming \(ADP\) and simulation\-based policy optimization have therefore long played an important role\(Bertsekas,[2025](https://arxiv.org/html/2606.10979#bib.bib13); Powell,[2011](https://arxiv.org/html/2606.10979#bib.bib14)\)\. From a mathematical perspective, ADP and DRL are closely connected: both seek tractable approximations to Bellman evaluation and improvement steps, either by approximating value functions, action\-value functions, or policies\. This connection has motivated a growing body of work applying DRL algorithms to solve large\-scale OR problems\.\(Delarueet al\.,[2020](https://arxiv.org/html/2606.10979#bib.bib35); Dai and Gluzman,[2022](https://arxiv.org/html/2606.10979#bib.bib46); Gijsbrechtset al\.,[2022](https://arxiv.org/html/2606.10979#bib.bib48); Harshaet al\.,[2025](https://arxiv.org/html/2606.10979#bib.bib36); Xuet al\.,[2025](https://arxiv.org/html/2606.10979#bib.bib37); Chenet al\.,[2026](https://arxiv.org/html/2606.10979#bib.bib47)\)\.

Despite this natural connection, applying off\-the\-shelf DRL algorithms to operational MDPs is rarely plug\-and\-play\. A critical challenge is the action interface\. Standard DRL implementations typically assume that actions can be enumerated, sampled, and optimized over through a simple and fixed representation\. In finite\-action problems, a neural network often outputs one logit or value for each action in a fixed catalog\. In continuous\-action problems, the action is usually represented as a fixed\-dimensional Euclidean vector, and the actor outputs either this vector or the parameters of a distribution over such vectors\. Many OR problems do not fit either template\. Their feasible actions are often state dependent, high dimensional, and implicitly defined by capacity, compatibility, and integrality constraints\.

The action\-interface difficulty is a critical reason why successful DRL applications in operational MDPs often require substantial case\-specific algorithmic engineering\. Researchers may need to design problem\-specific action decompositions, feasibility corrections, masking rules, or specialized network architectures\. These techniques can be effective, but they also reduce the extent to which standard DRL solvers can be used as reusable tools across operational MDPs\. In this sense, the challenge is not only that OR problems are large or stochastic; it is also essential the natural action spaces of OR models often do not align with the action representations expected by standard DRL implementations\.

The existing literature have made important progress toward addressing this issue\. For example, action\-representation methods embed large finite action sets into lower\-dimensional spaces, allowing feedback from one action to generalize to similar actions\(Chandaket al\.,[2019](https://arxiv.org/html/2606.10979#bib.bib25); Dulac\-Arnoldet al\.,[2015](https://arxiv.org/html/2606.10979#bib.bib26)\)\. Feasibility\-preserving methods such as masking, action elimination, can prevent infeasible actions from being executed when a fixed action catalog or a simple feasible region is available\(Huang and Ontañón,[2022](https://arxiv.org/html/2606.10979#bib.bib30); Zahavyet al\.,[2018](https://arxiv.org/html/2606.10979#bib.bib31); Phamet al\.,[2018](https://arxiv.org/html/2606.10979#bib.bib32)\)\. A more recent line of work combines reinforcement learning with optimization\-based action selection, either by embedding learned value information into an action selection problem or by incorporating an optimization layer inside the policy representation\(Delarueet al\.,[2020](https://arxiv.org/html/2606.10979#bib.bib35); Harshaet al\.,[2025](https://arxiv.org/html/2606.10979#bib.bib36); Xuet al\.,[2025](https://arxiv.org/html/2606.10979#bib.bib37); Hoppeet al\.,[2025](https://arxiv.org/html/2606.10979#bib.bib34)\)\.

These approaches are valuable, but do not fully address the interface problem that arises when one wants to apply standard off\-the\-shelf DRL algorithms to MDPs with state\-dependent feasible action sets\. Action\-representation methods are primarily designed for fixed finite action catalogs, whereas many feasible sets are defined implicitly by constraints and may be too large to enumerate\. Masking and elimination also rely on an explicit catalog or a finite superset of actions\. Optimization\-layer methods typically treat the optimization score as a generic learned utility coefficient and require specialized training for the layer\. These approaches either depend on a particular action representation, a problem\-specific feasibility mechanism, or a customized optimization\-based learning architecture\.

In this paper, we propose a novel action interface for MDPs with state\-dependent feasible action sets, which we call the*Bellman\-Taylor score decoding*\. The central idea is to standardize the learning interface rather than the operational action space\. Instead of learning a policy directly over the irregular feasible action set, the policy learns a score vector in a Euclidean space\. An action\-decoder then maps this score into an implementable natural action by solving an optimization problem over the original feasible set\. The decoder is motivated by a Taylor approximation of the post\-action Bellman continuation value \(or the optimal action\-value function\), where the learned score is intended to represent the marginal effect of the post\-action system configuration on future value\.

This construction separates learning from feasibility\. The DRL algorithm operates on a regular Euclidean score space, while feasibility, integrality, and combinatorial coupling are handled by the action decoder\. Once the decoder is specified, the original MDP induces a latent\-score MDP in which the action is the score vector\. Standard continuous\-action DRL algorithms, such as proximal policy optimization, can then directly be applied to this induced problem\. Importantly, the decoder is solved only in the forward pass to convert a sampled score into a feasible natural action\. The policy\-gradient update does not require differentiating through the decoder\. This is distinct with differentiable optimization\-layer methods, where the optimizer is part of the trainable computation graph and must be differentiated through or approximated by a surrogate gradient\.

The proposed framework does not eliminate the inherent difficulty of operational MDPs\. Instead, it confines model structure to the decoder, allowing the learning algorithm to remain standard\. The model specifies the feasible action set, immediate reward, and a post\-action representation of the system\. The policy learns state\-dependent scores, which the decoder translates into feasible operational actions\. This creates a bridge between off\-the\-shelf continuous\-action DRL solvers and MDPs with state\-dependent, constrained, and often combinatorial actions\. The trade\-off is that the decoder\-induced policy class is restricted: not every feasible policy in the original MDP can be represented by a score\-decoding policy\. Our theoretical analysis quantifies this trade\-off\.

As an application of this framework, we investigate a queueing network control problem\. Dynamic scheduling in multi\-class, multi\-pool queueing systems is a classical and practically important problem, with applications such as call centers and inpatient overflow management\(Dai and Shi,[2019](https://arxiv.org/html/2606.10979#bib.bib23); Chenet al\.,[2020](https://arxiv.org/html/2606.10979#bib.bib8)\)\. Recent studies have also explored the use of DRL for such problems, but successful implementations often rely on problem\-specific action decompositions, customized policy architectures, or variance\-reduction techniques tailored to the queueing model\. In contrast, our framework allows us to apply a standard PPO solver directly without introducing any additional problem\-specific algorithmic engineering such as variance reduction, which are common and essential in peer works\(Dai and Gluzman,[2022](https://arxiv.org/html/2606.10979#bib.bib46); Donget al\.,[2025](https://arxiv.org/html/2606.10979#bib.bib24)\)\. In this setting, the Bellman\-Taylor score decoding essentially becomes a learned index\-based dispatching rule: the RL policy learns state\-dependent indices, and the action decoder selects the feasible dispatching action with the largest total score\. Through simulation experiments, we show that the resulting decoded PPO policy outperforms various benchmarks across a range of queueing instances\.

In summary, our contributions are threefold\.

- •We introduce the Bellman\-Taylor score\-decoding algorithm framework for MDPs with state\-dependent feasible action sets\. The framework converts a constrained natural\-action MDP into a latent\-score MDP, so that standard continuous\-action DRL algorithms can be applied without designing a problem\-specific probability distribution over feasible actions\. Feasibility is enforced exactly through the decoder\.
- •We provide a structural performance guarantee showing that optimality gap of our approach can be decomposed into a structural approximation error and an algorithmic learning error from solving the induced latent\-score MDP\. The structural term is controlled by the local Taylor remainder of the post\-action continuation value, clarifying when the restricted score\-decoding policy class can approximate Bellman\-greedy decisions well\.
- •We apply the framework to a queueing network control problem\. Our implementation uses a standard PPO solver on the latent\-score MDP without queue\-specific modifications to the learning algorithm or additional variance\-reduction devices\. The numerical study demonstrates the superior performance of our approach\.

The remainder of the paper is organized as follows\. Section[2](https://arxiv.org/html/2606.10979#S2)reviews related literature\. Section[3](https://arxiv.org/html/2606.10979#S3)formulates the MDP setting and introduces background\. Section[4](https://arxiv.org/html/2606.10979#S4)introduces Bellman–Taylor score decoding framework and the PPO implementation, as well as performance guarantee\. Section[5](https://arxiv.org/html/2606.10979#S5)investigates an inventory problem as a sanity check\. Section[6](https://arxiv.org/html/2606.10979#S6)presents the queueing network control case study\. Section[7](https://arxiv.org/html/2606.10979#S7)concludes lastly\.

## 2Related Literature

### 2\.1Deep Reinforcement Learning and Standard Action Interfaces

For small\-scale MDPs with finite state and action spaces, classical dynamic programming and exact MDP methods, including value iteration, policy iteration, and linear programming, provide exact solution approaches\(Suttonet al\.,[1998](https://arxiv.org/html/2606.10979#bib.bib10)\)\. However, when the state and action spaces become large, these exact methods quickly become computationally prohibitive, due to the curse of dimensionality\(Bellman,[1957](https://arxiv.org/html/2606.10979#bib.bib11); Powell,[2011](https://arxiv.org/html/2606.10979#bib.bib14)\)\. This challenge has motivated a broad literature on approximation algorithms, with substantial progress in approximate dynamic programming\(Bertsekas,[2025](https://arxiv.org/html/2606.10979#bib.bib13); Powell,[2011](https://arxiv.org/html/2606.10979#bib.bib14)\)\. Deep reinforcement learning can be viewed as a neural\-network\-based extension of this approximation paradigm, in which value functions or policies are represented by deep neural networks and learned from trajectory data\. Owing to the strong approximation power of neural networks, DRL has achieved remarkable empirical success in a variety of benchmark problems in AI domains, such as game playing and robotic control\(Mnihet al\.,[2015](https://arxiv.org/html/2606.10979#bib.bib15); Rajeswaranet al\.,[2018](https://arxiv.org/html/2606.10979#bib.bib1); Wurmanet al\.,[2022](https://arxiv.org/html/2606.10979#bib.bib4); Tanget al\.,[2025](https://arxiv.org/html/2606.10979#bib.bib9)\)\.

In general, classic DRL algorithms can be roughly divided into two categories:*value\-based*methods, including the deepQQ\-networks \(DQN\) and related variants\(Mnihet al\.,[2015](https://arxiv.org/html/2606.10979#bib.bib15)\), which approximate the optimal action\-value function by a neural networkQθ​\(s,a\)Q\_\{\\theta\}\(s,a\)and select actions according toat∈arg⁡maxa∈𝒜​\(st\)⁡Qθ​\(st,a\);a\_\{t\}\\in\\arg\\max\_\{a\\in\\mathcal\{A\}\(s\_\{t\}\)\}Q\_\{\\theta\}\(s\_\{t\},a\);and*policy\-based*or*actor\-critic*methods, including the proximal policy optimization \(PPO\) and related variants\(Schulmanet al\.,[2017](https://arxiv.org/html/2606.10979#bib.bib16)\), which directly parameterize a policy asπθ​\(a∣s\)\\pi\_\{\\theta\}\(a\\mid s\)and optimize the expected discounted returnJ​\(πθ;ν0\)J\(\\pi^\{\\theta\};\\nu\_\{0\}\)over the parameter space ofθ\\theta\(which is often Euclidean\) using stochastic gradient methods\(Suttonet al\.,[1998](https://arxiv.org/html/2606.10979#bib.bib10)\)\. In both categories, deep neural networks serve as flexible function approximators and have demonstrated strong empirical performance in a variety of benchmark problems\(Rajeswaranet al\.,[2018](https://arxiv.org/html/2606.10979#bib.bib1); Wurmanet al\.,[2022](https://arxiv.org/html/2606.10979#bib.bib4); Tanget al\.,[2025](https://arxiv.org/html/2606.10979#bib.bib9)\)\.

However, a common implicit assumption in standard DRL implementations is that they typically rely on a tractable action interface\. In discrete\-action settings, the action space is often a explicit finite action catalog of manageable size, and the neural network outputs one logit orQQ\-value for each candidate action\(Mnihet al\.,[2013](https://arxiv.org/html/2606.10979#bib.bib5),[2015](https://arxiv.org/html/2606.10979#bib.bib15)\)\. In continuous\-action settings, the action is often represented as a fixed\-dimensional Euclidean vector, and the policy network outputs the parameters of a probability distribution, or directly outputs the action vector itself\(Schulmanet al\.,[2017](https://arxiv.org/html/2606.10979#bib.bib16)\)\. Thus, the neural\-network architecture implicitly assumes that, once the current state is given, actions can be scored, sampled, or optimized over through a simple and fixed interface\.

### 2\.2Action Representations and Feasibility\-Preserving Policies

When the feasible\-action geometry becomes complex, or the feasible action sets vary across states, the standard feasibility mechanisms in DRL depend on how the action is represented\. A related stream of work modifies the action parameterization so that large action spaces can be handled by standard function approximation\. Action\-embedding methods learn a low\-dimensional representation of a large discrete action set and use this representation to generalize feedback across actions\(Dulac\-Arnoldet al\.,[2015](https://arxiv.org/html/2606.10979#bib.bib26); Chandaket al\.,[2019](https://arxiv.org/html/2606.10979#bib.bib25)\)\. Other methods reduce the output dimension by exploiting factorized or hybrid action structure: action\-branching architectures decompose high\-dimensional discrete actions into componentwise decisions\(Tavakoliet al\.,[2018](https://arxiv.org/html/2606.10979#bib.bib27)\), while parameterized\-action and hybrid\-action methods represent decisions through a fixed discrete action type together with continuous parameters\(Hausknecht and Stone,[2016](https://arxiv.org/html/2606.10979#bib.bib28); Liet al\.,[2021](https://arxiv.org/html/2606.10979#bib.bib29)\)\. These approaches are effective when the action space admits a fixed catalog, a fixed factorization, or a fixed hybrid template\. They are less directly applicable when feasible actions are state dependent, implicitly specified by capacity and integrality constraints, and too large to enumerate\.

Feasibility\-preserving mechanisms in DRL are also closely related\. Invalid\-action masking removes infeasible actions from a fixed discrete catalog before sampling from the policy\(Huang and Ontañón,[2022](https://arxiv.org/html/2606.10979#bib.bib30)\)\. Action\-elimination methods learn to discard invalid or undesirable actions, often using external elimination signals\(Zahavyet al\.,[2018](https://arxiv.org/html/2606.10979#bib.bib31)\)\. Projection and repair layers map an unconstrained action proposed by a neural network to a feasible action satisfying prescribed constraints\(Phamet al\.,[2018](https://arxiv.org/html/2606.10979#bib.bib32)\)\. Knowledge\-compilation approaches encode valid combinatorial actions symbolically and learn policies supported only on the valid action set\(Linget al\.,[2023](https://arxiv.org/html/2606.10979#bib.bib33)\)\. These methods are useful when feasibility can be represented through a tractable catalog, mask, projection, repair rule, or symbolic encoding\.

Another approach handles combinatorial actions by factorizing a system\-level decision into a sequence of smaller decisions\. Atomic\-action and autoregressive policies generate assignments one at a time, often using feasibility masks to rule out invalid local choices\(Sunet al\.,[2024](https://arxiv.org/html/2606.10979#bib.bib12)\)\. This idea avoids enumerating the full feasible action set and provides a scalable policy parameterization\. It is natural for matching and routing problems, where a large assignment decision can be decomposed into individual customer\-server matches\.

Bellman–Taylor score decoding provides an alternative solution strategy\. The latent variable in our framework is not an embedding of an action, nor is it a componentwise action proposal to be repaired\. It is a state\-dependent marginal\-value score motivated by a Taylor approximation of the action\-value function\. Given this score, the decoder selects an action by optimizing a Bellman\-motivated surrogate over the original state\-dependent feasible set\. Thus, feasibility and action ranking are handled jointly: the decoder enforces the operational constraints exactly while using the learned score to rank feasible actions according to their estimated downstream value\. This feature is convenient for MDPs whose feasible actions are naturally represented by optimization program or various constraints rather than by a fixed action catalog\.

### 2\.3Optimization\-Oriented RL for Combinatorial Actions

Our method is also connected to the area that combines RL with optimization\-based action selection for combinatorial actions\. In these methods, optimization is used not merely to repair infeasible actions, but to select actions from a structured feasible set\.

A family of approaches focuses on value\-based RL algorithms, where a learned action\-value function is embedded into a combinatorial action\-selection problem at each decision step\. For example,Delarueet al\.\([2020](https://arxiv.org/html/2606.10979#bib.bib35)\)formulate policy improvement as a mixed\-integer optimization problem over the learnedQQ\-function\.Harshaet al\.\([2025](https://arxiv.org/html/2606.10979#bib.bib36)\)combine neural value approximation with mathematical programming and sample\-average approximation to solve large state\-dependent inventory actions\.Xuet al\.\([2025](https://arxiv.org/html/2606.10979#bib.bib37)\)embed an action\-value network into a mixed\-integer program for coupled restless bandits\. These methods are highly expressive because the optimizer can directly exploit a nonlinear approximation of the downstream value function\. Compared with our approach, these methods are computationally heavy, especially when the value function is nonlinear or requires multiple scenario evaluations\. Greedy optimization over an approximate value function may also amplify errors, leading to unstable or suboptimal actions\.

Our work is also related to differentiable optimization layers and decision\-focused learning\. This line of work embeds an optimization problem into a learning pipeline and trains upstream models by differentiating, exactly or approximately, through the optimizer\. Examples include differentiable convex or quadratic optimization layers, predict\-then\-optimize losses, MIP layers, black\-box solver differentiation, and perturbed combinatorial optimizers\(Elmachtoub and Grigas,[2022](https://arxiv.org/html/2606.10979#bib.bib39); Mandiet al\.,[2024](https://arxiv.org/html/2606.10979#bib.bib40); Amos and Kolter,[2017](https://arxiv.org/html/2606.10979#bib.bib41); Agrawalet al\.,[2019](https://arxiv.org/html/2606.10979#bib.bib42); Ferberet al\.,[2020](https://arxiv.org/html/2606.10979#bib.bib43); Vlastelica Pogančićet al\.,[2020](https://arxiv.org/html/2606.10979#bib.bib44); Berthetet al\.,[2020](https://arxiv.org/html/2606.10979#bib.bib45)\)\. These methods are powerful when the downstream optimization problem is continuous or admits a useful differentiable surrogate, because they enable end\-to\-end training with gradients that directly reflect downstream decision quality\.

Bellman\-Taylor score decoding utilizes optimization in a different way\. The decoder is not a trainable differentiable layer, but a fixed action\-selection map inside MDP\. Given a latent score, it selects a feasible natural action over the original state\-dependent feasible set\. The score policy is trained from MDP return using likelihood\-ratio policy gradients, so the optimizer is used only in the forward pass and need not be differentiated through\. This distinction is important for operational MDPs with integer or combinatorial actions, where the exact decoder may be discontinuous and difficult to differentiate\.

Remarkably,Hoppeet al\.\([2025](https://arxiv.org/html/2606.10979#bib.bib34)\)is also close to our approach, which maps the state to objective coefficients and uses a linear optimization to select the feasible action per iteration\. Although the idea of algorithm design is similar, it requires a more specialized training pipeline\. The discrete optimization layer is generally non\-differentiable, so it relies on perturbation\-based or differentiable surrogate losses for learning\. As a result, their approach lacks theoretical performance guarantees and cannot be naturally integrated with off\-the\-shelf DRL algorithms, making it less straightforward to apply in practical operational MDPs\.

## 3Problem Background

### 3\.1Markov Decision Process

Markov decision process \(MDP\) is a preeminent modeling paradigm in sequential decision\-making under uncertainty\(Suttonet al\.,[1998](https://arxiv.org/html/2606.10979#bib.bib10)\)\. Many applications in operations research \(OR\), including inventory management, queue scheduling, and resource allocation, can be naturally formulated as MDPs to solve\. Throughout this paper, we focus on solving infinite\-horizon, discounted\-reward MDPs, while alternative formulations have been extensively studied in the literature\(Puterman,[2014](https://arxiv.org/html/2606.10979#bib.bib2)\)\.

In general, a discounted\-reward MDP is described by tupleℳ=\(𝒮,\{𝒜​\(s\)\}s∈𝒮,ℙ,r,γ,ν0\),\\mathcal\{M\}=\(\\mathcal\{S\},\\\{\\mathcal\{A\}\(s\)\\\}\_\{s\\in\\mathcal\{S\}\},\\mathbb\{P\},r,\\gamma,\\nu\_\{0\}\),where𝒮\\mathcal\{S\}is the state space and𝒜​\(s\)\\mathcal\{A\}\(s\)denotes the feasible action set at statess\. Note that𝒜​\(s\)\\mathcal\{A\}\(s\)may vary for different states, i\.e\., it is state\-dependent\.ℙ\(⋅∣s,a\)\\mathbb\{P\}\(\\cdot\\mid s,a\)is the transition kernel, which governs the state transition’s rule\. Lastly,r​\(s,a\)r\(s,a\)is the one\-period immediate reward,γ∈\(0,1\)\\gamma\\in\(0,1\)is the discount rate, andν0\\nu\_\{0\}is the distribution of initial state\. A stationary policyπ\\pisamples an actionat∈𝒜​\(s\)a\_\{t\}\\in\\mathcal\{A\}\(s\)for each statests\_\{t\}in periodtt, following a fixed probability distributionπ\(⋅\|st\)\\pi\(\\cdot\|s\_\{t\}\)supported on𝒜​\(st\)\\mathcal\{A\}\(s\_\{t\}\)\. Thus, under policyπ\\pi, starting from an statess, the expected accumulative discounted reward isVπ​\(s\)=𝔼π​\[∑t=0∞γt​r​\(st,at\)\|s0=s\]\.V^\{\\pi\}\(s\)=\\mathbb\{E\}^\{\\pi\}\[\\sum\_\{t=0\}^\{\\infty\}\\gamma^\{t\}r\(s\_\{t\},a\_\{t\}\)\|s\_\{0\}=s\]\.The decision\-maker’s objective is to seek the optimal stationary policy to maximize the objectiveJ\(π;ν0\)=𝔼s∼ν0\[Vπ\(s\)\]\}J\(\\pi;\\nu\_\{0\}\)=\\mathbb\{E\}\_\{s\\sim\\nu\_\{0\}\}\[V^\{\\pi\}\(s\)\]\\big\\\}overΠ\\Pi, the set of all stationary policies\. LetV∗​\(s\)=supπ∈ΠVπ​\(s\)V^\{\*\}\(s\)=\\sup\_\{\\pi\\in\\Pi\}V^\{\\pi\}\(s\)be the optimal value function\. The corresponding optimal action\-value function \(orQQ\-function\) is define asQ∗​\(s,a\)=r​\(s,a\)\+γ⋅𝔼s′∼ℙ\(⋅\|s,a\)​\[V∗​\(s′\)∣s,a\]\.Q^\{\*\}\(s,a\)=r\(s,a\)\+\\gamma\\cdot\\mathbb\{E\}\_\{s^\{\\prime\}\\sim\\mathbb\{P\}\(\\cdot\|s,a\)\}\\big\[V^\{\*\}\(s^\{\\prime\}\)\\mid s,a\\big\]\.

According to dynamic programming principle, it is well\-known thatV∗​\(s\)=maxa∈𝒜​\(s\)⁡Q∗​\(s,a\)V^\{\*\}\(s\)=\\max\_\{a\\in\\mathcal\{A\}\(s\)\}Q^\{\*\}\(s,a\)and the optimal policyπ∗\\pi^\{\*\}chooses action inargmaxa∈𝒜​\(s\)\{Q∗​\(s,a\)\}\\mathop\{\\mathrm\{argmax\}\}\_\{a\\in\\mathcal\{A\}\(s\)\}\\\{Q^\{\*\}\(s,a\)\\\}to take for eachss\. In addition, there exists a deterministic optimal policy that maximizesJ​\(π;ν0\)J\(\\pi;\\nu\_\{0\}\)\(Suttonet al\.,[1998](https://arxiv.org/html/2606.10979#bib.bib10)\)\.

It is useful to distinguish between MDPs with known and unknown transition dynamics\. In the former case, the transition kernelℙ\(⋅\|s,a\)\\mathbb\{P\}\(\\cdot\|s,a\)is explicitly specified, or at least a simulator is available to generate state transitions under any admissible action\. In the latter case, the decision\-maker only observes transition samples through interaction or from logged trajectories without knowingℙ\(⋅\|s,a\)\\mathbb\{P\}\(\\cdot\|s,a\)\. In this paper, we focus on the known\-transition \(or simulator\-available\) setting, which is common in many OR applications where the system dynamics and operational constraints are explicitly specified\.

### 3\.2Applying DRL to OR: Challenges from State\-Dependent Feasible Action Sets

In many OR models, feasible action sets have complex state\-dependent constraint geometry\. Feasible actions are not drawn from a fixed finite action catalog, but are instead defined implicitly through operational constraints, integrality restrictions, and combinatorial coupling\. For example, the feasible action set at statesscan be represented abstractly as

𝒜​\(s\)=\{a∈𝒜¯:gi​\(s,a\)≤0,i=1,…,m1,hj​\(s,a\)=0,j=1,…,m2\},\\mathcal\{A\}\(s\)=\\left\\\{a\\in\\mathcal\{\\bar\{A\}\}:g\_\{i\}\(s,a\)\\leq 0,\\ i=1,\\ldots,m\_\{1\},\\ \\ h\_\{j\}\(s,a\)=0,\\ j=1,\\ldots,m\_\{2\}\\right\\\},where𝒜¯\\mathcal\{\\bar\{A\}\}is an ambient mixed\-integer action domain\. The constraint functions encode the system’s capacity limits, balance relations, and compatibility conditions\. Furthermore, the action variables and feasible sets are often high\-dimensional, coupled, or combinatorial\. Even when individual action components are simple, the joint feasible set can be highly coupled and difficult to enumerate\. The number of feasible actions often grows rapidly with system size, rendering explicit enumeration prohibitive\.

These features do not align with standard action interfaces used by standard DRL implementations\. Value\-based methods require evaluating or maximizing over candidate actions, which is nontrivial when the feasible set cannot be explicitly enumerated\. Policy\-based methods typically parameterize distributions over simple action domains, such as fixed finite catalogs or Euclidean boxes, which is equally problematic when feasible actions form an irregularly constrained set\. Mechanisms such as action masking is not a general remedy because they still rely on the existence of an tractable explicit finite superset of candidate actions, which can be prohibitively large\. This limits the direct applicability of off\-the\-shelf DRL methods in OR\.

## 4Bellman\-Taylor Score Decoding

Many MDPs in OR applications admit a structural representation of the state transition\. Throughout this paper, we investigate a particular class of problems in which, for each statessand natural actiona∈𝒜​\(s\)a\\in\\mathcal\{A\}\(s\), the next states′s^\{\\prime\}can be represented ass′=Ξs​\(ϕs​\(a\),ξs\)s^\{\\prime\}=\\Xi\_\{s\}\(\\phi\_\{s\}\(a\),\\xi\_\{s\}\)\. Hereϕs:𝒜​\(s\)→𝒳s⊆ℝds\\phi\_\{s\}:\\mathcal\{A\}\(s\)\\to\\mathcal\{X\}\_\{s\}\\subseteq\\mathbb\{R\}^\{d\_\{s\}\}represents a post\-action configuration of system status, which captures an intermediate state immediately after the action is applied but before the uncertainty is realized\.ξs\\xi\_\{s\}is an exogenous disturbance whose distribution is known and may depend onss, andΞs​\(⋅,⋅\)\\Xi\_\{s\}\(\\cdot,\\cdot\)is a transition function with known analytical formula\. For notational consistency, we denote the one\-period reward asr​\(s,a\)=ψs​\(a\)r\(s,a\)=\\psi\_\{s\}\(a\), whereψs:𝒜​\(s\)→ℝ\\psi\_\{s\}:\\mathcal\{A\}\(s\)\\to\\mathbb\{R\}is a known state\-dependent map that quantifies the immediate contribution of actionaa\.

For instance, in a queueing\-control problem, the statessmay describe the current queue lengths and server availability, while the natural actiona∈𝒜​\(s\)a\\in\\mathcal\{A\}\(s\)specifies a dispatching or service\-allocation decision\. The post\-action stateϕs​\(a\)\\phi\_\{s\}\(a\)then denotes the queue configuration immediately after the action is implemented but before any new arrivals or service completions occur\. The disturbanceξs\\xi\_\{s\}represents the exogenous randomness during the period, such as random arrivals or service outcomes, andΞs​\(ϕs​\(a\),ξs\)\\Xi\_\{s\}\\bigl\(\\phi\_\{s\}\(a\),\\xi\_\{s\}\\bigr\)maps the post\-decision state and the realized randomness to the next state\. In such problems, the optimal action\-value function has additional structure beyond the generic MDP formulation, allowing for approximation via a Taylor expansion\. This approximation then induces a score\-based view of action selection\.

### 4\.1Motivation and Decoder\-Induced Score Policy

Given above structural transition representation, we define the continuation\-value function \(or the value\-to\-go function\) of post\-action configuration as

Gs∗​\(x\)=𝔼​\[V∗​\(Ξs​\(x,ξs\)\)\],x∈𝒳s,G^\{\*\}\_\{s\}\(x\)=\\mathbb\{E\}\\\!\\left\[V^\{\*\}\\bigl\(\\Xi\_\{s\}\(x,\\xi\_\{s\}\)\\bigr\)\\right\],\\ x\\in\\mathcal\{X\}\_\{s\},\(4\.1\)which denotes the expected optimal downstream reward when the system is placed at post\-decision configurationxxand the subsequent evolution is driven by the stochastic disturbanceξs\\xi\_\{s\}\. Then the optimalQQ\-function admits decompositionQ∗​\(s,a\)=ψs​\(a\)\+γ​Gs∗​\(ϕs​\(a\)\),a∈𝒜​\(s\)Q^\{\*\}\(s,a\)=\\psi\_\{s\}\(a\)\+\\gamma\\,G^\{\*\}\_\{s\}\\bigl\(\\phi\_\{s\}\(a\)\\bigr\),\\ a\\in\\mathcal\{A\}\(s\)\.

Since the optimal action at statessisargmaxa∈𝒜​\(s\)Q∗​\(s,a\)\\mathop\{\\mathrm\{argmax\}\}\_\{a\\in\\mathcal\{A\}\(s\)\}Q^\{\*\}\(s,a\), the difficulty to solve the MDP essentially lies in learning the continuation\-value functionGs∗​\(x\)G^\{\*\}\_\{s\}\(x\)\. To obtain a tractable surrogate forGs∗​\(ϕs​\(a\)\)G^\{\*\}\_\{s\}\\bigl\(\\phi\_\{s\}\(a\)\\bigr\), given a statess, we choose a reference actionaref​\(s\)∈𝒜​\(s\)a^\{\\text\{ref\}\}\(s\)\\in\\mathcal\{A\}\(s\)and assume the corresponding post\-action configuration isxref​\(s\)=ϕs​\(aref​\(s\)\)∈𝒳sx^\{\\text\{ref\}\}\(s\)=\\phi\_\{s\}\(a^\{\\text\{ref\}\}\(s\)\)\\in\\mathcal\{X\}\_\{s\}\. We then approximateGs∗​\(ϕs​\(a\)\)G^\{\*\}\_\{s\}\(\\phi\_\{s\}\(a\)\)locally via a first\-order Taylor expansion atxref​\(s\)x^\{\\text\{ref\}\}\(s\),

Gs∗​\(ϕs​\(a\)\)≈Gs∗​\(xref​\(s\)\)\+⟨∇Gs∗​\(xref​\(s\)\),ϕs​\(a\)−xref​\(s\)⟩\.\\displaystyle G^\{\*\}\_\{s\}\(\\phi\_\{s\}\(a\)\)\\approx G^\{\*\}\_\{s\}\(x^\{\\text\{ref\}\}\(s\)\)\+\\big\\langle\\nabla G^\{\*\}\_\{s\}\(x^\{\\text\{ref\}\}\(s\)\),\\phi\_\{s\}\(a\)\-x^\{\\text\{ref\}\}\(s\)\\big\\rangle\.\(4\.2\)Although the state, action, and post\-action configuration may be discrete, for analytical convenience, we still assume thatGs∗​\(x\)G^\{\*\}\_\{s\}\(x\)has a smooth extension aroundxref​\(s\)x^\{\\text\{ref\}\}\(s\)\. Thus, the gradient and Taylor expansion are well\-defined\. Substituting this expansion into the optimalQQ\-function gives

Q∗​\(s,a\)\\displaystyle Q^\{\*\}\(s,a\)≈γ​Gs∗​\(xref​\(s\)\)−γ​⟨∇Gs∗​\(xref​\(s\)\),xref​\(s\)⟩\+ψs​\(a\)\+⟨γ​∇Gs∗​\(xref​\(s\)\),ϕs​\(a\)⟩\\displaystyle\\approx\\gamma G^\{\*\}\_\{s\}\(x^\{\\text\{ref\}\}\(s\)\)\-\\gamma\\big\\langle\\nabla G^\{\*\}\_\{s\}\(x^\{\\text\{ref\}\}\(s\)\),x^\{\\text\{ref\}\}\(s\)\\big\\rangle\+\\psi\_\{s\}\(a\)\+\\big\\langle\\gamma\\nabla G^\{\*\}\_\{s\}\(x^\{\\text\{ref\}\}\(s\)\),\\phi\_\{s\}\(a\)\\big\\rangle:=α1​\(s\)\+ψs​\(a\)\+⟨z1∗​\(s\),ϕs​\(a\)⟩\.\\displaystyle:=\\alpha\_\{1\}\(s\)\+\\psi\_\{s\}\(a\)\+\\langle z\_\{1\}^\{\*\}\(s\),\\phi\_\{s\}\(a\)\\rangle\.Since the interceptα1​\(s\)\\alpha\_\{1\}\(s\)does not rely onaa, the optimal action of statesscan be approximated as

argmaxa∈𝒜​\(s\)Q∗​\(s,a\)≈argmaxa∈𝒜​\(s\)\{ψs​\(a\)\+⟨z1∗​\(s\),ϕs​\(a\)⟩\}\.\\displaystyle\\mathop\{\\mathrm\{argmax\}\}\_\{a\\in\\mathcal\{A\}\(s\)\}Q^\{\*\}\(s,a\)\\approx\\mathop\{\\mathrm\{argmax\}\}\_\{a\\in\\mathcal\{A\}\(s\)\}\\big\\\{\\psi\_\{s\}\(a\)\+\\langle z\_\{1\}^\{\*\}\(s\),\\phi\_\{s\}\(a\)\\rangle\\big\\\}\.\(4\.3\)
Equation \([4\.3](https://arxiv.org/html/2606.10979#S4.E3)\) suggests that the Bellman\-greedy action ranking can be approximated by a score\-based linear surrogate over the post\-action configuration\. This motivates us to restrict policy optimization in a class of decoder\-induced policies, where the policy outputs a latent marginal\-value score and the decoder maps this score to a feasible action by solving the surrogate maximization problem\. Specifically, given a latent scorez∈ℝdsz\\in\\mathbb\{R\}^\{d\_\{s\}\}, define the Bellman–Taylor action decoder

Γ​\(s,z\)∈argmaxa∈𝒜​\(s\)\{ψs​\(a\)\+⟨z,ϕs​\(a\)⟩\}∈𝒜​\(s\)\.\\displaystyle\\Gamma\(s,z\)\\in\\mathop\{\\mathrm\{argmax\}\}\_\{a\\in\\mathcal\{A\}\(s\)\}\\big\\\{\\psi\_\{s\}\(a\)\+\\langle z,\\phi\_\{s\}\(a\)\\rangle\\big\\\}\\in\\mathcal\{A\}\(s\)\.\(4\.4\)We aim to optimize a stationary policyπ~\\widetilde\{\\pi\}that mapssstozz, and then recovers the original natural actionaathrough \([4\.4](https://arxiv.org/html/2606.10979#S4.E4)\)\. In this sense, the original MDP’s action is transformed: the decision variable is no longer the natural actionaaitself, but a score vectorzzliving in an Euclidean spaceℝds\\mathbb\{R\}^\{d\_\{s\}\}without any restrictions\.

Through this construction, action feasibility is enforced exactly by the decoder, where the MDP policy does not need to handle the irregular geometry of the original action space directly\. As a result, policy optimization is moved to a fixed Euclidean score space\. A standard continuous\-action policy can generate latent scores, while the decoder converts each score into an implementable natural action before interacting with the original MDP\.

We emphasize that the reference post\-action configurationxref​\(s\)x^\{\\text\{ref\}\}\(s\)is used to motivate and analyze the Taylor expansion only\. It does not enter the construction of decoderΓ​\(s,z\)\\Gamma\(s,z\)\. Thus, our method can be implemented without specifying axref​\(s\)x^\{\\text\{ref\}\}\(s\)\. Also note that optimizing the latent score policy is generally not equivalent to optimizing over all stationary policies of the original MDP, because the decoder\-induced score policy space is restricted\. Nevertheless, as long as the remainder terms of Taylor expansion \([4\.2](https://arxiv.org/html/2606.10979#S4.E2)\) can be controlled, the induced action ranking error can be controlled, leading to a bounded optimality gap\. In Section[4\.4](https://arxiv.org/html/2606.10979#S4.SS4), we make this statement precise\.

Lastly, we discuss some costs of this method\. First, each action selection now requires solving the decoder optimization problem, which introduces additional computational overhead relative to directly parameterized policies\. Second, the latent score MDP is only approximately aligned with the original optimization, since it is induced by an first\-order Taylor approximation of the continuation\-value function\. The optimality gap depends on how accurately the induced surrogate preserves the Bellman action ranking\. To mitigate these difficulties, note that many OR problems admit well\-performed benchmark heuristics, which can be used to warm\-start the DRL training via various techniques like behavior cloning\. This may reduce the iteration numbers needed for training\. In addition, to handle continuation\-value functions with large curvature, we generalize above method via a higher\-order Taylor expansion in Section[4\.3](https://arxiv.org/html/2606.10979#S4.SS3), yielding a richer decoder\-induced score policy family that better aligns with the original problem\.

### 4\.2Solving Latent Score MDP via PPO

Let𝒵⊆ℝds\\mathcal\{Z\}\\subseteq\\mathbb\{R\}^\{d\_\{s\}\}be the latent score space\. Given the action decoderΓ​\(s,z\)\\Gamma\(s,z\)\([4\.4](https://arxiv.org/html/2606.10979#S4.E4)\), we define the decoder\-induced MDPℳ~=\(𝒮,𝒵,P~,r~,γ,ν0\)\\widetilde\{\\mathcal\{M\}\}=\(\\mathcal\{S\},\\mathcal\{Z\},\\widetilde\{P\},\\widetilde\{r\},\\gamma,\\nu\_\{0\}\), where the state space remains𝒮\\mathcal\{S\}and the action space is now𝒵\\mathcal\{Z\}\. For each state\-score pair\(s,z\)\(s,z\), the decoder first selects a feasible natural actiona=Γ​\(s,z\)∈𝒜​\(s\)a=\\Gamma\(s,z\)\\in\\mathcal\{A\}\(s\), and the induced reward and transition kernel arer~​\(s,z\)=r​\(s,Γ​\(s,z\)\)\\widetilde\{r\}\(s,z\)=r\(s,\\Gamma\(s,z\)\)andℙ~\(⋅∣s,z\)=ℙ\(⋅∣s,Γ\(s,z\)\)\\widetilde\{\\mathbb\{P\}\}\(\\cdot\\mid s,z\)=\\mathbb\{P\}\(\\cdot\\mid s,\\Gamma\(s,z\)\)respectively\. If the decoder has multiple maximizers, we choose an arbitrary tie\-breaking rule\. A stationary score policyπ~\\widetilde\{\\pi\}sampleszt∼π~\(⋅∣st\)z\_\{t\}\\sim\\widetilde\{\\pi\}\(\\cdot\\mid s\_\{t\}\)first and executes the decoded natural actionat=Γ​\(st,zt\)a\_\{t\}=\\Gamma\(s\_\{t\},z\_\{t\}\)\. Its expected accumulative discounted value isV~π~​\(s\)=𝔼π~​\[∑t=0∞γt​r~​\(st,zt\)\|s0=s\]\.\\widetilde\{V\}^\{\\widetilde\{\\pi\}\}\(s\)=\\mathbb\{E\}^\{\\widetilde\{\\pi\}\}\[\\sum\_\{t=0\}^\{\\infty\}\\gamma^\{t\}\\widetilde\{r\}\(s\_\{t\},z\_\{t\}\)\|s\_\{0\}=s\]\.

To optimize MDPℳ~\\widetilde\{\\mathcal\{M\}\}, we essentially need to solve a continuous\-action control problem, where PPO is a standard approach\. In PPO, the policy is parameterized asπθ​\(z\|s\)\\pi\_\{\\theta\}\(z\|s\), whereθ\\thetais optimized via gradient\-based methods\. Although the decoderΓ​\(s,z\)\\Gamma\(s,z\)is an optimization program which can be nonsmooth inzz, applying PPO to solveℳ~\\widetilde\{\\mathcal\{M\}\}does not require differentiating throughΓ​\(s,z\)\\Gamma\(s,z\)\. The policy gradient only involves∇θlog⁡πθ​\(z∣s\)\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(z\\mid s\)rather than derivative of action decoder\. This is an important distinction from differentiable optimization\-layer approaches\. We call this procedure to solve the original MDP Bellman\-Taylor Score Decoding PPO \(BTSD\-PPO\) algorithm\.

Beyond PPO, the decoder\-induced MDPℳ~\\widetilde\{\\mathcal\{M\}\}can be solved by other DRL algorithms as well\. Any method applicable for continuous\-action control can be applied directly\. Value\-based methods such as DQN may also apply with an appropriate discretization of the latent score space to generate a finite action set\. To be focused, the subsequent discussions are restricted to PPO\. In the numerical study, we also test other DRL algorithms to demonstrate that the empirical benefit is primarily due to the Bellman\-Taylor score decoding, rather than a particular optimizer\.

### 4\.3Generalization: Higher\-Order Action Decoder

The first\-order Bellman\-Taylor action decoder constructed above relies on a linear approximation of the continuation\-value function, which is effective when the continuation effect is locally close to linear around the reference post\-decision point\. However, if the continuation\-value function exhibits non\-negligible curvature, then a first\-order Taylor expansion may no longer capture the Bellman action ranking with sufficient accuracy\. In such cases, it is natural to retain higher\-order terms in the Taylor expansion, thereby enriching the features used by the action decoder\.

We next introduce some notations to ease presentation\. Fix an integerK≥1K\\geq 1, for a multi\-indexα=\(α1,…,αds\)∈ℕds\\alpha=\(\\alpha\_\{1\},\\ldots,\\alpha\_\{d\_\{s\}\}\)\\in\\mathbb\{N\}^\{d\_\{s\}\}, let\|α\|=∑i=1dsαi\|\\alpha\|=\\sum\_\{i=1\}^\{d\_\{s\}\}\\alpha\_\{i\}andα\!:=∏i=1dsαi\!\\alpha\!:=\\prod\_\{i=1\}^\{d\_\{s\}\}\\alpha\_\{i\}\!\. For anyx∈𝒳sx\\in\\mathcal\{X\}\_\{s\}, define

\(x−xref​\(s\)\)α=∏i=1ds\(\[x\]i−\[xref​\(s\)\]i\)αi=∏i=1ds\(\[x\]i−\[ϕs​\(aref​\(s\)\)\]i\)αi,\\bigl\(x\-x^\{\\mathrm\{ref\}\}\(s\)\\bigr\)^\{\\alpha\}=\\prod\_\{i=1\}^\{d\_\{s\}\}\\bigl\(\[x\]\_\{i\}\-\[x^\{\\mathrm\{ref\}\}\(s\)\]\_\{i\}\\bigr\)^\{\\alpha\_\{i\}\}=\\prod\_\{i=1\}^\{d\_\{s\}\}\\bigl\(\[x\]\_\{i\}\-\[\\phi\_\{s\}\{\(a^\{\\mathrm\{ref\}\}\(s\)\)\}\]\_\{i\}\\bigr\)^\{\\alpha\_\{i\}\},where\[a\]i\[a\]\_\{i\}denotes theii\-th coordinate of a vectoraa\. Under regularity conditions, the continuation\-value function admits theKK\-th order Taylor expansion

Gs∗​\(ϕs​\(a\)\)≈Gs∗​\(xref​\(s\)\)\+∑1≤\|α\|≤KDα​Gs∗​\(xref​\(s\)\)α\!​\(ϕs​\(a\)−xref​\(s\)\)α,G^\{\*\}\_\{s\}\\bigl\(\\phi\_\{s\}\(a\)\\bigr\)\\approx G^\{\*\}\_\{s\}\\bigl\(x^\{\\mathrm\{ref\}\}\(s\)\\bigr\)\+\\sum\_\{1\\leq\|\\alpha\|\\leq K\}\\frac\{D^\{\\alpha\}G^\{\*\}\_\{s\}\\bigl\(x^\{\\mathrm\{ref\}\}\(s\)\\bigr\)\}\{\\alpha\!\}\\bigl\(\\phi\_\{s\}\(a\)\-x^\{\\mathrm\{ref\}\}\(s\)\\bigr\)^\{\\alpha\},\(4\.5\)whereDα​Gs∗​\(⋅\)D^\{\\alpha\}G^\{\*\}\_\{s\}\(\\cdot\)denotes theα\\alpha\-th order partial derivative of functionGs∗​\(⋅\)G^\{\*\}\_\{s\}\(\\cdot\)\. If we define the lifted feature map

Fs,K​\(a\):=\(\(ϕs​\(a\)−xref​\(s\)\)αα\!\)1≤\|α\|≤K,F\_\{s,K\}\(a\):=\\left\(\\frac\{\\bigl\(\\phi\_\{s\}\(a\)\-x^\{\\mathrm\{ref\}\}\(s\)\\bigr\)^\{\\alpha\}\}\{\\alpha\!\}\\right\)\_\{1\\leq\|\\alpha\|\\leq K\},whose dimension isms,K=∑k=1K\(ds\+k−1k\)m\_\{s,K\}=\\sum\_\{k=1\}^\{K\}\\binom\{d\_\{s\}\+k\-1\}\{k\}, and substitute the expansion \([4\.5](https://arxiv.org/html/2606.10979#S4.E5)\) into the optimalQQ\-function, we obtain

Q∗​\(s,a\)≈αK​\(s\)\+ψs​\(a\)\+⟨zK∗​\(s\),Fs,K​\(a\)⟩,Q^\{\*\}\(s,a\)\\approx\\alpha\_\{K\}\(s\)\+\\psi\_\{s\}\(a\)\+\\langle z\_\{K\}^\{\*\}\(s\),F\_\{s,K\}\(a\)\\rangle,whereαK​\(s\)\\alpha\_\{K\}\(s\)is an intercept term independent ofaa, andzK∗​\(s\)=γ​\(Dα​Gs∗​\(xref​\(s\)\)\)1≤\|α\|≤K∈ℝms,K\.z\_\{K\}^\{\*\}\(s\)=\\gamma\(D^\{\\alpha\}G^\{\*\}\_\{s\}\(x^\{\\mathrm\{ref\}\}\(s\)\)\)\_\{1\\leq\|\\alpha\|\\leq K\}\\in\\mathbb\{R\}^\{m\_\{s,K\}\}\.Hence, the optimal action corresponding to statesscan be locally approximated by

argmaxa∈𝒜​\(s\)Q∗​\(s,a\)≈argmaxa∈𝒜​\(s\)\{ψs​\(a\)\+⟨zK∗​\(s\),Fs,K​\(a\)⟩\}\.\\mathop\{\\mathrm\{argmax\}\}\_\{a\\in\\mathcal\{A\}\(s\)\}Q^\{\*\}\(s,a\)\\approx\\mathop\{\\mathrm\{argmax\}\}\_\{a\\in\\mathcal\{A\}\(s\)\}\\left\\\{\\psi\_\{s\}\(a\)\+\\langle z\_\{K\}^\{\*\}\(s\),F\_\{s,K\}\(a\)\\rangle\\right\\\}\.
As a result, under theKK\-th order Taylor expansion, the Bellman\-greedy action ranking is determined by a higher\-order score vectorzK∗​\(s\)z\_\{K\}^\{\*\}\(s\)\. Instead of optimizing over the original natural action space, it suffices to learn the map fromsstozK∗​\(s\)z\_\{K\}^\{\*\}\(s\)\. For a latent score vectorz∈ℝms,Kz\\in\\mathbb\{R\}^\{m\_\{s,K\}\}, define the higher\-order Bellman\-Taylor decoder

ΓK​\(s,z\)=arg⁡maxa∈𝒜⁡\{ψs​\(a\)\+⟨z,Fs,K​\(a\)⟩\}∈𝒜​\(s\)\.\\Gamma\_\{K\}\(s,z\)=\\arg\\max\_\{a\\in\\mathcal\{A\}\}\\left\\\{\\psi\_\{s\}\(a\)\+\\langle z,F\_\{s,K\}\(a\)\\rangle\\right\\\}\\in\\mathcal\{A\}\(s\)\.\(4\.6\)Then we obtain an optimization on the Euclidean spaceℝms,K\\mathbb\{R\}^\{m\_\{s,K\}\}whereΓK​\(s,z\)\\Gamma\_\{K\}\(s,z\)maps latent score to the natural action\.

The induced reparameterized MDP is defined analogously by replacing the first\-order decoderΓ\\GammawithΓK\\Gamma\_\{K\}\. Specifically, for each states∈𝒮s\\in\\mathcal\{S\}and latent scorez∈ℝms,Kz\\in\\mathbb\{R\}^\{m\_\{s,K\}\}, definer~K​\(s,z\):=r​\(s,ΓK​\(s,z\)\)\\widetilde\{r\}\_\{K\}\(s,z\):=r\(s,\\Gamma\_\{K\}\(s,z\)\)andℙ~K\(⋅∣s,z\):=ℙ\(⋅∣s,ΓK\(s,z\)\)\\widetilde\{\\mathbb\{P\}\}\_\{K\}\(\\cdot\\mid s,z\):=\\mathbb\{P\}\(\\cdot\\mid s,\\Gamma\_\{K\}\(s,z\)\)\. A latent\-score policy on the corresponding Euclidean score space then induces a policy on the original natural\-action space exactly as in the first\-order case\. All subsequent constructions, including the value function, the policy\-gradient update, and the PPO implementation, extend verbatim after replacingϕs​\(a\)\\phi\_\{s\}\(a\)withFs,K​\(a\)F\_\{s,K\}\(a\)andΓ\\GammawithΓK\\Gamma\_\{K\}\.

In practice, we can reorganize the Taylor expansion \([4\.5](https://arxiv.org/html/2606.10979#S4.E5)\) into an uncentered polynomial basis such that the constructed action decoder does not rely on reference actionxref​\(s\)x^\{\\mathrm\{ref\}\}\(s\)explicitly\. By the multi\-index binomial identity,

\(ϕs​\(a\)−xref​\(s\)\)α=∑0≤β≤α\(αβ\)​ϕs​\(a\)β​\(−xref​\(s\)\)α−β,\\bigl\(\\phi\_\{s\}\(a\)\-x^\{\\mathrm\{ref\}\}\(s\)\\bigr\)^\{\\alpha\}=\\sum\_\{0\\leq\\beta\\leq\\alpha\}\\binom\{\\alpha\}\{\\beta\}\\phi\_\{s\}\(a\)^\{\\beta\}\\bigl\(\-x^\{\\mathrm\{ref\}\}\(s\)\\bigr\)^\{\\alpha\-\\beta\},there exist a state\-dependent interceptα~K​\(s\)\\widetilde\{\\alpha\}\_\{K\}\(s\)and a coefficient vectorz~K∗​\(s\)∈ℝms,K\\widetilde\{z\}\_\{K\}^\{\*\}\(s\)\\in\\mathbb\{R\}^\{m\_\{s,K\}\}such that

Q∗​\(s,a\)≈α~K​\(s\)\+ψs​\(a\)\+⟨z~K∗​\(s\),F~s,K​\(a\)⟩,Q^\{\*\}\(s,a\)\\approx\\widetilde\{\\alpha\}\_\{K\}\(s\)\+\\psi\_\{s\}\(a\)\+\\big\\langle\\widetilde\{z\}\_\{K\}^\{\*\}\(s\),\\widetilde\{F\}\_\{s,K\}\(a\)\\big\\rangle,where the uncentered lifted feature map is defined byF~s,K​\(a\)=\(ϕs​\(a\)β\)1≤\|β\|≤K\.\\widetilde\{F\}\_\{s,K\}\(a\)=\(\{\\phi\_\{s\}\(a\)^\{\\beta\}\}\)\_\{1\\leq\|\\beta\|\\leq K\}\.Accordingly, one may equivalently define the higher\-order Bellman\-Taylor decoder in the uncentered form

Γ~K​\(s,z\)=argmaxa∈𝒜​\(s\)\{ψs​\(a\)\+⟨z,F~s,K​\(a\)⟩\}\.\\widetilde\{\\Gamma\}\_\{K\}\(s,z\)=\\mathop\{\\mathrm\{argmax\}\}\_\{a\\in\\mathcal\{A\}\(s\)\}\\big\\\{\\psi\_\{s\}\(a\)\+\\langle z,\\widetilde\{F\}\_\{s,K\}\(a\)\\rangle\\big\\\}\.This representation absorbs the reference pointxref​\(s\)x^\{\\mathrm\{ref\}\}\(s\)into the state\-dependent coefficients and removes its explicit appearance from the decoder objective\. Thus, same as the first\-order method, we do not need to specify the choice of reference post\-action configurationxref​\(s\)x^\{\\text\{ref\}\}\(s\)in algorithm\.

The higher\-order action decoder therefore involves a clear tradeoff\. Retaining more terms in the local Taylor expansion yields a richer surrogate for the continuation\-value function, and can better preserve the Bellman action ranking when local curvature is large\. However, this gain comes at a computational price\. As the expansion orderKKincreases, the lifted feature dimensionms,Km\_\{s,K\}grows rapidly, so the action space of the score\-induced MDP is high\-dimensional, the policy optimization becomes more expensive, and the action decoder must be evaluated on a larger feature set\. In many OR problems of interest, however, the dominant structural effect is already captured by the first\-order marginal continuation value, especially when the state transition exhibits certain affine structure\. For this reason, the first\-order decoder often provides a favorable balance between approximation quality, computational tractability, and ease of implementation, and will be our main focus in the numerical study\.

### 4\.4Performance Analysis

In this section, we analyze the performance of Bellman\-Taylor score decoding\. To ease presentation, we focus on the first\-order decoder and the analysis for higher order case is similar\. Letz^1​\(s\)\\widehat\{z\}\_\{1\}\(s\)be the policy trained by PPO to solve the decoder\-induced MDPℳ~\\widetilde\{\\mathcal\{M\}\}\. When a stochastic policy is adopted,z^1​\(s\)\\widehat\{z\}\_\{1\}\(s\)becomes a random variable\.z^1​\(s\)\\widehat\{z\}\_\{1\}\(s\)induces a policyπ^1\\widehat\{\\pi\}\_\{1\}to the original MDPℳ\\mathcal\{M\}that maps statessto a natural actionaathrough the decoderπ^1​\(s\)=Γ​\(s,z^1​\(s\)\)\.\\widehat\{\\pi\}\_\{1\}\(s\)=\\Gamma\(s,\\widehat\{z\}\_\{1\}\(s\)\)\.LetΠΓ\\Pi\_\{\\Gamma\}be the decoder\-induced policy class to MDPℳ\\mathcal\{M\}\. Also define the best achievable value within the decoder\-induced class byJΓ∗​\(ν0\)=supπ∈ΠΓJ​\(π;ν0\)\.J\_\{\\Gamma\}^\{\*\}\(\\nu\_\{0\}\)=\\sup\_\{\\pi\\in\\Pi\_\{\\Gamma\}\}J\(\\pi;\\nu\_\{0\}\)\.If the supremum is attained, we denote the maximizer byπΓ∗\\pi\_\{\\Gamma\}^\{\*\}, so thatJΓ∗​\(ν0\)=J​\(πΓ∗;ν0\)J\_\{\\Gamma\}^\{\*\}\(\\nu\_\{0\}\)=J\(\\pi\_\{\\Gamma\}^\{\*\};\\nu\_\{0\}\)\. Our objective is to boundJ​\(π∗;ν0\)−Jν0​\(π^1;ν0\)J\(\\pi^\{\*\};\\nu\_\{0\}\)\-J\_\{\\nu\_\{0\}\}\(\\widehat\{\\pi\}\_\{1\};\\nu\_\{0\}\), whereπ∗\\pi^\{\*\}is the optimal policy to the original MDPℳ\\mathcal\{M\}\. Throughout this section, we assume that all value functions are finite and that the Bellman maximizer and decoder maximizer admit measurable selections under the fixed tie\-breaking rule used by the decoder\. We fist impose an assumption to quantify the performance of PPO algorithm\.

###### Assumption 4\.1\.

There exists some constantℰlearn≥0\\mathcal\{E\}\_\{\\rm learn\}\\geq 0such thatJΓ∗​\(ν0\)−J​\(π^1;ν0\)≤ℰlearn\.J\_\{\\Gamma\}^\{\*\}\(\\nu\_\{0\}\)\-J\(\\widehat\{\\pi\}\_\{1\};\\nu\_\{0\}\)\\leq\\mathcal\{E\}\_\{\\rm learn\}\.

Assumption[4\.1](https://arxiv.org/html/2606.10979#S4.Thmtheorem1)measures how well PPO solves the latent\-score MDP relative to the best policy representable by the decoder\-induced class\. In practice,ℰlearn\\mathcal\{E\}\_\{\\rm learn\}includes finite\-sample error, function\-approximation error, and optimization error, etc\. The convergence guarantee of PPO algorithm has been established inLiuet al\.\([2019](https://arxiv.org/html/2606.10979#bib.bib3)\)and the references therein, which supports the rationale of this assumption\.

For any measurable deterministic score mapz​\(⋅\):𝒮→Zz\(\\cdot\):\\mathcal\{S\}\\to Z, letπz\\pi^\{z\}denote the deterministic decoded policyπz​\(s\)=Γ​\(s,z​\(s\)\)\.\\pi^\{z\}\(s\)=\\Gamma\(s,z\(s\)\)\.For each statessand actionaa, define the Bellman residual after subtracting the linear scorezzbyRsz​\(a\)=γ​Gs∗​\(ϕs​\(a\)\)−⟨z,ϕs​\(a\)⟩R\_\{s\}^\{z\}\(a\)=\\gamma G\_\{s\}^\{\*\}\(\\phi\_\{s\}\(a\)\)\-\\langle z,\\phi\_\{s\}\(a\)\\rangleand define its oscillation over the feasible action set𝒜​\(s\)\\mathcal\{A\}\(s\)by

Ωs​\(z\)=osca∈𝒜​\(s\)⁡\[Rsz​\(a\)\]=supa∈𝒜​\(s\)Rsz​\(a\)−infa∈𝒜​\(s\)Rsz​\(a\)\.\\Omega\_\{s\}\(z\)=\\operatorname\{osc\}\_\{a\\in\\mathcal\{A\}\(s\)\}\\big\[R\_\{s\}^\{z\}\(a\)\\big\]=\\sup\_\{a\\in\\mathcal\{A\}\(s\)\}R\_\{s\}^\{z\}\(a\)\-\\inf\_\{a\\in\\mathcal\{A\}\(s\)\}R\_\{s\}^\{z\}\(a\)\.It easy to check thatΩs​\(z\)=2​infc∈ℝsupa∈𝒜​\(s\)\|γ​Gs∗​\(ϕs​\(a\)\)−c−⟨z,ϕs​\(a\)⟩\|\\Omega\_\{s\}\(z\)=2\\inf\_\{c\\in\\mathbb\{R\}\}\\sup\_\{a\\in\\mathcal\{A\}\(s\)\}\\left\|\\gamma G\_\{s\}^\{\*\}\(\\phi\_\{s\}\(a\)\)\-c\-\\langle z,\\phi\_\{s\}\(a\)\\rangle\\right\|\. Thus,Ωs​\(z\)\\Omega\_\{s\}\(z\)measures the maximal remaining variation in the Bellman continuation value atssthat cannot be explained by the linear scorezz\. We have the following theorem\.

###### Theorem 4\.2\.

Suppose Assumption[4\.1](https://arxiv.org/html/2606.10979#S4.Thmtheorem1)holds\. Then

J​\(π∗;ν0\)−J​\(π^1;ν0\)≤infz​\(⋅\):𝒮→Z\{11−γ​𝔼s∼dν0,γπz​\[Ωs​\(z​\(s\)\)\]\}\+ℰlearn\.J\(\\pi^\{\*\};\\nu\_\{0\}\)\-J\(\\widehat\{\\pi\}\_\{1\};\\nu\_\{0\}\)\\leq\\inf\_\{z\(\\cdot\):\\mathcal\{S\}\\to Z\}\\left\\\{\\frac\{1\}\{1\-\\gamma\}\\mathbb\{E\}\_\{s\\sim d\_\{\\nu\_\{0\},\\gamma\}^\{\\pi^\{z\}\}\}\\left\[\\Omega\_\{s\}\(z\(s\)\)\\right\]\\right\\\}\+\\mathcal\{E\}\_\{\\mathrm\{learn\}\}\.

Theorem[4\.2](https://arxiv.org/html/2606.10979#S4.Thmtheorem2)decomposes the optimality gap of policyπ^1\\widehat\{\\pi\}\_\{1\}into two terms\. The first term is a structural approximation error: it measures how much Bellman continuation\-value variation remains after the best state\-dependent linear score is removed\. The second term is the algorithmic learning error incurred when PPO approximately solves the decoder\-induced MDP\.

The oscillation bound also gives a direct condition for exact Bellman action agreement\. For each statess, define the Bellman action gap

Δs=V∗​\(s\)−supa∈𝒜​\(s\)\{Q∗​\(s,a\):Q∗​\(s,a\)<V∗​\(s\)\}\.\\Delta\_\{s\}=V^\{\*\}\(s\)\-\\sup\_\{a\\in\\mathcal\{A\}\(s\)\}\\big\\\{Q^\{\*\}\(s,a\):\\ Q^\{\*\}\(s,a\)<V^\{\*\}\(s\)\\big\\\}\.Fix a statessand a scorez∈Zz\\in Z\. IfΩs​\(z\)<Δs\\Omega\_\{s\}\(z\)<\\Delta\_\{s\}, then every action selected by the decoder is Bellman optimal, i\.e\.,Γ​\(s,z\)∈arg⁡maxa∈𝒜​\(s\)⁡Q∗​\(s,a\)\.\\Gamma\(s,z\)\\in\\arg\\max\_\{a\\in\\mathcal\{A\}\(s\)\}Q^\{\*\}\(s,a\)\.This shows that an exact affine representation ofGs∗​\(⋅\)G\_\{s\}^\{\*\}\(\\cdot\)is not necessary for correct action selection; it is enough for the residual oscillation to be smaller than the Bellman action gap\. Thus, the residual oscillation term in Theorem[4\.2](https://arxiv.org/html/2606.10979#S4.Thmtheorem2)is a worst\-case certificate of statewise Bellman\-ranking error; it may be nonzero without affecting action selection whenever it remains below the local Bellman action gap\.

Theorem[4\.2](https://arxiv.org/html/2606.10979#S4.Thmtheorem2)does not require a smooth and continuous extension ofGs∗​\(⋅\)G^\{\*\}\_\{s\}\(\\cdot\)and therefore applies to finite, discrete, or continuous feasible action sets\. However, when additional structure is available, the residual oscillation term can be characterized more explicitly as two examples below\.

Smooth Interpolant and Curvature\-Radius Characterization\.For each statess, letΦs=\{ϕs​\(a\):a∈𝒜​\(s\)\}\\Phi\_\{s\}=\\\{\\phi\_\{s\}\(a\):a\\in\\mathcal\{A\}\(s\)\\\}be the attainable post\-action set\. In a discrete MDP, the original model only defines the continuation value onΦs\\Phi\_\{s\}\. Assume the continuation value function admits a low\-curvature smooth interpolant\.

###### Assumption 4\.3\.

For each statess, there exist an open convex setUs⊆ℝdsU\_\{s\}\\subseteq\\mathbb\{R\}^\{d\_\{s\}\}withconv⁡\(Φs\)⊆Us,\\operatorname\{conv\}\(\\Phi\_\{s\}\)\\subseteq U\_\{s\},and a twice continuously differentiable functionG¯s∗:Us→ℝ\\bar\{G\}\_\{s\}^\{\*\}:U\_\{s\}\\to\\mathbb\{R\}such thatG¯s∗​\(x\)=Gs∗​\(x\),∀x∈Φs\\bar\{G\}\_\{s\}^\{\*\}\(x\)=G\_\{s\}^\{\*\}\(x\),\\ \\forall x\\in\\Phi\_\{s\}andsupx∈conv⁡\(Φs\)‖∇2G¯s∗​\(x\)‖op≤L2​\(s\)\.\.\\sup\_\{x\\in\\operatorname\{conv\}\(\\Phi\_\{s\}\)\}\\left\\\|\\nabla^\{2\}\\bar\{G\}\_\{s\}^\{\*\}\(x\)\\right\\\|\_\{\\mathrm\{op\}\}\\leq L\_\{2\}\(s\)\.\.

The smooth interpolant is an analytical device used to certify that Bellman continuation value function has bounded curvature on the relevant post\-action geometry\. For any reference mapx​\(⋅\)x\(\\cdot\)satisfyingx​\(s\)∈conv⁡\(Φs\)x\(s\)\\in\\operatorname\{conv\}\(\\Phi\_\{s\}\), definezx​\(s\)=γ​∇G¯s∗​\(x​\(s\)\)\.z\_\{x\}\(s\)=\\gamma\\nabla\\bar\{G\}\_\{s\}^\{\*\}\(x\(s\)\)\.We have the following proposition\.

###### Proposition 4\.4\.

Suppose Assumption[4\.3](https://arxiv.org/html/2606.10979#S4.Thmtheorem3)holds\. Then

Ωs​\(zx​\(s\)\)≤γ​L2​\(s\)​supa∈𝒜​\(s\)‖ϕs​\(a\)−x​\(s\)‖22\.\\Omega\_\{s\}\(z\_\{x\}\(s\)\)\\leq\\gamma L\_\{2\}\(s\)\\sup\_\{a\\in\\mathcal\{A\}\(s\)\}\\\|\\phi\_\{s\}\(a\)\-x\(s\)\\\|\_\{2\}^\{2\}\.

Proposition[4\.4](https://arxiv.org/html/2606.10979#S4.Thmtheorem4)shows that the residual oscillation is a second\-order local quantity\. It is bounded by the local curvature of the post\-action continuation interpolant multiplied by the squared worst\-case deviation of feasible post\-action configurations from the reference point\.

Exact Finite\-Action Characterization\.Theorem[4\.2](https://arxiv.org/html/2606.10979#S4.Thmtheorem2)admits further simplification when feasible actions are finite for each state\. Specifically, fix a statessand suppose𝒜​\(s\)=\{a1,…,an\}\.\\mathcal\{A\}\(s\)=\\\{a\_\{1\},\\ldots,a\_\{n\}\\\}\.We denote byxi=ϕs​\(ai\)∈ℝdsx\_\{i\}=\\phi\_\{s\}\(a\_\{i\}\)\\in\\mathbb\{R\}^\{d\_\{s\}\}andyi=γ​Gs∗​\(xi\)y\_\{i\}=\\gamma G\_\{s\}^\{\*\}\(x\_\{i\}\)\. Lety=\(y1,…,yn\)⊤y=\(y\_\{1\},\\ldots,y\_\{n\}\)^\{\\top\}and the augmented feature matrixBs=\(\(1,x1⊤\),⋯,\(1,xn⊤\)\)⊤∈ℝn×\(ds\+1\)\.B\_\{s\}=\(\(1,x^\{\\top\}\_\{1\}\),\\cdots,\(1,x^\{\\top\}\_\{n\}\)\)^\{\\top\}\\in\\mathbb\{R\}^\{n\\times\(d\_\{s\}\+1\)\}\.We also define the optimal statewise oscillationΩ1∗​\(s\)=infz∈ℝdsosci=1,…,n⁡\[yi−z⊤​xi\]\.\\Omega\_\{1\}^\{\*\}\(s\)=\\inf\_\{z\\in\\mathbb\{R\}^\{d\_\{s\}\}\}\\operatorname\{osc\}\_\{i=1,\\ldots,n\}\[y\_\{i\}\-z^\{\\top\}x\_\{i\}\]\.

###### Proposition 4\.5\.

LetΔn\\Delta\_\{n\}be the probability simplex inℝn\\mathbb\{R\}^\{n\}\. Then

Ω1∗​\(s\)\\displaystyle\\Omega\_\{1\}^\{\*\}\(s\)=2minθ∈ℝds\+1∥y−Bsθ∥∞=sup\{\|p⊤y−q⊤y\|:p,q∈Δn,∑i=1npixi=∑i=1nqixi\}\.\\displaystyle=2\\min\_\{\\theta\\in\\mathbb\{R\}^\{d\_\{s\}\+1\}\}\\\|y\-B\_\{s\}\\theta\\\|\_\{\\infty\}=\\sup\\Big\\\{\|p^\{\\top\}y\-q^\{\\top\}y\|:p,q\\in\\Delta\_\{n\},\\ \\sum\_\{i=1\}^\{n\}p\_\{i\}x\_\{i\}=\\sum\_\{i=1\}^\{n\}q\_\{i\}x\_\{i\}\\Big\\\}\.

Proposition[4\.5](https://arxiv.org/html/2606.10979#S4.Thmtheorem5)shows that, in the finite\-action case, the statewise oscillation is the Chebyshev affine approximation error of the Bellman continuation trace on the attainable post\-action points\. Equivalently, if two randomized mixtures of feasible actions have the same average post\-action configuration, then any first\-order score assigns them the same average score; any difference in their average Bellman continuation values is invisible to the first\-order decoder\. To connect it back to the performance bound, suppose there exists a measurable selectorzloc​\(s\)∈argminz∈ℝdsΩs​\(z\)z^\{\\mathrm\{loc\}\}\(s\)\\in\\mathop\{\\mathrm\{argmin\}\}\_\{z\\in\\mathbb\{R\}^\{d\_\{s\}\}\}\\Omega\_\{s\}\(z\), and letπ∗\\pi^\{\*\}be the corresponding decoded policyπ∗​\(s\)=Γ​\(s,zloc​\(s\)\)\.\\pi^\{\*\}\(s\)=\\Gamma\(s,z^\{\\mathrm\{loc\}\}\(s\)\)\.Then Theorem[4\.2](https://arxiv.org/html/2606.10979#S4.Thmtheorem2)implies

J​\(π∗;ν0\)−J​\(π^1;ν0\)≤11−γ​𝔼s∼dν0,γπloc​\[Ω1∗​\(s\)\]\+ℰlearn\.J\(\\pi^\{\*\};\\nu\_\{0\}\)\-J\(\\widehat\{\\pi\}\_\{1\};\\nu\_\{0\}\)\\leq\\frac\{1\}\{1\-\\gamma\}\\mathbb\{E\}\_\{s\\sim d\_\{\\nu\_\{0\},\\gamma\}^\{\\pi^\{\\mathrm\{loc\}\}\}\}\\big\[\\Omega\_\{1\}^\{\*\}\(s\)\\big\]\+\\mathcal\{E\}\_\{\\mathrm\{learn\}\}\.
Overall, the preceding results separate the optimality gap into a structural approximation term and an algorithmic learning term, and provide analytical characterizations of the structural term through residual oscillation\. However, these bounds involves quantities that are defined through the optimal continuation value, which is itself the solution of the Bellman fixed\-point equation\. For a general MDP, converting these quantities into explicit bounds involving only primitive model parameters is typically difficult: it requires controlling the geometry, curvature, or finite\-difference structure of the optimal value function over the relevant post\-action configurations\.

## 5Sanity Check: Application to Inventory Control

In this section, we consider a multi\-location inventory control problem with cross\-transshipment as a controlled diagnostic application of our method\. The model is intentionally small and simplified, rather than a practical inventory operations\. Its purpose is to demonstrate that the proposed action reparameterization framework can handle OR problems with state\-dependent feasible action sets\.

Specifically, letℐ=\{1,…,n\}\\mathcal\{I\}=\\\{1,\\dots,n\\\}index inventory locations\. At the beginning of each period, the on\-hand inventory level of locationiiissi∈\{1,2,⋯,Li\}s\_\{i\}\\in\\\{1,2,\\cdots,L\_\{i\}\\\}whereLiL\_\{i\}denotes an inventory upper bound imposed for computational purposes so that the resulting MDP has a finite state space\. Thus,s=\(s1,…,sn\)s=\(s\_\{1\},\\dots,s\_\{n\}\)denotes the system’s state\. Each location independently faces stochastic demanddid\_\{i\}whose distribution is𝒟i\\mathcal\{D\}\_\{i\}\. To fill the demand, locationiiorders a quantity ofui∈\{0,1,⋯,Ui\}u\_\{i\}\\in\\\{0,1,\\cdots,U\_\{i\}\\\}items to replenish inventory at the end of period\. In addition to self\-replenishment, transshipment across different locations is allowed\. Letyi​j∈ℤ\+y\_\{ij\}\\in\\mathbb\{Z\}\_\{\+\}denote the quantity transshipped from locationiitojj\. Thus, the action of this problem isa=\(y,u\)a=\(y,u\)\. Since the total transshipment from locationiicannot exceed its current inventory capacitysis\_\{i\}, the feasible action space of statessis

𝒜​\(s\)=\{\(y,u\):yi​j∈ℤ\+,ui∈\{0,…,Ui\},∑j≠iyi​j≤si,∀i∈ℐ\}\.\\mathcal\{A\}\(s\)=\\Big\\\{\(y,u\):y\_\{ij\}\\in\\mathbb\{Z\}\_\{\+\},\\ u\_\{i\}\\in\\\{0,\\dots,U\_\{i\}\\\},\\ \\sum\_\{j\\neq i\}y\_\{ij\}\\leq s\_\{i\},\\ \\forall i\\in\\mathcal\{I\}\\Big\\\}\.Accordingly, we define outbound and inbound transshipment of locationiiasoi​\(y\)=∑j≠iyi​jo\_\{i\}\(y\)=\\sum\_\{j\\neq i\}y\_\{ij\}andmi​\(y\)=∑j≠iyj​im\_\{i\}\(y\)=\\sum\_\{j\\neq i\}y\_\{ji\}\.

To increase the nonlinearity of system dynamics, we assume that cross\-transshipment may incur losses, so not all inbound shipments are successfully processed\. This captures the effect of receiving congestion and diminishing marginal contribution of large inbound volumes\. Specifically, given an inbound volumemi​\(y\)m\_\{i\}\(y\), the successfully received inbound inventory of locationiiis a random variableei∼Binomial\(mi\(y\),pi,ρ\(mi\(y\)\)e\_\{i\}\\sim\\mathrm\{Binomial\}\(m\_\{i\}\(y\),p\_\{i,\\rho\}\(m\_\{i\}\(y\)\), where the success probability satisfiespi,ρ​\(m\)=1−ρ​\(m−1\)\+/κip\_\{i,\\rho\}\(m\)=1\-\\rho\{\(m\-1\)^\{\+\}\}/\{\\kappa\_\{i\}\}whereκi=max⁡\{1,∑j≠iLj−1\}\\kappa\_\{i\}=\\max\\\{1,\\sum\_\{j\\neq i\}L\_\{j\}\-1\\\}\. Note that0≤ρ<10\\leq\\rho<1is a parameter that controls the strength of receiving cost in cross\-transshipment, with largerρ\\rhoreducing the probability that inbound inventory is successfully processed and incorporated into the available stock\. Whenρ=0\\rho=0, all inbound transshipment is received\. After receiving self\-replenishment and cross\-transshipment, demandd=\(d1,…,dn\)d=\(d\_\{1\},\\dots,d\_\{n\}\)is realized and the next\-period inventory level of locationiiis updated assi′=min⁡\{Li,\(si−oi​\(y\)\+ei\+ui−di\)\+\}\.s\_\{i\}^\{\\prime\}=\\min\\\{L\_\{i\},\(s\_\{i\}\-o\_\{i\}\(y\)\+e\_\{i\}\+u\_\{i\}\-d\_\{i\}\)^\{\+\}\\\}\.Letri=si−oi​\(y\)\+ei\+uir\_\{i\}=s\_\{i\}\-o\_\{i\}\(y\)\+e\_\{i\}\+u\_\{i\}be the total inventory level of locationiiimmediately beforedid\_\{i\}is realized\. Then the one\-period cost is

c​\(s,a\)=∑i≠jτi​j​yi​j\+∑iχi​ui\+∑ihi​\(ri−di\)\+\+∑iℓi​\(di−ri\)\+,c\(s,a\)=\\sum\_\{i\\neq j\}\\tau\_\{ij\}y\_\{ij\}\+\\sum\_\{i\}\\chi\_\{i\}u\_\{i\}\+\\sum\_\{i\}h\_\{i\}\(r\_\{i\}\-d\_\{i\}\)^\{\+\}\+\\sum\_\{i\}\\ell\_\{i\}\(d\_\{i\}\-r\_\{i\}\)^\{\+\},whereτi​j,χi,hi,ℓi\\tau\_\{ij\},\\chi\_\{i\},h\_\{i\},\\ell\_\{i\}are the unit transshipment, replenishment, holding, and lost\-sale costs, respectively\. The objective is to minimizeJ​\(π;ν0\)=𝔼s0∼ν0π​\[∑t=0∞γt​c​\(st,at\)\]J\(\\pi;\\nu\_\{0\}\)=\\mathbb\{E\}\_\{s\_\{0\}\\sim\\nu\_\{0\}\}^\{\\pi\}\[\\sum\_\{t=0\}^\{\\infty\}\\gamma^\{t\}c\(s\_\{t\},a\_\{t\}\)\]whereν0\\nu\_\{0\}is the initial distribution\.

In this formulation, the post\-action configuration isϕs​\(a\)=\(b,m\)∈ℝ2​n\\phi\_\{s\}\(a\)=\(b,m\)\\in\\mathbb\{R\}^\{2n\}, wherebi=si−oi​\(y\)\+uib\_\{i\}=s\_\{i\}\-o\_\{i\}\(y\)\+u\_\{i\}is the local inventory base after outbound transshipment and replenishment, andmi=mi​\(y\)m\_\{i\}=m\_\{i\}\(y\)is the nominal inbound transshipment volume\. The randomness of system transition are incorporated by the received inbound transshipment and demand, which can be represented asΞs​\(ϕs​\(a\),ξs\)\\Xi\_\{s\}\(\\phi\_\{s\}\(a\),\\xi\_\{s\}\)\. Thus, this problem can be solved by our proposed method\. We test the first\-order and second\-order action decoder, i\.e\.,ΓK​\(s,z\)∈argmina∈𝒜​\(s\)\{c​\(s,a\)\+⟨z,FK​\(ϕs​\(a\)\)⟩\}\\Gamma\_\{K\}\(s,z\)\\in\\mathop\{\\mathrm\{argmin\}\}\_\{a\\in\\mathcal\{A\}\(s\)\}\\\{c\(s,a\)\+\\langle z,F\_\{K\}\(\\phi\_\{s\}\(a\)\)\\rangle\\\}, whereF1​\(w\)=wF\_\{1\}\(w\)=wis an identity map andF2​\(w\)=\(w,w⊗w\)F\_\{2\}\(w\)=\(w,w\\otimes w\)incorporates all second\-order terms\.

Letπ^K\\widehat\{\\pi\}\_\{K\}be the optimal policy trained by PPO withKK\-th order action decoder\. For each problem instance, we also exactly calculate the optimal policyπ∗\\pi^\{\*\}value as benchmark, which is achievable for small\-scale problems\. The corresponding value function andQQ\-function areV∗​\(s\)V^\{\*\}\(s\)andQ∗​\(s,a\)Q^\{\*\}\(s,a\)\. We compareπ^K\\widehat\{\\pi\}\_\{K\}with theπ∗\\pi^\{\*\}via three metrics: \(1\) optimality gapJ​\(π^K;ν0\)/J​\(π∗;ν0\)−1J\(\\widehat\{\\pi\}\_\{K\};\\nu\_\{0\}\)/J\(\\pi^\{\*\};\\nu\_\{0\}\)\-1; \(2\) optimal action agreement rate; and \(3\) Bellman regret𝔼x∼μρ∗​\[Qρ∗​\(x,aρ,Kbr​\(x\)\)−Vρ∗​\(x\)\],\\mathbb\{E\}\_\{x\\sim\\mu\_\{\\rho\}^\{\*\}\}\[Q\_\{\\rho\}^\{\*\}\(x,a\_\{\\rho,K\}^\{\\rm br\}\(x\)\)\-V\_\{\\rho\}^\{\*\}\(x\)\],whereμρ∗\\mu\_\{\\rho\}^\{\*\}is the occupancy measure induced byπ∗\\pi^\{\*\}\. We test 4 instances where parameterρ\\rhovaries in\{0,0\.25,0\.5,0\.75\}\\\{0,0\.25,0\.5,0\.75\\\}\. Asρ\\rhoincreases, the post\-action transition becomes more nonlinear, in the sense that the successfully received inbound inventory exhibits stronger diminishing marginal returns with respect to the nominal inbound transshipment volume\. Other model parameters, as well as PPO training details, are provided in Appendix\. The experiment results are summarized in Table[1](https://arxiv.org/html/2606.10979#S5.T1)\.

Table 1:Performance for first\-order and second\-order decoded policies in the inventory model\.First\-order decoded policySecond\-order decoded policyρ\\rhooptimality gapagreementregretoptimality gapagreementregret01\.0%95\.7%0\.380\.2%98\.8%0\.030\.251\.5%93\.5%0\.600\.3%97\.6%0\.080\.5010\.5%51\.1%4\.230\.4%96\.1%0\.120\.7511\.4%49\.6%5\.080\.6%91\.7%0\.27The results in Table[1](https://arxiv.org/html/2606.10979#S5.T1)demonstrates the role of action decoder\. The first\-order decoder already yields effective decoded policies under the state\-dependent transshipment constraints, particularly when the receiving process is close to linear\. As the congestion parameterρ\\rhoincreases, the expected received quantity becomes a more curved function of the nominal inbound volume, and the performance of the first\-order decoder gradually deteriorates\. This is reflected in the increase of its optimality gap and Bellman regret, as well as the decline in the Bellman action agreement\.

The second\-order decoder mitigates this deterioration by allowing the score function to depend on quadratic features of the post\-action configuration\. Its optimality gap remains below0\.59%0\.59\\%and its Bellman regret remains small across all values ofρ\\rho, suggesting that the additional second\-order terms help preserve the Bellman action ranking when nonlinear receiving effects are present\. These findings highlight the intended tradeoff of the proposed framework: a first\-order decoder offers a simple and computationally attractive reparameterization when the dynamics are nearly linear, whereas higher\-order decoders provide additional flexibility for systems with complex transition mechanisms such as congestion, loss, saturation, or diminishing marginal processing efficiency\.

## 6Case Study: Application to Queueing Network Control

Dynamic routing in multi\-class and multi\-pool service systems is a canonical problem of queueing network control, where heterogeneous customer classes need to be dynamically matched to server pools with different skills or service efficiencies\(Chenet al\.,[2020](https://arxiv.org/html/2606.10979#bib.bib8)\)\. The routing decisions must balance several competing effects: assigning a customer to a flexible or non\-primary pool may reduce current congestion, but it can also consume scarce capacity, lower service efficiency, or incur overflow costs\. This model finds many applications in service systems such as inpatient overflow management\(Dai and Shi,[2019](https://arxiv.org/html/2606.10979#bib.bib23); Sunet al\.,[2024](https://arxiv.org/html/2606.10979#bib.bib12)\)\. This problem can be naturally formulated as an MDP and the action space is state dependent: each action is an integer dispatching matrix whose feasibility is determined by the current queues and server capacities\. As a result, this problems becomes a suitable testbed for our algorithm\.

### 6\.1Model Description and MDP Formulation

We consider a discrete\-time, non\-preemptive, parallel\-service system with multiple customer classes and multiple server pools following the setup inChenet al\.\([2020](https://arxiv.org/html/2606.10979#bib.bib8)\)\. Letℐ=\{1,…,I\}\\mathcal\{I\}=\\\{1,\\ldots,I\\\}and𝒥=\{1,…,J\}\\mathcal\{J\}=\\\{1,\\ldots,J\\\}denote the sets of customer classes and server pools, respectively\. Customers of classi∈ℐi\\in\\mathcal\{I\}arrive according to a homogeneous Poisson process with rateλi\\lambda\_\{i\}\. Server poolj∈𝒥j\\in\\mathcal\{J\}has capacityUjU\_\{j\}, meaning that at mostUjU\_\{j\}customers can be simultaneously served in pooljj\. The service time of a class\-iicustomer served by a type\-jjserver is an exponentialμi​j\\mu\_\{ij\}distribution\. Equivalently, in the discrete\-time system, the probability of service completion in one time period is1−e−μi​j∈\(0,1\)1\-e^\{\-\\mu\_\{ij\}\}\\in\(0,1\)\. Letℰ=\{\(i,j\)∈ℐ×𝒥:μi​j\>0\}\\mathcal\{E\}=\\\{\(i,j\)\\in\\mathcal\{I\}\\times\\mathcal\{J\}:\\mu\_\{ij\}\>0\\\}be the set of all compatible class\-pool pairs\. This setup generalizes the pool\-dependent service rate model studied in\(Dai and Shi,[2019](https://arxiv.org/html/2606.10979#bib.bib23); Sunet al\.,[2024](https://arxiv.org/html/2606.10979#bib.bib12)\), but also increases the state’s dimension\.

At the beginning of each periodtt, the system state isst=\(qt,ht\)s\_\{t\}=\(q\_\{t\},h\_\{t\}\), whereqt=\(q1,t,…,qI,t\)∈ℤ\+Iq\_\{t\}=\(q\_\{1,t\},\\ldots,q\_\{I,t\}\)\\in\\mathbb\{Z\}\_\{\+\}^\{I\}records the queue lengths, andht=\(hi​j,t\)\(i,j\)∈ℰ∈ℤ\+\|ℰ\|h\_\{t\}=\(h\_\{ij,t\}\)\_\{\(i,j\)\\in\\mathcal\{E\}\}\\in\\mathbb\{Z\}\_\{\+\}^\{\|\\mathcal\{E\}\|\}records the in\-service occupancy by origin, withhi​j,th\_\{ij,t\}denoting the number of class\-iicustomers currently being served in pooljj\. This finer state description is necessary because, once the service rate depends on\(i,j\)\(i,j\), the total occupancy of pooljjalone is no longer sufficient to characterize the departure process\. The action of periodttis a matrixat=\(ai​j,t\)\(i,j\)∈ℰa\_\{t\}=\(a\_\{ij,t\}\)\_\{\(i,j\)\\in\\mathcal\{E\}\}, whereai​j,t∈ℤ\+a\_\{ij,t\}\\in\\mathbb\{Z\}\_\{\+\}denotes the number of classiicustomers dispatched from the queue to pooljjat timett\. A feasible action must satisfy constraints∑j:\(i,j\)∈ℰai​j,t≤qi,t,∀i∈ℐ\\sum\_\{j:\(i,j\)\\in\\mathcal\{E\}\}a\_\{ij,t\}\\leq q\_\{i,t\},\\forall i\\in\\mathcal\{I\}, which means that one cannot dispatch more classiicustomers than are currently waiting in queueii, and∑i:\(i,j\)∈ℰ\(hi​j,t\+ai​j,t\)≤Uj,∀j∈𝒥\\sum\_\{i:\(i,j\)\\in\\mathcal\{E\}\}\\bigl\(h\_\{ij,t\}\+a\_\{ij,t\}\\bigr\)\\leq U\_\{j\},\\forall j\\in\\mathcal\{J\}, which means that the total occupancy of pooljjafter dispatching cannot exceed its service capacity\. As a result, the feasible action set at states=\(q,h\)s=\(q,h\)is characterized by

𝒜​\(s\)=\{a=\(ai​j\)\(i,j\)∈ℰ∈ℤ\+\|ℰ\|:∑j:\(i,j\)∈ℰai​j≤qi,∑i:\(i,j\)∈ℰ\(hi​j\+ai​j\)≤Uj\}\.\\displaystyle\\mathcal\{A\}\(s\)=\\Big\\\{a=\(a\_\{ij\}\)\_\{\(i,j\)\\in\\mathcal\{E\}\}\\in\\mathbb\{Z\}\_\{\+\}^\{\|\\mathcal\{E\}\|\}:\\sum\_\{j:\(i,j\)\\in\\mathcal\{E\}\}a\_\{ij\}\\leq q\_\{i\},\\ \\sum\_\{i:\(i,j\)\\in\\mathcal\{E\}\}\(h\_\{ij\}\+a\_\{ij\}\)\\leq U\_\{j\}\\Big\\\}\.\(6\.1\)Given statest=\(qt,ht\)s\_\{t\}=\(q\_\{t\},h\_\{t\}\)and actionata\_\{t\}, the dispatching is implemented first\. Then random service completions and exogenous arrivals occur\. Specifically, for each classii, the queue lengths becomesqi,t\+1=qi,t−∑j:\(i,j\)∈ℰai​j,t\+δi,t,q\_\{i,t\+1\}=q\_\{i,t\}\-\\sum\_\{j:\(i,j\)\\in\\mathcal\{E\}\}a\_\{ij,t\}\+\\delta\_\{i,t\},whereδi,t∼Poisson​\(λi\)\\delta\_\{i,t\}\\sim\\mathrm\{Poisson\}\(\\lambda\_\{i\}\)\. For each compatible pair\(i,j\)∈ℰ\(i,j\)\\in\\mathcal\{E\}, the in\-service occupancy satisfieshi​j,t\+1=hi​j,t\+ai​j,t−di​j,t,h\_\{ij,t\+1\}=h\_\{ij,t\}\+a\_\{ij,t\}\-d\_\{ij,t\},wheredi​j,t∼Binomial​\(hi​j,t\+ai​j,t,1−e−μi​j\)\.d\_\{ij,t\}\\sim\\mathrm\{Binomial\}\(h\_\{ij,t\}\+a\_\{ij,t\},1\-e^\{\-\\mu\_\{ij\}\}\)\.

We consider a one\-period cost consisting of queue holding cost and dispatching cost\. Letwi\>0w\_\{i\}\>0be the waiting\-cost coefficient of classii, and letci​j≥0c\_\{ij\}\\geq 0denote the one\-time cost of dispatching a classiicustomer to pooljj\. This allowsci​jc\_\{ij\}to encode, for example, overflow penalties for sending customers to non\-primary pools\. Thus, the one\-period cost isc​\(st,at\)=∑i=1Iwi​qi,t\+∑\(i,j\)∈ℰci​j​ai​j,tc\(s\_\{t\},a\_\{t\}\)=\\sum\_\{i=1\}^\{I\}w\_\{i\}q\_\{i,t\}\+\\sum\_\{\(i,j\)\\in\\mathcal\{E\}\}c\_\{ij\}a\_\{ij,t\}\.

A stationary dispatching policyπ\\pispecifies, for each statess, a probability distribution supported on the feasible action set𝒜​\(s\)\\mathcal\{A\}\(s\)\. Starting from an initial statesswhose distribution isν0\\nu\_\{0\}, the decision\-maker’s objective is to minimize the expected cumulative discounted costJ​\(π;ν0\)=𝔼s0∼ν0π​\[∑t=0∞γt​c​\(st,at\)\],J\(\\pi;\\nu\_\{0\}\)=\\mathbb\{E\}\_\{s\_\{0\}\\sim\\nu\_\{0\}\}^\{\\pi\}\[\\sum\_\{t=0\}^\{\\infty\}\\gamma^\{t\}c\(s\_\{t\},a\_\{t\}\)\],whereγ∈\(0,1\)\\gamma\\in\(0,1\)is the discount factor \(or equivalently, maximize the negative accumulative cost as reward\.\)

### 6\.2First\-Order Score Decoding: Learning State\-Dependent Index Rule

Above queueing model fits the structural MDP studied in this paper\. Given a states=\(q,h\)s=\(q,h\)and a feasible actiona∈𝒜​\(s\)a\\in\\mathcal\{A\}\(s\), define the post\-action configurationϕs​\(a\)=\(q\+​\(a\),h\+​\(a\)\)\\phi\_\{s\}\(a\)=\(q^\{\+\}\(a\),h^\{\+\}\(a\)\), whereqi\+​\(a\)=qi−∑j:\(i,j\)∈ℰai​jq\_\{i\}^\{\+\}\(a\)=q\_\{i\}\-\\sum\_\{j:\(i,j\)\\in\\mathcal\{E\}\}a\_\{ij\}andhi​j\+​\(a\)=hi​j\+ai​jh\_\{ij\}^\{\+\}\(a\)=h\_\{ij\}\+a\_\{ij\}\. It represents the system configuration immediately after dispatching and before random arrivals or service completions occur\. Also define the exogenous disturbance asξs=\(δ,d\),\\xi\_\{s\}=\(\\delta,d\),whereδ=\(δi\)i∈ℐ\\delta=\(\\delta\_\{i\}\)\_\{i\\in\\mathcal\{I\}\}is the new arrivals with independent componentsδi∼Poisson​\(λi\)\\delta\_\{i\}\\sim\\mathrm\{Poisson\}\(\\lambda\_\{i\}\), andd=\(di​j\)\(i,j\)∈ℰd=\(d\_\{ij\}\)\_\{\(i,j\)\\in\\mathcal\{E\}\}is the service\-completion vector satisfyingdi​j∼Binomial​\(hi​j\+​\(a\),1−e−μi​j\)\.d\_\{ij\}\\sim\\mathrm\{Binomial\}\(h\_\{ij\}^\{\+\}\(a\),1\-e^\{\-\\mu\_\{ij\}\}\)\.Then the state transition can be written ass′=Ξs​\(ϕs​\(a\),ξs\),s^\{\\prime\}=\\Xi\_\{s\}\(\\phi\_\{s\}\(a\),\\xi\_\{s\}\),whereΞs​\(\(q\+,h\+\),\(δ,d\)\)=\(q\+\+δ,h\+−d\)\.\\Xi\_\{s\}\(\(q^\{\+\},h^\{\+\}\),\(\\delta,d\)\)=\(q^\{\+\}\+\\delta,\\ h^\{\+\}\-d\)\.The one period reward is also a functionψs​\(a\)=−∑i=1Iwi​qi−∑\(i,j\)∈ℰci​j​ai​j\.\\psi\_\{s\}\(a\)=\-\\sum\_\{i=1\}^\{I\}w\_\{i\}q\_\{i\}\-\\sum\_\{\(i,j\)\\in\\mathcal\{E\}\}c\_\{ij\}a\_\{ij\}\.

For the first\-order Bellman\-Taylor score decoding strategy introduced before, letz=\(zq,zh\)∈ℝI\+\|ℰ\|,z=\(z^\{q\},z^\{h\}\)\\in\\mathbb\{R\}^\{I\+\|\\mathcal\{E\}\|\},wherezq=\(ziq\)i∈ℐz^\{q\}=\(z\_\{i\}^\{q\}\)\_\{i\\in\\mathcal\{I\}\}andzh=\(zi​jh\)\(i,j\)∈ℰz^\{h\}=\(z\_\{ij\}^\{h\}\)\_\{\(i,j\)\\in\\mathcal\{E\}\}are the latent scores associated with the post\-decision queue lengthsq\+​\(a\)q^\{\+\}\(a\)and in\-service occupanciesh\+​\(a\)h^\{\+\}\(a\), respectively\. The action decoder is

ψs​\(a\)\+⟨z,ϕs​\(a\)⟩=const​\(s,z\)\+∑\(i,j\)∈ℰ\(−ci​j−ziq\+zi​jh\)⋅ai​j\\displaystyle\\psi\_\{s\}\(a\)\+\\langle z,\\phi\_\{s\}\(a\)\\rangle=\\mathrm\{const\}\(s,z\)\+\\sum\_\{\(i,j\)\\in\\mathcal\{E\}\}\\bigl\(\-c\_\{ij\}\-z\_\{i\}^\{q\}\+z\_\{ij\}^\{h\}\\bigr\)\\cdot a\_\{ij\}whereconst​\(s,z\)\\mathrm\{const\}\(s,z\)denotes an intercept term not depending onaa\. Hence, in this queueing application, the first\-order action decoder reduces to an integer program over the feasible dispatching set of formatΓ​\(s,z\)=argmaxa∈𝒜​\(s\)∑\(i,j\)∈ℰz~i​j​\(s\)⋅ai​j\\Gamma\(s,z\)=\\mathop\{\\mathrm\{argmax\}\}\_\{a\\in\\mathcal\{A\}\(s\)\}\\sum\_\{\(i,j\)\\in\\mathcal\{E\}\}\\widetilde\{z\}\_\{ij\}\(s\)\\cdot a\_\{ij\}withz~i​j​\(s\)=−ci​j−ziq\+zi​jh\\widetilde\{z\}\_\{ij\}\(s\)=\-c\_\{ij\}\-z\_\{i\}^\{q\}\+z\_\{ij\}^\{h\}\. With such a reparameterization, the formal queueing problem is translated into a continuous\-action MDP on the Euclidean score spaceℝI\+\|ℰ\|\\mathbb\{R\}^\{I\+\|\\mathcal\{E\}\|\}\. At each statess, the policy no longer generates the dispatching matrixaadirectly, but instead, produces a state\-dependent score vectorzz\.

This separation is crucial in queueing network control\. If we parameterize a policy directly on the dispatching matrixaa, the neural network must operate over a state\-dependent action set defined by integer, queue\-availability, and pool\-capacity constraints specified in \([6\.1](https://arxiv.org/html/2606.10979#S6.E1)\), which is highly nontrivial and typically requires either explicit projection or ad hoc feasibility corrections\. By contrast, after reparameterization, the policy acts on the latent score spaceℝI\+\|ℰ\|\\mathbb\{R\}^\{I\+\|\\mathcal\{E\}\|\}, which is a fixed Euclidean space independent of system status\. The difficult combinatorial part of the decision is then delegated to the decoder, which maps the score vector to a feasible dispatching action by solving the induced integer program\. Hence feasibility is enforced exactly by the decoder, while policy optimization is carried out on a regular latent space rather than on the original irregular action set\. As a result, the queueing\-control problem is converted into a latent\-score MDP that can be solved using the standard PPO procedure efficiently\.

This approach also finds connections with several classical routing rules from the queueing literature\. With policy optimization, the decoder actually assigns to each compatible class\-pool pair\(i,j\)\(i,j\)an effective state\-dependent indexz~i​j​\(s\)\\widetilde\{z\}\_\{ij\}\(s\)and then chooses the feasible dispatching matrix that maximizes the total weighted sum∑\(i,j\)∈ℰz~i​j​\(s\)⋅ai​j\\sum\_\{\(i,j\)\\in\\mathcal\{E\}\}\\widetilde\{z\}\_\{ij\}\(s\)\\cdot a\_\{ij\}\. In this sense, the resulting policy is a learned max\-weight\-type dispatching rule\. Classical heuristics such as the generalizedc​μc\\murule, max\-weight rule\(Buyukkocet al\.,[1985](https://arxiv.org/html/2606.10979#bib.bib20); Tassiulas and Ephremides,[1990](https://arxiv.org/html/2606.10979#bib.bib21); Stolyar,[2004](https://arxiv.org/html/2606.10979#bib.bib22)\), and their overflow\-cost\-adjusted variants\(Chenet al\.,[2020](https://arxiv.org/html/2606.10979#bib.bib8)\)can all be viewed as hand\-crafted specifications of such pairwise indices, typically expressed analytically in terms of queue lengths, holding costs, service rates, and overflow costs\. Under our notation,z~i​j​\(s\)\\widetilde\{z\}\_\{ij\}\(s\)can bewi​μi​jw\_\{i\}\\mu\_\{ij\},wi​μi​j−ci​jw\_\{i\}\\mu\_\{ij\}\-c\_\{ij\},wi​qi​μi​jw\_\{i\}q\_\{i\}\\mu\_\{ij\}, andwi​qi​μi​j−ci​jw\_\{i\}q\_\{i\}\\mu\_\{ij\}\-c\_\{ij\}, corresponding to thec​μc\\murule, modifiedc​μc\\murule, max\-weight rule, and modified max\-weight rule in literature\. By contrast, our formulation does not fix the indexz~i​j​\(s\)\\widetilde\{z\}\_\{ij\}\(s\)a priori but learns the optimal through DRL automatically\.

### 6\.3Numerical Experiments

We test our algorithm empirically in this section\. Due to space limitation, we only report the experiment results here\.

#### 6\.3\.1Small\-scale setting\.

We first conduct a small\-scale experiment in which the optimal dynamic policy can be computed exactly\. This serves as a sanity check to verify how our algorithm performs against the optimal in theory\. Particularly, we consider a2×22\\times 2queueing network with two customer classes and two server pools\. Each server pool has capacity 5, and all compatible class\-pool pairs have homogeneous service\-completion probabilityμi​j=1\\mu\_\{ij\}=1\. Customer classii’s primary server pool isii,i=1,2i=1,2\. The waiting\-cost coefficients are normalized to one\. Dispatching a customer to its primary pool is cost\-free, whereas to a non\-primary pool incurs a routing cost\. We consider three overflow\-cost levelscov∈\{0\.1,0\.3,0\.6\}c^\{\\mathrm\{ov\}\}\\in\\\{0\.1,0\.3,0\.6\\\}, and setci​j=covc\_\{ij\}=c^\{\\mathrm\{ov\}\}fori≠ji\\neq j\.

To simplify model, we assume the service rates only depend on server pool type, i\.e\.,μi​j=μj\\mu\_\{ij\}=\\mu\_\{j\}\. We study two traffic configurations\. In the balanced\-load instance, both classes have utilization levelsρ1=ρ2=0\.9\\rho\_\{1\}=\\rho\_\{2\}=0\.9, whereρi=λi/\(Ui​μi\)\\rho\_\{i\}=\\lambda\_\{i\}/\(U\_\{i\}\\mu\_\{i\}\)\. In the imbalanced\-load instance, the two classes have utilization levelsρ1=0\.95\\rho\_\{1\}=0\.95andρ2=0\.85\\rho\_\{2\}=0\.85, which creates asymmetric congestion pressure and makes routing decisions more state dependent\. The system starts from empty\. For each instance, we compute the optimal value function exact dynamic\-programming solution on the finite\-state system, with an appropriate state truncation mechanism, to obtain the optimal value functionV∗V^\{\*\}\. We then train the BTSD\-PPO algorithm and evaluate its performance via simulation\. The results are summarized in Table[2](https://arxiv.org/html/2606.10979#S6.T2)\. The number in bracket denotes the estimation standard error of policy evaluation via simulation\. We note that BTSD\-PPO achieves performance close to the exact optimal, with relative gaps below2\.5%2\.5\\%across all tested instances\. This suggests that the first\-order decoder\-induced policy family indeed preserves the essential structure of the optimal policy in small systems where exact comparison is possible\.

Table 2:Comparison of BTSD\-PPO with the exact optimal solution in a2×22\\times 2network\.ρ=\(0\.90,0\.90\)\\rho=\(0\.90,0\.90\)ρ=\(0\.95,0\.85\)\\rho=\(0\.95,0\.85\)Metriccov=0\.1c^\{\\mathrm\{ov\}\}=0\.1cov=0\.3c^\{\\mathrm\{ov\}\}=0\.3cov=0\.6c^\{\\mathrm\{ov\}\}=0\.6cov=0\.1c^\{\\mathrm\{ov\}\}=0\.1cov=0\.3c^\{\\mathrm\{ov\}\}=0\.3cov=0\.6c^\{\\mathrm\{ov\}\}=0\.6Exact optimal1210\.4 \(1\.8\)1241\.1 \(2\.0\)1285\.3 \(2\.1\)1185\.4 \(2\.0\)1218\.1 \(2\.0\)1264\.5 \(2\.0\)BTSD\-PPO1226\.3 \(1\.8\)1261\.0 \(2\.2\)1316\.9 \(2\.1\)1198\.1 \(1\.6\)1234\.2 \(1\.9\)1278\.2 \(1\.9\)Gap1\.3%1\.6%2\.5%1\.1%1\.3%1\.1%
#### 6\.3\.2Moderate\-scale setting\.

We next consider a moderate\-scale queueing network withI=5I=5customer classes andJ=5J=5server pools\. Each pool has capacityUj=30U\_\{j\}=30\. The network is fully connected, i\.e\.,ℰ=ℐ×𝒥\.\\mathcal\{E\}=\\mathcal\{I\}\\times\\mathcal\{J\}\.Customer classiihas primary poolii\. Thus, the action at each state is a5×55\\times 5integer dispatching matrix subject to the queue\-availability and pool\-capacity constraints in \([6\.1](https://arxiv.org/html/2606.10979#S6.E1)\)\. Same as above, service rates are pool dependent but class independent\. Specifically, for each server pooljj, we randomly selectμj∼Unif​\(0\.75,0\.95\),\\mu\_\{j\}\\sim\\mathrm\{Unif\}\(0\.75,0\.95\),and setμi​j=μj\\mu\_\{ij\}=\\mu\_\{j\}\. This specification preserves service\-rate heterogeneity across server pools while keeping the moderate\-scale setting comparable to classical pool\-dependent service models\. Problem of similar structure and scale has been investigated inDai and Shi \([2019](https://arxiv.org/html/2606.10979#bib.bib23)\)in the context of inpatient ward management\.

We consider three traffic\-intensity regimes\. Letmi=Ui​μim\_\{i\}=U\_\{i\}\\mu\_\{i\}be the nominal primary\-pool service capacity for classii\. For each regimerr, we specify a primary\-load vector𝝆r=\(ρ1r,…,ρ5r\)\\boldsymbol\{\\rho\}^\{\\,r\}=\(\\rho\_\{1\}^\{r\},\\ldots,\\rho\_\{5\}^\{r\}\)and determine the arrival ofλi\\lambda\_\{i\}asλir=ρir​mi\\lambda\_\{i\}^\{r\}=\\rho\_\{i\}^\{r\}m\_\{i\},1≤i≤51\\leq i\\leq 5\. The first regime is a symmetric heavy\-loaded system, where all classes have same utilization rate0\.950\.95, i\.e\.,𝝆1=\(0\.95,0\.95,0\.95,0\.95,0\.95\)\.\\boldsymbol\{\\rho\}^\{1\}=\(0\.95,0\.95,0\.95,0\.95,0\.95\)\.In the second regime, two classes are close to heavy traffic, while the remaining three classes are moderately loaded:𝝆2=\(0\.98,0\.96,0\.88,0\.84,0\.84\)\.\\boldsymbol\{\\rho\}^\{2\}=\(0\.98,0\.96,0\.88,0\.84,0\.84\)\.In the last regime, one class is moderately overloaded, two classes are near heavy traffic, and two classes are relatively lightly loaded:𝝆3=\(1\.05,0\.96,0\.92,0\.80,0\.77\)\.\\boldsymbol\{\\rho\}^\{3\}=\(1\.05,0\.96,0\.92,0\.80,0\.77\)\.Note that both imbalanced regimes have average nominal primary utilization equal to0\.900\.90, but they create different congestion patterns\. For each regime, we create three instances, by choosing the waiting costwiw\_\{i\}and overflow costci​jc\_\{ij\}from\[0,10\]\[0,10\]such that optimal OR heuristic introduced below varies\.

For benchmarks, we first consider four classical index\-based routing heuristics discussed inChenet al\.\([2020](https://arxiv.org/html/2606.10979#bib.bib8)\), namely,c​μc\\murule, modifiedc​μc\\murule, max\-weight rule, and modified max\-weight rule\. These benchmarks are state\-of\-the\-art OR heuristics literature and allow us to assess whether learning state\-dependent routing scores improves upon standard hand\-crafted indices\. We also test two RL\-based benchmark\. The first is the PPO algorithm proposed inSunet al\.\([2024](https://arxiv.org/html/2606.10979#bib.bib12)\), which decomposes system\-level combinatorial actions into sequential atomic actions within a PPO framework, enabling scalable policy learning for large inpatient overflow problem\. Another one is the MIP\-based value\-lookahead algorithm proposed inHarshaet al\.\([2025](https://arxiv.org/html/2606.10979#bib.bib36)\), which combines value\-function approximation with mathematical programming to optimize per\-step combinatorial actions for long\-term rewards in multi\-node inventory systems\. We emphasize that these benchmarks are adapted implementations rather than direct replications of the original algorithms\. The original methods were developed for different operational models and include problem\-specific tailoring to those settings\. For example,Sunet al\.\([2024](https://arxiv.org/html/2606.10979#bib.bib12)\)investigate a time\-varying Poisson arrival process while in our model, the arrival rate is fixed\. These model\-specific components are not directly applicable to our setting\. The comparison of different methods of each problem setting is reported in Table[3](https://arxiv.org/html/2606.10979#S6.T3)\.

Table 3:Comparison of BTSD\-PPO with various benchmarks in a5×55\\times 5network\.Casec​μc\\mumod\.c​μc\\mum\.p\.mod\. m\.p\.Atom\-PPOMIPBTSD\-PPOImproveB1233\.6 \(1\.2\)373\.2 \(1\.2\)310\.2 \(0\.4\)310\.0 \(0\.1\)297\.6 \(0\.2\)323\.3 \(0\.8\)224\.2 \(0\.2\)\+4\.0%B2459\.2 \(1\.3\)373\.8 \(0\.8\)550\.8 \(0\.7\)501\.5 \(0\.3\)418\.1 \(0\.7\)373\.0 \(0\.5\)351\.2 \(0\.9\)\+6\.0%B3954\.0 \(1\.6\)903\.8 \(1\.2\)999\.9 \(3\.0\)583\.0 \(2\.0\)677\.5 \(1\.3\)904\.3 \(2\.3\)547\.8 \(1\.2\)\+6\.0%T1225\.1 \(2\.7\)368\.7 \(1\.1\)305\.4 \(0\.6\)302\.8 \(1\.3\)289\.5 \(0\.3\)367\.8 \(1\.3\)208\.5 \(1\.5\)\+7\.4%T2468\.4 \(1\.6\)368\.2 \(1\.0\)545\.5 \(1\.7\)492\.0 \(1\.2\)606\.8 \(2\.6\)370\.0 \(1\.3\)346\.4 \(0\.4\)\+5\.9%T3545\.7 \(0\.6\)600\.5 \(1\.2\)610\.3 \(1\.3\)518\.3 \(1\.1\)758\.4 \(13\.1\)595\.2 \(0\.9\)395\.7 \(1\.7\)\+23\.7%O1233\.1 \(0\.9\)359\.0 \(2\.3\)305\.2 \(1\.3\)302\.2 \(1\.5\)349\.4 \(0\.8\)272\.3 \(0\.5\)196\.5 \(3\.8\)\+15\.7%O2474\.8 \(0\.9\)365\.6 \(1\.3\)544\.6 \(1\.1\)494\.8 \(1\.6\)499\.6 \(0\.7\)365\.3 \(1\.5\)305\.0 \(1\.0\)\+16\.6%O3487\.7 \(1\.3\)522\.9 \(1\.5\)598\.5 \(1\.9\)491\.3 \(1\.2\)574\.7 \(3\.8\)522\.7 \(1\.7\)414\.9 \(2\.9\)\+14\.9%

We observe that across all the nine cases, BTSD\-PPO achieves the lowest cost, with improvements over the best benchmark ranging from 4\.0% to 23\.7%\. We also note that the best\-performing OR heuristic varies across instances, highlighting that fixed heuristics are sensitive to traffic patterns and cost structures\. For the Atom\-PPO baseline, it generates feasible actions sequentially and preserves feasibility but fail to capture the global coupling of queue lengths and pool capacities, which are consistently outperformed by BTSD\-PPO\. Lastly, While the MIP\-based baseline can be strong in some cases, it lacks robustness across all scenarios\.

We further conduct an ablation study to isolate the role of BTSD from the choice of the underlying DRL optimizer\. Specifically, we combine the same BTSD decoder with three representative DRL backbones: PPO, SAC, and DQN\. We compare them with their vanilla counterparts that operate on the original action representation\. A naive action masking and projection mechanism is used to guarantee the feasibility The results, presented in Table[4](https://arxiv.org/html/2606.10979#S6.T4), show that BTSD consistently improves all three backbones\. Relative to the vanilla implementations, BTSD reduces the cost by 44\.1% for PPO, 44\.8% for SAC, and 41\.8% for DQN\. Moreover, all BTSD\-enhanced variants outperform the best benchmark policy\. In contrast, the vanilla DRL algorithms perform substantially worse than the heuristic benchmark\. These results suggest that the performance gain is not driven by a particular DRL optimizer, but by the BTSD action interface itself, which converts the original state\-dependent and combinatorial action space into a tractable latent\-score decision space while preserving exact feasibility\.

Table 4:Ablation Study: BTSD consistently improves different DRL backbones\.DRL BackboneVanillaBTSDImprovementPPO397\.2 \(0\.00\)222\.2\(0\.02\)\+44\.1%SAC417\.0 \(0\.01\)230\.2\(0\.22\)\+44\.8%DQN393\.1 \(0\.02\)228\.9\(0\.04\)\+41\.8%Best benchmark \(c​μc\\mu\-rule\)234\.3 \(0\.05\)–

## 7Conclusions and Future Work

This paper proposes Bellman\-Taylor score decoding as an action interface for MDPs with state\-dependent feasible action sets\. The main idea is to separate learning from feasibility\. A standard continuous\-action DRL algorithm learns state\-dependent scores in a Euclidean latent space, while an optimization decoder maps these scores into feasible natural actions in the original MDP\. This construction allows problem\-specific operational structure, such as capacity constraints, integrality restrictions, and combinatorial coupling, to be handled by the decoder rather than by a customized policy architecture\. Performance analysis suggests that the optimality gap can be decomposed into a structural approximation error and an algorithmic learning error\. In the queueing network case study, the decoder yields a learned max\-weight\-type dispatching rule, and numerical results demonstrate that a standard PPO implementation outperform classical routing heuristics and ADP benchmarks considerably\.

Several directions remain open for future research\. First, this paper focuses on the infinite\-horizon discounted\-reward formulation\. Many operational models are more naturally evaluated under a long\-run average\-cost criterion\. Extending Bellman–Taylor score decoding to average\-cost MDPs would require replacing the discounted continuation value with a relative value or bias function\. Second, our framework assumes that the transition mechanism and a meaningful post\-action representation are explicitly available\. This is natural in many stylized OR models, but may be restrictive in data\-driven settings where the dynamics are unknown\. An important direction is to learn or adapt the post\-action representation from data while retaining feasibility\-preserving decoding\. Third, the present implementation uses on\-policy learning through PPO\. Extending the framework to offline reinforcement learning would broaden its applicability and may improve sample efficiency, especially in systems where simulation is expensive or real\-world exploration is limited\.

## References

- Differentiable convex optimization layers\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§2\.3](https://arxiv.org/html/2606.10979#S2.SS3.p3.1)\.
- B\. Amos and J\. Z\. Kolter \(2017\)OptNet: differentiable optimization as a layer in neural networks\.InInternational Conference on Machine Learning \(ICML\),pp\. 136–145\.Cited by:[§2\.3](https://arxiv.org/html/2606.10979#S2.SS3.p3.1)\.
- R\. Bellman \(1957\)Dynamic programming\.Princeton University Press\.Cited by:[§2\.1](https://arxiv.org/html/2606.10979#S2.SS1.p1.1)\.
- Q\. Berthet, M\. Blondel, O\. Teboul, M\. Cuturi, J\. Vert, and F\. Bach \(2020\)Learning with differentiable perturbed optimizers\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§2\.3](https://arxiv.org/html/2606.10979#S2.SS3.p3.1)\.
- D\. P\. Bertsekas \(2025\)Neuro\-dynamic programming\.InEncyclopedia of optimization,pp\. 1–6\.Cited by:[§1](https://arxiv.org/html/2606.10979#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.10979#S2.SS1.p1.1)\.
- C\. Buyukkoc, P\. Varaiya, and J\. Walrand \(1985\)Thec​μc\\murule revisited\.Advances in Applied Probability17\(1\),pp\. 237–238\.External Links:[Document](https://dx.doi.org/10.2307/1427064)Cited by:[§6\.2](https://arxiv.org/html/2606.10979#S6.SS2.p4.12)\.
- Y\. Chandak, G\. Theocharous, J\. Kostas, S\. Jordan, and P\. S\. Thomas \(2019\)Learning action representations for reinforcement learning\.InInternational Conference on Machine Learning \(ICML\),pp\. 941–950\.Cited by:[§1](https://arxiv.org/html/2606.10979#S1.p5.1),[§2\.2](https://arxiv.org/html/2606.10979#S2.SS2.p1.1)\.
- J\. Chen, J\. Dong, and P\. Shi \(2020\)A survey on skill\-based routing with applications to service operations management\.Queueing Systems96\(1\),pp\. 53–82\.Cited by:[§1](https://arxiv.org/html/2606.10979#S1.p10.1),[§6\.1](https://arxiv.org/html/2606.10979#S6.SS1.p1.13),[§6\.2](https://arxiv.org/html/2606.10979#S6.SS2.p4.12),[§6\.3\.2](https://arxiv.org/html/2606.10979#S6.SS3.SSS2.p3.2),[§6](https://arxiv.org/html/2606.10979#S6.p1.1)\.
- Y\. Chen, J\. Dong, Z\. Wang, and C\. Zhang \(2026\)A primal\-dual approach to constrained markov decision processes with applications to queue scheduling and inventory management\.Management Science72\(2\),pp\. 955–988\.Cited by:[§1](https://arxiv.org/html/2606.10979#S1.p2.1)\.
- J\. G\. Dai and M\. Gluzman \(2022\)Queueing network controls via deep reinforcement learning\.Stochastic Systems12\(1\),pp\. 30–67\.Cited by:[§1](https://arxiv.org/html/2606.10979#S1.p10.1),[§1](https://arxiv.org/html/2606.10979#S1.p2.1)\.
- J\. G\. Dai and P\. Shi \(2019\)Inpatient overflow: an approximate dynamic programming approach\.Manufacturing & Service Operations Management21\(4\),pp\. 894–911\.Cited by:[§1](https://arxiv.org/html/2606.10979#S1.p10.1),[§6\.1](https://arxiv.org/html/2606.10979#S6.SS1.p1.13),[§6\.3\.2](https://arxiv.org/html/2606.10979#S6.SS3.SSS2.p1.10),[§6](https://arxiv.org/html/2606.10979#S6.p1.1)\.
- A\. Delarue, R\. Anderson, and C\. Tjandraatmadja \(2020\)Reinforcement learning with combinatorial actions: an application to vehicle routing\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§1](https://arxiv.org/html/2606.10979#S1.p2.1),[§1](https://arxiv.org/html/2606.10979#S1.p5.1),[§2\.3](https://arxiv.org/html/2606.10979#S2.SS3.p2.1)\.
- J\. Dong, B\. Görgülü, and V\. Sarhangian \(2025\)Multiclass queue scheduling under slowdown: an approximate dynamic programming approach\.arXiv preprint arXiv:2501\.10523\.Cited by:[§1](https://arxiv.org/html/2606.10979#S1.p10.1)\.
- G\. Dulac\-Arnold, R\. Evans, H\. van Hasselt, P\. Sunehag, T\. Lillicrap, J\. Hunt, T\. Mann, T\. Weber, T\. Degris, and B\. Coppin \(2015\)Deep reinforcement learning in large discrete action spaces\.arXiv preprint arXiv:1512\.07679\.Cited by:[§1](https://arxiv.org/html/2606.10979#S1.p5.1),[§2\.2](https://arxiv.org/html/2606.10979#S2.SS2.p1.1)\.
- A\. N\. Elmachtoub and P\. Grigas \(2022\)Smart “predict, then optimize”\.Management Science68\(1\),pp\. 9–26\.External Links:[Document](https://dx.doi.org/10.1287/mnsc.2020.3922)Cited by:[§2\.3](https://arxiv.org/html/2606.10979#S2.SS3.p3.1)\.
- A\. Ferber, B\. Wilder, B\. Dilkina, and M\. Tambe \(2020\)MIPaaL: mixed integer program as a layer\.InAAAI Conference on Artificial Intelligence,Cited by:[§2\.3](https://arxiv.org/html/2606.10979#S2.SS3.p3.1)\.
- J\. Gijsbrechts, R\. N\. Boute, J\. A\. Van Mieghem, and D\. J\. Zhang \(2022\)Can deep reinforcement learning improve inventory management? performance on lost sales, dual\-sourcing, and multi\-echelon problems\.Manufacturing & Service Operations Management24\(3\),pp\. 1349–1368\.Cited by:[§1](https://arxiv.org/html/2606.10979#S1.p2.1)\.
- P\. Harsha, A\. Jagmohan, J\. Kalagnanam, B\. Quanz, and D\. Singhvi \(2025\)Deep policy iteration with integer programming for inventory management\.Manufacturing & Service Operations Management27\(2\),pp\. 369–388\.External Links:[Document](https://dx.doi.org/10.1287/msom.2022.0617)Cited by:[§1](https://arxiv.org/html/2606.10979#S1.p2.1),[§1](https://arxiv.org/html/2606.10979#S1.p5.1),[§2\.3](https://arxiv.org/html/2606.10979#S2.SS3.p2.1),[§6\.3\.2](https://arxiv.org/html/2606.10979#S6.SS3.SSS2.p3.2)\.
- M\. Hausknecht and P\. Stone \(2016\)Deep reinforcement learning in parameterized action space\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2\.2](https://arxiv.org/html/2606.10979#S2.SS2.p1.1)\.
- H\. Hoppe, L\. Baty, L\. Bouvier, A\. Parmentier, and M\. Schiffer \(2025\)Structured reinforcement learning for combinatorial decision\-making\.InAdvances in Neural Information Processing Systems \(NeurIPS\),External Links:[Link](https://arxiv.org/abs/2505.19053)Cited by:[§1](https://arxiv.org/html/2606.10979#S1.p5.1),[§2\.3](https://arxiv.org/html/2606.10979#S2.SS3.p5.1)\.
- S\. Huang and S\. Ontañón \(2022\)A closer look at invalid action masking in policy gradient algorithms\.InInternational Florida Artificial Intelligence Research Society Conference \(FLAIRS\),Note:arXiv:2006\.14171External Links:[Document](https://dx.doi.org/10.32473/flairs.v35i.130584)Cited by:[§1](https://arxiv.org/html/2606.10979#S1.p5.1),[§2\.2](https://arxiv.org/html/2606.10979#S2.SS2.p2.1)\.
- B\. Li, H\. Tang, Y\. Zheng, J\. Hao, P\. Li, Z\. Wang, Z\. Meng, and L\. Wang \(2021\)HyAR: addressing discrete\-continuous action reinforcement learning via hybrid action representation\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§2\.2](https://arxiv.org/html/2606.10979#S2.SS2.p1.1)\.
- J\. Ling, M\. L\. Schuler, A\. Kumar, and P\. Varakantham \(2023\)Knowledge compilation for constrained combinatorial action spaces in reinforcement learning\.InProceedings of the International Conference on Autonomous Agents and Multiagent Systems \(AAMAS\),Note:IFAAMAS, 9 pagesCited by:[§2\.2](https://arxiv.org/html/2606.10979#S2.SS2.p2.1)\.
- B\. Liu, Q\. Cai, Z\. Yang, and Z\. Wang \(2019\)Neural trust region/proximal policy optimization attains globally optimal policy\.Advances in neural information processing systems32\.Cited by:[§4\.4](https://arxiv.org/html/2606.10979#S4.SS4.p2.1)\.
- J\. Mandi, J\. Kotary, S\. Berden, M\. Mulamba, V\. Bucarey, T\. Guns, and F\. Fioretto \(2024\)Decision\-focused learning: foundations, state of the art, benchmark and future opportunities\.Journal of Artificial Intelligence Research81,pp\. 1623–1701\.External Links:[Document](https://dx.doi.org/10.1613/jair.1.15320)Cited by:[§2\.3](https://arxiv.org/html/2606.10979#S2.SS3.p3.1)\.
- V\. Mnih, K\. Kavukcuoglu, D\. Silver, A\. Graves, I\. Antonoglou, D\. Wierstra, and M\. Riedmiller \(2013\)Playing atari with deep reinforcement learning\.arXiv preprint arXiv:1312\.5602\.Cited by:[§2\.1](https://arxiv.org/html/2606.10979#S2.SS1.p3.1)\.
- V\. Mnih, K\. Kavukcuoglu, D\. Silver, A\. A\. Rusu, J\. Veness, M\. G\. Bellemare, A\. Graves, M\. Riedmiller, A\. K\. Fidjeland, G\. Ostrovski,et al\.\(2015\)Human\-level control through deep reinforcement learning\.Nature518\(7540\),pp\. 529–533\.Cited by:[§1](https://arxiv.org/html/2606.10979#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.10979#S2.SS1.p1.1),[§2\.1](https://arxiv.org/html/2606.10979#S2.SS1.p2.6),[§2\.1](https://arxiv.org/html/2606.10979#S2.SS1.p3.1)\.
- T\. Pham, G\. De Magistris, and R\. Tachibana \(2018\)OptLayer – practical constrained optimization for deep reinforcement learning in the real world\.InIEEE International Conference on Robotics and Automation \(ICRA\),pp\. 6236–6243\.External Links:[Document](https://dx.doi.org/10.1109/ICRA.2018.8460547)Cited by:[§1](https://arxiv.org/html/2606.10979#S1.p5.1),[§2\.2](https://arxiv.org/html/2606.10979#S2.SS2.p2.1)\.
- W\. B\. Powell \(2011\)Approximate dynamic programming: solving the curses of dimensionality\.2 edition,John Wiley & Sons\.Cited by:[§1](https://arxiv.org/html/2606.10979#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.10979#S2.SS1.p1.1)\.
- M\. L\. Puterman \(2014\)Markov decision processes: discrete stochastic dynamic programming\.John Wiley & Sons\.Cited by:[§3\.1](https://arxiv.org/html/2606.10979#S3.SS1.p1.1)\.
- A\. Rajeswaran, V\. Kumar, A\. Gupta, G\. Vezzani, J\. Schulman, E\. Todorov, and S\. Levine \(2018\)Learning complex dexterous manipulation with deep reinforcement learning and demonstrations\.InProceedings of Robotics: Science and Systems \(RSS\),Cited by:[§1](https://arxiv.org/html/2606.10979#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.10979#S2.SS1.p1.1),[§2\.1](https://arxiv.org/html/2606.10979#S2.SS1.p2.6)\.
- J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov \(2017\)Proximal policy optimization algorithms\.arXiv preprint arXiv:1707\.06347\.Cited by:[§2\.1](https://arxiv.org/html/2606.10979#S2.SS1.p2.6),[§2\.1](https://arxiv.org/html/2606.10979#S2.SS1.p3.1)\.
- A\. L\. Stolyar \(2004\)Maxweight scheduling in a generalized switch: state space collapse and workload minimization in heavy traffic\.The Annals of Applied Probability14\(1\),pp\. 1–53\.Cited by:[§6\.2](https://arxiv.org/html/2606.10979#S6.SS2.p4.12)\.
- J\. Sun, J\. Dai, and P\. Shi \(2024\)Inpatient overflow management with proximal policy optimization\.arXiv preprint arXiv:2410\.13767\.Cited by:[§2\.2](https://arxiv.org/html/2606.10979#S2.SS2.p3.1),[§6\.1](https://arxiv.org/html/2606.10979#S6.SS1.p1.13),[§6\.3\.2](https://arxiv.org/html/2606.10979#S6.SS3.SSS2.p3.2),[§6](https://arxiv.org/html/2606.10979#S6.p1.1)\.
- R\. S\. Sutton, A\. G\. Barto,et al\.\(1998\)Reinforcement learning: an introduction\.Vol\.1,MIT press Cambridge\.Cited by:[§2\.1](https://arxiv.org/html/2606.10979#S2.SS1.p1.1),[§2\.1](https://arxiv.org/html/2606.10979#S2.SS1.p2.6),[§3\.1](https://arxiv.org/html/2606.10979#S3.SS1.p1.1),[§3\.1](https://arxiv.org/html/2606.10979#S3.SS1.p3.5)\.
- C\. Tang, B\. Abbatematteo, J\. Hu, R\. Chandra, R\. Martín\-Martín, and P\. Stone \(2025\)Deep reinforcement learning for robotics: a survey of real\-world successes\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.39,pp\. 28694–28698\.Cited by:[§1](https://arxiv.org/html/2606.10979#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.10979#S2.SS1.p1.1),[§2\.1](https://arxiv.org/html/2606.10979#S2.SS1.p2.6)\.
- L\. Tassiulas and A\. Ephremides \(1990\)Stability properties of constrained queueing systems and scheduling policies for maximum throughput in multihop radio networks\.In29th IEEE Conference on Decision and Control,pp\. 2130–2132\.Cited by:[§6\.2](https://arxiv.org/html/2606.10979#S6.SS2.p4.12)\.
- A\. Tavakoli, F\. Pardo, and P\. Kormushev \(2018\)Action branching architectures for deep reinforcement learning\.InAAAI Conference on Artificial Intelligence,Cited by:[§2\.2](https://arxiv.org/html/2606.10979#S2.SS2.p1.1)\.
- M\. Vlastelica Pogančić, A\. Paulus, V\. Musil, G\. Martius, and M\. Rolínek \(2020\)Differentiation of blackbox combinatorial solvers\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2\.3](https://arxiv.org/html/2606.10979#S2.SS3.p3.1)\.
- P\. R\. Wurman, S\. Barrett, K\. Kawamoto, J\. MacGlashan, K\. Subramanian, T\. J\. Walsh, R\. Capobianco, A\. Devlic, F\. Eckert, F\. Fuchs,et al\.\(2022\)Outracing champion gran turismo drivers with deep reinforcement learning\.Nature602\(7896\),pp\. 223–228\.Cited by:[§1](https://arxiv.org/html/2606.10979#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.10979#S2.SS1.p1.1),[§2\.1](https://arxiv.org/html/2606.10979#S2.SS1.p2.6)\.
- L\. Xu, B\. Wilder, E\. B\. Khalil, and M\. Tambe \(2025\)Reinforcement learning with combinatorial actions for coupled restless bandits\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§1](https://arxiv.org/html/2606.10979#S1.p2.1),[§1](https://arxiv.org/html/2606.10979#S1.p5.1),[§2\.3](https://arxiv.org/html/2606.10979#S2.SS3.p2.1)\.
- T\. Zahavy, M\. Haroush, N\. Merlis, D\. J\. Mankowitz, and S\. Mannor \(2018\)Learn what not to learn: action elimination with deep reinforcement learning\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§1](https://arxiv.org/html/2606.10979#S1.p5.1),[§2\.2](https://arxiv.org/html/2606.10979#S2.SS2.p2.1)\.

Similar Articles

Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling

Hugging Face Daily Papers

This paper introduces DeScore, a video reward model that decouples reasoning and scoring processes to improve training efficiency and generalization. It addresses the limitations of existing discriminative and generative reward models by using a 'think-then-score' paradigm with multimodal large language models.

What are MDPs? And how can we Solve them?

ML at Berkeley

This article explains the fundamentals of Markov Decision Processes (MDPs), a core framework in deep reinforcement learning, using an educational example of a student's daily decisions.