Generative OOD-regularized Model-based Policy Optimization
Summary
Introduces GORMPO, a density-regularized offline RL algorithm that uses generative density modeling to restrict policy updates to high-density areas, achieving 17% improvement on a real-world medical dataset and outperforming state-of-the-art baselines.
View Cached Full Text
Cached at: 05/26/26, 09:06 AM
# Generative OOD-regularized Model-based Policy Optimization
Source: [https://arxiv.org/html/2605.24405](https://arxiv.org/html/2605.24405)
Aysin Tumay1Jiahe Huang1Elise Jortberg2Rose Yu1 1University of California, San Diego2Abiomed
###### Abstract
We study sequential decision\-making with offline reinforcement learning \(RL\)\. Traditional offline RL policies may result in out\-of\-distribution \(OOD\) actions when training relies only on sparse offline representations\. To ensure safe offline policies in a sparse state\-action space, we explore how density estimation models can be integrated into model\-based RL methods to avoid the OOD regions\. Generative models are capable of explicitly modeling the density in sparse state\-action spaces\. Building on this, we introduceGenerativeOOD\-regularizedModel\-basedPolicyOptimization \(GORMPO\), a density\-regularized offline RL algorithm that uses generative density modeling to restrict policy updates to high\-density areas of the dataset\. Furthermore, we examine whetherbetter OOD detection corresponds to better model\-based offline policies\. We compare \(1\) the OOD detection capabilities of various density estimators and \(2\) their performance within the GORMPO framework on a real\-world medical dataset and sparse offline RL datasets\. We theoretically guarantee GORMPO’s performance under mild assumptions\. Empirically, GORMPO outperforms state\-of\-the\-art baselines by 17% on a real\-world medical dataset and enhances the base model on the offline RL datasets\. Our empirical findings show that better OOD detection generally results in improved policies in environments with stable dynamics, while conservative penalties with poor density estimation are favored when dynamics are uncertain\.
## 1Introduction
Offline reinforcement learning \(RL\) has shown great promise in safety\-critical control tasks where interaction with an online environment is impossible\. Applications of this problem have been demonstrated in sepsisKomorowskiet al\.\([2018](https://arxiv.org/html/2605.24405#bib.bib34)\); Raghuet al\.\([2017](https://arxiv.org/html/2605.24405#bib.bib35)\)and cancer predictionEckardtet al\.\([2021](https://arxiv.org/html/2605.24405#bib.bib37)\); Tsenget al\.\([2020](https://arxiv.org/html/2605.24405#bib.bib36)\), autonomous drivingBansalet al\.\([2019](https://arxiv.org/html/2605.24405#bib.bib40)\), and Unmanned Aerial Vehicle controlBrunkeet al\.\([2022](https://arxiv.org/html/2605.24405#bib.bib43)\)\. However, distribution shift constitutes a significant problem for offline RL because the policy reaches out\-of\-distribution \(OOD\) regions, leading to unsafe actions\. Offline RL also depends strongly on dataset coverage where limited state–action support leads to extrapolation error in Bellman backupsKumaret al\.\([2019](https://arxiv.org/html/2605.24405#bib.bib57)\)\. This issue is common in clinical settings because most patients receive adequate support and remain hemodynamically stable, while cases of insufficient treatment are rare and underrepresented as seen in Figure[1](https://arxiv.org/html/2605.24405#S1.F1)\.


Figure 1:Left: reward–action space of our medical dataset with the sparse region in gray\. Right: the transition rollout distribution versus train distribution modeled by KDE and NeuralODE\. While transition rollouts exhibit low likelihoods indicating OOD behavior, KDE saturates in OOD regions with a large spike\.A core challenge in offline RL is mitigating distribution shift without being overly conservative\. Prior model\-free methods constrain the learned policy to the behavior distribution on the offline data onlyFujimotoet al\.\([2019](https://arxiv.org/html/2605.24405#bib.bib42)\); Kumaret al\.\([2020](https://arxiv.org/html/2605.24405#bib.bib39)\)\. However, these constraints potentially suppress valid actions where the Q\-function generalizes wellAnet al\.\([2021](https://arxiv.org/html/2605.24405#bib.bib59)\)\. Other model\-free methodsWuet al\.\([2022](https://arxiv.org/html/2605.24405#bib.bib5)\); Maoet al\.\([2023](https://arxiv.org/html/2605.24405#bib.bib19)\)address this by explicitly regularizing the policy with behavior support based on the evidence lower bound \(ELBO\), or with importance samplingMaoet al\.\([2023](https://arxiv.org/html/2605.24405#bib.bib19)\), which might fall short on highly sparse offline data\. Some model\-based uncertainty\-penalized methodsYuet al\.\([2020](https://arxiv.org/html/2605.24405#bib.bib15)\); Sunet al\.\([2023](https://arxiv.org/html/2605.24405#bib.bib4)\), built on top of model\-based policy optimization \(MBPO\)Janneret al\.\([2019](https://arxiv.org/html/2605.24405#bib.bib16)\), address distribution shift by penalizing policy optimization with uncertainty\. However, augmenting the dataset with model\-generated rollouts can quickly drift into OOD states when the dynamics model is trained on sparse data and fails to generalize reliably\. Other works explored MBPO with kernel density estimation \(KDE\) regularization in healthcareYanet al\.\([2025](https://arxiv.org/html/2605.24405#bib.bib61)\); Tumayet al\.\([2025](https://arxiv.org/html/2605.24405#bib.bib14)\)\. However, generative density estimation offers a more principled alternative for flexible OOD detection compared to KDE, especially in sparse and high\-stakes tasks such as healthcareZhaiet al\.\([2016](https://arxiv.org/html/2605.24405#bib.bib23)\); Melnychuket al\.\([2023](https://arxiv.org/html/2605.24405#bib.bib22)\); Ruffet al\.\([2021](https://arxiv.org/html/2605.24405#bib.bib21)\)\.
In this work, we proposeGenerativeOOD\-regularizedModel\-basedPolicyoptimization \(GORMPO\), a density\-penalized model\-based offline RL algorithm\. To avoid the uncovered areas in the state\-action space during dynamics model rollouts, we discount the rolled\-out rewards based on the next state\-action density\. Lower\-density areas get penalized more to support in\-distribution \(ID\) data augmentation during policy optimization\. On top of this, since our pipeline is highly flexible, we calibrated five different density estimation models to answer ifbetter OOD detection lead to better offline policieswhen used in GORMPO\.
In summary, our primary contributions are:
- •We propose GORMPO, a plug\-and\-play framework for any model\-based offline RL algorithm that explicitly penalizes OOD state\-action rollouts with generative models\. GORMPO outperforms state\-of\-the\-art baselines by 17% on our proprietary real\-world medical dataset\.
- •We provide the first comprehensive evaluation of 5 distinct families of density estimators with four generative models for the OOD detection task\.
- •We demonstrate that expressive density models lead to better policies in stable dynamics, whereas pessimistic regularization is required for uncertain dynamics\.
## 2Related Work
Constrained Offline RL\.Addressing distribution shift is central to offline reinforcement learning\. A prominent class of methods mitigates extrapolation error by constraining the learned policy to the support of the behavior policy\. Approaches such as Batch\-Constrained Q\-learning \(BCQ\)Fujimotoet al\.\([2019](https://arxiv.org/html/2605.24405#bib.bib42)\), Conservative Q\-Learning \(CQL\)Kumaret al\.\([2020](https://arxiv.org/html/2605.24405#bib.bib39)\), and Behavior Regularized Actor\-Critic \(BRAC\)Wuet al\.\([2019](https://arxiv.org/html/2605.24405#bib.bib58)\)employ divergence metrics to penalize deviations from the offline dataset\. While effective at ensuring safety, distance\-based constraints often induce excessive conservatism\. By restricting learning to the neighborhood of observed data, they may suppress potentially optimal actions that remain plausible within the true underlying distributionAnet al\.\([2021](https://arxiv.org/html/2605.24405#bib.bib59)\); Degraveet al\.\([2022](https://arxiv.org/html/2605.24405#bib.bib60)\)\.
To overcome these limitations, recent work has shifted to explicit regularization in offline RL\. Model\-free methods regularize the policy toward the behavior distributionWuet al\.\([2022](https://arxiv.org/html/2605.24405#bib.bib5)\), reinforcing the patterns in the offline dataset during optimization\. However, estimating the lower bound provides a coarse regularization\. With explicit density estimation in model\-free RL, CPEDZhanget al\.\([2023](https://arxiv.org/html/2605.24405#bib.bib26)\)integrates FlowGAN, highlighting the benefits of generative density models\. Nevertheless, relying on the data support with weak expert demonstrations can still yield suboptimal policies\. In this case, MBPOJanneret al\.\([2019](https://arxiv.org/html/2605.24405#bib.bib16)\)proves that learning a dynamics model helps in broader exploration of the state\-action space\. Some impose pessimism through uncertainty penaltiesSunet al\.\([2023](https://arxiv.org/html/2605.24405#bib.bib4)\); Yuet al\.\([2020](https://arxiv.org/html/2605.24405#bib.bib15)\), quantifying uncertainty across an ensemble of dynamics models and penalizing value estimation and dynamics rollouts\. LEQPark and Lee \([2025](https://arxiv.org/html/2605.24405#bib.bib13)\)further addresses value overestimation in model rollouts via lower expectile regression ofλ\\lambda\-returns\. SAMBO\-RLLuoet al\.\([2024](https://arxiv.org/html/2605.24405#bib.bib48)\)introduces a shift\-aware reward correction to mitigate model bias\. However, uncertainty penalties, bias corrections, and value regularization can still be insufficient if rollouts drift into OOD regions when the dynamics model is trained on sparse data, requiring a density constraint\. Density regularization in model\-based RL with kernel density estimators such as OGSRLYanet al\.\([2025](https://arxiv.org/html/2605.24405#bib.bib61)\)and CORMPOTumayet al\.\([2025](https://arxiv.org/html/2605.24405#bib.bib14)\)has demonstrated success in suppressing OOD policies\. However, the kernel density estimator is not expressive in high\-dimensional settings\. For MBPO, density regularization with an expressive generative model for transition rollouts is essential in sparse state\-action domains\.
Generative Density Estimators\.Generative density estimation has advanced through three primary methodological paradigms\. First, normalizing flowsRezende and Mohamed \([2015](https://arxiv.org/html/2605.24405#bib.bib1)\)construct complex distributions from simple base measures via sequences of invertible transformations, facilitating exact likelihood evaluation\. Second, diffusion and score\-based generative modelsSonget al\.\([2021](https://arxiv.org/html/2605.24405#bib.bib3)\)model distributions through a forward noise\-injection and learned reverse process\. Recent iterations, such as improved DDPMsHoet al\.\([2020](https://arxiv.org/html/2605.24405#bib.bib2)\)and EDMKarraset al\.\([2022](https://arxiv.org/html/2605.24405#bib.bib71)\), have further optimized log\-likelihood estimation and sample fidelity\. To address the computational cost of high\-dimensional data, Latent Diffusion ModelsRombachet al\.\([2022](https://arxiv.org/html/2605.24405#bib.bib70)\)and autoregressive methodsXuet al\.\([2025](https://arxiv.org/html/2605.24405#bib.bib69)\)shift the generative process into a compressed latent space\. Third, high\-capacity backbones such as ViTDosovitskiyet al\.\([2021](https://arxiv.org/html/2605.24405#bib.bib28)\)and DiTPeebles and Xie \([2023](https://arxiv.org/html/2605.24405#bib.bib29)\)have demonstrated remarkable success in modeling complex manifolds\. When combined with continuous\-time frameworks like Neural ODEsChenet al\.\([2018](https://arxiv.org/html/2605.24405#bib.bib6)\), they offer a powerful inductive bias for capturing the underlying dynamics of temporal trajectories\. Most recently, unified frameworks such as Flow MatchingLipmanet al\.\([2023](https://arxiv.org/html/2605.24405#bib.bib30)\)and one\-step approaches like MeanFlowGenget al\.\([2026](https://arxiv.org/html/2605.24405#bib.bib31)\)have sought to reconcile flow and diffusion methods, significantly accelerating inference while maintaining expressive density estimation\. However, while individual classes of generative models have been applied in isolation, such as diffusion for planningJanneret al\.\([2022](https://arxiv.org/html/2605.24405#bib.bib66)\); Wanget al\.\([2023](https://arxiv.org/html/2605.24405#bib.bib67)\)or normalizing flows for policy constraintsAkimovet al\.\([2022](https://arxiv.org/html/2605.24405#bib.bib68)\), a systematic evaluation of modern density estimators specifically for OOD detection within model\-based offline RL remains underexplored\.
## 3Methodology
Figure 2:System diagram of GORMPO\.We enhance MBPO \(blue module\) through generative OOD\-regularization \(red module\)\. We sample an actionata\_\{t\}and use the pretrained dynamics model to predict the next states^t\+1\\hat\{s\}\_\{t\+1\}and rewardr^t\\hat\{r\}\_\{t\}\. We then compute the likelihood of\(s^t\+1,at\)\(\\hat\{s\}\_\{t\+1\},a\_\{t\}\)under a pretrained generative density estimator and penalizer^t\\hat\{r\}\_\{t\}in low\-density regimes, producingr~t\\tilde\{r\}\_\{t\}\. Finally, we store\(st,at,r~t,s^t\+1\)\(s\_\{t\},a\_\{t\},\\tilde\{r\}\_\{t\},\\hat\{s\}\_\{t\+1\}\)in the generated data buffer\.We propose GORMPO, a generative density estimator\-based OOD regularization for model\-based RL \(see Figure[2](https://arxiv.org/html/2605.24405#S3.F2)and the algorithm pseudocode in Appendix[A](https://arxiv.org/html/2605.24405#A1)\)\. In this section, we formulate the problem and detail components of GORMPO\.
### 3\.1Problem Setting
#### Markov Decision Process\.
We formulate our setting as a Markov decision process \(MDP\), defined by the tupleℳ=\(𝒮,𝒜,T,r,μ0,γ\)\\mathcal\{M\}=\(\\mathcal\{S\},\\mathcal\{A\},T,r,\\mu\_\{0\},\\gamma\), with state space𝒮\\mathcal\{S\}, action space𝒜\\mathcal\{A\}, transition dynamicsT:𝒮×𝒜→Δ\(𝒮\)T:\\mathcal\{S\}\\times\\mathcal\{A\}\\rightarrow\\Delta\(\\mathcal\{S\}\), reward functionr\(s,a\):𝒮×𝒜→ℝr\(s,a\):\\mathcal\{S\}\\times\\mathcal\{A\}\\rightarrow\\mathbb\{R\}, initial state distributionμ0\\mu\_\{0\}, and discount factor,γ∈\(0,1\)\\gamma\\in\(0,1\)\. RL algorithms aim to find a policyπ:𝒮→𝒜\\pi:\\mathcal\{S\}\\rightarrow\\mathcal\{A\}that maximizes the expected cumulative discounted reward𝔼π,s0∼μ0\[∑t=0∞γtr\(st,at\)\]\\mathbb\{E\}\_\{\\pi,s\_\{0\}\\sim\\mu\_\{0\}\}\\left\[\\sum\_\{t=0\}^\{\\infty\}\\gamma^\{t\}r\(s\_\{t\},a\_\{t\}\)\\right\]wheres0s\_\{0\}is the initial state\. The optimal policy is defined as,
π∗=argmaxπ𝔼π,s0∼μ0\[∑t=0∞γtr\(st,at\)\]\.\\pi^\{\*\}=\\arg\\max\_\{\\pi\}\\mathbb\{E\}\_\{\\pi,s\_\{0\}\\sim\\mu\_\{0\}\}\\left\[\\sum\_\{t=0\}^\{\\infty\}\\gamma^\{t\}r\(s\_\{t\},a\_\{t\}\)\\right\]\.\(1\)TheOffline RL settingis when the algorithm only has access to a dataset sampled from the environment𝒟env=\{\(s,a,r,s′\)\}t=1N\\mathcal\{D\}\_\{\\text\{env\}\}=\\\{\(s,a,r,s^\{\\prime\}\)\\\}\_\{t=1\}^\{N\}with collected by a behavior policy and cannot interact with the environment\.
#### Density Estimator Formulation\.
Our objective for density estimation is to probabilistically model the density ofnext state \+ actionpairs\. Our generative model learns an approximationpθ\(xt\)p\_\{\\theta\}\(x\_\{t\}\)of the density wherext=\(st\+1,at\)x\_\{t\}=\(s\_\{t\+1\},a\_\{t\}\)at timettby maximizing the expected log‑likelihood on the training samples as
maxθ𝔼xt∼𝒟env\[logpθ\(xt\)\]\.\\max\_\{\\theta\}\\;\\mathbb\{E\}\_\{x\_\{t\}\\sim\\mathcal\{D\}\_\{\\text\{env\}\}\}\\big\[\\log p\_\{\\theta\}\(x\_\{t\}\)\\big\]\.By doing so, the model becomes capable of estimating how likely a candidate pair of next state and action is under the learned dynamics distribution\.
### 3\.2Density\-based guardian for Offline RL
Our goal is to learn a policyπ\\pithat maximizes the expected return in the true MDPℳ=\(𝒮,𝒜,T,r,μ0,γ\)\\mathcal\{M\}=\(\\mathcal\{S\},\\mathcal\{A\},T,r,\\mu\_\{0\},\\gamma\)\. We denote the value function under the true dynamics asVℳπ\(s\)V^\{\\pi\}\_\{\\mathcal\{M\}\}\(s\)and the expected return as
ηℳ\(π\)=𝔼s0∼μ0\[Vℳπ\(s0\)\]\.\\eta\_\{\\mathcal\{M\}\}\(\\pi\)=\\mathbb\{E\}\_\{s\_\{0\}\\sim\\mu\_\{0\}\}\[V^\{\\pi\}\_\{\\mathcal\{M\}\}\(s\_\{0\}\)\]\.Following the model\-based approach, we jointly learn a transition dynamics modelT^ϕ\(s′∣s,a\)\\hat\{T\}\_\{\\phi\}\(s^\{\\prime\}\\mid s,a\)and a reward modelr^ψ\(s,a\)\\hat\{r\}\_\{\\psi\}\(s,a\)from𝒟env\\mathcal\{D\}\_\{\\text\{env\}\}\. Model rollouts are generated by samplings^t\+1∼T^ϕ\(⋅∣st,at\)\\hat\{s\}\_\{t\+1\}\\sim\\hat\{T\}\_\{\\phi\}\(\\cdot\\mid s\_\{t\},a\_\{t\}\)andr^t∼r^ψ\(⋅\|st,at\)\\hat\{r\}\_\{t\}\\sim\\hat\{r\}\_\{\\psi\}\(\\cdot\|s\_\{t\},a\_\{t\}\)\. We define the model MDP asℳ^=\(𝒮,𝒜,T^ϕ,r^ψ,μ0,γ\)\\hat\{\\mathcal\{M\}\}=\(\\mathcal\{S\},\\mathcal\{A\},\\hat\{T\}\_\{\\phi\},\\hat\{r\}\_\{\\psi\},\\mu\_\{0\},\\gamma\)\. We drop the parametersϕ\\phiandψ\\psifor the ease of notation\. Let
ρT^π\(st,at\)=π\(at\|st\)∑t′=0∞γt′PT^π\(st′\)\\rho^\{\\pi\}\_\{\\hat\{T\}\}\(s\_\{t\},a\_\{t\}\)=\\pi\(a\_\{t\}\|s\_\{t\}\)\\sum\_\{t^\{\\prime\}=0\}^\{\\infty\}\\gamma^\{t^\{\\prime\}\}P^\{\\pi\}\_\{\\hat\{T\}\}\(s\_\{t^\{\\prime\}\}\)\(2\)denote the discounted occupancy measure under policyπ\\piand dynamicsT^\\hat\{T\}, wherePT^,tπ\(s\)P^\{\\pi\}\_\{\\hat\{T\},t\}\(s\)is the probability of visiting statessat timett\. We useρπ\\rho^\{\\pi\}forρT^π\(st,at\)\\rho^\{\\pi\}\_\{\\hat\{T\}\}\(s\_\{t\},a\_\{t\}\)in Section[4](https://arxiv.org/html/2605.24405#S4)for ease of notation\.
Our model\-based offline RL architecture leverages data augmentation with the rollouts of a learned dynamics model, which exhibits varying degrees of accuracy across the state\-action space\. This directly controls the support of the training data during policy optimization\. While the optimal policy under perfect dynamics may venture beyond the behavioral distribution to achieve higher returns, model inaccuracies in these out\-of\-distribution \(OOD\) regions can lead to catastrophic failures in high\-stakes medical tasks\.
To account for OOD rollouts during data augmentation, the density\-based guardian \(red module in Figure[2](https://arxiv.org/html/2605.24405#S3.F2)\) distinguishes estimated in\-distribution \(ID\) and OOD dynamics rollouts\. Estimating the probability of\(s^t\+1,at\)\(\\hat\{s\}\_\{t\+1\},a\_\{t\}\)through different families of generative models is non\-trivial since \(1\) substantially varying likelihood scales across model families make the penalty level difficult to control \(see Appendix[D](https://arxiv.org/html/2605.24405#A4)for likelihood scales\) \(2\) not all generative models are inherently density estimators, i\.e\., score\-matching\-based diffusion models\. To address these challenges, we penalize the rolled\-out reward by employing a tanh penalty, which ensures training stability and strict theoretical guarantees\.
Given the density estimator,pθp\_\{\\theta\}, we define the estimated density regularizer as:
u\(s^t\+1,at\)=tanh\(max\(τ−log\(pθ\(s^t\+1,at\)\),0\)\)\{u\}\(\\hat\{s\}\_\{t\+1\},a\_\{t\}\)=\\tanh\(\\max\(\\tau\-\\log\(p\_\{\\theta\}\(\\hat\{s\}\_\{t\+1\},a\_\{t\}\)\),0\)\)\(3\)at time steptt, whereτ\\tauis the density threshold of ID data\.τ\\tauis a hyperparameter chosen based on the validation set\. Now define the regularized MDP asℳ~=\(𝒮,𝒜,T^,r~,μ0,γ\)\\tilde\{\\mathcal\{M\}\}=\(\\mathcal\{S\},\\mathcal\{A\},\\hat\{T\},\\tilde\{r\},\\mu\_\{0\},\\gamma\)with
r~\(st,at\)=r^\(st,at\)−λu\(s^t\+1,at\)\.\\tilde\{r\}\(s\_\{t\},a\_\{t\}\)=\\hat\{r\}\(s\_\{t\},a\_\{t\}\)\-\\lambda\{u\}\(\\hat\{s\}\_\{t\+1\},a\_\{t\}\)\.\(4\)whereλ=γc⋅CT^\\lambda=\\gamma c\\cdot C\_\{\\hat\{T\}\}, introduced in the next section\.CT^C\_\{\\hat\{T\}\}establishes a link between the dynamics model error and the density\-based penalty\. The optimal policy for the density\-penalized MDP is obtained by solving
π^=argmaxπηℳ~\(π\)\.\\hat\{\\pi\}=\\arg\\max\_\{\\pi\}\\eta\_\{\\tilde\{\\mathcal\{M\}\}\}\(\\pi\)\.\(5\)Next, we explain the formulation of the density estimators incorporated in the GORMPO pipeline\.
### 3\.3Generative Density Estimators
In this section, we detail the specific density estimation modelspθ\(𝐱\)p\_\{\\theta\}\(\\mathbf\{x\}\)employed within the GORMPO framework\. For all models, we treat the next state\-action pair𝐱=\(𝐬,𝐚\)\\mathbf\{x\}=\(\\mathbf\{s\},\\mathbf\{a\}\)as the joint input variable where𝐬∈ℝN×d1\\mathbf\{s\}\\in\\mathbb\{R\}^\{N\\times d\_\{1\}\},𝐚∈ℝN×d2\\mathbf\{a\}\\in\\mathbb\{R\}^\{N\\times d\_\{2\}\}, and𝐱∈ℝN×\(d1\+d2\)\\mathbf\{x\}\\in\\mathbb\{R\}^\{N\\times\(d\_\{1\}\+d\_\{2\}\)\}, whereNNis the sample size\.
#### Kernel Density Estimation \(KDE\)\.
We utilize KDE as a non\-parametric baseline to benchmark against generative approaches\. The density estimate is given bypKDE\(𝐱\)=1N∑i=1NKh\(𝐱−𝐱i\)p\_\{\\text\{KDE\}\}\(\\mathbf\{x\}\)=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}K\_\{h\}\(\\mathbf\{x\}\-\\mathbf\{x\}\_\{i\}\), whereKhK\_\{h\}is the Gaussian kernel with bandwidthhh, and\{𝐱i\}i=1N\\\{\\mathbf\{x\}\_\{i\}\\\}\_\{i=1\}^\{N\}represents the training dataset𝒟env\\mathcal\{D\}\_\{\\text\{env\}\}\. To ensure computational efficiency in training, we approximate the summation withkk\-nearest neighbors of𝐱\\mathbf\{x\}\.
#### Variational Autoencoders \(VAE\)\.
We adapt the standard VAEKingma and Welling \([2014](https://arxiv.org/html/2605.24405#bib.bib49)\)to the continuous state\-action space\. The model maximizes the ELBO on the log\-likelihood:
logpθ\(𝐱\)≥𝔼qϕ\(𝐳\|𝐱\)\[logpθ\(𝐱\|𝐳\)\]−DKL\(qϕ\(𝐳\|𝐱\)∥p\(𝐳\)\),\\log p\_\{\\theta\}\(\\mathbf\{x\}\)\\geq\\mathbb\{E\}\_\{q\_\{\\phi\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)\}\[\\log p\_\{\\theta\}\(\\mathbf\{x\}\|\\mathbf\{z\}\)\]\-D\_\{\\text\{KL\}\}\(q\_\{\\phi\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)\\\|p\(\\mathbf\{z\}\)\),\(6\)where the prior isp\(𝐳\)=𝒩\(𝟎,𝐈\)p\(\\mathbf\{z\}\)=\\mathcal\{N\}\(\\mathbf\{0\},\\mathbf\{I\}\)\. The approximate posterior is parameterized as a diagonal Gaussianqϕ\(𝐳\|𝐱\)=𝒩\(𝝁ϕ\(𝐱\),diag\(𝝈ϕ2\(𝐱\)\)\)q\_\{\\phi\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)=\\mathcal\{N\}\(\\boldsymbol\{\\mu\}\_\{\\phi\}\(\\mathbf\{x\}\),\\text\{diag\}\(\\boldsymbol\{\\sigma\}^\{2\}\_\{\\phi\}\(\\mathbf\{x\}\)\)\)\. At inference, we utilize the importance\-weighted estimateBurdaet al\.\([2016](https://arxiv.org/html/2605.24405#bib.bib27)\)\.
#### Normalizing Flows \(RealNVP\)\.
To enable exact log\-likelihood computation, we employ the RealNVP architectureDinhet al\.\([2017](https://arxiv.org/html/2605.24405#bib.bib55)\)\. The model learns a bijectionfθ:𝒳→𝒵f\_\{\\theta\}:\\mathcal\{X\}\\to\\mathcal\{Z\}mapping the complex state\-action distribution to a standard Gaussian latent space𝒵\\mathcal\{Z\}\. The log\-density is computed via the change\-of\-variables formula:
logpθ\(𝐱\)=logp𝒵\(fθ\(𝐱\)\)\+log\|det∂fθ\(𝐱\)∂𝐱T\|\.\\log p\_\{\\theta\}\(\\mathbf\{x\}\)=\\log p\_\{\\mathcal\{Z\}\}\(f\_\{\\theta\}\(\\mathbf\{x\}\)\)\+\\log\\left\|\\det\\frac\{\\partial f\_\{\\theta\}\(\\mathbf\{x\}\)\}\{\\partial\\mathbf\{x\}^\{T\}\}\\right\|\.\(7\)We utilize coupling layers to ensure the Jacobian determinant is computationally tractable\.
#### Diffusion Probabilistic Models\.
We model the density of𝐱\\mathbf\{x\}using a Denoising Diffusion Probabilistic Model \(DDPM\)\(Hoet al\.,[2020](https://arxiv.org/html/2605.24405#bib.bib2)\)\. The forward processq\(𝐱\(k\)\|𝐱\(k−1\)\)q\(\\mathbf\{x\}^\{\(k\)\}\|\\mathbf\{x\}^\{\(k\-1\)\}\)incrementally adds Gaussian noise overKKsteps\. We learn a reverse process parameterized by a noise prediction networkϵθ\(𝐱\(k\),k\)\\boldsymbol\{\\epsilon\}\_\{\\theta\}\(\\mathbf\{x\}^\{\(k\)\},k\)trained to minimize:
ℒsimple=𝔼𝐱\(0\),ϵ,k\[‖ϵ−ϵθ\(α¯k𝐱\(0\)\+1−α¯kϵ,k\)‖2\]\.\\mathcal\{L\}\_\{\\text\{simple\}\}=\\mathbb\{E\}\_\{\\mathbf\{x\}^\{\(0\)\},\\boldsymbol\{\\epsilon\},k\}\\left\[\\\|\\boldsymbol\{\\epsilon\}\-\\boldsymbol\{\\epsilon\}\_\{\\theta\}\(\\sqrt\{\\bar\{\\alpha\}\_\{k\}\}\\mathbf\{x\}^\{\(0\)\}\+\\sqrt\{1\-\\bar\{\\alpha\}\_\{k\}\}\\boldsymbol\{\\epsilon\},k\)\\\|^\{2\}\\right\]\.\(8\)Exact likelihood computation requires evaluating the ELBO over the full trajectory \(K≈1000K\\approx 1000steps\), which is computationally expensive for online OOD detection\. To ensure tractability, we estimate the likelihood using astrided inference schedule, summing the variational terms over a subsampled sequence of timesteps𝒮⊂\{1,…,K\}\\mathcal\{S\}\\subset\\\{1,\\dots,K\\\}where\|𝒮\|≪K\|\\mathcal\{S\}\|\\ll K\(e\.g\.,2020–5050steps\)\. This reduces the inference complexity fromO\(K\)O\(K\)toO\(\|𝒮\|\)O\(\|\\mathcal\{S\}\|\)while preserving sufficient accuracy to distinguish ID transitions from OOD\.
#### Continuous Normalizing Flows \(Neural ODE\)\.
As a continuous\-time counterpart to normalizing flows, we model the evolution of the density using a Neural ODE\(Chenet al\.,[2018](https://arxiv.org/html/2605.24405#bib.bib6)\)\. The state\-action pair is treated as the terminal state of a continuous transformation𝐳\(τ\)\\mathbf\{z\}\(\\tau\)governed by the ODEd𝐳\(τ\)/dτ=fθ\(𝐳\(τ\),τ\)d\\mathbf\{z\}\(\\tau\)/d\\tau=f\_\{\\theta\}\(\\mathbf\{z\}\(\\tau\),\\tau\)\. The log\-density is obtained by integrating the instantaneous change of variables:
logpθ\(𝐱\)=logp0\(𝐳\(0\)\)−∫0TTr\(∂fθ∂𝐳\(τ\)\)𝑑τ\.\\log p\_\{\\theta\}\(\\mathbf\{x\}\)=\\log p\_\{0\}\(\\mathbf\{z\}\(0\)\)\-\\int\_\{0\}^\{T\}\\text\{Tr\}\\left\(\\frac\{\\partial f\_\{\\theta\}\}\{\\partial\\mathbf\{z\}\(\\tau\)\}\\right\)d\\tau\.\(9\)This formulation allows for smooth density estimation that respects the continuous nature of physical control tasks\.
## 4Theoretical Results
In this section, we theoretically show that GORMPO has a guaranteed performance in realℳ\\mathcal\{M\}and achieves near\-optimal performance under the learned dynamics\. Our guarantee relies on three assumptions\. Assumption[4\.1](https://arxiv.org/html/2605.24405#S4.Thmtheorem1)on bounded rewards is standard in RL theory and holds for most practical applications; Assumption[4\.2](https://arxiv.org/html/2605.24405#S4.Thmtheorem2)captures epistemic uncertainty in supervised learning, where model performance degrades as we move away from the training distribution\. Assumption[4\.3](https://arxiv.org/html/2605.24405#S4.Thmtheorem3)captures the quality of the density estimation model\. For ease of notation, we denote the next state and action at anyttass′s^\{\\prime\}andaa, respectively\.
###### Assumption 4\.1\(Bounded Rewards and Value Functions\)\.
The reward function is bounded:\|r\(s,a\)\|≤rmax\|r\(s,a\)\|\\leq r\_\{\\max\}for all\(s,a\)∈𝒮×𝒜\(s,a\)\\in\\mathcal\{S\}\\times\\mathcal\{A\}\. Consequently,Vℳπ∈cℱV^\{\\pi\}\_\{\\mathcal\{M\}\}\\in c\\mathcal\{F\}whereℱ=\{f:‖f‖∞≤1\}\\mathcal\{F\}=\\\{f:\\\|f\\\|\_\{\\infty\}\\leq 1\\\}andc=rmax/\(1−γ\)c=r\_\{\\max\}/\(1\-\\gamma\)\.
###### Assumption 4\.2\(Density\-Dependent Model Error\)\.
There exists a constantCT^\>0C\_\{\\hat\{T\}\}\>0such that for all\(s,a\)∈𝒮×𝒜\(s,a\)\\in\\mathcal\{S\}\\times\\mathcal\{A\}:
dℱ\(T^\(s,a\),T\(s,a\)\)≤CT^⋅\[𝔼s′∼T^\(s,a\)\[u\(s′,a\)\]\]\+ϵapproxd\_\{\\mathcal\{F\}\}\(\\hat\{T\}\(s,a\),T\(s,a\)\)\\leq C\_\{\\hat\{T\}\}\\cdot\\big\[\\mathbb\{E\}\_\{s^\{\\prime\}\\sim\\hat\{T\}\(s,a\)\}\[u\(s^\{\\prime\},a\)\]\\big\]\+\\epsilon\_\{\\text\{approx\}\}\(10\)wheredℱd\_\{\\mathcal\{F\}\}is the integral probability metric w\.r\.t\.ℱ\\mathcal\{F\}, andϵapprox\>0\\epsilon\_\{\\text\{approx\}\}\>0represents an irreducible approximation error\.
CT^C\_\{\\hat\{T\}\}depends on the model class capacity and the smoothness of the true dynamics, discussed in Appendix[B](https://arxiv.org/html/2605.24405#A2)\.
###### Assumption 4\.3\(Density Estimation Error\)\.
There exists a constantϵdensity\>0\\epsilon\_\{\\text\{density\}\}\>0such that for all\(s′,a\)∈𝒮×𝒜\(s^\{\\prime\},a\)\\in\\mathcal\{S\}\\times\\mathcal\{A\}:
\|logpθ\(s′,a\)−logp\(s′,a\)\|≤ϵdensity\\left\|\\log p\_\{\\theta\}\(s^\{\\prime\},a\)\-\\log p\(s^\{\\prime\},a\)\\right\|\\leq\\epsilon\_\{\\text\{density\}\}\(11\)wherepθ\(s′,a\)p\_\{\\theta\}\(s^\{\\prime\},a\)is the learned density estimator andp\(s′,a\)p\(s^\{\\prime\},a\)is the true behavioral density from the offline dataset\.
###### Theorem 4\.4\(Conservative Value Bound with Density Error\)\.
Under Assumptions 1–3, for any policyπ\\pi:
ηM\(π\)≥ηM~\(π\)−γcϵapprox−λϵdensity\\eta\_\{M\}\(\\pi\)\\geq\\eta\_\{\\tilde\{M\}\}\(\\pi\)\-\\gamma c\\epsilon\_\{\\text\{approx\}\}\-\\lambda\\epsilon\_\{\\text\{density\}\}\(12\)whereλ=γcCT^\\lambda=\\gamma cC\_\{\\hat\{T\}\}is the penalty weight\.
###### Theorem 4\.5\(Optimality Gap with Density Error\)\.
Letπ∗\\pi^\{\*\}be the optimal policy for the true MDPℳ\\mathcal\{M\}andπ^\\hat\{\\pi\}be the solution to the density\-regularized MDPℳ~\\tilde\{\\mathcal\{M\}\}with penalty weightλ=γcCT^\\lambda=\\gamma cC\_\{\\hat\{T\}\}:
π^=argmaxπ𝔼\(s,a\)∼ρπ\[r^\(s,a\)−λu\(s^′,a\)\]\.\\hat\{\\pi\}=\\arg\\max\_\{\\pi\}\\mathbb\{E\}\_\{\(s,a\)\\sim\\rho^\{\\pi\}\}\[\\hat\{r\}\(s,a\)\-\\lambda\{u\}\(\\hat\{s\}^\{\\prime\},a\)\]\.Defineδmin=minπ𝔼\(s,a\)∼ρπ\[u\(s′,a\)\]\\delta\_\{\\min\}=\\min\_\{\\pi\}\\mathbb\{E\}\_\{\(s,a\)\\sim\\rho^\{\\pi\}\}\[u\(s^\{\\prime\},a\)\]whereu\(s′,a\)u\(s^\{\\prime\},a\)is the true regularizer\. Then for anyδ≥δmin\\delta\\geq\\delta\_\{\\min\}:
ηM\(π^\)≥maxπ:𝔼\[u\]≤δ\+ϵdensityηM\(π\)−2λ\(δ\+2ϵdensity\)−2γcϵapprox\\displaystyle\\eta\_\{M\}\(\\hat\{\\pi\}\)\\geq\\max\_\{\\pi:\\mathbb\{E\}\[u\]\\leq\\delta\+\\epsilon\_\{\\text\{density\}\}\}\\eta\_\{M\}\(\\pi\)\-2\\lambda\(\\delta\+2\\epsilon\_\{\\text\{density\}\}\)\-2\\gamma c\\epsilon\_\{\\text\{approx\}\}\(13\)
With Theorem[4\.4](https://arxiv.org/html/2605.24405#S4.Thmtheorem4), we demonstrate that the lower bound performance on the true MDP degrades linearly with the density estimation errorϵdensity\\epsilon\_\{\\text\{density\}\}\. This shows that improving the density estimator, reducingϵdensity\\epsilon\_\{\\text\{density\}\}, directly improves the guaranteed performance\. In Theorem[4\.5](https://arxiv.org/html/2605.24405#S4.Thmtheorem5), we show that the policy learned under imperfect density estimation is competitive with the best policy satisfying a relaxed constraint\. The relaxation is proportional toϵdensity\\epsilon\_\{\\text\{density\}\}, meaning better density estimation allows comparison with a less conservative constraint set\. We improve the theoretical results of\(Tumayet al\.,[2025](https://arxiv.org/html/2605.24405#bib.bib14)\)with stronger guarantees inℳ\\mathcal\{M\}, and better near\-optimal policies maintaining a low penalty inℳ~\\mathcal\{\\tilde\{M\}\}\. See the proofs, lemmas, and further discussion on assumptions in Appendix[B](https://arxiv.org/html/2605.24405#A2)\.
## 5Results
In this section, we present our experiment setup; results on the real\-world medical dataset and sparse D4RL datasets\. See Appendix[C](https://arxiv.org/html/2605.24405#A3)for hyperparameter sensitivity and penalty type ablations\.
### 5\.1Experiment Setup
#### Dataset and Task\.
We use a medical dataset that includes cardiogenic shock patient physiologic features and mechanical circulatory support inputs\. Cardiogenic Shock \(CGS\) is a syndrome characterized by cardiac output insufficient for end\-organ perfusion, where mechanical circulatory support \(MCS\) devices play an integral role in aiding heart muscle recoveryVahdatpouret al\.\([2019](https://arxiv.org/html/2605.24405#bib.bib17)\); Tumayet al\.\([2025](https://arxiv.org/html/2605.24405#bib.bib14)\)\. Our task is to learn a safe MCS weaning policy by reducing the pump level \(P\-level\) while maintaining stable hemodynamics\. This task reflects sparse state\-action coverage because few patients receive insufficient MCS and remain unstable, which leaves the low P\-level and low reward region in Figure[1](https://arxiv.org/html/2605.24405#S1.F1)underrepresented\.
#### Baselines\.
Our base model is MBPOJanneret al\.\([2019](https://arxiv.org/html/2605.24405#bib.bib16)\), which uses a learned dynamics model to generate short synthetic rollouts that are then mixed with real environment data to train a standard model\-free policy\. For SOTA offline RL baselines, we select model\-free SPOTWuet al\.\([2022](https://arxiv.org/html/2605.24405#bib.bib5)\)that uses a VAE to penalize the actor network with ELBO loss and state\-of\-the\-art model\-based MOBILESunet al\.\([2023](https://arxiv.org/html/2605.24405#bib.bib4)\)that penalizes the Bellman based on the disagreement in the dynamics models\.
#### Hyperparameter Tuning\.
Based onBrandfonbreneret al\.\([2021](https://arxiv.org/html/2605.24405#bib.bib9)\); Kurenkov and Kolesnikov \([2022](https://arxiv.org/html/2605.24405#bib.bib10)\), we finetuneλ\\lambdafrom a small set of hyperparameters by allowing access to the online simulation environment for 100 episodes \(see Appendix[E](https://arxiv.org/html/2605.24405#A5)\)\.
#### OOD Experiment Setup\.
We investigates the OOD detection performance of KDE, VAE, RealNVP, diffusion, and NeuralODE models\. The goal is to resemble OOD transition dynamics during policy training since a controlled OOD detection experiment directly on transition model rollouts is not trivial\. We generate OOD datasets by concatenating the datasets with their Gaussian noise𝒩\(μ,0\.1\)\\mathcal\{N\}\(\\mu,0\.1\)\-added versions \(see Appendix[F](https://arxiv.org/html/2605.24405#A6)\)\. The mean of the added noise is labeled as “OOD Distance” in Figures[4](https://arxiv.org/html/2605.24405#S5.F4)and[7](https://arxiv.org/html/2605.24405#S5.F7)\. Smaller OOD distance values correspond to a harder task due to the poor separation\. For evaluation, we employ accuracy, true positive rate \(TPR\), and true negative rate \(TNR\) based on the threshold selection on the validation set and ROC AUC for OOD\-ID separation\.
### 5\.2Real\-world Medical Dataset Experiments
In this section, we provide the OOD detection and offline RL results on the real\-world medical dataset\. All evaluations use the Transformer Digital Twin \(TDT\)Tumayet al\.\([2025](https://arxiv.org/html/2605.24405#bib.bib14)\)with 6\-hour rollouts, matching the dataset’s deterministic 6\-hour episode length\. We utilize average reward, weaning score \(WS\), and action change penalty \(ACP\) for evaluation, where reward signifies the patient’s stability based on physiological features; weaning score quantifies the successful decrease in the P\-level after observing patient stability as followsTumayet al\.\([2025](https://arxiv.org/html/2605.24405#bib.bib14)\):
WS=∑i=1T𝕀\(Is\_Stable\(i\),1\)⋅Weaned\(i\)∑i=1T𝕀\(Is\_Stable\(i\),1\)\\text\{WS\}=\\frac\{\\sum\_\{i=1\}^\{T\}\\mathbb\{I\}\(\\texttt\{Is\\\_Stable\}\(i\),1\)\\cdot\\texttt\{Weaned\}\(i\)\}\{\\sum\_\{i=1\}^\{T\}\\mathbb\{I\}\(\\texttt\{Is\\\_Stable\}\(i\),1\)\}\\vskip\-1\.99997pt\(14\)whereTTis episode length\. See Appendix[G](https://arxiv.org/html/2605.24405#A7)forIs\_Stable, andWeaned, and dataset details\. Action change penalty \(ACP\)Yanet al\.\([2025](https://arxiv.org/html/2605.24405#bib.bib61)\); Tumayet al\.\([2025](https://arxiv.org/html/2605.24405#bib.bib14)\)accumulates the magnitude of P\-level changes over an episode of lengthTTas
ACP =∑i=1T‖ai−1−ai‖2,if‖ai−1−ai‖2\>2\.\\text\{ACP =\}\\sum^\{T\}\_\{i=1\}\|\|a\_\{i\-1\}\-a\_\{i\}\|\|\_\{2\},\\text\{ if \}\|\|a\_\{i\-1\}\-a\_\{i\}\|\|\_\{2\}\>2\.\\vskip\-5\.0pt
Figure 3:Penalty during training rollouts for 200000 steps in our medical dataset\. GORMPO\-Diffusion and VAE overlap at 0\.Figure 4:OOD detection performance of density estimation models trained on the real\-world medical dataset in terms of ROC AUC, and accuracy\. OOD distance denotes the mean shiftμ\\muin𝒩\(μ,0\.1\)\\mathcal\{N\}\(\\mu,0\.1\), with largerμ\\mumaking OOD detection easier\.#### OOD Detection Results\.
On our proprietary medical dataset, NeuralODE depicts the best ROC AUC followed by the highest accuracy progression, as the OOD detection task gets harder\. Other models also perform close to NeuralODE, indicating that this dataset is well\-fitted to our sparsity problem\. This suggests that the valid medical states lie on a sharp, low\-dimensional manifold\. The physiological rules governing patient stability create a clear separation from random or unsafe regions, simplifying the density estimation task\. RealNVP’s accuracy stays roughly constant even as ROC AUC improves because the density threshold is chosen on a validation set that labels 1% of ID samples as OOD\. This setup introduces an unavoidable error floor in accuracy\.
Figure 5:Comparison of different versions of GORMPO against offline RL baselines on our proprietary medical dataset in reward \(↑\\uparrow\), WS \(↑\\uparrow\), and ACP \(↓\\downarrow\) on 5 seeds\.↑\\uparrow: higher the better;↓\\downarrow: lower the better\. GORMPO outperforms baselines in Reward and WS\. SPOT outperforms in ACP\.Figure 6:On our proprietary medical dataset, we compare the recommended GORMPO P\-levels \(solid blue\) against offline RL baselines in 6\-hour mean arterial pressure \(MAP\) rollouts \(solid red\), which is a significant variable for hemodynamic stability, on an OOD expert policy \(low reward \- low P\-level\)\. Since TDT predicts stable hemodynamics above the MAP clinical threshold, an ideal weaning policy is to decrease the P\-level every 1 hour, which GORMPO\-VAE and NeuralODE successfully follow\.Figure 7:OOD detection performance of KDE, VAE, RealNVP, diffusion, and NeuralODE models trained on sparse halfcheetah, hopper, and walker2d in ROC AUC, accuracy, TNR, and TPR\. OOD distance denotes the mean shiftμ\\muin𝒩\(μ,0\.1\)\\mathcal\{N\}\(\\mu,0\.1\), with largerμ\\mumaking OOD detection easier\.
#### RL Results\.
We further show the performance of GORMPO on our medical dataset in Figures[5](https://arxiv.org/html/2605.24405#S5.F5)and[6](https://arxiv.org/html/2605.24405#S5.F6)\(see the table in Appendix[D](https://arxiv.org/html/2605.24405#A4)\)\. GORMPO\-RealNVP outperforms baselines in reward by 17% and in WS by 13%\. GORMPO\-VAE, DDPM and NeuralODE are successful at reducing MBPO’s high ACP but underperform SPOT, which is less prone to exploration due to its model\-free nature\. However, following the clinician behavior is not always optimal as depicted in the observed P\-level in Figure[6](https://arxiv.org/html/2605.24405#S5.F6)due to high stochasticity\. We depict the weaning performance on an OOD sample in Figure[6](https://arxiv.org/html/2605.24405#S5.F6)\(see more in Appendix[D](https://arxiv.org/html/2605.24405#A4)\)\. GORMPO variants yield successful weaning by conserving the patient stability, keeping the mean arterial pressure \(MAP\) above the critical limit in Figure[6](https://arxiv.org/html/2605.24405#S5.F6)\. Patient stability \(reward\) is maintained with GORMPO\-RealNVP and NeuralODE, while VAE and DDPM result in a suboptimal patient trajectory since they fail at penalizing the policy enough to avoid unsafe actions in Figure[3](https://arxiv.org/html/2605.24405#S5.F3.3)\.
### 5\.3Sparse D4RL Benchmark Experiments
In this section, we provide the OOD detection and offline RL results of GORMPO using both MBPOJanneret al\.\([2019](https://arxiv.org/html/2605.24405#bib.bib16)\)and MOBILESunet al\.\([2023](https://arxiv.org/html/2605.24405#bib.bib4)\)as base models on the sparse D4RL datasets\.
#### Datasets\.
We show performance on theGym\-MuJoCotasks of the D4RLFuet al\.\([2021](https://arxiv.org/html/2605.24405#bib.bib33)\)benchmark on the medium\-expert level due to its high resemblance to real\-world clinical decision dynamics, i\.e, a clinician adjusting P\-level on MCS without access to patient hemodynamic forecasts\. We create the sparse datasets by randomly discarding 50% of the trajectories in a designated unsafe area with sparsity in the action\-reward space\. See Appendix[H](https://arxiv.org/html/2605.24405#A8)for more details\.
#### OOD Detection Results\.
We depict the OOD detection performance in Figure[7](https://arxiv.org/html/2605.24405#S5.F7)with VAE being the best in halfcheetah; and RealNVP in hopper, and walker2d\. In halfcheetah, VAE performs consistently close to the top performance in all metrics, while other models demonstrate inconsistencies\. Since the halfcheetah dataset comprises of 2 distinct modes, a variational generative model with a successful threshold selection can display robust performance\. In hopper and walker2d, RealNVP achieves the strongest OOD detection\. Since there are many outliers around the main modes, capturing the heavy tails benefits from an explicit density estimator rather than models based on a lower bound estimation \(see Appendix[F](https://arxiv.org/html/2605.24405#A6)\)\. Diffusion and NeuralODE exhibit the poorest separation, with accuracy levels approaching random chance, possibly due to conservative threshold selection, which prevents these models from assigning the sharply distinct likelihoods required for robust OOD detection\.
Table 1:Performance comparison of GORMPO to offline RL baselines in average normalized reward±\\pmstandard deviation over 3 seeds on sparse D4RL medium\-expert datasets\. The best\-performing density estimator is consistent across both GORMPO variants, also improving the base models\.DatasetSPOTBase ModelsGORMPO \(MOBILE\-based\)MOBILEKDEVAERealNVPDDPMhalfcheetah\-medium\-expert\-sparse76\.4±3\.2276\.4\\pm 3\.22104\.0±1\.63104\.0\\pm 1\.63103\.7±1\.39103\.7\\pm 1\.39106\.0±0\.54\\mathbf\{106\.0\\pm 0\.54\}104\.0±1\.18104\.0\\pm 1\.18103\.0±1\.44103\.0\\pm 1\.44hopper\-medium\-expert\-sparse83\.8±23\.383\.8\\pm 23\.395\.2±24\.595\.2\\pm 24\.595\.2±24\.595\.2\\pm 24\.5113\.0±0\.53113\.0\\pm 0\.5395\.5±29\.895\.5\\pm 29\.8113\.0±0\.62\\mathbf\{113\.0\\pm 0\.62\}walker2d\-medium\-expert\-sparse113\.5±0\.54113\.5\\pm 0\.54116\.0±2\.44116\.0\\pm 2\.44117\.0±1\.08117\.0\\pm 1\.08117\.0±4\.34117\.0\\pm 4\.34118\.0±2\.66\\mathbf\{118\.0\\pm 2\.66\}117\.0±4\.45117\.0\\pm 4\.45MBPOGORMPO \(MBPO\-based\)halfcheetah\-medium\-expert\-sparse76\.4±3\.2276\.4\\pm 3\.2263\.7±8\.2363\.7\\pm 8\.2378\.4±16\.978\.4\\pm 16\.989\.9±1\.50\\mathbf\{89\.9\\pm 1\.50\}86\.0±8\.3486\.0\\pm 8\.3485\.7±4\.1885\.7\\pm 4\.18hopper\-medium\-expert\-sparse83\.8±23\.383\.8\\pm 23\.34\.81±3\.184\.81\\pm 3\.188\.91±12\.898\.91\\pm 12\.892\.32±0\.682\.32\\pm 0\.6810\.0±9\.0210\.0\\pm 9\.0214\.1±8\.66\\mathbf\{14\.1\\pm 8\.66\}walker2d\-medium\-expert\-sparse113\.5±0\.54113\.5\\pm 0\.543\.19±1\.213\.19\\pm 1\.216\.31±1\.266\.31\\pm 1\.268\.20±6\.048\.20\\pm 6\.0410\.1±5\.34\\mathbf\{10\.1\\pm 5\.34\}5\.23±0\.535\.23\\pm 0\.53
#### RL Results\.
We demonstrate offline RL results on the sparse medium\-expert in Table[1](https://arxiv.org/html/2605.24405#S5.T1)\. In halfcheetah, MBPO\-based GORMPO outperforms MBPO by 41% while remaining competitive with SOTA baselines, highlighting that density\-based guardian’s benefit for regularizing the OOD transition rollouts in MBPO\. Walker2d and hopper datasets exhibit high uncertainty in the terminal signal, unlike halfcheetah with uniform episode lengths \(see Appendix[H](https://arxiv.org/html/2605.24405#A8)\)\. In this case, MBPO in walker2d and hopper largely fail, and MBPO\-based GORMPO’s improvement on MBPO is insufficient to outperform SOTA methods\. Without a direct uncertainty penalty on the critic network like MOBILE, MBPO is highly sensitive to compounding model error and termination misprediction, where small transition mistakes in contact\-rich locomotion quickly push model rollouts OOD\. Employing MOBILE as the base model, our density\-based guardian yields an 18\.7% improvement in hopper, demonstrating that it further strengthens SOTA uncertainty\-penalized methods, especially under hopper’s highly uncertain dynamics\.
Figure 8:t\-SNE projections of equal\-sized offline dataset and policy rollout\(s′,a\)\(s^\{\\prime\},a\)pairs\.In Figure[8](https://arxiv.org/html/2605.24405#S5.F8), MOBILE covers the data support in hopper and walker2d, meanwhile MBPO largely reaches OOD regions\. Best performing MOBILE\-based GORMPO variants have a better data coverage by avoiding OOD regions and overlapping with the dataset projection\. See more results on the medium, medium\-replay, and medium\-expert D4RL datasets in Appendix[D](https://arxiv.org/html/2605.24405#A4)\.
### 5\.4Does better OOD detection lead to better policies?
We answer our hypothesis by combining the OOD detection and RL results\. In our proprietary medical dataset, NeuralODE is the best OOD detection model, while GORMPO\-RealNVP and NeuralODE are the best and second\-best policies in reward and WS\. As this dataset exhibits vast sparsities, penalties with exact likelihoods is essential for policy performance\. In halfcheetah, VAE is consistently satisfactory at each metric with a distinctive threshold selection, which also coincides with the high reward of GORMPO\-VAE\. Since halfcheetah has a uniform episode length distribution, all density estimators successfully regularized the OOD rollouts of the dynamics model \(see Appendix[I](https://arxiv.org/html/2605.24405#A9)\)\. In the hopper models, RealNVP is consistently the best OOD detection model across all metrics\. We also observe its superiority in the RL results, with the GORMPO\-RealNVP being the second\-best model after diffusion\. Despite the OOD detection challenges inherent to diffusion models \(see Appendix[J](https://arxiv.org/html/2605.24405#A10)\), they surprisingly support policy training better than many explicit density models in uncertain dynamics\. We hypothesize this stems from penalty saturation, where DDPM assigns low probabilities to most generated samples, consistently driving the tanh penalty to 1 \(see Appendix[K](https://arxiv.org/html/2605.24405#A11)\)\. So, the high performance of GORMPO\-DDPM is likely driven by this aggressive conservatism, forcing the policy to remain extremely close to the training support as shown in Figure[8](https://arxiv.org/html/2605.24405#S5.F8)\. On the walker2d models, RealNVP is the best model in OOD separation, which overlaps with the RL result, with GORMPO\-RealNVP being the best variant\. Overall, we find that stronger OOD detection generally yields better policies, though under highly uncertain dynamics, such as hopper, the most pessimistic density models can be preferable\.
## 6Conclusion
In this paper, we propose GORMPO, a generative OOD\-regularized model\-based policy optimization algorithm for sequential decision\-making\. During policy optimization, we penalize rewards predicted by the dynamics model based on the joint probability of next\-state and action pairs with generative density estimator models\. Overall, GORMPO outperforms state\-of\-the\-art offline RL methods by 17% in our real\-world medical dataset and enhances the base models\. Due to our plug\-and\-play framework, we benchmark the OOD detection performance of five distinct families of density estimators on our medical dataset and sparse offline RL datasets\. We conclude that the best OOD detection models lead to the top\-2 best offline policy performance\. Notably, GORMPO\-DDPM achieves high rewards despite weak OOD detection, which suggests that highly uncertain dynamics may require more pessimistic density penalties\. A key limitation of our approach is its reliance on a penalization threshold, since distribution skewness can shift this threshold and alter conservatism\. In addition, ourtanh\\tanhpenalty function saturates in highly OOD regions, which might be too optimistic in safety\-critical tasks\. Future work will explore model\-based RL with threshold\-free density penalization\. Furthermore, we will study extending GORMPO to datasets with uncertain dynamics\.
## References
- Let Offline RL Flow: training conservative agents in the latent space of normalizing flows\.arXiv preprint arXiv:2211\.11096\.Cited by:[§2](https://arxiv.org/html/2605.24405#S2.p3.1)\.
- G\. An, S\. Moon, J\. Kim, and H\. O\. Song \(2021\)Uncertainty\-based offline reinforcement learning with diversified q\-ensemble\.InAdvances in Neural Information Processing Systems,M\. Ranzato, A\. Beygelzimer, Y\. Dauphin, P\.S\. Liang, and J\. W\. Vaughan \(Eds\.\),Vol\.34,pp\. 7436–7447\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2021/file/3d3d286a8d153a4a58156d0e02d8570c-Paper.pdf)Cited by:[§1](https://arxiv.org/html/2605.24405#S1.p2.1),[§2](https://arxiv.org/html/2605.24405#S2.p1.1)\.
- M\. Bansal, A\. Krizhevsky, and A\. S\. Ogale \(2019\)ChauffeurNet: learning to drive by imitating the best and synthesizing the worst\.InRobotics: Science and Systems \(RSS\),Cited by:[§1](https://arxiv.org/html/2605.24405#S1.p1.1)\.
- D\. Brandfonbrener, W\. Whitney, R\. Ranganath, and J\. Bruna \(2021\)Offline RL without off\-policy evaluation\.InAdvances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6\-14, 2021, virtual,pp\. 4933–4946\.Cited by:[§E\.1](https://arxiv.org/html/2605.24405#A5.SS1.p1.3),[§5\.1](https://arxiv.org/html/2605.24405#S5.SS1.SSS0.Px3.p1.1)\.
- L\. Brunke, M\. Greeff, A\. Hall, Z\. Yuan, S\. Zhou, J\. Panerati, and A\. P\. Schoellig \(2022\)Safe learning in robotics: from learning\-based control to safe reinforcement learning\.IEEE Robotics and Automation Letters7\(2\),pp\. 4915–4922\.Cited by:[§1](https://arxiv.org/html/2605.24405#S1.p1.1)\.
- E\. Buitenwerf, M\. F\. Boekel, M\. I\. van der Velde, M\. F\. Voogd, M\. N\. Kerstens, G\. J\. Wietasch, and T\. W\. Scheeren \(2019\)The haemodynamic instability score: development and internal validation of a new rating method of intra\-operative haemodynamic instability\.European Journal of Anaesthesiology\| EJA36\(4\),pp\. 290–296\.Cited by:[Table 15](https://arxiv.org/html/2605.24405#A7.T15),[Table 15](https://arxiv.org/html/2605.24405#A7.T15.17.2)\.
- Y\. Burda, R\. Grosse, and R\. Salakhutdinov \(2016\)Importance weighted autoencoders\.InInternational Conference on Learning Representations \(ICLR\),Note:PosterExternal Links:[Link](https://arxiv.org/abs/1509.00519)Cited by:[§3\.3](https://arxiv.org/html/2605.24405#S3.SS3.SSS0.Px2.p1.2)\.
- R\. T\. Q\. Chen, Y\. Rubanova, J\. Bettencourt, and D\. K\. Duvenaud \(2018\)Neural ordinary differential equations\.InAdvances in Neural Information Processing Systems,S\. Bengio, H\. Wallach, H\. Larochelle, K\. Grauman, N\. Cesa\-Bianchi, and R\. Garnett \(Eds\.\),Vol\.31,pp\.\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2018/file/69386f6bb1dfed68692a24c8686939b9-Paper.pdf)Cited by:[§2](https://arxiv.org/html/2605.24405#S2.p3.1),[§3\.3](https://arxiv.org/html/2605.24405#S3.SS3.SSS0.Px5.p1.2)\.
- J\. Degrave, F\. Felici, J\. Buchli, M\. Neunert, B\. Tracey, F\. Carpanese, T\. Ewalds, R\. Hafner, A\. Abdolmaleki, D\. de las Casas,et al\.\(2022\)Magnetic control of tokamak plasmas through deep reinforcement learning\.Nature602\(7897\),pp\. 414–419\.Cited by:[§2](https://arxiv.org/html/2605.24405#S2.p1.1)\.
- L\. Dinh, J\. Sohl\-Dickstein, and S\. Bengio \(2017\)Density estimation using real nvp\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§3\.3](https://arxiv.org/html/2605.24405#S3.SS3.SSS0.Px3.p1.2)\.
- A\. Dosovitskiy, L\. Beyer, A\. Kolesnikov, D\. Weissenborn, X\. Zhai, T\. Unterthiner, M\. Dehghani, M\. Minderer, G\. Heigold, S\. Gelly, J\. Uszkoreit, and N\. Houlsby \(2021\)An image is worth 16x16 words: transformers for image recognition at scale\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.24405#S2.p3.1)\.
- J\. Eckardt, K\. Wendt, M\. Bornhaeuser, and J\. M\. Middeke \(2021\)Reinforcement learning for precision oncology\.Cancers13\(18\),pp\. 4624\.Cited by:[§1](https://arxiv.org/html/2605.24405#S1.p1.1)\.
- J\. Fu, A\. Kumar, O\. Nachum, G\. Tucker, and S\. Levine \(2021\)D4RL: datasets for deep data\-driven reinforcement learning\.InInternational Conference on Learning Representations,Cited by:[§5\.3](https://arxiv.org/html/2605.24405#S5.SS3.SSS0.Px1.p1.1)\.
- S\. Fujimoto, E\. Conti, M\. Ghavamzadeh, and J\. Pineau \(2019\)Benchmarking batch deep reinforcement learning algorithms\.InarXiv preprint arXiv:1910\.01708,Cited by:[§1](https://arxiv.org/html/2605.24405#S1.p2.1),[§2](https://arxiv.org/html/2605.24405#S2.p1.1)\.
- Z\. Geng, M\. Deng, X\. Bai, J\. Z\. Kolter, and K\. He \(2026\)Mean flows for one\-step generative modeling\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=uWj4s7rMnR)Cited by:[§2](https://arxiv.org/html/2605.24405#S2.p3.1)\.
- J\. Ho, A\. Jain, and P\. Abbeel \(2020\)Denoising diffusion probabilistic models\.InAdvances in Neural Information Processing Systems,H\. Larochelle, M\. Ranzato, R\. Hadsell, M\.F\. Balcan, and H\. Lin \(Eds\.\),Vol\.33,pp\. 6840–6851\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2020/file/4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf)Cited by:[§2](https://arxiv.org/html/2605.24405#S2.p3.1),[§3\.3](https://arxiv.org/html/2605.24405#S3.SS3.SSS0.Px4.p1.4)\.
- M\. Janner, Y\. Du, J\. Tenenbaum, and S\. Levine \(2022\)Planning with diffusion for flexible behavior synthesis\.InProceedings of the 39th International Conference on Machine Learning,K\. Chaudhuri, S\. Jegelka, L\. Song, C\. Szepesvari, G\. Niu, and S\. Sabato \(Eds\.\),Proceedings of Machine Learning Research, Vol\.162,pp\. 9902–9915\.External Links:[Link](https://proceedings.mlr.press/v162/janner22a.html)Cited by:[§2](https://arxiv.org/html/2605.24405#S2.p3.1)\.
- M\. Janner, J\. Fu, M\. Zhang, and S\. Levine \(2019\)When to trust your model: model\-based policy optimization\.InAdvances in Neural Information Processing Systems,H\. Wallach, H\. Larochelle, A\. Beygelzimer, F\. d'Alché\-Buc, E\. Fox, and R\. Garnett \(Eds\.\),Vol\.32,pp\.\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2019/file/5faf461eff3099671ad63c6f3f094f7f-Paper.pdf)Cited by:[§1](https://arxiv.org/html/2605.24405#S1.p2.1),[§2](https://arxiv.org/html/2605.24405#S2.p2.1),[§5\.1](https://arxiv.org/html/2605.24405#S5.SS1.SSS0.Px2.p1.1),[§5\.3](https://arxiv.org/html/2605.24405#S5.SS3.p1.1)\.
- J\. Johnson, M\. Douze, and H\. Jégou \(2021\)Billion\-scale similarity search with gpus\.IEEE Transactions on Big Data7\(3\),pp\. 535–547\.External Links:[Document](https://dx.doi.org/10.1109/TBDATA.2019.2921572)Cited by:[§E\.2](https://arxiv.org/html/2605.24405#A5.SS2.SSS0.Px1.p1.1)\.
- T\. Karras, M\. Aittala, T\. Aila, and S\. Laine \(2022\)Elucidating the design space of diffusion\-based generative models\.InAdvances in Neural Information Processing Systems,Vol\.35\.Cited by:[§2](https://arxiv.org/html/2605.24405#S2.p3.1)\.
- D\. P\. Kingma and M\. Welling \(2014\)Auto\-encoding variational bayes\.In2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14\-16, 2014, Conference Track Proceedings,Y\. Bengio and Y\. LeCun \(Eds\.\),External Links:[Link](http://arxiv.org/abs/1312.6114)Cited by:[§3\.3](https://arxiv.org/html/2605.24405#S3.SS3.SSS0.Px2.p1.3)\.
- M\. Komorowski, L\. A\. Celi, O\. Badawi, A\. C\. Gordon, and A\. A\. Faisal \(2018\)The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care\.Nature Medicine24\(11\),pp\. 1716–1720\.Cited by:[§1](https://arxiv.org/html/2605.24405#S1.p1.1)\.
- A\. Kumar, J\. Fu, J\. Soh, G\. Tucker, and S\. Levine \(2019\)Stabilizing off\-policy q\-learning via bootstrapping error reduction\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§1](https://arxiv.org/html/2605.24405#S1.p1.1)\.
- A\. Kumar, A\. Zhou, G\. Tucker, and S\. Levine \(2020\)Conservative q\-learning for offline reinforcement learning\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§1](https://arxiv.org/html/2605.24405#S1.p2.1),[§2](https://arxiv.org/html/2605.24405#S2.p1.1)\.
- V\. Kurenkov and S\. Kolesnikov \(2022\)Showing your offline reinforcement learning work: online evaluation budget matters\.InProceedings of the 39th International Conference on Machine Learning,K\. Chaudhuri, S\. Jegelka, L\. Song, C\. Szepesvari, G\. Niu, and S\. Sabato \(Eds\.\),Proceedings of Machine Learning Research, Vol\.162,pp\. 11729–11752\.External Links:[Link](https://proceedings.mlr.press/v162/kurenkov22a.html)Cited by:[§E\.1](https://arxiv.org/html/2605.24405#A5.SS1.p1.3),[§5\.1](https://arxiv.org/html/2605.24405#S5.SS1.SSS0.Px3.p1.1)\.
- Y\. Lipman, R\. T\. Q\. Chen, H\. Ben\-Hamu, M\. Nickel, and M\. Le \(2023\)Flow matching for generative modeling\.InThe Eleventh International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=PqvMRDCJT9t)Cited by:[§2](https://arxiv.org/html/2605.24405#S2.p3.1)\.
- W\. Luo, H\. Li, Z\. Zhang, C\. Han, J\. Lv, and T\. Guo \(2024\)Mitigating distribution shift in model\-based offline RL via shifts\-aware reward learning\.External Links:2408\.12830Cited by:[§2](https://arxiv.org/html/2605.24405#S2.p2.1)\.
- Y\. Mao, H\. Zhang, C\. Chen, Y\. Xu, and X\. Ji \(2023\)Supported value regularization for offline reinforcement learning\.InAdvances in Neural Information Processing Systems,A\. Oh, T\. Naumann, A\. Globerson, K\. Saenko, M\. Hardt, and S\. Levine \(Eds\.\),Vol\.36,pp\. 40587–40609\.Cited by:[§1](https://arxiv.org/html/2605.24405#S1.p2.1)\.
- V\. Melnychuk, D\. Frauen, and S\. Feuerriegel \(2023\)Normalizing flows for interventional density estimation\.InProceedings of the 40th International Conference on Machine Learning,A\. Krause, E\. Brunskill, K\. Cho, B\. Engelhardt, S\. Sabato, and J\. Scarlett \(Eds\.\),Proceedings of Machine Learning Research, Vol\.202,pp\. 24361–24397\.External Links:[Link](https://proceedings.mlr.press/v202/melnychuk23a.html)Cited by:[§1](https://arxiv.org/html/2605.24405#S1.p2.1)\.
- K\. Park and Y\. Lee \(2025\)Model\-based offline reinforcement learning with lower expectile q\-learning\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=OATPSB5JK1)Cited by:[§2](https://arxiv.org/html/2605.24405#S2.p2.1)\.
- W\. Peebles and S\. Xie \(2023\)Scalable diffusion models with transformers\.InProceedings of the IEEE/CVF international conference on computer vision,pp\. 4195–4205\.Cited by:[§2](https://arxiv.org/html/2605.24405#S2.p3.1)\.
- A\. Raghu, M\. Komorowski, L\. A\. Celi, P\. Szolovits, and M\. Ghassemi \(2017\)Continuous state\-space models for optimal sepsis treatment: a deep reinforcement learning approach\.InProceedings of the Machine Learning for Healthcare Conference \(MLHC\),pp\. 147–163\.Cited by:[§1](https://arxiv.org/html/2605.24405#S1.p1.1)\.
- D\. Rezende and S\. Mohamed \(2015\)Variational inference with normalizing flows\.InProceedings of the 32nd International Conference on Machine Learning,F\. Bach and D\. Blei \(Eds\.\),Proceedings of Machine Learning Research, Vol\.37,Lille, France,pp\. 1530–1538\.External Links:[Link](https://proceedings.mlr.press/v37/rezende15.html)Cited by:[§2](https://arxiv.org/html/2605.24405#S2.p3.1)\.
- R\. Rombach, A\. Blattmann, D\. Lorenz, P\. Esser, and B\. Ommer \(2022\)High\-resolution image synthesis with latent diffusion models\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 10684–10695\.Cited by:[§2](https://arxiv.org/html/2605.24405#S2.p3.1)\.
- L\. Ruff, J\. R\. Kauffmann, R\. A\. Vandermeulen, G\. Montavon, W\. Samek, M\. Kloft, T\. G\. Dietterich, and K\. Muller \(2021\)A unifying review of deep and shallow anomaly detection\.Proceedings of the IEEE109\(5\)\.External Links:ISSN 1558\-2256,[Link](http://dx.doi.org/10.1109/JPROC.2021.3052449),[Document](https://dx.doi.org/10.1109/jproc.2021.3052449)Cited by:[§1](https://arxiv.org/html/2605.24405#S1.p2.1)\.
- J\. Song, C\. Meng, and S\. Ermon \(2021\)Denoising diffusion implicit models\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=St1giarCHLP)Cited by:[§2](https://arxiv.org/html/2605.24405#S2.p3.1)\.
- Y\. Sun, J\. Zhang, C\. Jia, H\. Lin, J\. Ye, and Y\. Yu \(2023\)Model\-Bellman inconsistency for model\-based offline reinforcement learning\.InProceedings of the 40th International Conference on Machine Learning,A\. Krause, E\. Brunskill, K\. Cho, B\. Engelhardt, S\. Sabato, and J\. Scarlett \(Eds\.\),Proceedings of Machine Learning Research, Vol\.202,pp\. 33177–33194\.External Links:[Link](https://proceedings.mlr.press/v202/sun23q.html)Cited by:[Remark B\.4](https://arxiv.org/html/2605.24405#A2.Thmtheorem4.p1.2),[Table 5](https://arxiv.org/html/2605.24405#A4.T5),[Table 5](https://arxiv.org/html/2605.24405#A4.T5.16.2.1),[Table 6](https://arxiv.org/html/2605.24405#A4.T6),[Table 6](https://arxiv.org/html/2605.24405#A4.T6.2.1),[§E\.1](https://arxiv.org/html/2605.24405#A5.SS1.p2.1.1),[§1](https://arxiv.org/html/2605.24405#S1.p2.1),[§2](https://arxiv.org/html/2605.24405#S2.p2.1),[§5\.1](https://arxiv.org/html/2605.24405#S5.SS1.SSS0.Px2.p1.1),[§5\.3](https://arxiv.org/html/2605.24405#S5.SS3.p1.1)\.
- H\. Tseng, L\. Wei, S\. Cui, Y\. Luo, R\. K\. Ten Haken, and I\. El Naqa \(2020\)Machine learning and imaging informatics in oncology\.Oncology98\(6\),pp\. 344–362\.Cited by:[§1](https://arxiv.org/html/2605.24405#S1.p1.1)\.
- A\. Tumay, S\. Sun, S\. Fereidooni, A\. Dumas, E\. Jortberg, and R\. Yu \(2025\)Guardian\-regularized safe offline reinforcement learning for smart weaning of mechanical circulatory devices\.External Links:2511\.06111,[Link](https://arxiv.org/abs/2511.06111)Cited by:[Lemma B\.3](https://arxiv.org/html/2605.24405#A2.Thmtheorem3),[§G\.1](https://arxiv.org/html/2605.24405#A7.SS1.p1.1),[§G\.2](https://arxiv.org/html/2605.24405#A7.SS2.p1.5),[Appendix G](https://arxiv.org/html/2605.24405#A7.p1.1),[§1](https://arxiv.org/html/2605.24405#S1.p2.1),[§2](https://arxiv.org/html/2605.24405#S2.p2.1),[§4](https://arxiv.org/html/2605.24405#S4.p3.5),[Figure 3](https://arxiv.org/html/2605.24405#S5.F3.2.2),[§5\.1](https://arxiv.org/html/2605.24405#S5.SS1.SSS0.Px1.p1.1),[§5\.2](https://arxiv.org/html/2605.24405#S5.SS2.p1.1)\.
- C\. Vahdatpour, D\. Collins, and S\. Goldberg \(2019\)Cardiogenic shock\.Journal of the American Heart Association8\(8\),pp\. e011991\.Cited by:[§5\.1](https://arxiv.org/html/2605.24405#S5.SS1.SSS0.Px1.p1.1)\.
- Z\. Wang, J\. J\. Hunt, and M\. Zhou \(2023\)Diffusion policies as an expressive policy class for offline reinforcement learning\.InThe Eleventh International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=AHvFDPi-FA)Cited by:[§2](https://arxiv.org/html/2605.24405#S2.p3.1)\.
- J\. Wu, H\. Wu, Z\. Qiu, J\. Wang, and M\. Long \(2022\)Supported policy optimization for offline reinforcement learning\.InAdvances in Neural Information Processing Systems,Cited by:[Table 6](https://arxiv.org/html/2605.24405#A4.T6),[Table 6](https://arxiv.org/html/2605.24405#A4.T6.2.1),[§E\.1](https://arxiv.org/html/2605.24405#A5.SS1.p2.1.1),[§1](https://arxiv.org/html/2605.24405#S1.p2.1),[§2](https://arxiv.org/html/2605.24405#S2.p2.1),[§5\.1](https://arxiv.org/html/2605.24405#S5.SS1.SSS0.Px2.p1.1)\.
- Y\. Wu, G\. Tucker, and O\. Nachum \(2019\)Behavior regularized offline reinforcement learning\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§2](https://arxiv.org/html/2605.24405#S2.p1.1)\.
- S\. Xu, Z\. Ma, W\. Chai, X\. Chen, W\. Jin, J\. Chai, S\. Xie, and S\. X\. Yu \(2025\)Next\-embedding prediction makes strong vision learners\.External Links:2512\.16922,[Link](https://arxiv.org/abs/2512.16922)Cited by:[§2](https://arxiv.org/html/2605.24405#S2.p3.1)\.
- R\. Yan, X\. Shen, A\. Wachi, S\. Gros, A\. Zhao, and X\. Hu \(2025\)Offline guarded safe reinforcement learning for medical treatment optimization strategies\.InAdvances in Neural Information Processing Systems \(NeurIPS\) 39,Cited by:[§G\.3](https://arxiv.org/html/2605.24405#A7.SS3.p1.1),[§1](https://arxiv.org/html/2605.24405#S1.p2.1),[§2](https://arxiv.org/html/2605.24405#S2.p2.1),[Figure 3](https://arxiv.org/html/2605.24405#S5.F3.2.2)\.
- T\. Yu, G\. Thomas, L\. Yu, S\. Ermon, J\. Y\. Zou, S\. Levine, C\. Finn, and T\. Ma \(2020\)MOPO: model\-based offline policy optimization\.InAdvances in Neural Information Processing Systems,Vol\.33,pp\. 14129–14142\.Cited by:[Lemma B\.1](https://arxiv.org/html/2605.24405#A2.Thmtheorem1),[Remark B\.4](https://arxiv.org/html/2605.24405#A2.Thmtheorem4.p1.2),[Table 6](https://arxiv.org/html/2605.24405#A4.T6),[Table 6](https://arxiv.org/html/2605.24405#A4.T6.2.1),[§E\.1](https://arxiv.org/html/2605.24405#A5.SS1.p2.1),[§1](https://arxiv.org/html/2605.24405#S1.p2.1),[§2](https://arxiv.org/html/2605.24405#S2.p2.1)\.
- S\. Zhai, Y\. Cheng, W\. Lu, and Z\. Zhang \(2016\)Deep structured energy based models for anomaly detection\.InProceedings of The 33rd International Conference on Machine Learning,M\. F\. Balcan and K\. Q\. Weinberger \(Eds\.\),Proceedings of Machine Learning Research, Vol\.48,New York, New York, USA,pp\. 1100–1109\.External Links:[Link](https://proceedings.mlr.press/v48/zhai16.html)Cited by:[§1](https://arxiv.org/html/2605.24405#S1.p2.1)\.
- J\. Zhang, C\. Zhang, W\. Wang, and B\. Jing \(2023\)Constrained policy optimization with explicit behavior density for offline reinforcement learning\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§2](https://arxiv.org/html/2605.24405#S2.p2.1)\.
## Appendix AAlgorithm Pseudocode
Algorithm 1Generative OOD\-regularized Model\-based Policy Optimization \(GORMPO\)1:inputJointly pre\-trained dynamics model
T^ϕ\(s′\|s,a\)\\hat\{T\}\_\{\\phi\}\(s^\{\\prime\}\|s,a\)and reward model
r^ψ\(s′,a\)\\hat\{r\}\_\{\\psi\}\(s^\{\\prime\},a\), pretrained density estimator
pθ\(s,a\)p\_\{\\theta\}\(s,a\), penalty weight
λ\\lambda, density threshold
τ\\tau\.
2:outputLearned policy
π^\\hat\{\\pi\}
3:for
KKepochsdo
4:Define penalized MDP
5:
s^′∼T^ϕ\(⋅\|s,a\)\\hat\{s\}^\{\\prime\}\\sim\\hat\{T\}\_\{\\phi\}\(\\cdot\|s,a\)
6:
u\(s^′,a\)=tanh\(max\(τ−logpθ\(s^′,a\),0\)\)u\(\\hat\{s\}^\{\\prime\},a\)=\\tanh\(\\max\(\\tau\-\\log p\_\{\\theta\}\(\\hat\{s\}^\{\\prime\},a\),0\)\)
7:
r~\(s,a\)=r^\(s,a\)−λu\(s^′,a\)\\tilde\{r\}\(s,a\)=\\hat\{r\}\(s,a\)\-\\lambda\\,u\(\\hat\{s\}^\{\\prime\},a\)
8:
ℳ~←\(S,A,T^ϕ,r~,μ0,γ\)\\tilde\{\\mathcal\{M\}\}\\leftarrow\(S,A,\\hat\{T\}\_\{\\phi\},\\tilde\{r\},\\mu\_\{0\},\\gamma\)
9:Train any MBRL model until convergence onℳ~\\tilde\{\\mathcal\{M\}\}
10:
π^←argmaxπηℳ~\(π\)\\hat\{\\pi\}\\leftarrow\\arg\\max\_\{\\pi\}\\eta\_\{\\tilde\{\\mathcal\{M\}\}\}\(\\pi\)
11:endfor
12:return
π^\\hat\{\\pi\}
## Appendix BProofs
###### Lemma B\.1\(Telescoping Lemma \- Lemma 4\.1 in\[[46](https://arxiv.org/html/2605.24405#bib.bib15)\]\)\.
LetMMandM^\\hat\{M\}be two MDPs with the same reward functionrr, but different dynamicsTTandT^\\hat\{T\}, respectively\. Let
GM^π\(s,a\):=𝔼s′∼T^\(s,a\)\[VMπ\(s′\)\]−𝔼s′∼T\(s,a\)\[VMπ\(s′\)\]\.G^\{\\pi\}\_\{\\hat\{M\}\}\(s,a\):=\\mathbb\{E\}\_\{s^\{\\prime\}\\sim\\hat\{T\}\(s,a\)\}\[V^\{\\pi\}\_\{M\}\(s^\{\\prime\}\)\]\-\\mathbb\{E\}\_\{s^\{\\prime\}\\sim T\(s,a\)\}\[V^\{\\pi\}\_\{M\}\(s^\{\\prime\}\)\]\.Then,
ηM^\(π\)−ηM\(π\)=γ𝔼\(s,a\)∼ρT^π\[GM^π\(s,a\)\]\\eta\_\{\\hat\{M\}\}\(\\pi\)\-\\eta\_\{M\}\(\\pi\)=\\gamma\\mathbb\{E\}\_\{\(s,a\)\\sim\\rho^\{\\pi\}\_\{\\hat\{T\}\}\}\[G^\{\\pi\}\_\{\\hat\{M\}\}\(s,a\)\]
As an immediate corollary:
ηM\(π\)=𝔼\(s,a\)∼ρT^π\[r\(s,a\)−γGM^π\(s,a\)\]≥𝔼\(s,a\)∼ρT^π\[r\(s,a\)−γ\|GM^π\(s,a\)\|\]\\eta\_\{M\}\(\\pi\)=\\mathbb\{E\}\_\{\(s,a\)\\sim\\rho^\{\\pi\}\_\{\\hat\{T\}\}\}\[r\(s,a\)\-\\gamma G^\{\\pi\}\_\{\\hat\{M\}\}\(s,a\)\]\\geq\\mathbb\{E\}\_\{\(s,a\)\\sim\\rho^\{\\pi\}\_\{\\hat\{T\}\}\}\[r\(s,a\)\-\\gamma\|G^\{\\pi\}\_\{\\hat\{M\}\}\(s,a\)\|\]
###### Lemma B\.2\(Function Class Bound\)\.
Ifℱ\\mathcal\{F\}is a set of functions mapping𝒮\\mathcal\{S\}toℝ\\mathbb\{R\}with∥⋅∥∞≤1\\\|\\cdot\\\|\_\{\\infty\}\\leq 1, andVπ∈cℱV^\{\\pi\}\\in c\\mathcal\{F\}wherec=rmax/\(1−γ\)c=r\_\{\\max\}/\(1\-\\gamma\), then
\|GM^π\(s,a\)\|≤c⋅dℱ\(T^\(s,a\),T\(s,a\)\)\|G^\{\\pi\}\_\{\\hat\{M\}\}\(s,a\)\|\\leq c\\cdot d\_\{\\mathcal\{F\}\}\(\\hat\{T\}\(s,a\),T\(s,a\)\)wheredℱ\(T^\(s,a\),T\(s,a\)\)=supf∈ℱ\|𝔼s′∼T^\(s,a\)\[f\(s′\)\]−𝔼s′∼T\(s,a\)\[f\(s′\)\]\|d\_\{\\mathcal\{F\}\}\(\\hat\{T\}\(s,a\),T\(s,a\)\)=\\sup\_\{f\\in\\mathcal\{F\}\}\|\\mathbb\{E\}\_\{s^\{\\prime\}\\sim\\hat\{T\}\(s,a\)\}\[f\(s^\{\\prime\}\)\]\-\\mathbb\{E\}\_\{s^\{\\prime\}\\sim T\(s,a\)\}\[f\(s^\{\\prime\}\)\]\|\.
###### Lemma B\.3\(Telescoping Lemma\[[39](https://arxiv.org/html/2605.24405#bib.bib14)\]: Model Approximation Bound for Transition Functions \)\.
Letπ\\pibe any feasible solution andT^ϕ\\hat\{T\}\_\{\\phi\}be the Transformer\-learned transition function with parametersϕ\\phi\. Assume the following:
- •Dataset𝒟=\{\(si,ai,ri,si′\)\}i=1N\\mathcal\{D\}=\\\{\(s\_\{i\},a\_\{i\},r\_\{i\},s^\{\\prime\}\_\{i\}\)\\\}\_\{i=1\}^\{N\}withNNsamples
- •Transformer withLLlayers, dimensiondmodeld\_\{model\}, andHattnH\_\{attn\}attention heads
- •Lipschitz continuous true transitionTTwith constantLTL\_\{T\}
Then, with probability at least1−2β−4δ1\-2\\beta\-4\\delta, the following holds:
\|Vψ,T^ϕπ\(ρ0\)−Vψ,Tπ\(ρ0\)\|≤ϵH\+ϵtrans\\left\|V\_\{\\psi,\\hat\{T\}\_\{\\phi\}\}^\{\\pi\}\(\\rho\_\{0\}\)\-V\_\{\\psi,T\}^\{\\pi\}\(\\rho\_\{0\}\)\\right\|\\leq\\epsilon\_\{H\}\+\\epsilon\_\{trans\}
where:
ϵH:=γH\+1\(2−γ\)ψmax\(1−γ\)2\\epsilon\_\{H\}:=\\frac\{\\gamma^\{H\+1\}\(2\-\\gamma\)\\psi\_\{max\}\}\{\(1\-\\gamma\)^\{2\}\}
ϵtrans:=ψmax\(γ−γH\+2\)\(1−γ\)2\(ϵapprox\+ϵgen\)\\epsilon\_\{trans\}:=\\frac\{\\psi\_\{max\}\(\\gamma\-\\gamma^\{H\+2\}\)\}\{\(1\-\\gamma\)^\{2\}\}\\left\(\\epsilon\_\{approx\}\+\\epsilon\_\{gen\}\\right\)
with:
ϵapprox:=Ctrans⋅min\{1L⋅dmodel,1Hattn⋅Nctx\}\\epsilon\_\{approx\}:=C\_\{trans\}\\cdot\\min\\left\\\{\\frac\{1\}\{L\\cdot d\_\{model\}\},\\frac\{1\}\{H\_\{attn\}\\cdot N\_\{ctx\}\}\\right\\\}
ϵgen:=LTdlog\(1/δ\)\+logNN\+𝒪\(1Neff\)\\epsilon\_\{gen\}:=L\_\{T\}\\sqrt\{\\frac\{d\\log\(1/\\delta\)\+\\log N\}\{N\}\}\+\\mathcal\{O\}\\left\(\\frac\{1\}\{\\sqrt\{N\_\{eff\}\}\}\\right\)
Assumption[4\.2](https://arxiv.org/html/2605.24405#S4.Thmtheorem2)is supported by the above derivation of model approximation error\. In regions with high estimated densitypθ\(s′,a\)p\_\{\\theta\}\(s^\{\\prime\},a\), the local sample density is high, yielding largeNeffN\_\{\\text\{eff\}\}and small error\. Conversely, aspθ\(s,a\)→0p\_\{\\theta\}\(s,a\)\\rightarrow 0, the effective sample size diminishes, causing the error to grow\. The linear relationshipτ−pθ\(s,a\)\\tau\-p\_\{\\theta\}\(s,a\)captures this first\-order dependence between local data density and model error, whileϵapprox=Ctrans⋅min\{1/\(L⋅dmodel\),1/\(Hattn⋅Nctx\)\}\\epsilon\_\{\\text\{approx\}\}=C\_\{\\text\{trans\}\}\\cdot\\min\\\{1/\(L\\cdot d\_\{\\text\{model\}\}\),1/\(H\_\{\\text\{attn\}\}\\cdot N\_\{\\text\{ctx\}\}\)\\\}represents the transformer’s irreducible approximation error determined by its architecture \(depthLL, dimensiondmodeld\_\{\\text\{model\}\}, and attention headsHattnH\_\{\\text\{attn\}\}\)\. The constantCT^C\_\{\\hat\{T\}\}encapsulates the Lipschitz constantLTL\_\{T\}of the true dynamics and the dimensionality\-dependent factors\.
###### Lemma B\.5\(Density Estimation Error Propagation\)\.
Under Assumption[4\.3](https://arxiv.org/html/2605.24405#S4.Thmtheorem3), the difference between the true regularizeru\(s′,a\)=tanh\(max\(τ−logp\(s′,a\),0\)\)u\(s^\{\\prime\},a\)=\\tanh\(\\max\(\\tau\-\\log p\(s^\{\\prime\},a\),0\)\)and the estimated regularizeru\(s^′,a\)=tanh\(max\(τ−logpθ\(s′,a\),0\)\)u\(\\hat\{s\}^\{\\prime\},a\)=\\tanh\(\\max\(\\tau\-\\log p\_\{\\theta\}\(s^\{\\prime\},a\),0\)\)satisfies:
\|u\(s^′,a\)−u\(s′,a\)\|≤ϵdensity\|u\(\\hat\{s\}^\{\\prime\},a\)\-u\(s^\{\\prime\},a\)\|\\leq\\epsilon\_\{\\text\{density\}\}for all\(s′,a\)∈𝒮×𝒜\(s^\{\\prime\},a\)\\in\\mathcal\{S\}\\times\\mathcal\{A\}\.
###### Proof\.
Sincetanh\(⋅\)\\tanh\(\\cdot\)is 1\-Lipschitz andmax\(⋅,0\)\\max\(\\cdot,0\)is 1\-Lipschitz:
\|u\(s′,a\)−u∗\(s′,a\)\|\\displaystyle\|u\(s^\{\\prime\},a\)\-u^\{\*\}\(s^\{\\prime\},a\)\|=\|tanh\(max\(τ−logpθ\(s′,a\),0\)\)−tanh\(max\(τ−logp\(s′,a\),0\)\)\|\\displaystyle=\|\\tanh\(\\max\(\\tau\-\\log p\_\{\\theta\}\(s^\{\\prime\},a\),0\)\)\-\\tanh\(\\max\(\\tau\-\\log p\(s^\{\\prime\},a\),0\)\)\|\(15\)≤\|max\(τ−logpθ\(s′,a\),0\)−max\(τ−logp\(s′,a\),0\)\|\\displaystyle\\leq\|\\max\(\\tau\-\\log p\_\{\\theta\}\(s^\{\\prime\},a\),0\)\-\\max\(\\tau\-\\log p\(s^\{\\prime\},a\),0\)\|\(16\)≤\|\(τ−logpθ\(s′,a\)\)−\(τ−logp\(s′,a\)\)\|\\displaystyle\\leq\|\(\\tau\-\\log p\_\{\\theta\}\(s^\{\\prime\},a\)\)\-\(\\tau\-\\log p\(s^\{\\prime\},a\)\)\|\(17\)=\|logp\(s′,a\)−logpθ\(s′,a\)\|\\displaystyle=\|\\log p\(s^\{\\prime\},a\)\-\\log p\_\{\\theta\}\(s^\{\\prime\},a\)\|\(18\)≤ϵdensity\\displaystyle\\leq\\epsilon\_\{\\text\{density\}\}\(19\)where the last inequality follows from Assumption 3\.3\. ∎
### B\.1Proof of Theorem[4\.4](https://arxiv.org/html/2605.24405#S4.Thmtheorem4): Conservative Value Bound with Density Error
###### Proof\.
Starting from Lemma[B\.1](https://arxiv.org/html/2605.24405#A2.Thmtheorem1):
ηM\(π\)=ηM^\(π\)−γ𝔼\(s,a\)∼ρT^π\[GM^π\(s,a\)\]≥ηM^\(π\)−γ𝔼\(s,a\)∼ρT^π\[\|GM^π\(s,a\)\|\]\\eta\_\{M\}\(\\pi\)=\\eta\_\{\\hat\{M\}\}\(\\pi\)\-\\gamma\\mathbb\{E\}\_\{\(s,a\)\\sim\\rho^\{\\pi\}\_\{\\hat\{T\}\}\}\[G^\{\\pi\}\_\{\\hat\{M\}\}\(s,a\)\]\\geq\\eta\_\{\\hat\{M\}\}\(\\pi\)\-\\gamma\\mathbb\{E\}\_\{\(s,a\)\\sim\\rho^\{\\pi\}\_\{\\hat\{T\}\}\}\[\|G^\{\\pi\}\_\{\\hat\{M\}\}\(s,a\)\|\]
By Assumption[4\.2](https://arxiv.org/html/2605.24405#S4.Thmtheorem2)and Lemma[B\.2](https://arxiv.org/html/2605.24405#A2.Thmtheorem2), with the true regularizeru\(s′,a\)u\(s^\{\\prime\},a\)based on the true densityp\(s′,a\)p\(s^\{\\prime\},a\):
\|GM^π\(s,a\)\|≤c⋅dℱ\(T^\(s,a\),T\(s,a\)\)≤cCT^⋅𝔼s′∼T^\(s,a\)\[𝔼s′∼T^\(s,a\)\[u\(s′,a\)\]\]\+cϵapprox\.\|G^\{\\pi\}\_\{\\hat\{M\}\}\(s,a\)\|\\leq c\\cdot d\_\{\\mathcal\{F\}\}\(\\hat\{T\}\(s,a\),T\(s,a\)\)\\leq c\\,C\_\{\\hat\{T\}\}\\cdot\\mathbb\{E\}\_\{s^\{\\prime\}\\sim\\hat\{T\}\(s,a\)\}\\big\[\\mathbb\{E\}\_\{s^\{\\prime\}\\sim\\hat\{T\}\(s,a\)\}\[u\(s^\{\\prime\},a\)\]\\big\]\+c\\epsilon\_\{\\text\{approx\}\}\.
We now relate the true regularizeru\(s′,a\)u\(s^\{\\prime\},a\)to the estimated regularizeru\(s^′,a\)u\(\\hat\{s\}^\{\\prime\},a\)based on the learned densitypθ\(s′,a\)p\_\{\\theta\}\(s^\{\\prime\},a\)\. Note thatηM^\(π\)=𝔼\(s,a\)∼ρT^π\[r^\(s,a\)\]\\eta\_\{\\hat\{M\}\}\(\\pi\)=\\mathbb\{E\}\_\{\(s,a\)\\sim\\rho^\{\\pi\}\_\{\\hat\{T\}\}\}\[\\hat\{r\}\(s,a\)\]\. We add and subtract the penalty term based on the estimated regularizer:
ηM\(π\)\\displaystyle\\eta\_\{M\}\(\\pi\)≥𝔼\(s,a\)∼ρT^π\[r^\(s,a\)\]−γcCT^𝔼\(s,a\)∼ρT^π\[𝔼s′∼T^\(s,a\)\[u\(s′,a\)\]\]−γcϵapprox\\displaystyle\\geq\\mathbb\{E\}\_\{\(s,a\)\\sim\\rho^\{\\pi\}\_\{\\hat\{T\}\}\}\\\!\\big\[\\hat\{r\}\(s,a\)\\big\]\-\\gamma cC\_\{\\hat\{T\}\}\\,\\mathbb\{E\}\_\{\(s,a\)\\sim\\rho^\{\\pi\}\_\{\\hat\{T\}\}\}\\\!\\Big\[\\mathbb\{E\}\_\{s^\{\\prime\}\\sim\\hat\{T\}\(s,a\)\}\[u\(s^\{\\prime\},a\)\]\\Big\]\-\\gamma c\\,\\epsilon\_\{\\text\{approx\}\}\(20\)=𝔼\(s,a\)∼ρT^π\[r^\(s,a\)−λ𝔼s′∼T^\(s,a\)\[u\(s^′,a\)\]\]\+λ𝔼\(s,a\)∼ρT^π\[𝔼s′∼T^\(s,a\)\[u\(s^′,a\)\]\]\\displaystyle=\\mathbb\{E\}\_\{\(s,a\)\\sim\\rho^\{\\pi\}\_\{\\hat\{T\}\}\}\\\!\\Big\[\\hat\{r\}\(s,a\)\-\\lambda\\,\\mathbb\{E\}\_\{s^\{\\prime\}\\sim\\hat\{T\}\(s,a\)\}\[u\(\\hat\{s\}^\{\\prime\},a\)\]\\Big\]\+\\lambda\\,\\mathbb\{E\}\_\{\(s,a\)\\sim\\rho^\{\\pi\}\_\{\\hat\{T\}\}\}\\\!\\Big\[\\mathbb\{E\}\_\{s^\{\\prime\}\\sim\\hat\{T\}\(s,a\)\}\[u\(\\hat\{s\}^\{\\prime\},a\)\]\\Big\]−γcCT^𝔼\(s,a\)∼ρT^π\[𝔼s′∼T^\(s,a\)\[u\(s′,a\)\]\]−γcϵapprox\\displaystyle\\quad\-\\gamma cC\_\{\\hat\{T\}\}\\,\\mathbb\{E\}\_\{\(s,a\)\\sim\\rho^\{\\pi\}\_\{\\hat\{T\}\}\}\\\!\\Big\[\\mathbb\{E\}\_\{s^\{\\prime\}\\sim\\hat\{T\}\(s,a\)\}\[u\(s^\{\\prime\},a\)\]\\Big\]\-\\gamma c\\,\\epsilon\_\{\\text\{approx\}\}\(21\)
By Lemma[B\.5](https://arxiv.org/html/2605.24405#A2.Thmtheorem5),\|u\(s′,a\)−u\(s^′,a\)\|≤ϵdensity\|u\(s^\{\\prime\},a\)\-u\(\\hat\{s\}^\{\\prime\},a\)\|\\leq\\epsilon\_\{\\text\{density\}\}for all\(s′,a\)\(s^\{\\prime\},a\)\. Therefore:
𝔼\(s,a\)∼ρT^π\[𝔼s′∼T^\(s,a\)\[u\(s′,a\)\]\\displaystyle\\mathbb\{E\}\_\{\(s,a\)\\sim\\rho^\{\\pi\}\_\{\\hat\{T\}\}\}\\big\[\\mathbb\{E\}\_\{s^\{\\prime\}\\sim\\hat\{T\}\(s,a\)\}\[u\(s^\{\\prime\},a\)\\big\]≤𝔼\(s,a\)∼ρT^π\[𝔼s′∼T^\(s,a\)\[u\(s^′,a\)\]\+ϵdensity\\displaystyle\\leq\\mathbb\{E\}\_\{\(s,a\)\\sim\\rho^\{\\pi\}\_\{\\hat\{T\}\}\}\\big\[\\mathbb\{E\}\_\{s^\{\\prime\}\\sim\\hat\{T\}\(s,a\)\}\[u\(\\hat\{s\}^\{\\prime\},a\)\\big\]\+\\epsilon\_\{\\text\{density\}\}\(22\)
Substituting this into our bound:
ηM\(π\)\\displaystyle\\eta\_\{M\}\(\\pi\)≥𝔼\(s,a\)∼ρT^π\[r~\(s,a\)\]\+λ𝔼\(s,a\)∼ρT^π\[𝔼s′∼T^\(s,a\)\[u\(s^′,a\)\]\\displaystyle\\geq\\mathbb\{E\}\_\{\(s,a\)\\sim\\rho^\{\\pi\}\_\{\\hat\{T\}\}\}\[\\tilde\{r\}\(s,a\)\]\+\\lambda\\mathbb\{E\}\_\{\(s,a\)\\sim\\rho^\{\\pi\}\_\{\\hat\{T\}\}\}\\big\[\\mathbb\{E\}\_\{s^\{\\prime\}\\sim\\hat\{T\}\(s,a\)\}\[u\(\\hat\{s\}^\{\\prime\},a\)\\big\]\(23\)−γcCT^\(𝔼\(s,a\)∼ρT^π\[𝔼s′∼T^\(s,a\)\[u\(s^′,a\)\]\+ϵdensity\)−γcϵapprox\\displaystyle\-\\gamma cC\_\{\\hat\{T\}\}\\left\(\\mathbb\{E\}\_\{\(s,a\)\\sim\\rho^\{\\pi\}\_\{\\hat\{T\}\}\}\\big\[\\mathbb\{E\}\_\{s^\{\\prime\}\\sim\\hat\{T\}\(s,a\)\}\[u\(\\hat\{s\}^\{\\prime\},a\)\\big\]\+\\epsilon\_\{\\text\{density\}\}\\right\)\-\\gamma c\\epsilon\_\{\\text\{approx\}\}=𝔼\(s,a\)∼ρT^π\[r~\(s,a\)\]\+\(λ−γcCT^\)𝔼\(s,a\)∼ρT^π\[𝔼s′∼T^\(s,a\)\[u\(s^′,a\)\]\\displaystyle=\\mathbb\{E\}\_\{\(s,a\)\\sim\\rho^\{\\pi\}\_\{\\hat\{T\}\}\}\[\\tilde\{r\}\(s,a\)\]\+\(\\lambda\-\\gamma cC\_\{\\hat\{T\}\}\)\\mathbb\{E\}\_\{\(s,a\)\\sim\\rho^\{\\pi\}\_\{\\hat\{T\}\}\}\\big\[\\mathbb\{E\}\_\{s^\{\\prime\}\\sim\\hat\{T\}\(s,a\)\}\[u\(\\hat\{s\}^\{\\prime\},a\)\\big\]−γcCT^ϵdensity−γcϵapprox\\displaystyle\-\\gamma cC\_\{\\hat\{T\}\}\\epsilon\_\{\\text\{density\}\}\-\\gamma c\\epsilon\_\{\\text\{approx\}\}\(24\)
Settingλ=γcCT^\\lambda=\\gamma cC\_\{\\hat\{T\}\}, the middle term vanishes:
ηM\(π\)\\displaystyle\\eta\_\{M\}\(\\pi\)≥𝔼\(s,a\)∼ρT^π\[r~\(s,a\)\]−γcCT^ϵdensity−γcϵapprox\\displaystyle\\geq\\mathbb\{E\}\_\{\(s,a\)\\sim\\rho^\{\\pi\}\_\{\\hat\{T\}\}\}\[\\tilde\{r\}\(s,a\)\]\-\\gamma cC\_\{\\hat\{T\}\}\\epsilon\_\{\\text\{density\}\}\-\\gamma c\\epsilon\_\{\\text\{approx\}\}\(25\)
Since𝔼\(s,a\)∼ρT^π\[r~\(s,a\)\]=ηM~\(π\)\\mathbb\{E\}\_\{\(s,a\)\\sim\\rho^\{\\pi\}\_\{\\hat\{T\}\}\}\[\\tilde\{r\}\(s,a\)\]=\\eta\_\{\\tilde\{M\}\}\(\\pi\)\(using normalized occupancy measure\) andλ=γcCT^\\lambda=\\gamma\\,c\\,C\_\{\\hat\{T\}\}:
ηM\(π\)\\displaystyle\\eta\_\{M\}\(\\pi\)≥ηM~\(π\)−γcϵapprox−λϵdensity\\displaystyle\\geq\\eta\_\{\\tilde\{M\}\}\(\\pi\)\-\\gamma c\\epsilon\_\{\\text\{approx\}\}\-\\lambda\\epsilon\_\{\\text\{density\}\}\(26\)whereM~=\(𝒮,𝒜,T^,r~,μ0,γ\)\\tilde\{M\}=\(\\mathcal\{S\},\\mathcal\{A\},\\hat\{T\},\\tilde\{r\},\\mu\_\{0\},\\gamma\)is the density\-regularized MDP\.
∎
### B\.2Proof of Theorem[4\.5](https://arxiv.org/html/2605.24405#S4.Thmtheorem5): Optimality Gap with Density Error
###### Proof\.
We establish a lower bound onηM\(π^\)\\eta\_\{M\}\(\\hat\{\\pi\}\)by relating it to the optimal policy within a constrained class\.
From Lemma[B\.1](https://arxiv.org/html/2605.24405#A2.Thmtheorem1)and Lemma[B\.2](https://arxiv.org/html/2605.24405#A2.Thmtheorem2), for any policyπ\\pi:
\|ηM^\(π\)−ηM\(π\)\|\\displaystyle\|\\eta\_\{\\hat\{M\}\}\(\\pi\)\-\\eta\_\{M\}\(\\pi\)\|≤γ𝔼\(s,a\)∼ρT^π\[\|GM^π\(s,a\)\|\]\\displaystyle\\leq\\gamma\\mathbb\{E\}\_\{\(s,a\)\\sim\\rho^\{\\pi\}\_\{\\hat\{T\}\}\}\[\|G^\{\\pi\}\_\{\\hat\{M\}\}\(s,a\)\|\]\(27\)≤γc𝔼\(s,a\)∼ρT^π\[dℱ\(T^\(s,a\),T\(s,a\)\)\]\\displaystyle\\leq\\gamma c\\mathbb\{E\}\_\{\(s,a\)\\sim\\rho^\{\\pi\}\_\{\\hat\{T\}\}\}\[d\_\{\\mathcal\{F\}\}\(\\hat\{T\}\(s,a\),T\(s,a\)\)\]\(28\)
By Assumption[4\.2](https://arxiv.org/html/2605.24405#S4.Thmtheorem2)with the true regularizeru\(s′,a\)u\(s^\{\\prime\},a\):
\|ηM^\(π\)−ηM\(π\)\|\\displaystyle\|\\eta\_\{\\hat\{M\}\}\(\\pi\)\-\\eta\_\{M\}\(\\pi\)\|≤γcCT^𝔼\(s,a\)∼ρT^π\[𝔼s′∼T^\(s,a\)\[u\(s′,a\)\]\+γcϵapprox\\displaystyle\\leq\\gamma cC\_\{\\hat\{T\}\}\\mathbb\{E\}\_\{\(s,a\)\\sim\\rho^\{\\pi\}\_\{\\hat\{T\}\}\}\\big\[\\mathbb\{E\}\_\{s^\{\\prime\}\\sim\\hat\{T\}\(s,a\)\}\[u\(s^\{\\prime\},a\)\\big\]\+\\gamma c\\epsilon\_\{\\text\{approx\}\}\(29\)
From equation \([29](https://arxiv.org/html/2605.24405#A2.E29)\) applied toπ^\\hat\{\\pi\}:
ηM\(π^\)\\displaystyle\\eta\_\{M\}\(\\hat\{\\pi\}\)≥ηM^\(π^\)−γcCT^𝔼\(s,a\)∼ρT^π^\[𝔼s′∼T^\(s,a\)\[u\(s′,a\)\]\]−γcϵapprox\\displaystyle\\geq\\eta\_\{\\hat\{M\}\}\(\\hat\{\\pi\}\)\-\\gamma cC\_\{\\hat\{T\}\}\\mathbb\{E\}\_\{\(s,a\)\\sim\\rho^\{\\hat\{\\pi\}\}\_\{\\hat\{T\}\}\}\\big\[\\mathbb\{E\}\_\{s^\{\\prime\}\\sim\\hat\{T\}\(s,a\)\}\[u\(s^\{\\prime\},a\)\]\\big\]\-\\gamma c\\epsilon\_\{\\text\{approx\}\}\(30\)
By definition,π^\\hat\{\\pi\}maximizes𝔼\(s,a\)∼ρT^π\[r^\(s,a\)−λ𝔼s′∼T^\(s,a\)\[u\(s^′,a\)\]\]\\mathbb\{E\}\_\{\(s,a\)\\sim\\rho^\{\\pi\}\_\{\\hat\{T\}\}\}\\big\[\\hat\{r\}\(s,a\)\-\\lambda\\mathbb\{E\}\_\{s^\{\\prime\}\\sim\\hat\{T\}\(s,a\)\}\[u\(\\hat\{s\}^\{\\prime\},a\)\]\\big\]\. Therefore, for any policyπ\\pi:
𝔼\(s,a\)∼ρT^π^\[r^\(s,a\)−λ𝔼s′∼T^\(s,a\)\[u\(s^′,a\)\]\]\\displaystyle\\mathbb\{E\}\_\{\(s,a\)\\sim\\rho^\{\\hat\{\\pi\}\}\_\{\\hat\{T\}\}\}\\big\[\\hat\{r\}\(s,a\)\-\\lambda\\mathbb\{E\}\_\{s^\{\\prime\}\\sim\\hat\{T\}\(s,a\)\}\[u\(\\hat\{s\}^\{\\prime\},a\)\]\\big\]≥𝔼\(s,a\)∼ρT^π\[r^\(s,a\)−λ𝔼s′∼T^\(s,a\)\[u\(s^′,a\)\]\]\\displaystyle\\geq\\mathbb\{E\}\_\{\(s,a\)\\sim\\rho^\{\\pi\}\_\{\\hat\{T\}\}\}\\big\[\\hat\{r\}\(s,a\)\-\\lambda\\mathbb\{E\}\_\{s^\{\\prime\}\\sim\\hat\{T\}\(s,a\)\}\[u\(\\hat\{s\}^\{\\prime\},a\)\]\\big\]\(31\)
Rearranging:
ηM^\(π^\)\\displaystyle\\eta\_\{\\hat\{M\}\}\(\\hat\{\\pi\}\)=𝔼\(s,a\)∼ρT^π^\[r^\(s,a\)\]\\displaystyle=\\mathbb\{E\}\_\{\(s,a\)\\sim\\rho^\{\\hat\{\\pi\}\}\_\{\\hat\{T\}\}\}\[\\hat\{r\}\(s,a\)\]\(32\)≥𝔼\(s,a\)∼ρT^π\[r^\(s,a\)\]−λ𝔼\(s,a\)∼ρT^π\[𝔼s′∼T^\(s,a\)\[u\(s^′,a\)\]\]\+λ𝔼\(s,a\)∼ρT^π^\[𝔼s′∼T^\(s,a\)\[u\(s^′,a\)\]\]\\displaystyle\\geq\\mathbb\{E\}\_\{\(s,a\)\\sim\\rho^\{\\pi\}\_\{\\hat\{T\}\}\}\[\\hat\{r\}\(s,a\)\]\-\\lambda\\mathbb\{E\}\_\{\(s,a\)\\sim\\rho^\{\\pi\}\_\{\\hat\{T\}\}\}\\big\[\\mathbb\{E\}\_\{s^\{\\prime\}\\sim\\hat\{T\}\(s,a\)\}\[u\(\\hat\{s\}^\{\\prime\},a\)\]\\big\]\+\\lambda\\mathbb\{E\}\_\{\(s,a\)\\sim\\rho^\{\\hat\{\\pi\}\}\_\{\\hat\{T\}\}\}\\big\[\\mathbb\{E\}\_\{s^\{\\prime\}\\sim\\hat\{T\}\(s,a\)\}\[u\(\\hat\{s\}^\{\\prime\},a\)\]\\big\]\(33\)
By Lemma[B\.5](https://arxiv.org/html/2605.24405#A2.Thmtheorem5),\|u\(s^′,a\)−u\(s′,a\)\|≤ϵdensity\|u\(\\hat\{s\}^\{\\prime\},a\)\-u\(s^\{\\prime\},a\)\|\\leq\\epsilon\_\{\\text\{density\}\}for all\(s,a\)\(s,a\)\. This gives us two inequalities:
u\(s^′,a\)\\displaystyle u\(\\hat\{s\}^\{\\prime\},a\)≤u\(s′,a\)\+ϵdensity\\displaystyle\\leq u\(s^\{\\prime\},a\)\+\\epsilon\_\{\\text\{density\}\}\(34\)u\(s^′,a\)\\displaystyle u\(\\hat\{s\}^\{\\prime\},a\)≥u\(s′,a\)−ϵdensity\\displaystyle\\geq u\(s^\{\\prime\},a\)\-\\epsilon\_\{\\text\{density\}\}\(35\)
From \([35](https://arxiv.org/html/2605.24405#A2.E35)\):
𝔼\(s,a\)∼ρT^π^\[𝔼s′∼T^\(s,a\)\[u\(s^′,a\)\]\]\\displaystyle\\mathbb\{E\}\_\{\(s,a\)\\sim\\rho^\{\\hat\{\\pi\}\}\_\{\\hat\{T\}\}\}\\big\[\\mathbb\{E\}\_\{s^\{\\prime\}\\sim\\hat\{T\}\(s,a\)\}\[u\(\\hat\{s\}^\{\\prime\},a\)\]\\big\]≥𝔼\(s,a\)∼ρT^π^\[𝔼s′∼T^\(s,a\)\[u\(s′,a\)\]\]−ϵdensity\\displaystyle\\geq\\mathbb\{E\}\_\{\(s,a\)\\sim\\rho^\{\\hat\{\\pi\}\}\_\{\\hat\{T\}\}\}\\big\[\\mathbb\{E\}\_\{s^\{\\prime\}\\sim\\hat\{T\}\(s,a\)\}\[u\(s^\{\\prime\},a\)\]\\big\]\-\\epsilon\_\{\\text\{density\}\}\(36\)
From \([34](https://arxiv.org/html/2605.24405#A2.E34)\):
𝔼\(s,a\)∼ρT^π\[𝔼s′∼T^\(s,a\)\[u\(s^′,a\)\]\]\\displaystyle\\mathbb\{E\}\_\{\(s,a\)\\sim\\rho^\{\\pi\}\_\{\\hat\{T\}\}\}\\big\[\\mathbb\{E\}\_\{s^\{\\prime\}\\sim\\hat\{T\}\(s,a\)\}\[u\(\\hat\{s\}^\{\\prime\},a\)\]\\big\]≤𝔼\(s,a\)∼ρT^π\[𝔼s′∼T^\(s,a\)\[u\(s′,a\)\]\]\+ϵdensity\\displaystyle\\leq\\mathbb\{E\}\_\{\(s,a\)\\sim\\rho^\{\\pi\}\_\{\\hat\{T\}\}\}\\big\[\\mathbb\{E\}\_\{s^\{\\prime\}\\sim\\hat\{T\}\(s,a\)\}\[u\(s^\{\\prime\},a\)\]\\big\]\+\\epsilon\_\{\\text\{density\}\}\(37\)
Using \([36](https://arxiv.org/html/2605.24405#A2.E36)\) and \([37](https://arxiv.org/html/2605.24405#A2.E37)\) in equation \([33](https://arxiv.org/html/2605.24405#A2.E33)\):
ηM^\(π^\)\\displaystyle\\eta\_\{\\hat\{M\}\}\(\\hat\{\\pi\}\)≥𝔼\(s,a\)∼ρT^π\[r^\(s,a\)\]−λ\(𝔼\(s,a\)∼ρT^π\[𝔼s′∼T^\(s,a\)\[u\(s′,a\)\]\]\+ϵdensity\)\\displaystyle\\geq\\mathbb\{E\}\_\{\(s,a\)\\sim\\rho^\{\\pi\}\_\{\\hat\{T\}\}\}\[\\hat\{r\}\(s,a\)\]\-\\lambda\\left\(\\mathbb\{E\}\_\{\(s,a\)\\sim\\rho^\{\\pi\}\_\{\\hat\{T\}\}\}\\big\[\\mathbb\{E\}\_\{s^\{\\prime\}\\sim\\hat\{T\}\(s,a\)\}\[u\(s^\{\\prime\},a\)\]\\big\]\+\\epsilon\_\{\\text\{density\}\}\\right\)\+λ\(𝔼\(s,a\)∼ρT^π^\[𝔼s′∼T^\(s,a\)\[u\(s′,a\)\]\]−ϵdensity\)\\displaystyle\\quad\+\\lambda\\left\(\\mathbb\{E\}\_\{\(s,a\)\\sim\\rho^\{\\hat\{\\pi\}\}\_\{\\hat\{T\}\}\}\\big\[\\mathbb\{E\}\_\{s^\{\\prime\}\\sim\\hat\{T\}\(s,a\)\}\[u\(s^\{\\prime\},a\)\]\\big\]\-\\epsilon\_\{\\text\{density\}\}\\right\)\(38\)=𝔼\(s,a\)∼ρT^π\[r^\(s,a\)\]−λ𝔼\(s,a\)∼ρT^π\[𝔼s′∼T^\(s,a\)\[u\(s′,a\)\]\]\\displaystyle=\\mathbb\{E\}\_\{\(s,a\)\\sim\\rho^\{\\pi\}\_\{\\hat\{T\}\}\}\[\\hat\{r\}\(s,a\)\]\-\\lambda\\mathbb\{E\}\_\{\(s,a\)\\sim\\rho^\{\\pi\}\_\{\\hat\{T\}\}\}\\big\[\\mathbb\{E\}\_\{s^\{\\prime\}\\sim\\hat\{T\}\(s,a\)\}\[u\(s^\{\\prime\},a\)\]\\big\]\+λ𝔼\(s,a\)∼ρT^π^\[𝔼s′∼T^\(s,a\)\[u\(s′,a\)\]\]−2λϵdensity\\displaystyle\\hskip 60\.00009pt\\quad\+\\lambda\\mathbb\{E\}\_\{\(s,a\)\\sim\\rho^\{\\hat\{\\pi\}\}\_\{\\hat\{T\}\}\}\\big\[\\mathbb\{E\}\_\{s^\{\\prime\}\\sim\\hat\{T\}\(s,a\)\}\[u\(s^\{\\prime\},a\)\]\\big\]\-2\\lambda\\epsilon\_\{\\text\{density\}\}\(39\)
Substituting \([39](https://arxiv.org/html/2605.24405#A2.E39)\) into \([30](https://arxiv.org/html/2605.24405#A2.E30)\):
ηM\(π^\)\\displaystyle\\eta\_\{M\}\(\\hat\{\\pi\}\)≥𝔼\(s,a\)∼ρT^π\[r^\(s,a\)\]−λ𝔼\(s,a\)∼ρT^π\[𝔼s′∼T^\(s,a\)\[u\(s′,a\)\]\]\+λ𝔼\(s,a\)∼ρT^π^\[𝔼s′∼T^\(s,a\)\[u\(s′,a\)\]\]\\displaystyle\\geq\\mathbb\{E\}\_\{\(s,a\)\\sim\\rho^\{\\pi\}\_\{\\hat\{T\}\}\}\[\\hat\{r\}\(s,a\)\]\-\\lambda\\mathbb\{E\}\_\{\(s,a\)\\sim\\rho^\{\\pi\}\_\{\\hat\{T\}\}\}\\big\[\\mathbb\{E\}\_\{s^\{\\prime\}\\sim\\hat\{T\}\(s,a\)\}\[u\(s^\{\\prime\},a\)\]\\big\]\+\\lambda\\mathbb\{E\}\_\{\(s,a\)\\sim\\rho^\{\\hat\{\\pi\}\}\_\{\\hat\{T\}\}\}\\big\[\\mathbb\{E\}\_\{s^\{\\prime\}\\sim\\hat\{T\}\(s,a\)\}\[u\(s^\{\\prime\},a\)\]\\big\]−2λϵdensity−γcCT^𝔼\(s,a\)∼ρT^π^\[𝔼s′∼T^\(s,a\)\[u\(s′,a\)\]\]−γcϵapprox\\displaystyle\\quad\-2\\lambda\\epsilon\_\{\\text\{density\}\}\-\\gamma cC\_\{\\hat\{T\}\}\\mathbb\{E\}\_\{\(s,a\)\\sim\\rho^\{\\hat\{\\pi\}\}\_\{\\hat\{T\}\}\}\\big\[\\mathbb\{E\}\_\{s^\{\\prime\}\\sim\\hat\{T\}\(s,a\)\}\[u\(s^\{\\prime\},a\)\]\\big\]\-\\gamma c\\epsilon\_\{\\text\{approx\}\}\(40\)
Settingλ=γcCT^\\lambda=\\gamma cC\_\{\\hat\{T\}\}, the terms involving𝔼\(s,a\)∼ρT^π^\[𝔼s′∼T^\(s,a\)\[u\(s′,a\)\]\]\\mathbb\{E\}\_\{\(s,a\)\\sim\\rho^\{\\hat\{\\pi\}\}\_\{\\hat\{T\}\}\}\\big\[\\mathbb\{E\}\_\{s^\{\\prime\}\\sim\\hat\{T\}\(s,a\)\}\[u\(s^\{\\prime\},a\)\]\\big\]cancel:
ηM\(π^\)\\displaystyle\\eta\_\{M\}\(\\hat\{\\pi\}\)≥𝔼\(s,a\)∼ρT^π\[r^\(s,a\)\]−λ𝔼\(s,a\)∼ρT^π\[𝔼s′∼T^\(s,a\)\[u\(s′,a\)\]\]−2λϵdensity−γcϵapprox\\displaystyle\\geq\\mathbb\{E\}\_\{\(s,a\)\\sim\\rho^\{\\pi\}\_\{\\hat\{T\}\}\}\[\\hat\{r\}\(s,a\)\]\-\\lambda\\mathbb\{E\}\_\{\(s,a\)\\sim\\rho^\{\\pi\}\_\{\\hat\{T\}\}\}\\big\[\\mathbb\{E\}\_\{s^\{\\prime\}\\sim\\hat\{T\}\(s,a\)\}\[u\(s^\{\\prime\},a\)\]\\big\]\-2\\lambda\\epsilon\_\{\\text\{density\}\}\-\\gamma c\\epsilon\_\{\\text\{approx\}\}\(41\)
Note that𝔼\(s,a\)∼ρT^π\[r\(s,a\)−λu\(s,a\)\]\\mathbb\{E\}\_\{\(s,a\)\\sim\\rho^\{\\pi\}\_\{\\hat\{T\}\}\}\[r\(s,a\)\-\\lambda u\(s,a\)\]represents the expected return in a hypothetical MDP with rewardr\(s,a\)−λu\(s,a\)r\(s,a\)\-\\lambda u\(s,a\)and dynamicsT^\\hat\{T\}\. While we actually optimizeM~\\tilde\{M\}with rewardr~\(s,a\)=r^\(s,a\)−λu^\(s,a\)\\tilde\{r\}\(s,a\)=\\hat\{r\}\(s,a\)\-\\lambda\\hat\{u\}\(s,a\)using the estimated regularizeru^\\hat\{u\}, the bound in \([41](https://arxiv.org/html/2605.24405#A2.E41)\) shows the relationship when evaluated with the true regularizeruu\.
We need to relate𝔼\(s,a\)∼ρT^π\[r^\(s,a\)\]\\mathbb\{E\}\_\{\(s,a\)\\sim\\rho^\{\\pi\}\_\{\\hat\{T\}\}\}\[\\hat\{r\}\(s,a\)\]toηM\(π\)\\eta\_\{M\}\(\\pi\)\. From equation \([29](https://arxiv.org/html/2605.24405#A2.E29)\):
𝔼\(s,a\)∼ρT^π\[r^\(s,a\)\]=ηM^\(π\)\\displaystyle\\mathbb\{E\}\_\{\(s,a\)\\sim\\rho^\{\\pi\}\_\{\\hat\{T\}\}\}\[\\hat\{r\}\(s,a\)\]=\\eta\_\{\\hat\{M\}\}\(\\pi\)≥ηM\(π\)−γcCT^𝔼\(s,a\)∼ρT^π\[𝔼s′∼T^\(s,a\)\[u\(s′,a\)\]\]−γcϵapprox\\displaystyle\\geq\\eta\_\{M\}\(\\pi\)\-\\gamma cC\_\{\\hat\{T\}\}\\mathbb\{E\}\_\{\(s,a\)\\sim\\rho^\{\\pi\}\_\{\\hat\{T\}\}\}\\big\[\\mathbb\{E\}\_\{s^\{\\prime\}\\sim\\hat\{T\}\(s,a\)\}\[u\(s^\{\\prime\},a\)\]\\big\]\-\\gamma c\\epsilon\_\{\\text\{approx\}\}\(42\)
For any policyπ\\pisatisfying𝔼\(s,a\)∼ρT^π\[𝔼s′∼T^\(s,a\)\[u\(s′,a\)\]\]≤δ\+ϵdensity\\mathbb\{E\}\_\{\(s,a\)\\sim\\rho^\{\\pi\}\_\{\\hat\{T\}\}\}\\big\[\\mathbb\{E\}\_\{s^\{\\prime\}\\sim\\hat\{T\}\(s,a\)\}\[u\(s^\{\\prime\},a\)\]\\big\]\\leq\\delta\+\\epsilon\_\{\\text\{density\}\}, substituting into \([41](https://arxiv.org/html/2605.24405#A2.E41)\):
ηM\(π^\)\\displaystyle\\eta\_\{M\}\(\\hat\{\\pi\}\)≥ηM\(π\)−γcCT^𝔼\(s,a\)∼ρT^π\[𝔼s′∼T^\(s,a\)\[u\(s′,a\)\]\]−γcϵapprox\\displaystyle\\geq\\eta\_\{M\}\(\\pi\)\-\\gamma cC\_\{\\hat\{T\}\}\\mathbb\{E\}\_\{\(s,a\)\\sim\\rho^\{\\pi\}\_\{\\hat\{T\}\}\}\\big\[\\mathbb\{E\}\_\{s^\{\\prime\}\\sim\\hat\{T\}\(s,a\)\}\[u\(s^\{\\prime\},a\)\]\\big\]\-\\gamma c\\epsilon\_\{\\text\{approx\}\}−λ𝔼\(s,a\)∼ρT^π\[𝔼s′∼T^\(s,a\)\[u\(s′,a\)\]\]−2λϵdensity−γcϵapprox\\displaystyle\\quad\-\\lambda\\mathbb\{E\}\_\{\(s,a\)\\sim\\rho^\{\\pi\}\_\{\\hat\{T\}\}\}\\big\[\\mathbb\{E\}\_\{s^\{\\prime\}\\sim\\hat\{T\}\(s,a\)\}\[u\(s^\{\\prime\},a\)\]\\big\]\-2\\lambda\\epsilon\_\{\\text\{density\}\}\-\\gamma c\\epsilon\_\{\\text\{approx\}\}\(43\)=ηM\(π\)−\(λ\+γcCT^\)𝔼\(s,a\)∼ρT^π\[𝔼s′∼T^\(s,a\)\[u\(s′,a\)\]\]−2λϵdensity−2γcϵapprox\\displaystyle=\\eta\_\{M\}\(\\pi\)\-\(\\lambda\+\\gamma cC\_\{\\hat\{T\}\}\)\\mathbb\{E\}\_\{\(s,a\)\\sim\\rho^\{\\pi\}\_\{\\hat\{T\}\}\}\\big\[\\mathbb\{E\}\_\{s^\{\\prime\}\\sim\\hat\{T\}\(s,a\)\}\[u\(s^\{\\prime\},a\)\]\\big\]\-2\\lambda\\epsilon\_\{\\text\{density\}\}\-2\\gamma c\\epsilon\_\{\\text\{approx\}\}\(44\)
Sinceλ=γcCT^\\lambda=\\gamma cC\_\{\\hat\{T\}\}:
ηM\(π^\)\\displaystyle\\eta\_\{M\}\(\\hat\{\\pi\}\)≥ηM\(π\)−2λ𝔼\(s,a\)∼ρT^π\[𝔼s′∼T^\(s,a\)\[u\(s′,a\)\]\]−2λϵdensity−2γcϵapprox\\displaystyle\\geq\\eta\_\{M\}\(\\pi\)\-2\\lambda\\mathbb\{E\}\_\{\(s,a\)\\sim\\rho^\{\\pi\}\_\{\\hat\{T\}\}\}\\big\[\\mathbb\{E\}\_\{s^\{\\prime\}\\sim\\hat\{T\}\(s,a\)\}\[u\(s^\{\\prime\},a\)\]\\big\]\-2\\lambda\\epsilon\_\{\\text\{density\}\}\-2\\gamma c\\epsilon\_\{\\text\{approx\}\}\(45\)≥ηM\(π\)−2λ\(δ\+ϵdensity\)−2λϵdensity−2γcϵapprox\\displaystyle\\geq\\eta\_\{M\}\(\\pi\)\-2\\lambda\(\\delta\+\\epsilon\_\{\\text\{density\}\}\)\-2\\lambda\\epsilon\_\{\\text\{density\}\}\-2\\gamma c\\epsilon\_\{\\text\{approx\}\}\(46\)=ηM\(π\)−2λδ−4λϵdensity−2γcϵapprox\\displaystyle=\\eta\_\{M\}\(\\pi\)\-2\\lambda\\delta\-4\\lambda\\epsilon\_\{\\text\{density\}\}\-2\\gamma c\\epsilon\_\{\\text\{approx\}\}\(47\)
Since this holds for all policiesπ\\pisatisfying𝔼\[u\]≤δ\+ϵdensity\\mathbb\{E\}\[u\]\\leq\\delta\+\\epsilon\_\{\\text\{density\}\}:
ηM\(π^\)\\displaystyle\\eta\_\{M\}\(\\hat\{\\pi\}\)≥maxπ:𝔼\[u\]≤δ\+ϵdensityηM\(π\)−2λδ−4λϵdensity−2γcϵapprox\\displaystyle\\geq\\max\_\{\\pi:\\mathbb\{E\}\[u\]\\leq\\delta\+\\epsilon\_\{\\text\{density\}\}\}\\eta\_\{M\}\(\\pi\)\-2\\lambda\\delta\-4\\lambda\\epsilon\_\{\\text\{density\}\}\-2\\gamma c\\epsilon\_\{\\text\{approx\}\}\(48\)
This can be rewritten as:
ηM\(π^\)\\displaystyle\\eta\_\{M\}\(\\hat\{\\pi\}\)≥maxπ:𝔼\[u\]≤δ\+ϵdensityηM\(π\)−2λ\(δ\+2ϵdensity\)−2γcϵapprox\\displaystyle\\geq\\max\_\{\\pi:\\mathbb\{E\}\[u\]\\leq\\delta\+\\epsilon\_\{\\text\{density\}\}\}\\eta\_\{M\}\(\\pi\)\-2\\lambda\(\\delta\+2\\epsilon\_\{\\text\{density\}\}\)\-2\\gamma c\\epsilon\_\{\\text\{approx\}\}\(49\)
∎
## Appendix CAblations
\(a\)Medical dataset\.All scores are stable except forλ=0\.1\\lambda=0\.1in ACP and WS\.
\(b\)Sparse D4RL datasets\.
Figure 9:λ\\lambdasensitivity plots averaged over 2 seeds\.Figure 10:Linear \(non\-saturating\) penalty ablation on the medical dataset\.GORMPO with tanh penalty has large gains on reward and WS over the linear penalty models\.
## Appendix DAdditional Results
Table 2:Test Negative Log\-Likelihood \(NLL\) scores \(lower the better\) for all density estimators on all datasets\.All models except for diffusion demonstrate reasonable scores, which verify their reliability as density\-based guardians\. The reason for diffusion having the highest NLL is that we use a Gaussian approximation in computing log\-likelihoods via Monte Carlo sampling\.DatasetKDEVAERealNVPDiffusionNeuralODEhalfcheetah\-m\-e\-sparse2\.9485\.199\-6\.55971\.49732\.89hopper\-m\-e\-sparse0\.4055\.823258\.420355134\.44\-7\.96walker2d\-m\-e\-sparse0\.9392\.863\-21\.292750839\.06\-17\.98medical dataset4\.29977\.89261\.94930220\.580\.398Table 3:GORMPO evaluation results on the medical dataset \(mean±\\pmstd\) on the same 5 seeds for 1000 episodes to ensure statistical significance\.MedicaldatasetSPOTMOBILEMBPOGORMPOKDEGORMPOVAEGORMPORealNVPGORMPODDPMGORMPONeuralODEReturn0\.419±0\.0910\.419\\pm 0\.0910\.637±0\.1250\.637\\pm 0\.1250\.644±0\.1310\.644\\pm 0\.1310\.586±0\.1250\.586\\pm 0\.1250\.377±0\.1300\.377\\pm 0\.1300\.739±0\.118\\mathbf\{0\.739\\pm 0\.118\}0\.561±0\.1420\.561\\pm 0\.1420\.757±0\.135\\mathbf\{0\.757\\pm 0\.135\}WS0\.196±0\.0090\.196\\pm 0\.009−0\.440±0\.006\-0\.440\\pm 0\.0060\.116±0\.0040\.116\\pm 0\.0040\.215±0\.0060\.215\\pm 0\.0060\.216±0\.0090\.216\\pm 0\.0090\.222±0\.009\\mathbf\{0\.222\\pm 0\.009\}0\.145±0\.0070\.145\\pm 0\.0070\.036±0\.0060\.036\\pm 0\.006ACP0\.036±0\.015\\mathbf\{0\.036\\pm 0\.015\}3\.479±0\.0843\.479\\pm 0\.0840\.616±0\.0640\.616\\pm 0\.0641\.194±0\.0501\.194\\pm 0\.0500\.334±0\.0360\.334\\pm 0\.0361\.861±0\.0741\.861\\pm 0\.0740\.434±0\.0390\.434\\pm 0\.0390\.322±0\.0430\.322\\pm 0\.043Table 4:Performance comparison of MBPO\-based GORMPO including NeuralODE as density estimator to offline RL baselines in average normalized reward±\\pmstandard deviation over 3 seeds on sparse D4RL medium\-expert datasets\. Best score is bolded; best GORMPO score is underlined\. m=medium, e=expert\.DatasetSPOTMOBILEMBPOGORMPO\-KDEGORMPO\-VAEGORMPO\-RealNVPGORMPO\-DDPMGORMPO\-NeuralODEha\-m\-e\-sparse76\.4±3\.2276\.4\\pm 3\.2299\.4±0\.48\\mathbf\{99\.4\\pm 0\.48\}63\.7±8\.2363\.7\\pm 8\.2378\.4±16\.978\.4\\pm 16\.989\.9±1\.50¯\\underline\{89\.9\\pm 1\.50\}86\.0±8\.3486\.0\\pm 8\.3485\.7±4\.1885\.7\\pm 4\.1880\.3±13\.780\.3\\pm 13\.7ho\-m\-e\-sparse83\.8±23\.383\.8\\pm 23\.395\.2±24\.5\\mathbf\{95\.2\\pm 24\.5\}4\.81±3\.184\.81\\pm 3\.188\.91±12\.898\.91\\pm 12\.892\.32±0\.682\.32\\pm 0\.6810\.0±9\.0210\.0\\pm 9\.0214\.1±8\.6614\.1\\pm 8\.665\.05±3\.855\.05\\pm 3\.85wa\-m\-e\-sparse113\.5±0\.54113\.5\\pm 0\.54𝟏𝟏𝟔±2\.44\\mathbf\{116\\pm 2\.44\}3\.19±1\.213\.19\\pm 1\.216\.31±1\.266\.31\\pm 1\.268\.20±6\.048\.20\\pm 6\.0410\.1±5\.34¯\\underline\{10\.1\\pm 5\.34\}5\.23±0\.535\.23\\pm 0\.533\.44±2\.193\.44\\pm 2\.19Table 5:Performance comparison of MOBILE and density\-regularized MOBILE variantswith VAE, RealNVP, and DDPM across medium datasets D4RL datasets averaged over 3 seeds\. Our density regularizers further increase MOBILE’s reward scores\. We measure MOBILE’s performance on sparse datasets, and medium dataset rewards are imported from MOBILE\[[37](https://arxiv.org/html/2605.24405#bib.bib4)\]\.DatasetsMOBILEMOBILE\-VAEMOBILE\-RealNVPMOBILE\-DDPMhalfcheetah\-m74\.674\.676\.5±0\.686\\mathbf\{76\.5\\pm 0\.686\}75\.1±0\.9975\.1\\pm 0\.9973\.36±1\.7973\.36\\pm 1\.79hopper\-m107\.0\\mathbf\{107\.0\}104\.0±0\.905104\.0\\pm 0\.905103\.0±0\.068103\.0\\pm 0\.06893\.6±16\.493\.6\\pm 16\.4walker2d\-m87\.787\.788\.2±2\.0888\.2\\pm 2\.0888\.0±1\.4588\.0\\pm 1\.4591\.1±0\.73\\mathbf\{91\.1\\pm 0\.73\}Figure 11:More visualizations of the OOD detection performance on the medical dataset\. We expect the policies to start weaning off the P\-level \(solid blue\) as we observe a stable \(above the clinical threshold and stationary\) mean arterial pressure \(MAP\) signal\. GORMPO\-VAE and GORMPO\-NeuralODE display the most successful weaning behavior with stable patient trajectory, as supported by their weaning scores\. GORMPO\-RealNVP shows low magnitude in action change, resulting in 0 WS in this sample\. Although MOBILE starts from an offset P\-level of 6, it fails at weaning later\.Table 6:Reward±\\pmone standard deviation of MBPO\-based GORMPO on D4RL medium, medium\-replay, and medium\-expert datasets averaged over 3 seeds\. We bold the best score and underline the second\-best score\. SPOT\[[42](https://arxiv.org/html/2605.24405#bib.bib5)\]and MOBILE\[[37](https://arxiv.org/html/2605.24405#bib.bib4)\]results are transferred from the reported values\. The divergence from the reported results in\[[46](https://arxiv.org/html/2605.24405#bib.bib15)\]for MBPO on halfcheetah\-m\-r, hopper\-m\-e, and walker2d\-m\-e datasets is caused by the limited number of seed averaging due to computational expense\.DatasetSPOTMOBILEMBPOGORMPO\-KDEGORMPO\-VAEGORMPO\-RealNVPGORMPO\-DDPMGORMPO\-NeuralODEhalfcheetah\-m\-v258\.258\.274\.6\\mathbf\{74\.6\}52\.85±11\.7252\.85\\pm 11\.7246\.33±2\.7646\.33\\pm 2\.7667\.34±0\.6567\.34\\pm 0\.6564\.08±5\.5864\.08\\pm 5\.5810\.65±8\.8110\.65\\pm 8\.8156\.01±8\.0056\.01\\pm 8\.00hopper\-m\-v283\.483\.4106\.6\\mathbf\{106\.6\}10\.29±3\.5710\.29\\pm 3\.573\.73±2\.423\.73\\pm 2\.4212\.90±9\.2412\.90\\pm 9\.2411\.38±7\.8111\.38\\pm 7\.8116\.47±9\.7916\.47\\pm 9\.7914\.14±13\.4114\.14\\pm 13\.41walker2d\-m\-v288\.1\\mathbf\{88\.1\}87\.787\.73\.43±2\.863\.43\\pm 2\.863\.06±1\.723\.06\\pm 1\.726\.36±1\.246\.36\\pm 1\.241\.06±2\.361\.06\\pm 2\.362\.14±2\.882\.14\\pm 2\.88−0\.11±0\.36\-0\.11\\pm 0\.36halfcheetah\-m\-r\-v252\.252\.271\.7\\mathbf\{71\.7\}0\.79±1\.070\.79\\pm 1\.0719\.89±25\.8119\.89\\pm 25\.8139\.66±23\.5839\.66\\pm 23\.589\.56±11\.419\.56\\pm 11\.4118\.11±24\.4118\.11\\pm 24\.410\.65±0\.310\.65\\pm 0\.31hopper\-m\-r\-v2100\.2100\.2103\.9\\mathbf\{103\.9\}37\.63±4\.7037\.63\\pm 4\.7048\.23±34\.2148\.23\\pm 34\.2122\.06±8\.5622\.06\\pm 8\.5667\.60±24\.6167\.60\\pm 24\.6125\.41±5\.4625\.41\\pm 5\.4612\.93±12\.9112\.93\\pm 12\.91walker2d\-m\-r\-v291\.6\\mathbf\{91\.6\}89\.989\.92\.83±2\.992\.83\\pm 2\.996\.28±4\.946\.28\\pm 4\.943\.01±4\.133\.01\\pm 4\.136\.67±5\.976\.67\\pm 5\.970\.46±0\.870\.46\\pm 0\.879\.91±0\.449\.91\\pm 0\.44halfcheetah\-m\-e\-v286\.286\.2108\.2\\mathbf\{108\.2\}79\.92±3\.6479\.92\\pm 3\.6483\.96±10\.6683\.96\\pm 10\.6681\.28±6\.3281\.28\\pm 6\.3275\.54±8\.4575\.54\\pm 8\.4572\.04±6\.6872\.04\\pm 6\.6861\.45±7\.0261\.45\\pm 7\.02hopper\-m\-e\-v272\.372\.3112\.6\\mathbf\{112\.6\}13\.11±8\.5413\.11\\pm 8\.541\.84±0\.011\.84\\pm 0\.012\.31±0\.332\.31\\pm 0\.332\.98±1\.612\.98\\pm 1\.611\.83±0\.961\.83\\pm 0\.965\.88±1\.285\.88\\pm 1\.28walker2d\-m\-e\-v2112\.0112\.0115\.2\\mathbf\{115\.2\}−0\.16±0\.02\-0\.16\\pm 0\.02−0\.16±0\.02\-0\.16\\pm 0\.02−0\.16±0\.14\-0\.16\\pm 0\.14−0\.29±0\.02\-0\.29\\pm 0\.02−0\.17±0\.04\-0\.17\\pm 0\.04−0\.06±0\.07\-0\.06\\pm 0\.07
## Appendix EGORMPO Hyperparameters and Model Architectures
#### Computing Resources\.
Models are trained on NVIDIA A100 80GB and NVIDIA GeForce RTX 4090 GPUs\.
Table 7:Elapsed time during GORMPO training on halfcheetah\-medium\-expert dataset\.Time Elapsed \(m\)GORMPO\-KDEGORMPO\-VAEGORMPO\-RealNVPGORMPO\-DDPMGORMPO\-NeuralODEhalfcheetah\-m\-e439±\\pm27\.3478\.0±\\pm29\.2614\.0±\\pm169\.9379\.6±\\pm16\.4573\.8±\\pm26\.6
### E\.1MBPO Parameters
Following prior work highlighting that reliable policy selection often requires*online*evaluation rather than offline policy evaluation\[[4](https://arxiv.org/html/2605.24405#bib.bib9),[25](https://arxiv.org/html/2605.24405#bib.bib10)\], we tune the penalty coefficientλ\\lambdausing a small online interaction budget of 6 hyperparameters\. Concretely, we sweepλ\\lambdawith online rollouts for a single seed, then fix the best\-performing value and train/evaluate across three additional seeds for 100 episodes each to limit interaction and reduce overfitting\. This protocol mirrors common practice in offline RL, where hyperparameters are chosen under an explicit online evaluation budget\[[25](https://arxiv.org/html/2605.24405#bib.bib10)\]while maintaining a conservative use of environment interaction\. See Table[8](https://arxiv.org/html/2605.24405#A5.T8)for the bestλ\\lambda\.
Table 8:Finetunedλ\\lambdavalues for each sparse dataset\.m=medium, r=replay, e=expert\.ModelsGORMPO\-KDEGORMPO\-VAEGORMPO\-RealNVPGORMPO\-DDPMGORMPO\-NeuralODEmedical dataset0\.20\.10\.20\.40\.2halfcheetah\-m\-e\-sparse0\.10\.10\.80\.30\.1hopper\-m\-e\-sparse0\.050\.30\.80\.050\.5walker2d\-m\-e\-sparse0\.50\.50\.80\.050\.05halfcheetah\-m0\.050\.10\.80\.30\.1hopper\-m0\.050\.30\.80\.050\.5walker2d\-m0\.80\.50\.050\.050\.05halfcheetah\-m\-r0\.050\.10\.80\.30\.1hopper\-m\-r0\.50\.50\.50\.50\.5walker2d\-m\-r0\.50\.50\.50\.50\.5halfcheetah\-m\-e0\.050\.10\.80\.30\.1hopper\-m\-e0\.50\.50\.50\.50\.5walker2d\-m\-e0\.50\.50\.80\.50\.5We directly use the same hyperparameters for MBPO training in GORMPO as reported in MOPO\[[46](https://arxiv.org/html/2605.24405#bib.bib15)\]except for the rollout length of the transition model\. We tuned this parameter to achieve stable Q\-values during training, which is essential for strong policy optimization\. We find 5, 5, 3, 5 for the medical dataset; halfcheetah, hopper, and walker2d medium\-expert\-sparse datasets, respectively, as providing the strongest training\. See Table[9](https://arxiv.org/html/2605.24405#A5.T9)for base parameters of MBPO for all datasets\.We also give the same tuning budget of 6 hyperparameters to SPOT\[[42](https://arxiv.org/html/2605.24405#bib.bib5)\]and MOBILE’s penalty parameters\[[37](https://arxiv.org/html/2605.24405#bib.bib4)\]\.For MOBILE\-based GORMPO experiments, we use the same hyperparameters found in Table[8](https://arxiv.org/html/2605.24405#A5.T8)for each sparse D4RL dataset\.
Table 9:Base hyperparameters of our GORMPO implementation\.ParametersValueActor learning rate3×10−43\\times 10^\{\-4\}Critic learning rate3×10−43\\times 10^\{\-4\}Discount factor \(γ\\gamma\)0\.990\.99Target network update coefficient \(τ\\tau\)0\.0050\.005Target entropy \(often−action dimension\-\\text\{action dimension\}\)−1\-1Temperature optimizer learning rate3×10−43\\times 10^\{\-4\}Dynamics model learning rate1×10−31\\times 10^\{\-3\}Dynamics ensemble size77Holdout ratio0\.20\.2Training epochs100100Steps per epoch10001000Evaluation episodes10001000Mini\-batch size256256Model rollout horizon55Rollout batch size1000010000Rollout frequency10001000Real\-to\-model data sampling ratio0\.050\.05
### E\.2Density Estimator Parameters\.
In this section, we include hyperparameters and training settings of different density estimator models\. All density estimators use the 1% percentile of the estimated validation distribution as the model threshold\. The input dimension of each model for the medical, halfcheetah, walker2d, and hopper tasks is 73, 23, 23, 14, respectively\.
#### Kernel Density Estimator\.
We employ the FAISS\[[19](https://arxiv.org/html/2605.24405#bib.bib25)\]k\-nearest neighbor model with GPU acceleration\. See Table[10](https://arxiv.org/html/2605.24405#A5.T10)for parameters\.
Table 10:KDE Model and Training ParametersParameterValueKernel FunctionGaussianBandwidth1\.0Number of Neighbors \(k\)100
#### Variational Autoencoder\.
See Table[11](https://arxiv.org/html/2605.24405#A5.T11)\.
Table 11:VAE Model and Training ParametersParameterValueLatent Dimension16Hidden Dimensions\[256, 256\]Activation FunctionReLUTraining Epochs100Batch Size256OptimizerAdamLearning Rate1×10−31\\times 10^\{\-3\}Weight Decay0Beta \(β\\beta\-VAE weight\)1\.0LR SchedulerReduceLROnPlateauScheduler Factor0\.5Scheduler Patiencepatience // 2Early Stopping Patience15
#### RealNVP\.
See Table[12](https://arxiv.org/html/2605.24405#A5.T12)\.
Table 12:RealNVP Training ParametersParameterValueNumber of Coupling Layers6Hidden Dimensions\[256, 256\]Activation FunctionReLUScale StabilizationtanhPrior DistributionStandard Normal \(μ=0\\mu=0,σ=1\\sigma=1\)Training Epochs100Batch Size256OptimizerAdamLearning Rate1×10−31\\times 10^\{\-3\}Weight Decay0LR SchedulerReduceLROnPlateauScheduler Factor0\.5Scheduler Patiencepatience // 2Early Stopping Patience15
#### Diffusion\.
See Table[13](https://arxiv.org/html/2605.24405#A5.T13)\.
Table 13:Diffusion Architecture and Training Hyperparameters\.ParameterValueParameters513d\+590,848\(610K\)513d\+590,848~\(610K\)Observation Dimension \(dd\)\{14,23\}\\\{14,23\\\}Training Epochs50Batch Size512Model TypeDDPM / DDIMNoise PredictorTime\-conditioned MLPTime EmbeddingSinusoidal \(128\-dim\)Hidden Layers3Hidden Units512ActivationSiLUInput Structure\[𝐱t,emb\(t\)\]∈ℝd\+128\[\\mathbf\{x\}\_\{t\},\\text\{emb\}\(t\)\]\\in\\mathbb\{R\}^\{d\+128\}Output Structureϵ∈ℝd\\boldsymbol\{\\epsilon\}\\in\\mathbb\{R\}^\{d\}OptimizerAdamWLearning Rate2×10−42\\times 10^\{\-4\}Beta ScheduleLinear
#### NeuralODE\.
See Table[14](https://arxiv.org/html/2605.24405#A5.T14)\.
Table 14:NeuralODE Architecture and Training Hyperparameters\.ParameterValueParameters1025d\+263,680\(280K\)1025d\+263,680~\(280K\)Observation Dimension \(dd\)\{14,23\}\\\{14,23\\\}Training Epochs20Batch Size512Network TypeTime\-conditioned MLPHidden Layers2Hidden Units512ActivationSiLUInput Structure\[𝐱,t\]∈ℝd\+1\[\\mathbf\{x\},t\]\\in\\mathbb\{R\}^\{d\+1\}Output Structure𝐯∈ℝd\\mathbf\{v\}\\in\\mathbb\{R\}^\{d\}OptimizerAdamWLearning Rate1×10−31\\times 10^\{\-3\}Weight Decay1×10−41\\times 10^\{\-4\}
## Appendix FOOD Dataset Generation Details
We create “OOD datasets” by randomly selecting 5 trajectories, adding Gaussian noise \(𝒩\(μ,0\.1\)\\mathcal\{N\}\(\\mu,0\.1\)\) to observations and actions, and concatenating the noiseless subset of the dataset with the noisy version\. Different levels of OOD datasets correspond to varying values ofμ\\muof the Gaussian noise\. We depict the state norm versus action norm plots in Figures[12](https://arxiv.org/html/2605.24405#A6.F12)and[13](https://arxiv.org/html/2605.24405#A6.F13)\. The number of samples in each dataset is 400, 7992, 10000, 9546, for the medical, hopper, halfcheetah, and walker2d datasets, proportional to the original dataset size\.
Figure 12:OOD dataset visualizations for the medical dataset\.\(a\)hopper\-medium\-expert OOD dataset visualization\.
\(b\)halfcheetah\-medium\-expert OOD dataset visualization\.
\(c\)walker2d\-medium\-expert OOD dataset visualization\.
Figure 13:OOD dataset visualizations for sparse D4RL medium\-expert datasets\.
## Appendix GMore Details on the Smart Weaning of Mechanical Circulatory Devices Task and Dataset
We use the same medical task and real\-world dataset in CORMPO\[[39](https://arxiv.org/html/2605.24405#bib.bib14)\]\.
### G\.1Dataset Details\.
Our real\-world medical dataset comprises of 379 patients corresponding to 17865 samples after downsampling and sliding window processes\. This dataset includes 12 features recorded directly or derived from signals of the MCS device, namely: Mean aortic pressure \(MAP\), mean pump speed, mean motor current, mean pump flow, left Ventricular Pressure \(LVP\), left ventricular end diastolic pressure \(LVEDP\), heart rate \(HR\), Systolic blood pressure \(SBP\), Diastolic blood pressure \(DBP\), Pulsatility, Relaxation Constant \(Tau\_LV\), and elastance estimation \(ESE\_LV\)\. We downsample the original signal of 25 Hz into 0\.00167 Hz \(1 sample per 10 minutes\) and extract samples with a sliding window of 1 hour\[[39](https://arxiv.org/html/2605.24405#bib.bib14)\]\.
### G\.2MDP Design for MCS\.
We first formulate the MCS weaning problem as an MDP\. The challenge in formulating the environment is balancing the rich information of medical time series with the learning challenges of a high\-dimensional Markov Decision Process \(MDP\)\. We define eachstatein the MDP to consist oftttime\-steps ofkkdifferent physiological features, i\.e\.,𝒮⊆ℝt×k\\mathcal\{S\}\\subseteq\\mathbb\{R\}^\{t\\times k\}\. Theactionspace is𝒜=\{2,3,⋯,9\}\\mathcal\{A\}=\\\{2,3,\\cdots,9\\\}, corresponding to pump level P2 to P9 on the MCS device\. The objective is to optimize patient outcome with a clinically appropriate weaning strategy\. For the offline RL problem, we organize the patient data into a replay buffer dataset of𝒟=\{\(si,ai,si′,ri\)\}i\\mathcal\{D\}=\\\{\(s\_\{i\},a\_\{i\},s^\{\\prime\}\_\{i\},r\_\{i\}\)\\\}\_\{i\}according to the formulation\. The state space, action space, reward, and MDP design is informed by expert recommendation\[[39](https://arxiv.org/html/2605.24405#bib.bib14)\]\.
#### Observations\.
The observation space includes 12 hemodynamic features of the patient\. Our inputs are the pump pressure, pump speed, and motor current 25 Hz signals recorded by the MCS device\. We down\-sample patient data from 25Hz to 0\.00167Hz \(1 sample per 10 minutes\) and process them into sliding windows of 1 hour \(6 time steps\) to be used as states for digital twin prediction and decision making based on expert suggestion\. Therefore, the observation space is𝒮=ℝ6×12\\mathcal\{S\}=\\mathbb\{R\}^\{6\\times 12\}, where eachsi=xt:t\+6s\_\{i\}=x\_\{t:t\+6\}at somettfor a patient\.
#### Action\.
The action for our MDP is the pump support level \(P\-level\) of the MCS device\. The device operates at 8 different speed levels, from P2\-P9, each with a constant motor speed \(rpm\)\. The P\-level proportionally determines the blood flow provided to the patient by the motor’s speed and current\. Clinicians can control the P\-level while the patient is on support\. The P\-level generally stays unchanged in 1\-hour intervals, unlike the state features, since it is manually controlled by the clinicians during the treatment\. In practice, we take the mean P\-level over the 1\-hour interval as expert action\. As a result, we define𝒜=\{2,…,9\}\\mathcal\{A\}=\\\{2,\\dots,9\\\}\.
#### Rewards\.
The design table for the reward function in Appendix[G\.3](https://arxiv.org/html/2605.24405#A7.SS3)is generated in line with medical consultancy\. It assigns a \(inverted\) risk score based on acceptable intervals for hemodynamic features\. The physiological reward is further normalized through Z\-score normalization and clipped between\[−2,2\]\[\-2,2\]to ensure training stability\.
### G\.3Medically\-informed Metrics
Action Change Penalty \(ACP\)\[[45](https://arxiv.org/html/2605.24405#bib.bib61)\]: Abrupt and extreme changes in P\-level may maximize rewards; however, they can induce physiological instability in a real\-world setting\. ACP gauges policy volatility and is given by:
ACP =∑i=1T‖ai−1−ai‖2,if‖ai−1−ai‖2\>2\.\\text\{ACP =\}\\sum^\{T\}\_\{i=1\}\|\|a\_\{i\-1\}\-a\_\{i\}\|\|\_\{2\},\\text\{ if \}\|\|a\_\{i\-1\}\-a\_\{i\}\|\|\_\{2\}\>2\.
whereai−1a\_\{i\-1\}is an action at statei−1i\-1,aia\_\{i\}is a subsequent action, andTTis the episode length\. Lower ACP values indicate stable physiology and safe weaning, but note that a value of 0 is undesirable as the P\-level must be lowered for weaning\.
Weaning Score \(WS\): To capture satisfactory weaning patterns, we support P\-level reductions at most every 1 hour when the patient is observed as hemodynamically stable for the past 1 hour as depicted in Eq\.[14](https://arxiv.org/html/2605.24405#S5.E14)\. Higher weaning scores signify an appropriate reduction in P\-level during relatively stable physiological states\. We employ stability based on the gradient of the past hemodynamic state as
Is\_Stable\(i\)\\displaystyle\\texttt\{Is\\\_Stable\}\(i\)=\|∂MAP\(i\)∂t\|<τ1∧\|∂HR\(i\)∂t\|<τ2∧\|∂Pulsat\(i\)∂t\|<τ3,\\displaystyle=\\left\|\\frac\{\\partial\\,\\mathrm\{MAP\}\(i\)\}\{\\partial t\}\\right\|<\\tau\_\{1\}\\land\\,\\left\|\\frac\{\\partial\\,\\mathrm\{HR\}\(i\)\}\{\\partial t\}\\right\|<\\tau\_\{2\}\\land\\,\\left\|\\frac\{\\partial\\,\\mathrm\{Pulsat\}\(i\)\}\{\\partial t\}\\right\|<\\tau\_\{3\},\(50\)whereτMAP=1\.36,τHR=2\.16\\tau\_\{\\text\{MAP\}\}=1\.36,\\tau\_\{\\text\{HR\}\}=2\.16andτPulsat=1\.95\\tau\_\{\\text\{Pulsat\}\}=1\.95, indicating a proxy for stability with a low gradient value of 3 hemodynamic indicators in the past state chosen with statistical significance tests\.
We evaluate the policies with the gradient\-based WS definition\. Second part of the WS metric is as follows\.
Weaned\(i\)=\{−1,ifai−ai\+1<0,ai−ai\+1,ifai−ai\+1∈\{1,2\},0,otherwise\.\\texttt\{Weaned\}\(i\)=\\begin\{cases\}\-1,&\\text\{if \}a\_\{i\}\-a\_\{i\+1\}<0,\\\\ a\_\{i\}\-a\_\{i\+1\},&\\text\{if \}a\_\{i\}\-a\_\{i\+1\}\\in\\\{1,2\\\},\\\\ 0,&\\text\{otherwise\}\.\\\\ \\end\{cases\}
Physiological Reward: The reward generally reflects the well\-being of the patient, according to the mean arterial pressure \(MAP\), heart rate \(HR\), and pulsatility of the past hour\. Our design follows the clinically defined ranges for hemodynamic stability while caring for the smoothness and differentiability of the function\.
Table 15:Hemodynamic instability score table from\[[6](https://arxiv.org/html/2605.24405#bib.bib51)\]\. We use a modified version of this table as our physiological reward\. When used for evaluating the learned policy as a reward function, we multiply the risk score by \-1\.Score ComponentValueScoreHemodynamicVariableMAP≥60\\geq 60050 to 59140 to 493<40<407Minimum MAP≥60\\geq 600in window50 to 59140 to 493<40<407Time Spent MAP00<60<60mmHg \(%\)2153\>5\>57Pulsatility\>20\>20010\-205<10<107HR\>100\>1003<50<503LVEDP\>20\>20715 to 204<15<153CPO0\.6 to 11<0\.6<0\.63<0\.5<0\.55The reward design in Table[15](https://arxiv.org/html/2605.24405#A7.T15)is staircase\-shaped, which has two drawbacks: non\-differentiability and a sparse signal\. We reformulate the hemodynamic instability score in the following way\.
- •Heart Rate Penalty FunctionThe heart rate penalty function penalizes deviations from an optimal heart rate of 75 bpm using a quadratic penalty: Phr\(hr\)=ReLU\(\(hr−75\)2250−1\)P\_\{\\text\{hr\}\}\(hr\)=\\text\{ReLU\}\\left\(\\frac\{\(hr\-75\)^\{2\}\}\{250\}\-1\\right\)\(51\)whereReLU\(x\)=max\(0,x\)\\text\{ReLU\}\(x\)=\\max\(0,x\)\. This function has zero penalty for heart rates in the range\[50,100\]\[50,100\]bpm and applies quadratic penalties for heart rates outside this range\.
- •Minimum MAP Penalty FunctionThe minimum Mean Arterial Pressure \(MAP\) penalty function ensures MAP values remain above 60 mmHg: PminMAP\(MAP\)=ReLU\(7\(60−MAP\)20\)P\_\{\\text\{minMAP\}\}\(MAP\)=\\text\{ReLU\}\\left\(\\frac\{7\(60\-MAP\)\}\{20\}\\right\)\(52\)This function applies a linear penalty when MAP falls below 60 mmHg, with the penalty increasing as MAP decreases further from this threshold\.
- •Pulsatility Penalty FunctionThe pulsatility penalty function maintains pulsatility within the range\[20,50\]\[20,50\]: Ppulsat\(p\)=ReLU\(7\(20−p\)20\)\+ReLU\(p−5020\)P\_\{\\text\{pulsat\}\}\(p\)=\\text\{ReLU\}\\left\(\\frac\{7\(20\-p\)\}\{20\}\\right\)\+\\text\{ReLU\}\\left\(\\frac\{p\-50\}\{20\}\\right\)\(53\) This bi\-directional penalty function penalizes pulsatility values below 20 and above 50, with zero penalty for pulsatility in the range\[20,50\]\[20,50\]\.
- •Hypertension Penalty FunctionThe hypertension penalty function penalizes elevated mean MAP values above 115 mmHg: Phyp\(MAP\)=ReLU\(MAP−10618\)P\_\{\\text\{hyp\}\}\(MAP\)=\\text\{ReLU\}\\left\(\\frac\{MAP\-106\}\{18\}\\right\)\(54\) This function applies a linear penalty for mean MAP values exceeding the hypertension threshold of 106 mmHg\.
The overall reward function combines all penalty components and negates the sum to create a reward signal:
R\(s\)=−\[PminMAP\(min\(MAP\)\)\+Phyp\(MAP¯\)\+Phr\(min\(HR\)\)\+Ppulsat\(min\(Pulsat\)\)\]R\(s\)=\-\\left\[P\_\{\\text\{minMAP\}\}\(\\min\(\\text\{MAP\}\)\)\+P\_\{\\text\{hyp\}\}\(\\overline\{\\text\{MAP\}\}\)\+P\_\{\\text\{hr\}\}\(\\min\(HR\)\)\+P\_\{\\text\{pulsat\}\}\(\\min\(\\text\{Pulsat\}\)\)\\right\]\(55\)
where:
- •min\(MAP\)\\min\(\\text\{MAP\}\),min\(HR\)\\min\(HR\),min\(Pulsat\)\\min\(\\text\{Pulsat\}\)are the minimum values over the time horizon
- •MAP¯\\overline\{\\text\{MAP\}\}is the mean MAP over the time horizon
- •The negative sign converts penalties into rewards \(higher rewards for lower penalties\)
## Appendix HDetails and Visualization of Sparse D4RL Datasets
We show the details of sparse dataset generation from the Gym\-MuJoCo D4RL medium\-expert datasets\. The unsafe region limits are decided based on the visualizations in Figure[14](https://arxiv.org/html/2605.24405#A8.F14)on Action Norm vs Reward contour plots\. We display the selected limits in Table[16](https://arxiv.org/html/2605.24405#A8.T16)\. In general, we enclose one mode of the action norm vs reward space\.
Table 16:Sparse D4RL action norm and reward unsafe region limits\.ha=halfcheetah, ho=hopper, wa=walker2d\.MinimumrewardMaximumRewardMinimumAction NormMaximumAction NormDiscarded Percentagefrom the unsafe regionDiscarded Percentagefrom the full datasetha\-m\-e\-sparse−3\.014\-3\.0146\.5696\.5690\.3630\.3631\.9361\.93641%41\\%27\.5%27\.5\\%ho\-m\-e\-sparse0\.5490\.5493\.8923\.8920\.0120\.0121\.0581\.05840%40\\%22%22\\%wa\-m\-e\-sparse−2\.557\-2\.5574\.5794\.5790\.2550\.2551\.7621\.76248%48\\%27%27\\%\(a\)halfcheetah \(Full\)
\(b\)hopper \(Full\)
\(c\)walker2d \(Full\)
\(d\)halfcheetah \(Sparse\)
\(e\)hopper \(Sparse\)
\(f\)walker2d \(Sparse\)
Figure 14:State\-action space visualizations for D4RL datasets\. Top row: full datasets\. Bottom row: sparse datasets\. Overall, we discard 27\.5%, 22%, and 27% of the samples from halfcheetah, hopper, and walker2d, respectively\.\(a\)Action\-reward space of the sparse halfcheetah dataset\.
\(b\)Action\-reward space of sparse hopper dataset\.
\(c\)Action\-reward space of sparse walker2d dataset\.
Figure 15:Action\-reward space of sparse walker2d and hopper datasets\. We compute the L2 norm of the actions to project it to 1D space for visualization purposes\.Figure 16:The distribution of episode lengths of the sparse D4RL datasets\. halfcheetah’s terminal signal is always 1000, while others vary a lot\. This shows a deterministic termination signal in halfcheetah, unlike hopper and walker2d datasets\.
## Appendix It\-SNE Plots for OOD Avoidance Behavior
Figure 17:t\-SNE projections of equal\-sized offline dataset and policy rollout\(s′,a\)\(s^\{\\prime\},a\)pairs\.MBPO\-based GORMPO rollouts expand beyond MOBILE rollouts in hopper and walker2d, while the MOBILE has more overlap and coverage\. The best GORMPO models either symmetrically surround data support \(GORMPO\-DDPM in hopper\) or stay near support without overlapping \(GORMPO\-RealNVP in walker2d\)\. In halfcheetah, GORMPO\-RealNVP depicts minimal deviation from data support, while MOBILE complements the data support\.
## Appendix JEmpirical Analysis of Diffusion Noise Predictions
To rationalize the low OOD detection accuracy observed in our experiments, we analyze the statistical properties of the noise vectors predicted by the diffusion model\. Our hypothesis is that the model’s strong inductive bias towards the standard normal distribution masks the presence of Gaussian\-perturbed anomalies\.
As illustrated in Figure[18](https://arxiv.org/html/2605.24405#A10.F18), the diffusion model consistently predicts noiseϵ\\epsilonthat conforms tightly to the standard normal distribution𝒩\(0,1\)\\mathcal\{N\}\(0,1\), regardless of whether the input is ID or OOD\.
- •Gaussian Adherence \(Panel A\): The predicted noise exhibits negligible deviation from normality, with near\-zero skewness \(0\.0000\.000\) and excess kurtosis \(−0\.009\-0\.009\)\.
- •Distributional Collapse \(Panel B\): Consequently, the noise distributions for ID and Gaussian\-perturbed OOD inputs become statistically indistinguishable, yielding a Kullback\-Leibler \(KL\) divergence of only0\.02460\.0246\.
This analysis suggests that the diffusion model projects both ID and Gaussian\-perturbed OOD samples toward a similar Gaussian latent manifold, reducing separability at the aggregate distribution level\. However, diffusion\-based OOD detection often relies on sample\-wise denoising or reconstruction discrepancies rather than solely on marginal statistics of the predicted noise\. Therefore, overlapping aggregate noise distributions do not necessarily imply the complete absence of OOD signal\. In our setting, the weak empirical separation may additionally stem from the Gaussian approximation strategy, or saturation effects in the density penalty, which can diminish sample\-level discriminative information during policy optimization\.
Figure 18:Distributional overlap of predicted noise\.\(A\)The predicted noiseϵ\\epsilontightly conforms to a standard normal distribution𝒩\(0,1\)\\mathcal\{N\}\(0,1\), evidenced by negligible skewness \(0\.0000\.000\) and excess kurtosis \(−0\.009\-0\.009\)\.\(B\)This Gaussian constraint causes substantial overlap between ID and OOD predictions\. The low KL divergence \(0\.02460\.0246\) confirms that the diffusion model projects distinct input distributions onto an indistinguishable Gaussian manifold\.
## Appendix KPenalty Progression Plots for Sparse D4RL Datasets
Figure 19:Mean±\\pmstandard deviation of the penalty during training MBPO\-based GORMPO for 3000000 steps in the sparse D4RL datasets\.
## NeurIPS Paper Checklist
1. 1\.Claims
2. Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?
3. Answer:\[Yes\]
4. Justification: Our theoretical and empirical results are summarized in the abstract and in the introduction with bullet points\.
5. Guidelines: - •The answer\[N/A\]means that the abstract and introduction do not include the claims made in the paper\. - •The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations\. A\[No\]or\[N/A\]answer to this question will not be perceived well by the reviewers\. - •The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings\. - •It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper\.
6. 2\.Limitations
7. Question: Does the paper discuss the limitations of the work performed by the authors?
8. Answer:\[Yes\]
9. Justification: Please see Conclusion last few sentences in Section[6](https://arxiv.org/html/2605.24405#S6)\.
10. Guidelines: - •The answer\[N/A\]means that the paper has no limitation while the answer\[No\]means that the paper has limitations, but those are not discussed in the paper\. - •The authors are encouraged to create a separate “Limitations” section in their paper\. - •The paper should point out any strong assumptions and how robust the results are to violations of these assumptions \(e\.g\., independence assumptions, noiseless settings, model well\-specification, asymptotic approximations only holding locally\)\. The authors should reflect on how these assumptions might be violated in practice and what the implications would be\. - •The authors should reflect on the scope of the claims made, e\.g\., if the approach was only tested on a few datasets or with a few runs\. In general, empirical results often depend on implicit assumptions, which should be articulated\. - •The authors should reflect on the factors that influence the performance of the approach\. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting\. Or a speech\-to\-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon\. - •The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size\. - •If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness\. - •While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren’t acknowledged in the paper\. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community\. Reviewers will be specifically instructed to not penalize honesty concerning limitations\.
11. 3\.Theory assumptions and proofs
12. Question: For each theoretical result, does the paper provide the full set of assumptions and a complete \(and correct\) proof?
13. Answer:\[Yes\]
14. Justification: We provide proofs of our theorems with supporting lemmas in Appendix[B](https://arxiv.org/html/2605.24405#A2)\.
15. Guidelines: - •The answer\[N/A\]means that the paper does not include theoretical results\. - •All the theorems, formulas, and proofs in the paper should be numbered and cross\-referenced\. - •All assumptions should be clearly stated or referenced in the statement of any theorems\. - •The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition\. - •Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material\. - •Theorems and Lemmas that the proof relies upon should be properly referenced\.
16. 4\.Experimental result reproducibility
17. Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper \(regardless of whether the code and data are provided or not\)?
18. Answer:\[Yes\]
19. Justification: We provide all information about model hyperparameters in Appendix[E](https://arxiv.org/html/2605.24405#A5), details about proprietary medical dataset in Appendix[G](https://arxiv.org/html/2605.24405#A7), OOD test dataset generation in Appendix[F](https://arxiv.org/html/2605.24405#A6), and sparse dataset creation details in Appendix[H](https://arxiv.org/html/2605.24405#A8)\. We give all methodological details in Section[3](https://arxiv.org/html/2605.24405#S3)in the main paper\.
20. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •If the paper includes experiments, a\[No\]answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not\. - •If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable\. - •Depending on the contribution, reproducibility can be accomplished in various ways\. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model\. In general\. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model \(e\.g\., in the case of a large language model\), releasing of a model checkpoint, or other means that are appropriate to the research performed\. - •While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution\. For example 1. \(a\)If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm\. 2. \(b\)If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully\. 3. \(c\)If the contribution is a new model \(e\.g\., a large language model\), then there should either be a way to access this model for reproducing the results or a way to reproduce the model \(e\.g\., with an open\-source dataset or instructions for how to construct the dataset\)\. 4. \(d\)We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility\. In the case of closed\-source models, it may be that access to the model is limited in some way \(e\.g\., to registered users\), but it should be possible for other researchers to have some path to reproducing or verifying the results\.
21. 5\.Open access to data and code
22. Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?
23. Answer:\[No\]
24. Justification: We will release code, sparse D4RL datasets and OOD test data upon an acceptance decision\. The real\-world medical dataset contains human\-subject information and cannot be publicly released\.
25. Guidelines: - •The answer\[N/A\]means that paper does not include experiments requiring code\. - • - •While we encourage the release of code and data, we understand that this might not be possible, so\[No\]is an acceptable answer\. Papers cannot be rejected simply for not including code, unless this is central to the contribution \(e\.g\., for a new open\-source benchmark\)\. - •The instructions should contain the exact command and environment needed to run to reproduce the results\. See the NeurIPS code and data submission guidelines \([https://neurips\.cc/public/guides/CodeSubmissionPolicy](https://neurips.cc/public/guides/CodeSubmissionPolicy)\) for more details\. - •The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc\. - •The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines\. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why\. - •At submission time, to preserve anonymity, the authors should release anonymized versions \(if applicable\)\. - •Providing as much information as possible in supplemental material \(appended to the paper\) is recommended, but including URLs to data and code is permitted\.
26. 6\.Experimental setting/details
27. Question: Does the paper specify all the training and test details \(e\.g\., data splits, hyperparameters, how they were chosen, type of optimizer\) necessary to understand the results?
28. Answer:\[Yes\]
29. Justification: Please refer to Appendix[E](https://arxiv.org/html/2605.24405#A5)\.
30. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them\. - •The full details can be provided either with the code, in appendix, or as supplemental material\.
31. 7\.Experiment statistical significance
32. Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?
33. Answer:\[Yes\]
34. Justification: We provide the RL results with 1\-sigma error bars over 3 random seeds\.
35. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •The authors should answer\[Yes\]if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper\. - •The factors of variability that the error bars are capturing should be clearly stated \(for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions\)\. - •The method for calculating the error bars should be explained \(closed form formula, call to a library function, bootstrap, etc\.\) - •The assumptions made should be given \(e\.g\., Normally distributed errors\)\. - •It should be clear whether the error bar is the standard deviation or the standard error of the mean\. - •It is OK to report 1\-sigma error bars, but one should state it\. The authors should preferably report a 2\-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified\. - •For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range \(e\.g\., negative error rates\)\. - •If error bars are reported in tables or plots, the authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text\.
36. 8\.Experiments compute resources
37. Question: For each experiment, does the paper provide sufficient information on the computer resources \(type of compute workers, memory, time of execution\) needed to reproduce the experiments?
38. Answer:\[Yes\]
39. Justification: We report compute resources along with training duration in Appendix[E](https://arxiv.org/html/2605.24405#A5)\.
40. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage\. - •The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute\. - •The paper should disclose whether the full research project required more compute than the experiments reported in the paper \(e\.g\., preliminary or failed experiments that didn’t make it into the paper\)\.
41. 9\.Code of ethics
43. Answer:\[Yes\]
44. Justification: This study used real patient data under IRB approval with no access to personally identifiable information\. The proposed methods arenotdeployed in real\-world clinical decision making\. We do not introduce or amplify harmful biases to the best of our knowledge, and we report limitations transparently\.
45. Guidelines: - •The answer\[N/A\]means that the authors have not reviewed the NeurIPS Code of Ethics\. - •If the authors answer\[No\], they should explain the special circumstances that require a deviation from the Code of Ethics\. - •The authors should make sure to preserve anonymity \(e\.g\., if there is a special consideration due to laws or regulations in their jurisdiction\)\.
46. 10\.Broader impacts
47. Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?
48. Answer:\[Yes\]
49. Justification: This work contributes to safer offline reinforcement learning by reducing out\-of\-distribution behavior in sparse and safety\-critical settings such as healthcare\. More reliable policy optimization under limited data may improve the robustness of clinical decision\-support systems and other real\-world sequential decision\-making applications\. However, because learned policies remain dependent on historical data quality and coverage, careful validation and human oversight are necessary before deployment\.
50. Guidelines: - •The answer\[N/A\]means that there is no societal impact of the work performed\. - •If the authors answer\[N/A\]or\[No\], they should explain why their work has no societal impact or why the paper does not address societal impact\. - •Examples of negative societal impacts include potential malicious or unintended uses \(e\.g\., disinformation, generating fake profiles, surveillance\), fairness considerations \(e\.g\., deployment of technologies that could make decisions that unfairly impact specific groups\), privacy considerations, and security considerations\. - •The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments\. However, if there is a direct path to any negative applications, the authors should point it out\. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate Deepfakes for disinformation\. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster\. - •The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from \(intentional or unintentional\) misuse of the technology\. - •If there are negative societal impacts, the authors could also discuss possible mitigation strategies \(e\.g\., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML\)\.
51. 11\.Safeguards
52. Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse \(e\.g\., pre\-trained language models, image generators, or scraped datasets\)?
53. Answer:\[N/A\]
54. Justification: There is no high risk of misuse of our models as they are not deployed on real subjects\. No sensitive data are made publicly available\.
55. Guidelines: - •The answer\[N/A\]means that the paper poses no such risks\. - •Released models that have a high risk for misuse or dual\-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters\. - •Datasets that have been scraped from the Internet could pose safety risks\. The authors should describe how they avoided releasing unsafe images\. - •We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort\.
56. 12\.Licenses for existing assets
57. Question: Are the creators or original owners of assets \(e\.g\., code, data, models\), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?
58. Answer:\[Yes\]
59. Justification: We properly cite all existing works used and mentioned in our paper\.
60. Guidelines: - •The answer\[N/A\]means that the paper does not use existing assets\. - •The authors should cite the original paper that produced the code package or dataset\. - •The authors should state which version of the asset is used and, if possible, include a URL\. - •The name of the license \(e\.g\., CC\-BY 4\.0\) should be included for each asset\. - •For scraped data from a particular source \(e\.g\., website\), the copyright and terms of service of that source should be provided\. - •If assets are released, the license, copyright information, and terms of use in the package should be provided\. For popular datasets,[paperswithcode\.com/datasets](https://arxiv.org/html/2605.24405v1/paperswithcode.com/datasets)has curated licenses for some datasets\. Their licensing guide can help determine the license of a dataset\. - •For existing datasets that are re\-packaged, both the original license and the license of the derived asset \(if it has changed\) should be provided\. - •If this information is not available online, the authors are encouraged to reach out to the asset’s creators\.
61. 13\.New assets
62. Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?
63. Answer:\[N/A\]
64. Justification: Code will be made public in the final version of the paper\.
65. Guidelines: - •The answer\[N/A\]means that the paper does not release new assets\. - •Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates\. This includes details about training, license, limitations, etc\. - •The paper should discuss whether and how consent was obtained from people whose asset is used\. - •At submission time, remember to anonymize your assets \(if applicable\)\. You can either create an anonymized URL or include an anonymized zip file\.
66. 14\.Crowdsourcing and research with human subjects
67. Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation \(if any\)?
68. Answer:\[N/A\]
69. Justification: This work does not involve crowdsourcing experiments or human subject studies\. Our real\-world medical dataset is is anonymized and not personally identifiable\.
70. Guidelines: - •The answer\[N/A\]means that the paper does not involve crowdsourcing nor research with human subjects\. - •Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper\. - •According to the NeurIPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector\.
71. 15\.Institutional review board \(IRB\) approvals or equivalent for research with human subjects
72. Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board \(IRB\) approvals \(or an equivalent approval/review based on the requirements of your country or institution\) were obtained?
73. Answer:\[Yes\]
74. Justification: This study used real patient data under IRB approval\.
75. Guidelines: - •The answer\[N/A\]means that the paper does not involve crowdsourcing nor research with human subjects\. - •Depending on the country in which research is conducted, IRB approval \(or equivalent\) may be required for any human subjects research\. If you obtained IRB approval, you should clearly state this in the paper\. - •We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution\. - •For initial submissions, do not include any information that would break anonymity \(if applicable\), such as the institution conducting the review\.
76. 16\.Declaration of LLM usage
77. Question: Does the paper describe the usage of LLMs if it is an important, original, or non\-standard component of the core methods in this research? Note that if the LLM is used only for writing, editing, or formatting purposes and does*not*impact the core methodology, scientific rigor, or originality of the research, declaration is not required\.
78. Answer:\[N/A\]
79. Justification: This paper used LLM tools only for writing, editing, or formatting purposes\.
80. Guidelines: - •The answer\[N/A\]means that the core method development in this research does not involve LLMs as any important, original, or non\-standard components\. - •Please refer to our LLM policy in the NeurIPS handbook for what should or should not be described\.Similar Articles
ODRPO: Ordinal Decompositions of Discrete Rewards for Robust Policy Optimization
Introduces ODRPO, a framework that decomposes discrete rewards into ordinal binary indicators to improve robustness of policy optimization in RLAIF for LLMs, achieving up to 14.8% relative improvement with minimal overhead.
Hölder Policy Optimisation
HölderPO introduces a generalized policy optimization framework that uses the Hölder mean for token-level probability aggregation in GRPO, with a dynamic annealing schedule to balance gradient concentration and variance. The method achieves state-of-the-art results on mathematical benchmarks (54.9% average, 7.2% relative gain over GRPO) and a 93.8% success rate on ALFWorld.
Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs
The paper introduces mmGRPO, a multi-module extension of Group Relative Policy Optimization (GRPO) that improves accuracy in modular AI systems by optimizing language model calls and prompts. It reports an average 11% accuracy improvement across various tasks and provides an open-source implementation in DSPy.
Gradient Extrapolation-Based Policy Optimization
The article introduces Gradient Extrapolation-Based Policy Optimization (GXPO), a method that approximates multi-step lookahead in RL training for LLMs using only three backward passes. It demonstrates improved reasoning performance on math benchmarks over standard GRPO while maintaining fixed active-phase costs.
UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models
UDM-GRPO introduces a stable RL training framework for uniform discrete diffusion models, boosting GenEval accuracy from 69% to 96% and OCR benchmark accuracy from 8% to 57%.