Beyond Penalization: Diffusion-based Out-of-Distribution Detection and Selective Regularization in Offline Reinforcement Learning

arXiv cs.LG Papers

Summary

This paper introduces DOSER, a framework using diffusion models for out-of-distribution detection and selective regularization in offline reinforcement learning. It aims to improve performance on static datasets by distinguishing between beneficial and detrimental OOD actions.

arXiv:2605.08202v1 Announce Type: new Abstract: Offline reinforcement learning (RL) faces a critical challenge of overestimating the value of out-of-distribution (OOD) actions. Existing methods mitigate this issue by penalizing unseen samples, yet they fail to accurately identify OOD actions and may suppress beneficial exploration beyond the behavioral support. Although several methods have been proposed to differentiate OOD samples with distinct properties, they typically rely on restrictive assumptions about the data distribution and remain limited in discrimination ability. To address this problem, we propose DOSER (Diffusion-based OOD Detection and Selective Regularization), a novel framework that goes beyond uniform penalization. DOSER trains two diffusion models to capture the behavior policy and state distribution, using single-step denoising reconstruction error as a reliable OOD indicator. During policy optimization, it further distinguishes between beneficial and detrimental OOD actions by evaluating predicted transitions, selectively suppressing risky actions while encouraging exploration of high-potential ones. Theoretically, we prove that DOSER is a $\gamma$-contraction and therefore admits a unique fixed point with bounded value estimates. We further provide an asymptotic performance guarantee relative to the optimal policy under model approximation and OOD detection errors. Across extensive offline RL benchmarks, DOSER consistently attains superior performance to prior methods, especially on suboptimal datasets.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/12/26, 07:04 AM

# Beyond Penalization: Diffusion-based Out-of-Distribution Detection and Selective Regularization in Offline Reinforcement Learning
Source: [https://arxiv.org/html/2605.08202](https://arxiv.org/html/2605.08202)
Qingjun Wang1, Hongtu Zhou1, Hang Yu1, Junqiao Zhao1,2 Yanping Zhao1, Chen Ye1, Ziqiao Wang1, Guang Chen1,3 1School of Computer Science and Technology, Tongji University, Shanghai, China 2MOE Key Lab of Embedded System and Service Computing, Tongji University, Shanghai, China 3Shanghai Innovation Institute \{2432069, zhouhongtu, 2432034, zhaojunqiao\}@tongji\.edu\.cn \{2534018, yechen, ziqiaowang, guangchen\}@tongji\.edu\.cn

###### Abstract

Offline reinforcement learning \(RL\) faces a critical challenge of overestimating the value of out\-of\-distribution \(OOD\) actions\. Existing methods mitigate this issue by penalizing unseen samples, yet they fail to accurately identify OOD actions and may suppress beneficial exploration beyond the behavioral support\. Although several methods have been proposed to differentiate OOD samples with distinct properties, they typically rely on restrictive assumptions about the data distribution and remain limited in discrimination ability\. To address this problem, we proposeDOSER\(Diffusion\-basedOOD Detection andSElectiveRegularization\), a novel framework that goes beyond uniform penalization\. DOSER trains two diffusion models to capture the behavior policy and state distribution, using single\-step denoising reconstruction error as a reliable OOD indicator\. During policy optimization, it further distinguishes between beneficial and detrimental OOD actions by evaluating predicted transitions, selectively suppressing risky actions while encouraging exploration of high\-potential ones\. Theoretically, we prove that DOSER is aγ\\gamma\-contraction and therefore admits a unique fixed point with bounded value estimates\. We further provide an asymptotic performance guarantee relative to the optimal policy under model approximation and OOD detection errors\. Across extensive offline RL benchmarks, DOSER consistently attains superior performance to prior methods, especially on suboptimal datasets\.

## 1Introduction

Offline reinforcement learning \(RL\) has emerged as a powerful paradigm for learning policies exclusively from static datasets, eliminating the need for potentially costly or risky online interactions\(Levineet al\.,[2020](https://arxiv.org/html/2605.08202#bib.bib1)\)\. This capability renders it particularly appealing for real\-world domains where exploration is constrained, such as robotics, healthcare and autonomous systems\. However, directly applying standard off\-policy RL algorithms to offline dataset pose a fundamental challenge of*distribution shift*\. When the learned policy generates actions that deviate substantially from the training data distribution, value functions tend to extrapolate erroneously, leading to severe value overestimation and ultimately catastrophic performance degradation\(Fujimotoet al\.,[2019](https://arxiv.org/html/2605.08202#bib.bib2)\)\.

Existing approaches fall into two categories: 1\) Policy constraint methods enforce the learned policy remain close to the behavior policy to avoid OOD actions\(Kumaret al\.,[2019](https://arxiv.org/html/2605.08202#bib.bib3); Wuet al\.,[2019](https://arxiv.org/html/2605.08202#bib.bib4); Fujimoto and Gu,[2021](https://arxiv.org/html/2605.08202#bib.bib5); Kostrikovet al\.,[2021](https://arxiv.org/html/2605.08202#bib.bib6)\), typically relying on variational auto\-encoders \(VAEs\)\(Kingma and Welling,[2013](https://arxiv.org/html/2605.08202#bib.bib7)\)for behavior modeling\. While effective in principle, these methods struggle to capture the multi\-modal nature of real\-world behaviors, often collapsing diverse action distributions into suboptimal averaged outputs within low\-density regions\(Wanget al\.,[2022](https://arxiv.org/html/2605.08202#bib.bib8)\)\. 2\) Value regularization methods offer an alternative by learning conservative Q\-functions that penalize OOD actions\(Kumaret al\.,[2020](https://arxiv.org/html/2605.08202#bib.bib9); Wuet al\.,[2021](https://arxiv.org/html/2605.08202#bib.bib10); Baiet al\.,[2022](https://arxiv.org/html/2605.08202#bib.bib11); Maoet al\.,[2023](https://arxiv.org/html/2605.08202#bib.bib12)\)\. Their effectiveness depends on the underlying OOD identification mechanism, which is a challenging task due to the limited representation capacity of the models used to characterize data distribution\. Furthermore, they usually apply uniform penalties across the entire out\-of\-support region, without considering valuable explorations that could enhance policy performance \(Figure[1](https://arxiv.org/html/2605.08202#S1.F1), left\)\.

Recent efforts have sought to mitigate excessive pessimism by controlling the level of conservatism in a fine\-grained manner\. CCVL\(Honget al\.,[2022](https://arxiv.org/html/2605.08202#bib.bib13)\)conditions the Q\-function on a confidence level to learn a spectrum of conservative value estimates, enabling adaptive policies that dynamically adjust pessimism during online evaluation\. ACL\-QL\(Wuet al\.,[2024](https://arxiv.org/html/2605.08202#bib.bib15)\)models the behavior policy as a Gaussian distribution and introduces learnable weighting functions to adaptively modulate conservatism at the state\-action level\. DoRL\-VC\(Huanget al\.,[2024](https://arxiv.org/html/2605.08202#bib.bib14)\)employs a VAE\-based detector to separate OOD from ID actions, and further distinguish OOD actions with different properties\. Nevertheless, such approaches either rely on Q\-ensemble learning to achieve varying degrees of conservatism, incurring additional training overhead, or inherits strong Gaussian assumptions regarding the behavior policy, which fundamentally limit their ability to reliably identify OOD samples\.

![Refer to caption](https://arxiv.org/html/2605.08202v1/x1.png)Figure 1:VAE\-based behavior modeling methods \(left\) misidentify OOD actions, whereas uniform penalties suppress high\-potential OOD actions\. DOSER \(right\) models multi\-modal behavior policy via diffusion model and uses reconstruction error as an OOD indicator, further distinguishing detrimental from beneficial actions for selective regularization\.To address these challenges, we presentDOSER\(Diffusion\-basedOOD Detection andSElectiveRegularization\), advancing OOD handling through two key innovations \(Figure[1](https://arxiv.org/html/2605.08202#S1.F1), right\)\. First, we utilize diffusion models to achieve OOD detection\. By deploying two diffusion models for behavior policy approximation and state distribution modeling, we establish reconstruction errors as theoretically rigorous metrics, avoiding strong Gaussian assumptions while maintaining well\-calibrated detection performance\. Second, we introduce an adaptive discrimination mechanism that goes beyond binary classification of in\-distribution \(ID\) and OOD\. By integrating a learned dynamics model, we distinguish between beneficial OOD actions \(those with potential to improve performance while staying within state distribution\) and detrimental OOD actions \(those likely to induce state distribution shift or value degradation\)\. This fine\-grained discrimination enables selective regularization, discouraging hazardous actions while encouraging promising explorations, which yields a robust framework that maintains necessary conservatism while facilitating policy improvement\.

The key contributions of this paper are as follows: 1\) We propose a diffusion\-based approach for OOD detection in offline RL, using reconstruction error as a theoretically grounded metric\. 2\) We introduce a dual regularization strategy that adaptively adjusts its treatment of OOD actions based on predicted outcomes, suppressing detrimental actions while encouraging beneficial ones\. 3\) Extensive experiments on D4RL benchmarks demonstrate superior or competitive performance compared to prior methods, with detailed ablations verifying the effectiveness of each component\.

## 2Preliminary

Offline RL\. We consider the RL problem formulated by the Markov Decision Process \(MDP\), which is defined as a tuple\(𝒮,𝒜,𝒫,R,γ,d0\)\(\\mathcal\{S\},\\mathcal\{A\},\\mathcal\{P\},R,\\gamma,d\_\{0\}\), with state space𝒮\\mathcal\{S\}, action space𝒜\\mathcal\{A\}, transition dynamicsP:𝒮×𝒜×𝒮→\[0,1\]P:\\mathcal\{S\}\\times\\mathcal\{A\}\\times\\mathcal\{S\}\\xrightarrow\{\}\[0,1\], reward functionR:𝒮×𝒜→\[Rmin,Rmax\]R:\\mathcal\{S\}\\times\\mathcal\{A\}\\xrightarrow\{\}\[R\_\{\\mathrm\{min\}\},R\_\{\\mathrm\{max\}\}\], discount factorγ∈\[0,1\)\\gamma\\in\[0,1\), and initial state distributiond0:𝒮→\[0,1\]d\_\{0\}:\\mathcal\{S\}\\xrightarrow\{\}\[0,1\]\(Suttonet al\.,[1998](https://arxiv.org/html/2605.08202#bib.bib16)\)\. The goal of RL is to learn a policyπ:𝒮→Δ​\(𝒜\)\\pi:\\mathcal\{S\}\\xrightarrow\{\}\\Delta\(\\mathcal\{A\}\)that maximizes the expected discounted returnJ​\(π\)=𝔼​\[∑t=0∞γt​R​\(𝒔t,𝒂t\)\]J\(\\pi\)=\\mathbb\{E\}\[\\sum\_\{t=0\}^\{\\infty\}\\gamma^\{t\}R\(\{\\bm\{s\}\}\_\{t\},\{\\bm\{a\}\}\_\{t\}\)\]\. For any policyπ\\pi, we define the value function asVπ​\(𝒔\)=𝔼​\[∑t=0∞γt​R​\(𝒔t,𝒂t\)\|𝒔0=𝒔\]V^\{\\pi\}\(\{\\bm\{s\}\}\)=\\mathbb\{E\}\[\\sum\_\{t=0\}^\{\\infty\}\\gamma^\{t\}R\(\{\\bm\{s\}\}\_\{t\},\{\\bm\{a\}\}\_\{t\}\)\|\{\\bm\{s\}\}\_\{0\}=\{\\bm\{s\}\}\], and the Q\-function asQπ​\(𝒔\)=𝔼​\[∑t=0∞γt​R​\(𝒔t,𝒂t\)\|𝒔0=𝒔,𝒂0=𝒂\]Q^\{\\pi\}\(\{\\bm\{s\}\}\)=\\mathbb\{E\}\[\\sum\_\{t=0\}^\{\\infty\}\\gamma^\{t\}R\(\{\\bm\{s\}\}\_\{t\},\{\\bm\{a\}\}\_\{t\}\)\|\{\\bm\{s\}\}\_\{0\}=\{\\bm\{s\}\},\{\\bm\{a\}\}\_\{0\}=\{\\bm\{a\}\}\]\. Given that rewards are bounded, the Q\-function must lie betweenQmin=Rmin/\(1−γ\)Q\_\{\\mathrm\{min\}\}=R\_\{\\mathrm\{min\}\}/\(1\-\\gamma\)andQmax=Rmax/\(1−γ\)Q\_\{\\mathrm\{max\}\}=R\_\{\\mathrm\{max\}\}/\(1\-\\gamma\)\. In offline RL, the agent is limited to learn from a static dataset𝒟=\{\(𝒔,𝒂,r,𝒔′\)\}\\mathcal\{D\}=\\\{\(\{\\bm\{s\}\},\{\\bm\{a\}\},r,\{\\bm\{s\}\}^\{\\prime\}\)\\\}collected by a behavior policyπβ\\pi\_\{\\beta\}, without any interaction with the environment\(Langeet al\.,[2012](https://arxiv.org/html/2605.08202#bib.bib17)\)\. We denote the empirical behavior policy asπ^β\\hat\{\\pi\}\_\{\\beta\}, which depicts the conditional action distribution observed in𝒟\\mathcal\{D\}\.

Diffusion Models\. Diffusion models\(Sohl\-Dicksteinet al\.,[2015](https://arxiv.org/html/2605.08202#bib.bib18); Hoet al\.,[2020](https://arxiv.org/html/2605.08202#bib.bib19); Songet al\.,[2020](https://arxiv.org/html/2605.08202#bib.bib20)\)have emerged as a powerful class of generative models that excel in capturing complex data distributions\. The core idea revolves around a forward diffusion process that gradually perturbs data into noise and a reverse process that learns to reconstruct the original data\. Given a clean sample𝒙0∼pdata​\(𝒙0\)\{\\bm\{x\}\}\_\{0\}\\sim p\_\{\\mathrm\{data\}\}\(\{\\bm\{x\}\}\_\{0\}\)with standard deviationσdata\\sigma\_\{\\mathrm\{data\}\}, the forward process constructs a sequence of increasingly noisy samples𝒙t∼p​\(𝒙t;σt\)\{\\bm\{x\}\}\_\{t\}\\sim p\(\{\\bm\{x\}\}\_\{t\};\\sigma\_\{t\}\)by adding i\.i\.d\. Gaussian noise with standard deviationσt\\sigma\_\{t\}that increases along the scheduleσmin=σ0<σ1<⋯<σN=σmax\\sigma\_\{\\mathrm\{min\}\}=\\sigma\_\{0\}<\\sigma\_\{1\}<\\cdots<\\sigma\_\{\\mathrm\{N\}\}=\\sigma\_\{\\mathrm\{max\}\}\. Commonly,σmin\\sigma\_\{\\mathrm\{min\}\}is chosen sufficiently small thatpmin​\(𝒙\)≈pdata​\(𝒙\)p\_\{\\mathrm\{min\}\}\(\{\\bm\{x\}\}\)\\approx p\_\{\\mathrm\{data\}\}\(\{\\bm\{x\}\}\), whileσmax\\sigma\_\{\\mathrm\{max\}\}is large enough that the final distribution approximates isotropic Gaussian noise, i\.e\.,pmax​\(𝒙\)≈𝒩​\(𝒙;0,σmax2​𝑰\)p\_\{\\mathrm\{max\}\}\(\{\\bm\{x\}\}\)\\approx\\mathcal\{N\}\(\{\\bm\{x\}\};0,\\sigma\_\{\\mathrm\{max\}\}^\{2\}\\bm\{I\}\)\.

In the original DDPM\(Hoet al\.,[2020](https://arxiv.org/html/2605.08202#bib.bib19)\)formulation, this process is modeled as a discrete Markov chain\. Subsequent works reinterpret it through the lens of stochastic differential equations \(SDEs\)\(Songet al\.,[2020](https://arxiv.org/html/2605.08202#bib.bib20)\), describing the evolution of𝒙t\{\\bm\{x\}\}\_\{t\}over continuous timet∈\[0,T\]t\\in\[0,T\]as:

d​𝒙t=f​\(𝒙t,t\)​d​t\+g​\(t\)​d​𝒘td\{\\bm\{x\}\}\_\{t\}=f\(\{\\bm\{x\}\}\_\{t\},t\)\\,dt\+g\(t\)\\,d\{\\bm\{w\}\}\_\{t\}\(1\)wheref​\(⋅,t\)f\(\\cdot,t\)andg​\(t\)g\(t\)are the drift and diffusion coefficients, and𝒘t\{\\bm\{w\}\}\_\{t\}is a standard Wiener process\.

The EDM framework\(Karraset al\.,[2022](https://arxiv.org/html/2605.08202#bib.bib21)\)refines this paradigm by reparameterizing the diffusion path with differentiable noise schedulesσ​\(t\)\\sigma\(t\)\. The reverse process is governed by a corresponding probability\-flow ODE derived from the forward SDE, which is formulated as:

d​𝒙t=−σ˙​\(t\)​σ​\(t\)​∇𝒙tlog⁡pt​\(𝒙t\)​d​td\{\\bm\{x\}\}\_\{t\}=\-\\dot\{\\sigma\}\(t\)\\sigma\(t\)\\nabla\_\{\{\\bm\{x\}\}\_\{t\}\}\\log p\_\{t\}\(\{\\bm\{x\}\}\_\{t\}\)dt\(2\)whereσ˙​\(t\)=d​σd​t\\dot\{\\sigma\}\(t\)=\\frac\{d\\sigma\}\{dt\}is the time derivative of noise schedule controlling the noise change rate,∇𝒙tlog⁡pt​\(𝒙t\)\\nabla\_\{\{\\bm\{x\}\}\_\{t\}\}\\log p\_\{t\}\(\{\\bm\{x\}\}\_\{t\}\)is the score function of the marginal distributionpt​\(𝒙t\)p\_\{t\}\(\{\\bm\{x\}\}\_\{t\}\)\. The score is approximated by a neural networkϵθ​\(𝒙t;σt\)\\bm\{\\epsilon\}\_\{\\theta\}\(\{\\bm\{x\}\}\_\{t\};\\sigma\_\{t\}\)trained via denoising score matching\(Vincent,[2011](https://arxiv.org/html/2605.08202#bib.bib22)\)\. The denoising modelϵθ\\bm\{\\epsilon\}\_\{\\theta\}is trained to predict the true clean sample𝒙0\{\\bm\{x\}\}\_\{0\}from its noisy version𝒙t=𝒙0\+σt​ϵ\{\\bm\{x\}\}\_\{t\}=\{\\bm\{x\}\}\_\{0\}\+\\sigma\_\{t\}\\bm\{\\epsilon\}by minimizing the reweightedL2L\_\{2\}loss:

ℒ​\(θ\)=𝔼σt,𝒙0∼p​\(𝒙0\),ϵ∼𝒩​\(0,𝑰\)​\[λ​\(σt\)​‖𝒙0−ϵθ​\(𝒙t,σt\)‖22\],\\mathcal\{L\}\(\\theta\)=\\mathbb\{E\}\_\{\\sigma\_\{t\},\{\\bm\{x\}\}\_\{0\}\\sim p\(\{\\bm\{x\}\}\_\{0\}\),\\bm\{\\epsilon\}\\sim\\mathcal\{N\}\(0,\\bm\{I\}\)\}\\left\[\\lambda\(\\sigma\_\{t\}\)\|\|\{\\bm\{x\}\}\_\{0\}\-\\bm\{\\epsilon\}\_\{\\theta\}\(\{\\bm\{x\}\}\_\{t\},\\sigma\_\{t\}\)\|\|\_\{2\}^\{2\}\\right\],\(3\)whereλ​\(σt\)\\lambda\(\\sigma\_\{t\}\)is the loss weight\. Compared to the original DDPM that requires thousands of denoising steps, EDM accelerates sampling by introducing optimized noise schedules and higher\-order ODE solvers, achieving high\-quality generation within only a few dozen steps\.

## 3Diffusion\-based OOD Detection and Selective Regularization

In this section, we present the technical framework of DOSER\. We begin by introducing three main components that enable precise detection and classification of OOD actions, then demonstrate the complete integration of these components into a unified algorithmic framework, detailing the practical implementation\. Figure[2](https://arxiv.org/html/2605.08202#S3.F2)provides an overview of the proposed method\. For comprehensive theoretical analysis, please refer to Appendix[A](https://arxiv.org/html/2605.08202#A1)\.

![Refer to caption](https://arxiv.org/html/2605.08202v1/x2.png)Figure 2:Overview of the proposed method: \(a\) Diffusion\-based OOD action detection, \(b\) Integrating the detector to achieve OOD action classification\.### 3\.1Diffusion\-based Behavior And State Modeling

The foundation of our approach is to establish two diffusion models that jointly capture the underlying distributions of the offline dataset\. We first construct a conditional diffusion model that learns the empirical behavior policy distributionπ^β​\(𝒂\|𝒔\)\\hat\{\\pi\}\_\{\\beta\}\(\{\\bm\{a\}\}\|\{\\bm\{s\}\}\)by training a denoising networkϵθa​\(𝒂t,σt,𝒔\)\\bm\{\\epsilon\}\_\{\\theta\_\{a\}\}\(\{\\bm\{a\}\}\_\{t\},\\sigma\_\{t\},\{\\bm\{s\}\}\)to reconstruct the original action𝒂0\{\\bm\{a\}\}\_\{0\}through the following optimization objective:

ℒ​\(θa\)=𝔼σt,\(𝒔,𝒂0\)∼𝒟,ϵ∼𝒩​\(0,𝑰\)​\[λ​\(σt\)​‖𝒂0−ϵθa​\(𝒂t,σt,𝒔\)‖22\]\.\\mathcal\{L\}\(\{\\theta\_\{a\}\}\)=\\mathbb\{E\}\_\{\\sigma\_\{t\},\(\{\\bm\{s\}\},\{\\bm\{a\}\}\_\{0\}\)\\sim\\mathcal\{D\},\\bm\{\\epsilon\}\\sim\\mathcal\{N\}\(0,\\bm\{I\}\)\}\\left\[\\lambda\(\\sigma\_\{t\}\)\|\|\{\\bm\{a\}\}\_\{0\}\-\\bm\{\\epsilon\}\_\{\\theta\_\{a\}\}\(\{\\bm\{a\}\}\_\{t\},\\sigma\_\{t\},\{\\bm\{s\}\}\)\|\|\_\{2\}^\{2\}\\right\]\.\(4\)where𝒂t=𝒂0\+σt​ϵ\{\\bm\{a\}\}\_\{t\}=\{\\bm\{a\}\}\_\{0\}\+\\sigma\_\{t\}\\bm\{\\epsilon\}is the noisy action with noise scaleσt\\sigma\_\{t\},λ​\(σt\)\\lambda\(\\sigma\_\{t\}\)balances loss scales across noise levels andϵ∼𝒩​\(0,𝑰\)\\bm\{\\epsilon\}\\sim\\mathcal\{N\}\(0,\\bm\{I\}\)\.

In parallel, we develop a diffusion model to capture the state distributiond0​\(𝒔\)d\_\{0\}\(\{\\bm\{s\}\}\)of the dataset\. The corresponding denoising networkϵθs​\(𝒔t,σt\)\\bm\{\\epsilon\}\_\{\\theta\_\{s\}\}\(\{\\bm\{s\}\}\_\{t\},\\sigma\_\{t\}\)is trained to recover the original states𝒔0\{\\bm\{s\}\}\_\{0\}from its noisy version𝒔t\{\\bm\{s\}\}\_\{t\}, using the following reconstruction objective:

ℒ​\(θs\)=𝔼σt,𝒔∼𝒟,ϵ∼𝒩​\(0,𝑰\)​\[λ​\(σt\)​‖𝒔0−ϵθs​\(𝒔t,σt\)‖22\]\.\\mathcal\{L\}\(\{\\theta\_\{s\}\}\)=\\mathbb\{E\}\_\{\\sigma\_\{t\},\{\\bm\{s\}\}\\sim\\mathcal\{D\},\\bm\{\\epsilon\}\\sim\\mathcal\{N\}\(0,\\bm\{I\}\)\}\\left\[\\lambda\(\\sigma\_\{t\}\)\|\|\{\\bm\{s\}\}\_\{0\}\-\\bm\{\\epsilon\}\_\{\\theta\_\{s\}\}\(\{\\bm\{s\}\}\_\{t\},\\sigma\_\{t\}\)\|\|\_\{2\}^\{2\}\\right\]\.\(5\)

### 3\.2OOD Detection via Reconstruction Error

Our detection mechanism leverages the denoising capabilities of pretrained diffusion models to identify OOD samples based on reconstruction errors\. Given a state\-action pair\(𝒔,𝒂0\)\(\{\\bm\{s\}\},\{\\bm\{a\}\}\_\{0\}\)encountered during policy optimization, we compute its OOD score through a two\-step procedure\.

First, we sample a noise scaleσt\\sigma\_\{t\}from the training noise schedule and perturb the action as𝒂t=𝒂0\+σt​ϵ\{\\bm\{a\}\}\_\{t\}=\{\\bm\{a\}\}\_\{0\}\+\\sigma\_\{t\}\\bm\{\\epsilon\}, whereϵ∼𝒩​\(0,𝑰\)\\bm\{\\epsilon\}\\sim\\mathcal\{N\}\(0,\\bm\{I\}\)\. The OOD score is then defined as theL2L\_\{2\}reconstruction error between the original action and its denoised counterpart:

ℰa​\(𝒔,𝒂0\)=‖𝒂0−ϵθa​\(𝒂t,σt,𝒔\)‖2\.\\mathcal\{E\}\_\{a\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}\_\{0\}\)=\\\|\{\\bm\{a\}\}\_\{0\}\-\\bm\{\\epsilon\}\_\{\\theta\_\{a\}\}\(\{\\bm\{a\}\}\_\{t\},\\sigma\_\{t\},\{\\bm\{s\}\}\)\\\|\_\{2\}\.\(6\)Analogously, for state inputs, we measure the reconstruction error between the original state𝒔0\{\\bm\{s\}\}\_\{0\}and its denoised version:

ℰs​\(𝒔0\)=‖𝒔0−ϵθs​\(𝒔t,σt\)‖2,\\mathcal\{E\}\_\{s\}\(\{\\bm\{s\}\}\_\{0\}\)=\\\|\{\\bm\{s\}\}\_\{0\}\-\\bm\{\\epsilon\}\_\{\\theta\_\{s\}\}\(\{\\bm\{s\}\}\_\{t\},\\sigma\_\{t\}\)\\\|\_\{2\},\(7\)where𝒔t\{\\bm\{s\}\}\_\{t\}denotes the noise\-corrupted state\.

Formally, the OOD indicator functions are given by:

𝕀ood​\(𝒂0\)\\displaystyle\\mathbb\{I\}\_\{\\text\{ood\}\}\(\{\\bm\{a\}\}\_\{0\}\)=\{ℰa​\(𝒔,𝒂0\)\>τa\},𝕀ood​\(𝒔0\)\\displaystyle=\\\{\\mathcal\{E\}\_\{a\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}\_\{0\}\)\>\\tau\_\{a\}\\\},\\quad\\mathbb\{I\}\_\{\\text\{ood\}\}\(\{\\bm\{s\}\}\_\{0\}\)=\{ℰs​\(𝒔0\)\>τs\},\\displaystyle=\\\{\\mathcal\{E\}\_\{s\}\(\{\\bm\{s\}\}\_\{0\}\)\>\\tau\_\{s\}\\\},\(8\)where the thresholdsτa\\tau\_\{a\}andτs\\tau\_\{s\}are set as thepp\-th percentiles of the reconstruction errors on the training dataset𝒟\\mathcal\{D\}, withppcontrolling the level of conservatism\.

This reconstruction\-based method offers three key advantages: 1\) Reconstruction error provides a likelihood\-free surrogate for distributional alignment, directly measuring conformity to the data manifold without explicit density estimation\. 2\) Diffusion models naturally capture multi\-modal distributions, avoiding the restrictive unimodal Gaussian assumptions of conventional approaches\. 3\) Detection is efficient, requiring only a single forward pass per sample\. Moreover, evaluating errors across multiple randomly sampled diffusion timesteps rather than a fixed noise level improves robustness, since different noise scales correspond to varying levels of information bottleneck in the data distribution\.

### 3\.3Adaptive OOD Action Classification

Building on the detection framework, we introduce an adaptive classification mechanism to handle OOD actions during policy optimization\. Unlike conventional methods that indiscriminately penalize all deviations, our approach distinguishes between*beneficial*and*detrimental*OOD actions through a two\-stage assessment process\.

For each policy\-generated OOD action𝒂ood\{\\bm\{a\}\}\_\{\\text\{ood\}\}in state𝒔\{\\bm\{s\}\}, we first predict the subsequent state𝒔π′\{\\bm\{s\}\}^\{\\prime\}\_\{\\pi\}using the learned dynamics modelpψ​\(𝒔′\|𝒔,𝒂\)p\_\{\\psi\}\(\{\\bm\{s\}\}^\{\\prime\}\|\{\\bm\{s\}\},\{\\bm\{a\}\}\), pretrained via supervised learning on the offline dataset𝒟\\mathcal\{D\}\. Since value estimation for OOD states is inherently unreliable, we then evaluate the outcome of𝒂ood\{\\bm\{a\}\}\_\{\\text\{ood\}\}along two dimensions: 1\) Whether𝒔π′\{\\bm\{s\}\}^\{\\prime\}\_\{\\pi\}lies outside the training distribution, determined by the proposed OOD detection mechanism; 2\) If𝒔π′\{\\bm\{s\}\}^\{\\prime\}\_\{\\pi\}is in\-distribution, whetherV​\(𝒔π′\)V\(\{\\bm\{s\}\}^\{\\prime\}\_\{\\pi\}\)exceedsV​\(𝒔id′\)V\(\{\\bm\{s\}\}^\{\\prime\}\_\{\\text\{id\}\}\), where𝒔id′\{\\bm\{s\}\}^\{\\prime\}\_\{\\text\{id\}\}denotes the predicted next state after executing the optimal in\-distribution action\.

Algorithm 1Diffusion\-Based OOD Detection with selective regularization \(DOSER\)Initialize Q\-network

QθQ\_\{\\theta\}, V\-network

VθV\_\{\\theta\}, diffusion behavior model

ϵθa\\bm\{\\epsilon\}\_\{\\theta\_\{a\}\}, diffusion state model

ϵθs\\bm\{\\epsilon\}\_\{\\theta\_\{s\}\}, policy network

πϕ\\pi\_\{\\phi\}, dynamics model

pψp\_\{\\psi\}, and target networks

Qθ′Q\_\{\\theta^\{\\prime\}\},

Vθ′V\_\{\\theta^\{\\prime\}\},

πϕ′\\pi\_\{\\phi^\{\\prime\}\}
\# Model Pretraining

Pretraining dynamics model

pψp\_\{\\psi\}by minimizing \([13](https://arxiv.org/html/2605.08202#S3.E13)\)

Pretraining diffusion models

ϵθa\\bm\{\\epsilon\}\_\{\\theta\_\{a\}\}and

ϵθs\\bm\{\\epsilon\}\_\{\\theta\_\{s\}\}by minimizing \([4](https://arxiv.org/html/2605.08202#S3.E4)\) and \([5](https://arxiv.org/html/2605.08202#S3.E5)\)

Calculate OOD detection thresholds

τa\\tau\_\{a\}and

τs\\tau\_\{s\}based on in\-sample reconstruction error

foreach iterationdo

Sample transition minibatch

\{\(𝒔,𝒂,r,𝒔′\)\}\\\{\(\{\\bm\{s\}\},\{\\bm\{a\}\},r,\{\\bm\{s\}\}^\{\\prime\}\)\\\}from

𝒟\\mathcal\{D\}
\# Critic Learning

Generate action

𝒂π∼πϕ​\(𝒔\)\{\\bm\{a\}\}\_\{\\pi\}\\sim\\pi\_\{\\phi\}\(\{\\bm\{s\}\}\)and predict the next state

𝒔π′=pψ​\(𝒔,𝒂π\)\{\\bm\{s\}\}^\{\\prime\}\_\{\\pi\}=p\_\{\\psi\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}\_\{\\pi\}\)
Select the best ID action

𝒂id∗\{\\bm\{a\}\}\_\{\\mathrm\{id\}\}^\{\*\}and predict the next state

𝒔id′=pψ​\(𝒔,𝒂id∗\)\{\\bm\{s\}\}^\{\\prime\}\_\{\\mathrm\{id\}\}=p\_\{\\psi\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}\_\{\\mathrm\{id\}\}^\{\*\}\)
Calculate the reconstruction errors of policy action and next state by\([6](https://arxiv.org/html/2605.08202#S3.E6)\) and\([7](https://arxiv.org/html/2605.08202#S3.E7)\)

Calculate the adaptive bonus

δV=Vθ​\(𝒔π′\)−Vθ​\(𝒔id′\)\\delta\_\{V\}=V\_\{\\theta\}\(\{\\bm\{s\}\}^\{\\prime\}\_\{\\pi\}\)\-V\_\{\\theta\}\(\{\\bm\{s\}\}^\{\\prime\}\_\{\\mathrm\{id\}\}\)
Update

QθQ\_\{\\theta\}and

VθV\_\{\\theta\}by minimizing \([10](https://arxiv.org/html/2605.08202#S3.E10)\) and \([12](https://arxiv.org/html/2605.08202#S3.E12)\)

\# Actor Learning

Update

πϕ\\pi\_\{\\phi\}by minimizing \([14](https://arxiv.org/html/2605.08202#S3.E14)\)

\# Target Network Update

θ′←ρ​θ\+\(1−ρ\)​θ′\\theta^\{\\prime\}\\leftarrow\\rho\\theta\+\(1\-\\rho\)\\theta^\{\\prime\},

ϕ′←ρ​ϕ\+\(1−ρ\)​ϕ′\\phi^\{\\prime\}\\leftarrow\\rho\\phi\+\(1\-\\rho\)\\phi^\{\\prime\}
endfor

Formally, the classification rule for OOD actions is given in Definition[1](https://arxiv.org/html/2605.08202#Thmdefinition1)\.

###### Definition 1\(Beneficial and detrimental OOD action sets\)\.

Let the beneficial OOD action set𝒜ood\+\\mathcal\{A\}\_\{\\mathrm\{ood\}\}^\{\+\}and the detrimental OOD action set𝒜ood−\\mathcal\{A\}\_\{\\mathrm\{ood\}\}^\{\-\}be subsets of the action space𝒜\\mathcal\{A\}\. Then:

𝒜ood\+\\displaystyle\\mathcal\{A\}\_\{\\mathrm\{ood\}\}^\{\+\}:=\{𝒂∈𝒜\|ℰs​\(𝒔π′\)≤τs∧V​\(𝒔π′\)≥V​\(𝒔id′\)\},\\displaystyle=\\left\\\{\{\\bm\{a\}\}\\in\\mathcal\{A\}\\;\\middle\|\\;\\mathcal\{E\}\_\{s\}\(\{\\bm\{s\}\}^\{\\prime\}\_\{\\pi\}\)\\leq\\tau\_\{s\}\\;\\wedge\\;V\(\{\\bm\{s\}\}^\{\\prime\}\_\{\\pi\}\)\\geq V\(\{\\bm\{s\}\}^\{\\prime\}\_\{\\mathrm\{id\}\}\)\\right\\\},\(9\)𝒜ood−\\displaystyle\\mathcal\{A\}\_\{\\mathrm\{ood\}\}^\{\-\}:=\{𝒂∈𝒜\|ℰs​\(𝒔π′\)\>τs∨V​\(𝒔π′\)<V​\(𝒔id′\)\},\\displaystyle=\\left\\\{\{\\bm\{a\}\}\\in\\mathcal\{A\}\\;\\middle\|\\;\\mathcal\{E\}\_\{s\}\(\{\\bm\{s\}\}^\{\\prime\}\_\{\\pi\}\)\>\\tau\_\{s\}\\;\\vee\\;V\(\{\\bm\{s\}\}^\{\\prime\}\_\{\\pi\}\)<V\(\{\\bm\{s\}\}^\{\\prime\}\_\{\\mathrm\{id\}\}\)\\right\\\},where𝐬π′∼pψ\(⋅\|𝐬,𝐚ood\)\{\\bm\{s\}\}^\{\\prime\}\_\{\\pi\}\\sim p\_\{\\psi\}\(\\cdot\|\{\\bm\{s\}\},\{\\bm\{a\}\}\_\{\\mathrm\{ood\}\}\),𝐬id′∼pψ\(⋅\|𝐬,𝐚id∗\)\{\\bm\{s\}\}^\{\\prime\}\_\{\\mathrm\{id\}\}\\sim p\_\{\\psi\}\(\\cdot\|\{\\bm\{s\}\},\{\\bm\{a\}\}\_\{\\mathrm\{id\}\}^\{\*\}\),𝐚id∗=arg​max𝐚∼πβ\(⋅\|𝐬\)⁡Q​\(𝐬,𝐚\)\{\\bm\{a\}\}\_\{\\mathrm\{id\}\}^\{\*\}=\\operatorname\*\{arg\\,max\}\_\{\{\\bm\{a\}\}\\sim\\pi\_\{\\beta\}\(\\cdot\|\{\\bm\{s\}\}\)\}Q\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)is the optimal in\-distribution action at state𝐬\{\\bm\{s\}\},ℰs​\(⋅\)\\mathcal\{E\}\_\{s\}\(\\cdot\)is the state reconstruction error defined in \([7](https://arxiv.org/html/2605.08202#S3.E7)\), andτs\\tau\_\{s\}is the state OOD threshold\.

Accordingly, detrimental OOD actions are penalized to mitigate overestimation\. Conversely, to encourage exploration beyond dataset support, beneficial OOD actions receive an adaptive bonusδV=V​\(𝒔π′\)−V​\(𝒔id′\)\\delta\_\{V\}=V\(\{\\bm\{s\}\}^\{\\prime\}\_\{\\pi\}\)\-V\(\{\\bm\{s\}\}^\{\\prime\}\_\{\\mathrm\{id\}\}\)\. This compensates for extrapolation errors in value estimation and guides the policy towards high\-value regions, even when Q\-value estimates for OOD actions remain imperfect\.

Therefore, we minimize the following loss for policy evaluation:

ℒ​\(θ\)=\\displaystyle\\mathcal\{L\}\(\\theta\)=𝔼\(𝒔,𝒂,𝒔′\)∼𝒟​\[\(Qθ​\(𝒔,𝒂\)−\(R​\(𝒔,𝒂\)\+γ​𝔼𝒂′∼πβ\(⋅\|𝒔\)​\[Qθ′​\(𝒔′,𝒂′\)\]\)\)2⏟Standard Bellman error\]\\displaystyle\\;\\mathbb\{E\}\_\{\(\{\\bm\{s\}\},\{\\bm\{a\}\},\{\\bm\{s\}\}^\{\\prime\}\)\\sim\\mathcal\{D\}\}\\big\[\\underbrace\{\\left\(Q\_\{\\theta\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)\-\\left\(R\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)\+\\gamma\\mathbb\{E\}\_\{\{\\bm\{a\}\}^\{\\prime\}\\sim\\pi\_\{\\beta\}\(\\cdot\|\{\\bm\{s\}\}\)\}\[Q\_\{\\theta^\{\\prime\}\}\(\{\\bm\{s\}\}^\{\\prime\},\{\\bm\{a\}\}^\{\\prime\}\)\]\\right\)\\right\)^\{2\}\}\_\{\\text\{Standard Bellman error\}\}\\big\]\(10\)\+β​𝔼𝒔∼𝒟,𝒂∼πϕ\(⋅\|𝒔\)​\[𝕀​\(𝒂∈𝒜ood−\)⋅\(Qθ​\(𝒔,𝒂\)−Qmin\)2⏟Penalty for detrimental OOD actions\]\\displaystyle\+\\beta\\,\\mathbb\{E\}\_\{\{\\bm\{s\}\}\\sim\\mathcal\{D\},\\,\{\\bm\{a\}\}\\sim\\pi\_\{\\phi\}\(\\cdot\|\{\\bm\{s\}\}\)\}\\big\[\\underbrace\{\\mathbb\{I\}\(\{\\bm\{a\}\}\\in\\mathcal\{A\}\_\{\\mathrm\{ood\}\}^\{\-\}\)\\cdot\\left\(Q\_\{\\theta\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)\-Q\_\{\\mathrm\{min\}\}\\right\)^\{2\}\}\_\{\\text\{Penalty for detrimental OOD actions\}\}\\big\]\+λ​𝔼𝒔∼𝒟,𝒂∼πϕ\(⋅\|s\)​\[𝕀​\(𝒂∈𝒜ood\+\)⋅\(Qθ​\(𝒔,𝒂\)−η​\(Qθ′​\(𝒔,𝒂id∗\)\+δV\)\)2⏟Bonus for beneficial OOD actions\]\\displaystyle\+\\lambda\\,\\mathbb\{E\}\_\{\{\\bm\{s\}\}\\sim\\mathcal\{D\},\\,\{\\bm\{a\}\}\\sim\\pi\_\{\\phi\}\(\\cdot\|s\)\}\\big\[\\underbrace\{\\mathbb\{I\}\(\{\\bm\{a\}\}\\in\\mathcal\{A\}\_\{\\mathrm\{ood\}\}^\{\+\}\)\\cdot\\left\(Q\_\{\\theta\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)\-\\eta\\left\(Q\_\{\\theta^\{\\prime\}\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}^\{\*\}\_\{\\mathrm\{id\}\}\)\+\\delta\_\{V\}\\right\)\\right\)^\{2\}\}\_\{\\text\{Bonus for beneficial OOD actions\}\}\\big\]whereQθ′Q\_\{\\theta^\{\\prime\}\}is the target Q\-network,Qmin=Rmin/\(1−γ\)Q\_\{\\mathrm\{min\}\}=\{R\_\{\\mathrm\{min\}\}\}/\{\(1\-\\gamma\)\}is the theoretical minimal Q\-value of the MDP\. In practical implementation, we approximate𝒂id∗\{\\bm\{a\}\}^\{\*\}\_\{\\mathrm\{id\}\}as:

𝒂^id∗=arg​max𝒂i∼π^β\(⋅\|𝒔\)Q​\(𝒔,𝒂i\)fori=1,…,N\\hat\{\\bm\{a\}\}\_\{\\mathrm\{id\}\}^\{\*\}=\\mathop\{\\mathrm\{arg\\,max\}\}\\limits\_\{\{\\bm\{a\}\}\_\{i\}\\sim\\hat\{\\pi\}\_\{\\beta\}\(\\cdot\|\{\\bm\{s\}\}\)\}Q\(\{\\bm\{s\}\},\{\\bm\{a\}\}\_\{i\}\)\\quad\\text\{for\}\\quad i=1,\\dots,N\(11\)withN=10N=10empirically balancing computational cost and performance across all tasks\.

### 3\.4Practical Implementation

In this section, we provide the practical implementation of our algorithm\.111Code is available at[https://github\.com/7ingw24/DOSER](https://github.com/7ingw24/DOSER)\.

Value Learning\. Similar to IQL, we perform expectile regression to train the value network\.

ℒ\(θ\)=𝔼\(𝒔,r,𝒔′\)∼𝒟\[L2τ\(r\+γVθ′\(𝒔′\)−Vθ\(𝒔\)\]\\mathcal\{L\}\(\\theta\)=\\mathbb\{E\}\_\{\(\{\\bm\{s\}\},r,\{\\bm\{s\}\}^\{\\prime\}\)\\sim\\mathcal\{D\}\}\\left\[L^\{\\tau\}\_\{2\}\(r\+\\gamma V\_\{\\theta^\{\\prime\}\}\(\{\\bm\{s\}\}^\{\\prime\}\)\-V\_\{\\theta\}\(\{\\bm\{s\}\}\)\\right\]\(12\)whereL2τ​\(u\)=\|τ−𝕀​\(u<0\)\|​u2L^\{\\tau\}\_\{2\}\(u\)=\|\\tau\-\\mathbb\{I\}\(u<0\)\|u^\{2\}denotes the asymmetricL2L\_\{2\}loss, andVθ′V\_\{\\theta^\{\\prime\}\}is the target V\-network\.

Dynamics Model\. With the quadruples\(𝒔,𝒂,𝒔′\)\(\{\\bm\{s\}\},\{\\bm\{a\}\},\{\\bm\{s\}\}^\{\\prime\}\)in offline dataset𝒟\\mathcal\{D\}, we train the dynamics model via supervised regression:

ℒ\(ψ\)=𝔼\(𝒔,𝒂,𝒔′\)∼𝒟\|\|pψ\(⋅\|𝒔,𝒂\)−𝒔′\|\|22\\mathcal\{L\}\(\\psi\)=\\mathbb\{E\}\_\{\(\{\\bm\{s\}\},\{\\bm\{a\}\},\{\\bm\{s\}\}^\{\\prime\}\)\\sim\\mathcal\{D\}\}\|\|p\_\{\\psi\}\(\\cdot\|\{\\bm\{s\}\},\{\\bm\{a\}\}\)\-\{\\bm\{s\}\}^\{\\prime\}\|\|\_\{2\}^\{2\}\(13\)
Policy Learning\. To enhance exploration, we optimize the actor network with maximum entropy regularization:

ℒ\(ϕ\)=𝔼𝒔∼𝒟,𝒂∼πϕ​\(𝒔\)\[αlogπϕ\(⋅\|𝒔\)−Qθ\(𝒔,𝒂\)\]\\mathcal\{L\}\(\\phi\)=\\mathbb\{E\}\_\{\{\\bm\{s\}\}\\sim\\mathcal\{D\},\{\\bm\{a\}\}\\sim\\pi\_\{\\phi\}\(\{\\bm\{s\}\}\)\}\\left\[\\alpha\\log\\pi\_\{\\phi\}\(\\cdot\|\{\\bm\{s\}\}\)\-Q\_\{\\theta\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)\\right\]\(14\)whereα\\alphais dynamically adjusted to maintain target entropy\.

Overall Algorithm\. Putting everything together, we summarize our implementation in Algorithm[1](https://arxiv.org/html/2605.08202#alg1)\.

## 4Experiments

![Refer to caption](https://arxiv.org/html/2605.08202v1/x3.png)Figure 3:OOD detection experiments on 1D navigation task, where a higher OOD detection metric \(reconstruction error or uncertainty estimation\) indicates a greater likelihood of being OOD\. \(a\) Two offline datasets with distinct data distributions: expert \(top\) and medium \(bottom\)\. \(b\) OOD scores across the entire state\-action space, evaluated using diffusion\-based reconstruction error\. \(c\) OOD scores based on model ensemble uncertainty\. \(d\) OOD scores based on MC dropout uncertainty\. \(e\) OOD scores derived from CVAE\-based reconstruction error\.In this section, we conduct a series of experiments to validate the effectiveness of our proposed method\. We aim to answer the following key questions: 1\) Is diffusion\-based reconstruction error better than existing approaches in detecting OOD samples? 2\) How does DOSER perform on standard offline RL benchmarks compared to prior SOTA methods? 3\) Does each component in DOSER contribute meaningfully to the overall performance? 4\) How sensitive is DOSER to its key hyperparameter? More experimental details and results are provided in Appendix[B](https://arxiv.org/html/2605.08202#A2)and[C](https://arxiv.org/html/2605.08202#A3)\.

### 4\.1OOD Detection

To evaluate the effectiveness of diffusion\-based reconstruction error for OOD detection, we design a simple 1D navigation task, the discrete state\-action space is defined over position𝒔∈\[−10,10\]\{\\bm\{s\}\}\\in\[\-10,10\]and step size𝒂∈\[−1,1\]\{\\bm\{a\}\}\\in\[\-1,1\]\. The reward function is defined as the negative distance to the target state0, such that rewards increase as the agent approaches the target\. By perturbing optimal actions with noise of varying scales, we generate two offline datasets,*expert*and*medium*\. We then compare our diffusion\-based approach against three representative baselines:

1\) Model ensemble\.An ensemble of dynamics models is trained to capture epistemic uncertainty, with OOD samples identified based on high prediction variance across ensemble members\.

2\) MC dropout\.Monte Carlo dropout is applied during inference to approximate model uncertainty, where actions with high estimated uncertainty are flagged as OOD\.

3\) CVAE\-based reconstruction error\.A conditional VAE \(CVAE\) is trained to model the behavior distribution, and the reconstruction error is used as the OOD indicator\.

As shown in Figure[3](https://arxiv.org/html/2605.08202#S4.F3), our diffusion\-based method effectively separates ID and OOD samples across the entire state\-action space, whereas baseline methods fail to achieve reliable identification even in this simple setting\. In particular, the model ensemble approach frequently misclassifies OOD samples as in\-distribution due to its inability to disentangle epistemic and aleatoric uncertainty\. Similarly, MC dropout tends to conflate these two sources of uncertainty, while also introducing undesirable stochasticity at inference\. Although the CVAE\-based reconstruction error baseline shows stronger discrimination than the other two methods, its performance primarily stems from reconstruction ability, while its limited capacity to model multi\-modal distributions remains a fundamental limitation\(Wanget al\.,[2022](https://arxiv.org/html/2605.08202#bib.bib8)\)\. For more experimental details, please refer to Appendix[B\.4](https://arxiv.org/html/2605.08202#A2.SS4)\.

![Refer to caption](https://arxiv.org/html/2605.08202v1/x4.png)Figure 4:Comparison of OOD action detection performance between CVAE\-based reconstruction error and the proposed diffusion\-based method in the D4RL MuJoCo domain\.We further compare the OOD action detection performance using CVAE\-based reconstruction error with our proposed diffusion\-based approach in the D4RL MuJoCo domain\(Fuet al\.,[2020](https://arxiv.org/html/2605.08202#bib.bib28)\), with results presented in Figure[4](https://arxiv.org/html/2605.08202#S4.F4)\. Both methods rely solely on reconstruction error as the detection metric, without incorporating any additional classification or compensation\. As illustrated, the CVAE\-based method struggles to reliably identify OOD samples in high\-dimensional continuous control tasks, which is attributed to its tendency to produce over\-smoothed reconstructions, thus diminishing sensitivity to anomalous action inputs\. In contrast, our proposed diffusion\-based OOD detection consistently delivers superior performance across all evaluated datasets\.

### 4\.2Comparisons on D4RL benchmarks

Table 1:Evaluation results on D4RL benchmark\. We report the average normalized scores at the last training iteration over 4 random seeds\. Note that m=medium, m\-r=medium\-replay, m\-e=medium\-expert\.Boldindicates the values within 95% of the maximum value\.DatasetConventional methodsDiffusion\-based methodsTD3\+BCIQLA2PRCQLSVRACL\-QLDQLSfBCIDQLQGPOSRPODTQLDOSER \(Ours\)halfcheetah\-m48\.347\.468\.644\.060\.569\.851\.545\.951\.054\.160\.457\.967\.5±\\pm0\.5hopper\-m59\.366\.3100\.858\.5103\.597\.990\.557\.165\.498\.095\.599\.6104\.0±\\pm0\.5walker2d\-m83\.778\.389\.772\.592\.479\.387\.077\.982\.586\.084\.489\.486\.7±\\pm1\.2halfcheetah\-m\-r44\.644\.256\.645\.552\.555\.947\.837\.145\.947\.651\.450\.963\.0±\\pm1\.1hopper\-m\-r60\.994\.7101\.595\.0103\.799\.3101\.386\.292\.196\.9101\.2100\.0104\.4±\\pm0\.6walker2d\-m\-r81\.873\.994\.477\.295\.696\.595\.565\.185\.184\.484\.688\.594\.4±\\pm1\.3halfcheetah\-m\-e90\.786\.798\.391\.694\.287\.496\.892\.695\.993\.592\.292\.796\.2±\\pm0\.4hopper\-m\-e98\.091\.5112\.1105\.4111\.2107\.2111\.1108\.6108\.6108\.0100\.1109\.3111\.5±\\pm1\.6walker2d\-m\-e110\.1109\.6114\.6108\.8109\.3113\.4110\.1109\.8112\.7110\.7114\.0110\.0110\.9±\\pm0\.2MuJoCo\-v2 Average75\.383\.393\.077\.691\.489\.688\.075\.682\.186\.687\.188\.793\.2pen\-human54\.971\.5\-35\.273\.1\-72\.8\-\-73\.9\-64\.187\.8±\\pm14\.7pen\-cloned63\.837\.3\-27\.270\.2\-57\.3\-\-54\.2\-81\.379\.3±\\pm8\.9Adroit\-v1 Average59\.454\.4\-31\.271\.7\-65\.1\-\-64\.1\-72\.783\.6We evaluate the policy performance of DOSER on the standard D4RL benchmark, covering a diverse set of continuous control tasks with varying dataset qualities\.

We compare DOSER against a broad range of baselines, including conventional algorithms and SOTA diffusion\-based approaches\. For policy constraint methods, we include TD3\+BC\(Fujimoto and Gu,[2021](https://arxiv.org/html/2605.08202#bib.bib5)\), IQL\(Kostrikovet al\.,[2021](https://arxiv.org/html/2605.08202#bib.bib6)\)and A2PR\(Liuet al\.,[2024](https://arxiv.org/html/2605.08202#bib.bib57)\)\. For value regularization methods, we compare against CQL\(Kumaret al\.,[2020](https://arxiv.org/html/2605.08202#bib.bib9)\), SVR\(Maoet al\.,[2023](https://arxiv.org/html/2605.08202#bib.bib12)\)and ACL\-QL\(Wuet al\.,[2024](https://arxiv.org/html/2605.08202#bib.bib15)\)\. For diffusion\-based methods, we consider approaches that also leverage diffusion models for behavior cloning, such as DQL\(Wanget al\.,[2022](https://arxiv.org/html/2605.08202#bib.bib8)\), SfBC\(Chenet al\.,[2022](https://arxiv.org/html/2605.08202#bib.bib25)\), IDQL\(Hansen\-Estruchet al\.,[2023](https://arxiv.org/html/2605.08202#bib.bib23)\), QGPO\(Luet al\.,[2023](https://arxiv.org/html/2605.08202#bib.bib24)\), SRPO\(Chenet al\.,[2023](https://arxiv.org/html/2605.08202#bib.bib26)\)and DTQL\(Chenet al\.,[2024](https://arxiv.org/html/2605.08202#bib.bib27)\)\. Baseline performance is taken from original papers or recent literature\. Some baselines did not report results on the pen tasks, and key hyperparameters for reproduction are unavailable, so we mark these entries as “–”\.

As shown in Table[1](https://arxiv.org/html/2605.08202#S4.T1), DOSER consistently achieves strong performance, outperforming prior methods on both Gym\-MuJoCo and Adroit tasks\. Its advantage is particularly pronounced in the more challenging “medium” and “medium\-replay” settings, where the datasets contain a significant proportion of suboptimal and heterogeneous behaviors\. This highlights the effectiveness of our proposed diffusion\-based OOD detection mechanism and its ability of selective regularization\. While existing diffusion\-based baselines already exhibit improved performance over traditional approaches due to their expressive modeling capacity, DOSER further improves upon them by explicitly classifying OOD actions, which allows for more refined value estimation and better policy improvement\. Note that methods such as SVR and A2PR also incorporate behavior modeling into their frameworks, either for value regularization or policy constraint\. Specifically, SVR employs a CVAE to approximate the support of the behavior policy and imposes uniform penalties to actions that fall outside this estimated support\. Similar to the motivation of DOSER, A2PR introduces an action discrimination mechanism to guide policy optimization\. However, A2PR’s discriminator is solely applied to in\-distribution actions identified by an enhanced CVAE, thereby restricting policy learning to a potentially inaccurate approximation of the dataset support\. In contrast to these CVAE\-based approaches, DOSER leverages the expressive power of diffusion models for more accurate OOD detection and employs a selective regularization strategy targeted at OOD actions\. This enables the learned policy to extrapolate to high\-value regions beyond the offline dataset, ultimately contributing to superior empirical performance\.

### 4\.3Ablation Study on Components in DOSER

To systematically validate the effectiveness of each component in the DOSER framework, we conduct ablation studies on two variants\.

1\) DOSER w/o AC and VC\. This variant removes both OOD action classification \(AC\) and value compensation \(VC\)\. It relies solely on diffusion\-based reconstruction error to detect OOD actions, applying a uniform penalty without distinguishing between beneficial and detrimental cases\. This serves as a direct test of the core capability of diffusion models in OOD detection\.

2\) DOSER w/o VC\. Building on the baseline above, this variant further differentiates OOD actions by incorporating both next\-state distribution modeling and value estimation\. Specifically, it identifies OOD actions that either \(i\) lead to OOD states or \(ii\) yield lower value outcomes than optimal ID actions as detrimental, penalizing only those\. All other OOD actions are retained without regularization, enabling a more nuanced treatment of OOD behavior\.

We keep all hyperparameters fixed across these variants and evaluate performance on MuJoCo locomotion tasks\. Table[2](https://arxiv.org/html/2605.08202#S4.T2)reports the average normalized scores of DOSER and its two ablated variants across nine datasets\. The complete learning curves are provided in Appendix[C\.9](https://arxiv.org/html/2605.08202#A3.SS9)\.

The results show that even the baseline variant \(DOSER w/o AC and VC\) already performs competitively with existing SOTA methods, confirming the strong effectiveness of diffusion models for OOD action detection\. However, its uniform penalization strategy excessively suppresses potentially beneficial OOD actions, leading to noticeable performance degradation\. In contrast, the classification\-based variant \(DOSER w/o VC\) alleviates this issue by selectively regularizing only detrimental OOD actions, resulting in smaller performance drops\. Overall, these findings strongly validate the effectiveness of DOSER’s fine\-grained classification and compensation mechanism in better balancing conservatism and exploration during policy optimization\.

Table 2:Components ablation across MuJoCo\-v2 tasks\.Methodhalfcheetahhopperwalker2dmm\-rm\-emm\-rm\-emm\-rm\-eDOSER w/o AC and VC65\.4±\\pm1\.158\.8±\\pm1\.694\.9±\\pm0\.2102\.1±\\pm1\.7104\.2±\\pm1\.3108\.3±\\pm2\.585\.4±\\pm0\.494\.1±\\pm1\.5110\.8±\\pm0\.4DOSER w/o VC67\.2±\\pm0\.961\.9±\\pm1\.596\.0±\\pm0\.299\.4±\\pm4\.103\.2±\\pm1\.8111\.2±\\pm3\.285\.8±\\pm0\.693\.0±\\pm1\.0111\.1±\\pm0\.5DOSER67\.5±\\pm0\.563\.0±\\pm1\.196\.2±\\pm0\.4104\.0±\\pm0\.5104\.4±\\pm0\.6111\.5±\\pm1\.686\.7±\\pm1\.294\.4±\\pm1\.3110\.9±\\pm0\.2

### 4\.4Sensitivity analysis

![Refer to caption](https://arxiv.org/html/2605.08202v1/x5.png)\(a\)OOD detection thresholdsτa\\tau\_\{a\}andτs\\tau\_\{s\}\.
![Refer to caption](https://arxiv.org/html/2605.08202v1/x6.png)\(b\)Penalty coefficientβ\\beta\.
![Refer to caption](https://arxiv.org/html/2605.08202v1/x7.png)\(c\)Compensation coefficientλ\\lambda\.

Figure 5:Ablation study on hyperparameters for halfcheetah tasks\.We compare different OOD detection thresholds in Figure[5\(a\)](https://arxiv.org/html/2605.08202#S4.F5.sf1), set as thepp\-th percentile of in\-distribution reconstruction errors\. A smaller threshold implies more samples will be identified as OOD, which is beneficial for narrow behavior distributions like in the “medium\-expert” dataset, where a larger threshold might overlook OOD samples\. For more diverse datasets like “medium” and “medium\-replay”, larger thresholds are preferred to prevent ID samples from being misclassified as OOD\.

We also investigate the impact of the penalty coefficientβ\\beta, varying it from10−510^\{\-5\}to11, as shown in Figure[5\(b\)](https://arxiv.org/html/2605.08202#S4.F5.sf2)\. Datasets with narrow distributions require a largerβ\\betato prevent value overestimation, while more diverse datasets benefit from a smallerβ\\betato avoid suppressing beneficial OOD actions\.

An ablation study on the compensation coefficientλ\\lambdain Figure[5\(c\)](https://arxiv.org/html/2605.08202#S4.F5.sf3)shows that DOSER performs well across a wide range ofλ\\lambdavalues on the more diverse datasets\. Settingλ=0\.001\\lambda=0\.001yields stable performance across datasets, while excessively large values can amplify the compensation effect, leading to value overestimation and disrupting the learning process\.

## 5Related Works

OOD Detection\. Reliable identification of OOD samples is critical for the robustness of machine learning systems\. Existing methods primarily fall into two categories: generative\-based and reconstruction\-based\. Generative\-based methods leverage probabilistic models to estimate the likelihood of test samples under the learned distribution\(Renet al\.,[2019](https://arxiv.org/html/2605.08202#bib.bib29)\), but models such as Glow\(Kingma and Dhariwal,[2018](https://arxiv.org/html/2605.08202#bib.bib30)\)and VAEs\(Kingma and Welling,[2013](https://arxiv.org/html/2605.08202#bib.bib7)\)often assign higher likelihoods to OOD samples than to ID data\(Hendryckset al\.,[2018](https://arxiv.org/html/2605.08202#bib.bib31); Nalisnicket al\.,[2018](https://arxiv.org/html/2605.08202#bib.bib32)\)\. Although improvements like likelihood ratios\(Renet al\.,[2019](https://arxiv.org/html/2605.08202#bib.bib29)\)and typicality tests\(Nalisnicket al\.,[2019](https://arxiv.org/html/2605.08202#bib.bib34)\)have been proposed, their reliance on likelihood estimation remains a fundamental limitation\. In contrast, reconstruction\-based methods\(Denoudenet al\.,[2018](https://arxiv.org/html/2605.08202#bib.bib35); Zonget al\.,[2018](https://arxiv.org/html/2605.08202#bib.bib36)\)directly measure reconstruction quality, based on the premise that models trained on ID data reconstruct familiar patterns well, while exhibiting significant errors on anomalous inputs\. Traditional autoencoders\(Lyudchik,[2016](https://arxiv.org/html/2605.08202#bib.bib37)\)and more recent diffusion\-based models\(Grahamet al\.,[2023](https://arxiv.org/html/2605.08202#bib.bib38)\)have shown promising results in this regard, with diffusion models leveraging iterative refinement to further enhance ID reconstruction\. Consequently, reconstruction error provides a more reliable signal of distribution shift than likelihood\-based metrics, offering improved discriminability between ID and OOD samples\.

OOD Detection in Offline RL\. Offline RL presents additional challenges for OOD detection due to the lack of online interaction\. To mitigate the risk of extrapolation error, BCQ\(Fujimotoet al\.,[2019](https://arxiv.org/html/2605.08202#bib.bib2)\)and SVR\(Maoet al\.,[2023](https://arxiv.org/html/2605.08202#bib.bib12)\)employ VAEs to approximate the behavior policy, constraining the learned policy to remain within behavior support\. However, VAEs often fail to capture multi\-modal distributions accurately\(Wanget al\.,[2024](https://arxiv.org/html/2605.08202#bib.bib39)\), resulting in oversimplified generations\. Another line of work quantifies uncertainty to identify OOD samples\. Model ensemble methods\(Lakshminarayananet al\.,[2017](https://arxiv.org/html/2605.08202#bib.bib40)\)identify OOD state\-action pairs via predictive variance, with algorithms such as MOPO\(Yuet al\.,[2020](https://arxiv.org/html/2605.08202#bib.bib41)\)incorporating this uncertainty as a penalty into the reward function\. Similarly, Monte Carlo \(MC\) dropout offers a computationally efficient approximation to Bayesian inference\(Gal and Ghahramani,[2016](https://arxiv.org/html/2605.08202#bib.bib42)\), and has been applied in offline RL for uncertainty\-aware OOD detection\(Wuet al\.,[2021](https://arxiv.org/html/2605.08202#bib.bib10)\)\. While effective to some extent, both approaches often conflate epistemic and aleatoric uncertainty, which may lead to erroneous identification of OOD actions\(Zhanget al\.,[2023](https://arxiv.org/html/2605.08202#bib.bib43)\)\. Alternatively, CQL\(Kumaret al\.,[2020](https://arxiv.org/html/2605.08202#bib.bib9)\)avoids explicit density estimation by regularizing the Q\-function to assign lower values to all unseen actions\. This implicit OOD detection eliminates the need for behavior modeling but risks being overly conservative, potentially suppressing valuable actions that lie outside the behavior support but could lead to improved performance\.

Diffusion Models in Offline RL\. Diffusion models have recently emerged as powerful paradigms in RL for modeling multi\-modal distributions\. This capability is particularly valuable in offline RL settings, where capturing the diversity of behaviors is essential for deriving robust policies\. Methods such as Diffusion\-QL\(Wanget al\.,[2022](https://arxiv.org/html/2605.08202#bib.bib8)\)and DAC\(Fanget al\.,[2024](https://arxiv.org/html/2605.08202#bib.bib44)\)incorporate Q\-function guidance into the reverse diffusion process, shaping action generation toward higher\-value regions\. In contrast, IDQL\(Hansen\-Estruchet al\.,[2023](https://arxiv.org/html/2605.08202#bib.bib23)\)and SfBC\(Chenet al\.,[2022](https://arxiv.org/html/2605.08202#bib.bib25)\)first pretrain a conditional diffusion model to generate multiple action candidates for a given state, and subsequently resample according to Q\-values to select the best action for execution\. Notably, while these approaches effectively integrate diffusion models with value functions for policy improvement, their use of diffusion remains largely limited to guiding or selecting actions, none of them fully exploit the inherent properties of diffusion models, such as reconstruction fidelity or noise sensitivity, to directly assess whether state\-action pairs lie within the support of the training distribution\.

## 6Conclusion

In this work, we proposed DOSER, a framework that mitigates distribution shift through diffusion\-based reconstruction error\. Unlike prior methods that rely on heuristic uncertainty measures or unreliable likelihood estimates, DOSER leverages the expressive power of diffusion models to compute theoretically grounded reconstruction errors for both behavior policy and state distributions\. This provides robust detection metrics that overcome the multi\-modality limitations of Gaussian\-based approximators\. Crucially, DOSER introduces a selective regularization mechanism that classifies OOD samples into beneficial and detrimental actions, enabling suppression of detrimental extrapolations while compensating promising explorations via value\-difference bonuses\. Extensive experiments demonstrate that DOSER achieves superior or competitive performance compared to state\-of\-the\-art methods, particularly on suboptimal datasets\.

Nonetheless, DOSER has two key limitations: 1\) its reliance on the accuracy of the diffusion\-based reconstruction and the learned dynamics model, and 2\) the computational overhead of the iterative diffusion sampling\. Future work could focus on enhancing the robustness of dynamics model and improving efficiency via model distillation and accelerated sampling techniques\.

## Acknowledgments

This work is supported by the Shanghai Municipal Science and Shanghai Automotive Industry Science and Technology Development Foundation \(No\. 2407\)\.

## References

- K\. D\. B\. J\. Adamet al\.\(2014\)A method for stochastic optimization\.arXiv preprint arXiv:1412\.69801412\(6\)\.Cited by:[Table 3](https://arxiv.org/html/2605.08202#A2.T3.2.4.2)\.
- C\. Bai, L\. Wang, Z\. Yang, Z\. Deng, A\. Garg, P\. Liu, and Z\. Wang \(2022\)Pessimistic bootstrapping for uncertainty\-driven offline reinforcement learning\.arXiv preprint arXiv:2202\.11566\.Cited by:[§1](https://arxiv.org/html/2605.08202#S1.p2.1)\.
- S\. Banach \(1922\)Sur les opérations dans les ensembles abstraits et leur application aux équations intégrales\.Fundamenta mathematicae3\(1\),pp\. 133–181\.Cited by:[§A\.2](https://arxiv.org/html/2605.08202#A1.SS2.p1.1),[§A\.3\.1](https://arxiv.org/html/2605.08202#A1.SS3.SSS1.Px3.p3.4)\.
- L\. Bergman and Y\. Hoshen \(2020\)Classification\-based anomaly detection for general data\.arXiv preprint arXiv:2005\.02359\.Cited by:[§C\.2](https://arxiv.org/html/2605.08202#A3.SS2.p1.1)\.
- H\. Chen, C\. Lu, Z\. Wang, H\. Su, and J\. Zhu \(2023\)Score regularized policy optimization through diffusion behavior\.arXiv preprint arXiv:2310\.07297\.Cited by:[§4\.2](https://arxiv.org/html/2605.08202#S4.SS2.p2.1)\.
- H\. Chen, C\. Lu, C\. Ying, H\. Su, and J\. Zhu \(2022\)Offline reinforcement learning via high\-fidelity generative behavior modeling\.arXiv preprint arXiv:2209\.14548\.Cited by:[§4\.2](https://arxiv.org/html/2605.08202#S4.SS2.p2.1),[§5](https://arxiv.org/html/2605.08202#S5.p3.1)\.
- T\. Chen, Z\. Wang, and M\. Zhou \(2024\)Diffusion policies creating a trust region for offline reinforcement learning\.Advances in Neural Information Processing Systems37,pp\. 50098–50125\.Cited by:[§B\.2](https://arxiv.org/html/2605.08202#A2.SS2.p1.7),[§4\.2](https://arxiv.org/html/2605.08202#S4.SS2.p2.1)\.
- Y\. Chen, X\. S\. Zhou, and T\. S\. Huang \(2001\)One\-class svm for learning in image retrieval\.InProceedings 2001 international conference on image processing \(Cat\. No\. 01CH37205\),Vol\.1,pp\. 34–37\.Cited by:[§C\.2](https://arxiv.org/html/2605.08202#A3.SS2.p1.1)\.
- T\. Denouden, R\. Salay, K\. Czarnecki, V\. Abdelzad, B\. Phan, and S\. Vernekar \(2018\)Improving reconstruction autoencoder out\-of\-distribution detection with mahalanobis distance\.arXiv preprint arXiv:1812\.02765\.Cited by:[§5](https://arxiv.org/html/2605.08202#S5.p1.1)\.
- F\. Dufour and T\. Prieto\-Rumeau \(2013\)Finite linear programming approximations of constrained discounted markov decision processes\.SIAM Journal on Control and Optimization51\(2\),pp\. 1298–1324\.Cited by:[§A\.3\.4](https://arxiv.org/html/2605.08202#A1.SS3.SSS4.p1.2)\.
- L\. Fang, R\. Liu, J\. Zhang, W\. Wang, and B\. Jing \(2024\)Diffusion actor\-critic: formulating constrained policy iteration as diffusion noise regression for offline reinforcement learning\.arXiv preprint arXiv:2405\.20555\.Cited by:[§5](https://arxiv.org/html/2605.08202#S5.p3.1)\.
- J\. Fu, A\. Kumar, O\. Nachum, G\. Tucker, and S\. Levine \(2020\)D4rl: datasets for deep data\-driven reinforcement learning\.arXiv preprint arXiv:2004\.07219\.Cited by:[§4\.1](https://arxiv.org/html/2605.08202#S4.SS1.p6.1)\.
- S\. Fujimoto and S\. S\. Gu \(2021\)A minimalist approach to offline reinforcement learning\.Advances in neural information processing systems34,pp\. 20132–20145\.Cited by:[§1](https://arxiv.org/html/2605.08202#S1.p2.1),[§4\.2](https://arxiv.org/html/2605.08202#S4.SS2.p2.1)\.
- S\. Fujimoto, D\. Meger, and D\. Precup \(2019\)Off\-policy deep reinforcement learning without exploration\.InInternational conference on machine learning,pp\. 2052–2062\.Cited by:[§1](https://arxiv.org/html/2605.08202#S1.p1.1),[§5](https://arxiv.org/html/2605.08202#S5.p2.1)\.
- Y\. Gal and Z\. Ghahramani \(2016\)Dropout as a bayesian approximation: representing model uncertainty in deep learning\.Ininternational conference on machine learning,pp\. 1050–1059\.Cited by:[§5](https://arxiv.org/html/2605.08202#S5.p2.1)\.
- H\. Gouk, E\. Frank, B\. Pfahringer, and M\. J\. Cree \(2021\)Regularisation of neural networks by enforcing lipschitz continuity\.Machine Learning110\(2\),pp\. 393–416\.Cited by:[§A\.3\.4](https://arxiv.org/html/2605.08202#A1.SS3.SSS4.p1.2)\.
- M\. S\. Graham, W\. H\. Pinaya, P\. Tudosiu, P\. Nachev, S\. Ourselin, and J\. Cardoso \(2023\)Denoising diffusion models for out\-of\-distribution detection\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 2948–2957\.Cited by:[§5](https://arxiv.org/html/2605.08202#S5.p1.1)\.
- T\. Haarnoja, A\. Zhou, K\. Hartikainen, G\. Tucker, S\. Ha, J\. Tan, V\. Kumar, H\. Zhu, A\. Gupta, P\. Abbeel,et al\.\(2018\)Soft actor\-critic algorithms and applications\.arXiv preprint arXiv:1812\.05905\.Cited by:[§B\.2](https://arxiv.org/html/2605.08202#A2.SS2.p3.1)\.
- P\. Hansen\-Estruch, I\. Kostrikov, M\. Janner, J\. G\. Kuba, and S\. Levine \(2023\)Idql: implicit q\-learning as an actor\-critic method with diffusion policies\.arXiv preprint arXiv:2304\.10573\.Cited by:[§4\.2](https://arxiv.org/html/2605.08202#S4.SS2.p2.1),[§5](https://arxiv.org/html/2605.08202#S5.p3.1)\.
- D\. Hendrycks, M\. Mazeika, and T\. Dietterich \(2018\)Deep anomaly detection with outlier exposure\.arXiv preprint arXiv:1812\.04606\.Cited by:[§5](https://arxiv.org/html/2605.08202#S5.p1.1)\.
- J\. Ho, A\. Jain, and P\. Abbeel \(2020\)Denoising diffusion probabilistic models\.Advances in neural information processing systems33,pp\. 6840–6851\.Cited by:[§2](https://arxiv.org/html/2605.08202#S2.p2.9),[§2](https://arxiv.org/html/2605.08202#S2.p3.2)\.
- J\. Hong, A\. Kumar, and S\. Levine \(2022\)Confidence\-conditioned value functions for offline reinforcement learning\.arXiv preprint arXiv:2212\.04607\.Cited by:[§1](https://arxiv.org/html/2605.08202#S1.p3.1)\.
- Z\. Huang, J\. Zhao, and S\. Sun \(2024\)De\-pessimism offline reinforcement learning via value compensation\.IEEE Transactions on Neural Networks and Learning Systems\.Cited by:[§1](https://arxiv.org/html/2605.08202#S1.p3.1)\.
- Y\. Jin, Z\. Yang, and Z\. Wang \(2021\)Is pessimism provably efficient for offline rl?\.InInternational conference on machine learning,pp\. 5084–5096\.Cited by:[§C\.2](https://arxiv.org/html/2605.08202#A3.SS2.p1.1)\.
- T\. Karras, M\. Aittala, T\. Aila, and S\. Laine \(2022\)Elucidating the design space of diffusion\-based generative models\.Advances in neural information processing systems35,pp\. 26565–26577\.Cited by:[§B\.1](https://arxiv.org/html/2605.08202#A2.SS1.p1.1),[§2](https://arxiv.org/html/2605.08202#S2.p4.1)\.
- D\. P\. Kingma and M\. Welling \(2013\)Auto\-encoding variational bayes\.arXiv preprint arXiv:1312\.6114\.Cited by:[§1](https://arxiv.org/html/2605.08202#S1.p2.1),[§5](https://arxiv.org/html/2605.08202#S5.p1.1)\.
- D\. P\. Kingma and P\. Dhariwal \(2018\)Glow: generative flow with invertible 1x1 convolutions\.Advances in neural information processing systems31\.Cited by:[§5](https://arxiv.org/html/2605.08202#S5.p1.1)\.
- I\. Kostrikov, A\. Nair, and S\. Levine \(2021\)Offline reinforcement learning with implicit q\-learning\.arXiv preprint arXiv:2110\.06169\.Cited by:[§1](https://arxiv.org/html/2605.08202#S1.p2.1),[§4\.2](https://arxiv.org/html/2605.08202#S4.SS2.p2.1)\.
- A\. Kumar, J\. Fu, M\. Soh, G\. Tucker, and S\. Levine \(2019\)Stabilizing off\-policy q\-learning via bootstrapping error reduction\.Advances in neural information processing systems32\.Cited by:[§1](https://arxiv.org/html/2605.08202#S1.p2.1)\.
- A\. Kumar, A\. Zhou, G\. Tucker, and S\. Levine \(2020\)Conservative q\-learning for offline reinforcement learning\.Advances in neural information processing systems33,pp\. 1179–1191\.Cited by:[§1](https://arxiv.org/html/2605.08202#S1.p2.1),[§4\.2](https://arxiv.org/html/2605.08202#S4.SS2.p2.1),[§5](https://arxiv.org/html/2605.08202#S5.p2.1)\.
- B\. Lakshminarayanan, A\. Pritzel, and C\. Blundell \(2017\)Simple and scalable predictive uncertainty estimation using deep ensembles\.Advances in neural information processing systems30\.Cited by:[§5](https://arxiv.org/html/2605.08202#S5.p2.1)\.
- S\. Lange, T\. Gabel, and M\. Riedmiller \(2012\)Batch reinforcement learning\.InReinforcement learning: State\-of\-the\-art,pp\. 45–73\.Cited by:[§2](https://arxiv.org/html/2605.08202#S2.p1.18)\.
- S\. Levine, A\. Kumar, G\. Tucker, and J\. Fu \(2020\)Offline reinforcement learning: tutorial, review, and perspectives on open problems\.arXiv preprint arXiv:2005\.01643\.Cited by:[§1](https://arxiv.org/html/2605.08202#S1.p1.1)\.
- T\. Liu, Y\. Li, Y\. Lan, H\. Gao, W\. Pan, and X\. Xu \(2024\)Adaptive advantage\-guided policy regularization for offline reinforcement learning\.arXiv preprint arXiv:2405\.19909\.Cited by:[§4\.2](https://arxiv.org/html/2605.08202#S4.SS2.p2.1)\.
- I\. Loshchilov and F\. Hutter \(2016\)Sgdr: stochastic gradient descent with warm restarts\.arXiv preprint arXiv:1608\.03983\.Cited by:[Table 3](https://arxiv.org/html/2605.08202#A2.T3.2.6.2)\.
- C\. Lu, H\. Chen, J\. Chen, H\. Su, C\. Li, and J\. Zhu \(2023\)Contrastive energy prediction for exact energy\-guided diffusion sampling in offline reinforcement learning\.InInternational Conference on Machine Learning,pp\. 22825–22855\.Cited by:[§4\.2](https://arxiv.org/html/2605.08202#S4.SS2.p2.1)\.
- O\. Lyudchik \(2016\)Outlier detection using autoencoders\.Technical reportCited by:[§5](https://arxiv.org/html/2605.08202#S5.p1.1)\.
- Y\. Mao, H\. Zhang, C\. Chen, Y\. Xu, and X\. Ji \(2023\)Supported value regularization for offline reinforcement learning\.Advances in Neural Information Processing Systems36,pp\. 40587–40609\.Cited by:[§B\.2](https://arxiv.org/html/2605.08202#A2.SS2.p2.1),[§1](https://arxiv.org/html/2605.08202#S1.p2.1),[§4\.2](https://arxiv.org/html/2605.08202#S4.SS2.p2.1),[§5](https://arxiv.org/html/2605.08202#S5.p2.1)\.
- E\. Nalisnick, A\. Matsukawa, Y\. W\. Teh, D\. Gorur, and B\. Lakshminarayanan \(2018\)Do deep generative models know what they don’t know?\.arXiv preprint arXiv:1810\.09136\.Cited by:[§5](https://arxiv.org/html/2605.08202#S5.p1.1)\.
- E\. Nalisnick, A\. Matsukawa, Y\. W\. Teh, and B\. Lakshminarayanan \(2019\)Detecting out\-of\-distribution inputs to deep generative models using typicality\.arXiv preprint arXiv:1906\.02994\.Cited by:[§5](https://arxiv.org/html/2605.08202#S5.p1.1)\.
- Y\. Ran, Y\. Li, F\. Zhang, Z\. Zhang, and Y\. Yu \(2023\)Policy regularization with dataset constraint for offline reinforcement learning\.InInternational conference on machine learning,pp\. 28701–28717\.Cited by:[§A\.3\.4](https://arxiv.org/html/2605.08202#A1.SS3.SSS4.1.p1.1)\.
- J\. Ren, P\. J\. Liu, E\. Fertig, J\. Snoek, R\. Poplin, M\. Depristo, J\. Dillon, and B\. Lakshminarayanan \(2019\)Likelihood ratios for out\-of\-distribution detection\.Advances in neural information processing systems32\.Cited by:[§5](https://arxiv.org/html/2605.08202#S5.p1.1)\.
- J\. Sohl\-Dickstein, E\. Weiss, N\. Maheswaranathan, and S\. Ganguli \(2015\)Deep unsupervised learning using nonequilibrium thermodynamics\.InInternational conference on machine learning,pp\. 2256–2265\.Cited by:[§2](https://arxiv.org/html/2605.08202#S2.p2.9)\.
- Y\. Song, J\. Sohl\-Dickstein, D\. P\. Kingma, A\. Kumar, S\. Ermon, and B\. Poole \(2020\)Score\-based generative modeling through stochastic differential equations\.arXiv preprint arXiv:2011\.13456\.Cited by:[§2](https://arxiv.org/html/2605.08202#S2.p2.9),[§2](https://arxiv.org/html/2605.08202#S2.p3.2)\.
- R\. S\. Sutton, A\. G\. Barto,et al\.\(1998\)Reinforcement learning: an introduction\.Vol\.1,MIT press Cambridge\.Cited by:[§2](https://arxiv.org/html/2605.08202#S2.p1.18)\.
- P\. Vincent \(2011\)A connection between score matching and denoising autoencoders\.Neural computation23\(7\),pp\. 1661–1674\.Cited by:[§2](https://arxiv.org/html/2605.08202#S2.p4.9)\.
- M\. Wang, Y\. Jin, and G\. Montana \(2024\)Learning on one mode: addressing multi\-modality in offline reinforcement learning\.arXiv preprint arXiv:2412\.03258\.Cited by:[§5](https://arxiv.org/html/2605.08202#S5.p2.1)\.
- Z\. Wang, J\. J\. Hunt, and M\. Zhou \(2022\)Diffusion policies as an expressive policy class for offline reinforcement learning\.arXiv preprint arXiv:2208\.06193\.Cited by:[§1](https://arxiv.org/html/2605.08202#S1.p2.1),[§4\.1](https://arxiv.org/html/2605.08202#S4.SS1.p5.1),[§4\.2](https://arxiv.org/html/2605.08202#S4.SS2.p2.1),[§5](https://arxiv.org/html/2605.08202#S5.p3.1)\.
- K\. Wu, Y\. Zhao, Z\. Xu, Z\. Che, C\. Yin, C\. H\. Liu, Q\. Qiu, F\. Feng, and J\. Tang \(2024\)ACL\-ql: adaptive conservative level inQQ\-learning for offline reinforcement learning\.IEEE Transactions on Neural Networks and Learning Systems\.Cited by:[§1](https://arxiv.org/html/2605.08202#S1.p3.1),[§4\.2](https://arxiv.org/html/2605.08202#S4.SS2.p2.1)\.
- Y\. Wu, G\. Tucker, and O\. Nachum \(2019\)Behavior regularized offline reinforcement learning\.arXiv preprint arXiv:1911\.11361\.Cited by:[§1](https://arxiv.org/html/2605.08202#S1.p2.1)\.
- Y\. Wu, S\. Zhai, N\. Srivastava, J\. Susskind, J\. Zhang, R\. Salakhutdinov, and H\. Goh \(2021\)Uncertainty weighted actor\-critic for offline reinforcement learning\.arXiv preprint arXiv:2105\.08140\.Cited by:[§1](https://arxiv.org/html/2605.08202#S1.p2.1),[§5](https://arxiv.org/html/2605.08202#S5.p2.1)\.
- H\. Xiong, T\. Xu, L\. Zhao, Y\. Liang, and W\. Zhang \(2022\)Deterministic policy gradient: convergence analysis\.InUncertainty in Artificial Intelligence,pp\. 2159–2169\.Cited by:[§A\.3\.4](https://arxiv.org/html/2605.08202#A1.SS3.SSS4.1.p1.1)\.
- T\. Yu, G\. Thomas, L\. Yu, S\. Ermon, J\. Y\. Zou, S\. Levine, C\. Finn, and T\. Ma \(2020\)Mopo: model\-based offline policy optimization\.Advances in Neural Information Processing Systems33,pp\. 14129–14142\.Cited by:[§5](https://arxiv.org/html/2605.08202#S5.p2.1)\.
- S\. Zhai, Y\. Cheng, W\. Lu, and Z\. Zhang \(2016\)Deep structured energy based models for anomaly detection\.InInternational conference on machine learning,pp\. 1100–1109\.Cited by:[§C\.2](https://arxiv.org/html/2605.08202#A3.SS2.p1.1)\.
- H\. Zhang, J\. Shao, S\. He, Y\. Jiang, and X\. Ji \(2023\)DARL: distance\-aware uncertainty estimation for offline reinforcement learning\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.37,pp\. 11210–11218\.Cited by:[§5](https://arxiv.org/html/2605.08202#S5.p2.1)\.
- B\. Zong, Q\. Song, M\. R\. Min, W\. Cheng, C\. Lumezanu, D\. Cho, and H\. Chen \(2018\)Deep autoencoding gaussian mixture model for unsupervised anomaly detection\.InInternational conference on learning representations,Cited by:[§C\.2](https://arxiv.org/html/2605.08202#A3.SS2.p1.1),[§5](https://arxiv.org/html/2605.08202#S5.p1.1)\.

## Appendix

## Appendix ATheoretical Analysis

In this section, we provide the formal definitions and theoretical analysis in the paper\.

### A\.1Definitions

###### Definition 2\(In\-sample Bellman operator\)\.

The in\-sample Bellman operator is defined as:

𝒯In​Q​\(𝒔,𝒂\):=R​\(𝒔,𝒂\)\+γ​𝔼𝒔′∼P\(⋅\|𝒔,𝒂\),𝒂′∼π^β\(⋅\|𝒔′\)​\[Q​\(𝒔′,𝒂′\)\],\\mathcal\{T\}\_\{\\mathrm\{In\}\}Q\(\{\\bm\{s\}\},\{\\bm\{a\}\}\):=R\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)\+\\gamma\\mathbb\{E\}\_\{\{\\bm\{s\}\}^\{\\prime\}\\sim P\(\\cdot\|\{\\bm\{s\}\},\{\\bm\{a\}\}\),\{\\bm\{a\}\}^\{\\prime\}\\sim\\hat\{\\pi\}\_\{\\beta\}\(\\cdot\|\{\\bm\{s\}\}^\{\\prime\}\)\}\\left\[Q\(\{\\bm\{s\}\}^\{\\prime\},\{\\bm\{a\}\}^\{\\prime\}\)\\right\],\(15\)whereπ^β\\hat\{\\pi\}\_\{\\beta\}is the empirical behavior policy in the dataset\.

Based on Definition[2](https://arxiv.org/html/2605.08202#Thmdefinition2), DOSER operator is defined as follows\.

###### Definition 3\(DOSER operator\)\.

From an optimization perspective, \([10](https://arxiv.org/html/2605.08202#S3.E10)\) lead to the DOSER policy evaluation operator:

𝒯DOSER​Q​\(𝒔,𝒂\)=\{𝒯In​Q​\(𝒔,𝒂\)if​ℰa​\(𝒔,𝒂\)≤τaQadj​\(𝒔,𝒂\)otherwise\\mathcal\{T\}\_\{\\mathrm\{DOSER\}\}Q\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)=\\begin\{cases\}\\mathcal\{T\}\_\{\\mathrm\{In\}\}Q\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)&\\text\{if \}\\mathcal\{E\}\_\{a\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)\\leq\\tau\_\{a\}\\\\ Q\_\{\\mathrm\{adj\}\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)&\\text\{otherwise\}\\end\{cases\}\(16\)whereQadj​\(𝐬,𝐚\)Q\_\{\\mathrm\{adj\}\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)is the adjusted Q\-target for OOD actions:

Qadj​\(𝒔,𝒂\)=\{Qminif​𝒂∈𝒜ood−η​\(Q​\(𝒔,𝒂id∗\)\+δV\)if​𝒂∈𝒜ood\+Q\_\{\\mathrm\{adj\}\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)=\\begin\{cases\}Q\_\{\\min\}&\\text\{if \}\{\\bm\{a\}\}\\in\\mathcal\{A\}\_\{\\mathrm\{ood\}\}^\{\-\}\\\\ \\eta\\left\(Q\(\{\\bm\{s\}\},\{\\bm\{a\}\}\_\{\\mathrm\{id\}\}^\{\*\}\)\+\\delta\_\{V\}\\right\)&\\text\{if \}\{\\bm\{a\}\}\\in\\mathcal\{A\}\_\{\\mathrm\{ood\}\}^\{\+\}\\end\{cases\}\(17\)

Therefore, DOSER guarantees that the Q\-values of ID actions remain unbiased, while underestimate those detrimental OOD actions\. By applying value compensationδV\\delta\_\{V\}to beneficial OOD actions, it incentivizes exploration toward high\-potential state\-action pairs\.

### A\.2Theorems

###### Theorem 1\(Contraction mapping property\)\.

For arbitrary Q\-functionsQ1Q\_\{1\}andQ2Q\_\{2\}defined on the whole state\-action space𝒮×𝒜\\mathcal\{S\}\\times\\mathcal\{A\}, the DOSER operator𝒯DOSER\\mathcal\{T\}\_\{\\mathrm\{\\mathrm\{DOSER\}\}\}constitutes aγ\\gamma\-contraction mapping in theℒ∞\\mathcal\{L\}\_\{\\infty\}norm:

‖𝒯DOSER​Q1−𝒯DOSER​Q2‖∞≤γ​‖Q1−Q2‖∞\.\|\|\\mathcal\{T\}\_\{\\mathrm\{DOSER\}\}Q\_\{1\}\-\\mathcal\{T\}\_\{\\mathrm\{DOSER\}\}Q\_\{2\}\|\|\_\{\\infty\}\\leq\\gamma\|\|Q\_\{1\}\-Q\_\{2\}\|\|\_\{\\infty\}\.\(18\)

By the Banach fixed\-point theorem\(Banach,[1922](https://arxiv.org/html/2605.08202#bib.bib47)\), repeatedly applying𝒯DOSER\\mathcal\{T\}\_\{\\mathrm\{DOSER\}\}converges to a unique fixed point from any initial Q\-function\.

###### Theorem 2\(Bounded value estimation\)\.

For any policyπ\\pi, letQDOSERπQ^\{\\pi\}\_\{\\mathrm\{DOSER\}\}denote the unique fixed point of the DOSER operator𝒯DOSER\\mathcal\{T\}\_\{\\mathrm\{DOSER\}\}\. Then,QDOSERπQ^\{\\pi\}\_\{\\mathrm\{DOSER\}\}satisfies the following boundedness property for all\(𝐬,𝐚\)\(\{\\bm\{s\}\},\{\\bm\{a\}\}\):

Qmin≤QDOSERπ​\(𝒔,𝒂\)≤QInπ​\(𝒔,𝒂id∗\)\+η​δV,Q\_\{\\mathrm\{min\}\}\\leq Q^\{\\pi\}\_\{\\mathrm\{DOSER\}\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)\\leq Q^\{\\pi\}\_\{\\mathrm\{In\}\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}^\{\*\}\_\{\\mathrm\{id\}\}\)\+\\eta\\delta\_\{V\},\(19\)whereQInπ​\(𝐬,𝐚id∗\)=max𝐚∼πβ\(⋅\|𝐬\)⁡QInπ​\(𝐬,𝐚\)Q^\{\\pi\}\_\{\\mathrm\{In\}\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}^\{\*\}\_\{\\mathrm\{id\}\}\)=\\max\_\{\{\{\\bm\{a\}\}\\sim\\pi\_\{\\beta\}\(\\cdot\|\{\\bm\{s\}\}\)\}\}Q^\{\\pi\}\_\{\\mathrm\{In\}\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)is the optimal Q\-value by iterating the in\-sample Bellman operator𝒯In\\mathcal\{T\}\_\{\\mathrm\{In\}\}\.

SinceQInπQ^\{\\pi\}\_\{\\mathrm\{In\}\}corresponds to the fixed point of the in\-sample Bellman operator, it yields reliable value estimates within the data distribution\. Therefore, Theorem[2](https://arxiv.org/html/2605.08202#Thmtheorem2)implies that DOSER incurs controlled value overestimation while enabling exploration to high\-value regions via the compensation mechanism\. The upper bound tightens as the compensation target weightη\\etaand value differenceδV\\delta\_\{V\}decrease, since smaller values of these parameters directly constrain the magnitude of the value adjustment for OOD actions, aligning the value estimate closer to the in\-distribution baselineQInπQ^\{\\pi\}\_\{\\mathrm\{In\}\}\.

By dynamically adjusting how OOD actions are treated based on their predicted outcomes, the proposed selective regularization mechanism balances safety and performance, avoiding the pitfalls of binary classification\. Crucially, our method preserves standard Q\-learning convergence guarantees while enabling safer exploration beyond the behavior policy support\.

###### Theorem 3\(Bounded critic deviation\)\.

Letπref\\pi\_\{\\mathrm\{ref\}\}denote the reference policy obtained with the true environment dynamicsPPand without OOD detection error, withQπrefQ^\{\\pi\_\{\\mathrm\{ref\}\}\}being its corresponding action\-value function\. Letπ^\\widehat\{\\pi\}denote the learned policy of DOSER under a dynamics model approximation errorεdyn\\varepsilon\_\{\\mathrm\{dyn\}\}and an OOD detection misclassification probabilityεdet\\varepsilon\_\{\\mathrm\{det\}\}\. Then, the deviation of the learned criticQ^\\widehat\{Q\}fromQπrefQ^\{\\pi\_\{\\mathrm\{ref\}\}\}is bounded as follows:

‖Q^−Qπref‖∞≤γ1−γ​\(Qmax​\(C1​εdyn\+C2​εdet\)\+η​δV\),\\\|\\widehat\{Q\}\-Q^\{\\pi\_\{\\mathrm\{ref\}\}\}\\\|\_\{\\infty\}\\leq\\frac\{\\gamma\}\{1\-\\gamma\}\\left\(Q\_\{\\max\}\\left\(C\_\{1\}\\varepsilon\_\{\\mathrm\{dyn\}\}\+C\_\{2\}\\varepsilon\_\{\\mathrm\{det\}\}\\right\)\+\\eta\\delta\_\{V\}\\right\),\(20\)
whereQmax=Rmax1−γQ\_\{\\max\}=\\dfrac\{R\_\{\\max\}\}\{1\-\\gamma\}, andC1C\_\{1\},C2C\_\{2\}are constants that capture the sensitivity of the policy optimization process to dynamics and detection errors respectively\.

###### Theorem 4\(Performance gap of DOSER\)\.

Letπ^\\widehat\{\\pi\}be the policy learned by DOSER through iterative application of𝒯DOSER\\mathcal\{T\}\_\{\\mathrm\{DOSER\}\}, and letπ∗\\pi^\{\*\}denote the optimal policy\. Supposeδf\\delta\_\{f\}represents the function approximation error\. Then the performance gap betweenπ∗\\pi^\{\*\}andπ^\\widehat\{\\pi\}satisfies

\|J​\(π∗\)−J​\(π^\)\|≤δf\+C​LP​Rmax1−γ​\(C1​εdyn\+C2​εdet\),\\left\|J\(\\pi^\{\*\}\)\-J\(\\widehat\{\\pi\}\)\\right\|\\leq\\delta\_\{f\}\+\\frac\{CL\_\{P\}R\_\{\\max\}\}\{1\-\\gamma\}\\left\(C\_\{1\}\\varepsilon\_\{\\mathrm\{dyn\}\}\+C\_\{2\}\\varepsilon\_\{\\mathrm\{det\}\}\\right\),\(21\)whereC1C\_\{1\},C2C\_\{2\}are positive constants, andLPL\_\{P\}is the Lipschitz constant of the environment dynamics\.

Consequently, the performance gap is influenced by three key components: the function approximation errorδf\\delta\_\{f\}, the OOD detection errorεdet\\varepsilon\_\{\\mathrm\{det\}\}, and the dynamics model approximation errorεdyn\\varepsilon\_\{\\mathrm\{dyn\}\}\. In our setting, the diffusion model provides a reliable mechanism for OOD detection, as confirmed by extensive experiments, which keepsεdet\\varepsilon\_\{\\mathrm\{det\}\}small\. Meanwhile, the learned dynamics model maintains stable predictive performance, ensuring thatεdyn\\varepsilon\_\{\\mathrm\{dyn\}\}remains bounded\. Therefore, when the diffusion reconstruction error becomes negligible and the dynamics model is sufficiently well fitted, together with a smallδf\\delta\_\{f\}, then the right\-hand side of the bound vanishes, implyingJ​\(π^\)→J​\(π∗\)J\(\\widehat\{\\pi\}\)\\to J\(\\pi^\{\*\}\)\.

### A\.3Proofs

#### A\.3\.1Proof of Theorem[1](https://arxiv.org/html/2605.08202#Thmtheorem1)

###### Proof\.

Let𝒯DOSER\\mathcal\{T\}\_\{\\mathrm\{DOSER\}\}denote the DOSER operator acting on bounded Q\-functions defined on𝒮×𝒜\\mathcal\{S\}\\times\\mathcal\{A\}\. Assume

- •Q\-functions lie in the Banach space\(ℬ,∥⋅∥∞\)\(\\mathcal\{B\},\\\|\\cdot\\\|\_\{\\infty\}\)of bounded real functions on𝒮×𝒜\\mathcal\{S\}\\times\\mathcal\{A\}with the sup\-norm;
- •the compensation coefficient satisfies0≤η≤γ<10\\leq\\eta\\leq\\gamma<1;
- •the value\-compensation termδV\\delta\_\{V\}is a scalar that does not depend on the Q\-function being evaluated \(i\.e\., it is treated as fixed when comparing two Q\-functions; ifδV\\delta\_\{V\}depends onQQ, then it must be Lipschitz continuous inQQwith a sufficiently small Lipschitz constant to preserve contraction; here we assume it is fixed for simplicity and clarity\)\.

LetQ1,Q2∈ℬQ\_\{1\},Q\_\{2\}\\in\\mathcal\{B\}be two arbitrary Q\-functions\. We will bound‖𝒯DOSER​Q1−𝒯DOSER​Q2‖∞\\\|\\mathcal\{T\}\_\{\\mathrm\{DOSER\}\}Q\_\{1\}\-\\mathcal\{T\}\_\{\\mathrm\{DOSER\}\}Q\_\{2\}\\\|\_\{\\infty\}by considering the three types of actions that DOSER treats differently: 1\) in\-distribution actions, 2\) detrimental OOD actions, and 3\) beneficial OOD actions\.

##### 1\) In\-distribution actions\.

For any\(𝒔,𝒂\)\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)withℰa​\(𝒔,𝒂\)≤τa\\mathcal\{E\}\_\{a\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)\\leq\\tau\_\{a\}, DOSER reduces to the in\-sample Bellman operator

𝒯DOSER​Q​\(𝒔,𝒂\)=𝒯In​Q​\(𝒔,𝒂\)=R​\(𝒔,𝒂\)\+γ​𝔼𝒔′∼P\(⋅∣𝒔,𝒂\),𝒂′∼π^β\(⋅∣𝒔′\)​\[Q​\(𝒔′,𝒂′\)\]\\mathcal\{T\}\_\{\\mathrm\{DOSER\}\}Q\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)=\\mathcal\{T\}\_\{\\mathrm\{In\}\}Q\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)=R\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)\+\\gamma\\,\\mathbb\{E\}\_\{\{\\bm\{s\}\}^\{\\prime\}\\sim P\(\\cdot\\mid\{\\bm\{s\}\},\{\\bm\{a\}\}\),\\;\{\\bm\{a\}\}^\{\\prime\}\\sim\\hat\{\\pi\}\_\{\\beta\}\(\\cdot\\mid\{\\bm\{s\}\}^\{\\prime\}\)\}\\big\[Q\(\{\\bm\{s\}\}^\{\\prime\},\{\\bm\{a\}\}^\{\\prime\}\)\\big\]\(22\)Hence, the contraction property follows from standard Bellman operator properties:

‖𝒯DOSER​Q1​\(𝒔,𝒂\)−𝒯DOSER​Q2​\(𝒔,𝒂\)‖∞\\displaystyle\\quad\\;\\left\\\|\\mathcal\{T\}\_\{\\mathrm\{DOSER\}\}Q\_\{1\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)\-\\mathcal\{T\}\_\{\\mathrm\{DOSER\}\}Q\_\{2\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)\\right\\\|\_\{\\infty\}\(23\)=‖𝒯In​Q1​\(𝒔,𝒂\)−𝒯In​Q2​\(𝒔,𝒂\)‖∞\\displaystyle=\\left\\\|\\mathcal\{T\}\_\{\\mathrm\{In\}\}Q\_\{1\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)\-\\mathcal\{T\}\_\{\\mathrm\{In\}\}Q\_\{2\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)\\right\\\|\_\{\\infty\}=‖\(R​\(𝒔,𝒂\)\+γ​𝔼𝒔′,𝒂′​\[Q1​\(𝒔′,𝒂′\)\]\)−\(R​\(𝒔,𝒂\)\+γ​𝔼𝒔′,𝒂′​\[Q2​\(𝒔′,𝒂′\)\]\)‖∞\\displaystyle=\\left\\\|\\left\(R\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)\+\\gamma\\mathbb\{E\}\_\{\{\\bm\{s\}\}^\{\\prime\},\{\\bm\{a\}\}^\{\\prime\}\}\\left\[Q\_\{1\}\(\{\\bm\{s\}\}^\{\\prime\},\{\\bm\{a\}\}^\{\\prime\}\)\\right\]\\right\)\-\\left\(R\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)\+\\gamma\\mathbb\{E\}\_\{\{\\bm\{s\}\}^\{\\prime\},\{\\bm\{a\}\}^\{\\prime\}\}\\left\[Q\_\{2\}\(\{\\bm\{s\}\}^\{\\prime\},\{\\bm\{a\}\}^\{\\prime\}\)\\right\]\\right\)\\right\\\|\_\{\\infty\}=γ​max𝒔,𝒂⁡\|𝔼𝒔′,𝒂′​\[Q1​\(𝒔′,𝒂′\)−Q2​\(𝒔′,𝒂′\)\]\|\\displaystyle=\\gamma\\max\_\{\{\\bm\{s\}\},\{\\bm\{a\}\}\}\\left\|\\mathbb\{E\}\_\{\{\\bm\{s\}\}^\{\\prime\},\{\\bm\{a\}\}^\{\\prime\}\}\\left\[Q\_\{1\}\(\{\\bm\{s\}\}^\{\\prime\},\{\\bm\{a\}\}^\{\\prime\}\)\-Q\_\{2\}\(\{\\bm\{s\}\}^\{\\prime\},\{\\bm\{a\}\}^\{\\prime\}\)\\right\]\\right\|≤γ​max𝒔,𝒂⁡𝔼𝒔′,𝒂′​\|Q1​\(𝒔′,𝒂′\)−Q2​\(𝒔′,𝒂′\)\|\\displaystyle\\leq\\gamma\\max\_\{\{\\bm\{s\}\},\{\\bm\{a\}\}\}\\mathbb\{E\}\_\{\{\\bm\{s\}\}^\{\\prime\},\{\\bm\{a\}\}^\{\\prime\}\}\\left\|Q\_\{1\}\(\{\\bm\{s\}\}^\{\\prime\},\{\\bm\{a\}\}^\{\\prime\}\)\-Q\_\{2\}\(\{\\bm\{s\}\}^\{\\prime\},\{\\bm\{a\}\}^\{\\prime\}\)\\right\|≤γ​max𝒔,𝒂⁡‖Q1−Q2‖∞\\displaystyle\\leq\\gamma\\max\_\{\{\\bm\{s\}\},\{\\bm\{a\}\}\}\\left\\\|Q\_\{1\}\-Q\_\{2\}\\right\\\|\_\{\\infty\}=γ​‖Q1−Q2‖∞\\displaystyle=\\gamma\\left\\\|Q\_\{1\}\-Q\_\{2\}\\right\\\|\_\{\\infty\}Thus for all in\-distribution\(𝒔,𝒂\)\(\{\\bm\{s\}\},\{\\bm\{a\}\}\), we have

‖𝒯DOSER​Q1−𝒯DOSER​Q2‖∞≤γ​‖Q1−Q2‖∞\|\|\\mathcal\{T\}\_\{\\mathrm\{DOSER\}\}Q\_\{1\}\-\\mathcal\{T\}\_\{\\mathrm\{DOSER\}\}Q\_\{2\}\|\|\_\{\\infty\}\\leq\\gamma\|\|Q\_\{1\}\-Q\_\{2\}\|\|\_\{\\infty\}\(24\)

##### 2\) Detrimental OOD actions\.

For detrimental OOD actions𝒂∈𝒜OOD−\{\\bm\{a\}\}\\in\\mathcal\{A\}\_\{\\mathrm\{OOD\}\}^\{\-\}, Q\-target are set to a constantQminQ\_\{\\min\}\(independent of the current Q\):

𝒯DOSER​Q​\(𝒔,𝒂\)=Qmin\\mathcal\{T\}\_\{\\mathrm\{DOSER\}\}Q\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)=Q\_\{\\min\}\(25\)The difference vanishes for any Q\-functionsQ1Q\_\{1\},Q2Q\_\{2\}:

‖𝒯DOSER​Q1​\(𝒔,𝒂\)−𝒯DOSER​Q2​\(𝒔,𝒂\)‖∞=‖Qmin−Qmin‖∞=0≤γ​‖Q1−Q2‖∞\|\|\\mathcal\{T\}\_\{\\mathrm\{DOSER\}\}Q\_\{1\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)\-\\mathcal\{T\}\_\{\\mathrm\{DOSER\}\}Q\_\{2\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)\|\|\_\{\\infty\}=\|\|Q\_\{\\min\}\-Q\_\{\\min\}\|\|\_\{\\infty\}=0\\leq\\gamma\|\|Q\_\{1\}\-Q\_\{2\}\|\|\_\{\\infty\}\(26\)

##### 3\) Beneficial OOD actions\.

For beneficial OOD actions𝒂∈𝒜OOD\+\{\\bm\{a\}\}\\in\\mathcal\{A\}\_\{\\mathrm\{OOD\}\}^\{\+\}, DOSER applies a value compensation, quantified as the difference between the value of the next state𝒔π′\{\\bm\{s\}\}^\{\\prime\}\_\{\\pi\}reaching by taking the beneficial OOD action𝒂\{\\bm\{a\}\}and𝒔id′\{\\bm\{s\}\}^\{\\prime\}\_\{\\mathrm\{id\}\}after executing the best ID action𝒂id∗\{\\bm\{a\}\}\_\{\\mathrm\{id\}\}^\{\*\}:

𝒯DOSER​Q​\(𝒔,𝒂\)\\displaystyle\\mathcal\{T\}\_\{\\mathrm\{DOSER\}\}Q\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)=η​\(Q​\(𝒔,𝒂id∗\)\+δV\)\\displaystyle=\\eta\\left\(Q\(\{\\bm\{s\}\},\{\{\\bm\{a\}\}\}\_\{\\mathrm\{id\}\}^\{\*\}\)\+\\delta\_\{V\}\\right\)\(27\)=η​\(max𝒂∼π^β\(⋅\|𝒔\)⁡Q​\(𝒔,𝒂\)\+V​\(𝒔π′\)−V​\(𝒔id′\)\)\\displaystyle=\\eta\\left\(\\max\_\{\{\\bm\{a\}\}\\sim\\hat\{\\pi\}\_\{\\beta\}\(\\cdot\|\{\\bm\{s\}\}\)\}Q\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)\+V\(\{\\bm\{s\}\}^\{\\prime\}\_\{\\pi\}\)\-V\(\{\\bm\{s\}\}^\{\\prime\}\_\{\\mathrm\{id\}\}\)\\right\)where by assumptionδV\\delta\_\{V\}is treated as a fixed scalar w\.r\.t\. the Q\-function comparison\. Thus

‖𝒯DOSER​Q1​\(𝒔,𝒂\)−𝒯DOSER​Q2​\(𝒔,𝒂\)‖∞\\displaystyle\\quad\\;\|\|\\mathcal\{T\}\_\{\\mathrm\{DOSER\}\}Q\_\{\\mathrm\{1\}\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)\-\\mathcal\{T\}\_\{\\mathrm\{DOSER\}\}Q\_\{\\mathrm\{2\}\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)\|\|\_\{\\infty\}\(28\)=‖η​\(Q1​\(𝒔,𝒂id,1∗\)\+δV\)−η​\(Q2​\(𝒔,𝒂id,2∗\)\+δV\)‖∞\\displaystyle=\|\|\\eta\\left\(Q\_\{\\mathrm\{1\}\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}\_\{\\mathrm\{id,1\}\}^\{\*\}\)\+\\delta\_\{V\}\\right\)\-\\eta\\left\(Q\_\{\\mathrm\{2\}\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}\_\{\\mathrm\{id,2\}\}^\{\*\}\)\+\\delta\_\{V\}\\right\)\|\|\_\{\\infty\}=η​max𝒔,𝒂⁡\|Q1​\(𝒔,𝒂id,1∗\)−Q2​\(𝒔,𝒂id,2∗\)\|\\displaystyle=\\eta\\max\_\{\{\\bm\{s\}\},\{\\bm\{a\}\}\}\|Q\_\{\\mathrm\{1\}\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}\_\{\\mathrm\{id,1\}\}^\{\*\}\)\-Q\_\{\\mathrm\{2\}\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}\_\{\\mathrm\{id,2\}\}^\{\*\}\)\|=η​max𝒔,𝒂⁡\|max𝒂⁡Q1​\(𝒔,𝒂\)−max𝒂⁡Q2​\(𝒔,𝒂\)\|\\displaystyle=\\eta\\max\_\{\{\\bm\{s\}\},\{\\bm\{a\}\}\}\|\\max\_\{\\bm\{a\}\}Q\_\{\\mathrm\{1\}\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)\-\\max\_\{\\bm\{a\}\}Q\_\{\\mathrm\{2\}\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)\|≤η​max𝒔,𝒂​‖Q1−Q2‖∞\\displaystyle\\leq\\eta\\max\_\{\{\\bm\{s\}\},\{\\bm\{a\}\}\}\|\|Q\_\{\\mathrm\{1\}\}\-Q\_\{\\mathrm\{2\}\}\|\|\_\{\\infty\}≤γ​‖Q1−Q2‖∞\\displaystyle\\leq\\gamma\|\|Q\_\{\\mathrm\{1\}\}\-Q\_\{\\mathrm\{2\}\}\|\|\_\{\\infty\}
Combining all three cases, we have‖𝒯DOSER​Q1−𝒯DOSER​Q2‖∞≤γ​‖Q1−Q2‖∞\|\|\\mathcal\{T\}\_\{\\mathrm\{DOSER\}\}Q\_\{\\mathrm\{1\}\}\-\\mathcal\{T\}\_\{\\mathrm\{DOSER\}\}Q\_\{\\mathrm\{2\}\}\|\|\_\{\\infty\}\\leq\\gamma\|\|Q\_\{\\mathrm\{1\}\}\-Q\_\{\\mathrm\{2\}\}\|\|\_\{\\infty\}\.

By the Banach fixed\-point theorem\(Banach,[1922](https://arxiv.org/html/2605.08202#bib.bib47)\),𝒯DOSER\\mathcal\{T\}\_\{\\mathrm\{DOSER\}\}admits a unique fixed point inℬ\\mathcal\{B\}and iterative application of𝒯DOSER\\mathcal\{T\}\_\{\\mathrm\{DOSER\}\}from any initial Q\-function converges to that fixed point at rate at mostγ\\gamma\. ∎

#### A\.3\.2Proof of Theorem[2](https://arxiv.org/html/2605.08202#Thmtheorem2)

###### Proof\.

Suppose the DOSER operator𝒯DOSER\\mathcal\{T\}\_\{\\mathrm\{DOSER\}\}admits a unique fixed pointQDOSERπQ^\{\\pi\}\_\{\\mathrm\{DOSER\}\}\(Theorem[1](https://arxiv.org/html/2605.08202#Thmtheorem1)\)\. Assume the compensation termδV\\delta\_\{V\}is a scalar that does not depend on the Q\-function being evaluated \(ifδV\\delta\_\{V\}is estimated from Q, a Lipschitz assumption on this estimator must be made\)\.

We reason by cases according to DOSER’s treatment of actions\. By Theorem[1](https://arxiv.org/html/2605.08202#Thmtheorem1), the fixed pointQDOSERπQ^\{\\pi\}\_\{\\mathrm\{DOSER\}\}exists and for each\(𝒔,𝒂\)\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)satisfies

QDOSERπ​\(𝒔,𝒂\)=\{𝒯In​QDOSERπ​\(𝒔,𝒂\)if​ℰa​\(𝒔,𝒂\)≤τa\(in\-distribution\)Qminif​𝒂∈𝒜OOD−\(detrimental OOD\)η​\(QDOSERπ​\(𝒔,𝒂id∗\)\+δV\)if​𝒂∈𝒜OOD\+\(beneficial OOD\)Q^\{\\pi\}\_\{\\mathrm\{DOSER\}\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)=\\begin\{cases\}\\mathcal\{T\}\_\{\\mathrm\{In\}\}Q^\{\\pi\}\_\{\\mathrm\{DOSER\}\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)&\\text\{if \}\\mathcal\{E\}\_\{a\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)\\leq\\tau\_\{a\}\\quad\(\\text\{in\-distribution\}\)\\\\\[4\.0pt\] Q\_\{\\min\}&\\text\{if \}\{\\bm\{a\}\}\\in\\mathcal\{A\}^\{\-\}\_\{\\mathrm\{OOD\}\}\\quad\(\\text\{detrimental OOD\}\)\\\\\[4\.0pt\] \\eta\\big\(Q^\{\\pi\}\_\{\\mathrm\{DOSER\}\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}^\{\*\}\_\{\\mathrm\{id\}\}\)\+\\delta\_\{V\}\\big\)&\\text\{if \}\{\\bm\{a\}\}\\in\\mathcal\{A\}^\{\+\}\_\{\\mathrm\{OOD\}\}\\quad\(\\text\{beneficial OOD\}\)\\end\{cases\}\(29\)
We show the two inequalities \(lower and upper bounds\) by treating each action type\.

##### 1\) Lower bound:QDOSERπ​\(𝒔,𝒂\)≥QminQ^\{\\pi\}\_\{\\mathrm\{DOSER\}\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)\\geq Q\_\{\\min\}\.

- •*In\-distribution actions\.*Forℰa​\(𝒔,𝒂\)≤τa\\mathcal\{E\}\_\{a\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)\\leq\\tau\_\{a\}, we have the in\-sample Bellman fixed\-point relation QDOSERπ​\(𝒔,𝒂\)=R​\(𝒔,𝒂\)\+γ​𝔼𝒔′∼P\(⋅∣𝒔,𝒂\),𝒂′∼π^β\(⋅∣𝒔′\)​\[QDOSERπ​\(𝒔′,𝒂′\)\]Q^\{\\pi\}\_\{\\mathrm\{DOSER\}\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)=R\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)\+\\gamma\\,\\mathbb\{E\}\_\{\{\\bm\{s\}\}^\{\\prime\}\\sim P\(\\cdot\\mid\{\\bm\{s\}\},\{\\bm\{a\}\}\),\\;\{\\bm\{a\}\}^\{\\prime\}\\sim\\hat\{\\pi\}\_\{\\beta\}\(\\cdot\\mid\{\\bm\{s\}\}^\{\\prime\}\)\}\\big\[Q^\{\\pi\}\_\{\\mathrm\{DOSER\}\}\(\{\\bm\{s\}\}^\{\\prime\},\{\\bm\{a\}\}^\{\\prime\}\)\\big\]\(30\)SinceR​\(𝒔,𝒂\)≥RminR\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)\\geq R\_\{\\min\}and the fact that for all successor pairsQDOSERπ​\(𝒔′,𝒂′\)≥QminQ^\{\\pi\}\_\{\\mathrm\{DOSER\}\}\(\{\\bm\{s\}\}^\{\\prime\},\{\\bm\{a\}\}^\{\\prime\}\)\\geq Q\_\{\\min\}, we obtain QDOSERπ​\(𝒔,𝒂\)≥Rmin\+γ​Qmin=QminQ^\{\\pi\}\_\{\\mathrm\{DOSER\}\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)\\geq R\_\{\\min\}\+\\gamma Q\_\{\\min\}=Q\_\{\\min\}\(31\)
- •*Detrimental OOD actions\.*The operator directly assignsQDOSERπ​\(𝒔,𝒂\)=QminQ^\{\\pi\}\_\{\\mathrm\{DOSER\}\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)=Q\_\{\\mathrm\{min\}\}for𝒂∈𝒜OOD−\{\\bm\{a\}\}\\in\\mathcal\{A\}^\{\-\}\_\{\\mathrm\{OOD\}\}, so the lower bound holds with equality\.
- •*Beneficial OOD actions\.*For𝒂∈𝒜OOD\+\{\\bm\{a\}\}\\in\\mathcal\{A\}^\{\+\}\_\{\\mathrm\{OOD\}\}, they receive value compensation weighted byη∈\[0,1\)\\eta\\in\[0,1\)\. Given that the fixed\-point valueQDOSERπ​\(𝒔,𝒂id∗\)Q^\{\\pi\}\_\{\\mathrm\{DOSER\}\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}^\{\*\}\_\{\\mathrm\{id\}\}\)is at leastQminQ\_\{\\min\},δV≥0\\delta\_\{V\}\\geq 0is satisfied for beneficial OOD actions, andQmin<0Q\_\{\\mathrm\{min\}\}<0is strictly negative, we have: QDOSERπ​\(𝒔,𝒂\)=η​\(QDOSERπ​\(𝒔,𝒂id∗\)\+δV\)≥η​Qmin≥QminQ^\{\\pi\}\_\{\\mathrm\{DOSER\}\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)=\\eta\\big\(Q^\{\\pi\}\_\{\\mathrm\{DOSER\}\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}^\{\*\}\_\{\\mathrm\{id\}\}\)\+\\delta\_\{V\}\\big\)\\geq\\eta\\,Q\_\{\\min\}\\geq Q\_\{\\min\}\(32\)

Combining the three subcases establishes the global lower boundQDOSERπ​\(𝒔,𝒂\)≥QminQ^\{\\pi\}\_\{\\mathrm\{DOSER\}\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)\\geq Q\_\{\\min\}for all state\-action pairs\(𝒔,𝒂\)\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)\.

##### 2\) Upper bound:QDOSERπ​\(𝒔,𝒂\)≤QInπ​\(𝒔,𝒂id∗\)\+η​δVQ^\{\\pi\}\_\{\\mathrm\{DOSER\}\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)\\leq Q^\{\\pi\}\_\{\\mathrm\{In\}\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}^\{\*\}\_\{\\mathrm\{id\}\}\)\+\\eta\\delta\_\{V\}\.

LetQInπQ^\{\\pi\}\_\{\\mathrm\{In\}\}denote the fixed point of the in\-sample Bellman operator𝒯In\\mathcal\{T\}\_\{\\mathrm\{In\}\}, by constructionQDOSERπ​\(𝒔,𝒂\)=QInπ​\(𝒔,𝒂\)Q^\{\\pi\}\_\{\\mathrm\{DOSER\}\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)=Q^\{\\pi\}\_\{\\mathrm\{In\}\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)for every ID state\-action pair\(𝒔,𝒂\)\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)\. This upper bound guarantees that DOSER incurs only limited overestimation\.

- •*In\-distribution actions\.*Ifℰa​\(𝒔,𝒂\)≤τa\\mathcal\{E\}\_\{a\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)\\leq\\tau\_\{a\}, then QDOSERπ​\(𝒔,𝒂\)=QInπ​\(𝒔,𝒂\)≤QInπ​\(𝒔,𝒂id∗\)≤QInπ​\(𝒔,𝒂id∗\)\+η​δVQ^\{\\pi\}\_\{\\mathrm\{DOSER\}\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)=Q^\{\\pi\}\_\{\\mathrm\{In\}\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)\\leq Q^\{\\pi\}\_\{\\mathrm\{In\}\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}^\{\*\}\_\{\\mathrm\{id\}\}\)\\leq Q^\{\\pi\}\_\{\\mathrm\{In\}\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}^\{\*\}\_\{\\mathrm\{id\}\}\)\+\\eta\\delta\_\{V\}\(33\)sinceη​δV≥0\\eta\\delta\_\{V\}\\geq 0\.
- •*Detrimental OOD actions\.*For𝒂∈𝒜OOD−\{\\bm\{a\}\}\\in\\mathcal\{A\}^\{\-\}\_\{\\mathrm\{OOD\}\}: QDOSERπ​\(𝒔,𝒂\)=Qmin≤QInπ​\(𝒔,𝒂id∗\)\+η​δVQ^\{\\pi\}\_\{\\mathrm\{DOSER\}\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)=Q\_\{\\min\}\\leq Q^\{\\pi\}\_\{\\mathrm\{In\}\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}^\{\*\}\_\{\\mathrm\{id\}\}\)\+\\eta\\delta\_\{V\}\(34\)
- •*Beneficial OOD actions\.*For𝒂∈𝒜OOD\+\{\\bm\{a\}\}\\in\\mathcal\{A\}^\{\+\}\_\{\\mathrm\{OOD\}\}, QDOSERπ​\(𝒔,𝒂\)=η​\(QDOSERπ​\(𝒔,𝒂id∗\)\+δV\)Q^\{\\pi\}\_\{\\mathrm\{DOSER\}\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)=\\eta\\big\(Q^\{\\pi\}\_\{\\mathrm\{DOSER\}\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}^\{\*\}\_\{\\mathrm\{id\}\}\)\+\\delta\_\{V\}\\big\)\(35\)Note that𝒂id∗\{\\bm\{a\}\}^\{\*\}\_\{\\mathrm\{id\}\}is an in\-distribution action, hence QDOSERπ​\(𝒔,𝒂id∗\)=QInπ​\(𝒔,𝒂id∗\)\.Q^\{\\pi\}\_\{\\mathrm\{DOSER\}\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}^\{\*\}\_\{\\mathrm\{id\}\}\)=Q^\{\\pi\}\_\{\\mathrm\{In\}\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}^\{\*\}\_\{\\mathrm\{id\}\}\)\.\(36\)Substituting yields QDOSERπ​\(𝒔,𝒂\)=η​\(QInπ​\(𝒔,𝒂id∗\)\+δV\)≤QInπ​\(𝒔,𝒂id∗\)\+η​δVQ^\{\\pi\}\_\{\\mathrm\{DOSER\}\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)=\\eta\\big\(Q^\{\\pi\}\_\{\\mathrm\{In\}\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}^\{\*\}\_\{\\mathrm\{id\}\}\)\+\\delta\_\{V\}\\big\)\\leq Q^\{\\pi\}\_\{\\mathrm\{In\}\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}^\{\*\}\_\{\\mathrm\{id\}\}\)\+\\eta\\delta\_\{V\}\(37\)

Putting together the three cases yields the desired upper boundQDOSERπ​\(𝒔,𝒂\)≤QInπ​\(𝒔,𝒂id∗\)\+η​δVQ^\{\\pi\}\_\{\\mathrm\{DOSER\}\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)\\leq Q^\{\\pi\}\_\{\\mathrm\{In\}\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}^\{\*\}\_\{\\mathrm\{id\}\}\)\+\\eta\\delta\_\{V\}for all\(𝒔,𝒂\)\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)\.

Combining the lower and upper bounds above, we obtain for any state\-action pair\(𝒔,𝒂\)\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)

Qmin≤QDOSERπ​\(𝒔,𝒂\)≤QInπ​\(𝒔,𝒂id∗\)\+η​δVQ\_\{\\min\}\\leq Q^\{\\pi\}\_\{\\mathrm\{DOSER\}\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)\\leq Q^\{\\pi\}\_\{\\mathrm\{In\}\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}^\{\*\}\_\{\\mathrm\{id\}\}\)\+\\eta\\delta\_\{V\}\(38\)which shows the fixed\-point values are uniformly bounded and that DOSER prevents uncontrolled value overestimation while permitting strategic exploration of beneficial out\-of\-distribution regions\. ∎

#### A\.3\.3Proof of Theorem[3](https://arxiv.org/html/2605.08202#Thmtheorem3)

We begin by introducing three key assumptions and an auxiliary lemma that will be used in the proof\.

###### Assumption 1\(Dynamics model error bound\)\.

There exists a constantεdyn≥0\\varepsilon\_\{\\mathrm\{dyn\}\}\\geq 0such that the learned dynamics modelP^\(⋅∣𝐬,𝐚\)\\widehat\{P\}\(\\cdot\\mid\{\\bm\{s\}\},\{\\bm\{a\}\}\)is uniformly close to the true transition kernelP\(⋅∣𝐬,𝐚\)P\(\\cdot\\mid\{\\bm\{s\}\},\{\\bm\{a\}\}\)in theℓ1\\ell\_\{1\}\-norm, satisfying for all\(𝐬,𝐚\)∈𝒮×𝒜\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)\\in\\mathcal\{S\}\\times\\mathcal\{A\}:

∥P^\(⋅∣𝒔,𝒂\)−P\(⋅∣𝒔,𝒂\)∥1≤εdyn\.\\\|\\widehat\{P\}\(\\cdot\\mid\{\\bm\{s\}\},\{\\bm\{a\}\}\)\-P\(\\cdot\\mid\{\\bm\{s\}\},\{\\bm\{a\}\}\)\\\|\_\{1\}\\leq\\varepsilon\_\{\\mathrm\{dyn\}\}\.\(39\)

###### Assumption 2\(OOD detector error bound\)\.

There exists a constantεdet≥0\\varepsilon\_\{\\mathrm\{det\}\}\\geq 0such that the misclassification probability of the Out\-of\-Distribution detector is uniformly bounded:

Pr​\[detector misclassifies​\(𝒔,𝒂\)\]≤εdetfor all​\(𝒔,𝒂\)∈𝒮×𝒜\.\\mathrm\{Pr\}\[\\text\{detector misclassifies \}\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)\]\\leq\\varepsilon\_\{\\mathrm\{det\}\}\\quad\\text\{for all \}\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)\\in\\mathcal\{S\}\\times\\mathcal\{A\}\.\(40\)

###### Assumption 3\(Policy deviation bound\)\.

There exist constantsC1,C2\>0C\_\{1\},C\_\{2\}\>0, characterizing the sensitivity of the policy optimization to dynamics model and OOD detection errors respectively, such that for all states𝐬∈𝒮\{\\bm\{s\}\}\\in\\mathcal\{S\}:

∥π^\(⋅∣𝒔\)−πref\(⋅∣𝒔\)∥TV≤C1εdyn\+C2εdet\.\\\|\\widehat\{\\pi\}\(\\cdot\\mid\{\\bm\{s\}\}\)\-\\pi\_\{\\mathrm\{ref\}\}\(\\cdot\\mid\{\\bm\{s\}\}\)\\\|\_\{\\mathrm\{TV\}\}\\leq C\_\{1\}\\varepsilon\_\{\\mathrm\{dyn\}\}\+C\_\{2\}\\varepsilon\_\{\\mathrm\{det\}\}\.\(41\)whereεdyn\\varepsilon\_\{\\mathrm\{dyn\}\}andεdet\\varepsilon\_\{\\mathrm\{det\}\}are defined in Assumptions[1](https://arxiv.org/html/2605.08202#Thmassumption1)and[2](https://arxiv.org/html/2605.08202#Thmassumption2)\.

###### Lemma 1\.

Letμ\\muandν\\nube two probability distributions over a finite set𝒳\\mathcal\{X\}, and letf:𝒳→ℝf:\\mathcal\{X\}\\to\\mathbb\{R\}be a bounded function with‖f‖∞≤M\\\|f\\\|\_\{\\infty\}\\leq M\. Then,

\|𝔼x∼μ​\[f​\(x\)\]−𝔼x∼ν​\[f​\(x\)\]\|≤2​M⋅‖μ−ν‖TV,\\left\|\\mathbb\{E\}\_\{x\\sim\\mu\}\[f\(x\)\]\-\\mathbb\{E\}\_\{x\\sim\\nu\}\[f\(x\)\]\\right\|\\leq 2M\\cdot\\\|\\mu\-\\nu\\\|\_\{\\mathrm\{TV\}\},\(42\)where‖μ−ν‖TV=supA⊆𝒳\|μ​\(A\)−ν​\(A\)\|\\\|\\mu\-\\nu\\\|\_\{\\mathrm\{TV\}\}=\\sup\_\{A\\subseteq\\mathcal\{X\}\}\|\\mu\(A\)\-\\nu\(A\)\|is the total variation distance\. For a finite set𝒳\\mathcal\{X\}, this is equivalent to‖μ−ν‖TV=12​∑x∈𝒳\|μ​\(x\)−ν​\(x\)\|\\\|\\mu\-\\nu\\\|\_\{\\mathrm\{TV\}\}=\\frac\{1\}\{2\}\\sum\_\{x\\in\\mathcal\{X\}\}\|\\mu\(x\)\-\\nu\(x\)\|\.

###### Proof\.

The expectation difference can be written as:

\|𝔼μ​\[f\]−𝔼ν​\[f\]\|\\displaystyle\\left\|\\mathbb\{E\}\_\{\\mu\}\[f\]\-\\mathbb\{E\}\_\{\\nu\}\[f\]\\right\|=\|∑x∈𝒳f​\(x\)​μ​\(x\)−∑x∈𝒳f​\(x\)​ν​\(x\)\|\\displaystyle=\\left\|\\sum\_\{x\\in\\mathcal\{X\}\}f\(x\)\\mu\(x\)\-\\sum\_\{x\\in\\mathcal\{X\}\}f\(x\)\\nu\(x\)\\right\|\(43\)=\|∑x∈𝒳f​\(x\)​\(μ​\(x\)−ν​\(x\)\)\|\\displaystyle=\\left\|\\sum\_\{x\\in\\mathcal\{X\}\}f\(x\)\(\\mu\(x\)\-\\nu\(x\)\)\\right\|≤∑x∈𝒳\|f​\(x\)\|⋅\|μ​\(x\)−ν​\(x\)\|\\displaystyle\\leq\\sum\_\{x\\in\\mathcal\{X\}\}\|f\(x\)\|\\cdot\|\\mu\(x\)\-\\nu\(x\)\|≤M​∑x∈𝒳\|μ​\(x\)−ν​\(x\)\|\\displaystyle\\leq M\\sum\_\{x\\in\\mathcal\{X\}\}\|\\mu\(x\)\-\\nu\(x\)\|=2​M⋅‖μ−ν‖TV\.\\displaystyle=2M\\cdot\\\|\\mu\-\\nu\\\|\_\{\\mathrm\{TV\}\}\.The last equality follows from the definition of total variation distance\. ∎

Now we start the proof of Theorem[3](https://arxiv.org/html/2605.08202#Thmtheorem3)\.

###### Proof\.

We begin by defining the Bellman operator associated with the reference policyπref\\pi\_\{\\mathrm\{ref\}\}:

𝒯ref​Q​\(𝒔,𝒂\):=r​\(𝒔,𝒂\)\+γ​𝔼𝒔′∼P\(⋅∣𝒔,𝒂\),𝒂′∼πref\(⋅∣𝒔′\)​\[Q​\(𝒔′,𝒂′\)\]\\mathcal\{T\}\_\{\\mathrm\{ref\}\}Q\(\{\\bm\{s\}\},\{\\bm\{a\}\}\):=r\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)\+\\gamma\\,\\mathbb\{E\}\_\{\{\\bm\{s\}\}^\{\\prime\}\\sim P\(\\cdot\\mid\{\\bm\{s\}\},\{\\bm\{a\}\}\),\{\\bm\{a\}\}^\{\\prime\}\\sim\\pi\_\{\\mathrm\{ref\}\}\(\\cdot\\mid\{\\bm\{s\}\}^\{\\prime\}\)\}\[Q\(\{\\bm\{s\}\}^\{\\prime\},\{\\bm\{a\}\}^\{\\prime\}\)\]\(44\)Denote the fixed point of the reference operator asQπrefQ^\{\\pi\_\{\\mathrm\{ref\}\}\}, so that

Qπref=𝒯ref​QπrefQ^\{\\pi\_\{\\mathrm\{ref\}\}\}=\\mathcal\{T\}\_\{\\mathrm\{ref\}\}Q^\{\\pi\_\{\\mathrm\{ref\}\}\}\(45\)DOSER critic constructs a modified Bellman target due to three factors: \(i\) dynamics model error, \(ii\) detector misclassification, and \(iii\) value adjustment\. Accordingly, the DOSER Bellman operator can be expressed as:

𝒯DOSER​Q​\(𝒔,𝒂\):=r​\(𝒔,𝒂\)\+γ​𝔼𝒔′∼P^\(⋅∣𝒔,𝒂\),𝒂′∼π^\(⋅∣𝒔′\)​\[\(Q​\(𝒔′,𝒂′\)\+b​\(𝒔′,𝒂′\)\)\]\\mathcal\{T\}\_\{\\mathrm\{DOSER\}\}Q\(\{\\bm\{s\}\},\{\\bm\{a\}\}\):=r\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)\+\\gamma\\,\\mathbb\{E\}\_\{\{\\bm\{s\}\}^\{\\prime\}\\sim\\widehat\{P\}\(\\cdot\\mid\{\\bm\{s\}\},\{\\bm\{a\}\}\),\{\\bm\{a\}\}^\{\\prime\}\\sim\\widehat\{\\pi\}\(\\cdot\\mid\{\\bm\{s\}\}^\{\\prime\}\)\}\[\(Q\(\{\\bm\{s\}\}^\{\\prime\},\{\\bm\{a\}\}^\{\\prime\}\)\+b\(\{\\bm\{s\}\}^\{\\prime\},\{\\bm\{a\}\}^\{\\prime\}\)\)\]\(46\)whereπ^\\widehat\{\\pi\}differs fromπref\\pi\_\{\\mathrm\{ref\}\}due to dynamics model error and OOD detector error,bbrepresents the value adjustment applied to the target Q\.

Now we compare the difference between the two operators when applied toQπrefQ^\{\\pi\_\{\\mathrm\{ref\}\}\}\. Define

Δ​\(𝒔,𝒂\):=\|𝒯DOSER​Qπref​\(𝒔,𝒂\)−𝒯ref​Qπref​\(𝒔,𝒂\)\|\\Delta\(\{\\bm\{s\}\},\{\\bm\{a\}\}\):=\\big\|\\mathcal\{T\}\_\{\\mathrm\{DOSER\}\}Q^\{\\pi\_\{\\mathrm\{ref\}\}\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)\-\\mathcal\{T\}\_\{\\mathrm\{ref\}\}Q^\{\\pi\_\{\\mathrm\{ref\}\}\}\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)\\big\|\(47\)Substituting the operator definitions yields:

Δ​\(𝒔,𝒂\)=γ​\|𝔼P^,π^​\[Qπref\+b\]−𝔼P,πref​\[Qπref\]\|≤γ​\(\(I\)\+\(I​I\)\)\\Delta\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)=\\gamma\\Big\|\\mathbb\{E\}\_\{\\widehat\{P\},\\widehat\{\\pi\}\}\[Q^\{\\pi\_\{\\mathrm\{ref\}\}\}\+b\]\-\\mathbb\{E\}\_\{P,\\pi\_\{\\mathrm\{ref\}\}\}\[Q^\{\\pi\_\{\\mathrm\{ref\}\}\}\]\\Big\|\\leq\\gamma\\big\(\(I\)\+\(II\)\\big\)\(48\)where the components correspond to

\(I\):=\|𝔼P^,π^​\[Qπref\]−𝔼P,πref​\[Qπref\]\|,\(I​I\):=\|𝔼P^,π^​\[b\]\|\(I\):=\\big\|\\mathbb\{E\}\_\{\\widehat\{P\},\\widehat\{\\pi\}\}\[Q^\{\\pi\_\{\\mathrm\{ref\}\}\}\]\-\\mathbb\{E\}\_\{P,\\pi\_\{\\mathrm\{ref\}\}\}\[Q^\{\\pi\_\{\\mathrm\{ref\}\}\}\]\\big\|,\\quad\(II\):=\\big\|\\mathbb\{E\}\_\{\\widehat\{P\},\\widehat\{\\pi\}\}\[b\]\\big\|\(49\)
Bound on \(I\):We decompose \(I\) into the dynamics model approximation error and policy distribution bias:

\(I\)\\displaystyle\(I\)=\|𝔼P^,π^​\[Qπref\]−𝔼P,π^​\[Qπref\]\+𝔼P,π^​\[Qπref\]−𝔼P,πref​\[Qπref\]\|\\displaystyle=\\big\|\\mathbb\{E\}\_\{\\widehat\{P\},\\widehat\{\\pi\}\}\[Q^\{\\pi\_\{\\mathrm\{ref\}\}\}\]\-\\mathbb\{E\}\_\{P,\\widehat\{\\pi\}\}\[Q^\{\\pi\_\{\\mathrm\{ref\}\}\}\]\+\\mathbb\{E\}\_\{P,\\widehat\{\\pi\}\}\[Q^\{\\pi\_\{\\mathrm\{ref\}\}\}\]\-\\mathbb\{E\}\_\{P,\\pi\_\{\\mathrm\{ref\}\}\}\[Q^\{\\pi\_\{\\mathrm\{ref\}\}\}\]\\big\|\(50\)≤\|𝔼P^,π^​\[Qπref\]−𝔼P,π^​\[Qπref\]\|\+\|𝔼P,π^​\[Qπref\]−𝔼P,πref​\[Qπref\]\|\\displaystyle\\leq\\big\|\\mathbb\{E\}\_\{\\widehat\{P\},\\widehat\{\\pi\}\}\[Q^\{\\pi\_\{\\mathrm\{ref\}\}\}\]\-\\mathbb\{E\}\_\{P,\\widehat\{\\pi\}\}\[Q^\{\\pi\_\{\\mathrm\{ref\}\}\}\]\\big\|\+\\big\|\\mathbb\{E\}\_\{P,\\widehat\{\\pi\}\}\[Q^\{\\pi\_\{\\mathrm\{ref\}\}\}\]\-\\mathbb\{E\}\_\{P,\\pi\_\{\\mathrm\{ref\}\}\}\[Q^\{\\pi\_\{\\mathrm\{ref\}\}\}\]\\big\|
For the dynamics model error, consider the functionf​\(𝒔′\)=𝔼𝒂′∼π^\(⋅∣𝒔′\)​\[Qπref​\(𝒔′,𝒂′\)\]f\(\{\\bm\{s\}\}^\{\\prime\}\)=\\mathbb\{E\}\_\{\{\\bm\{a\}\}^\{\\prime\}\\sim\\widehat\{\\pi\}\(\\cdot\\mid\{\\bm\{s\}\}^\{\\prime\}\)\}\[Q^\{\\pi\_\{\\mathrm\{ref\}\}\}\(\{\\bm\{s\}\}^\{\\prime\},\{\\bm\{a\}\}^\{\\prime\}\)\]\. Since\|Qπref\|≤Qmax\|Q^\{\\pi\_\{\\mathrm\{ref\}\}\}\|\\leq Q\_\{\\max\}, the function is bounded by‖f‖∞≤Qmax\\\|f\\\|\_\{\\infty\}\\leq Q\_\{\\max\}\. Applying Lemma[1](https://arxiv.org/html/2605.08202#Thmlemma1)with distributionsμ=P^\(⋅∣𝒔,𝒂\)\\mu=\\widehat\{P\}\(\\cdot\\mid\{\\bm\{s\}\},\{\\bm\{a\}\}\)andν=P\(⋅∣𝒔,𝒂\)\\nu=P\(\\cdot\\mid\{\\bm\{s\}\},\{\\bm\{a\}\}\)gives:

\|𝔼P^,π^​\[Qπref\]−𝔼P,π^​\[Qπref\]\|\\displaystyle\\big\|\\mathbb\{E\}\_\{\\widehat\{P\},\\widehat\{\\pi\}\}\[Q^\{\\pi\_\{\\mathrm\{ref\}\}\}\]\-\\mathbb\{E\}\_\{P,\\widehat\{\\pi\}\}\[Q^\{\\pi\_\{\\mathrm\{ref\}\}\}\]\\big\|=\|𝔼𝒔′∼P^​\[f​\(𝒔′\)\]−𝔼𝒔′∼P​\[f​\(𝒔′\)\]\|\\displaystyle=\\big\|\\mathbb\{E\}\_\{\{\\bm\{s\}\}^\{\\prime\}\\sim\\widehat\{P\}\}\[f\(\{\\bm\{s\}\}^\{\\prime\}\)\]\-\\mathbb\{E\}\_\{\{\\bm\{s\}\}^\{\\prime\}\\sim P\}\[f\(\{\\bm\{s\}\}^\{\\prime\}\)\]\\big\|\(51\)≤2Qmax⋅∥P^\(⋅∣𝒔,𝒂\)−P\(⋅∣𝒔,𝒂\)∥TV\\displaystyle\\leq 2Q\_\{\\max\}\\cdot\\\|\\widehat\{P\}\(\\cdot\\mid\{\\bm\{s\}\},\{\\bm\{a\}\}\)\-P\(\\cdot\\mid\{\\bm\{s\}\},\{\\bm\{a\}\}\)\\\|\_\{\\mathrm\{TV\}\}Using the definition of TV distance‖μ−ν‖TV=12​‖μ−ν‖1≤εdyn2\\\|\\mu\-\\nu\\\|\_\{\\mathrm\{TV\}\}=\\frac\{1\}\{2\}\\\|\\mu\-\\nu\\\|\_\{1\}\\leq\\frac\{\\varepsilon\_\{\\mathrm\{dyn\}\}\}\{2\}and Assumption[1](https://arxiv.org/html/2605.08202#Thmassumption1), we have:

\|𝔼P^,π^​\[Qπref\]−𝔼P,π^​\[Qπref\]\|≤2​Qmax⋅εdyn2=Qmax​εdyn\.\\big\|\\mathbb\{E\}\_\{\\widehat\{P\},\\widehat\{\\pi\}\}\[Q^\{\\pi\_\{\\mathrm\{ref\}\}\}\]\-\\mathbb\{E\}\_\{P,\\widehat\{\\pi\}\}\[Q^\{\\pi\_\{\\mathrm\{ref\}\}\}\]\\big\|\\leq 2Q\_\{\\max\}\\cdot\\frac\{\\varepsilon\_\{\\mathrm\{dyn\}\}\}\{2\}=Q\_\{\\max\}\\varepsilon\_\{\\mathrm\{dyn\}\}\.\(52\)
For the policy distribution bias, we apply a similar argument in the action space𝒜\\mathcal\{A\}:

\|𝔼P,π^​\[Qπref\]−𝔼P,πref​\[Qπref\]\|\\displaystyle\\big\|\\mathbb\{E\}\_\{P,\\widehat\{\\pi\}\}\[Q^\{\\pi\_\{\\mathrm\{ref\}\}\}\]\-\\mathbb\{E\}\_\{P,\\pi\_\{\\mathrm\{ref\}\}\}\[Q^\{\\pi\_\{\\mathrm\{ref\}\}\}\]\\big\|≤2Qmax∥π^\(⋅∣𝒔′\)−πref\(⋅∣𝒔′\)∥TV\\displaystyle\\leq 2Q\_\{\\max\}\\\|\\widehat\{\\pi\}\(\\cdot\\mid\{\\bm\{s\}\}^\{\\prime\}\)\-\\pi\_\{\\mathrm\{ref\}\}\(\\cdot\\mid\{\\bm\{s\}\}^\{\\prime\}\)\\\|\_\{\\mathrm\{TV\}\}\(53\)=2​Qmax​\(C1​εdyn\+C2​εdet\)\\displaystyle=2Q\_\{\\max\}\(C\_\{1\}\\varepsilon\_\{\\mathrm\{dyn\}\}\+C\_\{2\}\\varepsilon\_\{\\mathrm\{det\}\}\)Therefore, the combined bound for \(I\) is:

\(I\)≤Qmax​\(\(1\+2​C1\)​εdyn\+2​C2​εdet\)\(I\)\\leq Q\_\{\\max\}\(\(1\+2C\_\{1\}\)\\varepsilon\_\{\\mathrm\{dyn\}\}\+2C\_\{2\}\\varepsilon\_\{\\mathrm\{det\}\}\)\(54\)
Bound on \(II\):Given\|b\|≤η​δV\|b\|\\leq\\eta\\delta\_\{V\}, it follows directly that:

\(I​I\)=\|𝔼P^,π^​\[b\]\|≤𝔼P^,π^​\|b\|≤η​δV\(II\)=\\big\|\\mathbb\{E\}\_\{\\widehat\{P\},\\widehat\{\\pi\}\}\[b\]\\big\|\\leq\\mathbb\{E\}\_\{\\widehat\{P\},\\widehat\{\\pi\}\}\\big\|b\\big\|\\leq\\eta\\delta\_\{V\}\(55\)
Thus, for all\(𝒔,𝒂\)\(\{\\bm\{s\}\},\{\\bm\{a\}\}\),

Δ​\(𝒔,𝒂\)≤γ​\(Qmax​\(\(1\+2​C1\)​εdyn\+2​C2​εdet\)\+η​δV\)\\Delta\(\{\\bm\{s\}\},\{\\bm\{a\}\}\)\\leq\\gamma\\big\(Q\_\{\\max\}\(\(1\+2C\_\{1\}\)\\varepsilon\_\{\\mathrm\{dyn\}\}\+2C\_\{2\}\\varepsilon\_\{\\mathrm\{det\}\}\)\+\\eta\\delta\_\{V\}\\big\)\(56\)Consequently, the operator difference is bounded in the supremum norm by:

‖\(𝒯DOSER−𝒯ref\)​Qπref‖∞≤γ​\(Qmax​\(\(1\+2​C1\)​εdyn\+2​C2​εdet\)\+η​δV\)\\\|\(\\mathcal\{T\}\_\{\\mathrm\{DOSER\}\}\-\\mathcal\{T\}\_\{\\mathrm\{ref\}\}\)Q^\{\\pi\_\{\\mathrm\{ref\}\}\}\\\|\_\{\\infty\}\\leq\\gamma\\big\(Q\_\{\\max\}\(\(1\+2C\_\{1\}\)\\varepsilon\_\{\\mathrm\{dyn\}\}\+2C\_\{2\}\\varepsilon\_\{\\mathrm\{det\}\}\)\+\\eta\\delta\_\{V\}\\big\)\(57\)
By Theorem[1](https://arxiv.org/html/2605.08202#Thmtheorem1)in the main paper, the DOSER critic converges to the fixed point of𝒯DOSER\\mathcal\{T\}\_\{\\mathrm\{DOSER\}\}\. Thus:

Q^=𝒯DOSER​Q^\.\\widehat\{Q\}=\\mathcal\{T\}\_\{\\mathrm\{DOSER\}\}\\widehat\{Q\}\.\(58\)
We now bound the final approximation error:

‖Q^−Qπref‖∞\\displaystyle\\\|\\widehat\{Q\}\-Q^\{\\pi\_\{\\mathrm\{ref\}\}\}\\\|\_\{\\infty\}=‖𝒯DOSER​Q^−𝒯ref​Qπref‖∞\\displaystyle=\\\|\\mathcal\{T\}\_\{\\mathrm\{DOSER\}\}\\widehat\{Q\}\-\\mathcal\{T\}\_\{\\mathrm\{ref\}\}Q^\{\\pi\_\{\\mathrm\{ref\}\}\}\\\|\_\{\\infty\}\(59\)≤‖𝒯DOSER​Q^−𝒯DOSER​Qπref‖∞\+‖\(𝒯DOSER−𝒯ref\)​Qπref‖∞\\displaystyle\\leq\\\|\\mathcal\{T\}\_\{\\mathrm\{DOSER\}\}\\widehat\{Q\}\-\\mathcal\{T\}\_\{\\mathrm\{DOSER\}\}Q^\{\\pi\_\{\\mathrm\{ref\}\}\}\\\|\_\{\\infty\}\+\\\|\(\\mathcal\{T\}\_\{\\mathrm\{DOSER\}\}\-\\mathcal\{T\}\_\{\\mathrm\{ref\}\}\)Q^\{\\pi\_\{\\mathrm\{ref\}\}\}\\\|\_\{\\infty\}≤γ​‖Q^−Qπref‖∞\+γ​\(Qmax​\(\(1\+2​C1\)​εdyn\+2​C2​εdet\)\+η​δV\)\\displaystyle\\leq\\gamma\\\|\\widehat\{Q\}\-Q^\{\\pi\_\{\\mathrm\{ref\}\}\}\\\|\_\{\\infty\}\+\\gamma\\big\(Q\_\{\\max\}\(\(1\+2C\_\{1\}\)\\varepsilon\_\{\\mathrm\{dyn\}\}\+2C\_\{2\}\\varepsilon\_\{\\mathrm\{det\}\}\)\+\\eta\\delta\_\{V\}\\big\)
Rearranging terms:

\(1−γ\)​‖Q^−Qπref‖∞≤γ​\(Qmax​\(\(1\+2​C1\)​εdyn\+2​C2​εdet\)\+η​δV\)\(1\-\\gamma\)\\\|\\widehat\{Q\}\-Q^\{\\pi\_\{\\mathrm\{ref\}\}\}\\\|\_\{\\infty\}\\leq\\gamma\\big\(Q\_\{\\max\}\(\(1\+2C\_\{1\}\)\\varepsilon\_\{\\mathrm\{dyn\}\}\+2C\_\{2\}\\varepsilon\_\{\\mathrm\{det\}\}\)\+\\eta\\delta\_\{V\}\\big\)\(60\)
Absorbing the constants intoC1C\_\{1\}andC2C\_\{2\}yields the final result:

‖Q^−Qπref‖∞≤γ1−γ​\(Qmax​\(C1​εdyn\+C2​εdet\)\+η​δV\)\\\|\\widehat\{Q\}\-Q^\{\\pi\_\{\\mathrm\{ref\}\}\}\\\|\_\{\\infty\}\\leq\\frac\{\\gamma\}\{1\-\\gamma\}\\Big\(Q\_\{\\max\}\(C\_\{1\}\\varepsilon\_\{\\mathrm\{dyn\}\}\+C\_\{2\}\\varepsilon\_\{\\mathrm\{det\}\}\)\+\\eta\\delta\_\{V\}\\Big\)\(61\)This completes the proof\. ∎

#### A\.3\.4Proof of Theorem[4](https://arxiv.org/html/2605.08202#Thmtheorem4)

We first make several common continuity assumptions about the learnedQQfunction and the transition dynamicsPP, which is frequently employed in the theoretical analysis of RL\(Gouket al\.,[2021](https://arxiv.org/html/2605.08202#bib.bib53); Dufour and Prieto\-Rumeau,[2013](https://arxiv.org/html/2605.08202#bib.bib54)\)\.

###### Assumption 4\(Lipschitz Q\)\.

For all𝐬∈𝒮\{\\bm\{s\}\}\\in\\mathcal\{S\}and𝐚1,𝐚2∈𝒜\{\\bm\{a\}\}\_\{1\},\{\\bm\{a\}\}\_\{2\}\\in\\mathcal\{A\}, the learned value function isLQL\_\{Q\}\-Lipschitz, then

‖Q​\(𝒔,𝒂1\)−Q​\(𝒔,𝒂2\)‖≤LQ​‖𝒂1−𝒂2‖\.\\\|Q\(\{\\bm\{s\}\},\{\\bm\{a\}\}\_\{1\}\)\-Q\(\{\\bm\{s\}\},\{\\bm\{a\}\}\_\{2\}\)\\\|\\leq L\_\{Q\}\\\|\{\\bm\{a\}\}\_\{1\}\-\{\\bm\{a\}\}\_\{2\}\\\|\.\(62\)

###### Assumption 5\(Lipschitz P\)\.

For all𝐬∈𝒮\{\\bm\{s\}\}\\in\\mathcal\{S\}and𝐚1,𝐚2∈𝒜\{\\bm\{a\}\}\_\{1\},\{\\bm\{a\}\}\_\{2\}\\in\\mathcal\{A\}, the transition dynamics isLPL\_\{P\}\-Lipschitz, then

∥P\(⋅∣𝒔,𝒂1\)−P\(⋅∣𝒔,𝒂2\)∥≤LP∥𝒂1−𝒂2∥\.\\\|P\(\\cdot\\mid\{\\bm\{s\}\},\{\\bm\{a\}\}\_\{1\}\)\-P\(\\cdot\\mid\{\\bm\{s\}\},\{\\bm\{a\}\}\_\{2\}\)\\\|\\leq L\_\{P\}\\\|\{\\bm\{a\}\}\_\{1\}\-\{\\bm\{a\}\}\_\{2\}\\\|\.\(63\)

###### Lemma 2\.

Under Assumptions[5](https://arxiv.org/html/2605.08202#Thmassumption5), the following inequality holds:

TV​\(dπ1∥dπ2\)≤C​LP​max𝒔⁡‖π1​\(𝒔\)−π2​\(𝒔\)‖,\\mathrm\{TV\}\(d^\{\\pi\_\{1\}\}\\\|d^\{\\pi\_\{2\}\}\)\\leq CL\_\{P\}\\max\_\{\{\\bm\{s\}\}\}\\\|\\pi\_\{1\}\(\{\\bm\{s\}\}\)\-\\pi\_\{2\}\(\{\\bm\{s\}\}\)\\\|,\(64\)whereCCis a positive constant anddπd^\{\\pi\}is the state occupancy under policyπ\\pi\.

dπ​\(𝒔\)=\(1−γ\)​∑t=0∞γt​𝔼π​\[𝕀​\[𝒔t=𝒔\]\]\.d^\{\\pi\}\(\{\\bm\{s\}\}\)=\(1\-\\gamma\)\\sum\_\{t=0\}^\{\\infty\}\\gamma^\{t\}\\mathbb\{E\}\_\{\\pi\}\\left\[\\mathbb\{I\}\[\{\\bm\{s\}\}\_\{t\}=\{\\bm\{s\}\}\]\\right\]\.\(65\)

###### Proof\.

Please refer to Lemma 1 in\(Xionget al\.,[2022](https://arxiv.org/html/2605.08202#bib.bib55)\)Lemma A\.5 in\(Ranet al\.,[2023](https://arxiv.org/html/2605.08202#bib.bib56)\)\. ∎

Now we start the proof of Theorem[4](https://arxiv.org/html/2605.08202#Thmtheorem4)\.

###### Proof\.

The proof proceeds by decomposing the overall performance gap between the optimal policyπ∗\\pi^\{\*\}and the learned policyπ^\\widehat\{\\pi\}into manageable components, then bounding each term individually\. Similar to Theorem[3](https://arxiv.org/html/2605.08202#Thmtheorem3), letπref\\pi\_\{\\mathrm\{ref\}\}denote the ideal reference policy, then

\|J​\(π∗\)−J​\(π^\)\|\\displaystyle\|J\(\\pi^\{\*\}\)\-J\(\\widehat\{\\pi\}\)\|=\|J​\(π∗\)−J​\(πref\)\+J​\(πref\)−J​\(π^\)\|\\displaystyle=\|J\(\\pi^\{\*\}\)\-J\(\\pi\_\{\\mathrm\{ref\}\}\)\+J\(\\pi\_\{\\mathrm\{ref\}\}\)\-J\(\\widehat\{\\pi\}\)\|\(66\)≤\|J​\(π∗\)−J​\(πref\)\|\+\|J​\(πref\)−J​\(π^\)\|\\displaystyle\\leq\|J\(\\pi^\{\*\}\)\-J\(\\pi\_\{\\mathrm\{ref\}\}\)\|\+\|J\(\\pi\_\{\\mathrm\{ref\}\}\)\-J\(\\widehat\{\\pi\}\)\|The first term captures approximation error due to function approximation, we denote is asδf\\delta\_\{f\}\. Under the asymptotic regime where the empirical fitting errors vanish, this term can be arbitrarily small\. Hence we focus on the second term, which quantifies the performance gap between the learned policy and the reference policy\.

\|J​\(πref\)−J​\(π^\)\|\\displaystyle\|J\(\\pi\_\{\\mathrm\{ref\}\}\)\-J\(\\widehat\{\\pi\}\)\|\(67\)=\|11−γ​𝔼𝒔∼dπref​\[r​\(𝒔\)\]−11−γ​𝔼𝒔∼dπ^​\[r​\(𝒔\)\]\|\\displaystyle=\\left\|\\frac\{1\}\{1\-\\gamma\}\\mathbb\{E\}\_\{\{\\bm\{s\}\}\\sim d^\{\\pi\_\{\\mathrm\{ref\}\}\}\}\[r\(\{\\bm\{s\}\}\)\]\-\\frac\{1\}\{1\-\\gamma\}\\mathbb\{E\}\_\{\{\\bm\{s\}\}\\sim d^\{\\widehat\{\\pi\}\}\}\[r\(\{\\bm\{s\}\}\)\]\\right\|=11−γ​\|∑𝒔\(dπref​\(𝒔\)−dπ^​\(𝒔\)\)​r​\(𝒔\)\|\\displaystyle=\\frac\{1\}\{1\-\\gamma\}\\left\|\\sum\_\{\\bm\{s\}\}\(d^\{\\pi\_\{\\mathrm\{ref\}\}\}\(\{\\bm\{s\}\}\)\-d^\{\\widehat\{\\pi\}\}\(\{\\bm\{s\}\}\)\)r\(\{\\bm\{s\}\}\)\\right\|≤11−γ​∑𝒔\|dπref​\(𝒔\)−dπ^​\(𝒔\)\|​\|r​\(𝒔\)\|\\displaystyle\\leq\\frac\{1\}\{1\-\\gamma\}\\sum\_\{\\bm\{s\}\}\|d^\{\\pi\_\{\\mathrm\{ref\}\}\}\(\{\\bm\{s\}\}\)\-d^\{\\widehat\{\\pi\}\}\(\{\\bm\{s\}\}\)\|\|r\(\{\\bm\{s\}\}\)\|≤Rmax1−γTV\(dπref\(𝒔\)\|\|dπ^\(𝒔\)\)\\displaystyle\\leq\\frac\{R\_\{\\max\}\}\{1\-\\gamma\}\\mathrm\{TV\}\(d^\{\\pi\_\{\\mathrm\{ref\}\}\}\(\{\\bm\{s\}\}\)\|\|d^\{\\widehat\{\\pi\}\}\(\{\\bm\{s\}\}\)\)≤C​LP​Rmax1−γ​max𝒔⁡‖πref​\(𝒔\)−π^​\(𝒔\)‖\\displaystyle\\leq\\frac\{CL\_\{P\}R\_\{\\max\}\}\{1\-\\gamma\}\\max\_\{\{\\bm\{s\}\}\}\\\|\\pi\_\{\\mathrm\{ref\}\}\(\{\\bm\{s\}\}\)\-\\widehat\{\\pi\}\(\{\\bm\{s\}\}\)\\\|≤C​LP​Rmax1−γ​\(C1​εdyn\+C2​εdet\)\\displaystyle\\leq\\frac\{CL\_\{P\}R\_\{\\max\}\}\{1\-\\gamma\}\(C\_\{1\}\\varepsilon\_\{\\mathrm\{dyn\}\}\+C\_\{2\}\\varepsilon\_\{\\mathrm\{det\}\}\)Combining both error terms yields the overall performance guarantee:

\|J​\(π∗\)−J​\(πref∗\)\|≤δf\+C​LP​Rmax1−γ​\(C1​εdyn\+C2​εdet\)\|J\(\\pi^\{\*\}\)\-J\(\\pi^\{\*\}\_\{\\mathrm\{ref\}\}\)\|\\leq\\delta\_\{f\}\+\\frac\{CL\_\{P\}R\_\{\\max\}\}\{1\-\\gamma\}\(C\_\{1\}\\varepsilon\_\{\\mathrm\{dyn\}\}\+C\_\{2\}\\varepsilon\_\{\\mathrm\{det\}\}\)\(68\)∎

## Appendix BExperimental Details

### B\.1Diffusion Model Framework

We adopt the EDM framework\(Karraset al\.,[2022](https://arxiv.org/html/2605.08202#bib.bib21)\)to leverage the advantages of continuous\-time diffusion models for offline RL\. EDM builds upon the continuous\-time formulation derived from diffusion processes, which allows us to use an optimized ODE solver for sampling\. This solver adaptively determines the steps along the noise level trajectory, significantly reducing the computational load and accelerating generation speed, while maintaining high sample quality compared to sampling with a fixed discrete schedule\.

Noise schedule\. In the DOSER framework, the noise schedule is a crucial component of the diffusion model, defining how the noise levels vary over time\. Following the insights from the EDM paper, the noise scheduleσt\\sigma\_\{t\}is sampled from a log\-logistic distributionσt∼log\-logistic​\(log⁡σdata,s\)\\sigma\_\{t\}\\sim\\text\{log\-logistic\}\(\\log\\sigma\_\{\\text\{data\}\},s\), wherelog⁡σdata\\log\\sigma\_\{\\text\{data\}\}serves as the shape parameter andssas the scale parameter\. Using this schedule, a noisy action𝒂t\{\\bm\{a\}\}\_\{t\}is constructed as𝒂t=𝒂0\+σt​ϵ\{\\bm\{a\}\}\_\{t\}=\{\\bm\{a\}\}\_\{0\}\+\\sigma\_\{t\}\\bm\{\\epsilon\}, withϵ∼𝒩​\(0,𝑰\)\\bm\{\\epsilon\}\\sim\\mathcal\{N\}\(0,\\bm\{I\}\)\. The parameters are configured as follows:σdata=0\.5\\sigma\_\{\\text\{data\}\}=0\.5, and the noise schedule is clamped betweenσmin=0\.02\\sigma\_\{\\min\}=0\.02andσmax=80\\sigma\_\{\\max\}=80\.

Training loss\. The EDM framework precondition the neural network with aσt\\sigma\_\{t\}\-dependent skip connection to improve numerical stability\. Specifically, the denoising network for behavior policy modeling is defined as follows:

ϵθa​\(𝒂t,σt,𝒔\)=cskip​\(σt\)​𝒂t\+cout​\(σt\)​Fθa​\(cin​\(σt\);cnoise​\(σt\)\|𝒔\)\\bm\{\\epsilon\}\_\{\\theta\_\{a\}\}\(\{\\bm\{a\}\}\_\{t\},\\sigma\_\{t\},\{\\bm\{s\}\}\)=c\_\{\\text\{skip\}\}\(\\sigma\_\{t\}\)\{\\bm\{a\}\}\_\{t\}\+c\_\{\\text\{out\}\}\(\\sigma\_\{t\}\)F\_\{\\theta\_\{a\}\}\(c\_\{\\text\{in\}\}\(\\sigma\_\{t\}\);c\_\{\\text\{noise\}\}\(\\sigma\_\{t\}\)\|\{\\bm\{s\}\}\)\(69\)
Similarly, the denoising network for state distribution modeling is defined as:

ϵθs​\(𝒔t,σt\)=cskip​\(σt\)​𝒔t\+cout​\(σt\)​Fθs​\(cin​\(σt\);cnoise​\(σt\)\)\\bm\{\\epsilon\}\_\{\\theta\_\{s\}\}\(\{\\bm\{s\}\}\_\{t\},\\sigma\_\{t\}\)=c\_\{\\text\{skip\}\}\(\\sigma\_\{t\}\)\{\\bm\{s\}\}\_\{t\}\+c\_\{\\text\{out\}\}\(\\sigma\_\{t\}\)F\_\{\\theta\_\{s\}\}\(c\_\{\\text\{in\}\}\(\\sigma\_\{t\}\);c\_\{\\text\{noise\}\}\(\\sigma\_\{t\}\)\)\(70\)
whereFθaF\_\{\\theta\_\{a\}\}andFθsF\_\{\\theta\_\{s\}\}are the neural networks to be actually trained,cskip​\(σt\)c\_\{\\text\{skip\}\}\(\\sigma\_\{t\}\)modulates the skip connection,cin​\(σt\)c\_\{\\text\{in\}\}\(\\sigma\_\{t\}\)andcout​\(σt\)c\_\{\\text\{out\}\}\(\\sigma\_\{t\}\)scale the input and output magnitudes respectively, andcnoise​\(σt\)c\_\{\\text\{noise\}\}\(\\sigma\_\{t\}\)maps noise levelσt\\sigma\_\{t\}into a conditioning input forFθaF\_\{\\theta\_\{a\}\}andFθsF\_\{\\theta\_\{s\}\}\.

We can equivalently express the loss \([4](https://arxiv.org/html/2605.08202#S3.E4)\) with respect to the raw network outputFθaF\_\{\\theta\_\{a\}\}in \([69](https://arxiv.org/html/2605.08202#A2.E69)\):

𝔼σt,𝒔,𝒂,ϵ\[λ\(σt\)cout2\(σt\)\|\|Fθa\(cin\(σt\)⋅\(𝒂\+ϵ\);cnoise\(σt\)\|𝒔\)−1cout​\(σt\)\(𝒂−cskip\(σt\)⋅\(𝒂\+ϵ\)\)\|\|2\]\\mathbb\{E\}\_\{\\sigma\_\{t\},\{\\bm\{s\}\},\{\\bm\{a\}\},\\bm\{\\epsilon\}\}\\left\[\\lambda\(\\sigma\_\{t\}\)c^\{2\}\_\{\\text\{out\}\}\(\\sigma\_\{t\}\)\|\|F\_\{\\theta\_\{a\}\}\(c\_\{\\text\{in\}\}\(\\sigma\_\{t\}\)\\cdot\(\{\\bm\{a\}\}\+\\bm\{\\epsilon\}\);c\_\{\\text\{noise\}\}\(\\sigma\_\{t\}\)\|\{\\bm\{s\}\}\)\-\\frac\{1\}\{c\_\{\\text\{out\}\}\(\\sigma\_\{t\}\)\}\(\{\\bm\{a\}\}\-c\_\{\\text\{skip\}\}\(\\sigma\_\{t\}\)\\cdot\(\{\\bm\{a\}\}\+\\bm\{\\epsilon\}\)\)\|\|^\{2\}\\right\]\(71\)
Similarly, the loss \([5](https://arxiv.org/html/2605.08202#S3.E5)\) can be expressed based on \([70](https://arxiv.org/html/2605.08202#A2.E70)\):

𝔼σt,𝒔,ϵ​\[λ​\(σt\)​cout2​\(σt\)​‖Fθs​\(cin​\(σt\)⋅\(𝒔\+ϵ\);cnoise​\(σt\)\)−1cout​\(σt\)​\(𝒔−cskip​\(σt\)⋅\(𝒔\+ϵ\)\)‖2\]\\mathbb\{E\}\_\{\\sigma\_\{t\},\{\\bm\{s\}\},\\bm\{\\epsilon\}\}\\left\[\\lambda\(\\sigma\_\{t\}\)c^\{2\}\_\{\\text\{out\}\}\(\\sigma\_\{t\}\)\|\|F\_\{\\theta\_\{s\}\}\(c\_\{\\text\{in\}\}\(\\sigma\_\{t\}\)\\cdot\(\{\\bm\{s\}\}\+\\bm\{\\epsilon\}\);c\_\{\\text\{noise\}\}\(\\sigma\_\{t\}\)\)\-\\frac\{1\}\{c\_\{\\text\{out\}\}\(\\sigma\_\{t\}\)\}\(\{\\bm\{s\}\}\-c\_\{\\text\{skip\}\}\(\\sigma\_\{t\}\)\\cdot\(\{\\bm\{s\}\}\+\\bm\{\\epsilon\}\)\)\|\|^\{2\}\\right\]\(72\)
According to the variance normalization principles, we follow the practical implementation of EDM in parameter choice:

\{cskip​\(σt\)=σdata2/\(σt2\+σdata2\)cout​\(σt\)=σt⋅σdata/σt2\+σdata2cin​\(σt\)=1/\(σt2\+σdata2\)cnoise​\(σt\)=14​ln⁡\(σt\)λ​\(σt\)=\(σt2\+σdata2\)/\(σt⋅σdata\)2\\left\\\{\\begin\{array\}\[\]\{ll\}c\_\{\\text\{skip\}\}\(\\sigma\_\{t\}\)&=\\sigma^\{2\}\_\{\\text\{data\}\}/\(\\sigma\_\{t\}^\{2\}\+\\sigma^\{2\}\_\{\\text\{data\}\}\)\\\\\[6\.45831pt\] c\_\{\\text\{out\}\}\(\\sigma\_\{t\}\)&=\\sigma\_\{t\}\\cdot\\sigma\_\{\\text\{data\}\}/\\sqrt\{\\sigma\_\{t\}^\{2\}\+\\sigma^\{2\}\_\{\\text\{data\}\}\}\\\\\[6\.45831pt\] c\_\{\\text\{in\}\}\(\\sigma\_\{t\}\)&=1/\(\\sigma\_\{t\}^\{2\}\+\\sigma^\{2\}\_\{\\text\{data\}\}\)\\\\\[6\.45831pt\] c\_\{\\text\{noise\}\}\(\\sigma\_\{t\}\)&=\\frac\{1\}\{4\}\\ln\(\\sigma\_\{t\}\)\\\\\[6\.45831pt\] \\lambda\(\\sigma\_\{t\}\)&=\(\\sigma\_\{t\}^\{2\}\+\\sigma^\{2\}\_\{\\text\{data\}\}\)/\(\\sigma\_\{t\}\\cdot\\sigma\_\{\\text\{data\}\}\)^\{2\}\\end\{array\}\\right\.\(73\)

### B\.2Network Architecture

Behavior policy and state distribution modeling\. FollowingChenet al\.\([2024](https://arxiv.org/html/2605.08202#bib.bib27)\), we implement both our behavior policy and state distribution as MLP\-based diffusion models\. The denoising network for behavior policyϵθa​\(𝒂t,t,𝒔\)\\bm\{\\epsilon\}\_\{\\theta\_\{a\}\}\(\{\\bm\{a\}\}\_\{t\},t,\{\\bm\{s\}\}\)is a conditional diffusion model that predicts actions given a noisy action vector𝒂t\{\\bm\{a\}\}\_\{t\}, diffusion timesteptt\(encoded via sinusoidal positional embedding\), and state condition𝒔\{\\bm\{s\}\}\. In contrast, the denoising network for state distributionϵθs​\(𝒔t,t\)\\bm\{\\epsilon\}\_\{\\theta\_\{s\}\}\(\{\\bm\{s\}\}\_\{t\},t\)is an unconditional diffusion model that predicts states from a noisy state𝒔t\{\\bm\{s\}\}\_\{t\}and timestep embedding\. Both models share the same base architecture, which consists of a 4\-layer MLP with Mish activations and 256 hidden units per layer\. The main difference lies in their input dimensions, the behavior policy network additionally concatenates the state condition𝒔\{\\bm\{s\}\}, while the state distribution network operates without conditioning\.

Critic Networks\. Following the implementation of SVR\(Maoet al\.,[2023](https://arxiv.org/html/2605.08202#bib.bib12)\), the critic network comprises four Q\-networks and two V\-networks, each implemented as a 3\-layer MLPs with 256 hidden units per layer and ReLU activation functions\.

Actor Network\. The actor network adopts a Tanh\-Gaussian policy structure similar to SAC\(Haarnojaet al\.,[2018](https://arxiv.org/html/2605.08202#bib.bib48)\)\. It is implemented as a 3\-layer MLP with 256 hidden units and ReLU activations in all hidden layers\. The network supports both deterministic and stochastic action sampling, while preserving entropy regularization for effective exploration\.

Dynamics Model\. The dynamics model is implemented as a 3\-layer MLPs with 256 hidden units and ReLU activations, which takes concatenated state\-action pairs as input and predicts both the next state and reward\.

### B\.3Hyperparameters

Diffusion models and networks share the same hyperparameter settings across all tasks\. The detailed configurations are provided in Table[3](https://arxiv.org/html/2605.08202#A2.T3)\.

Table 3:Hyperparameters for all tasks\.HyperparameterValueOptimizerAdam\(Adam and others,[2014](https://arxiv.org/html/2605.08202#bib.bib45)\)Learning rate3e\-4Learning rate decayCosine\(Loshchilov and Hutter,[2016](https://arxiv.org/html/2605.08202#bib.bib46)\)Batch size256Discounted factor0\.99Target update rate0\.005Policy update frequency2Target network update frequency2Number of sampled actions10Compensation coefficientλ\\lambda0\.001Compensation target weightη\\eta0\.9To accommodate varying data distributions across different tasks, we employ task\-specific hyperparameters including the penalty coefficientβ\\betafor detrimental OOD action penalty, the OOD detection thresholdsτa\\tau\_\{a\}andτs\\tau\_\{s\}for actions and states, the expectile regression factorτ\\tau, and the lower bound of Q\-valueQminQ\_\{\\min\}, with their specific values for each task configuration detailed in Table[4](https://arxiv.org/html/2605.08202#A2.T4)\.

Table 4:Task\-specific hyperparameter settings\.Task𝜷\\bm\{\\beta\}𝝉𝒂\\bm\{\\tau\_\{a\}\}𝝉s\\bm\{\\tau\}\_\{s\}𝝉\\bm\{\\tau\}𝑸𝐦𝐢𝐧\\bm\{Q\_\{\\min\}\}halfcheetah\-medium\-v20\.00199th99th0\.9\-366halfcheetah\-medium\-replay\-v20\.00199th99th0\.9\-366halfcheetah\-medium\-expert\-v20\.0580th80th0\.7\-366halfcheetah\-expert\-v20\.0580th80th0\.7\-366halfcheetah\-random\-v20\.00199th99th0\.9\-366hopper\-medium\-v20\.00199th99th0\.9\-125hopper\-medium\-replay\-v20\.00199th99th0\.9\-125hopper\-medium\-expert\-v20\.0580th80th0\.7\-125hopper\-expert\-v20\.0580th80th0\.7\-125hopper\-random\-v20\.00199th99th0\.9\-125walker2d\-medium\-v20\.00199th99th0\.9\-471walker2d\-medium\-replay\-v20\.00199th99th0\.9\-471walker2d\-medium\-expert\-v20\.0599th99th0\.7\-471walker2d\-expert\-v20\.0599th99th0\.7\-471walker2d\-random\-v20\.00199th99th0\.9\-471pen\-cloned\-v1160th60th0\.7\-715pen\-human\-v12080th80th0\.7\-715
### B\.4Experimental details on toy example

For the 1D navigation task illustrated in Figure[6](https://arxiv.org/html/2605.08202#A2.F6)\(a\), the state space\[−10,10\]\[\-10,10\]represents the agent’s current position, while actions correspond to step sizes within\[−1,1\]\[\-1,1\]\. The reward function is the negative distance to the target state 0\. Based on this reward function, the ground truth Q\-function is calculated and depicted in Figure[6](https://arxiv.org/html/2605.08202#A2.F6)\(b\)\. To evaluate the performance of different methods, we generate an*expert*dataset and a*medium*dataset, each containing500,000transitions\. The expert dataset is constructed by perturbing the optimal action derived from the ground truth Q\-value with small noiseϵ∼𝒰​\[−0\.05,0\.05\]\\epsilon\\sim\\mathcal\{U\}\[\-0\.05,0\.05\], while the medium dataset is generated by adding larger noiseϵ∼𝒰​\[−0\.5,0\.5\]\\epsilon\\sim\\mathcal\{U\}\[\-0\.5,0\.5\]\. The score network in this toy example is implemented as a 4\-layer MLP with Mish activations and 256 hidden units per layer\.

![Refer to caption](https://arxiv.org/html/2605.08202v1/x8.png)Figure 6:Toy environment and ground truth Q\-function heatmap visualization\.For the model ensemble method, we employ 5 independently trained neural networks with identical architectures to quantify predictive uncertainty\. Each model is a 3\-layer MLP with ReLU activations and 128 hidden units\. All models are trained in a supervised manner for 100 epochs using the Adam optimizer with a fixed learning rate of 1e\-3\. During inference, the ensemble estimates epistemic uncertainty by computing the normalized variance across model predictions\.

For the MC dropout framework, we adopt a Q\-network architecture consisting of 3\-layer MLP with 256 hidden units and ReLU activations\. Dropout layers with a fixed probability of 0\.1 are incorporated to introduce stochasticity during inference\. This configuration enables the model to approximate Bayesian inference by maintaining dropout activation during both training and evaluation phases\. The Q\-network undergoes supervised training for1,000epochs using the Adam optimizer with a consistent learning rate of 1e\-3\. For uncertainty quantification, we performs 20 stochastic forward passes per state\-action pair with dropout enabled, computing the epistemic uncertainty as the normalized variance across these Monte Carlo samples\.

For the VAE\-based method, we adopt a conditional VAE \(CVAE\) architecture to model the behavior policy distribution and quantify out\-of\-distribution actions using reconstruction error\. The decoder reconstructs the original action through a single output head\. The model is trained for1,000epochs with the Adam optimizer at a learning rate of 1e\-3\. During inference, the reconstruction error is computed for state\-action pairs by comparing the reconstructed action to the original input action\.

### B\.5Experimental details on D4RL benchmarks

For all MuJoCo locomotion tasks, we pretrain the diffusion models for both the behavior policy and state distribution for100,000gradient steps using the Adam optimizer with a learning rate of 3e\-4 and a batchsize of 1024\. The dynamics models are also pretrained for100,000gradient steps with the same learning rate and batch size\. Our algorithm is then trained for 2 million gradient steps to ensure convergence, with policy evaluation performed every20,000gradient steps\. Results are reported as the average normalized scores over 40 random rollouts, comprising 4 independently trained models and 10 evaluation trajectories per model across all tasks\. All experiments are conducted on four NVIDIA GeForce RTX 3090 GPUs, with each experiment taking approximately 30 hours to complete, including both training and evaluation\.

## Appendix CAdditional Experimental Results

### C\.1OOD Detection Performance on D4RL Benchmarks

![Refer to caption](https://arxiv.org/html/2605.08202v1/x9.png)\(a\)halfcheetah\-medium\-v2 \(ID\) vs\. halfcheetah\-medium\-expert\-v2 \(OOD\)\.
![Refer to caption](https://arxiv.org/html/2605.08202v1/x10.png)\(b\)hopper\-medium\-expert\-v2 \(ID\) vs hopper\-medium\-replay\-v2 \(OOD\)\.
![Refer to caption](https://arxiv.org/html/2605.08202v1/x11.png)\(c\)walker2d\-medium\-replay\-v2 \(ID\) vs walker2d\-medium\-v2 \(OOD\)\.

Figure 7:Diffusion\-based reconstruction error distribution across datasets\. Diffusion models were trained exclusively on in\-distribution \(ID\) data\. From left to right: t\-SNE embedding of the state\-action distributions; reconstruction errors of ID samples; reconstruction errors of OOD samples; and density plots of error distributions for both ID and OOD samples\. The color bar indicates the magnitude of reconstruction error in the second and third columns\.To evaluate the ability of our diffusion\-based models to distinguish OOD samples, we conduct experiments on the D4RL benchmarks, designating certain datasets as in\-distribution \(ID\) and others as OOD\. Specifically, we pretrained diffusion models on the ID datasets, and evaluated their performance on the OOD datasets drawn from the same environment\. For each dataset, we randomly sample5,000state\-action pairs to ensure a balanced comparison\. The reconstruction error distributions for the actions are visualized via color\-mapped scatter plots and histograms in Figure[7](https://arxiv.org/html/2605.08202#A3.F7)\.

Across all environments, OOD datasets consistently exhibit significantly larger reconstruction errors compared to their ID counterparts\. This pronounced discrepancy is visually evident in both the color\-mapped scatter plots and the histogram plots\. In the scatter plots, ID samples are consistently associated with low reconstruction errors, whereas OOD samples display markedly high error values\. Similarly, the histogram plots reveal a distinct shift in the error distributions between ID and OOD samples, with OOD data showing a heavier tail toward higher error values\. These results strongly suggest that diffusion\-based reconstruction error serves as a robust and effective indicator for OOD detection in this setting\.

![Refer to caption](https://arxiv.org/html/2605.08202v1/x12.png)\(a\)Noise scale = 0\.5\.
![Refer to caption](https://arxiv.org/html/2605.08202v1/x13.png)\(b\)Noise scale = 1\.0\.
![Refer to caption](https://arxiv.org/html/2605.08202v1/x14.png)\(c\)Noise scale = 5\.0\.

Figure 8:Diffusion\-based reconstruction error distributions on original ID datasets and synthetic OOD datasets\.Table 5:OOD detection metrics on synthetic OOD datasets\.Noise ScaleTPTNFPFNAccuracyPrecisionRecallF1\-ScoreAUROC0\.5291049574320900\.78670\.98540\.58200\.73180\.96371\.048324957431680\.97890\.99120\.96640\.98760\.99805\.0500049574300\.99570\.99151\.00000\.99571\.0000To provide a more comprehensive quantitative analysis, we construct synthetic OOD datasets as follows\. We first sample5,000state\-action pairs from the original D4RL dataset, and for each pair, we generate a corresponding OOD sample by perturbing the action with standard Gaussian noise using noise scales of 0\.5, 1\.0, and 5\.0, respectively\. We evaluate the OOD detection capability of our diffusion\-based reconstruction error on these datasets, using the 99\-th percentile of reconstruction errors computed from ID samples as the detection threshold\. Based on this threshold, we report the counts of true positives \(TP\), true negatives \(TN\), false positives \(FP\), and false negatives \(FN\), together with standard classification metrics including precision, recall, F1\-score, and AUROC\. These results are summarized in Table[5](https://arxiv.org/html/2605.08202#A3.T5), and Figure[8](https://arxiv.org/html/2605.08202#A3.F8)presents the empirical distributions and histograms of reconstruction errors for both ID and OOD samples under different noise scales, while the corresponding ROC curves are shown in Figure[9](https://arxiv.org/html/2605.08202#A3.F9)\.

The results show that diffusion\-based reconstruction error is highly effective for OOD action detection across different levels of perturbation\. When the noise scale is relatively small, the method achieves high precision but moderate recall, indicating that mildly perturbed OOD actions are more difficult to detect\. As the noise scale increases, both recall and F1\-score improve substantially, reaching nearly perfect detection performance at large perturbations\. The AUROC also increases consistently and reaches 1\.0 for the largest noise setting, demonstrating that the reconstruction error provides a reliable and discriminative signal for distinguishing ID and OOD actions under challenging distribution shifts\.

![Refer to caption](https://arxiv.org/html/2605.08202v1/x15.png)\(a\)Noise scale = 0\.5\.
![Refer to caption](https://arxiv.org/html/2605.08202v1/x16.png)\(b\)Noise scale = 1\.0\.
![Refer to caption](https://arxiv.org/html/2605.08202v1/x17.png)\(c\)Noise scale = 5\.0\.

Figure 9:ROC curves for diffusion\-based OOD detection under different noise scales\.
### C\.2Validation on OOD Detection Benchamrks

Table 6:Validation on OOD detection benchmarks\.MethodKDDCUPKDDCUP\-RevArrhythmiaPrecisionRecallF1F\_\{1\}PrecisionRecallF1F\_\{1\}PrecisionRecallF1F\_\{1\}OC\-SVM0\.74570\.85230\.79540\.71480\.99400\.83160\.53970\.40820\.4581DCN0\.76960\.78290\.77620\.28750\.28950\.28850\.37580\.39070\.3815DSEBM\-r0\.19720\.20010\.19870\.20360\.20360\.20360\.15150\.15130\.1510DAGMM0\.92970\.94420\.93690\.93700\.93900\.93800\.49090\.50780\.4983GOAD\-\-0\.9840\-\-0\.9890\-\-0\.5200Ours0\.98620\.99370\.98990\.94760\.91440\.93070\.95450\.95450\.9545To further investigate the effectiveness and generalizability of our proposed diffusion\-based OOD detection mechanism beyond the reinforcement learning domain, we conducted additional experiments on three widely used anomaly detection benchmarks:KDDCUP,KDDCUP\-Rev, andArrhythmia\. Following the procedure described in the main paper, we compute the OOD score of each sample using the single\-step denoising reconstruction error produced by a trained diffusion model\. We compare our approach against several state\-of\-the\-art deep learning methods, including OC\-SVM\(Chenet al\.,[2001](https://arxiv.org/html/2605.08202#bib.bib49)\), DCN\(Jinet al\.,[2021](https://arxiv.org/html/2605.08202#bib.bib50)\), DSEBM\-r\(Zhaiet al\.,[2016](https://arxiv.org/html/2605.08202#bib.bib51)\), DAGMM\(Zonget al\.,[2018](https://arxiv.org/html/2605.08202#bib.bib36)\), and GOAD\(Bergman and Hoshen,[2020](https://arxiv.org/html/2605.08202#bib.bib52)\)\. Our experimental setup follows GOAD, and the baseline results are taken directly from the respective original publications\.

Table[6](https://arxiv.org/html/2605.08202#A3.T6)summarizes the precision, recall, and F1\-scores across all benchmarks\. The results demonstrate that our diffusion\-based approach consistently achieves high detection accuracy and outperforms or matches existing baselines across all datasets\. This robust performance further validates the diffusion model’s superior capability in modeling the in\-distribution data manifold and confirms the reliability of using the reconstruction error as a general indicator for OOD detection, even across varying data distributions and task contexts\.

### C\.3Quantitive Analysis between Diffusion\-based Reconstruction Error and Negative Log\-Likelihood \(NLL\)

To further examine whether diffusion\-based reconstruction error serves as a meaningful proxy for likelihood estimation, we conduct quantitative correlation analyses on both synthetic and D4RL datasets\. Scatter plots illustrating the relationship between diffusion reconstruction error and negative log\-likelihood \(NLL\) are shown in Figure[10](https://arxiv.org/html/2605.08202#A3.F10)\.

We first validate the relationship in a Gaussian mixture setting, where the ground\-truth density is analytically available\. Specifically, we construct a four\-component symmetric Gaussian mixture and uniformly sample10,000points\. Using a diffusion model trained on this distribution, we compute the reconstruction error for each sample and compare it against the true NLL derived from the underlying GMM\. The result indicates a very strong Pearson correlation \(ρ=0\.9718\\rho=0\.9718\), confirming that the diffusion\-based reconstruction error is highly consistent with the true likelihood when the underlying distribution is accurately modeled\.

We further evaluate this relationship on the halfcheetah\-medium\-replay\-v2 dataset, where the true behavior policy density is not directly accessible\. To approximate the behavior support likelihood, we train a CVAE on the dataset as a reference model and compute its NLL as a surrogate likelihood estimator\. The diffusion reconstruction error again exhibits a strong positive correlation with the CVAE\-based NLL \(ρ=0\.7848\\rho=0\.7848\), indicating that reconstruction error retains a substantial degree of statistical consistency with likelihood even in high\-dimensional continuous control environments\. However, since CVAEs are known to struggle with accurately modeling multi\-modal behavior distributions, the resulting NLL values from the reference model may introduce estimation bias and should therefore be interpreted as only a rough approximation for the true likelihood\.

![Refer to caption](https://arxiv.org/html/2605.08202v1/x18.png)\(a\)Diffusion\-based reconstruction error vs\. true NLL on GMM dataset\.
![Refer to caption](https://arxiv.org/html/2605.08202v1/x19.png)\(b\)Diffusion\-based reconstruction error vs\. CVAE\-based NLL on D4RL dataset\.

Figure 10:Correlation analysis between diffusion\-based reconstruction error and negative log\-likelihood \(NLL\)\.
### C\.4Action Type Proportions During Policy Optimization

To gain a deeper insight into how the action distribution induced by the learned policy evolves over time, we track the proportions ID, beneficial OOD, and detrimental OOD actions throughout the training process\. At each training iteration, the statistics are calculated over a sampled batch of size 256\. We present the results for three halfcheetah tasks in Figure[11](https://arxiv.org/html/2605.08202#A3.F11)\.

![Refer to caption](https://arxiv.org/html/2605.08202v1/x20.png)Figure 11:Proportions of different action types during policy optimization\.Across all datasets, the proportion of ID actions consistently increases as training progresses, accompanied by a corresponding decline in the overall proportion of OOD actions\. This trend suggests that the learned policy progressively aligns more closely with the behavioral support during optimization\. However, the relative magnitudes of the ID proportions vary considerably across the three datasets\. Specifically, the medium\-replay dataset exhibits the largest ID action ratio, followed by the medium dataset, whereas the medium\-expert dataset yields the lowest ID proportion\. We also visualize the ratio of beneficial to detrimental OOD actions in the rightmost column\. For both the medium\-replay and medium datasets, the proportion of beneficial OOD actions substantially exceeds that of detrimental ones\. Conversely, on the medium\-expert dataset, beneficial and detrimental OOD actions appear in almost equal proportion\.

These differences can be attributed to the inherent characteristics of the datasets\. The medium\-expert dataset has a relatively narrow support concentrated around near\-optimal trajectories\. As a result, the learned policy more frequently generates actions that fall outside this narrow support, leading to a higher overall OOD proportion\. Despite this, since the dataset already contains expert demonstrations, the potential performance gain from extrapolation is limited, leading to only a comparable proportion of beneficial and detrimental OOD actions\. In contrast, the medium\-replay and medium datasets exhibit more diverse distributions of generally suboptimal behaviors\. This broader support enables the learned policy to benefit from moderate extrapolation, where slight deviations outside the data manifold can lead to meaningful performance improvements, which is consistent with our empirical results\.

### C\.5Visualization of Q\-value distribution

![Refer to caption](https://arxiv.org/html/2605.08202v1/x21.png)Figure 12:Q\-value distributions for different action types\.To investigate whether beneficial OOD actions indeed lead the policy toward higher\-value regions, we visualize the learned Q\-value landscape\. Specifically, we randomly sample10,000states from the offline dataset\. For each evaluation state, we generate an action using the learned policy and categorize it as an ID action, a beneficial OOD action, or a detrimental OOD action based on the diffusion reconstruction error\. We then apply t\-SNE to embed the corresponding state\-action pairs into a two\-dimensional space, where each point is colored according to its Q\-value estimate\. In addition, we plot the Q\-value distributions for the three categories to enable a direct statistical comparison\.

As illustrated in Figure[12](https://arxiv.org/html/2605.08202#A3.F12), beneficial OOD actions exhibit a clearly right\-shifted Q\-value distribution, indicating that the actions identified as beneficial by our method correspond to regions where the critic consistently predicts higher returns\. In contrast, detrimental OOD actions predominantly occupy the low\-value region, with their Q\-value distribution concentrated near the lower tail\. The critic assigns persistently low values to these actions, implying that they are unlikely to yield performance gains and should therefore be suppressed during policy improvement\.

### C\.6Additional Sensitivity Analysis

#### C\.6\.1Dynamics Model Error

To evaluate how the accuracy of the learned dynamics model influences the performance of OOD action classification, we conduct an additional ablation study\. Specifically, we pretrain the dynamics model for 100k gradient steps and save the intermediate checkpoints at 10k and 20k steps\. These early\-stage models exhibit substantially higher prediction error compared to the final checkpoint, providing a controlled mechanism to examine the impact of model inaccuracies\. In the subsequent experiments, we replace the fully trained dynamics model with the selected checkpoint while keeping all other components unchanged\. We evaluate these variants on two halfcheetah datasets, with the corresponding training curves illustrated in Figure[13](https://arxiv.org/html/2605.08202#A3.F13)\.

Our results indicate that employing dynamics models derived from the early checkpoints consistently deteriorates policy performance relative to the fully trained model\. Furthermore, on the halfcheetah\-medium\-replay\-v2 dataset, we observe that when the dynamics model is poorly trained, the performance may fall below that of theDOSER w/o AC and VCvariant introduced in Section[4\.3](https://arxiv.org/html/2605.08202#S4.SS3), which intentionally excludes dynamics modeling\. This finding highlights that if the dynamics model fails to produce reliable next\-state predictions, the resulting misclassification of OOD actions can be more detrimental to overall performance than omitting OOD action classification entirely\.

In addition, the observed performance gap between the early\-stage and fully trained dynamics models provides further evidence regarding the model’s ability to generalize beyond the dataset support\. This indicates that the final checkpoint captures meaningful structural regularities of the environment rather than merely memorizing in\-distribution transitions\. As a result, a well\-trained dynamics model can provide sufficiently reliable predictions for moderate OOD actions, which is crucial for the selective regularization mechanism of DOSER\.

![Refer to caption](https://arxiv.org/html/2605.08202v1/x22.png)Figure 13:Sensitivity analysis of the dynamics model error\.
#### C\.6\.2The number of critic networks

In our main experiments, we employ four critic networks for Q\-function learning\. This design choice follows the implementation of SVR, upon which our training pipeline is partially built\. The use of multiple critics has been shown to reduce overestimation bias and stabilize value learning\. To examine whether this choice confers any unintended advantage, we perform an additional ablation in which DOSER is trained with only two critic networks while keeping all other components and hyperparameters fixed\. We evaluate both settings on the halfcheetah\-medium\-v2 and halfcheetah\-medium\-replay\-v2 datasets, the corresponding learning curves are presented in Figure[14](https://arxiv.org/html/2605.08202#A3.F14)\.

Empirically, we observe that using two critics achieves comparable final performance to the four\-critic setting across both tasks, with only a slight difference within an acceptable range\. This indicates that DOSER does not rely on the increased critic ensemble size to obtain its performance gains\. While additional critics can enhance robustness during training, the core algorithmic contributions of DOSER remain effective under the standard two\-critic setup commonly used in offline RL\.

![Refer to caption](https://arxiv.org/html/2605.08202v1/x23.png)Figure 14:Sensitivity analysis of the number of critic networks\.
#### C\.6\.3Compensation target weightη\\eta

We set the default value of the compensation target weightη\\etato 0\.9 in all main experiments\. This hyperparameter controls the weight of the target Q\-value of beneficial OOD actions, a smallerη\\etareduces the extent of compensation and makes DOSER more conservative\. To evaluate the sensitivity of DOSER to this hyperparameter, we conduct an ablation study by varyingη∈\{0\.8,1\.0\}\\eta\\in\\\{0\.8,1\.0\\\}while keeping all other components unchanged\.

As shown in Figure[15](https://arxiv.org/html/2605.08202#A3.F15), DOSER maintains consistently stable performance across this range\. However, when settingη=1\.0\\eta=1\.0, we observe a slight degradation in performance on halfcheetah\-medium\-replay\-v2\. This is primarily due to mild value overestimation introduced by fully adopting the target Q\-values of beneficial OOD actions without discounting\. Overall, the default choiceη=0\.9\\eta=0\.9effectively mitigates such overestimation while still enabling meaningful policy improvement\.

![Refer to caption](https://arxiv.org/html/2605.08202v1/x24.png)Figure 15:Sensitivity analysis of the compensation target weightη\\eta\.
#### C\.6\.4The number of sampled In\-distribution actionsNN

In the OOD action classification stage, DOSER estimates the optimal in\-distribution action by samplingNNcandidate actions from the offline dataset and selecting the one with the highest Q\-value as the reference\. To assess the robustness of DOSER to the chioce ofNN, we conduct experiments on the halfcheetah tasks withN∈\{5,10,20\}N\\in\\\{5,10,20\\\}\.

The results in Figure[16](https://arxiv.org/html/2605.08202#A3.F16)indicate that DOSER maintains strong performance across different values ofNN\. A largerNNprovides a more accurate approximation of the optimal ID action but comes with increased computational cost, whereas a smallNNmay introduce randomness in the estimation\. Since the optimal ID Q\-value is used only to construct an optimistic Q\-target that guides beneficial OOD actions toward higher\-value regions, DOSER does not rely heavily on the precise accuracy of this estimate\. Therefore, we chooseN=10N=10as a reasonable trade\-off between computational efficiency and estimation accuracy in our main experiments\.

![Refer to caption](https://arxiv.org/html/2605.08202v1/x25.png)Figure 16:Sensitivity analysis of the number of sampled in\-distribution actionsNN\.
#### C\.6\.5Q\-value Lower BoundQminQ\_\{\\min\}

DOSER employs a lower boundQminQ\_\{\\min\}when penalizing detrimental OOD actions\. In our main experiments, this value is not treated as a tunable hyperparameter\. Instead, it is derived directly from the environment dynamics asQmin=Rmin1−γQ\_\{\\min\}=\\frac\{R\_\{\\min\}\}\{1\-\\gamma\}, which corresponds to the standard minimum achievable return under the given discount factorγ\\gamma\. For the halfcheetah environment, settingγ\\gammato0\.990\.99yieldsQmin=−366Q\_\{\\min\}=\-366\. To further examine the impact of this parameter, we conduct an ablation study in whichQminQ\_\{\\min\}is set to0while keeping all other components unchanged\. This alternative setting corresponds to a less conservative penalty on potentially detrimental OOD actions\. We evaluate this variant on the halfcheetah tasks, with the resulting learning curves reported in Figure[17](https://arxiv.org/html/2605.08202#A3.F17)\.

The results show that replacing the original value ofQminQ\_\{\\min\}with0leads to performance degradation across both datasets\. This outcome can be attributed to the fact that a higher lower bound reduces the penalization applied to detrimental OOD actions, thereby weakening the mechanism designed to mitigate value overestimation\. Nevertheless, even with this suboptimal setting, the overall performance remains competitive with existing offline RL baselines\.

![Refer to caption](https://arxiv.org/html/2605.08202v1/x26.png)Figure 17:Sensitivity analysis of the value ofQminQ\_\{\\min\}\.

### C\.7Ensemble\-guided Gating Mechanism

![Refer to caption](https://arxiv.org/html/2605.08202v1/x27.png)Figure 18:Comparison of DOSER with and without ensemble\-guided gating\.![Refer to caption](https://arxiv.org/html/2605.08202v1/x28.png)Figure 19:Proportions of different action types with ensemble\-guided gating\.We further introduce an ensemble\-guided uncertainty gating mechanism on top of the learned dynamics model, which is designed to prevent unreliable next\-state predictions from influencing the classification of OOD actions\. We construct an ensemble ofK=5K=5independently initialized dynamics models, each trained on the original offline dataset\. For any state\-action pair\(𝒔,𝒂\)\(\{\\bm\{s\}\},\{\\bm\{a\}\}\), the ensemble produces multiple next\-state predictions\{𝒔^1′,𝒔^2′,…,𝒔^K′\}\\\{\\hat\{\\bm\{s\}\}^\{\\prime\}\_\{1\},\\hat\{\\bm\{s\}\}^\{\\prime\}\_\{2\},\\dots,\\hat\{\\bm\{s\}\}^\{\\prime\}\_\{K\}\\\}\. We calculate the prediction variance across ensemble members as a measure of epistemic uncertainty:

Var​\(𝒔^′\)=1K​∑k=1K‖𝒔^k′−𝒔¯′‖2,where𝒔¯′=1K​∑k=1K𝒔^k′\\mathrm\{Var\}\(\\hat\{\\bm\{s\}\}^\{\\prime\}\)=\\frac\{1\}\{K\}\\sum\_\{k=1\}^\{K\}\\left\\\|\\hat\{\\bm\{s\}\}^\{\\prime\}\_\{k\}\-\\bar\{\\bm\{s\}\}^\{\\prime\}\\right\\\|^\{2\},\\quad\\text\{where\}\\quad\\bar\{\\bm\{s\}\}^\{\\prime\}=\\frac\{1\}\{K\}\\sum\_\{k=1\}^\{K\}\\hat\{\\bm\{s\}\}^\{\\prime\}\_\{k\}\(74\)
To determine whether a predicted next state is reliable, we estimate the empirical distribution of these variances on the offline dataset and use the 99\-th percentile as a reliability thresholdτvar\\tau\_\{\\mathrm\{var\}\}\. Only when the prediction variance for an OOD action falls below this threshold do we trust the predicted next state and apply the value\-based beneficial/detrimental classification; otherwise, the action is conservatively categorized as detrimental\.

However, Figure[18](https://arxiv.org/html/2605.08202#A3.F18)demonstrates that incorporating this ensemble\-guided gating mechanism into DOSER brings no noticeable performance improvement in the halfcheetah environments\. We further analyze the action type proportions during training \(Figure[19](https://arxiv.org/html/2605.08202#A3.F19)\), defining confident actions as those with prediction variance below the reliability threshold, and uncertain actions as those filtered out by the gate\. The results indicate that fewer than 5% of actions exceed the threshold and are consequently filtered out\.

This result again suggests that the pretrained dynamics model already generalizes reasonably well to the moderately OOD regions\. It is also possible that the chosen ensemble size of 5 is insufficient to fully capture epistemic uncertainty, and larger or more expressive ensembles might provide stronger gating effects\. We leave the exploration of more sophisticated uncertainty quantification methods to future work\.

### C\.8Additional Experiment Results on D4RL Benchamrk

To further validate DOSER’s performance across a wider range of dataset qualities, we conduct additional experiments on the Gym\-MuJoCo expert and random datasets, as shown in Table[7](https://arxiv.org/html/2605.08202#A3.T7)\. Across the expert datasets, DOSER achieves competitive performance relative to prior offline RL methods\. More notably, on the random datasets, where the behavior data is highly suboptimal, DOSER exhibits stronger performance, indicating its robustness under poor\-quality offline data\.

Table 7:Additional performance comparison on Gym\-MuJoCo expert and random datasets\. We report the mean and standard deviation over 4 seeds for DOSER\.DatasetBCBCQBEARDTAWACOneStepTD3\+BCCQLIQLDMGDOSER \(Ours\)halfcheetah\-e92\.989\.992\.787\.781\.788\.296\.796\.395\.095\.995\.4±\\pm0\.6hopper\-e110\.9109\.054\.694\.2109\.5106\.9107\.896\.5109\.4111\.5111\.6±\\pm0\.5walker2d\-e107\.7106\.3106\.6108\.3110\.1110\.7110\.2108\.5109\.9114\.7111\.2±\\pm0\.3halfcheetah\-r2\.62\.22\.32\.26\.12\.311\.017\.513\.128\.832\.8±\\pm1\.5hopper\-r4\.17\.83\.95\.49\.25\.68\.57\.97\.920\.431\.2±\\pm0\.1walker2d\-r1\.24\.912\.82\.20\.26\.91\.65\.15\.44\.83\.5±\\pm2\.3Average53\.253\.445\.550\.052\.853\.456\.055\.356\.862\.764\.9
### C\.9Learning Curves

Learning curves on D4RL tasks are provided in Figure[20](https://arxiv.org/html/2605.08202#A3.F20), Figure[21](https://arxiv.org/html/2605.08202#A3.F21), and Figure[22](https://arxiv.org/html/2605.08202#A3.F22)\. The curves are averaged over 4 random seeds, with the shaded area representing the standard deviation across seeds\.

![Refer to caption](https://arxiv.org/html/2605.08202v1/x29.png)Figure 20:Learning curves of the component ablation study on Gym\-MuJoCo tasks\.![Refer to caption](https://arxiv.org/html/2605.08202v1/x30.png)Figure 21:Learning curves on Adroit tasks\.![Refer to caption](https://arxiv.org/html/2605.08202v1/x31.png)Figure 22:Learning curves on Gym\-MuJoCo expert and random tasks\.

## Appendix DThe Use of Large Language Models \(LLMs\)

We acknowledge the assistance of GPT\-5 in proofreading and polishing the manuscript\. The authors bear full responsibility for the content and presentation of this paper\.

Similar Articles

RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework

Hugging Face Daily Papers

RAD-2 presents a unified generator-discriminator framework for autonomous driving that combines diffusion-based trajectory generation with RL-optimized reranking, achieving 56% collision rate reduction compared to diffusion-based planners. The approach introduces techniques like Temporally Consistent Group Relative Policy Optimization and BEV-Warp simulation environment for efficient large-scale training.