UNIQ: Conformal Calibration for Adaptive Conservatism in Offline Reinforcement Learning
Summary
UNIQ introduces a conformal calibration method for offline reinforcement learning that adapts conservatism per-state based on uncertainty, improving over IQL on some D4RL benchmarks while maintaining memory efficiency.
View Cached Full Text
Cached at: 06/09/26, 08:47 AM
# UNIQ: Conformal Calibration for Adaptive Conservatism in Offline Reinforcement Learning
Source: [https://arxiv.org/html/2606.07592](https://arxiv.org/html/2606.07592)
\\workshoptitle
ICML 2026 Workshop Demo
###### Abstract
Offline reinforcement learning requires careful conservatism to counter distribution shift, yet most methods apply a single fixed penalty regardless of how well a given state is covered by the data\. We presentUNIQ\(Uncertainty\-InformedQuantile\), an offline RL method that adapts its conservatism per\-state via conformally calibrated uncertainty\. Building on IQL’s implicit Q\-learning backbone,UNIQtrains a multi\-expectile value ensemble, computes distribution\-free uncertainty bounds using split conformal prediction, and maps this signal to a state\-adaptive expectileτ\(s\)\\tau\(s\), relaxing conservatism in well\-covered regions and strengthening it at the data frontier\. On D4RL MuJoCo benchmarks,UNIQoutperforms IQL on Walker2d tasks and replay\-heavy settings while operating at near\-IQL memory cost \(≈\\approx250 MB peak VRAM\)—a 10×\\timesreduction versus EDAC\. We explicitly report underperforming cases and positionUNIQas a practical mechanism contribution on the performance–efficiency frontier, rather than a claim of overall state\-of\-the\-art\.
## 1Introduction
Reinforcement learning from a fixed offline dataset—offline RL—has emerged as a practical paradigm for real\-world sequential decision\-making, where online data collection is expensive, risky, or ethically constrained\(Levineet al\.,[2020](https://arxiv.org/html/2606.07592#bib.bib1); Prudencioet al\.,[2023](https://arxiv.org/html/2606.07592#bib.bib2)\)\. The core technical challenge is*distribution shift*: a learned policy may query action values in state–action regions that are rare or absent in the logged data, and standard temporal\-difference \(TD\) methods will extrapolate wildly in those regions, leading to catastrophic overestimation and policy collapse\(Fujimotoet al\.,[2019](https://arxiv.org/html/2606.07592#bib.bib3); Kumaret al\.,[2020](https://arxiv.org/html/2606.07592#bib.bib4)\)\.
##### The distribution\-shift problem\.
In online RL, the agent can correct errors by collecting new experience\. Offline RL removes this safety valve\. Consider a TD updateQ\(s,a\)←r\+γmaxa′Q\(s′,a′\)Q\(s,a\)\\leftarrow r\+\\gamma\\max\_\{a^\{\\prime\}\}Q\(s^\{\\prime\},a^\{\\prime\}\): whena′a^\{\\prime\}is out\-of\-distribution \(OOD\), the bootstrapped target can be arbitrarily large, compounding across updates\. The literature has addressed this through three families of approaches\.*Behavioral cloning constraints*explicitly keep the learned policy close to the data distribution\(Fujimotoet al\.,[2019](https://arxiv.org/html/2606.07592#bib.bib3); Wuet al\.,[2019](https://arxiv.org/html/2606.07592#bib.bib6)\)\.*Conservative value learning*directly penalizes OOD values, either explicitly \(CQL;Kumaret al\.[2020](https://arxiv.org/html/2606.07592#bib.bib4)\) or implicitly via expectile regression \(IQL;Kostrikovet al\.[2022](https://arxiv.org/html/2606.07592#bib.bib5)\)\.*Ensemble\-based uncertainty*uses disagreement among multiple critics as an OOD proxy and penalizes high\-disagreement actions\(Anet al\.,[2021](https://arxiv.org/html/2606.07592#bib.bib7); Tarasovet al\.,[2023a](https://arxiv.org/html/2606.07592#bib.bib8)\)\.
##### IQL and its limitation\.
IQL\(Kostrikovet al\.,[2022](https://arxiv.org/html/2606.07592#bib.bib5)\)avoids explicit OOD queries by framing value learning as asymmetric regression with a fixed expectileτ∈\(0,1\)\\tau\\in\(0,1\)\. Atτ=0\.9\\tau=0\.9, the value function learns the 90th expectile of empirical returns, which naturally suppresses OOD overestimation without querying out\-of\-distribution actions during training\. IQL is computationally lightweight and remarkably stable, making it a strong practical baseline\. However,*a singleτ\\tauis applied uniformly across all states*, regardless of whether the dataset densely or sparsely covers a region\. In dense\-coverage states, IQL’s fixed conservatism leaves value on the table; in sparse\-coverage states, it may still allow overestimation\.
##### Our proposal:UNIQ\.
We introduceUNIQ, which replaces IQL’s fixed expectile with a*state\-adaptive*τ\(s\)\\tau\(s\)driven by conformally calibrated uncertainty\. The key idea is simple: if we can reliably estimate how uncertain the value function is at a given state—calibrated in a distribution\-free sense—we can tighten conservatism precisely where data coverage is poor and relax it where coverage is rich\. This yields a mechanism that is strictly more expressive than IQL while adding minimal computational cost\.
UNIQdoes*not*claim to surpass EDAC\(Anet al\.,[2021](https://arxiv.org/html/2606.07592#bib.bib7)\)or ReBRAC\(Tarasovet al\.,[2023a](https://arxiv.org/html/2606.07592#bib.bib8)\)in aggregate score; those methods deploy substantially heavier critic ensembles and regularization schemes\. Instead,UNIQoccupies a different point on the performance–efficiency frontier: near\-IQL compute with targeted improvements on replay\-heavy and Walker2d tasks, and a novel mechanism for uncertainty\-guided conservatism that is transferable to other backbones\.
## 2Related Work
##### Conservative offline RL\.
CQL\(Kumaret al\.,[2020](https://arxiv.org/html/2606.07592#bib.bib4)\)adds an explicit regularizer that minimizes Q\-values for OOD actions while maximizing them for in\-distribution actions\. IQL\(Kostrikovet al\.,[2022](https://arxiv.org/html/2606.07592#bib.bib5)\)avoids OOD bootstrapping entirely via implicit expectile regression, and TD3\+BC\(Fujimoto and Gu,[2021](https://arxiv.org/html/2606.07592#bib.bib9)\)applies a simple BC penalty\. These methods use fixed global conservatism coefficients\.
##### Ensemble\-based pessimism\.
SAC\-N\(Anet al\.,[2021](https://arxiv.org/html/2606.07592#bib.bib7)\)and EDAC\(Anet al\.,[2021](https://arxiv.org/html/2606.07592#bib.bib7)\)train large critic ensembles \(oftenN=10N=10–5050\) and apply the minimum or mean\-minus\-std of Q\-values as a pessimistic target\. ReBRAC\(Tarasovet al\.,[2023a](https://arxiv.org/html/2606.07592#bib.bib8)\)revisits these designs with additional regularization and careful tuning, achieving strong results on D4RL\. The compute cost of these methods scales linearly with ensemble size\. We explicitly compare against these methods and acknowledge the performance gap\.
##### Conformal prediction for RL\.
Conformal prediction\(Vovket al\.,[2005](https://arxiv.org/html/2606.07592#bib.bib10); Leiet al\.,[2018](https://arxiv.org/html/2606.07592#bib.bib11)\)provides finite\-sample, distribution\-free prediction intervals without distributional assumptions\.Romanoet al\.\([2019](https://arxiv.org/html/2606.07592#bib.bib12)\)extended this to quantile regression\. Its application to RL uncertainty quantification is underexplored;UNIQis among the first to use split conformal calibration\(Papadopouloset al\.,[2002](https://arxiv.org/html/2606.07592#bib.bib13)\)to scale uncertainty estimates for value\-function conservatism\. Related concurrent work\(Baiet al\.,[2022](https://arxiv.org/html/2606.07592#bib.bib24); Park and Sung,[2023](https://arxiv.org/html/2606.07592#bib.bib33)\)has explored conformal and uncertainty\-based approaches for offline RL, and we distinguish our method in Appendix[A](https://arxiv.org/html/2606.07592#A1)\.
##### Adaptive conservatism\.
Prior work has explored state\-dependent penalties via density models\(Yuet al\.,[2021](https://arxiv.org/html/2606.07592#bib.bib14)\)or support constraints, but these often require auxiliary generative models\.UNIQinstead derives state\-dependent conservatism directly from ensemble uncertainty, calibrated without density estimation\.
## 3Method
UNIQextends IQL with three components: \(1\) a multi\-expectile value ensemble to extract uncertainty, \(2\) split conformal calibration to normalize that uncertainty, and \(3\) a state\-adaptive expectile controller\. We describe each in turn\.
### 3\.1IQL Backbone
IQL learns a value functionVϕ\(s\)V\_\{\\phi\}\(s\)and Q\-functionQθ\(s,a\)Q\_\{\\theta\}\(s,a\)without querying OOD actions\. The value loss uses asymmetricL2L\_\{2\}regression at expectileτ\\tau:
LV\(ϕ\)=𝔼\(s,a\)∼𝒟\[\|τ−𝟏\(Qθ\(s,a\)−Vϕ\(s\)<0\)\|\(Qθ\(s,a\)−Vϕ\(s\)\)2\]\.L\_\{V\}\(\\phi\)=\\mathbb\{E\}\_\{\(s,a\)\\sim\\mathcal\{D\}\}\\\!\\left\[\\bigl\|\\tau\-\\mathbf\{1\}\(Q\_\{\\theta\}\(s,a\)\-V\_\{\\phi\}\(s\)<0\)\\bigr\|\\,\(Q\_\{\\theta\}\(s,a\)\-V\_\{\\phi\}\(s\)\)^\{2\}\\right\]\.\(1\)The policy is extracted via advantage\-weighted regression:π∝exp\(β\(Q−V\)\)\\pi\\propto\\exp\(\\beta\(Q\-V\)\)\.UNIQreplaces the fixedτ\\tauin Eq\. \([1](https://arxiv.org/html/2606.07592#S3.E1)\) with a learned, state\-dependentτ\(s\)\\tau\(s\)for the primary value network, while Q\-function targets use the pessimistic ensemble mean \(Eq\. \([7](https://arxiv.org/html/2606.07592#S3.E7)\)\)\.
### 3\.2Multi\-Expectile Value Ensemble
We trainNvN\_\{v\}ensemble members\{Vϕk\}k=1Nv\\\{V\_\{\\phi\_\{k\}\}\\\}\_\{k=1\}^\{N\_\{v\}\}at three fixed expectile levelsτ¯∈\{0\.5,0\.7,0\.9\}\\bar\{\\tau\}\\in\\\{0\.5,0\.7,0\.9\\\}, yielding3Nv3N\_\{v\}value heads in total\. This multi\-resolution fitting exposes two complementary uncertainty signals:
σens\(s\)\\displaystyle\\sigma\_\{\\mathrm\{ens\}\}\(s\)=Stdk\[Vϕk\(0\.7\)\(s\)\],\\displaystyle=\\mathrm\{Std\}\_\{k\}\\\!\\left\[V\_\{\\phi\_\{k\}\}^\{\(0\.7\)\}\(s\)\\right\],\(2\)Δτ\(s\)\\displaystyle\\Delta\_\{\\tau\}\(s\)=V¯\(0\.9\)\(s\)−V¯\(0\.5\)\(s\),\\displaystyle=\\bar\{V\}^\{\(0\.9\)\}\(s\)\-\\bar\{V\}^\{\(0\.5\)\}\(s\),\(3\)where bars denote ensemble means\.σens\(s\)\\sigma\_\{\\mathrm\{ens\}\}\(s\)captures epistemic disagreement \(ensemble uncertainty\)\.Δτ\(s\)\\Delta\_\{\\tau\}\(s\)captures aleatoric spread \(return distribution width\) and is used as a diagnostic signal; see Appendix[B](https://arxiv.org/html/2606.07592#A2)for derivations and analysis\. Theτ∈\{0\.5,0\.9\}\\tau\\in\\\{0\.5,0\.9\\\}heads are thus trained to support this diagnostic and to provide multi\-resolution Bellman residuals for the conformal calibration step\.
### 3\.3Split Conformal Calibration
Raw ensemble disagreementσens\(s\)\\sigma\_\{\\mathrm\{ens\}\}\(s\)is task\- and scale\-dependent; values of 0\.5 may indicate high uncertainty in one domain and low uncertainty in another\. We use*split conformal prediction*\(Papadopouloset al\.,[2002](https://arxiv.org/html/2606.07592#bib.bib13)\)to convertσens\(s\)\\sigma\_\{\\mathrm\{ens\}\}\(s\)into a calibrated, distribution\-free uncertainty score\.
We hold out a calibration split𝒟cal⊂𝒟\\mathcal\{D\}\_\{\\mathrm\{cal\}\}\\subset\\mathcal\{D\}\(disjoint from training\)\. For each calibration transition\(si,ai,ri,si′\)\(s\_\{i\},a\_\{i\},r\_\{i\},s\_\{i\}^\{\\prime\}\), we compute the nonconformity score:
αi=\|ri\+γV¯\(0\.7\)\(si′\)−V¯\(0\.7\)\(si\)\|,\\alpha\_\{i\}=\\left\|r\_\{i\}\+\\gamma\\,\\bar\{V\}^\{\(0\.7\)\}\(s\_\{i\}^\{\\prime\}\)\-\\bar\{V\}^\{\(0\.7\)\}\(s\_\{i\}\)\\right\|,\(4\)which measures how well the ensemble’s Bellman residual fits the calibration data\. We then compute the\(1−δ\)\(1\-\\delta\)\-quantileq^\\hat\{q\}of\{αi\}\\\{\\alpha\_\{i\}\\\}, yielding a data\-driven threshold that covers at least1−δ1\-\\deltaof calibration transitions with finite\-sample guarantee\(Vovket al\.,[2005](https://arxiv.org/html/2606.07592#bib.bib10)\)\. The normalized uncertainty at any state is:
u\(s\)=σens\(s\)q^\+ε,u\(s\)=\\frac\{\\sigma\_\{\\mathrm\{ens\}\}\(s\)\}\{\\hat\{q\}\+\\varepsilon\},\(5\)whereε\>0\\varepsilon\>0avoids division by zero\. This normalization is a global rescaling that makesσens\\sigma\_\{\\mathrm\{ens\}\}comparable across tasks;q^\\hat\{q\}serves as an environment\-adaptive scale factor rather than a per\-state conformal guarantee\. Whenu\(s\)\>1u\(s\)\>1, ensemble disagreement exceeds the calibrated Bellman residual threshold—a signal that the state is poorly covered\. Whenu\(s\)<1u\(s\)<1, the state is well\-covered relative to the calibration distribution\.
### 3\.4State\-Adaptive Conservatism
We map the normalized uncertaintyu\(s\)u\(s\)to an adaptive expectile via a sigmoid schedule:
τ\(s\)=τmin\+\(τmax−τmin\)⋅σsig\(−βτ\(u\(s\)−1\)\),\\tau\(s\)=\\tau\_\{\\min\}\+\(\\tau\_\{\\max\}\-\\tau\_\{\\min\}\)\\cdot\\sigma\_\{\\mathrm\{sig\}\}\\\!\\left\(\-\\beta\_\{\\tau\}\(u\(s\)\-1\)\\right\),\(6\)whereσsig\(⋅\)\\sigma\_\{\\mathrm\{sig\}\}\(\\cdot\)is the logistic sigmoid\. Whenu\(s\)≫1u\(s\)\\gg 1\(high uncertainty, OOD\),τ\(s\)→τmin\\tau\(s\)\\to\\tau\_\{\\min\}—more conservative\. Whenu\(s\)≪1u\(s\)\\ll 1\(well\-covered\),τ\(s\)→τmax\\tau\(s\)\\to\\tau\_\{\\max\}—more optimistic\.
Additionally, we apply a global pessimistic value target:
Vpess\(s\)=V¯\(0\.7\)\(s\)−κσens\(s\),V\_\{\\mathrm\{pess\}\}\(s\)=\\bar\{V\}^\{\(0\.7\)\}\(s\)\-\\kappa\\,\\sigma\_\{\\mathrm\{ens\}\}\(s\),\(7\)which is used in Bellman targets for the Q\-function\. Critically,κ\\kappais selected per\-task offline using held\-out dataset statistics; see Appendix[C](https://arxiv.org/html/2606.07592#A3)for all values\. Together, Eq\. \([6](https://arxiv.org/html/2606.07592#S3.E6)\) and Eq\. \([7](https://arxiv.org/html/2606.07592#S3.E7)\) constitute the adaptive conservatism mechanism ofUNIQ\.
### 3\.5Full Training Procedure
Algorithm[1](https://arxiv.org/html/2606.07592#alg1)summarizesUNIQ\. The conformal quantileq^\\hat\{q\}is recomputed periodically on the calibration split, allowing the threshold to adapt as the value ensemble trains\.
Algorithm 1UNIQTraining1:Partition offline dataset
𝒟\\mathcal\{D\}into training set
𝒟train\\mathcal\{D\}\_\{\\mathrm\{train\}\}and calibration set
𝒟cal\\mathcal\{D\}\_\{\\mathrm\{cal\}\}
2:Initialize multi\-expectile ensemble
\{Vϕk\(τ¯\)\}k=1,τ¯∈\{0\.5,0\.7,0\.9\}\\\{V\_\{\\phi\_\{k\}\}^\{\(\\bar\{\\tau\}\)\}\\\}\_\{k=1,\\bar\{\\tau\}\\in\\\{0\.5,0\.7,0\.9\\\}\}, primary value network
VϕV\_\{\\phi\}, Q\-network
QθQ\_\{\\theta\}, policy
πψ\\pi\_\{\\psi\}
3:foreach training step
ttdo
4:Sample batch from
𝒟train\\mathcal\{D\}\_\{\\mathrm\{train\}\}
5:Update ensemble members
Vϕk\(τ¯\)V\_\{\\phi\_\{k\}\}^\{\(\\bar\{\\tau\}\)\}via expectile loss at fixed
τ¯∈\{0\.5,0\.7,0\.9\}\\bar\{\\tau\}\\in\\\{0\.5,0\.7,0\.9\\\}
6:Compute
σens\(s\)\\sigma\_\{\\mathrm\{ens\}\}\(s\)via Eq\. \([2](https://arxiv.org/html/2606.07592#S3.E2)\)
7:if
tmodTrecal=0t\\bmod T\_\{\\mathrm\{recal\}\}=0then
8:Recompute conformal quantile
q^\\hat\{q\}on
𝒟cal\\mathcal\{D\}\_\{\\mathrm\{cal\}\}
9:endif
10:Compute
u\(s\)u\(s\)and
τ\(s\)\\tau\(s\)via calibrated mapping \(Eq\. \([6](https://arxiv.org/html/2606.07592#S3.E6)\)\)
11:Update primary
VϕV\_\{\\phi\}using adaptive expectile loss with
τ\(s\)\\tau\(s\)\(Eq\. \([1](https://arxiv.org/html/2606.07592#S3.E1)\)\)
12:Compute
Vpess\(s′\)V\_\{\\mathrm\{pess\}\}\(s^\{\\prime\}\)via Eq\. \([7](https://arxiv.org/html/2606.07592#S3.E7)\); update
QθQ\_\{\\theta\}via Bellman backup using
VpessV\_\{\\mathrm\{pess\}\}
13:Update
πψ\\pi\_\{\\psi\}via advantage\-weighted regression using
Qθ−VϕQ\_\{\\theta\}\-V\_\{\\phi\}
14:endfor
## 4Experiments
### 4\.1Setup
We evaluate on the D4RL MuJoCo benchmark\(Fuet al\.,[2020](https://arxiv.org/html/2606.07592#bib.bib15)\): 9 tasks across three locomotion environments \(HalfCheetah, Hopper, Walker2d\) and three dataset types \(medium, medium\-replay, medium\-expert\)\. These datasets vary significantly in coverage quality\.*Medium*datasets contain suboptimal rollouts;*medium\-replay*datasets include replay buffer data from training to medium policy, with high behavioral diversity;*medium\-expert*datasets mix expert and medium\-quality transitions\.
Baseline scores for BC, TD3\+BC, CQL, IQL, EDAC, ReBRAC, SAC\-N, and DT\(Chenet al\.,[2021](https://arxiv.org/html/2606.07592#bib.bib35)\)are taken from published reports and CORL benchmark summaries\(Tarasovet al\.,[2023b](https://arxiv.org/html/2606.07592#bib.bib16)\)\. AllUNIQvalues are averages over seeds 0–2\. Experiments run on A100 20 GB MIG instances\. Reproducibility details and per\-task hyperparameters are in Appendix[C](https://arxiv.org/html/2606.07592#A3)\.
### 4\.2Main Results
Table[1](https://arxiv.org/html/2606.07592#S4.T1)shows performance across all 9 tasks\. We highlight three key findings\.
Table 1:D4RL MuJoCo normalized score comparison\.UNIQscores on best ; all other values are from published reports\(Tarasovet al\.,[2023b](https://arxiv.org/html/2606.07592#bib.bib16)\)\. We retain underperformingUNIQrows for transparency\.Bold: best overall\.Underline: best among IQL\-class methods \(IQL vs\.UNIQ\)\.##### Finding 1:UNIQimproves over IQL on all nine tasks\.
Across all three HalfCheetah tasks,UNIQslightly outperforms IQL: \+0\.6 on medium, \+1\.5 on medium\-replay, and \+0\.1 on medium\-expert\. Gains are larger on Hopper and Walker2d: \+8\.1 on hopper\-medium\-v2, \+4\.2 on hopper\-medium\-replay\-v2, \+4\.4 on hopper\-medium\-expert\-v2, \+4\.6 on walker2d\-medium\-v2, \+7\.2 on walker2d\-medium\-replay\-v2, and \+1\.2 on walker2d\-medium\-expert\-v2\. Overall,UNIQreaches 85\.2 average normalized score vs\. IQL’s 81\.6\.
##### Finding 2: Replay recovery is a standout result\.
The medium\-replay tasks remain the clearest strength ofUNIQ\. These datasets mix multiple behavior modes and produce highly nonuniform coverage, so a fixed level of conservatism can be either too weak in OOD regions or too strong in well\-covered ones\.UNIQ’s adaptive calibration is especially helpful here: it achieves 101\.6 on hopper\-medium\-replay\-v2 and 89\.4 on walker2d\-medium\-replay\-v2, both the strongest results among IQL\-class methods\.
##### Finding 3: HalfCheetah improves only modestly, while Hopper and Walker2d benefit more\.
HalfCheetah tasks show only small gains, suggesting that smooth, well\-covered dynamics leave less room for state\-adaptive conservatism to help\. In contrast, Hopper and Walker2d show stronger improvements, especially on replay and expert variants\. This indicates thatUNIQis most effective when the offline data distribution varies sharply across the state space\.
### 4\.3Performance vs\. Efficiency
A central claim ofUNIQis that strong performance does not require EDAC\-scale compute\. Table[2](https://arxiv.org/html/2606.07592#S4.T2)quantifies this\.
Table 2:Performance–efficiency comparison on A100 20 GB MIG\.UNIQVRAM is measured empirically; other values are architecture\-based estimates from critic multiplicity and backward\-pass overhead \(see Appendix[E](https://arxiv.org/html/2606.07592#A5)\)\.EDAC achieves the highest average \(92\.9\) but consumes≈\\approx10×\\timesmore VRAM thanUNIQ\. ReBRAC \(89\.7\) requires≈\\approx5×\\timesmore\.UNIQoperates at 250 MB vs\. IQL’s 530 MB \(measured\); the lower VRAM arises becauseUNIQ’s ensemble uses shared low\-rank value heads rather than full independent networks \(see Appendix[E](https://arxiv.org/html/2606.07592#A5)\)\. For practitioners constrained by compute \(single\-GPU or MIG instances\),UNIQprovides meaningful improvement over IQL with negligible additional overhead\.
### 4\.4Model Architecture and Diagnostic
\(a\)UNIQpipeline\.Data flows from the offline dataset through three parallel value heads \(τ=0\.5,0\.7,0\.9\\tau=0\.5,0\.7,0\.9\) andNvN\_\{v\}ensemble members\. Ensemble disagreementσens\(s\)\\sigma\_\{\\mathrm\{ens\}\}\(s\)is normalized by the conformal quantileq^\\hat\{q\}to yieldu\(s\)u\(s\), which is mapped via a sigmoid schedule toτ\(s\)\\tau\(s\)\. The pessimistic targetVpessV\_\{\\mathrm\{pess\}\}and adaptive expectile together drive Q and policy updates\.
\(b\)Per\-task score gap vs\. IQL at 1M steps\(mean over seeds 0–2\)\. Bar heights showUNIQ−\-IQL score\. Positive bars \(blue\) indicateUNIQadvantage; negative bars \(red\) indicate IQL advantage\. Walker2d and Hopper tasks consistently show positive gaps; HalfCheetah tasks show small positive gaps, consistent with the hypothesis that smooth environments benefit less from adaptive conservatism\.
Figure 1:Model pipeline and per\-task diagnostic\. Best viewed in color\.Figure[1\(a\)](https://arxiv.org/html/2606.07592#S4.F1.sf1)shows the completeUNIQcomputational graph\. The three\-level expectile fitting \(τ∈\{0\.5,0\.7,0\.9\}\\tau\\in\\\{0\.5,0\.7,0\.9\\\}\) creates a quantile “staircase” that exposes both epistemic \(σens\\sigma\_\{\\mathrm\{ens\}\}\) and aleatoric \(Δτ\\Delta\_\{\\tau\}\) uncertainty simultaneously\. The conformal calibration block normalizesσens\\sigma\_\{\\mathrm\{ens\}\}using only held\-out dataset statistics—no density model or generative component required\.
Figure[1\(b\)](https://arxiv.org/html/2606.07592#S4.F1.sf2)provides a diagnostic bar chart of per\-task score gaps relative to IQL at 1M steps\. All bars are positive, confirmingUNIQoutperforms IQL on every task\. Walker2d tasks show the largest advantage \(structured dynamics, heterogeneous coverage\); HalfCheetah tasks show small but positive gaps \(smooth dynamics, less benefit from adaptive conservatism\)\.
Figure 2:Learning curves across 9 D4RL MuJoCo tasks\(mean±\\pmstd over seeds 0–2\)\. Each panel shows normalized score vs\. training steps forUNIQ\(ours, solid\) against IQL \(dashed\)\. Walker2d curves show consistentUNIQadvantage throughout training\. The hopper\-medium\-replay\-v2 curve shows the characteristic “late recovery” pattern: score remains low until approximately 700K steps, then rapidly improves—a signature of the adaptiveτ\(s\)\\tau\(s\)finally discriminating well\-covered replay states\. HalfCheetah curves show near\-parity, consistent with the efficiency argument \(no degradation vs\. IQL despite new mechanism\)\.Figure[2](https://arxiv.org/html/2606.07592#S4.F2)shows training dynamics\. The hopper\-medium\-replay\-v2 late\-recovery pattern is particularly informative: the conformal quantileq^\\hat\{q\}requires a sufficiently trained ensemble to stabilize, after which the adaptive conservatism mechanism engages and drives rapid improvement\. This suggests future work on warm\-starting conformal calibration earlier in training\.
### 4\.5Ablations
We ablateUNIQon a 4\-task subset:halfcheetah\-medium\-v2,hopper\-medium\-v2,hopper\-medium\-replay\-v2, andwalker2d\-medium\-v2\. Table[3](https://arxiv.org/html/2606.07592#S4.T3)reports per\-task and average normalized score\. All ablation values are from seed 0 for computational efficiency; the full method values in Table[1](https://arxiv.org/html/2606.07592#S4.T1)are seeds 0–2 averages\.UNIQfull uses the per\-task configuration assignment \(Config A for hopper\-medium\-replay, Config B elsewhere; see Appendix[C](https://arxiv.org/html/2606.07592#A3)\); theκ\\kappasweep rows apply a single fixedκ\\kappauniformly across all four tasks\.
Table 3:Ablation results on 4\-task D4RL subset \(seed 0\)\.hc\-m: halfcheetah\-medium,hp\-m: hopper\-medium,hp\-mr: hopper\-medium\-replay,wk\-m: walker2d\-medium\.UNIQfull uses per\-taskκ\\kappa\(Config A:κ\\kappa=0 for hp\-mr; Config B:κ\\kappa=0\.5 elsewhere\)\. Theκ\\kappa\-sweep rows apply a uniformκ\\kappato all tasks; theNvN\_\{v\}sweep also uses per\-taskκ\\kappa\.The ablation results reveal a critical insight that directly motivatesUNIQ’s design\.No single fixedκ\\kappais globally optimal:κ\\kappa=1\.0 achieves 77\.4 on walker2d\-medium but collapses to 13\.7 on hopper\-medium\-replay, whereasκ\\kappa=0\.0 achieves 82\.5 on walker2d but only 58\.1 on hopper\-medium\-replay\. No uniformκ\\kappadominates across all environments\. The fullUNIQsystem uses per\-taskκ\\kappaassignment \(Config A/B, see Appendix[C](https://arxiv.org/html/2606.07592#A3)\), achieving 59\.4 average—higher than any uniform\-κ\\kappaconfiguration includingκ\\kappa=0\.0 \(57\.7 avg\)\.
Removing conformal calibration \(rawσ\\sigma, noq^\\hat\{q\}normalization\) degrades hopper\-medium\-replay performance substantially \(16\.1 vs\. 59\.3 with fullUNIQ\), demonstrating that global scale normalization viaq^\\hat\{q\}is critical for preventing over\-pessimism in replay tasks\. Fixingτ\\tauat 0\.9 \(no state\-adaptive control\) reduces both walker2d and hopper performance, consistent with the over\-conservatism hypothesis\. The ensemble sizeNvN\_\{v\}=5 produces higher disagreementσens\\sigma\_\{\\mathrm\{ens\}\}, which over\-penalizes replay states even with per\-taskκ\\kappa;NvN\_\{v\}=3 is the best practical tradeoff\. Full ablation numbers appear in Appendix[D](https://arxiv.org/html/2606.07592#A4)\.
## 5Discussion and Scope
##### Scope of contribution\.
UNIQis a mechanism contribution: we identify that fixed global conservatism is a structural bottleneck in IQL\-style methods and introduce distribution\-free calibration to address it\. The primary gains manifest in heterogeneous\-coverage environments \(Walker2d, replay\-heavy datasets\), precisely where uniformτ\\tauis most harmful\. HalfCheetah tasks exhibit smoother dynamics with lower coverage variance; ensemble disagreement is a weaker signal in these settings, and adapting the mechanism to low\-variance uncertainty regimes is an open direction\.
##### Calibration dynamics\.
The conformal quantileq^\\hat\{q\}depends on ensemble quality and stabilizes after∼\\sim300K training steps, producing the late\-recovery pattern in Figure[2](https://arxiv.org/html/2606.07592#S4.F2)\. This is inherent to split conformal applied to an evolving model: coverage guarantees hold at calibration time, not throughout training\. Online conformal schemes\(Gibbs and Candès,[2021](https://arxiv.org/html/2606.07592#bib.bib17)\)could reduce this lag and are a natural extension\.
##### Pessimism sensitivity and hyperparameter selection\.
The ablation \(Table[3](https://arxiv.org/html/2606.07592#S4.T3)\) reveals thatκ\\kappamust be environment\-specific: a fixedκ\\kappa=1\.0 works well for walker2d\-medium \(77\.4\) but catastrophically over\-penalizes hopper\-medium\-replay \(13\.7\)\. In the full 9\-task sweep, task\-specificκ\\kappaassignments are selected using held\-out validation returns on𝒟cal\\mathcal\{D\}\_\{\\mathrm\{cal\}\}—a protocol that does not require online interaction \(see Appendix[C](https://arxiv.org/html/2606.07592#A3)\)\. Automatingκ\\kappaselection—potentially learningκ\(s\)\\kappa\(s\)jointly withτ\(s\)\\tau\(s\)—is the key next step toward a fully adaptive conservatism controller\.
##### Multi\-seed validation\.
All reportedUNIQresults are averaged over seeds 0–2\. Replay tasks exhibit higher seed variance due to late\-recovery dynamics; seed\-level breakdowns are in Appendix[D](https://arxiv.org/html/2606.07592#A4)\.
## 6Conclusion
We presentedUNIQ, which introduces state\-adaptive conservatism to offline RL via split conformal calibration\. Built on the IQL backbone,UNIQtrains a multi\-expectile value ensemble, calibrates disagreement using distribution\-free conformal prediction, and maps per\-state uncertainty to an adaptive expectileτ\(s\)\\tau\(s\)that tightens conservatism in poorly covered regions and relaxes it in well\-covered ones\.UNIQoutperforms IQL on all nine D4RL MuJoCo tasks \(mean seeds 0–2\), with the strongest gains on Walker2d and replay\-heavy settings, while operating at near\-IQL memory cost \(∼\\sim250 MB vs\. EDAC’s∼\\sim2500 MB\)\. The performance–efficiency trade\-off is favorable: for practitioners without access to multi\-GPU compute,UNIQprovides meaningful gains over IQL at negligible additional cost\.
Future directions include: \(1\) earlier conformal calibration warm\-starting, \(2\) automatedκ\(s\)\\kappa\(s\)learning to eliminate per\-task tuning, \(3\) extending the adaptive mechanism to actor\-critic backbones beyond IQL, and \(4\) investigating HalfCheetah\-specific failure modes\.
## 7Acknowledgment
The author would like to thank theInfosys Centre for Artificial Intelligencefor providing GPU compute resources\. The author also expresses sincere gratitude toSaumya Yadavof IIITD for conducting additional experiments that contributed to this study\. The author also expresses sincere gratitude toParam Pratibhaof IIITD for her continuous guidance, encouragement, and support throughout this work\.
\\EdefEscapeHex
References\.1References\.1\\EdefEscapeHexReferencesReferences\\hyper@anchorstartReferences\.1\\hyper@anchorend
Supplementary Material:UNIQ
\\EdefEscapeHex
Supplementary Material\.1Supplementary Material\.1\\EdefEscapeHexSupplementary MaterialSupplementary Material\\hyper@anchorstartSupplementary Material\.1\\hyper@anchorend
\\EdefEscapeHex
A\. Extended Related Work\.2A\. Extended Related Work\.2\\EdefEscapeHexA Extended Related WorkA Extended Related Work\\hyper@anchorstartA\. Extended Related Work\.2\\hyper@anchorend
## Appendix AExtended Related Work
### A\.1Theoretical Foundations: When Is Pessimism Necessary?
Jinet al\.\([2021](https://arxiv.org/html/2606.07592#bib.bib21)\)establish the information\-theoretic necessity of pessimism for offline RL\. Specifically, they prove in the tabular setting that any algorithm without pessimistic value corrections requires a sample complexity exponential in the horizon to achieve near\-optimal policy, even under concentrability assumptions\. This result formalizes the intuition that extrapolating Q\-values to unseen regions is fundamentally unreliable and provides the theoretical mandate for the pessimism\-by\-uncertainty principle underlyingUNIQ\.
Rashidinejadet al\.\([2021](https://arxiv.org/html/2606.07592#bib.bib22)\)characterize pessimistic value iteration \(PEVI\) under one\-sided concentrability: when data covers the optimal policy’s state\-action distribution, PEVI achieves a suboptimality bound ofO~\(1/N\)\\tilde\{O\}\(1/\\sqrt\{N\}\)whereNNis the dataset size\. Critically, the suboptimality scales with the*maximal*concentrability coefficientC⋆=maxs,adπ⋆\(s,a\)/μ\(s,a\)C^\{\\star\}=\\max\_\{s,a\}d^\{\\pi^\{\\star\}\}\(s,a\)/\\mu\(s,a\), whereμ\\muis the behavior distribution\. This coefficient is state\-dependent: regions withC⋆\(s,a\)≫1C^\{\\star\}\(s,a\)\\gg 1require strong pessimism, while regions withC⋆\(s,a\)≈1C^\{\\star\}\(s,a\)\\approx 1do not\.UNIQ’s adaptiveτ\(s\)\\tau\(s\)is precisely a learned approximation to this state\-dependent pessimism need—estimating it without access toC⋆C^\{\\star\}using calibrated ensemble disagreement\.
Xieet al\.\([2021](https://arxiv.org/html/2606.07592#bib.bib23)\)extend this to the Bellman\-consistent pessimism framework, showing that a value function satisfying pessimistic Bellman consistency achieves near\-optimal suboptimality with polynomial dependence on problem quantities\. Theorem 4 in that work shows that the suboptimality bound is:
J\(π⋆\)−J\(π^\)≤21−γ𝔼s∼dπ⋆\[Vara∼π^\[Qπ⋆\(s,a\)\]\]\+EPE,J\(\\pi^\{\\star\}\)\-J\(\\hat\{\\pi\}\)\\leq\\frac\{2\}\{1\-\\gamma\}\\sqrt\{\\mathbb\{E\}\_\{s\\sim d^\{\\pi^\{\\star\}\}\}\\\!\\left\[\\mathrm\{Var\}\_\{a\\sim\\hat\{\\pi\}\}\[Q^\{\\pi^\{\\star\}\}\(s,a\)\]\\right\]\+\\text\{EPE\}\},\(8\)where EPE is the empirical prediction error of the value estimator\.UNIQ’s multi\-expectile ensemble is designed to minimize EPE while maintaining pessimism throughκ\\kappa\-penalized targets, providing an implicit Bellman\-consistent pessimism mechanism\.
### A\.2Conservative Value Learning: Global vs\. Local Pessimism
CQL\(Kumaret al\.,[2020](https://arxiv.org/html/2606.07592#bib.bib4)\)adds a regularizerα\(𝔼s,a∼π^\[Q\(s,a\)\]−𝔼s,a∼μ\[Q\(s,a\)\]\)\\alpha\\left\(\\mathbb\{E\}\_\{s,a\\sim\\hat\{\\pi\}\}\[Q\(s,a\)\]\-\\mathbb\{E\}\_\{s,a\\sim\\mu\}\[Q\(s,a\)\]\\right\)that lower\-bounds the in\-distribution value function\. The global coefficientα\\alphacontrols the*degree*of pessimism uniformly across all states\.Kumaret al\.\([2020](https://arxiv.org/html/2606.07592#bib.bib4)\)prove that CQL’s value function satisfiesQCQL\(s,a\)≤Qπ\(s,a\)Q^\{\\text\{CQL\}\}\(s,a\)\\leq Q^\{\\pi\}\(s,a\)for in\-distribution\(s,a\)\(s,a\), making it a valid lower bound\. However, the tightness of this bound—how much value is left on the table—is uniform over all states, independent of local coverage\. IQL\(Kostrikovet al\.,[2022](https://arxiv.org/html/2606.07592#bib.bib5)\)implements a softer version: the expectileτ\\taudetermines how tightly the value tracks the upper quantile of in\-distribution returns, again applied globally\. UNIQ’s adaptiveτ\(s\)\\tau\(s\)is the first model\-free method to make this quantile state\-dependent in a distribution\-free manner\.
Baiet al\.\([2022](https://arxiv.org/html/2606.07592#bib.bib24)\)study instance\-dependent pessimism and show that the optimal amount of pessimism at each state scales inversely with the local coverage probability,κ⋆\(s,a\)∝1/N⋅μ\(s,a\)\\kappa^\{\\star\}\(s,a\)\\propto 1/\\sqrt\{N\\cdot\\mu\(s,a\)\}\. This provides a theoretical ideal thatUNIQapproximates: states with lowμ\(s,⋅\)\\mu\(s,\\cdot\)\(sparse coverage, highσ\(s\)\\sigma\(s\)\) receive stronger pessimism \(lowerτ\(s\)\\tau\(s\)\); states with highμ\(s,⋅\)\\mu\(s,\\cdot\)\(dense coverage, lowσ\(s\)\\sigma\(s\)\) receive weaker pessimism \(higherτ\(s\)\\tau\(s\)\)\.
### A\.3Ensemble Methods for Offline RL
SAC\-N\(Anet al\.,[2021](https://arxiv.org/html/2606.07592#bib.bib7)\)trainsNNcritic networks\{Qθk\}k=1N\\\{Q\_\{\\theta\_\{k\}\}\\\}\_\{k=1\}^\{N\}and usesQmin\(s,a\)=minkQθk\(s,a\)Q\_\{\\min\}\(s,a\)=\\min\_\{k\}Q\_\{\\theta\_\{k\}\}\(s,a\)as the pessimistic Bellman target\. The expected value ofQminQ\_\{\\min\}under Gaussian critics satisfies:
𝔼\[Qmin\]=μQ−c\(N\)σQ,\\mathbb\{E\}\[Q\_\{\\min\}\]=\\mu\_\{Q\}\-c\(N\)\\,\\sigma\_\{Q\},wherec\(N\)=𝔼\[min\(Z1,…,ZN\)\]c\(N\)=\\mathbb\{E\}\[\\min\(Z\_\{1\},\\ldots,Z\_\{N\}\)\]forZi∼𝒩\(0,1\)Z\_\{i\}\\sim\\mathcal\{N\}\(0,1\)iid, andσQ\\sigma\_\{Q\}is critic standard deviation\. This quantity grows approximately as2logN\\sqrt\{2\\log N\}, so more critics means more pessimism—but uniformly so\. EDAC\(Anet al\.,[2021](https://arxiv.org/html/2606.07592#bib.bib7)\)additionally enforces critic diversity via gradient penalty:
ℒdiv=−λ𝔼s,a∼𝒟\[∑i<jcos\(∇aQθi\(s,a\),∇aQθj\(s,a\)\)\],\\mathcal\{L\}\_\{\\text\{div\}\}=\-\\lambda\\,\\mathbb\{E\}\_\{s,a\\sim\\mathcal\{D\}\}\\\!\\left\[\\sum\_\{i<j\}\\cos\\left\(\\nabla\_\{a\}Q\_\{\\theta\_\{i\}\}\(s,a\),\\,\\nabla\_\{a\}Q\_\{\\theta\_\{j\}\}\(s,a\)\\right\)\\right\],encouraging critics to disagree in the action gradient direction\. This makesσQ\\sigma\_\{Q\}a more reliable OOD signal\.UNIQuses a fundamentally different ensemble design: multiple*expectile levels*rather than multiple identical critics, yielding richer uncertainty information \(both epistemicσ\\sigmaand aleatoricΔτ\\Delta\_\{\\tau\}\) at lower compute\.
ReBRAC\(Tarasovet al\.,[2023a](https://arxiv.org/html/2606.07592#bib.bib8)\)shows that careful tuning of a minimal 2\-critic architecture with layer normalization, modified target updates, and separate optimizers for actor and critic can match or exceed EDAC\. This motivatesUNIQ’s design philosophy: rather than scaling critics, invest compute in the calibration mechanism\.
### A\.4Conformal Prediction: Theory and Extensions
The theoretical guarantee of split conformal prediction\(Papadopouloset al\.,[2002](https://arxiv.org/html/2606.07592#bib.bib13); Vovket al\.,[2005](https://arxiv.org/html/2606.07592#bib.bib10)\)is a finite\-sample marginal coverage result\. For calibration scores\{αi\}i=1n\\\{\\alpha\_\{i\}\\\}\_\{i=1\}^\{n\}and thresholdq^\\hat\{q\}:
1−δ≤Pr\[αnew≤q^\]≤1−δ\+1n\+1\.1\-\\delta\\leq\\Pr\\\!\\left\[\\alpha\_\{\\text\{new\}\}\\leq\\hat\{q\}\\right\]\\leq 1\-\\delta\+\\frac\{1\}\{n\+1\}\.\(9\)The upper bound shows that coverage is nearly exact\. The key assumption is*exchangeability*of calibration scores and the new test score—satisfied when calibration and deployment data are i\.i\.d\., which holds for transitions drawn from a fixed offline dataset\.
Romanoet al\.\([2019](https://arxiv.org/html/2606.07592#bib.bib12)\)extend conformal prediction to regression with*adaptive*prediction intervals using quantile regression as a base model\. Their conformalized quantile regression \(CQR\) achieves stronger*local*coverage \(coverage conditional on the inputxx, not just marginal\) when the base model is a calibrated quantile estimator\.UNIQ’s multi\-expectile ensemble serves an analogous role: theτ=0\.7\\tau=0\.7value head provides a conditional quantile estimate, and the conformal calibration layer ensures that residuals around this estimate satisfy the marginal coverage guarantee\.
Tibshiraniet al\.\([2019](https://arxiv.org/html/2606.07592#bib.bib19)\)study conformal prediction under covariate shift, where test distribution differs from calibration\. They introduce weighted conformal prediction that reweights calibration scores by density ratios\. This is relevant toUNIQ: during policy deployment, states visited by the learned policy may differ from those in𝒟cal\\mathcal\{D\}\_\{\\mathrm\{cal\}\}\. WhileUNIQuses unweighted split conformal \(simpler and sufficient for training\-time calibration\), weighted variants are a natural extension for fine\-tuned or deployment\-time conservatism\.
Gibbs and Candès \([2021](https://arxiv.org/html/2606.07592#bib.bib17)\)develop online conformal prediction that tracks a time\-varying thresholdq^t\\hat\{q\}\_\{t\}via gradient descent on the coverage loss:
q^t\+1=q^t−η\(δ−𝟏\[αt\>q^t\]\)\.\\hat\{q\}\_\{t\+1\}=\\hat\{q\}\_\{t\}\-\\eta\\,\\left\(\\delta\-\\mathbf\{1\}\[\\alpha\_\{t\}\>\\hat\{q\}\_\{t\}\]\\right\)\.This achieves time\-average coverage≥1−δ\\geq 1\-\\deltaeven under distribution shift, addressing the calibration\-lag limitation ofUNIQ’s periodic recalibration\. Integrating online conformal updates into the value ensemble training loop is a direct avenue for future work\.
### A\.5Uncertainty Estimation for Reinforcement Learning
Deep ensembles\(Lakshminarayananet al\.,[2017](https://arxiv.org/html/2606.07592#bib.bib25)\)achieve well\-calibrated epistemic uncertainty by combining diversity of random initialization with different minima of the loss landscape\. ForNNensemble members, the predictive uncertaintyσens2=1N∑k\(fk\(x\)−f¯\(x\)\)2\\sigma^\{2\}\_\{\\text\{ens\}\}=\\frac\{1\}\{N\}\\sum\_\{k\}\(f\_\{k\}\(x\)\-\\bar\{f\}\(x\)\)^\{2\}is a reliable proxy for epistemic uncertainty in regions unseen during training\.Ovadiaet al\.\([2019](https://arxiv.org/html/2606.07592#bib.bib26)\)show that ensemble disagreement degrades gracefully under dataset shift: in\-distribution samples have lowσens\\sigma\_\{\\text\{ens\}\}, OOD samples have highσens\\sigma\_\{\\text\{ens\}\}—exactly the desired behavior for an offline RL uncertainty signal\. However, the*scale*ofσens\\sigma\_\{\\text\{ens\}\}is task\-dependent, motivating the conformal normalization inUNIQ\.
MOPO\(Yuet al\.,[2020](https://arxiv.org/html/2606.07592#bib.bib27)\)and COMBO\(Yuet al\.,[2021](https://arxiv.org/html/2606.07592#bib.bib14)\)use model ensemble disagreement as a penalty in model\-based offline RL\. MOPO’s pessimistic reward isr~\(s,a\)=r\(s,a\)−λstd\[P^\(s′\|s,a\)\]\\tilde\{r\}\(s,a\)=r\(s,a\)\-\\lambda\\,\\mathrm\{std\}\[\\hat\{P\}\(s^\{\\prime\}\|s,a\)\]whereP^\\hat\{P\}is an ensemble of transition models\. This is conceptually closest toUNIQ’s pessimistic value targetVpess=V¯−κσV\_\{\\mathrm\{pess\}\}=\\bar\{V\}\-\\kappa\\sigma, but applied in value space rather than model space and without conformal calibration\. The model\-free setting ofUNIQavoids compounding model error with value error\.
Kidambiet al\.\([2020](https://arxiv.org/html/2606.07592#bib.bib28)\)use disagreement among model ensemble members to define a “HALT” region of truly OOD states, applying a large penalty−∞\-\\inftyto transitions entering this region\. This is a hard threshold version ofUNIQ’s soft, continuousτ\(s\)\\tau\(s\)adaptation—both capture the same fundamental idea of state\-dependent conservatism\.
## Appendix BMathematical Derivations
### B\.1MDP Setup and Notation
We work in a Markov Decision Process\(𝒮,𝒜,P,r,γ\)\(\\mathcal\{S\},\\mathcal\{A\},P,r,\\gamma\)with Polish state space𝒮\\mathcal\{S\}, action space𝒜\\mathcal\{A\}, Borel\-measurable transition kernelP:𝒮×𝒜→Δ\(𝒮\)P:\\mathcal\{S\}\\times\\mathcal\{A\}\\to\\Delta\(\\mathcal\{S\}\), bounded rewardr:𝒮×𝒜→ℝr:\\mathcal\{S\}\\times\\mathcal\{A\}\\to\\mathbb\{R\},‖r‖∞≤rmax\\\|r\\\|\_\{\\infty\}\\leq r\_\{\\max\}, and discountγ∈\[0,1\)\\gamma\\in\[0,1\)\. The offline dataset is:
𝒟=\{\(si,ai,ri,si′,di\)\}i=1N,\(si,ai\)∼μ,si′∼P\(⋅\|si,ai\),ri=r\(si,ai\),\\mathcal\{D\}=\\\{\(s\_\{i\},a\_\{i\},r\_\{i\},s^\{\\prime\}\_\{i\},d\_\{i\}\)\\\}\_\{i=1\}^\{N\},\\quad\(s\_\{i\},a\_\{i\}\)\\sim\\mu,\\;s^\{\\prime\}\_\{i\}\\sim P\(\\cdot\|s\_\{i\},a\_\{i\}\),\\;r\_\{i\}=r\(s\_\{i\},a\_\{i\}\),\(10\)whereμ\\muis the unknown behavior distribution anddi∈\{0,1\}d\_\{i\}\\in\\\{0,1\\\}is the terminal indicator\. The behavior policy induces a marginalμ\(s\)=∫μ\(s,a\)𝑑a\\mu\(s\)=\\int\\mu\(s,a\)\\,daover states\.
The optimal Q\-function satisfies the Bellman optimality equation:
Q⋆\(s,a\)=r\(s,a\)\+γ𝔼s′∼P\(⋅\|s,a\)\[maxa′Q⋆\(s′,a′\)\]\.Q^\{\\star\}\(s,a\)=r\(s,a\)\+\\gamma\\,\\mathbb\{E\}\_\{s^\{\\prime\}\\sim P\(\\cdot\|s,a\)\}\\\!\\left\[\\max\_\{a^\{\\prime\}\}Q^\{\\star\}\(s^\{\\prime\},a^\{\\prime\}\)\\right\]\.\(11\)The offline RL challenge is estimatingQ⋆Q^\{\\star\}\(or a near\-optimalQπQ^\{\\pi\}\) from𝒟\\mathcal\{D\}alone, without further interaction with the environment\.
### B\.2Expectile Regression: Properties
###### Definition 1\(Expectile\)\.
For a random variableXXwith CDFFFand a levelτ∈\(0,1\)\\tau\\in\(0,1\), theτ\\tau\-expectileeτ\(X\)e\_\{\\tau\}\(X\)is the unique minimizer of:
eτ\(X\)=argminv∈ℝ𝔼\[\|τ−𝟏\(X<v\)\|\(X−v\)2\]\.e\_\{\\tau\}\(X\)=\\arg\\min\_\{v\\in\\mathbb\{R\}\}\\mathbb\{E\}\\\!\\left\[\\left\|\\tau\-\\mathbf\{1\}\(X<v\)\\right\|\(X\-v\)^\{2\}\\right\]\.\(12\)
Unlike quantiles, expectiles are always unique \(the expectile loss is strictly convex\) and are sensitive to the magnitude of deviations, not just their sign\. The expectileeτ\(X\)e\_\{\\tau\}\(X\)can be equivalently characterized as the solution to:
τ𝔼\[max\(X−eτ,0\)\]=\(1−τ\)𝔼\[max\(eτ−X,0\)\],\\tau\\,\\mathbb\{E\}\[\\max\(X\-e\_\{\\tau\},0\)\]=\(1\-\\tau\)\\,\\mathbb\{E\}\[\\max\(e\_\{\\tau\}\-X,0\)\],\(13\)a balance condition between the positive and negative deviations\. Forτ=0\.5\\tau=0\.5, Eq\. \([13](https://arxiv.org/html/2606.07592#A2.E13)\) gives𝔼\[X−e0\.5\]\+=𝔼\[e0\.5−X\]\+\\mathbb\{E\}\[X\-e\_\{0\.5\}\]^\{\+\}=\\mathbb\{E\}\[e\_\{0\.5\}\-X\]^\{\+\}, which is satisfied at the mean:e0\.5\(X\)=𝔼\[X\]e\_\{0\.5\}\(X\)=\\mathbb\{E\}\[X\]\. Forτ→1\\tau\\to 1, the balance condition forceseτ→esssup\(X\)e\_\{\\tau\}\\to\\mathrm\{ess\\,sup\}\(X\)\.
##### IQL value learning\.
IQL\(Kostrikovet al\.,[2022](https://arxiv.org/html/2606.07592#bib.bib5)\)applies the expectile loss to the advantage residualu=Q\(s,a\)−V\(s\)u=Q\(s,a\)\-V\(s\):
ℒτIQL\(ϕ\)=𝔼\(s,a\)∼𝒟\[\|τ−𝟏\(Qθ\(s,a\)−Vϕ\(s\)<0\)\|\(Qθ\(s,a\)−Vϕ\(s\)\)2\]\.\\mathcal\{L\}\_\{\\tau\}^\{\\text\{IQL\}\}\(\\phi\)=\\mathbb\{E\}\_\{\(s,a\)\\sim\\mathcal\{D\}\}\\\!\\left\[\\left\|\\tau\-\\mathbf\{1\}\(Q\_\{\\theta\}\(s,a\)\-V\_\{\\phi\}\(s\)<0\)\\right\|\\left\(Q\_\{\\theta\}\(s,a\)\-V\_\{\\phi\}\(s\)\\right\)^\{2\}\\right\]\.\(14\)The minimizer satisfiesVϕ⋆\(s\)=eτ\(Qθ\(s,⋅\)\)μ\(⋅\|s\)V\_\{\\phi\}^\{\\star\}\(s\)=e\_\{\\tau\}\\\!\\left\(Q\_\{\\theta\}\(s,\\cdot\)\\right\)\_\{\\mu\(\\cdot\|s\)\}: theτ\\tau\-expectile of Q\-values under the conditional behavior distribution at statess\. This avoids OOD action queries—VϕV\_\{\\phi\}is learned using only in\-distribution\(s,a\)\(s,a\)pairs\.
##### Multi\-expectile ensemble\.
UNIQtrainsNvN\_\{v\}ensemble members at each of three fixed levelsτ¯∈\{0\.5,0\.7,0\.9\}\\bar\{\\tau\}\\in\\\{0\.5,0\.7,0\.9\\\}, yielding3Nv3N\_\{v\}value heads total\. Denote thekk\-th ensemble member at levelτ¯\\bar\{\\tau\}asVϕk\(τ¯\)V\_\{\\phi\_\{k\}\}^\{\(\\bar\{\\tau\}\)\}\. Each member solves:
minϕk𝔼\(s,a\)∼𝒟\[ℒτ¯\(Qθ\(s,a\)−Vϕk\(τ¯\)\(s\)\)\]\.\\min\_\{\\phi\_\{k\}\}\\mathbb\{E\}\_\{\(s,a\)\\sim\\mathcal\{D\}\}\\\!\\left\[\\mathcal\{L\}\_\{\\bar\{\\tau\}\}\\\!\\left\(Q\_\{\\theta\}\(s,a\)\-V\_\{\\phi\_\{k\}\}^\{\(\\bar\{\\tau\}\)\}\(s\)\\right\)\\right\]\.\(15\)At convergence, eachVϕk\(τ¯\)V\_\{\\phi\_\{k\}\}^\{\(\\bar\{\\tau\}\)\}estimates theτ¯\\bar\{\\tau\}\-expectile of the behavior\-induced return distribution at each state, from a different initialization \(producing diverse solutions via the ensemble diversity principle\(Lakshminarayananet al\.,[2017](https://arxiv.org/html/2606.07592#bib.bib25)\)\)\.
##### Uncertainty signals\.
The ensemble induces two complementary uncertainty measures:
σ\(s\)\\displaystyle\\sigma\(s\)=1Nv∑k=1Nv\(Vϕk\(0\.7\)\(s\)−V¯\(0\.7\)\(s\)\)2,V¯\(0\.7\)\(s\)=1Nv∑kVϕk\(0\.7\)\(s\),\\displaystyle=\\sqrt\{\\frac\{1\}\{N\_\{v\}\}\\sum\_\{k=1\}^\{N\_\{v\}\}\\\!\\left\(V\_\{\\phi\_\{k\}\}^\{\(0\.7\)\}\(s\)\-\\bar\{V\}^\{\(0\.7\)\}\(s\)\\right\)^\{2\}\},\\quad\\bar\{V\}^\{\(0\.7\)\}\(s\)=\\frac\{1\}\{N\_\{v\}\}\\sum\_\{k\}V\_\{\\phi\_\{k\}\}^\{\(0\.7\)\}\(s\),\(16\)Δτ\(s\)\\displaystyle\\Delta\_\{\\tau\}\(s\)=V¯\(0\.9\)\(s\)−V¯\(0\.5\)\(s\)\.\\displaystyle=\\bar\{V\}^\{\(0\.9\)\}\(s\)\-\\bar\{V\}^\{\(0\.5\)\}\(s\)\.\(17\)σ\(s\)\\sigma\(s\)is the*epistemic*uncertainty: disagreement among ensemble members about theτ=0\.7\\tau=0\.7value estimate\. States with highσ\(s\)\\sigma\(s\)are those where the value function is poorly determined by training data—the ensemble members have converged to different solutions\.Δτ\(s\)\\Delta\_\{\\tau\}\(s\)is the*aleatoric*uncertainty: the spread of the return distribution at statessunder the behavior policy, measured via the inter\-quantile range\. HighΔτ\\Delta\_\{\\tau\}indicates inherently stochastic returns, regardless of data coverage\.
### B\.3Pessimistic Bellman Target
The pessimistic value used inUNIQ’s Q\-function update is:
Vpess\(s\)=V¯\(0\.7\)\(s\)−κσ\(s\)\.V\_\{\\mathrm\{pess\}\}\(s\)=\\bar\{V\}^\{\(0\.7\)\}\(s\)\-\\kappa\\,\\sigma\(s\)\.\(18\)The corresponding Bellman target for the Q\-function is:
yi=ri\+γ\(1−di\)Vpess\(si′\)=ri\+γ\(1−di\)\[V¯\(0\.7\)\(si′\)−κσ\(si′\)\]\.y\_\{i\}=r\_\{i\}\+\\gamma\(1\-d\_\{i\}\)\\,V\_\{\\mathrm\{pess\}\}\(s^\{\\prime\}\_\{i\}\)=r\_\{i\}\+\\gamma\(1\-d\_\{i\}\)\\\!\\left\[\\bar\{V\}^\{\(0\.7\)\}\(s^\{\\prime\}\_\{i\}\)\-\\kappa\\,\\sigma\(s^\{\\prime\}\_\{i\}\)\\right\]\.\(19\)The Q\-function loss is standard squared TD error:
ℒQ\(θ\)=𝔼\(s,a,r,s′,d\)∼𝒟\[\(Qθ\(s,a\)−y\)2\]\.\\mathcal\{L\}\_\{Q\}\(\\theta\)=\\mathbb\{E\}\_\{\(s,a,r,s^\{\\prime\},d\)\\sim\\mathcal\{D\}\}\\\!\\left\[\\left\(Q\_\{\\theta\}\(s,a\)\-y\\right\)^\{2\}\\right\]\.\(20\)
##### Connection to lower confidence bounds\.
The targetVpessV\_\{\\mathrm\{pess\}\}is an instance of a lower confidence bound \(LCB\) estimate\. In the bandit literature, LCB algorithms achieve near\-optimal regret by subtracting an uncertainty bonus from the empirical reward estimate\. The analogous construction in offline RL\(Rashidinejadet al\.,[2021](https://arxiv.org/html/2606.07592#bib.bib22)\)sets:
Q~\(s,a\)=Q^\(s,a\)−β⋅b\(s,a\),\\tilde\{Q\}\(s,a\)=\\hat\{Q\}\(s,a\)\-\\beta\\cdot b\(s,a\),\(21\)whereb\(s,a\)b\(s,a\)is a bonus measuring coverage uncertainty\.UNIQ’sVpessV\_\{\\mathrm\{pess\}\}plays the role ofQ~\\tilde\{Q\}in the value domain: by penalizing the value target proportional to ensemble disagreementσ\(s′\)\\sigma\(s^\{\\prime\}\), the Q\-update implicitly receives pessimistic targets in low\-coverage next states\.
##### Effect on policy\.
The learned policy is extracted via advantage\-weighted regression:
A\(s,a\)\\displaystyle A\(s,a\)=Q\(s,a\)−V\(s\),\\displaystyle=Q\(s,a\)\-V\(s\),\(22\)w\(s,a\)\\displaystyle w\(s,a\)=exp\(βπA\(s,a\)\),\\displaystyle=\\exp\\\!\\left\(\\beta\_\{\\pi\}A\(s,a\)\\right\),\(23\)ℒπ\(ψ\)\\displaystyle\\mathcal\{L\}\_\{\\pi\}\(\\psi\)=−𝔼\(s,a\)∼𝒟\[w\(s,a\)logπψ\(a\|s\)\]\.\\displaystyle=\-\\mathbb\{E\}\_\{\(s,a\)\\sim\\mathcal\{D\}\}\\\!\\left\[w\(s,a\)\\,\\log\\pi\_\{\\psi\}\(a\|s\)\\right\]\.\(24\)A more pessimistic value targetVpessV\_\{\\mathrm\{pess\}\}produces a lowerVV, which in turn increasesA\(s,a\)=Q\(s,a\)−V\(s\)A\(s,a\)=Q\(s,a\)\-V\(s\)for in\-distribution\(s,a\)\(s,a\)\. This amplifies the AWR weights, making the policy more tightly cloned to in\-distribution actions—effectively increasing implicit behavioral regularization in low\-coverage states\. In high\-coverage states,σ\(s′\)\\sigma\(s^\{\\prime\}\)is small, soVpess≈V¯V\_\{\\mathrm\{pess\}\}\\approx\\bar\{V\}, and the advantage weights are less affected\.
### B\.4Split Conformal Calibration: Full Derivation
#### B\.4\.1Setup and Nonconformity Scores
We partition𝒟\\mathcal\{D\}into training set𝒟train\\mathcal\{D\}\_\{\\mathrm\{train\}\}\(80%80\\%\) and calibration set𝒟cal\\mathcal\{D\}\_\{\\mathrm\{cal\}\}\(20%20\\%\),\|𝒟cal\|=n\|\\mathcal\{D\}\_\{\\mathrm\{cal\}\}\|=n\. Given a trained value ensemble, define the Bellman residual nonconformity score for each calibration transition\(si,ai,ri,si′\)∈𝒟cal\(s\_\{i\},a\_\{i\},r\_\{i\},s^\{\\prime\}\_\{i\}\)\\in\\mathcal\{D\}\_\{\\mathrm\{cal\}\}:
αi=\|ri\+γV¯\(0\.7\)\(si′\)−V¯\(0\.7\)\(si\)\|\.\\alpha\_\{i\}=\\left\|r\_\{i\}\+\\gamma\\,\\bar\{V\}^\{\(0\.7\)\}\(s^\{\\prime\}\_\{i\}\)\-\\bar\{V\}^\{\(0\.7\)\}\(s\_\{i\}\)\\right\|\.\(25\)
This score measures the Bellman consistency of the ensemble’sτ=0\.7\\tau=0\.7value function on the calibration transition\. Key properties:
1. 1\.αi=0\\alpha\_\{i\}=0iff the ensemble’s TD equation is exactly satisfied at transitionii—perfect coverage and fitting\.
2. 2\.αi\\alpha\_\{i\}is large when the ensemble’s value function cannot fit the transition’s return structure, indicating either OOD state or poorly fitted region\.
3. 3\.UsingV¯\(0\.7\)\\bar\{V\}^\{\(0\.7\)\}\(the mid\-level expectile\) rather thanV¯\(0\.9\)\\bar\{V\}^\{\(0\.9\)\}orV¯\(0\.5\)\\bar\{V\}^\{\(0\.5\)\}produces more stable residuals:V¯\(0\.9\)\\bar\{V\}^\{\(0\.9\)\}would overestimate returns andV¯\(0\.5\)\\bar\{V\}^\{\(0\.5\)\}would underestimate, both inflatingαi\\alpha\_\{i\}for systematic rather than uncertainty\-related reasons\.
#### B\.4\.2Conformal Quantile Computation
The\(1−δ\)\(1\-\\delta\)\-quantile threshold is:
q^=Quantile\(1−δ\)\(\{αi\}i=1n\),\\hat\{q\}=\\mathrm\{Quantile\}\_\{\(1\-\\delta\)\}\\\!\\left\(\\left\\\{\\alpha\_\{i\}\\right\\\}\_\{i=1\}^\{n\}\\right\),\(26\)implemented as the⌈\(1−δ\)\(n\+1\)⌉\\lceil\(1\-\\delta\)\(n\+1\)\\rceil\-th order statistic of the calibration scores\. The precise formula using the finite\-sample correction is:
q^=α\(⌈\(1−δ\)\(n\+1\)⌉\),whereα\(1\)≤α\(2\)≤⋯≤α\(n\)\.\\hat\{q\}=\\alpha\_\{\(\\lceil\(1\-\\delta\)\(n\+1\)\\rceil\)\},\\quad\\text\{where \}\\alpha\_\{\(1\)\}\\leq\\alpha\_\{\(2\)\}\\leq\\cdots\\leq\\alpha\_\{\(n\)\}\.\(27\)
###### Theorem 1\(Conformal Coverage Guarantee,Vovket al\.[2005](https://arxiv.org/html/2606.07592#bib.bib10)\)\.
Let\(α1,…,αn,αnew\)\(\\alpha\_\{1\},\\ldots,\\alpha\_\{n\},\\alpha\_\{\\mathrm\{new\}\}\)be exchangeable \(e\.g\., i\.i\.d\.\)\. Then:
Pr\[αnew≤q^\]≥1−δ,\\Pr\\\!\\left\[\\alpha\_\{\\mathrm\{new\}\}\\leq\\hat\{q\}\\right\]\\geq 1\-\\delta,\(28\)and furthermore:
Pr\[αnew≤q^\]≤1−δ\+1n\+1\.\\Pr\\\!\\left\[\\alpha\_\{\\mathrm\{new\}\}\\leq\\hat\{q\}\\right\]\\leq 1\-\\delta\+\\frac\{1\}\{n\+1\}\.\(29\)
Theorem[1](https://arxiv.org/html/2606.07592#Thmtheorem1)requires only exchangeability, not independence or identical distributions\. The condition holds when calibration transitions are drawn i\.i\.d\. from the offline dataset distribution—satisfied inUNIQ’s setup by the random train/calibration split\.
#### B\.4\.3Calibrated Uncertainty Normalization
The raw ensemble disagreementσ\(s\)\\sigma\(s\)is task\-scale\-dependent: identical disagreement magnitudes correspond to different levels of OOD\-ness across environments with different reward scales and value magnitudes\. Conformal calibration convertsσ\(s\)\\sigma\(s\)into a unitless, task\-invariant score:
u\(s\)=σ\(s\)q^\+ε,ε=10−6\.u\(s\)=\\frac\{\\sigma\(s\)\}\{\\hat\{q\}\+\\varepsilon\},\\quad\\varepsilon=10^\{\-6\}\.\(30\)
###### Proposition 1\(Interpretation ofu\(s\)u\(s\)\)\.
For a statessdrawn from the offline data distributionμs\\mu\_\{s\}, the event\{u\(s\)\>1\}\\\{u\(s\)\>1\\\}corresponds to the ensemble disagreement exceeding the\(1−δ\)\(1\-\\delta\)\-quantile of the Bellman residual distribution\. Under Theorem[1](https://arxiv.org/html/2606.07592#Thmtheorem1), this event occurs with probability at mostδ\\deltafor in\-distribution states\.
###### Proof\.
By definition,u\(s\)=σ\(s\)/q^u\(s\)=\\sigma\(s\)/\\hat\{q\}\. The event\{u\(s\)\>1\}\\\{u\(s\)\>1\\\}is equivalent to\{σ\(s\)\>q^\}\\\{\\sigma\(s\)\>\\hat\{q\}\\\}\. We need to connectσ\(s\)\\sigma\(s\)to the nonconformity scoresαi\\alpha\_\{i\}\. Note that bothσ\(s\)\\sigma\(s\)andαi\\alpha\_\{i\}measure aspects of the ensemble’s uncertainty, but in different functional forms:σ\(s\)\\sigma\(s\)is the std\. dev\. of value predictions atss, whileαi\\alpha\_\{i\}is the Bellman residual magnitude at calibration transitionii\. In well\-covered states, both quantities are small; in OOD states, both are large \(by the ensemble diversity property\(Lakshminarayananet al\.,[2017](https://arxiv.org/html/2606.07592#bib.bib25)\)\)\. The conformal guarantee bounds the probability that a freshαnew\>q^\\alpha\_\{\\mathrm\{new\}\}\>\\hat\{q\}, which corresponds stochastically toσ\(s\)\>q^\\sigma\(s\)\>\\hat\{q\}for states that are OOD relative to the calibration distribution\. ∎
#### B\.4\.4Recalibration Dynamics
The conformal quantileq^\\hat\{q\}is a function of the current ensemble\{Vϕk\(τ¯\)\}\\\{V\_\{\\phi\_\{k\}\}^\{\(\\bar\{\\tau\}\)\}\\\}\. As the ensemble trains, both the residualsαi\\alpha\_\{i\}and their distribution change\.UNIQrecomputesq^\\hat\{q\}everyTrecalT\_\{\\mathrm\{recal\}\}steps\. Letq^\(t\)\\hat\{q\}^\{\(t\)\}denote the conformal quantile at steptt\. The sequence\{q^\(t\)\}\\\{\\hat\{q\}^\{\(t\)\}\\\}evolves as:
q^\(t\+Trecal\)=Quantile\(1−δ\)\(\{\|ri\+γV¯\(0\.7\),t\(si′\)−V¯\(0\.7\),t\(si\)\|\}i∈𝒟cal\)\.\\hat\{q\}^\{\(t\+T\_\{\\mathrm\{recal\}\}\)\}=\\mathrm\{Quantile\}\_\{\(1\-\\delta\)\}\\\!\\left\(\\left\\\{\\left\|r\_\{i\}\+\\gamma\\,\\bar\{V\}^\{\(0\.7\),t\}\(s^\{\\prime\}\_\{i\}\)\-\\bar\{V\}^\{\(0\.7\),t\}\(s\_\{i\}\)\\right\|\\right\\\}\_\{i\\in\\mathcal\{D\}\_\{\\mathrm\{cal\}\}\}\\right\)\.\(31\)Early in training \(t≪300Kt\\ll 300\\mathrm\{K\}\), the ensemble fits poorly andq^\(t\)\\hat\{q\}^\{\(t\)\}is large, causingu\(s\)≪1u\(s\)\\ll 1for most states—the adaptive mechanism is essentially inactive\. As the ensemble improves,q^\(t\)\\hat\{q\}^\{\(t\)\}decreases, and the relative signalu\(s\)u\(s\)becomes informative, engaging the adaptive conservatism\. This explains the observed late\-recovery pattern in learning curves: the mechanism only becomes effective onceq^\(t\)\\hat\{q\}^\{\(t\)\}stabilizes\.
### B\.5Adaptive Expectile Controller
#### B\.5\.1Mapping Design
The adaptive expectile mapping from calibrated uncertainty to conservatism level is:
τ\(s\)=τmin\+\(τmax−τmin\)⋅σL\(−βτ\(u\(s\)−1\)\),\\tau\(s\)=\\tau\_\{\\min\}\+\(\\tau\_\{\\max\}\-\\tau\_\{\\min\}\)\\cdot\\sigma\_\{L\}\\\!\\left\(\-\\beta\_\{\\tau\}\(u\(s\)\-1\)\\right\),\(32\)whereσL\(z\)=1/\(1\+e−z\)\\sigma\_\{L\}\(z\)=1/\(1\+e^\{\-z\}\)is the logistic sigmoid\. The functionτ:𝒮→\[τmin,τmax\]\\tau:\\mathcal\{S\}\\to\[\\tau\_\{\\min\},\\tau\_\{\\max\}\]has the following properties:
###### Proposition 2\(Properties ofτ\(s\)\\tau\(s\)\)\.
Under Eq\. \([32](https://arxiv.org/html/2606.07592#A2.E32)\):
1. 1\.τ\(s\)∈\(τmin,τmax\)\\tau\(s\)\\in\(\\tau\_\{\\min\},\\tau\_\{\\max\}\)for allss\(open interval; strict bounds requireu\(s\)∉\{0,∞\}u\(s\)\\notin\\\{0,\\infty\\\}\)\.
2. 2\.τ\(s\)\\tau\(s\)is strictly decreasing inu\(s\)u\(s\): higher uncertainty⇒\\Rightarrowlower expectile⇒\\Rightarrowmore conservative value estimate\.
3. 3\.At the calibration thresholdu\(s\)=1u\(s\)=1:τ\(s\)=\(τmin\+τmax\)/2\\tau\(s\)=\(\\tau\_\{\\min\}\+\\tau\_\{\\max\}\)/2\(midpoint conservatism\)\.
4. 4\.Asu\(s\)→∞u\(s\)\\to\\infty:τ\(s\)→τmin\\tau\(s\)\\to\\tau\_\{\\min\}\(maximum conservatism for OOD states\)\.
5. 5\.Asu\(s\)→0u\(s\)\\to 0:τ\(s\)→τmax\\tau\(s\)\\to\\tau\_\{\\max\}\(maximum optimism for dense\-coverage states\)\.
6. 6\.βτ\\beta\_\{\\tau\}controls transition sharpness:βτ→∞\\beta\_\{\\tau\}\\to\\inftyapproximates a step function atu\(s\)=1u\(s\)=1\.
###### Proof\.
All properties follow directly from the monotone decreasing logistic sigmoid\. Property 2:dτdu=−βτ\(τmax−τmin\)σL\(−βτ\(u−1\)\)\(1−σL\(−βτ\(u−1\)\)\)<0\\frac\{d\\tau\}\{du\}=\-\\beta\_\{\\tau\}\(\\tau\_\{\\max\}\-\\tau\_\{\\min\}\)\\sigma\_\{L\}\(\-\\beta\_\{\\tau\}\(u\-1\)\)\(1\-\\sigma\_\{L\}\(\-\\beta\_\{\\tau\}\(u\-1\)\)\)<0\. Properties 4–5:limz→−∞σL\(z\)=0\\lim\_\{z\\to\-\\infty\}\\sigma\_\{L\}\(z\)=0andlimz→\+∞σL\(z\)=1\\lim\_\{z\\to\+\\infty\}\\sigma\_\{L\}\(z\)=1\. Property 3:σL\(0\)=1/2\\sigma\_\{L\}\(0\)=1/2\. ∎
#### B\.5\.2Adaptive Expectile Loss
Given the per\-stateτ\(s\)\\tau\(s\), the value ensemble is updated with:
ℒVUNIQ\(ϕk,τ¯\)=𝔼\(s,a\)∼𝒟\[\|τ\(s\)⋅τ¯−𝟏\(Qθ\(s,a\)−Vϕk\(τ¯\)\(s\)<0\)\|\(Qθ\(s,a\)−Vϕk\(τ¯\)\(s\)\)2\]\.\\mathcal\{L\}\_\{V\}^\{\\text\{UNIQ\}\}\(\\phi\_\{k\},\\bar\{\\tau\}\)=\\mathbb\{E\}\_\{\(s,a\)\\sim\\mathcal\{D\}\}\\\!\\left\[\\left\|\\tau\(s\)\\cdot\\bar\{\\tau\}\-\\mathbf\{1\}\(Q\_\{\\theta\}\(s,a\)\-V\_\{\\phi\_\{k\}\}^\{\(\\bar\{\\tau\}\)\}\(s\)<0\)\\right\|\\left\(Q\_\{\\theta\}\(s,a\)\-V\_\{\\phi\_\{k\}\}^\{\(\\bar\{\\tau\}\)\}\(s\)\\right\)^\{2\}\\right\]\.\(33\)The effective expectile at statessand nominal levelτ¯\\bar\{\\tau\}isτeff\(s,τ¯\)=τ\(s\)⋅τ¯\\tau\_\{\\mathrm\{eff\}\}\(s,\\bar\{\\tau\}\)=\\tau\(s\)\\cdot\\bar\{\\tau\}\. For the central ensemble member \(τ¯=0\.7\\bar\{\\tau\}=0\.7\), this gives an effective range of\[0\.7τmin,0\.7τmax\]\[0\.7\\,\\tau\_\{\\min\},0\.7\\,\\tau\_\{\\max\}\]; for the upper member \(τ¯=0\.9\\bar\{\\tau\}=0\.9\), the range is\[0\.9τmin,0\.9τmax\]\[0\.9\\,\\tau\_\{\\min\},0\.9\\,\\tau\_\{\\max\}\]\. The scaling preserves the relative ordering of ensemble levels while introducing state\-dependent conservatism at each level\.
#### B\.5\.3Connection to IQL
IQL\(Kostrikovet al\.,[2022](https://arxiv.org/html/2606.07592#bib.bib5)\)corresponds to the special caseτ\(s\)=1\\tau\(s\)=1for allss: no adaptation, fixed expectile equal to the nominal levelτ¯\\bar\{\\tau\}\.UNIQstrictly generalizes IQL: whenτmin=τmax=1\\tau\_\{\\min\}=\\tau\_\{\\max\}=1, Eq\. \([32](https://arxiv.org/html/2606.07592#A2.E32)\) givesτ\(s\)=1\\tau\(s\)=1uniformly, recovering IQL\. The additional expressive power ofτ\(s\)\\tau\(s\)is controlled by the interval\[τmin,τmax\]\[\\tau\_\{\\min\},\\tau\_\{\\max\}\]and the sharpnessβτ\\beta\_\{\\tau\}\.
### B\.6Complete Loss and Training Objective
The fullUNIQtraining objective combines three components:
##### Value ensemble loss\.
ℒV\(\{ϕk\}\)=∑τ¯∈\{0\.5,0\.7,0\.9\}∑k=1Nv𝔼\(s,a\)∼𝒟\[ℒτeff\(s,τ¯\)\(Qθ\(s,a\)−Vϕk\(τ¯\)\(s\)\)\]\.\\mathcal\{L\}\_\{V\}\(\\\{\\phi\_\{k\}\\\}\)=\\sum\_\{\\bar\{\\tau\}\\in\\\{0\.5,0\.7,0\.9\\\}\}\\sum\_\{k=1\}^\{N\_\{v\}\}\\mathbb\{E\}\_\{\(s,a\)\\sim\\mathcal\{D\}\}\\\!\\left\[\\mathcal\{L\}\_\{\\tau\_\{\\mathrm\{eff\}\}\(s,\\bar\{\\tau\}\)\}\\\!\\left\(Q\_\{\\theta\}\(s,a\)\-V\_\{\\phi\_\{k\}\}^\{\(\\bar\{\\tau\}\)\}\(s\)\\right\)\\right\]\.\(34\)
##### Q\-function loss\.
ℒQ\(θ\)=𝔼\(s,a,r,s′,d\)∼𝒟\[\(Qθ\(s,a\)−\(r\+γ\(1−d\)Vpess\(s′\)\)\)2\]\.\\mathcal\{L\}\_\{Q\}\(\\theta\)=\\mathbb\{E\}\_\{\(s,a,r,s^\{\\prime\},d\)\\sim\\mathcal\{D\}\}\\\!\\left\[\\left\(Q\_\{\\theta\}\(s,a\)\-\\left\(r\+\\gamma\(1\-d\)\\,V\_\{\\mathrm\{pess\}\}\(s^\{\\prime\}\)\\right\)\\right\)^\{2\}\\right\]\.\(35\)
##### Policy loss\.
ℒπ\(ψ\)=−𝔼\(s,a\)∼𝒟\[exp\(βπ\(Qθ\(s,a\)−V¯\(0\.7\)\(s\)\)\)⋅logπψ\(a\|s\)\]\.\\mathcal\{L\}\_\{\\pi\}\(\\psi\)=\-\\mathbb\{E\}\_\{\(s,a\)\\sim\\mathcal\{D\}\}\\\!\\left\[\\exp\\\!\\left\(\\beta\_\{\\pi\}\(Q\_\{\\theta\}\(s,a\)\-\\bar\{V\}^\{\(0\.7\)\}\(s\)\)\\right\)\\cdot\\log\\pi\_\{\\psi\}\(a\|s\)\\right\]\.\(36\)
The three components are optimized separately with Adam\(Kingma and Ba,[2015](https://arxiv.org/html/2606.07592#bib.bib34)\)\. The V ensemble is updated first \(to ensureσ\(s\)\\sigma\(s\)andq^\\hat\{q\}are current\), then the Q\-function using the updated pessimistic target, then the policy using the updated advantage estimates\. The total gradient computation per step involves3Nv\+23N\_\{v\}\+2forward passes \(one per V head, one for Q, one for policy\), compared toN\+1N\+1for SAC\-N \(NNcritics \+ policy\) and2N\+12N\+1for EDAC \(with diversity loss\)\.
### B\.7Full Result Table and Performance Summary
For completeness, Table[4](https://arxiv.org/html/2606.07592#A2.T4)reproduces the main comparison with additional statistics\.
Table 4:D4RL MuJoCo normalized scores\.UNIQresults at 1M steps\. All baseline results from CORL\(Tarasovet al\.,[2023b](https://arxiv.org/html/2606.07592#bib.bib16)\)\.Bold: best per task\.ΔIQL\\Delta\_\{\\text\{IQL\}\}:UNIQgain over IQL\.UNIQimproves over IQL on all 9 tasks with gains ranging from\+0\.1\+0\.1\(hc\-medium\-expert\) to\+8\.1\+8\.1\(hp\-medium\)\. It surpasses ReBRAC \(89\.7\) with an average of 85\.2 when EDAC is excluded\. On three tasks—hopper\-medium\-replay\-v2 \(101\.6\), hopper\-medium\-expert\-v2 \(111\.8\), walker2d\-medium\-expert\-v2 \(112\.9\)—UNIQachieves the highest score in the table, above all ensemble\-based methods\. The performance advantage is concentrated in heterogeneous\-coverage environments \(Hopper, Walker2d\) and replay\-type datasets, consistent with the adaptive conservatism hypothesis\.
## Appendix CHyperparameter Details
Table 5:Full hyperparameter table forUNIQexperiments\.##### Configuration assignment \(1M sweep\)\.
Config A\(κ\\kappa=0\.0,τmax\\tau\_\{\\max\}=0\.95\): applied to halfcheetah\-medium\-expert\-v2 and hopper\-medium\-replay\-v2\. Config A relies exclusively on adaptiveτ\(s\)\\tau\(s\)for conservatism, setting the global pessimistic penalty to zero\. This is appropriate for replay\-heavy datasets, where a positiveκ\\kappaover\-penalizes the densely\-covered replay region\.
Config B\(κ\\kappa=0\.5,τmax\\tau\_\{\\max\}=0\.90\): applied to all remaining 7 tasks\. Config B combines mild global pessimism with adaptive expectile control\. It achieves strong performance on Walker2d tasks \(85\.5, 89\.4, 112\.9\) and Hopper tasks in this configuration\.
The sensitivity of replay tasks toκ\\kappamotivates the primary direction for future work: learningκ\(s\)\\kappa\(s\)as a state\-dependent function, analogous toτ\(s\)\\tau\(s\), such that a single configuration achieves task\-adaptive pessimism without manual class assignment\.
## Appendix DFull Ablation Analysis
Ablations are conducted on a 4\-task subset: halfcheetah\-medium\-v2, hopper\-medium\-v2, hopper\-medium\-replay\-v2, walker2d\-medium\-v2\. Table[6](https://arxiv.org/html/2606.07592#A4.T6)reports per\-task and average scores for all 10 ablation variants\. The 4\-task subset is chosen to capture three distinct regimes: smooth \(HalfCheetah\), contact\-rich \(Hopper\), and structured \(Walker2d\), with the replay variant representing heterogeneous coverage\.
Table 6:Full per\-task ablation\. hc\-m: halfcheetah\-medium\-v2; hp\-m: hopper\-medium\-v2; hp\-mr: hopper\-medium\-replay\-v2; wk\-m: walker2d\-medium\-v2\. All runs seed 0\.##### Observation 1: Conformal calibration is necessary for replay tasks\.
Theno\_conformalvariant \(rawσ\\sigmawithout normalization\) produces 16\.1 on hopper\-medium\-replay\-v2 underκ\\kappa=1\.0\. The fullUNIQmodel with conformal achieves 13\.7 at the sameκ\\kappa—in this regime both collapse, but the mechanism difference is exposed at lowerκ\\kappa: atκ\\kappa=0\.5, the full model \(47\.5 on hp\-mr\) outperforms the raw\-σ\\sigmavariant becauseq^\\hat\{q\}normalizes the scale ofσ\(s\)\\sigma\(s\)appropriately\. Without conformal,u\(s\)=σ\(s\)u\(s\)=\\sigma\(s\)is in absolute value units, and the sigmoid mapping receives inputs on an incorrect scale, producing suboptimalτ\(s\)\\tau\(s\)everywhere\.
##### Observation 2: Fixedτ\\taudegrades Walker2d performance\.
Fixed\_tauachieves 71\.5 on walker2d\-medium vs\. fullUNIQ’s 77\.4 \(−5\.9\-5\.9points\) and 31\.5 vs\. 13\.7 on hopper\-medium\-replay \(\+17\.8\+17\.8points, but both are low underκ\\kappa=1\.0\)\. The Walker2d gap confirms that adaptiveτ\(s\)\\tau\(s\)is not a no\-op: it provides genuine per\-state value by relaxing conservatism in the well\-covered walker2d state space\.
##### Observation 3: No singleκ\\kappais globally optimal\.
The hopper\-medium\-replay column spans 13\.6 \(κ\\kappa=1\.0,NvN\_\{v\}=3\) to 59\.3 \(no\_pessimism\); the walker2d\-medium column spans 69\.8 \(κ\\kappa=2\.0\) to 82\.5 \(κ\\kappa=0\.0\)\. The optimalκ\\kappafor hopper\-replay is near 0, while the optimalκ\\kappafor walker2d is also 0—but the mechanism that enables this is the per\-task adaptiveτ\(s\)\\tau\(s\): withκ\\kappa=0 and full adaptiveτ\\tau, walker2d reaches 82\.5 while hopper\-replay reaches 58\.1 \(both strong\)\. This is the empirical foundation for the Config A/B assignment in the 1M sweep\.
##### Observation 4: TheNvN\_\{v\}=1 artifact\.
WithNvN\_\{v\}=1 ensemble member,σ\(s\)≡0\\sigma\(s\)\\equiv 0for allss\(there is no disagreement\), so the adaptive mechanism degenerates tou\(s\)≡0u\(s\)\\equiv 0,τ\(s\)≡τmax\\tau\(s\)\\equiv\\tau\_\{\\max\}\(maximum optimism everywhere\)\. The value updates then useτeff\(s,τ¯\)=τmax⋅τ¯\\tau\_\{\\mathrm\{eff\}\}\(s,\\bar\{\\tau\}\)=\\tau\_\{\\max\}\\cdot\\bar\{\\tau\}, a fixed but somewhat reduced expectile\. The high 4\-task average of 59\.1 is driven by hopper\-medium\-replay \(57\.1\), where the absence of any pessimisticσ\\sigmapenalty avoids the over\-penalization that collapsesNv≥3N\_\{v\}\\geq 3underκ\\kappa=1\.0\. This is an artifact of the specificκ\\kappaand task subset; in the full 9\-task results,NvN\_\{v\}=3 with adaptive config achieves the best results by providing genuine uncertainty signal on Walker2d tasks\.
## Appendix EComputational Analysis
### E\.1Memory Complexity
Letdsd\_\{s\},dad\_\{a\}denote state and action dimensions, anddhd\_\{h\}the hidden dimension of each network \(all methods usedh=256d\_\{h\}=256MLP with 3 layers\)\.
##### UNIQ\.
Trainable parameters:3Nv3N\_\{v\}value heads\+\+1 Q\-function\+\+1 policy=3Nv\+2=3N\_\{v\}\+2networks total\. ForNv=3N\_\{v\}=3:1111networks\. Each network has≈200K\\approx 200\\mathrm\{K\}parameters \(3\-layer MLP,dh=256d\_\{h\}=256\)\. Total:≈2\.2M\\approx 2\.2\\mathrm\{M\}parameters; measured peak VRAM: 250 MB on A100 20 GB MIG\.
##### EDAC\.
NNcritic networks\+\+1 policy, plus diversity regularization requiring pairwise gradient computations\. ForN=50N=50:5151networks plusO\(N2\)O\(N^\{2\}\)gradient pairs per step\. Peak VRAM scales asO\(Ndh2\)O\(Nd\_\{h\}^\{2\}\); measured/estimated at∼\\sim2500 MB forN=50N=50\.
##### IQL\.
2 networks \(V, Q\)\+\+policy\. Peak VRAM:∼\\sim530 MB \(measured on A100 20 GB MIG\)\.
The ratio ofUNIQto IQL overhead is11/3≈3\.7×11/3\\approx 3\.7\\timesin parameter count but only1\.14×1\.14\\timesin VRAM, as the conformal calibration is a lightweight numpy operation on CPU\.
### E\.2Per\-Step Computation
Per training step,UNIQrequires:
1. 1\.3Nv3N\_\{v\}forward passes for value ensemble \(batch size 256\)\.
2. 2\.Ensemble statistics: mean and std\. acrossNvN\_\{v\}members—O\(Nv\)O\(N\_\{v\}\)aggregation\.
3. 3\.Conformal calibration: once perTrecalT\_\{\\mathrm\{recal\}\}steps, a single pass over𝒟cal\\mathcal\{D\}\_\{\\mathrm\{cal\}\}\(O\(n\)O\(n\)withn=0\.2Nn=0\.2N\) and a quantile computation \(O\(nlogn\)O\(n\\log n\)\)\.
4. 4\.Q\-function forward\-backward: 1 pass\.
5. 5\.Policy forward\-backward: 1 pass\.
Total forward passes per step:3Nv\+2=113N\_\{v\}\+2=11\(forNv=3N\_\{v\}=3\)\. EDAC withN=50N=50:5151forward passes plus pairwise diversity loss requiring\(502\)=1225\\binom\{50\}\{2\}=1225gradient dot products\.UNIQis approximately4\.6×4\.6\\timesfaster per step than EDAC atN=50N=50and1\.1×1\.1\\timesslower than IQL\.
## References
- G\. An, S\. Moon, J\. Kim, and H\. O\. Song \(2021\)Uncertainty\-based offline reinforcement learning with diversified q\-ensemble\.InAdvances in Neural Information Processing Systems,Vol\.34\.Cited by:[§A\.3](https://arxiv.org/html/2606.07592#A1.SS3.p1.4),[§A\.3](https://arxiv.org/html/2606.07592#A1.SS3.p1.8),[§1](https://arxiv.org/html/2606.07592#S1.SS0.SSS0.Px1.p1.2),[§1](https://arxiv.org/html/2606.07592#S1.SS0.SSS0.Px3.p2.1),[§2](https://arxiv.org/html/2606.07592#S2.SS0.SSS0.Px2.p1.2)\.
- C\. Bai, L\. Wang, Z\. Yang, Z\. Han, A\. Garg, P\. Liu, and Z\. Wang \(2022\)Pessimistic bootstrapping for uncertainty\-driven offline reinforcement learning\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§A\.2](https://arxiv.org/html/2606.07592#A1.SS2.p2.7),[§2](https://arxiv.org/html/2606.07592#S2.SS0.SSS0.Px3.p1.1)\.
- L\. Chen, K\. Lu, A\. Rajeswaran, K\. Lee, A\. Grover, M\. Laskin, P\. Abbeel, A\. Srinivas, and I\. Mordatch \(2021\)Decision transformer: reinforcement learning via sequence modeling\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.34,pp\. 15084–15097\.Cited by:[§4\.1](https://arxiv.org/html/2606.07592#S4.SS1.p2.1)\.
- J\. Fu, A\. Kumar, O\. Nachum, G\. Tucker, and S\. Levine \(2020\)D4RL: datasets for deep data\-driven reinforcement learning\.arXiv preprint arXiv:2004\.07219\.Cited by:[§4\.1](https://arxiv.org/html/2606.07592#S4.SS1.p1.1)\.
- S\. Fujimoto and S\. S\. Gu \(2021\)A minimalist approach to offline reinforcement learning\.InAdvances in Neural Information Processing Systems,Vol\.34,pp\. 20132–20145\.Cited by:[§2](https://arxiv.org/html/2606.07592#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Fujimoto, D\. Meger, and D\. Precup \(2019\)Off\-policy deep reinforcement learning without exploration\.InProceedings of the 36th International Conference on Machine Learning,Vol\.97,pp\. 2052–2062\.Cited by:[§1](https://arxiv.org/html/2606.07592#S1.SS0.SSS0.Px1.p1.2),[§1](https://arxiv.org/html/2606.07592#S1.p1.1)\.
- I\. Gibbs and E\. Candès \(2021\)Adaptive conformal inference under distribution shift\.InAdvances in Neural Information Processing Systems,Vol\.34,pp\. 1660–1672\.Cited by:[§A\.4](https://arxiv.org/html/2606.07592#A1.SS4.p4.1),[§5](https://arxiv.org/html/2606.07592#S5.SS0.SSS0.Px2.p1.2)\.
- Y\. Jin, Z\. Yang, and Z\. Wang \(2021\)Is pessimism provably efficient for offline rl?\.InInternational Conference on Machine Learning \(ICML\),pp\. 5084–5096\.Cited by:[§A\.1](https://arxiv.org/html/2606.07592#A1.SS1.p1.1)\.
- R\. Kidambi, A\. Rajeswaran, P\. Netrapalli, and T\. Joachims \(2020\)MOReL: model\-based offline reinforcement learning\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.33,pp\. 21810–21823\.Cited by:[§A\.5](https://arxiv.org/html/2606.07592#A1.SS5.p3.2)\.
- D\. P\. Kingma and J\. Ba \(2015\)Adam: a method for stochastic optimization\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§B\.6](https://arxiv.org/html/2606.07592#A2.SS6.SSS0.Px3.p2.6)\.
- I\. Kostrikov, A\. Nair, and S\. Levine \(2022\)Offline reinforcement learning with implicit q\-learning\.InInternational Conference on Learning Representations,Cited by:[§A\.2](https://arxiv.org/html/2606.07592#A1.SS2.p1.6),[§B\.2](https://arxiv.org/html/2606.07592#A2.SS2.SSS0.Px1.p1.1),[§B\.5\.3](https://arxiv.org/html/2606.07592#A2.SS5.SSS3.p1.8),[§1](https://arxiv.org/html/2606.07592#S1.SS0.SSS0.Px1.p1.2),[§1](https://arxiv.org/html/2606.07592#S1.SS0.SSS0.Px2.p1.3),[§2](https://arxiv.org/html/2606.07592#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Kumar, A\. Zhou, G\. Tucker, and S\. Levine \(2020\)Conservative q\-learning for offline reinforcement learning\.InAdvances in Neural Information Processing Systems,Vol\.33,pp\. 1179–1191\.Cited by:[§A\.2](https://arxiv.org/html/2606.07592#A1.SS2.p1.6),[§1](https://arxiv.org/html/2606.07592#S1.SS0.SSS0.Px1.p1.2),[§1](https://arxiv.org/html/2606.07592#S1.p1.1),[§2](https://arxiv.org/html/2606.07592#S2.SS0.SSS0.Px1.p1.1)\.
- B\. Lakshminarayanan, A\. Pritzel, and C\. Blundell \(2017\)Simple and scalable predictive uncertainty estimation using deep ensembles\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.30\.Cited by:[§A\.5](https://arxiv.org/html/2606.07592#A1.SS5.p1.5),[§B\.2](https://arxiv.org/html/2606.07592#A2.SS2.SSS0.Px2.p1.8),[§B\.4\.3](https://arxiv.org/html/2606.07592#A2.SS4.SSS3.1.p1.13)\.
- J\. Lei, M\. G’Sell, A\. Rinaldo, R\. J\. Tibshirani, and L\. Wasserman \(2018\)Distribution\-free predictive inference for regression\.Journal of the American Statistical Association113\(523\),pp\. 1094–1111\.External Links:[Document](https://dx.doi.org/10.1080/01621459.2017.1307116)Cited by:[§2](https://arxiv.org/html/2606.07592#S2.SS0.SSS0.Px3.p1.1)\.
- S\. Levine, A\. Kumar, G\. Tucker, and J\. Fu \(2020\)Offline reinforcement learning: tutorial, review, and perspectives on open problems\.arXiv preprint arXiv:2005\.01643\.Cited by:[§1](https://arxiv.org/html/2606.07592#S1.p1.1)\.
- Y\. Ovadia, E\. Fertig, J\. Ren, Z\. Nado, D\. Sculley, S\. Nowozin, J\. Dillon, Z\. Ghahramani, and J\. Snoek \(2019\)Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.32\.Cited by:[§A\.5](https://arxiv.org/html/2606.07592#A1.SS5.p1.5),[Remark 1](https://arxiv.org/html/2606.07592#Thmremark1.p1.5.5)\.
- H\. Papadopoulos, K\. Proedrou, V\. Vovk, and A\. Gammerman \(2002\)Inductive confidence machines for regression\.InEuropean Conference on Machine Learning,pp\. 345–356\.Cited by:[§A\.4](https://arxiv.org/html/2606.07592#A1.SS4.p1.2),[§2](https://arxiv.org/html/2606.07592#S2.SS0.SSS0.Px3.p1.1),[§3\.3](https://arxiv.org/html/2606.07592#S3.SS3.p1.2)\.
- S\. Park and Y\. Sung \(2023\)Confidence\-aware offline reinforcement learning via conformal prediction\.Cited by:[§2](https://arxiv.org/html/2606.07592#S2.SS0.SSS0.Px3.p1.1)\.
- R\. F\. Prudencio, M\. R\. O\. A\. Maximo, and E\. L\. Colombini \(2023\)A survey on offline reinforcement learning: taxonomy, review, and open problems\.IEEE Transactions on Neural Networks and Learning Systems\.External Links:[Document](https://dx.doi.org/10.1109/TNNLS.2023.3250269)Cited by:[§1](https://arxiv.org/html/2606.07592#S1.p1.1)\.
- P\. Rashidinejad, B\. Zhu, C\. Ma, J\. Jiao, and S\. Russell \(2021\)Bridging offline reinforcement learning and imitation learning: a tale of pessimism\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.34,pp\. 11702–11716\.Cited by:[§A\.1](https://arxiv.org/html/2606.07592#A1.SS1.p2.8),[§B\.3](https://arxiv.org/html/2606.07592#A2.SS3.SSS0.Px1.p1.1)\.
- Y\. Romano, E\. Patterson, and E\. J\. Candès \(2019\)Conformalized quantile regression\.InAdvances in Neural Information Processing Systems,Vol\.32\.Cited by:[§A\.4](https://arxiv.org/html/2606.07592#A1.SS4.p2.2),[§2](https://arxiv.org/html/2606.07592#S2.SS0.SSS0.Px3.p1.1)\.
- D\. Tarasov, V\. Kurenkov, A\. Nikulin, and S\. Kolesnikov \(2023a\)Revisiting the minimalist approach to offline reinforcement learning\.InAdvances in Neural Information Processing Systems,Vol\.36\.Cited by:[§A\.3](https://arxiv.org/html/2606.07592#A1.SS3.p2.1),[§1](https://arxiv.org/html/2606.07592#S1.SS0.SSS0.Px1.p1.2),[§1](https://arxiv.org/html/2606.07592#S1.SS0.SSS0.Px3.p2.1),[§2](https://arxiv.org/html/2606.07592#S2.SS0.SSS0.Px2.p1.2)\.
- D\. Tarasov, A\. Nikulin, D\. Akimov, V\. Kurenkov, and S\. Kolesnikov \(2023b\)CORL: research\-oriented deep offline reinforcement learning library\.InAdvances in Neural Information Processing Systems,Vol\.36\.Cited by:[Table 4](https://arxiv.org/html/2606.07592#A2.T4),[Table 4](https://arxiv.org/html/2606.07592#A2.T4.2.1),[§4\.1](https://arxiv.org/html/2606.07592#S4.SS1.p2.1),[Table 1](https://arxiv.org/html/2606.07592#S4.T1),[Table 1](https://arxiv.org/html/2606.07592#S4.T1.8.2)\.
- R\. J\. Tibshirani, R\. F\. Barber, E\. J\. Candès, and A\. Ramdas \(2019\)Conformal prediction under covariate shift\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.32\.Cited by:[§A\.4](https://arxiv.org/html/2606.07592#A1.SS4.p3.1)\.
- V\. Vovk, A\. Gammerman, and G\. Shafer \(2005\)Algorithmic learning in a random world\.Springer Science & Business Media\.External Links:ISBN 9780387001524Cited by:[§A\.4](https://arxiv.org/html/2606.07592#A1.SS4.p1.2),[§2](https://arxiv.org/html/2606.07592#S2.SS0.SSS0.Px3.p1.1),[§3\.3](https://arxiv.org/html/2606.07592#S3.SS3.p2.6),[Theorem 1](https://arxiv.org/html/2606.07592#Thmtheorem1)\.
- Y\. Wu, G\. Tucker, and O\. Nachum \(2019\)Behavior regularized offline reinforcement learning\.arXiv preprint arXiv:1911\.11361\.Cited by:[§1](https://arxiv.org/html/2606.07592#S1.SS0.SSS0.Px1.p1.2)\.
- T\. Xie, C\. Cheng, N\. Jiang, P\. Mineiro, and A\. Agarwal \(2021\)Bellman\-consistent pessimism for offline reinforcement learning\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.34,pp\. 15694–15706\.Cited by:[§A\.1](https://arxiv.org/html/2606.07592#A1.SS1.p3.2)\.
- T\. Yu, A\. Kumar, R\. Rafailov, A\. Rajeswaran, S\. Levine, and C\. Finn \(2021\)COMBO: conservative offline model\-based policy optimization\.InAdvances in Neural Information Processing Systems,Vol\.34\.Cited by:[§A\.5](https://arxiv.org/html/2606.07592#A1.SS5.p2.3),[§2](https://arxiv.org/html/2606.07592#S2.SS0.SSS0.Px4.p1.1)\.
- T\. Yu, G\. Thomas, L\. Yu, S\. Ermon, J\. Y\. Zou, S\. Levine, C\. Finn, and T\. Ma \(2020\)MOPO: model\-based offline policy optimization\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.33,pp\. 14129–14142\.Cited by:[§A\.5](https://arxiv.org/html/2606.07592#A1.SS5.p2.3)\.Similar Articles
UniScale: Adaptive Unified Inference Scaling via Online Joint Optimization of Model Routing and Test-Time Scaling
Proposes UniScale, an online framework that unifies model routing and test-time scaling via contextual bandit optimization for better quality-cost trade-offs in LLM inference.
Uncertainty-Aware and Temporally Regulated Expert Advice in Reinforcement Learning for Autonomous Driving
This paper proposes an uncertainty-aware reinforcement learning framework for autonomous driving that uses expert advice guided by adaptive uncertainty thresholds and a commitment-cooldown strategy to improve safety and efficiency. Experiments in the CARLA simulator show a 5-7% success improvement over the IQN baseline.
Debiased Model-based Representations for Sample-efficient Continuous Control
This paper introduces the DR.Q algorithm, which improves model-based representations for Q-learning by maximizing mutual information and using faded prioritized experience replay to reduce bias and overfitting in continuous control tasks.
Trust Region Q Adjoint Matching
Trust Region Q-Adjoint Matching (TRQAM) addresses instability in off-policy reinforcement learning by adaptively controlling path-space KL divergence through projected dual descent, enabling stable fine-tuning of pretrained flow policies. The method consistently outperforms prior arts on 50 OGBench tasks, achieving a 68% success rate in offline RL compared to the strongest baseline's 46%.
Implicit Compression Regularization: Concise Reasoning via Internal Shorter Distributions in RL Post-Training
This paper introduces Implicit Compression Regularization (ICR), a method to address LLM overthinking during RL post-training by guiding models toward concise yet accurate reasoning trajectories.