Return-to-Go Is More Than a Number: Q-Guided Alignment for Return-Conditioned Supervised Learning

arXiv cs.LG Papers

Summary

This paper proposes Q-align DT, a framework that aligns return-to-go with Q-values to improve controllability and performance in offline reinforcement learning, achieving superior results on D4RL benchmarks.

arXiv:2605.29028v1 Announce Type: new Abstract: Conditioned Sequence Models (CSMs) learn policies by treating return-to-go (RTG) as a control signal. However, existing CSMs often treat the RTGs as simple numerical inputs rather than aligning them with the performance of their policies. In this paper, we propose Q-ALIGN DT, a framework that enforces this alignment by ensuring the $Q$-value of the output policy is consistent with the input RTG. By leveraging a $Q$ function to provide dense guidance to CSMs and further fine-tuning it using an RTG-perturbation technique with the CSM, our method ensures that higher RTGs are consistently mapped to trajectories with higher expected returns. Theoretically, we show that Q-ALIGN DT can efficiently learn the desired policy and output a near-optimal one when the RTG is sufficiently high. Empirically, we demonstrate through extensive experiments that Q-ALIGN DT achieves superior controllability and performance across the D4RL benchmark. Remarkably, our model effectively learns a structured family of policies that maintains precise alignment and generalizes to tasks like velocity-tracking where prior methods fail.
Original Article
View Cached Full Text

Cached at: 05/29/26, 09:15 AM

# Return-to-Go Is More Than a Number: Q-Guided Alignment for Return-Conditioned Supervised Learning
Source: [https://arxiv.org/html/2605.29028](https://arxiv.org/html/2605.29028)
###### Abstract

Conditioned Sequence Models \(CSMs\) learn policies by treating return\-to\-go \(RTG\) as a control signal\. However, existing CSMs often treat the RTGs as simple numerical inputs rather than aligning them with the performance of their policies\. In this paper, we proposeQ\-align DT, a framework that enforces this alignment by ensuring theQQ\-value of the output policy is consistent with the input RTG\. By leveraging aQQfunction to provide dense guidance to CSMs and further fine\-tuning it using anRTG\-perturbationtechnique with the CSM, our method ensures that higher RTGs are consistently mapped to trajectories with higher expected returns\. Theoretically, we show thatQ\-align DTcan efficiently learn the desired policy and output a near\-optimal one when the RTG is sufficiently high\. Empirically, we demonstrate through extensive experiments thatQ\-align DTachieves superior controllability and performance across the D4RL benchmark\. Remarkably, our model effectively learns a structured family of policies that maintains precise alignment and generalizes to tasks like velocity\-tracking where prior methods fail\.

Machine Learning, ICML

## 1Introduction

Offline reinforcement learning \(Offline RL\) aims to learn an effective and robust policy that can be deployed without interacting with the environment, relying solely on pre\-collected datasets\(Fujimotoet al\.,[2019](https://arxiv.org/html/2605.29028#bib.bib31)\)\. Recently, transformer architectures\(Vaswaniet al\.,[2017](https://arxiv.org/html/2605.29028#bib.bib20)\), which have shown remarkable success in natural language processing\(Devlinet al\.,[2019](https://arxiv.org/html/2605.29028#bib.bib21); Brownet al\.,[2020](https://arxiv.org/html/2605.29028#bib.bib36); Liuet al\.,[2019](https://arxiv.org/html/2605.29028#bib.bib37)\)and computer vision\(Dosovitskiyet al\.,[2021a](https://arxiv.org/html/2605.29028#bib.bib19); Heet al\.,[2021](https://arxiv.org/html/2605.29028#bib.bib39); Dosovitskiyet al\.,[2021b](https://arxiv.org/html/2605.29028#bib.bib38); Liuet al\.,[2021](https://arxiv.org/html/2605.29028#bib.bib40)\), have also been adopted in RL due to their powerful sequence modeling capabilities\(Laskinet al\.,[2023](https://arxiv.org/html/2605.29028#bib.bib24); Leeet al\.,[2023](https://arxiv.org/html/2605.29028#bib.bib25); Grigsbyet al\.,[2024](https://arxiv.org/html/2605.29028#bib.bib32)\)\. Among these advances, Conditional Sequence Models \(CSMs\)\(Chenet al\.,[2021](https://arxiv.org/html/2605.29028#bib.bib8); Janneret al\.,[2021](https://arxiv.org/html/2605.29028#bib.bib26)\)provide a new perspective by treating policy learning as a supervised sequence modeling problem conditioned on the desired performance\. In particular, Decision Transformer \(DT\) introduces a Return\-to\-Go \(RTG\) token which enables the model to generate trajectories that achieve expected returns rather than imitating the behavioral policy\.

However, how the actual return obtained by a CSMalignswith the target input RTG is a fundamental yet often overlooked property\. Precise alignment enables a single model to represent a diverse family of policies and is essential for controllable robot behaviors with varying velocities\(Tanakaet al\.,[2025](https://arxiv.org/html/2605.29028#bib.bib4)\)\. Therefore, we would like to ask:

*How well can CSMs actually align with target RTGs?*

Unfortunately, recent studies\(Kimet al\.,[2024b](https://arxiv.org/html/2605.29028#bib.bib5); Tanakaet al\.,[2025](https://arxiv.org/html/2605.29028#bib.bib4)\)report that existing CSMs often exhibit significant insensitivity to RTG and fail to achieve proper alignment\. As empirically demonstrated in[Figure1](https://arxiv.org/html/2605.29028#S1.F1), we hypothesize that this failure stems from a lack of structural awareness: a robust CSM should capture the structure between different RTG targets and their corresponding behaviors instead of merely treating the RTGs as simple tokens\. Specifically, a higher RTG should consistently correspond to trajectories with higher expected cumulative returns, ensuring thispartial orderbetween target RTGs and realized performance\.

In offline RL, enforcing such a partial order is challenging since it can be infeasible to construct sufficient trajectories within the fixed dataset following the partial order\. To address this challenge, in this paper, we proposeQ\-align DT, which introduces an auxiliaryQQ\-function to provide dense guidance\. With a novel*RTG\-to\-behavior*objective alongside an*RTG\-perturbation*technique, our method encourages the model to output actions that precisely reflect the relative differences in desired RTGs\. We further integrate this RTG perturbation into theQQ\-function updates during co\-training, ensuring that the critic and the policy co\-evolve toward a consistent, reward\-sensitive behavior\.

Building on our algorithm, we provide a theoretical analysis of the alignment properties of CSMs and conduct extensive experiments to evaluate our trained models\. We find thatQ\-align DTlearns a family of RTG\-conditioned policies that actively respond to target shifts, rather than merely relying on static reward associations found in the training data\. Moreover, we demonstrate that our model can be adapted to distinct tasks \(e\.g\.,HalfCheetah\-Vel\) while maintaining competitive performance and alignment, indicating its potential for generalization across diverse behaviors\.

Overall, our contributions are threefold:

- •We proposeQ\-align DT, which introduces a new RTG\-to\-behavior alignment objective to enforce consistency between the input RTG and policy behavior, significantly improving RTG\-conditioned alignment\. It further employs a co\-training framework with RTG perturbations that provide a high\-quality action space forQQ\-function learning, enabling bidirectional improvements\.
- •Theoretically, we show thatQ\-align DTimproves alignment by restricting the policy class, and that the alignment objective ensures, under high\-RTG conditioning, equivalence to maximizing theQQ\-function\. Combined with RTG perturbations, this leads to convergence to a near\-optimal in\-distribution policy under mild assumptions\.
- •Extensive experiments show thatQ\-align DTconsistently achieves competitive performance across a wide range of offline RL tasks while significantly improving RTG\-conditioned alignment\. Remarkably and somewhat surprisingly, we report thatQ\-align DTgeneralizes effectively to the challengingHalfCheetah\-Veltask and attains competitive performance by only controlling the RTG signal with zero\-shot transfer\.

Code\.Our code is available at[https://github\.com/yangyuxiao\-sjtu/Q\-Align\-DT](https://github.com/yangyuxiao-sjtu/Q-Align-DT)\.

![Refer to caption](https://arxiv.org/html/2605.29028v1/x1.png)\(a\)hopper\-medium
![Refer to caption](https://arxiv.org/html/2605.29028v1/x2.png)\(b\)halfcheetah\-medium
![Refer to caption](https://arxiv.org/html/2605.29028v1/x3.png)\(c\)walker2d\-medium

Figure 1:Performance ofQ\-align DT\(Q\_align\) and other baseline models on D4RL tasks\. Target RTGs are set with an interval of 100 targeting for cumulative rewards\. We sample 30 trajectories for each target RTG and report the mean performance for each method\.
## 2Related Works

### 2\.1Offline Reinforcement Learning

Reinforcement Learning aims to train an agent to solve tasks through direct interaction with the environment, but such interaction is often expensive or impractical in domains such as robotics and healthcare\. To address this limitation, Offline Reinforcement Learning learns policies from a fixed dataset collected by a behavior policy\(Levineet al\.,[2020](https://arxiv.org/html/2605.29028#bib.bib45); Siegelet al\.,[2020](https://arxiv.org/html/2605.29028#bib.bib51); Jaqueset al\.,[2019](https://arxiv.org/html/2605.29028#bib.bib50); Agarwalet al\.,[2020](https://arxiv.org/html/2605.29028#bib.bib49); Ernstet al\.,[2005](https://arxiv.org/html/2605.29028#bib.bib48)\)\.

Despite its promise, naively applying online RL algorithms in an offline setting often leads to severe performance degradation due to out\-of\-distribution actions and extrapolation errors\(Fujimotoet al\.,[2019](https://arxiv.org/html/2605.29028#bib.bib31); Kumaret al\.,[2019](https://arxiv.org/html/2605.29028#bib.bib34); Levineet al\.,[2020](https://arxiv.org/html/2605.29028#bib.bib45)\)\. Existing work mitigates these issues through a range of techniques, includingQQ\-value regularization\(Wuet al\.,[2019](https://arxiv.org/html/2605.29028#bib.bib53); Kumaret al\.,[2020](https://arxiv.org/html/2605.29028#bib.bib12); Wanget al\.,[2020](https://arxiv.org/html/2605.29028#bib.bib54)\)and behavior cloning–based objectives\(Fujimoto and Gu,[2021](https://arxiv.org/html/2605.29028#bib.bib14)\)\.

More recently, the fixed\-dataset nature of Offline RL has motivated the adoption of Transformer\-based architectures\(Chenet al\.,[2021](https://arxiv.org/html/2605.29028#bib.bib8); Wuet al\.,[2023](https://arxiv.org/html/2605.29028#bib.bib23); Chebotaret al\.,[2023](https://arxiv.org/html/2605.29028#bib.bib57)\), which enable efficient and fully parallelized supervised training over trajectories\.

### 2\.2Conditioned Sequence Models

Conditioned Sequence Models \(CSMs\)\(Chenet al\.,[2021](https://arxiv.org/html/2605.29028#bib.bib8); Janneret al\.,[2021](https://arxiv.org/html/2605.29028#bib.bib26)\)cast RL in the supervised learning paradigm by predicting actions from historical state\-action pairs\(si,ai\)\(s\_\{i\},a\_\{i\}\)and return\-to\-go \(RTG\) tokensrtgi\\text\{rtg\}\_\{i\}\. In particular, the RTG is derived from the dataset as cumulative rewards during training, and is provided as a user\-specified conditioning during inference\. Although recent theoretical results suggest CSMs can recover target returns under ideal assumptions\(Brandfonbreneret al\.,[2022](https://arxiv.org/html/2605.29028#bib.bib27); Linet al\.,[2024](https://arxiv.org/html/2605.29028#bib.bib28); Furutaet al\.,[2022](https://arxiv.org/html/2605.29028#bib.bib29)\), empirical studies\(Kimet al\.,[2024b](https://arxiv.org/html/2605.29028#bib.bib5); Tanakaet al\.,[2025](https://arxiv.org/html/2605.29028#bib.bib4)\)show that the information carried by the RTG is often under\-utilized by CSMs, leading to poor alignment between the target behavior and desired RTGs\.

To address this issue, RADT\(Tanakaet al\.,[2025](https://arxiv.org/html/2605.29028#bib.bib4)\)enhances RTG sensitivity through architectural modifications, at the cost of substantial computational and parameter overhead, by introducing an additional attention layer in each transformer block\.

### 2\.3QQ\-Learning in Conditioned Sequence Models

QQ\-functions are widely used to improve CSMs\. Early methods like QDT\(Yamagataet al\.,[2023](https://arxiv.org/html/2605.29028#bib.bib22)\)and CGDT\(Wanget al\.,[2024](https://arxiv.org/html/2605.29028#bib.bib11)\)leverage pretrainedQQ\-functions for RTG relabeling and dataset\-bias mitigation, while later approaches such as QT\(Huet al\.,[2024](https://arxiv.org/html/2605.29028#bib.bib2)\), QCS\(Kimet al\.,[2024a](https://arxiv.org/html/2605.29028#bib.bib1)\), and TD3\-ODT\(Yanet al\.,[2024](https://arxiv.org/html/2605.29028#bib.bib3)\)backpropagateQQ\-value gradients through predicted actions\.

WhileQQ\-gradient methods achieve competitive peak performance, they often push the policy toward maximum\-value actions regardless of the target RTG, degrading alignment as it collapses to relatively high\-return regions within the data distribution \([Figure1](https://arxiv.org/html/2605.29028#S1.F1)\)\. Furthermore, although existing methods leverageQQ\-functions to improve CSMs, little attention has been paid to how CSMs can, in turn, benefitQQ\-function training through their alignment capabilities\.

### 2\.4Multi\-Task and Meta\-Reinforcement Learning

Meta\-Reinforcement Learning \(Meta\-RL\)\(Becket al\.,[2025](https://arxiv.org/html/2605.29028#bib.bib58)\)aims to quickly adapt an agent to new tasks with similar underlying structures\(Finnet al\.,[2017](https://arxiv.org/html/2605.29028#bib.bib60); Duanet al\.,[2016](https://arxiv.org/html/2605.29028#bib.bib59)\)\. A more challenging setting, Offline Meta\-RL, considers scenarios where the agent must learn from a fixed dataset and is expected to generalize to unseen test tasks\(Mitchellet al\.,[2021](https://arxiv.org/html/2605.29028#bib.bib16); Dorfmanet al\.,[2021](https://arxiv.org/html/2605.29028#bib.bib61)\)\. Owing to the in\-context learning capabilities of Transformer architectures, recent works have increasingly deployed Transformers for such tasks\(Xuet al\.,[2022](https://arxiv.org/html/2605.29028#bib.bib18); Leeet al\.,[2023](https://arxiv.org/html/2605.29028#bib.bib25)\)\.

## 3Preliminaries

We consider a Markov Decision Process \(MDP\) defined by the tuple\(𝒮,𝒜,P,R,γ\)\(\\mathcal\{S\},\\mathcal\{A\},P,R,\\gamma\), where𝒮\\mathcal\{S\}and𝒜\\mathcal\{A\}represent the state and action spaces,P​\(st\+1\|st,at\)P\(s\_\{t\+1\}\|s\_\{t\},a\_\{t\}\)denotes the transition probability,R​\(s,a\)R\(s,a\)is the reward function, andγ∈\(0,1\]\\gamma\\in\(0,1\]is the discount factor\. Following the Decision Transformer framework\(Chenet al\.,[2021](https://arxiv.org/html/2605.29028#bib.bib8)\), we represent the input as a sequence of reward\-to\-go \(RTG\), state, and action tokens:

𝝉=\(rtg0,s0,a0,…,rtgH,sH,aH\),\\displaystyle\\boldsymbol\{\\tau\}=\(\\text\{rtg\}\_\{0\},s\_\{0\},a\_\{0\},\\ldots,\\text\{rtg\}\_\{H\},s\_\{H\},a\_\{H\}\),wherertgt=∑i=tHri\\text\{rtg\}\_\{t\}=\\sum\_\{i=t\}^\{H\}r\_\{i\}represents the cumulative future reward at timett\. This RTG signal serves as the target return to condition the policy\. To ensure computational efficiency, we employ akk\-step context window, where the truncated sequence at timettis defined as:

𝝉t=\(rtgt−k\+1,st−k\+1,at−k\+1,…,rtgt,st,at\)\.\\displaystyle\\boldsymbol\{\\tau\}\_\{t\}=\(\\text\{rtg\}\_\{t\-k\+1\},s\_\{t\-k\+1\},a\_\{t\-k\+1\},\\ldots,\\text\{rtg\}\_\{t\},s\_\{t\},a\_\{t\}\)\.
The RTG token serves as the primary mechanism for conditioning the model’s behavior\. Ideally, a DT\-based model should represent a family of policies indexed by the target RTG\(Brandfonbreneret al\.,[2022](https://arxiv.org/html/2605.29028#bib.bib27)\), expressed as:

ΠDT=\{πz∣z∈ℝ\+\},\\displaystyle\\Pi\_\{\\mathrm\{DT\}\}=\\\{\\pi\_\{z\}\\mid z\\in\\mathbb\{R\}\_\{\+\}\\\},\(1\)In practice, the RTG tokens are usually controlled during inference to inform the desired behavior\. To reflect this, we define amodified RTG sequence𝝉tg\\boldsymbol\{\\tau\}\_\{t\}^\{g\}for any scalarg∈ℝg\\in\\mathbb\{R\}by shifting all RTG tokens in the context window bygg:

𝝉tg=\(rtgt−k\+1\+g,st−k\+1,at−k\+1,…,rtgt\+g,st,at\)\.\\displaystyle\\boldsymbol\{\\tau\}\_\{t\}^\{g\}=\(\\text\{rtg\}\_\{t\-k\+1\}\{\+\}g,s\_\{t\-k\+1\},a\_\{t\-k\+1\},\\ldots,\\text\{rtg\}\_\{t\}\{\+\}g,s\_\{t\},a\_\{t\}\)\.\(2\)

## 4Methods

Motivation\.To investigate the interpretability of target RTG, we benchmark several variants of the Decision Transformer\(Chenet al\.,[2021](https://arxiv.org/html/2605.29028#bib.bib8); Huet al\.,[2024](https://arxiv.org/html/2605.29028#bib.bib2); Tanakaet al\.,[2025](https://arxiv.org/html/2605.29028#bib.bib4); Kimet al\.,[2024b](https://arxiv.org/html/2605.29028#bib.bib5)\)and analyze their behavior under various RTG conditioning values\. As illustrated in[Figure1](https://arxiv.org/html/2605.29028#S1.F1), we consistently observe a significant gap between the target RTG and the actual rollout performance\.

This misalignment is especially evident in theHalfCheetahenvironment, where the model exhibits almost no sensitivity to variations in the requested RTG\. Such findings suggest that the RTG conditioning remains marginal in the model’s decision process, directly motivating the development ofQ\-align DTto explicitly enforce RTG\-behavior alignment\.

### 4\.1Training Model with RTG\-to\-Behavior Alignment

From the above discussion, we aim to train a CSM variant that internalizes thepartial ordering between return\-to\-go \(RTG\) and action quality\. Ideally, this learning objective can be formulated as a constrained optimization problem:

minθ⁡LSL​\(θ\),s\.t\.​∂Qψ​\(s,πθ​\(s,RTG\)\)∂RTG≥0,\\min\_\{\\theta\}L\_\{\\text\{SL\}\}\(\\theta\),\\quad\\text\{s\.t\. \}\\frac\{\\partial Q\_\{\\psi\}\(s,\\pi\_\{\\theta\}\(s,\\text\{RTG\}\)\)\}\{\\partial\\text\{RTG\}\}\\geq 0,\(3\)whereLSL​\(θ\)L\_\{\\text\{SL\}\}\(\\theta\)is the standard supervised learning loss ensuring the policy remains anchored to the offline dataset, while the constraint ensures theQQ\-function of the predicted action is monotonically increasing with respect to the input RTGs\.

To preserve this monotonicity under the variations of RTGs, directly calculating the gradient of criticQQcan be time\-consuming due to the backpropagation through theQQfunction and decision transformer\. In order to improve the time efficiency, we consider the zero\-th order estimation of the gradient followingSalimanset al\.\([2017](https://arxiv.org/html/2605.29028#bib.bib42)\); Bernsteinet al\.\([2018](https://arxiv.org/html/2605.29028#bib.bib41)\)\. In particular, for any small enough perturbationδ\\delta,

∂Qψ​\(s,πθ​\(s,RTG\)\)∂RTG≈Qψ​\(s,a^δ\)−Qψ​\(s,a^\)δ,\\displaystyle\\frac\{\\partial Q\_\{\\psi\}\(s,\\pi\_\{\\theta\}\(s,\\text\{RTG\}\)\)\}\{\\partial\\text\{RTG\}\}\\approx\\frac\{Q\_\{\\psi\}\(s,\\hat\{a\}^\{\\delta\}\)\-Q\_\{\\psi\}\(s,\\hat\{a\}\)\}\{\\delta\},wherea^δ=πθ​\(s,RTG\+δ\)\\hat\{a\}^\{\\delta\}=\\pi\_\{\\theta\}\(s,\\text\{RTG\}\+\\delta\)anda^=πθ​\(s,RTG\)\\hat\{a\}=\\pi\_\{\\theta\}\(s,\\text\{RTG\}\)\. To further mitigate the effect of the magnitude ofδ\\deltaand allow large perturbationsδ\\delta, we convert \([3](https://arxiv.org/html/2605.29028#S4.E3)\) into the following objective:

minθ⁡LSL​\(θ\),s\.t\.​sgn​\(δ\)​\(Qψ​\(s,a^δ\)−Qψ​\(s,a^\)\)≥0,\\displaystyle\\min\_\{\\theta\}L\_\{\\text\{SL\}\}\(\\theta\),~\\text\{s\.t\.\}~\\mathrm\{sgn\}\(\\delta\)\\big\(Q\_\{\\psi\}\(s,\\hat\{a\}^\{\\delta\}\)\-Q\_\{\\psi\}\(s,\\hat\{a\}\)\\big\)\\geq 0,\(4\)
To this end, we solve \([4](https://arxiv.org/html/2605.29028#S4.E4)\) by introducing the Lagrange multiplier named asalignment lossLAlignL\_\{\\text\{Align\}\}\. In particular, given an input sequence𝝉t\\boldsymbol\{\\tau\}\_\{t\}and its perturbed version𝝉tδ\\boldsymbol\{\\tau\}\_\{t\}^\{\\delta\}generated by adding a sequence\-level noiseδ∼𝒩​\(0,σe2\)\\delta\\sim\\mathcal\{N\}\(0,\\sigma\_\{e\}^\{2\}\), the alignment lossLAlignL\_\{\\text\{Align\}\}is defined by:

LAlign=∑i=t−k\+1tI𝒞⋅\|Qψ​\(si,a^iδ\)−Qψ⟂​\(si,a^i\)\|,L\_\{\\text\{Align\}\}=\\textstyle\{\\sum\_\{i=t\-k\+1\}^\{t\}\}~\\ \{I\}\_\{\\mathcal\{C\}\}\\cdot\\big\|Q\_\{\\psi\}\(s\_\{i\},\\hat\{a\}\_\{i\}^\{\\delta\}\)\-Q\_\{\\psi\}^\{\\perp\}\(s\_\{i\},\\hat\{a\}\_\{i\}\)\\big\|,\(5\)wherea^i\\hat\{a\}\_\{i\}anda^iδ\\hat\{a\}\_\{i\}^\{\\delta\}are theii\-th predicted actions conditioned on𝝉t\\boldsymbol\{\\tau\}\_\{t\}and modified RTG sequence𝝉tδ\\boldsymbol\{\\tau\}\_\{t\}^\{\\delta\}defined by \([2](https://arxiv.org/html/2605.29028#S3.E2)\);Qψ⟂Q^\{\\perp\}\_\{\\psi\}denotes the stop\-gradient operator which is widely used inQQ\-based algorithms\(Sutton and Barto,[2018](https://arxiv.org/html/2605.29028#bib.bib43)\)treating the originalQQ\-value as a fixed reference to stabilize training; indicator function𝕀𝒞\\mathbb\{I\}\_\{\\mathcal\{C\}\}detects the constraint violation in \([4](https://arxiv.org/html/2605.29028#S4.E4)\):

𝕀𝒞=\{1,ifsgn​\(δ\)​\(Qψ​\(si,a^iδ\)−Qψ⟂​\(si,a^i\)\)<00,otherwise\.\\mathbb\{I\}\_\{\\mathcal\{C\}\}=\\begin\{cases\}1,&\\text\{if \}\\text\{sgn\}\(\\delta\)\\big\(Q\_\{\\psi\}\(s\_\{i\},\\hat\{a\}\_\{i\}^\{\\delta\}\)\-Q\_\{\\psi\}^\{\\perp\}\(s\_\{i\},\\hat\{a\}\_\{i\}\)\\big\)<0\\\\ 0,&\\text\{otherwise\}\\end\{cases\}\.For simplicity, letQiδ=Qψ​\(si,a^iδ\)Q^\{\\delta\}\_\{i\}=Q\_\{\\psi\}\(s\_\{i\},\\hat\{a\}\_\{i\}^\{\\delta\}\)andQi=Qψ⟂​\(si,a^i\)Q\_\{i\}=Q^\{\\perp\}\_\{\\psi\}\(s\_\{i\},\\hat\{a\}\_\{i\}\)\. The indicator activates when the RTG perturbation and the induced critic change have inconsistent directions, i\.e\.,sgn​\(δ\)​\(Qiδ−Qi\)<0\\mathrm\{sgn\}\(\\delta\)\(Q^\{\\delta\}\_\{i\}\-Q\_\{i\}\)<0\. This results in a directional ranking penalty that corrects violations of the RTG–value monotonicity while leaving already ordered pairs unchanged\. In the context of offline RL, whereQQ\-functions trained on static datasets are prone to scale bias and local inaccuracies\(Kumaret al\.,[2019](https://arxiv.org/html/2605.29028#bib.bib34)\), this yields aconservative alignmentobjective by enforcing relative ordering rather than exploiting potentially inaccurate value magnitudes that could push the policy toward out\-of\-distribution actions\(Fujimotoet al\.,[2019](https://arxiv.org/html/2605.29028#bib.bib31)\)\.

Together with the joint state\-action prediction with lossLSL​\(θ\)=∑i=t−k\+1t‖si−s^i‖2\+‖ai−a^i‖2L\_\{\\text\{SL\}\}\(\\theta\)=\\textstyle\{\\sum\_\{i=t\-k\+1\}^\{t\}\}\\\|s\_\{i\}\-\\hat\{s\}\_\{i\}\\\|^\{2\}\+\\\|a\_\{i\}\-\\hat\{a\}\_\{i\}\\\|^\{2\}following the decision transformer’s causal structure and input RTGs, this alignment objective encourages the model to internalize the RTG\-action\-QQmapping and regularize the policy to respect a partial RTG ordering\. To this end, the overall training objective can be summarized as

ℒtotal​\(θ\)=LSL​\(θ\)\+λe​LAlign​\(θ\),\\displaystyle\\mathcal\{L\}\_\{\\text\{total\}\}\(\\theta\)=L\_\{\\text\{SL\}\}\(\\theta\)\+\\lambda\_\{e\}\\,L\_\{\\text\{Align\}\}\(\\theta\),\(6\)whereλe\\lambda\_\{e\}controls the relative weight of the alignment constraint\. This formulation allowsQ\-align DTto learn a coherent family of policies that do not only imitate the dataset, but are also systematically steerable via RTG conditioning\.

### 4\.2Training theQQFunction withRTG Perturbation

InQ\-align DT, aQQ\-function is required to compute the alignment loss\. If the critic is either pretrained solely on the behavior policy or naively co\-trained using actions generated by the policy conditioned on𝝉t\\boldsymbol\{\\tau\}\_\{t\}, it tends to merely reflect the offline data distribution and remains anchored to suboptimal returns\. When the critic remains stationary, the alignment loss defined in \([6](https://arxiv.org/html/2605.29028#S4.E6)\) becomes self\-limiting, as it penalizes the policy for attempting to exceed the RTGs present in the static dataset\. In practice, this mismatch not only limits alignment in high\-RTG regions, but can also degrade alignment across the entire RTG spectrum \(see[TableE\.10](https://arxiv.org/html/2605.29028#A5.T10)\)\.

To ensure the critic accurately reflects the policy’s behavior across the RTG spectrum, we train the critic to satisfy Bellman consistency with respect to the current policy’s response under RTG perturbations\. Specifically, the criticQψQ\_\{\\psi\}is optimized to minimize:

Lq​\(ψ1,2\)=∑i=t−k\+1t−1∑m=12\(Qψm​\(si,ai\)−yi′\)2,\\displaystyle\\textstyle\{L\_\{q\}\(\\psi\_\{1,2\}\)=\\sum\_\{i=t\-k\+1\}^\{t\-1\}\\sum\_\{m=1\}^\{2\}\\bigl\(Q\_\{\\psi\_\{m\}\}\(s\_\{i\},a\_\{i\}\)\-y\_\{i\}^\{\\prime\}\\bigr\)^\{2\},\}\(7\)with the target valueyi′y\_\{i\}^\{\\prime\}being defined as:

yi′=ri\+γ​minm=1,2⁡Qψm′⟂​\(si\+1,a^i\+1′,Δ​RTG\),\\displaystyle\\textstyle\{y\_\{i\}^\{\\prime\}=r\_\{i\}\+\\gamma\\min\_\{m=1,2\}Q\_\{\\psi\_\{m\}^\{\\prime\}\}^\{\\perp\}\\\!\(s\_\{i\+1\},\\,\\hat\{a\}\_\{i\+1\}^\{\\prime,\\Delta\\mathrm\{RTG\}\}\),\}wherea^i\+1′,Δ​RTG\\hat\{a\}\_\{i\+1\}^\{\\prime,\\Delta\\mathrm\{RTG\}\}denotes the action predicted by the target policyπθ′\\pi\_\{\\theta^\{\\prime\}\}at the\(i\+1\)\(i\{\+\}1\)\-th step conditioned on the perturbed trajectory sequence𝝉tΔ​RTG\\boldsymbol\{\\tau\}\_\{t\}^\{\\Delta\\mathrm\{RTG\}\}andQψ′Q\_\{\\psi^\{\\prime\}\}is the target critic\. The offsetΔ​RTG\\Delta\\mathrm\{RTG\}is a positive scalar that serves as a fixed trajectory\-level perturbation\. Conditioning on the RTG\-perturbed trajectory𝝉tΔ​RTG\\boldsymbol\{\\tau\}\_\{t\}^\{\\Delta\\mathrm\{RTG\}\}, the policyπθ′\\pi\_\{\\theta^\{\\prime\}\}induces a higher\-return action support for the criticQψQ\_\{\\psi\}, encouraging the critic to reflect higher\-return behaviors within the RTG\-conditioned policy family\. As shown in our analysis in Sec\.[5\.2](https://arxiv.org/html/2605.29028#S5.SS2),Δ​RTG\\Delta\\mathrm\{RTG\}in the co\-training mechanism effectively acts as a control parameter for the critic learning dynamics\.

In particular, increasingΔ​RTG\\Delta\\mathrm\{RTG\}biases the RTG\-conditioned policy toward higher\-return actions, generating more informative targets for critic learning\. Through the alignment loss, these improved value estimates are fed back to the actor, enabling a positive actor–critic feedback loop that facilitates policy improvement while remaining grounded in the support of the offline data, rather than merely fitting static, suboptimal return averages\.

Combined with theQ\-align DTand the*RTG\-perturbation*in co\-training theQQfunction with the decision transformer, we present the algorithm as[Algorithm1](https://arxiv.org/html/2605.29028#alg1)in Appendix[A](https://arxiv.org/html/2605.29028#A1)\.

## 5Theoretical Analysis

In this section, we analyze the alignment error of CSMs and the policy behavior ofQ\-align DTunder data support constraints\. We defer the detailed proof to Appendix[B](https://arxiv.org/html/2605.29028#A2)\.

### 5\.1WhyQ\-align DTReduces Alignment Error

Our analysis builds on the following finite sample analysis for DT presented in\(Brandfonbreneret al\.,[2022](https://arxiv.org/html/2605.29028#bib.bib27)\):

###### Lemma 5\.1\(Theorem 1 and Corollary 3,Brandfonbreneret al\.[2022](https://arxiv.org/html/2605.29028#bib.bib27)\)\.

For a sample sizeNNand a finite policy classΠ\\Pi, and an MDP with horizonHH, letπ^∈Π\\hat\{\\pi\}\\in\\Pibe the empirical risk minimizer that minimizes the training loss\. With probability at least1−δp1\-\\delta\_\{p\}, the total alignment error satisfies

RTGtgt−𝔼𝝉∼π^RTGtgt​\[RTGreal\]≤𝒪​\(H2​log\(\|Π\|/δp\)1/4N1/4\)\.\\displaystyle\\text\{RTG\}\_\{\\text\{tgt\}\}\-\\mathbb\{E\}\_\{\\boldsymbol\{\\tau\}\\sim\\hat\{\\pi\}\_\{\\text\{RTG\}\_\{\\text\{tgt\}\}\}\}\[\\text\{RTG\}\_\{\\text\{real\}\}\]\\leq\\mathcal\{O\}\\bigg\(H^\{2\}\\frac\{\\log\(\|\\Pi\|/\\delta\_\{p\}\)^\{1/4\}\}\{N^\{1/4\}\}\\bigg\)\.

Crucially, the bound shows that alignment error scales with\(log⁡\|Π\|\)1/4\(\\log\|\\Pi\|\)^\{1/4\}\. In unconstrained sequence modeling, the policy classΠ\\Pican be sufficiently large and therefore lead to significant alignment gap\. In contrast,Q\-align DTimposes an order\-preserving bias that encouragesQ​\(s,π​\(s,g\)\)Q\(s,\\pi\(s,g\)\)to be monotonically increasing with respect to RTGgg\. As we will present in the following theorem, ensuring this monotonicity will effectively reduce the policy classΠ\\Pi\.

###### Theorem 5\.1\(Policy Space Reduction in Discrete Case\)\.

Consider a discrete state and action spaceS,AS,A, given a set of RTG valuesG=\{g1<⋯<g\|G\|\}G=\\\{g\_\{1\}<\\dots<g\_\{\|G\|\}\\\}\. LetΠfree\\Pi\_\{\\rm free\}be the class of all deterministic policiesπ:S×G→A\\pi:S\\times G\\to A\. Assume that for each statess, the functionQ​\(s,⋅\):A→𝒱s⊂ℝQ\(s,\\cdot\):A\\to\\mathcal\{V\}\_\{s\}\\subset\\mathbb\{R\}induces an ordering over actions\. A policyπ\\piis admissible in the constrained classΠmono\\Pi\_\{\\rm mono\}if it preserves this order across RTG levels\. Then the log\-complexity satisfies:

log⁡\|Πmono\|≤\|S\|​\|G\|​log⁡\(\|G\|\+\|A\|−1\)\|G\|\+\|S\|​\|G\|\.\\displaystyle\\log\|\\Pi\_\{\\rm mono\}\|\\leq\|S\|\|G\|\\log\\frac\{\(\|G\|\+\|A\|\-1\)\}\{\|G\|\}\+\|S\|\|G\|\.Furthermore, assuming eachQQ\-value is shared by at mostCCactions and the resolution of the value space\|𝒱s\|\|\\mathcal\{V\}\_\{s\}\|is of the same order as the RTG resolution\|G\|\|G\|\(i\.e\.,\|𝒱s\|=Θ​\(\|G\|\)\|\\mathcal\{V\}\_\{s\}\|=\\Theta\(\|G\|\)\), we havelog⁡\|Πmono\|=O~​\(\|S\|​\|G\|\)\\log\|\\Pi\_\{\\rm mono\}\|=\\tilde\{O\}\(\|S\|\|G\|\)\.

Compared to the unconstrained case wherelog⁡\|Πfree\|=O​\(\|S\|​\|G\|​log⁡\|A\|\)\\log\|\\Pi\_\{\\rm free\}\|=O\(\|S\|\|G\|\\log\|A\|\), Theorem[5\.1](https://arxiv.org/html/2605.29028#S5.Thmtheorem1)illustrates thatQ\-align DTutilizes the directional information of theQQfunction to effectively eliminate thelog⁡\|A\|\\log\|A\|dependence in the complexity term\. This significant reduction in the hypothesis space provides a theoretical intuition for whyQ\-align DTcan significantly diminish alignment errors observed in practice\.

### 5\.2Policy Optimality under High\-RTG Conditioning

As the alignment lossLAlignL\_\{\\text\{Align\}\}enables the partial ordering between the RTG and action quality, we would like to examine how the learned policy behaves when conditioned on extremely large RTG values\. This setting is particularly important in practice, as inputting sufficiently large RTGs is a standard evaluation protocol for CSMs\.

To better understand this regime, the following theorem presents an asymptotic equivalent behavior ofQ\-align DTand QT\(Huet al\.,[2024](https://arxiv.org/html/2605.29028#bib.bib2)\), where QT explicitly optimizes for action\-value maximization under fixed RTG\.

###### Theorem 5\.2\(Equivalence in High\-RTG Regime\)\.

For a fixed RTGRR, letπθ​\(s,R\)\\pi\_\{\\theta\}\(s,R\)be the policy learned byQ\-align DTandπQT​\(s,R\)\\pi\_\{\\text\{QT\}\}\(s,R\)be the policy that minimizes the following QT objective\(Huet al\.,[2024](https://arxiv.org/html/2605.29028#bib.bib2)\)using a fixedQQfunction:

ℒQT​\(π,R\)=𝔼\(s,a\)∼𝒟​\[‖π​\(s,R\)−a‖2−η​Q​\(s,π​\(s,R\)\)\]\.\\displaystyle\\mathcal\{L\}\_\{\\text\{QT\}\}\(\\pi,R\)=\\\!\\\!\\\!\\\!\\\!\\underset\{\(s,a\)\\sim\\mathcal\{D\}\}\{\\mathbb\{E\}\}\[\\\|\\pi\(s,R\)\-a\\\|^\{2\}\-\\eta Q\(s,\\pi\(s,R\)\)\]\.Assuming the dataset𝒟\\mathcal\{D\}has full support over the action space, then asR→∞R\\to\\infty,Q\-align DTrecovers the behavior of an idealized QT that pursues maximum value:

limR→∞πθ​\(s,R\)=limR→∞πQT​\(s,R\)=arg⁡maxa∈𝒜⁡Q​\(s,a\)\.\\displaystyle\\lim\_\{R\\to\\infty\}\\pi\_\{\\theta\}\(s,R\)=\\lim\_\{R\\to\\infty\}\\pi\_\{\\text\{QT\}\}\(s,R\)=\\arg\\max\_\{a\\in\\mathcal\{A\}\}Q\(s,a\)\.

Then the next theorem further characterizes the effectiveness of theRTG\-perturbationtechniques in learning theQQfunction used in[Theorem5\.2](https://arxiv.org/html/2605.29028#A5.EGx13)\. Especially, the following theorem shows different asymptotic behaviors under different choices ofΔ​RTG\\Delta\\mathrm\{RTG\}used inRTG\-perturbation\.

###### Theorem 5\.3\(Impact ofΔ​RTG\\Delta\\mathrm\{RTG\}\)\.

Let\{Qm,πm\}m\\\{Q\_\{m\},\\pi\_\{m\}\\\}\_\{m\}be the sequences produced byQ\-align DT\. Consider the critic update using a perturbed conditioning signalR~=RTG\+Δ​RTG\\tilde\{R\}=\\mathrm\{RTG\}\+\\Delta\\mathrm\{RTG\}\. Then the following statements hold:

\(No Perturbation\)\.IfΔ​RTG=0\\Delta\\mathrm\{RTG\}\{=\}0, the updates remain within the behavior policy’s action support, leading to convergence of\{Qm\}m\\\{Q\_\{m\}\\\}\_\{m\}to a value function close toQβQ\_\{\\beta\}, corresponding to conservative evaluation within the behavior support\.

\(Large Perturbation\)\.IfΔ​RTG\\Delta\\mathrm\{RTG\}is sufficiently large such thatR~\\tilde\{R\}exceeds all returns in𝒟\\mathcal\{D\}, then\{Qm\}m\\\{Q\_\{m\}\\\}\_\{m\}converges to the optimal action\-value functionQ∗Q^\{\*\}and the induced policies\{πm​\(s,R~\)\}m\\\{\\pi\_\{m\}\(s,\\tilde\{R\}\)\\\}\_\{m\}converge to the corresponding optimal policyπ∗\\pi^\{\*\}, both restricted to the support of𝒟\\mathcal\{D\}\.

Table 1:Performance \(D4RL normalized score↑\\uparrow\) ofQ\-align DTand other state\-of\-the\-art baselines on Gym domains\. Results are averaged over five random seeds, and we report the mean±\\pmstandard error\. Boldface numbers denote the highest or comparable scores among the algorithms\.DatasetIQLTD3\+BCDTCGDTLSDTDCDMRADTQTQCSQ\-align DThalfcheetah\-medium\-replay44\.144\.636\.640\.442\.941\.339\.641\.348\.954\.157\.1±\\pm0\.74hopper\-medium\-replay92\.160\.982\.793\.493\.994\.295\.495\.7102100\.4102\.2±\\pm0\.64walker2d\-medium\-replay73\.781\.879\.478\.174\.776\.685\.575\.998\.594\.1101\.3±\\pm0\.73halfcheetah\-medium47\.448\.342\.64343\.64343\.5–51\.45965\.3±\\pm0\.63hopper\-medium63\.859\.367\.696\.987\.292\.598\.1–96\.996\.4102\.1±\\pm0\.74walker2d\-medium79\.983\.77479\.18179\.283\.8–88\.888\.294\.7±\\pm0\.67halfcheetah\-medium\-expert86\.790\.786\.893\.693\.29393\.993\.196\.193\.398\.8±\\pm0\.68hopper\-medium\-expert91\.598107\.6107\.6111\.7110\.4111\.8110\.4113\.4110\.2114\.0±\\pm0\.18walker2d\-medium\-expert109\.6110\.1108\.1109\.3109\.8109\.6112\.7109\.7112\.6116\.6121\.4±\\pm0\.52Sum688\.8677\.4685\.4741\.4738739\.8764\.3–808\.6812\.3856\.9

Table 2:Performance \(D4RL normalized score↑\\uparrow\) ofQ\-align DTand baseline methods on AntMaze\. Results averaged over five seeds\. Boldface numbers denote the highest or comparable scores among the algorithms\.

## 6Experiments

We evaluate theQ\-align DTmainly on D4RL benchmark\(Fuet al\.,[2021](https://arxiv.org/html/2605.29028#bib.bib15)\)including theGym\(Hopper, HalfCheetah, Walker2d\) andAntMaze\(umaze, medium\) tasks\.

### 6\.1Implementation Details

We compareQ\-align DTwith a diverse set of baselines, including both value\-based methods and CSM methods\. For value\-based approaches, we consider IQL\(Kostrikovet al\.,[2022](https://arxiv.org/html/2605.29028#bib.bib9)\), TD3\+BC\(Fujimoto and Gu,[2021](https://arxiv.org/html/2605.29028#bib.bib14)\), and CQL\(Kumaret al\.,[2020](https://arxiv.org/html/2605.29028#bib.bib12)\)\. For CSM\-based methods, we include DT\(Chenet al\.,[2021](https://arxiv.org/html/2605.29028#bib.bib8)\), DC\(Kimet al\.,[2024b](https://arxiv.org/html/2605.29028#bib.bib5)\), RVS\(Emmonset al\.,[2022](https://arxiv.org/html/2605.29028#bib.bib35)\), CGDT\(Wanget al\.,[2024](https://arxiv.org/html/2605.29028#bib.bib11)\), LSDT\(Wanget al\.,[2025](https://arxiv.org/html/2605.29028#bib.bib7)\), DM\(Zhenget al\.,[2025](https://arxiv.org/html/2605.29028#bib.bib6)\), RADT\(Tanakaet al\.,[2025](https://arxiv.org/html/2605.29028#bib.bib4)\), QT\(Huet al\.,[2024](https://arxiv.org/html/2605.29028#bib.bib2)\), and QCS\(Kimet al\.,[2024a](https://arxiv.org/html/2605.29028#bib.bib1)\)\.

For training stability, we pretrain theQQ\-function on the offline dataset\. To decouple the performance of RTG\-to\-behavior alignment from the effects of critic pretraining, we employ a standard DoubleQQ\-learning update\(Hasseltet al\.,[2016](https://arxiv.org/html/2605.29028#bib.bib13)\)across most environments, with the exception ofantmaze, where IQL\(Kostrikovet al\.,[2022](https://arxiv.org/html/2605.29028#bib.bib9)\)is adopted due to its superior performance in sparse\-reward tasks \(see Appendix[D\.2](https://arxiv.org/html/2605.29028#A4.SS2)for details\)\.

Motivated by prior observations that vanilla attention underutilizes RTG tokens\(Kimet al\.,[2024b](https://arxiv.org/html/2605.29028#bib.bib5); Tanakaet al\.,[2025](https://arxiv.org/html/2605.29028#bib.bib4)\), we empirically introduce a lightweight convolutional projection for attention inputs\. As shown in[Table4](https://arxiv.org/html/2605.29028#S6.T4), this does not affect peak performance but improves alignment behavior; architectural details and ablations are provided in Appendix[D\.1](https://arxiv.org/html/2605.29028#A4.SS1)\.

### 6\.2Metric

We follow standard offline RL evaluation and report D4RL normalized scores averaged over five seeds, with 50 rollouts per seed \(100 for AntMaze\)\.

To evaluate*alignment*error, we measure how accurately a policy tracks a prescribed RTG target\. For each environment, we sweep target RTGs from the minimum to maximum D4RL return in increments of100100, execute3030rollouts per target, and record the resulting normalized scores\. Letscoretgt,j\\text\{score\}\_\{\\mathrm\{tgt\},j\}andscorereal,j\\text\{score\}\_\{\\mathrm\{real\},j\}denote thejj\-th target and achieved scores\. Fornntarget RTGs, we evaluate the alignment error by the root mean squared deviation

M=1n​∑j=1n\(scorereal,j−scoretgt,j\)2,M=\\sqrt\{\\tfrac\{1\}\{n\}\\textstyle\{\\sum\_\{j=1\}^\{n\}\}\(\\text\{score\}\_\{\\mathrm\{real\},j\}\-\\text\{score\}\_\{\\mathrm\{tgt\},j\}\)^\{2\}\},\(8\)where lowerMMindicates better RTG\-to\-behavior alignment\.

### 6\.3Main Results

We first present the performance ofQ\-align DTon Gym domains and AntMaze tasks in[Tables1](https://arxiv.org/html/2605.29028#S5.T1)and[2](https://arxiv.org/html/2605.29028#S5.T2)\. The results show thatQ\-align DTconsistently achieves state\-of\-the\-art performance, and in many cases surpasses existing methods across the datasets\. This demonstrates that, although our method is primarily motivated by alignment, it can attain high scores when conditioned on high RTG, as is typical for CSMs\.

We also report alignment metrics in[Table3](https://arxiv.org/html/2605.29028#S6.T3)\.Q\-align DTachieves strong alignment across the evaluated Gym domains, validating the effectiveness of our proposed approach\. Notably, in HalfCheetah, where many CSMs tend to produce similar behaviors for different RTGs, our method reliably aligns with the target RTG\.

Table 3:Alignment performance \(Root Mean Squared Error↓\\downarrow\) ofQ\-align DTand other state\-of\-the\-art baselines on Gym domains\. Results are averaged over five random seeds\.
### 6\.4Ablation Study

In this section, we present key ablation studies forQ\-align DT\. Additional experiments, including hyperparameter variations, alternative indicator functions, architectural components, and training procedures, are reported in Appendix[E\.1](https://arxiv.org/html/2605.29028#A5.SS1)\.

Selection ofΔ​RTG\\Delta\\mathrm\{RTG\}inRTG\-perturbation\.To evaluate the impact of differentΔ​RTG\\Delta\\mathrm\{RTG\}values, we conduct experiments onhalfcheetah\-mediumwith normalizedΔ​RTG∈\{0,1,3,5,10\}\\Delta\\mathrm\{RTG\}\\in\\\{0,1,3,5,10\\\}\(i\.e\., actual RTG normalized by10001000\)\. We report both the D4RL score and the alignment error in[Figure2](https://arxiv.org/html/2605.29028#S6.F2)\. Consistent with our analysis, increasingΔ​RTG\\Delta\\mathrm\{RTG\}generally improves overall performance, while the alignment error increases mildly\.

![Refer to caption](https://arxiv.org/html/2605.29028v1/x4.png)Figure 2:Effect of normalized RTG offsetΔ​RTG\\Delta\\text\{RTG\}\(actual RTG divided by 1000\) on policy performance \(D4RL score\) and alignment error inhalfcheetah\-medium\.Interestingly, alignment error increases as the RTG offset approaches zero, with most degradation occurring in the low\-RTG regime\. This effect is closely related to early\-timestep collapse and is exacerbated when theQQ\-function is fixed after pretraining \(Appendix[E\.5](https://arxiv.org/html/2605.29028#A5.SS5)\)\. We hypothesize that this occurs because the critic remains close toQβQ\_\{\\beta\}\. In low\-RTG regions, this leads to weak and noisy guidance due to the highly multimodal nature of low\-return behaviors and their limited value separation\. Consequently, RTG perturbations are important not only for high\-RTG performance, but also for maintaining reliable alignment across the RTG spectrum\.

On the other hand, larger RTG offsets target higher\-return behaviors but also increase distributional shift, potentially destabilizingQQ\-function learning\. To mitigate this, we restrictΔ​RTG\\Delta\\mathrm\{RTG\}to a moderate range; detailed values and experimental settings are reported in Appendix[D\.5](https://arxiv.org/html/2605.29028#A4.SS5)\.

#### Effect of the Convolution Layer\.

We ablate the 1D convolution layer in our architecture by reverting the model to the standard Decision Transformer\. As shown in[Table4](https://arxiv.org/html/2605.29028#S6.T4), while the convolution layer has a marginal impact on the best achievable performance, it substantially reduces the alignment error\. A possible explanation is that the convolution operation enables the attention mechanism to better extract and propagate information from the RTG token, thereby providing more informative alignment signals\.

Table 4:Effect of the convolution layer\. Align\. RMSE \(↓\\downarrow\) denotes the alignment error, Perf\. \(↑\\uparrow\) denotes the overall performance \(D4RL normalized score\)\. Results averaged over three seeds\.

## 7Behavioral Analysis and Generalization

While D4RL rewards are usually scalar, these rewards often aggregate multiple components\. For instance,halfcheetahuses the difference between the forward speed*forward\_reward*and*ctrl\_cost*as the reward\. Therefore, it’s natural to ask whether the model can meaningfully adjust its behavior conditioned on different target RTGs even when trained only on scalar returns\.

To answer this, we evaluate rollouts across a range of target RTGs\. Higher RTGs consistently induce faster locomotion, while lower RTGs lead to more cautious movement \([Figure3](https://arxiv.org/html/2605.29028#S7.F3)\)\. A finer\-grained analysis of instantaneous velocity along representative trajectories \([Figure4](https://arxiv.org/html/2605.29028#S7.F4)\) shows that the agent adaptively adjusts gait and speed rather than following a single policy to satisfy the specified reward constraints\. These results suggest thatQ\-align DThas captured a semantically meaningful mapping from RTG tokens to the underlying action space\. Additionalantexperiments are provided in Appendix[E\.3](https://arxiv.org/html/2605.29028#A5.SS3)with similar observations\.

![Refer to caption](https://arxiv.org/html/2605.29028v1/x5.png)Figure 3:Mean forward velocity of the agent conditioned on varying target RTGs across different datasets onhalfcheetah\.![Refer to caption](https://arxiv.org/html/2605.29028v1/x6.png)Figure 4:Agent velocity over timesteps for different target RTGs onhalfcheetah\-medium\-expert, averaged over 30 runs\.Zero\-shot transfer to Meta\-RL tasks\.With this observation, we further challenge the model’s alignment robustness through a zero\-shot transfer tohalfcheetah\-vel\(Mitchellet al\.,[2021](https://arxiv.org/html/2605.29028#bib.bib16)\)\. We compare our approach against two distinct categories of baselines: \(i\)In\-domain Meta\-RL methods, including Prompt\-DT\(Xuet al\.,[2022](https://arxiv.org/html/2605.29028#bib.bib18)\), MACAW\(Mitchellet al\.,[2021](https://arxiv.org/html/2605.29028#bib.bib16)\), and PEARL\(Rakellyet al\.,[2019](https://arxiv.org/html/2605.29028#bib.bib17)\), which are trained on thehalfcheetah\-veltask family across multiple tasks; and \(ii\)Cross\-domain CSMs, which, likeQ\-align DT, are trained exclusively on thehalfcheetah\-medium\-expertdataset and generalize to the velocity\-tracking task directly\.

Table 5:Performance on HalfCheetah\-vel\.Meta\-RL methods are trained onHalfCheetah\-vel, while offline RL methods are trained onHalfCheetah\-medium\-expertand evaluated via zero\-shot transfer\. For Prompt\-DT,K∗=5K^\{\*\}\{=\}5follows the original setting with 5\-step expert prompts used at both training and inference, whileK∗=0K^\{\*\}\{=\}0removes prompts entirely to enable a fair comparison with prompt\-free CSMs\.To account for differences in reward formulation, observation dimensions, and episode lengths, we apply a unified adjustment protocol to all cross\-domain CSMs \(details in Appendix[C](https://arxiv.org/html/2605.29028#A3)\)\. Specifically, for each target velocity, the agent’s behavior is controlled solely by the input RTG, which is determined using simple linear interpolation\.

Surprisingly, we found thatQ\-align DTexhibits a strong zero\-shot transferability compared to other CSMs and achieves competitive performance even against in\-domain Meta\-RL baselines as shown in[Table5](https://arxiv.org/html/2605.29028#S7.T5)\. This demonstrates its superior ability to represent and generalize multiple RTG\-conditioned policies across environments with significant discrepancies\. Furthermore, we evaluate the model on higher target velocities that exceed the original range of thehalfcheetah\-velbenchmark in[TableE\.15](https://arxiv.org/html/2605.29028#A5.T15)\. These results highlight the controllability ofQ\-align DTand its ability to generalize to a wide range of target velocities\.

## 8Conclusion

In this work, we introduceQ\-align DT, a framework designed to align conditioned sequence models with different return\-to\-go \(RTG\) targets\. We analyze the underlying mechanisms ofQ\-align DTand show that actions from alignment\-consistent CSMs can provide informative signals forQQ\-function learning\. Extensive experiments demonstrate that our approach learns a coherent family of policies and achieves state\-of\-the\-art performance in both return maximization and RTG alignment across a wide range of environments\.

## Acknowledgements

We thank the anonymous reviewers for their helpful comments\. This research was supported by WZ’s startup funding provided by the School of Data Science and Society at UNC Chapel Hill\.

## Impact Statement

This paper presents work whose primary goal is to advance the field of machine learning, specifically in offline reinforcement learning and conditioned sequence models\. Our contributions focus on improving the alignment and controllability of learned policies using fixed datasets, without introducing new data sources or deployment mechanisms\.

The techniques studied in this work are general algorithmic methods and are evaluated exclusively in standard simulated benchmark environments\. As such, we do not anticipate direct negative societal impacts arising uniquely from this research beyond those already well established for reinforcement learning methods in general\. Potential applications of improved policy alignment include enabling safer and more controllable decision\-making systems, subject to appropriate domain\-specific considerations\.

## Limitations and Future Work

This paper studies the RTG mismatch issue in return\-conditioned offline RL and proposes Q\-guided alignment as a practical mitigation\. Our experiments focus on standard Gym and AntMaze benchmarks, and extending the evaluation to broader domains such as visual control and robotic manipulation is an important direction for future work\. Our theoretical analysis uses simplifying assumptions to isolate the effect of RTG–behavior alignment; extending the analysis to more general settings remains open\. Finally, whileQ\-align DTshows empirical benefits in sparse\-reward environments such as AntMaze, its stable performance in these settings currently relies on additional initialization or preprocessing for obtaining a reliable critic signal\. Developing a more robust sparse\-reward variant is an interesting direction for future work\.

## References

- R\. Agarwal, D\. Schuurmans, and M\. Norouzi \(2020\)An optimistic perspective on offline reinforcement learning\.InProceedings of the 37th International Conference on Machine Learning,H\. D\. III and A\. Singh \(Eds\.\),Proceedings of Machine Learning Research, Vol\.119,pp\. 104–114\.Cited by:[§2\.1](https://arxiv.org/html/2605.29028#S2.SS1.p1.1)\.
- J\. Beck, R\. Vuorio, E\. Zheran Liu, Z\. Xiong, L\. Zintgraf, C\. Finn, and S\. Whiteson \(2025\)A tutorial on meta\-reinforcement learning\.Foundations and Trends in Machine Learning18\(2–3\),pp\. 224–384\.External Links:[Document](https://dx.doi.org/10.1561/2200000080),ISSN 1935\-8245Cited by:[§2\.4](https://arxiv.org/html/2605.29028#S2.SS4.p1.1)\.
- J\. Bernstein, Y\. Wang, K\. Azizzadenesheli, and A\. Anandkumar \(2018\)SignSGD: compressed optimisation for non\-convex problems\.InProceedings of the 35th International Conference on Machine Learning,J\. Dy and A\. Krause \(Eds\.\),Proceedings of Machine Learning Research, Vol\.80,pp\. 560–569\.Cited by:[§4\.1](https://arxiv.org/html/2605.29028#S4.SS1.p2.3)\.
- D\. Brandfonbrener, A\. Bietti, J\. Buckman, R\. Laroche, and J\. Bruna \(2022\)When does return\-conditioned supervised learning work for offline reinforcement learning?\.InAdvances in Neural Information Processing Systems,A\. H\. Oh, A\. Agarwal, D\. Belgrave, and K\. Cho \(Eds\.\),Cited by:[§B\.1](https://arxiv.org/html/2605.29028#A2.SS1.p1.1),[§B\.1](https://arxiv.org/html/2605.29028#A2.SS1.p2.1),[§B\.1](https://arxiv.org/html/2605.29028#A2.SS1.p5.1),[§2\.2](https://arxiv.org/html/2605.29028#S2.SS2.p1.2),[§3](https://arxiv.org/html/2605.29028#S3.p2.4),[§5\.1](https://arxiv.org/html/2605.29028#S5.SS1.p1.1),[Lemma 5\.1](https://arxiv.org/html/2605.29028#S5.Thmlemma1)\.
- T\. B\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell, S\. Agarwal, A\. Herbert\-Voss, G\. Krueger, T\. Henighan, R\. Child, A\. Ramesh, D\. M\. Ziegler, J\. Wu, C\. Winter, C\. Hesse, M\. Chen, E\. Sigler, M\. Litwin, S\. Gray, B\. Chess, J\. Clark, C\. Berner, S\. McCandlish, A\. Radford, I\. Sutskever, and D\. Amodei \(2020\)Language models are few\-shot learners\.External Links:2005\.14165Cited by:[§1](https://arxiv.org/html/2605.29028#S1.p1.1)\.
- Y\. Chebotar, Q\. Vuong, K\. Hausman, F\. Xia, Y\. Lu, A\. Irpan, A\. Kumar, T\. Yu, A\. Herzog, K\. Pertsch, K\. Gopalakrishnan, J\. Ibarz, O\. Nachum, S\. A\. Sontakke, G\. Salazar, H\. T\. Tran, J\. Peralta, C\. Tan, D\. Manjunath, J\. Singh, B\. Zitkovich, T\. Jackson, K\. Rao, C\. Finn, and S\. Levine \(2023\)Q\-transformer: scalable offline reinforcement learning via autoregressive q\-functions\.InProceedings of The 7th Conference on Robot Learning,J\. Tan, M\. Toussaint, and K\. Darvish \(Eds\.\),Proceedings of Machine Learning Research, Vol\.229,pp\. 3909–3928\.Cited by:[§2\.1](https://arxiv.org/html/2605.29028#S2.SS1.p3.1)\.
- L\. Chen, K\. Lu, A\. Rajeswaran, K\. Lee, A\. Grover, M\. Laskin, P\. Abbeel, A\. Srinivas, and I\. Mordatch \(2021\)Decision transformer: reinforcement learning via sequence modeling\.InAdvances in Neural Information Processing Systems,A\. Beygelzimer, Y\. Dauphin, P\. Liang, and J\. W\. Vaughan \(Eds\.\),Cited by:[§D\.1](https://arxiv.org/html/2605.29028#A4.SS1.p1.1),[§D\.3](https://arxiv.org/html/2605.29028#A4.SS3.p1.1),[§D\.5](https://arxiv.org/html/2605.29028#A4.SS5.p1.7),[§1](https://arxiv.org/html/2605.29028#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.29028#S2.SS1.p3.1),[§2\.2](https://arxiv.org/html/2605.29028#S2.SS2.p1.2),[§3](https://arxiv.org/html/2605.29028#S3.p1.6),[§4](https://arxiv.org/html/2605.29028#S4.p1.1),[§6\.1](https://arxiv.org/html/2605.29028#S6.SS1.p1.1),[Table 5](https://arxiv.org/html/2605.29028#S7.T5.6.2.10.6.1)\.
- J\. Devlin, M\. Chang, K\. Lee, and K\. Toutanova \(2019\)BERT: pre\-training of deep bidirectional transformers for language understanding\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long and Short Papers\),J\. Burstein, C\. Doran, and T\. Solorio \(Eds\.\),Minneapolis, Minnesota,pp\. 4171–4186\.External Links:[Document](https://dx.doi.org/10.18653/v1/N19-1423)Cited by:[§1](https://arxiv.org/html/2605.29028#S1.p1.1)\.
- R\. Dorfman, I\. Shenfeld, and A\. Tamar \(2021\)Offline meta reinforcement learning – identifiability challenges and effective data collection strategies\.InAdvances in Neural Information Processing Systems,M\. Ranzato, A\. Beygelzimer, Y\. Dauphin, P\.S\. Liang, and J\. W\. Vaughan \(Eds\.\),Vol\.34,pp\. 4607–4618\.Cited by:[§2\.4](https://arxiv.org/html/2605.29028#S2.SS4.p1.1)\.
- A\. Dosovitskiy, L\. Beyer, A\. Kolesnikov, D\. Weissenborn, X\. Zhai, T\. Unterthiner, M\. Dehghani, M\. Minderer, G\. Heigold, S\. Gelly, J\. Uszkoreit, and N\. Houlsby \(2021a\)An image is worth 16x16 words: transformers for image recognition at scale\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.29028#S1.p1.1)\.
- A\. Dosovitskiy, L\. Beyer, A\. Kolesnikov, D\. Weissenborn, X\. Zhai, T\. Unterthiner, M\. Dehghani, M\. Minderer, G\. Heigold, S\. Gelly, J\. Uszkoreit, and N\. Houlsby \(2021b\)An image is worth 16x16 words: transformers for image recognition at scale\.External Links:2010\.11929Cited by:[§1](https://arxiv.org/html/2605.29028#S1.p1.1)\.
- Y\. Duan, J\. Schulman, X\. Chen, P\. L\. Bartlett, I\. Sutskever, and P\. Abbeel \(2016\)RL2: fast reinforcement learning via slow reinforcement learning\.External Links:1611\.02779Cited by:[§2\.4](https://arxiv.org/html/2605.29028#S2.SS4.p1.1)\.
- S\. Emmons, B\. Eysenbach, I\. Kostrikov, and S\. Levine \(2022\)RvS: what is essential for offline RL via supervised learning?\.InInternational Conference on Learning Representations,Cited by:[§D\.3](https://arxiv.org/html/2605.29028#A4.SS3.p1.1),[§6\.1](https://arxiv.org/html/2605.29028#S6.SS1.p1.1)\.
- D\. Ernst, P\. Geurts, and L\. Wehenkel \(2005\)Tree\-based batch mode reinforcement learning\.Journal of Machine Learning Research6\(18\),pp\. 503–556\.Cited by:[§2\.1](https://arxiv.org/html/2605.29028#S2.SS1.p1.1)\.
- C\. Finn, P\. Abbeel, and S\. Levine \(2017\)Model\-agnostic meta\-learning for fast adaptation of deep networks\.External Links:1703\.03400Cited by:[§2\.4](https://arxiv.org/html/2605.29028#S2.SS4.p1.1)\.
- J\. Fu, A\. Kumar, O\. Nachum, G\. Tucker, and S\. Levine \(2021\)D4\{rl\}: datasets for deep data\-driven reinforcement learning\.Cited by:[§6](https://arxiv.org/html/2605.29028#S6.p1.1)\.
- S\. Fujimoto and S\. Gu \(2021\)A minimalist approach to offline reinforcement learning\.InAdvances in Neural Information Processing Systems,A\. Beygelzimer, Y\. Dauphin, P\. Liang, and J\. W\. Vaughan \(Eds\.\),Cited by:[§D\.3](https://arxiv.org/html/2605.29028#A4.SS3.p1.1),[§2\.1](https://arxiv.org/html/2605.29028#S2.SS1.p2.1),[§6\.1](https://arxiv.org/html/2605.29028#S6.SS1.p1.1)\.
- S\. Fujimoto, D\. Meger, and D\. Precup \(2019\)Off\-policy deep reinforcement learning without exploration\.InProceedings of the 36th International Conference on Machine Learning,K\. Chaudhuri and R\. Salakhutdinov \(Eds\.\),Proceedings of Machine Learning Research, Vol\.97,pp\. 2052–2062\.Cited by:[§1](https://arxiv.org/html/2605.29028#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.29028#S2.SS1.p2.1),[§4\.1](https://arxiv.org/html/2605.29028#S4.SS1.p3.18)\.
- H\. Furuta, Y\. Matsuo, and S\. S\. Gu \(2022\)Generalized decision transformer for offline hindsight information matching\.InInternational Conference on Learning Representations,Cited by:[§2\.2](https://arxiv.org/html/2605.29028#S2.SS2.p1.2)\.
- J\. Grigsby, L\. Fan, and Y\. Zhu \(2024\)AMAGO: scalable in\-context reinforcement learning for adaptive agents\.InThe Twelfth International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.29028#S1.p1.1)\.
- H\. v\. Hasselt, A\. Guez, and D\. Silver \(2016\)Deep reinforcement learning with double q\-learning\.InProceedings of the Thirtieth AAAI Conference on Artificial Intelligence,AAAI’16,pp\. 2094–2100\.Cited by:[§D\.2](https://arxiv.org/html/2605.29028#A4.SS2.p1.2),[§6\.1](https://arxiv.org/html/2605.29028#S6.SS1.p2.2)\.
- K\. He, X\. Chen, S\. Xie, Y\. Li, P\. Dollár, and R\. Girshick \(2021\)Masked autoencoders are scalable vision learners\.External Links:2111\.06377Cited by:[§1](https://arxiv.org/html/2605.29028#S1.p1.1)\.
- D\. Hendrycks and K\. Gimpel \(2016\)Gaussian error linear units \(gelus\)\.arXiv: Learning\.Cited by:[Table D\.7](https://arxiv.org/html/2605.29028#A4.T7.5.10.9.2)\.
- S\. Hu, Z\. Fan, C\. Huang, L\. Shen, Y\. Zhang, Y\. Wang, and D\. Tao \(2024\)Q\-value regularized transformer for offline reinforcement learning\.InForty\-first International Conference on Machine Learning,Cited by:[§D\.1](https://arxiv.org/html/2605.29028#A4.SS1.p2.5),[§D\.2](https://arxiv.org/html/2605.29028#A4.SS2.p2.7),[§D\.3](https://arxiv.org/html/2605.29028#A4.SS3.p1.1),[§D\.3](https://arxiv.org/html/2605.29028#A4.SS3.p2.1),[§2\.3](https://arxiv.org/html/2605.29028#S2.SS3.p1.3),[§4](https://arxiv.org/html/2605.29028#S4.p1.1),[§5\.2](https://arxiv.org/html/2605.29028#S5.SS2.p2.1),[Theorem 5\.2](https://arxiv.org/html/2605.29028#S5.Thmtheorem2.p1.4.4),[§6\.1](https://arxiv.org/html/2605.29028#S6.SS1.p1.1),[Table 5](https://arxiv.org/html/2605.29028#S7.T5.6.2.12.8.1)\.
- M\. Janner, Q\. Li, and S\. Levine \(2021\)Offline reinforcement learning as one big sequence modeling problem\.InAdvances in Neural Information Processing Systems,A\. Beygelzimer, Y\. Dauphin, P\. Liang, and J\. W\. Vaughan \(Eds\.\),Cited by:[§1](https://arxiv.org/html/2605.29028#S1.p1.1),[§2\.2](https://arxiv.org/html/2605.29028#S2.SS2.p1.2)\.
- N\. Jaques, A\. Ghandeharioun, J\. H\. Shen, C\. Ferguson, A\. Lapedriza, N\. Jones, S\. Gu, and R\. Picard \(2019\)Way off\-policy batch deep reinforcement learning of implicit human preferences in dialog\.External Links:1907\.00456Cited by:[§2\.1](https://arxiv.org/html/2605.29028#S2.SS1.p1.1)\.
- J\. Kim, S\. Lee, W\. Kim, and Y\. Sung \(2024a\)Adaptive $q$\-aid for conditional supervised learning in offline reinforcement learning\.InThe Thirty\-eighth Annual Conference on Neural Information Processing Systems,Cited by:[§D\.2](https://arxiv.org/html/2605.29028#A4.SS2.p2.7),[§D\.3](https://arxiv.org/html/2605.29028#A4.SS3.p1.1),[§2\.3](https://arxiv.org/html/2605.29028#S2.SS3.p1.3),[§6\.1](https://arxiv.org/html/2605.29028#S6.SS1.p1.1),[Table 5](https://arxiv.org/html/2605.29028#S7.T5.6.2.11.7.1)\.
- J\. Kim, S\. Lee, W\. Kim, and Y\. Sung \(2024b\)Decision convformer: local filtering in metaformer is sufficient for decision making\.InThe Twelfth International Conference on Learning Representations,Cited by:[§D\.3](https://arxiv.org/html/2605.29028#A4.SS3.p1.1),[§1](https://arxiv.org/html/2605.29028#S1.p2.3),[§2\.2](https://arxiv.org/html/2605.29028#S2.SS2.p1.2),[§4](https://arxiv.org/html/2605.29028#S4.p1.1),[§6\.1](https://arxiv.org/html/2605.29028#S6.SS1.p1.1),[§6\.1](https://arxiv.org/html/2605.29028#S6.SS1.p3.1),[Table 5](https://arxiv.org/html/2605.29028#S7.T5.6.2.9.5.1)\.
- D\. P\. Kingma and J\. Ba \(2014\)Adam: a method for stochastic optimization\.CoRRabs/1412\.6980\.Cited by:[Table D\.7](https://arxiv.org/html/2605.29028#A4.T7.5.11.10.2)\.
- I\. Kostrikov, A\. Nair, and S\. Levine \(2022\)Offline reinforcement learning with implicit q\-learning\.InInternational Conference on Learning Representations,Cited by:[§B\.3](https://arxiv.org/html/2605.29028#A2.SS3.p1.7),[§D\.2](https://arxiv.org/html/2605.29028#A4.SS2.p1.2),[§D\.3](https://arxiv.org/html/2605.29028#A4.SS3.p1.1),[Remark 5\.2](https://arxiv.org/html/2605.29028#S5.Thmremark2.p1.9),[§6\.1](https://arxiv.org/html/2605.29028#S6.SS1.p1.1),[§6\.1](https://arxiv.org/html/2605.29028#S6.SS1.p2.2)\.
- A\. Kumar, J\. Fu, G\. Tucker, and S\. Levine \(2019\)Stabilizing off\-policy q\-learning via bootstrapping error reduction\.InNeural Information Processing Systems,Cited by:[§2\.1](https://arxiv.org/html/2605.29028#S2.SS1.p2.1),[§4\.1](https://arxiv.org/html/2605.29028#S4.SS1.p3.18)\.
- A\. Kumar, A\. Zhou, G\. Tucker, and S\. Levine \(2020\)Conservative q\-learning for offline reinforcement learning\.InAdvances in Neural Information Processing Systems,H\. Larochelle, M\. Ranzato, R\. Hadsell, M\.F\. Balcan, and H\. Lin \(Eds\.\),Vol\.33,pp\. 1179–1191\.Cited by:[§D\.3](https://arxiv.org/html/2605.29028#A4.SS3.p1.1),[§2\.1](https://arxiv.org/html/2605.29028#S2.SS1.p2.1),[§6\.1](https://arxiv.org/html/2605.29028#S6.SS1.p1.1)\.
- M\. Laskin, L\. Wang, J\. Oh, E\. Parisotto, S\. Spencer, R\. Steigerwald, D\. Strouse, S\. S\. Hansen, A\. Filos, E\. Brooks, maxime gazeau, H\. Sahni, S\. Singh, and V\. Mnih \(2023\)In\-context reinforcement learning with algorithm distillation\.InThe Eleventh International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.29028#S1.p1.1)\.
- J\. Lee, A\. Xie, A\. Pacchiano, Y\. Chandak, C\. Finn, O\. Nachum, and E\. Brunskill \(2023\)Supervised pretraining can learn in\-context reinforcement learning\.InThirty\-seventh Conference on Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2605.29028#S1.p1.1),[§2\.4](https://arxiv.org/html/2605.29028#S2.SS4.p1.1)\.
- S\. Levine, A\. Kumar, G\. Tucker, and J\. Fu \(2020\)Offline reinforcement learning: tutorial, review, and perspectives on open problems\.External Links:2005\.01643Cited by:[§2\.1](https://arxiv.org/html/2605.29028#S2.SS1.p1.1),[§2\.1](https://arxiv.org/html/2605.29028#S2.SS1.p2.1)\.
- L\. Lin, Y\. Bai, and S\. Mei \(2024\)Transformers as decision makers: provable in\-context reinforcement learning via supervised pretraining\.InThe Twelfth International Conference on Learning Representations,Cited by:[§2\.2](https://arxiv.org/html/2605.29028#S2.SS2.p1.2)\.
- Y\. Liu, M\. Ott, N\. Goyal, J\. Du, M\. Joshi, D\. Chen, O\. Levy, M\. Lewis, L\. Zettlemoyer, and V\. Stoyanov \(2019\)RoBERTa: a robustly optimized bert pretraining approach\.ArXivabs/1907\.11692\.Cited by:[§1](https://arxiv.org/html/2605.29028#S1.p1.1)\.
- Z\. Liu, Y\. Lin, Y\. Cao, H\. Hu, Y\. Wei, Z\. Zhang, S\. Lin, and B\. Guo \(2021\)Swin transformer: hierarchical vision transformer using shifted windows\.External Links:2103\.14030Cited by:[§1](https://arxiv.org/html/2605.29028#S1.p1.1)\.
- E\. Mitchell, R\. Rafailov, X\. B\. Peng, S\. Levine, and C\. Finn \(2021\)Offline meta\-reinforcement learning with advantage weighting\.InProceedings of the 38th International Conference on Machine Learning,M\. Meila and T\. Zhang \(Eds\.\),Proceedings of Machine Learning Research, Vol\.139,pp\. 7780–7791\.Cited by:[§2\.4](https://arxiv.org/html/2605.29028#S2.SS4.p1.1),[Table 5](https://arxiv.org/html/2605.29028#S7.T5.6.2.5.1.1.1),[§7](https://arxiv.org/html/2605.29028#S7.p3.1)\.
- K\. Rakelly, A\. Zhou, C\. Finn, S\. Levine, and D\. Quillen \(2019\)Efficient off\-policy meta\-reinforcement learning via probabilistic context variables\.InProceedings of the 36th International Conference on Machine Learning,K\. Chaudhuri and R\. Salakhutdinov \(Eds\.\),Proceedings of Machine Learning Research, Vol\.97,pp\. 5331–5340\.Cited by:[Table 5](https://arxiv.org/html/2605.29028#S7.T5.6.2.6.2.1.1),[§7](https://arxiv.org/html/2605.29028#S7.p3.1)\.
- T\. Salimans, J\. Ho, X\. Chen, S\. Sidor, and I\. Sutskever \(2017\)Evolution strategies as a scalable alternative to reinforcement learning\.External Links:1703\.03864Cited by:[§4\.1](https://arxiv.org/html/2605.29028#S4.SS1.p2.3)\.
- N\. Y\. Siegel, J\. T\. Springenberg, F\. Berkenkamp, A\. Abdolmaleki, M\. Neunert, T\. Lampe, R\. Hafner, N\. Heess, and M\. Riedmiller \(2020\)Keep doing what worked: behavioral modelling priors for offline reinforcement learning\.External Links:2002\.08396Cited by:[§2\.1](https://arxiv.org/html/2605.29028#S2.SS1.p1.1)\.
- R\. S\. Sutton and A\. G\. Barto \(2018\)Reinforcement learning: an introduction\.A Bradford Book,Cambridge, MA, USA\.External Links:ISBN 0262039249Cited by:[§4\.1](https://arxiv.org/html/2605.29028#S4.SS1.p3.14),[Remark 5\.2](https://arxiv.org/html/2605.29028#S5.Thmremark2.p1.9)\.
- T\. Tanaka, K\. Abe, K\. Ariu, T\. Morimura, and E\. Simo\-Serra \(2025\)Return\-aligned decision transformer\.Transactions on Machine Learning Research\.Note:External Links:ISSN 2835\-8856Cited by:[§D\.3](https://arxiv.org/html/2605.29028#A4.SS3.p1.1),[§D\.6](https://arxiv.org/html/2605.29028#A4.SS6.p2.1),[§1](https://arxiv.org/html/2605.29028#S1.p2.1),[§1](https://arxiv.org/html/2605.29028#S1.p2.3),[§2\.2](https://arxiv.org/html/2605.29028#S2.SS2.p1.2),[§2\.2](https://arxiv.org/html/2605.29028#S2.SS2.p2.1),[§4](https://arxiv.org/html/2605.29028#S4.p1.1),[§6\.1](https://arxiv.org/html/2605.29028#S6.SS1.p1.1),[§6\.1](https://arxiv.org/html/2605.29028#S6.SS1.p3.1),[Table 5](https://arxiv.org/html/2605.29028#S7.T5.6.2.8.4.1)\.
- A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin \(2017\)Attention is all you need\.InProceedings of the 31st International Conference on Neural Information Processing Systems,NIPS’17,Red Hook, NY, USA,pp\. 6000–6010\.External Links:ISBN 9781510860964Cited by:[§1](https://arxiv.org/html/2605.29028#S1.p1.1)\.
- J\. Wang, P\. Karanasou, P\. Wei, E\. Gatti, D\. M\. Plasencia, and D\. Kanoulas \(2025\)Long\-short decision transformer: bridging global and local dependencies for generalized decision\-making\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§D\.3](https://arxiv.org/html/2605.29028#A4.SS3.p1.1),[§6\.1](https://arxiv.org/html/2605.29028#S6.SS1.p1.1)\.
- Y\. Wang, C\. Yang, Y\. Wen, Y\. Liu, and Y\. Qiao \(2024\)Critic\-guided decision transformer for offline reinforcement learning\.InAAAI,pp\. 15706–15714\.Cited by:[§D\.3](https://arxiv.org/html/2605.29028#A4.SS3.p1.1),[§2\.3](https://arxiv.org/html/2605.29028#S2.SS3.p1.3),[§6\.1](https://arxiv.org/html/2605.29028#S6.SS1.p1.1)\.
- Z\. Wang, A\. Novikov, K\. Zolna, J\. S\. Merel, J\. T\. Springenberg, S\. E\. Reed, B\. Shahriari, N\. Siegel, C\. Gulcehre, N\. Heess, and N\. de Freitas \(2020\)Critic regularized regression\.InAdvances in Neural Information Processing Systems,H\. Larochelle, M\. Ranzato, R\. Hadsell, M\.F\. Balcan, and H\. Lin \(Eds\.\),Vol\.33,pp\. 7768–7778\.Cited by:[§2\.1](https://arxiv.org/html/2605.29028#S2.SS1.p2.1)\.
- Y\. Wu, G\. Tucker, and O\. Nachum \(2019\)Behavior regularized offline reinforcement learning\.External Links:1911\.11361Cited by:[§2\.1](https://arxiv.org/html/2605.29028#S2.SS1.p2.1)\.
- Y\. Wu, X\. Wang, and M\. Hamaya \(2023\)Elastic decision transformer\.InThirty\-seventh Conference on Neural Information Processing Systems,Cited by:[§2\.1](https://arxiv.org/html/2605.29028#S2.SS1.p3.1)\.
- M\. Xu, Y\. Shen, S\. Zhang, Y\. Lu, D\. Zhao, B\. J\. Tenenbaum, and C\. Gan \(2022\)Prompting decision transformer for few\-shot policy generalization\.InThirty\-ninth International Conference on Machine Learning,Cited by:[§2\.4](https://arxiv.org/html/2605.29028#S2.SS4.p1.1),[Table 5](https://arxiv.org/html/2605.29028#S7.T5.5.1.1.1.1),[§7](https://arxiv.org/html/2605.29028#S7.p3.1)\.
- T\. Yamagata, A\. Khalil, and R\. Santos\-Rodriguez \(2023\)Q\-learning decision transformer: leveraging dynamic programming for conditional sequence modelling in offline RL\.InProceedings of the 40th International Conference on Machine Learning,A\. Krause, E\. Brunskill, K\. Cho, B\. Engelhardt, S\. Sabato, and J\. Scarlett \(Eds\.\),Proceedings of Machine Learning Research, Vol\.202,pp\. 38989–39007\.Cited by:[§2\.3](https://arxiv.org/html/2605.29028#S2.SS3.p1.3)\.
- K\. Yan, A\. G\. Schwing, and Y\. Wang \(2024\)Reinforcement learning gradients as vitamin for online finetuning decision transformers\.InNeural Information Processing Systems \(NeurIPS\),Cited by:[§B\.1](https://arxiv.org/html/2605.29028#A2.SS1.p3.1),[§D\.2](https://arxiv.org/html/2605.29028#A4.SS2.p2.7),[§2\.3](https://arxiv.org/html/2605.29028#S2.SS3.p1.3)\.
- H\. Zheng, L\. Shen, Y\. Luo, D\. Ye, B\. Du, J\. Shen, and D\. Tao \(2025\)Decision mixer: integrating long\-term and local dependencies via dynamic token selection for decision\-making\.InForty\-second International Conference on Machine Learning,Cited by:[§D\.3](https://arxiv.org/html/2605.29028#A4.SS3.p1.1),[§6\.1](https://arxiv.org/html/2605.29028#S6.SS1.p1.1)\.

## Appendix AAlgorithm

In this section, we provide the formal pseudocode forQ\-align DT\. The training procedure is designed to ensure that the policyπθ\\pi\_\{\\theta\}not only clones the behavioral data but also adheres to directional value consistency through the alignment loss\.

Algorithm 1Training Procedure0:Dataset

𝒟\\mathcal\{D\}, sequence horizon

kk, training epochs

NN, batch size

BB, RTG offset

Δ​RTG\\Delta\\text\{RTG\}, noise variance

σe2\\sigma\_\{e\}^\{2\}, target update rate

α\\alpha, pretrained critic networks

Qψ1,Qψ2Q\_\{\\psi\_\{1\}\},Q\_\{\\psi\_\{2\}\}, target critic networks

Qψ1′,Qψ2′Q\_\{\\psi\_\{1\}^\{\\prime\}\},Q\_\{\\psi\_\{2\}^\{\\prime\}\}
0:Trained actor parameters

θ\\theta
1:Initialize actor

πθ\\pi\_\{\\theta\}and target actor

πθ′\\pi\_\{\\theta^\{\\prime\}\}
2:for

epoch=1\\text\{epoch\}=1to

NNdo

3:Sample batch of trajectories

𝝉∼𝒟\\boldsymbol\{\\tau\}\\sim\\mathcal\{D\}
4:Sample sub\-trajectory

𝝉t\\boldsymbol\{\\tau\}\_\{t\}of length

kkwith random timestep

tt// Critic update

5:Sample target actions

a^′,Δ​RTG∼πθ′​\(𝝉tΔ​RTG\)\\hat\{a\}^\{\\prime,\\Delta\\mathrm\{RTG\}\}\\sim\\pi\_\{\\theta^\{\\prime\}\}\(\\boldsymbol\{\\tau\}\_\{t\}^\{\\Delta\\text\{RTG\}\}\)
6:Update critic networks using \([7](https://arxiv.org/html/2605.29028#S4.E7)\)// Actor update

7:Predict actions

a^∼πθ​\(𝝉t\)\\hat\{a\}\\sim\\pi\_\{\\theta\}\(\\boldsymbol\{\\tau\}\_\{t\}\)
8:Sample noise

δ∼𝒩​\(0,σe2\)\\delta\\sim\\mathcal\{N\}\(0,\\sigma\_\{e\}^\{2\}\)
9:Predict noisy actions

a^δ∼πθ​\(𝝉tδ\)\\hat\{a\}^\{\\delta\}\\sim\\pi\_\{\\theta\}\(\\boldsymbol\{\\tau\}\_\{t\}^\{\\delta\}\)
10:Update actor using \([6](https://arxiv.org/html/2605.29028#S4.E6)\)

11:Update target networks \(actor and critics\) with rate

α\\alpha
12:endfor

13:Return

θ\\theta

## Appendix BMathematical Details

### B\.1Proof of[Theorem5\.1](https://arxiv.org/html/2605.29028#S5.Thmtheorem1)

In order to prove[Theorem5\.1](https://arxiv.org/html/2605.29028#S5.Thmtheorem1), we first give a formal statement and analysis of results fromBrandfonbreneret al\.\([2022](https://arxiv.org/html/2605.29028#bib.bib27)\)\.

###### Theorem B\.1\.

Consider an MDP, a behavior policyβ\\beta, and a conditioning functionffthat is consistent with the reward dynamics, i\.e\.,

f​\(s\)=f​\(s′\)\+r​\(s,a\)\\displaystyle f\(s\)=f\(s^\{\\prime\}\)\+r\(s,a\)for any transition fromsstos′s^\{\\prime\}under actionaa\. Let

g​\(𝝉\)=∑i=0Hri\\displaystyle g\(\\boldsymbol\{\\tau\}\)=\\sum\_\{i=0\}^\{H\}r\_\{i\}denote the cumulative reward of a trajectory𝛕\\boldsymbol\{\\tau\}\. Assume:

1. 1\.Return Coverage:the policyβ\\betacovers all initial statess1s\_\{1\}, i\.e\., P\(g\(𝝉\)=\(f\(s1\)∣s1\)≥αf,∀s1\.\\displaystyle P\(g\(\\boldsymbol\{\\tau\}\)=\(f\(s\_\{1\}\)\\mid s\_\{1\}\)\\geq\\alpha\_\{f\},\\quad\\forall s\_\{1\}\.
2. 2\.Near Determinism:the environment satisfies P​\(r≠R​\(s,a\)∣s,a\)≤ϵ,P​\(s′≠T​\(s,a\)∣s,a\)≤ϵ\\displaystyle P\(r\\neq R\(s,a\)\\mid s,a\)\\leq\\epsilon,\\quad P\(s^\{\\prime\}\\neq T\(s,a\)\\mid s,a\)\\leq\\epsilonfor some reward functionrrand transition functionTT\.

Then the expected return of the conditioned policy trained on infinite dataπf\\pi\_\{f\}satisfies

𝔼s1​\[f​\(s1\)\]−𝔼𝝉∼πf​\[g​\(𝝉\)\]≤ϵ​\(1αf\+2\)​H2,\\displaystyle\\mathbb\{E\}\_\{s\_\{1\}\}\[f\(s\_\{1\}\)\]\-\\mathbb\{E\}\_\{\\boldsymbol\{\\tau\}\\sim\\pi\_\{f\}\}\[g\(\\boldsymbol\{\\tau\}\)\]\\leq\\epsilon\\left\(\\frac\{1\}\{\\alpha\_\{f\}\}\+2\\right\)H^\{2\},whereHHis the horizon\.

The result follows directly from Theorem 1 ofBrandfonbreneret al\.\([2022](https://arxiv.org/html/2605.29028#bib.bib27)\)\. ∎

With[TheoremB\.1](https://arxiv.org/html/2605.29028#A2.Thmtheorem1), we can analyze why previous CSMs struggle to align with target RTG values that deviate from the training distribution\. FollowingYanet al\.\([2024](https://arxiv.org/html/2605.29028#bib.bib3)\), we consider the case wheref​\(s1\)=RTGtgtf\(s\_\{1\}\)=\\text\{RTG\}\_\{\\text\{tgt\}\}\. According to[TheoremB\.1](https://arxiv.org/html/2605.29028#A2.Thmtheorem1), the alignment error is bounded by:

RTGtgt−𝔼𝝉∼π\(⋅\|s,RTGtgt\)​\[RTGreal\]≤ϵ​\(1αf\+2\)​H2,\\displaystyle\\text\{RTG\}\_\{\\text\{tgt\}\}\-\\mathbb\{E\}\_\{\\boldsymbol\{\\tau\}\\sim\\pi\(\\cdot\|s,\\text\{RTG\}\_\{\\text\{tgt\}\}\)\}\[\\text\{RTG\}\_\{\\text\{real\}\}\]\\leq\\epsilon\\left\(\\frac\{1\}\{\\alpha\_\{f\}\}\+2\\right\)H^\{2\},While[TheoremB\.1](https://arxiv.org/html/2605.29028#A2.Thmtheorem1)provides an analysis of DT’s behavior in the infinite\-data regime, we are more interested in the finite\-data setting\. In particular, we have:

###### Theorem B\.2\.

Under the assumptions of[TheoremB\.1](https://arxiv.org/html/2605.29028#A2.Thmtheorem1), let the training sample size beNN, and further assume that

- •Pπf​\(s\)Pπβ​\(s\)≤Cf\\frac\{P\_\{\\pi\_\{f\}\}\(s\)\}\{P\_\{\\pi\_\{\\beta\}\}\(s\)\}\\leq C\_\{f\}for allss\.
- •The policy classΠ\\Piis finite\.
- •\|logπ\(a\|s,g\)−logπ\(a′\|s′,g′\)\|<c\|\\log\\pi\(a\|s,g\)\-\\log\\pi\(a^\{\\prime\}\|s^\{\\prime\},g^\{\\prime\}\)\|<cfor all\(a,s,g,a′,s′,g′\)\(a,s,g,a^\{\\prime\},s^\{\\prime\},g^\{\\prime\}\)and allπ∈Π\\pi\\in\\Pi\.
- •The approximation error ofΠ\\Piis bounded byϵerror\\epsilon\_\{\\text\{error\}\}\.

Then, with probability at least1−δp1\-\\delta\_\{p\},

𝔼𝝉∼π\(⋅\|s,RTGtgt\)​\[RTGreal\]−𝔼𝝉∼π^\(⋅\|s,RTGtgt\)​\[RTGreal\]≤O​\(Cfαf​H2​\(c​\(log⁡\|Π\|/δpN\)1/4\)\+ϵerror\)\.\\displaystyle\\mathbb\{E\}\_\{\\boldsymbol\{\\tau\}\\sim\\pi\(\\cdot\|s,\\text\{RTG\}\_\{\\text\{tgt\}\}\)\}\[\\text\{RTG\}\_\{\\text\{real\}\}\]\-\\mathbb\{E\}\_\{\\boldsymbol\{\\tau\}\\sim\\hat\{\\pi\}\(\\cdot\|s,\\text\{RTG\}\_\{\\text\{tgt\}\}\)\}\[\\text\{RTG\}\_\{\\text\{real\}\}\]\\leq O\\Bigg\(\\frac\{C\_\{f\}\}\{\\alpha\_\{f\}\}H^\{2\}\\Big\(\\sqrt\{c\}\\Big\(\\frac\{\\log\|\\Pi\|/\\delta\_\{p\}\}\{N\}\\Big\)^\{1/4\}\\Big\)\+\\sqrt\{\\epsilon\_\{\\text\{error\}\}\}\\Bigg\)\.

whereπ^∈Π\\hat\{\\pi\}\\in\\Piis the policy that minimizes the KL divergence on the training dataset\.

The result follows directly from Corollary 3 inBrandfonbreneret al\.\([2022](https://arxiv.org/html/2605.29028#bib.bib27)\)\. ∎

Combining[TheoremB\.1](https://arxiv.org/html/2605.29028#A2.Thmtheorem1)and[TheoremB\.2](https://arxiv.org/html/2605.29028#A2.Thmtheorem2), we conclude that with probability at least1−δp1\-\\delta\_\{p\}, the total alignment error of the learned policyπ^\\hat\{\\pi\}scales as

RTGtgt−𝔼𝝉∼π^\(⋅\|s,RTGtgt\)​\[RTGreal\]≤O​\(ϵ​1αf​H2\+Cfαf​H2​\(c​\(log⁡\|Π\|/δpN\)1/4\)\+ϵerror\)\.\\displaystyle\\text\{RTG\}\_\{\\text\{tgt\}\}\-\\mathbb\{E\}\_\{\\boldsymbol\{\\tau\}\\sim\\hat\{\\pi\}\(\\cdot\|s,\\text\{RTG\}\_\{\\text\{tgt\}\}\)\}\[\\text\{RTG\}\_\{\\text\{real\}\}\]\\leq O\\Bigg\(\\epsilon\\tfrac\{1\}\{\\alpha\_\{f\}\}H^\{2\}\+\\frac\{C\_\{f\}\}\{\\alpha\_\{f\}\}H^\{2\}\\Big\(\\sqrt\{c\}\\Big\(\\frac\{\\log\|\\Pi\|/\\delta\_\{p\}\}\{N\}\\Big\)^\{1/4\}\\Big\)\+\\sqrt\{\\epsilon\_\{\\text\{error\}\}\}\\Bigg\)\.\(9\)
By imposing a monotonicity constraint on theQQ\-value of the policy output with respect to the given RTG,Q\-align DTeffectively restricts the policy spaceΠ\\Pi\. We then give the proof of[Theorem5\.1](https://arxiv.org/html/2605.29028#S5.Thmtheorem1):

###### Proof of[Theorem5\.1](https://arxiv.org/html/2605.29028#S5.Thmtheorem1)\.

In the unconstrained case, each state\-RTG pair\(s,g\)\(s,g\)can be independently mapped to any of\|A\|\|A\|actions\. With\|S\|\|S\|states andKKRTG levels, the total number of policies is:

\|Πfree\|=\|A\|\|S\|​\|G\|,log⁡\|Πfree\|=\|S\|​\|G\|​log⁡\|A\|\.\\displaystyle\|\\Pi\_\{\\rm free\}\|=\|A\|^\{\|S\|\|G\|\},\\quad\\log\|\\Pi\_\{\\rm free\}\|=\|S\|\|G\|\\log\|A\|\.
In the constrained caseΠmono\\Pi\_\{\\rm mono\}, for a fixed statess, the sequence\(π​\(s,g1\),…,π​\(s,g\|G\|\)\)\(\\pi\(s,g\_\{1\}\),\\dots,\\pi\(s,g\_\{\|G\|\}\)\)must be non\-decreasing with respect to⪯s\\preceq\_\{s\}induced by the scalar\-valuedQ​\(s,⋅\)Q\(s,\\cdot\)\. In the presence of ties inQ​\(s,⋅\)Q\(s,\\cdot\), we assume an arbitrary but fixed tie\-breaking rule, which induces a total order⪯s\\preceq\_\{s\}compatible withQ​\(s,⋅\)Q\(s,\\cdot\)\. Treating each state independently, the number of monotone sequences of lengthKKover the set of actionsAAis given by the multiset coefficient:

\|Πmono,s\|=\(\|G\|\+\|A\|−1\|G\|\)\.\\displaystyle\|\\Pi\_\{\{\\rm mono\},s\}\|=\\binom\{\|G\|\+\|A\|\-1\}\{\|G\|\}\.
Across all states:

\|Πmono\|=∏s∈S\|Πmono,s\|,log⁡\|Πmono\|=∑s∈Slog⁡\(\|G\|\+\|A\|−1\|G\|\)\.\\displaystyle\|\\Pi\_\{\\rm mono\}\|=\\prod\_\{s\\in S\}\|\\Pi\_\{\{\\rm mono\},s\}\|,\\quad\\log\|\\Pi\_\{\\rm mono\}\|=\\sum\_\{s\\in S\}\\log\\binom\{\|G\|\+\|A\|\-1\}\{\|G\|\}\.
Using\(nk\)≤\(e​n/k\)k\\binom\{n\}\{k\}\\leq\(en/k\)^\{k\}, we have

log⁡\|Πmono\|≤\|S\|​\|G\|​log⁡e​\(\|G\|\+\|A\|−1\)\|G\|\.\\displaystyle\\log\|\\Pi\_\{\\rm mono\}\|\\leq\|S\|\|G\|\\log\\frac\{e\(\|G\|\+\|A\|\-1\)\}\{\|G\|\}\.
If we further consider the regime where the resolution of the score space\|𝒱s\|\|\\mathcal\{V\}\_\{s\}\|is comparable to the RTG discretization\|G\|\|G\|\(i\.e\.,\|𝒱s\|=Θ​\(\|G\|\)\|\\mathcal\{V\}\_\{s\}\|=\\Theta\(\|G\|\)\), then, since each score value can be shared by at most a constant numberCCof actions, we have\|A\|≤C​\|𝒱s\|=Θ​\(\|G\|\)\|A\|\\leq C\|\\mathcal\{V\}\_\{s\}\|=\\Theta\(\|G\|\)\. Therefore,

\|G\|\+\|A\|−1\|G\|=O​\(1\)\.\\frac\{\|G\|\+\|A\|\-1\}\{\|G\|\}=O\(1\)\.Consequently, the resulting log\-complexity bound satisfies

log⁡\|Πmono\|=O​\(\|S\|​\|G\|\),\\log\|\\Pi\_\{\\rm mono\}\|=O\(\|S\|\|G\|\),eliminating any explicit logarithmic dependence on\|A\|\|A\|\.

∎

### B\.2Proof of[Theorem5\.2](https://arxiv.org/html/2605.29028#A5.EGx13)

We consider an MDP

M=\(𝒮,𝒜,P,R,γ\)\\displaystyle M=\(\\mathcal\{S\},\\mathcal\{A\},P,R,\\gamma\)with horizonHH, whereP​\(s′∣s,a\)P\(s^\{\\prime\}\\mid s,a\)denotes the transition probability\. LetR∗R^\{\*\}be a sufficiently large total reward, larger than the return of any trajectory𝝉=\(s0,a0,r0,…,sH,aH,rH\)\\boldsymbol\{\\tau\}=\(s\_\{0\},a\_\{0\},r\_\{0\},\\dots,s\_\{H\},a\_\{H\},r\_\{H\}\)in the environment\.

Our proof strategy is to decompose[Theorem5\.2](https://arxiv.org/html/2605.29028#A5.EGx13)into two parts, formally stated as two lemmas\. These lemmas show that both QT andQ\-align DT’s objectives converge to the same greedy action\-selection criterion when the target RTG is sufficiently large\.

###### Lemma B\.1\.

LetQQbe a fixed critic defined for all\(s,a\)\(s,a\)pairs in the dataset, and letπAlign\\pi^\{\\mathrm\{Align\}\}denote the policy that minimizes the alignment actor loss in \([6](https://arxiv.org/html/2605.29028#S4.E6)\)\. Assume the dataset fully covers all state\-action pairs, i\.e\.,

p𝒟​\(a∣s\)\>0,∀s∈𝒮,∀a∈𝒜,\\displaystyle p\_\{\\mathcal\{D\}\}\(a\\mid s\)\>0,\\quad\\forall s\\in\\mathcal\{S\},\\;\\forall a\\in\\mathcal\{A\},wherep𝒟​\(a∣s\)p\_\{\\mathcal\{D\}\}\(a\\mid s\)denotes the empirical probability of observing actionaaat statessin the dataset\. Then, for the maximal target returnR∗R^\{\*\}, the policyπR∗Align\\pi^\{\\mathrm\{Align\}\}\_\{R^\{\*\}\}outputs an action

a∗=arg⁡maxa∈𝒜data​\(s\)⁡Q​\(s,a\),\\displaystyle a^\{\*\}=\\arg\\max\_\{a\\in\\mathcal\{A\}\_\{\\mathrm\{data\}\}\(s\)\}Q\(s,a\),for each statess\.

###### Proof\.

To analyze the behavior ofQ\-align DTasRTGtgt→R∗\\text\{RTG\}\_\{\\text\{tgt\}\}\\to R^\{\*\}, we focus on the components of the loss that are sensitive to the target return\. For a fixed statess, the alignment actor loss atR∗R^\{\*\}is defined as:

Ltotal​\(s,R∗\)\\displaystyle L\_\{\\mathrm\{total\}\}\(s,R^\{\*\}\)=𝔼a∼𝒟R∗​\(s\)​‖πR∗​\(s\)−a‖2\+𝔼RTG∼𝒟​\(s\)​𝕀​\(Q​\(s,πRTG​\(s\)\)\>Q​\(s,πR∗​\(s\)\)\)\\displaystyle=\\mathbb\{E\}\_\{a\\sim\\mathcal\{D\}\_\{R^\{\*\}\}\(s\)\}\\\|\\pi\_\{R^\{\*\}\}\(s\)\-a\\\|^\{2\}\+\\mathbb\{E\}\_\{\\text\{RTG\}\\sim\\mathcal\{D\}\(s\)\}\\mathbb\{I\}\\bigl\(Q\(s,\\pi\_\{\\text\{RTG\}\}\(s\)\)\>Q\(s,\\pi\_\{R^\{\*\}\}\(s\)\)\\bigr\)=𝔼a∼𝒟R∗​\(s\)​‖πR∗​\(s\)−a‖2⏟supervised regression term\+𝔼a′∼𝒟​\(s\)​𝕀​\(Q​\(s,a′\)\>Q​\(s,πR∗​\(s\)\)\)⏟ranking term,\\displaystyle=\\underbrace\{\\mathbb\{E\}\_\{a\\sim\\mathcal\{D\}\_\{R^\{\*\}\}\(s\)\}\\\|\\pi\_\{R^\{\*\}\}\(s\)\-a\\\|^\{2\}\}\_\{\\text\{supervised regression term\}\}\+\\underbrace\{\\mathbb\{E\}\_\{a^\{\\prime\}\\sim\\mathcal\{D\}\(s\)\}\\mathbb\{I\}\\bigl\(Q\(s,a^\{\\prime\}\)\>Q\(s,\\pi\_\{R^\{\*\}\}\(s\)\)\\bigr\)\}\_\{\\text\{ranking term\}\},where𝕀​\(⋅\)\\mathbb\{I\}\(\\cdot\)is the indicator function\. The second equality holds under the assumption that the model successfully minimizes the supervised loss within the data distribution, such thatπRTG​\(s\)≈a′\\pi\_\{\\text\{RTG\}\}\(s\)\\approx a^\{\\prime\}for\(s,a′,RTG\)∼𝒟\(s,a^\{\\prime\},\\text\{RTG\}\)\\sim\\mathcal\{D\}\.

In the high\-RTG regime whereR∗R^\{\*\}exceeds the return of any trajectory in the dataset, the supervised regression term vanishes as𝒟R∗​\(s\)=∅\\mathcal\{D\}\_\{R^\{\*\}\}\(s\)=\\emptyset\. The objective thus simplifies to the expectation of the indicator function:

Ltotal​\(s,R∗\)=∑a′∈𝒜data​\(s\)p𝒟​\(a′∣s\)⋅𝕀​\(Q​\(s,a′\)\>Q​\(s,πR∗​\(s\)\)\)\.\\displaystyle L\_\{\\mathrm\{total\}\}\(s,R^\{\*\}\)=\\sum\_\{a^\{\\prime\}\\in\\mathcal\{A\}\_\{\\mathrm\{data\}\}\(s\)\}p\_\{\\mathcal\{D\}\}\(a^\{\\prime\}\\mid s\)\\cdot\\mathbb\{I\}\\bigl\(Q\(s,a^\{\\prime\}\)\>Q\(s,\\pi\_\{R^\{\*\}\}\(s\)\)\\bigr\)\.
Since the empirical probabilityp𝒟​\(a′∣s\)p\_\{\\mathcal\{D\}\}\(a^\{\\prime\}\\mid s\)is strictly positive for alla′∈𝒜data​\(s\)a^\{\\prime\}\\in\\mathcal\{A\}\_\{\\mathrm\{data\}\}\(s\), the non\-negative sum reaches its global minimum of0if and only if:

Q​\(s,πR∗​\(s\)\)≥Q​\(s,a′\),∀a′∈𝒜data​\(s\)\.\\displaystyle Q\(s,\\pi\_\{R^\{\*\}\}\(s\)\)\\geq Q\(s,a^\{\\prime\}\),\\quad\\forall a^\{\\prime\}\\in\\mathcal\{A\}\_\{\\mathrm\{data\}\}\(s\)\.This directly impliesQ​\(s,πR∗​\(s\)\)≥maxa′∈𝒜data​\(s\)⁡Q​\(s,a′\)Q\(s,\\pi\_\{R^\{\*\}\}\(s\)\)\\geq\\max\_\{a^\{\\prime\}\\in\\mathcal\{A\}\_\{\\mathrm\{data\}\}\(s\)\}Q\(s,a^\{\\prime\}\)\. Given that the dataset covers all state\-action pairs, the global minimizer satisfies:

πR∗Align​\(s\)=arg⁡maxa∈𝒜data​\(s\)⁡Q​\(s,a\)\.\\displaystyle\\pi\_\{R^\{\*\}\}^\{\\mathrm\{Align\}\}\(s\)=\\arg\\max\_\{a\\in\\mathcal\{A\}\_\{\\mathrm\{data\}\}\(s\)\}Q\(s,a\)\.This completes the proof\. ∎

###### Lemma B\.2\.

Assume the dataset fully covers all state\-action pairs, consider an idealized version of QT where the conditioning space is extended toR→∞R\\to\\infty\. LetπQT\\pi^\{\\mathrm\{QT\}\}denote the policy that minimizes the QT actor loss\. Then, for the maximal target returnR∗R^\{\*\},πQT\\pi^\{\\mathrm\{QT\}\}outputs an action

a∗=arg⁡maxa∈𝒜data​\(s\)⁡Q​\(s,a\),\\displaystyle a^\{\*\}=\\arg\\max\_\{a\\in\\mathcal\{A\}\_\{\\mathrm\{data\}\}\(s\)\}Q\(s,a\),for each statess\.

###### Proof\.

For a fixed statess, the full QT actor loss is

LQT​\(s,R∗\)=𝔼a∼𝒟R∗​\(s\)​‖πR∗​\(s\)−a‖2⏟supervised regression term−Q​\(s,πR∗​\(s\)\)⏟Q\-term\.\\displaystyle L\_\{\\mathrm\{QT\}\}\(s,R^\{\*\}\)=\\underbrace\{\\mathbb\{E\}\_\{a\\sim\\mathcal\{D\}\_\{R^\{\*\}\}\(s\)\}\\\|\\pi\_\{R^\{\*\}\}\(s\)\-a\\\|^\{2\}\}\_\{\\text\{supervised regression term\}\}\-\\underbrace\{Q\(s,\\pi\_\{R^\{\*\}\}\(s\)\)\}\_\{\\text\{Q\-term\}\}\.
When the target returnR∗R^\{\*\}is larger than any trajectory in the dataset, the supervised regression term vanishes because𝒟R∗​\(s\)\\mathcal\{D\}\_\{R^\{\*\}\}\(s\)is empty\. The remaining loss reduces to

LQT​\(s,R∗\)=−Q​\(s,πR∗​\(s\)\),\\displaystyle L\_\{\\mathrm\{QT\}\}\(s,R^\{\*\}\)=\-Q\(s,\\pi\_\{R^\{\*\}\}\(s\)\),which is minimized by choosing

πR∗​\(s\)=arg⁡maxa∈𝒜data​\(s\)⁡Q​\(s,a\)\.\\displaystyle\\pi\_\{R^\{\*\}\}\(s\)=\\arg\\max\_\{a\\in\\mathcal\{A\}\_\{\\mathrm\{data\}\}\(s\)\}Q\(s,a\)\.
Thus, for the maximal target return,πQT\\pi^\{\\mathrm\{QT\}\}outputs the action with the highest Q\-value among all dataset actions, which concludes the proof\. ∎

###### Proof of[Theorem5\.2](https://arxiv.org/html/2605.29028#A5.EGx13)\.

By combining the results of[LemmaB\.1](https://arxiv.org/html/2605.29028#A2.Thmlemma1)and[LemmaB\.2](https://arxiv.org/html/2605.29028#A2.Thmlemma2), we observe that both policies,πQT\\pi^\{\\mathrm\{QT\}\}andπAlign\\pi^\{\\mathrm\{Align\}\}, satisfy the same optimality condition in the high\-RTG limit\. Specifically, since both objectives lead to the selection of the greedy action

a∗=arg⁡maxa∈𝒜data​\(s\)⁡Q​\(s,a\)a^\{\*\}=\\arg\\max\_\{a\\in\\mathcal\{A\}\_\{\\mathrm\{data\}\}\(s\)\}Q\(s,a\)\(10\)for any given statess, the behavior ofQ\-align DTis theoretically equivalent to that of an idealized QT in this regime\. This concludes the proof\. ∎

### B\.3Proof of[Theorem5\.3](https://arxiv.org/html/2605.29028#S5.Thmtheorem3)

To simplify the analysis, we considersupport\-constrained optimality, where all state\-action pairs\(s,a\)\(s,a\)are restricted to the support of the offline dataset𝒟\\mathcal\{D\}, and, for theoretical analysis, we assume exact critic updates\. WhenΔ​RTG=0\\Delta\\mathrm\{RTG\}=0, the critic Bellman update evaluates the unperturbed RTG\-conditioned policy within the dataset support, yielding a conservative, behavior\-aligned value estimate\. Consequently, the combination of the supervised learning loss and the unperturbed critic keeps the policy anchored near the behavior policyβ\\beta\. We then focus on thehigh\-RTG regimewhere the perturbationΔ​RTG\\Delta\\mathrm\{RTG\}is sufficiently large to steer the policy toward high\-value regions\. Following IQL\(Kostrikovet al\.,[2022](https://arxiv.org/html/2605.29028#bib.bib9)\), we define the support\-constrained optimal state\-action value functionQ∗Q^\{\*\}and the corresponding optimal policyπ∗\\pi^\{\*\}as

Q∗​\(s,a\)\\displaystyle Q^\{\*\}\(s,a\)=r​\(s,a\)\+γ​𝔼s′∼P\(⋅\|s,a\)​\[maxa′∈𝒜:πβ​\(a′\|s′\)\>0⁡Q∗​\(s′,a′\)\],\\displaystyle=r\(s,a\)\+\\gamma\\mathbb\{E\}\_\{s^\{\\prime\}\\sim P\(\\cdot\|s,a\)\}\\Big\[\\max\_\{a^\{\\prime\}\\in\\mathcal\{A\}:\\pi\_\{\\beta\}\(a^\{\\prime\}\|s^\{\\prime\}\)\>0\}Q^\{\*\}\(s^\{\\prime\},a^\{\\prime\}\)\\Big\],π∗​\(s\)\\displaystyle\\pi^\{\*\}\(s\)=arg⁡maxa∈𝒜:πβ​\(a\|s\)\>0⁡Q∗​\(s,a\),\\displaystyle=\\arg\\max\_\{a\\in\\mathcal\{A\}:\\pi\_\{\\beta\}\(a\|s\)\>0\}Q^\{\*\}\(s,a\),where𝒜\\mathcal\{A\}is the action space andπβ\\pi\_\{\\beta\}is the behavior policy of the offline dataset𝒟\\mathcal\{D\}\.

To proceed, we first establish the following lemma:

###### Lemma B\.3\.

LetQmQ\_\{m\}be the action\-value function of some policyπ\\pi\. Letπm\\pi\_\{m\}be the policy obtained by minimizing the alignment actor loss onQmQ\_\{m\}\. LetQm\+1Q\_\{m\+1\}be the action\-value function obtained by minimizing the TD loss \([7](https://arxiv.org/html/2605.29028#S4.E7)\) withΔ​RTG=R∗\\Delta\\text\{RTG\}=R^\{\*\}\.

Then for every\(s,a\)\(s,a\)we have

Qm\+1​\(s,a\)≥Qm​\(s,a\)\.\\displaystyle Q\_\{m\+1\}\(s,a\)\\geq Q\_\{m\}\(s,a\)\.
Moreover, if there exists some statesswhereπm​\(s\)\\pi\_\{m\}\(s\)strictly improves the one\-step target, then the inequality is strict for at least one state\-action pair\.

###### Proof sketch\.

For notational simplicity, we useπ~m\\tilde\{\\pi\}\_\{m\}to denote the policy induced byπm\\pi\_\{m\}under a return perturbationΔ​RTG=R∗\\Delta\\mathrm\{RTG\}=R^\{\*\}\. We prove the claim by induction on the remaining horizon\.

For the terminal timestepHH, where no bootstrapping is applied, we have

Qm\+1​\(sH,aH\)=Qm​\(sH,aH\)=r​\(sH,aH\)\.\\displaystyle Q\_\{m\+1\}\(s\_\{H\},a\_\{H\}\)=Q\_\{m\}\(s\_\{H\},a\_\{H\}\)=r\(s\_\{H\},a\_\{H\}\)\.
Assume the statement holds for all timestepst\+1,…,Ht\+1,\\dots,H\. Consider a state\-action pair\(st,at\)\(s\_\{t\},a\_\{t\}\)at timesteptt:

Qm\+1​\(st,at\)\\displaystyle Q\_\{m\+1\}\(s\_\{t\},a\_\{t\}\)=r​\(st,at\)\+γ​𝔼st\+1∼P\(⋅∣s,a\)​\[Qm\+1​\(st\+1,π~m​\(st\+1\)\)\]\\displaystyle=r\(s\_\{t\},a\_\{t\}\)\+\\gamma\\mathbb\{E\}\_\{s\_\{t\+1\}\\sim P\(\\cdot\\mid s,a\)\}\\Big\[Q\_\{m\+1\}\\big\(s\_\{t\+1\},\\tilde\{\\pi\}\_\{m\}\(s\_\{t\+1\}\)\\big\)\\Big\]≥r​\(st,at\)\+γ​𝔼st\+1∼P\(⋅∣s,a\)​\[Qm​\(st\+1,π~m​\(st\+1\)\)\]\\displaystyle\\geq r\(s\_\{t\},a\_\{t\}\)\+\\gamma\\mathbb\{E\}\_\{s\_\{t\+1\}\\sim P\(\\cdot\\mid s,a\)\}\\Big\[Q\_\{m\}\\big\(s\_\{t\+1\},\\tilde\{\\pi\}\_\{m\}\(s\_\{t\+1\}\)\\big\)\\Big\]≥r​\(st,at\)\+γ​𝔼st\+1∼P\(⋅∣s,a\)​\[Qm​\(st\+1,π​\(st\+1\)\)\]\\displaystyle\\geq r\(s\_\{t\},a\_\{t\}\)\+\\gamma\\mathbb\{E\}\_\{s\_\{t\+1\}\\sim P\(\\cdot\\mid s,a\)\}\\Big\[Q\_\{m\}\\big\(s\_\{t\+1\},\{\\pi\}\(s\_\{t\+1\}\)\\big\)\\Big\]=Qm​\(st,at\),\\displaystyle=Q\_\{m\}\(s\_\{t\},a\_\{t\}\),where the first inequality follows from the induction hypothesis, and the second inequality follows from[LemmaB\.1](https://arxiv.org/html/2605.29028#A2.Thmlemma1)\. This completes the induction and provesQm\+1≥QmQ\_\{m\+1\}\\geq Q\_\{m\}, with strict inequality for at least one state\-action pair ifπm\\pi\_\{m\}strictly improves the one\-step target\. ∎

###### Proof of[Theorem5\.3](https://arxiv.org/html/2605.29028#S5.Thmtheorem3)\.

By[LemmaB\.3](https://arxiv.org/html/2605.29028#A2.Thmlemma3), theQ\-align DTupdate ensures\{Qm\}\\\{Q\_\{m\}\\\}is monotonically non\-decreasing:Qm\+1≥QmQ\_\{m\+1\}\\geq Q\_\{m\}on𝒟\\mathcal\{D\}\. Given bounded rewards and finite horizon, the sequence\{Qm\}\\\{Q\_\{m\}\\\}converges pointwise to a limitQ¯\\bar\{Q\}by the Monotone Convergence Theorem\.

We showQ¯=Q∗\\bar\{Q\}=Q^\{\*\}on𝒟\\mathcal\{D\}by contradiction\. SupposeQ¯\\bar\{Q\}is not optimal; then there exists\(s,a\)∈𝒟\(s,a\)\\in\\mathcal\{D\}such that:

Q¯​\(s,a\)<r​\(s,a\)\+γ​𝔼s′∼P​\[maxa′∈𝒟​\(s′\)⁡Q¯​\(s′,a′\)\]\.\\bar\{Q\}\(s,a\)<r\(s,a\)\+\\gamma\\mathbb\{E\}\_\{s^\{\\prime\}\\sim P\}\\big\[\\max\_\{a^\{\\prime\}\\in\\mathcal\{D\}\(s^\{\\prime\}\)\}\\bar\{Q\}\(s^\{\\prime\},a^\{\\prime\}\)\\big\]\.\(11\)By[LemmaB\.1](https://arxiv.org/html/2605.29028#A2.Thmlemma1), under a sufficiently large perturbationδ\\delta, the induced policyπ~\\tilde\{\\pi\}\(derived from the converged policyπθ\\pi\_\{\\theta\}\) selects greedy actions w\.r\.t\.Q¯\\bar\{Q\}:π~​\(s\)=arg⁡maxa′∈𝒟​\(s\)⁡Q¯​\(s,a′\)\\tilde\{\\pi\}\(s\)=\\arg\\max\_\{a^\{\\prime\}\\in\\mathcal\{D\}\(s\)\}\\bar\{Q\}\(s,a^\{\\prime\}\)\. Substituting this into \([11](https://arxiv.org/html/2605.29028#A2.E11)\) yields:

Q¯​\(s,a\)<r​\(s,a\)\+γ​𝔼s′∼P​\[Q¯​\(s′,π~​\(s′\)\)\]\.\\bar\{Q\}\(s,a\)<r\(s,a\)\+\\gamma\\mathbb\{E\}\_\{s^\{\\prime\}\\sim P\}\\big\[\\bar\{Q\}\(s^\{\\prime\},\\tilde\{\\pi\}\(s^\{\\prime\}\)\)\\big\]\.\(12\)However, upon convergence,Q¯\\bar\{Q\}must satisfy the Bellman equation corresponding to the induced policyπ~\\tilde\{\\pi\}:

Q¯​\(s,a\)=r​\(s,a\)\+γ​𝔼s′∼P​\[Q¯​\(s′,π~​\(s′\)\)\],\\bar\{Q\}\(s,a\)=r\(s,a\)\+\\gamma\\mathbb\{E\}\_\{s^\{\\prime\}\\sim P\}\\big\[\\bar\{Q\}\(s^\{\\prime\},\\tilde\{\\pi\}\(s^\{\\prime\}\)\)\\big\],which directly contradicts \([12](https://arxiv.org/html/2605.29028#A2.E12)\)\. Therefore,Q¯\\bar\{Q\}satisfies the Bellman optimality equation on𝒟\\mathcal\{D\}, implyingQ¯=Q∗\\bar\{Q\}=Q^\{\*\}\. Consequently, the sequence\{π~m\}\\\{\\tilde\{\\pi\}\_\{m\}\\\}converges to an optimal policyπ∗​\(s\)∈arg⁡max⁡Q¯​\(s,a\)\\pi^\{\*\}\(s\)\\in\\arg\\max\\bar\{Q\}\(s,a\)restricted to the dataset\. ∎

## Appendix CExperiments onhalfcheetah\-vel

Transferring a model trained on thehalfcheetahenvironment and deploying it directly onhalfcheetah\-velis challenging, as the two environments differ in several important ways\. We summarize the main challenges as follows:

- •Different Reward Functions\.Although both environments include a forward\-velocity term and a control cost, their reward definitions differ significantly\. Inhalfcheetah, the forward reward is simply the current velocity\. In contrast,halfcheetah\-veldefines the forward reward as −\|vt−vtarget\|,\\displaystyle\-\\lvert v\_\{t\}\-v\_\{\\text\{target\}\}\\rvert,wherevtv\_\{t\}is the current velocity andvtargetv\_\{\\text\{target\}\}is the target velocity\.
- •Different Horizons\.halfcheetahuses a horizon of 1000 steps, whereashalfcheetah\-velterminates after 200 steps\. This discrepancy affects the RTG interpretation learned during training\.
- •Different Observation Dimensions\.Compared tohalfcheetah, thehalfcheetah\-velenvironment further includes a 3\-dimensional absolute position in its observations\.

For fair comparison onhalfcheetah\-vel, we directly deploy the model trained onhalfcheetah, but apply several modifications to the inputs:

- •Reward Mismatch\.To maintain consistency with the training setup, we re\-compute the reward using the current forward velocity, updating the RTG as: r¯t=vt\+ct,RTGt\+1=RTGt−r¯t,\\displaystyle\\bar\{r\}\_\{t\}=v\_\{t\}\+c\_\{t\},\\qquad\\mathrm\{RTG\}\_\{t\+1\}=\\mathrm\{RTG\}\_\{t\}\-\\bar\{r\}\_\{t\},wherectc\_\{t\}is the control cost fromhalfcheetah\-vel\. Note that although this modification is applied when updating the RTG, we report results using the original reward definition ofhalfcheetah\-velfor all transferred methods\.
- •Horizon Mismatch\.We continue to construct the RTG token using the original horizon of 1000 \(i\.e\.,1000×vtarget1000\\times v\_\{\\text\{target\}\}\), while the environment naturally truncates at 200 steps\.
- •Observation Mismatch\.We simply discard the extra 3 dimensions inhalfcheetah\-vel, and rely on the RTG token to convey velocity\-related information\.

Even after applying these adjustments, selecting an appropriate initial RTG for deployment onHalfCheetah\-velremains challenging\. The main difficulty lies in estimating the cumulative control cost required to compute the RTG corresponding to a given target velocity: since the control cost depends on the induced trajectory, the desired RTG cannot be determined analytically and must be calibrated empirically at deployment time\. Moreover, since the model is trained with a 1000\-step horizon, it is allowed to defer RTG consumption toward later timesteps; empirically, we observe that the learned policy can exhibit such behavior \(as illustrated in[Figure4](https://arxiv.org/html/2605.29028#S7.F4)\), which becomes infeasible under the 200\-step evaluation horizon\.

To address this mismatch, we define a base RTG as1000×vtarget1000\\times v\_\{\\text\{target\}\}and empirically calibrate the RTG values\. Specifically, we sample 40 RTG candidates around this base value for the maximum and minimum target velocities to identify the corresponding optimal RTGs, denoted asRTGmax\\text\{RTG\}\_\{\\max\}andRTGmin\\text\{RTG\}\_\{\\min\}\. For intermediate target velocitiesvv, we assume a smooth and approximately monotonic relationship between RTG and the resulting velocity, and linearly interpolate the RTG as

RTG​\(v\)=RTGmin\+v−vminvmax−vmin​\(RTGmax−RTGmin\)\.\\displaystyle\\text\{RTG\}\(v\)=\\text\{RTG\}\_\{\\min\}\+\\frac\{v\-v\_\{\\min\}\}\{v\_\{\\max\}\-v\_\{\\min\}\}\\left\(\\text\{RTG\}\_\{\\max\}\-\\text\{RTG\}\_\{\\min\}\\right\)\.All methods are evaluated using this procedure independently to ensure a fair comparison\.

The effectiveness of linear interpolation demonstrates both the controllability and alignment ofQ\-align DT\. In standard CSMs, the mapping from RTG to realized behavior is often inconsistent, leading to poor alignment with the target velocity in this setting\. In contrast,Q\-align DTlearns a smooth and approximately linear relationship between RTG and behavior\. This allows us to anchor the policy at only the two endpoint velocities and reliably interpolate for intermediate targets, handling the environment shift without the need for additional RTG tuning, a strategy that fails for standard CSMs\.

To understand the sensitivity ofQ\-align DTto the number of RTG candidates, we conduct experiments with different calibration budgets and evaluate onhalfCheetah\-vel\. As shown in[TableC\.6](https://arxiv.org/html/2605.29028#A3.T6), zero\-shot transfer performance degrades gradually rather than collapsing as the number of calibration samples is reduced\. Even with substantially fewer calibration samples,Q\-align DTremains much stronger than the other CSM\-based baselines evaluated with 40 samples, as shown in[Table5](https://arxiv.org/html/2605.29028#S7.T5)\.

Table C\.6:Effect of the number of calibration samples on zero\-shot transfer performance\.
## Appendix DImplementation Details

### D\.1Model Architecture

CSM Architecture\.We adopt a Transformer\-based policy following the sequence\-conditioned structure inChenet al\.\([2021](https://arxiv.org/html/2605.29028#bib.bib8)\)\. To enhance the local temporal consistency of control signals, we insert a lightweight 1D convolution layer after the projection matricesWk,Wq,WvW\_\{k\},W\_\{q\},W\_\{v\}\. Specifically, for each layer, we compute:

kℓ=Conv​\(Wk​xℓ\),qℓ=Conv​\(Wq​xℓ\),vℓ=Conv​\(Wv​xℓ\)\.\\displaystyle k\_\{\\ell\}=\\mathrm\{Conv\}\(W\_\{k\}x\_\{\\ell\}\),\\quad q\_\{\\ell\}=\\mathrm\{Conv\}\(W\_\{q\}x\_\{\\ell\}\),\\quad v\_\{\\ell\}=\\mathrm\{Conv\}\(W\_\{v\}x\_\{\\ell\}\)\.The convolution window size is set tow=6w=6, enabling each output to incorporate information from a local temporal neighborhood\. To preserve causality, we apply*causal left padding*by padding the input sequence withw−1=5w\-1=5zeros on the left, ensuring that the output at timestepttdepends only on inputs at or beforett\. This prevents future information leakage and maintains the temporal causality required for sequence modeling\. As demonstrated in our ablation studies, this modification does not dictate the peak performance\.

Prediction Heads\.Following the design inHuet al\.\([2024](https://arxiv.org/html/2605.29028#bib.bib2)\), the model performs prediction sequentially at each timestepii\. Specifically, the hidden representation of the RTG tokenrtgi\\text\{rtg\}\_\{i\}is mapped through a linear head to reconstruct the current states^i\\hat\{s\}\_\{i\}, and the representation of the state tokensis\_\{i\}is mapped through a separate head to predict the corresponding actiona^i\\hat\{a\}\_\{i\}\. These architectural details are summarized in[TableD\.7](https://arxiv.org/html/2605.29028#A4.T7)\.

QQ\-Function\.For theQQ\-function, we adopt a 3\-layer MLP with a hidden dimension of 256\.

### D\.2Q Pretraining

For stability, we pretrain theQQ\-function before training the CSM\. To isolate the effect of the proposed alignment method, we adopt a simple DoubleQQ\-learning update\(Hasseltet al\.,[2016](https://arxiv.org/html/2605.29028#bib.bib13)\)in all experiments, except for the AntMaze environments, where IQL\(Kostrikovet al\.,[2022](https://arxiv.org/html/2605.29028#bib.bib9)\)is used\.

Specifically, for Q networksQψ1,Qψ2Q\_\{\\psi\_\{1\}\},Q\_\{\\psi\_\{2\}\}and target networksQψ1′,Qψ2′Q\_\{\\psi\_\{1\}^\{\\prime\}\},Q\_\{\\psi\_\{2\}^\{\\prime\}\}, the pretraining loss is

Lpretrain=∑i=t−k\+1t∑m=12\(Qψm​\(si,ai\)−yi\)2,yi=ri\+γ​minm=1,2⁡Qψm′​\(si\+1,ai\+1\),L\_\{\\text\{pretrain\}\}=\\sum\_\{i=t\-k\+1\}^\{t\}\\sum\_\{m=1\}^\{2\}\\Bigl\(Q\_\{\\psi\_\{m\}\}\(s\_\{i\},a\_\{i\}\)\-y\_\{i\}\\Bigr\)^\{2\},\\quad y\_\{i\}=r\_\{i\}\+\\gamma\\min\_\{m=1,2\}Q\_\{\\psi\_\{m\}^\{\\prime\}\}\(s\_\{i\+1\},a\_\{i\+1\}\),whereγ=0\.99\\gamma=0\.99\. While RTG typically represents an undiscounted return \(γ=1\\gamma=1\) and we follow this standard practice for RTG conditioning, we employ a discountedQQ\-function \(γ=0\.99\\gamma=0\.99\) solely as a guidance signal during pretraining and alignment\. This design choice is empirically robust and consistent with prior successful architectures\(Huet al\.,[2024](https://arxiv.org/html/2605.29028#bib.bib2); Kimet al\.,[2024a](https://arxiv.org/html/2605.29028#bib.bib1); Yanet al\.,[2024](https://arxiv.org/html/2605.29028#bib.bib3)\)\. Importantly, since our alignment loss relies exclusively on therelative orderingrather than absolute magnitude, the scale discrepancy introduced by the discount factor does not affect the directional gradient\. As long as the discountedQQ\-function preserves monotonic preference over returns, the structural regularization remains theoretically sound and avoids optimization inconsistency\.

For AntMaze, standard Bellman updates often fail to propagate values across long horizons in offline settings, leading to vanishing gradients and uninformative value estimates\. We therefore use IQL for pretraining, with an expectile of 0\.8\.

Table D\.7:Architecture ofQ\-align DT
### D\.3Baseline Details

We evaluateQ\-align DTagainst a comprehensive set of baselines, includingtraditional offline RL methods: IQL\(Kostrikovet al\.,[2022](https://arxiv.org/html/2605.29028#bib.bib9)\), TD3\+BC\(Fujimoto and Gu,[2021](https://arxiv.org/html/2605.29028#bib.bib14)\), and CQL\(Kumaret al\.,[2020](https://arxiv.org/html/2605.29028#bib.bib12)\); andsequence modeling approaches: DT\(Chenet al\.,[2021](https://arxiv.org/html/2605.29028#bib.bib8)\), DC\(Kimet al\.,[2024b](https://arxiv.org/html/2605.29028#bib.bib5)\), RVS\(Emmonset al\.,[2022](https://arxiv.org/html/2605.29028#bib.bib35)\), CGDT\(Wanget al\.,[2024](https://arxiv.org/html/2605.29028#bib.bib11)\), LSDT\(Wanget al\.,[2025](https://arxiv.org/html/2605.29028#bib.bib7)\), DM\(Zhenget al\.,[2025](https://arxiv.org/html/2605.29028#bib.bib6)\), RADT\(Tanakaet al\.,[2025](https://arxiv.org/html/2605.29028#bib.bib4)\), QT\(Huet al\.,[2024](https://arxiv.org/html/2605.29028#bib.bib2)\), and QCS\(Kimet al\.,[2024a](https://arxiv.org/html/2605.29028#bib.bib1)\)\.

For most tasks, we report the performance directly from their original publications\. However, for the AntMaze environment, some baselines such as QT\(Huet al\.,[2024](https://arxiv.org/html/2605.29028#bib.bib2)\)originally reported results on thev0version\. To ensure a fair comparison on the current standard benchmark, we re\-evaluate these methods on thev2environment using their official codebases\. For the specific evaluation of alignment properties, we re\-train and evaluate DT, DC, RADT, QT, and QCS using their respective official implementations to ensure consistency in our experimental setup\.

### D\.4Hardware and Environment Configuration

All experiments were conducted on a NVIDIA L40S GPU with 48GB of HBM2 memory\. The CPU is an Intel\(R\) Xeon\(R\) Gold 6430\.

For the software environment, we used:

- •d4rl1\.1,gym0\.18,mujoco2\.0\.2
- •PyTorch2\.4, CUDA 12\.1
- •Operating system: Red Hat Enterprise Linux 9\.6
- •Environment managed withcondaand Python 3\.8

### D\.5Hyperparameters

We report the hyperparameters ofQ\-align DTfor the selected tasks in[TableD\.8](https://arxiv.org/html/2605.29028#A4.T8)\. Below, we describe the key components: the noise scaleσe\\sigma\_\{e\}, the weight of theQQ\-alignment lossλe\\lambda\_\{e\}, and the RTG perturbation magnitudeΔ​RTG\\Delta\\text\{RTG\}\. Following the standard protocol of Decision Transformer\(Chenet al\.,[2021](https://arxiv.org/html/2605.29028#bib.bib8)\), we scale the target RTG by a factor of1/10001/1000for all D4RL Gym environments\. Accordingly, the perturbation scaleΔ​RTG\\Delta\\text\{RTG\}and the noise magnitudeσe\\sigma\_\{e\}reported in this paper are defined with respect to these normalized RTG values\.

The choice ofσe\\sigma\_\{e\}is primarily influenced by the reward scale of each environment\. The coefficientλe\\lambda\_\{e\}varies across datasets because it depends on both the dataset distribution and the stability of the environment duringQQ\-function updates\.

The perturbation valueΔ​RTG\\Delta\\text\{RTG\}is introduced during theQQ\-function update so that the learned critic reflects the value of relatively high\-reward policies within the policy family ofQ\-align DT\. Ideally, we would like it to approximate the value of the best policy in this family, i\.e\., by lettingΔ​RTG→∞\\Delta\\text\{RTG\}\\to\\infty\. However, in practice this may lead to out\-of\-distribution behavior and destabilizes training\. Therefore, we chooseΔ​RTG\\Delta\\text\{RTG\}according to the stability characteristics of each environment\.

Table D\.8:Hyperparameters ofQ\-align DTacross different environments\.σe\\sigma\_\{e\}controls the scale of RTG perturbation noise;λe\\lambda\_\{e\}is the weight of theQQ\-alignment loss;Δ​RTG\\Delta\\text\{RTG\}specifies the RTG perturbation used forQQ\-function updates\.Environment / Datasetσe\\sigma\_\{e\}λe\\lambda\_\{e\}Δ​RTG\\Delta\\text\{RTG\}walker2d\-medium\-replay\-v2100\.32halfcheetah\-medium\-replay\-v21555hopper\-medium\-replay\-v2100\.30\.5walker2d\-medium\-v2100\.31halfcheetah\-medium\-v215510hopper\-medium\-v2100\.31walker2d\-medium\-expert\-v2100\.310halfcheetah\-medium\-expert\-v215110hopper\-medium\-expert\-v2100\.35antmaze\-umaze\-v251001antmaze\-umaze\-diverse\-v251001antmaze\-medium\-play\-v251001antmaze\-medium\-diverse\-v251001Specifically, when sampling RTG perturbationsδ\\deltain Antmaze environement, we employ a half\-normal distribution,δ=\|ϵ\|\\delta=\|\\epsilon\|,ϵ∼𝒩​\(0,σe2\)\\epsilon\\sim\\mathcal\{N\}\(0,\\sigma\_\{e\}^\{2\}\)\. This ensures that the perturbations are positive, providing directional guidance toward the goal in sparse\-reward navigation tasks, while negative perturbations, which are uninformative in this setting, are avoided\.

We do not perform extensive hyperparameter sweeps, as our goal is to evaluate the robustness and controllability of the proposed method rather than optimize peak performance\. Accordingly, we largely share hyperparameters across environments within the same domain, making only minor adjustments when necessary for training stability\. Despite this minimal tuning,Q\-align DTconsistently outperforms baselines, indicating that the observed gains arise from improved alignment rather than hyperparameter optimization\. For AntMaze, we fix a largerλe\\lambda\_\{e\}across all variants to accommodate the sparse\-reward structure and different reward scale\.

### D\.6Alignment Calculation

For evaluating alignment performance, we sweep the target RTG from the minimum to the maximum return in each dataset with a step size of 100\. For each RTG, we run 30 rollouts and compute the average D4RL score\. The root mean squared deviation between the achieved scores and the target scores is used as the alignment metric\. The minimum and maximum returns of each dataset are listed in[TableD\.9](https://arxiv.org/html/2605.29028#A4.T9)\.

This evaluation is similar to prior work\(Tanakaet al\.,[2025](https://arxiv.org/html/2605.29028#bib.bib4)\); however, unlike RADT which evaluates only 7 RTG values per dataset, our method considers at least 32 distinct RTG values \(in Hopper\), providing a finer\-grained assessment of policy alignment across the entire return spectrum\.

Table D\.9:Maximum and Minimum return of each dataset

## Appendix EMore Experiment Results

### E\.1More Ablation Experiments

In this section, we provide a detailed ablation analysis of the core components inQ\-align DT\.

#### Effect of Fixing theQQ\-function\.

We conduct an ablation in which theQQ\-function is fixed after pretraining and is no longer updated jointly with the actor\. As shown in[TableE\.10](https://arxiv.org/html/2605.29028#A5.T10), fixing theQQ\-function leads to a clear degradation in the best achievable performance, as the static critic fails to provide informative signals to improve the actor beyond the behavior policy\. Furthermore, alignment error increases accordingly, indicating that co\-training is essential for maintaining reward\-sensitive behavior\.

Notably, most of the degradation occurs under low\-RTG conditions\. During inference, when conditioned on a relatively low target RTG, the model trained with a fixedQQ\-function often collapses \(e\.g\., falls\) in the early timesteps\. This early failure causes the realized return to approach zero, falling far below the intended RTG target and resulting in a breakdown of alignment \(see[SectionE\.5](https://arxiv.org/html/2605.29028#A5.SS5)for further results\)\.

Table E\.10:Effect of fixing theQQ\-function\. RMSE \(↓\\downarrow\) denotes the alignment error, while Perf\. \(↑\\uparrow\) denotes the overall performance \(D4RL normalized score\)\. Results are averaged over three random seeds\.
#### Asymmetric vs\. Symmetric Indicator Functions\.

We investigate the necessity of theasymmetric indicator functionby considering a symmetric variant:

𝕀sym=\{1,ifsgn​\(δ\)​\(Qψ​\(si,a^iδ\)−Qψ⟂​\(si,a^i\)\)<0,−1,otherwise\.\\mathbb\{I\}\_\{\\text\{sym\}\}=\\begin\{cases\}1,&\\text\{if \}\\text\{sgn\}\(\\delta\)\\big\(Q\_\{\\psi\}\(s\_\{i\},\\hat\{a\}\_\{i\}^\{\\delta\}\)\-Q^\{\\perp\}\_\{\\psi\}\(s\_\{i\},\\hat\{a\}\_\{i\}\)\\big\)<0,\\\\ \-1,&\\text\{otherwise\}\.\\end\{cases\}Unlike our original formulation, which only penalizes constraint violations, this variant explicitly encourages the actor to further maximize the value gap even when the ranking is already correct\.

As shown in[TableE\.11](https://arxiv.org/html/2605.29028#A5.T11), this “active alignment” approach does not generally improve performance\. On the contrary, it significantly increases training instability and even leads to behavioral collapse in environments like Hopper and Walker2d\. This suggests that the primary role of the alignment loss should becorrectingmisaligned rankings rather than continuously pushing actions beyond the critic’s reliable regions, which may introduce detrimental gradient noise\.

#### Sensitivity to the Alignment Loss Form\.

We further study the sensitivity ofQ\-align DTto the functional form of the alignment loss\. Specifically, we replace the default linear penalty with a Squared Penalty,

\(Qψ​\(si,a^iδ\)−Qψ⟂​\(si,a^i\)\)2,\\displaystyle\\big\(Q\_\{\\psi\}\(s\_\{i\},\\hat\{a\}\_\{i\}^\{\\delta\}\)\-Q^\{\\perp\}\_\{\\psi\}\(s\_\{i\},\\hat\{a\}\_\{i\}\)\\big\)^\{2\},which imposes a quadratic cost on directional inconsistencies\.

As shown in[TableE\.11](https://arxiv.org/html/2605.29028#A5.T11), this variant yields comparable performance across all tasks, with only minor deviations from the default formulation\. Empirically, introducing higher\-order penalties does not yield consistent improvements, which leads us to adopt the simplest formulation, viewable as a first\-order approximation of the value difference under RTG perturbations\.

Table E\.11:Ablation of the alignment loss structure\. We compare the defaultQ\-align DTwith its symmetric and squared \(L2\) variants\.
#### Sensitivity to RTGσe\\sigma\_\{e\}\.

We evaluate the sensitivity ofQ\-align DTto the perturbation magnitudeσe\\sigma\_\{e\}, which is defined relative to the normalized RTG range\.[TableE\.12](https://arxiv.org/html/2605.29028#A5.T12)reports the performance on thehalfcheetah\-mediumtask across a wide range ofσe\\sigma\_\{e\}values\.

The results indicate thatQ\-align DTis highly robust to the choice ofσe\\sigma\_\{e\}within a reasonable range \(from 1 to 50\)\. Even a large perturbation \(σe=50\\sigma\_\{e\}=50\) does not lead to performance collapse, suggesting that the directional signal remains informative\. By contrast, an excessively small perturbation \(e\.g\.,σe=0\.5\\sigma\_\{e\}=0\.5\) limits the actor’s ability to explore the value landscape, resulting in suboptimal alignment\.

Table E\.12:Sensitivity analysis of the perturbation scaleσe\\sigma\_\{e\}onhalfcheetah\-medium\. The noise is applied to the normalized RTG \(scaled by1/10001/1000\)\.Q\-align DTexhibits strong robustness across an order of magnitude of noise levels\.
#### Sensitivity to RTGλe\\lambda\_\{e\}\.

We further evaluate the sensitivity ofQ\-align DTto the alignment loss weightλe\\lambda\_\{e\}\. As shown in[TableE\.14](https://arxiv.org/html/2605.29028#A5.T14), we report the performance onhalfcheetah\-mediumacross a wide spectrum of weights\.

The results reveal a clear phase transition: atλe=0\\lambda\_\{e\}=0, where the directional guidance is absent, the model achieves a score of only42\.8442\.84, consistent with standard supervised learning baselines\. However, with even a small weight \(λe=1\.0\\lambda\_\{e\}=1\.0\), the performance surges to61\.761\.7\. Crucially, forλe≥3\.0\\lambda\_\{e\}\\geq 3\.0, the performance enters a stable plateau, with scores remaining consistently above64\.064\.0regardless of the specific weight chosen\. While the peak performance reaches68\.268\.2atλe=100\.0\\lambda\_\{e\}=100\.0, the marginal gains across the3\.03\.0to100\.0100\.0range suggest thatQ\-align DTis highly robust to this hyperparameter\. This insensitivity allows for a “plug\-and\-play” deployment without the need for exhaustive, task\-specific fine\-tuning\.

#### Sensitivity toQQFunction Accuracy

We investigate the sensitivity ofQ\-align DTto inaccuracies in the learnedQQfunction\. Specifically, we perturb the target criticQψ′Q\_\{\\psi^\{\\prime\}\}by injecting Gaussian noiseϵ∼𝒩​\(0,σn2\)\\epsilon\\sim\\mathcal\{N\}\(0,\\sigma\_\{n\}^\{2\}\)to simulate potential estimation errors\.

We conduct experiments onhalfcheetah\-mediumandwalker2d\-mediumwith noise scalesσn∈\{1,10,50\}\\sigma\_\{n\}\\in\\\{1,10,50\\\}, and report the results in Table[E\.13](https://arxiv.org/html/2605.29028#A5.T13)\. Notably,Q\-align DTexhibits remarkable resilience, maintaining stable performance even under substantial perturbations\. This indicates that the method is insensitive to the exact accuracy of theQQ\-function\. We attribute this robustness to the alignment objective, which emphasizes the*relative ordering*and*directional consistency*ofQQ\-values rather than their absolute magnitudes\. As long as the critic preserves coarse rank\-ordering of actions, the alignment loss provides sufficient guidance for effective policy extraction, making the learning process tolerant to estimation noise\.

Table E\.13:Sensitivity analysis of the value function\. We report the normalized D4RL scores under varying noise scalesσn\\sigma\_\{n\}\.Table E\.14:Sensitivity analysis of the perturbation scaleλe\\lambda\_\{e\}onhalfcheetah\-medium\.

### E\.2Extended Generalization onHalfCheetah\-Vel

While the standardhalfcheetah\-veltask focuses on target velocities within the\[0,3\]\[0,3\]range, our model leverages the diverse transitions in themedium\-expertdataset to acquire a behavioral repertoire capable of reaching velocities exceeding 10\. To fully explore the limits of its controllability, we evaluateQ\-align DTon extended target velocities that far exceed those typically used in Meta\-RL benchmarks\.

We follow the evaluation protocol in[AppendixC](https://arxiv.org/html/2605.29028#A3), extending the episode horizon to 1000 steps to allow sufficient time for the agent to accelerate to these higher targets\. Since the originalhalfcheetah\-veldataset is restricted to low\-velocity samples, we use CSMs trained on the broaderhalfcheetah\-medium\-expertdataset as baselines\. As reported in[TableE\.15](https://arxiv.org/html/2605.29028#A5.T15),Q\-align DTdemonstrates a remarkably wide effective range of controllability, accurately tracking high\-velocity targets even when trained under a singular and straightforward reward objective \(i\.e\., where reward simply scales with velocity\)\.

Table E\.15:Performance under high target velocities onHalfCheetah\-vel\.Maximum episode return is reported for each target velocity \(horizon = 1000\)\.
### E\.3Behavioral Analysis onAnt

We also evaluateQ\-align DTonant\-mediumto examine whether the agent can modulate its behavior under different target RTGs\. As shown in[FigureE\.5](https://arxiv.org/html/2605.29028#A5.F5), the agent demonstrates similar trends tohalfcheetah: higher target RTGs induce faster locomotion, while lower RTGs lead to more conservative movement\. These results suggest that, even when trained only on scalar returns, the actor can learn distinct behaviors that are controllable via the RTG tokens, indicating that the structured mapping from RTG to actions generalizes across different environments\.

![Refer to caption](https://arxiv.org/html/2605.29028v1/fig/ant-medium_mean-vel.png)

![Refer to caption](https://arxiv.org/html/2605.29028v1/x7.png)

Figure E\.5:Behavioral modulation of the agent onant\-mediumunder varying target RTGs\. \(Left\) The relationship between target RTG and realized mean velocity; \(Right\) Velocity trajectories across different RTG levels averaged over 30 rollouts\.
### E\.4Additional Experiments

We first show the last iteration performance of DT, QT andQ\-align DTin[TableE\.16](https://arxiv.org/html/2605.29028#A5.T16)

Table E\.16:Last\-iteration performance comparison on D4RL Gym tasks\.We report a detailed ablation of the architectural change and the alignment objective in[TableE\.17](https://arxiv.org/html/2605.29028#A5.T17)\. To make the attribution clearer, we compare vanilla DT, DT with 1D convolution, DT with the alignment loss, and Q\-Align DT in terms of alignment MSE\. As shown in[TableE\.17](https://arxiv.org/html/2605.29028#A5.T17), 1D convolution alone slightly reduces the alignment MSE, while the alignment loss provides the dominant improvement\. Combining both components achieves the lowest alignment MSE, indicating that the main gain comes from the proposed alignment objective rather than the architectural modification alone\.

Table E\.17:Ablation study on the effect of 1D convolution and the alignment loss, measured by alignment MSE\. Lower is better\.We further report the training cost of DT, QT, andQ\-align DTin[TableE\.18](https://arxiv.org/html/2605.29028#A5.T18), where each method is trained for 100 epochs with 1000 steps per epoch\. QT andQ\-align DTare slower than DT because both use an additional Q\-function\. Compared with QT,Q\-align DTonly adds extra forward computation for perturbed\-action generation, without introducing additional backward\-pass overhead\. As a result, its practical training cost remains comparable to QT\.

Table E\.18:Wall\-clock training time under the same setting of 100 epochs and 1000 steps per epoch\.Table E\.19:Spearman correlation between predictedQQ\-values and realized returns across environments\.To better understand the different behavior ofQ\-align DTon Gym and AntMaze tasks, we further compute the Spearman correlation between predictedQQ\-values and realized returns, as shown in[TableE\.19](https://arxiv.org/html/2605.29028#A5.T19)\. The correlation is much lower on AntMaze than on dense\-reward Gym tasks \(e\.g\., 0\.94 on Hopper\-medium vs\. 0\.26 on AntMaze\-umaze\-diverse\)\. This suggests that theQQ\-guided alignment signal is substantially weaker in sparse\-reward environments, especially on the diverse split\. We view this phenomenon as a limitation ofQ\-align DTin sparse\-reward settings and leave further improvement to future work\.

### E\.5Further Analysis on Alignment Stability

We first present the complete alignment curves forQ\-align DTacross all nine Gym tasks in[FigureE\.6](https://arxiv.org/html/2605.29028#A5.F6), demonstrating consistently robust alignment\.

Beyond overall performance, we examine scenarios in which alignment degrades, specifically under a fixed critic or when the RTG offset is near zero\. As shown in[FigureE\.7](https://arxiv.org/html/2605.29028#A5.F7), when theQQ\-function is fixed, the agent’s performance under low\-RTG targets effectively collapses to zero\. A stable response only emerges once the target RTG exceeds a certain threshold, indicating a sharp transition in behavior\. In[FigureE\.8](https://arxiv.org/html/2605.29028#A5.F8), we further report the agent’s velocity at each timestep, which shows that the misalignment in the low\-RTG regime is caused by early timesteps collapse\.

We further examine the impact of RTG offsets in[FigureE\.9](https://arxiv.org/html/2605.29028#A5.F9)\. Notably, even with synchronized training, a near\-zero offset replicates the failure mode of the fixed\-critic baseline\. This is because a sufficientΔ​RTG\\Delta\\mathrm\{RTG\}is essential to ensure that theQQ\-function evaluates the better\-performing policies within the evolving policy family\. Without this positive offset, the alignment objective is guided by value estimates of mediocre or failing behaviors\. Consequently, the policy lacks the structural guidance needed to remain stable under low\-RTG targets, leading to the observed performance collapse where realized returns drop to zero\.

![Refer to caption](https://arxiv.org/html/2605.29028v1/fig/alignment/halfcheetah-medium.png)\(a\)HalfCheetah Medium
![Refer to caption](https://arxiv.org/html/2605.29028v1/fig/alignment/hopper-medium.png)\(b\)Hopper Medium
![Refer to caption](https://arxiv.org/html/2605.29028v1/fig/alignment/walker2d-medium.png)\(c\)Walker2d Medium
![Refer to caption](https://arxiv.org/html/2605.29028v1/fig/alignment/halfcheetah-medium-replay.png)\(d\)HalfCheetah Medium\-Replay
![Refer to caption](https://arxiv.org/html/2605.29028v1/fig/alignment/hopper-medium-replay.png)\(e\)Hopper Medium\-Replay
![Refer to caption](https://arxiv.org/html/2605.29028v1/fig/alignment/walker2d-medium-replay.png)\(f\)Walker2d Medium\-Replay
![Refer to caption](https://arxiv.org/html/2605.29028v1/fig/alignment/halfcheetah-medium-expert.png)\(g\)HalfCheetah Medium\-Expert
![Refer to caption](https://arxiv.org/html/2605.29028v1/fig/alignment/hopper-medium-expert.png)\(h\)Hopper Medium\-Expert
![Refer to caption](https://arxiv.org/html/2605.29028v1/fig/alignment/walker2d-medium-expert.png)\(i\)Walker2d Medium\-Expert

Figure E\.6:Performance ofQ\-align DTacross Gym locomotion tasks\. Each row corresponds to one dataset type, and each column corresponds to one environment\.![Refer to caption](https://arxiv.org/html/2605.29028v1/fig/walker2d_no_update_Q.png)\(a\)Walker2d Medium
![Refer to caption](https://arxiv.org/html/2605.29028v1/fig/halfcheetah_no_update_Q.png)\(b\)Halfcheetah Medium

Figure E\.7:Results ofwalker2d\-mediumandhalfcheetah\-mediumwhenQQis fixed\.![Refer to caption](https://arxiv.org/html/2605.29028v1/x8.png)\(a\)Walker2d Medium
![Refer to caption](https://arxiv.org/html/2605.29028v1/x9.png)\(b\)Halfcheetah Medium

Figure E\.8:Velocity onwalker2d\-mediumandhalfcheetah\-mediumwith a fixedQQ\. At lower RTG targets, the model consistentlycollapseswithin the first few timesteps, rendering itunresponsiveto RTG changes\. A meaningful behavioral response to the conditioning only emerges after the RTG exceeds a critical threshold, allowing the model to escape early collapse\.![Refer to caption](https://arxiv.org/html/2605.29028v1/fig/halfcheetah_offset0.png)\(a\)Offset=0
![Refer to caption](https://arxiv.org/html/2605.29028v1/fig/halfcheetah_offset1.png)\(b\)Offset=1
![Refer to caption](https://arxiv.org/html/2605.29028v1/fig/halfcheetah_offset3.png)\(c\)Offset=3

Figure E\.9:Resultshalfcheetah\-medium’s alignment curve under different RTG offset\.

Similar Articles

Trust Region Q Adjoint Matching

Hugging Face Daily Papers

Trust Region Q-Adjoint Matching (TRQAM) addresses instability in off-policy reinforcement learning by adaptively controlling path-space KL divergence through projected dual descent, enabling stable fine-tuning of pretrained flow policies. The method consistently outperforms prior arts on 50 OGBench tasks, achieving a 68% success rate in offline RL compared to the strongest baseline's 46%.

Drift Q-Learning

arXiv cs.LG

Proposes DriftQL, which combines a drift-based behavioral regularizer with critic-driven policy improvement for offline RL, outperforming diffusion and flow methods on D4RL and OGBench while maintaining simplicity and efficiency.