Reversal Q-Learning

arXiv cs.LG Papers

Summary

This paper proposes Reversal Q-Learning (RQL), an offline reinforcement learning algorithm that trains a flow policy using an expanded Markov decision process framework and techniques to enable off-policy RL without backpropagation through time. It achieves state-of-the-art performance on challenging simulated robotic tasks.

arXiv:2606.17551v1 Announce Type: new Abstract: Iterative generative modeling techniques, such as flow matching, provide powerful tools to model complex behaviors for effective offline reinforcement learning (RL). In this work, we propose a new off-policy RL algorithm that trains a flow policy based on prior data. Our idea starts from the "expanded" Markov decision process (MDP) framework, which treats individual flow refinement steps as separate actions in an MDP. To enable off-policy RL within this framework, we apply two techniques: we generate virtual on-policy trajectories (by "reversing" flows) to make this framework compatible with prior data, and we apply a bias-and-variance reduction technique to mitigate the curse of horizon in off-policy RL. We call the resulting algorithm Reversal Q-learning (RQL). RQL has several advantages over previous flow-based RL methods: it does not suffer from backpropagation through time, makes better use of the learned value function, and directly trains the full, expressive flow policy. Through our experiments on 50 challenging simulated robotic tasks, we show that RQL leads to the best average offline RL performance compared to state-of-the-art flow-based offline RL algorithms.
Original Article
View Cached Full Text

Cached at: 06/17/26, 05:41 AM

# Reversal Q-Learning
Source: [https://arxiv.org/html/2606.17551](https://arxiv.org/html/2606.17551)
###### Abstract

Iterative generative modeling techniques, such as flow matching, provide powerful tools to model complex behaviors for effective offline reinforcement learning \(RL\)\. In this work, we propose a new off\-policy RL algorithm that trains a flow policy based on prior data\. Our idea starts from the “expanded” Markov decision process \(MDP\) framework, which treats individual flow refinement steps as separate actions in an MDP\. To enable off\-policy RL within this framework, we apply two techniques: we generate virtual on\-policy trajectories \(by “reversing” flows\) to make this framework compatible with prior data, and we apply a bias\-and\-variance reduction technique to mitigate the curse of horizon in off\-policy RL\. We call the resulting algorithmreversal Q\-learning \(RQL\)\. RQL has several advantages over previous flow\-based RL methods: it does not suffer from backpropagation through time, makes better use of the learned value function, and directly trains the full, expressive flow policy\. Through our experiments on5050challenging simulated robotic tasks, we show that RQL leads to the best average offline RL performance compared to state\-of\-the\-art flow\-based offline RL algorithms\.

Code:[https://github\.com/aoberai/rql](https://github.com/aoberai/rql)

Website:[https://aober\.ai/rql](https://aober.ai/rql)

Machine Learning, ICML

![Refer to caption](https://arxiv.org/html/2606.17551v1/x1.png)Figure 1:Reducing the effective horizon\.We reduce the effective TD horizon to leverage the expanded MDP framework for*off\-policy*RL\. We avoid the naive solution which requiresF×TF\\times Tbackups\. Instead, RQL on average requires justTT\.## 1Introduction

Recent advancements in iterative generative modeling, such as denoising diffusion\(Sohl\-Dickstein et al\.,[2015](https://arxiv.org/html/2606.17551#bib.bib57); Ho et al\.,[2020](https://arxiv.org/html/2606.17551#bib.bib24)\)and flow matching\(Lipman et al\.,[2023](https://arxiv.org/html/2606.17551#bib.bib39); Albergo & Vanden\-Eijnden,[2023](https://arxiv.org/html/2606.17551#bib.bib4); Liu et al\.,[2023](https://arxiv.org/html/2606.17551#bib.bib42)\), have provided powerful tools for effective off\-policy reinforcement learning \(RL\)\(Wang et al\.,[2023](https://arxiv.org/html/2606.17551#bib.bib61); Hansen\-Estruch et al\.,[2023](https://arxiv.org/html/2606.17551#bib.bib20); Park et al\.,[2025c](https://arxiv.org/html/2606.17551#bib.bib52)\)\. By modeling complex behaviors in offline datasets with expressive generative models \(e\.g\., by training diffusion or flow policies\), they can capture diverse behavioral priors that can be rapidly adapted to downstream tasks\.

While promising in principle, training diffusion or flow policies with off\-policy RL is a difficult problem\. This challenge stems from their*iterative*nature\. For example, if we naïvely train a diffusion policy to maximize a learned value function\(Lillicrap et al\.,[2016](https://arxiv.org/html/2606.17551#bib.bib38); Wang et al\.,[2023](https://arxiv.org/html/2606.17551#bib.bib61)\), gradients are backpropagated through the entire iterative generative process, often leading to unstable training and suboptimal performance\(Park et al\.,[2025c](https://arxiv.org/html/2606.17551#bib.bib52)\)\. Prior work has sidestepped this issue using other techniques like weighted regression\(Zhang et al\.,[2025](https://arxiv.org/html/2606.17551#bib.bib66)\), distillation\(Park et al\.,[2025c](https://arxiv.org/html/2606.17551#bib.bib52)\), and rejection sampling\(Hansen\-Estruch et al\.,[2023](https://arxiv.org/html/2606.17551#bib.bib20)\), but these approaches come with their own limitations \(see[Section2](https://arxiv.org/html/2606.17551#S2)\)\.

In this work, we consider an alternative paradigm that has recently been explored in diffusion\-based*on\-policy*RL\(Black et al\.,[2023](https://arxiv.org/html/2606.17551#bib.bib7); Fan et al\.,[2023](https://arxiv.org/html/2606.17551#bib.bib15); Ren et al\.,[2025](https://arxiv.org/html/2606.17551#bib.bib54)\)\. The idea is simple: instead of treating a diffusion policy as a black box that generates an action from a state, this paradigm treats individual denoising steps as part of a Markov decision process \(MDP\), effectively expanding the horizon byFFtimes \([Figure1](https://arxiv.org/html/2606.17551#S0.F1)\)\. This way, we can fully avoid handling tricky issues from training iterative policies with RL, such as backpropagation through time\. Previous work has shown that this expanded MDP paradigm is highly effective when combined with on\-policy algorithms like REINFORCE\(Williams,[1992](https://arxiv.org/html/2606.17551#bib.bib62)\)and PPO\(Schulman et al\.,[2017](https://arxiv.org/html/2606.17551#bib.bib56)\)\.

Unfortunately, this expanded MDP framework is not directly suitable for*off\-policy*RL, whose goal is to train diffusion or flow policies with RL in a sample\-efficient manner, leveraging prior data\. There are two main reasons\. First, standard offline datasets only contain state\-action pairs from the original environment, and do not provide diffusion or flow trajectories corresponding to the expanded MDP\. Second, the MDP expansion increases the horizon byFFtimes, which makes it challenging to estimate accurate values due to “the curse of horizon” in off\-policy RL\(Liu et al\.,[2018](https://arxiv.org/html/2606.17551#bib.bib41); Park et al\.,[2025b](https://arxiv.org/html/2606.17551#bib.bib51)\)\.

Our key insight in this work is that the*reversibility*of*deterministic*iterative generative models \(e\.g\., flow matching\) provides an effective solution to both challenges\. Specifically, we first generate “virtual” trajectories in the expanded MDP, by reconstructing the flow trajectories that the current policy would have produced for each state\-action pair in the dataset\. This is done by solving an inverse problem via reverse flows\. We then apply multi\-step returns to these virtual trajectories to reduce the effective horizon for value function learning\. Since these virtual trajectories are fully deterministic and on\-policy, we can obtain unbiased and zero\-variance return estimates from otherwise biased multi\-step returns\(Sutton & Barto,[2005](https://arxiv.org/html/2606.17551#bib.bib58)\)\.

We call the resulting off\-policy flow RL algorithmreversal Q\-learning \(RQL\), which is the main contribution of this work\. Through our diverse experiments across5050simulated robotic tasks, we demonstrate that RQL leads to the best performance compared to a number of strong off\-policy flow\-based RL baselines\. We show that RQL is particularly strong in challenging long\-horizon manipulation and locomotion environments\.

## 2Related Work

RL with iterative generative models\.Prior works have developed a variety of techniques to use modern iterative generative models \(e\.g\., denoising diffusion\(Sohl\-Dickstein et al\.,[2015](https://arxiv.org/html/2606.17551#bib.bib57); Ho et al\.,[2020](https://arxiv.org/html/2606.17551#bib.bib24)\)and flow matching\(Lipman et al\.,[2023](https://arxiv.org/html/2606.17551#bib.bib39); Albergo & Vanden\-Eijnden,[2023](https://arxiv.org/html/2606.17551#bib.bib4); Liu et al\.,[2023](https://arxiv.org/html/2606.17551#bib.bib42)\)\) for data\-driven RL, such as offline RL\(Lange et al\.,[2012](https://arxiv.org/html/2606.17551#bib.bib32); Levine et al\.,[2020](https://arxiv.org/html/2606.17551#bib.bib34)\)and offline\-to\-online RL\. These works have employed diffusion or flow matching for trajectory modeling\(Janner et al\.,[2022](https://arxiv.org/html/2606.17551#bib.bib27); Ajay et al\.,[2023](https://arxiv.org/html/2606.17551#bib.bib3); Zheng et al\.,[2023](https://arxiv.org/html/2606.17551#bib.bib69); Li et al\.,[2023](https://arxiv.org/html/2606.17551#bib.bib37); Chen et al\.,[2024](https://arxiv.org/html/2606.17551#bib.bib9)\), world modeling\(Lu et al\.,[2023a](https://arxiv.org/html/2606.17551#bib.bib43); Ding et al\.,[2024b](https://arxiv.org/html/2606.17551#bib.bib13); Jackson et al\.,[2024](https://arxiv.org/html/2606.17551#bib.bib26); Alonso et al\.,[2024](https://arxiv.org/html/2606.17551#bib.bib5)\), and policy learning\(Wang et al\.,[2023](https://arxiv.org/html/2606.17551#bib.bib61); Hansen\-Estruch et al\.,[2023](https://arxiv.org/html/2606.17551#bib.bib20); Chen et al\.,[2023](https://arxiv.org/html/2606.17551#bib.bib10); Kang et al\.,[2023](https://arxiv.org/html/2606.17551#bib.bib28); Ren et al\.,[2025](https://arxiv.org/html/2606.17551#bib.bib54); Park et al\.,[2025c](https://arxiv.org/html/2606.17551#bib.bib52)\)\. Our work falls into the last category\. We aim to develop a better algorithm to train a flow policy for off\-policy RL leveraging prior data\(Ball et al\.,[2023](https://arxiv.org/html/2606.17551#bib.bib6)\)\.

RL with diffusion and flow policies\.Due to their iterative nature, training diffusion or flow policies with RL is not a straightforward task\. A number of diverse approaches have been proposed to guide the iterative generation process to maximize returns\. These methods are based on different principles, such as backpropagation through time\(Wang et al\.,[2023](https://arxiv.org/html/2606.17551#bib.bib61); He et al\.,[2023](https://arxiv.org/html/2606.17551#bib.bib21); Ding & Jin,[2024](https://arxiv.org/html/2606.17551#bib.bib12); Ada et al\.,[2024](https://arxiv.org/html/2606.17551#bib.bib1); Zhang et al\.,[2024](https://arxiv.org/html/2606.17551#bib.bib65); Espinosa\-Dice et al\.,[2026](https://arxiv.org/html/2606.17551#bib.bib14)\), regression\(Lu et al\.,[2023b](https://arxiv.org/html/2606.17551#bib.bib44); Kang et al\.,[2023](https://arxiv.org/html/2606.17551#bib.bib28); Hansen\-Estruch et al\.,[2023](https://arxiv.org/html/2606.17551#bib.bib20); Chen et al\.,[2023](https://arxiv.org/html/2606.17551#bib.bib10); Ding et al\.,[2024a](https://arxiv.org/html/2606.17551#bib.bib11); Zhang et al\.,[2025](https://arxiv.org/html/2606.17551#bib.bib66)\), distillation\(Park et al\.,[2025c](https://arxiv.org/html/2606.17551#bib.bib52); Agrawalla et al\.,[2026](https://arxiv.org/html/2606.17551#bib.bib2)\), MDP expansion\(Ren et al\.,[2025](https://arxiv.org/html/2606.17551#bib.bib54); Gao et al\.,[2025](https://arxiv.org/html/2606.17551#bib.bib19)\), and more\(Yang et al\.,[2023](https://arxiv.org/html/2606.17551#bib.bib64); Mark et al\.,[2024](https://arxiv.org/html/2606.17551#bib.bib45); Fang et al\.,[2025](https://arxiv.org/html/2606.17551#bib.bib16); Wagenmaker et al\.,[2025](https://arxiv.org/html/2606.17551#bib.bib60); Zhang et al\.,[2026](https://arxiv.org/html/2606.17551#bib.bib67)\)\. In the rest of this section, we discuss why some of these paradigms may be limited in practice and how our method can provide a better alternative\.

\(1\) Backpropagation through time\.Arguably, the most straightforward way to train a diffusion policy with RL is to directly maximize a learned value function with gradient ascent, treating the iterative generation process as a black box\. While prior work has shown that this can sometimes be effective\(Wang et al\.,[2023](https://arxiv.org/html/2606.17551#bib.bib61); He et al\.,[2023](https://arxiv.org/html/2606.17551#bib.bib21); Ada et al\.,[2024](https://arxiv.org/html/2606.17551#bib.bib1); Zhang et al\.,[2024](https://arxiv.org/html/2606.17551#bib.bib65)\), this paradigm often suffers from an issue called backpropagation through time \(BPTT\), especially when using larger iteration steps\. Since gradients are propagated through the long chain of the entire iterative generation procedure, it often causes training instability and leads to suboptimal performance in practice\(Park et al\.,[2025c](https://arxiv.org/html/2606.17551#bib.bib52)\)\. In contrast, our method does not suffer from the BPTT issue because we treat iterative refinement steps as distinct MDP environment steps\.

\(2\) Regression\.To avoid BPTT, many previous works have explored regression\-based techniques to maximize returns with diffusion or flow policies\. These methods include weighted regression\(Lu et al\.,[2023b](https://arxiv.org/html/2606.17551#bib.bib44); Kang et al\.,[2023](https://arxiv.org/html/2606.17551#bib.bib28); Ding et al\.,[2024a](https://arxiv.org/html/2606.17551#bib.bib11); Zhang et al\.,[2025](https://arxiv.org/html/2606.17551#bib.bib66)\), rejection sampling\(Chen et al\.,[2023](https://arxiv.org/html/2606.17551#bib.bib10); Hansen\-Estruch et al\.,[2023](https://arxiv.org/html/2606.17551#bib.bib20); He et al\.,[2024](https://arxiv.org/html/2606.17551#bib.bib22); Park et al\.,[2025b](https://arxiv.org/html/2606.17551#bib.bib51)\), and filtering\(Frans et al\.,[2025](https://arxiv.org/html/2606.17551#bib.bib17); Intelligence et al\.,[2025](https://arxiv.org/html/2606.17551#bib.bib25)\)\. While these approaches do not suffer from the instabilities of BPTT, they only use zeroth\-order information from the value function \(i\.e\., they do not use value gradients\), which often leads to suboptimal performance \(in weighted regression\- or filtering\-based methods\) or requires a large amount of compute \(in rejection sampling\-based methods\)\(Park et al\.,[2024](https://arxiv.org/html/2606.17551#bib.bib49)\)\. Unlike these regression\-based approaches, we make better use of the value function by utilizing its first\-order \(gradient\) information, which we show leads to better performance in practice\.

\(3\) MDP expansion\.An alternative paradigm to diffusion policy learning is to treat iterative refinement steps as MDP steps, and solve this “expanded” MDP with a standard, off\-the\-shelf RL algorithm\. This framework is beneficial in that it does not suffer from BPTT and can fully utilize value gradients\. Prior work has shown that variants of this idea indeed lead to strong performance in on\-policy RL settings\(Black et al\.,[2023](https://arxiv.org/html/2606.17551#bib.bib7); Fan et al\.,[2023](https://arxiv.org/html/2606.17551#bib.bib15); Ren et al\.,[2025](https://arxiv.org/html/2606.17551#bib.bib54)\)\. However, this framework has rarely been applied to an offline*off\-policy*RL setting\. This is mainly because \(1\) the original dataset does not contain full diffusion trajectories and \(2\) it makes the horizonFFtimes longer \(whereFFis the number of iterative refinement steps\), which exacerbates “the curse of horizon” in off\-policy value learning\(Liu et al\.,[2018](https://arxiv.org/html/2606.17551#bib.bib41); Park et al\.,[2025b](https://arxiv.org/html/2606.17551#bib.bib51)\)\.

To our knowledge, the only prior work that applies MDP expansion to off\-policy RL is BDPO\(Gao et al\.,[2025](https://arxiv.org/html/2606.17551#bib.bib19)\), which concerns stochastic diffusion policies and employs bi\-level hierarchical value functions to deal with the increased horizon\. Unlike BDPO, our method is based on deterministic “reverse” flows, which enables us to address the horizon challenge without potentially complicated hierarchies\. Empirically, we also show that RQL leads to substantially better performance than this prior work\.

## 3Preliminaries

Problem setting\.We consider a Markov decision process \(MDP\) defined asℳ=\(𝒮,𝒜,r,μ,p\)\{\\mathcal\{M\}\}=\(\{\\mathcal\{S\}\},\{\\mathcal\{A\}\},r,\\mu,p\)\(Sutton & Barto,[2005](https://arxiv.org/html/2606.17551#bib.bib58)\)\.𝒮\{\\mathcal\{S\}\}is a state space,𝒜\{\\mathcal\{A\}\}is an action space,r​\(s,a\):𝒮×𝒜→ℝr\(s,a\):\{\\mathcal\{S\}\}\\times\{\\mathcal\{A\}\}\\to\{\\mathbb\{R\}\}is a reward function,μ​\(s\)∈Δ​\(𝒮\)\\mu\(s\)\\in\\Delta\(\{\\mathcal\{S\}\}\)is an initial state distribution, andp​\(s′∣s,a\):𝒮×𝒜→Δ​\(𝒮\)p\(s^\{\\prime\}\\mid s,a\):\{\\mathcal\{S\}\}\\times\{\\mathcal\{A\}\}\\to\\Delta\(\{\\mathcal\{S\}\}\)is a transition dynamics kernel, whereΔ​\(𝒳\)\\Delta\(\{\\mathcal\{X\}\}\)denotes the set of probability distributions over a space𝒳\{\\mathcal\{X\}\}\. We also assume that we are given a prior dataset𝒟=\{τ\(n\)\}n∈\{1,2,…,N\}\{\\mathcal\{D\}\}=\\\{\\tau^\{\(n\)\}\\\}\_\{n\\in\\\{1,2,\\ldots,N\\\}\}consisting of trajectoriesτ=\(s0,a0,r0,s1,…,sT\)\\tau=\(s\_\{0\},a\_\{0\},r\_\{0\},s\_\{1\},\\ldots,s\_\{T\}\), which may correspond to human demonstrations, previous rollouts, or even suboptimal data\.

In this work, we consider the problem of*offline RL*\. That is, we aim to find a return\-maximizing policy leveraging a prior dataset𝒟\{\\mathcal\{D\}\}\. Formally, our goal is to train a policyπ​\(a∣s\):𝒮→Δ​\(𝒜\)\\pi\(a\\mid s\):\{\\mathcal\{S\}\}\\to\\Delta\(\{\\mathcal\{A\}\}\)\(based on𝒟\{\\mathcal\{D\}\}\) that maximizes the discounted sum of rewards:

J​\(π\)=𝔼τ∼pπ​\(τ\)​\[∑t=0∞γt​r​\(st,at\)\],\\displaystyle J\(\\pi\)=\\mathbb\{E\}\_\{\\tau\\sim p^\{\\pi\}\(\\tau\)\}\\left\[\\sum\_\{t=0\}^\{\\infty\}\\gamma^\{t\}r\(s\_\{t\},a\_\{t\}\)\\right\],\(1\)whereγ∈\(0,1\)\\gamma\\in\(0,1\)is a discount factor and

pπ​\(τ\)=\\displaystyle p^\{\\pi\}\(\\tau\)=μ​\(s0\)​π​\(a0∣s0\)​p​\(s1∣s0,a0\)\\displaystyle\\mu\(s\_\{0\}\)\\pi\(a\_\{0\}\\mid s\_\{0\}\)p\(s\_\{1\}\\mid s\_\{0\},a\_\{0\}\)⋯​π​\(aT−1∣sT−1\)​p​\(sT∣sT−1,aT−1\)\.\\displaystyle\\cdots\\pi\(a\_\{T\-1\}\\mid s\_\{T\-1\}\)p\(s\_\{T\}\\mid s\_\{T\-1\},a\_\{T\-1\}\)\.\(2\)
Flow policies\.Flow matching\(Lipman et al\.,[2023](https://arxiv.org/html/2606.17551#bib.bib39); Albergo & Vanden\-Eijnden,[2023](https://arxiv.org/html/2606.17551#bib.bib4); Liu et al\.,[2023](https://arxiv.org/html/2606.17551#bib.bib42)\)provides a scalable way to train an expressive generative network to model a continuous data distribution\. In this work, we consider*flow policies*, which model action distributions via flow matching\. Formally, a flow policy is modeled by a time\-dependent velocity fieldv​\(s,x,f\):𝒮×ℝd×\[0,F\]→ℝdv\(s,x,f\):\{\\mathcal\{S\}\}\\times\{\\mathbb\{R\}\}^\{d\}\\times\[0,F\]\\to\{\\mathbb\{R\}\}^\{d\}, where we assume𝒜=ℝd\{\\mathcal\{A\}\}=\{\\mathbb\{R\}\}^\{d\}and useffto denote the time variable \(this choice is to avoid a notational clash with the time stepttin the MDP\), andFFdenotes the maximum time\. This velocity field induces a*flow*ψ​\(s,x,f\):𝒮×ℝd×\[0,F\]→ℝd\\psi\(s,x,f\):\{\\mathcal\{S\}\}\\times\{\\mathbb\{R\}\}^\{d\}\\times\[0,F\]\\to\{\\mathbb\{R\}\}^\{d\}, which is defined as the unique solution\(Lee,[2012](https://arxiv.org/html/2606.17551#bib.bib33)\)to the following ordinary differential equation \(ODE\):

dd​f​ψ​\(s,x,f\)=v​\(s,ψ​\(s,x,f\),f\)\.\\displaystyle\\frac\{\{\\mathrm\{d\}\}\}\{\{\\mathrm\{d\}\}f\}\\psi\(s,x,f\)=v\(s,\\psi\(s,x,f\),f\)\.\(3\)This ODE transforms a fixed prior distribution \(e\.g\., the standard normal𝒩​\(0,Id\)\{\\mathcal\{N\}\}\(0,I\_\{d\}\)\) atf=0f=0into a different distribution atf=Ff=F, which defines the action distribution of the flow policy\. In practice, we use the Euler method withFFiteration steps to solve the ODE forψ\\psi: i\.e\., we compute

xf\+1←xf\+v​\(s,xf,f\)\\displaystyle x^\{f\+1\}\\leftarrow x^\{f\}\+v\(s,x^\{f\},f\)\(4\)atf=0,1,…,F−1f=0,1,\\ldots,F\-1, wherex0x^\{0\}is sampled from the prior distribution andxFx^\{F\}approximates the outputψ​\(s,x0,F\)\\psi\(s,x^\{0\},F\)\.

Flow policies are often trained to model behavioral action distributions in the dataset𝒟\{\\mathcal\{D\}\}\. This can be done by minimizing the following*flow\-matching*loss:

ℒBC​\(v\)=𝔼\(s,a\)∼𝒟,x0∼𝒩​\(0,Id\),f∼𝒰​\(0,F\)​\[‖v​\(s,xf,f\)−1F​\(a−x0\)‖22\],\\displaystyle\{\\mathcal\{L\}\}^\{\\mathrm\{BC\}\}\(v\)=\\mathbb\{E\}\_\{\\begin\{subarray\}\{c\}\(s,a\)\\sim\{\\mathcal\{D\}\},\\\\ x^\{0\}\\sim\{\\mathcal\{N\}\}\(0,I\_\{d\}\),\\\\ f\\sim\{\\mathcal\{U\}\}\(0,F\)\\end\{subarray\}\}\\left\[\\\|v\(s,x^\{f\},f\)\-\\frac\{1\}\{F\}\(a\-x^\{0\}\)\\\|\_\{2\}^\{2\}\\right\],\(5\)where

xf=\(1−f/F\)​x0\+\(f/F\)​a\\displaystyle x^\{f\}=\(1\-f/F\)x^\{0\}\+\(f/F\)a\(6\)and𝒰\{\\mathcal\{U\}\}denotes a uniform distribution\. It has been shown that the resulting velocity field generates a flow that transforms the Gaussian distribution𝒩​\(0,Id\)\{\\mathcal\{N\}\}\(0,I\_\{d\}\)into the behavioral action distributionπβ​\(a∣s\)\\pi^\{\\beta\}\(a\\mid s\)of the dataset\(Lipman et al\.,[2024](https://arxiv.org/html/2606.17551#bib.bib40); Black et al\.,[2024](https://arxiv.org/html/2606.17551#bib.bib8); Park et al\.,[2025c](https://arxiv.org/html/2606.17551#bib.bib52)\)\.

## 4Reversal Q\-Learning

Our goal in this work is to develop a performant \(offline\) off\-policy RL algorithm that leverages prior data to train a flow policy\. As briefly discussed in[Section2](https://arxiv.org/html/2606.17551#S2), we develop our method based on the expanded MDP framework\.

In this section, we first formally define the expanded MDP framework in[Section4](https://arxiv.org/html/2606.17551#S4)and describe why it is challenging to directly apply this framework to off\-policy RL in[Section4\.2](https://arxiv.org/html/2606.17551#S4.SS2)\. Then, we introduce our method,reversal Q\-learning \(RQL\), as a solution in[Section4\.3](https://arxiv.org/html/2606.17551#S4.SS3), and discuss practical implementation techniques in[Section4\.4](https://arxiv.org/html/2606.17551#S4.SS4)\.

![Refer to caption](https://arxiv.org/html/2606.17551v1/x2.png)Figure 2:Expanded MDP\.The expanded MDP construction treats individual denoising steps as individual actions, which enables training a diffusion or flow policy with a standard RL algorithm\. F denotes the number of diffusion or flow integration steps\.### 4\.1Expanded MDPs

The main idea behind the expanded MDP framework is to treat each Euler integration step in a flow policy as a separate action\. Essentially, this “expands” the MDP horizon byFFtimes, whereFFis the number of Euler integration steps\. This expanded MDP framework was originally proposed in prior work in diffusion models and diffusion policies\(Black et al\.,[2023](https://arxiv.org/html/2606.17551#bib.bib7); Fan et al\.,[2023](https://arxiv.org/html/2606.17551#bib.bib15); Ren et al\.,[2025](https://arxiv.org/html/2606.17551#bib.bib54)\)\. In this work, we consider flow policies instead of diffusion policies, and we describe its \(deterministic\) flow variant in this section\.

Specifically, given an MDPℳ=\(𝒮,𝒜=ℝd,r,μ,p\)\{\\mathcal\{M\}\}=\(\{\\mathcal\{S\}\},\{\\mathcal\{A\}\}=\{\\mathbb\{R\}\}^\{d\},r,\\mu,p\), the expanded MDP is defined asℳ~=\(𝒮~,𝒜=ℝd,r~,μ~,p~\)\\widetilde\{\\mathcal\{M\}\}=\(\\widetilde\{\\mathcal\{S\}\},\{\\mathcal\{A\}\}=\{\\mathbb\{R\}\}^\{d\},\\widetilde\{r\},\\widetilde\{\\mu\},\\widetilde\{p\}\)\. The augmented state space𝒮~=𝒮×ℝd×\{0,1,…,F−1\}\\widetilde\{\\mathcal\{S\}\}=\{\\mathcal\{S\}\}\\times\{\\mathbb\{R\}\}^\{d\}\\times\\\{0,1,\\ldots,F\-1\\\}consists of elements corresponding to the tuples\(s,x,f\)\(s,x,f\)of a states∈𝒮s\\in\{\\mathcal\{S\}\}, a partially generated actionx∈ℝdx\\in\{\\mathbb\{R\}\}^\{d\}, and the discretized flow timef∈\{0,1,…,F−1\}f\\in\\\{0,1,\\ldots,F\-1\\\}\.

The transition dynamics kernelp~​\(\(s′,x′,f′\)∣\(s,x,f\),a\):𝒮~×ℝd→Δ​\(𝒮~\)\\widetilde\{p\}\(\(s^\{\\prime\},x^\{\\prime\},f^\{\\prime\}\)\\mid\(s,x,f\),a\):\\widetilde\{\\mathcal\{S\}\}\\times\{\\mathbb\{R\}\}^\{d\}\\to\\Delta\(\\widetilde\{\\mathcal\{S\}\}\)is defined as follows\. Iff<F−1f<F\-1,\(s′,x′,f′\)\(s^\{\\prime\},x^\{\\prime\},f^\{\\prime\}\)is deterministically set to\(s,x\+a,f\+1\)\(s,x\+a,f\+1\)\. Otherwise,s′s^\{\\prime\}is sampled fromp\(⋅∣s,x\+a\)p\(\\cdot\\mid s,x\+a\),x′x^\{\\prime\}is sampled from𝒩​\(0,Id\)\{\\mathcal\{N\}\}\(0,I\_\{d\}\), andf′f^\{\\prime\}is deterministically set to0\. Intuitively, the expanded MDP queries the original MDP everyFFsteps to update the environment statess, and otherwise only updates the partially generated actionxxfollowing the Euler integration rule\. The flow stepffin the expanded MDP serves as a counter\. For the initial state distributionμ~​\(s,x,f\)\\widetilde\{\\mu\}\(s,x,f\),ssis sampled fromμ​\(⋅\)\\mu\(\\cdot\),xxis sampled from𝒩​\(0,Id\)\{\\mathcal\{N\}\}\(0,I\_\{d\}\), andffis deterministically set to0\.

The reward functionr~​\(\(s,x,f\),a\):𝒮~×ℝd→ℝ\\widetilde\{r\}\(\(s,x,f\),a\):\\widetilde\{\\mathcal\{S\}\}\\times\{\\mathbb\{R\}\}^\{d\}\\to\{\\mathbb\{R\}\}is defined as follows: iff<F−1f<F\-1,r~​\(\(s,x,f\),a\)=0\\widetilde\{r\}\(\(s,x,f\),a\)=0and otherwiser~​\(\(s,x,f\),a\)=r​\(s,x\+a\)\\widetilde\{r\}\(\(s,x,f\),a\)=r\(s,x\+a\)\. In other words, as in the new transition dynamics, rewards in the expanded MDP are given only everyFFsteps by querying the original reward function\. Similarly, only rewards from the original MDP are discounted, defined by a modified discount factorγ~\\widetilde\{\\gamma\}: iff<F−1f<F\-1,γ~=1\\widetilde\{\\gamma\}=1and otherwiseγ~=γ\\widetilde\{\\gamma\}=\\gamma\.

### 4\.2Challenges

The expanded MDP framework defined in the previous section allows us to use an existing RL algorithm to directly train individual refinement steps of a flow policy to maximize returns\. Indeed, prior work has shown that \(a diffusion\-based variant of\) this expanded MDP framework enables training performant diffusion policies when combined with*on\-policy*RL methods like PPO\(Schulman et al\.,[2017](https://arxiv.org/html/2606.17551#bib.bib56); Ren et al\.,[2025](https://arxiv.org/html/2606.17551#bib.bib54)\)\.

However, this framework in its vanilla form is not directly suitable for*off\-policy*RL, which is the main focus of this work\. There are two challenges\. First, the given offline dataset𝒟=\{\(s0,a0,r0,s1,…,sT\)\(n\)\}\{\\mathcal\{D\}\}=\\\{\(s\_\{0\},a\_\{0\},r\_\{0\},s\_\{1\},\\ldots,s\_\{T\}\)^\{\(n\)\}\\\}only consists of transitions in the original MDP and does not contain flow trajectories for the expanded MDP\. In other words, we do not have intermediate flow integration steps for each\(s,a,r,s′\)\(s,a,r,s^\{\\prime\}\)tuple in the dataset\.

Second, and perhaps more importantly, this expanded MDP framework increases the horizon length byFFtimes\. Compared to on\-policy RL methods, which can tolerate long horizons relatively well thanks to on\-policy value estimation techniques \(e\.g\., GAE\(Schulman et al\.,[2016](https://arxiv.org/html/2606.17551#bib.bib55)\)\), off\-policy RL struggles more as the horizon grows\(Liu et al\.,[2018](https://arxiv.org/html/2606.17551#bib.bib41)\)\. This is mainly because off\-policy RL typically relies on temporal difference \(TD\) learning to estimate off\-policy values, where biases in TD targets accumulate over the entire horizon and can substantially harm performance\(Park et al\.,[2025b](https://arxiv.org/html/2606.17551#bib.bib51)\)\.

### 4\.3Solution: Reversal

Our key insight in this work is that the*reversibility*and*determinism*of flow ODEs provide us with a solution that addresses both of the challenges\. Specifically, for each\(s,a,r,s′\)\(s,a,r,s^\{\\prime\}\)tuple in the original dataset, we first generate “virtual” flow trajectories in the expanded MDP by following the flow ODE in the*reverse*direction\. Next, we apply multi\-step returns to these virtual trajectories to reduce the effective horizon\. Importantly, since the virtual flow trajectories are deterministic and on\-policy, these multi\-step returns are unbiased and zero\-variance, unlike in the general case\(Sutton & Barto,[2005](https://arxiv.org/html/2606.17551#bib.bib58)\)\.

![Refer to caption](https://arxiv.org/html/2606.17551v1/x3.png)Figure 3:Flow reversal\.We generate “virtual” on\-policy flow trajectories by following the ODE in the reverse direction\.Generating virtual trajectories forℳ~\\bm\{\\widetilde\{\\mathcal\{M\}\}\}\.The first step is, for a transition tuple\(s,a,r,s′\)\(s,a,r,s^\{\\prime\}\)in the dataset𝒟\{\\mathcal\{D\}\}, to generate the corresponding flow trajectory inℳ~\\widetilde\{\\mathcal\{M\}\}with respect to the current flow policyvv\. Our observation is that this can be done by computing the “reverse” flowθ​\(s,x,f\):𝒮×ℝd×\[0,F\]→ℝd\\theta\(s,x,f\):\{\\mathcal\{S\}\}\\times\{\\mathbb\{R\}\}^\{d\}\\times\[0,F\]\\to\{\\mathbb\{R\}\}^\{d\}defined by the following ODE:

dd​f​θ​\(s,x,f\)=−v​\(s,θ​\(s,x,f\),f\)\.\\displaystyle\\frac\{\{\\mathrm\{d\}\}\}\{\{\\mathrm\{d\}\}f\}\\theta\(s,x,f\)=\-v\(s,\\theta\(s,x,f\),f\)\.\(7\)Note that the sign of the velocity fieldvvis reversed\. The reason behind this is that a flow induces diffeomorphisms \(i\.e\., smooth bijections whose inverses are also smooth\) between any two time steps under mild regularity assumptions\(Lee,[2012](https://arxiv.org/html/2606.17551#bib.bib33)\), and its inverse flow is induced by the negative of the original velocity field\. This reverse ODE can be solved with the Euler method by computing

xf−1←xf−v​\(s,xf,f\)\\displaystyle x^\{f\-1\}\\leftarrow x^\{f\}\-v\(s,x^\{f\},f\)\(8\)atf=F,F−1,…,1f=F,F\-1,\\ldots,1, where the initial value is given byxF=ax^\{F\}=a\([Figure1](https://arxiv.org/html/2606.17551#S0.F1)\)\. After computingx0,x1,…,xFx^\{0\},x^\{1\},\\ldots,x^\{F\}, we obtain the following transitions forℳ~\\widetilde\{\\mathcal\{M\}\}:

\(\(s,xf,f\)⏟state,xf\+1−xf⏟action,0⏟reward,\(s,xf\+1,f\+1\)⏟next​state\)\\displaystyle\(\\underbrace\{\(s,x^\{f\},f\)\}\_\{\\mathrm\{state\}\},\\underbrace\{x^\{f\+1\}\-x^\{f\}\}\_\{\\mathrm\{action\}\},\\underbrace\{0\}\_\{\\mathrm\{reward\}\},\\underbrace\{\(s,x^\{f\+1\},f\+1\)\}\_\{\\mathrm\{next\\ state\}\}\)\(9\)forf=0,1,…,F−2f=0,1,\\ldots,F\-2and

\(\(s,xf,f\)⏟state,xf\+1−xf⏟action,r⏟reward,\(s′,x′⁣0,0\)⏟next​state\),\\displaystyle\(\\underbrace\{\(s,x^\{f\},f\)\}\_\{\\mathrm\{state\}\},\\underbrace\{x^\{f\+1\}\-x^\{f\}\}\_\{\\mathrm\{action\}\},\\underbrace\{r\}\_\{\\mathrm\{reward\}\},\\underbrace\{\(s^\{\\prime\},x^\{\\prime 0\},0\)\}\_\{\\mathrm\{next\\ state\}\}\),\(10\)forf=F−1f=F\-1, wherex′⁣0∼𝒩​\(0,Id\)x^\{\\prime 0\}\\sim\{\\mathcal\{N\}\}\(0,I\_\{d\}\)\. We define𝒟~\\widetilde\{\\mathcal\{D\}\}as the set of transitions in[Equations9](https://arxiv.org/html/2606.17551#S4.E9)and[10](https://arxiv.org/html/2606.17551#S4.E10)\. Note that𝒟~\\widetilde\{\\mathcal\{D\}\}depends on the current flow policyvv, and is re\-computed from𝒟\{\\mathcal\{D\}\}for each batch \(see[Algorithm1](https://arxiv.org/html/2606.17551#alg1)for details\)\.

Reducing the effective horizon\.With𝒟~\\widetilde\{\\mathcal\{D\}\}defined above, we can now train an off\-policy value function forℳ~\\widetilde\{\\mathcal\{M\}\}with Q\-learning\. For example, one may train a Q functionQ​\(\(s,x,f\),a\):𝒮~×𝒜→ℝQ\(\(s,x,f\),a\):\\widetilde\{\\mathcal\{S\}\}\\times\{\\mathcal\{A\}\}\\to\{\\mathbb\{R\}\}with the following standard temporal difference loss:

ℒ\(Q\)=𝔼\[\(\\displaystyle\{\\mathcal\{L\}\}\(Q\)=\\mathbb\{E\}\\Big\[\\Big\(Q​\(\(s,x,f\),a\)−r\\displaystyle Q\(\(s,x,f\),a\)\-r−γ~maxa′Q¯\(\(s′,x′,f′\),a′\)\)2\],\\displaystyle\-\\widetilde\{\\gamma\}\\max\_\{a^\{\\prime\}\}\\bar\{Q\}\(\(s^\{\\prime\},x^\{\\prime\},f^\{\\prime\}\),a^\{\\prime\}\)\\Big\)^\{2\}\\Big\],\(11\)where transition tuples\(\(s,x,f\),a,r,\(s′,x′,f′\)\)\(\(s,x,f\),a,r,\(s^\{\\prime\},x^\{\\prime\},f^\{\\prime\}\)\)are sampled from the dataset𝒟~\\widetilde\{\\mathcal\{D\}\}, andQ¯\\bar\{Q\}denotes a target network\(Mnih et al\.,[2013](https://arxiv.org/html/2606.17551#bib.bib46)\)\. However, while this is a valid objective in theory, it is challenging to obtain an accurate value function with this vanilla objective in practice\. This is because the horizon length has increased toT×FT\\times FfromTTin the expanded MDP, where this increased horizon impedes off\-policy value learning \(see[Section4\.2](https://arxiv.org/html/2606.17551#S4.SS2)for a detailed explanation\)\.

Our idea to address this horizon challenge is to observe that \(some\) multi\-step returns are*unbiased and zero\-variance*in this expanded MDP, because the flow trajectories in𝒟~\\widetilde\{\\mathcal\{D\}\}are deterministic and on\-policy\. Specifically, we propose to use the following multi\-step Q\-learning objective instead:

ℒ\(Q\)=𝔼τ~\[\(\\displaystyle\{\\mathcal\{L\}\}\(Q\)=\\mathbb\{E\}\_\{\\widetilde\{\\tau\}\}\\Big\[\\Big\(Q​\(\(s,xf,f\),af\)−r\\displaystyle Q\(\(s,x^\{f\},f\),a^\{f\}\)\-r−γmaxa′Q¯\(\(s′,x′⁣0,0\),a′\)\)2\],\\displaystyle\-\\gamma\\max\_\{a^\{\\prime\}\}\\bar\{Q\}\(\(s^\{\\prime\},x^\{\\prime 0\},0\),a^\{\\prime\}\)\\Big\)^\{2\}\\Big\],\(12\)where flow trajectories

τ~=\(\\displaystyle\\widetilde\{\\tau\}=\\big\(\(s,x0,0\)⏟state,a0⏟action,0⏟reward,\(s,x1,1\)⏟state,a1⏟action,0⏟reward,…,\\displaystyle\\underbrace\{\(s,x^\{0\},0\)\}\_\{\\mathrm\{state\}\},\\underbrace\{a^\{0\}\}\_\{\\mathrm\{action\}\},\\underbrace\{0\}\_\{\\mathrm\{reward\}\},\\underbrace\{\(s,x^\{1\},1\)\}\_\{\\mathrm\{state\}\},\\underbrace\{a^\{1\}\}\_\{\\mathrm\{action\}\},\\underbrace\{0\}\_\{\\mathrm\{reward\}\},\\ldots,\(s,xF−1,F−1\)⏟state,aF−1⏟action,r⏟reward,\(s′,x′⁣0,0\)⏟state\)\\displaystyle\\underbrace\{\(s,x^\{F\-1\},F\-1\)\}\_\{\\mathrm\{state\}\},\\underbrace\{a^\{F\-1\}\}\_\{\\mathrm\{action\}\},\\underbrace\{r\}\_\{\\mathrm\{reward\}\},\\underbrace\{\(s^\{\\prime\},x^\{\\prime 0\},0\)\}\_\{\\mathrm\{state\}\}\\big\)\(13\)are sampled from𝒟~\\widetilde\{\\mathcal\{D\}\}andffis sampled uniformly from\{0,1,…,F−1\}\\\{0,1,\\ldots,F\-1\\\}\.

Intuitively,[Equation12](https://arxiv.org/html/2606.17551#S4.E12)directly takes the TD target at the end of each flow trajectory, skipping intermediate flow transitions \([Figure1](https://arxiv.org/html/2606.17551#S0.F1)\)\. Since intermediate flow trajectories are fully deterministic and are synthesized with respect to the current policyvv\(i\.e\., they are on\-policy\), the multi\-step return in[Equation12](https://arxiv.org/html/2606.17551#S4.E12)is unbiased and zero\-variance, unlike standard off\-policy multi\-step TD learning\(Sutton & Barto,[2005](https://arxiv.org/html/2606.17551#bib.bib58)\)\.

This technique reduces the effective value horizon \(i\.e\., the number of Bellman updates required to propagate information along each trajectory\) fromT×FT\\times FtoTTon average\. In our experiments, we show that this “value horizon reduction”\(Park et al\.,[2025b](https://arxiv.org/html/2606.17551#bib.bib51)\)is indeed crucial in achieving strong performance in practice\.

### 4\.4Practical Algorithm

Based on the ideas described in[Section4\.3](https://arxiv.org/html/2606.17551#S4.SS3), we now introduce a practical algorithm to train a flow policy with off\-policy RL using prior data\. We call the resulting methodreversal Q\-learning \(RQL\)\.

Value learning\.The multi\-step value loss in[Equation12](https://arxiv.org/html/2606.17551#S4.E12)requires computing the maximum over next actions \(maxa′\\max\_\{a^\{\\prime\}\}\)\. Since naïvely computing it often leads out\-of\-distribution queries of the Q\-function, especially when using offline data\(Levine et al\.,[2020](https://arxiv.org/html/2606.17551#bib.bib34)\), we instead use expectile regression\(Newey & Powell,[1987](https://arxiv.org/html/2606.17551#bib.bib48); Kostrikov et al\.,[2022](https://arxiv.org/html/2606.17551#bib.bib31)\)to compute this maximum in an implicit manner\.

Specifically, we consider the following IVL\-like loss \(which is a value\-only variant of implicit Q\-learning\(Kostrikov et al\.,[2022](https://arxiv.org/html/2606.17551#bib.bib31); Park et al\.,[2025a](https://arxiv.org/html/2606.17551#bib.bib50)\)\) to train a value functionV​\(s,x,f\):𝒮×ℝd×\{0,…,F−1\}→ℝV\(s,x,f\):\{\\mathcal\{S\}\}\\times\{\\mathbb\{R\}\}^\{d\}\\times\\\{0,\\ldots,F\-1\\\}\\to\{\\mathbb\{R\}\}:

ℒ\(V\)=𝔼τ~\[\\displaystyle\{\\mathcal\{L\}\}\(V\)=\\mathbb\{E\}\_\{\\widetilde\{\\tau\}\}\\big\[ℓ2κ\(V\(s,xf,f\)\\displaystyle\\ell\_\{2\}^\{\\kappa\}\\big\(V\(s,x^\{f\},f\)−\(r\+γV\(s′,x′⁣0,0\)\)\)\],\\displaystyle\-\(r\+\\gamma V\(s^\{\\prime\},x^\{\\prime 0\},0\)\)\\big\)\\big\],\(14\)whereℓ2κ​\(x\)=\|κ−𝕀​\(x\>0\)\|​x2\\ell\_\{2\}^\{\\kappa\}\(x\)=\|\\kappa\-\{\\mathbb\{I\}\}\(x\>0\)\|x^\{2\}is the expectile loss with an expectileκ\\kappa\. Intuitively, this asymmetric expectile loss approximates themaxa′\\max\_\{a^\{\\prime\}\}operator in[Equation12](https://arxiv.org/html/2606.17551#S4.E12)without having to explicitly search for the maximum\.

Policy learning\.To train a flow policy to maximize the learned value functionVV, we employ a DDPG\-style loss with a behavioral regularizer\(Lillicrap et al\.,[2016](https://arxiv.org/html/2606.17551#bib.bib38); Wu et al\.,[2019](https://arxiv.org/html/2606.17551#bib.bib63); Fujimoto & Gu,[2021](https://arxiv.org/html/2606.17551#bib.bib18)\)\. Specifically, we train the velocity fieldvvto minimize the following loss:

ℒ​\(v\)=\\displaystyle\{\\mathcal\{L\}\}\(v\)=−𝔼τ~​\[V​\(s,xf\+v​\(s,xf,f\),f\+1\)\]⏟value​maximization\\displaystyle\\underbrace\{\-\\mathbb\{E\}\_\{\\widetilde\{\\tau\}\}\[V\(s,x^\{f\}\+v\(s,x^\{f\},f\),f\+1\)\]\}\_\{\\mathrm\{value\\ maximization\}\}\+αℒBC\(v\),⏟behavioral​regularization\\displaystyle\\underbrace\{\+\\alpha\{\\mathcal\{L\}\}^\{\\mathrm\{BC\}\}\(v\),\}\_\{\\mathrm\{behavioral\\ regularization\}\}\(15\)whereℒBC​\(v\)\{\\mathcal\{L\}\}^\{\\mathrm\{BC\}\}\(v\)is defined in[Equation5](https://arxiv.org/html/2606.17551#S3.E5)andα\\alphais a hyperparameter that controls the strength of the behavioral flow\-matching regularizer\. Intuitively, the first term pushes velocity vectors to maximize the value, while the second term regularizes the flow policy to be close to the prior dataset\.

We found that having this behavioral regularizer \(common in offline RL\) is beneficial in practice for two reasons\. First, it allows the policy to capture useful behavioral priors from the dataset throughout the training\. Second, it encouragesx0x^\{0\}computed via reversal \([Equation8](https://arxiv.org/html/2606.17551#S4.E8)\) to be closer to the prior distribution𝒩​\(0,Id\)\{\\mathcal\{N\}\}\(0,I\_\{d\}\)by minimizing distributional shifts between the current policy and the prior dataset, which in turn makes the target value in[Equation14](https://arxiv.org/html/2606.17551#S4.E14)more accurate\.

Implementation\.In practice, we additionally apply action chunking\(Zhao et al\.,[2023](https://arxiv.org/html/2606.17551#bib.bib68); Li et al\.,[2025](https://arxiv.org/html/2606.17551#bib.bib36)\)for both value functions and the flow policy, which we found to improve performance\. We denote the resulting action\-chunked dataset as𝒟ac\{\\mathcal\{D\}\}\_\{\\mathrm\{ac\}\}\. We summarize our algorithm in[Algorithm1](https://arxiv.org/html/2606.17551#alg1)\.

Algorithm 1Reversal Q\-Learning \(RQL\)Initialize value functionV​\(s,xf,f\)V\(s,x^\{f\},f\), flow policyv​\(s,xf,f\)v\(s,x^\{f\},f\)

whileunfinisheddo

Sample\(s,a,r,s′\)∼𝒟ac\(s,a,r,s^\{\\prime\}\)\\sim\{\\mathcal\{D\}\}\_\{\\mathrm\{ac\}\}

Computeτ~\\widetilde\{\\tau\}using[Equation8](https://arxiv.org/html/2606.17551#S4.E8)

TrainVVby minimizingℒ​\(V\)\{\\mathcal\{L\}\}\(V\)\([Equation14](https://arxiv.org/html/2606.17551#S4.E14)\)

Trainvvby minimizingℒ​\(v\)\{\\mathcal\{L\}\}\(v\)\([Equation15](https://arxiv.org/html/2606.17551#S4.E15)\)

Why is RQL beneficial?The RQL algorithm described in[Algorithm1](https://arxiv.org/html/2606.17551#alg1)has several appealing properties\. First, by treating each individual flow step as a distinct action, RQL does not suffer from backpropagation through time or related issues that arise when performing RL with iterative models\. Second, the flow policy loss in[Equation15](https://arxiv.org/html/2606.17551#S4.E15)utilizes first\-order \(gradient\) information from the value function, which leads to better efficiency and performance compared to zeroth\-order \(value\-gradient\-free\) methods, such as the regression\-based methods described in[Section2](https://arxiv.org/html/2606.17551#S2)\. We empirically support this claim in our experiments\. Third, we reduce the increased effective horizon of the expanded MDP to the original length*without incurring any biases or variances*\. This mitigates the curse of horizon in off\-policy RL, enabling more accurate and effective off\-policy value learning\.

## 5Experiments

In this section, we empirically evaluate the performance of RQL on a variety of challenging simulated robotic manipulation tasks\. We also experimentally demonstrate how each component of RQL is necessary to achieve strong performance in practice\.

![Refer to caption](https://arxiv.org/html/2606.17551v1/figures/envs/scene.png)scene
![Refer to caption](https://arxiv.org/html/2606.17551v1/figures/envs/puzzle-4x4.png)puzzle\-4x4
![Refer to caption](https://arxiv.org/html/2606.17551v1/figures/envs/cube-quadruple.png)cube\-quad
![Refer to caption](https://arxiv.org/html/2606.17551v1/figures/envs/antmaze-giant-v0_layout.png)amz\-giant
![Refer to caption](https://arxiv.org/html/2606.17551v1/figures/envs/humanoidmaze1.png)hmz\-large

Figure 4:Environments\.### 5\.1Experimental Setup

Tasks and datasets\.We employ5050robotic manipulation tasks in the OGBench benchmark suite\(Park et al\.,[2025a](https://arxiv.org/html/2606.17551#bib.bib50)\)in our experiments\. Specifically, followingLi & Levine \([2024](https://arxiv.org/html/2606.17551#bib.bib35)\), we consider both manipulation tasks likescene,puzzle, andcubeas well as locomotion tasks likeantmazeandhumanoidmaze\([Figure4](https://arxiv.org/html/2606.17551#S5.F4)\)\. These tasks require object manipulation with stitching \(scene\), combinatorial reasoning \(puzzle\), fine\-grained control \(cube\), long horizon locomotion \(antmaze\), and high\-dimensional control \(humanoidmaze\)\. We consider two variants of tasks with different levels of difficulty forpuzzle,cube,antmaze, andhumanoidmazerespectively\.

Using the same experimental setting asLi & Levine \([2024](https://arxiv.org/html/2606.17551#bib.bib35)\), we use the expanded 100M\-sized datasets forcube\-quadrupleandpuzzle\-4x4provided byPark et al\. \([2025a](https://arxiv.org/html/2606.17551#bib.bib50)\), while other environments use the standardplayandnavigatedatasets\. These datasets consist of task\-agnostic trajectories that repeatedly perform random atomic motions \(e\.g\., randomly press buttons inpuzzle\)\. Hence, the agent must be able to stitch different trajectory segments in the dataset to solve the given task\. In addition, we use the sparse reward variant ofsceneandpuzzlefollowingLi & Levine \([2024](https://arxiv.org/html/2606.17551#bib.bib35)\), while other tasks use the standard semi\-sparse reward function \(i\.e\., one that only depends on the number of remaining tasks\), which is the default setting for OGBenchsingletasktasks\(Park et al\.,[2025a](https://arxiv.org/html/2606.17551#bib.bib50)\)\.

Methods and comparisons\.We compare RQL with diverse strong baselines across different categories\. For a Gaussian policy baseline, we consider ReBRAC\(Tarasov et al\.,[2023](https://arxiv.org/html/2606.17551#bib.bib59)\)\. For flow\-based off\-policy RL baselines, we consider𝟏𝟖\\mathbf\{18\}methods across seven categories: FQL\(Park et al\.,[2025c](https://arxiv.org/html/2606.17551#bib.bib52)\)as a distillation\-based method, IFQL\(Hansen\-Estruch et al\.,[2023](https://arxiv.org/html/2606.17551#bib.bib20); Park et al\.,[2025c](https://arxiv.org/html/2606.17551#bib.bib52)\)as a rejection sampling\-based method, FAWAC\(Nair et al\.,[2020](https://arxiv.org/html/2606.17551#bib.bib47); Park et al\.,[2025c](https://arxiv.org/html/2606.17551#bib.bib52)\)as a weighted regression\-based method, FBRAC\(Park et al\.,[2025c](https://arxiv.org/html/2606.17551#bib.bib52)\)as a backpropagation through time method, CGQL, DAC, and QSM\(Li & Levine,[2024](https://arxiv.org/html/2606.17551#bib.bib35); Fang et al\.,[2025](https://arxiv.org/html/2606.17551#bib.bib16); Psenka et al\.,[2024](https://arxiv.org/html/2606.17551#bib.bib53)\)as test\-time Q\-gradient\-based methods, DSRL as a latent noise steering method\(Wagenmaker et al\.,[2025](https://arxiv.org/html/2606.17551#bib.bib60)\), FEDIT as a residual edit method\(Li & Levine,[2024](https://arxiv.org/html/2606.17551#bib.bib35)\), BAM and QAM\(Li & Levine,[2024](https://arxiv.org/html/2606.17551#bib.bib35)\)as adjoint matching methods, and BDPO\(Gao et al\.,[2025](https://arxiv.org/html/2606.17551#bib.bib19)\)as a method using the expanded MDP construction\.

For methods other than BDPO and RQL, we take the corresponding results from prior work\(Li & Levine,[2024](https://arxiv.org/html/2606.17551#bib.bib35)\); note that we use the same setting as this prior work, so the results are fully compatible\. We also use action\-chunking variants of these methods to ensure a fair comparison\. All experiments in this work are averaged over four seeds, and we present95%95\\%confidence intervals in tables and figures\.

Figure 5:Overall Performance\.RQL exceeds the aggregate performance of all baselines across the 50 tasks\.![Refer to caption](https://arxiv.org/html/2606.17551v1/x4.png)![Refer to caption](https://arxiv.org/html/2606.17551v1/x5.png)Figure 6:Ablation study on the expectileκ\\kappa\.![Refer to caption](https://arxiv.org/html/2606.17551v1/x6.png)Figure 7:Ablation study on the BC regularization coefficientα\\alpha\.
### 5\.2Results

We present the full evaluation result on5050challenging robotic manipulation tasks in[Table1](https://arxiv.org/html/2606.17551#A1.T1)and aggregated performance in[Figure5](https://arxiv.org/html/2606.17551#S5.F5)\. The results show that RQL achieves the best average performance across the board, showing particularly strong performance on the most challenging variants of tasks likeantmaze\-giant,humanoidmaze\-large,puzzle\-4x4, andcube\-quadruple\. Note that our expanded MDP\-based framework outperforms the most standard ways of training a flow policy from a value function, such as backpropagation through time \(FBRAC\), rejection sampling \(IFQL\), and weighted regression \(FAWAC\)\.

One of our main ideas is to apply unbiased and zero\-variance multi\-step returns \([Equation12](https://arxiv.org/html/2606.17551#S4.E12)\) to reduce the effective horizon length for value learning\. To understand the importance of value horizon reduction, we compare the performance without this technique, denoted by TFQL in[Figure5](https://arxiv.org/html/2606.17551#S5.F5)and[Table1](https://arxiv.org/html/2606.17551#A1.T1)\. In particular, TFQL retains an effective horizon length of “T×FT\\times F”, compared to “TT” for RQL\. The results suggest that applying horizon reduction is indeed crucial in achieving strong performance, and naïvely using the vanilla expanded MDP framework \(without horizon reduction\) can lead to a complete failure on some benchmark tasks\.

### 5\.3Ablation Study

In this section, we ablate two components of RQL and discuss their effects\. This is done on the singletask defaults of five representative tasks:antmaze\-giant,humanoidmaze\-large,scene,puzzle\-4x4, andcube\-quadruple\.

Expectileκ\\bm\{\\kappa\}\.In RQL, we use the expectile loss \([Equation14](https://arxiv.org/html/2606.17551#S4.E14)\) to approximate themax\\maxoperator in the Bellman operator\.[Figure6](https://arxiv.org/html/2606.17551#S5.F6)shows an ablation study on the value ofκ\\kappa\. Note thatκ=0\.5\\kappa=0\.5corresponds to SARSA\. The results suggest that using a highκ\\kappa\(i\.e\., “more RL”\) often leads to better performance across different tasks, likely because the OGBench datasets are highly suboptimal for the given tasks\.

BC coefficientα\\bm\{\\alpha\}\.As in many other offline RL methods\(Fujimoto & Gu,[2021](https://arxiv.org/html/2606.17551#bib.bib18); Wang et al\.,[2023](https://arxiv.org/html/2606.17551#bib.bib61); Park et al\.,[2025c](https://arxiv.org/html/2606.17551#bib.bib52)\), RQL also has a hyperparameter that interpolates between RL and BC \(i\.e\., BC coefficientα\\alphain[Equation15](https://arxiv.org/html/2606.17551#S4.E15)\)\. We ablate this hyperparameter and present the results on tasks across diverse categories in[Figure7](https://arxiv.org/html/2606.17551#S5.F7)\. The results show thatα\\alphais the most important hyperparameter to tune\.

## 6Conclusion

In this work, we proposed a flow\-based off\-policy RL algorithm based on the expanded MDP framework\. Our ideas based on “flow reversal” enable training an effective flow policy without suffering from backpropagation through time or the curse of horizon in off\-policy RL, while making use of rich gradient information in the learned value function\. Through our experiments across a number of robotic manipulation tasks, we empirically demonstrate that RQL achieves the best performance compared to other strong baselines, especially on challenging long\-horizon tasks\.

Limitations and future work\.While RQL achieves strong empirical performance across diverse tasks, it has several limitations, which open up diverse opportunities for future work\. First, we find offline RL performance is relatively sensitive to both the BC coefficient and expectile, and we expect that these hyperparameters need to be swept properly \([SectionC\.2](https://arxiv.org/html/2606.17551#A3.SS2)\) for best performance\. Second, we only demonstrate the capabilities of RQL in the context of RL and robotic control\. Given the generality of this framework, we believe it may also be applied to fine\-tune image generation models or other modalities beyond RL and control, which we leave for future work\.

## Acknowledgments

This work was partly supported by the Korea Foundation for Advanced Studies \(KFAS\), AFOSR FA9550\-22\-1\-0273, ONR N00014\-25\-1\-2060, and DARPA ANSR\. This research used the Savio computational cluster resource provided by the Berkeley Research Computing program at UC Berkeley\.

## References

- Ada et al\. \(2024\)Ada, S\. E\., Oztop, E\., and Ugur, E\.Diffusion policies for out\-of\-distribution generalization in offline reinforcement learning\.In*IEEE Robotics and Automation Letters \(RA\-L\)*, 2024\.
- Agrawalla et al\. \(2026\)Agrawalla, B\., Nauman, M\., Agrawal, K\., and Kumar, A\.floq: Training critics via flow\-matching for scaling compute in value\-based rl\.In*International Conference on Learning Representations \(ICLR\)*, 2026\.
- Ajay et al\. \(2023\)Ajay, A\., Du, Y\., Gupta, A\., Tenenbaum, J\., Jaakkola, T\., and Agrawal, P\.Is conditional generative modeling all you need for decision\-making?In*International Conference on Learning Representations \(ICLR\)*, 2023\.
- Albergo & Vanden\-Eijnden \(2023\)Albergo, M\. S\. and Vanden\-Eijnden, E\.Building normalizing flows with stochastic interpolants\.In*International Conference on Learning Representations \(ICLR\)*, 2023\.
- Alonso et al\. \(2024\)Alonso, E\., Jelley, A\., Micheli, V\., Kanervisto, A\., Storkey, A\., Pearce, T\., and Fleuret, F\.Diffusion for world modeling: Visual details matter in atari\.In*Neural Information Processing Systems \(NeurIPS\)*, 2024\.
- Ball et al\. \(2023\)Ball, P\. J\., Smith, L\., Kostrikov, I\., and Levine, S\.Efficient online reinforcement learning with offline data\.In*International Conference on Machine Learning \(ICML\)*, 2023\.
- Black et al\. \(2023\)Black, K\., Janner, M\., Du, Y\., Kostrikov, I\., and Levine, S\.Training diffusion models with reinforcement learning\.*ArXiv*, abs/2305\.13301, 2023\.
- Black et al\. \(2024\)Black, K\., Brown, N\., Driess, D\., Esmail, A\., Equi, M\., Finn, C\., Fusai, N\., Groom, L\., Hausman, K\., Ichter, B\., et al\.π0\\pi\_\{0\}: A vision\-language\-action flow model for general robot control\.*ArXiv*, abs/2410\.24164, 2024\.
- Chen et al\. \(2024\)Chen, C\., Deng, F\., Kawaguchi, K\., Gulcehre, C\., and Ahn, S\.Simple hierarchical planning with diffusion\.In*International Conference on Learning Representations \(ICLR\)*, 2024\.
- Chen et al\. \(2023\)Chen, H\., Lu, C\., Ying, C\., Su, H\., and Zhu, J\.Offline reinforcement learning via high\-fidelity generative behavior modeling\.In*International Conference on Learning Representations \(ICLR\)*, 2023\.
- Ding et al\. \(2024a\)Ding, S\., Hu, K\., Zhang, Z\., Ren, K\., Zhang, W\., Yu, J\., Wang, J\., and Shi, Y\.Diffusion\-based reinforcement learning via q\-weighted variational policy optimization\.In*Neural Information Processing Systems \(NeurIPS\)*, 2024a\.
- Ding & Jin \(2024\)Ding, Z\. and Jin, C\.Consistency models as a rich and efficient policy class for reinforcement learning\.In*International Conference on Learning Representations \(ICLR\)*, 2024\.
- Ding et al\. \(2024b\)Ding, Z\., Zhang, A\., Tian, Y\., and Zheng, Q\.Diffusion world model\.*ArXiv*, abs/2402\.03570, 2024b\.
- Espinosa\-Dice et al\. \(2026\)Espinosa\-Dice, N\., Zhang, Y\., Chen, Y\., Guo, B\., Oertell, O\., Swamy, G\., Brantley, K\., and Sun, W\.Scaling offline rl via efficient and expressive shortcut models\.In*Neural Information Processing Systems \(NeurIPS\)*, 2026\.
- Fan et al\. \(2023\)Fan, Y\., Watkins, O\., Du, Y\., Liu, H\., Ryu, M\., Boutilier, C\., Abbeel, P\., Ghavamzadeh, M\., Lee, K\., and Lee, K\.Dpok: Reinforcement learning for fine\-tuning text\-to\-image diffusion models\.In*Neural Information Processing Systems \(NeurIPS\)*, 2023\.
- Fang et al\. \(2025\)Fang, L\., Liu, R\., Zhang, J\., Wang, W\., and Jing, B\.Diffusion actor\-critic: Formulating constrained policy iteration as diffusion noise regression for offline reinforcement learning\.In*International Conference on Learning Representations \(ICLR\)*, 2025\.
- Frans et al\. \(2025\)Frans, K\., Park, S\., Abbeel, P\., and Levine, S\.Diffusion guidance is a controllable policy improvement operator\.*ArXiv*, abs/2505\.23458, 2025\.
- Fujimoto & Gu \(2021\)Fujimoto, S\. and Gu, S\. S\.A minimalist approach to offline reinforcement learning\.In*Neural Information Processing Systems \(NeurIPS\)*, 2021\.
- Gao et al\. \(2025\)Gao, C\.\-X\., Wu, C\., Cao, M\., Xiao, C\., Yu, Y\., and Zhang, Z\.Behavior\-regularized diffusion policy optimization for offline reinforcement learning\.In*International Conference on Machine Learning \(ICML\)*, 2025\.
- Hansen\-Estruch et al\. \(2023\)Hansen\-Estruch, P\., Kostrikov, I\., Janner, M\., Kuba, J\. G\., and Levine, S\.Idql: Implicit q\-learning as an actor\-critic method with diffusion policies\.*ArXiv*, abs/2304\.10573, 2023\.
- He et al\. \(2023\)He, L\., Shen, L\., Zhang, L\., Tan, J\., and Wang, X\.Diffcps: Diffusion model based constrained policy search for offline reinforcement learning\.*ArXiv*, abs/2310\.05333, 2023\.
- He et al\. \(2024\)He, L\., Shen, L\., Tan, J\., and Wang, X\.Aligniql: Policy alignment in implicit q\-learning through constrained optimization\.*ArXiv*, abs/2405\.18187, 2024\.
- Hendrycks & Gimpel \(2016\)Hendrycks, D\. and Gimpel, K\.Gaussian error linear units \(gelus\)\.*ArXiv*, abs/1606\.08415, 2016\.
- Ho et al\. \(2020\)Ho, J\., Jain, A\., and Abbeel, P\.Denoising diffusion probabilistic models\.In*Neural Information Processing Systems \(NeurIPS\)*, 2020\.
- Intelligence et al\. \(2025\)Intelligence, P\., Amin, A\., Aniceto, R\. J\., Balakrishna, A\., Black, K\., Conley, K\., Connors, G\., Darpinian, J\., Dhabalia, K\., DiCarlo, J\., Driess, D\., Equi, M\., Esmail, A\., Fang, Y\., Finn, C\., Glossop, C\., Godden, T\., Goryachev, I\., Groom, L\., Hancock, H\., Hausman, K\., Hussein, G\., Ichter, B\., Jakubczak, S\., Jen, R\., Jones, T\., Katz, B\., Ke, L\., Kuchi, C\., Lamb, M\., LeBlanc, D\., Levine, S\., Li\-Bell, A\., Lu, Y\., Mano, V\., Mothukuri, M\., Nair, S\., Pertsch, K\., Ren, A\. Z\., Sharma, C\., Shi, L\. X\., Smith, L\., Springenberg, J\. T\., Stachowicz, K\., Stoeckle, W\., Swerdlow, A\., Tanner, J\., Torne, M\., Vuong, Q\., Walling, A\., Wang, H\., Williams, B\., Yoo, S\., Yu, L\., Zhilinsky, U\., and Zhou, Z\.π\\pi\*0\.6: a vla that learns from experience\.*ArXiv*, abs/2511\.14759, 2025\.
- Jackson et al\. \(2024\)Jackson, M\. T\., Matthews, M\. T\., Lu, C\., Ellis, B\., Whiteson, S\., and Foerster, J\.Policy\-guided diffusion\.In*Reinforcement Learning Conference \(RLC\)*, 2024\.
- Janner et al\. \(2022\)Janner, M\., Du, Y\., Tenenbaum, J\. B\., and Levine, S\.Planning with diffusion for flexible behavior synthesis\.In*International Conference on Machine Learning \(ICML\)*, 2022\.
- Kang et al\. \(2023\)Kang, B\., Ma, X\., Du, C\., Pang, T\., and Yan, S\.Efficient diffusion policies for offline reinforcement learning\.In*Neural Information Processing Systems \(NeurIPS\)*, 2023\.
- Karras et al\. \(2024\)Karras, T\., Aittala, M\., Lehtinen, J\., Hellsten, J\., Aila, T\., and Laine, S\.Analyzing and improving the training dynamics of diffusion models\.In*IEEE/CVF Conference on Computer Vision and Pattern Recognition*, 2024\.
- Kingma & Ba \(2015\)Kingma, D\. P\. and Ba, J\.Adam: A method for stochastic optimization\.In*International Conference on Learning Representations \(ICLR\)*, 2015\.
- Kostrikov et al\. \(2022\)Kostrikov, I\., Nair, A\., and Levine, S\.Offline reinforcement learning with implicit q\-learning\.In*International Conference on Learning Representations \(ICLR\)*, 2022\.
- Lange et al\. \(2012\)Lange, S\., Gabel, T\., and Riedmiller, M\.Batch reinforcement learning\.In*Reinforcement learning: State\-of\-the\-art*, pp\. 45–73\. Springer, 2012\.
- Lee \(2012\)Lee, J\. M\.*Introduction to Smooth Manifolds*\.Springer, 2012\.
- Levine et al\. \(2020\)Levine, S\., Kumar, A\., Tucker, G\., and Fu, J\.Offline reinforcement learning: Tutorial, review, and perspectives on open problems\.*ArXiv*, abs/2005\.01643, 2020\.
- Li & Levine \(2024\)Li, Q\. and Levine, S\.Q\-learning with adjoint matching\.*ArXiv*, abs/2601\.14234, 2024\.
- Li et al\. \(2025\)Li, Q\., Zhou, Z\., and Levine, S\.Reinforcement learning with action chunking\.*ArXiv*, abs/2507\.07969, 2025\.
- Li et al\. \(2023\)Li, W\., Wang, X\., Jin, B\., and Zha, H\.Hierarchical diffusion for offline decision making\.In*International Conference on Machine Learning \(ICML\)*, 2023\.
- Lillicrap et al\. \(2016\)Lillicrap, T\. P\., Hunt, J\. J\., Pritzel, A\., Heess, N\. M\. O\., Erez, T\., Tassa, Y\., Silver, D\., and Wierstra, D\.Continuous control with deep reinforcement learning\.In*International Conference on Learning Representations \(ICLR\)*, 2016\.
- Lipman et al\. \(2023\)Lipman, Y\., Chen, R\. T\., Ben\-Hamu, H\., Nickel, M\., and Le, M\.Flow matching for generative modeling\.In*International Conference on Learning Representations \(ICLR\)*, 2023\.
- Lipman et al\. \(2024\)Lipman, Y\., Havasi, M\., Holderrieth, P\., Shaul, N\., Le, M\., Karrer, B\., Chen, R\. T\. Q\., Lopez\-Paz, D\., Ben\-Hamu, H\., and Gat, I\.Flow matching guide and code\.*ArXiv*, abs/2412\.06264, 2024\.
- Liu et al\. \(2018\)Liu, Q\., Li, L\., Tang, Z\., and Zhou, D\.Breaking the curse of horizon: Infinite\-horizon off\-policy estimation\.In*Neural Information Processing Systems \(NeurIPS\)*, 2018\.
- Liu et al\. \(2023\)Liu, X\., Gong, C\., and Liu, Q\.Flow straight and fast: Learning to generate and transfer data with rectified flow\.In*International Conference on Learning Representations \(ICLR\)*, 2023\.
- Lu et al\. \(2023a\)Lu, C\., Ball, P\., Teh, Y\. W\., and Parker\-Holder, J\.Synthetic experience replay\.In*Neural Information Processing Systems \(NeurIPS\)*, 2023a\.
- Lu et al\. \(2023b\)Lu, C\., Chen, H\., Chen, J\., Su, H\., Li, C\., and Zhu, J\.Contrastive energy prediction for exact energy\-guided diffusion sampling in offline reinforcement learning\.In*International Conference on Machine Learning \(ICML\)*, 2023b\.
- Mark et al\. \(2024\)Mark, M\. S\., Gao, T\., Sampaio, G\. G\., Srirama, M\. K\., Sharma, A\., Finn, C\., and Kumar, A\.Policy agnostic rl: Offline rl and online rl fine\-tuning of any class and backbone\.*ArXiv*, abs/2412\.06685, 2024\.
- Mnih et al\. \(2013\)Mnih, V\., Kavukcuoglu, K\., Silver, D\., Graves, A\., Antonoglou, I\., Wierstra, D\., and Riedmiller, M\. A\.Playing atari with deep reinforcement learning\.*ArXiv*, abs/1312\.5602, 2013\.
- Nair et al\. \(2020\)Nair, A\., Dalal, M\., Gupta, A\., and Levine, S\.Accelerating online reinforcement learning with offline datasets\.*ArXiv*, abs/2006\.09359, 2020\.
- Newey & Powell \(1987\)Newey, W\. and Powell, J\. L\.Asymmetric least squares estimation and testing\.*Econometrica*, 55:819–847, 1987\.
- Park et al\. \(2024\)Park, S\., Frans, K\., Levine, S\., and Kumar, A\.Is value learning really the main bottleneck in offline rl?In*Neural Information Processing Systems \(NeurIPS\)*, 2024\.
- Park et al\. \(2025a\)Park, S\., Frans, K\., Eysenbach, B\., and Levine, S\.Ogbench: Benchmarking offline goal\-conditioned rl\.In*International Conference on Learning Representations \(ICLR\)*, 2025a\.
- Park et al\. \(2025b\)Park, S\., Frans, K\., Mann, D\., Eysenbach, B\., Kumar, A\., and Levine, S\.Horizon reduction makes rl scalable\.In*Neural Information Processing Systems \(NeurIPS\)*, 2025b\.
- Park et al\. \(2025c\)Park, S\., Li, Q\., and Levine, S\.Flow q\-learning\.In*International Conference on Machine Learning \(ICML\)*, 2025c\.
- Psenka et al\. \(2024\)Psenka, M\., Escontrela, A\., Abbeel, P\., and Ma, Y\.Learning a diffusion model policy from rewards via q\-score matching\.In*International Conference on Machine Learning \(ICML\)*, 2024\.
- Ren et al\. \(2025\)Ren, A\. Z\., Lidard, J\., Ankile, L\. L\., Simeonov, A\., Agrawal, P\., Majumdar, A\., Burchfiel, B\., Dai, H\., and Simchowitz, M\.Diffusion policy policy optimization\.In*International Conference on Learning Representations \(ICLR\)*, 2025\.
- Schulman et al\. \(2016\)Schulman, J\., Moritz, P\., Levine, S\., Jordan, M\. I\., and Abbeel, P\.High\-dimensional continuous control using generalized advantage estimation\.In*International Conference on Learning Representations \(ICLR\)*, 2016\.
- Schulman et al\. \(2017\)Schulman, J\., Wolski, F\., Dhariwal, P\., Radford, A\., and Klimov, O\.Proximal policy optimization algorithms\.*ArXiv*, abs/1707\.06347, 2017\.
- Sohl\-Dickstein et al\. \(2015\)Sohl\-Dickstein, J\., Weiss, E\., Maheswaranathan, N\., and Ganguli, S\.Deep unsupervised learning using nonequilibrium thermodynamics\.In*International Conference on Machine Learning \(ICML\)*, 2015\.
- Sutton & Barto \(2005\)Sutton, R\. S\. and Barto, A\. G\.Reinforcement learning: An introduction\.*IEEE Transactions on Neural Networks*, 16:285–286, 2005\.
- Tarasov et al\. \(2023\)Tarasov, D\., Kurenkov, V\., Nikulin, A\., and Kolesnikov, S\.Revisiting the minimalist approach to offline reinforcement learning\.In*Neural Information Processing Systems \(NeurIPS\)*, 2023\.
- Wagenmaker et al\. \(2025\)Wagenmaker, A\., Nakamoto, M\., Zhang, Y\., Park, S\., Yagoub, W\., Nagabandi, A\., Gupta, A\., and Levine, S\.Steering your diffusion policy with latent space reinforcement learning\.*ArXiv*, abs/2506\.15799, 2025\.
- Wang et al\. \(2023\)Wang, Z\., Hunt, J\. J\., and Zhou, M\.Diffusion policies as an expressive policy class for offline reinforcement learning\.In*International Conference on Learning Representations \(ICLR\)*, 2023\.
- Williams \(1992\)Williams, R\. J\.Simple statistical gradient\-following algorithms for connectionist reinforcement learning\.*Machine learning*, 8\(3\):229–256, 1992\.
- Wu et al\. \(2019\)Wu, Y\., Tucker, G\., and Nachum, O\.Behavior regularized offline reinforcement learning\.*ArXiv*, abs/1911\.11361, 2019\.
- Yang et al\. \(2023\)Yang, L\., Huang, Z\., Lei, F\., Zhong, Y\., Yang, Y\., Fang, C\., Wen, S\., Zhou, B\., and Lin, Z\.Policy representation via diffusion probability model for reinforcement learning\.*ArXiv*, abs/2305\.13122, 2023\.
- Zhang et al\. \(2024\)Zhang, R\., Luo, Z\., Sjölund, J\., Schön, T\. B\., and Mattsson, P\.Entropy\-regularized diffusion policy with q\-ensembles for offline reinforcement learning\.In*Neural Information Processing Systems \(NeurIPS\)*, 2024\.
- Zhang et al\. \(2025\)Zhang, S\., Zhang, W\., and Gu, Q\.Energy\-weighted flow matching for offline reinforcement learning\.In*International Conference on Learning Representations \(ICLR\)*, 2025\.
- Zhang et al\. \(2026\)Zhang, Y\., Yu, S\., Zhang, T\., Guang, M\., Hui, H\., Long, K\., Wang, Y\., Yu, C\., and Ding, W\.Sac flow: Sample\-efficient reinforcement learning of flow\-based policies via velocity\-reparameterized sequential modeling\.In*International Conference on Learning Representations \(ICLR\)*, 2026\.
- Zhao et al\. \(2023\)Zhao, T\. Z\., Kumar, V\., Levine, S\., and Finn, C\.Learning fine\-grained bimanual manipulation with low\-cost hardware\.In*Robotics: Science and Systems \(RSS\)*, 2023\.
- Zheng et al\. \(2023\)Zheng, Q\., Le, M\., Shaul, N\., Lipman, Y\., Grover, A\., and Chen, R\. T\.Guided flows for generative modeling and decision making\.*ArXiv*, abs/2311\.13443, 2023\.

## Appendix AFull Result Table

Table 1:Performance on𝟓𝟎\\bm\{50\}simulated robotic manipulation tasks\.RQLgenerally achieves the best performance across the board, particularly on more challenging, long\-horizon tasks likehumanoidmaze\-largeandcube\-quadruple\.
## Appendix BAdditional Implementation Details

Computingxf\\bm\{x^\{f\}\}\.For simplicity,[Section4\.3](https://arxiv.org/html/2606.17551#S4.SS3)shows that we computexfx^\{f\}at discretized flow intervalsf∈\{0,1,…,F−1\}f\\in\\\{0,1,\\ldots,F\-1\\\}\. In practice, we recognize that our value update rule in[Equation14](https://arxiv.org/html/2606.17551#S4.E14)does not require us to compute the entire flow trajectory but rather a singularxfx^\{f\}for a flow timeff\. Thisffcan be sampled in many ways, we choose to partly samplefffrom a continuous uniform distribution on the support\[0,F\]\[0,F\]and the other half uniformly within\{0,1,…,F−1\}\\\{0,1,\\ldots,F\-1\\\}\. For theseff, given that we havexFx^\{F\}, we want to computexfx^\{f\}\. This can be done withFFEuler steps as follows:

xf−h←xf−h​v​\(s,xf,f\)\\displaystyle x^\{f\-h\}\\leftarrow x^\{f\}\-hv\(s,x^\{f\},f\)\(16\)wherehhis the fixed Euler step sizeF−fF\\frac\{F\-f\}\{F\}, iterated forFFsteps fromxFx^\{F\}, yieldingxfx^\{f\}\.

Actor exponential moving average \(EMA\)\.We use an EMA of the flow policy during evaluation withλ=0\.999\\lambda=0\.999like previous works\(Ho et al\.,[2020](https://arxiv.org/html/2606.17551#bib.bib24); Karras et al\.,[2024](https://arxiv.org/html/2606.17551#bib.bib29)\)\. While this is optional in practice, we find it helps in our experiments\.

Critic pessimism coefficient \(ρ\\rho\)\.We use a pessimistic critic backup\(Fang et al\.,[2025](https://arxiv.org/html/2606.17551#bib.bib16)\)in TD targets, computed from an ensemble of value functions, as inLi & Levine \([2024](https://arxiv.org/html/2606.17551#bib.bib35)\)\. For an ensemble ofKKvalue functions parameterized byφj\\varphi\_\{j\}forj∈\{1,…,K\}j\\in\\\{1,\\dots,K\\\}and corresponding target networks parameterized byφ¯j\\bar\{\\varphi\}\_\{j\}, the loss function is

ℒ​\(φj\)=𝔼τ~​\[ℓ2κ​\(Vφj​\(s,xf,f\)−\(r\+γ​\[V¯mean​\(s′,x′⁣0,0\)−ρ​V¯std​\(s′,x′⁣0,0\)\]\)\)\],\\displaystyle\{\\mathcal\{L\}\}\(\\varphi\_\{j\}\)=\\mathbb\{E\}\_\{\\widetilde\{\\tau\}\}\\big\[\\ell\_\{2\}^\{\\kappa\}\\big\(V\_\{\\varphi\_\{j\}\}\(s,x^\{f\},f\)\-\(r\+\\gamma\[\\bar\{V\}\_\{\\text\{mean\}\}\(s^\{\\prime\},x^\{\\prime 0\},0\)\-\\rho\\bar\{V\}\_\{\\text\{std\}\}\(s^\{\\prime\},x^\{\\prime 0\},0\)\]\)\\big\)\\big\],\(17\)whereV¯mean​\(s′,x′⁣0,0\)=1K​∑kVφ¯k​\(s′,x′⁣0,0\)\\bar\{V\}\_\{\\text\{mean\}\}\(s^\{\\prime\},x^\{\\prime 0\},0\)=\\frac\{1\}\{K\}\\sum\_\{k\}V\_\{\\bar\{\\varphi\}\_\{k\}\}\(s^\{\\prime\},x^\{\\prime 0\},0\),V¯std​\(s′,x′⁣0,0\)=1K​∑k\(Vφ¯k​\(s′,x′⁣0,0\)−V¯mean​\(s′,x′⁣0,0\)\)2\\bar\{V\}\_\{\\text\{std\}\}\(s^\{\\prime\},x^\{\\prime 0\},0\)=\\sqrt\{\\frac\{1\}\{K\}\\sum\_\{k\}\\left\(V\_\{\\bar\{\\varphi\}\_\{k\}\}\(s^\{\\prime\},x^\{\\prime 0\},0\)\-\\bar\{V\}\_\{\\text\{mean\}\}\(s^\{\\prime\},x^\{\\prime 0\},0\)\\right\)^\{2\}\}, andρ\\rhocontrols the degree of pessimism\. See[SectionC\.2](https://arxiv.org/html/2606.17551#A3.SS2)for more details\.

## Appendix CExperimental Details

We evaluate all baselines with the official OGBench environments and datasets\. We fix the following hyperparameters for fair comparison, unless otherwise mentioned\.

Table 2:Common Hyperparameters\.### C\.1Methods

- •BDPO\(Gao et al\.,[2025](https://arxiv.org/html/2606.17551#bib.bib19)\)\. We use the original implementation byGao et al\. \([2025](https://arxiv.org/html/2606.17551#bib.bib19)\)and sweepη\\etawithin\{0\.03,0\.1,0\.3,0\.7,1\}\\\{0\.03,0\.1,0\.3,0\.7,1\\\}and useρ=0\.5\\rho=0\.5for all tasks excepthumanoidmazewhich usesρ=0\.0\\rho=0\.0\. FollowingGao et al\. \([2025](https://arxiv.org/html/2606.17551#bib.bib19)\), BDPO requires an additional BC warmup stage, which we provide for11M additional offline steps\. Thus, it uses twice as many offline steps as RQL and other methods in this comparison\.
- •TFQL\. We implement TFQL and swept expectileκ\\kappawithin\{0\.5,0\.7,0\.9\}\\\{0\.5,0\.7,0\.9\\\}, and the BC regularization coefficientα\\alphafrom\{0\.1,0\.3,1,3,10\}\\\{0\.1,0\.3,1,3,10\\\}\.
- •RQL\. See Appendix[C\.2](https://arxiv.org/html/2606.17551#A3.SS2)\.

For other methods, we use the official results byLi & Levine \([2024](https://arxiv.org/html/2606.17551#bib.bib35)\)and maintain an apples\-to\-apples setting when adding new results\. Complete hyperparameters for BDPO, TFQL, and RQL can be found in Table[3](https://arxiv.org/html/2606.17551#A3.T3)\.

### C\.2Hyperparameter Tuning

There are two important hyperparameters to tune when using RQL: expectileκ\\kappaand BC regularizationα\\alpha\. We swept expectileκ\\kappawithin\{0\.5,0\.7,0\.9\}\\\{0\.5,0\.7,0\.9\\\}, and the BC regularization coefficientα\\alphafrom\{0\.1,0\.3,1,3,10\}\\\{0\.1,0\.3,1,3,10\\\}\. While we use ensemble critic target pessimistic coefficient\(Fang et al\.,[2025](https://arxiv.org/html/2606.17551#bib.bib16)\)ρ=0\.5\\rho=0\.5for all tasks excepthumanoidmazefor apples\-to\-apples comparison withLi & Levine \([2024](https://arxiv.org/html/2606.17551#bib.bib35)\)results, we findρ=0\\rho=0is often better such as incube\-double, and with additional tuning budget, we recommend ablatingρ\\rhowithin\{0,0\.5\}\\\{0,0\.5\\\}\. Similarly, we matchLi & Levine \([2024](https://arxiv.org/html/2606.17551#bib.bib35)\)with action chunk horizonh=5h=5for manipulation andh=1h=1for locomotion tasks, but we recommend additional tuning within\{1,3,5,10\}\\\{1,3,5,10\\\}\.

Table 3:Task\-specific hyperparameters\.We describe task\-specific hyperparameters below \(α\\alpha: BC coefficient,κ\\kappa: expectile,η\\eta: regularization strength,ρ\\rho: pessimistic coefficient\)\.

Similar Articles

Drift Q-Learning

arXiv cs.LG

Proposes DriftQL, which combines a drift-based behavioral regularizer with critic-driven policy improvement for offline RL, outperforming diffusion and flow methods on D4RL and OGBench while maintaining simplicity and efficiency.

QPILOTS: Efficient Test-Time Q-Steering for Flow Policies

arXiv cs.LG

QPILOTS is a method that steers flow policies at inference time by using critic gradients projected from noisy intermediate states, achieving state-of-the-art performance on offline-to-online RL benchmarks and improving pretrained VLA models without modifying the base policy.