@aditya_oberai: What if we treat flow steps as RL actions? Combined with our “flow reversal” technique, this leads to a really clean & …
Summary
Proposes treating flow steps as RL actions combined with a 'flow reversal' technique for flow offline reinforcement learning.
View Cached Full Text
Cached at: 06/17/26, 08:02 PM
What if we treat flow steps as RL actions?
Combined with our “flow reversal” technique, this leads to a really clean & powerful recipe for flow offline RL!
Thread
We introduce Reversal Q-Learning (RQL) .
RQL achieves strong results across 50 locomotion and manipulation tasks compared to 19 other state-of-the-art flow-based offline RL algorithms.
We know iterative generative models like flow matching are powerful for modeling complex robot policies in offline reinforcement learning (RL).
Yet, training them is non-trivial: BPTT is unstable, and 1-step distillations inhibit expressivity.
We propose a new algorithmic idea, starting from a simple view of flow RL.
A flow policy constructs actions via a sequence of refinement steps. To do RL, we can treat individual refinement steps as actions and apply standard RL algorithms.
We can directly do RL over refinement steps, but this expands each action into multiple decision steps, multiplying the value learning horizon.
This expansion is particularly bad for off-policy RL, which exhibits the “curse of horizon”.
We recognize we can prevent an expansion in the value learning horizon by constructing “virtual” flow trajectories from standard prior data that are perfectly suited for multi-step returns.
We generate trajectories in the expanded framework via “flow reversal”, which follows the flow ODE in reverse from actions in prior data.
We show these trajectories are deterministic and on-policy, and they thereby allow for unbiased, zero-variance multi-step returns.
The implementation is really simple.
We learn a value function jointly over complete and partially-generated actions.
We can then use reparameterized gradients on each flow step (alongside a BC term).
That’s it! A big thank you to my co-authors @seohong_park @svlevine.
Website: http://aober.ai/rql Paper: http://arxiv.org/abs/2606.17551 Codebase: http://github.com/aoberai/rql
Similar Articles
@seohong_park: RQL is a new, clean algorithm for (offline) flow RL! The main idea is to treat flow steps as MDP steps, and use "revers…
RQL is a new algorithm for offline flow reinforcement learning that treats flow steps as MDP steps and uses reversed flows to generate hindsight trajectories.
@svlevine: Flow reversal steering allows "steering" diffusion-based VLAs with high-level actions, for example from VLM reasoning. …
Flow reversal steering enables steering diffusion-based vision-language-action models with high-level actions, such as from VLM reasoning, and allows RL in diffusion noise space for task exploration.
@svlevine: Diffusion (or flow) makes for excellent policies, but training them with RL is notoriously hard: BPTT is unstable, RL o…
New paper shows how to optimize flow matching actors for reinforcement learning by approximating the Jacobian of the flow denoising process with the identity matrix, making training feasible.
Reversal Q-Learning
This paper proposes Reversal Q-Learning (RQL), an offline reinforcement learning algorithm that trains a flow policy using an expanded Markov decision process framework and techniques to enable off-policy RL without backpropagation through time. It achieves state-of-the-art performance on challenging simulated robotic tasks.
@svlevine: A new way to do off-policy RL with diffusion: if we have off-policy data, we need to figure out what the diffusion late…
A new method for off-policy reinforcement learning with diffusion models, using flow reversal to handle off-policy data by reversing the diffusion process on it.