@aditya_oberai: What if we treat flow steps as RL actions? Combined with our “flow reversal” technique, this leads to a really clean & …

X AI KOLs Timeline Papers

Summary

Proposes treating flow steps as RL actions combined with a 'flow reversal' technique for flow offline reinforcement learning.

What if we treat flow steps as RL actions? Combined with our “flow reversal” technique, this leads to a really clean & powerful recipe for flow offline RL! Thread 🧵 https://t.co/PxE8yzH9gM
Original Article
View Cached Full Text

Cached at: 06/17/26, 08:02 PM

What if we treat flow steps as RL actions?

Combined with our “flow reversal” technique, this leads to a really clean & powerful recipe for flow offline RL!

Thread

We introduce Reversal Q-Learning (RQL) .

RQL achieves strong results across 50 locomotion and manipulation tasks compared to 19 other state-of-the-art flow-based offline RL algorithms.

We know iterative generative models like flow matching are powerful for modeling complex robot policies in offline reinforcement learning (RL).

Yet, training them is non-trivial: BPTT is unstable, and 1-step distillations inhibit expressivity.

We propose a new algorithmic idea, starting from a simple view of flow RL.

A flow policy constructs actions via a sequence of refinement steps. To do RL, we can treat individual refinement steps as actions and apply standard RL algorithms.

We can directly do RL over refinement steps, but this expands each action into multiple decision steps, multiplying the value learning horizon.

This expansion is particularly bad for off-policy RL, which exhibits the “curse of horizon”.

We recognize we can prevent an expansion in the value learning horizon by constructing “virtual” flow trajectories from standard prior data that are perfectly suited for multi-step returns.

We generate trajectories in the expanded framework via “flow reversal”, which follows the flow ODE in reverse from actions in prior data.

We show these trajectories are deterministic and on-policy, and they thereby allow for unbiased, zero-variance multi-step returns.

The implementation is really simple.

We learn a value function jointly over complete and partially-generated actions.

We can then use reparameterized gradients on each flow step (alongside a BC term).

That’s it! A big thank you to my co-authors @seohong_park @svlevine.

Website: http://aober.ai/rql Paper: http://arxiv.org/abs/2606.17551 Codebase: http://github.com/aoberai/rql

Similar Articles

Reversal Q-Learning

arXiv cs.LG

This paper proposes Reversal Q-Learning (RQL), an offline reinforcement learning algorithm that trains a flow policy using an expanded Markov decision process framework and techniques to enable off-policy RL without backpropagation through time. It achieves state-of-the-art performance on challenging simulated robotic tasks.