@aditya_oberai: What if we treat flow steps as RL actions? Combined with our “flow reversal” technique, this leads to a really clean & …

X AI KOLs Timeline 06/17/26, 03:30 PM Papers

Summary

Proposes treating flow steps as RL actions combined with a 'flow reversal' technique for flow offline reinforcement learning.

What if we treat flow steps as RL actions? Combined with our “flow reversal” technique, this leads to a really clean & powerful recipe for flow offline RL! Thread 🧵 https://t.co/PxE8yzH9gM

Original Article

View Cached Full Text

Cached at: 06/17/26, 08:02 PM

What if we treat flow steps as RL actions?

Combined with our “flow reversal” technique, this leads to a really clean & powerful recipe for flow offline RL!

Thread

We introduce Reversal Q-Learning (RQL) .

RQL achieves strong results across 50 locomotion and manipulation tasks compared to 19 other state-of-the-art flow-based offline RL algorithms.

We know iterative generative models like flow matching are powerful for modeling complex robot policies in offline reinforcement learning (RL).

Yet, training them is non-trivial: BPTT is unstable, and 1-step distillations inhibit expressivity.

We propose a new algorithmic idea, starting from a simple view of flow RL.

A flow policy constructs actions via a sequence of refinement steps. To do RL, we can treat individual refinement steps as actions and apply standard RL algorithms.

We can directly do RL over refinement steps, but this expands each action into multiple decision steps, multiplying the value learning horizon.

This expansion is particularly bad for off-policy RL, which exhibits the “curse of horizon”.

We recognize we can prevent an expansion in the value learning horizon by constructing “virtual” flow trajectories from standard prior data that are perfectly suited for multi-step returns.

We generate trajectories in the expanded framework via “flow reversal”, which follows the flow ODE in reverse from actions in prior data.

We show these trajectories are deterministic and on-policy, and they thereby allow for unbiased, zero-variance multi-step returns.

The implementation is really simple.

We learn a value function jointly over complete and partially-generated actions.

We can then use reparameterized gradients on each flow step (alongside a BC term).

That’s it! A big thank you to my co-authors @seohong_park @svlevine.

Website: http://aober.ai/rql Paper: http://arxiv.org/abs/2606.17551 Codebase: http://github.com/aoberai/rql

@aditya_oberai: What if we treat flow steps as RL actions? Combined with our “flow reversal” technique, this leads to a really clean & …

Similar Articles

@seohong_park: RQL is a new, clean algorithm for (offline) flow RL! The main idea is to treat flow steps as MDP steps, and use "revers…

@svlevine: Flow reversal steering allows "steering" diffusion-based VLAs with high-level actions, for example from VLM reasoning. …

@svlevine: Diffusion (or flow) makes for excellent policies, but training them with RL is notoriously hard: BPTT is unstable, RL o…

Reversal Q-Learning

@svlevine: A new way to do off-policy RL with diffusion: if we have off-policy data, we need to figure out what the diffusion late…

Submit Feedback

Similar Articles

@seohong_park: RQL is a new, clean algorithm for (offline) flow RL! The main idea is to treat flow steps as MDP steps, and use "revers…

@svlevine: Flow reversal steering allows "steering" diffusion-based VLAs with high-level actions, for example from VLM reasoning. …

@svlevine: Diffusion (or flow) makes for excellent policies, but training them with RL is notoriously hard: BPTT is unstable, RL o…

@svlevine: A new way to do off-policy RL with diffusion: if we have off-policy data, we need to figure out what the diffusion late…