@seohong_park: RQL is a new, clean algorithm for (offline) flow RL! The main idea is to treat flow steps as MDP steps, and use "revers…
Summary
RQL is a new algorithm for offline flow reinforcement learning that treats flow steps as MDP steps and uses reversed flows to generate hindsight trajectories.
View Cached Full Text
Cached at: 06/18/26, 12:01 AM
RQL is a new, clean algorithm for (offline) flow RL!
The main idea is to treat flow steps as MDP steps, and use “reversed” flows to generate hindsight flow trajectories for off-policy data.
Aditya Oberai (@aditya_oberai): What if we treat flow steps as RL actions?
Combined with our “flow reversal” technique, this leads to a really clean & powerful recipe for flow offline RL!
Thread 🧵
Similar Articles
Reversal Q-Learning
This paper proposes Reversal Q-Learning (RQL), an offline reinforcement learning algorithm that trains a flow policy using an expanded Markov decision process framework and techniques to enable off-policy RL without backpropagation through time. It achieves state-of-the-art performance on challenging simulated robotic tasks.
@aditya_oberai: What if we treat flow steps as RL actions? Combined with our “flow reversal” technique, this leads to a really clean & …
Proposes treating flow steps as RL actions combined with a 'flow reversal' technique for flow offline reinforcement learning.
Drift Q-Learning
Proposes DriftQL, which combines a drift-based behavioral regularizer with critic-driven policy improvement for offline RL, outperforming diffusion and flow methods on D4RL and OGBench while maintaining simplicity and efficiency.
@svlevine: A new way to do off-policy RL with diffusion: if we have off-policy data, we need to figure out what the diffusion late…
A new method for off-policy reinforcement learning with diffusion models, using flow reversal to handle off-policy data by reversing the diffusion process on it.
@svlevine: Diffusion (or flow) makes for excellent policies, but training them with RL is notoriously hard: BPTT is unstable, RL o…
New paper shows how to optimize flow matching actors for reinforcement learning by approximating the Jacobian of the flow denoising process with the identity matrix, making training feasible.