@svlevine: A new way to do off-policy RL with diffusion: if we have off-policy data, we need to figure out what the diffusion late…
Summary
A new method for off-policy reinforcement learning with diffusion models, using flow reversal to handle off-policy data by reversing the diffusion process on it.
View Cached Full Text
Cached at: 06/18/26, 04:03 AM
A new way to do off-policy RL with diffusion: if we have off-policy data, we need to figure out what the diffusion latent steps for it would be with our current policy (not the one that collected it), so this requires reversing the diffusion process on off-policy data.
Aditya Oberai (@aditya_oberai): What if we treat flow steps as RL actions?
Combined with our “flow reversal” technique, this leads to a really clean & powerful recipe for flow offline RL!
Thread 🧵
Similar Articles
@svlevine: Diffusion (or flow) makes for excellent policies, but training them with RL is notoriously hard: BPTT is unstable, RL o…
New paper shows how to optimize flow matching actors for reinforcement learning by approximating the Jacobian of the flow denoising process with the identity matrix, making training feasible.
@svlevine: Flow reversal steering allows "steering" diffusion-based VLAs with high-level actions, for example from VLM reasoning. …
Flow reversal steering enables steering diffusion-based vision-language-action models with high-level actions, such as from VLM reasoning, and allows RL in diffusion noise space for task exploration.
@probablynotaz9: Solo-author ICML paper alert Ever wanted to post-train your diffusion LLM with good old policy gradients, without havin…
This solo-author ICML paper introduces Amortized Group Relative Policy Optimization (AGRPO) to enable effective reinforcement learning post-training for diffusion language models.
Scaling World-Model Reinforcement Learning Through Diffusion Policy Optimization
Proposes Model-Based Diffusion Policy Optimization (MBDPO), a framework that unifies search and policy optimization in world models using diffusion policy representations, achieving consistent scaling behavior and superior performance across offline and online reinforcement learning tasks.
Drift Q-Learning
Proposes DriftQL, which combines a drift-based behavioral regularizer with critic-driven policy improvement for offline RL, outperforming diffusion and flow methods on D4RL and OGBench while maintaining simplicity and efficiency.