@svlevine: Diffusion (or flow) makes for excellent policies, but training them with RL is notoriously hard: BPTT is unstable, RL o…

X AI KOLs Following 06/10/26, 03:52 AM Papers

Summary

New paper shows how to optimize flow matching actors for reinforcement learning by approximating the Jacobian of the flow denoising process with the identity matrix, making training feasible.

Diffusion (or flow) makes for excellent policies, but training them with RL is notoriously hard: BPTT is unstable, RL over diffusion blows up the horizon. In our new paper, we show how we can optimize flow matching actors by using "one weird trick" -- "approximate" the Jacobian of the flow denoising process with the identity matrix.

Original Article

Similar Articles

Reinforcement Learning via Value Gradient Flow

Hugging Face Daily Papers

Value Gradient Flow (VGF) presents a scalable approach to behavior-regularized reinforcement learning by formulating it as an optimal transport problem solved through discrete gradient flow, achieving state-of-the-art results on offline RL and LLM RL benchmarks. The method eliminates explicit policy parameterization while enabling adaptive test-time scaling by controlling transport budget.

@svlevine: Flow reversal steering allows "steering" diffusion-based VLAs with high-level actions, for example from VLM reasoning. …

X AI KOLs Following

Flow reversal steering enables steering diffusion-based vision-language-action models with high-level actions, such as from VLM reasoning, and allows RL in diffusion noise space for task exploration.

Drift Q-Learning

arXiv cs.LG

Proposes DriftQL, which combines a drift-based behavioral regularizer with critic-driven policy improvement for offline RL, outperforming diffusion and flow methods on D4RL and OGBench while maintaining simplicity and efficiency.

Path-Coupled Bellman Flows for Distributional Reinforcement Learning

arXiv cs.LG

This paper introduces Path-Coupled Bellman Flows (PCBF), a continuous-time distributional reinforcement learning method that uses flow matching to model return distributions without heuristic projections. It addresses boundary mismatch and high-variance issues in previous flow-based approaches by coupling current and successor return flows through shared base noise.

@probablynotaz9: Solo-author ICML paper alert Ever wanted to post-train your diffusion LLM with good old policy gradients, without havin…