Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models

Hugging Face Daily Papers 06/09/26, 12:00 AM Papers

Summary

Flow-DPPO replaces ratio clipping with divergence proximal constraints in flow matching models, improving training stability and multi-objective optimization through exact KL divergence computation.

Recent work has demonstrated that online reinforcement learning (RL) can substantially improve the quality and alignment of flow matching models for image and video generation. Methods such as Flow-GRPO and CPS cast the denoising process as a Markov Decision Process and apply PPO-style ratio clipping to enforce a trust region. However, we argue that ratio clipping is structurally ill-suited for flow models: the probability ratio between new and old policies is a noisy, single-sample estimate of the true policy divergence, leading to over-constraining in some regions of the trajectory and under-constraining in others. We propose Flow-DPPO (Flow Divergence Proximal Policy Optimization), which replaces ratio clipping with a divergence proximal constraint. A key observation is that the per-step policy in flow models is Gaussian, enabling exact and cheap computation of the KL divergence between old and new policies. Flow-DPPO employs an asymmetric divergence mask that blocks gradient updates only when they simultaneously move away from the trusted region and violate the divergence threshold. Experiments show that Flow-DPPO achieves higher rewards with better KL-proximal efficiency, alleviates catastrophic forgetting, promotes balanced multi-objective optimization, and enables stable multi-epoch training where ratio clipping degrades. Code and models are available at https://github.com/Tencent-Hunyuan/UniRL/tree/main/FlowDPPO.

Original Article

View Cached Full Text

Cached at: 06/10/26, 05:44 AM

Paper page - Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models

Source: https://huggingface.co/papers/2606.11025

Abstract

Flow-DPPO replaces ratio clipping with divergence proximal constraints in flow matching models, improving training stability and multi-objective optimization through exact KL divergence computation.

Recent work has demonstrated thatonline reinforcement learning(RL) can substantially improve the quality and alignment offlow matching modelsfor image and video generation. Methods such as Flow-GRPO and CPS cast thedenoising processas aMarkov Decision Processand applyPPO-style ratio clippingto enforce atrust region. However, we argue that ratio clipping is structurally ill-suited for flow models: the probability ratio between new and old policies is a noisy, single-sample estimate of the truepolicy divergence, leading to over-constraining in some regions of the trajectory and under-constraining in others. We propose Flow-DPPO (Flow Divergence Proximal Policy Optimization), which replaces ratio clipping with a divergence proximal constraint. A key observation is that the per-step policy in flow models is Gaussian, enabling exact and cheap computation of theKL divergencebetween old and new policies. Flow-DPPO employs anasymmetric divergence maskthat blocks gradient updates only when they simultaneously move away from the trusted region and violate the divergence threshold. Experiments show that Flow-DPPO achieves higher rewards with better KL-proximal efficiency, alleviatescatastrophic forgetting, promotes balancedmulti-objective optimization, and enables stablemulti-epoch trainingwhere ratio clipping degrades. Code and models are available at https://github.com/Tencent-Hunyuan/UniRL/tree/main/FlowDPPO.

View arXiv page View PDF Project page Add to collection

Get this paper in your agent:

hf papers read 2606\.11025

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.11025 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.11025 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.11025 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models

Paper page - Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Flow-OPD: On-Policy Distillation for Flow Matching Models

PrismFlow: Residual Dynamics for Flow Matching in Time-Series Generation

$\xi$-DPO: Direct Preference Optimization via Ratio Reward Margin

AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation

Constraint-Aware Flow Matching: Decision Aligned End-to-End Training for Constrained Sampling

Submit Feedback

Similar Articles

Flow-OPD: On-Policy Distillation for Flow Matching Models

PrismFlow: Residual Dynamics for Flow Matching in Time-Series Generation

$\xi$-DPO: Direct Preference Optimization via Ratio Reward Margin

AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation

Constraint-Aware Flow Matching: Decision Aligned End-to-End Training for Constrained Sampling