Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models
Summary
Flow-DPPO replaces ratio clipping with divergence proximal constraints in flow matching models, improving training stability and multi-objective optimization through exact KL divergence computation.
View Cached Full Text
Cached at: 06/10/26, 05:44 AM
Paper page - Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models
Source: https://huggingface.co/papers/2606.11025
Abstract
Flow-DPPO replaces ratio clipping with divergence proximal constraints in flow matching models, improving training stability and multi-objective optimization through exact KL divergence computation.
Recent work has demonstrated thatonline reinforcement learning(RL) can substantially improve the quality and alignment offlow matching modelsfor image and video generation. Methods such as Flow-GRPO and CPS cast thedenoising processas aMarkov Decision Processand applyPPO-style ratio clippingto enforce atrust region. However, we argue that ratio clipping is structurally ill-suited for flow models: the probability ratio between new and old policies is a noisy, single-sample estimate of the truepolicy divergence, leading to over-constraining in some regions of the trajectory and under-constraining in others. We propose Flow-DPPO (Flow Divergence Proximal Policy Optimization), which replaces ratio clipping with a divergence proximal constraint. A key observation is that the per-step policy in flow models is Gaussian, enabling exact and cheap computation of theKL divergencebetween old and new policies. Flow-DPPO employs anasymmetric divergence maskthat blocks gradient updates only when they simultaneously move away from the trusted region and violate the divergence threshold. Experiments show that Flow-DPPO achieves higher rewards with better KL-proximal efficiency, alleviatescatastrophic forgetting, promotes balancedmulti-objective optimization, and enables stablemulti-epoch trainingwhere ratio clipping degrades. Code and models are available at https://github.com/Tencent-Hunyuan/UniRL/tree/main/FlowDPPO.
View arXiv pageView PDFProject pageAdd to collection
Get this paper in your agent:
hf papers read 2606\.11025
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.11025 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.11025 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.11025 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Flow-OPD: On-Policy Distillation for Flow Matching Models
Flow-OPD is a research paper introducing a two-stage on-policy distillation framework for Flow Matching text-to-image models, significantly improving generation quality and alignment metrics using Stable Diffusion 3.5 Medium.
PrismFlow: Residual Dynamics for Flow Matching in Time-Series Generation
PrismFlow introduces a Flow Matching method with Koopman-inspired dynamical experts to handle multimodal and multiscale time-series data, achieving state-of-the-art performance with significant improvements in Context-FID and Discriminative Score.
$\xi$-DPO: Direct Preference Optimization via Ratio Reward Margin
This paper introduces xi-DPO, a novel preference optimization method that reformulates the objective to minimize distance to optimal ratio reward margins, addressing hyperparameter tuning challenges in SimPO. Experimental results show that xi-DPO outperforms existing methods on open benchmarks.
AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation
AnyFlow introduces a novel any-step video diffusion distillation framework that optimizes full ODE sampling trajectories through flow-map transition learning and backward simulation, achieving performance that matches or surpasses consistency-based counterparts while scaling with sampling step budgets.
Constraint-Aware Flow Matching: Decision Aligned End-to-End Training for Constrained Sampling
Proposes Constraint-Aware Flow Matching, a novel end-to-end framework that aligns the model's learning dynamics with constrained sampling procedure, mitigating distributional shift from projection corrections for high-quality constrained generation.