Flow-OPD: On-Policy Distillation for Flow Matching Models
Summary
Flow-OPD is a research paper introducing a two-stage on-policy distillation framework for Flow Matching text-to-image models, significantly improving generation quality and alignment metrics using Stable Diffusion 3.5 Medium.
View Cached Full Text
Cached at: 05/11/26, 07:20 AM
Paper page - Flow-OPD: On-Policy Distillation for Flow Matching Models
Source: https://huggingface.co/papers/2605.08063
Abstract
Flow-OPD addresses limitations in Flow Matching text-to-image models through a two-stage alignment approach combining on-policy distillation and manifold anchor regularization, achieving significant improvements in generation quality and alignment metrics.
ExistingFlow Matching(FM) text-to-image models suffer from two critical bottlenecks under multi-task alignment: the reward sparsity induced by scalar-valued rewards, and the gradient interference arising from jointly optimizing heterogeneous objectives, which together give rise to a ‘seesaw effect’ of competing metrics and pervasive reward hacking. Inspired by the success ofOn-Policy Distillation(OPD) in the large language model community, we propose Flow-OPD, the first unified post-training framework that integrateson-policy distillationintoFlow Matchingmodels. Flow-OPD adopts a two-stage alignment strategy: it first cultivates domain-specialized teacher models via single-rewardGRPOfine-tuning, allowing each expert to reach its performance ceiling in isolation; it then establishes a robust initial policy through a Flow-based Cold-Start scheme and seamlessly consolidates heterogeneous expertise into a single student via a three-step orchestration of on-policy sampling,task-routing labeling, anddense trajectory-level supervision. We further introduceManifold Anchor Regularization(MAR), which leverages a task-agnostic teacher to provide full-data supervision that anchors generation to a high-quality manifold, effectively mitigating the aesthetic degradation commonly observed in purely RL-driven alignment. Built uponStable Diffusion 3.5 Medium, Flow-OPD raises theGenEval scorefrom 63 to 92 and theOCR accuracyfrom 59 to 94, yielding an overall improvement of roughly 10 points over vanillaGRPO, while preserving image fidelity and human-preference alignment and exhibiting an emergent ‘teacher-surpassing’ effect. These results establish Flow-OPD as a scalable alignment paradigm for building generalist text-to-image models.
View arXiv pageView PDFProject pageGitHub11Add to collection
Get this paper in your agent:
hf papers read 2605\.08063
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper1
#### CostaliyA/Flow-OPD Updatedabout 4 hours ago • 12
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.08063 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.08063 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
DanceOPD: On-Policy Generative Field Distillation
DanceOPD proposes an on-policy generative field distillation framework for flow-matching models that unifies text-to-image generation, local editing, and global editing via capability-specific routing and velocity-based training, improving multi-capability composition while preserving anchor generation quality.
Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models
Flow-DPPO replaces ratio clipping with divergence proximal constraints in flow matching models, improving training stability and multi-objective optimization through exact KL divergence computation.
On-policy distillation: one of the hottest terms on PapersWithCode [R]
Hugging Face's Niels introduces On-policy Distillation (OPD), a key post-training technique used in models like Qwen 3.6/3.7, GLM-5.1, and DeepSeek-V4, now featured on PapersWithCode with a linked whiteboard explanation by Sasha Rush and Dwarkesh Patel.
DiffusionOPD: A Unified Perspective of On-Policy Distillation in Diffusion Models
DiffusionOPD proposes a multi-task training paradigm for diffusion models that uses online policy distillation to efficiently combine task-specific teachers into a unified student, achieving state-of-the-art results on all evaluated benchmarks.
AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation
AnyFlow introduces a novel any-step video diffusion distillation framework that optimizes full ODE sampling trajectories through flow-map transition learning and backward simulation, achieving performance that matches or surpasses consistency-based counterparts while scaling with sampling step budgets.