Flow-OPD: On-Policy Distillation for Flow Matching Models
Summary
Flow-OPD is a research paper introducing a two-stage on-policy distillation framework for Flow Matching text-to-image models, significantly improving generation quality and alignment metrics using Stable Diffusion 3.5 Medium.
View Cached Full Text
Cached at: 05/11/26, 07:20 AM
Paper page - Flow-OPD: On-Policy Distillation for Flow Matching Models
Source: https://huggingface.co/papers/2605.08063
Abstract
Flow-OPD addresses limitations in Flow Matching text-to-image models through a two-stage alignment approach combining on-policy distillation and manifold anchor regularization, achieving significant improvements in generation quality and alignment metrics.
ExistingFlow Matching(FM) text-to-image models suffer from two critical bottlenecks under multi-task alignment: the reward sparsity induced by scalar-valued rewards, and the gradient interference arising from jointly optimizing heterogeneous objectives, which together give rise to a ‘seesaw effect’ of competing metrics and pervasive reward hacking. Inspired by the success ofOn-Policy Distillation(OPD) in the large language model community, we propose Flow-OPD, the first unified post-training framework that integrateson-policy distillationintoFlow Matchingmodels. Flow-OPD adopts a two-stage alignment strategy: it first cultivates domain-specialized teacher models via single-rewardGRPOfine-tuning, allowing each expert to reach its performance ceiling in isolation; it then establishes a robust initial policy through a Flow-based Cold-Start scheme and seamlessly consolidates heterogeneous expertise into a single student via a three-step orchestration of on-policy sampling,task-routing labeling, anddense trajectory-level supervision. We further introduceManifold Anchor Regularization(MAR), which leverages a task-agnostic teacher to provide full-data supervision that anchors generation to a high-quality manifold, effectively mitigating the aesthetic degradation commonly observed in purely RL-driven alignment. Built uponStable Diffusion 3.5 Medium, Flow-OPD raises theGenEval scorefrom 63 to 92 and theOCR accuracyfrom 59 to 94, yielding an overall improvement of roughly 10 points over vanillaGRPO, while preserving image fidelity and human-preference alignment and exhibiting an emergent ‘teacher-surpassing’ effect. These results establish Flow-OPD as a scalable alignment paradigm for building generalist text-to-image models.
View arXiv pageView PDFProject pageGitHub11Add to collection
Get this paper in your agent:
hf papers read 2605\.08063
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper1
#### CostaliyA/Flow-OPD Updatedabout 4 hours ago • 12
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.08063 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.08063 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation
AnyFlow introduces a novel any-step video diffusion distillation framework that optimizes full ODE sampling trajectories through flow-map transition learning and backward simulation, achieving performance that matches or surpasses consistency-based counterparts while scaling with sampling step budgets.
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation
This paper investigates the parameter-level mechanisms behind the efficiency of On-Policy Distillation (OPD) for large language models, attributing it to early 'foresight' in module allocation and update direction. It proposes EffOPD, a plug-and-play method that accelerates OPD training by 3x without compromising final performance.
D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models
This paper introduces D-OPSD, a novel training paradigm for step-distilled diffusion models that enables on-policy self-distillation during supervised fine-tuning. It allows models to learn new concepts or styles without compromising their efficient few-step inference capabilities.
The Illusion of Certainty: Decoupling Capability and Calibration in On-Policy Distillation
This paper identifies that on-policy distillation (OPD) in language models leads to severe overconfidence due to information mismatch between training and deployment, and proposes CaOPD, a calibration-aware framework that improves both performance and confidence reliability.
Rubric-based On-policy Distillation
This paper introduces ROPD, a rubric-based on-policy distillation framework that achieves superior sample efficiency compared to traditional logit-based methods. It enables model alignment in black-box scenarios by using structured semantic rubrics instead of teacher logits.