Flow-OPD: On-Policy Distillation for Flow Matching Models

Hugging Face Daily Papers 05/08/26, 12:00 AM Papers

Summary

Flow-OPD is a research paper introducing a two-stage on-policy distillation framework for Flow Matching text-to-image models, significantly improving generation quality and alignment metrics using Stable Diffusion 3.5 Medium.

Existing Flow Matching (FM) text-to-image models suffer from two critical bottlenecks under multi-task alignment: the reward sparsity induced by scalar-valued rewards, and the gradient interference arising from jointly optimizing heterogeneous objectives, which together give rise to a 'seesaw effect' of competing metrics and pervasive reward hacking. Inspired by the success of On-Policy Distillation (OPD) in the large language model community, we propose Flow-OPD, the first unified post-training framework that integrates on-policy distillation into Flow Matching models. Flow-OPD adopts a two-stage alignment strategy: it first cultivates domain-specialized teacher models via single-reward GRPO fine-tuning, allowing each expert to reach its performance ceiling in isolation; it then establishes a robust initial policy through a Flow-based Cold-Start scheme and seamlessly consolidates heterogeneous expertise into a single student via a three-step orchestration of on-policy sampling, task-routing labeling, and dense trajectory-level supervision. We further introduce Manifold Anchor Regularization (MAR), which leverages a task-agnostic teacher to provide full-data supervision that anchors generation to a high-quality manifold, effectively mitigating the aesthetic degradation commonly observed in purely RL-driven alignment. Built upon Stable Diffusion 3.5 Medium, Flow-OPD raises the GenEval score from 63 to 92 and the OCR accuracy from 59 to 94, yielding an overall improvement of roughly 10 points over vanilla GRPO, while preserving image fidelity and human-preference alignment and exhibiting an emergent 'teacher-surpassing' effect. These results establish Flow-OPD as a scalable alignment paradigm for building generalist text-to-image models.

Original Article Export to Word Export to PDF

View Cached Full Text

Cached at: 05/11/26, 07:20 AM

Paper page - Flow-OPD: On-Policy Distillation for Flow Matching Models

Source: https://huggingface.co/papers/2605.08063

Abstract

Flow-OPD addresses limitations in Flow Matching text-to-image models through a two-stage alignment approach combining on-policy distillation and manifold anchor regularization, achieving significant improvements in generation quality and alignment metrics.

ExistingFlow Matching(FM) text-to-image models suffer from two critical bottlenecks under multi-task alignment: the reward sparsity induced by scalar-valued rewards, and the gradient interference arising from jointly optimizing heterogeneous objectives, which together give rise to a ‘seesaw effect’ of competing metrics and pervasive reward hacking. Inspired by the success ofOn-Policy Distillation(OPD) in the large language model community, we propose Flow-OPD, the first unified post-training framework that integrateson-policy distillationintoFlow Matchingmodels. Flow-OPD adopts a two-stage alignment strategy: it first cultivates domain-specialized teacher models via single-rewardGRPOfine-tuning, allowing each expert to reach its performance ceiling in isolation; it then establishes a robust initial policy through a Flow-based Cold-Start scheme and seamlessly consolidates heterogeneous expertise into a single student via a three-step orchestration of on-policy sampling,task-routing labeling, anddense trajectory-level supervision. We further introduceManifold Anchor Regularization(MAR), which leverages a task-agnostic teacher to provide full-data supervision that anchors generation to a high-quality manifold, effectively mitigating the aesthetic degradation commonly observed in purely RL-driven alignment. Built uponStable Diffusion 3.5 Medium, Flow-OPD raises theGenEval scorefrom 63 to 92 and theOCR accuracyfrom 59 to 94, yielding an overall improvement of roughly 10 points over vanillaGRPO, while preserving image fidelity and human-preference alignment and exhibiting an emergent ‘teacher-surpassing’ effect. These results establish Flow-OPD as a scalable alignment paradigm for building generalist text-to-image models.

View arXiv page View PDF Project page GitHub11 Add to collection

Get this paper in your agent:

hf papers read 2605\.08063

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper1

#### CostaliyA/Flow-OPD Updatedabout 4 hours ago • 12

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.08063 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.08063 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Flow-OPD: On-Policy Distillation for Flow Matching Models

Paper page - Flow-OPD: On-Policy Distillation for Flow Matching Models

Abstract

Models citing this paper1

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation

Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

The Illusion of Certainty: Decoupling Capability and Calibration in On-Policy Distillation

Rubric-based On-policy Distillation

Submit Feedback

Similar Articles

AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation

Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

The Illusion of Certainty: Decoupling Capability and Calibration in On-Policy Distillation

Rubric-based On-policy Distillation