Reinforcing Few-step Generators via Reward-Tilted Distribution Matching

Hugging Face Daily Papers Papers

Summary

RTDMD is a two-stage framework combining distribution matching distillation with reward-guided reinforcement learning to improve few-step image generation alignment with human preferences. It achieves state-of-the-art results on multiple models with only 4 inference steps.

Recent advances in few-step diffusion distillation have enabled efficient image generation, yet aligning these models with human preferences remains challenging. We propose Reward-Tilted Distribution Matching Distillation (RTDMD), a two-stage framework that unifies distribution matching distillation with reward-guided reinforcement learning for few-step flow generators. We show that minimizing the KL divergence to a reward-tilted teacher distribution naturally decomposes into a distribution matching term and a reward maximization term. In the first stage, we introduce Ambient-Consistent Distribution Matching Distillation (AC-DMD), which performs subinterval-wise distribution matching and augments the fake score objective with a consistency regularizer to help the fake score model track the shifting generator distribution under limited updates. In the second stage, we jointly optimize both terms: for the reward maximization term, we derive a hybrid policy gradient that combines a GRPO-style estimator for the stochastic intermediate transitions with direct reward backpropagation through the deterministic final step, and further introduce step-subset GRPO (SubGRPO) to reduce variance. Experiments on SD3, SD3.5, and FLUX.2 demonstrate that RTDMD establishes new state-of-the-art results across preference, aesthetic, and compositional metrics with only 4 inference steps, outperforming previous few-step text-to-image generation methods. Code and models are available at https://github.com/Harahan/RTDMD.
Original Article
View Cached Full Text

Cached at: 05/26/26, 06:42 AM

Paper page - Reinforcing Few-step Generators via Reward-Tilted Distribution Matching

Source: https://huggingface.co/papers/2605.26108

Abstract

RTDMD is a two-stage framework that combines distribution matching distillation with reward-guided reinforcement learning to improve few-step image generation alignment with human preferences.

Recent advances in few-stepdiffusion distillationhave enabled efficient image generation, yet aligning these models with human preferences remains challenging. We propose Reward-TiltedDistribution Matching Distillation(RTDMD), a two-stage framework that unifiesdistribution matching distillationwithreward-guided reinforcement learningfor few-step flow generators. We show that minimizing theKL divergenceto areward-tilted teacher distributionnaturally decomposes into a distribution matching term and a reward maximization term. In the first stage, we introduce Ambient-ConsistentDistribution Matching Distillation(AC-DMD), which performs subinterval-wise distribution matching and augments thefake score objectivewith aconsistency regularizerto help the fake score model track the shifting generator distribution under limited updates. In the second stage, we jointly optimize both terms: for the reward maximization term, we derive a hybridpolicy gradientthat combines aGRPO-style estimator for the stochastic intermediate transitions with direct reward backpropagation through the deterministic final step, and further introduce step-subsetGRPO(SubGRPO) to reduce variance. Experiments on SD3, SD3.5, and FLUX.2 demonstrate that RTDMD establishes new state-of-the-art results across preference, aesthetic, and compositional metrics with only 4 inference steps, outperforming previous few-step text-to-image generation methods. Code and models are available at https://github.com/Harahan/RTDMD.

View arXiv pageView PDFGitHub3Add to collection

Get this paper in your agent:

hf papers read 2605\.26108

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper2

#### Harahan/FLUX2-4B-RTDMD Text-to-Image• Updated39 minutes ago #### Harahan/SD35M-RTDMD Text-to-Image• Updated39 minutes ago

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.26108 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.26108 in a Space README.md to link it from this page.

Collections including this paper1

Similar Articles

RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework

Hugging Face Daily Papers

RAD-2 presents a unified generator-discriminator framework for autonomous driving that combines diffusion-based trajectory generation with RL-optimized reranking, achieving 56% collision rate reduction compared to diffusion-based planners. The approach introduces techniques like Temporally Consistent Group Relative Policy Optimization and BEV-Warp simulation environment for efficient large-scale training.

Reinforcement learning with prediction-based rewards

OpenAI Blog

OpenAI introduces Random Network Distillation (RND), a prediction-based method for encouraging exploration in RL agents through curiosity, achieving human-level performance on Montezuma's Revenge without demonstrations or game state access.