Reinforcing Few-step Generators via Reward-Tilted Distribution Matching
Summary
RTDMD is a two-stage framework combining distribution matching distillation with reward-guided reinforcement learning to improve few-step image generation alignment with human preferences. It achieves state-of-the-art results on multiple models with only 4 inference steps.
View Cached Full Text
Cached at: 05/26/26, 06:42 AM
Paper page - Reinforcing Few-step Generators via Reward-Tilted Distribution Matching
Source: https://huggingface.co/papers/2605.26108
Abstract
RTDMD is a two-stage framework that combines distribution matching distillation with reward-guided reinforcement learning to improve few-step image generation alignment with human preferences.
Recent advances in few-stepdiffusion distillationhave enabled efficient image generation, yet aligning these models with human preferences remains challenging. We propose Reward-TiltedDistribution Matching Distillation(RTDMD), a two-stage framework that unifiesdistribution matching distillationwithreward-guided reinforcement learningfor few-step flow generators. We show that minimizing theKL divergenceto areward-tilted teacher distributionnaturally decomposes into a distribution matching term and a reward maximization term. In the first stage, we introduce Ambient-ConsistentDistribution Matching Distillation(AC-DMD), which performs subinterval-wise distribution matching and augments thefake score objectivewith aconsistency regularizerto help the fake score model track the shifting generator distribution under limited updates. In the second stage, we jointly optimize both terms: for the reward maximization term, we derive a hybridpolicy gradientthat combines aGRPO-style estimator for the stochastic intermediate transitions with direct reward backpropagation through the deterministic final step, and further introduce step-subsetGRPO(SubGRPO) to reduce variance. Experiments on SD3, SD3.5, and FLUX.2 demonstrate that RTDMD establishes new state-of-the-art results across preference, aesthetic, and compositional metrics with only 4 inference steps, outperforming previous few-step text-to-image generation methods. Code and models are available at https://github.com/Harahan/RTDMD.
View arXiv pageView PDFGitHub3Add to collection
Get this paper in your agent:
hf papers read 2605\.26108
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper2
#### Harahan/FLUX2-4B-RTDMD Text-to-Image• Updated39 minutes ago
#### Harahan/SD35M-RTDMD Text-to-Image• Updated39 minutes ago
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.26108 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.26108 in a Space README.md to link it from this page.
Collections including this paper1
Similar Articles
High-Fidelity Two-Step Image Generation via Teacher-Aligned End-to-End Distillation
This paper introduces Z-Image Turbo++, a two-step image generation model distilled from an eight-step teacher using distribution-aligned adversarial learning, step-decoupled parameterization, and end-to-end training with iterative regularization to narrow the quality gap with multi-step generation.
RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework
RAD-2 presents a unified generator-discriminator framework for autonomous driving that combines diffusion-based trajectory generation with RL-optimized reranking, achieving 56% collision rate reduction compared to diffusion-based planners. The approach introduces techniques like Temporally Consistent Group Relative Policy Optimization and BEV-Warp simulation environment for efficient large-scale training.
Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation
Stream-R1 introduces a reliability-perplexity aware reward distillation framework for streaming video generation that adaptively weights supervision to improve visual and motion quality without additional computational overhead.
DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification
DeltaRubric is a research paper introducing a two-step multimodal preference evaluation approach using a single MLLM to improve reward modeling reliability through joint planning and verification.
Reinforcement learning with prediction-based rewards
OpenAI introduces Random Network Distillation (RND), a prediction-based method for encouraging exploration in RL agents through curiosity, achieving human-level performance on Montezuma's Revenge without demonstrations or game state access.