Optimizing Visual Generative Models via Distribution-wise Rewards
Summary
This paper presents a reinforcement learning framework for visual generative models that uses distribution-wise rewards, with a subset-replace strategy for efficiency, improving image diversity and quality while addressing mode collapse and reward hacking.
View Cached Full Text
Cached at: 07/03/26, 03:52 AM
Paper page - Optimizing Visual Generative Models via Distribution-wise Rewards
Source: https://huggingface.co/papers/2607.02291
Abstract
A novel reinforcement learning framework for visual generation uses distribution-wise rewards to improve image diversity and quality while addressing mode collapse and computational efficiency issues.
Conventionalreinforcement learningstrategies for visual generation typically employsample-wise reward functions, yet this practice frequently results inreward hackingthat degrades image diversity and introduces visual anomalies. To address these limitations, we present a novel framework that finetunesgenerative modelsusingdistribution-wise rewards, ensuring better alignment with real-world data distributions. Unlike rewards that evaluate samples individually, distribution-wise reward accounts for the data distribution of the samples, mitigating themode collapseproblem that occurs when all samples optimize towards the same direction independently. To overcome the prohibitive computational cost of estimating these rewards, we introduce asubset-replace strategythat efficiently provides reward signals by updating only a small subset of a generated reference set. Additionally, we apply RL to optimizepost-hoc model mergingcoefficients, potentially mitigating the train-inference inconsistency caused by introducingstochastic differential equation(SDE) in regular RL practices. Extensive experiments show our approach significantly improvesFID-50Kacross various base models, from 8.30 to 5.77 for SiT and from 3.74 to 3.52 for EDM2. Qualitative evaluation also confirms that our method enhancesperceptual qualitywhile preserving sample diversity.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2607\.02291
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2607.02291 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2607.02291 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2607.02291 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Reinforcing Few-step Generators via Reward-Tilted Distribution Matching
RTDMD is a two-stage framework combining distribution matching distillation with reward-guided reinforcement learning to improve few-step image generation alignment with human preferences. It achieves state-of-the-art results on multiple models with only 4 inference steps.
Reward as An Agent for Embodied World Models
This paper introduces Reward as an Agent and DynDiff-GRPO to address reward hacking and limited exploration in reinforcement learning for embodied world models, achieving significant accuracy gains.
Accelerating Disaggregated RL for Visual Generative LLMs with Diffusion-Based Parallelism and Trainer-Assisted Generation
This paper introduces DigenRL, a disaggregated RL framework for diffusion-based generative LLMs that uses generation-axis pipeline parallelism and trainer-assisted generation to improve throughput by 1.56-2.10x over existing systems.
Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation
Stream-R1 introduces a reliability-perplexity aware reward distillation framework for streaming video generation that adaptively weights supervision to improve visual and motion quality without additional computational overhead.
Hierarchical Variational Policies for Reward-Guided Diffusion
Proposes a hierarchical variational policy framework for reward-guided diffusion, enabling high-quality sampling with reduced inference cost. Achieves strong quality-speed tradeoff on tasks like super-resolution.