MARBLE: Multi-Aspect Reward Balance for Diffusion RL
Summary
This paper introduces MARBLE, a gradient-space optimization framework for multi-reward reinforcement learning fine-tuning of diffusion models, which harmonizes policy gradients without manual weighting.
View Cached Full Text
Cached at: 05/08/26, 07:34 AM
Paper page - MARBLE: Multi-Aspect Reward Balance for Diffusion RL
Source: https://huggingface.co/papers/2605.06507
Abstract
A novel gradient-space optimization framework called MARBLE addresses limitations in multi-reward reinforcement learning fine-tuning of diffusion models by maintaining independent advantage estimators and harmonizing policy gradients through quadratic programming without manual reward weighting.
Reinforcement learning fine-tuninghas become the dominant approach for aligningdiffusion modelswith human preferences. However, assessing images is intrinsically amulti-dimensional task, and multiple evaluation criteria need to be optimized simultaneously. Existing practice deal with multiple rewards by training one specialist model per reward, optimizing aweighted-sum rewardR(x)=sum_k w_k R_k(x), or sequentially fine-tuning with a hand-crafted stage schedule. These approaches either fail to produce a unified model that can be jointly trained on all rewards or necessitates heavy manually tuned sequential training. We find that the failure stems from using a naiveweighted-sum rewardaggregation. This approach suffers from a sample-level mismatch because most rollouts are specialist samples, highly informative for certain reward dimensions but irrelevant for others; consequently, weighted summation dilutes their supervision. To address this issue, we propose MARBLE (Multi-Aspect Reward BaLancE), agradient-space optimizationframework that maintains independent advantage estimators for each reward, computes per-rewardpolicy gradients, and harmonizes them into a single update direction without manually-tuned reward weighting, by solving aQuadratic Programmingproblem. We further propose anamortized formulationthat exploits the affine structure of the loss used in DiffusionNFT, to reduce the per-step cost from K+1 backward passes to near single-reward baseline cost, together withEMA smoothingon the balancing coefficients to stabilize updates against transient single-batch fluctuations. On SD3.5 Medium with five rewards, MARBLE improves all five reward dimensions simultaneously, turns the worst-aligned reward’s gradient cosine from negative under weighted summation in 80% of mini-batches to consistently positive, and runs at 0.97X the training speed of baseline training.
View arXiv pageView PDFProject pageGitHub5Add to collection
Get this paper in your agent:
hf papers read 2605\.06507
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.06507 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.06507 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.06507 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Recovering Hidden Reward in Diffusion-Based Policies
This research paper explores methods for recovering hidden rewards within diffusion-based policies, likely aiming to improve the alignment or efficiency of such models.
UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models
UDM-GRPO introduces a stable RL training framework for uniform discrete diffusion models, boosting GenEval accuracy from 69% to 96% and OCR benchmark accuracy from 8% to 57%.
Beyond SFT-to-RL: Pre-alignment via Black-Box On-Policy Distillation for Multimodal RL
The paper introduces PRISM, a method that inserts a distribution-alignment stage between supervised fine-tuning and reinforcement learning to mitigate distributional drift in multimodal models. It uses a black-box adversarial game with an MoE discriminator to improve RLVR performance on models like Qwen3-VL.
Reinforcement Learning via Value Gradient Flow
Value Gradient Flow (VGF) presents a scalable approach to behavior-regularized reinforcement learning by formulating it as an optimal transport problem solved through discrete gradient flow, achieving state-of-the-art results on offline RL and LLM RL benchmarks. The method eliminates explicit policy parameterization while enabling adaptive test-time scaling by controlling transport budget.
RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework
RAD-2 presents a unified generator-discriminator framework for autonomous driving that combines diffusion-based trajectory generation with RL-optimized reranking, achieving 56% collision rate reduction compared to diffusion-based planners. The approach introduces techniques like Temporally Consistent Group Relative Policy Optimization and BEV-Warp simulation environment for efficient large-scale training.