Tag
This paper introduces MARBLE, a gradient-space optimization framework for multi-reward reinforcement learning fine-tuning of diffusion models, which harmonizes policy gradients without manual weighting.
Self-Distillation Zero (SD-Zero) is a novel training method that converts sparse binary rewards into dense token-level supervision through dual-role training where a model acts as both generator and reviser, achieving 10%+ improvements on math and code reasoning benchmarks with higher sample efficiency than RL approaches.