Tag
This paper explains the root cause of reward hacking in reward-guided flow and diffusion models, attributing it to finite-particle plug-in estimation of the Doob h-function, and proposes a reward damping schedule to correct within-mode bias without additional computational cost.