doob-h-function

#doob-h-function

Are we really tilting? The mechanics of reward guidance in flow and diffusion models

arXiv cs.LG ↗ · 2026-06-03 Cached

This paper explains the root cause of reward hacking in reward-guided flow and diffusion models, attributing it to finite-particle plug-in estimation of the Doob h-function, and proposes a reward damping schedule to correct within-mode bias without additional computational cost.

0 favorites 0 likes

doob-h-function

Are we really tilting? The mechanics of reward guidance in flow and diffusion models

Submit Feedback