Tag
DiRecT introduces a training-free algorithm for safe diffusion-based planning that enforces constraints only on final clean trajectories using receding-horizon denoising, improving safety and performance over existing methods.
This paper formalizes model exploitation in reinforcement learning, proving it is unavoidable in large policy sets, and establishes a theoretical bridge between reward hacking and model exploitation.