Recovering Hidden Reward in Diffusion-Based Policies
Summary
This research paper explores methods for recovering hidden rewards within diffusion-based policies, likely aiming to improve the alignment or efficiency of such models.
View Cached Full Text
Cached at: 05/08/26, 07:12 AM
Paper page - Recovering Hidden Reward in Diffusion-Based Policies
Source: https://huggingface.co/papers/2605.00623 Get this paper in your agent:
hf papers read 2605\.00623
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.00623 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.00623 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.00623 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Hierarchical Variational Policies for Reward-Guided Diffusion
Proposes a hierarchical variational policy framework for reward-guided diffusion, enabling high-quality sampling with reduced inference cost. Achieves strong quality-speed tradeoff on tasks like super-resolution.
Adaptive Order Policies for Masked Diffusion
Proposes learning the unmasking order in masked diffusion models using a lightweight policy network, with a weighted loss that outperforms heuristics on combinatorial tasks and protein design.
@svlevine: A new way to do off-policy RL with diffusion: if we have off-policy data, we need to figure out what the diffusion late…
A new method for off-policy reinforcement learning with diffusion models, using flow reversal to handle off-policy data by reversing the diffusion process on it.
Scaling World-Model Reinforcement Learning Through Diffusion Policy Optimization
Proposes Model-Based Diffusion Policy Optimization (MBDPO), a framework that unifies search and policy optimization in world models using diffusion policy representations, achieving consistent scaling behavior and superior performance across offline and online reinforcement learning tasks.
SafeDiffusion-R1: Online Reward Steering for Safe Diffusion Post-Training
SafeDiffusion-R1 introduces an online reinforcement learning framework using GRPO and a steering reward mechanism to improve safety in diffusion models without requiring supervised data or reward tuning, achieving state-of-the-art performance on multiple harm categories.