SafeDiffusion-R1: Online Reward Steering for Safe Diffusion Post-Training
Summary
SafeDiffusion-R1 introduces an online reinforcement learning framework using GRPO and a steering reward mechanism to improve safety in diffusion models without requiring supervised data or reward tuning, achieving state-of-the-art performance on multiple harm categories.
View Cached Full Text
Cached at: 05/19/26, 10:31 AM
Paper page - SafeDiffusion-R1: Online Reward Steering for Safe Diffusion Post-Training
Source: https://huggingface.co/papers/2605.18719
Abstract
A novel online reinforcement learning framework for diffusion models that improves safety without requiring supervised paired data or reward tuning, achieving state-of-the-art performance on multiple harm categories.
Diffusion modelshave been widely studied for removing unsafe content learned during pre-training. Existing methods require expensive supervised data, either unsafe-text paired with safe-image groundtruth or negative/positive image pairs, making them impractical to scale. Furthermore, offline reinforcement learning and supervised fine-tuning approaches that generate synthetic data offline suffer fromcatastrophic forgetting, degrading generation quality. We propose a novelonline reinforcement learningframework that addresses both data scarcity and model degradation through post-training withGroup Relative Policy Optimization(GRPO) on both negative and positive text prompts. To eliminate the need for fine-tuning specializedsafe/unsafe reward models, we introduce asteering reward mechanismthat exploits an inherent property ofCLIP embeddings: steering text representations toward positive safety directions and away from negative ones in the embedding space. Our online-policy approach enables the model to learn from diverse prompts, including explicit unsafe content, withoutcatastrophic forgetting. Extensive experiments demonstrate that our method reduces inappropriate content to 18.07\% (vs. 48.9\% for SD v1.4) and nudity detections to 15 (vs. 646 baseline) while improvingcompositional generation qualityfrom 42.08\% to 47.83\% onGenEval. Remarkably, these safety gains generalize to out-of-domain unsafe prompts across seven harm categories, achieving state-of-the-art performance without supervised paired data or reward tuning. Github: https://github.com/MAXNORM8650/SafeDiffusion-R1.
View arXiv pageView PDFProject pageGitHub5Add to collection
Get this paper in your agent:
hf papers read 2605\.18719
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper1
#### ItsMaxNorm/SafeDiffusion-R1 Text-to-Image• Updatedabout 2 hours ago
Datasets citing this paper1
#### ItsMaxNorm/SafeDiffusion-R1-dataset Viewer• Updated1 minute ago • 236k
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.18719 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving
ReflectDrive-2 is a new discrete diffusion planner for autonomous driving that uses reinforcement learning to enable self-editing of trajectory tokens, achieving high performance and low latency on the NAVSIM benchmark.
@svlevine: A new way to do off-policy RL with diffusion: if we have off-policy data, we need to figure out what the diffusion late…
A new method for off-policy reinforcement learning with diffusion models, using flow reversal to handle off-policy data by reversing the diffusion process on it.
Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models
This paper introduces a novel adaptive scheduler for steering discrete diffusion language models using sparse autoencoders, demonstrating that targeting interventions based on when specific attributes commit improves control quality and strength over uniform methods.
Recovering Hidden Reward in Diffusion-Based Policies
This research paper explores methods for recovering hidden rewards within diffusion-based policies, likely aiming to improve the alignment or efficiency of such models.
Scaling World-Model Reinforcement Learning Through Diffusion Policy Optimization
Proposes Model-Based Diffusion Policy Optimization (MBDPO), a framework that unifies search and policy optimization in world models using diffusion policy representations, achieving consistent scaling behavior and superior performance across offline and online reinforcement learning tasks.