SafeDiffusion-R1: Online Reward Steering for Safe Diffusion Post-Training

Hugging Face Daily Papers 05/18/26, 12:00 AM Papers

diffusion-models safety reinforcement-learning grpo clip fine-tuning generative-ai

Summary

SafeDiffusion-R1 introduces an online reinforcement learning framework using GRPO and a steering reward mechanism to improve safety in diffusion models without requiring supervised data or reward tuning, achieving state-of-the-art performance on multiple harm categories.

Diffusion models have been widely studied for removing unsafe content learned during pre-training. Existing methods require expensive supervised data, either unsafe-text paired with safe-image groundtruth or negative/positive image pairs, making them impractical to scale. Furthermore, offline reinforcement learning and supervised fine-tuning approaches that generate synthetic data offline suffer from catastrophic forgetting, degrading generation quality. We propose a novel online reinforcement learning framework that addresses both data scarcity and model degradation through post-training with Group Relative Policy Optimization (GRPO) on both negative and positive text prompts. To eliminate the need for fine-tuning specialized safe/unsafe reward models, we introduce a steering reward mechanism that exploits an inherent property of CLIP embeddings: steering text representations toward positive safety directions and away from negative ones in the embedding space. Our online-policy approach enables the model to learn from diverse prompts, including explicit unsafe content, without catastrophic forgetting. Extensive experiments demonstrate that our method reduces inappropriate content to 18.07\% (vs. 48.9\% for SD v1.4) and nudity detections to 15 (vs. 646 baseline) while improving compositional generation quality from 42.08\% to 47.83\% on GenEval. Remarkably, these safety gains generalize to out-of-domain unsafe prompts across seven harm categories, achieving state-of-the-art performance without supervised paired data or reward tuning. Github: https://github.com/MAXNORM8650/SafeDiffusion-R1.

Original Article

View Cached Full Text

Cached at: 05/19/26, 10:31 AM

Paper page - SafeDiffusion-R1: Online Reward Steering for Safe Diffusion Post-Training

Source: https://huggingface.co/papers/2605.18719

Abstract

A novel online reinforcement learning framework for diffusion models that improves safety without requiring supervised paired data or reward tuning, achieving state-of-the-art performance on multiple harm categories.

Diffusion modelshave been widely studied for removing unsafe content learned during pre-training. Existing methods require expensive supervised data, either unsafe-text paired with safe-image groundtruth or negative/positive image pairs, making them impractical to scale. Furthermore, offline reinforcement learning and supervised fine-tuning approaches that generate synthetic data offline suffer fromcatastrophic forgetting, degrading generation quality. We propose a novelonline reinforcement learningframework that addresses both data scarcity and model degradation through post-training withGroup Relative Policy Optimization(GRPO) on both negative and positive text prompts. To eliminate the need for fine-tuning specializedsafe/unsafe reward models, we introduce asteering reward mechanismthat exploits an inherent property ofCLIP embeddings: steering text representations toward positive safety directions and away from negative ones in the embedding space. Our online-policy approach enables the model to learn from diverse prompts, including explicit unsafe content, withoutcatastrophic forgetting. Extensive experiments demonstrate that our method reduces inappropriate content to 18.07\% (vs. 48.9\% for SD v1.4) and nudity detections to 15 (vs. 646 baseline) while improvingcompositional generation qualityfrom 42.08\% to 47.83\% onGenEval. Remarkably, these safety gains generalize to out-of-domain unsafe prompts across seven harm categories, achieving state-of-the-art performance without supervised paired data or reward tuning. Github: https://github.com/MAXNORM8650/SafeDiffusion-R1.

View arXiv page View PDF Project page GitHub5 Add to collection

Get this paper in your agent:

hf papers read 2605\.18719

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper1

#### ItsMaxNorm/SafeDiffusion-R1 Text-to-Image• Updatedabout 2 hours ago

Datasets citing this paper1

#### ItsMaxNorm/SafeDiffusion-R1-dataset Viewer• Updated1 minute ago • 236k

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.18719 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

SafeDiffusion-R1: Online Reward Steering for Safe Diffusion Post-Training

Paper page - SafeDiffusion-R1: Online Reward Steering for Safe Diffusion Post-Training

Abstract

Models citing this paper1

Datasets citing this paper1

Spaces citing this paper0

Collections including this paper0

Similar Articles

ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving

@svlevine: A new way to do off-policy RL with diffusion: if we have off-policy data, we need to figure out what the diffusion late…

Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models

Recovering Hidden Reward in Diffusion-Based Policies

Scaling World-Model Reinforcement Learning Through Diffusion Policy Optimization

Submit Feedback

Similar Articles

ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving

@svlevine: A new way to do off-policy RL with diffusion: if we have off-policy data, we need to figure out what the diffusion late…

Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models

Recovering Hidden Reward in Diffusion-Based Policies

Scaling World-Model Reinforcement Learning Through Diffusion Policy Optimization