SafeDiffusion-R1: 用于安全扩散后训练的在线奖励引导

摘要

扩散模型在消除预训练过程中学到的不安全内容方面已被广泛研究。现有方法需要昂贵的监督数据，要么是不安全文本与安全图像真值配对，要么是负/正图像对，这使得它们难以规模化扩展。此外，离线强化学习和监督微调方法通过离线生成合成数据，但会遭受灾难性遗忘，导致生成质量下降。我们提出了一种新颖的在线强化学习框架，通过在后训练阶段对负向和正向文本提示应用组相对策略优化（GRPO），同时解决了数据稀缺和模型退化问题。为了消除对专用安全/不安全奖励模型进行微调的需求，我们引入了一种导向奖励机制，该机制利用了CLIP嵌入的一个固有特性：在嵌入空间中，将文本表示导向正向安全方向并远离负向方向。我们的在线策略方法使模型能够从包括明确不安全内容在内的多样化提示中学习，而不会发生灾难性遗忘。大量实验表明，我们的方法将不适当内容降低至18.07%（对比SD v1.4的48.9%），裸体检测降至15次（对比基线的646次），同时在GenEval上，组合生成质量从42.08%提升至47.83%。值得注意的是，这些安全增益在七个危害类别的域外不安全提示上具有泛化性，实现了无需监督配对数据或奖励调优的最优性能。GitHub：https://github.com/MAXNORM8650/SafeDiffusion-R1。

English

Diffusion models have been widely studied for removing unsafe content learned during pre-training. Existing methods require expensive supervised data, either unsafe-text paired with safe-image groundtruth or negative/positive image pairs, making them impractical to scale. Furthermore, offline reinforcement learning and supervised fine-tuning approaches that generate synthetic data offline suffer from catastrophic forgetting, degrading generation quality. We propose a novel online reinforcement learning framework that addresses both data scarcity and model degradation through post-training with Group Relative Policy Optimization (GRPO) on both negative and positive text prompts. To eliminate the need for fine-tuning specialized safe/unsafe reward models, we introduce a steering reward mechanism that exploits an inherent property of CLIP embeddings: steering text representations toward positive safety directions and away from negative ones in the embedding space. Our online-policy approach enables the model to learn from diverse prompts, including explicit unsafe content, without catastrophic forgetting. Extensive experiments demonstrate that our method reduces inappropriate content to 18.07\% (vs. 48.9\% for SD v1.4) and nudity detections to 15 (vs. 646 baseline) while improving compositional generation quality from 42.08\% to 47.83\% on GenEval. Remarkably, these safety gains generalize to out-of-domain unsafe prompts across seven harm categories, achieving state-of-the-art performance without supervised paired data or reward tuning. Github: https://github.com/MAXNORM8650/SafeDiffusion-R1.