ChatPaper.aiChatPaper

SafeDiffusion-R1:用於安全擴散後訓練的線上獎勵引導

SafeDiffusion-R1: Online Reward Steering for Safe Diffusion Post-Training

May 18, 2026
作者: Komal Kumar, Ankan Deria, Abhishek Basu, Fahad Shamshad, Hisham Cholakkal, Karthik Nandakumar
cs.AI

摘要

擴散模型在移除預訓練期間學習到的不安全內容方面已被廣泛研究。現有方法需要昂貴的監督資料,包括不安全文本與安全影像的真實配對,或負向/正向影像對,使得這些方法難以規模化。此外,離線強化學習與監督式微調方法(透過離線生成合成資料)會遭受災難性遺忘,導致生成品質下降。我們提出一個新穎的線上強化學習框架,透過在負向與正向文字提示上應用群體相對策略優化(GRPO)進行後訓練,同時解決資料稀缺與模型退化問題。為了消除微調專用的安全/不安全獎勵模型的需求,我們引入一種引導獎勵機制,利用CLIP嵌入的固有屬性:在嵌入空間中將文字表示引導至正向安全方向,並遠離負向方向。我們的在線策略方法使模型能夠從多樣化的提示(包括明確的不安全內容)中學習,而不會發生災難性遺忘。大量實驗證明,我們的方法將不當內容從48.9%(SD v1.4)降至18.07%,裸體檢測次數從646次降至15次,同時在GenEval上將合成構成品質從42.08%提升至47.83%。值得注意的是,這些安全效益可泛化至七類危害範疇中的域外不安全提示,在無需監督配對資料或獎勵調校的情況下達到最先進性能。Github: https://github.com/MAXNORM8650/SafeDiffusion-R1。
English
Diffusion models have been widely studied for removing unsafe content learned during pre-training. Existing methods require expensive supervised data, either unsafe-text paired with safe-image groundtruth or negative/positive image pairs, making them impractical to scale. Furthermore, offline reinforcement learning and supervised fine-tuning approaches that generate synthetic data offline suffer from catastrophic forgetting, degrading generation quality. We propose a novel online reinforcement learning framework that addresses both data scarcity and model degradation through post-training with Group Relative Policy Optimization (GRPO) on both negative and positive text prompts. To eliminate the need for fine-tuning specialized safe/unsafe reward models, we introduce a steering reward mechanism that exploits an inherent property of CLIP embeddings: steering text representations toward positive safety directions and away from negative ones in the embedding space. Our online-policy approach enables the model to learn from diverse prompts, including explicit unsafe content, without catastrophic forgetting. Extensive experiments demonstrate that our method reduces inappropriate content to 18.07\% (vs. 48.9\% for SD v1.4) and nudity detections to 15 (vs. 646 baseline) while improving compositional generation quality from 42.08\% to 47.83\% on GenEval. Remarkably, these safety gains generalize to out-of-domain unsafe prompts across seven harm categories, achieving state-of-the-art performance without supervised paired data or reward tuning. Github: https://github.com/MAXNORM8650/SafeDiffusion-R1.