SafeDiffusion-R1: 안전한 확산 모델 사후 훈련을 위한 온라인 보상 유도

초록

확산 모델은 사전 학습 중 습득된 유해 콘텐츠를 제거하기 위해 광범위하게 연구되어 왔다. 기존 방법은 유해 텍스트와 안전 이미지 쌍 또는 부정/긍정 이미지 쌍 중 하나인 고비용 지도 데이터를 필요로 하여 확장성이 떨어진다. 더욱이, 오프라인 강화 학습 및 오프라인에서 합성 데이터를 생성하는 지도 미세 조정 방식은 치명적 망각을 초래하여 생성 품질을 저하시킨다. 본 논문에서는 부정 및 긍정 텍스트 프롬프트 모두에 대해 GRPO(Group Relative Policy Optimization)를 활용한 사후 학습을 통해 데이터 부족과 모델 성능 저하를 모두 해결하는 새로운 온라인 강화 학습 프레임워크를 제안한다. 특화된 안전/유해 보상 모델을 미세 조정할 필요를 없애기 위해, CLIP 임베딩의 고유한 속성, 즉 임베딩 공간에서 텍스트 표현을 긍정적 안전 방향으로 유도하고 부정적 방향에서 멀어지게 하는 속성을 활용하는 스티어링 보상 메커니즘을 도입한다. 우리의 온라인 정책 접근 방식은 명시적 유해 콘텐츠를 포함한 다양한 프롬프트로부터 모델이 학습할 수 있게 하면서도 치명적 망각을 방지한다. 광범위한 실험 결과, 본 방법은 부적절한 콘텐츠를 18.07%로(SD v1.4의 48.9% 대비), 나체 탐지를 15회(기준 646회 대비)로 줄이면서 GenEval에서 구성 생성 품질을 42.08%에서 47.83%로 향상시킨다. 주목할 점은, 이러한 안전성 향상이 7가지 유해 범주에 걸쳐 분포 외 유해 프롬프트로 일반화되어, 지도 쌍 데이터나 보상 튜닝 없이도 최첨단 성능을 달성한다는 것이다. Github: https://github.com/MAXNORM8650/SafeDiffusion-R1.

English

Diffusion models have been widely studied for removing unsafe content learned during pre-training. Existing methods require expensive supervised data, either unsafe-text paired with safe-image groundtruth or negative/positive image pairs, making them impractical to scale. Furthermore, offline reinforcement learning and supervised fine-tuning approaches that generate synthetic data offline suffer from catastrophic forgetting, degrading generation quality. We propose a novel online reinforcement learning framework that addresses both data scarcity and model degradation through post-training with Group Relative Policy Optimization (GRPO) on both negative and positive text prompts. To eliminate the need for fine-tuning specialized safe/unsafe reward models, we introduce a steering reward mechanism that exploits an inherent property of CLIP embeddings: steering text representations toward positive safety directions and away from negative ones in the embedding space. Our online-policy approach enables the model to learn from diverse prompts, including explicit unsafe content, without catastrophic forgetting. Extensive experiments demonstrate that our method reduces inappropriate content to 18.07\% (vs. 48.9\% for SD v1.4) and nudity detections to 15 (vs. 646 baseline) while improving compositional generation quality from 42.08\% to 47.83\% on GenEval. Remarkably, these safety gains generalize to out-of-domain unsafe prompts across seven harm categories, achieving state-of-the-art performance without supervised paired data or reward tuning. Github: https://github.com/MAXNORM8650/SafeDiffusion-R1.