扩散模型安全防护式直接偏好优化（Diffusion-SDPO）

摘要

儘管文本到圖像擴散模型能生成高品質影像，但其與人類偏好的對齊仍面臨挑戰。我們重新審視基於擴散的直接偏好優化（DPO）方法，發現一個關鍵缺陷：擴大偏好間距並不一定能提升生成品質。具體而言，標準的Diffusion-DPO目標函數可能同時增加勝出分支與落敗分支的重建誤差。這會導致非偏好輸出的退化加劇，即使偏好間距擴大，勝出分支也會受到負面影響。為解決此問題，我們提出Diffusion-SDPO——一種通過根據敗者梯度與勝者梯度的對齊程度自適應縮放前者來保護勝出分支的安全更新規則。一階分析推導出的閉式縮放係數可確保在每個優化步驟中，偏好輸出的誤差保持非遞增。我們的方法簡潔且與模型無關，能廣泛兼容現有DPO式對齊框架，僅增加邊際計算開銷。在標準文本到圖像基準測試中，Diffusion-SDPO在自動化偏好評估、美學指標及提示詞對齊度上均持續優於現有偏好學習基線模型。代碼已公開於https://github.com/AIDC-AI/Diffusion-SDPO。

English

Text-to-image diffusion models deliver high-quality images, yet aligning them with human preferences remains challenging. We revisit diffusion-based Direct Preference Optimization (DPO) for these models and identify a critical pathology: enlarging the preference margin does not necessarily improve generation quality. In particular, the standard Diffusion-DPO objective can increase the reconstruction error of both winner and loser branches. Consequently, degradation of the less-preferred outputs can become sufficiently severe that the preferred branch is also adversely affected even as the margin grows. To address this, we introduce Diffusion-SDPO, a safeguarded update rule that preserves the winner by adaptively scaling the loser gradient according to its alignment with the winner gradient. A first-order analysis yields a closed-form scaling coefficient that guarantees the error of the preferred output is non-increasing at each optimization step. Our method is simple, model-agnostic, broadly compatible with existing DPO-style alignment frameworks and adds only marginal computational overhead. Across standard text-to-image benchmarks, Diffusion-SDPO delivers consistent gains over preference-learning baselines on automated preference, aesthetic, and prompt alignment metrics. Code is publicly available at https://github.com/AIDC-AI/Diffusion-SDPO.

扩散模型安全防护式直接偏好优化（Diffusion-SDPO）

Diffusion-SDPO: Safeguarded Direct Preference Optimization for Diffusion Models

摘要

Support