考慮邊界的偏好優化,用於對齊擴散模型而無需參考
Margin-aware Preference Optimization for Aligning Diffusion Models without Reference
June 10, 2024
作者: Jiwoo Hong, Sayak Paul, Noah Lee, Kashif Rasul, James Thorne, Jongheon Jeong
cs.AI
摘要
基於人類偏好的現代對齊技術,如RLHF和DPO,通常採用相對於參考模型的發散正則化來確保訓練穩定性。然而,這常常限制了模型在對齊期間的靈活性,特別是當偏好數據與參考模型之間存在明顯的分布差異時。本文專注於最近的文本到圖像擴散模型的對齊,如Stable Diffusion XL(SDXL),並發現這種“參考不匹配”確實是對齊這些模型時的一個重要問題,原因在於視覺模態的非結構化特性:例如,對特定風格方面的偏好可能很容易引起這種差異。受到這一觀察的啟發,我們提出了一種新穎且記憶友好的擴散模型偏好對齊方法,不依賴任何參考模型,稱為邊緣感知偏好優化(MaPO)。MaPO同時最大化了偏好和非偏好圖像集之間的可能性邊緣,以及偏好集的可能性,同時學習一般風格特徵和偏好。為了評估,我們引入了兩個新的成對偏好數據集,其中包括來自SDXL的自生成圖像對,模擬參考不匹配的各種情況,Pick-Style和Pick-Safety。我們的實驗證實,MaPO在Pick-Style和Pick-Safety以及與Pick-a-Pic v2一起使用時可以顯著改善對齊,超越基本的SDXL和其他現有方法。我們的代碼、模型和數據集通過https://mapo-t2i.github.io 公開提供。
English
Modern alignment techniques based on human preferences, such as RLHF and DPO,
typically employ divergence regularization relative to the reference model to
ensure training stability. However, this often limits the flexibility of models
during alignment, especially when there is a clear distributional discrepancy
between the preference data and the reference model. In this paper, we focus on
the alignment of recent text-to-image diffusion models, such as Stable
Diffusion XL (SDXL), and find that this "reference mismatch" is indeed a
significant problem in aligning these models due to the unstructured nature of
visual modalities: e.g., a preference for a particular stylistic aspect can
easily induce such a discrepancy. Motivated by this observation, we propose a
novel and memory-friendly preference alignment method for diffusion models that
does not depend on any reference model, coined margin-aware preference
optimization (MaPO). MaPO jointly maximizes the likelihood margin between the
preferred and dispreferred image sets and the likelihood of the preferred sets,
simultaneously learning general stylistic features and preferences. For
evaluation, we introduce two new pairwise preference datasets, which comprise
self-generated image pairs from SDXL, Pick-Style and Pick-Safety, simulating
diverse scenarios of reference mismatch. Our experiments validate that MaPO can
significantly improve alignment on Pick-Style and Pick-Safety and general
preference alignment when used with Pick-a-Pic v2, surpassing the base SDXL and
other existing methods. Our code, models, and datasets are publicly available
via https://mapo-t2i.github.ioSummary
AI-Generated Summary