基于边际感知的偏好优化,用于对齐无参考的扩散模型。
Margin-aware Preference Optimization for Aligning Diffusion Models without Reference
June 10, 2024
作者: Jiwoo Hong, Sayak Paul, Noah Lee, Kashif Rasul, James Thorne, Jongheon Jeong
cs.AI
摘要
基于人类偏好的现代对齐技术,如RLHF和DPO,通常相对于参考模型采用离散正则化来确保训练稳定性。然而,这往往会限制模型在对齐过程中的灵活性,特别是当偏好数据与参考模型之间存在明显的分布差异时。本文关注最近的文本到图像扩散模型的对齐,如稳定扩散XL(SDXL),发现这种“参考不匹配”确实是对齐这些模型的一个重要问题,因为视觉模态的非结构化特性:例如,对特定风格方面的偏好可能很容易引起这种差异。受到这一观察的启发,我们提出了一种新颖且内存友好的扩散模型偏好对齐方法,不依赖于任何参考模型,命名为边际感知偏好优化(MaPO)。MaPO同时最大化了偏好和非偏好图像集之间的可能性边际以及偏好集的可能性,从而同时学习一般风格特征和偏好。为了评估,我们引入了两个新的成对偏好数据集,包括来自SDXL的自动生成图像对,模拟了参考不匹配的各种情景,分别是Pick-Style和Pick-Safety。我们的实验证实,MaPO在Pick-Style和Pick-Safety上可以显著改善对齐,并在与Pick-a-Pic v2一起使用时,超越了基础SDXL和其他现有方法的一般偏好对齐。我们的代码、模型和数据集可通过https://mapo-t2i.github.io 公开获取。
English
Modern alignment techniques based on human preferences, such as RLHF and DPO,
typically employ divergence regularization relative to the reference model to
ensure training stability. However, this often limits the flexibility of models
during alignment, especially when there is a clear distributional discrepancy
between the preference data and the reference model. In this paper, we focus on
the alignment of recent text-to-image diffusion models, such as Stable
Diffusion XL (SDXL), and find that this "reference mismatch" is indeed a
significant problem in aligning these models due to the unstructured nature of
visual modalities: e.g., a preference for a particular stylistic aspect can
easily induce such a discrepancy. Motivated by this observation, we propose a
novel and memory-friendly preference alignment method for diffusion models that
does not depend on any reference model, coined margin-aware preference
optimization (MaPO). MaPO jointly maximizes the likelihood margin between the
preferred and dispreferred image sets and the likelihood of the preferred sets,
simultaneously learning general stylistic features and preferences. For
evaluation, we introduce two new pairwise preference datasets, which comprise
self-generated image pairs from SDXL, Pick-Style and Pick-Safety, simulating
diverse scenarios of reference mismatch. Our experiments validate that MaPO can
significantly improve alignment on Pick-Style and Pick-Safety and general
preference alignment when used with Pick-a-Pic v2, surpassing the base SDXL and
other existing methods. Our code, models, and datasets are publicly available
via https://mapo-t2i.github.ioSummary
AI-Generated Summary