从噪声偏好中学习：基于半监督学习的直接偏好优化方法

摘要

人类视觉偏好本质上是多维度的，涵盖美学价值、细节保真度与语义对齐性。然而现有数据集仅提供单一的整体标注，导致严重的标签噪声：在某些维度表现优异但其他维度存在缺陷的图像被简单标记为胜者或败者。我们通过理论证明，将多维偏好压缩为二元标签会产生相互冲突的梯度信号，从而误导扩散直接偏好优化（DPO）。为解决此问题，我们提出半监督DPO方法，将一致性样本视作清洁标注数据，冲突性样本作为噪声未标注数据进行处理。我们的方法首先在共识过滤的清洁子集上训练初始模型，随后将该模型作为隐式分类器为噪声集生成伪标签进行迭代优化。实验结果表明，Semi-DPO实现了最先进的性能，并显著提升了对复杂人类偏好的对齐能力，且无需在训练过程中引入额外的人工标注或显式奖励模型。代码与模型将于以下地址开源：https://github.com/L-CodingSpace/semi-dpo

English

Human visual preferences are inherently multi-dimensional, encompassing aesthetics, detail fidelity, and semantic alignment. However, existing datasets provide only single, holistic annotations, resulting in severe label noise: images that excel in some dimensions but are deficient in others are simply marked as winner or loser. We theoretically demonstrate that compressing multi-dimensional preferences into binary labels generates conflicting gradient signals that misguide Diffusion Direct Preference Optimization (DPO). To address this, we propose Semi-DPO, a semi-supervised approach that treats consistent pairs as clean labeled data and conflicting ones as noisy unlabeled data. Our method starts by training on a consensus-filtered clean subset, then uses this model as an implicit classifier to generate pseudo-labels for the noisy set for iterative refinement. Experimental results demonstrate that Semi-DPO achieves state-of-the-art performance and significantly improves alignment with complex human preferences, without requiring additional human annotation or explicit reward models during training. We will release our code and models at: https://github.com/L-CodingSpace/semi-dpo

从噪声偏好中学习：基于半监督学习的直接偏好优化方法

Learning from Noisy Preferences: A Semi-Supervised Learning Approach to Direct Preference Optimization

摘要

Support