ViPO：大规模视觉偏好优化

摘要

尽管偏好优化对提升视觉生成模型至关重要，但如何有效扩展这一范式仍属未知领域。当前开源偏好数据集存在相互冲突的偏好模式，优胜样本在某些维度表现突出却在其他方面欠佳。直接在此类噪声数据集上进行优化难以有效学习偏好，阻碍了规模化扩展。为增强对噪声的鲁棒性，我们提出Poly-DPO方法，通过引入多项式项扩展DPO目标函数，能根据数据集特征动态调整模型置信度，从而适应不同数据分布的有效学习。除偏差模式外，现有数据集还存在分辨率低、提示词多样性有限、分布不平衡等缺陷。为通过突破数据瓶颈推动大规模视觉偏好优化，我们构建了ViPO数据集——包含500个类别的100万张1024像素图像对和300个类别的30万对720p以上视频对。采用最先进生成模型与多样化提示词确保偏好信号的可靠性及分布均衡性。值得注意的是，当将Poly-DPO应用于高质量数据集时，最优配置会收敛至标准DPO。这一现象验证了数据集质量及Poly-DPO的自适应特性：当数据质量足够时，复杂优化变得多余，但对不完善数据集仍具价值。我们在多种视觉生成模型上验证了该方法：在Pick-a-Pic V2等噪声数据集上，Poly-DPO相较Diffusion-DPO在SD1.5和SDXL模型上分别取得6.87和2.32的GenEval分数提升；使用ViPO训练的模型性能远超基于现有开源偏好数据集的结果。这些发现证实，同时解决算法适应性与数据质量问题对扩展视觉偏好优化至关重要。

English

While preference optimization is crucial for improving visual generative models, how to effectively scale this paradigm remains largely unexplored. Current open-source preference datasets contain conflicting preference patterns, where winners excel in some dimensions but underperform in others. Naively optimizing on such noisy datasets fails to learn preferences, hindering effective scaling. To enhance robustness against noise, we propose Poly-DPO, which extends the DPO objective with an additional polynomial term that dynamically adjusts model confidence based on dataset characteristics, enabling effective learning across diverse data distributions. Beyond biased patterns, existing datasets suffer from low resolution, limited prompt diversity, and imbalanced distributions. To facilitate large-scale visual preference optimization by tackling data bottlenecks, we construct ViPO, a massive-scale preference dataset with 1M image pairs at 1024px across five categories and 300K video pairs at 720p+ across three categories. State-of-the-art generative models and diverse prompts ensure reliable preference signals with balanced distributions. Remarkably, when applying Poly-DPO to our high-quality dataset, the optimal configuration converges to standard DPO. This convergence validates dataset quality and Poly-DPO's adaptive nature: sophisticated optimization becomes unnecessary with sufficient data quality, yet remains valuable for imperfect datasets. We validate our approach across visual generation models. On noisy datasets like Pick-a-Pic V2, Poly-DPO achieves 6.87 and 2.32 gains over Diffusion-DPO on GenEval for SD1.5 and SDXL, respectively. For ViPO, models achieve performance far exceeding those trained on existing open-source preference datasets. These results confirm that addressing both algorithmic adaptability and data quality is essential for scaling visual preference optimization.