ViPO:大规模视觉偏好优化
ViPO: Visual Preference Optimization at Scale
April 29, 2026
作者: Ming Li, Jie Wu, Justin Cui, Xiaojie Li, Rui Wang, Chen Chen
cs.AI
摘要
尽管偏好优化对提升视觉生成模型至关重要,但如何有效扩展这一范式仍属未知领域。当前开源偏好数据集存在冲突的偏好模式,优胜样本在某些维度表现卓越却在其他维度欠佳。直接在此类嘈杂数据集上进行优化难以有效学习偏好,阻碍了规模化扩展。为增强对噪声的鲁棒性,我们提出Poly-DPO方法,通过引入多项式项扩展DPO目标函数,能根据数据集特征动态调整模型置信度,从而在不同数据分布中实现有效学习。除偏差模式外,现有数据集还存在分辨率低、提示词多样性不足及分布不平衡等问题。为通过突破数据瓶颈推动大规模视觉偏好优化,我们构建了ViPO数据集——包含500个类别下100万对1024像素图像样本,以及3个类别下30万对720p以上视频样本。采用顶尖生成模型和多样化提示词确保偏好信号的可靠性及分布的均衡性。值得注意的是,当将Poly-DPO应用于高质量数据集时,最优配置会收敛至标准DPO。这一收敛现象验证了数据集质量,也体现了Poly-DPO的自适应特性:当数据质量足够时,复杂优化变得不再必要,但对不完善数据集仍具价值。我们在多种视觉生成模型上验证了该方法:在Pick-a-Pic V2等嘈杂数据集上,Poly-DPO对SD1.5和SDXL在GenEval指标上分别较Diffusion-DPO提升6.87和2.32分;使用ViPO训练的模型性能远超基于现有开源偏好数据集的模型。这些结果证实,同时解决算法适应性与数据质量问题是扩展视觉偏好优化的关键。
English
While preference optimization is crucial for improving visual generative models, how to effectively scale this paradigm remains largely unexplored. Current open-source preference datasets contain conflicting preference patterns, where winners excel in some dimensions but underperform in others. Naively optimizing on such noisy datasets fails to learn preferences, hindering effective scaling. To enhance robustness against noise, we propose Poly-DPO, which extends the DPO objective with an additional polynomial term that dynamically adjusts model confidence based on dataset characteristics, enabling effective learning across diverse data distributions. Beyond biased patterns, existing datasets suffer from low resolution, limited prompt diversity, and imbalanced distributions. To facilitate large-scale visual preference optimization by tackling data bottlenecks, we construct ViPO, a massive-scale preference dataset with 1M image pairs at 1024px across five categories and 300K video pairs at 720p+ across three categories. State-of-the-art generative models and diverse prompts ensure reliable preference signals with balanced distributions. Remarkably, when applying Poly-DPO to our high-quality dataset, the optimal configuration converges to standard DPO. This convergence validates dataset quality and Poly-DPO's adaptive nature: sophisticated optimization becomes unnecessary with sufficient data quality, yet remains valuable for imperfect datasets. We validate our approach across visual generation models. On noisy datasets like Pick-a-Pic V2, Poly-DPO achieves 6.87 and 2.32 gains over Diffusion-DPO on GenEval for SD1.5 and SDXL, respectively. For ViPO, models achieve performance far exceeding those trained on existing open-source preference datasets. These results confirm that addressing both algorithmic adaptability and data quality is essential for scaling visual preference optimization.