通过动态目标边界的鲁棒偏好优化

摘要

大型语言模型（LLMs）的对齐对于确保其在实际应用中的安全性和可靠性至关重要。直接偏好优化（DPO）作为一种高效方法应运而生，它直接利用偏好对优化模型，显著降低了资源需求。然而，DPO的有效性在很大程度上依赖于数据质量，而数据质量常因噪声而受损。在本研究中，我们提出了gamma-PO，一种动态目标边际偏好优化算法，该算法在成对级别上调整奖励边际。通过引入实例特定的边际校准，gamma-PO策略性地优先处理高置信度对（即展示出更高奖励边际的对），同时抑制来自模糊对的潜在噪声。此外，gamma-PO是一种即插即用的方法，与依赖偏好对间奖励边际的DPO变体兼容。在AlpacaEval2和Arena-Hard等基准测试中，gamma-PO相较于其他基线平均提升了4.4%，为最先进性能设立了新标杆。同时，gamma-PO仅需极少的代码改动，对训练效率的影响微乎其微，使其成为增强LLMs对齐性的稳健解决方案。我们的代码可在https://github.com/sunjie279/gammaPO{https://github.com/sunjie279/gammaPO}获取。

English

The alignment of Large Language Models (LLMs) is crucial for ensuring their safety and reliability in practical applications. Direct Preference Optimization (DPO) has emerged as an efficient method that directly optimizes models using preference pairs, significantly reducing resource demands. However, the effectiveness of DPO heavily depends on the data quality, which is frequently compromised by noise. In this work, we propose gamma-PO, a dynamic target margin preference optimization algorithm that adjust reward margins at the pairwise level. By introducing instance-specific margin calibration, gamma-PO strategically prioritizes high-confidence pairs (those demonstrating higher reward margins) while suppressing potential noise from ambiguous pairs. Moreover, gamma-PO is a plug-and-play method, compatible with variants of DPO that rely on reward margin between preference pairs. Across benchmarks such as AlpacaEval2 and Arena-Hard, gamma-PO achieves an average 4.4\% improvement over other baselines, setting new benchmarks for state-of-the-art performance. Additionally, gamma-PO requires minimal code changes and has a negligible impact on training efficiency, making it a robust solution for enhancing LLMs alignment. Our codes are available at https://github.com/sunjie279/gammaPO{https://github.com/sunjie279/gammaPO}.

通过动态目标边界的鲁棒偏好优化

Robust Preference Optimization via Dynamic Target Margins

摘要

Support