透過動態目標邊界的穩健偏好優化

摘要

大型語言模型（LLMs）的對齊對於確保其在實際應用中的安全性和可靠性至關重要。直接偏好優化（DPO）作為一種高效方法，通過直接利用偏好對來優化模型，顯著降低了資源需求。然而，DPO的效果在很大程度上依賴於數據質量，而數據質量常因噪聲而受到影響。在本研究中，我們提出了gamma-PO，這是一種動態目標邊際偏好優化算法，能在配對層面調整獎勵邊際。通過引入實例特定的邊際校準，gamma-PO策略性地優先處理高置信度對（即展示較高獎勵邊際的配對），同時抑制來自模糊配對的潛在噪聲。此外，gamma-PO是一種即插即用的方法，與依賴於偏好對之間獎勵邊際的DPO變體兼容。在AlpacaEval2和Arena-Hard等基準測試中，gamma-PO相較於其他基線平均提升了4.4%，為最新技術性能樹立了新標杆。此外，gamma-PO僅需極少的代碼改動，對訓練效率的影響微乎其微，使其成為增強LLMs對齊的強大解決方案。我們的代碼可在https://github.com/sunjie279/gammaPO{https://github.com/sunjie279/gammaPO}獲取。

English

The alignment of Large Language Models (LLMs) is crucial for ensuring their safety and reliability in practical applications. Direct Preference Optimization (DPO) has emerged as an efficient method that directly optimizes models using preference pairs, significantly reducing resource demands. However, the effectiveness of DPO heavily depends on the data quality, which is frequently compromised by noise. In this work, we propose gamma-PO, a dynamic target margin preference optimization algorithm that adjust reward margins at the pairwise level. By introducing instance-specific margin calibration, gamma-PO strategically prioritizes high-confidence pairs (those demonstrating higher reward margins) while suppressing potential noise from ambiguous pairs. Moreover, gamma-PO is a plug-and-play method, compatible with variants of DPO that rely on reward margin between preference pairs. Across benchmarks such as AlpacaEval2 and Arena-Hard, gamma-PO achieves an average 4.4\% improvement over other baselines, setting new benchmarks for state-of-the-art performance. Additionally, gamma-PO requires minimal code changes and has a negligible impact on training efficiency, making it a robust solution for enhancing LLMs alignment. Our codes are available at https://github.com/sunjie279/gammaPO{https://github.com/sunjie279/gammaPO}.

透過動態目標邊界的穩健偏好優化

Robust Preference Optimization via Dynamic Target Margins

摘要

Support