동적 목표 마진을 통한 강건한 선호 최적화

초록

대규모 언어 모델(LLM)의 정렬은 실용적 응용에서의 안전성과 신뢰성을 보장하기 위해 중요합니다. 직접 선호 최적화(Direct Preference Optimization, DPO)는 선호 쌍을 사용하여 모델을 직접 최적화하는 효율적인 방법으로 등장하여 자원 요구를 크게 줄였습니다. 그러나 DPO의 효과는 데이터 품질에 크게 의존하며, 이는 잡음으로 인해 자주 저하됩니다. 본 연구에서는 쌍별 수준에서 보상 마진을 조정하는 동적 목표 마진 선호 최적화 알고리즘인 gamma-PO를 제안합니다. 인스턴스별 마진 보정을 도입함으로써, gamma-PO는 높은 신뢰도를 보이는 쌍(더 높은 보상 마진을 보이는 쌍)을 전략적으로 우선시하면서 모호한 쌍에서 발생할 수 있는 잡음을 억제합니다. 또한, gamma-PO는 선호 쌍 간의 보상 마진에 의존하는 DPO 변형과 호환되는 플러그 앤 플레이 방식입니다. AlpacaEval2 및 Arena-Hard와 같은 벤치마크에서 gamma-PO는 다른 기준선 대비 평균 4.4%의 성능 향상을 달성하며, 최신 기술 수준의 새로운 벤치마크를 설정합니다. 추가적으로, gamma-PO는 최소한의 코드 변경만 필요로 하며, 훈련 효율성에 미치는 영향이 미미하여 LLM 정렬 강화를 위한 견고한 솔루션으로 자리 잡고 있습니다. 저희 코드는 https://github.com/sunjie279/gammaPO{https://github.com/sunjie279/gammaPO}에서 확인할 수 있습니다.

English

The alignment of Large Language Models (LLMs) is crucial for ensuring their safety and reliability in practical applications. Direct Preference Optimization (DPO) has emerged as an efficient method that directly optimizes models using preference pairs, significantly reducing resource demands. However, the effectiveness of DPO heavily depends on the data quality, which is frequently compromised by noise. In this work, we propose gamma-PO, a dynamic target margin preference optimization algorithm that adjust reward margins at the pairwise level. By introducing instance-specific margin calibration, gamma-PO strategically prioritizes high-confidence pairs (those demonstrating higher reward margins) while suppressing potential noise from ambiguous pairs. Moreover, gamma-PO is a plug-and-play method, compatible with variants of DPO that rely on reward margin between preference pairs. Across benchmarks such as AlpacaEval2 and Arena-Hard, gamma-PO achieves an average 4.4\% improvement over other baselines, setting new benchmarks for state-of-the-art performance. Additionally, gamma-PO requires minimal code changes and has a negligible impact on training efficiency, making it a robust solution for enhancing LLMs alignment. Our codes are available at https://github.com/sunjie279/gammaPO{https://github.com/sunjie279/gammaPO}.

동적 목표 마진을 통한 강건한 선호 최적화

Robust Preference Optimization via Dynamic Target Margins

초록

Support