마진 적응형 DPO: 선호 최적화에서 세밀한 제어를 위한 보상 모델 활용

초록

직접 선호도 최적화(Direct Preference Optimization, DPO)는 대규모 언어 모델을 정렬하는 간단하면서도 효과적인 방법으로 부상했습니다. 그러나 고정된 온도 매개변수에 의존하기 때문에 다양한 선호도 데이터에 대한 학습이 최적화되지 못하고, 쉬운 예제에 과적합되거나 정보성이 높은 예제에서 충분히 학습하지 못하는 문제가 발생합니다. 이를 해결하기 위해 최근 몇 가지 방법이 제안되었습니다. IPO는 일반적인 과적합 문제를 해결하지만, 균일한 정규화로 인해 지나치게 보수적일 수 있습니다. 더 목표 지향적인 접근법인 베타-DPO는 자체적인 한계를 가지고 있습니다: 배치 수준의 적응 방식은 혼합된 마진 쌍에 단일의 타협된 온도를 적용하며, 선형 업데이트 규칙은 불안정한 음수 베타 값을 생성할 수 있고, 필터링 메커니즘은 잠재적으로 유용한 학습 신호를 버릴 수 있습니다. 본 연구에서는 안정적이고 데이터를 보존하며 인스턴스 수준의 해결책을 제공하는 마진 적응형 직접 선호도 최적화(Margin-Adaptive Direct Preference Optimization, MADPO)를 소개합니다. MADPO는 실용적인 두 단계 접근법을 사용합니다: 먼저 선호도 마진을 추정하기 위해 보상 모델을 학습시키고, 이 마진을 사용하여 각 학습 샘플에 대해 DPO 손실에 연속적이고 적응적인 가중치를 적용합니다. 이 재가중 방식은 어려운 쌍에 대해 증폭되고 쉬운 쌍에 대해 감쇠되는 효과적인 목표 마진을 생성하여 학습 신호에 대한 세밀한 제어를 가능하게 합니다. 우리는 MADPO가 잘 정의된 최적화 경계를 가지고 있으며 보상 모델 추정 오류에 강건하다는 것을 증명하는 포괄적인 이론적 분석을 제공합니다. 감정 생성 작업에 대한 실험을 통해 이론을 검증했으며, MADPO는 다양한 품질의 데이터셋에서 강력한 베이스라인을 일관되게 크게 능가했습니다. 고품질 데이터에서는 최고의 대비 방법 대비 최대 +33.3%, 저품질 데이터에서는 +10.5%의 성능 향상을 달성했습니다. 우리의 결과는 MADPO가 선호도 정렬에 있어 더 강건하고 원칙적인 접근법임을 입증합니다.

English

Direct Preference Optimization (DPO) has emerged as a simple and effective method for aligning large language models. However, its reliance on a fixed temperature parameter leads to suboptimal training on diverse preference data, causing overfitting on easy examples and under-learning from informative ones. Recent methods have emerged to counter this. While IPO addresses general overfitting, its uniform regularization can be overly conservative. The more targeted approach of beta-DPO suffers from its own limitations: its batch-level adaptation applies a single, compromised temperature to mixed-margin pairs, its linear update rule can produce unstable negative beta values, and its filtering mechanism discards potentially useful training signals. In this work, we introduce Margin-Adaptive Direct Preference Optimization (MADPO), a method that provides a stable, data-preserving, and instance-level solution. MADPO employs a practical two-step approach: it first trains a reward model to estimate preference margins and then uses these margins to apply a continuous, adaptive weight to the DPO loss for each individual training sample. This re-weighting scheme creates an effective target margin that is amplified for hard pairs and dampened for easy pairs, allowing for granular control over the learning signal. We provide a comprehensive theoretical analysis, proving that MADPO has a well-behaved optimization landscape and is robust to reward model estimation errors. We validate our theory with experiments on a sentiment generation task, where MADPO consistently and significantly outperforms strong baselines across datasets of varying quality. It achieves performance gains of up to +33.3\% on High Quality data and +10.5\% on Low Quality data over the next-best method. Our results establish MADPO as a more robust and principled approach to preference alignment.

마진 적응형 DPO: 선호 최적화에서 세밀한 제어를 위한 보상 모델 활용

Margin Adaptive DPO: Leveraging Reward Model for Granular Control in Preference Optimization

초록

Support