邊際自適應DPO：利用獎勵模型實現偏好優化中的精細控制

摘要

直接偏好優化（Direct Preference Optimization, DPO）作為一種簡單且有效的方法，已被用於對齊大型語言模型。然而，其依賴於固定溫度參數的特性，導致在處理多樣化偏好數據時訓練效果欠佳，容易對簡單樣本過擬合，而對信息量大的樣本學習不足。近期出現的方法旨在解決這一問題。雖然IPO方法針對一般性過擬合進行了處理，但其均勻的正則化策略可能過於保守。更具針對性的beta-DPO方法則存在自身局限：其批次級別的適應性調整對混合邊界對應用單一且折衷的溫度，其線性更新規則可能導致不穩定的負beta值，且其過濾機制可能丟棄潛在有用的訓練信號。在本研究中，我們提出了邊界自適應直接偏好優化（Margin-Adaptive Direct Preference Optimization, MADPO），該方法提供了一種穩定、數據保留且實例級別的解決方案。MADPO採用了一種實用的兩步策略：首先訓練一個獎勵模型來估計偏好邊界，然後利用這些邊界對每個單獨的訓練樣本應用連續且自適應的權重於DPO損失上。這種重新加權方案創建了一個有效的目標邊界，對於困難對其進行放大，對於簡單對則進行抑制，從而實現對學習信號的精細控制。我們提供了全面的理論分析，證明MADPO具有良好行為的優化景觀，並且對獎勵模型估計誤差具有魯棒性。我們在情感生成任務上的實驗驗證了我們的理論，MADPO在不同質量的數據集上均一致且顯著地超越了強基線方法。在高質量數據上，其性能提升最高達+33.3%，在低質量數據上則為+10.5%。我們的結果確立了MADPO作為一種更為穩健且基於原則的偏好對齊方法。

English

Direct Preference Optimization (DPO) has emerged as a simple and effective method for aligning large language models. However, its reliance on a fixed temperature parameter leads to suboptimal training on diverse preference data, causing overfitting on easy examples and under-learning from informative ones. Recent methods have emerged to counter this. While IPO addresses general overfitting, its uniform regularization can be overly conservative. The more targeted approach of beta-DPO suffers from its own limitations: its batch-level adaptation applies a single, compromised temperature to mixed-margin pairs, its linear update rule can produce unstable negative beta values, and its filtering mechanism discards potentially useful training signals. In this work, we introduce Margin-Adaptive Direct Preference Optimization (MADPO), a method that provides a stable, data-preserving, and instance-level solution. MADPO employs a practical two-step approach: it first trains a reward model to estimate preference margins and then uses these margins to apply a continuous, adaptive weight to the DPO loss for each individual training sample. This re-weighting scheme creates an effective target margin that is amplified for hard pairs and dampened for easy pairs, allowing for granular control over the learning signal. We provide a comprehensive theoretical analysis, proving that MADPO has a well-behaved optimization landscape and is robust to reward model estimation errors. We validate our theory with experiments on a sentiment generation task, where MADPO consistently and significantly outperforms strong baselines across datasets of varying quality. It achieves performance gains of up to +33.3\% on High Quality data and +10.5\% on Low Quality data over the next-best method. Our results establish MADPO as a more robust and principled approach to preference alignment.

邊際自適應DPO：利用獎勵模型實現偏好優化中的精細控制

Margin Adaptive DPO: Leveraging Reward Model for Granular Control in Preference Optimization

摘要

Support