边际自适应DPO:利用奖励模型实现偏好优化中的精细控制
Margin Adaptive DPO: Leveraging Reward Model for Granular Control in Preference Optimization
October 6, 2025
作者: Hyung Gyu Rho
cs.AI
摘要
直接偏好优化(DPO)作为一种简单有效的方法,已广泛应用于大型语言模型的对齐任务。然而,其依赖固定温度参数的特性导致在处理多样化偏好数据时训练效果欠佳,容易对简单样本过拟合,而对信息量大的样本学习不足。针对这一问题,近期涌现了多种改进方法。虽然IPO方法解决了普遍性的过拟合问题,但其统一的正则化策略可能过于保守。更为针对性的beta-DPO方法则存在自身局限:其批次级别的适应机制对混合边际对应用单一折衷温度,线性更新规则可能导致不稳定的负beta值,且其过滤机制会丢弃潜在有用的训练信号。本文提出边际自适应直接偏好优化(MADPO),该方法提供了一个稳定、数据保留且实例级别的解决方案。MADPO采用实用的两步策略:首先训练奖励模型以估计偏好边际,随后利用这些边际为每个训练样本的DPO损失应用连续自适应权重。这种重加权方案创建了一个有效的目标边际,对困难对进行放大,对简单对进行抑制,从而实现对学习信号的精细控制。我们提供了全面的理论分析,证明MADPO具有良好优化的目标函数,并对奖励模型估计误差具有鲁棒性。通过在情感生成任务上的实验验证,MADPO在不同质量的数据集上均显著优于现有基线方法,在高质量数据上实现了高达+33.3%的性能提升,在低质量数据上也有+10.5%的提升。实验结果确立了MADPO作为一种更稳健、更原则性的偏好对齐方法。
English
Direct Preference Optimization (DPO) has emerged as a simple and effective
method for aligning large language models. However, its reliance on a fixed
temperature parameter leads to suboptimal training on diverse preference data,
causing overfitting on easy examples and under-learning from informative ones.
Recent methods have emerged to counter this. While IPO addresses general
overfitting, its uniform regularization can be overly conservative. The more
targeted approach of beta-DPO suffers from its own limitations: its
batch-level adaptation applies a single, compromised temperature to
mixed-margin pairs, its linear update rule can produce unstable negative
beta values, and its filtering mechanism discards potentially useful
training signals. In this work, we introduce Margin-Adaptive Direct Preference
Optimization (MADPO), a method that provides a stable, data-preserving, and
instance-level solution. MADPO employs a practical two-step approach: it first
trains a reward model to estimate preference margins and then uses these
margins to apply a continuous, adaptive weight to the DPO loss for each
individual training sample. This re-weighting scheme creates an effective
target margin that is amplified for hard pairs and dampened for easy pairs,
allowing for granular control over the learning signal. We provide a
comprehensive theoretical analysis, proving that MADPO has a well-behaved
optimization landscape and is robust to reward model estimation errors. We
validate our theory with experiments on a sentiment generation task, where
MADPO consistently and significantly outperforms strong baselines across
datasets of varying quality. It achieves performance gains of up to +33.3\% on
High Quality data and +10.5\% on Low Quality data over the next-best method.
Our results establish MADPO as a more robust and principled approach to
preference alignment.