边际自适应DPO：利用奖励模型实现偏好优化中的精细控制

摘要

直接偏好优化（DPO）作为一种简单有效的方法，已广泛应用于大型语言模型的对齐任务。然而，其依赖固定温度参数的特性导致在处理多样化偏好数据时训练效果欠佳，容易对简单样本过拟合，而对信息量大的样本学习不足。针对这一问题，近期涌现了多种改进方法。虽然IPO方法解决了普遍性的过拟合问题，但其统一的正则化策略可能过于保守。更为针对性的beta-DPO方法则存在自身局限：其批次级别的适应机制对混合边际对应用单一折衷温度，线性更新规则可能导致不稳定的负beta值，且其过滤机制会丢弃潜在有用的训练信号。本文提出边际自适应直接偏好优化（MADPO），该方法提供了一个稳定、数据保留且实例级别的解决方案。MADPO采用实用的两步策略：首先训练奖励模型以估计偏好边际，随后利用这些边际为每个训练样本的DPO损失应用连续自适应权重。这种重加权方案创建了一个有效的目标边际，对困难对进行放大，对简单对进行抑制，从而实现对学习信号的精细控制。我们提供了全面的理论分析，证明MADPO具有良好优化的目标函数，并对奖励模型估计误差具有鲁棒性。通过在情感生成任务上的实验验证，MADPO在不同质量的数据集上均显著优于现有基线方法，在高质量数据上实现了高达+33.3%的性能提升，在低质量数据上也有+10.5%的提升。实验结果确立了MADPO作为一种更稳健、更原则性的偏好对齐方法。

English

Direct Preference Optimization (DPO) has emerged as a simple and effective method for aligning large language models. However, its reliance on a fixed temperature parameter leads to suboptimal training on diverse preference data, causing overfitting on easy examples and under-learning from informative ones. Recent methods have emerged to counter this. While IPO addresses general overfitting, its uniform regularization can be overly conservative. The more targeted approach of beta-DPO suffers from its own limitations: its batch-level adaptation applies a single, compromised temperature to mixed-margin pairs, its linear update rule can produce unstable negative beta values, and its filtering mechanism discards potentially useful training signals. In this work, we introduce Margin-Adaptive Direct Preference Optimization (MADPO), a method that provides a stable, data-preserving, and instance-level solution. MADPO employs a practical two-step approach: it first trains a reward model to estimate preference margins and then uses these margins to apply a continuous, adaptive weight to the DPO loss for each individual training sample. This re-weighting scheme creates an effective target margin that is amplified for hard pairs and dampened for easy pairs, allowing for granular control over the learning signal. We provide a comprehensive theoretical analysis, proving that MADPO has a well-behaved optimization landscape and is robust to reward model estimation errors. We validate our theory with experiments on a sentiment generation task, where MADPO consistently and significantly outperforms strong baselines across datasets of varying quality. It achieves performance gains of up to +33.3\% on High Quality data and +10.5\% on Low Quality data over the next-best method. Our results establish MADPO as a more robust and principled approach to preference alignment.

边际自适应DPO：利用奖励模型实现偏好优化中的精细控制

Margin Adaptive DPO: Leveraging Reward Model for Granular Control in Preference Optimization

摘要

Support