マージン適応型DPO：選好最適化における細粒度制御のための報酬モデルの活用

要旨

直接選好最適化（Direct Preference Optimization, DPO）は、大規模言語モデルの整合性を高めるためのシンプルかつ効果的な手法として登場した。しかし、固定された温度パラメータに依存するため、多様な選好データに対する学習が最適ではなくなり、容易な事例に過剰適合し、有益な事例から十分に学習できないという問題が生じる。これに対処するため、最近ではいくつかの手法が提案されている。IPOは一般的な過剰適合に対処するが、その均一な正則化は過度に保守的である場合がある。よりターゲットを絞ったアプローチであるbeta-DPOも独自の限界を抱えている：バッチレベルの適応は混合マージンのペアに対して単一の妥協的な温度を適用し、線形更新ルールは不安定な負のベータ値を生成する可能性があり、フィルタリング機構は潜在的に有用な学習信号を捨ててしまう。本研究では、安定性を保ち、データを保存し、インスタンスレベルで解決を提供する手法であるマージン適応型直接選好最適化（Margin-Adaptive Direct Preference Optimization, MADPO）を提案する。MADPOは実用的な2段階アプローチを採用している：まず選好マージンを推定するための報酬モデルを学習し、次にこれらのマージンを使用して、個々の訓練サンプルに対してDPO損失に連続的かつ適応的な重みを適用する。この再重み付けスキームにより、難しいペアに対しては効果的なターゲットマージンが増幅され、容易なペアに対しては減衰されるため、学習信号に対する細かな制御が可能となる。我々は包括的な理論分析を提供し、MADPOが良好な最適化ランドスケープを持ち、報酬モデルの推定誤差に対して頑健であることを証明する。感情生成タスクにおける実験を通じて理論を検証し、MADPOが品質の異なるデータセットにおいて強力なベースラインを一貫して大幅に上回ることを示す。最高品質のデータでは次善の手法に対して最大+33.3%、低品質のデータでは+10.5%の性能向上を達成する。これらの結果から、MADPOが選好整合性に対するより頑健で原理的なアプローチであることが確立された。

English

Direct Preference Optimization (DPO) has emerged as a simple and effective method for aligning large language models. However, its reliance on a fixed temperature parameter leads to suboptimal training on diverse preference data, causing overfitting on easy examples and under-learning from informative ones. Recent methods have emerged to counter this. While IPO addresses general overfitting, its uniform regularization can be overly conservative. The more targeted approach of beta-DPO suffers from its own limitations: its batch-level adaptation applies a single, compromised temperature to mixed-margin pairs, its linear update rule can produce unstable negative beta values, and its filtering mechanism discards potentially useful training signals. In this work, we introduce Margin-Adaptive Direct Preference Optimization (MADPO), a method that provides a stable, data-preserving, and instance-level solution. MADPO employs a practical two-step approach: it first trains a reward model to estimate preference margins and then uses these margins to apply a continuous, adaptive weight to the DPO loss for each individual training sample. This re-weighting scheme creates an effective target margin that is amplified for hard pairs and dampened for easy pairs, allowing for granular control over the learning signal. We provide a comprehensive theoretical analysis, proving that MADPO has a well-behaved optimization landscape and is robust to reward model estimation errors. We validate our theory with experiments on a sentiment generation task, where MADPO consistently and significantly outperforms strong baselines across datasets of varying quality. It achieves performance gains of up to +33.3\% on High Quality data and +10.5\% on Low Quality data over the next-best method. Our results establish MADPO as a more robust and principled approach to preference alignment.

マージン適応型DPO：選好最適化における細粒度制御のための報酬モデルの活用

Margin Adaptive DPO: Leveraging Reward Model for Granular Control in Preference Optimization

要旨

Support