MaPPO：事前知識を用いた最大事後確率選好最適化

要旨

ユーザーに代わって大規模言語モデル（LLMs）の時代が到来する中、選好最適化（Preference Optimization, PO）手法は、LLMsを人間の選好に適合させ、性能を向上させるための中心的なアプローチとなっています。本論文では、事前報酬知識を最適化目標に明示的に組み込む選好学習の枠組みであるMaximum a Posteriori Preference Optimization（MaPPO）を提案します。既存のDirect Preference Optimization（DPO）やその派生手法が選好学習を最尤推定（Maximum Likelihood Estimation, MLE）問題として扱うのに対し、MaPPOは事前報酬推定を原則に基づいたMaximum a Posteriori（MaP）目標に統合することで、このパラダイムを拡張します。これにより、DPOとその派生手法を一般化するだけでなく、応答の過度に単純化された二値分類を緩和することで適合性を向上させます。さらに重要なことに、MaPPOは追加のハイパーパラメータを導入せず、オフラインとオンラインの両方の設定で選好最適化をサポートします。また、MaPPOはプラグインとして使用でき、広く使用されているSimPO、IPO、CPOを含むDPO派生手法に対して一貫した改善をもたらします。MT-Bench、AlpacaEval 2.0、Arena-Hardの3つの標準ベンチマークにおける異なるモデルサイズとモデルシリーズの広範な実証評価により、計算効率を犠牲にすることなく、適合性能の一貫した向上が実証されています。

English

As the era of large language models (LLMs) on behalf of users unfolds, Preference Optimization (PO) methods have become a central approach to aligning LLMs with human preferences and improving performance. We propose Maximum a Posteriori Preference Optimization (MaPPO), a framework for learning from preferences that explicitly incorporates prior reward knowledge into the optimization objective. While existing methods such as Direct Preference Optimization (DPO) and its variants treat preference learning as a Maximum Likelihood Estimation (MLE) problem, MaPPO extends this paradigm by integrating prior reward estimates into a principled Maximum a Posteriori (MaP) objective. This not only generalizes DPO and its variants, but also enhances alignment by mitigating the oversimplified binary classification of responses. More importantly, MaPPO introduces no additional hyperparameter, and supports preference optimization in both offline and online settings. In addition, MaPPO can be used as a plugin with consistent improvement on DPO variants, including widely used SimPO, IPO, and CPO. Extensive empirical evaluations of different model sizes and model series on three standard benchmarks, including MT-Bench, AlpacaEval 2.0, and Arena-Hard, demonstrate consistent improvements in alignment performance without sacrificing computational efficiency.

MaPPO：事前知識を用いた最大事後確率選好最適化

MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge

要旨

Support