MaPPO:基於先驗知識的最大後驗偏好優化
MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge
July 27, 2025
作者: Guangchen Lan, Sipeng Zhang, Tianle Wang, Yuwei Zhang, Daoan Zhang, Xinpeng Wei, Xiaoman Pan, Hongming Zhang, Dong-Jun Han, Christopher G. Brinton
cs.AI
摘要
隨著大型語言模型(LLMs)代表用戶的時代展開,偏好優化(Preference Optimization, PO)方法已成為對齊LLMs與人類偏好並提升性能的核心途徑。我們提出了最大後驗偏好優化(Maximum a Posteriori Preference Optimization, MaPPO),這是一個從偏好中學習的框架,明確地將先驗獎勵知識納入優化目標。雖然現有方法如直接偏好優化(Direct Preference Optimization, DPO)及其變體將偏好學習視為最大似然估計(Maximum Likelihood Estimation, MLE)問題,MaPPO通過將先驗獎勵估計整合到一個有原則的最大後驗(Maximum a Posteriori, MaP)目標中,擴展了這一範式。這不僅推廣了DPO及其變體,還通過緩解回應的過於簡化的二元分類來增強對齊。更重要的是,MaPPO未引入額外的超參數,並支持離線和在線設置中的偏好優化。此外,MaPPO可作為插件使用,在DPO變體(包括廣泛使用的SimPO、IPO和CPO)上實現一致的改進。在包括MT-Bench、AlpacaEval 2.0和Arena-Hard在內的三個標準基準上,對不同模型大小和模型系列的廣泛實證評估顯示,在不犧牲計算效率的情況下,對齊性能得到了持續提升。
English
As the era of large language models (LLMs) on behalf of users unfolds,
Preference Optimization (PO) methods have become a central approach to aligning
LLMs with human preferences and improving performance. We propose Maximum a
Posteriori Preference Optimization (MaPPO), a framework for learning from
preferences that explicitly incorporates prior reward knowledge into the
optimization objective. While existing methods such as Direct Preference
Optimization (DPO) and its variants treat preference learning as a Maximum
Likelihood Estimation (MLE) problem, MaPPO extends this paradigm by integrating
prior reward estimates into a principled Maximum a Posteriori (MaP) objective.
This not only generalizes DPO and its variants, but also enhances alignment by
mitigating the oversimplified binary classification of responses. More
importantly, MaPPO introduces no additional hyperparameter, and supports
preference optimization in both offline and online settings. In addition, MaPPO
can be used as a plugin with consistent improvement on DPO variants, including
widely used SimPO, IPO, and CPO. Extensive empirical evaluations of different
model sizes and model series on three standard benchmarks, including MT-Bench,
AlpacaEval 2.0, and Arena-Hard, demonstrate consistent improvements in
alignment performance without sacrificing computational efficiency.