MaPPO:基于先验知识的最大后验偏好优化
MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge
July 27, 2025
作者: Guangchen Lan, Sipeng Zhang, Tianle Wang, Yuwei Zhang, Daoan Zhang, Xinpeng Wei, Xiaoman Pan, Hongming Zhang, Dong-Jun Han, Christopher G. Brinton
cs.AI
摘要
随着大语言模型(LLMs)代表用户时代的到来,偏好优化(Preference Optimization, PO)方法已成为将LLMs与人类偏好对齐并提升性能的核心途径。我们提出了最大后验偏好优化(Maximum a Posteriori Preference Optimization, MaPPO),这是一个从偏好中学习的框架,明确地将先验奖励知识整合到优化目标中。尽管现有方法如直接偏好优化(Direct Preference Optimization, DPO)及其变体将偏好学习视为最大似然估计(Maximum Likelihood Estimation, MLE)问题,MaPPO通过将先验奖励估计融入一个原则性的最大后验(Maximum a Posteriori, MaP)目标,扩展了这一范式。这不仅推广了DPO及其变体,还通过缓解对响应进行过度简化的二分类,增强了对齐效果。更重要的是,MaPPO未引入额外超参数,并支持离线和在线环境下的偏好优化。此外,MaPPO可作为插件使用,在包括广泛应用的SimPO、IPO和CPO在内的DPO变体上实现一致改进。在MT-Bench、AlpacaEval 2.0和Arena-Hard三个标准基准上,对不同模型规模和系列进行的广泛实证评估表明,MaPPO在不牺牲计算效率的前提下,持续提升了对齐性能。
English
As the era of large language models (LLMs) on behalf of users unfolds,
Preference Optimization (PO) methods have become a central approach to aligning
LLMs with human preferences and improving performance. We propose Maximum a
Posteriori Preference Optimization (MaPPO), a framework for learning from
preferences that explicitly incorporates prior reward knowledge into the
optimization objective. While existing methods such as Direct Preference
Optimization (DPO) and its variants treat preference learning as a Maximum
Likelihood Estimation (MLE) problem, MaPPO extends this paradigm by integrating
prior reward estimates into a principled Maximum a Posteriori (MaP) objective.
This not only generalizes DPO and its variants, but also enhances alignment by
mitigating the oversimplified binary classification of responses. More
importantly, MaPPO introduces no additional hyperparameter, and supports
preference optimization in both offline and online settings. In addition, MaPPO
can be used as a plugin with consistent improvement on DPO variants, including
widely used SimPO, IPO, and CPO. Extensive empirical evaluations of different
model sizes and model series on three standard benchmarks, including MT-Bench,
AlpacaEval 2.0, and Arena-Hard, demonstrate consistent improvements in
alignment performance without sacrificing computational efficiency.